This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
between concepts P (the part) and W (the whole) from a relation
, where the concept Part of W reifies, i.e., embeds in its name, the PART-OF relationships to W . For example, ) and the equivalent concepts and relations (e.g., , where Part of W embeds a reified PART-OF relationship). (2) If a relation to be modified is specific to the base relations (e.g., 54.92% in FMA as shown in Table 3), find all relations inferable from this relation (or using it for inference) and check their validity. ( 3 ) If a relation to be modified is represented explicitly and inferable (e.g., 38.83% in FMA as shown in Table 3), identify all relations from which this relation can be inferred, and check their validity. Detecting inconsistency. Both FMA and GALEN were found to contain a small number of hierarchical cycles, resulting from either reflexive or circular hierarchical relations. Cycles may be found among the relations explicitly represented (e.g.,
253
Joint>, where the concept Component of Joint reifies a specialized PART-OF relationship. Examples of augmentation based on nominal modification and prepositional attachment include
Identifying the origin of semantic relations
Semantic relations may be acquired by several methods. They can be explicitly represented, added by complementation, as well as generated by augmentation and by inference. The former two categories constitute explicit knowledge (i.e., the base semantic relations in this study) and the latter two implicit knowledge. In other words, each method produces a set of semantic relations. Augmentation relies solely on concept names and only one set of augmented relations obtains. In contrast, inference can be applied to the base relations only, to the augmented relations only, or to both, resulting in three distinguishable sets of inferred relations. The five sets of semantic relations studied are: B (base semantic relations), A (augmented semantic relations), IB(inferred semantic relations based on the base relations alone), I A(inA ferred semantic relations based on the augmented relations alone), and I ~(inferred semantic relations based on the base and augmented relations). Depending on which method (or methods) can generate it, each semantic relation belongs to at least one and at most five sets B , A , 16, IA, and IbA.When a relation can be generated by several methods, it is therefore common to the corresponding sets of relations and, thus, belongs to the intersection of these sets. We use the intersection of sets as a unique identifier for the origin of a relation, hereafter referred to as its source. For example, the source ( B n A n I B ~nA la) identifies the and, IA,but absent from 16. More concretely, relations common to the sets B, A , I ~ A the semantic relation
254
Prostate> (i.e., in 1 6 " ~ ) ;and cannot be inferred solely from base relations using our inference rules (i.e., not in 16).
PART-OF,
4
4.I
Results
Number of semantic relations acquired
The number of semantic relations acquired from FMA and GALEN is presented in Table 1. The base semantic relations include the relations explicitly represented and those added by complementation, as described earlier. The implicit relations are generated by augmentation and inference. Because semantic relations may be acquired by several methods, the total number of unique semantic relations is slightly less than the sum of the number of relations in the four subcategories listed.
Table 1. Number of semantic relations acquired from FMA and GALEN
4.2
Origin of the semantic relations acquired
From the perspective of the semantic relations, the source of a relation represents the method (or methods) by which this relation can be generated. From the five individual methods studied in this paper (B, A, I g , I,, and I ~ A ) nineteen , sources in FMA and sixteen in GALEN were found to partition the total set of relations into disjoint subsets. To each subset corresponds a combination of methods by which the relations in the subset can be generated. As shown in Figure 1, four sources contribute the vast majority of relations in both FMA (about 95%) and GALEN (nearly 99%). These sources are: ( I B ~ n A Is), ( I s ~ A ) ,(B), and ( B n Z s u ~n IS). The number and percentage of relations coming from each source for FMA and GALEN are presented in Table 2. For example, 105,084 relations in FMA can be generated by both A (augmentation) and IB,,A (inference based on the base and augmented relations), but not by the other three methods. As shown in the table next to the label (A n I B ~ A ) these , A white 105,084 relations are represented by two gray slots in column A and I B ~ and
255
slots in the other three columns. Note that row (A) represents the relations that can only be generated by augmentation, while a gray slot in column A identifies the relations that may be generated by augmentation.
Source of the semanti
Table 2. Source of the semantic relations acquired from FMA and GALEN
Figure 1. Contribution of the top four sources of relations in FMA and GALEN
256
4.3
Base semantic relations
The base semantic relations come from all sources involving B, i.e., not only the row ( B ) in Table 2, but all ten rows marked in grey in column B, including, for example, (B ~ I B , A While ) . some of these relations are only present in the base, some of them may also be augmentable, be inferable, or both. The proportion of base relations for each of these categories in FMA and GALEN is shown in Table 3.
6.74 %
2.68 %
Table 3. The base semantic relations
4.4
Augmented semantic relations
The augmented semantic relations come from all sources involving A , i.e., not only the row ( A ) in Table 2, but all ten rows marked in grey in column A , including, for example, ( A n 15,~). While some of these relations can be generated only by augmentation, some of them may also be present in the base, be inferable, or both. The proportion of augmented relations for each of these categories in FMA and GALEN is shown in Table 4.
Augmented semantic relations Can only be augmented Also present in the base Also inferable (Both in the base and inferable)
4.5
FMA (N=392,314) 24.52 % 11.12 % 65.14 % 0.78 %
GALEN (N=32,922) 13.02 % 20.50 % 68.28 % 1.80 %
Inferred semantic relations
The inferred semantic relations come from all sources involving I B ~ A15, , or In, i.e., not only the rows ( I B ~ A,)(IB), and (la) in Table 2, but all rows except ( B ) , (A), and ( B n A). These rows are all marked in grey in column I B ~ A15, , or In, and include, for example, ( I B ~ n A IA). While some of these relations can be generated only by inference, some of them may also be present in the base, be augmentable, or both. The
257
proportion of inferred relations for each of these categories in FMA and GALEN is shown in Table 5.
Inferred semantic relations Can only be inferred Also present in the base Also augmentable (Both in the base and augmentable)
FMA (N=l1,896,508) 95.77 % 2.11 % 2.15 % 0.03 %
GALEN (N=4,356,244) 98.86 % 0.64 %
0.52 % 0.02 %
Table 5. The inferred semantic relations
The last row in Tables 3, 4, and 5 corresponds in all three cases to relations which are present in the base and are also augmentable and inferable (3,082 in FMA and 590 in GALEN). These relations correspond to the following four rows in Table 2: ( B n A n IB,,), ( B n A n IB,A n Is), ( B n A n IB"A n IA), and ( B n A r ) I B u A f l IBn IA).
5
5.1
Discussion
Specificity and common features of the various methods generating relations
Each method provides specific relations. With the exception of I B and I,, each method contributes specific relations, i.e., relations that could not be generated by other methods. By definition, I g U , includes both I g and In, i.e., every relation in I g or I A is also in IB,A. However, as reflected by the two non-empty sets (IB,,A n I S ) and (IB~n A IA), not every relation generated by I B can also be generated by l a , and viceversa. The largest proportion of specific relations is associated with inference (more than 95% of the relations inferred from FMA and GALEN can be generated only by inference). The base relations represent the second pool of specific relations (the proportion of base relations which cannot be generated by augmentation or inference is nearly 55% in FMA and 86% in GALEN). Many relations can be generated by more than one method. Many relations generated by augmentation (1 1% in FMA and 20% in GALEN) and, to a lesser extent, by inference (2.1% in FMA and .6% in GALEN) are also present in the base, i.e., explicitly represented in most cases. There is also a significant overlap between the relations generated by augmentation and by inference, especially when examined from the perspective of augmented relations (about two thirds of augmented relations can also be inferred). Finally, a few hundred relations can be generated by all the methods under investigation. These relations, B n A n I B ~ n A I B n IA, are present in the base, augmentable, and inferable from both the base and augmented rela-
258
tions. Examples of such relations include
5.2
5.2.1
Applications Ontology auditing, validation, and maintenance
This study showed that the relations represented in ontologies - explicitly or not may be redundant. When relations can be acquired from several different methods (e.g., explicitly represented and inferable from a combination of other relations), the relations in the ontology are no longer independent of each other. Redundancy may have beneficial effects for users of the ontology, such as providing direct links between important concepts. However, the dependence among equivalent relations or combination thereof is rarely explicit. Therefore, there is a chance that, over time, one relation be modified without modifying the dependent relations accordingly, leading to inconsistency. Recognizing redundancy. Using techniques such as augmentation and inference, we showed that it is possible to identify relations which can be generated by more than one method, i.e., redundant relations. The percentage of redundant relations can be used as an indicator for auditing ontologies. A small percentage is likely to be associated with consistency and ease of maintenance, but the ontology may be more difficult to use by humans without the help of an inference engine.
259
Identifying dependence among relations. An ontology in which dependence among equivalent relations is explicit would be easier to maintain in a consistent state. For example, the following guidelines, inspired by the two ontologies of anatomy under investigation, could be adopted: (1) If a relation to be modified is represented explicitly and augmentable (6.74% in FMA as shown in Table 3), modify the explicit representation (e.g.,
Integration of multiple ontologies
Facilitating comparisons across ontologies. The ontologies to be integrated may use different modeling conventions, resulting not only in different relations being represented, but also in different ways to represent the same relations. In both cases, integration is facilitated by forcing all relations to be explicitly represented. This enables comparisons across systems based on simple matches among
260
5.3
Advantages and limitations of this approach
Formalism. While other ontology tools (e.g., [ 6 , 7 ] )require OKBC-compliance, the approach described in this paper is not tied to a particular formalism. FMA is a frame-based system and GALEN is based on description logics (DL). One requirement is to extract hierarchical relations from the system (e.g., superclass-subclass). The other requirement is to augment knowledge using linguistic clues in concept names. This presupposes the existence of concept names and is therefore not applicable to some 3,000 anonymous concepts in GALEN. Of note, the relations resulting from applying inference rules to hierarchical relations would certainly have been generated by a reasoner in a DL-based system. By generating these relations independently of such a system, however, our method is applicable to ontologies represented in other formalisms as well. Domain. As a method for auditing ontologies (see section 5.2.1), this approach can be used with any ontology, as long as the requirements mentioned above are met. In its application to integrating multiple ontologies (section 5.2.2), this method requires that the ontologies to be integrated be of the same domain or, at least, have a significant overlap, as it is the case with FMA and GALEN. With other alignment methods (e.g., [17]), our method has in common that it intersects the content of several ontologies. However, we take advantage of techniques such as augmentation and inference, described in this paper and quantified for the FMA-GALEN alignment, to maximize the intersection. Validation. One limitation of this study is that no validation of the relations generated has been performed yet. However, some elements of validation are built in the method. Redundant relations are likely to be valid, as are the relations represented in several ontologies. Finally, relations resulting from inference mechanisms should generally be valid. The evaluation provided by this method is essentially quantitative, resulting from auditing the ontology automatically. For this reason, our method can be seen as complementary of a qualitative analysis of taxonomic relationships (e.g., [IS]), which requires extensive manual work. Acknowledgements The research was supported in part by an appointment to the National Library of Medicine Research Participation Program administrated by the Oak Ridge Institute of Science and Education through an interagency agreement between the U.S. Department of Energy and the National Library of Medicine. Thanks for their support and encouragement to Cornelius Rosse, JosC Mejino, and Kurt Richards for FMA and Alan Rector, Jeremy Rogers, and Angus Roberts for GALEN. Thanks also to Pieter Zanstra at Kermanog for providing us with an extended license for the GALEN server.
26 1
References Corcho 0, Fernandez-Lopez M, Gomez-Perez A. Methodologies, tools and languages for building ontologies. Where is their meeting point? Data & Knowledge Engineering 2003;46( 1):41-64 2. Duc HN. Resource-bounded reasoning about knowledge [PhD Thesis]: University of Leipzig; 2001 3. Sima J, Cervenka J. Neural knowledge processing in expert systems. In: Cloete I, Zurada JM,editors. Knowledge-based neurocomputing. Cambridge, Mass.: MIT Press; 2000. p. 419-466 4. Zhang S , Bodenreider 0. Aligning representations of anatomy using lexical and structural methods. Proc AMIA Symp 2003:(to appear) 5. Baader F, Horrocks I, Sattler U. Description logics as ontology languages for the Semantic Web. In: Hutter D, Stephan w, editors. Festschrift in honor of Jorg Siekmann: Springer; 2003. p. (to appear) 6. Noy NF, Musen MA. PROMPT: algorithm and tool for automated ontology merging and alignment. Proc of AAAI 2000:450-455 7. McGuinness DL, Fikes R, Rice J, Wilder S . The Chimaera ontology environment. Proc of AAAI 2000:1123-1124 8. Reed SL, Lenat D. Mapping Ontologies into Cyc. Proc of AAAI 2002 htt~://citeseer.ni.nec.com/509738.html. 9. Bailin SC, Truszkowsk W. Ontology negotiation as a basis for opportunistic cooperation between intelligent information agents. In: Cooperative Information Agents V, Proceedings; 2001. p. 223-228 10. Uschold M, Gruninger M. Creating semantically integrated communities on the world wide web. Proc International Workshop on the Semantic Web 2002 1.
htt~:llsemanticweb2002.aifb.uni-k~lsnthe.de/USCHOLD-Hawaii-01vitedTalk~OO2.~df. 11. Rosse C, Mejino JL, Modayur BR, Jakobovits R, Hinshaw Kp, Brinkley .IF. Motivation and organizational principles for anatomical knowledge representation: the digital anatomist symbolic knowledge base. J Am Med Inform Assoc 1998;5(1):17-40 12. Noy NF,Musen MA, Mejino JL,Rosse C. Pushing the envelope: challenges in a framebased representation of human anatomy: Technical Report of Stanford Medical Informatics; 2002. Report No.: SMI-2002-0925 13. Rector AL,Bechhofer S, Goble CA, Horrocks I, Nowlan WA, Solomon WD. The GRAIL concept modelling language for medical terminology. Artif Intel1 Med 1997;9(2): 139-71 14. Rogers J, Rector A. GALEN's model of parts and wholes: experience and comparisons. Proc M I A Symp 2000:714-8 15. Schulz S. Bidirectional mereological reasoning in anatomical knowledge bases. Proc AMIA Symp 2001:607-11 16. Bodenreider 0. Circular Hierarchical Relationships in the UMLS: Etiology, Diagnosis, Treatment, Complications and Prevention. Proc AMIA Symp 2001:57-61 17. Wiederhold G. An Algebra for Ontology Composition. Proceedings of the 1994 Monterey Workshop on Formal Methods 1994:56-61 18. Welty C, Guarino N. Supporting ontological analysis of taxonomic relationships. Data & Knowledge Engineering 2001;39(1):51-74
JOINT LEARNING FROM MULTIPLE TYPES OF GENOMIC DATA A.J. HARTEMINK Department of Computer Science and Center for Bioinformatics and Computational Biology Duke University, Box 90129 Durham, NC 27708-0129 [email protected] E. SEGAL Department of Computer Science Stanford University Stanford, CA 94303 eran @ cs.Stanford.edu
Recent technological advances enable us to collect many different types of data at a genome-wide scale, including DNA sequences, gene and protein expression measurements, protein-protein interactions, protein structural information, and protein-DNA binding data. These data provide us with a means to begin elucidating the large-scale modular organization of the cell. Indeed, much recent work has been devoted to the analysis of these data for this purpose. However, most of this work has been devoted to the analysis of a single type of data at a time, using other types of data only for validation. In contrast, results jointly learned from more than one type of data are likely to lead to new insights that might not be as readily available from analyzing one type of data in isolation. For instance, experimental genomic datasets often contain errors arising from imperfections in the applied technology. Thus, some of the findings of methods that analyze a single type of data may be erroneous. If we assume that technological errors across different genomic datasets are largely independent, then the probability of error in results that are supported by two different types of data is dramatically reduced. The Joint Learning from Multiple Types of Genomic Data session at PSB 2004 was created to provide a forum for novel methods that use more than one type of data in their analysis and do so jointly. Our goal in organizing this new session at PSB is two-fold: first, we hope to encourage the computational biology community to develop methods that are capable of integrating the large number of different types of data that are becoming increasingly available; second, we
262
263
hope to stimulate the discovery of new biological insights that would be difficult or impossible to identify in the analysis of only single types of data. Based on the number of excellent papers submitted, the session has clearly tapped into a growing interest in such joint methods. Because of this large number of quality submissions, we were able to accept nine papers for publication. Interestingly, almost every one is different from the others in terms of the types of data used and the goal of the study. Some examples include: combining sequences from multiple organisms, or combining phylogenetic trees with sequence, for the task of detecting cis-regulatory motifs; combining gene expression and sequence for detecting operon structure; combining protein sequences with tertiary structural information for classifying proteins; combining protein-protein interaction data with gene expression for learning regulatory networks; and combining text from the literature with protein sequences for discovering functional domains in proteins. The methods employed for the joint learning were also very diverse, and included probabilistic methods, support vector machines, and methods from combinatorial optimization. Taken together, these papers represent a fairly thorough cross-section of the most promising directions in this field. As more types of data become widely available, it is our belief that these kinds of unified approaches are likely to produce great insights into the complex biological systems that we are trying to better understand. The session co-chairs are grateful to those who submitted papers to the session for their contributions in advancing the field of joint learning, and especially grateful to those who reviewed submissions for their contributions in selecting the most outstanding papers to present this year, which was a challenging task given the large number of excellent submissions.
ProGreSS: SIMULTANEOUS SEARCHING OF PROTEIN DATABASES BY SEQUENCE AND STRUCTURE? A. BHATTACHARYA A. K. SINGH
T. CAN
T. KAHVECI Y.-E WANG
Department of Computer Science University of California, Santa Barbara, CA 93106 { arnab,tcan,tamel;ambuj,yfiYang}@ cs.ucsb.edu
Abstract We consider the problem of similarity searches on protein databases based on both sequence and structure information simultaneously. Our program extracts feature vectors from both the sequence and structure components of the proteins. These feature vectors are then combined and indexed using a novel multi-dimensionalindex structure. For a given query, we employ this index structure to find candidate matches from the database. We develop a new method for computing the statistical significance of these candidates. The candidates with high significance are then aligned to the query protein using the Smith-Waterman technique to find the optimal alignment. The experimental results show that our method can classify up to 97 % of the superfamilies and up to 100 % of the classes correctly according to the SCOP classification. Our method is up to 37 times faster than CTSS, a recent structure search technique, combined with Smith-Waterman technique for sequences.
1 Introduction The industrialization of molecular biology research has resulted in an explosion of bioinformatics data (DNA and protein sequences, protein structures, gene expression data and genome pathways). Each of these data present a different type of information about the functions of the genes and the interactions between them. Most of the earlier work focuses on only one type of data since each type of data has a different representation and the means of similarity varies for each data type. Combined learning from multiple types of data will help biologists achieve more precise results for several reasons: a) The probability of having false positive results due to errors in data generation decreases since it is less likely for the same error to appear in all the datasets. b) More than one aspect of the biological objects can be captured simultaneously.
1.1 Problem dejinition In this paper, we consider the problem of joint similarity searches on protein sequences and structures. A protein is represented as an ordered list of amino acids, where each amino acid has a sequence and a structure component (the terms amino a Work supported panially by NSF under grants EIA-0080134, 11s-9877142, DBI-0213903, and IRI9908441.
264
265
acid and residue are used interchangeably). The sequence component of an amino acid is its residue name indicated by a one letter code from a 20 letter alphabet. The structure component consists of the Secondary Structure Element (SSE) type of that residue (a-helix, P-sheet, or turn), and a 3-D vector which shows the position of its carbon-alpha (C,) atom. 1.2 Related work It has been one of the most important goals in molecular biology to elucidate the relationship among sequence, structure and function of p r ~ t e i n s l A ~ handful ~ ~ ~ . of algorithms and tools have been developed to analyze sequence and structure similarities between the proteins. These methods are usually focused on either sequence (SmithPSI-BLAST 7, or structure information (VAST ’, Waterman (SW) ‘, BLASTP DAL19, CE”, PSI1’, CTSS”) for finding similarities between different proteins. On the other hand, a few tools have been developed for providing integrated environments for analyzing the sequence and structure information together. Protein Analyst 13, 3DinSight “, and the integrated tools by Huang et al. l5 are among those tools. They provide a combination of separate (but cooperating) programs for integration of sequence and structure analysis under a single working environment. The components of these systems are usually run one after another, with one’s results being the input to the other. Although these tools provide integration of multiple types of data, they perform search on only one type of data at a time. We believe that integration of multiple data sources at indexing and search level would provide more precise and efficient tools. 596,
1.3 An overview of our method We extract a number of feature vectors on sequence and structure components of each protein in the database by sliding a window. Each feature vector maps to a point in a multi-dimensional space. Thus, a protein is represented by a number of points. This multi-dimensional space consists of orthogonal dimensions for sequence and structure. Later, we partition the space with the help of a grid and index these points using Minimum Bounding Rectangles (MBRs). Given a query, our search method runs in three phases: Phase I (index search): Feature vectors (i.e., points) are extracted from the query protein. For each of these query points, all the database points that are within eq and et distance along the sequence and the structure dimensions are found using the index structure. Each such point casts a vote for the protein to which it belongs as in geometric hashing 16. Phase 2 (statistical signiJicance): For each database protein, a statistical significance value is computed based on the votes it obtained in Phase 1 and its length. Phase 3 (post-processing): The top c proteins of highest significance are selected, where c is a predefined cutoff. The optimal painvise alignment of these c proteins to the query protein are then computed using the SW technique. Finally, the C , atom of
266
the matching residues are super-positioned using the least-squares method by Arun et al. l7 to find the optimal RMSD (Root Mean Square Distance). We name our method ProCreSS (Protein Grep by Sequence and Structure) since it enables queries based on sequence and structure simultaneously. The rest of the paper is organized as follows. Section 2 discusses our index structure for proteins. Section 3 explains our search algorithm. Section 4 presents the experimental results. We end with a brief discussion in Section 5.
2 Feature vectors and index construction In this section, we develop new methods to extract features for protein structures and sequences. Feature vectors for structures are computed as the curvature and torsion values of the residues in a sliding window. Curvature and torsion values provide a necessary and sufficient condition for the isomorphism of two space curves 12. For a detailed explanation of how curvature and torsion are computed, refer to CTSS 1 2 . Feature vectors for sequences are computed using a sliding window and a score matrix that defines the similarity between all the amino acids. We also propose a novel index structure to provide efficient access to these features.
2.1
Feature vectors for structure We slide a window of a prespecified size, w, on the proteins (i.e., each positioning of the window contains w consecutive residues). We will discuss the choice of w later. Figure l(a) depicts two positionings of the window. For a given window, the curvature and torsion values for each residue in that window is computed. The resulting vector contains 2w values since two values are stored per residue in the window. This vector maps to a point in a 2w-dimensional space. Having a large number of dimensions increases the cost for computing the similarity l8 and the cost for storing the vectors. Therefore, we reduce the number of dimensions to a smaller number, d t , using the Haar wavelet transformation, at the cost of reduced precision (see l9 for details on Haar transformation). We use dt = 2 in our experiments. The transformed vector is normalized to [O,lldt space. Along with each feature vector, we also store the SSE types of the residues. As w increases, the feature vector contains information about the correlation between larger number of residues. Thus the similarity between two feature vectors implies longer matches. On the other hand, very large values for w may cause false dismissals since shorter matches may be discarded due to their neighboring residues. We set w = 3 for our experiments.
2.2 Feature vectors for sequence The similarity between two amino acids of protein sequences is usually defined using score matrices (e.g., PAM and BLOSUM). A score matrix consists of 20 rows and columns; one for each amino acid. The entries of a score matrix denote the score for aligning a pair of residues. If two amino acids are similar, then the score for that pair is large, otherwise it is small.
267
Given a score matrix M , we call each row of M the score vector of the amino acid corresponding to that row. Thus, each entry of this vector shows the similarity of that amino acid to one of the 20 possible amino acids. We define the distance between two amino acids as the Euclidean distance between their score vectors. This is justified, because if the score of aligning two amino acids x and y is high in a score matrix, then they are similar. Therefore, if x is similar (or dissimilar) to another amino acid x , then y is also similar (or dissimilar) to z . Similar to protein structures, we extract feature vectors for protein sequences by sliding a window of length w (see Figure l(b) for w = 3). Each positioning of the window contains w amino acids. We append the score vectors of these amino acids in the same order as they appear in the window to obtain a vector of size 20w. This vector maps to a point in 20w-dimensional space. Since the number of dimensions is too large for efficient indexing even for small values of w, we reduce the number of dimensions to a smaller number, d,, using Haar wavelets. Similar to the structure component, we recommend d, = 2 for optimal qualityhme trade-off. The resulting vector is then normalized to [0,lIdq space. We again choose w = 3 .
2.3 Indexing feature vectors So far we have discussed how to extract feature vectors for structure and sequence components of the proteins separately. In this section, we will discuss how to build an index structure on these feature vectors. In order to search the protein database based on both sequence and structure, we need to combine the feature vectors for these two components. Since the same window size is used for both the components, every positioning of the window produces one dt-dimensional feature vector for its structure component and one d,-dimensional feature vector for its sequence component. We append these two vectors to obtain a single (&+d,)-dimensional vector. The resulting vector is called the combined feature vector. Since the entries of each of the feature vectors are normalized to the [0,1] interval, the combined feature vector resides in a [O,l]df+dq search space. The index structure is built by first partitioning the search space into 7 equal pieces along each dimension. The resulting grid contains qdf+dq cells of length 1/17 along each dimension. We will discuss the choice of 17 in Section 3.1. Once the space is partitioned, a window of length w is slid on each protein in the database. For each positioning of the window, the combined feature vector is computed. Each such vector maps to a point p in one of the cells of the grid. For each such point, we check whether that cell is empty. If it is empty, we construct an MBR that contains only p . Otherwise, we find the MBR B in that cell whose volume becomes the smallest after extending it to contain p . If the volume of B , after its expansion, is less than a precomputed volume threshold, V ,then we extend B and insert p into B , otherwise we create a new MBR that covers only p . V affects only the performance, not the experimentally. ~4 Figure 2 presents quality of the search. We chose V = ( 1 / 2 ~ ) ~ * +
268 Let D be a dataset that contains proteins, w be the window size, V be the volume cutoff. *I Procedure CreateIndex(D, w , V ) For each protein I E D for each positioning of window of length w p := combined frequency vector for current win1 C := cell that contains p ; if C = 0 then B.Lower := p ; B.Higher := p ; Insert B into C ; else B := argminBEC{volume(B Up)}; if volume(B U p ) 5 V then B := B U p ; else B.Lower := p ; B.Higher := p ; Insert B into C ; endif endif endfor endfor I*
~l:':j' .
.................
.
. A . R N! A : V
i ...,.
T K
Va = a l , a2, ..., a20 Vr = r l , 12, ..., 120 Vn = n l , n2, .... 1120 [ Va, Vr, Vn I [ Vr, Vn, Va I
1
Haar
0 (x1,yl)
~
1
0 (x2,y2)
(b)
Figure 1: Feature vectors for (a) protein structure, and (b) protein sequence.
Figure 2: Algorithm for building the index structure.
the algorithm that constructs the index structure. Figure 3 depicts a layout of a 2-D search space and the MBRs built on the data points for q = 4. Here, dt = d , = 1.
3 Querymethod Given a query <&,E,, e t , r>, where Q is a query protein, E, E [0,1] and et E [0,1] are the distance thresholds for sequence and structure respectively, and T is the boolean value regarding the use of SSE information, our search algorithm runs in three phases: 1) index search, 2) statistical significance computation, and 3 ) postprocessing. In this section, we will discuss each of these phases. We will assume that the index structure is built using a user specified score matrix for sequence (e.g., PAM or BLOSUM), and w for the window size. 3.1 Index search
Each residue of the query protein Q consists of a sequence component and a structure component. We extract combined feature vectors from Q by sliding a window of length w on it. Each of these combined feature vectors defines a query point in the search space. Figure 4 shows a sample query point in a 2-D search space, where the horizontal axis is the structure dimension and the vertical axis is the sequence dimension. In this figure, the search space is split into 16 cells numbered from 0 to 15. The query point falls into cell 10. We want to find the database points that are within
269
U 1 2
u n
8Q
A
9
Il
j
IP- l3
I
Ill
r l B
II
1
smruchue
1
Figure 3: A layout of the MBRs and data points on the search space for 7 = 4 in 2-D.
1
Figure 4: A sample query point and its query box for 7 = 4 in 2-D.
an E t distance along the structure dimensions and eq distance along the sequence dimensions from the query point. In Figure 4, we are interested in the points in the shaded region. Note that if r = true, then we only consider the database points that have the same SSE type as the query point. For each query point, we construct a query box by extending it by Et and by cq in both directions along the structure and the sequence dimensions respectively (see Figure 4). Next, we find the cells in the search space that overlap the query box. If a cell does not overlap the given query box, then it is guaranteed that it does not contain any database points that are in the query box. A cell can overlap a query box in two ways: 1) it is contained in the query box (e.g., cell 10 in Figure 4), or 2) it partially overlaps the query box (e.g., cells 5 , 6 , 7 , 9 , 11, 13, 14, and 15 in Figure 4). 1) If a cell is contained in the query box, all the points in that cell are guaranteed to overlap the query box. Therefore, we increment the vote to the database proteins that contains a data point in that cell for each such data point (if T = true, then the vote is added only for the points that have the same SSE type as the query point). 2 ) If a cell partially overlaps the query box, then we check all the MBRs in that cell. If an MBR is contained in the query box (e.g., the MBR in cell lo), each point in that MBR contributes a vote. If an MBR partially overlaps the query box (e.g., the MBR in cell 15), then we find the points in that MBR that are in the query box to find the votes. If an MBR does not overlap the query box (e.g., the MBR in cell 6), we ignore all the points in that MBR. This method is more precise than geometric hashing 16, because for a given query point it inspects the neighboring cells in addition to the cell into which that query point falls. The number of partitions 77 in the search space affects the run time of the index search. As 77 decreases, each cell contains more MBRs. Therefore, if a query box partially overlaps a cell, then more MBRs need to be tested for intersection with the
270
query box, thus increasing the search time. On the other hand, having too many partitions have two disadvantages: 1) most of the cells will be sparse or empty incurring space cost. 2) the volume of the boxes will be very small since each cell will get smaller. This increases the total number of MBRs, and hence the number of MBRs for intersection test. From our experiments we recommend q = 10 for optimal results. 3.2 Statistical signGcance computation Once the index structure is searched, we obtain a number of votes for each protein in the database. The total number of votes for a protein x shows the number of query points that are close to x’s points. We define the p-value of a match as the unexpectedness of that result. Smaller p-values imply better matches. Definition 1 Given a protein x with n points in the index structure and v votes for a given query, the p-value of x for that query is dejined as the probability of having at least v votes for a randomly generated protein with rz. points in the search space. Next, we discuss the computation of p-values. Consider a protein in the database that is represented in the search space using n points (n = length of protein - window size + 1). Let the protein receive v votes for a given query. Let X be a random variable representing the number of query boxes that overlap with a randomly selected point in the search space. Let px and 0% be the mean and the variance of X . The total number of query boxes that overlap with n randomly selected points can be computed as X , = X X . . . X (exactly n X s ) . Since X s are independent and identically distributed random variables, using Central Limit Theorem, one can show that X , is normally distributed with mean px,, = n . px and variance 0%- = n . 0%. Thus, if px and 0% are known, one can compute the distribution of X , using a normal distribution. Since the protein has v votes, its p-value can be computed as P ( X , 2 v). The computation of p-values requires the values of p x and & . The distribution of X depends on the distribution of query points, and the distance thresholds e q and e t . We compute the values of px and n$ by generating a large number of random points in the search space and counting the number of query boxes that it overlaps. In our experiments, we generate 10,000random points for this estimation.
+ + +
3.3 Post-processing After the statistical significances of all the proteins are computed, top c proteins with the highest significance are selected as candidates for post-processing, where c is a predefined cutoff. The purpose of post-processing is to find the optimal alignment between the query protein and the most promising proteins. Let q be the query protein. For every protein x in the candidate set, post-processing runs in two steps: Step 1: We build a 1x1 x IqI score matrix, MS&,for structure component, where 1x1 and 141 are the number of residues in x and q, as follows: For each residue in x and q, we construct a 2-D vector as its curvature and torsion. Each entry of Mstr is then computed as the negative of the Euclidean distance between the <curvature, torsion>-vector of the corresponding residues. For the sequence component, we cre-
27 1
ate another 1x1 x 141 score matrix, Mseq, such that V i , j the entry Mseq[i,j] is equal to the score of aligning the ith letter of x with the jth letter of q in the underlying score matrix (e.g., BLOSUM62). The matrices Mseq and Mstr are normalized and a combined score matrix Mcom = (1- e t ) . Mstr + (1- e q ) .Mseq is computed. Here, the weights (1- e t ) and (1 - E,) represent the importance that the user gives to each of the components. The optimal alignment between x and q is then found by running the Smith-Waterman dynamic programming using hfcom. Step 2: The alignment obtained in Step 1 defines a one-to-one mapping between a subset of residues of x and q, and is optimal with respect to Mcom. Finally, we find the 3-D rotation and translation of x that gives the minimum RMSD to q by using a least-squares fitting method 17.
4 Experimental evaluation We used single domain chains in our experiments. We downloaded all the protein chains in PDB ( h t t p : / /www. rcsb. org/pdb)that contain only one domain according to VAST8 and SC0P2O classifications. We only considered proteins that are members of one of the following SCOP classes: all a , all p, a+p and alp. We identified the superfamilies (according to SCOP classification) that have at least 10 representatives in this dataset. There are 181 such superfamilies. We created a database D of size 1810 proteins by including 10 proteins from each of these superfamilies. We formed a query set, DQ,by choosing a random chain from each of the 181 superfamilies in D . DQ is large enough to sample D since it contains one protein from each superfamily. We ran a number of experiments on these sets to test the quality and the performance of ProGreSS. The tests were run on a computer with two AMD Athlon M P 1600+ processors with 2 GB of RAM, running Linux 2.4.19. In the rest of this section, we use w for the window size, c for the cutoff, et and E, for the structure and sequence distance thresholds, 7 for the SSE type match choice, and q for the number of partitions. We employ the BLOSUM62 score matrix for sequences in all of our experiments. The number of dimensions d, and dt for sequence and structure are both set to 2. 4.1
Quality test Our first experiment set inspects the effect of various indexing and search parameters on the quality of our index search results. We classify a given query protein into one of the superfamilies and classes using the c best seeds as follows. The logarithms of the p-values of the matches in top c results in each superfamily are accumulated. The query protein is classified into the superfamily that has the largest magnitude of this sum. We use the same technique to classify the query protein to one of the four SCOP classes: all Q, all ,8, a+p and alp. Since the queries are selected from the database, in order to be fair, we do not take into account the query protein itself if it is among the top c results. We will only report the results for T = true, since it usually produced slightly better results than T =false.
272 100,
,
I
I
I
I
,
,
1w
,
90
80 70
60 50
40
30 20 10 401
2
"
4
6
'
8
"
10
cuton
12
"
14
16
'
18
1
20
(C)
Figure 5: Percentage of query proteins correctly classified for different values of c.
0 0 05
01 Dlslanm lhrerhald (L,=E
0 15
2
d Figure 6: Percentage of query proteins correctly classified for different values of distance threshold when E t = e q .
Figure 5 shows the percentage of query proteins correctly classified to classes (CL) and superfamilies (SF) for different values of c, where et = eq = 0.01 and 0.02, and w = 3. In all these experiments, we obtained the best results for c = 2 and 3. We achieved up to 96 % and 94 % correct classification for classes and superfamilies respectively. As c increases, our method starts retrieving proteins from other classes and superfamilies. We set c = 3 for the rest of the experiments. Figure 6 plots the percentage of correctly classified proteins for varying distance thresholds when et = eq and w = 3. The purpose of this experiment is to understand what a good distance threshold should be when sequence and structure have equal importance. The graph shows that the accuracy of ProGreSS increases when distance threshold increases from0.005 to 0.01. At et = eq = 0.01,ProGreSS achieves 96 % and 94 % correct classification for classes and superfamilies. As the distance threshold increases, ProGreSS starts retrieving distant proteins and its accuracy drops. Figure 7 shows the percentage of correctly classified superfamilies for different values of et when eq is fixed and for different values of eq when et is fixed, for w = 3. This experiment shows the effect of distance threshold for each of the structure and sequence components separately. When eq is fixed, as et decreases, the classification quality of ProGreSS increases. This implies that our method can find better results when the distance threshold is small. The highest accuracy obtained is 62 %. For eq = 1.0 (i.e., when the sequence component is ignored), ProGreSS performs the worst. This is an important result since it shows that searches based on structure alone would incur more false positives than the searches based on both sequence and structure. When et is fixed, as eq decreases, ProGreSS classifies more proteins correctly. In this case, 94 % of the proteins are correctly classified into their superfamilies. Our method performs the worst when et = 1.O. This result leads to two important conclusions: 1) Searching by sequence information alone is worse than searching based by
273 100 90
80
g
-8 -E B
5
70 60
50 40
8 30 20 10
Figure 7: Percentage of query proteins correctly classified for different values of ct ( c q ) when cq ( c t ) is fixed.
5
10 Window size (w)
15
20
Figure 8: Percentage of query proteins correctly classified for different values of w.
sequence and structure simultaneously. 2) For purposes of classification, our extraction of feature vectors for sequence is better than those for structure. Figure 8 plots the effect of window size on the classification quality of ProGreSS. The best results are achieved at w = 3. At this window size, ProGreSS can classify 100% and 97 % of the classes and superfamilies correctly. ProGreSS performs worse for smaller window sizes since correlations between the consecutive residues are not reflected to the index structure. As w becomes larger than 3 , ProGreSS starts to m i s s some of the good results since shorter local matches are not preserved for large w. Finally, Figure 9 compares the accuracy of our technique with CTSS, a recent algorithm that considers structure alone. We show the number of correct proteins (those from the same superfamily as the query protein) for different values of c. CTSS finds 3 out of 10 correct proteins in the first 100 candidates. On the other hand, our method finds the same number of proteins within the first 4 candidates. 4.2 Performance tesl In this experiment set we compare the performance of our method to CTSS. In order to have fair results, we run CTSS in two phases: 1) the top c candidates are found using the original CTSS code and each candidate is aligned to the query by using SW based on its structure score matrix. 2 ) The optimal sequence alignment of all the database proteins to the query are determined using SW alignment. For CTSS and ProGreSS, we choose c = 100 and 4 respectively. This is because the quality of their candidates are similar for these values of c (see Figure 9). We run queries for all of the 181 proteins and align only the candidate proteins to each of the query proteins. Figure 10 shows the average time spent by CTSS and our method. The run times for CTSS and SW are 38 and 18 seconds respectively. The graph for CTSS+SW is flat since these methods are independent of q. ProGreSS runs faster than CTSS+SW for all values of 11. For q = 10, ProGreSS runs in only 1.5 seconds (i.e., 37 times faster
274
I
m
c
-
1
0 cuton (C)
Figure 9: Number of proteins found from the same superfamily as the query protein for ProGreSS and CTSS for different values of c.
2
4
6 8 10 Number of pattnwns
12
14
16
Figure 10: Comparison of running times of ProGreSS and CTSS+SW.
than CTSS+SW). As Q gets smaller, ProGreSS runs slower. This is because when a query box partially overlaps a cell, more MBRs are tested for intersection. As Q becomes larger than 10, the performance of ProGreSS drops since the total number of MBRs in the index structure increases. 5 Discussion In this paper, we considered the problem of joint similarity searches on protein sequences and structures. We proposed a sliding-window-based method to extract feature vectors on the sequence and structure components of the proteins. Each feature vector is mapped to a point in a multi-dimensional space. We developed a novel index structure by partitioning the space with the help of a grid, and clustering these points using Minimum Bounding Rectangles (MBRs). Our search method finds the number of feature vectors that are similar to the feature vectors of a given query for each database protein. We also proposed a new statistical method to compute the significance of the results found at the index search phase. The results are sorted according to their significance and the most promising results are aligned using the Smith-Waterman (SW) method4 and the least-squares method by Arun et al.17 to find the optimal alignment. According to the experimental results on a set of representative query proteins, ProGreSS classified all of the classes and 97 % of the superfamilies correctly. Our method ran 37 times faster than CTSS, a recent structure search technique, combined with the SW technique for sequences. Combined learning from multiple data sources is an important research problem since each data provides a correlated yet different type of information about the protein. ProGreSS provides the user a wide flexibility of search parameters to assign weights on each of these data types. We believe that, the methods discussed in this
275
paper are an important step in understanding the functions of proteins better, and will be widely applicable in the area of proteomics. In the future, we would like to include other features into our index structure such as expression arrays and pathways.
References 1. T. C. Wood and W. R. Pearson. Evolution of Protein Sequences and Structures. J. OfMol. B i d , 291(4):977-995, 1999. 2. H. Hegyi and M. Gerstein. The Relationship between Protein Structure and Function: a Comprehensive Survey with Application to the Yeast Genome. J. ofMol. Eiol., 288(1):147-164, 1999. 3. J. M. Sauder, J. W. Arthur, and R. L. Dunbrack Jr. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins: Structure, Function, und Genetics, 40(1):&22, 2000. 4. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J. qfMoleculur Biology, March 1981. 5. S. Altschul, W. Gish, W. Miller, E. W. Meyers, and D. J. Lipman. Basic local alignment search tool. J. Molecular Biology, 215(3):403410, 1990. 6. W. Gish and D.J. States. Identification of protein coding regions by database similarity search. Nature Genet., pages 266-272, 1993. 7. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST a new generation of protein database search programs. Nucl. Acids. Res., 25(17):3389-3402, 1997. 8. T. Madej, J.-F. Gibrat, and S.H. Bryant. Threading a database of protein cores. Proteins, 23:356369,1995. 9. L. Holm and C. Sander. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology, 233:123-138, 1993. 10. H.N. Shindyalov and P.E. Bourne. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, 11(9):739-747, 1998. 11. 0. Camoglu, T. Kahveci, and A. K. Singh. Towards Index-based Similarity Search for Protein Structure Databases. In CSB, 2003. 12. T. Can and Y.F. Wang. CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Features. In CSB, 2003. 13. M. A. S. Saqi, D. L. Wild, and M. J. Hartshom. Protein Analyst - a distributed object environment for protein sequence and structure analysis. Bioinformatirs, 15521-522, 1999. 14. J. An, T. Nakama, Y. Kubota, H. Wako, and A. Sarai. Construction of an Integrated Environment for Sequence, Structure, Property and Function Analysis of Proteins. Genome Informafirs, 10:229230, 1999. 15. C. C. Huang, W. R. Novak, P. C. Babbitt, A. I. Jewett, T. E. Ferrin, and T. E. Klein. Integrated Tools for Structural and Sequence Alignment and Analysis. In PSE, pages 227-238,2000. 16. H.J. Wolfson and I. Rigoutsos. Geometric hashing: An introduction. IEEE Computational Science & Engineering, pages 10-21, Oct-Dec 1997. 17. K.S. Arun, T.S. Huang, and S.D. Blostein. Least-squares fitting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(5):698-700, September 1987. 18. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is "Nearest Neighbor" Meaningful? In ICDT, pages 217-235, 1999. 19. R.M. Rao and A S . Bopardikar. Wavelet Transforms Introduction to Theory and Applications. Addison Wesley, 1998. 20. A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247536-540, 1995.
PREDICTING THE OPERON STRUCTURE OF BACILLUS SUBTILIS USING OPERON LENGTH, INTERGENE DISTANCE, AND GENE EXPRESSION INFORMATION M.J.L. DE HOON', S. IMOTO', K. KOBAYASHI', N. OGASAWARA', S. MIYANO' ' H u m a n Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Graduate School of Biological Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0101, Japan
'
We predict the operon structure of the Bacillus subtilis genome using the average operon length, the distance between genes in base pairs, and the similarity in gene expression measured in time course and gene disruptant experiments. By expressing the operon prediction for each method as a Bayesian probability, we are able to combine the four prediction methods into a Bayesian classifier in a statistically rigorous manner. The discriminant value for the Bayesian classifier can be chosen by considering the associated cost of misclassifying an operon or a nonoperon gene pair. For equal costs, an overall accuracy of 88.7% was found in a leaveone-out analysis for the joint Bayesian classifier, whereas the individual information sources yielded accuracies of 58.1%, 83.1%, 77.3%, and 71.8% respectively. The predicted operon structure based on the joint Bayesian classifier is available from the DBTBS database (http: //dbtbs .hgc. jp).
1
Introduction
In prokaryotes, open reading frames (ORFs) belonging to the same operon are transcribed together into a single mRNA molecule. To understand gene regulation in prokaryotic organisms, as a first step it is important to determine the operon structure of their genomes. In addition, as genes in the same operon are likely to be functionally related, the inferred operon structure may reveal the role of currently unknown genes. The distance between two adjacent genes on the same strand of DNA tends to be shorter if they belong to the same operon, and longer if they belong to different operons. Using a list of experimentally known operons, we can determine the discriminant value of the intergenic distance at which an adjacent gene pair is more likely to be an operon pair than a non-operon pair. For the Escherichia cola genome, operon pair predictions using the intergenic distance information were 82% accurate?i2 An alternative method of operon prediction is based on gene expression measurements. Using cDNA microarray technology, the expression levels can be measured simultaneously for all genes in the genome by measuring the corre-
276
277
sponding mRNA concentrations. In time course gene expression experiments, the expression levels are measured at several time points following a change in the environment of the organism, such as an increase in the temperature or the salt concentration. In gene disruptant experiments, the steady-state gene expression levels are measured for an organism in which the expression of a specific gene has been disrupted. As genes belonging to the same operon are transcribed into a single mRNA molecule, the degree of similarity in the gene expression profiles of two adjacent genes can be used to assess the likelihood that the gene pair belongs to the same operon. When applied to operon prediction in Escherachia coli using a collection of 72 cDNA microarray experiments t o calculate the similarity in gene expression, a sensitivity of 82% was found3 Sabatti e t al? postulated that gene experiments that perturb a large number of genes offer more information for operon prediction than confined perturbations. Time-course gene expression data may therefore be more suitable for operon prediction than gene disruptant expression data, as changes in the environment of an organism in a time-course experiment are likely to affect a larger number of genes than the disruption of a single gene in a gene disruptant experiment. In practice, the distribution functions of both the intergenic distance and the measured similarity in gene expression exhibit a large degree of overlap for operon gene pairs and non-operon gene pairs, and the choice between operon and non-operon may become ambiguous. The reliability of operon prediction can be improved by considering the intergenic distance and the similarity in gene expression together in a Bayesian posterior probability, which resulted in a sensitivity of 88% for the Escherichia coli genome3 For these predictions, a constant (uninformative) prior was used. To find the true Bayesian posterior probability, we would have to consider the relative abundance of operon pairs in comparison to non-operon pairs. This will give us a base line rate of finding operon gene pairs among the adjacent gene pairs, depending on the average number of genes per operon. Within a Bayesian framework, we can consider this rate as the prior probability of a gene pair t o belong to the same operon, while the intergenic distance and gene expression information are used to calculate the Bayesian posterior probability. As on average an operon in the Bacillus subtilis contains more than two genes, there are more operon gene pairs than non-operon gene pairs. Including the prior probability will therefore lead t o a more accurate prediction for operon pairs, a less accurate prediction for non-operon pairs, and a higher overall prediction. To guard against a less accurate prediction for non-operon pairs, we can consider the relative cost of misclassification as an operon pair compared to the cost of misclassification as a non-operon pair. For example, if we want t o
278
verify experimentally the operon boundaries by considering all candidate nonoperon gene pairs, the cost of misclassifying a non-operon pair as an operon pair would be relatively high, and we might consider to classify a gene pair as a tentative operon pair even if the posterior probability is somewhat lower than 50%. Here, we use the combination of intergenic distance and similarity in gene expression from 99 gene disruptant experiments and 75 time-course expression measurements to predict the operon structure in Bacillus subtilis. From a list of 635 known o p e r o n we ~ ~found ~ ~ ~582 ~ operon pairs and 91 non-operon pairs. Using these data, we predicted the operon structure of Bacillus subtilis, and assessed the overall prediction accuracy and the relative contributions of operon length, intergenic distance, and expression information. 2 2.1
Operon structure predictors Operon length
Table 1 shows the distribution of the operon length based on our list of 635 known operons. To infer a base line rate for adjacent gene pairs t o belong to the same operons, we would like to fit a statistical model to these measured operon lengths. The simplest statistical model consistent with the data is the geometric distribution? P r [operon contains n genes] = pn-l (1 - p )
(1)
Accordingly, we regard operons as being produced by a Bernoulli process with probability p , as shown in Figure 1. A Bernoulli process is the discrete equivalent of a Poisson process, and is the only discrete distribution without memory. Biologically, it means that a priori there is a probability p for each intergenic region to contain a terminator sequence to mark the end of an operon, independent of its length. Using Eq. 1, we can calculate the probability p from the average operon length 6: fi-1 p z - - , n where f i = 2.39 is determined from Table 1, leading to a prior probability p = 0.581 of finding an operon pair. Figure 2 shows the distribution of the measured operon lengths, as well ils the geometric distribution fitted to it. Note that except for singletons, any known operon will contribute to the set of known operon pairs, while non-operon pairs can only be found if two adjacent operons both happen to be known. Estimating p directly from the number of known operon pairs and known non-operon pairs would therefore lead to a severely biased estimate.
279 Table 1: Number of genes per operon, calculated from the list of 635 known operons.
70 35 31
4 5
9 10
2
13, 14, 15 16 31
gene to the
Probability p Figure 1: The distribution of the operon length can be described in terms of a Bernoulli process with probability p .
c
._
i
a I
v) .-
-
2 0.4
-3 R
e
LL
0.2
0.0
Number of genes per operon Figure 2: The distribution of the operon length, as determined from the list of 635 known operons.
280
Distance between genes in base pairs
Figure 3: The distribution function of the distance in base pairs between adjacent genes for operon pairs and non-operon pairs.
2.2 Intergenic distance Using the list of known operon and non-operon pairs, we estimated the probability density distribution of the distance between the genes, measured in base pairs, using an estimation procedure based on the Epanechnikov kernel? As some genes partially overlap each other, the intergenic distance is allowed t o be negative. Figure 3 shows the inferred probability distribution for operon pairs and non-operon pairs. Whereas the intergenic distance on average is considerably less for operon pairs than for non-operon pairs, there is a substantial overlap between the two distribution functions, highlighting the need for additional predictors to distinguish operon pairs from non-operon pairs. 2.3 Gene expression data
As genes that belong t o the same operon are transcribed into a single mRNA molecule, we expect their measured expression profiles to be highly similar. In cluster analysis: the Pearson correlation and the Euclidean distancd' are commonly used to assess the similarity in gene expression profiles. In operon prediction from gene expression data, the Pearson correlation is typically used. However, the theory of discriminant analysis' suggests that the Euclidean
281 Table 2: The time points at which expression measurements were made for the eight timecourse experiments of Bacillus subtilis. ~
Experiment Cold shock Competence Glucose, glutamine added during sporulation Glucose limitation Heat shock Increased aminoacid availability Phosphate, glucose starvation Phosphate limitation Salt stress Sporulation
Measurement time points in minutes 0, 5, 10, 30, 60, 120 0 , 60, 120, 180, 240, 300, 360 0, 60, 120, 180, 240, 300 0 , 60, 125, 180, 240 0, 5, 10, 30, 60 0, 30, 60, 120, 210, 300, 420, 540 0 , 60, 120, 180, 240, 300, 360, 420 0 , 55, 115, 175, 235, 295 0, 5, 10, 30, 60 0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480, 510, 540
distance would be optimal, given that the expression profiles of gene pairs in the same operon are equal rather than merely correlated. Yere, we will apply both the Euclidean distance and the Pearson correlation to evaluate their effectiveness in separating operon pairs from non-operon pairs. We consider the gene expression data measured at 75 time points total in eight time-course experiments, described in Table 2, together with 99 gene disruptant experiments, listed in Table 3. Genes with more than 50% missing data were removed for the leave-one-out analysis described below. Furthermore, in each disruptant experiment the measured expression levels for the disrupted gene were marked as missing. Global normalization was applied to the remaining genes. Figures 4 and 5 show the distribution functions of the Pearson correlation and the Euclidean distance for known operon and non-operon gene pairs. To guarantee that the probability density function vanishes for distances less than zero, a mirroring technique was used in which the negative of each data point was added to the data set. The probability density function estimated from the padded data set was subsequently multiplied by two and set to zero for negative distances. For the Pearson correlation r , the same mirroring technique was used for r = 1; for T = -1, no mirroring was needed as both probability density functions were already zero. Both figures show a considerable amount of overlap between the distribution functions for operon pairs and non-operon pairs, although the Pearson correlation achieves a slightly better separation.
282 Table 3: Disrupted gene in each experiment. The genes degU, sigF, sig W, and veg were each disrupted in two experiments, as indicated here
abh abrB acoR ahrC alsR ansR araR azlB ccpA YYaG ykuM
2.4
citR cit T cod Y comA comK cspB ctsR ydbG degU degU deoR
YjmH Y9kL gerE glcR glc T glnR gntR gutR hpr hrcA
hutP
iolR paiB ycso YPG lacR p h o P levR p u r R lexA P Y T R lmrA rocR lrpA sacT lrp c s e n S YQhN sigB m t r B sigD sigE paiA
sigF sigl? sigG sigw ykoZ sigL yhdM sig sig sig sigx
v
w w
sig Y sigz sinR soj splA spoOA spoOJ
-
tnrA treR veg veg xylR ybbH YbfA SPOIIIC yesS spoIIID YhjM spo VT yotL tenA ytzE -
B a y e s i a n classifier
From the estimated distribution functions f o p ( d ) , f ~ o (pd ) of the intergenic distance d for known operon pairs (OP) and known non-operon pairs (NOP), and the estimated distribution functions g o p ( D ) ,gNOp ( D )of the dissimilarity D between two expression profiles, we construct the joint Bayesian classifier
With the prior probability p calculated from the average operon length (Eq. a), the joint Bayesian classifier is equal to the posterior probability of finding an operon pair. The prediction accuracy will be higher for operon pairs than for non-operon pairs, due to the former being more abundant than the latter in the Bacillus subtilis genome, as parameterized by p . With the uninformative prior (p = proposed previously: Eq. 3 is no longer the true Bayesian posterior probability. The uninformative prior leads to an equal accuracy for operon and non-operon pairs, but to a lower overall accuracy. Usually, a gene pair is predicted to belong t o the same operon if the posterior probability is more than and to different operons if the posterior probability is less than Instead, we propose to classify a gene pair as an operon pair if the posterior probability surpasses a certain discriminant value p~ which is not necessarily equal to 0.5. This allows us to tune the relative accuracy of finding operon pairs or non-operon pairs by choosing the parameter p~ appropriately, depending on how the operon predictions will be used. For example, for terminator sequence prediction we may want to include all gene
i)
a.
i,
283
Euclidean distance between gene expression log-ratios Figure 4: The probability density function of the measured Euclidean distance between the expression log-ratios for known operon and known non-operon gene pairs, as calculated from the combined gene disruptant and time-course gene expression data.
pairs that have a posterior probability of 30% or more of being a non-operon pair ( p = ~ 0.7), as requiring a posterior probability of 50% will cause us t o miss many potential terminator sequences.
3
Prediction accuracy
The operon prediction accuracy was assessed using a leave-one-out analysis, in which each of the known operon or non-operon pairs was consecutively ignored in the learning phase, followed by a prediction of the operon status of the gene pair that was left out. Using only the operon length information, the Bayesian classifier reduces to the prior probability for all gene pairs. Consequently, all gene pairs are predicted to be operon pairs, resulting in a 100% prediction accuracy for operon pairs, a 0% accuracy for non-operon pairs, and an 58.1% overall prediction accuracy, corresponding to the prior probability p . Table 4 shows the accuracy of predictions based on the intergenic distance, the gene expression data, and on the joint Bayesian classifier, using a discriminant p~ = for the posterior probability. The intergenic distance, at an accuracy of 83.1%,is a somewhat more reliable predictor of the operon structure than the gene expression data, which yielded an accuracy of 79.9%. As
a
284
Pearson correlation between gene expression log-ratios Figure 5: The distribution of the measured Pearson correlation between the expression logratios for known operon and known non-operon gene pairs, as calculated from the combined gene disruptant and timecourse gene expression data.
expected, the joint Bayesian classifier surpasses each of the separate predictors, reaching an accuracy of 88.7%. Here, the similarity in the gene expression profiles was assessed using the Pearson correlation r by defining D = 1 - r. The Euclidean distance yielded a marginally lower prediction accuracy of 88.6% for the joint Bayesian classifier. The time-course gene expression data achieved a better prediction accuracy (77.3% based on 75 expression measurements) than gene disruptant experiments (71.8% based on 99 expression measurements). This is consistent with the conjecture by Sabatti et al! that gene expression experiments affecting a large number of genes are more suitable for operon prediction. The combined expression data of the time-course and the gene disruptant experiments achieved an improved prediction accuracy of 79.9%. As in this analysis the cost of misclassifying an operon pair is regarded t o be equal t o the cost of misclassifying a non-operon pair, the discriminant value for the posterior probability was chosen to be 50%. The prediction accuracy of non-operon pairs can be improved at the expense of a less accurate , vice prediction for operon pairs by increasing the discriminant value p ~ and versa. Figure 6 shows the prediction accuracy of the joint Bayesian classifier as a function of the discriminant probability p ~ The . optimal overall accuracy is achieved for a discriminant probability less than 0.5, which reflects the fact
285
Table 4: The accuracy of operon Predictor Intergenic distance Gene expression, overall Time-course experiments Gene disruptant experiments Joint Bayesian classifier
Operon pairs 82.1% 80.1% 76.8% 69.9% 88.8%
Non-operon pairs 89.0% 79.1% 80.2% 83.5% 87.9%
Overall accuracy 83.1% 79.9% 77.3% 71.8% 88.7%
that operon pairs are more abundant than non-operon pairs in the Bacillus subtilis genome. Next, we used the joint Bayesian classifier t o predict the operon structure of the complete Bacillus subtilis genome, using the Pearson correlation to assess the similarity in the expression profiles. The predicted operon structure is available from the DBTBS database? in terms of the posterior probability, enabling users to assess the reliability of each prediction, as well as to choose the discriminant value p~ corresponding t o their interests. In addition t o the predictors described above, we examined the viability of determining the operon structure by finding the oAtranscription factor binding site and the terminator sequence motif. For all regions between adjacent gene pairs on the same strand of DNA, we calculated the motif score using the Position Specific Score Matrix for the oA binding site5 The terminator sequence motif was predicted using dtp, a prediction tool for finding rho-independent transcription terminators.'2 Neither of these predictors produced a clear distinction between operon pairs and non-operon pairs, and were therefore not included in the joint Bayesian classifier. Note that in both cases the aim of the predictor is t o find where a motif is located in a given sequence segment, rather than whether a given sequence segment contains the motif. It may therefore be possible t o construct better sequence analysis tools for the specific task of operon structure prediction. 4
Conclusion
We predicted the operon structure of the Bacillus subtilis genome by combining operon length, intergenic distance, and gene disruptant and time-course gene expression experiments at an estimated overall accu.-azy of almost 89%. The intergenic distance information was the most accurate single predictor (83.1%), followed by the time-course gene expression data (77.3%) and the
286
6 100%
e
ODeron bairs '
'
>
0
m 0 c ._
-
0 .-
73
a,
a CAO,
1 /
0.0
Non-operon pairs
\\ 1
0.5 1.o Discriminant value po for the posterior probability
Figure 6: The prediction accuracy as a function of the choice for the discriminant probability p ~ A . large value of p~ corresponds to a high cost of misclassifying a non-operon gene pair.
gene disruptant data (71.8%). The average operon length was considered in order to determine the base line probability of finding an operon pair. The distribution of the operon length was modeled by a geometric distribution, which means that a priori there is an equal probability of finding a terminator sequence between any pair of adjacent genes, irrespective of the length of the operons in which those genes are located. The predicted operon structure is available from the DBTBS database5 In the leave-one-out analysis, we found that assessing the expression similarity using the Euclidean distance does not yield a better separation between operon and non-operon pairs than the Pearson correlation. This is somewhat surprising from the viewpoint of discriminant analysis. The superior results of the Pearson correlation may be due to the error structure in gene expression measurements, or to hitherto unexplained dependencies in the expression level of two adjacent genes in different operons. Similarity measures may exist that are even more suitable for operon prediction than the Pearson correlation.
Acknowledgments We would like to thank Yuko Makita and Mitsuteru Nakao of the University of Tokyo for assisting us with the oA and terminator sequence motif prediction.
287
References 1. G. Moreno-Hagelsieb and J. Collado-Vides. A powerful non-homology method for the prediction of operons in prokaryotes. In Proceedings of the Tenth International Conference on Intelligent Systems for Molecular Biology ( I S M B 2002), Bioinformatics Supplement 1, pages S329-S336, 2002. 2. H. Salgado, G. Moreno-Hagelsieb, T.F. Smith, and J. Collado-Vides. Operons in Escherichia coli: Genomic analyses and predictions. Proc. Natl. Acad. Sci. USA, 97:6652-6657, 2000. 3. C. Sabatti, L. Rohlin, M.-K. Oh, and J.C. Liao. Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res., 30:2886-2893, 2002. 4. S. Okuda, S. Kawashima, and M. Kanehisa. Database of operons in Bacillus subtilis. In Genome Informatics, volume 13, pages 496-497, 2002. 5. Y. Makita aiid K. Nakai. DBTBS: Database of transcriptional regulation in Bacillus subtilis and its contribution t o comparative genomics. Nucleic Acids Research, submitted, 2003. http://dbtbs.hgc.jp. 6. A.L. Sonenshein, J.A. Hoch, and R. Losick. Bacillus subtilis and its closest relatives: From genes to cells. ASM Press, Washington, DC, 2001. 7. J.H. Zar. Biostatistical Analysis. Prentice-Hall, London, 4 edition, 1999. 8. B.W. Silverman. Density Estimation f o r Statistics and Data Analysis. Chapman and Hill, London, 1986. 9. M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95:14863-14868, 1998. 10. D.K. Slonim, P. Tamayo, J.P. Mesirov, T. Golub, and E.S. Lander. Class prediction and discovery using gene expression data. In R E C O M B 2000, pages 263-272, 2000. 11. M. S. Bartlett and N. W. Please. Discrimination in the case of zero mean differences. Biometrika, 50:17-21, 1963. 12. T. Yada, M. Nakao, Y. Totoki, and K. Nakai. Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics, 15(12):987-993, 1999.
COMBINING TEXT MINING AND SEQUENCE ANALYSIS TO DISCOVER PROTEIN FUNCTIONAL REGIONS E. ESKIN School of Computer Science Engineering Hebrew University [email protected]. ac.il
E. AGICHTEIN Department of Computer Science Columbia University [email protected]. edu Recently presented protein sequence classification models can identify relevant regions of the sequence. This observation has many potential applications t o detecting functional regions of proteins. However, identifying such sequence regions automatically is difficult in practice, as relatively few types of information have enough annotated sequences t o perform this analysis. Our approach addresses this data scarcity problem by combining text and sequence analysis. First, we train a text classifier over the explicit textual annotations available for some of the sequences in the dataset, and use the trained classifier t o predict the class for the rest of the unlabeled sequences. We then train a joint sequence text classifier over the text contained in the functional annotations of the sequences, and the actual sequences in this larger, automatically extended dataset. Finally, we project the classifier onto the original sequences t o determine the relevant regions of the sequences. We demonstrate the effectiveness of our approach by predicting protein sub-cellular localization and determining localization specific functional regions of these proteins.
1
Introduction
Supervised learning techniques over sequences have had a tremendous amount of success in modeling proteins. Some of the most widely used methods are Hidden Markov Models (HMMs) to model protein families 1,2,3 and neural network techniques for predicting secondary structure 4 . Recently a new class of models which use margin learning algorithms such as the Support Vector Machine (SVM) algorithm, have been applied to modeling protein families. These models include the spectrum kernel5 and mismatch kernel6 which have been shown to be competitive with state-of-the-art methods for protein family classification. These methods represent sequences as collections of short substrings of length k or k-mers. One property of these classifiers is that we can examine the trained models generated by these methods and discover which k-mers are the most important for discriminating between the
288
289
classes. By projecting these Ic-mers onto the original sequences, we can discover which regions of the protein specifically correspond to the class and potentially discover the relevant functional region of the protein. In a recent paper, it has been shown that some of the k-mers with the highest weights in a protein family classification model correspond to known motifs of the protein family '. This technique is general in that it can be applied to determine the relevant functional region of a set of proteins given a set of example proteins by creating a data set where the examples of the class of proteins are positive training examples and a sampling of other proteins are negative examples. However, despite the large size of protein databases and the large amount of annotated proteins, very few types of information are sufficiently annotated to generate a large enough training set of proteins to perform this analysis. For example, consider the sub-cellular localization of proteins. Only a very small fraction of the database, 1576, is annotated with sub-cellular localization despite the fact that 35% of the database is annotated with functional annotation which corresponds to localization. If we can somehow use the functional annotation as a proxy for localization information, we can then apply our analysis to identify the regions of the proteins that are specific to each sub-cellular location. In their recent work, Nair and Rost 8 , defined a method for inferring localization information from the functional annotation which greatly influenced our work. In this paper, we introduce a framework that combines text-mining over database annotations with sequence learning t o both classify proteins and determine the functional regions specific to the classes. Our framework is designed specifically for the case when we are given a relatively small set of example sequences compared to a much larger amount of text annotated, yet unlabeled sequences. Our framework learns how the text is correlated with the labels and jointly learns over sequences and text of both the example (labeled) and unlabeled (yet annotated) examples. The output of the learning is a sequence classifier which can be used to identify the regions in the proteins specific to the class. We demonstrate our method with a proof of concept application to identify regions correlated to sub-cellular localization. We choose sub-cellular location as the proof of concept application because two recent works by Nair and Rost show that functional annotations of proteins correlate with localization and localization can be inferred from sequences. Using the small set of labeled examples and sequences as a seed we train a text classifier t o predict the subcellular localization based on the functional annotations similar to the approach presented in Nair and Rost, 2002. This effectively augments the seed set of labeled sequences with a larger set of sequences with predicted localizations. We then jointly learn a sequence and text classifier over the extended dataset.
290
Train Joint
Step 1: Extend Training Set by Exploiting Text Annotations
Step 2: Exploit both Text and Sequence information in the Extended Training Set
Figure 1: Framework for Extending and Combining Textual Annotations with Sequence Data.
This is similar to the work by Nair and Rost, 2002 where they showed that sequence homology can be used to predict localization. Finally we then use the sequence model t o identify the localization specific regions of the proteins. Preliminary analysis of the regions shows that some correspond with known biological sites such as DNA-binding sites for the nuclear proteins.
2 2.1
Methods Framework Ouemiew
The framework for discovering functional regions of a proteins given a set of examples of the protein consists of several steps, as shown in Figure 1. First we create a seed dataset which consists of the labeled proteins as positive training examples and a sampling of other proteins as the negative examples. Using this seed set, we train a text classifier over the annotations of the sequences. Then using the text classifier, we predict over the database additional sequences which correspond to the class. Using this extended dataset, we train a joint sequence and text classifier. By projecting the classifier onto our original sequences, we can identify which regions of the protein have a high positive weight with respect to the class corresponding to the example proteins and are likely candidates for the relevant functional region of the protein. The input t o our framework is a set of examples of the proteins and the output is a joint text sequence classifier for predicting other examples of that protein and predictions for regions in the original proteins which correspond to the common function of the example set of proteins.
29 1
2.2 Extending the Seed Dataset A significant problem in machine learning is the scarcity of training data. Insufficient training data often prevents machine learning techniques from achieving acceptable accuracy. In this section we present an application of text classification that allows us to automatically construct a comprehensive training set by starting with the initial smaller seed set of labeled sequences. Combining labeled and unlabeled examples is a topic that has been thoroughly studied in the machine learning community (e.g., Blum and Mitchell, 1998" and Tsuda et. al. ,I1 and the references therein for a starting point). The simple approach that we describe below was sufficient for our task, and we plan to explore more sophisticated approaches in our future work. To extend the training set, we exploit the large amount of textual information often associated with a sequence. For example, SWISS-PROT l 2 provides rich textual annotations for each entry in the database. Unfortunately, these annotations are difficult to compile and maintain, and as a result important information is often missing for many entries (e.g., the localization information). However, we can sometimes deduce this missing information from the textual annotations that happen to be present for a database entry. This general approach was presented in Nair and Rost '. The predictions for the unknown sequences rely on some form of clussifying the textual annotations. After training over a number of labeled training examples, text classifiers can successfully predict the correct class of unlabeled texts. We represent the text using a bag of words model where each text annotation is mapped t o a vector containing the frequency of each word. As the actual classifier, we use RIPPER 1 3 , a state of the art text classification system. RIPPER operates by learning rules to describe the text in the training examples, and then applies these rules to predict the appropriate classification of new, unlabeled texts. 2.3
Training a Joint Sequence T e x t Classifier
Each protein record consists of the sequence and the text from its functional annotation. We construct a classifier to predict members of the class of proteins corresponding to the example proteins by learning from both the text and the sequences. In order to learn from the text and sequences jointly we use a kernel framework. Both sequences and text are mapped to points in a feature space which is a high dimensional vector space. A kernel for both sequences and text allows us to efficiently compute inner products between points in the space. Using this kernel, we apply the SVM algorithm to train. The kernel, described below, is constructed in such a way t o take into
292
account interactions between the text and sequences during the learning, which results in a true joint sequence text classifier.
Text K e r n e l The feature space for the text portion of a protein record uses the bag of words representation described above. The feature space corresponding t o the kernel is a very high dimensional space where each dimension corresponds t o a specific word. Each word w corresponds to a vector in the space, & ( w ) where the value of the vector 1 for the word's dimension and 0 for all other dimensions. A text string z is mapped t o a vector which is the sum of the vectors corresponding t o the words in the text, ~ ! J T (= x) qb~(w). Although the dimension of the feature space is very large, the vectors corresponding t o the text strings are very sparse. We can take advantage of this t o compute inner products between points in the feature space very efficiently. For two text annotations x and y , we denote the text kernel t o be
xwEz
KT(X,?I)
= qbT(Z) ' qbT(?I).
Sequence Kernel
Sequences are also represented as points in a high dimensional feature space. Sequences are represented as a collection of their substrings of a fixed length k (or k-mers) obtained by sliding a window of length k across the length of the sequence. The simplest sequence feature space contains a dimension for each possible k-mer for a total dimension of 20k. For a k-mer a , the image of the k-mer in the sequence feature space, d s ( a ) ,has the value 1 for the k-mer a and the value 0 for the other dimensions. The image of a sequence x is the sum of 4 ~ ( a ) This . sequence representation the images of its k-mers, qbs(z) = CaEz is equivalent t o the k-spectrum kernep. An advantage of this representation is that we can compute kernels or inner products of points in the feature space very efficiently using a trie dat a s t r u c t u d . In practice, because of mutations in the sequences, exact matching k-mers between sequences are very rare. In order t o more effectively model biological sequences, we use the sparse kernel sequence representation that allows for approximate matching. The sparse kernel is similar in flavor t o the mismatch kernel and is fully described elsewherd4J5. Consider two sequences of length k , a and b. Each sequence consists of a single substring. The sparse kernel defines a mapping into a feature space which has the following property
293
where d H ( a , b ) is the Hamming distance between substrings a and b and 0 < a < 1 is a parameter in the construction of the mapping. If the two substrings are identical, than the Hamming distance is zero and the substrings contribute 1 to the inner product of the sequences, exactly as in the spectrum kernel. However, if the Hamming distance is greater than zero, the similarity is reduced by a factor of a for every mismatch. Details of the sparse kernel implementation are described e l ~ e w h e r d ~ ~ ’ ~ . Combining Text and Sequences We can use the framework of kernels to define a feature space which allows for interactions between sequences and text annotations. In our approach, we use a very simple method for combining the text and sequence classifiers. There exists a vast literature in machine learning on alternative techniques for this problem. We now define our combined kernel Kc(z,y) = K ~ ( z , y ) Ks(z,y) ( K T ( y) ~ , K s ( z ,Y))~. The first two terms effectively include the two feature spaces of text and sequences. The third term is a degree two polynomial kernel over the sum of the two kernels. If we explicitly determine the feature map for the combined kernel, the third term would include features for all pairs of sequences and words. Since the classifier trains over this space, it effectively learns from both sequence and text and the interactions between them.
+
+
+
S u p p o r t Vector Machines Support Vector Machines (SVMs) are a type of supervised learning algorithms first introduced by by Vapnik 1 6 . Given a set of labeled training vectors (positive and negative input examples), SVMs learn a linear decision boundary to discriminate between the two classes. The result is a linear classification rule that can be used to classify new test examples. Suppose our training set consists of labeled input vectors (xi,yi), i = 1 . ..m7where xi E R” and yZ E {fl}. We can specify a linear classification rule f by a pair (w,b ) , where w E R” and b E R,via
f ( x ) = W . x + b,
(2)
where a point x is classified as positive (negative) if f ( x ) > 0 ( f ( x ) < 0). Such a classification rule corresponds to a linear (hyperplane) decision boundary between positive and negative points. The SVM algorithm computes a hyperplane that satisfies a trade-off between maximizing the geometric margin which is the distance between positive and negative labeled points and training
294
errors. A key feature of any SVM optimization problem is that it is equivalent to solving a dual quadratic programming problem that depends only on the inner products xi 'xj of the training vectors which allows for the application of kernel techniques. For example, by replacing xi . xj by K c ( x i ,xj) in the dual problem, we can use SVMs in our combined text sequence feature space.
2.4
Predicting Relevant Functional Regions
Once a SVM is trained over a set of data, the classifier is represented in its dual form as a set of support vector weights si, one for each training example xi. The form of the SVM classifier is
which can be represented in the primal form as
f(.)
= 4qz) '
c
Sid(Zi)
= 4(2) . w
(4)
i
xi
where w = si$(xi) is the SVM hyperplane. By explicitly computing 4(xi) we can compute w directly. In the case of sequences, this can be efficiently implemented using the same data structures used for computing kernels betweensequencies We are interested in the sequence only portion of the feature space. For the sequence portion, w has a weight for every possible k-mer. The score can be interpreted as a measure of how discriminative the k-mer is with respect to the classifier. High positive scores correspond t o k-mers that tend to occur in the example set and not in other proteins. We define the score for a region on the protein as the sum of the k-mer scores contained in the region. If a region score is above a threshold, we predict that the region is a potential functional region associated with the example proteins. 3
Results for Protein Localization
We evaluate our framework in three ways. First we measure the accuracy of extending the set of labeled examples. Second, we evaluate the joint text sequence classifier over 20% of the annotated localization data. This data was held out of the training in all steps of the framework. We evaluate the accuracy of predicting localization from the functional class over this data. We also evaluate the joint sequence text classifier over this data and compare it to a text only and sequence only classifier. Finally, we perform a preliminary
295 100 Locnllzah.rr cytoplasm nuclear rnitoch chloroplast extrncel
endoplos perox g0lgi lyso W.C"OlM
Totol
Prectswn
Rccoll
count
Predicted count
4,976 3,843 1,925 1,693 755 655 174 160 167 97 14.454
28,318 10,504 6,996 3,414 7,724 2,742 810 914 1,004 470 62,806
0.869 0.940 0.823 0.869 0.728 0.696 0.442 0.805 0.654 0.579
0.705 0.790 0.656 0.705 0.474 0.538 0.217 0.559 0.530 0.112
Annotated
95
90 c Q
j
85 80 75 70
65 20
30
40
50 Recall
60
70
80
Figure 2: (a): Explicitly Annotated localization, and localization predicted based on textual information for SWISS-PROT4.0., (b): Precision vs. recall of the text classifier using keywords only, vs. field-specific text annotations, vs. all available text annotations
analysis of the predictions of the relevant regions in the proteins t o localization. We specifically examine nuclear localization signals since many of these are well known and there are readily available databases which we can use to verify our predictions.
3.1 D a t a Description We use SWISS-PROT4.0 1 2 , a large database of sequences and associated textual annotations. In this proof of concept application, we focus on the specific task of inferring sub-cellular localization. A fraction of the sequences in SWISS-PROT have associated annotations that explicitly state their subcellular localization. We report the number of sequences with explicitly annotated localization of each type in Table 2. As we can see, out of more than 100,000 entries in SWISS-PROT, less than 15% have explicit localization information.
3.2 Increasing the S e t of Localization Annotated Sequences We can increase the amount of information available to a learner by augrnenting the explicitly labeled examples with unlabeled data. Useful information relevant t o localization is often contained within unlabeled text annotations. By learning t o recognize the textual annotations associated with localizations, we can assign localization labels to the unlabeled text annotated sequences. This general approach for predicting localization of unlabeled, but annotated, sequences is presented in Nair and Rost '. In their approach, the training focuses on detecting a set of discriminating keywords. If such a keyword is present, the sequence is predicted to belong t o the appropriate class. In
296
this work we used RIPPER 1 3 , a rule-based classifier, to learn rules to predict localization of an SWISS-PROT entry based on textual annotations. The classifier was trained over the 14,454 explicitly annotated sequences. The derived rules were then used t o predict the localization of the remaining (unlabeled) SWISS-PROT entries. The approach described in Nair and Rost, 20028 focuses on carefully selected and assigned keyword annotations, and does not consider the unstructured annotations that are often available for the sequences. Text classification systems such as RIPPER implement sophisticated feature selection algorithms, and can be trained over the noisy, but potentially informative unstructured data. To evaluate this hypothesis, we varied the types of textual annotations available t o the classifier. We compared the quality of prediction based only on the keywords information, as used in Nair and Rost *, t o the prediction accuracy achieved by considering other text fields, such as descriptions, and finally with using all of the available textual annotations for the sequence. The experimental results for varying the type of textual annotation are reported in Figure 2(b). While the specific evaluation setup and methodology that we used is slightly different from the evaluation of Nair and Rost for the same task, the overall results for keywords-based classification appear comparable. As we can see in Figure 2(b), considering all of the available textual annotations significantly increases both the recall and the precision of predicting the localization of unknown sequences. For example, a t the precision level of 80%, using all of the text annotations allows RIPPER to achieve significantly higher recall. Therefore, for the remainder of this paper our text classifier considers all of the textual annotations that are available for each SWISS-PROT entry. The counts of the automatically predicted SWISS-PROT entries are reported in Figure 2(a). We also report the precision and recall of the classifier, evaluated over the hold-out data using cross-validation. These accuracy figures serve as an estimate of the accuracy, or the “quality” of the resulting extended training set. Note that the while the text classifier introduces some noise into the training set, the extended training set a t over 62,000 examples is significantly larger than the original training set. This extended automatically labeled training set can now be used to train a better join text and sequence classifier.
3.3 Evaluation t h e Joint Text Sequence Classifier Over the extended data described in Section 3.2, we performed experiments t o measure the improvement of the classifier when considering text and sequences
297 Localtiation Category cytoplasm nuclear mitoch chloroplast extracel endoplas perox golgi lyso vacuolar
Text Classzfier 0.91 0.94 0.96 0.96 0.92 0.89 0.93
Sequence Classifier 0.86 0.91 0.91 0.96 0.93 0.94 0.88
0.91
0.83
0.93 0.94
0.99 0.94
Joant Classzfier 0.93 0.97 0.99 0.96 0.95 0.96 0.95 0.93 0.99 0.94
Table 1: Comparison of text only classifier, sequence only classifier and joint classifier for each localization category. Each classifier is evaluated by computing the ROC50 score.
together. We ran three experiments by leaving out 20% of the original annotated sequence data as a test set and using the remaining data as a training set. We trained three models on the training set: a text only classifier, a sequence only classifier and a joint sequence text classifier. For all three classifiers, we used the SVM algorithm with the only difference being the choice of kernel. The text classifier uses the text kernel K T ( z ,y ) , the sequence classifier uses the sequence kernel K s ( z ,y) and the combined classifier use the K c ( z ,y) kernel. For each class, we used all of the members of the class as positive examples and a sampling of the remaining classes as negative examples. For each of the classes of localization data we report the results of the classifiers performance over the test data in Table 2(b). We use R O C 5 0 scores to compare the performance of different homology detection between methods. The R O C 5 0 score is the area under the receiver operating characteristic curve - the plot of true positives as a function of false positives - up to the first 50 false positives 17. A score of 1 indicates perfect separation of positives from negatives, whereas a score of 0 indicates that none of the top 50 sequences (or text annotations) selected by the algorithm were positives.
3.4 Identifying Regions Relevant t o Localization We made predictions for regions correlated to localization using the method described in Section 2.4. Since of all the localization signals, nuclear localization signals are the most characterized and have a searchable database of signals, the NLS databasel', we restricted our evaluation to these signals. We examined the 20 highest non-overlapping regions and compared them to the NLS database and found 8 common signals. Table 2 shows the eight predicted regions and the corresponding entries from the NLS database.
298 Predicted Region KKKKKKK RKRKK KKEKKEKKDKKEKKEKKEKKDKKEKKEKKEKK GGGTGGTGTGTGGG QRFTQRGGGAVGKNRRGGRGGNRGGRNNNSTR EVLKVQKRRIYD LSGGTPKRCLDLSNLS
NLS
Signal RfCRKK KKEKKKSKK RGGRGRGRG G G GxxxKNRRxxxxxxRG GRN [FLjKxxKRR T[PLV]KRC
Origin (A) (B)
predic ed (C)
predic ed predic ed
Table 2: Eight predicted regions corresponding to nuclear localization and the corresponding entries from the NLS database. The signal entry is a signal that is close to the predicted signal. The origin describes whether the signal was experimentally verified or predicted according to the database and the reference is the corresponding reference for the predicted signals. References: (A) Bouvier, D., Badacci, G., Mol, Biol. Cell,1995,6,1697-705 (B) Youssoufian, H. et al., Blood Cells Mol. Dis.,1999,25,305-9. (C) Truant, R., Cullen, B.R., Mol. Cell. Biol., 1998,18,1449-1458. 4
Discussion
We have presented a framework for combining textual annotations and sequence data for determining the relevant functional regions from a set of example proteins. Since a large enough set of examples to perform this kind of analysis is often difficult to obtain, we use a general approach of extending the original training set by exploiting textual annotations. This results in a significantly larger set of labeled examples. We can then train a joint text and sequence classifier over the extended training set, and subsequently project the classifier onto the original sequences to identify the relevant regions. We have shown how we can recover nuclear localization signals using this analysis. The framework takes advantage of recent sequence classification models which are based on analysis of subsequences of the protein and for each position in the sequence, can determine how relevant that position is to predict the class. We have applied the framework to sub-cellular localization of proteins. We plan to explore alternative ways for combining textual and sequence information using our general approach as well a more thorough analysis of the localization predictions. We also plan to apply our framework to determine relevant regions for other properties of proteins. References 1. S. R. Eddy. Multiple alignment using hidden Markov models. In C. Rawlings, editor, Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pages 114-120. AAAI Press, 1995. 2. A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501-1531, 1994. 3. K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for
299
detecting remote protein homologies. Bioinformatics, 14(10):846-856: 1998. 4. B. Rost and C. Sander. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19(1):55-72, 1994. 5. C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing (PSB), Kaua’i, Hawaii, 2002. 6. C. Leslie, E. Eskin, and W. S. Noble. Mismatch string kernels for SVM protein classification. In Proceedings of Advances in Neural Information Processing Systems 1 5 (NIPS), 2002. 7. C. Leslie, E. Eskin, A. Cohen, and W. S.Noble. Mismatch string kernels for SVM protein classification. Technical report, Columbia University, 2003. 8. R. Nair and B. Rost. Inferring sub-cellular localization through automated lexical analysis. Bioinformatics, 18 Suppl 1:S78-S86, Jul 2002. 9. R. Nair and B. Rost. Sequence conserved for subcellular localization. Protein Sci., 11(12):2836-47, Dec. 2002. 10. A. Blum and T . Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. 11. K. Tsuda, S. Akaho, and K. Asai. The em algorithm for kernel matrix completition with auxiliary data. Journal of Machine Learning Research, 4167-81, 2003. 12. A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J . Mol. Med., 75:312-316, 1997. 13. W. W. Cohen. Fast effective rule induction. In International Conference on Machine Learning, 1995. 14. E. Eskin and S. Snir. A biologically motivated sequence embedding into euclidean space. Technical report, Hebrew University, 2003. 15. E. Eskin, W. S. Noble, Y. Singer, and S. Snir. A unified approach for sequence prediction using sparse sequence models. Technical report, Hebrew University, 2003. 16. V. N. Vapnik. Statistical Learning Theory. Springer, 1998. 17. M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers and Chemistry, 20(1):25-33, 1996. 18. R. Nair, P. Carter, and B. Rost. NLSdb: database of nuclear localization signals. Nucleic Acids Research, 31(1):397-9, Jan 2003.
KERNEL-BASED DATA FUSION AND ITS APPLICATION TO PROTEIN FUNCTION PREDICTION IN YEAST G. R . G. LANCKRIET Division of Electrical Engineering, University of California, Berkeley
M. DENG Department of Biological Sciences, University of Southern California N. CRISTIANINI Department of Statistics, University of California, Davis
M. I. JORDAN Division of Computer Science, Department of Statistics, University of California, Berkeley W . S. NOBLE Department of Genome Sciences, University of Washington
Abstract Kernel methods provide a principled framework in which t o represent many types of data, including vectors, strings, trees and graphs. As such, these methods are useful for drawing inferences about biological phenomena. We describe a method for combining multiple kernel representations in a n optimal fashion, by formulating the problem as a convex optimization problem t h a t can be solved using semidefinite programming techniques. The method is applied t o the problem of predicting yeast protein functional classifications using a support vector machine (SVM) trained on five types of data. For this problem, t h e new method performs better than a previously-described Markov random field method, and better than the SVM trained on any single type of data.
1 Introduction Much research in computational biology involves drawing statistically sound inferences from collections of data. For example, the function of an unannotated protein sequence can be predicted based on an observed similarity between that protein sequence and the sequence of a protein of known function. Related methodologies involve inferring related functions of two proteins if they occur in fused form in some other organism, if they co-occur in multiple Online supplement at noble. gs .washington. e d d y e a s t
300
301
species, if their corresponding mRNAs share similar expression patterns, or if the proteins interact with one another. It seems natural that, while all such data sets contain important pieces of information about each gene or protein, the comparison and fusion of these data should produce a much more sophisticated picture of the relations among proteins, and a more detailed representation of each protein. This fused representation can then be exploited by machine learning algorithms. Combining information from different sources contributes to forming a complete picture of the relations between the different components of a genome. This paper presents a computational and statistical framework for integrating heterogeneous descriptions of the same set of genes, proteins or other entities. The approach relies on the use of kernel-based statistical learning methods that have already proven to be very useful tools in bioinformatics? These methods represent the data by means of a kernel function, which defines similarities between pairs of genes, proteins, etc. Such similarities can be quite complex relations, implicitly capturing aspects of the underlying biological machinery. One reason for the success of kernel methods is that the kernel function takes relationships that are implicit in the data and makes them explicit, so that it is easier to detect patterns. Each kernel function thus extracts a specific type of information from a given data set, thereby providing a partial description or view of the data. Our goal is to find a kernel that best represents all of the information available for a given statistical learning task. Given many partial descriptions of the data, we solve the mathematical problem of combining them using a convex optimization method known as semidefinite programming (SDP)2 This SDP-based approacp yields a general methodology for combining many partial descriptions of data that is statistically sound, as well as computationally efficient and robust. In order to demonstrate the feasibility of these methods, we address the problem of predicting the functions of yeast proteins. Following the experimental paradigm of Deng et a1.p we use a collection of five publicly available data sets to learn t o recognize 13 broad functional categories of yeast proteins. We demonstrate that incorporating knowledge derived from amino acid sequences, protein complex data, gene expression data and known proteinprotein interactions significantly improves classification performance relative to our method trained on any single type of data, and relative to a previously described method based on a Markov random field model!
302
2
Related Work
Considerable work has been devoted to the problem of automatically integrating genomic datasets, leveraging the interactions and correlations between them to obtain more refined and higher-level information. Previous research in this field can be divided into three classes of methods. The first class treats each data type independently. Inferences are made separately from each data type, and an inference is deemed correct if the various data agree. This type of analysis has been used to validate, for example, gene expression and protein-protein interaction data:,6,7 t o validate proteinprotein interactions predicted using five different methods: and to infer protein function? A slightly more complex approach combines multiple data sets using intersections and unions of the overlapping sets of predictions!' The second formalism to represent heterogeneous data is to extract binary relations between genes from each data source, and represent them as graphs. As an example, sequence similarity, protein-protein interaction, gene co-expression or closeness in a metabolic pathway can be used to define binary relations between genes. Several groups have attempted to compare the resulting gene graphs using graph algorithms:1)12 in particular to extract clusters of genes that share similarities with respect to different sorts of data. The third class of techniques uses statistical methods to combine heterogeneous data. For example, Holmes and Bruno use a joint likelihood model to combine gene expression and upstream sequence data for finding significant gene c l ~ s t e r s 1Similarly, ~ Deng et al. use a maximum likelihood method to predict protein-protein interactions and protein function from three types of data?4 Alternatively, protein localization can be predicted by converting each data source into a conditional probabilistic model and integrating via Bayesian c a l c ~ l u s !The ~ general formalism of graphical models, which includes Bayesian networks and Markov random fields as special cases, provides a systematic methodology for building such integrated probabilistic models. As an instance of this methodology, Deng et al. developed a Markov random field model to predict yeast protein function? They found that the use of different sources of information indeed improved prediction accuracy when compared t o using only one type of data. This paper describes a fourth type of data fusion technique, also statistical, but of a more nonparametric and discriminative flavor. The method, described in detail below, consists of representing each type of data independently as a matrix of kernel similarity values. These kernel matrices are then combined to make overall predictions. An early example of this approach, based on fixed sums of kernel matrices, showed that combinations of kernels can yield
303
improved gene classification performance in yeast, relative to learning from a single kernel matrix.16 The current work takes this methodology further-we use a weighted linear combination of kernels, and demonstrate how to estimate the kernel weights from the data. This yields not only predictions that reflect contributions from multiple data sources, but also yields an indication of the relative importance of these sources. The graphical model formalism, as exemplified by the Markov random field model of Deng et al., has several advantages in the biological setting. In particular, prior knowledge can be readily incorporated into such models, with standard Bayesian inference algorithms available to combine such knowledge with data. Moreover, the models are flexible, accommodating a variety of data types and providing a modular approach to combining multiple data sources. Classical discriminative statistical approaches, on the other hand, can provide superior performance in simple situations, by focusing explicitly on the boundary between classes, but tend to be significantly less flexible and less able to incorporate prior knowledge. As we discuss in this paper, however, recent developments in kernel methods have yielded a general class of discriminative methods that readily accommodate non-standard data types (such as strings, trees and graphs), allow prior knowledge to be brought to bear, and provide general machinery for combining multiple data sources. 3
Methods and Approach
Kernel Methods Kernel methods work by embedding data items (genes, proteins, etc.) into a vector space 3, called a feature space, and searching for linear relations in such a space. This embedding is defined implicitly, by specifying an inner product for the feature space via a positive semidefinite kernel function: K(x1,xZ) = (@(XI), @(x2)),where @(XI)and @(x2)are the embeddings of data items x1 and x2. Note that if all we require in order to find those linear relations are inner products, then we do not need to have an explicit representation of the mapping @, nor do we even need to know the nature of the feature space. It suffices to be able to evaluate the kernel function, which is often much easier than computing the coordinates of the points explicitly. Evaluating the kernel on all pairs of data points yields a symmetric, positive semidefinite matrix K known as the kernel matrix, which can be regarded as a matrix of generalized similarity measures among the data points. The kernel-based binary classification algorithm that we use in this paper, e ~ ~ a~linear ~ ~ discriminant the I-norm soft margin support vector r n a ~ h i n forms boundary in feature space F,f(x) = w*@(x) + b, where w E 3 and b E R.
304
Given a labelled sample Sn = { ( X I , yl), . . . , (x,, yn)}, w and b are optimized to maximize the distance (“margin”) between the positive and negative class, allowing misclassifications (therefore “soft margin”): n
min
w T w + ~ ) ( i
wAE
subject to
i= 1 yi(wT@(xi)
+ b ) 2 1 - (i,
i
= 1,.. . , n
E i > 0 , i = 1 , . . . ,n
where C is a regularization parameter, trading off error against margin. By considering the corresponding dual problem of (l),one can prove l8 that the weight vector can be expressed as w = oci@(xi), where the support values a; are solutions of the following dual quadratic program (QP):
zy=l
max a
2aTe - cuTdiag(y)Kdiag(y)a : C 2 a 2 0, a T y = 0,
An unlabelled data item x,,, linear function
can subsequently be classified by computing the n
f ( X n e w ) = WTQ,(Xnew)
+ b = C aiK(x2, x n e w ) + b. Z=1
If f(x,,,) is positive, then we classify x,,, as belonging to class +l; otherwise, ,, as belonging to class -1. we classify x
Kernel Methods for Data Fusion Given multiple related data sets ( e g , gene expression, protein sequence, and protein-protein interaction data), each kernel function produces, for the yeast genome, a square matrix in which each entry encodes a particular notion of similarity of one yeast protein to another. Implicitly, each matrix also defines an embedding of the proteins in a feature space. Thus, the kernel representation casts heterogeneous data-variablelength amino acid strings, real-valued gene expression data, a graph of proteinprotein interactions-into the common format of kernel matrices. The kernel formalism also allows these various matrices to be combined. Basic algebraic operations such as addition, multiplication and exponentiation preserve the key property of positive semidefiniteness, and thus allow a simple but powerful algebra of kernelsJg For example, given two kernels K1 and Kz, inducing the embeddings @ 1 ( x ) and @z(x),respectively, it is possible to define the kernel K = K1 Ka, inducing the embedding @(x)= [@l(x), @z(x)]. Of even greater interest, we can consider parameterized combinations of kernels.
+
305
In this paper, given a set of kernels linear combination
K=
K
=
{Kl, Kz, . . . ,Km}, we will form the
m
C piKi. i=l
As we have discussed, fitting an SVM to a single data source involves solving a QP based on the kernel matrix and the labels. We have shown that it is possible to extend this optimization problem not only to find optimal linear discriminant boundaries but also t o find optimal values of the coefficients pi in (2) for problems involving multiple kernels3 In the case of the 1-norm soft margin SVM, we want to minimize the same cost function (l),now with respect to both the discriminant boundary and the pi. Again considering the Lagrangian dual problem, it turns out that the problem of finding optimal pi and cq reduces to a convex optimization problem known as a semidefinite program (SOP):
subject to
trace
[5
piKi,) = c,
m
i=l
where c is a constant. SDP can be viewed as a generalization of linear programming, where scalar linear inequality constraints are replaced by more general linear matrix inequalities (LMIs): F ( x ) 0, meaning that the matrix F has to be in the cone of positive semidefinite matrices, as a function of the decision 0, variables x. Note that the first LMI constraint in (3), K = CLlpiKi emerges very naturally because the optimal kernel matrix must indeed come from the cone of positive semidefinite matrices. Linear programs and semidefinite programs are both instances of convex optimization problems, and both can be solved via efficient interior-point algorithms2 In this paper, the weights pi are constrained to be non-negative and the Ki are positive semidefinite and normalized ([Ki]jj= 1) by construction; thus K 0 is automatically satisfied. In that case, one can prove3 that the SDP (3) can be cast as a quadratically constrained quadratic program ( Q C Q P ) , which
306 Table 1: Functional categories. The table lists the 13 CYGD functional classifications used in these experiments. The class listed as “others” is a combination of four smaller classes: (1) cellular communication/signal transduction mechanism, (2) protein activity regulation, (3) protein with binding function or cofactor requirement (structural or catalytic) and (4)transposable elements, viral and plasmid proteins.
1 2 3 4 5 6 7
Category metabolism energy cell cycle & DNA processing transcription protein synthesis protein fate cellular transp. & transp. mech.
Size 1048 242 600 753 335 578
479
I
8 9 10 11 12 13
Category cell rescue, defense & virulence interaction w/ cell. envt. cell fate control of cell. organization transport facilitation others
Size 264 193 411 192 306 81
improves the efficiency of the computation: max a,t
subject to
22e-ct
(4)
1
t 2 -crTdiag(y)Kidiag(y)c, i = 1 , . . . , m n
Ty = o ,
Q
C>Q>O.
Thus, by solving a QCQP, we are able to find an adaptive combination of kernel matrices-and thus an adaptive combination of heterogeneous information sources-that solves our classification problem. The output of our procedure is a set of weights pi and a discriminant function based on these weights. We obtain a classification decision that merges information encoded in the various kernel matrices, and we obtain weights pi that reflect the relative importance of these information sources. 4
Experimental Design
In order to test our kernel-based approach, we follow the experimental paradigm of Deng et a14 The task is predicting functional classifications associated with yeast proteins, and we use as a gold standard the functional catalogue provided by the MIPS Comprehensive Yeast Genome Database (CYGD-mips .gsf .de/ p r o j / y e a s t ) . The toplevel categories in the functional hierarchy produce 13 classes (see Table 1). These 13 classes contain 3588 proteins; the remaining yeast proteins have uncertain function and are therefore not used in evaluating the classifier. Because a given protein can belong t o several functional classes, we cast the prediction problem as 13 binary classification tasks, one for each functional class.
307
The primary input t o the classification algorithm is a collection of kernel matrices representing different types of data. In order to compare the SDP/SVM approach to the MRF method of Deng et al., we perform two variants of the experiment: one in which the five kernels are restricted to contain precisely the same binary information as used by the MRF method, and a second experiment in which two of the kernels use richer representations and a sixth kernel is added. For the first kernel, the domain structure of each protein is summarized using the mapping provided by SwissProt v7.5 (us. expasy . org/sprot) from protein sequences t o Pfam domains (pfam.wust1.edu). Each protein is characterized by a 4950-bit vector, in which each bit represents the presence or absence of one Pfam domain. The kernel function K p f a mis simply the inner product applied to these vectors. This bit vector representation was used by the MRF method. In the second experiment, the domain representation is enriched by adding additional domains (Pfam 9.0 contains 5724 domains) and by replacing the binary scoring with log Evalues derived by comparing the HMMs with a given protein using the HMMER software toolkit (hmmer.wustl .edu). Three kernels are derived from CYGD information regarding three different types of protein interactions: protein-protein interactions, genetic interactions, and cc-participation in a protein complex, as determined by tandem affnity purification (TAP). All three data sets can be represented as graphs, with proteins as nodes and interactions as edges. Kondor and Lafferty 2o propose a general method for establishing similarities among the nodes of a graph, based on a random walk on the graph. This method efficiently accounts for all possible paths connecting two nodes, and for the lengths of those paths. Nodes that are connected by shorter paths or by many paths are considered more similar. The resulting dzfluszon kernel generates three interaction kernel ~ , and KTAP.A diffusion constant r controls the rate of matrices, K G ~KPhys diffusion through the network. 2o For K G and ~ Kphys ~ r = 5, and for KTAP r=1. The fifth kernel is generated using 77 cell cycle gene expression measurements per gene?l Two genes with similar expression profiles are likely to have similar functions; accordingly, Deng et al. convert the expression matrix to a square binary matrix in which a 1 indicates that the corresponding pair of expression profiles exhibits a Pearson correlation greater than 0.8. We use this matrix t o form a diffusion kernel K E ~In~the . second experiment, a Gaussian kernel is defined directly on the expression profiles: for expression profiles x and z, the kernel is K ( x , z )= exp(-llx - z(I2/2o)with width o = 0.5. In the second experiment, we construct one additional kernel matrix by applying the Smith-Waterman pairwise sequence comparison algorithm 22 to
308 ,
,
,
,
,
,
0 5 1
2
3
4
5
6
1,
,
,
,
,
,
7
8
P
10
?I
. , ,
09 0.85
0.8
0
g 0.75 0.7 0.65 0.6 0 55
12
13
Function Class
Figure 1: Classification performance for the 13 functional classes. The height of each bar is proportional to the ROC score. The standard deviation across the 15 experiments is usually 0.01 or smaller (see supplement), so most of the depicted differences are significant. Black bars correspond to the MRF method of Deng et al.; gray bars correspond to the SDP/SVM method using five kernels computed on binary data, and white bars correspond to the SDP/SVM using the enriched Pfam kernel and replacing the expression kernel with the SW kernel.
the yeast protein sequences. Each protein is represented as a vector of SmithWaterman log E-values, computed with respect to all 6355 yeast genes. The kernel matrix Ksw is computed using an inner product applied to pairs of these vectors. This matrix is complementary to the Pfam domain matrix, capturing sequence similarities among yeast genes, rather than similarities with respect t o the Pfam database. Each algorithm’s performance is measured by performing 5-fold crossvalidation three times. For a given split, we evaluate each classifier by reporting the receiver operating characteristic (ROC) score on the test set. The ROC score is the area under a curve that plots true positive rate as a function of false positive rate for differing classification thresholdsz3 For each classification, we measure 15 ROC scores (three 5-fold splits), which allows us to estimate the variance of the score. 5
Results
The experimental results are summarized in Figure 1. The figure shows that, for each of the 13 classifications, the ROC score of the SDP/SVM method is better than that of the MRF method. Overall, the mean ROC improves
309 Table 2: Kernel weights and ROC scores for the transport facilitation class. The table shows, for both experiments, the mean weight associated with each kernel, as well as the ROC score resulting from learning the classification using only that kernel. The final row lists the mean ROC score using all kernels. Kernel
Binary data ROC Weight 2.21
:% :: Kphys
0.18
K ~ a p K
E
K sw all
~
0.94 0.74 0.93 ~ -
Enriched kernels Weight ROC
,9331 ,6093 ,6655 ,6499 ,5457
1.58 0.21 1.01 0.49
-
1.72
,9674
-
-
,9461 .6093 ,6655 ,6499 .7126 ,9180 ,9733
from 0.715 t o 0.854. The improvement is consistent and statistically significant across all 13 classes. An additional improvement, though not as large, is gained by replacing the expression and Pfam kernels with their enriched versions (see supplement). The most improvement is offered by using the enriched Pfam kernel and replacing the expression kernel with the Smith-Waterman kernel. The resulting mean ROC is 0.870. Again, the improvement occurs in every class, although some class-specific differences are not statistically significant. Table 2 provides detailed results for a single functional classification, the transport facilitation class. The weight assigned to each kernel indicates the importance that the SDP/SVM procedure assigns to that kernel. The Pfam and Smith-Waterman kernels yield the largest weights, as well as the largest individual ROC scores. Results for the other twelve classifications are similar (see supplement) 6
Discussion
We have described a general method for combining heterogeneous genomewide data sets in the setting of kernel-based statistical learning algorithms, and we have demonstrated an application of this method to the problem of predicting the function of yeast proteins. The resulting SDP/SVM algorithm yields significant improvement relative to an SVM trained from any single data type and relative to a previously proposed graphical model approach for fusing heterogeneous genomic data. Kernel-based statistical learning methods have a number of general virtues as tools for biological data analysis. First, the kernel framework accommodates non-vector data types such as strings, trees and graphs. Second, kernels provide significant opportunities for the incorporation of specific biological knowledge, as we have seen with the Pfam kernel, and unlabelled data, as in the diffusion
310
and Smith-Waterman kernels. Third, the growing suite of kernel-based data analysis algorithms requires only that data be reduced to a kernel matrix; this creates opportunities for standardization. Finally, as we have shown here, the reduction of heterogeneous data types to the common format of kernel matrices allows the development of general tools for combining multiple data types. Kernel matrices are required only to respect the constraint of positive semidefiniteness, and thus the powerful technique of semidefinite programming can be exploited to derive general procedures for combining data of heterogeneous format and origin.
Acknowledgements
WSN is supported by a Sloan Foundation Research Fellowship and
by National Science Foundation grants DBI-0078523 and ISI-0093302. MIJ and GL acknowledge support from ONR MURI N00014-00-1-0637 and NSF grant 11s-9988642.
References 1. B. Scholkopf, K. Tsuda and J.-P. Vert. Support vector machine applications in computational biology. MIT Press, Cambridge, MA, 2004. 2. L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49-95, 1996. 3. G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semi-definite programming. In Proc 19th Int Conf Machine Learning, pp. 323-330, 2002. 4. M. Deng, T. Chen, and F. Sun. An integrated probabilistic model for functional prediction of proteins. Proc 7th Int Conf Comp Mol Biol, pp. 95-103, 2003. 5. H. Ge, Z. Liu, G. Church, and M. Vidal. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genetics, 29:482-486, 2001. 6. A. Grigoriev. A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucl Acids Res, 29:3513-3519, 2001. 7. R. Mrowka, W. Lieberneister, and D. Holste. Does mapping reveal correlation between gene expression and protein-protein interaction? Nature Genetics, 33:15-16, 2003. 8. C. von Mering, R. Krause, B. Snel et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417:399-403, 2002. 9. E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. 0. Yeates, and D. Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature, 402(6757):83-86, 1999. 10. R. Jansen, N . Lan, J. Qian, and M. Gerstein. Integration of genomic datasets to predict protein complexes in yeast. Journal of Structural and Functional Genomics, 2:71-81, 2002. 11. A. Nakaya, S. Goto, and M. Kanehisa. Extraction of correlated gene clusters by multiple graph comparison. In Genome Informatics 2001, pp. 44-53, 2001.
31 1
12. A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18:S136-S144, 2002. 13. I. Holmes and W. J. Bruno. Finding regulatory elements using joint likelihoods for sequence and expression profile data. In Proc Int Sys Mol Biol, pp. 202210,2000. 14. M. Deng, F. Sun, and T. Chen. Assessment of the reliability of protein-protein interactions and protein function prediction. In Proc Pac Symp Biocomputing, pp. 140-151, 2003. 15. A. Drawid and M. Gerstein. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J. Mol. Biol., 301:1059-1075, 2000. 16. P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy. Gene functional classification from heterogeneous data. In Proc 5th Int Conf Comp Mol Biol, pp. 242-248, 2001. 17. B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Computational Learing Theory, pp. 144-152, 1992. 18. B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 19. C. Berg, C. J. Christensen, and P. Ressel. Harmonic Analysis o n Semigroups: Theory of Positive Definite and Related Functions. Springer, New York, NY, 1984. 20. R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proc Int Conf Machine Learning, pp. 315-322, 2002. 21. P. T. Spellman, G. Sherlock, M. Q. Zhang et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerewisiae by microarray hybridization. Mol Biol Cell, 9:3273-3297, 1998. 22. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J Mol Biol, 147(1):195-197, 1981. 23. J. A. Hanley and B. 3. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29-36, 1982.
DISCOVERY OF BINDING MOTIF PAIRS FROM PROTEIN COMPLEX STRUCTURAL DATA AND PROTEIN INTERACTION SEQUENCE DATA
'
H. LI J. LI ', S. H. TAN S.-K. NG Institute f o r Infocomm Research, 21 Heng M u i Keng Terrace, Singapore, 119613 School of Computing, National University of Singapore, Singapore, 11 9260 Email: {haiquan,jinyan,soonheng,[email protected]. edu.sg} 'x2
Abstract Unravelling the underlying mechanisms of protein interactions requires knowledge about the interactions' binding sites. In this paper, we use a novel concept, binding m o t i f pairs, to describe binding sites. A binding motif pair consists of two motifs each derived from one side of the binding protein sequences. The discovery is a directed approach that uses a combination of two data sources: 3-D structures of protein complexes and sequences of interacting proteins. We first extract m a x i m a l contact segment pairs from the protein complexes' structural data. We then use these segment pairs as templates to sub-group the interacting protein sequence dataset, and conduct an iterative refinement t o derive significant binding motif pairs. This combination approach is efficient in handling large datasets of protein interactions. From a dataset of 78,390 protein interactions, we have discovered 896 significant binding motif pairs. The discovered motif pairs include many novel motif pairs as well as motifs that agree well with experimentally validated patterns in the literature.
1
Introduction
Protein-protein interactions play a crucial role in the operations of many key biological functions such as inter-cellular communications, signal transduction, and regulation of gene expressions. Unravelling the underlying mechanisms of these interactions will provide invaluable knowledge t h a t could lead t o t h e discovery of new drugs and better treatments for many human diseases. Physically, protein interactions are mediated by short sequences of residues that form t h e contact interfaces between two interacting proteins, often referred as their binding sites. Though many experimental method2 and computationaf methods have been developed to detect protein interactions with increasing levels of accuracies, few methods can
aTo whom correspondence should be addressed.
312
313 pinpoint the specific residues in the proteins that are involved in the interactions. Such information are necessary for the interaction data to be directly useful for drug discovery. To determine the binding sites between interacting proteins, usually experimental methods include mutagenesis studies and phage displa?, which are tedious and time-consuming. Computational methods often include docking approaches and domaindomain interaction approaches. The docking approach is based on the analysis of bound protein structures. The use of this approach is very limited. The main reason is that resolved structures of proteins are often not available due t o the limitations in scalability and coverage of current protein structural determination technologies. The domain-domain interaction approach assumes that protein interactions are determined by the interactions between domains and is aimed to figure out the interactions only among predefined domain&'516.However, some domains may not directly determine the interactions, but only function as determinants of protein folding. Even though the domains involve in protein interactions, not all of their residues are contained in the binding sites and contribute t o the role of the interactions. In this work, we study the problem of binding site at residue level rather than at domain level. Our basic idea is that correlated sequence motif pairs determine the interactions. A similar concept, correlated sequence-signature pairs, was first proposed by Sprinzak4 with the expression of domain pairs. We focus on efficient in silico discovery of our motif pairs from multiple data sources about protein interactions. Ideally, such interacting motif pairs should be discovered from protein complex structural data. However, as discussed above, the availability of such data is very limited. Alternatively, interacting motif pairs may be discovered by analyzing their co-occurrence rates in interacting protein pairs' sequences. However, as high-throughput detection technologies such as two-hybrid screeningJ>' can rapidly generate large datasets of experimentally determined protein interactions, the search space on the associated protein sequences is enormous. The high false positive rates observed in high-throughput protein interaction data could also diminish the biological significance of motif pairs detected solely from protein interaction sequences. To address these issues in mining motif pairs, we propose a joint approach that makes use of the two available types of interaction data: (1) the limited structural data of protein complexes that provide exact information on inter-protein contact sites, and (2) the abundantly available interacting protein sequence pairs from high-throughput interaction detection experiments. The structural data of protein complexes are carefully mined for contact residues; these are then computationally extended into the so-called maximal contact segment pairs which we will define later. The complexes' maximal segment pairs are then de-
314
ployed t o seed the discovery of motif pairs from large sequence datasets of interacting proteins, followed by an iterative refinement procedure to ensure the significance of the derived motif pairs. This combined directed approach reduces the formidable search space of interacting protein sequences while providing some biological support for the motifs discovered. Indeed, many of our motif pairs discovered this way can be confirmed by biological patterns reported in the literature, as we will show later. We present the overall picture of our method in Section 2 . In Sections 3 and 4,we describe new algorithms t o discover maximal contact segment pairs from protein complex data, and then to discover binding motif pairs from interacting protein sequence data. Results showing the effectiveness and significance of this joint approach axe presented in Section 5 . Finally, we conclude and discuss about possible future work in Section 6. 2
Overview of Our Method and Data Used
A key idea in our proposed method for discovering significant binding motif pairs is the detection of m a x i m a l contact segment pairs between two proteins residing in a complex. First, all possible pairs of spatially contacting residues are determined from the 3-D structure data of a protein complex. These contact points are then extended t o capture as many continuous binding residues along the two proteins as possible, deriving the maximal contact segment pairs. Computationally, the derivation of maximal contact segment pairs is a challenging problem. In Section 3 , we will describe an algorithm to discover them efficiently. Our objective is to discover significant binding m o t i f pairs from proteinprotein interaction sequence datasets. Using the maximal contact segment pairs that we have discovered from the protein complex structural data, we cluster the interacting protein sequence data into sub-groups, each corresponding to one maximal contact segment pair. Then from each sub-group, we use a new motif discovery algorithm and an iterative optimization refinement algorithm to discover a binding motif pair. To assess the significance of binding motif pairs in the refinement procedure, we define a measure called emerging significance, which is similar to the concept of emerging pattern&'. This meamre is based on both positive and negative interaction datasets: A pattern or motif pair is said t o have a high emerging significance if it has a high frequency in the positive dataset but a relatively low frequency in the negative dataset.The iterative refinement is terminated when the motif pairs reach an optimized level of emerging Significance. The protein complex dataset used in this study is a non-redundant subset from PDB where the maximum pairwise sequence identify is 30% and only structures with resolution 2.0 or better are included. The set
315
used was generated on 9th June 2003 and contained 1533 entries in which each entry has at least 2 chains. As mentioned, our emerging significance approach requires the use of both positive and negative instances of pairwise protein-protein interactions. For positive protein-protein interaction sequence data, we used the data by von Mering et all. This dataset covers almost all those interaction data generated by experimental methods and in-silico methods for yeast proteins. In total, there are 78,390 non-redundant interactions in this dataset. However, there are currently no large datasets of experimentally validated negative interactions. As such, we generated a putative negative interaction dataset by assuming that any possible protein pair in yeast that do not occur in the positive dataset as a negative interaction. As our emerging significance measure only requires that the detected patterns have relatively lower frequency in the negative datasets, the effect of potential false negative interactions in this putative negative dataset is minimal.
3 Discovering M a x i m a l Contact S e g m e n t P a i r s f r o m P r o t e i n Complexes 3.1 Preprocessing: C o m p u t e Contact Sites Given a pair of proteins in a complex, a contact site is an elemental pair of two residues or atoms, each coming from one of the two proteins, that are close enough in space. A protein complex usually consists of multiple proteins, in this study we then consider all pairs of proteins in a protein complex to obtain all contact sites in this step. We define a contact site mathematically as follows: Suppose two proteins with 3 - 0 structural coordinates in (x,y,z), La = { ( a i , x a j r y a i , z a i ) , i= I...m} and Lb = { ( b j , X b j , Y b j ,Z b j ) , j = l . . . n } . T h e pair ( a i , b j ) is a contact site i f d i s t ( a i , b j ) E , where a; and bj are the a t o m id in the protein L , and L b respectively, and E is a n empirical threshold f o r the Euclidean distance f u n c t i o n d i s t ( . , .). Such a pair is denoted C o n t a c t ( a i , b j ) , or equivalently Contact ( b j a i ) . Note that a contact site in the atom level directly implies a contact site in residue level because each atom is a part of a unique residue. Hereafter, we will discuss contact sites only at the residue level. Since two residues are said to be in contact if one of the atoms in a residue is in contact with one atom in the other residue, it is possible for a residue to be in contact with multiple residues.
<
3.2 Extract Contact Segment Pairs Next, we extend the concept of contact sites t o the concept of contact segment pairs, aiming t o search for large areas of contact sites in a pair of
316
Figure 1: An illustration of contact segment pairs in a pair of interacting proteins A and B. Here, protein A is said to be the opposite protein of B, and vise vesa.
binding proteins. Figure 1 shows our idea, depicting a typical scenario where segments of residues in one protein are continuously in contact with segments of residues in the other protein. As an illustration, the segment [ale, a151 in protein A of Figure 1 is in contact with the segment [ b z l ,b27] in protein B. That is, they are a contact segment pair. But the segment [a301a401 in protein A and the segment [bal,b27] in protein B are collectively not a contact segment pair. Formally, the definition is: A contact segment pair is a segment pair ([ai,, a i z ] ,[bj, bj,]) satisfying, f o r V a i E [ail a i z ] ; 3bj E [bj, ,b j z ] such that ( a i , b j ) is a contact site, where a i l , a i z ,bj, , bJ2 are residue ids in two proteins La and Lb. Such a pair of segments as sometimes denoted C o n t a c t ( [ a i ,,ai2J, [bj,, b j 2 ] ) . A maximal contact segment pair is then defined as a contact segment pair such that no other contact segment pair can contain the both segments of this contact pair. In this paper, we are interested in the following problem:
Definition 1 Maximal Contact Segment Pairs Problem: Given a pair of binding proteins La and L b , suppose C = { ( a i l b j ) ( Contact(ai,b j ) with respect t o the two proteins L a a n d L b ) , the problem is how t o find all possible maximal contact segment pairs f r o m C with their segment lengths all longer than a threshold.
A naive approach to solving this problem would require testing all possible segment pairs. Suppose two proteins La and L b have m and n residues respectively, then the proteins La and L b will have m2 and n2 possible segments respectively. For each combination, O ( m n ) time complexity would be required for the computation. So, the total time complexity for such a naive approach will be O ( m 3 * n 3 ) per pair of proteins in each complex. This is very expensive particularly when the protein complexes are large and there are hundreds or thousands of protein complexes need to be examined. We present a more efficient method to discover maximal contact segment pairs here. Observe that for each residue, it may be in contact with multiple
317
residues in the opposite protein (see Figure 1). We introduce a concept named coverage to capture this phenomenon; it will be shown later that this is a useful concept for improving the efficiency of our discovery algorithm. The coverage of a residue a i , denoted Cow(a;),is the set of all residues in the opposite protein that are in contact with this residue, namely Gow(ai) = {bjI(a;,bj)E C}. The coverage of a segment [a;,, a ; 2 ] ,denoted Cow([ai,,ai,]), is the union of the coverages of all its residues, namely, Cov([ai,,4 )= ua;qai, ,,,,]Cov(aa). The following proposition is useful in our algorithm t o discover maximal contact segment pairs efficiently. Proposition 1 A segment pair ([a;,,ai,], [bj, ,bj,]) is a contact segment pair iff the coverage of a n y of the two segments contains the other segm e n t , i.e. G o n t a c t ( [ a ; ,, a;,], [bj, , bj,]) (Cow([ai,,a i 2 ] )2 [bj, , bj,]) A (CO~([bj,> b2l)
Proof:
2
3 :We
1% 4). 9
use contradiction t o prove. Suppose Cov([ai,,a;,]) 2
[bj,, bjz] is not true, then there exists a bj E [bj, ,bj,] but this bj
4
Cow([a;,,ai,]). This means there is no a; E [a;,,a;,] in contact with b j . This contradicts the assumption. Therefore, Cov([ai,,a i 2 ] ) 2 [bj,, b j 2 ] . We can prove Cow([bj,,bj,]) 2 [ a i l ,a;,] in a symmetrical manner. +: If Cov([a;,,ai,]) 2 [ b j l , b j 2 ] ,this means that for each bj E [bj,,bj,], there exist at least one contact site in [a,,,ai,]. Similarly, the residues in the other segment have the same property. Our algorithm is a top-down recursive algorithm. At the initial step, each entire protein in a pair is treated as a segment. A series of recursive breaking-down are then performed to output maximal contact segment pairs, using the above proposition t o determine when t o break-down a segment into several smaller segments and when to terminate producing a new candidate segment pair. The details of our algorithm are as follows: Input: Two proteins La = { ( a i , x a i ,y a i , z a i ) , i = l...m} and Lb = { ( b j , xbj, Y b j , zbj), j = I...n}, two special segments [ a l ,a,], and [ b l , b,], and G = { ( a ; , b j ) l C o n t a c t ( a i , b j ) , 1 5 i 5 m ,1 5 j 5 n } . Output: A set of maximal contact segment pairs. Preparation Step: Compute Gow(a;) and Gow(bj) for a11 1 5 i 5 m, 1 5 j s n . Initialization Step: P u t the initial segment pair ( [ a l a,], , [bl ,b n ] ) into the candidate list. repeat
Segment Coverage Step: Remove the first segment pair from the candidate list, denoted ([zi, , xi2], [yj, ,yj2]); Compute the coverage for C O ~ ( [ %4) , n [ Y j l ,Yhl. Splitting Step:
318 if (COv([xil , xiz])n [yj,, yj,]) == [yj, ,yj,] then if (Cov([yj,, yj,]) n [xi,, x i Z ] )== [xil,xi,] then Output the segment pair. else Add ([yj, ,yj,], [xil,xi,]) into the candidate list. end if else Split Cov([zi,, xi2]) n [yj, ,yj,] into w number of continuous subsegments, denoted [ykzt-,,ykzt],t = l....w, put each segment pair ([ykzt-,,ykat],[xi,,x i z ] )t, = l...w, into the candidate list. end if until The candidate list is empty. A detailed example can be found in this paper's supplementary information' 4
Discovering Binding Motif Pairs f r o m Interacting P r o t e i n Sequence P a i r s
Next, we describe how to discover binding motif pairs from protein interaction sequence data using the maximal contact segment pairs detected from protein complexes.
4 . I Seeded Sub-grouping and Consensus Motif Discovery We use each of the discovered maximal contact segment pairs as seed to sub-group the interaction sequence pairs such that all the interaction pairs that contain the contact segment pair are grouped together. We then conduct a consensus motif discovery in each of the sub-groups of protein interaction sequences. First, let us give the following two definitions: Contain: Suppose a sequence S = S I S ... ~ s, and a segment P = p l .p z...p,. S contains P , denoted Contain(S; P); if LocalLAlignment(S, P ) >_ A; where X is an empirical threshold. Cluster of a C o n t a c t Segment Pair: Given an interaction dataset D consisting of n sequence pairs, denoted D = { ( S j , S ; ) , l 5 i 5 n } , and a segment pair P = ( P I P2): , the cluster of this segment pair with respect to D; denoted G D ( P ) ;is
{(s{,sh)l(S;,sk) U
{ ( S y ,S;)l
E D,Contain(S{,Pi),andContain(Sh,Pz)} (S;, Sy) E D , C o n t a i n ( S y ,P I ) ,and G o n t a i n ( S ; , P2))
By this way of sub-grouping the interaction dataset, the resulting clusters of different segment pairs may overlap with one another. Biologically, this is important because one protein may involve interactions with different proteins in different locations.
319
Given the cluster of a contact segment pair, our subsequent step is to find two consensus motifs, one from all those S; plus all those Sy (namely the left-side sequences of those protein sequence pairs), and the other from all those S;plus all those S[ (namely the right-side sequences of those protein sequence pairs). At each side, we align all the sequences according t o the best alignments with respect to the corresponding segment (PI or PZ in this case). We used the score matrix developed by Azaryd' for the local alignmed2, since structure is preserved for any residue pairs that have high scores in the matrix. To obtain the consensus motif from each side of these alignments, every column in the alignment is examined as follows: If the occurrence of a residue in this column is above the stated threshold, we include it in the the consensus motif. If there are no such residues, we treat this column as a wildcard. It is also possible to use alternative methods such as EMOTIP3 to find the consensus motifs. These two consensus motifs form a binding motif pair. Note that we derive this binding motif pair starting from one contact segment pair. So, given a set of maximal contact segment pairs discovered from the protein complex dataset, we can obtain a set of binding motif pairs by going through all these maximal contact segment pairs on the interacting protein sequence datasets.
4.2
Iterative Refinement
Next, we perform an iterative refinement on the binding motif pairs discovered in the last subsection. The purpose of doing this is to optimize these binding motif pairs. Given a binding motif pair Q, our refinement algorithm uses Q to sub-group the interacting protein sequences dataset, and generates a new binding motif pair Q' (using exact m a t c h instead of local alignment here), as discussed in the last subsection but replacing the maximal contact segment pair P with Q. Iteratively, the algorithm repeats the procedure, using Q' as Q, until Q' reaches an optimized state. The stopping criteria used here is based on a concept of emerging significance of consensus motifs. Recall that we have established two protein sequence pair datasets: the interaction dataset (also called the positive dataset) and the negative dataset. So far, we have used only the positive dataset in generating the consensus motifs. To measure the emerging significance of a pair of consensus motifs, we make use of both of the positive and negative datasets. If a motif pair is significant, it is reasonable to expect the pair to occur in the positive dataset much more frequently than in the negative dataset. We give the definitions for emerging significance below:
320
Frequency of a motif pair with respect to a dataset: Suppose
5 i 5 n}, we have a dataset D consisting of sequence pairs D={(S:, the frequency of a m o t i f pair P=(Pl, Pz)with respect t o D i s defined as: Freq(P,D ) = Significant motif pairs: Suppose we have a positive dataset Dpos and a negative dataset D N ~A~motif . pair P i s significant if: r a t i o ( P , Dpos,DNeg) = ~ ~ ~ $ ~ 2 ;r ,$where ~ ~r ;i s {a threshold. W e also call ratio(P,D p O sD,veg) , the emerging significance of P.
v.
4 . 3 T i m e Complexity of t h e Method The time complexity for sub-grouping based on a segment pair is O( ( lDposI+ ID,vegI)* ICPI) because of using local alignment. Here CP represents the set of maximal contact segment pairs. The size of binding motif pairs is O(lCP1) in the case of using our column-by-column consensus algorithm. The time used to compute the clusters for motif pairs in each pass is linear if the suffix tree approach14 is applied to conduct the exact match for regular patterns. The complexity of computing a consensus motif pair from a cluster is also linear. Suppose there is at most K passes for the algorithm to terminate, the number of motif pairs is N c p , the time complexity for the refinement of motif pairs is O ( ( ( ( D p o s ( ( D , v e S (*)NCP ICPI) * K). In total, the time complexity for this step is O((IDpOsI lD,vegI)* (ICPI N c p * K) [CPl * K ) .
+
5
+
+
+
+
Implementation and Results
In the initial step of computing contact sites from the protein complex data, we set the threshold E to 5A. More than 56% of the complexes were found to contain at least one contact site. We also set the number 4 as the threshold of segment length. We found 1403 maximal segment pairs from this complex dataset. For sub-grouping the interaction dataset using the maximal segment pairs, a threshold should be set in the contain operation. Instead of setting X to be a constant, it is more reasonable to set the threshold strictly for short segments but loosely for long segments. The actual parameters used in our experiment are provided in our supplementary information l o . Our refinement procedure was performed for 7 iterative passes. After that all the motif pairs became stable. We found a total of 896 motif pairs to be significant when the emerging significance threshold 7- was set to be 2. The detailed distribution of emerging significance values can again be found in our supplementary information l o . All our source codes of the algorithms were run on a Pentium 4 PC with 2.4 GHZ CPU and 256M RAM. Most of the time (around 12 hours) were spent to sub-group the interaction sequence data using the maximal
321 contact segment pairs. The mining of all the maximal segment pairs was very fast, spending only 50 seconds. The refinement algorithm was also fast, spending about 1 hour. Note that this time cost is acceptable considering the enormity of the problem space. Although the objective is to discover novel motif pairs, to evaluate the biological significance of the motif pairs found by our algorithms, it is important to verify that some of the discovered motifs agree well with experimentally validated patterns in the literature. However, most publications on the experimental discovery of binding motifs only report a single motif on one side rather than a pair of binding motifs. As such, we can only confirm the coincidence of individual motifs in our motif pairs with the reported binding motifs found by traditional experimental methods. For example, for the mutagenesis method, we used key words ‘binding motif OR site AND mutagenesis’ to search all biomedical abstracts in PUBMED of NCBI. 202 motifs were found, in which 91 motifs are compatible with at least one in our motifs, 58 motifs are highly similar with ours. We show the first 5 matches in Table 1. Similar comparison with the phage display method is provided in our supplementary information l o . Table 1: Motif coincidence with the mutagenesis method.
ALETS PVDLS LLDLL PIDLSLKP
P*DLS
11435317 11373277 11451993 10748065 11062046
Table 2 illustrates how we can compare motif pairs using the individual binding motifs reported in the literature. As an example, we use the binding consensus sequences in the list compiled by Kay et a1 l 5 for various proteins by phage display. First, we identify the individual motifs in our population of discovered motif pairs that match closely with a binding consensus sequence in the compiled list. Then, for each of such matched motifs, we verify whether the motif on the other side of the corresponding motif pair axe found in proteins known to bind to the particular consensus sequence. In Table 2, we list six example binding consensus sequences from Kay et a1 l5 compiled list in the first column. In the second column, we list the individual matched motifs from our population of discovered motif pairs-we arbitrarily assign these motifs as the “left motifs”. In the third column, we show the motifs on the other sides (the “right motifs”) of the matched motif pairs. Since these right motifs are also found in the proteins (shown in the fourth column) reported to bind to the corresponding consensus sequence, the motif pairs
322
can be considered to be biologically verified. More examples are detailed in our website l o . Table 2: Motif pair coincidence betwren our motif pairs and peptide-protein binding pairs. Consensus Sequence P*LP*[KR] P*LP*[KR] P*LP*[KR]
Left motif P[EK]*P P [ILV][FILIPG P[ILV][FL]PG [RKH]PP[AILVP]P[AILVP]KP P[IV][EP][IV]A RLP*LP P[EK]*P [RKH]PP[AILVP]P[AILVP]KPP[IV][DP]P[FL]
6
Right Motif GV[FI]S P[ILV] [FLIPG P[ILV][FIL]PG AAS[FI] GV[FI]S PL[DP]PL
Binding Protein CRK A CRK A CRK A Cortactin Synaptojanin I Shank
Conclusion and Further Work
The mining of binding motif pairs from protein interaction data is important for extracting knowledge that can lead to the discovery of new drugs. Most of the work reported in the literature only dealt with finding individual binding motifs rather than pairs of interacting motifs. Since motif pairs-unlike single binding motifs-can provide better information for understanding the interactions between proteins, we studied the problem of finding binding motif pairs from large protein interaction dat asets. Our approach combines the mining of large protein interaction sequence datasets with the use of smaller protein complex structural datasets to direct the search. For mining protein complex structural data, we have formulated the detection of maximal contact segment pairs as a novel computational search and optimization problem, and we have provided an efficient algorithm for that. The maximal contact segment pairs derived can then be deployed as seeds for sub-grouping the vast dataset of interacting protein sequence pairs so that motif discovery algorithms can be directed to find the motif pairs within sub-groups. By iteratively applying this technique, we refine these motif pairs until they reach a satisfactory level of emerging significance. The results have shown that our combination approach is efficient and effective in finding biologically significant binding motif pairs. Many of the motif pairs that we have discovered coincided well with known motif pairs independently discovered by experimental methods. However, our this directed approach heavily depends on protein complex data source. As the current complex dataset is very limited, our approach may miss many other important motif pairs. On the other hand, it is worthwhile to improve our approach for discovering more significant binding motif pairs. For example, in our current definition of contact segment pairs, each residue in one segment is strictly required to have at least one contact residue in the other segment. Biologically, contact segment pairs are still valid even if a few residues in the segments are not in contact.
323 Computationally, however, our top-down recursive algorithm for finding maximal contact segment pairs will no longer be valid without this constraint. Therefore, one future research direction will be to explore the relaxation of this constraint while retaining the efficiency of the algorithm. References
1. Von Mering C et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399-403, 2002. 2. Valencia A and Pazos F. Computational methods for the prediction of protein interactions. Curr Opin Struc B i d , 12(3):368-373, 2002. 3. B.k. et a1 Kay. Phage display of peptides and proteins. Academic Press, New York, 1996. 4. Sprinzak E and Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol, 311(4):68192, 2001. 5. Deng M et al. Inferring domain-domain interactions from proteinprotein interactions. Genome Res., 12(10):1540-8, 2002. 6. Ng SK et al. Integrative approach for computationally inferring protein domain interactions. Bioinformatics, 19(8):923-9, 2003. 7. Uetz P et al. A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature, 403(6770):623-7, 2000. 8. Ito T et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci, 98(8):4569-74, 2001. 9. Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: Discovering trends and differences. In ACM SIGKDD, pages 4352, USA, Aug 1999. 10. Supplementary Information. http://sdmc.i2r .a-star.edu.sg/protein-interaction/. 11. Azarya-Sprinzak E et al. Interchanges of spatially neighbouring residues in structurally conserved environments. Protein Eng, 10(10):1109-22, 1997. 12. Smith TF and Waterman MS. Identification of common molecular subsequences. J Mol B i d , 147(1):195-7, 1981. 13. Nevill-Manning CG et al. Highly specific protein sequence motifs for genome analysis. Proc Natl Acad Sci,95:5865-71, 1998. 14. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249-260, 1995. 15. Kay BK et al. The importance of being proline: the interaction of proline-rich motifs in signaling proteins with their cognate domains. FASEB J., 14(2):231-41, 2000.
PHYLOGENETIC MOTIF DETECTION BY EXPECTATIONMAXIMIZATION ON EVOLUTIONARY MIXTURES A.M. MOSES Graduate group in Biophysics and Center for Integrative Genomics, University of California, Berkeley Email: amoses @ocJ:berkeley.edu D . Y . CHIANG Department of Molecular and Cell Biology, University of California, Berkeley Email: dchiang @ocJ:berkeley.edu
M.B. EISEN Department of Genome Sciences, Lawrence Berkeley Lab and Department of Molecular and Cell Biology, University of California, Berkeley Email: [email protected]
1
Abstract The preferential conservation of transcription factor binding sites implies that non-coding sequence data from related species will prove a powerful asset to motif discovery. We present a unified probabilistic framework for motif discovery that incorporates evolutionary information. We treat aligned DNA sequence as a mixture of evolutionary models, for motif and background, and, following the example of the MEME program, provide an algorithm to estimate the parameters by Expectation-Maximization. We examine a variety of evolutionary models and show that our approach can take advantage of phylogenic information to avoid false positives and discover motifs upstream of groups of characterized target genes. We compare our method to traditional motif finding on only conserved regions. An implementation will be made available at http://rana.lbl.gov.
2
Introduction
A wide range of biological processes involve the activity of sequence-specific DNA binding proteins, and an understanding of these processes requires the accurate elucidation of these proteins’ binding specificities. The functional binding sites for a given protein are rarely identical, with most proteins binding to families of related sequences collectively referred to as their ‘motif‘ [l]. Although experimental methods exist to identify sequences bound by a specific protein, they have not been widely applied, and computational approaches [2,3,4] to ‘motif discovery’ have proven to be a useful alternative. For example, the program MEME [ 5 ] , models a collection of sequences as a mixture of multinomial models for motif and background and uses an ExpectationMaximization (EM) algorithm to estimate the parameters.
324
325
Because functional binding sites are evolutionarily constrained, their preferential conservation relative to background sequence has proven a useful approach for their identification [61. With the availability of complete genomes for closely related species e.g., [7], it is possible to incorporate an understanding of binding site evolution into motif discovery as well. At present, few motif discovery methods simultaneously take advantage of both the statistical enrichment of motifs and the preferential conservation of the sequences that match them. One recent study [7] enumerated spaced hexamers that were both preferentially conserved (in multiple sequence alignments) and statistically enriched. Another method, FootPrinter, [81 identifies sequences (with mismatches) with few changes over an evolutionary tree. Neither of these methods, however, makes use of an explicit probabilistic model. Here we present a unified probabilistic framework that combines the mixture models of MEME with probabilistic models of evolution, and can thus be viewed as an evolutionary extension of MEME. These evolutionary models (used in the maximum likelihood estimation of phylogeny [9]) consider observed sequences to have been generated by a continuous time Markov substitution process from unobserved ancestral sequences, and can accurately model the complicated statistical relationship between sequences that have diverged along a tree from a common ancestor. Our approach considers observed sequences to have been generated from ancestral sequences that are two component mixtures of motif and background, each with their own evolutionary model. The value of varying evolutionary models has been realized in other contexts as well, e.g., [lo] and such models have been successfully traiced using EM [ l l ] . A mixture of evolutionary models has been used previously to identify slowly evolving non-coding sequences [12], and this work can equally be regarded as an extension of that approach. Given a set of aligned sequences, we use an EM algorithm to obtain the maximum likelihood estimates of the motif matrix and a corresponding evolutionary model.
3 3.1
Methods Probabilistic model
We first describe the probabilistic framework used to model aligned non-coding sequences. We employ a mixture model, which can be written generically as
p(data) =
c
pfmodel)p(csata17l)
models
where p ( x ) is the probability density function for the random variable x . The sum over models indicates that the data is distributed as some mixture of component models, where the prior, p(model), is the mixing proportion. For simplicity, we first address the case of pair-wise sequence alignments.
326
Given some motif size, w , we treat the entire alignment as a series of alignments of length w ,each of which may be an instance of the motif or a piece of background sequence. We denote the pair of aligned sequences as X and Y, where the i" position in the sequence as a vector of length 4, (for each of ACGT), where if the b" base is observed, and 0 otherwise. We denote the unobserved ancestral sequence, A, similarly, except that the values of Aib are not observed. For a series of alignments of total length N , the likelihood, L, is given by i=O nai
k=i
k 0
where the m, are unobserved indicator variables indexing the component models; in our case m is either motif or background. Generically, we let
the prior probability for each component. We incorporate the sequence specificity of the motif by letting the prior probabilities of observing each base in the ancestral sequence, p(Akblmi), be the frequency of each base at each position in the motif (the frequency matrix). We write p(Akb\mi) =fmkb?
such that if m is motif, fmkb gives the probability of observing the b" base at the k-i" position. For the background model we use the average base frequencies for each alignment, and assume that they are independent of position. This allows us to run our algorithm on several alignments simultaneously [ 131 and the densities are therefore conditioned on the alignment as well, but omit this here for notational clarity. Finally, noting that because the two sequences descended independently from the ancestor, we can write p(Xk,Ykwkb,mi) = P(XklAkb,mi) P(Ykpkb,mi), where P(Xkwkb,mi) is the probability of the residue xk, given that the ancestral sequence, A, was base b at that position - a substitution matrix for each component model. For simplicity we use the Jukes-Cantor [ 141 substitution matrix, which is, in our notation,
where ad is the rate parameter at position k-i. It is here that we incorporate differences in evolution between the motif and background by specifying different substitution matrices for each component. For example, if we set a, smaller for the motif than for background, the motif
327
evolves at a slower rate than the background - it is conserved. We test a variety of different substitution models for the motif and summarize the implications for motif discovery in the Gcn4p targets. (See results) Unfortunately, as the dependence of these models on the equilibrium frequencies becomes more complicated, deriving ML estimators for the parameters becomes more difficult, and more general optimization methods may be necessary. Once again, we can allow each alignment its own background rate, [ 131 and express the motif rate as a proportion of background.
3.2
An EM algorithm to train parameters
Following the example of the h4EME program [5] which uses an EM (an iterative optimization scheme guaranteed to find local maxima in the likelihood) algorithm to fit mixtures to unrelated sequences, we now derive an EM algorithm to train the parameters of the model described above. We write the ‘expected complete log likelihood’ [ 151
(In L,) =
zyc i=o
(mi) I n TnL +
mi
1 c (Am)( h p ( x . + , l %I
i+w-1
3
k=T
hco
+
~kh,.n1.i)
fmkb)
where In denotes the natural logarithm, and maximize by setting the derivatives with respect to the parameters to zero at each iteration. Setting
and solving gives
i
where Rkm is the ratio of expected changed to identical residues under each model, and is given by
c c c
AT-v
i+t~-l 3
(mi)
R,*=
i=o
k=i
N--IL1
.i+w-1
{mi)
i=O
k=i
~{AM)(~-YH.-&~) b=O 3
x(Akb)(xkb+ykb) b=O
for all k in the case of a constant rate across the motif. The sufficient statistics d k b ’ and m i , , are derived by applying Bayes’ theorem and are computed using the values of the parameters from the previous iteration. We have
1
328
In order to extend these results beyond pair-wise alignments, we can simply replace the two sequences X and Y with the probability of the entire tree below conditioned on having observed base b in the ancestral sequence. The likelihood becomes N-
L=
w
fl
i=o
n
i+w-1
xp(7ni) 1w;
k=i
3
~P(tr~~lAkblP(Akb,Inli) k 0
where p(treepkb) are computed using the 'pruning' algorithm [9].Of course, a tree topology is needed in these cases and we used the accepted topology for the sensu strict0 Succhuromyces [7] and computed for each alignment the maximum likelihood branch lengths using the p a d package [ 161.
3.3
Implementation
We implemented a C++ program (EMnEM: Expectation-Maximization OIEvolutionary Mixtures) to execute the algorithm described above, with the following extensions. Because instances of a motif may occur on either strand of DNA sequence, we also treat the strand of each occurrence as a hidden variable, and sum over the two possible orientations. In addition, because the mixture model treats each position in the alignment independently, we down-weight overlapping matches by limiting the total expected number of matches in any window of 2w to be less than one. Finally, because EM is guaranteed only to converge to a local optimum in the likelihood, we need to initialize the model in the region of the likelihood space where we believe the global optimum lies. Similar to the strategy used in the MEME program [ S ] , we initialize the motif matrix with the reconstructed ancestral sequence of length w at each position in the alignments, and perform the full EM starting with the sequence at the position that had the greatest likelihood. EMnEM will be made available at http://rana.lbl.gov.
329
3.4
Time complexity
The time complexity of the EM algorithm is linear with total length of the data, and the initialization heuristic we have implemented is quadratic with the length. Interestingly, because our algorithm runs on aligned sequences, relative to MEME, which that treats sequences independently, the total length is reduced by a factor of l/s, where S is the number of sequences in the alignment. Usually, we lose this factor in each iteration when calculating p(tree!Akb) using the ‘pruning’ algorithm [9], as it is linear in S. We note, however, that for evolutionary models (e.g., Juckes-Cantor) where p(tree!Akb)is independent of p(Akblmi),we may learn the matrix without reestimating the sufficient statistics u&,> (the reconstructed ancestral sequence) at each iteration. In these cases the complexity of EMnEM will indeed be linear in the length of the aligned sequence, a considerable speedup, especially in the quadratic initialization step. 4 4.1
Results and Discussion A test case from the budding yeasts
In order to compare our algorithm under various evolutionary models as well as to other motif discovery strategies, we chose to compare all methods on a single test case: the upstream regions from 5 sensu strict0 Sacchuromyces (S. bayanus, S. cerevisiae, S. kudriavzevii, S. mikatae, and S. paradoxus) of 9 known Gcn4p targets that are listed in SCPD [17]. In order to control for variability in alignment quality at different evolutionary distances, we made multiple alignments of all available upstream regions using T-coffee [18] and then extracted the appropriate sequences for any subset of the species. The Gcn4p targets from SCPD are a good set on which to test our method because there are a relatively high number of characterized sites in these promoters. In addition, the upstream regions of these genes contain stretches of poly T, which are not known to be binding sites. As a result, MEME (“tcm” model, w 10) assigns a lower (better) evalue to a ‘polyT’ motif (e=2.7e-03) than to the known Gcn4p motif (e=1.6e06) when run on the S. cerevisiae upstream regions. Because this is typical of the types of false positives that motif finding algorithms produce, we use as an indicator of the success of our method the log ratio of the likelihood of the evolutionary mixture model using the real Gcn4p matrix, to that using the polyT matrix. If this indicator is greater than zero, i.e.,
the real motif has a greater likelihood than the false positive, and should be returned as the top motif.
330
4.2
Incorporating a model of motif evolution can eliminatefalse positives
In order to explore the effects of incorporating models of motif evolution into motif detection, we tested several evolutionary models. In particular we were interested in the effect of incorporating evolutionary rate, as real motifs evolve slower than surrounding sequences. Using alignments of S. cerevisiae and S. mikatae, we calculated the log ratio of the likelihood using the real Gcn4p matrix to the likelihood using the polyT matrix with Jukes-Cantor substitution under several assumptions about the rate of evolution in the motif (Figure 1). Interestingly, slower evolution in the motif, either ‘/4 or 0.03 (the ML estimate) times background rate, is enough to assign a higher likelihood to the Gcn4p motif and thus eliminate the false positive. We tried two additional evolutionary models, in which the rate of substitution at each position depends on the frequency matrix. In the Felsenstein ’81 model (F81) the different types of changes occur at different rates, but the overall rate at each position is constant, while the Halpern-Bruno model (HB) assumes there is purifying selection at each position and can account for positional variation in overall rate [ 19,201. In each case, these more realistic models further favored the Gcn4p matrix over the polyT. fi
A’
0
10
c2
0 -5
.g
-10
M
-15 -20
,a a 5 d
8
Model of motif evolution
Figure 1. Effect of models for motif evolution on motif detection Plotted is the log ratio of the likelihood using the Gcn4p matrix to the likelihood using the polyT matrix under various evolutionary models in alignments of S. cerevisiae to S. rnikarae. Models that allow the motif to evolve more slowly than background, JC (0.25). JC (ML) and JC (HB), and models in which the rates of evolution take into account the deviation from equilibrium base frequencies, F81 and JC (HB), assign higher likelihood to the Gcn4p matrix. Also plotted is the negative log ratio of the evalues from MEME (‘tcm’ model, w 10). JC are Jukes-Cantor models with rate parameter equal to background (bg), !A of background (0.25) or set to the maximum-likelihood estimate below background (ML).
4.3
Success of motif discovery is dependent on evolutionary distance
In order to test the generality of the results achieved for the S. cerevisiae S. mikatae alignments, we calculated the log ratio of the likelihood of the evolutionary mixture using the real Gcn4p matrix to the polyT matrix over a range of evolutionary distances and rates of evolution (figure 2, filled symbols). At closer distances, more of the data is redundant, while over longer
33 1
comparisons, conserved sequences should stand out more against the background. Indeed, at the distance of S. cerevisiae to S. paradoxus (-0.13 substitutions per site), the likelihood of polyT is greater, while at the distance of S. cerevisiae, S. mikatae, and S. paradoxus (-0.31 subs. per site) the Gcn4p matrix is favored. Interestingly, this is true regardless of the rate of evolution assumed for the motif. While at all evolutionary distances slow evolution favors the Gcn4p matrix more than when the motif evolves at the background rate, the effect of including slower evolution is smaller than the effect of the varying evolutionary distance. Only at the borderline distance of S. cerevisiae to S. mikatae (-0.25 subs. per site), do the models perform differently. We also ran MEME (with the “tcm” model, w set at 10) on the all sequences (from all genes and all species) and calculated the negative log ratio of the MEME e-values for the two motifs (figure 2, heavy trace). MEME treats all the sequences independently, and continues to assign the polyT matrix a lower e-value over all the evolutionary distances. At least for this case, it seems more important to accurately model the phylogenetic relationships between the sequences (i.e., using a tree) than to accurately model the evolution within the motif.
30
20
10 0
- 10 -20 -30 -40
I
-1
I
+JC (0.125)
--
+JC (0.25) -t- JC
(0.5)
JC cbg) -MEME MEh4E 50% id MEME 70% id
Evolutionary distance
Figure 2. Effect of evolutionary distance on motif detection. Log ratio of the likelihood using the Gcn4p matrix to the likelihood using polyT matrix and alignments that span increasing evolutionary distance. At distances greater than S. cerevisiue to S. mikatue the evolutionary mixture assigns the Gcn4p matrix a greater likelihood whether the rate of evolution in the motif is equal to, !h, !4 or ‘/s of the background rate, (diamonds, squares, triangles and circles, respectively). Also plotted are negative log ratios of the MEME evalues for the Gcn4p to polyT, using the entire sequences, or prefiltering alignments for 20 base pair windows of at least 70% or 50% identity to a reference genome (heavy, lighter and lightest traces, respectively.)
332 4.4
The unified framework is preferable to using evolutionary information separately
In order to compare our method, which incorporates evolutionary information directly into motif discovery, to approaches that use such information separately, we scanned the alignments at each evolutionary distance and removed regions than were less than 50 or 70 % identical to a reference genome in a 20 base pair window. This allows MEME, which does take into account phylogenic information, to focus on the conserved regions. We ran MEME and computed the negative log ratio of the e-values for the Gcn4p matrix and the polyT matrix. While in both cases there were distances where the real motif was favored (figure 2, lighter traces), the effect of the filtering was not consistent. At
HEM13. RTTIOI,
Binding factor Roxlp +
I EMnEMrank
MEME rank
Motif
2 TCTATTGTTC
+7I
ERG2, ERG3, ERG9, UPC2
TCTAAACGAA
RNR2, RNR3, RNR4. RFXI
Rfxlp ++
CDCl9, PGKI, TPII, ENOI, EN02.
Gcrlp +
1
AR080, AR09, AROlO TRRI. TRX2. GSHI.
Ar08Op++
1
Yaplp++
-
Zaplp++
3
GTTGCCAGAC
FET4
Table 1. Motif discovery using EMnEM and MEME. The EMnEM program was run using the Jukes Cantor model for motif evolution with the rate set to 'A background (JC 0.25) on S.cerevisiue S. mikatue alignments in each case. For cases where EMnEM ranked the motif higher, the consensus sequence and a plot of the information content is shown. MEME was run on the unaligned sequences from both species simultaneously. Target genes are from SCPD[17] (+) or YPD [21] (++). indicates that a plausible motif was not found.
333
distances too close, not enough is filtered out, and the polyT is still preferred, while at distances too far, real instances of the motif will no longer pass the cutoff and the real motif is no longer recovered (figure 2, lighter traces). Thus, while incorporating evolutionary information separately can help recover the real motif, it depends critically on the choice of percent identity cutoff. 4.5
Examples of other discovered motifs
We ran both our program and MEME on the upstream regions of target genes of some transcription factors with few characterized targets and/or poorly defined motifs In several cases, for a given motif size, our algorithm ranked a plausible motif first, and MEME ranked a polyT motif first (see Table 1).
5
Conclusions and future directions
We have provided an evolutionary mixture model for transcription factor binding sites in aligned sequences, and a motif finding algorithm based on this framework. We believe that our approach has many advantages over current methods; it produces probabilistic models of motifs, can be applied directly to multiple or pair-wise alignments, and can be applied simultaneously at multiple loci. Our method should be applicable to any group of species whose intergenic regions can be aligned, though because alignments may not be possible at large evolutionary distances, our reliance on them is a disadvantage of our method relative to FootPrinter [S]. It is not difficult to conceive of extending this framework to unaligned sequences by treating the alignment as a hidden variable as well; unfortunately, the space of multiple alignments is large, and improved optimization methods would certainly be needed. In addition to motif discovery, our probabilistic framework is also applicable to binding site identification. Current methods that search genome sequence for matches to motifs are also plagued by false positives, but optimally combining sequence specificity and evolutionary constraint may lead to considerable improvement.
Acknowledgements We thank Dr. Audrey Gasch, Emily Hare and Dan Pollard for comments on the Manuscript. MBE is a Pew Scholar in the Biomedical Sciences. This work was conducted under the US Department of Energy contract No. ED-AC03-76SF00098
334
References 1. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. (2000) Jan;l6(1): 16-23. 2. Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Nut1 Acad Sci U S A. (1989) Feb;86(4): 1183-7. 3. Lawrence CE, Reilly AA. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolper sequences. Proteins. 1990;7(1):41-51. 4. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. (1993) Oct 8;262(5131):208-14. 5. Bailey TL, Elkan C, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, A M Press, Menlo Park, California, (1994.) 6. Hardison, Conserved noncoding sequences are reliable guides to regulatory elements, Trends in Genetics, (2000) Sep;16(9):369-372 7. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. (2003) May 15;423(6937):241-54. 8. Blanchette M, Schwikowski B, Tompa M. Algorithms for phylogenetic footprinting. J Comput Biol. (2002);9(2):211-23. 9. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. (1981);17(6):368-76. 10. Ng PC, Henikoff JG, Henikoff S. PHAT: a transmembrane-specific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics. 2000 Sep;16(9):760-6.Erratum in: Bioinformatics 2001 Mar;17(3):290 11. Holmes I, Rubin GM. An expectation maximization algorithm for training hidden substitution models. J Mol Biol. (2002) Apr 12;317(5):753-64. 12. Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. (2003) Feb 28;299(5611):1391-4. 13. Yang, Z. Maximum likelihood models for combined analyses of multiple sequence data. J Mol Evol. 42587-596 (1996.) 14. Yang, Z., N. Goldman, and A. E. Friday. Comparison of models for nucleotide substitution used in maximum likelihood phylogenetic estimation. Mol Biol Evol. 11:316-324 (1994) 15. M. I. Jordan, An Introduction to Probabilistic Graphical Models, in preparation. 16. Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. (1997) 13(5):555-556
335
17. Zhu J, Zhang MQ. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. (1999) Jul-Aug;15(7-8):607-611. 18. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. (2000) Sep 8;302(1):205-17. 19. Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998 Ju1;15(7):910-917. 20. Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB. Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol Biol. (2003) 3: 18 21. Hodges PE, Payne WE, Garrels JI. The Yeast Protein Database (YPD): a curated proteome database for Saccharomyces cerevisiae. Nucleic Acids Res. (1998) Jan 1;26(1):68-72.
USING PROTEIN-PROTEIN INTERACTIONS FOR REFINING GENE NETWORKS ESTIMATED FROM MICROARRAY DATA BY BAYESIAN NETWORKS N. NARIAI, S. KIM, S. IMOTO, S. MIYANO Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan We propose a statistical method t o estimate gene networks from DNA microarray data and protein-protein interactions. Because physical interactions between proteins or multiprotein complexes are likely to regulate biological processes, using only mRNA expression data is not sufficient for estimating a gene network accurately. Our method adds knowledge about protein-protein interactions t o the estimation method of gene networks under a Bayesian statistical framework. In the estimated gene network, a protein complex is modeled as a virtual node based on principal component analysis. We show the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae cell cycle data. The proposed method improves the accuracy of the estimated gene networks, and successfully identifies some biological facts.
1
Introduction
The complete DNA sequences of many organisms, such as yeast, mouse, and human, have recently become available. Genome sequences specify the gene expressions that produce proteins of living cells, but how the biological system as a whole really works is still unknown. Currently, a large number of gene expression data and protein-protein (p-p) interaction data have been collected from high-throughput analyses, and estimating gene networks from these data has become an important topic in systems biology. Several methods have been proposed for estimating gene networks from microarray data by using Boolean networks’>30,differential equation model&7, and Bayesian network~~g~12~13,14~15~16,22. However, using only microarray data is not sufficient for estimating gene networks accurately, because the information contained in microarray data is limited by the number of arrays, their quality, noise and experimental errors. Therefore, the use of other biological knowledge together with microarray data is a key for extracting more reliable ? ~ this idea previously and proposed a information. Hartemink et ~ l noticed method t o use localization data combined with microarray data for estimating a gene network. There are other works combining microarray data with biological knowledge, such as DNA sequences of promoter elementg3)32 and transcriptional bindings of regulator^?^?^^?^^. In this paper, we propose a statistical method for estimating gene net-
336
337 works from microarray data and p-p interactions by using a Bayesian network model. We extract 9,030 physical interactions from the MIPS d a t a b a s 8 to add knowledge about p-p interactions to the estimation method of gene networks. If multiple genes will form a protein complex, then it is natural to treat them as one variable in the estimated gene network. In addition, in the estimated gene network, a protein complex is modeled as a virtual node based on principal component analysis. That is, the protein complexes are dynamically found and modeled based on the proposed method while we estimate a gene network. Previously, Segal et al? proposed a method for identifying pathways from microarray data and p-p interaction data. A different point of our method is that we model protein complexes directly in the Bayesian network model aimed at refining the estimated gene network. Also, our method can decide whether we make a protein complex based on our criterion. We evaluate our method through the analysis of Saccharomyces cerevisiae cell cycle gene expression dats’. First, we estimated three gene networks, by microarray data alone, by p-p interactions alone, and by our method. Then, we compared them with the gene network compiled by KEGG for evaluation. We successfully show that the accuracy of the estimated gene network is improved by our approach. Second, among 350 cell cycle related genes, we found 34 gene pairs as protein complexes. In reality, most of them are likely to form protein complexes considering biological databases and existing literature. Third, we show an example to use an additional information “phase” together with the microarray data and p-p interactions for estimating a more meaningful gene network. 2
Bayesian Network Model with Protein Complex
Bayesian networks (BNs) are a type of graphical model that represents relationships between variables. That is, for each variable there is a probability distribution function whose definition depends on the edges leading into the variable. A BN is a directed acyclic graph (DAG) encoding the Markov assumption that each variable is independent of its non-descendants, given just its parents. In the context of BNs, a gene is regarded as a random variable and shown as a node in the graph, and a relationship between the gene and its parents is represented by the conditional probability. Thus, the joint probability of all genes can be decomposed as the product of the conditional probabilities. Suppose that we have n set of microarray data ( 2 1 , ..., zn}of p genes. A BN model is then written as f(zi1,...,zi,lOc) = fj(zijIpij, Oj), wherepij is the parent observation vector of j t h gene (genej) measured by ith array. For
&
338
example, if gene2 and gene3 are parents of genel, we set pi, = (xi2,Zi3)T. If we ignore the information of p-p interactions, the relationship between xij and pij can be modeled by using a nonparametric additive regression
where p,(;Z’ is the kth element of pi?, mj is a regression function and ~ i isj a random variable with a normal distribution with mean 0 and variance ~7.; When a gene is regulated by a protein complex, it is natural that we consider a protein complex as a direct parent. Therefore, we consider the use of virtual nodes corresponding to protein complexes in the BN model. Concretely, if gene2 and gene3 make a protein complex and regulate gene1, we construct a new variable L1complex23” from the expression data of gene2 and ~ ~genel” genes. In the BN model, then, we consider the relation ‘ ‘ c ~ r n p l e x+ instead of “gene2 -+ gene1 c genes”. If genes make a protein complex, it is expected that there may be a relatively high correlation among the expression values of those genes. For constructing a new variable representing a protein complex, therefore, we use principal component analysid7 (PCA). By using PCA, we can reduce the dimension of the data with the least loss of information. Suppose that genes from gene1 to gened make a protein complex and that the d dimensional vector is the = -ji:[l-dl)(xy-dl.-ji:[lfirst eigenvector of the matrix #-dl
xi
with xy-” = (zil, xid id)^ and = ~ ! - ~ ‘ / Here n . xT is the transpose of x. The ith observation of the protein complex is then obtained by [1-d] - [1-d]T ci - al (xi[1-d] - ~- [ l - ~ ]In) .such case, we use the regression function mj,[l-d~ instead of the additive regression function rnjl(xi1) ... m j d ( x i d ) . Figure 1 shows an example of modeling a protein complex. SPCQ7 and SPC98 form a protein complex. The solid line is the first principal component and the observations of the protein complex are obtained by projecting expression data onto this line.
(cP-~])
+
+
This model can be viewed as an extension of principal component regressiog, in which we choose whether we make protein complexes based on our criterion that evaluates the goodness of the BN model as a gene network.
339
x rnRNAexpreSSiondata
1-
m
8
0-
v) P
x x
I *~ -41
k./x:x Y. ? l
t principal imponent 1
0
1
2
SPC97
Figure 1: An example of modeling a protein complex by using principal component analysis. The scatter plot of SPC97 and SPC98, and the first principal component are shown.
3
Criterion and Algorithm for Estimating a Gene Network
From a Bayesian statistical viewpoint, we can choose the graph structure by maximizing the posterior probability of the graph G
where x ( G ) is a prior probability of the graph G, T(8GIX) is the prior distribution on the parameter OG and X is the hyperparameter vector. The marginal likelihood measures the closeness between microarray data and the graph G. We add the knowledge about p-p interaction into r ( G ) . Following the result of Imoto et al?5, we can model the knowledge about p-p interaction as a prior probability of graph G by using the Gibbs distributiod'. Let Uij be the interaction energy of the edge from gene; to genej and categorized into 2 values, H1 and H2 ( H I < H 2 ) . If there is a p-p interaction between gene; and genej, we set U;j = Uji = H I . The total energy of the graph G can then be defined as E(G) = x{i,jlEGU;j, where the sum is taken over the existing edges in the graph G. The probability r ( G ) is naturally modeled by the Gibbs distribution of the form n(G)= 2-1exp{-CE(G)}, where (> 0) is an inverse temperature and 2 is the partition function given by 2 = exp{-CE(G)}. Here is the set of possible graphs. By replacing CHI and CH2 with (1 and C2, respectively, the prior probability r ( G )is specified by C1 and C2. Hence, we have r ( G ) = 2-1 n(i,j)EGexp(-C,(i,j,), with a ( i , j )= k
<
xGEP
340 for
Uij
= Hk.
For computing the marginal likelihood represented by the integration in (2), we used the Laplace approximation for integral&1g~33 and the result was shown by Imoto e t al?4. Hence, we have a Bayesian information criterion, named BNRC (Bayesian network and Nonparameteric Regression Criterion), for evaluating networks
where
JA
(6),
=
-a2{lA
(eG
Ix)}/aeGaez
and b~ is the mode of lA(0GlX). We can choose the graph structure as the minimizer of BNRC. Based on the BN model with protein complex and the information criterion described above, we can naturally obtain the greedy hill-climbing algorithm for finding and modeling protein complexes and estimating a gene network as follows:
Stepl. For genei, perform one of four procedures, “add a parent”, “remove a parent”, “reverse the parent-child relationship” and “none”, which gives the lowest BNRC score. If directed cycles are formed, we cancel the operation. Step2. In Stepl, if “add a parent” was performed, go to Step3. Otherwise, go to Step6. Step3. If the relation between genei and the added gene (we denote gene(i)) is listed in p-p interactions, go to Step4. Otherwise, go to Step6. Stepl. Construct a protein complex from the expression values of genei and gene(i) based on the principal component analysis. Step5. If the protein complex works better than only using genei or gene(i) as a parent of each child of genei or gene(i),we use this protein complex in the estimated network. Otherwise, we ignore this protein complex. Step6. If the BNRC score becomes unchanged, the learning is finished. Otherwise, go to Stepl and continue the greedy hill-climbing algorithm.
34 1 Table 1: Comparison result of the cell cycle pathway in KEGG. “agree”, “reverse”, “false negative” and “false positive” edges are counted by comparing the estimated networks with the KEGG pathway. Note that edges among protein complexes are not counted in this table.
I
edge type agree reverse false negative false positive
4
using only microarray data 4
2 20 55
I
using only p - p interactions 19 (directions unknown) 26 11
I
o u r method
1
16 4
I
18 14
Computational Experiments
We apply our method t o Saccharomyces cerevisiae cell cycle microarray data31, and 9,030 p-p interaction data extracted from MIPS databasgl. For the prior probability T ( G )given in Section 3, we choose 0.5 for (1 and 25.0 for (2 experimentally. This point is where the maximum number of protein complexes is observed in the estimated gene networks. When we use a larger (1 and a smaller (2 , p-p interactions did not contribute to the gene network refinement. On the other hand, when we used a smaller (1 and a larger cz, the resulting network reflected the p-p interactions too strongly. 4.1
Cell Cycle Pathway in KEGG
For evaluating the accuracy of estimated gene networks, we choose 99 genes from KEGG pathway database of Saccharomyces cerevisiae cell cycle18. In this analysis, we focus on how the accuracy of the estimated network increases by adding the information of p-p interactions. We estimated three gene networks, by using only microarray data, by using only p-p interactions, and by using the proposed method. Then, we compared them with the gene network compiled by KEGG for evaluation. Table 1 summarizes the result of the comparison among three networks. Note that in this table, edges among protein complexes are not counted, because these edges should not be considered as “gene regulation” in the gene network. By comparing the network estimated by microarray data alone with the network estimated by our method, we can immediately find that the number of edges that agree with KEGG pathway, denoted as agree, adequately increases by adding p-p interactions t o microarray data. We can also observe that the proposed method can reduce the false positive edges drastically. By comparing the network estimated by p-p interactions alone with the network
342
Figure 2: Cell cycle gene network estimated by our method
estimated by our method, we can find that several false negative edges of p-p interactions are newly estimated by adding microarray data, though the number of agree edges is almost the same. As for false positive edges, we could not observe apparent improvements by adding microarray data. Figure 2 shows a part of the estimated gene network based on the proposed method. We can find that the proposed method succeeded in finding APC (Anaphase Promoting Complex), MCM (Mini-Chromosome Maintenance) complex, and clb5-cdc28p complex. 4.2
Gene Network with 350 Cell Cycle Genes
For evaluating our method in the sense of modeling a protein complex, we chose 350 genes from the MIPS functional category “mitotic cell cycle and cell cycle control” , and searched protein complexes while learning gene networks. We found 34 candidate protein complexes listed in Table 2. Among 34 candidate protein complexes, 22 pairs are also listed in the MIPS complex catalogue, and six pairs are reported in existing literature.
343 Table 2: Detected protein complexes among 350 cell cycle genes. The word rate means the contribution rate of the 1st principal component of two genes, and eval. means the evaluation of the results. “0” shows that the MIPS protein complexes catalogue contains the pair as a protein complex. “A’’ shows that while the MIPS cataloaue - does not contain those pairs, ” shoi that the result has not been reported yet tture suppor them. - -
RSC6 MCM5 SPC97 CIK 1 CLB5 GIM3 SKPl CDCll CDC3 CDClO APCl APC4 APC4 APClO APCS APCl APC2 APCS APCl APC2 APC3 APCll SMCl
scc3
BIM1 CLN2 CKSl HSL7 RAD23 NUF2 NUFl NUF2 CBF2 - CDC24
eual. rate gene B - 0.91 RSC8 0 0.89 MCM7 0 0.80 3PC98 0 0.70 KAR3 0 0.69 CDC28 0 0.67 PAC10 0 0.66 CDC53 0 0.80 CDC12 0 0.55 SHSl 0 0.54 SHSl 0 0.75 APClO 0 0.74 CDC23 0 0.73 APCll 0 0.72 APCll 0 0.71 APClO 0 0.66 CDC23 0 0.66 CDC16 0 0.66 CDC16 0 0.64 CDC26 0 0.63 APC5 0 0.63 CDC16 0 0.55 CDC26 0 0.84 SMC3 0.63 SMC3 0.69 TUB1 a 0.64 CDC53 a 0.57 CDC28 0.55 SWEl ? 0.82 RPT6 ? 0.80 NUMl ? 0.79 SPC97 ? 0.77 SMCl ? YGRl79C 0.65 ? SWEl 0.55
a a a
a
annotation RSC complex MCM complex gamma-tubulin complex kinesin-related motor proteins clb5-cdc28p complex gim complex SCF complex septin filaments septin filaments septin filaments APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex cohesin compledl cohesin compled’ tublin c0mple2~ G l / S transitiod4 cyclin-dependent kinasg4 septin assembly checkpoinf proteasome nuclear migration nuclear migration nuclear migration centromerel kinetochore-associated serinelthreonine protein kinase
344
Although six pairs, denoted as ‘l?” in Table 2, are unknown, they may suggest that each pair forms a protein complex. For example, RAD23 and RPTG may form a protein complex that involves in proteasome activity. In a similar way, NUFZ and NUMI may work together for nuclear migration. There are 309 p-p interactions among 350 cell cycle related genes, in which only 119 interactions are in fact protein complex related. These results suggest that our method successfully models the protein complexes, and finds the biologically plausible protein complexes.
4.3 Using Phase Information together with Microarrays and P-P Interactions In this section, we show a case t o use an additional information “phase” together with the microarray data and p-p interactions. It is known that cyclins “CLNI and CLNZ”, “CLB5 and CLBG’, and “CLBl and CLB2” are activated in Gl/S, S, and M phases, respectivelf. Before estimating a gene network, we choose phase-specific genes whose expression levels are highly correlated with each cyclin listed above. We collected 33 genes from the correlations, i.e., the correlation is greater than 0.75. Also, we selected 93 genes that show p-p interactions with 33 genes and six cyclins. That is, in this analysis, we focus on the gene network with 132 genes. Figure 3 shows the expression patterns of genes that are divided into three groups by the correlations and p-p interactions. At first, we estimate a gene network for each phase, i.e., Gl/S, S and M phases. We then combine those three networks and obtain a final network shown in Figure 4. Genes that are on the dotted line are selected as a member of both phases, i.e., Y O X I belongs to G l / S phase and also S phase. In this analysis, we can find biologically important genes, such as HCMI, FKH2 and ACE2. These genes are transcription factor^?^,^^, and FKH2 was reportea6 as a regulator of CLB2, SWI5, and HST3. Although KEGG pathway does not include those genes, we succeeded in finding those important relationships based on our approach. 5
Discussion
In this paper we proposed a statistical method for estimating gene networks by combining microarray gene expression data and p-p interactions. We also proposed a method for modeling protein complexes in the estimated gene network by using principal component analysis. An advantage of our method is that not only p-p interactions, but also protein complexes are naturally modeled under a Bayesian statistical framework. By adding p-p interaction data into our Bayesian network estimation method, we successfully estimated the gene
345 I
3
GIlS phase
I I
J!
I
I
,
alpha
cdcl5
cdc28
SIU
alpha
cdd5
cdc28
elu
s p ase
I
(G2phase)
I
I
I
M phase
I
I
Figure 3: Gene expression profiles that belong t o (Top) G l / S phase, (Middle) S phase, and (Bottom) M phase.
Figure 4: Cell cycle gene network estimated by using “phase” information together with microarray data and p-p interactions.
346 network more accurately than using only microarray data. We also observed that protein complexes were correctly found and modeled while learning gene networks. We consider the following topics as our future works: First, currently our greedy algorithm only merges protein pairs based on PCA. Modeling a larger protein complex in the gene network will be an important problem. Second, as real biological processes are often condition specific, it is important t o take “conditions” or “environments” into account. Third, in the last experiment, we showed an example that we added an additional information “phase” t o the microarray data and p-p interaction data, and estimated a gene network based on those three types of data. We expect that estimating an accurate gene network by using further genomic data, including DNA-protein interactions, binding site information, and so on, will give us more meaningful information about biological processes. We would like to investigate these topics in our future papers.
Acknowledgements The authors would like to thank three referees for their helpful comments and suggestions.
References 1. T. Akutsu, S. Miyano and S. Kuhara, Pac. Symp. Biocomput., 4, 17 (1999). 2. S. Chatterjee and B. Price, John Wzley and Sons, (1977). 3. T. Chen, H. L. He and G. M. Church, Pac. Symp. Biocomput., 4, 29 (1999).
M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart and R. W. Davis, Molecular Cell, 2 , 65 (1998). V. J. Cid, M. J. Shulewitz, K. L. McDonald and J . Thorner, Mol. Biol. Cell, 12, 1645 (2001). A. C. Davison, Biometrika, 73, 323 (1986). M. J. L. de Hoon, S. Imoto, K. Kobayashi, N. Ogasawara and S. Miyano, Pac. Symp. Biocomput., 8 , 17 (2003). N . Fkiedman, M. Goldszmidt, in M.I.Jordan ed., Kluwer Academic Publishers,
4. R. J. Cho,
5. 6.
7. 8.
421 (1998). 9. N. Friedman,
M. Linial, I. Nachman and D. Pe’er, J . Comp. Biol, 7, 601 (2000). 10. S. Geman and D. Geman, IEEE T P A M I , 6,721, (1984). 11. C. H. Haering, J. Lowe, A. Hochwagen and K. Nasmyth, Molecular Cell, 9, 773 (2002).
347 12. A. J. Hartemink, D. K. Gifford, T . S. Jaakkola and R. A. Young, Pac. S y m p . Biocomput., 6, 422 (2001). 13. A. J. Hartemink, D. K. Gifford, T . S. Jaakkola and R. A. Young, Pac. Symp. Biocomput., 7, 437 (2002). 14. S . Imoto, T . Goto and S. Miyano, Pac. S y m p . Biocomput., 7, 175 (2002). 15. S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara and S. Miyano, Proc. 2nd IEEE Computer Society Bioinformatics Conference, 104 (2003). 16. S. Imoto, S. Kim, T. Goto, S. Aburatani, K. Tashiro, S. Kuhara and S. Miyano, J . Bioinformatics and Comp. Biol., 1(2), 231 (2003). 17. I. J. Jolliffe, Springer-Verlag, N e w York, (1986). 18. M. Kanehisa, S. Goto, S. Kawashima and A. Nakaya, Nucleic Acids Res., 30, 42 (2002). 19. S. Konishi, T. Ando and S. Imoto, Biometrika, (2003) in press. 20. H. J. McBride, Y. Yu and D. J . Stillman, J. Bid. C h e m , 274, 21029 (1999). 21. H. W. Mewes, D. Frishman, U. Guldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgenstern, M. Miinsterkoetter, S. Rudd and B. Weil, Nucleic Acids Res., 30(1), 31 (2002). 22. D. Pe’er, A. Regev, G. Elidan and N. Friedman, B i o i n f o m a t i c s , 17, S1 (2001). 23. Y . Pilpel, P. Sudarsanam and G. M. Church, Nature Genetics, 29, 153 (2001). 24. G. J. Reynard, W. Reynolds, R. Verma and R. J. Deshaies, Mol. Cell. Biol., 2 0 , 5858 (2000). 25. K. Schwartz, K. Richards and D. Botstein, Mol. Biol. Cell, 8, 2677 (1997). 26. E. Segal, Y. Barash, I. Simon, N. Friedman and D. Koller, R E C O M B , 273 (2002). 27. E.Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller and N. Friedman, Nature Genetics, 34(2), 166 (2003). 28. E. Segal, H. Wang and D. Koller, Bioinformatics, 19, S264 (ISMB 2003). 29. E. Segal, R. Yelensky and D. Koller, Bioinformatics, 19, S273 (ISMB 2003). 30. I. Shmulevich, E. R. Dougherty, S. Kim and W. Zhang, Bioinformatics, 18, 261 (2002). 31. P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein and B. Futcher, Mol. Biol. Cell, 9, 3273 (1998). 32. Y. Tamada, S. Kim, H. Bannai, S. Imoto, K. Tashiro, S. Kuhara and S. Miyano, B i o i n f o m a t i c s , (ECCB 2003). in press. 33. L. Tinerey and J . B. Kadane, J. A m e r . Statist. Assoc., 81, 82 (1986). 34. A. R. Willems, S. Lanker, E. E. Patton, K. L. Craig, T. F. Nason, N. Mathias, R. Kobayashi, C. Wittenberg and M. Tyers, Cell, 8 6 , 453 (1996). 35. G. Zhu and T . N. Davis, Biochim. Biophys. Acta., 1448(2), 236 (1998). 36. G. Zhu, P. T . Spellman, T. Volpe, P. 0. Brown, D. Botstein, T . N. Davis and B. F‘utcher, Nature, 406, 90 (2000).
MOTIF DISCOVERY IN HETEROGENEOUS SEQUENCE DATA
A. PRAKASH Department of Computer Science and Engineering University of Washington Seattle, W A 98195-2350 U.S.A. M. BLANCHETTE
School of Computer Science McGiEl University Montreal, Quebec, Canada H3A 2 A 7 S. SINHA
Center f o r Studies in Physics and Biology T h e Rockefeller University New York, N Y 10021 U.S.A.
M. TOMPA Department of Computer Science and Engineering University of Washington Seattle, W A 98195-2350 U.S.A.
Abstract This paper introduces the first integrated algorithm designed to discover novel motifs in heterogeneous sequence data, which is comprised of coregulated genes from a single genome together with the orthologs of these genes from other genomes. Results are presented for regulons in yeasts, worms, and mammals.
1
Regulatory Elements and Sequence Sources
An important and challenging question facing biologists is to understand the varied and complex mechanisms that regulate gene expression: how, when, in what cells, and at what rate is a given gene turned on and off? This paper focuses on one important aspect of this challenge, the discovery of novel binding sites in DNA (also called regulatory elements) for the proteins involved in such gene regulation. This is an important first step in determining which proteins regulate the gene and how.
348
349
Until the present, nearly all regulatory element discovery algorithms have focused on what will be called homogeneous data sources, in which all the sequence data is of the same type (see Section 1.1). This paper introduces the first integrated algorithm designed t o exploit the richer potential of heterogeneous sequence data, which is comprised of coregulated genes from a single genome together with the orthologs of these genes from other genomes. 1.1 Regulatory Elements f r o m Homogeneous Data
A number of algorithms have been proposed for the discovery of novel regulatory elements in nucleotide sequences. Most of these try to deduce the regulatory elements by considering the regulatory regions of several (putatively) coregulated genes from a single genome. Such algorithms search for overrepresented motifs in this collection of regulatory regions, these motifs being good candidates for regulatory elements. Some examples of this approach include Bailey and Elkan', BrZzma et al. ', Buhler and Tompa 3, Hertz and Stormo 4 , Hughes et al. 5 , Lawrence et al. ', Lawrence and Reilly 7 , Rigoutsos and Floratos 8, Rocke and Tompa ', Sinha and Tompa l o , van Helden et al. 11, and Workman and Stormo12. An orthogonal approach deduces regulatory elements by considering orthologous regulatory regions of a single gene from multiple species. This approach has been used in phylogenetic footprinting (Tagle et al. 1 3 , Loots et al. 14) and phylogenetic shadowing (Boffelli et al. 15). The simple premise underlying these comparative approaches is that selective pressure causes functional elements to evolve at a slower rate than nonfunctional sequences. This means that unusually well conserved sites among a set of orthologous regulatory regions are good candidates for functional regulatory elements. The standard method that has been used for phylogenetic footprinting is to construct a global multiple alignment of the orthologous regulatory sequences using a tool such as CLUSTAL W (Thompson et al. 16), and then identify well conserved regions in the alignment. An algorithm designed specifically for phylogenetic footprinting without resorting to global alignment has been developed by Blanchette et al. 17118 1.2 Regulatory Elements f r o m Heterogeneous Data
As more related genomes are sequenced and our understanding of regulatory relationships among genes improves, we will find ourselves in a situation with richer data sources than in the past. Namely, the data t o be analyzed will often be heterogeneous, a collection of coregulated genes from one genome together with their orthologous genes in several related genomes. There is an obvious advantage t o considering heterogeneous
350
data when it is available: namely, motifs may not be detectable when one considers only the coregulated regions from one genome or only the orthologous regions of one gene (McGuire et al. 19, Wang and Stormo2'). The most obvious way to handle heterogenous data is to treat all the regulatory regions identically: pool all the input sequences, and search for overrepresented motifs. This is precisely what was done in studies There are several reasons by Gelfand et al. 21 and McGuire et al. why treating the heterogenous data homogeneously in this way discards valuable information that may be necessary for accurate prediction of regulatory elements: 1. This method ignores the phylogeny underlying the data so that, for example, similar sequences from a subset of closely related species will have an unduly high weight in the choice of motifs predicted. 2. Phylogenetic studies such as that of Lane et al. 22 show that instances of orthologous regulatory elements, because they evolved from a common ancestral sequence, tend to be better conserved than instances across coregulated genes of the same genome. By pooling all the sequences, this distinction is lost.
3. Perhaps most importantly, the number of occurrences of a given regulatory element will vary greatly across putatively coregulated genes: some regulatory regions will contain no occurrences, while others will contain multiple occurrences. This variance in number should be much less across orthologous genes, again because they are evolved from a single ancestral sequence. By pooling all the sequences, this distinction too is lost.
Another method for exploiting heterogeneous data involves two separate passes. For instance, Wasserman et al. 23, Kellis et al. 24, Cliften et al. 25 , and Wang and Stormo 2o search for well conserved motifs across the orthologous genes and then, among these, search for overrepresented motifs. GuhaThakurta et al. 26 do the opposite, searching for overrepresented motifs in one species and eliminating those that are not well conserved in the orthologs. In both cases, the first pass acts as a filter before performing the second pass, and a drawback is that the true motif may be filtered out because it is not conserved well enough in the dimension of the first pass. In other words, these algorithms do not integrate all the available information from the very beginning. In this paper we propose the first algorithm that uses the heterogeneous sequence data in an integrated manner. We focus on the 2species case for concreteness and efficiency, but also because of its timeliness for the study of regulons in important sequenced pairs such as human/mouse, fruitfiy/mosquito, and C. elegansl C. braggsae.
351 2
Expectation-Maximization for Heterogeneous Data
The Expectation-Maximization algorithm of MEME is very well suited for the discovery of regulatory elements in single-species regulons. We have generalized MEME’s framework and algorithm so that it is suited for the two-species heterogeneous data problem. We call the new algorithm OrthoMEME. The inputs to OrthoMEME are sequences X I ,X Z ,. . . , X,, Yl,Y2,... ,Y,, where X l , X z , . . . , X , are the regulatory regions of n genes from species X , and Y; is Xi’s orthologous sequence from species Y . For ease of discussion we will assume that the motif width W is fixed but, like MEME, OrthoMEME iterates over different values of W and chooses the best result. Also like MEME, OrthoMEME can be run in any of three modes: OOPS (One Occurrence Per Sequence), ZOOPS (Zero or One Occurrence Per Sequence), or TCM (zero or more occurrences per seqhence). TCM mode is particularly appropriate for most regulatory element problems. In the heterogeneous data setting, a motif occurrence in sequence i means an occurrence in Xi and an orthologous occurrence in Yi. That is, even in TCM mode every motif occurrence consists of an orthologous pair. Accordingly, the hidden random variables are Zis,k,defined to be 1 if there are orthologous motif occurrences that begin at position j of Xi and position k of Yi, both occurrences in orientation s (either or -), and 0 otherwise. (An underlying assumption is that sequences outside motif occurrences are drawn from the background distributions and, in particular, are not orthologous. This is in general untrue, but for sufficiently diverged sequences the resulting inaccuracy should be minimal.) OrthoMEME’s objective is to maximize the expected log likelihood of the model, divided by the motif width, given the input sequences and hidden variables. The model parameters specify how well conserved the motif is among the sequences of species X (parameter 8, a position weight matrix), and how well conserved orthologous pairs of motif instances are (parameter 7,a vector of 4 x 4 transition probability matrices). More specifically,
+
e,,
=
{
Pr(residue r in background distribution) Pr(residue r at position j of X’s occurrences)
ifj=O if 1 5 j
5 W,
qjrs = Pr(at position j of motif, residue r of X maps to residue s of Y ) .
There is also a corresponding parameter Oh, that specifies the background distribution in species Y . In ZOOPS and TCM modes, there is an additional parameter X that specifies the expected frequency of motif occurrences. Let 4 be a vector containing all the model parameters.
352
In classic expectation-maximization fashion, OrthoMEME alternates between E-steps (which update the expected values of the hidden variables) and M-steps (which update the model parameters). More specifically, the E-step computes E(Zisjk I Xi,Y,,4), where 4 consists of the values of the model parameters computed in the previous M-step. The M-step finds the values of the model parameters 4 that maximize the log likelihood of the model, given the input sequences and the expected values of & j k computed in the previous E-step. The formulas for these steps depend on the mode (OOPS, ZOOPS, TCM). For simplicity, we present only the formulas for OOPS mode. Let be the residue present at position p of strand s in sequence Xi, and let m be the length of each input sequence. Then the E-step for OOPS mode is computed as follows:
where
p= 1
p e { k , ..., k + W - - l l
p=l
The model parameters are evaluated in the M-step as follows. Let h f h f, denote the expected number of times residue f of X is mapped to residue g of Y at position h in the motif.
0 is updated as in MEME. Each E-step and M-step runs in time O(nm2W),since the number of hidden variables is 2nm2. This causes the algorithm to run slowly when the input sequences are long, which is an aspect of the algorithm that we are striving t o improve. MEME’s running time per step is O(nmW). The algorithm needs a measure t o compare solutions found, in order t o choose the best motif among all those found from different initial
353 values of 4 and different choices of motif width W . Unlike MEME, OrthoMEME compares solutions on the basis of the expected log likelihood of the model, divided by the motif width, given the input sequences and hidden variables. That is, it uses the very evaluation function that it is optimizing. (MEME instead uses the p-value of the relative entropy of the motif instances predicted.) There is an interesting algorithmic problem that arises only in the TCM mode of OrthoMEME and not at all in MEME. In order to produce actual motif occurrences from the final values Z i s j k of 4), OrthoMEME must choose 0 or more good ortholE(Zi,jk 1 X i , ogous pairs (jl,k l ) , ( j 2 , k2), . . . for each value of i. These pairs should represent nonoverlapping occurrences whose order is conserved between the two species, that is, j h W 5 j h + l and k h W 5 k h + l , for all h. For each value of i, OrthoMEME does this by retaining only those pairs ( j ,k ) such that z i s j k exceeds a threshold, and then using dynamic programming (quite similar to that for optimal alignment) to choose those pairs that represent nonoverlapping occurrences with conserved order and maximum total value of z i s j k .
x,
+
3
+
E x p e r i m e n t a l Results
OrthoMEME is implemented and we intend to make it publicly available. This section reports initial results of OrthoMEME on several heterogeneous data sets. All MEME and OrthoMEME motifs discussed below were among the top 3 motifs reported on those input sequences. Tables 1-3 show the predictions of OrthoMEME on yeast regulons from Saccharomyces cerewisiae and their orthologs in Saccharomyces bayanus. The S. cerewisiae target genes and binding sites for these transcription factors come from SCPD 27 The homogeneous S. cerevisiae data sets of Tables 1 and 2 are known to be particularly difficult: the motif discovery tools YMF l o , MEME ', and AlignACE all failed t o find the known transcription factor binding sites in these S. cerewisiae regulons (Sinha and Tompa2'). Table 1 shows OrthoMEME's predictions on the genes known to be regulated by HAP2;HAP3;HAP4. There are 5 known binding sites contained in 4 target genes. MEME predicted only 1 of these binding sites (whether run on just S. cerewisiae sequences or on the pooled sequences of both species), whereas OrthoMEME predicted 3 using the same parameters. In this and all subsequent tables, the underlined portions of the predicted motif occurrences are the subsequences that overlap the known binding sites. Table 2 shows OrthoMEME's predictions on the genes known to be regulated by UASCAR. There are 4 known binding sites contained in 3 target genes, all 4 of which are predicted by OrthoMEME. MEME pre~
354 Table 1: HAP2;HAP3;HAP4 predicted motif, OOPS mode, sequence length 600. The column labeled “Mut” shows the number of mismatches between the orthologous motif occurrences. The underlined portions of the motif occurrences are the subseauences that overlar, the known binding sites. OrthoMEME missed one occurrence in each of S P R J and 6 Y C l . Source: SCPD 2 7 .
S. bayanus CYCl
SPRJ QCR8
Table 2: UASCAR predicted motif, TCM mode, sequence length 300. OrthoMEME missed no occurrences. Source: SCPD 27
S. cerevisiae
Gene CAR2 CAR2 ARG5,6
CARl ARG5,6 ARG5,6 CAR2 CAR2 ARG5,6 CARl CARl CARl
Str
+ + + + + + + -
+ + +
Pos -218 -154 -114 -169 -52 -286 -189 -252 -224 -209 -232 -86
Instance CTCTGTTAAC T G C C C m TTCCATTAGG TTCACTTAGC TGCCTTTAGT TTCACTTAAA TGCCGTTAGC TTGCGTGTGG ATGACTCAGT TGCCATTAGC TGCCCTTCGC TTCTCTTCTC
S. bayanus Instance
Pos -222 -153 -122 -176 -56 -294 -193 -257 -228 -216 -239 -73
CTCTGTTAAC TGCCCTTGCC TTCCATTAGG TTCACTTAGC TGCCTTTAGT TTCACTTAAG TGCCGTTAGC TTGCGTGCGG ATGACTCAGT TGCCGTTAGC TGCCCTTGGC TTCTCCTCTC
Mut 0 0 0 0 0 1 0 1 0 1 1 1
dicted none of these binding sites when run on the S. cerevisiae sequences alone, and all 4 when run on the pooled sequences of both species. Table 3 summarizes the performance of OrthoMEME on some less difficult yeast regulons 28. On all three regulons OrthoMEME had few true negatives. On the SCB and PDR3 regulons, OrthoMEME’s number of false positives was comparable to that of MEME. On the MCB regulon, OrthoMEME had many more false positives than MEME, but many fewer true negatives to compensate. Tables 4 and 5 give examples of OrthoMEME run on heterogeneous human/mouse data. Table 4 shows target genes of the human transcription factor SRF together with their mouse orthologs. TRANSFAC 29 reports one known binding site in each of these 4 regulatory sequences.
355 Table 3: Summaryof other yeast regulons, S. cerevisiae vs. S. bayanus, TCM mode, sequence length 1000. Column headings: “genes”, the number of target genes in the regulon; “known”, the number of known S. cerevisiae binding sites in these target genes; “MEME, S. cer.”, MEME run on the S. cerevisiae sequences; “MEME, pooled”, MEME run on the pooled sequences of both species; “FP”, the number of false positives (predictions that were not binding sites); “TN”, the number of true negatives (binding sites that were not predicted). Source: SCPD 2 7 .
factor SCB MCB PDR3
genes 3
known
5 4
11 11
8
OrthoMEME FP TN 6 10 7
2 1 2
MEME, S. cer. FP TN 8 5 6
2
7 1
MEME, pooled FP TN 13 6
4 5 1
13
Table 4: SRF predicted motif, OOPS mode, sequence length 1000. OrthoMEME missed one occurrence in each of B-ACT and apoE. Source: TRANSFAC 29.
Gene B-ACT
Str
+
c-fos apoE
-
CA-ACT
-
H. sapiens M. POS Pos Instance -73 CCTTTTATGG I -65 -314 CCTAATATGG -459 -43 CCAATTATAG -855 -850 CCTTATTTGG -111
I
musculus Instance
I Mut
CCTTTTATGG 1 CCTAATATGG CCAATTATAG CCTTATTTGG
0 0 0 0
OrthoMEME predicted 2 of these 4 known binding sites. MEME, using the same parameters, found none of them, whether run on just the human sequences or on the pooled human and mouse sequences. Table 5 shows target genes of the human transcription factor NF-KB together with their mouse orthologs. TRANSFAC 29 reports 11 known binding sites in these 10 genes. Because OrthoMEME was run in OOPS mode, it missed one of the two occurrences in IL-2. It also missed the known occurrences in SELE and IL-PRa. MEME, using the same parameters, performed as well on this regulon. Table 6 shows a n example of OrthoMEME’s predictions on a worm regulon. This is a collection of Caenorhabditis elegans genes regulated by the transcription factor DAF-19 (Swoboda et al. 30), together with orthologs from Caenorhabditis briggsae. Each regulatory region in C. elegans is known t o contain one instance of the “x-box”, which is the binding site of DAF-19. OrthoMEME predicted all five of the documented x-boxes3’, as did MEME. (The full x-box has width 14 bp, of which OrthoMEME omitted the somewhat less conserved first 4 bp.)
356 Table 5: NF-KB predicted motif, OOPS mode, sequence length 1000. OrthoMEME missed one occurrence in each of SELE, IL-~RcY,and IL-2. Source: TRANSFAC 2 9 .
Gene SELE ICAM-1 GRO-y GRO-a IL-2Ra GRO-P
TNF-P IL-6 IFN-P IL-2
Str
+ + + + + +
Pos -285 -228 -160 -160 -306 -156 -274 -139 -140 -255
H. sapiens Instance CCCGGGAATATCCAC CTCCGGAATTTCCAA TCCGGGAATTTCCCT TCCGGGAATTTCCCT TGCGGTAATTTTTCA TCCGGGAATTTCCCT CCTGGGGGCTTCCCC TGTGGGATTTTCCCA CAGAGGAATTTCCCA AGAGGGATTTCACCT
Pos -262 -250 -140 -140 -276 -146 -251 -125 -137 -257
M. musculus Instance Mut TCTGGGAATATCCAC 2 4 TCTAGGAATTTCCAA TCCGGGAATTTCCCT 0 TCCGGGAATTTCCCT 0 TGCGGTAATTTTTCA 0 TCAGGGAATTTCCCT 1 CCTGGGGGCTTCCCC 0 TGTGGGATTTTCCCA 0 CAGAGGAATTTCCCA 0 AGAGGGATTTCACCT 0
Table 6: DAF-19 predicted motif, OOPS mode, sequence length 1000. OrthoMEME missed no occurrences. Source: Swoboda et d.30.
Gene che-2 osm-1 f02d8.3 osm-6 daf-19
4
Str
+ -
-
Pos -126 -86 -79 -100 -109
G. elegans Instance
TCATGGTGAC CCATGGTAGC CCATGGAAAC CTATGGTAAC CCATGGAAAC
C. braggsae Pos -178 -79 -93 -764 -243
Instance CCATGGCAAC CCATGGCAAC CCATGGAAAC CGATGACAAA CTTTGGCAAA
Mut
3 2 0 4 4
Conclusion
As more genomes are sequenced and our understanding of regulatory relationships among genes improves, algorithms for motif discovery from the rich source of heterogeneous sequence data will become prevalent. We have introduced the first algorithm to deal with heterogeneous data sources in a truly integrated manner, using all the data from the onset of analysis. We are still in the early stages of experimenting with the implementation and its parameters. There is much room for improved prediction accuracy and we are optimistic that, with more experience, we will consistently be able to solve problems with OrthoMEME that cannot be solved from homogeneous data alone. There is a reasonably straightforward extension to K > 2 species in which the transition matrices vj are replaced by rate matrices and one assumes that the phylogeny and its branch lengths are given. For this
357 extension the running time would be O(nmKW ) ,which is prohibitive. We are working on faster algorithms for this case and also the important case K = 2. For the case K = 2, it seems important to have a better understanding of how evolutionary distance between the species affects OrthoMEME’s accuracy. Acknowledgments Peter Swoboda provided us with the C. elegans DAF-19 data set, and Phil Green and Joe Felsenstein made helpful suggestions. This material is based upon work supported in part by the Howard Hughes Medical Institute, by the National Science Foundation under grants DBI-9974498 and DBI-0218798, and by the National Institutes of Health under grant R01 HG02602. References 1. Timothy L. Bailey and Charles Elkan. The value of prior knowledge in discovering motifs in MEME. In Proceedings of the Third
2.
3. 4.
5.
6.
7.
8.
International Conference on Intelligent Systems f o r Molecular Biology, pages 21-29, Menlo Park, CAI 1995. AAAI Press. Alvis BrZzma, Inge Jonassen, Jaak Vilo, and Esko Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 15:1202-1215, 1998. Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225-242, 2002. Gerald Z. Hertz and Gary D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, JulyfAugust 1999. J. D. Hughes, P. W. Estep, S. Tavazoie, and G. M. Church. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology, 296:1205-1214, 2000. Charles E. Lawrence, Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 8 October 1993. Charles E. Lawrence and Andrew A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7:41-51, 1990. Isidore Rigoutsos and Aris Floratos. Motif discovery without alignment or enumeration. In R E C O M B 9 8 : Proceedings of the Second
358
Annual International Conference on Computational Molecular Biology, pages 221-227, New York, NY, March 1998. 9. Emily Rocke and Martin Tompa. An algorithm for finding novel gapped motifs in DNA sequences. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 228-233, New York, NY, March 1998. 10. Saurabh Sinha and Martin Tompa. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research, 30(24):5549-5560, December 2002. 11. J. van Helden, A. Rios, and J. Collado-Vides. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Research, 28:1808-1818, 2000. 12. C. T . Workman and G. D. Stormo. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In Pacific Symposium on Biocomputing, pages 464-475, Honolulu, Hawaii, January 2000. 13. D.A. Tagle, B.F. Koop, M. Goodman, J.L. Slightom, D.L. Hess, and R.T. Jones. Embryonic E and y globin genes of a prosimian primate ( Galago crassicaudatus) nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. Journal of Molecular Biology, 203:439-455, 1988. 14. Gabriela G. Loots, Ivan Ovcharenko, Lior Pachter, Inna Dubchak, and Edward M. Rubin. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Research, 125332-839, May 2002. 15. Dario Boffelli, Jon McAuliffe, Dmitriy Ovcharenko, Keith D. Lewis, Ivan Ovcharenko, Lior Pachter, and Edward M. Rubin. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science, 299(5611):1391-1394, February 2003. 16. J.D. Thompson, D.G. Higgins, and T.J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680, 1994. 17. Mathieu Blanchette, Benno Schwikowski, and Martin Tompa. Algorithms for phylogenetic footprinting. Journal of Computational Biology, 9(2):211-223, 2002. 18. Mathieu Blanchette and Martin Tompa. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Research, 12(5):739-748, May 2002. 19. Abigail Manson McGuire, Jason D. Hughes, and George M. Church. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Research, 10:744-757, 2000. 20. Ting Wang and Gary D. Stormo. Combining phylogenetic data
359
21.
22.
23.
24.
25.
26.
27.
28.
29.
with coregulated genes t o identify regulatory motifs. Bioinformatics, 2003. To appear. M. S. Gelfand, E. V. Koonin, and A. A. Mironov. Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. Nucleic Acids Research, 28(3):695-705, 2000. Robert P. Lane, Tyler Cutforth, Janet Young, Maria Athanasiou, Cynthia Friedman, Lee Rowen, Glen Evans, Richard Axel, Leroy Hood, and Barbara J. Trask. Genomic analysis of orthologous mouse and human olfactory receptor loci. Proceedings of the National Academy of Science U S A , 98(13):7390-7395, June 19, 2001. Wyeth W. Wasserman, Michael Palumbo, William Thompson, James W. Fickett, and Charles E. Lawrence. Human-mouse genome comparisons to locate regulatory sites. Nature Genetics, 26:225-228, October 2000. Manolis Kellis, Nick Patterson, Matthew Endrizzi, Bruce Birren, and Eric S. Lander. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423:241-254, May 2003. Paul Cliften, Priya Sudarsanam, Ashwin Desikan, Lucinda Fulton, Bob Fulton, John Majors, Robert Waterston, Barak A. Cohen, and Mark Johnston. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science, 301:71-76, 2003. Debraj GuhaThakurta, Lisanne Palomar, Gary D. Stormo, Pat Tedesco, Thomas E. Johnson, David W. Walker, Gordon Lithgow, Stuart Kim, and Christopher D. Link. Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods. Genome Research, 12:701-712, 2002. Jian Zhu and Michael Q. Zhang. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics, 15(7/8):563-577, July/August 1999. http : //cgsigma. cshl.org/jian/. Saurabh Sinha and Martin Tompa. Performance comparison of algorithms for finding transcription factor binding sites. In 3rd IEEE Symposium on Bioinformatics and Bioengineering, pages 214-220. IEEE Computer Society, March 2003. E. Wingender, P. Dietze, H. Karas, and R. Knuppel. TRANSFAC: a database on transcription factors and their DNA bindNucleic Acids Research, 24(1):238-241, 1996. ing sites. http://transfac.gbf-braunschweig.de/TRANSFAC/.
30. Peter Swoboda, Haskell T . Adler, and James H. Thomas. The RFX-type transcription factor DAF-19 regulates sensory neuron cilium formation in C. elegans. Molecular Cell, 5:411-421, March 2000.
NEGATIVE INFORMATION FOR MOTIF DISCOVERY
K.T.TAKUSAGAWA
D.K.GIFFORD
kentaomit. edu gifford@mit. edu Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, M A 02139, USA We discuss a method of combining genome-wide transcription factor binding data, gene expression data, and genome sequence data for the purpose of motif discovery in S. cerevisiae. Within the word-counting algorithmic approach t o motif discovery, we present a method of incorporating information from negative intergenic regions where a transcription factor is thought n o t t o bind, and a statistical significance measure which account for intergenic regions of different lengths. Our results demonstrate that our method performs slightly better than other motif discovery algorithms. Finally, we present significant potential new motifs discovered by the algorithm.
1
Introduction
In the field of computational biology, motif discovery is one tool for unraveling the transcriptional regulatory network of an organism. The underlying model assumes that a transcription factor binds t o a specific short sequence (“a motif”) in an intergenic region near a gene the factor regulates. With the recent availabilty of many genome-wide data sets, we can predict certain motifs by computational methods rather than laborious experimentation. Such computational techniques rely on fusing genome sequence data with other data sets. In this paper, we discover motifs by fusing sequence data with transcription factor binding data and gene expression data. Chromatin immunoprecipitation (ChIP) microarray experiments can determine where in the genome particular transcription factor binds to a resolution of single intergenic region (usually 500-2000 bpy. The GRAM algorithd combines such genome-wide location information with gene-expression experiments. The algorithm discovers additional intergenic regions that are likely bound by the transcription factor but did not cause a strong signal in the ChIP experiment. For motif discovery, intergenic regions are partitioned into two categories: those t o which the transcription factor is thought to bind (according t o raw ChIP experiments or after incorporating additional information via an algorithm like GRAM) and those t o which it does not bind. We will refer t o the bound sequences as the “positive intergenic sequences” and those not bound as the “negative intergenic sequences”.
360
361
If an algorithm were only to use the positive sequences for motif discovery, then it would likely discover many false motifs. Such false motifs are caused by sequences which appear frequently in all the intergenic sequences of a genome. In S. cerevisiae, two prominent simple examples of such sequences are poly-A (long strings of consecutive adenine nucleotides) and poly-CA (long strings of alternating cytosine and adenine n~cleotides)~. Fortunately, fusing binding data with the complete sequencing of the S. cerevisiae genome provides us with a conceptually simple method of discovering a transcription factor’s motif: find a sequence which is present in the positive sequences and not present in the negative sequences. However, because of experimental noise and variability of binding by a transcription factor, we expect to find occasional examples of the correct motif in the negative sequences, so we instead seek a motif that is significantly over-represented in the positive intergenic sequences when compared with the negative intergenic sequences. 1.1
Related work
There have been many past efforts to use negative intergenic sequences to derive a statistical test. The very popular “Random Sequence Null Hypothesis” (so named in Barash, et al!) uses the negative sequences to discover the parameters of an n-th order background Markov model (n = 0 and n = 3 are popular). This approach greatly dilutes the information content of the negative intergenic sequences, and especially loses information about false motifs whose length is greater than the order of the Markov model. In contrast, the approach pursued in this paper will be similar t o Vilo, et al!’ and Barash, et up. Vilo, et al. cluster genes by their expression profiles and seek t o discover motifs within each cluster. Their test for significance compares the total occurrences of a potential motif in all intergenic sequences to the within-cluster count. Their significance test compares a statistic against a binomial distribution. Barash, et al. describe an alternative t o the “Random Sequence Null Hypothesis”, namely a “Random Selection Null Hypothesis”. They perform a similar calculation t o Vilo, et al., but compare against a hyper-geometric distribution. (The difference appears t o be the assumption of whether motif-containing sequences are selected “with replacement” or “without replacement” from all the sequences.) A somewhat different approach is described by Sinhag, who shows how t o view motif discovery as a feature selection problem for classification. Sinha’s algorithm requires the input of positive and negative intergenic sequences.
362
Sinha generates the negative examples (intergenic sequences) artificially using a Markov model, but the framework presented the paper could easily use actual
negative intergenic sequences from ChIP experiments. This paper makes the following two contributions to field. First, we describe modification to statistical methods of Vilo, et al. and Barash, et al. which allow for intergenic sequences with different lengths. Second, we also apply our motif discovery method and statistical test transcription factor binding data from ChIP microarray experiments. The papers cited above were published before ChIP data were available, therefore the authors used clustered geneexpression data for groups of genes thought to be regulated by a common transcription factor. Recently, other researchers have taken techniques similar those described in this paper and fused them with other data sets. Kellis, et a16 incorporate conservation information from different yeast species. Gordon, et al? incorporate structural data about the transcription factor and its likely binding domain.
2
Methods
We perform motif discovery in the framework of word-counting. This framework exhaustively enumerates a class of potential motifs (or words) and scores each word for its likelihood of being a true motif. We searched for potential motifs of width 7 with up to 2 wildcard elements among the 7 positions. The wildcard elements permitted were the double-degenerate nucleotides (IUPAC codes M, R, W, S, Y, K) and the quadruple-degenerate “gap” nucleotide (IUPAC code N). For each potential motif m, we determine which positive sequences and which negative sequences m occurs. We then determine if m occurs in the positive sequences more often then would be expected by chance. We must therefore first define a null hypothesis of what in fact is expected by chance. Biologically, the null hypothesis corresponds to the situation that m is not the motif for the transcription factor. To be able to statistically reject the null hypothesis, we must quantify what we would expect to see if the null hypothesis were true. We will present two different null hypotheses, the latter which will incorporate sequence lengths as additional information to the statistical measure. Computational constraints determined the limits of width 7 and 2 wildcards. At those limits, a search for a transcription factor’s motif (within approximately 3 Mbase of 5’. cerewisiae sequence) took approximately 20 minutes on a 1.6 GHz Athlon system. The running time scales exponentially with re-
363
spect t o the width and number of allowed wildcards. As an aside, we note that this exponential increase could be addressed in future investigations in two ways. For slightly wider motifs or more wildcards, more computing power can be applied: the algorithm parallelizes trivially by having different processors examine separate regions of the search space. Beyond that, if one wanted t o discover long motifs, one can use the short motifs discovered by exhaustive search as starting points to an expectationmaximization type algorithm, as done in by Barash, et a13 and Gordon, et a1.5.
2.1 Sequences chosen w i t h u n i f o r m probability The two null hypotheses are instances of the “Random Selection Null Hypothesis” of Barash et al?, which states that when the null hypothesis is true (i.e., the motif is incorrect), the positive sequences are “randomly selected” from among all the intergenic sequences, without any correlation or bias toward sequences containing the incorrect motif. (One can visualize a transcription factor as the “hand” which randomly selects from an urn of intergenic sequences.) For their model, “randomly selected” means “all sequences are equally likely to be chosen without replacement”. For this definition of “randomly selected”, they give a formula for the probability that m occurs in k sequences by chance alone.
where n is the number of positive sequences, N is the total number of sequences (positive and negative), and K is the number of sequences in which the word m occurs. The above formula is the hyper-geometric probability distribution. Using this formula we can calculate a p-value that the null hypothesis is true. The p-value sums the tail of the probability distribution for k’ 2 k. n
2.2
Sequences chosen by length
Instead of “all sequences equally likely” as the behavior under the null hypothesis, we propose the null hypothesis that:
Sequences will be selected (without replacement) with probability proportional to the sequence’s length.
364 Figure 1: Distribution of integenic sequence lengths in S. cerewisiae.
I0 length
The motivation for this alternative stems from the fact that sequences from the ChIP experiments are of different lengths (Figure 1). The modification is plausible: given no other knowledge about the transcription factor, a longer sequence is more likely t o contain the transcription factor’s true motif. Let AL be the bag (multi-set) of all sequence lengths, and K L be the sub-bag of the lengths of the sequences in which the word m occurs. (Thus lALl = N and IKLI = K . ) We use bags t o allow for distinct sequences which happen t o have the same length. Having defined the null hypothesis, we can define the probability of it being true as the probability that k or more sequences in which word occurs are selected. Because computing this probability exactly is computationally prohibitive, we instead compute an approximation. Instead of selecting sequences without replacement, we select sequences with replacement. The probability of selecting exactly k sequences is binomial:
where r is the proportion of total sequences (weighted by lengths) containing the word.
To calculate the p-value that the null hypothesis is true, we reuse equation 2, substituting for Phyper.
365 Table 1: Consensus sequences
TF ABFl GAL4 GCRl HAP3 HSFl MATal MIGl RAP1 STE12 SWIG
Consensus TCRNNNNNNACG CGGNNNNNNNNNNNCCG CTTTCC CCAATNA GAANNTTTCNNGAA TGATGTANNT WWWWSYGGGG RMACCCANNCAYY TGAAACA CACGAAA
TF
~
i
CBFl GCN4 HAP2 HAP4 IN02 MCMl PH04 REBl SW14 YAP1
Consensus RTCACRTG TGACTCA CCAATNA CCAATNA ATGTGAAA CCNNNWRGG CACGTG CGGGTRR CACGAAA TTACTAA
3 Results and Discussion The results and discussion are organized into the following sections. 53.1 validates the algorithm by attempting t o replicate known motifs. 53.2 presents potential new motifs discovered by the algorithm. Finally, 53.3 discusses ideas for future work.
Validation
3.1
This section measures and compares the algorithm’s motif discovery performance. For an absolute measure, the algorithm was run on binding data for transcription factors whose motifs were previously discovered and confirmed biologically. For a comparative measure, the same data were analyzed with the motif discovery programs MEME and MDscan’. The algorithm was also run on differently processed binding data for each transcription factor to determine the effect of the type binding data on motif discovery. Program parameters
MDscan was run through the web interface with the following parameters: 0
Motif width: 7
0
Number of top sequences to look for candidate motifs: 10
0
Number of candidate motifs for scanning the rest sequences: 20
0
Report the top final 10 motifs found
0
Precomputed genome background model: S . cerevisiae intergenic
366
MEME was run with the command-line parameters - h a -w 7 -motifs 10 -revcomp -bf ile $MEME/tests/yeast .nc .6.f req. The parameters direct MEME attempt t o discover 10 motifs of width 7 on either strand using the pre-computed order-6 Markov background model of the yeast non-coding regions.
Binding data Three different sets of positive sequences were used. That is, three different methods were used to determine which sequences are bound by a transcription factor. The first two are a simple p-value threshold on the ChIP experiment! (not related t o the p-values calculated the statistical tests of Chapter 2). The last uses the GRAM gene modules described in Bar-Joseph, et al? which fuse both binding data and expression level data. 1. Bound intergenic regions, cutoff p-value 0.001
2. Bound intergenic regions, cutoff p-value 0.0001 3. GRAM Gene modules under YPD To score the performance of both this paper’s algorithm, and MEME and MD-Scan, the discovered motifs were compared against the consensus sequences for transcription factors (Table 1) which were gathered from the TRANSFAC database. We score the closeness of a discovered motif with the consensus using a Euclidean distance metric described in the thesis version of this pape?’. The threshold of correctness was chosen “by eye” t o be a value for which discovered motifs below the threshold seemed close to consensus motifs. The threshold was loose enough that a motif is scored l L ~ ~ r r eeven ~ t ’ ’when the discovered motif spans only half of a wide gapped motif (for example ABFl or GAL4). We report the number of times the most statistically significant discovered motif was correct, and the number of times a correct motif was found somewhere in the top 10 significant motifs. This paper’s algorithm only reported so sometimes no motifs were found. motifs with significance greater than Table 2 gives the number of correct motifs found by the algorithm and other motif-discovery algorithms on different data sets. We can make the following observations: The best performance was this paper’s algorithm using binding data with threshold pvalue 0.001.
367 Table 2: Verified consensus motifs
0
0
0
Algorithm
Data set
Choose from
This paper MDscan MEME This paper MDscan MEME This paper This paper This paper MEME This paper
p=O.OOl p=O.OOl p=O.OOl p=O.OOl p=O.OOl p=O.OOl GRAM GRAM p=o.o001 p=o.o001 p=o.o001
Top 10 Top 10 Top 10
Top 10 Top 10 Top 1
Number correct (out of 20') 14 12 10 10 9 0 12 9 12 12 9
Choosing a more rigorous threshold for the binding data, namely 0.0001, resulted in slightly poorer performance, most likely because of insufficient positive intergenic sequences for a significant result. Incorporating gene expression information with the GRAM modules algorithm caused the algorithm t o perform slightly poorer than using the raw binding data. However, the modules result did find 2 correct motifs that the raw binding data did not (at the cost of failing t o 4 others). The algorithm finds slightly more correct motifs than MEME or MDscan.
3.2 New motifs Tables 3 and 4 give the top-scoring motifs for some transcription factors not listed in Table 1. These are candidates for further investigation. The positive sequences used for the table were the bound sequences at p-value 0.001, From discussion with a colleague, we note that the motifs for CIN5, GATS, GLN3, IME4, YAP5, and YAP6 are probably not correct, while those for BAS1, FKH1, FKH2, IN04, and SUM1 are consistent with what is known about the transcription factor&.
Results on shuffled data To judge the background level of motifs, the algorithm was also run on random sets of intergenic sequences. Ideally, these runs should produce no significant
368 Table 3: Top scoring motifs discovered for transcription factors not on Table 1with binomial significance greater than 10-l’. The significance values are loglo of the pvalue. The gap wildcard is denoted by a dot.
I
I
TF
+ Condition
BAS1 YPD CIN5
YPD
Motif I LiAtiYtiG
v3 19;o; TAYGSAA
v 1;3;
11
CC~TACA
FHLl Rapamycin
99”,191
FHLl
YPD
99”,19l
FKHl
YPD
FKH2
YPD
GAT3 YPD
CC~TACA
GTAAACA
3Vlll91 GTSAACA 3v’: 1191 CYGACGC
9;31939 C.GCGGA
Binomial
HvDeweometric
-10.99
-14.71
-10.86
-19.67
-27.28
-39.88
-35.12
-50.61
-10.85
-14.72
-12.16
-18.49
-15.90
-21.14
GLN3 Rapamycin
9,39331
-11.46
-16.65
IME4
YPD
CACACAC 9 19 19 19
-12.16
-15.22
IN04
YPD
CATGTGA 91V3V31
-12.14
-14.36
MBPl YPD
GACGCG? 319393Y
-20.14
-25.40
MET4 Rapamycin
ATTCGGC lVV9339
-10.25
-13.13
MET4 YPD
CtCGTGA 933v31
-10.78
-13.08
369 Table 4: Top scoring motifs (continued from Table 3)
TF
+ Condition
Motif
Binomial
Hypergeometric
NRGl YPD
CTGC?T“G 9V39YX3
-11.65
-19.00
PHDl YPD
AT“ G C A C . 1z3919.
-10.86
-20.01
-12.91
-15.94
RGMl YPD
CCC$CGA
999I93l
STBl YPD
CGCGAAA 9393111
SUM1 YPD
G$CAC$A 3Y919Yl
-11.38
-17.18
YAP5 YPD
ACGCGCP 1939398
-11.94
-16.98
YAP6 YPD
gGGCACO P33919f
-11.44
-18.78
-12.36
motifs. Twenty-five random trials were run for each of 20, 40, 80, 120, and 160 randomly chosen S. cerewisiae intergenic sequences (for a total of 125 trials). Five of the 125 experiments discovered a total of 11 motifs with binomial pvalues less than lop4, with most significant motif having significance 10-4.7. These falsely significant motifs were more likely to be found when there were fewer positive sequences, as 8 of the 11 motifs were found in data sets with 20 positive sequences. In the course of the 125 trials, over 70 million hypotheses (i.e., candidate motifs) are tested, so it is reasonable to see a few false positives with significance has higher than low4. 3.3 Future work
The statistical test developed in Chapter 2 can make use of more information for a better measure of significance. In $2.2 we defined the null hypothesis behavior “random selection” to be as selection with probability proportional to length. A straightforward modification would be to instead use the number of different subsequences of a sequences as its probability (appropriately normalized). As an extreme example, consider a very long sequence consisting of a repeat of a single nucleotide. While long, such a sequences offers few possibilities of where a transcription factor might bind. Such a long repetitive
370
sequence ought to be selected with low probability. Continuing in this manner, other biological prior knowledge can be incorporated into the prior probability that a sequence is selected. Such knowledge might involve the location of the sequence on the chromosome, knowledge about the gene which the sequence precedes, or other genetic markers. Biologically, we must question the assumption of independence (modulo choosing without replacement) between the n = IPI random selections from A. For example, it would be reasonable to hypothesize that if two sequences are very similar, they would likely both be selected, or neither. Not only can we incorporate biologically relevant information into the prior probability of the binding, but we can also try to incorporate more information about the binding event itself. Currently, the algorithm only makes use of the binary presence (“yes” or “no”) of words in sequences. It could, for example, incorporate the following features: 0 0
Number of occurrences of the word in the sequence Position of the occurrence(s) with respect t o the start of transcription or other genetic markers in the sequence
0
Strand of the occurrence of the word
0
p-value of the binding event.
Beyond yeast, of course, are the many organisms whose genomes have been recently sequenced, including human. It will be only a matter of time before ChIP and other genome-scale location experiments are performed on those organisms. We expect that to do worthwhile motif discovery on larger and more complicated genomes, careful attention will have to be paid to the statistical issues and improvements mentioned above.
Acknowledgements Special thanks to Richard A. Young, D. Benjamin Gordon, and Ziv Bar-Joseph for help with the data sources used in this project. K.T.T. was supported by a NDSEG/ASEE Graduate Fellowship.
References 1. T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc. 2nd International Conference on Intelligent Systems for Molecular Biology, 1994.
371 2. Z. Bar-Joseph, G. K. Gerber, T. I. Lee, N. J. Rinaldi, J. Y. Yoo, F. Robert, D. B. Gordon, E. Fraenkel, T. S. Jaakkola, R. A. Young, and D. K. Gifford. Computational discovery of gene modules and regulatory networks. (Submitted f o r publication), 2003. 3. Y. Barash, G. Bejerano, and N. Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. In Algorithms in Bioinfomnatics: Proc. First International Workshop. 2001. 4. D. B. Gordon, 2003. personal communication. 5. D. B. Gordon, L. Nekludova, N. J. Rinaldi, C. M. Thompson, D. K. Gifford, T. Jaakkola, R. A. Young, and E. Fraenkel. A knowledge-based analysis of high-throughput data reveals the mechanisms of transcription factor specificity. (Submitted f o r publication), 2003. 6. M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. Sequencing and comparison of yeast species t o identify genes and regulatory elements. Nature, 423:241-254, 2003. 7. X. S. Liu, D. L. Brutlag, and J. S. Liu. An algorithm for finding proteinDNA binding sites with applications t o chromatin-immunoprecipitation microarray experiments. Nature Biotechnology, 203355839, August 2002. 8. B. Ren, F. Robert, J. J. Wyrick, 0. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell, and R. A. Young. Genome-wide location and function of DNA binding proteins. Science, 290:2306-2309, 2000. 9. S. Sinha. Discriminative motifs. In Proceedings of the Sixth Annual International Conference on Research in Computational Molecular Biology ( R E C O M B ) , 2002. 10. K. T. Takusagawa. Negative information for motif discovery. Master’s project, Massachusetts Institute of Technology, July 2003. 11. J . Vilo, A. Brazma, I. Jonassen, A. Robinson, and E. Ukkonen. Mining for putative regulatory elements in the yeast genome using gene expression data. In Proc. International Conference o n Intelligent Systems f o r Molecular Biology, 2000.
INTRODUCTION TO INFORMATICS APPLICATIONS IN STRUCTURAL GENOMICS S . D . MOONEY Stanford Medical Informatics Department of Genetics, Stanford University Stanford, CA 94305 P.E. BOURNE The Sun Diego Supercomputer Center The University of California Sun Diego San Diego, CA 92093
P.C. BABBITT Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, University of California Sun Francisco Sun Francisco, CA 94143
1. Structural Genomics
Structural genomics initiatives aim to determine all of the naturally evolved macromolecular scaffolds of proteins, RNA and DNA. In this introduction, we introduce several recent advances in the computational methods that support structural genomics. These include improvements at all levels of structure analysis, from fold identification of a target sequence and structure prediction, to structure evaluation and classification. The reader is referred to Goldsmith-Fischman and Honig’ for a thorough treatment on computational methods in structural genomics and to Bourne, et al. in this volume for the status of target structure determination. Improvements in computational methods for structural genomics are facilitating the identification of new, previously uncharacterized targets with novel fold classifications and predicted functions. These computational methods support the structural genomics pipeline by identifying targets, storing assay data, and by analyzing results in a statistically sound manner. The six papers presented here address many aspects of this diverse topic. One of the primary ways of identifying the function of an unknown structure is to identify its most similar structural neighbors. These “nearest neighbor” structural classification methods have proven to be powerful tools for identifying unknown
372
373
function. For example, the Structural Classification of Proteins project, SCOP, is an effort to classify all protein domains. SCOP classification is performed using both human intervention and through automated methods. Therefore, the challenge for fully automated computational methods is to correctly classify protein domains and to produce results similar to that of methods or databases that rely on human annotation. In this volume, Huan, et al. apply an information theoretic approach to identify coherent subgraphs in graphs that represent protein structures. They test their method on several families and find that their classifications correlate well with SCOP. Another challenge for computational structural bioinformatics methods is macromolecular structure prediction. A common approach to predicting the structure of an amino acid sequence is to apply comparative modeling methods, by modeling an unknown sequence upon a structure having a similar sequence. Comparative modeling is often performed in a four-step process: fold identification, threading, model building and evaluation with refinement of the structure. Fold identification and threading remain significant challenges. A target sequence may have little sequence similarity to any known scaffold. This volume presents two papers aimed at improving the identification of the appropriate fold for a target protein sequence through experimental intervention. First, Potluri et al. present a method for discriminating well predicted structures from poorly predicted ones using chemical cross linking data. Second, Qu et al., present a method for identifying the fold of a sequence using the NMR technique of residual dipolar coupling. Their program, RDCthread, identifies structural homologs of a target protein using RDC data and secondary structure prediction. Although most structural genomics techniques aim at studying protein structures, similar techniques have been applied to RNA structure prediction. For a review of structure prediction techniques as applied to RNA structure, see Schuster, et a1’. In this volume, Nebel presents a method for identifying good predictions of RNA secondary structure, thereby improving secondary structure prediction overall. Finally, one of the most exciting activities in structural genomics is studying the many structures that are now stored in public databases such as the Protein Databank (PDB)3. Peng, et al. apply contrast classifiers to explore bias in the PDB. When they compared the distributions of proteins in SWISS-PROT and the PDB, they found that transmembrane, signal, disordered and low complexity regions are poorly represented in the PDB. They reason that contrast classifiers can be used to select important targets for structural genomics initiatives.
374 Successes in structural genomics initiatives continue to be accompanied by the development of computational methods that apply sophisticated analyses from such diverse fields as information theory, clustering methods, and novel experimental techniques. As a result, novel structures continue to be added to our structural repertoire, giving new biological insight in this post-genomic era. Acknowledgements
We would like to Giselle Knudsen for her advice in preparing this document.
References
1. Goldsmith-Fischman S and Honig B (2003) “Structural genomics: computational methods for structure analysis” Protein Science 12(9): 181321 2. Schuster, P., Stadler, P.F., and Renner, A. (1997) “RNA structures and folding: from conventional to new issues in structure predictions” Current Opinions in Structural Biology 7(2):229-35. 3. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. (2000) “The Protein Data Bank” Nucleic Acids Research 28( 1):235-42.
THE STATUS OF STRUCTURAL GENOMICS DEFINED THROUGH THE ANALYSIS OF CURRENT TARGETS AND STRUCTURES
P . E . BOURNE, C . K . J . ALLERSTON, W . K R E B S , W. LI, and I . N . SHINDYALOV The San Diego Supercomputer Center, The University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA A. GODZIK, I. F R I E D B E R G , and T. LIU The Burnham Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037 USA D . W I L D a n d S . HWANG The Keck Graduate Institute, 535 Watson Drive, Claremont, CA 91 711 USA
Z. G H A H R A M A N I Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London, WCIN 3AR, U K L. C H E N a n d J . WESTBROOK Department of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854 USA
Structural genomics - large-scale macromolecular 3-dimenional structure determination - is unique in that major participants report scientific progress on a weekly basis. The target database (TargetDB) maintained by the Protein Data Bank (http://taroetdb.Pdb.org)reports this progress through the status of each protein sequence (target) under consideration by the major structural genomics centers worldwide. Hence, TargetDB provides a unique opportunity to analyze the potential impact that this major initiative provides to scientists interested in the sequence-structure-function-disease paradigm. Here we report such an analysis with a focus on: (i) temporal characteristics - how is the project doing and what can we expect in the future? (ii) target characteristics - what are the predicted functions of the proteins targeted by structural genomics and how biased is the target set when compared to the PDB and to predictions across complete genomes? (iii) structures solved - what are the characteristics of structures solved thus far and what do they contribute? The analysis required a more extensive database of structure predictions using different methods integrated with data from other sources. This database, associated tools and related data sources are available from http://spam.sdsc.edu.
1 Introduction Structural genomics has been heralded as the follow on to the human genome project. This is interpreted to mean a large-scale project, with scientific,
375
376
engineering and technological components and with the potential to have a large impact on the life sciences. Whereas the goals of the human genome project were relatively well defined - sequence the 3 billion nucleotides comprising the human genome and define all open reading frames - the goals advanced for structural genomics are more diverse (http://www.nigms.nih.gov/news/meetings/hinxton.html/) [I]. For instance, some of the NIH P50 structural genomics centers have focused on all of the protein structures in a given genome - A . thaliana, T. rnaritima and M. tuberculosis, are examples under scrutiny. Other groups have focused on obtaining sufficient coverage of fold space [2] to facilitate accurate homology modeling of the majority of proteins of biological interest (see http:Nspam.sdsc.edulsntdb for a description of the focus of each center). Since structure has already taught us so much about biological function when undertaken as a functionally driven initiative, undertaking structure determination in a broader genomic sense will likely also bring significant new understanding of living systems. Further, it will likely lead to advances in the process of structure determination, whether by X-ray crystallography or NMR. With such diversity of deliverables and with some projects now well established, an obvious question is, how are we doing? This paper addresses this question. The question has been addressed before in the context of new folds and functions and has proven has to be a somewhat controversial. An initial report in Science [3] implied that the number of structures produced as of November 2002 was minimal. A response from the US Northeast Structural Genomics Consortium (NESG) [4] indicated it was early in the process and that indeed that the absolute number of structures produced may not be the best measure, but rather the value of those structures is more to the point. NESG indicated that a structure containing a novel fold would indeed provide a new template from which many sequences could be related and hence was a significant contribution. It is not our intent here to join this argument but to simply point readers at some quantitative data and suggest how the process might proceed in the future and the challenges it provides to the bioinformatics community.
2 Methods An important feature of structural genomics, laid out by the N M as part of the awards made to the pilot centers engaged in this high throughput structure determination, was the importance of reporting their progress on a regular basis. The 16 pilot centers in the US and worldwide do this by way of weekly updates made available through their individual centers and collated by the Protein Data Bank (PDB) into what is known as the target database (TargetDB; http://targetdb.pdb.org)
377
[S].The contents of the target database are also available as an XML file. This file was used to create a local database from which the results presented here are derived. This database is available at http://spam.sdsc.edu/sgtdb. Fold prediction is based on three existing methodologies, FFAS [6] iGAP [7] and Bayesian networks [8] which are fully described elsewhere. Prediction of all open reading frames from complete proteomes uses the iGAP methodology and is part of the Encyclopedia of Life (EOL; httP://eol.sdsc.edu) project.
3 Results
3.1 Progress In the past year (May 1, 2002 - May 31, 2003) 314 structures resulting from structure genomics were reported by TargetDB. During the same period, a total of 3324 structures were deposited with the PDB. Thus structure genomics is currently contributing approximately 10% of structures to the field of structural biology. The number of structures at each stage in the pipeline is shown in figure 1.
Figure 1 Structural Genomics Targets at Different Stages of Solution (April 1,2003)
378 Slightly less than 50% of targets are selected for scrutiny. From these a high percentage can be expressed, but the number purified and crystallized drops off dramatically, indicating these steps continue to register low success rates and should be a focus of renewed efforts. Of those that crystallize, the majority find their way into the PDB. Is the percentage of structures determined by structural genomics likely to increase in the near future? To address this question requires that we look for temporal trends in the data. This is possible since TargetDB is updated each week and the mean time that an active target spends at each step in the structure determination pipeline can be assessed. These results are shown in Figure 2. It should be noted that not all of the centers reporting weekly status update their internal status tracking data with the same frequency. Consequently, the interval assessment here must be interpreted with care.
A Chartrhowinathe numbarofdavltahsnf~ratargettomaks atrsnrltlonhom onsTnmotUstabare statusto another.
matur T m r l t i o n
Figure 2 Mean Time of Targets at Each Structure Determination Step
379
For targets that make it to the next step, the data indicates that there is no specific bottleneck at this point, but rather a balance between the time taken at each structure determination step. Without a significant bottleneck the prospects for improving the rate of structure determination would seem good, particularly as the early stages of the project have included a significant engineering component for some projects. However, a final answer to the question will come from further review of TargetDB in the next two years.
3.2 Target characteristics
The characteristics of targets being attempted by individual structural genomics groups are highly variable (see http://sDam.sdsc.edu/satdb for a synopsis of the activities of each individual group). Groups are focusing on one or more of the following: complete proteomes, pathways and diseases, new folds, new technologies and specific structures. Thus the relative number of active targets from each group is meaningless and no attempt is made here to compare groups, rather the characteristics of the targets as a whole is considered. A review of the over 30,000 targets in the database (April 1, 2003) indicates a 13% redundancy at the 100% sequence identity and 38% redundancy at the 30% sequence identity level. This implies that either individual groups are operating without regard for other groups, or there is interest in the same targets by different groups perhaps indicating some important functional significance for a particular target. This data could be probed further to ascertain (if possible from sequence alone) the functional significance of these hotly contested targets. It should be noted that there is a temporal aspect to these target data. When a target was selected, which may be up to three years ago, the level of redundancy with respect to NR may have been significantly different, so these data need to be interpreted with care A review of each groups targets indicates that there is a significant level of redundancy within a groups targets (Figure 3). In some cases this is the nature of the redundancy in the complete proteome under study, in other cases perhaps a desire to attempt to solve multiple instances of an important structure that, based of sequence identity, are known to have the same fold.
380
.. . .
. .
. ..
...
. ..
. .. . -
.
. . .. ..
.
.
30%
“A
Fold Space
Complete Genome(s)
Specific Proteins
Techiiology Driven
Lab and main P r m d Goals
Figure 3 Sequence Redundancy within each Groups Targets
3.3 Structure characteristics Are there any specific characteristics of the novel folds in the structures determined by the Structural Genomics Initiative? How do these differ from the general population in the PDB and why? In short, what is novel from the structures being determined by structural genomics and how do they aid us by increasing our understanding of living systems and/or aid more rapid structure determination or modeling? An analysis of the former is provided by [9]. Here we focus on the characteristics important to bioinformatics, specifically fold and function, which can be used in further analysis, for example, in homology modeling. An analysis of the new folds as defined by SCOP is given in Table 1.
381 Table 1 New Folds Resulting from Structural Genomics Period Oct 2001 Mar 2002
Total New Folds 48
Apr 2002 Sep 2002
27
Oct 2002 Mar 2003
64
New Folds from Structure Genomics 1. YchN-like (c.144) 2. Hypothetical Protein MTH777 (c. 1 15) 3 . alphaheta knot (c.116) 4. Archaeosine tRNA-guanine transglycosylase, C-terminal additional domains (e.36) 5. YebC-like (e.39) 1. DsrC, the gamma subunit of dissimilatory sulfite reductase (d.203) 2. Ribosome binding protein Y (d.204) 3. Hypothetical protein MTH637 (d.206) 4. Thymidylate synthase-completmenting protein Thy1 (d.207) 5. MTH1598-like (d.208) 1. S13-like H2TH domain (a.156) 2. C-terminal domain of D$F45/ICAD (a.164) 3. BEACH domain (a. 169) 4. Viral chemokine binding protein m3 (b.116) 5. Obg-fold (b.117) 6. N-terminal domain of MutM-like DNA repair proteins (b. 113) 7. Pututive glycerate kinase (c.118) 8. DegV-like (c.119) 9. YbaB-like (d.222) 10. S U E(d.224) 11. Replication modualtor SeqA, C-terminal DNA-binding domain (d.228)
In the first reporting period the number of new folds reported by structural genomics was approximately 10% of the total number reported (5 out of 48), a result proportional to the percentage of structures coming from structural genomics. In the second and third periods this jumped to 18% (5 out of 27) and 17% (I1 out of 64), respectively indicating that the goal of new fold discovery may be being met, given that only 10% of structures overall are coming from structural genomics. However, the sample of new folds is small and hence we will need to wait for additional time periods and review this trend again.
382
A review of the sequences of solved structures against the non-redundant protein sequence database (NR)ordered in bins of expectation value (E-value) is given in Figure 3. A chat showin0 the e-value dlsmbunon of me Tarwts wlfh me stahls In PDE ' after BLAST DrOCeSmq a w n s t me nowredundam database
35
30 25 20 15 10
5 0
Figure 4 Likely Uniqueness of New Targets Approximately 70 of a total of 3 14 structures have an E-value of 10-3 or higher and represent a group for which sequence homology is not guaranteed and hence represent possible new functions (assuming functions were correctly assigned to sequences in NR).Again the above is only an indicator of the situation. A better analysis would require comparison against NR at the time the structure was solved or released. What of the overall distribution of folds represented by TargetDB? Figure 5 shows the distribution of folds derived by FFAS [ 6 ] , iGAP [7] and Bayesian networks [8]. The level of reliability is not considered, only possible predictions are represented, both FFAS and iGAP provided predictions for the nearly all targets, Bayesian networks for about lo%, based on a smaller template library. Not only does this highlight internal consistency between the methods of prediction, it also indicates differences. The distribution of major folds seems consistent with the distribution of associated biological functions in living systems. For example, it is known that p-loop containing protein families are very prevalent in nature.
383 SCOP Fold Distribution 6000
5000
4000 L.:
p
:
L 0
k $00
:
I
.-
2000
1000
0
Figure 5 Predicted Folds from TargetDB: I=FFAS; 2=iGAP; 3=Bayesian Networks This relationship is probed further in figure 6. Fold predictions are made for all open reading frames is a variety of organisms as well as the PDB and TargetDB.
Figure 6 SCOP Fold Distributions in Several Model Organisms, PDB and TargetDB
384
A question that can be posed from these data is how biased are the distributions of folds in TargetDB relative to those from specific target organisms and the PDB? Intuitatively one would expect the PDB to be biased towards proteins that are a) likely to be crystallized easily b) smaller proteins amenable to NMR or c) over represented by particular classes of proteins since they represent drug targets or functionally important proteins. Conversely, TargetDB would be somewhat closer to what is found in nature as whole genomes are being attempted. Having said that, it may be at this stage of structural genomics that projects are going for the low hanging fruit and hence it may be too early to make such a comparison. It should also be noted that there is an undetermined bias in these data and hence they should be considered cautiously. The bias arises in that predictions are done with a mix of fold prediction and homology modeling. In both cases there is a bias towards known folds since, nevertheless expected trends do occur. Immunoglobulin-like beta sandwiches (bl) are over represented in the PDB and under represented in TargetDB. This would suggest they have proven particularly amenable to crystallization and represent a sequence rich fold class which recognizes many of the targets and if new folds is an aim will likely discount a large number of targets, hence the under representation from TargetDB. The same argument can be made for tim barrels (cl). The empirical rule that emerges from these and other fold classes is that a class that is over represented in the PDB is under represented in TargetDB. RNNDNA binding 3 helical bundles (a4) appear to be over represented in TargetDB relative to what appears in the PDB and several model organisms. The same is true of P-loop containing nucleotide triphosphate hydrolases, perhaps a reflection of their role as drug targets. S-adenosyl-L-methionine-dependent methyltransferases also appear over represented in TargetDB. 4 Discussion
Structural genomics is a large science project involving multidisciplinary teams seeking to increase the number of macromolecular structures. From this process comes new understanding of living systems derived from functional inference from structure and improved methodologies. Improved methodologies range from new engineering practices which speed the structure determination process to an increased number of known folds that improves our ability to provide realistic models of proteins of unknown structure. A unique aspect of structural genomics is a weekly report by all groups engaged in this activity. Thus for the first time we are in a position to monitor quantitatively the scientific progress of a major scientific project. This progress is in the form of the status in the structure determination process of protein sequence targets. This status terminates at the point the structure enters the PDB and hence structures completed by structural genomics can be compared against structures
385
derived from conventional functionally driven structure determination experiments. Targets which have not yet been solved can be predicted with a variety of existing structure prediction methods. Taking existing unsolved targets, solved structures and predicted structures of the targets a picture of the progress of structural genomics begins to emerge. Here we have reported on that picture. The percent of structures being contributed by structural genomics is approximately 10% at this time. The time to solution ranges from three to eighteen months with a peak in the 8-10 month range (data not shown). Data are not available for how this compares to conventional structure determination but it is estimated to be of a similar order. At this time structural genomics would seem to be contributing twice the number of new folds as conventional structure determination, but the numbers are two small to be considered statistically significant. An argument has been made that structure genomics might contribute less new folds that one might anticipate since the emphasis will be on determining the maximum number of structures. Numbers implies taking what crystallizes easily and this could be construed as being those structures that appear in a subset of folds most amenable to crystallization. Conversely, a functionally driven initiative on a single target might expend more time and energy performing experiments that would result in the crystallization of a less amenable fold not pursued by structural genomics. This type of conjecture will become more fact as the number of structures increases. We will continue to process TargetDB and report our finding through the Web site at http://spam.sdsc.edu/sgtdb.
Acknowledgments This work is supported by the National Institutes of Health grant number lPOlGM63208-01.
References 1. S.E. Brenner SE, and M. Levitt. Expectations from Structural Genomics Protein Sci 9(1), 197 (2000). 2. E. Portugaly and M. Linial. Estimating the Probability for a Protein to have a New Fold: A Statistical Computational Model. Proc Nut1 Acad Sci U S A. 97(10), 5161 (2000) 3. R.F Service Tapping DNA for Structures Produces a Trickle. Science 298, 948 (2002). 4. M. Gerstein et al. Structural Genomics: Current Progress. Science 299, 1663 (2003). 5. J. Westbrook J. et al. The Protein Data Bank and Structural Genomics. Nucleic Acids Research 31(1) 489 (2003).
386 6. L. Rychlewski, L. Jaroszewski, W. Li, and A. Godzik, A. Comparison of Sequence profiles. Strategies for Structural Predictions using Sequence Information. Protein Science 9 232 (2000). 7. W.W. Li, G.B. Quinn, N.N. Alexandrov, P.E. Bourne and I.N. Shindyalov Proteins of Arabidopsis (PAT) database: A Resource for Comparative Proteomics. Genome Biology In Press (2003). 8. A. Raval, Z. Ghahramani and D.L. Wild A Bayesian Network Model for Protein Fold and Remote Homologue Recognition. Bioinformatics 18(6) 788 (2002). 9. C. Zhang and S-H Kim Overview of Structural Genomics: From Structure to Function. Current Opinions in Chemical Biology 7 28 (2003).
PROTEIN STRUCTURE AND FOLD PREDICTION USING TREE-AUGMENTED NAIVE BAYESIAN CLASSIFIER A. CHINNASAMY, W. K. SUNG (arun, ksung)@comp. nus. edu.sg, Department of Computer Science, National University of Singapore, 3 Science Drive 2, Singapore 11 7543.
A. MITTAL ankush@bits-pilani. ac. in, Department of Computer Science, Birla Institute of Technology and Science, Pilani, India. For determining the structure class and fold class of Protein Structure, computerbased techniques have became essential considering the large volume of the data. Several techniques based on sequence similarity, Neural Networks, SVMs, etc have been applied. This paper presents a framework using the Tree-Augmented Networks (TAN) based on the theory of learning Bayesian networks but with less restrictive assumptions than the n a k e Bayesian networks. In order t o enhance TAN’S performance, pre-processing of data is done by feature discretization and post-processing is done by using Mean Probability Voting (MPV) scheme. The advantage of using Bayesian approach over other learning methods is that the network structure is intuitive. In addition, one can read off the TAN structure probabilities t o determine the significance of each feature (say, Hydrophobicity) for each class, which help to further understand the mystery of protein structure. Experimental results and comparison with other works over two databases show the effectiveness of our TAN based framework. The idea is implemented as the BAYESPROT web server and it is available a t http://wwwappn.comp.nus.edu.sg/-bioinfo/bayesprot/Default.htm.
1
Introduction
In proteomics, finding the structure and the fold of a protein is very important since it helps t o understand the functions, the catalytic and the structural roles of proteins. Protein structure can be determined experimentally by Xray diffraction and NMR techniques. These methods are expensive, tedious, labor intensive and have their own limitations. This leads t o the research in predicting the protein folding pattern, given only its primary structure ‘. This computational way of protein structure prediction can be classified into two general types ’.
387
388
1. Homology methods: a) Sequence Similarity Methods: These methods are based on the observation that two proteins have very similar structure if their sequences have high homology 3 . b) Threading Methods: These methods predict the structure of a protein sequence by aligning with a known structure. 12. 2. Discriminative Methods: These methods extract some general “rules” from the known protein structures and applies the “rules” t o a new protein sequence to make the prediction 16. Sequence similarity has its limitation as it can apply only to those sequences which are similar in term of both sequences and structures ’. Several discriminative methods based on statistical techniques, neural networks and SVMs have been applied in the past. The main difficulty in applying learning(discriminative) methods is, the folding prediction becomes less accurate with increasing number of classes. This study hopes t o solve this issues using the Bayesian classifier framework. Bayesian classifier theoretically is the best classifier provided the underlying distribution functions are well estimated 7 . However, Bayesian classifier requires a prior knowledge of many probabilities. This paper designs a framework called BAYESPROT with discretization of feature space and TreeAugmented Network (TAN) Bayesian classifier as foundation t o address the problem of structure and fold classification from database. In addition, Mean Probability Voting (MPV) method is employed t o improve the performance. For the prediction in this paper,we use the protein classification type in SCOP 22 database, that is, proteins are classified in hierarchical order of structures, folds, super families and families. Since finding the structural and the fold class is more significant, in this paper we applied our classification system to classify a protein into different structural and fold classes. 2
Review
Recently, machine learning tools have been largely used in the classification based on tertiary super classes. These methods are denoted as discriminative methods or data mining approaches. Since no direct relationship between sequence and structure are derived, much attention paid on statistical or machine learning techniques to classify the proteins using feature vector representations of available knowledge. Dubchak et a1 1995, 1999 5,6 conducted the classification studies based on neural networks. Ding and Dubchak I (2001) * classified the proteins into 27 fold classes using SVMs and neural networks based on three
389 multi-classification methods (OvO, uOv0, AvA) and concluded that SVM’s performance is better than Neural networks. Their study introduces SVM t o the protein classification problem. The accuracy measurement in their method assumes that the prediction is partially correct when ties exist(for ours, we assume the prediction fail). Also their method uses large number of classifiers. Cai et aL(2001) l9 used SVMs to classify the proteins into four major protein classes and compared the results with component coupled with neural network. Edler et al. (2001) conducted a statistical study based on logistic regression, additive models, and projection pursuit on protein fold prediction with a dataset containing 268 proteins. Markowetz et al.(2003) used Gaussian and various polynomial kernels based on SVMs and showed that their approach performed better than the work in ’. From all these studies it is evident that among all the prediction methods, SVM performs better. Though most works recently showed that SVMs have good generalization property and outperforms statistically than Neural network methods for the protein fold prediction, SVM methods are reported t o result in high number of ‘false positive^'^. Besides, the number of binary classifiers is numerous and the computational time for the SVM training is high when the number of classes is large. It has also been shown that SVMs performances vary with change in dimensions of the feature vector and SVM methods might require feature selection Therefore, alternative method of learning are sought which might not have some of these defaults.
3
Overview of BAYESPROT
Figure 1 shows the overview of the BAYESPROT system. Given a database of several millions of protein sequences, their attributes are extracted and transformed into features, namely, composition (20), secondary structure (21), hydrophobicity (21), polarity (21), polarizability (21), and Van Der Waals Volume (21). After the feature vector extraction, the values of features were discretized t o four discrete states by frequency discretization method. Three separate TAN Bayesian classifiers were constructed using all concatenated feature vectors (126), composition feature vectors (20), and secondary structure feature vectors (21) respectively. The previous research and our experiments suggest that, amongst all the attributes, composition and secondary structure features are the most important for the protein structure prediction. Hence, we construct the TAN classifiers for composition and secondary structure separately and chose only these two to reduce the complexity. Next MPV is employed t o predict the structural class. A similar procedure is required to classify the fold
390
Structure
v Class
Fold
v Class
Figure 1: Architecture of BAYESPROT
class as shown in the Figure 1.
4
Dataset and feature vector representation
We used the datasets referred in two prominent recent works: Ding and Dubchak (2001) and Markowetz et al.(2003) '. Summary of the two datasets (Dataset I and Dataset 11) is tabulated in Table 2.
4.1
Dataset I
Dataset I used in our study was originally built for the study of and later used by '. Both studies confirm that the dataset is reasonable as it is based on the PDBselect sets where two proteins have no more than 35% of the sequence identity for sequences longer than 80 residues. Dataset I is available at http://www.nersc.gov/Ncding/protein/.
4.2 Dataset 11 Dataset I1 was built from the Database for expected Fold-Classes (DEF) for the statistical study 'O. Markowetzet et a1.(2003) used this dataset and concluded that SVM was better than previous statistical studies. Dataset I1 is available at http://www.dkfz.de/biostatistics/protein/gsme97.html.
39 1 4.3 Feature Vectors or Global Descriptors of A m i n o Acid Sequence
To apply machine learning algorithm] we have to turn the amino acid sequence of heterogeneous length into feature vector of homogeneous length. This feature vector construction is based on physical and stereo chemical properties of amino acids. This method was used and explained in and ‘. Each protein sequence is represented by a set of six attribute feature vectors. Composition feature vector of length 20, which lists out the proportion of the 20 amino acids, is constructed in a straightforward manner. Apart from composition, the other attributes used are predicted secondary structure, polarity] polarizability, hydrophobicity and Van der Waals volume. Except composition, feature vectors for the above five attributes are constructed in two steps.
Stepl: For each attribute, twenty amino acids are divided into three groups,(see Table 1). For each protein sequence, every amino acid was replaced by the index 1, 2, or 3 depending on its grouping. For example protein sequence KLLSHCLLVTLAAHLPAEFTPAV will be replaced by 13322333323222322132232 based on the attribute hydrophobicity division of amino acids(see Table 1). Step 2: For each converted sequences calculated in step1 three descriptors “composition” (C), “transition” (T), and “distribution” (D), are calculated based on the definition given below. Composition: Composition is calculated for each group based on the simple formula, Ci = ( ( n i ) / L *) 100; where Ci represents the percent composition of each group;, where ni represents total number of group; residues in the sequences, and L represents the length of the sequence. Transition: Transition (Tij) is represented by the percent frequency with which group; is followed by groupj or groupj followed by groupi where a , j takes the values 1, 2 and 3. Distribution: Distribution descriptor D consists of the five numbers for each of the three groups: the fractions of the entire sequence, where the first residue of a given group is located, and where 25%, 50%, 75%, and 100% of those are contained. Each attribute the feature vector contains 21 features: 3 composition features, 3 transition features and 5* 3 distribution features. Feature vector is of length 126 which is constructed by concatenating 21 all 5 attribute vectors of
392 Table 1: Amino acid attributes and corresponding groups.
Attribute secondary structure Hydrophobicity Polarizability Polarity Van der Waals volume ~
Group 1 Group 2 Group 3 Helix Strand Coil Polar Neutral Hydrophobic R,K,E,D,Q,N G,A,S,T,P,H,Y C,V,L,I,M,F,W (0-2.78) (2.95-4.0) (4.43-8.08) G,A,S,C,T,P,D N,V,E,Q,I,L M,H,K,F,R,Y,W (4.9-6.2) (8.0-9.2) (10.4-13.0) L,I,F,W,C,M,V,Y P,A,T,G,S H,Q,R,K,N,E,D (0-0.108) (0.128-0.186) (0.219-0.409) G,A,S,D,T C,P,N,V,E,Q,I,L K,M,H,F,R,Y,W
length 105 (5*21=105), amino acid composition vector of length 20 and the sequence length of length 1. 5
Our Framework
5.1 Discretization
In our dataset, the feature vectors are of continuous nature. Though the Bayesian classifier supports both continuous and discrete probability distributions it was experimentally found that the continuous probability distribution was not suitable for these datasets. Therefore, we pre-processed data by converting the continuous attribute data to discrete attribute data. One p o p ular and simple discretization approach is range discretization. However, in range discretization, some of the discretized partitions become over-populated while others remain empty leaving to poor discretization. In order to avoid this problem, we employ frequency-based discretization which partitions the attributes into intervals each containing almost same number of instances. Several frequency based discretization methods were employed with ‘3’ intervals, ‘4’ intervals, ‘5’ intervals, ‘7’ intervals and ‘10’ intervals. By experiment, method with ’4’ intervals yielded better classification performance than other methods and it was chosen.
5.2
TAN Bayesian Classifier
Bayesian Networks are directed acyclic graphs which combine both statistical and graph theory for representing conditional independencies l o . A directed edge A + B indicates the causal relationship (A causes B) and thus Bayesian
393
networks are quite intuitive. Optimal classifications can be achieved by reasoning about these probabilities along with observed data The classification is done by applying Bayes rules to compute the probability of a class C given the particular instance of attributes A l , . . . ,A, and then predicting the class with the highest probability. Structural relationship among the attributes is important for the Bayesian network classifier t o construct the relationship amongst various nodes. However, no clear structural relationship is known at present due to the nature of problem. Structural learning is not possible with present database. Therefore, we chose TAN Bayesian classifier I 3 , l 5 rather than Bayesian network classifier as it is more relevant to the problem considering the feature vector properties and relations. TAN Bayesian Classifier is an extension of na'ive Bayesian classifier. Similar t o na'ive Bayesian classifier, TAN consists of a class node connecting t o all child nodes each representing a feature. Moreover, each child node can has at most one other feature node as parent. Attractive property of the TAN Bayesian classifier is that it learns the probabilities from the data in polynomial time. For our case, we create a TAN Bayesian classifier which has a class node representing the protein structure/fold classes and connected to 126 child nodes for 126 feature vectors. In addition, it is assumed that composition node Ci has structural relationship with Ci+l , each attribute percent composition and each distribution vector has structural relationship. Three TAN Bayes classifiers have been constructed for the concatenated feature vectors of length 126, composition feature vectors of length 20 and secondary structure feature vectors of length 21 respectively. TAN Bayes classifier has been defined in the given equation where cy is normalization constant.
'*.
P(ClasslA1,.. . ,A,)
=
P(C1ass) . ~IZIP(AiIparents(Ai)) (1)
5.3 Mean Probability Voting
Let Pi , PCi and Psi for i = 1,2, . . . , k be the marginal probabilities from the TAN Bayesian classifiers which uses length 126 concatenated feature vectors, length 20 composition feature vectors and length 21 secondary structure feature vectors , respectively where 'k' represents the number of classes. Then mean probability MPi, for i = 1 , 2 , . . . , k is calculated by taking average of Pi , PCi and Psi. The prediction of structural/fold class was done by selecting the class which has the highest mean probability (MP). It is accepted from the previous studies that composition and secondary structure are important
394
Figure 2: TAN Bayesian Classifier Table 2: Structural and Fold Classification Results of BAYESPROT. Dataset
Number of Classes
No. of Proteins
5
313 143
Dataset I Dataset I1 Dataset I Dataset I1
in Train D a t a
4
27
313 143
42
No. of Proteins
Test D a t a in Test d a t a Accuracy(%) Structural Classes 385 80.52 125 77.6 Fold Classes 385 58.18 12s 74.40
Cross Validation Accuracy (%) 83.09 79.85 59.77 75.7s
in deciding the protein structures. And in our experiment, voting increased the accuracy by around 4%. 6
Experimental Results
6.1 Results Both structural and fold classifications have been done using BAYESPROT with dataset I and dataset 11. Table 2 summarizes the results of both dataset. Evaluation of classifier is done by testing with independent test dataset and 10-fold cross validation. In Dataset I, 27 fold classes used are from the structural classes a , p, a / @a, p, small and in Dataset 11, 42 fold classes used are from the structural classes a , p, a/P,and a p.
+
+
For dataset I structural classes, the confusion matrix is shown in Table 3 while the sensitivity and the specificity for five structural classes are shown in Table 4. Except a /3 super class, all other super classes are predicted with sensitivity greater than 70%. From confusion matrix for structural classifier it is evident that a significant number of proteins of ‘a p’ class are misclassified in ‘a’ and ‘0’ classes. Similarly, some ‘p’ class proteins are misclassified in ‘a/p’. Specificity of each
+
+
395 Table 3: Confusion Matrix for Super Classifier (Dataset I)
Actual
alp a+P
4 8 0
Small
8 5
132 4 0
14 1
1 24
Table 4: Sensitivity and Specificity for each class (Dataset I)
Specificity (%) Structural
I
a
Classes
Fold
Small Average of
80.33 77.78 91.03 40.00 88.89 50.89
94.44 92.16 90.83 96.29 99.72 61.76
Classes
structural class is very high compared t o sensitivity. Confusion matrix and individual accuracy tables for Database I1 structural classes are available in the webpage http://www.comp.nus.edu.sg/Nbioinfo/bayesprot/results.htm. From the experiment, it can be concluded that BAYESPROT classified six fold classes with accuracy greater than 60% and predicted 15 fold classes with accuracy greater than 50% in dataset I. Average specificity of 27 fold classes is 61.76% which is higher than average sensitivity 50.89%. Confusion matrix and detailed results for 27 fold classes and 42 fold classes are available in the webpage http://www.comp.nus.edu.sg/Nbioinfo/bayesprot/results.htm.
7 Analysis and Discussions 7.1 Dataset I: Comparison with Ding and Dubchak(2001)
In Ding and Dubchak study, they used One-Versus-Others(OvO), UniqueOne-Versus-Others(u0vO) or All-Versus-All(AvA) methods for multi classification which used binary SVM or Neural networks as building blocks. Table 5 summarizes the result of 27 fold classes by BAYESPROT and SVM '. In 10-fold cross-validation study accuracy of 59.77% is achieved by BAYESPROT which is 31.57% higher than SVM AvA method. The number
396 Table 5: Comparative Results of BAYESPROT and SVM with Dataset I
Methods
BAYES
PROT Accuracy (%) No. of Clfrs. Used
58.8 3 TAN Bayes Clfrs.
Test Dataset SVM SVM OvO uOv0 41.8 45.2 168 2457 Binary Binary SVM SVM Clfrs. Clfrs.
SVM AvA 56.0 2106 Binary SVM Clfrs.
Cross Validation BAYES SVM PROT AvA 59.77 45.4 30 TAN 84,240 BAYES Binary Clfrs. SVM Clfrs.
of classifiers used for this cross-validation study is 10*3 (=30) TAN Bayesian Classifier, which is substantially less than the number of classifiers in SVM AvA where 84,240 binary SVM classifiers were employed. It is important to note that the accuracy measurement used in our study and are different by the way of calculating the number of proteins correctly classified by the classifier. In the method by ', if the output for the three top classes C1, C, and Cs are 2, 2 and 1 respectively by voting results and the correct class is Cz, then the number of correctly predicted protein is counted as 0.5 in their work. However, our work considers such a case t o be a misclassification and we do not increment the number of true positives. Thus, the superiority of BAYESPROT method over SVM can be observed. Another thing t o be considered is that the number of classifiers used in SVM and Neural networks is much higher than BAYESPROT. Learning complexity of SVM depends on the number of iterations and in many cases the learning complexity is quite high. But in BAYESPROT, since the dataset is complete and structure is known, the time required to learn the parameters is very less. In addition, the number of classifiers used in Bayesian network is substantially less than SVM as can be seen in Table 5. 7.2 Dataset 11: Comparison with Markowetz et al.(2003)
Dataset I1 consists of 42 fold classes, 143 training proteins and 125 test proteins. In study OvO SVM multi classification method was employed and achieved a high accuracy of 76.8% among various kernel for the test dataset and 70.9% for cross-validat ion. Table 6 summarizes the BAYESPROT and SVM results. Distribution of number of proteins in all classes is quite less in dataset I1 which is not the case with dataset I. Out of 42 fold classes, 36 classes have proteins less than or equal to 4 in training dataset.
397 Table 6: Comparative Results of BAYESPROT and SVM with Dataset I1 Methods
Used
I
Test Dataset
1 BAYES I SVM I SVM
SVM Poly2 kernel
I
BAYES PROT
I
Cross Validation SVM I SVM AvA Polyl kernel kernel
I
SVM Poly2 kernel
68
75.75
69.8
70.9
65
Binary SVM Clfrs.
BAYES Clfrs.
Binary SVM Clfrs.
Binary SVM Clfrs.
Binary Clfrs.
7.3 Effects of Large number of Training Samples
Cross validation is a method to estimate the generalization error of a given model. We conducted 10 fold cross validation study t o estimate the generalization error and to compare with previous SVM methods. From Table 5 and Table 6, it is clear evident that after performing cross validation over dataset I and dataset 11, accuracy in BAYESPROT increases while the accuracy in SVM method decreases. 7.4 Interpreting the Classification Results
Analyzing the classification results is very important for solving biological problems. The biologists need t o know the confident level of the resultant classes outputted by the classifiers for further analysis. Understanding the marginal differences between top predicted classes is also important in further confirming the structural class of the protein. Our classification approach supports this type of interpretations, as it gives the probability for each class. This kind of interpretation is not possible in neural networks and difficult in SVM. Neural networks contain many hidden nodes and final output is based on threshold value. In SVM, as the number of classifiers is high, reading the distances between hyper plane and the classes are very difficult. 8
Conclusions and future work
In this paper, we presented a framework based on TAN and voting method that is shown t o perform better than SVM on most cases. Since the network structure and the probabilities are well understood, the BAYESPROT framework also has several theoretical advantages relevant t o biology researchers and thus it is a better tool for analyzing protein sequences. Further research is being carried out for incorporating better network structure than TAN to improve the performance.
398 References
1. A. Mittal et al., SPIE Conf. on Applns. of Art. Neural Networks in Image Procsg. VI, USA, 97-107, 2001. 2. A. Mittal and L.-F. Cheong, IEEE Transactions on Knowledge & Data Engg., vol 15, no4,(2003). 3. David W . Mount, Cold Spring Harbor Laboratory Press, (2001). 4. Ding CHQ, Dubchak I, Bioinformatics, 4(17):349-358, (2001). 5. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH , Proteins Jun 1; 35(4):401-7, (1999). 6. Dubchak I, I. Muchnik, S.R. Holbrook and S.-H. Kim, Proced. of Natl. Acad. of Sci. of USA, 92, 8700-8704, (1995). 7. Duda, R.O. and Hart, P.E,and and D. G. Stork, and D. G. Stork , John Wiley & Sons, (2001). 8. Edler L et al., Math. and Computer Modelling 33, 1401-1417, (2001). 9. F. Markowetz, L. Edler, and M. Vingron, Biometrical Journal 45 3, 377389, (2003). 10. Finn V. Jensen, Springer-Verlag, New york, (2001). 11. John, G.H., & Langley, P. , In Proced. of the 11th Conf. on Uncert. in AI, Montreal, Quebec, Morgan Kaufmann, pp. 338-345, (1995). 12. Jones D, Taylor W , Thornton J Nature,358:86-89, (1992). 13. Nir Friedman et al., Machine Learning 29(2-3): 131-163 (1997). 14. P. Domingos and M. Pazzani, Machine Learning, 29:103-130, 1997. 15. Pat Langley et al., In Procd. of the 10 Natnl. Conference on AI, pages 223-228. AAAI Press and MIT Press, (1992). 16. P. Wang and D. Zhang, the 14th IEEE Int. Conf. on Tools with AI. November pp. 252-257, (2002). 17. Ronan Collobert and Samy Bengio, J. of Machine Learning Research, vol 1, pages 143-160, 2001. 18. Sippl MJ, Flockner H, Structure 4, 15-19, (1996). 19. Yu-Dong Cai, Xiao-Jun Liu, Xue-biao Xu and Guo-Ping Zhou,BMC Bioinformatics 2:3, (2001). 20. J. Grassmann, M. Reczko, S. Suhai and L. Edler, In Proc. Int. Conf. Intell. Syst. Mol. Biol (ISMB 1999), pp. 106-12, (1999). 21. Joel R. Bock, David A. Gough, Bioinformatics vol 17-5, 455-460, (2001). 22. Murzin A. G., Brenner S. E., Hubbard T., Chothia C., J. Mol. Biol. 247, 536-5404 1995).
CLUSTERING PROTEIN SEQUENCE AND STRUCTURE SPACE WITH INFINITE GAUSSIAN MIXTURE MODELS A. DUBEY, S. HWANG, C. RANGEL Keck Graduate Institute, 535 Watson Drive, Claremont CA 91711, USA C.E. RASMUSSFN Max Planck Institute for Biological Cybernetics, Spemann Strasse 38 72076 Fuebingen, Germany Z . GHAHRAMANI Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London, W C l N 3AR, UK
D.L.WILD Keck Graduate Institute, 535 Watson Drive, Claremont CA 91711, USA Abstract We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required t o model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/~wild/PSB04/index.html
1
Introduction
The clustering of protein sequences into families and superfamilies is a common approach for both comparative genomics and t h e prediction of protein function. With t h e advent of structural genomics projects] t h e clustering of protein sequences with those of known structure has also
399
400
been proposed as a method of target selection for structure determination. Newly determined protein structures must then be classified, both to assess their novelty, and in the case of proteins of unknown function, as a first step in functional annotation. Most methods for clustering protein sequences begin with an allagainst-all pairwise similarity search and use the pairwise score as a measure of similarity of the two sequences. A variety of approaches have been described to construct clusters from these scores: GENERAGE uses recursive single linkage hierachical clustering, and PROTOMAP constructs hierarchical clusters in a similar manner but using the means of all pairwise scores. SYSTERS uses heuristics derived from set-theoretic considerations to obtain a set of disjoint clusters. Abascal and Valencia describe a method for clustering protein families which uses the Ncut algorithm derived from graph theory. All these methods rely on the setting of some score theshold to distinguish members of a particular cluster from non-members, making the determination of the number of clusters arbitrary and subjective. Approaches based on single linkage hierarchical clustering can give results which are highly dependent on small changes to the data (such as adding or removing a single sequence). Moreover, non-probabilistic approaches do not provide a measure of uncertainty about the clustering, make it difficult to compute the predictive quality of the clustering and to make comparisons between clusterings based on different model assumptions (e.g. numbers of clusters, shapes of clusters, etc). Krogh et al. provided an alternative probabilistic approach which used hidden Markov models (HMMs) to cluster protein sequences from the globin family into subfamilies. They fit a mixture of HMMs (which is itself a special kind of HMM) using maximum likelihood methods. The results of these experiments were promising for this particular example, yielding clusters that correspond to known globin subfamilies. Little work has followed up on this area. Methods for automatically clustering sequences into hypothesized classes will be increasingly useful as amounts of sequence and structural data continue t o grow. An important issue that must be addressed in any clustering method is the question of how many clusters to use. Bayesian statistics can provide a solution to model selection questions of this kind (e.g6'7). Within the Bayesian framework, an elegant alternative approach is to assume that the data was in fact generated from an infinite number of Gaussian clusters. Any actual clusters in the protein sequence data will surely not be Gaussian distributed". Infinite mixtures are a sensible way to capture the fact that we don't really believe that protein sequence data is well modeled by a finite number of Gaussians. An infinite Gaussian discuss below how one can derive vectorial representations of sequences so that questions about Gaussianity are well-defined.
401
mixture model can readily model a finite number of non-Gaussian clusters. Finally, in an infinite Gaussian mixture model there is no need to make arbitrary choices about how many clusters there are in the data; nevertheless, after modeling one can ask questions such as how probable it is that two protein sequences or structures belong to the same cluster? We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc. based on the theory of infinite mixtures8. This theory is based on the observation that the mathematical limit of an infinite number of components in an ordinary finite mixture model (i.e. clustering model) Such a Dirichlet process corresponds to a Dirichlet process prior prior allows the data itself to dictate how many mixture components are required to model it. That is, a diverse family may require several components whereas a simpler family may require only one. Although in theory the infinite mixture has an infinite number of parameters, surprisingly, it is possible to sample from these infinite mixture models efficiently since only the parameters of a few of the models need to be represented. The theory of infinite mixture models is laid out by Rasmussen', who showed that the procedure works effectively with mixtures of Gaussians. It has since been applied to the clustering of gene expression profiles by Medvedovic and Sivaganesan ll. 9,1098.
2
Infinite Gaussian Mixture Models
One commonly used computational method of non-hierarchical clustering based on measuring Euclidean distance between feature vectors is given by the k-means algorithm. However, the k-means algorithm is inadequate for describing clusters of unequal size or shape. A generalization of k-means can be derived from the theory of maximum likelihood estimation of Gaussian mixture model?. In a Gaussian mixture model, the data (e.g. features of protein sequences or gene expression profiles which can be arranged into pdimensional vectors y) is assumed to have been generated from a finite number (k) of Gaussians, P(y) = +jPj(y) where + j is the mixing proportion for cluster j (fraction of population belonging to cluster j ; + j = 1; + j 2 0) and Pj (y)is a multivariate Gaussian distribution with mean ,uj and covariance matrix C j . The clusters can be found by fitting the maximum likelihood Gaussian mixture model as a function of the set of parameters B = {q+, p j , Cj}j"=l using the EM algorithm 1 2 . Euclidean distance corresponds to assuming that the Cj are all equal multiples of the identity matrix. Starting from a finite mixture model (Z), we define a prior over the mixing proportion parameters +. The natural conjugate prior for
C3kxl
402
mixing proportions is the symmetric Dirichlet distribution: P(+la) =
nT=,
where a controls the distribution of the prior weight r (rq(l ak )) assigned to each cluster, and I? is the gamma function. We then explicitly include indicator variables ci for each data point (i.e. protein sequence) which can take on integer values 4 = j , j E (1,. . . , k } , corresponding to the hypothesis that data point i belongs to cluster j . Under the mixture model, by definition, the prior probability is proportional to the mixing proportion: P(ci = j ( 4 ) = +j. A key observation is that we can compute the conditional probability of one indicator variable given the setting of all the other indicator variables after integrating over all possible settings of the mixing proportion parameters:
where c-i is the setting of all indicator variables except the i t h , n is the total number of data points, and n - i j is the number of data points belonging to class j not including i. By Bayes rule, P(+IC-i,a) = P
( + I ~ ) / P ( c n- ~ I( c~ e) l + )
(2)
tfi
which is also a Dirichlet distribution, making it possible to perform the above integral analytically. We now can take the limit of k going to infinity, obtaining a Dirichlet Process with differing conditional probabilities for clusters with and without data: for clusters where n-i,j > 0: n-;, . p ( q = jIC-i,cY) = for all other clusters combined: p ( # ~ cif for all i’ # ilc-i,a) = +. This shows that the probabilites are proportional to the occupation numbers, n - i j . Using these conditional probabilities one can Gibbs sample from the indicator variables efficiently, even though the model has infinitely many Gaussian clusters. Having integrated out the mixing proportions one can also Gibbs sample from all of the remaining parameters of the model, i.e. { p , C}j. The details of these procedures can be found in Rasmussen (2000)8. We have used infinite Gaussian mixtures to model protein sequence data with the intention of answering queries of the kind: what is the probability that two proteins belong to the same cluster? Unlike previous methods based on a single clustering of the data, this approach computes this probability while taking into account all sources of model uncertainty (including number of clusters and location of clusters). We use the probability p i j that two proteins i and j belong to the same cluster in the infinite mixture model as a measure of the similarity of these protein sequences. Conversely 1 - p i j defines a dissimilarity measure
n--lia,
403
which for the purposes of visualization can be input to one of the standard linkage algorithms used for hierarchical clustering (see Figure 3). We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-hmensional structures and G-protein coupled receptor sequences.
3
Methods
To be able to cluster protein sequences, we need to be able to obtain a vector representation of the protein in a suitable metric space. We use the Fisher score vector respresentation described by Jaakkola et a1 1 3 , which provides an appropriate measure of similarity between sequences. The Fisher score vector for a particular protein X is obtained by evaluating the derivative of the log-likelihood with respect to a vector of parameters (6) of a hidden Markov model (HMM) trained on the set of protein sequences: UX = Ve log P ( X l 6 ) . Each component of the vector UX is the derivative of the log-likelihood for the sequence X with respect to a particular parameter (the emission probabilities of the HMM). In the work described below, we first train an HMM on the set of protein sequences of interest and then calculate a Fisher score vector as described above. In the case of sequences of known structure, we use the Bayesian network model of Raval et al. 14, which can be thought of as an extension of a hidden Markov model to incorporate multiple observations of primary sequence, secondary structure and residue solvent accessibility, calculated from the three-dimensional coordinates by the DSSP method of Kabsch and Sander 15. For all data sets the dimensionality of the Fisher score vector was then reduced by principal components analysis and we used this reduced dimension vector as the y vector input into the infinite Gaussian mixture model. We used the first 10 principal components, which captured most of the variance in the UX vectors. The mixture model was initialized with all data belonging to a single Gaussian, and a large number of Gibbs sampling sweeps are performed, updating all variables and parameters, i.e. { { p j , C j } , {ci},a } , in turn by sampling from the conditional distributions derived in the previous sections and described in more detail in Rasmussen (2000)8. We typically run the chain for 110,000 iterations, discarding the initial 11,000 steps as “burn-in” and keeping every 1000th step after that, generating 100 roughly independent samples from the posterior distribution.
404
Results
4
4.1
Globin Sequences
The mixture of HMMs method of Krogh et a1 discovered 7 clusters in a set of 628 globin sequences, corresponding to: 1. Class 1 233 sequences: principally all a , a few 6 ( an a-type chain of mammalian embryonic hemoglobin), X/T' (the counterpart of the a chain in major early embryonic hemoglobin P), and 6 - 1 chains (early erythrocyte a-like). 2. Class 2 232 sequences: almost all P, a few 6 (P-like), E (&type found in early embryos), y (comprises fetal hemoglobin F in combination with two a chains), p (major early embryonic P-type chain) and 6 chains (embryonic P-type chain).
3. Class 3 71 myoglobins.
4. Class 4 58 sequences. The 13 highest scoring in this cluster were leghemoglobins. This class contained a variety of sequences including 3 non-globins in the original data set. 5. Class 5 19 sequences. Midge globins. 6. Class 6 8 sequences. Globins from agnatha (jawless fish). 7. Class 7 7 sequences. varied. Our results, using an updated version of the same data set (630 globin sequences, distributed with the HMMER2 software package) are shown in Figure 1. In this plot we show the number of times, out of 100 samples, that the indicator variables for two sequences were equal. As shown above, this may be interpreted as the probability p i j that two proteins a and j belong to the same cluster. It is evident that our model has discovered a larger number of clusters that the method of Krogh et a15. The granularity of this clustering is determined by the data and not by some user-defined threshold. Large solid blocks of color along the diagonal correspond to homogeneous clusters. Note that in our method, sequences may belong to more than one cluster with a defined probability: off-diagonal elements indicate 'cross-clustering'. For comparison, we also clustered the sequences using BLASTCLUST, which clusters the sequences according to a sequence identity threshold and a single linkage algorithm. With a 90% sequence identity threshold, 261 clusters were obtained. The first large homogeneous cluster in Figure 1 (bottom right hand corner) comprises 37 hemoglobin ,B sequences plus two 6 sequences (HBD-COLPO and HBDPANTR) (Figure 1). Although a number of these sequences are contained within the same cluster in the BLASTCLUST output, indicating that they have > 90% sequence identity, we note that the clusters are by no means identical.
405
The BLASTCLUST cluster containing many of these hemoglobin p sequences also contains 8 hemoglobin 6 sequences and one Hemoglobin p-2 chain (HBBZPANLE). Figure 1 indicates that all sequences within this cluster also 'cross-cluster' with another group of p sequences with a probability of around 20-30%. The next cluster from the bottom right (Figure 1) contains all a sequences and cross clusters with another group of a sequences with a probability of around 40-50%. Although a detailed analysis of these results is beyond the scope of this paper, we identify at least 11 distinct a and 13 distinct ,f3 clusters (plus some additional smaller ones). Although some of the variant sequences cluster with a and ,f3 sequences] we identify a number of clusters composed only of variant sequences: 3 clusters comprising only 7, E and 0 sequences] one cluster of 6 and one cluster of ( sequences. We identify 3 distinct clusters of leghemoglobins and 1 cluster of midge hemoglobins (6 sequences)] a small cluster of fish hemoglobins and a small cluster comprising clam and earthworm sequences. Myoglobins, which Krogh et a1 (1994) found in one cluster, form 10 distinct clusters, mainly comprising proteins from related species. BLASTCLUST groups these into 6 clusters plus 9 singletons at a 90% identity theshold. We identify only 11 singletons (proteins which never cluster with another), none of which are myoglobins. The largest cluster comprises 40 hemoglobin beta sequences.
Figure 1: Clustering of the 630 globin sequences. The gray scale indicates the number of times, out of 100 samples, that the indicator variables for two sequences were equal, or the probability that two sequences belong to the same cluster
These results indicate that our method is capable of producing biologically meaningful results and correctly classifies the main globin subfamilies. In addition, it provides a finer level of clustering within these subfamilies than either the use of BLAST alignments and sequence identity or the method of Krogh et al!
406
4.2
Globin Sequences of Known Structure
For this experiment we obtained globin sequences from the Strucural Classification of Proteins (SCOP) database l6 using the ASTRAL resource ‘. Sequences with > 95% sequence identity were excluded, leaving 91 proteins. According t o the SCOP classification, these conprised representatives of 4 globin structural subfamilies (a.l.l.1: truncated hemoglobins (4 sequences) , a.1.1.2: glycera globins, myoglobins, hemoglobin I, flavohemoglobins, leghemoglobins, hemoglobin a and p chains, a.1.1.3: phycocyanins, allophycocyanins, phycoerythrins and a.1.1.4: nerve tissue mini-hemoglobin (1 sequence) ). The sequences were clustered using feature vectors derived from two models: a sequence-only HMM and a Bayesian net model (structural HMM). The results are shown in Figure 2 and Figure 3. The results from the sequence only clustering (Figure 2 left) show a similar pattern to those obtained with the 630 globin sequences. Fairly homogeneous clusters are mainly composed of related sequences, eg: p hemoglobin chains, a hemoglobin chains, myoglobins, phycocymin a and b, phycoerythrin and b and allophycocyanin a and b chains (which all form separate clusters). Glycera globins form a separate cluster, as do leghemoglobins. Three or four heterogeneous (loosely associated) clusters are observed, which include truncated hemoglobins, hemoglobin I’s, dehaloperoxidase etc. The results from the model which includes secondary structure and residue accessibility information shows fewer clusters; 1 2 in all, plus two singletons (dehaloperoxidase and pig roundworm hemoglobin, domain 1) (Figure 2 right). Again a and ,6 hemoglobin chains form distinct and fairly homogeneous clusters, as do the myoglobins, with the exception of lMYT (this is a myoglobin which lacks the D helix), which clusters more strongly with /3 hemoglobins, as well as weakly with the myoglobin cluster, and lMBA (a mollusc myoglobin), which clusters with clam hemoglobins and glycera globins from bloodworms. Phycocyanins, allophycocyanins and phycoerythrins (which are all classified by SCOP into the same subfamily a.1.1.3) form two distinct large joint clusters. Within these clusters one can detect subfamilies corresponding to the allophycocyanins, phycoerythrins and phycocyanins, which cluster amongst themselves with a higher probability. Leghemoglobins cluster strongly with a single non-symbiotic plant hemoglobin from rice, and weakly with a clam hemoglobin I. Truncated hemoglobins, which SCOP classifies into a different subfamily ( a . l . l . l ) , form two distinct clusters, and the sole member of subfamily a. 1.1.4 (nerve tissue mini-hemoglobin), clusters with 1CH4 (chimeric synthetic hemoglobin beta-alpha). In comparison, 13 clusters are produced with BLASTCLUST only at a 29% sequence bhttp://astral.stanford.edu
407
identity threshold or lower. These comprise a single cluster for a.l.l.1, nine separate clusters for a.1.1.2 (including 4 singletons), a single cluster for a.1.1.3 and a singleton for a.1.1.4. Our results, which do not require a predefined threshold to be specified, provide a reflection the underlying SCOP classifications, but the biologically meanigful sub-clusters also suggest that a further level of subfamily subdivision is possible.
Figure 2 : Clustering of the 91 SCOP globin sequences:left, by sequence information only; right, with the inclusion of structural information. Sequence labels on the y-axis are ordered optimally for each plot.
Figure 3: Dendrogram resresentation of the clustering of the 91 SCOP globin sequences shown in Figure 2: left, by sequence information only; right, with the inclusion of structural information.
4.3
G-Coupled Protein Receptors (GPCRs)
According to the GPCRDB classification system 17, the G-protein coupled receptor (GPCR) superfamily is classified into 5 major classes: Class A (related to rhodopsin and adrenergic receptors), Class B (related to
408
calcitonin and PTH/PTHrP receptors), Class C (related t o metatropic receptors), Class D (related to pheromone receptors) and Class E (related to CAMP receptors). The classes share 20% sequence identity over predicted transmembrane helices 1 7 . Each class is further divided into level 1 subfamilies (eg: Amine, Peptide, Opsin etc. for Class A) and further into Level 2 subfamilies (Muscarinic, Histamine, Serotonin etc. for the Amine subfamily). A number of putative GPCRs have no identified natural ligand and are dubbed ’orphan’ receptors. The sequence diversity of the GPCR classes makes subfamily classification a challenging problem. The problem of recognizing GPCR subfamilies is compounded by the fact that the subfamily classifications in GPCRDB are defined chemically (that is, according to the differential binding of ligands to the receptors) and not necessarily by either sequence similarity or the post ligand-receptor binding pathways. A number of other authors have described computational approaches to classifying GPCRs. Karchin et alla trained 2-class support vector machines (SVMs) using Fisher score vectors derived from HMMs 13. Joost and Methner l9 used a phylogenetic tree constructed by neighbor joining with bootstrapping. Lapinsh et al 2o translated amino acid sequences into vectors based on the physicochemical properties of the amino acids and used and autocross-covariance transformation followed by principal components analysis (PCA) t o classify GPCRs. For our experiments, sequences were obtained from the GPCRDB database l 7 ‘. Because of the smaller number of sequences in Classes B-E, we have focussed our analysis of Class A sequences. Our dataset comprised 946 sequences, of which 303 were “orphan” receptors, with no family classification. A portion of the clustering results using the infinite Gaussian mixture model are shown in Figure 4. Because of the sequence diversity of this superfamily, a larger number of smaller clusters are evident around the diagonal than were observed with the globin sequences. Most of the homogeneous clusters (solid color) comprise sequences from the same subfamily (level 3 in the GPCRDB hierarchy), and appear to be orthologs of the same protein from related species. Whilst a detailed analysis of these is beyond the scope of the present paper, as an illustration, we note that the largest cluster (bottom right hand corner), comprises Rhodopsin (Rhodopsin Vertebrate type 1) sequences from mammals and reptiles (plus lamprey), whilst the second cluster is composed entirely of fish Rhodopsins. Some unexpected associations also appear. Although in some case our results indicate assignments for certain orphan receptors which agree those of the authors cited above, in other cases our predictions are novel. A detailed analysis of these will be published in an extended version of this paper.
>
‘http://www.gpcr.org
409
Figure 4: Part of the clustering of the GPCR Class A sequences.
5
Discussion
The consistency of the clusters we obtain with a well annotated superfamily of protein sequences such as the globins gives us confidence that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. Homogeneous clusters tend to consist of orthologs of the same protein and paralogs appear to be separated into distinct clusters. This pattern appears to be repeated in our clustering of the GPCR sequences, with the potential of providing functional annotations for certain orphan receptors. Whilst some of these agree with predictions derived from neighborjoining phylogenetic trees and principal component analysis, a number are novel. In all cases, our method provides a finer level of granularity than the method of Lapinsh et al. ‘O, clustering orphan receptors with members of particular GPCRDB subfamilies, rather than a broad family classification. With the inclusion of secondary stucture and residue solvent accessibility information in the HMM on which our method is based, the clustering of the SCOP globin sequences changes from a large number of small clusters of functionally related sequences to a smaller number of clusters, in which the members of the SCOP globin families are clearly separated. However, once again we achieve an even finer level of classification, clearly separating a , p and myoglobins, as well as other members of SCOP class a.1.1.2. This suggests that our method also has the potential to provide a novel automated method for the structural classification of proteins. In order to achieve a large scale clustering of sequence or structure space we will investigate the use of Fisher scores obtained from from a “mixture model” which combines individual models for different superfamilies as described in 14.
410
Acknowledgments This work was supported by the National Institutes of Health (NIH) under Grant Number 1 PO1 GM63208. CER was supported by the German Research Council (DFG) through grant RA 1030/1. References 1. 2. 3. 4. 5.
6. 7. 8.
9. 10. 11.
12. 13. 14. 15. 16. 17.
18. 19. 20.
A.J. Enright and C.A. Ouzounis, Bioinformatics 16, 451-457 (2000) G. Yona and N. Linial and M. Linial, Proteins 37, 360-378 (1999) A. Krause and M. Vingron, Bioinformatics 14, 430-438 (1998) F. Abascal and A. Valencia, Bioinformatics 18, 908-921 (2002) A. Krogh A and M. Brown and I.S. Mian and K. Sjolander and D. Haussler, J. Mol. Biol. 235,1501-1531 (1994) Y. Barash and N. F'riedman,J. Comput. Biol. 9, 161-191 (2002) S. Richardson and P. Green (1997), J. Roy. Stat. SOC.B59, 731792 (1997) C. E. Rasmussen in Advances in Neural Information Processing Systems 1.2, ed. S. A. Solla, T. K. Leen, and K.-R. Muller (MIT Press, 2000) C.E. Antoniak, Annals of Statistics 2, 1152-1174 (1974) R. M. Neal, J. Comp. and Graphical Statistics 9,249-265 (2000) M. Medvedovic and S. Sivaganesan, Bioinformatics 1 8 , 1194-1206 (2002) G. McLachlan and D. Pee1,Finite Mixture Models,(Wiley, New York, 2000). T. Jaakkola and M. Diekhans and D. Haussler J Comput Biol. 7, 95-114 (2000) A. Raval and Z. Ghahramani and D.L. Wild, Bioinformatics 18, 788-801 (2002) W . Kabsch and C. Sander Biopolymers 22, 2577-2637 (1983) A.G. Murzin and S.E. Brenner and T. Hubbard and C. Chothia,J. Mol. Biol. 247, 536-540 (1995) F. Horn and J. Weare and M.W. Beukers and S. Hoersch and A. Bairoch and W. Chen and 0.Edvardsen and F. Campagne and G. Vriend,Nucleic Acids Res. 26, 277-281 (1998) R. Karchin and K. Karplus and D. Haussler, Bioinformatics 18, 147-159 (2002) P. Joost and A. Methner, Genome Biol. 3,RESEARCH0063 (2002) M. Lapinsh and A. Gutcaits and P. Prusis and C. Post and T. Lundstedt and J.E. Wikberg, Protein Sci. 11, 795-805 (2002)
ACCURATE CLASSIFICATION O F PROTEIN STRUCTURAL FAMILIES USING COHERENT SUBGRAPH ANALYSIS
w. WANG', A. WASHINGTON',
J. PRINS', R. SHAH^, A. T R O P S H A ~ ~ 'Department of Computer Science, 'The Laboratory for Molecular Modeling, Division of Medicinal Chemistry and Natural Products, School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599
J. HUAN',
Protein structural annotation and classification is an important problem in bioinformatics. We report on the development of an efficient subgraph mining technique and its application to finding characteristic substructural patterns within protein structural families. In our method, protein structures are represented by graphs where the nodes are residues and the edges connect residues found within certain distance from each other. Application of subgraph mining to proteins is challenging for a number reasons: (1) protein graphs are large and complex, (2) current protein databases are large and continue to grow rapidly, and (3) only a small fraction of the frequent subgraphs among the huge pool of all possible subgraphs could be significant in the context of protein classification. To address these challenges, we have developed an information theoretic model called coherent subgraph mining. From information theory, the entropy of a random variable X measures the information content carried by X and the Mutual Information (MI) between two random variables X and Y measures the correlation between X and Y. We define a subgraph X as coherent if it is strongly correlated with every sufficiently large sub-subgraph Y embedded in it. Based on the MI metric, we have designed a search scheme that only reports coherent subgraphs. To determine the significance of coherent protein subgraphs, we have conducted an experimental study in which all coherent subgraphs were identified in several protein structural families annotated in the SCOP database (Murzin et al, 1995). The Support Vector Machine algorithm was used to classify proteins from different families under the binary classification scheme. We find that this approach identifies spatial motifs unique to individual SCOP families and affords excellent discrimination between families.
1
Introduction
1.1 Spatial MotifDiscovery in Proteins Recurring substructures in proteins reveal important information about protein structure and function. For instance, common structural fragments may represent fixed 3D arrangements of residues that correspond to active sites or other functionally relevant features such as Prosite patterns (Hofmann, et al. 1999). Understanding recurring substructures in proteins aids in protein classification (Chakraborty et al. 1999), function prediction (Fischer et al. 1994), and folding (Kleywegt 1999).
41 1
412
Many computational methods have been proposed to find motifs in proteins. Multiple sequence alignments of proteins with similar structural domains (Henikoff, et a1 1999) could be used to provide information about the possible common substructures in the hope that conserved sequence patterns in a group of homologous proteins may have similar 3D arrangements. This method generally doesn’t work very well for proteins that have low sequence similarity although structurally similar proteins can have sequence identities below lo%, far too low to propose any structural similarity on the basis of sequence comparison (Orengo & Taylor, 1996). Several research groups have addressed the problem of finding spatial motifs by using computational geometry/computer vision approaches. From the geometric point of view, a protein can be modeled as a set of points in the R3 space and the problem of (pairwise) spatial motif finding can be formalized as that of finding the Largest Common Point (LCP) set. (Akutsu et al. 1997). Plenty of variations to this problem have been explored, which include approximate LCP problem (Chakraborty et al. 1999, Indyk et al. 1999) and LCP-a (finding a sufficiently large common point set S of two sets of points but not necessarily the maximal one) (Finn et al. 1997). Applying frequent subgraph mining techniques to find patterns from a group of proteins is a non-trivial task. The total number of frequent subgraphs for a set of graphs grows exponentially as the average graph size increases, as graphs become denser, as the number of node and edge labels decreases and as the size of the recurring subgraphs increases (Huan et a1 2003). For instance, for a moderate protein dataset (about 100 proteins with the average of 200 residues per protein), the total number of frequent subgraphs could be extremely high (>> one million). Since the underlying operation of subgraph isomorphism testing is NP-complete, it is critical to minimize the number of frequent subgraphs that should be analyzed. In order to apply the graph based spatial motif identification method to proteins, we have developed a novel information theoretic model called coherent subgraphs. A graph G is coherent if it is strongly correlated with every sufficiently large subgraph embedded in it. As discussed in the following parts of this report, coherent subgraphs capture discriminative features and afford high accuracy of protein structural classification. 1.2 Related Work
Finding patterns from graphs has long been an interesting topic in the data minindmachine learning community. For instance, Inductive Logic Programming (ILP) has been widely used to find patterns from graph dataset (Dehaspe 1998). However, ILP is not designed for large databases. Other methods focused on approximation techniques such as SUBDUE (Holder 1994) or heuristics such as greed based algorithm (Yoshida and Motoda, 1995).
413
Several algorithms have been developed in the data mining community to find all frequent subgraphs of a group of general graphs (Kuramochi and Karypis 2001, Yan and Han 2002, Huan et al. 2003). These techniques have been successfully applied in cheminformatics where compounds are modeled by undirected graphs. Recurring substructures in a group of chemicals with similar activity are identified by finding frequent subgraphs in their related graphical representations. The recurring substructures can implicate chemical features responsible for compounds’ biological activities (Deshpande et al. 2002). Recent subgraph mining algorithms can be roughly classified into two categories. Algorithms in the first category use a level-wise search scheme like Apriori (Agrawal and Srikant, 1994) to enumerate the recurring subgraphs. Examples of such algorithms include AGM (Inokuchi et al. 2000) and FSG (Kuramochi and Karypis 2001). Instead of performing the level-wise search scheme, algorithms in the second category use a depth-first enumeration for frequent subgraphs (Yan and Han 2002, Huan et al. 2003). A depth-first search usually has better memory utilization and thus better performance. As reported by Yan and Han (2002), a depth-first search, can outperform FSG, the current state-of-the-art level-wise search scheme by an order of magnitude overall. All of the above methods rely on a single threshold to qualify interesting patterns. Herein, we propose the coherent subgraph model using a statistical metric to qualify interesting patterns. This leads to more computationally efficient yet more accurate classification. The remaining part of the paper is organized as follows. Section 2 presents a formal base for the coherent subgraph mining problem. This includes the definition of the labeled graph and labeled graph database (Section 2.1), the canonical representation of graphs (Section 2.2), the coherent subgraph mining problem, and our algorithm for efficient coherent subgraph mining (Section 2.3). Section 3 presents the results of an experimental study to classify protein structural families using the coherent subgraph mining approach and a case study of identifying fingerprints in the family of serine proteases. Finally, Section 4 summarizes our conclusions and discusses future challenges.
2 Methodology 2.1 Labeled Graph We define a labeled graph G as a four element tuple G = {V, E, 1,1) where V is the set of nodes of G and E L V XV is the set of undirected edges of G. C is a set of labels and the labeling function 1: V u E + C maps nodes and edges in G to their labels. The same label may appear on multiple nodes or on multiple edges, but we require that the set of edge labels and the set of node labels are
414
disjoint. For our purposes we assume that there is a total order 2 associated with the label set A labeled graph G = (V, E, C, 1) is isomorphic to another graph G=(V', E', 1') iff there is a bijection f V + V' such that: V u E V, l(u) = l'(f(u)), and V U, v EV, ( ((u,v) E E (f(u), f(v)) EE') A l(u,v) = l'(f(u), f(v))). The bijection f denotes an isomorphism between G and G . A labeled graph G= (V, E, C, 1) is an induced subgraph of graph G=(V',E, 1') iff v c V', E L E', V u,v E V, ((u, v) E E' 3 (u, v) EE), V u E V, (l(u)= l'(u)), and V (u, v) EE, (I(u, V)= l'(u, v)). A labeled graph G is induced subgraph isomorphic to a labeled graph G , denoted by G L G , iff there exists an induced subgraph G ' of G such that G is isomorphic to G'. Examples of labeled graphs, induced subgraph isomorphism, and frequent induced subgraphs are presented in Figure 1.
c.
x,
x,
8
0
P,
P5
0
Q
'5
B
PC
p, (P)
(Q)
a
y-Tf$2
(R) "
rc
%.&/
,'x
b
Figure 1. (a): Examples of three labeled graphs (referred to as a graph database) and an induced subgraph isomorphism. The labels of the nodes are specified within the circle and the labels of the edges are specified along the edge. We assume the order a > b > c > d > x > y > 0 throughout this paper. The mapping ql + pz, qz + PI,93-1 p3 represents an induced subgraph isomorphism from graph Q to P. (b) All the frequent induced subgraphs with minSupport set to be 2/3 for the graph database uresented in (a).
Given a set of graphs GD (referred to as a graph database), the support of a graph G, denoted by SUPG is defined as the fraction of graphs in GD which embeds the subgraph G. Given a threshold t (0 < t 51) (denoted as rninSupport), we define G to be frequent, iff SUPG is at least t. All the frequent induced subgraphs in the graph database GD presented in Figure 1 (a) (with minSupport 2/3) are presented in Figure 1 (b).
41 5
Throughout this paper, we use the term subgraph to denote an induced subgraph unless stated otherwise.
2.2 Canonical Representation of Graphs We represent every graph G by an adjacency matrix M. Slightly different from the adjacency matrix used for an unlabeled graph (Cormen et al, 2001), every diagonal entry of M represents a node in G and is filled with the label of the node. Every off-diagonal entry corresponds to a pair of nodes, and is filled with the edge label if there is an edge between these two nodes in G, or is zero if there is no edge. Given an n x n adjacency matrix M of a graph with n nodes, we define the code of M, denoted by code(M), as the sequence of lower triangular entries of M (including the diagonal entries) in the order: Mz,l M2,2... Mn,l M,,2 ...Mn,,., M , , where Mi,, represents the entry at the ith row andjth column in M. The standard lexicographic ordering of sequence defines a total order of codes. For example, code “ayb” is greater than code ”byb” since the first symbol in string “ayb“ is greater than the first symbol in string “byb” (We use the order a > b > c > d > x > y > 0). For a graph G, we define the Canonical Adjacency Matrix (CAM) of G as the adjacency matrix that produces the maximal code among all adjacency matrices of G. Interested readers might verify that the adjacency matrix MI in Figure 2 is the CAM of the graph P shown in Figure 1.
Ml
M3
Figure 2. Three examples of adjacency matrices. After applying the total ordering, we have code(M1) = “aybyxbOyxcOOyOd” > code(M2)= “aybyxbOOydOyxOc” z code(M3) =“bxbyOdxyOcyy00a”.
Given an n x n matrix N and an m x m matrix M, we define N as the maximal proper submatrix (MP submatrix for short) of M iff n = m-1 and “ij = mij (0 < i, j Sn). One of the nice properties of the canonical form we are using (as compared to the one used in Inokuchi et al. 2000 and Kuramochi et al. 2001) is that, given a graph database GD, all the frequent subgraphs (represented by their CAMS) could be organized as a rooted tree. This tree is referred to as the CAM Tree of G and is formally described as follows: The root of the tree is the empty matrix;
416
0
Each node in the tree is a distinct frequent connected subgraph of G, represented by its CAM; For a given none-root node (with CAM M), its parent is the graph represented by the MP submatrix of M;
Figure 3. Tree organization of all the frequent subgraphs of the graph database shown in Figure 1 (a)
2.3 Finding Patternsjrom Labeled Graph Database As mentioned earlier, the subgraph mining of protein databases presents a significant challenge because protein graphs are large and dense resulting in an overwhelmingly large number of possible subgraphs (Huan et al. 03). In order to select important features from the huge list of subgraphs, we have proposed a subgraph mining model based on mutual information as explained below. 2.3.1
Mutual Information and Coherent Induced Subgraphs
We define a random variable XG for a subgraph G in a graph database GD as follows: 1 with probability SUPG XG= 0 with probability 1-sUpG Given a graph G and its subgraph G , we define the mutual information I(G, G ) as follows: I(G, G ) = Ex,, xG PWG, XG> log&(xG, x~i~)/(p(XG>p(X~i.))>. where P(XG XG,) is the (empirical) joint probability distribution of (XG, XG), which is defined as follows: if XG= 1 and XG.= 1 p(xG, xG>= sUPG if XG= 1 and XG. = 0 0 SUpG. - sUPG ifXG=OandXG.=1 1- SUpG' otherwise
417
Given a threshold t (t > 0) and a positive integer k, a graph G is k-coherent iff 'd G' c G and IGl >k, (I(G, G ) 2t), where IG'I denotes the number of nodes in G' . The Coherent Subgraph Mining problem is to find all the k-coherent subgraphs in a graph database, given a mutual information threshold t (t > 0) and a positive integer k. Our algorithm for mining coherent subgraphs relies on the following two well-known properties (Tan et al. 2002): Theorem For graphs P c Q L G, we have the following inequalities: ,@ 'I GI 5 I@', Q) I(P, G) 5 I(Q, G) The first inequality implies that every subgraph G (with size 2 k) of a kcoherent graph is itself k-coherent. This property enables us to integrate the kcoherent subgraph into any tree-based subgraph using available enumeration techniques (Yan and Han 2002, Huan et al. 2003). The second inequality suggests that, in order to tell whether a graph G is k-coherent or not, we only need to check all k-node subgraphs of G. This simplifies the search. In the following section, we discuss how to enumerate all connected induced subgraphs from a graph database. This work is based on the algebraic graphical framework (Huan et al. 2003) of enumerating all subgraphs (not just induced subgraphs) from a graph database.
2.3.2
Coherent Subgraph Mining Algorithm
CSM input a graph database GD, a mutual information threshold t (0 < t I 1) and a positive integer k output: set S of all G s coherent induced subgraphs. P t {all coherent subgraphs with size k in GD) S4-Q
CSM-Explore (P. S, t, k);
CSM-Explore input a CAM list P, a mutual information threshold t (0 < t I l), a positive integer k, and a set of coherent connected subgraphs' CAMS S. output set S containing the CAMSof all coherent subgraphs searched so far Foreach X E P S c S u { X } C t (YI Y is a CAM and X is the MP submatrix of Y J remove non k-coherent element@)from C. CSM-Explore(C, S , t, k) End
418
3
Experimental Study
3.1 Implementation and Test Platform The coherent subgraph mining algorithm is implemented using the C++ programming language and compiled using g++ with 0 3 optimization. The tests are performed using a single processor of a 2.OGHz Pentium PC with 2GB memory, running RedHat Linux 7.3. We used Libsvm for protein family classification (further discussed in Section 3.4); the Libsvm executable was downloaded from http://www.csie.ntu.edu.tw/-cjlin/libsvd. 3.2 Protein Representation as a Labeled Graph We model a protein by an undirected graph in which each node corresponds to an amino acid residue in the protein with the residue type as the label of the node. We introduce a “peptide” edge between two residues X and Y if there is a peptide bond between X and Y and a “proximity” edge if the distance between the two associated C, atoms of X and Y is below a certain threshold (lOA in our study) and there is no peptide bond between X and Y.’ 3.3Dataset.s and Coherent Subgraph Mining Three protein families from the SCOP database (Murzin et al, 1995) were used to evaluate the performance of the proposed algorithm under a binary (pairwise) classification scheme. SCOP is a domain expert maintained database, which hierarchically classifies proteins by five levels: Class, Fold, Superfamily, Family and individual proteins. The SCOP families included the Nuclear receptor ligand-binding domain (NRLB) family from the all alpha proteins class, the Prokaryotic serine protease (PSP) family from the all beta proteins class, and Eukaryotic serine protease (ESP) family from the same class. Three datasets for the pairwise comparison and classification of the above families were then constructed: C1, including NRLB and PSP families; C2, including ESP and PSP families, and C3, including both eukaryotic and prokaryotic serine proteases (SP) and a random selection of 50 unrelated proteins (RP). All the proteins were selected from the culled PDB list, (http://www.fccc.edu/research/labs/dunbrack/pisces/culledpdb.html)with less than 60% sequence homology (resolution = 3.0, R factor = 1.0) in order to remove redundant sequences from the datasets. These three datasets are further summarized in Table 1. For each of the datasets, we ran the coherent subgraph identification algorithm. Thresholds ranging from 0.5 to 0.25 were tested; however, we only
’
Note that this graph representation provides a lot of flexibility for future studies, e.g. using smaller number of residue classes or using additional edge labels.
419
report the results with threshold 0.3, which gave the best classification accuracy in our experiments. 3.4 Pair-wise Protein Classification Using Support Vector Machines (SVM) Given a total of n coherent subgraphs f b f2, ..., f n , we represent each protein G in a dataset as a n-element vector V=v b v2, ... .vn in the feature space where v; is the total number of distinct occurrences of the subgraph f t in G (zero if not present). We build the classification models using the SVM method (Vapnik 1998). There are several advantages of using SVM for the classification task in our context: 1) SVM is designed to handle sparse high-dimensional datasets (there are many features in the dataset and each feature may only occur in a small set of samples), 2) there are a set of kernel learning functions (such as linear, polynomial and radius based) we could choose from, depending on the property of the dataset. Table 1 summarizes the results of the three classification experiments and the average five fold cross validation total classification accuracy [i.e., (TP + TN)/(N) where TP stands for true positive, TN stands for true negative, and N is the total number of testing samples]. In order to address the problem of possible over-fitting in the training phase, we created artificial datasets with exactly same attributes but randomly permuted class labels. This is typically referred to as the Y-randomization test. The classification accuracy for randomized datasets was significantly lower than for the original datasets (data not shown) and hence we concluded that there is no evidence of over-fitting in our models.
Cl C2 C3
Class A PSP PSP SP
Total # Proteins 9 9 44
Class B NRLB ESP RP
Total # Proteins 13 35 50
Features 40274 34697 42265
Time, (sec.) 240 450 872
Accuracy (%) 96 93 95
Table 1. Accuracy of classification tasks Ci, C2, €3. We used the C-SVM classification model with the linear kernel and left other values as default. Columns 1-4 give basic information about the dataset. SP -serine proteases; PSP - prokaryotic SP; ESP - eukaryotic SP; NRLB - nuclear receptor ligand binding proteins, RP - random proteins. The fifth column (Features) records the total number of features mined by CSM and the sixth column (Time) records how much CPU time was spent on the mining task. The last column gives the five fold cross validation accuracy.
3.5 Identification of Fingerprints for the Serine Protease Family Features found for the task C3 in Table 1 were analyzed to test the ability of the CSM method to identify recurrent sequence-structure motifs common to particular protein families; we used serine proteases as a test case. For every coherent subgraph, we can easily define an underlying elementary sequence motif similar to Prosite patterns as:
420
M = { AAp, di, d2, A A , d3, AAs 1 where AA is the residue type, p, q, r and s are residue numbers in a protein sequence, and dl=q-p- 1, dz=r-q-1, d3=s-r-1, i.e., sequence separation distances. We have selected a subset of the discriminative features from the mined features such that every feature occurs in at least 80% of the proteins in the SP family and in less than 10% of the proteins of the RP family. For each occurrence of such features, sequence distances were analyzed. Features with conserved sequence separation were used to generate consensus sequence motifs. We found that some of our spatial motifs correspond to serine protease sequence signatures from the Prosite Database. An example (Gl) of such a spatial motif and its corresponding sequence motif C-x( 12)-A-x-H-C (where x is any residue(-s) and the number in the parenthesis is the length of the sequence separation) are shown in Fig. 4. This example demonstrates that the spatial motifs found by subgraph mining can capture features that correspond to motifs with known utility in identifying protein families. The spatial motif G2, which also was highly discriminative, occurs in SP proteins at a variety of positions, with varying separations between the residues. Such patterns seem to defy a sequence-level description, hence raise the possibility that spatial motifs can capture features beyond those described at the sequence level.
Figure 4:Two discriminative features that appear very frequently in SP family while are infrequent in the RP family. Left: the graphical representation of the two subgraphs (with residue type specified within the circle). A dotted line in the figure represents a proximity edge and a solid line represents a peptide edge. Right: the 3D occurrences of G1 (right) and G2 (left) within the backbone of one of serine proteases, Human Kallikrein 6 (Hk6).
4
Conclusions and Future Work
We have developed a novel coherent subgraph mining approach and applied it to the problem of protein structural annotation and classification. As a proof of concept, characteristic subgraphs have been identified for three protein families from the SCOP database, i.e., eukaryotic and prokaryotic serine proteases and nuclear receptor binding proteins. Using Support Vector Machine binary
42 1
classification algorithm, we have demonstrated that coherent subgraphs can serve as unique structural family identifiers that discriminate one family from another with high accuracy. We have also shown that some of the subgraphs can be transformed into sequence patterns similar to Prosite motifs allowing their use in the annotation of protein sequences. The coherent subgraph mining method advanced in this paper affords a novel automated approach to protein structural classification and annotation including possible annotation of orphan protein structures and sequences resulting from genome sequencing projects. We are currently expanding our research to include all protein structural families and employ multi-family classification algorithms to afford global classification of the entire protein databank.
Acknowledgments The authors would like to thank Prof. Jack Snoeyink and Deepak Bandyopadhyay for many helpful discussions.
References 1. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, Proc. of the 20th lnt. Con$ on Very Large Databases (VLDB), 487-499 (1994) 2. T. Akutsu, H. Tamaki and T. Tokuyama, “Distribution of distances and triangles in a point set and algorithms for computing the largest common point sets”. In Proc. 13th Annual ACM Symp. on Computational Geometry, 3 14-323 (1997) 3. S. Chakraborty and S. Biswas, “Approximation Algorithms for 3-D Common Substructure Identification in Drug and Protein Molecules”, Workshop on Algorithms and Data Structures, 253-264 (1999) 4. T. H. Cormen, C. E. Leiserson and R. L. Rivest, Introduction to Algorithms, (MIT press, 2001). 5. L. Dehaspe, H. Toivonen and R. D. King, “Finding frequent substructures in chemical compounds”, Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, 30-6 (1998) 6. M. Deshpande, M. Kuramochi and G. Karypis, “Frequent SubStructure-Based Approaches for Classifying Chemical Compounds”, Proc. of the 8th International Conference on Knowledge Discovery and Data Mining (2002) 7. P. W. Finn, L. E. Kavraki, J. Latombe. R. Motwani, C. R. Shelton, S. Venkatasubramanian and A. Y ao, “RAPID: Randomized Pharmacophore Identification for Drug Design”, Symposium on Computational Geometry, 324-333 (1997)
422
8. D. Fischer, H. Wolfson, S. L. Lin, and R. Nussinov, “Threedimensional, sequence order-independent structural comparison of a serine protease against the crystallographic database reveals active site similarities: potential implication to evolution and to protein folding”. Protein Sci. 3,769-778 (1994) 9. S Henikoff, J Henikoff, S Pietrokovski. “Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations”, Bioinfomatics, 15(6):471-9 (1999) 10. K. Hofmann, P. Bucher, L. Falquet, A. Bairoch, “The PROSITE database, its status in 1999”. Nucleic Acids Res, 1;27(1):215-9 (1999) 11. L. B. Holder, D. J. Cook and S. Djoko, “Substructures discovery in the subdue system”, Proc. AAAI’94 Workshop Knowledge Discovery in Databases, 169-180 (1994). 12. J. Huan, W. Wang, J, Prins, “Efficient Mining of Frequent Subgraph in the Presence of Isomorphism”, Proc. of the 31d International conference on Data Mining, (2003) 13. P. Indyk, R. Motwani, S. Venkatasubramanian, Geometric Matching Under Noise, “Combinatorial Bounds and Algorithms”, ACM Symposium on Discrete Algorithms (1999). 14. A. Inokuchi, T. Washio, and H. Motoda, “An Apriori based algorithm for mining frequent substructures from graph data”, In Proc. of the 4th European Con. On Principles and Practices of Knowledge Discovery in Databases, 13-23 (2000). 15. G.J. Kleywegt “Recognition of spatial motifs in protein structures” J MoZ Biol. 285(4): 1887-97 (1999) 16. M. Kuramochi and G. Karypis, “Frequent subgraph discovery”, Proc. of the I st International conference on Data Mining, (2001) 17. AG Murzin, SE Brenner, T Hubbard, C Chothia, “SCOP: a structural classification of proteins database for the investigation of sequences and structures”, J. Mol. Biol. 247,536-540 (1995) 18. CA Orengo and WR Taylor, “SSAP: Sequential Structure Alignment Program for Protein Structure Comparison”, Methods in Enzymol266: 617-643 (1996) 19. P. Tan and V. Kumar and J. Srivastava, “Selecting the right interestingness measure for association patterns”, In Proceedings ofthe Eighth ACM International Conference on Knowledge Discovery and Data Mining (2002) 20. V. Vapnik, Statistical Learning Theory, (John Wiley, 1998) 21. X. Yan and J. Han. gSpan, “Graph-based substructure pattern mining”, Proc. of the 2”dInternational conference on Data Mining, (2002) 22. K. Yoshida and H. Motoda, “CLIP: Concept learning from inference patterns”, Artificial Intelligence, 75( 1):63-92, (1995)
IDENTIFYING GOOD PREDICTIONS OF RNA SECONDARY STRUCTURE
M.E. NEBEL Johann Wolfgang Goethe- Universitat, Institut fur Informatik, 60325 Frankfurt a m Main, Germany Abstract Predicting the secondary structure of RNA molecules from the knowledge of the primary structure (the sequence of bases) is still a challenging task. There are algorithms that provide good results e.g. based on the search for an energetic optimal configuration. However the output of such algorithms does not always give the real folding of the molecule and therefore a feature to judge the reliability of the prediction would be appreciated. In this paper we present results on the expected structural behavior of LSW rRNA derived using a stochastic context-free grammar and generating functions. We show how these results can be used to judge the predictions made for LSU rRNA by any algorithm. In this way it will be possible to identify those predictions which are close to the natural folding of the molecule with a probability of 97% of success.
1
Introduction and Basic Definitions
A ribonucleic acid (RNA) molecule consists of a chain of nucleotides (there are four different types). Each nucleotide consists of a base, a phosphate group and a sugar group. The various types of nucleotides only differ from the base involved; there are four choices for the base, namely adenine (A), cytosine (C), guanine (G) and uracil (U). The specific sequence of the bases along the chain is called p r i m a r y structure of the molecule. It is usually modeled as a word over the alphabet { A , C , G, U } . Through the creation of hydrogen bonds, the complementary bases A and U (resp. C and G) form stable base pairs with each other. Additionally, there is the weaker G-U pair, where bases bind in a skewed fashion. Due t o these base pairs, the linear chain is folded into a threedimensional conformation called tertiary structure of the molecule. For some types of RNA molecules like transfer RNA, the tertiary structure is highly connected with the function of the molecule. Since experimental approaches which allow the discovery of the tertiary structure are quite expensive, biologists are looking for methods to predict the tertiary structure from the knowledge of the primary structure. It is the common practice to consider the simplified secondary structure of the molecule, where we restrict the possible base pairs such
423
424
that only planar structures occur. So far, several algorithms for the prediction of secondary structures using rather different ideas were p r e ~ e n t e d ! ~ ~ ~ ~ However, the output of such algorithms cannot be assumed to be error-free, so they might predict a wrong folding of a molecule. To have a tool t o quantify the reliability of the prediction would be helpful. In this paper we propose to use a statistical filter which compares structural parameters of the predicted molecule with those of an expected molecule of the same type and the same size (number of nucleotides/bases), and we show that such a filter offers good results. In literature you find a lot of different results dealing with the expected structure of RNA molecules. Waterman‘ gave the first formal framework for secondary structures. Later on, some authors considered the combinatorial and the Bernoulli model of RNA secondary structures (where the molecule is modeled as a certain kind of planar graph) and they derived numerous results like the average size and number of hairpins and bulges, the number of ladders, the expected order of a structure and its distribution or the distribution of unpaired bases (see8~9~10~11). I n l l it was pointed out (by comparison t o real world data) that both models are rather unrealistic and thus the corresponding results can hardly be used for our purposes. In this paper we will sketch one possible way to construct a realistic model for RNA secondary structures which allows us to derive the corresponding expectations, variances and all other higher moments to be used according to our ideas. In the rest of this paper we assume that the reader is familiar with the basic notions of Formal Language Theory such as context-free grammars, derivation trees, etc. A helpful introduction to the theory can be found in?2 We also assume a working knowledge on the notion of secondary structures and the concepts like hairpins, interior loops, etc. We refer to l3 for a related introduction. Besides modeling a secondary structure as a planar graph, it is a slightly different approach to model it by using stochastic context-free grammars as proposed by?4 A stochastic context-free grammar (SCFG) is a 5-tuple G = ( I ,T ,R, S,P ) , where I (resp. T ) is an alphabet (finite set) of intermediate (resp. terminal) symbols ( I and T are disjoint), S E I is a distinguished intermediate symbol called axiom, R c I x ( l u T ) *is a finite set of production-rules and P is a mapping from R to [0,1] such that each rule f E R is equipped with a probability p f := P(f).The probabilities are chosen in such a way that for all A E I the equality C f E R p f S ~ ( f=) ,1~holds. Here S is Kronecker’s delta and Q ( f ) denotes the source of the production f , i.e. the first component A of a production-rule ( A , a ) E R. In the sequel we will write p f : A --+ a instead of f = ( A , a ) E R, p f = P(f). In Information Theory SCFGs were introduced as a device for producing a language together with a corresponding
425
probability distribution (see e.g. 15,16). Words are generated in the same way as for usual context-free grammars, the product of the probabilities of the used production-rules provides the probability of the generated word. Note that this does not always provide a probability distribution for the language. However, there are sufficient conditions which allow us t o check whether or not a given grammar provides a distribution. First, one was interested in parameters like the moments of the word and derivation lengths l7 or the moments of certain subwords?' Furthermore, one was looking for the existence of standard-forms for SCFGs such as Chomsky normalform or Greibach normalform in order to simplify proof^!^ Some authors used the ideas of Schutzenberger2' to translate the corresponding grammars into probability generating functions to derive their result^!^^^' However, languages resp. grammars were not used to model any sort of combinatorial object besides languages themselves and therefore the question on how to determine probabilities was not asked. In Computational Biology SCFGs are used as a model for RNA secondary structure^?^'^ In contrast t o Information Theory not only the words generated by the grammar are used, but also the corresponding derivation trees are taken into consideration: A word generated by the grammar is identified with the primary structure of an RNA molecule, its derivation tree is considered as the related secondary s t r ~ c t u r e ! Note ~ that there exists a one-to-one correspondence between the planar graphs used by Waterman as a model for RNA secondary structures and a certain kind of unary/binary trees (see e.g. lo). Thus the major impact from using SCFGs is given by the way in which probabilities are generated. Since a single primary structure can have numerous secondary structures, an ambiguous SCFG is the right choice. The probabilities of such a grammar can be trained from a database. The algorithms applied for this purpose are generalizations of the forwardlbackward algorithm used in the context of hidden Markov models2,21and are also applied in Linguistics, where one usualy works with ambiguous grammars, too. At the end of the training the most probable derivation tree of a primary structure in the database equals the secondary structure given by the database. Applications were found in the prediction of RNA secondary structure1*' were the most probable derivation tree is assumed to be the secondary structure belonging to the primary structure processed by the algorithm. So far, no one used these grammars to derive structural results, which in case of an ambiguous grammar is obvious since it is impossible to find any sense in such results. In section 2 we provide the link between SCFGs and the mathematical research on RNA. We use non-ambiguous stochastic contextfree grammars t o model the secondary structures. This is done by disregarding the primary structure and representing the secondary structure as a certain kind of Motzkin language (i.e. a language over the alphabet {(,), I} which en-
426
codes unarylbinary trees equivalent to the secondary structure) which now is the language generated by the grammar. After training the SCFGs it is used t o derive probability generating functions which enable us to conclude quantitative results related to the expected shape of RNA secondary structures. Those results will be the basis for our quantitative judgement of predictions. In order to train the grammar we derived a database of Motzkin words which correspond one-to-one to the secondary structures contained in the databases of Wuyts et al?' We have also used the databases of Brown for RNase P seq u e n c e ~and ~ ~of Sprinzl et al. for tRNA m ~ l e c u l e sthe , ~ ~corresponding results are not reported here due t o lack of space.
2
The Expected Structure of rRNA Molecules
In this section we will present our results concerning the expected structure of rRNA molecules only with a few comments on how they were derived; technical details can be found in?5 As described in the first section, we used a SCFG whose probabilities were trained on all entries of the database of Wuyts et al. in order t o derive our results. This grammar can easily be translated into an equivalent probability generating function according to the ideas of Schutzenbergerz' From those generating functions we derived some expected values for structural parameters of large subunit (LSU) ribosomal RNA molecules, like e.g. the average number and length of hairpin-loops or the average degree of multiloops. The corresponding formuls are presented in Table 1, where each parameter is presented together with its expected asymptotical behavior, i.e. its expected behavior within a large (number of nucleotides) molecule. Note that we have investigated all the different substructures which must be distinguished in order t o determine the total free energy of a molecule which is necessary e.g. for certain prediction algorithms. Compared t o all previous attempts t o describe the structure of RNA quantitatively (see for instance 6,9,10,11,26), the results presented here are the most realistic ones. This is in line with the positive experience of Knudsen et al. and of Eddy et al. with respect t o the prediction of secondary structures based on trained SCFGs (resp. covariance models). The results in Table 1 should be considered as the structural behavior of an RNA molecule folded with respect t o its energetic optimum. Therefore, they are of interest themselves; for the first time we get some (mathematical) insight on how real secondary structures behave. Besides the application, which is the subject of this paper, the realistic modeling of the secondary structures gives rise t o further applications like the following: First, we can use our results t o provide bounds for the running-time of algorithms working on secondary structures as their input. Second, when predicting a
'
427 Table 1: Expectations for different parameters of large subunit ribosomal RNA secondary structures. In all cases n is used to represent the total size of the molecule.
Expect at ion 0.0226n 7.3766 0.0095n 1.5949 0.0593n 4.1887 0.0164n 3.8935 0.0 106n 4.1311 4.3686 18.1679 18.1353
Parameter Number of hairpins Length of a hairpin-loop Number of bulges Length of a bulge Number of ladders Length of a ladder (counting the number of pairs) Number of interior loops Length of a single loop within an interior loop Number of multiloop Degree of a multiloop Length of a single loop within a multiloop Number of single stranded regions Length of a single stranded region
secondary structure, our results may provide initial values for loop lengths etc. when searching for an optimal configuration such that a faster convergence should be expected. We used the following grammar t o derive the results in Table 1 (all capital letters are intermediate symbols):
fl=S fe=L
+ SAC, +
f2=S
( L ) ,f7=L
fll=L f15=H
+
Mfs=L
IB(L), f12=B +
f19=K
f22=N
+ C , f3=C + CI, f4=C
+
&, -+
flG=I
+
I , fg=L
-+
+
BI, f13=B
IJ(L)KI,f 1 7 = J
KI, fzo=K
U ( L ) N ,f z = N
+ E, -+
f21=M
U, f24=U
+
-+
-+
+
fS=A
IH, fio=L f14=N
---f
+
E,
JI,
-+
-+
+
(L),
(L)BI,
HI, &,
flS=J
U(L)U(L)N, U ( ,f25=U
-+
E.
The idea behind the grammar is the following: Starting at the axiom S a sentential form of the pattern CACAC . . . AC is generated, where each A stands for the starting point of a folded region and C represents a single stranded region. Applying production A -+ ( L ) produces the foundation of the folded region. From there the process has different choices. It may continue building up a ladder by applying L -+ ( L ) . It might introduce a multiloop by the application of L + M or an interior loop by the application of L -+ I . A
428 Table 2: The probabilities for the productions of our grammar obtained from its training on a database of large subunit ribosomal RNA secondary structures
rule
f
prob. p f
rule f f2
f5 fs
fig
f22 f25
0.0207 0.6270 1.oooo 0.7461 0.5149 0.1863
fll f14 fi7 f20 f23
prob. p f 0.1372 1.oooo 0.0662 0.0176 0.8644 0.7401 0.2539 0.4851
rule f f3
fs fg
fi2 fis f18 f2l f24
prob. p f 0.9477 0.7612 0.0941 0.3730 0.1356 0.2599 1.0000 0.8137
hairpin-loop is produced by L + IH. Additionally, the grammar may introduce a bulge by the productions L + (L)Sl resp. L + IB(L) where the two productions distinguish between a bulge at the 3’ resp. 5’ strand of the corresponding ladder. An interior loop is generated by the production I + IJ(L)KI where J and K are used to produce the loops. The multiloop is generated by the productions M + U ( L ) U ( L ) N N , -+ U ( L ) N and N + U , i.e. we have at least three single stranded regions represented by U , by additional applications of the production N -+ U ( L ) N the degree of the multiloop can be increased. The other production-rules are used t o generate unpaired regions in different contexts. We used different intermediate symbols in all cases because otherwise we would get an averaged length of the different regions instead of a distinguished length for all substructures considered. We first had t o determine the probabilities for this grammar in order to derive the results in Table 1. We used a special parsing algorithm with all entries of the database as the input. Table 2 presents the resulting probabilities. Then the grammar was translated into a probability generating function from which our expectations were concluded by using Newton’s polygon method and singularity analysis (details on that can be found in?5) Table 3 compares the expected values according to our formuk t o statistics computed from the database (archaea and bacteria data only). For this purpose we have set the parameter n to the average length of the structures used to compute the statistics. We observe that most parameters are described pretty well by our formula? (the root mean square deviation of the statistics compared t o our formula? is given by 3.5260. . .), so it makes sense t o use them according t o our ideas.
429 Table 3: The average values computed statistically from the database compared t o the values implied by the corresponding formulz in Table 1. All values were rounded t o the second decimal place.
Parameter number of hairpins length of a hairpin-loop number of bulges length of a bulge number of ladders length of a ladder number of interior loops length of single loop in interior loop number of multiloops degree of a multiloop length of single loop in multiloop number of single stranded regions length of single stranded regions
3
Stat ist ics 51.76 7.43 20.94 1.59 130.94 4.18 36.25 3.89 21.98 4.06 4.80 7.44 15.62
Formula 52.02 7.38 21.87 1.59 136.50 4.19 37.75 3.89 24.40 4.13 4.37 18.17 18.14
Quotient 99.49% 100.70% 95.78% 99.88% 95.92% 99.85% 96.02% 99.98% 90.10% 98.31% 109.96% 40.97% 86.15%
Identifying Good Predictions
In order t o see whether or not our expectations for certain structural parameters of RNA secondary structure can be used for identifying good or bad predictions we continued in the following way. First we used the RNAStructure software by Mathews, Zuker and Turner (version 3.71) in order t o obtain predicted secondary structures for all sequences for archaea and bacteria in the database of Wuyts et al.; the default settings of the program were used. We decided to use those parameters for the judgement of the predictions where according to Table 3 the relative error of the value of the formula compared t o the statistics computed from the database is at most 2%. Then the quality of the predictions was quantified as follows: For every prediction generated (for some sequences the software provides several predictions) we computed the number of hairpins 2 1 , the average length of a hairpin-loop 2 2 , the average length of a bulge 2 3 , the average length of a ladder 54, the average length of a single loop in an interior loop 2 5 and the average degree of a multiloop 2 6 . Furthermore we computed the corresponding values yi from our formulae, 1 5 i 5 6, setting n to the length of the sequence under consideration. Let z' := (11. - y11,. . . , 156 - y61) denote the vector of the differences of these values (1. I denoting modulus) and let 2 denote the set of all vectors z' obtained by considering all predicted structures. In order to
430
endow every parameter with the same weight, every z' E 2 was normalized by dividing each component by the maximal observed value for that component in 2. Finally, assuming that the resulting vectors are denoted by (vl,v2,.. . ,v6) the corresponding structure was ranked by
(1) i5i56
Squares were used t o amplify differences. This ranking must be considered as the distance of the structure under investigation t o some sort of consensus structure implicitly provided by the expected values presented in section 2. Therefore a small rank should imply a good prediction, high ranks should disclose bad results of the prediction algorithm. In order to see whether it worked, we needed some notion for the similarity of structures. We chose the most simple but also most stringent one: Two structures (the predicted structure and the corresponding structure in the database of Wuyts et al.) are compared position by position (using the ct-files) counting the number of bases which are bond to exactly the same counterpart in both files. The total number is divided by the length of the related primary structure. We call the resulting percentage matching rate, a matching rate of 70% or larger is assumed to be a successful prediction. For the data of archaea and bacteria considered in our experiments a , all structures with a matching rate greater or equal to 70% were rated 3.54.. . or less. Additionally, only about 2.56% of all predictions had a rank of 3.54. . . or less so that a rank of 3.54 or less implies a successful prediction with a probability close t o $&. Assuming a linear dependence between the matching rate of the predictions and the rank according to (1) an ideal ranking would possess a correlation coefficient of -1 when comparing the two. However, in our case we observed a correlation coefficient of -0.3645235338. Furthermore, when looking at the quantile-quantile plot which compares the distributions of ranking and matching rates as shown in Figure 1 we observe a poor behavior especially for predictions with a matching rate between 55% to 65%. Note that an ideal ranking would result in a linear (diagonal) plot. Searching for an explanation of this rather poor correlation we took a look at the correlations between the overall ranking according t o (1) and the values of the different vi, 1 5 i 5 6. The results can be found in Table 4. One immediately notices that the (expected) length of a hairpin-loop and the (expected) degree of the multiloops are neg-Note that the data of archaea and bacteria used for our experiments is a subset of the data used to train the grammar. However, since the grammar was trained on the entire database it was also trained on other families of rRNA and thus good results with respect to our task should result from some sort of generalization.
431 Table 4: The correlation of a single wi to name of its associated parameter.
w,”.Within the table each wi is identified by the
Parameter number of hairpins length of a hairpin-loop length of a bulge length of a ladder length of single loop in interior loop degree of a multiloop
Correlation 0.6575498439 -0.3432207906 0.4460590292 0.2158570276 0.3850727833 -0.0844724840
atively correlated with the rank, i.e. they have a counterproductive effect on our ranking. Therefore we run a second set of experiments now using VT
i€{ 1,3,4,5}
as the rank of the prediction. The new filter assigns a rank of at most 1.87. . . to those predictions that have a matching rate of 70% or larger. Again, only about 2.56% of all predictions were ranked 1.87. . . or less, thus the new filter works with the same accuracy as the former one. But now we observe a correlation coefficient of -0.4745120689. Additionally, the quantile-quantile plot as shown in Figure 2 is much closer t o the diagonal thus giving rise to a better judgement of the predictions particularly for predictions with a matching rate between 55% and 65%. Note that the number of hairpins is the only parameter used in (2) which depends on the size of the structures and thus needs our methods based on SCFGs to be derived. All the other parameters could have been determined by simple statistical methods only. However, omitting w1 from the computations results in a worse accuracy and in a poor correlation coefficient of -0.24249.. . 4
Possible Improvements
Certainly the results reported in the previous section are only a first step towards a precise judgement of an algorithmic prediction of RNA secondary structure. However, the author belives this first step t o be promising. There is a potential for improving our approach in many directions. First, one might consider additional parameters like e.g. the order of a secondary structure introduced by Waterman6 In contrast to the parameters considered here, the order does not only take care of small parts of a secondary structure but it is
432
Figure 1: T h e quantile-quantile plot of the ranking according t o (1) compared t o the matching rate of the predicted secondary structures.
Figure 2: T h e quantile-quantile plot of the ranking according t o (2) compared t o the matching rate of the predicted secondary structures.
a sort of global parameter considering the balanced nesting depth of hairpins. Mathematical results for the expected order of a secondary structure which fit pretty well with the real world behavior can be found in?l Second, it can be helpful to give different weights to the different parameters used when computing the rank of a structure. For instance it seems to be reasonable to give a higher weight t o such parameters which have a smaller (relative) variance than others since these parameters must be assumed to be conserved more strongly. Therefore a different behavior is more unlikely than a different behavior with respect t o others. So far, the author has not been able to gather experiences in this field but it is a starting point of further research. 5
Conclusions
In this paper we have shown how results for the expected structural behavior of RNA secondary structures can be used in order t o judge the quality of a prediction made by any algorithm. First experiences were gained by considering large subunit ribosomal RNA molecules. To judge a single predicted structure S it is necessary to compute the length n of the corresponding primary structure and the values observed within S for the four parameters attached to the zti in (2). Then it is possible t o compute the rank of S which according to our experiments provides information on the quality (matching rate)
433
of the prediction with high probability. The methods presented in 25, which were used to derive the key results for our methodology, i.e. expected values for structural parameters within a realistic model for the molecules, are not restricted to this familiy of RNA. So they might be used for kinds of RNA as well. Furthermore, it should work t o implement a corresponding set of routines using a computer algebra system like maple such that the expectations needed in order to judge predictions for other kinds of RNA can be computed automatically. As a consequence the ideas presented in this article may lead t o the development of a new kind of software tools which supports the automated prediction of secondary structure with posteriori information on the quality of the results. In the long run, these ideas might be transferred t o other areas of structural genomics, e.g. the prediction of three dimensional structure of proteins. Acknowledgements
I wish to thank Matthias Rupp for his support in writing the programs for the statistical analysis presented in section 3 and for helpful suggestions. References
1. S. R. EDDYAND R. DURBIN,Nucleic Acid Res. 22 (1994), 2079-2088. 2. B. KNUDSEN A N D J. HEIN, Bioinformatics 15 (1999), 446-454. 3. R . NUSSINOV, G. PIECZNIK, J. R. GRIGGAND D. J. KLEITMAN,SIAM Journal on Applied Mathematics 35 (1978), 68-82. 4. J . M. PIPASAND J . E. MCMAHON,Proceedings of the National Academy of Sciences 72 (1975), 2017-2021. 5. D . SANKOFF, Tenth Numerical Taxonomy Conference, Kansas, 1976. 6. M. S. WATERMAN, Advances in Mathematics Supplementary Studies 1 (1978), 167-212. 7. M. ZUKERAND P. STIEGLER,Nucleic Acid Res. 9 (1981), 133-148. 8. W . FONTANA, D. A. M. KONINGS,P. F. STADLERAND P. SCHUSTER, Biopolymers 33 (1993), 1389-1404. 9. I. L. HOFACKER,P. SCHUSTERAND P. F. STADLER, Discrete Applied Mathematics 88 (1998), 207-237. 10. M. E. NEBEL,Journal of Computational Biology 9 (2002), 541-573. 11. M. E. NEBEL,Bulletin of Mathematical Biology, to appear. 12. J . E. HOPCROFT,R. MOTWANIAND 3 . D. ULLMAN, Addison Wesley, 2001. 13. D. SANKOFF AND J. KRUSKAL, CSLI Publications, 1999.
434 14. Y.
15. 16. 17. 18. 19. 20.
21. 22. 23.
SAKAKIBARA,M. BROWN, R . HUGHEY, I. S. MIAN, K. SJOLANDER,R. C. UNDERWOOD AND D. HAUSSLER, Nucleic Acid Res. 22 (1994), 5112-5120. T. L. BOOTH,IEEE Tenth Annual Symposium on Switching and Automata Theory, 1969. U. GRENANDER, Tech. Rept., Division of Applied Mathematics, Brown University, 1967. S. E . HUTCHINS, Information Sciences 4 (1972), 179-191. H. ENOMOTO,T. KATAYAMA AND M. OKAMOTO, Systems Computer Controls 6 (1975), 1-8. T. HUANGAND K. S. Fu, Information Sciences 3 (1971), 201-224. N . CHOMSKYAND M. P. SCHUTZENBERGER, Computer Programming and Formal Systems (P. Braffort and D. Hirschberg, eds.), NorthHolland, Amsterdam, 1963, 118-161. R . DURBIN,S. EDDY,A. KROGHAND G. MITCHISON, Cambridge University Press. WUYTS J., DE RIJK P., VAN DE P E E R Y., WINKELMANS T., DE WACHTERR., Nucleic Acids Res. 29 (2001), 175-177. J. W. BROWN, Nucleic Acids Res. 27 (1999), http://jwbrown. mbio.ncsu.edu/RNaseP/home.htrnl.
24. M. SPRINZL, K. s. VASSILENKO, J. EMMERICH AND F.BAUER,(20 December, 1999) http://www.uni-bayreuth.de/departments/biochemie/trna/. 25. M. E. NEBEL, technical report, http://boa .sads.informati k. unifrankfurt.de:8000/nebel.htrnl 26. M. REGNIER,Generating Functions in Computational Biology: a Survey, submitted. 27. E. HILLE,Blaisdell Publishing Company, Waltham, 1962, 2 vol.
EXPLORING BIAS IN T H E PROTEIN DATA BANK USING CONTRAST CLASSIFIERS K . P E N G , Z . OBRADOVIC, S . VUCETIC Center f o r Information Science and Technology, Temple University, 1805 N Broad St Philadelphia, PA 19122, USA
In this study we analyzed the bias existing in the Protein Data Bank (PDB) using the novel contrast classifier approach. We trained an ensemble of neural network classifiers, called a contrast classifier, to learn the distributional differences between non-redundant sequence subsets of PDB and SWISS-PROT. Assuming that SWISS-PROT is a representative of the sequence diversity in nature while the PDB is a biased sample, output of the contrast classifier can be used to measure whether the properties of a given sequence or its region are underrepresented in PDB. We applied the contrast classifier to SWISS-PROT sequences to analyze the bias in PDB towards different functional protein properties. The results showed that transmembrane, signal, disordered, and low complexity regions are significantly underrepresented in PDB, while disulfide bonds, metal binding sites, and sites involved in enzyme activity are overrepresented. Additionally, hydroxylation and phosphorylation posttranslational modification sites were found to be underrepresented while acetylation sites were significantly overrepresented. These results suggest the potential usefulness of contrast classifiers in the selection of target proteins for structural characterization experiments.
1 Introduction The ultimate goal of structural genomics is to determine structures for every natural protein through a large-scale structure characterization and computational analysis. However, in anticipation of the development of cost-effective techniques and protocols for large-scale experiments, current efforts in structural genomics are aimed towards determining structures of a limited portion of representative proteins to achieve a rapid coverage of the protein sequencehtructure space [3]. As a common approach, the proteins are first filtered to remove those considered inappropriate for structural characterization, e.g., membrane, low complexity, and signal peptides. The remaining proteins are clustered into families based on sequence similarity. Finally, representative proteins from the families of largest biological interest are selected for structural characterization experiments. Although some progress has been made, selection of the target proteins remains an open problem in structural genomics [3]. As the main database of experimentally characterized structural information, Protein Data Bank (PDB) [ 13 contains more than 20,000 structures of proteins, nucleic acids and other related macromolecules characterized by methods such as X-ray diffraction and nuclear magnetic resonance (NMR) spectroscopy. However, current information in PDB is highly biased in the sense that it does not adequately cover the whole sequence/structure space. For example, membrane proteins represent a very important structural class in nature, but their structures are usually extremely difficult to determine due to the need for a lipid bilayer or substitute amphiphile [18]. In
435
436
general, PDB is positively biased towards proteins that are more amenable to expression, purification and crystallization. Another source of bias is the fact that different research groups usually have different objectives when selecting the target proteins: some aim at determining structures of proteins from a specific model organism; some may focus on proteins in a single pathway; others may be more interested in certain type of proteins, e.g., disease-related proteins. PDB is also statistically redundant due to the presence of multiple entries for highly similar or identical proteins. According to the PDB statistics available at http://www.rcsb.org/pdb/'holdings.html, out of the 3,298 structures deposited in year 2001 only 204 could be considered as novel while the remaining ones were mostly minor variants of those already reported. Understanding the bias and redundancy in PDB is crucial for selection of further structural targets as well as for various structure predictions. Several studies have been performed towards this goal. Brenner et al. [4] analyzed the SCOP [12] structural classification of PDB proteins and reported high skewness at all classification levels. Gerstein [6] compared several complete genomes with a non-redundant subset of PDB and concluded that the proteins encoded by the genomes were significantly different from those in the PDB with respect to sequence length, amino acid composition and predicted secondary structure composition. Liu and Rost [ 101 analyzed proteomes of 30 organisms and estimated that current structural information in PDB and other databases was available for only 6-38% of all proteins and found over 18,000 segment clusters suitable for structural genomics. In this paper we provide a complementary view of the bias in PDB that explores differences in sequence properties of PDB and SWISS-PROT [2] proteins. This was accomplished by training an ensemble of neural network classifiers to distinguish between distributions of the non-redundant subsets of PDB and SWISS-PROT. Following the recently proposed contrast classifier framework [14], output of such an ensemble of classifiers measures the level to which a given sequence property is overrepresentedhnderrepresented in PDB as compared to SWISS-PROT. We applied the contrast classifier to analyze the bias in PDB towards numerous protein properties and to examine whether our approach can be useful in selecting the most interesting target proteins for structural characterization.
2 Methods 2.1 Datasets Since both PDB and SWISS-PROT are statistically redundant due to the presence of large number of homologues, learning on such data could lead to biased results. Thus, non-redundant subsets were used as unbiased representatives of the two databases. The non-redundant representative of PDB used in this study was PDB-Select-25 [7]
437
constructed based on all-against-all Smith-Waterman alignments between PDB chains. In this subset, the maximal painvise identity was limited to 25% since it is believed to be an appropriate compromise between reducing the sequence redundancy and preserving the sequence diversity [17]. The version used in this study was released in December 2002 and consisted of 1,949 chains. After removing chains shorter than 40 residues, the resulting set PDB25 contained 1,824 chains with 324,783 residues. For SWISS-PROT (October 2001, Release 40, 101,602 sequences), we applied an approach used in our previous study [20] to construct its non-redundant representative subset. Sequence similarity information from ProtoMap database [22] was used to group all SWISS-PROT proteins into 17,676 clusters using the default ProtoMap E-value cutoff of 1. A representative protein with the richest annotation in SWISS-PROT was then selected from each cluster. Similarly to PDB25, proteins shorter than 40 residues were removed. The resulting set SwissRep consisted of 16,360 proteins with 6,946,185 residues. The relatively high E-value cutoff, leading to quite aggressive redundancy reduction, was acceptable since the resulting SwissRep was still sufficiently large to represent the diversity of SWISS-PROT. Table 1. Summary of special regions in SwissRep.
Regions transmembrane low complexity disordered
number of regions
number of residues
10,274 14,648 11,332
215,109 (3.1%) 2,041,162 (29.4%) 506,229 (7.3%)
We also identified various regions of interest from SwissRep proteins for further analysis. Transmembrane regions were identified through the keywords (KW lines) and feature tables (FT lines) associated with each SWISS-PROT sequence. We identified transmembrane helix regions as the most distinctive among all types of membrane regions. Low complexity regions were marked by the SEG program [21] using the standard parameters K1 = 3.4 and K2 = 3.75, and a window of length 45. Disordered regions longer than 30 residues were predicted by the VL3 disorder predictor [13] with Win/Wout= 41/1 and a threshold of 0.85. Table 1 shows the summary of these identified regions with their corresponding sizes measured as the number of regions, the number of residues and the percentage of residues at SwissRep.
2.2 Contrast Classi9ers Let us assume we are given dataset G obtained by unbiased sampling from a multivariate underlying distribution, and dataset H obtained by potentially biased sampling from the same distribution. This scenario could occur when objects from H are characterized by a larger set of attributes then those of G. For example, SwissRep is an example of unbiased dataset G that contains only protein sequence information, while PDB25 is an example of biased dataset H that contains both protein sequence and structure information. Understanding the bias in data G is of major importance for
438
an appropriate analysis and inference from such data. The recently proposed contrast classifier approach [141 provides a simple but effective framework for detecting and exploring the data bias. By g(x) and h(x) let us denote the probability density functions (pdf) of unbiased data G and biased data H, respectively. The contrast classifier is a classifier trained to discriminate between the distributions of datasets G and H. Using classification algorithms that are able to approximate the posterior class probability (e.g. neural networks), the output cc(x) of a contrast classifier trained on a balanced set with the same number of examples from G (class 1) and examples from H (class 0) approximates cc(x) = g(x)/(g(x)+h(x))[ 141. With a simple transformation it follows that h(x)/g(x)= cc(x)/(l-cc(x)), and that cc(x) = 0.5 corresponds to a data point x that is represented equally well in both datasets (i.e. h(x)/g(x)= 1). The contrast classifier output cc(x) is therefore a very suitable measure for analysis of the data bias. The distribution of cc(x) gives information about the overall level of bias in dataset H: if it is concentrated around 0.5 the bias is negligible, while if it is dispersed across the interval [0, 11 the bias is significant. Moreover, we could measure the level to which a given data point is overrepresentedunderrepresented in dataset H: data points with cc(x) < 0.5 are overrepresented, while those with cc(x) > 0.5 are underrepresented.
2.3 Training Contrast ClassiJiersfor Bias Detection in PDB In this study we assume that SwissRep is a representative of the protein sequence space, while PDB2.5 is a biased sample. Note that, while the first assumption is probably not completely correct since SWISS-PROT represents only the proteins studied with a sufficient detail, it is acceptable for the purpose of analyzing the bias in PDB. Based on the description in Section 2.2, it is evident that contrast classifiers can be used directly to explore the bias in PDB. While any classification algorithm able to approximate posterior class probability can be employed to train a contrast classifier, in this study we used feedfonvard neural networks with one hidden layer and sigmoidal neurons. Since there is a large imbalance in the number of data points in SwissRep and PDB25 datasets (with the proportion of approximately 21 :l), learning a single neural network on balanced samples from the two datasets would not properly utilize the data diversity present in SwissRep. We addressed this by training an ensemble of neural networks on balanced training sets consisting of equal number of PDB25 and SwissRep examples randomly sampled from the available data. Similar to bagging [ 5 ] we constructed a contrast classifier by aggregating the predictions of these neural networks through averaging. Additional benefit of using an ensemble of neural networks is that the averaging is known as a successful technique for increasing their accuracy by reducing variance while retaining low bias in prediction.
439
2.4 Knowledge Representation
For each sequence position, a set of relevant attributes was derived from statistics of a subsequence within a window of length W centered at the position. More specifically, given a sequence s = {si,i = 1, . . ., L } of length L, for each sequence position si an appropriate M-dimensional attribute vector xi = { x , j= 1, . . .,M } is constructed and a corresponding class label yi is assigned. Thus, sequeilce s is represented as a set of L examples {(xi,yi), i = 1, . .., L } . Using a window of length W= 21, a total of 25 attributes were derived for each sequence position. These attributes were proved to be useful in various protein sequence analyses and structure prediction problems, The first 19 attributes were the amino acid frequencies within the window since it has been shown that PDB proteins exhibited unique amino acid composition patterns [6]. Only 19 of the 20 frequencies were used since the remaining one could be uniquely determined from the rest. Based on the amino acid frequencies, an attribute called K2-entropy [21] was calculated to measure local sequence complexity. We also measured flexibility [19] and hydropathy [9] propensities obtained by triangular moving average window where center position had weight 1 and the most distant positions had weight 0.25. While window length was 21 for hydropathy attribute, it was only 9 for the flexibility attribute, as suggested by the previous study [ 191. The final 3 attributes were outputs of the PHD secondary structure predictor [ 161, i.e., the prediction scores for alpha helix (H), beta strand (E) and loop (L). Finally, class labels 0 and 1 were assigned to examples from PDB25 and SwissRep, respectively.
2.5 Using Contrast ClassiJiers to Explore Bias in PDB Given a measure of contrast cc(x) at each sequence position, we explored the bias in PDB towards numerous protein functional properties, as defined by SWISS-PROT keyword and feature classification. It was expected that the analysis would confirm known results (e.g. that transmembrane and low complexity regions are underrepresented in PDB) and point to some less-known sources of bias. For a set R of regions with a given functional property, mean and standard deviation of the corresponding cc(x) were calculated to measure the direction and level of bias. Additionally, the Kolmogorov- Smirnov goodness of fit test [ 111 (KS test) was used to measure the difference between the cc(x) distributions of R and PDB25. The KS test measures the maximum absolute difference between the empirical cumulative distributions of the two samples and uses it to estimate the test p-value. Since CC(X) of neighboring sequence positions are correlated due to the use of window W (=21) in attribute construction (see Section 2.4), we estimated the effective length as L,= 1 + (L-l)/Wfor each sequence region of length L and used it in calculation of the KS test p-values.
440 3 Results and Discussions
3.1 Training Contrast Classi$er We built the contrast classifier as an ensemble o f 50 neural networks each having 5 hidden neurons and 1 output neuron with sigmoid activation function. To reduce bias towards long sequences, a balanced training set for each neural network was selected in two steps: (a) 20 examples sampled randomly without replacement were taken from each sequence in PDB25 and SwissRep, and (b) a balanced set of 8,000 examples was sampled randomly with replacement from the resulting set. Individual neural networks were trained with the backpropagation algorithm. To avoid overfitting, 80% of the balanced set was used for training and the rest was reserved to signal the training termination. If the training was not stopped after 300 epochs it was terminated automatically.
3.2 Distributions of Contrast Classijier Outputs Comparing contrast classifier outputs at PDB25 and SwissRep. A trained contrast classifier was applied to both PDB25 and SwissRep sequences, and their cc(x) distributions were compared in Figure l(a). Since SwissRep contained a number of PDB25 sequences andlor their homologues, it was expected that the two distributions would overlap. However, a considerable proportion of SwissRep sequences had relatively large cc(x) values (e.g., larger than 0.7) while most PDB25 sequences had smaller cc(x) values concentrated around 0.47. This result clearly illustrated the existence of bias in PDB. In the following subsections, we analyze the sources of bias in greater detail. We also examined the distributions of another two sets in Figure l(a): PDB25H of homologues of PDB25 in SwissRep; and PDB25NH of the remaining sequences of SwissRep. The homologues were identified through 3 iterations of PSI-BLAST search using E-value thresholds of 0.001 for sequence inclusion in the profile and 1 for including sequences in the final selection. As expected, distribution for PDB25H was similar to PDB25, while distribution for PDB25NH was similar to SwissRep. Distributions of 3 specific sequence regions. We examined distributions of cc(x) for transmembrane regions, low complexity regions, and predicted disordered regions from SwissRep (see Section 2.1). As shown in Figure l(b), all these regions exhibited cc(x) values significantly higher than PDB25 sequences, indicating that they were highly underrepresented in PDB. As discussed in the Introduction, transmembrane regions are typically excluded from structural characterizations. Low complexity regions have biased amino acid composition involving a few amino acid types and they often do not fold into stable 3D structure [15]. Huntley and Golding [8] performed an extensive investigation on eukaryotic proteins in PDB and reported a
441
large deficiency in low complexity regions. Their results indicated that even for the few low complexity regions with structural data present in PDB, tertiary structures were missing in most cases. Predicted disordered regions [13] correspond to the regions very likely to have flexible structure that could not be captured by X-ray crystallography or NMR. Since disordered proteins are hard to crystallize, it was expected that they are underrepresented in PDB.
-PDBZ5
.1u
0.0
.i
.2
.3
.4
.5
.6
-
.lu
PDB25 ..... ..... ........
SwirrRep
SwissRsp
__-
PDB25H
.7
.8
.9
i.1
Figure 1. Comparison of cc(x) distributions between PDB25 and other sets: (a) SwissRep, PDB25H and PDBZSNH; (b) various regions of interestfrom SwissRep.
Distributions of functional regions characterized by SWISS-PROT FT line. We extended the analysis to functional regions described by feature tables (FT lines in SWISS-PROT) with the FT keywords. Note that the length of functional regions could range from one (e.g. posttranslational modification sites) to a few hundred residues. In Figure 2 we plot the distributions of the 3 selected functional region types. The supplementary material with the plots of all functional regions listed in the FT lines can be accessed at http://www. ist.temple.edu/disprot/PSB04. Given the explanation of the contrast classifier output discussed in Section 2.2, a positively skewed output distribution indicates that a certain type of functional site or region is underrepresented in PDB, while a negatively skewed output distribution indicates that it is overrepresented. For example, disulfide bonds (DISULFID) play important roles in stabilizing protein tertiary structure and thus should be abundant in PDB. Consistent with this fact is that their cc(x) distribution is highly negatively skewed (see Figure 2). On the other hand, signal peptides (SIGNAL) are short segments of amino acids in a particular order that govern the transportation of newly synthesized proteins, and then cleaved from the matured proteins. Since structure characterization experiments usually target matured proteins, signal peptides are expected to be underrepresented in PDB. Accordingly, we observe a positively skewed distribution similar to that of transmembrane regions in Figure 1(b). Repeats
442
(REPEAT) are internal sequence repetitions and typically have low sequence complexity and thus exhibit a similar distribution to that of low complexity regions in Figure l(b). .12
.10
08
2.
3m
.06
$ .04
.02
0 00 0.0 CC(X)
Figure 2 . Dishibutions of c c h ) of 3 selected sites or regions from SwissRep sequences.
.1
2
3
4
.5
.6
.7
8
.9
1.0
CCW
Figure 3 . Distributions of contrast classifier output cc(x) of 3 selected posttranslational modification sites from SwissRep sequences.
Comparing distributions of PDB and different functional regions. The cc(x) distributions of the functional sites or regions were compared with the distributions of the PDB25 sequences using the 2-sample Kolmogorov-Smirnov test described in Section 2.5. The FT keywords corresponding to these sites or regions were then ranked according to the resulting p-values, as shown in Table 2. Note that the table does not list FT keywords ACT-SITE, CA-BIND, CONFLICT, INIT-MET, LIPID, MUTAGEN, SIMILAR, SITE, THIOLEST, TRANSIT, NON-CONS, NON-TER, NOT SPECIFIED, NP-BIND, UNSURE, VARIANT, and VARSPLIC since they were either of less interest or their total effective length was less than 1000 residues. Also shown in Table 2 are means and standard deviations of cc(x) values, and the total effective length used in the KS test. We further examined contrast classifier output cc(x) on different posttranslational modification sites identified by FT keyword MOD-RES. Results for the 5 most frequent sites are shown in Table 3. Similar to Table 2, these sites were ranked according to their Kolmogorov-Smirnov test p-values when compared to the distribution of PDB25 sequences. Among the top 3 sites, phosphorylation and hydroxylation sites have positively skewed distributions, while acetylation sites have negatively skewed distribution, as shown in Figure 3. This suggests that the first 2 modification sites are underrepresented in PDB, while the acetylation sites are overrepresented.
443 Table 2. Comparison of distributions of contrast class$er outputs on sites or regions of interest and PDB25 sequences. The p-values were obtained using the Kolmogorov-Smirnov 2-sample test.
FT keyword
TRANSMEM REPEAT DOMAIN CARl3OHYD CHAIN MOD-RES PEPTIDE
p-value
NCC)
$4
L
0 0 7.9312-318 1.17e-138 5.09e-118 3.57e-061 7.24e-061
0.72 0.53 0.5 1 0 51 0 50 0.53 0 53
0.12 014 0.12 0.12 0.12 0.13 0.13
20531 13377 82912 7015 74737 1515 1225
Table 3. Comparison of contrast classi$er output distributions of different posttranslational modification sites with that of PDB25 sequences. p-value
Ncc)
Nee)
L
phosphorylation .
modificationsite
3.54e-054
0.55
0.11
608
amidation methylation PDB25
5.68e-015 8.66e-010
0.55 0.57 0.47
0.14 0.12 0.10
170 93 17194
Mcc) - mean of cc(x), ofcc) - standard deviation of cc(x),
L - effective number of residues
Distributions of SCOP structural classes. According to the SCOP database [121 (release 1.6 1, Nov. 2002), 1,685 out of the 1,824 chains in PDB25 can be classified into 11 structural classes, as shown in Table 4.Note that different parts of a chain might belong to different classes. We examined the cc(x) distributions of individual structural classes and compared them with the overall distribution of PDB25 sequences using the Kolmogorov-Smirnov test (results shown in Table 4). The most significant difference corresponded to sequences from the small class with a negatively skewed distribution. It is worth noting that membrane and cell surface, coiled coils, and peptide structural classes appeared to be significantly underrepresented in PDB25.
444 Table 4. Comparison of cc(x) distributions on PDB25proteinsfrom differentfold classes with that of all PDB25 proteins.
Fold Class
p-value
Mcc)
Nee)
L
small alpha membrane and cell surface
8.74~-037 4.65e-018 3.86e-015
0.41
peptides
0.000585
0 52
0.10 0.10 0 14 0 0 0 12
608 2779 382 19 73 85
0 47
0.10
17194
PDB25
0.49 0 53
Hcc)- mean of cc(x), Wcc) - standard deviahon of cc(x), L - effechve number of residues
Analysis of underrepresented proteins. Complementing the study of cc(x) distributions of different functional protein regions or protein types, we explored the properties of proteins that are most highly underrepresented by PDB25. Some of these proteins are arguably the most interesting targets for future structural determination experiments. For this study, each SwissRep sequence s was represented with a single number cc-avg(s) representing the average cc(x) over the sequence. A total of 2,814 (or 17.2% of) SwissRep sequences having cc-avg(s) > 0.597 were selected with the threshold chosen such that only 1% ofPDB25H sequences satisfied the inequality. We analyzed the properties of the resulting set, called SwissOut, by comparing the commonness of different SWISS-PROT keywords (KW line) in SwzssOut and PDB25H (see Section 3.2). By denotingfSwlssOut and fPDB2jH as frequencies of proteins with a given keyword among SwissOut and PDB25H, respectively, their difference can be quantified by measuring the 2-score defined as (f&ssOuf-fDBz,-H)H)l sqrtcfpDB2fd1- - D B ~ ~ H ) /where ~ ? ) , N is the number of proteins in Swissout. Table 5. The top 6 SWISS-PROT keywords associated with underrepresented sequences Keyword hypothetical protein transmembrane complete proteome inner membrane chloroplast chromosomal protein
fswissRep
I”/.]
42.38 17.55 31.64 1.49 1.36 0.66
[%I
~PDBZ~H
15.64 14.51 18.24 1.45 1.22 1.29
fSvisroUr [%]
Z-score
54.34 42.89 32.55 3.62 3.13 2.74
56.52 42.74 19.66 9.64 9.23 6.82
In Table 5 we list 6 SWISS-PROT keywords with the highest Z-scores among the ones represented with more than 50 SwissOut proteins. By careful examination, it is evident that the obtained results are reasonable, and are another indication of a potential of the proposed contrast classifier approach. Furthermore, it is likely that SwissOut proteins with keyword “complete proteome” would be very interesting structural targets.
445
4 Concluding Remarks We applied the contrast classifier to explore the bias existing in the Protein Data Bank towards different functional protein properties. Assuming SWISS-PROT is a representative of the protein universe while the PDB is a biased sample, we trained a contrast classifier with the non-redundant subsets of PDB and SWISS-PROT and used its output to analyze the bias in PDB. Comparing to other methods for examining bias in PDB (see the Introduction), the main strength of our approach is that it provides a quantitative measure to assess the bias in a uniform way. Our results confirmed some well-known facts such as the lack of transmembrane, low complexity and disordered regions among PDB sequences. They have also revealed some less recognized facts such as depletion of PDB in phosphorylation and hydroxylation modification sites and overrepresentation in acetylation sites. These results are a strong indication that contrast classifiers should be considered as an attractive tool for selection of target proteins for future structural characterization experiments. There are several immediate avenues of future research. As shown by our results, contrast classifier trained with attributes derived from simple statistics over a local window was able to successfully explore the bias in PDB. This suggests that more sophisticated choice of attributes could provide an additional insight into the sources of bias. Similarly, removing well-known underrepresented regions (e.g. transmembrane, low complexity) before the training of the contrast classifier would allow better focus on the less known sources of bias in PDB. Finally, by a slight extension of the proposed methodology contrast classifiers could be trained with sequences of known folds vs. sequences in SWISS-PROT. This could have a potential in detecting the sequences with potentially novel fold structures.
References
1. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov and P.E. Bourne, "The Protein Data Bank", Nucleic Acids Res., 28, 235 (2000). 2. B. Boeckmann, A. Bairoch, R. Apweiler, M.C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout and M. Schneider, "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003", Nucleic Acids Res., 3 1, 3 65 (2003). 3. S.E. Brenner, "Target selection for structural genomics", Nat. Struct. Biol., Structural Genomics supplement, 7,967 (2000). 4. S.E. Brenner, C. Chothia and T.J. Hubbard, "Population statistics of protein structures: lessons from structural classifications", Curr. Opin. Struct. Biol., 7,369 (1997). 5. L. Breiman, "Bagging predictors", Mach. Learning, 24, 123 (1996).
446
6. M. Gerstein, "How representative are the known structures of the proteins in a complete genome? A comprehensive structural census", Fold Des., 3,497 (1 998). 7. U. Hobohm and C. Sander, "Enlarged representative set of protein structures", Protein Sci., 3, 522, (1994). 8. M.A. Huntley and G.B. Golding, "Simple sequences are rare in the Protein Data Bank", Proteins: Struc. Funct. Gen., 48, 134 (2002). 9. J. Kyte and R.F. Doolittle, "A simple method for displaying the hydropathic character of a protein", J. Mol. Biol., 157, 105 (1982). 10.J. Liu and B. Rost, "Target space for structural genomics revisited", Bioinformatics, 18, 922 (2002). 11.F.J. Massey Jr., "The Kolmogorov-Smirnov test of goodness of fit", J. Amer. Statist. Assoc., 46, 68 (1951). 12.A.G. Murzin, S.E. Brenner, T. Hubbard and C. Chothia, "SCOP: a structural classification of proteins database for the investigation of sequences and structures", J. Mol. Biol., 247, 536 (1995). 13.Z. Obradovic, K. Peng, S. Vucetic, P. Radivojac, C. Brown and A.K. Dunker, "Predicting intrinsic disorder from amino acid sequence", Proteins: Struc. Funct. Cen., Special Issue on CASPS, in press. 14. K. Peng, S. Vucetic, B. Han, H. Xie and Z. Obradovic, "Exploiting unlabeled data for improving accuracy of predictive data mining", In Proc. Third IEEE Int'l Con$ on Data Mining, Novemember 2003, Melbourne, FL, in press. 15. P. Romero, Z. Obradovic, X. Li, E. Garner, C.J. Brown and A.K. Dunker, "Sequence complexity and disordered protein", Proteins: Struc. Funct. Gen., 42, 38 (2001). 16.B. Rost, "PHD: predicting one-dimensional protein structure by profile-based neural networks", Methods Enzymol., 266, 525 (1996). 17.B. Rost, "Twilight zone of protein sequence alignments", Protein Eng., 12(2), 85 (1999). 18.H. Sakai and T. Tsukihara, "Structures of membrane proteins determined at atomic resolution", J. Biochem. 124, 1051 (1998). 19.M. Vihinen, E. Torkkila and P. Riikonen, "Accuracy of protein flexibility predictions", Proteins: Struc. Funct. Gen., 19, 141 (1994). 20. S, Vucetic, D. Pokrajac, H. Xie and Z. Obradovic, "Detection of underrepresented biological sequences using class-conditional distribution models", In Proc. Third SIAM Int'l Con$ on Data Mining, May 2003, San Francisco, CA. 21. J.C. Wootton, S. Federhen, "Analysis of compositionally biased regions in sequence databases", Methods Enzymol., 266, 554 (1996). 22. G. Yona, N. Linial and M. Linial, "ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space", Proteins, 37,360 (1999).
GEOMETRIC ANALYSIS OF CROSS-LINKABILITY FOR PROTEIN FOLD DISCRIMINATION S. POTLURI', A.A. KHAN', A. KUZMINYKH', J.M. BUJNICK13, A.M. FRIEDMAN*, C. BAILEY-KELLOGG' Depts. of 'Comp. Sci.,'Math., and 4Biol. Sci., Purdue Vniv.,West Lafayette, IN 47907, USA Intl. Inst. Molec. and Cell Biol.,Warsaw, Poland
Abstract
Protein structure provides insight into the evolutionary origins, functions, and mechanisms of proteins. We are pursuing a minimalist approach to protein fold identification that characterizes possible folds in terms of consistency of their geometric features with restraints derived from relatively cheap, hghthroughput experiments. One such experiment is residue-specific cross-linking analyzed by mass spectrometry. This paper presents a suite of novel lower- and upper-bounding algorithms for analyzing the distance between surface cross-link sites and thereby validating predicted models against experimental cross-linking results. Through analysis and computational experiments, using simulated and published experimental data, we demonstrate that our algorithms enable effective model discrimination.
1 Introduction Knowledge of protein structure is vital for understanding protein function and evolution. Traditional protein structure determination techniques, X-ray crystallography and nuclear magnetic resonance spectroscopy, provide atomic detail, but despite many advances, they remain difficult, expensive, and time-consuming techniques. Recent reports from labs conducting the high-throughput protein structure initiative indicate that only 10 percent of expressed and purified proteins advance to full 3D structure. Alternatively, purely computational techniques (homology modeling, fold recognition, and ab initio) are much faster, but due to the inherent difficulty in scoring predictions, they encounter significant ambiguity in reliably identifying correct structures. We seek a middle ground, verifying predicted structures against minimalist experiments that provide relatively sparse, noisy information relatively quickly and cheaply. In particular, this paper focuses on developing and applying geometric algorithms for model discrimination using data from residue-specific cross-linking, analyzed by mass spectrometry (Fig. l). We assume here that the models have already been generated and the experimental data have been analyzed to identify a set of crosslinks. We present algorithms for checking the consistency of the identified cross-links with the structure models, in order to discriminate among the models.
447
448 1. Model/ predict
2. Cross-link
3. Pmteolytically digest
4. Interpret mass spectmm
Figure 1: Cross-linking mass spectrometry protocol. (1) Computationally generate a set of possible structure models. (2) Specifically cross-link the protein using a small molecule of a fixed maximum length. (3) Digest the cross-linked protein with a protease. (4) Obtain and interpret a mass spectrum, using identified cross-links as evidence for spatial proximity and thus for a particular model.
Employing Edman sequencing and mass spectroscopy of cross-links, Haniu et al. developed a largely correct model of human erythropoietin consistent with the cross-linking data, although no alternatives were explicitly considered. Later, Young et al. pioneered the use of mass spectroscopy alone to correctly discriminate among threading models of Basic Fibroblast Growth Factor, FGF-2, in spite of very low sequence similarity. More recent work employs a “top-down’’ method to fragment proteins within a Fourier transform mass spectrometer, so as to focus on only singly cross-linked protein monomers4. Similarly, cross-linking has been used to determine tertiary and quaternary arrangements of proteins 5 , including membrane proteins that are inherently difficult to ~ r y s t a l l i z eThe ~ ~ ~minimalist . philosophy has also been applied by other groups in support of approximate structure determination. For example, a limited number of long-range distance constraints from NMR 8,9, mutagenesis followed by functional evaluation chemical modification 12, and the pair distance distribution function from small-angle X-ray scattering 13, have all been employed. While traditional structure determination techniques provide substantial overdetermination, minimalist experimental methods for rapid confirmation are noisy and yield only very sparse information. This places a significant burden on computational analyses to carefully characterize model geometry and maximize discriminatory power, in order to be robust to experimental noise and ambiguity. This paper develops a suite of new algorithms, trading complexity vs. accuracy, for analysis of cross-linkability in predicted structure models. The algorithms provide better discriminability and robustness than previously published approaches, and thus promise to enable broader applicability of cross-linking to protein fold identification.
2 Cross-Linkability Analysis 2.1 Problem Formulation A cross-linker serves as a molecular ruler by linking only “close-enough” pairs of residues. Since the atoms of the cross-linker occupy physical space, the measurements are greatly constrained. We assume here that the cross-linker is energetically excluded
449
Input: Polyhedral protein surface S, representing the boundary of the body from which the cross-linker is excluded. Let Sintdenote the interior of the body. A set P of point cross-linking sites on S,representing potentially cross-linked atoms. Computation: Cross-linking paths between site pairs p i , p j E P and exterior to Sint. output: For each pair of sites p i , p j E P, cross-linking distance D , (i, j ) as the minimum of the lengths of crosslinking paths between pi and p i . Figure 2: Cross-link problem formulation and 2D schematic illustrating surface S, atoms, cross-linking sites p l and p z , and cross-linking paths Q (achieving cross-linking distance) and R.
from penetrating the protein interior. Since cross-linked residues (e.g. L y s ) must be on or near the protein surface in order for the cross-linker to react with them, we represent cross-linked atoms (e.g. Lys NC) by points on a solvent accessible surface 14. For example, one could find the closest surface point, or a set of “close-enough’’ such points, reachable from an atom without intersecting the van der Waals spheres of other atoms. While the cross-linked atoms have considerable mobility in solution, we assume that they are fixed for these algorithms. (Dynamics may be accounted for by applying the algorithms to multiple conformations.) We also assume the cross-linker is infinitely flexible. Alternatives will be addressed in a separate publication. With this representation, cross-linkability is determined by testing whether or not the distance between cross-linking sites, measured exterior to the protein, is short enough for the cross-linking molecule. Fig. 2 formalizes the problem and terminology. The basic protein surface representation we employ is a triangulation of the solvent accessible surface, where vertices indicate locations of a probe molecule’s center (typically water) when in contact with the protein, and edges connect triangle vertices. In order to allow for uncertainty in the atomic coordinates of models, we have found it desirable to ignore part or all of the protein side chains. For example, C” coordinates, as employed by Young3, completely ignore side chains, while C@coordinates ignore many atoms but retain the side chain direction. We have developed an iterative “peeling” algorithm to remove exposed side chain atoms while leaving internal ones intact so that no voids are introduced. The algorithm first identifies solvent accessible residues (with solvent accessible area above some threshold), and then removes those
450
side chain atoms that are solvent accessible, starting from the end and moving towards the C" in subsequent iterations. This approach guarantees that, upon termination, all and only the outer atoms are removed. The problem of computing cross-linking distance requires finding the shortest path between two points. This is a well-studied problem in graph theory and networks (e.g. Dijkstra's algorithm 15). The complexity of geometric shortest path algorithms (e.g. for robotics) grows rapidly with the dimension. Our cross-linking problem can be viewed as finding the shortest obstacle-avoiding path, treating the protein body as an obstacle. When the path is not constrained to a discrete graph, but can include bends, the number of combinatorially different paths becomes exponential. Several approximation algorithms for finding the shortest path have been developed 16. Here we specialize the shortest path problem to take into consideration the special geometry of proteins. We obtain a hierarchy of novel lower- and upper-bound algorithms for estimating cross-linking distance. Due to space constraints, we present here only high-level pseudocode (Fig. 3), examples (Fig. 4), and sketches of some correctness and complexity arguments.
2.2
Lower Bound Algorithms
The Euclidean distance d ( p i ,p j ) between cross-linking sites provides an obvious lower bound, Dline,on cross-linking distance. This straight-line approach does not account for the model's surface geometry, and provides relatively little information, but has been employed for model discrimination by Young et al. A tighter bound is obtained by sampling cross-sections of the protein at points along the segment connecting cross-link sites. Our disk algorithm (Figs. 3, 4a) computes a lower bound Ddisk by sampling a set C of points on the pipj segment and in Sint,and then constructing a sequence of disks with centers in C perpendicular to pipj and contained entirely within the body S U Si,t (they intersect the protein surface only by their boundary circles). The convex hull of the union of the disks and endpoints captures some of the essential surface geometry and provides for immediate computation of a lower bound path. The distance from one site to the other is measured along a path in the intersection of the boundary of the convex hull with a plane containing the segment pipj. D d i s k ( p i , pj) depends on the sample points C , which we treat as fixed for the following arguments. For all p i , p j , Dline(pi,pj).< D d i s k ( p i , p j ) because the length of each path from pi to p j is at least the Euclidean distance. For all p i , p i , Ddi& ( p i ,p j ) 5 D , ( p i , p j ) ) follows from the fact that if the length of a path P from pi to p j is less than D d i s k ( p i , p j ) , then P intersects the interior of at least one of the disks. Thus, if there exists a cross-linking path P, with lp,I = D , ( p i , p j ) < D d i & ( p i , p j ) , then P, contains an interior point of at least one of the disks. By construction, each interior
451
PlaneDistance ( S ,p i , p j ) C c a set of sample points on [pi,p j ] in S i n t 0 c a set of sample plane normals not perpendicular to pipj return m a x c c (maxeso (min { d ( p i , p ) d ( p , p j ) I P E S n plane(c, 8 ) ) ) )
+
ShortcutDistance ( S ,p i , p j ) P +- a set of sample paths on graph of S , from pi to p j for each P = (pi = q , v 2 , . . . ,u, = p j ) E P G p + (V,E ) : V = ( ~ 1 ,... , v,}, E = { { ~ k ,VI,} I d p t length of shortest pi to p j path on Gp return minpcp dp
n Sint
= 0}
VisibilityDistance ( S ,p i , p j ) G c (V,E ) : V = vertices o f , E = { {uk,u ~ }I 2ikvI n S i n t = S} return length of shortest pi to p j path on G Figure 3: Cross-linking distance bounding algorithms.
point of each of the disks belongs to Sint.so P, intersects S i n t , a contradiction. The complexity of the disk algorithm depends on the implementation of the various geometric tests. Selecting sample points requires testing inside/outside of the polyhedral surface, and determining disk radii requires finding distances to surface points on the perpendicular. We employ a straightforward inside/outside test counting the number of intersections of a ray from the sample point with the triangles of the protein surface, requiring total O ( C T ) time, where T is the set of triangles of S. We compute disk radii by first sorting surface vertices in order along the segment pipj, and then for each sample point, using binary search to find vertices of triangles that potentially intersect the disk at the sample point. This requires output-sensitive time O(CTclog T ) ,where Tc is the set of triangles found by the search. We note that if a very finely sampled set of points is desired (trading off increased complexity for increased accuracy), a plane sweep algorithm could be employed, keeping track of surface triangles intersecting the current plane and iterating by vertices in order of
452
Figure 4:2D schematics and examples on protein FGF-2 for (a) disk, (b) plane, and (c) shortcut algorithms.
m.
their projections onto A complementary lower bound, Dplane,considers single cross-sections at multiple angles and positions. Our plane algorithm (Figs. 3, 4b) employs this idea to compute a lower bound &lane by finding, at each sample point and each admissible plane orientation, the shortest path from one cross-link site to the other via a point on the intersection of the plane and the protein surface. The longest such path determines the lower bound. Correctness of the plane algorithm follows from the fact that the cross-linking path must pass through each such plane without intersecting Sint. The complexity analysis for the plane algorithm is similar to that for the disk algorithm. The disk algorithm considers the sample points simultaneously, at a uniform cross-section angle, while the plane algorithm considers the sample points independently, at variable angles. Both the lower bounds and the computational complexity of these algorithms depend not only on S, p i , p j , but also on the sample points (and for plane, sample normal directions). The two degrees of freedom sampled for the plane orientations result in more intersection tests than are required for the disk algorithm.
2.3
Upper Bound Algorithms
An immediate upper bound on the cross-linking distance is obtained by taking the convex hull of the protein surface, finding paths outside Sintfrom the cross-linking sites to representative points on the surface of the hull, and finding shortest paths on the hull surface between these points. The correctness of the upper bound computed by this hull algorithm follows immediately, since the hull is exterior to the protein. depends on the paths from the sites to the hull surface, and is useful when the computation of these paths is easy (e.g. a line segment not intersecting Sintcan be identified). By applying Chen and Han’s l7 single-source shortest-paths
453
algorithm for polyhedral surfaces, the complexity for a single site pi to all other p j E V is the set of hull vertices.
P is O ( V 2 ) where ,
The convex hull approach takes “shortcuts” across the mouth of concavities by traversing the hull of the protein, but can miss shortcuts through the concavities. A complementary approach is to start with a sample of paths on the protein surface, rather than on the hull, and then take shortcuts where possible to reduce the lengths of these paths. More precisely, a shortcut of a path replaces the subsequence of vertices ( p k , p k + l , . . . , p l ) with the sequence ( p k , p l ) when the s e g m e n t m doesn’t intersect Sint.We call such a pair pk,pl a visible pair. Our shortcut algorithm (Figs. 3, 4c) applies this approach to compute an upper bound Dshortcut. Since initial paths are on the surface and shortcuts do not penetrate the body, this is a correct upper bound. The complexity of the shortcut algorithm depends on the approaches to generating paths, computing visibility, and selecting shortcuts. Our current implementation generates diverse paths by repeatedly performing a breadth-first search from p i to p., (taking time linear in the number of surface vertices) and removing edges for path vertices before the next iteration. Other approaches are also possible to achieve diversity. We shortcut a path by an iterative greedy refinement algorithm, starting at p , and at each iteration jumping to the vertex furthest in the path and still visible. Visibility can be tested by computing surface triangle intersections, as discussed regarding the disk algorithm, yielding O ( T P 2 )total time to shortcut a path P. An alternate approach that we are exploring is to test intersection of a segment with each of the protein atom spheres, using an atomic radius expanded by that of the solvent. In either case, efficient data structures could reduce the number of triangles tested. Dijkstra’s single-source shortest path a l g ~ r i t h m ’could ~ be employed instead of the greedy shortcutting, requiring O ( T P 2 )time to guarantee optimal shortcutting. We find that in practice the greedy approach usually makes substantial progress per iteration and is closer to linear than quadratic in path length. Rather than considering shortcuts on a few sample paths, we can compute, at the cost of complexity, a complete visibility graph for the protein surface. A visibility graph l8 indicates all visible pairs of vertices. Given a visibility graph, we can apply standard shortest paths algorithms (e.g. Dijkstra’s algorithm 15). Our visibility algorithm (Fig. 3 ) uses this approach to compute an upper bound Dvisibility. As with the shortcut algorithm, correctness as an upper bound is immediate. A straightforward construction of the visibility graph, using the techniques mentioned above for shortcutting, requires O ( T V 2 )time, where T and V are respectively the set of triangles and vertices of S . This preprocessing is used for all cross-linking site pairs; Dijkstra’s algorithm then requires additional O ( V 2 )time for each site.
454
2.4 Protein Model Discrimination
In order to discriminate among a set of predicted protein models, we must test for each of them the feasibility of the distances for all observed cross-links. We note that less information can be gained from the absence of evidence for a cross-link under a bottom-up mass spectrometry approach, since several factors other than cross-linking distance can contribute to the absence. More powerful reasoning from negative evidence will be possible in future work, particularly following the application of topdown mass spectrometry for cross-linking analysis 4. When employed with observed cross-links, lower and upper bounds provide complementary information for model discrimination. A lower bound can provide evidence against a model, when the estimated distance for an observed cross-link exceeds the expectation for the cross-linker. An upper bound can provide evidence for a model, when the estimated distance for an observed cross-link is less than the maximum distance. We adopt a simple strategy assuming cross-links are independent and sum their scores: +l when an upper bound is satisfied, -1 when a lower bound is violated, and 0 when neither holds. (It is impossible for both to hold.)
3 Results We have tested the performance of our algorithms for model selection with both published experimental and simulated data. Fibroblast growth factor (FGF-2) is the primary target because of available data3 and structure (PDB id 4FGF). Competing models were obtained for the published template structures via the protein foldrecognition meta-~erver~';two of the models are of the same fold (p trefoil) as 4FGF. The Lys-specific cross-linker BS3 was used. To further demonstrate the utility of our approach, we chose two CASP4 2o targets with many high-quality models: deoxyribonucleoside kinase (PDB id 1J90) and a-catenin (PDB id 1L7C). We applied our algorithms, using Nc, CY, Cp, or C" atoms (with surfaces appropriately peeled), and found the Cp to provide the best results. The C" straight-line measurement of Young et al. provides a control, although we could not exactly reproduce their model discrimination results (presumably due to differences in the details of the protein models). Visualizations like those in Fig. 4 provide evidence of the ability of our algorithms to better approximate cross-linking distance. To quantitatively characterize discriminatory power, we computed, for each distance between 1 and 45 A, the number of possible L y s pairs in 4FGF whose length exceeds the threshold and compared the number for experimentally identified cross-links (to be maximized) and unidentified ones (to be minimized). Greater difference between these numbers at a threshold indicates better abstraction of structural features and enhanced ability of the method
455
Figure 5: Comparison of cross-linking distances for (left) Cn straight-line, (middle) C p disk, and (right) C p plane methods. The z-axis indicates a distance and the y-axis the number of experimentally-identified (blue lower line; 18 maximum) and not (red upper line; 48 maximum) cross-links exceeding that threshold.
employed to separate identified from unidentified for a cross-linker of that length. Fig. 5 compares the straight-line distance against two of our lower bound methods. The area between the curves (summing the count difference over the range) is 641 for C a straight-line, 826 for CD disk, and 887 for Cp plane, demonstrating the more informative bounds provided by our algorithms. In model discrimination, Young et al. employ a maximum value of 24 A for feasible cross-linking distance; we use the same threshold for testing both upper and lower bounds. This value accounts for the BS3 length (1 1.4 A), the distance from the reactive Nc to the representative cross-linking site, and a small amount of uncertainty. Fig. 5 shows that some of the experimentally-determined cross-links have distances exceeding even this threshold (e.g. Ddisk(Lys21,Lys125) is 29.5 A). These large distances were confirmed visually. Possible explanations include experimental errors, artificial distortion of the protein, or extensive natural flexibility. Artificial distortion (e.g. by partial denaturation due to multiple cross-links), may be alleviated by better choice of experimental conditions. The work of Falke21 suggests it is possible to obtain cross-links more than 10 A longer than expected, in mobile situations, although the rate of cross-linking falls off by orders of magnitude. To study such flexibility, we intend to apply our algorithms to multiple frames of a molecular dynamics simulation, boosting the need to trade off efficiency and tightness of bound. We note that infrequent conformations might in general be detected rarely by mass spectrometry, and thus could be treated as noise in a probabilistic analysis. The cross-link experiment could also be altered to exploit differences in rates. We further quantified discriminatory power by comparing differences in estimated cross-link distances between models. Treat the set of cross-linking distances for a model as a point in C-dimensional space (for l cross-links), and compute differences (Euclidean distance) between these points. A larger difference is indicative of greater discriminatory power, since the cross-linker’s fixed length is more likely to separate the points on some dimension (cross-link). We compared our disk Cp
456
algorithm to the control straight-line C”, and found that our algorithm yields an average of 0.2-0.3 8, larger average differences for both experimentally observed and all possible cross-links, when either comparing 4FGF to all other models, 4FGF to non-p-trefoil models, or each model to all other models. We tested our methods by ranking the correct structure vs. the models, scoring with either the Young approach of counting violations (straight-line distance > 24 A) or our discrimination method combining disk (lower bound) and shortcut (upper bound) distances. We analyzed the effects of cross-link sparsity and noise by choosing datasets consisting of a random subset of the identified plus a random set of the unidentified cross-links. Fig. 6 illustrates the average rank of the correct structure over 100 such simulations for each of several different numbers of observed and unobserved cross-links. (We apply the conservative choice of ranking the correct structure worst in case of a tie.) With smaller subsets of identified cross-links, the two methods are comparable. Larger subsets tend to include more cross-links labeled infeasible by the disk bound, and our method degrades. Finally, we analyzed model discriminability by varying the number of simulated “good” and “bad” cross-links and finding the average rank of the correct structure as above. For tests with our method, good cross-links were chosen from those with shortcut Cp distance below 24 A in the correct structure, and bad cross-links from those with disk Cp distance greater than 24 A.Similarly, good and bad cross-links for the straight-line method were chosen using the 24 A threshold. Fig. 7 shows results for FGF using each method to analyze the corresponding simulated dataset. These results test discriminability and robustness to sparsity and noise - over many different sets of feasiblehfeasible cross-links, our distances distinguish the correct structure from the models better than do straight-line distances. Fig. 8 shows our results on the CASP4 targets; straight-line is again inferior (not shown). 4
Conclusions
We have developed and applied a set of lower- and upper-bound algorithms for estimating cross-linking distance. The algorithms trade off complexity and tightness of bound. We have shown that by taking into account protein surface geometry, our algorithms provide better model discriminability, in terms of cross-link separability, distance differences, and discrimination effectiveness. We illustrated the robustness of our techniques by simulating sets of good and bad cross-link data. Our results demonstrate that information from relatively rapid and inexpensive experiments permit model discrimination in spite of sparse information and the presence of noise. The current work can be further extended in several ways. Protein dynamics can be taken into consideration. As more experimental data become available, better classifiers can be developed to apply distance estimates to model discrimination. While
457
5 0 Unidentified Figure 6: Discrimination using experimental data for FGF-2 with (a) straight-line C a , (b) combined disk and shortcut Cp. The 2-and y-axes indicate number of cross-link pairs identified and unidentified, respectively; the z-axis shows the average rank of the actual structure over 100 random subsets.
Figure 7: Discriminability for FGF-2 with (a) straight-line C a , (b) combined disk and shortcut C p . The zand y-axes indicate number of good and bad cross-link pairs, respectively, chosen according to the same methods; the z-axis shows the average rank of the actual structure over 100 random subsets.
Figure 8: Discriminability, as in Fig. 7, with combined disk-shortcut Cp using simulated data for (a) deoxyribonucleoside kinase and (b) a-catenin models.
458 cross-links were considered independent here, a more complex framework would capture dependencies with respect to differential reactivity, competing cross-links, and so forth. Our analysis can be used in planning experiments, e.g. proposing a cross-linker of the best length or the substitution of particular residues to lysine. Acknowledgments This work is supported in part by a US NSF CAREER award to CBK (11s-0237654); and EMBO/HHMI Young Investigator and Foundation for Polish Science Young Scholar award to JMB. Thanks to Mike Stoppelman, Xiaoduan Ye, and other members of our labs for helpful discussions and related work. References 1. Natl. Inst. Gen. Med. Sci. http://www. structuralgenornics.org. 2. M. Haniu, L. 0. Narhi, T. Arakawa, S. Elliott, and M. F. Rohde. Protein Sci, 9:1441-51, 1993. 3. M.M. Young et al. PNAS, 975802-5806,2000. 4. G. H. Kruppa, J. Schoeniger, and M. M. Young. Rapid Commun Mass Spectrom, 17(2):155-62,2003. 5. A. Scaloni et al. J Mol Biol, 277:945-958, 1998. 6. J. B. Swaney. Methods Enzymol, 128:613-626, 1986. 7. I. Kwaw, I. Sun, and H. R. Kaback. Biochemistry, 39:3134-3140,2000. 8. J. Skolnick, A. Kolinski, and A. R. Ortiz. J M o l Biol, 265:217-241, 1997. 9. P. M. Bowers, C. E. M. Straws, and D. Baker. J Biomol NMR, 18:311-318, 2000. 10. S. Elliott et al. Blood, 87(7):2702-13, 1996. 11. A. Bohm et al. J Biol Chem, 277(5):3708-17,2002. 12. F. Zappacosta et al. Protein Sci, 6(9):1901-9, 1997. 13. W. Zheng and S. Doniach. JMoZ Biol, 316:173-87,2002. 14. B. Lee and F. M. Richards. J Mol Biol, 55(3):379-400, 1971. 15. E. W. Dijkstra. Numerische Mathematik, 1:269-271, 1959. 16. J. S. B. Mitchell. Geometric shortest paths and network optimization. Handbook of Computational Geometry, 2000. 17. J. Chen and Y. Han. In Proc ACM Symp Comp Geom, pp. 360-369, 1990. 18. J.C. Latombe. Robot Motion Planning. Kluwer, 1991. 19. M.A. Kurowski and J.M. Bujnicki. Nucleic Acids Res, 31(13):3305-7, 2003. http://genesilico.pl/rneta. 20. J. Moult et al. Proteins, S5:2-7,2001. 21. C . L. Careaga and J. J. Falke. JMol Biol, 226:1219-35, 1992.
PROTEIN FOLD RECOGNITION THROUGH APPLICATION OF RESIDUAL DIPOLAR COUPLING DATA
Y.QU, J.-T. GUO, V. OLMAN, and Y. XU* Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, USA, and Computational Biology Institute, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA ( * correspondence: [email protected]) Residual dipolar coupling (RDC) represents one of the most exciting emerging NMR techniques for studying protein structures. However, solving a protein structure using RDC data alone is a highly challenging problem as it often requires that the starting structure model be close to the actual structure of a protein, for the structure calculation procedure to be effective. We report in this paper a computer program, RDC-PROSPECT, for identification of a structural homolog or analog of a target protein in PDB, which best matches the 15N-'H RDC data of the protein recorded in a single ordering medium. The identified structural homologlanalog can then be used as a starting model for RDC-based structure calculation. Since RDC-PROSPECT uses only RDC data and predicted secondary structure information, its performance is virtually independent of sequence similarity between a target protein and its structural homolog/analog, making it applicable to protein targets out of the scope of current protein threading techniques. We have tested RDC-PROSPECT on all "N-'H RDC data (representing 33 proteins) available in the BMRB database and the literature. The program correctly identified the structural folds for 80% of the target proteins, significantly better than previously reported results, and achieved an average alignment accuracy of 97.9% residues within 4-residue shift. Through a careful algorithmic design, RDC-PROSPECT is at least one order of magnitude faster than previously reported algorithms for principal alignment frame search, making our algorithm fast enough for large-scale applications.
-
1 Introduction Since the publication of the seminal work by Tolman et al.' and Tjandra & Bax? residual dipolar coupling (RDC) in weak alignment media has gained great popularity in solving protein structures using NMR techniques. RDC provides information about angles of atomic bonds, e.g., N-H bonds, of a protein's amino acids with respect to a specific 3-dimensional (3D) reference frame. Using such information, an NMR structure could, at least theoretically, be solved through molecular dynamics (MD) simulation and energy minimization, under the constraints of the RDC angle information. A key advantage of RDC-based NMR structure solution is that RDC data can be obtained using a small number of NMR experiments and done in a very efficient manner.3 Potentially, it could also overcome a number of limitations of traditional NOE-based NMR structure determination techniques, e.g., the size limit for a target p r ~ t e i n . ~ Though recognized for its great potential for solving larger proteins faster, direct application of RDC data for protein structure solution remains a highly challenging problem. The roblem mainly comes from the well-known four-fold degeneracy nature of RDC. An RDC value of an N-H bond (for example) does not
P
459
460
uniquely define a single orientation of the N-H bond as desired, rather it only restricts the orientation to two symmetric cones, making the search space of feasible structural conformations extremely large. In addition, inclusion of the RDC terms in the NMR energy function for structure calculation has resulted in a highly rippled energy surface with innumerable sharp local minima,6 making the search problem exceedingly difficult. In the absence of long-range NOE distance information, it is practically intractable to find the global minimum by conventional optimization techniques. However, if the starting model is close to the true structure, convergence will become much easier. Therefore, a great amount of efforts have been made to obtain good starting structures for RDC-based N M R structure calculation. Existing methods for deriving protein structures from RDC data alone mainly fall into two categories: de novo fragment assembly methods'-'' and whole protein structural homology search methods De novo methods build protein structures by assembling structural fragments that are consistent with RDC data. These methods typically require a complete or near-complete set of RDC data to be effective, and are often very time-consuming. One example of such methods is the RosettaNMR program," which typically need more than 3 RDC data per residue for its structure calculation to be accurate. As these methods typically attempt to assemble a protein structure in a sequential manner, they often suffer from problems resulting from accumulation and propagation of small errors from each individual fragment. Structural homology search methods generally require fewer RDC data and much less computing time, but are applicable only to proteins with solved homologous structures. Based on theoretical estimates on the total number of unique structural folds in nature and on the low percentage (< 5%) of novel structural folds among all structures submitted to PDB in the past few years,I3 people generally believe that the majority of the unique structural folds in nature are already included in PDB. Hence structural homology search methods are becoming increasingly popular. Annila et al." are the first to use assigned RDC to search for structural homologs. Their work demonstrated the feasibility of fold recognition using RDC data alone. Meiler et a1." developed a program, DipoCoup, for structural homology search using secondary structure alignment. While all the aforementioned methods contain interesting ideas, they have been tested only on a very small set of proteins, in a few cases only on one protein, ubiquitin. Therefore, their true practical usefulness is yet to be determined. We have recently developed a computer program, RDC-PROSPECT (RDCPROtein Structure PrEdiCtion Toolkit), for protein fold recognition and protein backbone structure prediction. Currently the program uses only assigned N-H RDC data in a single ordering medium and predicted secondary structure to identify structural homologs or analogs from the PDB database. RDC-PROSPECT identifies a structural fold through finding a structural fold in PDB, which best matches the N-H RDC data, using a dynamic programming approach. Compared with existing methods, RDC-PROSPECT has a number of unique capabilities. Firstly, RDCPROSPECT requires only a small number of RDC for fold recognition. On our test set consisting of all publicly available N-H RDC data of 33 proteins deposited in the
46 1
BMRB database (www.bmrb.wisc.edu) and published in the literature, RDCPROSPECT achieves an 80% fold recognition rate on an average of 0.7 RDC data per residue. The requirement of fewer RDC data implies smaller number of NMR experiments needed to solve a structure. Secondly, RDC-PROSPECT does not require sequence similarity information for fold recognition, making the program equally applicable to proteins with only remote homologs or structural analogs in the PDB database, which represents a significant challenge to current threading methods. Thirdly, RDC-PROSPECT runs significantly faster than almost all existing RDC-based methods, using a novel search algorithm for the principal alignment frame of the RDC data.
2 Methods An RDC measures the relative angle of an atomic bond in a residue, with respect to the principal alignment frame14of the protein (more rigorously, each rigid portion of the protein structure). The principal alignment frame, represented as an (x, y, z) Cartesian coordinate system, is dependent on the medium where the protein situates and the protein structure itself. In this paper, we consider only the RDC data of N-H bonds, the easiest RDC data to get experimentalp. The RDC data measured by NMR experiments for each N-H bond is defined as' D = D, (3cos28 - 1) + 1.5 D, (sin% cos2cp )
(1)
where 8 is the angle between the bond and the z-axis of the principal alignment frame (x, y, z), and cp is the angle between the bond's projection in the x-y plane and the x-axis; D, and D, represent the axial and rhombic component of the alignment tensor, respectively. Intuitively, D, and D, measure the magnitude (intensity) of the alignment. From an NMR experiment, we will get a set of {Di} values without knowing which Di corresponds to the N-H bond of which residue in a protein and what the principal alignment frame is. Our goal here is to develop a computational procedure to find a protein fold in the PDB database and search for an (x, y, z) Cartesian coordinate system that produces a set of calculated N-H bond RDC values using equation (l), which best match the experimental RDC data. In this paper, we solve a constrained version of this fold recognition problem, assuming that the RDC data are already correctly assigned to individual residues. 2.1 Alignment of RDC data with structural fold The RDC-based fold recognition problem can be rigorously stated as follows. Let D = (D1, . . ., DK) be a list of assigned experimental N-H RDC data (DNH)of a target protein. Let D*(T, F) = (DI1, . . ., D*M)be the calculated RDC data of a template structure T, assuming the principal alignment frame is F. We want to find an alignment A: i-+A(i) between D and D*(T, F), that minimizes the following function:
462
where Di is aligned with D*A(i), and CY is the standard deviation of the experimental are the predicted secondary structure type of position i of the target DNH; Si and S*A(i) protein and the assigned secondary structure of position A(i) of the template structure; M() is a penalty function for secondary structure type matcWmismatch, with M() equals -1 for match and 1 for mismatch; pG, is the total gap penalty for the j-th gap in the alignment, which has the following form a + Ljb, with a being the opening gap penalty, b being the elongation gap penalty and Lj being the length of the j-th gap (the number of consecutive skipped elements). w1 and w2are two scaling factors, which are empirically determined (using simulated data) as w1= 1 and 02 = 1. The D*(T, F) values of the template structure T are calculated using equation (1) for a specified alignment frame F (we will discuss how to systematically search for the correct alignment frame in the next subsection). To estimate D, and D, in (l), we use the equations in the histogram method proposed by Clore et al.:I6 D,, = 2 D , (3) D,, = - D, (1 + 1.5 DJDJ where D,, and D,, are the maximum or the minimum values of the experimental DNH, respectively, with IDzz[> ~D,,~. 0 and cp in equation (1) are calculated for the N-H bond of each residue of the template structure with respect to the specified alignment frame F. We have used PSIPRED” for secondary structure prediction of a target protein sequence. We consider three classes of secondary structures: helix (H), strand (E), and coil (C). In assessing secondary structure matches (using function M()), we consider only PSIPRED predictions with confidence level of at least 8 on the scale of 0-9. For a prediction with confidence level < 8, we assign a special category U (uncertainty) to this position and set M (Si, S*A(~)) = 0 when Si = U. The alignment problem also employs a few additional rules as hard constraints, when aligning a list of RDC data with a protein structure. These include (a) if a position in the target protein does not have assigned RDC data, its corresponding alignment score (the D-portion in (2)) will be set to zero; (b) no penalty for gaps in the beginning and the end of a global alignment; (c) no alignment gap is allowed in the middle of an H- or E- secondary structure of the template structure; and (d) we consider alignment scores defined by (2) only for helix and strand regions while for coil regions, we penalize length difference of aligned coils. This is done for the following consideration: homologous proteins are generally more conserved among their corresponding core secondary structures (helices and strands) but not the coil regions. Considering detailed sequence alignment between coil regions often hurts the fold recognition and alignment accuracy, especially when dealing with remote homologs and structural analogs. We have implemented a simple dynamic programming algorithm for finding the globally optimal solution of this alignment problem under the specified hard
463
constraints. The dynamic programming algorithm consists of a set of recurrences, similar to the Needleman-Wunsch algorithm.'8 At each step of the recurrence calculation, the hard constraints are checked to guarantee no violation of constraints.
2.2 Assessment of prediction confidence Considering that the alignment scores are not normalized with respect to the lengths and the composition of amino acids, we use Z-score to assess the quality of an alignment. For an RDC alignment problem with a set of experimental RDC data DNH and a template structure T, we calculate the Z-score of the alignment score To as follows. The RDC data with their respective secondary structure types are randomly shuffled multiple times. For each reshuffled RDC list, we calculate the alignment score with the template T. The Z-score of Tois defined as Z = (T, - To) / 0,
(4)
where T, and o, are the average alignment score of the reshuffled RDC lists and their standard deviation. For our current work, we run 500 times of reshuffling (we have also tried significantly larger number of reshuffling but found that 500 gives similar Z-scores to that with higher numbers). Figure 1 shows a plot of Z-score with respect to the fold recognition specificity on our test set of 33 proteins against our template structure database. For example, when Z-score is > 20, the prediction specificity is > 70%.
-0 0
20
40
60
80
100
2-score
Figure 1. Fold recognition Z-score versus prediction specificity
2.3 Principal alignment frame search and fold recognition
One of the challenging issues with the RDC-based fold recognition problem is that we do not know the principal alignment frame from the experimental data, which is required for the calculation of RDC values using equation (1). If the 3D structure of the target protein is known, this problem is equivalent to finding the correct rotation, in a fixed 3D Cartesian coordinate system of the structure that gives the (0, cp)angles of its N-H bonds and hence the calculated RDC values, which best match the experimental RDC data. For our fold recognition work, the problem is to find the rotation of a template structure that gives the best match with the experimental data, defined by equations (2) and (4). Note that any rotation of a 3D protein structure
464 (say in PDB format) can be accomplished by a combination of clockwise rotations around x-axis by a degree and around z-axis by y degree. More specifically, the new coordinates of a data point [x, y, z], after a (a,y)-rotation, can be calculated as
where the two rotation matrices are defined as
0
cosy
siny
0’
0
1
For each given template structure, our fold recognition algorithm will search through all possible (a, y)-rotations. For each (a, y)-rotation, the algorithm employs the alignment algorithm of Section 2.1 to find the optimal alignment between the (assigned) experimental RDC data and the calculated RDC data for the template under this particular rotation. One thing to notice is that the range of both a and y is between 0 and 180 degrees as there is no need to consider 180 < a, y 5 360 because of the four-fold degeneracy of RDC data.” We have extensively tested and evaluated different increments for a and y, ranging from 1 degree to 30 degrees. We found that the search surface (made of values of the calculated RDC) over the (a,y)-plane is very smooth, and an increment of 30 degree is adequate for our fold recognition. So we use 30 degrees as default increment value for RDC-PROSPECT. For each template, our algorithm will conduct 36 (6x6) RDC data alignments. The alignment with the optimal alignment score among the 36 alignments is considered the best alignment between the RDC data and the template. For cases we need to get very accurate alignment frame, we use a finer grid for searching the (a,y)angles, which takes longer search time. Our overall fold recognition procedure is carried out as follows. For each set of assigned RDC data, we search our template database consisting of all proteins in the SCOP40 d a t a b a ~ e . ’Currently, ~ SCOP40 (release 1.63 of May 2003) consists of approximately 5,200 protein domains covering 765 folds and 2,164 families. Hydrogen atoms are added to the structure using the program REDUCE.20 Secondary structure assignment is carried out using the program DSSPcont.’’ For each template, we calculate the Z-score of its best alignment with the experimental RDC data using equation (4). Then all the templates are ranked based on their alignment raw scores.
465
3 Results We have tested RDC-PROSPECT on all publicly available N-H RDC data deposited in the BMRB database and published in the literature (by July, 2003), which contain 51 sets of RDC data for 33 proteins. The goal of the tests is to evaluate the fold recognition rate using RDC data (plus predicted secondary structure of a target protein) and the accuracy of the alignment with the correct structural folds. Tables 1 and 2 summarize the fold recognition and alignment results on the 33 proteins using 51 sets of RDC data - for some proteins, there are multiple sets of RDC data collected by different labs andor in different ordering media. For the fold recognition prediction, we consider a prediction as correct if a member protein from the same family or superfamily of the target protein is ranked in top three among all proteins in SCOP40, otherwise as incorrect. From Table 1, we can see that RDC-PROSPECT correctly identified the structural folds for 41 out of 51 RDC data sets (80.4% success rate), and identified 26 structural folds for 33 target proteins (78.8% success rate). Hence we consider the performance of RDCPROSPECT as quite successful even under our very conservative definition of correct fold recognition, i.e. ranked among top three out of thousands of possible structures. It is somewhat unfortunate that there is very little published data by other RDCbased structure prediction programs. Most of them were tested only on one protein, ubiquitin. The only meaningful comparison we can do is with RosettaNMR that was tested on 4 proteins using experimental RDC data, ubiquitin (ld3z), BAF (lcmz), cyanovirin-N (lci4), and G A P (2ezx), and 7 proteins using simulated RDC data." Of the 4 proteins with experimental data, RosettaNMR predicted correct structures for ld3z and lcmz, and partially (-50%) correct structures for lci4 and 2ezm. Our program correctly identified the backbone structures for ld3z, lcmz, and 2ezx (the same protein as lci4), but did not find the correct structural fold for 2ezm due to inadequate secondary structure information (only 9.9% of the residues have reliable secondary structure prediction by PSIPRED). From Table 2, we can see that alignment accuracy for the 26 target proteins with correct fold recognition is very high. The percentage of 4-shifts is commonly used for assessing threading alignment accuracy. RDC-PROSPECT achieved an average alignment accuracy of 97.9% residues aligned within 4-residue shifts to their correct positions. None of the other RDC-based structure prediction programs provide this kind of statistics. Figure 2 shows the predicted structures (right) versus the actual structures (left) for four target proteins with < 25% sequence identity with their best structural templates. 4 Discussion
Our results have clearly demonstrated that RDC-based fold recognition, when
466
Target No.
PDB code
Table 1. A summary of fold recognition accuracy Length Data template template Seq. Set name length Iden
Rank
Z-score
No
~
1
lap4
89
2 3
lb4c 1brf
92 53
1 2 3 4
4 5
lc05 lcmz
159 152
6
ld3z
76
7 8
ld8v le81
263 129
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
lf3y li42 lj6t,A lj6t,B lj70 lj7p ljwe lkhm lkqv 113g llud ln7t lny9 2ezx
165 89 148 85 76 67 114 89 76 136 162 103 90 89 123 125 85 56
3eza,Al 3eza,A2
3eza,B 3gbl ~
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
d2pvbad2pvbad2pvbadlksoad 1rb9dldx8adlfjgddldk8ad l agredlbt0adlbt0ad 1h4ra3 d l h8cadlbt0ad 1h4ra3 d l h4ra3 d 1h4ra3 dlbt0adlh4ra3 d l bt0adlhwma-
d3lztd3 lztdljknadli42adla6jad 1opdd 1exradlj7qadlb79adlj4wal d 1irjad l bm8dlra9d 1mfgadlashdlci4adlzymal d l zyma2 d 1opdd2igd-
107 107 107 93 52 70 208 147 128 73 73 84 82 73 84 84 84 73 84 73 25 1 129 129 165 89 150 85 146 86 102 74 85 99 159 95 147 89 123 125 85 61
19.1 19.1 19.1 37.2 64.8 32.9 45.2 28.8 37.5 59.2 59.2 20.4 18.6 59.2 20.4 20.4 20.4 59.2 20.4 59.2 37.0 100 100 99.4 100 24.2 97.6 49.0 18.6 88.6 26.7 28.7 72.8 29.8 91.3 18.2 97.8 100 100 97.6 82.0
1 1 1 2
1 1
10.2 10.2 10.5 11.0 7.0 5 .O 12.6 10.1 12.3 12.5 13.7 14.4 16.0
1
13.4
1 1 1 1
15.8 15.2 16.9 14.7 17.1 13.1 97.5 14.1 12.8 14.8 10.6 24.4 23.6 21.3 9.7 16.2 45.1 11.4 16.5 26.6 19.4 12.6 7.4 14.4 8.6 14.8 13.2
1
1 1 1 1 1 1
1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
~~
Only the highest ranked correct template is listed for each protein. The first two columns represent the target id in our test and in PDB code. The third column represents the sequence length of the target. The fourth column represents the id of the RDC data set for each protein, some of which have multiple data sets. The fifth and sixth columns are the correct template id in SCOP code and its sequence length. The seventh column represents the sequence identity between a target protein and its correct template. The eighth column shows the rank of the top template among all SCOP40 proteins while the ninth shows the corresponding Z-score. No correct templates are identified in top three templates for proteins 27-33 (including ld2b, Ighh, 1081, lq n l, 2ezm, Zgat, 4gat).
467 Table 2. Summary of alignment accuracy
I
I
97.9 95.3 96.8 Accuracy (%) 63.1 90.1 x-shift represents the percentage of residues that are within x residues to its correct alignment positions.
1ap4
ld3Z
lj6t, A
11qQ
Figure 2. Actual (left) and predicted structure (right) on four target proteins with < 25% sequence identity with their best structural folds in SCOP40.
coupled with predicted secondary structure, is highly effective and robust for identification of native-like structural folds and prediction of its backbone structure. Our test examples cover a wide range of prediction scenarios. The test proteins span over 5 SCOP classes and more than 20 SCOP fold families with varying sequence lengths. Their N-H RDC data coverage ranges from 43.4% to 95.5%, and their predicted secondary structure ranges from 9.9% to 76.3% (for the remaining residues, their predictions are “uncertain” and hence not used). We now discuss some key advantages and unsolved issues of RDC-PROSPECT along with some future developments.
4.1 Eficient algorithm for alignment tensor orientation search If we use N to represent the number of rotation an les we have to search along each of rotations while axis, previous similar algorithms9322all require N combinations . . our algorithm requires only N2, saving at least one order of magnitude of search time and making our program much faster than other similar programs.
f
4.2 Combination of RDC data and predicted secondary structure for fold recognition We found that predicted secondary structure, though not perfect, complements the RDC data for fold recognition. While RDC data are good for identification of global
468 structural environment, secondary structure is good for finding the local structural environment (e.g., in a helix or in a strand). Our test data have shown that without either one of the two types of data, RDC-PROSPECT'S performance drops significantly. In this work, we used predicted secondary structures based on protein sequence information only. Actually, secondary structures could be derived more accurately using experimental data, like chemical shifts data. The only reason we did not use chemical shifts is that only 10 out of 33 proteins have such data available in the BMRB database. Using chemical shifts data will improve the performance of the program. For example, the otherwise missed correct template for the protein 2ezm can be identified when chemical shifts based secondary structure prediction is used.
4.3 Why some protein structures cannot be correctly predicted? For 7 out of 33 target proteins, RDC-PROSPECT did not place the correct structural folds in the top three templates. We have done a detailed analysis on the failed predictions and found that the failures can be attributed to two classes of reasons. a. proteins composed mainly of coils: this group includes lo8r, l q n l , 2gat, 4gat (6gat). As discussed in Section 2, RDC-PROSPECT considers only coil length conservation but does not conduct detailed alignment for coil region. When a protein is mainly composed of coils, RDC-PROSPECT does not perform well. Work is currently under way to improve on such cases. b. others: we found that various other reasons could also contribute to the failure of our RDC-based fold recognition. The reasons range from inaccurate estimation of D, and D,, to incorrect prediction of secondary structures, to errors in the measured RDC data. In this work, we have used raw RDC data without treatment of the data for contributions from internal dynamics. Our results suggest this is feasible in practice. As Rohl and Baker discussed," internal dynamics likely contribute to the observed RDC to a greater content in flexible loops. Our method doesn't perform alignment in the coil region, so this greatly alleviates the effect of dynamics that could potentially harm the alignment.
4.4 Comparisons with DipoCoup DipoCoup is a popular program to perform 3D structure homology search using RDC and pseudo-contact shifts together with secondary structure information. A basic problem with DipoCoup is that it does not use gap penalty in alignment, thus its applicability is significantly limited. In contrast, RDC-PROSPECT allows the flexibility of having gaps inside or outside secondary structures. Moreover, DipoCoup uses secondary structure fragment as alignment unit, while RDCPROSPECT conducts alignments at residual level, making it more flexible and robust. This also allows us to use sparse secondary structure information, which DipoCoup could not handle.
469
4.5 Assignment of RDC data Like other RDC-based structure prediction programs, RDC-PROSPECT assumes that the RDC data have been assigned to individual residues. This should not limit its applications, as sequential assignments of NMR data (RDC data included), unlike NOE data assignments, are general1 solvable using existing programs. A recently published work by Coggins & Zhou& has achieved -80% assignments without any error for 27 test proteins using their PACES program. Assignments at such level are adequate for RDC-PROSPECT to perform well for most proteins. We have previously published an a1gorithm/softwarez4 for sequential assignments of NMR data using chemical shifts data. We are in the process of merging the two programs to do fold recognition using unassigned RDC data.
In conclusion, our method has convincingly testified the capability of fast and accurate protein fold recognition through combining sparse RDC data and threading technology. An important feature of our RDC-based homology search method is that it does not use sequence information for alignment. Our program provides a good complimentary and crosscheck tool to the conventional threading methods. It is especially attractive for the low sequence identity situations that the conventional structure prediction methods generally do not perform reliably. As we continue to work on this project, we will (a) use chemical shifts data for more reliable prediction of secondary structures, (b) include other types of RDC data, such as C-H RDC, which can be easily added into the framework of RDC-PROSPECT, and (c) include traditional statistics-based threading energy terms, such as pair-wise interaction potentials, in our RDC-based fold recognition method, as in our threading program PROSPECT.’’ We expect that RDC-PROSPECT will prove to be useful in structural genomics projects for high-throughput structure determinations, due to the efficient and effective application of RDC-PROSPECT to fit sparse RDC data with solved structures from a minimum number of NMR experiments.
Acknowledgments This work was funded in part by the Structural Biology Program of the Office of Health and Environmental Research, U.S. Department of Energy, under Contract No. DE-AC05-000R22725 managed by UT-Battelle, LLC. We thank Drs. Nitin Jain, Dong Xu and Dongsup Kim for helpful discussions.
470
References
1. J.R. Tolman, J.M. Flanagan, M.A. Kennedy, and J.H. Prestegard, Proc. Natl. Acad. Sci. I/. S. A. 92,9279 (1995) 2. N. Tjandra and A. Bax, Science 278, 1111 (1997) 3. J.R. Tolman, Curr. Opin. Struct. Biol. 11,532 (2001) 4. J.H. Prestegard, Nut. Struct. Biol. 5 Suppl, 5 17 (1998) 5. J.H. Prestegard, H.M. A1 Hashimi, and J.R.Tolman, Q. Rev. Biophys. 33, 371 (2000) 6. A. Bax, Protein Sci. 12, 1 (2003) 7. F. Delaglio, G. Kontaxis, and A. Bax, J. Am. Chem. SOC. 122,2142 (2000) 8. J.C. Hus, D. Marion, and M. Blackledge, J. Am. Chem. SOC. 123, 1541 (2001) 9. F. Tian, H. Valafar, and J.H. Prestegard, J. Am. Chem. SOC. 123, 11791 (2001) 10. C.A. Rohl and D. Baker, J. Am. Chem. SOC. 124,2723 (2002) 11. A. Annila, H. Aitio, E. Thulin, and T. Drakenberg, J. Biomol. NMR 14,223 (1999) 12. J. Meiler, W. Peti, and C. Griesinger, J. Biomol. NMR 17,283 (2000) 13. D. Lee, A. Grant, D. Buchan, C. Orengo, Curr. Opin. Struct. Biol. 13, 359 (2003) 14. J.A. Losonczi, M. Andrec, M.W. Fischer, and J.H. Prestegard, J. Magn Reson. 138,334 (1999) 15. G.M. Clore, A.M. Gronenborn, and N. Tjandra, J . Magn Reson. 131, 159 (1998) 16. G.M. Clore, A.M. Gronenborn, and A. Bax, J. Magn Reson. 133,216 (1998) 17. D.T. Jones, J. Mol. Biol. 292,195 (1999) 18. S.B. Needleman, C.D. Wunsch, J. Mol. Biol. 48,443 (1970) 19. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia, J . Mol. Biol. 247, 536 (1995) 20. J.M. Word, S.C. Lovell, J.S. Richardson, and D.C. Richardson, J. Mol. Biol. 285, 1735 (1999) 21. P. Carter, C.A. Andersen, and B. Rost, Nucleic Acids Res. 31, 3293 (2003) 22. J.C. Hus, J.J. Prompers, and R. Bruschweiler, J. Magn Reson. 157, 119 (2002) 23. B.E. Coggins and P. Zhou, J. Biomol. NMR 26,93 (2003) 24. Y. Xu, D. Xu, D. Kim, V. Olman, J. Razumovskaya, and T. Jiang, IEEE Computing in Science d Engineering 4,50 (2002) 25. Y. Xu and D. Xu. Proteins 40, 343 (2000)
COMPUTATIONALAND SYMBOLIC SYSTEMS BIOLOGY T . IDEKER Department of Bioengineering U.C. San Diego La Jolla, CA 92093 [email protected] E. N E U M A N N Beyond Genomics 40 Bear Hill Road, Waltham, MA eneumann@beyondgenomics. corn
V. SCHACHTER GENOSCOPE (National Consortium for Genomics Research) 2, rue Gaston Crkmieux, F-91000 EVRY FRANCE [email protected]
It has become increasingly evident that the use of large-scale experimental data and the invocation of ‘Systems Biological’ principles are gaining widespread acceptance in mainstream biology. Systems Biology involves the use of global cellular measurements-i.e., genomic, proteomic, and metabolomic-to construct computational models of cellular processes and disease. It typically involves an iterative computational/experimental cycle of 1) inferring an initial model of the cellular process of interest through sequencing, expression profiling, and/or molecular interaction mapping projects; 2) perturbing each model component and recording the corresponding global cellular response to each perturbation; 3) integrating the measured global responses with the current model; and 4) formulating and testing new hypotheses for unexpected observations. Recent technological developments are enabling us to define and interrogate cellular processes more directly and systematically than ever before, using two complementary approaches. First, it is now possible to systematically measure pathway interactions themselves, such as those between proteins and proteins or between proteins and DNA. Several methods are available for measuring proteinprotein interactions at large scale-two of the most popular being the two-hybrid system and protein coimmunoprecipitation in conjunction with tandem mass spectrometry. Protein-DNA interactions, as commonly occur between transcription
471
472
factors and their DNA binding sites, are also being measured systematically using the technique of chromatin immunoprecipitation. Other types of molecular interactions and reactions, such as those involving metabolites and drugs, have been culled from the literature and stored in large, publicly-accessible databases such as MetaCyc and KEGG. A second major approach for interrogating pathways has been to systematically measure the molecular and cellular states induced by the pathway structure. For example, global changes in gene expression are measured with DNA microarrays, while changes in proteins and metabolite concentrations may be quantitated with mass spectrometry, NMR, and other advanced techniques. The amount of quantitative data these experiments yield is on the order of thousands of individual molecular channels, and has been used to successfully identify patterns indicative of biological responses or disease states. However, it has become apparent that single genes or their products do not cause most of the biological phenomena observed. These findings have drawn researchers to the conclusion that the most interesting phenomena in biology result from the interrelated actions of many components within the system as a whole. Recent computational approaches to Systems Biology have involved formulating both molecular interactions and molecular states into computational pathway models of various types. The amount of research in this area has exploded in recent years, as witnessed by the number of research presentations at meetings such as PSB, RECOMB, the Biopathways Consortium, and the International Conference on Systems Biology. Although much of this research has focussed on systems of differential equations and other numerical pathway simulations, a variety of model types and formalisms are in fact possible. Models may in fact be numerically computable, but they may also be symbolical and accessible to inferential logic. Logical formalisms that describe complex phenomena are just as important as is modeling molecular dynamics, and may lead to faster insight where the computational complexities are too great for a full-scale simulation. These research areas need to be pursued in parallel to more numerically-driven approaches, since they may offer a way to merge much of the symbolic knowledge derived from existing biological research. In support of this view, almost half of the papers presented in this session involve the use of logical formalisms for modeling pathways, pathway dynamics, and/or network inference. Symbolic logic is used to analyze protein functional domains (Talcott et d.);to infer novel metabolic pathways using information on known pathways and the biochemical structures of their metabolites (McShan et LIZ.); or to
473 model cell-cell interactions using a stochastic extension of the pi-calculus (Lecca et al.). Many of these papers combine more than one large-scale data type, including gene expression profiles, protein-protein interaction data, and/or pathway databases. Another group of papers concentrate on either new formal representations for network inference or efficient experimental design, i.e. choosing an optimal set of gene deletions, overexpressions, or other experiments to maximize the information gained about the network. Of particular interest here is work by Gat-Viks et al. on representing gene regulation by ‘chain functions’ ; inferring a system of differential equations through systematic overexpressions (di Bernardo et al.); and methods for decomposing gene expression data into its component cellular processes within a Bayesian framework (Lu et aL). Finally, as an overlapping theme, several papers point to how Systems Biology may be used as part of a high-throughput drug discovery and development platform. For instance, the work by McShan et al. might be used to explore how newly developed drugs will be metabolised by the body; the work by di Bernardo et al. could be applied to predict primary drug targets based on the pathways they affect; while the work of Kightley et al. is a method for network inference submitted by researchers in the biotechnology/pharma industry. The field of Systems Biology still includes many challenges and holds much promise. By increasing our repertoire of model representations and analytical formalisms, the methods explored here are the starting points for numerous advances in biotechnology, not the least of which is an enhanced ability to target therapeutics appropriately in diseased cells. Thus, we move one step closer to the day in which computational pathway modeling tehniques will have widespread impact and acceptance within basic biological research and replace high-throughput screening as a de-facto standard in “big pharma”.
A MIXED INTEGER LINEAR PROGRAMMING (MILP) FRAMEWORK FOR INFERRING TIME DELAY IN GENE REGULATORY NETWORKS M. S. DASIKA, A. GUPTA AND C. D. MARANAS Department of Chemical Engineering, The Pennsylvania State University, University Park, PA I6802 E-mail: {msdl79,axg218, costas]@psu.edu In this paper, an optimization based modeling and solution fkamework for inferring gene regulatory networks while accounting for time delay is described. The proposed framework uses the basic linear model of gene regulation. Boolean variables are used to capture the existence of discrete time delays between the various regulatory relationships. Subsequently, the time delay that best fits the expression profiles is inferred by minimizing the error between the predicted and experimental expression values. Computational experiments are conducted for both in numero and real expression data sets. The former reveal that if time delay is neglected in a system a priori known to be characterized with time delay then a significantly larger number of parameters are needed to describe the system dynamics. The real microarray data example reveals a considerable number of time delayed interactions suggesting that time delay is ubiquitous in gene regulation. Incorporation of time delay leads to inferred networks that are sparser. Analysis of the amount of variance in the data explained by the model and comparison with randomized data reveals that accounting for time delay explains more variance in real rather than randomized data
1
Introduction
The advent of microarray technology has made it possible to gather genome-wide expression data. In addition to experimentally quantifying system-wide responses of biological systems, these technologies have provided a major impetus for developing computational approaches for deciphering gene regulatory networks that control the response of these systems to cellular and environmental stimuli. A complete understanding of the organization and dynamics of gene regulatory networks is an essential first step towards realizing this goal [l, 21. To date, many computational/algorithmic frameworks have been proposed for inferring regulatory relationships from microarray data. Initial efforts primarily relied on the clustering of genes based on similarity in their expression profiles [3]. This was motivated by the hypothesis that genes with similar expression profiles are likely to be coregulated. Hwang e t . d [4] and Stephanopoulos eta1 [5] extended these clustering approaches to classify distinct physiological states. However, clustering approaches alone cannot extract any causal relationship among the genes. Many researchers have attempted to explain the regulatory network structure by modeling them as Boolean networks [6, 71. These networks model the state of the gene as either ON or OFF and the input-output relationships are postulated as logical functions. Measures of transcript levels, however, vary in a continuous manner implying that
474
475
the idealizations underlying the Boolean networks may not be appropriate and more general models are required [S]. Recently, there have been many attempts to develop approaches that can uncover the extent and directionality of the interactions among the genes, rather than simply grouping genes based on the expression profiles. These approaches include the modeling of genetic expression using differential equations [9-1I], Bayesian networks [12] and neural networks [13]. Even though a lot of progress has been made, key biological features such as time delay have been left largely unaddressed in the context of inferring regulatory networks. Experimentally measured time delay in gene expression has been widely reported in literature [14-161. However, on the computational front, the fact that gene expression regulation might be asynchronous in nature ( ie., the expression profile of all the genes in the system may not be regulated simultaneously), has largely been left unexplored. From a biological viewpoint, time delay in gene regulation arises from the delays characterizing the various underlying processes such as transcription, translation and transport processes. For example, time delay in regulation may result due to the time taken for the transport of a regulatory protein to its site of action. Consequently, accounting for this key attribute of the regulatory structure is essential to ensure that the proposed inference model accurately captures the dynamics of the system. Prominent among the initial efforts made to incorporate time delay is the framework developed by Yildirim and Mackey [17]. The authors examined the effect of time delay in a previously developed mechanistic model of gene expression, in the Lac operon [18]. Chen et. a1 [9] proposed a general mathematical framework to incorporate time delay but did not apply it to any gene expression data to produce verifiable results. While interesting, these methods are not scalable to large expression data sets where the mechanistic details are often absent. Quin et. a1 [ 191 have proposed a time-shifted correlation based approach to infer time delay using dynamic programming. Since this approach relies on pairwise comparisons, it fails to recognize the potential existence of multiple regulatory inputs with different time delays. In this paper, we propose an optimization based modeling and solution framework for inferring gene regulatory relationships while accounting for time delays in these interactions using mixed-integer linear programming (MILP). We compare the proposed model, both in terms of its capability to uncover a target network that exhibits time delays for a test example, as well as computational requirements with a model that does not account for time delay. The rest of the paper is organized as follows. In the following section, a detailed description of the proposed model formulation is provided. Subsequently, the performance of the proposed model is evaluated on two data sets (one in numero, one real). Finally, concluding remarks are provided and the work is summarized.
476
2
Method
Here, an inference method is described for extracting the regulatory inputs for each gene in a genetic regulatory network, while accounting for time delays in the system. To this end, the linear model of network inference [20-221 is adopted as a benchmark and modified to account for time delay as shown in Eq 1.
z,( t )= z,(t + 1)- z,( t )= CCwJ,,zJ( t - I) v r””r
At
i = 1,2,...~ ,=I,Z t ,...T
(1)
r=n /=I
i at time point t and wjir is the regulatory coefficient that captures the regulatory effect of gene j on gene i . The In Eq 1, Z i( t ) is the expression level of gene
index z indicates that this regulation has a time delay of z associated with it while the integer parameter zmaxdenotes the longest time delay accounted for. Note that the frequency at which gene expression is sampled through the microarray experiment determines the maximum amount of biologically relevant time delay that can be inferred. For example, if the time points are separated by secondslminutes then a higher value of rmaxcan be used. Subsequently, if wjir >O then gene j activates gene i with a time delay z ,while if wjlr
Y..=
1 if gene j regulatesgenei with a time delay z
Subsequently, the network inference model with time delay is formulated as the following mixed integer linear programming (MILP) model.
477
Minimize
subject to rM'
N
Z , ( t ) - C C m p r ~(t-r)=e,+(t)-e;(t) , Vi=1,2,..., ~ ; t = 1 ,,..., 2 T r=n
(3)
J = ~
r=n
~ Z Y ~ , ~,.._,
_ms.
\I
IN,
Vi=1,2
N
The objective function (Eq 2) minimizes the total (over all genes and time points) absolute error E between the predicted and the experimental expression values. The absolute value of the error is determined from Eq 3 through the positive and negative error variables e: ( t ) and c i ( t ) respectively. For a given gene i and time point t, only one of these variables can be non-zero. Specifically, if the error is positive then e:(t) is non-zero while if the error is negative then e,(t) is nonzero. This property arises from the fact that when the constraints of the model are placed in matrix form, the columns associated with these two variables are linearly dependent. Consequently, the linear programming (LP) theory principle that states that the columns of the basic variables (variables that are non-zero at the optimal solution) are linearly independent ensures the above property. Eq 4 ensures that the coefficients for all regulatory relationships not present in the network are forced to : and "F are the lower and upper bounds respectively zero. In this constraint, 0 on the values of regulatory coefficients. Eq 5 imposes the constraint that each regulatory interaction, if it exists, may assume only a single value of time delay associated with it while Eq 6 limits N , , the maximum number of regulatory inputs to gene i
The proposed framework has a number of key advantages. The basic linear model with no time delay is a special case of the proposed model. It can be recovered by including the following constraints.
Yji,= O V i i , j = 1, 2,..., N , z > O (9) Additional environmental stimuli may be incorporated by introducing an additional node that describes the influence of the stimulus into the network. Furthermore, various biologically relevant hypotheses can be tested by translating them into
478
either additional/alternative constraints or objective functions. For example, one of the hypotheses recently proposed, concerns the robustness of gene regulatory networks, defined as the ability of these networks to effectively tolerate random fluctuations in gene expression levels [23, 241. Within the context of the linear model, this translates into having small values of the regulatory coefficients mjir so that small variations in the expression levels of genej have a small impact on the rate of change of expression of gene i. From a statistical perspective, the proposed framework can be used to capture the trade-off between degree of model fit and the number of model parameters. By systematically varying the number of maximum regulatory inputs to a particular gene and computing the resulting minimum error, a trade-off curve between accuracy and model complexity can be generated. This curve provides an appropriate means for determining the critical number of regulatory inputs above which the model is tending towards over-fitting of data. In a system with N genes, there will be N2(zmax +1) binary variables implying a possible alternatives for the network connectivity. Even for a total of 2N2(rmax+’) relatively small network inference setting it is computationally expensive to conduct an exhaustive search through these alternatives. The computational requirements can be reduced, to a certain extent, by exploiting the decomposable structure of the proposed model. This is achieved by recognizing that the model can be solved for each gene i separately without any loss of generality. Note, however, that this model structure is lost if an overall maximum connectivity constraint is imposed in the same spirit as the individual gene maximum connectivity constraint (Eq 6). In addition to improved computational performance, another key advantage of the decomposable property is that it limits the amount of computational resources that need to be expended if only a sub-network involving a sub-set of the genes is to be inferred. The key parameters that determine the computational complexity of the proposed model are the bounds a?, imposed on the regulatory coefficients in Eq 4.
Qy
While in certain special application settings, there are pre-specified upper and lower bounds that are part of the model, in contrast, in our proposed model, these bounds are not known a priori. For such cases, typically the “Big-M’ approach is utilized whereby arbitrarily largehmall bounds are imposed [25]. Such a simplistic approach circumvents the need to determine tight valid bounds, although, at the expense of much higher computational requirements. On the other hand, if tight invalid bounds are specified, the computational gains realized will be off-set by the inability to attain the global optimal solution. In light of this trade-off between computational requirements and quality of optimal solution, a sequential bound relaxation
479
procedure is developed and described next. As a starting point for this procedure, for a given gene i*, both the upper and lower bounds are fixed such that
I Rl- I= I Rl:? I = Ro I'
. The initial value of the bound is selected based on the
scaling of the expression values. Specifically, for gene j , this initial bound value is determined as a value proportional to the ratio of the order of magnitude of the derivative values and that of the expression values. Subsequently, given these bounds, the inference model is solved to obtain the optimal values of the regulatory coefticient mii., (R;i. ) and the absolute error E,, (ao )= I
I,
Next, the bounds are relaxed such that
":.. = (1 + ai.).R o I'
2
+
(e; ( t ) e- (t)).
t=l
where 0 < 8,*I 1
followed by re-optimization of the model with these updated bounds. Since the relaxation of bounds leads to a larger feasible solution, it is guaranteed * ) 5 E,. (ao ) . These two steps of bound relaxation and optimization that E,. (a' I
I,
I'
are repeated until the total absolute error for gene if reduces tohelow the desired tolerance level. This procedure is then repeated for all the genes in the network until the entire (or a sub-set) network topology has been inferred. 3
Results and Discussion
To highlight and test the inference capabilities of the proposed model, it is applied to two different data-sets. Data set 1 (40 genes, 8 time points) is generated in numero by assuming known time delay in the system dynamics. The ability of the inference procedure to uncover an a priori known target network as well as the computational performance of the model is studied by employing this data set. Subsequently, a real microarray data-subset (24 genes, 9 time points) is analyzed to highlight the applicability of the inference procedure to data derived from real biological systems. 3.1 Data set I
The expression data for the 40 gene network is generated by assuming that 6 genes have 3 regulatory inputs, 10 genes have 2 regulatory inputs, while the remaining genes have a single regulatory input. 33 interactions are designed to have a time delay of zero, 21 have a time delay of one and 9 have a time delay of two time points. Given this topology of the regulatory network, gene expression values are computed for each one of the 40 genes at 8 time points. The derivatives are computed by employing forward difference. The starting value for the bound for
480
each gene is set to 1.0 and a bound increment value 6;= 1.0 is employed for computation. The assumed network constituted 63 interactions with known regulatory weights and time delays associated with these interactions. The original network, in terms of all 63 regulatory interactions and the associated regulatory weights and time delays, is perfectly recovered by solving the proposed model with time delay. The optimization model is solved using the CPLEX solver accessed via the GAMS modeling environment. The CPU time needed to recover the original regulatory inputs for each gene is shown in Figure l(a) while the distribution of total number of sequential bound relaxation iterations required is shown in Figure l(b). .With
Time Delay <@WithoutTime Delay
3 I-
0"
.With
Time Delay @Without Time Delay
80
2500 I 2000 .-E 1500
2
60
c ; 40
1000 500
s 20 0
0
1 4 7 1013161922252831 343740
Gene#
(4
2
4
6
8
More
#Iterations
@)
Figure 1: Comparison of computational performance for the model with and without time delay. (a) Total CPU time required for each of the 40 genes in the network (b) Distribution of the total number of sequential bound relaxation procedure iterations.
Specifically, 9,505 CPU seconds (on an IBM RISC 6000 machine) are required for the 86 iterations. In addition to the model with time delay, network inference is also camed out by neglecting time delay. This is achieved by including Eq 9 in the inference model Eq 2-8. The model without time delay provides the appropriate benchmark for systematically highlighting the gains, if any, that are realized by accounting for time delay. The computational results for the two models are contrasted in Figures l(a) and l(b). For the model without time delay, a total of 4696 CPU seconds are expended for the 227 iterations that are needed to infer a network with zero error. However, even though the model without time delay is able to fit the data perfectly with relatively lesser computational resources, it is unable to identify the assumed target network in terms of the network topology and regulatory parameters. In addition, as expected, the number of parameters required increases significantly for the model without time delay. In particular, 121 regulatory relationships are inferred by the model without time delay, implying an almost two-fold increase in the number of parameters needed over the model with time delay.
481
3.2
Data Set 2
The second microarray data set analyzed consisted of time course expression profiles of 24 genes of Bacillus subtilis subjected to an amino-acid pulse in minimal media. Gene expression is measured using Affymetrix GeneChipB arrays at 0, 8, 13, 18, 28, 38, 68, 118 and 178 minutes. The amino-acid pulse is introduced for 8 minutes at the start of the experiment. Subsequently, cubic splines are used to interpolate the expression data and the derivatives are computed by employing a local finite difference approximation at each of the time points. The model with time delay is used to infer the regulatory network, The trade-off curve between error and the maximum number of parents is shown in Figures 2(a) and 2(b) for both the model with and without time delay. Note that the maximum number of parents determines the number of parameters available for fitting. In accordance with the results obtained for data set 1, Figure 2 highlights the fact that for any imposed threshold error tolerance value, the model with time delay infers a network which is sparser.
* Model Without Time Delay
Model With Tlme Delay
0
1
2
3
4
5
Maximum # Parents
(4
6
1
0
1
2
3
4
5
6
7
Maximum #Parents
(b)
Figure2: Trade-off between number of model parameters and quality of fit (a) Model with time delay (b) Model without time delay.
The inferred regulatory relationships are shown in Figures 3 and 4. The proposed model is able to identify a number of regulatory relationships that have been previously reported in literature. Jin eta1 [26] have hypothesized the existence of regulatory relationship between citH and genes involved in aspartate production (nadB and purA). The inferred regulatory network identifies a potential indirect mechanism for these regulations mediated bypyc.4 and odhB (Figure 3). Miller et.d [27] have reported that genes sdhA and citG might share a common regulatory mechanism. The inferred network indicates that genes involved in glycine, serine and threonine bqhZJ) metabolism regulate both citG and sdhC, which is a part of the sdhCAB operon. These results highlight the capability of the proposed inference framework to capture biologically plausible regulatory interactions.
482
Figure 3. Regulatory network urferred by the model with time delay for
Figure 4. Time delays associated with the infered, pink r=
483
From a statistical perspective, in addition to the relative error, a metric that is widely used to determine how well a regression fits is the coefficient of determination (or multiple correlation coefficient) R2[28]. This metric quantifies the fraction of variability in the response variable that can be explained by the variability in the input variables. In the context of our current setting, the average R 2 value is given bY
where the Var[.]operator determines the variance of a particular quantity over time and E j ( t )= e,? ( t )- c r ( t ) is the computed error for gene i at time point t. Given this metric, the additional variance explained by the model with time delay is determined as Add. Variance Explained = R 2 [Time delay]-R2 [Without time delay]
(1 1)
Figure 5 shows the additional variance explained for data-set 2 as the number of parents is vaned. In addition to the real data set, the additional variance explained for random data is also shown in Figure 5 . Specifically, the randomized data is obtained by permuting the rows and columns of the expression matrix such that any underlying structure of the data I s Real Data =Randomized Data lost while the scaling of the data is 30% retained. The results of Figure 5 25% indicate that the model with time 20% 15% delay is able to discriminate 10% m n i between real and randomized data 5% , 0% only when the maximum number 1 2 3 4 5 6 7 of inputs allowed is either 4 or 5. Maximum # Inputs For relatively small number of Figure 5 : Additional variance explained inputs (1,2 and 3), the model is by including time delay for real and randomized data. unable to capture the underlying structure of the real data due to lack of sufficient number of parameters. Similarly, at the other extreme, when too many parameters are made available (6 and 7), the model starts tending towards over-fitting leading to the overlap between real and randomized data. A clear separation between the two data sets is realized only in the intermediate range of inputs (4 and 5). These results highlight the capabilities of the proposed modeling and solution framework in not only accounting for key system dynamics such as x
1..
.+.=,=,-,
484
time delay but also gaining deeper insights into the topological features of regulatory networks. 4
Summary and Conclusions
In this work, an optimization based modeling and solution framework, for incorporating time delay in transcriptional regulations was proposed. The proposed model used the existing linear model as a benchmark and employed boolean variables to incorporate discrete time delay into the interactions. Since, the system of equations describing the interactions is underdetermined and consequently has a family of solutions that fit the data equally well, various properties of biological networks such as sparseness, and uniqueness of time delay were employed to search through the solution space. A number of key advantages of the model in terms of examining the impact of alternative objective functions, incorporating known biological interactions and including environmental stimuli were discussed. On the computational front, however, the proposed model formulation was NP-hard implying that the computational requirements increase exponentially with the model size. To alleviate this problem, a sequential bound relaxation procedure was proposed. The inferential potential of the proposed methodology was determined by applying it to an in numero data set and a real expression data set. Results for the in numero data set confirmed the fact that neglecting time delay in a system a priori known to be characterized with it results in a significant increase in the number of parameters needed to describe the system dynamics. Subsequently, application of the model to real microarray data uncovered numerous regulatory relationships with time delay suggesting that time delay is ubiquitous in gene regulation. In the spirit of the results obtained for the first data set, inclusion of time delay resulted in inferred networks that were sparser. In addition, analysis of the amount of variance in the data explained by the model revealed that the proposed methodology explained more variance in real data as compared to randomized data. References 1. 2. 3. 4. 5. 6. 7.
8.
Bolouri, H. and J.M. Bower, Computational Modeling of Genetic and Biochemical Networkr, ed. H. Bolouri and J.M. Bower. 2001, Cambridge,Massachusetts: The MIT Press. Bolouri, H. and E.H. Davidson. BioEssays, 2002.24(12): p. 1118-1127. Spellman. Mol. Biol. Cell, 1998. 9: p. 3273-3297. D.Hwang, et al.. Bioinformatics, 2002. 18: p. 1184-1193. Stephanopoulous, G., et al.. Bioinformatics, 2002. 18: p. 1054-1063. Akutsu, T. and S. Miyano. Pac. Symp. Biocomput., 2000.5: p. 290-301. Ideker, T.E., V. Thorsson, and R.M. Karp. Pac. Symp. Biocomput., 2000. 5 : p. 302-313. D.Jong, H.. Journal of Computational Biology, 2002. 9(1): p. 67-103.
485
9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
Chen, T., H.G.L. He, and G.M. Church. Pac. Symp. Biocomput., 1999.4: p. 102-111. Yeung, M.K.S., J. Tegner, and J.J. Collins. PNAS, 2002. 99: p. 6163-6168. Hoon, M.J.L., et al.. Pac. Symp. Biocomput., 2003. 8: p. 17-28. Friedman, N., et al.. Journal of Computational Biology, 2000. 7: p. 601620. Vohradsky, J.. The Journal of Biological Chemistry, 2001.276: p. 3616836173. Jagle, U., et al.. The Journal of Biological Chemistry, 1997.272: p. 58715879. Gill, R.T., et al.. Journal ofBacteriology, 2002.184(13): p. 3671-3681. Nitzan Rosenfeld, U.A.. J.Mol.Biol,2003.329(645-654). Yildirim, N. and M.C. Mackey. Biophysical Journal, 2003.84: p. 2841285 1. Wong, P., S. Gladney, and J.D. Keasling. Biotechnol. Prog., 1997. 13: p. 132-143. Quin, J., et al.. J. Mol. Biol., 2001. 314: p. 1053-1066. D'haeseleer, P., L. Shoudan, and R. Somogyi. Pac. Symp. Biocomput., 1999.4: p. 41. Weaver, D.C., C.T. Workman, and G.D. Stormo. Pac. Symp. Biocomput., 1999.4: p. 112-123. Someren, E.P.V., L.F.A. Wessels, and M.J.T. Reinders. 2000. Someren, E.P.V., et al.. Proceedings of the 2001 IEEE - EURASIP Workshop on Nonlinear Signal and Image Processing (NSIPOl), Baltimore, Maryland, June 2001., 2001. Someren, E.P.V., L.F.A. Wessels, and M.J.T. Reinders. Signal Processing, 2003.83: p. 763-775. Winston, W.L. and M. Venkataraman, Introduction To Mathematical Programming. 4 ed. Vol. 1. 2003, Pacific Grove: BrooksICole-Thomson Learning. S.Jin, M.D. Jesus-Bemios, and A.L.Sonenshein. J Bacteriol, 1996. 178(2): p. 560-3. P.Miller, et al.. J Bacteriol, 1988. 170(6): p. 2742-8. Ross, S.M., Introdution to Probability and Statistics for Engineers and Scientists. 2 ed. 2000: Harcourt Academic Press.
ROBUST IDENTIFICATION OF LARGE GENETIC NETWORKS D. DI BERNARD0 TIGEM, Via P Castellino 111; 80131, Naples, Italy email:[email protected]; Tel: +39 081 6132 229 FAX: +39 081 560 98 77 T.S. GARDNER, J.J. COLLINS Center for BioDynamics and Department of Biomedical Engineering, Boston University, 44 Cummington St., Boston, M A 02215, USA
Temporal and spatial gene expression, together with the concentration of proteins and metabolites, is tightly controlled in the cell. This is possible thanks to complex regulatory networks between these different elements. The identification of these networks would be extremely valuable. We developed a novel algorithm to identify a large genetic network, as a set of linear differential equations, starting from measurements of gene expression at steady state following transcriptional perturbations. Experimentally, it is possible to overexpress each of the genes in the network using an episomal expression plasmid and measure the change in mRNA concentration of all the genes, following the perturbation. Computationally, we reduced the identification problem to a multiple linear regression, assuming that the network is sparse. We implemented a heuristic search method in order to apply the algorithm to large networks. The algorithm can correctly identify the network, even in the presence of large noise in the data, and can be used to predict the genes that directly mediate the action of a compound. Our novel approach is experimentally feasible and it is readily applicable to large genetic networks.
1
Introduction
Temporal and spatial gene expression, together with the concentration of proteins and metabolites, is tightly controlled in the cell. This is possible thanks to complex regulatory networks between these different elements. The identification of these networks would be extremely valuable. Different experimental and computational methods have been proposed t o tackle the network identification problem Although implemented with some success, they are data intensive and the description of the network they provide is limited. A variety of mathematical models may be used to describe genetic networks 7,8,9, including Boolean logic Bayesian networks 1 2 , graph theory 1 3 , and 1,213,4*5,6.
486
487
ordinary differential equations. We concentrated our efforts on this last model, because it offers a description of the network as a continuous time dynamical system that can be used t o infer the genes with the major regulatory function in the network. In addition, it can be applied t o the RNA expression measurements obtained from pharmacological perturbations to identify the genes that directly mediate a compound's bio-activity in the cell. We already developed and tested in vitro an algorithm to identify a genetic network of nine genes, as a set of linear differential equations, starting from measurements of gene expression at steady state following transcriptional perturbations 14. In what follows we describe a modification of the algorithm to tackle the problem of reverse-engineering large genetic networks.
2
Methods
2.1 Network model description
'
A network can be described by a set of ordinary differential equations describing the time evolution of the mRNA concentration of the genes in the network? :
where g represents the mRNA concentrations of the genes in the network, and 1is a set of transcriptional perturbations. Assuming that the cell under investigation is at equilibrium near a stable steady-state point, we can apply a small perturbation t o each of its genes. A perturbation is small if it does not drive the network out of the basin of attraction of its stable steady-state point and if the stable manifold in the neighborhood of the steady-state point is approximately linear. With these assumptions, we can linearize the set of nonlinear rate equations near its stable state-steady point '. Thus, for each gene, i , in a network of N genes we can write the following equation:
where xi1 is the mRNA concentration of gene i following the perturbation in experiment 1 ; aij represents the influence of gene j on gene i; uil is an "(from now on we will use the following notation: g represents a column vector, gT is a row vector, 2 is a scalar and A is a matrix)
488 external perturbation t o the expression of gene i in experiment I . For all N genes, Eqs. 2 can be rewritten in more compact form using matrix notation:
(3) where g L is an N x 1 vector of mRNA concentrations of the N genes in experiment I , A is an N x N connectivity matrix, composed of elements a i j , and uLis an N x 1 vector of the perturbations applied t o each of the N genes in experiment 1.
2.2 Network Identification To identify the network, using the model described above, means t o retrieve matrix A. This is possible if we measure mRNA concentration of all the N genes a t steady state (i.e., & = 0) in M experiments and then solve the system of equations:
A.X=-U
(4)
where X is an N x M matrix composed of columns 3 ; U is an N x M with each column, uL.Equation 4 can be solved only if M 2 N . However, the recovered weights, A, will be extremely sensitive t o noise both in the data, X, and in the perturbations, U, and thus unreliable unless we overdetermine the system of Eqs. 4. This can be accomplished either by increasing the number of experiments ( M > N ) , or, by assuming the maximum number of regulators acting on each gene, k , is less than M (i.e., the network is not fully connected l5l6), thus reducing the number of weights aij to be recovered.
2.3 Experimental approach To identify the network we need to perform transcriptional perturbations for each of the genes in the network and t o measure the changes at steady state following the perturbation of the mRNA concentrations for each of the genes in the network. In each perturbation experiment, it is possible to overexpress a different one of the genes in the network using an episomal expression plasmid. Then we let the cells grow under constant physiological conditions t o their steady state after the perturbation and measure the change in mRNA concentration compared t o cells under the same physiological conditions but unperturbed. This can be achieved using microarrays or real time quantitative
PCR.
489
2.4 Algorithm. A genetic network can be described by the system of linear differential equations, Eqs. 2. For each gene i at steady state (& = 0) in experiment l , we can therefore write:
where uil is the transcriptional perturbation applied to gene i in experiment 1, gi is a row of A, and 3 ( N x 1) are the mRNA concentrations at steady state following the perturbation in experiment 1. The algorithm assumes that only k out of the N weights in aifor gene i are different from zero. For each possible combination of k out of N weights, the algorithm computes the solution to the following linear regression model:
where yil = -uil is the perturbation applied to gene i in experiment 1 ; bi is a k x 1 vector representing one of k ! possible combinations of weights 7-for gene i; E ~ is Z a scalar stochastic normal variable with zero mean and variance, V U T ( E ~ ~representing ), measurement noise on the perturbation of gene i in experiment 1; zZis a k x 1 vector of mRNA concentrations following the perturbation in experiment 1, with added uncorrelated Gaussian noise (y ) with -1 zero means and variances var(yl). Equation 6 represents a multiple linear regression model with noise qil = @ . % ~ i l with , zero mean and variance:
/Lkl!
+
c k
VUT(7il)=
b’+.Lr(yjl)
+
WUT(Eil)
(7)
j=1
(if
and
C ~ Z
T~are uncorrelated).
If we collect data in M different experiments, then we can write Eq. 6 for each experiment and obtain the system of equations:
where y . is an A4 x 1 vector of measurements of the perturbation yil t o gene i in tG M experiments; Z is a K x M matrix, where each column is the vector 3 for one of the M experiments; &iis an M x 1 vector of noise in the M experiments. From Eqs. 8, it follows that a predictor for --zy . given the data matrix Z is:
490
We chose to minimize the following cost function to find the k weights, hi, for gene i :
1=1
1=1
The solution can be obtained by computing the pseudo inverse
of
Z:
-
= (Z . z T ) - l
'
z
'
2a
(11)
Note that the solution, Ti, in Eq. 11is not the maximum likelihood estimate for the parameters bi when the regressors Z are stochastic variables 17, but it nevertheless is a good estimate. We select as the best approximation of the weights in Eqs. 2 for gene i , the one with the smallest least-squares error, C f , among the (N choose k) possible solutions
5,.
2.5 Estimation of the variance of the parameters. We now turn to the estimation of the variance on the estimated parameters
bi
- and the calculation of the goodness of fit. If, in each experiment, the noise
is uncorrelated and Gaussian with zero mean 2nd known variance, then the covariance matrix of the estimated parameters bi is 1 6 :
Cm&)
= (Z
.zT)-l.
z . c, . ZT
(Z
zT)-l
(12) where C, is an M x M diagonal matrix with diagonal elements equal to the noise variance for gene i in the M experiments, war(qil).. . war(qim). We assume that we can estimate war(qil) in each experiment using the parameters ii estimated with Eq. 11 and substituting in Eq. 7: '
'
We can now compute the variances of the parameters using Eq. 12, where C, is computed using Eq. 13. The quantities war(yjl) and w a r ( ~ j lare ) supposed
to have been estimated experimentally. We can also compute a goodness of fit test using the Chi-squared statistic:
49 1
2.6 Modification of t h e algorithm for large networks. For a network of N genes, with k 5 N connection for each gene, we need t o combinations of k genes and then select solve Eq. 6 for all the possible the one that fits the data best. For large networks, this exhaustive approach is unfeasible since there are too many combinations t o test. We used a heuristic search method (Forw-TopD-reest-K 18) t o reduce the number of solutions to test. We first compute all the possible solutions with single connections ( k = l ) as described in sec. 2.4. We then select the best D solutions (the ones with the smallest least squared error), and only for these intermediate solutions, we compute all the possible solutions with an additional connection. Then we again select the best D solutions, and so on until the number of connections found for each gene is k . We implemented this approach using a value of D = 5.
&
2.7
Target prediction.
A
It is possible t o use the recovered network t o deconvolve the results of an experiment, i.e., t o recover the unknown perturbations u, in an experiment, given the measurements of the response t o that perturbation, 5. The predicted perturbations & can be computed from:
The variance on the estimated perturbation of gene i can be computed as 19.
=
g .(Z . ZT)-'
W U T ( ~ ~ ~ )
k
. Z . C,. ZT . (Z . ZT)-'
+ ~ ~ ~ j u a r ( (16) ~ , j )
'go
j=1
Using the variance of the estimated perturbation, we perform a t test t o test the hypothesis that the predicted perturbations are significantly different from zero.
2.8
Simulated data
To test the algorithm on a realistic data set, we generated 10 random networks of N = 100 genes with an average of k = 10 connections for each gene. Each network was represented by a full rank sparse matrix A ( N x N ) , as described in sec.2.2. We made sure that all the eigenvalues of these random sparse matrices had a real part less than 0 t o ensure that the dynamical systems described by
492
them were stable. The data set X ( N x M ) was obtained by inverting eq.4 to obtain:
X=-A-l.U
(17)
where U, ( N x M ) , were the perturbations in M = 100 experiments. We chose U to be a diagonal identity matrix. This is equivalent to say that in each experiment only 1 out of the 100 genes was perturbed by increasing its transcription rate by 1. The data the algorithm needs t o identify the network A are the gene expression data matrix X and the perturbation matrix U. We added white gaussian noise to each data matrix. For the perturbation matrix U, the standard deviation of the noise was fixed to uu = 0.3 (i.e. 30% of the magnitude of the perturbation), while for the gene expression data matrix it varied from uz = 0.1 * X to uz = 0.5 * X where X represents the average of the absolute values of the elements in X . The performance of the algorithm was tested using these data with the different noise levels in order to identify the network A. We used two measures of performance: coverage (correct connections in the recovered network model / total connections in the true network) and false positives (incorrect connections in the recovered model / total number of recovered connections). In order to test the ability of the identified network to predict unknown perturbations given the gene expression data, for each random network, we generated 10 additional experiments in which 3 genes, randomly chosen out of 100 genes, were perturbed simultaneously. We computed the ability of the recovered network to predict which genes had been perturbed, using the method described in 2.7. The algorithm described in this section was fully implemented in MATLAB environment. For a network of 100 genes, the algorithm took 50s to run on a Pentium I11 with a clock speed of 1.2 Ghz.
3
Results
3.1 Identification of networks Figure 1 shows the average performance of the algorithm across the 10 random networks described in sec. 2.8 for noise levels ranging from 10% t o 50%. Since the algorithm reports also the variance of the identified elements in matrix A, it is possible t o compute a p value for each of its elements aij. We used a Student t distribution to test the hypothesis that the element aij identified
493
0 '
,--
I
I
20
I
I
30
I
I
40
I
I
50
Noise level (%)
Figure 1: Model recovery performance for simulations. Perturbations of magnitude ui = 1 (arbitrary units) were applied to ten randomly connected networks of one hundred genes with an average of ten regulatory inputs per gene. For each perturbation t o each random network, the mRNA concentrations at steady state were calculated, and normally-distributed, uncorrelated noise was added both to the mRNA concentrations and to the perturbations to represent measurement error. The noise (noise = S z / p z ,where S, is the standard deviation of the mean of 5,p , ) on the perturbations was set to 30%. The noise on the mRNA concentrations was varied from 10% t o 50%. The average coverage, top panel, (correct connections in the recovered network model / total connections in the true network) and average false positives, bottom panel, (incorrect connections in the recovered model / total number of recovered connections) were calculated across all the models recovered. Filled circles: All the recovered connections were included in the computation of coverage and false positives. Filled squares: Only the recovered connections with a p-value 5 0.05 were included in the computation.
494
by the algorithm is significantly different from 0. This is equivalent t o test whether gene i is significantly regulated by gene j. Figure 1 reports also the coverage and false positives in the case we consider significantly different from 0 only those elements with a pvalue 5 0.05 (dashed lines).
3.2 Target prediction Figure 2 shows the coverage (genes correctly identified as perturbed by the network model / total number of perturbed genes) and the percentage of false positives (genes wrongly identified as perturbed by the network model / total number of genes identified as perturbed by the network model) for noise levels ranging from 10% to 50% averaged across the 10 random networks and across 10 perturbation experiments, as described in sec.2.8. In Figure 2, open bars show coverage and false positives considering the predicted perturbations correct only if they have a pvalue 5 0.01, black bars show the same quantities for a pvalue 5 0.1. 4
Discussion
The algorithm we propose requires only measurements of mRNA concentrations at steady state following transcriptional perturbations. Therefore, the experimental time and costs involved in the procedure are affordable. This is a very useful feature of our approach. Another essential feature is its robustness to measurement noise. Measurements of mRNA concentration using microarrays are noisy, and therefore an algorithm to identify networks is useful only if it is robust t o such noise. We showed that the recovered network can be used for target prediction, this can be very useful for drug discovery. Using measurements of mRNA concentration changes at steady state following the application of a compound to a cell population, we can predict which are the direct targets of that drug in a large gene network using the recovered network model. The recovered network model, A , is a linear representation of a nonlinear system. Nonlinear behaviours that are sometimes exhibited by gene, protein, and metabolite networks, including bifurcations, thresholds, and multistability, cannot be described by A. Nevertheless, the linear approximation is topologically equivalent t o the nonlinear system near a steady-state point. Therefore, to apply our algorithm, it is necessary t o remain near a single steady state during the course of all experiments. From a practical perspective, this means that cells must be maintained under consistent and constant environmental
495 90
80
7c
6C
5c
4c
3c
2c
I(
0
10%
30% Noise level (“4)
50%
Figure 2: Perturbation prediction performance for simulations. Three genes were randomly and simultaneously perturbed. Using the steady state measurements following the perturbation, the network model was used to predict which genes had been perturbed. This experiment was repeated ten times for each one of ten different random networks of one hundred genes with an average of ten regulators per gene. Coverage (genes correctly identified as perturbed by the network model / total number of perturbed genes) and the percentage of false positives (genes wrongly identified as perturbed by the network model / total number of genes identified as perturbed by the network model) for noise levels ranging from 10% to 50% averaged across the ten random networks and the ten perturbation experiments. Open bars: Coverage (tall) and false positives (short) considering correct only predictions with a p-value 5 0.01. Filled bars: Coverage (tall) and false positives (short) considering correct only predictions with a p-value 5 0.1.
496 and physiological conditions, and the applied perturbations must be relatively small. If these conditions are not met, the recovered model may contain a certain degree of nonlinear error, or, in the extreme, it may not be possible t o adequately fit a linear model. In practice, it is generally straightforward t o keep the cells in a constant environmental and physiological state, but due t o the presence of measurement noise, it can be challenging t o meet the condition of small perturbations. For errors due to noise, we can improve the signal-to-noise ratio ( S I N ) by boosting the size of the Perturbations. However, larger perturbations can lead t o larger nonlinear errors. Thus, the experimenter must identify an acceptable balance between noise and nonlinear error. The network should be sparse for the method t o work. Our algorithm can be successfully applied as long as the real connectivity of the network (i.e. number of connections per gene) is less that the number of perturbation experiments. An exact threshold for the maximum number of connections that can be recovered correctly with this algorithm cannot be computed because this will depend on the noise level of the data. For noise-free data, the maximum connectivity will be equal t o the number of perturbations experiments performed. Our approach t o inferring genetic networks has been shown t o work in vivo for small networks 14. The computer simulations here described suggest that a modified version of the algorithm will work also for large genetic networks. We showed that even with considerable noise, it is still possible t o recover 60% of the real network with less than 10% of wrongly identified connections. This is important in biological research because it can provide a first draft of the map of interaction among hundreds of genes whose function or regulation is partly or completely unknown. Also the network recovered with the algorithm can predict the direct targets of an unknown perturbation with a specificity of approx. 80%, even in the presence of large noise. This would greatly help in the identification of the real targets of a novel molecule in a large network, by greatly reducing the targets t o be tested experimentally. In addition, the experiments required t o generate the data needed by the algorithm are feasible and economically affordable also for large networks.
497
References 1. A. H. Y. Tong, B. Drees, Nardelli G., G. D. Bader, B. Brannetti, L. Castagnoli, M. Evangelista, S. Ferracuti, B. Nelson, S. Paoluzi, M. Quondam, A. Zucconi, C. W. V. Hogue, S. Fields, C. Boone, and G. Cesareni, Science 295, 321-324 (2002). 2. T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J. B. Tagne, T. L. Volkert, E. Fraenkel, D. I<. Gifford, and R. A. Young. Science 298, 799-804 (2002). 3. T. Ideker, V. Thorsson, J. A. Ranish, R. Christmas, J. Buhler, J. K. Eng, R. Bumgarner, D. R. Goodlett, R. Aebersold, and L. Hood, Science 292, 929-934 (2001). 4. E. H. Davidson, J. P. Rast, P. Oliveri, A. Ransick, C. Calestani, et al. Science 295, 1669-1678 (2002). 5. A. Arkin, P. D. Shen, and J. Ross, Science 277, 1275-1279 (1997). 6. M. K. S. Yeung, J. Tegner, and J. J. Collins, Proc. Natl. Acad. Sci. U.S.A. 99, 6163-6168 (2002). 7. H. de Jong, J. Comp. Biol. 9, 67-103 (2002). 8. M .A. Savageau, Chaos, 142-159 (2001). 9. A Levchenko, and A. Iglesias, Baophys J , 50-63 (2002). 10. I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, Bioinformatics 18, 261-274 (2002). 11. S. Liang, S. Fuhrman, and R. Somogyi, Proc. Pacific Symp. Biocomp. 3, 18-29 (1998). 12. A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young, Proc. Pacific Symp. Biocomp. 7,437-449 (2002). 13. A. Wagner, Bioinformatics 17, 1183-1197 (2001). 14. T. S. Gardner, D. di Bernardo, D. Lorenz, and J. J. Collins, Science 301, 102-105 (2003). 15. Z. N. Oltvai, and A. L. BarabAsi, Science 298, 763-764 (2002). 16. L. Ljung, System Identification: Theory for the User. Prentice Hall, Upper Saddle River, NJ, (1999). 17. W. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge, UK, (1993). 18. E. P. van Someren, L. F. A. Wessels, M. J. T. Reinders, and E. Backer, Proc. PdInt. Conf. Systems Biol. , 222-230 (2001). 19. D .Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression Analysis. John Wiley & Sons, Inc., New York, (2001).
RECONSTRUCTING CHAIN FUNCTIONS IN GENETIC NETWORKS I. GAT-VLKS, R. SHAMIR School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel. {iritg,rshamir} Otau.ac.il. R. M. KARJ?, R. SHARAN Internutaonal Computer Science Institute, 1947 Center St., Berkeley CA 94704. {karp,roded} @icsi.berkeley.edu. Abstract
Deciphering the mechanisms that control gene expression in the cell is a fundamental question in molecular biology. This task is complicated by the large number of possible regulation relations in the cell, and the relatively small amount of available experimental data. Recently, a new class of regulation functions called chain functions was suggested. Many signal transduction pathways can be accurately modeled by chain functions, and the restriction t o chain functions greatly reduces the vast search space of regulation relations. In this paper we study the computational problem of reconstructing a chain function using a minimum number of experiments, in each of which only few genes are perturbed. We give optimal reconstruction schemes for several scenarios and show their application in reconstructing the regulation of galactose utilization in yeast.
1
Introduction
The regulation of mRNA transcription is key to cellular function. High throughput genomic technologies, such as DNA microarrays, enable a global view of the transcriptome, and provide the means to reconstructing regulatory relations among genes, that is, inferring the set of genes that cooperate in the regulation of a given gene and the particular logical function by which this regulation is determined. This paper studies the number and complexity of biological experiments that are needed in order to infer certain regulatory relations. An experiment involves knocking out or over-expressing certain genes, and measuring the expression levels of all other genes. The order of an experiment is the number of genes that are perturbed. A key obstacle in the inference of regulation relations is the large number of possible solutions and, consequently, the unrealistically large amount of data needed to identify the right one. Akutsu et al. showed that even for a boolean
498
499
network model, the number of experiments that are needed for reconstructing a network of N genes is prohibitive: The lower and upper bounds on the number of experiments of order N - 1 that are needed, are C2(2N-') and O(N . 2 N - 1 ) , respectively. Even with no more than d regulators for each regulated gene, the number of required experiments of order d is still i 2 ( N d )and O(N"), respectively '. The inherent complexity of genetic network inference led researchers to seek ways around this problem. Ideker et al. studied how to dynamically design experiments so as to maximize the amount of information extracted. Friedman et al. used Bayesian networks to reveal parts of the genetic network that are strongly supported by the data. Tanay and Shamir suggested a method of expanding a known network core using expression data. Several studies used prior knowledge about the network structure, or restrictive models of the structure, in order to identify relevant processes in gene expression data 5 * 6 9 7 1 8+ Recently, a biologically motivated model of regulation relations based on chain functions, was suggested in order to cope with the problem of genetic network inference '. In a chain function, the state of the regulated gene depends on the influence of its direct regulator, whose activity may in turn depend on the influence of another regulator, and so on, in a chain of dependencies (we defer formal definitions till later). The chain model further assumes that variable states are boolean. The latter assumption is a drastic simplification of real biology, yet it captures important features of biological systems and was frequently used in previous studies '. The class of chain functions has several important advantages ':These functions reflect common biological regulation behavior, so many real biological regulatory relations can be elucidated using them (examples include the SOS response mechanism in E. coli l o and galactose utilization in yeast "). Moreover, by restricting consideration to chain functions, the number of candidate functions drops from double exponential to single exponential only. In this paper we study the computational problems arising when wishing to reconstruct chain functions using a minimum number of experiments of the smallest possible order. We address both the question of finding the set of regulators of a chain function, which is typically much smaller than the entire set of genes, and the question of reconstructing the function given its regulators. We give optimal reconstruction schemes for several scenarios and show their application on real data. Our analysis focuses on the theoretical complexity of reconstructing regulation relations (number and order of experiments), assuming that experiments provide accurate results, and that the target function can be studied in isolation from the rest of the genetic network. The paper is organized as follows: Section 2 contains basic definitions related to chain functions. In Section 3 we give worst-case and average-
500
case analyses of the number of experiments needed in order to reconstruct a chain function. Both low-order and high-order experimental settings are considered. In Section 4 we study the reconstruction of composite regulation functions that combine several chains. Finally, in Section 5 we describe a biological application of our analysis to reconstruct the regulation mechanism of galactose utilization in yeast. For lack of space, some proofs are shortened or omitted. 2
Chain Functions
Chain functions were introduced by Gat-Viks and Shamir '. In the following we define these functions and describe their main properties. Our presentation differs from the original one, to allow succinct description of the reconstruction schemes in later sections. Let U denote the set of all variables in a network, where IUI = N . These variables correspond to genes, mRNAs, proteins or metabolites. Each variable may attain one of two states: 1 or 0 . The state of gene g , denoted by s t a t e ( g ) , indicates the discretized expression level of the gene. The intended interpretation is that s t a t e ( g ) is 1if gene g is capable of being activated in a given environment, and 0 otherwise. A variable normally exists in its wild-type state, but perturbations such as gene knockouts may change its state. Let go E U be regulated by a set S of n variables. In that case we say that S is the regulator set of go, and go is called the regulatee. A candidate regulation function for the regulatee go has the form f g O : (0,l)" + (0,l). In other words, the state of go is a function of the states of its regulators. The chain function model assumes that the functional relations are deterministic. The chain function f 9 0 on the regulators gn, ...,g1 d+ termines the state of the regulatee go. The order of the regulators is important, as it reflects the order of influence among them. We call gi the predecessor of gj for i > j , and the successor of gk for i < k. Each regulator may activate or repress its successor, and this chain of events enables a signal to propagate from gn to go, in a manner described below. Associated with each regulator gi is a fixed value yi which dictates the regulatory influence of gi on gi-1. If yi = 0 then gi is an activator; otherwise gi is a repressor. The value yi represents an intrinsic property of the chain and is not subject to change. The control pattern of f g O is the binary vector (yn,...,y1 ) . The function f90 can be defined using two n-long boolean vectors attributing activity and influence to each g i . The definitions of the activity and influence are recursive. Let a ( g i ) denote the activity of g i , and i n f l ( g i ) denote the influence of gi on gi-1. The influence on gn is always 1. gi is activated ( u ( g i ) = 1) iff it is capable of being activated and it receives a positive activation signal from its predecessor. The activation signal i n f Z ( g i ) , transmitted from gi to gi-1
501 is 1 if gi is an activator and is itself activated, or if gi is a repressor and is not activated (so that it fails to repress g i - 1 ) . Formally, a ( g i ) = 1 iff ( i n f l ( g i + l ) = 1 and s t a t e ( g i ) = 1)
(1)
(2) i n f l ( g i ) = yi @ a ( g i ) Finally, the state of the regulatee go is simply the influence of 9 1 . We define the output of f g O to be s t a t e ( g 0 ) . A chain function is uniquely determined by its set of regulators, their order and the control pattern. Any control pattern may be separated into blocks of consecutive regulators by truncating the control pattern after each 1. The first block (rightmost, ending at 91) has two possible forms: 0 . . . 0 or 0 . . .01. All other blocks are of the form 0 . . . 01.
3
Reconstruction of Chain Functions
In this section we study the question of uniquely determining the chain function which operates on a known regulatee, using a minimum number of experiments. We assume throughout that all variable states in wildtype are known (or, else, these could be measured). We further assume that all regulator states in wild type are 1, except possibly g n . The latter assumption is motivated by the observation that in many biological examples, all regulators are expressed in wild type and the state of the regulatee is determined by the presence or absence of a metabolite g n . (Examples include the Trp, lac and araBAD operons in E . COZZ~~, and the regulation of galactose utilization in yeast l l . See Section 6 for a discussion of the situation when this assumption does not hold.) An experiment is defined by a set of variables that are externally perturbed (knocked-out or over-expressed). The states of the perturbed variables are thus fixed, and the states of all non-perturbed regulators are assumed to remain at the wild-type values, with the exception of the regulatee. Its state is determined by the chain function. The order of an experiment is the number of externally perturbed variables in it. Our reconstruction algorithms are based on performing various experiments and observing their influence on the state of the regulatee. The algorithms implicitly assume that the regulation function is indeed a chain function and do not explicitly test this property. We now devise a simple set of equations that characterize the output of a chain function as a function of the control pattern and the states of the regulators, both in the wild-type state and in states produced by perturbing some regulators. These equations are the foundation of all the subsequent reconstruction schemes:
Proposition 1 Let f be a chain function on g n , . . . , g 1 . I f s t a t e ( g i ) = 1 for 1 5 i < n then s t a t e ( g 0 ) = s t a t e ( g , ) @ ( @ Z l y i ) . For any other
502
state vector, i f the least index of a state-0 regulator is j 5 n then f90(9,, ...,91) = @sE1yi. Proof: By definition, a(g,) = state(g,). For i < n, state(gi) = 1 implies that u(gi ) = a(gi+l)@yi+l. It follows by induction that state(g0) = state(g,) @ (@y=lyi). Similarly, if sta te(g j) = 0 and state(gi) = 1 for all i < j, it follows by induction that f 9 0 ( g , , ...,g 1) = @s=lyi.
3.1
Types and Blocks
A perturbation is an experiment that changes the state of a variable to the opposite of its state in wild-type. By our assumption on the regulator states in wild-type, the perturbation of a regulator in {g,-l, ...,g 1 ) is a knockout. For S C: U,an S-perturbation is an experiment in which the states of all the variables in S are perturbed. Let 20 be state(g0) in wild-type. Let w be the opposite state. For the reconstruction, we first classify the variables in U into two types: W and W . A variable is in W ( W ) if its perturbation produces output w (w). Naturally, the majority of the genes have type W , since in particular all the genes that are not part of the chain function are such. By Proposition 1 we have g,, E W . We call a gene that belongs to W ( W )a W - g e n e ( W - g e n e ) . W-successor, W-successor of a gene and W-regulator , W-regulator are similarly defined. The type of a single gene can be determined by a single perturbation of the gene. Such an experiment will be referred as a typing experiment throughout. Corollary 2 Given an ordered set of regulators g,,, ...,91, their control pattern can be reconstructed using n typing experiments. Consider now the block partition of the regulators. The right boundary of a block corresponds to a regulator gj with yj = 1 (unless j = 1, in which case yl = 0 is also possible), and any other regulator gi in the block has yi = 0. Lemma 3 Each bZock contains regulators of a single type, and two adjacent blocks contain regulators of opposite types.
The proof follows from the fact that the type of gi differs from the type of gi-1 iff yi = 1. Thus, we can refer to a block as either a W-block or a W-block, and the two types of blocks alternate.
3.2 Reconstructing the Regulator Set and the Function Consider achainfunctionwithcontrolpattern (y,,, ...,y 1 ) andlet g j , . . . ,gi be a block. Then i n f Z ( g i ) = [ i n f Z ( g j + ~A) state(gh))]@ y i . Thus,
503 the behavior of the chain is determined by the boolean variable infZ(gj+l), by the control pattern, and by the conjunction of the states of its regulators, Since this conjunction is independent of the order of occurrence of these genes, no experiment based on perturbing the states of the genes can determine the order of the genes within the block. In view of this limitation, our goal is to reconstruct the control pattern, the set of genes within each block (but not the order of their occurrence) and the ordering of the blocks. Correspondingly, in the following we will use the term successor of a gene to denote a regulator that succeeds that gene in the chain and is not a member of its block. For convenience, we shall refer to W-genes that are not regulators of go as predecessors of gn. The above discussion implies that once we have typed each gene, it remains to determine, for each pair consisting of a W-gene and a Wgene, which of these genes precedes the other in the chain. Let kw,kW denote the number of regulators of types W,W, respectively. Note that ICw km = n 5 N, and in fact, typically, n << N , as kw << IWI. Suppose we perform a (2, k}-perturbation with gi E W and gk E I$'. If the result is w , then gk precedes gi. Otherwise, g, precedes gk. A 2-order experiment for determining the relative order of a W-gene and a W-gene will be called a comparison throughout.
+
Proposition 4 Given the set of regulators of a chain function and their types, kw km comparisons are necessary and sufficient to reconstruct the function. Proof: The upper bound follows by comparing every W-regulator with every W-regulator. The lower bound follows from the fact that, in the special case where every W-regulator precedes every W-regulator, no set of comparisons can determine the relative order of a given pair consisting of a W-regulator and a I$'-regulator, unless it includes a direct comparison between the pair. Therefore, all such comparisons must be performed. rn
We now turn to the question of reconstructing a chain function without prior knowledge of the identity of its regulators. The discussion above suggests a way to solve the problem: First, we find the gene types using N typing experiments. Next, we reconstruct the block structure by performing all possible comparisons between a W-gene and a W-gene. A more efficient reconstruction is possible when g n is known. 'This is common in functions in which gn stimulates the response. If gn is known, then, since gn E I$',all W-regulators can be identified by comparing every W-gene with gn, for a total of N - km comparisons. Since any W-gene is a regulator, these experiments are sufficient to identify all the regulators, and we can apply Proposition 4 to complete the reconstruct ion.
504
Proposition 5 A chain function can be reconstructed using at most N typing experiments and km x ( N - km) comparisons. Given gn, a chain function can be reconstructed using at most N - 1 typing experiments and N - n kwkm comparisons.
+
Propositions 4 and 5 were a worst case analysis. Next, we describe another reconstruction algorithm, whose expected number of required experiments is lower. The algorithm is based on identifying gn efficiently and using it for the reconstruction. Denote by D, the set of W-successors o f g E W in f .
Proposition 6 A chain function can be reconstructed using N typing kw kw) comparexperiments and a n expected number of O ( N log k, isons.
+
Proof: Algorithm: We perform N typing experiments. Next, we apply a randomized scheme to identify gn and reconstruct the chain: Each time we pick a gene g E W at random, find its successors and their order, and remove g and all its successors from further consideration. We stop when no W genes are left, identifying gn as the last picked gene. In order to find the successors of g, we first identify the members of D, using at most N - km comparisons. Using D,,we then reconstruct the part of the chain that spans g and its successors by at most ID,I(k, - 1) comparisons, as in Proposition 4. Complexity: The set of comparisons can be divided into two parts: Those that are required to identify the sets D,, and those required to reconstruct the chain parts induced by these sets. For the latter, kw km comparisons are needed in total, since every pair consisting of a Wregulator and a W-regulator is compared exactly once. Thus, it suffices to compute the expectation of the first part. Let T ( x ) be this expectation, given that the current W set contains x elements, where T ( 0 ) = 0. Then T ( z ) 5 CP"=l(N T ( x - 4)) for x 2 1. By induction T ( z )5 2Nlogx N . Substituting x = km we obtain the required bound. w
+
3.3
+
Using High-Order Experiments
In this section we show how to improve the above results when using experiments of order q > 2. The results in this section are mainly of theoretical interest, since high-order experiments may not be practical.
Proposition 7 Given the set of n regulators of a chain function, the function can be reconstructed using O(n + *) experiments of order at most q. This i s optimal u p to constant factors f o r q = O ( n ) .
505
Proof: The number of possible chain functions with n regulators is @((log, e)"+'n!) '. Since each experiment provides one bit of information, the information lower bound is Cl(nlog n) experiments. We give the upper bound proof for q = n. The proof for other values of q follows by appropriately choosing subsets of regulators of cardinality q , and reconstructing their sub-chains using the method we give next, thereby inferring the entire chain. Let ni be the number of regulators in block i, where blocks are indexed in right-to-left order. Our reconstruction algorithm is as follows: First, we perform pz typing experiments. Next, we identify the type of the first block using one experiment of order n, in which all regulators are perturbed. We proceed to reconstruct the blocks one by one, according to their order along the chain. Note that the type of each block is now known, since the two types alternate. Suppose we have already reconstructed blocks 1,.. . ,i - 1. For reconstructing the i-th block we only consider the set of regulators that do not belong to the first i - 1 blocks. Out of this set, let A be the subset of regulators that have the same type as block i, and let B be the the subset of regulators of the opposite type. We use standard binary search on the set A to identify the members of the i-th block, including in the perturbations also all regulators in B . This requires O(ni logn) experiments. Thus, altogether we perform O ( nlog n ) experiments. H
4
Combining Several Chains
In this section we extend the notion of a chain function to cover common biological examples in which the regulatee state is a boolean function of several chains. Frequently, a combination of several signals influences the transcription of a single regulatee via several pathways that carry these signals to the nucleus, and a regulation function that combines them together. Here, we formalize this situation by modeling each signal transduction pathway by a chain function, and letting the outputs of these paths enter a boolean gate. Define a Ic-chain function f as a boolean function which is composed of Ic chain functions over disjoint sets of regulators, that enter a boolean gate G(f). Let f' be the i-th chain function and let gj denote the j-th regulator in f'. The output of the function is G(infZ(g:),. . . ,infZ(gf)). In the following we present several biological examples for Ic-chain functions that arise in transcriptional regulation in different organisms: The lac operon codes for lactose utilization enzymes in E. Coli. It is under both negative and positive transcriptional control. In the absence of lactose, lac-repressor protein binds to the promoter of the lac operon and inhibits transcription. In the absence of glucose, the level of CAMP
506
in the cell rises, which leads to the activation of CAP, which in turn promotes transcription of the lac operon. In our formalism, the lac operon is controlled by a 2-chain function with an AND gate. The chains are: f ' ( g i , g ; ) = f'(lactose, lac-repressor), with control pattern 11, and f 2 ( g i ,gg,g:) = f '(glucose, CAMP, CAP), with control pattern 100. Other examples of 2-chains with AND gates are the regulation of arginine metabolism and galactose utilization in yeast l l . A 2-chain with an OR gate regulates lysine biosynthesis pathway enzymes in yeast l l . These examples motivate us to restrict attention to gates that are either OR or AND. We first show that we can distinguish between OR and AND gates. We then show how to reconstruct k-chain functions in the case of OR and later extend our method to handle AND gates. Denote the output of f i by Oi. If Oi = 1 in wild-type, we call f i a I-chain and, otherwise, a U-chain. A regulator gj is called a 0-regulator (1-regulator) if its perturbation produces Oi = 0 (Oi = 1). Let t o ( k l ) be the number of 0-regulators (1-regulators) in f . A block is called a 0-block (1-block), if it consists of 0-regulators (1-regulators). Lemma 8 Given a k-chain function f with gate G ( f ) which is either A N D or OR, k 2 2, we can determine, using O(N2) experiments of order at most 2, i f G ( f ) is an A N D gate or a n OR gate. Proof: We perform N typing experiments. If w = 0 and W = 8 then G ( f ) is an AND gate. If w = 1 and @ = 8 then G ( f ) is an OR gate. Otherwise, W # 0. In this situation the cases of w = 0 and w = 1 are similarly analyzed. We describe only the former. If w = 0 we have to differentiate between the case of an OR gate, whose inputs are all 0-chains, and the case of an AND gate, whose inputs are one 0-chain and ( k - 1) 1-chains. To this end we perform all comparisons of a W-gene and a W-gene. Let T be the set of genes g such that the result of a {g,g'}-perturbation is w for every g' E W. Then T # 0 iff G ( f ) is an AND gate. rn
We now study the reconstruction of an OR gate. Let S be the (possibly empty) set of regulators that reside in one of the first blocks (i.e., blocks containing g i ) , that are also 1-blocks. We observe that a perturbation of any regulator in S results in state(go) = 1regardless of any other simultaneous perturbations we may perform. Hence, our reconstruction will be unique up to the ordering within blocks and the assignments of the regulators in S to their chains. The next lemma handles the case w = 0. The subsequent lemma treats the case w = 1. Lemma 9 Given a k-chain function f with a n OR gate and assuming that w = 0 , we can reconstruct f using N typing experiments and ( N k l ) k l comparisons.
507
Proof: We perform N typing experiments. Then, for each 1-regulator b, we perform all possible comparisons, thereby identifying all 0-regulators that succeed b in its chain. This completes the reconstruction. w Lemma 10 Let f be a k-chain finction with an OR gate. Assume that w = 1, and let r be the number of 1-chains entering the OR gate. Then f can be reconstructed using O ( N r Nko') ezperiments of order at most min{k 1,r 2).
+
+
+
Proof: First, we determine r , the minimum order of an experiment that will produce output 0 for f . For successive values i we perform all possible i-order experiments; r is determined as the smallest i for which we obtain output 0. In total we perform O ( N r )experiments. We call the set of perturbed genes in an r-order experiment which results in output 0, a reset combination. Next, we identify all 1-regulators. This is done by performing O(Nk0') experiments of order (r 1) as follows: For each reset combination discovered, we perturb in addition each other gene, one at a time, and record those that produce output 1 as 1-regulators. Each reset combination identifies a set of 1-regulators. These sets form a partial order under set inclusion. Let M be a reset combination corresponding to a minimal set in the partial order of 1-regulator sets. The genes in this minimal set will be exactly the 1-regulators in the 0-chains and the 1regulators in S. By perturbing all r regulators in M , we deactivate the 1-chains, thereby reducing the problem of reconstructing the 0-chains to that of reconstructing a (k - r)-chain function with an OR gate and w = 0. This is done by applying the reconstruction method of Lemma 9 using experiments of order at most min{k 1,r 2). The assignment of 1-regulators in S will remain uncertain. The 1-chains can be now computationally inferred as follows: Pick an arbitrary reset combination and consider in turn each of its subsets of cardinality r - 1. Fixing a subset, consider all reset combinations that contain it. The variable 0-regulators in these combinations correspond to the 0-regulators of a particular 1-chain. For each of these variable 0regulators our experiments determine a set consisting of the 1-regulators in its chain that succeed it, plus the 1-regulators in S and in the 0chains, which have been identified by the reset combination M , and can be removed from consideration. Performing this computation for all combinations and subsets, we will have determined, for each 1-chain, its 0-regulators, its 1-regulators and the ordering relations between them.
+
+
+
Note that for k = 1 the above algorithms will reconstruct a single chain. Further note that these algorithms may be used for the reconstruction of an AND gate as well, exchanging the roles of 0 and 1 in the above description. This gives rise to the following result:
508
Theorem 11 A Ic-chain function with an OR o r an A N D gate can be reconstructed using U ( N k ) experiments of order at most k + 1. 5
A Biological Application
The methods we presented above can be applied to reconstruct chain functions from biological data. We describe in detail one such reconstruction of the yeast galactose chain function, for which some of the required perturbations have been performed. We show that one additional experiment suffices to fully reconstruct the regulation function. The galactose utilization in the yeast S. cerevisiae l1 occurs in a biochemical pathway that converts galactose into glucose-6-phosphate. The transporter gene gal2 encodes a protein that transports galactose into the cell. A group of enzymatic genes, gall, ga17, gallO, gal5 and ga16, encode the proteins responsible for galactose conversion. The regulators gal4p, gal3p and gal80p control the transporter, the enzymes, and to some extent each other (Xp denotes the protein product of gene X). In the following, we describe the regulatory mechanism, assuming that glucose is absent in the medium. gal4p is a DNA binding factor that activates transcription. In the absence of galactose, gal80p binds gal4p and inhibits its activity. In the presence of galactose in the cell, gal80p binds gal3p. This association releases gal4p, promoting transcription. This mechanism can be viewed as a chain function, where f1(g4,g3, g2,gl) = fi(gaZactose,gaZ3, gaZ80, gaZ4), and the corresponding control pattern is 0110. The ga17, gall0 and gall regulatees are also negatively controlled by another chain f 2 containing MIGl and glucose. The two chains are combined by an AND gate. We focus here on the reconstruction of f l , since the other chain has no influence in the experiments that we describe below (as those were conducted in the presence of glucose). f' consists of 3 blocks, where in wild-type (in the presence of glucose and galactose) ga13, gal80 and gal4 are in state 1 (using the same discretization procedure employed by Ideker et al. '). Assuming we know the group of four regulators, we need according to Proposition 4 a total of 4 typing experiments and 3 comparisons (since only gal80 is of type W) to reconstruct the chain. Notably, all 4 typings and 2 of the 3 comparisons were performed by Ideker et al. 1 2 , yielding the correct results. The missing experiment is a comparison of gal80 and ga13. A correct result of this experiment will lead to full reconstruction of the chain function. 6
Concluding Remarks
In this paper we studied the computational problems arising when wishing to reconstruct regulation relations using a minimum number of ex-
509
periments, assuming that the experiments provide correct results. We restricted attention to common biological relations, called chain functions, and exploited their special structure in the reconstruction. We also suggested an extension of that model, that combines several chain functions, and studied the implied reconstruction questions. On the practical side, we have shown an application of our reconstruction scheme for inferring the regulation of galactose utilization in yeast. The task of designing optimal experimental settings is fundamental in meeting the great challenge of regulatory network reconstruction. While this task entails coping with complex interacting regulation functions, we chose here to focus on the reconstruction of a single regulation relation of a single regulatee. We also made two strong assumptions that simplify the analysis considerably: (1) The function can be studied in isolation. Hence, upon any perturbation, none of the other regulators change their states; (2) the wild type state of all regulators (except possibly gn) is 1. Our study could serve as a component in a more general scheme for dealing with entire networks, whose regulation relations possibly interact with one another.
Acknowledgments R. M. Karp and R. Shamir were supported by a grant from the USIsrael Binational Science Foundation (BSF). R. Sharan was supported by a F’ulbright grant. I. Gat-Viks was supported by a Colton fellowship.
References 1. T. Akutsu et al. Theor. Comp. Sci., 298:235-251, 2003. 2. T. Ideker, V. Thorsson, and R.M. Karp. In Pmc. of the Pacific Symposium in Biocomputing, pages 305-316, 2000. 3. N. Friedman et al. J . Gomp. Biol., 7:601-620, 2000. 4. A. Tanay and R. Shamir. Bioinformatics, 17, Supplement 1:270278, 2001. 5. D. Hanisch et al. Bioinfomatics, 18, Supplement 1:145-154, 2002. 6. T. Ideker et al. Bioinformatics, 18, Supplement 1:233-240, 2002. 7. E. Segal et al. Bioinformatics, 17, Supplement 1:243-252, 2001. 8. D. Pe’er, A. Regev, and A. Tanay. Bioinfomatics, 18, Supplement 1:258-267, 2002. 9. I. Gat-Viks and R. Shamir. Bioinformatics, 19, Supplement 1:108117, 2003. 10. F. C. Neidhardt, editor. ASM Press, 1996. 11. E. W. Jones, J. R. Pringle, and J. R. Broach, editors. Cold Spring Harbor Laboratory Press, 1992. 12. T. Ideker et al. Science, 292:929-933, 2001.
INFERRING GENE REGULATORY NETWORKS FROM RAW DATA - A MOLECULAR EPISTEMICS APPROACH D. A. KIGHTLEY, N. CHANDRA AND K. ELLISTON Genstruct Inc., 125 Cambridgepark Drive, Cambridge, MA 01 702, USA Biopathways play an important role in the functional understanding and interpretation of gene function. In this paper we present the results of an iterative algorithm for automatically generating gene regulatory networks from raw data. The algorithm is based on an epistemics approach of conjecture (hypothesis formation) and refutation (hypothesis testing). These operations are performed on a matrix representation of the gene network. Our approach also provides a way of incorporating external biological knowledge into the model. This is done by preassigning portions of the matrix - which represent previously known background knowledge. This background knowledge helps make the results closer to a human’s rendition of such networks. We illustrate our approach by having the computer replicate a gene regulatory network generated by human scientists at an academic lab.
1
Introduction
Gene regulation in eukaryotes is the result of a complex interaction of numerous elements that combine to determine the expression of genes. The bindings of multiple transcription factors at cis-regulatory sites act in combination to determine the level of gene transcription. Discovering the nature of these interactions remains a challenging problem. Elucidation of the regulatory network architecture from a set of experimental data is a complex problem and development of an automated process can help in generating networks that are too large and too complex for humans to handle. Algorithms for automatically generating a genetic regulatory network have been used on a number of different data types. Microarrays [5] give a measure of levels of gene expression in a cell and these data have been used to generate the underlying genetic network [17]. However, the cost of analysis of each interaction in the network is high. The complete set of data is rarely produced and data are frequently sparse. As a result, network inference algorithms are typically applied for recreating complex functional network structures from limited datasets [ll, 13, 151. A different technique measures changes in mRNA transcription of various target genes, measured by PCR, when another gene is perturbed. These perturbation studies [8, 101 can yield information as to which genes are regulated, either directly or indirectly, by another. Thus by combining the interactions it is possible to build up a regulatory network. However, an effect can be the result of a direct interaction or an indirect action through intermediate genes. Therefore, it is necessary to incorporate prior knowledge of the system to infer the network structure; a Bayesian network has been used for this purpose [14, 161.
510
51 1
An alternative approach for generating gene regulatory networks has been to use reverse engineering of data using generative algorithms [6, 7, 121. This approach starts with a set of observations and generates networks that approximate the solution. Through modification and refinement the network that best explains the data is arrived upon (see Section 3.1).
2
2.1
Gene Perturbation Data
Source of the Data
The data relating to gene regulation of purple sea urchin (Strongylocentrotus purpuratus) embryo development has been made available on the Internet [2], from where the data was transcribed. Figure 1 is a sample of the data giving the effects on two transcription factors out of a total of 60 genes. The dataset relates to experiments performed at the Davidson Laboratory at the California Institute of Technology that involved quantitative PCR studies on embryos during the early stages of development (< 72 hr). Details of the findings from the studies have been published [3]. 2.2
Gene Perturbation
The experiments performed on the Sea Urchin embryos involved perturbation of genes and measurement of changes in expression of a second, target gene. In the absence of other influences, perturbation of a gene that is an activator of another will cause the expression of the second gene to be decreased. Alternatively, if the perturbed gene is inhibitory, the expression level of the latter will be increased. The numerical values refer to the cycle number in the PCR experiment and this relates back to the starting level of mRNA, which is amplified exponentially during PCR. A value of 1 represents an approximate doubling of initial mRNA level. Thus, if a value of 3 is reported for an interaction, perturbation of the gene resulted in an 8 fold increase in the gene product compared with the unchanged cell. The convention used in the data is that negative values mean less starting mRNA. Thus, if perturbation of a gene results in lower quantities of mRNA transcribed from target genes, the relationship must have been activation. Similarly positive values indicate inhibition. Transcription regulation involves a complex network of genes that encode transcription factors which, in turn, regulate other genes. A specific transcription factor can regulate multiple genes and there are chains of interactions which form a cascade. Thus perturbation of a single gene can affect the expression of many other genes both directly and indirectly. Consequently, an observed change in gene expression is the result of the combined effects on all of the regulatory genes that influence its transcription. Being able to determine whether an interaction is direct or indirect is a hurdle in deciphering causality in gene regulatory networks.
512
2.3
A look at the data from the Davidson lab
The experimenters presented data relating to three types of perturbations: Morpholino-subsituted antisense oligonucleotide (MASO) - the mRNA transcribed from a gene binds to the complimentary RNA strand, thereby preventing translation of the gene product. Messenger RNA overexpression (MOE) - involves amplification of gene products from the perturbed gene. Engrailed repressor domain fusion (En) - the transcription factor is converted into a form in which it becomes the dominant repressor of all target genes. The three techniques represent distinctly different methods for gene perturbation. However we do not have enough details on them to determine whether there are any useful differences in the results. Therefore, no distinction between techniques was made, results having been taken as being equivalent, and data for the same perturbation, but from different experimental techniques, were combined. The results for each perturbation experiment were reported as up to 7 individual values that relate to both replicate measurements of the same cDNA batch and separate experiments. These values were averaged to provide a single value for equivalent samples. Results recorded as Not Significant (NS) were treated as zero.
-. . ... ...
Figure 1. A sample of the data presented on the Davidson Lab website. This portion of the data relates to perturbation of multiple genes and the effect on the transcription factors, GataC and GataE. The original data used *1.6 as the significance threshold. However by treating non-significant samples as zero, time-averaged samples were reduced in value so a
513
lower threshold was needed. After analysis of the data, values that fell below h0.75 were taken to indicate no significant interaction. Data are presented as a set of time slices that cover intervals in embryo development between 12 and 72 hours after fertilization. However, most data are for three time slices between 12 and 28 hr and the remaining information is veIy sparse. For the majority of the work, mean values for the first 4 ranges were combined to yield an average across these times. In addition to gene perturbation results, there is a table of genes that are not affected by perturbation during the first 24 hrs and also footnotes that provide information about gene interactions, many highlighting possible indirect effects. This additional information was incorporated into the experimental data to yield a single value for the effect of one gene on another. Data were available for only around 12.8% (460 out of 3600) of the possible interactions. Some of the remainder may be filled in by future experimentation but, for the purpose of this analysis, these 'unknowns' were taken to indicate no interaction unless there were indications to the contrary.
2.4
Gene Selection
The overall dataset contained 60 genes identified to regulate gene expression in Sea Urchin embryos. To simplify the system, a decision to concentrate on the Endomesoderm was made since there was the greatest quantity of data relating to these cells. The remainder of the embryonic regions had considerably less experimental coverage. Twenty-one regulatory genes are active in the Sea Urchin endomesoderm during the chosen developmental stages and, of the 441 possible interactions, there are 162 data points or 36.7% coverage. In addition to the 21 genes, the published endomesoderm regulatory network also includes complexes (e.g. Su(H)-N'', n-TCF) involving endomesoderm gene products. However, no data were presented that supported the formation of these complexes, nor was there any data for their action within the cell. Therefore, complexes were omitted from the analysis.
3
3.1
Algorithm for Network analysis
The flowchart
The algorithm used is based on exploring the state space of all possible gene networks (models) in a systematic, iterative fashion. The first step involves generating a model from a given set of components. The components for the gene network are:
514
An activation An inhibition No effect These three relations between genes are represented as + 1, - 1 and 0 in a matrix of gene-to-gene interactions. The initial model generated represents a hypothesis that has to be tested and scored. The next step involves simulation, The model, which represents a set of regulatory connections between genes, can be simulated qualitatively. For example, the network contains the relation: A activates B which activates C. The experimental data are checked to see what experiments have been done. Assume that one of the experiments involved overexpressing A then, according to our hypothesized model, an overactivation of A will result in an increase in B and C. The results of the simulation are tested against the actual data. As indicated below, the actual data will show that B increases and C decreases.
Hypothesized Model
Actual connections (unknown to computer)
This comparison is then used to score the model. The model is then modified using a state space search algorithm to create a new model. The process is followed iteratively till the score does not immove any more. To avoid local minima, the modified-models are randomly perturbed using an annealing method.
r
Generate a model (hypothesis)
1:fine
Footnotes & Experimental data
1 ode1
Compare Results I
t Score the model I
-
f
Result
Figure 2. Molecular Epistemics Algorithm of Conjecture and Refutation.
515
3.2
Handling non-numerical biological knowledge outside the raw data
The process of scientific discovery involves experimentation, but interpretation of the results involves bringing to bear ones prior knowledge of the underlying biology. Our approach allows for outside literature, footnotes and personal knowledge to be added to the model before it runs. This is achieved in two ways. The first approach is to incorporate externally known regulatory knowledge into the input data prior to running the algorithm. Another approach involves incorporating the known prior knowledge into the initial model. The idea here is to make some of the gene-to-gene connections ‘fixed’ or pre-set before the model generation process is started. If this cannot be done for all the knowledge, it can be incorporated into the scoring algorithm [I].
4
4. I
Endomesoderm Gene Regulatory Networks
Representation of the Regulatory Networks
Networks generated by the algorithm were displayed graphically using Netbuilder, a tool for construction of computation models developed by Science and Technology Research Centre, University of Hertfordshire, UK. This tool was also used by the Davidson Lab team to display their network results. The colors and overall network layout presented here were chosen to closely resemble those used in the Davidson paper and so make for easier comparison. 4.2
The Complete Regulatory Network
By using a straight substitution of the data with values greater than or equal to the threshold taken to mean activation or inhibition depending on the sign, and all other values to signify no connection, a simple representation of the entire network of connections was obtained (Figure 3). This interpretation takes into account the additional information provided in the footnotes to the data (incorporated into the values), but is doing no interpretation or analysis of the data. The generated network comprises 56 links between the genes of which 45 were activations and 11 inhibitions.
516
Figure 3: Automatically generated Endomesoderm gene regulatory network that directly reflects the raw data.
The complete network generated directly from the data is similar to the Endomesoderm network published by Davidson, however there are some notable differences which may not be related directly to interpretation of the information. Firstly, the data available on the website is constantly under review and is augmented as new results become available. The dataset used in this study was dated October 28'h, 2002 and so was considerably newer than that used to construct the network for the article that appeared in the March 1 issue of Science [3]. Although the network displayed on the website is also being updated, it is changed less frequently than the data and may not reflect all the updates. Secondly, the Davidson Lab's network represents the regulatory network for the organism and includes many genes that are not active in the endomesoderm. These genes will have interactions with the 21 genes under study which may have effects that are not apparent when the endomesoderm is viewed in isolation. Nevertheless, there are still discrepancies. Some links are present on the published network even though the dataset indicates they should not be there. For instance, there are data to suggest an activation link between bra and nrl, however a footnote states that this must be an indirect link since bra is not active in the cell at this time. The data used for this work took all of the footnotes into account and so does not show this link, whereas the published network included it. On the other hand, there is data to support an activation link between eve and four other genes, yet the published networks only show a single effect. Thus, while these networks and the Davidson Lab published networks show similar information, they show some differences which are, at least partly, due to differences in the source data.
517
4.3
Network reduction
The scoring mechanism in the underlying algorithm was modified to give a low score to links that can be explained by intermediate genes. This was done to remove indirect links - thereby generating a minimal network that explained the raw data faithfully. For instance, elk, Sox-1 and Notch all activate both GataC and gcrn, and gcm activates GataC (Figure 3). Therefore, it is possible that the observed effects on GataC were really a result of an indirect effect through gcm. This suggests that the three links from elk, Sox-1 and Notch to GataC could be removed without contradicting information contained in the data. By eliminating the maximum number of links without breaking any of the connections between genes or making a link with too many intermediates, it was possible to remove 13 links from the network (all activations) and reduce the total number of links from 56 to 43 (Figure 4). In separate runs of the algorithm it was possible to get slightly different sets of links removed, but the minimum number of links necessary to explain all of the data was still 43. The algorithm was also run in a configuration that permitted the removal of links that can be explained through pathways of up to 2 intermediate genes. In this way 3 extra edges could be removed, however the more intermediates there are the harder it is to justify that the link has been retained and the observed effect is still valid.
Figure 4: Automatically generated minimal Endomesoderm network with links removed where a connection is already present through a single intermediate node. On the complete network, genes highlighted in rectangular boxes have links to both GataC and gcm (ellipses). In the minimal network, their actions on GataC are all through gcm.
518
4.4
Networks from separate stages of embryo development
Data for the 21 endomesoderm genes at each time period was rendered into a separate network to compare expression profiles at each time. This yielded a set of networks that contained 15 (12-16 hr), 30 (18-21 hr), 45 (24-28 hr), 6 (32-36 hr), 2 (40-48 hr) and 0 (60-72 hr) links. Although gene expression does change through the development stages, it is unlikely that these results represent an accurate picture of the regulatory system, rather an indication that the dataset is incomplete. Thus, without additional data to indicate that genes operational at one period are turned off in another (there are some data), it will be very difficult to draw any conclusions from these observations. 5
5. I
Next steps
Probabilistic assignment of effects
The approach taken for this study relied on definitive assignment of a link (or no link) between two genes based on the data. The output from the algorithm is trinary and, therefore, relies heavily on the thresholding function to define whether a gene is activated or inhibited. There is no indication as to the certainty of these predictions and this all-or-nothing approach leads to the possibility that a small change in the threshold level can create or eliminate links. The idea here is to generate networks with links with varying levels of confidence. This may be done in our platform by placing link values on a continuous scale, for example from -10 to +lo. The output value is a measure of the certainty that the algorithm can predict the presence of a link. For instance, a value of -10 would mean an activation relationship with absolute certainty, likewise +10 for a certain inhibition. A value closer to zero is less certain. A threshold hnction will still be required to apply the cut-off that defines an interaction with no link. Nevertheless, a value just exceeding the threshold will be labeled as uncertain, rather than all links having equal validity.
5.2
Incorporation of auxiliary information
A mechanism for incorporating external auxiliary knowledge of biology is needed. An example of where auxiliary information could be used is in the action of Otx on wnt8. The data indicates that this should be a straight forward inhibition. However, the published network indicates that Otx activates an intermediate gene labeled ‘Rep. of wnt8’ [Repressor] and that this gene inhibits wnt8. There is no footnote with the data that could indicate why the link was drawn like this, yet evidence can be found in another publication by the group at the Davidson
519
Laboratory [4]. This paper reported that introduction of an obligate repressor of Otx target genes resulted in a many fold increase in the transcripts of wnt8. Thus, this information is showing that the action of otx on wnt8 is a two (or more?) step process. This knowledge could have been incorporated into the algorithm to improve accuracy of the output. A future development of the module would, therefore, utilize the auxiliary information known about interactions and incorporate this into the decisions to include a link or not. Thus, additional knowledge could be used to strengthen the case for a particular configuration of the network over another. 6
Discussion
Automated generation of biopathways can help generate large complex gene regulatory networks that can be minimized to best explain the raw data. These methods can incorporate knowledge gleaned from the literature, footnotes and other sources. This makes the approach closer to how a human would work: bringing to bear knowledge and prior experiences when interpreting results from experiments. Acknowledgements
We would like to thank our scientific advisors Atul Butte and Trey Ideker for their inputs and direction in selecting the data set and developing the approach. References 1.
Chandra et. al. “Epistemics Engine”, US.Patent application, (Nov 2002)
2.
Davidson Laboratory Website. http:!/its.caltech.edu/-mirskv/awr.htnil
3. Davidson et al. A genomic regulatory network for development. Science 295, 1669-1678 (2002) 4.
Davidson et al. A provisional regulatory gene network for specification of endomesoderm in the sea urchin embryo. Developmental Biology 246, 162-190 (2002)
5.
Kohane IS, Kho A, Butte AJ. Microarrays for an Integrative Genomics, MIT Press (2002)
6. Kosa, et al. Reverse engineering of metabolic pathways from observed data using genetic programming. Pacific Symposi& on Biocomputing 6, 434-445 (2000)
520
7. Kosa, et al. Reverse engineering and automatic synthesis of metabolic pathways from observed data using genetic programming. Stamford University Technical report SMI-2000-0851 (2000) 8.
Ideker TE, Thorsson V, Karp RM. Discovery of regulatoryinteractions through perturbation: inference and experimental design. Pacific Symposium on Biocomputing 5, 302-313 (2000)
9.
Wessels L.F.A., Van Someren, E.P. and Reinders, M.J.T. A comparison of genetic network models. Pacific Symposium on Biocomputing 6 , 5085 19 (200 1)
10 Maki, Y. et al. Development of a system for the inference of large scale genetic networks. Pacific Symposium on Biocomputing 6 , 446-458 (2000)
11 Smith VA, Jarvis ED, Hartemink AJ. Evaluating functional network inference using simulations of complex biological systems. Bioinformatics lS(Supp1. I), 5216-24 (2002) 12. Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing 3, 18-29 (1998) 13. Imoto S, Goto T, Miyano S. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pacific Symposium on Biocomputing 7, 175186 (2002) 14. Chrisman et al. Incorporating biological knowledge into evaluation of causal regulatory hypothesis. Pacific Symposium on Biocomputing 8, In press (2003) 15. Akutsu T, Miyano S, Kuhara S. Algorithms for inferring qualitative models of biological networks. Pacific Symposium on Biocomputing 5, 290-301 (2000) 16. Hartemink AJ et al. Combining location and expression data for principled discovery of genetic regulatory network models. Pacific Symposium on Biocomputing 7, 437-449 (2002) 17. Wimberly FC, Glymour C, Ramsey J. Experiments on the accuracy of algorithms for inferring the structure of genetic regulatory networks from associations of gene expressions, I: algorithms using binary variables. Submitted to the Journal of Machine Learning Research. (2002)
A BIOSPI MODEL OF LYMPHOCYTE-ENDOTHELIAL INTERACTIONS IN INFLAMED BRAIN VENULES P. LECCA AND C. PRIAMI Dipartimento d i Informatica e Telecomunicazioni, Universitci d i Trento {lecca,priami}@science.unitn.it C. LAUDANNA AND G. CONSTANTIN Dipartimento di Patologia Universitci di Verona {carlo.laudanna,gabriela.constantin}@univr. it This paper presents a stochastic model of the lymphocyte recruitment in inflammed brain microvessels. The framework used is based on stochastic process algebras for mobile systems. The automatic tool used in the simulation is the BioSpi. We compare our approach with classical hydrodinamical specifications.
1 Introduction Lymphocytes roll along the walls of vessels to survay the endothelial surface for chemotactic signals, which stimulate the lymphocyte to stop rolling and migrate through the endothelium and its supporting basement membrane. Lymphocyte adhesion to the endothelial wall is mediated by binding between cell surface receptors and complementary ligands expressed by the endothelium. The dynamic of adhesion is regulated by the bond association and dissociation rates: different values of these rates give rise to different dynamical behaviors of the cell adhesion. The most common approach to the simulation of rolling process of lymphocyte is based on hydrodynamical models of the particle motion under norAt a macroscopic scale, the process is generally mal or stressed flow modeled with the typical equations of mass continuity, momentum transport and interfacial dynamic. At a microscopic scale, the cell rolling is simulated as a sequence of elastic jumps on the endothelial surface, that result from sequential breaking and formation of molecular bonds between ligands and receptors This kind of model is able to simulate the time-evolution of bond density. A major challenge for a mechanical approach is to treat the disparate scales between the cell (typically of the order of micrometers) and the bonds (of the order of nanometers). In fact, rolling involves either dynamical interaction between cell and surrounding fluid or microscopic elastic deformations of the bonds with the substrate cells. Moreover recent studies have revealed 1,16y1a.
1636,9.
521
522
that the process leading to lymphocyte extravasation is a sequence of dynamical states (contact with endothelium, rolling and firm adhesion), mediated by partially overlaped interactions of different adhesion molecules and activation factors. The classical mechanical models are inefficent tools to describe the concurrency of the molecular interactions; also if they treat the physical system at the scale of intermolecular bonds with appreciable detail, they are not able to reproduce the sensitivity to the small pertubations in the reagent concentrations or in reaction rates typical of microscopic stochastic systems governed by complex and concurrent contributions of many different molecular reactions. The probabilistic nature of a biological system a t the molecular scale requires new languages able t o describe and predict the fluctuations in the population levels. We rely on a stochastic extension 21,22 of the n-calculus 17, a calculus of mobile processes based on the notion of naming. The basic idea of this biochemical stochastic n-calculus is to model a system as a set of concurrent processes selected according to a suitable probability distribution in order to quantitatively accommodate the rates and the times at which the reactions occur. We use this framework to model and simulate the molecular mechanism involved in encephalitogenic lymphocyte recruitment in inflammed brain microvessels. Our development can also be interpreted as a comparison between the most common modeling method based on hydrodynamical and mechanical studies and x-calculus representation, in order to point out the ability of this new tool to perform a stochastic simulation of chemical interactions that is higly sensitive to small perturbations. We also present data obtained from BioSpi simulations. 2
Molecular mechanism of autoreactive lymphocyte recruitment in brain venules
A critical event in the pathogenesis of multiple sclerosis, an autoimmune disease of the central nervous system, is the migration of the lymphocytes from the brain vessels into the brain parenchima. The extravasation of lymphocytes is mediated by highly specialized groups of cell adhesion molecules and activation factors. The process leading to lymphocytes migration, illustrated in Fig. 1 , is divided into four main kinetic phases: 1) initial contact with the endothelial membrane (tethering) and rolling along the vessel wall; 2) activation of a G-protein, induced by a chemokine exposed by the inflamed endothelium and subsequent activation of integrins 3) firm arrest and 4) crossing of the endothelium (diapedesis). For this study, we have used a model of
523
early inflammation in which brain venules express E- and P-selectin, ICAM-1 and VCAM-1 20. The leukocyte is represented by encephalitogenic CD4+ T lymphocytes specific for PLP139-151, cells that are able to induce experimental autoimmune encephalomyelitis, the animal model of multiple sclerosis. Tethering and rolling steps are mediated by binding between cell surface receptors and complementary ligands expressed on the surface of the endothelium. The principal adhesion molecules involved in these phases are the selectins: the P-selectin glyco-protein ligand-1 (PSGL-1) on the autoreactive lymphocytes and the E- and P-selectin on the endothelial cells. The action of integrins is partially overlaped to the action of selectins/mucins: a4 integrins and LFA-1 are also involved in the rolling phase, but they have a less relevant role. Chemokines have been shown to trigger rapid integrin-dependent lymphocyte adhesion in vivo through a receptor coupled with G iproteins. Integrindependent firm arrest in brain microcirculation is blocked by pertussis toxin (PTX), a molecule able to ADP ribosylate Gi proteins and block their function. Thus, as previously shown in studies on nazve lymphocytes homing to Peyer's patches and lymph nodes, encephalitogenic lymphocytes also require an in situ activation by an adhesion-triggering agonist which exerts its effect via Gi-coupled surface receptor. The firm adhesion/arrest is mediated by lymphocyte integrins and their ligands from the immunoglobulin superfamily expressed by the endothelium. The main adhesion molecules involved in cell arrest is integrin LFA-1 on lymphocyte and its counterligand ICAM-1 on the endothelium. The action of a4 integrins is partially overlaped to the action of LFA-1: a4 integrins are involved in the arrest but they have a less relevant role 'O.
leukocyte
Hemath Umv
LFA-VICAM-I
1. Tethering and mUing
Z.Pim a m 1
Figure 1. The process leading t o lymphocyte extravasation is a finely regulated sequence of steps controlled by both adhesion molecules and activating factors.
524
3
Kinetics models of cell adhesion
In this section we firstly describe the micro-scale model of cell adhesion proposed by Dembo et al. 6, that computes the time-evolution of the of the bonds density between ligands and receptors during the phase of rolling. Secondly, we briefly report the recent results of the computational method called Adhesive Dynamics developed by Chang et al. and based on the Bell model I, that expresses the dissociation rate as a function of the total force applied on the lymphocyte, simulates the adhesion of a cell to a surface under flow. Here the relationship between ligand/receptor functional properties and the dynamics of adhesion are expressed in state diagrams, drawing the variation of the lymphocyte centroid position in time. We have considered here these two models, because they describe the two main aspects of the cell motion: the molecular interaction a t molecular bond scale and the dynamics of the motion at the lymphocyte scale, to compare the two kinds of results with the .rr-calculus simulations. Dembo adhesion model. Rolling is a state of dynamic equilibrium in which there is rapid breaking of bonds at the trailing edge of the lymphocyteendothelium contact zone, matched by rapid formation of new bonds a t the leading edge. The process of lymphocyte rolling and adhesion under blood flow involves the balance of the forces arising from hydrodinamic effect including shear and normal stresses and the number and strength of the molecular bonds 7,12,23,24,25
The kinetic reaction model proposed by Dembo et a1.6 simulates the rolling lymphocyte as a viscous newtonian fluid enclose in a pre-stressed elastic membrane and the adhesion bonds formed between the rolling cell and its substrate are simulated as elastic springs perpendicular to substrate. The parameters considered by this model are: Nl (lingands density) = N, (receptors density) = 400 pm2, k:n (equilibrium association rate) = 84sW1, (equilibrium dissociation rate) = ls-', 0 (equilibrium spring constant) = 5 dyne/cm, uts(transient bond elastic constant) = 4.5 dyne/cm, K B T (thermal energy) = 3.8 x loW7ergs and X (equilibrium bond length) = 20 nm. They are used to compute the bond density Nb assuming the adhesion bond force Fb = Nbu(,!-x) 16,18. The hyperbolic analytic solution for the time-evolution of bond density Nb is given by Nb(t) = 400 and it is plotted in Fig. 2. Bell model and adhesive dynamics. The physicochemical properties that give rise to the various dynamic states of cell adhesion are mainly the rates of reaction. In particular the bond dissociation rate and its dependence on the resultant of the applied forces play an important role in rolling process. Bell proposed that the net dissociation rate k , f f of a bond under an applied
-& +
525
Figure 2. Time-evolution of bonds desity.
external force f can be modeled as k , f f = k$\
exp
(&) \
/
where k$\
is the
unstressed dissociation rate and KBT is the thermal energy; s is a parameter with units of length that relates the reactivity of the molecule to the distance to the transition state in the intramolecular potential of mean force for single The Bell model paramenter k$$ and s are functional properties of bonds the molecules. Using the equation above to model the force dependence of dissociation, Chang et. a1 performed Adhesive Dynamics computer simulations to obtain the states diagrams of the lymphocyte motion. In the Adhesive Dynamic method 3,13,14,the simulation begins with a freely moving cell (modeled as a sphere with receptors distributed at random over its surface and kinetic parameters ’). The cell is allowed to reach a steady translational velocity in absence of specific interactions, after which receptor-mediated binding is initiated. The involved adhesion molecules and the uniformely reactive substrate react with association rate k,, and dissociation rate k , f f . During each time step, bond formation and breakage are simulated by Monte Carlo methods, in which random numbers are compared with the probabilities for binding and unbinding to determine whether a bond will form or break in the time interval The dynamic of motion involves the elastic bond force, given by the Hooke’s law, colloidal force and the force imparted to the cell by the fluid shear. The motion of the lymphocyte is obtained from the mobility matrix for a sphere near a plane in viscous fluid. The new positions of free receptors and tethers at t d t are updated from their positions at t , using the translational and angular velocity of the cell. The process is repeated until the cell travels 0.1 cm, or 10s of simulated time has elapsed. The adhesive dynamics simulation parameters are: R (cell radius) = 5 pm, X (equilibrium bond length) = 20 nm, (T (spring constant) = 100 dyne/cm, p (viscosity) = 0.01 g c m - l s-l,T (temperature) = 310 K and yw (wall shear rate) = ~ O O S - ~ . From different values of rates constants in the Bell model (see caption of Fig.3) different motion state diagrams emerge 16. Tethering, in which ‘18.
’.
+
526
lymphocytes move at a translational velocity v < 0.5vh (where v h is the hydrodinamical velocity of the blood flow) but exhibit no durable arrest is shown in Fig. 3 (upper left). Rolling for which cells travel a t v < 0.5vh,but experience durable arrests, is shown in Fig. 3 (upper right). Finally in firm adhesion, shown in Fig. 3(lower), cells bind to the endothelium and remain mot ionless. 3w,
I
Figure 3. Representative trajectory of lymphocyte tethering at a mean velocity ?I equal t o one half of the hydrodinamic velocity w h . The parameters are the following: y = 0.001 nm, k,, = 84sP1, k i y i = Is-' (upper left). Representative trajectory of rolling motion of lymphocyte, with a mean ve1ocit.y w < 0.5vh, that experience durable arrests (upper right). Representative trajectory of lymphocyte for firm adhesion. The parameters are the following: y = 0.001nm, k,, = 84s-I, k$$ = 20s-I (lower).
4
The BioSpi model implementation and results
We first recall the syntax and the intuitive semantics of the stochastic 7rcalculus 22. We then describe our specification of the lymphocyte recruitment process, and eventually we discuss the simulation results. Biomolecular processes are carried out by networks of interacting protein molecules, each composed of several distinct independent structural parts, called domains. The interaction between proteins causes biochemical modification of domains (e.g. covalent changes). These modifications affect the potential of the modified protein to interact with other proteins. Since protein interactions directly affect cell function, these modifications are the main mech-
527
anism underlying many cellular functions] making the stochastic n-calculus particularly suited for their modeling as mobile communicating systems. The syntax of the calculus follows
P
::= 0
I X I ( T , T).P I ( v z ) P I [x= YIP I PIP I P + P
IA(y1,. . . , y n )
where T may be either x(y) for i n p u t , or %y for output (where x is the subject and y is the object) or 7%for silent moves. The parameter r corresponds to the basal rate of a biochemical reaction and it is an exponential distribution associated to the channel occurring in n . The order of precedence among the operators is the order (from left to right) listed above. Processes model molecules and domains. Global channel names and conames represent complementary domains and newly declared private channels define complexes and cellular compartments. Communication and channel transmission model chemical interaction and subsequent modifications. The actual rate of a reaction between two proteins is determined according to a constant basal rate empirically-determined and the concentrations or quantities of the reactants . Two different reactant molecules, P and Q, are involved, and the reaction rate is given by B r a t e x IPI x I&/, where B r a t e is the reaction's basal rate, and IPJand IQI are the concentrations of P and Q in the chemical solution computed via the two auxiliary functions, In,, Out, that inductively count the number of receive and send operations on a channel x enabled in a process. The semantics of the calculus thereby defines the dynamic behaviour of the modeled system driven by a race condition, yielding a probabilistic model of computation. All the activities enabled in a state compete and the fastest one succeeds. The continuity of exponential distributions ensures that the probability that two activities end simultaneously is zero. The reduction semantics of the biochemical stochastic 7r-calculus is
A reaction is implemented by the three parameters T b , TO and T I , where rb represents the basal rate, and TO and denote the quantities of interacting molecules, and are computed compositionally by I n , and Out,.
528
4.2
Specification
The system of interacting adhesion molecules that regulate the lymphocytes recruitment on endothelial surface illustrated in Fig. 1 has been implemented in the biochemical stochastic -ir-calculus. The system is composed by eight concurrent processes, corresponding to the eight species of adhesion molecules, that regulate the cell rolling and arrest: PSGLI, PSELECTIN, CHEMOKIN, CHEMOREC, ALPHA4, VCAMl, LFAl and ICAM1. The code implements the four phases of the lymphocyte recruitment: the interaction between PSGLl and PSELECTIN, the ALPHA4 and LFAl activation by chemokines and the firm arrest mainly caused by the interaction between the active form of LFA1, LFAlACTIVE, and ICAMl and in part also due to the interaction of the active form of ALPHA4, ALPHA4ACTIVE, with VCAMI. Its specification is We simulated the role and the contribution of the different interactions as bi-molecular binding processes occuring at different rates. The selectins interaction PSGLl/PSELECTIN plays a crucial role in guaranting an efficient rolling, therefore the channels rates for the communication in the binding process between PSGLl and PSLECTIN have been calculated from the deterministic rates of the Bell model, that reproduce the tethering and rolling motion. Analogously, for the ALPHA4ACTIVE/VCAMI interaction, that contributes to rolling and, in part, also to cell arrest, the channels rate have been calculated from the Bell model rates that recreate the rolling motion. The interaction LFAlACTIVE/ICAMl is the main responsible of firm arrest of the cell on the endothelium and thus the rates of communication between LFALACTIVE and ICAMl ACTIVE have been calculated from those reproducing the firm adhesion in Bell model simulations. The activation of ALPHA4 and LFAl integrins by the chemokines is implemented in two steps: firstly a chemokine CHEMOKIN binds to its receptors CHEMOREC and changes to a “bound” state CEHMOKINBOUND. Then the complex CHEMOKINBOUND sends two names sign1 and sign2 on the channels act-alpha and actlfa, on which the processes ALPHA4 and LFAl are ready to receive them as inputs. After ALPHA4 and LFAl have received the signals from CHEMOKINBOUND, they change to the active form ALPHA4ACTIVE and LFAIACTIVE. The whole process of lymphocyte recruitment occur in a space of V = 1.96 x 105pm3,corresponding to a volume of a vessel of 25pm of radius and 100pm of length, and in a simulated time of 15s. In the considered volume V , the number of mulecules is of the order of lo6. In our simulations the values
529 S Y S T E M ::= PSGLlIPSELECTINICHEMOKINICHEMORECIALPHAl IVCAMlILFAIIICAMl P S G L l ::= ( u b a c k b o n e ) B I N D I N G P S I T E l B I N D I N G P S I T E ::= (@(backbone), RA).PSGLlBOUND(backbone) P S G L l B O U N D ( b b ) ::= (bb,RDo).PSGLl P S E L E C T I N ::= (bind(cross-backbone),R A ) . P S E LECT I N B O U ND(crossbackbone) P S E L E C T I N B O U N D ( c b b ) ::= RDo).PSELECTIN C H E M O K I N ::= (u chemobb)B I N D I N G - C S I T E B I N D I N G C S I T E ::= (G(chemobb),RA-C).CHEMOCHIN-BOUND(chemobb) C H E M O C H I N B O U N D ( c h e m o b b ) ::= ACTlIACT2IACT3(cbb) ACT1 ::= (alpha-act ( s i g n l ) ,A).ACTI ACT2 ::= (lfa-a&sign2), A).ACT2 ACT3(chb) ::= (chb,R D _ C ) . C H E M O K I N C H E M O R E C ::= (lig(crossxhemobb),R A E ). C H E M O R E C B O U N D ( cross-chemobb) C H E M O R E C B O U N D ( c c r ) ::= (ccr,A ) . C H E M O R E C A L P H A 4 ::= (alphaact(act-a),A ) . A L P H A 4 A C T I V E L F A l ::= (If a-act (act-1),A).L F A l A C T I V E A L P H A 4 A C T I V E ::= ( u backbone2)BINDINGASITE B I N D I N G A S I T E ::= (binda(backbme2),RA).ALPHA4BOUND(backbone2) A L PH A 4 B O U N D( bb2) ::= RD1 ).ALP HA4 V C A M 1 ::= (bind2(cross-back~one2), R A ).V C AM I B O U N D( cross backbone2) V C A M l B O U N D ( c b b 2 )::= (cbba,RDl).VCAMl L F A l A C T I V E ::= ( u backbone3)BINDINGYITE3 B I N D I N G S I T E 3 ::= (bind3(backbwne3),RA).LFAIBOUND(backbone3) L F A l B O U N D ( b b 3 ) ::= (bb3,R D 2 ) . L F A l B O U N D I C AM 1 ::= (bind3(cross-backbae3),R A ).I C A M 1B O U N D ( crossbackbone3) I C A M l B O U N D ( c b b 3 ) ::=(cbb3,R D 2 ) . I C A M l B O U N D
(a,
(m,
RD1 = 5.100 RA = 6.500 RA-C = RDo = 0.051 RDz = 1.000 RD-C = 3.800 A = infinite Radius of vessel = 25 micromenters Length of vessel = 100 micromenters Volume of vessel = 1.96 X lo5 cubic micrometers Radius of lymphocyte = 5pm
of the volume and of the molecules number have been proportionally re-scaled by this factor, to make the code computationally faster. The stochastic reaction rates for bimolecular bindinglunbinding reaction are inversely proportial to the volume of space in that the reactions occur lo, in particular for the stochastic association rate we have that RA = kon/V and for the stochastic dissociation rate we have RD = 2 k o f f / V , where the ki’s are the deterministic rates. The output of simulation is the time-evolution of number of bonds (shown in Fig. 4) assuming the following densities expressed in prnF2: PSGL-1 l9 and P-SELECTIN 5600, ALPHA4 and VCAM-1 85, CHEMOREC and CHEMOKINES 15000, LFA-1 l1 and ICAM-1 5500. The characterization of the steps and the adhesion molecules implicated in lymphocyte recruit-
*
530 ment in brain venules was performed by using intravital microscopy, a potent technique allowing the visualization and analysis of the adhesive interactions directly thrmiuh t h o s h i l l in Iiim animal PSGL 1 P SELECTIN NTERACTWN 100
: 2 2
40 20
H
O
z
0
2
4 6 m T i m (s&)
i
ALPHWVCAM-I INTERACTION
o
CHEMOKINES'RECEPTORSIKTEPACTION
5 B z
100 50
0
2
4 6 8 T i m (sac)
1
0
Figure 4. BioSpi simulation of 4-phases model of lymphocyte recruitment.
The BioSpi simulations reproduce the hyperbolic behavior predicted by the Dembo model. However unlike Dembo model, the BioSpi model is more sensitive to the variations of the dissociation constant rate k${. Moreover the plots in Fig. 4 show the relevant roles played by PSGL1/P-Selectin and LFA-l/ICAM-l interactions. The curve describing the timeevolution of the bonds number of LFA-l/ICAM-l interaction presents an approximately linear steep increasing (with an angular coefficient of the order of lo3) followed by a clearly constant behavior: this curve represents the firm adhesion of lymphocyte and it is comparable with the state diagram of the Bell model of Fig. 3 . In fact, the firm arrest is reached when the number of bonds become stably constant in the time or, analogously, when the position of cell centroid does not change anymore. On the contrary, the plots representing PSGL-1/P-SELECTIN and ALPHA4/VCAM-l interactions present, after a steep increasing with about the same slope of that of LFA-l/ICAM-l binding, an oscillating behavior respect to the equilibrium positions given by the y = 80 and y = 1, respectively. This behavior represents the sequential bonds breaking and formation in the selectins and integrins binding during the rolling (see Fig. 3 for comparison). The results obtained in this work assert that the formal description provided by BioSpi model represents in a concise and expressive way the basic physics governing the process of lymphocyte recruitment.
531 More generally, physics describes either microscopic or macroscopic interactions between bodies by means of the concept of force, that expresses the action of the field generated by a particle (or a set of particle) on the other bodies of the system. BioSpi representation hits this remarks, that is just the central paradigma of the physical description of the nature and summarizes it in the new concepts of communications exchange or ( n a m e s passing). Moreover, the rates of communication in stochastic rr-calculus include all the dynamic of the system, because they contain the quantitative information about the intensity of the forces transmitted between the particles. Finally, the main advantage of the BioSpi model is that thte n-calculus permits to better investigate dynamics, molecular and biochemical details. It has a solid theoretical basis and linguistical structure, unlike other apporaches 5 . 5
Conclusion
The usage of new languages such as stochastic 7r calculus to describe and simulate the migration of autorective lymphocytes in the target organ will help us better understand the complex dynamics of lymphocyte recruitment during autoimmune inflammation in live animal. Furthermore, our approach may represent an important step toward future predictive studies on lymphocyte behavior in inflamed brain venules. The stochastic calculus may, thus, open new perspectives for the simulation of key phenomena in the pathogenesis of autoimmune diseases, implicating not only better knowledge, but also better future control of the autoimmune attack.
References 1. Bell G . I., Science 200, 618-627, 1978 2. The BioSpi project web site: http://www.wisdom.weizmann.ac.il/-aviv 3. Chamg K., Tees D. F. J. and Hammer D. A., The state diagram f o r cell adhesion under Pow: leukocyte adhesion and rolling, Proc. natl. Acad. Sci. USA 10.1073/pnas200240897, 2000. 4. Chigaev A, Blenc AM, Braaten JV, Kumaraswamy N, Kepley CL, Andrews RP, Oliver JM, Edwards BS, Prossnitz ER, Larson RS, Sklar LA. Real time analysis of the afinity regulation of alpha 4-integrin. The physiologically activated receptor is intermediate in afinity between resting and Mn(2+) or antibody activation. J Biol Chem. 2001 Dec 28;276(52) r48670-8. 5. M. Curti, P. Degano and C. T. Baldari, Casual n-calculus for biochemical modelling. Computational Methods in System Biology, CMSB 2003, Springer. 6. Dembo M., Tomey D. C., Saxaman K. and Hammer D, The reaction-limited kinetics of membrane-to-surface adhesion and detachment. Proc. R. SOC. Lon. B. Vol. 234, pp. 55-83, 1998. 7. Dong C., Cao J., Struble E. J . and Lipowsky H., Mechanics of leukocyte deformation and adhesion to endothelium in shear flow, Annual of biomedical engineering, Vol.
532 27, pp 298-312, 1999. 8. Evans E. and Ritchie K., Biophys. J., Vol. 72 1541-1555, 1997 9. Fritz J., Katopodis A. G., Kolbinger F. and Anselmetti D., Force-mediated kinetics of single P-selectin/ligand complexes by atomic force microscopy, Proc. Natl. Acad. Sci USA, Vol. 95, pp.12283-12288, 1998. 10. Gillespie D. T., Exact stochastic simulation of coupled chemical reactions, Journal of Physical Chemistry, 81(25): 2340 - 2361, 1977. 11. Goebel MU, Mills PJ. Acute psychological stress and exercise and changes in peripheral leukocyte adhesion molecule expression and density. Psychosom Med. 2000 Sep-Oct;62(5):664670. 12. Goldman A. J., Cox R. G . and Brenner H., Slow viscous motion of a sphere parallel to a plane wall: couette flow, Chem. Eng. Sci, 22: 653 - 660,1967. 13. Hammer D. A. and Apte S. M. Biophys. J. 63,35-57,1992. 14. Kuo S. C., Hammer D. A., and Lauffenburger D. A., Biophys. J. 73, 517-531,1996. 15. C. Laudanna, J. Yun Kim, G. Constantin and E. Butcher, rapid leukocyte integrin activation by chemokines, Immunological Reviews, Vol. 186: 37-46,2002 16. Lei X. and Dong C., Cell deformation and adhesion kinetics in leukocyte rolling, BED-Vol. 50, Bioengineering Conference, ASME 1999 (available at http://asme.pinetec.com/biol999/data/pdfs/a0081514.pdf) 17. Milner R., Communicating and Mobile Systems: the A-calculus. Cambridge University Press, 1999 18. N'dri N., Shyy W., Udaykumar and H. S. Tran-Son-tay R., Computational modeling of cell adhesion and movement using continuum-kinetics approach, BED-Vol. 50, Bioengineering Conference, ASME 2001 (available at http://asme.pinetec.com/bio2001/data/pdfs/aOO12976.pdf) 19. Norman KE, Katopodis AG, Thoma G, Kolbinger F, Hicks AE, Cotter MJ, Pockley AG, Hellewell PG. P-selectin glycoprotein ligand-1 supports rolling on E and Pselectin in vivo. Blood. 2000 Nov 15;96(10):3585-3591. 20. Piccio L., Rossi B., Scarpini E., Laudanna C., Giagulli C., Issekutz A. C., Vestweber D., Butcher E. C. and Costantin G., Molecular mechanism involved in lymphocyte recruitment in inflammed brain microvessel: critical roles for P-selectin Glycoprotein Ligand- 1 and Heterotrimeric G , -linked receptors, The Journal of Immunology, 2002 21. Priami, C., Stochastic A-calculus, The Computer Journal, 38, 6,578-589,1995 22. Priami, C., Regev A., Shapiro E. and Silverman W.., Application of a stochastic passing-name calculus to representation and simulation of molecular processes, Information Processing Letters, 80,25 -31,2001 23. Schmidtke D. W. and Diamond S. L., Direct observation of membrane tethers formed during neutrophil attachment to platelets or P-selectin under physiological flow, The Journal of Cell Biology, Vol. 149 Number 3, 2000. 24. Udaykumar H. S.,Kan H. C.,Shyy W. and Tran-Son-Tay R., Multiphase dynamics in arbitrary geometries o n $xed Cartesian grids, J. Comp. Phys., Vol. 137 pp. 366 -
405,1997. 25. Zhu C., Bao G . and Wang N., Cell mechanics: mechanical response, cell adhesion and molecular deformation, Annual Review of Biomedical Engineering 02:189-226.
MODELING CELLULAR PROCESSES WITH VARIATIONAL BAYESIAN COOPERATIVE VECTOR QUANTIZER X. LU1,2,4,M. HAUSKRECHT2 and R.S. DAY3 'Center for Biomedical Informatics, 'Dept of Computer Science, 3Dept of Biostatistics. University of Pittsburgh 4Dept of Biometry and Epidemiology,Medical University of South Carolina email: [email protected]", milos @ cs.pitt.edu, day @upci.pitt.edu
Abstract
Gene expression of a cell is controlled by sophisticated cellular processes. The capability of inferring the states of these cellular processes would provide insight into the mechanism of gene expression control system. In this paper, we propose and investigate the cooperative vector quantizer (CVQ) model for analysis of microarray data. The CVQ model could be capable of decomposing observed microarray data into many different regulatory subprocesses. To make the CVQ analysis tractable we develop and apply variational approximations. Bayesian model selection is employed in the model, so that the optimal number processes is determined purely from observed micro-array data. We test the model and algorithms on two datasets: (1) simulated gene-expression data and (2) real-world yeast cell-cycle microarray data. The results illustrate the ability of the CVQ approach to recover and characterize regulatory gene expression subprocesses, indicating a potential for advanced gene expression data analysis.
1 Introduction Current DNA microarray technology allows scientists to monitor gene expression at genome level. Although microarray data are not direct measurements of activity of cellular processes (or signal transduction pathways), they provide opportunities to infer the states of the cellular processes and study the mechanism of gene expression control at the system level. When a cell is subjected to different conditions, the states of the processes controlling gene expression change accordingly and result in different gene expression patterns. One important task for system biologists is to identify the cellular processes controlling gene expression and infer their states under a specific condition based on observed expression patterns. Different approaches have been applied in order to identify the cellular processes by decomposing (deconvoluting) the observed microarray data into different components. For example, singular value decomposition (SVD) principal component analysis (PCA) ', independent component analysis (ICA) Bayesian decomposition and probabilistic
',
314,
"To whom correspondence should be addressed.
533
534 relation modeling (PRM) ti have been used to decompose observed microarray data into different processes. The problem of identifying hidden regulatory processes in a cell can be formulated as a blind source separation problem, where distinct regulatory processes, which we would like to identify and characterize, are modeled as hidden sourcesb. The task is to identify the source signals purely based on observed data. An additional challenge is that the separation process must be performed fully unsupervised - the number of sources is not known in advance. To facilitate biological interpretation, the originating signals of the processes in a system should be identified uniquely. Some of the aforementioned models, such as SVD and PCA, restrict the components to be orthonormal, thus they are not suitable for blind source separation. Independent component analysis (ICA), independent factor analysis (IFA) and various vector quantization models7~8,g~10 are among the models used for blind source separation. In this work we develop an inference algorithm for one such model - the cooperative vector quantizer (CVQ) model. The main advantage of the CVQ model over other blind source separation models is that it mimics the switching-state nature of the regulatory processes; consequently, the results of the analysis can be easily interpreted by biologists. Fully unsupervised blind source separation requires learning the model structure. In microarray data analysis, one needs to infer the optimal number of latent regulatory processes in the system. The parameters of a latent variable model with a fixed structure (known number of processes) can be learned using maximum likelihood estimation (MLE) techniques, e.g. the expectation maximization (EM) l1 algorithm, as in Segal et al ‘. Unfortunately, the value of likelihood by itself is not suitable for model selection. The main reason is that MLE prefers more complex models and tends to over-fit the training data. That is, more complex models return higher likelihood scores for the training data, but they do not generalize well to future, yet to be seen, data. On the other hand, the methods used in the studies by Alter et al and Liebermei~ter~ simply dictate the number of processes of the model and do not have the flexibility of model selection. Model selection can be addressed effectively within the Bayesian f r a m e w ~ r k l ~ Bayesian ! ~ ~ > ~ selection ~. penalizes models for complexity as well as for poor fit, therefore it implements Occam’s Razor. In this work, we investigate the Bayesian model selection framework in the context of the CVQ model. More specifically, we derive and implement a variational Bayesian approach which can automatically learn both the structure and parameters of the CVQ model, and thus perform full-scale blind source separation. In the following sections, we first present the CVQ model. After that, we discuss the theory of the Bayesian model selection and its approximations. We derive and present a variational Bayesian approximation for learning the CVQ model from data. bWe use “sources” and “processes” interchangeably throughout the rest of paper.
535
Figure 1: A directed acyclic graph (DAG) representation of the cooperative vector quantizer (CVQ) model. The square corresponds to an individual data point which consists of observed variables y and latent variables s. W, 7, T and T are model parameters.
Finally, we test the model and algorithms on (1) a simulated gene expression data (2) yeast cell-cycle microarray dataz0 and discuss the results.
2 The CVQ Model In the CVQ model, the states of the cellular processes are represented as a set of binary variables s = { sk}fz1 referred to as sources, where K is the number of processes in a given model. Each source assumes a value of 0/1, which simulate the “off/on” state of cellular processes. Each microarray experiment is represented as a D-dimensional vector y, where D is the number of genes on a microarray. An observed data point y(n)is produced cooperatively by the sources depending on their states. When a source s k equals 1, it will output a D-dimensional weight wk to y. We can think of the source variable s k as a switch which, when turned on, allows the outflow of weights w k to y. More formally K Y = x S k W k + E k=l
\
k=l
where N ( . l p ,E) denotes a multivariate Gaussian distribution; s k is an index function; W k is the weight output by source s k i E N ( 0 , A )is noise of the system. Parameters (6)of the model are: 7r = { 7 r 1 , 7 r ~ ,. . . ,X K } where 7rk is the probability of s k = 1; a D x K weight matrix W whose column w k corresponds to the weight output for source S k i y = { n , y 2 , . . . ,Y K } whose components are the precision of columns of the weight matrix; the covariance matrix A = T-’I where T is the precision of noise E . The graphic representation of the model is shown in Figure 1. The learning task includes the parameter estimation and model selection based on the Bayesian framework.
-
(5)
536
3 Bayesian Model Selection The main task of model selection in the VBCVQ model is to determine the number of processes (sources) in the model. In the Bayesian model selection framework, we choose the model M i with the highest posterior probability P ( M ; I Y )among a set of models, ( M = { M j } E l ) based , on the observed data. Therefore the selection of the model is dictated by observed data, not arbitrarily by the modeler. According to Bayes’ theorem, the posterior probability of a model equals:
P ( Y J M ~ )=
S, P ( Y l e ,Mi)p(elMi)de
(2)
N
are the observed data; P ( Y I M i ) is the marginal likelihood where Y = or “evidence” for the model; P ( M i )is the prior probability for the model M i . If no prior knowledge is available, we use an uninformative prior P ( M i ) and the model selection is determined by P ( Y \ M ; ) . Variational approximations. The evaluation of equation (2) is often intractable in practice. Various techniques are used to approximate the integration, e.g., Laplace approximation, Bayesian information criteria (BIC) and Markov Chain Monte Car10 (MCMC) simulation 13. Recently, the variational Bayesian approach has been used in various statistical models to approximate the integration in equation (2) The approach takes advantage of the fact that, for a given model Mi, the log marginal likelihood, In P ( Y [ M i ) ,can be bounded from below l 5 , I 2 as: 15316,12110.
where &(.) is an arbitrary distribution, H and B denote sets of hidden variables and parameters of a given model respectively. The inequality is established by Jensen’s inequality. Thus, one can treat the lower bound 3 as the function of the free distribution Q ( H , 0) and maximize 3 with respect to Q ( H , 0). The best result is achieved if Q ( H , B ) equals the posterior joint distribution over hidden variables H and parameters 8. However, the evaluation of the true posterior distribution is intractable in most practical cases. To overcome the difficulty, a variational approximation can be achieved by restricting the maximization Q ( H , 0) to a smaller family of distributions chosen for convenience. A common approach is to use the mean-field approximation, which maximized on the family of models in which hidden variables
537 and parameters are independent. Then the joint distribution can be fully factored: Q(H,0) = Q H ( H ~ ) Q o ( Q j ) . Restricting Q ( H , Q )to this family gives a less tight bound in equation (4),but one can analytically maximize the lower bound of the log marginal likelihood with respect to the factorized family of distributions by an iterative algorithm similar to the EM algorithm12. In the Bayesian framework, the parameters of a given model are treated as random quantities, requiring us to specify prior distributions P(01Mi) for all model parameters. We choose the following conjugate priors to facilitate the estimation of approximate posterior distributions:
nEl
n,'=,
where Beta(.la,,B)is a beta distribution; G(.Ialb ) is a gamma distribution. We use the following set of values of hyper-parameters: a = p = 1, a, = b, = cT = d, = lop3 during training sessions. 4 Variational Bayesian Learning In the variational Bayesian approach, we maximize the lower bound F of the marginal log likelihood In P(YIMi) with respect to a set of parameterized variational distributions Q ( H k ) ,k = 1 , 2 , . . . ,K and Q(Q,),p = 1 , 2 , . . . P , which are approximate posterior distributions of hidden variables and parameters15-12.The process of maximizing the lower bound F and learning parameter is very similar to conventional expectation-maximization (EM) algoritM'. We adopt iterative variational approximation p r i n ~ i p l e ~which ~ > ~maximizes ~, the function F by iterating over two alternating re-estimation steps: 0
Estimation of hidden source distributions Q H ( H ) : Q & ( H ) 0: exP ( l n P ( y , H I @ ) ) ~ , ( e )
Estimation of parameter posteriors
(5)
538
where (.)&(,) denotes the expectation w.r.t. distribution Q ( . ) . Expanding and evaluating the equations (5) and (6), we obtain a set of approximate posterior distributions of the hidden sources H and parameters 8. Thus, the variational Bayesian approach allows us not only to approximate the log marginal likelihood In P ( Y ( M ito)achieve model selection, but also to learn the approximate distributions of the parameters. In the following, we summarize the form of the approximate posterior distributions and rules of updating the parameters of the distributions. Complete derivations can be found in the separate reportI7.
n K
Q ( s )=
Be(sklh);
Q ( r= )
nf=lBeta(rr,I&,Pk);
k=l
Q ( 7 )= G ( T [ & , & ) ;
where B e ( . l X ) is a Bernoulli distribution. One can maximize the lower bound 3 by initializing the parameters of the model with a suitable guess, then iteratively update the parameters for individual approximate distribution using following updating rules until 3converges to a local maximum.
CT = c,
+ -;N2D
539
Figure 2: Left panel: Original source images used to generate data. Middle panel: Observed images resulting from mixture of sources. Right panel: Recovered sources
5 Analysis of Simulated Data We have implemented the variational Bayesian inference algorithm for the CVQ model. To demonstrate the capability of the model to identify the source processes uniquely, we first applied the model to a simulated microarray data. In this experiment, we used 8 hidden sources to simulate cellular processes that control expression of 16 genes. The left panel of Figure 2 depict the components of the model, where genes are represented by pixels of a 4 x 4 image. Each of the 8 sources controls a subset of 16 genes, where the intensity of the pixels reflect the degree of influence by the source. As the figure shows, some genes are controlled by multiple sources. We generated 600 images (experimental data) by setting sources to be “odoff’stochastically, summing the weight output by sources and adding random noise into the images. The middle panel of Figure 2 illustrates some of the data images generated during the process. We run our program to test its ability of automatically recovering the number of sources and their patterns. The right panel of Figure 2 shows the result of an experiment where the algorithm is initialized with 16 hidden sources. The program correctly identified all 8 sources that were used to generate the data and eliminated the rest 8 unnecessary sources. The experiment demonstrates an excellent performance of the variational Bayesian approach on blind source separation for simulated gene expression data. Figure 2 also shows an interesting characteristic of our Bayesian CVQ model its ability to eliminate unnecessary sources automatically, thus, achieving the effect of model selection. Such an ability is due to the introduction of hierarchical parameters y (see Section 2) into the model. The approach is referred to as automatic relevance determination (ARD). It has been used in a number of Bayesian linear latent variable models to determine model dimension automatically. When variational Bayesian ICA model with mixture of Gaussian sources was first tested to perform a similar image separation tasklg,lO,recovery of source images 16,18i10.
540
Figure 3: Source processes recovered from the training data containing a background signal and both positive and negative weight sources. The first image captures the background signal. Black pixels capture negative weights.
from the mixed image data was hindered by contamination with negative “ghost” images. In order to prevent “ghost” images, special constraints on distributions were incorporated into the ICA model. Specifically, the use of rectified Gaussian distributions priors lo restricted both the source and weight matrix to the positive domain. In contrast, the CVQ model performs blind source separation without special constraints. Adopting Bernoulli distributions for sources in the CVQ model naturally constrains the sources to the non-negative domain, preventing “ghost” images. No constraint on the weight matrix appears necessary. This flexibility allows the capture of genuine negative influences of sources on the observed data, which is a highly desirable characteristic for detecting the repressive effects of signal transduction components on gene expression. To test the model’s ability to capture repressive effects, we generated 600 training data with 8 sources similar to those described earlier with one exception: weight outputs for two sources are negative on some of the pixels. We randomly initialized parameters for hidden sources, and then ran the algorithm to recover the sources. Once again our variational Bayesian algorithm was able to identify correctly not only the number of underlying regulatory signals but also their weight matrices, including their repressive (negative) components. Figure 3 shows the sources and weights recovered by the algorithm for the simulated data. Black pixels correspond to negative weights.
6 Application in Microarray Data Analysis In this section, we present the result of applying the CVQ data analysis to the yeast cell cycle data by Spellman et a120. These cell cycle data has been widely used to test The data set contains a collection different algorithms, including SVD and ICA of the whole yeast genome expression measurements (77 samples) across the yeast cell cycle. During the cell cycle, the states of the cellular processes that controls ‘i3.
541
progression of cell cycle switch “ordoff” periodically. Thus, these data are suitable to test the ability of the CVQ model to capture such periodical behavior of cellular processes. We have extracted expression patterns of 697 genes that are documented to be cell-cycle dependent 2o and used the CVQ to model the data. Original data is in the form of log ratio of fluorescence of labeled sample cDNA and control cDNA. Before fitting the model, the log ratio of the data was transformed to positive values by subtract the minimumratio of each gene. In order to determine the optimal model that fit the data well, we tested CVQ models setting the initial number of sources to values ranging from 8 to 30. We ran each model 30 times. Figure 4 shows the results of experiments. We can see that the lower bound F for log marginal likelihood reaches a plateau between the models with 12 to 20 sources. Inspecting the recovered models, we found that most of these models have 12 working sources; excess sources were eliminated by the ARD phenomenon. Note that models initialized with more than 20 sources are penalized by the Bayesian approach in that the 3 values begin to drop. Thus, the variational Bayesian approach consistently returned models with 12 sources as the most suitable model for the observed data. In comparison to the models studied by Alter et all and Liebermeister where the number of processes was determined by the number of samples, our approach determines the number of processes based on the sound statistical foundation of the Bayesian framework. In addition, the larger number of processes in their model significantly increases the number of parameters to estimate - about 50,000 more parameters would be needed to carry out a similar experiment. It is well known that the models with a large number of parameters are prone to over-fitting the training data, especially with a training set of a small size like the one used in our experiment. The full Bayesian treatment of the CVQ model implicitly penalizes models with too many parameters, thus making it less likely to over-fit the data.
’,
We have studied the recovered CVQ model to see if it can capture the periodic behaviors of the processes. The middle and right panel of the Figure 4 show one of the recovered models with the highest 3. The middle panel shows the state of 12 hidden sources across the experiment conditions, in this case, a times series of gene expression observations. One can clearly see the cyclic “ordoff” pattern of the sources which are far from being random. This is not surprising and encouraging, as we are modeling expression control processes of cell cycle related genes. For each of the cell cycle time points, we can see sources cooperatively contributing to observations. Thus, the CVQ model provides another approach to decomposing the overall observation at genome level into different processes, which may reflect the state of different cellular signal transduction components. A more detailed biological analysis of the results is being carried out and will be reported separately.
542
Figure 4: Left panel: Mean and standard deviation of 3of models initialized with different number of sources. Middle panel:Top: States of hidden sources (rows) of each time series observations (columns). Black blocks indicate the source is “on” and white blocks indicate the source is “off”.Eonom: Corresponding cell cycle phase for each observation. Right panel: The weights associated with sources (columns).
7
Discussion
One important aspect of systems biology is to understand how information is organized inside the cell. For example, an interesting question is: what is the minimum number of central signal transduction components needed to coordinate the variety of cellular signals and cellular function. A cell is constantly bombarded by extracellular signals; many of these signals are eventually propagated to the nucleus in order to regulate gene expression. It would be surprisingly inefficient for nature to endow every receptor at the plasma membrane with a unique pathway to pass its signal from plasma membrane to the promoter of a gene. Rather more plausible is a minimum set of partially shared signal transduction components that play central role in coordinating signals from extracellular environment and disseminating the signals to the transcription factor level. These components work as encoders that compress a large amount of information from extracellular and intracellular environments to minimum length, then pass the information to gene expression regulating components such as transcription factors or repressors. To model these signal transduction components, model selection becomes a key issue, which has not been well addressed previously. Bayesian model selection respects Occam’s razor, to minimize a fitted model’s complexity, potentially increase the interpretability of the data in terms of information organization and flow inside living cells. These characteristics put the model a step ahead of some commonly used models for modeling cellular processes controlling gene expression. Like most other models used to decompose observed microarray data into components, the CVQ model is a linear model. In microarray data analysis, measurements are usually transformed by the logarithm, so that cooperative effects that combine multiplicatively at the raw data level can be handled as additive. This simplifies
543
model-fitting but may be too restrictive. To capture nonlinear relationships in the log space, the CVQ model could naturally be extended to mixtures of CVQ models. This extension will be studied in the future. Another possible improvement of the model includes more sophisticated approximation methods, such as Minka’s expectation propagation method 21, to obtain a better approximation of the log marginal likelihood, and thus, better model selection and optimization.
Acknowledgments The authors would like to extend special thanks to Dr. Zoubin Ghahramani for constructive initial inputs and discussion. We thank Drs. Gregory Cooper, Chengxiang Zhai, Rong Jin, Vanathi Gopalakrishnan, Matthew J. Bed and two anonymous reviewers for insightful discussions and comments. Xinghua Lu would like to thank for the support from the National Library of Medicine (training grant 3T15 LM0705915S1) and the MUSC COBRE for Cardiovascular Disease.
Reference 1. Alter, 0, Brown, P. 0. and Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of Ameerica, 97:lOlOl-10106,2000. 2. Raychaudhuri, S., Stuart, J. M. and Altman, R. B.. Principal components analysis to summarize microarray experiments: application to sporulation time series. In Proceeding of Pacifc Symposium on Biocomputing, pages 45546,2000. 3. Liebermeister,W. Linear modes of gene expression determined by independent component analysis. Bioinformatics, 18:51-60,2002. 4. Martoglio, A, Miskin, J. W., Smith, S. K. and MacKay, D. J. C.. A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics, 18 no. 12:1617-1624,2002. 5. Moloshok, T. D., Klevecz, R. R., Grant, J. D., Manion, F. J., Speier W. F., and Ochs, M. F.. Application of Bayesian decomposition for analysing microarray data. Bioinfomatics, 18(4):566-575.2002. 6. Segal, E, Battle, A and Koller, D. Decomposing gene expression into cellular processes. In Proceedings of Pacific Symposium on Biocomputing, volume 8, pages 89-100,2003. 7. Attias, H.. Independent Factor Analysis. Neural Computation, 11(4):803-851, 1999. 8. Hinton, G. E. and Zemel, R. S.. Autoencoders, minimum description length, and helmholtz free energy. In Advances in Neural Information Processing Systems 6 . Morgan Kaufman, 1994. 9. Ghahramani, Z. Factorial learning and EM algorithm. In Advances in Neural Znformation Processing Systems 7. Morgan Kanfmann Publishers, 1995. 10. Miskin, J and MacKay, D. Ensemble learning for blind source separation. In S. Roberts and R. Everson, editors, Independent Component Analysis: Principles and Practice, pages 209-233. Cambridge Unviersity Press, 2001.
544 11. Dempster, A.P., Laird, N.M. and Rubin, D.B.. Maximum likelihood estimation from incomplete data via EM algorithm (with discussion). Journal of Royal Statistics Society, B 3911 - 38, 1977. 12. Ghahramani, Z and Beal, M. J.. Propagation algorithms for variational bayesian learning. In Advances in Neural Information Processing Systems 12, pages 507-513. MIT Press, 2000. 13. Kass, R and Raftery, A, E.. Bayes Factors. Technical Report Technical Report No 254, Dept. of Statistics and Techical Report No 571, Dept. of Statistics, Univ. of Washington and Carnegie Mellon Univ., 1994. 14. MacKay, D. Probable networds and plausible predictions - a review of practical Baysian methods for supervised nerual networkds. Network: Computation in Neural Systems, 6(3):469-505, 1995. 15. Attias, H. Inferring parameters and structure of latent variable models by variational Bayes. In Proceedings of the Uncertainty in A1 Conference, pages 21-30, 1999. 16. Bishop,C. M.. Variational principal components. In Proceedings of Ninth International Conference on Artijicial Neural Networks, volume 1, pages 509-514. ICANN, 1999. 17. Lu, X., Hauskrecht, M. and R. S. Day, R. S.. Variational Bayesian learning of cooperative vector quantizer model - theory. Technical Report No: CBMI-02-181, The Center for Biomedical Informatics, University of Pittsburg, 2002. 18. Ghahramani, Z. and Beal, M. J.. Variational inference for Bayesian mixtures of factor analysers. In Advances in Neural ZnformationProcessing Systems 12, Cambridge, MA, 2000. MIT Press. 19. Lawrence, N. D. and Bishop, C. M.. Variational Bayesian independent component analysis. Technical report, Computer Laboratory, University of Cambridge, 2OOO. 20. Spellman, P. T., Sherlock, G, Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. 0. Botstein, D and Futcher, B.. Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12):3273-3297, 1998. 21. Minka, M. P.. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 2001.
SYMBOLIC INFERENCE OF XENOBIOTIC METABOLISM D.C. MCSHAN, M. UPDADHAYAYA and I. SHAH School of Medicine University of Colorado 4200 East 9th Avenue, B-119 Denver, CO 80262 {daniel.mcshan,minesh.upadhyaya,imran.shah}@uchsc.edu Abstract We present a new symbolic computational approach to elucidate the biochemical networks of living systems de novo and we apply it to an important biomedical problem: xenobiotic metabolism. A crucial issue in analyzing and modeling a living organism is understanding its biochemical network beyond what is already known. Our objective is to use the available metabolic information in a representational framework that enables the inference of novel biochemical knowledge and whose results can be validated experimentally. We describe a symbolic computational approach consisting of two parts. First, biotransformation rules are inferred from the molecular graphs of compounds in enzyme-catalyzed reactions. Second, these rules are recursively applied to different compounds to generate novel metabolic networks, containing new biotransformations and new metabolites. Using data for 456 generic reactions and 825 generic compounds from KEGG we were able to extract 110 biotransformation rules, which generalize a subset of known biocatalytic functions. We tested our approach by applying these rules to ethanol, a common substance of abuse and to furfuryl alcohol, a xenobiotic organic solvent, which is absent in metabolic databases. In both cases our predictions on the fate of ethanol and furfuryl alcohol are consistent with the literature on the metabolism of these compounds.
Introduction
The objective of this work is to develop a predictive strategy for elucidating metabolism. We mold available metabolic information in an expressive symbolic representation and employ a novel inference framework to explore unchartered pathways. We hypothesize that biochemical rules can be inferred from the databases of endogenous metabolism and that we can use these rules to predict the metabolism of unknown xenobiotics through detoxification pathways. In particular, we focus on xenobiotic pathways in mammalian systems. What is the importance of discovering new pathways? Our knowledge of metabolism is essentially incomplete and it can be argued that cataloging
545
546
all possible mammalian xenobiotic pathways is infeasible. With the availability of the complete genomic blueprint for living systems and a large set of known biotransformations, it is becoming possible to theoretically elucidate metabolism. This includes the analysis of endogenous as well as xenobiotic pathways. Drugs, substances of abuse and environmental pollutants are examples of compounds that may not occur naturally in a living system. Since these compounds and/or their metabolic by-products can be potentially toxic, investigating xenobiotic metabolism is important for human health and the environment. Pathway inference is a computationally challenging problem even with the availability of the genomic blueprint for a living system and the functional annotations of its putative genes. Since the availability of the first microbial genome, Haemophilus influenza; a number of metabolic reconstruction tools have been developed. These include PathoLogicy and PathFinde?. These methods focused on matching putatively identified enzymes with known, or “reference”, pathways. Although reconstruction is an important starting point for metabolic processes it does not enable the discovery of new pathways. To overcome some of these issues we have recently developed a new pathway inference system to search for novel metabolic routes called PathMine?. PathMiner uses known biotransformations to synthesize new pathways and employs heuristics t o contain the combinatorial complexity of the search. This paper delves into a deeper biological problem: de novo pathway inference and its practical application to a biomedical problem: xenobiotic metabolism. The metabolic potential of a living system depends on biocatalysis. However, understanding the mechanisms of enzymatic catalysis is an extremely difficult problem, and knowledge in this area is limited to a handful of wellstudied examples. Generally, biochemists can abstract empirical “rules” for the biotransformation of metabolites by enzymes. For instance, consider the broad range of substrates for Saccharomyces cerevisae (yeast) alcohol dehydrogenase (YADH), which reduces acetaldehyde and a variety of other aldehyde&, and oxidizes ethanol, and other acyclic primary alcohols. Yet an alcohol dehydrogenase from Themnoanaerobium brokii (TADH) catalyzes the stereospecific reduction of ketones and the oxidation of secondary alcohols. The functions of YADH and TADH share common attributes and have some unique differences: they are both alcohol dehydrogenases but their specificities for the alcohols are different. The functions of these enzymes can be expressed in terms of the functional groups modified (alcohol to aldehyde or ketone), and the backbone structure of the molecule (primary or secondary alcohol). This is essentially a symbolic description of biocatalysis and we believe that it can be applied to complete metabolic systems.
547
Methods Our strategy for elucidating de nouo xenobiotic metabolism consists of two main steps. First, we use biotransformation data t o derive symbolic chemical substructural rules that generalize the action of enzymes on specific compounds. Second, we apply these rules iteratively to a compound t o generate a plausible metabolic system. We describe these steps in the following sections but first we discuss our metabolic representation.
Representing biotransfomnations and rules Our abstraction of metabolic concepts is based on work by Karp? in terms of the high-level concepts including pathways, enzyme-catalyzed reactions and transformations. At the level of biotransformations we are motivated by Kazic? in that we focus on the specific chemical substructural details of metabolites that are modified through biocatalysis. In our system, compounds are represented as X. Compounds in our abstraction have a chemical structure which is represented as a molecular graph, r, in which nodes are atoms and edges are bonds. In the context of a biotransformation the pattern of substructural changes from the input compound to the output compound is represented as a rule, U. A rule captures the concept of functional group changes that occur in a biotransformation. Rules are implicitly unidirectional so reversible transformations are represented as two separate rules. The two molecular graphs of a rule are indicated by the input graph, A-, and the output graph, A+. For instance, the rule for the conversion of a primary alcohol to an aldehyde is shown in Figure 1. In this case A- is an alcohol moiety, which is converted to A+, an aldehyde moiety.
xs P r i m a r yAlcohol
A-
A+
XP Aldehyde
Figure 1: Alcohol dehydrogenase (EC 1.1.99.9) Transformation from abstract PrimaryAlcohol to abstract Aldehyde showing the computedA- and A+ moieties. The A-moeity is the subgraph that is in Xsbut not in Xp. The A+moeity is the subgraph that is in X,but not in X , .
In the present work we focus on changes at the level of functional groups between pairs of compounds. We represent the conversion of one input com-
548 pound to one output compound as a transformation. This simplifies our representation of reactions in terms of the main metabolites. In this work we obtain this data from the KEGG distribution, but we are also exploring automated methods for identifying the main metabolites in a reaction.
Extracting transformation rules from reaction data One strategy for identifying rules is to curate them manually, however, our goal is to use the available metabolic dat&? to derive biotransformation rules automatically. This is a difficult problem in general as the information about reactive moieties is not explicitly available. In this paper we have used a simple strategy for extracting rules automatically from “general” reactions. In KEGG, for instance, general reactions are defined when the input and the output compounds are both Markush structures. We find 741 general reactions in KEGG, which constitute 20% of the reactions annotated as being human-specific. For example a gene that is extremely important in xenobiotic metabolism and encodes cytochrome P-450 enzyme, CYP2D6, is implicated in the disposition of over thirty toxins. In KEGG, the P-450 enzyme (EC 1.14.14.1) is associated with only four reactions as shown in Figure 2. There are two specific reactions involved in endogenous functions associated with tryptophan metabolism and gamma-hexachlorocyclohexane degradation. The other two operate on general compounds denoted by their Markush structures (these are abstract structures containing a wildcard “R” group and specific functional groups). We convert these general reactions automatically to rules as described above. This is done by replacing the wildcard of the substrate with “C” and storing it as the Asubgraph in the resulting rule; similarly, the “R” in the product graph is replaced and the resulting graph is stored as A+. To our knowledge no one has taken advantage of this annotation before in metabolic pathway inference. In this work we focus on the rules important in xenobiotic metabolism in mammalian systems, including oxidation, reduction, hydrolysis and conjugation to mention a few. There are generally two phases in xenobiotic metabolism. In phase 1 the compounds are ’functionalized’,which means that a reactive functional group is exposed. Detoxification occurs in phase 2 by further action on the functional groups, which is the form in which the compound is excreted. For instance, the first phase activates a molecular oxygen in the input compound, and the second phase conjugates it. Glucuronidation is the most common conjugate and can be attached to any labile oxygen. In the case of alcohol metabolism, both the alcohol and the acid can usually be conjugated.
549
Melatonin u 6 - Hydroxymelatonin FattyAcid u alphaHydroxyFattyAcid Alkane u R O H Parathion u Paraoxon Figure 2: CYP2D6 (EC 1.14.14.1) reactions in KEGG. Compounds are either Abstract(contain one or more Markush “R” groups) or Normal (have unique structure).
Biotransformation rule application Our rule application algorithm is illustrated in Algorithm 1. A rule is applied to a substrate X , by searching the graph of X,, I?, for the subgraph A-. If the subgraph A- is found, it is replaced by the A+ graph to yield the product graph, rP. This is summarized as follows:
r, - A- + A+ + rp. This is graphically illustrated in Figure 3.
XS
A-
A+
XP
Ethanol
Figure 3: Application of alcohol dehydrogenase rule to ethanol
The product of applying a rule to a compound can be a completely novel compound or a known compound. We use subgraph isomorphism to search the product molecular graph against the database of known compounds. If the compound is not found, a novel compound X i is created and given a unique identifier (Nxxxxxx in which x is a digit from 0-9). The corpus of all rules is designated We have a top-level function metabolize(X,U,n)which takes a compound X and systematically applies each rule in the rule-base through n iterations.
u.
550
input
: X,,
compound to metabolize
U , list of rules n, iterations output : Graphical visualization Products Products t 4
I?,
t molecular-graph(X,)
for ( A - , A + ) t U do
rpt g r a p h - r e p l a c e (rs,A-,A+)
I
if I?, then X, t find-compound-by-graph(r,) if X, = 4 then X, t rnake-novel-compound(r,)
pushnew (X,,Products)
if n > I then for X in Products do
1
L append (met abo 1 ize (X,U,n - l),Products)
Algorithm 1: metabolize(X,U,n). Algorithm to create a network of pathways length n from input compound X , by applying rules U . Initially the list of Products is set to null. The molecular graph, Fa, of the input compound is obtained from the KEGG m o l file representation. For every rule in the rulebase U , we obtain the A- and A+ subgraphs. The product graph, r, is obtained by performing a graphical search/replace on the input graph, r,. If r, is non, i.e., a match was found and applied, then the product graph r, is searched against the database of known compounds and the database of novel compounds to see if an isomorphic graph exists. If the graph matches an existing compound, then X , is returned. If there is no identified compound with the g a p h , then a novel compound, X, is generated and given a unique identifier (the Nxxxxx symbols in the diagrams). In either case, the product, X, is pushed onto the Products list for this metabolite X,. This process can occur iteratively for every product, X in the Produds list. The metabolize function is simply called again with the recursion level reduced. The results are appended to the Products list.
Implementation The system is implemented in Allegro Common Lisp. The metabolic databases are read in and parsed into CLOS structures. For visualization, the transformations are exported to the AT&T graphviz program neato which does a simple force-based layout of the metabolic graph. This network is read back in and presented with the nodes replaced by compound structures using our internal visualization system. The novel compounds that are produced by the appli-
551 U.
I
Reactant
I
Product
I
B.C.
I
Enzyme
I
Table 1: Simplest 10 of 110 rules inferred from KEGG generic reactions
cation of the rules are simply graphs. In order to visualize the compounds, we require 2D coordinates. To achieve this, we export the graph as a mol file with the 2D coordinates as zeroes and then layout the mol file using the JChem molconvert package. The mol files are read back in and stored with the compounds as they are created.
Results and Discussion We used a recent version of the KEGG database which had 10,635 compounds, out of which 825 are generic. Of the 5,428 reactions in the KEGG database, 741 operate on the generic compounds. From this data, we infer 110 biotransformation rules, and the 10 simplest ones are summarized in Table 1. These rules correspond to enzymes which have flexibility in the substrates they can transform. Using our symbolic computational approach described in the previous sections we elucidate the de now metabolism of two compounds. First, we consider ethanol, which is a common substance of abuse and for which we have some data of human metabolism. Second, we demonstrate the fate of furfuryl alcohol, which is is an industrial organic solvent used as a paint thinner and is absent in our database. Experimental evidence suggests that prolonged exposure to furfuryl alcohol may have significant toxicological effects. We first apply the rules to the compound ethanol which is in the database. The graph is shown in Figure 4. Next, we apply the rules to a new compound, furfuryl alcohol, which is not in the database. The result is shown in Figure 5. That some of the
552
w.75
Figure 4: The de novo prediction of ethanol metabolism. Ethanol is in the center of the figure. The highlighted transformations are the activation of the alcohol to an aldehyde by alcohol dehydrogenase (EC 1.1.99.20), then to an acid by aldehyde oxidase (EC 1.2.3.1), respectively. Not shown, but in the next iteration is the O-glycosylation of the aldehyde by beta-Glucuronidase (EC 3.2.1.31).
nodes in our ethanol metabolism graph match to known compounds in the database is encouraging. Additionally, we were able to identify the pathway, alcohol + aldehyde + acid + conjugation, which recapitulates the standard ethanol detoxification pathway. We are also able to predict metabolites for a compound previously unknown to the system. The furfuryl alcohol metabolic predictions are consistent with literature. Martin, et. al., report that furfuryl alcohol can be O-glycosylated by beta-Glucuronidasdo as we predict (shown as compound NO0482 in Figure 5). Additionally, the acid of furfurol, 2-furoate, is actually in the KEGG database and is identified as such by the algorithm. Nomeir, et. al., report that the initial step in furfuryl alcohol metabolism in rat is the oxidation to furoic acid, which is excreted unchanged and decarboxy-
553
lated, or conjugated with glycine or condensed with acetic acia'. In this case, the limitations in our system t o predict the condensation with acetic acid, for instance, lie in the breadth of the rules, not in the fundamental methodology. By extending our method for inferring new rules based on known biochemistry we can overcome this limitation.
-
Figure 5: The de novo prediction of furfuryl metabolism. Furfurol is in the center of the figure. The highlighted transformation between compound furfurol and compound NO0482 (up and to the left) is an 0-glycosylation by beta-Glucuronidase (EC 3.2.1.31). The highlighted transformations below furfurol are the activation of the alcohol to an aldehyde (N00479, furfural) by alcohol dehydrogenase (EC 1.1.99.20),then t o an acid by aldehyde oxidase (EC 1.2.3.1), respectively. The acid is identified by the algorithm as being in the KEGG database (by graph similarity) as 2-Furoate (C01546). In the next iteration, not shown, the acid is finally 0-glycosylated by beta-Glucuronidase (EC 3.2.1.31).
Most of the complex products of furfuryl alcohol are simply consecutive glucurodinations by the rule:
Alcohol
+
B -D
- Glucuronide
Due to the lack of specificity of this rule t o primary alcohols, glucuronidation is applied to the hydoxyl groups on the p- D-Glucuronide. While this might be
554
biologically valid, in reality, glucuronidation renders a compound water soluble after which it is eliminated by excretion. This limitation is beyond the scope of the current work but can be addressed in the future by considering the physical properties of compounds, like water-solubility. That a biotransformation rule can be applied does not imply that it is biochemically valid. For instance, consider the biotransformation rules that apply t o a hydroxyl functional group. Compounds containing this functional group include primary alcohols, secondary alcohols, and also carboxylic acids. Enzymes that act on alcohols may not act on carboxylic acids and vice-versa. To capture the substrate specificity of enzymes we are working on a more sophisticated representation of rules that can improve their biological validity. Though this is a limitation of our present algorithm, our predictions are still useful for elucidating potential xenobiotic metabolism, which can be tested experimentally. It is important to contrast our approach t o other rule-based approache2T6 in pathway prediction. One of the main advantages of our strategy is automated biotransformation rule extraction from available resources of metabolic data. As opposed t o the manual curation-based efforts, our approach will scale gracefully with increasing data for two important reasons. First, our algorithm for rule extraction can be extended t o utilize most of the available enzyme-catalyzed reaction data beyond the generic reactions in KEGG. Second, we can control the combinatorial explosion of plausible biotransformations by extending our existing algorithm on pathway search?. Another advantage of our approach is that we can relate our biotransformation predictions to the organism-specific enzymes and genes, which is crucial for in vivo or in vitro experimental validation.
Conclusion We have developed a symbolic inference approach and demonstrated the de novo elucidation of metabolism. This was accomplished by representing biocatalysis, which is the basis of metabolism, in terms of expressive symbolic biotransformation rules. These biotransformation rules generalize the biocatalytic functions of enzymes and enable the discovery of new metabolic potential in living systems. We developed an algorithm to extract these rules from known enzyme-catalyzed reactions and to apply these rules to elucidate the metabolism of new compounds. We successfully tested this concept to predict the xenobiotic metabolism of ethanol and furfuryl alcohol. The results are encouraging because furfuryl alcohol is absent in our database and yet we can correctly identify its products through 0-glycosidation and oxidation to
555
furoic acid in agreement with the literature. These results are also biologically interesting because they support the notion that xenobiotic metabolism is a manifestation of endogenous biocatalytic abilities in an organism. Though there are a some limitations in our approach the method is quite general and scalable for investigating the metabolic network of any living system. This work supports the relevance of symbolic approaches in discovering the biochemical capabilities of living systems. Our results on xenobiotic metabolism offer a prelude to the potential discoveries that can be made in combination with high-throughput or traditional experimental strategies. Acknowledgments The authors acknowledge Weiming Zhang for the visualization software. This work is sponsored by the National Science Foundation (BES-9911447), the Department of Energy (DE-FG03-01ER6311l/M003), and the Office of Naval Research (N00014-00-1-0749). References
1. Applications of Biochemical Systems in Organic Chemistry. Wiley, New York, N.Y., 1976. 2. R.D. Fleischmann et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269:469-512, 1995. 3. T. Gaasterland and E.E. Selkov. Automatic Reconstruction of Metabolic Networks Using Incomplete Information. ISMB, 3:127-135, 1995. 4. T Gaasterland and CW Sensen. MAGPIE: automated genome interpretation. Dends Genet, 12(2):76-78, 1996. 5. A. Goesmann, M. Haubrock, F. Meyer, J. Kalinowski, and R. Giegerich. PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics, 18(1):124-9, 2002. 11836220. 6. B.K. Hou, L.P. Wackett, and L.B. Ellis. Microbial pathway prediction: a functional group approach. J Chem Inf Comput Sci, 43(3):1051-7,2003. 7. P. Karp and M. Riley. Representations of metabolic knowledge: Pathways. In R. Altman, D. Brutlag, P. Karp, R. Lathrop, and D. Searls, editors, Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 1994. 8. P.D. Karp, M. Krummenacker, S.M. Paley, and J. Wagg. Integrated pathway/genome databases and their role in drug discovery. Trends in Biotechnology, 17(7):275-281, 1999.
556 9. T Kazic. Reasoning about biochemical compounds and processes. pages 35-49. World Scientific, Singapore, 1992. 10. B.D. Martin, E.R. Welsh, J.C. Mastrangelo, and R. Aggarwal. General 0-glycosylation of 2-furfuryl alcohol using beta-glucuronidase. Biotechno1 Bioeng, 80(2):222-7, 2002. 11. A.A. Nomeir, D.M. Silveira, M.F. McComish, and M. Chadwick. Comparative metabolism and disposition of furfural and furfuryl alcohol in rats. Drug Metab Dispos, 20(2):198-204, 1992.
FINDING OPTIMAL MODELS FOR SMALL GENE NETWORKS S. OTT, S. IMOTO, S . MIYANO Human Genome Center, Institute of Medical Science, The University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan { ott,imoto,miyano} 0ims.u-tokyo. a c . ~ Finding gene networks from microarray data has been one focus of research in recent years. Given search spaces of super-exponential size, researchers have been applying heuristic approaches like greedy algorithms or simulated annealing to infer such networks. However, the accuracy of heuristics is uncertain, which in combination with the high measurement noise of microarrays - makes it very difficult to draw conclusions from networks estimated by heuristics. We present a method that finds optimal Bayesian networks of considerable size and show first results of the application to yeast data. Having removed the uncertainty due to the heuristic methods, it becomes possible to evaluate the power of different statistical models to find biologically accurate networks.
1
Introduction
Inference of gene networks from gene expression measurements is a major challenge in Systems Biology. If gene networks can be infered correctly, it can lead to a better understanding of cellular processes, and, therefore, have applications to drug discovery, disease studies, and other areas. Bayesian networks are a widely used approach to model gene n e t ~ o r k s In ~ Bayesian ~ ~ ~ networks, ~ ~ ~ the ~ ~behaviour ~ ~ ~ of~ the ~ gene ~ ~ ~ ~ network is modeled as a joint probability distribution for all genes. This allows a very general modeling of gene interactions. The joint probability distribution can be decomposed as a product of conditional probabilities P(X, 1x1,.. . , X,), representing the regulation of a gene g by some genes 91,. . . , g,. This decomposition can be represented as a directed acyclic graph. The Bayesian network model has been shown to allow finding biologically plausible gene networks4,'. However, the difficulty of learning Bayesian networks lies in its large search space. The search space for a gene network of n genes is the space of directed acyclic graphs with n vertices. A recursive formula as well as an asymptotic expression for the number of directed acyclic graphs with n vertices (G) was derived by Robinson15. We state the asymptotic expression here:
557
558
For example, there are roughly 2.34. possible networks with 20 genes, and possible solutions for a gene network with 30 genes. Even for about 2.71. a gene network of 9 genes (search space size roughly 1.21 . 1015), a brute force approach would take years of computation time even on a supercomputer. Moreover, it is known that the problem of finding an optimal network is NPhard', even for the discrete scores BDe2i3 and MDL3. Therefore, researchers have so far used heuristic approaches like simulated annealing' or greedy algorithmsg to estimate Bayesian networksls. However, since the accuracy of heuristics is uncertain, it is difficult t o base conclusions on heuristically estimated networks. In order to overcome this problem, we have analysed the structure of the super-exponential search space and developed an algorithm that finds the optimal solution within the super-exponential search space in exponential time. This approach is feasible for gene networks of 20 or more genes, depending on the concrete probability distribution used. Furthermore, adding biologically justified assumptions, the optimal network can be infered for gene networks of up to 40 genes. Overcoming the uncertainties of heuristics opens up the possibility to compare statistical models with respect to their power t o infer biologically accurate gene networks. Also, this method is a valuable tool for refining gene networks of known functional groups of genes. We present the method in Section 2. In Section 3, we present results of an application of this method, which show that it can estimate gene networks biologically accurate.
TheMethod
2
2.1
Preliminaries
Throughout this section, we assume we are given a set of genes G and a network score function as used by several g r o ~ p s ~ ,i.e. ~'a ~ ~function , s : G x 2G + R that assigns a score t o a gene g E G and a set of parent genes A C G. Given a network N , the score of N is defined as score(N) =def s(g, P N ( g ) )where , P N ( g )denotes the set of g's parents in N .
xsEG
Examples: 1. BDe ~ c o r e ~ > ~ The score is proportional t o the posterior probability of the network, given the data. When the BDe score is used, the microarray data needs to be discretized.
559
2. MDL score3 The MDL score makes use of the minimal description length principle and also uses discretized data. 3. BNRC score' The BNRC score uses nonparametric regression to capture nonlinear gene interactions. Since the data does not need to be discretized, no information is lost. The task of infering a network is t o find a set of parent genes for each gene, such that the resulting network is acyclic and the score of the network is minimal. We introduce some notations needed to describe the algorithm.
Definition 1: F We define F : G x 2G -+
A
R as F ( g , A ) =def minBcA - s(g, B ) for all g E G and
C G.
0
The meaning of F ( g , A ) is, by the definition, the optimal choice of parents for gene g, when parents have to be selected from the subset A . For every acyclic graph, there is an ordering of the vertices, such that all edges are oriented in the direction of the ordering. Conversely, when given a fixed order of G, we can think of the set of all graphs that comply with the given order, as we do in the next definition. An ordering of a set A C G can be described as a permutation n : (1,.. ., IAI} A . Let us use IIA t o denote the set of all permutations of A. --f
Definition 2: ?r-linearity Let A C G and T E IIA. Let N A x A be a network. We say N is T-linear 0 iff for all (9,h) E N n-'(g) < ?r-'(h) holds. Now we use the above definitions and define function Q A , which will allow us to compute the score of the best n-linear network for a given T , as we show below.
Definition 3: Q A Let A C: G. We define Q A : IIA -+R as
560
If we can compute the best .ir-linear network for a given permutation 7r using functions F and Q , then what we need t o do in order to find the optimal network is t o find the optimal permutation .ir, which yields the global minimum. Formally, we define function M for this step.
Definition 4: M We define M : 2G -+
UAcGHA as (3)
for all A 2.2
C G.
The Algorithm
Using above notations, the algorithm can be defined as follows. Step 1: Step 2: Step 3: Step 4: Step 4a: Step 4b: Step 5:
Compute F ( g , 0) = s(g, 0) for all g E G. For all A C G , A # 0 and all g E G, compute F ( g ,A ) as min{s(g, A ) ,m i n a F ~ (~g , A - { a } ) } . Set M ( 0 ) = 0. For all A E G, A # 8, do the following two steps: Compute g* = argmin,gA(F(g, A - (9)) + Q A - { g } ( M ( A- (9)))). For all 1 5 i < IAl, set M ( A ) ( i )= M ( A - {g*})(i), and M(A)(IAl) = 9*. return Q G ( M ( G ) ) .
In the recursive formulas given in Step 2 and in Step 4, we want to compute the function F resp. M for a subset A G of cardinality m = / A ( ,and need function values of function F resp. M for subsets of cardinality m - 1. Therefore, we can apply dynamic programming in Step 2 as well as in Step 4 to compute functions F resp. M for subsets A of increasing cardinality. In the recursive formula in Step 4, first the last element g* of the permutation M ( A ) is computed in Step 4a, and then M ( A ) is set in Step 4b. 2.3
Correctness and Tame Complexity
First, we prove the correctness of the algorithm. The correctness of the recursive formula in Step 2 of the algorithm follows directly from the definition of F . Therefore, after execution of Step 1 and Step 2, the values of function
561
F for all genes g and all subsets A C G are stored in the memory. Before proceeding t o Step 3 and Step 4, we state a lemma on the meaning of function QA. Lemma 1
Let A C G and T E IIA.Let N * C A x A be a rr-linear network with minimal score. Then, QA(7r) = score(N*) holds. Proof. In a 7r-linear graph, a gene g can only have parents h, which are upstream in the order coded by 7 r , that is, ..-'(/I) < r - ' ( g ) . Therefore, when selecting parents for g, we are restricted t o B = { h E AIr-l(h) < r-l(g)}, and F ( g , B) is the optimal choice in this case. Since in a .rr-linear graph, all edges comply with the order coded by T , we can choose parents in this way for all genes independently, which proves the claim. 0 Using Lemma 1, we prove that function M can be computed by the formula given in Step 4.
Lemma 2 Let A C G . Let g* = argmingEA(F(g,A - {g}) + QA-{g)(M(A (9)))). Define 7r E IIA by ~ ( i=)M ( A - {g*})(i), and n(IA1) = g*. Then, rr = M(A). Proof. Let rr' E HA. By the definition of M , we have to show QA(7r) 5 Q A ( d ) . Let N * be an optimal n-linear network, M* be an optimal d-linear network. Then, by Lemma 1, QA(n) 5 Q A ( d ) is equivalent to score(N*) 5 score(M*). Let us denote the last element of 7r' as h = n'(IA1). We note that for any B C G, Q B ( M ( B ) )is the score of a global optimal network on B by above definitions. Therefore, we have: score(M*) = s(h,P M * ( h ) ) f C g ~ A - { h L )'(g, P M * (9)) 2 s ( h , ~ M * ( h )Q)A - { ~ ) ( M ( A - { h } ) ) 2 F ( h ,A - { h } ) QA-{h)(M(A - { h } ) ) > minhEA(F(h,A - { h } ) QA-{h)(M(A- { h } ) ) ) = F(g*,A - {g*}) QA-{g*)(M(A - {g*})) = score(N*),
+ +
+
which shows the claim.
+
0
Since Q can be directly computed using F , the algorithm can compute
562
Q G ( M ( G ) in ) Step 5. Finally, Q G ( M ( G ) )is the score of an optimal Bayesian network by definition, which shows the correctness. If the information of the best parents is stored together with F ( g , A ) for every gene g and every subset A C G, the optimal network can be constructed during the computation of Q G ( M ( G ) ) .
Theorem 1 Optimal networks can be f o u n d using O(n.2") dynamic programming steps.
Proof. The dynamic programming in Step 1 and Step 2 requires O ( n . 2") ( n = [GI)steps and in each step one score is computed. In the dynamic programming in Step 3 and Step 4 O(2") steps are needed, where each steps involves looking up some previously stored scores. Note that the function QA does not need to be actually computed in Step 4a, because QA-{g} can be stored together with M ( A - ( 9 ) ) in previous steps. Therefore, the overall time complexity is O(n '2"). 0 In biological reality, while the number of children of a regulatory gene may be very high, the number of parents can be assumed to be limited. When we limit the number of parents, the number of score calculations reduces substantially, allowing the computation of larger networks. We state the following trivial corollary, which is practically very meaningful (see Section 3).
Corollary 1 Let m E JV be a constant. Optimal networks, in which n o gene has more t h a n m parents, can be f o u n d in O(n.2") dynamic programming steps. If we do not want to limit the number of parents by a constant, but instead can select for each gene a fixed number of candidate parents, the complexity changes as follows.
Corollary 2 Let m E M be a constant. For each g E G , let C, C G be a set with JC,I 5 m. Optimal networks, an which each gene g has parents only in C,, can be found in O(2") dynamic programming steps.
Proof. Since the parents of each gene are selected from a set of constant size, the complexity of the dynamic programming in Step 1 and Step 2 becomes
563
constant. Therefore, the overall complexity becomes O(2").
0
We note, that the two applications of dynamic programming in our algorithm can be implemented as a single application of dynamic programming, because when we compute function M for a set of size m, we only need function values of function F for a set of size m - 1. Therefore, only the function values for functions F and M for sets of size m - 1 and m need t o be stored in the memory at the same time. This is practically meaningful to reduce the required amount of memory. We also note that the algorithm can be modified to also compute suboptimal solutions. Computing the second-best or the third-best network might be valuable in order to assess the stability of the infered networks under marginal changes of the score.
3
Results
The algorithm described above was implemented as a C++ program. As scoring functions, existing implementations of the BNRC score, the BDe score and the MDL score are used. All three approaches (Theorem 1, Corollary 1 and 2 ) were implemented. We applied the program to a dataset of 173 microarrays, measuring the response of Saccharomyces cerevisiae t o various stress conditions.
3.1 Application to Heat Shock Data From the dataset we selected 15 microarrays from 25°C to 37OC heat shock experiments and 5 microarrays from heat shock experiments from various temperatures to 37°C. Then we selected a set of 9 genes, which are involved or putatively involved in the heat shock response. Figure 1 shows the optimal network with respect to the BNRC score. We observe that the transcription factor M C M l is estimated to regulate three other genes, while it is not regulated by one of the genes in this set, which is plausible. The second transcription factor in our set of genes, HSFI, is estimated t o regulate three other heat shock genes. It is also estimated to be regulated by a HSP7U-protein ( S S A I ) ,which was reported before16. Another chaperone among these genes, SSA3, also seems to play an active role in the heat shock response and interacts with SSA1 and HSPlO4, coinciding with a report by Glover and Lindquist'. Overall, the result is biologically plausible and gives an indication for the active role of the chaperones SSA1 and SSA3 during the heat shock response.
564
We conclude that optimally infered gene networks are meaningful and useful for the elucidation of gene regulation. gene HSFl
SSAl SSA3 HIGl HSP104 MCMl HSP82
YR02 HSP26 3.2
annotat ion heat shock transcription factor ER and mitochondrial translocation, cytosolic HSP70 ER and mitochondrial translocation, cytosolic HSP70 heat shock response, heat-induced protein heat shock response, thermotolerance heat shock protein transcription, multifunctional regulator protein folding, HSPSO homolog unknown, putative heat shock protein diauxic shift, stress-induced protein
Computational Possibilities and Limitations
While even networks of small scale like the network infered in Section 3.1 cannot be infered with a brute force approach (Eqn. 1), they can be optimally infered by our program using a single Pentium CPU with 1.9 GHz for about 10 minutes. In order to evaluate the practical possibilities of this approach, we selected 20 genes with known active role in gene regulation’’ from the
565
data set and estimated a network with optimal BNRC score using all 173 microarrays. The computation finished within about 50 hours using a Sun Fire 15K supercomputer with 96 CPUs, 9OOMHz each. As a result of this computational experiment, we conclude that our method is feasible for gene networks of 20 genes, even if no constraints are made and a complex scoring scheme like the BNRC score is used. For the discrete scores BDe and MDL, which can be computed much faster, even networks of more than 20 genes can be infered optimally without constraints. When the number of parents is limited to about 6 (Corollary 1) or, alternatively, sets of about 20 candidate parents are preselected (Corollary 2), even with the BNRC score gene networks of more than 30 genes can be infered optimally. However, the method as it is now will not allow to estimate networks of more than about 40 genes. While the theoretical time complexity of the approach given in Corollary 2 is below the time complexity of the approach given in Corollary 1, we argue that the latter might be practically more important. First, limiting the number of parents by a constant can be easily done and is biologically justified, while selecting a set of candidate parents for each gene requires a method of gene selection, which can potentially bias the computation result. Second, it has to be considered that each dynamic programming step in the computation of function F requires the computation of one score, while one dynamic programming step for function M only requires looking up some previous results. When the number of parents is limited as in Corollary 1, the required number of score calculations becomes a polynomial, which makes this approach faster in practical applications, though the approach in Corollary 2 is theoretically superior. 4
Conclusion
We have presented a method that allows to infer gene networks of 20-40 genes optimally, depending on the probability distribution used and on whether additional assumptions are made or not. This makes it possible to compare different scoring schemes, to assess the best parameters for a given scoring scheme, and to evaluate the usefulness of given microarray data, since optimal solutions are obtained. Also, the method is especially useful in settings where researchers focus on a certain group of genes and want to exploit gene expression measurements concerning these genes to the full extent. In contrast to heuristic approaches, if the results are unsatisfying or contradictory to biological knowledge, it can be concluded that the statistical model is incorrect or the data is insufficient. Even for a network of 20 genes,
566
getting t o know the best network from the huge search space given is a large amount of information. We note that the method is not dependent on a certain scoring scheme or a certain kind of gene expression measurements. It can be applied in any setting, where a score as defined in Section 2 is given. For example, when sequence i n f ~ r m a t i o n ' ~protein , interaction data", or other knowledge is incorporated in the score function, this method can also be applied. In order to find gene networks with more than 40 genes, two directions of future work open up. First, if a part of the set of subsets, in which the algorithm performs the actual search, can be pruned, the limit of feasibility might be increased. Second, compartmentalization of gene networks" might be used t o decompose larger networks in smaller parts, and infer each partial network optimally.
Acknowledgements The authors would like to thank Michiel de Hoon for discussions of the manuscript, and Hideo Bannai for advice on implementational issues.
References 1. D.M. Chickering. Learning Bayesian networks is NP-complete. In D. Fisher and H.-J. Lenz, editors, Learning from Data: Artificial Intelligence and Statistics V, Springer-Verlag, 1996. 2. G.F. Cooper, E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9: 309-347, 1992. 3. N. Friedman, M. Goldszmidt. Learning Bayesian networks with local structure. Jordan, M.I. (ed.), Kluwer Academic Publishers, pp. 421-459, 1998. 4. N. Friedman, M. Linial, I. Nachman, D. Pe'er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7: 601620, 2000. 5. A.P. Gasch, et. al. Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell, 11: 42414257, 2000. 6. J.R. Glover, S. Lindquist. Hspl04, Hsp70, and Hsp40: a novel chaperone system that rescues previously aggregated proteins. Cell, 94: 73-82, 1998. 7. A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, R.A. Young. Using graphical models and genomic expression data to statistically validate models
567
of genetic regulatory networks. Pacific Symposium on Biocomputing, 6: 422-433, 2001. 8. A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, R.A. Young. Combining location and expression data for principled discovery of genetic regulatory network models. Pacific Symposium on Biocomputing, 7: 437-449, 2002. 9. S. Imoto, T. Goto, S. Miyano. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pacific Symposium on Biocomputing, 7: 175-186, 2002. 10. T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, Y. Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences USA, 97: 4569-4574, 2001. 11. S. Imoto, S. Kim, T. Goto, S. Aburatani, K. Tashiro, S. Kuhara, S. Miyano. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Journal of Bioinformatics and Computational Biology, in press, 2003. 12. T.I. Lee, N.J. Rinaldi, F. Robert, et. al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298: 799-804, 2002. 13. I.M. Ong, J.D. Glasner, D. Page. Modelling regulatory pathways in E. coli from time series expression profiles. Bioinfomnatics, 18: 241-248, 2002. 14. D. Pe’er, A. Regev, G. Elidan, N. Friedman. Inferring subnetworks from perturbed expression profiles. Bioinformatics, 17: 215-224, 2001. 15. R.W. Robinson. Counting labeled acyclic digraphs. New Directions in the Theory of Graphs, pp. 239-273, 1973. 16. Y. Shi, D.D. Mosser, R.I. Morimoto. Molecular chaperones as HSF1specific transcriptional repressors. GeneskDevelopment, 12: 654-666, 1998. 17. V.A. Smith, E.D. Jarvis, A.J. Hartemink. Evaluating functional network inference using simulations of complex biological systems. Bioinfomnat~ C S 18: , 216-224, 2002. 18. E.P. van Someren, L.F.A. Wessels, E. Backer, M.J.T. Reinders. Genetic network modeling. Pharmacogenomics, 3(4): 507-525, 2002. 19. Y. Tamada, S. Kim, H. Bannai, S. Imoto, K. Tashiro, S. Kuhara, S. Miyano. Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics, in press, 2003.
PATHWAY LOGIC MODELING OF PROTEIN FUNCTIONAL DOMAINS IN SIGNAL TRANSDUCTION C. TALCOTT, S . EKER, M. KNAPP, P. LINCOLN, K. LADEROUTE SRI International, 333 Ravenswood Avenue, Menlo Park CA 94025 Cfirstname.lastname) @sri.com Abstract
Protein functional domains (PFDs) are consensus sequences within signaling molecules that recognize and assemble other signaling components into complexes. Here we describe the application of an approach called Pathway Logic to the symbolic modeling signal transduction networks at the level of PFDs. These models are developed using Maude, a symbolic language founded on rewriting logic. Models can be queried (analyzed) using the execution, search and modelchecking tools of Maude. We show how signal transduction processes can be modeled using Maude at very different levels of abstraction involving either an overall state of a protein or its PFDs and their interactions. The key insight for the latter is our algebraic representation of binding interactions as a graph.
1 Introduction There is a practical need to represent very large biological networks of all kinds as models at different levels of abstraction. For example, consider the following: The proteome of eukaryotic cells is at least an order of magnitude larger than the genome (very large and diverse protein networks) A large fraction of the genome of mammalian cells ( z 10% of the human genome) encodes genomic regulators producing very large regulatory networks of the genome itself Biological networks interact as modules/subnetworks to produce high levels of physiological organization (e.g., circadian clock subnetworks are integrated with metabolic, survival, and growth subnetworks) In silico models of such networks would be valuable but must have certain features. In particular, they must be easily modified-extended or updated-and useable by bench researchers for formulating and testing hypotheses about how signals and other changes are propagated. Pathway Logic 1,2 is an application of techniques from formal methods and rewriting logic to develop models of biological processes. The goals of the Pathway Logic work include: building network models that working biologists and biomedical researchers can interact with and modify; making formal methods tools accessible to the general biological and biomedical research community; and enabling wet-lab researchers to generate informed hypotheses about complex biological networks.
568
569
The Pathway Logic work has initially focused on curation of models of signal transduction networks, including the Epidermal Growth Factor Receptor (EGFR) network and closely related networks 4 ) 5 @ . Signal transduction processes are modeled at different levels of abstraction involving: (I) the overall state of proteins, or (11) protein functional domains (PFDs) and their interactions. These signaling networks can be queried using formal methods tools, for example, by choosing an initial condition and trying the following: (i) execution-show me some signaling pathway; (ii) search-show me all pathways leading to a specified final condition; or (iii) model-checking-is there a pathway with certain given properties? In this paper we use the recruitment and activation of the ubiquitous Rafl serine-threonine protein kinase to illustrate the two levels of representation and in particular to show how PFDs are modeled and how the resulting model can be used. This more detailed representation of signaling proteins in which PFDs are explicit can be used to model domain specific interactions in signaling networks, an important area of modern signal transduction research. Future work includes expanding the collection of proteins modeled at the level of PFD interactions as data becomes available, modeling additional signal transduction networks and modeling metabolic pathways and their interactions with signal transduction pathways. 1.1 Formal Methods in Biology
Formal methods techniques have been used by various groups to develop executable models of biological systems at high levels of abstraction. Typically the techniques are based on a model of concurrent computation with associated formal languages for describing system behavior and tools for simulation and analysis. Petri nets were developed to specify and analyze concurrent systems. There are many variants of the Petri net formalism and a variety of languages and tools for specification and analysis of systems using the Petri net model Petri nets have a graphical representation that corresponds naturally to conventional representations of biochemical networks. They have been used to model metabolic pathways and simple genetic networks (examples include 8,9,10,11 ). However, these efforts have largely been concerned with kinetic or stochastic models of biochemistry. In12 a more abstract and qualitative view was taken, mapping biochemical concepts such as stoichiometry, flux modes, and conservation relations to well-known Petri net theory concepts. The pi-calculus l 3 is a process algebra originally developed for describing concurrent computer processes. There are a number of specification languages and tools based on the pi-calculus. A pi-calculus model for the receptor tyrosine kinasehitogen-activated protein kinase (RTW-MAPK) signal transduction pathway is presented in 14. Signaling proteins are represented as processes and interactions as synchronous communications between processes (handshakes).
’.
570
A stochastic variant of the pi-calculus is used in l5 to model both the time and probability of biochemical reactions. Statecharts are a visual notation for specifying reactive concurrent systems l 6 used in object-oriented software design methodologies. Statecharts naturally express compartmentalization and hierarchical processes as well as flow of control amongst subprocesses. The resulting models can be used for simulation and visualization of biochemical processes. Statecharts have been used to model biological processes such as T-cell a c t i v a t i ~ n ~ ~ , ~ ~ . Live Sequence Charts l9 are an extension of the Message Sequence Charts modeling notation for system design. Using the associated PlayInPlayOut approach, models can be built and tested by acting out reaction scenarios. Models of subsystems can be combined and charts can be annotated with assertions that allow invariants and prohibited conditions to be expressed and checked. This approach has been used to model the process of cell fate acquisition during C.elegans vulva1 development'O. 1.2 Pathway Logic
Pathway Logic is an approach to modeling biological entities and processes based on formal methods and rewriting logic '. Pathway Logic models are developed using the Maude ( h t t p : / /maude.c s l . sri . corn) system, a formal language and tool set based on rewriting logic. Like the approaches to modeling biological processes mentioned above, Pathway Logic models are executable-hence they can be used for simulation. In addition, the Maude system provides search and model-checking capabilities. Using the search capability all possible future states of a system can be computed to show its evolution from a given initial state (specified by the states of individual components) in response to a stimulus or perturbation. Using model-checking a system in a given initial state can be shown to never exhibit pathways with certain properties, or the model-checker can be used to produce a pathway with a given property (by trying to show that no such pathway exists). Using the reflective capability of Maude, models can be mapped to other formalisms and exported in formats suitable for input to other tools for additional analysis capabilities and visualization. Rewriting Logic3, is a logical formalism based on two simple ideas: states of a system are represented as elements of an algebraic data type; and the behavior of a system is given by local transitions between states described by abstractions called rewrite rules. In Pathway logic, algebraic data types are used to represent concepts from cell biology needed to model signaling processes, including intracellular proteins, biochemicals such as second messengers, extracellular stimuli, biochemical modification of proteins, protein association, and cellular compartmentalization of proteins. Rewrite rules are used to model local processes withn a cell or transmission of a signal across a cell membrane. A signaling network is represented as a collection of rewrite rules together with the algebraic decla-
57 1
rations. Rewriting logic then allows reasoning about possible complex changes given the basic changes (rules) specified by the model. In particular, pathways in the network satisfying different properties can be generated automatically using tools based on logical inference for execution (deduction), search, and modelchecking.
2 Activation of Rafl modeled at two levels A Pathway Logic model of the Epidermal Growth Factor Receptor (EGFR) network (reviewed in4,5,6)is being developed by curating rewrite rules for relevant biochemical processes from the scientific literature. Depending on what data is available, processes are modelled at different levels of abstraction. Level I rules model processes in terms of overall protein states. Protein functional domains (PFDs) are consensus sequences within signaling molecules that rccognize and bind other signaling components to make complexes. When there is enough information about a protein and the domains it contains to hypothesize the details of activation and translocation Level I1 rules are developed. These rules model processes in terms of protein functional domains and explicit posttranslational modifications of individual signaling molecules are included in the model. A key idea for the Level I1 rules is the representation of PFDs and their interactions algebraically as a graph. Here we use the recruitment and activation of the ubiquitous Rafl serinethreonine protein kinase to illustrate the two levels of representation. The Rafl system is a reasonably well-established and detailed example of a signal integrator in the EGFR network 2 1 , 2 2 . The Rafl kinase is an effector of EGFR and other RTK signaling through the ERK1/2 MAPK pathway, which is organized in a module that can be represented by the kinase cascade MAPKKK + MAPKK MAPK (reviewed in5). In this module, Rafl is a MAPKKK. 2.1 Activation of Rafl at Level I
An early step in the activation of Rafl is recruitment of cytoplasmic Rafl to the inner side of the cell membrane by Ras, following stimulation of the EGFR. Figure 1 shows both a graphical representation and the Maude representation (from which the picture is generated) of the Level I rule 280 modeling the activation of Rafl and its recruitment to the cell membrane. This rule says that if the cell contains a Ras type protein with a GTP modification, activated Pak, and Src protein kinases on the interior side of the cell membrane, and Rafl, phosphorylated 143-3 scaffoldladaptor proteins, and the phosphatase PP2A in the cytoplasm, then Rafl can be activated and recruited to the membrane along with 14-3-3, leaving PP2A in the cytoplasm. In Maude a cell is represented by a term of the form { CM 1 . . . { . . . } 1 where the first ellipsis stands for biochemicals in or attached to the interior of the
572
crl[280.?Ras.?Pak.Src.PP2A.?l4-3-3.->.Rafll : {CM I crn [?Ras - GTP] [?Pak - actl [Src - actl (cyto Rafl [?14-3-3 - phosl PPZA => {CM I crn [?Ras - GTP] [?Pak - act] [Src - actl [Rafl - act] [?14-3-3 - phosl {cyto PP2A)J if ?Ras S:Soup : = N-Ras K-Ras H-Ras . [metadata "21192014( R )" I
Figure 1: Rafl activation rule (Level I)
cell membrane, and the second ellipsis stands for the biochemicals and compartments in the cytoplasm. A particular cell state is represented by replacing the ellipses by terms representing specific biochemicals and compartments. In a Maude rule the ellipses are replaced by patterns-terms with variables ranging over some set of biochemicals, represented as sorts in Maude. One of the sorts is Ras representing the Ras type proteins. We use the convention that the name of a class of proteins prfixed by a ? is a variable ranging over the corresponding sort. Thus ?Ras can be instantiated to any of the proteins in the model declared to be of sort Ras. At Level I, posttranscriptional modification is represented abstractly applied to a protein and a set of abstract modiby a modification operator [---I fications. In the left-hand side of rule 280 the term [ ?Ras - GTP] represents a Ras type protein with a GTP modification, while the term [ Src - act I represents activated Src protein kinase on the interior side of the cell membrane. The occurrence of Rafl, PP2A, and [?14-3-3 - phosl represent Rafl, PP2A and phosphorylated 14-3-3 in the cytoplasm. The variables c m and c y t o serve a place holders for any remaining unspecified biochemicals in (or on the interior side of) the cell membrane, and the cytoplasm respectively. In order to apply a set of rules to a particular cell, the components of that cell are formally represented as a multiset of ground terms (constants and other terms containing no variables) declared to be the initial cell state. A rule such as 280 is then applied to the cell by finding a substitution of components for the variables appearing in the left-hand side that make it equal to the cell in question (matching), and replacing the cell by the result of applying the matching substitution
573
to right-hand side of the rule. Representing cell contents using multisets means that the order that individual components are listed in does not matter, and the matching process takes this into consisderation. With the above in mind we can see that application of rule 280 to the initial cell state:
I [N-Ras - GTP] [Pakl {Rafl [14-3-3t - phos] PP2A
eq cell = PD({CM
-
act] [Src
}})
-
act]
.
does indeed move Rafl and 14-3-3 from the cytoplasm to the membrane, activating Rafl and leaving the phosphorylation state of the 14-3-3 protein unchanged. The condition following the if in rule 280 constrains the matching protein found for the variable Ras to be one of those listed. The term [metadata
"
2 119 2 0 14 I' ]
represents information that is not used in execution of the model but provides evidence and other useful information that can used in other operations on the model. This particular metadata is the medline citation for a paper used in curation of the rule. Level I rules have an alternative representation in terms of occurrences and transitions (corresponding to a special kind of Petri net), An occurrence is a biochemical paired with its location in the cell. For example, the occurrence of Rafl on the left hand side of the rule is represented by the pair < Raf 1, cyto z and the pair < [ Raf 1 - act 3 , cm > represents the occurrence on the righthand side. A rule is then represented by a triple consisting of the multiset of left-hand side occurrences, the rule identifier, and the multiset of right-hand side occurrences. (Generic variables such as cm and cyto are ignored.) In the picture the occurrences are represented by ovals labelled by a printed form and the transistion by a rectangle labeled with the rule identifier. Occurrences that appear only on the left-hand side are indicated by arrows from the oval to the rectangle, those that appear only on the right-hand side by arrows from the rectangle to the oval, and those that appear on both sides (enzymes, coenzymes) by dashed bidirectional arrows. 2.2
Activation of Rafl at Level II
The difference between aLevel I rule and aLevel I1 rule is that a Level I rule deals with interactions between whole proteins whereas a Level I1 rule deals with interactions between protein domains. In Level I, Rafl is considered to be inactive by (1) not having the modification "act" and (2) being located in the cytoplasm. In Level I1 the phosphorylation states of relevant amino acids, the domains and sites which are bound intra- or inter-molecularly are made explicit. Based on work by Dhillon and Kolch22 (augmented with details from a number of other publications) we drew, by hand, a stylized diagram of a possible Rafl activation process (Figure 2). The diagram is focused on the Rafl protein. Rafl is represented as a list of domains (blue bars) and potential phosphorylation sites
574
575
(lavender bars) relevant to the interaction being studied. Phosphorylation is indicated by a button labeled P hanging below the site bar. Other proteins binding to Rafl are represented by a bar labeled by the bound domain and the protein name. Those above the Rafl list (red) are in or attached to the cell membrane (also indicated by [CM]), and those below (green) are in the cytoplasm. The first row of the diagram represents inactive Rafl. It is associated with a dimer of 14-3-3 scaffold/adaptor proteins through binding of phosphorylated serines 259 and 621 in Rafl to serine binding domains (SBD) in the 14-3-3 dimer. In the diagram the 14-3-3 dimer is represented by the two 14-3-3 binding domains (green bars) and the line connecting these domains to each other. The arrows in the diagram indicate the progression of the activation process and the arrow labels give a description of the rule governing the interaction and indicate the key triggering biochemistry. For example, the trigger for Raf rule # 1 is activated PKCz ( [ P K C z - a c t ] ) . Based on this diagram, rules were written to model the steps of Rafl activation. To represent the functional domains of a signaling protein explicitly, we annotate proteins using the notation [ p: P r o t e i n I a t t s : A t t s 3 . Here a t t s :A t t s is a set of attributes representing one or more PFDs or amino acid residues (sites). Each attribute may have associated modifications such as phosphorylation (phos) or an indication that the domaidsite is participating in a binding (bound). Thus, a protein at Level I1 can be thought of as an encapsulated collection of functional domains and sites. The association or binding of signaling proteins through their functional domains is explicitly represented by edges in a graph whose nodes are protein-attribute pairs. For example the inactivated form of Rafl shown in the first row of Figure 2 is represented by right-hand side of the following Maude equation. eq Rafl.inact =
[Rafl
I
( S 43), RBD, C1, ( S 259 - phos - bound) (Y 341), PABM, ( S 621 - phos - bound) 1
[14-3-3a I (SBD [14-3-3b 1 (SBD e((Rafl,(S 621)), e ( ( R a f l , ( S259)) e( (14-3-3a,DMD),
bound), (DMD - bound)] bound), (DMD - bound), ( T 141 1 (14-3-3a,SBD)) (14-3-3b,SBD)) (14-3-3b,DMD)) .
The attributes 43), RBD, C1, S 259 - phos - bound), ( S 338), (Y 341), PABM, ( S 621 - phos - bound) (S
correspond to the bars in Figure 2. The attribute ( S 6 2 1 - phos - bound) denotes the site ( S 6 2 1 ) with two modifications phos and bound. The modifications -phos on the sites s 259 and s 6 2 1 correspond to the buttons labeled P and the modification, -bound is used to indicate locally that the attribute has a binding. In the Maude term the 14-3-3 dimer is represented by the two 14-3-3 protein terms, and the edge e ( (14-3-3a, DMD) , (14-3-333, DMD) )
576
The two vertical lines connecting the phosphorylated sites on Rafl to the 14-3-3 dimer are represented in the Maude term by the edges e( (Rafl,( S 621)), (14-3-3a,SBD)) e( (Rafl,( S 621)), (14-3-3a,SBD)) .
In the Level I1 representation the activation of Raf 1, represented at Level I by the single rule 280, requires several rules in which structural features of some of the proteins, including Rafl, are annotated with information about relevant PFDs and binding sites, and the binding between proteins is made explicit. As an example, we show the Maude representation of the rule numbered 6 in the diagram, in which activated Src phosphorylates partially activated Rafl at Tyrosine 341. rl[Rafl#6.Y34lphos]: {CM I cm PS PA [ ? S l k - act] [?Ras I GTPbound, (RafBD - bound)] [Rafl I ( S 43), ( S 259), (Y 341), (C1 - bound), ( S 621 - phos - bound), (PABM - bound), (RBD - bound), raf1:Attsl [14-3-3a \ (SBD - bound),(DMD - bound),la:Attsl [14-3-3b I SBD, (DMD - bound), (T 141 - phos)] e((14-3-3a, DMD), (14-3-3b,DMD)) e( (Rafl, ( S 621)1 , (14-3-3a,SBD)) e( (Rafl, Cl), b(PS)) e( (Rafl, PABM), b(PA)) e( (Rafl, RBD), (?Ras, RafBD)) {cyto}} =>
{CM I cm PS PA [ ? S l k - act] [?Ras I GTPbound, (RafBD - bound)] [Rafl I ( S 43), ( S 259), (Y 341 - phos), ( S 621 - phos - bound), (PABM - bound), (C1 - bound), (RBD - bound), rafl:Atts] [14-3-3a I (SBD - bound),(DMD - bound),la:Atts] [14-3-313 1 SBD,(DMD - bound), (T 141 - phos)] e((14-3-3a, DMD), (14-3-3b,DMD)) e((Raf1, ( S 621)), (14-3-3a,SBD)) e( (Rafl, Cl), b(PS)) e( (Rafl, PABM), b(PA)) e((Raf1, RBD), (?Ras, RafBD)) {cyto}) .
The left-hand side of rule matches a situation in which Rafl is associated with a dimer of 14-3-3 proteins through binding of phosphorylated serine 621 (represented by ( S 621 -phos - bound) )to the serine-binding domain ( (SBD - bound) ) in the 14-3-3 dimer, represented by the edge e((Rafl,(S 621)), (14-3-3a,SBD)).
The additional requirements that Rafl must be bound to Ras, phosphotidylserine (PS), and phosphatidic acid (PA) are represented by the edges e((Raf1, Cl), b(PS)) e((Raf1, PABM), b(PA)) e( (Rafl, RBD) , (?Ras, RafBD))
577
where the terms b ( PS) and b (PA)represent unspecified binding domains or sites on PS and PA respectively. Notice that the representation of overall celI structure is the same and that Level I and Level I1 notation for proteins can be mixed, only using Level I1 detail where relevant. For example, Src is used as a Level I protein (as a variable ? S 1k) of sort S 1k (Src like kinase). In order for Rafl to be fully activated it must be phosphorylated on both Y341 (by a Src-like-kinase) and on S338 (by a member of the Pak family). It is unclear whether Y341 or S338 is phosphorylated first. This is represented in Figure 2 by the branch in the sequence of rules. In the Maude representation, rule 6 deals with this ambiguity by using the variable raf 1 :Atts instead of requiring a particular phosphorylation state for S338. Rule 5 (not shown) similarly uses an attribute variable instead of requiring a particular phosphorylation state for Y341. The application of Level I1 rules follows the same procedure as for Level I. Although domains and sites have a fixed order within a protein sequence, in the Maude model we treat them as a set because the ordering information plays no role in the processes represented. (Some ordering information is implicit in the site numbers and could easily be added if required for other purposes.) Level I1 rules for Rafl are connected to Level I by the equational rule shown above that converts the Level I representation Raf 1.inact of inactivated Rafl to its Level I1 representation, and a dual rule that converts the Level I1 complex representing activated Rafl to its Level I representation (rule 7 in the pathway shown below).
3 Using the Pathway Logic Model We now illustrate some of the ways in which the tools supplied by Maude can be used to query and analyze a Pathway Logic model. To set a context for using the rules for Rafl activation at the PFD level (Level 11) we define an initial cell state (qraf containing inactive Rafl and postulated necessary conditions to activate it. eq qraf =
PD({CM
I
PS PA [ P a k l - actl [PKCz [Src - actl [H-Ras - GTPI {Rafl.inact PP2A)) ) .
-
act]
The form PD ( . . . ) represents a cell in a Petri dish, possibly with some external signaling compounds. As a first example of using the model, the question “can Rafl in a cell described by qraf be activated?’ is answered by defining a proposition praf 0 that expresses the query and then using the findPath 4uery. eq PD( out {CM 1 cm [Rafl prafo = true .
I=
-
actl {cyto}} )
The above equation says that the proposition praf 0 is true for a cell if the dish containing it matches the pattern on the left .
578
The query findPath (qraf,praf0) uses the Maude model checker to find a counter example to the assertion that no state satisfying praf 0 can be reached from the initial state qraf by applying the rules of the model (in this case the equation for Raf 1.inact and Raf rules 1-7). If a counter example is found, the query function extracts a path giving the labels of rules applied and the state reached that satisfies the property praf 0. The Maude command r e d f indPath (qraf,praf 0) executes this query, returning the following. result Simplepath: spath(‘Rafl#l.PKCz ‘Rafl#2.PP2A ‘Rafl#3.PS.PA ‘Rafl#4.Ras ‘Rafl#S.S338phos ‘Rafl#6.~341phos‘Rafl#7.Rafl.is.act, PD({CM 1 PA PS [Pakl - act] [PKCz - act] [Rafl - act] [H-Ras - GTP] [Src - act] 114-3-333 PP2A 14-3-3a))))
The label Raf 1 # 7 . Raf 1.is.act refers to a rule that converts the Rafl complex from Level I1 to Level I to connect with downstream Level I rules. To determine if other pathways are possible, we use the search command search qraf = > ! d:Dish . to ask for all paths leading to a final state (a state to which no more rewrite rules apply). The answer here is that there is one final state, the one found by the above query, and two paths. The second path differs from the first only in the order in which rules 5 and 6 are applied. In general we might discover quite different pathways to a given final state, and/or more than one possible final state. The f indPath query can also be used to check whether a model can generate expected intermediate states. For example, proposition praf 1 expresses the property that a certain collection of bindings occurs. eq PD( o u t {CM I cm e( (Rafl,( S 621)), (14-3-3a, SBD)) e( (Rafl,Cl), b(PS)) e( (Rafl,PABM),b(PA)) e((14-3-3a,DMD), (14-3-3b,DMD)) {cyto)} ) j = prafl = true .
Executing the query findPath (qraf,praf 1 ) results in a path in which rules 1, 2, and 3 have been applied. Although these results seem satisfactory, we might be concerned that the rules could also generate impossible or unlikely states, such as one in which Rafl is bound to both 14-3-3’s in the dimer as well as being bound to PS and PA. To determine whether this possibility is predicted by the model, we can search for a cell state satisfying praf2 , defined by matching the pattern PD( out {CM I cm [H-Ras - GTP] e((14-3-3a, DMD), (14-3-3b, DMD)) e((Rafl,(S 621)), (14-3-3a,SBD)) e( (Rafl,( S 259)), (14-3-3b,SBD)) e( (Rafl,Cl),b(PS)) e( (Rafl,PABM),b(PA)) {cyto}} )
Indeed executing the query f indPath (qraf,praf 2 ) Maude confirms that such a state is not reachable by returning the result (nopath).Simplepath.
*/
579
4 Conclusions Pathway Logic is an example of how logical formalisms and formal modeling techniques can be used to develop a new science of symbolic systems biology. We believe that this computational science will provide researchers with powerful tools to facilitate the understanding of complex biological systems and accelerate the design of experiments to test hypotheses about their functions in vivo. In particular, we are interested in formalizing models that biologists can use to think about signaling pathways and other processes in familiar terms while allowing them to computationally ask questions about possible outcomes. Here we have exemplified our approach using the biochemistry of signaling involving the mammalian Rafl protein kinase. The use of a logic such as rewriting logic for this kind of modeling has many practical benefits, including the ability to (1) build and analyze models with multiple levels of detail, (2) represent general rules, (3) define new kinds of data and properties, and (4) execute queries using logical inference. Model validation is done both by experimental testing of predictions and by using the analysis tools to check consistency with known results. Already the Pathway Logic models are useful for clarifying and organizing experimental data from the literature. The eventual goal is to reach a level of maturity that supports prediction of new and possibly unexpected results.
Acknowledgments We thank the anonymous reviewers for their helpful criticisms. This work was supported in part by grant CA73807 from the National Institutes of Health (KL). Maude tool development has been supported by NSF grants CCR-9900326 and CCR-9900334, and DARPA through Air Force Research Laboratory Contract F30602-02-C-0130.
References 1. S. Eker et al. Pathway logic: Symbolic analysis of biological signaling. In Proceedings of the Pac$c Symposium on Biocomputing, pages 400-412, January 2002. 2. S. Eker, M. Knapp, K. Laderoute, P. Lincoln, and C. Talcott. Pathway logic: Executable models of biological networks. In Fourth International Workshop on Rewriting Logic and Its Applications (WRLA'2002), 2002. http://www.elsevier.nl/locate/entcs/volue7l.html. 3. J. Meseguer. Conditional rewriting logic as a unified model of concurrency. Theoretical Computer Science, 96(1):73-155, 1992. 4. J. M. Kyriakis and J. Avruch. Mammalian mitogen-activated protein kinase signal transduction pathways activated by stress and inflammation. Physiol. Rev., 81:807-869, 2001. 5. G. Pearson et al. Mitogen-activated protein (MAP) kinase pathways: regulation and physiological functions. Endocr. Rev., pages 153-183, 2001.
580
6. J.D. Jordan, E. Landau, and R. Iyengar. Signaling networks: The origins of cellular multitasking. Cell, 103:193-200, 2000. 7. J. L. Peterson. Petri Nets: Properties, analysis, and applications. PrenticeHall, 1981. 8. P. J. Goss and J. Peccoud. Quantitative modeling of stochastic systems in molecular biology using stochastic Petri nets. Proc. Natl. Acad. Sci. U. S. A., 95:6750-6755, 1998. 9. H. Matsuno, A. Doi, M. Nagasaki, and S. Miyano. Hybrid Petri net representation of gene regulatory network. In Pacific Symposium on Biocomputing, volume 5, pages 341-352, 2000. 10. H. Genrich, R. Kuffner, and K. Voss. Executable Petri net models for the analysis of metabolic pathways. Int. J. STTT, 3, 2001. 11. J. S. Oliveira et al. A computational model for the identification of biochemical pathways in the Krebs cycle. J. Computational Biology, 105782, 2003. 12. I. Zevedei-Oancea and S. Schuster. Topological analysis of metabolic networks based on Petri net theory. In Silico Biology, 3(0029), 2003. 13. R. Milner. Communication and Concurrency. Prentice Hall, 1989. 14. A. Regev, W. Silverman, and E. Shapiro. Representation and simulation of biochemical processes using the pi-calculus process algebra. In R. B. Altman, A. K. Dunker, L. Hunter, and T. E. Klein, editors, Paczjic Symposium on Biocomputing, volume 6 , pages 459470. World Scientific Press, 2001. 15. C. Priami, A. Regev, E. Shapiro, and W. Silverman. Application of a stochastic name-passing calculus to representation and simulation of molecular processes. Information Processing Letters, 2001. in press. 16. D. Harel. Statecharts: A visual formalism for complex systems. Science of Computer Programming, 8:231-274, 1987. 17. N. Kam, I.R. Cohen, and D. Harel. The immune system as a reactive system: Modeling t cell activation with statecharts. Bulletin of Mathematical Biology, 2002. to appear. 18. S. Efroni, D. Harel, and I.R. Cohen. Towards rigorous comprehension of biological complexity: Modeling, execution and visualization of thymic t-cell maturation. Genome Research, 2003. Special issue on Systems Biology, in press. 19. W. Damm and D. Harel. Breathing life into message sequence charts. Formal Methods in System Design, 19(l), 2001. 20. N. Kam et al. Formal modeling of C.elegans development: A scenariobased approach. In First International Workshop on Computational Methods in Systems Biology, volume 2602 of Lecture Notes in Computer Science, pages 4-20. Springer, 2003. Meaningful relationships: The regulation of the 21. W. Kolch. Ras/Raf/MEWERK pathway by protein interactions. Biochem 1, 351 1289-305, 2000. 22. A. S. Dhillon and W. Kolch. Untying the regulation of the Raf-1 kinase. Arch. Biochem Biophys, 404:3-9,2002.
MODELING GENE EXPRESSION FROM MICROARRAY EXPRESSION DATA WITH STATE-SPACE EQUATIONS F. x.wu', w . J. ZHANG', A. J. KUSALIK'.' ' Division of Biomedical Engineering, Department of Computer Science, University of Saskatchewan, 57 Campus Dr., Saskatoon, SK, S7N 5A9, CANADA faw341 @mail.usask.ca; zhangc@engr. usask.ca; [email protected] We describe a new method to model gene expression from time-course gene expression data. The modelling is in terms of state-space descriptions of linear systems. A cell can be considered to be a system where the behaviours (responses) of the cell depend completely on the current internal state plus any external inputs. The gene expression levels in the cell provide information about the behaviours of the cell. In previously proposed methods, genes were viewed as internal state variables of a cellular system and their expression levels were the values of the internal state variables. This viewpoint has suffered from the underestimation of the model parameters. Instead, we view genes as the observation variables, whose expression values depend on the current internal state variables and any external input. Factor analysis is used to identify the internal state variables, and Bayesian Information Criterion (BIC) is used to determine the number of the internal state variables. By building dynamic equations of the internal state variables and the relationships between the internal state variables and the observation variables (gene expression profiles), we get state-space descriptions of gene expression model. In the present method, model parameters may be unambiguously identified from timecourse gene expression data. We apply the method to two timecourse gene expression datasets to illustrate it.
1. Introduction With advances in DNA microarray technology'.' and genome sequencing, it has become possible to measure gene expression levels on a genomic scale3. Data thus collected promise to enhance fundamental understanding of life on the molecular level, from regulation of gene expression and gene function to cellular mechanisms, and may prove useful in medical diagnosis, treatment, and drug design. Analysis of these data requires mathematical tools that are adaptable to the large scale of the data, and capable of reducing the complexity of the data to make it comprehensible. Substantial effort is being made to build models to analyze it. Non-hierarchical clustering techniques such as k-means clustering are a class of mixture model-based approaches4. They group genes with similar expression patterns and have already proven useful in identifying genes that contribute to common functions and are therefore likely to be c o r e g ~ l a t e d ~ ~ However, ~.',~. as pointed out by Holter et al?, whether information about the underlying genetic architecture and regulatory interconnections can be derived from the analysis of gene expression patterns remains to be determined. It is also important to note that models based on clustering analysis are static and thus can not describe the dynamic evolution of gene expression.
581
582
Boolean network can be applied to gene expression, where a gene’s expression (state) is simplified to being either completely “on” or “off’. These states are often represented by the binary values 1 and 0, respectively, and the state of a gene is determined by a Boolean function of the states of other genes. The functions can be represented in tables, or as rules. And example of the latter is “if gene A is ‘on’ AND either gene B OR C is ‘off‘ at time t , then gene D is ‘on’ at time t + At “. As the system proceeds from one state (or time point) to the next, the pattern of currently expressednon-expressed genes is used as input to rules which specify which genes will be “on” at the next state or time point. Somogyi and Sniegoski” showed that such Boolean networks have features similar to those in biological systems, such as global complex behaviour, self-organization, stability, redundancy, and periodicity. Liang et al.” described an algorithm for inferring genetic network architectures from the rules table of a Boolean network model. Their computational experiments showed that a small number of state transition pairs are sufficient to infer the original observations. Akutsu et al.” devised a much simpler algorithm for the same problem and proved that if the in-degree of each node (i.e., the number of input nodes to each node) is bounded by a constant h , only O(log n) state transition pairs (from possible 2” pairs) are necessary and sufficient to identify the original Boolean network of n nodes (genes) correctly with high probability. However, the Boolean network models depend on simplifying assumptions about biology systems. For example, by treating gene expression as either completely “on” or “off ’, these models ignore those genes that have a range of expression levels and can have regulatory effects at intermediate expression levels. Therefore they ignore those regulatory genes that influence the transcription of other genes to variable degrees. In addition to Boolean networks models (of discrete variables), dynamic models (of continuous variables) have also been applied to gene expression. Chen et aI.l3 proposed a differential equation model of gene expression. Due to the lack of gene expression data, the model is usually underdetermined. Using the additional requirements that the gene regulatory network should be sparse, they showed that time, where n is the number of genes the model can be constructed in o(~”+’) andlor proteins in the model and h is the number of maximum nonzero coefficients (connectivity degree of genes in a regulatory network) allowed for each differential equation in the model. In order that the parameters of the models are identifiable, both ChenI3 and Akutsu” assume that all genes have a fixed maximum connectivity degree h (often small). These assumptions obviously contradict biological reality. For instance, some genes are known to have many regulatory inputs, while others are not known to have more than a few. Another shortcoming of the previous work is that the fixed maximum connectivity degree h of Chen et al.I3 is chosen in an ad hoc manner. De Hoon et al.I4 considered Chen’s differential model and used Akaike’s Information Criterion (AIC) to determine the connectivity degree h of each gene. In their method, not all
583 genes must have a fixed connectivity. However, they do not present an efficient algorithm to identify the parameters of their differential equation model; the bruteforce algorithm used in the paperI4 has a computational complexity of 0 ( 2 ” ’ ) , where n is the number of genes in the model. The authors claim that their method can be applied to find a network among individual genes. However, for biologically realistic regularity networks, the computational complexity is prohibitive. For instance, De Hoon et al. do not build any gene expression models among individual genes and instead choose to group the genes into several clusters and only study the interrelationships between the clusters. D’haeseleer et al.” proposed a linear model for mRNA expression levels during CNS (stands for Central Nervous System) development and injury. To deal with the lack of gene expression data, the authors used a nonlinear interpolation scheme to guess the shapes of gene expression profiles between the measured time points. Such an interpolation scheme is ad hoc. Therefore, the reasonableness of the model built from such interpolated data is suspicious. In addition, while authors built a linear model for 65 measured mRNA species, there exists a problem of dimensional disaster when the number of genes in a model is large, for example, about 6000 (the number of genes in yeast). Recently we have investigated strategiesI6 for identifying gene regulatory networks from gene expression data with a state-space description of the gene expression model. We have found that modeling gene expression is key to inferring the regulatory networks among individual genes. Therefore, in the paper we focus on modeling gene expression. The contributions of this paper are as follows: A state-space description of a gene expression dynamic model is proposed, where gene expression levels are viewed as the observation variables of a cellular system, which in turn are linear combinations of the internal variables of the system. Factor analysis is used to separate the internal variables and calculate their expression values from the values of the observation variables (gene expression data), where Bayesian Information Criterion (BIC) is used to determine the number of the internal variables The method is applied to two time-course gene expression datasets. The results suggest that it is possible to determine unambiguously a gene expression dynamic model from limited of time-course gene expression data.
2. Methods Chen et a1.I3 theoretically model biological data with the following linear differential equations:
584
d - ~ ( t )= A . x ( t ) dt
where the vector x(t) = [x,( t ) ... x,,(t)]’ contains the mRNA and/or protein concentrations as a function of time t , the matrix A is constant and represents the extent or degree of regulatory relationships among genes and/or proteins, and where n is the number of genes and/or proteins in the model. The superscript “T’ in the formula indicates the transposition of a vector. D’haeseleer et a1.” proposed the following linear difference equations to model gene expression data: x(t
+ At) = W .~
( t )
(2)
where the vector x(t) = [x,( t ) ... x,,(t)]’ contains gene expression levels as a function of time t , the matrix w = [ W q ] , , x , r represents regulatory relationships and degrees among genes, and n is the number of genes in the model. In detail, x < ( t + A t ) is the expression level of gene i at time t +At , and w,, indicates how much the level of gene j influences gene i when time goes from r to t + At . Models (1) and (2) are equivalent. When At tends to zero, model (2) may be transformed into model (1). On the other hand, to identify the parameters in model (l), one must descretize it into the formalism of model (2). Since gene expression data from DNA microarray can only be obtained at a series of discrete time points with the present experimental technologies, difference equations are employed to model gene expression data in this paper. In addition, in DNA microarray experiments usually only the gene expression levels are determined, while the concentrations of resulting proteins are unknown. Therefore this work only considers constructing a system describing a gene expression dynamic model. In Boolean network model, model (1) or model ( 2 ) genes are viewed as state variables in a cellular system. This makes parameter identification of the models impossible without other additional assumptions when using microarray data. In addition, previous models assume that regulatory relationships among genes are direct; for example, gene j directly regulating gene i with the weight w,,in model (2). In fact, genes may not be regulated in such a direct way in a cellular system and may be regulated by some internal regulatory elements”. The following state-space description of a gene expression model is proposed to model gene expression evolution z(t + A t ) = A . z ( t ) + n , ( t ) x(t)
= c. z ( t ) + n , ( t )
(3)
where, in terms of linear system theory”, equations (3) are called the state-space description of a system. The vector ~ ( t= [)x , ( r ) ... x,,(t)lT consists of the
585
observation variables of the system and x i ( t ) (i = l;.-,n) represents the expression level of gene i at time t , where n is the number of genes in the model. The vector z ( t ) = [ z , ( t ) ... z,(t)lT consists of the internal state variables of the system and z , ( t ) (i = I,..., p ) represents the expression value of internal element i at time t
which directly regulates gene expression, where p is the number of the internal is the time translation matrix of the internal state variables. The matrix A = [uijlpXp state variables or the state transition matrix. It provides key information on the , ~ ~ influences of the internal variables on each other. The matrix C = [ c ~ ~ is] ,the transformation matrix between the observation variables and the internal state variables. The entries of the matrix encode information on the influences of the internal regulatory elements on the genes. Finally, the vectors n , ( t ) and n z ( t ) stand for system noise and observation noise. For simplicity, noise is ignored in this development. Let x(t) be the gene expression data matrix with n rows and m columns, where n and rn are the numbers of the genes and the measuring time points, respectively. The building of model (3) from microarray gene expression data x(t) may be divided into two phases. Phase one identifies the internal state variables and their expression matrix z(t) with p rows and rn columns from the data matrix x(t) and computes the transformation matrix C such that X(t) = c ' Z ( f )
(4)
Phase two builds the difference equations of the internal states; i.e. determine the state transition matrix A from the expression matrix z(t). In the process of building model (3), phase one, i.e. to establishing equations (4), is key. There are many methods that may be used to get decomposed equations (4) describing the gene expression data. For example, one may employ cluster where the means of the clusters may be viewed as the internal variables. One may also employ singular value d e c o m p o ~ i t i o n ~where . ~ ~ , the characteristic modes or eigengenes may be viewed as the internal variables. However, in typical applications of cluster analysis and singular value decomposition, the number of such internal variables is chosen in ad hoc fashion, with the result that matrix C and the expression data matrix of the internal variables z(t) are decided subjectively rather than from the data themselves. Note that the matrices C and z(t) are dependent. After z(t) is identified, C may be calculated by formula C = X ( t ) .Z' ( t ) , where Z' ( t ) is a unique Moore-Penrose generalized inverse of the matrix z(t). Next, maximum likelihood factor a n a l y s i ~ ~ .is~ used ' . ~ ~to identify the internal state variables, and BIC is used to determine the number of the internal state variables,
586
where x(t) is the n x m observed data matrix, C is the n x p unobserved factorscore matrix and z(t) is the p x m loaded matrix. In fact, both the generalized likelihood ratio test (GLRT) and the Akaike’s information criterion (AIC) methodz3 also may be used to determine the number of the internal variables, but they have a similar drawback, as the sample size increases there is an increasing tendency to accept the more complex modelz4. The BIC takes sample size into account. Although the BIC method was developed from a Bayesian standpoint, the result is insensitive to the prior distribution for adequate sample size. Thus a prior distribution does not need to be spe~ified’~>*~, which simplifies the method. For each model, the BIC is calculated as BIC = -2.
log - likelihood of the estimation model
number of the estimated parameters in the model
where n is the sample size. As with AIC, the model with the smallest BIC is chosen. BIC avoids the overfitting of a model to data. After obtaining the expression data matrix of the internal variables z(t) and the transformation matrix C in phase one, we develop the difference equations in model (3) ~ (+ A t t) = A .z(t)
(6)
from the data matrix z(t) in phase two. The matrix A contains p 2 unknown elements while the matrix z(t) contains m .p known expression data points. If p > m , equations (6) will be underdetermined. Fortunately, using BIC the number of chosen internal variables p generally is less than the number of time points m . Therefore matrix A is identifiable. To determine matrix A , the time step At is chosen to be the highest common factor among all of the experimentally measured time intervals so that the time of the j th measurement is ti = n, ’ A t , where n, is an integer. For equally spaced measurements, n, = j .
We define a time-variant vector v ( t ) with the same
dimensions as the internal state vector z ( t ) and with the initial value v(r,) = z ( t o ) . For all subsequent times, v ( t ) is determined fromv(t + A t ) = A . v ( t ) . For any integer k , we have
+
~ ( t , k .At) = Ak .~ ( t , ).
(7)
The p 2 unknown elements of the matrix A are chosen to minimize the cost function (the sum of squared relative errors)
587
where
IJ.11
stands for the Euclidean norm of a vector. For equally spaced
measurements, the problem is a linear regression one and the solution to minimizing the cost function (8) can be a least square one. For unequally spaced measurements, the problem becomes nonlinear, and it is necessary to determine matrix A by using an optimization technique such as those in chapter 10 of Press’s textz6.
3. Applications
0
2
4
6
8
1
# of the internal variables
0
co) Figure 1. Profiles of BIC with respect to the number of the internal variables for (a) CDC15 data and (b) BAC data.
In this section, the proposed methodology was applied to two publicly available microarray datasets. The first dataset (CDC15) is from Spellman et a].” and consists of the expression data of 799 cell-cycle related genes for the first 12 equally spaced time points representing the first two cycles. The dataset is available at http://cellcycle-www.stanford.edu, and missing data were imputed by the mean values of the microarrays. The second dataset (BAC) is from Laub at aLZ8and consists of the expression data of 1590 genes for 11 equally spaced time points with no missing data. The dataset is available is at http://caulobacter.stanford.edu /CellCvcle. As the mean values and magnitudes for genes and microarrays mainly reflect the experimental procedure, we normalize the expression profile of each gene to have length one and then for expression values on each microarray as so to have mean zero and length one. Such normalizations also make factor analysis simple”.
588 Table 1. The internal variable expression matrices RAP
-0.2065 0.2914 -0.5766 0.2401 -0,0886 -0.7472 0.0812 -0.4848 0.1591 -0.0418 -0.5397 -0.6201 -0.2144 0.1406 -0.0389 0.2695 - 0.7875 -0.0898 0.0950 0.1 159 0.7960 -0.3190 -0.2828 -0.0038 0.1283 0.6692 0.41 16 - 0.3365 -0.0460 0.1430 -0.4139 0.4091 -0.3770 -0.4557 -0.0130 - 0.7042 - 0.2534 - 0.0028 - 0.4060 0.0820 -0.3371 -0,6247 0.0893 -0,1332 -0.0618 0.5592 - 0.4646 - 0.1469 - 0.0957 - 0.3433 0.7490 0.0429 -0.1504 -0,1983 -0.2431 0.0216 0.5261 0.2677 0.2599 -0.1465
-0.4478 - 0.6954 - 0.8355 -0,7904 -0.7850 -0.8141 -0.7410 -0.6371 -0.5635 -0.7409 -0.7777
0.0733 -0.5429 0.0938 -0.1839 0.2965 - 0.4481 0.0018 - 0.2020 0.4048 0.0408 -0.2612 0.0739 0.2241 0.1674 0.0162 0.0252 0.2158 0.2685 0.0289 0.0021 -0.0381 0.2671 0.2602 -0.1303 -0.4120 0.1512 0.0618 -0.0864 -0.5639 0.0442 -0.2583 -0.1583 -0.4091 -0,1484 -0.2821 0.0947 -0.2597 -0.2584 0.1761 0.3170 -0.0906 -0.1943 0.1666 0.1007
The EM algorithm for maximum likelihood factor analysis23was employed for the two datasets. The gene expression profile for one gene is one sample observation and the identified parameters are the p . r n elements of the matrix z(t) and the variances of rn residue errorsz3.Figure 1 depicts the profiles of BIC with respect to the number of internal variables. Clearly from Figure I , 5 is the best choice as the number of internal variables for both datasets. The expression matrices for the internal varaibles are listed in Table I, where each column describes one internal variable. Table 2. The state transition matrix of the internal variables
CDC 15
BAC
A =[0.4378 -1.0077 0.5009 0.1851 -0.1189
0.6649 -0.0702 -0.0699 0.0161
0.5244 0.1734 -0.0103 0.0316
0.2475 0.6794 0.1786 -0.0700
0.1511 -0.3092 0.6163 0.1358
-0.1356 -0.5279 -0.5190 0.66621
A =[0.4378 -1.0077 0.5009
0.6649 -0.0702 -0.0699 0.0161
0.5244 0.1734 -0.0103 0.0316
0.2475 0.6794 0.1786 -0.0700
0.1851 -0,1189 0.151 1 -0.1356 -0.3092 -0.5279 0.6163 -0.5190 0.1358 0.66621
In order to determine the state transition matrices in the models from the internal expression matrices, we solve two optimization problems (8), for the two datasets. As both datasets are equally spaced measurements, the least square method can be used to obtain the two state transition matrices A in the models shown in Table 2 . Figure 2 gives a comparison of the internal variable expression profiles in Table 1 and their calculated profiles from the model (3) for (a) CDC1.5 and (b) BAC,
589
respectively: The values of the cost functions are 0.2321 and 0.0761 for the CDC15 dataset and the BAC dataset, respectively. That is, at each time point the average relative errors between the internal variable profiles in Table 1 and their calculated values by model (3) are 0.0622 and 0.0372 for the CDC15 dataset and the BAC dataset, respectively. Therefore, two state transition matrices in Table 2 are plausible.
0
2
4
6
8
1
0
1
1
II
I
2
4
6
8 1 0 1 2
2
4
6
I 8 1 0 1 2
0
2
4
6
8
10
12 I
0
2
4
6
8
10
12
2
4
6
8
10
12
I
'
-1 0 0 51
-1
I 2
0
2
4
6
8
1
0
1
I 2
-1 I
I
2
4
6
8
1
0
1
2
-1
0.5 I
0
2
4
6
8
1
0
1
2
0.5 I
-0.51 0
2
4
6
8
1
0
1
I
2
-0.51
0
I
Figure 2. A comparison of the internal variable expression profiles in table 1 and their calculated profiles from the model (3) for (a) CDC15 and @) BAC. The solid lines correspond to the profiles in table 1 and the dash lines to the calculated profiles from the model (3).
Since an exponential or a polynomial growth rate of a gene expression is unlikely to happen, the gene expression systems are assumed to be a stable systemI3. This means that all eigenvalues of the state transition matrix A in model (3) should lie
590
inside the unit circle if model (3) describes a gene expression dynamic system. Five eigenvalues of the state transition matrix A for the CDC15 dataset are and 0.4262 -0.8488i, 0.4262 +0.8488i, 0.5509, 0.7605 - 0.29501 , 0.7605+0.2950i1 all of which lie inside the unit circle. Five eigenvalues of the state transition matrix A for BAC dataset are 1.0282 , 0.6835-0.4997i, 0.6835 + 0.4997i , 0.3092 - 0.5769i , and 0.3092+0.5769i . All of these except for the first one lie inside the unit circle. However, the first eigenvalue is very close to 1. Since these two systems are (almost) stable, they are robust to system noise, for example, the squared summable noises. Therefore, these two models are sound to gene expression dynamic systems.
4. Discussion This paper proposes a method to model gene expression dynamics from measured time-course gene expression data. The model is in the form of the state-space description of linear systems. Two gene expression models for two previously published gene expression datasets were constructed to show how the method works. The results demonstrate that some of features of the models are consistent with biological knowledge. For example, genes may be regulated by internal regulatory elements", and gene expression dynamic systems are stable and robustz9. Compared to previous models, our model (3) has the following characteristics. First gene expression profiles are the observation variables rather than the internal state variables. Second, and from a biological angle, our model (3) can capture the fact that genes may be regulated by internal regulatory elements". Finally, although it contains two groups of equations (one is a group of difference equations and the other, algebraic equations), the parameters in model (3) are identifiable from existing microarray gene expression data without any assumptions on the connectivity degrees of gene^''.'',^^,^^ and the computational complexity to identify them is simple. The main shortcomings of this approach are: 1) the inherent linearity which can only capture the primary linear components of a biological system which may be nonlinear; 2) the ignorance to time delays in a biological system resulting, for example, from the time necessary for transcription, translation, and diffusion; 3) the failure to handle external inputs and noise. In the future work, we will address these shortcomings, especially the latter one. In addition, the present approach will be applied to more datasets and the biological relevance of the internal variables will be demonstrated. This last goal requires closer collaborations with biologists. We can not expect to obtain perfect gene expression models which can completely explain organismal or suborganismal behaviours from existing gene expression data at this time. On the other hand, any subjective assumptionsenforced models may result in misinterpreting organismal or suborganismal behaviours. Using the present methodology one may sufficiently explore the data to
591 construct sound models, which is what data can tell us. We believe that our method, along with the results of the application to two datasets, advances gene expression modelling from time-course gene expression datasets.
Acknowledgements We thank Natural Sciences and Engineering Research Council of Canada (NSERC) for partial financial support of this research. The first author thanks University of Saskatchewan for funding him through a graduate scholarship award and Mrs. Mirka B. Pollak for funding him through The Dr. Victor A. Pollak and Mirka B. Pollak Scholarship(s).
Reference 1.
2. 3. 4.
5. 6. 7. 8. 9. 10. 11.
12.
Pease, A. C., et al. “Light-Generated Oligonucleotide Arrays for Rapid DNA Sequence Analysis” Proc. Natl. Acad. Sci. USA 91: 5022-5026, (1994). Schena, M., et al. “Quantitative monitoring of gene expression patterns with a complementary DNA microarray” Science 270: 467-470, ( 1995). Sherlock, G., et al. “The Stanford Microarray Database” Nucleic Acids Research 29: 152-155,(2001). Everitt, B. S. and Dunn, G. “Applied Multivariate Data Analysis” New York: Oxford University Press, (1992). Tavazoie, S., et al. “Systematic determination of genetic network architecture”, Nature genetics 22: 281-285, (1999). Yeung, K.Y, et al. “Model-based clustering and data transformations for gene expression data”, Bioinformatics 17: 977-987, (2001). Ghosh, D. and Chinnaiyan, A. M. “Mixture modelling of gene expression data from microarray experiments” Bioinformatics 18: 275-286, (2002). McLachlan, G. J., Bean, R. W., and Peel, D. A. “Mixture model-based approach to the clustering of microarray expression data”, Bioinformatics 18: 413-422, (2002). Holter, N. S., et al. “Dynamic modeling of gene expression data” Proc. Natl. Acad. Sci. USA 98: 1693-1698..~(2001). Somogyi, R. and Sniegoski, C. A. “Modeling the complexity of genetic networks: Understanding multigenic and pleiotropic regulation” Complexity 1: 45-63, (1996). Liang, S., et al. “REVEAL, A general reverse engineering algorithm for inference of genetic network architectures” Pacific Symposium on Biocomputing 3: 18-29, (1998). Akutsu, T., et al. “Identification of gene networks from a small number of gene expression patterns under the Boolean network model” Paczfic Symposium on Biocomputing 4: 17-28, (1999).
592
13. Chen, T., He, H. L., and Church, G. M. “Modeling Gene Expression with Differential Equations” Pacific Symposium on Biocomputing 4: 29-40, (1999). 14. de Hoon, M. J. L., et al. “Inferring Gene Regulatory Networks from TimeOrdered Gene Expression Data of Bacillus Subtilis Using Differential Equations” Pacific Symposium on Biocomputing 8: 17-28, (2003). 15. D’haeseleer, P., et al. “Linear Modeling of mRNA Expression Levels During CNS Development and Injury’’ Pacific Symposium on Biocomputing 4: 41-52, (1999). 16. Wu, F. X., et al. “Reverse engineering gene regulatory networks using the state-space description of microarray gene expression data” in preparation. 17. Baldi, P. and Hatfield, G. W. “DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling” New York Cambridge University Press, (2002). 18. Chen, C. T. “Linear System Theory and Design” 3rd edition, New York: Oxford University Press, (1999). 19. van Someren, E. P., Wessels, L. F. A., and Reinders, M.J.T. “Linear modeling of genetic networks from experimental data” In Proceedings of the Eight International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), La Jolla, California, USA, (2000). 20. Alter, O., Brown, P. O., and Botstein, D. “Singular value decomposition for genome-wide expression data processing and modeling” Proc. Natl. Acad. Sci. USA 97: 10101-10106,(2000). 21. Lawley, D. N. and Maxwell, A. E. “Factor Analysis as a Statistical Method” 2ed, London: Buuterorth, (1971). 22. Bubin, D. B. and Thayer, D. T. “EM algorithms fro ML factor analysis” Psychometrika 47: 69-76, (1982). 23. Burnham, K. P. and Anderson, D. R., “Model selection and inference: a practical information-theoretic approach” New York: Springer, (1998). 24. Raftery, A. E. “Choosing models for cross-classification” American Sociological Review 51: 145-146, (1986). 25. Schwarz, G. “Estimating the dimension of a model” Annals of Statistics 6: 461-464, (1978). 26. Press, W. H. et al. “Numerical Recipes in C: The Art of Scientific Computing” 2nd edition, Cambridge, UK: Cambridge University Press, (1992). 27. Spellman, P. T., et al. “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization” Mol. Biol. 9: 3273-3297, (1998). 28. Laub, M. T. “Global analysis of the genetic network controlling a bacterial cell cycle” Science 290: 2144-2148, (2000). 29. Hartwell, L. H., et al. “From molecular to modular cell biology” Nature 402: C47 - 52, (1999).
This page intentionally left blank