iife Sciences Society
COMPUTATIONAL SYSTEMS BI0INFORMATICS
This page intentionally left blank
iife Sciences Society
COMPUTATIONAL SYSTEMS
BIOINFORMATICS CSB2007 CONFERENCE PROCEEDINGS Volume 6
University of California San Diego, USA
13-17 August 2007
eDITORS
pETER mARKSTEIN IN SILICO lABS, llc, usa
yING xU uNIVERSITY OF gEORGIA, usa
Imperial College Press
Published by
Imperial College Press 57 Shelton Street Covent Garden London WC2H 9HE Distributed by
World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 U K office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
COMPUTATIONAL SYSTEMS BIOINFORMATICS Proceedings of the Conference CSB 2007 - Vol. 6 Copyright 0 2007 by Imperial College Press All rights reserved. This book, orparts thereoj may not be reproduced in anyform or by any means, electronic ormechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN- 13 978- 1-86094-872-5 ISBN-I0 1-86094-872-3
Printed by FuIsland Offset Printing (S) Pte Ltd, Singapore
iife Sciences Society
Thank You CSB2007 Gold Sponsor The Life Sciences Society, LSS Directors, together with the CSB2007 Program Committee and Conference Organizing Committee are extremely grateful to the
Hewlett-Packard Company for their Gold Sponsorship of the Sixth Annual Computational Systems Bioinformatics Conference, CSB2007 at University of California San Diego, La Jolla, California, August 13-17,2007
i n v e n t
This page intentionally left blank
vii
PREFACE The 2 1St Century has seen the emergence of tremendous vigor and excitement at the interface between computing and biology. Following years of investment by industry and then private foundations, the US Federal Government has greatly increased its support. Increasingly, the experimental findings from all of the biological sciences are becoming data rich and their practitioners are turning to the use of computational methods to manage and analyze the data. In light of the growing opportunities and excitement at the frontier interface between computing and biology, a few scientists turned conference planners organized the first Computational Systems Bioinformatics (CSB) conference in 2001 at Stanford, CA; CSB continued each August at Stanford over the next five years. During this time, many computer scientists and other engineers, as well as biologists, have attended CSB meetings, which have particularly served to introduce cutting edge biological inquiry and challenge problems to investigators from quantitative science backgrounds. CSB, more recently, became the public venue for the not-for-profit Life Sciences Society, or LSS, which was founded, in part, to enhance the opportunities at the interface between the quantitative I engineering sciences and the biological sciences. In 2006, LSS was honored to be invited to hold CSB2007 on the campus of the University of California at San Diego, UCSD. Some future meetings at UCSD, and ultimately, at other universities, as well as satellite sessions at bioinformatics meetings around the world, are anticipated over the next several years. The Stanford / Bay Area I Silicon Valley venue, along with presenting highlights in bioinformatics, has been especially valuable for connecting individuals from the computer and electronics industry with investigators in the pure and applied life sciences. The current venue should provide some connections to telecommunications, while sustaining some of the earlier opportunities, but we anticipate an enhanced interaction with basic and applied biotechnology. The University of California at San Diego grew from the Scripps Institute of Oceanography and began as a graduate school with a strong focus on
the natural sciences. The rich research culture around UCSD and many neighboring institutions would generally be termed the venue of the Torrey Pines Mesa, an area very rich in biotechnology research activities. Today, San Diego has the largest cluster of Life Sciences centers with 26 research institutes (including UCSD and a suite of Institutes: Salk, Neurosciences, Scripps Research, Burnham Medical, as well as smaller not for profits) located in an area less then 10 square miles. For more on San Diego’s R&D Life Sciences Centers and the vibrant biotechnology and pharmacology efforts, do visit the BioCom website: http://www. biocom.org/Portals/O/SanDiegoLifeScien ceNumbers-Fall06.pdf. CSB2007 will continue to be a 5 day single track conference, with three core plenary presentations days sandwiched between a day of practical tutorials, long a very popular feature, and a day workshops exploring the future. Thus, CSB2007 includes several half day tutorials, 30 refereed papers plus keynote and invited speakers, and posters, during its five full days. Special events for the evenings are planned. CSB2007, as in each of its previous years, owes a lot to its many hard working volunteers, who are listed under the Committees. The indefatigable energy of Vicky and Peter Markstein continues to sustain the magnitude and amplitude of the extraordinary science and technology vector that is CSB, and their partnership with Ying Xu also remains essential. The efforts to manage and enhance local arrangements, by Kayo Arima, Patrick Shih, Lydia Grech and Ed Buckingham, should also be acknowledged. A few words, naturally, about SoCal: bring family and guests to enjoy San Diego’s world-famous attractions as Seaworld, the San Diego Zoo, the Wild Animal Park and LEGOLAND California, as well as historic cultural gems Balboa Park and Old Town, and of course, the “endless” beach.
John Wooley, General Conference Chair
This page intentionally left blank
ix
COMMITTEES
Steering Committee Phil Bourne - University of California, San Diego Eric Davidson - California Institute of Technology Steven Salzberg - The Institute for Genomic Research John Wooley - University of California San Diego, San Diego Supercomputer Center
Organizing Committee Kayo Arima - Universitye of California San Diego, Local Arangements Pat Blauvelt - Communications Ed Buckingham - LSS VP Conferences Kass Goldfein - Finance Consultant Lydia Grech - University of California San Diego, Local Arrangements Fenglou Mao - University of Georgia, On-Line Registration and Refereeing Website Vicky Markstein - Life Sciences Society, Co-Chair, LSS President Patrick Shih - University of California San Diego, Local Arrangements Jean Tsukamoto - Graphics Design Bill Wang - Sun Microsystems Inc, LSS Information Technology John Wooley - University of California San Diego, San Diego Supercomputer Center, Co-Chair
Program Committee Tatsuya Akutsu - Kyoto University Phil Bourne - University of California San Diego Jake Chen - Indiana University Amar Das - Stanford University Chris Ding - Lawrence Berkeley Laboratory Roderic Guigo, IMIM, Barcelona Tao Jiang - University of California Riverside Lydia Kavraki - Rice University Hoong-Chien Lee - National Central University, Taiwan Ann Loraine - University of Alabama Michele Markstein - Harvard University Peter Markstein - Hewlett-Packard Co., Co-chair Satoru Miyano - University of Tokyo Sean Mooney - Indiana University Jan Mrazek - University of Georgia Isidore Rigoutsos - IBM TJ Watson Research Center Andrey Rzhetsky - Columbia University
X
Hershel M. Safer, Weizmann Institute of Science David States - University of Michigan Anna Tramontano - University of Rome Olga Troyanskaya - Princeton University Alfonso Valencia - Centro Nacional de Biotecnologia, Spain Eberhard Voit - Georgia Tech Limsoon Wong - Institute for Infocomm Research Ying Xu - University of Georgia, Co-chair Aidong Zhang - SUNY Buffalo Michael Zhang - Cold Spring Harbor Laboratory Xianghong Jasmine Zhou - University of Southern California Yaoqi Zhou - Indiana University
Assistants to the Program Co-Chairs Ann Terka - University of Georgia Joan Yantko - University of Georgia
Poster Committee Nigam Shah - Stanford University, Chair Patrick Shih - University of California San Diego
Tutorial Committee Weizhong Li - University of California San Diego, A1 Shpuntoff - Syngenta Biotechnology Institute, Chair John Wooley - University of California San Diego
Workshop Committee Iddo Friedberg - University of California San Diego, Co-Chair Weizhong Li - University of California San Diego, Co-Chair Patrick Shih - University of California San Diego
xi
REFEREES
Tatsuya Akutsu Mar Alba
Yuki Kato Lydia Kavraki Melissa Kemp
Takis Benos Phil Bourne Liming Cai Ildefonso Cases Robert Castelo Dongsheng Che Jake Chen Liang Chen David Chew Young-Rae Cho I-Chun Chou Xiangqin Cui PhuongAn Dam Amar Das David de Juan Chris Ding
HC Lee Hoong-Chien Lee Haiquan Li Jing Li Xiaoman Shawn Li Guohui Lin Chun-Chi Liu Guimei Liu Huiqing Liu Yunlong Liu Ann Loraine Michia Ma Fenglou Mao Peter Markstein David Martin Satoru Miyano Mark Moll Jan Mrazek
Iakes Ezkurdia Matteo Floris Sylvain Foissac David Gilley Gautam Goel Roderic Guigo Scott Harrison Nurit Haspel Jianjun Hu Woochang Hwang
Masao Nagasaki Luay Nakleh Christoforous Nikolau Juan Nunez-Iglesias Victor Olman Miguel Padilla Grier Page Florencio Pazos Daniel Platt Zhen Qi
Seiya Imoto Tao Jiang
Predrag Radivojac Isidore Rigoutsos
Andrey Rzhetsky Hershel M. Safer Sudipto Saha David States Wing-Kin Sung Takeyuki Tamura Anna Tramontano Olga Troyanskaya Aristotelis Tsirigos Alfonso Valencia Siren Veflingstad Eberhard Voit John Wagner Mingyi Wang Limsoon Wong Hongwei Wu Jialiang Wu Min Xu Ying Xu Weiwei Yin Kangyu Zhang Michael Zhang Shiju Zhang Fengfeng Zhou Ruhong Zhou Wen Zhou Xianghong Jasmine Zhou Yaoqi Zhou
This page intentionally left blank
...
XI11
CONTENTS Preface
vii
Committees
ix
Referees
xi
Keynote Address Quantitative Aspects of Gene Regulation in Bacteria: Amplification. Threshold, and Combinatorial Control Terry Hwa Whole-Genome Analysis of Dorsal Gradient Thresholds in the Drosophila Embryo Julia ZeitlingeK Rob Zinzen, Dmitri Papatsenko et al.
Invited Talks Learning Predictive Models of Gene Regulation Christina Leslie
9
The Phylofacts Phylogenomic Encyclopedias: Structural Phylogenomic Analysis Across the Tree of Life Kimmen Golander
11
Mapping and Analysis of the Human Interactome Network Kavitha Venkatesan
13
Gene-Centered Protein-DNA lnteractome Mapping A.J. Marian Walhout
15
Proteomics Algorithm for Peptide Sequencing by Tandem Mass Spectrometry Based on Better Preprocessing and Anti-S ymmetric Computational Model Kang Ning and Hon Wai Leong
19
Algorithms for Selecting Breakpoint Locations to Optimize Diversity in Protein Engineering by Site-Directed Protein Recombination Wei Zheng, Xiaoduan Ye, Alan A4 Friedman and Chris Bailey-Kellogg
31
An Algorithmic Approach to Automated High-Throughput Identification of Disulfide Connectivity in Proteins Using Tandem Mass Spectrometry Timothy Lee, Rahul Singh, Ten-Yang Yen and Bruce Macher
41
xiv
Biomedical Application Cancer Molecular Pattern Discovery by Subspace Consensus Kernel Classification Xiaoxu Hun
55
Efficient Algorithms for Genome-Wide tagSNP Selection Across Populations via the Linkage Disequilibrium Criterion Lan Liu, Yonghui Wu, Stefano Lonardi and Tao Jiang
67
Transcriptional Profiling of Definitive Endoderm Derived from Human Embryonic Stem Cells Huiqing Liu, Stephen Dalton and Xng Xu
79
Pathways, Networks and Systems Biology Bayesian Integration of Biological Prior Knowledge into the Reconstruction of Gene Regulatory Networks with Bayesian Networks Dirk Husmeier and Adriano I.: Werhli
85
Using Indirect Protein-Protein Interactions for Protein Complex Predication Hon Nian Chua, Kang Ning, Wing-Kin Sung et al.
97
Finding Linear Motif Pairs from Protein Interaction Networks: A Probabilistic Approach Henry C.M. Leung, M H . Siu, S.M. Yiu et al.
111
A Markov Model Based Analysis of Stochastic Biochemical Systems Preetam Ghosh, Samik Ghosh, Kalyan Basu and Sajial K. Das
121
An Information Theoretic Method for Reconstructing Local Regulatory Network Modules from Polymorphic Samples Manjunatha Jagalur and David Kulp
133
Using Directed Information to Build Biologically Relevant Influence Arvind Rao, Alfred 0. Hero III, David J. States and James Douglas Engel
145
Discovering Protein Complexes in Dense Reliable Neighborhoods of Protein Interaction Networks Xiao-Li Li, Chuan-Sheng Foo and See-Kiong Ng
157
Mining Molecular Contexts of Cancer via In-Silico Conditioning Seungchan Kim,Ina Sen and Micheal Bittner
169
Genomics Prediction of Transcription Start Sites Based on Feature Selection Using AMOSA Xi Wang, Sanghamitra Bandyopadhyay, Zhenyu Xuan et al.
183
Clustering of Main Orthologs for Multiple Genomes Zheng Fu and Tao Jiang
195
Deconvoluting the BAC-Gene Relationships Using a Physical Map Yonghui Wu, Lan Liu, Timothy J. Close and Stefano Lonardi
203
X \’
A Grammar Based Methodology for Structural Motif Finding in ncRNA Database Search Daniel Quest, William Tapprich and Hesham Ali
215
IEM: An Algorithm for Iterative Enhancement of Motifs Using Comparative Genomics Data Evliang Zeng, Kalai Mathee and Giri Navasimhan
227
MANGO: A New Approach to Multiple Sequence Alignment Zefeng Zhang, Ha0 Lin and Ming Li
237
Learning Position Weight Matrices from Sequence and Expression Data Xin Chen, Lingqiong Guo, Zhaocheng Fan and Tao Jiang
249
Structural Bioinformatics Effective Labeling of Molecular Surface Points for Cavity Detection and Location of Putative Binding Sites Mavy Ellen Bock, Claudio Garutti and Conettina Guerra
263
Extraction, Quantification and Visualization of Protein Pockets Xiaoyu Zhang and Chandrajit Bajaj
275
Uncovering the Structural Basis of Protein Interactions with Efficient Clustering of 3-D Interaction Interfaces Zeyar Aung, Soon-Heng Tan, See-Kiong Ng and Kian-Lee Tan
287
Enhanced Partial Order Curve Comparison Over Multiple Protein Folding Trajectories Hong Sun, Hakan Ferhatosmanoglu, Motonori Ota and Yusu Wang
299
fRMSDPred: Predicting Local EWSD Between Structural Fragments Using Sequence Information Huzefa Rangwala and George Kavypis
311
Consensus Contact Prediction by Linear Programming Xin Gao, Dongbo Bu, Shuai Cheng Li et al.
323
Improvement in Protein Sequence-Structure Alignment Using InsertiodDeletion Frequency Arrays Kyle Ellrott, Jun-Tao Guo, Victor Olman and Ying Xu
335
Composite Motifs Integrating Multiple Protein Structures Increase Sensitivity for Function Prediction Brian Y Chen, Drew H. Bvyant, Amanda E. Cruess et al.
343
Ontology, Database and Text Mining An Active Visual Search Interface for Medline Weijian Xuan, Manhong Dai, Barbara Mire1 et al.
359
Rule-Based Huamn Gene Normalization in Biomedical Text with Confidence Estimation William K! Lau, Calvin A. Johnson and Kevin G. Becker
371
xvi
CBioC: Beyond a Prototype for Collaborative Annotation of Molecular Interactions from the Literature Chitta Baral, Graciela Gonzalez, Anthony Gitter et al.
381
Biocomputing Supercomputing with Toys: Harnessing the Power of NVIDIA 8800GTX and Playstation 3 for Bioinformatics Problem Justin milson, Manhong Dai, Elvis Jakupovic et al.
387
Exact and Heuristic Algorithms for Weighted Cluster Editing Sven Rahmann, Tobias Wittkop, Jan Baumbach et al.
391
Method for Effective Virtual Screening and Scaffold-Hopping in Chemical Compounds Nikil Wale, George Karypis and Ian A. Watson
403
Transcriptomics and Phylogeny Improving the Design of Genechip Arrays by Combining Placement and Embedding Sirgio Anibal de Cawalho JK and Sven Rahmann
417
Modeling Species-Genes Data for Efficient Phylogenetic Inference Wenyuan Li and Ying Liu
429
Reconcilation with Non-Binary Species Trees Benjamin Vernot, Maureen Stolzer; Aiton Goldman and Dannie Durand
441
Author Index
453
-----".".--
Computational Systems Bioinformatics 2007
Keynote Address
This page intentionally left blank
3
QUANTITATIVE ASPECTS OF GENE REGULATION IN BACTERIA: AMPLIFICATION, THRESHOLD, AND COMBINATORIAL CONTROL Terry Hwa Center,for Theoretical Biological Physics and Department of Physics University of California San Diego 9500 Gilman Drive La Jolla, CA 92093-03 74
Biological organisms possess an enormous repertoire of genetic responses to ever-changing combinations of cellular and environmental signals. Unlike digital electronic circuits however, signal processing in cells is carried out by a limited number of asynchronous devices in fluctuating aqueous environments. In this talk, I will discuss the control of genetic responses in bacteria. Theoretical analysis of the known mechanisms of transcriptional control suggests "programmable" mechanisms for implementing a broad class of combinatorial control. Further analysis of post-
transcriptional control suggests mechanisms for signal amplification, threshold response, and noise attenuation. I will present experimental characterization of some of these bio-computational "devices", as well as experiments illustrating how promoter sequences may be "trained" by directed evolution. Quantitative characterization and controlled manipulation of these devices may bring about predictive understanding of biological control systems, and reveal interesting, novel strategies of distributed computation.
This page intentionally left blank
WHOLE-GENOME ANALYSIS OF DORSAL GRADIENT THRESHOLDS IN THE DROSOPHILA EMBRYO Julia Zeitlinger’,Rob Zinzen2,Dmitri Papatsenko2,Rick Young’, and Mike Levine2
Whitehead Institute, M I . T. Cambridge, MA
Dept. MCB, Centerfor Integrative Genomics UC Berkeley, Berkeley, CA
Dorsal is a sequence-specific transcription factor related to NF-kE3. The protein is distributed in a broad nuclear gradient in the precellular Drosophila embryo. This gradient controls dorsal-ventral patterning by regulating at least 50 target genes in a concentration-dependent manner. Dorsal works with two additional regulatory proteins that are encoded by genes directly regulated by the gradient, Twist and Snail. To determine how the Dorsal gradient generates diverse thresholds of gene activity, we have used ChIP-chip assays with Dorsal, Twist, and Snail antibodies. This method efficiently identified 20 known enhancers and predicted another 30-50 novel enhancers associated with known or suspected dorsal-ventral patterning genes. At least one-third of the Dorsal target genes appear to contain “shadow” enhancers. These are additional cis-regulatory sequences with activities that overlap the principal enhancer guiding the expression of the associated gene. Shadow enhancers might arise from duplications of regulatory DNAs and could provide an important source for novel patterns of gene expression during evolution. The analysis of -30 different Dorsal target enhancers suggest that those mediating gene expression in response to high levels of the Dorsal gradient contain a series of disordered low-affinity Dorsal andlor Twist activator binding sites. In contrast, enhancers mediating expression in response to low levels of the gradient (5% or less of the peak levels of the Dorsal protein) contain an ordered arrangement of optimal Dorsal and Twist binding sites. This organization is likely to foster cooperative occupancy of linked operator sites. We
discuss the importance of enhancer structure in mediating a sensitive threshold response to a morphogen gradient. Although there are many examples of gene regulation via elongation of stalled polymerase (Pol) 11, it is not known to what extent this mechanism is used to establish differential patterns of gene expression during Drosophila embryogenesis. To investigate this issue, we performed ChIP-chip assays using antibodies directed against Pol 11. A specific mutant embryo was used-Tolllob--thatcontains high, uniform levels of the Dorsal, Twist, and Snail proteins. As a result, all of the cells form mesoderm derivatives. Ectodermal derivatives, such as the CNS and extraembryonic membranes, are completely absent. Previous whole-genome tiling arrays identified every gene that is active and inactive in Toll’Ob mutant embryos. Neurogenic genes that are activated by intermediate levels of the Dorsal gradient are repressed due to overexpression of the Snail repressor. Although silent, most of these genes contain a peak of Pol I1 binding at the 5’ end of the transcription unit. In contrast, genes that are uniformly expressed in these embryos display distinct Pol I1 binding profiles (across the length of the transcription unit). It was possible to classify 75% of all protein coding genes in the Drosophila genome into 3 categories based on Pol I1 binding profiles: uniform binding, no binding, or restricted binding near the start site. The -3600 genes exhibiting a uniform Pol I1 binding profile encode housekeeping functions that are constitutively expressed throughout embryogenesis. The -5,000 genes lacking
6
Pol I1 binding tend to be silent in the embryo, but expressed during larval and adult development. Finally, the -1600 genes that exhibit 5’ binding (i.e. stalling) tend to exhibit localized patterns of gene expression during embryogenesis and function as developmental control genes, such as Hox genes and components o f the FGF, Wnt, Hedgehog, TGFP, and Notch signaling pathways.
These observations suggest that the regulation of Pol I1 elongation is a major mechanism o f differential gene activity in the Drosophila embryo. We discuss the use of Pol I1 stalling as a mechanism o f transcriptional repression, and as a means for preparing developmental control genes for rapid and dynamic induction during embryogenesis.
~-
Invited Talks
This page intentionally left blank
9
LEARNING PREDICTIVE MODELS OF GENE REGULATION Christina Leslie Memorial Sloan-Ketteving Cancer Center New York City,NY
Studying the behavior of gene regulatory networks by learning from high-throughput genomic data has become one of the central problems in computational systems biology. Most work in this area focuses on learning structure from data -- e.g. finding clusters or modules of potentially co-regulated genes, or building a graph of putative regulatory "edges" between genes -- and generating qualitative hypotheses about regulatory networks. Instead of adopting the structure learning viewpoint, our focus is to build predictive models of gene regulation that allow us both to make accurate quantitative predictions on new or held-out experiments (test data) and to capture mechanistic information about transcriptional regulation. Our algorithm, called MEDUSA, integrates promoter sequence, mRNA expression, and transcription factor occupancy data to learn gene regulatory programs that predict the
differential expression of target genes. MEDUSA does not rely on clustering or correlation of expression profiles to infer regulatory relationships. Instead, the algorithm learns to predict up/down expression of target genes by identifying condition-specific regulators and discovering regulatory motifs that may mediate their regulation of targets. We use boosting, a technique from machine learning, to help avoid overfitting as the algorithm searches through the high dimensional space of potential regulators and sequence motifs. We will describe results of a recent gene expression study of hypoxia in yeast, in collaboration with the lab of Li Zhang. We used MEDUSA to propose the first global model of the oxygen and heme regulatory network, including new putative context-specific regulators. We then performed biochemical experiments to confirm that regulators identified by MEDUSA indeed play a causal role in oxygen regulation.
This page intentionally left blank
THE PHYLOFACTS PHYLOGENOMIC ENCYCLOPEDIAS: STRUCTURAL PHYLOGENOMIC ANALYSIS ACROSS THE TREE OF LIFE
Kimmen Sjolander Berkeley Phylogenomics Group University of California, Berkeley h ttp://phylogenomics. berkeley. edu
Protein families evolve a multiplicity of functions and structures through gene duplication, domain shuffling, speciation and other processes. Phylogenomic analysis, combining phylogenetic tree construction, integration of experimental data, and differentiation of orthologs and paralogs, has been shown to address the systematic errors associated with standard protocols of protein hnction prediction. The explicit integration of structure prediction and analysis in this framework, which we call structural phylogenomics, provides additional insights into protein superfamily evolution, and improves function prediction accuracy. The Berkeley Phylogenomics Group has developed the PhyloFacts Phylogenomic Encyclopedia for protein
families across the Tree of Life. At present (April 17, 2007), PhyloFacts contains over 27,000 “books” for protein families and domains and over 988,000 hidden Markov models (HMMs) enabling classification of proteins to functional families and subfamilies. Other functionality provided by PhyloFacts includes prediction of protein structure, active site residues, and cellular localization. In this talk, I will present new methods developed by my group for key tasks in a phylogenomic pipeline, including multiple sequence alignment, phylogenetic tree construction, subfamily identification and critical residue prediction.
This page intentionally left blank
13
MAPPING AND ANALYSIS OF THE HUMAN INTERACTOME NETWORK Kavitha Venkatesan
Harvard Univevsity Email: kavitha-venkatesan@dfci. haward. edu
1. INTRODUCTION We have mapped a first version of the human interactome network using a high-throughput Y2H (HTY2H) technology. Our data set, CCSB-HI1 is high in specificity and adds -2700 new protein-protein interactions to existing interactome maps. CCSB-HI 1 interactions are enriched for correlations with mRNA coexpression, presence of shared conserved cis regulatory DNA motifs, shared phenotypes and shared function.. A systematic quantitative examination of various existing human interactome maps shows that, contrary to existing notion, high-throughput Y2H maps are in fact higher in specificity than the composite information obtained from curating literature containing a large number of papers describing one or a few interactions at-a-time.
Furthermore, combined experimental and computational modeling of repeat trials of a HT-Y2H screen predicts the size of the Y2H-detectable human interactome and demonstrates the feasibility of mapping a nearly complete set of human interactions through multiple screens in a reasonable time frame. Novel candidate disease genes and associated hypotheses emerge for more than 300 interactions involving disease proteins from this data set. This existing interactome map can be used to begin to investigate how cellular networks are perturbed in disease. For example, from analysis of a draft interactome map of Epstein-Barr virus proteins with human proteins that we generated, we find that EBV proteins tend to target highly connected or hub proteins in the human interactome, and moreover, proteins that are central in the network, having relatively short paths to other proteins in the network.
This page intentionally left blank
GENE-CENTERED PROTEIN-DNA INTERACTOME MAPPING AJ Marian Walhout Program in Gene Function and Expression and Program in Molecular Medicine UMass Medical School Worcester, MA
Transcription regulatory networks play a pivotal role in the development, function and pathology of metazoan organisms. Such networks are comprised of proteinDNA interactions between transcription factors (TFs) and their target genes'. We are interested in the architecture and hnctionality of such networks. We developed high-throughput gene-centered methods' for the identification of protein-DNA interactions between large sets of regulatory gene segments and various TF resources, including novel Steiner Triple System-based TF smart pools and a TF array3. So far, we mapped two gene-centered networks using C. elegans gene promoter^^'^. These networks already provided insights into differential gene expression at a systems level. For instance, we found that most C. elegans genes are controlled by a layered hierarchy of TFs that sometimes function in a modular manner. Our data can be accessed in our database, EDGEdb6.
1. Walhout, A. J. M. Unraveling Transcription Regulatory Networks by Protein-DNA and ProteinProtein Interaction Mapping. Genome Res 16, 14451454 (2006). 2 . Deplancke, B., Dupuy, D., Vidal, M. & Walhout, A. J. M. A Gateway-compatible yeast one-hybrid system. Genome Res 14,2093-2101 (2004). 3. Vermeirssen, V. et al. A C. elegans transcription factor array and Steiner Triple System-based smart pools: high-performance tools for transcription regulatory network mapping. Nat Methods In press (2007). 4. Deplancke, B. et al. A gene-centered C. elegans protein-DNA interaction network. Cell 125, 11931205 (2006). 5. Vermeirssen, V. et al. Transcription factor modularity in a gene-centered C. elegans core neuronal protein-DNA interaction network. Genome Res May 18; [Epub ahead of print] (2007). 6. Barrasa, M. I., Vaglio, P., Cavasino, F., Jacotot, L. & Walhout, A. J. M. EDGEdb: a transcription factor-DNA interaction database for the analysis of C. elegans differential gene expression. BMC Genomics 8,21 (2007).
This page intentionally left blank
Proteomics
. I
This page intentionally left blank
19
ALGORITHM FOR PEPTIDE SEQUENCING BY TANDEM MASS SPECTROMETRY BASED ON BETTER PREPROCESSING AND ANTI-SYMMETRIC COMPUTATIONAL MODEL Kang Ning and Hon Wai Leong Department of Computer Science, National University of Singapore Block SI 5, 3 Science Drive 2, Singapore 11 7543 (ningkang, leonghw]@comp.nus. edu.sg Peptide sequencing by tandem mass spectrometry is a very important, interesting, yet challenging problem in proteomics. This problem is extensively investigated by researchers recently, and the peptide sequencing results are becoming more and more accurate. However, many of these algorithms are using computational models based on some unverified assumptions. We believe that the investigation of the validity of these assumptions and related problems will lead to improvements in current algorithms. In this paper, we have first investigated peptide sequencing without preprocessing the spectrum, and we have shown that by introducing preprocessing on spectrum, peptide sequencing can be faster, easier and more accurate. We have then investigated one very important problem, the anti-symmetric problem in the peptide sequencing problem, and we have proved by experiments that model that simply ignore anti-symmetric or model that remove all anti-symmetric instances are too simple for peptide sequencing problem. We have proposed a new model for antisymmetric problem in more realistic way. We have also proposed a novel algorithm which incorporate preprocessing and new model for anti-symmetric issue, and experiments show that this algorithm has better perfonnance on datasets examined.
1.
INTRODUCTION
Peptide sequencing by mass spectrometry (referred to as “peptide sequencing” in the following part) is the process of interpreting peptide sequence from the mass spectrum. Peptide sequencing is an important problem in proteomics. Currently, though high throughput mass spectrometers has generated huge amount of spectra, the peptide sequencing these spectrum data is still slow and not accurate. Algorithms for peptide sequencing can be categorized into database search algorithms [ 1-31 and de novo algorithms [4-61. The database search algorithms are suitable for known sequences already existing in the database. However, they do not have good performance for novel sequences not available in database. For these peptide sequences, the de novo algorithms are the methods of choice. De novo algorithms interpret peptide sequences from spectrum data purely by analyzing the intensity and correlation of the peaks in the spectrum. Though current extensive research in de novo peptide sequencing helps to improve the accuracies, there are still many obstacles for both de novo and database search approaches, which make further improvement of the accuracies of peptide sequencing difficult. Among these obstacles, preprocessing to remove the noises froin spectrum before peptide sequencing, as well as the anti-symmetric problem, are two of the very important issues.
Preprocessing to remove noisy peaks A peak in spectrum is noisy if it does not correspond to
a peptide fragment, but a contaminant in mass spectrometers, experiment environments, etc. Since most spectra contain a significant amount of noises, and noisy peaks may mislead interpretation; therefore, preprocessing to remove noisy peaks from the spectrum is necessary.
The anti-symmetric problem A peak p L is anti-symmetric if there can be different fragment ion interpretations for p I , otherwise, p , is symmetric. There is an anti-symmetric problem in spectrum S if S has one peak p I which is anti-symmetric. For the spectrum graph G [4] used to represent spectrum, a path in G is called anti-symmetric if there are no two vertices (fragment ion interpretations) on this path which represent the same peak; otherwise, we say that this path has the anti-symmetric problem. The antisymmetric problem is common in peptide sequencing. Currently there are generally two approaches to the antisymmetric problem. One approach is to ignore the antisymmetric problem [6]; and another is to apply the “strict” anti-symmetric rule that require each peak to be represented by at most one vertex (fragment ion interpretation) on a path in the spectrum graph G [4; 7; 81. The “strict” anti-symmetric rule is used in many peptide sequencing algorithms, but whether applying this rule is realistic is doubtful. In this paper, we will address computational model to remove noise peaks from spectrum. This model also includes the method for introduction of “pseudo peaks”
20 into the spectrum to iinprove peptide sequencing accuracies. We have also proposed the restricted antisymmetric model for the anti-symmetric problem. We have then proposed a novel peptide sequencing algorithm which incorporate these two computational models.
2.
ANALYSIS OF PROBLEMS AND CURRENT ALGORITHMS
In this section, we will analyze the presence of noises in the spectrum, as well as the difference between the algorithms that use preprocesses and those which do not use them. We will also investigate how significant is the anti-symmetric problem in peptide sequencing by mass spectrum, and how current algorithms cope with this pi ob1em.
2.1.
General Terminologies
We first define some general terms. Through mass spectrometer, or tandem inass spectrometer, a peptide P=(alaz...a,& where each of a],.. .,all is one of the amino acids, is fragmented into a spectrum S with maximum charge of a. The parent mass of the peptide P is given by A4 = m ( p ) = C:=, m(a,) . Consider a peptide prefix fragment ph = ( a l a2...aJ, for k 5 n, the prefix mass is defined as m(p,) = m ( a , ) . Suffix masses are defined similarly. We always express a fragment mass in experimental spectrum using the PRM (prefix residue inass) representation, which is the mass of the prefix fragment. Mathematically, for a fragment q with mass m(q), we define PRM(q) = m(q) if q is a prefix fragment (such as (b-ion}); and we define PRM(q) = M - m(q) if q is a suffix fragment (such as b-ion}). A spectrum S is composed of many peaks {pl, p2 ... p,}. Each of the peaks p I is represented by its intensity intensityb,) and mass-to-charge ratio mz(pJ. If peak p i is not noisy peak, then it will represent a fragment ion of P.Each peak pi can be characterized by the ion-type, that is specified by (t, h, Z)E(A,XA,,~AJ,where Az is the set of charges of the ions, A, is the set of basic ion-type, and Ah is the set of neutral losses incurred on the ion. In this paper, we restrict our attention to the set of ion-types AR=(AtxAhxAz), where Az ={1,2,..:,a), A, = {a-ion, bion, y-ion} and A,, = -H20, -NH3). Suppose the (t, h, z)-ion of the fragment q (prefix or suffix fragment) produces an observed peak p i in the experimental spectrum S that has a mass-to-charge ratio of mz@,),
C:=,
{a,
then m(q) can be computed using a shifting function, Shift, defined as follows: m(q)= ShZf
21 vertices if their mass difference is equal to the mass of 1 ainino acid. We also define the theoretical spectrum TS;I(P) that completely characterizes the set of all possible peaks for a peptide assuming that the ions can take charge 1,2,...$. Note that by comparison of theoretical spectrum with experimental spectrum, the theoretical upper bounds for different measurements on peptide sequencing results can be calculated [9]. Another useful measure is the SPC, The shared peaks count (SPC) between the experimental spectrum S and a peptide P is defined as the number of peaks in S that has the same mass-to-charge ratio (inz) as those in TS(P), the theoretical spectrum of P.
2.2.
Datasets
All of the experiments in this paper use the spectra selected with different charges from (a) Amethyst data set from Global Proteome Machine (GPM) [lo] and (b) the data set from Institute for Systems Biology (ISB) [ l l ] . The GPM dataset are MSiMS spectra obtained from QSTAR, from both MALDI and ESI sources. The ISB dataset was generated using ESI source from a mixture of 18 proteins, obtained from Ion-Trap, and consists of spectra of up to charge 3. In contrast to the GPM datasets, the ISB datasets are of low quality. We have selected spectra with corresponding peptide sequences validated by Xcorr score > 2.5. Table 1 listed the number of spectra and the number of peaks per spectrum for different charges of GPM and ISB spectra. Table 1. The number of spectra, and the number of peaks per spectrum. The results are based 011 the GPM and ISB datasets of different charges.
WLGJFGGJ Charge
No. Spectrum
No. peaks per spectrum
pm-1 Total
2328
995
42.6
230.7
46.5
226.0
Each GPM spectrum has between 20-50 peaks (usually high quality peaks) and an average of about 40 peaks. In contrast, each ISB spectrum has between 50-300 peaks and an average of 150 peaks. Moreover, for the corresponding peptide sequences, GPM
sequences have average lengths of 14.5 amino acids, and ISB sequences have average length of 15.0.
2.3.
Problems Analysis
Since binning is the general prerequisites for spectra data preprocess, in this section, we have first analyzed the methods for binning of the peaks in the spectrum, and then discuss preprocessing to remove noisy peaks from while introduce "pseudo peaks" into spectrum. Then we have analyzed of the anti-symmetric problem. Binning of peaks in spectrum Binning discretizes the mass to charge ratios of the peaks to a series of bins of equal sizes. Each bin contains a single peak. The binning idea is already embedded in [12; 131 for mass spectrum alignment. In 112; 131, the peaks of the spectrum are packed into many bins of same sizes, and the spectrum is transformed to a sequence of 0s and 1s. Recently, a database search algorithm COMET [ 141 is proposed which uses the bins (usually of size 1 Da) for their correlations and statistical analysis (Z-score) for accurate peptide sequencing by database search (spectrum comparison). The important parameters considered in binning include the size of the bins, the number of supporting peaks, as well as the intensities of the peaks. The leinina below shows that connected peaks remain connected after binning if we adjust the mass tolerance properly.
22 Table 2. The average contents of different types of peaks in GPM and ISB spectra. The symmetiic peaks are just counted once for total content measures.
I
b-ion, 0, 1 b-ion. 0 . 2
I
a: I
11.2 3.55
I
0.07 0.01
I
0.05 0.02
I
b-ion, -HzO, 1 I
I
I
I
y-ion, -NH3, 1
I Therefore, it is clear that given the proper value of tolerance, the binning can preserve the accuracies. The binning method makes the removal of noises easier, and also makes sequencing faster and potentially more accurate, especially for noisy spectrum. Preprocessing to remove noisy peaks and introduce pseudo peaks Noisy peaks exist in every spectrum, but how to distinguish them from “true” peaks is not an easy problem. The first step is to analyze the spectrum data and find the patterns of noisy peaks. To this end, we have analyzed most abundant ion type: {b-ion, 0, l}, {b-ion, 0, 2}, {b-ion, -H20, l}, {b-ion, -NH3, 1}, {yion, 0, l}, {y-ion, 0, 2}, {y-ion, -H20, l}, {y-ion, NH3, l}, and assume those peaks not of these ion types noises. The analysis is done on binned GPM dataset and ISB dataset. The experimental spectrum and theoretical spectrum for the corresponding sequence is compared, and peaks in experimental spectrum that can be matched with certain ion types are counted. The “content of peaks” for specific ion type is defined as the ratio of “number of peaks” (in experimental spectrum) of that ion type, over total number of peaks in experimental spectrum. The number of peaks and the contents of peaks of different ion types are analyzed, with average results in Table 2. From Table 2, we can see that noisy peaks comprise a significant portion of the peaks in the experimental spectrum. For GPM datasets, 80% of the peaks are noisy peaks, and the most abundant ion types - the b- and yion types, only compose 6% and 5% of the peaks. For
Noises Total
1
26.0 32.2
I
189.1
1
1.00
I
1.00
I
ISB datasets, 83% of the peaks are noisy peaks, and the most abundant ion types - the b- and y-ion types, only compose 4% and 5% of the peaks. ISB spectra have more noisy peaks, and peptide sequencing for these spectra are more difficult. Further analysis of the noisy peaks indicates that there are more noisy peaks in the middle part (according to mass to charge ratios) of the spectrum, than those at the two ends of the spectrum. Also, most of the noisy peaks have some features in common, such as low intensity and few other ions (b-, y-, loss of water or ammonia, for example) support. For some famous algorithms such as Lutefisk [6], there are no such preprocessing to remove noises. PEAKS [15] and PepNovo [5] are two famous algorithms that have implemented preprocesses. In PEAKS, the noise level of the spectrum is estimated, and the intensities of all the peaks in the spectrum are reduced by this noise level. Then all the peaks with zero or negative intensities are removed. In PepNovo, preprocessing is applied to remove or downgrade peaks that have low intensity, and do not appear to be b- or yions. Recently, the AUDENS algorithm has been proposed [16]. The algorithm has a flexible preprocessing module which screens through the peaks in the spectrum, and distinguishes between signal and noise peaks. Previous preprocessing for peptide sequencing by mass spectrometry only considered how to remove noisy peaks. However, some fragment ions are not represented by any of the peaks. Appropriate introduction of “pseudo peaks” into spectrum may help
23 in interpretation of these fragment ions, and increase the sequencing accuracies. The idea of “pseudo peaks” is first described in PEAKS [15]. PEAKS assumes that peaks are at every place in the spectrum, and those which are not present in the actual spectrum are peaks with 0 intensities. It is proven that appropriate introduction of “pseudo peaks” can partially solve the problem of missing edges in the spectrum graph approach [ 151. In our preprocessing computational model, we have remove noisy peaks from, removal as well as the introduction of pseudo peaks into spectrum. Notice that though the process is similar to previous work, the computation model is different.
The anti-symmetric problem We have mentioned that there are two approaches to the anti-symmetric problem: 1) ignore the antisymmetric problem and 2) apply “strict” anti-symmetric rule. In the following part, we show that since both of the approaches are based on unverified assumptions, they do not reflect the nature of real spectrum. First we give part of a real spectrum from GPM datasets (Fig 1). Note that peak no. 1 has multiple annotations. If we just ignore this peak, then there are two peptide fragments that we cannot interpret (AGFAGDDA and AGFAGDDAPRA V F P q , while the peptide itself has 21 amino acids. Therefore, we see that the simple model which apply strict anti-symmetric rule may miss some interpretations of peptide fragments. Peak No,
M/Z
Intensity
1 2 3
ii7.105 191 12
20
5 6
a 8 10
11
205 103
231 125
248.166 2?i> 16 I 302.203 319 212 ‘il? 211 3761@6
To analyze the significance of the anti-symmetric problem in peptide sequencing, we have generated the theoretical spectrum of known peptide sequences. We have analyzed most abundant ion types {b-ion, 0, I}, {b-ion, 0, 2}, {b-ion, -H20, l } , {b-ion, -NH3, l}, {yion, 0,l}, {y-ion, 0, 2}, {y-ion, -H20, I}, {y-ion, NH3, l}, and assume there is no noise. Two peaks are said to be overlap if their mass difference is within threshold (default of 0.25 Da). Note that each of such overlapping peaks is equivalent to a symmetric peak. Results on selected GPM and ISB spectrum datasets are shown in Table 3. The “average numbers” are the average number of symmetric peaks for theoretical spectrum of one peptide sequence, and the “average ratios” are computed as “average numbers”, over average number of peaks in theoretical spectrum. It is obvious that the instances of overlaps (within threshold, 0.25 Da) are quite common. For the overlaps of b- and y-ions in GPM datasets, there is one overlap instance in about 5 peptide sequences, or in about 67 amino acids. The overall overlap instances are even more common, one instance in about 0.36 sequences, or about 5 amino acids. The ISB datasets has a little bit less overlaps, but overall, there is still more than one instance in 0.35 sequences, or in 4 amino acids. Note that we have not considered peaks with higher charges (223). But previous research [9] has found that there is significant amount of higher charge (223) peaks in high-charge spectra. It is nature that the number of overlapping instances will increase when we have
Ion Types
PO 20
A
60 30 4% i
l
1 0 18 0 h i (’i 10
,’ 1 , h. @)
, I h k5c-
11)
7.0 LC‘
541673 5.13 -63
20 10
22
5 8 5 324
80
24
60959
10
25
628.0
19
AGFAGDDAPRAVFPSIVCRP AGFACDEIAPRAVFPSWGRPR
Fig 1. Example of a real spectrum (left) with its corresponding peptide (right). The ion types are represented by (t, b, Z)E (AtxAhxAz), as defined above. In the bracket after the peptide fragment is the corresponding peak number.
24
3.
considered high-charge peaks, and more ion types. Therefore, “strict” anti-symmetric rule is not realistic.
We propose a new algorithm that is based on two new computational models: 1) preprocessing that can remove noisy peaks from, while introduce pseudo peaks into, the spectrum; and 2 ) new anti-symmetric model that is more flexible and realistic to the anti-symmetric problem.
Table 3: The average numbers and ratios of overlapping instances for different kinds of overlaps. Results on all of the GPM and ISB data. Overlapping Types
b-ion, 0 , 1
++ y-ion, 0, 1
b-ion. 0. I f 3 v-ion. 0 . 2 b-ion, 0 , I f 3 y-ion, -H20, 1 b-ion, 0 , I f 3 y-ion, -NH3, 1
1 K2li
0.1541 0.012 0.173 0.0111
p G 7 p i i i ~ 3.1. l
Preprocessing to remove noisy peaks and introduce pseudo peaks
v-ion. 0. l f + b-ion. 0 . 2 y-ion, 0 , I f 3 b-ion, -H20, 1
I
y-ion, 0 , 1 f+ b-ion, -NH3, I b-ion. 0 . 2 t 3 v-ion, 0 . 2
Ib-ion, 0,2 t 3 y-ion, -H20, 1 b-ion, 0,2 t 3 y-ion, -NH3, 1
1
I
0.1521
0.00011
0.1281 0.0001
y-ion, 0,2 t 3 b-ion, -H20, 1 y-ion, b-ion, b-ion, y-ion, b-ion,
NEW COMPUTATIONAL MODELS AND ALGORITHM
0 , 2 + 3 b-ion, -NH3, 1 -H20, I t 3 y-ion, -H20, 1 -H20, I f 3 y-ion, -NH,, 1 -H20, I f 3 b-ion, -NH3, 1 -NH3, I t 3 y-ion, -NH3, 1
All
Experiments were also performed with random introduction of noises into theoretical spectrum. Results (details not shown) indicate that there is a significant increase in the number of overlap instances. Therefore, ignoring the anti-symmetric problem is also not realistic, especially for noisy spectra. In Lutefisk [6], the anti-symmetric problem is assumed not exist, and a peak can be annotated as different ion types. In the Sherenga algorithm [4], only one ion type is possible for each peak, but the exact algorithm that solve the anti-symmetric algorithm is not described. The dynamic programming algorithm for solving anti-symmetric problem is described in [7; 81, and suboptimal algorithm that gives the suboptimal results for the anti-symmetric problem is shown in [ 171. Since our experiments have shown that neither of the two approaches (assumptions) to the anti-symmetric problem is realistic, the simple models based on these assumptions may be the obstacles for improvements of current algorithms. Therefore, we have proposed a more realistic computational model for anti-symmetric problem.
First, the binning process is applied on the peaks in the spectrum. The masses of amino acids are at least of 1.0 Da difference (except for (I, L) and (Q, K), which cannot be distinguished by any de novo peptide sequencing algorithm without isotope information). We thus set the value of mass tolerance m, to be 0.5 Da, and the bin size mbin to be 0.25 Da (according to Lemma 1). With the process of binning, later processes can be even more accurate (lemma 1 shows that there is no loss of accuracy) as well as more efficient because less peaks are considered. After binning, the pseudo peaks are introduced into every empty bin, and each of them are of 1/10 intensity (empirically determined) of the lowest intensity in original spectrum. After binning the peaks and introduction of pseudo peaks, the support scores are computed for every bin (peak). Here, we transform each of the bins (peaks) into vertices (ion interpretations) in the extended spectrum graph G~(yb), and then score each of the vertices. Define Nsupport(vi)as the number of vj (v;#v,), where PRM(vj)=PRM(vi). Define the intensity function as fintensity(vi)=max(O.O 1, loglo(intensity(vi)), where loglo(intensity(vi)) is normalized, so that fintenslty (v,) cannot be less than 0. Let L be the total number of incoming and outgoing edges for vi, and aj be the amino acid for the edge (v,,v, ) (or (v, ,v,)). Then CII(PRM(v,)PRM(vi)I-mass(aj)l/Lis the average mass error for v,. To avoid “divide-by-zero” error in calculating the weight function, we define error function as ferr,,(vi)=max(0.05, CII(PRM(v,)-PRM(vi)l-mass(ai)(iL). The definition ensure that ferror(vi)is larger than 0.05, a reasonably small error value. Then the score of vertex v, in G,(F,J is defined as (9)
25 For each bin, the support score is computed and ranked. Some of the actual peaks that are highly likely to be noises are deleted, and some of the pseudo peaks highly likely to represent ion types are kept. Using this method, we can 1) prune out noises in the spectrum and 2) introduce meaningful peaks into the spectrum. So we may create better spectrum graph to process. Based on the analysis of the scores of peaks in the spectrum (details not shown here), the lowest 20% bins in scores ranking, or those bins with scores less than 1% of the highest ones are filtered out.
3.2.
The Anti-symmetric Problem
Since there are a significant ratio of peaks in spectrum that can be (correctly) annotated as different ion types, the anti-symmetric rule should not be strictly followed. Otherwise, there is loss of information. However, since there are still quite some noisy peaks after preprocess, peptide sequencing that ignores anti-symmetric problem may also be misled by noisy peaks, and thus not preferable. Therefore, it would be better if a more flexible and less strict anti-symmetric rule is applied on the spectrum for the anti-symmetric problem. We have proposed the restricted anti-symmetric model. In this model, restricted number ( r ) of peaks can have different ion types. It is easy to observe that the current two approaches for anti-symmetric problem can be described by this model. The approach that ignores the anti-symmetric problem is the one with Y-number of peaks, and the approach that apply the “strict” antisymmetric rule is the one with F O . The restricted anti-symmetric model is based on the extended spectrum graph G,(Fp) model using multicharge strong tags [18]. Multi-charge strong tags are highly reliable tags in the spectrum graph G,(Fp). A multi-charge strong tag of ion-type (z*, t, h) E A” is a maximal path ( vo, vl, v2, ..., v,) in GI( S: ,{ AR }), where every vertex v, is of a (z*, t, h)-ion, in which t and h should be the same for all vertices, and z* can be different number from { 1,. . .a}. The principle of the restricted anti-symmetric model is that if a multi-charge strong tags (tag) T, in G,(S’J is of high score, and on this tag, the number (r) of overlapping instances (an instance is represented as two vertices of different ion type for the same peak) is within certain tolerance (half of the length of tag), then
T, is a good tag in G,(Fp), and it is selected for subsequent process. It is easy to see that preprocessing and restricted anti-symmetric models can be applied on any de novo peptide sequencing algorithms to improve the accuracies (details in experiments). Below we describe our novel algorithm based on these two models.
3.3. Novel Peptide Sequencing Algorithm Our novel algorithm (GST-SPC*) is based on the previously proposed GST-SPC algorithm [ 181 which has good performance. GST-SPC algorithm has two phases. In the first phase, the GST-SPC algorithm computes a set of tags - the set of all multi-charge strong tags (corresponding to tags of maximal length in extended spectrum graph) - and this leads to an improvement in the sensitivity that can be achieved. In the second phase, the GST-SPC algorithm try to link these tags, and computes a peptide sequence that is optimal with respect to shared peaks count (SPC) from all sequences that are derived from tags. The GST-SPC performs comparable to or better than other de novo sequencing algorithms (Lutefisk and PepNovo), especially on multi-charge spectra. In the GST-SPC* algorithm, before peptide sequencing, all of the peaks of the spectrum are binned, with each bin of the mass range rnbi,, (0.25 Da). The pseudo peaks are introduced into every empty bins. Bins (transformed to vertices in extended spectrum graph) that have very low scores or low support rank are filtered out. Based on the analysis of the peaks in the spectrum, lowest 20% bins, as well as those bins with support scores less than 5% of the highest ones are filtered out. In GST-SPC algorithm, we note that all of the tags can have their SPC computed before deriving the paths in the spectrum. So in GST-SPC* algorithm, after tags are generated in the extended spectrum graph G,(Fb), we have filtered out the tags that violate the “restricted anti-symmetric rule”. For the restricted anti-symmetric model on tags, we restricted r to be at maximum half the length of that tag. We have then computed the SPC for those “good” tags. Then a variant of width first search algorithm is applied on GI(S*p)to find paths from vg to vM, so that these paths have high SPC, and they are consistent with restricted anti-symmetric model. Since
26 computed as the longest common subsequence (LCS) of the correct peptide sequence p and the sequencing result P. Sensitivity indicates the quality of the sequence with respect to the correct peptide sequence and a high sensitivity means that the algorithm recovers a large portion of the correct peptide. The tag-sensitivity accuracy take into consideration of the continuity of the correctly sequences amino acids. For a fair comparison with algorithms as PepNovo that only outputs highest scoring tags, we also use PPV and tag-PPV measures, which indicate how much of the results are correct. Upper Bound on Sensitivity: Given a spectrum S and the correct peptide sequence p, let U( S; , { d } ) denote the theoretical uppev bound on sensitivity that can be attained by any algorithm using the extended spectrum graph G, (SF ) , namely using the extended spectrum Sp” and a connectivity d. The bound U( Sp” , I d } ) is computed as the maximum number of amino acids that can be identified from G,(SF) with all of ion types in A , over the length of p. PepNovo and Lutefisk which considers charge of up to 2 are bounded by U( S; , (2)) and there is a sizeable gap between U( Si,{2)) and U( Sl,(2}). This bound was introduced in [18] for the analysis of the multi-charge spectra. In this paper, we have also computed this bound to evaluate the performance of different algorithms.
the number of tags is small, the algorithm is efficient. A flowchart of the whole algorithm is illustrated in Fig 2.
EXPERIMENTS
4.
4.1.
Experiment Settings
All of the experiments in this paper are performed on a PC with 3.0 GHz CPU and 1.0 GB memory, running Linux system. Our algorithm is implemented in Perl. We have also selected Lutefisk [6], PepNovo [ 5 ] and PEAKS [15], three modern and commonly used algorithms with freely available implementations (online portal for PEAKS), for analysis and comparison. The best results given by different algorithms are used for comparison. For measurement of the sequencing performance, we have adopted the following measurements: Sensitivity and Positive Predictive Value (PPV). Sensitivity = # correct / I p I PPV = # correct / I P I Tag-Sensitivity = # tag-correct / I p 1 Tag-PPV = # tag-correct / I P 1
(10) (1 1) (12) (13)
where #correct is the “number of correctly sequenced amino acids” and #tag-correct is “the sum of lengths of correctly sequenced tags (of length > 1)”. #correct is
t
............... noises - -. pseudo peaks I
I
EII Binning
Multi-chxgc tags (b) Rcstiicted anti-syinrrtetric inodcl (‘1)
-
i
Introduce “pseudo peaks”
4
I
.................................................
...............................
Tags:
i
j 7/’STSQKR j CCTGDHTK
I
Compute scores and remove noisy peaks
I ......
Fig 2. Flowchart of the whole algorithm. “bad’ tags are tags that violate the restricted anti-symmetric model.
j
27
4.2.
Results
We have first analyzed the performance of preprocessing method, and compared the results of Lutefisk, PepNovo, PEAKS and GST-SPC. We have also compared these results with theoretical upper bounds on sensitivity, to measure how good the results of these algorithms are compared to optimal ones. The GPM and ISB spectra are categorized by charges (given by spectrum data). The results are shown in Table 4. From results, we have observed that preprocessing to remove the noises can effectively increase the sequencing accuracies. Compared with the results from original GST-SPC without preprocess, both of the PPV and sensitivity accuracies increase by about 8% for GPM datasets, and about 5% for ISB datasets after preprocess. This difference is probably due to the fact that ISB spectrum has more noises in it than GPM spectrum, so after preprocessing to filter out noises, ISB spectra still have more noises. Such accuracies are much superior to results from Lutefisk algorithm, especially on spectrum with high charges (223). The novel algorithm outperforms the PepNovo algorithm on GPM dataset; and for ISB dataset, the accuracies are closer. Interestingly, when compared with PEAKS, we have discovered that though PEAKS’S results on spectra with charge 1 and 2 are comparable with our results, they are better than our results on multi-charge spectrum. This is because PEAKS also has a preprocessing step to remove noisy peaks and introduce pseudo peaks, again prove that such preprocessing in necessary. As can be found later, when we have used new anti-symmetric model, the accuracies of our algorithm are improved, and there
is almost no difference between them. Compared with theoretical upper bounds, we can see that there is still much room for improvements. We have then performed analysis on restricted antisymmetric model. All of the results based on GST-SPC algorithm are preprocessed. The results based on restricted anti-symmetric model (GST-SPC*) are compared with the results based on strict anti-symmetric rule (strict anti-symmetric) and results from GST-SPC which ignores anti-symmetric issue (no anti-symmetric). The results are shown in Table 5. Table 5. results based on the restricted anti-symmetrie model, compared with other models. The accuracies in cells are represented in
a (PPV/sensitivity [tag PPV/tag sensitivity]) format. Dataset
No. of spectrum
GST-SPC (no antisymmetric)
0.020/0.027]
Total
GST-SPC
GST-SPC*
(strict antisymmetric)
[0.026/0.025]
[0.02810.029]
0.34510.360
0.344/0.364
0.347/0.375
0.39010.473 [O. 120/0.132] 0.41 110.398 [0.096/0.072] 0.40810.496 [O. 10110. 1451
0.386/0.486 [0.121/0.132] 0.414/0.397 [0.090/0.076] 0.42610.528 [O. 1 15/0.156]
0.393/0.491 [O. l61/0 1601 0.434/0.421 [O. I 19/0.121] 0.419/0.53 I [O. 1 17/0.164]
0.409/0.447 [0.109/0.120]
0.419/0.464 [0.118/0.112]
0.427/0.475 [0.119/0.141]
Table 4. The performance of preprocess. The accuracies in cells are represented in a (PPVisensitivity) fonnat. “-”means that the value is not available by the algorithm, and “*” shows the average values based on charge 1 and charge 2 spectra.
28 Table 5 shows that the restricted anti-symmetric model has superior accuracies. Compared with the results froin algorithms which ignores anti-symmetric problem (no anti-symmetric), the application of restricted anti-symmetric model can improve the accuracies by about 5%, and this is probably due to the fact that restricted anti-symmetric model can remove some “bad” tags. About 2% to 5% improvements is observed when compared with the results from strict anti-symmetric model, this is consistent with the results of significance of the anti-symmetric problem in Table 3. The results also show a great improvement in tag PPV and tag sensitivity by using the restricted antisymmetric rule, especially on ISB datasets. This may also be caused by the restricted anti-symmetric model that removes the “bad’ tags. Compare the results in Table 5 with those from Table 4, we have also observed that by the use of restricted anti-symmetric mle in GST-SPC”, the peptide sequencing results are more accurate. The results of GST-SPC” are closer to accuracies of PepNovo (charge and 2) and PEAKS, and significantly better than results of Lutefisk. We also note that these results are still about 20% (charge 1 and charge 2 spectra) to 50% (charge 5 spectra) less than the theoretical upper bounds of the accuracies given in [9]. We have then computed the number of results that are 100% match with the correct peptide sequences (sensitivity=l and PPV=l). Results show that all of these algorithms output more than 5% of 100% match results. For our novel algorithm which introduces pseudo peaks, the problem that many of the missing fragmentations do not have enough peaks support still exists. We think that better scoring function can help to improve the ratio of 100% match results.
We have also applied preprocessing and restricted anti-symmetric model on other algorithms. We have selected PepNovo algorithm in this experiment. PepNovo takes input as the preprocessed spectra by our preprocessing model, and output the tags. We have then rescored and ranked these tags according to the restricted anti-symmetric model. We refer to this method based on preprocessing and restricted antisymmetric model as PepNovo”. Table 7. The performance of preprocessing and anti-symmetric model on PepNovo. The accuracies in cells are represented in a (PPVIsensitivity) format.
I
Dataset
I No.
of
1
PepNovo
I
spectrum
PepNovo with preprocess
I
0.322 ! 0.186
I 1 1 Charge 1
1-
;;;:;0
I
I
I
0.320 I 0.190
~
PepNovo*
~.;ll
0.330 I 0.201
~
1
~ ~ ; l /
Charge 2
0.481 I 0.445
0.480 I 0.445
0.488 ! 0.445
Total
0.486 I 0.455
0.485 I 0.417
0.489 I 0.425
Table 6. Sequencing results of Lutefisk, PepNovo, GST-SPC and our novel algorithm. The accurate subsequences are labeled in italics “M1Z”means mass to charge ratio, “Z’means charge, and “-” ineans there is no result.
29 In Table 6, we have listed a few “good” interpretations of the GST-SPC* algorithm, on which Lutefisk does not provide good results. It is interesting to note that more and longer peptide fragments are correctly sequenced by the novel algorithm - the power of preprocessing and the restricted anti-symmetric rule. In these interpretations, we observe that the novel algorithm that incorporates preprocessing and restricted anti-symmetric model can predict more and longer fragments of the correct peptides than Lutefisk, PepNovo and original GST-SPC. For example, for the peptide sequence “PAAPAAPAPAEKTPVKK”, the two tags “APAAPAPA” and “KE’ are both interpreted correctly only by this novel algorithm. Efficiency: The GST-SPC* algorithm can process a GPM spectrum (fewer peaks) in about 8 seconds, and 20 seconds for an ISB spectrum (many peaks). This is a little bit faster than the original GST-SPC algorithm, but slower than Lutefisk algorithm (within 10 seconds for these spectra) and PepNovo (about 10 to 15 seconds for these spectra) algorithm. This is because preprocessing can reduce the number of peaks, but the restricted antisymmetric rule cause the increase of time. For PEAKS algorithm, the average processing time is 0.3 second per spectrum on the powerful computation facility of peaks online (http:llwww.bioinfor.com:8080lpeaksonline). Because of preprocess, the space needed by GST-SPC* is less than the original GST-SPC algorithm. The novel algorithm used approximately 20 MB memory to process a GPM spectrum, and about 50 MB memory to process an ISB spectrum, in which most of the space is used for store the extended spectrum graph.
5.
CONCLUSIONS
In this paper, we have addressed two important issues in peptide sequencing. The first one is preprocessing to remove noisy peaks from spectrum, and introduce pseudo peaks into spectrum at the same time. We have shown by experiments that there is a significant portion of noisy peaks in the spectrum, and our preprocessing method, which removes noisy peaks and introduce pseudo peaks, can make peptide sequencing more efficient and more accurate. The second issue is about the anti-symmetric problem. We have shown that both strict anti-symmetric rule and no consideration of antisymmetric problem are not realistic, and we have
proposed a restricted anti-symmetric model. Both models can help improve accuracies of de novo algorithms, and the novel GST-SPC* algorithm that incorporates these models is shown to have high performance on datasets examined However, there are still gaps between accuracies of this algorithm and the theoretical upper bounds. The algorithm can be improved by using better scoring function (rather than SPC), better preprocessing method, and more adaptable anti-symmetric model. We are currently working on these aspects, and preliminary results are encouraging. The peptide sequencing problem is a very interesting problem in bioinformatics, and there are many other problems in peptide sequencing, such as peptide sequence assembly. We will apply our computational models on some of these interesting problems in the future.
References 1. Eng, J.K., McCormack, A.L. and John R. Yates, I. (1994) An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database, JASMS, 5, 976-989. 2. Perkins, D.N., Pappin, D.J.C., Creasy, D.M. and Cottrell, J.S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data, Electrophoresis, 20, 355 13567. 3. Tanner, S., Shu, H., Frank, A., Mumby, M., Pevzner, P. and Bafna., V. (2005) Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra, Anal Chem, 77, 4626--4639. 4. Dancik, V., Addona, T., Clauser, K., Vath, J. and Pevzner, P. (1999) De novo protein sequencing via tandem mass-spectrometry, J . Comp. Biol., 6, 327-341. 5 . Frank, A. and Pevzner, P. (2005) PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling, Anal. Chem., 77,964 -973. 6. Taylor, J.A. and Johnson, R.S. (1997) Sequence database searches via de novo peptide sequencing by tandem mass spectrometry, Rapid Commun Mass Spectrom., 11, 1067-1075. 7. Chen, T., Kao, M.-Y., Tepel, M., Rush, J. and Church, G.M. (2001) A Dynamic Programming Approach to De Novo Peptide Sequencing via Tandem Mass Spectrometry, Journal of Computational Biology, 8, 325-337.
30 8. Lu, B. and Chen, T. (2004) Algorithms for de novo peptide sequencing via tandem mass spectrometry, Drug Discovery Today: BioSilico, 2, 85-90. 9. Chong, K.F., Ning, K., Leong, H.W. and Pevzner, P. (2006) Modeling and Characterization of MultiCharge Mass Spectra for Peptide Sequencing, Journal of Bioinfovmatics and Computational Biology, 4, 13291352. 10. Craig, R. and Beavis, R.C. (2004) TANDEM: matching proteins with mass spectra, Bioinformatics, 20, 1466-1467. 11. Keller, A,, Purvine, S., Nesvizhskii, A.I., Stolyar, S., Goodlett, D.R. and Kolker, E. (2002) Experimental protein mixture for validating tandem inass spectral analysis, O M C S , 6, 207-212. 12. Pevzner, P.A., Dancik, V. and Tang, C.L. (2000) Mutation-tolerant protein identification by massspectrometry, International Conference on Computational Molecular Biology (RECOMB 2000), 23 1-236. 13. Tsur, D., Tanner, S., Zandi, E., Bafna, V. and Pevzner, P.A. (2005) Identification of Post-translational Modifications via Blind Search of Mass-Spectra. IEEE
Computer Society Bioinformatics Conference (CSB) 2005. 14. Keller, A,, Eng, J., Zhang, N., Li, X.-j. and Aebersold, R. (2005) A uniform proteomics MSIMS analysis platform utilizing open XML file formats, Molecular Systems Biology, doi: 10.1038Imsb4100024. 15. Ma, B., Zhang, K. and Liang, C. (2005) An Effective Algorithm for the Peptide De Novo Sequencing from MSIMS Spectrum, Journal of Computer and System Sciences, 70, 418-430. 16. Grossmann, J., Roos, F.F., Cieliebak, M., Liptak, Z., Mathis, L.K., Miiller, M., Gruissem, W. and Baginsky, S. (2005) AUDENS: A Tool for Automated Peptide de Novo Sequencing, J. Proteome Res., 4, 1768 -1774. 17. Lu, B. and Chen, T. (2003) A suboptimal algorithm for de novo peptide sequencing via tandem mass spectrometry, J Comput Biol., 10, 1-12. 18. Ning, K., Chong, K.F. and Leong, H.W. (2007) De novo Peptide Sequencing for Multi-charge Mass Spectra based on Strong Tags, Fifth Asia Pacific Bioinformatics Conference (APBC 2007).
31 ALGORITHMS FOR SELECTING BREAKPOINT LOCATIONS T O O P T I M I Z E DIVERSITY I N PROTEIN ENGINEERING B Y SITE-DIRECTED PROTEIN RECOMBINATION
Wei Zheng’ , Xiaoduan Ye’, Alan M. Friedrnan2*, and Chris Bailey-Kelloggl* Department of Computer Science, Dartmouth College Department of Biological Sciences, Markey Center for Structural Biology, Purdue Cancer Center, and Bindley Bioscience Center, Purdue University Protein engineering by site-directed recombination seeks t o develop proteins with new or improved function, by accumulating multiple mutations from a set of homologous parent proteins. A library of hybrid proteins is created by recombining the parent proteins a t specified breakpoint locations; subsequent scrccning/selection identifies hybrids with desirable functional characteristics. In order t o improve the frequency of generating novel hybrids, this paper develops the first approach t o explicitly plan for diversity in site-directed recombination, including metrics for characterizing the diversity of a planned hybrid library and efficient algorithms for optimizing experiments accordingly. The goal is to choose breakpoint locations t o sample sequence space as uniformly as possible (which we argue maximizes diversity), under the constraints imposed by the recombination process and the given set of parents. A dynamic programming approach selects optimal breakpoint locations in polynomial time. Application of our method t o optimizing breakpoints for a n example biosynthetic enzyme, purE, demonstrates the significance of diversity optimization and the effectiveness of our algorithms.
1. INTRODUCTION Protein engineering aims to create amino acid sequences encoding proteins with desired characteristics, such as improved or novel function. Two contrasting strategies are coinmonly employed to attempt t o improve an existing protein. One approach focuses on redesigning a single sequence towards a new purpose, selecting a small number of mutations t o the wild-typelP5. Another approach creates libraries of variant proteins to be selected or screened for desired characteristics. The library approach samples a larger portion of the sequence space, accw mulating multiple mutations in each library member increasing both the ability t o reveal novel solutions t o attaining function. as well as the risk of obtaining non-functional sequences. Protein engineering by site-directed recombination (Fig. 1) provides one approach for generating libraries of variant proteins. A set of homologous parent genes are recombined at defined breakpoint locations, yielding a combinatorial set of hybrid^"^. In contrast t o stochastic library construction site-directed approaches choose breakpoint locations t o optimize expected library quality, e.g., predicted disruption7. 13, 14. In both cases, the use of recombination enables the creation of protein variants that simultaneously accu-
mulate a relatively large number of “natural” mutations relative t o the parent. The mutations have been previously proven compatible with each other and within a similar structural and functional context, and are thus less disruptive than random mutations. Recombination-based approaches, when conibined with high-throughput screening and selection. can avoid the need for precise modeling of the biophysical implications of mutations. They employ an essentially “generate-and-test” paradigm. As always, the goal is t o bias the .'generate" phase to improve the hit rate of the “test” phase. A library is completely determined by selecting a set of parents and a set of breakpoint locations. To optimize an experiment so as to improve the expected quality of the resulting library, there are essentially two competing goals-we warit the resulting proteins t o be both viable and novel. Most previous work on planning site-directed recombination cxperinients has focused on enhancing viability, by seeking to minimize the amount of structural disruption due t o recombination6, 14-17. However. breakpoints can also be selected so as t o enhance novelty by maximizing the diversity of the hybrids. For example, consider choosing one internal breakpoint (in addition t o the one at the end) for the three parents in Fig. 1, left. If we put the breakpoint between the
*Contact authors. CBK: 6211 Sudikoff Laboratory, Hanover, NH 03755, USA;
[email protected]. AMF: Lilly Hall, Purdue University. West Lafayette, IN 47907, USA;
[email protected].
32 last two residues, all hybrids will be the same as the parents ( i x . . a zero-mutation library). To improve the chance of getting novel hybrids, we must choose breakpoints that make hybrids different from each other and/or from the parents (Fig. 1, right).
... Fig. 1. Diversity optimization in site-directed protein recombination. (Left) Recombination of three parent sequences at a set of three breakpoints (we always include a n extra breakpoint a t the end of the sequence). A total of s3 = 27 hybrids results, including three sequences equivalent to the parents. (Right) Repulsive spring analogy for library diversity. Hybrids (circles) are defined by parents (stars) and breakpoint locations. In order to sample the sequence space well, we want to choose breakpoint locations to push hybrids away from each other. (For clarity, only some relationships are illustrated.) Since the parents will also appear in the hybrid library, the hybrids are pushed away from them as well. Alternatively, a n explicit goal may be t o push the hybrids away from the parents as much as possible, so as to maximize the possibility for novel characteristics that are not found in the parents. We capture these two goals as the W H H (hybrid-hybrid) and W H (hybrid~ parent) metrics below, and demonstrate that they are highly correlated as a function of breakpoint location. Note that at all times, the hybrids are restricted t o being a combination of the parents.
quency of clones with improved activity on the normally poor substrate cefotaxime18. In a study of single chain F v antibodies, the greatest affinity improvement was exhibited by libraries of moderate t o high mutation levels (3.8-22.5 mutations/gene)lg. Mutants with significantly higher affinity than the wildtype were well represented within the active fraction of the library population with high mutation levels. This paper represents the first approach to explicitly plan for diversity in site-directed recombination. We develop metrics for evaluating diversity, iii terms of both the differences among hybrids and the differences between hybrids and parents. We develop polynomial-time dynamic programming algorithms t o select optimal breakpoint locations for these diversity rnetrics. We show that the algorithnis are effective and significant in optimizing libraries from the piirE family of biosynthctic enzymes.
2. M E T H O D S We
are given a set of n, parent sequences {PI.P2:. . . , P,}, forming a niultiple sequence alignment with each sequence of length 1 including residues and gaps. Our goal is t o select a set of X breakpoint locations X = { x 1 , ~ 2 .,.., x 1 ~1 5 x1 < x2 < ... < xx = I } . For simplicity in notation, we always place the final breakpoint after the final residue position (i.e., z x = I ) . The breakpoints partition each parent Pa into X fragments with sequences Pa[l, 2 1 1 , Pa[xl 1 , 2 2 1 , . . . P,[zx-l 1 . ~ ~where 1 ; in general we use S[T,r’] to denote the ainiiio acid string from position r to r’ in sequence S , and S [ r ]to denote tlie single amino acid at position r . A hybrid protein Hi is a concatenation of chosen parental fragments, asscmbled in the original order. Thus it is also of length I . Then a hybrid library ‘H(P, X ) = { H I ,H 2 ; .. . HT1x} includes all combinations. Our goal is to choose X (such that = X and z x = 1) to optimize the diX ) , for a set P of parents. versity of library ‘H(P,
P
=
Diversity has been experimentally demonstrated t o be important t o obtaining new characteristics. The number of mutations has been correlated with functional change from wild-type in several proteins modified by different methodologies. Hybrid cytoclrromes P450 with the most altered profiles and greatest activity on a new substrate (allyloxybenzene) were found to have higher effective mutation levels (30-50 mutations among tlie 460 residues) than the enzymes with similar activities to the parentsi6. A random mutant library of TEM-1 plactamase witjh a minimal mutation load (8.2 mutations/gene) was found t o have the highest frequency of clones carrying wild-type or minimally different activity, while a mutant library with maximal mutation load (27.2 mutationslgene) had the highest fre-
+
+
~
1x1
2.1. Library Diversity For two amino acid sequences S and S’of length I , we define the mutation le,iiel m ( S .S’) as the number of corresponding residues that differ:
m ( S , S ’ )=
c lsrsl
I { S [ r ]# S ’ [ r ] } .
(1)
33 where indicator function I is 1 when the predicate is true and 0 when it is false. To mitigate the effect of neutral mutations, rather than using literal equality we measure functional relatedness using one of the standard sets of amino acid classes
(Claim 2.2): nx
i=l
1
n
r=l b = l
n
{{C>,{F,YIW},{H,R,,K},{N,D,Q,E},{S,T,P,A,G}, {M,I,L,V}}. In either case, a “gap” in the alignment is taken as a distinct amino acid type. Our approach can be used with any similarly-structured metric for mutation level. While our goal is to optimize library diversity, we show that the choice of parents and number of breakpoints, independent of breakpoint location, determines the rniitatiori level between all pairs of hybrids (Claim 2.1), between one parent and all hybrids (Claim 2.2), and between all hybrids arid all parents (Claim 2.3). nx-1
Claim 2.1.
(4) b= 1
Claim 2.3 follows immediately from Claim 2.2.
0
The right-hand sides of the claims involve the parents but not the hybrids. Thus, surprisingly, the total number of mutations differentiating hybrids from each other and from the parents are independent of breakpoint locations and determined solely by the choice of parents. However, the distribution of the diversity within the library does depend on the breakpoints.
C y = i + l m ( ~ i , ~=jn2(’-’) ) x
E;I; C&,+,4 P , ,fi>).
2.2. Metrics for Breakpoint Selection
Intuitively (Fig. 1, right), hybrids sample a sequence space defined by the parents and the breakpoint locations. A priori, we don’t know what parts of the space are most promising, and thus we seek to generClaim 2.3. Z:=l Z:Ll m ( H i ,Pa) = X ate novel proteins by sampling the space as uniformly m(Pa,Ph). as possible, rather than clustering hybrids near each other or near the parents. Proof. Consider residue position T . where 1 5 r 5 I . More formally, consider one particular hybrid Over the set of n’ hybrids, there must be nXp1inHi. We want to make other hybrids roughly all as stances of PI [r].n’-’ of Pz[r],. . . , and n’-’ of PrL[r]. different from Hi; i.e., for the other H j , the various Thus we have m ( H i , H j ) should be roughly equal. If we do this for all H i , then we will also make the HJ- different from each other (and not just from one particular H i ) . That is, we want to make m ( H i ,Hi) relatively ,=1 r=l a=l uniform, or minimize its deviation:
Claim 2.2. ypa E P : C?=’m(Pa,8).
~ f ~ m ( ~ = i nx-1 , ~ ,X )
c:=,c;=,
I
a=l
By extending this t o all pairs we have (Claim 2.1):
I
n-1
r=l a=’
n
b=a+l
n-1
n
(3) a = l b=a+I
and by similarly comparing to a fixed parent we have
where m is the mean value of m ( H , , H,). Expanding the square in Eq. ( 5 ) yields an m ( H , , H,)2 term, a constant m2 term, and an m x m ( H , , H,) term whose sum is constant by Claim 2.1. Thus we need only minimize the m ( H , , H J ) 2term, which we call the “variance.” This gives us the first of two diversity optimization targets.
Problem 2.1. (Hybrid-Hybrid Diversity Optimization) G i v e n n parent sequences P of 1 residues
34 a71d a positive inteaer A, ch,oose a set X of X breakp o i n t s (with zx = 1) t o minimize th,e hybrid-hybrid “variance” YHH (X) of t h e resulting library, where
nX-I
2=1
nx
{ X I , . . . . ~ k - 1 = r’} by concatenating each of the hybrids with each parent fragment Pa[r’ 1.7-1. Optimal substructure holds, sirice the best choice for ~k depends only on the best choice for z k - 1 .
+
r’
1
j=a+l
r
r’+l
[
1
2
...
d(T,X’)
9
In addition to making hybrids different from each other, we also may want to focus on making them different from the parents. Following a similar intuition and argument as above. we obtain a second diversity optimization target:
Problem 2.2. (Hybrid-Parent Diversity Optimization) Giiien n paren,t sequences P of 1 residues and a positisue integer A, choose a s e t X of X brealcp o i n t s (where = 1 ) t o minimize t h e hybrid-parent “variance” v ~ p ( Xof) t h e resulting library, where nX
2=1
n
a=l
f o r H , E R ( P .X ) ,Pa E P. Intuitively (Fig. 1, right), both H-H and H-P diversity optimization will spread hybrids out in sequence space. In fact, we can show that for any set X of X breakpoints,
10
Fig. 2. Library substructure: library X ( P ,X ) ending at position T extends library ‘ H ’ ( P,X’ ) ending at position r‘ by 1.r] t o each hybrid H: adding each parent fragment Pa [T’ in ‘ H ’ ( P , X ’ ) .
+
H-H Diversity Optimization. We use this insight to devise a dynamic programming recurrence t o coinpute t,he optimal value of Z J H H for the kth breakpoint location, based on the optimal values of ? ) H H for the possible ( k - 1)st locations. Define ~ H H ( Tk:) t o be the minimum value of W H H ( X )for * any X = {XI.. .., ~ = k r}. Then d ~ ~ (A)1 is, the optimal value for H-H diversity optimization.
we
Due t o lack of space, we omit thc proof, which is an algebraic manipulation of the terms. This relationships means that the two criteria should be highly correlated. as our results below confirm.
2.3. Dynamic Programming for Breakpoint Selection
In order t o select an optimal set of breakpoints, we select breakpoints from left t o right (N- t o Cterminal) in the sequences. We slightly abuse our previous notation, truncating the parents at the last breakpoint selected (consistent with our previous use of the end of the sequence as the final breakpoint). As Fig. 2 illustrates, a hybrid library with breakpoints X = { X I , . . . . x ~ - I= r ’ , = ~T } extends a hybrid library with breakpoints X’ =
Claim 2.4. can compute in t i m e O(Xn2P) as
dHH(r,k )
recursively
m,(Pa[1,r]; Pb[1,r])2 { czz,’Ey=a+~ x + d(,r’, k
min,/,,{n2
where
~ H H is
-
1)
if k = 1, k > 1.
e H H ( k . r, r’)} if
defined in Ep.(lO).
Proof. As discussed above, the hybrid library R ( P , X ) is extended from R ( P , X ’ ) , where X’ is missing the final breakpoint in X . Let us use H , for the members of R ( P , X ) and H,’ for those of R ( P .X ’ ) , and “+” t o denote sequence concatenation. Following the structure in Fig. 2, we can sep1 , r ] from arate w H H into terms %(P .X ’) P,[r’ hybrids in a single “sub-library” sharing the same added fragments, and terms R ( P .X ’ ) Pa[r’ I,.] and R ( P .X ’ ) Pb[r’ + 1,r ] between separate “sublibraries” with distinct added fragments. This gives Eq. (11).
+
+
+
+
+
35 Expanding the second term 011 the right-hand side in Eq. (11) gives Eq. (12). By Claim 2.1 for parents with k - 1 breakpoints (and thus truncated at T ’ ) , we h a w Eq. (13). We can substitute twice the right-hand side of Eq. (13) into the third term in Eq. (12) (with ”twice” to account for summing over all pairs vs. all distinct pairs), noting that the sums over the parents a, and b in Eqs. (12) and (13) are independent. We can then substitute the resulting formula back into Eq. (11). Simplification yields Eq. (14), where most terms are collected into e H H ; except for the sums
n-1
n
a = l b=a+l
%=I j=i+l
i=l
J=i+l
of T L ( H , I , H : )including ~, n from the first term in Eq. (11) and twice )(; from the first term in Eq. (12) (with ”twice” again due to all vs. all distinct). Because Eq. (14) only depends on T’ and not the previous breakpoints, d ( r ,~ c )= min{n2 r’
+ eHH(Ic,
T, T’)}.
(9)
Computing this recurrence using dynamic programming requires a table of size X x 1; filling in each entry requires time O ( n 2 )to compute e H H and must look back at O(I) previous entries to compute the minimum, for a total time of O ( X n 2 P ) . 0
n,-1
n
36 H-P Diversity Optimization. A similar dynamic prograrrirriiiig algorithm to thc H-H one above allows us to optimize H-P diversity. Let, d ~ p ( 7 . k. ) be the niinimuni value of u ~ p ( X for ) aiiy X = ( 2 1 %. .. .zk = T } : so that d ~ p ( 1A) . is tlie optimal value for H-P diversity optimization.
Claim 2. 5. We can compute d ~ p ( rk ,) recurs%iiely in time O(An2L2)as
x:=lc;=, m(Pa[1.T].Pb[1.T])2 { iniii7./<,.{n, x k 1) + e t l p ( k , dHp(r’.
~ ~ 1 e1 ~~ %S ~p defined e
-
T. T’)}
in Eq. 16.
II
/
n
71
rn(P,[l,T’].Pb[l.T’]) x c m ( P o [ , . ’ +l.I‘].”I‘’+ a=l
i=l
l>I’])
h= 1
b= 1 n
ri
o=l2=1
+
x ~ ( s kI ” ~1) e ~ p ( kI‘,. I ” ) } . (15) if k = 1: d ( ~k .) = inin{n T’
1; The table size arid time to fill in each entry are the saiiie as with H-H diversity. 0 ~
e H p ( k . r , T I ) = 2171’-2
ii
Proof. Thc proof is similar to that for H-H diversity. By partit,ionirig the library. we have Eg. (17). By Claiiri 2.2 for parerits wit,li k - 1 breakpoints truncated at position T ’ . we have Eq. (18). Substituting the right-hand sick of Eq. (18) into the third term in Eq.(17): arid simplifying. we get Eq. (19). Hcrc, e I I p ( k . r ; T ’ ) also clepends oiily on T’ aiid riot the preceding breakpoints. so we have
n h-1
a = l i=1
3. RESULTS A N D D I S C U S S I O N The orthologous proteins of the purE family (COG 41 and pfani 731) form a valuable target for engineering a diverse hybrid library. The srriall (gcncrally about 120 residue) purE sequences, which form either a siriglc protein or a single domain in a fusion protein; cat>alyzesteps in the d e novo syntlicsis of purines. While clear orthologs, purE proteins carry out substantially different eiizymat,ic act,ivit,ies in different organisms: in eubacteria, fungi and plants (as wcll as probably most arcliaebacteria), tlie purE product fiinctions as a mutase in the sec-
orid step of a two-step react,ion. while in 1net)azoaiis and methanogenic archaebacteria. the purE product functions as a carboxylase in a single-step reaction that yields thc same product,”0. 21. A genetic system allows selection in, uiuo for both the catalytic mechanism and different levels of erizyriiatic activity. In order t o uncover explanations for the striking divergence of firiction (mutase vs. carboxylase activity) within lioniologoiis sequences. we sought to evenly partition the sequence space: bridging tlie two W a n d s . ” To establish a set, of piirE parents. we perforiried standard sequence search and aligrirrieiit techniques, and elirriiriatcd columns riot, mapped to
37 the struct,ure of E. coli p r E (PDB id: lqcz) and eliminated sequences with niorc than 20% gaps. This yielded a diverse set of 367 sequences of 162 rcsidues each: inclutliiig 28 of the rarer class of metazoans arid methanogens with inferred carhoxylasc activity. The average pairwise sequeiice idcnt,ity (under the cla of Sec. 2.1) is G5.8%. We first chose three diverse parent sequences from the purE family: PI from the eubacterium Eschrr.ichia coli P2 from the vertebrate chicken (GalZus g a l k ~ s )and P3 from the iriethanogenic archaeth,ermautotroph*ibact,eriuin Mrthur~,oth,er~nobu~tt'~ C I L S . The mutation levels among these three parent sequences are 'rrL(P13Pz)= 94. m(P1.l'3) = 65 and !rrl,(Pz,P3) = 85.We applied our algorithms to choose a. set of 4: 5 , 6 and 7 internal breakpoints (Fig. 3 ) .
bleakpolnl local~onsand fragment mufa110nlevels lor H H opt~mlratlon
# Infernal breakpolnls
24
7
47
j
30
62
,
32
33
~
, 24 6
30
5
35
j
40
0
j
30
62
I
,
I
32
24
30
~
33
760~10~
47
8 54x1 0.
140 52
~
9 63x10-
47
0
114
j
28
30
40
j
36
, 64 40
~
,
~
j
52
6o
I
36
residueomton
~
27
,
1614-
28
822x10
144
xio-
/ 4 0
46
-8
-12.
209x10-
140
106 46
,
,
38
113
64
40
36
Y
149
,
121
32
52
j
136
~
93
,
j
j37 ,
32
2o
j 64
,
35
47
89
I
47
32
26
4
40
80 100 resdue positon
47
,
5
67 7 ~ 1 0 ' ~
breakpoint 10catmns and fragment mutat80n levels lor H-P opf~mlzaflon 24
6
j
46
,
46
60
40
j
36
,
140
106
j
52
20
t mteinal breakpoints
7
j
40
26
~
j
36
113
64
I
47
j
35
,
144
121
~
v,,,
150
I, 27
38
~
84
32 4
32
,
52
~
30 93
,
37
,
26
28
~
,
,
32
137
114
I
64
47
~
89
,
points. As the nilitation levels show. in seeking t,o make hybrids distribut,ed uniformly in the sequerice space, breakpoint selection optimization equalizes the contributions to diversity from the fragments. To show that it is not likely to generate equivalent diversity by chance. we chose 10000 random sets of four internal brcakpoints. The distributions of '011~ and i 1 H p for these raridorii sets are plotted in Fig. 4
&
8~
U 6-
47
705~10~
a, u
4-
140
52
j
47
239x10'
0
Fig. 3. Breakpoint locations for three purE proteins, under (top) H-H and (hottorn) H-P divcrsity optimization. The sequence is labelcd with residue indices, with a-helices shown with light boxes and 8-sheets with dark ones, according to the crystal structure of E. cola purE (PDB id: lqcz). Nurnbers above the dashed lines indicate the positions of breakpoints. Numbers within the fragments give the sum of the intra-fragment mutation levels between all pairs of parents
For 4, 5, arid G interrial breakpoints, both HH and H-P optimization yield the same breakpoint locations. For 7 internal breakpoints, the locations only differ by a few residues for the last two break-
Fig. 4. Distribnt,ion of diversity values for random breakpoint selection compared with dynamic programming optimization. The z-axis indicatcs different diversity values. The y-axis indicates the frequencies of the diversity value among 10000 random sets of four internal breakpoints. Dark diamonds indicate diversity values for breakpoints selected by our algorithm: 9.63 x lo7 for H-H, 2.39 x lo6 for H-P, and 8565 for sum-min (using the H-P breakpoints).
The breakpoints selected by our a1gorit)hins are better than any random selection. For cornparison, we also calculatcd the ..sum-niin" diversity i , used by Arnold and metric C:", rnin, n ~ ( H Pa) colleague^'^. Currently no efficient algorithm has
38 been found to directly maximize sum-min diversity, but our H-H and H-P optimization algorithms also apparently do a good job of optimizing it; no random breakpoint selection was found to do better. As we proved in Claims 2.1-2.3, the choice of parent sequences determines the total number of mutations. We also expect it to affect library diversity, since the choice of parents defines the available sequence space (we can only recombine the parents). To test the effect of parent diversity on optimization of library diversity, we randomly chose 1000 threemember purE parent sets. For each set, we selected optimized breakpoints with our algorithms, and calculated the three diversity values as above (using the H-P breakpoints for calculation of sum-min diversity). For each parent set, we also calculated the means of the three diversity metrics over 1000 random sets of four internal breakpoints. Fig. 5 plots the additive difference between values under our optimized breakpoint sets vs. mean values for random breakpoint sets. As the total mutation level of the parents increases, so does the improvement of our breakpoints over random. Presumably, more parent diversity provides more opportunity to explicitly optimize library diversity. As shown by the ratio analysis of YHH and V H P in Eq. (8) and confirmed empirically in Fig. 3, hybrid-parent diversity optimization is highly correlated with hybrid-hybrid diversity optimization. It also appears to be highly correlated with the summin diversity of Arnold and co-workers. Fig. 6(a,c) shows the relationship among these values. using the same random breakpoint selections as in Fig. 4. Optimization for hybrid-parent diversity also achieves good diversity according to the other two metrics. Fig. 6(b,d) shows that the correlation remains extremely high ( R near 1 and -1) over the random parent sets and random breakpoint sets used in Fig. 5. These correlations allow us to do just one polynomial-time diversity optimization, achieving three goals simultaneously.
L -
d0
I00
150
200
250
300
Total Mutations among Parents
I1
L -
40
I00
1,50
260
250
300
Total Mutations among Parents
G 2
1500
5
v) c
0 1000 a,
3 E0
50%0
100
150
200
250
300
Total Mutations among Parents
Fig. 5. Effect of parent selection on diversity optimization. The z-axis indicates the total number of mutations between pairs of purE parents in 1000 randomly chosen three-parent plans. The y-axis indicatcs, for each parent choice, the improvement in diversity from 1000 random plans t o the optimized plan (larger y values indicate more improvement). For H-H and H-P, improvement is measured as the mean random plan value minus the value of our plan; for sum-min, improvement is the value of our plan minus the mean random plan value.
39
4. CONCLUSION 1 2 5 ~
While diversity in hybrid libraries is the key to finding novel function, library design has instead previonsly focused on reducing the fraction of non-viable hybrids. Diversity has been a side-effect, rather than an explicit optimization target. In this initial approach t o optimizing diversity, we showed here that the total number of mutations in a library is fixed by the choice of' parents, but that their distribution among hybrids can be optimized so that the hybrids broadly sample sequence space. Our metrics and algorithms enable efficient selection of breakpoint locations t o optimize diversity. In practical applications, a suitable combination of diversity and viability will be desired. Since the dynamic programniing approach here has a similar structure to algorithms for minimizing disruptiori13. 14, it might be possible t o optimize for a desired trade-off between these two competing goals. We likewise anticipate integrating knowledge of important residues (e.g., targeting an active site), via appropriate weights. Finally, since the parents define the searchable sequence space and tlie total possible diversity, the importance of parent selection is reenipliasized.
0 8985
'"""~
8'2
2'4
2'6
2'8
H-P Variance
3
3'2
lo6
(c)
ACKNOWLEDGMENTS
This work was supported in part by an NSF CAREER award t o CBK (11s-0444544) and a grant from NSF SEIII (11s-0502801) t o CBK. AMF, and Bruce Craig. References
Fig. 6. Relationship among three diversity metrics. (a,.): Correlation over random four-breakpoint sets with the fixed three-parent set of Fig. 4. The z-axis indicates H-P variance ( w ~ p )the , y-axis indicates H-H variance ( V H H ) or sum-min diversity, respectively. (b,d): Histogram of correlation coefficients of diversity metrics for random sets of four internal breakpoints with the same random parent sets as Fig. 5. Note that the histograms are focused on a small region very near 1 and -1, respectively.
1. B. Kuhlman, G. Dantas, G.C. Ireton, G. Varani, B.L. Stoddard, and D. Baker. Design of a novel globular protein fold with atomic-level accuracy. Sczence, 302( 5649) 11364-8, 2003. 2. L.L. Looger, M.A. Dwyer, J . J . Smith, and H.W. Hellinga. Computational design of receptor and sensor proteins with novel functions. Nature, 423(6936):185-90, 2003. 3 . R.H. Lilien, B.W. Stevens, A.C. Anderson, and B.R. Donald. A novel ensemble-based scoring and search algorithm for protein redesign and its application t o modify t h e substrate specificity of the gramicidin synthetase A phenylalariine adenylation enzyme. J . Cornput. Biol., 12(6):740-61, 2005. 4. J. Li, Z. Yi, M.C. Laskowski, M. Laskowski J r . , and C. Bailey-Kellogg. Analysis of sequencereactivity space for protein-protein interactions. Proteins, 58(3) :661-71, 2005.
40 5. I. Georgiev, R.H. Lilien, and B.R. Donald. A novel minimized dead-end elimination criterion and its application to protein redesign in a hybrid scoring and search algorithm for computing partition functions over molecular ensembles. In Proc. RECOMB, pages 530-45, 2006. 6. C.A. Voigt, C. Martinez, Z.G. Wang, S.L. Mayo, and F.H. Arnold. Protein building blocks preserved by recombination. Nut. Struct. Bzol., 9(7):553-8, 2002. 7. M.M. Meyer, J.J. Silberg, C.A. Voigt, J.B. Endelman, S.L. Mayo, Z.G. Wang, and F.H. Arnold. Library analysis of SCHEMA-guided protein recombination. Protean Sci.,12:1686-93, 2003. 8. C.R. Otey, M. Landwehr, J.B. Endelman, K. Hiraga, J.D. Bloom, and F.H. Arnold. Structureguided recombination creates an artificial family of cytochromes P450. PLoS Biol., 4(5):e112, 2006. 9. L. Saftalov, P.A. Smith, A.M. Friedman, and C. Bailey-Kellogg. Site-directed combinatorial construction of chimaeric genes: general method for optimizing assembly of gene fragments. Proteins, 64(3):629-42, 2006. 10. W.P. Stemmer. Rapid evolution of a protein in witro by DNA shuffling. Nature, 370(6488):389-91, 1994. 11. A.M. Aguinaldo and F.H. Arnold. Staggered extension process (StEP) zn witro recombination. Methods MoZ. Bzol., 231:105-10, 2003. 12. W.M. Coco. RACHITT: Gene family shuffling by random chimeragenesis on transient templates. Methods Mol. Biol.,231:111-127, 2003. 13. J.B. Endelman, J . J . Silberg, Z.G. Wang, and F.H. Arnold. Site-directed protein recombination as a shortest-path problem. Protein Eng. Des. Sel., 17:589-594. 2004.
14. X. Ye, A.M. Friedman, and C. Bailey-Kellogg. Hypergraph model of multi-residue interactions in proteins: sequentially-constrained partitioning algorithms for optimization of site-directed protein recombination. J . Comput. B i d , in press, 2007. Conference version: Proc. RECOMB, 2006, pp. 15-29. 15. G.L. Moore and C.D. Maranas. Identifying residueresidue clashes in protein hybrids by using a secondorder mean-field approach. PNAS, 100(9):5091-6, 2003. 16. C.R. Otey, J.J. Silberg, C.A. Voigt, J.B. Endelman, G. Bandara, and F.H. Arnold. Functional evolution and structural conservation in chimeric cytochromes p450: calibrating a structure-guided approach. Chem. Biol., 11(3):309-18, 2004. 17. M. C. Saraf, A. Gupta, and C.D. Maranas. Design of combinatorial protein libraries of optimal size. Proteins, 60(4):769-77, 2005. 18. bl. Zaccolo and E. Gherardi. The effect of highfrequency random mutagenesis on i n vitro protein evolution: a study on TEM-1 beta-lactamase. J . Mol. Biol., 2851775-83, 1999. 19. P.S. Daugherty, G. Chen, B.L. Iverson, and G. Georgiou. Quantitative analysis of the effect of the mutation frequency on the affinity maturation of single chain Fv antibodies. PNAS, 97:2029-34, 2000. 20. S.M. Firestine, S.W. Poon, E.J. Mueller, J. Stubbe, and V.J. Davisson. Reactions catalyzed by 5aminoimidazole ribonucleotide carboxylases from Escherichia coli and Gallus gallus: a case for divergent catalytic mechanisms. Biochemistry, 33:1192734, 1994. 21. J. Thomas et al. in preparation.
41
AN ALGORITHMIC APPROACH TO AUTOMATED HIGH-THROUGHPUT IDENTIFICATION OF DlSULFlDE CONNECTIVITY IN PROTEINS USING TANDEM MASS SPECTROMETRY Timothy Lee and Rahul Singh* Department ojcomputer Science, Sun Francisco State University, 1600 Holloway Avenue, Sun Francisco, CA 94132-4025, U.S.A.
Ten-Yang Yen and Bruce Macher Department ojChemistuy and Biochemistry, Sun Francisco State University, 1600 Holloway Avenue, Sun Francisco, CA 941324025, U.S.A. Knowledge of the pattern of disulfide linkages in a protein leads to a better understanding of its tertiary structure and biological function. At the state-of-the-art, liquid chromatographyielectrospray ionization-tandem mass spectrometry (LCIEST-MSIMS) can produce spectra of the peptides in a protein that are putatively joined by a disulfide bond. In this setting, efficient algorithms are required for matching the theoretical mass spaces of all possible bonded peptide fragments to the experimentally derived spectra to determine the number and location of the disulfide bonds. The algorithmic solution must also account for issues associated with interpreting experimental data from mass spectrometry, such as noise, isotopic variation, neutral loss, and charge state uncertainty. In this paper, we propose a algorithmic approach to high-throughput disulfide bond identification using data from mass spectrometry, that addresses all the aforementioned issues in a unified framework. The complexity of the proposed solution is of the order of the input spectra. The efficacy and efficiency of the method was validated using experimental data derived from proteins with with diverse disulfide linkage patterns
1. INTRODUCTION Cysteine residues have a property unique among the 20 naturally occurring amino acids, in that they can pair to form disulfide bonds. These covalent bonds occur when the sulfhydryl groups of cysteine residues become oxidized (S-H + S-H + S-S + 2H).’ Because disulfide bonds impose length and angle constraints on the backbone of a protein, knowledge of the location of these bonds significantly constrains the searchspace of possible stable tertiary structures which the protein folds into. The disulfide linkage pattern of a protein also can have an important effect on its function. For example, the disulfide bond structures of STSSia IV are necessary for its polysialyation activity.2 Methods for determining disulfide bonds in a protein can be classified as either: (1) purely predictive, based completely on the protein’s primary structure, or ( 2 ) based on analyzing data from experimental methods, such as Crystallography, NMR, and Mass Spe~trometry.~,~ Predictive approaches typically aim to infer the disulfide bonding state of cysteine residues in a protein, primarily by characterizing a heuristically defined local sequence environment. Towards this goal, predictive approaches include graph-theoretic method^,^ combinatorial optimization formulations,6 techniques
* Corresponding author. Email: [email protected]
based on efficient indexing of the search space,’ and a variety of supervised learning formulations involving neural-networks, hidden Markov models, and support vector machines.’.’’ However, Vullo and Frasconi concluded that any prediction algorithm must have a computational time complexity bounded by O(n( & /2)”), where n is the number of cysteines in the protein.* This limits the application of such an algorithm to proteins with only a few disulfide bonds. In addition, the prediction accuracies or these methods, defined as the fraction of the total number of proteins whose connectivity patterns are correctly predicted, are currently limited to about 60%. By contrast, determination of disulfide bonds can also be achieved with high accuracy for any number of bonds by analyzing data from structure elucidation techniques such as X-ray crystallography and NMR. These techniques require relatively large amounts (10 to 100 mg) of pure protein in a particular solution or crystalline state and are fundamentally low-throughput in nature. In this context, the use of information from mass spectrometric (MS) analysis constitutes an important direction for elucidation of structural features, such as disulfide bonds. For identification of disulfide linkages, the general strategy involves mass spectrometry-based
analysis to make an initial identification of the putative peptides involved in a disulfide bond. These peptides are then fragmented, and a tandem mass spectrum (MUMS) of the fragments generated. The MS/MS spectrum is subsequently analyzed to confirm the initial identification of a disulfide bond. Such an approach can offer accurate identification and, in principle, can scale to any number of bonds with much less stringent sample purity requirements when compared to NMR or X-ray crystallography. Although the aforementioned approach is conceptually straightforward, the actual task of identifying the MS/MS spectra corresponding to disulfide linkages is non-trivial. In this paper we investigate this precise problem. The key contributions of this work lie in addressing the problem of disulfide bond identification in the context of the technical challenges arising from the use of the real-world data from tandem mass spectrometric analysis. The combination of experimental procedure and algorithmic analysis proposed is scalable to structures having a large number of disulfide bonds. Furthermore, the processing is inherently highthroughput. Other features of the proposed approach include: Invariance to the topology of the disulJde bonds: Disulfide bonds may be classified as intramolecular bonded (within a single peptide chains) or inter-molecular bonded (between different peptide chains). The proposed methodology can identify such bonds within a single framework. Analysis of experimental errorshoise at the level of the produced spectrum: Our proposed methodology requires the mass spectra and tandem mass spectra to be converted into a finite set of discrete “mass peaks.” We present algorithms to resolve such peaks from spectra having peaks of non-zero width. We also address how to obtain the optimal set of peaks from each tandem mass spectrum. Accounting for neutral loss and isotopic variation: During the collision induced disassociation step of an LCIESI-MS/MS analysis, a peptide fragment may have undergone neutral loss, resulting in the loss of a small molecule such as water or ammonia. In addition, the constituent atoms that comprise an amino acid exist in a number of isotopic forms. As a result, peptides consisting of the same sequence of amino acids will be measured as a series of masses by the mass spectrometer. This must be considered
when computing the expected mass of a disulfide bonded peptide fragment. Interpretation of the charge state: Precursor ions with a high charge state (triply charged ion or greater) can be misinterpreted by MS data processing programs commonly supplied as part of the MS instrumentation. For example, ion trap mass spectrometers have a relatively low resolution. In such cases, a quadruply charged ion may not be well resolved and can be misinterpreted as a triply charged ion. This error often cannot be identified unless a higher resolution scan (zoom scan) is employed during the experiment. Consequently, the mass of a disulfide bonded pair of peptides is incorrectly computed, resulting in either not identifying (false negative) or incorrectly identifying (false positive) the bond.
1.1. Comparison of the Proposed Approach with Related Works Examples of techniques employing purely predictive methodology include DiANNA,” DISULFIND,I2 and P r e C y ~ . ’ ~ Each of these implementations employ weighted graph matching to predict the final disulfide connectivity pattern. In fact, these implementations all use a program (by Rothberg) that implements Gabow’s algorithmic solution of the maximal weighted graph matching p r ~ b l e m . ’Additionally, ~ a learning strategy is involved where fundamental assumptions are made about the relationship between the cysteine residues in order to obtain the edge weights. Examples of such assumptions include length of the local sequence environment to be considered, formulation of the residue contact potential function used, and assumptions involved in defining the training set. However, their reported prediction accuracies indicate that these underlying assumptions remain open to further investigations. Existing web-based programs such as MS-Bridge in the ProteinProspector tools,15 X! Protein Disulphide Linkage Modeler,I6 and PeptidemapI7 are useful when analyzing MS data from MALDI-TOF (Matrix Assisted Laser Desorption Ionization-Time of Flight) experiments. However, these programs do not analyze MSiMS data, thus missing the useful structural information inherent in this data. The program MS2Assign can be used to analyze disulfide linkages from MS/MS data.I8 However, because it was designed
43 primarily for the analysis of results from cross-linking studies, MS2Assign requires the user to input detailed information on the specific modifications expected. As a result, there is a need for a software tool that utilizes both MS and MSIMS data to identify disulfide linkage patterns in a high throughput manner.
2. THE PROPOSED METHOD
2.1. Problem Formulation Let a, denote the set of amino acid residues, each with mass m(a,). A peptide p = (ai) is then a string of amino acids with mass m(p) = xi m(aJ + 18 Da (Daltons). Peptides have a specific directionality: the string starts at the unbonded amide group, called the N terminus, and ends at the carboxylic acid group, called the C terminus. The term 18 Da is included in this formula to account for the masses of H and OH of the N- and C-termini of the peptide, respectively. In a LCIESI-MSIMS experiment, a protease is used to divide a protein into peptides. A protein A = /pi) denotes the set of all peptides. A cysteine-containing peptide c is a peptide of protein A that has at least one of its amino acids ai identified as a cysteine residue. Thus if C = /cJ is the set of all cysteine-containing peptides, A . In practice, it is very rare that C = A. then A disulfide bondedpeptide DI,, is a pair of cysteinecontaining peptides cI and c2, with mass m(C,,>)= m(c,) + m(c2) 2m(H), where 2m(H) accounts for the mass of the two protons that are lost when the disulfide bond is formed. A disulfide connectivity pattern can be modeled in terms of an undirected graph G = ( V , E). The vertex set V represents the set of bonded cysteines and an edge e E E corresponds to a disulfide bridge between its adjacent cysteines. Admissible vertex and edge sets are constrained because an even number of intra-chain bonded cysteines is required and a cysteine can only be bridged to one and only one different cysteine. Thus, we have Iq = ZB, (El = B and degree(v) = 1 for any v E (perfect matching), where B denotes the number of disulfide bonds in a chain. The problem of identifying the correct connectivity pattern for a given disulfide bonded chain is simply formulated as finding the best possible candidate as given by a suitable scoring hnction. If we consider only those cysteines that are known to be involved in a disulfide bond, it is evident that this problem is equivalent to the problem of computing the maximum-
c
~
v
weight perfect matching. In a perfect matching, every vertex of the graph is incident to exactly one of the edges of the matching. In this formulation, we attribute a weight w,greater or equal to zero for the edge e of G that was initially identified by the MS spectrum match to each pair of cysteines. The disulJide bond mass space BMS = (bmsi} of a protein is the set of every possible pair of cysteinecontaining peptides. A mass list h f L = {mb) is the set of numbers that represent the masses of the precursor ions obtained from a LCIESI-MSIMS experiment. A bond match bmk between D and MI occurs between bms, and mb when (bmsi- ml,i < bm,, where msm, is defined as the bond mass tolerance, the amount of experimental uncertainty that ml, is allowed to have to determine the match. A bond spectrum match is the set of matches BSM = { bSmk} between ML and BMS. In a LCIESI-MSIMS experiment, each precursor ion undergoes collision-induced disassociation, resulting in fragment ions that constitute a MSIMS spectrum. If the precursor ion is a disulfide bonded pair of peptides, the fragmentation process typically keeps this bond intact. Let FML = icfmlJ denote the set of MSIMS values corresponding to the masses of the peptide fragments. A peptide fragment F is a substring of a peptide with mass m(F) = C6,5sm(a,), where r and s denote the locations of starting and ending amino acids of the peptide fragment. A disulfide bondedfragment FI,, is a pair of peptide fragments FI and F,, with mass m(FI,2) = m(F1) + m(F2) 2m(H). For there to be a disulfide bond between F1 and F2, each fragment must contain at least one cysteine. The disulfide bonded fragment mass space FMS = {fms,) for two cysteinecontaining peptides P, and Pj is the set of every disulfide bonded fragment mass that can be obtained from these two peptides. A,fiagment match fmk between FML and FMS occurs between fml, and fmsV when lfmsV - fmlJ < fmt, where fmt is defined as the M S / S mass tolerance, the f amount of experimental uncertainty that tmsj is allowed to have to determine the match. A MUMS spectrum match TSM is the set of matches TSM = Vsmk} between FMS and FML. The match ratio r is then defined as the number of matches divided by the size of the tandem mass spectrum, i.e. r = ITShilIIFMSI. Since each match ratio is a measure of how well the LCIESI-MSIMS experimental data supports the hypothesis of a disulfide bond between two of the cysteines in the protein being analyzed, we denote r i j as
*
~
44 the match ratio for a bond between cysteines CI and C2. As a result, each r is assigned as the weight w e of the graph G which models the overall connectivity pattern. Thus, the disulfide linkage pattern identification problem is to find a perfect matching in G of maximum weight.
2.2. Algorithmic Framework Determining the disulfide linkage patterns involves solving the following four sub-problems: Find the bond spectrum match BSM between the mass list ML and the disulfide bond mass space BMS. Determine the MS/MS spectiwm match TSM between the disulfide bonded fragment mass space FMS and the MSIMS mass list FML. Find a perfect matching of maximum weight for a fully connected graph with /C/vertices, with edges of weight TI,,. Utilize experimental data the contains noise, isotopic variation, neutral loss, and charge state uncertainty to achieve the matchings in subproblems 1 and 2. In the following subsections, we present our approach to each of these sub-problems.
2.2.1. Finding the MS spectrum match Let k denote the number of sites where an arbitrary protein A can be cleaved with a certain protease. The construction of the mass space then requires O(k2)time. This is because the k proteolytic amino acids divide the protein A into k+l subsequences, leading to k(k+1)12 unique pairs of subsequences that can be formed. For the case of disulfide bonds, we are concerned with forming unique pairs of subsequences from C as opposed to A. Because A for almost all proteins and proteases, the disulfide bond mass space BMS is likely to be smaller than the mass space obtained from every peptide in A . The quadratic time complexity can be hrther reduced if the data structure used to construct and search D did not require computing the mass of every member of D. The intuition lies in computing the inasses of the possible disulfide bonded peptides that are expected to be close in value in the mass spectrum S. This can be
c
done by use of the expected uinino ucid mass, as defined below:
1
The weighted mean of A, i.e, me = w,m(a,), where {wl}denotes the relative abundance of each amino acid. Using published values for masses and relative abundances,19 we obtain me = 1 1 1.17 Da. Using this definition, we can predict that the mass of a peptide m(p) = IIpII m, + 18, where IIpII represents the number of amino acids contained in the peptide. The additional 18 Da was explained in Section 2. Statistically, this is justified to a first approximation because the weighted standard deviation, again using published data”, is 28.86 Da. Thus, the number of amino acids in the bonded pair of peptides, denoted lid, 1 , can be used to construct BMS in such a way that it is approximately mass sorted. This is the motivation for exploring the use of a hash table to construct and search BMS. The hash table is a well known data structure for efficient searching of a data space.20 If the hash function employed satisfies the assumption of simple uniform hashing, then the expected time to search for an element is O(1). Simple uniform hashing means that, given a hash table T, with 171 buckets, any data element d, is equally likely to hash into any bucket, independently of where any other element has hashed to. Using the Expected Amino Acid Mass to predict the mass of a peptide, we implement the simple hashing function h(d,) = lld!lII as a first approximation. This results in our algorithm (which we call MSHushlD) for this subproblem, to have an overall complexity of O(IC/*+ IBMSI), where IBMSI is the size of the mass spectrum. Table 1 presents a toy example illustrating the construction of the hash table. In this example, the three pairs of peptides will be hashed to buckets 10, 12, and 14 respectively. Let the MS spectrum for peptides of the protein being considered in this example contain a mass peak having the value of m(p) = 1332 Da. Following our approach, this results in an estimated number of amino acids to be 12 (IIpII = 12). Subsequently, the corresponding bucket in the hash table is accessed.
45 Table 1 . Example showing how hash table is constructed
from the lowest to the highest value. Thus, given an MUMS fragment ion inass, it is possible to make an I .Given proteiii 2. Identify 3 . Foim all 4. determine initial estimate of the location of the diagonal band of sequence and riurnber of cysteinepossible pairs theoretical table masses that are most likely to match amino acids protease containing of peptides this fragment ion mass. peptides Let s be an MSiMS fragment ion mass peak value. EC~GR EC~GRNVNC~ EC~GR 10 If either s < mrllrrl or s > m,llnr,the algorithm returns no NVNC*TK value. Otherwise, in the second step, we compute the TKAIQC~~LDE average amino acid residue mass E = (m(p1) + NVNC~TK EC’GR 12 H, trypsin m(p2))/(n + in). This is the approximate mass difference AIQCI4LDEH I (cleaves after K between an element and the (up to) four elements that AIQPLDEH NVNC*TK 14 and R) are a “step” away from it. A step is defined to be the AIQC~~LDEH movement of an index that points from an element to a In our example, this bucket contains the peptide neighboring element, either vertically or horizontally, in NVNCTK. The mass of this peptide is then computed, a mass table. Thus, the estimate of the number of steps and compared with m(p) to determine if there is a match. used to index into the table to locate the band for a Because there is a possibility that another bucket may particular inass peak is nstcps= s I E . While any contain a peptide pair with a matching mass, continuous path of steps froin m,,,,,, to m,,,,can be used neighboring buckets (i.e, buckets 11 and 13) are also to locate the band, it is simplest to step along the accessed. perimeter of the mass table. In this algorithm, we start by stepping “down” along the first column, and then 2.2.2. Finding the MS/MS spectrum match “across” along the last row. We note that the initial estimate may not index into Based on experimental observation, when peptides the actual location of the band. Therefore, we need a undergo collision-induced dissociation (CID), the strategy to reach the actual location starting from the fragments produced are mostly either a b-ions (contains the N terminus) or y-ions (contains the C ter~ninus).~ initial estimate. For relatively short peptides of under one hundred amino acid residues (much longer than We have also observed that the disulfide bond remains usually encountered in tryptic digests), one can simply intact during CID. Let p l denote a peptide with its generate neighboring mass table elements along the path possible y-ions yl and b-ions b l , and similarly y2 and used to index into the table until the band is reached. b2 for peptide p2. If pl and p2 are in a disulfide bond, The location of the band is identified as the index of the four types of fragments may occur: yl+y2, yl+b2, element that has the mass closest to s. bl+yl, and bl+b2. The most convenient way to Once the location of the band is identified, the compute and display the disulfide bonded pair mass remaining elements of the band are generated and space is to generate four tables in which each row compared to s. The second element will be found either represents the mass of an ion of the first peptide and directly above, or above and to the right (rowrow-1, each column represents the mass of an ion of the second column=column+k, where k depends on the relative peptide. Then, each entry in this MS/MS mass table sizes of the peptides) of the first element. (subsequently referred to as mass tuble) is the sum of its As an example, let the two amino acid sequences be row and column, minus 2m(H) Da. Next, let peptides p l p l = NVNCTK, and p2 = AIQCLDEH. Table 2 shows and p2 consist of m and n amino acid residues, all of the possible y- and b-ions that contain a cysteine, respectively. The first step is to compute the lowest and as well as the mass of each ion. Note that for y ions, an highest inasses m,,,,, and m,,,~,,,in the mass table. The additional 18 Da are added to the sum of the residue former is the first row and first column of the mass table, masses. The resulting mass table for the b l + y2 and the latter is its last row and last column. Also, combination is shown in Table 3. because the dynamic range of amino acid residue masses The algorithm described by this approach, IndexlD, is relatively small (about 3.3:l in the extreme case of has a worst case time complexity of O(n + in) to locate tryptophan:glycine), the increase in mass is the band. However, because this approach usually approximately linear as the values are read “diagonally”
I I
46 indexes into the mass table just a few elements away from the band, the time complexity can be estimated by a constant. Because the band is in general a diagonal along the mass table, enerating the band elements has a complexity of O( nm ). Since IndexZD is invoked for each instance of a FSM spectrum match, the time complexity of the solution to subproblem 2 for a protein is O(IFML,l )), where IFMLI is the size of the tandem mass spectrum.
J"
(G
Table 2. Example fragment mass space
I
l'eptide
1
Ion
I
I
Sequence
Mass (Da)
I
type
1
I
Y
b
I 1 1 2
y
CTK NCTK VNCTK NVNCTK NVNC NVNCT NVNCTK CLDEH QCLDEH IQCLDEH AIQCLDEH AIQCL AIQCLD AIQCLDE AIQCLDEH
I I
351 446 564 678 43 1 532 660
1 :5; 1 752
I
I
673 810
Table 3. Example mass table. b 1+y2-2
50 1
639
752
813
CLDEH
QCLDEH
IQCLDEH
AIQCLDEH
930
1068
1181
1242
1031
1169
1282
1343
1159
1297
1410
1471
43 1 NVNC 532 NVNCT 660 NVNCTK
2.2.3. Finding a perfect matching of maximum weight for a fully connected graph Sub-problem 3 , the maximum-weight perfect matching problem, is a well understood problem in graph theory. At present, the best performing algorithm that solves this problem for a fully connected graph was designed by Gabow.*' This algorithm has a worst-case bound of
0(ici3).
2.2.4. Consideration of missed proteolytic cleavages and intra-molecular bonded cysteines In the laboratory, a protease used to digest a protein may sometimes miss a cleavage point. For example, a protein with sequence NRDKTA should be digested by trypsin into three peptides: NR, DK, and TA. However, if one cleavage point is missed, two peptides are created: either NRDK and TA, or NR and DKTA. We model this behavior by including the parameter m,,,, the maximum number of missed cleavages allowed. It can be inferred by induction that a protein with k cleavage sites and a mlnax= m will digest into (in + l)k unique peptides, assuming k >> m. Note that mlnax includes all smaller values of missed cleavage levels, e.g., mmax= 2 includes m = 1 and m = 0 as well. If mmax is small (e.g., three or smaller), missed cleavages can be considered to be a constant multiplicative factor in our time complexity analysis as described earlier. Since the proteolytic digestion process produces peptides that contain two or more cysteine residues, there is the possibility that intra-molecular bonds may occur, i.e. disulfide bonds exist within a single peptide. These peptides must be included into the mass list M L , with mass m(p) = Clm(a,) - 2, if at most one disulfide bond per peptide is considered. The impact on time complexity is simply the larger disulfide bond mass space D, which can be modeled as an additive factor, f(/P/, ICI, mmax). The disulfide bonded fragment mass space DF for an intra-molecular bonded peptide consists of the union of the mass spaces of the possible b-ions and y-ions that can result from its fragmentation. For example, for the peptide ASICQQNCQY, the possible b-ions are b l , b2, b3, b8, b9, and bl0, and the possible y-ions are yl, y2, y7, y8, y9, and y10. Thus the complexity of the solution to subproblem 2 is increased by an additive factor, O(jm1 max[n, m]).
2.2.5. Peak finding in the presence of noise Using Bioworks software from Thermo-Fisher, the raw data obtained from a LCIESI-MSIMS analysis of a single protein is converted to a series of of DTA files. The DTA format is very simple; the first line contains the mass of the precursor ion and the peptide charge state as a pair of space separated values. Subsequent lines contain space separated pairs of fragment ion
47 mass-to-charge ratios (denoted d z ) and intensity values. These lines are sorted in order of increasing mlz. Typically hundreds of DTA files are produced per analysis. A typical DTA file contains on the order of lo2 to lo3 lines of fragment ion information. The intensity vales can range from 1 to the order of 10’. It is expected that only a fraction (loo), a limit I is placed on the number of peaks to be considered for matching. Next, we consider the correlation between MS/MS spectrum peaks and the masdintensity lines in the associated DTA file. In the graphical representation of a MUMS spectrum the peaks are very sharp. In the DTA file the more intense mass peaks typically occupy several neighboring lines, reflecting the slightly differing masses of the isotopes of a fragmented ion. If each line in a DTA file is considered to be a separate mass peak, the data analysis would be biased towards masses associated with more intense peaks. To correct for this bias, we represent a set of neighboring lines with similar intensity as a single peak. We formalize the concept of “neighborhood” by defining the maximum peak width p,,>as the maximum difference in mass-to-charge ratio that two consecutive lines in the DTA file can have and yet be considered to a single peak. “Similar” is defined as the absolute difference in intensity of two neighboring peaks less than 50% of the larger intensity. We denote the set of peaks that result asp,, where 0 5 i 5 1. Figure 1 illustrates an example of how threshold t, limit I, and maximum peak width pW work together to find the best mass peaks. Let the masses of the six peaks shown here be labeled a through f, and let t = lo%, 1 = 2 and p M ,be the mass range as shown. Peak c has the maximum intensity, so peak f is eliminated, since c a, its intensity is less than 10% of c’s. Because peaks a, b, and c would have been replaced by a single ~
peak with mass of average mass of these three peaks. However, because the intensity of peak b is less than 50% of peak a, this is not done. Instead, the peak window moves to peaks c, d, and e. In this case, these peaks are replaced by a single peak of mass = (c + d + e)/3. Since the limit is two, this peak and peak a are identified as the peaks to use for subsequent analysis. Figure 1. Illustrative example of peak finding.
qd
f
g
P W
2.2.6. Addressing isotopic neutral loss
variation
and
To account for the possibility of neutral loss, for each element fml, of the MS/MS mass space FML computed in the preceding section, we add three more elements: mVm1, ) + m(H20), mCfhl, ) + m(NH3), and rnVml, ) + m(H20) + m(NH3), where m(H20) is the mass of a water molecule and m(NH3) is the mass of an ammonia molecule. This accounting increases the size of the disulfide bonded fragment mass space FMS by a factor of four. In addition, we use the average masses for amino acid residues to compute the mass of peptides and their fragments with molecular weights greater than 1500 Da. Our experiments using an ion trap indicate that this results in more accurate correlations with observed fragment ion peaks than by simply using monoisotopic masses. As a consequence of this step, we empirically observed a much closer correlation between the MUMS
4% mass space FA4L and the disulfide bonded fragment mass space FMS values.
performance of this algorithm is dominated by IMLI, lFMLl and the IIO cost to process the spectrum data.
2.2.7. Interpretation of peaks given charge state uncertainty
3. EXPERIMENTAL RESULTS
For some low resolution mass spectrometers, it has been observed that the charge state of the precursor ion used to generate the MSIMS spectra may be reported incorrectly. An incorrect number for the charge state will significantly impact the MSIMS mass space that is searched for matches. To address such cases our system is implemented such that the user can intervene and correct the mass assignment. Next, we examine how to process the values of fragment ion m / z in the DTA file to obtain the MSIMS mass space FML used to search for matches with the disulfide bonded fragment mass space FMS. Let the charge state value (reported or corrected) for a DTA file be denoted as c. No fragment of the precursor ion can have a charge larger than c. Then each element of FMLis obtained by computing FML,(z) = zp,-(z-l)m(H), where 1 5 z 5 c for each z, 1 5 i 5 I , and m(H) is the mass of a single proton. The second term is needed because FMS is computed for singly protonated ions. 2.2.8. Overall complexity
The overall time complexity of our algorithmic approach is computed as follows: Finding the bond spectrum match BSM between the mass list ML and the disulfide bond mass space BMS is performed once per analysis, with a time complexity of IMLIO(MSHash1D) = O(IMLI(IC/’ + IBMm). Determining the MS/MS spectrum match TSM between the disulfide bonded fragment mass space FMS and the MSIMS mass list FML is performed each time there is a bond spectrum match, or /BSMI times, with time com lexity of IBSMIO(1ndexID) = O(IBSM/IFML/( nm )). Finding a perfect matching of maximum weight for a hlly connected graph with /CI vertices has a time complexity of o(l~1~). The techniques developed to utilize experimental data constitute a constant factor multiplying ML and FML. Thus, the overall complexity of o u m r o a c h is O(IMLI(ICI~ + IBMS~) + (IBSMIIFMI, ( J n m + 1~1’). Since n, m and C are typically small (< loo), the
e
3.1. Description of the Data Experimental Procedures
and
The proposed method was validated utilizing MS and MSIMS data obtained by LCIESI-MSIMS analysis for three eukaryotic glycosyltransferases with varying numbers of cysteines and disulfide bonds: 1. Mouse Core 2 1,6-N-Acetylglucosaminyltranferase I (C2GnT-I) 22 2. ST8Sia IV Polysialytranserase (ST8Sia IV) 3. Human Fucosyltransferase VII (FT VII) 23 The disulfide linkage pattern for each of these proteins is known and reported in each cited reference. The experimental data was obtained using a capillary liquid chromatography system coupled with a ThermoFisher LCQ ion trap mass spectrometer LCIESI-MSIMS system was used to obtain the MS and MSIMS data. Further details of the experimental protocols used are a~ailable.’~ We obtained the primary sequences from the SwissPro1 databa~e,’~and DTA files were obtained from LCIESI-MSIMS analyses of each protein. For each experiment, we set the bond mass tolerance bm, = 3.0 Da, the maximum peak width p,, = 2 Da, the threshold t = 2% of the maximum intensity, and the limit 1 = 50 peaks. We used MS/MS mass tolerance fni, = 1.0 Da, except when intramolecular bonded cysteines were identified, when a value of 1.5 Da was used. The protease is set to what was used in the actual experiment. We set maximum number of missed cleavages allowed inmah= 1, except for one case where a combination of trypsin and chymotrypsin was used, where we set mmax=3.
’
3.2. Summary of Results The proposed method was applied to determine the disulfide-bonding patterns of three proteins, with varying numbers of cysteines and disulfide bonds. Our results are presented in the form of a connectivity matrix, as proposed in2‘ Each matrix element below the diagonal corresponds to a possible disulfide bond. In this matrix we indicate the “known” linkage patterns by a gray shaded matrix element. If our method computes a match ratio of over 50% for a particular combination,
49 we record it in the table. In addition, we assign one of the values TP, FP, FN, or TN to each matrix element per the following conventions: For match ratios of at least 50%, true positive (TP) is assigned if the same matrix element is shaded gray. 0 A false positive (FP) is assigned if the matrix element is not shaded. A false negative (FN) is assigned to a matrix element if the matrix element is shaded but its match ratio is less than 50%. Table 4 summarizes our results for an analysis of 233 DTA files of C2GnT-I. For this dataset, the charge state reported in two DTA files needed to be reinterpreted in order to avoid false negative results. In Table 5 we present the results from the analysis of 79 DTA files of ST8Sia IV, and table 6 contains the results obtained from the analysis of 158 DTA files of FucT VII. We evaluate the performance using the following metrics: 0 Precision P = TP/(TP+FP) Recall R = TP/(TP+FN) 0 Sensitivity S = TN/(TN+FP) Table 7 summarizes our results for these metrics. Although our precision result for C2GnT-I is low compared to the precision results for ST8Sia IV and FucT VII, it still compares favorably with the results reported by the purely predictive methods."-'3 In addition, we note that we can improve the precision from P = 0.40 to P = 0.70 if we chose to ignore all match ratios less than 85%.
Table 5. ST8Sia IV validation testing results. Cysteine
142
292
156
356
location
142 156
TN
Table 6. FucT VII validation testing results.
location
I
I
Table 7. Overall performance results. Protein
Precision
Recall
Specificity
1 C2GnT-I
FT VII
1.o
1.o
.o
1
Following the implementations of the purely predictive methodology, we adapted WMATCH, Rothberg's implementation of Gabow's a l g ~ r i t h m ' ~to, ~ ' find the maximum weight matching. This analysis component was only conducted for the C2GnT-I intermediate results, as the linkage patterns for ST8Sia IV and FucT VII are already evident. Our result was in agreement with the published bonding pattem2'
3.2.1. Analysis of the effect of threshold t on results
varying
The values we used for many of the parameters introduced in this paper, such as threshold t, limit 1, and maximum peak width pw, were based on heuristics
50 developed by experimenters. In this section, we examine the effect of varying the threshold t on our results. We used the C156-C356 bond in ST8Sia IV for data. Figure 2 consists of two graphs: (1) a plot of match ratio vs. t, and (2) a plot of the fraction of total peaks used vs. t. The intersection of these two graphs is close to t = 2, confirming that the heuristic value used in our experiments optimizes performance and data utilization.
-
Figure 2. Match ratio and peak utilization vs. threshold t
I
Peak utilization
Match ratio
Program
Number of
Number of
neaks utilized
matches
Match ratio
MS2Assign
1774
1646
0.93
MS2DB
50
48
0.96
Program
Number of
Number of
Match ratio
peaks utilized
matches
MSZAssign
2169
1791
0.78
MS2DB
50
44
0.72
These studies suggest that MS2DB may be slightly better than MS2Assign at discriminating between a true positive and a false positive result. More studies are needed to support this conclusion. 4. CONCLUSIONS AND DISCUSSION
0
5
10
15
20
25
threshold t
3.2.2. Comparison with MS2Assign program As discussed in Section 1, the program MS2Assign can be configured to process MSIMS data to identify disulfide bonds in protein. However, we note that while MS2Assign automates the identification of disulfide bonds, it does not do so in a high throughput manner. For example: The two peptides MS2Assign takes as input must be obtained from another program, such as Peptidemap. MS2Assign accepts the input of only one MSIMS mass list (from one DTA file). Also, because MS2Assign does not account for experimental noice, isotopic variation, or the intensity of the fragmented ion, the accuracy of its results may not be as high as the accuracy of a program that takes these factors into consideration. To investigate this, we identified the DTA files that MS2DB used to obtain match ratios for C13 to C59 (true positive identification) and C199 to C413 (false positive identification) of C2GnT-I. We then copied the fragment ion d z portion of the file to use for the Peak List in MS2Assign. Our results are summarized in Tables 7 and 8.
In this paper we have presented a comprehensive algorithmic framework for the determination of disulfide bonds by utilizing data from tandem mass spectrometry. The proposed approach involves addressing four key sub-problems. First, the match between a given mass spectrum and the set of every possible pair of cysteinecontaining peptides of the given protein is obtained. Next, the correspondence between the tandem mass spectrum and the set of every disulfide bonded fragment mass is determined. The actual disulfide connectivity pattern is determined by solving the maximal weight matching problem. The salient contribution of our approach is the use of real-world data from mass spectrometry in the above steps. Doing so, requires addressing a series of algorithmic challenges that include peak finding in noise spectra, addressing issues of isotopic variation and neutral loss, peak interpretation in the presence of charge state uncertainty, consideration of both inter-peptide and intra-peptide bonds, and consideration of missed proteolytic cleavages. Until now, techniques for disulfide bond identification have tended to remain on either sides of the model-or-measure dichotomy. The proposed work seeks to span this divide and identifies the core algorithmic challenges at the intersection of purely computational and purely experimental strategies. Experimental results highlight the high precision and recall that can be obtained with such a hybrid strategy. Another advantage of this approach is its data-driven
51 and high-throughput nature. An implementation of our approach is available for public use at: http:l/tintin.sfsu.edu:3319lims2dbl.
Acknowledgments The research presented in this paper was partially supported by the grants I S 0 6 4 4 4 1 8 and CHE-06 19163 of the National Science Foundation, a grant from the Center for Computing in Life Science of San Francisco State University, and the grant P20MD000262 from the NIH. The authors also than the anonymous reviewers for their comments.
References 1. Creighton TE, Zapun A and Darby NJ. Mechanisms and catalysts of disulfide bond formation in proteins. Trends in biotechnology 1995; 13: 18-23. 2. Angata K, Yen TY, El-Battari A, Macher BA, Fukuda M. Unique disulfide bond structures found in ST8Sia IV polysialyltransferase are required for its activity. JBiol Chem. 2001; 18:15369-15377. 3. Gorman JJ, Wallis TP, Pitt JJ. Protein disulfide bond determination by mass spectrometry. Mass spectrometvy reviews 2002; 21: 183-216. 4. Brunger, AT. X-ray crystallography and NMR reveal complementary views of structure and dynamics. Nature structural biology 1997; 4 Suppl: 862-865. 5 . Klepeis JL, Floudas CA. Prediction of P-sheet topology and disulfide bridges in polypeptides. J. Comput. Chem. 2003; 24:191-208. 6. Taskar B, Chatalbashev V, Koller D, Guestrin C. Learning structured prediction models: A large margin approach. Proc. of the International Conference on Machine Learning; 2005. 7. Ferre F, Clote P. Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics 2005; 21: 2336-2346. 8. Vullo A, Frasconi P. Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics 2004; 20, 653-659. 9. Baldi P, Cheng J, Vullo A. Large-Scale Prediction of Disulphide Bond Connectivity. Advances in Neural information Processing Systems 2004; 11: 97-104. 10. Tsai CH, Chen BJ, Chan CH, Liu HL, Kao CY, Improving disulfide connectivity prediction with sequential distance between oxidized cysteines. Bioinformatics 2005, 21:44 16-44 19.
11. DiANNA:
http:/lclavius.bc.edu/-clotelab/DiANNA/ 12. DISULFIND: http://disulfind.dsi.unifi.it/ 13. PreCys: http:llbioinfo.csie.ntu.edu.tw:5433lDis~1lfidel 14. WMATCH: Solver for the Maximum Weight Matching Problem:
http://elib.zib.de/pub/Packaaes/mathpron/matching/ weighted1 15. MS-Bridge: http://pros~ector.ucsf.edulpros~ector 16. X! Protein Disulphide Linkage Modeler: http:/lwww.systemsbiology.ca/x-
bang/DisulphideModeler/Disul~hideModeler.litml 17. Peptidemap: http:llprowl.rockefeller.edu/prowl 18. MS2Assign: http:llroswell.ca. sandia. govl-mmyounglms2assign. & m J
19. http:/lprowl.rockefeller.edulaainfolsti~ct.htm 20. Cormen TH, Leiserson CE, Rivest RL, Stein C. introduction to Algorithms, MIT Press, 2001 : 224229. 21. Gabow H. Implementation of Algorithms for Maximum Matching on Nonbipartite Graphs. Ph.D. thesis, Stanford University, 1973. 22. Yen TY, Macher BA, Bryson S, Chang X, Tvaroska I, Tse R, Takeshita S, Lew AM, and Datti A. Highly Conserved Cysteines of Mouse Core 2 1,6-N-Acetyl glucosaminyltransferase I Form a Network of Disulfide Bonds and Include a Thiol That Affects Enzyme Activity. J Biol Chem. 2003; 278145864-81. 23. De Vries T, Yen T, Joshi RK, Storm J, van den Eijnden DH, Knegtel RMA, Bunschoten H, Joziasse DH, Macher BA. Neighboring cysteine residues in human fucosyltransferase VII are engaged in disulfide bridges, forming small loop structures: a proposed 3D model based on location of cysteines, and threading and homology modeling. Glycobiology 2001; 11:423-432. 24. Yen TY, Macher BA. Determination of glycosylation sites and disulfide bond structures using LC/ESI-MSIMS analysis. Methods in enzymology 2006; 415:103-113. 25. Swiss-Prot database: http://cs.expasv.orgl 26. Fariselli P, Casadio R. Prediction of disulfide connectivity in proteins. Bioinformatics 200 1; 17:957-964.
This page intentionally left blank
Biomedical Application
m
This page intentionally left blank
55
CANCER MOLECULAR PATTERN DISCOVERY BY SUBSPACE CONSENSUS KERNEL CLASS1FlCAT1ON Xiaoxu Han Department of Mathematics and Bioinformatics Program, Eastern Michigan Universio Ypsilanti, MI 48197, USA xiaoxu. han @ emich. edu Cancer molecular pattern efficient discovery is essential in the molecular diagnostics. The characteristics of the gendprotein expression data are challenging traditional unsupervised classification algorithms. In this work, we describe a subspace consensus kernel clustering algorithm based on the projected gradient nonnegative matrix factorization (PG-NMF). The algorithm is a consensus kernel hierarchical clustering (CKHC) method in the subspace generated by the PG-NMF. It integrates convergence-soundness parts-based learning, subspace and kernel space clustering in the microarray and proteomics data classification. We first integrated subspace methods and kernel methods by following our framework of the input space, subspace and kernel space clustering. We demonstrate more effective classification results from our algorithm by comparison with those of the classic NMF, sparse-NMF classifications and supervised classifications (K" and SVM) for the four benchmark cancer datasets. Our algorithm can generate a family of classification algorithms in machine learning by selecting different transforms to generate subspaces and different kernel clustering algorithms to cluster data.
1. INTRODUCTION
With the development of genomics and proteomics, Molecular diagnostics has appeared as a new tool to diagnose cancers. It picks a patient's tissues or blood samples and uses DNA microarray or mass spectrometry (MS) based proteomics techniques to generate their gene expressions or protein expressions. The genelprotein expressions reflect gene/protein activity patterns in different types of cancerous or precancerous cells. They are molecular patterns or molecular signatures of cancers. Different cancers will have different molecular patterns and the molecular patterns of a normal cell will be different from those of a cancer cell. Clinicians identify the potential biomarkers by analyzing the gene/protein patterns. However, robustly classifying cancer molecular patterns is still a challenge for clinicians and bioiformaticans. Many classification methods from statistical and machine learning are proposed for cancer molecular pattern classification. These methods can be generally classified as supervised classification methods, such as k-nearest neighborhood (k"), linear discriminant anayalsis (LDA), neural networks (NN),support vector machines (SVM); unsupervised classification (clustering) methods, such as hierarchical clustering
(HC), self-organizing maps (SOM), principal component analysis (PCA); and their variants, such as particle swarm optimization support vector machines (PSO-SVM), kernel principal component analysis (KPCA) etc. 4-7 We are particularly interested in the unsupervised molecular pattern discovery algorithms, because they do not need or have prior knowledge about data. They also have potentials to explore the latent structure of data. However, the traditional clustering algorithms: HC and SOM were already proved unstable for gene and protein expression data although they are widely used in the cancer molecular pattern discovery community. 4*8*15 Actually, the characteristics of gene and protein expression data are challenging the traditional unsupervised classification algorithms. These high dimensional data can be represented by an n x m matrix after preprocessing. The row data in the matrix are the expression levels of a gene across different experiments or intensity values of a measured data point in different samples (observations) corresponding to an d z ratio. The column data are the gene expression levels of a genome under an experiment or intensity values of all measured data points in a sample corresponding to m/z ratios. Usually, n >> rn ; that is, the number of variables
56 in a dataset is much greater than the number of observations/experiments. For the gene expression data, the column number in the matrix is 5000 usually; for the proteomics data, the matrix column number is < 200 and the matrix row number is in the order of lo5 - lo6 generally. These data are not noise free data because their raw data have noise and preprocessing algorithms can’t remove them completely. Although there are a large number of variables in these data, only a small set of variables account for most of data variations.
H as a feature matrix. The columns of W (a set of bases) set up a new coordinate system and all elements of H are the coordinates of X in this new coordinate system. The feature matrix H is the prototype dataset of X after the feature selection, where each column is the prototype of an observation. After NMF, each column (observation) of X can be represented as a linear combination of r bases Wt, i = 1,2,...r approximately,
1.I. Nonnegative matrix factorization
That is, each observation is expressed as the product of the basis matrix and its corresponding prototype after feature selection. The objective function E ( W ,H ) = IIX -WHII can be expressed as Euclidean distance or Kullback-Leibler (KL) divergence between X and WH . For example, the Euclidean distance objective function is defined as follows.
It is obvious that dimension reduction / feature selection should be conducted to reduce data to a much lower dimension before classification. Several well-known global feature selection methods, such as principal component analysis (PCA), singular value decomposition (SVD), and independent component analysis (ICA) have been applied in the cancer molecular pattern classifications. 9,10,’ However, the holistic feature selection mechanism from these methods prevents from the alternative local feature selection. For example, PCA can only capture the global characteristics of data and each principal component (PC) contains information from all input variables. This leads to the hard time to interpret PCs intuitively. Data representation in PCA is not “purely additive”. Each PC has both positive and negative entries, which are likely to cancel each other partly in the feature selection. On the other hand, there is a local feature selection algorithm: nonnegative matrix factorization (NMF) with parts-based learning mechanism.I3 In contrast to the global feature selection algorithms, NMF can capture variables contributing to local characteristics of data with obvious interpretations. It makes the global characteristics as the simple “additiodcombinations” of the local characteristics. In fact, data representation in NMF is purely additive is because of the nonnegative constraints in the NMF. Given an nonnegative matrix X E R””” and a rank r < min(n,rn) , NMF is an nonlinear programming problem to find two optimal nonnegative matrices W E R””‘ and H E R‘”’” that minimize the reconstruction error, which can be measured by a distance metric, between the matrices X and WH : E ( W ,H ) = IIX -WHII ; that is, X - WH . We name W as a basis matrix and
’,’*
Lee and Seung gave a multiplicative update algorithm for NMF by conducting a dynamic step based gradient descent learning with respect to W and H . I 3 The iteration schemes for the Euclidean distance objective function are as follows (The iteration schemes for the KL divergence are similar). In the iteration, W and H are initialized randomly.
(3)
The multiplicative update algorithm works well experimentally. However, there is no guarantee that it can converge to local minimum points of the objective function, because the limit of the non-increasing sequence [ W ( k ) , H I k ) ] generated from the multiplicative update algorithm may not be a stationary point; l4 that is, it lacks “convergence-soundness”. Brunet et al. used NMF to classify cancer molecular patterns by conducting NMF based clustering for gene expression data.I5 Their NMF clustering consists of three steps. First, decompose gene expression data X
57 under a rank r by the multiplicative update algorithm, i.e. each observation is represented as the linear combination of bases by Eq. (l), where h, is the i-th element of the H , , which is the prototype of the j-th observation X after feature selection. Second, clustering is conducted by the following query asked by each sample: 'which basis has the largest expression level in my prototype? I will belong to the cluster associative with that basis'. For example, suppose ho is the largest value in H , , then sample X , will be assigned to the cluster i because the irh basis has the largest expression level in its prototype H , . The number of clusters is just the decomposition rank r . Finally, the rank leading to the most meaningful clustering is decided by a Monte Carlo based model selection mechanism by finding a rank with the maximum cophenetic correlation coefficient in the hierarchical clustering. The cophenetic correlation coefficient is a measure to evaluate the stability of hierarchical clustering. It is the correlation between the pairwise distance and linkage distance in the hierarchical clustering. A large cophenetic correlation coefficient value will indicate the high stability of a hierarchical clustering. Brunet et a1 proved this method was superior to HC and SOM methods for three benchmark cancer data set^.'^ Inspired by this work, Gao and Church developed a sparse nonnegative matrix factorization to cluster the cancer samples by adding sparseness control in the basic NMF formulation (sparse-NMF).'6xl 7 They demonstrated the sparse-NMF based clustering was superior to the basic NMF clustering method for the same datasets. However, Brunet et a1 's NMF based clustering method has following weak points. 1. The multiplicative update algorithm in the NMF lacks the convergence soundness. The model selection mechanism in the NMF clustering is expensive, because it requires to compute cophenetic correlation coefficients for the hierarchical clustering conducted at all possible ranks to decide the final optimal decomposition rank.
,
1.2. Contributions In this study, we describe a subspace consensus kernel clustering technique based on the projected gradient nonnegative matrix factorization (PG-NMF), which was developed by Lin,I4 to conduct cancer molecular pattern classification for microarray and proteomics data. The projected gradient nonnegative matrix factorization (PG-
NMF) has sound convergence and converges faster than the basic NMF.I4 In addition, we present the ideas of input space, subspace and kernel space clustering before elaborating on our PG-NMF based classification method under the framework of subspace and kernel space clustering. The idea of our method is to transform a genelprotein expression data set X E %" into a subspace S c Finby using the PG-NMF algorithm. Then, a consensus kernel hierarchical clustering algorithm (CKHC) is developed to cluster the projections of a dataset X in the subspace S to infer the latent structure of the data. We have showed that the PG-NMF based subspace kernel clustering (PG-NMF-CKHC) is superior to the basic NMF, sparse-NMF clustering and supervised clustering (K" and SVM) in the cancer molecular pattern discovery for four benchmark cancer datasets. This paper is organized as follows. Section 2 presents the concepts of input space, subspace and kernel space clustering before introducing our PG-NMF based consensus kernel hierarchical clustering in the section 3 . Section 4 shows the experimental results of our algorithm. Finally, we discuss the possible algorithm generalizations and draw conclusions.
2. INPUT
SPACE, SUBSPACE AND KERNEL CLUSTERING
For a given data set X = ( X ' , X ~ , . . . X ~E) ~9lnxrn, clustering is to find an implicit classification function f : X + r that maps each data sample x i ,to its target function value y j (label) in a set r according to some dissimilarity metric ( j = 1,2.. . I r 1). Data samples with a same target function value (label) after classification will claim to share a same cluster. We classify clustering as the input space, subspace and kernel space clustering according to where the implicit classification function f is computed. In the input space clustering, the implicit classification function f is computed in the input space %jnxrnofthe dataset. Hierarchical clustering (HC), K-means clustering and expectation maximization (EM) clustering all belong to the input space clustering. In the kernel space clustering, the classification function f is computed in a kernel space R of the input space, which is a high dimensional Hilbert space generated by a feature map function @ : X + R , dim(R) >> dim(X) . That is, the clustering is conducted for the high
58 dimensional data@(X). On the other hand, in the subspace clustering, the classification function f is computed in a subspace S of the input space, generated by a linear or nonlinear transform 4 , dim(S) 5 dim(X) . Generally, almost all input-space clustering methods can be used in the subspace clustering to cluster the feature data in the subspace. However, not all input space clustering algorithms can have corresponding kernel space clustering algorithms. In the following work, we use the HC as an example to demonstrate the input space, subspace and kernel space clustering.
2.1. Subspace clustering A subspace S is generated from a linear or nonlinear transform @ : X E %nx"' + X *E %'""' and clustering is conducted through the transformed data X * . For example, SOM and PCA based clustering are typical subspace clustering approaches. Most likely, the subspace has the lower dimensionality than the original dataset, i.e. dim(S) < dim(X) . Each transform4 applied to X can be represented as TX = X' , where T is the matrix representation of transform 4. Writing it as a matrix decomposition form of X , we have X = WX * , where the matrix W is the inverse or pseudo-inverse of the matrix T. We still call W as a basis matrix and X ' as a feature matrix. The columns of the basis matrix span the subspace: S = span(W, ,W,,. . .W,) . Dependent on the properties of the transform@,the basis matrix may not be unique and the corresponding matrix decomposition may not be unique also. Geometrically, each column of X is the coordinates of each observatiodcolumn of X in the subspace S , which can be viewed as a new coordinate system. Self-organizing map clustering can be viewed as a simple subspace clustering, where the target function value of each sample is determinated by the location of its corresponding reference vector of the best matching unit (BMU) on the SOM plane. In the nonlinear transform conducted by a self-organizing map (SOM), the feature matrix X * is called the prototype data including all reference vectors on the SOM plane. The subspace bases (W,,W*,...W,)can be obtained by solving r least square problems, where r is the number of neurons on the SOM plane. Actually, the transform 4can be implemented by any linear or nonlinear feature selection methods, such A
as principal component analysis (PCA), independent component analysis (ICA), self-organizing map (SOM) and nonnegative matrix factorization (NMF). The spectral analysis methods like fast Fourier transform, wavelet transform can also implement 4.That is, any input space clustering algorithms can be employed to cluster the feature data X ' . For example, clustering the data principal components (PCA clustering) by HC or other input space clustering methods is a typical subspace clustering, where the subspace generated by the PCA transform is an orthogonal space." Similarly are the hierarchical clustering of the independent components of data (ICA clustering) and the FFT coefficients of data (FFT clustering). l 9
2.2. Kernel space clustering: conduct clustering in a high dimension space with kernel tricks Kernel space clustering conducts clustering in the kernel/feature space Q of a data set X E %'""" . The motivation to conduct kernel space clustering is because classificatiodlearning in a high dimensional space can have desirable results. We use the kernel tricks to avoid the huge computing complexity from clustering in the feature space Q . To apply the kernel tricks in clustering, we need to formulate an input space clustering algorithm into inner product forms at first. Then a kernel function k ( x , y ) = (@(x) @ ( y ) ) is employed to evaluate all the inner products. The kernel function has to satisfy the mercer theorem.*' Through the kernel tricks, classificatiodclustering can be conducted in a high dimensional space by only paying input space level computing complexity, and the feature map CD is unnecessary to be explicit. Although several inputspace clustering methods have their corresponding kernel extensions, we give the kernelization of the hierarchical clustering (HC) in this work. Qin et a1 mentioned the applications of the kernel hierarchical clustering in the gene expression data. However, they only gave an approximation based kernel extension rather than a rigorous kernel extension of the classic hierarchical clustering. Kernelization of the general hierarchical clustering algorithm consists of two steps: kernelize pairwise distance and linkage computing. In the kernelization of the painvise distances, we focus on the Euclidean and correlation distances because they are mostly used
*'
59 dissimilarity metrics in HC. The Euclidean distance between samplesx! and x I in the kernel space can be which can be kernalized as: d(@(x,1, @@,
1) = ( K ,- 2K,, + K , )I/*
(5)
where K,l = K ( x , , x , ) = ( ( @ ( x , ) @ ( x , ) ) . In the kernelization of the correlation distance between samplesx, and x I , we assume the mapped vectors @(xt),@(x,)are zero mean data in the kernel space L2 , then the correlation distance between @(x,) and @(x,) can be formulated as the following inner product form in Eq. (6), where clI = c(@(x, ), @(x,)) . c" =1-
(@(XI
(@(XI)
@(x, 1)
@(x,))I'2(@(x,)
Where x,"' is the i'" sample in the cluster C, ; The I C, C, I are the number of samples in the clusters C, and C, ; k i ' ) = k(x,"', x j r ) ) ,ki" = k ( x : ' ) , x:')) and ki' ') = k(x,'",x:,)) .
(6) @(x,))1/2
However, we shall drop this assumption in the kernel space for more general practice. We use the expectation of all feature data to center each feature data, (7)
Then the corresponding correlation distance can be formulated-as the similar form as in the Eq. (6). Let K,; = (@(xi) G ( x j ) ) , then we have the following result:
2.3. What's the ideal unsupervised classification algorithm for the high dimensional gene/protein expression data? We believe that an ideal unsupervised classification or clustering algorithm for the high dimensional gene and protein data should satisfy following criteria. 1. Some feature selection methods ought to be applied to reduce data dimensions such that data are "clean and compact". 2. The feature selection method employed should have the part-base learning property to maintain the data locality well; that is, the feature selection method can conduct local feature selection. 3. Kernel tricks are desirable to be applied in the clustering of the data after feature selection to achieve better classification results in a kernel space. According to the criteria, we give our subspace consensus kernel classification algorithm based on the projected gradient NMF (PG-NMF). The basic idea is to apply a convergent soundness local feature algorithm: PG-NMF to the gene/protein expression dataset X , which is equivalent to project the dataset X into the subspace S generated by the PG-NMF: X WH , where W is the basis matrix generating the subspace. Then kernel hierarchical clustering is applied to column data the feature matrix H , which are the prototype data of the original data. Since the basis matrix and feature matrix are not unique in the NMF. We develop the consensus kernel hierarchical clustering algorithm (CKHC) to get the final classification.
-
Since the kernel matrix K is a semi-positive definite matrix, summarizing previous results, we have the correlation distance in the kernel space between @(I, ) and @(x, ) can be computed as (9)
3. PG-NMF
The extension of the single, complete and average linkage in the kernel space is trivial but not for the centroid linkage. The centroid linkage between two clusters is defined as the Euclidean distance between the centroid of two clusters. We give the centroid linkage d,, between the clusters C, and C, in the Eq. (10).
SUBSPACE KERNEL HIERARCHICAL CLASS1FlCATlON
PG-NMF based subspace kernel classification is to conduct consensus kernel hierarchical clustering (CKHC) to each feature matrix H in a subspaces generated by the PG-NMF. The CKHC is an algorithm to run the kernel hierarchical clustering in a Monte Carlo simulation approach and compute the final classification by building a consensus tree. It consists of two general steps. 1. Build a consensus tree for the expression dataset X at each rank by conducting CKHC
60 to feature matrices H from the PG-NMF. 2. Then the best consensus tree, which is the final classification, is selected by our novel model selection method. The following algorithm describes the consensus kernel hierarchical clustering (CKHC) at rank r. Algorithm 1 Consensus kernel hierarchical clustering at rank r Input: nonnegative matrix X (nxm), rank r, PG-NMF running times N>=100, Kernel function k ( x , y ) , linkage metric 1 Output: the consensus tree T a t rank r // Run PG-NMF X-WH to do feature selection at rank r N times 1. 2.
For run=l:N Initialize W and H randomly
3.
Compute X-WH, W E R n X r , H E R"'"
4.
Compute the kernel pairwise distances
by PG-NMF
between columns of feature matrix H in the kernel space by Eq. (5)/(9) 5.
Record the kernel pairwise distances in
6.
Concatenate all such kernel distance vectors for N
an m(m-1)/2 x l vector: d feature matrices in a matrix D: D=[D, d];
7. End 8. Compute a consensus kernel distance vector dCo,ISe,zSUQ by weighting the ratios of the sum of each column in D over the sum of the elements of matrix D
9.
Build the consensus tree T from the consensus
10
Return T
kernel distance vector under the linkage metric I
We still need to answer the following question: 'What is the model selection method to find the most robust consensus tree (classification)?' To avoid the exhaustive search on all possible ranks, we give a singular-value based rank selection method to find an optimal rank search interval [ 2 , r * ] .The idea can be described as follows. Given a threshold & ( E E [0.90,1) ), we compute the importance ratio of first r* singular values such that the important ration >= the threshold. The importance ratio of first r*singular values is defined as the ratio of the
sum of the first r*singular values over the sum of all singular values (Eq.11).
That is, PG-NMF is only conducted in the optimal rank search interval [ 2 , r * ] and we only search the best consensus tree from the r* consensus trees. The most robust consensus tree will be from which rank in the interval [ 2, Y * ] ? It is reasonable that the most robust consensus tree should be from a rank, where the bases of its subspace generated by the PG-NMF each time represent all levels of patterns inherent in the dataset. From the point of view of data variability, it is a rank where the ratio between the largest data variability and the smallest data variability of the bases data reaches its maximum value. We propose a measure robust index 6 to find the most robust consensus tree according to the previous considerations. The robust index 6 is the condition number of the covariance matrix of the average basis matrix E ( W )from the N times running of the PG-NMF. The average basis matrix is defined as:
E ( W ) = -l- CN W"' N r=l The condition number of the covariance matrix of the average basis matrix E ( W ) is the ratio between the maximum eigenvalue and the minimum eigenvalue of E ( W ) : 6 = A,,, I A,,,,. The A,,, is the variance of the 1st principal component of the average basis matrix: the largest data variability of the basis data. The ;Iminis the variance of the last principal component of the average basis matrix: the smallest variability of the basis data. The robust index can be huge but it is impossible to reach infinite because A,,,, is the smallest positive eigenvalue of the covariance matrix of E ( W ) . The final classification is just the consensus tree with the largest robust index number. The PG-NMF based consensus kernel hierarchical clustering algorithm (PG-NMFCKHC) can be described as follows. Algorithm 2 PG-NMF based Consensus kernel hierarchical clustering Input: a n X m nonnegative data matrix X, Importance ratio threshold & 2 0.90 Output: the final consensus tree T
61
1.
Decide the rank search interval [2, r*] by the
2.
For r=2: r*
important ratio threshold
3.
E
Conduct consensus kernel hierarchical clustering at rank r to get a consensus tree T,. at rank r
4.
Compute the robust index
better than those of Euclidean distance (Figure 3). The NMF clustering has two misclassified samples: ALL-21 302-B-cell and ALL-14749-B-cell. Sparse-NMF clustering has one misclassified sample: AML-12. However, the running time of NMF and sparse-NMF clustering are twice more than that of our algorithm.
6 of the consensus
tree T,
5.
End
6.
T t T, with the maximum robust index
4. EXPERIMENTS We apply the PG-NMF-CKHC algorithm to discover the cancer molecular patterns for several bench-mark cancer datasets. We use a measure called classification rate ,n C, = x 6 ( i ) / m to evaluate the accuracy of the unsup&vised classification for a dataset with m samples, where 6(i)=1 if the sampleiis assigned in a correct cluster; otherwise 6(i)= 0 . We use three kernel functions in our algorithms: linear, polynomial and Gaussian kernel. The dissimilarity measures in the kernel hierarchical clustering are chosen as Euclidean and correlation distances. We choose the average linkage metric in the kernel hierarchical clustering. The PGNMF algorithm is run N=100 times in each optimal rank search interval with tolerance 1Oe-9. The first dataset is Leukemia dataset, a benchmark dataset consisting of 38 samples in the cancer research. It can be classified as 27 acute lymphoblastic leukemia (ALL) and 11 acute myelogenous leukemia (AML) marrow samples. The ALL samples can be further divided into 19 'B' and 8 'T' subtypes. HC and SOM were proved to be unstable for this dataset.15 The optimal search interval for this dataset is [2,6] under the importance ratio threshold 0.90. The robust index in PG-NMF-CKHC reaches its largest number at rank 5 for a Gaussian kernel under the correlation distance (Figure 2). Figure 1 is the visualization of the final consensus tree. It is clear that there are three clusters, AML, ALLB, and ALL-T in the final consensus tree. There is just only one misclassification i.e. ALL-14749-B-cell was assigned to AML. We have found the combinations of the Gaussian kernel function and correlationEuclidean distance under the average linkage metric both can reach the best performance in the classification. Under the linear kernel, we can see that classification results under the correlation distance are
Fig. 1. The visualization of the consensus tree at rank 5 for a Gaussian kernel under the correlation distance and average linkage metric.
nm.
Fig. 2. The largest robust index reached at rank 5 for the Gaussian kernel with correlation distance.
I
b
F
Fig. 3. The classification rates under linear, polynomial and Gaussian kernel for Euclidean and correlation distances.
The second dataset is Medulloblastoma dataset, the gene expression data from childhood brain tumors
62
known as medulloblastomas. The pathogenesis about these tumors is still not well understood yet by investigators. However, there are two generally accepted histological sub-classes: classic and desmoplatic. These sampled are divided as 25 classic and 9 desmoplastic medulloblastomas. General HC and SOM failed to reveal the classifications of these samples.” The robust index reaches its maximum in the optimal rank search interval [2,10] at rank 7 for a polynomial kernel under the correlation distance. Figure 4 is the visualization of the final classification. There are 8 desmoplastic samples clustered and total 2 samples are misclassified: sample 25 and sample 33.
Fig. 4.
Visualization
of
the
final
consensus
tree
of
better clustering structure since there are 8 desmophlastic samples clustered. On the other hand, sparse-NMF has 7 misclassified at its best rank 5 . l 6 It seems sparseness constraints do not contribute to the improving classification rates for this dataset. Since the pathogenesis of medulloblastoma is still not wellunderstood, we did not compute the classification rates for this dataset. The third dataset is an ovarian cancer dataset, a MS proteomics dataset consisting of 20 cancer and 20 normal samples, which presents as a 15142x40 positive matrix. This data set is a subset of Ovarian Dataset 8-702 that was generated using the WCX2 protein array, which includes 91 controls and 162 ovarian cancers. For this dataset, we try supervised classification first. We randomly pick other 40 samples (20 cancer and 20 normal) from the original dataset as a training set; then we use kNN under Euclidean and correlation distance to classify the MS data. We have found the best classification rate from k” is 92%. But it can’t classify sample 3, 12, 36 correctly. Our algorithm reaches the best classification at rank 7 in the optimal rank search interval [2,10]. There is only one misclassified sample :sample 36 (Figure 6).
the
medulloblastomas dataset at the rank 7 under the polynomial kernel under the average linkage metric and correlation distance
16000 14000
t
12000~
~
5 Rank
6
7
0
02
04
06
08
1
12
8
Fig. 5. The largest robust index reached at rank 7 for the polynomial
Fig. 6. The final consensus tree at rank 7 under Gaussian kernel with
kernel with correlation distance.
correlation distance.
The NMF decomposition desmophlastic algorithm also
Figure 7 shows the performance of linear, Gaussian and polynomial kernel in the classification. The combination of the polynomial kernel and correlation distance has the best performance under the average
has 2 samples misclassified at its best rank 5.’’ However, it only gets 7 samples clustered. Although our have 2 misclassified samples, we have
63 linkage metric. Classification rates generally decrease after the rank 7 and the correlation distance generally performs better than the Euclidean distance in the classification. I
I
Fig. 7. The classification rates of the PG-NMF-CKHC for this dataset: polynomial kernel
+
correlation distance reaches the best
classification rate.
We also apply NMF and sparse-NMF classification for the proteomics data, although they were developed under the context of gene expression data. There are 8 samples misclassified from NMF clustering and 12 samples misclassified from the Sparse NMF clustering for our ovarian cancer dataset. Both algorithms indicate there are 2 clusters from their cophenetic coefficients. Since a proteomics dataset generally has much higher dimensionalities than a gene expression dataset, NMF and sparse NMF clustering have large time complexity for a proteomics dataset. For this dataset, NMF clustering takes >78 hours and sparse-NMF clustering takes >153 hours running under two PCs with 3.0 GHZ CPU and 504 RAM running under WIN-XP 0s. It seems that NMF based clustering/classification mechanism can't work well in the context of the proteomics data.
4.1. Comparing classification results from kNN, sparse-NMF and support vector machines (SVM) We compare PG-NMF-CKHC for the four datasets (the leukemia, medulloblatoma, ovarian cancer dataset and a colon cancer dataset, which consists of 22 controls and 40 cancer data samples) with the classic NMF clustering, sparse-NMF clustering, and SVM and k"
classifications. In k"and SVM, We run classification 10 times under holdout cross-validation with 50% holdout percentage for each case. We take the average classification rates as the final classification rates. In the SVM classification, we also use linear, polynomial and Guassian kernel. We select the best final classification rate from three kernels as the final classification rate of SVM. In the leukemia data, we use SVM/lc" to classify ALL and AML types instead of all three types. Although the pathogenesis of medulloblatoma is not well established, we still compute the classification rates of this dataset based on the general assumption that samples are divided as 25 classic and 9 desmoplastic medulloblastomas, for the convenience of comparisons. Table 1 shows the classification rates for the four benchmark datasets from,"k PG-NMF-CKHC, NMF, sparse-NMF and SVM classifications. We have found that our algorithm is superior to the NMF, sparse-NMF and supervised SVM classification algorithms for these datasets; The NMF classification has better performance than SVM and k" for three gene expression datasets. Sparse-NMF has averagely better performance than k" for three gene expression datasets. However, the NMF and sparse-NMF can't compete with k" and SVM for the proteomics data. According to our classification results, it seems that sparseness constraint on the NMF may not always contribute to the improvement in the classifications for some datasets. Besides the ovarian dataset, for the medulloblatoma dataset, the classic NMF clustering seems to perform better in classifying desmoplastic medulloblastomas than the sparse-NMF clustering at rank 5 , where both algorithms reaches the most robust reproducibility partitions. We also noticed the NMF and sparse-NMF clustering can not compete with SVM classification for the ovarian dataset. It is interesting to see that sparseness constraint may not lead to the better classification results for the colon cancer dataset. The classic NMF clustering reaches its largest cophenetic correlation coefficient at rank 2 (2 clusters) and its corresponding classification rate is 0.9355. However, the sparse NMF clustering reaches its largest cophenetic correlation coefficient at rank 4 (4 clusters) and its corresponding classification rate is 0.758 1. It is possible due to the fact that the expression patterns of those dominant co-expressed genes such as, oncogenes, tumor suppressor genes are not extracted out in the sparse representation. This may also indicate that sparseness
64 control may not always lead to a better classification results for some dataset. Figure 8 and 9 give the visualization of the NMF and sparse-NMF clustering from the rank 2-5 for the colon cancer dataset. Probability of two samples clustered together is indicated by color. Generally, blue indicates a numeric value near 0 and a red color indicates a numeric values near 1. The deep blue standing for 0 indicates samples are never assigned in one cluster and dark red standing for 1 indicates samples are assigned in one cluster.
Fig. 8. The visualization of the NMF clustering from rank 2-5 for the colon dataset
5. CONCLUSIONS As a part-based learning machine learning algorithm, NMF has found its application successfully in image
analysis, document clustering and cancer molecular pattern discovery. In this study, we present an NMF based subspace kernel clustering algorithm: PG-NMFCKHC based on the input space, subspace and kernel space clustering framework. We have shown that PGNMF-CKHC improves the cancer molecular pattern discovery for the well-studied four datasets. It can work well for both gene expression data and protein expression data according to out current results. Our algorithm can be generalized to a family of subspace kernel classificatiodclustering algorithms in machine learning by selecting different transforms to generate subspaces and different kernel clustering algorithms to cluster data. For example, conduct kernel k-means clustering in a subspace generated by the independent component analysis (ICA) applied to a high dimensional dataset, or conduct the kernel Fisher discriminant analysis (KFDA) 22 in a subspace generated by principal component analysis (PCA). Despite its promising features, it is also worthy to point out that PG-NMF based consensus kernel hierarchical clustering has the limitation of greater algorithmic complexity, especially compared with the traditional hierarchical clustering (HC). However, it is clear that our algorithm is easy to fit in a parallel computing structure due to its Monte Carlo simulation mechanism. Thus, we plan to implement the parallel version of the subspace based kernel classification algorithm for the cancer molecular pattern classification in the following work.
Fig. 9. The visualization of the sparse-NMF clustering from rank 25 for the colon dataset
Table 1. Compare PG-NMF-CKHC classification results with those of the NMF, sparse-NMF, SVM and KNN classifications
Cancer Data Information Algorithm Classification Rates DataSize #tvne k” PGNMF-CKHC NMF Snarse-NMF Leukamia 5000x38 3 0.8860 0.9737 0.9470 0.9737 Medulloblastoma 5893x34 2 0.7611 0.9412 0.9412 0.8235 Ovarian 15142x40 2 0.8990 0.9750 0.8000 0.7000
0.9132 0.8300 0.9474
Colon
0.8542
Cancer Name
2000x62
2
0.7667
0.9355
0.9032
0.7581
SVM
65
Acknowledgments Author wants to thank the support from the New Faculty Research Award at Eastern Michigan University for this research.
References 1. Lilien, R. and Farid, H. Probabilistic Disease Classification of Expression-dependent Proteomic Data from Mass Spectrometry of Human Serum, Journal of Computational Biology 2003; 10 ( 6 ) , 925-946. 2. Golub, T. et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science 1999; 286: 531537. 3. Furey T., Cristianini N., Duffy N, Bednarski D., Schummer M. and Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 2000; 16 (10): 906-914. 4. Hautaniemi, S. , Yli-Harja, O., Jaakko Astola, J., Kauraniemi, P. et al. Analysis and Visualization of Gene Expression Microarray Data in Human Cancer Using Self-organizing Maps, Machine Learning 2003; 52: 45-66. 5. Ressom, H., Varghese, R., Saha, D., Orvisky, R. et al. Analysis of mass spectral serumprofiles for biomarker selection. Bioinformatics 2005; 21: 4039-4045. 6. Liu Z., Chen D. and Bensmail H. Gene expression data classification with Kernel principal component analysis. J Biomed Biotechnol. 2005 (2) 155-159. 7. Eisen,M. et al. Cluster analysis and display of genome-wide expression patterns. Proc. Nut1 Acad. Sci. USA 1998; 95: 14863-14868. 8. Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Nut1 Acad. Sci. USA 1999; 96: 2907-2912. 9. Bicciato,S. et al. PCA disjoint models for multiclass cancer analysis using gene expression data. Bioinfomatics 2003; 19: 571-578 10. Wall, M., Andreas, R., Rocha, L. Singular value decomposition and principal component analysis. A Practical Approach to Microarray Data Analysis. Berrar, D., W. Dubitzky, W., Granzow, M. eds. Kluwer: Nonvell, 2003; 91-109. 11. Tan, Y., Shi, L., Tong, W., and Wang, C. Multiclass cancer classification by total principal component regression using microarray gene
expression data. Nucleic Acids Res. 2005; 33( 1) 5665. 12. Zhang, X., Yap, Y., Wei, D., Chen, F. and Danchin, A. Molecular diagnosis of humancancer type by gene expression profiles and independent component analysis, European Journal of Human Genetics 2005; 1-9: 1018-4813. 13. Daniel D. Lee and H. Sebastian Seung.: Learning the parts of objects by non-negative matrix factorization. Nature 1999; 401: 788-791. 14. Lin, C. Projected gradient methods for non-negative matrix factorization, Neural Computation 2007; In Press. 15. Brunet, J., Tamayo, P., Golub, T. and Mesirov., J. Molecular pattern discovery using matrix factorization. Proc. Nut1 Acad. Sci. USA, 2004, 101,12: 4 164-4 169. 16. Gao, Y. and Church, G. Improving molecular cancer class discovery through sparse nonnegative matrix factorization, Bioinformatics 2005; 21 (21):, 3970-3975. 17. Patrik 0. Hoyer: Non-negativematrix factorization with sparseness constraints. Journal of Machine Learning Research 2004,s: 1457-1469. 18. Yeung, K. and Ruzzo, W.: Principal Component Analysis for clustering gene expression data, Bioinformatics, 2001; 17 (9): 763-774. 19. Lee, S. and Batzoglou, S. ICA-Based Clustering of Genes from Microarray Expression Data, Neural Information Processing Systems(NIPS) 2003. 20. Vapik, V. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. 21. Qin. J. et al.: Kernel hierarchical gene clustering from microarray expression data, Bioinformatics, 2003, 19 (16), 2097-2104. 22. Mika, S., Ratsch, G., Weston, J., Scholkopf, B. and Muller, KR. Fisher discriminant analysis with kernels, Neural Networks f o r Signal Processing IX,. 1999; 41-48.
This page intentionally left blank
67
EFFICIENT ALGORITHMS FOR GENOME-WIDE TAGSNP SELECTION ACROSS POPULATIONS VIA THE LINKAGE DISEQUILIBRIUM CRITERION
Lan Liu, Yonghui Wu, Stefan0 Lonardi and Tao Jiang* Department of Computer Science and Engineering, University of California, Riverside, CA 92507, USA *Email:[email protected]:edu In this paper, we study the tagSNP selection problem on multiple populations using the pairwise T~ linkage disequilibrium criterion. We propose a novel combinatorial optimization model for the tagSNP selection problem, called the rninirnurn coi?inzontugSNP selection (MCTS) problem, and present efficient solutions for MCTS. Our approach consists of three main steps including (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e. the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time tagging lower bounds are discussed in the literature. We assess the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing single-population tagging programs like FESTA,LD-Select and the multiplepopulation tagging method MultiPop-Tagselect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-Tagselect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds obtained by our method.
1. INTRODUCTION The rapid development of high-throughput genotyping technologies has recently enabled genome-wide association studies to detect connections between genetic variants and human diseases. Single-nucleotide polymorphism (SNP) is the most frequent form of polymorphism in the human genome. Common SNPs with minor-allele frequency (MAF) of 5% have been estimated to occur once every -600 bps '*, and there are more than 10 million verified SNPs in dbSNP Given these numbers, it is currently infeasible to consider all the available SNPs to carry out association studies. This motivates the selection of a subset of informative SNPs, called tagSNPs. The selection of tagSNPs in silico is a well-studied research topic. Existing computational methods for tagSNP selection can be classified into two categories: haplotype-based methods 1, 12, 17, 19,24, 28, 31, 3 2 , 34 and haplotype25, 27, 26. The independent methods 5, 15. 16, haplotype-based methods require phased multi-locus haplotypes, whereas the haplotype-independent methods do not require haplotype information. The main shortcoming of haplotype-based methods is that the preprocessing step (i.e. the inference of haplotypes from genotypes) is computationally demanding. In addition, since there is not an authoritative inference method, the haplotypes generated by the existing haplotype inference methods are often quite different 7, 3 5 . Consequently, the tagSNPs selected by the haplotype-based methods would be quite different. Recently, Carlson et al. proposed a haplotype-independent method that employs the r2 linkage disequilibrium (LD) statistical criterion to
''.
20-22i
321
measure the association between SNPs. The tagSNPs selected by this method are shown to be effective in disease association mapping studies, because the measure T~ is directly related to the statistical power of association mapping. Because this method has comparable performance at a lower computational cost than many other methods 3 3 , 27, tagging approaches based on r2 LD statistics have gained popularity among researchers in the SNP community 2. 5, 8 , 22, 26,33 Most approaches using the r2 criterion require that tagSNPs be defined within a single population, because LD patterns (see the caption of Figure 1(A) for a definition) are quite susceptible to population stratification In two populations with different evolutionary histories, a pair of SNPs having remarkably different allele frequencies and very weak LD may show strong LD in the admixed population (see such an example in Table 1). Recent study shows that the LD patterns and allele frequencies across populations are very different 29 in fact. For example, among the populations collected in the HapMap project (i.e. YRI, CEU, CHB and JPT), 81 % of the SNPs in YRI population have a near perfect proxy (i.e. SNPs that have r2 2 0.8 with other SNPs), while in the other three populations, 9 1% of the SNPs have a near perfect proxy '. Therefore, tagSNPs picked from the combined populations or one of the populations might not be sufficient to capture the variations in all populations. In order to maintain the power of association mapping, we need generate a common (or universal) tagSNP set to type all the populations with sufficient accuracy. A simple approach to select a universal tagSNP set is to tag one population first and then select a supplementary set for each of the other populations one by one 22.
'.
2i
*Corresponding author.
231
68 Table 1. r 2 statistics for a pair of S N P markers in a single and admixed populations. O n e SNP has alleles denoted as A and a while the other S N P has alleles denoted as B and b. Population 3 is an even mixture of populations 1 and 2. Population 1 b 0.9025 0.0475 0.0475 0.0025 0.95 0.05
B
A a
0.95 0.05 T'
=0
A a
Population 2 b 0.0475 0.9025 0.95
B 0.0025 0.0475 0.05
For instance, we can select a tagSNP set for non-African populations and a supplement for populations with significant African ancestry 2 3 . However, this sequential approach might not give a satisfactory solution, as the tagSNP set selected for one population might be far from being adequate to type the SNPs of the remaining populations. As a result, the supplementary tagSNP sets are large and the total number of tagSNPs chosen is far from the optimum. Moreover, the performance of the approach is sensitive to the specific order of the input populations. In order to generate the smallest set of tagSNPs on K populations, one would have to execute the tagging procedure K ! times considering all possible orderings, which would be extremely inefficient for genomewide tagging. We can improve the performance of the tagging approach by evaluating multiple populations at the same time. When choosing tagSNPs, we prefer those with "good properties" with respect to the collection of populations as a whole. An example of our tagging strategy is given in Figure 1.
Population I
Populat,on 2
(A)
Populvt,on 2
PopularLon I
iB)
Populatlo" 2
Population I
iC)
Fig. 1. (A). LD patterns in two populations. The vertices denote the SNP markers and the edges denote pairs of markers with strong LD (i.e. the r2 measure between the markers is greater than a given threshold). (B). Tagging results of the above simple sequential approach. We first choose markers 3 and 6 to tag population I and then choose an additional marker 5 to tag population 2. Three markers are selected in total to tag both populations. (C). Tagging results of an improved approach. We select markers 4 and 6 considering both populations simultaneously. Only two markers are selected in total to tag both populations.
Previous work on tagSNP selection based on the linkage disequilibrium criterion. There is a large body of scientific literature on the problem of selecting tagSNPs based on the r2 LD criterion. Carlson et al. suggested a greedy procedure called LD-Select, which works as follows: (i) select the SNP with the maximum number of proxies, (ii) remove the SNP and its proxies from consideration, and (iii) repeat the above two steps until all SNPs have been tagged '. This algorithm is very simple, however it may miss solutions with the smallest number of tagSNPs in general, as shown in 2 G . More recently, Qin et al. implemented a comprehensive search algo-
Population 3 b 0.4525 0.0475 0.5 0.0475 0.4525 0.5 0.5 0.5 T' = 0.6561
B
0.05 0.95 T'
=0
A a
rithm called FESTA, which first breaks down a large set of markers into disjoint pieces (calledprecincts), and then performs an exhaustive search on each piece if the estimated computational cost is below a certain threshold 2 G . FESTA usually gives a better solution than LD-Select, but due to the fact that it employs exhaustive search, it is too slow to be practical for genome-wide tagSNP selection. The above methods are only applicable to single population tagSNP selection. Recently, Howie et al. presented an algorithm for multiple populations, called MultiPop-Tagselect. MultiPop-Tagselect combines the tagSNPs selected for each population by LD-Select to produce a universal tagSNP set for a collection of populations 1 3 . The algorithm works reliably, and it could in principle be used with any tagSNP selection method for single populations. However, its accuracy highly depends on the performance of the single-population tagSNP selection method. Magi et al. 22 also designed a software tool called REAPER which is rather similar to LD-Select if applied to a single population. To select a universal tagSNP set for several populations, it first selects a tagSNP set for one population, and then it selects a supplement for the remaining populations one by one. As mentioned above, the performance of the method crucially depends on the choice of the initial tagSNP set and the ordering of the populations. It is not clear, moreover, how one should select tagSNPs for the first population so as to minimize the size of the final solution.
Our contribution on tagSNP section based on the linkage disequilibrium criterion. In this paper, we take a different approach to the multi-population tagSNP selection problem. Contrary to the previous methods, we do not generate a tagSNP set for each individual population separately, but rather we evaluate all the populations at the same time. The method that we propose could be used to generate a universal or cosmopolitan tagSNP set for multi-ethnic, ethic-unknown or admixed populations 1 3 . The main idea of our approach is to transform a multi-population tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem (to be defined more precisely later in the paper), into a minimum common dominating vertex set problem on multiple graphs. Each graph corresponds to one of the populations under consideration. The vertices in a graph correspond to the SNP markers of the population, and there is an edge between two markers when they are in strong LD
69
(according to some given threshold). To find an optimal solution MCTS, we first decompose it into disjoint subproblems, each of which is essentially a connected component of the union graph" and represents a precinct as defined in 26. Then, for each precinct, we apply three data reduction rules repeatedly to further reduce the size of the subproblem, until none of the rules can be applied anymore. Finally, the reduced subproblems are solved by either a simple greedy approach (similar to cosmopolitan tagging ') or a more sophisticated Lagrangian relaxation heuristic. Both algorithms will be explained in detail later in the paper. Along with the solution produced by our algorithm, we also obtain lower bounds on the minimum number of tagSNPs required, which allows us to quantitatively assess how close our solution is from the optimum. We evaluate the performance of our method on real HapMap data for genome-wide tagging. The experimental results demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing singlepopulation tagging programs like FESTA, LD-Select and the multiple-population tagging method MultiPopTagselect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-Tagselect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds provided by our method. For example, the gap between our solution and the lower bound is 1061 SNPs with r2 threshold being 0.5 and 142 SNPs with the r2 threshold being 0.8, given the entire human genome with 2,862,454 SNPs (MAF being 5%). The rest of the paper is organized as follows. In Section 2, we first propose a combinatorial optimization model for the MCTS problem and then present a computational complexity result. In Section 3, we introduce three rules to reduce the size of the problem, and devise a greedy tagging algorithm, called GreedyTag, and a Lagrangian relaxation heuristic, called LRTag. After showing the experimental results in Section 4, we conclude the paper with some remarks about the performance of our tagging method in Section 5. Due to page limit, some of the illustrative figures and tables are given in the appendix. 2. FORMULATION OF THE MCTS PROBLEM
Consider K distinct populations and a set V of biallelic SNP markers v1, v2, . . . , vn. Since the r2 coefficient is unreliable for rare SNPs when the sample size is small 5, we will consider only SNPs with MAF 2 5%. The set of SNPs might be different from population to population. We use V , C V to denote the SNP set in population i. Clearly, we have V = V1 U Vz U . . . U VJ. "Given graphs G, = ( K ,E,)(1 5 z
For a pair of SNP markers wjl and w j , in a population i (for any 1 5 i I K ) ,the r2 coefficient between them is denoted by r : ( w j , , uj,). Markers w j l and v J 2 are said to be in high LD in population i, if r: (vJ1,v j , ) 2 yo, where yo is a pre-defined threshold (yo will be set to 0.5 or higher in our study). Moreover, w j , (or vj,) is considered being the tagSNP or proxy for vj, (or vjl , respectively) in population i . For convenience, we define E,to be the set containing all the high-LD marker pairs in population i, i.e. Ei = {(~~,,~j,)/~~(~jl,2lj,) 2 yo, wj,,uj2 E Ei}. Now we can formally define the MCTS problem. MINIMUMCOMMONTAGSNP SELECTION(MCTS) Instance: A collection of K populations and a set V of biallelic SNP markers. Each population i (1 5 i 5 K ) has its marker set V , C V and LD patterns E, = { ( w j l , u g z ) l r ~ ( . u j , , v j 2 )2 yo, u j , , v j z E V,}, where yo is a pre-defined threshold. Feasible solution: A subset T C V such that for any marker w E V,,'u !$ T from some population i, there exists a marker w' in T f' V , with ( w , w ' ) E Ei(that is,
P.,(
4L 70).
Objective: Minimize ITI. It is easy to observe that any feasible solution to the MCTS problem is a common dominating vertex set in the graphs { Gi 11 5 i L K } , where Gi = (V,, E,). In particular, the smallest set of tagSNPs for a single population is a minimum dominating vertex set of the corresponding graph. Obviously, the MCTS problem is NP-hard, since it is a generalization of the minimum dominating vertex set problem, which is known to be NP-hard '.
Theorem 2.1. The MCTS problem is NP-hard. We introduce some additional notations to be used later. To differentiate the occurrences of a marker in different populations, we use w j to represent the j t h marker appearing in the ith population. Given a marker u j E V, we define the following two sets: N z ( t ~ j= ) { t ~ j , l ( t ~ j , aE j,E ) , , t ~ ~ , uEj /&} U { ~ $ v jE &} N * ( t J j )= U I < i < K NZ(tJj)
(1) The set NZ (vj)represents the subset of markers in strongLD with vj in population i, and the set N* (wj)represents the union of such subsets for all the populations. Note that, NZ ( v j ) is empty if vj $ V,. Given a marker w3 E V , in population i, we define the following set:
C(vj) = {vj(l(vj,wf) E
E~,Uj,UjI E
V , } u { V j } (2)
The set C(v3)is the subset of markers each of which can tag the occurrence 714, whereas N * ( v j )is the subset of occurrences that the marker uj can tag.
5 k ) , the union graph is defined as G = ( V , E ) ,where V =
u,V , and E = u,E,
70
Based on the above definitions, the MCTS problem can also be viewed as the following set cover problem. Given the universe U = ul<,
v}
3. OPTIMIZATION TECHNIQUES TO SOLVE THE MCTS PROBLEM In principle, a minimum common tagSNP set can be found by exhaustive search. In reality, there are millions of markers, and it is infeasible to conduct the exhaustive search. Since human chromosomes consist of high-LD regions (i.e. haplotype blocks) interspersed with recombination hotspots, we partition the markers into precincts such that markers in strong LD belong in the same precinct. In this way, we could narrow down the search space and improve the efficiency of our algorithm. In order to deal with multiple populations, we extend the concept of precinct defined originally in 26. We say that two markers are in the same precinct if and only if they are in strong LD in some population. Based on the simple observation that no marker in a precinct can tag a marker in another tag a marker in another precinct, we can obtain a minimum tagSNP set for the combining the minimum tagSNP sets for each precinct. The precincts can be easily identified by running a breath first search (BFS) in the union graph G. By partitioning the markers into precincts, we decompose the original problem into a set of disjoint subproblems of much smaller sizes. We then select tagSNPs for each precinct independently, which could save a lot of running time.
3.1. Data Reduction Rules We introduce three simple data reduction rules to further reduce the subproblem sizes and improve efficiency. Rule 1: Pick all irreplaceable markers. If a marker uj
has no proxy from population i (that is, u j is a singleton in Gi= (K, I$)), then marker u3 must be in the minimum tagSNP set. Rule 2: Remove less informative markers. Given two markers ujl and vy, if N * ( v , / ) C N*(w,i),we say that uu3is more informative than u , ~ ' .Similarly, given a set of markers ujl,uJ1 ,..., v , ~ ,if N * ( v j , ) C N*(v,,) . . . C N*(y,,), uJk is called the maximally informative SNP marker in the set. It is clear that we can discard less informative SNPs and only keep those maximally informative ones without degrading the quality of the solution. Rule 3: Remove less stringent occurrences. Given two occurrences ug: and uj,if Cj C Cj:, we say that u$ is less stringent than u! Similarly, given a set of occur. . ?' rences uii,v;;,. . ., vLk 3k' if C B: C . . . C C Cj:, the occurrence is called the most stringent occurrence in the set. Observe that the markers selected to tag the most stringent occurrences will also tag the less stringent occurrences. Therefore, we consider only the most stringent occurrences and discard the others. The above rules can also be viewed as data reduction rules applied to a 0/1 matrix obtained as follows. Given the notations of the occurrence set U , the marker set V and the neighborhood collections C introduced in previous section, the rows in the matrix represent U , the columns denote V, and each cell ( i >j ) indicates whether the marker corresponding to column j can tag the occurrence corresponding to row i, (i.e. the value of a cell is set to 1 if the marker can tag the occurrence, and 0 otherwise). Thus, Rule 2 (or Rule 3) is equivalent to redundant column deletion (or row deletion, respectively). The above rules can be applied repeatedly and in any combination whenever applicable. The reduced problem obtained after the application of the above data reduction rules will be subject to our greedy algorithm or Lagrangian relaxation (LR) algorithm, as explained next.
~3:
3.2. A Greedy Algorithm for MCTS
Greedy algorithms are often desirable due to their simplicity and efficiency. The greedy algorithm, GreedyTag, below is adapted from the greedy algorithm for the set cover problem as presented in 3 0 . By first applying the above data reduction rules, we will show later in the paper that GreedyTag greatly outperforms the other greedy algorithms such as LD-Select and MultiPop-Tagselect. Moreover, a lower bound, called GreedyTug-lb, is produced by GreedyTag, which is equal to the number of tagSNPs selected by data reduction Rule 1. Even though the lower bound is is somewhat loose since we only consider Rule 1, it turned out to be pretty tight in our experiments on real data (see Section 4 for more details). Due to space constraint, we present the pseudo-code of GreedyTag in the appendix.
71
3.3. A Lagrangian Relaxation Algorithm for MCTS
which
A subset T of SNPs can be denoted by its characteristic vector t = tltz . . . t,, where ti = 1if ui E T , and ti = 0 otherwise. It is thus easy to formulate the following integer linear program for MCTS.
(VX11, VA12. ...,
simplifies
L(t,X)
=
C1l,lnt, +
A%,. ObviouslY, V X = VXK,,) where VX,,, = S(Xz,,). Start-
C1<2
‘
ing from a initial setting A’, we sequentially generate A’, X2, X3, . . ., based on the following formula
Minimize IT1 = Cl<j
xvj,E
2 1 1 5 i 5 K and 1 5 j 5 n E (0, 1}, 1 5 j I n c(vs) t j j
tj
(3) Our second algorithm for MCTS is based on the Lagrangian relaxation framework. We assign a non-negative vector X = X l l X l z . . . XK,, of Lagrangian multipliers to the inequalities, and obtain the following relaxed integer program. Minimize L ( t A) -CI(j_
l < z _ < K v,,€
uZ,E
C(v;)
“(9,)
which are the Lagrangian costs associated with t, in (4). Rearranging the terms in (4), we have the objective func+ Cl<j ) 0, t, = 1 if s ( t J ) < 0, and t, an arbitrary value if s ( t J )= 0. The vector t obtained above may not be a feasible solution to (3). In other words, some occurrence might not be tagged by any marker in T = {u, It, = 1 , l _< j 5 n } induced by the characteristic vector t . We will adopt a strategy reduced cost heuristic (RCH) introduced by Balas and Ho to deal with this issue (the details are given in the pseudo-code in the appendix). Next we need find a good multiplier vector A, i.e. one that gives a near optimal lower bound. We utilize a standard optimization technique called subgradient optimization which iteratively updates the solution toward the subgradient direction to reach the optimum. We can define
’,
S(X2,j) = 1 -
c
W,’E q v ; )
tj,
where T ” is the smallest feasbile tagSNP set found so far (i.e. the best upper bound for maz L ( t ) ) ,L* is the largest of rnax L ( t ) found so far (i.e. the best lower bound for max L ( t ) ) and , {QO, 01, . . .} is a decreasing sequence of pre-defined scalars. The pseudo-code for the Lagrangian relaxation algorithm, LRTag, is given in the appendix. In the algorithm, we start from a initial setting of A’, generate a solution to toand extend it to a valid tagSNP set as mentioned above. Then we update Xo into X1 according to the formula (5). We repeat the process until we cannot improve X or a predefined number of maximum iterations is reached. Over the entire iterative process, the smallest feasible set of tagSNPs found by LRTag would be output as a solution to the MCTS problem, and the largest L ( t ) would be a lower bound for tagSNP selection, called LRTag-lb.
4. EXPERIMENTAL RESULTS In our experiments, we test the algorithms GreedyTag and LRTag on the Hapmap populations, and compare their performance and efficiency with single-population tagging programs LD-Select and FESTA, and a multiplepopulation tagging program MultiPop-Tagselect. For convenience, we will also denote by GreedyTag the cardinality of a feasible tagSNP set obtained by the GreedyTag algorithm. We use similar notations for LRTag, LDSelect, FESTA and MultiPop-Tagselect. Both of our algorithms calculate lower bounds on the minimum number of required tagSNPs, one of which is found by GreedyTag (i.e. GreedyTagJb) and the other by LRTag (i.e. LRTaglb). We define gap as the difference between the highest lower bound and the cardinality of the smallest tagSNP set found by our algorithms, which will be used to measure the quality of the solutions. We apply all the methods on the entire human genome data involving chromosomes 1 through 22 and on all ENCODE regions (ENmO10, ENmO13, ENmO14, ENrll2, ENrl13, ENr123, ENrl31, ENr213, ENr232 and ENr321) genotyped by the HapMap project (release #19, NCBI build 34, October 2005). For the ENCODE data, we estimate the r2 statistics by using a two-marker EM algorithm to compute the maximum-likelihood values of the four gamete frequencies, which is also commonly adopted by LD-Select and Haploview ’. For the entire human genome data, we directly download the r2 statistics from the HapMap website l o , generated
by Haploview to save computational cost. Note that, Haploview only calculates LD for markers up to 250 kbps apart, which is reasonable because the LD for markers that are farther than 250 kbps would normally be very weak anyway, and high LD in such a case can happen only purely by chance. In order to save running time for dealing with the entire human genome data, we prune the LD pattern data downloaded from the HapMap website by keeping only entries with r2 no less than 0.5. We ran all the programs on a 32-processor SGI Altix 4700 supercomputer system with l .6HZ CPU and 64 GB shared memory in the Computer Science Department, University of California - Riverside. Our GreedyTag and LRTag algorithms used up to 15 threads in parallel, while each of the other programs is single-threaded.b 4.1. Tagging the ENCODE Regions
A dense set of SNPs across ten large genomic regions have been produced by the HapMap ENCODE project. These regions serve as the foundation to evaluate the development of methodologies and technologies for detecting functional elements in human DNA. Each region is about 500Kb in length and has an SNP density about I SNP per 600 bps.
Tagging ENCODE regions for a single population. We tag each HapMap population separately by LD-Select, FESTA, and our new algorithms GreedyTag and LRTag. For illustration purposes, we only show the results for tagging the CEU population and compare the performance of the above algorithms in Table 2. When the r2 threshold is set as 0.5, the number of tagSNPs selected by our algorithm is on the average 9.3% of the total number of markers (the actual percentage number ranges from 5.1% to 15.3%). With a more stringent r2 threshold of 0.8, the average number of tagSNPs rises to 17.6% of the total number of markers (ranging from 11.4% to 24.9%). The same trend was observed when applying our algorithms on the other populations (results are not shown due to space constraint). On each ENCODE region, we observe that the gap between LRTag-lb and LRTag is at most one with the r2 threshold being 0.5, and there is no gap when the r2 threshold is set as 0.8. This demonstrates that our algorithm LRTag found near-optimal solutions in all test cases. In general, LRTag and GreedyTag always generated the smallest sets of tagSNPs, FESTA selected at most three more tagSNP, and LD-Select might select up to eight more tagSNPs. Since our algorithms and FESTA are all nearoptimal, we compare the time efficiency of these programs in Table 3. Because LD-Select takes genotype data
as input and the other programs take pairwise LD data as input, we do not compare LD-Select's running times directly with those of the others here (generally speaking, it takes LD-Select from 30m to 2h on an ENCODE region). From Table 3, we can see that the running time of FESTA varied widely from I s to 64h on different regions, while our algorithms GreedyTag and LRTag consistently took 1-2s on all regions. In conclusion, our algorithms were 3 to 4 orders of magnitude faster than FESTA in most of the cases, and found slightly smaller sets of tagSNPs.
Tagging ENCODE regions for multiple populations. We tag each and the entire ENCODE regions for all four HapMap populations by MultiPop-Tagselect, GreedyTag and LRTag. The tagging results of these methods on each ENCODE region are summarized in Table 4. We also highlight the results for region ENmO13 and for the entire ENCODE region in Figure 2. With the r2 threshold set as 0.5, the number of tagSNPs selected by our algorithms is on the average 18.3% of the total number of markers (the actual percentage number ranges from 11.0% to 34.5%). With a more stringent r2 threshold of 0.8, the average number of tagSNPs increases to 33.7% (ranging from 24.0% to 50.5%). We observe that LRTag always performs the best in these tests, followed by the GreedyTag algorithm, and MultiPop-Tagselect always performs worst. When r2 threshold is set as 0.5, LRTag requires 16.4% fewer markers on the average than MultiPop-Tagselect. When the r2 threshold is 0.8, LRTag usually requires 5.1% fewer markers on the average than MultiPop-Tagselect. The gap between LRTag-lb and LRTag is at most two for each ENCODE region and totally six for all ENCODE regions with the r2 threshold being 0.5. There is no gap with the r2 threshold being 0.8. 4.2. Genome-wide Tagging Because both LD-Select and MultiPop-Tagselect (written in Perl) took more than 20 hours to tag a single chromosome, we re-implemented their algorithms in C++, called LD-Select* and MultiPop-Tagselect", respectively, in order give a fair comparison of the programs. Since FESTA's "greedy-exhaustive hybrid search" is very computational demanding and hard to emulate, we exclude FESTA from the following comparative study.
Tagging the human genome for a single population. We apply LD-Select", GreedyTag and LRTag on each HapMap population separately. For illustration purposes, we only discuss the results for tagging the CEU population and compare the performance of the above three algorithms. The details can be found in Table 5 given in the appendix.
bNote that if a program runs in time o f t with 15 threads, then its running time with one thread would be 15t. This transformation can be used to compare the running times of our programs and those of the other programs on a single thread mode,
73 Table 2. Summary of tagSNPs identified by FESTA, LD-Select, GreedyTag and LRTag for a single population, CEU, on all ENCODE regions. Region
ENmOlO
ENm013
ENmOl4
ENrll2
ENrll3
ENr123
ENrl31
ENr213
ENr232
ENr321
# SNP
525
692
904
947
1080
864
990
612
457
544
# precinct # tagSNP (upper bound)
39
27
47
52
40
30
83
42
64
52
LD-Select FESTA GreedyTag LRTag # tagSNP (lower bound) LRTag-lb GreedyTag-lb Gap r2 2 0.8 # precinct # tagSNP (upper bound) LD-Select FESTA GreedyTag LRTag # tagSNP (lower bound) GreedyTag-lb LRTag-lb Gav
62 57 56 56
38 35 35 35
65 63 63 63
84 76 76 76
77 73 73 73
69 65 62 62
112 107 107 107
62 61 61 61
72 70 70 70
68 65 64 64
55 50 1
35 33 0
63 63 0
16 72 0
73 69 0
62 54 0
107 101 0
60 55 1
70 70 0
64 62 0
1 I6
69
121
139
131
129
175
10.5
106
107
123 122 122 122
82 79 79 79
129 129 129 129
152 152 152 152
146 143 143 143
139 139 139 139
189 186 186 186
110 110 110 110
115 114 114 114
109 109 109 109
122 122 0
79 79 0
129 129 0
152 150 0
143 143 0
139 139 0
186 186
110 107
114 110
109 109
2 0.5
T'
n
n
0
n
Table 3. The time efficiency of FESTA, GreedyTag and LRTag for selecting tagSNPs from a single population, CEU,
on all ENCODE regions. The running times are obtained on a 32-processor SGI Altix 4700 supercomputer system. Region r 2 2 0.5 FESTA GreedyTag LRTag r2 2 0.8 FESTA GreedyTag LRTag
ENmOlO
ENmOl3
ENmOl4
ENrll2
ENrll3
ENr123
ENrl31
ENr213
ENr232
ENr321
3h14m 1s 1s
3h16m 1s 1s
4h51m 1s 1s
3h18m 1s 2s
14h24m 1s 1s
64h4m 1s 2s
5h13m 1s 1s
2h24m 1s 1s
lh38m 1s 1s
45m19s 1s 1s
3s
2811120s
44m6s
12mXs
50m16s
3s
1s
2s
1s
1s
< Is
< 1s
Table 4. Summary of tagSNPs identified by MultiPop-Tagselect, GreedyTag and LRTag for all HapMap populations on ENCODE regions.
Region # SNP
2 0.5 #precinct # tagSNP (upper bound) MultiPop-TagSelect GreedyTag LRTag # tagSNP-(lower bound) LRTdg Ab GreedyTag-lb Gap T Z 2 0.8 # precinct # tagSNP (upper bound) MultiPop-TagSelect GreedvTag
ENmOlO 783
ENmOl3 1063
ENmOl4 1261
ENrll2 1I58
ENrll3 1485
ENr123 1221
ENrl31 1186
ENr213 900
ENr232 777
ENr321 1025
T'
L R T ~ ~
65
38
48
44
67
38
73
56
126
53
206 179 179
150 1 I7 1I7
201 164 164
238 184 184
260 228 228
181
141 141
306 257 257
200 173 171
286 268
233 193
268
I91
178 169
1I7 107 0
162 I49 2
I82 168 2
226 218 0
141 139 0
256 250
173 173 0
268 264 0
193 189 0
1
1
I56
Ill
146
101
209
I06
I94
I70
210
191
338 322 322
275 255 255
321 305 305
425 391 389
454 437 437
329 300 300
462
324 318 318
402 392 392
396 377 377
322 319
255 253
305 303
389 374
437 435
300 300 ' 0
445 443 0
318
392 392 0
371 371 0
445 445
# tagSNP (lower bound)
LRTagJb GreeZyTag-lb Gap
n
n
n
With the r 2 threshold set as 0.5, the number of tagSNPs selected by our algorithms is 14.4% of the total number of markers on the average (the actual percentage ranges from 11.2% to 21.4%). With a more stringent r2 threshold of 0.8, the average number of tagSNPs in-
n
0
318 0
creases to 26.6% (ranging from 22.2% to 35.5%). Similar trends were observed when applying our algorithms to the other populations (the results are not shown due to space limitation).
74 Tagging region ENm013 for all HapMap populations
h
350
0 m )
300
z
-
-
ragging entire ENCODE region for all HapMap populations 5000
400
4500 4000 3500
Iuu
0.4
05
06 07 ?threshold
0.8
0.9
Grdyllb LR-lb LR Greedy MPS
t
05
06
07
08
09
P threshold
Fig. 2. (A). Tagging for HapMap populations on region ENm013 with 783 markers. (B). Tagging for HapMap populations on all ENCODE regions with 10,859 markers. Here, “Grdy-lb’ stands for “GreedyTag-lb”, “LRTag-lb” stands for “LRTag-lb”, “LR’ stands for “LRTag”, “Greedy” stands for “GreedyTag”, and “MPS” stands for “MultiPopTagSelect”. Tagging chromosome 3 for all HapMap populations
Tagging entire human genome for all HapMap populations
1 55e+006 145e+006 135e+006
1
04
-.”,~-
LR-lb * _ j, LR-
GrE:g
05
0-6 07 threshold
08
09
6
?threshold
Fig. 3. (A). Tagging chromosome 3 for all HapMap populations with 196,535 markers. (B). Tagging the entire human genome for all HapMap populations with 2,862,454 markers. See the caption of Figure 2 for the definitions of the legends in the subfigures.
We observe that LRTag always performs the best, followed by the GreedyTag algorithm, and LD-Select” always performs the worst. With the r2 threshold set as 0.5, LRTag usually requires 4.9% fewer tagSNPs (the actual percentage number ranges from 2.8% to 6.3%) on average than LD-Select” on each chromosome. When the r2 threshold is increased to 0.8, LRTag usually requires 1.2% fewer tagSNPs (ranging from 0.07% to 1.5%) on average than LD-Select”. We can see that, on each chromosome, the gap between the lower bound from LRTag-lb and the upper bound obtained by LRTag is on the average 7 (the actual number ranges from 1 to 20) with the r2 threshold set as 0.5 and less than 1 (ranging from 0 to 2 ) with the r2 threshold set being 0.8. This demonstrates that LRTag finds near-optimal solutions in all test cases even for genome-wide tagging on a single population. In fact, the performance of GreedyTag is not bad either.
r2 threshold of 0.8, the average number of tagSNPs increases to 46.0% (ranging from 29.4% to 60.4%). Based on Table 8, we observe that LRTag always performs slightly better than GreedyTag and significantly better than MultiPop-Tagselect*. With the r2 threshold set as 0.5, LRTag requires 6.8% fewer tagSNPs on average (the actual number ranges from 4.0% to 8.0%)than MultiPopTagselect” on each chromosome. With the r2 threshold set as 0.8, LRTag requires 3.6% fewer markers on average (ranging from 2.7% to 4.3%) than MultiPop-Tagselect”. The gap between the lower bound from LRTag-lb and upper bound of LRTag is on the average 48 for each chromosome (the actual number ranges from 6 to 109) with the r2 threshold set as 0.5, and 6.5 (ranging from 0 to 16) with the r2 threshold set being 0.8, as shown in Table 8.
Tagging the human genome for multiple populations.
Our LRTag and GreedyTag algorithms run quickly on ENCODE regions and the entire human genome for both single and multiple populations. On an ENCODE region with the r2 threshold being 0.5, it takes our algorithms no more than 2 seconds to tag a single population (as shown in Table 3) and less than 7 seconds to tag multiple populations (as displayed in Table 9 in the appendix). On a human chromosome, it takes no more than 4 minutes to tag a single population (as shown in Table 6 in the appendix) and less than 12 minutes to tag on multiple populations (as displayed in Table 7 in the appendix). For
Finally, we tag the entire human genome for all four HapMap populations by MultiPop-Tagselect, GreedyTag and LRTag. We summarize the tagging results of these methods on each chromosome in Table 8 (given in the appendix), and then highlight the results for chromosome 3 and all chromosomes in Figure 3. With the r2 threshold set as 0.5, the number of tagSNPs selected by our methods is on the average 27.3% of the total number of markers (the actual percentage ranges from 21.9% to 47.2%). With a more stringent
5. CONCLUSION
75
r2 thresholds greater than 0.5, our algorithms run faster. Hence, for any given r2 threshold, it takes our algorithms less than a minute to tag the entire ENCODE region and less than an hour to tag the entire human genome. If the number of populations of interest increases, the genotyping density increases or the r2 threshold increases, the number of required tagSNPs also increases. For example, on multiple HapMap populations with the r2 threshold being 0.5, we need to tag one SNP for about every 6 SNPs on the densely genotyped ENCODE regions. We need to tag one SNP for about every 4 SNPs on sparsely genotyped HapMap chromosomes. All the lower and upper bounds produced by the discussed methods are shown in Figure 2 and Figure 3. In the figures, we tag the ENCODE regions and human genome on the HapMap populations with the r2 thresholds being 0.5,0.6,0.7 and 0.8 separately. From all these test cases, we observe that LRTag always chooses the smallest set of tagSNPs, closely followed by GreedyTag, while MultiPop-Tagselect chooses the largest set. LRTag-lb always provides the best lower bound and LRTag the best upper bound among all methods considered. The simple greedy algorithm, GreedyTag, chooses slightly more tagSNPs than LRTag and the lower bound GreedyTaglb is slightly lower than LRTag-lb, which indicates that the data reduction rules in Section 3 are very powerful. When the r2 threshold increases, the size of the precincts decreases. Consequently, the gap between the lower bound and the upper bound decreases. For the entire human genome with 2,862,454 markers, the gap between LRTag and LRTag-lb is 1061 when the r 2 threshold is 0.5, and 142 when the r 2 threshold increases to 0.8. The small gap shows that LRTag finds near-optimal solutions for genome-wide tagging.
ACKNOWLEDGEMENT The research was supported in part by NSF grant CCR-0309902, NIH grant LM008991-01, NSFC grant 60528001, NSF CAREER 11s-0447773, NSF DBI0321756 and a Changjiang Visiting Professorship at Tsinghua University. Our programs (GreedyTag and LRTag) are available upon request.
References 1 , H. Avi-Itzhak et a/.. Selection of Minimum Subsets of Single Nu-
cleotide Polymorphisms to Capture Haplotype Block Diversity. Proc. Pac. Symp. Biocomput., 2003; 466477. 2. P. Bakker et al.. Transferability of tag SNPs in Genetic Associationstudies in Multiple Populations Nat. Genet., 2006; 38: 12981303. 3. E. Balas and M. C. Carrera A Dynamic Subgradient-Based Branch-And-Bound Procedure for Set Covering. Operations Research, 1996; 44: 875-890. 4. R. Bar-Yehuda and S. Moran. On Approximation Problems related to the Independent Set and Vertex Cover Problems. Disc. Appl. Murh., 1984; 9: 1-10,
5. C.S. Carlson et al.. Selecting a Maximally Informative Set of Singlenucleotide Polymorphisms for Association Analyses using Linkage Disequilibrium. Am. J. Hum. Genet., 2004; 74: 106-120. 6. D.F.Conrad et a/. A Worldwide Survey of Haplotype Variation and Linkage Disequilibrium in the Human Genome Nature Genet., 2006; 38: 1251-1260. 7. K. Ding et al. The Effect of Haplotype-block Definitions on Inference of Haplotype-block Structure and htSNPs Selection. Mol. Biol. Evol. 2005; 22: 148-159. 8. D. A. Hinds et ul.. Whole-Genome Patterns of Common DNA Variation in Three Human Populations Science 2005; 307: 10721079. 9. International HapMap Consortium. Nature 2005; 437: 12991320. 10. HapMap LD data. h t t p : //www.hapmap.org/downloads/lddata/20 0 5 - 10 /
1. The International SNPWorking Group. A Map of Human Genome
2. 3.
14. 15.
16.
17. 18. 19. 20.
21.
22.
23. 24.
25.
26.
27.
28. 29.
Sequence Variation Containing 1.42 Million Single Nucleotide Polymorphisms. Nature 2001; 409: 928-933. S.B. Gabriel eta/.. The Structure of Haplotype Blocks in the Human Genome. Science 2002; 296: 2225-2229. B.N.Howie et al. Efficient Selection of Tagging Single-Nucleotide Polymorphisms in Multiple Populations Hum. Genet. 2006; 120: 58-68. Whole-Genome Patterns of Common DNA Variation in Three Human Populations Science 2005; 307: 1072-1079. E. Halperin, G. Kimmel, and R. Shamir. Tag SNP Selection in Genotype Data for Maximizing SNP Prediction Accuracy. Bioinformutics 2005; 21(Suppl I): i195-i203. J. Hampe, S. Schreiber and M. Krawczak. Entropy-based SNP Selection for Genetic Association Studies. Hum Genet. 2003; 114(1): 3643. G.C. Johnson et a/.. Haplotype Tagging for the Identification of Common Disease Genes. Nat. Genet. 2001; 29: 233-237. L. Kruglyak and D. Nickerson. Variation is the Spice of Life. Nut. Genet. 2001; 27: 234-236. X. Ke and L.R. Cardon. Efficient Selective Screening of Haplotype Tag SNPs. Bioinformatics 2003; 19: 287-288. Z. Lin and R.B. Altman. Finding Haplotype Tagging SNPs by Use of Principal Components Analysis. Am. J. Hum. Genet. 2004; 75: 850-861. Z. Liu, S. Lin and M. Tan. Genome-Wide Tagging SNPs with Entropy-Based Monte Carlo Method. J. Compur. B i d . 2006; 13(9): 1606-1614,. R. Magi, L. Kaplinski and M. Remm The Whole Genome TagSNP Selection and Transferability Among HapMap Populations Pacific Symposium on Biocomputing 2006; 11: 535-543. A. C. Need and D.B. Goldstein Genome-wide Tagging for Everyone Nut. Genet. 2006; 38: 1227-1228. N. Patil et al.. Blocks of Limited Haplotype Diversity Revealed by High-resolution Scanning of Human Chromosome 21. Science 2001; 294: 1719-1723. T. M. Phuong, Z. Lin and R. B. Altman. Choosing SNPs Using Feature Selection Proc. Compsrtaiond Systems Bioinformatics Conference(CSBj 2005; 301-309. Z. S. Qin, S. Gopalakrishnan and G. R. Abecasis. An Efficient Comprehensive Search Algorithm for TagSNP Selection using Linkage Disequilibrium Criteria Bioinformatics 2006; 22: 220225. D.O. Stram eta/.. Choosing Haplotypetagging SNPs based on Unphased Genotype Data using a Preliminary Sample of Unrelated Subjects with an Example from the Multiethnic Cohort Study. Hum. Hered. 2003; 55: 27-36. P. Sebastiani et al.. Minimal Haplotype Tagging. Proc. Nut/. Acad. Sci. USA 2003; 100: 9900-9905. S. L. Sawyer, et al. Linkage Disequilibrium Patterns Vary Substantially among Populations. Eur J Hum Genet 2005; 13: 677-686.
76 30. V. V. Vazirani. Approximation Algorithms, 2003. 31. N. Wang et a/. , Distribution of Recombination Crossovers and the Origin of Haplotype Blocks: The Interplay of Population History, Recombination, and Mutation. Am. J. Hum.Genet. 2002; 71: 1227-1 234. 32. K. Zhang et a/.. A Dynamic Programming Algorithm for Haplotype Partitioning. Proc. Natl. Acad. Sci. USA 2002; 99: 73357339. 33. K. Zhang and L. Jin. HaploBlockFinder: Haplotype Block Analyses. Bioinformatics 2003; 19: 1300-1301. 34. K. Zhang eta/.. HapBlock: Haplo-type Block Partitioning and Tag SNP Selection Software Using a Set of Dynamic Programming Algorithms. Bioinformatics 2005; 21: 131-134. 35. E. Zeggini et a/.. Characterisation of the Genomic Architecture of Human Chromosome 17q and Evaluation of Different Methods for Haplotype Block Definition. B M C Genet. 2005; 6:21.
Algorithm 5.2 (LRTag: Lagrangian relaxation Algorithm for TagSNP Selection in Multiple Populations) Input: A set V of n biallelic SNP markers and their pairwise r2 LD statistics in K distinct populations. A pre-defined threshold yo for r 2 LD statistics. A pre-defined initial scalar a0 and threshold amin for subgradient optimization. A predefined maximum number Iterrnu, of iterations and a preof maximum trials. Output: A feadefined threshold K,,, sible tagSNP set T V, and a lower bound LB. Begin Partition markers into precincts. Let the set of precincts be P
For each precinct p E P {thefollowing will be executed in parallel) Let U be the SNP set and W be the marker occurrences set in p . Step 1: Apply the three data reduction rules and obtain a temporary tagSNP set T, and a lower bound LB,. {The same as the rules in algorithm 5.1)}.
APPENDIX
Algorithm 5.1 (GreedyTag: Greedy Algorithm for TagSNP Selection in Multiple Populations) Input: A set V of biallelic SNP markers and their pairwise r 2 LD statistics in K distinct populations. A pre-defined threshold 7 0 for r2 LD statistics. Output: A feasible tagSNP set T C V, and a lower bound LB. Begin Partition markers into precincts. Let the set of precincts be
P.
Step 2: Select tagSNPs under a LR framework. Generate Lagrangian relaxation formula as in Equation (4); k t 0; a: + a o ; Iter -& 0; Initialize X being an arbitrary non-negative vector; LBpl t 0; Tpl t U ;
While (a: > a m z n )and (Iter < Iterma,) Iter + Iter 1; new-LB + CIc21K,Icj5n Ai,j; new-T t 0;
+
{Calculate a new lower bound new-LB) For each z1, E U s3 + A%,,/; { N * ( v ~ ) i s g i v e r t i n E q l m f i o n ( l j }
l-zv;,E
N X ( v J )
For each precinct p E P {the following will be executed in parallel on a multi-processor machine} Let U be the set of SNPs and W the set of marker occurrences in p . Step 1: Apply the three data reduction rules. Tp e 8; LB, -e0; UPDATED e true; While UPDATED { execute the optimal rules iteratively} UPDATED e false; If 3 an irreplaceable marker v j E U {Rule l } ’+= - {vj}; e N*(Wj); { N * ( w , ) i s d e f n e d i n E q u a f i o n ( 1 j } 1; T p +=T p U { ~ j } ; LBp + LBp UPDATED e true; If 3 a less informative marker v j E U {Rule 2) U e U - {vj}; UPDATED e true; If 3 a less stringent occurrence vi E W {Rule 3) W 6 W - {vj}; UPDATED e true; For each v j E U
u u w w-
D ( v ~e ) N * ( u ~n)
+
w;
Step 2: Select tagSNPs greedily. While W is non-empty {there are markers to be tagged} Let uj,, e argmax,,EulD(vj)l; Tp
’+= T p U { ~ j , , } ;U 6
wew
-
U - {~j,,};
N*(Vjo);
Foreachvj E U D ( v j ) e D ( v j )n w;
T 6 up,p T p ; LB e= C P E P LBP Output T , L B {output the solution and lower bound} End
If s3 5 0 t j += 1; Else t j + 0; new-LB t new-LB s j . t,;
+
{Obtain a feasible tagSNP set new-T by the RCH method ’) For each vj E U R C H - s 3 -+s 3 ; Foreach v3 E W RCH-A,,, -e At,,; For each wj E W Ifxuj,t c(uq, t j r < 1 {C(v;)isdefinedinEquation(2)}
R C H - s , t m i n { R C H _ s j , : v 3 /€ C(vj)}; RCH_Ai,j + RCH-X,>,+ RCH-5,; Foreach u3/ E C ( v j ) R C H - S ~+ I R C H - s 3 / - RCH-5,; I f R C H - s 3 / 5 0 t,! + 1; For each vj E U If t, = 1 new-T + new-T U {vj}; {Update the lower bound LBpl and the tagSNP set T,l } If new-LB 5 LBpl k t k 1; If ( k 2 K,,,) a: + 4 2 ; k + 0; Else LBpl t new-LB; k + 0; If Inew-TI < iTplI Tpl e new-T;
+
{Update the Lrrgrangian multip1ier.s X by ihe subgradient optimization method } Foreach v; E W VA,,j t 1 - Cu,7,tC(v;, t,~;
+
X -e maz(0, X a: ITpl+LB,l VX}; IIVAl2 {Combine the solutions from step I and step 2 ) T p + T P U T p l ;L B , + = L B , + L B p ~ ;
T + U p t p T p ; L B + C p E p LBp Output T , L B {output the solution and the lower bound)
77 Table 5. Summary of the tag SNPs selected by LD-Select, GreedyTag and LRTag for a single population, CEU, on each human chromo-
some. Chromosome # SNP
r 2 2 0.5 # precinct # tagSNP (upper bound) LD-Select* GreedyTag LRTag # tagSNP (lower bound) LRTaglb GreedyTag-lb Gap r2 2 0.8 # precinct # tagSNP (upper bound) LD-Select* GreedyTag LRTag # tagSNP (lower bound) LRTaglb GreedyTag-lb Gap Chromosome # SNP r2 2 0.5 #precinct # tagSNP (upper bound) LD-Select* GreedyTag LRTag # tagSNP (lower bound) LRTag _lb GreedyTag-lb gap r2 2 0.8 # tagSNP (upper bound) # precinct LD-Select* GreedyTag LRTag # tagSNP (lower bound) LRTag-lb GreedyTag-lb Gap
1 151195
2 181499
3 143472
4 130823
5 138817
6 149514
7 113037
122646
9 100352
10 110942
11 104661
15752
29426
12901
11906
11998
11831
10512
9900
9438
10153
9979
2 1865 20806 20800
36238 35083 35065
19063 17984 17977
17212 16295 16286
17765 16769 16756
17921 16815 16798
15418 14584 14577
15140 14203 14196
13800 13066 13058
14882 14041 14038
14307 13600 13589
20793 20 123 7
35059 34202 6
17958 17155 19
16279 15675 7
16736 15965 20
16784 16086 14
14569 14021 8
14182 13568 14
13049 12530 9
14031 13477 7
13578 13089 11
35990
51098
31916
28650
2993 1
30632
26181
26 120
23739
25186
23544
38944 38534 38534
54612 5408 1 54080
35092 34602 34601
31590 31124 31123
32978 32502 32501
33723 33229 33227
28754 28362 28362
28822 28394 28393
26008 25666 25665
27698 27302 27302
25826 25485 25484
38534 38269
34600 34310 1 14 68485
31123 30824
12 100437
54080 53687 0 13 84184
15 58491
32501 32 189 0 16 57083
33225 32962 2 17 47505
28361 28083 1 18 62666
28391 281 10 2 19 29341
25664 25396 1 20 51206
27301 27077 1 21 27955
25483 25276 1 22 26996
9960
7476
6751
6740
7184
6764
6534
5291
5874
3270
3829
14243 13554 13548
10996 10374 10370
9703 9215 9212
9364 8930 8923
9962 9503 9500
8656 8355 8354
9115 8652 8649
6464 6286 6284
7972 7637 7634
4470 4258 4257
5029 4831 4831
13539 13048 9
10363 9988 7
9203 8867 9
8920 8589 3
9500 9241 0
8353 8180 1
8646 8332 3
6283 6190 1
7630 7380 4
4256 4125 1
4830 4714 1
23809 25887 25579 25579
18509 20221 19967 19967
16391 17723 17546 17545
15629 16908 16722 16722
16869 18194 18012 18012
13942 14778 14670 14669
15262 16498 16299 16299
10019 10494 10420 10420
13177 14194 14052 14052
7390 7912 7844 7844
8240 8727 8652 8652
25578 25387
19966 19778 1
17545 17405 0
16722 16608 0
18012 17836 0
14668 14588 1
16299 16181 0
10420 10382 0
14051 13943 1
7844 7774 0
8652 8601 0
0
1
0
8
Table 6. The speeds of GreedyTag and LRTag for tagging the human genome for a single population,
CEU, with the r 2 threshold being 0.5. The running time is evaluated on a 32-processor SGI Altix 4700 supercomputer system. Chromosome LRTae GreegyTag Chromosome LRTag GreedyTag
1
111118s 111117s 12 56s 56s
2 lm44s 111141s 13
50s 50s
3 111128s lm16s 14 34s 37s
4 111112s lm15s
15 28s 27s
5 lm27s 111124s 16 23s 20s
6 3m7s 3mlls 17 46s 47s
7 lm3s 58s 18
31s 31s
8 lm15s 111116s
19 9s 10s
9 57s
57s 20 23s 22s
10 lm6s lm6s
21 11s 11s
I1 lm8c lmlOs 22 10s 9s ~
~~~~
Table 7. The speeds of GreedyTag and LRTag for tagging the entire human genome for all HapMap populations
with the r 2 threshold being 0.5. The running time is evaluated on a 32-processor SGI Altix 4700 supercomputer system. Chromosome LRTae GreeduyTag Chromosome LRTap Greeciy Tag
1
31114s 3mlls 12 211155s 3m
2 2m2s lm13s 13 2mlls 2m27s
3 31119s 311143s 14 111128s lm16s
4 211151s 211146s 15 Im 52s
5 311137s
311120s 16 48s 23s
6 111114s 1011145s 17 ImlOF 11119s
7 2m12s 2111253 18
lm17s lm16s
8 311145s 21x152s 19 27s 27s
9 211124s 211118s 20 56s 50s
10 2m499 211155s 21 25s
25s
11 211120s 2m16s 22 ~~
30s 30s
78 Table 8. Summary of the tagSNPs selected by MultiPop-TagSelect, GreedyTag and LRTag for all HapMap populations on each human
chromosome. 10
# SNP r 2 2 0.5 # precinct # tagSNP (upper bound)
216357
2 249136
3 196535
4 182273
5 187924
6 205496
7 155224
8 170136
9 138047
156089
11 144083
16234
26836
12835
12251
12414
11862
10332
10101
9254
10220
9568
MultiPop-Tagselect” GreedyTag LRTag # tagSNP (lower bound) LRTag-lb GreedyTag-lb Gap r2 2 0.8 # precinct # tagSNP (upper bound) MultiPop-Tagselect* GreedyTag LRTag # tagSNP (lower bound) LRTag-lb GreedyTag-lb Can Chromosome # SNP r 2 2 0.5 # precinct # tagSNP (upper bound) MultiPop-Tagselect* GreedyTag LRTag # tagSNP (lower bound) LRTag-lb GreedyTag-lb Gap T’ 2 0.8 # precinct # tagSNP (upper bound) MultiPop-TagSelect” GreedyTag-lb LRTag-lb # tagSNP (lower bound) LRTag-lb GreedyTag-lb GaD
64892 59126 55016
126408 122372 117537
56978 5 1266 47450
52828 47650 44223
54087 4866 1 45 186
54454 48817 44987
45943 41169 38150
46927 42206 39149
41341 37289 34439
45226 40713 37554
41556 37365 34590
54942 53937 74
11751 1 117155 26
47362 46330 88
44141 43239 82
45102 44145 84
44878 43845 109
38090 37280 60
39076 38161 73
34381 33534 58
37486 36713 68
34537 33778 53
42450
56135
35192
33434
33211
33228
28543
28948
25485
28277
25428
100062 94797 94797
155505 150664 150664
89195 84091 84090
82835 78077 78076
84998 80188 80186
86313 80981 80980
72024 67818 67817
74934 70678 70677
65442 61708 61706
70817 66676 66674
64679 6072 1 607 18
94788 94362 9 12 141943
150660 150393 4 13 119080
84079 83585 11 14 94528
78072 77674 4 15 81687
80174 79663 12 16 79898
80964 80507
70667 70285
17 64645
67808 67461 9 18 89024
19 40549
61699 61321 7 20 70877
66663 66291 11 21 39400
60705 60294 13 22 39523
10086
7810
6532
6667
7328
6952
6875
5127
6139
3290
3884
42362 38563 35493
33477 30183 27927
28465 25706 23932
27847 25408 23721
28987 26432 24791
23601 21931 20647
27109 24789 23174
15768 14785 14007
23243 21319 19994
13100 12010 11253
13895 12980 12174
35449 34833 44
27881 27306 46
23903 23440 29
23686 23229 35
24761 24294 30
20636 20366 11
23141 22697 33
14001 13814 6
19971 19601 23
11238 11035 15
12160 12017 14
27027
21084
17723
17526
18943
16278
17866
11289
15438
8366
9480
65521 61828 61826
52863 49797 49796
44226 41867 4 1867
42380 40250 40250
43913 41726 41724
34289 32862 32862
41893 39833 39833
22274 21465 21464
3525 1 33676 33675
19990 19060 19060
20624 19742 19741
61816 61450 10
49791 49498 5
41860 4 1642 7
40247 40029 3
41721 41497 3
32860 32740 2
39832 39625 1
21464 21377 0
33673 33525 2
19059 18996 1
19739 19660 2
Chromosome
1
16
10
Table 9. The speeds of GreedyTag and LRTag for tagging the entire ENCODE region for all HapMap pop-
ulations with the r2 threshold being 0.5. The running time is evaluated on a 32-processor SGI Altix 4700 supercomputer system. Region LRTag GreedvTae
ENmOlO la 15
ENm013 4s 4s
ENmOl4 3s 4s
ENrll2 5s
ENrll3 IS
ENrl23 5s
ENrl31
ENr213
ENr232
IS
IS
1s
5s
IS
65
15
Is
IS
EN1321 2s 2s
79 TRANSCRIPTIONAL PROFILING OF DEFINITIVE ENDODERM DERIVED FROM H U M A N EMBRYONIC S T E M CELLS
Huiqing Liul>’, Stephen Dalton’, Ying Xu1,’,*
’
Dept. of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, A t h e n s , GA 30602, U S A * Email: [email protected] Definitive endoderm (DE), the inner germ layer of the trilaminar embryo, forms gastrointestinal tract, its derivatives, thyroid, thymus, pancreas, lungs and liver. Studies on DE formation in Xenopus, zebrafish and mouse suggest a conserved molecular mechanism among vertebrates. However, relevant analysis on this activity in human has not been extensively carried out. With the maturity of the techniques for monitoring how human embryonic stem cells (hESCs) react to signals that determine their pluripotency, proliferation, survival, and differentiation status, we are now able to conduct a similar research in human. In this paper, we present an analysis of gene expression profiles obtained from two recent experiments to identify genes expressed differentially during the process of hESCs differentiation to DE. We have carried out a systematic study on these genes to understand the related transcriptional regulations and signaling pathways using computational predictions and comparative genome analyses. Our preliminary results draw a similar transcriptional profile of hESC-DE formation to that of other vertebrates.
1. INTRODUCTION
During gastrulation, three primary germ layers (endoderm, mesoderm and ectoderm) are derived from the epiblast of human embryonic stem cells (hESCs). From these initial embryo layers, all the other somatic tissue types will develop. For instance, endoderm (the inner layer, also called as definitive endoderm, DE) forms gastrointestinal tract, its derivatives, thyroid, thymus, pancreas, lungs, and liver. Therefore, investigation of the biological mechanisms that occur during the hESCs differentiation will help us understand the developmental pathways involved in the formation of a mature organ. In this study, we attempt to investigate transcriptional regulation and associated signaling pathways related to DE formation from hESCs. Although studies on DE formation in Xenopus, zebrafish, and mouse suggested a conserved molecular mechanism among vertebrates ’, relevant analysis on this activity in human has not been extensively done. Recently, several new methods for directing the differentiation of hESC towards DE have been investigated and two techniques were reported to have successfully directed DE formation ’. The core part of one technique, Ref. 1, is to first treat hESCs with Activin A in a low FCS (fetal calf serum) condition, and then enrich the culture by the DE cell
*Corresponding author
surface marker CXCR4. Another technique, as described in our previous publication (Ref. 5), is to grow hESCs in mouse embryonic fibroblast conditioned medium (MEF-CNI) under feeder free conditions with phosphatidylinositol3-kinase signaling being suppressed. After five days, about 70-80% of the hESC culture is converted into DE. To compare the DE generated by the two techniques and to obtain an overall gene expression profile of this cell line, RNA samples from these two experiments are hybridized to the Affymetrix HG-U133 Plus 2.0 oligonucleotide microarray, which contains more than 54,000 probe sets, representing 38,500 human genes Studies on other vertebrates suggest that DE formation requires first the Activin/Nodal signaling of TGFP (transforming growth factor ,Ll super-family), followed by the activation of a set of downstream transcription factors (TFs) such as SOX17 of Sox (SRY-related HMG-box) family, FOXA2 (HNF3P) of Forkhead family and a number of TFs from the Gata family Manipulation of the Activin/Nodal ligands is done through a few transcription factors of the Smad family that lie at the core of the TGFP pathway. Current understanding of this process is that, when the Activin/Nodal signaling protein meets its receptor, the highly homologous SMAD2 and SMAD3 intracellular mediators get phosphorylated on their conserved C-terminal motif and translocate
’.
’.
80 with SMAD4 into the nucleus. SMAD2 or SMAD3 associates with SMAD4 to form a Smad complex which incorporates an additional DNA-binding cofactor to activate Or repress the expression Of the regulated genes ‘.
3. M E T H O D S A N D PRELIMINARY RESULTS
To analyze DE genes in a systematic manner, we present a number of studies in this section. For each study, we report the preliminary results that we have obtained as of now. 3.1. Markers of hESC-DE
2. DATA Temporal gene expression profiles from the two experiments described in Ref. 1 and Ref. 5 are collected. Data are scaled to a median intensity with target setting of 500 and CEL files were normalized using probe quantile In both experiments, only those transcripts differentially expressed during DE formation are kept for further study. Table 1 shows the number of genes with certain fold changes on their expression levels during DE development in the experiments (data set I is from Ref. 5 and data set I1 is from Ref. 1). Seventy-five genes are selected, which exhibit substantial changes in both experiments are categorized in Table 1 of Ref. 5 according to their biological functions. Throughout the rest of this paper, we use this gene set for our data analysis studies unless stated otherwise. By looking a t the functional (biological process) assignments of these genes according to the Gene Ontology (GO) database (http://www.geneontology.org/), we found that, 70% of these genes have GO term “cellular physiological process” (level 3 biological process) , 50% have term “cell communication” and 48% are involved in ”regulation of cellular process”. In addition to the microarray data, we have also collected gene expression information using the quantitative polymerase chain reaction (Q-PCR) under the same protocol described in Ref. 5.
’.
Among the seventy-five genes identified above, we found all previously known markers of DE, namely SOX17, CXCR4, GSC, CER1, HHEX, FOXA2, GATA4 and GATA6. All these genes are upregulated during the differentiation. On the other hand, different from these genes, three indicators of the mesoderm (ME) patterning from mesendoderm, namely Brachyury (T),MIXLl and FOXC1, all have their expression levels stop increasing in the middle of the DE formation (at “24 hour or 36 hour; both microarray and Q-PCR data) and then drop sharply immediately after that turning point. This confirms that the differentiation is to DE. not to ME.
3.2. Transcription factor identification
To investigate the functional roles of the Smad family members and other TFs during the human DE formation, we analyzed the promoter regions of the obtained hESC-DE genes using computational tools and comparative genomics approach. Figure 1 describes the workflow of our procedure.
DE genes
Align with ortholog genes in mouse genome
sequences of the genes
Table 1. Number of genes differentially expressed in DE formation from two microarray data sets. nf means n-fold change of expression level (n=2,4,6,8,10). Data Set
2f
4f
6f
8f
10f
I I1 Both
3360 4168 1926
901 1088 389
475 527 190
325 339 113
236 232 75
Find over-expressed TFBS (rVISTA, TRANSFAC)
Discover new binding motifs (motif finding program)
binding motifs in Rat and Dog gcnomes (Molecular Signature Database)
Fig. 1. A workflow for identifying transcription factors in hES C-D E .
81
We used the (whole genome) rVISTA tool to assist our work in this step, which is one of VISTA computational tools developed at Lawrence Berkeley National Laboratory (http : //genome. lbl .gov/ vista/index. shtml). By checking against the known transcription factor binding sites (TFBS) stored in TRANSFAC database, rVISTA is able to identify TFBS that are enriched in the promoter regions of a group of input genes and are conserved between pairs of species. In our study, we chose to scan 5,000 bps upstream region of each of our genes for possible conserved TFBS in the human genome against its counterpart in the mouse genome. The enrichment is measured by a p-value taking all upstream regions of human RefSeq5 genes as the background and the output is the corresponding TFs. The two top enriched TFBS returned by rVISTA are for Forkhead family members FOX01 ( p < and FOX04 (p < FOX0 is one of the identified DNAbinding cofactors of SMAD2/3-SMAD4 ‘. Furthermore, by scanning MSigDB (Molecular Signature Database; http ://www .broad .mit . edu/gsea/ msigdb/msigdb-index. html), 25 out of the 75 hESC-DE specific genes are reported to have F O X 0 motifs conserved across the human, mouse, rat and dog genome. MSigDB stores all conserved transcription factor binding motifs derived from a recent comparative analysis of the four genomes ’. Other top TFs whose binding sites are overrepresented include E 2 F l D P l ( p < lop1’), L E F l T E F l (p < PITX2 (p < 10-12), and TCF4 (p < lop1’), suggesting the involvement of Wntlp-catenin and Nodal signaling pathways in the DE formation. We observed that the Smad binding element (SBE) is not enriched in our data set. This is not surprising since the SBE sequence, 5’-GTCT3’ or its complement 5’-AGAC-3’, is too short to be identified alone in the background with a large population of these 4-mers, by chance, in the genome. To discover new binding motifs or those not captured by the existing popular TFBS databases, we applied CUBIC, a motif finding program developed by our group, to the relevant promoter regions (conserved between human and mouse) of the identified hESC-DE genes. CUBIC is an efficient tool to identify transcription factor binding sites via data clustering One motif identified by CUBIC is similar to the previously reported FOXH (FAST)
‘.
binding site (CAATxxACA) ’. FOXH is another known important cofactor of SMAD213-SMAD4 in response to the TGFP signaling ‘. A sequence logo of this motif is given in Figure 2 drawn by Weblogo (http ://weblogo.berkeley . edu/). In addition, several GC-rich motifs are also identified, which is consistent with the previously reported fact that Smad complexes recognize GC-rich regions in certain promoters ‘. Our results support the current general belief that in SMAD2/3-SMAD4 regulations, DNAbinding partners determine the choice of the target genes
‘.
9-
A predicted motif similar to FoxH biiicling site
Fig. 2. A motif identified by CUBIC. It is similar to the reported binding site of FOXH.
3.3. Transcriptional regulation in the different phases of DE formation Smad regulation is the first response to the TGFP signaling in the DE formation. In order to further direct the differentiation to DE, several downstream transcription factors are also required. SOX17 and FOXA2 are among the ones that have been previously reported *. To identify the transcription factors that function in different phases during the DE formation, we have clustered genes based on the similarities of their expression profiles and attempted to find the “dominant” regulator(s) for each gene cluster. Figure 3 shows that the genes are grouped into five clusters. Most genes in Cluster 1 start to get up-regulated from “72hour. Genes in Cluster 2 get up-regulated around 48hour while genes in Cluster 4 start from “24136hour. Besides, genes in Cluster 3 represent “early response” genes, which start to increase their expression levels in the very early phase, and most of them have a decreasing pattern in the later phases during the DE formation. We notice that two of them are previously known ME markers, i.e. MIXLl and FOXC1. Different from these four clusters, Cluster 5 consists of genes that are downregulated during the DE differentiation.
82 factor by using rVISTA. We are currently eniploying other computational methods to identify its target genes and the potential binding sites from the transcripts in Cluster 1 and Cluster 2 of Figure 3 . Meanwhile, some of our discoveries reported in this paper are under wet lab verification. After a clear picture of transcriptional regulations is disclosed, we will investigate more on signaling pathways related t o hESC-DE. In summary, we have applied computational tools and comparative genomics approach to analyze temporal gene expression data of hESCDE formation. Our preliminary results demonstrate that the biological mechanism of D E differentiation in human is similar to that of the other vertebrates.
Acknowledgment
Fig. 3. Clustering on temporal expression profile of hESCDE genes. Hierachical clustering are performed by Cluster3.0 on log transformed and normalized data. This software mdecan be accessed from http://bonsai.ims.u-tokyo.ac.jp/ hoon/software/cluster/software. htm.
HL and YX’s work is supported in part by Georgia Cancer Coalition and NSF 11s-0407204. SD’s work is supported by Georgia Research Alliance
References When we applied rVISTA to each cluster of genes, we found additional over-represented TFBS. Particularly, the binding sites of F O X 0 1 and F O X 0 4 are ranked as the top TFBS in Clusters 3 and 4 genes, confirming the Smad regulation in the early phases during the DE formation. Binding sites of MEISl (Meis homeobox 1) ( p < lop3) and TITFl (thyroid transcription factor) ( p < l o V 3 ) are the two top TFBS for genes in Cluster 5. Although the expression profile of MEISI does not show substantial changes in our data, its overall trend is clearly decreasing (data not shown). In addition, previous studies in vertebrates have shown the involvement of T I T F l in the organogenesis of thyroid, lung and some areas of the forebrain ’. The best TFBS from Cluster 2 genes is muscle initiator sequence 20 ( p < l o V 5 )and for Cluster 1 is POU6F1 (POU domain, class 6, transcription factor 1) ( p < 10V3). This may imply the involvement of these two T F s in later phase of DE formation.
1. D’Amour KA, et al. (2005) Efficient differentiation of human embryonic stem cells to definitive endoderm. Nat Biotechnol, 23( 12):1534-4. 2. Devriendt K, et al. (1998) Deletion of thyroid tran-
3.
4. 5.
6.
7.
8.
4. ON-GOING WORK A N D CONCLUSION
9. Since SOX17 is not recorded in TRANSFAC, we cannot get any TFBS information of this transcription
scription factor-1 gene in an infant with neonatal thyroid dysfunction and respiratory failure. (Letter) New Eng. J . Med. 338: 1317-1318. Labbe E, et al. (1998) Smad2 and Smad3 positively and negatively regulate TGF beta-dependent transcription through the forkhead DNA-binding protein FAST2. Mol Cell. 2(1):109-20. Massagu J , Seoane J , Wotton D. (2005) Smad transcription factors. Genes Dev. 19(23):2783-2810. McLean AB, et a1 (2007) Activin A efficiently specifies definitive endoderm from human embryonic stem cells only when phosphatidylinositol 3-kinase signaling is suppressed. S t e m Cells, 25(1):29-38 Olman V, Xu D, Xu Y. (2003) CUBIC: identification of regulatory binding sites through data clustering, JBCB l(1): 21-40. Stainier DY. (2002) A glimpse into the molecular entrails of endoderm formation. Genes Dev. 16(8):893907. Sinner D, et al. (2004) sox17 and P-catenin cooperate to regulate the transcription of endodermal genes. Development, 131(13):3069-80. Xie X, et al. (2005) Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature, 434(7031): 338-345.
Pathways, Networks and Systems Biology
-
This page intentionally left blank
BAYESIAN INTEGRATION OF BIOLOGICAL PRIOR KNOWLEDGE INTO T H E RECONSTRUCTION OF GENE REGULATORY NETWORKS WITH BAYESIAN NETWORKS
Dirk Husmeier* a n d Adriano V.
Werhli
Biomathematics and Statistics Scotland, Edinburgh, United Kingdom * Email: [email protected] There have been various attempts t o improve the reconstruction of gene regulatory networks from microarray data by the systematic integration of biological prior knowledge. Our approach is based on pioneering work by Imoto et a1.l1, where the prior knowledge is expressed in terms of energy functions, from which a prior distribution over network structures is obtained in the form of a Gibbs distribution. The hyperparameters of this distribution represent the weights associated with the prior knowledge relative to the data. To complement the work of Imoto et al.”, we have derived and tested an MCMC scheme for sampling networks and hyperparameters simultaneously from the posterior distribution. We have assessed the viability of this approach by reconstructing the RAF pathway from cytometry protein concentrations and prior knowledge from KEGG.
1. INTRODUCTION Bayesian networks have received increasing attention from the computational biology community as models of gene regulatory networks, following up on pioneering work by Friedman et al.4 and Hartemink et a1.6. Several tutorials on Bayesian networks have been published*>lo) 16. We therefore only qualitatively recapitulate some aspects that are of relevance for the present study, and refer the reader to the above tutorials for a thorough and more rigorous introduction. The structure of a Bayesian network is defined by a directed acyclic graph (DAG) indicating how different variables of interest, represented by nodes, “interact”. The word “interact” has a causal connotation, which is ultimately of interest to the biologist, but has to be taken with caution in this context, as explained shortly. The edges of a Bayesian network are associated with conditional probabilities, defined by a functional family and their parameters. The interacting entities are associated with random variables, which represent some measured entities of interest, like relative gene expression levels or protein concentrations. We denote the set of all the measurements of all the random variables as the data, represented by the letter D. As a consequence of the acyclicity of the network structure, the joint probability of all the random variables can be factorized into a product of lower-complexity con*Corresponding author
ditional probabilities according to conditional independence relations defined by the graph structure G. Under certain regularity conditions, the parameters associated with these conditional probabilities can be integrated out analytically. This allows us to compute the marginal likelihood or evidence P(DIG), which captures how well the network structure G explains the data D . In the present study we computed P(DIG) under the assumption of a linear Gaussian distribution. The resulting score was derived by Geiger and Heckerman5 and is referred to as the BGe score. We are interested in learning a network of causal relations between interacting nodes. While such a causal network forms a valid Bayesian network, the inverse relation does not hold: when we have learned a Bayesian network from the data, the resulting graph does not necessarily represent the correct causal graph. One reason for this discrepancy is the existence of unobserved nodes. When we find a probabilistic dependence between two nodes, we cannot necessarily conclude that there exists a causal interaction between them, as this dependence could have been brought about by a common yet unobserved regulator. However, even under the assumption of complete observation the inference of causal interaction networks is impeded by symmetries within so-called equivalence classes, which consist of networks that yield the same evidence scores P(D1G).
86
A simple example are two conditionally dependent nodes, say A and B, where the two networks related to the two possible directions of the edge, A + B and A + B, are equivalent. There are two ways to break the symmetries of the equivalence classes. One approach is to use active interventions, like gene knockouts and overexpressions. When knocking out gene A affects gene B , while knocking out gene B does not affect gene A, then A + B will tend to have a higher evidence than A +- B. For more details, see Refs. 23, 24. An alternative way to break the symmetries, investigated in this paper, is to use prior information. When genes A and B are conditionally dependent, and we have prior knowledge that A is a transcription factor that regulates genes in the functional category that B belongs to, then we will presumably favour A -+ B over A + B. To formalize this notion, we score networks by the posterior probability P(GlD)
P(W)P(G)
larity conditions is theoretically guaranteed to converge to the posterior distribution of equation (l)7. Given a network structure Gold, a new network structure Gne, is proposed from the proposal distribution &(G,,,/Gold), which is then accepted according to the standard Metropolis-Hastings scheme7 with the following acceptance probability:
A
= min
P (D 1 Gnew ) P(Gmw) Goid)P(Goid)
{ P(D I X
Q(GoldIGnew1, &(G n e w /Gold)
I)
(2)
The functional form of the proposal distribution Q(G,e,IGold) depends on the chosen type of proposal moves. In the present paper, we consider three edge-based proposal operations: creating, deleting, or inverting an edge. The computation of the Hastings factor Q ( G o l d I G n e w ) / Q ( G n e w / G o l d ) is, for instance, discussed in Ref. 10.
(1)
where P ( D l G ) is the evidence, and P ( G ) is the prior distribution over network structures; the latter distribution captures the biological knowledge that we have prior to measuring the data D. While different graphs might have identical scores in light of the data, P(DIG), symmetries can be broken by the inclusion of prior knowledge, P ( G ) , and these two sources of information are systematically integrated into the posterior distribution P(GID). Our ultimate objective, hence, is to find the network structure G that maximizes P(G1D). Unfortunately, the number of structures increases super-exponentially with the number of nodes. Also, in systems biology, where we aim to learn complex interaction patterns involving many components, the amount of information from the data and the prior is usually not sufficient to render the distribution P(GID) sharply peaked at a single graph. Instead, the distribution is usually diffusely spread over a large set of networks. Summarizing this distribution by a single network would not be appropriate. Instead, we aim to sample network structures from the posterior distribution P(GID) so as to obtain a typical collection of high-scoring networks and, thereby, capture intrinsic inference uncertainty. Direct sampling from this distribution is usually intractable, though. Hence, we resort to a Markov chain Monte Carlo (MCMC) scheme17, which under fairly general regu-
2. METHODOLOGY
2.1. Biological prior knowledge To integrate biological prior knowledge into the inference of gene regulatory networks, we define a function that measures the agreement between a given network G and our biological prior knowledge. Following an approach first proposed by Imoto et a1.I1 and subsequently applied in Refs. 12, 18, 21, 22, we call this measure the energy E , borrowing the name from statistical physics. We split E into two components. One of the components, Eo, is associated with the absence of edges. The other component, El. is associated with the presence of edges. A network G is represented by a binary adjacency matrix, where each entry G,, can be either 0 or 1. A zero entry, G,, = 0, indicates the absence of an edge between node, and node,. Conversely if G,, = 1 there is a directed edge from node, to node,. We define the biological prior knowledge matrix B to be a matrix in which the entries B,, E [0,1] represent our knowledge about interactions between nodes as follows: If entry B,, = 0.5, we do not have any prior knowledge about the presence or absence of the directed edge between node, and node,. If 0 5 B,, < 0.5 we have prior evidence that the directed edge between node, and node, is absent. The evidence is stronger as B,, is closer to 0. If 0.5 < B,, 5 1 we have prior evidence
87 that the directed edge pointing from node, to node, is present. The evidence is stronger as B,, is closer to 1. Having defined how to represent a network G and the biological prior knowledge B , we now define the energies associated with the presence and absence of edges as follows: n
(3)
where n is the total number of nodes. To integrate the prior knowledge expressed by Equations ( 3 ) and (4) into the inference procedure, we follow Imoto et a1.l’ and define the prior distribution over network structures G to take the form of a Gibbs distribution:
where the partition function is defined as:
z(po,pl)
e-{PoEO(G)+PIEl(G)}
=
(6)
G€ D
Unfortunately, the number of graphs increases superexponentially with the number of nodes, rendering the computation of 2 not viable for large networks. To proceed, we define: Eo(G) =
C
€0
(n;rn [GI)
Here, the summation in the last equation extends over all parent configurations 7rn of node n, which in the case of a fan-in restriction is subject to constraints on their cardinality. Note that the essence of equation (11) is a dramatic reduction in the computational complexity. Rather than summing over the whole space of network structures, whose cardinality increases super-exponentially with the number of nodes N , we only need to sum over all parent configurations of each node; the complexity of this operation is polynomial in N . However, we have ignored interactions between the nodes; modifications of a parent configuration 7rn may lead to a directed cyclic structure, which is invalid and should be excluded from the summation in equation 11. The detection of directed cycles is a global operation. This destroys the modularity inherent in equation 11, and leads to a considerable explosion of the computational complexity. Note, however; that equation 11 still provides an upper bound on the true partition function. When densely connected graphs are ruled out by a fan-in restriction, as commonly done; the number of cyclic terms that need to be excluded from equation 11 can be assumed to be relatively small. We can then expect the bound to be rather tight, and use it to approximate the true partition function. In all our simulations we assumed a fan-in restriction of three, as has widely been applied by different authors3) 4, ’.
(7)
n
2.2. M C M C sampling scheme n
where 7rn [GI is the set of parents of node n in the graph G and we have defined:
r
B , , >0 5
”
B,,, > O 5
Akin to the ideal gas approximation in statistical physics, we now approximate the partition function of the whole network by a product of single-node partition functions:
Having defined the prior probability distribution over network structures, our next objective is to extend the MCMC scheme of equation 2 to sample both the network structure and the hyperparameters from the posterior distribution. Starting from a definition of the prior distributions on the hyperparameters POand 01, P(P0) and P ( P l ) ,our aim is to sample the network structure G and the hyperparameters Po and from the posterior distribution P(G,P o , PIID). To this end, we propose a new network structure G,,, from the proposal distribution Q(GnewIGold)and, additionally, new hyperparameters from the proposal distributions R(Po,,, PO,,^) and R(P1,, IPlold).We then accept this move according to the standard MetropolisHastings update rule7 with the following acceptance
88
probability:
The two submoves are iterated until some convergence criterion is satisfied, discarding an initial burn-in phase before sampling configurations. In our simulations, we chose the prior distribution of each hyperparameter P ( P t ) , i E {0,1}, to be the uniform distribution over the interval [0,MAX], with MAX = 30. The proposal distribution of the hyperparameters R(Ptnp, was chosen to be a uniform distribution over a moving interval of length L = 6 << MAX, centred on the current value of the respective hyperparameter and subject to the constraint € [O,MAX]. Note that L only affects the convergence and mixing of the Markov chain that is, the computational efficiency and could, in principle, be adjusted during the burn-in phase. To test for convergence of the MCMC simulations, various methods have been developed'. In our work, we applied the scheme used in Ref. 23: each MCMC run was repeated from independent initializations. and consistency in the marginal posterior probabilities of the edges was taken as indication of sufficient convergence, leading to a typical trajectory length of 5 x lo5 steps, of which the first half was discarded as the burn-in phase. ~
~
To increase the acceptance probability and, hence, mixing and convergence of the Markov chain, it is advisable to break the move up into three submoves: 0
0
Sample a new network structure GneWfrom the proposal distribution Q(GnewIGold) for fixed hyperparameters PO and PI. Sample a new hyperparameter PO, from the proposal distribution R(Po,,, PO,,,) for fixed hyperparameter P1 and fixed network structure G. Sample a new hyperparameter /31,,,, from the proposal distribution IZ(Plnew (Plold)for fixed hyperparameter Po and fixed network structure G.
Assuming uniform prior distributions P(P0) and '(Pi) as well as symmetric proposal distributions R(PO,,,, IPOdd) and R(@l,,ew l P l o l d ) , the corresponding acceptance probabilities arc given by the following expressions:
3. DATA 3.1. Cytometry data Sachs et al.19 have applied intracellular multicolour flow cytometry experiments to quantitatively measure protein concentrations related to the RAF pathway. RAF is a critical signalling protein involved in regulating cellular proliferation in human immune system cells. The deregulation of the RAF pathway can lead to carcinogenesis, and this pathway has therefore been extensively studied in the literature') 19; see Figure 1 for a representation of the currently accepted gold standard network. In our experiments we used 5 data sets with 100 measurements each, obtained by randomly sampling subsets from the original observational (i.e. unintervened) data of Sachs et al.lg. This subsampling was motivated by the fact that we wanted to investigate the learning performance on sample sizes typical of current microarray experiments, which do not provide the abundance of experimental conditions that one gets from cytometry experiments. Details about how we standardized the data can be found in Ref. 23.
89
Fig. 1. RAF signalling pathway. The graph shows the currently accepted RAF signalling network, taken from Ref. 19. Nodes represent proteins, edges represent interactions, and arrows indicate the direction of signal transduction.
3.2. Synthetic data
3.3. Biological prior knowledge
A realistic simulation of data typical of signals measured in molecular biology is based on treating the interactions in the network as enzyme-substrate reactions in organic chemistry. From chemical kinetics it is known that the concentrations of the molecules involved in these reactions can be described by a system of ordinary differential equations ( O D E S ) ~ ~ . Assuming equilibrium and adopting a steady-state approximation, it is possible to derive a set of closed-form equations that describe the product concentrations as nonlinear (sigmoidal) functions of combinations of substrates. However, instead of solving the steady-state approximation to ODES explicitly we approximate the solution with a qualitatively equivalent combination of multiplications and sums of sigmoidal transfer functions. The resulting sigmapi formalism has been implemented in the software package Netbuilder26, 2 7 , which we have used for simulating the data from the RAF signalling pathway, displayed in Figure 1. We used the same amount of data as for the flow cytometry experiments and created 5 simulated data sets with 100 measurements each. To model the stochastic influences, all nodes were subjected to additive Gaussian noise with zero mean and standard deviation equal to 0.1. More details about the generation of these data can be found in Ref. 23.
We extracted biological prior knowledge from the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways database13-15. KEGG pathways represent the current knowledge of the molecular interaction and reaction networks related to metabolism, other cellular processes, and human diseases. As KEGG contains different pathways for different diseases, molecular interactions and types of metabolism, it is possible to find the same pair of genes” in more than one pathway. We therefore extracted all pathways from KEGG that contained at least one pair of the 11 proteins/phospholipids included in the RAF pathway. We found 20 pathways that satisfied this condition. From these pathways, we computed the prior knowledge matrix, introduced in Section 2.1, as follows. Define by Mij the total number of times a pair of genes a and j appears in a pathway, and by mij the number of times the genes are connected by a (directed) edge in the KEGG pathway. The elements Bi, of the prior knowledge matrix are then defined by
If a pair of genes is not found in any of the KEGG pathways, we set the respective prior association to Bij = 0.5, implying that we have no information about this relationship.
”We use the term “gene” generically for all interacting nodes in the network. This may include proteins encoded by the respective genes.
90 4. S I M U L A T I O N S
4.1. Motivation
As described in Section 3.1, the RAF pathway has been extensively studied in the literature. We therefore have a sufficiently reliable gold standard network for evaluating the results of our inference procedure, as depicted in Figure 1. Additionally, recent work by Sachs et a1.l’ provides us with an abundance of protein concentration data from cytometry experiments, and the authors have also demonstrated the viability of learning the regulatory network from these data with Bayesian networks. However, the abundance of cytometry data substantially exceeds that of currently available gene expression data from microarrays. We therefore pursued the approach taken in Ref. 23 and downsampled the data to a sample size representative of current microarray experiments (100 exemplars). Although the RAF pathway has been extensively studied, we have to appreciate that the published gold standard network only reflects the current state of our knowledge and does not necessarily represent the true biological network. As we will discuss in the final two sections, there are, in fact, indications that the currently accepted gold standard network is incomplete and possibly partially wrong. In order to evaluate the performance of the proposed Bayesian inference scheme on data for which we know the true network, we tested it independently on data generated from the gold standard network with the Netbuilder simulator, as described in Section 3.2. Hence, we repeated all evaluations twice: on real cytometry protein concentrations, and on data synthetically generated from the published gold standard network. As described in Section 3.1, the objective of our study is to assess the viability of the proposed Bayesian inference scheme and to estimate by how much the network reconstruction results improve as a consequence of combining the data with prior knowledge from the KEGG pathway database. To this end, we have compared the results obtained with the methodology described in Section 2 with our earlier results from Werhli et al.23, where we had evaluated the performance of Bayesian networks (BNs) and Graphical Gaussian models (GGMs, applied as
described in Ref 20) without the inclusion of prior knowledge. 4.2. Reconstructing the regulatory network
While the true network is a directed graph, our reconstruction methods may lead to undirected, directed, or partially directed graphsb. To assess the performance of these methods, we applied two different criteria. The first approach, referred to as the undirected graph evaluation (UGE), discards the information about the edge directions altogether. To this end, the original and learned networks are replaced by their skeletons, where the skeleton is defined as the network in which two nodes are connected by an undirected edge whenever they are connected by any type of edge. The second approach, referred to as the directed graph evaluation (DGE), compares the predicted network with the original directed graph. A predicted undirected edge is interpreted as the superposition of two directed edges, pointing in opposite directions. The application of any of the machine learning methods considered in our study leads to a matrix of scores associated with the edges in a network. For BNs sampled from the posterior distribution with MCMC, these scores are the marginal posterior probabilities of the edges. For GGMs, these are partial correlation coefficients. Both scores define a ranking of the edges. This ranking defines a receiver operator characteristics (ROC) curve, where the relative number of true positive (TP) edges is plotted against the relative number of false positive (FP) edges. The results are shown in Figure 2.
5. RESULTS A N D DISCUSSION Figure 2 shows the ROC curves for four different network reconstruction methods: using the prior knowledge from KEGG only, according to (16); learning Bayesian networks and graphical Gaussian models from the protein concentration data alone; and the proposed Bayesian inference scheme for integrating prior knowledge and data. The figure also distinguishes between learning the skeleton of the graph only (UGE: undirected graph evaluation) and considering the direction of the edges also (DGE: di-
bGGM~ are undirected graphs. While BNs are, in principle, directed graphs, partially directed graphs may result as a consequence of equivalence classes. which were briefly discussed in Section 1.
91 Netbuilder UGE
Netbuilder DGE I
1-
1-
0.9 -
0.9 -
0.8 -
0.8 -
0.7
0.7 -
'
I
,;
,'-'-'
a,
~
0.6 -
a,
x
'
0
. :
0.6 -
13
0.5-
8
.r
0
I
i
,,
0.5-
0
8
I 0
0.4 -
0.4 -
0 0
0.3-
0.3
0
~
8
0.2 -
0.2 -
-
0.1 -
0.1 -
I
BN-Prior
01
---
,'
BN-Prior PriorOnlv
0-
8
0
0.2
0.4
0.6
0.8
1
0
0.2
% False
Real UGE I
'
0.8
I
1
Real DGE I
1-
0.4 0.6 % False
I
'
1-
0.9 -
0.9
~
0.8 -
0.7 -
2 s?
0.6 -
0.5-
0.4 -
0.30.2
~
-
0.1 -
BN-Prior
0~
% False
0
0.2
0.4 0.6 % False
0.8
1
Fig. 2. Reconstruction of the RAF signalling pathway. The figure evaluates the accuracy of inferring the RAF signalling network from cytometry data (bottom row) and from simulated Netbuilder data (top row), each combined with prior information from KEGG. This evaluation was carried out twice: with and without taking the edge direction into account (UGE: undirected graph evaluation, left column; DGE: directed graph evaluation, right column). Four machine learning methods were compared: Bayesian Networks without prior knowledge (BNs), Graphical Gaussian Models without prior knowledge (GGMs), Bayesian Networks with prior knowledge from KEGG (BN-Prior), and prior knowledge from KEGG only (PriorOnly). In the latter case, the elements of the prior knowledge matrix (introduced in Section 2.1) were computed from equation (16). The ROC curves presented are the mean ROC curves obtained by averaging the results over five different data sets.
rected graph evaluation). Recall that larger areas under the ROC curves indicate a better prediction performance overall, although the slope on the left is also of interest, as we are usually interested in keeping the number of false positives bounded at low values. The figure suggests that the system-
atic integration of prior knowledge with the proposed Bayesian inference scheme leads, overall, to a considerable improvement in the prediction performance over the three alternative schemes that are based on either the data or the prior knowledge fromKEGG alone. There are various interesting trends
92 AUC-DGE
AUC-UGE
20 18
16 14
12
10
a
.
i .
:
. .
2 n
"0
10
15
20
"0
AUC-DGE
10
P,-Edges
45
20
AUC-UGE
0.82
0.8
0.78
0.76
0.74
0.92
0.7
"0
10 P,-Eciges
15
20
Fig. 3. Learning the hyperparameters associated with the prior knowledge from KEGG on simulated Netbuilder data and real flow cytometry data. The grey shading of the contour plots represents the mean area under the ROC curve (AUC value) averaged over five different data sets - as a function of the fixed values of the hyperparameters Po and p i . The black dots show the values of these hyperparameters that were sampled in the MCMC simulations. The top row shows the results obtained on the simulated data. The bottom row shows the results obtained on the real flow cytometry protein concentrations. The left column shows the results for the directed graph evaluation (DGE), while the column on the right shows the results obtained when ignoring edge directions and only taking the skeleton of the network into account (UGE: undirected graph evaluation). ~
to be noted, though. For learning the skeleton of the graph (UGE), the improvement obtained on the real cytoflow data is more substantial than on the synthetic data; see the left panel of Figure 2. This is a consequence of the fact that on the synthetic data, Bayesian networks show already a strong perforniance on learning the skeleton of the network,
leaving not much room for further improvement. On the cytoflow data, on the other hand, the performance is much poorer. Consequently, the integration of prior knowledge leads to a more substantial improvement. When taking the edge directions into consideration (DGE), the proposed Bayesian integration scheme outperfornis all other methods on the
93 synthetic data; see Figure 2, top right. This result is consistent with what has been discussed in the Introduction section: when learning Bayesian networks from non-dynamical non-interventional data (as considered here) without prior knowledge, there is inherent uncertainty about the direction of edges owing to intrinsic symmetries within network equivalence classes; see Section 1. These symmetries are broken by the inclusion of prior knowledge; hence the improvement in the prediction performance. This improvement is also observed on the real cytoflow data (Figure 2, bottom right), but to a lesser extent. Although the area under the ROC curve related to the Bayesian integration scheme exceeds that of all other ROC curves, the prediction based on prior knowledge alone shows a steeper slope in the very left region of the false-positive axis. This implies that for very high values of the threshold on the edge scores, a network learned from prior knowledge alone is more accurate than a network learned with any of the three methods that make use of the data. While the resulting network itself would not be particularly interesting it would only contain a very small number ( 3 or 4) of the highest scoring edges this observation is interesting nevertheless, and can be explained as follows. The discrepancy between the UGE and DGE scores indicates that the Bayesian network learns the skeleton of the graph more accurately than the direction of the interactions, with some of the edge directions systematically inverted. A possible explanation are errors in the gold standard network. The recent literature describes evidence for a negative feedback loop between RAF and ERK via MEK. Active RAF phosphorylates and activates MEK, which, in turn, activates ERK. This corresponds to the directed regulatory path shown in Figure 1. However, through a negative feedback mechanism involving ERK, RAF is phosphorylated on inhibitory sites, generating an inactive, desensitized RAF. Details can be found in Ref. 2. This feedback loop is not included in the goldstandard network reported by Sachs et al.19, shown in Figure 1. Such as yet unaccounted feedback loops could explain systeniatic deviations between the predicted and the gold standard network, not only because the structure of a Bayesian network is constrained to be acyclic, but also because we ultimately don’t have a reliable gold standard to assess the quality of the predictions. This example points to a fundamental problem inherent in any evaluation based ~
~
solely on real biological data, and illustrates clearly the advantage of our combined evaluation based on both laboratory and simulated data. It is obviously of interest to test how well the inference of the hyperparameters PO and p1 works, especially as this inference depends on the partition function 2 of equation ( 6 ) , which can only be computed approximately; see (11). To this end, we repeated the MCMC simulations for a large set of fixed values of PO and PI, selected from the grid [0,20]x [0,20]. For each pair of fixed values (PO,P I ) , we sampled BNs from the posterior distribution with MCMC, and evaluated the network reconstruction accuracy using the evaluation criteria described in Section 4.2. We compared these results with the proposed Bayesian inference scheme, where both hyperparameters and networks are simultaneously sampled from the posterior distribution with the MCMC scheme discussed in Section 2.2. The results are shown in Figure 3 . The grey shading of the contour plots indicates the network reconstruction accuracy in terms of the directed (DGE: left panels) and undirected (UGE: right panels) graph evaluation, obtained from the synthetic (top panels) and real cytometry data (bottom panels). The black dots show the hyperparameter values sampled with the MCMC simulations. While the distribution of 00, the hyperparameter associated with the non-edges, is fairly peaked, the distribution of PI, the hyperparameter associated with the edges, is rather diffuse. This diffusion is particularly noticeable on the synthetic data. However, even on the real cytometry data, the distribution of P1 has a long tail, with values being sampled across the whole permissible spectrum. An inspection of the prior knowledge matrix B extracted from KEGG according to (16) reveals that the prior knowledge associated with the energy function El - equation (4) accounts for only 25% of the true edges in the gold standard network of Figure 1, while the prior knowledge associated with the energy function EO equation ( 3 ) - accounts for 92% of the non-edges. Consequently, it appears that EO captures more relevant information for network reconstruction than E l , which is reflected by the tighter distribution of the respective hyperparameter. The location of the sampled values of the hyperparameters PO and PI falls into the region of high network reconstruction scores. This suggests that the proposed Bayesian sampling scheme suc~
~
94 ceeds in finding hyperparameter values that lead to good network reconstructions. A certain deviation from the optimal reconstruction would be expected owing to the approximation made for computing the partition function; see (11). However, his deviation is small for both scores (UGE and DGE) on the synthetic data, and for the UGE score on the cytometry data. A noticeable deviation occurs for the DGE score on the cytometry data, though; see Figure 3 , bottom left panel. This deviation indicates a systematic mismatch between the DGE score and the posterior probability of the hyperparameters, which suggests that the cytometry data do not support all the edge directions in the gold standard network of Figure 1. Two possible explanations are either wrong edge directions in the gold standard network, or the existence of as yet unaccounted feedback loops, in confirmation of what has been discussed above.
learning the directed graph from these data, though. This difference between the directed and undirected graph reconstruction did not occur on the synthetic data, which suggests t h a t either certain edge directions in the gold standard network are wrong, or that certain feedback loops are missing, in corroboration of the findings reported by Doughtery et a1.2.
ACKNOWLEDGEMENT Dirk Husmeier is supported by the Scottish Executive Environmental and Rural Affairs Department (SEERAD). Adriano Werhli is supported by Coordenaqao de Aperfeiqoamento de Pessoal de Nivel Superior (CAPES).
References 1. M. K. Cowles and B. P. Carlin. Markov chain Monte
6. CONCLUSION Our paper complements the work of Imoto et a1.l' on improving the reconstruction of regulatory networks from postgenomic data by the systematic integration of prior knowledge. The idea is t o express the prior knowledge in terms of energy functions, from which a prior distribution over network structures is obtained in the form of a Gibbs distribution. The hyperparameters of this distribution represent the weights associated with the various sources of prior knowledge relative t o the data. We have developed a Bayesian approach t o inferring these hyperparameters, based on MCMC. We have tested the viability of this approach by trying t o reconstruct the RAF pathway from flow cytometry protein concentrations and prior knowledge from KEGG. As a n independent source of validation, we repeated the evaluation on synthetic data generated from the gold standard network. Our findings suggest t h a t the Bayesian integration scheme systematically improves the network reconstruction over approaches that either use only the protein concentrations or the prior knowledge from KEGG alone. Also, the hyperparameters are sampled in regions close to those that yield the best possible network reconstruction, suggesting that the ideal gas approximation made for computing the partition function does not adversely affect the performance of the scheme. Learning the undirected skeleton graph from the cytometry data led to results t h a t were systematically better than those obtained when
2.
3. 4.
5.
6.
7.
8.
9.
10.
Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91:883-904, 1996. M. K. Dougherty, J. Muller, D. A. Ritt, M. Zhou, X. Z. Zhou, T. D. Copeland, T. P. Conrads, T. D. Veenstra, K. P. Lu, and D. K. Morrison. Regulation of Raf-1 by direct feedback phosphorylation. Molecular Cell, 17:215-224, January 2005. N. Friedman and D. Koller. Being Bayesian about network structure. Machine Learning, 50:95-126, 2003. N. Friedman, M. Linial, I. Nachman, and D. Pe'er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7:601-620, 2000. D. Geiger and D. Heckerman. Learning Gaussian networks. pages 235-243, San Francisco, CA., 1994. Morgan Kaufmann. A. J. Hartemink, D. K. Gifford, T . S. Jaakkola, and R. A. Young. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pacific Symposium on Biocomputing, 6:422-433, 2001. W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97-109, 1970. D. Heckerman. A tutorial on learning with Bayesian networks. In M. I. Jordan, editor, Learning in Graphical Models, Adaptive Computation and Machine Learning, pages 301-354, Cambridge, Massachusetts, 1999. MIT Press. D. Husmeier. Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics, 19:2271-2282, 2003. D. Husmeier, R. Dybowski, and S. Roberts. Probabilistic Modeling in Bioinformatics and Medical In-
95
11.
12.
13. 14.
15.
16. 17.
18.
19.
formatics. Advanced Information and Knowledge Processing. Springer, New York, 2005. S. Imoto, T. Higuchi, T. Goto, S. Kuhara, and S. Miyano. Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks. Proceedings IEEE Computer Society Bioinformatics Conference, (CSB’03):104-113, 2003. S. Imoto, T. Higuchi, T. Goto, and S. Miyano. Error tolerant model for incorporating biological knowledge with expression data in estimating gene networks. Statistical Methodology, 3(1):1-16, January 2006. M. Kanehisa. A database for post-genome analysis. Trends Genet, 13:375-376, 1997. M. Kanehisa and S.Goto. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28127-30, 2000. M. Kanehisa, S. Goto, M. Hattori, K. AokiKinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Research, 34:D354-357, 2006. P. J . Krause. Learning probabilistic networks. Knowledge Engineering Review, 13:321-351, 1998. D. Madigan and J. York. Bayesian graphical models for discrete data. International Statistical Review, 631215-232, 1995. N. Nariai, S. Kim, S. Imoto, and S. Miyano. Using protein-protein interactions for refining gene networks estimated from microarray data by Bayesian networks. Pacific Symposium on Biocomputing, 9:336-347, 2004. K. Sachs, 0. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, Vol 308, Issue 5721, 523-529 , 22 April 2005,
308(5721):523-529, 2005. 20. 3. Schafer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4, 2005. Article 32. 21. Y. Tamada, H. Bannai, S. Imoto, T. Katayama, M. Kanehisa, and S.Miyano. Utilizing evolutionary information and gene expression data for estimating gene networks with Bayesian network models. Journal of Bioinformatics and Computational Biology, 3(6) :1295-1313, June 2005. 22. Y. Tamada, S. Kim, H. Bannai, S.Imoto, K. Tashiro, S.Kuhara, and S.Miyano. Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics, 19:ii227-ii236, June 2003. 23. A. V. Werhli, M. Grzegorczyk, and D. Husmeier. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics, 22(20):2523-2531, 2006. 24. L. Wernisch and I. Pournara. Reconstruction of gene networks using bayesian learning and manipulation experiments. Bioinformatics, 20:2934-2942, 2004. 25. C.-R. Yang, B. E. Shapiro, E. D. Mjolsness, and G. W. Hatfield. An enzyme mechanism language for the mathematical modeling of metabolic pathways. Bioinformatics, 21(6):774-780, 2005. 26. C. H. Yuh, H. Bolouri, and E. H. Davidson. Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science, 279:1896-1902, March 1998. 27. C. H. Yuh, H. Bolouri, and E. H. Davidson. Cisregulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development, 128:617-629, 2001.
This page intentionally left blank
97
USING INDIRECT PROTEIN-PROTEIN INTERACTIONS FOR PROTEIN COMPLEX PREDICTION Hon Nian Chual, Kang Ning2, Wing-Kin Sung2,Hon Wai Leong2 and Limsoon Wong2
‘Graduate School of Integrated Sciences and 2Department of Computer Science, National University of Singapore go30641 [email protected], {ninghng, bung, leonghw, wongls)@comp. nus.edu.sg Protein complexes are fundamental for understanding principles of cellular organizations. Accurate and fast protein complex prediction from the PPI networks of increasing sizes can serve as a guide for biological experiments to discover novel protein complexes. However, protein complex prediction from PPI networks is a hard problem, especially in situations where the PPI network is noisy. We know from previous work that proteins that do not interact, but share interaction partners (level-2 neighbors) often share biological functions. The strength of functional association can be estimated using a topological weight, FS-Weight. Here we study the use of indirect interactions between level-2 neighbors (level-2 interactions) for protein complex prediction. All direct and indirect interactions are first weighted using topological weight (FS-Weight). Interactions with low weight are removed from the network, while level-2 interactions with high weight are introduced into the interaction network. Existing clustering algorithms can then be applied on this modified network. We also propose a novel algorithm that searches for cliques in the modified network, and merge cliques to form clusters using a “partial clique merging” method. In this paper, we show that 1) the use of indirect interactions and topological weight to augment protein-protein interactions can be used to improve the precision of clusters predicted by various existing clustering algorithms; 2) our complex finding algorithm performs very well on interaction networks modified in this way. Since no any other information except the original PPI network is used, our approach would be very useful for protein complex prediction, especially for prediction of novel protein complexes. Keywords: protein-protein interaction, protein complex prediction, level-2 interaction, partial clique merging
1
INTRODUCTION
Identification of functional modules in protein interactions network is a first step in understanding the organization and dynamics of cell hnctions. Protein-protein interaction networks (PPIs) are rapidly becoming larger and more complete as research on proteomics and systems biology proliferates [l]. As a result, more protein complexes are been identified [2]. A protein complex is a group of two or more associated proteins. Protein complex is a form of quaternary structure. Similar to phosphorylation, complex formation often serves to activate or inhibit one or more of the associated proteins. Many protein complexes are established, particularly in the model organism Saccharomyces cerevisiae (Bakers’ yeast). With a wealth of and constantly increasing size of PPI datasets, efficient and accurate intelligent tools for identification of protein complexes are of great importance. In this paper, we have focused on predicting protein complexes from PPI data. Currently, there are several approaches to the protein complex prediction problem [3-81. Spirin et. al. [3] proposed using clique finding and super-paramagnetic clustering with Monte Carlo optimization to find clusters of proteins. They found a significant number of protein complexes that overlap with experimentally derived ones.
While clique finding [3] imposes stringent search criterion, and generally results in greater precision, recall is limited because: 1) protein interaction networks are incomplete; and 2) protein complexes may not necessary be complete subgraphs. Another approach, such as MCODE [5], are clustering based. MCODE makes use of local graph density to find protein complex. PPI networks are transformed to weighted graphs in which vertices are proteins and edges represent protein interactions. The algorithm operates in three stages: vertex weighting, complex prediction and optimal post-processing. Each stage involves several parameters that can be fine-tuned to get better predictions. However, clustering approaches [5, 81 yield good recall but sacrifice precision. To make clustering based approaches more viable, [4, 71 show that it is possible to identify high precision subsets of clusters from clustering results by post-processing based on hnctional homogeneity, cluster size and interaction density. While post processing significantly improves precision, recall is drastically reduced. Moreover, the approach makes use of functional information, which limits its applicability in less studied genomes such as Homo sapiens, Mus muculus and Arabidopsis thialiana. Recently, a popular clustering algorithm, Markov
98 clustering algorithm (MCL) [9], has also been shown to perform well in an evaluation of algorithms for protein clustering in PPI networks [6]. MCL partitions the graph by discriminating strong and weak flow in the graph, which is shown to be very robust against graph alternations. Table 1 gives the main features of the algorithms that we have used for comparison in this paper.
Type Multiple assignment of protein Weighted edge
RNSC Local search cost based No
MCODE Local neighbourhood density search Yes
MCL Flow simulation No
No
No
Yes
We know from [lo] that many proteins that do not interact, but share common interaction partners, share functions and participate in similar pathways. The interactions between these proteins are referred to as “level-2 neighbors”. [ 101 also proposed a topological weight, FS-Weight for estimating functional association between direct and indirect interactions, which is shown to work well. In this paper, we propose using these indirect interactions with FS-Weight to modify the existing PPI as a preprocessing step to complex prediction. The original PPI network is expanded by including indirect interactions (relationship between pairs of proteins that do not interact but share common interactors). A topological weight, FS- Weight (functional similarity weight), is then computed for both direct and indirect interactions. Interactions with weights below a threshold are removed. We also propose a new algorithm that incorporates FS-Weight for complex prediction. The algorithm employs clique finding on a modified PPI network, retaining the benefits of clique based approaches while improving recall. The algorithm first searches for cliques in the modified network, and iteratively merges them by “partial clique merging” to form larger clusters. For the rest of this paper, we refer to predicted protein clusters as clusters, and known protein complexes as complexes.
2
INTRODUCTION OF INDIRECT NEIGHBORS
The PPI network is transformed into a graph G=(V, E). Each vertex VkEV represents a protein, while each edge {v,,v~}E E represents an interaction between the proteins v, and v,. For the rest of this section, we consider PPI networks in this graph-based representation. We refer to level-1 interactions as the original interactions in the PPI network, and level-2 interaction as an indirect interaction between two proteins which do not interact, but share common interaction interactors. Members in a real complex may not have physical interactions with all other members; hence conventional methods (clique-based, density-based) may miss the detection of many members. By introducing level-2 interactions, which represent strong functional relations (from [ 1O]), we will be able to capture members with less physical involvement in the complex. [lo] showed that a topological weight, the FS-Weight, can identify both level-1 and level-2 interactions that are likely to share common functions within the local (level-1 and level-2) PPI interaction neighborhood. Since proteins within a complex interact to perform a common function, it makes sense to identify protein complexes using FSweight. Through topological weighting, we can identify interactions reasonably with a good likelihood of indicating functional relationship, and use these for complex prediction. This will also reduce the impact of noise and make predictions more robust.
Topological Weighting All level-1 and level-2 interactions in the PPI network are given a weight using the topological weight, FS-Weight, defined as follows:
s,
(21,
v) =
c
2 in€(
i
V,
r,,,r,,
nN,) \
99
N, refers to the set that contains protein p and its level-1 neighbors; ru,, refers to the estimated reliability of the interaction between u and w. In [lo], r,,,, is estimated based on annotated proteins in the training set during cross validation. To avoid possible bias that may be caused by using additional information (functional annotation), we exclude reliability estimation of interactions and set all rU,M. to 1. ,Iu,"is a pseudo-count included in the computation to penalize similarity weights between protein pairs when proteins has very few level-1 neighbors, and is defined as:
I
= max(0, no,.g- (INL,- N" + INt,nN , J )
(2)
in which navg is the average number of neighbors per protein in the PPI network. Using FS-Weight, we modify an existing proteinprotein interaction network in the following manner: 1) Level-1 interactions in the network that have low FSWeights (weight below a certain threshold, FS-Weight,,,) are removed from the PPI network. 2) Level-2 interactions that have high FS-Weights (above or equal to FSWeight,,,) are added into the PPI network. FS-Weight,,, is a value that is determined empirically.
3
PCP ALGORITHM
After we have generated a modified PPI network, existing protein complex prediction algorithms can be applied on it for more reliable protein complex prediction. However, we have also designed a novel algorithm, ProteinComplexPrediction (PCP), for complex prediction using "partial clique merging". This method differs from existing approaches in the following ways: 1) it uses the FS-Weight information during the merging of cliques (clusters); 2) merging based on cliques is a clear and rigid method in graph theory and it is more viable based on reliable PPI networks. PCP attempts to achieve the high precision of clique-finding algorithms whilst providing greater recall and computational tractability, without using any external information. Results show that this method performs well and is robust against noises.
Maximal Clique Finding We first find all maximal cliques within the modified PPI. To do this, we implement the maximal clique finding
algorithm described in [ l l ] . This algorithm has been shown to be very efficient on sparse graphs. All cliques of at least size 2 is reported. To make sure that there is no overlap among cliques, any overlap between cliques can only be assign to one clique. There can be many ways to do this. Since FS-Weight is an estimate for the likelihood of sharing functions, a cluster with a larger average FSWeight would more likely represent a subset of a real complex. We define the Average FS-Weight of a subgraph S with edges E, is defined as: c F S ( u ,v) FS""g (s)=Ia.1)=E, E IT
1
(3)
Ideally, we want to find the best way to remove overlaps so that the total average FS,, of all the final nonoverlapping cliques is maximized. However, since this is a NP-hard problem, we turn to heuristics. All cliques are first sorted by decreasing FS,,. The clique with the highest FS,, is selected and compared with the rest of the cliques. Whenever an overlap is found with another clique, the overlapping nodes are assigned to one of the two cliques such that the two cliques have a higher average FS,,. An example is given in Fig 1 (b). InterClusterDensity A protein complex is likely to consist of proteins forming a dense network of interactions, but may not necessarily form a complete clique. Due to the stringent definition of a clique, the resulting maximal cliques from the clique finding step are relatively small and are likely to be partial representations of real complexes. To reconcile these smaller protein clusters into larger clusters that form fuller representation of real complexes, we previously tried to merge overlapping clusters based on the amount of overlapping vertices between them. However, the corresponding prediction results are not good, since each merge considers only overlapping vertices between two clusters, but overlooks the density of interactions between them. Hence we define Inter-Cluster Density (ICD), which is a measure of interconnectedness between two subgraphs, as a criterion for merging clusters. The ICD essentially computes the FS-Weight density of intercluster interactions between the non-overlapping proteins of two clusters. High ICD indicates that the two clusters are highly connected. Using ICD to impose criteria for
100
merging ensures that merged clusters retain a certain degree of interconnectedness between its members. The Inter-Cluster Density (ICD) between subgraphs S, and Sb is defined as:
where V, is the set of vertices of subgraph S,. An example of ICD computation is given in Fig 1 (a). Partial Clique Merging To merge cliques found in the PPI network, we define the term “partial cliques” as strongly connected subgraphs formed from the amalgamation of one or more cliques. Trivially, all cliques in the PPI network G are partial cliques. We begin with an initial graph G: in which each vertex represents a partial clique, and add an edge (u, v) between any pair of partial cliques u and v in G: if ICD(u,v)>ICDth,,,. From G:, we can again find maximal cliques among the vertices. Each clique in G: is therefore a cluster of partial cliques from G, where all pairs of partial cliques in the cluster hlfils a minimum level of interconnectedness defined by ICD. In other words, the vertices in each clique from G: can be merged to form a larger partial clique. This process is then repeated to form bigger partial cliques. In each iteration i, a graph G,’ is formed from
PC’.’, the partial cliques from the previous iteration, i.e. G,‘ = (PC’.’, {(u,v) 1 ICD(u,v)>ICDth,,,, u,vEPC’-’}). From Gpl,we can again find maximal cliques among the vertices (partial cliques in Gpl-’) and merge the proteins in these cliques to form bigger partial cliques. This is done until no hrther merge can be made. In order for the more connected partial cliques to merge first, we first perform the merge using ICDth,,, = 1. The merging process is then repeatedly reinitiated while reducing ICDth,, by 0.1 until ICDth,,, 5 ICD,,,. ICD,,, is a threshold to be determined empirically. A smaller ICD,,, will yield bigger clusters and vice versa. We refer to this merging method as “partial clique merging”.
4
EXPERIMENTS
Experiment Settings and Datasets The PCP algorithm is implemented in C++ and Perl. We compare PCP with state-of-the-art algorithms: RNSC [4], MCODE [5] and MCL [6] algorithms. The experiments are performed on a PC with 3.0 GHz CPU and 1.0 GB RAM, running a Linux system. PPI datasets We use two high-throughput datasets obtained from different sources for analysis of these algorithms. The first dataset is obtained from the GRID database [12]. This dataset is a combination of six protein interaction networks from the Saccharomyces cerevisiae (Bakers’
ICD(Sa, s h ) =(0.8+0.5+0.7+0.6+0.9+0.8)/(3 *4)=0.36 (a) Fig 1. (a) Example of ICD computation. There are two clusters, and solid lines are used for ICD calculation. (b) Example of resolving overlapping cliques. Edge thickness represents the FS-Weight of the edge.
101
Yeast) genome. These includes interactions characterized V, are the vertices of the subgraph defined by S; and V, by mass spectrometry technique from Ho et al. [ 131, Gavin are the vertices of the subgraph defined by C. et al.[ 141, Gavin et a/. [ 151 and Krogan et a/. [16], as well In [5], an overlap threshold of 0.2 is used to as two-hybrid interactions from Uetz et al. [ l ] and Ito et determine a match. [4] used a modified version of the al. [17]. We shall refer to this dataset as PPI[Combined]. overlap which is more stringent but involves many The second dataset is taken from a current release of the empirically derived parameters which may not be BioGRID database [ 181. We only consider interactions applicable across different datasets. To simplify derived from mass spectrometry and two-hybrid comparison, we used an overlap threshold of 0.25 to experiments since these represents physical interactions. determine a match for all experiments in this work. We shall refer to this dataset as PPI[BioGRID]. Table 3 Predicted protein clusters that match one or more true presents the features of the two datasets, as well as some protein complexes with overlap score above this threshold characteristics of the clusters predicted by different are identified as “matched predicted complexes”, and the algorithms. corresponding complexes are identified as “matched Protein Complex datasets known complexes”. Note that the number of “matched As a yardstick for prediction performance, we use protein clusters”, matched,~,,,,,, may differ from the number of complex data from the MIPS database [2]. These protein “matched complex”, macthed,,mplexbecause one known complexes are treated as a golden standard for analysis. complex can match one or more predicted clusters. To examine whether false positives in predictions To measure the accuracies of prediction, the analysis may turn out to be undiscovered annotations, we use two on the Precision and Recall, of different algorithms are releases of the MIPS complex datasets - a dataset released computed. Precision and Recall are defined as on 03/30/2004 and a newer dataset released on Precision = matchedclu.s,,ls 05/18/2006. We refer to two protein complex datasets as (6) predicted,,,,,,,, PC2004 and PC2006, respectively. During validation, proteins that cannot be found in the input interaction network are removed from the complex data. (7) Cluster Scoring Density of a graph G = (V,E) is defined as DG= ~ E ~ / ~ E ~ m a x , where predicted,l,,,e,, and knowncomplexes are the number of where for a graph with loops and lElmax=/VI (IV1+1)/2 and predicted clusters and the number of known (real) for a graph with no loops, /El,,,= IV/ (lVl-l)/2. So, DG is a complexes, respectively. real number ranging from 0.0 to 1.O. Resulting cluster S = The recall measure in our validation is determined by (V,E) from the algorithm are scored and ranked by clusdev matched complexes instead of predicted clusters, and is scoye, which is defined as the product of the density and hence not prone to bias. Moreover, the precision measure the number of vertices in S, (Dc x IVl). This ranks larger uses the number of predicted clusters as a denominator. more dense clusters higher in the results. Hence there should not be any significant bias in these Validation Criterion validation measures. We only consider clusters and In order to study the relative performance of PCP against 4 and above, since matches between complexes of size existing algorithms, we need to define the criterion that clusters and complexes of smaller sizes have relatively determines whether a predicted protein cluster matches a high probabilities of occurring by chance [4]. Note that true protein complex. [5] defined a matching criterion unlike the validation measures used in [6], we do not seek using the overlap between a protein cluster S and a true to evaluate the clustering properties of each algorithm. protein complex C: Rather, we are concerned about the actual usefulness of the algorithms in detecting clusters that match real complexes reasonably well.
102
To avoid bias that may arise from large variations in the size of predicted complexes, we also introduce another precision-recall analysis based on protein membership assignment. For this analysis, we defined two terms: protein-cluster pair (PCI) and protein-complex pair (PCo). Each PCl represents an unique protein-cluster relationship. For example, given two predicted clusters Cl(A)= { P I ,P2} and Cl(B) = {PI,P3},we have four PCls, namely ( C W , P I ) , ( C W , P A ( C W , P I ) and ( C W , P3). Similarly, each PCo represents an unique proteincomplex relationship. Precisionprotein: A PCI is considered to be matched if its protein belongs to some complex that matches its cluster. The definition of a match between a predicted cluster and a complex is described earlier in this section. Precisionproteln is defined as: Precisionprote,,=
I matched,, I
(8)
1 predicted,, I
Recallprotein: A PCo is considered to be matched if its protein belongs to some cluster that matches its complex. Recallprote,,is defined as:
(9)
There are two tunable parameters in our experiments: FS-Weight,,,, and ICD,,,. FS-Weight,,, determines the FS-Weight (1) threshold for filtering out level-1 and level2 interactions. ICD,,, determines the Inter-Cluster Density (4) threshold for which two clusters are allowed to merge during clustering for the PCP algorithm. Based on PPI[Combined] and PCzoo4,we use level- 1 interactions (without any filtering) to determine ICD threshold. FSWeight threshold is determined on the same dataset using PCP algorithm. Inter-Cluster Density Threshold: We first vary ICD,,,, the Inter-Cluster Density threshold for merging clusters between 0.1 and 0.5 and perform the predictions. The corresponding precision and recall of the predictions are shown in Fig 2 (a). Lower ICD,,, results in more clusters being merged and vice versa. We find that ICD,,,=O.l yields the best precision against recall and use this for the rest of our experiments. FS-Weight Threshold: [ 101 showed that filtering level-1 and level-2 interactions with a FS-Weight threshold of 0.2 resulted in interactions that have a significantly higher likelihood of sharing functions. Here we perform protein complex prediction using the PCP algorithm with a range of FS-Weight,,,, to determine which value can yield the best prediction performance. The ICD,,, is set to 0.1. The corresponding precision and recall of the predictions are shown in Fig 2 (b). We find
I
Results
Precision vs Recall for different ICDThreoholds
i ; ~
Parameters determination The optimal parameters for RNSC, MCODE and MCL algorithms are given by [6] (Table 2).
I
08 07 0 6
; 05
6
04 03
02 01
0
005
01
015
02
025
03
Recall
(a) Precision vs Recallfor different FSWebghl Thresholds
Algorithm
Parameter
Optimal value
RNSC
No. of experiments Tabu length Scaled stopping tolerance Depth Node score % Haircut Fluff % of complex fluffing Inflation
3
MCODE
1 MCL
1
50 15 100 0 True False 0.2 1.8
0 0
005
0 1
015 Recall
02
025
03
(b) Fig 2. Effect of (a) ICD threshold and (b) FS-Weight threshold on Precision and Recall values for PPI[Combined] dataset.
103 that FS-Weight,,,=0.4 yields the best precision against recall, and use this for the rest of our experiments.
Introduction of indirect neighbors The introduction of indirect neighbors is the key part of our analysis in this paper. To evaluate the performance this process, we transform the original PPI network in three different ways: (1) All level-1 interactions; (2) All level-1 and level-2 interactions; (3) All level-1 interactions, and level-2 interactions with FS-Weight 1 FS-Weight,,,; and (4) level-1 and level-2 interactions with FS-Weight 2 FS-Weight,,,. For (2), Due to the large number of level-2 interactions, results can only be obtained for MCL and RNSC. For example, on PPI[combined], there are 20,46 1 level-1 interactions. With the introduction of level-2 interactions, the number of interactions increased to 404,5 1 1. After filtering level-2 interactions based on FS-Weight, we have 23,356 interactions. Finally, upon filtering both level- 1 and level2 interactions, we are left with only 7303 interactions. If two proteins in an interaction belong to some common known complex, we defined the interaction as an intra-complex interaction. To justify our intuition for using level-2 interactions and FS-Weight for complex prediction, we compute the fraction of interactions in the 4 transformed networks that are intra-complex interactions. Since proteins are clustered based on interactions, a higher fraction of intra-complex interactions will naturally yield more accurately predicted clusters. In Fig 3, we present the corresponding fractions for two PPI networks, PPI[Combined] and PPI[BioGRID] using the known protein complexes in PCzoo4..We observe that the fraction of intra-complex interactions did not change significantly after adding filtered level-2 interactions into the network. However, if both level-1 and level-2 interactions are filtered, the fraction of intra-complex interactions become significantly higher. Without any filtering, level-2 interactions will contain too many false positives to be useful, as reflected by the very small fraction of intracomplex interactions. This is consistent with the findings for function similarity in [lo]. From the observations, we believe that using a PPI network with filtered level-1 and level-2 interactions would yield the best results for protein complex prediction.
Fraction of IntraCornplex interactions 07 VI
0
F
2-
06
05 04
E 03 c
gm
02
1;
01
0 L1
L1 & L2
L1 8 . Filtered L2 Filtered LI&L2
Fig 3. Fraction of intra-complex interactions with nodes sharing some complex membership for different PPI networks.
Comparison with existing approaches We compared clusters predicted using four clustering algorithms: MCL, RNSC, MCODE and PCP on the two datasets PPI[Combined] and PPI[BioGRID]. PC2004 is used to represent real protein complex against which the results from these algorithms are validated. Table 3 summarizes some general characteristics of clusters predicted by four clustering algorithms. The PPICBioGRID] dataset is larger than PPI[Combined]. We observe that upon the introduction of filtered level-2 interactions, the number of predicted clusters generally decrease while average cluster sizes increase. This is due to greater connectivity in the graph since more edges are added among the same number of nodes. We also observe that the average cluster sizes of clusters predicted by the MCODE and MCL algorithms are larger than those predicted by the RNSC and PCP algorithms. After filtering both level-1 and level-2 interactions using FSWeight, all algorithms produced less clusters. With the exception of MCODE, the average cluster sizes of clusters predicted by the various algorithms are also larger. We have also studied the average density of the clusters predicted by the four different algorithms using the different networks. Generally, all algorithms predicted clusters with the highest density using only level-1 interactions, followed by using level-1 and filtered level-2 interactions. Using filtered level-1 and level-2 interactions resulted in clusters of lower density. When level-1 and level-2 interactions without filtering are used, the clusters found have the lowest density. RNSC yielded clusters
104
Datasets
Nodes
Edges
No. of Clusters
No. Avg. Setting Complex Complex Size
PPI[Combined] 4672
20461 815
8.80
PPI[BioGRID] 5036
27560 815
8.82
RNSC 2332 874 2233 699 2404 811 2331 901
1) 2) 3) 4) 1) 2) 3) 4)
with the highest density, followed by MCODE, PCP and MCL. Interestingly, we found that the average density of real protein complexes is quite low, around 0.55, which suggests that the density of predicted clusters do not correlate with prediction accuracy. Fig 4 presents the precision-recall analysis of the predictions made by the four algorithms. By varying a threshold on cluster score, we can obtain a range of recall and precision for the predictions from each algorithm. P*ecislo" "9 Recall (Combined. LI&L2)
Precision VI Recall ICombIncd. L I I
0
02
01
03
0
01
Recali
02
03
Recall
(a)
1
Prscision vs Recall (Combined, Filtered LISLZ)
P r e c k i o n YI Recall IComblned. Li+Flltered L 2 I 1 09 08
07
6
t
e
06
05 04
03 02
01 0 0
01
02 Recall
03
0
01
02 Recall
03
Avg. Cluster Size
MCODE MCL PCP 121 936 1537 209 1499 120 720 417 92 259 1764 152 830 159 1557 142 681 555 121 285
RNSC 2.00 5.34 2.09 2.44 2.20 6.21 2.16 2.36
MCODE MCL PCP 5.75 4.99 3.04 22.35 6.48 6.49 3.12 5.83 6.59 4.09 3.98 2.85 6.38 31.67 5.69 7.40 3.23 5.51 7.46 3.83 -
1
Preclslon YI Recall (0logrid. LirFilIered L2)
Precision Y S Recall 10iogrid. Filtered L16L21
1
1
09
09
08
s 2
I
08 07
07 06 05 04 03 024
zI 0 5 08
04
'r
i 02
01 Recall
I 03
03 02 01 0 0
01
02
03
Recall
Fig 4. The precisions and recalls of RNSC, MCODE, MCL and PCP algorithms on PPI[Combined] with (a) original level-1 interactions, (b) level-1 and level-2 interactions, (c) original level-1 and filtered level-2 interactions, and (d) filtered level-1 and level-2 interactions; PPI[BioGRID] with (e) original level-I interactions, (4level- 1 and level-2 interactions, (g) original level-I and filtered level-2 interactions, and (h) filtered level-] and level-2 interactions. Results are based on comparison with PCz004protein complex dataset.
From Fig 4 (a)-(d) on the PPI[Combined] dataset, we observed that RNSC performs the best in precision and recall on the original network (level-1 interactions). With the introduction of lelvel-2 interactions, the precision and recall decreased. When these level-2 interactions are filtered, precision and recall are improved in MCODE and RNSC, while PCP and MCL remain almost unchanged. However, when filtered level-1 and level-2 interactions are used, all methods show significant improvement in precision except RNSC. In all the combinations, PCP with filtered level-1 and level-2 interactions performs the best (Fig 4 (d)). A similar trend is observed in the bigger PPI[BioGRID] dataset (Fig 4 (e)-(h)). Precision is improved in most algorithms with the introduction of filtered level-2 neighbors, and hrther improvement is achieved when level- 1 interactions are also filtered based on FS-Weight. In particular, the performance of MCODE and MCL improved substantially with the introduction of level-2 interactions and FS-Weight filtering. Again, PCP
105
with filtered level- 1 and level-2 interactions performs the best (Fig 4 (h)). To illustrate the contribution of PCP to complex prediction, we compare predictions made by each algorithm natively (i.e. RNSC, MCODE, MCL on original level- 1 interactions against PCP on filtered level- 1 and level-2 interactions) in Fig 5. We observe that PCP outperforms the other algorithms significantly (Fig 5 (a) and (b)). We arrived at similar conclusions using precision-recall analysis based on protein membership assignment (Fig 5 (c) and (d)). Precision vs Wcall ICombinedl
Preslslon
s
Wcall
lWoorIdl
1
g
YII
examples where PCP can predict protein clusters that match a real complex more precisely than other algorithms. In the first example (Fig 6 (a)), PCP predicted a cluster that matches a 4-member protein complex completely, while RNSC’s 3-member cluster has only one member, “YDRl2 I W’, that matches the same complex.
09 08 07 06
6
$
05
1 09 08 07 06
05
04
, i 04
03 02
03 02 01
01
”
1
I 1
PreciSion “ 5 Pacall (Combined)
0
01 Recall
02
0
01
02
Recall
(C) Fig 5. Precisions-recall analysis of RNSC, MCODE, MCL and PCP algorithms on (a) PPI[Combined] and (b) PPI[BioGRID] using native settings (RNSC, MCODE, MCL on original level-1 interactions, and PCP on filtered level-I and level-2 interactions); Precision-recall analysis based on protein membership assignment on the same predictions on (c) PPI[Combined] and (d) PPI[BioGRID]. Results are based on comparison with PCzoa4protein complex dataset.
Examples of predicted complexes: We have proposed two new concepts in this paper: the introduction of indirect interactions as a preprocessing step, and the PCP clustering algorithm. To illustrate how these concepts can help to predict protein clusters that better match real complexes, we examine some examples of protein clusters predicted by the PCP based on the modified network, as well as RNSC and MCL algorithms based on the original network, and how they correspond to real protein complexes in the PCzOo4dataset. Fig 6 shows two
(b) Fig 6. Example of predicted and matched complexes. Complexes in PCzoo4, the predicted clusters by MCL, RNSC and PCP are shown in different boxes. (a) A complex in PC2004 of size 4, PCP’s cluster matched it perfectly, while MCL and RNSC’s clusters matched 1 and 2 of the proteins in the complex, respectively. (b) In this complex in PC2004 of size 8, RNSC’s predicted cluster matched only 2 proteins, while PCP’s predicted cluster matched 5 proteins, MCL also matched 5 proteins, but predicted 6 proteins that are not in the complex.
106 This is probably due the fact that members in RNSC’s cluster are well connected by level-1 interaction. But by including level-2 interactions and filtering unreliable interactions, their connections are shown not to be strong enough to be in one cluster. Therefore PCP is able to identify the correct complex. Similarly, the cluster predicted by MCL only overlaps with two members of the complex, while the other 6 members of the cluster do not belong to the real complex. The second example (Fig 6 (b)) shows a 5-member protein cluster predicted by PCP, which is a subset of a 8-member protein complex. The best match with the same complex from RNSC is a 7member cluster, in which only 2 belongs to a subset of the real complex. Though PCP’s predicted cluster matched 5 proteins and MCL also matched 5 proteins, but the latter predicted 6 proteins that are not in the complex. A closer look will reveal that PCP’s cluster member do not have any interactions among them, and this subset of the real protein complex can only be identified by level-2 interactions with the rest of the complex members. PCP is unable to discover the rest of the complex as their connectivity with the other members is very weak or unknown. The protein “YLLOllW~is missed by PCP because its local topology resulted in a low FS-Weight score. This may be due to the reason that “hub proteins” like “YLLO1 1W’ are automatically penalized by the FSWeight score.
Validation on newer protein complex data A comparison of prediction performance validated against an old protein complex dataset and a newer, more updated standard protein complex dataset can reveal the parameter-independent identification power of the different algorithms. We have previously assessed the RNSC, MCODE, MCL and PCP algorithms with PC2004. Here, we validate the predicted clusters of PCP and other algorithms against a more recent and more updated protein complex dataset, pc2006. We have used modified PPI networks (PPI[Combined] and PPI[BioGRID]) with filtered level-1 and level-2 interactions which have the shown earlier (Fig 4) to yield the best performance for most algorithms studied. The corresponding precisionversus-recall graphs are shown in Fig 7. Comparing Fig 4 against Fig 7, we find that against the same recall range, the precision of all algorithms studied has increased
substantially when validating against Pc2006 for both PPI network datasets. A significant number of clusters which are predicted by PCP, but have been treated as false positives because they cannot be matched against any known complex in PC2004,are now found to match against known complexes in PC2006.This indicates that PCP has a good potential for finding novel protein complexes.
1
Pleclsio“
“S
Recall
(Combined, Filtered L I I I L 2 I PC2006)
1
Precision vs Recall (Biogrcd. Filtered L l L L L l i PC2006)
02
01
03
Recall
Fig 7. The precisions and recalls of different algorithms on (a) PPl[Combined] and (b) PPQBioGRID] with filtered level- 1 and level-2 interactions. Results are based on comparison with PC2006 protein complex dataset.
We also present two illustrative examples in Fig 8 which show that PCP predicted novel members to some complexes, which are later verified in the newer complex dataset. In the first example (Fig 8 (a)), PCP predicted a cluster of 4 proteins. The cluster is found to match well with a real 4-member complex from PCzOo4that contains all but 1 of the proteins in the predicted cluster. A comparison with PC2006, however, reveals that the predicted cluster matched a real complex in the dataset that contains all the 4 proteins. The protein “YFL008W’ in PC2006has level-1 interactions with the other 3 proteins, but since the FS-Weight of these interactions are low, PCP did not predict it to be in the same cluster. It is also interesting that in Fig 8 (b), PCP has predicted “YHR033W’ to be in the same cluster as the other 5 proteins, and this is consistent with Pc2006 but not PC2004. However, the other 5 proteins in the new complex are not predicted by PCP, since they do not have any level-1 interaction with other proteins. We think that more accurate prediction of this protein complex may be achieved by incorporating additional information such as hnction annotations. Moreover, while “YJR072C” protein is predicted by PCP, it is not in new protein complex. Since the interactions of this protein with “YDR212W’ and “YJR064W’ are present in quite a few other protein
107
complexes [8], we believe that even though this protein is not in the same complex with other proteins, it should be in the same “function unit” [ 3 ] with these proteins. Discriminating “function unit” with protein complex may need additional information such as function annotations.
I=
P C ~ Ocomplex O~ PCP cluster - - _ - -PCzoorcomplex
-1 I I I I I I
I I I I I
I I I I
I I I I
____________________-----
YDL134C
(b) Fig 8. Examples of predicted and matched complexes based on old ~, and the and new PPI networks. Complexes in P C Z ~PC2006 predicted PCP clusters are shown in different boxes for comparison. (a) The complex in PC2004is of size 4, while in PCZWs, its size is 5 . PCP predicted 4 proteins in this complex correctly. (b) Tbis complex is of size 5 in PC2w4,for which PCP predicted all 5 protein correctly. In PCZO06, its size is 11, while PCP algorithm prcdicted 6 of them correctly.
Robustness against noise in interaction data To assess the robustness of the algorithm, we have computed the precision and recall of predictions by PCP when noise of different types and amount is randomly added into the reliable PPI[Combined]. In robustness experiments, noises are usually introduced by swapping edges, or randomize the node labels. However, these methods, which are used in estimating p-values and uniqueness of PPI motifs, are not a good model for our purpose. We are considering errors produced by high-throughput PPI experiments. In this type of experiments, the errors should be closer to edges missing (not detected) or sticky proteins, which are modeled by random noises. Hence, to simulate such noise, we randomly add, delete and reroute (delete and add) 10% to 50% of “pseudo” interactions in the network. The precision and recall of the predicted clusters on the various perturbed datasets are shown in Fig 9. We can see from Fig 9 (a) that the precision against recall of the clusters predicted by PCP remains fairly consistent even with random additions of interactions up to 50% of the original interactions in PPI[Combined]. This is a clear indication that PCP algorithm is robust against spurious interactions. The filtering of the PPI network based on FS-Weight removes most of these random additions, and retains only confident interactions for clustering. Random deletion of interactions has a greater impact on clustering performance, as can be seen in Fig 9 (b). This is analogous to a lack of information, leading a reduction in recall. As FS-Weight is a local topology measure, it becomes less effective when the interaction network become very sparse, since there will be insufficient interactions in the local neighborhood to give a confident score. The formulation of the measure will assign low weights in these cases, which will cause many interactions to be filtered. Nonetheless, precision remains high for clusters that can be discovered. A combination of random addition and deletions results in a simultaneous reduction in precision and recall.
108
1
Precision YS Recall againPt Random Additons 1
0.8 06 04
0.2 0
0.05
0
0 1
0 2
0.15 Recall
0 25
(a)
1
I
Precision vs Recall against Random DeletionS
1
0
0 05
0.1
0 15 Recaii
0.2
0.25
1
(b) Precision vs Recall against R a n d o m Re,O"teS
9
1
.
08
-
0.6 0.4
t10%
-
+20%
& 30%
I I
440%
0 2 -
0.05
01
0.15
0 2
Rerall ... ~
~~
~
(c)
Fig 9. The precision and recall of predictions made by the PCP algorithm when different types and amount of noise are introduced into the reliable PPI network. Three ways of perturbing the network are studied: (a) Random addition (b) Random deletion (c) Random deletion and addition (reroute).
5
Based on modified PPI network, we have also proposed the PCP clustering algorithm in which, cliques are identified in the network, and merged progressively using the "partial clique merging" method. We have compared PCP with RNSC, MCODE and MCL algorithms and showed that PCP has superior precision and recall in complex prediction. By validating against newer MIPS complex data, we find that PCP can discover novel members of complexes which are only found in the newer complex dataset. Through comprehensive noise analysis, we also showed that PCP maintains high precision even when used on significantly noisier datasets. Nonetheless, one limitation still plague previous and our current approach: complexes which has subsets of proteins that are not tightly connected to the rest of the complex members cannot be identified, as illustrated in Fig 8 (b). This is inevitable since clustering methods are highly dependent on interaction density, We are currently studying the possibility of using other biological information to represent a more reliable and complete network of relationships between proteins for complex prediction.
DISCUSSIONS AND CONCLUSIONS
Since protein complexes plays an important role in cells, identification of protein complex from PPI networks is an interesting and challenging problem in systems biology. However, current PPI networks are incomplete and contain many errors. In this paper, we proposed a preprocessing step on PPI networks before complex prediction: 1) introduce level-2 interactions; 2 ) weigh level-1 and level-2 interactions using FS-Weight; and 3) remove interactions with weight lower than a certain threshold. From our experiments, we have shown that existing clustering algorithms are able to produce clusters that match protein complexes with significantly higher precision and recall using PPI networks processed in this way.
Acknowledgements We would like to thank Igor Jurisica for kindly provide us the source codes of RNSC algorithm. We would also like to thank Sylvian Brohee for providing us with the source codes of the MCL and MCODE algorithms.
References 1 . Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P et al: A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature 2000,403(6770):623-627. 2. Mewes HW, Heumann K, Kaps A, Mayer K, Pfeiffer F, Stocker S, Frishman D: MIPS: a database for genomes and protein sequences. Nucleic Acids Research 1999, 27(1):44-48. 3. Spirin V, Mriny LA: Protein complexes and functional modules in molecular networks. PNAS 2003, 100(2 1):12123-12128. 4. King AD, Priulj N, Jurisica I: Protein complex prediction via cost-based clustering. Bioinfovmatics 2004, 20( 17):3013-3020.
109 5. Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003,4(2):27. 6 . Brohee S, Helden Jv: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006, 7:488. 7. Priulj N, Wigle DA, Jurisica I: Functional topology in a network of protein interactions. Bioinformatics 2003, 20(3):340 - 348. 8. Asthana A, King OD, Gibbons FD, Roth FP: Predicting Protein Complex Membership Using Probabilistic Network Reliability. Genome Research 2004, 14(6):1170-1175. 9. Dongen Sv: Graph Clustering by Flow Simulation. 2000(PhD thesis, University of Utrecht). 10. Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006,22( 13):1623-1630. 11. Tomita E, Tanaka A, Takahashi H: The worst-case time complexity for generating all maximal cliques and computational experiments. Theoretical Computer Science 2006(363):28-42. 12. Breitkreutz BJ, Stark C, Tyers M: The GRID: the General Repository for Interaction Datasets. Genome Biol 2003, 4(3):R23. 13. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams S-L, Millar A, Taylor P, Bennett K, Boutilier K et
al: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002,415:180 - 183. 14. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141-147. 15. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C, Jensen LJ, Bastuck S, Dumpelfeld B et al: Proteome survey reveals modularity of the yeast cell machinery. Nature 2006,440(7084):63 1-636. 16. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Li J, Pu S, Datta N, Tikuisis AP et al: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440(7084):637643. 17. Ito T, Chiba T,Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 2001,98(8):4569-4574. 18. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34(Database issue):D535-539.
This page intentionally left blank
111
FINDING LINEAR MOTIF PAIRS FROM PROTEIN INTERACTION NETWORKS: A PROBABILISTIC APPROACH Henry C.M. Leung *, M. H. Siu, S.M. Yiu and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Po!@ulam Road, Hong Kong (cmleung2, mhsiu, smyiu, chin]@cs.hku.hk
Ken W.K. Sung Department of Computer Science, National University of Singapore Singapore ksung @ comp.nus.edu.sg Abstract. Finding motif pairs from a set of protein sequences based on the protein-protein interaction data is a challenging computational problem. Existing effective approaches usually rely on additional information such as some prior knowledge on protein groupings based on protein domains. In reality, this kind of knowledge is not always available. Novel approaches without using this knowledge is much desirable. Recently, Tan ef al. [lo] proposed such an approach. However, there are two problems with their approach. The scoring function (using x2 testing) used in their approach is not adequate. Random motif pairs may have higher scores than the correct ones. Their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. In this paper, our contribution is two-fold. We first introduce a new scoring method, which is shown to be more accurate than the X-score used in [lo]. Then, we present two efficient algorithms, one exact algorithm and a heuristic version of it, to solve the problem of finding motif pairs. Based on experiments on real datasets, we show that our algorithms are efficient and can accurately locate the motif pairs. We have also evaluated the sensitivity and efficiency of our heuristics algorithm using simulated datasets, the results show that the algorithm is very efficient with reasonably high sensitivity.
1. INTRODUCTION In the cell of all organisms, protein-protein interactions occur in the structure of sub-cellular organelles, the transport machinery across different membranes, packaging of chromatin, the network of sub-membrane filaments, signal transduction and regulation of gene expression, etc. Aberrant protein-protein interactions have been linked to a number of neurological disorders such as Alzheimer’s disease [ 121, Muscular Dystrophy [4] and Huntington’s disease [2]. Because of their importance, much research has been performed in order to understand the mechanism of protein-protein interactions. Protein is a sequence of amino acids with 3D structure. Some subsequences of a protein will form substructures, called domains, on the surface of the 3D structure. These domains characterize the functions of each protein by controlling what kind of molecules this protein will bind to. When two proteins bind together * Corresponding author.
(interact),their domains will exchange charges to form bonds which stabilize the protein-protein complex. For example, proteins with Src homology 3 (SH3) domain (GxxPxNY) usually bind to proteins with polyproline type I1 helical structure (PxxP). We call GxxPxNY and PxxP a binding motif pair or motif pair in short. Discovering motif pairs helps us understand many protein-involved mechanisms and predict the functions of a protein. Biological experiments such as sitedirected mutagenesis [3] and phage display [ 5 ] are available for discovering motif pairs. However, these experiments are both laborious and expensive. If we are given the protein-protein interaction data (e.g. the DIP (Database of Interacting Proteins) database [14]), discovering the motif pair is more difficult than discovering the motif of the binding sites of co-regulated DNA sequences, which has been well studied in the literature[ 1,6-81, because of the following issues:
112 1. Protein sequences are composed of 20 amino acids whereas DNA sequences are comprised of 4 nucleotides. Therefore, it is more computationally involved when we work with protein sequences. 2. For co-regulated DNA sequences, it is assumed that the given DNA sequences contain overrepresented subsequences of a similar pattern (motif). However, we can only identify the pair of similar patterns (motif pair) if the set of relevant protein-protein interactions are isolated. This might not be easy as there are many possible subsets of interactions. Even when the set of n interactions can be isolated, there are still 2"-' ways of grouping the protein sequences to identify the motif pair. 3. Usually there is more information available for discovering DNA motif since, for example, the sequences without the binding sites (control set) can provide extra information for solving the problem. However, missing interactions between two protein sequences in the database do not imply the non-existence of motif pairs because the missing protein-protein interaction data might be due to the lack of experiments between these pair of proteins. A naive approach is to fix a particular protein and the group of proteins that are known to bind to this protein. Then, identify the motif from the group of protein sequences. This can be done using standard motif discovery tools such as MEME [l] or Weeder [8]. And then also find a motif pattern that can uniquely identify the particular protein to form the motif pair. However, this method works only when the number of protein sequences that bind to the same protein is large, say > 4. In practice, it is usually not the case. In fact, even when a particular protein can bind to a group of many proteins, those bindings might be due to many factors, not just a single motif pair. The problem of finding a motif pair is then reduced to finding a sequence pattern that can uniquely identify that particular protein and also an over-represented sequence pattern in a subset (not necessary all) of proteins in the group. Since that particular protein might be uniquely identified by more than one pattern and there can be many subsets of proteins in the group whose sequences have similar patterns, the number of possible motif pairs
can be many. So, it is impossible to determine the correct motif pair, if it exists, which initiates the interactions. To handle the above problem, [9] proposed to take advantage of prior knowledge of protein groupings according to protein domains. Instead of considering the bindings between a particular sequence and a group of proteins, they consider the bindings between a group of protein sequences and another group of sequences with a particular domain so as to increase the number of sequences in the instance. A modified Gibbs sampling algorithm was developed to identify the motif pair. This method will work if we already know one of the motifs in the motif pair; otherwise, how to isolate two groups of proteins that are related to the same motif pair from the interaction database is non-trivial. Recently, Tan et al. [lo] introduced a method to discover motif pairs without knowledge of motifs participating in the motif pair and without any prior knowledge on the protein groupings. The basic idea of their approach is as follows. Based on the input sequences, they generated all possible substring pairs of a certain length from any two interacting sequences. For each possible motif pair, they identified the two groups of proteins that contain an instance of the motifs. They compare the number of observed interactions between these two groups and the expected number of interactions using x2 testing. The motif pairs with the highest x-score (implying the observed number of interactions is much larger than the expected number) were reported. They developed two algorithms, DMOTIF and D-STAR, for discovering binding motif pairs based on this idea. D-MOTIF can discover the motif pair with the highest X-score while D-STAR is a heuristics algorithm. There are several problems with this approach. We found out that the X-score is not an adequate measure for motif pairs. Since the expected number of interactions decreases with the number of sequences that contain the two motifs, algorithms using x2 testing tend to discover binding motif pairs that occur only in a few sequences. For example, when there is only one sequence with motif M I and only one sequence with motif M2, if there is an interaction between these two sequences, the score will be high. However, MI and M 2 are not statistically significant as they occur in one sequence
x-
113
'
only. This is also the reason why their approach requires two input minimum thresholds for both the number of sequences containing each motif and the number of interactions between these sequences. Moreover, they assume that the interactions are uniformly distributed in the input sequences, which may not be correct since the interaction experiments may be biased to some sequences depending on the choices of the researchers. As a result, X-score may not be a good measure in this type of studies'. Both algorithms, D-MOTIF and D-STAR, are not practical for large data sets. For D-MOTIF, as the minimum number of interactions of at least 3 is assumed, all possible motif pairs from each interaction triplet (that is, any three pairs of sequences that are known to bind) are considered. Based on these motif pairs, they isolate the two subsets of proteins that contain the motif for calculating the X-score. In the worst case, D-MOTIF runs in O(m3(n(1x1- l)d)6)time where m is the number of interactions, n is the length of a sequence, and C is the alphabet (assuming that 1 and d are small). D-MOTIF takes a long running time and requires a huge amount of memory even for a data set of about a hundred sequences. On the other hand, D-STAR is a heuristic version of D-MOTIF and does not consider all possible motif pairs. In their study, they adopted the mismatch (I, 4-motif model (that is, the motif is of length 1 and the instances differ from the motif by d mismatches). Based on the observation that if we considered all sequences containing a substring y with at most 2d mismatches from an instance x of a real motif M , we include all instances of M. They, therefore, only consider the substrings that occur in the input sequences to isolate the two subsets of proteins. However, they might include a lot of noisy instances in the subsets. Although D-STAR can discover binding motif pairs from a data set with a hundred sequences in minutes, it takes days when the number of sequences is increased to a thousand. D-
' As an example, based on the same SH3 domains dataset used in [l], which consists of about 150 protein sequences and 230 interactions, instead of using their heuristics algorithm, we exhausted all possible substring pairs using the same set of parameters in [l]. We computed the X-score of each pair. We found that all motif pairs that are similar to the ones reported by [ 11 are ranked 90 or below. In other words, the correct motif pair can only be found in rank 90 or below.
STAR runs in O(m2n2 + mtn2) time where t is the number of sequences. Our contributions: In this paper, we introduce a new scoring method by calculating the probability that the observed number of interactions is generated under a null hypothesis. If this probability is small, the null hypothesis is incorrect for the pair of motifs and these two motifs are likely to be a motif pair. This scoring method tends to discover motif pairs that occur in many sequences (instead of one sequence) with a large number of observed interactions. The scoring function resolves the problems of X-score and does not require any pre-set thresholds. Experimental results on real biological data show that our scoring method can model binding motif pairs better than x2 testing. We propose to use the wildcard motif (1, 4-model and have developed an exact algorithm, FindMotif( 1, d , r), to find the top r motif pairs with the highest scores and a heuristic algorithm, MotiMeuristics(1, d, r), to find r motif pairs which are guaranteed to be local optimal solutions. The exact algorithm runs in O(m2n2) time while the heuristic version runs in O(rm2n)time where r is the number of random seeds used in the algorithm. Usually I and d are small and r is around 200, so both our exact and heuristic algorithms run faster than D-MOTIF and D-STAR, respectively. In practice, MotifHeuristics can discover motif pairs for more than 5000 protein sequences only in about 20 minutes. For the exact algorithm, it only takes about half an hour to handle a data set of about 150 sequences and 230 interactions. We have also evaluated MotifHeuristics using simulated datasets, the results show that MotifHeuristics is efficient with reasonably high sensitivity. This paper is organized as follows. We will describe our scoring method and the formal problem definition in Section 2. In Section 3, we will describe our algorithms for discovering motif pairs. Experimental results on real biological data will be shown in Section 4. Section 5 concludes the paper.
2. PROBLEM DEFINITION
2.1. Motif Representation Protein is a sequence of amino acids which can be represented by a sequence of symbols in C =
114 [ ‘A’..‘Z’} - (‘B’, ‘J’, ‘O’, ‘U’, ‘X’, ‘Z’}. Proteins with
similar function usually contain similar substrings. These substrings can be modeled by an abstract representation called motif. In this paper, we define (I, &motif to be a length-l string with d wildcard symbols, denoted by ‘x’, and 1 - d symbols in C. For example, “GxxPxNY” is a (7,3)-motif. A (l,d)-motif M represents those length-l substrings a which are the same as M but with each wildcard symbol ‘x’ replaced by a symbol in E. Each string cr is called an instance of motif M . For example, “GACPQNY’ is an instance of the motif “GxxPxNY”.
2.2. (/, @-Motif Pair Finding Problem Given a set S of t protein sequences and a set Z ( (si,s j } I si,sj E S} of m known interactions between sequences in S, we want to discover a pair of motifs M l and M 2 such that the set of instances of M I (denoted as S(M1)) interact with the set of instances of M2 (denoted as S(M2)). Note that since many interactions between sequences in S are still unknown and there are no interactions between some pairs of sequences, there may not be interactions between all sequences in S(Ml) and S(M2). 2.2.1. x-score Tan et al. [ 101 discovered the motif pair by comparing the observed number of interactions with the expected number of interactions for every motif pair. Given a motif pair M I and M2, they calculated the expected number E(M1,M2)of interactions between sequences in S ( M I ) and S(M2)by assuming the m known interactions are uniformly distributed in the t input sequences. They compared E(Ml,M2) with the observed number O(M1,M2) of interactions between sequences in S(M1) and S(M2) by x2 testing.
Tan et al. [lo] discovered motif pairs by considering those pairs with high X-scores. However, using X-score has two main weaknesses. 1. Large X-score when S(MJ and S(M2) are small. If the sizes of S ( M J and S(M2) are small, since the value of E(M1,M2)is extremely small, even when
there is only one interaction between sequences in S ( M l ) and S(M2),the X-score can be very large. For example, in a database with 5000 sequences and 20000 interactions, the value of E(Ml,M2) for a motif pair M1 and M 2 with one sequence in S(M1) and S(M2) respectively is 0.0016. If there is an interaction between these two sequences, the Xscore will be 623. Tan et al. try to solve this problem by limiting the sizes of S ( M , ) and S(M2) to be bigger than a threshold. However, different thresholds should be used for different input data and they tend to discover motif pairs with the size of S ( M l )and S(M2) being the same as the threshold. 2. Number of interactions is not uniformly distributed. Since biologists usually perform experiments on some special proteins, these proteins participate in more known interactions than other proteins, e.g. YBL063W protein participates in 283 known interactions while YJR091C protein participates in 1 known interaction only. Therefore, the assumption that the m known interactions are uniformly distributed in the t input sequences is incorrect. The method by Tan et al. might discover the wrong motif pairs and at the same time might miss the correct ones. 2.2.2. p-score Instead of using X-score as the scoring function, we calculate the probability that there are O(M1,M2)or more interactions between sequences in S(M1) and S(M2) based on a null hypothesis (described below). If this probability is small, the null hypothesis cannot model the motif pair M l and M2 and the motif pair is statistically significant. Given a motif M I , let Z(Ml) Z be the interactions involving sequences in S(Ml) and T(M1) (T(M1) be the set of sequences that interact with sequences in S(M1). As the number of ways of distributing x objects onto y boxes (“onto” means at least one object per box) is ( I ) and the null hypothesis assumes that the IZ(MI)I interactions are uniformly distributed onto T(MI), the conditional probability that there are exactly i interactions between sequences in S(M1) and S(M2) given Z(M1) and T ( M I )can be calculated as follows (to be precise, only the value of lT(Ml)l, IT(Ml) ( S(M2)1,and
1 IS
d(Ml). We use PI (P2) to approximate the p-score of ( M I ,M2) when i / d(M2)(i / @MI)) is smaller. A motif pair MI and M2 with small p-score means P, = P ( O ( M , , M , ) = i IT(M,),Z(M,),S(M,),S(M,)) that M l ( M 2 ) has unexpectedly large number of I Z ( M , ) I-i-1 interactions with M2(M1).Therefore, we discover motif - (I T(Ml)A-&42)-l I T ( M , ) - S ( M , ) 1-1 pairs by searching for pairs of motifs MI and M2 with small p-scores. Using p-score as the scoring function overcomes the two weaknesses of X-score. Firstly, motif pair M I and M 2 Note that in the numerator of the equation, the i with small size usually has a large p-score, e.g. when interactions are distributed uniformly onto T ( M , ) n both S(MI)and S(M2) contain one sequence and there is S(M2) and the remaining II(MI)I - i interactions onto an interaction between them, p-score(Ml ,M2) = 1. T(M,) - S(M2).The conditional probability that there are Therefore, by using p-score as the scoring function, we O ( M ,,M2) or more interactions between sequences in tend to discover motif pair MI and M2 when there are an S ( M l ) and S(M2) given Z(MI) and T ( M l ) can be unexpectedly large number of interactions between large calculated by summing up all possible i from O(M,,M2) sets sequences S ( M I ) and S(M2). Secondly, by using to U = min(Z(Ml) - IT(M,) - S(M2)1, IS(M,)lxlT(M,) n T ( M I ) ,T(M2), S(MI), S(M2) in calculating the p-score, S(M2)I], where i is upper bounded by two cases: (1) we do not rely on the assumption that the interactions when each sequence in T ( M I )- S(M2) is involved in are uniformly distributed among all input sequences. exactly one interaction, and (2) the maximum number of
IT(MI) - S(M2)I are needed in the calculation). The conditional probability
[;;g,;;:;)
1
interactions between S(Ml) and T ( M l ) n S(M2). We have PI = P ( O ( M I ,M 2 ) 2 i I S(Md, T(MJ, MI), S(M2)) - C I I * , M , , M 1 P, . Similarly, we can calculate the conditional probability P2 = P ( O ( M I .M2) 2 i I S(M& T(M2), Z(M2), S ( M l ) ) that there are i or more interactions between sequences in S(MI) and S(M2) given that the sequences in S(M2) participate in IZ(M2)I interactions with sequences in T(M2).The p-score of a motif pair { M I ,M 2 } can be represented by the conditional probability Po = P(O(MI, M2) 2 i I S(MI)>T(MI), I(Ml), S(M2), T(M2), Z(M2)). However, Po cannot be calculated easily. We estimate the value of Po by the following equation:
When the values of P(T(M2),Z(M2) I O ( M l ,M 2 ) 2 i, ~ ( M z )S(Md) , and P ( T ( M d , U 4 2 ) I S ( M d , S(MJ) are the same, Po will be exactly the same as P I . Let d(M2) = (IZ(M2)I/ IT(M2)I) IT(M2) n S(Ml)I be the expected number of interactions between S(M2) and T(M2) n S ( M I ) .When i / d(M2) is small, the difference between P(T(Md, Z(M2) I OW,, Md 2 i, S ( M d , S(MI)) and P(T(M2), Z(M2) I S(M2), S ( M , ) ) is small. Therefore, Po can be approximated by P I when i / d(M2) is small. Similarly, Po can be approximated by P2 when i I @ M I ) is small. Thus we compare the values of i / 8 ( M 2 )and i /
3. ALGORITHM
3.1. Exact Algorithm In this section, we propose an exact algorithm, FindMotif(l, d, r), to identify the top r (1, d)-motif pairs with the lowest p-scores. The algorithm is based on the idea of voting. Based on the scoring function we proposed in Section 2, we require the following information to calculate the score of each possible motif pairs. In the following, we assume that 1 and d are small, so ( i, ) is a constant. 1. For each motif M , - N s [ N : the number of sequences containing an instance of M , IS(M)I; - "[MI: the total number of interactions for sequences in S(M), IZ(M)I; - NT[w: the number of sequences interacting with the sequences in S(M), IT(M)I. 2. For each pair of motifs MI and M2, - N o [ M I , M2]: the total number of interactions between sequences in S ( M l ) and those in S(M2), that is, the observed interactions, O ( M l ,M 2 ) ; - C[M,,M2]:the number of sequences in S(M2)that interact with sequences in S ( M , ) , IT(M,) n S(M2)I. For each length4 substring s in the input sequence, we add a vote to N s [ for ~ which s is an instance of the
116 motif M. Note that there are (i,) such M's. Each N s [ N can get at most one vote from each input sequence. N S [ M can be computed in O(tn) time where t is the number of sequences and n is the length of each sequence. To compute NI, NT, No, and C, we do the voting as follows. For each interaction {s,, sl} E I , for every length-1 substring x in s,, we add a vote to N,(M) and a vote to N A M ) where x is an instance of M . Each N,(M) can get at most one vote from each interaction. Each NAM) can get at most one vote for every sequence sI. If s, and sl are two different sequences, for every length-1 substring y in sl, we add a vote to N,(M) if NAM) has not yet received a vote from the same interaction. And we add a vote to N A M ) where y is an instance of M if N A M ) has not yet received a vote from s,. For every two length4 substrings x and y (in s, and sI, respectively), we add a vote to N o [ M I ,M2] and a vote to CIM1,M2] and C[M2,M , ] where x and y are instances of M I and M2 respectively. Each No[M1,M2] can get at most one vote from each interaction. Each C I M I ,M2] can get at most one vote from each sequence s,. Each C I M I ,M2] can get at most one vote from each sequence s,.This step takes O(mn2)time where m is the total number of interactions. Finally, we can pre-compute all possible values for ( 1 ) for different values of i, j to be used in the calculation of the score of a motif pair. The maximum value for i is the largest possible number of interactions, r, for a motif and can be obtained from the N I I W table, which is usually a lot smaller than m. Computing all these values takes only O(r2)time. Then, computing the score for a motif pair takes O(m) time as it involves finding the sum of at most O(m) terms, each can be computed in constant time using the pre-computed ( ) values. There are altogether 0((L, ) 2 . El2('- )' possible motif pairs where Z is the alphabet for amino acids . However, some of them may not have any instance in the input sequences, so we only need to compute the score for at most O(mn2) pairs since we only need to consider those pairs of sequences which have an interaction. The overall time complexity is O(m2n2).The space complexity is 0((i,)* . ICI2('-'). As 1 and d are usually small, they can be treated as constants. Theorem 1 follows.
Theorem 1: The time and space complexities of FindMotif(1, d, r ) are O(m2n2) and O(ICI2" - d, 1, respectively. For the SH3 domain dataset of 146 protein sequences and 233 interactions, the algorithm takes about half an hour with 1 = 8 and d = 5 , which is a lot faster than D-MOTIF. We also develop a heuristics algorithm that can run faster than FindMotif(1, d , r) which can handle the whole yeast dataset of about 5000 sequences with over 20,000 interactions in about 20 minutes. Remarks: In practice, if the space requirement of o((i, )2 . ICI~"- )' is too large, we apply some simple tricks to reduce the space requirement. For example, we first fix a set of positions for the wildcard characters and process the motifs with these positions as wildcards. The space requirement can be reduced to O(lC12"-d'). Then, repeat the procedure for (:, ) times. Another trick is to fix the first character, say 'A', of one motif and process the motifs starting with that character, the space requirement will then be reduced to O(ICI2"-"- 1.
'
3.2. Heuristics Algorithm FindMotif finds the p-score for all motif pairs in the sequences that interact. As the motif pair space is very large and computing the p-score is time-consuming, FindMotif has a long running time when dealing with a moderately large dataset or when 1 is large. The heuristics algorithm MotifHeuristics improves the running time of FindMotif by considering less motif pairs, thus reducing the number of times the p-score is calculated. Instead of finding the p-score for all motif pairs that interact, MotiMeuristics starts with r random seed (1, d)motifs that are found in the sequences. For each seed x, we can find its optimal partner motif y such that motif pair {x,y } has the lowest p-score by voting, similar to FindMotif. The optimal motif partners of the seeds then becomes the new seeds. In the next step, we go on to find the optimal partner motif for all the new seeds. By doing this, we can obtain motif pairs with lower p-score. This process is repeated until the p-score does not improve. The resulting motif pairs are local optimal motif pairs. It is likely that at least one of the randomly generated seeds will converge to the optimal motif pair.
117 Table 1. Experimental results on SH3 domain dataset using FindMotif
Table 3. Experimental results on yeast dataset MI
S(MI),S(M2) MI, M2)p-score PxNxVxxx LxxLxxSx 22,69 80 4.78~10-l~ LxPxxTxx GxxPxxYx 29,17 54 1.70~10-’~ LSxSxxxx PxNxVxxx 57,22 68 1.78~10.’~ LLxxLxxx PxNxVxxx 60,22 83 2.65~10-I~ PxNxVxxx SxSxIxxx 22,58 82 4.40x10-” SLxxKxxx PxNxVxxx 47,22 67 6.66~ PxxPxRxx GxxPxxYx 28, 17 57 2.24~10-~~ SxIDxxxx GxxPxxYx 36, 17 63 3.34~ LxPxxTxx AxxSxGxx 29,23 52 4.46~10.’~ GxxPxxYx LxxLxxSx 17,69 80 5 . 6 6 1~0-12 Top 10 motif pairs reported by FindMotif on the SH3 dataset with 1 = 8, d = 5 , r = 200. MI
M2
Table 2. Experimental results on SH3 domain dataset using MotifHeuristics
MI
M2
LxxLxxSx LxPxxTxx GxxPxxYx GxxPxNxx LxxSxKxx QSxxSxxx SxxSxSxx IxxTTxxx PSxLxxxx SxPxPxxx
PxNxVxxx GxxPxxYx PxxPxRxx PxxPxRxx TxxGxVxx SxxQxxIx ATxPxxxx KxxPExxx YxxDYxxx AxAxYxxx
M2
S(MI), S ( M d O(MI,M2) p-score 78 4.25~10.~~ 72 9.12~10~~ 56 4 . 2 0 ~ 1 032 128 1 .17x 10-31 35 6.86~ 38 1.38~10-~~ 23 3 . 3 9 1019 ~ 38 4.28~10-l~ 39 1.62~10-~’ 43 1 . 7 5lo-’’ ~
GxxPxNxV PxLPxRxx 32,39 PPxPxRxx GxxPxNYx 27,19 RRxDxxQx SSPxKxxx 11, 107 GCxxAExx SSxxSxxS 16,508 LVxxFLxx LxxSPxKx 77,68 DTxGQxxx LxxYIxIx 43,3 1 GDGTxxxx IWDxRxxx 40,16 GSTGxxxx AxxLxNSx 46,70 IGxAIxxx GxKTxKxx 59,50 FGxxTxNx ALRxLxxx 16,103
Top 10 motif pairs reported by MotifHeuristics on the yeast dataset with I = 8, d = 5 , r = 200.
Theorem 2: The time and space complexities of MotifHeuristics (I, d, r) are O(rm2n)and O(r . El‘ ‘), respectively. ~
S(MI),S(M2) @MI, M2) p-score
69,22 80 4.78~10.’~ 29, 17 54 1.70~10-’~ 17,28 57 2 . 2 4 1~0-‘* 18,28 54 1.04x10-” 38,15 38 1.12x10-1~ 36,24 40 1.36x10-“ 79,18 76 1.87xIO-“ 21, 18 27 8.50x10-” 47, 10 42 1 . 4 010.‘’ ~ 37,12 44 3 . 2 0 lo-’’ ~ Top 10 motif pairs reported by MotifHeuristics on the SH3 dataset with I = 8, d = 5 , r = 200.
By setting an appropriate the value for r, the algorithm can finish within reasonable time with high accuracy. The time complexity for MotifHeuristics(I, d, r): Let the algorithm run for k iterations for each seed. In each iteration, at most r seeds are given. The algorithm finds the optimal partner of the r seeds. Similar to the exact algorithm, this step is done by voting Ns, NI, NT, No and C. However, it takes O(mnr)time only as one of the motif (the seed) is known. As there are r seeds and O(mn) possible optimal partner for each seed, it takes O(rm2n) time to calculate the scores of all pairs. The overall time complexity is O(krm2n), as there are k iterations. In practice, the algorithm halts after around 10 iterations, or we can stop the execution after 10 iterations. Therefore, k can be neglected. Again, we treat I and d as constants. Theorem 2 follows.
4. EXPERIMENTS We have performed experiments to evaluate our scoring function and the performance of MotifHeuristics. We ran our program on real biological data and verified the results with those obtained from biological experiments. We have also compared the performance our algorithm with the heuristics algorithm D-STAR proposed by Tan et al. [lo]. All the experiments were performed on a standalone computer with 2.4GHz Intel CPU and 4GB memory. In each experiment, we used 200 seeds for MotifHeuristics and each seed was refined at most 10 times.
4.1. SH3 Domains Dataset SH3 domains are similar amino acid segments that are found to bind motifs “PxxP”, “PxxPx[RK]” and “[RK]xxPxxP” [ 121. It has been experimentally determined that the binding is due to the presence of the pattern “GxxPxNY” in SH3 domain (PDB 1D:lAVZ). We tested whether FindMotif and MotifHeuristics are able to recover this motif pair “GxxPxNY” and “PxxP’. The dataset was obtained from Biomolecular Object Network Databank [13]. It contains 146 yeast proteins, including 23 that contain the SH3 domain, and 233 protein-protein interactions.
118 Top 10 motif pairs reported by FindMotif on the SH3 dataset with 1 = 8, d = 5. We tested FindMotif and MotiMeuristics with 1 = 8 and d = 5. The results are shown in Table 1 and 2. FindMotif discovered the motif pair “GxxPxxY” and PxxPxR at rank 7 and “GxxPxxY” and “PxLP’ at rank 13. MotifHeuristics also discovered the similar motif pair at rank 3. Therefore, both algorithms could discover the known motif pair. Tan et al. stated that the heuristeric D-STAR algorithm was able to discover the known motif pairs [lo]. We exhausted all possible motif pairs using their model and parameters. We found the first motif pair similar to “GxxPxNY” and “PxxP” was ranked 90 and there were 89 motif pairs having higher X-score than the known motif pair. This result suggested that D-STAR might miss out motif pairs with good scores (the top 89 pairs). On the other hand, our scoring function was more robust than the X-score proposed by Tan et al. as FindMotif has considered all possible motif pairs and the known motif pair is ranked within the top ten results.
4.2. Yeast Dataset To measure the efficiency and further verify the correctness of MotifHeuristics on large dataset, we ran the algorithm on the yeast dataset. The yeast dataset is obtained by merging data from the MIPS and DIP databases. It includes 5246 yeast proteins and 21225 interactions. Since this dataset contains most protein sequences and interactions in the SH3 dataset, we expect our algorithms can also discover motif pair similar to “GxxPxNY” and “PxxP’. We used MotiMeuristics to discover motif pairs in this dataset with parameters 1 = 8, d = 4 and r = 200. The results were shown in Table 3. MotiMeuristics reported the motif pair “GxxPxNxV” and “PxLPxRxx” at rank 1. This shows that our scoring function perform well in both small and large datasets.
4.3. Running Time Comparison The running time of the three algorithms on the above datasets are shown in Table 4. For the SH3 dataset, all algorithms could discover a motif pair similar to the known pair “GxxPxNY” and “PxxP”. Since FindMotif guaranteed finding the motif pair with the lowest p-score, it took the longest time (54 minutes) to finish. For the
Table 4. Comparison of the algorithms’ running time
SH3 dataset
Yeast dataset
FindMotif 54 min ( I = 8, d = 5 ) 51 s 44 min MotiMeuristics ( I = 8, d = 5 , Y = 200) ( I = 8, d = 4, r = 200) 14 min D-STAR
two heuristics algorithms, D-STAR took 14 minutes while MotifHeuristics took 51 seconds only. On the other hand, MotiMeuristics had the shortest running time. For the yeast dataset, since the number of protein sequences and interactions were large, FindMotif and the heuristics algorithm, D-STAR, could not finish in 5 days. On the other hand, MotifHeuristics was able to discover the correct motif pair in 44 minutes only. So, MotiMeuristics is more scalable and efficient.
4.4. Simulated Data We have further evaluated the sensitivity and efficiency of MotifHeuristics using simulated data. We generated 146 (same size as the SH3 dataset) length-668 random protein sequences (each nucleotide has equal probability to occur). We randomly picked a (8,5)-motif pair and planted s instances (i.e., IS(M,)I = IS(M2)I = s) of each motif in the sequences. 200 interactions are randomly assigned to the 146 sequences. An additional i (i.e., 10(MI,M,)I) interactions are assigned to the instances of the planted motif pair. We have tried two settings: s = 10 and s = 20. For each setting, we study the running time and the success rate of MotifHeuristics using different values of i. For each set of parameters s and i, 50 different data sets were generated. Note that we only consider the algorithm to be successful if the planted motif pair appears in the output as rank 1. The results are shown in Figure 1. When the number of interactions increases to about 45 and 75, respectively, for the cases of 10 and 20 planted motif instances, the success rate increases to more than 80%. We found that with fewer interactions, the p-value of the planted motif pair is usually larger than lxlO-” making it difficult to be distinguished from noise. The results are consistent with the real dataset (see Table 2). The
119
s 6
I 09 08 07 06 05
:: 04 2
03 02 01 0 0
40
20
60
80
100
planted interactions
Fig. 1. Success rate against number of planted interactions (1 = 8, d = 5 , r = 200).
average running time for each dataset is about 1 minute. Overall speaking, MofitHeuristics is fast with reasonably high sensitivity.
5. CONCLUSION In this paper, we have proposed a new scoring function to evaluate whether a motif pair is significant based on the given protein-protein interaction data. We also developed an exact algorithm and a heuristic algorithm to solve the problem. We show that our scoring function is more accurate than the one used in [lo] and our algorithms are more efficient and scalable than existing algorithms. Possible future directions include the followings. The scoring function we proposed is an approximation to the conditional probability we defined in Section 2. Whether a more accurate scoring function exists is an interesting and important question. Although our algorithms can process a large dataset within reasonable amount of time, a more efficient (in terms of time and space) algorithm is always desirable. Whether the current approach can be effectively applied to locate motif triplets is also a challenging extension.
Acknowledgments This paper is supported by the RGC grant HKU 7 120/06E.
References 1. Bailey, T., Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. ISMB 1994; 28-36.
2. Goehler, H. et al. Can we infer peptide recognition specifying mediated by SH3 domains? FEBS Lett 2002; 513(1): 38-44. 3. Hans, J., Brandt, W., Vogt, T. Site-directed mutagenesis and protein 3D-homology modelling suggest a catalytic mechanism for UDP-glucosedependent betanidin 5-0-glucosyltransferase from Dorotheanthus bellidiformis. Plant J. 2004; 39(3): 3 19-333. 4. Hu, H. et al. A map of WW domain family interactions. Proteomics 2004; 4(3): 643-655. 5. Karkkainen, S, Hiipakka, M, Wang, JH, Kleino, I, Vaha-Jaakkola, M, Renkema, GH, Liss, M, Wagner, R, Saksela, K. Identification of preferred protein interactions by phage-display of the human Src homology-3 proteome. EMBO Rep 2006; 7(2): 186191. 6. Leung, H., Chin, F. Discovering Motifs with Transcription Factor Domain Knowledge. PSB 2007; 12: 472-483. 7. Leung, H., Chin, F. Finding motifs from all sequences with and without binding sites. Bioinformatics 2007; 22(18): 2217-2223. 8. Pavesi, G., Mereghetti, P., Zambelli, F., Stefani, M., Mauri, G., Pesole, G. MOD Tools: regulatory motif discovery in nucleotide sequences from coregulated or homologous genes. Nucleic Acids Res 2006; 34: W566-W570. 9. Reiss, D. J., Schwikowski, B. Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 2004; 14(1): 55-67. 10. Tan, Soon-Heng, Hugo, Willy, Sung, Wing-Kin, Ng, See-Kiong. A correlated motif approach for finding short linear motifs from protein interaction networks. BMC Bioinformatics 2006; 7:502. 11. Tong, A. H. Y. et al. A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 2002; 295(5553): 321-324. 12. Weidemann, A,, Konig, G., Bunke, D., Fischer, P., Salbaum, M., Masters, C., Beyreuther, K. Identification, biogenesis, and localization of precursors of Alzheimer's disease A4 amyloid protein. Cell 1989; 57: 115-126. 13. Biomolecular Object Network Databank. http://bond.unleashedinformatics.com 14. Database of Interacting Proteins. http://dip.doembi.ucla.edu/
This page intentionally left blank
121 A MARKOV MODEL BASED ANALYSIS OF STOCHASTIC BIOCHEMICAL SYSTEMS
Preetam Ghosh*, Samik Ghosh, Kalyan Basu and Sajal K Das Biological Networks Research Group, Dept. of Comp. Sc. & Engg., University of Texas at Arlington, TX-76010 Email: { ghosh, sghosh, basu, das} @cse.uta.edu The molecular networks regulating basic physiological processes in a cell are generally converted into rate equations assuming the number of biochemical molecules as deterministic variables. At steady state these rate equations gives a set of differential equations that are solved using numerical methods. However, the stochastic cellular environment motivates us t o propose a mathematical framework for analyzing such biochemical molecular networks. The stochastic simulators that solve a system of differential equations includes this stochasticity in the model, but suffer from simulation stiffness and require huge computational overheads. This paper describes a new markov chain based model to simulate such complex biological systems with reduced computation and memory overheads. The central idea is to transform the continuous domain chemical master equation (CME) based method into a discrete domain of molecular states with corresponding state transition probabilities and times. Our methodology allows the basic optimization schemes devised for the CME and can also be extended t o reduce the computational and memory overheads appreciably at the cost of accuracy. The simulation results for the standard Enzyme-Kinetics and Transcriptional Regulatory systems show promising correspondence with the CME based methods and point t o the efficacy of our scheme.
1. INTRODUCTION
The research challenge of today is to develop a comprehensive modeling framework integrating molecular, genetic and pathway data for a quantitative understanding of physiology and behavior of biological processes a t multiple scales. The complexity of the biological process at molecular level is enormous due to the vast number of molecular state spaces possible in a cell and the large number of state transitions. Computational cell biology currently models the biological system as an aggregate functional state where the underlying molecular transitions are not captured. Hence, these models can only provide understanding for some specific problems a t a functional level but not at the molecular dynamics level. Spatio-temporal models capturing the temporal and spatial dynamics of biological processes at a molecular level can be classified as follows: (1) mesoscale dynamics, (2) cellular/organ-level stochastic simulation (3) rule based model
Existing quantum mechanics and molecular dynamics based models are limited in scope, as they cannot handle the complexity of an entire cell or a complex pathway. The former captures the random environment of the cell at electron level and is very useful to understand the structure of the macro*Corresponding author.
molecules but can only handle about 1000 atoms. The molecular mechanics model uses force field methods to understand the function of the macromolecules. This model is used to study the binding site configurations for protein-protein or proteinDNA interactions and protein folding. Currently it can handle about 1 million atoms and hence is not sufficient to model a cell or complex pathways. The models for mesoscale dynamics, and cellular/organ-level stochastic simulation focus on a narrow range of biological components such as the wave model for ventricular fibrillation in human heart, neural network signaling model to control the onset of sleep in humans, and simulation frameworks like E-Cell and Virtual Cell ’. Mesoscale dynamics deal with rate equation based kinetic models and uses continuous time deterministic techniques. This model solves a complex set of differential equations corresponding to chemical reactions of the pathways. Since a biological system involves a large number of such equations, the model can only solve a system of at most 1000 reactions. To address the observed stochasticity in a biological system 11, l 2 Gillespie extended the rate based model to a stochastic simulation framework. This led to a few other variations such as Kitano’s Cell Designer l o ,DARPA’s BioSpice ’, Cell Illustrator etc. The computational overhead of this simulation
122 forced the use of approximation techniques to solve the rate equations by sacrificing accuracy e.g. the Tau Leap algorithm '. Gillespie's technique considers the biochemical system as a discrete Markov process but suffers from the following limitation: 6i
0
It assumes that a biological system only consists of different biochemical reactions. Hence, each reaction event is abstracted by the experimentally determined rate constant and cannot incorporate the pertinent details of that biological event. For example, ideally a protein-ligand docking event should incorporate some details of the protein/ligand docking site location which is considered by our protein-ligand docking model presented in 2 4 . Because our models presented in 2 5 , 23, 24 are parametric, we can easily estimate the kinetic parameters even in cases where such experimental data are not available.
Due to the large number of protein complexes in a cell, these stochastic simulation models lead to a combinatorial explosion in the number of reactions, thus making them unmanageable for complex metabolic and signaling pathway problems. The simulation model we propose here builds on the Gillespie technique and allows for many novel approximation techniques which cannot be implemented in the Gillespie simulation. Moreover, the flexibility of using different mathematical abstractions for different biological events make our technique more attractive than the r-calculus 31, 3 2 modeling technique. Finally, the rule based simulation l 3 models the multi cell interactions at a molecular level and addresses the more complex host-pathogen interactions. It ignores the stochastic nature of biological functions and considers a set of rules derived from pathways. In this paper, we convert the biological process into a stochastic network and solve it as a stochastic network analysis problem. Stochastic discrete event simulation is another way of addressing this problem as we have described in 2 2 .
2. STOCHASTIC BIOCHEMICAL SYSTEM ANALYSIS In a stochastic biochemical system, the state of the system a t any time is defined by the number of molecules of each type. The transition from one state to another is derived from the probability of the reactions at the current state and the resulting next state
is the new molecular state. As the molecular reactions in a biological process occur due to the random collision of the molecules, the state transition parameters are random and the state space is discrete. Let us assume in a stochastic biochemical system there are M elementary (monomolecular or bimolecular) irreversible reaction channels, which react at random times. A monomolecular reaction converts a reactant molecule into one or more product molecules. A bimolecular reaction converts two reactant molecules into one or more product molecules. We can decompose a reaction channel that involves more than two reactant molecules into a cascade of elementary reaction channels and model a reversible reaction channel by two irreversible reaction channels. The state of a stochastic biochemical system at time t is characterized by the M-dimensional random vector
Z ( t ) = [Zl( t )2 2 ( t )...ZM @)IT where Z,(t) = z , if the mth reaction has occurred 2 times during the time interval [O, t ) and T denotes vector or matrix transposition. The random variable Z,(t) is referred to as the degree of advancement (DA) of the mth reaction 14. Also X,(t) denotes the number of molecules of the nth reactant or product species present in the system at time t . By assuming N distinct species, we have
X ( t ) = [XI(t)X2(t)...XN(t)lT Given that the biochemical system is at state X ( t ) = t , let qm(z) be the number of all possible distinct combinations of the reactant molecules associated with the mth reaction channel when the system is at state z. Note that
z at time
x2., for monomolecular reactions x,(z, 1)/2, for bimolecular reactions ~
4m(x) =
x,x3,
with identical reactants for bimolecular reactions with different reactants
I
for some 1 5 i , j 5 N,i# j . Moreover, let c, > 0 be the probability per unit time that a randomly chosen combination of reactant molecules will react through the mth reaction channel. This probability is known as the specific probability rate constant of the mth reaction. Then, the probability that one mth reaction will occur during a time interval [t,t d t ) will approximately be equal to r m ( z ) d t ,for a sufficiently small d t , where
+
r,(x)
= cmqm(z),m
EM
=
(1, 2 , ..., M } ,
123 is known as the propensity function of the mth reaction channel 15, 16. Note that, given the state z ( t ) of the biochemical system at time t , we can uniquely determine the state z ( t ) of the system at time t as
rnEM
where Z O , ~is the initial number of molecules of the nth species present in the cell a t time t = 0 and , s is the stoichiometric coefficient. This coefficient quantifies the change in the number of molecules of the nth molecular species caused by one occurrence of the mth reaction. The state z ( t ) cannot be determined from z ( t ) in general since there might be several states z ( t ) that lead to the same state z ( t ) . To distinguish Z ( t ) from X ( t ) ,all existing works on stochastic simulation refer to Z ( t ) as the hidden state and to X ( t ) as the observable state and use a hidden markov model to analyze the system. The discrete-valued random process
z= { Z ( t ) , t2 0) characterizes the dynamic evolution of the hidden state of a biochemical system. This process is specified by the probability mass function (PMF)
3. OUR MARKOV CHAIN FORMULATION We replace the hidden markov model by a Markov Chain based approach to model a composite biochemical system. Note that the system only represents biochemical reactions or protein-ligand docking events in the cell. Thus in the Markov Chain, each state transition occurs due to one reaction/docking event. If multiple reaction/docking events are possible, then the state transitions can occur due to any one of these events and hence we can have multiple transition paths to the next state. The states in the Markov Chain are defined as the number of molecules of the different components in the biological system, i.e., X ( t ) = [ X , ( t ) , X Z ( t )..., , X,(t)]. For example, consider the following biochemical system: R1 :
X i +Xz
+X s ;
R2 : X2
+ X4 +X5
where, X I ,X2, X , are proteins and X3, X5 denote the docked complexes. Then each state in the Markov Chain will have 5 tuples corresponding to the number of molecules of these 5 components. The corresponding Markov Chain with the possible state transitions is shown in Fig 1. Note that each transition signifies either an R1 or an R2 type of event. Thus, the total number of edges coming out of each node is given by the possible number of reaction/docking events (and equivalently the number of differential equations) considered in the system.
P z ( z ; t )= P r [ Z ( t )= z l Z ( 0 ) = 01, for every t 2 0. Simple probabilistic arguments show t ) satisfies the following first-order differthat Pz(z; ential equation 17:
m E M
for t > 0, with initial condition Pz(O;0) = 1, where em is the mth column of the M x M identity matrix and
3.1. The M F P T concept Assuming first order kinetics, the probability that a particle has reached the final state at some time t is given by P f ( t ) = 1 - e P k t , where k is the rate, and P f ( t )is the probability of reaching a final state by time t . By running many independent simulations shorter than 1 / k , we can estimate the cumulative distribution P f ( t ) ,and fit the value for the rate, k . The mean first passage time is the average time when a particle reaches the final state for the first time, given that it is in an initial state at t = 0, roo
MFPT This is the well-known forward Kolmogorov differential equation 18p20 governing the stochastic evolution of a continuous-time Markov chain. In computational biochemistry, Eqn. 1 is referred to as the chemical master equation (CME) 14. It turns out that Z is a multivariate birth process 1 8 , 2o and X is a multivariate birth-death process.
=
J
Jt=o ( dtu P f ( t ) ) dt t =
rw
1
k
3.2. Computing the state transition probabilities and times
Note that computing the MFPT requires an estimate of each state transition probability along with the time taken for the transition. Because, each
124 a
b
Fig. 1. Markov Chain formulation with 3 molecules each of X I ,X z , X4 and no X 3 , X 5 molecules initially.
Fig. 2. tions.
A simple birth-death model for reversible reac-
state transition signifies either a reaction or docking, we can find the state transition probabilities and times from the batch models of the reaction 25 and docking 24 events using concepts from collision theory. The batch model incorporates the number of molecules of each reactant present before the start of the reaction/docking events. This makes each state transition depend upon the current state that the system is in. Note that the batch model estimates the time of reaction/docking as a random variable following a Gamma distribution when few reactant molecules are present in the system. However, as the number of reactant molecules increase, the mean-tostandard deviation ratio for time becomes close to 1 signifying an exponential distribution. Also, note that 24 reports that the docking time is primarily affected by the collision theory component. Hence the batch models of 25 are also applicable to the docking events.
Time taken for completing R3 (denoted by T R ~ ) can be estimated from the rate constant as follows: 1
3.2.1. Monomolecular reactions
where, n1, n 2 are the numbers of X 1 and X2 type molecules present in the cell, 7-12 is the collision radius computed as the sum of the radii of X 1 and X2 molecules (which are assumed to be spherical), m12 is the reduced mass computed as m12 = (where m l , m2 are the masses in gm of X I and X2 type molecules), V is the cell volume, T is the temperature (in Kelvin), k~ is the Boltzmann’s constant = 1.381 x 10-23kg m 2 / s 2 / K / m o l e c u l e and E.412 is the activation energy required for reaction R1. T R ~ denotes the mean of the reaction completion time which is assumed to follow an exponential distribution. Note that the Gillespie simulator also considers the reaction time to be a random variable following the exponential distribution.
231
231
The time taken for monomolecular reactions can be simply computed from the experimentally determined reaction rate constant for the reaction. Denoting the reaction rate constant by k ~the~probability , of reactions of type R3 (denoted by P R ~is) given by: R3 : X6 + X7
+ X 8 ; PR3
= [X6]kR37
where [X,] denotes the concentration of X6 type of molecules and r denotes a infinitely small time step (generally in the order of N lo-’ secs). Note that this definition of the monomolecular reaction probability is exactly the same as that used for solving the CME and can be defined as the probability of a reaction of type R3 occurring in time r.
In 25 we have shown that the reaction time is a random variable following an exponential distribution when there are sufficient number of molecules in the system. Hence, we assume that the monomolecular reaction completion time also follows an exponential distribution with mean T R ~ . 231
3.2.2. Bimolecular reactions We use the batch model developed in 25 for computing the probability of reaction and first and second moments of the reaction completion times. Considering reaction R1, the probability and time can be estimated as:
125 In 2 3 , we have shown that the mean of the reaction time ( T R ~ is )actually equal to the time reported by the rate equation based model. Hence, denoting the rate of reaction R1 by k ~ ,we , have:
T R= ~
1 n 1 n ~ k ~ ~
~
Hence the probability of reaction can also be computed if one does not know the activation energy for any specific reaction but the rate constant is known. As before, reactions involving multiple copies of any molecule type can be represented by a cascade of elementary reactions of the above types.
case, the probabilities of forward and backward reactions ( a and b ) can be computed as discussed before. We approximate P,ff as Peff = a - b in such cases. Note that this approximation assumes that P ( S i ) FZ P(S,) for all the reversible reactions in the system. While this indeed is a gross simplification of the reversible reaction kinetics, the results obtained show that it is not overly restrictive. Moreover, when a z b, we assume that the reversible reaction attains equilibrium and make node Si a sink i.e., no further state transitions can originate from this node. Bimolecular reactions: Consider reversible reactions of type R4 as follows: R4 :
3.2.3. Reversible Reactions The Gillespie simulator considers reversible reactions as two separate reactions. This increases the complexity of the system as more number of reactions need to be handled. Also, in our Markov Chain based model, a reversible reaction will involve a double edge between any two nodes making the MFPT computations difficult. Hence we can approximately characterize reversible reactions using a simple birth-death model as shown in Fig 2. Let us denote the forward and backward transition probabilities between any two states S, and S, by a and b respectively. We need to compute the effectzve probability that the reaction proceeds in the forward direction denoted by P e f f such that the double edge can be replaced by a single edge driving the reaction in the forward direction with probability P e f f .However, the time for the forward reaction still remains the same and can be computed as above. The computation of P e f f will be different for the monomolecular and bimolecular reaction scenarios. In general, P e f f can be expressed by:
Peff = P(S2) x u - P(S,) x b where, P ( S , ) and P(S,) are the probabilities of being in states S,and S, respectively. However, P(S,) and P(S,) does not simply depend on a and b, but also on the transition probabilities of edges into and out of nodes S,and S, making the P e f f estimation quite complicated. In the following, we show two approximate schemes of computing P e f f for monomolecular and bimolecular reactions. Monomolecular reactaons: Consider reversible reactions of type R1, i.e., X1 X Z H X3. In this
+
Xg
+
Xi0 H X i 1
+
Xi2
Here also we can use the above approximation of P(S,) M P(S,) and compute Peff = a - b.
3.3. Pruning the Markov Chain As mentioned before, we will estimate the time taken to reach any node in the markov chain by using the MFPT. Hence, we consider each node in the chain as a sink to compute its MFPT. Also, it has to be ensured that every node in the Markov Chain is able to reach the sink. Otherwise, since these nodes will have an infinite mean first passage time, calculations done on the Markov Chain will fail. We identify the nodes that can reach the sink by performing a depth first search from the sink over the incoming edges, and marking all nodes that are reachable. The nodes that were not marked can be simply deleted, thus ensuring that all nodes in the Markov Chain can reach a node in the final state. Next, we normalize the probabilities on all the edges so that on each node, the sum of the probabilities for all outgoing edges is one as follows:
p;ew
Pv
=
Ceciger, P t k
The probability on each edge equals the number of times that transition was made divided by the total number of transitions from that node. 3.4. Computing the total probability of reaching a final state
The Markov Chain consists of a set of nodes and a set of transitions or edges between these nodes. Each edge has a probability associated with it as well as
126 the time taken to traverse this edge. We define the Psink of a node as the probability that the system starting in the initial state would reach the sink state before reaching the initial state again. Following 21 we will use the Markov Chain to calculate the Psi& values. The P s i n k can be defined conditionally based on the first transition made from the node as follows:
C
Psznk (nod%) =
P(transition(i,j ) )
transition(i,j)
x PSink(nodezItransition(i,j ) ) where the sum is over all possible transitions (that are mutually exclusive) from nodei. The possible transitions from nodei are simply all of the edges leading from nodei, and the probability of each of these transitions is the Pi, values defined previously. This satisfies the above condition. Psink(nodeiItransition(i, j ) ) is simply the Psink of node, which results in the following equations:
Psznk(nodez) =
C PijPsink(nodej), edge,,
Psznk(nodea)= 1, node, E sink, Psink(nodei) = 0, nodei E source Thus the probability of reaching any node in the chain can be estimated by a simple recursive procedure that traverses the chain. Note that in the worst case, the chain becomes a tree, where each node can traverse to M different new nodes ( M being the number of reactions considered). Hence the worst case time complexity of traversing the chain is O(V E ) M O ( E ) , where V, E are the number of vertices and edges of the chain. This is because the number of edges is generally greater than the number of vertices in the chain. In the worst case we might have a tree where E = V - 1. Also, as the probability has to be computed for each node in the chain, we have an overall complexity of O ( V E ) .
where the sum is over all possible transitions from nodez. The MFPT of nodei given that a transition to nodej was made, is the time taken to go from nodei to node, added to the MFPT from nodej:
C Pij(timei, + M F P T ( n o d e , ) )
MFPT(node,) =
edge,,
(2) where the sum is over all edges leading from nodei. Also, we can define the initial conditions as follows:
M F P T ( n o d e i ) = 00, nodei f sink M F P T ( n o d e , ) = 0 , nodei E sink Note that t i m e is a random variable, and hence cannot be added as shown in the equations above. Hence we need to compute the convolution of exponential distributions that has to replace a simple addition of this random variable. Equivalently, it should be understood that the MFPT is no longer fixed, but is also a random variable. We need general expressions for the following two types of convolutions of exponential distributions:
(1) General expression for n + 1-fold convolution of exponential variables from an n-fold convolution for the (timeij M F P T ( n o d e i ) )component of Eqn 2 :
+
fn=ale
-i
n - z
T1+a2e
T 2 + .
.. + ane
-1
m
+
3.5. Computing the MFPT for reaching the final state
We define the mean first passage time (MFPT) of any node in the chain as the average time taken to reach that node (considered the sink) from the first node in the chain. The MFPT is defined conditionally based on the first transition made from any node:
MFPT(node,) =
C
P(transition(i,j ) )
transztzon,,
x MFPT(nodeiItransition(i,j ) )
jfn+l
=
a:+'e-e
+ aY"e-5
+ ...
+.;=p%l
where, TI, T2,..., Tn denote the means of the reaction times of each edge of the n-fold convolution (convolution of the times for n edges gives an n-fold convolution), and Tn+l = time,, in the (time,, M F P T ( n o d e , ) ) component of Eqn 2. While the above expression gives the general distribution for the n 1-fold convolution, the first and second moments can also be generically expressed as follows:
+
+
+
First Moment = Fn+l = U . ; L + ~ ( Tu~Y)+ ~' ( T ~ ) ~
+ ... + aE=:(Tn+1)2
127 Second Moment = S"+' = CZ:+'(TI)~
After a few manipulations it can be shown that the first and second moments of this general distribution reduces to:
+ Tz + ... + Tn+l; 1 Sn+' = S" + Tn+l(CTi);
Fn+l = Ti
we can show that
Hence, smaller the value of 6, the more precise are the results obtained.
3.6. Approximating the Markov Chain:
n-C
Reducing complexity a t the cost of accuracy
z=l
S1= (2) General expression for a convolution between an n-fold convolution (fn)and an m-fold convolution ( g m ) for the (Cedge,,) component of Eqn 2:
+
Note that the above expression contains m n terms in total and the first and second moments of this general distribution can also be computed in a similar manner as before. Moreover, because of the simplified expression for the first moment of the MFPT, we can use the same expression as in Eqri 2 if one is only interested in the mean value of the MFPT itself. In the next section we report the results based on this mean value of the MFPT distribution. However, it is also possible to compute the exact MFPT distribution of each node in the chain. It should be noted that the above expressions for the general distribution of the MFPT and corresponding first and second moments were derived assuming T, # T j , for all i , j . This will be true for most cases as it is quite unlikely that the mean of the reaction times are equal (because the mean also depends on the concentration of the reactant molecules and most states in the chain will have different concentrations of the particular reactants of the specific reaction). However, in certain cases, the mean reaction times might be equal and we need to add a small 6 to make them different such that the above reactions remain valid. Consider a 2-fold convolution of exponentially distributed random variables with means TI and TZ. If TI = Tz, the general distribu-
e, ~-
tion takes the form the form
~-~-
(e
T 1 -e
T1 -T2
T2)
and when TI# Tz,it is of
. However, with 6 = TI- Tz,
In most cases, it is not possible to derive an analytical solution of the CME. The following approximation techniques have been proposed to reduce the complexity of the CME: (1) Langevin approximation (LA) 16: A useful approximation to the CME is obtained by assuming that there exists a time step dt such that the following two conditions are satisfied: 0
Changes in the hidden system states that occur during time interval [t,t d t ) do not appreciably affect the propensity functions. The expected number of occurrences of each reaction in a time interval [t,t d t ) is much larger than one.
+
+
It can be shown that, under both conditions, the dynamic evolution of the hidden state process is governed by a simpler system of stochastic differential equations that can be solved by the Monte Carlo estimates. (2) Linear Noise approximation (LNA) 26, 2 7 : Unfortunately, the LA method does not allow us to obtain an expression for the joint probability density function (PDF) of the hidden states. However, by using additional approximations, the hidden states can be characterized by a multivariate Gaussian P D F that can be solved numerically (e.g., by the standard Euler method) and is faster than the Monte Carlo method. However, both the LA and LNA methods require both conditions (shown above) to be satisfied simultaneously which is not possible in most biological systems. (3) Poisson approximation (PA) 2 8 : A better approximation of the HMM is obtained by employing a time step d t satisfying the first condition, but may not necessarily satisfy the second one. Since reactions that occur during the time interval [Icdt,( I c 1 ) d t ) will not appreciably change
+
128 the values of the propensity functions, these reactions will occur independently of each other. Moreover, the number of occurrences of the mth reaction during [kdt, (k 1)dt) is assumed to be a Poisson random variable. (4) Mean-Field approximation (MFA) 29: The PA method does not allow us to derive an expression for the joint PMF of the hidden states. However, it is possible to approximately characterize the hidden states by a PMF by the dynamic evolution of the normal Gibbs distribution. This method is superior to the LNA method for three main reasons:
+
0
0
0
It is based on the more accurate Poisson approximation, its approximation accuracy does not depend on the cellular volume, and it does not require linearization of the underlying propensity functions.
(5) Stochastic quasi-equilibrium approximation (SQEA) 30: Most often, reactions occur on vastly different time scales e g . , the transcription and translation reactions are typically slow reactions, whereas dimerization is a fast reaction. This means that transcription and translation may occur infrequently, whereas, dimerization may occur numerous times within successive occurrences of slow reactions. In such cases, the Gillespie algorithm spends most of the time simulating fast reaction events. It may, however, be less important to know the activity of fast reactions in detail since the system’s dynamic evolution may be mostly determined by the activity of the slow reactions. Hence, it is possible to approximate the CME by one that involves only slow reactions. In our Markov model formulation, we do not have any hidden states as the chain can be appropriately characterized by the number of different molecule types present in the system (denoting the states of the chain), and each state transition is characterized by the corresponding reaction/docking events. Hence, most of the above techniques are not directly applicable to this formulation. However, we can employ the SQEA approach to substantially simplify the markov chain (with lesser number of states) making the MFPT computations faster. In this case, the states of the markov chain will have the same tuples
as before, however the state transitiom will only be governed by the slow reactions. During each state transition, the new state in the chain is computed depending on this slow reaction and also computing how many fast reactions can occur in that time and appropriately updating the molecule counts of the reactants in the fast reactions. In fact this technique has a direct analogy to Gillespie’s tau-leap algorithm, wherein, we can specify a certain time step At, and compute how many reactions (both fast and slow) occur within that period. Thus we can compute the next state and the markov chain will become a 1-dimensional chain thereby greatly reducing the complexity. Also the memory requirements for storing the Markov chain can be completely removed as the MFPT can be computed online as the chain progresses in time. 4. RESULTS A N D ANALYSIS
4.1. Enzyme-Kinetics system Figs 3-5 show the molecular distributions of the product ( P ) molecules with time for different number of enzyme ( E )and substrate ( S )molecules. Note that it is possible to report the exact molecular distributions of any molecule type in the system using our approach. The time axis reports the mean value of the MFPT (which is also a random variable as discussed earlier). Fig 8 compares the dependency of mean number of P type molecules on time with that reported from an exact simulation of the CME (obtained from Monte Carlo simulation of the differential equations in the system). Our results compare very well with the exact simulation for low number of molecules in the system. With large number of enzyme molecules present, the reactions occur very fast and the markov model formulation being driven in discrete time produces less accurate results. Nevertheless, it is computationally very fast and allows the study of more complicated systems (with large number of reactions and molecular types involved). Figs 6-7 plots the probability distributions of the product molecules. The different bars at each possible molecular count value of the P type molecules correspond to the probability of reaching different states (from the initial state) in the Markov model having that number of P type niolecules (and different molecular count values for the other entities in the system). It is again possible to compute the
129 I
I
2
6
1
10
8
12
$6
/I
IS
20
time (seconds)
Fig. 3. Molecular distribution of P type molecules, with E=10, S=5.
Fig. 4. Molecular distribution of P type molecules, with E=10, S=100.
E.10, S=5
E=lO, S 4 0 0
1
0
2
3
Fig. 5. Molecular distribution of P type molecules, with E=1000, S=lOO.
4 0
Number of P molecules
I0
20
M
40
50
SO
70
80
90
99
Number of Pmolsculcr
Fig. 6. Probability distribution of P type molecules, with E=10, S=5.
Fig. 7 . Probability distribution of P type molecules, with E=10, S=lOO.
Fig. 8. Mean number of P type molecules, Our model Vs Exact Simulation.
.. ....
c
;
i
11
B
1 0
0
20
10
60
80
1w
(20
140
Time (seconds)
Fig. 9. Effects of SQEA and Tauleaping approximations.
i
.
Number of initial S type molecules
Fig. 10. Mean to standard deviation ratios of molecular distribution of P type molecules with constant number of enzyme molecules.
complete distribution (not just the first and second moments) of all the different molecule types in the system with our formulation. Fig 9 shows the effects of the SQEA (denoted by "quasi approx") and tau-leap approximations to our markov model. The reversible reactions are considered fast reactions in our analysis. As expected, the
SQEA approach provides a very accurate approximation of the mean number of product molecules whereas the tau-leap variation (with At = l o p 3 secs) provides the fastest (and most memory efficient) solution at the cost of accuracy. Figs 10-11 plot the mean to standard deviation ratio of the molecular distribution of the product
130 degradation
degradation
t
t
transcription
Fig. 12. A simple transcriptional regulatory system
Fig. 13. Terminology for the Transcriptional Regulatory System.
M
Protein (monomer)
D
Transcription factor (dimer)
RNA
mRNA
DNA
DNA template free of dimers DNA template bound at
DNA.D
4.2. Transcriptional Regulatory System
R1
DNA template bound at R1 and R2
DNA.2D
Fig. 14. Reactions Associated with the Transcriptional Regulatory System. Reaction
RNA
1
RNA+ M
+
3
DNA.D
+
RNA + DNA.D
RNA
4
5
DNA+ D
6
DNA.D
7 8
+
+
-
0
DNA.D
DNA
+D
+ D + DNA.2D DNA.2D + D N A . D + D DNA.D
Rate Constant 0.043s0.0007~-~
111-8
2
the product molecules is primarily governed by the number of substrate molecules in the system.
0.0715s-' 0.0039~-~ 0.02s-1
0.479 1s
-
0.002s-
We next show the results for a simple transcriptional regulatory system as shown in Fig 12. Protein M , synthesized by transcription of a gene, dimerizes to the transcription factor D , which may bind to the gene's regulatory region at two binding sites, R1 and Rz. The promoter coincides with Ra. Binding of D at R1 activates transcription of M . However, binding of D at Rz excludes the RNA polymerase from binding at the gene's promoter and in this case transcription is repressed. Fig 13 presents the terminology used for the different components of this example system, whereas Fig 14 shows the list of reactions involved along with their respective rate constants 29. In this section, we present the results for the well known Enzyme Kinetics system governed by the following three elementary reactions:
1
0.8765 x 10-lls-l
9
M+M+D
0.083s-I
10
D+M+M
0.5s-l
molecules with varying number of substrate and e n z y m e molecules respectively. With less number of substrates, the stochastic resonance is quite high in the system (as the ratio is less than 1). With higher number of substrates, the ratio saturates at 1.5 implying lesser stochasticity in the system. Also, the stochasticity is not very much dependent on the number of enzyme molecules in the system as depicted in Fig 11. Thus from these plots we can infer that the stochastic resonance in the molecular distribution of
E.S
+P
+E,
E
$-
S H E.S
The rate constant for the reversible reaction pair is set at Is-' and that for the first reaction is 0.1s-l. In this system as well, we find very good agreement between the exact simulation results with that from our model. Thus, for reaction-pairs (5, S} (7, S} and (9, lo} we choose the forward reactions as 6, 7 and 10 respectively and drive the Markov Chain formulation accordingly. The accuracy of our system suffers from this approximation (hence the difference from the exact simulation results). It should be noted that these results were generated for a low number of the different molecule types in the system. As the number of molecules increase, the MFPT based results are further off from the ex-
131
r
i
f
f
,
Mean from MFPT lorn-
Fig. 15. Mean number of monomers: Exact Simulation Vs Our Model.
Fig, 16. Mean number of dimers: Exact Simulation Vs Our Model.
Fig. 17. Mean number of mRNA transcripts: Exact Simulation Vs Our Model.
act simulation results because of the approximations. Thus, our model allows for a computationally efficient implementation of a complex biochemical system simulation which can give accurate results when the number of molecules of the components in the system are small.
both cases i.e., by estimating the kinetic parameters through controlled experiments or by using the parametric models.
5 . DISCUSSION
We have introduced a Markov Chain based analysis technique as an alternative for complex biological process modeling. The main idea of this modeling is to transform the biological processes from a continuous deterministic process to a discrete random process. Because of its simplicity in comparison to solving numerically a large number of differential equations, our framework reduces the computational overhead and increases scalability considerably. We are currently working on a complex pathway model with many molecular types and with large number of molecules of each type to estimate the computational complexity. The main benefit of this analysis is to analyze the stochasticity of many reactions occurring together. Current experimental methods are not able to capture this measurement a t a molecular level without special set-up. The challenge in the model proposed here is the optimization of memory and computational speed of DFS and MFPT algorithms. Note that each node in the Markov Chain has an out-degree of M , where M is the number of reactions/docking considered in the system. The storage of an arbitrary graph with a large number of nodes and out-degree will have memory problems. It is also important to find appropriate simplifications and data structures to speed up the process. Can the chain be converted into a tree structure by eliminating (adding) pseudo nodes (edges) ? This will allow us to traverse the chain (during DFS
Here we make some comments regarding both the differential equation based and our discrete random process based approach for biological system modeling. The former approach is usually used to model the variations of the concentrations of biomolecules, where the latter models the variations of the number of biomolecules. As for any research problem for which there are a variety of feasible solutions, each of these approaches has its own pros and cons. For example, when the number of biomolecules is extremely large, it may not even be practical to use our discrete random process-based model because of the following reasons:
(1) the number of possible candidate states of a molecular entity, X ( t ) E {0,1, ...,the maximum number of molecules}. can be too huge to handle ( 2 ) if a discretization strategy is used, accuracy of the model could be compromised. No matter which model is used, some of the parameters (e.g., kinetic parameters for the differential equation based models) need to be estimated. The parametric models we have introduced for biochemical reactions and docking can estimate these parameters theoretically and can be used once we have sufficient fidelity in these models. However, the Markov model based approach presented in this paper will work for
6. CONCLUSION A N D FUTURE DIRECTIONS
132 or MFPT computations) in O(logMV) time. We have already stated that the tau-leap approximation on t h e chain reduces it t o a 1-dimensional chain a n d the MFPT computations can be performed online. Also, can t h e tree structure be converted into a trie wherein t h e chain is compressed optimally thereby reducing t h e memory overheads ? T h e complete cell model by this analysis may not be feasible due to t h e large number of molecules in t h e cell, but we expect t h a t many complex biological systems can be modeled by this technique. References 1. Making Sense of Complexity Summary of the Worlcshop on Dynamical Modeling of Complex Biomedical Systems, (2002). 2. Endy, D., and Brent, R. Modeling cellular behavior. Nature., vol. 409, Jan 2001. 3. Loew, L. The Virtual Cell Project. ’In Silico’ Simulation of Biological Processes (Novartis Foundation Symposium No. 247), Wiley, 207-221, 2002. 4. Tomita, M. et.al. The E-CELL Project: Towards Integrative Simulation of Cellular Processes. New Generation Computing., (2000) 18(1): 1-12. 5. Gillespie, D. Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry, 81: 2340-2361. 6. Gillespie, D. Approximate accelerated stochastic simulation of chemically reacting systems. Journal of Chemical Physics., 115(4) : 1716-1733. 7. Rathinam, M., Petzold, L., Gillespie, D. Stiffness in Stochastic Chemically Reacting Systems: The Implicit Tau-Leaping Method. Journal of Chemical Physics., 119 (24), 12784-12794, 2003. 8. Cell Illustrator, www.fqspl. com.pl/lifee_science/cellillustrator/ci. htm 9. BioSpice: open-source biology, http://biospice. lbl.gov/home. html 10. CellDesigner: A modeling tool of biochemical networks, http://celldesigner. o r , / 11. MacAdams, H., and Arkin. A. It is a noisy business! Genetic regulation at the nanomolar scale. Trends in Genetics, vol 15, pp 65-69, 1999. 12. Hasty, J., and Collins, J. Translating the Noise. Nature, Genet., 2002, 31, 13-14. 13. Meier-Schellersheim, M., and Mack, G. SIMMUNE, a tool for simulating and analyzing immune system behavior. CoRR cs.MA/9903017: (1999). 14. vanKampen, N. Stochastic Processes in Physics and Chemistry. Amsterdam: Elsevier, 1992. 15. Gillespie, D. A Rigorous Derivation of the Chemical Master Equation. Physica A , vol. 188, pp. 404-425, 1992. 16. Gillespie, D. The Chemical Langevin Equation. J . Chemical Physics, vol. 113, no. 1, pp. 297-306, 2000. 17. Haseltine, E., and Rawlings, J. Approximate Simula-
tion of Coupled Fast and Slow Reactions for Stochastic Chemical Kinetics. J . Chemical Physics, 117:15, pp. 6959-6969, 2002. 18. Karlin, S., and Taylor, H. A First Course in Stochastic Processes. second ed. Sun Diego, Calif.: Academic Press, 1975. 19. Karlin, S., and Taylor, H. A Second Course in Stochastic Processes. Sun Diego, Calif.: Academic Press, 1981. 20. Papoulis, A,, and Pillai, S. Probability, Random Variables and Stochastic Processes. fourth ed. New York: McGraw-Hill, 2002. 21. Singhal, N. et al. Error analysis and efficient sampling in Markovian state models for molecular dynamics. Jour. Of Chem. Physics., 2005. 22. Ghosh, S., Ghosh, P., Basu, K., Das, S . , and Daefler, S. iSimBioSys: A Discrete Event Simulation Platform for ’in silico’ Study of Biological Systems Proc. of 39th IEEE Annual Simulation Symposium, 2006, USA. 23. Ghosh, P., Ghosh, S., Basu, K., Das, S., and Daefler, S . An Analytical Model to Estimate the time taken for Cytoplasmic Reactions for Stochastic Simulation of Complex Biological Systems. Proc. of the 2nd IEEE Granular Computing Conference, 2006, USA. 24. Ghosh, P., Ghosh, S., Basu, K., Das, S., and Daefler, S. A stochastic model to estimate the time taken for Protein-Ligand Docking. 2006 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), Sep. 2006, Canada. 25. Ghosh, P., Ghosh, S., Basu, K., Das, S., and Daefler, S . Stochastic Modeling of Cytoplasmic Reactions in Complex Biological Systems. 6th IEE International Conference on Computational Science and its Applications (ICCSA), May 8-11, 2006, Glasgow, UK. 26. Rao, C., Wolf, D., and Arkin, A. Control, Exploitation and Tolerance of Intracellular Noise. Nature, 420, pp. 231-237, 2002. 27. Raser, J., and O’Shea, E. Control of Stochasticity in Eukaryotic Gene Expression. Science, 304, pp. 18111814, 2004. 28. Cao, Y . ,Gillespie, D., and Petzold, L. Avoiding Negative Populations in Explicit Poisson Tau-Leaping. J. Chemical Physics, vol. 123, 054104, 2005. 29. Goutsias, J. A Hidden Markov Model for Transcriptional Regulation in Single Cells. IEEE/A CM Transactions on Computational Biology and Bioinformatics, 3( l),2006. 30. Goutsias, J. Quasiequilibrium approximation of fast reaction kinetics in stochastic biochemical systems. J. Chemical Physics, vol. 122, 184102, 2005. 31. Regev, A,, Silverman, W., and Shapiro, E. Representation and simulation of biochemical processes using the n-calculus process algebra. Proc. of the Paczfic Symposium of Biocomputing ( P S B 2001), 6: 459-470. 32. Regev, A,, Silverman, W., and Shapiro, E. Representing biomolecular processes with computer process algebra: n-calculus programs of signal transduction pathways. Proc. of the Pacific Symposium of Biocomputing 2000, World Scientific Press, Singapore.
133 A N INFORMATION THEORETIC M E T H O D FOR RECONSTRUCTING LOCAL REGULATORY NETWORK MODULES FROM POLYMORPHIC SAMPLES
M a n j u n a t h a Jagalur, David Kulp* Computational Biology Lab, University of Massachusetts Amherst, Amherst, MA-01 002, U S A * Email: { manju,dkulp} @cs.umass.edu
Statistical relations between genome-wide mRNA transcript levels have been successfully used t o infer regulatory relations among the genes, however the most successful methods have relied on additional data and focused on small sub-networks of genes. Along these lines, we recently demonstrated a model for simultaneously incorporating micro-array expression data with whole genome genotype marker data t o identify causal pairwise relationships among genes. In this paper we extend this methodology t o the principled construction of networks describing local regulatory modules. Our method is a two-step process: starting with a seed gene of interest, a Markov Blanket over genotype and gene expression observations is inferred according t o differential entropy estimation; a Bayes Net is then constructed from the resulting variables with important biological constraints yielding causally correct relationships. We tested our method by simulating a regulatory network within the background of of a real data set. We found that 45% of the genes in a regulatory module can he identified and the relations among the genes can be recovered with moderately high accuracy (> 70%). Since sample size is a practical and economic limitation, we considered the impact of increasing the number of samples and found that recovery of true gene-gene relationships only doubled with ten times the number of samples, suggesting that useful networks can be achieved with current experimental designs, but that significant improvements are not expected without major increases in the number of samples. When we applied this method to an actual data set of 111 hack-crossed mice we were able t o recover local gene regulatory networks supported by the biological literature
1. INTRODUCTION Understanding the function of every gene and its role in expression of a particular complex trait is one of the fundamental aims of genomics. Availability of genome-wide data has ma.de it possible to tackle this problem from a systems biology perspective. Global putative gene regulatory networks have been constructed using mRNA abundance data collected through micro-array experiments. In some cases supplemental information like chip-CHIP binding data l 3 and single or multiple gene perturbation data have been used to construct more robust networks. Recently there has been growing interest in a quantitative genetics strategy wherein, along with gene expression data, genetic marker data is used for constructing such networks2', 3 3 . In this strategy, crosses are made from inbred strains that differ in physical and genetic attributes. Resulting individuals can be considered the result of thousands of gene perturbations. Whole genome markers are genotyped and the abundance of transcripts are measured for each individual. For example, Brem and colleagues used a cross of a wild strain of yeast and baker's yeast to create one of the first such data *Corresponding author.
sets17. Schadt and colleagues have collected such data for mouse and maize'. Figure l(a) describes the data. Gene expression (T') represents transcript abundance. Discrete genotype values ( M k ) for bi-allelic markers are measured at relatively uniform positions across the genome. For an F2 diploid cross, if the parent genotypes are AA and BB, then markers may take values AA, AB and BB. We assume alleles have additive effect and represent genotypes as integers (0, I, 2 ) . The genotype of a gene (Q,) is not directly measured, but can be estimated by maximum likelihood using the flanking marker genotypes and genetic linkage distances to those markers ( D L and DR in the figure)35. Our aim is to find genetic and genomic factors (i.e. some subset of (T U Q)) that affect a particular complex trait and infer the relationships among these factors. We generally refer to (T U Q) as an expression genetics data set. There has been several efforts to infer regulatory relationships among genes using expression genetics data sets. A key quantitative genetics concept in all these approaches is the quantitative trait locus (QTL), which refers to a region along a chromosome
134
Fig. 1. Expression genetics data description. Gene G, is located between markers Mk and A/ik+l. Its genotype is Q3 and amount of mRNA translated is ‘I3.Note that Q3 is unobserved and must be estimated. (b) The QTG model of a single regulator-target pair of genes (regulator is gene J and target is gene i). Ml and Ml+l are flanking markers of gene z.
where the markers are significantly correlated with a measured trait. QTLs are determined using interval mapping34 or other related methods35. In our case, the gene expression level is the trait of interest and QTLs of this kind are called eQTLs. Finding pairwise regulatory relations between a regulator gene and a target gene is a simpler problem than that of constructing more elaborate networks. Bing and Hoeschele chose a regulator gene that is maximally correlated t o the target among those found in a target’s eQTLZ4. In our previous work, we generalized this idea by mapping eQTLs using a modified interval mapping model that simultaneously fit the joint contributions of both genotype and expression level of each candidate regulator along a chromosome. We called this model QTG’. Its important new feature was the ability t o capture the varying nature of regulation with respect to a regulator‘s genotype. Through this approach we could discover regulators that act as enhancers or repressors depending on their genotypes. (Described in more detail in section 1.3.) Network structure prediction has also been attempted by 25 and 26. In these works the network is represented by a Bayesian Network (BN) where the nodes are observed gene expression levels and the edges represent conditional dependencies, which are assumed to correspond to causal relationships. In both of these works, the key idea is to place strong priors over possible network structures according t o the eQTLs associated with each gene. Li et alZ5 selected candidate regulators with non-synonymous polymorphisms that are positioned within eQTLs. Later an exhaustive search over BN structures was performed t o reconstruct a global regulatory network. Zhu et a1 26 used a set of heuristics based on
the characteristics of eQTLs to determine probable edge direction and connectivity. In this paper, we present an improved BN reconstruction algorithm with the following major contributions:
Regulatory modules, instead of global regulatory networks, are inferred, which mitigates some of the difficulties of BN structure inference when sample size is small relative t o the number of variables; Genotype values and expression levels are modeled together in a single BN, which provides simultaneous integration of data types and the identification of different kinds of regulatory control; Multiple genes and genetic effects are considered together, rather than a single gene or a single QTL; Gene “self effects” are included, which incorporates the often significant effect of cisacting polymorphisms; and the interacting effect between genotype and expression level is modeled (QTG model). which allows for complex regulatory behavior.
The rest of the paper is organized as follows. Subsections 1.1-1.2 present important concepts regarding Markov blankets and Bayesian networks. More details of the QTG model are described in 1.3. We present our new regulatory network inference method in section 2, describe our experiments in section 3.1 and end with a discussion and conclusion in section 4.
I35
1.1. Markov Blanket The Markov blanket of a variable X , E X is defined as the minimal set of variables M B E X - { X s } that provide the maximum possible information about X,. Knowing the value of other variables outside of M B does not provide additional information. Formally, ~X~X-n/rS-(X,)(X 1XSIMB)
In a Bayesian network, the Markov blanket is the union of parent, child and spouse (i.e. parents of children) nodes. In a gene regulatory network, the Markov blanket of a gene contains its regulators, targets and co-regulators. Thus, a Markov blanket of a gene of interest corresponds to the biological concept of a local gene regulatory module (figure 2). Recovering the Markov blanket using raw data is well-studied in the context of feature selection2'> 3 2 . Here we describe one particularly at tractive approach. 31
conditional independence is accurate, then this algorithm is guaranteed to give correct results. Usually conditional mutual information is used for measuring conditional independence such as in 3, 32. In practice the conditional independence test is deemed reliable only when the number of samples is at least five times the number of degrees of freedom. For discrete data this imposes a requirement of an exponential number of samples with respect to the number of variables in the conditioning set. However, when data is continuous and Gaussian distributed, as assumed here, then the number of required samples is only quadratic with respect to the number of variables in the conditioning set.
Algorithm 1.1IAMB algorithm INPUT: Data: X = { X l , X 2 , .. . , X n } , Target: s , Threshold: 6' OUTPUT: Markov Blanket: MB I: M B = @ 2: repeat 3: i = argmax,+ M I ( X , ;X , I M B ) 4: if M I ( X , ; X , I M B ) > 0 then 5: M B = M B U {X,} 6: end if 7: until M B does not change 8: repeat 9: i = argminxtE=MBM I ( X , ;X , I M B 10: if M I ( X , ; X , I M B - {X,}) < 6' then 11: M B = M B - {X,} 12: end if 13: until M B does not change
Fig. 2 . Example of Markov Blanket. Nodes marked gray belong to Markov blanket of node marked in green
1.1.1. Incremental Association Markov Blanket Incremental association Markov blanket (IAMB) is an information theoretical approach to infer a Markov blanket ( D I B )from data3. This is a two-step algorithm. In the first step, nodes are added to an interim MB' based on a greedy search for variables that are not conditionally independent. Since it is a greedy algorithm some nodes that should not be in the final M B might be present in M B * . These nodes are removed in the second step through an exhaustive search of all subsets of M B * . When the data set is faithful to the true distribution and the measure of
Conditional independence for continuous data can be computed using the differential entropies of the involved variables. Differential entropy is a relative measure that quantifies the amount of surprise (or information) of a continuous variable. It is equal to the expected log of the probability density.
where f is the probability density function of z. For a multivariate Gaussian variable X = { X I ,X 2 , . . . , X,} differential entropy h ( X ) is equal
136 to
Log likelihood is used as the scoring function:
where C is the co-variance matrix of X . Conditional relative entropy is defined as the amount of surprise in one variable when the condition variable is known.
=
h ( X , Y )- h ( Y )
Mutual information quantifies the amount of information that is contained in a random variable (X) about the other variable ( Y ) .It is equal to the difference between the amount of information in one of the variables (which is entropy, h ( X ) ) and the amount of information in it that is unexplained by the other variable (which is conditional entropy, h ( X 1 Y ) ) .Under condition Z it is equal to:
121
N
Since the hyper parameter 0 is estimated using the finite number of samples, it is always possible to increase the log likelihood of a graph by increasing its connectivity. This over-fitting phenomenon can be be avoided by using a scoring scheme that takes connectivity into consideration. Bayesian information criterion(BIC, also known as Schwarz information criterion) is one such scheme.
M I ( X ;Y I Z ) = h ( X / Z )- h ( X I Y ,2) ScoreBlc(X, 0 ) = 2 L L ( X ,0 )- klog(M)
1.2. Bayesian Networks A Bayesian network (BN) is a minimal graphical representation of a joint probability distribution over a set of random variables”. Each variable in a BN corresponds to a node and each dependency corresponds to an edge. Nodes are connected by a directed edge and the resulting graph will be a directed acyclic graph. The distribution of a variable conditionally depends only on its parents. Like Markov blanket selection, constructing Bayesian networks is also a well-studied problem”, 2 8 . For a given network structure, the conditional probability distribution function of each variable can be calculated using maximum likelihood estimates. Using these functions, the posterior probability of the data can be calculated and a network can be scored. Let X = { X I ,X z , . . . , X,} be the set of variables in the network. The posterior likelihood of an observation x is given by: 301
N
P ( x )=
I-IP(zzIPa(xz),
0)
z=1
where P a ( z i / O ) is the set of parent nodes corresponding to node X i and 0 is the hyper-parameter set determining the conditional probability distribution. For a data set X = {x1,x2,. . . , z M }the posterior likelihood is given by: M
N
where k is the number of free parameters in 0. For linear Gaussian models k is equal to the total number of edges in the network. Given that the possible network structure space is super-exponential with respect to the number of nodes, an exhaustive search through all possible graphs is usually not feasible. Reasonable heuristics like node ordering30 can be used when the number of samples is high and the number of variables is low. But those algorithms are infeasible when the number of dimensions is high and inaccurate when the number of samples is low. Another class of algorithms use information theory to construct these networks. A polynomial time algorithm existsz8 when an oracle, which determines if two variables are dependent conditioned on a set of variables, is available and the data is DAG-faithful. Such an oracle can be constructed by calculating conditional mutual information for the set of variables. But calculation of mutual information can be problematic when the number of samples is low, just as with the Markov blanket algorithms, as mentioned above, and when the number of variables is high. In our method we overcome this limitation by restricting ourselves to building local networks around our gene of interest. As the number of genes in the regulatory neighborhood of a gene is usually low, we can keep our network searching problem tractable.
137 over our previous scanning method’ is that we construct networks involving multiple genes to specifically model the joint distribution, whereas the previous approach could only identify putative pairwise relationships akin to a relevance network37.
c
CUA 9
P .c
0 !=
0 .-m
m
2.1. Mixed Type Bayesian Network Under Biological Constraints
2 Q X
W
Expression of Regulator Fig. 3. Trans-acting effect as a function of regulator genotype allows for complex enhancer/suppressor relationships. Note that expression and genotype may be marginally independent of the target, but the regulatory relation can still be identified.
1.3. Q T G Model The conventional model for mapping linkage of loci to phenotypes is a linear model of the form
We model a gene regulatory network as a highly constrained Bayesian network subject to the biological conditions as graphically described in Figure 4. A “gene” is modeled as a meta-node, such that a node (G,) consists of expression (T,), genotype (Qa) and interaction (T,Q,) variables (Figure 4a). Edges denote regulation between genes where edges are drawn from the regulator meta-node to a target meta-node. The kind of regulatory control between two genes depends on which terms in the meta-nodes were used (Figure 4b). Since genotypes represent independently random recombination events, edges are always directed away from genotype variables.
P(TzIQ3)= N(Po+ P1Qj10) where T, is the phenotype of interest (expression of a target gene) and Q j are inferred genotypes of genes G, along a chromosome. In ’, we suggested an alternative model that explicitly incorporated the genotype and expression level a t gene Gj as well as the potential interacting effect of genotype and expression level, yielding
P(TilQj,T j ,0) = N ( P o + PiTj + P2Qj + P 3 T j Q j l ~ ) (1) (Fig-
where 19 is the /3 and 0 model parameters. ure lb.) A scanning method, like conventional QTL mapping, can be used in which pairwise relationships are found by computing the log posterior odds for all G, in the genome. Equation 1 has the advantage of capturing the types of dependency relationships shown in figure 3. However, the scanning method does not incorporate multi-locus regulatory control.
2. M E T H O D S Now we present an algorithm that finds the loci that are in the regulatory neighborhood of a gene of interest and reconstructs the corresponding partial network. The main advantage of this new method
I
-
\ I
---
Fig. 4. (a) Elements of gene A. Ta is the expression, Qa is the genotype and T,Q, is the interaction variable. (b) All edge types. Colors are used t o visually code predicted networks (such as in figure 7 . (c) Example of gene-gene relationship with two edge types involved.
2.2. Markov Blanket Inference
Algorithm 2.1 for inferring a Markov blanket is very similar to the IAMB algorithm with several domain specific differences. The candidate variable set C consists of all gene expression values I 5 i 5 n, where n is the number of genes), all marker genotypes ( M j , l 5 j 5 k , where k is the number of polymorphic markers) and approximate interacting t e r m s estimated from the product of expression and flanking marker genotypes (where we write TQ; to mean TiMh(i), TQT to mean T ~ M E ( and ~ ) , hf~(i)
(z,
138
and MR(i) are the flanking left and right markers of gene G i ) . In the forward step, based on conditional independence, variables from C are incrementally added to the Markov blanket M B and in the backward step false positives are removed. A continuous form of conditional mutual information (as explained in section 1.1.1)is used as the measure of conditional independence. Variables are assumed to follow a multinomial Gaussian distribution. If we make the reasonable biological assumption that any gene has no more than about ten genes in its local regulatory network3', then we require only (= 100) samples to accurately calculate conditional mutual information.
Algorithm 2.1 Inferring Markov Blanket of a gene. MI calculates the conditional mutual information as described in section 1.1.1. Functions max and min return maximum/minimum element in the array and its index. INPUT: Expression Levels: T = { T I T2, , . . . ,T n } , Marker Genotypes: M = { M I ,M z , . . . , M k } , Interaction terms: I = {TQ;,TQI;,. . . , TQR,TQL}, Seed Gene s , Threshold Q OUTPUT: Markov Blanket M B E T U M U
I 1:
MB=B
C=(TUMU I ) - {T,,TQ:,TQ;} repeat 4: for Ci E C do 2:
3:
5: 6:
scoTei = M I ( C i ;T,IMB)
end for
[ m a z M I ,mazi] = maz(score) if m a z M I 2 Q then 9: M B = M B U {C,,,i} 10: end if 11: until m a x M I < cy 12: repeat 13: for Ci E MB do 14: scorei = M I ( C i ; T , I M B - { C i } ) 15: end for 16: [ m i n M I ,mini]= min(score) 17: if m i n M I < cy then 18: M B = M B - {CmaZi} 19: end if 20: until m a x M I < a 2s: return MB 7:
8:
2.3. Gene regulatory network reconstruction We use an incremental algorithm similar to 31 for constructing the local network for a seed gene, s (Algorithm 2 . 2 ) given its Markov blanket, M B , . The novelty of our method is that we must simultaneously estimate the unobserved genotype values Qi while constructing the graph edges. We begin with an M B , that contains zero or more expression and genotype terms (e.g. Ti, TQL, etc.) for each gene Gi. We define the regulatory neighborhood of seed gene s as R N , = M B , U {T,}. For all genes with a flanking marker in the M B , we introduce the true but unobserved genotype Qi and estimate its maximum likelihood value according to the distances to the flanking markers. Similarly we replace any TQ: and TQF terms with TQi. Next, the variables in RN, are consolidated into gene meta-nodes, such that all variables associated with gene Gj are grouped. Then, beginning with an empty graph, edges are added, removed, or reversed between variables in separate meta-nodes based on an increase in the network score. Unlike a conventional Bayes Net construction, we explicitly consider combined genotype and expression effects including interacting effects. These different kinds of regulatory effects are represented as different types of edges (figure 4b). The score is computed as the log of the joint probability with a Bayesian Information Criterion(B1C) penalty term to control for complexity of the network. Finally, the Qi terms are re-estimated based on the new graph structure (connected genes and flanking markers). With the new values of Qi, a new graph structure is generated. This EM-like iterative process is repeated until convergence, which happens quickly in practice.
139
Algorithm 2.2 Algorithm for constructing local regulatory network. EstimateGenotype function estimates the genotype of a locus by using the genotypes of the flanking markers and the distance to those markers. Score calculates the optimal score of a network using EM strategy. In expectation step all the Qs and T Q s are estimated using the current value of hyper parameter set (C) and their priors. Later in maximization step the C is re-calculated using the re-estimated values of Qs and TQs. AddScore is the score of the new network when a edge is added, reversed or removed. This function also checks for DAG consistency of the network and if that is violated returns -co. from can be any node, t o node needs to have expression term in it and kind can be any kind of edge shown in 4 or of kind no edge (used when an edge needs to be deleted). INPUT: Markov Blanket M B , , Expression profiles: T = { T I ,T2,. . . , T n } , Marker Genotypes: M = { M I ,M 2 , . . . , M k } , Interaction terms: I = {TQi,TQY, . . . , TQL,TQL} Seed Gene s , Threshold /3 OUTPUT: Local Network B N , 1:
RN,
2:
for each gene i do Q i = EstGenotype(ML(i), M R ( ~, Location(i)) ) end for for each gene i do
3: 4: 5:
6:
Gt
Purely genetic hyper-nodes are an interesting special case. In some cases a marker variable Mt might not have a gene in M B , to which it can be grouped with. In those cases a dummy gene hyper node is created for this marker. These dummy genes are assigned a range of locations (determined using the location of markers Mi-1 and Mi+l that flank M i ) instead of having one exact location as with regular gene hyper nodes. During the network optimization the exact location of this dummy gene is re-calibrated to maximize the score. This strategy allows us to detect genetic elements that are either not associated with any of the known genes. Such effects include, for example, cis-acting QTLs and noncoding genes.
3. EXPERIMENTS AND RESULTS Simulations were performed to test the fidelity of the model, to set appropriate threshold parameters, and to calculate the sample size needed to achieve good accuracy and recovery.
= M B , U T,
=
3.1. Simulations
{Ti, TQi,Q i }
end for 8: C G = {GilTi E R N , V TiQi E RN, V Ti&: E 7:
RN, 1 BN, = 8 10: curMazScore = Score(BN,, C G ) 11: while forever do { f r o m ,to, kind} 12: argmaxf,,,,,kd Addscore(BN,, { f r , to, k d } , C G ) 13: if AddScore(BN,, {from,to, k i n d } , C G ) curMazScore > p then 14: if 3kind s.t {from, to, k i n d } E B N , then B N , = B N , - { { f TO^, to, k i n d } } 15: 16: end if 17: if 3lcind s.t { t o , from,Icind} E B N , then 18: BN, = B N , - { { t o ,f r o m , k i n d } } end if 19: 20: B N , = BN,u { { f r o m , to, k i n d } } 21: else 22: return B N , end if 23: 24: end while 9:
~
~
~
Fig. 5 . Simulation Strategy. Black nodes were selected from the existing data and the red nodes were simulated using a linear Gaussian model.
~
Synthetic data was generated to test the viability of this approach. To keep the simulation as realistic as possible and to preserve the distribution of the real data, only a small set of simulated data was added to the existing data. Networks of various size were simulated. Importantly, parent and spouse genes were
140
not simulated, but selected from existing genes. Target genes and their children were simulated using a linear model with Gaussian noise. A example of such simulated network is shown in Figure 5. The coefficients of this linear model were selected from a Gaussian distribution. To test the data requirement for sample sizes greater than the available 111 samples, we simulated additional expression values as Gaussian and genotypes from linkage probabilities. Results of these simulations are presented in Figure 6 for a 5 node network. (For network sizes greater than 5, accuracy did not decrease substantially and the number of recovered genes remained almost the same; data not shown.) Figure 6a describes the performance of the Markov blanket recovery. Each line in the figure corresponds to a sample size. Results suggest that this algorithm can recover parts of the network with high accuracy at useful recovery rates. For example, greater than 45% of genes in the true Markov blankets were recovered a t an accuracy of about 75%. Reducing the threshold did not result in increased recovery but caused accuracy to drop substantially. When we increased sample size to 1000 (ten times the current available data) there was a marked improvement in recovery( > 75%) and accuracy( > 85%). Figure 6b describes the performance of network inference, i.e. edge prediction, over the Markov blanket variables. Considering only gene meta-node connectivity, the algorithm exceeded 90% accuracy and 90% recovery for the correct placement of edges. When the correct direction is also taken into account, accuracy of 85% could be achieved with recovery of about 85%. Edges of correct direction and correct edge type could be recovered with 70% accuracy and 70% recovery. Thus, a quite reasonable reconstruction of a network could be achieved with a large majority of edges properly labeled and oriented. We found that a threshold of (Y = 0.1 on conditional mutual information and p = 50.0 for adding an edge in network reconstruction yielded the best results.
3.2. Biological Significance For practical experimental results we used data collected by Schadt et all, consisting of gene expression profiles for 111 F2 mice derived from crossing C57BL/6J and DBA/2J. The dataset contains expression for 23,574 genes and genotypes for 134
markers spread over 19 chromosomes. We applied our algorithm to construct local networks seeded by 400 highly cited mouse genes in PubMed database, under the assumption that wellannotated seeds are more useful when performing a manual, qualitative review of predicted regulatory networks. A simple analysis showed that 69% of these networks seed gene shared common Gene Ontology annotation with at least one other gene in the network. Further, in 31% of the cases seed gene shared annotation with two or more neighbors. Several of these networks are shown in Figure 7 with the biological interpretations and analysis. The inferred local regulatory network of Dlx2 is shown in figure 7a. Three of the genes in the network, 01x2, Aebpl and Dnmt3a, are known transcription factors. This indicates that these genes might be involved in a transcriptional cascade. The local regulatory network of Rela (figure 7b) contains Mapkl and both of these are involved in organ morphogenesis. Rela seems to be regulating Usmg5, which is involved in skeletal muscle growth, which suggests that Rela’s role is skeletal muscle growth. The inferred local regulatory network of Pcna (figure 7c) suggests that Pcna and Dmapl might be co-regulating Priml. This is interesting as these two genes are known to interact with similar domains36. The local network of Fgfr2 (figure 7d) is interesting in many ways. Biologically this network makes sense as there is reasonable functional overlap among the genes n the network. Fgfr2 and Ptk2 are involved in regulation of actin cytoskeleton. Fgfr2, Ptk2 and Gnaq are all nucleotide binding proteins. This network is also interesting computationally as we can predict the causality of this network though there are no genetic variables. In this network all the genes are well correlated with the seed gene, but Ptk2 and Ppt are uncorrelated. This is the only network that is able to capture these informational dependencies accurately3’.
4. DISCUSSION
Expression genetics data has helped scientists to understand the genetics behind expression of many simpler traits that are affected by very few genetic factors. Understanding genetics behind more complex traits needs careful modeling of the interaction between the quantitative (gene expression) and qualitative (genotype) traits. We presented an extension of our QTG model
141 B. Performance of network structure prediction
A. Performance of Markov Blanket inference 0.8
0.7
0.6
2 0.5 -
$0.6-
Q)
>
>
0 0.5-
0 0.40 a,
0
2
0.30.2 0.1 -
0’ 0
0.3-
-500
0.2 -
1000 Real Data
~
0.2
0.4-
0.4
Correct type and direction Correct direction
0.1 -
0.6
0.8
I
0-
1
0.55
0.6
0.65
0.7
Accuracy
0.75
0.8
0.85
0.9
0.95
Accuracy
Fig. 6. A. Accuracy vs recovery plot for classification of variables in Markov Blanket of candidate seed gene. Different lines show results for different sample sizes. B. Accuracy vs recovery plot for graph reconstruction using different graph evaluation criteria.
A
I
I
Fig. 7.
I
Sample local regulatory networks. See text.
for analyzing regulation involving multiple genes as a directed acyclic graph. In this study we investigated the use of an information theoretic method for accurately constructing local gene regulatory network from a seed gene. Our model allows use of both expression and genotype in the same network thereby exploiting the natural dependencies. The method combines conventional quantitative genetic mapping and model-based network inference in one
unified algorithm compared to approaches where genetic analysis is done first and results are used to refine genomic study results. Our simulation results suggest that reasonably accurate small networks can be constructed using our approach. Importantly, we also found that small sample size is the most important limitation on the utility of these data sets. Our study suggests that a magnitude increase in number of samples would go
a long way in identifying reliable and complete gene regulatory networks, b u t such large experiments are impractical in t h e near term. A brief analysis of t h e local networks that are constructed around some well known genes suggest t h a t our method is capable of recovering biologically relevant networks from the expression genetics data. Most of t h e networks have edges between t h e genes that are known to be functionally similar and/or are active in t h e same cellular locations.
ACKNOWLEDGEMENT
12.
13.
14.
15.
We are thankful to Gary Churchill a n d Sharon Tsaih for their useful comments.
References 1. Schadt EE, Monks SA et a1 Genetics of gene expression surveyed in maize, mouse and man, Nature, Mar 20;422(6929), pp 297-302.(2003). 2. Kulp D, Jagalur M. Causal Inference of Regulator Target Pairs by Gene Mappzng of Expression Phenotypes. BMC Genomics. 2006; 7: 125. (2006). 3. I Tsamardinos, CF Aliferis, A Statnikov. Algorithms f o r Large Scale Markov Blanket Discovery, The 16th International FLAIRS Conference. (2003). 4. N Friedman, M Linial, I Nachman, D Peer. Using Bayesian networks to analyze expression data, Journal of Computational Biology, vol. 7, pp 601-620. (2000). 5. D Pe’er, A Regev, G Elidan, N Friedman. Inferring subnetworks from perturbed expression profiles, Bioinformatics. (2001). 6 . Pe’er D, Regev A, Tanay A. Minreg: inferring an active regulator set, Bioinformatics 18, pp l:S258-267. (2002). 7. Segal E, Shapira M, Regev A, Pe’er D, Botstein D, Koller D, Friedman N. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data, Nature Genetics 34(2), pp 166-176. (2003). 8. Nir Friedman. Inferring Cellular Networks Using Probabilistic Graphical Models,Science, Vol 303, Issue 5659, pp 799-805 , 6 February. (2004). 9. AJ Hartemink, DK Gifford, TS Jaakkola, RA Young. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks,Pacific Symposium on Biocomputing. (2001). 10. Smith VA, Jarvis ED, Hartemink AJ. Evaluating functional network inference using simulations o j complex biological systems,Bioinformatics. (2002). 11. Perez-Enciso M, Tor0 MA, Tenenhaus M, Gianola D. Combining gene expression and molecular marker
16.
17.
18.
19.
20.
21.
22.
23. 24.
25.
26.
information for mapping complex trait genes: a simulation study,Genetics, 164(4), pp 1597-606. (2003). Yu J, Smith VA, Wang PP, Hartemink AJ, Jarvis ED. Advances to Bayesian network inference for generating causal networks from observational biological data, Bioinformatics, 12;20(18),pp 3594-603 (2004). CH Yeang, T Jaakkola. Physical Network Models and Multi-source Data Integration , Proceedings of the seventh annual international conference on Computational molecular biology, pp 312 - 321. (2003). Hartemink AJ, Gifford DK, Jaakkola TS, Young RA. Combining location and expression data f o r principled discovery of genetic regulatory network models, Pacific Symposium on Biocomputing. (2002). Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yo0 JY, Robert F, Gordon DB, Fraenkel E, Jaakkola TS, Young RA, Gifford DK. Computational discovery of gene modules and regulatory networks, Nature Biotechnology, Nov;21(11), pp 1337-42. (2003). A Battle, E Segal, D Koller. Probabilistic Discovery of Overlapping Cellular Processes and Their Regulation, Proceedings of the eighth annual international conference on Computational molecular biology, pp 167-176. (2004). RB Brem, G Yvert, R Clinton, L Kruglyak. Genetic dissection of transcriptional regulation in budding yeast, Science, 26;296(5568), pp 752-755. (2002). EE Schadt, SA Monks, SH Friend. A new paradigm f o r drug discovery: integrating clinical, genetic, genomic and molecular phenotype data to identify drug targets,Biochemical Society Transactions, 31, pp 437443. (2003). Kraft P, Horvath S. The genetics of gene expression and gene mapping, Trends in Biotechnology, 21(9), pp 377-378.(2003). Jansen RC, Nap JP. Regulating gene expression: surprises still in store, Trends in Genetics, 20(5), pp 223-225. (2004). Doerge RW. Mapping and analysis of quantitative trait loci in experimental populations, Nature Reviews, Genetics,3, pp 43-52. (2002). Heckerman D. A Tutorial on Learning With Bayesian Networks, Technical Report, Microsoft Research, MSR-TR-95-06. (1995). Sen S, Churchill G. A statistical framework for quantitative trait mapping. Genetics 2001, 159:371-87. Bing, N. and Hoeschele, I. Genetical genomics analysis of a yeast segregant population f o r transcription network inference. Genetics 2005. 170(2): 533-42. Li H, Lu L, Manly KF, Cheder EJ, Bao L, Wang J , Zhou M, Williams RW, Cui Y. Inferring gene transcriptional modulatory relations: a genetical genomics approach. Hum Mol Genet. 2005 May 1;14(9):1119-25. Zhu J, Lum PY, Lamb J , GuhaThakurta D, Edwards SW, Thieringer R, Berger JP, Wu MS, Thompson J, Sachs AB, Schadt EE. An integrative genomics approach to the reconstruction of gene networks in seg-
143
27. 28.
29.
30.
31.
32.
33.
regating populations. Cytogenet Genome Res 2004. 105(2-4):363-74. Kevin P. Murphy. The Bayes Net Toolbox f o r M A T L A B . Computing Science and Statistics, vol 33. Jie Cheng, Russell Greiner, Jonathan Kelly, David Bell, Weiru Liu. Learning Bayesian networks f r o m data: An information-theory based approach. Artificial Intelligence. Vol. 137, no. 1-2, pp. 43-90. May 2002. Daphne Koller and Mehran Sahami. Toward Optimal Feature Selection. International Conference on Machine Learning 1996. 284-292. Nir Friedman, Daphne Koller. Being Bayesian about Network Structure. Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth C'unference, 2000. Greg Cooper and Edward Herskovits. A Bayesian Method f o r the Induction of Probabilistic Networks f r o m Data. Machine Learning 1992. 9:309-347. JM Pena, J Bjorkegren, J Tegner. Scalable, Efficient and Correct Learning of Markov Boundaries under the Faithfulness Assumption. Bioinformatics 2005, 21(Suppl 2):ii224-29. E E Schadt and P Y Lum. Reverse engineering gene
34.
35.
36.
37.
38.
39.
networks t o identify key drivers of complex disease phenotypes. Journal of lipid research. Vol. 47, 26012613, December 2006. ES Lander and D Botstein. Mapping Mendelian Factors Underlying Quantitative Traits Using R F L P Linkage Maps. Genetics, Vol 121, 185-199. 1989. M Lynch and B Walsh. Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA. 1998. J B Margot, AE Ehrenhofer-Murray and H Leonhardt. Interactions within the mammalian D N A methyltransferase family. BMC Molecular Biology 2003, 4:7. AJ Butte, P Tamayo, D Slonim, TR Golub and IS Kohane. Discovering functional relationships between R N A expression and chemotherapeutic susceptibility using relevance networks. PNAS, USA. 2000 Oct 24;97(22):12182-6. RB Brem and L Kruglyak. T h e landscape of genetic complexity across 5,700 gene expression traits in yeast. PNAS, USA. 2005 Feb 1;102(5):1572-1577. J Pearl and TS Verma, A Theory of Inferred Causation. UCLA Cognitive Systems Laboratory, Technical Report (R-156).
This page intentionally left blank
145 U S ING DIRECT ED INFORM A T I0N T O BU ILD BI0LOGICA LLY RELEVA NT INFLUENCE NETWORKS
Arvind Rao* and Alfred 0. Hero, I11 Electrical Engineering and Computer Science, Bioinformatics, University of Michigan, A n n Arbor, M I 48109, USA * Email: [ukarvind, hero]@umich.edu David J. States Bioinformatics, Human Genetics, University of Michigan, A n n Arbor, M I 481 09, USA Email: [email protected]
James Douglas Engel Cell and Developmental Biology, University of Michigan, A n n Arbor, M I 48109, USA Email: [email protected] The systematic inference of biologically relevant influence networks remains a challenging problem in computational biology. Even though the availability of high-throughput data has enabled the use of probabilistic models t o infer the plausible structure of such networks, their true interpretation of the biology of the process is questionable. In this work, we propose a network inference methodology, based on the directed information (DTI) criterion, which incorporates the biology of transcription within the framework, so as to enable experimentally verifiable inference. We use publicly available embryonic kidney and T-cell microarray datasets t o demonstrate our results. We present two variants of network inference via DTI (supervised and u n s u p e r v i s e d ) and the inferred networks relevant t o mammalian nephrogenesis as well as T-cell activation. We demonstrate the conformity of the obtained interactions with literature as well as comparison with the coefficient of determination (COD) method. Apart from network inference, the proposed framework enables the exploration of specific interactions, not just those revealed by data.
1. INTRODUCTION Computational methods for inferring dependencies between genes [4,13,6] using probabilistic methods have been used for quite some time now. However the biological significance of these recovered networks has been a topic of debate, apart from the fact that such techniques mostly yield networks of significant influences as 'observed/inferred' from the underlying structure of data. Alternatively, other biological data (sequence information) might suggest the examination of the probabilistic dependence of one gene on another gene through the transcription factor (TF) encoded by the first gene. What if we were interested in the transcriptional influences on a certain gene 'A' but our prospective network inference technique was unable to recover them?. We propose a technique with an eye on two of these potential limitations: biological significance and influence between *Corresponding author.
'any' two variables of interest. Such an approach is increasingly necessary when we want to integrate and understand multiple sources of data (sequence, expression etc.). The method that we propose builds on an information theoretic criterion referred to as the directed information (DTI). The DTI [5,26] can be interpreted as a directed version of mutual information, a criterion used quite frequently in other related work [13]. It turns out, as we will demonstrate, that the DTI gives a sense of directional association for the principled discovery of biological influence networks. There are two main contributions of this work. Firstly, we present a short theoretical treatment of DTI and an approach to the supervised and unsupervised influence recovery problems, using microarray expression data. Secondly, we examine two sce-
146 narios - the inference of large scale gene influence networks (in mammalian nephrogenesis and T-cell development) as well as potential effector genes for Gata3 transcriptional regulation in distinct biological contexts. We find that this method outperforms other methods in several aspects and leads to the formulation of biologically relevant hypotheses that might aid subsequent experimental investigation. 2. GENE NETWORKS
Transcription is the process of generation of messenger RNA (mRNA) from the DNA template representing the gene. It is the intermediate step before the generation of functional protein from messenger RNA. During gene expression, transcription factor proteins are recruited at the proximal promoter of the gene as well as at distal sequence elements (enhancers/silencers) which can lie several hundreds of kilobases from the gene’s transcriptional start site [all. Since transcription factors are also proteins (or their activated forms) which are in turn encoded for by other genes, we can consider the notion of an influence between a transcription factor gene and the target gene. Below (Fig. 1)we give a characterization of what we mean by transcriptional regulatory networks. As the name suggests, gene A is connected by a link to gene C if a product of gene A, say protein A, is involved in the transcriptional regulation of gene C. This might mean that protein A is involved in the formation of the complex which binds at the basal transcriptional machinery of gene C to drive gene C regulation.
+
Gatti3 C
V
Re Ne
Fig. 1. A transcriptional regulatory network with genes A and B effect C. An example of C that we study here is the Gata3 gene.
As can be seen, the components of the transcription factor (TF) complex recruited at the gene promoter,are the products of several genes. Therefore,
the incorrect inference of a transcriptional regulatory network can lead to false hypotheses about the actual set of genes affecting a target gene. Since biologists are increasingly relying on computational tools to guide experiment design, a principled approach to biologically relevant network inference can lead to significant savings in time and resources. In this paper we try to combine some of the other available biological data (protein-protein interaction data and phylogenetic conservation of binding sites across genomes) to build network topologies with a lower false positive rate of linkage. 3. PROBLEM SETUP In this work, we also study the mechanism of gene regulation for genes, with the Gata3 gene as an example. This gene has important roles in several processes in mammalian development [21], like in the developing urogenital system (nephrogenesis), central nervous system, and T-cell development. In order to find which TFs regulate the tissue-specific transcription of Gata3 (either at the promoter or longrange regulatory elements) , a commonly followed approach [ll, 121 would be to look for phylogenetically conserved transcription factor binding sites (TFBS). The hypothesis underlying this strategy is that the interspecies-conservationof a TFBS suggests a possibly functional binding of the TF at the motif (from evolutionary pressure for function). This work primarily addresses the following questions: Which transcription factors are potentially active at the target gene’s promoter during its tissue specific regulation - this question is primarily answered by examining the phylogenetically conserved TFBS at the promoter and asking if microarray data suggests the presence of an influence between the TF encoding gene and the target gene (i.e. Gata3). This approach thus integrates sequence and expression information. Biologists are also interested in network of relationships among genes expressed under a certain set of conditions, which uses several network inference procedures, such as Bayesian networks [4], MI [13]etc. However, there has been lack of a common framework to do both supervised and unsupervised directed network inference within these set-
147 tings to detect non-linear gene-gene interactions. We present Directed Information as a potential solution to both these scenarios. Supervised network inference pertains to finding the strengths of directed relationships between two specific genes. Unsupervised network inference deals with finding the most probable network structure to explain the observed data (like in Bayesian structure learning using expression data).
(http://genet. chmcc. org, http://spring. imb.uq.edu. au/ and http:,@idney.scgap. org/andex.h,tml. For illustration, we use the G a t a 3 example in the rest of this paper.
3.1. Phylogenetic Conservation of Binding Sites As mentioned above, the mechanism of regulation of a target gene is via the binding site of the corresponding transcription factor (TF). It is believed that several T F binding motifs niight have appeared over the evolutionary time period due to insertions, mutations, deletions etc in vertebrate genomes. However, if we are interested in the regulation of a process which is known to be similar between several organisms (say Human, Chimp, Mouse, Rat and Chicken), then we can look for the conservation of functional binding sites over all these genornes. This helps us isolate the functional binding sites, as opposed to those which might have randomly arisen. This however, does not suggest that those other TF binding sites have no functional role. If we are interested in the mechanism of regulation of the G a t a 3 gene (which is known to be implicated in mammalian nephrogenesis), we examine its promoter region for phylogenetically conserved TFBS (Fig. 2). Such information can be obtained from most genome browsers [20]. We see that even for a fairly short stretch of sequence (1kilobase) upstream of the gene, there are several conserved sequence elements which are potential TFBS (light grey regions in Fig. 2). To test their functional role in-vivo or invitro, it is necessary to select only a subset of these TFs, because of the great reliance on resources and effort. Hence the genes encoding for these conserved TFs are the ones that we examine for possible influence determination via expression-based influence metrics. If we are able to infer an influence between the TF-coding gene and the target gene at which its TF binds, then this reduces the number of candidates to be tested. To examine G a t a 3 ' s role in kidney development, we use microarray expression data from a public repository of kidney microarray data
Fig. 2. TFBS conservation between and Rat, upstream (x-axis) of http://www.ecrbrowser.dcode.org.
Human, Gata3,
Mouse from
Another source of side information which becomes extremely useful in such scenarios is the biophysics of transcriptional regulation - this indicates that TFs binding at regulatory regions hardly do so alone but simultaneously participate in several interactions with proximal elements. Hence the presence of conserved TFs which are known binding partners (identified from protein interaction databases) increases the likelihood of functionality of that TF in transcriptional regulation. Our approach thus integrates several aspects: Identifying if any of the genes influence a target gene by coding for a transcription factor binding at the site discovered from conservation studies. This directed influence is captured using an influence metric (like directed information). Using phylogenetic information and proteinprotein interaction to infer which binding sites upstream of a target gene may be functional.
4. DTI FORMULATION As alluded to above, there is a need for a viable influence metric that can find relationships between the T F "effector" gene (identified from phylogenetic
148 conservation) and the target gene (like Gata3). Several such metrics have been proposed, notably, correlation, coefficient of determination (COD),mutual information etc. To alleviate the challenge of detecting non-linear gene interactions, an information theoretic measure like mutual information has been used to infer the conditional dependence among genes by exploring the structure of the joint distribution of the gene expression profiles [13]. However, the absence of a 'causal' (or directed dependence) information theoretic metric has hindered the utilization of the full potential of information theory. In this work, we examine the applicability of such a metric - the Directed Information criterion (DTI) to the explicit inference of gene influence. This will enable us to potentially discover any directed non-linear relationship between genes of interest. The DTI - which is a measure of the causal dependence between two N-length random processes X = X N and Y = Y N is given by [22]:
c N
I(XN
+ YN) =
I(X"; YnlY"-1)
(1)
n=l
Here, Y" denotes (Yl,YZ, .., Y"), i.e. a segment of the realization of a random sequence Y and I ( X N ;Y N ) is the Shannon mutual information [28]. An interpretation of the above formulation for DTI is in order. To infer the notion of influence between two time series (mRNA expression data) we find the mutual information between the entire evolution of gene X (up to the current instant n ) and the current instant of Y (Y"), given the evolution of gene Y up to the previous instant n - 1 (i.e. Y"-'). We do this for every instant n E ( 1 , 2 , . . . , N) in the N - length expression time series. Thus, we find the influence relationship between genes X and Y for every instant during the evolution of their individual time series. As already known, I ( X N ; Y N ) = H(XN) H ( X N I Y N ) ,with H(XN)and H ( X N I Y N )being the Shannon entropy of X and the conditional entropy of X given Y, respectively. Using this definition of mutual information, the Directed Information can be expressed in terms of individual and joint entropies of X and Y. One way to estimate entropy is to use marginal and joint histograms, but there are problems both due to computational complexity as well as with moderate sample size. Especially in a microar-
ray expression setting (where we have only a modest number of sample points per gene), it would be useful to examine an alternative strategy for entropy estimation which uses a data-dependent binning approach. One such method to find the entropy of the random variables X N and Y N uses the DarbellayVajda algorithm [7]. In this approach, an adaptive partitioning of the observation space is used to estimate the probability densities as well as the entropies of the random variables. Briefly, the Darbellay-Vajda procedure for entropy estimation proceeds as follows (more details can be found in [as]): N
I(XN 4YN) = C [ H ( X " l Y " - ' ) n=1
-
H(X"IY")]
N
=
C[I(X";Y")
-
I(X"; oY"-')]
(2)
n= 1
For evaluating the DTI in expression (a),we need to evaluate the expressions I(X"; Y") and I(Xn;OY"-') in each term of the sum. For the evaluation of I ( X n ; Y n ) , we thus have an n-dimensional list for X" and Y" respectively. Transform the vectors X" -
(Xl,X2, ' " , X"), Y" f (Yl,YZ,.. .,Y") to (ui,K)= ( j x(j)= xi,,k y ( k ) = K),V1 < i < n w h e r e ( X ( l ) , X ( 2 ) , . . . , X ( " ) ) , (Y(l,, Y(z),. . . , Y(")) are the rankordered versions of ( X I ,XZ,. . . , X"), (Yl,Yz,.. .,Yn). Thus the sample (observation) space ((U,V))is a 2 0 representation of the ranks of X" and Y". This is an ordinal sanipling step. We note that I ( U ,V) = I ( X " , Y"). In the U - V co-ordinate plane, a dyadic partitioning of the sample space is iteratively done until the sample distribution of each cell is not significantly different than random (i.e. conditionally independent). Once the sample distribution in a cell achieves independence, it (the cell) is not split any further. Hence, if there are K partitions in the observation space, and the kth cell has n k samples, the mutual information is estimated as I u , ~= I ( X n , Y n ) = E f = l F x
I49 L ~ . L ~
n k / n
a
log( ( p ; , q L y , , ) ) . w e note that kn2 is the 2D-hypervolume of the kth cell. We note that the presence of biological/technical replicates (as is available in microarray data) would create many more samples from which to obtain entropy estimates.
To obtain the DTI between any two genes of interest ( X and Y ) with N-length expression profiles X N and Y N respectively, we plug in the information estimates ( I ( X n ;OY"-'), and I ( X n ;Y " ) )computed above into the above expression (2). However, it is preferred to have a normalized version of this metric (lying between [0,1])for a comparison of the strengths of relationships between other genes. Also, it is essential to consider a notion of significance of the obtained DTI measure. We thus perform bootstrapping of every estimate of the DTI and if the value of DTI is significant (p value = 0.05), we accept the notion of an influence between genes X and Y . Below (Sec: 7), we have indicated the sequence of steps to estimate the significance of an influence between Pax2 and Gata3. The steps for normalizing the DTI measure as well as estimating significance with respect to a null DTI distribution are given in the following sections.
5 . A NORMALIZED D T I MEASURE In this section, we derive an expression for a 'normalized DTI coefficient'. This is useful for a meaningful comparison across different criteria during network inference. In this section, we use X , Y , Z for X N , Y N and Z N interchangeably, i.e X = X N , Y = Y N , and = By the definition of DTI, we can see that 0 5 I ( X N + Y N )5 I ( X N ; Y N )< 00. The normalized measure P D T I should be able to map this large range ([0,001) to [0,1]. We recall that the multivariate canonical correlation is given by [24]: ~ X N ; = ~ N C X j , / 2 C X ~ ; y ~ C y land , / 2 this is normalized having eigenvalues between 0 and 1. We also recall that, under a Gaussian distribution on X N and Y N ,the joint entropy H ( x ~ ; Y = ~ ) I ~ ( ~ T ~ ) ~ ~ I C ~ where IAl is the determinant of matrix A, C denotes the covariance matrix. Thus, for I ( X N ; Y N )= H ( X N ) H ( Y N ) H ( X N ;Y N ) the , expression for mutual information, under jointly Gaussian assumptions on X N and Y N ,
z zN.
-;
+
becomes, I ( X ;Y ) = - 31 In(
I
ICXNYNI2 w N . v N
I) =
-;ln(1-
2
p X N I y N ) . Hence, a straightforward transformation is normalized MI, ~ M = I d1 - e-21(x;y) = J1 - ~ - ~ C ~ N = , I ( .~ ~A; connection ~ I ~ ' - ~with ) [15], can thus be immediately seen. With this, P M I is normalized between [0,1] and gives a better absolute definition of dependency that does not depend on the unnormalized MI. We will use this definition of normalized information coefficients in the present set of simulation studies. For constructing a normalized version of the DTI, we can extend this approach, from [9]. Consider three random vectors X, Y and Z , each of which are identically distributed as " ( p x , C X X ) , N ( p y ,C y y ) , and " ( p z , C z z ) respectively. We also have, ~
> -
Their partial correlation a y x ~ zis then given by,
6yXlz = a2
=
a1 a3
cYX
-
with,
a1
= Cyy
CYZC,;CZX,
-
a3
CyzC,;Czy, =
cXX
-
CXZC,$ZX. Recalling results from conditional Gaussian distributions, these can be denoted by: a1 = C y l z , a 2 = C x y j z and a3 = C x l z . Thus, 6 ~ x = 1~ C ~ ~ ~ C X Y ~ ZExtending C X ; ~ the . above result from the mutual information to the directed information case, we have, pDT1 = J1 - e-Zc,",i I ( x z ; ~ I y z - ' ) . We recall the primary difference between MI and DTI, (note the superscript on X): MI: I ( X N ;Y N )= I ( X N ;J'lY"'). DTI: I ( X N + Y N )= ELl I ( X z ;J'lYz-l). Having found the normalized DTI, we ask if the obtained DTI estimate is significant with respect to a 'null DTI distribution' obtained by random chance. This is addressed in the next two sections. 6. KERNEL DENSITY ESTIMATION (KDE) The goal in density estimation is to find a probafunction f ( z ) that approximates the Nbility ~ N I density , underlying density f ( z ) of the random variable 2. Under certain regularity conditions, the kernel density estimator i h ( Z ) a t the point z is given by fh(Z) = E:=l K ( y ) , with n being the number of samples z1,22, . . . , z, from which the density
150 is to be estimated, h is the bandwidth of a kernel I((.) that is used during density estimation. A kernel density estimator at z works by weighting the samples (in ( z 1 , 2 2 , .. . .z,)) around z by a kernel function (window) and counts the relative frequency of the weighted samples within the window width. As is clear from such a framework, the choice of kernel function K ( 0 ) and the bandwidth h determines the fit of the density estimate. Some figures of merit to evaluate various kernels are the asymptotic mean integrated squared error (AMISE), bias-variance characteristics and region of support [8].It is preferred that a kernel have a finite range of support, low AMISE and a favorable bias-variance tradeoff. The bias is reduced if the kernel bandwidth (region of support) is small, but has higher variance because of a small sample size. For a larger bandwidth, this is reversed (ie large bias and smaller variance). Under these requirements, the Epanechnikov kernel has the most of these desirable characteristics - i.e. a compact region of support, the lowest AMISE compared t o other kernels, and a favorable bias variance tradeoff [ 8 ] . The Epanechnikov kernel is given by: 3 K ( u ) = -(1 - u ~ ) I ( I u I5 1). 4 with I ( * ) being the indicator function conveying a window of width spanning [-1,1] centered at 0. An optimal choice of the bandwidth is h = 1.06 x iz x n-'I5;, following [14]. Here 8, is the standard error from the bootstrap DTI samples ( z 1 ,za, . . . , z,). Hence the kernel density estimate for the bootstrapped DTI (with n = 1000 samples), Z & ~ B ( X Y~N )becomes, fh(Z) = $[1- ( ~ ) a ] I ( l 1) ~ with l h = 2.676-, and n = 1000. We note that I B ( X N + Y N )is obtained by finding the DTI for each random permutation of X , Y time series, performing this permutation B times. and obtaining a estimate of the density over these B permutations. --j
&xCpl
5
7. BOOTSTRAPPED CONFIDENCE I NT ERVA LS Since we do not know the true distribution of the DTI estimate, we find an approximate confidence interval for the DTI estimate ( f ( X N + Y " ) ) , using bootstrap [19]. We denote the cumulative distribution function (over the Bootstrap samples) of
f(X" Y " ) by F I ; ( X N - Y N ) ( IB (X" + Y " ) ) , Figure 3. Let the mean of the bootstrapped null distribution be I g ( X N + Y " ) . We denote by t l p a ,the (1 quantile of this distribution i.e. +
I;, ( x
-
Y ) -I;; ( x -Y
)
P([ 6 1 5 t1-a) = 1 - a } . Since we need the real f ( X N + Y " ) t o be significant and close to I , we need f ( X N + Y N )2 [ I g ( X N+ Y N )+ tl-, x 81,with i being the standard error of the bootstrapped distribution, [C=; ib( X N -Y N ) - I ; ( X N Y " ) I 2 , B is the numfT= B-1 ber of Bootstrap samples. For the Pax2-Gata3 interaction, we show the kernel density estimate of the bootstrapped histogram using the Epanechnikov kernel (Fig. 3 ) as well as the position of the true DTI estimate in relation to the overall histogram. With the obtained kernel density estimate of the Pax2- Gata3 interaction, shown below, we can find significance values of the true DTI estimate in relation t o the bootstrapped null distribution. :
(t1-o
-
,
I
01
nu
Fig. 3.
Cumulative Distribution Function for bootstrapped I ( P a z 2 + Gata3). The true I ( P a z 2 + Gata3) = 0.9911.
8. S U M M A R Y OF ALGORITHM We now present two versions of the DTI algorithm, one which involves an inference of general influence network between all genes of interest (unsupervisedD T I ) and another, a focused search for effector genes which influence one particular gene of interest (supervised-DTI). Our proposed approach for (supervised-DTI) is as follows: 0
Identify the G key genes based on required phenotypical characteristic using fold change studies. Preprocess the gene expression profiles by normalization and cubic spline interpolation. We now assume that there are N
151
0
points for each gene. Bin each of the expression profiles into K quantiles (here, we use K = 4), thus building a joint histogram. The granularity of sampling can be an issue during entropy estimation, hence the DarbellayVajda method can also be used here. We note that the presence of probe-level or sample replicates greatly enhance the accuracy of the entropy estimation step. For each pair of genes A,and B among these G genes :
-
-
~
-
~
0
Look for a phylogenetically conserved binding site of protein encoded by gene A, in the upstream region of gene B. Find D T I ( A , , B ) = I(A: 4 BN), and the normalized DTI from A, to B, DTI(A,, B ) = 2/1- e-21(AEJ+BN). Bootstrapping over several permutations of the data points of A, and B yields a null distribution (using KDE) for D T I ( A , , B ) . If the true D T I ( A , , B) is greater than the 95% upper limit of the confidence interval (CI) from this null histogram, infer a potential influence from A, to B. The value of the normalized DTI from A, to B gives the putative strength of interaction/influence. Every gene A, which is potentially influencing B is an 'affector'. This search is done for every gene A, among these G genes ((AI, Az, . . . , A G ) ) .
ple to include some apriori biological knowledge (if a subset of upstream TFs at the promoter are already known, either experimentally or from other sources) - a search among the binding partners of these known TFs can reduce the set of potential effectors and reduce the complexity of the unsupervised procedure. Another element that has been added is the control of false discovery rate (FDR) [27] to screen each of the G(G - I) hypotheses (both directions) during network discovery amongst G genes. Table 1. Comparison of various network inference methods. Method
Resolve Cycles
Nonlinear framework
Search for interaction
Nonparametric framework
SSM [l] COD [3] GGM [6] DTI [5]
Y N N
N N Y Y
N Y N Y
N N N Y
In Table 1 we compare the various contemporary methods of directed network inference. Recent literature has introduced several interesting approaches such as graphical gaussian models (GGMs), coefficient of determination (COD), state space models (SSMs) for directed network inference. This comparison is based primarily on expectations from such inference procedures - that we would like any such metric/procedure to: 0 0
1-
0
We observe that both phylogenetic information is inherently built into the influence network inference step above.
0
For unsupervised DTI, we adapt the above approach for every pair of genes (A, B ) in the list, noting that D T I ( A ,B ) # D T I ( B ,A). In this case we are not looking at any interaction in particular, but are interested in the entire influence network that can be potentially inferred from the given time series expression data. The network adjacency matrix has entries depending on the direction of influence and is related to the strength of influence as well as the false discovery rate. We note that it is fairly sim-
Y
Resolve cycles in recovered interactions. Be capable of resolving directional and potentially non-linear interactions. This is because interactions amongst genes involve non-linear kinetics. Be a non-parametric procedure to avoid distributional assumptions (noise etc). Be capable of recovering interactions that a biologist might be interested in. Rather than use a method that discovers interactions underlying the data purely, the biologist should be able to use prior knowledge (from literature perhaps). For example, a biologist can examine the strength and significance of a known interaction and use this as a basis for finding other such interactions.
From the above comparisons, we see that DTI is the only metric which can recover interactions under all these considerations.
152 9. RESULTS In this section, we give some scenarios where DTI can complement existing bioinformatics strategies to answer several questions pertaining to transcriptional regulatory mechanisms. We address three different questions. To infer gene influence networks between genes that have a role in early kidney development and T-cell activation, we use unsupervised D T I with relevant microarray expression data, noting that these influence networks are not necessarily transcriptional regulatory networks. To find transcription factors that might be involved in the regulation of a target gene (like Gata3) at the promoter, a common approach is to first look for phylogenetically binding motif sequences conserved across related species. These species are selected based on whether the particular biological process is conserved in them. To add additional credence to the role of these conserved TFBSes, microarray expression can be integrated via supervised DTI to check for evidence of an influence between the TF encoding gene and the target gene. Before proceeding, we examine the performance of this approach on synthetic data. 9.1. Synthetic Network A synthetic network is constructed in the following fashion: We assume that there are two genes g1 and g3 which drive the remaining genes of a seven gene network. The evolution equations are as below: g2,t =
1 -gl,t-1
2
1 + -g3,t-2 3
g4,t
f g7,t-1;
2
= 92,t-1
g5.t = g2,t-2
g7,t
+ g 31/2, t - 1 ; + g4,t-1; 1 2
1/3
= -g4,t-1;
For the purpose of comparison, we study the performance of the Coefficient of Determination (COD) approach for directed influence network determination. The COD allows the determination of associ-
ation between two genes via a R2 goodness of fit statistic. The methods of [3]are implemented on the time series data. Such a study would be useful to determine the relative merits of each approach. We believe that no one procedure can work for every application and the choice of an appropriate method would be governed by the biological question under investigation. Each of these methods use some underlying assumptions and if these are consistent with the question that we ask, then that method has utility.
97
(With DTI)
(with COD)
Fig. 4. The synthetic network as recovered by (a) DTI and (b) COD.
As can be seen (Fig. 4), though COD can detect linear lag influences, the non-linear ones are missed. DTI detects these influences and almost exactly reproduces the synthetic network. Given the non-linear nature of transcriptional kinetics, this is essential for reliable network inference. DTI is also able to resolve loops and cycles ( g 3 , [g2,g4],g5 and g 2 , g4, g7,g2). Based on these observations, we examine the networks inferred using DTI in both the supervised and unsupervised settings. 9.2. Directed Network inference: Gata3 Regulation in Early Kidney Development Biologists have an interest in influence networks that might be active during organ development. Advances in laser capture microdissection coupled with those in microarray methodology have enabled the investigation of temporal profiles of genes putatively involved in these embryonic processes. Forty seven genes are expressed differentially between the ureteric bud and metanephric mesenchyme [25] and putatively involved in bud branching during kidney development. The expression data [lo] temporally profiles kidney development from day 10.5 dpc to the neonate stage. The influence amongst these genes is
153 shown below (Fig. 5). Several of the presented interactions are biologically validated but there is an interest to confirm the novel ones pointed out in the network. The annotations of some of these genes are given below (Table 2).
genes over 10 time points with 44 (34+10) replicate measurements for each time point.
Agtrap
Scarb2
Pax2
Gata3
GataP
Mapkl
Fig. 5. Overall Influence network using DTI during early kidney development.
Some of the interactions that have been experimentally validated include the Ram-Mapkl [18], Pax2-Gata3 [16] and Agtr-Pax2 [17] interactions. We note that this result clarifies the application of DTI for network inference in an unsupervised manner - i.e. discovering interactions revealed by data rather than examining the strengths of interactions known apriori. Such a scenario will be explored later (Sec: 9.4). We note that though several interaction networks are recovered, we only show the largest network including Gata3, because this is the gene of interest in this study. An important shortcoming of most gene network inference approaches is that these relationships are detected based on mRNA expression levels alone. To understand these interactions with greater fidelity, there is a need to integrate other data sources corresponding to phosphorylation, dephosphorylation as well as other post-transcriptional/translationalactivities, including miRNA activity. 9.3. Directed Network Inference: T-cell Activation
To clarify the validity of the presented approach, we present a similar analysis on another data set - the T-cell expression data [l],in Fig. 6. This data looks at the expression of various genes after T-cell activation using stimulation with phorbolester PMA and ionomycin. This data has the profiles of about 58
Fig. 6.
DTI based T-cell network.
Several of these interactions are confirmed in earlier studies [l, 29, 30, 311 and again point to the strength of DTI in recovering known interactions. The annotation of some of these genes are given in Table 3. We note that the network of Fig. 6 shows the largest influence network (containing Gata3) that can be recovered. Gata3 is involved in T-cell development as well as kidney development and hence it is interesting to see networks relevant to each context in Figs. 5 and 6. Also, these 58 genes relevant to T-cell activation are very different from those for kidney development, with fairly low overlap. For example this list does not include Pax2 (which is relevant in the kidney development data). 9.4. Phylogenetic conservation of TFBS effectors
A common approach to the determination of "functional" transcription factor binding sites in genomic regions is to look for motifs in conserved regions across various species. Here we focused on the interspecies conservation of TFBS (Fig. 2) in the Gata3 promoter to determine which of them might be related to transcriptional regulation of Gata3. Such a conservation across multiple-species suggests selective evolutionary pressure on the region with a potential relevance for function. As can be seen in Fig. 2, we examine the Gata3 gene promoter and find atleast forty different transcription factors that could putatively bind at the promoter as part of the transcriptional complex. Some of these TFs, however, belong to the same family.
154 Table 2.
Functional annotations (Entrez Gene) of some of the genes with Gata2 and Gata3 during nephrogenesis. ~~
~~
Gene Symbol
Gene Name
Possible Role in Nephrogenesis (Function)
Ram Gata2 Gata3 Pax2 Lamc2 Pgf Coll8ul Agtrap
Retinoic Acid Receptor G A T A binding protein 2 G A T A binding protein 3 Paired Homeobox-2 Laminin Placental Growth Factor collagen, type X V I I I , alpha 1 Angiotensin I1 receptor-associated protein
crucial in early kidney development several aspects of urogenital development several aspects of urogenital development conversion of MM precursor cells to tubular epithelium Cell adhesion molecule Arteriogenesis, Growth factor activity during development extracellular matrix structural constituent, cell adhesion Ureteric bud cell branching
Table 3. ~
~~~
Functional annotations of some of the genes following T-cell activation
~~
Gene Symbol
Gene Name
Possible Role in T-cell activation (Function)
Gasp7 JunD CKRl
Caspase 7 Jun D proto-oncogene Chemokine Receptor 1 Interleukin 4 receptor Mitogen activated kinase 4 acute myeloid leukemia 1; am11 oncogene Retinoblastoma 1
Involved in apoptosis regulatory role of in T lymphocyte proliferation and T h cell differentiation negative regulator of the antiviral CD8+ T cell response inhibits IL4-mediated cell proliferation Signal transduction CD4 silencing during T-cell differentiation Cell cycle control
114~ Mapk4 AMLI Rbl
Using supervised DTI, we examined the strength of influence from each of the TF-encoding genes (Ai) to Gata3, based on expression level [lo, http://spring.imb.uq. edu.a u / ] . These "strength of influence" DTI values are first checked for significance at a p-value of 0.05 and then ranked from highest to lowest (noting that the objective is to maximize I ( A z4 Gata3)). Based on this ranking, we indicate some of the TFs that have highest influence on Gata3 expression (Fig. 7). Obviously, this information is far from complete, because of examination only at the mRNA level for both effector as well as Gata3.
Fig. 7. Putative upstream TFs using DTI for the Gata3 gene. The numbers in each TF oval represent the DTI rank of the respective T F .
Table 4 shows the embryonic kidney-specific expression of the TFs from 7. This is an independent annotation obtained from UNIPROT
Table 4. Functional annotations of some of the transcription factor genes putatively influencing Gata3 regulation in kidney. Gene Symbol
Description
Expressed in Kidney
PPAR
peroxisome proliferatoractivated receptor Paired Homeobox-2 Hypoxia-inducible factor 1 SP1 transcription factor GLI-Kruppel family member early growth response 3
Y
Pax2 HIFl SPl GLI EGR3
Y Y Y Y Y
(http://expasy.org/sprot/). To understand the notion of kidney-specific regulation of Gaia3 expression by various transcription factors, we have integrated three different criteria. We expect that the TFs regulating expression would have an influence on Gata3 expression, be expressed iii the kidney and have a conserved binding site at the GataY promoter. This is clarified in part by Fig. 7 and Table 4. As an example, we see that the TFs Pax2, PPAR, SP1 have high influence via DTI and are expressed in embryonic kidney (Table 4), apart from having conserved TFBS. This lends good computational evidence for the role of these TFs in Gatu3 regulation, and presents a reasonable hypothesis worthy of experimental validation. As an additional step, we also examined the influence for another two TFs - S T E l 2 and HP1, both of which have a high co-expression correlation with Gata3 as well as conserved TFBS in the promoter
155 region. The DTI criterion gave us no evidence of influence between these to TFs and GataS’s activity. We believe that this information coupled with the present evidence concerning the non-kidney specificity of S T E l 2 and H P l , present some argument for the non-involvement of these TFs in kidney specific regulation of Gata3. Hopefully, these findings would guide a more focused experiment to identify the key TFs involved in GataY activity.
CONCLUSIONS In this work, we have presented the notion of directed information (DTI) as a reliable criterion for the inference of influence in gene networks. After motivating the utility of DTI in discovering directed non-linear interactions, we present two variants of DTI that can be used depending on context. One version, unsupervised-D TI, like traditional network inference, enables the discovery of influences (regulatory or non-regulatory) among any given set of genes. The other version (supervised-DTI) aids the modeling of the strength of influence between two specific genes of interest - questions arising during transcriptional influence. It is interesting that DTI enables the use of the same framework for both these purposes as well as is general enough t o accommodate arbitrary lag, non-linearity, loops and direction. We see that the above presented combination of supervised and unsupervised variants enable their applicability to several important problems in bioinformatics (upstream TF discovery), some of which are presented in the Results section. The network inference approach can also alow incorporation of additional biophysical knowledge - both pertaining t o physical mechanisms as well as protein interactions that exist during transcription. We point out that given the diverse nature of biological data of varying throughput, one has to adopt an approach to integrate such data to make biologically relevant findings and hence the DTI metric fits in very naturally into such a n integrative framework.
ACKNOWLEDGEMENTS The authors gratefully acknowledge the support of the NIH under award 5R01-GM028896-21 (J.D.E). We would like t o thank Prof. Sandeep Pradhan and Mr. Ramji Venkataramanan for useful discussions
on Directed information. We are also grateful to the reviewers for having helped us to improve the quality of the manuscript.
References 1. Range1 C, Angus J , Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild DL, Falciani F, 11Modeling T-cell activation using gene expression profiling and state-space models”, Bioinformatics, 20(9),1361-72, June 2004. 2. Stuart RO, Bush KT, Nigam SK, “Changes in gene expression patterns in the ureteric bud and metanephric mesenchyme in models of kidney development”, Kidney International,64(6),19972008,December 2003. 3. Hashimoto RF, Kim S, Shmulevich I, Zhang W, Bittner ML, Dougherty ER., “Growing genetic regulatory networks from seed genes”.,Bioinformatics. 2004 May 22;20(8):1241-7. 4. Woolf PJ, Prudhomme W, Daheron L, Daley GQ, Lauffenburger DA., “Bayesian analysis of signaling networks governing embryonic stem cell fate decisions”., Bioinformatics. 2005 Mar;21(6):741-53. 5. Rao A,Hero A0,States DJ,Engel JD, “Inference of biologically relevant Gene Influence Networks using the Directed Information Criterion” ,Proc. of the IEEE Conference on Acoustics, Speech and Signal Processing, 2006. 6 Opgen-Rhein, R., and Strimmer K., “Using regularized dynamic correlation to infer gene dependency networks from time-series microarray data”, Proc. of Fourth International Workshop on Computational Systems Biology, W C S B , 2006. 7 G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partitioning of the observation space,” IEEE Trans. on Information Theory, vol. 45, pp. 1315-1321, May 1999. 8. Hastie T , Tibshirani R, The Elements of Statistical Learning, Springer 2002. 9. Geweke J., “The Measurement of Linear Dependence and Feedback Between Multiple Time Series,” Journal of the American Statistical Assoczation, 1982, 77, 304-324. (With comments by E. Parzen, D. A. Pierce, W. Wei, and A. Zellner, and rejoinder) 10. Challen G, Gardiner B, Caruana G, Kostoulias X, Martinez G , Crowe M, Taylor DF, Bertram J , Little M, Grimmond SM., “Temporal and spatial transcriptional programs in murine kidney development” .,Physiol Genomics. 2005 Oct 17;23(2):159-71. 11. Kreiman G., “Identification of sparsely distributed clusters of cis-regulatory elements in sets of coexpressed genes”.,Nucleic Acids Res. 2004 May 20;32(9):2889-900. 12. MacIsaac KD, Fraenkel E., “Practical strategies for discovering regulatory DNA sequence motifs”.,PLoS Comput Biol. 2006 Apr;2(4):e36
156 13. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A,, “ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context” .,BMC Bioinformatics. 2006 Mar 20;7 Suppl 1:S7. 14. J. Ramsay, B. W. Silverman, Functional Data Analysis (Springer Series in Statistics), Springer 1997. 15. H. Joe., “Relative entropy measures of multivariate dependence” ., J. A m . Statist. Assoc., 84:157164, 1989. 16. Grote D, Souabni A, Busslinger M, Bouchard M., “Pax 2/8-regulated Gata3 expression is necessary for morphogenesis and guidance of the nephric duct in the developing kidney” ., Development. 2006 Jan;133(1):53-61. 17. Zhang SL, Moini B, Ingelfinger JR., “Angiotensin I1 increases Pax-2 expression in fetal kidney cells via the AT2 receptor”.,J Am Soc Nephrol. 2004 Jun;15(6):1452-65. 18. Balmer JE, Blomhoff R., “Gene expression regulation by retinoic acid”.,J. Lipid Res. 2002 Nov;43:ll: 1773-808. 19. Effron B, Tibshirani R.J, An Introduction to the Bootstrap (Monographs on Statistics and Applied Probability), Chapman & Hall/CRC, 1994. 20. I. Ovcharenko, M.A. Nobrega, G.G. Loots, and L. Stubbs, “ECR Browser: a tool for visualizing and accessing data from comparisons of multiple vertebrate genomes” , Nucleic Acids Research, 32, W280-W286 (2004). 21. Khandekar M, Suzuki N, Lewton J, Yamamoto M, Engel JD., “Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system” ., Mol Cell Biol. 2004 Dec;24(23):10263-76.
22. J. Massey, “Causality, feedback and directed information,” in Proc. 1940 Symp. Information Theory and Its Applications (ISITA-go), Waikiki, HI, Nov. 1990, pp. 303305. 23. Hudson, J.E., “Signal Processing Using Mutual Information”, Signal Processing Magazine,23(6):50-54, Nov. 2006. 24. Gubner J. A,, Probability and Random Processes for Electrical and Computer Engineers, Cambridge, 2006. 25. Schwab K, Patterson LT, Aronow BJ, Luckas R, Liang HC, Potter SS., “A catalogue of gene expression in the developing kidney”., Kidney Int. 2003 Nov;64(5):1588-604. 26. H. Marko, “The Bidirectional Communication Theory - A Generalization of Information Theory”, IEEE Transactions on Communications, Vol. COM21, pp. 1345-1351, 1973. 27. Benjamini, Y. and Hochberg, Y., “Controlling the false discovery rate: A practical and powerful approach to multiple testing”.,J. Roy. Statist. Soc. Ser. B.1995; 571289-300. 28. Cover T.M, Thomas J.A, “Elements of Information Theory”, Wiley-Interscience, 1991. 29. Ezzat S, Mader R, Yu S, Ning T, Poussier P, Asa SL., “Ikaros integrates endocrine and immune system development.” ,J Clin Inwest. 2005 Apr;115(4):844-8. 30. Zhang, DH, Yang L, and Ray A. “Differential responsiveness of the IL-5 and IL-4 genes to transcription factor GATA-3”.,J Immunol 161: 3817-3821, 1998. 31. Rogoff HA, Pickering MT, Frame FM, Debatis ME, Sanchez Y, Jones S, Kowalik TF., “Apoptosis associated with deregulated E2F activity is dependent on E2F1 and Atm/Nbsl/Chk2”.,Mol Cell Biol. 2004 Apr;24(7):2968-77.
157 DISCOVERING PROTEIN COMPLEXES IN DENSE RELIABLE NEIGHBORHOODS OF PROTEIN INTERACTION NETWORKS
Xiao-Li Li* Knowledge Discovery Department, Institute for Infocomm Research, Heng Mui Keng Terrace, 119613, Singapore * Email: [email protected] Chuan-Sheng Foo Computer Science Department, Stanford University, Stanford CA 94305-9025 USA Email: csfoo astanford. edu See-Kiong Ng Knowledge Discovery Department, Institute for Infocomm Research, Heng Mui Keng Terrace, 119613, Singapore Email: [email protected] Multiprotein complexes play central roles in many cellular pathways. Although many high-throughput experimental techniques have already enabled systematic screening of pairwise protein-protein interactions e n m a s s e , the amount of experimentally determined protein complex data has remained relatively lacking. As such, researchers have begun t o exploit the vast amount of pairwise interaction data t o help discover new protein complexes. However, mining for protein complexes in interaction networks is not an easy task because there are many data artefacts in the underlying protein-protein interaction data due t o the limitations in the current high-throughput screening methods. We propose a novel DECAFF (Dense-neighborhood Extraction using Connectivity and conFidence Features) algorithm t o mine for dense and reliable subgraphs in protein interaction networks. Our method is devised t o address two major limitations in current high throughout protein interaction data, namely, incompleteness and high data noise. Experimental results with yeast protein interaction data show that the interaction subgraphs discovered by DECAFF matched significantly better with actual protein complexes than other existing approaches. Our results demonstrate that pairwise protein interaction networks can be effectively mined t o discover new protein complexes, provided that the data artefacts in the underlying interaction data are taken into account adequately.
1. INTRODUCTION Multiprotein complexes play central roles in many cellular pathways. Common examples include the ribosomes for protein biosynthesis, the proteasomes for breaking down proteins, and the nuclear pore complexes for regulating proteins passing through the nuclear membrane. Searching for protein complexes is therefore an important research focus in molecular and cell biology. However, while tens of thousands of pairwise protein-protein interactions have been detected by high throughput experimental techniques (e.g. yeast-two-hybrid), only a small subset of the many possible protein complexes has been experimentally determined Given that the protein complexes are molecular aggregations of proteins assembled from multi-
'.
*Corresponding author.
ple stable protein-protein interactions, researchers have recently begun to explore the possibility of exploiting the current abundant datasets of pairwise protein-protein interactions to help discover new protein complexes (see Section 2). In fact, it has been observed that densely connected regions in the protein interaction graphs often correspond to actual protein complexes 2 , suggesting the identities of protein complexes can be revealed as tight-knitted subcommunities in protein-protein interaction maps. This has led to previous works that looked into the mining of cliques or other dense graphical subcomponents in the interaction graphs for putative complexes. However, the protein interaction networks derived from current high throughput screening meth4 e 7
158
ods are riot an easy source for mining as there are still many data artefacts in the underlying interaction data due to inherent experimental limitations. In fact, it has been repeatedly shown that the current protein interaction data is still incomplete and noisy 8 , and it is important to take this into account when devising algorithms to mine the protein interaction networks. For example, the use of cliques for detecting complexes would be too constraining and cannot provide satisfactory coverage. In this work, we propose a novel DECAFF (Dense-neighborhood Extraction using Connectivity and conFidence Features) algorithm that is devised to address two major limitations in current high throughout protein interaction data, namely, incompleteness and high data noise. Unlike conventional methods, our DECAFF method specifically mines for maximal dense local neighborhoods (instead of cliques) and filters the unreliable protein complexes by estimating the reliability of each protein interaction in the network. Experimental results with yeast protein interaction data show that the interaction subgraphs discovered by DECAFF matched significantly better with actual protein complexes than other existing approaches. Our results confirm that there are indeed dense graphical subcomponents in the pairwise protein interaction networks that correspond to actual multiprotein complexes, and we could exploit the interactome to help map the protein complexome more effectively by taking in account of the data artefacts in the underlying protein interaction data.
2. RELATED WORKS By modeling protein interaction data as a large undirected graph where the vertices represent unique proteins and edges denote interactions between two proteins, Ref. 2 was one of the first to reveal that protein complexes generally corresponded to dense regions (highly interconnected subgraphs) in the protein interaction graphs. Ref. 3 then exploited this finding and used cliques (fully connected subgraphs) as a basis to detect protein complexes and functional modules in protein interaction networks. However, the use of cliques was too constraining given that the incompleteness in the currently available interaction data; as a result, the method could only detect fewer protein complexes. Bader then proposed a novel MCODE algorithm
that discovered protein complexes based on the proteins’ connectivity values in a protein interaction graph ‘. The algorithm first computes the vertex weighting from its neighbor density and then traverses outward from a seed protein with a high weighting value to recursively include neighboring vertices whose weights are above a given threshold. As the highly weighted vertices may not be highly connected to one another, this approach does not guarantee that the discovered regions are dense. As a result, not all the detected regions correspond to protein complexes. In fact, in the post preprocessing step of the MCODE algorithm, there was a need to filter for the so-called “2-core”s as an attempt to eliminate some obvious non-dense region detected by the algorithm. Clustering algorithms have also been proposed to identify dense regions in a given graph by partitioning it into disjoint clusters . However, these general graph clustering algorithms cluster each vertex (protein) into one specific group which made them inappropriate for this biological application as a protein is often involved in multiple complexes (i.e. clusters) 1 3 . Another clustering approach was proposed by Ref. 5, which used a restricted neighborhoods search clustering algorithm (RNSC) to predict protein complexes by partitioning the proteinprotein interaction network using a cost function. However, like many clustering algorithms, their results depended on the quality of the initial random seeds. In addition, there were relatively fewer complexes predicted by this algorithm, reflecting another limitation of clustering approaches. In our recent work 6 , we proposed the LCMA algorithm (Local Clique Merging Algorithm) to mine the dense subgraphs for protein complexes. Instead of adopting the over-constraining cliques as the basis for protein complexes, LCMA adopted a local clique merging method as an attempt to address the current incompleteness limitation of protein interaction data. Evaluation results showed that LCMA was better in detecting complexes than full clique 3 , MCODE and RNSC algorithm ’. However, LCMA also shared the same drawback as MCODE in that the graphical components detected by the algorithm are not guaranteed to be dense subgraphs. Most recently, Ref. 7 proposed an algorithm based on the assumption that two nodes that belong to the same cluster have more common neigh‘ 1
159 bors than two nodes that are not in the same cluster. Besides ensuring the high density (20.7) of a graph, their algorithm also keeps track of its cluster property, a numerical measure for measuring whether a dense graph contains more than one dense component. If a graph has a low value for the clust e r property, then it will be separated into multiple subgraphs. However, given the higher proportion of noisy protein interactions (up to 50%) in current protein interaction networks ’, the formations of clusters will be greatly affected when the algorithm computes the cluster property. In this paper, we propose the DECAFF algorithm which first mines local dense neighborhoods (in addition to local cliques) for each vertex (protein) and then merges these local neighborhoods according to their affinity to form maximal dense regions that correspond to possible protein complexes. In addition, given the potentially high false positive rate in the protein interaction data, DECAFF also filters away possible false protein complexes that have low reliability scores, ensuring that the proteins in the predicted protein complexes are connected by high confidence protein interactions in the underlying network. The overall DECAFF algorithm is described in Section 3.3. 3. T H E PROPOSED TECHNIQUES Mathematically, a protein-protein interaction (PPI) network can be represented as a graph G,,, = (VPPI.EPP,),where V,,, represents the set of the interacting proteins and E,,, denotes all the detected pairwise interactions between proteins from V,,,,,. Our objective is to detect a set of subgraphs C = { g = ( V , E ) I IVI 2 3 , V C vpp,,E C E p p i } , where each g is a dense subgraph (possibly overlapping) in GI,,, that may correspond to an actual multiprotein complex. Additionally, since many false positive protein interactions in G,,, may be assembled into false protein complexes, we also require that each detected dense graph g has a high reliability score.
3.1. Mining for dense subgraphs Let us first introduce the notion of the local neighborhood graph for each vertex:
Definition 3.1. The local neighborhood graph of a vertex vi E V in G = (V,E) is defined as Gut =
Vuz= {wt} U {u I v E V, { u , v t } E E}, and Eut
= { { ~ j r ~ l c } / { ~ jE ~E ~ , Vkj }, u k
(1)
In other words, vertex u,’s local neighborhood graph is the subgraph formed by v, and all its immediate neighbors with the corresponding interactions in G. In this work, we have devised our algorithm to focus first on each vertex’s local neighborhood graph in a bottom-up fashion, as it is impractical to directly detect dense subgraphs in a top-down fashion from G,,,, which is usually a very large graph with thousands of vertices and tens of thousands of edges. Let us now define the notion of the density of a graph :
Definition 3.2. The density of a graph g = (V,E) is defined as its clustering coefficient (cc) 1 2 :
Note that 0 5 cc(g) 5 1 since the maximum number of edges in an undirected graph g = (V,E) is ~ V ~ * ( ~ V ~ - lIfgisaclique, )/2. thencc(g) = 1 a s i t has the maximum number of edges. In this work, we detect putative protein complexes from dense subgraphs of G,,, instead of the conventional requirement for cliques. We define a dense graph as one in which its density is at least max(6,0.5), where 6 is a user-defined threshold to provide for more stringent conditions. The results reported in this paper are based on setting 6 as 0.7, which is also the same setting used in the recent work by Ref. 7. The following theorem indicates that we can adopt a bottom-up approach to discover dense subgraphs from protein interaction network:
Theorem 1. E v e r y dense neighborhood g in GI,,, c a n be assembled using only t h e dense neighborhoods of i t s i n n e r vertices. The formal proof for Theorem 1 can be found in Appendix A of the S u p p l e m e n t a r y Materials (which is available at h t t p : //wwwi . i2r.a-star . edu. sg/ “xlli/csb-supp . p d f ) . Theorem 1 suggests a strategy of first finding the local dense neighborhoods for each vertex, and then obtaining larger dense neighborhoods by merging these dense sub-regions. As such, DECAFF algorithm mines for dense subgraphs in two steps:
160
(1) First, we compute the local dense neighborhoods for all the vertices in the given interaction graph G,,,. We use a local clique mining method to locate the local cliques, and then deploy a novel hub-removal technique to heuristically detect local dense subgraphs in each vertex’s local neighborhood graph. Such systematic scanning of the local dense neighborhoods in the entire interaction graph will allow DECAFF to discover most of the local dense regions, resulting in significantly higher recall than other algorithms (see Section 4). (2) Then, we merge the extracted local dense neighborhoods to obtain maximal dense neighborhoods that correspond to larger complexes.
3.1.1. Mining for local dense subgraphs Given that we already have an efficient method for discovering local cliques ‘, we first mine for each vertex’s local cliques, and then expand the collection of other local dense subgraphs using a hub-removal procedure which we will describe shortly. In this way, we can ensure that both cliques and non-clique dense subgraphs are detected effectively.
Fig. 1. A local clique obtained from YBR112C’s local neighborhood graph
To detect local cliques, we adapt the method from the LCMA algorithm which is basically an elimination process in which the neighborhood vertices of a given vertex are iteratively removed, starting from the least connected vertex (vertex with lowest degree), to increase the overall density of the local neighborhood graph. The details of this
step can be found in Appendix B of Supplementary Materials. Here, we show an example (Figure 1) of mining a local clique from a local neighborhood graph for the vertex (protein) YBRl12C to illustrate how it works. In this case, the neighbors YILOGlC, YDR043C, YGL035C, YMR240C, YCLO67C, YLR176C, YCR084C were sequentially removed. This results in the final local dense neighborhood shown in the circled area of Figure 1 which is a clique d = (V, E ) , V = {YBR112C, YDL005C, YOR174W, YGL025C) and density cc ( d ) = 1 (IVl= 4, and IEl = 6). Although the LCMA algorithm can obtain the local cliques, an actual protein complex may not be presented as a fully connected subgraph in a protein interaction network for various reasons as previously discussed (e.g. incompleteness of current protein interaction data). There are thus possibly many other dense but non-clique subgraphs for each vertex that could form parts of a target complex. In DECAFF, we devise a Hub Removal algorithm to efficiently detect multiple dense subgraphs with densities larger than the given threshold 6. In the hierarchical network model proposed by Ref. 14, a biological network is constructed from a small cluster of highly connected nodes by generating replicas of the network at each step and linking the external nodes of the replicated clusters to the central node of the old cluster. This construction procedure suggests a heuristic for recovering the smaller dense clusters in the network by reversing the process, which forms the basis for the Hub Removal algorithm. Basically, we start by removing the most highly connected node (the hub) and its corresponding edges from the network, and then recursively repeating this procedure on its connected components, until a dense cluster is recovered and the removed hub is re-inserted back into the cluster. A more detailed description of this algorithm can be found in Appendix B of Supplementary Materials. Figure 2 shows the results of applying the Hub Removal algorithm to further discover dense subgraphs in the local neighborhoods of the protein YBR112C. While the previous LCMA algorithm could only discover a single fully connected graph {YBR112C, YDL005C; YOR174W, YGL025C) in this neighborhood graph, our recursive Hub Removal Algorithm is able to detect an additional 4 dense subgraphs: {YBR112C, YGL035C,
161 YMR240C), {YBR112C, YCR084C, YLRl76C}, {YBR112C, YCR084C, YCLO67C) and {YBR112C, YDL005C, YOR174W, YGL025C, YCR084C). Note that as this approach allows the discovery of multiple, possibly overlapping, dense neighborhoods for each vertex, it also allows the possibility of a vertex (protein) participating in multiple complexes.
pothesis that if two neighborhoods have larger intersection sets and similar sizes, then they are more similar and have a larger affinity. The merging step takes the set of local dense neighborhoods LDN (comprising local cliques output by the LCMA algorithm and the dense neighborhoods obtained from the Hub Removal Algorithm) and tries to merge neighborhoods that have affinity values greater than a threshold w . The merging process is performed iteratively until the average density of the subgraphs in LDN starts to fall. The details of the algorithm are provided in Appendix B of Supplementary Materials, which also contains a further illustrative example in Appendix C. 3.2. Filtering for reliable subgraphs
Fig. 2. Multiple dense subgraphs obtained from YBR112C’s local neighborhood graph
3.1.2. Merging for maximal dense neighborhoods In an interaction graph with potentially incomplete interaction data, it is likely that a large protein complex is presented in the PPI graph as a composite of multiple overlapping dense neighborhoods. In addition, there is also biological evidence that many complexes are formed by multiple substructures such as subcomplexes 8 , l5 . We therefore adopt an additional step to merge the individual local dense neighborhoods (that have been detected in section 3.1.1) using a heuristic that assigns overlapping neighborhoods with comparable sizes a high affinity to be merged.
Definition 3.3. Neighborhood Affinity. Given two neighborhoods (subgraphs) A and B, we define the Neighborhood Affinity N A between them as N A ( A , B )=
lA n BIZ IAl
* IBI
(3)
Equation 3 quantifies the degree of similarity between neighborhoods. Note that if one neighborhood’s size, e.g. IBl, is much bigger than lAl, then N A ( A , B ) will be small since lAnBI/IAI < 1 and IA n BI << IBI. Our heuristic is based on the hy-
In the previous section, we have taken into consideration the presence of possible incompleteness (missing interactions) in the protein interaction datasets by mining for only dense subgraphs and using a merging process to build up larger complexes. However, as it is also well known that many high throughput protein interaction datasets contain a high rate of false positives (noisy interactions), our algorithm could also be susceptible to the presence of false positive interactions especially since we have employed a relatively relaxed graphical constraint to infer protein complexes. To minimize the false detection of complexes assembled with false positive interactions, we perform an additional filtering process on the detected subgraphs (i.e. complexes) by modeling the protein interaction network as a weighted graph where each protein interaction or edge is assigned a weight that corresponds to its reliability, and then filtering those detected dense subgraphs that consist of protein interactions with low reliability. 3.2.1. Computing reliability of protein interactions
We begin by assigning a prior reliability to each protein interaction using the approach proposed by Ref. 16. The method first computes a reliability score for each experimental source, since protein interactions discovered through different experimental sources may have different quality. This score is computed using additional biological information on the proteins, and it is defined to be the fraction of inter-
162 action pairs from each source that shared at least one function. Then, using the reliability score for each experimental source, the method estimates the prior reliability r,,, for each individual protein-protein interaction (u, ). as follows 17, ’*:
Together, the above equations give the posterior reliability P(RIS)as long as we can estimate P(SIR) and P(SI-R). In this paper, we estimate P(SIR) using a small-scale experimental data set ss from the DIP protein interaction set (http://dip.doembi.ucla.edu/):
Definition 3.4. The prior reliability of a proteinprotein interaction pair rU+is defined as
where ri is the reliability score of experimental source i, ES,,, is the set of experimental sources from which the interaction (714 u ) was observed, and ni,U,u is the number of times that (u; u ) was detected in experimental source i . The rule of thumb is that protein interactions discovered through multiple experiments tend to be more reliable. Note that the reliability score ru,v in Definition 4 computes the confidence of a particular data source. To determine whether a specific interaction detected between a pair of proteins ( u , ~is) a reliable one, we also need to check whether the proteins u and u shared a function (in this work we use the MIPS functional catalog http: //mips.gsf. de/desc/yeast /) . Therefore, we compute a posterior reliability RU,ufor each protein interaction based on the following three cases. We use R to denote the event that the given interaction is a true interaction (i.e. it is reliable), S to denote the event that the proteins in the given interaction share a common function, D to denote the event that the proteins in the given interaction do not share a common function, and U to denote the event that either protein (or both proteins) have unknown functions. Case 1: The two proteins share a common function. In this case, P ( R l S ) , the probability that the interaction is true given that the proteins share a common function can be written as:
where share(p1,p z ) denotes that proteins p l and p2 share at least one function. To estimate P(SI-R), we randomly selected 1 million protein pairs that were not present in current protein interaction datasets to form a non-reliable protein interaction set ns. Then, P(S1-R) is estimated as follows:
Case 2: The two proteins d o not share a common function. In this case, P(RID) can be computed as:
(9) where P ( D ) and P ( D I R ) are computed using Equations 10 and 11 respectively:
P(DIR) = 1 - P(SIR)
(11)
Note that both P ( S ) in Equation 10 and P(SIR) in Equation 11 have already been computed previously in Equations 6 and 7 respectively. Case 3: Either protein’s function is unknown. In this case, we compute the posterior reliability P(RIU)given that either u or u (or both) has unknown function:
+
(5)
P(RIU) = P ( S ) * P(RIS) P ( D )* P(RID) (12)
Note that P(R)is the prior reliability, i.e., P(R)= ru,v. P ( S ) , the probability that the two proteins have a common function, can be formulated as
Again, all the terms on the right hand side of Equation 12 have already been computed in the previous cases. Given a protein interaction (u, u),its posterior reliability Ru,, can be obtained through the computation of P(RIS),P(RID) or P(RIU), depending on
+
P ( S ) = P(SIR) * P(R) P(SI-R) * P ( - R )
(6)
163 the available information of the functions of proteins u and 21. Note for those proteins with unknown function, it is also possible to predict their functions by utilizing the topological information of P P I networks and gene expression data 16, 19.
3.2.2.Computing reliability of detected complexes In this work, we detect a putative multiprotein complex as a subgraph g = (V,E ) . We define its reliability as the average reliability score of all the protein interactions in E :
Definition 3.5. The reliability of a graph g = (V,E ) is defined as:
Suppose the mean and standard deviation of reliability distribution are p and a respectively. A subgraph g of G,,, is regarded as a reliable if ( r e l i a b i l i t y ( g ) - p 2 max(0.5,y) * a ) . y is a userdefined threshold to provide for more stringent reliability requirement if necessary-the bigger the value of y, the more reliable the predicted complexes are since their constituent protein interactions are more reliable.
3.3. The overall DECAFF algorithm The overall DECAFF algorithm is shown in algorithm 1 as follows:
Overall DECAFF algorithm Run LCMA algorithm to detect the local cliques (stored in set LC) for each protein; Run Hub Removal algorithm to detect the local dense subgraphs (stored in set DS); L D N = D S U LC; Run merging algorithm to merge for maximal dense neighborhoods from L D N , which are stored in set C; FOR each graph c E C IF (reliability(c)- p < maz(0.5,y) * a )
c = c {el; -
ENDIF ENDFOR In algorithm 1, we first compute the local dense neighborhoods for all the vertices in the given inter-
action graph G,,,. Particularly, step 1 eniploys a local clique mining method to locate the local cliques, and step 2 then deploys a novel hub-removal technique to detect local dense subgraphs in each vertex’s local neighborhood graph. Such systematic scanning of the local dense neighborhoods in the entire interaction graph will allow DECAFF to discover most of the local dense regions (store in L D N in step 3 ) , resulting in significantly higher recall than other algorithms. Then, step 4 merges the extracted local dense neighborhoods in the first two steps to obtain maximal dense neighborhoods that correspond to larger complexes. Finally, from steps 5 to 9, we filter away possible false protein complexes from set C that have low reliability scores, ensuring that the proteins in the predicted protein complexes are connected by high confidence protein interactions in the underlying network. The protein complexes in set C are output as the final predicted complexes.
4. EXPERIMENTS
For evaluation, we applied our DECAFF algorithm on three experimental protein-protein interaction data sets for yeast to facilitate comparisons with various current techniques. The first dataset was collected by Ref. 4. It was used by both the MCODE algorithm and the LCMA algorithm to mine protein complexes. The dataset was assembled from all machine-readable resources in 2003: Uetz 20, Ito 2 1 , Drees 2 2 , FromontRacine 2 3 , Ho 24, Gavin 8 , Tong 2 , Mewes(M1PS) 25, Costanzo(YPD) 26. In total, it consists of 15,143 experimentally determined protein-protein interactions among 4,825 yeast proteins. The second protein interaction dataset was collected from the MIPS database, which consists of 15,456 interactions (of which 12,526 are unique protein interactions) among 4,554 proteins. The data was publicly available from ftp : //f tpmips .gsf . de/yeast/PPI/PPI-l8052006.tab . It was used by Ref. 7 to mine for protein complexes. The third dataset was collected from the BIOGRID, which consists of 82,633 interactions (of which 51;105 are unique) among 5,299 proteins. The dataset was downloaded from http ://www . thebiogrid.org 27. BIOGRID is the most comprehensive data set compared to the two protein interaction datasets above.
164 4.1. Reference complexes and evaluation metric
We evaluated the experimental results against a reference dataset of known yeast protein complexes retrieved from the MIPS (ftp://ftpmips.gsf.de/yeast/). The protein complexes in this dataset had been curated from the biomedical literature. While it is probably one of the most comprehensive public datasets of yeast complexes available, it is by no means a complete dataset - there are still many yeast complexes that remained to be discovered (hence the motivation for this work). After filtering the predicted protein complexes from the dataset, we obtained a final set of 215 yeast complexes as our benchmark for evaluation. The biggest protein complex, cytoplasmic ribosomes, contains 81 proteins while the average number of proteins in a complex is 6.38. For assessment, we used the same evaluation metric that was adopted by previous authors for evaluating the MCODE algorithm 4 , LCMA algorithm 6 , and Md Altaf algorithm ', whereby neighborhood affinity N A (Definition 3) was used to determine matching between a predicted complex p E P and a complex m E MIPS. We consider the two complexes to be matching if N A ( p , m ) 2 0.2, which was the same threshold used in MCODE, LCMA and Md Altaf algorithm. The set of true positives ( T P ) is therefore defined as T P = { p l N A ( p , m ) 2 0.2, p E P, m E MIPS }, while the set of false negatives ( F N ) is defined as F N = { m I V p ( N A ( p , m )< 0.2), p E P, m E MIPS}. The set of false positives ( F P ) is F P = P - T P , while the recall and precision are:
R = ITPI/(/TPI+ IFNI)
(14)
/TPI/(ITPI+ IFPI)
(15)
P
=
We use the F-measure, which is the harmonic mean of precision and recall, to evaluate the overall performance of the different techniques:
F -measure = 2 * P
* R / ( P+ R)
(16)
Note that it is possible that multiple predicted complexes may correspond to a single reference complex, using the evaluation metric defined above (see definition of T P ) . Recent work by Gavin et al. 28 has shown that protein complexes have a modular structure, consisting of core proteins that are present in multiple complexes, and attachment proteins that are present in only some of them. This modularity
of complexes may help to explain why multiple predicted complexes match a single benchmark complex, since the same core proteins may be present in the complexes, albeit with different attachment proteins. It is also important to note that as our reference complex set MIPS is by no means complete, some predicted complexes which probably are true complexes will be falsely regarded as false positives ( F P ) . As such, the F-measure of the algorithms should be taken for comparative purpose instead of at their absolute values. 4.2. Comparative results
We compared the performance of DECAFF algorithm with current computational techniques, namely, MCODE 4 , LCMA and Md Altaf algorithm '. Note that the results of MCODE algorithm were only available on their own Bader protein interaction data while the results of Md Altaf algorithm were only available on the MIPS protein interaction data. For fair comparison, we also ran the LCMA and DECAFF algorithms on all the three protein interaction data. Note that all the existing algorithms use the same MIPS complexes as a reference set. Table 1. Overall performance of MCODE, LCMA, Md Altaf algorithm and D E C A F F algorithm.
Method
Dataset
Recall
Precision
F-measure
MCODE LCMA DECAFF
Bader Bader Bader
0.258 0.787 0.883
0.271 0.275 0.392
0.264 0.408 0.543
Md Altaf LCMA DECAFF
MIPS MIPS MIPS
0.601 0.725 0.806
0.111 0.301 0.416
0.188 0.425 0.549
LCMA DECAFF
BIOGRID BIOGRID
0.921 0.955
0.214 0.435
0.347 0.597
"The comparison experiments are performed on the Bader and Hogue, MIPS, and BIOGRID protein interaction data. For all the three protein interaction data, w = 0.30 and y = 0.95 are used in DECAFF algorithm. w = 0.30 is also used in LCMA algorithm.
Table 1 shows the overall comparison results of the different computational algorithms. Using the same Bader protein interaction data, DECAFF was able to predict 1,736 complexes, of which 681 matched 125 benchmark complexes. Overall, the Fmeasure of DECAFF on this dataset is 54.3%, which
165
is 27.9% and 13.5% higher than MCODE and LCMA respectively. Using MIPS protein interaction data, DECAFF predicted 1,220 complexes, of which 508 matched 93 benchmark complexes. On this dataset, DECAFF obtained 54.9% as its F-measure, which is 36.1% and 12.4% higher than Md Altaf algorithm and LCMA algorithm respectively. On applying our DECAFF algorithm on the most comprehensive protein interaction data BIOGRID, we managed to predict 2,840 complexes, of which 1,235 complexes matched with 157 MIPS complexes. On this comprehensive dataset, DECAFF obtained 59.7% as its F-measure, which is 25.0% higher than the LCMA algorithm. In short, our DECAFF algorithm performed with precision and recall values that are significantly higher than all the other computational techniques in all the three evaluation datasets. 4.3. Effect of the hub removal routine
First, recall that our algorithm detects dense subgraphs in addition to the local cliques for merging, and we devised a novel hub removal routine to heuristically detect multiple dense subgraphs. To investigate the effect of using local dense neighborhoods instead of local cliques as a basis for complex mining, we re-ran our experiments with a version of DECAFF without the hub-removal routine. Interestingly, the precisions of the DECAFF without the hub-removal routine were similar or only slightly worse, whereas the recall decreased significantly at l8.9%, 25.7%, and 22.1% in Bader, MIPS, and BIOGRID interaction data respectively. This shows that in addition to the local cliques, the less graphically-stringent dense local neighborhoods in DECAFF are essential for the effective mining of many more true protein complexes than clique-based methods. 4.4. Effect of parameters w and y
Next, note that DECAFF algorithm employs two user-defined parameters w and y to control the merging process and to filter unreliable protein complexes respectively. We first investigated how the merging threshold w affected the performance of the algorithm by running it with values of w ranging from -1.0 to +1.0 in steps of 0.1, while keeping the filtering threshold fixed at y = 0.95.
In all three protein interaction datasets, the effect of varying w was similar. As w initially increased, the resulting F-measure increased. However, increasing w beyond 0.6 resulted in a decreased F-measure. A possible explanation for this is that more merging of the local dense neighborhoods takes place when the w < 0.6. When w is increased beyond 0.6, the threshold becomes so strict that merging seldom takes place. However, when w is set too low (i.e. w < 0.15), any two local dense neighborhoods will be merged as long as they have at least one common protein. Such indiscriminate merging will result in an increased number of false positives, which explains the lower F-measure values for DECAFF algorithm with low w values. We found that the optimal values of w for DECAFF with y = 0.95 can be found within a large range of 0.15 < w < 0.55. As such, selecting a suitable value for w for good performance is not a problem. To study the effect of the other user-defined constraint y , which is used to filter unreliable protein complexes detected by DECAFF, we ran DECAFF with y from -1.0 to +1.0 with w = 0.3. Generally, increasing y increased the performance of DECAFF in all the three protein interaction networks, suggesting that the complexes predicted with reliable protein interactions are more likely to be true complexes. When DECAFF is used with an extremely small y such as -1.0, the filtering step is practically nonexistent. DECAFF performed worst without filtering as the noisy protein interaction data will significantly affect the accuracy of DECAFF. This indicates that the filtering step in DECAFF is also an essential one to ensure good performance. When compared with y = -1.0, DECAFF’s precisions with a high y = 0.95 (used in this paper) were increased by 9.0010, 6.6%, 19.2% while making a marginal sacrifice on the recall by 3.7%, 3.2% and 3.5% on the Bader, MIPS, and BIOGRID protein interaction datasets respectively. This means that our filtering strategy of reliability is very successful since it can keep most of the true protein complexes (or protein interactions) while filtering away most of the false protein complexes. More detailed analyses of the effect of these parameters on the performance of DECAFF can be found in Appendix E of Supplementary Materials.
166 4.5. Analysis of the predicted complexes
We also evaluated the statistical significance of the protein complexes predicted by DECAFF using pvalues. Given a predicted complex with n proteins, the p-value computes the probability of observing Ic or more proteins from the complex by chance in a biological function shared by C proteins from a total genome size of G proteins:
Definition 4.1. The p-value of a predicted complex is defined as:
i=o
Inl
In other words, the above p-value measures whether a predicted complex is enriched with proteins from a particular function more than what would be expected by chance. Given that proteins in a protein complex are assembled to perform common biological functions, they are expected to share common functions. As such, true protein complexes should have low p-values, indicating that their collective occurrence within the graphical subcomporients detected by DECAFF did not happen merely by chance. We evaluated the p-values for all the predicted complexes by incorporating a Bonferroni correction, and we found that majority of our predicted complexes are statistically significant at the 0.01 significance level (Typically, a cut-off Q: level 0.01 for Bonferroni corrected p-values is chosen such that p-values below the a level are deemed significant). Specifically, 1,729 out of the 1,737 predicted complexes (or 99.5%) detected in the Bader data, 1,205 out of the 1,221 predicted complexes (or 98.7%) detected in the MIPS data, and 2,828 out of the 2,841 predicted complexes detected in the BIOGRID data were deemed significant in terms of the above pvalue. Table 2 shows ten predicted complexes which have very small p-values (thus highly likely to be true protein complexes). In one of these predicted complexes (ID=4), we found that 9 out of 10 proteins in this predicted complex matched exactly with a 9-protein complex in our MIPS protein complex benchmark. On further analysis, we found that the additional unmatched protein ”YKL138C-A” in our predicted complex has actually been recently annotated as part of a DASH
complex ”. This indicates that our method was capable to detect the novel biological knowledge which were absent in the reference data. In fact, as there were seven out of these ten predicted complexes that can be matched with our MIPS protein complex benchmark, we performed further analysis on the remaining three unmatched complexes (ID=2, 9, 10) to see if they are actual novel protein complexes. Our literature search showed that for one of these unmatched predicted complexes (ID=2), 19 out of its 20 protein members were actually part of the “U4/U6 x U5 tri-snRNP complex” (32 proteins) published by Ref. 30. The other predicted 5-protein complex (ID=9) that was not matched with any of our benchmark complexes was found to match 5 out of the 6 proteins in the “niannosyltransferase complex”, a protein complex that is responsible for mannosyltransferase activity 31. Finally, the third unmatched 5-protein complex (ID=10) predicted by DECAFF was also found to correspond directly with a “nuclear condensin complex“, a multisubunit protein complex that plays a central role in the condensation of chromosomes that remain in the nucleus 31. These results show that while some of our predicted complexes do not match with any of our benchmark MIPS complexes (an incomplete reference set), many of them match very well with actual complexes published in biological literature. Our predicted complexes with low p-values are thus likely to be true protein complexes. In fact, this is further supported by matching the predicted complexes with the known protein complexes from the BIND database 3 2 : more than half of the predicted complexes (673 out of 1055 complexes) from Bader protein interaction data that did not niatch with any of our MIPS benchmark complexes matched BIND complexes. Similarly, 256 out of the 712 unmatched predicted complexes from the MIPS protein interaction dataset matched BIND complexes, and 825 out of the 1,605 unmatched predicted complexes from BIOGRID protein interaction data matched BIND complexes. We also investigated why a number of the reference protein complexes in our MIPS benchmark were not matched by any of our complexes predicted by DECAFF. Out of the 215 benchmark MIPS complexes, 157 were matched with a complex predicted by DECAFF using the most comprehensive
167 Table 2.
Ten predicted complexes with different functions from the BIOGRID protein interaction data.
a 1D:complex ID; N: the size of complexes; 6: density of complexes; p-value: Corrected p-value of complexes; w : similarity between the predicted complexes and MIPS Benchmark; GO I D : the protein GO function ID; Function: the protein function with lowest p-value; ORFs: proteins’ ORFs in complexes
BIOGRID dataset. 31 of the unmatched 58 reference complexes appeared as individual protein pairs in the BIOGRID interaction graph. Out of remaining 27 unpredicted reference protein complexes, 22 were undetected by DECAFF as they were present in the interaction graph as very sparsely connected subgraphs with a very low average density of 0.178; only 5 reference protein complexes were mistakenly filtered because they were deemed as unreliable protein complexes. We can expect that the performance of DECAFF should improve further with the availability of better P P I detection technologies that can generate more complete P P I data.
5. Conclusions While much efforts has been expended on charting the protein interactome, the map for the protein “complexome” has remained comparatively empty. In this paper, we have proposed a robust method for exploiting the protein interaction networks to mine for new protein complexes.
Unlike other current computational techniques, our DECAFF algorithm attempts to identify dense and reliable graphical subcomponents in protein interaction networks that could correspond to actual multiprotein complexes. To address the possibility of missing interactions in the underlying interaction network, we have relaxed the graphical constraint from cliques to local dense neighborhoods. The use of local dense neighborhoods as a basis for mining the interaction graphs also allowed us to be certain that maximal dense neighborhoods can always be found under the merging operation (Theorem 1). As such, the main focus is to detect as many local dense graphs as possible to ensure coverage, and to ascertain the reliability of the component interactions as much as possible to ensure accuracy. For the former, we have employed a novel hub-removal procedure that can effectively mine for multiple and possibly overlapping local dense subgraphs for each protein (vertex). This process caters for the biological possibility of a protein participating in multiple protein
168 complexes. For t h e latter, we have devised a novel reliability measure t o filter away potential false protein complexes in order to address t h e possibility of false positives in t h e underlying protein interaction networks. We evaluated our DECAFF algorithm using three yeast protein interaction data and found t h a t t h e performance of DECAFF algorithm is indeed significantly better t h a n all t h e other existing computational techniques. Our current work has shown t h a t both t h e network topological information a n d t h e interaction reliability information in t h e interaction m a p can be exploited together t o help discover t h e underlying elements for mapping t h e complexome. Further resolution a n d usage of t h e algorithm will b e for mapping out t h e “protein complex interactome” by uncovering t h e interacting links between t h e complexes a n d t h e proteins as well as other biomolecules. Acknowledgments
We thank t h e anonymous reviewers for their constructive reviews and t h e early contributions from M r Soon-Heng Tan. References
1. R. P. Sear, Physical Biology 1,53 (2004). 2. A. H. Y. Tong, Science 295, 321(Jan 2002). 3. V. Spirin and L. A. Mirny, Proc Nut1 Acad Sci U S A 100, 12123(Oct 2003). 4. G. D. Bader and C. W. V. Hogue, BMC Bioinjormatics 4, p. 2(Jan 2003), Evaluation Studies. 5. A. D. King, N. Przulj and I. Jurisica, Bioinformatics 20, 3013(Nov 2004), Evaluation Studies. 6. X.-L. Li, S.-H. Tan, C.-S. Foo and S.-K. Ng, Genome Informatics 16, 260(Dec 2005). 7. M. Altaf-U1-Amin, Y . Shinbo, K. Mihara, K. Kurokawa and S. Kanaya, B M C Bioinjormatics 7, 207 (2006). 8. A.-C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz, J. M. Rick, A.M. Michon, C.-M. Cruciat, M. Remor, C. Hofert, M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M.-A. Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer and G. Superti-Furga, Nature 415, 141(Jan 2002). 9. C. von Mering and Krause, Nature 417, 399(May 2002), Evaluation Studies. 10. E. Hartuv and R. Shamir, Information Processing Letters 76, 175(Dec 2000).
11. S. van Dongen, Graph clustering by flow simulation, PhD thesis, University of Utrecht, (May 2000). 12. D. J. Watts and S. H. Strogatz, Nature 393, 440(Jun 1998). 13. G. Palla, I. Derknyi, I. Farkas and T. Vicsek, Nature 435, 814(Jun 2005). 14. E. Ravasz, A. Somera, D. Mongru, Z. Oltvai and A. BarabBsi, Science 297, 1551(Aug 2002). 15. A.-C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J. Jensen, S. Bastuck, B. Dumpelfeld, A. Edelmann, M.-A. Heurtier, V. Hoffman, C. Hoefert, K. Klein, M. Hudak, A.-M. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. M. Rick, B. Kuster, P. Bork, R. B. Russell and G. SupertiFurga, Nature 440, 631(Mar 2006). 16. E. Nabieva, K. Jim, A. Agarwal, B. Chazelle and M. Singh, Bioinformatics 21 Suppl 1, 302(Jun 2005). 17. H. N. Chua, W. K. Sung and L. Wong, Bioinformatics 22, 1623(Jul 2006). 18. M. A. Gilchrist, L. A. Salter and A. Wagner, Bioinformatics 20, 689(Mar 2004), Comparative Study. 19. X.-L. Li, Y.-C. Tan and S.-K. Ng,BMC Bioinjormatics 7 Suppl 4, p. S23 (2006), Evaluation Studies. 20. P. Uetz and L. Giot, Nature 403, 623(Feb 2000). 21. T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori and Y. Sakaki, Proc Natl Acad Sci U S A 98, 4569(Apr 2001). 22. B. L. Drees and Sundin, J Cell Biol154, 549 (2001). 23. M. Fromont-Racine and A. E. Mayes, Yeast 17, 95(Jun 2000). 24. Y. Ho and Gruhler, Nature 415, 180(Jan 2002). 25. H. W. Mewes and D. Frishman, Nucleic Acids Res 28, 37(Jan 2000). 26. M. C. Costanzo and M. E. Crawford, Nucleic Acids Res 29, 75(Jan 2001). 27. C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz and M. Tyers, Nucleic Acids Res 34, 535(Jan 2006). 28. A.-C. Gavin, P. Aloy, P. Grandi, R. Krause, M. Boesche, M. Marzioch, C. Rau, L. J. Jensen, S. Bastuck, B. Dumpelfeld, A. Edelmann, M.-A. Heurtier, V. Hoffman, C. Hoefert, K. Klein, M. Hudak, A.-M. Michon, M. Schelder, M. Schirle, M. Remor, T. Rudi, S. Hooper, A. Bauer, T. Bouwmeester, G. Casari, G. Drewes, G. Neubauer, J. M. Rick, B. Kuster, P. Bork, R. B. Russell and G. SupertiFurga, Nature 440, 631(Mar 2006). 29. J. J. L. Miranda, P. D. Wulf, P. K. Sorger and S. C. Harrison, Nut Struct Mol Biol 12, 138 (2005). 30. S. W. Stevens and J. Abelson, Proc Nut1 Acad Sci U S A 96, 7226 (1999). 31. J. Jungmann, J. C. Rayner and S. Munro, J Biol Chem 274, 6579 (1999). 32. G. D. Bader, I. Donaldson, C. Wolting, B. F. Ouellette, T. Pawson and C. W. Hogue, Nucleic Acids Res 29, 242(Jan 2001).
169
MINING MOLECULAR CONTEXTS OF CANCER VIA IN-SILICO CONDITIONING Seungchan Kim* and Ina Sen School of Computing and Informatics, Arizona State University, Tempe, Arizona 85281,USA 'Email: [email protected], [email protected]
Micheal Bittner Translational Genomics Research Institute, Pheonix, Arizona 85281, USA Email: [email protected] Cell maintains its specific status by tightly regulating a set of genes through various regulatory mechanisms. If there are aberrations that force cell to adjust its regulatory machinery away from the normal state to reliably provide proliferative signals and abrogate normal safeguards, it must achieve a new regulatory state different from the normal. Due to this tightly coordinated regulation, the expression of genes should show consistent patterns within a cellular context, for example, a subtype of tumor, but the behavior of those genes outside the context would rather become less consistent. Based on this hypothesis, we propose a method to identify genes whose expression pattern is significantly more consistent within a specific biological context, and also provide an algorithm to identify novel cellular contexts. The method was applied to previously published data sets to find possible novel biological contexts in conjunction with available clinical or drug sensitivity data. The software is currently written in Java and is available upon request from the corresponding author*.
1. INTRODUCTION The cellular system is very complex, arising from the interaction of many cellular components and processes. In order to maintain a specific state, the cell needs to tightly regulate various components using a host of regulatory mechanisms. A series of disruptions to the regulatory mechanisms, erodes the normal controls over proliferation, and produces a variety of other regulatory variations leading the cell to assume a significantly different state than its prior normal state, such as cancer^.^^"^ To transition from normal to abnormal, (e.g. healthy to tumor), the functioning of the regulatory mechanism of the cellular system must be altered in significant ways. Such a change would result in an alteration of the way in which the cellular system interprets and acts upon certain kinds of input, in other words, a change of cellular context. While governing regulatory mechanism of normal context is disturbed in cancer, the persistent growth of the cancer implies that these cells retain a complex, reliable regulatory system capable of maintaining the enormous order required for the cell to live. The tumor's new behaviors now require * Corresponding author.
a regulatory mechanism, possibly a different one from the regulatory mechanism that maintained the normal cell from which the tumor originated. While many association-based appro ache^^"^ have proven useful, one must look among all of the associated genes and attempt to group them on the basis of prior knowledge about the activities of the individual genes to identify particular processes. As the tool tries to look for more specific relationships among genes, it can find smaller groups of interacting genes, defined by the kinds of behaviors that arise from the way in which transcriptional regulation operates, improving the likelihood that such sets do represent interpretable hypothesis. More intriguingly, when the contextual information is unknown a priori, which is not unusual, capturing this implicit situational information, i.e. cellular context, based on observational evidence, and identifying genes with behavior specific to the context is a critical step toward the understanding of interactions among the participants and the discovery of its regulatory mechanism. Recently, Segal et a1 developed the algorithms employing similar concepts32 and applied those to a
170
Saccharomyces cerevisiae expression data set to identify regulatory modules and their condition-specific regulators from gene expression data.3' They also applied the method to perform an integrated analysis of 1,975 published microarrays spanning 22 tumor types to develop cancer module maps.3o This method starts from initial partitions generated from clustering and utilizes prior biological knowledge such as Gene Ontology3, KEGG (Kyoto Encyclopedia of Genes and Genomes)20 and Gene MicroArray Pathway Profiler, if available, in that study. Our method does not explicitly depend on such knowledge but solely depends on data. Biclustering' and Signature A l g ~ r i t h m ' ~are ' ' ~ two other methods comparable to the proposed method, which try to identify subsets of genes and samples. However, OUT method is inspired by the biologically interpretable master-slave model and has an inherent directionality in place, i.e. influence of master over the slave genes. Biclustering considers coherent genesample patterns but struggles with evaluating the separation between the identified biclusters, making its output not as easily interpretable. The signature algorithm on the other hand requires an initial seed gene list and builds the consistent condition list and gene list iteratively to identify transcription modules. The necessity for initial gene list limits the exploratory power of this algorithm. As the algorithm proceeds, dependant upon the geneslconditions included in progressive iterations, it may allow convergence to a separate module altogether, thereby losing the signal present in the initial list. Our method, context miner, identifies each context with a corresponding master gene and set of samples thereby ensuring the identification of a unique context and evaluates its statistical significance. In the following sections, we first describe the algorithm to identify a set of genes that appear tightly regulated within known cellular contexts. We then describe a method to explore molecular and clinical patterns to identify all cellular contexts with consistent patterns that are statistically, significantly different from the rest of the data set. Lastly, we present the analyses of previously published data set; melanoma with gene expression profiles and gene expression along with drug activity data of NCI 60 cell lines, and conclude with some discussions.
2. METHODS
deterministic transcriptional activities. When the cell moves away from this cellular context or changes to a different cellular state, the behavior of the same set of genes will not appear as deterministic since they now behave without control signals (intrinsic stochastic behavior) or each gene comes under the control of various other external controls. In this section, we first describe novel statistics to identify a set of genes with more deterministic transcriptional behavior within a given cellular context than outside the context. Once a set of genes is identified, we evaluate the statistical significance of such a finding. While the algorithms are described and applied in the context of transcriptional activities, we later explain how to use the proposed method for gene expression data integrated with other types of data such as array-based comparative genomic hybridization (aCGH) data and other clinical parameters such as drug sensitivity.
2.2. Consistency statistics : interference and crosstalk To identify a set of genes with consistent transcriptional behavior within a specific cellular context, we need statistics to evaluate consistency andlor inconsistency within and outside specified context. Here, we consider a context c to be given by specifying a subset of samples, S, assumed to share certain phenotypes resulting from being governed by common biological processes or regulatory mechanisms. We formulate the hypothesis as follows. Let us assume a cell can be in any of the different cellular statuses, C E (c,, c2, . . . c k ) . In other words, specifying context ci will partition samples into two groups, one that would reflect. the cellular status defining the context and the other that does not. For example, a clinical parameter such as drug responsiveness can be considered a conditioning factor, partitioning patients into two groups; one being responsive and the other not. The two statistical parameters, interference and crosstalk, can be used to determine whether a gene is being regulated within a given cellular context.2' The interferenceb, &'), for a gene gk given a cellular context cj, is the extent to which latent variables (external controls sensitive but not specific to context) interfere with the regulatory signal from a master gene, G":
2.1. Identification of cellular contexts It is assumed that when a cell maintains a specific cellular context, for example, a phenotype, it tightly regulates a battery of genes, which would show rather
1 - 6kC), has the same form of equation as the precision. However, the interference was motivated by biological insight about gene regulation and we will keep the term as is.
171 and the crosstalk q f ) is defined as the probability that the gene, gk is being regulated (by external control), when the cellular context is not ci:
The equations above can be modified to consider gk = OFF as well. A gene is determined to be specific to the given context if both interference and crosstalk are significantly low. Given a subtype of tumor, for example, we identify genes with significantly low interference and crosstalk as being tightly regulated within the tumor (See Fig. 1). For example, we want to find a set of genes that are consistently up-regulated (ON) only within a group of patients who respond to a therapy but not outside. The interference and the crosstalk can be used to find such genes. This approach is different from typical t-test where a gene needs to be differentially expressed (ON in one group and OFF in the other group). The interference and crosstalk allow certain level of up-regulation outside the context as long as it is not as consistent as within the context. Since both the interference and crosstalk parameters are estimated from the observations, the statistical significance of the estimated values should be considered in order to avoid highly possible false positives or false discoveries. Let N be the number of observations made and assume that there are two different classes (different subtypes of diseases or prognosis). For a gene, assume there are no OFF status and nl (= N - no) ON status overall in the observations. When we partition the N samples into two groups based on their class labels, n in the first class and N - n in the other, let 1 5 n1 ON samples get assigned to the first group. Let us denote the subset of samples associated with the first group by S‘” and the other by S‘.). Using the Eqs. (1-2), we estimate both interference and crosstalk, but we would like to know the probability of obtaining those values by chance given the observations. Since we partition the samples to acquire those numbers, this probability is same as the probability that we partition the samples to have exactly the same configuration given the parameters above, ( N nl, n, 0, and this can be computed via hyper-geometric probability as follows in Eq. 3:
We then define the probability, given (N, n l , n), that the gene is consistently expressed (ON) 1 times or more as: min(n,n,)
Pr(L 2 I ; N ,n, ,n ) =
Pr(L = i; N,n,, n ) r=l
(4)
+
Class.
I
Genes specific to given context, {6!, q!, p,}
IIw
I
II II
I
I
Figure 1. Context module: Group A indicates genes with both low cross-talk and low interference with statistical significance. B presents genes with low interference but high cross-talk, C with no statistical significance, and D with high interference or cross-talk. A set of genes identified in A and B is called context module.
As more ON’S are assigned to the class of our interest, both the interference and the crosstalk parameters decrease. Therefore, Pr(L 2 I; N , n l , n ) is the probability that those parameters are estimated at the same values or higher. If this probability is very low, such as less than 0.05, it is rare to find those estimated values by chance, i.e., it is significantly different from what can be found by chance. For a given context, cJ and corresponding subsets S,(+) and S,(-),and a gene, gk, with the parameter, ( N , dk),, dk),tk’), we denote the probability Pr(L> tk); N, n@))by pk‘’. The set of genes with low interference and crosstalk and yet statistically significant are identified to be specifically highly correlated within a given cellular context. The set of genes of interest to biologists focusing on subtypes of cancer are the ones with low interference (Eq. 1) within the subset of tissues from the corresponding subtype and low crosstalk (Eq. 2) outside the subset with low probability of finding such gene by chance (Eq. 4).
2.3. Interrogating contexts via in-silico conditioning In practice, such explicit knowledge about contexts as clinical parameters is often not known a priori. In this section, we describe a method to systematically identify possible molecular contexts. The premise is that each context is conditioned by a gene, i.e. master, or other external, cellular conditioning parameters such as clinical parameters. The method interrogated each gene
172 or clinical parameter if one of its states, for example, ON or OFF, could be interpreted as a conditioning factor. This was done by grouping samples into two: first with the samples where the state of conditioning parameter is set to a specific state, and the other with the rest of the samples. Then, by applying the method described in the previous section, we identified the genes seemingly tightly regulated in such conditions. This is similar to biologists’ manipulating the status of a gene or conditioning cells to investigate its down stream effect. Biologists often use ectopic e x p r e s s i ~ n ’ ~or ~ ’ gene ~ ~ ’ ~silencing techniques such as RNA i n t e r f e r e n ~ e ~to ~ ’either ” ~ ~ increase or decrease the expression levels, respectively. In our case, it is done computationally after the data is collected. Thus, we call this in-silico conditioning. Each conditioning yields a subset of samples, i.e. context, where a set of genes that appear tightly regulated within the context are obtained. Depending on the number of samples and the number of genes, the context might be statistically insufficiently distinctive; the pattern of the similar size of samples and genes might be found by chance. The next subsection addresses this issue.
2.4. Significance test for identified contexts We assessed the probability of finding a context pattern where the same number of or more genes were tightly regulated across same number of samples by chance. Let (M, N) denote data size where M is the total number genes and N is the number of samples in data set. We also let rn and n denote the number of co-regulated genes and the number of observations in an identified context, respectively. We estimated Pr(rn’ 2 rn I n’ = n), the probability that a context regulates larger or equal number of genes than rn, given the sub-sample size n. This probability was estimated via re-sampling method. More specifically, we randomly split given data set into two groups of which the one was of sample size n (context candidate) and the other of N - n. We then applied the same set of statistics (Eqs. 1-2) to identify the number of genes filtered by the same thresholds for interference (q), crosstalk (6)and p-value (p). By repeating this procedure many times, we estimated Pr(rn’ 2 rn I n‘ = n). The accuracy of the estimation is based on the number of repetitions. In typical setting, no less than 1,000 repetitions were required to provide distribution with enough statistical power. Using this re-samplingbased approach, we could assess the statistical significance of identified contexts and consider only significant patterns for further analysis.
2.5. Data quantization To use the proposed method, gene expression data needs to be quantized. If the data is pre-quantized by a sophisticated method such as described in Chen et al.6,7 we use them as is. If not, there are several other methods available: fold-changes, heuri~tic-based,~,~~ and model-based approaches.35237While relevant, the discussion of the quantization issue is beyond the scope of this study, we therefore leave the further discussion to those studies.
2.6. Data quantization To use the proposed method, gene expression data needs to be quantized. If the data is pre-quantized by a sophisticated method such as described in Chen et al.6,7 we use them as is. If not, there are several other methods available: fold-changes, heuri~tic-based,~,~~ and model-based approaches.35337 While relevant, the discussion of the quantization issue is beyond the scope of this study, we therefore leave the further discussion to those studies. 3. EVALUATION OF THE ALGORITHM To evaluate its utility and the performance of the proposed algorithm, we used simulation-based experiments. To generate the synthetic data, we started with a set of master-slave relations, which consisted of a master, a set of slaves and a set of rules between the master and the slaves. We then added the conditioning and crosstalk parameters to specify the strength of regulation from the master to the slaves, which added randomness to the relations. This master-slave relation defined a cellular context. Since typical cells have many such cellular mechanisms actively operating simultaneously, we specified multiple cellular contexts and samples were drawn, as measurements were made, from the set of cellular contexts. The process is illustrated in Figure. 2. In Figure 2, the third sample, row (red“ (R), black (B), B, green (GI, R, B, G, R, G, R, R, G, G, B, G, B) has been drawn from the same cellular context (first graph) as the first sample, row (R, B, B, G, R, B, B, R, G, R, R, G, G, R, G, B), but because of the randomness introduced by the conditioning and crosstalk parameters, on being sampled they do not show the identical gene expression profile, which is also typical in real biological observations. In the simulation done, we used four different cellular contexts, one master gene in each context, and “red” appears dark gray, and “green” appears light gray, due to the conversion to grayscale to comply with the conference guideline.
173
sample 1 sample k
4
‘1*
context 2
context 1 Prob(C1)
context 3
lOW
i! high
Figure 2. Master-Slave regulatory model to simulate gene expression data: different numbers of samples are drawn from each context (three contexts). 105 genes in a simulated data. The main focus of the simulation was to find the effects of number of samples drawn for each context and the number of regulated genes in each context, to the performance of the algorithm. For each data, four different contexts were sampled with different sampling rates of 5, 10, 15, and 25 observations. Also different numbers of slaves were tested for each context: 10, 15, 25, and 40 out of 105 total genes. Once the data was generated, for each cellular context, we tested how accurately the algorithm identified master and slaves corresponding to the cellular context. For comparison, other statistics-based method (correlation) and information-based method (mutual information) were also used to identify such
40
15
20
25 # of slaves
(a)
30
35
40
sets. These methods have been popularly used in microarray analyses such as clustering to identify coregulated genes within same cellular context. For the measures of performance, we used false positive (FP; the number of genes identified as being regulated that in fact are not), false negative (FN; the number of genes not identified as being regulated that in fact are ), and total error (FP + FN). Each parameter combination was repeated 200 times to compute the average performance. The results are shown in Figure. 3 which compares the performance of correlation and mutual information with the context miner in terms of error. As we can see, in all cases, the proposed algorithm (context) outperformed the other algorithms (mutual information and correlation). There was not much difference between the other two algorithms in terms of performance. Fig. 3 (a) shows the effect of the number of regulated genes in each context. It is shown that significant portion of total error comes from false negative (FN). It also shows that FN increases as the number of slaves increases while false positive (FP) remains unchanged, which is somewhat expected. Fig. 3 (b) shows the impact of the size of context (sample size). While FN remains relatively unchanged, FP decreases significantly as the sample size increases. Overall error for the proposed method also remains relatively low at 4 to 8% for different number of slaves in each context and 5 to 9% for different sample sizes. Therefore, the simulation study proves that the proposed method can be effectively used to identify a set of genes that are specifically regulated within a specific cellular context.
size of context (sampled)
(b)
Figure 3. The comparisons of the proposed method against other association-based methods shows the proposed methods outperform the others for both increasing number of regulated genes (slaves) (a) and the size of contexts (samples) (b).
174
4. RESULTS To show a proof of principle, we applied the method to a previously published melanoma data set4 and gene expression data with drug activity data for the NCI 60 cell lines.29The latter will illustrate how multiple types of data (gene expression and drug activity data) can be combined in the analysis to identify interesting patterns of interactions not only among genes but also between genes and drugs.
4.1. Analysis of melanoma gene expression profile Melanoma data set consists of 8,607 genes and 31 samples. After preliminary analysis and filtering according the method described in the original paper, we extracted 587 genes which were then used for this study. In the original study of melanoma, there existed very tightly clustered samples (major cluster) with less motility and invasiveness. We, therefore, first identified a set of genes that displayed consistent expression in the major cluster using our method. Top two genes identified in the original paper, WNT5A and MLANA, as well as SNCA and EDNRB were found at the top of our list (data not shown). Also, there were some new genes identified with high consistency, which are interesting candidates for further study. Another finding in the original paper was the regulatory control of Wnt5a when it is highly expressed. Thus, the samples conditioned by a high expression of WNT5A were found to be strongly associated with the high expressions of MLANA, DKK3, SERPINE1, MTlX, KAI1, BRD2, and TRAM1. The involvement of these genes with melanoma development is unknown but the consistency of their transcriptional activities warrants further investigation. However, the regulation of MLANA by WNTSA has been recently reported.34 To unravel novel molecular contexts related to melanoma, we applied the algorithm as described in section 2. Using more than 10,000 re-sampling, we identified more than 100 contexts withp < 0.005. Table 1 lists the contexts withp < le-4. In Table I , note that two distinct states of MLANA make up two different contexts; the first one is when it is normal and has 17 genes (excluding itself) being regulated, and the other is when it is up-regulated and has 14 genes (excluding itself) as regulated. When compared, two contexts share only one gene (MYLK) in common as regulated, but in distinct states. This confirms our assumption that a gene can be regulated by different regulators (masters) when cellular context changes. Further investigation revealed that more than 100 contexts identified with p < 0.005 can be grouped
Table 1. Cellular contexts (denoted by masters) identified with the statistical significance, p < le-4. ~
Conditioners MLANA PLP 1 MLANA FBN2 MMP3 TCEB3 LOC646762 MYLK DUSP 1 MMP3 IFIT 1 MBP EDNRB SNED 1 WT5A
State
+ +
+ + +
m 10 23 21 13 12 17 9 20 7 16 14 15 6 4 24
n 18
7 15 12 14 6 21 16 37 11 15 8 43 68 4
~~~
Pr(m’>m/n’=n) 0.0000992 0.0000992 0.0000993 0.0000994 0.0000994 0.0000994 0.0000994 0.0000995 0.0000995 0.0000995 0.0000995 0.0000995 0.0000995 0.0000996 0.0000997
Conditioners are the genes used to in-silico condition context. State indicates the expression states of corresponding conditioner. m and n denote the number of samples where the conditioner is kept at the state and the number of regulated genes (including the conditioner itself) within such context. The last column shows the probability that a context with equal or larger size can be identified by chance using re-sampling method.
into less than 30 larger contexts with unique hierarchical structures (data not shown), upregulation of WNT5A and upregulation of JUN being among them. Upregulation of WNT5A is interesting because of its regulation of MLANA, as reported in Weeraratna et al.34 Upregulation of JUN became also interesting because of our biologist’s other supporting biological evidences. These two contexts are exclusive, implying two distinctive molecular contexts relevant to melanoma. Ongoing work to elucidate the effects of WNT5A in melanoma has revealed that high levels of WNT5A in melanocytes are associated with higher production of the cytokine IL6. Work from a number of laboratories shows that the transcription factor Mitf, which drives Melan-A transcription is itself regulated by Pax3 and S O X ~ Oand , ~ ~that this regulatory pathway can be inhibited in melanoma by IL6 stimulation, which affects ~ax3.l~ In the other context considered, the stimulation of transcription of DUSPl by JUN is seen. DUSPl expression is known to be upregulated by the onset of chronic ~roliferation~~, and is known to inactivate
175 MAPK through dephosphorylation.’ DUSP 1 transcription is activated in a wide variety of stresses, and the exact transcription factors involved are not well worked out.” It is likely that the high predictivity of DUSP 1 transcription by increased JUN transcription arises from the simultaneous stimulation of both JUN and DUSPl that occurs when cells go into chronic proliferative states. Interestingly, recent study found the possibility of potential diagnostic relevance of DUSP expression in tumors.’
4.2. Analysis of NCI 60 cell line gene expression and drug sensitivity data We extended the concept of finding conditioning factors from only genes, to elements which influence, regulate or act specific to the existing cellular state. Any such factor would also be bound by the constraints in place due to cellular state. Applying our method to such disparate datasets such as aCGH, gene expression andlor drug activity data, would allow us to witness the possible underlying patterns of the inter- and intrarelationships in them. Using the NCI60 data set we show that the contexts identified can help guide further studies of drug effectiveness and mechanism of action.
4.2.1. Data preparation To provide an example of exploratory functions possible by our method, we applied it to the NCI 60 drug data. The dataset consisted of the drug activity data of 118
Clinical parameters -pathology -survival -drug treatment
quantization ; logical values
Context Mining
-CGH data
...
-
4.2.2. Patterns of drug-gene interactions The context analysis on the NCI 60 drug activity and gene expression data resulted in 4153 contexts. Among those, we focused on the contexts where at least one drug and a gene were included, which resulted in 243 contexts. At last, only 27 contexts were found to be statistically significant with p-value less than 0.0 1,
molecular or cellular context
4
Molecular profile -expression data quantization
drugs and the gene expression data of 1375 genes across the NCI-60 cell line^.'^ The original paper related this data to sensitivity to therapy rather than to molecular consequences of the therapy, as the gene expression patterns were determined in untreated cells. The drug activity was represented in a matrix with -log GI50 values, where GIs0 is an indicator of the growth inhibition by the compound on the cell 1ine.The application of our method can be summarized into three steps - scaling of data to comparable form (normalization), combining these forms, and applying our method to obtain contexts corresponding to the different conditioning factors. In order to scale the different data sources, the drug matrix was normalized by subtracting its row-wise mean and dividing by its row-wise standard deviation. For the gene expression matrix, no transformation was applied, the matrix being already normalized. Next, matrix entries were quantized on the basis of two-fold changes, for statistical significance. Then both quantized matrices were combined into one and used as the input data for the context analysis. The generalized process is captured in Figure 4.
clinicalOr physiological cnntext .. .. .
Prognostic prediction
t
discrete values
Figure 4. Combining genomic data and clinical parameters to identify cellular contexts - Drug activity data and gene expression data from NCI gene expression database NCI 60 cell lines was discretized independently into ternary values and then combined into a single data set. Using each drug or gene as the conditioning factor, context analysis was carried out to obtain contexts focusing on drug-gene combinations.
176 Table 2. Top 27 contexts identified from combined drug data and gene expression data, with statistical significance p
PTK2 RAB7 GJA4 HEXB MMP 14 TWF 1 CORO 1A TDG GLUL ISGF3G KCNQ4
MYL3 W6-2 13 9.1 MAPRE2 KLF6
IRX3 REEP5
Drugs 7-Epi- 10-deacetylbaccatin I11 [TU] Camptothecin,20-ester (S) [Tl] Camptothecin, 11-HOMe (RS) [T 11 Taxol analog [TU] Disease Leukemia
2 1 2 1 1 1 1 1 1 4 2 1 1 2 1 2 1 1 1 1 3 1
4 5 3 2 3 5 5 3 3 4 5 3 2 4 4 3 3 4 3 4 2 4
184 132 241 200 173 102 102 164 145 159 93 139 174 145 120 118 118 117 114 107 144 100
0.00 163 0.00 163 0.00 169 0.00 172 0.00253 0.00327 0.00327 0.00337 0.00422 0.00489 0.00490 0.00506 0.00517 0.00653 0.00734 0.00759 0.00759 0.00816 0.00843 0.00897 0.00948 0.00979
33 0 39 37 5 3 5 16 3 3 4 9 44 9 3 3 12
1 1 1
190 7 160 4
0.00259 0.00382 0.00603 0.00771
7 0 42
1
2 42 2 54
2
6
78
0.00485
2
5 5 2 6 0
0
[TU] is Tubulin-active antimitotic agents and [Tl] - Topoisomerase I inhibitor. which are displayed in Table 2. Among these identified contexts, we proposed how these elements act with each other, based on the domain knowledge, annotations or functionality. We observed that the majority of the contexts reflected patterns found in the original paper.29 For example, the two breast cancer cell lines positive for oestrogen receptor, T-47D and MCF7, clustered
together in the original paper, were also found to be grouped together in our analysis. The context identified showed higher activity of drug 11-formyl Camptothecin (RS) than its counterpart Camptothecin, 1 1-HOMe(RS). For the two cell lines (MDA-MB-435 and MDAN), there were two filtered contexts of interest In the first context with only these two cell lines grouped, drug 7-Epi-l O-deacetylbaccatin I11 (Taxol Analog NSC No.
177
656178), Paclitaxel and other Taxol analog drugs with the mechanism of action as Tubulin-active antimitotic agents (TU) displayed highly active status. In the second context, conditioned by gene RAB7, these two cell lines were grouped together with Melanoma cell lines (MALME-3M, SK-MEL-5 and UACC-62). Interestingly, the drugs identified in this context as being consistent were Cyclocytidine and Cyctarabine(araC), belonging to DNA synthesis inhibitor mechanism (Ds). However, they did not display high activity across all these samples, while Taxol analog drugs were highly active in these two breast cancer cell lines. In the original paper, MDAMB-435 and MDA-N cell lines clustered closely with Melanoma cell lines.29 The authors had discussed that the MDA-MB-435 and its Erb/B2 transfectant MDA-N expressed large number of genes characteristic of melanoma, and recent findings now group these two as a subtype of Melanoma itself.26-2xHowever, the finding in our study may indicate that they still do not use the same mechanisms in drug responses. In Table 2, many of the contexts include drugs that have different mechanism of action. Every context depicts the common transcriptional activities of given cell lines, for example, subtypes of cancers with shared transcriptional behavior. It is possible that in order to stop proliferation of the cell, different points of the regulatory mechanisms present in cancer cells are targeted. Thus depending upon drug target point, varying degree of potency of drug would be established, effective in arresting the cancer development. Our initial purpose of being able to attribute the drug to a particular mechanism seemed thwarted by the inclusion of drug in multiple contexts, showing more than one type of mechanisms active in each context. Considering the previous argument, we tried to improve the prediction of mechanism of action of drug by finding maximum overlap between biological processes (GO terms) of the genes targeted by drug with unknown action and those of drugs with known action. Greater overlap would imply similarity in mechanism of action. We tried assigning the mechanism of action of drug Inosine-glycodialdehyde (Inox) by studying other drugs in all contexts which include Inox. In the context conditioned by IRX3, Inox showed similar activity to 1l-Formyl-20(RS)-Camptothecin, of mechanism T 1, topoisomerase 1 inhibitor. In the context conditioned by gene TWF1, it showed high activity along with drugs Dichloroallyl-lawsone and Pyrazofurin of mechanism Rs, RNA synthesis inhibitor. This context consisted of Leukemia cell lines CCRF-CEM, K-562, MOLT-4, HL60 and WMI-8226.
We extracted for each drug the corresponding target genes from PubGene" and ran the obtained lists through G ~ M i n e rOn . ~ ~matching the significant GO terms (with p-value<0.05) we found that although there were less than 10 exact matches but the terms displayed more coherency in terms of function to Rs mechanism derived GO terms. For Inosine-glycodialdehyde, we found GO:0000122, GO:0045892 which relate to negative regulation of transcription (from FWA polymerase I1 promoter and DNA-dependant). GO terms matching those from Pyrazofurin and dichloroallyl-lawsone (Rs mechanism) included GO:0006220, GO:0009058, GO:0009 165 and GO:0044249, related to nucleotide metabolism and biosysnthesis. There was no significant GO term match between those derived from Inox and those from Camptothecin. Some contexts group different cell lines possibly implying an underlying similarity in the regulatory mechanism in place, irrespective of the tissue of origin. This allows identification of drugs which could be potent in these particular cancer subtypes, allowing us to span and target a greater range of cancer types using the same drug. By finding targeted mechanisms by concentrating on annotations such as GO terms would allow greater power in our ability to prescribe effective drugs.
5. CONCLUSION We propose a method to identify putative cellular contexts via in-silico conditioning, which, if applied to the study of cancer, could lead to the discovery of subtypes of the disease not obvious at the histological level but possibly explained at molecular levels and carry prognostic relevance. The method can be applied to the experimental data with disparate data sources to improve understanding of the multilayer interactivity of biological components and help direct further studies. We used this method on public datasets of melanoma and gene expression data with drug activity data of NCI 60 cell lines. In melanoma study, we identified distinctive transcriptional patterns, one of which can be of clinical importance. The contexts analyzed imply some concerted pattern amongst the different components, and may be necessary to allow integration of biological data using prior knowledge to guide the combination and comprehension of the data. The current method is limited to only one conditioning parameter due to its exhaustive search which grows exponentially with the number of conditioning factors. Biological systems often require multivariate conditioning, and we are currently exploring extension of the algorithm to address it.
178
Acknowledgments We thank Dr. Haiyong Han (Translational Genomics Research Institute) for providing help interpreting the results of NCI 60 cell line data, Mr. Verdicchio and Ms. Ramesh for helping us with data analysis. We also thank our reviewers for giving us invaluable feedback. This study is partially supported by ASU/Mayo Collaboration Initiative Grant (#016851-OOl), NIH Pol-CA27502-23, PO1 CA109552-01A1 and U19 AI067773.
References 1. Alessi, D. R., C. Smythe, et al. The human CLlOO gene encodes a Tyr/Thr-protein phosphatase which potently and specifically inactivates MAP kinase and suppresses its activation by oncogenic ras in Xenopus oocyte extracts. Oncogene 1993; S(7): 2015-20. 2. Amit, I., A. Citri, et al. A module of negative feedback regulators defines growth factor signaling. Nut Genet 2007 accepted. 3. Ashburner, M., C. A. Ball, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nut Genet 2000; 25( 1): 259. 4. Bittner, M., P. Meltzer, et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 2000; 406(6795): 536-40. 5 . Caplen, N. J., S. Parrish, et al. Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc Natl Acad Sci U S A 2001; 98(17): 9742-7. 6. Chen, Y., E. R. Dougherty, et al. Ratio-Based Decisions and the Quantitative Analysis of cDNA Microarray Images.” Journal of Biomedical Optics 1997; 2: 364-74. 7. Chen, Y., V. Kamat, et al. Ratio statistics of gene expression levels and applications to microarray data analysis. Bioinformatics 2002; 18(9): 1207-15. 8. Cheng, Y. and G. M. Church. Biclustering of expression data. Proc Int Conf Intel1 Syst Mol Biol 2000; 8: 93-103. 9. Chung, T.-H. and S. Kim. Quantization of Global Gene Expression Data. Internationa Conference on1 Machine Learning and Application ( I C M A ) 2006, FL. 10 Davidson, E. H., J. P. Rast, et al. A provisional regulatory gene network for specification of endomesoderm in the sea urchin embryo. Dev Biol 2002; 246(1): 162-90.
11. Fire, A,, S. Xu, et al. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 1998; 391(6669): 806-11. 12. Golub, T. R., D. K. Slonim, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 1999; 286: 531-437. 13. Hahn, W. C., C. M. Counter, et al. Creation of human tumour cells with defined genetic elements. Nature 1999; 400(6743): 464-8. 14. Hahn, W. C. and R. A. Weinberg. Modelling the molecular circuitry of cancer. Nut Rev Cancer 2002; Z(5): 331-41. 15. Hanahan, D. and R. A. Weinberg. The hallmarks of cancer. Cell 2000; lOO(1): 57-70. 16. Ihmels, J., S. Bergmann, et al. Defining transcription modules using large-scale gene expression data. Bioinformatics 2004; 20( 13): 1993-2003. 17. Ihmels, J., G. Friedlander, et al. Revealing modular organization in the yeast transcriptional network. Nut Genet 2002; 31(4): 370-7. 18. Jenssen, T. K., A. Laegreid, et al. A literature network of human genes for high-throughput analysis of gene expression. Nut Genet 2001; 28(1): 21-8. 19. Kamaraju, A. K., C. Bertolotto, et al. Pax3 downregulation and shut-off of melanogenesis in melanoma B 16/F10.9 by interleukin-6 receptor signaling. JBiol Chem 2002; 277(17): 15132-41. 20. Kanehisa, M. and S. Goto. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28(1): 27-30. 21. Kim, S., E. R. Dougherty, et al. Cellular contexts from gene expression profile. IEEE International Workshop on Genomic Signal Processing and Statistics 2005, Newport, RI. 22. Li, M., J. Y. Zhou, et al. The phosphatase M U 1 is a transcriptional target of p53 involved in cell cycle regulation. J Biol Chem 2003; 278(42): 41059-68. 23. Mousses, S., N. J. Caplen, et al. RNAi microarray analysis in cultured mammalian cells. Genome Res 2003; 13(10): 2341-7. 24. Noguchi, T., R. Metz, et al. Structure, mapping, and expression of erp, a growth factor-inducible gene encoding a nontransmembrane protein tyrosine phosphatase, and effect of ERP on cell growth. Mol Cell Biol 1993; 13(9): 5195-205. 25. Potterf, S. B., M. Furumura, et al. Transcription factor hierarchy in Waardenburg syndrome:
179
26.
27.
28.
29.
30.
31.
regulation of MITF expression by SOX10 and PAX3. Hum Genet 2000; 107(1): 1-6. Rae, J. M., C. J. Creighton, et al. MDA-MB-435 cells are derived from M14 Melanoma cells--a loss for breast cancer, but a boon for melanoma research. Breast Cancer Res Treat 2006. Rae, J. M., S. J. Ramus, et al. Common origins of MDA-MB-435 cells from various sources with those shown to have melanoma properties. Clin Exp Metastasis 2004; 21(6): 543-52. Ross, D. T., U. Scherf, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nut Genet 2000; 24(3): 227-35. Scherf, U., D. T. Ross, et al. A gene expression database for the molecular pharmacology of cancer. Nut Genet 2000; 24(3): 236-44. Segal, E., N. Friedman, et al. A module map showing conditional activity of expression modules in cancer. Nut Genet 2004; 36( 10): 1090-8. Segal, E., M. Shapira, et al. Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nut Genet 2003; 34(2): 166-76.
32. Segal, E., H. Wang, et al. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 2003; 19 Suppl 1: i264-7 1. 33. Shmulevich, I. and W. Zhang. Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 2002; lS(4): 55565. 34. Weeraratna, A. T., Y. Jiang, et al. Wnt5a signaling directly affects cell motility and invasion of metastatic melanoma. Cancer Cell 2002; l(3): 27988. 35. Yeung, K. Y., C. Fraley, et al. Model-based clustering and data transformations for gene expression data. Bioinformatics 2001 ; 17(10): 97787. 36. Zeeberg, B. R., W. Feng, et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol2003; 4(4): R28. 37. Zhou, X., X. Wang, et al. Binarization of microarray data on the basis of a mixture model. Mol Cancer Ther 2003; 2(7): 679-84.
This page intentionally left blank
This page intentionally left blank
183
PREDICTION OF TRANSCRIPTION START SITES BASED ON FEATURE SELECTION USING AMOSA Xi Wang", Sanghamitra Bandy~padhyay~"', Zhenyu Xuan', Xiaoyue Zhao2,Michael Q. Zhang2.'and Xuegong Zhang'* 'Bioinformatics Division, TNLIST and Dep. of Automation, Tsinghua Univ., Beijing 100084, China 'Cold Spring Harbor Laboratory, I Bungtown Road, Cold Spring Harbor, New York I 1724, USA 3Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700 108, India 'The first two authors are joint first authors. 'Email: zhangxg @ tsinghuaedu.cn To understand the regulation of the gene expression, the identification of transcription start sites (TSSs) is a primary and important step. With the aim to improve the computational prediction accuracy, we focus on the most challenging task, i t . , to identify the TSSs within 50 bp in non-CpG related promoter regions. Due to the diversity of non-CpG related promoters, a large number of features are extracted. Effective feature selection can minimize the noise, improve the prediction accuracy, and also to discover biologically meaningful intrinsic properties. In this paper, a newly proposed multi-objective simulated annealing based optimization method, Archive MultiObjective Simulated Annealing (AMOSA), is integrated with Linear Discriminant Analysis (LDA) to yield a combined feature selection and classification system. This system is found to be comparable to, often better than, several existing methods in terms of different quantitative performance measures.
1. INTRODUCTION It is known that the initiation of transcription of a gene is the important first step in gene expression. RNA polymerase I1 (Pol 11) plays the key role during transcription and is recruited by other transcription factors (TFs) to TSS within the preinitiation complex (PIC). Determining the location of the TSSs has become crucial for mapping the cis-regulatory elements and hence for further studying the mechanism of gene regulation. The core promoter region is centered around the TSS, within a length of -100bp and the proximal promoter, which is also enriched by transcription factor binding sites (TFBSs), is located immediately upstream of the core promoter within several hundred base pairs. Given the relationship between promoters and TSSs, a promoter region must contain the information for Pol11 to recognise TSSs, this information forms the basis for identifying the TSS in silico. Moreover, through computational modeling, important cis-regulatory element features may be identified. Good predictive features and accurate TSSs will help forming testable hypothesis and designing targeted experiments. Predicting the TSS in silico is an old but still very challenging problem. A strategy was proposed by Zhang
li
Corresponding author.
in 1998': an initial identification of a promoter approximately within a distance of 2 kb, followed by a more specific prediction method to locate the TSS within a 50 bp region. Many methods have been developed in the past decade2, generally belonging to the two categories: namely the initial identification of a gross promoter and the more specific prediction of the TSS. It is also demonstrated by recent studies that, in vertebrates, one should treat the CpG related and the non-CpG related promoters (see the definition of nonCpG related promoters in the data section) separately for better TSS prediction, which is biologically sound and computationally feasible3. For the CpG related promoters, the TSS prediction is much easier and has largely been solved2. '. However, predicting TSSs for non-CpG related promoters remains challenging. In this paper we focus on predicting TSSs within 50 bp for non-CpG related promoters. Almost all the previous methods for TSS prediction have been summarized in the recent reviews3, '. The main idea of those methods is to use some characteristic features, which can differentiate between a promoter region a non-promoter region, in the classification tests. The resulting classifiers (or predictive models) are applied to new input DNA sequences for TSS prediction. However, Bajic et a15 describe detecting TSSs in non-
CpG related promoter as a bottleneck of current technology. The reason may be due to poor understanding of transcriptional initiation mechanism and the diversity of non-CpG related promoters, especially tissue specific promoters. Hence the features which could be used to distinguish the promoter from non-promoter regions cannot be easily determined. To solve this problem, one strategy is to start with a large number of potential features and then select the most discriminative ones according to certain classification objectives. A good feature selection not only can improve the accuracy of the prediction, but also can reveal biologically meaningful features which may provide deeper biological insights. Feature selection is the process of selecting a subset of the available features such that some internal or external criterion is optimized. The purpose of this step is the following: building simpler and more comprehensible models, improving the performance of some subsequent machine learning algorithm, and helping to prepare, clean, and understand data. Different algorithms exist for performing feature selection. One important approach is to use an underlying search and optimization technique like genetic algorithms6,’. However, it may often be difficult to evolve just a single criterion that is sufficient to capture the goodness of a selected subset of features. It may thus be more appropriate and natural to treat the problem of feature selection as one of multi-objective optimization. Such an approach is adopted in this article. A newly developed multi-objective simulated annealing algorithm called Archived Multi-Objective Simulated Annealing (AMOSA)” is utilized for this purpose.
2. MATERIALS AND METHOD
2.1. AMOSA Archived multi-objective simulated annealing (AMOSA)’. is a generalized version of the simulated annealing (SA) algorithm based on multi-objective optimization (MOO). MOO is applied when dealing with the real-world problems where there are several objectives that should be optimized simultaneously. In general, a MOO algorithm usually admits a set of solutions that are not dominated by any solution it encountered, i.e., non-dominated solutions.‘” ‘I During recent years, many multi-objective evolution algorithms,
such as Multi-Objective SA (MOSA), have been suggested to solve the MOO problems.’2 Simulated annealing (SA) is a search technique for solving difficult optimization problems, which is based on the principles of statistical mechanic^'^. Recently, SA has become very popular because not only can SA replace the exhaustive search to save time and resource, but also converge to the global optimum if annealed sufficiently slowly14. Although the single objective version of SA is quite popular, its utility in the multi-objective case was limited because of its search-from-a-point nature. Recently Bandyopadhyay et a1 developed an efficient multi-objective version of SA called AMOSA’. that overcomes this limitation. AMOSA is utilized in this work for selecting features for the task of TSS prediction. The AMOSA algorithm incorporates the concept of an archive where the non-dominated solutions seen so far are stored. Two limits are kept on the size of the archive: a hard or strict limit denoted by HL, and a soft limit denoted by SL. The algorithm begins with the initialization of a number (yx SL, O
whereJ(a) andJ(b)are the i‘h objective values of the two solutions and Ri is the corresponding range of the objective function. Based on domination status different cases may arise viz., accept the (i) new-pt, (ii) current-pt
185
or (iii) a solution from the archive. Again, in case of overflow of the archive, clustering is used to reduce its size to HL. The process is repeated iter times for each temperature that is annealed with a cooling rate of a(
2.2. Data Since TSS prediction for non-CpG related promoters is still unsolved, our newly proposed prediction system is aimed at such TSSs. All the data used come from the work of Zhao et all'. We take promoter sequences as non-CpG related if the normalized CpG content of the 3 kb centered at the TSS is less than 0.3. All of the examples were taken from Eukaryotic Promoter Database (EPD)" and the Database of Transcription Start Site (DBTSS)I9, both of them have
relatively high quality annotation. EPD is based on experimentally determined TSSs while DBTSS on fulllength oligo-capped cDNA sequences. A total of 1,570 sequences containing non-CpG related promoters were selected, including 299 from EPD and 1,271 from DBTSS. These sequences are of 10 kb in size and are centered at the annotated TSSs. Positive and negative samples are defined in order to train LDA classifiers (or prediction models) to accomplish the more specific prediction of TSSs. Within the known promoter regions, positive samples are the sequences around core promoters from site -250 to site 50, denoted as [-250, 501 (for a gene, site 0 is not included), and negative ones are [-850,-SO), [-550,250), (50,3501 and (350,6501 (Fig. l(a)). It is obvious that negative ones can be divided into up- and downstream negative samples. For a test sequence with TSS unknown, we slide a 300 bp window by 1 bp step to get samples with the same length as that of the training samples (Fig. 1(b)). Several features were extracted for all the samples, including positive and negative samples in training set
.............. ............. ............ .......... ....... ...... .......... .........
Fig. 1. Samples and Classifiers. (a) Preparation of the samples in the training set. The promoter region from 250bp upstream to 50bp downstream of the annotated TSS is taken as the positive sample, while the other upstream and downstream sequences are taken as negative samples. The upstream negative samples (U) are used with the positive samples (P) to train the classifier P vs. U, and the downstream negative samples (D) are used with the positive samples (P) to train the classifier P vs. D. (b) The classification on the samples in test set. A window of 300-bp is scanned along the DNA sequence to be tested at a 1-bp step, forming the test sets. There are 2401 samples from each sequence. The two classifiers (P vs. U and P vs. D) are applied on each of the samples and the outputs of them are combined, generating a series of prediction scores at each position of the sequence. Post-processing then is used for making the final decisions based on the scores (see text and Fig. 2 and 3).
186 and the samples in test set. Before feature selection, there are 210+ numeric features (see Table 1 in Zhao et al’s paper17). To make the analysis of the features easier, we categorize them as follows: (i) basic sequence features; (ii) mechanical properties; (iii) motif features. The basic sequence features include scores of core promoter elements (TATA, Inr, CCAAT-box and GCbox), the frequencies of 1-mer or 2-mer related to C or/and G, and the scores from 31d order Markov chain modeling. The mechanical properties capture the characteristics of the energy and flexibility profiles around TSS, and the distance and correlation values are computed with different sequence lengths and smoothing window lengths. The motif features are generated by featuretab, part of CREAD suite of sequence analysis tools’’. The motif weight matrices are from TRANSFAC” and maximal scores of the weight matrices for TFBSs are used as the motif features. There are about 66 features in this category. If too few features are used to classify promoter and non-promoter regions and to predict TSSs, the predictive power may be very low. On the other hand, however, if the number of the features is too large, the noise may go up and the predictive power would come down. Hence, feature selection is one of the most important steps of the whole system for TSS prediction. The multi-objective optimization method AMOSA is implemented in our system for effective feature selection.
2.3. Classification Strategy In our proposed TSS prediction system, we use Fisher’s linear discriminant analysis or LDA to build the basal classifiers. LDA, originally developed in 1936 by R.A. Fisher, is a classic classification method. The main idea of Fisher’s linear discriminant is to project data, usually in a high-dimensional space, onto a direction so that the distance between the means of the two classes is maximized while the variance within each class is minimized. Thereafter, the classification becomes a one dimension problem and classification can be done by a proper threshold cut-off.22 As LDA considers all samples in the projection, it has been shown to be robust to noise and often performs well in many applications. LDA models are built with the features selected, and their performance is used as the guide in the feature selection procedure.
2.4. Feature Selection with AMOSA Among the 210 features to be used for predicting TSSs in non-CpG related promoters, there might be ones which contribute little to the classification but bring in more noises. In this article, a state of the AMOSA denotes the features that are selected for classification. LDA classifiers are built with only the selected features. Three objectives, namely, sensitivity (Sn), positive predictive value (PPV) and Pearson correlation coefficient (0, are used to evaluate the performance of the LDA classifiers with the selected features. They are computed using 10-fold cross validation as: Sn =-
TP TP + FN TD
PPV=-
cc =
1 1
TP + FP
T P x T N - F P x FN J ( T P + FP)(TP + FN)(TN + FP)(TN + F N )
(3) (4)
where TP, TN, FN, and FP are the numbers of true positives, true negative, false negatives, and false positives, respectively. We consider the three objectives are equally important, where Sn controls false negatives, PPV limits false positives, and CC balances classification results. However, the traditional optimization methods, which can only optimize one objective, could not deal with this problem. The multi-objective optimization method AMOSA is therefore implemented in order to solve the three-objective optimization problem. For multiobjective optimization methods usually allow multiple solutions, we get several sets of selected features in each experiment.
2.5. Prediction System Our whole prediction system contains two phases. At first, AMOSA is combined with LDA as a feature selection and classification system. Thereafter, post processing is performed to integrate classification scores and get prediction results. Fig. 2 shows the flow chart of the whole prediction system, which contains two phases: feature selection & classification, and post processing. Note that there are two symmetrical parts in Phase I. This is because we train two types of models (P vs. U and P vs. D) using
187
Combine
Phase II: post-processing
Fig. 2. How Chart for the Prediction System.
three categories of training samples: Positives (P), Upstream negatives (U) and Downstream negatives (D), and thus classify the test samples twice. We take the left side for example to explain the procedures in Phase I. First of all, we pre-select the features (removing those features that have almost same values for both P and U samples). Then, we come to the key step, where AMOSA is used to select the features and LDA to classify the samples. Let IZ denote the number of the feature subsets output by Come%'ondingly, we trained. For an input sequence, get IZ LDA we slide a 300-bp-width window with 1 bp step to generate the test samples. After we use the IZ feature subsets and the IZ classifiers to classify the sequential
samples, IZ vectors of classification scores are output. Let I denote the length of the vectors. Treating the n vectors equally, we sum up the n vectors to get the summed scores
si,
= l,2,...,l, and then normalize
them by:
s, =-1.0+
x2
n iS ,z' ',ax
(5)
- ,in'
where S, , i = 1,2,.. .,I, denote the initial summed scores, S; , i = 1,2,...,I, the Ones after normalization, and S ,,, , S,,, are the maximum and minimum of all the initial summed scores, respectively. The right-hand side of
188
Phase I, the classifier for P vs. D is also implemented similarly. We consider the two vectors output by the two symmetrical classifiers to contribute equally for the prediction, and add the two scores for the prediction. We use Eq. (6) to smooth the sum: mm(r+50,1)
s8** =
S: x exp(-( j - i)* / 5000) I=mdx(l,r-50) mm(!+50,1)
(6)
exp(-( j - i)* / 5000)
where S; , j = 1,2,...,1, denote the scores before smoothed and S,? , i = 1,2,...,1, denote the smoothed scores. We choose a RBF (Radial Basis Function) window with width of +50 rather than a flat window because the influence decays with the distance. We cluster the sites with the scores larger than a given threshold (the thresholds can be chosen in [-0.2, 0.251, and the corresponding results with the varying thresholds are shown in Fig.4 as a PPV-Sn curve.), and in a cluster the site with maximum score is the putative TSS (Fig. 3).
w
Fig. 3. Last Steps of Post Processing. (a) The vector of bars stands for the vector of scores, high bar indicating high score. A threshold is set. (b) The bars under the threshold are removed. (c) The remaining bars within certain distance are clustered as one cluster. (d) The site with maximum score in a cluster is output as the putative TSS.
2.6. Cross Validation We use a 5-fold cross validation to evaluate the performance of our proposed TSS prediction system. We divide the 1,271 sequences from DBTSS into 5 parts, each time 4 parts of the five and all the 299 sequences from EPD forming a training set and the remaining part forming the corresponding test set. Thus, 5 pairs of training and test sets are prepared. For each sequence (1Okbp length, with annotated TSS located at site 1) in test set, we slide a window to get 300-bp-length samples from site -1200 to 1201. Therefore, there are 2401 samples from each sequence in test set. True positive (TP), false positive (FP) and false negative (FN) are defined. The putative TSS which is located within 50 bp from any of the annotated TSS is considered as a TP, otherwise an FP. If there is no predicted TSS in the f 50bp region of an annotated
TSS, this counts as one FN. Two important criteria, Sn and PPV, are as defined in Eq. (2) and (3). Note that for one annotated TSS there is either a TP or a FN, and for one sequence in our data there is only one annotated TSS, so the sum of TP and FN equals to #sequences. The Eq. (2) can be simplified as:
Sn =
TP #sequences
(7)
2.7. Other TSS Prediction Methods In comparison study, we compare our newly proposed TSS prediction system with three other most effective and publicly available methods5' 17: McPromoterZ3, E p ~ n i n e ~and ~ , CoreBoost17. McPromoter combines DNA sequence likelihoods and profiles of physical properties as features, and a neural network is implemented for predicting TSSs. Eponine uses a set of
189 weight matrices for extracting features, and applies a hybrid relevance vector machine method to build the prediction model. And CoreBoost proposes a feature selection strategy which can select useful features among basic sequence information, mechanical properties and motif features, and then a boosting technique is implemented to predict TSSs. We got McPromoter software from its authors and downloaded Eponine from its ~ e b s i t eWe ~ ~ ran . the programs of the three methods on our local computer.
3. RESULTS AND DISCUSSION 3.1. Performance of AMOSA The performance of our system with AMOSA embedded is first compared with that without feature selection, i.e., using all available features for the prediction. Table 1 shows the TP, FP, Sn and PPV values comparing the two methods due to different parameters. From table 1, we can see that the system using all the features to predict TSSs does not perform as good as the one having AMOSA feature selection embedded with the same parameters. Therefore, it can be concluded that, the feature selection method with AMOSA implemented is effective, making the prediction system achieve higher Sn and PPV even using fewer features. Table 1. Prediction results between all features and selected features using AMOSA.
FP
Sd%l
Parameters”
Method
TP
0.10/1000
All Features
283
887
22.3
24.2
AMOSA
886 832
25. 1
26.5
All Features
319 281
22.1
2.5.2
AMOSA
316
813
24.9
28.0
All Features
304
1004
23.9
23.2
AMOSA
335
1014
26.4
24.8
All Features
303
939
23.8
24.4
AMOSA
333
932
26.2
263
0.10/2000 0.00/1000
0.00/2000
PPV(%l
aThe parameters are the classification scores threshold (e.g. 0.10) and clustering distance (e.g. 1000).
different training sets, but there are quite a few features which are selected almost all the times. We count the number of times each feature is selected during the 5 fold cross validation experiment. Table 2 lists the top features selected for model P vs. U and P vs. D separately while Table 3 for both models. From the tables, we can see that the known core promoter elements play great roles in the prediction, for their weight scores such as TATA90, Inr90, GCbox90 appear in the top rank. Log-likelihood ratios from 3rd order Markov chain (denoted by MC) together with some energyklexibility characters and motif features are also among the top features. Moreover, for the two LDA classifiers (P vs. U and P vs. D), the selected features are different. For example, the weighted score of the 7 mer TATA box (denoted by TATA-7) has a lager possibility to appear in the classifier P vs. U, while the weighted score based on Bucher et alZ6for Inr (denoted by Inr90) is more frequently selected in the classifier P vs. D. That’s why we train the two classifiers (P vs. U and P vs. D) separately. The LOGOS for the motifs mentioned in table 2 or 3 are shown in table 4. As to the three categories of the features, namely the basic sequence features, the mechanical properties, and the motif features, the proportion of the features selected are not the same. We call the ratio of #(selected features) to #(total features) in each category as the feature selection ratio. From table 5, we can see that the selection ratio for the motif features is very low, even less than half of the other two. It indicates that the TFBS motif weight matrices seem to have less information in predicting TSSs in non-CpG related promoters. However, the reason may also be that there are many different motifs, playing different roles in different promoters (e.g. tissue-specificity). If we group the motifs according to their functions, their contribution in the prediction might be more and the performance might be further improved. Besides the motif features, there are also redundancies existed in the other two categories.
3.2. Features for TSS Prediction
3.3. Comparison with Other Methods
Among all the original features (more than 210), only about 90 features are selected each time on an average during the 5-fold cross validation. It is not surprising that not all the selected features are the same for
We compare our system with three other effective and publicly available methods: McPromoterz3, E p ~ n i n e ~ ~ , and C o r e B ~ o s t ’ Five-fold ~. cross validation is used to evaluate the performance of our prediction system.
190 Table 2. Top selected features using AMOSA for models P vs. U and P vs. D. The total number of subsets of the selected features for the model P vs. U in 5-fold experiment is 43, while that for P vs. D is 46. P vs. u MC corr.flex. 150.1000 corr.eng.500.250 corr.eng. 1000.1300 V$ELK 1-02.~0s V$HNF1-Q6.pos V$MYC-Q2.pos V$PAX6-01 .pos aveTATA.flex TSSdiffNew1.eng eud.eng.5.250 TATA90 corr.flex.500.1000 TATAdiffNew3.flex Density eud.flex. 1000.1000
count
ratio%
P vs. D
43 43 43 43 43 43 41 41 40 39 39 38 38 37 37 37
100.00 100.00 100.00 100.00 100.00 100.00 95.35 95.35 93.02 90.70 90.70 88.37 88.37 86.05 86.05 86.05
aveTATA.flex aveTSS.flex Inr90 MC corr.flex.5.1300 corr.eng.5.500 corr.eng.500.250 CCAAT90 corr.eng.1000.1300 V$CDPCR1-01 .pos TSSdiffNew2.eng TATACCAAT90 aveTATA.eng TATA90 TATAGCbox90.dist mclstmc
count
ratio%
46 46 46 46 46 46 46 45 45 44 41 40 39 39 39 39
100.00 100.00 100.00 100.00 100.00 100.00 100.00 97.83 97.83 95.65 89.13 86.96 84.78 84.78 84.78 84.78
Table 3. Top selected features using AMOSA in common. The total number of subsets of the selected features for both models in 5-fold experiment is 89.
corr.eng.500.250 MC corr.eng.1000.1300 aveTATA.flex corr.flex.5.1300 V$ELK 1-02.~0s TATA90 V$HNF lPQ6.pos eud.eng.5.250 V$CDPCR1-01 .pos Inr90 TSSdiffNew1.eng V$PAX6-0l.pos Densitv
P vs. u count ratio% 43 100.00 43 100.00 43 100.00 40 93.02 35 8 1.40 43 100.00 38 88.37 43 100.00 39 90.70 30 69.77 27 62.79 39 90.70 41 95.35 37 86.05
Fig. 4 depicts the plot of PPV vs. Sn to show the comparison results. Those different points for one method are due to the different parameters. The asterisks and the circles are for Eponine and McPromoter, respectively. The solid and dashed curves are for CoreBoost. And the squares and the triangles are for our system with different clustering distances, where the
P vs. D count ratio% 46 100.00 46 IOO.00 45 97.83 46 100.00 46 100.00 36 78.26 39 84.78 34 73.91 35 76.09 44 95.65 46 100.00 34 73.91 32 69.57 35 76.09
count 89 89 88 86 81 79 77 77 74 74 73 73 73 72
Both ratio% 100.00 100.00 98.88 96.63 91.01 88.76 86.52 86.52 83.15 83.15 82.02 82.02 82.02 80.90
different points with the same symbol are for the different score thresholds from -0.20 (bottom right) to 0.22 (top left). It is clear that our prediction system with clustering distance 2000bp outperforms Eponine, McPromoter and CoreBoost. The score threshold 0.03 achieving 26.0% Sn and 26.5% PPV is chosen as the default threshold in our prediction system.
191 Table 4. LOGOS of motifs listed in top features. ~~
Motif
LOGO
Note
C
V$MYC-Q2
P vs. u
V$NFY7Q6-Ol
P vs. u
V$PAX-Q6
P vs. u
V$CDPCR1-01
Table 5. The feature selection ratios for the different feature categories.
overall Pvs. U Pvs. D
sequence features ave. # ratio(%) 33 100 15.4 46.1 16.6 50.3
mechanical properties ave. # ratio(%) 114 100 55.2 48.4 51.6 50.5
motif features ave. # ratio(%) 66 100 15.4 23.3 13.6 20.6
ave. # 213 86.0 87.8
total ratio(%) 100 40.4 41.2
01
0.15 0.05
t 0. I
0.15
a2
03s
0.3
0.35
snsiiiuity
Fig. 4. Positive predictive value vs. sensitivity. The asterisks are for Eponine, which is the default result. The circles are for McPromoter with the default clustering distance 2000bp. The solid and the dashed curves are for CoreBoost, with the solid curve for clustering scores within 500bp and the dashed one for 2000bp. The squares and the triangles are for our system with AMOSA embedded, of which the clustering distances are 2000bp and 1OOObp, respectively.
192
3.4. Discussion In this paper, we have proposed a new system based on AMOSA feature selection to predict TSSs in non-CpG related human promoters. Firstly, we generate features from the sequence characteristics, the mechanical properties and the TFBS motif scores. Thereafter, we implement AMOSA to select features to train LDA models. And finally, we use the LDA classification scores followed by some post-processing to predict TSSs. As a result, relatively higher prediction Sn and PPV are achieved when comparing to the other existing methods. It can be observed that the performance of all these methods still has a lot of room for improvement. This reflects the complexity of the problem and the insufficient understanding of the underlying biology. However, considering that we are trying to predict a single TSS with 50 bp resolution de novo from a long genomic DNA sequence, such moderate sensitivity and specificity are still welcome. The results can be also useful, in conjunction with other gene prediction tools, for helping biologists to prioritizing their experimental targets. Further improvement will likely require more detailed information on chromatin state and tissue/stagespecificity of the promoter sequences.
Acknowledgments This work is supported in part by NSFC grants 30625012 and 60540420569, the National Basic Research Program of China 2004CB51860.5, and the Changjiang Professorship Award of China. Additional support is provided by an award from the Dart Neurogenomics Alliance and by HGOOl696 grant from NIH.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
References 1. 2.
3.
4.
Zhang MQ. Identification of human gene core promoters in silico. Genome Res 1998; 8: 319-326. Werner T. The state of the art of mammalian promoter recognition. Brief Bioinform 2003; 4: 2230. Davuluri RV, Grosse I, Zhang MQ. Computational identification of promoters and first exons in the human genome. Nut Genet 2001; 29: 412-417. Ioshikhes IP, Zhang MQ. Large-scale human promoter mapping using CpG islands. Nat Genet 2000; 26: 61-63.
15.
16.
17.
Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nat Biotechnol. 2004; 22: 1467-1473. Oh IS, Lee JS, Moon BR. Hybrid Genetic Algorithms for Feature Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004; 26: 1424 - 1437. Raymer ML, Punch WF, Goodman ED, Kuhn LA, Jain AK. Dimensionality reduction using genetic algorithms. IEEE Transactions on Evolutionary Computation 2000; 4: 164-171. Bandyopadhyay S, Saha S. Simultaneous Optimization of Multiple Objectives: New Methods and Applications. In the Proceedings of the Eighth International Conference on Humans and Computers 2005; 159-165. Bandyopadhyay S, Saha S, Maulik U, Deb K. A Simulated Annealing Based Multi-objective Optimization Algorithm: AMOSA. IEEE Transactions on Evolutionary Computation 2007; Submitted. Coello CAC. A Comprehensive Survey of Evolutionary-Based Multiobjective Optimization Techniques. Knowledge and Information Systems 1999; 1: 129-156. Deb K. Multi-Objective Optimization Using Evolutionary Algorithms John Wiley and Sons, Ltd., England. 2001. Veldhuizen DAV, Lamont GB. Multiobjective Evolutionary Algorithms: Analyzing the State-ofthe-Art. Evolutionary Computation 2000; 8: 125147. Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH. Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics 1953; 21: 1087-1092. German S, German D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 1984; 6: 721-741. Smith KI, Everson RM, Fieldsend JE, Murphy C, Misra R. Dominance-Based Multi-Objective Simulated Annealing. IEEE Trans. on Evolutionary Computation 2005; Submitted. Deb K, Pratap A, Agarwal S, Meyarivan T. A Fast Elitist Multi-Objective Genetic Algorithm: NSGA11. IEEE Trans. on Evolutionary Computation 2002; 6: 182-197. Zhao X, Xuan Z, Zhang MQ. Boosting with stumps for predicting transcription start sites. Genome Biol 2007; 8: R17.
193 18. Cavin PR, Junier T, Bucher P. The Eukaryotic Promoter Database EPD Nucleic Acids Res 1998; 26: 353-357. 19. Suzuki Y, Yamashita R, Sugano S, Nakai K. DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res 2004; 32: D78-D8 1. 20. The Comprehensive Regulatory Element Analysis and Discovery (CREAD) suite. httr,://rulai.cshl.edu/cread. 2 1. Wingender E, Chen X, Hehl R, Karas H, Liebich I, Matys V, Meinhardt T, PruB M, Reuter I, Schacherer F. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res 2000; 28: 316-319. 22. Duda RO, Hart PE, Stork DG. Pattern Classifcation (Second Edition). John Wiley & Sons, Inc., England. 2001: 117-121.
23. Ohler U, Niemann H, Liao G, Rubin GM. Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 2001; 17: S199-S206. 24. Down TA, Hubbard TJP. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res 2002; 12: 458-461. 25. Eponine. http:l/~~~.sanger.ac.uk/users/td2/eponine/. 26. Bucher P. Weight matrix descriptions of four eukaryotic RNA polymerase I1 promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 1990; 212: 563-578.
This page intentionally left blank
195 CLUSTERING OF M AIN ORTHOLOGS FOR MULTIPLE GENOMES
Zheng Fu* and Tao Jiang Department of Computer Science and Engineering, University of California Riverside CA 92521, USA * Email: [email protected] The identification of orthologous genes shared by multiple genomes is critical for both functional and evolutionary studies in comparative genomics. While it is usually done by sequence similarity search and reconciled tree construction in practice, recently a new combinatorial approach and a high-throughput system MSOAR for ortholog identification between closely related genomes based on genome rearrangement and gene duplication have been proposed in MSOAR assumes that orthologous genes correspond to each other in the most parsimonious evolutionary scenario minimizing the number of genome rearrangement and (post-speciation) gene duplication events. However, the parsimony approach used by MSOAR limits it t o pairwsie genome comparisons. In this paper, we extend MSOAR to multiple (closely related) genomes and propose a n ortholog clustering method, called MultiMSOAR, to infer main orthologs in multiple genomes. As a preliminary experiment, we apply MultiMSOAR to rat, mouse and human genomes, and validate our results using gene annotations and gene function classifications in the public databases. We further compare our results to the ortholog clusters predicted by MultiParanoid, which is an extension of the well-known program Inparanoid for pairwise genome comparisons. The comparison reveals that MultiMSOAR gives more detailed and accurate orthology information since it can effectively distinguish main orthologs from inparalogs.
''.
1. INTRODUCTION According to the definition of Fitch ", orthologs are genes that evolved by speciation, while paralogs are genes that evolved by duplication. Orthologs typically occupy the same functional niche in different species, whereas paralogs tend to evolve toward functional diversification. Hence, the identification of orthologous genes shared by multiple genomes is critical for both the functional and the evolutionary aspects of comparative genomics. The traditional ortholog identification methods could be categorized into two types. The first is sequence similarity-based methods, such as COG/KOG 23, ", 24, EGO 15, Inparanoid/MultiParanoid 19, ', OrthoMCL 17, just to name a few. The other is tree-based methods, including RAP 6 , TreeFam 16, PhyOP 12, Orthostrapper 21, RIO 26, OrthologID 4, etc. The main assumption behind sequence similarity-based methods is that the evolutionary rates of all genes in a homologous family are equal and thus the divergence time could be estimated by comparing the DNA or protein sequences of genes. However, incorrect ortholog assignments might be obtained if the real rates of evolution vary significantly between homologs, and methods that rely on sequence similarity alone are highly subject
to artificial association of slowly evolving paralogs and to erroneous exclusion of the more rapidly evolving genes '. Tree-based analysis is very intuitive and informative for ortholog identification, since it visually presents the history of a gene family7. Usually, orthologs and paralogs are identified by a reconciled tree, which is constructed to reconcile the incongruent gene and species trees by taking into consideration gene duplication events. However, tree-based approaches critically rely on the correctness of reconstructed gene and species trees. Moreover, reconstructing accurate gene trees for genome-wide scale analysis is very computation-intensive. Recently, a new combinatorial approach and a high-throughput system MSOAR for genome-wide ortholog identification for closely related genomes based on genome rearrangement and gene duplication were proposed in l l . MSOAR focuses on the assignment of a subtype of orthologs, called main orthologs which are formed by the true exemplars from each pair of corresponding sets of inparalogous genes, a by computing the rearrangement/duplication distance between two genomes. The assumption is that main orthologs correspond to each other in the most parsimonious evolutionary scenario involving genome rearrangement and
''
*Corresponding author. "With respect to a certain speciation event, the inparalogous genes are those that were generated by post-sepciation duplications 19.
196
(post-speciation) gene duplication events. Since the true exemplar gene of an inparalogous set is the direct descendant of the ancestral gene of the set, it best reflects the original position and function of the ancestral gene in the ancestral genome. Hence, a reliable assignment of main orthologs is an important step toward the general identification of orthologs. The extensive tests on simulated data and real human and mouse genomes in demonstrate that MSOAR has a comparable performance as Inaparanoid ” and is able to find ortholog pairs that would be missed by Inparanoid (or any sequence similarity based methods). Moreover, its assignment result on human and mouse gonomes is well supported by the six methods listed on the HGNC Comparison of Orthology Predictions (HCOP) website (http://www.gene.ucl.ac.uk/cgibin/nomen~lature/hcop.pl)~, Jackson Lab’s humanmouse ortholog database s, and the protein functions defined in Protein Analysis Through Evolutionary Relationships (PANTHER) classification system (http://www.pantherdb.org/) 25. However, MSOAR requires the computation of the so called R D distance ( i .e. genome rearrangement/duplication distance) between two given genomes 11, and is thus limited t o pairwise comparisons. In this paper, we present a new method to cluster main orthologs shared by multiple genomes, by extending MSOAR to more than two genomes. Given a set of genomes, the new method, called MultiMSOAR, first applies MSOAR to each pair of input genomes, and then it combines the pairwise ortholog assignment results from MSOAR consistently, taking into account the species phylogeny, to build main orthologs clusters for the whole set of input genomes. We validate the performance of MultiMSOAR by testing the method on the genomes of rat, mouse and human and comparing its predicted main ortholog clusters using gene annotations and functional classification in public databases. We also compare our result to that of MultiParanoid’s ’, which is a singlelinkage based ortholog clustering approach utilizing the pairwise ortholog clusters obtained by Inparanoid ”.
2. METHOD Consider k closely related genomes G I ,G2, . . . , GI,, where k 2 3. Suppose that these k genomes are or-
dered according to their given (rooted) species tree in a post-order traversal fashion. For example, genome G I and G2 share a common ancestor denoted as A12, A12 and genome G3 share a common ancestor denoted as A123, so on and so forth, and finally all the genomes share a common ancestor denoted as A12 . . . k . That is, the genomes are phylogenetically ordered. MultiMSOAR first applies MSOAR on each pair G,,G, of the input genomes to obtain a set of putative main ortholog pairs for G,, G,. Then it constructs clusters of main orthologs for all the input genomes by combining the pairwise ortholog prediction results by resolving inconsistency and taking into account the species tree and possibilitiy of gene loss. 2.1. Main ortholog clusters for three genomes
We first explain the idea of this method for the case of three genomes. Given three phylogenetically ordered genomes G I , G2 and G3, and the sets (or tables) of putative main ortholog pairs T(G1,G2), T(G1,G3), and T(G2,G3) obtained by applying MSOAR to genome pairs G I and G2 , GI and G3, and Ga and G3, MultiMSOAR starts the construction of ortholog clusters by making every main ortholog pair in these three tables its own cluster. MultiMSOAR next merges clusters using the single linkage technique, i.e. two clusters are merged if and only if they share a common (main) orthologous gene. This procedure is repeated until no mergeable clusters exist. This first step is called cluster initiation, and the main ortholog clusters generated in this step are called the initial clusters. In the following, we will deal with each initial cluster separately. We can use an undirected connected graph G ( X ,Y ,2)to describe the structure of an initial cluster, where X , Y , and Z are three disjoint vertex sets that contain the vertices representing genes from the three genomes involved in the initial cluster. In graph G ( X , Y , Z ) (or simply 4 for simplicity), the vertices are X u Y U Z and each edge connects two vertices if they are assigned as a main ortholog pair by MSOAR in the pairwise comparisons, 2.e. they form an entry in one of the main ortholog pair tables. Since the main orthology is an inter-genome and oneto-one relationship, 4 is a tripartite graph and have four possible topologies, called triangle, 2-path, 3path, and n - p a t h respectively (see Figure 1). We will
197 process these topologies differently. In the case of a triangle, the corresponding cluster has three orthologous genes, one from each genome, forming exactly three pairs of main orthologs. Such a cluster will be reported as a final main ortholog cluster because of the strong support from the pairwise comparisons. Each 2-path topology describes the scenario that a main ortholog pair was found in two of the genomes, but neither of these two genes have an main ortholog counterpart found in the third genome. This main ortholog pair will also be reported as a final main ortholog cluster. Moreover, if the main ortholog pair was found between GI and G3 (or Ga and G3),a gene loss will also be reported in Gz (or GI, respectively), since GI and Gz are assumed to have diverged from a more recent speciation. Note that, if the main ortholog pair was found between GI and G2, we will not need report a gene loss event in G3. A 3-path topology is an acyclic path with three vertices, describing the scenario that two main ortholog pairs were found that involve one gene from each genome and share a common gene. However, none of the remaining two (unshared) genes were found to form main ortholog pairs with any other genes. This 3path topology indicates a possible main ortholog pair (missing edge) that has been missed by MSOAR due to complications caused by multi-domain proteins or alternative splicing. Therefore, the three genes in this 3-path initial cluster will be reported as a final main ortholog cluster. Some real examples of gene losses and missing main ortholog pairs found by MultiMSOAR will be given in section 3. All other initial clusters have the n-path topology. An n-path could be a path or a cycle, as long as it involves more than three vertices. Such an initial cluster contains more than one gene from some genome, and the handling of such an initial cluster is nontrivial. In practice, the number of initial clusters with the n-path topology should be usually very small
Fig. 1. Four possible topologies of the initial main ortholog clusters.
if the pairwise comparison results are reliable. For example, the number of such initial clusters is 390 (or 2.64%)involving a total of 2688 (or 5.79%) genes from all three genomes in the rat, mouse and human comparison to be discussed in the next section. For each initial cluster G ( X ,Y ,2)with the n-path topology, MultiMSOAR uses a heuristic algorithm, called NPATHRESOLVER, to divide the initial cluster into final main ortholog clusters, each with three ortholog genes, using a combinatorial optimization approach. This heuristic algorithm transforms G ( X ,Y ,2)into a complete weighted tripartite graph G ( X , Y ,2, W) by adding dummy vertices and dummy edges (so that a perfect matching always exists), and then tries to find a perfect tripartite matching with the maximum weight. This tripartite matching problem is also called the maximum three-index assignment problem, which is known NP-hard 13. We employ the single-pass recursive heuristic proposed by Bandelt et al. 3 , which could also be applied to the maximum multi-index assignment problem. The heuristic works as follows: (i) Find the maximum weight bipartite matching M x y between the vertex sets X and Y. (ii) Let N = {nzy/xE X , y E Y ,( x , g ) E M x y } be a new vertex set, and define the weight between vertices nzy E N and z E 2 as W ( n z y , z )= W ( 2 , z ) W(y,z). (iii) Find a maximum weight bipartite matching between the vertex sets N and 2. Note that, a maximum weight bipartite matching can be computed by the classical Hungarian method l8 in cubic time. The weights W in c ( X ,Y ,2, W )are defined taking into account both sequence similarity and the main ortholog pair information from the pairwise comparisons found by MSOAR which are mostly based on gene location information.
+
W(i, j ) =
I
MAXWEIGHT Evalue(i,j)= 0 or(i,j)E E(B) - log(Ewalue(i,1)) 0 < E v a l u e ( i ,1) 5 l e - 20 MINWEIGHT Otherwise
(1) Here, Evalue(i,j ) is obtained by an all-versus-all BLASTp comparison between each pair of genomes. ( i , j ) E E(G)indicates that i and j was assigned as a main ortholog pair by the pairwise comparisons using MSOAR. MAXWEIGHT and MINWEIGHT are two constant values, where MAXWEIGHT must be bigger than the biggest value of - log(Evalue(i,j ) ) and MINWEIGHT must be smaller than the smallest value of - log(Evalue(i,j ) ) .
198 for three lar to the above algorithm NPATHRESOLVE The algorithm obtains a set of triplets based on genomes. Note that the single-pass recursive heuristhe final maximum weight matching. A triplet will tic for finding a maximum weight matching can be be reported as a main ortholog cluster if and only if extended to k > 3 genomes in a straightforward its three vertices represent real genes. In other words, way ’. Again, this approach will be quit,? effective as long as a triplet contain at least one dummy versince the number of nontrivial cases are expected to tex, all the genes in this triplet will be regarded as inparalogs. The outline of algorithm NPATHRESOLVERbe very small. is illustrated in Figure 2. 3. EXPERIMENTAL RESULTS Algorithm NPATHItESOLVER(G(X, Y ,2)) I . Add dummy vertices and edges to obtain a complete wcighted tripart,ite graph G(.?. Y ,2?) 2. Definc ctlgc weight function W for 6 ac:cordiiig t,o eqimtion (1) 3 . Compute a tripartik matching M ( X ,I-.2)using t,he single-pass recursive lieuristic 1.for each t i ) E M( A% ?i .Y, Z) if ‘ti/ coiltailis no dummy vertices 5. then oiitpiit 177, EM a final main ortholog cliistcr (5. Fig. 2. The heuristic algorithm t o resolve initial clusters with the n-path topology.
2.2. Extension to the comparison of more than three genomes
Now consider the case of k > 3 genomes G I , G z , . . . , G k . The initial clusters can be constructed in the same way as in the case of three genomes using the single linkage clustering technique. Here, the graph G(V1,Vz,. . . , Vk)has k disjoint vertex sets, which correspond to the k genomes. Similar to the above, the initial clusters are classified into three possible topologies: the k-clique, a pseudoclique, and a nontrivial case. A k-clique consists of k genes, one from each genome, that form exactly k ( k - 1)/2 main ortholog pairs as found by the pairwise comparisons. This cluster will be reported as a final main ortholog cluster. A pseudo-clique is a graph with m 5 k vertices, with each vertex from a different genome. If the pseudo-clique contains e edges, we use a parameter y = 2e/m(m - 1) to measure its cliqueness (or edge density). When m and y are greater than some user-defined thresholds, the corresponding initial clusters will be reported as a final main ortholog clusters, and some gene loss events will be reported according to the species phylogeny. In a nontrivial case, the initial cluster contains multiple genes from the same genome. A maximum weight k-partite matching will be used on G(V1,V2,.. . , V k ) to distinguish main orthologs from inparalogs, simi-
In order to test the performance of MultiMSOAR as a tool of clustering main orthologs shared by multiple genomes, we have applied it to t,hree model genomes: Rat (Rattus norvegicus), mouse (Mus musculus) and human ( H o m o sapiens). Gene positions, transcripts and translations were downloaded from the UCSC Genome Browser l 4 website (http://genome.ucsc.edu).We use the canonical splice variants from the November 2004 update of the rat genome (UCSC rn4, Nov. 2004, version 3.4), the build 36 “essentially finished” assembly of the mouse genome (UCSC mm8, February 2006) and the build 36.1 finished human genome assembly (UCSC hg18, March 2006). There are 7066 protein sequences in the rat genome assembly rn4, 19199 sequences in mouse genome assembly mm8 and 20161 sequences in human genome assembly hg18. The pairwise main ortholog information is obtained by running MSOAR on each pair of the genomes. Specifically, there are 14306 main ortholog pairs reported between mouse and human, 6539 main ortholog pairs between mouse and rat, and 6347 main ortholog pairs between rat and human. MultiMSOAR identifies 14790 main ortholog clusters in total. We validate the predicted main ortholog clusters using the gene annotation information and function classification in public databases below. We will also compare the result of MultiMSOAR with that of MultiParanoid which is an ortholog clustering method solely based on sequence similarity. The comparative study shows that the prediction result of MiiltiMSOAR largely agrees with that of MultiParanoid, but about 7.17% of MultiMSOAR’s predicted main ortholog clusters properly refine their corresponding MultiPara.noid clusters.
’
3.1. Validation using gene annotation First, we use gene annotation information (in particular, gene symbols or names) to validate the main
199 Table 1. Validation of the main ortholog clusters found by MultiMSOAR using gene annotation Main ortholog clusters of size two Main ortholog clusters of size three
assignable 7700 4755
ortholog clusters found by MultiMSOAR. The hypothesis is that genes with identical symbols are most likely to be main orthologs, since a gene symbol usually conveys the character or function of the gene. We extracted the gene annotation information from UniProtKB/Swiss-Prot Release 52.1. Recall that MultiMSOAR output 14790 main ortholog clusters for rat, mouse and human, among which only 12598 clusters have complete annotations. Out of the 12598 main ortholog clusters, 10605 (84.18%) clusters are true positives ( i e . all the genes in the cluster have completely identical gene symbols). Among the 10605 true positives, 6176 clusters have size two and 4429 clusters have size three. Since there are 12455 assignable main ortholog clusters (2. e. the total number of clusters of genes with identical symbols), MultiMSOAR achieved a sensitivity of 85.15% for the rat, mouse and human comparison. The detailed results are also summarized in Table 1.
3.2. Validation using gene functions
‘
Besides gene annotation, we also use gene functional classification to validate our clustering result. PANTHER (Protein Analysis Through Evolutionary Relationships) classification system 25 is an online resource that classifies genes by their functions. It is based on a method that uses published scientific experimental evidence or evolutionary relationship to predict functions in the absence of direct experimental evidence. Proteins that belong to the same functional family and subfamily are assigned the same PANTHER ID. We examine the consistency between the main ortholog clusters output by MultiMSOAR and the PANTHER IDS of the involved genes. Out of the 14297 main ortholog clusters of rat, mouse and human found by MultiMSOAR with valid Entrez gene IDS, 11667 (or 81.6%) clusters consist of genes with the same PANTHER IDS, including 6703 clusters of size two and 4964 clusters of size three. This result demonstrates that the main ortholog clusters obtained by MultiMSOAR are very much in agreement with the gene functional classification provided by PANTHER.
assigned 8610 6180
unknown 1392 719
true positive 6176 4429
3.3. Comparison with MultiParanoid MultiParanoid is a genome-scale analysis program that clusters orthologs and inparalogs shared by multiple genomes l . It is a straightforward extension of the well-known Inparanoid program 19, which identifies orthologs and inparalogs between a pair of genomes solely based on sequence similarity. To ensure a direct comparison between MultiMSOAR and MultiParanoid, we run MultiParanoid on the same dataset (2.e. UCSC hg18, UCSC mm8, and UCSC rn4). Since MultiParanoid only reports clusters of co-orthologous genes and it dose not distinguish main orthologs from their inparalogs, the size of a MultiParanoid cluster might exceed three. After comparing with the MultiParanoid clusters, the main ortholog clusters identified by MultiMSOAR are divided into four categories: match, subset, absence, and mismatch. Among the 14790 main ortholog clusters generated by MultiMSOAR for rat, mouse and human, 13109 (or 89.12%) clusters found identical matches in MultiParanoid’s output, 1054 (or 7.17%) clusters are contained in the corresponding MultiParanoid clusters as proper subsets, 297 (or 2.02%) clusters are absent in MultiParanoid’s output (including those clusters that are proper supersets of some MultiParanoid clusters), and 330 (or 2.59%) clusters are mismatched, i e . each of them partially overlaps with some MultiParanoid cluster. Note that, when a MultiMSOAR cluster C1 is properly contained in some MultiParanoid cluster Cz, the additional elements in C, are likely inparalogs (as identified by MultiMSOAR) rather than main orthologs, and thus C1 could represent a more accurate main ortholog cluster than C2. In other words, C1 could be viewed as a refinement of C,. The distribution of these four types of main ortholog clusters is illustrated in Figure 3. This comparison shows that the main ortholog clusters identified by MultiMSOAR are very consistent with the ortholog clusters generated by MultiParanoid. Furthermore, MultiMSOAR gives more detailed and accurate orthology information since it distinguishes main orthologs from inparalogs.
200 14000
c
14000
l2000
12000
10000
.
.
.
.
.
1
a"""
4 4-I ---4 -I dnnn 4-I --4 -I finnn
Match
i Subset
Absence
Mismatch
Fig. 3. Comparing the prediction results of MultiMSOAR and MultiParanoid.
Fig. 4. An example of gene lost and missing main ortholog pairs. In the figure, the rat, mouse and human chromosomal segments are ordered top down. Solid lines indicate main ortholog pairs found by pairwise comparisons. Dashed lines indicate the missing orthology information identified by MultiMSOAR.
3.4. Examples of identified gene losses and main ortholog pairs missed in pairwise comparisons As described above, by taking into account the species tree of the genomes under consideration, MultiMSOAR is able to identify possible gene losses. In the case of rat, mouse and human comparison, if a main ortholog pair was found between mouse and human (or rat and human) without a corresponding orthologous gene found in rat (or mouse, respectively), a gene loss will be reported in rat (or mouse, respectively), since mouse and rat were separated by a more recent speciation. Figure 4 shows a segment of rat chromosome 5 (169,624,099- 169,349,727),a segment of mouse chromosome 4 (151,234,544 - 150,964,681) and a segment of human chromosome 1 (6,028,567 6,407,434). Based on the gene location information and gene sequence similarity, MultiMSOAR successfully identified 9 main ortholog clusters within these chromosome segments and a possible gene loss in rat
( i e . chd5). Besides, a main ortholog pair between human and mouse ( i e . ESPN) missed by MSOAR in the pairwise comparisons was identified by MultiMSOAR. This pair of main orthologs was missed by MSOAR because their sequences match different segments of their orthologous gene in rat and thus have insufficient similarity between themselves. In the rat, mouse and human comparison, a total of 8286 genes were found to have been lost by MultiMSOAR and 138 pairs of main ortholog pairs that were missed by MSOAR in the pairwise comparison were imputed. 4. CONCLUDING REMARKS
The ortholog clustering method that we presented here extends the pairwise method MSOAR l1 and enables the identification of main ortholog clusters for multiple closely related genomes. Our preliminary experiment on a three genome comparison demonstrates that our method performs consistently with the gene annotation and funcational classification information in public databases and a published program in the literature. Some interesting future work includes more extensive testing on four or more genomes and elaborate (and in-depth) handling of gene losses (e.g. using pseudo gene information). We plan to make this a program a public server in the near future. ACKNOWLEDGMENT
This project is supported in part by NSF grant CCR0309902, National Key Project for Basic Research (973) grant 2002CB512801, NSFC grant 60528001, and a Changjiang Visiting Professorship at Tsinghua University. References A Alexeyenko, I Tamas, G Liu, and E L L Sonnham-
mer. Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics, 22:9-15, 2006. A Bairoch, R Apweiler, C H Wu, W C Barker,
B Boeckmann, S Ferro, E Gasteiger, H Huang, R Lopez, M Magrane, and et al. The universal protein resource (uniprot). Nucleic Acids Res., 33:D 154-D 159, 2005. H Bandelt, Y Crama, and Spieksma F. Approximation algorithms for multi-dimentional assignment problems with decomposable costs. Discrete Applied Mathematics, 49:25-50, 1994.
20 1 4. J C Chiu, E K Lee, M G Egan, I N Sarkar, G M Coruzzi, and R DeSalle. Orthologid: automation of genome-scale ortholog identification within a parsimony framework. Bioinformatics, 22:699-707, 2006. 5. P Dehal and J Boore. Two rounds of whole genome duplication in the ancestral vertebrate. PLOS Biology, 311700-1708, 2005. 6. J Dufayard, L Duret, S Penel, M Gouy, F Rechenmann, and G Perriere. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence detabase. Bioinformatics, 21:2596-2603, 2005. 7. J A Eisen. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8:163-167, 1998. 8. J T Eppig, C J Bult, J A Kadin, J E Richardson, J A Blake, A Anagnostopoulos, Baldarelli. R M, M Baya, J S Beal, S M Bello, and et al. The mouse genome database (mgd): from genes to mice: a community resource for mouse biology. Nucleic Acids Res., 33:D471-D475, 2005. 9. T A Eyre, M W Wright, M J Lush, and Bruford E A. Hcop: a searchable database of human orthology predictions. Brief Bioinform., 2006. 10. W M Fitch. Distinguishing homologous from analogous proteins. Syst. Zool., 19:99-113, 1970. 11. Z Fu, X Chen, V Vacic, P Nan, Y Zhong, and T Jiang. A parsimony approach to genome-wide ortholog assignment. In Proceedings of 10th Annual International Conference, R E C O M B (Venice, Italy, April 2006), pages 578-594, 2006. 12. L Goodstadt and C Ponting. Phylogenetic reconstruction of orthology, paralogy and conserved synteny for dog and human. PLOS Biology, 1:e45, 2003. 13. V Kann. Maximum bounded 3-dimensional matching is max snp-complete. Inform. Process. Lett., 37:27-35, 1991. 14. D Karolchik, R Baertsch, M Diekhans, T S Furey, A Hinrichs, YT Lu, K M Roskin, Schwartz M, C W Sugnet, D J Thomas, and et al. The ucsc genome browser database. Nucleic Acids Res., 31:51-54, 2003. 15. Y Lee, R Sultana, G Pertea, J Cho, S Karamycheva, J Tsai, B Parvizi, F Cheung, V Antonescu,
16.
17.
18.
19.
20. 21.
22.
23.
24.
25.
26.
J White, and et al. Cross-referencing eukaryotic genomes: Tigr orthologous gene alignments (toga). Genome Research, 12:493-502, 2002. H Li, A Coghlan, J Ruan, L J Coin, J Heriche, L Osmotherly, R Li, T Liu, Z Zhang, L Bolund, and et al. Treefam: a curated database of phylogenetic trees of animal gene families. Nucleic Acide Res., 34:D572D580, 2006. L Li, C Stoeckert, and D Roos. Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Research, 13:2178-2189, 2003. C H Papadimitriou and K Steiglitz. Combinatorial optimization: algorithms and complexity. 2004. M Remm, C E Storm, and E L Sonnhammer. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of Molecular Biology, 314:1041-1052, 2001. D Sankoff. Genome rearrangement with gene families. Bioinformatics, 15:909-917, 1999. C E Storm and E L Sonnhammer. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics, 18:9299, 2002. R L Tatusov, M Y Galperin, D A Natale, and E V Koonin. The cog database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res., 28:33-36, 2000. R L Tatusov, E Koonin, , and D J Lipman. A genomic perspective on protein families. Science, 278:631-637, 1997. R L Tatusov, D A Natale, J D Jackson, A R Jacobs, B Kiryutin, E V Koonin, D M Krylov, R Mazumder, S L Mekhedov, A N Nikolskaya, and et al. The cog database: an update version includes eukaryotes. B M C Bioinformatics, 4:41-54, 2003. P D Thomas, M J Campbell, A Kejariwal, H Mi, B Karlak, R Daverman, K Diemer, A Muruganujan, and A Narechania. Panther: a library of protein families and subfamilies indexed by function. Genome Research, 13:2129-2141, 2003. C M Zmasek and S R Eddy. Rio: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. B M C Bioinformatics, 3:14, 2002.
This page intentionally left blank
203
DECONVOLUTING THE BAC-GENE RELATIONSHIPS USING A PHYSICAL MAP
'Yonghui Wu
'Lan Liu
2Timothy J. Close
'Stefan0 Lonardi*
Department of Computer Science and Engineering, University of California, Riverside, Department of Botany &' Plant Sciences, University of California, Riverside, * Email: [email protected]. edu Motivation: The deconvolution of the relationships between BAC clones and genes is a crucial step in the selective sequencing of the regions of interest in a genome. It usually requires combinatorial pooling of unique probes obtained from the genes (unigenes), and the screening of the BAC library using the pools in a hybridization experiment. Since several probes can hybridize to the same BAC, in order for the deconvolution to be achievable the pooling design has t o be able to handle a large number of positives. As a consequence, smaller pools need t o be designed which in turn increases the number of hybridization experiments possibly making the entire protocol unfeasible. Results: We propose a new algorithm that is capable of producing high accuracy deconvolution even in the presence of a weak pooling design, i.e., when pools are rather large. The algorithm compensates for the decrease of information in the hybridization data by taking advantage of a physical map of the BAC clones. We show that the right combination of combinatorial pooling and our algorithm not only dramatically reduces the number of pools required, but also successfully deconvolutes the BAC-gene relationships with almost perfect accuracy. Availability: Software available on request from the first author.
1. INTRODUCTION While the number of fully sequenced organisms is growing rather rapidly, some organisms are unlikely to be sequenced in the immediate future due to the large size and highly repetitive content of their genomes. Many of these latter class of organisms are in the plant kingdom. For these cases, a feasible alternative strategy called reduced representation sequencing (see, e.g. 1) entails focusing only on the gene space, that is the portion of the genome that is gene rich. It is well known that in higher organisms genes are not uniformly distributed throughout the genome but instead tend to cluster into gene-rich regions of the chromosomes (see, e.g. 11, 9). In many cases, however, even this latter strategy is too expensive or laborious. The next level of reduced sequencing requires a specific list of genes of interests (e.g., abiotic stress-related or pathogen responsive). The task then is to identify the portion of the genome that contains these genes of interest, e.g., by identifying the BAC clonesa carrying those genes, and then sequence solely these BAC clones. Identification of the BACs containing a specific set of genes is called deconvolution. More precisely,
the goal of BAC-gene deconvolution is to precisely assign each gene (unigeneb) to one or more BAC clones in the BAC library for that organism. Another important reason to deconvolute BACgene relationships is to place BAC clones (and all the genes that they contain) on the genetic linkage mapc, if such map is available. This placement is possible when a gene within a BAC has been placed on the genetic linkage map. The assignment of genes to BAC clones can be accomplished experimentally by performing hybridization between a characteristic short sequence in the gene called probe, and the entire BAC library. The gene is then assigned to the set of BAC clones that are positive for that probe. In this paper we assume that each probe has the property that it will hybridize only to the BAC clones containing the original gene from which it was obtained. In other words, we assume that each probe is a unique signature for each gene (unigene). The uniqueness of each probe can be established with certainty only if we have the complete knowledge of the full sequence of the genome of interest, but the problem of designing such probes is well-studied (see, e.g. 16, 17). Because our
*Corresponding author aBACs are artificial chromosome vectors derived from bacteria used for cloning relatively large DNA fragments, typically in the range of lOOK nucleotide bases. b A unigene is obtained by assembling one or more ESTs. A unigene (if correctly assembled) is a portion of a gene transcript. "A genetic linkage map is a linear map of the relative positions of genes along a chromosome, where distances are established by analyzing the frequency at which two gene loci become separated during chromosomal recombination.
204 assumption establishes a one-to-one correspondence between probes and unigenes, in the context of this paper the terms “probes” and “unigenes” are equivalent and can be used interchangeably. Figure 1 illustrates the BAC-unigene deconvolution problem. Each chromosome is “covered” by a set of BAC clones. The totality of the BAC clones constitutes the BAC library. A gene, represented by the arrow below the chromosome, can be covered by one or more unigenes. A typical unigene covers either the 3’ or the 5’ end of a gene but not the introns. Probes are designed from unigenes trying to avoid splicing sites, otherwise they would not hybridize to the corresponding BAC, which contains the introns as well. Due to the large numbers of BACs and probes, usually in the order of tens of thousands, it is not feasible t o carry out a separate hybridization for each probe/unigene. Group testing is typically used to reduce the total number of hybridizations required. Here we assume that the probes are grouped into poolsd and that pools are used to screen the BAC library. The hybridization experiments are then carried out for each BAC and probe pool pair. The readout of one such experiment is positive if the BAC under consideration happens to contain a gene that matches one or more probes in the pool, and as a consequence the hybridization will take place. In Computer Science terms, one can think of the hybridization process as some form of approximate string matching between the probe and the BAC. So far the steps required for the deconvolution are (I) design a unique probe for each unigene, (2) group the probes into pools, and (3) screen the BAC library using the pools. The input to the deconvolution problem is the set of readouts of all these hybridizations between pool of probes and the BACs. Clearly the smaller the size of each pool, the greater is the number of experiments needed, and the “easier” is the deconvolution. Vice versa, if the size of each pool is too large, then too many BACs will be positives, and the deconvolution may not be achieved at all.
In order to study this trade-off, the notion of decodability needs to be introduced. We say that a particular pooling design is d-decodable if the deconvolution can be achieved in the presence of d or less positives. In this specific context, if all BACs happen to hybridize to at most d probes and the pooling is d-decodable, the subset of positive probes can be unambiguously determined from the readouts of the experiments. However, if a BAC hybridizes to more than d probes, it may not be possible to achieve the deconvolution. If the goal is to resolve the BACprobe/unigene relationships exactly, the pools would have to be designed with a decodability greater than or equal to the maximum number of probes that a BAC might possibly contain. As we will see later in the paper, in order for a pool design to have a high decodability the pools have to be rather small, which in turn can make the number of hybridization experiments prohibitively high. Our objective is to use a pooling design with low decodability (e.g., d = 1 or d = 2) which is fast and inexpensive and exploit additional information to achieve deconvolution. More specifically, we requires some knowledge of overlap between the BAC clones as provided by a physical mape of the genome of interest. The rest of the paper is organized as follows. In Section 2 we define formally the problem, and we describe three algorithms for solving the deconvolution. The first one uses solely the results from the hybridization, the second exploits the physical map, and the third is a variation of the second when we have a “perfect” physical map. Although in practice it is unrealistic to assume a perfect map, this algorithm will be useful in the simulations to test the limits of our method. In that section we also formalize the deconvolution problem (with physical map) as a new optimization problem, and prove that it is NP-hard. The optimization problem is expressed in the form of an integer linear program, which is then relaxed to a linear program in order to be solved efficiently. To obtain the best integer solutions, several iterations of randomized rounding are applied t o the solutions obtained from the linear program.
can pool BACs, but this will not change the nature of the problem.
eA physical map consists of a linearly ordered set of BAC clones encompassing the chromosomes. Physical maps can be generated by first digesting the BAC clones with restriction enzymes and then detecting overlaps between clones by matching the lengths of the fragments produced by the digestion.
205 r____.
=-=-+,1111
I
L-y' .\.
.........I Unigenesp .........!
Chromosome \. '.
.!
........I
! ........!
L
L
.............. ............ i:...................... 1111 ....: h..
= 1: I
,
I
,
I
,
I
,
I
......... ........1 I
Y
D 1 m T
i
.........
,
I
,
I
.
I
,
'%
%_.i
I
.
::
I ........
*
I
I
J
I
I.-!Genes
Fig. 1. An illustration of a chromosome, BACs, genes, unigenes and probes. The goal of deconvolution is to find out the set of BACs that each probe belongs to
Our randomized algorithm can be proved to achieve a constant approximation ratio. In Section 3 we report experimental results on simulated hybridization data on the rice genome, and on real hybridization data on the genome of barley. The results show that our algorithm is capable of deconvoluting a much higher percentage of the BACgene relationships from the hybridization data than the naive basic approach. In particular, if given a high quality physical map, we can solve almost 100% of the assignments with almost perfect accuracy (if the pooling is well designed).
2. M E T H O D S As mentioned above, the deconvolution process consists of two major steps. In the first, we use a lowdecodability pooling design, screen the BAC library using these pools, and collect the hybridization data. Since the pooling has low-decodability, the data obtained by the first step will deconvolute the BACunigene relationships only partially. In the second step, we will exploit the physical map information to attempt to resolve the remaining ambiguities.
2.1. Problem Formulation Let 0 be the set of all the unique probes obtained from the unigenes and let B be set of all BAC clones. A pool p of probes is a simply a subset of 0, that is p c 0.The collection of all pools is denoted by P. The result of the hybridization of BAC b with pool p is captured by the binary function h(b, p). We have
h(b,p) = 1 if the readout is positive, h(b,p) = 0 otherwise. If we assume no experimental errors, h(b,p) = 1 if and only if there exists a t least one probe from the pool p that matches somewhere in BAC b. Given the values of h(b,p) for all pairs b, p, the deconvolution problem is to establish an assignment between the probes in 0 and the BAC clones in B, in a such a way that it satisfies the value of h. 2.2. Basic Deconvolution The basic deconvolution is rather simple. First, we determine for each BAC b the set of probes that it cannot contain, which we denote by @. This set can be obtained as follows -
0 b = {o
E 013p E IP such that o E p and h(b, p) = O}
Next, for each pair (b, p) of positive hybridization, we construct the pair ( b , Ob,p)where Ob,p = p - 6 . The presence of the pair ( b , CDb,p) means that BAC b has to contain at least one probe from the set Ob,p. The output from this first step is a list of such pairs, which we denote with symbol R. For any pair (b, Ob,p) E R such that IOb,pI = 1, the relationship between BAC b and the only probe o E Q pis exact, that is b must contain o (or, alternatively, o is assigned to b ) . We call exact pairs those pairs in R for which = 1. By definition, exact pairs can be resolved uniquely. However, since the decodability of the pool design is much lower than the number of positives that a BAC may contain, we expect the proportion of exact pairs to be rather small. In practice, the large
206 majority of the pairs in R will be non-exact. For these latter pairs, the relationships between BAC and probes can be solved only by employing additional information. Next, we present an algorithm that resolves the non-exact pairs using a physical map which may contain errors. Then, we present an algorithm for the case where high-quality or near perfect physical map is available.
2.3. Deconvolution Using an Imperfect Physical Map
For the purposes of the deconvolution, the essential information on the physical map is represented by the overlaps between BAC clones. We will show bow the knowledge of the overlap between BACs can help resolve ambiguous (non-exact) BAC-probe relationships. The following two observations are the cornerstones of our algorithms. The first observation is that if BAC bl has a large overlap with BAC b 2 , and a t the same time probe o belongs to BAC b z , then probe o will belong to BAC bl with high probability. The second observation is that since probes are carefully designed to be unique for a specific gene (unigene), the probability that they will match/hybridize in some location other than the location of the gene from which they were designed is very low. Therefore, if probe o belongs to BAC b l , and BAC b2 does not overlap BAC b l , then o will belong to BAC b2 with very low probability. In this section we assume that we are given a physical map that can potentially contains errors. Such a map could have been obtained, for example, by running F P C lo software on some restriction fingerprinting data. The output of F P C is an assembly of the BAC clones into disjoint sets, called contigs. If two BAC clones belong to the same contig, they are very likely to overlap with each other. On the other hand, if two BAC clones are from two different contigs, they are very unlikely to overlap. Formally, each contig is a subset of B.Let C denote the collection of contigs. The set C is a partition of B. Since probes are designed to hybridize only to one location throughout the genome, we expect that each of them
will only belong to one contig. The list of pairs in R can be interpreted as a list of constraints. A pair ( b , Ob,p) E R is satisfied if there exists a probe o E Ob,p and o is assigned to BAC b. Ideally, we would like to compute a BACprobe assignment such that each probe is assigned to a set of BACs belonging to the same contig and all the constraints in R are satisfied. Due to the fact that the physical map is imperfect and that some of the probes may not match anywheref or match multiple locationsg in the genome, we may not able to satisfy all constraints. Therefore, a reasonable objective is to assign probes to BACs so that the number of satisfied constraints is maximized, subject to the restriction that each probe may be only assigned to one contig. We have now turned the deconvolution problem into an optimization problem, which is going to be tackled in two phases. In phase I, we will assign probes to contigs (not to BACs). The list of BAC constraints R is transformed to a new list of contig constraints 0’ of the same size, as follows. For each constraint E R we create the contig constraint ( C , O ~ where , ~ ) c is the contig to which b belongs. A contig constraint ( C , C D ~ , ~is) satisfied if probe o E 0 b , p and o is assigned to contig c . As said, the goal is to maximize the number of satisfied constraints in 0’ by assigning each probe to at most one contig. If a constraint in 0’ appears in multiple copies, that constraint will contribute its multiplicity to the objective function when satisfied. In phase 11, probes will be assigned to BACs. Let o be a probe in 0 and assume that o was assigned to contig c E C in phase I. In phase 11, o is assigned to the following set of BACs
{ b E c/3(b,Ob,,) E R such that o E Ob,p}.
(1)
It can be easily verified that if the assignment of probes to contigs in phase I is optimal (i.e., it satisfies the maximum number of constraints from 0 ’ ) the final assignment of probes to BACs using (1) is also optimal (i.e., it satisfies the maximum number of constraints from 0 ) . Since the final assignment can be easily obtained from an optimal solution to phase I, the rest of this
fThis can happen, for example, if the probe happens to cross a splicing site. gIn practice it is impossible to guarantee the uniqueness of probes unless one knows the whole sequence of the genome and the hybridization is modeled perfectly zn silico.
207 section will be focused on solving the optimization problem in phase I. First we present a formal description of the problem. MAXIMUM CONSTRAINTSATISFYING PROBECONTIGASSIGNMENT (MCSPCA) Instance: A set of probes 0, a set of contigs C and a list of constraints R’, where each item of R‘ has the form ( c , @ , ~ c) ,E C and Ob,p G 0. Objective: Assign each probe to at most one contig in C such that the number of satisfied constraints in 0’ is maximized. A constraint ( c ,Ob,p) is satisfied if one or more of the probes from Ob,,is assigned to c . Unfortunately but not surprisingly, the MCSPCA optimization problem is NP-complete. This can be proved by a reduction from the 3SAT problem a. Theorem 2.1. The MCSPCA problem is NP-hard.
0
three literals corresponding to clause si E S . For example, if s1 = (u1 U U v3), then the constraint (c,{ w?,uzf , vi}) is added to 0;. 0; = { ( v 1 , { v 4 , u f H , ( u 2{, v $ , d H > . “ , (vn,(4,v m .
It can be verified that fiable.
< E L if and only if cp is satis0
2.3.1. Solving the MCSPCA problem via integer programming
The MCSPCA optimization problem can be solved with integer linear programming (ILP) as follows. Let X0,, be the variable associated with the possible assignment of probe o t o contig c , which is set to 1 if o is assigned to c and set to 0 otherwise. Let Yq be a variable corresponding to the constraint q E R’, which is set to 1 if q is satisfied and set to 0 otherwise. The following integer program encodes the MCSPCA problem.
Proof. Let us define a language problem L which corresponds to a special case of the above optimization problem as follows
L = ((0, C , R’)I there exists a many to one mapping from 0 to
C such that all constraints in R’
are satisfied} where 0, C and 0’ are defined in the original optimization problem above. We will prove that L is NP-complete by a reduction from 3SAT. As a result, the MCSPCA optimization problem is also NPcomplete. Let cp = ( V , S ) be an instance of 3SAT, where V = { v l ,212,. . . , v,} is the set of variables and S = { s I , s 2 , .. . , s k } is the set of clauses of cp. We construct an instance = ( 0 , C , 0 ’ ) of L , with the property that E L if and only if cp is satisfiable. The construction is as follows: 0 = { v ; , v [ , v $ , v i ., . . ,vk,vA}, where uf and vf correspond to the “true” and “false” assignments to variable vi E V respectively. We set C = {c,v1,v2,. . . ,v,}, and R’ consists of two parts, namely 0; and 0;. Rl, is used to ensure the satisfiability of all the clauses in S and 0; is used to ensure that either vf or vf is used to satisfy the constraints, but not both of them. We set
<
1
0
2
3
<
1
2
3
0; = { ( c ,( 8 1 , s1, SI)), (c, (5-2, s2, s2}),. ‘ . > ( c ,{sk,s,; s i } ) } , where ( s t ,sp, s:) are the
The first constraint ensures that each probe can be assigned to a t most one contig, whereas the third constraint makes sure that a constraint q = (c,S ) E R’ is satisfied iff one or more probes from S is assigned to c. The integer program above cannot be solved optimally for real world instances because solving integer programs is very time consuming and in worst case takes exponential time. Faster and approximate algorithms must be obtained. Below we present one such fast approximate algorithm based on relaxation and randomized rounding. 2.3.2. Relaxation and randomized rounding
The integer linear program is first relaxed to the corresponding linear program (LP) by turning all {0,1} constraints into [0,1]. Then, the linear program can be efficiently solved optimally. Let {X,”,, 1 Yo E 0 , c E C} and {Y: I Vq E Q’} be the optimal solutions to the linear program. In general, and
x:,,
208 Y : are fractional values between 0 and 1. X:,, is to be interpreted as the probability of assigning probe o to contig c . We apply standard randomized rounding technique to convert the fractional solutions into an integral solution as follows. For each probe 0 , let Co = { c 1 c E C , X & > O}. If X,",, = 1, randomly assign o to one of the contigs in Co according to the associated probability X& . If CcEC, X & < 1, randomly assign o to one of the contigs in C, according to the associated probability X:,,, or to none of the contigs with probability 1Xz,,. The output from this rounding step is an assignment of each probe to at most one contig, as required. This rounding procedure can be applied multiple times. Among the assignments produced by each individual rounding step, the one that satisfies the maximum number of constraints will be taken as the final solution. The following theorem shows that the randomized rounding step achieves a constant approximation ration.
and
S
0.In this case, Y:
= xoESX~,,.
Prob(1, = 1) = 1 - Prob(1, =
= 0)
1- r I ( 1 - X i , , )
xcEC,
xcEC,
So, Prob(1, = 1)/Y: 1 - e-l
> m i n o l v l l ( l - e-Y)/y
=
Therefore, to sum up, E ( W ) 2 (1 - e-')OPT,. 0 The algorithm can be de-randomized via the method of conditional expectation to achieve a deterministic performance guarantee. The derandomization step follows the procedure in 13 page 132. We observe that the approximation algorithm we propose here is similar to the one for MAX-SAT 14, '. To summarize, the sketch of the MCSPCA algorithm is presented in Figure 2 . 2.4. Deconvolution Using a Perfect Physical
Theorem 2.2. The randomized MCSPCA algorithm achieves approximation ratio (1 - e-').
Proof. Let us still use {X,",, 1 Mo E 0 , c E C} and {Y: I Mq 6 O'} to denote the optimal solutions obtained by solving the linear program. Let O P T f be the optimal value of the objective function of the linear program, that is, O P T f = CqEn, Y i . Let Iq be an indicator random variable, which is set to 1 if the constraint q is satisfied under our randomized rounding step and to 0 otherwise. Let W denote the total number of satisfied constraints after the rounding step. Clearly, W = CqEn, I q , and E ( W ) = CqEn, Prob(1, = 1). Consider the following two cases: (1) Y; = 1. Let q be of the form ( c ,S ) where c E and S C 0.In this case, X:,, L 1.
xoES
Prob(1,
=
C
1) = 1 - Prob(1, = 0 ) =
1-
n(l
- X,*,,)
So, Prob(Iq = 1)/Y,* > 1 - e-l < 1. Let q be of t,he form ( c , S )where c E C
(2) Yg
Map
As said, although it is not realistic to assume to have a perfect or near perfect physical map, this variation of the problem allows us to establish the limits of how many assignment we can correctly deconvolute from the hybridization data. This is particularly useful for simulations, to ensure that our algorithm can achieve good results if the input physical map is of good quality. If we are given a perfect (or near-perfect) physical map, the problem can be tackled from the "opposite" direction. Instead of trying to take advantage of the grouping of BACs into disjoint contigs, we partition each BAC into several pieces. We preprocess the physical maps as follows. We align the BACs along the chromosome according to their relative positions on the physical map. Then, starting from the 5' end, we cut the chromosome at each location where either a BAC starts or ends. This process breaks the genome into at most 2 n fragments, where n is the total number of BACs. Each fragment is covered by one or more BACs, and some fragments may be covered by exactly the same set of BACs. In that latter case, only one fragment is kept while the others are removed. At the end of this preprocessing phase, a set of fragments is produced where each fragment is covered by a distinct set of overlapping BACs. Let
209 Algorithm MCSPCA(0, @, B, R)
0. Convert R to R' 1. Generate the integer program in (2) from (0, @, 0 ' ) 2. Solve the LP relaxation of the ILP in step 1, and obtain the optimal fractional solution {X,",,} and {Y;} 3. Apply K steps of randomized rounding and save the best solution for each o E 0 do = { c 1 c E @,X,.,, > 0) Assign 0 to c E C, with probability X:,, or to none of the contigs with probability 1 - CcEc, X:,, 4. Further assign probes to BACs if o is assigned to c in step 3 then assign o to the set of BACs { b E c13(b,Ob,p) E R s.t. o E Ob,p)}
c,
Fig. 2.
Sketch of the two-phase deconvolution algorithm that exploits an imperfect physical map
us denote the final set of fragments as IF. Given a fragment f E IF and a BAC b E B, we use B ( f ) to denote the set of BACs that f is covered by, and use F(b) to denote the set of fragments that b covers. For the same reasons mentioned above, we expect that each probe will match its intended place in the genome and nowhere else. Our goal is to assign each probe to one fragment while a t the same time maximize the number of satisfied constraints in R. A constraint (b ,Q p ) E R is satisfied if one or more probes from Ob,pis assigned to any of the fragment in the set F(b). Given an assignment between probes and fragments, the probe-BAC assignment can be easily obtained. Below is a formal statement of our new optimization problem.
MAXIMUMCONSTRAINTSATISFYING PROBEFRAGMENT ASSIGNMENT (MCSPFA) Instance: A set of fragments IF, a set of probes 0,a set of BACs B and a list of constraints R = { ( b ,Qb,,)lb E B,0 b , p c 0). Objective: Assign each probe in 0 to at most one fragment in IF, such that the number of satisfied constraints in R is maximized. The MCSPFA problem is also NP-hard, since it is a special case of MCSPCA when all BACs in B are disjoint. 2.4.1. Solving the MCSPFA problem via integer programming
A variant of the integer program that we presented for MCSPCA can also solve this problem optimally. Let X,,f be a variable associated with the possible assignment of probe o to fragment f , which is set to 1
if o is assigned to f , 0 otherwise. Let Yqbe defined in the same way as the previous integer program. The integer linear program for MCSPFA follows.
M a x i m i z e CqEC Yq Subject t o C GF X,,f 5 1
vo E 0 vq E
R
o t S f EF(b)
X0,f E {0,1) y q E {0,1>
v0 E 0, f E IF vq E R
The major difference between the ILP (3) and the ILP (2) is in the third constraint. The third constraint in the ILP above translates the fact that a constraint ( b , S) E R is satisfied if any probe in S is assigned to any fragment in F(b). 2.4.2. Relaxation, rounding and analysis
Following the same strategy used in the MCSPCA problem, the ILP is relaxed to its corresponding LP, and then the LP can be solved optimally. Let {X,", I Vo E 0, f E IF} and {Y,* I Y q E R'} denote the optimal solutions to the LP. The fractional solution will be rounded to an integer solution by interpreting X,",f as the probability of assigning probe o to fragment f . Let OPTf be the optimal value of objective function in the LP, that is OPTf = CqEn Yg. Let I , be the indicator random variable, which is 1 if the constraint q is satisfied under the above randomized rounding step, 0 otherwise. Let W denote the total number of satisfied constraints after the rounding step. Clearly, W = CqEn I,, and E ( W ) = CqER Prob(Iq = 1). A similar analysis
210 to the one carried out for MCSPCA applies to MCSPFA as well, and as a consequence we can prove that E ( W ) 2 (1 - e - l ) O P T f . We can therefore claim the following theorem. Theorem 2.3. T h e randomized MCSPFA algorithm achieves approximation ratio (1 - e-').
The pseudo-code of the MCSPFA algorithm is presented in Figure 3.
3. EXPERIMENTAL RESULTS In order to evaluate the performance of our algorithms, we applied them to two datasets. The first one is partially simulated data on the rice genome while the second is real-world da.ta from hybridizations carried out in Prof. Close lab at UC Riverside on the barley genome. Before delving in the experimental setup, we give a short description of the pooling design which is relevant to the discussion.
3.1. Pooling Design Pooling design (or group testing) is a well-studied problem in the scientific literature (see 2 and references therein). Traditionally, biologists use the rather rudimentary 2D or 3D grid design. There are however more sophisticated pooling strategies (see, e.g., 7, 3, 12). To the best of our knowledge, the shifted transversal design (STD) l2 is among the best choices due to its capability to handle multiple positives, its flexibility and efficiency. STD pools are constructed in layers, where each layer consists of P pools, where P is a prime number. Each layer constitutes a partition of the probes, and the larger is the number of layers, the higher is the decodability of the pooling. More specifically, let I' be the smallest integer such that Pr+' is greater than or equal to the number of probes to be pooled, and let L be the number of layers. Then, the decodability of the pool set is equal to L(L - l)/r].In order to increase the decodability by one, an additional P r pools are needed. By a simple calculation, one can realize that the number of pools required to provide sufficient information for deconvolution (e.g., to be at least 10decodable) for a real-world problem assuming 50,000 BACs and 50,000 unigenes is prohibitively high.
3.2. Experimental Results on the Rice Genome The rice "virtual" BAC library used here is a subset of the Nipponbare BAC library, whose BAC end sequences are hosted at Arizona Genomic Institute (AGI). The fingerprinting data for this library was also obtained from AGI. Our rice virtual library contains the subset of BACs in the original library whose location on the rice genome (Oryza sativa) can be uniquely identified and for which restriction fingerprinting data was available. The location of the BACs was determined by BLASTing the BAC end sequences against the rice genome. Since our rice virtual BAC library is based on real fingerprinting data (agarose gel), we expect the overlapping structure of the rice BACs in the physical map to be an accurate representation of what would happen in other organisms. Also, since we know the actual coordinates of the BACs on the rice genome, we can also produce a perfect physical map and test the maximum amount of the decorivolution that can be extracted from the hybridization data. For the purposes of this simulation, we restricted our attention to chromosome 1of rice, which includes 2,629 BACs. This set of BACs provides a 8 . 5 9 cov~ erage of chromosome 1. We created a physical map of that chromosome by running F P C lo on the fingerprinting data with cutoff parameter set to l e - 15 (all other parameters are left to default). F P C assembled the BACs into 347 contigs and 416 singletons. Not including the singletons, each contig contains on average about 6.4 BACs. Given that the fingerprinting data is noisy, the physical map assembled by F P C cannot be expected to be perfect. It is also well known that the order of the BACs within a contig is generally not reliable '. We then obtained rice unigenes from NCBI (build #65) with the objective of designing the probes. First, we had to establish the subset of these unigenes that belong to chromosome 1. We BLASTed the unigenes against the rice genome, and we selected a subset of 2,301 unigenes for which we had high confidence to be located on chromosome 1. Then, we computed unique probes using OLIGOSPAWN '', 17. This tool produces 36 nucleotides long probes, each of which matches exactly the unigene it represents and at the same time it does not match (even approximately) to any other
21 1 Algorithm MCSPFA(0, IF,B,0)
1. Generate the integer program (3) from (0, IF, 0) 2. Solve the LP relaxation of the ILP in step 1, and obtain the optimal fractional solution {X,”,,} and {Y;} 3. Apply K steps of randomized rounding and save the best solution
for each o E 0 do El = { f I f E L q f > 0) Assign o to f E F, with probability X:,
,
or to none of the fragments with probability 1 - C 4. Further assign probes to BACs. if o is assigned to f in step 3 then assign o to all the BACs in F(b) Fig. 3.
, , EFo
X:,
Sketch of the two-phase deconvolution algorithm that exploits a perfect physical map
unigenes in the dataset. OLIGOSPAWNsuccessfully produced unique probes for 2,002 unigenes (out of 2,301). The remaining unigenes were not represented by a probes because no unique 36-mer could be found. This set of 2,002 probes is named probe set 1. Some of the probes in probe set 1, however, did not match anywhere in the genome. This will happen if the probe chosen happens to cross a splicing cite or when the unigene from which it was selected was misassembled. In probe set 1, 330 probes did not match anywhere on rice chromosome 1. The remaining 1,672 probes matched exactly once in rice chromosome 1. This latter set constitutes our probe set 2, which is a “cleaner” version of probe set 1. We observe that in order to clean the probe set one has to have access to the whole genome, which somewhat unrealistic in practice. Each BAC contains on average 5.8 probes and at most 20 probes. The probes were hybridized in silico to the BACs using the following criteria. A 36 nucleotides probe hybridizes a BAC if they share a common (exact) substring of 30 nucleotides or more. The criteria was debated a t length with the biologists in Prof. Close lab and among all suggestions, this one was chosen for its simplicity. Observe that the hybridization table is the only synthetic data in this dataset. The next step was to pool the probes for group testing. We followed the shifted transversal design pooling strategy l 2 and designed four sets of pools, for different choices of the pooling parameters P and L. Recall that in STD the number of pools is P*L. Pool set 1 is 1-decodable, obtained by choosing P = 13 and L = 3. Pool set 2 uses two extra layers (P = 13, L = 5), which increased the decodability by
1. For pool set 3 , we chose P = 47 and L = 2. Pool set 3 is also 1-decodable, but since each pool contains a smaller number of probes, it will deconvolute the BAC-probe relationships better than pool set 1. Pool set 4 is constructed from pool set 3 by adding an additional layer (P = 47, L = 3 ) . As a consequence, pool set 4 is 2-decodable. The four pooling designs were applied to probe set 1 and probe set 2. In total, we constructed 8 sets of pools. The hybridization tables h between pools and BACs were formed for the 8 sets of pools. Then, the basic deconvolution step (described in Section 2.2) was carried out. This step produced a list of constraints, some of which were exact pairs and could deconvolute immediately. The set of BACs, the set of probes, the list of constraints from the previous step, and the physical map produced by FPC, were then fed into MCSPCA. Our ILP-based algorithm produced a set of BACprobe assignments, which was then merged with the exact pairs obtained by the basic deconvolution to produce the final assignment. Similarly, the set of BACs, the set of probes, the list of constraints from the basic deconvolution and the perfect physical map were fed into MCSPFA. The assignment obtained by MCSPFA was also merged with the exact pairs to produce the final assignment. For both algorithms we used the GNU Linear Programming Kit (GLPK) to solve the linear programs. The size of the linear programs are quite large. The number of variables ranged from 47,820 to 165,972 and the number of constraints ranged from 29,412 to 60,475. To evaluate the accuracy of our algorithms, we employ two performance metrics, namely recall and
212 Table 1. Assignment accuracy of MCSPCA (with imperfect physical map) and MCSPFA (with perfect physical map) on probe set 1
pooling
#pools
# true assigns
basic recall
P=13,L=3 P=13,L=5 P = 47,L = 2 P =47,L =3
39 65 94 141
14742 14742 14742 14742
0.0103 0.2726 0.0173 0.763
MCSPCA recall precision
MCSPFA recall precision
0.199 0.618 0.4005 0.9069
0.4857 0.9708 0.8856 0.9991
0.2647 0.7668 0.5236 0.9446
0.4227 0.8511 0.7626 0.9798
Table 2. Assignment accuracy of MCSPCA (with imperfect physical map) and MCSPFA (with perfect physical map) on probe set 2
pooling
#pools
# true assigns
basic recall
P=13,L=3 P = 13,L = 5 P=47,L =2 P=47,L= 3
39 65 94 141
14742 14742 14742 14742
0.0121 0.3111 0.0298 0.8182
Table 3.
pooling
P = 13, L P = 13,L P = 47, L P = 47, L
=3 =5 =2 =3
I
recall MCSPCA precision
0.2163 0.6488 0.4348 0.9285
0.3182 0.8314 0.6009 0.9767
I
~
MCSPFA recall precision
0.625 0.9984 0.9971 0.9995
0.6214 0.9964 0.9962 0.9997
Performance of the randomized rounding scheme on probe set I
#constraints
OPTf
MCSPCA W OPTfIW
OPTf
W
OPTfIW
35071 58472 27591 41509
28683 45220 21472 29458
22615 41462 18633 29378
35033 58425 27567 41467
30562 58277 26524 41467
0.8724 0.9975 0.9622 1
precision. Recall is defined as the number of correct assignments made by our algorithm divided by the total number of true assignments. Precision is defined as the number of correct assignments made by our algorithm divided by the total number of assignments our algorithm made. Tables 1 and 2 summarize the assignment accuracy of our algorithms. “Basic recall” is the recall of the basic deconvolution step (precision is not reported because it is always 100%). A few observations are in order. First, note that 2-decodable pooling designs achieve a much better performance than l-decodable pooling. Second, probe set 2 provides better quality data and as a consequence it improves the deconvolution. However, if we stick with the more realistic probe set 1 and noisy physical map, our algorithm still achieves 91% recall and 94% precision for the pooling P = 47,L = 3. Even more impressive is the amount of additional deconvolution achieved for the other 2-decodable pooling when compared to the basic deconvolution. For example, for P = 13,L = 5 the pooling is composed by only 65 pools, and thereby the basic deconvolu-
0.7884 0.9169 0.8678 0.9973
MCSPFA
tion achieves just 27% recall. Our algorithm however achieves 62% recall with 77% precision. The results for the perfect map shows that our algorithm could potentially deconvolute all BAC-probe pairs with almost 100% precision (if the pooling is “powerful” enough). Finally, in order to show the effectiveness of our randomized rounding scheme, Tables 3 and 4 report the total number of constraints, the optimal value OPTf of the LP, and the number W of satisfied constraints. Note that the rounding scheme does not significantly affect the value of the objective function (see ratio O P T f I W ) .
3.3. Experimental Results on the Barley Genome
The second dataset is related to the barley (Hordeurn vulgare) project currently in progress at UC Riverside. The Barley BAC library used is a Morex library covering 6.3 genome equivalents 15. Selected BACs from the BAC library were fingerprinted using a techniques that employs four different restric-
213
pooling
#constraints
OPTf
MCSPCA W OPTfIW
OPTf
P = 13,L = 3 P = 13,L = 5 P = 47,L = 2 P = 47,L = 3
35102 58210 27739 41378
27161 42795 20127 28567
21567 39934 17990 28532
35089 58179 27711 41378
tion enzymes, called high information content fingerprinting '. The physical map was constructed using FPC. The total number of BACs that were successfully fingerprinted and that were present in the physical map is 43,094. Among the set of BACs present in the physical map, about 20 have been fully sequenced. They will be used for validation of our algorithm. The Barley unigenes were obtained by assembling the ESTs downloaded from the NCBI EST database. The unigene assembly contains in total 26,743 contigs and 27,121 singletons. About a dozen research groups around the world contributed hybridization data. Each group designed probes for certain genes of interest and performed the hybridization experiments. Since those efforts were not centrally coordinated, the probe design and the pool design were completely ad hoc. The length of probe ranges from 36 nucleotides to a few hundreds bases. The number of unigenes that each pool represents also ranges from one to a few hundreds. We collected the data, and we transformed it into a list of constraints that we processed first with the basic deconvolution. Recall that if we obtain an exact pair, the assignment is immediate. But if a constraint is non-exact, we cannot conclude much even if the size of CDb,p is very small. However, intuitively those constra,ints for is small are the most informative. which In an attempt to filter out the noise and isolate the informative constraints we selected only those for which ICDb,pl 5 50. In total 14,796 constraints were chosen. Then, we focused only on the 5,327 BACs and the 2,263 unigenes that were involved in this selected set of constraints. We then used our MCSPCA method on this reduced set (along with the barley physical map produced by FPC) and we obtained 9,587 assignments. We cross-referenced these assignments to the small set of sequenced BACs and we determined that six of them were in common to the 5,327 BACs we selected. Our algorithm assigned
0.7940 0.9331 0.8938 0.9988
MCSPFA W OPTfIW 31183 58179 27711 41338
0.8887 1 1 0.9990
eight unigenes to those 6 BACs, and six of them turned out to be correct by matching them to the sequences of 20 known BACs.
4. CONCLUSIONS In this paper, we proposed a new method to solve the BAC-gene deconvolution problem. Our method compensates for a weaker pooling design by exploiting a physical map. The deconvolution problem is formulated as pair of combinatorial optimization problems, both of which are proved to be NP-complete. The combinatorial optimization problems are solved approximately via Integer Programming followed by Linear Programming relaxation and then randomized rounding. Our experimental results on both real and simulated data show that our method is very accurate in determining the correct mapping between unigenes and BACs. The right combination of combinatorial pooling and our method not only can dramatically reduce the number of pools required, but also can deconvolute the BAC-gene relationships almost perfectly. ACKNOWLEDGMENTS This project was supported in part by NSF CAREER 11s-0447773, NIH LM008991-01, and NSF DBI-0321756. The authors thank Serdar Bozdag for providing the data for the simulation on the rice genome. References 1. Barbazuk WB, Bedell JA, Rabinowicz PD. Reduced representation sequencing: a success in maize and a promise for other plant genomes. Bioessays 2005; 27: 839-848 2. Du DZ, Hwang FK. Combinatorial Group Testing and its applications 2nd edition. World Scientific
2000 3. Dyachkov A, Hwang F, Macula A, et al. A Construction of Pooling Designs with Some Happy Surprises.
214
4.
5.
6.
7.
8. 9.
10.
Journal of Computational Biology 2005; 12: 11291136 Flibotte S, Chiu R, Fjell C, et al. Automated ordering of fingerprinted clones. Bioinformatics, 20: 1264-1271 Goemans MX, Williamson DP. New :-Approximation Algorithms for the Maximum Satisfiability Problem. S I A M Journal o n Discrete Mathematics 1994; 7; 656-666 Luo MC, Thomasa C, Youa FM et al. Highthroughput fingerprinting of bacterial artificial chromosomes using the snapshot labeling kit and sizing of restriction fragments by capillary electrophoresis. Genomics 2003; 82: 378-389 Macula AJ. A simple construction of &disjunct matrices with certain constant weights. Discrete Mathematics 1996; 162: 311-312 Michael S. Introduction to the Theory of Computation. International Thomson Publishing 1996 Sandhu D, Gill KS. Gene-Containing Regions of Wheat and the Other Grass Genomes Plant Physiology 2002; 128: 803-811 Soderlund C, Longden I, Mott R. FPC: a system for building contigs from restriction fingerprinted clones. Computer Applications in the Biosciences 1997; 135, 523-535,
11. Sumner AT, de la Torre J, Stuppia L. The distribution of genes on chromosomes: a cytological approach Journal of molecular evolution 1993; 37: 117-122 12. Thierry-Mieg N. A new pooling strategy for highthroughput screening: the Shiftcd Transversal Design. B M C Bioinformatics 2006; 7:28 - 37 13. Vazirani VV. Approximation Algorithms Springer 2001 14. Yannakakis M. On the approximation of maximum satisfiability. S O D A '92: Proceedings of the third annual A C M - S I A M symposium o n Discrete algorithms 1992; 1-9 15. Yu Y, Tomkins JP, Waugh R, et al. A bacterial artificial chromosome library for barley (Hordeum vulgare L.) and the identification of clones containing putative resistance genes. Theoretical and Applied Geneti c 2000; ~ 101: 1093-1099. 16. Zheng J, Close T J , Jiang T, Lonardi S. Efficient Selection of Unique and Popular Oligos for Large EST Databases. Bioinformatics 2004; 20 : 2101-2112 17. Zheng J , Svensson JT, Madishetty K et al. Oligospawn: a web-based tool for the design of overgo probes from large unigene databases B M G Bioinformatics 2006; 7: 7-15
A GRAMMAR BASED METHODOLOGY FOR STRUCTURAL M O T I F FINDING IN ncRNA DATABASE SEARCH
Daniel Quest*, William Tapprich' , Hesham Ali"
* College of Information Science and Technology, University of Nebraska at Omaha
' Department of
Biology, University of Nebraska at Omaha Omaha, NE 68182-0694, USA E-mail: djquestaunmc. edu
In recent years, sequence database searching has been conducted through local alignment heuristics, patternmatching, and comparison of short statistically significant patterns. While these approaches have unlocked many clues as to sequence relationships, they are limited in that they do not provide context-sensitive searching capabilities (e.g. considering pseudoknots, protein binding positions, and complementary base pairs). Stochastic grammars (hidden Markov models HMMs and stochastic context-free grammars SCFG) do allow for flexibility in terms of local context, but the context comes at the cost of increased computational complexity. In this paper we introduce a new grammar based method for searching for RNA motifs that exist within a conserved RNA structure. Our method constrains computational complexity by using a chain of topology elements. Through the use of a case study we present the algorithmic approach and benchmark our approach against traditional methods.
1. INTRODUCTION Functional non-coding R.NA (ncRNA) has received great attention in recent years because of their A diverse functional activities within the cell. ncRNA forms a secondary structure that enables other molecules to interact with it and carry out functional activities. In many cases, molecules interact with conserved primary structure patterns or motifs given that the ncRNA is in the correct secondary structure conformation. Because of this, the bioinformatics community has focused considerable energy towards methods that predict ncRNA secondary structure and search for homologous structures within a sequence database (e.g. 1, 2 , 3) . Currently there are many approaches to find a RNA homolog. The first approach is to construct a structure for the sequence, and then use that structure to query a sequence database 1. One option to construct the structure is to use sequence profiles of the RNA as it is conserved through evolution 4. Another approach is to use a package such as Mfold 5 and chemical probing validation experiments to determine the RNA structure. As soon as the structure is determined, one can use pattern-matching software to find structural homologs within the RNA database. Pat tern-matching software packages were first used to find homologous tRNAs 6, 7. Over time, they have evolved to consider multiple different abstractions of the structural patterns. Pattern-matching
programs have evolved from regular expression tools to scripting languages capable of considering errors, non-Watson-crick base pairs, complementary base pairing, and common structural profiles. Some example programs include RnaBob 8, RNAMOT 9, Palingol 10, and RNAMotif 11. Although these methods are extremely powerful and fast, they require significant user expertise to obtain reliable profiIes. In addition, they do not easily allow probabilistic scoring schemes to be integrated into them. This implies that these tools return all hits that are possible given our current understanding of the secondary structure. These tools do not rank profiles based on what is most likely to occur based on the phylogenetic relationships of the ncRNA. A second approach to finding a ncRNA homolog is to use a stochastic context free grammar (SCFG) 12 to simultaneously align the primary sequence and the secondary structure. SCFGs have an advantage over pattern-matching programs in that they require less manual expertise and tuning to find accurate structural alignments once the global parameters are set. In practice however they are impractical because of running time 0 ( n 4 ) 1. If pseudoknots are considered 13 time complexity ( 0 ( n 6 )makes ) database searches impossible. To circumvent these obstacles Weinberg and Ruzzo proposed a HMM filter that allows for faster ncRNA searches without the loss of accuracy 2 . Recently Zhang et al proposed an additional filter 3 and a sequence filtering methodology
216
14 for constructing fast SCFG searchers without the loss of accuracy. This capability allows us to construct queries over large datasets based on primary and secondary structure instead of primary sequence alone. However, implicit in the assumptions of these filtering techniques is the concept that scoring matrices are homogeneous across all putative alignment regions. In some cases however we have more evidence that require our scoring system to be heterogeneous. For example, when some of the bases have been biologically verified by chemical probing and other bases have not been verified we wish a model with two classes verified and not verified. We then wish to search for RNAs that have a conserved secondary structure subject to the constraint that all verified bases remain functional. In other words, we wish the bases in the motif to remain functional, and so they can not be considered in other base pair interactions in the folding of the molecule. This problem can be solved with patternmatching programs, but search results suffer because errors are not scored in a probabilistic way. Consequently, for a short ncRNA, such a program can return a pattern that satisfies all constraints but is not closely related to known ncRNA found in nature. This implies that the number of matches is dictated by the length of the database instead of functional relationships inside the database. SCFGs can be modified to impose additional constraints through additional grammar rules, however this process is time consuming. More importantly, changing the grammar and the parameters has the effect of also changing the relationships used to construct the filters that allow SCFGs to run in reasonable time. In this paper, we propose a new approach to search a sequence database for RNA structures that have known functional sites (motifs). Our approach uses the strategy of nested grammars to simultaneously integrate secondary structure, primary structure, and biologically verified constraints. We will show that our method is capable of finding significant substrings or motifs when pattern-matching approaches can not, and that our method can serve as a reasonable second step for imposing constraints on putative hits from a filtered SCFG filter (therefore avoiding the need to construct constraint-aware fil-
ters). To illustrate our nested-grammar paradigm, we will first show that a grammar with favorable runtime characteristics can be used as an approximation for a grammar with more complex runtime characteristics. In this way, a heuristic for a complex grammar G can be generated via a simple grammar G‘. We also show that G’ can provide a solution within r for G where T is an arbitrary error threshold. Finally, we illustrate our algorithm for evaluating G and G’ via an example and a case study.
2. PROBLEM DESCRIPTION A N D METHODOLOGY Our primary interest in this work is to search large databases for significant signals within a conserved two-dimensional structure. Given that we know some pattern or signal from biologically verified data, we wish to find two-dimensional structural honiologs in a database subject to the constraint that the structural homolog must contain this signal. In this paper, we present a robust grammar-based approach for finding non-deterministic RNA structural motifs in a conserved secondary structure. Like the patternmatching approaches, our approach allows the user to decide the level of flexibility of constraints of the profile to search. Given reasonable constraints, the proposed approach also has a favorable computational complexity. The core idea is to define a primary grammar for running the nucleotide comparisons, and a secondary grammar to model the secondary structure relationships. This idea is similar to the idea independently developed in MilPat using constraint networks 15. In MilPat, a constraint network is used to model secondary structure dependencies. Our approach, on the other hand, uses a secondary grammar to model constraints. The key advantage of using nested grammars is that all constraints can be integrated homogeneously using Bayes rule. The direct impact is that we allow for mismatches and thus our models can be entirely probabilistic in nature. Our nested grammar based method functions by considering two grammars: G and G‘. The structural alignment grammar G represnents the known constraints that exists in the molecule. The sequence alignment grammar G’ represents an ordered set of phylogenetic subsequences found in the structure. Elements of G’ may be scored with traditional
217 scoring matricies, or additional information from biological experements can be added to score subsequences heterogeneously. We use a pairwise hidden markov model (discussed below) to evaluate all possible alignment positions for subsequences in G’ and then combine evidence using the more robust SCFG gramniar to select the subsequence alignments with most supporting evidence and construct the grammar to sequence alignment. The next few paragraphs provide a background for our method. In sections 2.1 to 2.5 we detail the key components of the algorithm. The algorithm in its entirety is presented in section 3. Consider a pattern P = { p l , p z , . . . , p m } that we wish to search for, and a sequence S such that IPI < IS/(1x1denotes the length of x). Our objective is to create a sequence S from P using production rules from the set: insert ( I = ;), delete ( D = ?), match ( M = E), and mismatch ( X = :) where a and b are any characters in the terminal alphabet C such that a # b and - is a space. The production transcript T is an ordered list of production rules that produces S from P 16. T is a generative regular probabilistic grammar ( G ) that, for each production rule, generates a pair of characters corresponding to P and one corresponding to S such that both P and S are produced (a pairwise hidden Markov model or PHMM). For example, if P = AAAC and S = TAGCC we could construct both P and S with the production transcript T = { X ,D , M , I , M , I } as shown in figure 1.
and non-terminal production rules. A non-terminal production rule is a grammar production rule that may produce any other production rule (terminal or non-terminal including itself) from a finite set of options. Non-terminal production rules exist to allow shortcuts in the alignment path to more closely approximate our biological problem. The classic example is one where gaps are clustered together to represent introns in an alignment between genomic DNA and messenger RNA. Such a gramniar could be represented as follows:
GI:
T+:T
,T L+,L
R+?R
I
,T
I I I
,L T
S+LS L+:F,,
1 b‘
F+kFt,
PHMMs have proven to be useful in global sequence alignments between two sequences. In order to cluster gaps or account for palindromic base pairing, the grammar needs to be extended to include both primitive production rules ( I ,D , M , X )
I 1
T
E
E E
In GI, production rules L and R represent the gaps in P and S respectively. E represents the final character in P and S . More recently, non-terminal production rules have been used to model structural parameters in RNA folding. Such folding parameters are modeled by production rules that produce two characters simultaneously. The resulting production rules represent a palindromic language. For example, we could extend our non-terminals to include production rules such as tT$’ where the notation implies a basepairs with a’ in P and b basepairs with b‘ in S. A simple, but effective grammar for RNA structure prediction was proposed by Knudsen and Hein in the PFold package 17:
Gz :
Fig. 1. A PHMM representation of a production transcript
1 aT I I ‘?R 1
b
1 I
L LS
Additional production rules allow models to more closely represent biological function. They also increase computational complexity, sometimes so much so that realistic models on large data sets can not be computed in reasonable time even on a large cluster. Production rules that allow for non-regularity (i.e. both-sides emission) make database search intractable in practice without filters. To circumvent this problem, statistical techniques are used to infer where non-terminal operations can be applied. Traditionally, this approach has been to find some statistical properties of a dataset given a grammar G and then restrict the search based on those properties. In this work, we wish to show a method for
218 compiling evidence that can restrict the number of non-regular production rules to areas of greatest interest and therefore manage tractability through a multi-level grammar strategy. In other words, given our grammar G that is difficult to compute, we wish to run a grammar GI that approximates G to some threshold 'T. Given those approximations, we then wish to bind the search of non-terminal operations that exist in G to regions generated in GI that have the most evidence to support a non-terminal shortcut. 2.1. Transcript Evidence Each of the production rules has an associated cost. To calculate the cost of a production transcript, we sum the costs of all production rules in that production transcript. Traditionally, alignment imposes few limitations on the costs chosen for each of the production rules. Drastically different alignment summaries can be obtained from different scoring schemes 18. To combine multiple production transcripts in a logically consistent manner, we use a Bayesian method for scoring production transcripts. Imagine we draw production rules from an urn at random to construct our production transcript. At each draw, we are constrained by the pattern and the sequence of the production rule we choose because P must produce S . If Hi is the hypothesis that production operation exists in the transcript at position i, and X i is our prior information about all other possible production operations a t position i (a position specific scoring matrix), then we can relate our hypotheses by the inversion formula:
Axiomatically, if H,' represents the hypothesis that any production operation other than H, exists at position i in the transcript, then we can construct an identical equation for P(H,IIP,S,X , ) . If we take the log of the ratio of P ( H , ( P , S , X , ) and P(H,'IP,S,X , ) we can obtain the evidence:
(2) In equation 2 , e(HiIXi) represents our prior evidence in production rule H a t position i based on our grammar production model. If this to zero, it indicates
that we have no evidence supporting or refuting H . The evidence for an entire production transcript T can be calculated as:
The evidence in a production transcript depends only on our prior knowledge stated explicitly in X. Given N represents all possible production operations at position i in the transcript, S M is the substitution matrix based on sampling of production rules from ncRNA, and W represents the current production operation, we have a general formula for evaluating evidence of a production rule versus all other production rules a t the same position in the transcript:
e(TIP,S , X ) =
e(HzIP,S, X z )
(4)
2
At this point, we can integrate our chemical probing data or other biological evidence using the term e ( H z = W I X , = S M ) .
2.2. Scoring a Grammar The objective now is to find a maximum production list amongst all possible production lists.
Definition 2.1. A maximum weighted grammar production list (MWGPL) is an ordered list of all allowable grammar production steps and their associated evidence such that the production list: (1) produces S from P when the list of production rules are taken in order and (2) has maximum total evidence over all paths that produce S from P . If the MWGPL is known, then we can make a statement about how well the pattern and the sequence correspond to the model. A great deal of evidence is likely to imply that the model, the sequence and the pattern agree and that the sequence has the same characteristics as the pat tern. This leads us to the three key considerations that are the subject of this work: (1) Find a partition of S such that our algorithms can be run efficiently in practice with minimum loss to the quality of a query, (2) Minimize the use of non-terminals but retain the benefits of non-terminal operations, and (3) Integrate relationships in our data into the grammar model.
219
2.3. Optimization Through Nesting Grammars
To manage tractability of the evaluation polynomial, we would like to be able to partition S recursively as the query collects evidence towards the most likely propositions. To obtain some guarantee about the running time of our partition, we also want to select production operations that are consistent with runtime expectations. To do this, we define the notion of a topology element.
Definition 2.2. A topology element T E = {PS,, R } is a grammar production rule set that contains a collection of patterns PS, = { p s l , p s z , . . . } and a set of allowed production rules R. A topology must use the set of production rules R to produce S’, a subsequence of S. Each topology element must have one prior associated with all of the production rules in R. In a grammar G a topology element may be used only once. A topology element is also a grammar. To evaluate the evidence that topology element T E produced S’ we use Equation 3 selecting p s l from PS, and selecting S’ in S such that evidence is maximized. As each topology element may have more than one string, the production of S from the topology element series is constructed by picking the minimum weighted grammar production list over all topology alternatives in T E . Topology elements are produced through a grammar, The global grammar G contains productions for the topology grammar GI. A heuristic grammar HG for G approximates G by using production rules in G and production rules from a simpler grammar GI. The topology element paradigm allows us to manage complexity by recursively defining partitions on S . Consequently, it allows us to restrict complexity by bounding non-terminal production rules to regions specified by the partition. A topology element serves as a heuristic to cut vertexes in the production transcript graph so that the graph may be evaluated using divide and conquer (for a description of the relationship between edit transcripts and edit graphs and how grammar production rules can be represented as both a graph and a sequence of rules see 16, 12). The topology element serves as evidence towards G with GI. In practice, we want to evaluate all topology elements that satisfy an ev-
idence threshold 7 . If 7 i s large, the approximation for G will be inaccurate because G‘ will miss many candidates although the runtime will be favorable. If 7 is small, the likelihood that we miss a production transcript lessens, but the computational cost of evaluating HG increases. A topology element T E is evaluated with grammar G’. Evaluating T E with G’ inevitably reduces the correctness of the MWGPL for G because some production rules in G do not exist in GI. There are two approaches to solving this problem: (1) Allow grammar refinement, or (2) assume that higher order relationships can be approximated, given enough alternatives in the data. Grammar refinement is a strategy where we may reexamine the production list for topology element T E produced by G’ and substitute production rules that exist in G (but not in G’) into the production list for T E . Using a grammar refinement strategy, we can guarantee that the MWGPL for HG has the same evidence as the MWGPL for G. The disadvantage of this approach is that in the worst case we will actually evaluate G. While there are many potential approaches to bound the number of refinements, we choose to save this for future work. Instead, we choose to focus on a data-driven approach. We assume that a relationship found in a higher order grammar production can be discovered if we have enough supporting sequences in our model. As the number of sequences increases, the known alternatives for a topology element approaches the real number of alternatives in the database. As an example, consider sequence A =tgtCCCaTATAaGGGata that we know can be partitioned into 3 consensus regions. TE1 =tata, TE2 =ccc, and TE3 =ggg. TE2 and TE3 are related because they complementary base pair. Here is an example topology element based grammar for A:
GTl:
A-+:A ,A
1
I B+EB I B I c+;c 1
L A 1 ‘?A TE1B :B
I
FB
I
1
TE2 C T E 3
LC
I
FC
1 ,c
(t
In GTl, insertion of TE1, TE2, and TE3 is done via calls to a simpler grammar GT,’. Insertion may allow grammar substitution operations that increase
220 evidence based on known topology relationships. For example, in the above grammar TE2 and TE3 are known to complementary basepair so regular grammar productions { M + EM,M + EM,M + : M } on the MWGPL of TE2 and the regular grammar productions { M -+ ; M , M + : M , M + : M } on the MWGPL of TE3 can be substituted with palindromic grammar productions representing complementary base pairs in GTl: { M + EM:,M + EM:, M 4 EM:}. Using this framework, we can pursue likely complementary base pairs without being forced to evaluate all possible complementary base pairs. 2.4. Non-terminal Grammar Operations
The goal of this section is t o show how one grammar can be used to approximate another grammar. Consider the following grammar, G3 , that produces a local alignment:
1 A A+EA I : A I R + F R I ,R 1
G3:
L+,L
1
F A
,A
I
E
E
In this grammar, L and R are production rules that result in no evidence; we wish a local alignment. Production rule A is where G3 collects evidence. If we use dynamic programming to build all maximal solutions, A requires O(IPl x ISI) to evaluate the MWGPL. The bottleneck comes from the fact that at each position in the production list, we have three choices: we may advance our position in S but not P , or in P but not S , or in both P and S. Consider a prototype grammar, G4 that we wish to use t o approximate G3:
Given an evidence cutoff T , we can store all possible production transcripts with evidence over r in O(lPl ISl). Note that G4 only need a starting position and an ending position. Given these two positions, the evidence transcript is unique. This grammar can be evaluated in O(1PI IS/). We can further increase speed by hashing all possible production transcripts resulting in a score of at least T
+
+
and index all instances that actually appear in producing S from P . To produce G3 with the transcript T from G4 we could use the following grammar Gg:
Gg:
I
I : T + ~ _ L 1 , R I :T I L+b_L 1 T R+,R I T
:T+:T
,R
1
:L
E E
G5 functions to stitch elements with large amounts of supporting evidence from Gq and construct our approximation for G3. For each grammar alignment in G4 with evidence over r, we construct the table for the grammar production list. As an example, consider a list 1st of non-overlapping components that produce S 1st = ( A ,B,C . . . ) via G4. To stitch A, B , C together using G5, we first make all the grammar productions in A, thus constructing the final row of the dynamic programming table from A. The final row in the table is then used as we make grammar rule productions in G5 until we get to the position in Si where i is the first position in B. We then make all production rules in B and again use the last row from B to continue making grammar productions using Gg. This process continues until S is produced. If 7- is large, then most of the computational time will be spent evaluating G5 instead of G4 and the computational time will approach Gs. If T is small, then computing time is dominated by G4 and our approximation algorithm will be nearly linear. 2.5. Integrating Relationships in Data into Search
A topology element allows us to produce sequences instead of characters. Assume that we have a regular grammar that produces the hairpin in figure 2a. The grammar is divided into topology elements TI -7'11. Figure 2b shows several example sequences that all contain the same hairpin. At the base of figure 2b are the totals of the number of bases at each position. T4 and T8 are highly conserved and easily detected. However, the signal for T4 and T8 has very little information content. We would like to use the surrounding elements to increase (decrease) the evidence that we have supporting T4 and T8 as a real site. Simple graphical models such as an HMM will not be able to detect the site T4 T8 because
+
22 1 a.
b.
o .U
c o U-Go G-C
T6
A - G O OU-A. .G A. A A.
T4 AAUA AAUA AAUA AAUA AAUA AAUA AAUA AAUA
T5 GACUGU CAGACA GGUCCG AUUUAG ACAGCG UAGUGA CUGAUU GGCGUC
T6 UC-AC AC*GG CU*GC GC*AG GC*AA AA*cu UG*AG UG*UA
T7 GCGGUU UGUGUG UGGACA CUAAAU CGCUGC UCACUA GAUCAG GACGCU
T8 GA-A GA-A GA-A GA-A GA-A GA-A GA-A GA-A
T9 GGAGA GGCGC
Sum: A 1210 C 1324 G 3124 U 2230
8808 0000 0000 0080
231212 212131 323223 122322
21*42 14*22 22*13 31*11
022222 222221 332312 312133
08 8 00 0
10322 14203 54241 10122
52
53 S4 55 S6 57 58
G C-G
T3
~3 UAUC GCGC GGGG CCAC GACG UCUG AUCG GUUC
S1
80 0
00 0
cccuu UGUGG GCGUC GCAGA GCGAU AGAAC
T9
c u
G O T10 C. C-GO A-Uo o G UO T1 o G - C T11 C-G
T2
53
58 54
'57
'57
Fig. 2. a) A hairpin structure containing the loop E motif from coxsackievirus B3 b) An illustrative example of sequences from the characteristic portions of the loop E motif c) A phylogenetic tree constructed from sequences S1-S8 d) A phylogenetic tree constructed from T5 T7. Note that we chose the partitions because of our chemical probing data
+
the evidence found in the other sites is lost when you consider only the previous base. On the other hand, SCFGs will catch the base pairing relationships between elements T 3 , T9 and T5, T7 but they must check every possible base pair in the sequences to find the relationship. We would prefer a method that can detect the base pairing, but is not forced to evaluate all possible pair alignments. Our approach is to use the phylogenetic relationships found in the data to add evidence toward base pairing. To construct our prior belief in the sequence relationships, we cluster all of the known sequences by constructing a phylogenetic tree as shown in figure 2c. Then, to represent the complementary base pairs, we concatenate T5 and T7 and construct a tree. The evidence that a relationship exists between Si and Sj is the distance between Si and Sj in the tree shown in figure 2c minus the distance between Si and Sj in tree shown in figure 2d. This is evaluated in the grammar when the term e ( H i I X i )of Equation 2 sums evidence for topology elements.
3. ALGORITHM To score a sequence, first we introduce a global grammar G for evaluating sequences in the database D. For example, to evaluate the structure of the hairpin in Figure 2 we can use the following grammar G:
G:
D j E D E+EE PI+ D
I I
,D ,E T 3 P2
P24T4
P3
P3+T5
T6 T7
I 1
? D ? E
I
E
T9 E
T8
Our algorithm contains two phases, top down and bottom up. Top down refers to the phase where we traverse productions in G generating instances of GI that will produce S . Bottom up is the procedure where instances of GI are stitched together with operations in G into a production list that produces S . The transcript with the maximum evidence is selected as the approximation for the MWGPL of G. In this example, we assume the grammar for evaluating T3,T 5 ,T 6 ,T8 and T9 are instances of the regular grammar G3. While we could choose to evaluate G3 with an approximation, as was done in the previ-
222 ous section, the overhead in RNA structure matching comes from the palindromic non-terminals. Because the partitions of S1 - S8 in our example are such short sequences, evaluating G3 directly using the GOtoh 19 memory reduction method has better alignments for small sequences. Because the sequences are short, the dynamic programming tables are also short and the overhead is low. T 4 and T 8 can be evaluated with G' = G4 because they are absolutely conserved.
Fig. 3.
A grammar for finding the loop E RNA motif
The top down algorithm for constructing G proceeds forward by first producing P1. P1 in turn produces T 3 , P2, and T9. Candidates for the MWGPL for G over T3, and T 9 are computed via GY. All non-overlapping candidates that satisfy the condition that T 3 k < T91 and that the evidence for e ( T 3 k ) e(T91) > T are stored in a table (tablel). G then proceeds by producing P2 constrained such that T31, < T4, < T8, < T91 and evidence for the transcript greater than T . Because we do not know which entries in table1 exist in the MWGPL for G, we must store all potential candidates for T4 and T8 that exist between Sk+lTYl+i, min(k) and Sl,maz(1) in tablez. Variables k and 1 represent an index in 0 - /SI. Note that a potential candidate for a topology element such as T4 may not overlap with another candidate for the production list of T 4 , but it may overlap with a candidate from any other topology element (e.g. T5). In a similar way, G t,hen produces T5, T6, and T7 and the non-self-overlapping production lists over T are stored in tables. Once tables are created for all elements of G , we construct the MWGPL approximation for G by stitching candidate topology production lists together if they contain more evidence after being merged. Intuitively,
+
the tables mark candidate positions for G that may exist on the MWGPL. Figure 3 illustrates this basic idea. On S, we have putative positions marked by the forward production operations on the grammar
G. score ( S ,t) : 1 = i-1 productionTranscript = 0 while(productionTranscript.hasMoreWaysToMakeS0): productionTranscript += makeProduction(G,S,t,l) if (productionTranscript .produces(S) == True) : 1 += productionTranscript productionTranscript = 0 forAll i in 1: GX. computeEvidence (i) if(i.isMaxEvidence0) return i makeProduction(G,S,t,l): SelectNextRule = FSA.DP(1) if(productionSet.contains(TE): evaluateAndMark(TE) forAll TE > t BayesNet 11.index] . add(productionRu1e) return productiofiule
4. RESULTS Nondetermanistic structural motif finding is one of the most outstanding problems in bioinformatics. The proposed method, advanced grammar alignment search tool (AGAST) can be applied to find motifs in any biological sequences including DNA/RNA/Protein. In this section, we assess the performance of the proposed method in finding loop E motifs in conserved secondary RNA structures. The loop E motif is a fold that organizes structure in hairpin loops and multi-helix junctions in many RNA molecules. The motif is prevalent in 16s and 2 3 s ribosomal RNA and derives its name from its discovery in loop E of 5s rRNA 20. The loop E motif is particularly significant in RNA structure because it uses a series of non-canonical base pairs to form a characteristic fold. This fold widens the major groove of the RNA helix and presents a crossstrand adenosine stack that serves as a recognition feature for RNA-protein and RNA-RNA interactions 21. The presence of this motif in molecules as diverse as ribosomal RNA, potato tuber spindle viroid, RNase P RNA, and the hairpin ribozyme, proves that the loop E motif is an important feature in RNA structure and function. Sequence comparison and chemical probing analysis has revealed a consensus pattern for the loop E motif. This pattern consists of a parallel purine-
223 purine pair (usually AA), a bulged nucleotide, a nonWatson-Crick UA that is absolutely conserved, and a purine-puring pair (AA or AG, but not GA). As a result of the non-canonical pairing, the motif generates a signature pattern of susceptibilities in chemical probing experiments 22. We have identified the sequence pattern and the chemical probing pattern of the loop E motif in the coxsackievirus B3 (CVB3) genomic RNA. The general character of the absolutely conserved properties of this motif were a.ua.*gaa. The CVB3 loop E motif in the context of the surrounding RNA is shown in Figure 2. In this section, we compare the performance of our proposed method, AGAST, with RSEARCH and RNAMotif in finding the loop E motif. We selected RNAMotif because of the pattern-matching tools, it is one of the most flexible and can be customized to our specific problem. We selected RSEARCH because it is guarenteed to give optimal results over other methods because it computes all possible sequence-structure alignments. We did not use MilPat because its current release does not allow errors and thus is more constrained than both of these programs. The structure of the loop E motif must correspond exactly to the character of the sequence shown figure 2. The character of the motif is a hairpin loop with the loop E sequence immediately flanked by paired regions. The loop E section must be absolutely conserved (with motif a.ua.* [gal aa). Complementary base-pairing flanking the loop E is responsible for maintaining the structure of the loop. The turn at the top of the loop may contain a large secondary structure (instead of a 4 base turn). To test the sensitivity and specificity of our approach, we collected a set of sequences from coxsackievirus B3 (CVB3) genomic RNA. We partitioned these sequences into two groups, testing and modeling. With the modeling sequences, we constructed a multiple sequence alignment as shown above by overlaying our chemical probing data, phylogenetic conservation and possible folding conformations from mfold. Then, for each sequence in the testing set, we constructed a false positive sequence using a third order Markov chain (to preserve the motif, but destroy complementary base pairing required for secondary structure). For each of the sequences in the dataset we ran RSEARCH, RNAmotif, and AGAST. In the case of RNAmotif, we designed two
pattern-matching queries using the same information that we had in constructing the AGAST query. The first query which we call RNAmotif-intuitive, constrains results such that they form a hairpin around the conserved loop E motif and that complementary base pairs exist in the hairpin both 5’ and 3’ of the motif. This query is based on our understanding that the motif can only be formed if there is significant stability provided by complementary base pairs both 5’ and 3’ of the motif. In our second RNAmotif query, which we call RNAmotif-permissive, we give RNAmotif the 5’ and 3’ regions surrounding the loop E motif. Because this query did not match any sequences in our test database, we gradually increased the error threshold in the regions 5’ and 3’ of the motif until we obtained matches. RSEARCH was provided only with the sequence from the hairpin, that it uses to make a grammar. Each of these grammars was queried against our database. The results from this experiment are in Table 1. These results indicate that the traditional methods of SCFGs (RSEARCH) and expertly tuned queries (RNAMotif-permissive) remain the most sensitive methodologies when searching for a double stranded RNA motif in a two-dimensional structure. However, this sensitivity comes at the cost of an increased number of false predictions. In sequences that have no conserved two-dimensional structure (HMM-3 Jumbled sequences), we found the false positive rates to be 1.22 and 1.29 for RSEARCH and RNAMotif-permissive respectively. This is because both programs predicted more sites than there exist sequences in the generated database. The RNAMotif-intuitive query was able to substantially reduce the number of false predictions, but it was far too restrictive, eliminating 85% of true positives. We believe that our approach has significant promise because it was capable of maintaining relatively high sensitivity while increasing specificity to the same level as our intuitive description of the motif. Moreover, upon closer investigation, we realized that the false positives found in our real database where all phylogentically diverse from those instances we had in our training set (all forming in different clades from our training sequences). This indicates that our approach may perform better with a representative sequence from each clade in the phylogenetic tree, but finding such representatives in a new domain remains a challenging problem. Among the
224 Tool RSEARCH RNAMotif-intuitive RNAMot if-permissive AGAST
True Positives Time 963m53.2~ 55 89.65s 8 15.2s 55 68.36s 50
other methods, none produce acceptable values for both sensitivity and specificity in our problem domain. On the other hand AGAST had over 90% for both parameters. Another experiment was conducted t o search a larger dataset to find additional unknown instances of the loop E motif. Because ribosomal RNA sequences are known to contain the loop E, we generated a data set of all ribosomal RNA by parsing species specific data files (e.g. gbpril) from Genbank release 143 for all files with ribosomal RNA. We found 176,371 rRNA records using this method. We shuffled each of the sequences in the database using a Markov chain of order three and ran our algorithm on both ribosomal RNA sequences. Figure 4 shows the distribution of scores for records from the two sets.
Fig. 4. Grammar scores (maz{ewidence}- evidence f o u n d ) versus number of records for finding the loop E RNA motif.
In this experiment, there is a significant difference between sequences generated by the Markov chain that contain the sequence a.ua . * [gal aa, and rRNA sequences. Also, of the top 10 records returned in our search we were able t o verify that 9 of the records did contain the loop E motif and the other records are unknown. We are currently working to verify if the final sequence also contain the loop E motif. This demonstrates that our tool can find
0.98 0.91
0.98
motif in structure homology even in a large database.
5 . CONCLUSIONS In this paper, we have proposed a grammar based method based on constructing graphical models that relate subsequences instead of forcing the evaluation of individual characters. We have used this method to find the loop E structural motif inside of ncRNA with conserved secondary structure. Our results show that our method produced the best sensitivity/specificity combination among the tested methods for the problem domain. It may also serve as a strong complement to current methods in accelerating ncRNA homology detection because it can be more specific than SCFGs in the case where we have additional information about interior structural motifs. We believe that well structured data relationships can play a key roll in difficult problems such as motif searching. We also believe topology models are very general and could be used in modeling and searching for complex patterns in DNA or proteins. We believe that this work points to the need of more general approaches to automatically generate RNA database queries; especially queries where some possible structures can be eliminated from the SCFG on the basis of biological evidence. Our method would serve well for building filters that can be combined with existing methods such as FastR for increased specificity in selecting structures from the SCFG with conserved structural motifs.
ACKNOWLEDGMENTS We would like to thank Mohammad Shafiullah for help on scaling the code to large architectures, and Laura A. Quest/Brad Friedman/Mark Pauley for help with the manuscript. This research project was made possible by the NSF grant number EPS0091900 and the NIH grant number P20 RR16469 from the INBRE Program of the National Center for Research Resources.
225 References 1. Klein RJ, Eddy SR. RSEARCH: nding homologs of single structured RNA sequences. B M C Bioinformatics, September 2003; vol. 4. 2. Weinberg Z, Ruzzo WL. Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics, August 2004; vol. 20 Suppl 1. 3. Zhang S, Haas B, Eskin, E, and Bafna V. Searching genomes for noncoding RNA using FastR. I E E E / A C M Trans Comput Biol Bioinform, 2005; vol. 2, no. 4 366-379. 4. Rivas E, Klein R J , Jones TA, and Eddy SR. Computational identication of noncoding RNAs in E. coli by comparative genomics. Curr B i d , September 2001; vol. 11, no. 17: 1369-1373. 5. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res, July 2003; vol 31, no. 13: 3406-3415. 6. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res, March 1997; vol. 25, no. 5: 955-964. 7. Fichant GA, Burks C. Identifying potential tRNA genes in genomic DNA sequences. J Mol Biol, August 1991; vol. 220, no. 3: 659-671. 8. Eddy SR. RNABob: A program to search for RNA secondary structure motifs in sequence databases. http://selab.wustl.edu/cgibin/selab.pl?mode=software 9. Laferrire A, Gautheret D, and Cedergren R. An RNA pattern matching program with enhanced performance and portability. Comput Appl Biosci, April 1994; vol. 10, no. 2: 211-212. 10. Billoud B, Kontic M, and Viari A. Palingol: a declarative programming language to describe nucleic acids secondary structures and to scan sequence database. Nucleic Acids Res, April 1996; vol. 24, no. 8: 1395-1403. 11. Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R. Rnamotif, an RNA secondary structure denition and search algorithm. Nucleic Acids Res, November 2001; vol. 29, no. 22: 4724-
4735. 12. Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press, London. 1998. 13. Rivas E, and Eddy SR. A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol, February 1999; vol. 285, no. 5: 2053-2068. 14. Zhang S, Borovok I, Aharonowitz Y , Sharan R, Bafna V. A sequence-based ltering method for ncRNA identication and its application to searching for riboswitch elements. Bioinformatics, July 2006; vol. 22, no. 14. 15. Thbault P, de Givry S, Schiex T, Gaspin C. Searching RNA motifs and their intermolecular contacts with constraint networks. Bioinformatics, July 2006. 16. Guseld D. Algorithms o n strings trees and sequences. Cambridge University Press, London 1999. 17. Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res, July 2003; vol. 31, no. 13 3423-3428. 18. Dewey CN, Huggins PM, Woods K, Sturmfels B, Pachter L. Parametric alignment of drosophila genomes. PLoS Computational Biology, June 2006; vol. 2, no. 6: e73+. 19. Gotoh 0 . An improved algorithm for matching biological sequences. J Mol Biol, December 1982; vol. 162, no. 3: 705-708. 20. Branch AD, Benenfeld BJ, and Robertson HD. U1traviolet light-induced crosslinking reveals a unique region of local tertiary structure in potato spindle tuber viroid and HeLa 5s RNA.PNAS, October 1985; vol. 82, no. 19: 6590-6594. 21. Correll CC, Wool IG, and Munishkin A. The two faces of the Escherichia coli 23 S rRNA sarcinlricin domain: the structure at 1.11 a resolution. J Mol Biol, September 1999; vol. 292, no. 2: 275-287. 22. Leontis NB, Westhof E. A common motif organizes the structure of multi-helix loops in 16 s and 23 s ribosomal RNAs. J Mol Biol, October 1998; vol. 283, no. 3: 571-583.
This page intentionally left blank
227
IEM: AN ALGORITHM FOR ITERATIVE ENHANCEMENT OF MOTIFS USING COMPARATIVE GENOMICS DATA Erliang Zeng’, Kalai Mathee‘, and Giri Narasimhan’’,
‘ Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, Florida, 33199, USA, and ’Departnzent of Biological Sciences, Florida International University, Miami, Florida, 331 99, USA. Understanding gene regulation I S a key step to investigating gene jimrtions and their relationships, Man) algorithms have Been developed to discover transcription factor binding sites (TFBS); they are predominantly located in upstream regions of genes and contribute to transcription regulation if they are bound by a specific transcription factor. However, traditional metho& focnsing on finding motifs have shortcomings, which can be overcome by using conparative genonzics data that is now increasingly available. Traditional methods to .score inotvs cllso have their limitations. In this paper, we propose a new algorithm called IEM to refine motijjs using comparative genomics data. We .show tlze effectiwness of our techniques with several data sets, Two sets oj experiments were peiformed d l i comparative genomics data onfive strains of P. aeruginosa. One set of experiments w r e perforined with similar data on fonr species of yeast. The weighted conservation score proposed in this paper is an improvement over existing motifscores.
Keyword: Comparative Genomics, Motif, EM algorithm
1. INTRODUCTION Gene expression is a fundamental biological process. The first step in this process called transcription transmits genetic information from DNA to messenger RNA (mRNA). A transcription factor (TF) is a protein that regulates transcription of a gene by interacting with specific short DNA sequences, located often in the upstream region of the regulated genes. Such short DNA sequences are called transcription factor binding sites (TFBS) or regulatory elements. The regulatory elements can be described as sequence signatures and will be referred to in this paper as motifs. One TF can regulate a large set of genes, and a single gene may be regulated by the combination of several TFs. The upstream region of each gene regulated by the same T F must have at least one binding site specific for that particular TF. These binding sites must be specific enough so that the T F can “recognize” them and bind to them. However, it is well known that different sites bound by the same T F are not necessarily identical. The computational challenge is to find these sites and to succinctly and accurately describe all suc’h binding sites.
^To whom correspondence should be addressed
The simplest way to describe a binding site is to write down its consensus sequence. However, this is very imprecise and does not do justice to the complexity of the sequence signature. A sequence alignment of all known binding sites captures its complexity, but is not succinct enough. A logo format (Schneider and Stephens 1990; Crooks, Hon et al. 2004) is succinct enough, but is merely visual. The appropriate description is a profile, which is also referred to as a position-specific scoring matrix (PSSM) or a position weight matrix (PWM) (Werner 1999; Stormo 2000). A profile is a 4 x K matrix (K is the length of the binding site) whose entries give a measure of the preference of a base appearing at any given position. Examples of sophisticated algorithms to identify TF binding sites include MEME (Bailey and Elkan 1994), AlignACE (Hertz and Stormo 1999), Bioprospector (Liu, Brutlag et al. 2001), MDscan (Liu, Brutlag et al. 2002), YMF (Sinha and Tompa 2003), Weeder (Pavesi, Mereghetti et al. 2004) and many more. All these methods attempt to find sequence signatures that are significantly overrepresented in the upstream regions of a given gene set (typically a cluster of co-regulated genes from analyzing microarray data, or a gene set inferred
228 from a ChIP-Chip experiment) when compared to an appropriately chosen background. Despite the successful application of the algorithms listed above, each of them has certain limitations (Hu, Li et al. 2005; Tompa, Li et al. 2005; GuhaThakurta 2006; Maclsaac and Fraenkel 2006; Sandve and Drablos 2006). First, all these methods are prone to predict a large number of motifs, many of which are falsepositives, partly because TFs show remarkable flexibility in the binding sites they can potentially bind to. Second, all these methods report statistically overrepresented motifs. However, statistical significance of motifs need not be synonymous with biological relevance of motifs. Binding of TFs to their binding sites is a complex process and may be assisted or hindered by many other unexplained factors. Comparative genornics data is a promising new source of information that can help to improve motif prediction. With the availability of an increasing number of whole genome sequences of evolutionarily-related genomes, it is practical to incorporate the comparative genomics data into the motif discovery process. The basic assumption is that transcription factors and transcriptional mechanisms involved in fundamental cellular processes are likely to be conserved among evolutionary-related genomes. Consequently, the binding sites for such TFs are also likely to be conserved. Therefore, availability of comparative genomics data i; likely to provide additional support to the predictions of binding sites. The simplest way to deal with data on additional genomes is to pool together the upstream regions of all available genomes and to apply traditional motif detection methods. However, this is not an optimal utilization of the comparative genomics data. The “phylogenetic footprinting” strategy is a sophisticated method used to find motifs that are conserved for a particular gene across related organisms (Blanchette and Tompa 2002). Several subtle approaches such as PhyloCon (Wang and Stormo 2003), orthoMEME (Prakash, Blanchette et al. 2004), CompareProspector (Liu, Liu et al. 2004), EMnEM (Moses, Chiang et al. 2004), PhyME (Sinha, Blanchette et al. 2004), and PhyloGibbs (Siddharthan, Siggia et al. 2005) were developed recently to solve this problem. In these approaches, either an EM-based algorithm, a greedy algorithm or a Gibbs Sampling strategy was applied to optimize an objective function, while taking the phylogenetic relationships into account. The main problem with these methods is that phylogenetic relationships are often not easy to infer and not very reliable. Also, any motif that is unique to particular genomes or in upstream regions of genes with no orthologs
in some related genomes will not be detected. Most of above methods also need an alignment of the input sequence. Like phylogenetic relationships, alignments are also often unreliable. Inaccurate alignments (or phylogeneties) lead to errors in profile matrices, and ultimately in motif prediction. Another challenge in motif prediction is to develop scoring functions that ,reflect biological significance. Several popular scoring functions include IC (information content), MAP, Group Specificity score, LLBG (least likely under the background model) and Bayesian scoring function. However, as explained earlier, algorithms that use these scoring schemes end up with a large number of false positives in their predictions. When dealing with multiple genomes, the degree of conservation of the ‘hits’ of a profile across the many genomes can be used as a crude surrogate for the significance of the motif. However, this metric has its shortcomings. In this paper, we propose a metric to measure such biological significance. In this paper, we propose a new algorithm called ZEM (Iteratively Enhancing Motif Discovery). ZEM is an iterative version of an earlier algorithm called EMR (Enhancing Motif Refinement) (Zeng and Narasimhan 2007). It differs from other earlier approaches in that no attempt is made to perform de nova detection of motifs (although that would be easy to incorporate). Instead, comparative genomics data is used to “enhance” any given motif. These motifs may have been discovered by other computational methods, or may have been identified by laboratory techniques. Thus our method leverages the best-known motif discovery methods, or utilizes the (potentially incomplete) knowledge of previous studies while incorporating newly available comparative genomics data. The research described here is significant for the following reasons. First, there is a clear need to reduce the number of false positives predicted by traditional tools. Second, our method can make use of partial information (on one or more binding sites), which may be available as a result of biological experiments. Third, with the availability of high throughput gene expression techniques like Microarrays and ChIP-Chip experiments, it is possible to get sets of co-expressed genes involved in the same metabolic pathway (and, therefore, potentially coregulated). Finally, our results show that the IEM algorithm has superior ability to overcome the shortcoming of previous methods and to effectively utilize any available comparative genomics data.
229
2. METHODS 2.1. Algorithm The IEM algorithm takes as input an “unrefined” motif for a given genome & (called the reference genome); this motif may have been generated using any reasonable existing motif detection method. Alternatively, the input could be a known binding site or a crude approximation based loosely on some experiments. Using one or more additional genomes f i (referred to as the related genomes), and the corresponding orthology information the algorithm returns an enhanced between TI and r2, motif. The refinement procedure is EM-based, as described below in Section 2.1.3. 2.1 . I . Basic Expectation Maximization (EM) Algorithm Since our algorithm is EM-based, we first present an adaptation of the classical EM algorithm (Dempster, Laird et al. 1977) for ab initio motif discovery (Lawrence and Reilly 1990). Motif prediction can be thought of as a parameter estimation process for a mixture model: ( 1 ) a model for the motif and (2) a model for the background. Roughly speaking, the algorithm can be described as follows: In the (Expectation) E-Step, for every site, the likelihood that it belongs to either model of the mixture is computed. And, in the (Maximization) M-Step, a set of parameters (i.e., the entries of the profile) for the individual models (motif model and background model) are recomputed using the likelihood values computed in the E-step as weights in the calculation. Upon convergence, we end up with two models: one for motif and one for background. We randomly initialize parameters for the motif model (by randomly choosing the locations of the binding sites), and then the E-step and M-step are iterated until convergence. 2.1.2.
Improvements in MEME
The original version of EM as proposed by Lawrence and Relly (Lawrence and Reilly 1990) suffers from several limitations. For example, it does not state how to choose a starting point: It assumes that each sequence in the dataset contains exactly one occurrence of the motif; it also assumes that there is only one instance of the motif in each upstream region and does not attempt to find multiple instances. Bailey and Elkan proposed a modified EM method called MEME to eliminate these limitations (Bailey and Elkan 1994). Their method used sequences from the input as random start points. The
method allows multiple instances of a motif in one upstream region. Furthermore, once the algorithm converges upon a motif, it is eliminated from consideration and then the algorithm restarts to look for other motifs. MEME works reasonably well on many data sets, and is widely used. However, it has shortcomings. First, even though it choses a start point form among the subsequences of the input sequence, it may not converge upon a desired motif. Thus, it is not suitable for finding motifs for which we may know partial information. Second, the only way it can deal with comparative genomics data is by merely pooling the input sequences from multiple genomes. However, as mentioned before, this leaves the comparative genomics data underutilized. Our proposed IEM method considers comparative genomic data in a “dual” manner. 2.1.3. /EM Alaorithm The IEM algorithm is described below in Figure 1. Assume the input consists of profile MI = [mJ, which is a 4 x K matrix. K is the length of the motif and m , is the entry in the ithrow and jthcolumn of M I . Let the indicator variable matrix be defined as Z = (zpq):where zpq= 1, if an instance of the motif starts from pth position in the upstream region of the qth gene, and is equal to 0 otherwise. These indicator variables approximate the probability that a specific site (i.e., the sequence starting from the pth position in the upstream region of the qlh gene) is a binding site according to the profile matrix. The IEM algorithm estimates the indicator variable matrix Zl and profile matrix MI in the reference genome and the indicator variable matrix Z2 and profile M2 in the related genomes iteratively. The estimation process is similar to that in MEME (Bailey and Elkan 1994). However, in IEM a dual-step estimation is applied by incorporating comparative genomics data. Given indicator variable zpqin one data source (either the reference genome or the related genomes) and a motif model [i.e., profile matrix) M for the entire data set (merged from M I and M2), we can calculate the probability of observing a given upstream region U, as follows:
where mu,] is background frequency for base a, mrr, is frequency for base a at positionj in the motif model, k is the motif length, n is number of 1s in ZPq,and I is the length of upstream sequence. Then by Bayes’ rule, we can calculate the probability that the site at position p in upstream region q is a binding site as follows:
230
Intuitively, the ZEM algorithm tries to refine a motif in each iteration in two successive EM steps. In each step, it computes the likelihood for each site in one data set over a model M (not merely M I or M 2 ) , which is arrived at by the previous maximization step applied over all the data sets. Comin et al. reported a subtle motif discovery method using a similar two-step strategy (Comin and Parida 2007). The differences are twofold. First, we incorporate comparative genomics data, and second, we use profiles instead of consensus sequences to represent the motifs. Input: a) Profile MI, motif length I , and associated gene set GI from
genome fi bj upstream sequences of the ORFs in GI c j Additional genome(s) ri,.and the orthology map for all the genomes d) upstream sequences of the ORFs in G I ,the orthologs of GI in
r, Output: Refined motif weight matrix M r Algorithm: Estimate Z, in G, from MI. while (not converged) do Re-estimate M2 in G2 from Z,. M = merge(M1 , MI) Re-estimate ZI in GI from M. Re-ertimate MI in G, from ZI. M = rnerge(M1 , M I ) Re-estimate ZI in G2from M. endwhile Return M,.
Figure 1. [EM Algorithm
In summary, IEM algorithm does the following 4 steps iteratively: 1. In the first E step, the probabilities that each site in the reference genome belongs to the profile M I are computed by using formula (2). 2. In the first M step, the new profile M I is estimated by using every (indicated) binding site in the reference genome (i.e., weighted with Z,,). Profile M is updated using the new sites. 3. In the second E step, the probabilities that each site in the related genomes belong to the profile M2 are computed by using formula (2). 4. In the second M step, the new profile M 2 is estimated by using every (indicated) binding site in the related genomes (i.e., weighted with Z,,). Profile M is updated by using the new sites.
The “merge” operation mentioned in the algorithm is achieved by creating the profile matrix from the instances of the sites with indicator value 1 from all the genomes. Note that a generalization of the merging step is possible where the sites are weighted by the probability of that site belonging to a mode1 (i.e., its score against the profile).
2.2. Evaluation Approaches Evaluation of the IEM algorithm is a nontrivial task because very little experimentally verified data is available. Even the available experimentally verified data is often only partial information. In one of the experiments described below, we consider the critical regulation activities in the arginine metabolic pathways in the bacterium P. aeruginosa (PAO1). We show that our algorithm, with the help of the complete genomes of six strains of P. aeruginosa, produces refined motifs with improved accuracy (see the Results section for details). The performance in such cases can be measured in terms of true positives and false positives from the available partial information. Here the true positives measure indicates the number of known binding sites that are predicted, while the false positives are the number of known non-binding sites that are predicted. In another experiment, where no experimentally verified data was available, we have proposed two approaches to evaluate our results. One approach is to investigate the functional enrichment of the genes whose upstream regions have a predicted binding site. Using gene ontology analysis, we observed that the terms that were enriched were closely related to what is known about the regulator. Another approach is to compute meaningful measures of motif scores. Traditional ones such as MAP and IC scores are not well-suited for comparative genomics data. A better approach is to use scores based on how well the predicted binding site is conserved across all the genomes under consideration. The simplest measure along these lines is what we will refer to as the conservation score. It is the average number of genomes in which any given predicted binding site occurs simultaneously in the upstream sequences of orthologous genes. This value ranges between 0 and m, where m is the number of genomes (besides the reference genome) being analyzed. Such a measure was proposed earlier (Gertz, Riles et al. 2005). Let m denote the number of genomes (besides the reference genome) being considered. Let n be the total number of genes in the reference genome whose upstream sequence has at least one pre-
23 1 dicted site of the motif, and let s, be the number of genomes in which the ortholog of gene i contains a site in its upstream region. Then the conservation score S is defined as:
3. RESULTS
S=C” 2 S
Metabolic pathways have been widely studied. They can be extremely complex, and may involve large numbers of genes. Often every path in the network involves one or more TFs and the genes regulated by them. However, only a few of genes and TFs in the pathways may have been identified, and even fewer of the T F binding sites may be known. A useful problem is to identify the genes and TFs and their binding sites specifically involved in a specific pathway. Starting from one or two experimentally verified binding sites, can we predict the rest of the relevant binding sites of the genes in the pathway? Furthermore, can we identify such a gene set? We will show that our IEM algorithm can help to address these questions. In order to evaluate our results, we used a well studied pathway - the arginine metabolic pathway in P. aeruginosa, as an example. It is already known that P. aeruginosa possesses four different pathways for utilization of arginine (Lu, Yang et al. 2004): the arginine deiminase (ADI) pathway, the arginine succinyltransferase (AST) pathway, the arginine decarboxylase (ADC) pathway, and the arginine dehydrogenase (ADH) pathway. Under anaerobic conditions, arginine can be used as a direct source of ATP via the AD1 pathway. ArgR is a TF in the ADH pathway. Lu et al. used microarray experiments to identify candidate genes for the ArgR regulon (Lu, Yang et al. 2004). It was reported that ArgR regulated 37 (28 induced and 9 repressed) genes from 17 operons. Eighteen of the 28 arginine-inducible genes are in 4 transcriptional units that have been reported previously as members of the ArgR regulon (Itoh 1997; Park, Lu et al. 1997; Nishijyo, Park et al. 1998; Lu, Winteler et al. 1999; Lu and Abdelal 2001; Hashim, Kwon et al. 2004). Lu et al. also identified several new ArgR regulon members among these 37 genes, and verified them by wet lab experiments. Since the ArgR system is well studied, we used it to test the IEM algorithm.
(3)
,=I
The weakness of this conservation score is that it does not account for some key facts. In the following discussion, let A and B be two predicted motifs with the same conservation score, i.e., same average hits per genome. 1. If A has more instances than B in which s, equals to in, it should be considered more significant. 2. If A has more hits than B in the reference genome, then it should be considered more significant. To overcome the above disadvantages, we propose a new score, which we refer to as the weighted conservation score. It is given as:
Erniwnt ,
s,= log[mn]
I=’
WI>W,-I
Vi,
(4)
n z ”r=l’ wI
where rn is the number of genomes being considered, n is the number of genes in the reference genome whose upstream regions contain at least one instance of the predicted motif, n, is the number of genes that has i number of genomes in which the corresponding ortholog contains at least one instance of the motif in its upstream region, and w,is a suitable weight constant that satisfies w,> w,I for all i, implying that if a motif instance occurs in more orthologs then it should be weighted higher. w, is chosen to be i in following example. We highlight the differences between the conservation score and the weighted conservation score using simple examples. In Figure 2, motifs A and B have the same conservation score. Unlike motif B, motif A has instances across all related genomes in the upstream regions of three orthologous gene sets. We argue that motif A is more conserved than motif B. The weighted conservation score reflects this intuition. Motif C, with the same conservation score as motif D, has more instances in the reference genome, which may indicate a more important biological role. The weighted conservation score rewards motifs A and C.
3.1. Results on the Arginine Metabolic Pathway Study
3.1. l . Arginine pathway data s e t Upstream regions of the 17 transcriptional units (operons) were obtained for five strains of P. aeruginosa (PAO1, PA14, PACS2, PA2192, and PA3719). We also included 6 genes involved in the ADC pathway and the ADH pathways that were known not to bind to ArgR.
232 of the base from the consensus sequence was set at 0.7, and the frequencies of other bases were set at 0.1. Each of the three programs was run 10 times for the data set introduced earlier. We counted the number of true predictions (TP, True Positives), the number of false predictions (FP, False Positives) and the motif scores IC (Information Content), MAP (maximum a posteriori probability) and the weighted conservation scores S,. 3.1.3. Arginine path way prediction comparisons results Tables 1 and 2 present the results from two experiments (two genome case vs five genome case) for the 10 runs. The three columns present the results with the three programs. In cases where a motif was reported, the number of TPs and Fps along with three measures of quality of the motif are reported. The IEM algorithm finds the ArgR binding motif in every instance. In the experiments involving two genomes, the motif scores (using the MAP, IC, and Sc measures) are comparable to the reported ones using MEME or AlignACE. However, when four genomes were used, the scores using the E M algorithm was markedly superior to those with the other two methods (when they were reported).
3.2. Results on AmpR Figure 2. Shown are examples that highlight the differences
between the conservation score, S, and the weighted conservation scores, S,. 3.1.2. Prediction comparison procedure
To show the power of our technique, we assumed for our experiments that we know only one (randomly chosen) instance of a binding site for ArgR. We used a subset of the operons mentioned above (12 out of 17 from AD1 pathways and all 6 from ADC/ADH pathways). We then set out to see if the algorithm successful in locating previously known binding sites in the remaining 5 operons. On an average the refined motif missed 1.2 of the 5 known binding sites. We applied MEME, AlignACE, and IEM to the same data set. The results were compared for an experiment with data from two genomes (PA01 and PA14) and another experiment with data from five genomes (PAO1, PA14, PACS2, PA2192, and PAC3719). The idea was to get a sense of how much the comparative genomics data helped in the task. MEME and AlignACE were applied to the pooled data. For IEM, the initial profile was created using the motif instance. The frequency
In this section, we discuss our experiments with the IEM algorithm applied to data from experiments on the transcription factor, AmpR, in P. aeruginosa. AmpR was recently reported as a global transcription factor that regulates the expression of many virulence factors (Kong, Jayawardena et al. 2005). To better understand the regulon of AmpR, the consensus sequence (5'TCTGCTGCAAATTT-3') of AmpR binding sites in C. freundii and E. cloacae was used by Kong et al. to find an exactly conserved sequence site within the upstream region of ampC in P A 0 1 (Kong, Jayawardena et al. 2005). They also analyzed the upstream regions of all the genes putatively regulated by AmpR with the hope of finding a potential AmpR binding site. Tools such as MEME and AlignACE failed to find anything resembling the binding site from the upstream region of ampC. The IEM algorithm was then applied using the consensus sequence mentioned above, a potential handcrafted list of 10 genes possibly regulated by AmpR, and newly available comparative genomics data sets from four closely related strains of Pseudomonas (PA14, PA2192, PACS2, and PAC3719). As mentioned in the previous section, a crude motif profile was constructed
233 based on the consensus sequence. The results before and after applying the IEM algorithm are shown in Table 3.
A -
The refined motif showed improved scores according to three different motif scores. After refinement, we found that putative AmpR binding site appears only in 3 of the 10 genes mentioned above (lasA, lasR, and ampC) across all five strains of P. aeruginosa. Support for these 3 predictions was obtained using lacZ fusions in the Mathee lab. Further experimental verification is needed and work is underway in the Mathee lab. We conjecture that the remaining 7 genes are only indirectly regulated by AmpR. We then used the refined motif to scan the entire PA01 genome for instances of the motif in the upstream regions. Based on the likelihood value calculated in formula (2), we ranked the “hits” and chose the top 150 genes and followed it up with gene function enrichment analysis. See Table 4 for the results. The term with the top hit, i.e., the lowest P-value was “periplasmic space”. This is considered significant because, ampR is known to be involved in cell-wall recycling. A similar search with the motif before refinement did not find this GO-term.
I Table 1. Motif predicted by IEM, MEME, and AlignACE using data on 2 strains of P. aeruginosa (PAOland PA14).
(457,120,284)
Table 3. Characteristics of motif before and after refinement
3.3. Results on Whole Genomic Data
Table 2. Motif predicted by IEM, MEME, and AlignACE using data on 5 strains of P. aeruginosa (PAO1, PA14, PA2192, PACS2, and PAC3719).
Next we discuss our experiments with yeast data sets. Recently, Kellis et al. compared five yeast species to identify regulatory elements in the entire genome by searching for conserved segments across different yeast species (Kellis, Patterson et al. 2003). They developed a motif score called MCS (Motif Conservation Score) to measure the conservation ratio of a motif compared to the random patterns of the same length and degeneracy (Kellis, Patterson et al. 2003). A list of 72 full motifs having MCS at least 4 was reported. These 72 predicted motifs showed strong overlap with 28 of the 33 known motifs in yeast. However, the motifs used in the paper were represented using generalized consensus sequences (i.e., using IUPAC codes to represent nucleotide degeneracy) instead of the more powerful profile matrix. We set out to consider whether the IEM algorithm could improve the predictions from that work. Starting from the results of Kellis et al., we used IEM to refine each of the 72 motifs mentioned above.
234 Data from four yeast genomes (S. cerevisiae, S. paradoxus, s. mikatae and s. bayanus) were used. Complete results on the refined motifs are available at our supplementary results website: [http://biorg.cs.fiu.edu/IEM/]. Below we show some of the highlights in Table 5. In each case the number of hits went down after the refinement.
Motif
Motif Score (C, MAP)
# o f ORFs
Motif
Number YCGTnnnnmRYGAY
1.89, 0.40
9.83,5.61
I
hRCCCYTWM
29
' -&&x.L,
I
442 284
I
1.93, 0.53 6.78, 4.80
lola1 Enriched P-Value
3
2
0.m
3
2
0.m
3
2
011085
4
2
O.0lM
5
2
0.om
290
23
0.0273
152
14
0.0293
97
10
0.0328
14
3
0.0367
671
43
0.04
Table 4. Go enrichment analysis for the AmpR experiments
4. DISCUSSION AND CONCLUSION In this paper we propose a new algorithm to refine motifs with the help of comparative genomics data. The algorithm incorporates an improved scoring scheme that is sensitive to hits in the related genomes. The algorithm is inspired by the technique of "co-training" from the field of data mining, where lessons learnt from one data source is iteratively used to model the situation for another data source. The results show clear improvements in the quality of the motifs output. The IEM algorithm does have its own shortcomings, which we continue to improve. First, it does not attempt to change the length of the motif from the initial motif it started with. Second, it works best if the genomes considered are very closely related and is useful in cases where the phylogenetic relationships between the genomes are not known. If phylogentic information is available, then the algorithm can be modified to factor this in, along the lines of several previous algorithms.
ACKNOWLEDGMENTS The work of GN was supported in part by a grant from NIH under NIH/NIGMS SO6 GM008205. We thank Camilo Valdes for helping us compile the upstream sequence data for the five strains of P. aeruginosa and for his help with Figure 2 in the paper.
CGGCnnMGnnnnnnnCGC
57
*c c
C~
84 52
2.03,0.34 5.68, 1.81
Table 5. Results of motif refinement for the yeast data set. For each of the five motifs, the upper row is the consensus sequence from Kellis et al., while the lower row is the result after refinement by the IEM algorithm.
REFERENCES 1. Bailey, T. L. and C. Elkan (1994). "Fitting a mixture model by expectation maximization to discover motifs in biopolymers." Proc Int Conf Intel1 Syst Mol Biol2: 28-36. 2. Blanchette, M. and M. Tompa (2002). "Discovery of regulatory elements by a computational method for phylogenetic footprinting." Genome Res 12(5): 739-48. 3. Comin, M. and L. Parida (2007). Subtle Motif Discovery for Detection of DNA regulatory sites. Pac Bioinfo Conf (APBC2007), Hong Kong. 4. Crooks, G. E., G. Hon, et al. (2004). "WebLogo: A sequence logo generator." Genome Res 14(6): 11881190. 5. Dempster, A. P., N. M. Laird, et al. (1977). "Maximum likelihood estimation from incomplete data via the EM algorithm." J. R.Statist. SOC.B 39: 1-38. 6. Gertz, J., L. Riles, et al. (2005). "Discovery, validation, and genetic dissection of transcription factor binding sites by comparative and functional genomics." Genome Res 15(8): 1145-52. 7. GuhaThakurta, D. (2006). "Computational identification of transcriptional regulatory elements in DNA sequence." Nucl Acids Res 34( 12): 3585-98. 8. Hashim, S., D. H. Kwon, et al. (2004). "The arginine regulatory protein mediates repression by arginine of the operons encoding glutamate synthase and anabolic glutamate dehydrogenase in Pseudomonas aeruginosa." J Bacteriol 186(12): 3848-54. 9. Hertz, G. Z. and G. D. Stormo (1999). "Identifying DNA and protein patterns with statistically significant alignments of multiple sequences." Bioinformatics 15(7-8): 563-77.
10. Hu, J., B. Li, et al. (2005). "Limitations and potentials of current motif discovery algorithms." Nucl Acids Res 33(15): 4899-9 13. 11. Itoh, Y. (1997). "Cloning and characterization of the aru genes encoding enzymes of the catabolic arginine succinyltransferase pathway in Pseudomonas aeruginosa." J Bacteriol 179(23): 7280-90. 12. Kellis, M., N. Patterson, et al. (2003). "Sequencing and comparison of yeast species to identify genes and regulatory elements." Nature 423(6937): 24154. 13. Kong, K. F., S. R. Jayawardena, et al. (2005). "Pseudomonas aeruginosa AmpR is a global transcriptional factor that regulates expression of AmpC and PoxB beta-lactamases, proteases, quorum sensing, and other virulence factors." Antimicrob Agents Chemother 49( 11): 4567-75. 14. Lawrence, C. E. and A. A. Reilly (1990). "An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences." Proteins 7( 1): 41-5 1. 15. Liu, X., D. L. Brutlag, et al. (2001). "BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes." & Symp Biocomput: 127-38. 16. Liu, X. S., D. L. Brutlag, et al. (2002). "An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments." Nat Biotechnol20(8): 835-9. 17. Liu, Y., X. S. Liu, et al. (2004). "Eukaryotic regulatory element conservation analysis and identification using comparative genomics." Genome Res 14(3): 451-8. 18. Lu, C. D. and A. T. Abdelal (2001). "The gdhB gene of Pseudomonas aeruginosa encodes an arginine-inducible NAD(+)-dependent glutamate dehydrogenase which is subject to allosteric regulation." J Bacteriol 183(2): 490-9. 19. Lu, C. D., H. Winteler, et al. (1999). "The ArgR regulatory protein, a helper to the anaerobic regulator ANR during transcriptional activation of the arcD promoter in Pseudomonas aeruginosa." J Bac& 181(8): 2459-64. 20. Lu, C. D., Z. Yang, et al. (2004). "Transcriptome analysis of the ArgR regulon in Pseudomonas aeruginosa." J Bacteriol 186( 12): 3855-61. 21. MacIsaac, K. D. and E. Fraenkel (2006). "Practical strategies for discovering regulatory DNA sequence motifs." PLoS Comput Biol 2(4): e36. 22. Moses, A. M., D. Y. Chiang, et al. (2004). "Phylogenetic motif detection by expectationmaximization on evolutionary mixtures." Pac Svmp Biocomput: 324-35.
23. Nishijyo, T., S. M. Park, et al. (1998). "Molecular characterization and regulation of an operon encoding a system for transport of arginine and ornithine and the ArgR regulatory protein in Pseudomonas aeruginosa." J Bacteriol 180(21): 5559-66. 24. Park, S. M., C. D. Lu, et al. (1997). "Cloning and characterization of argR, a gene that participates in regulation of arginine biosynthesis and catabolism in Pseudomonas aeruginosa PA01 J Bacteriol 179(17): 5300-8. 25. Pavesi, G., P. Mereghetti, et al. (2004). "Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes." Nucl Acids Res 32(Web Server issue): W199-203. 26. Prakash, A., M. Blanchette, et al. (2004). "Motif discovery in heterogeneous sequence data." Svmp Biocomput: 348-59. 27. Sandve, G. K. and F. Drablos (2006). "A survey of motif discovery methods in an integrated framework." Biol Direct 1: 11. 28. Schneider, T. D. and R. M. Stephens (1990). "Sequence logos: a new way to display consensus sequences." Nucl Acids Res 18(20): 6097-100. 29. Siddharthan, R., E. D. Siggia, et al. (2005). "PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny." PLoS Comput Biol 1(7): e67. 30. Sinha, S., M. Blanchette, et al. (2004). "PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences." BMC Bioinformatics 5: 170. 31. Sinha, S. and M. Tompa (2003). "YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation." Nucl Acids & 31( 13): 3586-8. 32. Stormo, G. D. (2000). "DNA binding sites: representation and discovery." Bioinformatics 16(1): 1623. 33. Tompa, M., N. Li, et al. (2005). "Assessing computational tools for the discovery of transcription factor binding sites." Nat Biotechnol 23(1): 137-44. 34. Wang, T. and G. D. Stormo (2003). "Combining phylogenetic data with co-regulated genes to identify regulatory motifs." Bioinformatics 19(18): 2369-80. 35. Werner, T. (1999). "Models for prediction and recognition of eukaryotic promoters." Mamm Genome lO(2): 168-75. 36. Zeng, E. and Narasomhan, G. (2007). "Enhancing motif refinement by incorporating comparative genomic data." Proc of the Int Symp on Bioinfo Res and Appl (ISBRA), Lect Notes in Comp Sci, Vol. 4463, Springer Verlag, p329-337, 2007. .I'
This page intentionally left blank
237 MANGO: A N E W APPROACH T O MULTIPLE SEQUENCE ALIGNMENT
Zefeng Zhang a n d Hao Lin
Computational Biology Research Group, Division of Intelligent Software Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China Email: { zhangzf, linhao} @ict.ac. cn Ming Li*
David R. Cheriton School of Computer Science, University of Waterloo, Ont. N2L 3G1, Canada * Email: [email protected] Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the ‘once a gap, always a gap’ phenomenon. Is there a radically new way t o do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16s RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, ProbConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0. MANGO is available at http://www.bioinfo.org.cn/mango/
1. Introduction Multiple sequence alignment is a basic and essential step of many sequence analysis methods6. For example, the multiple sequence alignment is used in phylogenetic inference, RNA structure analysis, homology search, non-coding RNA (ncRNA) detection and motif finding. For recent reviews in this area, see Refs. 35 and 13. Finding the optimal alignment (under the SP score with affine gap penalty) for multiple sequences has been shown to be NP-hard44. A trivial solution by dynamic programming takes O ( n k )time with k sequences, each of length n. Under moderate assumptions”, the problem has a polynomial time approximation scheme (PTAS)28. However, this PTAS remains to be a theoretical solution since it has a high polynomial power related to the error rate. With the rapid growth of molecular sequences, the problem becomes more prominent. Thus many modern alignment programs resort to heuristics to reduce the computational cost
while sacrificing accuracy. The prevailing strategy is the progressive alignment methodl4>41, implemented in the popular C l ~ s t a l W *software, ~ as well as in the more recent multiple sequence alignment programs MUSCLE”, T - C ~ f f e e MAFFT19, ~~, and Progre~sive-POAl~, to name a few. The idea behind progressive alignment is to build the multiple sequence alignment on the basis of pairwise alignments under the guidance of an evolutionary tree. A distance matrix is computed from similarities of sequence pairs, according to which a phylogenetic tree is builts8. The multiple alignment is then constructed by aligning two sequences or alignment profiles along the phylogenetic tree. In this way, the progressive alignment method avoids the exponential search. For a large number of sequences, distance matrix calculation can be slow, and the optimal phylogenetic tree construction itself, under the usual assumptions of parsimony, niax likelihood, or max number of quartets, is NP-hard anyways. After all, sometimes the purpose of doing multiple sequence alignment is
*Corresponding author. aFor example: when the average number of gaps per sequence is a constant, the problem has a PTAS
238 to construct a phylogenetic tree itself. Then if we didn’t believe the initial phylogeny (constructed t o do multiple sequence alignment), why should we believe in the phylogeny that is constructed based on a multiple alignment which in turn is based on the untrusted phylogeny? Again, heuristics were used t o accelerate the phylogenetic tree construction. Pairwise similarity is estimated using fast k-mer counting in MUSCLE, and similar strategy can be seen in the fast version of ClustalW. However, in spite of its most attractive virtue the speed, progressive approach (adding sequences greedily t o form multiple sequence alignment) is born with the wellknown pitfall that, error introduced in early stages cannot be corrected later (so-called ‘once a gap, always a gap’). Many efforts have been made t o remedy this drawback t o enhance the accuracy of final alignment. MUSCLE adds tree refinement after progressive alignment stage t o tune the result. T-Coffee uses consistency-based score reflecting global information to assist pairwise alignment during the progressive alignment process. PROBCONS adopts a probabilistic consistency-based way. All of them do achieve better accuracy. There are two alternatives t o progressive approaches. One is simultaneous alignment of all sequences by standard dynamic programming (DP). Two packages MSA26 and DCA3’ follow this idea. However, algorithms in this category do not scale up because of their heavy computational costs. Another alternative is the iterative strategies16> l8, 2 5 . Starting from an initial alignment, these methods iteratively tune the alignment until there are no improvements t o the objective functions. These iterative strategies require a good initial alignment as a start point, otherwise the iterative process will be time-consuming or easily fall into local optima. We now introduce the core idea of MANGO. Given a set of sequences, how do we really judge an alignment? Do we really care about aligning a non-homologous region well? No. What we really care is the aligner’s ability of putting similar regions together, including the distant homologous regions. Some similar regions are shared by most of (if not all) sequences, while others may be shared by only a few sequences. Gaps are inserted t o get an alignment which lines up the similar regions properly. With this simple observation, we describe a new paradigm to do multiple sequence alignnient. Our new algo~
331
rithm uses the novel idea of optimized spaced seeds. introduced by Ma, Tromp, and Li31 initially for pairwise alignment, t o find similar regions and bind them together via sophisticated algorithms (which are of theoretical interests on their own) and then refine the alignment. Note that similar approaches (Ref. 32) using consecutive or non-optimized k-mers (with some gaps) might have been used in some programs t o some degree, however, without optimized spaced seeds, such approaches cannot achieve high sensitivity and specificity as in MANGO. Also note that these optimized spaced seeds are not dependent on any data. they are independently optimized as in Refs. 31, 29 and 9. Our new algorithm requires neither the slow global multiple alignment nor the inaccurate progressive local pairwise alignment. The optimized spaced seeds have shown their advantages over traditional consecutive seeds for pairwise alignment in PatternHunter software3’. It has since been adopted by most modern homology search software including BLAST (MegaBLAST). It has been shown31, 22, 7 , 30 that the optimized spaced seeds can achieve much better sensitivity and specificity than consecutive seeds. After that, the idea of a single optimized spaced seed is extended t o multiple optimized seeds in PattcrriHunter I1 program2’. 40. vector seeds3, and neighbor Multiple seeds’ were studied for even better sensitivity and specificity. One multiple genomic DNA alignment program4 uses optimized spaced seeds t o find hits between two sequences. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. MANGO uses multiple optimized spaced seeds to catch similar regions (hits) quickly with high sensitivity and specificity. A scoring scheme is designed t o encode global similarity information in hits. Hits with scores beyond a threshold are arranged carefully to form parts of the alignment. Under the constraint of these hits. banded dynamic programming is carried out t o achieve a global solution. Experiments were carried out on large 16s RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed. against state-ofart multiple sequence alignment methods, including ClustalW, MUSCLE, MAFFT, ProbConsRNA, Dialign, DIALIGN-T, T-Coffee, POA and Kalign. The experiments were performed only on nu-
239 cleotide sequences for the purpose of justifying our new approach. For multiple protein sequence alignment, other factors, such as similarity scoring schema and secondary structure information affect the alignment quality a lot, hence potentially would blur the comparative results on the effectiveness of optimized spaced seed approach.
2. Method The work flow of MANGO is given in Fig 1. Our strategy contains three stages. After any stage, MANGO can be stopped and it will output the alignment constructed so far. In stage one of template construction, MANGO locates super motifs in input sequences, and builds a skeletal alignment by pasting each sequence to the template exposed by motifs. In stage two of hit binding, MANGO first sorts the hits according to their agreements among themselves and then tries to bind hits one by one into the skeletal alignment. Iterative refinement is then carried out in stage three to produce the final alignment, where MANGO picks out one sequence at a time, and aligns it to the current alignment of the rest of sequences.
not requiring a match. The length of the seed is the length of the binary string, and the weight of the seed is the number of 1’s in the seed. For reasons why the optimized spaced seeds are much better than the consecutive BLAST type of seeds, please see Refs. 31 or 30 which points to many more recent theoretical studies. We have used eight highly independent spaced seedsb generated from the parent seed 1110110010110101111,with seed weight 13 and seed length ranged from 19 to 23 from Ref. 9. 82 single optimal spaced seed with weight ranged from 9 to 16 and length ranged from 9 to 22, are optimized against a 64-length IID region of similarity level 0.7, using the dynamic programming algorithm described in Ref. 22. These 82 seeds are sorted with decreasing weight, to prefer specificity to sensitivity. We used the first seed to locate the super motifs and construct profile template, then we applied the other seeds one by one. For each seed-match (namely a hit), we call the matched fragment a k-mer. The k-mer has the same length as the seed.
2.2. Stage one: constructing profile template
locate
high freq
profile template
Skeletal alignment
sorted hits
Fig. 1. Three stages of MANGO: template construction, hit binding and iterative refinement, producing skeletal alignment, initial alignment and final alignment, respectively.
2.1. Seeds selection Following the original notation of Ref. 31, we denote a spaced seed by a binary string. A “1” in the spaced seed means it requires a match at the position, and a “0” indicates a “don’t care” position b11100110110010101111,1101110110000110100111, 1011110010110111011,11001110000010110101111, 10110111010110001111,10101010110010100101111, 1110110001111101101,11001110110010010001111.
If a piece of sequence segment is shared by a considerably large portion of input sequences, we call it a super motif. The super motifs reflect the conserved parts among sequences and are very likely to appear in the final alignment, hence pining them down will give guidance to the whole alignment process. MANGO uses an optimized spaced seed to detect super motifs, thus they are not necessarily the same (due to ‘no care’ positions of spaced seed), as long as they have high similarity to be caught by the spaced seed. The detection is performed as the following. Highly frequent k-mers (currently defined as 25% of the input sequences having the k-mer) extracted by the spaced seed are lined up according to their relative appearance in each sequence. Then MANGO determines overlapping portion (see middle of Fig 2 for demonstration of overlap) of adjacent k-mers, by searching for their existing overlapping parts in each sequence. In this way, a super motif is represented
240 by a series of overlapping k-mers, and all those kmers are concatenated together to form the profile template. After that MANGO aligns each sequence to the template. As Fig 2 indicates, those highly frequent k-mers resided inside each sequence is directed (aligned) to corresponding position in profile template, thus produces a skeletal alignment. sequence I profile template sequence 2
---
----
--------
Fig. 2. MANGO locates super motifs by highly frequent kmers (shaded boxes) and constructs profile template by concatenating them. Those k-mers inside each sequence are aligned t o the profile template, producing a skeletal alignment.
The impact of this stage is twofold: the profile template provides anchors and constraints for later alignment process, and wiping out highly frequent k-mers greatly reduces the number of hits to be considered in stage two of hit binding.
ously. Hence hitj will vote against hit,, and s(i,j ) = -Wdisagree; if hiti and hitj are order compatible as Fig s(2.a) indicates, the appearance of hitj in a certain alignment will encourage hiti to appear too, and
S ( i , j )= Wagree; if hiti and hitj are overlapped and their nucleotide mapping order is consistent as Fig 3(2.b) indicates, MANGO further considers their overlapped region size in this case. Define overlapping ratio between them as Q = overlap-size/(21 - overlap-size), where 1 is the spaced seed (hit) length. Then S ( i , j ) = Q * Woverlap-high (1 - Q) * Woverlap-low;
+
We also have indirect votes from k-mers on other sequences too. If hitj votes for or against hiti (as in Fig 3(3.a) and (3.b)), those k-mers same as k-mer of hitj on other sequences (C in Fig 3(3)) will increase the power of voting, since occurrence of C through hitj will enhance the probability that hiti appears or doesn’t appear in final correct alignment. hiti
2.3. Stage two: binding the hits
2.3.1. Vote among hits After getting rid of the super motifs, less frequent k-mers extracted by single or multiple spaced seeds will generate a set of hits among sequences. Hits may conflict to each other and MANGO tries to select a good compatible subset of them. Since the consistent relationship among those hits reflects the global similarity of input sequences, MANGO encodes global information into each hit by assigning it a priority score, which is voted by other hits to agree or disagree that this hit should appear in final alignment. The consistency and inconsistency relationships of the hits are illustrated in Fig 3, corresponding to “yes” vote (positive score) and “no” vote (negative score), respectively. Assume that hiti and hit? occur between the same two sequences. Let S ( i , j ) be the vote score for hiti by hitj, which is calculated as: (1) if hiti and hitj are incompatible (either they are order incompatible as in Fig 3(l.a) or they are overlapped but their nucleotide mapping order is inconsistent as in Fig 3(1.b)), they can not appear in the same alignment simultane-
hitj
1.a order incompatible
1.b overlap incompatible
2.a order compatible
2.b overlap compatible c
3.a indirect agree
3.b indirect disagree
Fig. 3 . To uncover global similarities, MANGO assigns hit, a priority score, which is voted by other hits: (1) “no” vote: hit, and hiti are order incompatible in 1.a or they have inconsistent nucleotide mapping order in 1.b; (2) “yes” vote: hit, and hit, are order compatible in 2.a or they have consistent nucleotide mapping order in 2.b; (3) indirect vote from k-mer C who increases voting power of hit, t o hit,.
Let N ( h i t )be the number of sequences that have the same k-mer as that inside hit. The priority score assigned to hit, is calculated as C, S ( i , j ) * (1 ( N ( j ) - 2 ) * Windirect). The voting results are collected and hits are sorted according to their score. Low scored hits (probably random hits) are removed
+
24 1 and the remaining hits are considered as candidates into next hit binding stage. 2.3.2. Bind hits greedily
In the second part of stage two of hit binding, MANGO will try to bind each hit (high score first) into skeletal alignment generated from stage one, by greedily checking hit candidates one by one. To bind a hit is to align all corresponding sequence letter pairs along the hit and fix their positions (once aligned, they remain aligned). What we want is to arrange the relative positions of the hits carrying similarity information. Thus, a natural way is to formulate an alignment solution as a directed acyclic graph(DAG), by viewing aligned nucleotides as one vertex. We found Ref. 25 employs a similar DAG format ion.
above criteria are satisfied, MANGO binds it to the DAG (updates the DAG by merging the corresponding vertices); otherwise, the hit is discarded. Let reach(s,t)be a predicate which is true iff s # t and there is a directed path from s to t in the DAG. If we merge two vertices z and y when reach(z,y) V reach(y,z)is true, then the resulting graph is no longer a DAG, due to the circle introduced. So a hit ( ( s l , t l ) ,( s ~ , t 2 .) ., . , ( s i , t i ) ) can be bound into alignment DAG, if and only if lreach(si,ti)A i r e a c h ( t i , s i ) , for 1 5 i 5 1. sz
s1
,
\
I
SI
(
I
)
1?1 1?1
:?:
!
\
l
i
1,
j
12
I
11
(1) 0
1
2
3
4
5
(1) The success of binding a hit ( ( ~ 1t l, ) ,( s z , t 2 ) , . . . . that there is no directed path between any vertex pair ( s z , t a ) ,1 5 z 5 1 ; (2) We check that by expanding left predecessors (grey area) of (sz,t t ) , bypassing those already expanded by ( s t - l , t t - l ) (dark grey area).
Fig. 5.
( s l , t l ) ) requires
(4
( B)
Fig. 4 By viewing aligned nucleotides as one vertex, an alignment can be formulated as a DAG
Given N sequences each with L, nucleotides, we denote j t hnucleotide in sequence S, as a vertex Sz,J. Directed edges are linked from S,,J to Sz,J+l,for 0 5 j < L , - 1. Thus, the initial graph G = ( V , E ) has CO1,
Now we describe the process of binding a hit candidate. As indicated in Fig 5(1), we need to check that there is no directed path between two vertices in any vertex pair, and if success, we will update the DAG. An intuitive way is to maintain a transitive closure matrix, that is, to record the reachability for each pair of vertices. This strategy has constant query time and update time of 0(1* n),where n is the vertex number. Neither the space complexity of O ( n 2 )nor the updating complexity of 0(1*n ) is acceptable, as n is too large and the update operation is quite frequent. To our knowledge, the best result of reachability algorithm without explicitly maintaining transitive closure matrix on general directed graph is due to Ref. 37. They provide a nice full dynamic (supporting query, edge insertions and deletions) algorithm with query time O ( n ) and amortized update time of O ( m d o g ( n ) ) , where m is the edge number. The best result on DAGs appears in Ref. 36 with query time of O( A ) and amortized update time W n ) of O ( m ) . Neither of them is suitable to our problem with frequent updates of vertex merging. Thus we design an algorithm supporting query and vertex merging on DAGs, with nearly constant update
+
242 (merging) time and O ( n ) query time in the worst case, for a vector of arbitrary number of adjacent vertex pairs. Let L ( s ) = { i / r e a c h ( i , s ) , iE V } and R ( s ) = {ilreach(s,i ) ,i E V } . We call L ( s ) the (left) predecessor set of s and R ( s ) the (right) successor set of s. We know that:
Lemma 2.1. reach(s,,t,)
+ R ( s , ) n L(t,) # 0.
Corollary 2.1. hzt ( ( s l , t l ) ,( s Z , t 2 ) ,. . . , ( s l ,t l ) ) can be bound znto alzgnment DAG L(s,) n R(t,) = 0 A R(s,)n L(t,) = 0, for 1 5 z 5 1. The reachability of vertex pairs in a hit are not independent, due to:
Lemma 2.2. L(s,) R(&+l)(l 5 < 1 ) .
C
L(S,+I) and R(t,) II
By Corollary 2.1 and Lemma 2.2, we can check reachability of vertex pairs in an incremental way. After we checked that L(s,) n R(t,) = 0 A R(s,) n L(t,) = 0, for vertex pair (s,+1,&+I), we only need to check that ( L ( s z + l\L(s,))nR(t,+l) ) = 0 A R ( s , + l )n (L(t,+l)\L(t,)) = 0.So for each given vertex pair, we check reachability between them by expanding the left predecessors of these two vertices, searching for any vertex that is located at the right side (marked black in Fig 5(2)) along two sequences, while bypassing those vertices already expanded by prior pairs. This can be done in O ( n ) time in the worst case for arbitrary number of vertex pairs, because no vertex is expanded twice during checking process of all vertex pairs of a hit. Nearly constant update (merging) time O(I * a ( n ) ) ,where a(.) is inverse Ackermann function, is achieved if for each vertex we maintain a list of its incoming vertices and record the list head and tail pointers. To merge two vertices, we alias them by Union-Find algorithm and merge their incoming lists. At this point, MANGO applies seeds sequentially, with seeds of higher weights used first. That is, for each given seed, MANGO searches for hits, calculates their voting score and finds the best way to bind them. This strategy is temporarily adopted mainly for memory efficiency. In future, MANGO will include an option of using several seeds each step. This will increase significantly the memory usage, however it has a potential to further increase sensitivity and specificity.
It should be noted that similar regions are arranged carefully to form a basic alignment without considering any gap penalty in the above two steps. 2.4. Stage three: iterative refinement
After stage two of hit binding, input sequence letters are tightly bound as a DAG, and the final stage of MANGO tries to refine this alignment, with respect to the topological structure of DAG. Each time one sequence is selected and corresponding DAG vertices (black nodes in Fig 6) are picked out to form one input. The rest of vertices inside DAG form another input, and MANGO performs a dynamic programming (DP) using similar heuristic strategy to Ref. 20. User can also choose to perform an optimal alignment search with affine gap penalty described in Ref. 21, which we modified to fit in our alignment situation. The scoring scheme we used is the simplest sumof-pairs (SP) score, with matching nucleotide pairs scoring 1 and mismatching pairs scoring 0. The gap open penalty and gap extend penalty are -1.53 and -0.23 respectively, same with that used in MAFFT.
selected
Fig. 6. (1)In refinement stage, MANGO picks out a sequence each time and aligns it to other sequences based on current alignment. (2) The topological order of DAG (such as vertex a must appear before vertex b ) helps to narrow down the search space (the grey area is skipped).
However, the vertex orders in DAG can help to narrow down the search space of DP. This can be seen from Fig 6. The DAG requires vertex b to appear after vertex a in any valid alignment, thus a porbion (shaded area in Fig 6(2)) of quadratic search space is cut off. In practice, the D P is performed in a banded search space with almost linear time and space. We find the idea of Iterative POAZ5is similar to what we do in the refinement part. It extracts one sequence out and aligns it to the rest DAG by DP, slightly different from the DP with constraints we revised from Ref. 21. However, without the guidance
243
of bounded hits, local optima is easily reached in the iterative way. The above three stages constitute the core algorithms of MANGO. Finding profile template and constructing a skeletal alignment is the fastest; binding hits is efficient by the reachability detection algorithm described above, and experiments (see below) show that this stage is responsible for most accuracy ratio; refinement stage is the slowest, but it can uncover similarities that have escaped the detection of the optimal spaced seeds. 3. Results
This section assesses the performance of MANGO on three large alignment benchmarks of small subunit (SSU) ribosomal RNA (16s RNA) for phylogenetically representative sets of prokaryotes, mitochondrial and eukaryotes, respectively (version 9.36)5. These alignments are hand-curated by the Ribosomal Database Project (RDP-11) at the Michigan State University. All leading multiple sequence alignment software (using their most recent publicized versions) , which are able t o do large scale multiple sequence alignment, are compared with MANGO: ClustalW 1.8341,the most popular software; MUSCLE 3.611, a fast program aims at speed and accuracy; MAFFT 5.86119, with many recent improvements from previous versions; Dialign 2.2.132; DIALIGN-T 0.2.1’, improved version of Dialign; T-Coffee 4.8534, well known for its accuracy; POA 2.025, 17, based on partial order graph formulation and Kalign 2.024. Two versions of ClustalW: ClustalW-fast (with -quicktree parameter) and ClustalW (full version), two versions of MUSCLE: MUSCLE-fast (with -maziters 1 -diags parameters) and MUSCLE (full version), two versions of MAFFT: MAFFT-fast (FFT-NS-1 with -retree 1 -maziterate 0 parameters, the fastest version) and MAFFT (L-INS-i with -localpair -maxiterate 1000 parameters, the most accurate version), two versions of Dialign: Dialignfast (with - 0 -ds parameters to speed up DNA alignment) and Dialign (full version) are tested t o compare speed, memory usage and alignment accuracy. All experiments were carried out on a PC with Intel Celeron-2.0 processor and 1G main memory, and all data sets and experimental results are available on MANGO website. ProbConsRNA’’ has comparable SP score and PS score with MAFFT on the
three data sets. However, its consuming time is too long, so we included the MAFFT result only. 3.1. Measurement
Alignment accuracy is measured by SP and PS scores11. Let A be the alignment generated by the program. Let R be the (supposedly correct, or human curated in our case) reference alignment. SP score (also known as Q score and SPS score43) is defined as the number of correctly aligned nucleotide pairs in A divided by total number of nucleotide letter pairs in R. P S score” is defined as the number of correctly aligned nucleotide pairs in A divided by total number of nucleotide pairs in A. Thus, SP score measures the sensitivity of the alignment A and P S score measures the specificity of the alignment A. All these scores were calculated by the program bali-score from Ref. 43, which removes noisy columns containing mostly gaps from alignments. 3.2. M A N G O versions
Time and accuracy varies if we choose t o use first eight seeds or use all ninety seeds. Also time is reduced if we remove refinement stage. We simply provide the experimental results for all four versions of MANGO: MANGO8, where we used eight neighbor seeds without the refinement stage; MANGO8-r, where the refinement stage was added to MANGO8; MANGOSO, where we used all 90 seeds without refinement; and MANGOSO-r, where the refinement stage was added to MANGOSO. 3.3. The 16s SSU rRNA data set
The 16s SSU rRNA benchmark has three data sets for prokaryotes, mitochondria1 and eukaryotes, respectively. The first d a t a set (prokaryotes) has 218 sequences with average sequence length 1487 bp; the second data set (mitochondrial) has 76 sequences with average length 1075 bp and the third data set (eukaryotes) has 140 sequences with average length 1823 bp. The mitochondrial d a t a set is the smallest but with the lowest similarity among three data sets. 3.4. The assessment
Table 1 t o 3 present the experimental results on the d a t a sets. We make a few observations below:
244
On all three data sets, MANGO90 is simultaneously more sensitive (SP score), more specific (PS score) and much faster than ClustalW, ClustalWfast, MUSCLE, iterative POA and progressive POA. The sensitivity of MUSCLE-fast is too low. MANGO8 has higher sensitivity, higher specificity, and higher speed simultaneously than MUSCLEfast. Comparing to MUSCLE and ClustalW (the two most trusted software), MANGOSO-r achieves both higher SP and PS scores, in half of MUSCLE’S or ClustalW’s time, on all three data sets. Except for a few unbalanced cases (when DIALIGN-T has very low sensitivity but higher specificity), for all data sets, with lower specificity and significantly lower sensitivity, Dialign, DIALIGN-T and Dialign-fast all run many times slower than any version of MANGO. The syecificity of these programs is generally good. Although Kalign runs quite fast, it has both low sensitivity and specificity. On all three data sets, MANGO has slightly lower sensitivity (SP scores) but higher specificity (PS scores) against MAFFT full version but MANGO (MANGOSO-r), at similar sensitivity and specificity, is at least 5 times faster than MAFFT full version in all cases. On the low similarity data of mitochondria1 data set, all programs have very poor performance except MANGO and MAFFT. NIAFFT has high SP score and MANGO has high PS score and with second highest SP score. Here MANGO90 finished in 37 seconds vs. MAFFT finishing in 14 minutes. T-Coffee failed in two cases. For the case (mitochondrial) it finally finished after 37 hours, its sensitivity and specificity were both inferior to MANGOSO-r, which finished in less than 3 minutes. MANGO seeds were trained universally as in PatternHunter3l>29, ’, not tuned for these data sets. MANGO’S strategy is more suitable for medium scale input, such as data sets with more than 20 sequences. Because MANGO does not invest heavily on the refinement stage, it doesn’t perform as well as some of the specialized methods such as ClustalW on the popular nucleotide MSA benchmark BRALiBASE15, in which each reference alignment is generally short and composed of small number of sequences.
0
0
MANGO is also able to handle extremely large data sets. In addition to these three data sets, we have also used MANGO to align over 20,000 reads for some repeat regions, for the purpose of sequence assembly, which was one of our original reasons to begin this research. As a word of caution, higher SP scores or higher PS scores do not necessarily imply that the alignment is better biologically. What we hope to achieve via this experiment is to demonstrate that our new approach can help to capture the global features (reflected by the PS score) of an alignment. We think capturing sufficient aligned parts with higher confidence is probably more important than aligning more pairs in a random region.
We also carried out experiments on multiple alignment of Alu sequences in Human genome. Due to the space limit, more experimental results will be presented in the full version of this paper.
4. Discussion In this paper, we have presented a new approach to multiple sequence alignment, orthogonal to existing approaches. Necessary algorithms were developed for managing the hits efficiently, getting rid
Table 1. performance evaluation on prokaxyotes 1 6 s SSU rRNA benchmark data set ~
data set
prokaryotes
program ClustalW-fast ClustalW MUSCLE-fast MUSCLE MAFFT-fast MAFFT Dialign-fast Dialign DIALIGN-T T-Coffee POA-iter POA-prog Kalign MANGO8 MANG08-r MANGO90 MANGOSO-r
~~
SP
PS
time(s)
mem(M)
0.937 0.913 0.925 0.937 0.931
0.929 0.932 0.928 0.928 0.930 0.942 0.934 0.928
324 6380 145 1427 42 6006 123083 202903 20719 failed 1106 8495 73 121 660 299 592
4 4 216 216 137 149 409 401 201
0.949
0.924 0.921 0.770 0.881 0.899 0.926 0.929 0.944
0.943 0.944
0.952
0.901 0.927 0.914 0.942 0.939 0.945 0.943
48 71 6 49 49 49 49
a The SP and PS scores for each method on three 1 6 s SSU rRNA benchmark data sets are listed, together with CPU time in seconds and memory usage in megabytes. Top three SP and P S scores are marked bold.
245 Table 2. performance evaluation on mitochondria1 16s SSU rRNA benchmark data set data set
mitochondrial
program ClustalW-fast ClustalW MUSCLE-fast MUSCLE MAFFT-fast MAFFT Dialign-fast Dialign DIALIGN-T T-Coffee POA-iter POA-prog Kalign MANGO8 MANG08-r MANGO90 MANGOSO-r
SP
PS
0.442 0.478 0.357 0.442 0.591 0.734 0.538 0.588 0.209 0.594 0.316 0.485 0.375 0.530 0.595 0.590 0.596
0.408 0.469 0.364 0.421 0.558 0.691 0.777 0.763 0.000 0.740 0.328 0.494 0.330 0.859 0.739 0.878 0.836
time(s) 63 242 30 333 33 829 1070 3024 1202 135063 234 581 12 4 589 37 155
mem(M) 3 3 74 74 147 129 34 37 26 1000 30 10 4 10 89 10 42
The inconsistency of SP and PS score for DIALIGN-T is due to the calculation process of b d - s c o r e , which removes columns containing mostly gaps.
Table 3. performance evaluation on eukaryotes 16s SSU rRNA benchmark data set data set
eukaryotes
program ClustalW-fast ClustalW MUSCLE-fast MUSCLE MAFFT-fast MAFFT Dialign-fas t Dialign DIALIGN-T T-Coffee POA-iter POA-prog Kalign MANGO8 MANG08-r MANGO90 MANGOSO-r
SP
PS
0.874 0.844 0.588 0.863 0.887 0.905 0.821 0.840 0.761
0.844 0.866 0.675 0.836 0.874 0.880 0.891 0.894 0.936
0.753 0.796 0.869 0.861 0.896 0.887 0.890
0.796 0.822 0.837 0.904 0.880 0.911 0.896
time(s) 348 4013 148 2638 81 3428 28604 53791 15527 failed 1011 5238 71 75 2281 305 661
mem(M) 5 5 186 186 192 147 205 205 112
for progressive approaches. Additionally, the new paradigm has two more advantages. First, prior biological knowledge can be easily added to the skeletal alignment, as user-defined constraints/bindings. Secondly, input sequences from diverging families can cause problems for many global multiple sequence alignment programs. In this case, local multiple sequence alignment programs’> 27 are preferable. This problem is taken care of naturally by our method. Since we build the skeletal alignment and arrange hits step by step on the basis of similar regions, sequences with different conserved regions do not interfere among themselves as much as in the progressive alignment approaches. Hits produced by spaced seeds are the footstone of our method and the alignment accuracy is based on the correctness of binding hits. Though our voting strategy can get rid of most of the false hits, choosing a better set of multiple spaced seeds with higher sensitivity and specificity can enhance the accuracy and reduce running time. Up to now, the refinement stage contributes not too much to the final result, but we believe changing the scoring scheme in the final stage will enhance the result more. Though we have focused on nucleotide sequences in this paper for the purpose of clean methodology comparative study, our method can be extended to protein sequences by designing multiple seeds for protein sequences and redefining seed hits on protein sequences similar to protein pairwise alignment23, then fusing the secondary structure and profile information into hit finding. This project is underway. Acknowledgement
69 76 6 28 72 29 41
of false hits and combining the hits. To demonstrate our methodology, we have implemented MANGO for multiple alignment of nucleotide sequences. Our new approach build the alignment “region by region”, unlike progressive alignment methods which build the alignment sequence by sequence. In this way, we align sequences simultaneously, avoiding the inherent ‘once a gap, always a gap’ phenomenon
We thank Bin Ma for providing the neighbor seeds from Ref. 9, and Dan Brown, Dongbo Bu and Yu Lin for discussions on multiple sequence alignment. References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. Journal of Molecular Bzology 215,403-410. Amarendran, R.S., Jan W.M., Michael K., Burkhard M. (2005) DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics 6, 66. BrejovB B., Brown, D., Vinar, T. (2005) Vector seeds: An extension to spaced seeds. Journal of Computer and System Sciences. bf 70 364-380. Brown, D.G., Hudek, A.K., (2004) New Algo-
5.
6.
7.
8.
9. 10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
rithms for Multiple DNA Sequence Alignment. Algorithms in Bioinformatics, 4 t h International Workshop ( W A B I ) , 314-325 Cannone, J.J., Subramanian, S., Schnare, M.N., Collet, J.R., D’Souza, L.M., Du, Y., Feng, B., Lin, N., Madabusi, L.V., Muller, K.M., Pande, N., Shang, Z., Yu, N., and Gutell, R.R. (2002) The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and other RNAs. BioMed Central Bioinformatics 3, 2. Carrillo, H. & Lipman, D. J. (1988) The multiple sequence alignment problem in biology. S I A M J. Appl. Math. 48, 1073-1082. Choi, KP. & Zhang, LX. (2004) Sensitivity analysis and efficient method for identifying optimal spaced seeds. Journal of Computer and System Sciences, 68, 22-40. Csiiros, M. & Ma, B. (2005) Rapid Homology Search with Two-Stage Extension and Daughter Seeds. COC O O N 11, 104-114. Csuros, M. & Ma, B. (2006) Rapid homology search with neighbor seeds. To appear in Algorithmica Do, C.B., Mahabhashyam, M.S.P., Brudno, M., and Batzoglou, S. (2005) PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research, 15, 330-340. Edgar RC. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32, 1792-1797. Edgar RC. & Sjolander K. (2004) MUSCLE: A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 8, 1301-1308. Edgar RC. & Batzoglou S. (2006) Multiple sequence alignment. Current Opinion in Structural Biology 16, 368-373. Feng, DF. & Doolittle, RF. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360. Gardner PP, Wilm A, Washietl S. (2005) A benchmark of multiple sequence alignment programsupon structural RNAs. Nucleic Acids Res. 2005, 33, 24332439. Gotoh 0. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 1996, 264, 823-838. Grasso, C. and Lee, C. (2004) Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 2004, 20, 1546-1556. Katoh K, Misawa K, Kuma K , Miyata T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30, 3059-3066. Katoh K, Kuma K , Toh H, Miyata T. (2005) MAFFT version 5: improvement in accuracy of mul-
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
tiple sequence alignment. Nucleic Acids Res 2005, 33, 511-518. Kececioglu, J. and W . Zhang. (1998) Aligning alignments. Proceedings of the 9th Annual Symposium on Combinatorial Pattern Matching 189-208. Kececioglu, J. and D. Starrett. (2004) Aligning alignments exactly. Proceedings of the 8th A C M Conference on Research in Computational Molecular Biology ( R E C O M B ) 85-96. Keich, U., Li, M., Ma, B., Tromp, J. (2004) On spaced seeds for similarity search. Discrete Applied Mathematics 138 253-263. Kisman, D., Li, M., Ma, B., and Wang, L. (2005) tPatternHunter: gapped, fast and sensitive translated homology search. Bioinformatics, 21:4, 542544. Lassmann T . and Erik L.L. Sonnhammer (2005) Kalign - an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6, 298. Lee C, Grasso C; and Sharlow MF. (2002) Multiple sequence alignment using partial order graphs. Bioinformatzcs 18(3), 452-464 Lipman, D. J., Altschul, S. F. & Kececioglu, J. D. (1989) A tool for multiple sequence alignment. Proc. Natl Acad. Sci. USA, 86, 4412-4415. Lawrence, C. E., Altschul, S . F., Boguski, M. S . , Liu, J . S . , Neuwald, A. F. & Wootton, J. C. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 62, 208214. Li, M., Ma, B., Wang, L. (2000) Near optimal alignment within a band in polynomial time. Proc. 32nd A C M Symp. Theory of Computing (STOC’OO), Portland, Oregon, 425-434. Li, M., Ma, B., Kisman, D., Tromp, J. (2004) PatternHunter 11: highly sensitive and fast homology search. Journal of Bioznformatics and Computational Biology 2 411-439. Li, M., Ma, B., Zhang, L. (2006) Superiority and complexity of the spaced seeds. In Proceedings of the seventeenth annual A C M - S I A M symposium o n Discrete algorithms ( S O D A 2006), 444-453. Ma, B., Tromp, J., Li, M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18 440-445. Morgenstern, B. (1999) Dialign2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15, 211-218. Notredame, C. & Higgins, D. G. (1996) SAGA: sequence alignment by genetic algorithm. Nucl. Acids Res. 24, 1515-1524. Notredame, C., Higgins DG, Heringa J. (2000) TCoffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000, 302, 205-217. Notredame,C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131-144.
247 36. Roditty, L. and Zwick, U. (2002) Improved dynamic reachability algorithms for directed graphs. Proceedings of FOCS’02, 679-689. 37. Roditty, L. and Zwick, U. (2004) A fully dynamic reachability algorithm for directed graphs with an almost linear update time. Proc. of 36th STOC, 184191. 38. Saitou, N. & Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4,406-125. 39. Stoye, J., Moulton, V. & Dress, A. W. (1997) DCA: an efficient implementation of the divide-andconquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13,625-626 40. Sun, Y., Buhler, J. (2004) Designing multiple simultaneous seeds for DNA similarity search. Proc. 8th Annual International Conference on Computational
Molecular Biology (RECOMB). 76-84 41. Thompson, J., Higgins, D. & Gibson, T. (1994) ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl. Acids Res. 22, 4673-4690 42. Thompson, J., Plewniak, F. & Poch, 0. (1999) BaliBase: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics, 15,87-88. 43. Thompson, J. D., Plewniak, F. & Poch, 0. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucl. Acids Res. 27, 2682-2690. 44. Wang, L. & Jiang, T. (1994) On the complexity of multiple sequence alignment. J . Comput. B i d . 1, 337-348.
This page intentionally left blank
249
LEARNING POSITION WEIGHT MATRICES FROM SEQUENCE A N D EXPRESSION DATA
Xin Chen* and Lingqiong Guo School of Physical and Mathematical Sciences Nanyang Technological University, Singapore * Email: [email protected], [email protected] Zhaocheng Fan Department of Computer Science and Technology Tsinghua University, Beijing, China Email: [email protected]. c n Tao Jiang Department of Computer Science and Engineering University of California at Riverside, USA Email: [email protected] Position weight matrices (PWMs) are widely used t o depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and discovering the binding targets of TFs. Given a set of binding sites bound by a TF, the learning problem can be formulated as a straightforward maximum likelihood problem, namely, finding a PWM such that the likelihood of the observed binding sites is maximized, and is usually solved by counting the base frequencies at each position of the aligned binding sequences. In this paper, we study the question of accurately learning a PWM from both binding site sequences and gene expression (or ChIP-chip) data. We revise the above maximum likelihood framework by taking into account the given gene expression or ChIP-chip data. More specifically, we attempt to find a PWM such that the likelihood of simultaneously observing both the binding sequences and the associated gene expression (or ChIP-chip) values is maximized, by using the sequence weighting scheme introduced in our recent work. We have incorporated this new approach for estimating PWMs into the popular motif finding program AlignACE. The modified program, called W-AlignACE, is compared with three other programs (AlignACE, MDscan, and MotifRegressor) on a variety of datasets, including simulated data, publicly available mRNA expression data, and ChIP-chip data. These large-scale tests demonstrate that W-AlignACE is an effective tool for discovering TF binding sites from gene expression or ChIP-chip data and, in particular, has the ability t o find very weak motifs.
1. INTRODUCTION The discovery of regulatory motifs in DNA sequences is very important in systems biology as it is the first and important step towards understanding the mechanisms that regulate the expression of genes. It is well-known that direct experimental determination of transcription factor (TF) binding motifs is not practical or efficient in many biological systems '. However, recent advances in highthroughput biotechnology such as cDNA microarray and chromatin immunoprecipitation (ChIP) offer a chance to discover de nowo binding motifs a t very low costs. Taking advantage of these new technologies,
*Corresponding author.
at least three computational strategies for motif discovery have been proposed in the literature. We summarize them briefly in Figure 1. Although tremendous efforts have been made in the past decade, motif finding still remains a great challenge 3". Regulatory motifs (or TF binding sites) are often modeled by position weight matrices (also known as position specific scoring matrices), which is a probabilistic model that characterizes the DNA binding preferences of a TF. Therefore, learning an accurate position weight matrix plays a key role not only in modeling a TF but also in distinguishing its true binding sites from spurious sites. This is particularly valuable for some motif discovery algorithms
250
Fig. 1. Three different computational strategies for motif discovery. (1) The primary strategy for motif discovery involves two separate stages. In the first stage, we find clusters of genes sharing similar expression patterns. A general-purpose clustering algorithm can be employed here, for example, K-means or self-organized mapping (SOM). Genes in each cluster are potentially co-regulated by a common TF. In the second stage, we search for short sequence patterns enriched in the promoter regions of genes in each individual cluster. Essentially, finding enriched patterns is the problem of multiple local sequence alignment The widely-used algorithms include CONSENSUS I s , MEME ’, and AlignACE ”. Note that, the presence or absence of a promoter motif has no influence on which cluster a gene is assigned t o under this strategy. Furthermore, clustering is itself a well-known difficult problem. When noise is introduced into the cluster (through spurious correlation or imperfect clustering algorithms), the desired motif pattern may not be so enriched as t o be detected in the second stage. (2) The second strategy attempts t o integrate the two separate processes of the primary strategy. Its basic idea is to find clusters of genes that have both coherent expression pattern and enriched motif pattern 1 2 , ”. Arguably, the presence of motif patterns can play an important role in determining the overall clustering of expression patterns. Therefore, the integrated strategy may significantly increase the chance of finding the true motif pattern. KIMONO is such a program, in which gene clustering and pattern detection are integrated in a way that allows the output of pattern detection to feed into the clustering algorithm and vice versa at each iteration of two processes. However, KIMONO is computationally intensive so that it has yet t o be tested on real biological data. Therefore, its advantage on real biological data is still unknown 12. ( 3 ) The third strategy bypasses the process of gene clustering required in the previous two strategies, and explores a way of fitting motif to expression directly 3 . It intends t o find motif patterns that have strong correlation with gene expression via significance testing, since biologically meaningful motifs should be those most capable of explaining variations of expression data over a large number of genes. From this point of view, it is therefore a more promising strategy than the previous two for motif discovery. A general implementation of this strategy first conducts word enumeration and then uses regression t o assess the goodness of a word fitting to expression data, as exemplified in REDUCE and GEMP s. Note that these algorithms are usually computation intensive due t o exhaustive word enumeration.
’”.
’’
which rely heavily on position weight matrices, for instance, MEME and Gibbs sampler 2 , 16. A position weight matrix (PWM) is generally learned from a collection of aligned DNA binding sites that are likely to be bound by a common TF. Theoretically, it can be formulated as a maximum likelihood problem - finding a PWM such that the likelihood of the observed set of binding sites is maximized To solve the problem, one may assume that
’.
the binding sites are randomly independent observations from a product multinomial distribution, from which it follows that each entry of the PWM will be proportional to the observed count of its corresponding nucleotide at the corresponding position. This is precisely the method commonly used to compute a PWM from a collection of binding sites. Learning from DNA binding sequences alone might not be sufficient to find a PWM that accu-
2s 1 rately models a TF. An improvement could be made by taking the evolutionary history into account, as shown in PhyME 27 and PhyloGibbs 28. More recently, with the advent of DNA microarray and ChIP technologies, gene expression (or ChIP-chip) data has proven to be particularly valuable for motif discovery, as it represents an observable effect resulting directly from the binding of TFs. As illustrated in Figure 1, regression-based methods find motifs by correlating putative binding sites with expression data 3 , 5, 18, 14. However, none of them take advantage of expression/ChIP data explicitly to estimate an accurate PWM. In our recent study, a sequence weighting scheme was proposed to estimate a PWM by explicitly taking gene expression variations (or binding ratios) into account ‘. We then incorporated it into the basic Gibbs sampling algorithm for motif discovery “, but with the limitation that it could only run in the site sampling mode ( i e . , it assumes exactly one binding site per sequence). Our preliminary experiments showed that sequence weighting was a quite effective approach to the estimation of PWMs, since it helped find weak motifs in two datasets (for TFs GAL4 and STE12) that were missed by the original (basic) Gibbs sampler. However, our recent tests on 40 ChIP-chip datasets from l4 indicate that the approach still has a large room for improvement since the sequence-weighted Gibbs sampler would miss many of the motifs found by AlignACE, which is more modern Gibbs sampling algorithm 13, 25. In this paper, we continue the development of the sequence weighting approach, and present several further improved motif discovery results. First, we extend the maximum likelihood problem naturally to find a PWM such that the likelihood of observing the combination of binding sites and expression data is maximized. The extension provides a theoretical foundation of the sequence weighting scheme, which is missing in ‘. Since binding sites inducing dramatic fold changes in expression (or showing strong binding ratios in ChIP experiments) are more likely to represent the true motif 17, the sequence weighting scheme could therefore offer an approximate while reasonably good solution to the new maximum likelihood problem at very low computational cost. Second, we incorporate the sequence weighting scheme into the modern Gibbs sampling program, AlignACE 13, 25. The modified program is called W-AlignACE. Dif71
ferent from our previous implementation of sequence weighting in ‘, W-AlignACE is able to run in the motif sampling mode. In other words, it allows zero or multiple binding sites to occur in a promoter sequence. Third, we conduct large-scale tests on two high-throughput datasets including gene expression and ChIP-chip data, and compare the results of WAlignACE with those obtained from AlignACE, MDscan 1 7 , and MotifRegressor ’. Our results demonstrate that W-AlignACE performed the best in all tests, and was able to find very weak motifs such as those for DIG1 and GAL4, which were missed by the other three program. The remainder of the paper is organized as follows. We first formulate a maximum likelihood problem for learning PWMs jointly from sequence and expression data in Section 2. Our experiments on simulated data will be presented in Section 3.1, and experiments on large real biological datasets, including mRNA expression and ChIP-chip data, will be presented in Section 3.2. Finally, some concluding remarks are given in Section 4.
2. LEARNING POSITION WEIGHT MATRICES
We consider how to estimate a PWM from binding sequences alone and from both binding sequences and expression/ChIP-chip data separately. 2.1. Learning PWMs from sequences
As mentioned before, a PWM 0 is often used to characterize the nucleotide frequencies at each position of a binding site, where 0 = (01, . . . , e l ) and 0, = (Oa,,,Qc,3. 0g,3,Bt,,)T represents the probability of observing the four nucleotides A, C, G, and T at the j t h position of a binding site, such that Oa,J QC,, Og,, Qt,3 = 1 for each 3 , 1 5 J 5 J . In general, 0 is assumed to follow a product Dzrzchlet distribution 19. 20. Hence, the prior distribution on 0 is
+
+
+
n(@) = T l ( ~ l ) ” . ~ J ( 0 J ) , where T , (0,) is a Dirichlet distribution Dir(1, 1,1,1). A PWM can be estimated from a collection of DNA sequences R = ( R 1 , .. . , R,) that correspond to alagned binding sites of a TF, where R, =
252
. . r,J) represents the i t h binding site, for each i = 1 , .. . , n. These binding sites are assumed 19, 2o ( ~ , 1 ~ ,.2
to be randomly independent observations from a product multznomzal distribution with parameter 0; that is, rtj’s are mutually independent, and with probability Q a , j take the nucleotide A, for example. It thus follows that the posterior distribution of 0 is also a product of independent Dirichlet distributions,
n J
~ ( 0 1=~ ) Dir(ca,j
+ 1,c c , j + 1,c g , j + 1,ct,j + 11,
j=1
where c,,~, for example, is the count of nucleotide A among all the j t h bases of the binding sites in R. Further, by maximizing the likelihood of 0, z.e., .ir(RIO),we have Q a , j 0~
Og,j
0;
ca,j
+ 1,
Qc,j c( cc,j
+ 1,
cg,j
+ 1,
Q t , j oc C t , j
+ 1.
That is, the probability of observing the nucleotide A (C, G, or T) at position j of a binding site is proportional to the count of nucleotide A (C, G, or T) among all the j - t h position of the binding sites in R. a Indeed, this is exactly the method commonly used to estimate a PWM 0 for a T F , given a collection of its binding sites. Consequently, the conditional predictive distribution of a DNA sequence B = (bl . . . b ~ will ) be J
J
j=l
j=1
2.2. Learning PWMs from both sequences and expression
We propose a new approach to learning PWMs through the combination of both sequence and expression data. The method can be easily extended to ChIP-chip data. Let E = ( E l , .. . , E n ) denote the fold changes of mRNA expression of downstream genes, where Ei is associated to the binding site R,. We want to find a PWM 0 such that its likelihood .ir(R,ElO) is maximized; that is, 0 can best “explain” both the sequence and expression data simultaneously. The hope is that such a newly formulated problem will result in a PWM with significantly improved discriminative power. Finding the maximum
likelihood i7(01RE,) , however, is expected to be very hard, as it is conditioned on two disparate types of data whose exact quantitative correlation is not completely clear yet. Note that, expression fold changes are assumed to be induced as a result of the binding between DNA sequences and a TF. Linear correlation between sequence and expression, i.e., assuming additivity of binding sites’ contributions to expression, has been used in several existing methods for motif site predictions 3 , 7 , most of which employ the third strategy that we discussed earlier in Figure 1. For the sake of a simple argument, the expression (log fold change) is assumed to be correlated proportionally to the conditional predictive distribution of its corresponding sequence, that is. log&
o (
n(RilO), for each i, 1 5 i 5 n,
or, for short, log& oc .ir(RIO).Therefore, we can reduce the maximum likelihood problem T ( R€10) , to the problem of finding a PWM 0 such that sequence R fits expression log E the best by linear correlation. A natural method to solve such a fitting problem is via an EM-like iteration, i e . , starting with an initial PWM and then refining it iteratively 14. However, such an iterative process is generally very time consuming. Moreover, it is clearly infeasible to incorporate such a process into a Gibbs sampling algorithm, which is an iterative algorithm by itself 19. In order to approximate 0 with an effective algorithm, we assume that the posterior distribution ?r(OlR,E ) is a product of independent Dirichlet distributions as ~(0lR) but with different parameters; that is,
’*,
J
i7(0IR,E )
=
IT
Dir(E,,j
+ I, + 1, Cc,j
Eg,j
+
+I, Et,j l),
j=1
where E a , j , for example, is the count of nucleotide A weighted by log€ among all the j t h bases of the binding sites in R. In other words, n
i=l
where
6 ( r i j ,A)
=
1, if rij = A 0, otherwise
”The additive term of 1 in the above formula is due to the prior distribution of ~ ~ ( 0 ~ ) . bNote that, multiple binding sites may share the same downstream gene and thus its associated log fold change value
253 We can see that the above setting of parameters can be justified partially by the biological observation that binding sites inducing big fold changes in expression are more likely to represent a true motif 17. It follows that the desired PWM will be
@g,j
oc E g , j
+ 1,
Bt,j
oc c t , j
+ 1.
Similarly, the conditional predictive distribution of a DNA sequence B = (bl . . . b ~ will ) be J
J
j=1
j=1
Consequently, the new approach to the learning of PWMs is indeed done via the sequence weighting scheme recently proposed in '. Note that n ( B ( O E, ) is completely equal to . I ~ ( B I Oif) every binding site induces the same fold change in gene expression. Figure 2 illustrates a simple example that clearly demonstrates the advantage of our new approach for learning PWMs from both sequence and expression data. Gibbs sampling is known to be a very effective strategy for motif discovery. Its basic idea is to construct a Markov chain of a random variable X with T ( X ) as its equilibrium distribution. For details on Gibbs sampling algorithms, the reader is referred to 19*20. The above new predictive distribution n ( B ( OE, ) can be used, in place of n ( B ( O )to , implement a collapsed Gibbs sampling algorithm 19, 20. In particular, we have incorporated this method of computing PWMs into a powerful Gibbs sampling program, AlignACE (for Aligns Nucleic Acid Conserved Elements 13, 2 5 ) . The modified program is called WAlignACE, and available a t http : //www .ntu. edu. sg/home/ChenXin/Gibbs.
2.3. Quality measures of putative motifs Putative motifs are generally scored and ranked before they are reported, because only the top few motifs undergo further investigations in practice. Therefore, a metric is needed to measure the goodness of putative motifs. Indeed, the metric to be chosen plays an important role in the success of motif discovery. An inappropriate metric might lower the rank of a bona fide motif so that it is unlikely to be discovered.
Information content is often used to measure the degree of nucleotide conservation in a motif given a probabilistic model 0. It is defined as l 5
where @O = ( @ A , oQc,,, , @G,o,Q T , ~is) the ~ nucleotide frequencies in the background sequence such that they sum up to one. The logarithm is often taken with base two to express the information content in bits. If each residue is equally probable in the background sequence then the information content can be as large as 2, representing the most conserved motif. Note that, however, a highly conserved motif may not be statistically significant relative to the expectation for its random occurrences in the promoter sequences under consideration. Figure 2 shows an example where sequence weighting might improve the information content of a PWM, although this is not necessarily always the case. The MAP score is the metric for motif strength used by AlignACE to judge different motifs sampled during the course of the algorithm 1 3 . It is calculated for a motif by taking into account factors such as the number of aligned binding sites, the number of promoter sequences, the degree of nucleotide conservation, and the distribution of information-rich positions. Therefore, it is believed to be a more sensitive measure for assessing different motifs, in particular, those having different widths and/or different numbers of aligned binding sites. Another alternative is to measure the statistical significance of correlation between putative motifs and gene expression. For example, the pvalues from multiple linear regression are employed in REDUCE and also in MotifRegressor to rank putative motifs. Such a metric takes into account the variation of gene expression data, and is thus more plausible from the biological perspective. Note that, however, the presence of a few spurious binding sites may reduce the significance value dramatically. Therefore, it is not a robust metric. 2.4. Performance evaluation of putative motifs
To show the predictive ability of a motif discovery approach, we need an accurate yet feasible method to evaluate putative motifs. The most accurate method
254
A
C
T
G
1
A (b)
0
0 .5
.25
0
0 .5
1 0
(c)
1 0 . 1 0 0 . 5 0 3
1 0
Fig. 2. Estimating PWMs. (a) A collection of four aligned DNA sequences bound by a TF, and the logarithmic fold changes in expression of their corresponding downstream genes listed in the first column. (b) The P W M learned from sequences alone. Its information content (see section 2.3 for definition) is 1.44 bits. (c) The P W M learned from both sequences and expression. Its information content improves to 1.53 bits, indicating the higher binding specificity of the motif. For instance, the TF is shown to bind to nucleotide G more preferentially than C at the fourth position, although both have the same counts observed in the sequences. Indeed, it can be justified by the fact that the nucleotide G occurs at the fourth position of the sequences that induce large fold changes in expression.
is clearly to directly verify if putative binding sites are true or not. This requires that the bona f i d e binding sites are already known before the evaluation, which, however, is not the case for most biological datasets. Therefore, the use of this method is limited to simulation experiments 24. The second method is to compare the PWM of a putative motif with that of the true one. The true PWMs used for evaluation should be able to correctly reflect the binding preference of TFs. However, not many true motif PWMs have been found and are available in the public databases. For instance, of the 40 motifs that we study below, only 9 have PWMs in the TRANSFAC database 2 2 . Furthermore, these PWMs might not be considered true due to at least two reasons. First, they are derived from as few as eight binding sequences. Second, the computational method for learning a true PWM from binding sequences is questionable (see Figure 2). These reasons discourage us from using PWMs as benchmark for reliable performance evaluation, in particular at a large scale. The third choice is to consider the consensus pattern of a putative motif. The consensus pattern is generally described using IUPAC-ambiguity codes, and hence a more rough (but robust) representation of TF binding preference than its corresponding PWM. In the IUPAC code of a motif, { A ,C, G, 2') indicate the most conserved region of a consensus pattern, which we refer to as the core of a consensus pattern. Note that the core is the most informative part of a consensus. To compare motifs, a putative motif is usually considered true if its consensus core matches that of the true motif ( i e . , the weak region of the consensus pattern are ignored). It can be seen that such a comparison is not sensitive to either spu-
rious binding sites or the scarcity of binding sites, as is the previous method using PWMs. Based on these observations, we will compare consensus cores in the performance evaluation of our predicted motifs in this study. 3. E X P E R I M E N T A L RESULTS In this section, we present our test results of WAlignACE on both simulated and real datasets. Note that the evaluation method proposed in 30 is not applicable here because W-AlignACE requires ChIPchip or expression data in addition to promoter sequences.
3.1. Simulated data We first perform tests on randomly generated sequence data, with artificially planted motif instances, to get an insight into the algorithm's idealized performance under controlled conditions. Here, we generate more complicated simulated data than those used in many other studies ', 1 7 , in the hope to explore in depth how a PWM learned from sequence and expression effects the performance of motif finding algorithms. The data generating procedure is summarized as follows. (1) Manually create a motif consensus sequence consisting of a specified number of nucleotides. In our experiments, we consider three motif widths, J = 6,8,10,reflecting different levels of difficulty for motif finding. (2) Randomly generate 100 promoter sequences of 600 bases each. ( 3 ) A PWM 0 of size 4 x 10 is randomly generated according to the motif consensus and a given value for its information content.
2.5.5 Randomly generate 60 motif occurrences (i.e., binding sites) according to the motif probabilistic model given by the PWM 0 . Among 100 promoter sequences, we will not plant any binding sites in the bottom 50. That is, the 60 binding sites are planted in the top 50 promoter sequences at random positions by replacing segments of the same width. Because the planted positions are randomly selected, some of the top 50 promoter sequences may not contain any binding sites, while the others may contain multiple sites. Therefore, the total number of promoter sequences without any planted binding sites may exceed 50. The hyperbolic tangent function, which is a scaled and biased logistic function, has been used in several studies to model the relation between sequence and expression 14. Similarly, we use it to estimate expression values hypothetically induced by the planted binding sites. For example, for a promoter sequence S planted with m binding sites R = (R1,. . . , R m ) ,where Ri = ( T ~ ~ T. Q. . T ~ J is ) the ith binding site in S, its expression value can be set using the following sequence of formulae,
’,
/ m
\
which starts with computing the log-odds between the posterior probability of binding sites and a background probability of nucleotides. Note that the maximum expression value assigned this way could be close to 4. For those having no binding sites planted, the expression values are set to be randomly in the interval [I,a ) , simulating a commonly occurred situation in microarray experiments where some genes
may not have any binding site of the TF under investigation in their promoter regions, but are more or less expressed (possibly due to the binding of other TFs). Compared to equal expression values to be assigned, random expression values impose more difficulties on W-AlignACE to finding a correct motif, but apparently has no effect on AlignACE. For each motif width, ten test datasets are generated with varying degrees of conservation, giving rise to a total of 30 datasets. Each dataset has 100 promoter sequences, each of which assigned an expression value as described above. We run both program AlignACE and W-AlignACE on the data, and then compare their predicted motifs with the planted motif. A predicted motif is considered true if it has the same consensus core as the planted motif. The results are summarized in Table 1. We can see that WAlignACE is able to find more true motifs than AlignACE, and in most cases, the true motif is ranked the first among the list of reported motifs if sorted by their MAP scores.
3.2. Real data Due to the stochastic nature of Gibbs sampling, we run for each dataset both programs AlignACE and W-AlignACE five times with different random seeds. MDscan l7 and MotifRegressor 7 , instead, are run only once for each dataset because they are deterministic algorithms. Predicted motifs are sorted using their respective sorting schemes (e.g., the MAP score for AlignACE), and only the top four are reported since the remaining motifs (ranked after the fourth) are generally too insignificant to be considered as true. In order to evaluate our method, we retrieve the consensus pattern for each motif from the Saccharomyces Genome Database (see h t t p ://www .yeastgenome .org/), and compare it with the motifs found by MDscan, MotifRegressor, AlignACE, and W-AlignACE, respectively. In our experiments, no prior knowledge on true motifs is assumed. Therefore, all the program parameters are set to their default values. For instance, the de-
‘Some motif consensi in the Saccharomyces Genome Database were obtained from putative binding sites, which have not been verified experimentally. Therefore, caution must be taken when using them as benchmark data. dMotifRegressor requires as many as 17 input parameters, for which we chose a typical setting ( L e . , their default values are generally preferred). The specific command line thus used to run Motifaegressor is “MotifRegressor MRexpression.txt MRsequencestxt yeast.int 1 1 2 1 1 2.0 1.5 5.0 5.0 10 10 50 30 MRoutput.txt”. For its detailed explanation, please refer t o the documentation of MotifRegressor.
256 Table 1. Test results on 30 simulated datasets. For each motif width, we performed the test on ten PWMs with varying information contents. ..- .- -.__ -__ ..-__ AiignRCE W-AlignACE Motif width Information content Rank if found Rank if found 0.65, 0.74, 0.77, 0.81, 0.88 -, -, -, -, -, 3 , -, J = 6 - 0.91, 0.98, 1.01, 1.01, 1.18 -, -, -, -, > 1, - > 1 - 0.61, 0.71, 0.72, 0.88, 0.91 -, -, -, -, , , -> -> 1 J = 8 0.96, 1.02, 1.04, 1.08, 1.17 -, -, -, -, 1 2, -, 1, 1, 1 - 0.63, 0.74, 0.79, 0.82, 0.93 -, -, -, -, 1 , -, 2 , 1 J = 10 0.98, 1.01, 1.03, 1.03, 1.03 1, -, 1, -, 1 1, 1, 1, - > 1 -1
1
1
fault number of columns to align is set to 10. Working with default values is indeed a common practice, especially when the discovery of novel motifs is intended.
3.2.1. mRNA expression data We have applied our algorithm to the publicly available dataset for yeast from microarray experiments on environmental stress response l l . A sample of 100 most induced genes by YAPl overexpression is used here to demonstrate the advantage of the new learning approach in motif discovery. The log fold changes of these genes in mRNA expression range from 1.04 to 3.55. YAPl is a transcriptional activator required for oxidative stress tolerance, and is known to recognize the DNA sequence TTACTAA l o or the sequence GCTTACTAA with higher binding specificity, as annotated e in the Saccharomyces Genome Database (http://db. yeastgenome . org/cgi-bin/ locus.pl?locus=YAPi). Our experimental results show that, AlignACE failed to report any motifs containing the consensus pattern TTACTAA of the YAPl motif among the top four motifs in each run. Instead, W-AlignACE successfully found the known YAPl motif GCTTACTAAT and ranked it the second (MAP score: 126.68). A closer examination on all the putative motifs revealed that, AlignACE reported a weak pattern GATTAGTAAT ranked 12 (MAP score: 10.09) in one run and GCTTAGTAAT ranked 13 (MAP score: 9.41) in another run. Although both contain the complementary inverse of TTACTAA, neither exactly matches GCTTACTAA, the YAPl motif annotated in the Saccharomyces Genome Database. Note that the second weak pattern above differs from the YAPl motif by only one base at the sixth position, if we ignore the differ-
ence in motif width. MDscan reported the pattern GATTACTAAT as its top ranked motif, which differs from the YAPl motif by one base at the second position. MotifRegressor did not performed better than MDscan, but instead it reported GATTACTAAT as its second motif. These results give a solid example where W-AlignACE is more accurate than AlignACE, MDscan, and MotifRegressor. Table 2 . Test results on the publicly available datasets from the yeast environmental stress response microarray experiment. Note that, only W-AlignACE discovered the YAPl motif consensus in the Saccharomyces Genome Database without any mismatching.
SGD annotation \V-AlignACE AlignACE MDscan MotifRegressor
GCTTACTAA GCTTACTAAT GATTAGTAAT GCTTAGTAAT GATTACTAAT GATTACTAAT
2 12 13 1 2
3.2.2. ChlP-chip data We further apply our algorithm to the ChIP-chip data reported in 2 1 . Recall that a ChIP-chip experiment uses chromatin immunoprecipitation (ChIP), followed by the detection of enriched fragments using DNA microarray hybridization, to determine the genomic-binding location of TFs (TFs). Although the data are still noisy, they are the best genomewide data of in vivo TF-DNA binding localization so far 14. Forty datasets, each containing genes targeted by one TF, have been obtained using ChIP-chip p value 0.001 as the cutoff in the study of 14, and are publicly available at http : //biogibbs .stanf ord. edu/"hong2004/Motif Booster/. The sizes of these datasets range from 25 up to 176 genes. For each
eThe annotated consensus is indeed GCTKACTAA using IUPAC ambiguity codes, for which K represents the base G or T
257
gene; its promoter sequence is taken up to 800 bps upstream, but not overlapping with the previous gene. As mentioned earlier, we use consensus cores annotated in the Saccharomyces Genome Database as benchmark, and compare them with the putative motifs reported by MDscan, MotifRegressor, AlignACE, and W-AlignACE. To evaluate our method, we search for the putative motifs with consensus cores matching the annotated ones (while ignoring the difference in motif width), and consider them as being correct. Table 3 summarizes all the true motifs found for the forty TFs under investigation. At a first glance, it is already very encouraging to see that W-AlignACE successfully found the correct motifs for three TFs, DIGl, GAL4, and NDD1. This is especially interesting since these three TFs were observed in l4 to be among the nine TFs (the other six are GATS, GCR2, IME4, IXRl, P H 0 4 , and ROX1) whose correct motifs are hard to find. Compared to the other three program (MDscan, MotifRegressor, and AlignACE), W-AlignACE in general performed strongly. It found correct motif patterns for all the datasets that AlignACE was able to solve, and also for six additional datasets (ACE2, DIG1, GAL4, HAP4, STE12, SWI5). We further notice that in most cases, W-AlignACE reported a PWM with a much higher MAP score than AlignACE when a correct motif was found by both. When a spurious motif was reported, however, the MAP scores estimated by both program are comparable. For instance, both AlignACE and W-AlignACE found the correct consensus pattern nCGTnnnnAGTGAT for ABF1. Its MAP score is 351.866 as estimated by AlignACE, much lower than 436.877 estimated by W-AlignACE. In contrast, both program also reported an obviously spurious motif in the top four, GAAAAAAAAA. Its MAP scores are 176.129 and 165.632 given by AlignACE and W-AlignACE, respectively. All the above show that the new PWM learning approach via sequence weighting could increase the signal-to-noise ratio of a correct motif, but not of a spurious motif. Therefore, it may have a profound impact on the success of computational motif discovery, because it not only
increases the chance of find correct motifs, but also enhances our confidence about the predicted motifs. This is further demonstrated by the following case studies. Note that, the full test results are available at http://www.ntu.edu.sg/home/ChenXin/Gibbs, and so is the program W-AlignACE. ACE2 is a TF that activates the transcription of genes expressed in the G1 phase of the cell cycle '. Its ChIP-chip data in our study consists of 46 target genes. W-AlignACE successfully discovered the correct ACE2 motif, and ranked it the first with the highest MAP score 127.571. AlignACE did report the ACE2 motif in one of its runs but with a very low ranking of 9 (MAP score: 22.2304). In contrast, GAAAAAAAAA is the top motif found by AlignACE, having the MAP score as high as 104.081. Figure 3 depicts the distributions of sonie motifs in the promoter sequences, from which we can see that functional binding sites are more likely to occur in the promoter sequences having higher ChIPchip scores. This observation is precisely the basis of W-AlignACE and why it performs better than AlignACE. Also note that, both MDscan and MotifRegressor failed to report any motifs resembling the correct ACE2 motif. GAL4 is among the most characterized transcriptional activators, which activates genes necessary for galactose metabolism 26. In our previous study 6 , we incorporated the sequence weighting scheme into the basic Gibbs sampling algorithm from 16, which was only allowed to run in the site sampling mode ( i e . , assuming that exactly one binding motif occurs in each input promoter sequence), and tested it successfully on a small ChIP-chip data from the genome-wide location analysis 2 6 , which contains only 10 target genes. The current dataset from l 4 contains 25 target genes. When run on this larger dataset, our previous algorithm failed to find any motifs resembling the correct GAL4 motif (mostly likely because it was limited to the site sampling mode and could not properly handle multiple/zero occurrences of the correct motif). Indeed, GAL4 is a well-known motif that is too weak to be easily detected 14, partly because there is a 11base gap ( L e . , degenerate region) in the middle of
fFurther notice that, four of the above mentioned six TFs, GATS, GCRP, IME4, and IXRl, do not have motif consensi annotated in the Saccharomyces Genome Database. Therefore, their motifs found by W-AlighACE are not evaluated here, and could still be true motifs.
258 Table 3. Experimental results on 40 ChIP-chip datasets. The highlighted rows indicate TFs for which W-AlignACE was able to find the correct motifs but AlignACE failed. The TFs with asterisks do not have motif consensus patterns annotated in the Saccharomyces Genome Database.
CAD1 CBFl CIN5 DAL81
GCN4
27 28
56
IXRI* MBPl MCMl NDDl NRGl PDRl PHDl PH04 RAP1 REBl
45 70 41 127 89
RT.Ml
x 4
ROXl SKN7 SMPl
28 72 48
SUM1 SWI4
41 90
SWIG YAP1 YAP5
35
~
YAPfi*
GATTACTAAT TCACGTGACC
GCTGACTAAT nGGTCACGTG ATTACATAAnC
22.3769 91.5147 25.7981
TGCTTAnTAAT nGGTCACGTG GnTTAnGTAAGC
55.0084 112.272 162.825
AAnGTAAACAA
40 8666
AAAnnGTAAACA
185 944
GGATGAGTCA
42.5719
GnATGAGTCAn
187.854
AAnAAACGCGT TTnCCnnnTnnGGAAA TTTCCnAAAnnGG
36.9147 129.158 50.7552
AnnAAACGCGTC nTnCCnnAnnnGGAAA CCnAAnnnGGnAAAnnnT
103 034 179 8 2 222.986
TGTATGGATT TCCGGGTAAC
ATGTnTGGGTG nCCGGGTAAC
204.493 216.424
ATGTnTGGGTG nCCGGGTAAC
255 127 262.57
TGTGACAGTA AACGCGAAAA
GTGnCAGnAAA GnnnCGCGAAAA
50 0198 66 0847
GTGnCAGnAAA GnGnCGCGAAAA
69 7947 247 458
GnGnCGCGAAAA GCTTACTAAT
48 4327 24 5596
GnGnCGCGAAAA ATTAGTAAGC
49 8036 5 2 1866
116 32
AATGACTCAT
GATGAGTCAC
28
74 59 67
CGCGACGCGT CCTAATTAGG CCTAAATAGG CCCTAGGCGC
59
65 55 6S
ACE2 (GAACCAGCAA)
ACE2 (GAAAAMAAA)
77 6
$ 5
8
4
I -Number
7 of sites
e
!
;,
$ 3
5 2
1 0
Fig. 3. The distributions of ChIP-chip scores and occurrences of the binding sites of three TFs ACE2, GAL4 and STE12. The top right figure depicts the distribution for a spurious motif ranked the first by AlignACE with MAP score 104.81, and the other three figures correspond to three correct motifs all ranked the first by W-AlignACE with MAP scores, 127.571, 184.307, and 358.174, resp. We can see that the correct motifs occur in promoter sequences with high scores more frequently than in those of low scores. This property generally does not hold for spurious motifs, whose occurrences are not expected t o have any correlation with ChIP-chip scores or expression values.
259 its consensus pattern, i. e. CGGnnnnnnnnnnnCCG. Therefore, the new dataset for GAL4 presents a new challenge for computational motif discovery methods. W-AlignACE once again performed remarkably better than AlignACE. It ranked the correct GAL4 motif the first with MAP score 184.307. In contrast, AlignACE failed to find the correct GAL4 motif, and neither did MDscan or MotifRegressor. A closer examination on the GAL4 dataset reveals that there are only 6 of the 25 genes whose promoter sequences contain the exact consensus pattern (see Figure 3). Furthermore, these six genes are all among the top if we sort all genes in the dataset by their ChIP-chip scores. g This might explain the failure of AlignACE and the success of W-AlignACE in the GAL4 dataset. MDscan failed perhaps because it was not optimized for finding gapped motifs. STE12 is a DNA-bound protein that directly controls the expression of genes in response of haploid yeast to mating pheromones 26. The ChIP-chip dataset from l4 consists of 54 pheromone-induced genes in yeast likely to be directly regulated by STE12. This data is also much larger than the dataset consisting of 29 genes used in our previous study '. W-AlignACE once again found the correct motif and ranked it the first with MAP score 358.174. On the contrary, AlignACE ranked the correct motif only the fourteenth with a much lower MAP score of 49.0173. This is not surprising, because once again most of the occurrences of the correct motif are located in the promoter regions of genes having high ChIP-chip scores, as shown in Figure 3. In conclusion, the sequence weighting scheme that learns PWMs from both sequence and expression data could indeed boost AlignACE's ability to pick correct motifs from sequences with noisy background. It is interesting to note that MotifRegressor performed much worse than MDscan in this test, although the former uses the latter as a feature extraction tool to find candidate motifs '. This could be due to several factors. First, the cutoff used by Motifaegressor on the significance of linear regression might be strict. Second, the true motifs are too weak as evaluated by MotifRegressor based on the signifi-
cance of linear regression (e.g., due to the presence of spurious binding sequences). Third, the parameter setting that MotifRegressor applied to MDscan did not work as well as the default one, which we used to test MDscan. Last, the parameters that we set for MotifRegressor might not be optimal either. 4. DISCUSSION A N D FUTURE RESEARCH
Learning an accurate PWM from a collection of aligned binding site sequences is a delicate problem that plays an important role in modeling a TF. In this paper, we tackled this problem by proposing a new approach to learning PWMs jointly from sequence and expression. We believe that this approach could be a very useful enhancement to many of the motif discovery programs that are based on PWMs, such as Gibbs sampling and MEME. Our preliminary experiments on Gibbs sampling support this belief, and demonstrate that W-AlignACE is a very effective tool for biologists to computationally discover TF binding motifs when the gene expression or ChIP-chip data are given. The web W-AlignACE service is provided at http://wwwi. sprns.ntu.edu.sg/"chenxin/W-AlignACE. Our future work includes more delicate/theoretical treatment of multiple motif occurrences, and treatment of multiple experiment expression data (which are usually time series data) and cooperative motifs (or cis-regulatory modules). ACKNOWLEDGMENTS
XC's research is supported by a start-up fund from NTU. TJ's research is supported by NSF grant CCF-0309902, NIH grant LM008991-01, NSFC grant 60528001, National Key Project for Basic Research (973) grant 2002CB512801, and a Changjiang Visiting Professorship at Tsinghua University. References 1. Y. Barash, G. Bejerano, and N. Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. Algorithms in Bioinformatics: Proc. First International Workshop, number 2149 in LNCS, pp. 278-293, 2001.
gunfortunately, there is no GAL4 binding site at the upstream of the top gene, which actually presents more challenge to W-AlignACE than to AlignACE for discovering the correct motif. hMore precisely, the current implementation of MotifRegressor uses MDmodule, instead of MDscan, as a feature extraction tool. MDmodule is a modified version of MDscan.
260 2. T. Bailey and C. Elkan. Fitting a mixture niodel by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol., 2, 28-36, 1994. 3. H. Bussemaker, H. Li, and E.D. Siggia. Regulatory element detection using correlation with expression. Nat. Genet., 27, 167-171, 2001. 4. J. Cherry, C. Ball, et al. Genetic and physical maps of Saccharomyces cerevisiae. Nature, 387, 67-73, 1997. 5. D. Chiang, P. Brown, and M. Eisen. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics, 17, S49-S55, 2001. 6. X. Chen and T. Jiang. An improved Gibbs sampling method for motif discovery via sequence weighting. Proc. of Computational System Bioinformatics, 239247, 2006. 7. E. Conlon, X. Liu, J. Lieb, and J. Liu. Integrating regulatory motif discovery and genome-wide expression analysis. PN A S , 100, 3339-3344, 2003. 8. P. Dohrmann, G. Butler, K. Tamai, S. Dorland, J. Greene, D. Thiele, and D. Stillman. Parallel pathways of gene regulation: homologous regulators SWI5 and ACE2 differentially control transcription of HO and chitinase. Genes Dev. 6 , 93-104, 1992. 9. J. Dolan, C. kirkman, and S. Fields. The yeast STEl2 protein binds to the DNA sequence mediating pheromone induction. Proc. Natl. Acad. Sci. USA, 86, 5703-5707, 1989. 10. L. Fernandes, C. Rodrigues-Pousada, and K. Struhl. Yap, a novel family of eight bZIP proteins in Saccharomyces cerevisiae with distinct biological functions. Mol. Cell. Biol., 17, 6982-6993, 1997. 11. A. Gasch, P. Spellman, C. Kao, 0. Carmel-Harel, M. Eisen, G. Storz, D. Botstein, P. Brown. Genomic expression programs in the Response of Yeast cells to environmental changes. Mol. Biol. Cell, 11, 42414257, 2000. 12. I. Holmes and W. Bruno. Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. of IS MB , 202-210, 2000. 13. J. Hughes, P. Estep, S. Tavazoie, and G. Church. Computational identification of Cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol., 296, 1205-1214, 2000. 14. P. Hong, X. Liu, Q. Zhou, X. Lu, J. Liu, and W. Wong. A boosting approach for motif modeling using ChIP-chip data. Bioinform.atics, 21, 2636-2643, 2005. 15. G. Hertz and G. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 563-577, 1999.
16. C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, J. Wootton. Detcting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science, 262, 208-214, 1993. 17. X. Liu, d. Brutlag, and J. Liu. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology, 2 0 , 835-839, 2002. 18. H. Leung, F. Chin, S. Yiu, R. Rosenfeld, and W. Tsang. Finding motifs with insufficient number of strong binding sites. Journal of Computational Biology, 12, 686-701, 2005. 19. J. Liu. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89, 958-966, 1994. 20. J. Liu, A. Neuwald, and C. Lawrence. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies, Journal of the American statistical Association, 90, 1156-1170, 1995. 21. T. Lee, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799-804, 2002. 22. V. Matys, E. Fricke, et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res, 31, 374-378, 2003. 23. A. Neuwald, J. Liu, and C. Lawrence. Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci., 4, 1618-1632, 1995. 24. P. Pevzner and S. Sze. Combinatorial approaches t o finding subtle signals in DNA sequences. Proc. Int. Conf. Inntell. Syst. Mol. Biol., 8, 269-278, 2000. 25. F. Roth, J. Hughes, P. Estep, and G. Church. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotech., 16, 939-945, 1998. 26. B. Ren, et al. Genome-wide location and function of DNA binding proteins. Science, 290, 2306-2309, 2000. 27. S. Sinha, M. Blanchette, and M. Tompa. PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. B M C Bioinformatics, 5: 170, 2004. 28. R. Siddharthan, E. Siggia, E. Nimwegen. PhyIoGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Computational Biology, 1, e67, 0534-0555, 2005. 29. E. Segal, R. Yelensky, and D. Koller. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19: i273-i282, 2003. 30. M. Tompa, N. Li, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology, 23, 2005.
CI
Structural Bioinf ormatics
This page intentionally left blank
263 EFFECTIVE LABELING OF MOLECULAR SURFACE POINTS FOR CAVITY DETECTION A N D LOCATION OF PUTATIVE BINDING SITES
Mary Ellen Bock
Dept. of Statistics, Purdue University 150 N . University Street, West Lafayette, I N 47907-2067, USA E-mail: mbock@purdue. edu Claudio Garutti Dept. of Information Engineering, University of Padova, V i a Gradenigo 6a, 35131 Padova, Italy E-mail: garutticQdei. unzpd.it
Concettina Guerra Dept. of Information Engineering, University of Padova, V i a Gradenigo 6a, 35131 Padova, Italy College of Computing, Georgia Institute of Technology, 801 Atlantic, Atlanta, GA, USA E-mail: [email protected] We present a method for detecting and comparing cavities on protein surfaces that is useful for protein binding site recognition. The method is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. Results of the cavity detection procedure are presented for a large set of non-redundant proteins and compared with SURFNET-ConSurf. Our comparison method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein. Such a finding would be an indication that the two regions likely bind to the same ligand. Our overall approach for cavity detection and comparison is benchmarked on several pairs of known complexes, obtaining a good coverage of the atoms of the binding sites.
Keywords: protein surfaces comparison; spin-images; binding sites; cavity detection; drug design
1. INTRODUCTION The automatic recognition of regions of biological interest, such as binding sites, on protein surfaces is a critical task in function determination and drug design. The number of protein structures available is increasing, while the assessment of the function of a protein binding site involves time-demanding experimentation with ligands. To this extent, every tool is welcome that can give function-related information, like putative binding sites, for directing the experimental phase. Cavity detection is often the first step for functional analysis, since binding sites in proteins usually lie in cavities. In our work, we represent a protein surface using spin-images, and, based on such representation, use a labeling of surface points that is effective in finding cavities and binding sites. Our approach is simple and fast, purely geometric with no dependence on physico-chemical properties. It examines a subset of surface points, generally less than half of the original points, that are likely to lie on cavities. Those are the points, labeled blocked, whose normal intersects the protein surface at some other
point. For each blocked point, the procedure generates a trial sphere and constrains the radius of the sphere so that it does not penetrate any neighboring atom, by using the values of the spin-image. The clusters of overlapping spheres correspond to surface cavities. One use of the method is to compare similarities of a cavity from one protein to a cavity in another protein. The comparison method based on spinimages, introduced for protein surface comparison: , 2 can be adapted to find a surface region in one cavity that is geometrically similar to a surface region in the other cavity. Such a finding would be an indication that the two regions likely bind to a common ligand. Typically, the surface region that constitutes the binding site of a ligand in a cavity is only a small part of the total surface area of the cavity and the volume of the cavity is much larger than needed to accommodate the ligand. One extension of the comparison of cavities in proteins is to compare cavities found in two different chains of the same protein. Once again similar surface regions within the two cavities may indicate binding sites for the same lig-
and on the two chains. We tested our cavity detection procedure with a nonredundant set of 244 protein structures previously defined.3 The results that we obtain on the dataset using only geometric criteria are comparable to those of SURFNET-ConSurf method: that adds information on the conserved residues to their surface pocket predictor. The combined use of the cavity detection and cavity comparison procedures was benchmarked on several pairs of proteins used in the molecular recognition method based on spinimages!i2 For the analysis of the results, we used the measure of coverage of the binding site. We observed that the new combined approach achieves better results in terms of coverage of the binding site, w.r.t. the comparison performed on the whole surfaces. Not surprisingly, it drastically improves on the execution times needed for discovering similar regions on entire protein surfaces.2 If we restrict the analysis to cavities, the execution times are reduced from 1-2 hours down to few minutes or even seconds. The paper is organized as follows. Sec. 2 presents a short survey of the existing methods for cavity delineation and binding site recognition. In Sec. 3 we review the spin-image representation of a protein surface and discuss a labeling of the protein surface points that is useful in the identification and characterization of protein cavities. Sec. 4 presents a new method for cavity detection and its use in the recognition of similar regions on protein surfaces. We provide experimental results in Sec. 5 and conclusions in Sec. 6. 2. PREVIOUS WORK Our work is a combination of a method to detect cavities on protein surfaces and then a method to compare the cavities from two distinct proteins surfaces to locate common putative binding sites. Thus we are reviewing the methods for locating cavities and the methods to find similarities between proteins surfaces. 2.1. Methods t o detect cavities
Several methods and procedures exist to det,ect protein cavities, either internal to a molecule or external on a protein surface?-'' Some methods concern themselves primarily with the visualization of molecular surface cavities rather than with their analysis.
The methods can also be applied in delineating gap regions between two molecules, for instance an enzyme and an inhibitor. It has been observed that external surface cavities are more difficult to delineate and depict because of the difficulty of knowing "how far in the open space to extend the groove region"? The cavity detection algorithms are often based on fitting probe spheres into the spaces between the atoms. In DOCK6 algorithm, for each pair i, j of surface points, a sphere is generated tangent to the surface at i and j and with center on the surface normal at i. Then the cluster program of the DOCK suite performs a clustering of the obtained spheres. Finally, geometric values of the resulting clusters, such as volume and depth, are determined. In many cases, the largest cluster is the ligand binding site of the molecule. The program SURFNET5 for visualizing molecular surfaces builds a sphere for each pair of nearby atoms with the center halfway between the two atoms and then adjusts t,he radius if it clashes with any neighboring atom. The predicted cleft volume is in many cases much larger than the ligand that occupies it. A trimming procedure called SURFNET-ConSurf7 reduces the size of the clefts generated by SURFNET by cutting away regions distant from highly conserved residues. In the POCKET' program, trial spheres are placed on a regular three-dimensional grid and their radii are reduced in size until no neighboring atom penetrates the sphere.
2.2. Recognition of binding sites Much work has been done on the recognition of the binding sites of using various approaches based on different protein representations and matching strategies. Three recognition problems are generally addressed: 1) the comparison of known binding sites to determine their degree of similarity, 2) the search for a given binding site in a set of complete protein structures, 3 ) the search for putative binding sites of a given protein in a set of known binding sites. In SiteEngine;' all three problems are considered and extensive experimentation is conducted for each. Recognition is obtained by hashing triangles of points and their associated physico-chemical properties and by application of a clever scoring mechanism. A method for binding pocket comparison and clustering has been proposed:' based on a protein shape representa-
265 tion in terms of spherical harmonic coefficients. This method is interesting and fast; however, as pointed out by the authors, it requires a registration phase, to align the two shapes, that it is not always very reliable. A geometric hashing approach have been used13 to compare and cluster phosphate binding sites in proteinnucleotide complexes, leading to the identification of 10 clusters. These are the structural P-loop, di-nucleotide binding motif [FAD/ NAD(P)binding and Rossman-like fold] and FAD binding motif. A cavity-aware match technique14 which uses Cspheres to represent active clefts which must remain vacant for ligand binding. The technique reduces the number of false positives while maintaining most of the true positive matches found with identical motifs lacking C-spheres. A different instance of the comparison problem' is when two complete protein surfaces are compared to discover their most similar regions. The adaptation of this method to surface cavities will be discussed in this paper. i 2
3. SURFACE CHARACTERIZATION 3.1. Spin-image representation of protein surfaces We represent the molecular surface as a collection of spin-images, each of them associated to a surface point with its normal. Surface points are generated using Connolly's molecular representation?6 Spin-images are semi-local shape descriptors used mostly in the area of computer vision for 3D model retrieval and r e g i ~ t r a t i o n ?A~ spin-image provides a high-dimensional description of the appearance of a 3D object in a local reference system. It is an histogram of quantized surface point locations in a local coordinate system associated to a 3D point on the surface and to its normal. Spin-images are discriminative (and as such can be used for recognition), easy to compute and invariant under rigid transformations. For a surface point P with normal n, let ( P , n ) be the coordinate system with origin in P and axis n. In this system, every surface point Q is represented by two coordinates ( a ,p), where a is the perpendicular distance of Q to n, and /3 the signed perpendicular distance of Q to the plane T through P perpendicular to n. The spin-image is a two-dimensional histogram of the quantized coordinates (a',/3) of the surface points. The image pixels are of size equal to 1
35 -
n/ I l l
30
25 -
*i 20 15 10 5
0
0-10
10-20
20-30 3040 40-50 50-60 60-70 "A 01 blocked ooin15
Fig. 1. Histogram of the number of blocked points on protein surfaces and binding sites.
A in our application. A spin-image is rotation invariant since all points on a ring centered on the normal n have the same coordinates. The spin-image dimensions depend on the point P and its corresponding tangent plane and corresponding normal n to its tangent plane T . The number of columns depends on the maximum distance am,, from n of other points on the surface of the object. Let h be the number of rows and k be the number of columns of the spin-image. If pT = /3mas:-,8min then h = [PT/&1 and k = [ a m a z / &s]h,e r e E is the pixel size. 3.2. Characterizing cavities in terms of blocked points We label surface points as blocked or unblocked depending on the shape of their spin-images. A surface point P with normal n is labeled blocked if n intersects the surface at any other point lying above the tangent plane T a t P perpendicular to n; otherwise it is labeled unblocked. To label a point, only the first column of its spin-image needs to be examined: if it contains a non-zero pixel with positive p, then the point is blocked, otherwise it is unblocked. Crucial to our cavity detection procedure is the identification of blocked points on the protein surface. Typically, the number of blocked points on a protein surface is smaller than that of unblocked points, i.e. of points whose normal does not intersect the surface at any other point. Not surprisingly, the opposite is true for points of the binding sites. In Fig. 1 we show the statistics of blocked points of proteins and binding sites (the proteins are taken from a non-redundant dataset3 that will be discussed
266 in more detail later). For most proteins, less than 50% of the surface points are blocked, while for the majority of the binding sites, more than 70% of points are blocked. For example, out of 5039 Connolly's points of protein lnsf (D2 Hexamerization domain of NEthylmaleimide sensitive factor) 1800 are blocked, i.e. approximately 35% of the total. For the binding site of lnsf with ligand ATP, the percentage of blocked points goes up to 74%. As another example, protein lmjh, an hypothetical protein binding ATP, has an even higher percentage of blocked points on the binding site, i.e. above 80%. Furthermore, blocked points are strongly present in cavities, especially in internal cavities. In fact, if a cavity is internal, then the normals at all points of the cavity intersect the protein at some other points of the cavity. If a cavity is external, there might be few unblocked points at the bottom of the cavity. Thus, for cavity detection, we restrict our analysis to blocked points. The identification of blocked points can be done very easily once the spin-images of surface points have been constructed. If the first column (corresponding to 0 5 (Y < E ) of a spin-image contains a non-zero pixel with positive 0,then the point is blocked, otherwise is unblocked. Here we are assuming that the normal n intersects the surface at some other point Q if n is within E distance from Q, where E is the spin-image pixel size.
4. METHODS
4.1. Cavity detection Our approach in delineating surface cavities considers only blocked points. For each blocked point, it builds the largest sphere that can fit at that point; then it determines the cavities as clusters of overlapping spheres. Given a blocked point P with normal n and spin-image spin(P),the associated sphere is obtained from the biggest (discrete) semi-circle in s p i n ( P ) ,tangent to the cell in 0 and containing only empty cells of spin(P). Due t o the cylindrical symmetry of spin-images, the semi-circle of spin(P) corresponds to the sphere in 3-D. Defining the sphere starting from the spin-image allows fast construction of the spheres. For a blocked point, we find the sphere as follows. We consider the horizontal profile of a spin-image as
Fig. 2. Determination of the sphere using spin-image horizontal profile.
+
a one-dimensional array with length 2 1, where 2 is a count of the number of successive zero elements along the column 0 (corresponding to 0 5 a < E ) of the spin-image for 0 2 0 starting at p = 0. The ith element of the vector is given by the number of contiguous zero-elements in row i of the spin-image starting at column 0 and ending at the first non-zero cell along row i. 2 is a constraint on the largest possible diameter of a sphere that can touch the protein surface at the blocked point (We have assumed E equal to 1 A).The particular values of the elements of the profile further constrain the largest diameter of such a sphere. To calculate the largest possible radius of the sphere, LPR, we initially set the variable R equal to 2 / 2 . As we observe the values of the horizontal profile starting at position 1, no constraint is imposed if the value is greater than the current value of R. The smallest position j such that the vector value at j t h position is smaller than the current value of R gives the first constraint upon the LPR and this must be calculated. For i positive, a value of i in position i is a constraint of radius i on LPR. More generally, it can be easily shown that a value i at position j is a constraint of c on LPR where c = (i2+ j 2 ) / 2 j if i 2 j and c = (i2 ( j - 1 ) 2 ) / 2 ( j- l), otherwise. If c is less than R, then R is set to c. For successive positions in the horizontal profile, this computation is repeated if the profile value is smaller than R. Fig. 2 shows an example of determination of the sphere using the spin-image horizontal profile. For a molecule with a set B of blocked points, we generate spheres only for the subset B' of points of B with a 2 value below a given threshold (10 A, in our tests). Blocked points with larger 2 values are not typical of cavities, since they can also be found
+
267 at the top of a region if their normal intersects the surface a t a far away region. Our overall approach is simple and fast. The time required to generate all spheres is O ( b x d ) , where b is the number of considered blocked points, typically much smaller than the number m of all surface points, and d = 10 is the maximum Z value of the spin-images. If we take into account the preprocessing phase needed to create m spin-images, the overall time complexity of our procedure becomes O(m x max{m,D} b x d ) , where D is the size of the spin-image. This represents a computational advantage with respect to methods for cavity detection that generate m2 trial spheres, one for each pair of surface points, and check the non penetration of other surface points into each sphere, obtaining an overall time complexity of O(m3). Notice that the complexities of both approaches can be improved by the use of clever techniques for neighbor finding operations. In our approach, these could lead to a faster creation of spin-images, if only local points are chosen to contribute to the construction of the spinimage of a given point. In the other approaches, fast neighbor finding operations could speed up the check of the non penetration constraint. Once all spheres of blocked points are obtained, those with LPR below a certain threshold (1 A in our experiments) are removed so that small gaps between atoms are not considered. From the remaining spheres, a clustering procedure determines collections of interpenetrating spheres corresponding to the points of the surface cavities. The clusters are identified as the connected components of the undirected graph G = ( V , E ) ,in which the vertices are the blocked points, and an edge connects two vertices if their spheres overlap. The overall procedure is outlined below. PROCEDURE: Cavity Detection For a given protein surface, determine the set of blocked points B arid its subset B’ consisting of points with 2 less than a predefined threshold ThZ = 10. For each point b of B’, build the sphere touching the surface at b from its spin-image profile, as described above. Prune the set B’ by removing all points with a radius of the sphere T < 1A. Find the connected components GI, . . . , G, of G using Breadth First Search.
+
The vertices of each connected component of G form a cluster corresponding to a surface cavity. Note that point density has an impact on the choice of the parameters. In our work, we generated one point every square angstrom. The threshold values for T h Z and r were assessed by performing cavity detection on 30 random proteins from the dataset3 using different values of the parameters. 4.2. Finding similar binding sites o n two proteins We now give an outline of our overall approach for detecting similar binding sites on two protein surfaces.
(1) Build the spin-image representation of the surface points of the two proteins. (2) For each protein, find the surface cavities based on the spin-image profiles of blocked points and select the largest cavity(ies). (3) Compare pairs of cavities, one per protein, by identifying and grouping sets of corresponding points based on the correlation of their associated spin-images. Return the regions on the two cavities that are most similar. Step 1 and 2 have been described in the previous sections. For comparing pairs of cavities in step 3 we use an adaptation of the recognition method based on spin-images;>2 and here referred to as MolLoc, that allows the discovery of similar regions on protein surfaces. MolLoc takes as input a pair of proteins and finds the regions on the two surfaces that most resemble each other. Basically, for two given proteins g and g’, MolLOCbuilds individual point correspondences (Q, Q’), Q E g and Q’ E g’, if their spin-images have a high correlation value. A high correlation value is taken as an indication of structural similarity of the local regions surrounding the two points and contributing to the spin-images. Once point correspondences are identified, they are clustered into groups of consistent correspondences. The consistency criterion is purely geometric and enforces the rigidity constrain of three dimensional objects. It states that the angles between normals at two surface points on one protein and the distances between the two points must be preserved between the corresponding points of the other protein.
268 Although effective in identifying surface similarity, MolLoc suffers from high computational complexity. For a pair of large proteins, the execution time can be up to two hours. A number of heuristics have been proposed to cope with this problem. One heuristics consists of mapping surface points into cells of a 3D grid, and restricting the matching procedure to points contained into pairs of grid cells and into their neighboring cells. We use the same basic matching procedure for comparing two surface cavities obtaining execution times that are of the order of minutes or even seconds. No mapping of points into a 3D grid is necessary, which is also instrumental in producing more accurate results. 5. DATAA N D
RESULTS
5.1. C a v i t y detection We conducted experiments for cavity detection on a dataset of 244 previously defined3 The protein structures are taken from the PDB. Of these proteins, 112 are enzymes (45.9%), 129 nonenzymes (52.9%), and three "hypothetical" (1.2%) proteins, according to PDBsum28 and Uniprotzg These PDB entries contained 464 ligands not covalently bound to the protein and then for each complex protein-ligand there is a binding site. The binding sites of these complexes are determined in the following way. For a ligand binding to a protein, the binding site consists of the atoms of the protein that are (i) closer than a given threshold (5 A in our experiments) t o at least one atom of the ligand, and (ii) have at least one surface point that is blocked by t h e ligand. A surface point is said to be blocked by the ligand if its normal intersects (is close to) at least one atom of the ligand. The surface points and their normals are generated using Connolly's programz6 The obtained binding sites are generally identical (or very similar) to those derived with the CSU software3' that analyzes the interatomic contacts in protein complexes. The ligands in the data set form a very heterogeneous set, including sugars, co-factors, substrate analogs, peptides, etc. They also show great variability in the size and shape of their binding sites. The number of atoms in the binding sites varies from 3 to 141, where the binding site of ligand NAG-21 in the complex lo7d has only 3 atoms, and that of ligand CDN in the complex lnek has 141 atoms.
-*
'" II
*
f
I
f
F' 100
, 2
r
i
Q
50
P
0'
20
40
60
80
100
120
U atoms of the Ilgand
Fig. 3 . The figure plots the number of atoms of the binding sites versus the number of atoms of the ligands for all 244 proteins of the dataset. The dotted line is the least square line.
Although there is a correlation between the number of atoms of the binding sites and of the ligands, as shown in Fig. 3, the binding sites of the same ligand with different proteins may vary significantly in size. For example, the binding sites of ligand MPD in protein complexes ld3c, lh6g, lhty, li78, llvo, lnvm, 1000, lsrq consist of a number of atoms ranging from 3 to 28. A ligand can have more than one binding site with the same protein, and these binding sites can also vary considerably in size. The ligand UPL (unknown branched fragment of phospholipid) has 27 binding sites on the same protein (llsh), of which the smallest has only 4 atoms, while the biggest has 56 atoms. The ligand of the dataset that shows the largest variability is FAD (flavin-adenine dinucleotide), where the biggest of its 11binding sites has 114 atoms and the smallest has just 10 atoms. Our cavity detection algorithm was run on the whole data set of 244 proteins. For each protein, it returned all cavities with more than a threshold number of atoms, ranked according to the number of atoms they contain. Thus rank one identifies the largest cavity, rank two the second largest cavity, and so on. This number is taken as an approximate measure of extension of the cavity. The number of cavities found on a protein vary considerably, depending on the size of the protein and its shape. In analyzing our solutions, we use the measure of coverage of the residues (atoms) of the binding site, i.e. the percentage of residues (atoms) of the binding site found in the cavity. A residue belongs to a cavity if at least one of the surface points close to it belongs to the cavity. If the binding site of a ligand is known, we call
269
---I
I
250
5:n
loo/
n
0-0 1
01-02 0 2 ~ 0 3 0 3 - 0 4 0 4 - 0 5 0 5 - 0 6 0 6 - 0 7 0 . 7 - 0 8 0 8 - 0 9 coverage
Fig. 4. 4(a) distribution of rank of cavities containing the ligand. 4(b) coverage of binding sites.
best-coverage cavity the cavity with the biggest coverage (in terms of atoms) of the binding site. In discussing our results, we consider only the bestcoverage cavity for each complex of the dataset, and refer to it simply as cavity in the following. Fig. 4(a) shows the distribution of ranks of bestcoverage cavities (those containing the ligand). Of the 464 binding sites, 224 are in the largest cavity. As shown in Fig. 4(b), the values of coverage of residues of the binding sites are generally very good, with the majority of cavities achieving a coverage above 90%. This is true also for the coverage of the atoms of the binding site, even though such values are generally lower than those obtained for residues. The results of our procedure for the whole dataset are available at h t t p : //www .unipd. it/ ~garuttic/cavity/cavities07.xml . Fig. 4(a) shows the distribution of the best-coverage cavities according to their rank. It can be seen that in most cases our method identifies the binding site in the biggest cavity. Moreover, according to Fig. 4(b), we
can infer that most of the times the binding site is completely included in the cavity. In Tab. 1 we show the top 20 cavities according to their values of coverage. The values of coverage for SURFNET-ConSurf are not reported in this and in the other tables because they are not available. Thus, all these cavities tightly include the binding site, and in the first seven cases they coincide with it. It can be seen that, for these 20 entries, we locate the binding site in one of the four biggest cavities on 14 cases out of 20, which is competitive with the 8 out of 20 of SURFNETConSurf. Moreover, in all the entries but one, our procedure find that the best-coverage cavity has rank less than or equal to that of SURFNET-ConSurf. The only exception is for protein lp60 with ligand HPY-411, but it can noted that this protein has several cavities with similar dimensions, and thus the ranking can be significantly different even with similar algorithms. Tabs. 2 and 3 show the top 20 cavities according to their size, defined as cavity volume and number of atoms of the cavity, respectively. The results of Tab. 2 do not show any significant differences between the two methods, since all the cavities but two have rank one and big size in both methods. The two exceptions are complex lei6 with ligand PPF-412 (chain D), and complex 1r72 with ligand NAD-5. In the first case we find a small cavity that completely includes the binding site, which can be considered an improvement with respect to the big cavity found with SURFNET-ConSurf, while in the second case the small cavity found has a 25% coverage on a binding site of 8 atoms and thus contains only two atoms of the binding site. The results of Tab. 3 show the biggest cavities that we find. They all have rank one, high coverage, and a considerable number of atoms (more than 600). Also the cavities found with SURFNET-ConSurf have big size, but eight of them have rank higher than one, which suggests that these cavities are smaller than ours. This analysis suggests that the results that we obtain are close to those of SURFNET-ConSurf, with a fast and still accurate geometrical method, without including any information about residues conservation. From the analysis of the results, we can observe that for ligands with a large number of atoms in contact, our procedure identifies the binding site in a cavity with rank lower than four in most cases; otherwise it tends to find the binding site in a smaller cavity with rank larger than four (see
Table 1. The 20 cavities with the best values of coverage found by our procedure. PdbID is the ID of the complex in the PDB. Chain is the chain used in the experiment. Rank is the identifier of the cavity of the protein with the best-coverage of the binding site. Cov and # A t o m s of the cavity refer t o the best-coverage cavity. Gov is the coverage expressed in terms of atoms. # Atoms of the b.s., # Atoms of the ligand and Name of the ligand refer t o the ligand as indicated in the PDB. Ligand name is expressed in the format resname:chain: seqnumber. PdbID
lejj lfw9 lh2r 119g lp60 lp60 1qft lotw lp0z lo7d lotw llrh llrh lr91 li9g 115j ld15 1hnn lus5 loOr
Chain
A A
SL A AB AB A AB A ABCDE AB AD AD A A A AB A A A
Rank
4 2 >4 3 2 2 2 >4 2 >4 >4 3 >4 2 1 >4 3 1 1 1
Rank SURFNETConSurf
Cov
>4 4 >4 >4 >4 1 >4 >4 4 >4 >4 >4 >4 2 2 >4 4 3 >4
1.00 1.oo 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
1
#Atoms of the b.s.
#Atoms of the cavity
Cavity Vol in SURFNETConSurf ( A 3 )
#Atoms
24 25 16 25 18 18 27 42 38 26 42 37 35 29 62 25 63 63 29 72
24 25 16 25 18 18 27 46 42 29 48 44 42 40 90 37 95 101 48 120
NA 189 NA NA NA 279 NA NA 366 NA NA NA NA 292 1141 NA 748 358 NA 1284
11 10 8 8 8 8 8 24
Fig. 5(a) and Fig. 5(b)). Consider the case of ligand MPD (2-METHYL-2,4-PENTANEDIOL) binding to 14 chains of 8 different proteins. When the binding site is large, as in the complex lsrq where it consists of 89 atoms, then it is found in the cavity with rank one; by contrast, in the complex ld3c with only 12 atoms in contact, the binding site is found in the cavity ranked 14. Among the 210 cavities with rank one, 142 have a binding site with more than 40 atoms (see Fig. 5(a)). There are few ligands for which the binding sites are approximately of the same size. An example is ligand ATP whose binding sites are about 40 atoms and are, in all cases, contained in the top cavity, with rank one. Fig. 5 shows the distribution of binding sites (ligands) by cavity rank and number of atoms of binding site (ligand). The bigger the number of atoms of the binding site, the better the rank of the corresponding cavity. In fact, on 88 binding sites that have less than 20 atoms, only 17 binding sites lie in the biggest cavity, 5 in the second biggest cavity, two in the third and one in the fourth, while 63 binding sites are located in a cavity smaller than the fourth. The results improve if the number of atoms of the
Ligand
ligand
13 8 24 14 14 8 27 7 26 26 10 36
3PG::601 PHB::199 NFE:: 1004 FS4::201 HPY::410 HPY: :411 HSM:: 173 PQQ::501 FLC::1632 TRS:A:2 PQQ::500 NLA::8190 NLA::5190 BET::1001 SAM::301 F3S::868 SAH:: 1699 SAH::2001 GLU:A:1315 GDU::404
binding site increase. Thus 64% of the binding sites that have 20 or more atoms but less than 40 lie in one of the four biggest cavities, and this percentage increases to 88% for the binding sites that have 40 or more atoms but less than 60 and 95% for the binding sites that have 60 or more atoms but less than 80. Finally, all but four of the 29 binding sites that have 80 or more atoms but less than 100 lie in one of the three biggest cavities, and all the 14 binding sites that have 100 atoms or more lie in the biggest cavity. Fig. 5(b) shows analogous results for the ligands. The biggest cavity does not contain any binding site in 80 of the 244 proteins considered in the experiments. For example, l b l l (feline immunodeficiency virus protease complexed with T1-3-093) has a binding site with ligand INT in the cavity with rank two, while the cavity with rank one does not contain any ligand (see Fig. 6(a)). The ligands are located close to P-sheets 53-57, 62-68 , 89-92 and 37-39, while the biggest cavity extends from the N-terminal valine to residue 114 close to C-terminal methionine, including residue 108 of alpha-helix 104-110. Also the ligand C8E in the complex lbxw is not located in the largest cavity (see Fig. 6(b)). The largest cavity is
27 1 Table 2. The 20 cavities with the biggest cavity volume according t o SURFNET-ConSurf. PdbID is the ID of the complex in the PDB. Chain is the chain used in the experiment. Rank is the identifier of the cavity of the protein with the best-coverage of the binding site. Cov and # Atoms of the cavity refer t o the best-coverage cavity. Cov is the coverage expressed in terms of atoms. # Atoms of the b.s., # Atoms of the ligand and Name of the ligand refer to the ligand as indicated in the PDB. Ligand name is expressed in the format resname: chain: seqnumber. PdbID
11135 ln35 ln35 113i 113i 113i 113i lp91 lp91 1f48 lf48 lsr9 litw ljvl lei6 lpOh leyr leyr lr72 lueu
Chain
A A A ABCD ABCD ABCD ABCD AB AB A A AB A AB AD A AB AB AB A
Rank
1 1 1 1 1 1 1 1 1
1 1 1 1 1 4 1 1
1 3 1
Rank SURFNETConSurf
Cov
1 1 1 1 1 1 1 1 1
0.92 0.89 0.81 0.97 0.97 0.96 0.93 0.97 0.95 0.96 0.90 1.00 0.96 0.95 1.00 0.71 0.89 0.81 0.25 0.88
1
1 1 1 1 1 1 1 1 1 1
#Atoms of the b.s.
#Atoms of the cavity
Cavity Vol in SURFNETConSurf (A3)
49 45 31 62 58 57 57 67 63 57 50 30 27 62 24 76 47 47 8 48
759 759 759 1080 1080 1080 1080 632 632 689 689 412 325 1499 63 247 380 380 32 239
19763 19763 19763 12820 12820 12820 12820 10221 10221 8993 8993 8477 8213 6810 6643 6351 6322 6322 6224 5745
at the bottom of a ,&barrel, while the ligand sticks outside from the center of the barrel and does not have a geometrically tight binding with the protein. In both cases our biggest cavities coincide with those found by the CASTp server ( h t t p ://sts .bioengr . uic. edu/castp), which is also based on geometric criteria only?)" 5.2. Finding similar b i n d i n g sites o n
two proteins We benchmarked our method on several pairs of proteins or chains from another representative setJ1 The set includes 46 proteins, 12 proteins with a chain binding to ATP and 10 with a chain binding to other adenine-containing ligands. Other proteins are from diverse functional families that can bind estradiol, equilin and retinoic acid. Other different protein families from the set are: HIV-1, anhydrase, antibiotics, fatty acid-binding proteins, chorismate mutases and serine proteases. In analyzing our solutions, we use the measure of coverage, i.e. the percentage of residues of the binding site found in the solution, and of accuracy, i.e. the percentage of residues in the solution that belong to the active site. A residue be)'
#Atoms
Ligand
ligand 28 28 28 26 26 26 26 27 27 27 27 8 13 39 7 48 50 50 44 29
CH1::1291 CH1::1295 CH1::1294 SAH::803 SAH::802 SAH::804 SAH::801 SAM:: 1401 SAM::2401 ADP::590 ADP::591 KIV::701 ICI:A:743 UD1: :901 PPF:D:412 COA::601 CDP::1001 CDP::2001 NAD::5 CTP::501
longs to a solution if at least one of the surface points close to it belongs to the solution. We performed comparisons of a query protein or chain surface with other proteins of the data set of 46 proteins or chains to retrieve those with high score when matched with the query. The score of a comparison is defined as the number of correspondences between points on the pair of matching regions identified on two cavities. We also compute the root mean square deviation (rmsd) of the rigid transformation that best aligns the corresponding points in the pair of regions for the two surfaces. The results shown here are obtained using the Catalytic Subunit of CAMP-dependent Protein-Kinase (pdb code latp, chain E) as query protein. This chain binds ATP. As already observed in the previous section, the ATP binding pockets in different proteins show great structural variability, although their size in terms of number of atoms/residues is about the same. In Tab. 4 we show the values of coverage and accuracy obtained when comparing the cavity with rank one of l a t p with those of proteins lphk, lcsn, lmjh, lhck and lnsf. For the same pairs of proteins, we show also the values of coverage of the binding
272 Table 3. The 20 cavities with the biggest number of cavity atoms according t o our procedure. PdbID is the ID of the complex in the PDB. Chain is the chain used in the experiment. Rank is the identifier of the cavity of the protein with the best-coverage of the binding site. Cov and # Atoms of the cavity refer to the best-coverage cavity. Cow is the coverage expressed in terms of atoms. # Atoms of the b.s., # Atoms of the ligand and Name of the ligand refer t o the ligand as indicated in the PDB. Lzgand name is expressed in the format resname: chain: seqnumber. PdbID
ljvl ljvl 113i 113i 113i 113i lm98 lm98 lm98 lnek lnek In35 In35 In35 1lvo 1lvo 1f48 1f48 lf2u lf2u
Chain
Rank
AB AB ABCD ABCD ABCD ABCD AB AB AB ABCD ABCD A A A AB AB A A ABCD ABCD
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Rank SURFNETConSurf
Cov
1 3 1 1 1 1 3 2 2 3 3
0.95 0.97 0.97 0.97 0.96 0.93 1.00 0.98 0.74 0.88 0.85 0.92 0.89 0.81 0.86 0.85 0.96 0.90 1.00 0.99
1
1 1 3 4 1
1 1 1
#Atoms of the b.s.
#Atoms of the cavity
Cavity Vol in SURFNETConSurf ( A 3 )
#Atoms
62 60 62 58 57 57 103 105 35 141 52 49 45 31 28 27 57 50 72 69
1499 1499 1080 1080 1080 1080 775 775 775 766 766 759 759 759 714 714 689 689 670 670
6810 3746 12820 12820 12820 12820 1334 1311 1311 2211 2211 19763 19763 19763 1426 892 8993 8993 3526 3526
39 39 26 26 26 26 42 42 23 77 23 28 28 '28 8 8 27 27 31 31
Ligand
ligand UD1::901 UD1::902 SAH::803 SAH::802 SAH::804 SAH::801 HEQ::351 HEQ::350 suc::401 CDN::308 UQ2::306 CH1::1291 CH1::1295 CH1::1294 MPD::4002 MPD::4001 ADP::590 ADP::591 ATP:A:901 ATP:C:901
Table 4. Comparison of l a t p ( CAMP-dependent Protein-Kinase) with lphk (Subunit of glycogen phosphorylase kinase), lcsn (Casein kinase-1), 1mjh:B ("Hypothetical" protein MJ0577), lhck (Cyclin dependent P K ) and lnsf (Examerization domain of N-ethilmalemide-sensitive fusion protein),
# residues in binding site
Coverage MolLoc2
Coverage Cavity comparison
Accuracy Cavity comparison
latp lphk
23 26
78% 69%
91% 90%
80% 76%
latp lcsn
23 26
70% 62%
78% 80%
75% 91%
latp 1mjh:B
23 25
26% 24%
34% 32%
100 % 88%
latp 1hck
23 24
39% 42 %
56% 58 %
92 % 87 %
latp lnsf
23 23
43% 35%
60% 43%
93% 76%
P d b ID
site obtained by the comparison method based on spin-images2 and here referred to as MolLoc. We do not report the accuracy values for MolLoc; although the solution regions had a significant overlap with the binding sites, they spanned areas much larger than the binding sites. Indeed the goal of MolLoc
was to identify similar regions on protein surfaces, not to find binding sites. For the proteins l a t p and lcsn, which both bind to the ligand ATP, the two most similar regions on each protein are part of the binding site and this explains also the high values of coverage for MolLoc. In both proteins, the binding
273
200 Y
5 150
-5 I
100
4;
50
0
250
~
2001 200 -
I150-
100-
n-
50 -
0
0-01
1
ddh
06-0707-0808-09 01-0202~0303~0404-0505.06 0 1 - 0 2 0 2 ~ 0 3 0 3 ~ 0 4 0 4 - 0 5 0 5 - 0 6 0 6 - 0 7 0 7 - 0 8 0 8 - 0 9C0 9-1 coverage
Fig. 5 . Distribution of binding sites (ligands) by cavity rank and number of atoms of the binding site (ligand).
sites are located in the top cavity. The new method improves on coverage while at the same time obtaining a good accuracy for all pairwise comparisons. The execution time is drastically reduced w.r.t. MolLoc. While MolLoc took about two hours to execute, the new method took less than two minutes. There are cases when we cannot expect our algorithm to identify the common regions that correspond to the active sites on a pair of cavities. However, if a large cavity is broken into several smaller cavities by physico-chemical considerations about binding sites, then one runs the risk of losing part of the binding site, which will make it harder to identify common binding sites when comparing cavities in two proteins. From the observations in the previous section about the difference in size of different binding sites for the same ligand, it is evident that any matching procedure based on purely geometric criteria will fail to recognize binding sites for those cases.
Fig. 6. Proteins lb11(6(a)) and lbxw(6(b)). The biggest cavities are displayed in spacefill.
6. CONCLUSIONS We have presented a method for binding site recognition that is effective and fast. It uses only geometric criteria and a description of the protein surfaces by means of a collection of two-dimensional arrays, the spin images, each describing the spatial arrangement of the protein surface points in the vicinity of a given surface point. As mentioned, there are several cases where our recognition procedure fails to identify the correct binding sites. When a ligand binds different proteins at sites that vary significantly in size and shape, most of existing approaches are inadequate to identify the binding location. The problem is further complicated by the simultaneous presence of several ligands within the same cavity. We think our work can contribute one more step towards the solution of
274 t h e problem, when only geometric features a r e considered.
REFERENCES 1. M. E. Bock et al., Proc. Combinatorial Pattern Matching C P M 2005 , 417-428 (2005). 2. M. E. Bock et al., J . Comp. Biol. 14(3), in press (2007). 3. F. Glaser et al., Comput. Syst. Bioinformatics Conf. 62, 479-488 (2006). 4. G. P. Brady Jr and P. F. Stouten, J . Computer Aided Mol. Des. 14, 383-401 (2000). 5. R. A. Laskowski, J . Mol. Graph. 13,323-330 (1995). 6. I. D. Kuntz et al., J. Mol. Biol. 161(2), 269-288 (1982). 7. R. A. Laskowski et al., J . Mol. Biol. 351, 614-626 (2005). 8. D. G. Levitt and L. J. Banaszak, J . Mol. Graphics 10, 229-234 (1992). 9. J. Liang et al., Proteins 33, 1-17 (1998). 10. J. Liang et al., Proteins 33, 18-29 (1998). 11. A. Shulman-Peleg et al., J . Mol. Biol.339, 607-633 (2004). 12. R. J. Morris et al., Bioinformatics 21(10), 23472355 (2005). 13. A. Brakoulias and R. M. Jackson, Proteins 56, 250260 (2004). 14. B. Y . Chen et al., Comput. Syst. Bioinformatics Conf. , 311-323 (2006).
15. J. A. Barker and J. M. Thornton, Bioinformatics 13, 1644-1649 (2003). 16. T. A. Binkowski et al., J. Mol. Biol. 332, 505-526 (2003). 17. T. A. Binkowski et al., Prot. Sci. 14, 2972-2981 (2005). 18. N. Kinoshita et al., J . Struct. Funct. Genomics 2, 9-22 (2001). 19. G. Kleywegt, J . Mol. Biol. 285, 1887-1897 (1999). 20. N. Kobayashi N. and Go, J . Mol. Bid. 26, 135-144 (1997). 21. Y. Y. Kuttner et al., Proteins: Struct. Funct. Bioinf. 52, 400-411 (2003). 22. L. Lo Conte et al., J . Mol. Biol. 285, 1021-1031 (1999). 23. R. Najmanovich et al., Bioinformatics 23(2), 104109 (2007). 24. A. Via et al., J . Mol. Biol. 57, 1970-1977 (2000). 25. H. Yao et al., J. Mol. Biol. 326, 255-261 (2003). 26. M. L. Connolly, J . Appl. Cryst. 16, 548-558 (1983). 27. A. E. Johnson and M. Hebert, IEEE Trans. Patt. Anal. Machine Intell. 21(5), 433-449 (1999). 28. R. A. Laskowski, Nucleic Acids Res. 29, 221-222 (2001). 29. The UniProt Consortium, Nucleic Acids Res. 35, D193-197 (2007). 30. V. Sobolev et al., Bioinformatics 15,327-332 (1999). 31. H. M. Berman et al., Nucl. Acids Res. 28, 235-242 (2000).
275 EXTRACTION, QUANTIFICATION A N D VISUALIZATION OF PROTEIN POCKETS
Xiaoyu Zhang*
Department of Computer Science California State University San Marcos San Marcos, C A 92096 'Email: [email protected] Chandrajit B a j a j
Department of Computer Science University of Texas at Austin Austin, T X 78712 Email: [email protected] Molecular surfaces of proteins and other biomolecules, while modeled as smooth analytic interfaces separating the molecule from solvent, often contain a number of pockets, holes and interconnected tunnels with many openings (mouths), aka molecular features in contact with the solvent. Several of these molecular features are biochemically significant as pockets are often active sites for ligand binding or enzymatic reactions, and tunnels are often solvent ion conductance zones. Since pockets or holes or tunnels share similar surface feature visavis their openings (mouths), we shall sometimes refer to these molecular features collectively as generalized pockets or pockets. In this paper we focus on elucidating all these pocket features of a protein (from its atomistic description), via a simple and practical geometric algorithm. We use a two-step level set marching method to compute a volumetric pocket function 4 p ( z ) as the result of an outward and backward propagation. The regions inside pockets can be represented as 4 p ( z ) > 0 and pocket boundaries are computed as the level set 4 p ( z ) = t, where t > 0 is a small number. The pocket function q!~p(z) can be computed efficiently by fast distance transforms. This volumetric representation allows pockets to be analyzed quantitatively and visualized with various techniques. Such feature analysis and quantitative visualization are also generalizable to many other classes of smooth and analytic free-form surfaces or interface boundaries.
1. INTRODUCTION Molecular surfaces are solvent contact interfaces between the strongly covalent bonded atoms of the molecule and the ionic solvent environment which is mostly water. Molecular surfaces often contain a number of pockets, holes and interconnected tunnels with many openings (mouths), aka molecular features in contact with the solvent. Several of these molecular features are biochemically significant as pockets are often active sites for ligand binding or enzymatic reactions [7],and tunnels are often solvent ion conductance zones [45]. Since pockets or holes or tunnels share similar surface feature visavis their openings (mouths), we shall sometimes refer to these molecular features collectively as generalized pockets or pockets. The surface of a protein can be represented as a closed compact surface S in R3 and the closed interior V as the region bounded by S . It is important ~
*Corresponding author.
to correctly identify the main biophysical features of S in our protein model, such as its "pockets", "tunnels", and "voids" , and so that they can be used for quantitative scoring of binding affinities and other biochemical reactions. In this paper we present a simple and fast geometric algorithm for extracting pockets of any closed compact smooth surface, particularly complicated solvent contact surfaces of proteins. We use a twostep level-set marching method, first outward from the original protein surface S and then backward from a topological simple enclosing shell obtained as a result of the first marching. The pockets are extracted as the regions outside S and not reached by the backward propagation, as illustrated in Figure 1. The result of the outward and backward propagation is represented as a 3 0 volumetric "pocket function" d p ( x ) . The pockets in S can be represented implicitly as the regions 4 p ( x ) > E , where E is a small constant. This volumetric representation
27 6
(4
(b)
Fig. 1. (a) Outward propagation from S to the shell T that has a simple topology. (b) Backward propagation from T t o the final front F . Pockets are extracted as the yellow regions between F and 5’.
of pockets is very convenient, since it allows us to compute the pocket‘s mouth surfaces as a contour set q5p(x) = t, quantitative the pocket’s volumetric properties, and visualize theni with various volume visualization techniques. We present some relevant background and related work in the next section, and then describe details of our pocket extraction algorithm in section 3. In section 4 we discuss our implementation of the algorithm and compare its results with prior published work.
2. BACKGROUND A N D RELATED WORK The description of protein surfaces is important in the analysis of protein-protein and protein-ligand interactions. Several computational models of molecular surfaces for proteins have been used in the past. The van-der-Waals surface [lo],is the boundary of the union of balls, where each atom is represented as a spherical solid ball. Other popular models include the solvent accessible surface [38] and the solvent contact surface [14] . More recent work [a,13,39,48]show how to extract triangular meshes of sniooth molecular surfaces. Smooth (and analytic) molecular surfaces can also be modeled as the level set of a volumetric function, e.g. the level set F ( z ) = 1 for the electron density function E ( z )in space EX3 [9,18,25]. An isotropic Gaussian kernel function, G t ( x )= ezp( b z * ( X - C t ) 2
-
rP
b z ) , x E R3, approximates the electron density distribution of a single atom around the atomic center. The decay constant for the Gaussian kernel is de-
noted by bi, while ci is the center of the i-th atom, and ri is the van der Waals radius of the atom. The electron density field E for a molecule is obtained by summing the individual atomic density distributions for all its atoms,
E(x) =
c
G(x)
2
Shape properties such as normal, Gaussian and mean curvatures can be computed and displayed for the molecular surfaces [3,18]. However, we are often more interested in the shapes of active sites (pockets) instead of the overall properties of the protein surfaces. Several pocket extraction methods have been developed and published. Delaney [15] uses cellular logic operations on grid points in a spirit similar to our two-step marching algorithm, but its results are very rough approximations and difficult for further visualization and analysis. Edelsbrunner et al. [19] computes pockets for molecular surfaces based on the union-of-balls model using Delaunay triangulations and alpha shapes. The Delaunay triangulation Dg (and its dual Voronoi diagram) are first constructed for the set B of atomic centers [19]. A flow relation can then be defined for two Delaunay tetrahedra, r E D B and m E D B , if they share a common plane and the dual Voronoi vertex of IT lies on different sides of the plane from m. If IT < CT, r is called a predecessor of m and c a successor of r. A tetrahedron flows t o infinity, if its dual Voronoi vertex is outside D B or its successor flows to infinity. The alpha-shape Ag c DB at Q = 0 is the subsimplex of D B contained in the union of balls. Pockets P are defined
277 in [19] as the set of Delaunay tetrahedra that do not flow to infinity and do not belong to the alpha-shape A g , i.e.
P c DB - A g . The alpha-shape based algorithm was implemented and tested for a number of sample proteins [34]. One shortcoming of this method is that the alpha-shape representations of the pockets are usually not smooth. Our algorithm in this paper represents the pockets with a smooth volumetric function, from which smooth pocket surfaces can be computed. Furthermore, our approach works for any representations of protein surfaces, either based on smoothed union-of-balls, [2,3,14,38]or the volumetric model [9, 181. Our feature analysis algorithm is also clearly generalizable to many other classes of smooth and analytic free-form surfaces. Complicated shapes are often captured via volumetric functions coupled to morphological operations on the functions. In 2D range images, Krishnapuram and Gupta [30] uses dilation and erosion operations to detect and classify edges; Gil and Kimmel [24] discussed algorithms for computing onedimensional dilation and erosion operators. In addition to the extraction of polygonal surfaces from volumetric functions, 3D polygonal models are also converted into volumetric representations and then modified, repaired and simplified using morphological operations [2l,361. A related problem to finding pockets in molecular surfaces is shape segmentation, which has been studied using different geometric and topological structures such as shock graphs [40], medial axes [32], skeletons [44], Reeb graphs [26], and others [27,33,35]. A notable approach is based on Morse theory, which segments the domain manifold M into stable (unstable) manifolds [16] or Morse-Smale cells [20] of critical points of a Morse function. The Morse function commonly used for shape segmentation is the distance function to a set of discrete points P [16,19,23]: h(x) = min IIx - pi/. PEP
Again the Delaunay triangulation (and the dual Voronoi decomposition) can be computed for the points in P. The critical points of h are the intersections of Delaunay elements with their Voronoi complements. The stable-manifolds of the critical points
of the distance function to a set of discrete points are called the flow complex in [23], and which is homotopy equivalent to its alpha-shape [17]. The stable manifolds of maxima has the same dimension as the the manifold M and give a segmentation of M . It is possible to consider the pocket extraction problem as the segmentation of the complementary space outside the surface However, a large number of points are necessary to sample complex surfaces and a large number of maxima and stable manifolds would segment it into many small pieces that have no direct correspondence to the pockets.
s.
3. ALGORITHMS In this section, we first present the two-step marching algorithm for computing the pocket function q5p(x) in section 3.1. Section 3.2 describes the method of computing signed distance function (SDF) that is based on fast distance transforms arid used in the computation of the pocket fiinctrions. Section 3.3 discusses the quantitative analysis and visualization of the protein pockets.
3.1. Pocket Extraction Consider a closed compact surface S , e.g. the green inner curve in Figure 1. We use a two-step marching (fill and removal) strategy to extract pockets in S. First we fill all pockets, voids, and depressions on S by marching outward from S . As shown in Figure l ( a ) , the front propagates outward from the surface S to a final shell surface T . During the marching the topology of the propagation front changes, for example the topology of front R in Figure 1 is different from S and T . T is chosen to be a propagated front with distance t that is far enough away from S such that T has the simple topology as a sphere and the topology would not change any more by further propagation. The exact value of the distance t from S to T is not significant in our algorithm of pocket extractions. For a typical protein, we choose t as the larger value between 40 A and the twice the largest dimension of the protein. In the subsequent removal step, the front is propagated backwards from the shell T towards the original surface S. The distance of backward marching is also t so that tjhe front is not allowed to penetrate S and stops when it touches 5’.Notice the outward
278 marching in the fill step is irreversible and the final front of the backward marching cannot extend into the depressed regions in the surface S. Therefore in our algorithm, pockets are defined as the regions between the final front F of the backward propagation and the original surface S. The shaded (yellow) area in Figure l ( b ) illustrates a 2D example pocket found by using this fill and removal strategy. This definition intuitively captures the main characteristics of protein pockets. We now also give a more mathematical definition. Starting from the initial surface S,the outward propagation front moves along its normal directions at a speed v. The marching front R ( t ) at time t can be determined according to the level set method [41], i.e. R ( t )is the zero level set of a function 4(Z, t ) satisfying the evolution equation:
4t +*ulo4l = 0 with initial condition 4(x,t = 0) = d ( z ) , where d ( z ) is the signed distance function (SDF) from S, defined as
inverted, i.e. &(z) > 0 if LL: is inside T and &(z) < 0 if x is outside T. (4) The volumetric pocket function q5p(z) is constructed as:
4 p ( z )= min(ds(z),dT(z)- t ) ,
(3)
where d s ( z ) and d ~ ( zare ) the distance functions computed in step 1 and 3. Notice # p ( z ) > 0 only for points outside S and not reachable by backward propagation from T, i.e. points in pockets, tunnels, etc. The bounding surfaces of pockets are then extracted as the level set $p(z) = E , where a small number E > 0 is used to take into consideration the size of solvent atoms. For example, we usually choose F to be between 1 and 1.5 A. This pocket extraction algorithm is simple, flexible, and robust. It works for any closed surfaces in R" space. Particularly it works for any molecular surface descriptions: union of balls, solvent accessible surface, or contours of electron density functions. Figure 2 shows the successful extraction of two tunnels in an "8" shape.
SDF d ( z ) is positive/negative when x is outside/inside the surface s, and the marching front R ( t ) is the level set 4 ( x , t ) = 0. If the speed u = 1 is constant, as we would assume in our twostep marching algorithm, the marching front R ( t ) at time(distance) t is simply the level set
d(x) = t .
(2)
Assume we already have an efficient algorithm to compute the signed distance functions (SDF) of a closed compact surface, which will be discussed in section 3.2. We present here the algorithm of computing the volumetric pocket function 4 p (z) that represents the pockets in the protein surface.
(1) Compute the signed distance function &(z) from the original surface S. (2) Extract the shell surface T as a level set &(x) = t , where the distance t > 0 is large enough so that T has a simple sphere topology. As mentioned earlier, the exact value oft is not significant in the algorithm. (3) Compute the signed distance function d r ( z ) from the surface T , where the sign of d ~ ( z is)
Fig. 2. (a) The original ''8" shape in white and the final shell surface from the outward propagation in dark red. (b) The "8" shape and tunnel mouth shown in green. (c) The bounding surfaces of the two tunnels extracted as a level set of the pocket function. (d) A slice of the pocket function of the "8" shape, where the white circles are the cross-section of the tunnel surfaces.
279 Pocket Mouth In many applications we wish to find a pocket’s ”mouth”, the bounding surface that separates the inside of pockets from the outside region. The number of surface openings (mouths) m of a pocket (or tunnel) classifies the type of the pocket (or tunnel): 0 0
0 0
void if m = 0 normal pocket if m = 1 hole or simple tunnel (simple conector) if m = 2 arbitrary tunnel (multiple connector) if m >= 3
The above pocket function can be used to easily obtain any pocket’s mouths. The bounding surfaces of a pocket consists of its mouth and surface patches that are coincident with the original surface S. In other words, pockets mouths are the patches of the final backward propagation front F , which do not match with the original surface S as illustrated in Figure l(b). Therefore, in our algorithm pocket mouths are determined as portions of the level set 4 p ( z ) = E satisfying the condition cis(.) > E . In order to demonstrate the effectiveness of our algorithm, we select a random protein ”Bacteriochlorophyll Containing Protein” (PDB ID: 3BCL) from the protein data bank (PDB) [6]. This protein has a very complex molecular surface and contains one large binding site in the middle and some small pockets on its surface, as shown in Figure 3(a). Figure 3(b) shows as a color map a slice of the SDF from the the complex molecular surface of the protein. The cross-section of the protein surface is displayed in Figure 3(b) as white curves, on which the SDF d ( z ) = 0. The large tunnel in the middle is clearly visible, with several small surface pockets and internal voids. The pocket function of the protein (3BCL) is computed using the algorithm described in section 3.1 and the corresponding slice of the pocket function is shown in Figure 3(c), in which the crosssection of the pocket bounding surface is displayed as white curves. Finally in Figure 3(d), we superimpose the pocket surface with the molecular surface, where pocket mouths are extracted and drawn as yellow line segments. The result matches very well with our own intuition of pockets and their mouths. One can see that our pocket extraction algorithm has almost perfectly located all pockets and holes in the molecular surface.
3.2. Signed Distance Functions Efficient and stable computation of signed distance functions (SDF) d ( z ) plays a critical role in the pocket extraction algorithm described in section 3.1. A number of SDF algorithms have been developed in recent years. In this section, we present a method of computing the SDF ds(x) and d ~ ( x )based on fast distance transforms [22]. Other stable SDF algorithms may also be applied, for example SDF algorithms using graphics hardware [42] for better speedup. Given a 2D/3D binary image as input, its distance transform calculates the shortest distance from each pixel (voxel) to the nearest non-zero pixels (voxels). The distance transform computation is very efficient and can be done in time linear to the number of pixels (voxels). We extend the distance transforms to compute SDF for any closed compact surface. Considering a closed compact surface S embedded in a regular grid, we define a grid point p as a near point, highlighted in Figure 4(a), if at least a cell containing p intersects S. Otherwise p is considered as a far point. The signed distance function ds(x) to the surface S is computed as follows: (1) A binary image 10 is constructed by setting the values of near points to 1 and far points to 0. (2) We compute the distance transform for the binary image 10. Particularly, for each far point p its closest near point cp is recorded. We call cp the near cousin of p . The time for this step is linear to the number of grid points. ( 3 ) For each near point q, its shortest distance &(q) to S is computed and the sign of d s ( q ) is set positive/negative if q is outside/inside S. The point @ on S nearest to the point q is also recorded. In order to determine whether q is outside or inside S,we assume that S has been decomposed into simplices, e.g. triangles in 3D, and the normal vectors always point towards the outside of S. In R3, the nearest point 4 may be inside a triangle, on a triangle edge, or on a triangle vertex. If @ belongs to only one simplex t E S , i.e. 4 is within the interior of t, then q is outside if ( q - @). n’t > 0, where fit is the normal vector o f t . But this dot-product criterion fails if ij is a shared point of two or more simplices, i.e. @ is on a corner or edge of S. In this case,
Fig. 3. Example slices of the pocket function and extracted pocket surfaces for the ”Bacteriochlorophyll Containing Protein” (PDB ID: 3BCL). (a) The protein surface and the big tunnel in the middle. (b) A cross-section of the protein surface shown as white curves on the color map of the SDF. (c) A slice of the pocket function for the ”Bacteriochlorophyll Containing Protein” is shown as color map and the pocket boundaries are shown as curves. (d) The pockets in (c) are superimposed onto the protein surface in (b).
its closest near point c,. If p is outside S , then its near cousin cp is inside S. Let us follow the path from p to cp that consists of three segments along the 2 , 1 ~ ,and z axes. The last outside point on the path must be a near point and is closer to p than c,. This contradicts the definition that cp is the closest near point to p . The same arguments hold if p is inside S. 0 Fig. 4. (a) Near points are highlighted in orange. (b) Examples of determining whether a near point is inside/outside the surface S , where is contained in only one simplex but & is shared by two simplices. The dashed line is a ray from 42 t o find the nearest intersecting simplex.
we use a ray-shooting method to find the closest simplex to q. We cast a ray R, from q through an interior point of a simplex containing 4 and compute the intersection points between R, and other simplices sharing the same 4, as illustrated in Figure 4(b). The first simplex t o intersected by R, is chosen and the sign of d s ( q ) is set as the sign of ( 4 - @). Fito. (4) The SDF & ( p ) of a far point p has the same sign as that of its near cousin c,. The magnitude of d s ( p ) is evaluated as Ip- Cpl, where Cp E S is t8he nearst point to cp on S computed in step 3. We state two propositions about the signed distance functions & ( p ) computed in the above algorithm.
Prop 3.1. The sign of SDF d s ( p ) is correctly set for every far point p . Proof. We prove this by contradiction. The sign of d s ( p ) of the far point p is the same as that of
Prop 3.2. The error of & ( p ) is not accumulative and is bounded by the same order as the grid cell side 6. Proof. Clearly the magnitude of d s ( p ) is larger than the distance Id(p,cp)lfrom p to its near cousin c, and less than Id(p.c,)l + Id(cp,Lp)I. The distance Id(c,, Cp)l from the near point cp to the closest point Cp on S is in the order of O(6 ). Thus we have the following inequality,
I4R %)I 5 =
Ids(p)l 5 Id(R %)I
+ Id(CP, .-)I
M p , c p ) /+ O(6).
Therefore error between ds ( p ) and its approximate 0 d ( p , c p ) d(c,, L,) is bounded by O ( 6 ) .
+
Since SDF always achieves the correct signs and has bounded errors for the SDF, the above algorithm is very robust.It is also very efficient and works even for highly complicated protein surfaces. The running time of each step of algorithm is O ( N ) linear to the number of grid points N , except for step (3). In the worst case, step ( 3 ) has computational coniplexity O ( s . N,), where s is the number of simplices in the surface S and N , is the number of near points.
28 1
Fig. 5.
The molecular surface and pockets of HIV-I protease visualized using combined surface and volume rendering.
However, we use spatial decomposition of the regular grid to limit the search of the nearest point of q to a small subset of simplices on S . On average then, the complexity of step (3) is O(N,), proportional to the number of simplices in S , which makes the computation of the SDF efficient even for highly complicated protein surfaces in our experiments. 3.3. Quantitative Analysis and Visualization Representing pockets as a volumetric function 4 p ( x ) allows for a number of different ways to visualize and analyze the pocket structures quantitatively. Visualization As the pocket function 4 p ( x ) is a 3D
volumetric scalar function, we can visualize it using various volume visualization techniques, e.g. raycast or texture based volume rendering and isosurface rendering. As an example of visualization, Figure 5 displays the HIV-I protease (PDB ID: lHOS), that is important for the maturation of HIV-I virus. An inhibitor can bind in a tunnel of the HIV-I protease, as shown in Figure 5(a). We compute the pocket function of the HIV-1 protease and successfully extract the binding site as the large pocket region of the function. Figure 5(b) renders the pocket function using 3D texture-based volume rendering combined with the protein surface to illustrate the overall distribution of the pocket regions. Figure 5(c) displays the bounding surfaces of the largest four pockets of the HIV-1 protease, one of which is on the other side of the protein and invisible from this view. The ligand binding tunnel is extracted as the pocket (tunnel) with the largest volume. The visualizations were performed using surface and volume rendering capabilities of TexMol [l].
Quantitative Analysis Based on the volumetric pocket function 4 p ( x ) ,we can extract the bounding surfaces of all pockets, tunnels, and voids in a protein once as the level set + p ( l c ) = E . Quantitative measures like the volume and surface area of each pocket can be computed from the pocket function by summing up the contributions from individual cells that belong completely or partially to the pocket. If the 3D domain is decomposed into simplices, the contribution from each simplex to the volume or surface area of the level set $p(x) = E can be represented as a B-spline function of the variable E and the total measure is the sum of all non-zero B-splines [4]. Additional geometric and shape properties can also be computed for protein pockets based on the pokcet function 4 p ( x ) , for example curvatures distributions [14,43], shape histograms [29, 371, coefficients of volumetric function expansions [as],and shape context [5]. Those shape properties of protein pockets may be used for building a database of the proteins pocket structures, and applied to the problem of ligand binding [31]. An affine-invariant method of comparing protein structures is described in [47] by using multi-resolution dual contour trees (MDCT) of the molecular shape functions, e.g. solvent accessibility, combined with geometric, topological, and electrostatic potential properties. We think the pocket functions would better capture the most important features of the protein shapes and provide more accurate comparison and classification. Contour tree (CT) is an affine-invariant data structure that captures the topological structures of the level sets of a volumetric function F ( z ) [ll], which may also be used for volumetric function matching and protein docking. Each node of the
282
Fig. 6. (a) The contour tree of the pocket function for the ”Bacteriochlorophyll A Protein” (3BCL) (b), (c), and (d) are the DCTs of the pocket function at three different resolutions of 16, 4, and 1 intervals.
C T corresponds to a critical point of the function and each arc corresponds to a contour class connecting two critical points. A contour class is a maximal set of continuous contours which do not contain any critical points. If we cut the C T at the isovalue w, the number of connected contours of the level set F ( z ) = w is equal to the number of intersections (cuts) to CT. In the case of pocket function q5p(z), the number of cuts to C T of the pocket function at t > 0 is the number of continuous surface patches bounding the pockets, i.e. the number of separate pockets. Our pocket algorithm works for general 3D surface models, e.g. the tunnels in the ”8” shape shown in Figure2, and for complex protein surfaces. This method is very sensitive. For a complicated surface such as the molecular surfaces computed from electron density functions, the pocket function will capture the large binding sites as well as even small depressions and voids in the surface. This distinctive feature offers both opportunities and challenges in protein shape and structure analysis. For example, the C T of the pocket function 4 p ( x ) for the ”Bacteriochlorophyll A Protein (SBCL)” is shown in Figure 6(a). Since we are only interested in the pocket regions where + p ( x ) > 0, the C T has been truncated to remove the uninteresting part of 4 p ( z ) < 0. However, it still is very complex and contains 2,063 nodes (critical points). A cut at t > 0 would introduce a large number of individual pockets, many of which are very small and of little importance. Furthermore the critical points and the structure of the C T are sensitive to the noise in the data.
Pocket Filtering We need to simplify the pocket function and/or the corresponding CT, in order to focus on the major pockets and filter out small ones. Carr et. a1 [12] describes a method of simplifying isosurfaces by tagging C T edges with geometric information and suppressing contours of small geometrical measures. While this approach can be applied to the pocket functions to pick the major pockets, the C T itself is not simplified and it is very hard to compare the CT’s of two protein pocket functions. Another way of simplification is to construct a simplified data structure for the volumetric function, e.g. the dual contour tree (DCT) introduced in [46,47]. A DCT studies properties of interval volumes within a specific range of a scalar function and is constructed by partitioning arcs of a C T into sets of connected segments, each of which corresponds to a connected interval volume of the function domain [47]. These interval volumes represent connected regions whose function values are within a specific range. Each node of a DCT is such a connected interval volume. For example, in the case of a pocket function g5p (x),the interval volumes within the range [t,max(4p(z))]are the 3D regions inside the pockets, where t > 0 is a value for the considerations of the solvent size. The relevant range [t,max(q5p(z))] of the pocket function is divided into a number of smaller intervals to get a high-resolution and more complete representation of the underlying pocket function. Figure 6(b) shows the DCT constructed from the C T in Figure 6(a) by dividing the 4 p ( x ) functional range [t,max(qbp(z))] into 16 intervals. Each node in Figure 6 (b) represents a connected volume within a certain functional interval. For each node of the DCT, geometric and topolog-
283 Table 1. Computational time for some examples. T1, T2, and T3 are the time for computing d s ( z ) , d T ( Z ) , and pockets function respectively. data ”8” shape ”Bacteriochlorophyll A Protein” (3BCL) ” Hvdrolase” (1C2B)
ical properties of the corresponding interval volume are computed. We refer to [46,47] for details of constructing DCT and computing the volume and other attributes of the DCT nodes. Because protein surfaces are highly complicated, they usually contain many small pockets and voids. On the other hand, biologically important active/binding pockets must have enough size to hold the solvated ligand. We can thus remove the very small pockets from further consideration by pruning the DCT nodes whose volumes fall under a given threshold. The DCT in Figure 6(b) has been simplified by pruning. The pruning process can be facilitated by merging functional ranges and constructing DCT’s of coarse resolutions. Figure 6(c) and (d) show the DCTs of the same protein with four range intervals and one range interval respectively. A node in a lower-resolution DCT is merged from multiple child nodes from the higher resolution DCT. Pruning a lower resolution DCT node shall remove all its child nodes as well. In the single-range DCT in Figure 6(d), only two nodes are left after prune, one of which contains more than 94% of total pocket volume and corresponds to the large binding site in the middle of the ”Bacteriochlorophyll A Protein (3BCL)”. 4. IMPLEMENTATION A N D EXAMPLES
We have implemented both the pocket extraction and the SDF computation algorithms in C++ and encapsulated them in our freely available TexMol software [l].Our implementation is portable across multiple compute platforms. The implementation is very robust and efficient, and can compute pockets for complicated molecular surfaces with multiple thousands of atoms in a few seconds. Excluding the time of extracting the original molecular surfaces, Table 1 shows the computation time without optimization on a DELL Laptop with 1.6 GHz processor and 1GB memory for the ”8” shape and two proteins, ”Bacteriochlorophyll A Protein” (3BCL) and ”Hydrolase” (lC2B); downloaded from the PDB. The ”Hydrolase” (PDB ID: 1C2B) is a protein complex
tri# 1,536 275,456 268,876
Tl(s) 2.1 10.25 9.92
T2(s) 5.45 6.38 5.63
T3(s) 0.33 0.33 0.45
total (s) 7.88 16.96
16
containing four similar subunits. We choose the dimensions of the regular grid as 128 x 128 x 128 for the balance between accuracy and the requirements for memory. The smaller the grid size, the more accurate the SDF computation. However it also requires more memory and longer computation time because the distance transform time is linear to the number of grid points. Our experiments show that a 128 x 128 x 128 grid is sufficient for extracting protein pockets. For example, when we increase the grid resolution for the protein SBCL from 128 x 128 x 128 to 196 x 196 x 196, we still get the same set of pockets and the volume of the largest pocket has changed less than 5%. In Table 1, T1 is the time for computing the SDF cis(.) for the original surface S,T2 is the time for extracting the shell surface T and computing the SDF ~ T ( z )and , T 3 is the time for constructing the pocket function ~ P ( z and ) extracting the pockets. SBCL and 1C2B have longer T1 than the ”8”shape because they have more simplices (triangles) in the original surface. All three data sets have similar time for T2 and T3, which is proportional to grid dimension. We compared our results to the alpha-shape based ” CAST” algorithm [8,34], using the sample protein list given in [34]. CAST uses the union-ofball model of proteins. It gets atomic radius for each atom from a PDB file, computes the threedimensional weighted Delaunay triangulation, and then computes the alpha-shape and the volume and areas of the pockets. The pockets are visualized in CAST by displaying the protein residues around the pocket with different color [8],as shown in Figure 7(c). Contrary to the CAST algorithm, our pocket algorithm can use any molecular surface model, including the union-of-ball model. In our implementation, we model molecular surfaces as smooth level sets of the electron density functions E ( z ) ,which are computed as the summation of the Gaussian kernel density functions of all atoms contained in the corresponding PDB files. The molecular surface S ,
284 as mentioned before, is extracted as the level set of E ( x ) = 1. The pocket function q5p(x) is computed as described in section 3.1. As mentioned earlier, we can apply more flexible and powerful visualization to the pocket function than the results of CAST. We can actually extract and visualize the pocket volumes themselves, as shown in Figure 7 (a) and (b), which can be further used to be compared with the shapes of binding ligands. In our analysis, We use multi-resolution DCT’s to prune geometrically insignificant nodes and to remove small pocket components. We set a conservative threshold to be 1% of the total pocket volume, since pocket below this threshold are simply too small to be a binding site. A pocket or void is discarded if its volume is below the threshold. For most proteins under consideration, there is a pocket with dominant volume corresponding to the binding site. In table 2, we present the number of pockets after our pruning step, the volume of the largest pocket, and the percentage of its volume over the total pocket volume. In the computation we have chose t = 1.5 for extracting pockets as the level set of q5p(x) = E . The results are compared to those from CASTp [8]. Although we used a different molecular surface model from the ”CAST” algorithm, our quantitative results are correlated well with the results from CAST [34] and experiments. For example, the Bacteriochlorophyll A Protein (3BCL) was shown in table 2 to contain a large pocket (tunnel) of 94.4% of the total pocket volume, consistent with the experimental binding site and the result in CAST j34]. However, the values of the pocket volumes do not match exactly. This may be due to two major reasons: first the two algorithms use different protein models for pocket extraction; second pockets are segmented differently in the two methods. For example ”porin” (2por) has 45 pockets in CAST compared to 2 in our algorithm. We believe our algorithm is quantitatively more accurate according to the visualizations. Our definition of pockets is matheniatically rigorous and the extraction algorithm has been shown to visually correct as in Figure 3 and 5.
5. CONCLUSION In this paper we present a simple and practical geometric algorithm to compute pockets of any closed
compact surface, particularly complicated molecular surfaces. The pockets are represented as a volumetric pocket function, which has the advantage of allowing a wide range of quantitative analysis and visualization. Furthermore, the advantage of our method lies in its generality and applicability to any definition of molecular surfaces. We also present an efficient volumetric sign distance function calculation, necessary for the pocket function. Additionally, we combine quantitative analysis with DCT’s to filter insignificant features from molecular surfaces. The combined set of algorithms provide an efficient and robust to extract complementary space features from very complex protein surfaces and additionally other free-form surfaces. The results of our implenientation capture all the protein pockets and correlate well with experiments and prior pocket extraction algorithms. ACKNOWLEDGEMENTS
We thank Dr. Bong-Soo Sohn for several helpful discussions related to pocket computations. This research was supported in part by NSF grants EIA-0325550, CNS-0540033, and NIH grants P20-RR020647, R01-GM074258, R01-GM073087, R01-EB004873. The TexMol software can be freely downloaded from ht t p: //www .ices.ut exas.edu/ C CV/software/
References
1. BAJAJ,c., DJEU, P., SIDDAVANAHALLI; v., AND THANE, A. Texmol: Interactive visual exploration of large flexible multi-component molecular complexes. Proc. of the Annual IEEE Visualization Conference (2004), 243-250. 2. BAJAJ,c., LEE, H. Y . , WIERKERT, R.,AND P A S CUCCI, V. Nurbs based b-rep models from macromolecules and their properties. In I n Proceedings Fourth Symposium on Solid Modeling and Applications, Atlanta, Georgia, 1997, C. Hoffmann and W. Bronsvort Eds., A C M Press. 1997, pp. 217-228. 3. BAJAJ,c., AND SIDDAVANAHALLI, v. An adaptive
grid based method for computing molecular surfaces and properties. ICES Technical Report TR-06-57, 2006. 4. BAJAJ,
c. L., PASCUCCI, V., AND SCHIKORE, D. R. The contour spectrum. In I E E E Visualization Con-
ference (1997), pp. 167-173. S.,MALIK,J . , AND PUZICHA, J . Shape 5. BELONGIE, matching and object recognition using shape con-
285
Fig. 7. Visualizations for the pockets of the "Bacteriochlorophyll A Protein" (3BCL). (a) and (b) show the big central pocket (tunnel) extracted using our algorithm, with and without the protein surface. (c) is the visualization from CASTp [8]. Table 2. Protein name staphylococcal nuclease . . HIC-1 protease Endonuclease acetylcholinesterase porin ribonuclease thioredoxin reductase NADH peroxidase
bacteriochlorophyll A protein glycogen phosphorylase porcine pancreatic elastase
elastase with TFA-Lys-Pro-IS0 FKBP-FK506
Pocket statistics of some sample proteins. PDB ID lsnc 1hvi 2abk lack 2por lrob ltde 2npx 3bcl kPd 3est lela lfkf
#of pockets
largest pocket ( A 3 )
percentage
CAST largest
7 8 4 8 2
538.4 879 2129 1591 6508.8 842.3 3985 6759 7742 7168 1532 696 446
72.7% 62.8% 83.9% 35.4% 95.2% 78.8% 89% 92.3% 94.4% 63.9% 62.3% 37.6% 84.3%
pockets ( A 3 ) 757.9 1446.8 886.1 1812.3 2306.9 477.5 2993.2 2846.1 11063 9059.6 741.8 304.3 292.9
3 5 3 2 9 9 10 5
texts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 4 (2002), 509-522. 6 . BERMAN,H. hI., WESTBROOK,J . , FENG,Z.,
GILLILAND,G., BHAT, T. N., WEISSIG, H.: SHINDYALOV, I. N., AND BOURNE, P. E.The protein data bank. Nucleic Acids Res 28, 1 (2000), 235-42. 0305-1048 Journal Article. 7. BERMAN, J . Structural properties of acetylcholinesterase from eel electric tissue and bovine erythrocyte membranes. Biochem 12 (1973), 1710. T. A , , NAGHIBZADEH, s., AND LIANG, 8. BINKOWSKI, J . CASTp: Computed Atlas of Surface Topography of proteins. Nucl. Acids Res. 31, 13 (2003), 33523355. 9. BLINN,J . F. A generalization of algebraic surface drawing. A C M Transactions on Graphics (1982). C., AND TOOZE, J . Introduction t o Pro10. BRANDEN, tein Stmcture: Second Edition. Garland, New York, 1999. J . , AND AXEN,U. Comput11. CARR,H., SNOEYINK, ing contour trees in all dimensions. Computational Geometry: Theory and Applications 24, 2 (2003), 75-94. 12. CARR,H., SNOEYINK: J . , AND VAN DE P A N N E , M. Simplifying flexible isosurfaces using local geometric
13.
14.
15.
16.
17.
18.
measures. In V I S '04: Proceedings of the conference on Visualization '04 (Washington, DC, USA, 2004), IEEE Computer Society, pp. 497-504. CHENG,H.-L., A N D SHI, x. Quality mesh generation for molecular skin surfaces using restricted union of balls. In IEEE Visualization 2005 (2005), pp. 399- 405. CONNOLLY, M. L. Analytical molecular surface calculation. Applied Crystallography 16 (1983), 548558. DELANEY, J . S. Finding and filling protein cavities using cellular logic operations. J . Mol. Graph. 10, 3 (1992), 174-177. 159108. DEY, T. K., GIESEN,J., AND GOSWAMI, s. Shape segmentation and matching with flow discretization. In Workshop on Algorithms and Data Structures (2003). DEY, T. K., GIESEN,J., AND JOHN,M. Alphashapes and flow shapes are homotopy equivalent. In Proceedings of the thirty-$j?h annual A C M symposium on Theory of computing (2003), pp. 493-502. DUNCAN, B. S.,AND OLSON,A. J . Shape analysis of molecular surfaces. Biopolymers 33 (1993), 231-238.
19. EDELSBRUNNER, H., FACELLO, M., AND LIANG, J . On the definition and the construction of pock-
286
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
ets in macromolecules. Discrete Applied Mathematics (1998), 83-102. EDELSBRUNNER, H., HARER,J . , NATARAJAN, V., AND PASCUCCI, v. Morse-smale complexes for piecewise linear 3-manifold. In Proceedings of the nineteenth annual symposium on Computational geometry (2003), pp. 361 - 370. EL-SANA,J., A N D VARSHNEY, A. Topology simplification for polygonal virtual environments. IEEE Transactions o n Visualization and Computer Graphics 4, 2 (1998), 133-144. FELZENSZWALB, P. F.,AND HUTTENLOCHER, D. P . Distance transforms for sampled functions. Tech. Rep. Technical Report TR2004-1963,, Cornell University, 2004. GIESEN,J . , AND JOHN,M. The flow complex: a data structure for geometric modeling. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms (2003), pp. 285-294. GIL, J . Y . , AND KIMMEL,R . Efficient dilation, erosion, opening, and closing algorithms. IE E E Trans. Pattern Anal. Mach. Intell. 24, 12 (2002), 16061617. GRANT,J . , AND PICKUP, B. A gaussian description of molecular shape. Journal of Physical Chemistry 99 (1995), 3503-3510. HILAGA,M., SHINAGAWA, Y . , KOHMURA, T . , AND KUNII,T. Topology matching for fully automatic similarity estimation of 3d shapes. In Siggraph 2001 (Los Angeles, USA, 2001), pp. 203-212. KATZ,S.,AND TAL,A. Hierarchical mesh decomposition using fuzzy clustering and cuts. A C M Transactions on Graphzcs 22, 3 (2003), 954-961. KAZHDAN, M., FUNKHOUSER, T., AND RUSINKIEWICZ, s. Rotation invariant spherical harmonic representation of 3d shape descriptors. In Proceedings of the Eurographics/A CM S IG G RA PH symposzum on Geometry processing (2003), Eurographics Association, pp. 156-164. KAZHDAN, M., FUNKHOUSER, T., AND RUSINKIEWICZ, S. Shape matching and anisotropy. A C M Trans. Graph. 23, 3 (2004), 623-629. KRISHNAPURAM, R . , AND GUPTA,S.Morphological methods for detection and classification of edges in range images. Journal of Mathematical Imaging and Vision 2, 4 (Dec. 1992), 351-375. LEE, C., AND VARSHNEY, A. Computing and Displaying Intermolecular Negative Volume for Docking. Springer Berlin Heidelberg, 2006. LEYMARIE, F. F., AND KIMIA,B. B. The shock scaffold for representing 3d shape. In Proceedings of the 4th International Workshop on Visual Form, Lecture Notes In Computer Science. Springer-Verlag, 2001, pp. 216 - 228. LI, x., WOON, T. w . , TAN,T. S.,AND HUANG, Z. Decomposing polygon meshes for interactive applications. In Proceedzngs of the 2001 symposium on Interactive 3 0 graphics (2001), pp. 35-42.
34. LIANG,J., EDELSBRUNNER, H., AND WOODWARD, C . Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design. Protein Sci 7, 9 (1998), 1884-97. 0961-8368 (Print) Journal Article. 35. MANGAN, A. P., A N D WHITAKER, R.T. Partitioning 3d surface meshes using watershed segmentation. IEEE Transactions on Visualization and Computer Graphics 5, 4 (1999), 308-321. 36. NOORUDDIN,F. S.,AND TURK,G. Simplification and repair of polygonal models using volumetric techniques. IEEE Transactzons on Visualization and Computer Graphics 9, 2 (2003), 191-205. 37. OSADA,R., FUNKHOUSER, T., CHAZELLE, B., AND DOBKIN,D. Matching 3d models with shape distributions. In Proceedings of the International Conference on Shape Modeling & Applications (2001), IEEE Computer Society, p. 154. 38. RICHARDS, F. M. Areas, volumes, packing and protein structure. Annu Rev Biophys Bioeng 6 (1977), 151-76. 0084-6589 Journal Article Review. 39. SANNER, M., OLSON,A,, A N D SPEHNER, J.-C. Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38, 3 (March 1996), 305-320. 40. SEBASTIAN, T . B., KLEIN,P . N., A N D KIMIA,B. B. Recognition of shapes by editing shock graphs. In International Conference on Computer Vision (2001), pp. 755-762. 41. SETHIAN, 3. A. A fast marching level set method for monotonically advancing fronts. P N A S 93, 4 (1996), 1591-1595. R., AND GROSS,M . Signed dis42. SIGG,c . , PEIKERT, tance transform using graphics hardware. In IEEE Vis2003 (2003). 43. SONTHI,R . , KUNJUR, G . , A N D GADH,R. Shape feature determination usiang the curvature region representation. In Proceedings of the fourth A C M symposium on Solid modeling and applications (1997), ACM Press, pp. 285-296. 44. SUNDAR, H., SILVER,D., GAGVANI, N., A N D DICKINSON, S. Skeleton based shape matching and retrieval. In Shape Modelling and Applications Conference (May 2003). 45. UNWIN,N . Refined structure of the nicotinic acetylcholine receptor at 4 a resolution. J . Mol. Biol. 346 (2005), 967-989. 46. Z H A N G , X., BAJAJ,c . , AND BAKER,N. Affine invariant comparison of molecular shapes with properties. Tech. rep., University of Texas at Austin, 2005. 47. ZHANG, X., BAJAJ, C. L., KWON;B., DOLINSKY, T. J . , NIELSEN,J. E., A N D BAKER,N. A. Application of new multi-resolution methods for the comparison of biomolecular electrostatic properties in the absence of global structural similarity. SIAM Multiscale Modeling and Simulation 5 (2006), 1196-1213. 48. ZHANG,Y . , Xu, G., A N D BAJAJ,C. Quality meshing of implicit solvation models of biomolecular structures. 510-530.
287 UNCOVERING T H E STRUCTURAL BASIS OF PROTEIN INTERACTIONS WITH EFFICIENT CLUSTERING OF 3-D INTERACTION INTERFACES
Z.Aung* Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 and School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 11 7543 * Email: [email protected]
S.-H. Tan Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Department of Molecular and Medical Genetics, University of Toronto, Toronto, Ontario, Canadat, and Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canadat Email: [email protected]
S.-K. Ng Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 Email: [email protected] K.-L. Tan School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 11 7543 Email: [email protected] The biological mechanisms with which proteins interact with one another are best revealed by studying the structural interfaces between interacting proteins. Protein-protein interfaces can be extracted from 3-D structural data of protein complexes and then clustered to derive biological insights. However, conventional protein interface clustering methods lack computational scalability and statistical support. In this work, we present a new method named “PPiClust” to systematically encode, cluster and analyze similar 3-D interface patterns in protein complexes efficiently. Experimental results showed that our method is effective in discovering visually consistent and statistically significant clusters of interfaces, and at the same time sufficiently time-efficient to be performed on a single computer. The interface clusters are also useful for uncovering the structural basis of protein interactions. Analysis of the resulting interface clusters revealed groups of structurally diverse proteins having similar interface patterns. We also found, in some of the interface clusters, the presence of well-known linear binding motifs which were non-contiguous in the primary sequences. These results suggest that PPiClust can discover not only statistically significant but also biologically significant protein interface clusters from protein complex structural data.
1. INTRODUCTION Proteins and their molecular interactions with one another are essential for many different biological activities in the cell. Unlike the DNA, a protein is composed of a sequence of amino acid (AA) residues folded into a three-dimensional (3-D) form. It is widely-understood that the 3-D structure of a protein, rather than its AA sequence, is the key determinant of its biological function. A substantial amount of research work on understanding the mechanisms of protein-protein interactions (PPIs) from the primary sequences of proteins *Corresponding author. t Present affiliations.
has already been reported-for example, see Ref. 1, 2. In comparison, there has been relatively limited amount of work done based on 3-D structures. In this paper, we will study the interactions between proteins in terms of their 3-D protein-protein interfaces. These are regions in 3-D protein complexes that consist of interacting residues belonging to two different chains that are in spatial vicinity. The interface residues have been known to be highly conserved.2 Identifying and understanding the underlying mechanisms of these interface clusters can lead to important biological insights that can be useful for appli-
288 cations such as drug design. For example, inhibiting protein-protein interactions with small molecules can be difficult due to the generally largeness of proteins and potential lack of cavities. However, targeting the most critical residues such as those in the binding interfaces may lead to improved inhibition of these interactions. Clustering the 3-D protein interaction interfaces can help uncover the structural basis of protein interactions. However, this is a computationally taxing task. In this paper, we propose a novel scheme named PPiClust (Protein-Protein interface M e r e r ) for efficiently representing and clustering large numbers of protein-protein interfaces of the protein complexes. PPiClust can discover statistically significant interface clusters efficiently. Unlike the existing approaches, our method employs a novel built-in statistical analysis mechanism to quantitatively assure the quality of the resultant clusters. We will also demonstrate that our method is computationally efficient in that we can generate the protein interface clusters within a reasonably short time on a single PC. Finally, we will demonstrate that many of the protein interface clusters discovered by our PPiClust are also biologically significant and are useful for uncovering the structural basis of protein interactions. For example, we will show that our method can discover numerous remarkably similar interface structures belonging to the protein complexes from different structural fold types, demonstrating that it is biologically possible for globally distinct protein structures to associate in similar ways at the interaction interface level. We will also show, for the first time, that the interacting residues sequences in some of the interface clusters matched numerous well-known linear binding motifs. This is an interesting discovery because many of these linear motif interacting residues sequences were actually non-contiguous in the corresponding primary sequences. This means that residues from different parts of protein can come together spatially to potentially mimic the functions of linear motifs. It also suggests that many linear motifs could be more prevalent than expected, since linear motifs are currently accounted for based only on their presence in the primary sequences. In fact, given that current linear motifs have been discovered using the primary sequences,2 our PPiClust could
form an alternative framework to facilitate the discovery of more novel linear motifs in the structural space. 2. RELATED WORKS
Structural studies of interfaces have been focused on either the effective characterization of the interfaces, such as Ref. 3, 4, 5, or the quantitative comparison and clustering of interfaces, such as Ref. 6, 7, 8, 9. As our objective is to uncover the structural basis of protein interactions in terms of the 3-D interaction interfaces, we will focus on the quantitative comparison and clustering of interfaces in this paper. Clustering of protein-protein interfaces have typically been done based on the backbone of C, atoms from interacting residues and their neighboring residues. For example, both Tsai et aL6 and Keskin et aL7 used a geometric hashing-based algorithm to compare and cluster backbones of C, atoms from protein-protein interfaces. A heuristic and iterative clustering algorithm with gradual relaxation of similarity score was employed in each iteration. Other popular protein interface clustering algorithms such as 121-SiteEngine (Interface-toInterface Site Engine)' regarded an interface as a set of interacting triangles (I-triangles) that consists of a triplet of functional groups (pseudo-centers) in one chain that formed 3 interactions with the other chain, while PIBASE' clusters interfaces between protein domains rather than between entire protein chains. In this work, we also group the interfaces derived from the protein complexes in PDB (Proteii? Data Bank)" into clusters based on the quantitative comparison of their structural similarities. However, we differ from existing works as follows: All the existing interface clustering methods lacked an important feature that systemat,ically ensures the statistical significance of the interface clusters that they generate. The significance of the clusters were typically validated a posteriori, usually rather unsystematically by visual inspection or ad-hoc biological analyses on a few sample clusters to suggest the usefulness of the methods. Recently, the importance of proper statistical validation for biological data clustering was highlighted by Hand1 et al." Here, we couple our clustering approach with a built-in statistical analysis feature so that the interface
289 clusters generated by our algorithm are also statistically significant (in addition to the conventional visual and biological verifications). All the existing methods employed timeconsuming comparison techniques to measure the similarity of the interfaces, resulting in unscalable approaches. For instance, 121SiteEngines took an average of 26 seconds for each pairwise interface comparison, and required a total of 5,574,861 such comparisons in its entire clustering process.' This means it will require about 1,677 days (over 4 years) to carry out the clustering process on a single PC. (Actually, 121-SiteEngine was implemented on a cluster of P C workstations, and it took about 1 month processing time.12) Here, by using a novel algorithmic scheme, we are able to perform interaction interface clustering in a much more timeefficient manner while at the same time maintaining the statistical quality of the clusters generated. DATA REPRESENTATION
Many biological processes in the cell involve the formation of protein complexes which are molecular aggregations of numerous proteins that are in stable protein-protein interactions. The interacting proteins can be collectively crystallized and their 3-D structure determined as a single group. Such structural information are usually deposited as a single entity into PDB" database and given a unique PDB ID. The member proteins of a protein complex are called protein chains or simply chains. Within a particular complex, each chain is assigned a unique chain ID. A pair of protein chains that are directly interacting with each other form an interface region through which they interact spatially. A residue from a protein chain is considered to be a part of an interface if it has at least one counterpart residue from the other chain with the distance between their nearest atoms less than or equal to 5A.6 For example, in Figure 1, the protein complex gamma delta resolvase is designated with the PDB ID 2rsl. It has three protein chains which are assigned the chain IDS A, B, and C. In this complex, there are direct interactions between chains A and B, and also between chains B and C respectively. The interface for each interacting protein pair is highlighted in the figure. The interface for chains A and
B is denoted as 2rslAB, and that for chains B and C as 2 r ~ l B C . ~
Fig. 1. The protein complex g a m m a delta resolvase (PDB ID 2rsl) with three protein chains A, B and C.
The residues that constitute an interface are not always sequential in nature, according to the observations in Ref. 6. It is therefore inadequate to represent the interfaces by the literal sequences of the constituent residues from the N-terminus to the Cterminus of the chains. Furthermore, to overcome the weakness of those methods6>l 3 that handled the two interface fragments separately, we need to find a better way to encode the interfaces as a single entity such that processing it is equivalent to processing its two constituent interface fragments simultaneously. rrniinus
,
Principal component of A s Interface Fragment
C-ter minus
_"
~ttntipaicodponmt of B 5 Interface Fragnlent
Fig. 2. An example protein complex with chains A and B. The dotted line means that the two residues are in contact (i.e. distance of their nearest atoms 5 5A).
To do so, we encode interfaces as interface ma-
290
trices as follows. For the two interface fragments in an interface, we first derive their respective principal component vectors by means of principal component a n a 1 y ~ i s .We l ~ then arrange the residues in each interface fragment by their positions along its principal component vector as shown in Figure 2. The interface fragment for chain A is an ordered set of 9 residues: {rl,rz,rs,r4, r5, r6, r 1 0 , ~ 1 1r12}, , whereas that for chain B is an ordered set of 8 residues: (31, s 2 , s 3 , s17, S16r s10, s11, 312}.
An interface matrix encoding of each protein interface can then be obtained by storing the pairwise distances between the centers of residues, each from an interface fragment, in a matrix that effectively captures the “interface pattern” of the interface fragments. The interface matrix for the interacting proteins chains A and B in the above example is a 9 x 8 matrix: d(r1, S1) d(rl, S 2 ) d(rl, S 3 ) d(T2r s1) d ( r 2 , s 2 ) d ( r 2 , s 3 ) d ( T 3 , s1) d(T3, s 2 ) d ( r 3 ,S 3 ) d(r4r s1) d(r4, s 2 ) d(r4r S 3 )
d ( r l , S17) . , . d ( r l , S12) d ( r 2 , S17) , , , d ( T 2 , S 1 2 ) d(r4, S17)
., , ., ,
d ( r l 2 , S 1 ) d ( r l 2 , s2) d ( r l 2 , S3)
d(T12r S17)
. , , d(rl2r
d(T3r 5 1 7 )
d ( 7 3 , S12)
d(r4, S 1 2 )
512)
where d(.,.) is the Euclidean spatial distance between the centers of two given residues.
used the 3-D structural data of protein complexes from PDB. After removing the irrelevant structures, such as single chains, low-resolution models, etc., we obtained a data set of 17,300 protein chains which belonged to 5,503 protein complexes. We then extracted protein-protein interfaces from the interacting protein pairs of the protein complexes. From 5,503 complexes, we obtained 17,012 interfaces. After pruning away the interfaces with too few (less than 10) or too many (more than 200) interacting residues in each side, 11,558 interfaces were left. Some of these interfaces may be redundant. Two interfaces are considered redundant if both of their corresponding chains are sequentially homologous (with more than 30% sequence identity using BLASTClust, which is a part of BLAST1’ suite). Using this criterion, we identified groups of redundant interfaces. For each such group, we chose the one with the best resolution and the largest interface size as the representative interface. After this process, we ended up with 1,445 representative interfaces for further analysis. The interfaces were then encoded into interface matrices as described in Section 3. 4.2. Feature Vector Generation
4. CLUSTERING METHOD
There are four major steps in our proposed PPiClust method for discovering the significant clusters of similar protein-protein interfaces:
(1) Extracting representative interfaces. First, we extract representative interfaces from the 3-D protein complexes in PDB and encode them as interface matrices; ( 2 ) Generating interface feature vectors. We then generate feature vectors for the representative interface matrices extracted; (3) Clustering. Clustering is then performed on the interface feature vectors to discover groupings of the protein interfaces; and (4) Statistical validation. Finally, we quantitatively ascertain the statistical quality of the interface clusters generated. 4.1. Extracting Representative Interfaces
First, we extracted a set of representative proteinprotein interfaces from the protein complexes. We
Our objective is to group similar interface matrices into their respective clusters. To do so, we need to be able to compare the interface matrices and determine their similarity values quantitatively. The DALI method16 was previously used to align 2-D distance matrices derived from individual 3-D protein structures. Unfortunately, we cannot employ DALI here because it is known to be a time consuming pairwise alignment method, and it will take a very long time (several months on a stand-alone PC) to align 1,445 interface matrices all-against-all for our systematic analysis. As such, we devise a new scheme for encoding the interface matrices so that they can be compared efficiently and effectively. We opt for a scheme where we represent each interface matrix as a multidimensional feature vector based on the frequencies of the “local features” exhibited in the interface matrix. Such a frequency-based approach has been extensively used in various histogram methods in image processing.17 It has also been used in structural bioinformatics, particularly for protein
29 1 Generate Sub-matrices
Interface Matru t
Cut into 6x6 sub-matnces
Generate Rep. Sub-matrices
Generate Feature Vectors
Generate Feature Sub-matrices
Select rep SubnIatricet (by cimtering)
I
Feature sub-matrix 1
Interface Feature sub-matrix 2
Interface Matru 2
I-
Generation feat1
I Interface Matru II
Matrices 1+..I)
svc-rnalnces from teD sub-matrices
and their
:
Cut inta 6x8 sub-mahces Set of feature sub-matrites
lnterfafaoe Feature Vectors 1.1'
Set of all rep submalrices
Fig. 3. Generating feature vectors from representative interface matrices. Representative sub-matrices for each 1"epresentative interface matrix are shown in gray.
fold classification.", l9 We can view an interface matrix as a set of 6 x 6 overlapping sub-matrices.16>2o Our basic idea is to represent an interface matrix as a "bit-vector" where each bit corresponds to the presence or absence of a single type of sub-matrix which constitute the whole interface matrix. However, there are over one million distinct sub-matrices for all 1,445 interface matrices. If we use them all, our resultant bit-vector will be too long. To reduce the number of sub-matrix types, we group the similar ones together and select a representative sub-matrix from each group. This is done by two rounds of nearest-neighbor clustering21 and medoid selection processes. The process of generating feature vectors from the representative interface matrices is outlined in Figure 3. Using this method, we finally came up with 409 features sub-matrices. Using the 409 feature sub-matrices, we can now systematically encode each interface matrix as a feature vector. Basically, it is the frequency profile of the sub-matrix features in the interface matrix, with
the dimension of the frequency vector being equal to the number of feature sub-matrices.
4.3. Clustering What we have done in the previous steps is to reduce the 3-D structural information of protein interfaces into 2-D interface matrix and then into 1-D feature vectors. We are now ready to cluster the extracted protein interfaces. For any two feature vectors FV, and FV,, we measure their feature vector distance with the inverse cosine distance function17 defined as :
where ( 0 . 0 ) is the dot product between two vectors, and (11 0 11) is the norm of a vector. While d f ( 0 , o ) is a non-metric distance (it violates the triangular inequality property), it is well-suited for reflecting the human's perceptions of similarity and non~imi1arity.l~ Note that we have also tested our sys-
292 tem with the metric Euclidean distance function, but it confirmed that the inverse cosine distance function is indeed superior. In this study, we discover the clusters of interface feature vectors for 1,445 interfaces by employing the nearest-neighbor clustering algorithm2' using the distance function df ( e l0 ) and the distance threshold dft . (We will discuss the effects of different dft values in the next sub-section.) In the clustering process, for every object (feature vectors in our case) in the input data set, we allocate it to the cluster in which its nearest neighbor exists and all the other existing cluster members are also near enough to it, with regard to the given distance threshold (dft in our case). If we cannot detect such a cluster, we create a new cluster with this object as the first member.
This corresponds to an S value of 0.85 if all the clusters are taken into account, and a value of 0.58 if we consider only the non-singleton clusters. As explained in Ref. 23, a clustering is considered reasonably good if its 3 is between 0.51 and 0.7. In this way, our method ensures the statistical quality of the interface clusters generated.
Percenbrgeof non-singleton clusters
-.*r
0 0.1 0 2 0.3 0.4 0.5 0 6 0.7 0.8
4.4. Statistical Validation During the clustering, we also conducted a statistical test called silhouette analysis2' to quantitatively ascertain the quality of the interface clusters that were discovered. The silhouette width s ( i ) of an object i is defined as: b(i) - a(i) s(i) = max(a(i),b(i)) where a ( i ) is the average distance of i to all other objects in its own cluster, and b ( i ) is the average distance of i to all objects in its nearest neighbor cluster. The silhouette width of an object is between -1 (the worst case) and fl (the ideal case). The average silhouette width S of a clustering scheme is the average of the silhouette widths of all the members in all clusters. The larger the value of 3,the better the clustering scheme. Figure 4 shows the effect of varying distance threshold dft values on the average silhouette width (3)of the clustering scheme. The lower dft values give higher quality clusters (i.e. in terms of high 3 values) than those generated with larger dft values. However, the lower dft values also generated many more useless singleton interface clusters. As such, we have to make a decision based on the tradeoff between the g value and the coverage of non-singleton clusters. We use here a criterion of having at least half of the total number of interfaces covered by nonsingleton clusters. This can be attained by setting dft = 0.35, which covered 50.6% of the interfaces.
I
~ ~ ~ Q ' 8 b o n
Threshold (df,) Fig. 4. Effect of feature vector distance threshold (dft) on clustering.
5 . RESULTS We implemented PPiClust on a stand-alone PC with 2 Pentium IV 3.0GHz CPUs and 1GB main memory. The resultant clusters are presented in the webpage: http://wwwl.i2r.a-star.edu.sg/-azeyar/ genesis/PPiClust/. In our current study on 1,445 representative interfaces from 5,503 protein complexes, our method was time-efficient with a total processing time of only about 8 hours. This is much faster compared to the other interface clustering methods such as Ref. 7, 8, 6, which are too slow to be practically implemented on a single PC.
5.1. Visual Verification
As a preliminary analysis, we inspected the visual quality of the interface clusters. Figure 5 shows the various sample interfaces observed in some of the clusters. The interfaces were represented as interface matrices, depicted in the figure as gray-scale images. Darker tones indicate closer residue-residue distances in an interface, while the lighter tones depict the larger distances. It is observed that the interfaces belonging to a same cluster generally look similar. For example, interface patterns (a)-(d) belong to a particular cluster with the characteristic
293
(a) Id7mBA
(b) Igk4AB
(c) 1gIPBC
(e) Ic3qBA
( f ) lgtdAB
(i) lbslAB
(j) lg5cFE
(k) 115xBA
(m) lad3AB
( n ) lienDC
(0)lk5dAB
(9) ld2fAB
(r) ljuhCA
( s ) ljxhAB
(d) 116kAD
( h ) ltmzBA
(I) llucAB
( t ) llhpBA
Fig. 5. Examples of similar interface patterns (represented as interface matrices) belonging t o various interface clusters: (a)-(d) thin diagonals, (e)-(h) thick diagonals, (i)-(I) horizontal ripples, (m)-(p) vertical ripples, and (q)-(t) sparse patterns.
appearance of a thin diagonal, interface patterns (e)(h) belong to another cluster with the common appearance of a thick diagonal, etc. Next, we further analyze our resulting protein interface clusters of protein interfaces to see whether our method can generate not only statistically significant protein interface clusters but also biologically interesting ones. In particular, we investigated whether the clusters contained non-trivial discoveries such as similar interface patterns from structurally diverse proteins, as well as whether the well-known linear binding motifs were also found in the resulting protein interface clusters.
clustered together. What would be more interesting would be the discovery of clusters that contained similar interfaces whose parent chains are structurally quite different. We have found a surprisingly large number of such interfaces in the clusters of interaction interfaces using our method. Let us systematically determine if the interface clusters generated by our method contained mostly interfaces from structurally diverse parental chains. We can measure the diversity of a given interface cluster C with its Fold pair-based Shannon’s entropy as follows:
c k
Ent(C) =
-P2
2=
5.2. Structural Diversity of Interfaces’ Parent Chains
Despite the built-in statistical assurance in PPiClust, it is still plausible that the resultant clusters contained protein interfaces whose parent protein chains are all structurally similar. Discovering such interface clusters would not be very biologically significant, since the interacting interface from structurally similar parent chains are expected to be
x log,
P2
(3)
1
where k is the total number of distinct parent Fold pairs that the interfaces in cluster C belongs to, and p z is the proportion of C belonging to a particular Fold pair i. In the case when a cluster is totally homogeneous (i.e. all interfaces in the cluster belongs to a single Fold pair), its entropy value will be 0. On the other hand, if a cluster is totally diverse (i.e. each member interface belongs to a distinct Fold pair from the
294 others), its entropy value will be logzn, where n is the number of members in the cluster. Figure 6 shows the average entropy values for the different cluster sizes. We also show two reference curves for the ideal (zero) and the maximum entropy (log, n) cases in the figure. Observe that the entropy values for the interface clusters found by our method are indeed generally close to the maximum values. Thus, we can infer that our methods have detected mostly biologically interesting clusters of structurally similar interfaces belonging to the structurally diverse parent proteins. 60
00
I
I
14 0
5
10
15
20
25
30
35
Cluster Size Fig. 6.
Average entropies for different cluster sizes
Overall, the average entropy of each cluster is 1.37. The result indicates that similar interfaces are indeed mediating interactions among diverse structural folds. These interface clusters represent favorable binding structural scaffolds that have been reused in nature for interactions. They are thus useful for understanding the underlying structural basis for proteins to interact with each other (such as identifying putative binding sites on proteins of known structures).25 The interfaces could also facilitate studies on the critical residues4 and the motifsz6 important for the stability of protein-protein interactions. Figure 7 shows the interface lkacAB of protein complex lkac (The X Repressor C-Terminal Domain Octamer) and lmbxCA of protein complex lmbx (ClpSN with Transition Metal Ion Bound). According to SCOPZ7 structural classification system, 1kacA belongs to Fold b.21, lkacB to Fold b.1, lmbxC to Fold d.45, and lmbxA to Fold a.174. In other words, lkacAB belongs to the parent Fold pair b.21-b.1, and lmbxCA belongs t o the parent Fold
pair d.45-a.174. Thus, while the interface structures of lkacAB and lmbxCA are quite similar, their parent chain structures are very different. This finding enables us to further investigate the possible functional similarity of lkacAB and ImbxCA, even though their global structures bear no significant resemblance to each other. In fact, as we will discuss in the next section, we actually found an important linear motif KPxx[QK] (ELM ID: LIG-SH3-4) commonly embedded in both of them. 5.3. Occurrences of Important Biological Motifs We also observed that the discovered interfaces tend to be compact-each interface fragment contains an average of 30.81 residues. This has biological significance as it implies that the provision of a large complementary surface between two structures is not an essential prerequisite for interactions. In fact, it is likely that the interactions are mediated by short residue fragments or motifs on these compact interfaces. Biologists have recently discovered that there are small contiguous sequence segments of 3-10 residues that play critical roles in many protein interactions, post-translational modifications and trafficking.’ In fact, it is estimated that 15%-40% of interactions may be mediated by a short, linear motif (expressed commonly in regular expression) in one of the binding partnersz8 To further assess the biological significance of the clusters derived, we also attempt to identify linear binding motifs’ from our clusters. For each cluster generated by our method, we derive two sets of interface residue sequences that are sequential in 3-D space after the principal component analysis transformation. Note that these interface residues may not be contiguous in terms of their primary sequences. To detect whether occurrences of important biological motifs can be found in our interface clusters, we attempt to match a set of linear binding motifs extracted from biomedical literature and ELM database” to the interface sequences derived above. The most significant matches are listed in Table 1. For example, the common AxxxA3’ helix-helix interaction motif (where x denote any AA) were repeatedly detected in our derived sequences. In particular, the popular PxxP binding motif 31 in various signaling pathways were also detected in one of our clusters.
295
i
Fig. 7.
Similar interfaces in different protein complexes.
Table 1. Significant matches between known linear binding motifs and clusters of interface sequences. Motifs are expressed as regular expression where “x” represent any AA. For matched interface sequences, the chain ID and the corresponding AA numherings are given. The odd-ratio is calculated as O / E where 0 is observed occurrence of linear motif in the cluster, and E is occurrence of linear motif expected by random in the cluster. Linear Binding Motif
Matched Interface Sequences
Odd-Ratio
References
KPxx[QK]
RGD
64.94
LIG-RGD (Ref. 29)
PXXP
27.56
(Ref. 3 3 )
19.10
LIG-TRAF2-1 (Ref. 29)
AxxxA
Interestingly, on visual inspection, we found many cases whereby the interface residue sequences that matched the known linear binding motifs are themselves non-sequential in their primary sequences. This is rather intriguing because linear sequence motifs have traditionally been assumed to oc-
cur as contiguous sequence segments; yet, we have found in our interface clusters numerous instances whereby the residues from different parts of a protein chain come together spatially to mimic some known linear binding motifs. For example, Figure 8 shows two interface residue sequences in one
296 cluster that come together spatially to re-assemble the KPxx[QK] linear motif (ELM ID: LIG-SH3-4). Figure 9 shows another example of sequentially discontinuous interface residues re-assembling another known linear motif ( R x L x [ E Q ] ~ ~In ) . this example, both sets of residues corroborate to form a similar interface that interacts with an a-helix.
-
Currently, only 200 linear binding motifs out of few thousands speculated to exist are known2there might also be many important biological motifs that are sequentially discontinuous that have yet to be detected. We have shown here that it is possible to relate the protein interface clusters with biologically important motifs by adopting a principle component analysis to transform residues at interaction interfaces for linear binding motif discovery. Our efficient PPiClust method could thus form an alternative framework to facilitate the discovery of more novel linear motifs in the as yet unexplored structural space.
6. CONCLUSIONS
Fig. 8. Conservation of motif KPxx[QK] in a particular cluster.
Fig. 9. Conservation of motif RxLx[EQ] in a particular cluster.
These intriguing examples discovered in our interface clusters suggest that foldings of protein chains can be combined to yield interaction sequence motifs. This would imply that the many reported biologically important linear motifs could occur more frequently than expected, as we have yet to take into account of the possibility of sequentially discontinuous occurrences. For example, the RxLx[EQ] motif which was attributed to the virulency of malarial parasite P. falcaparum in human was found in 250 to 350 of the parasite proteins by primary sequence match.32 The actual number of proteins containing this motif could be more based on what we have observed in this work.
In this paper, we have proposed a novel interaction interface clustering scheme named PPiClust (Protein-Protein interface M e r e r ) to extract statistically significant and biologically interesting clusters of protein interfaces from 3-D protein complex structural data. As we have taken care to encode the 3-D structural patterns of interfaces with compact 1D feature vectors, the proposed method is also timeefficient-the total time taken for the whole process is about 8 hours on a stand-lone PC. This is important as most other methods cannot scale up to mine the increasingly available structural information. Our analysis on the resultant interaction interface clusters revealed that the structurally similar interfaces in our clusters can belong t o parent proteins that have very diverse structural folds. This suggests the possibility of similar protein functions among proteins with different structural fold types, an observation that was also made in other existing works.6-8 More interestingly, our analysis also revealed that many highly conserved linear binding motifs of well-known biological functions can also be detected in the interface clusters generated by our method. This included sequentially discontinuous occurrences of the motifs, suggesting that residues from different parts of protein can come together spatially to mimic the functions of linear motifs. In fact, there might still be important biological motifs that are spatially conserved but sequentially discontinuous yet to be detected. Our efficient PPiClust method can thus enable the exploration of the yet unexplored structural space to uncover the structural basis of protein interactions.
297 Ref'erences 1. Ng SK, Zhang Z, Tan SH. Integrative approach for computationally inferring protein domain interactions. Bioinformatics 2003; 19: 923-929. 2. Neduva V, Russell RB. Linear motifs: evolutionary interaction switches. FEBS Lett 2005; 579: 33423345. Lo Conte L, Chothia C, Janin J . The atomic structure of protein-protein recognition sites. J Mol Biol 1999; 285: 2177-2198. 4. Hu Z , Ma B, Wolfson H, Nussinov R. Conservation of polar residues as hot spots at protein interfaces. Prot Struct Funct Genet 2000; 39: 331-342. 5. Teyra J, Doms A, Schroeder M, Pisabarro MT. SCOWLP: a web-based database for detailed characterization and visualization of protein interfaces. B M C Bioinformatics 2006; 7: 104. 6. Tsai CJ, Lin SL, Wolfson HJ, Nussinov R. A dataset of protein-protein interfaces generated with sequence-order-independent comparison technique. J Mol Biol 1996; 260: 604-620. 7. Keskin 0 , Tsai CJ, Wolfson H, Nussinov R. A new, structurally nonredundant, diverse data set of protein-protein interfaces and its implications. Protein Sci 2004; 13: 1043-1055. 8. Mintz S, Shulman-Peleg A, Wolfson HJ, Nussinov R. Generation and analysis of a protein-protein interface data set with similar chemical and spatial patterns of interactions. Prot Struct Funct Bioinfo 2005; 61: 6-20. 9. Davis FP, Sali A. PIBASE: a comprehensive database of structurally defined protein interfaces. Bioinformatics 2005; 21: 1901-1907. 10. Berman HM, Westbrook J , Feng Z , Gilliland G, Bhat T N , Weissig H, Shindyalov IN, Bourne PE. The protein data bank. Nucleic Acids Res 2000; 28: 235242. 11. Hand1 J, Knowles J , Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics 2005; 21: 3201-3212. 12. Shulman-Peleg A. Personal communications 2005. 13. Shulman-Peleg A, Nussinov R, Wolfson HJ. Recognition of functional sites in protein structures. J MoZ Biol 2004; 339: 607-633. 14. Murtagh F, Heck A. Multivariate Data Analysis. Kluwer Academic 1987. 15. Altschul SF,Gish W, Miller W, Meyers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215: 403-410. 16. Holm L, Sander C. Protein structure comparison by alignment of distance matrices. J Mol Biol 1993; 233: 123-138. 17. Zachary J, Iyengar SS, Barhen J. Content based image retrieval and information theroy: a general approach. J A m e r Soci Info Sci Tech 2001; 52: 840852.
18. Carugo 0, Pongor S. Protein fold similarity estimated by a probabilistic approach based on c(a1pha)-c(a1pha) distance comparison. J Mol Biol 2002; 315: 887-898. 19. Choi IG, Kwon J , Kim SH. Local feature frequency profile: a method to measure structural similarity in proteins. Proc Natl Acad Sci, USA 2004; 101: 37973802. 20. Aung Z, Fu W , Tan KL. An efficient index-based protein structure database searching method. In: Proc 8th Intl Conf Database Systems for Advanced Applications (DASFAA'03) 2003; pp. 311-318. 21. Dunham MH. Data mining: introductory and advanced topics. Prentice Hall 2003. 22. Kaufman L, Rousseeuw PJ. Finding groups i n data: a n introduction to cluster analysis. WileyInterscience 1990. 23. http://www.unesco.org/webworld/idams/ advguide/Chapt7-1-1.htm. 24. Shannon CE. A mathematical theory of communication. Bell Sys Tech J 1948; 27: 379-423. 25. Pieper U, Eswar N, Braberg H, et al. MODBASE, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 2004; 32: D217-222. 26. Tsai CJ, Xu D, Nussinov R. Structural motifs at protein-protein interfaces: protein cores versus twostate and three-state model complexes. Protein Sci 1997; 6: 1793-1805. 27. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995; 247: 536-540. 28. Ceol A, Chatr-aryamontri A, Santonico E, Sacco R, Castagnoli L, Cesareni G. DOMINO: a database of domain-peptide interactions. Nucleic Acids Res 2006; 35: D557-560. 29. Puntervoll P, Linding R, Gemund C, et al. ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 2003; 31: 3625-3630. 30. Kleiger G, Grothe R, Mallick P, Eisenberg D. GXXXG and AXXXA: common alpha-helical interaction motifs in proteins, particularly in extremophiles. Biochemistry 2002; 41: 5990-5997. 31. Dombrosky-Ferlan P, Grishin A, Botelho RJ, Sampson M, Wang L, Rudert WA, Grinstein S, Corey SJ. Felic (CIP4b), a novel binding partner with the Src kinase Lyn and Cdc42, localizes to the phagocytic CUP.Blood 2003; 101: 2804-2809. 32. Przyborski J, Lanzer M. Parasitology: the malarial secretome. Science 2004; 306: 1897-1898. 33. Ravi Chandra B, Gowthaman R, Raj Akhouri R, Gupta D, Sharma A. Distribution of proline-rich (PxxP) motifs in distinct proteomes: functional and therapeutic implications for malaria and tuberculosis. Protein Eng Des Sel 2004; 17: 175-182.
This page intentionally left blank
299 ENHANCED PARTIAL ORDER CURVE COMPARISON OVER MULTIPLE PROTEIN FOLDING TRAJECTORIES
Hong Sun*, Hakan Ferhatosmanoglu, Motonori Otat, Yusu Wang Department of Computer Science and Engineering The Ohio State University, Columbus, OH 43210 Global Scientific Information and Computing Center Tokyo Institute of Technology, 0-okayama, Meguro-ku, Tokyo 152-8550, Japan Email: sunh, hakan, yusuQcse. Ohio-state. edu, [email protected]. ac.jp Understanding how proteins fold is essential to our quest in discovering how life works at the molecular level. Current computation power enables researchers to produce a huge amount of folding simulation data. Hence there is a pressing need to be able to interpret and identify novel folding features from them. In this paper, we model each folding trajectory as a multi-dimensional curve. We then develop an effective multiple curve comparison (MCC) algorithm, called the enhanced partial order (EPO) algorithm, to extract features from a set of diverse folding trajectories, including both successful and unsuccessful simulation runs. Our EPO algorithm addresses several new challenges presented by comparing high dimensional curves coming from folding trajectories. A detailed case study of applying our algorithm to a miniprotein Trp-cage 24 demonstrates that our algorithm can detect similarities at rather low level, and extract biologically meaningful folding events. keywords: EPO, trajectory, alignment, protein folding, high dimension
1. INTRODUCTION Proteins are the main agents in cells. From a chemical point of view, a protein molecule is a linear sequence of amino acids. This linear sequence, under appropriate physicochemical conditions, folds into a unique native structure rapidly. Understanding folding process is of paramount importance, especially since the outcome of it, namely the three dimensional protein structure, to a large extent decides the functionality of the molecule. Hence a lot of research has been devoted to investigating the kinetics of protein folding. In particular, modern (parallel) computation power makes it possible to perform large-scale folding simulations. As a result, interpreting the huge amount of simulation data obtained becomes a crucial issue. Given the highly stochastic nature of the protein motion, the study of protein fold usually relies on an ensemble of folding simulations including both successful and unsuccessful runs, which are trajectories that do or do n o t include a sequence of mnformations leading to a near native conformation. Given such a diverse data set, scientists wish to answer questions such as what causes the folding process falling into different results, and what common properties are *Corresponding author.
shared by the successful runs, but not the unsuccessful ones? To this end, it is highly desirable to be able to compare multiple folding trajectories and extract useful information from them. In this paper, we model each protein folding trajectory as a high dimensional curve, and then present a novel multiple-curve comparison (MCC) algorithm to identify critical information from a set of trajectory curves in an automatic manner. In particular, we focus on the geometry of protein chain conformations throughout the folding process, and convert each conformation into a high dimensional point. The goal is to extract lists of ordered events common to successful runs but not to unsuccessful ones, such as discovering that a conformation B is always formed after A and followed by a conformation C before reaching a successful folding conformation. (Conformations A, B , and C may not be consecutive.) To this end, we develop an effective new multiple curves comparison algorithm called the enhanced partial order (EPO) algorithm, to capture similarities and dis-similarities between a set of input folding trajectories. The EPO algorithm is developed over the concept of POA (partial order alignment) algorithm 9, 17, but is greatly improved and extended
300 in several aspects, especially in its sensitivity in detecting low level of similarity and its capability of handling high dimensional curves. Applying it to the folding trajectories of a miniprotein Trp-cage 24 shows that our algorithm is able to automatically detect important critical folding events which are observed earlier 28 by biological methods. We remark that the EPO algorithm is general, and can be applied to multiple protein structure comparisons as well.
',
problem 7 , 8 , 11, 30, 3 2 . If we have a set of k > 0 structures, then even the problem of aligning them optimally without considering transformations becomes intractable it takes f l ( n k )time using the standard dynamic programming algorithm, where n is the size of each protein involved. ~
In practice, progressive methods are widely used to attack the MSTA problem 21. For example, given a set of structures, many approaches start with a seed structure and then progressively align the remaining 20, 2 5 , '6 35. structures onto it one by one 3 , A consensus or core structure is typically built throughout, to maintain the common substructures among the proteins that are already aligned. At each round, only pairwise structure comparison is usually performed to align the current consensus with a new structure. " 1
2. RELATED A N D N E W WORK 2.1. Related Work Previously, folding simulations analysis is performed mainly for testing various protein folding models 18, 3 3 , such as the folding pathway model and the funnel model; and/or for studying energetic aspects of folding kinetics ', 5 , 19. The geometric shapes of the conformations involved in folding trajectories have not been widely explored 6, 14) 2 8 , despite their important role in folding. A particularly interesting work in this direction is by Ota et al. 2 s , where they provide a quite detailed study of the folding trajectories of a mini-protein Trp-cage using phylogenic tree combined with expert knowledge. However in general, an automatic tool to facilitate the folding simulations analysis a t large scales is still missing. This paper provides an important step towards this goal by modeling folding trajectories as curves and using a new multiple curve comparison (MCC) algorithm to detect critical folding events. ' 1
41
The closest relative of our MCC problem in computational biology is the multiple structure alignment (MSTA) problem, which aims at aligning a family of protein structures, each modeled as a three dimensional polygonal curve to represent its backbone. MSTA is a very hard problem. In fact, even the pairwise comparison problem of aligning two structures A and B is believed to be NP-hard since one has to optimize simultaneously both the correspondence between A and B and the relative transformation of one structure with respect to the other. Numerous heuristic-based algorithms have been developed in practice for this fundamental
Obviously, the above progressive MSTA framework is a greedy approach. Its performance depends on the underlying pairwise comparison methods used, the order of structures that are progressively aligned, as well as the consensus structure maintained. Various heuristics have been exploited to find a good order for the progressive alignments. Note that this order can also be guided by a tree instead of a linear sequence, which removes the need of choosing a seed structure. The progressive procedure may also be iterated several times to locally refine the niultiple structure alignments.
2.2. Our Results There are two main differences between the MCC problem we are interested in and the traditional MSTA problem. In the case of protein structures, it is usually explicitly or implicitly assumed that the (majority of the) input proteins belong to one family", or a t least share some relations. As such, one can expect that some consensus of the family should exist. However in our case, the set of curves are from a set of simulations including both successful and unsuccessful runs, and we wish to classify this diverse set of curves, and capture common features within as well as across its sub-families. Secondly and more importantly, the level of similarity existing in these folding trajectories is usually much lower
"How to classify a set of input structures into different families is a related problem, and many such classifications exist
12, 2 2 , 2 7 .
301
t 2 3
3 1 2 3
1 2
1
2 3
4 5
1
3 3
4
5 4
(a) Linear Alignment
(b) Partial Order Alignment
Fig. 1. Aligning five trajectories (IDS 1 to 5 ) using (a) a linear graph, and (b) a partial order graph. Symbols in the circles are the node IDS and numbers on edges are trajectory IDS. Note that the linear alignment in (a) will not be able t o record the partial similarity between curves 3 and 4, which is maintained in (b) (i.e, node d ) .
than that in a family of related proteins. Hence we aim at an algorithm with high sensitivity, which is able to detect small-scaled partial similarity. In this paper, we propose and develop a sensitive MCC algorithm, called the EPO (enhanced partial order) algorithm, to compare a set of diverse high dimensional curves. Our algorithm follows a similar framework as the POA algorithm 17, 35 to encode the similarities of aligned curves in a partial order graph, instead of in a linear structure used by many traditional MSTA algorithms. This has the advantage that other than similarities among all curves, similarities among a subset of input curves can also be encoded in this graph. See Figure 1 for an example, where nodes in both graphs represent a group of aligned points from input curves. For the more important problem of sensitivity, we observe that being a greedy approach, the progressive MSTA framework tends to be inherently insensitive to low level of similarities if one early local decision is wrong, it may completely miss a smallscaled partial similarity. To improve this aspect of the performance of the progressive framework, we first propose a novel two-level scoring function to measure similarity, which, together with a clustering idea, greatly enhances the quality of the local pairwise alignment produced a t each round. We then develop an effective merging step to post-process the obtained alignments. This step helps to reassemble vertices from input curves that should be matched together, but were scattered in several sub-clusters in the alignments due to some earlier non-optimal ~
decisions. Both techniques are general and can be used to improve the performance of many existing MSTA algorithms. Experimental results show that our MCC algorithm is highly sensitive and able to classify input curves. We also demonstrate the power of our tool in mining critical events from protein folding trajectories using a detailed case study of a miniprotein Trp-cage. Although our EPO algorithm is developed for comparing folding trajectories, the algorithm is general and can be applied to other domains as well, such as protein structures or pedestrian trajectories extracted from surveillance videos 34. EPO fits especially well for those applications when the level of similarity is low.
3. METHOD In this section, we describe our EPO algorithm for comparing a set of possibly high dimensional general c7~rves.If we are given a set of protein folding data, we first convert each folding trajectory to a high dimensional curve. In particular, a folding trajectory is a sequence of conformations (structures) of a protein chain, representing different states of this protein during the simulation of its folding process. We represent each conformation using the distance map between its alpha-carbon atomsb so that it is invariant under rigid transformations. For example, if a protein contains n amino acids, then its distance map is a n x n matrix M where M [ i ][ j ]equals the distance between the ith and j t h alpha-carbon atoms along the protein backbone. This matrix can then be considered as a point in the n2 dimensions. This
"One can also encode the side-chain information into the high dimensional curves, or map a substructure into a high dimensional point.
302 way, we map each trajectory of rn conformations to a curve in Rn2 with m vertices. In the remaining part of this paper, we use the terms trajectories and curves interchangeably. 3.1. Notations and Algorithm Overview Before we formally define the MCC problem, we introduce some necessary notations. Given a set of elements V = {vl,. . . , vl}, a relation + over V is transitive if vi 4 vj and vj 4 v k imply that ui 4 v k . In this paper, we also refer to vi + vj as a partial order constraint. A partial order graph (POG) G = (V, E) is a directed acyclic graph with V = ( ~ 1 , ... ,vl}, where vi 4 wj if there is an edge (vi,vj).Note that by the transitivity of this relation, two nodes may have a partial order constraint even when there is no edge between them in G. Let R be the set of partial order constraints induced by G. We say that V is a partial order last w.r.t. G if for any vi 4 vj E R , we have that i < j . In other words, the linear order in V is a total order satisfying all partial order constraints induced from G. See Figure 2 for an example.
@--@ C
Fig. 2. A POG G of 5 nodes. Note that there is a partial order constraint a 4 d even though there is no edge between them. Both { a , b, c, d , e} and { a , c, b, d, e } are valid partial order lists w.r.t. G.
Let 7 = { T I ,. . . , T N }be a set of N trajectories in Rd, where each trajectory Ti is an ordered sequence of n points p i , . . . , p i ‘. The goal of the MCC algorithm is to find aligned sub-sequences from 7. More formally, an aligned node o is a collection of vertices from Tis, with at most one point from each Ti. Given a 3-tuple ( 7 , 7 ,E ) , where T and E are input thresholds, an alignment of 7 is a POG G with the corresponding set of partial order constraints R and a partial order list of aligned nodes 0 = (01, . . . , o ~ } such that the following three criteria are satisfied:
C1.
lokl
2
7,
for any k E [I,L];
~ 2 for . anypj,pj’, E o k , lip; - p $ : ~ /F E ; C3. if p j E ok, and p:., E o k 2 with o k l then j < j’.
+ ok2,
( C l ) indicates that the number of vertices of input curves aligned to each aligned node is greater than a size threshold 7 , and (C2) requires that these aligned points are tightly clustered together (i.e, the diameter is bounded by a distance threshold E ) . (C3) enforces that points in different aligned nodes still maintain their partial order along their respective trajectory. Our goal is to maximize L , the size of such an alignment 0. See Figure 3 (b) for an example of an alignment graph. Algorithm overview At a high level, the EPO algo-
rithm has two stages (see Figure 3): (Sl) initial POG construction stage and (S2) merging stage. The first stage generates an initial alignment for 7,encoded in a POG G. The procedure has the same framework as the POA algorithm, but its performance, especially when the similarity is low, is significantly improved, via the use of a clustering preprocessing step and a new two-level scoring function. In the second stage, we develop a novel and effective procedure to merge nodes from G to produce aligned nodes with large size, and output a better final alignment G*. Below, we describe each stage in detail. 3.2. Initial POG Construction
Standard dynamic programming (DP) 23, 31 is an ideal method for pairwise comparison between sequences. It produces the optimal alignment between two sequences with respect to a given scoring function. One can perform multiple sequences alignment progressively based on this DP pairwise comparison method. Roughly speaking, in the ith round of the algorithm, the alignment of the first i - 1 sequences is represented in a consensus sequence. The algorithm then update this consensus by aligning it with the i t h sequence Si using the standard DP algorithm. Information from Si that is not aligned to the consensus sequence is essentially lost. See Figure 1(a). The partial order alignment (POA) algorithm l7 greatly alleviates this problem by encoding the consensus in a POG instead of a linear sequence
‘For simplicity, we assume without loss of generality that all Tis have the same length n.
~
.....
@i; (a) Initial POG
(b) POG before merging
~
.
303
.....
(c) POG after merging
Fig. 3. Symbols inside the circles are the node IDS. The table associated with each node encodes the set of points aligned to it. In particular, each row represents a point with its trajectory ID ( T column) and its index along the trajectory (S column). In (a), a POG is initialized by the trajectory T I . An example of a POG after aligning a few trajectories is shown in (b). Note that a new node/branch is created when a point cannot be aligned to any existing nodes. For example, node e is created when pg (i.e, the 3rd point of Tz) is inserted. (c) shows the POG after merging point p i from the node b to the node e constrained by the distance threshold E .
(see Figure l ( b ) ) . In particular, the alignment of . . . , S,-l is encoded in a partial order graph Gi, which is then updated by aligning it with Si. The alignment between Gi and Si can still be achieved by a DP algorithm. The main difference is that in this DP procedure, to find the optimal score of aligning a node u E Gi and an element s E Si,one has to inspect the alignment between all parents of u in Gi and the parent of s in Si. The POA algorithm reduces the influence of the order of the sequences aligned, and is able to capture alignments between a subset of sequences. More details of the POA algorithm can be found in 17) 5’1,
’.
In our case, each trajectory is mapped to an ordered sequence of points (i.e, a polygonal curve), and a similar algorithm can be applied to our trajectory data, where instead of the usual 1D sequences, we now have dD sequencesd. Below we explain two main differences between our EPO algorithm and the POA algorithm.
3.2.1. Size of POG
The first problem with current POA algorithm is that the size of the POG graph maintained expands quickly when the level of similarity is low. For example, suppose we are updating the current POG Gi to G,+1 by aligning it with a new curve Ti. If a point p E T, cannot be aligned to any node in Gi, then it will create a new node in Gi+l, as this node may potentially be aligned later with the remaining curves. Consequently, if the similarity is sparse, many new nodes are created without producing significantly aligned nodes later and the size of the POG graph
increases rapidly. This induces high computational complexity. To address this problem, our algorithm first preprocesses all points from the input curves 7 by clustering them into groups 13, the diameter of which is smaller than a user defined threshold, which is fixed as the distance threshold E in our experiments. We keep only those clusters whose size is greater than a certain threshold (7/2 in our experiments), and collect their centers in C = { c l , . . . , c T } , which we refer to as the set of canonical cluster centers. Intuitively, C provides a synopsis of the input curves and represents potentially aligned nodes. If, in the process of aligning Ti with G,, a point p E Ti is not aligned to any node in Gi, then we insert a new node in Gi+l only if p is within E away from some canonical center from C. If p is far from all the canonical cluster centers, then there is little chance that p can form significant alignment with points from later curves, as that would have implied that p should belong to a dense cluster. The set of canonical cluster centers will also contribute to the scoring function described below.
3.2.2. Scoring Function
The choice of the scoring function when aligning Gi = ( K ,Ei) with Ti, is in general a crucial aspect of an alignment algorithm. Given a point p E Ti and a node o E Gi, let S ( o , p ) be the similarity between p and 0, the definition of which will be described shortly. The score of aligning p with o is usually
dSince each point corresponds to the distance map of a conformation, no transformation is needed when comparing such curves.
.
304 scoring function for measuring similarities.
defined as: Score(o,p) = max { max (Score(o’, q) (o’,o)€Et
max
+ 6(0, p)),
Score(o’, p), Score(o, q)},
(o’,o)€Et
where q is the predecessor (i.e, parent) of the point p along Ti, and 0’ ranges over all predecessors of o in the POG Gi. It is easy to verify that such scores can be computed by a dynamic programming procedure due to the inherent order existing in both the trajectory and the POG. A common way to define S ( o , p ) , the similarity between o and p, is as follows. Assume that each node o is associated with a node center W ( O ) to represent all the points aligned to this node. Then
More specifically, for a node 0 , let q be the first point aligned to this node. This means that at the time we were examining q , q cannot be aligned to any existing node in the POG. Let c k E be the nearest canonical cluster center of q recall that the node o was created because 114 - ckll 5 E . We add c k as a point aligned to this node, and at any time, the center of the minimum enclosing ball of currently aligned points, including c k , will be used as the node center ~ ( o )Now . let
c
~
be the diameter of points currently aligned to o. We define that : S(o,p) =
An alternative way to view this is that each node o has an influence region of radius E around its center. A point p can be aligned to a node o only if it lies within the influence region of 0. In order to be able to align as many points as possible, intuitively, it is more desirable that the influence regions of nodes in current POG cover as much space as possible. Natural choices for the node center W ( O ) of o include using a canonical cluster center computed earlier, or the center of the minimum enclosing ball of points already aligned to this node (or some weighted variants of it). The advantage of the former is that canonical cluster centers tend to spread apart, which helps to increase coverage. Furthermore, the canonical cluster centers serve as good candidates for node centers as we already know that there are many points around them. The disadvantage is that it does not consider the distribution of points aligned to this node. See Figure 4, where without considering the distribution of points aligned to 0, and ob, the new point p will be aligned to ob even thought oa is a better choice. Using the center of the minimum enclosing ball alleviates this problem. However, such centers depend heavily on the order of curves aligned, and the influences regions of nodes produced this way tend to overlap much more than using the canonical cluster centers. We combine the advantages of both approaches into the following two-level
{
2~ if lip - ~ ( o ) l < l D(o) E else if llp-w(o)ll < E 0 else
(2)
In other words, the new scoring function encourages centering points around previously computed cluster centers, thus tending to reduce overlaps between the influence regions of different nodes. Furthermore, it gives higher similarity score for points that are more tightly grouped together with those already aligned at current node, addressing the problem shown in Figure 4. Our experimental tests have shown that this two level scoring function significantly outperforms the ones using either only the canonical centers or only the centers of minimal enclosing balls. We remark that it is possible to use variants of the above two-level scoring function, such as making it continuous (instead of being a step function). We choose the current form for its simplicity. Furthermore, experiments show that there is only marginal difference if we use the continuous version.
Fig. 4. Empty and and Obi respectively. is closer to w(ob), it oa. Hence ideally, it
solid points are aligned to the nodes oa For a new point p (the star), although it is better grouped with points aligned to should be aligned to om instead of to o b .
30s
3.3. Merging Stage In the first stage, we have applied a progressive method to align each trajectory onto an alignment graph one by one. In the ith iteration, a point from Ti is either aligned to the best matched node in the current POG Gi, or a new node is created containing this point and the corresponding canonical cluster center. After processing all of the N trajectories in order, we return the final POG G = G N . In the second stage of our EPO algorithm, we further improve the quality of the alignment in G using a novel merging process. Given the greedy nature of the POA algorithm, the alignment obtained in G is not optimal and depends on the alignment order. Furthermore, given that the influence regions of different nodes may overlap, no matter how we improve the scoring function, sometimes it is simply ambiguous to decide locally where to align a new point, and a wrong decision may have grave consequence later.
Fig. 5. Empty and solid points are aligned to the nodes on and oh, respectively, while points in the dotted region should be grouped together.
For example, see Figure 5, where the set of points P (enclosed in the dotted circle) should have been aligned to one node. However, suppose the nodes o, and o b already exist before any point in P is inserted. Then as points from P come in, it is rather likely that they are distributed evenly into both oa and ob. This problem becomes much more severe in higher dimensions, where P can be distributed to several nodes whose centers are well-separated around P , but whose influence regions still covers some points from P (the number of such regions grows exponentially w.r.t. the dimension d ) . Hence instead of being captured in one heavily aligned node, P is broken into several nodes of small size. Our experimental tests confirm that this is happening rather commonly in the POA algorithm.
To address this problem, we propose a novel postprocessing on G. The goal is to merge qualified points from neighboring less-aligned nodes t o augment more heavily loaded nodes. In particular, the following two invariants are maintained during the merging process:
(11) At any time, the diameter of the target node is still bounded by the distance threshold E ; (12) The partial order constraints induced by the POG graph are always consistent with the order of points along each trajectory. The second criterion means that at any time in the POG graph G’, if p E 01, y E 02, p , q E Ti and p precedes y along the trajectory T,, then either 01 4 02, or there is no partial order relation between them. In other words, the resulting POG still corresponds to a valid alignment of 7 with respect to the same thresholds. As an example, see Figure 3, where the point p i (i.e, the second point of T I )in the node b in (b) is moved to the node e in (c). Note that the graph is also updated to reflect the change (the dashed edge in (c)), in order to maintain the invariants (11) and (12). When all points aligned a node o are merged to other nodes ( i x , o becomes empty), we delete 0, and its successors in the POG will then become the successors of its parent.
Algorithm 3.1: MERGING PROCESSING( G = (01, . . . , om , ...}, I om 1 2 1 om+l I) while significant progress for each om E G in increasing order of m for each neighbor on, 1 on 1<1 om I for each t r n if mergeOK() == true then merge t I n + om \\ mergeOK() checks if the two invariants
\\ \\
can be maintained if performing the candidate merging operation Fig. 6.
The merging algorithm.
306
A high level pseudocode of the merging process is shown in Figure 6. It augments better aligned nodes from the current POG G by processing first the nodes with larger size. We perform this procedure a few times till there is no significant increases in the quality of the resulting alignment. In practice, to speed up the algorithm, we merge neighbors to a node o only if its size is greater than some threshold, as otherwise, there is low probability that o will become a heavy node later. 4. EXPERIMENTAL RESULTS
In this section, we report a systematic performance study on a biological dataset that contains 200 molecular dynamics simulations. The experiments achieve the following goals: First, we show that the quality of the alignments produced by our EPO algorithm is significantly better than that of the original POA algorithm. Second, we demonstrate the effectiveness of our algorithm by applying it to real protein simulation data and obtaining biologically meaningful results that are consistent with previous discoveries 2 8 . The algorithm is implemented using Java. 4.1. Background of Dataset
Our input dataset includes 200 simulated folding trajectories for a particular protein called Trp-cage. The dataset is provided by the Ota’s Lab 28. The folding simulations were performed at 325 K by using the AMBER99 force field with a small modification and the generalized Born implicit solvent model. Trp-cage (see Figure 7) is a mini-protein consisting of 20 amino acids. It has been widely used for folding study because of its short, simple sequence and its quick folding kinetics. Following the definition from 15, a successful folding event has to satisfy the following two criteria: 0
0
The RMSD for a conformation from the native NMR structure 24 is less than 2A.
A subsequence of such near-native conformations holds for at least 200ps.
In 2 8 , 58 successful folding trajectories reaching successful folding events are identified, and each trajectory includes 101 successive conformations sampled at 20ps interval. Furthermore, there are two
crucial observations in 28 that we will exam in the our experiment. First, before moving to the native conformation, a “ring” sub-structure (see Figure 7) has to be formed. Second, the distinction of native and pseudonative confirmations heavily relies on side-chain position of “ring” sub-structure. 28 obtained the above results by aligning each pair of trajectories first and then applying a neighbor joining method to group similar trajectories together. However this semi-automatic approach requires dedicated expert knowledge. The following experiments applied on the same dataset show that our EPO algorithm can automatically detect the above folding events with little prior knowledge.
NMR structure of trp-cage protein 112y. Labels on graph mark amino acids(AAs). AA2 to AA7 roughly form an alpha-helix. AA2 to AA19 form a ring-type structure. In particular, AA2 to AA5 and AA16 to AA19 form the “neck” of this ring. Fig. 7.
4.2. Experimental Setting
In order to be consistent with the results from ”, we select all 58 successful folding events, and call it SuccData. We also randomly select 58 unsuccessful folding trajectories, each containing 101 conformations, and collect them in a set called FailData. The union of successful and unsuccessful data is referred to as the MaxData. We set the distance threshold E = lA, and 7 = 40 in the following experiments, unless specified otherwise. 4.3. Investigation on Entire Protein Structure
In the first set of experiments, we convert each conformation to a high dimensional point based on the distance matrix between all of the alpha-carbon atoms. Figure 8 compares the quality of the alignments of the SuccData by performing the POA algorithm, our EPO algorithm without the merging procedure (EPO-NoMerge), and the EPO algorithm. It
307 shows the number of aligned nodes (y-axis) versus the size of aligned nodes (x-axis). Note that EPONoMerge is essentially POA with a clustering preprocessing and the new two-level scoring function.
-z 2
E
-
7,
6
J
$
4
;
3
@
EPO wlo merging
$:
formation ID greater than 90, and 24.4% has an ID between 80 and 90. This implies that our algorithm has the potential to detect the stabilization of successful folding events in an automatic manner. This also implies that using the entire protein structure may be too coarse to detect critical folding events, as they are usually induced by small key motifs. In what follows, we map only a substructure of the input protein into a high dimensional point and provide more detailed analysis of this folding data.
n
4.4. Investigation on Substructures Fig. 8. Distribution of aligned nodes produced by the EPO algorithm, EPO-NoMerge (i.e, first stage of the EPO algorithm), and the traditional POA algorithm. The histogram is the number of aligned nodes (y-axis) versus the size of aligned nodes (z-axis).
The similarity level between these trajectories is low (i.e, the number of aligned nodes with large size is small). It is clear from this histogram that our EPO algorithm significantly outperforms the other two by producing more aligned nodes with large sizes. The comparison between EPO and EPO-NoMerge demonstrates the effectiveness of our merging procedure, and that EPO-NoMerge is better than POA shows that the two-level scoring function as well as the clustering preprocessing greatly enhances the performance. We have also performed experiments which show that compared to the POA algorithm, EPO-NoMerge is much less sensitive to the order of curves aligned, and we omit the results due to lack of space. Comparisons of the three algorithms over the MixData produces a similar result, and majority of points aligned to heavy nodes (i.e, 101 2 40) are from successful runs. We also observe that most of the heavily aligned nodes are close to the end of the trajectories for the SuccData. In fact, many aligned points have conformation IDS around and greater than 90, which is indeed the time that the folding starts to get stabilized. More specifically, consider the set of aligned nodes of size greater than 40 for the SuccData. Among all points aligned to these nodes, 67.2% has a con-
It is usually believed that certain critical motifs play important roles which stabilize the whole structure in the folding process 18, 3 3 . We wish to have a tool that can identify such critical motifs (substructures) automatically. We define a candidate motif to be two subchains of Trp-cage, each of length 4. These two pieces induce a sub-window in the distance map of each conformation of the protein. We further require that the number of contacts in this subwindow w.r.t. the distance map of the native structure is at least 4, where a contact corresponds to two alpha-carbon atoms within distance 6A. For each conformation of a candidate motif', we consider the distance matrix between its alpha-carbon atoms as before, and convert the folding trajectory of this motif into a curve in the 4 x 4 = 16 dimensional space. In order to be more discriminative, we also introduce a side-chain weighting factor Q, ranging from 0 to 1, to include the side chain information when comparing two high dimensional points : Q = 0 means that side-chain information is completely ignored. We perform our EPO algorithm on both the SuccData and the MixData, and there are two motifs that especially stand out. 4.4.1. Alpha-Helix Substructure
The first one corresponds to an alpha-helix substructure. In Figure 7, five successive amino acids (No.2 -7) form an alpha-helix structure which is a simple, self-contained secondary structure (SSE) 24. From the results returned by our EPO algorithm, we note
eRoughly speaking, for every conformation, we record for each residue also the relative position of the centroid of its side-chain with respect t o its alpha-carbon atom. This provides another high dimensional point that we call a side-cham point. The distance between two conformations will combine the distance between their side-chain points by the side-chain weighting factor.
308 that this alpha-helix is formed rather early consistently in both successful and unsuccessful runs. Once formed, it remains stable. This is consistent with the common conception that due to its chemical property, alpha-helix is a stable secondary structure, and can be formed quickly. Hence the formation of alphahelix cannot be used to differentiate successful runs from unsuccessful ones. 4.4.2. Ring-Substructure
The second motif corresponds to the neck of a ring structure. In particular, it consists of the sub-chains of No. 2 - 5 and No. 16 - 19 amino acids. The following results demonstrate that EPO can automatically not only find but also track the formation of such fingerprint sub-structure(critica1 motif). Table 1. EPO on ring Structure(A4zzData). Column 1 3 shows the size of an aligned node (i.e, the number of points aligned t o this node) from MixData, SuccData, and FailData, respectively. Column 4 shows the diameter of this node (note that the distance threshold E = 1A means that the diameter of a node can be upto 2A). ~
lOil
2 ,r 49 45 41 40 48 40 47 42 44 49 54 59 60 56 62 62
Classification SuccData FailData 22 27 17 28 12 29 9 31 31 16 32 8 13 34 7 35 8 36 7 42 6 48 50 9 51 9 52 4 56 6 58 4
D(o) .A 1.852 1.798 2.189 1.447 1.761 1.322 1.133 1.923 0.873 1.428 1.020 1.294 0.932 1.255 1.782 1.503
First, we observe from Table 1 that when applying the EPO algorithm to the MixData (with the sidechain weight factor o = 0.9), significant alignments involve mainly trajectories from SuccData. For example, the last row of Table 1 shows that among the 62 points (from 62 trajectories) aligned to a particular node, 58 are from SuccData, with the remaining 4 from FailData. Hence this motif is potentially critical to the success of the folding of Trp-cage. It also suggests that we can automatically
classify the MixData into SuccData and FailData with few false positives based on this ring-neck motif, while previously, the classification in the input data was obtained by a few expert defined rules. Second, when the side-chain weighting factor a = 0.9, we note that 49.6% of significant aligned nodes are formed before the conformation ID 85 (compared to results from Section 4.3). For example, there are two aligned nodes from the successful runs, where 80% of points (i.e. trajectories) aligned to them has a conformation ID between 75 - 85. This implies that the complete formation of this ring-neck usually immediately precedes the stabilization of the folding structure (which is roughly at Conformation ID 90 for successful trajectories). If reducing the side-chain weighting factor Q: to 0.5, naturally, we found more aligned nodes. In particular, other than the cluster with conformations of IDS around 80, we observe more significant clusters involving conformations with IDS from 50 - 70. By comparing the conformations of the ring-neck motif in these clusters with those in the aligned nodes around 80, we found that the backbone structures are rather similar, but the side-chains are of different orientations. In other words, the shape of the ring-neck motif is first stabilized by the backbone structure, and then the side-chains gradually move into right position. There are a few trajectories where the side-chains eventually move to the mirror image of their correct positions, and lead to pseudonative conformations which can only be detected when considering the side-chains. The above results are consistent with the results from 28, where such a ring-shaped substructure was discovered semi-automatically by pairwise structure comparisons together with expert knowledge.
4.5. Timing of EPO The above experiments are implemented on a Windows XP machine with 1.5GHz CPU and 512 MB Memory. The running time of performing the experiments on the entire protein structure is about 30 minutes, and that of the small motifs is about 20 minutes. We believe that the time complexity of the current algorithm can be significantly improving by optimizing our code. The merging stage is the most
309 time consuming component and takes about three quarters of the total time.
5 . CONCLUSIONS In this paper we proposed and developed EPO, a n effective multiple curve comparison method, t o analyze protein folding trajectories. Our new method greatly improved the performance of the POA algorithm by using a clustering preprocessing, a more discriminative two-level scoring function, as well as a novel merging post-processing procedure. It can detect low level of similarity among input curves. We demonstrated the effectiveness of our method by applying it to a set of simulated folding trajectories of the miniprotein Trp-cage. Currently, we have only experimented the EPO algorithm with a mini-protein (Trp-cage). One immediate question is t o understand the scalability of the EPO algorithm for larger proteins or longer trajectories. In particular, a larger protein means a curve of higher dimensions. Our EPO algorithm seems t o scale linearly with the dimension, from current experiments. Furthermore, in practice, it is likely that we only perform the algorithm on small motifs. For longer trajectories, it seems that our algorithm scales in a quadratic manner. However, further experiments are necessary t o investigate the scalability issue. There are some previous work t h a t analyze protein folding trajectories by collecting various statistics on measures such as the contact number (i.e, the number of native contacts) of each conformation along a trajectory and the URMS distance between a conformation and the native structure 14. One way to view this is that a trajectory is mapped into a time-series d a t a representing the evolution of, say, the number of native contacts, which can be considered as a one-dimensional curve. In this regard, we can use our EPO algorithm t o analyze a collection of such curves induced by one measure. Ingeneral, there may be multiple measures, geometric or physio-chemical, that a user may wish t o inspect. Hence it is highly desirable t o build a framework for analyzing folding trajectories that can incorporate these multiple measures, and that also enables the addition of new properties easily. This is one important future direction for us.
Finally, we remark that compared to other multiple curve alignment algorithms, our algorithm is specifically effective at capturing low level of similarities. As such, another important future direction is t o apply similar techniques to classifying protein structures, as well as extracting structural motifs from a family of proteins that may not share high global similarity.
ACKNOWLEDGMENT The work was supported in part by the Department of Energy (DOE) under grant DE-FG02-06ER25735 and in part by National Science Foundation under grant 11s-0546713. References 1. V. I. Abkevich, A. M. Gutin, and E. I. Shakhnovich.
2.
3. 4.
5.
6.
7.
8.
9.
10.
Specific nucleus as the trasition state for protein folding: evidence from the lattice model. Biochemistry, 33:10026-10036, 1994. J. M. Borreguero, F. Ding, S. V. Buldyrev, H. E. Stanley, and N. V. Dokholyan. Multiple Folding Pathways of the SH3 Domain. ArXZv Physics eprints, 87, May 2003. L. P. Chew and K. Kedem. Finding the Consensus Shape for a Protein Family (Extended Abstract). F. Chiti, N. Taddei, P. M. White, M. Bucciantini, F. Magherini, M. Stefani, and C. M. Dobson. Mutational analysis of acylphosphatase suggests the importance of topology and contact order in protein folding. Nature Struc. Biol., 6:1005-1009, 1999. N. V. Dokholyan, S. V. Buldyrev, H. E. Stanley, and E. I. Shakhnovich. Molecular dynamics studies of folding of a protein-like model. Fold. Design, 3:577587, 1998. R. Du, V. S. Pande, A. Y. Grosberg, T. Tanaka, and E. Shakhnovich. On the role of conformational geometry in protein folding. Journal of Chemical Physics, 111:10375-10380, Dec. 1999. M. Gerstein and M. Levitt. Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Science, 7:445-456, 1998. J. F. Gibrat, T. Madej, and S. H. Bryant. Surprising similarities in structure comparison. Curr. Opin. Struct. Biol., 6(3):377-385, 1996. C. Grasso and C. Lee. Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinfomatics, 20(10):1546-1556, June. C. Guda, S. Lu, E. D. Scheeff, P. E. Bourne, and L. N. Shindyalov. CE-MC: a multiple protein
310
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
structure alignment server. Nucleic Acids Research, 32:"W100-3", 2004. L. Holm and C. Sander. Protein structure comparison by alignment of distance matrices. J . Mol. Biol., 233:123-138, September. L. Holm and C. Sander. Dali/FSSP classification of three-dimensional protein folds. Nucleic Acids Res, 25(1):231-234, 1997. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. A CM Comput. Surv.,31(3):264323, 1999. K. Kedem, L. Chew, and R. Elber. Unit-Vector RMS(URMS) as a Tool to Analyze Molecular Dynamics Trajectories. Proteins: Structure, Function and Genetics, 37:554-564, 1999. R. Koike, K.Kinoshita, and A. Kidera. Ring and Zipper formation is the key to understanding the structural variety in al1-P proteins. F E B S Letters, 533:913, 2003. E. Krissinel and K . Henrick. Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Cryst., D60~2256-2268, 2004. C. Lee, C. Grasso, and M. Sharlow. Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3):452-464, 2002. C. Levinthal. Are there pathways for protein folding? J . Chim.Phys., 65:44-45, 1968. S. W. Lockless and R. Ranganathan. Evolutionarily Conserved Pathways of Energetic Connectivity in Protein Families. Science, 286( 5438):295-299, October 1999. D. Lupyan, A. Leo-Macias, and A. R. Ortiz. A new progressive-iterative algorithm for multiple structure alignment. Bzoinformatics, 21( 15):3255-3263, 2005. M.J.Sutcliffe, I.Haneef, D.Carney, and T.L.Blundel1. Knowledge based rnoddelling of homologous proteins, part I: three-dimensional frameworks derived from the simultaneous superposition of multiple structures. Protein Engineering, 1(5):377-384, 1987. A. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J . Mol. Biol., 247:536-540, 1995. S. B. Needleman and C. D. Wunsch. A general
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
method applicable to the search for similarities in the amino acid sequence of two proteins. J . Mol. Biol., 48:443-453, 1970. J. Neidigh, R. Fesinmeyer, and N. Andersen. PDB 1D:lLZY Mini-proteins Trp the light fantastic. Nat.Struct.Biol.,9(6):425-430, June 2002. M. E. Ochagavia and S. Wodak. Progressive combinatorial algorithm for multiple structural alignments:application to distantly related proteins. Proteins, 55:436-454, 2004. C. A. Orengo. CORA-Topological fingerprints for protein structural families. Protein Science, 8:699715, 1999. C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M. Thornton. CATH- A hierarchic classification of protein domain structures. Structure, 5(8):1093 -1108, 1997. M. Ota, M. Ikeguchi, and A. Kidera. Phylogeny of protein-folding trajectories reveals a unique pathway to native structure. PN AS, 101(51):17658-17663, December 2004. E. Sandelin. Extracting multiple structural alignments from pairwise a1ignments:a comparison of a rigorous and heuristic approach. Bioinformatics, 2 l ( 7): 1002-1009, 2005. I. N. Shindyalov and P. E. Bourne. Protein structure alignment by incremental combinatorial extension (CE) of optimal path. Protein Engineeving, 11(9) :739-747, 1998. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J . Mol. B i d , 147:195-197, 1981. W. Taylor and C. Orengo. Ssap: sequential structure alignment program for protein structure comparison. Methods Enzymol, 266:617-35, 1996. P. Wolynes, J. Onuchic, and D. Thirumalai. Navigating the folding routes. Science, 267:1619-1920, 1995. Y.Caspi and M.Irani. Spatio-temporal alignment. Proc. IEEE Transactions O n Pattern Analysis and Machine Intelligence., pages 1409-1424, 2002. Y. Ye and A. Godzik. Multiple flexible structure alignment using partial order graphs. BioinformatZ C S , 21(10):2362-2369, 2005.
31 1 fRMSDPred: PREDICTING LOCAL RMSD BETWEEN STRUCTURAL FRAGMENTS USING S EQ UENC E INFORMAT I0 N
Huzefa Rangwala* and George Karypis Computer Science f3Engineering, University of Minnesota, Minneapolis, MN 55455 Email: rangwala, [email protected]
Abstract The effectiveness of comparative modeling approachesfor protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this paper focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identifi high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classijkatiori that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared to the projile-to-projile scoring schemes. Keywords: structure prediction, comparative modeling, machine learning, classijication, regression
1. INTRODUCTION
tive target-template alignment algorithms 25, that incorporate profiles ', profile-to-profile scoring functions 5,17, *, and predicted secondary structure information 13, 24 have contributed to the continuous success of comparative modeling 3 7 , 38. The dynamic-programming-based algorithms 19. 33 used in target-template alignment are also used by many methods to align a pair of protein structures. However, the key difference between these two problem settings is that, while the target-template alignment methods score a pair of aligned residues using sequence-derived information, the structure alignment methods use information derived from the structure of the protein. For example, structure alignment methods like CE 30 and MUSTANG l5 score a pair of residues by considering how well fixed-length fragments (i.e., short contiguous backbone segments) centered around each residue align with each other. This score is usually computed as the root mean squared deviation (RMSD) of the optimal superimposition of the two fragments. In this paper, motivated by the alignment requirements of comparative modeling approaches and the operational characteristics of protein structure alignment algorithms, we focus on the problem of estimating the RMSD value of a pair of protein fragments by consid' 1
7 1
Over the years, several computational methodologies have been developed for determining the 3D structure of a protein (target) from its linear chain of amino acid residues 2 3 , 31, '*. Among them, approaches based on comparative modeling 2 7 , " are the most widely used and have been shown to produce some of the best predictions when the target has some degree of homology with proteins of known 3D structure (templates) 3 , 42. The key idea behind comparative modeling approaches is to align the sequence of the target to the sequence of one or more template proteins and then construct the target's structure from the structure of the template(s) using the alignment(s) as a reference. Thus, the construction of high-quality target-template alignments plays a critical role in the overall effectiveness of the method, as it is used to both select the suitable template(s) and to build good reference alignments. The overall performance of comparative modeling approaches will be significantly improved, if the targettemplate alignment constructed by considering sequence and sequence-derived information is as close as possible to the structure-based alignment between these two proteins. The development of increasingly more sensi271
''1
321
*Corresponding author.
391
"1
312 ering only sequence-derived information. Besides its direct application to target-template alignment, accurate estimation of these fragment-level RMSD values can also be used to solve a number of other problems related to protein structure prediction such as identifying the best template by assessing the quality of target-template alignments and identifying high-quality segments of an alignment. We present algorithms to solve the fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates sequence-derived information in the form of position-specific profiles and predicted secondary structure 14. This information is effectively encoded in fixed-length feature vectors. We develop and test novel second-order pairwise exponential kernel functions designed to capture the conserved signals of a pair of local windows centered at each of the residues and use a fusion-kernel-based approach to incorporate the profileand secondary structure-based information. An extensive experimental evaluation of the algorithms and their parameter space is performed using a dataset of residue-pairs derived from optimal sequencebased local alignments of known protein structures. Our experimental results show that there is a high correlation (0.681 - 0.768) between the estimated and actual fragment-level RMSD scores. Moreover, the performance of our algorithms is considerably better than that obtained by state-of-the-art profile-to-profile scoring schemes when used to solve the fragment-level RMSD prediction problems. The rest of the paper is organized as follows. Section 2, provides key definitions and notations used throughout the paper. Section 3 formally defines the fragment-level RMSD prediction and classification problems and describes their applications. Section 4 describes the prediction methods that we developed. Section 5 describes the datasets and the various computational tools used in this paper. Section 6 presents a comprehensive experimental evaluation of the methods developed. Section 7 summarizes some of the related research in this area. Finally, Section 8 summarizes the work and provides some concluding remarks.
2. DEFINITIONS AND NOTATIONS Throughout the paper we will use X and Y to denote proteins, 2, to denote the ith residue of X , and ~ ( z , y, j ) to denote the residue-pair formed by residues x, and yJ. Given a protein X of length n and a user-specified
+
parameter w, we define w m e r ( z , ) to be the (2w 1)length contiguous subsequence of X centered at position i (w < i 5 n - w).Similarly, given a user-specified parameter 'u, we define v f r a g ( z , ) to be the (2v 1)length contiguous substructure of X centered at position i (u < i 5 n - v). These substructures are commonly referred to as fragments 30, 15. Without loss of generality, we represent the structure of a protein using the C, atoms of its backbone. The w m e r s and v f r a g s are fixed-length windows that are used to capture information about the sequence and structure around a particular sequence position, respectively. Given a residue-pair ~ ( x ,yJ), , we define f RMSD(x,, yJ) to be the structural similarity score between v f r a g ( z , ) and w f r a g ( y J ) . This score is computed as the root mean square deviation between the pair of substructures after optimal superimposition. A residue-pair ~ ( z , y3) , will be called reliable if its f RMSD is bellow a certain value (i.e., there is a good structural superimposition of the corresponding substructures). Finally, we will use the notation ( a ,15)to denote the dot-product operation between vectors a and b.
+
3. PROBLEM STATEMENT The work in this paper is focused on solving the following two problems related to predicting the local structural similarity of residue-pairs. Definition 3.1 ( ~ R M S Estimation D Problem) Given a residue-pair r ( x Zy]), , estimate the fRMSD(x,, y j ) score by considering information derived from the amino acid sequence of X and Y . Definition 3.2 (Reliability Prediction Problem) Given a residue-pair xi, yy3),determine whether it is reliable
or not by considering only information derived from the amino acid sequence of X and Y . It is easy to see that the reliability prediction problem is a special case to the ~ R M S Destimation problem. As such, it may be easier to develop effective solution methods for it and this is why we consider it as a different problem in this paper. The effective solution to these two problems has four major applications to protein structure prediction. First, given an existing alignment between a (target) protein and a template, a prediction of the ~ R M S Dscores of the aligned residue-pairs (or their reliability) can be used to assess the quality of the alignment and potentially select among different alignments and/or different templates. Second, f RMSD scores (or reliability assessments) can
313
be used to analyze different protein-template alignments in order to identify high-quality moderate-length fragments. These fragments can then be used by fragmentassembly-based protein structure prediction methods like TASSER 41 and ROSETTA 26 to construct the structure of a protein. Third, since residue-pairs with low fRMSD scores are good candidates for alignment, the predicted fRMSD scores can be used to construct a position-toposition scoring matrix between all pairs of residues in a protein and a template. This scoring matrix can then be used by an alignment algorithm to compute a high-quality alignment for structure prediction via comparative modeling. Essentially, this alignment scheme uses predicted ,fRMSDscores in an attempt to mimic the approach used by various structural alignment methods 15, 30. Fourth, the ~ R M S D scores (or reliability assessments) can be used as input to other prediction tasks such as remote homology prediction and/or fold recognition. In this paper we study and evaluate the feasibility of solving the ~ R M S Destimation and reliability prediction problems for residue-pairs that are derived from optimal local sequence alignments. As a result, our evaluation focuses on the first two applications discussed in the previous paragraph (assessment of target-template alignment and identification of high-confidence alignment regions). However, the methods developed can also be used to address the other two applications as well.
ranking of a set of the residue-pairs. We use the error insensitive support vector regression eSVR 34 for learning a function f ( 7 r ) to predict the ~RMSD(T)scores. Given a set of training instances (T,, ~RMSD(T,)), the t-SVR aims to learn a function of the form 36i
f(T)=
c
“zfK(T,.rr%)-
*.,€A+
c
01% I c ( T , T t ) ,
(2)
?rzEA-
where A+ contains the residue-pairs for which ~RMSD(T,)- f ( 7 r , ) > t, A- contains the residue pairs for which ~ R M S D ( T , )- f ( n , ) < -t, and a,’ and a , are non-negative weights that are computed during training by maximizing a quadratic objective function. The objective of the maximization is to determine the flattest f ( ~ ) in the feature space and minimize the estimation errors for instances in A+ U A-. Hence, instances that have an estimation error satisfying I f ( 7 r 2 ) - ~ R M S D ( T , ) < ~ E are neglected. The parameter t controls the width of the regression deviation or tube. In the current work we focused on several key considerations while setting up the classification and regression problems. In particular we explored different types of sequence information associated with the residuepairs, developed efficient ways to encode this information to form fixed length feature vectors, and designed sensitive kernel functions to capture the similarity between pairs of residues in the feature spaces.
4. METHODS We approach the problems of distinguishing reliablehnreliable residue-pairs and estimating their f RMSD scores following a supervised machine learning framework and use support vector machines (SVM) 36 to solve them. Given a set of positive residue-pairs A+ (i.e., reliable) and a set of negative residue-pairs A- (i.e., unreliable), the task of support vector classification is to learn a function f(n)of the form
’’,
f(T)=
c
AZ+KC(fl,T%) -
?r,Ed+
c
A,
K(fl,TirZ)>
(1)
?r,Ed-
where A T and A, are non-negative weights that are computed during training by maximizing a quadratic objective function, and .) is the kernel function designed to capture the similarity between pairs of residue-pairs. Having learned the function f(n), a new residue-pair 7r is predicted to be positive or negative depending on whether f ( ~ is) positive or negative. The value of f ( ~ also ) signifies the tendency of n to be a member of the positive or negative class and can be used to obtain a meaningful
x(.,
4.1. Sequence-based Information For a given protein X , we encode the sequence information using profiles and predicted secondary structure.
-
4.1.1 Profile Information The profile of a protein X is derived by computing a multiple sequence alignment of X with a set of sequences {Yl,. . . , Ym}that have a statistically significant sequence similarity with X (i.e., they are sequence homologs). The profile of a sequence X of length n is represented by two n x 20 matrices, namely the positionspecific scoring matrix px and the position-specific ,frequency matrix F x . Matrix p can be generated directly by running PSI-BLAST whereas matrix -F consists of the frequencies used by PSI-BLAST to derive p. These frequencies, referred to as the target frequencies l8 consists of both the sequence-weighted observed frequencies (also referred to as effective frequencies 18) and the BLOSUM62 derived-pseudocounts Further, each row of the matrix is normalized to one.
’,
’.
3 14 4.1.2. Predicted Secondary Structure Information For a sequence X of length n we predict the secondary structure and generate a position-specific secondary structure matrix SX of length n x 3. The ( i ,j ) entry of this matrix represents the strength of the amino acid residue at position i to be in state j , where j E (0,1,2) corresponds to the three secondary structure elements: alpha helices (H), beta sheets (E), and coil regions (C).
4.2. Coding Schemes The input to our prediction algorithms are a set of wmerpairs associated with each residue-pair ~ ( z yj). i , The input feature space is derived using various combinations of the elements in the p and S matrices that are associated with the subsequences w m e r ( z , ) and wmer(yJ). For the rest of this paper, we will use px(i-w . . . i+ w) to denote the (2w 1)rows of matrix px corresponding to wme r (zi).A similar notation will be used for matrix
+
s.
4.2.1. Concatenation Coding Scheme For a given residue-pair 7r( z,, yy3 ), the feature-vector of the concatenation coding scheme is obtained by first linearizingthematricespx(2-w.. . i + w ) a n d p y ( j - w . . . j + w) and then concatenating the resulting vectors. This leads to feature-vectors of length 2 x (2w 1) x 20. A similar representation is derived for matrix S leading to feature-vectors of length 2 x (2w 1) x 3. The concatenation coding scheme is order dependent as the representations for 7r(zz,y3) and x ( y J ,z,) are not equivalent. We call the feature representations obtained by the two concatenation orders as forward (frwd) and reverse ( w s d ) representations. Note that we use the terms forward and reverse only for illustrative purposes as there is no way to assign a fixed ordering to the residues of a residue-pair, as this is the source of the problem in the first place. We explored two different ways of addressing this order dependency. In the first approach, we trained up to ten models with random use of the forward and backward representation for the various instances. The final classification and regression results were determined by averaging the results produced by each of the ten different models. In the second approach, we built only one model based on the forward representation of the residue-pairs. However, during model application, we classifiedhegressed both the forward and reverse representations of a residue-pair and used the average of the SVMIe-SVR outputs as the final classificatiodregression
+
+
result. We denote this averaging method by avg.
4.2.2. Pairwise Coding Scheme For a given residue-pair .(xi, yj), the pairwise coding scheme generates a feature-vector by linearizing the matrix formed by an element-wise product between px ( i - w . . . i w) and p y ( j - w . . . j w). The length of this vector is (2w 1) x 20 and is order independent. If we denote the element-wise product operation by "@", then the element-wise product matrix is given by
+
+
+
Px(-w + 2 . . . w + 2 ) @Py(-w + j . . . w + j ) .
(3)
A similar approach is used to obtain the pairwise coding scheme for matrix leading to feature-vectors of length (2w 1) x 3.
s,
+
4.3. Kernel Functions The general structure of the kernel function that we use for capturing the similarity between a pair of residuepairs ~ ( z , yJ) , and ~'(z:,,$,) is given by
where K?'(T, T ' ) is given by KE"(.,T')
= K;"(K,T/)
+
(5)
(K;"(7r,7r'))2,
and K!js(7r,d)is a kernel function that depends on the choice of particular coding scheme (cs). For the concatenation coding scheme using matrix p (i.e., cs = pconc), K!js(7r,d)is given by k=+w Krc"nc(7r,7r')=
c (PX(2+k),PXI(i'+k))+
k=-w
(6) k=+w
c (PY( j + I.), P,, ( j ' + k ) )
k=-w
For the pairwise coding scheme using matrix cs = p p a s r ) , ~ ; ~ ( 7 Tr ' ,) is given as k=+w K:paLr(7-l,7r')
=
c
k=-w
p (i.e.,
+ k ) @PY(ll+ k ) , PX!(i' + k ) 8 P y , (3' + k ) ) . (PX(2
(7)
Similar kernel functions can be derived using matrix S for both the pairwise and the concatenation coding schemes. We will denote these coding schemes as and Scone, respectively. Since the overall structure of the kernel that we used (Equations 4 and 5 ) is that of a normalized second-order exponential function, we will refer to it as nsoe. The second-order component of Equation 5 allows the nsoe kernel to capture pairwise dependencies among the residues used at various positions within each w m e r ,
315 and we found that this leads to better results over the linear function. This observation is also supported by earlier research on secondary-structure prediction as well 14. In addition, nsoe’s exponential function allows it to capture non-linear relationships within the data just like the kernels based on the Gaussian and radial basis function 3 6 .
4.3.1. Fusion Kernels We also developed a set of kernel functions that incorporate both profile and secondary structure information using an approach motivated by fusion kernels 34. Specifically, we constructed a new kernel function as the unweighted sum of the nsoe kernel function for the profile and secondary structure information. For example, the concatenationbased fusion kernel function is given by K(” + S ) c o n c(T, .)’
= K”conc (T, T’)+ Ksco7Lc (,.
..’).
(8)
A similar kernel function can be defined for the pairwise coding scheme as well. We will denoted the pairwisebased fusion kernel by ~(p+s)pa’T(n, n’). Note that since these fusion kernels are linear combinations of valid kernels, they are also admissible kernels.
5. MATERIALS 5.1. Datasets We evaluated the classification and regression performance of the various kernels on a set of protein pairs used in a previous study for learning a profile-to-profile scoring function ”. These pairs of proteins were derived from the SCOP 1.57 database, classes a-e, with no two protein domains sharing greater than 75% sequence identity. The dataset is comprised of 473 protein pairs belonging to the same family, 433 pairs belonging to the same superfamily but not the same family, and 422 pairs belonging to the same fold but not the same superfamily. For each protein pair, we used the alignment produced by the Smith-Waterman 33 algorithm to generate the aligned residue-pairs that were used to train and test the various algorithms. These alignments were computed using the sensitive PICASSO 8 , l8 profile-to-profile scaring function. For each aligned residue-pair n ( ~ ,yj), , we computed its ~ R M S D ( X y~j,) score by considering fragments of length seven (i.e., we optimally superimposed v f rags with u = 3 ) . For the ~ R M S Destimation problem, we used the entire set of aligned residue-pairs and their corresponding ~ R M S Dscores for training and testing the 6-SVRbased regression algorithms. For the reliability prediction problem, we used the aligned residue-pairs to con-
struct two different classification datasets, that will be referred to as easy and hard. The positive class (i.e., reliable residue-pairs) for both datasets contains all residuepairs whose ~ R M S Dscore is less than 0.75A. However, the datasets differ on how the negative class (i.e., unreliable residue-pairs) is defined. For the hard problem, the negative class consists of all residue-pairs that are not part of the positive class (i.e., have an ~ R M S D score that is greater than or equal to 0.75A), whereas for the easy problem, the negative class consists only of those residue-pairs whose ~ R M S Dscore is greater than 2.5A. Thus, the easy dataset contains classes that are well-separated in terms of the ~ R M S Dscore of their residue-pairs and as such it represents a somewhat easier learning problem. Both these datasets are available at the supplementary website for this paper (http://bioinfo.cs .umn.edu/supplements/fRMSDPred). We perform a detailed analysis using different subsets of the datasets to train and test the performance of the models. Specifically, we train four models using (i) protein pairs sharing the same SCOP family, (ii) protein pairs sharing the same superfamily but not the family, (iii) protein pairs sharing the same fold but not the superfamily, and (iv) protein pairs from all the three levels. These four models are denoted by fam, suf, fold, and all. We also report performance numbers by splitting the test set in the aforementioned four levels. These subsets allow us to evaluate the performance of the schemes for different levels of sequence similarity.
5.2. Profile Generation To generate the profile matrices p and F,we ran PSIBLAST, using the following parameters (b1astpgp -j 5 -e 0 . 0 1 -h 0 . 0 1 ) . The PSI-BLAST was performed against NCBI’s nr database that was downloaded in November of 2004 and contained 2,17 1,938 sequences.
5.3. Secondary Structure Prediction We use the state-of-the-art secondary structure prediction server called YASSPP l4 (default parameters) to generate the S matrix. The values of the S matrix are the output of the three one-versus-rest SVM classifiers trained for each of the secondary structure elements.
5.4. Evaluation Methodology We use a five-fold cross-validation framework to evaluate the performance of the various classifiers and regres-
316 sion models. To prevent unwanted biases, we restrict all residue-pairs involving a particular protein to belong solely in the training or the testing dataset. We measure the quality of the methods using the standard receiver operating characteristic (ROC) scores and the ROC5 scores averaged across every protein pair. The ROC score is the normalized area under the curve that plots the true positives against the false positives for different thresholds for classification The ROC, score is the area under the ROC curve up to the first R false positives. We compute the ROC and ROC5 numbers for every protein pair and report the average results across all the pairs and cross-validation steps. We selected to report ROC5 scores because each individual ROC-based evaluation is performed on a per protein-pair basis, which, on average, involves one to two hundred residue-pairs. The regression performance is assessed by computing the standard Pearson correlation coefficient (CC)between the predicted and observed fRMSD values for every protein pair. The results reported are averaged across the different pairs and cross-validation steps.
’.
5.5. Profile-to-Profile Scoring schemes To assess the effectiveness of our supervised learning algorithms we compare their performance against that obtained by using two profile-to-profile scoring schemes to solve the same problems. Specifically, we use the profileto-profile scoring schemes to compute the similarity between the aligned residue-pairs summed over the length of their w m e r s . To assess how well these scores correlated with the ~ R M S Dscore of each residue-pair we compute their correlation coefficients. Note that since residue-pairs with high-similarity score are expected to have low ~ R M S Dscores, good values for these correlation coefficients will be close to -1. Similarly, for the reliability prediction problem, we sort the residue-pairs in decreasing similarity score order and assess the performance by computing ROC and ROC5 scores. The two profile-to-profile scoring schemes that we used are based on the dot-product and the PICASSO score, both of which are used extensively and shown to produce good results 18, ”. The dot-product similarity score is defined both for the profile- as well as the secondary-structure-based information, whereas the PICASSO score is defined only for the profile-based information. The profile-based dotproduct similarity score between residues xi and y j is given by (Px(i), py (j)). Similarly, the secondarystructure-based dot-product similarity score is given by 391
( S ~ ( i ) , S y ( jThe ) ) . PICASSO similarity score 8 , between residues xi and y j uses both the p and .F matrices and is given by (?=~-(i) p y ( j ) .Fy(j)p~(i)). We will use p d o t p , S d o t p , and P.FPic to denote these three similarity scores, respectively.
+
5.6. Support Vector Machines The classification and regression is done using the publicly available support vector machine tool SVMlight 29 that implements an efficient soft margin optimization algorithm. The performance of SVM and t-SVR depends on the parameter that controls the trade-off between the margin and the misclassification cost (“C” parameter). In addition, the performance of E-SVR also depends on the value of the deviation parameter E. We performed a limited number of experiments to determine good Values for these parameters. These experiments showed that C = 0.1 and E = 0.1 achieved consistently good performance and was the value used for all the reported results.
6. RESULTS We have performed a comprehensive study evaluating the classification and regression performance of the various information sources, coding schemes, and kernel functions (Section 4) and compare it against the performance achieved by the profile-to-profile scoring schemes (Section 5.5). We performed a number of experiments using different length w m e r s for both the SVMIE-SVR-and profileto-profile-based schemes. These experiments showed that the supervised learning schemes achieved the best results when 5 w 7, whereas in the case of the profileto-profile scoring schemes, the best performing value of w was dependent on the particular scoring scheme. For these reasons, for all the SVMIE-SVR-based schemes we only report results for w = 6, whereas for the profile-toprofile schemes we report results for the values of w that achieved the best performance.
< <
317
6.1. Order Dependency in the Concatenation Coding Scheme Table 1.: Comparing the classification and regression performance of the various concatenation based kernels due to order dependency.
sentations for both the classification as well as regression problem. For this reason, throughout the rest of this study we only report the results obtained using the averaging scheme for the concatenation-based coding schemes.
6.2. RBF versus NSOE Kernel Functions
~
EST
Reliabilitv Prediction
I ~
ROCs 0.802 (p+S)'OnC-fam (rvsd) 0.803 0.817 0.822 0.821 0.827 0.785 0.800 0.796 0.839 (p+S)con" -all ( w s d ) 0.853 0.853
ROC TOCs ROC ~
0.937 0.937 0.941 0.938 0.938 0.940 0.918 0.922 0.922 0.948 0.950 0.952
0.666 0.664 0.673 0.653 0.651 0.659 0.618 0.638 0.637 0.680 0.692 0.693
0.903 0.902 0.906 0.898 0.899 0.902 0.872 0.881 0.882 0.909 0.913 0.913
cc
0.693 0.693 0.700 __ 0.687 0.688 0.694 0.660 0.663 0.667 0.717 0.721 0.725 ~
~
The test set consisted of proteins from the all set, whereas the training set uses either the all, ,fam. s~& and fold sets. The f i d and rvsd notations indicate concatenation orders of the two w m e r s , whereas avg denotes the scheme which uses the average output of both the results. EST denotes the ~ R M S Destimation results using regression. The numbers in bold show the best performing schemes for each of the sub-tables.
Section 4.2.1 described two different schemes for addressing the order-dependency of the concatenation coding scheme. Our experiments with these approaches showed that both achieved comparable results. For this reason and due to space constraints in this section we only present results for the second approach (i.e., averaging the SVMIc-SVR prediction values of the forward and reverse representations). These results are shown in Table 1, which shows the classification and regression performance achieved by the concatenation-based fusion kernel for the two representations and their average. These results show that there exists a difference in the performance achieved by the forward and reverse representations. Depending on the protein set used to train and/or test the model, these differences can be non-trivial. For example, for models trained on the fold and all protein sets, the performance achieved by the reverse representation is considerably higher than that achieved by the forward representation. However, these results also show that by averaging the predictions of these two representations, we are able to achieve the best results (or close to). In many cases, the averaging scheme achieves up to 1% improvement over either the forward or reverse repre-
Table 2.: Comparing the performance of the rbf and nsoe kernel functions. Reliability Prediction
I
I
Scheme pconc-all ( r b f ) pconc-all(nsue) pPaZr-all(rbf) p P a Z T - a l l (nsoe)
I
I
EASY ROC5 ROC 0.728 0.750 0.708 0.723
0.910 0.918 0.900 0.905
I
I
HARD ROC5 0.572 0.598 0.550 0.559
ROC 0.865 0.875 0.854 0.856
I 1 1
EST
I
I
CC 0.537 0.566 0.528 0.534
The test and traning set consisted of proteins from the all set. EST denotes the fRMSD estimation results using regression. The numbers in bold show the best performing schemes for each of the sub-
tables.
Table 2 compares the classification and regression performance achieved by the standard rbf kernel against that achieved by the normalized second-order exponential kernel (nsoe) described in Section 4.3. These results are reported only for the concatenation and pairwise coding schemes that use profile information. The rbf results were obtained after normalizing the feature-vectors to unit length, as it produced substantially better results over the unnormalized representation. These results show that the performance achieved by the nsoe kernel is consistently 3% to 5% better than that achieved by the rbf kernel for both the classification and regression problems. The key difference between the two kernels is that in the nsoe kernel the even-ordered terms are weighted higher in the expansion of the infinite exponential series than the rbf kernel. As discussed in Section 4.3, this allows the nsoe kernel function to better capture the pairwise dependencies that exists at different positions of each wmer.
6.3. Input Information and Coding Schemes Table 3 compares how the features derived from the profiles and the predicted secondary structure impact the performance achieved for the reliability prediction problem. The table presents results for the SVM-based schemes using the concatenation and pairwise coding schemes as well as results obtained by the dot-product-based profileto-profile scoring scheme (see the discussion in Section 5.5 for a discussion on how these scoring schemes
318 Table 3.: Classification performance of the individual kernels for both the easy and hard datasets. HARD
EASY fam
suf
Scheme ROC5 ROC ROC5 ROC P,,+, (6) 0.673 0.826 0.496 0.803 --"r Sdotp ( 3 ) 0.642 0.786 0.680 0.884 pconc-all I 0.817 0.919 0.716 0.917 Sconc-all 0.790 0.908 0.794 0.939 pPaZT-all 0.784 0.902 0.699 0.909 SPaZT-all 0.676 0.837 0.690 0.909
I
fold ROC5 ROC 0.341 0.717 0.706 0.901 0.712 0.918 0.823 0.951 0.679 0.905 0.727 0.922
fam
I
I
sllf
ROC5 ROC ROC5 ROC 0.470 0.753 0.315 0.698 0.466 0.771 0.503 0.856 0.621 0.867 0.574 0.880 0.615 0.865 0.631 0.913 0.588 0.849 0.509 0.853 0.486 0.803 0.548 0.880
fold ROC5 ROC 0.236 0.567 0.590 0.695 0.572 0.636
0.646 0.885
0.882 0.923 0.868 0.895
The test set consisted of proteins from the fam, suf, and fold sets, whereas the traning set used the all set. The numbers in parentheses for the profile-to-profile scoring schemes indicate the value of for the mers that were used. The numbers in bold show the best performing schemes for each of the sub-tables.
were used to solve the reliability prediction problem). Analyzing these results across the different SCOPderived test sets, we can see that protein profiles lead to better performance for the family-derived set, whereas secondary structure information does better for the superfamily- and fold-derived sets. The performance improvements achieved by the secondary-structure-based schemes are usually much greater than the improvements achieved by the profile-based scheme. Moreover, the relative performance gap between secondary-structureand profile-based schemes increases as we move from the superfamily- to the fold-derived set. This holds for both the easy and hard datasets and for both the kernelbased methods and the profile-to-profile-based scoring scheme. These results show that profiles are more important for protein-pairs that are similar (as it is the case in the family-derived set), whereas secondary-structure information becomes increasingly more important as the sequence similarity between the protein-pairs decreases (as it is the case in the superfamily- and fold-derived sets). Analyzing the performance achieved by the different coding schemes, we can see that concatenation performs uniformly better than pairwise. As measured by ROC'S, the concatenation scheme achieves 4% to 15% better performance than the corresponding pairwisebased schemes. However, both schemes perform considerably better than the profile-to-profile-based scheme. These performance advantages range from 11% to 30% (as measured by ROCs).
6.4. Fusion Kernels 6.4.1. Reliability Prediction Problem Table 4 shows the performance achieved by the fusion kernels on solving the reliability prediction problem for both the easy and hard datasets. For comparison purposes, this
table also shows the best results that were obtained by using the profile-to-profile-based schemes to solve the reliability prediction problem. Specifically, we present dot-product-based results that score each w m e r as the sum of its profile and secondary-structure information ((p S ) d o t p ) and results that score each w m e r as the sum of its PICASSO score and a secondary-structurebased dot-product score (pFptc Sdotp). From these results we can see that the SVM-based schemes, regardless of their coding schemes, consistently outperform the profile-to-profile scoring schemes. In particular, comparing the best results obtained by the concatenation scheme against those obtained by the W,,, Sdotpscheme (i.e., entries in bold), we see that the former achieves 18% to 24% higher ROC5 scores for the easy dataset. Moreover, the performance advantage becomes greater for the hard dataset and ranges between 3 1% to 36%. Comparing the performance achieved by the fusion kernels with that achieved by the nsoe kernels (Table 3) we can see that by combing both profile and secondary structure information we can achieve an ROC5 improvement between 3.5% and 10.8%. These performance improvements are consistent across the different test sets v i m , SULand fold) and datasets (hard and easy). Comparing the performance achieved by the models trained on different protein subsets, we can see that the best performance is generally achieved by models trained on protein pairs from all three levels of the SCOP hierarchy (i.e., trained using the all set). However, these results also show an interesting trend that involves the set of fold-derived protein-pairs. For this set, the best (or close to) classification performance is achieved by models trained on fold-derived protein-pairs. This holds for both the concatenation and pairwise coding schemes
+
+
+
319
Table 4.: Classification performance of the fusion kernels for the easy and hard datasets. EASY all
fam
HARD suf
Scheme ]ROC5ROC ROC5 ROC ROC5 ( P + S ) d o t p ( 6 ) 0.523 0.794 0.679 0.831 0.511 zTpic Sdot,(2)l 0.719 0.891 0.733 0.865 0.720 (P+S)Conc -fum I 0.817 0.941 0.829 0.929 0.811 0.827 0.940 0.820 0.918 0.821 0.796 0.922 0.751 0.874 0.778 0.853 0.952 0.846 0.936 0.841 0.783 0.925 0.797 0.909 0.786 0.810 0.932 0.805 0.907 0.818 0.805 0.923 0.765 0.879 0.799 0.832 0.942 0.823 0.920 0.825
fold
all
fflm
ROC ROC5 ROC QOCs ROC ROC.6 0.814 0.3.59 0.733 0.365 0.716 0.474 0.91 1 0.701 0.901 0.526 0.850 0.535 0.948 0.808 0.949 0.673 0.906 0.652 0.948 0.841 0.957 0.659 0.902 0.610 0.931 0.863 0.967 0.637 0.882 0.557 0.956 0.873 0.967 0.693 0.913 0.665 0.939 0.762 0.930 0.640 0.888 0.621 0.945 0.808 0.947 0.652 0.890 0.619 0.937 0.855 0.959 0.644 0.882 0.576 0.949 0.850 0.958 0.668 0.897 0.634
fold
suf
ROC ROC5 ROC 0.710 0.249 0.663 0.864 0.543 0.878 0.921 0.714 0.927 0.925 0.711 0.929 0.903 0.753 0.944 0.926 0.747 0.939 0.899 0.681 0.911 0.904 0.698 0.919 0.894 0.751 0.936 0.907 0.734 0.930 The test and training set consisted of proteins from the all, fam, suf, andfold sets. The numbers in parentheses for the profile-to-profile scoring schemes indicate the value of w for the wmers that were used. The numbers in bold show the best performing schemes for the kernelbased and profile-to-profile scoring based schemes. The underlined results show the cases where the pairwise coding scheme performs better than the concatenation coding scheme.
1
+
ROC ROC5 0.758 0.328 0.820 0.498 0.879 0.662 0.866 0.676 0.822 0.635 0.886 0.679 0.863 0.627 0.859 0.653 0.837 0.636 0.867 0.650
~
and the easy and hard datasets. These results indicate that training a model using residue-pairs with high-tomoderate sequence similarity (i.e., as it is the case with thefum- and sufderived sets) does not perform very well for predicting reliable residue-pairs that have low or no sequence similarity (as it is the case with the fold-derived set). Finally, as it was the case with the nme kernels, the concatenation coding schemes tend to outperform the pairwise schemes for the fusion kernels as well. However, the advantage of the concatenation coding scheme is not uniform and there are certain training and test set combinations for which the painvise scheme does better. These cases correspond to the underlined entries in Table 4.
tween the two schemes can also be seen in Figures 1 and 2 that plots the actual ~ R M S Dscores against the estimated ~ R M S Dscores of (p+S)conc-ull and the p p i c S d o t p similarity scores, respectively. Comparing the two figures we can see that the ~ R M S Destimations produced by the t-SVR-based scheme are significantly better correlated with those produced by V p i c S d o t p . Finally, in agreement with the earlier results, the concatenation coding scheme performs better than the pairwise scheme. The only exceptions are the models trained on the fold-derived set, for which the pairwise scheme does better when tested on the all- and fain-derived sets (underlined entries in Table 5).
6.4.2. f W s D Estimation Problem Table 5 shows the performance achieved by E-SVR for solving the fRMSD estimation problem as measured by the correlation coefficient between the observed and predicted ~ R M S Dvalues. We report results for the fusion kernels and the Zr,,, &otp profile-to-profile scoring scheme. Note that as discussed in Section 5.5, the scores computed by mpic &otp should be negatively correlated with the ~ R M S D ; thus, negative correlations represent good estimations. From these results we can see that as it was the case with the reliability prediction problem, the E-SVRbased methods consistently outperform the profile-toprofile scoring scheme across the different combinations of training and testing sets. The (p+S),Onc models achieve an improvement over Wp,, &otp that ranges from 21% to 23.2%. The performance difference be-
Table 5.: Regression Performance of the fusion kernels on the hard dataset.
+
+
+
+
+
Scheme =pic
I
all
+Sdotp (3)
(p+S)conc -fam ( p + S ) c o n c-suf (P+S)''"' -fold (p+S)co"c-all (p+S)paZr-fam -suf (p+S)patT -fold ( p + S ) p a Z T -all
(P+S)patr
I
-0.590 0.700 0.694 -0.667 _ _ 0.725 0.676 0.672 0.676 0.694
fam -0.550 0.662 0.612 _0.557 0.681 0.639 0.610 0.639 0.645
suf
-0.611 0.720 0.739 0.719 0.744 0.695 0.705 0.695 0.712
fold
-0.625 0.736 0.764 0.770 0.768 0.708 0.727 0.708 0.746
I
The test and training set consisted of proteins from the all, fam, suf, and fold sets. The number in parentheses for the profile-to-profile
scoring scheme indicates the value of for the mer that was used. Good correlation coefficient values will be negative for the profile-to-profile scoring scheme and positive for the kernel-based schemes. The numbers in bold show the best performing schemes. The underlined results show the casewhere the pairwise coding scheme performs better an the concatenation coding scheme.
320
7. RELATED RESEARCH The problem of determining the reliability of residuepairs has been visited before in several different settings. ProfNet 21, 2o uses artificial neural networks to learn a scoring function to align a pair of protein sequences. In essence, ProfNet aims to differentiate related and unrelated residue-pairs and also estimate the RMSD score between these residue-pairs using profile information. Protein pairs are aligned using STRUCTAL ', residue-pairs within 3A apart are considered to be related, and unrelated residue-pairs are selected randomly from protein pairs known to be in different folds. A major difference between our methods and ProfNet is in the definition of reliablehnreliable residue-pairs and on how the RMSD score between residue-pairs is measured. As discussed in Section 2, we measure the structural similarity of two residues (fRMSD) by looking at how well their vf r a g s structurally align with each other. However, ProfNet only considers the proximity of two residues within the context of their global structural alignment. As such, two residues can have a very low RMSD and still correspond to fragments whose structure is substantially different. This fundamental difference makes direct comparisons between the results impossible. The other major differences lie in the development of order independent coding schemes and the use of information from a set of neighboring residues by using a w m e r size greater than zero. The task of aligning a pair of sequences has also been casted as a problem of learning parameters (gap opening, gap extension, and position independent substitution matrix) within the framework of discriminatory learning 11, 40 and setting up optimization parameters for an inverse learning problem 3 5 . Recently, pair conditional random fields were also used to learn a probabilistic model for estimating the alignment parameters (i.e., gap and substitution costs) '.
8. CONCLUSION AND FUTURE WORK In this paper we defined the fRMSD estimation and the reliability prediction problems to capture the local structural similarity using only sequence-derived information. We developed a machine-learning approach for solving these problems by using a second-order exponential kernel function to encode profile and predicted secondary structure information into a kernel fusion framework. Our results showed that the fRMSD values of aligned residue-pairs can be predicted at a good level of accuracy. We believe that this lays the foundation for using
estimated ~ R M S D values to evaluate the quality of targettemplate alignments and refine them.
" I
1 ! " #$%'
5
2 3 4 I * + , ~. I - / 01 ' I 2 " $ 3 4 i%Y&as "
Fig. 1.: Scatter plot for test protein-pairs at all levels between estimated and actual ~ R M S Dscores. The color coding represents the approximate density of points plotted in a fixed normalized area.
O,"-r-QP
AC AB
A' A'
A& A%
AS A#
.I .I
I,
I,,
# < I,I
,-ins&+,;
1*+,~?1/*+,~.1012+'/30
Fig. 2.: Scatter plot for test protein-pairs at all levels between profile-to-profile scores and actual ~ R M S D scores. The color coding represents the approximate density of points plotted in a fixed normalized area.
ACKNOWLEDGEMENTS We would like to express our deepest thanks to Professor Arne Elofsson and Dr. Tomas Ohlson for helping us with datasets for the study. This work was supported by NSF EIA-9986042, ACI-0133464,IIS-0431135, NIH RLM0087 13A, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014,
32 1
and by the Digital Technology Center at the University of Minnesota.
References 1. S. F. Altschul, W. Gish, E. W. Miller, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403-410, 1990. 2. S. F. Altschul, L. T. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389-402, 1997. 3. Helen M. Berman, T. N. Bhat, Philip E. Bourne, Zukang Feng, Gary Gilliland Helge Weissig, and John Westbrook. The Protein Data Bank and the challenge of structural genomics. Nature Structural Biology, 7:957-959, November 2000. 4. C. B. Do, S. S. Gross, and S. Batzoglou. Contralign: Discriminative training for protein sequence alignment. In Proceedings of the Tenth Annual International Conference on Computational Molecular Biology (RECOMB),2006. 5. R. Edgar and K. Sjolander. A comparison of scoring functions for protein sequence profile alignment. BIOINFORMATICS, 20(8):1301-1308,2004, 6. M. Gerstein and M. Levitt. Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Science, 7:445456, 1998. 7. M. Gribskov and N. Robinson. Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Computational Chemistry, 20:25-33, 1996. 8. A. Heger and L. Holm. Picass0:generating a covering set of protein family profiles. Bioinformatics, 17(3):272-279, 200 1. 9. S. Henikoff and J. G. Henikoff. Amino acid subsitution matrices from protein blocks. PNAS, 89:10915-10919, 1992. 10. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proc. of the European Conference on Machine Learning, 1998. 11. T. Joachims, T. Galor, and R. Elber. Learning to align sequences: A maximum-margin approach. New Algorithms for Macromolecular Simulation, 49,2005. 12 D. T. Jones. Genthreader: an efficient and reliable protein fold recognition method for genomic sequences. Journal of Molecular Biology, 287:797-815, 1999. 13 D. T. Jones, W. R. Taylor, and J. M. Thorton. A new approach to protein fold recognition. Nature, 358:86-89, 1992. 14 George Karypis. Yasspp: Better kernels and coding schemes lead to improvements in svm-based secondary structure prediction. Proteins: Structure, Function and Bioinformatics, 64(3):575-586, 2006. 15. A. S. Konagurthu, J. C. Whisstock, P. J. Stuckey, and A. M. Lesk. Mustang: a multiple structural alignment algorithm. Proteins: Structure, Function, and Bioinformatics, 64(3):559-574, 2006. 16. G. R. G. Lanckriet, T. D. Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data
fusion. Bioinformatics, 20( 16):2626-2635,2004. 17. M. Mart-Renom, M. Madhusudhan, and A. Sali. Alignment of protein sequences by their profiles. Protein Science, 13:1071-1087,2004. 18. D. Mittelman, R. Sadreyev, and N. Grishin. Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics, 19(12):1531-1539, 2003. 19. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 4R443-453, 1970. 20. T. Ohlson, V. Aggarwal, A. Elofsson, and R. Maccallum. Improved alignment quality by combining evolutionary information, predicted secondary structure and selforganizing maps. BMC Bioinformatics, 1 (357), 2006. 21. T. Ohlson and A. Elofsson. Profnet, a method to derive profile-profile alignment scoring functions that improves the alignments of distantly related proteins. BMC Bioinformatics, 6(253), 2005. 22. William R. Pearson and David J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85:2444-2448, 1988. 23. J. Pillardy, C. Czaplewski, A. Liwo, J. Lee, D. R. Ripoll, R. Kazmierkiewicz, S. Oldziej, W. J. Wedemeyer, K. D. Gibson, Y. A. Arnautova, J. Saunders, Y. J. Ye, and H. A. Scheraga. Recent improvements in prediction of protein structure by global optimization of a potential energy function. PNAS USA, 98(5):2329-2333,2001, 24. J. Qiu and R. Elber. Ssaln: An alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins: Structure, Function, and Bioinformatics, 62(4):881891,2006. 25. H. Rangwala and G. Karypis. Incremental window-based protein sequence alignment algorithms. Bioinfomzatics, 23(2):e 17-23, 2007. 26. C. A. Rohl, C. E. M. Strauss, K. M. S. Misura, and D. Baker. Protein structure prediction using rosetta. Methods in Enzymology, 383:66-93,2004. 27. R. Sanchez and A. Sali. Advances in comparative proteinstructure modelling. Current Opinion in Structural Biology, 7(2):206-214, 1997. 28. T. Schwede, J. Kopp, N. Guex, and M. C. Peltsch. Swissmodel: An automated protein homology-modeling server. Nucleic Acids Research, 31(13):3381-3385, 2003. 29. B. Schlkopf, C. Burges, and A. Smola, editors. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999. 30. I. Shindyalov and P. E. Bourne. Protein structure alignment by incremental combinatorial extension (ce) of the optimal path. Protein Engineering, 11:739-747, 1998. 31. K. T. Simons, C. Strauss, and D. Baker. Prospects for ab initio protein structural genomics. Journal of Molecular Biology, 306(5):1191-1199,2001. 32. J. Skolnick and D. Kihara. Defrosting the frozen approximation: Prospector-a new approach to threading. Proteins: Structure, Function and Genetics, 42(3):319-331, 2001.
322 33. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981. 34. A. Smola and B. Scholkopf. A tutorial on support vector regression. NeuroCOLT2, NC2-TR- 1998-030, 1998. 35. F. Sun, D. Fernandez-Baca, and W. Yu. Inverse parametric sequence alignment. Proceedings of the International Computing and Combinatorics Conference (COCOON), 2002. 36. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, 1995. 37. C. Venclovas. Comparative modeling in casp5: Progress is evident, but alignment errors remain a significant hindrance. Proteins: Structure, Function, and Genetics, 53~380-388, 2003. 38. C. Venclovas and M. Margelevicius. Comparative modeling in casp6 using consensus approach to template selection, sequence-structure alignment, and structure assess-
39.
40.
41.
42.
ment. Proteins: Structure, Function, and Bioinformatics, 7:99-105,2005. G. Wang and R. L. Dunbrack JR. Scoring profile-toprofile sequence alignments. Protein Science, 13:16121626,2004. C. Yu, T. Joachims, R. Elber, and J. Pillardy. Support vector training of protein alignment models. To appear in Proceeding of the Eleventh International Conference on Research in Computational Molecular Biology (RECOMB), 2007. Y. Zhang, A. J. Arakaki, and J. Skolnick. Tasser: an automated method for the prediction of protein tertiary structures in casp6. Proteins: Structure, Function, and Bioinformatics, 7:91-98,2005, Y. Zhang and J. Skolnick. The protein structure prediction problem could be solved using the current pdb library. PNAS USA, 1024(4):1029-1034,2005.
323 CONSENSUS CONTACT PREDICTION BY LINEAR PROGRAMMING
Xin G a o l , Dongbo
B u ~ , ~Shuai * , Cheng Li’, Ming L i l t , and Jinbo Xu3$
’Dauid R. Cheriton School
of Computer Science University of Waterloo, Waterloo Ontario, Canada N2L 3G1 21nstitute of Computing Technology Chinese Academy of Sciences, Beijing, China 100080 Toyota Technological Institute at Chicago 1427 East 60th Street, Chicago, I L 60637 Email: {xdgao, dbu, scli, mli} @cs.uwaterloo.ca, [email protected]
Protein inter-residue contacts are of great use for protein structure determination or prediction. Recent C A W events have shown that a few accurately predicted contacts can help improve both computational efficiency and prediction accuracy of the ab inrto folding methods. This paper develops an integer linear programming (ILP) method for consensus-based contact prediction. In contrast t o the simple “majority voting” method assuming that all the individual servers are equal and independent, our method evaluates their correlations using the maximum likelihood method and constructs some latent independent servers using the principal component analysis technique. Then, we use an integer linear programming model to assign weights t o these latent servers in order to maximize the deviation between the correct contacts and incorrect ones; our consensus prediction server is the weighted combination of these latent servers. In addition t o the consensus information, our method also uses server-independent correlated mutation (CM) as one of the prediction features. Experimental results demonstrate that our contact prediction server performs better than the “majority voting” method. The accuracy of our method for the top L / 5 contacts on CASP7 targets is 73.41%, which is much higher than previously reported studies. On the 16 free modeling (FM) targets, our method achieves an accuracy of 37.21%. Keywords: residue-reside contact prediction, consensus, principal component analysis, integer linear programming, latent server.
1. INTRODUCTION Computational protein structure prediction has made great progress in the last three decades ’. Recent CASPs (Critical Assessment of Structure Prediction) 3-8 have demonstrated that accurately predicted contacts can provide very important information for protein structure prediction methods. Rosetta performs impressively on recent CASPs. Misura et al. l 2 further modified the Rosetta free modeling protocol to encode residue-residue contact information. Experimental results demonstrate that by using spatial constraints extracted from homologous structures, not only the running time is shortened, but also the prediction accuracy is improved. For some concrete cases, the models built by Rosetta are more accurate than their templates on aligned regions, which is rarely seen before. Zhang-server, a refined version of TASSER 1 3 , ranked number 1 among all the automatic servers in CASP7. CASP7 evalu’)
’-’’
*The first two authors contribute equally t o the paper. ?To whom correspondence should be addressed. $To whom correspondence should be addressed.
ation shows that iteratively running TASSER simulation for two rounds by using contact constraints at the second round greatly improves the prediction accuracy. For a protein of length L. The contact map of this protein is an L x L matrix A, in which A [ i ] [ j ] is set to 1 if residue i and j are in contact, and 0 otherwise. Commonly, two residues are considered to be in physical contact if the spatial distance between their C p atoms (C, atom for Glycine) is less than some threshold value. 1.1. Related Work
There are four commonly acknowledged contact prediction assessment criteria: accuracy, coverage, improvement over random, and Xd ’, 8, 14. Among them, accuracy is t,he most import,ant measurement. Protein residue-residue contact map was first studied by 15-18 to calculate mean force potentials.
324 Gobel et nl. l9 formally proposed the problem of residue-residue contact prediction, and showed that correlated mutation (CNI) is useful information to predict inter-residue contacts. Different correlated mutation calculation methods have been carefully studied since then 20p23. While CM performs well with local contact prediction, which is usually considered to be two incontact residues within 6 amino acids from each other on protein sequence, it usually fails on nonlocal contacts. Therefore, other information such as evolutionary information and secondary structure information, has been applied to improve the performance of contact prediction methods 24p30. In 24, Fariselli et nl. encoded four kinds of features into a neural network based server (CORNET): 1) CM, 2) evolutionary information, 3) sequence conservation, and 4) predicted secondary structure. They defined two residues to be in contact if the Euclidean distance between the coordinates of their C p atoms (C, atom for Glycine) is smaller than 8A. To fairly test the performance of CORNET, they further required that the sequence separation between residues is at least 7, which can eliminate the influence of local a-helical residue-residue contacts. CORNET has an average accuracy of 0.21, which is higher than any previously reported result. Other features have been well studied since then 2s. However, the reported accuracy has not been improved too much. PROFcon 3 0 , one of the top three contact prediction servers in CASP6 7 , encodes alignment information into their neural network model, such as solvent accessibility and secondary structure over regions between two residues, as well as the average properties of the entire protein. PROFcon performs impressively on short proteins or alpha/beta proteins, on which the accuracy is over 30%. Different from those machine learning based methods, which encode CM information and other sequence-related and alignment-related information, there are some studies trying to predict residueresidue contacts from other perspectives. Bystroff and his colleagues 3 1 ) 32 took folding pathway into consideration, and predicted residue contacts by HMMSTR 3 3 , a hidden Markov model for local sequence-structure correlation. MacCallum 34 first pre-processed the sequence profile generated by PSIBLAST 3 5 . Then Self-organizing Maps (SOMs) were applied to reduce the high dimension of the pro26i
file data to 3D SOM grids. When converting into RGB code, contacted P-strands usually have correlated colors. To sum up, previous studies have drawn the following conclusions: I) Correlated mutation information is an influential factor in contact prediction, while solely encoding CM is not good enough for predicting contacts; 2 ) Other information, such as secondary structure, and solvent accessibility can help improve the accuracy; 3) Contacts predicted by top protein structure prediction servers are comparable or even a bit better than those predicted by contact predictors.
1.2. Our Contributions To take advantage of useful information from the above conclusions, we propose a consensus residueresidue contact prediction method. Our consensus method assigns a confidence score to each contact from all contacts predicted by individual protein structure prediction servers, while also taking CM information into consideration. The intuition behind our method is that top models generated by protein structure prediction servers are usually the results of optimization on global energy and structures. Thus, encoding such information can help to select conserved contacts and long-ranged contacts. Different from traditional consensus methods which are widely used in protein fold recognition, our method aims to be able to identify correctly predicted contacts even if the majority of servers votes against them. We have observed from recent CASP results that correlation exists among different servers on contacts determined by predicted 3D models because of similar tools used by these servers, such as PSI-BLAST and PSIPRED 36. The correlated servers sometimes make a native contact t o receive less supports than some incorrect ones. Our consensus method aims to reduce the impact caused by server correlation. The outline of our consensus method is as follows: 0
A maximum likelihood (ML) method is applied to measure correlation coefficient between any two servers. Principal component analysis (PCA) technique is employed to extract new independent latent servers. An integer linear programming(1LP) method is then used to assign a weight
325 to each latent server, by maximizing the confidence score difference between native contacts and incorrect ones. CM is also considered to be a latent server which assigns a probability to each contact. This results in a consensus contact predictor to accurately assign confidence scores for all contacts extracted from the initial models. The rest of this paper is organized as follows: Section 2 presents some preliminaries. In Section 3 , we describe our new consensus method. Section 4 shows and analyzes experimental results on CASP7 data set. In Section 5, we discuss the potential applications and the future development of our method. Finally, Section 6 draws some conclusions. 2. PRELIMINARIES
In this paper, a model refers to a protein structure outputted by a protein structure prediction server. In contrast to human expert, a serwer refers to an automated system which predicts a set of structures for a given amino acid sequence, known as the target. Two residues are in contact if their Co atoms (Ca atom for Glycine) is smaller than 8A and they are at least 6 residues apart in the sequence. We call a contact natave contact if the two residues are indeed in contact in the native structure of the target. Given a model and a target, the contact accuracy of this model is calculated as the number of native contacts extracted from this model divided by the total number of contacts of this model, while the contact coverage of this model is defined to be the number of native contacts extracted from this model divided by the total number of native contacts. Since contacts extracted from protein structure prediction servers do not have confidence scores, we randomly choose a number of contacts to do statistics, for example, L , L / 5 or all, where L is the length of the target protein. Given a target t l , 1 5 1 5 t, a server S,, 1 5 i 5 u,outputs a set of models. The contacts determined by these models are extracted and considered as contact candidates, denoted as C%J = { ~ , , l , ~ 15 1 q 5 n , , ~ } where , n , , ~is the number of contacts produced by server S, for target ti. The set of contact candidates for target ti is denoted as Cl = U, Cz,l. A consensus server aims t o assign a confidence score to each candidate.
This paper is based on the following two assumptions:
0
Server Si generates its predictions based on a confidence measure. That is, for each contact c E Cl, Si has a confidence S % , ~ ,that L c appears in the native structure. Since the initial confidence score is unavailable, we simply approximate it by the number of models containing this contact divided by the total number of models generated by the server on this target. There are some implicit latent independent servers H j , 1 5 j 5 u, dominating the explicit servers Si. Given a target t i , Hj assigns a value hj,c,l,c E Cl, as the confidence that c is a native contact.
Identifying the latent independent servers is essential to reduce the negative effects of server correlations and to reduce the dimensionality of the seasch space, as the number of latent servers is expected to be smaller than the number of original servers. After deriving the latent servers, we can design a new and more accurate prediction server S*,by an optimal linear combination of the latent servers, which for each target tl assigns a confidence score to each contact candidate c E Cl as follows: U
j=1
where
is the weight of latent server H j .
3. M E T H O D S The basic idea of our method is to reduce the negative effects caused by the correlations among prediction servers. We first employ the maximum likelihood technique to estimate the server correlations; then adopt the factor analysis technique to uncover the latent servers; and finally design a mixed integer linear programming method to derive the optimal weights for the latent independent servers.
3.1. Maximum Likelihood Estimation of Server Correlations
Let 02,3,1 denote the overlap set of Cz,land C3,1,i.e., 02,3,1 = Cz,lnC,,l,and let O,,~.Z = 107.3.1/. For a given
326 target, let p t , j be the probability that a contact returned by S, is the same to that returned by server S,. Under a reasonable assumption that targets t l , 1 5 1 5 !are mutually independent, the likelihood that server S,, 1 5 i 5 u generates contacts c i , ~ , ~ , 15 q I ni,~ is:
of
Therefore, the maximum likelihood estimation can be calculated as follows:
pi,j
(3) In the rest of this paper, we use P to denote the matrix P = [pz,,lUXU. 3.2. Uncovering the Latent Servers
For a target t l , let s , , . ~and h j , c ~be i the confidence that contact c is chosen as one of the prediction results by server S, and H,, respectively. Since the latent servers are mutually independent, it is reasonable to assume that s,,k,l is a linear combination of hJ,k,l,1 5 j 5 v: V
?I
,=I
j=1
where --f
(4)
.G =< % , l , l , s,,2,1, ’ .
, s, ICIJ,Z >, 1 5 2. i =< h,,1,1, h,,2,l,. . , h, IC1J,l >, 1 5 3 I v.
u , and h,,l ’ Here, A,,, is the weight, and a larger A,,] implies a higher chance that server S, adopts contacts reported
by H,. From the correlation matrix of prediction servers S,, factorianalysis technique is employed to derive + A,, and h,,l; that is, hj,l can be represented to be a linear combination of as follows:
combination of the latent servers. For each target t l , it assigns each contact candidate c E Cl with a score as in Eq 1. To determine a reasonable setting of coefficient X i , a training process is conducted on a training data set D = {< t l , CT, C; >, 1 I 1 5 IDl}, where tl E T is a target, CT 2 Cl denotes the set consisting of native contacts, and Cl- C (21 denotes the incorrect contact set. The learning process attempts to maximize the number of contacts that are correctly identified by S*. More specifically, for each target tl in the training data set, a score is assigned for each contact candidate by S*.A good contact predictor should assign native contacts higher scores than incorrect ones. The larger the gap between scores of native contacts and incorrect ones, the more robust this new prediction approach is. In practice, “soft margin” idea is adopted to take outliers into accounts; that is, allowing errors on some samples, we maximize the number of native contacts with a score higher than incorrect ones by at least a threshold. In our integer linear programming formulation, we employ two types of indicator variables. Let xp,q be an integer variable such that xp,q= 0 if and only if contact p is given score higher than q by at least E . Here, E is a parameter used as the lower bound of gap between the score of a native contact and incorrect ones. Similarly, let yP,l denote whether p has a score greater than all the incorrect contacts in C;. Formally, the learning techniques can be forrnulated into an ILP problem as follows: ID1
?I
71
j=1
j=1
i= 1
where
<
w J , 1 , w j , 2 , . . ., w , , ~
>
(5) is an eigenvector of
PTP. 3.3. ILP Model t o Weigh Latent Servers After deriving the latent servers H,(15 j 5 u) , we can construct a new server S * , as an optimal linear
cx; V
3=1
= 1,xj 2 0
1< j < v
(9)
327
Xp,q
E {O,I)
For constraint
YP,Z
(10)
E {0,1>
7, it is easy to see that
c,”=, -c,”=, X5H;,q,z2 -1. Thus, this conXj*Hj,,,l
straint forces x ~to , be~0 if the difference between the scores assigned to p and q is smaller than E . If p has a score not higher than all the incorrect contacts, constraint 8 will force yP,z to be 0. Constraint 9 normalizes the coefficient settings, and constraint 10 restrict and yp,l to be either 0 or 1. The objective function is the number of native contacts with score higher than all the incorrect contacts. 3.4. A New Prediction Server Now, we wrap up everything to obtain a new prediction server. Given a target t*, each server S, produces a set of contact candidates, C:. The set of all candidates is denoted as C* = C:. For each contact candidate c E C*, the latent probability h;,, = Cr==, wJ,%sf,,,1 5 j 5 u,is derived from Eq. 5. Then, the consensus server produces a score for each contact candidate based on Eq. 1, and picks up the top scored ones as the final predictions.
u,
4. EXPERIMENTAL RESULTS 4.1. Data Set
Server Selection. To fairly evaluate the performance of our consensus method, we chose six automatically individual protein structure prediction servers, each of which is comparative modeling method. These servers are FOLDpro 3 7 , mGenThreader 39, RAPTOR 40) 41, FUGUE3 4 2 , SAMTO2 43, and SPARK3 44. Although there are some fragment assembly based servers with higher overall performance on protein 3D structure prediction than these six servers, such as Rosetta and Zhang-server, we didn’t choose them because their assembling process directly uses the results of some contact prediction methods. Training and Test Data. The biennial CASP has provided us a comprehensive and objective data set. We chose CASP7 targets and models generated by those six servers as our training and test data. For each server on a target, top 5 models are considered. All server models are downloaded from CASP7 website, except for mGenThreader, which did not participate CASP7. We submitted CASP7 targets to 381
mGenThreader web server and downloaded models from there. Eighty-nine CASP7 target proteins have their native structures published after the CASP7 while 104 protein sequences were released as targets. We removed redundancy at 40% sequence identity level using CD-HIT 45, which results in a data set with 88 target proteins. Only TO346 is removed because it shares 71% sequence identity with T0290. We further removed three targets (T0287, T0334, and T0385) from our data set because there are some errors in models generated by some of the six individual servers. To do cross validation, we randomly divided the 85 target proteins into four sets with size 22, 21, 20, and 22, respectively. If one target belongs to a set, then all of its models and contacts are in this set. Data Set Statistics. We compared the performance of the six individual servers from the cont,act prediction accuracy and coverage point of view. The prediction accuracy of a server is calculated as the number of correctly predicted contacts divided by the total number of predicted contacts by this server, while the coverage of a server is calculated as the number of correctly predicted contacts divided by the total number of real contacts in the native structure. For each server on each target, the best model among the top 5 models generated by this server in terms of contact accuracy is chosen. If the number of contacts generated by a model is less than L/5, both the accuracy and the coverage for this model are set to 0. As shown in Table 1, the average accuracy among all contacts determined by the best model ranges from 43% to 53%, while the SAM-TO2 server has the highest accuracy. The server “Overall” in Table 1 means the server which contains all contacts determined by the best six models generated by those six servers. The average accuracy of server “Overall“ is very low (12.30%) comparing to the average accuracy of any individual server. Recall the way to calculate the accuracy, the server “Overall” always contains much more correctly predicted contacts than any individual server does. Therefore, the low accuracy of server “Overall” implies the incorrectly predicted contacts generated by these individual servers are different from each other for most cases, which means consensus method can probably be applied to differentiate correctly predicted contacts and incorrectly predicted ones. It can also be seen that the average coverage
328 of these six servers ranges from 36% to 51%, while RAPTOR has the highest coverage. However, when combining these six servers together, the average coverage for server “Overall” is very high (about 80%). This means some correctly predicted contacts are only supported by a small number of individual servers while different servers can predict a common subset of correctly predicted contacts. Note that to fairly evaluate the contact prediction ability of a protein structure prediction server, both accuracy and coverage should be combined. For example, SAM-TO2 generates the highest accuracy among the six individual servers. However, the coverage is low (37.1%). This reveals that SAM-TO2 tends to generate protein structure models which contain only a small number of contacts, most of which are conserved contacts.
Server FDP MGTH R.AP FUG SAM SPK
Ave Accu 0.4511 0.4317 0.4843 0.4630 0.5331 0.4793 0.1230
DevAccu 0.0818 0.0659 0.0664 0.0793 0.0651 0.0731 0.0072
A v e ~ o v Dewov 0.4836 0.0928 0.4480 0.0851 0.5221 0.0697 0.3667 0.0554 0.3710 0.0551 0.5118 0.0764 0.8028 0.0233
I I
I
I
I I
I
MGTH 0.344 1 0.414 0.348 0.500 0.407
I
I 1
I
RAP 0.426 0.418 1 0.402 0.593 0.500
I
I I
I
FUG 0.250 0.263 0.296 1 0.466 0.293
I
I I
I
SAM 0.296 0.295 0.346 0.365 1 0.336
I
I I
I
SPK 0.410 0.413 0.514 0.398 0.593 1
We then derived the relationship between the latent servers and the individual ones. As shown in Table 3, different latent independent servers represent different individual servers; for example, HIrepresents the common characteristics shared by these individual servers because the weights of H I on these individual servers are about the same; Hz differentiate FUGUE3 from other servers; H3 represents FOLDpro by a large positive weight, and represents mGenThreader by a large negative weight. Based on the eigenvalues, H4 was eliminated since the eigenvalue for H4 is much smaller than others. Thus, H4 can be considered as random noise.
Table 1. The average and deviation of contact accuracy and coverage of the best model among the top 5 models generated by different individual servers on CASP7 targets. Server FOLDpro mGenThreader RAPTOR FUGUE3 SAM-TO2 SPARK3 Overall
I
FDP 1 0.347 0.428 0.345 0.502 0.403
Table 3. Relationship among the six individual servers and latent servers. FDP: FOLDpro, MGTH: mGenThreader, RAP: RAPTOR, FUG: FUGUE3, SAM: SAM-TOP, SPK: SPARKS.
I
Server FDP MGTH RAP FUG SAM SPK
I I 1
I
I I I I
H1 0.371 0.372 0.418 0.373 0.490 0.410
1
I 1
I
I I I I
H2 -0.351 -0.258 -0.225 0.821 0.202 -0.207
I
I I
I
I I I I
H3 0.655 -0.752 0.035 0.039 0.034 -0.023
I I I
I
I I I I
H4 -0.549 -0.477 0.364 -0.218 0.227 0.487
I I
I I
I I I I
H5 0.014 -0.004 0.265 0.369 -0.814 0.359
I I
I I
I I I I
H6 -0.081 -0.016 0.755 0.012 -0.036 -0.649
4.2. Server Correlations and Latent Servers
We further studied the correlations among the six individual servers, and derived the relationship among the individual servers and the latent ones. Table 2 shows the correlations among the six individual servers, which is calculated according to Eq. 3 . Note that the matrix is not symmetric because O % , ~ ,is L not always equal to O ~ , ~ , As J . shown in Table 2, the correlation between two servers ranges from 0.25 to 0.59, which implies that some servers are more closely correlated than others in terms of contact prediction. Thus, traditional linear-regressionbased consensus methods, which simply apply “majority voting” rule and assume the error is under a normal distribution, will fail when correct contacts are not supported by majority of individual servers.
We derived the optimal weights for the latent servers by cross validation on the four sets. Correlated mutation is considered to be another independent latent server, because it provides a target sequence-related probability for each contact candidate. CM is calculated as previously described in 19, 23 . Each time we trained our ILP model on three of these four sets, and got a set of optimal weights, based on which a new prediction server is derived, named as S l , S;, Si, and Si,respectively. In this paper, by saying server S * , we mean server ST on test set i (i = 1 , 2 , 3 , 4 ) . Table 4 shows the linear combination representation of S* on the individual servers and correlated mutation. We can see the four sets of weights are very similar. Note here. a negative
1
329 scores. We implemented “majority voting” server as follows: For each contact candidate of a target generated by the best models of the six individual servers, the best model of each individual server votes “Yes” Table 4. The linear combination representation of S* on six (denoted as 1) or “No” (denoted as 0) to this candiindividual servers and correlated mutation. FDP: FOLDpro, date. The number of supporting servers for all canMGTH: mGenThreader, RAP: R A P T O R , FUG: FUGUE3, SAM: didates are then calculated and sorted, and different SAM-T02, SPK: SPARKS. I S I F D P I MGTH I R A P 1, FUG I SAM I SPK I CM I numbers of top candidates are taken. The accuracy is calculated on the top candidate sets. S; 1 0.292 I -0.283 I 1.272 I 1.470 I 0.230 I 0.618 1 0.300 S,* I 0.305 I -0.274 I 1.346 I 1.346 I 0.217 I 0.578 I 0.370 As shown in Table 5, the average accuracy inS; I 0.383 I -0.290 I 1.373 I 1.357 1 0.141 I 0.650 I 0.280 creases when the number of top contacts decreases, 5’2 I 0.287 1 -0.440 I 1.292 I 1.386 I 0.123 I 0.558 I 0.230 except for server S* on test set 1, on which the accuracy for top L/10 contacts is slightly lower than that 4.3. CASP7 Evaluation for top L/5 contacts. This is possible because L/10 is usually a small number (20-30 for most cases), and a We first assessed our consensus server S* by Refew incorrectly predicted top contacts will influence ceiver Operating Characteristic (ROC) plots. ROC the total accuracy significantly. The overall accucurves can provide an intuitionistic way to examracy of S* on all four test sets is at least 62010, and ine the tradeoff between the ability of a classifier to is always higher than “majority voting” server. For correctly identify positive cases and the number of the top L/5 contacts, the accuracy of S* is 73.41%, negative cases that are incorrectly classified. Fig 1 which is about 5% higher than “majority voting”. shows the perfornisnce comparison in terms of conWe drew Fig 2 to examine the prediction accutact prediction for server S* and the six individual racy for the top L/5 contacts of S*on each CASP7 servers on the four test sets determined by our cross target. It can be seen that the accuracy is higher validat ion. than 80% on most targets. In fact, among the toAs shown in Fig 1, server S* performs better tal 85 targets, S* has accuracy 100% on 13 targets, than any individual server on all the four test sets. above 90% on 38 targets, and above 80% on 57 tarFor each server, the performance of this server on test gets, while the accuracy is below 40% for only 16 set 1 is slightly better than that on the other three targets. Note that there are two targets, TO309 (free test sets, which means test set 1 is the easiest test set modeling target) and TO335 (template based modelamong those four. RAPTOR performs better than ing target), on which has accuracy 0. We carefully other individual servers on the first three test sets, looked into these two targets. Both targets are very while SPARK3 has the best performance on test set short. The target sequences published by CASP7 for 4. There are clear performance differences between TO309 and TO335 have length 76 and 85, respectively. server S*and the best individual server on test set However, the experimentally determined length used 1, 2, and 4 when the false positive rate is below 0.3. by CASP7 to evaluate these two targets are only 62 However, the difference is not obvious on those three and 36, respectively, which means some parts of the test sets when false positive rate is larger than 0.3. targets are not experimentally determinable or not For test set 3, the hardest test set, the performance accurate enough. Thus, L/5 is only 12 and 7 for these of S*is much better than any individual server all two targets. Besides, all six individual servers did the time. It is also noticeable that the curve of S* is poorly on contact prediction on them, which means much smoother than individual servers. we only have a few correct candidates among a large We further evaluated the performance of S*from number of incorrect ones. This can explain the failaverage accuracy point of view. Table 5 shows the ure of s*on TO309 and T0335. average accuracy and deviation of S* and “majority To evaluate more carefully how much our consenvoting” server on the four test sets when different sus method can improve upon individual servers and numbers of top contacts are considered. Recall S* the simple “majority voting” method, we divided all generates a confidence score for each contact cantargets into three categories: easy (high accuracy) , didate, we can easily take the top contacts for each medium (template based modeling), and hard (free target after sorting all candidates by their confidence weight implies that the corresponding server’s contribution has been over-expressed by other individual servers which have correlation with it.
1
I
I
I
3
I
s*
330 Predicting Contacts on Test Set 1
Predicting Contacts on Test Set 2 1
0.8
u:
2
0.6 ._ c
.-
(I)
--0 0
I 0.2
0.4
-
g
II
0.8
1
p
0.2
1
-I -0
0.2
0.4
- -
FUGUE3
-SPARK3
0.6
False Positive Rate
Predicting Contacts on Test Set 3
Predicting Contacts on Test Set 4
I
1~
0.2
0.4
-
0.6
0.8
1
1
S’ FOLDpro MGTH RAPTOR FUGUE3 ST02 -SPARK3
I I 3 I
II
0.8
False Positive Rate
--0 0
---
3
FUGUE3
- -SPARK3
0.6
0.4
a,
1I 1
False Positive Rate
--0 0
1 0.2
- -
FUGUE3
-SPARK3
1
0.4 0.6 0.8 False Positive Rate
Fig. 1. Performance comparison using ROC plots in terms of contact prediction for S* (thick solid line), FOLDpro (thick dotted line), mGenThreader (thin dashdot line), RAPTOR (thin dotted line), FUGUE3 (thick dashed line), SAM-TO2 (thin solid line), and SPARK3 (thin dashed line). Table 5 . Average accuracy and deviation of the top contacts predicted by S* on different test sets, and the accuracy of server “majority voting”.
modeling), according to Zhang’s assessment 46. The numbers of easy, medium and hard targets are 23, 46, and 16; respectively. Table 6 shows the average accuracy and deviation of s*, individual servers, and “majority voting” method. As shown in Table 6, for easy and medium targets, the accuracy of S* on top L / 5 contacts is 93.71% and 75.85%, respectively, and much higher than the best individual server, where the improvement is at least 17% for each case. However, for hard targets, the accuracy of S* is only 3% higher than SAM-T02, while at least 19% higher than the best of the rest servers. We examined the models generated by SAM-T02.
They are sometimes much shorter than target proteins, and usually contain a very small set of contacts. However, the percentage of native contacts in this set is usually high. On the other hand, server S* always performs better than “majority voting” server on easy, medium, and hard targets, while the improvements are about 2%, 4%, and 12%, respectively. This makes sense because for easy targets, individual servers usually do well, which means for a contact candidate, the more servers support it, the more likely it is correct. However, this rule doesn’t always work on medium and hard targets. Thus, our consensus method does much better than “majority
33 1
u
I l l I , I , , , , I I , I , I , , , , ,
Fig. 2.
, I I I
I
I
I
I
I
I
I
I
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Prediction accuracy for the top L/5 contacts of
.
.
.
.
.
S* on
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
each CASP7 target
Table 6. Average accuracy and deviation of S*,individual servers, and “majority voting” server on easy, medium, and hard target sets Server Name Top L of S* Top L/2 of S’ Top L/5 of S* Top L/10 of S* FOLDpro mGenThreader RAPTOR FUGUE3 SAM-TO2 SPARK3 Top L/5 of Majority Voting
Easy TI Accu. 0.8957 0.9359 0.9371 0.9564 0.7697 0.6783 0.7534 0.7481 0.7538 0.7621 0.9240
rgets
Medium
0.0117 0.0060 0.0044
I :::E5
0.0105 0.0379 0.0132 0.0067 0.0132
0.4401 0.4300 0.5000 0.4731 0.5403
0.6378
‘argets Dev. 0.0695 0.0776 0.0848 0.0891 0.0539 0.0440 0.0387 0.0616 0.0522 0.0471 0.0808
Hard Targets 0.2083
1
0.0286
==-P=-
voting” on harder targets.
5. DISCUSSIONS The experimental results have demonstrated that by encoding global energy and structure information from another perspective, consensus methods can identify native contacts well. We did not directly compare our method to other contact predictors on exactly the same data set and the same contact definition, since such data is not available. It is widely acknowledged that CASP data set is objective and comprehensive. Thus, it can be expected that our method performs much better than other predictors on the same data set because our method achieves an average accuracy of 73.41% on CASP7 data set, comparing to generally 30% accuracy of other predictors on data sets with similar difficulty levels to CASP.
One drawback of our method is that it is a selection-only consensus method. If all individual servers generate models with very few native contacts, our method will fail simply because there is nothing correct to select. We tried to avoid this drawback by using a server independent feature, CM, to introduce some contact candidates which are not predicted by any individual server. However, CM itself is not strong enough to find native contacts. Thus, future work will be combining more server independent features to introduce native contact candidates even if all individual servers fail to do so. On the other hand, a possibly better measure for consensus contact prediction methods is to require the methods to predict all the native contacts inside the input candidate set instead of predicting a fixed-size contact set. In this way, if all individual servers fail to predict any native contacts, and the consensus
332 method also returns 0 contacts, the accuracy will be loo%, which makes more sense than 0% under the current evaluation criteria. A potential application of our contact prediction method is to provide highly conserved constraints for protein structure prediction or refinement methods. Recent CASPs show that fragment assembly based methods usually perform better than traditional comparative modeling methods, because instead of assuming there are known templates in the database which have similar structures t o the targets, fragment assembly based methods only require some substructures with similar structures t o some regions in templates. However, fragment assembly based methods usually suffer from huge search spaces. Our consensus method has an average accuracy 73.41% on top L / 5 contacts, while for most cases, the accuracy is higher than 80%. Thus, our method can provide a reasonable number of highly conserved contacts for assembly step t o significantly reduce the search space. On the other hand, if all the individual servers we used predict the structure for a target protein extremely well or poorly, our consensus method will probably be able t o only improve the assembly speed, rather than the accuracy. In the former case, since almost all contact candidates provided by these individual servers are correct ones, our method can only reduce the total number of well-conserved contacts and thus improve the speed for assembly step. In the latter case, since there are almost no correct contact candidates for our method t o choose, the assembly accuracy can hardly benefit from our results. However, in any other cases, contacts provided by our method will greatly help assembly process. The reason is that our method can generate a small number of highly conserved contacts. Considering only a small number of contacts will reduce the assembly search space, and thus increase the speed. Moreover, experimental results have demonstrated that our method can generate contacts with higher accuracy than both contact predictors and protein structure prediction methods. This can reduce the risk of generating models with incorrect contacts, which will reduce the risk of selecting incorrect models from the final assembly decoy set, and thus will greatly increase the overall assembly accuracy.
6. CONCLUSIONS In this paper, we proposed a linear programming based consensus contact prediction method. Experimental results show that this method preforms well, especially on easy and medium targets. The accuracy of our method is higher than any previously reported studies.
ACKNOWLEDGEMENT This work is supported by NSERC Grant OGP0046506, and NSF of China Grant 60496324. References 1. Y. Xu, D. Xu, and J . Liang. Computational Meth-
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
ods for Protein Structure Prediction and Modeling, 1st ed. Springer 2007. Y. Xu, D. Xu, and J. Liang. Computational Methods for Protein Structure Prediction and Modeling, 2nd ed. Springer 2007. J. Moult, T. Hubbard, K. Fidelis, and J. Pedersen. Critical assessment of methods of protein structure prediction (CASP):round 111. Proteins 1999; 37: 26. J. Moult, K. Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods of protein structure prediction (CASP):round IV. Proteins 2001; 45: 27. J. Moult, K . Fidelis, A. Zemla, and T. Hubbard. Critical assessment of methods of protein structure prediction (CASP):round V. Proteins 2003; 53: 334339. J. Moult, K. Fidelis, B. Rost, T. Hubbard, and A. Tramontano. Critical assessment of methods of protein structure prediction (CASP):round 6. Proteins 2005; 61: 3-7. 0. Grana, D. Baker, R.M. MacCallum, J. Meiler, M. Punta, B. Rost, M.L. Tress, and A. Valencia. CASP6 assessment of contact prediction. Proteins 2005; 61: 214-224. N. Clarke, A. Valencia, J.M.G. Izarzugaza, M.L. Tress, and 0. Grana. CASP7 assessment of contact Prediction. C ASP7 presentation, November 2006. D. Chivian, D. E. Kim, L. Malmstrom, P. Bradley, T. Robertson, P. Murphy, C. E. Strauss, R. Bonneau, C. A. Rohl, and D. Baker. Automated prediction of CASP-5 structures using the Robetta server. Proteins 2003; 53(S6): 524-533. D.E. Kim, D. Chivian, and D. Baker. Protein structure prediction and analysis using the Robetta server. Nucleic Acids Research 2004; 32: 526-531. D. Chivian, D.E. Kim, L. Malmstrom, J. Schonbrun, C. Rohl, and D. Baker. Prediction of CASP6 structures using automated Robetta protocols. Proteins 2005; 61(S7:1): 57-66.
333 12. K.M.S. Misura, D. Chivian, C.A. Rohl, D.E. Kim, and D. Baker. Physically realistic homology models built with Rosetta can be more accurate than their templates. P N A S 2006; 103: 5361-5366. 13. Y. Zhang, A. Arakaki, and J. Skolnick. TASSER: An automated method for the prediction of protein tertiary structures in CASP6. Proteins 2005; 61(S7): 91-98. 14. Tress M. Valencia, A., I. Ezkurdia, G. Lbpez, and 0. Graiia. CASP6 assessment of contact prediction. GASP6 presentation, December 2004. 15. S. Miyazawa and R.L. Jernigan. Estimation of effective interresidue contact energies from protein crystal-structures quasi-chemical approximation. Macromolecules 1985; 18: 534-552. 16. M.J. Sippl. Calculation of conformational ensembles from potentials of mean force. J. Mol. Biol. 1990; 213: 859-883. 17. T. Grossman, R.M. Farber, and A.S. Lapedes. Neural Net Representations of Empirical Protein Potentials. In Intelligent Systems in Molecular Biology 1995; 154-161. 18. E.S. Huang, S.Subbiah, and M. Levitt. Recognizing native folds by the arrangement of hydrophobic and polar residues. J. Mol. Biol 1995; 249: 493-507. 19. U. Gobel, C. Sander, R. Schneider, and A. Valencia. Correlated mutations and residue contacts in proteins. Proteins: Structure, Function and Genetics 1994; 18: 309-317. 20. I.N. Shindyalov, N.A. Kolchanov, and C. Sander. Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Prot. Engng. 1994; 7:349-358. 21. W.R. Taylor and K. Hatrick. Compensating changes in protein multiple sequence alignments. Prot. Eng n g . 1994; 7:341-348. 22. D.J. Thomas, G. Casari, and C. Sander. The prediction of protein contacts from multiple sequence alignments. Prot. Engng. 1996; 9: 941-948. 23. 0. Olmea and A. Valencia. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold Des. 1997; 2: S25-32. 24. P. Fariselli, 0. Olmea, A. Valencia, and R. Casadio. Progress in predicting inter-residue contacts of proteins with neural networks and correlated mutations. Proteins 2001; 5: 157-162. 25. P. Fariselli, 0. Olmea, A. Valencia, and R. Casadio. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering 2001; 14(11): 835-843. 26. M.S. Singer, G. Vriend, and R.P. Bywater. Prediction of protein residue contacts with a PDB-derived likelihood matrix. Prot. Engng. 2002; 15: 721-725. 27. G. Pollastri and P. Baldi. Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics 2002; 18: S62-70.
28. Y. Zhao and G. Karypis. Prediction of contact maps using Support Vector Machines. In 3rd IEEE International Conference on Bioinformatics and Bioengineering (BIBE) 2003; 26-33. 29. N. Hamilton, K. Burrage, M.A. Ragan, and T. Huber. Protein contact prediction using patterns of correlation. Proteins: Structure, Function, and Bioinformatics 2004; 56: 679-684. 30. M. Punta and B. Rost. PROFcon: novel prediction of long-range contacts. Bioinformatics 2005; 21: 2960-2968. 31. M.J. Zaki, S. Jin, and C. Bystroff. Mining residue contacts in proteins using local structure predictions. I E E E Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 2003; 33: 789-801. 32. Y. Shao and C. Bystroff. Predicting interresidue contacts using templates and pathways. Proteins 2003; 53: 497-502. 33. C. Bystroff, V. Thorsson, and D. Baker. HMMSTR: A hidden Markov model for local sequence-structure correlations in proteins. J. Mol. Biol. 2000; 301: 173-190. 34. R.M. MacCallum. Striped sheets and protein contact prediction. Bioinformatics 2004; 20: 224-231. 35. S.F.Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 1997; 25: 3389-3402. 36. DT. Jones. Protein secondary structure prediction based on position-specific scoring matrices. J.Mol.Bio1. 1999; 292: 195-202. 37. J. Cheng and P. Baldi. A machine learning information retrieval approach to protein fold recognition. Bioinformatics 2006; 22: 1456-1463. 38. D.T. Jones. GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 1999; 287: 797-815. 39. L.J. McGuffin and D.T. Jones. Improvement of the GenTHREADER method for genomic fold recognition. Bioinformatics 2003; 19: 874-881. 40. J. Xu, M. Li, G. Lin, D. Kim, and Y . Xu. Protein threading by linear programming. In Proceedings of the Pacific Symposium(PSB) 2003; 264-275. 41. J. Xu. Protein fold recognition by predicted alignment accuracy. ACM/IEEE Transactions on Computational Biology and Bioinformatics 2005; 2(2): 157-165. 42. 3. Shi, T.L. Blundell, and K. Mizuguchi. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 2001; 310(1): 243-257. 43. Barrett C. Karplus, K. and R. Hughey. Hidden Markov Models for detecting remote protein homologies. Bioinformatics 1998; 14(10): 846-856. 44. H. Zhou and Y. Zhou. Folg recognition by combining sequence profiles derived from evolution and from
334 depth dependent structural alignment of fragments . Proteins 2005; 5 8 : 321-328. 45. W. Li and A. Godzik. Cd-hit: A fast program for clustering and comparing large sets of protein or nu-
cleotide sequences. Bioinformatics 2006; 22: 1658~ 1659. 46. http://zhang.bioinformatics.ku.edu/casp7/.
335
IMPROVEMENT IN PROTEIN SEQUENCE-STRUCTURE ALIGNMENT USING INSERTION/DELETION FREQUENCY ARRAYS
Kyle Ellrott, Jun-tao Guo, Victor Olman, Ying Xu.
Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, T h e University of Georgia, Athens, Georgia 30602 As a protein evolves, not every part of the amino acid sequence has an equal probability of being deleted or for allowing insertions, because not every amino acid plays an equally important role in maintaining the protein structure. However the most prevalent models in fold recognition methods treat every amino acid deletion and insertion as equally probable events. We have analyzed the alignment patterns for homologous and analogous sequences to determine patterns of insertion and deletions, and used that information t o determine the statistics of insertions and deletions for different amino acids of a target sequence. We define these patterns as Insertion/Deletion (Indel) Frequency Arrays (IFA). By applying IFA to the protein threading problem, we have been able to improve the alignment accuracy, especially for proteins with low sequence identity. Contact: [email protected]
1. INTRODUCTION Protein threading, a technique for sequencestructure alignment, has played a key role in predicting protein structures in the past decade. Most of the details in a threading model deal with how well an amino acid from a target sequence is aligned to a particular residue position on a known protein structure. For example, the energy functions used in our threading program PROSPECT include mutation, singleton, secondary structure match, and two-body interaction energiesg. These sets of energy function primarily concentrate on the positive space of the alignment, i.e. rewarding amino acid alignment. Deletion penalties, on the other hand, are a set of terms that describe how to penalize an alignment when gaps are introduced. There are two primary changes during protein evolution: mutation and insertion/deletion. A mutation event is the result of changing one amino acid to another and is evaluated by mutation energy matrices, such as PAM3 and Blosum '. Another event in protein evolution is the insertion and deletion of amino acids. These evolutionary changes are evaluated with gap penalty models in protein threading algorithms. While several gap penalty models have been proposed, the most widely used model for gap penalty is the simple affine model 1 7 . In this model there is a large penalty for opening a gap, or starting a deletion, and a smaller constant penalty for continuing that insertion/deletion. This can be viewed as a simple linear function, G = Wopen Wconst* len, where len is the length of the gap, Wopenis the penalty for opening
+
a gap, and every residue deleted is penalized by a constant penalty Wconst. This simple linear function can be easily implemented in a dynamic programming based alignment program such as the Smith-Waterman method, with a running time of O ( N M ) where N is the length of the target sequence, and M is the length of the structural template. In addition to this linear penalty model, there are more sophisticated methods that have been used. These typically attempt to formulate the penalty as non-linear functions 16, 6 , or use monotonic functions to avoid over-penalizing large gaps 13. However, these non-linear gap functions cannot be optimized using traditional SmithWaterman, and require more advanced algorithms for sequence-structure alignment optimization 4, ll. Nonetheless, within the framework of the SmithWaterman algorithm, it is possible to use a nonlinear gap function, if the function is only dependent on local sequence/structure alignments. The penalty of a gap can be dependent on the probability of an amino acid being deleted or being inserted. Given these conditions a set of local optimal decisions can still be aggregated to achieve the global optimality. Therefore, dynamic programming can still be used. To find useful statistical information about deletion and insertion probabilities, we use a technique similar to the one for generating Position Specific Score Matrices (PSSM) '. PSSMs have had a significant impact on secondary structure prediction and protein fold recognition 2 1 . A PSSM is generated by finding homologous sequences in a non-
336 redundant (NR) sequence database and aligning those sequences. The amino acid mutation patterns are used to create residue specific replacement scores. The patterns of insertions and deletions can be studied in a similar way. Using statistical analysis of alignments from a ‘PsiBlast’ search against the NR database, we can construct penalty functions that are based on insertion/deletion patterns specific to a protein family and the different portions of the sequence. These scores are not simply based on a global constant. For every residue, the percentage of times that it is deleted, or has an insertion before or after it can be measured against a known sequence database. We call this information the Indel Frequency Arrays (IFA). It should be pointed out that this type of energy, unlike some of other previously mentioned gap models, is only dependent on local sequence alignment and thus can be run in the same computational time as Smith Waterman algorithm. There has been similar position specific gap penalties suggested previously, such as the work by Lesk at al. l o . However, their work was based on different scoring values for differently assigned secondary structure values, and was not specific to protein families.
the length of the target (1) and template ( m ) respectively. Si,j is sub sequence alignment of the target residues from 0 - i and the template residues from 0 - j . Thus the total alignment is expressed as Etot = Sl,,. The value of S L ,can ~ be iteratively calculated with the formula: Si,j = min
{
Si-i,j-1
+Eij
Si-l,j
+ I N S ( %j ,)
Si,j-l
+ DEL(i,j)
Match Insertion, (i, j ) + I Deletion, ( i , j ) + D
(2)
The energy Ei,j for aligning a target position i to a template position j is defined as: (3)
Ei,j = ~ r n u t E m u t ( ~ , . ? )
+ WsingletonEsingleton
(21
j)
f WsecstTUctEsecStrUct(i, j )
In the original model, the I N S and D E L values where calculated as such:
I N S ( i , j )=
If ( i - 1 , j ) E I
WconstEconst
If ( i - 1,j)$ 1
WopenEopen
+WconstEconst (4)
2. METHODS Alignment Strategy
We use our threading program, PROSPECT, as an example, in which the optimal alignment is calculated by finding an alignment A with the total alignment score E t o t defined as: Etot = mint(W,utE,,t(A)
(1)
+ WszngletonEszngleton ( A ) + WsecstructEsecstruct ( A ) + WopenEopen(A) + WconstEconst(A)) where Emut is the mutation energy, Eszngleton is the Singleton energy, EsecstTuct is secondary structure match energy; Eopenand Econst represent the two aspects of the affine gap function, the gap opening penalty, and the constant deletion for each residue removed; the set of W s represent the weight parameters. Optimal alignments can be found with dynamic programming by finding iterative solution of the values for Sz,3, with i and j both going from zero to
where I is the set of insertion operations, and D is the set of deletion operations. These equations refer to operations on the template, i.e. an insertion operation is an insertion on the template. New Gap Energy Model
Deletion is the inverse operation of insertion. A deletion in the target is equivalent to an insertion to the template, and visa-versa. However, how these operations are handled is different. The deletion occurs a t a specific point, while an insertion occurs in between two residues. Our study has shown that deletion and insertion probabilities are not equally distributed across the entire sequence, and are unlikely to be similar for different protein sequences. A sample distribution
337 -
I
1 -I0.8
I
I
I
I
I
I
I
I
I
I _
-
-
-
-
-
0.6-0.4-
-
0.2 0
1 0
1 --I 0.8 0.6-0.4-
I
I
I
20 I
I
I
I
I
I
40 I
-
I
I
I
I
I
I
I
I
I
I
I
I
80
60
I
I
100
I
I -
-
v
0.2-
0 1 01
I
I
20
I
I
I
40
60
80
I
I
100
Fig. 1. IFA information associated with the SCOP identifier ‘dlqhoa2’. The top graph is the IFA information for deleting any of the residues. The bottom graph represents the IFA information for having an insertion before a given residue. A value of zero means that the indel can occur without penalty.
can be seen in Figure 1. The probability of deleting a residue is not directly related to the probability of inserting a residue immediately before or after it, suggesting a simple deletion operation could actually encompass four different energies, insertion and deletion for both sequences being aligned. As a result, the penalty feature previously described as ECOnSt which was a constant penalty for every insertion/deletion can be replaced with a set of four features: E:n,, E j e l ,E&,. and Eiel,where t stands for template, and g stands for query or target. In order to maintain the W coefficients each of the new penalty values needs to be scaled between 0 and 1, and multiplied by WCOnSt. Under this model we redefine the I N S ( i ,j ) and D E L ( i , j )functions as 7 and 8. Calculating lndel Profiles
The insertion/deletion profiles used to create the IFA are derived from alignments using ‘PsiBlast’ against the N R database. Deletion energies are determined by observing the number of times a residue is deleted, or a residue being inserted before it. In order to easily access indel information, we translate the standard two line text alignment returned back by ‘PsiBlast’ into a number array as shown in Figure 2. In this format array a represents the indices of the aligned subject sequence for each amino acid in the target sequence. If u[i]= j then the i t h amino acid in the target sequence is aligned to the j t h amino acid in the template. Positions that are not aligned to any residue are represented by
‘-1’. After the above conversion, finding the insertions/deletions becomes a matter of referring to one array, rather than parsing two text arrays. To find the deletions, one simply scans the a array looking for the ‘-1’ entries, which represent the non-aligned residues. To find the insertions, one looks at all the non ‘-1’ entries in the array. For the i t h residue that is followed by the next non ‘-1’ residue j , if a [ i ] + 1 # u [ j ]then there is a gap. If the first non ‘-1’ entry is not ‘l’,there is a pre-sequence insertlion. Similarly, if the last non ‘-1’ entry is not aligned to the last residue of the subject, there is a postsequence insertion. This information is then summed for each individual residue position, and devided by the number of aligned sequences. This provides the percentage of times a residue is deleteted or allows an insertion. These arrays of percentages are referred to as B,,, and B d e l . They are then formulated as energy functions as such: Ezns(i)= 1 - BtnS(i)and E d e l ( i )= 1 - &el (i).
Target:-ABC--DEFGI---Temp1ate:AABCAA--FGIZZZZ TargetAlign:2,3,4,-1,-1,7,8,9 TemplateAlign:-1,1,2,3,-1,-1,6,7,8,-1,-1,-1,-1 Fig. 2. This example shows the conversion from the sequence based model used to represent alignments, typically as outputted by Blast. The bottom shows the same alignment in an easier to use hash table format. Each position represents the number of the position of the aligned residue in the opposite sequence. Deletions are represented as a -1.
338
Training and Testing
We use two methods in order to evaluate the alignment performance inprovements that the IFA method provies. First, we compared the alignment results with the output from FAST 2 2 , a structural comparison program. We chose FAST because of its efficient and accurate performance. FAST can correctly align 96% of the residue pairs in aligned regions of the 1033 protein alignments in the HOMESTRAD database12>2 2 . As a common practice alignment is considered to be correct if the residue was aligned within 4 residues of the FASTbased structure-structure alignment positionl8. The reported percentage accuracy is the percentage of residues placed within 4 residues of the correct postion out of the total possible residue placements. The next method of evaluation is the MAMMOTH program15. MAMMOTH determines the statistical significance of the backbone structure created by predictive tools against the actual backbone structure of the target. We report the -Zn(E) score, for which a value greater then 4 is statistically significant. Our training set is comprised of 300 SCOP l4 domain entries from the ASTRAL 25 list 2 , in which no entries would have higher than 25% sequence identity. Based on the results from FAST, the average sequence identity for the aligned pairs of this dataset is 9.5%. Using the SCOP identifiers, we then compiled a list of all proteins that occurred in the same fold, super families and families. To find the optimal set of weights for each of the gap penalty permutation, we have applied 10 cycles of the Violated Inequality Minimization (VIM) 2 3 method of optimization. The training set is used to find a set of optimal weights for our original threading approach which uses traditional affine gap penalty. The same set of weights was used in the variable deletion model. For the testing we used a set comprised of 724 proteins also derived from ASTRAL 25 that did not overlap with the original training set. We used the
same SCOP table to determine relationship, how ever this time, relationship were filterd by the FAST S N score. The S N score determines significance of the structural alignment created by FAST. Pairs with scores lower then 2 were removed so that bad alignments would not create noise when analyzing the performance of threading results against the structural alignments. This left a testing set comprised of 3058 pair relationships. This set was made of 344 family, 1265 super family and 1449 fold level relationships. Statistical Analysis
We describe our statistical model related to comparison of two methods as an experiment with multinomial distribution having 3 possible outcomes: Method 1 (IFA method) is better than Method 2 (original method) (probability P+l), the two Methods are equal in their power (probability Po), and Method 2 is better than Method 1 (probability E l ) , Po P-1+ P+1 = 1. Our goal is to check hypothesis Ho : P+1 = P-1 against the alternative H1 : P+1 > P-1. For hypothesis checking we use Pearson’s x 2 test, i.e, x 2 = ( K + 1 / N - p + 1 ) 2 + ( K - 1 / N - p - 1 ) 2
+
p :
1
P!
1
+
where N is the number of comparisons, the number of times Method 1 worked better, and K-1 the number of times Method 2 worked better. If Ho is true then P+I = P-1 = 0.5*(1-P0), and replacing Po with its maximum likelihood estimator K ~ / Nwe , get x 2 = ( K + l / N - p ) ’ + ( K - l $ - P ) z , p = (Ko”-Po)z, p,”
K+1
P2
0.5 * (1 - Po). The asymptotic distribution of the x2 in our case is a chi-square distribution with one degree of freedom.
3. RESULTS Comparison with Constant Gap Penalty
The overall average improvement for alignment accuracy is shown a t different alignment levels in Table 1. The more distant two proteins are evolutionarily, the
339 Table 1. A comparison using different gap function.
Fast
MAMMOTH
SCOP A l i g n m e n t
Original
IFA
Increase
Original
IFA
Increase
S e t SIZP
Fold SuperFamily Family
42.6 55.3 70.6
46.2 57.1 71.4
8.5% 3.3% 1.1%
11.5 13.9 14.5
13.2 15.2 15.5
14.8% 9.4% 6.9%
1449 1265 344
Sequenre Identity
Original
Increase
Original
Fast
MAMMOTH ~~~
IFA
bigger improvement of the IFA method. At the fold level, alignment accuracy increase from 50% to 55%. Once two proteins are in the same family, the amount of improvement decreases to 2.5%. A similar trend is observed if we seperate protein pairs by their percentage of aligned amino acids. The lower the identity, the larger improvement that the IFA model provides. This trend can also be seen in Figure 3. The top section of the figure represents binned averages across different levels of sequence identity. The bottom section represents a segmented linear regression. Alignment Accuracy/Sequence Identity AKraEc
...... 07
-.----...,
2 0.6
,
05
.B 0 4 ?
,.__ ....----.......---
-.__/
03
n
0.3
0
0.05
0. i Linear Regrewion
I 0 05
I 0.1
0 15
I 0
IS
0.2
I
IFA
~~~
Increase
Set S i z p
mains. The original model over compensated for C-terminal deletions. This caused the second helix to be aligned to the location where the first helix should be aligned. This cascaded into a series of mid sequence deletions that smeared two helices together. Our model increases the alignment accuracy from 26.5% to 73%. In order to show that these improvements in alignment accuracy were not the product of random statistical fluctuations we have also analyzed the comparative performance of the two methods on the same alignment pairs. We counted the number of times the new model has led to an improvement in alignment accuracy, shown in Table 2 . We have also calculated the statistical significance of these numbers. Using the statistical analysis described in the Methods section, we can calculate the p-value of the hypothesis that improvements are random. For the testing set with FAST based alignemnts, we get the values K+1 = 1464, KO = 882, and K-1 = 712, which for the first statistical testing method leads to the p-value=3.55 x This shows that the difference in performance of two methods can not be explained by pure chance, indicating the superiority of Method 1, our new IFA method.
02
Fig. 3. The binned averages and multi-segment linear regressions of alignment accuracy for the two methods. The dotted line represent the IFA model. As the sequence identity of the aligned structures decreases, the contribution of the IFA model become more valueable.
A specific example for alignment improvement shown in Figure 4 demonstrates the potential benefits of this information source. Both proteins were classified as 'winged helix' DNA-binding do-
Improving Fold Recognition
In many studies fold recognition techniques are trained to achieve the optimal alignment accuracy. This is based on the assumption that the closer two proteins are in terms of their fold families, the more amino acids that are likely to be correctly aligned. Therefore, the better the predicted alignment accuracy, the better the fold recognition method is likely to be. When training machine learning techniques
340
Fig. 4. This example is for the alignment between SCOP domains ‘dlflzal’ and ‘dlucra-’. Each block represents an assigned secondary structure element.
Table 2. The side by side comparison. The number of times each method had a better score, and the statistical significance of that ratio. The p-value is the probability that the improvement is caused by chance, the lower the better.
Level
I Family
I
IFA
I
Original
I
Tie
I
26
I
78
I
Test 1
2.94 x
Table 3 . The Correlation coefficients of the gap energies, as applied to the two threading method results.
Test 2
lop7’
1.29 x
lorx4
I
1.29 x
lop1‘
E:,,, E&, with the alignment accuracy. We compared the correlation coefficients for each of the energies in two different threading sets in Table 3 . The combined energy is Eva, = Efiel + EtnS E& E:,,. The variable deletion energies, once combined, are very well correlated to alignment accuracy, indicating a good ability to distinguish correct alignments. The ability increases even more once they are used to optimize the alignment, as in the variable deletion threading set. First, the threading set comprised of alignments created with Econstand next, the set of alignments optimized using the set of variable gap penalties. As we can see in Table 3 , individually each of the separate variable deletion penalties is not as correlated as Econst. However, once they are summed together, and used to optimize the alignment, their correlation increases greatly. This makes sense, because each of the variable deletion penalties is only 1/4th of the total deletion energy. The increase in correlation from -0.18, using the original model for alignment and fold recognition, to -0.27, for using the new model, should correspond to a greater ability to correctly differentiate correct fold from incorrect fold.
+
previous research has used a vector to represent a set of features from an alignment, such as the values from the different energy terms, and trained them for fold recognition using the alignment accuracy as a measure. Previous research have used machine learning techniques such as neural networks 20, SVMs 18, and gradient boosting based functions. For these techniques, the more correlated a given feature is to the alignment accuracy, the easier it will be to train the regression function. Therefore, we can predict the benefit to fold recognition a feature will have by measuring its correlation with the alignment accuracy. We test our new energy functions by comparing the correlation coefficients of: Econst,Eie2 EZel,
I
Test 3
+
34 1 4. DISCUSSION We have demonst,ra.ted our new deletion model in the context of protein threading. This new energy seems t o work best in the context of distantlyrelated threading models. The variable gap penalty by Madhusudhan et a1 l1 concentrated on the performance improvements of variable gap penalties for protein alignments with sequence identity spanning the range of 20-40%. Our profile-based variable deletion energy has its best improvements in the low homology range, from 2-15% sequence identity, the so called ‘twilight zone’ where both fold recognition and threading alignment accuracies are in desperate needs for improvements. As seen in Figure 3, the lower the sequence idenity, the more the IFA can improve the accuracy of sequence alignment. But at higher sequence identity levels, where the variable deletion penalty starts to lose some of its advantage, it does not cause an increase in false positive. So it can be used safely, regardless of the level of homology. We have shown that our energy fits within the Smith-Waterman alignment framework, but it is also theortically possible t o incorporate it into the algorithmic methods suggested by Madhusudhan et all’. Not only did Indel Frequency Arrays improve the alignment accuracy, we also have shown that it is a statistically significant improvement. Further study, with the application of SVMs, neural networks, or gradient boosting methods will be needed t o see if the increased alignment accuracy correlation coefficient we detected translates t o better fold recognition..
5. CONCLUSION We have shown there is large amount of information inherent in the insertions and deletions during protein evolution. This information can be determined by analyzing sequence alignments with homologous sequences. Once applied, this technique can improve prot,ein threading alignment accuracy. We have shown that this information can be applied t o the Smith-Waterman sequence alignment algorithm without added complexity. These energies can also be added t o more complex methods, such as integer programming 19)
’.
ACKNOWLEDGMENTS The work is, in part, supported by National Science Foundation (DBI-0354771/ITR-IIS-O407204/CCF-
0621700), and a Distinguished Cancer Scholar grant from the Georgia Cancer Coalition.
References 1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res, 25( 17):3389-3402, 1997. 2. J . Chandonia, G. Hon, N. Walker, L. L. Conte,
P. Koehl, M. Levitt, and B. SE. The astral compendium in 2004. Nucleic Acids Research, 32:D189D192, 2004. 3. M. 0.Dayhoff, R. M. Schwartz, and B. C. Orcutt. A model for evolutionary change in proteins. Atlas of Protein Sequence and Structure, 5:345-352, 1978. 4. T. G. Dewey. A sequence alignment algorithm with an arbitrary gap penalty function. Journal of Computational Biology., 8(2):177-190, 2001. 5. K. Ellrott, J. tao Guo, V. Olman, and Y. Xu. A gen-
6.
7.
8.
9.
eralized threading model using integer programming with secondary structure element deletion. Genome Informatics, 17(2), 2006. N . C. Goonesekere and B. Lee. Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Research, 32(9):2838-2843, 2004. S . Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc NatZ Acad Sci U S A , 89(22):10915-9, 1992. F. Jiao, J. Xu, L. Yu, and D. Schuurmans. Protein fold recognition using the gradient boost algorithm. In Computational Systems Bioinformatics Conference, 2006. D. Kim, D. Xu, J. T. Guo, K. Ellrott, and Y. Xu. PROSPECT 11: protein structure prediction program for genome-scale applications. Protein Eng,
16(9):641-50, 2003. 10. A. M. Lesk, M. Levitt, and C. Chothia. Alignment
of the amino acid sequences of distantly related proteins using variable gap penalties. Protein Engineering, 1(1):77-78, 1986. 11. M. Madhusudhan, M. A. Marti-Renom, R. Sanchez, and A. Sali. Variable gap penalty for protein sequence-structure alignment. Protein Engineering, Design, and Selection, 19(3):129-133, 2006. 12. K. Mizuguchi, C. Deane, T. Blundell, and J. Overington. Homstrad: a database of protein structure alignments for homologous families. Protein Science, 7:2469-2471, 1998. 13. R. Mott. Local sequence alignments with monotonic gap penalties. Bioinformatics, 15(6):455-462, 1999. 14. A. G. Murzin, S. E. Brenner, T. Hubbard, and
C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J MoZ Biol, 247(4):536-40, 1995.
342 15. A. Ortiz, C. Strauss, and 0. Olmea. Mammoth (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci., 11(11):2606-21, 2002. 16. B. Qian and R. A. Goldstein. Distribution of indel lengths. Proteins: Structure, Function, and Geneti c ~45:102-104, , 2001. 17. J. G. Reich, H. Drabsch, and A. Daumler. On the statistical assessment of similarities in dna sequences. Nucleic Acids Res, 12(13):5529-5543, July 1984. 18. J. Xu. Fold recognition by predicted alignment accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(2):157-165, 2005.
19. J. Xu, M. Li, D. Kim, and Y. Xu. RAPTOR: op-
20.
21.
22. 23.
timal protein threading by linear programming. J. Bioinform. Comput. Bid., 1(1):95-117, 2003. Y. Xu, D. Xu, and V. Olman. A practical method for interpretation of threading scores: An application of neural network. Statistica Sinica, 12:159-177, 2002. G. Yona and M. Levitt. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol, 315(5):1257-75, 2002. J. Zhu and Z. Weng. FAST: A novel protein structure alignment algorithm. Proteins, 58:618-627, 2005. A. Zien, R. Zimmer, and T. Lengauer. A simple iterative approach to parameter optimization. In Recomb Proceedings 2000, pages 318-327, 2000.
343 COMPOSITE MOTIFS INTEGRATING MULTIPLE PROTEIN STRUCTURES INCREASE SENSITIVITY FOR FUNCTION PREDICTION
Brian Y. Chena, Drew H. Bryantb,Amanda E. Cruessa, Joseph H. Bylund', Viacheslav Y. Fofanovd, David M. Kristensene, Marek Kimmeld, Olivier Lichtargee, Lydia E. Kavrakia'b* a Department of Computer Science, Department of Bioengineering, Department Ecology and Evolutionary Biology, Department of Statistics, Rice University Houston, T X 77005, U S A
Department of Molecular and H u m a n Genetzcs, Baylor College of Medicine Houston, T X 77030, U S A The study of disease often hinges on the biological function of proteins, but determining protein function is a difficult experimental process. To minimize duplicated effort, algorithms for function prediction seek characteristics indicative of possible protein function. One approach is to identify substructural matches of geometric and chemical similarity between motzfs representing known active sites and target protein structures with unknown function. In earlier work, statistically significant matches of certain effective motifs have identified functionally related active sites. Effective motifs must be carefully designed to maintain similarity to functionally related sites (sensitivity) and avoid incidental similarities to functionally unrelated protein geometry (speczficity). Existing motif design techniques use the geometry of a single protein structure. Poor selection of this structure can limit motif effectiveness if the selected functional site lacks similarity to functionally related sites. To address this problem, this paper presents composzte motifs, which combine structures of functionally related active sites to potentially increase sensitivity. Our experimentation compares the effectiveness of composite motifs with simple motzfs designed from single protein structures. On six distinct families of functionally related proteins, leave-one-out testing showed that composite motifs had sensitivity comparable to the most sensitive of all simple motifs and specificity comparable to the average simple motif. On our data set, we observed that composite motifs simultaneously capture variations in active site conformation, diminish the problem of selecting motif structures, and enable the fusion of protein structures from diverse data sources.
1. INTRODUCTION Developing an improved understanding of biological systems. the molecular basis of disease, and the design of novel and effective drugs are important efforts which could be enhanced with a broader understanding of the biological function of proteins. However, elucidating protein function is an expensive and time consuming experimental process, depending on the insight of experienced investigators and expensive laboratory equipment. To support and accelerate this cause, computational techniques for protein function prediction have been developed to gather evidence suggesting hypothetical functions of target proteins. This paper focusses on one family of function prediction techniques that we call motif matching algorithms, such as Match Augmentation (MA) 8 , Jess ', PINTS 3 2 , and pvSOAR 3 , among many others. The evidence gathered by motif matching algo"Corresponding Author: [email protected]
rithms are instances of geometric and chemical similarity, matches, between motif structures, representing sites of known biological function, and substructures of target proteins, for which functional information is unavailable. In the past, matches with statistically significant geometric and chemical similarity have identified targets with sites functionally similar to the motif ', 3 , 8 , 3 2 , suggesting that matches may provide meaningful evidence of similar function. One major challenge confronting the motif matching strategy is the fact that motifs are imperfect markers of function. While generally motifs are designed to represent a known active site, the geometric form and chemical composition of active site characteristics can drastically affect the number of matching functionally related targets (motif sensitivity), as well as the number of unintended matches to unrelated sites (motif specificity). Effective motifs, which are both sensitive and specific, are crit-
344 ical for a successful application of motif matching, but difficult to design. For this reason, motif refinement towards heightened sensitivity and specificity is a critical open problem. This paper contributes one practical method for motif refinement. Motif refinement strategies in earlier work 9 , lo*31 implement analyses which ultimately select geometric components for motifs from only one protein structure. We refer to these motifs as Simple Motifs. In response, this paper asks if Composite Motifs, which combine the geometry of several active site structures, could better capture the natural variability inherent in functionally related active sites. We also asked if the design of motifs based on multiple protein structures could escape the potfentially negative effects of using simple motifs. This paper proposes two specific types of composite motifs, averaged motifs and centered motifs, which are constructed from a multiple structural alignment of' related active sites. Beginning with a data set of 6 distinct families of functionally related proteins, we conducted a series of leave-one-out experiments to test the sensitivity and specificity of averaged and centered motifs. In comparison to all possible simple motifs from the same family, averaged and centered motifs performed with high sensitivity and average specificity, while simple motifs exhibited wildly varying sensitivity and specificity, demonstrating that composite motifs diminish the need to select individual motifs. Furthermore, the high sensitivity of averaged motifs also demonstrates that composite motifs can better capture geometric variations within a family of related sites. This paper does not argue that composite motifs are a solution to the difficult problem of motif design. Rather, we propose that composite motifs are one method for achieving effective motifs which could compliment existing strategies for motif refinement, such as MULTIBIND 3 1 , Geometric Sieving ', Cavity Scaling l o , and Surfnet-Consurf 17. Composite motifs contribute to the study of motif refinement with three unique strengths: First, composite motifs capture variations in active site conformations, which are not apparent in any individual protein structure. Improved representation of active site conformations can enhance motif effectiveness. Second, composite motifs eliminate the problem of selecting an individual protein structure, sidestepping the risk of selecting ineffective simple
motifs. Finally, composite motifs provide a novel opportunity for the integration of protein structures from novel sources. Since the effectiveness of the motif is based on the geometry of a potentially large set of protein structures, alternative sources of protein structure data, such as snapshots from molecular dynamics simulations and NMR data, could be incorporated into the design of composite motifs. Composite motifs are a first step towards the synthesis of multiple protein structures for improved function prediction.
2. RELATED WORK The application of motif matching to protein function prediction is affected by at least three distinct subproblems: (1) selecting a functional site representation (2) designing a matching algorithm ( 3 ) filtering biologically irrelevant matches
This paper describes composite motifs, which contribute to the first subproblem. However, a complete demonstration of the effectiveness of composite motifs, in the context of function prediction, also requires solutions to the other two subproblems. This section explains existing approaches to all three subproblems in relation to our contributions. 2.1. Related Work in Motif Design
The design of effective motifs is a two stage problem requiring a computational representation of protein structure, or motif type, and the choice of specific active site elements to include, the motif design. Motif types in earlier work can be loosely classified into two classes: point-based motifs, and volumebased motifs. Point-based motifs have used points in space to represent atom coordinates, ', 8 , 30, 3 2 , points 20 on the solvent accessible surface, and chemical binding patterns 3 1 . These mot$ points can be 32, labeled with atomic and residue identity electrostatic potential 2 0 , and evolutionary significance and variat,ion ', among many other chemical and biological properties. Labeling motif points allows additional chemical and biological knowledge to be mapped to an otherwise purely geometric comparison process, increasing the relevance of the motif type. 8i 301
Volume-based motifs use spheres lo, 3 5 , grids and other geometric representations, such as alpha shapes ’, to represent active clefts and cavities in protein structures. Rather than directly representing atomic structure, volumetric motifs represent volumes that can be functionally significant, such as ligand or cofactor binding sites. While volume-based motifs are not always labeled, some techniques which apply volume-based motifs also integrate sequence analysis and point-based comparison with volumetric comparisons. Once the motif type is chosen, given a specific active site to represent, a specific motif design must be established for the active site. For point-based motifs, this can involve the selection of the atoms thought to be most closely involved with the function of the protein. In the past, functionally documented amino acids from the literature ’, databases of catalytic sites ’, and evolutionarily significant amino acids have been used to design point-based motifs. Volumetric motifs have been designed by identifying statistically significant cavities and indentations on protein surfaces ’. Given the active site to be represented, recent results suggest that a selection of amino acids can then be refined for geometric and chemical comparison. For example, identifying geometrically conserved binding patterns common among several functionally related active sites 31 could yield additional matches to functionally related proteins. Motifs can be refined to be geometrically unique, recurring rarely among functionally unrelated proteins ’. Finally, point-based motifs can be augmented with volumetric data and eliminate matches lacking functionally significant cavities Volumetric motifs have been refined by identifying indentations on the protein surface that are distant from evolutionarily significant amino acids 1 7 . In addition, high-impact volumes within a surface clefts; which seem to be essential for functionally related matches, can be automatically identified to refine cavity-aware motifs lo. This paper provides a unique approach to the refinement of point-based motifs. While other motif refinement techniques focus on the selection of amino acids ’, 3’ or integrate additional data this paper improves on existing motif designs by incorporating the geometry of other protein structures containing similar active sites. In our experimentation, 23, 24
’.
’)
’’,
.
we asked if this approach would yield motifs that more closely resemble the population of structures with functionally related active sites. The possibility of integrating multiple protein structures yields the first technique, to our knowledge, where motifs can contain geometric information not taken directly from a single protein structure. Our approach is most related to techniques designed to represent a range of protein structures, such as hinge-bending point-based motifs 3 0 , and motifs representing conserved binding patterns 31. Hingebending motifs can represent multiple protein structures, but only capture structures implied by the range of hinge motions, which can differ from the population of proteins containing similar functional sites. In comparison, the composite motifs studied in this work are built explicitly from populations of protein structures with similar functional sites. Motifs representing conserved binding patterns represent the largest common set of motif points between a set of functionally similar active sites, but the largest common set of motif points may not include functionally significant motif points with geometric variations in active site conformations. In contrast, our techniques for generating composite motifs, described in Section 3, can represent a geometric consensus among these variations.
2.2. Earlier Motif Matching Algorithms Motif matching algorithms are designed for compatibility and efficiency with a specific motif type. In addition to full structure alignment methods such as DALI ’’, which could be applied to the motif matching problem, motif matching algorithms for pointbased motifs include Geometric Hashing 3 6 , JESS ’, PINTS ”, and Match Augmentation 8 , l o (MA). One unique advantage of composite motifs is that composite motifs are point-based motifs that are assembled in a novel manner but remain compatible with existing point-based motif matching algorithms. Motif matching algorithms are also designed for compatibility with volume-based motifs, such as pvSOAR A wide range of function prediction and analysis techniques using volume-based analysis examine a single protein structure to identify characteristics consistent with an active site: Among many, SCREEN 26 identifies cavities which are likely to be drug binding sites, SURFNET-Consurf l7 seeks evolutionarily significant catalytic sites, and CASTp
’.
’
346
analyzes cavities on the protein surface and identifies those probable of biological activity.
In our experimentation, we use MASH for experimentation on composite motifs and running control experiments on simple motifs.
2.3. Statistical Models for Motif Matching
Having found a set of matches using a motif matching algorithm, the final subproblem for function prediction via motif matching is to eliminate matches which are unlikely to have any biological relevance. In several approaches to motif matching, statistical models have been developed which model the degree of geometric and chemical similarity observed in matches with functionally related proteins. In comparison to a baseline degree of similarity observed in matches at random, matches to functionally related proteins exhibit statistically significant geometric and chemical similarity. The statistical models employed by PINTS 3 2 , JESS ', and MA 8 , have been shown to be capable of identifying functionally related active sites. Statistical models can be used to assign p-values to a given match. The p-value estimates the probability of observing another target, selected at random, with greater geometric and chemical similarity than the target identified with the given match. Thus, a match is statistically significant if the p-value falls below a given significance threshold a. 2.4. The M A S H pipeline In earlier work 11, we developed the MASH software pipeline, which contains a matching algorithm and a statistical model for identifying matches to pointbased motifs. Because of its availability and compatibility with composite motifs, we use MASH to benchmark the effectiveness of composite motifs in our experimentation. As input, MASH takes a simple or composite motif, a target protein structure, and a reference set of protein structures. Using MA 8 , MASH computes a match m between the motif and the target as well as a match between the motif and all members of the reference set. Then, applying our statistical model 8 , MASH uses these matches to assign a p-value to m. The output of MASH is the match m, and the p-value of m. If p < Q, then we say that the match m is statistically significant, and a positive prediction of functional similarity. Otherwise, we say that m is statistically insignificant, and a negative prediction of functional similarity.
3. GENERATING COMPOSITE MOTIFS
In our experimentation, we asked if composite motifs represent geometric variations in functionally related active sites better than simple motifs. For this reason, we detail both simple and composite motifs here. 3.1. Simple Motifs
Derived originally from a single protein structure PO, a simple motif po is composed of 1 points in space p ( o , o )p(o,l), , . . . ,p(O,l),where the coordinates for each ~ ( 0 ,are ~ ) derived from an atom in PO. Each motif point p(o,i)is also labeled with biological and chemical information. Initially, each motif point is identified with its atom type and amino acid type within PO.Each motif point also bears a ranking r ( p ( o , i ) )which is associated with the functional importance of the motif point. The matching algorithm used in this paper, MA is capable of prioritizing its search for motifs in order of functional importance. Finally, each motif point also contains a list of associated amino acids l ( ~ ( ~ , ~called ) ) , alternate labels, which represent acceptable substitutions in matching target amino acids. This permits our motifs to represent amino acids substitutions in major evolutionary divergences *, 25 or variations between distinct but chemically related amino acids. 3.2. Composite Motifs
Composite motifs are point-based motifs whose motif points are positioned by the geometric consensus of related active site structures. This paper presents averaged and centered motifs which are two examples of composite motifs designed from related active sites. In the design of composite motifs, we begin with a set of k protein structures Po, P I , . . . , 4 ,where each Pi is contains a functionally related active site, which is defined as an individual motif pi = {p ( i , o ) , . . ., p(i,n)}with exactly n motif points. Given that these motifs are functionally related, we list the motif points in PO, p l , . . ., p k in such an order that for any i, 0 5 i 5 n, the motif points q1,%),
347
Fig. 1. Composite motif construction begins with the multiple structure alignment of the individual motifs PO,p l , etc, yielding clusters of correlated points in the ultimate alignment. We describe this iterative alignment process in Section 3.2.
. . . , p ( k , , ) are functionally identical. Furthermore, for any i , 0 5 a 5 n, the motif points p ( o , % )p, ( l , , ) , . . ., p ( k , , ) are assigned the same ranking and the same alternate labels. Using a method from Ref. 34, we first compute a multiple structural alignment of the individual motifs, as depicted in Figure 1. This is accomplished by first computing a least RMSD (LRMSD) alignment” of each p , to an arbitrarily selected p,. In each alignment between one p , and p,, is correlated to p(,,o), p(,,1) is correlated to p ( j , l ) ,etc, resulting in a cluster containing all p(,,O), a cluster containing all p ( , , I ) , and so on. We compute a centroid for each cluster, and refer to each centroid as C O , C ~ ., . . , C L . In the next iteration, we align each p , to this set of centroids, instead of the arbitrarily selected individual motif, and recompute the centroids for the new multiple structural alignment. Repeated iterations converge rapidly to a single multiple structural alignment 3 4 , with centroids CO,(21,. . . , Cl. A completed alignment of amino acids used in our experimentation appears in Figure 3 . Once the multiple structural alignment is complete, we use the newly aligned formation of structures to finalize averaged and centered motifs.
3.2.1. Averaged Motifs Averaged motifs use CO,C1, . . ., Cl as the coordinates of their motif points. This is demonstrated
in Figure 2. Once we have the coordinates of the averaged motif points, the labels, ranking, and alternate labels, being identical in each of P O , p l , . . ., p k , are applied respectively to each of CO,C1, . . ., Cl, completing an averaged motif.
3.2.2. Centered Motifs Centered motifs are initially generated with the same iterative multiple structural alignment. However, once the alignment is complete, the smallest sphere containing each cluster of correlated motif points is computed, and the center of the sphere is used for each composite motif point. We demonstrate this in Figure 2. Again, the labels, ranking, and alternate labels are mapped to each of these points.
3.2.3. Advantages o f Composite Motifs We designed composite motifs to represent variations in active site structures, to diminish the need to select individual structures for simple motifs, and to promote the fusion of protein structures from varying data sources. Towards the first goal, averaged and centered motifs select points in space to represent the variation exhibited by each motif point. This straightforward approach is strongly applicable to the natural variability of protein structures, under the assumption that geometric identity implies functional similarity.
”An LRMSD alignment of two sets of points A and B rotates and translates A to the position where root mean squared deviation (RMSD) between A and B is minimized
348
Averaged Motif Point
* *
---*
Fig. 2. The multiple structure alignment of the individual motifs generates clusters of correlated motif points, demonstrated on the left side of this figure. As demonstrated above, averaged motif points are positioned at the centroid of the cluster. Centered motifs, demonstrated below, compute the smallest containing sphere around the correlated motif points, and use the center of the sphere for the composite motif point.
Generating a single composite motif that represents a set of related sites also reduces the problem of selecting a single protein structure to represent the entire set. In our experimentation, we will test the degree to which composite motifs can identify functionally related proteins, in comparison to simple motifs based on individual related sites. One concern we had was that some sites might be overrepresented in the family of protein structures, thereby affecting motif points in averaged motifs. Since structural overrepresentation is inevitable, due to the fact that structures are unavailable for all proteins, we designed centered motifs, to use the geometric position of the overall cluster (the smallest surrounding sphere) for motif points. Composite motifs have the distinctive characteristic that protein structure data from many sources could be fused in a single representation. As the availability of protein structures and functional annotations accelerates, composite motifs could provide a useful method for applying additional knowledge towards function prediction. In particular, because hundreds of protein structures can be integrated into composite motifs, additional sources of data, such as snapshots from molecular dynamics simulations and models from structure prediction techniques, could be integrated to counterbalance experimental biases inherent in existing structures and further expand the set of structural variations represented by composite motifs.
4. EX P ER IMENTAT I0N
In controlled experimentation, we compared the effectiveness of simple motifs, averaged motifs, and centered motifs. First, we identified six protein fam-
ilies which contained many distinct protein structures with functionally related active sites. Treating these classifications as a gold standard for functional similarity, we used each family to generate averaged and centered motifs on a leave-one-out basis. Finally, we tested the effectiveness of these averaged and centered motifs to identify statistically significant matches with the left out structure, in comparison to simple motifs. 4.1. Protein Families
The six families of proteins used in this work are taken from the Enzyme Classification (EC) specified by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology 14, which, although imperfect, is standard and useful for our purposes. In each family, we required one primary structure, with functional amino acids documented in the literature, as well as a t least 10 other non-mutant protein structures (although EC families with more structures were preferred), all with resolution below 3,&.The next six paragraphs describe the functionally documented amino acids from each primary structure. We refer to each EC family (bolded below) using the PDB code (also bolded) of its primary structure.
la3h/3.2.1.4 Bacillus agaradherans endoglucanase is a cellulase and belongs to EC family 3.2.1.4. Five points were selected for this motif, including tryptophan 262, which exists in an orientation that allows it to interact with substrate, tryptophan 178, which is an invariant residue in the subfamily 5-2 enzymes that is part of the aglycon binding sites, and histidine 206, which may play an important role in catalysis, perhaps as part of substrate binding 1 3 . Glutamic
349
Peroxidase (Active Site) EC famill.: I . 1 1. I .7 28 related structures Max. painvise RMSD: 4.17 Avg. painvise KMSD:1.35 Primaq Structure: 1ARU
EIIS 56 Alt. Labels: I
\
1 Nis 184
I'
-
ASN 91 Alt. Labels: NS . . i
/
Alt. Labels: H Fig. 3 . Multiple structural alignment of Peroxidase active sites in EC family 1.11.1.7. The substructures aligned in this image demonstrate the distinct geometric variability of related sites in each EC family. Structural differences between sites in each structure are apparent in both sidechain conformations as well as alpha carbon (spheres, in this image) positions. Some families, such as la3h, were distinctly more variable, while others, such as ldid, exhibited less variability.
acid 139 and 228 were also included, being the catalytic acid/base and the enzymatic nucleophile, respectively 13.
laru/l.ll.l.'7 Peroxidase from the fungus Arthromyces ramosus is a heme protein belonging to EC family 1.11.1.7. Five points were selected for this motif, including histidine 184, which binds the heme iron ", and the distal arginine (Arg-52 in this structure 16), which has been proposed to play a role in substrate binding and stabilization of the product of the first step of the enzyme reaction 3 3 . Also included was histidine 56, which is suggested to be responsible for proton translocation in the hydrogen peroxide substrate and has been shown to undergo conformational change in complexes with both cyanide and triiodide 16. Asparagine 93 and glutamic acid 87 form a hydrogen bond network with histidine 56 16. lasy/6.1.1.12 Aspartyl-tRNA synthetase is a dimeric aminoacyl tRNA synthetase responsible for the translation of genetic information and belongs to EC family 6.1.1.12. Eight points were selected for this motif. Serine 329 is part of a loop that interacts with
the discriminator base G73 and the first base pair of the stem of the tRNA molecule, serine 423 and lysine 428 are the endpoints of a segment that interacts with the phosphate groups of A72 and G73, and lysine 293 is the only residue making direct contact with a tRNA molecule bound to the other monomer '. Arginine 325 and 531 are involved in binding the ATP substrate, bonded to the a-phosphate and yphosphate, respectively 5, while aspartic acid 342 plays a role in binding the amino groups of the aspartic acid substrate '. Proline 273 has been confirmed to be essential in the dimerization 15, and enzymatic activity has been shown to decrease markedly when this residue is substituted '.
ldid/5.3.1.5 D-xylose isomerase, belonging to EC family 5.3.1.5, converts xylose to xylulose, such as in the conversion of glucose to fructose. Six points were selected for this motif. It has been proposed that aspartic acid 56 polarizes and activates histidine 53, which acts as a base to catalyze ring opening, and that lysine 182 aides in isomerization, while tryptophan 136 and phenylalanine 93 and 25 from a completely hydrophobic environment in which the
350
Min. Max. Avg.
A A A
# of Structs.
1asy 0.072773 3.034972 1.947437
ldid 0.000272 0.820726 0.251243
lk55 0.018086 7.134243 3.790644
lrx7 0.007937 5.299205 1.514413
la3h 0.000383 5.754516 2.429289
laru 0.00021 4.169486 1.346931
14
93
181
132
119
28
Fig. 4. A summary of the variations in geometric similarity between all pairs of simple motifs used in experimentation, as well as the number of structures in each family. Families denoted by the PDB code of their primary structure.
hydride shift occurs
12.
lk55/3.5.2.6 Class D P-Lactamase, a member of EC family 3.5.2.6, is responsible for the hydrolysis of P-lactam antibiotics, and as a result, it is one of the causes of bacterial resistance to this group of antibiotics 18. Eight points were selected for this motif. Serines 67 and 115 and lysine 205 are among the residues active in catalysis, while phenylalanines 69 and 120, valine 117, tryptophan 154, and leucine 155 create a hydrophobic pocket within the active site 18. lrx7/1.5.1.3 Dihydrofolate reductase, belonging to EC family 1.5.1.3 and required for normal metabolism in prokaryotic and eukaryotic cells, is an enzyme that catalyzes the NADPH-dependent reduction of 7,8-dihydrofolate to 5,6,7,8-tetrahydrofolate 2 8 . Seven points were chosen for this motif. Histidine 45 creates an ionic interaction with the pyrophosphate moiety of the NADP+ coenzyme and makes a bifurcated hydrogen bond with two oxygens of the ADP group '. Glycine 96 also makes such a hydrogen bond with two oxygens of the ADP 5'-phosphate '. Aspartic acid is the single polar residue in the folate binding cleft and participates in the catalyzing reduction of 7,8-dihydrofolate in two ways: by indirect protonation of N5 and by the precise positioning of the dihydropteridine ring through H-bonding '. Phenylalanine 31 forms a rigid ceiling to the pteridine binding site, which appears to be important for catalysis '. Isoleucine 50 is among the residues that create a hydrophobic pocket surrounding the folate tail '. Finally glycine 15 is part of group of amino acids that function as a lid that controls that entry and exit of ligands into the enzyme, and tryptophan 22 is involved in the slow, rate-limiting release of product 28.
4.2. Motifs used in Experimentation
Simple Motifs From every structure in every family, we created one simple motif as a control set for
our experimentation. Creating a simple motif for the primary structure in each family was accomplished by running the Evolutionary Trace (ET) 25 to identify alternate labels and a ranking of evolutionary significance (see Section 3.1) for all functionally documented amino acids. The geometric positions of the alpha carbons in functionally documented amino acids, coupled with the alternate labels and ranking provided by ET, complete a primary motif for each family. Creating a simple motif for all non-primary structures in each family is substantially more difficult, because functional documentation was not available for many non-primary structures. For this reason, we applied MA ', to search for the primary motif in the other structures of each protein family, identifying a set of similar sites. In each structure, we use the most geometrically similar site as the simple motif. The lack of functional documentation in many of the non-primary structures of each family leaves few alternative methods for discovering similar sites, but regardless of which site is used, MA is no substitute for functional documentation. Existing alternative methods, such as sequence comparison and other structure comparison algorithms, do not provide any improved guarantees to identify cognate active sites. A similar approach for identifying related sites was implemented in the Catalytic Site Atlas 27, which uses sequence analysis to relate functionally documented amino acids to similar amino acids in proteins of related function. Sequence analysis does not guarantee functional similarity, but significantly widens the range of similar active sites. In order to minimize any bias introduced by MA, we used very broad geometric thresholds when searching for similar sites. We used MA to consider all similar sites which had matching alpha carbons as distant as lOA in the LRMSD alignment, while searching for the site with smallest LRMSD. Geo-
35 1 metric thresholds used by MA do not appear to have significantly biased the set of simple motifs. As documented in Figure 4, between the simple motifs of each family, we measured the degree of pairwise geometric similarity, and observed notable geometric variations in all families except ldid. In our experimentation, a statistically significant match between a simple motif and a structure in the same family is called a true positive (TP) match, and a statistically significant match to a structure outside the family is a false positive (FP) match. A statistically insignificant match to a structure inside the family is a false negative (FN), and a statistically insignificant match to a structure outside the family is called a true negative (TN). Composite Motifs For each family of k simple motifs, we also created k averaged and k centered motifs in a leave-one-out manner. This is accomplished by identifying the k - 1 simple motifs that are not left out, and using them as individual motifs in the construction of an averaged or a centered motif, as described in Section 3 . 2 . Assembling simple motifs creates a test set where each composite motif can be tested against the left out structure. For each leave-one-out motif generated, if the left-out member of the protein family has a statistically significant match, then we call this match a TP. If the left out structure is not statistically significant we call the match a FN. FPs and TNs are counted in the same way as simple motifs. 4.3. Experimental Protocol For every simple and composite motif, we computed matches between the motif and every member of the associated protein family. We also computed matches between the motif and 5000 randomly sampled structures from the PDB, to represent a set of functionally unrelated proteins. We then assessed the statistical significance of each match computed, and counted the number of TPs, FPs, TNs, and FNs for all motifs. Given greater computing time, the set of randomly sampled PDB structures could be expanded further. However, in earlier work 8 , we observed that sampling 5% (5000 is more than 5%) of the PDB can reasonably represent the geometric composition of the proteins in the PDB. For this reason, sampling 5000 functionally unrelated proteins was deemed sufficient t o simulate the number of FP
matches observed in general conditions. Overall, approximately 4054 distributed CPU hours were spent gathering these matches. 4.4. Implementation Specifics
This work uses a snapshot of the PDB database from 09.14.2006. Structures with multiple chains were divided into separate structures, producing 93582 structures. While separating chains might block the identification of matches to active sites that span multiple chains, re-integration of separate chains might yield errors which lead to chemically impossible protein structures. None of the motifs used in this experimentation span separate chains. Composite motifs were computed using C/C++ code developed on an Athlon XP 2600+, with 1Gb of ram, running Debian Linux. Computing averaged and centered motifs, described in Section 3, takes approximately 10-15 seconds on this machine. Pvalues and matches were computed using distributed MASH on a 28 chassis Cray XD1 with 672 2.2Ghz AMD Opteron cores. 4.5. Averaged and Centered Motifs are Sensitive and Specific
We compared the sensitivity and specificity of averaged and centered motifs to the sensitivity of every possible simple motif in each protein family. Observed sensitivity is plotted Figure 5. The horizontal axis represents each family of EC proteins, denoted by their primary structure. The vertical axis represents sensitivity: the proportion of TP matches observed relative to the number of proteins in the protein family. The black brackets, each having three hash marks, signify the minimum, mean, and maximum number of TP matches identified by simple motifs in the EC class. Every simple motif in the family corresponding to l d i d matched all members of the family. The dark grey line represents the number of TP matches identified by centered motifs, and the light grey line represents the number of TP matches identified by averaged motifs. Averaged motifs were among the most sensitive of all individual matches. One family of protein structures, ldid, demonstrated very low structural variability. This is consistent with the observation from Figure 4 that simple motifs in ldid expressed little geometric variability. As a result, composite motifs generated from this
352 TP Match Comparison between Individual and Composite Motifs
12
1
T
TI
0
08
: .)
L
j
8
06
J Y
4 04 02
4
t-
4-
I
-Max. Simple TPs I -Min Simple TPs -Avg SimpleTPs -
\I
Y
Averaged TPs Centered TPs
0
.c
1asy
ldid
i
L
1k55
1rx7
1a3h
laru
Motifs I Protein Familv
Fig. 5. A comparison of T P matches found by composite motifs, relative to TP matches found by simple motifs from the same family. On the vertical axis, we normalize the total proportion of T P matches for each family; a value at 1.0 demonstrates that the motif identified statistically significant matches to all structures in its EC family. On the horizontal axis, we chart the protein families studied in this work. The vertical black bars indicate the maximum, minimum, and average number of TP matches identified by single-structure motifs from each EC family. It is apparent, with the exception of ldid, that single-structure motifs can fall within a wide range of sensitivity. The dark and light grey lines signify the number of T P matches identified by centered and averaged motifs, respectively. Composite motifs, especially averaged motifs, are significantly more sensitive than most simple motifs on almost all protein families studied.
family performed perfectly also. Among individual motifs, sensitivity fluctuates significantly. For example, in the family of lrx7, some individual motifs identify matches with only 2 out of the 136 remaining members of the family, while other individual motifs identify as many as 112. In the family of laru, some individual motifs identify matches with only 11 out of the 27 remaining members of the family, while others identify as many as 24. The choice of individual structures for motif design significantly risks the sensitivity and specificity of the motif created. In comparison, the sensitivity of averaged motifs was consistently greater than the mean sensitivity of individual motifs, which was similar to the sensitivity of centered motifs as well. With the exception of averaged motifs for la3h, composite motifs in general did not outperform all individual motifs. This demonstrates that composite motifs largely avoid the problem of selecting individual motifs, and that averaged motifs can achieve very high sensitivity. We measured specificity in Figure 6. The horizontal axis again corresponds to each family of EC
proteins, and the vertical axis corresponds to the number of FP matches, from the random sample 5000 PDB proteins, observed for each motif. We report the number of FPs observed, instead of specificity, because there are so many more unrelated proteins than functionally related proteins, that specificity is almost always close to 99%. Reporting the number of FPs makes the results easier to observe. The black brackets correspond to the highest, lowest, and mean number of FP matches to each individual Mi. The dark grey and the light grey lines correspond to the number of FP matches to centered and averaged motifs, respectively. The mean number of FP matches observed with simple motifs was very similar to the number of FP matches observed with centered and averaged motifs. The number of FPs observed can fluctuate significantly among individual motifs. In la3h, some individual motifs identify 123 FP matches, whereas others identify only 41. In other families, specificity did not fluctuate as much, such as in lrx7, where individual motifs identified between 38 and 57 FP matches. In comparison, averaged and centered mo-
353 FP Match Comparison between Individual and Composite Motifs
T
c 1asy
1 dld
tix7
1k55
la3h
1aI"
Motlfs I Protein Family
Fig. 6. A comparison of F P matches found by composite motifs, relative to F P matches found by simple motifs from the same family. On the vertical axis, we plot the number of F P matches observed. On the horizontal axis, we chart the protein families studied in this work. The vertical black bars again indicate the maximum, minimum, and average number of F P matches identified by single-structure motifs from each EC family. The dark and light grey lines signify the number of F P matches identified by centered and averaged motifs, respectively. With one exception, composite motifs tend to identify an average number of FP matches, in comparison to single-structure motifs, demonstrating that composite motifs are not an additional source of prediction error.
tifs almost always identified an average number of FP matches. Composite motifs appear to avoid high false positive rates which can occur with individual motifs, again reducing the problem of selecting individual protein structures.
5 . CONCLUSIONS We have described composite motifs, a unique approach to motif refinement. Overall, composite motifs seem to achieve sensitivity among the most sensitive individual motifs, while maintaining average specificity and eliminating the problem of accidentally selecting an ineffective simple motif. On 6 families of functionally related proteins, our experimentation demonstrates, on a small scale, that composite motifs can capture variations in active site conformations. We observed that averaged motifs performed with sensitivity comparable to the most sensitive simple motifs, and that centered motifs performed with sensitivity typical of the average simple motif. While increasing sensitivity, averaged and centered motifs tended to identify FP matches typical of the average simple motif. We also observed that simple motifs had sensi-
tivity and specificity falling in a very wide range. Selecting any individual structure for the design of a motif risks the selection of insensitive or nonspecific simple motifs. In our experimentation, we observed that composite motifs may diminish this problem, because no selection needs to be made, and because they performed with high sensitivity and average specificity. As the availability of protein structures and functional annotations accelerates, we feel that composite motifs will become increasingly applicable for effective annotation of protein structures and for the integration of additional types of structural information from diverse data sources. ACKNOWLEDGEMENTS This work is supported in part by grants from NSF DBI0547695 and NSF DBI-0318415 through a subcontract from the Baylor College of Medicine. Additional support is gratefully acknowledged from training fellowships from the W.M. Keck Center for Interdisciplinary Training (NLM Grant No. 5T15LM07093) to B.C.; from March of Dimes Grant FY03-93 to O.L.; from a. Sloan Fellowship to L.K; and from a VIGRE Training in Bioinformatics Grant from NSF DMS 0240058 to V.F. Experiments
354 were run on equipment funded by NSF CNS-0523908 and NSF CNS-042119, Rice University, and partnership with AMD and Cray. D.B. has been partially supported by the W.M. Keck Undergraduate Research Training Program and by the Brown School of Engineering at Rice University.
12.
13. References
1. Barker
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
JA, Thornton JM. An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinf. 2003; 19: 1644-1649. Binkowski TA, Naghibzadeh S, Liang J. CASTp: Computed Atlas of Surface Topography of proteins. Nucl. Acid. Res. 2003;31:3352-55. Binkowski TA, Freeman P, Liang J. pvSOAR: Detecting similar surface patterns of Pocket and Void Surfaces of Amino Acid Residues on proteins Nucl. Acid. Res. 2004; 32: W555-8. Bystroff C , Oatley SJ, Kraut J. Crystal structures of Escherichia colz dihydrofolate reductase: the NADP' holoenzyme and the folate-NADP+ ternary complex. Substrate binding and a model for the trasition state Biochemistry. 1990;29:3263-3277. Cavarelli J, Rees B, Thierry JC, Moras D. Yeast aspartyl-tRNA synthetase: a structural view of the aminoacylation reaction Biochimie. 1993;75:11171123. Cavarelli J, Rees B, Ruff M, Thierry JC, Moras D. Yeast tRNAASp recognition by its cognate class I1 aminoacyl-tRNA synthetase Nature. 1993;362:181184. Cavarelli J, Eriani G, Rees B, Ruff M, Boeglin M, Mitschler A, Martin F, Gangloff J , Thierry JC, Moras D. The active site of yeast aspartyl-tRNA synthetase: structural and functional aspects of the aminoacylation reaction EMBO J.. 1994;13:327337. Chen BY, Fofanov VY, Kristensen DM, Kimmel M, Lichtarge 0, Kavraki LE. Algorithms for Structural Comparison and Statistical Analysis of 3D Protein Motifs. Proceedings of Pacific Symposium on Biocomputing 2005. 2005:334-45. Chen BY, Fofanov VY, Bryant DH, Dodson BD, Kristensen DM, Lisewski AM, Kimmel M, Lichtarge 0, Kavraki LE. Geometric Sieving: Automated Distributed Optimization of 3D Motifs for Protein Function Prediction. Proceedings of The Tenth Annual International Conference on Computational Molecular Biology (RECOMB 2006). 2006:500-15. Chen BY, Bryant DH, Fofanov VY, Kristensen DM, Cruess AE, Kimmel M, Lichtarge 0, Kavraki LE. Cavity-Aware Motifs Reduce False Positives in Protein Function Prediction. Proceedings of the 2006 IEEE Computational Systems Bioinformatics Conference (CSB 2006). 2006:311-23. Chen BY. Geometry-based Methods for Protein
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
Functzon Prediction. PhD thesis, Rice University 2006. Collyer CA, Blow DM. Observations of reaction intermediates and the mechanism of aldose-ketose interconversion by d-xylose isomerase Proc. Natl. Acad. Sci. 1990;87:1362-1366. Davies GJ, Dauter M, Brzozowski AM, Bjornvad ME, Andersen KV, Schulein M. Structure of the Bacillus agaradherans family 5 endoglucanase at 1.6 Aand its cellobiose complex at 2.0 Aresolution Biochemistry. 1998;37:1926-1932. International Union of Biochemistry, Nomenclature Committee. Enzyme Nomenclature. Academic Press: San Diego, California 1992. Eriani G, Cavarelli J, Martin F, Dirheimer G, Moras D, Gangloff J. Role of dimerization in yeast aspartyl-tRNA synthetase and importance of the class I1 invariant proline Proc. Natl. Acad. Sci. USA. 1993;90:10816-10820. Fukuyama K, Kunishima N, Amada F, Kubota T, Matsubara H. Crystal structures of cyanide- and triiodide-bound forms of Arthromyces ramosus peroxidase at different pH values: perturbations of active site residues and their implication in enzyme catalysis J . Biol. Chem.. 1995;270:21884-21892. Glaser F, Morris R J , Najmanovich R J , Laskowski RA, Thornton JM. A method for localizing ligand binding pockets in protein structures. Proteins. 2006; 621479-88. Golemi D, Maveyraud L, Vakulenko S, Tranier S, Ishiwata A, Kotra LP, Samama J-P, Mobashery S. The first structural and mechanistic insights for class D 0-lactamases: evidence for a novel catalytic process for turnover of p-lactam antibiotics J . A m . Chem. Soc.. 2000;122:6132-6133, Holm L, Sander C. Protein Structure Comparison by Alignment of Distance Matrices J . Mol. Biol.. 1990;233:123-138. Kinoshita K, Nakamura H. Identification of protein biochemical functions by similarity search using the molecular surface database eF-site Protein Science. 2003; 12:1589-1595. Kristensen DM, Chen BY, Fofanov VY, Ward RM, Lisewski AM, Kimmel M, Kavraki LE, Lichtarge 0. Recurrent Use of Evolutionary Importance for Functional Annotation of Proteins Based on Local Structural Similarity. Protein Science. 2006;15:1530-6. Kunishima N, Fukuyama K, Matsubara H, Hatanaka H, Shibano Y, Amachi T. Crystal structure of the fungal peroxidase from Arthromyces ramosus at 1.9 A resolution. Structural comparison with the lignin and cytochrome c peroxidases. J . Mol. Biol. 1994;235:331-344. Laskowski RA. SURFNET: A program for a program for visualizing molecular surfaces, cavities, and intramolecular interact ions. Journal Molecular Graphics. 1995;13:321-330. Levitt DG, Banaszak LJ. POCKET: a computer
355
25.
26.
27.
28.
29.
30.
graphics method for identifying and displaying protein cavities and their surrounding amino acids Journal of Molecular Graphics. 1992; 10:229-34. Mihalek I, Res I, Lichtarge 0. A Family of Evolution-Entropy Hybrid Methods for Ranking of Protein Residues by Importance J. Mol. Biol. 2004;336:1265-82. Nayal M, Honig B. On the nature of cavities on protein surfaces: application to the identification of drug-binding sites Proteins. 2006;63:892-906. Porter CT, Bartlett GJ, Thornton JM. The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data Nucleic Acids Research. 2004;32:D129-D133. Reyes VM, Sawaya MR, Brown KA, Kraut J. Isomorphous crystal structures of Escherichia coli dihydrofolate reductase complexed with folate, 5-deazafolate, and 5,lO-dideazatetrahydrofolate:mechanistic implications Biochemistry. 1995;34:2710-2723. Russell RB. Detection of protein three-dimensional side chain patterns. New examples of convergent evolution J. Mol. Biol. 1998;279:1211-27. Shatsky M, Nussinov R, Wolfson HJ. FlexProt:
31.
32.
33.
34.
35.
36.
Alignment of Flexible Protein Structures Without a Predefinition of Hinge Regions. Journal of Computational Biology. 2004; 11:83-106. Shatsky M, Shulman-Peleg A, Nussinov R, Wolfson HJ. The Multiple Common Point Set Problem and Its Application to Molecule Binding Pattern Detection. J . Comp. Biol.. 2006; 13: 407-28. Stark A, Sunyaev S, Russell RB. A Model for Statistical Significance of Local Similarities in Structure J . Mol. Biol.. 2003; 326: 1307-1316. Vitello LB, Erman JE, Miller MA, Wang J , Kraut J. Effect of arginine-48 replacement on the reaction between cytochrome c peroxidase and hydrogen peroxide Biochemistry. 1993;32:9807-9818. Wang X, Snoeyink J. Multiple Structure Alignment by Optimal Rmsd Implies that the Average Structure is a Consensus Proceedings of Computatzonal Systems Bioinformatics 2006 (CSB2006). 2006. Williams MA, Goodfellow JM, Thornton JM. Buried waters and internal cavities in monomeric proteins Protein Science. 1994;3:1224-35. Wolfson HJ, Rigoutsos I. Geometric Hashing: An Overview. IEEE Comp. Sci. Eng. 1997;4:10-21.
This page intentionally left blank
. #.
Ontology, Database and Text Mining
This page intentionally left blank
359
AN ACTIVE VISUAL SEARCH INTERFACE FOR MEDLINE Weijian Xuan', Manhong Dai',
Barbara Mire12, Justin Wilson', Brian Athey', Stanley J Watson', Fan Men&,'*
1 Molecular and Behavioral Neuroscience Institute and Department ofpsychiatry, 2 National Centerfor Integrative Biomedical Informatics, University of Michigan Ann Arbor, MI 481 09, USA *Email: [email protected] Searching the Medline database is almost a daily necessity for many biomedical researchers. However, available Medline search solutions are mainly designed for the quick retrieval of a small set of most relevant documents. Because of this search model, they are not suitable for the large-scale exploration of literature and the underlying biomedical conceptual relationships, which are common tasks in the age of high throughput experimental data analysis and cross-discipline research. We try to develop a new Medline exploration approach by incorporating interactive visualization together with powerful grouping, summary, sorting and active external content retrieval functions. Our solution, PubViz, is based on the FLEX platform designed for interactive web applications and its prototype is publicly available at: http://brainanay.mbni.med.umich.edu/Braina~ay/DataMining/PubViz.
1. INTRODUCTION Understanding the biomedical significance of high throughput data, such as those from microarray gene expression analysis, genome-wide SNP genotyping and biomedical images from in situ or MRI, is a major challenge in the postgenomic era. Researchers often have to examine large bodies of literature in unfamiliar fields for new insights in their area of interest. Unfortunately, prevailing Medline search approaches were largely designed for the efficient retrieval of a small number of records rather than an in-depth exploration of a large body of literature. The inherent limitations in the prevailing search engines such as Entrez and Google Scholar prevent them from being effective large-scale literature exploration tools. Firstly, the widely-used Medline search methods rely heavily on a step-wise narrowing of search scope but such an approach does not work well for exploring new territories. This is because employing sensible filtering criteria to investigate all potentially relevant topics often requires good background knowledge. For example, in microarray gene expression analysis, researchers frequently have to deal with lists of genes that are not known to be associated with biological processes in which they are interested. Researchers have to utilize other intermediate concepts to establish the indirect link between gene lists and specific biological processes. Identifying these intermediate concepts through literature searches with existing methods, however, is very difficult. Medline searches using a list of such gene names often lead to hundreds or even
thousands of Medline records. Few options are available for identifying potentially relevant topics or novel conceptual relationships other than going through the retrieved records one-by-one in the prevailing Medline search systems. It will be ideal to have a flexible overview function that can summarize the search results based on criteria from different biomedical concept categories, such as the protein (gene product) interaction, pathway or cellular processes, anatomical location, known high level biological process or disease, etc. Besides increasing search efficiency, the grouping and summarization of search results using different criteria will provide many additional biomedical concepts that can potentially link unfamiliar gene names to the targeted pathophysiological processes. This ability to view the summaries of a large record set from different angles has several benefits. It exceeds even the current advance of using single grouping criteria, such as the MeSH term-based grouping, something that we implemented in our GeneInfoMiner 2o and in doing so significantly improves the Medline search efficiency. Additionally, examining search results from different viewpoints will stimulate new ideas. Systematic mapping of search results to different concept categories, such as interacting proteins, pathways and anatomical locations, is also likely to be more comprehensive than what a researcher can think of at a given moment, prompting hindher to examine a problem from more different aspects of under many situations.
360 Secondly, it will be a boon to researchers if they could see contextual similarity relationships among different records instead of the linear lists presented by PubMed or Google Scholar. Yet even when retrieved records are mapped to MeSH terms and presented in tabular format as we do in our GeneInfoMiner 20, information about the similarity of record sets associated with different MeSH terms is not shown. Such similarity information between different groups of records is very useful for revealing novel conceptual relationships as well as for increasing the accuracy of document retrieval. Some applications have tackled the difficulty of apprehending the painvise similarity relationship between different records or record sets through various visualization techniques 4,7,11,13315317321. The desktop application RefViz is a noteworthy example. It uses a “galaxy” view for exploring large Medline record sets after similarity-based clustering 19. Yet web implementations of such similarity-based Medline record overviews have encountered obstacles due to the CPU-intensive nature of clustering Medline record sets. As we describe later, modern video hardware can be used to speedup similarity and clustering significantly. Some new generation Medline search solutions such as ALIBABA and botXminer have begun to overcome these obstacles by using network graphs to display the biomedical conceptual relationships extracted from different Medline records 2 3 7 ~ 1 3 ~.1 They 5 aim to enable researchers to grasp the complex biomedical conceptual relationships in search results at a glance. While these pioneering works produce impressive graphics and potentially increase literature exploration efficiency, the usefulness of these tools is severely constrained by the poor performance of conceptual relationship extraction by existing techniques. While there are a number of reasonable solutions for identifying name entities in specific categories such as gene and protein names in biomedical literature none of the existing conceptual relationship extraction methods can deal with content from the full Medline database in a satisfactory manner. As a result, the reliability of such conceptual relationships in such networks, particularly those involving indirect relationships, is questionable under many situations ‘,I6. In addition, the apprehension of conceptual networks quickly becomes very difficult as the number of concepts increases beyond 50 or so. A ’%”,
conceptual network with more than several dozen elements and whose membership and relationship among its members alters upon every new query do not encourage confidence in their typical users. Another challenge is that merely utilizing information within the boundary of the Medline database in Medline explorations is far from sufficient. Data and knowledge residing outside of the Medline database are critical to an understanding of the full implication of search results as well as to the development of new ideas for subsequent searches. Because of this, some of the existing Medline search engines, such as Entrez, PubGene as well as the ALIBABA l 5 and botXminer I 3 mentioned previously, add hyperlinks to biomedical concepts in the search results to facilitate the further exploration and understanding of the related concepts. This hyperlink approach significantly improves Medline search results exploration. But solely relying on concept associated hyperlinks has four shortcomings: 1) Low retrieval efficiency: users have to click hyperlinks one-by-one in order to investigate related external information. It will be ideal to have an automatic mechanism to grab the related information automatically from external database and present them together with Medline search results, 2) Separation of related information: because the related information can only be retrieved by clicking a hyperlink underlying a concept in the search results, it is hard to investigate similar external information together in the development of new ideas. For example, if three types of hyperlinks, Entrez Gene, Allen Brain Map and dbSNP are provided to each gene name in the Medline search results, it will be better to present external information in the same category or group hyperlinks for the same type of external data together for a given set of Medline search results, 3) Inability of mapping search results to external data for effective overview. Hyperlink only provides point-to-point association, not summary information about all the search results with regard to an external data source. For example, although it is fairly straightforward to add pathway links to individual gene names in search results, it will be more useful to map the search results to known pathways and to present an overview of search results based on pathways. This way a researcher can easily learn how each pathway is related to the search results based on the number of gene or small molecular associated with Medline records in individual
361 pathways. It will also naturally suggest additional search terms using biologically meaningful relationships presented in the same pathway. 4) Fail to incorporate external information automatically in a search: although it is always possible to refine an existing search by utilizing new information obtained from external database links, this approach does not fundamentally change the prevailing mode of Medline record retrieval, which mainly relies on information within the Medline database such as MeSH terms, query-record and recordrecord similarity or probably also citatiodhit rate in Google Scholar. Useful knowledge in external databases, such as molecular pathways, neuroanatomical connections between different brain regions, genomic location and linkage disequilibrium relationships among genes, SNPs, STS/microsatellite markers and cytobands, are not utilized in retrieving related Medline records. Despite its extensiveness, Medline only contains a small fraction of biomedical knowledge, with the rest distributed across full text papers, textbooks and expert curated databases. Consequently, Medline searches would be significantly improved if external data and knowledge were effectively utilized in the search algorithm rather than just using hyperlinks to access external information and then refining queries. Because of the above reasons, new approaches for improving Medline exploration are needed. One such approach is to automatically acquire relevant external information, present them in meaningful ways and provide efficient functions for data exploration. In terms of pulling in information, for example, a researcher may be interested in learning the potential biological significance of gene lists derived from a microarray study. In such a case, whenever a list of genes is used as input for Medline search, it will be very useful to use this gene list to obtain automatically relevant information from other databases. Such other database information includes genes that directly interact with the query gene list, SNPs and genetic markers in the vicinity of the query genes, anatomical location of transcripts from the queries genes, related pathways and gene ontology categories, etc. Because the majority of such information is not in the Medline database, presenting these data in an organized manner together with the Medline search results should help to increase the thoroughness as well as the efficiency of Medline searches. For example, having a list of
interacting genes, pathway or the in situ hybridization image of the query genes readily available during Medline exploration will be very useful for suggesting new gene, anatomical and functional search terms for refining the Medline searches about unfamiliar genes. Although it is not possible to bring all related external information to the same screen, allowing users to visit external databases through links presented in predefined external information category will facilitate researchers to investigate different functional aspects of query genes. Without doubt, since different researchers have different external information requirements, a good solution must be able to provide good extensibility for users to configure search and display functions for their favorite data sources. While the active “pulling” and organized presentation of external information for refining searches will be helpful, a more comprehensive solution should also incorporate external information in determining similarity between individual biomedical concepts as well as Medline records in the document retrieval algorithm mentioned previously. This will involve the generation of distance matrixes and weights for different category of biomedical concepts during Medline record similarity calculation. Conceivably, different similarity matrixes are preferable when the focus of the search is different, such as looking for Medline records related to the in situ or immunohistochemistry study of genes vs. identifying Medline records related to genetic markers associated with genes. Finally, just jamming a computer screen with Medline records and external information from various sources will most likely reduce the efficiency of Medline exploration. As mentioned earlier, graphic presentations of results are suitable for large amounts of data, and many applications successfully present diverse data from divergent sources in graphic forms that fit the type of data (e.g. networks for associations). These presentations also give mouse-over or right-click information on data points. But assuring that presentations are effective in interactive visualizations for users’ purposes is the Achilles’ heel for many webbased bioinformatics applications. There often is a large gap between making an application merely “usable” and making it really “useful.” A good application requires careful user requirement analysis and interface design. For example, easy access to information that users deem
362 relevant requires not just that it be available on a screen but that the content is laid out and arranged in different categories that are meaningful to users. Each category of data, moreover, should also be presented in an appropriate mode and style, such as visualized networks arranged and perceptually encoded to highlight contextual similarity among a reasonable number of Medline records, Other modes of presentation that match users’ needs include pathway overlay diagrams to show overlays of Medline records with different elements in a pathway, an expandable ontological tree for exploring functional or structurally related terms in the vicinity of retrieved Medline records, and links to in situ hybridization image. Additionally, an effective interface should also allow users to switch efficiently to different views of the same set of retrieved Medline records, with graphic views being clear enough to be remembered without image overload. In this way, researchers can easily examine the search results from different perspectives together with different types of external information. Most importantly since the display must support dynamic inquiry and not just a retrieved “fact answer”, intuitive visual data exploration functions such as selectiunselect, summary, forming new queries, saving results, etc., must be incorporated for the effective mining of the data set. The history of data exploration processes also is very useful since Medline exploration usually is more complicated than simple Medline record retrieval and users often need to go back to previous steps for additional exploration. Based on the rationales presented above, we started to develop a new Medline search interface that aims at facilitating the interactive exploration of Medline utilizing information from external databases such as Michigan Molecular Interaction Database (MiMI) 6, KEGG and Allen Brain Atlas 12. Different from classical search engines designed for most efficient and accurate record retrieval, our solution is mainly targeted at the understanding of high throughput biomedical data, where researchers often need to venture into unfamiliar territories for new insights in specific pathophysiological processes. Our prototype, PubViz is still a work in progress but the prototype with 5000 bipolar-related Medline records as the test data set is accessible at: http://brainarray .mbni.med.umich.edu/Brainarray/ DataMiningIPubViz.
In this manuscript, we will first present technical aspects of our solution in the Material and Methods section. The Result section will focus on some of the novel data display and interactive search functions in PubViz using real world examples. Issues encountered in our prototype and functions we hope to include soon are described in the Discussion section. 2. MATERIAL AND METHODS
2.1. System Design Overview PubViz consist of four major components: 1) A search component: enables users to search for biomedical literature using a series of flexible criteria, e.g. gene ID, MeSH concept, keyword and their combination. 2) A process component: retrieves pre-annotated literature (e.g. genelprotein name tagger, UMLS concept matcher we developed). It will also filter or expand the result set. The PubViz search interface communication with backend process functions is based on extensive web services. 3) An exploration component: presents processed search results in an intuitive and interactive fashion. For example, in the citation view, gene view, and MeSH view, related literatures are presented respectively in network graphs. Each node in the graph represents an entity or a concept. The connections between pairwise nodes are calculated using the similarity algorithms we describe below. Essentially this component generates overviews on the data set from different perspectives. 4) The analysis component: integrates various scaffolds to help users understand the literature set better. These analytic supports include, for example, topic groupingisorting functions, visual exploration capabilities, extensive external links, data visualization and dynamic filtering functions. PubViz is developed on Adobe’s latest Flex 2.0 platform. It provides efficient development tools and components that allow us to build highly interactive user interface with high efficiency (http://www .adobe. codproductslflex).
2.2. PubViz Web Services Since utilizing external information in Medline exploration is a critical design goal of PubViz but most
363 Table 1. PubViz web service and HTTP service examples. Web Service
Genesearch Diseasesearch MeSHSearch Citationsearch GraphLayout
MeSHProfiling GeneMapping GeneticMarkerMapping UMLSMapping DiseaseAssociation
Pathway Query ProteinInteraction
Service Description retrieves Medline abstracts related to specific genes retrieves Medline citations related to specific diseases retrieves most relevant records for specific MeSH terms retrieves Medline citations, including title, abstract, MeSH, journal, etcl provides wrapper service for graph layout using GraphViz returns significant MeSH terms that differentiate the current subset from whole Medline returns identified genes in given Medline abstract and map to Entrez Gene database IDS returns genetic markers based on linkage disequilibrium criteria returns extracted UMLS concepts in Medline abstracts returns identified diseases and mapped OMIM IDS if possible retrieves genes and related citations for specific pathways retrieves interacting proteins of given protein(s)
external data sources are updated frequently, we decided to use web services and HTTP services to collect external data on-the-fly during Medline search. This way we do not need to update the content of many external databases with the PubViz database. Currently PubViz has a variety of web services (using SOAP) and HTTP services. These services not only perform data retrieval but also integrate substantial amount of text mining results from our group. They will be easily integrated into other literature mining systems and will be public available for developers to access programmatically (see PubViz site). Table 1 lists some of the web/HTTP services currently used in PubViz.
2.3. Similarity Measures PubViz can easily adopt any similarity measures based on annotations in specific concept categories such as gene, anatomical structure and disease. Here we provide two examples of similarity measures currently used in PubViz: Medline document similarity based on MeSH annotation: There are around 11 MeSH concepts assigned to each Medline abstract by expert curators. MeSH annotation provides a quick way of calculating painvise Medline record similarity. The cosine similarity measure cos(Di,D,) is the angle computed by taking the dot product of two bags of MeSH topic vectors f D , , m and f D , , , , as,
Use MeSH term relatedness for term expansion: Since many high level non-specific MeSH terms are commonly assigned to biomedical papers, simply counting the co-occurrence of MeSH terms for the purpose of term expansion during Medline query does not work well. To calculate the relatedness score between a given MeSH term and another MeSH term, we weight the each co-occurred MeSH term by its inverse document frequency:
Scorei=number of co-occurrence
* log(D/DJ
(2)
Where D is the total number of documents in the corpus, and Di is total number of documents annotated with M,. As an example, for MeSH term “Prefrontal Cortex” the most related terms we find are “Bipolar Disorder” and “Schizophrenia”, instead of more frequently co-occurred term “Adult” and “Male”.
2.4. High Performance Similarity Calculation and Clustering Using 8800GTX A major challenge in implementing on-the-fly Medline record similarity and clustering calculation for graphic display in a web application is their CPU intensive
364 nature. Fortunately, Graphics processing units (GPUs) provide an inherently parallel platform suited for various distance calculation and clustering problems. The release of the Compute Unified Device Architecture (CUDA) platform for the NVIDIA GeForce 8XXX graphics cards eases the task of implementing distance and clustering algorithms by presenting the graphics card as multi-threaded coprocessor 14. Our initial tests show the above MeSH term-based similarity and clustering calculation of 2001000 Medline records can be reduced to several seconds just by using one NVIDIA GeForce 8800 GTS card, which only costs around $550. As a result, interactive visualization of similarity calculation results in web applications for decent Medline record size is now a reality with a low cost computer cluster. 3. RESULTS Since PubViz is still a project in progress, the current fully functional web prototype uses a small corpus for
more efficient prototyping. The corpus contains 5000 bipolar-related Medline abstracts that are tagged by our own genelprotein name, genetic marker, and UMLS concept taggers. The full Medline record access is expected to be ready in July, 2007. Here we describe the main features of PubViz and some functions that scaffolds users’ literature search. More functions are being incorporated into PubViz.
3.1. PubViz User Interface PubViz aims to bring the richness and usability of good desktop applications to the web-based environment (Fig. 1). PubViz is a purely online application, which does not require any installation nor any local upgrade or maintenance. Since PubViz is run on Flash virtual machine, it does not have any compatibility issues across different browsers. PubViz uses tabbed layouts across the top (view tabs), right hand side (panel tabs) and bottom (data tabs) to enable access to diverse data and panels.
Fig. 1. PubViz interface layout
3.3. Data Visualization
It provides visual querying through brushing (direct interaction with data on the graphic itself) and through sliders and other visual controls. Graphs on different tabs are dynamically linked. Users can resize the panels to gain space for interesting views or relevant charts. Every table in PubViz application is sortable on any columns. Users can also reorder, resize table columns or even hide some fields.
PubViz has built in visualization tools such as charting functions to help users quickly gain insights into the returned record set (see Fig. 1). To support users in finding relationships among retrieved results, PubViz presents multiple interactive and interlinked charts for the same Medline record set. By analyzing data from different perspectives - tables for details, graphs for inferring meaning from structure - users are able to discern patterns and relationships that would not jump out at them from long-scrolling tabular displays alone. Moreover, they can interactively explore these patterns by selecting data of interest and having all the views automatically update for that selection. Dynamic updates across charts based on selecting and filtering enable in-depth exploration from different perspectives. This technologically achieved dynamic linking - critical for exploratory analysis - is now being extensively assessed in terms of usability issues, as is support for visual memory across views. We include three examples here to demonstrate the variety of visualization options supported in PubViz. These examples do not follow one sample case through all three views. Rather for the purpose of depicting rich enough views for readers to get an idea of what each sample view and data tab displays, each example is a self-contained instance of querying and retrieved data.
3.2. Active Retrieval of External Information When users perform a Medline search, PubViz will actively retrieve the relevant external information through web services described in section 2.2. For example, if users specify one of more genes in their search, PubViz will automatically use the genes tagged in the returned Medline records to query the Michigan Molecular Interaction (MiMI) database and pull in MiMI results for interacting proteins related to these genes Users thus can review the genes that are related to their initial search criteria quickly, without having to perform separate searches in the MiMI database (Fig. 2a). Similarly, PubViz maps genes in returned Medline records to Allen Brain Atlas so that users can obtain related in situ hybridization images just by opening the PubViz data tab and clicking links that PubViz has automatically generated and displayed in the “In Situ Image” tab (Fig. 2b).
‘.
.
.....................................
...................
Fig. 2a. Direct interacting protein from MiMI . . . . . . . . En
xmq@
....................
....
Fig. 2b. Allen Brain Atlas mapping
366 Timeline view: In the first example in Fig. 3 a user wants to see articles related to bipolar disease published between 1966 and 2006 the and clicks on the top “Chartview” tab to begin the search. The user enters the keyword “bipolar” into the search field, sets the time range accordingly, and clicks “Search”. PubViz returns a list of diseases, under the time range slider. The user
..................
............................
clicks on one of these - “Bipolar Affective Disorder” and a line graph is displayed, showing the number of articles published (y-axis) each year (x-axis). Users can issue another search by just click on a data point on the graph, and the research results can be reflected in other data and view tabs, as well).
...........
.
........
PMXD ram& IlmR:
..................... Keyuord .icerch I
Fig. 3. Timeline View
I i
Ed..
z!L“*v> ..................... ........... <.en* 404, ~
--
m
i
367 Pathway view: Fig. 4 shows the results that occur when a user opens the pathway search panel tab in order to explore functional aspects of genes of interest. On the side panel, the user enters the Gene ID of the target gene and PubViz maps the Medline search results onto the KEGG pathway database. PubViz retrieves pathways from KEGG containing genes related to user query, and, as with retrieved results in the previous example, it lists them under the search Gene ID term. The user simply needs to click on the pathway of interest, and the KEGG pathway displays in the Pathway View window. By being able to see other genes in the pathway during this literature search without having to go out of the exploration environment, the user can immediately implement new ideas for additional Medline exploration. MeSH tree view: This final example draws on the pre-calculated MeSH concept similarity we described in section 2.3. Working now from the MeSH side panel tab (depicted in Fig. 5 separately from the rest of the screen only to show it more clearly here) the user enters a MeSH term, PubViz returns the top N (specified by the user) related concepts as well as underlying Medline citations, which are not shown here but detailed in the lower MeSH Profiling data tab. They are sorted by year or other user chosen criteria. Additionally, MeSH connections are drawn on a MeSH View canvas, also not depicted here but placed in the center where views for all tabbed displays reside. This example shows how
users get multiple views alongside each other - a tree hierarchy, detailed data on citations, and networks based on similarity relationships. When users search for MeSH terms or click on a particular MeSH term in Mesh search panel, PubViz shows the MeSH topic in a hierarchical MeSH tree view (Fig. 5) for users to review or issue further search request.
3.4. Search History Tracking Interactive search empowers user to search using a combination of criteria. Meanwhile, it also raises the question of how users can track their exploration history. From the side panel User Profile tab, PubViz allows users to set up individual accounts, and it automatically records their use history including the parameter settings (Fig.6). Therefore, users can quickly replicate or continue their precious analysis or downloadiupload data or literature sets. 4. DISCUSSION
In summary, PubViz is designed to be an efficient Medline literature exploration interface for understanding the biological implications of high throughput data. While some of the existing Medline search solutions provide gene or gene-ontology centered graphic layout 3 s 3 7 x ' 0 3 ' 3 ~ ' 5, few of them provide powerful interactive visual Medline exploration.
OU
eSH Search Limit:
are logging in PubVi?.
9
lode Label
2
Substance P
7 L>t3iological Factors
v ~ I n f l a m m s b o nMediators
~6-MR-cl'f
3earsnt&wie;
e.747'
16-MAR-07
Search:PMfQ
9603609
20
10960164
15
953.458s~
25
y LAutacoids
Fig. 5. MeSH search panel
16-MAR-07
Search:PMID
ie-MAR-0;
~
~
Pig. 6. User history
r
~
:
368 PubViz is distinct by providing the combination of multiple topic-centered graph layout and powerful data charting tools for efficient and flexible visual Medline exploration. In PubViz, switching between charts and underlying tabulated data just requires a single mouse click. These visualization tools enable users to grasp “big picture” of their query results, something they cannot do from results displayed in tabular form. Moreover, by combining graphic dkplays with capabilities for interactively selecting and filtering data, drilling down to details, and sorting, these visual data exploration tools enable users to quickly recognize and uncover hidden patterns and data of interest in Medline database. We hope the ability to translate data patterns into insights makes PubViz a highly effective literature exploration tool. The FLEX 2.0 technology offers the possibility of rich internet application with performance approaching typical desktop programs. In traditional HTML-based web applications, in contrast, when a user click on one responsive element on a web page, usually the whole page will be resent from the server and then refresh at the client side, which usually delays the response. In PubViz, however, all web services are invoked using asynchronous calls, and most interactions on retrieved data are handled by the PubViz on the client side. It enables us to write functions to handle user interactions on complex graphs and issue more dynamic search requests to help users to drill deeper, navigate faster, and understand better. Without doubt, our prototype only provides a framework that can be greatly improved on. Besides various new functions, a key issue we need to work on is the usability of PubViz. At this moment, the organization and presentation of data and function is certainly not optimal. We plan to conduct systematic usability studies to improve the PubViz interface to make it an efficient tool for literature exploration. A key capability we want to add into PubViz is the use of external knowledge and information to improve Medline record retrieval. Essentially, external knowledge can be used to modify the semantic distance between different concepts thus change the similarity among different documents as well as the similarity between query terms and Medline records. For example, the use of external protein-protein interaction data can make the similarity of gene/protein names dependent on how different protein interact with each other rather
than treat each gene/protein name as independent of each other. If we assign high similarity to geneiproteins that have direct interaction with each other and incorporate such similarity information in the Medline record retrieval process, we will be able to obtain Medline records not only containing the query gene/protein names, but also those containing gene/protein names that directly interact query gene/protein names. Besides protein interaction information, we hope to include different areas of knowledge and experimental data, such as linkage disequilibrium relationship among cytoband, SNP, STS/microsatellite marker and genes, co-regulated genes from microarray study, neuroanatomical circuits described in textbooks, etc., in the PubViz system. As a result, PubViz can greatly increase the efficiency of Medline data exploration in unfamiliar territories. Since it is impossible for a single group to incorporate all potentially useful external knowledge sources or display functions in PubViz, we plan to build standard interfaces to allow interested researchers to add their own concept similarity matrix, web services for collecting external information and visualization tools. In the long run, we hope PubViz to become a highly extensible system for exploring Medline and other free text databases.
Acknowledgements W. Xuan, M. Dai, S. J. Watson and F. Meng are members of the Pritzker Neuropsychiatric Disorders Research Consortium, which is supported by the Pritzker Neuropsychiatric Disorders Research Fund L.L.C. This work is also partly supported by the National Center for Integrated Biomedical Informatics through NIH grant 1U54DA02 1519-01Al to University of Michigan.
References 1.
2.
3.
Chang, J.T. et al. (2004) GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20 (2), 216-225 Chen, H. and Sharp, B.M. (2004) Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5 , 147 Doms, A. and Schroeder, M. (2005) GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res 33 (Web Server issue), W783786
369 4.
5.
6.
7.
8.
9. 10.
11.
12.
13.
14.
Hoffmann, R. and Valencia, A. (2004) A gene network for navigating the literature. Nut Genet 36 (71,664 Homayouni, R. et al. (2005) Gene clustering by latent semantic indexing of MEDLINE abstracts. Bioinformatics 21 (l), 104-1 15 Jayapandian, M. et al. (2007) Michigan Molecular Interactions (MIMI): putting the jigsaw puzzle together. Nucleic Acids Res 35 (Database issue), D566-571 Jenssen, T.K. et al. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nut Genet 28 (l), 2 1-28 Johnson, H.L. et al. (2005) Evaluation of lexical methods for detecting relationships between concepts from multiple ontologies. In Proceedings of the Pacific Symposium on Biocomputing (PSB) 2006 Kanehisa, M. (2002) The KEGG database. Novartis Found Symp 247, 9 1- 101. Landauer, T.K. et al. (2004) From paragraph to graph: latent semantic analysis for information visualization. Proc Natl Acad Sci U S A 101 Suppl 1, 5214-5219 Lin, S.M. et al. (2004) MedlineR: an open source library in R for Medline literature data mining. Bioinformatics 20 (1S), 3659-3661 McCarthy, M. (2006) Allen Brain Atlas maps 2 1,000 genes of the mouse brain. Lancet Neurol5 (1 l), 907-908 Mudunuri, U. et al. (2006) botxminer: mining biomedical literature with a new web-based application. Nucleic Acids Res 34 (Web Server issue), W748-752 Nvidia. (2007) Nvidia CUDA: Compute Unified Device Architecture. h tp://developer.down load. nvidia. coni/compute/czr du/O UNVIDIA CUDA Pro,ovnmmiizg Guide 0.
8J&15. Plake, C. et al. (2006) ALIBABA: PubMed as a graph. Bioinformatics 22 (19), 2444-2445 16. Rinaldi, F. et al. (2007) Mining of relations between proteins over biomedical scientific literature using a deep-linguistic approach. Artificial Intelligence in Medicine 39 (2), 127-136 17. Sharma, P. et al. (2006) Mining literature for a comprehensive pathway analysis: a case study for retrieval of homocysteine related genes for genetic and epigenetic studies. Lipids Health Dis 5, 1 18. Tanabe, L. and Wilbur, W.J. (2002) Tagging Gene and Protein Names in Biomedical Text. Bioinformatics 18 (8), 1124-1132
19. ThomsonResearchSoft. (2005) RefViz. http://www. ref i i z .com/winfo. a s p 20. Xuan, W. et al. (2005) GeneInfoMiner--a web server for exploring biomedical literature using batch sequence ID. Bioinfovmatics 2 1 (1 6), 34523453 21. Yuryev, A. et al. (2006) Automatic pathway building in biological association networks. BMC Bioinformatics 7, 171
This page intentionally left blank
37 1
RULE-BASED HUMAN GENE NORMALIZATION IN BIOMEDICAL TEXT WITH CONFIDENCE ESTIMATION
William W. Lau and Calvin A. Johnson*
Center for Information Technology, National Institutes of Health Bethesda, MD 20892-5624 *Email:[email protected]
Kevin G. Becker Research Resources Branch, National Institute on Aging 330 Cassell Drive, Baltimore, MD 21224 Email: [email protected] The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for “down-stream” text mining applications in hioinformatics. We have developed a rule-based algorithm that divides the normalization task into two steps. The first step includes pattern matching for gene symbols and an approximate term searching technique for gene names. Next, the algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of ,7622 and an AUC (area under the recall-precision curve) of ,7461 were achieved on the test data using the set of weights optimized to the training data.
1. BACKGROUND Identification of gene and protein mentions is arguably one of the most difficult named entity recognition (NER) tasks in the life sciences domain because of the irregularity and ambiguities in gene nomenclature’. A majority of genes can be referred to by more than one name and symbol. Some of these terms are common in the English language and some are even shared by two or more genes, of the same and/or different species2. An evaluation that was conducted for the BioThesaurus found that a genelprotein has an average of 3.53 synonyms, and that the same term is associated with 2.31 different concepts on average3. In search of records for a particular gene, most search engines, including PubMed, performs only basic keyword matching. This leads to substantial number of false positives and false negatives, making it difficult for users to locate the information that are truly useful to them.
A number of systems have been developed in the past few years to address the problem of gene recognition. The techniques fall into two broad categories, machine-learning and rule-based methods, which vary in their degree of reliance on dictionaries, statistics, linguistics, and heuristics’. Machine-learning approaches, including hidden Markov Models4 and support vector machine?, are very scalable. However, these techniques are very sensitive to the selection of features6 and the results are difficult to interpret. In rule-based approaches, hand-crafted rules for specific datasets are derived by experts with domain knowledge. These rules are often implemented as regular expression statements. Although this approach can be quite laborintensive, rule-based systems is often superior in handling genes that do not appear in training data. Traditionally, the more human interventions there are in a system, the better the system performs’. As annotated data sets become more readily available and the learning techniques become more sophisticated, this trend may change in the near future.
372 Many popular tools, such as Al3NER’ and GAPSCORE’, address the problem of NER without uniquely identifying the entities being mentioned. However, the ability to accurately associate these text mentions with specific entries in biomedical databases is of great value to “downstream” text mining applications, e.g. document classification and knowledge discovery. The next step beyond gene mention tagging is gene izormalization. It is a procedure in which each gene occurrence in the text is mapped to a unique gene identifier’”. In case of mentions associated with multiple identifiers, additional steps have to be taken to select the correct identifier among all the candidates. To study associations between genes using information in the literature, Jenssen et al. I ’ used simple string matching for gene recognition. Up to 40% of associations were incorrect, due to these problems in normalization: symbols shared by several genes, syntactical variations of the terms, and insufficient synonym lists. Thus, a more sophisticated gene normalization technique is required. Various competitions on text mining have been held in the past to create a platform where different text mining approaches can be compared objectively using common standards and evaluation criteria. BioCreAtIvE is one of the several competitions specifically tailored to the biological domain. The first evaluation was held in 2003, and attracted 27 participants from around the world. We entered Task 2 of the second BioCreAtIvE challenge’*. The objective of this task was to return the EntrezGene identifiers corresponding to the human genes and direct gene products appearing in a set of MEDLINE abstracts annotated by researchers at the European Bioinformatics Institute. Our gene normalization algorithm is a prototype component of the PubMatrix system 13, a text mining tool for genetic association studies. The advent of highthroughput microarray analysis has made it possible to measure the expression of thousands of genes and proteins simultaneously. However, the large volumes of data that are being generated create a huge challenge for scientists to effectively interpret and evaluate their results. PubMatrix, among others, can be used to systematically identify associations between sets of genes and diseases using information available in the MEDLINE literature. The assumption is that if the co-
occurrence frequency between a gene and a disease is of statistical significance, they probably have an underlying biological relationship. The PubMatrix system thereby helps researchers to validate their experimental results and to select a manageable set of promising genes for further analysis. Since simple string matching of the genes has yielded poor performance in other studies, we developed the gene normalization algorithm to help improve the accuracy of the PubMatrix results. Our system is essentially a rule-based system utilizing information from knowledge bases, statistical analysis, and empirical evidence. Section 2 is an extension to our paper submitted to the BioCreAtIvE Workshop14. This section describes our gene normalization algorithm, in particular the metric we use to estimate the confidence level of a match. In Section 3, our experimental results on the BioCreAtIvE data will be presented. We conclude with Section 4 by discussing performance issues and the significance of each component of the confidence measure. Table 1. Regular expression rules applied to gene symbol pattern matching to account for several syntactic variations commonly encountered in the literature.
Rules
Examules
Interchange of Roman and Arabic
GALA 3 GAL IV
numerals Interchange of dashes and spaces
NKG2-E 3 NKG2 E
Allow a dash or space in front of a
NAT2 3 NAT-2
numeral
Allow an optional ‘s’ at the end of a
EST 3 ESTs
symbol
Allow
an
optional
‘h’
at
the
BIF 3 hBlF
beginning of a symbol
Allow for case difference if symbol
RACl 3 Racl
has more than two characters
2. IMPLEMENTATION
2.1. Identification of Gene Mentions The algorithm detects the occurrence of gene mentions by matching input text against the EntrezGene dictionary from the National Library of Medicine. The procedure effectively combines the tasks of gene detection and gene identifier lookup. Different approaches are used in the detection for gene symbols (including “Other Aliases” in the EntrezGene database)
373 and gene names (including “Other Designations”). Gene-symbol tagging is based on pattern matching. For each symbol in the knowledge base, a set of regular expressions rules, as shown in Table 1 , are applied to evaluate every string separated by space and punctuation symbols. For the offk5al symbols, we also generate new symbols by expanding the associated Greek letters into their full names, e.g. “CHKB” to “CHK beta” and “beta CHK’. For gene names, an approximate term matching technique has been employed. After breaking a gene name into individual words or tokens, each token is searched against the text using rules similar to gene symbol matching. Subsequently, the phrase containing the most tokens is identified. This phrase is conditionally accepted if the ratio, rm, between the number of tokens in the mention candidate and the total number of tokens to be matched is higher than a threshold (0.7 in our submissions). However, the candidate has to include specific tokens as measured by the number of citations containing those tokens (if a token’s frequency of occurrence is low, it is too important to be ignored). The system also maintains a list of allowed and prohibited missing words. If a word in the prohibited list, e.g. “receptor”, is missing from the phrase, the candidate is rejected. On the other hand, if a word in the allowed list, such as “type” and “subunit”, is missing in the candidate, the algorithm calculates r,,, as if the word were not in the gene name. As an illustration, consider the gene “angiotensin I1 receptor, type 1,” which consists of five tokens. The term “angiotensin I1 type 1” has an r,, of 0.8, but is rejected because “receptor” is missing. On the other hand, the term “angiotensin I1 receptor alpha” has an r,,, of 1.0. In addition, another rule is that candidates are allowed to contain at most two extra words between any two tokens as long as the words are frequently found in the biomedical literature. Besides the names that are already in the knowledge base, additional synonyms are generated by replacing common chemical names with their abbreviations. For example, “acetyl-CoA carboxylase beta” is created from “acyl-Coenzyme A carboxylase beta”. This approximate matching technique, which is similar to that proposed by Hanisch et a1 15, can accommodate typical variations of gene name mentions, such as word ordering, found in the literature.
2.2. Confidence Measure Mention Candidates
of
Gene
After a gene mention is detected, the algorithm calculates a confidence score using several statistical and heuristic measures. The three most novel features used in our submissions were coverage, inverse distance, and uniqueness. 2.2.1. Coverage The calculation of the coverage score is quite different between gene names and gene symbols. The score for symbols, is calculated as follows:
ws,
where L is the character length of the term extracted from the text and, s is a scaling factor defined as:
S =
I
[k,+ (1 - k,)e’;n-’
if the candidate is enclosed otherwise
where 0 5 k , 5 1 is a parameter (set to 0.8). The intuition is that the more characters the symbol has, the less likely it is that the term is used other than to represent the gene. If the term is enclosed by brackets, i.e. ([( }I), the gene name is probably mentioned in the text as well and score should be scaled accordingly. For gene names, the coverage score is a weighted average of two ratios, rL and r,,,. rL is the ratio of the character length of the candidate string to the corresponding name in the knowledge base. Thus,
where 0 5 k2 5 1 is a parameter (set to O S ) , is the minimum occurrence frequency threshold for any missing words not in the allowed list (set to 20,000), and fm is the occurrence frequency of the least common missing word. In addition to character length, the coverage metric for gene names also takes into account how many words are matched as well as the specificity of the words missing from the mention.
374 2.2.2. Inverse Distance
2.2.4. Discrete Features
For gene symbols, inverse distance is based on the edit distance, dL, of the candidate term to the formal reference in the database. The score, is defined as follows:
We have identified three additional features that could assist the algorithm to select the correct identifier in case of ambiguity. First, if more than one unique mention of a gene is extracted from the text (e.g. both name and symbol), our confidence that the correct identifier is selected increases. This feature is referred to as number of mentions. In addition, many genes in the EntrezGene knowledge base have not been approved by the HUGO Gene Nomenclature Committee. We believe that the references for these genes are unstable and few articles on these genes have been written. Therefore, in the oflicial status feature preference is given to genes that have been approved. A related feature is mention type. A recent studyI6 suggests that scientists do not usually follow standard nomenclatures. Suspecting that there exists some degree of correlation, we take into consideration whether the mention is an officially approved term. We also incorporate a boosting factor into the confidence measure to reward or punish a candidate when there is contextual clue in the citation suggesting whether the mention actually refers to a gene. For example, if the text contains the chromosome location or accession numbers of the gene, its score will be boosted. If the mention is preceded or followed by supporting modifiers, such as “gene” and “encode”, we have a much higher level of confidence that this mention is a true positive. On the contrary, if counter-indicators, such as “test” and “cell line”, appear adjacent to the candidate, the mention should be penalized by inverting the boosting factor. Therefore in addition to the allowed and prohibited missing word lists, we also maintain a list of indicator terms and a list of counter-indicator terms. Whereas all the other factors are combined linearly to compute the final score, the boosting factor is added last as an exponent to the score. The final confidence score for a mention is simply calculated as:
a,,
(3) where k5 equals to (1-s)/L. It takes into consideration the variations in capitalization, ordering, and any omissions/additions of punctuations and spaces. The closer the mention matches the actual symbol, the higher the score. For gene names, since syntactic variations are common, Eq. (3) is modified by factoring into the token ratio r,:
2.2.3. Uniqueness Uniqueness is an estimate of the probability that the candidate is referring to something other than the gene in question. If the mention has a very high frequency of occurrence in the literature, the score is reduced accordingly, because frequently occurring terms may have multiple meanings other than just being referred as genes. For gene names, the uniqueness score, p N, has two components, one being influenced by the size of the population, T, and the other by a user defined frequency threshold, fmax, which limits the maximum number of documents the term can appear in (set to 40,000 in our experiments). Thus, p,,, is given as:
where k3 and k4 are parameters (set to 0.5 and 10, respectively), and f is the number of documents containing the term. The population we use in our system is the entire collection of MEDLINE citations. Formulation of the uniqueness score for gene symbols is the same as Eq. ( 5 ) , except that the score is further multiplied by the scaling factors.
[ .1,
TW7)
SJW) = c w i c i
(6)
where b is the boosting factor, n is the number of features not considered in the boosting, and wi and ci are the weight and sub-score for feature i, respectively. Consequently, a list of gene mentions with their associated identifiers and confidence scores is created
375 for each citation. An acceptance threshold can be applied to improve precision. If a gene has more than one unique mention in the text, the maximum score is used.
2.3. Overlapping of Boundaries
Gene
Mention
When a string is associated with more than one gene identifier, the algorithm needs to determine which gene the authors actually intended. The disambiguation procedure is as follows. First, if a mention appears entirely within another longer mention (Fig. la), the algorithm removes the shorter mention if it does not appear anywhere else by itself in the text. If some words of a mention overlap with another mention (Fig. lb) or if two mentions share the exact same term (Fig. lc), the one with a lower score is removed. If the scores of two conflicting candidates are equal, their uniqueness scores are both reduced by half. The effect of this operation is that if the mentions are weak at the first place, they can both be eliminated with a smaller threshold. If the candidate had more than one form of occurrence, e.g. both the symbol and the name were detected, the highest score was considered. Moreover, if two genes are adjacent to each other without being separated by any punctuation (Fig. Id), we remove either the first mention or the mention with a lower score. a. Interleukin 1 receptor
b glucoconicoid-induced protetn kina5e.X-linked wrface antigen 3 c. NIP1 --
d nicotinic acetylchdine receptor gene CHRNA10
___
FheMatch Caiecff*Mch
Fig. 1. Four case$ of boundary conflicts are illustrated When a mention 15 completed covered by another mention (a), the shorter mention is taken out from the gene list The confidence score is used to determine which mention is more probable in cases (b), (c). and (d) For (d), if the score 13 the same for both mentions, the first mention is removed
3. EVALUATION We evaluated our gene-normalization system by finding a (locally) optimized set of weights wfmi“on a training-data set, testing the performance of the system using drain on a testing-data set, and then crossvalidating the performance by training on the testingdata set to generate a set of weights wfest which were evaluated on the training data set. The training and test data sets were those provided by the by the
BioCreAtIvE I1 gene normalization task. These data sets comprise and 286 and 262 documents, respectively. The results of the optimization process are summarized in Tables 2 and 3. Table 2 provides the values of the original weights w o we used in the competition as well as the tuned weights wtPSfand w~‘“‘~.Table 3 gives the results obtained from running the optimized weights through the data set on which they were trained as well as on the other data set. The maximum F-score and area under the recall-precision curve (AUC), which were obtained by testing wfrarnon the testing data set, were found to be 0.7622 and 0.7554 respectively. With the original weights, prior to optimization, these values were 0.7523 and 0.7423 respectively. To generate w‘~‘‘‘~and wteJt, the set of starting weights wo was first obtained through empirical evidence and knowledge gained through the experience of developing the system. A good starting point for the optimizer was then found by manually exploring the energy landscape of the maximum F-score and AUC. A set of weights was then selected from the trial set which we felt could be considered “close to the maximum.” These weights were entered as a starting point to the Nelder-Mead simplex method17, an unconstrained derivative-free method which can find a local maximum via a geometric process involving reflection, contraction, and expansion. Although it has poor theoretical properties, the Nelder-Mead method is surprisingly robust for objective functions that are not analytical. Although we are using an unconstrained optimizer, our problem is actually constrained, namely
max
+ (F,,,, (
W)
+ A UC(W)) (7)
where Fmax(w)is the maximum F-score obtained over a set of thresholds and AUC(w) is the AUC from those same set of thresholds. In the results that we report, we used a threshold interval of 0.01, or 100 estimates when the maximum threshold is 1. As is clear from Table 3, we obtained an optimal solution well within the bound constraints. We also pondered imposing an equality constraint w1 + w2 + .... + w6 = 1 to enforce the idea that the maximum threshold must be 1 and that the result is properly “scaled.” Doing so would have necessitated a genuinely constrained optimizer. Rather than facing these complications, we adjusted the method to allow
376 for arbitrary thresholds. As a precaution, we used a starting point that was normalized according to the equality constraint. Table 2. Weights obtained through an optimization process (w''~'" and wtC") as well as starting weights w". The actual value of the weights W I through w6 are the product of the values shown and the denormalization fxtor. Original
Train
on
Train
training set WirC'lti
Wlcrl
wI
0.1500
0.1224
0.1389
Coverage
W'2
0.2333
0.2082
0.1743
Inverse distance
w)
0.2333
0.2744
0.2414
Uniqueness
w4
0.2333
0.1961
0.2357
wj
0.1000
0.0580
0.0629
of
Maximum F-score: cross section at tuned feature weights 0.78
testing set
weights
wo Mention type
Number
on
AUC curve is smoother than the maximum F-score. In Fig. 4, the effect of the boosting factor w7 is demonstrated by plotting the maximum F-score and AUC versus w7while the other weights are held to their wtraInvalue. In Fig. 5, the recall-precision curve is plotted for weights set to wo as well as wfratrr.
+Mention type &Coverage + - Inverse distance Uniqueness -$- Num. mentions -Official status
mentions
0.1467
Official status
W6
0.0200
0.1407
Boosting factor
w7
1.2500
1.4514
1.4402
1 .0000
1.0207
0.9699
Denormalization
0.71
n7
factor
Weights
Table 3. Result of running the normalization system on the training and testing data provided for the BioCreAtIvE II gene normalization task. The combination measure is equal to half the F-score plus half the area under the recall-precision curve (AUC). wo WtrJln Wlerl Test on: Measure Training
Max. F-score
0.7703
0.7757
0.7733 AUC: cross section at tuned feature weights
set
Testing
Fig. 2. Maximum F-score for the first six features versus variations in the weights of the corresponding feature while the other weights were set to the wl"'ln values. The markers on the lower right indicate the w"''~" values. The horizontal lines are the F-score at wiril"'. Results obtained in tests against the testing set.
0.76 I
AUC
0.7516
0.7586
0.7593
Combination
0.7609
0.7671
0.7663
F-score
0.7523
0.7622
0.7677
AUC
0.7423
0.7485
0.7546
Combination
0.7473
0.7554
0.761 1
I
set
Figs. 2 and 3 plot the F,,,,, and AUC, respectively, versus feature weight values for the first six features, i.e. wI through w6. These figures each contain six plots corresponding to the six features. In each plot, only the corresponding weight is allowed to change through the range of 0 to 1 while the other weights are held to their wfr""' values. Since the results were obtained by testing the training-data-optimized weights against the test data, not surprisingly there exist solutions on the test data with greater maxima than our solution. Despite this, we feel that our solution fared well on a foreign data set. Our explorations did reveal a somewhat difficult energy landscape with multiple maxima. Not surprisingly, the
-irCowrage
Uniqueness
-+Num mentions -Official
0
01
02
03
04
05 06 Weights
status 07
08
09
1
Fig. 3. AUC for the hr\t six features versus vanations in the weight\ of the corresponding feature while the other weights were set to the wfra'"values The markers on the lower right Indicate the w"'"" values
4. DISCUSSION We have developed a gene normalization algorithm that separates the task into two processes. First, the
377 algorithm searches for possible gene mentions with the goal of high recall. Different techniques are applied to the search for gene symbols and gene names, although both rely on the use of dictionaries and rules. The rules are important as they consider many syntactical variations that are commonly encountered in gene nomenclature. Since most gene names are phrases rather than single words, an approximate term matching technique is employed to also account for differences in word ordering and word choices. The second process of the algorithm attempts to improve the precision by measuring the level of confidence of each match and filtering out those mentions that have low confidence score. The confidence score is derived from a set of quantitative measures leveraging statistical, morphological, and contextual information available to the system. In addition to indicating whether a term actually refers to a gene or not, these measures provide a means for the system to disambiguate mentions to which multiple genes are mapped. F-rreasureand AU= at varicus t>xstingfador wiw
0.77 +F - m r e
0.765
+ALC
0.76
O.=I 0.75 0.7451
0.735-
. -1
Fig 4. Effect of the boosting factor. F-measure and AUC versus the boosting factor w7 while the other weights were set to w train. Results obtained against the tesing set.
Using the BioCreAtIvE datasets for evaluation of the algorithm, the best F-score we achieved on the test data was 0.7622 when the feature weights were optimized with the training data. Without the thresholding process, the gene tagging component alone could attain an F-score of 0.647 with a recall of 0.869. Recall at this step essentially limits the recall obtainable in the thresholding process. A majority of the undetected mentions have complex syntax not being
handled by the rules we defined. Table 4 provides some examples of challenging cases that contributed to the false negative counts in the tagging process. Nevertheless, many genes are referred to in the text both by their name and symbol. The undetected mentions thus result in a smaller impact on the recall performance. Figs. 2 and 3 show the individual contribution of each internal feature we measure in the confidence score. We call these internal features because the scores are computed out-of-context, based solely on the evidence presented by the mentions themselves. The only exception is the scaling factor s on gene symbols, which is influenced by whether the symbol is extracted from text enclosed by a set of brackets. We can observe from the figures that all six features are useful for the gene normalization task because their optimal weights are all greater than zero. As the weight of a feature increases, the feature becomes more dominant in determining the final confidence score. Inverse distance and uniqueness are the only features that produced better results (on AUC) or only slightly degraded (on Fscore) results from zero weight to a weight of 1. All the other features posted worse performance when they became dominant. Although the best performance is achieved using a combination of these features, our observation suggests that inverse distance and uniqueness have good enough discriminatory power to estimate the level of confidence by themselves when other information is not available. In addition to the internal features, several contextual factors are used to determine whether the confidence score is boosted or not. Since the boosting factor is added as an exponent, the effect is non-linear. Boosting exerts most of its influence on mentions for which the internal features may be ineffective. When a gene is mentioned for the first time in the text, the authors often specify that the entity of interest is a gene, especially when the gene is ambiguous or not very well known. Boosting is useful as illustrated in Fig. 4. However, sometimes a wrong mention can be boosted. Moreover, when counterindicators are detected, the boosting factor is inverted and the score is thus reduced. It can be argued that the punishing factor should be made more severe in order to successfully remove those mentions that have high scores but actually refer to something else. Features for confidence measure. In contrast with the other features, the effect of coverage, inverse
378 distance, and uniqueness are clearly pivotal as there is significant performance improvement from zero weight to their optimal settings. It can be argued that uniqueness is the most important feature in our evaluation. Lack of this feature would result in severe degradation of performance, most noticeable in the AUC. Uniqueness is a statistical measure with the assumption that gene mentions should have a low frequency of occurrence. This is a good assumption in most cases. However, it is not good with legitimate genes that actually appear frequently in the literature (e.g. Interleukin 1) and relatively rare terms with multiple meanings, one of them being a gene reference. For example, “ADA” can stand for the American Diabetes Association or the gene adenosine deaminase. Our solution to the second issue is to look at whether a symbol is mentioned within a set of brackets. If it is the case, presence of the gene name becomes a determining factor. We found this contextual feature to be very helpful for improving precision. Another important feature is the inverse distance, which is a dictionarybased measure that calculates the similarity between the candidate mention and the corresponding gene term in the database. Currently, character is the basic unit in the calculation of edit distance. For names, the effect of changing the word order is subject to the length of the words. It may be more appropriate to use word as the unit of measurement. Coverage is mostly a heuristic measure in which we assume longer mentions are more likely to be true. Albeit that it is a very good measure, the performance degraded when it become a dominant factor, suggesting that length alone is not reliable. Comparison to other gene normalization tools. A number of gene tagging tools are freely available to the community, but to our knowledge, no standalone gene normalization systems have been made publicly accessible. No comparison is made between our tool and ABNER or GAPSCORE because the task of these tools (i.e. NER) is different from ours (i.e. normalization) and such comparison would not be particularly meaningful. In the second BioCreAtIvE challenge, 20 teams entered the gene normalization taskI2. Many teams followed the same general approaches we employed. Several participants built upon “off-the-selves” gene tagging tools. The best Fscore from each team ranges from 0.810 to 0.394, with a
median of 0.731. The highest recall and precision achieved are 0.833 and 0.841, respectively. The difference in performance is primarily due to the way filtering of candidates, including disambiguation, was performed. Some relied on pruning of the lexicon and some implemented rules of various degrees of sophistication to reduce false positives. Nevertheless, the results of the top scoring teams, including ours, are comparable. It is important to note that the recall of 0.869 at a precision of 0.515 which we achieved after the first step of the process is advantagous when high recall is required. Another benefit that our system provides is that each mention is associated with a confidence score. This feature affords users the ability to choose a suitable balance between recall and precision. Table 4. Examples of false negative cases in which the algorithm was not able to detect them at all.
Description Range Ambiguity Choice of words Boundary
Examples ORP- 1 to ORP-6 p32 IFN-induced protein of 10 kDa Protein kinase C epsilon, and zeta
isoforms
alpha,
Effectof qmizatimmtranifgck@ast&edonteSt&ta 0.9,
Id
I
‘h
0.51
OC 0.41
I
0
0.5
I
0.6
0.7
0.8
0.9
1
Precision
Fig. 5. Recall versus precision as tested on the test data with the and the optimized weights w”“”I, original weights (wo)
379 5. CONCLUSION We have developed a gene normalization algorithm that relies heavily on rules that combine statistics and heuristics. The confidence measure provides a means to quantify the degree of conformance to these rules and allow users to choose the proper compromise between recall and precision based on the situation. In our evaluation, only basic knowledge about the genes was used to disambiguate mentions with multiple mappings. A majority of candidates that mapped to more than one gene identifier actually referred to gene families. For future work, information about gene families and association of various terms can be applied for more sophisticated filtering. Part-of-speech tagging may also help to discern mention boundaries and improve system efficiency by only considering noun phrases.
8.
9.
10.
11.
12.
13.
Acknowledgments This research was supported by the Intramural Research Program of the National Institutes of Health, Center for Information Technology. We appreciate the contributions of Alex Wang and Jigar Shah.
14.
15.
References 1. Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics. 2002; 18: 1124-1132. 2. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: From information retrieval to biological discovery. Nut Rev Genet. 2006; 7: 119-129. 3. Liu H, Hu ZZ, Torii M, Wu C, Friedman C. Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc. 2006; 13: 497-507. 4. Zhou G, Zhang J, Su J, Shen D, Tan C. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics. 2004; 20: 1 178-1190. 5. Hakenberg J, Bickel S, Plake C, et al. Systematic feature evaluation for gene name recognition. BMC Bioinformatics. 2005; 6 Suppl 1: S9. 6. Leser U, Hakenberg J. What makes a gene name? named entity recognition in the biomedical literature. Brief Bioinform. 2005; 6: 357-369. 7. Dickman S. Tough mining: The challenges of searching the scientific literature. PLOS Biol. 2003; 1: E48.
16. 17.
Settles B. ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005; 21: 3 191-3192. Chang JT, Schutze H, Altman RB. GAPSCORE: Finding gene and protein names one word at a time. Bioinformatics. 2004; 20: 216-225. Hirschman L, Colosimo M, Morgan A, Yeh A. Overview of BioCreAtIvE task 1B: Normalized gene lists. BMC Bioinformatics. 2005; 6 Suppl 1: s11. Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for highthroughput analysis of gene expression. Nut Genet. 2001; 28: 21-28. Morgan A, Hirschman, L. Overview of BioCreative I1 Gene Normalization. Proc of the Second BioCreative Challenge Evaluation Workshop 2007. Becker KG, Hosack DA, Dennis G,Jr, et al. PubMatrix: A tool for multiplex literature mining. BMC Bioinformatics. 2003; 4: 61. Lau W, Johnson C. Rule-based gene normalization with a statistical and heuristic confidence measure. Proc of the Second BioCreative Challenge Evaluation Workshop 2007. Hanisch D, Fundel K, Mevissen HT, Zimmer R, Fluck J. ProMiner: Rule-based protein and gene entity recognition. BMC Bioinformatics. 2005; 6 Suppl 1: S14. Tamames J, Valencia A. The success (or not) of HUGO nomenclature. Genome Biol. 2006; 7: 402. Nelder JA, Mead R. A simplex method for function minimization. Comput J . 1965; 7: 308-313.
This page intentionally left blank
38 1
CBioC: BEYOND A PROTOTYPE FOR COLLABORATIVE ANNOTATION OF MOLECULAR INTERACTIONS FROM THE LITERATURE C. Baral, G. Gonzalez, A. Gitter, C. Teegarden, and A. Zeigler School of Computing and Informatics, Arizona State University Tempe, AZ 85281, USA Email: chitta C3asu. edu, ggonzalez @ asu.edu
G. Joshi-Top6 Northeast Biosciences, lnc New York, NY, USA In molecular biology research, looking for information on a particular entity such as a gene or a protein may lead to thousands of articles, making it impossible for a researcher to individually read these articles and even just their abstracts. Thus, there is a need to curate the literature to get various nuggets of knowledge, such as an interaction between two proteins, and store them in a database. However the body of existing biomedical articles is growing at a very fast rate, making it impossible to curate them manually. An alternative approach of using computers for automatic extraction has problems with accuracy. We propose to leverage the advantages of both techniques, extracting binary relationships between biological entities automatically from the biomedical literature and providing a platform that allows community collaboration in the annotation of the extracted relationships. Thus, the community of researchers that writes and reads the biomedical texts can use the server for searching our database of extracted facts, and as an easy-to-use web platform to annotate facta relevant to them. We presented a preliminary prototype as a proof of concept earlier’. This paper presents the working implementation available for download at http://www.cbioc.orp as a browser-plug in for both Internet Explorer and FireFox. This current version has been available since June of 2006, and has over 160 registered users from around the world. Aside from its use as an annotation tool, data from CBioC has also been used in computational methods with encouraging results2.
1. INTRODUCTION There are about 15 million abstracts currently indexed in PubMed, with anywhere between 300,000 and 500,0003 being added each year. To illustrate this problem, consider the following example. A search for the gene TNF alpha in PubMed yields 74430 articles (as of March of 2007) and 6193 review articles. Refining the search to TNF alpha and inflammation reduces this number to 15 126 regular articles and 1757 review articles, still too many for a researcher to review. It would be significantly easier if he or she had access to a database that stores relevant nuggets of knowledge such as the relationship between genes and biological processes. The problem of constructing such a database has been recognized as one that needs to be solved to move forward into the great challenges of science for this century4. Currently, two approaches are used to extract such facts from biomedical publications: (i) human curation and (ii) development and use of automated information extraction systems. However, the constantly increasing
number of articles and the complexity inherent to its annotation results in data sources that are continuously outdated. For example, CeneRIF (Gene Reference Into Function), was started in 2002, yet it covers only about 1.7% of the genes in Entrez’ and 25% of human genes. Automatic extraction and annotation seems a natural way to overcome the limitations of manual curation, and a lot of work has been done in this area, including the automatic extraction of genes and gene products6, proteinprotein interactions’, relationships between genes and biological functions*, and genes and diseases’, among others. However, the reliability of the extracted information varies greatly and thus discourages the biologists from using it for their research. CBioC represents a new approach to the problem through mass collaboration, where the community of researchers that writes and reads the biomedical texts will be able to contribute to the curation process, dictating the pace at which it is done. Automated text extraction is used
Y
Figure 1. C B i d automatically launches the interaction web band at the bottom of the main window when a user visits F’ubMed, and displays the facts processed, extraction occurs “on the fly”. The left image corresponds to the interactions display, allowing available for it. If the abstract has no the user to tab through the different of relationships @roteidprotein, gene/disease, and gene/bioprocess). The right image shows the simplest annotation mechanim (a yes/no vote orrectly Extracted”) and the agreement level (% Approval). Users may also modify and add interactions.
as a starting point to bootstrap the database but then it i s up to researchers to improve upon the extracted data by modifications, additions of missed facts, and voting on the accuracy of extraction. It runs as a web browser extension and allows unobtrusive use of the system during the regular course of research, allowing the natural “checks and balances“ of community consensus to take hold to resolve inconsistencies when possible, or to point out disagreements and controversial findings. Although most of the data in CBioC is currently from automatic extraction, users have contributed over 500 interactions which are currently being evaluated. This shows how with CBioC, small or large groups of researchers can easily annotate articles and find facts of interest to them.
2. METHODS CBioC is available for both Internet Explorer and Firefox, for PCs, Macs, and Linux machines. Once installed, CBioC runs unobtrusively, and when one visits the Entrez (PubMed) web site, CBioC automatically opens within a “web band“ at the bottom of the main browser. Users that do not wish to install the plug in can get similar functionality by logging in from our home page. modified version of the extraction ed IntEx3) that uses Natural Language Processing methods to extract protein-protein interactions, gene-disease relations, and gene-bioprocess relations. 2.1. Usage
.
.I_
.
, ..
I
Figure 2. A CBioC search provides a simple way to browse through interactions involving a particular gene or the list of genes involved in a disease or biological process.
Consider a variation of the research scenario introduced before. A PubMed search for ”TNF alpha atherosclerosis” returns over 900 abstracts. One of the abstracts (PMID 16814297), reports TNF-alpha modulates MCP-1, a common alias to CCL2. Expression of CCL2 has been found to be increased in cardiovascular diseases and is of high interest as a biomarker of atherosclerosis”. However, as of March 2007, none of the public curated databases had captured this important interaction, and any researcher that missed the article will probably not learn about it. CcL2 is involved in immunoregulatory and inflammatory processesll. Thus, that Tm-alpha modulates C C L ~as
reported in the article (supported by others, such as PMID 9920834) is significant, important to assess the relevance of TNF alpha with respect to atherosclerosis and for any systems biology simulations. Thus, relying solely on curated data could leave this piece of information out. Consider the same scenario, but with CBioC installed. The user could start with a TNF alpha search in CBioC (see Figure 2 ) . Quickly scrolling down through the listed interactions gives the researcher a general idea of the known relevant associated genes, even though some of them might not be accurate. If MCP 1 calls the researcher’s attention, the rest of the interactions in that abstract can be quickly displayed by clicking on the PMID of the interaction of interest among the search results. A list of other articles that report the same interaction can be viewed by clicking on the “Related Articles” link.
2.2. Functionality 2.2.1. Displaying data
When one searches the PubMed database and displays a particular abstract, CBioC automatically displays the interactions found related to the abstract. If the abstract has not been processed by CBioC before, an extraction system runs “on the fly”. CBioC also displays interactions found for the article in publicly accessible databases. 2.2.2. Searching
As a registered CBioC user, one can search the CBioC database for all facts related to a particular protein, gene, disease, or interaction word by simply typing the relevant term in the Search box within the CBioC web band. CBioC automatically expands a search term with known synonyms of the term. One can also display the facts available for a set of abstracts by typing a commaseparated list of their PMIDs in the search box. The search box also lets one see all the facts we have from a particular database by typing its name, such as “BIND”or “MINT”. 2.2.3. Modifying, and adding
Registered CBioC users can vote on the accuracy of an extraction, modify the interactions, or add interactions that the extraction system missed. If the interaction seems
correctly extracted, one can click the “Yes” button to approve. Otherwise, one can vote “No” or modify the data by clicking “Modify”. If “Modify” is clicked, the data fields open up for editing. The user’s screen id will be displayed in the “Source” column from then on, with the previous data stored and accessible via the “History” link. The modified information is then subject to community vote. Similarly, an interaction present in the abstract or in the full article, it can be entered in the last row.
3. RESULTS AND DISCUSSION Although the CBioC system has moved well beyond its prototype stage, it is still considered a “beta” system and new features are being added. It is, however, functional. To date, over 4.5 million abstracts have been preprocessed, and CBioC does dynamic (“on the fly”) extraction when a user views an abstract that has not been pre-processed. This is an important feature that gives users total control over which abstracts are to be processed. Additionally, we have incorporated interactions from BIND, GRID, MINT, DIP and IntAct. A total of 261 distinct users have downloaded the CBioC plug-in, with 161 of them becoming registered users since June 2006, when CBioC was mentioned in Science Magazine’s NetWatch 1 2 . Partial statistics for those that have chosen a personal title (such as “Doctor”, “Professor”, or “Researcher”) during the registration process show our users include 53 doctors, 30 researchers, 17 professors, 8 post-docs and 40 students. Actions of registered users are tracked, and have so far yielded a total of over 500 curated interactions (either added, modified, or approved through a “yes” vote). Of course, this added to the more than 1.5 million relationships automatically extracted from text. As a point of comparison, at the time of its publication, IntAct13 had 2200 interactions, most of them from high throughput experiments (not curated). Two years after its conception, MINTI4 had 2500 curated mammalian interactions, and was the largest publicly available dataset of curated entries at the time. It will be interesting to see how many curated interactions will CBioC have when it hits the 2 year mark in June of 2008. Table 1 shows statistics about content and user actions. About 55% of the votes confirm the automatic extraction
384 is correct (yes votes), an indicator of the extraction system precision. This use of community validation is another area to explore as value added by the CBioC platform. Aside from the web interface, data from CBioC has also been used in computational methods with encouraging results15. We presented in a computational method to
uncover possible gene-disease relationships that are not directly stated in an abstract or were missed by the initial mining of the literature. Ranked lists of genes obtained from the method reach precision of 98% for the top 50, and up to 92% for the top 200 genes, in contrast to about 70% accuracy by simple co-occurrence searches.
Table 1. CBioC statistics. The left table details the type of information stored in the CBioC database, accessible via term searches or by P M D . The right table details the number of actions by registered users (as of March 2007). Actions by non-registered users are not tracked. IntAct interactions are being updated, with over 130,000 becoming available soon. Users (excluding the development team) include 53 doctors, 30 researchers, 17 professors, 8 postdocs and 40 students. CBioc statistics Integrated Data
Add ii7teraciion
163
Total Processed
1 618 878
BIND Interactions 114,685
Modify interaction
133
M h Interactions
47%
GRID Interactions 58,467
Abstracts
51 721
Seaich
972,769
DIP Interactions
52,070
View (article)
Total GenelDisease
301,547
IntAct Interactions
Total GeneiBio-Process
251,233
Total ProteiniProtein
References 1. Baral, C. et al. Collaborative Curation of Data from
2.
3.
4.
5.
6.
7.
Rate interaction
MINT Interactions
lnteracbons
Bio-medical Texts and Abstracts and Its integration. in Data Integration in the Life Sciences 309-312 (Lecture Notes in Computer Science, San Diego, CA, 2005). Gonzalez, G., Uribe, J.C., Tari, L., Brophy, C. & Baral, C. Mining Gene-Disease relationships from Biomedical Literature. in Pacific Symposium in Biocomputing (Maui, Hawaii, 2007). Soteriades, E.S. & Falagas, M.E. Comparison of amount of biomedical research originating from the European Union and the United States. BMJ: British Medical Journal. 331 192-194 (2005). Emmott, S. Towards 2020 Science: a Report. in Towards 2020 Science Workshop (ed. Cambridge, M.R.) (2006). Lu, Z., Cohen, K.B. & Hunter, L. Finding GeneRIFs via Gene Ontology Annotations. in Pacific Symposium on Biocomputing Vol. 11 52-63 (World Scientific Publishing Co. Pte. Ltd., Maui, Hawaii, USA, 2006). Tanabe, L. & Wilbur, W.J. Tagging gene and protein names in biomedical text. Bioinformatics 18, 11241132 (2002). Ahmed, S.T., Chidambaram, D., Davulcu, H. & Baral, C. IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text. in BioLINK: Linking Literature, Information and Knowledge,for Biology (Detroit, Michigan, 2005).
6 734
Vote (total) Votc (ycs:
71
793 31 69 370 207
8. Koike, A,, Niwa, Y. & Takagi, T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 21, 1227-1236 (2005). 9. Chun, H.-W. et al. Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning. in Pacific Symposium on Biocomputing Vol. 11 4-15 (2006). lO.Herder, C. et al. Chemokines and Incident Coronary Heart Disease. Results From the MONICA/KORA Augsburg Case-Cohort Study, 1984-2002. Arterioscler Thromh Vasc B i d , 01 .ATV.0000235691.84430.86 (2006). 11. Entrez Gene entry for CCL2 (GeneID: 6347). 12. Leslie, M. NetWatch - Software: Annotate While You Read. in Science Magazine Vol. 312 1721 (2006). 13.Hermjakob, H. et al. IntAct: an open source molecular interaction database. Nucl. Acids Res. 32, D452-455 (2004). 14. Arnaud Ceol et al. The (new) MINT Database. in BITS 2004 (Padova, Italy, 2004). 15. Gonzalez, G., Uribe, J.C., Tari, L., Brophy, C. & Baral, C. Mining Gene-Disease relationships from Biomedical Literature: Incorporating Interactions, Connectivity, Confidence, and Context Measures. in Pacific Symposium in Biocomputing (Maui, Hawaii, 2007).
I
Biocomputing
This page intentionally left blank
387
SUPERCOMPUTING WITH TOYS: HARNESSING T H E POWER OF NVlDlA 8800GTX A N D PLAYSTATION 3 FOR BIOINFORMATICS PROBLEMS
Justin Wilson, Manhong Dai, Elvis Jakupovic, Stanley Watson and Fan Meng* Molecular and Behavioral Neuroscience Institute and Department of PsychiatmJ, University of Michigan, A n n Arbor, M I 48109, United States of America * Email: mengfaumich. edu Modern video cards and game consoles typically have much better performance to price ratios than that of general purpose CPUs. The parallel processing capabilities of game hardware are well-suited for high throughput biomedical data analysis. Our initial results suggest that game hardware is a cost-effective platform for some computationally demanding bioinformatics problems.
1. INTRODUCTION Biomedical data analysis, visualization and mining demand more and more computing power in the post-genome era. Computer clusters are the prevailing solution for many bioinformatics laboratories and centers for accelerated large-scale data analysis. However, expanding the computing capacity of an existing cluster by more than an order of magnitude using traditional methods in a time of leveling-off processor speeds is difficult and expensive. State-of-the-art game consoles and graphics processing units possess enormous computing power that can be directed at a variety of data analysis taskslP4. However, the use of game hardware in bioinformatics is still rare and limited to special applications. The GPGPU website listed only one bioinformatics-related application, which reported a 2.7-fold speedup of the most time-consuming loop in the RAxML phylogenetic tree inference program when using a GeForce FX 5700 LE graphics card instead of a Pentium 4 3.2 GHz processor5. Most recently, the famous FoldingQHome project developed clients for both AT1 graphics processing units (GPU) and the Sony Playstation 3 (PS3). In fact, PS3 already exceeds all participating computers in the number of TFLOPs contributed to the Folding@Home project6. A major obstacle to the wide-spread deployment of such promising game hardware was the lack of development tools. Traditionally, a developer had to learn a graphics API and cast their problem like a
*Corresponding author
graphics problem in order to use a GPU for general computation. However, the recent release of the Compute Unified Device Architecture (CUDA) by NVIDIA has circumvented this problem and greatly facilitated developing software for NVIDIA GPUs7. In addition, the highly acclaimed Cell Broadband Engine (CBE) in the PS3, can be programmed using C instead of assembly with the free IBM Cell SDK'. Furthermore, third party vendors such as PeakStreamg and RapidMind" allow the same program to be compiled and automatically optimized without modification for different multi-core platforms, thus greatly shortening the development cycle for different parallel computing platforms. The computationally-intense nature of highthroughput data analysis led us to examine the possibility of utilizing game hardware to speed-up several common algorithms. Our results are very encouraging and we believe game hardware is an effective platform for many bioinformatics problems.
2. MATERIALS A N D M E T H O D S Single or multiple CPU tests were performed on an 8x Opteron 865 (dual core) sever with 64G PC2700 memory running Fedora Core 2. GPU tests were performed on a 2x Opteron 275 (dual core) server with 4G of memory and a BFG GeForce 8800GTX with a core frequency of 600 MHz. The PS3 used in this project was a 60 GB version. The complier used for single and multiple Opteron core implementations was GCC 3.3.3. CUDA 0.8 and IBM Cell SDK
388
2.0 were used for the 8800GTX and PS3 programs, respectively. We used RapidMind version 2.0 beta 3 and followed their write-once and run-anywhere paradigm for each platform. See our webpage” for the details of our tests.
3. RESULTS 3.1. 8800GTX and CBE vs. lx and 16x CPU
Table 1 summarizes the performance of a single precision vector calculation when using the native development environments for the 8800GTX and CBE as well as the RapidMind platform. The calculation is described by N
2nI;oI; +
where o is an element-wise division operator, a‘ and b are vectors of 9437184 elements and N is the number of repeated Go calculations. The column headings for table 1 are as follows: “I” is the number of times the calculation was performed, “N”is the number of repeated b’ o b’ calculations, “1x” represents a single CPU, ”16x” represents 16 CPUs, “GPU” represents the 8800GTX, “PS3” represents the CBE, and “RM” represents the designated hardware under the control of RapidMind. Table 1. Vector multiplication/division performance on different platforms (seconds)
I 10 1 10 1 100 10 1000 100 1000 100
N 500 500 100 100 10 10 1 1 0 0
lx
426.6 42.7 79.1 7.9 56.7 5.7 59.7 6.0 19.7 2.0
16x
32.3 3.3 6.2 0.6 7.1 0.7 30.6 4.1 15.0 1.6
GPU
2.2 0.3 1.3 0.2 11.7 1.2 116.2 11.7 106.0 10.2
GPU RM
PS3
2.5 1.5 1.9 0.7 13.2 1.9 127.9 13.4 127.8 13.2
96.4 9.7 19.6 2.0 21.4 2.2 41.3 4.4 20.0 2.3
PS3
RM
574.6 559.5 11.5 6.1 21.0 2.6 17.0 2.2
Due to the physical design, game hardware does not provide an advantage for operations involving a large number of memory reads and writes (lower half ahttp://wiki.mbni.med.umich.edu/wiki/index.php/Toycomputing
of table 1). When a small number of memory operations (low iteration) are combined with CPU intensive operations (high calculation), the PS3 is more than 4 times the speed of a single Opteron 865 core. Most strikingly, a single NVIDIA 8800GTX is about 200 times faster than a single Opteron 865 core and more than 10 times faster than our 16-core (8x2) server. These results should be interpreted with the understanding that these numbers represent the upper limit of game hardware performance since the entire problem resided in the main memory of each device and there were no conditional statements. Executables generated by RapidMind showed similar performance improvements on the 8800GTX when compared to executables generated using CUDA. The version we tried lacked optimization support for the CBE but RapidMind has promised such optimization in future versions”. Regardless, the ability to use the same source code for different multicore platforms should significantly help the adoption of game hardware.
3.2. Clustering Algorithms Clustering is one of the most widely used approaches in bioinformatics. However, clustering algorithms are CPU intensive and a speedup would benefit problems ranging from gene expression analysis to document mining. A full clustering algorithm usually has two main components: determining the similarity of various samples (vectors) through a distance measure and the classification of samples into different groups through a clustering method”. We decided to implement two distance calculation methods, Euclidean and B-spline-based mutual information13, and two clustering methods, single-link hierarchical clustering and the centroid k-means clustering for an 8800GTX, and investigate their performance under various conditions. The mentioned implementations have been used to generate similarity matrices and cluster documents from the MEDLINE database represented by MeSH term vectors and gene expression values from U133A GeneChips. GPUs are best suited for parallel data processing with a high ratio of arithmetic to 1 / 0 and minimal amount of conditional instructions. Memory reads and writes between the host computer and GPU should be minimized. Data should be aligned
389 in memory and memory access patterns should be sequential and regular. A good strategy for design algorithms for the GPU is to examine the data dependency between the stages of an algorithm and have a kernel for each stage. Furthermore, having each thread or each block compute one independent element of the output of a stage automatically eliminates the need for synchronization between blocks. Using these rules yields a distance matrix calculation kernel where each element in the distance matrix is computed by one block. First the vectors are copied to the device and aligned in memory. Each thread then computes the difference between two elements of two vectors and accumulates the results until both vectors are exhausted. Then, the shared memory between the threads can be utilized to sum up the contribution of each thread. Finally, the computed value is written to the distance matrix. Finding the minimum in a distance matrix and updating values according to the Lance-Williams formula are both activities in hierarchical clustering that can be parallelized. Finding the minimum is similar to computing a distance matrix only the location of the minimum must be remembered. Updating the distance matrix can also be performed in parallel because only the rows and columns containing the two merged elements need to be updated. Consequently, one thread can process each column in the matrix. The above techniques are also used in the kmeans algorithm. The only seemingly difficult issue is adding up the vectors to calculate the new cluster centers. Since the GPU lacks atomic operations, having different blocks update the centers at the same time will not work correctly. However, by having each thread compute one element of one new cluster center, we circumvent the need for atomic operations. We also minimize number of memory reads by using the assignment matrix. The computational speedup for calculating Euclidean distance matrices and mutual information matrices is presented in figure 1. The legend shows the number of elements in each vector and the type of calculation (“D” for distance, “B” for €3-spline). The B-spline mutual information algorithm was configured to use 10 bins and spline order of 313. As expected, the B-spline mutual information matrix shows better GPU acceleration due to its higher arithmetic to 1/0 ratio. The figure also shows that
it may not be worthwhile to perform small Euclidean distance calculations with a GPU since most of the processing time will be spent on memory operations.
0 ‘ 256
512
1024
2048
I 8192
4096
Number of Vectors 512D ----t 2048D - - x - -
8192D 512B
- - * - -2048B --=--
----=---81928
-
8
~
Fig. 1. Similarity Matrix Calculation Speedup
16 I
I
Q
3
U a, a,
Q
(JY
0 ’ 256
512
1024
2048
4096
I 8192
Number of Vectors 512H + 8192H 2048H - - x - 512K Fig. 2.
- - * - -2048K ----D
8192K
- 1.--8.-
Clustering Speedup
The computational speedup for hierarchical clustering ( “H”) (including the initial distance calculation) and k-means clustering (“K ”) is presented in figure 2. For k-means, the number of iterations was
390 fixed and the number of clusters was 4. As expected, both figures show that the speedup is strongly related to the dimensionality of the vectors to be classified because the elements of a data point can usually be operated on in parallel.
3.3. Monte Carlo Permutation Permutation is widely used in statistical analysis but is often the most time consuming step in genomewide data analysis. Table 2 compares the performance of an efficient Monte Carlo permutation procedure14 for correlation calculation on different platforms using expression values from 4096 genes from 7226 U133A Genechips deposited in the NCBI GEO database. It is obvious from the table that the 8800GTX can drastically speed-up Monte Carlo permutations without expanding an existing cluster given an open PCIe slot and adequate power supply. Table 2. Monte Carlo permutation on game hardware (seconds) Number
CPU ( l x )
GPU
PS3
1 2 4
258.77 517.50 1035.01
11.12 21.99 43.75
57.33 114.36 228.60
4. DISCUSSION
Although we just started developing with game hardware, our results suggest that the NVIDIA GeForce 8800GTX is a very attractive co-processor capable of increasing the single precision floating point calculation speed by more than one order of magnitude in clustering and Monte Carlo permutation procedures. It is likely many other parallel data bioinformatics algorithms, particularly those related to high throughput genome-wide data analyses, will benefit from a port to game hardware.
Acknowledgments
The authors are members of the Pritzker Neuropsychiatric Disorders Research Consortium, which is supported by the Pritzker Neuropsychiatric Disorders Research Fund L.L.C. This work is also supported in part by the National Center for Integrated Biomedical Informatics through NIH grant lU54DA021519-01Al to the University of Michigan. References 1. Angel E, Baxter B, Bolz J , Buck I, Carr N, Coombe et al. http://www.gpgpu.org/ 2007. 2. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K , Houston M, Hanrahan P. A C M Transactions o n Graphics 2004; 23(3): 777-786. 3. Owens JD, Luebke D, Govindaraju N, Harris M, Krger J, Lefohn A E, Purcell TJ. Computer Graphics Forum 2007; 26(1): 80-113. 4. Mueller, F. http://moss. csc. ncsu. edu/ mueller/cluster/ps3/ 2007. 5. Charalambous M, Trancoso P, Stamatakis A. Proceedings of the 10th Panhellenic Conference o n Informatics (PCI 2005), Springer L N C S 2005; 415-425. 6. FoldingQHome. http://fah-web. stanford. edu/cgibin/main.py?qtype=osstats 2007. 7. NVIDIA. http://developer.nvidia. com/object/cuda. html 2007. 8. IBM. http://www.alphaworks.ibm.com/tech/cellsw 2006. 9. Peakstream. http://www. peakstreaminc. com/product/overview/ 2006. 10. RapidMind. http://www.rapidmind.net/2006. 11. RapidMind. http://www.rapidmind.net/ pdfs/RapidMindCellPortzng.pdf 2006. 12. Wikipedia. http://en.wikipedia.org/wiki/Data-clustering 2007. 13. Daub CO, Steuer R, Selbig J, Kloska S. B M C Bioinformatics 2004; 5 : 118. 14. Lin, DY. Bioinformatics 2005; 21(6): 781-787.
39 1 EXACT A N D HEURISTIC ALGORITHMS FOR WEIGHTED CLUSTER EDITING
Sven Rahmann, Tobias Wittkop, Jan Banmbach, and Marcel Martin Computational Methods for Emerging Technologies group, Genome Informatics, Technische Fakultat, Bielefeld University, 0-33594 Bielefeld, Germany Address correspondence to: [email protected]. de
Anke Trd3 and Sebastian Bocker Lehrstuhl f u r Bioinformatik, Friedrich-Schiller- Universitat Jena, Ernst-Abbe-Platz 2, 0-077443 Jena, Germany Clustering objects according to given similarity or distance values is a ubiquitous problem in computational biology with diverse applications, e.g., in defining families of orthologous genes, or in the analysis of microarray experiments. While there exists a plenitude of methods, many of them produce clusterings that can be further improved. “Cleaning up” initial clusterings can be formalized as projectzng a graph on the space of transitive graphs; it is also known as the cluster editing or cluster partitioning problem in the literature. In contrast to previous work on cluster editing, we allow arbitrary weights on the similarity graph. To solve the so-defined weighted transitive graph projection problem, we present (1) the first exact fixed-parameter algorithm, (2) a polynomial-time greedy algorithm that returns the optimal result on a well-defined subset of “close-to-transitive’’ graphs and works heuristically on other graphs, and ( 3 ) a fast heuristic that uses ideas similar to those from the Fruchterman-Reingold graph layout algorithm. We compare quality and running times of these algorithms on both artificial graphs and protein similarity graphs derived from the 66 organisms of the COG dataset.
1. INTRODUCTION
The following problem arises frequently in clustering applications: Given a set of objects V and a similarity or distance measure for each unordered pair {u, u} of objects, we want to partition V into disjoint clusters. A common strategy is to choose a similarity threshold and construct the corresponding threshold graph: The objects constitute the nodes of the graph, and an edge is drawn between u and v if their similarity exceeds (distance falls below) the given threshold. In this case, u and u are called “similar”, which we write as u v. However, the resulting graph need not be transitive, meaning that u N v and v w do not necessarily imply u w. We wish to “clean up” such a preliminary clustering with as few edge changes as possible. Formal definitions are given below. N
N
N
The similarity graph. We write V for the set of objects to be clustered; these are the vertices or nodes of the graph. We write for the set of k-element subsets of V. We use uv shorthand for an unordered pair {u, u} E We assume the availability of a symmetric similarity function s : -+ IR such that u and u are simiZar, u v, if and only if s ( u , u ) := s(uu) > 0.
(I)
(y).
-
(I)
The edge set of the similarity graph is E := {uu : u u}. Note that the similarity of an object to itself is not and need not be defined here. For any set F C we define s ( F ) := CuUcF s(u,u). A perfect clustering is characterized by the condition that the graph G = (V,E ) is transitive, defined by any of the following equivalent conditions: N
(I),
(1) For each triple uuw E (:), the implication uv E E and uw E E + uw E E holds. (2) G contains no induced paths of length 2, i.e., for each triple uvw E we have IE n {uw,ww,uw}l # 2. (3) G is a disjoint union of cliques (i.e., of complete graphs).
(y),
Our goal is to edit a given graph G = ( V , E ) by removing and adding edges in such a way that it becomes transitive. Each operation incurs a nonnegative cost: If uu E E , the edge removal cost of uv is s(u,’u). If uu E , the edge addition cost of uv is -s(u,u).Note the following subtlety: If s(u,u) = 0, then initially uv $! E , but it costs nothing to add this edge. The cost to transform the initial graph G = (V, E ) into a graph G’ = (V, E’) with different edge set E’ is consequently defined as cost(G + G’) :=
4
392
s ( E \ E’) - s(E’ \ E ) Problem statement. The weighted transitive graph projection problem (WTGPP) is defined as follows. Given a similarity function s : + R and the weighted undirected graph G = (V,E , s ) with E := {uw : s(uw) > 0}, compute S(G) := min{cost(G + G’) : G’ transitive} and find one or all transitive G* with cost(G + G*) = S(G). Such G* are called best transitive approximations to G or transitive projections or least-cost cluster edits of G. We also call this problem the weighted cluster editing problem.
(y)
Previous work and results. The unweighted version of this problem, where s ( u , v ) E {+l,-1} and cost(G + G’) = IE \ E’I IE’\ El = IE A E’I, has been extensively studied and is also known as cluster editing in the literature. The first study that we are aware of goes back to Zahn” in 1964 and solves the problem on specially structured graphs (2level hierarchies). On the negative side, the problem has been proven NP-hard in general at least twice independently4>16. On the positive side, fixed-parameter tractability (FPT) of the unweighted cluster editing problem using the minimum number of edge changes as parameter k is well-studied. Gramm et al.g give a simple algorithm with running time O(3k+lV/13) and, by applying a refined branching strategy, improve the time complexity to 0(2.27k+lV13).recent experiments by Dehne et al.3 suggest that the 0(2.27k lV13) algorithm is indeed faster than the 0(3k++IV13) algorithm in practice. In theory, the best known algorithm8 for the problem has running time 0(1.92k lVI3),but this algorithm uses very complicated branching rules (137 initial cases) and has never been implemented. Damaschke2 shows how to enumerate all optimal solutions. Unfortunately, it is also known that almost all graphs are almost maximally far away from transitivity in the following sense, as shown by Moon12. Let Q, be the set of all graphs on n vertices. Note that each (V, E ) = G E Q, satisfies S(G) 5 ( ; ) / 2 because if I E 1 )(; /2, we can remove all edges and obtain the transitive empty graph, and if /El 2 (;)/a, we can add all missing edges and obtain the transitive complete graph. Now define the class G,,, of graphs that are “far away” from transitivity in the sense that S(G) 2 (1- E ) . ( : ) / 2 . Then for any E > 0,
+
+
+
<
this class contains asymptotically almost all graphs, i.e., lQn,El/lQnl + 1 as n + 00. Nevertheless, the FPT results are important in practice because we can expect the preliminary clusterings that we obtain from real-world datasets to be “almost transitive” already. To our knowledge, the W T G P P has not been subject to fixed-parameter approaches until now. Grotschel and Wakabayashi” formulated it as an integer linear program and gave a cutting plane algorithm, but apparently it has not been tried on large instances. Our contributions. We present the first fixedparameter algorithm for the W T G P P in Section 2 , which is an extension of the FP algorithm from Ref. 9 for the unweighted case. Assuming Is(u’u)l 2 1 for all u’u E our algorithm checks in time 0 ( 3 k lVI3) if there exists a transitive projection of cost at most k . In fact, the running time of our algorithm is much better in practice. We also approach the problem from another end and present a new 0(IVl3 lVllEl lEl2) time greedy algorithm that provably returns the correct transitive projection on a well-defined class of graphs that are ‘hot too far away” from transitivity, but may return suboptimal solutions for other graphs (Section 3 ) . In Section 4, we present a fast heuristic based on ideas from graph layouting. In practice, its running time is 0 ( l V I 2 ) while , a worst-case analysis gives 0(lVl4) for cases that do not seem to occur in practice. Although this paper focuses on the new algorithms, we also present applications to simulated graphs and to protein similarity graphs derived from BLAST scores on the 66 organisms of the COG dataset17; these appear in Section 5 . A concluding discussion is found in Section 6.
(y),
+
+
+
Preliminaries. Without loss of generality, the vertex set will be denoted by V := (1,. . . , n}. The input to all algorithms in subsequent sections is the (symmetric) similarity function s : + R,and the initial edge set E := {ij : s ( i j ) > O}. We set m := lEl. Without loss of generality, we may assume that the input graph consists of a single connected component. If not, we can treat each connected component separately, because an optimal solution will never join separate components; this is easily proved by contradiction.
(y)
393 The output of each algorithm is an edge set E* and a cost c* := cost(G + (V,E * ) ) .We say that an algorithm correctly solves instance G = (V, E , s) if C* = S ( G ) . Let N ( u ) := {u: s(uv) > O} c V denote the set of neighbors of v. We call N n ( u ,v) := N ( u ) n N ( v ) the common neighbors of u and v and N n ( u , v ) := ( N ( u )A N ( v ) )\ { u ,w} their non-common neighbors; here A A B is the symmetric set difference of sets A and B . Let C(G) be the set of all conflict triples, i.e., uvw E that induce a path of length two: C(G) := {uvw E : IE n {uu, vw, U W } ~ = 2). As noted, G is transitive if and only if C(G) = {}.
(y)
(y)
2. FIXED-PARAMETER ALGORITHM Fixed-parameter algorithmics were introduced by Downey and Fellows in the late nineties5. They enable us to find exact solutions for several NP-hard problems. The basic idea is to choose a parameter for a given problem such that the problem is solvable in polynomial time when the parameter is fixed. A problem is fixed-parameter tractable with respect to the given parameter if there exists an algorithm which solves the problem in a running time of O ( f ( k ) . l I l c )where , f is a function only dependent on the parameter k , 111 is the size of the input, and c is a constant. See Ref. 13 for a recent overview on fixed-parameter algorithms. In the following, we propose a fixed-parameter algorithm for the WTGPP parameterized with the (real-valued) cost k of an optimal solution. Given an instance of the problem and fixed k , the algorithm is guaranteed to find an optimal solution with cost at most k or to return that no such solution exists. The algorithm roughly adopts the branching strategy and data reduction rules of the 0(3k lVI3) algorithm from Ref. 9 and runs in time 0(3k lVI3) if every edge deletion or insertion has a cost of at least 1 (if not, costs may be scaled up to fulfill this requirement). While our algorithm accepts any positive real numbers as input, minimum edit costs are required to achieve a provable running time because there can be no fixed-parameter algorithm solving the problem with arbitrarily small weights unless P=NP. Our algorithm requires a cost parameter k . So in order to find an optimal solution, i.e., the smallest k for which a G* with cost(G 4 G*) 5 k exists, we call the algorithm repeatedly, starting with k = 1.
+
+
If we do not find a solution with this value, we increase k by 1, call the algorithm again and so forth. Note that for every k , we have to traverse the complete search tree and find the best solution with cost 5 k , if any. The overall structure of the algorithm is recursive. In the beginning, we start with the full input graph and the given parameter k . Given G and k 2 0, we first call the data reduction procedure described below. Then we pick a conflict triple uvw E C(G) and repair it in each possible way by recursively branching into three subproblems. In order to ensure that the sub-problems do not overlap, we will in the process set some nonexistent edges to “forbidden” (so we can never add them) and some existent edges to “permanent” (so they cannot be removed). Initially all edges have no such label. Data reduction. The following operations reduce the problem size. They are performed initially and for every sub-problem.
Remove claques: Identify connected components and remove all components that are cliques from the input graph. The algorithm can be called separately for each component. Check for unaffordable edge modifications: For each uw E , we calculate a lower bound icf (uv) and icp(uv) for setting uv to forbidden or permanent, respectively. When setting u v to forbidden, we state that u and w should be in different components and therefore should have no common neighbors. Conversely, setting uu to permanent means getting rid of all non-common neighbors. Lower bounds on the induced costs are obtained as
(L)
icf(uv)=
C C
min{s(uw), s(vw));
w E N n (U, u )
icp(uv) =
niin{ls(uw)I, Is(ww)I}
.Lu€NA(%V)
We maintain lists in which these costs are sorted by size and update these lists every time an edit operation is carried out. Data reduction now works as follows: (a) For all uu E V where i c f ( u v ) > k (i.e., which cannot be forbidden): Insert uv if necessary, and set u v to “permanent”. (b) For all uv E V where icp(uv) > k (i.e., which cannot be made permanent): Delete uv if necessary, and set uu to “forbidden”.
394
0
If there is a pair uv such that both icp(uv) > k and i c f ( u v ) > k , the (sub-)problem instance is not solvable with parameter k . Merge vertices incident to pemnanent edges: As soon as we set an edge uv to permanent, it is obvious that u and u must be in the same clique in each solution found in this branch of the algorithm. In this case we merge u and v , creating a new vertex x. Note that if w is a neighbor both of u and of v, we create a new edge xw whose deletion costs as much as the deletion of both uw and uw. If w is neither a neighbor of u nor of v, we calculate the insertion cost of the nonexistent edge xw analogously. In case w is a neighbor either of u or of v but not both, uvw is a conflict triple, and we have to decide whether we delete the edge connecting w with u or v or we insert the nonexistent edge. By summing the similarities (one of which is negative) to calculate the respective value for xw we carry out the cheaper operation and maintain the possibility to edit xw later. Thus, we merge u and v into a new vertex x as follows: For each vertex w E V \ {u,v}, set s(zw)+ s(uw)+s(uw). Let k +- k - icp(uu), and delete u and %r from the graph.
Branching Strategy. After data reduction, let uvw E C(G) be a conflict triple, and let u be the vertex of degree two and v , w be the leaves. We recursively branch into three cases.
(1) Insert the missing edge vw,and set all edges uv, uw,vw to “permanent”. ( 2 ) Delete edge uv, and set the remaining edge uw to “permanent” and the absent edges uv and vw to “forbidden”. (3) Delete edge uw,set it to “forbidden” (do not set the other edge labels).
permanent. We can show that the running time for merging two vertices is 0(lVI2),and the total running time for data reduction of an arbitrary input graph is 0(lVI3).A detailed proof is deferred to a full journal version of this paper. If every edge deletion or insertion has a cost of at least 1, then we can show that our data reduction results in a problem kernel with a t most 2k2 k vertices. For the weighted cluster editing algorithm, this would result in a total running time of 0(3k . k4 lVI3). We use interleaving14 by performing data reduction repeatedly during the course of the search tree algorithm whenever possible. This reduces the total running time to 0 ( 3 k lVI3). We stress that the faster 0 ( 2 . 2 7 k algorithm of Gramm et al.g for the unweighted case cannot be used to solve the WTGPP, because the branching strategy is based on an observation that does not hold for weighted graphs (Lemma 5 in Ref. 9). We are currently working on adapting this branching strategy to the weighted case.
+
+
+ +
3. GREEDY HEURISTIC As in the fixed-parameter algorithm, all conflict triples uvw E C(G) must be repaired to make G transitive. A repair consists of either removing one of the two existing edges or adding the missing edge. Observe that the hard part is to correctly “guess” the set of edges to remove. Thereafter, the edge insertions can easily be found by transitive closure, that is, adding those edges required to make each connected component a clique. Our idea is to define a function that scores edge removals and then let the algorithm greedily delete the highest-scoring edge in each step until further deletions do not improve the solution. Scoring edges. We define G’s deviation from transi-
tivity D ( G ) as In each branch, we lower k by the insertion or deletion cost required for the executed operation. If this would lead to k < 0, we skip this branch. This branching strategy gives us a search tree of size 0(3’), but usually much smaller in practice. Time complexity analysis. If we set an edge to forbidden or permanent, this can reduce the parameter k because we have to delete or insert an edge. This, in turn, may trigger other edges to be forbidden or
D ( G ) :=
c
min{l4uv)I, Is(vw)l, ls(.w)ll.
(1)
u v w EC ( G )
We can now score edge removals: Let uv be an edge in G = (V,E , s). Removing it yields GL, := (V, E \ {uv}, s’), where s’(zy)= s(zy), except s’(uv) = -cc (“forbidden”). We call
Auv(G) := D ( G ) - D(GL,)
(2) the transitivity improvement of edge uv. The term s(uv) penalizes the edge removal. - S(UU)
395 Algorithm. In addition to the main algorithm, the greedy heuristic consists of two auxiliary functions, which we describe first,.
Algorithm REMOVE-CULPRIT(G)returns the highest-scoring edge argmax,,,,{A,,(G)} and removes it from G. There are m edges; computing each A,,(G) can be done in O(n) since only triples containing uw need to be considered. Thus, the runtime of the first invocation is O(mn). Subsequent invocations need only O(m+n)time, O ( n )to update scores for edges around the deleted edge, and O ( m ) to find the maximum score. asAlgorithm TRANSITIVE-CLOSURE-COST(G) sumes G is connected; it returns the total cost of all edge additions required for a transitive closure of v max{-s(uw),O}, in time O ( n 2 ) .
G,
CuvE(,)
Algorithm GREEDY-HEURISTIC(G) is the main algorithm. It returns a pair (deletions, cost), where deletions is the list of edges to be removed from G and cost is the total cost of all edit operations (both removals and additions). Remember that G is connected.
(1) cost c TRANSITIVE-CLOSURE-COST(G). ( 2 ) If cost = 0, return an empty list and cost 0. (3) Set deletions +- empty last; delcost + 0. (4) Repeat the following steps until G consists of two connected components GI and Ga. (a) uw t REMOVE-CULPRIT(G) (b) append uu to deletions (c) increase delcost by s(uw)
(5) Adjust deletions such that it only includes edges that contribute to the cut between GI and G2. Adjust delcost accordingly, and re-add incorrect edges t o GI and Ga. (6) Solve the problem recursively for GI and G2, as long as there is a chance for a better solution: If delcost 2 cost, return ( e m p t y last, cost). ( l i s t l , costl) t GREEDY-HEURISTIC(G~). If delcost cost1 2 cost, return ( e m p t y list, cost). (listz, costz) t GREEDY-HEURISTIC(G~) If delcost + cost1 + costs 2 cost, return (empty list, cost). ( 7 ) Append list1 and last2 to deletions. Return (deletions, delcost-kcostl +costg).
+
If the “safety net” in step 5 is never invoked,
GREEDY-HEURISTIC deletes each of the m edges at most once across all recursions. After each deletion, both determining connected Components and REMOVE-CULPRITrequire O ( m n) time. Also, TRANSITIVE-CLOSURE-COST takes O ( n 2 )time for each cut, of which there are at most n - 1. Thus, the runtime is O(m(rn n) n3).
+
+ +
Correctness of the greedy heuristic for special graphs. We show that the greedy heuristic correctly
computes the transitive projection of certain classes of graphs in the unweighed case, where s( i , j ) E {+l}. Here Eq. (1) becomes D(G) = IC(G)l and Eq. ( 2 ) becomes A,,(G) = IC(G&,)l - IC(G)l - 1 = JNn(u,v)\ - \ N n ( u ,v)l - 1, since triples not containing edge uw cancel out. Let T be an unweighted transitive graph consisting of T cliques C1,. . . , C, with ni := ICJ. Graph G is obtained from T by edge modifications. Let 6, be the number of u-incident edges deleted from T, and L, the number u-incident edges added to T to obtain G.
Lemma. (I) Let u’u E E ( G ) n E ( T ) be an intracluster edge of C,. Then A,,(G) 5 26, 26, L, L, - n2 1. ( 2 ) Let xljl E E ( G ) \ E ( T ) be an inter-cluster edge between C, 3 x and C, 3 y . Then A,,(G) 2 n, n3 - (6, 6, 2 ~ , 2 ~ ~1. )
+
+
+ +
+ +
+
+
+
Proof. We count the common and non-common neighbors of u. (1) There are no non-common neighbors of uw in T , and each edge deletion or insertion incident to u or w creates at most one. Therefore INn(u,w)I 5 6, 6, L, L,. There are ni - 2 common neighbors of uu in T , and each edge deletion incident to u or ’u removes a t most one. Thus ~ N n ( u , v )>_ J nz - 2 6,). ( 2 ) After inserting x y into T , this edge has (n,- 1) (nj- 1) non-common neighbors. Each deletion incident to z or y decreases this number, and each of the L, - 1 plus L~ - 1 additional insertions incident to 2 or y might also decrease this number. Thus I N n ( x , y ) /2 ni+nj-(6,+6y+~,+-~y).O n t h e other hand, each insertion can also create a common neighbor; thus INn(z,y)I 5 L~ L~ - 2.
+ + + +
+
+
396 Theorem. GREEDY-HEURISTIC( G) recovers
the original transitive graph T if the following assumption holds: For each vertex from any Ci in T , at most 2ni/9 edges to vertices in other clusters are added and at most 2ni/9 of the edges to vertices in the same cluster are removed to obtain G. Proof. We show that &(G) > A,(G) for any intercluster edge e and intra-cluster edge f . Assume that e = xy lies between C, and C j , and that f = uu lies in Ci. Using the Lemma and the 2/9-assumption, & ( G ) - A ( f )2 2ni - (6, 2 ~ , 26, 26, L , L,) nj - (6, 2 ~ >_ ~n j /)3 > 0, as all ni-terms cancel out. Therefore, GREEDY-HEURISTIC will always remove inter-cluster edges first. This also shows that the “safety net” (step 5) of the algorithm is unnecessary here.
+
+
+ +
+
+ +
4. LAYOUT-BASED HEURISTIC Our final heuristic is based on physical intuition and motivated by graph layouting, initially introduced by Fruchterman and Reingold‘. It has later been extended and used for the visualization of structural and functional relationships in biological networks, e.g., in BioLayout7. The main idea of these layout algorithms is to arrange all nodes on a 2-dimensional plane to fit esthetic criteria (such as even node distribution in a frame and inherent symmetry reflection). The graph’s nodes are interpreted as magnets (or electrical charges of the same kind), and edges are replaced by rubber bands to form a physical system. The nodes are initially placed randomly or in a circle, for example, and then left to the forces of the system, so that the magnetic repulsion and the band’s attraction forces on the nodes move the system to a minimal energy state. While a physical system provides the motivation for these algorithms, in the actual implementation the nodes need not move according to exact physical laws. We have adapted and extended these ideas: The layout of the graph is used to partition it into disjoint connected components. Our algorithm proceeds in three phases: (1) layout, (2) partitioning, and ( 3 ) postprocessing. Layout phase. The goal is to find a position pos[i]= ( p o s [ i ] l , p o s [ i ] 2E ) R2 for each node 1 5 i 5 n, starting with a circular layout of radius po (a
user-defined parameter) around the origin. We define the distance d ( i , j ) of nodes i and j as their Euclidean distance in the layout: d ( i , j ) :=
(EL,( P 4 i I d
1/2 - POS[Jld)2)
.
For a user-defined number R of iterations, we compute the displacement of each node, and update the position pos[i] of each node i accordingly. We have allowed ourselves some freedom in deriving a good displacement vector. In particular, we do not compute forces, accelerations, and velocities of points, but for simplicity’s sake, directly apply a displacement vector to a node once it has been computed according to t,he rules below. In this sense, the physical system described above serves only as a motivation, but not as a model for the algorithm. In round 7’ E (1,.. . , R } , we compute the displacement of node i as follows. For each node j # i with s ( i j ) > 0, we move i into the direction of j (the unit vector of this direction is (pos[j]- p o s [ i ] ) / d ( i ,j ) ) by an amount of fatt.Fatt(d(i. j ) ) . s ( i j ) . Here Fat+,(d) is a strictly increasing function of the distance we use Fatt(d) := log(d 1) -, and fatt > 0 is a userdefined scaling factor for attraction. Conversely, for each node j # i with s ( i j ) < 0, we move i away from j by an amount of frep . Frep(d(i,j)) . ls(ij)l, where Frep(d):= l/Fatt(d) is strictly decreasing, and frep > 0 is another scaling factor. Finally, the magnitude of the displacement vector is cut off a t a maximal value M ( T )that depends on the iteration T : We use M ( T )= n . M O . ( l / ( ~ + l ) ) ~ to obtain increasingly small displacements in later iterations. Again, MO > 0 is a user-defined parameter. After the displacement of a node i has been computed, the node is immediately moved, before the displacement of node i 1 is computed. While this does not agree with physical model, we have found that it speeds up convergence of the layout and saves memory for the displacement vectors for each node. After all nodes have been moved, the next iteration starts. The layout phase obviously runs in O ( R .n 2 ) time. The actions of the algorithm are visualized in Figure 1. For the cluster editing problem based on protein sequence similarities, we use the following parameters: number of iterations R = 186, initial circular layout radius po = 200, repulsion scaling factor frep = 1.687, attraction scaling factor fatt = 1.245, Mo = 633. The best parameter constellation is ~
+
+
397
B
Fig. 1. Layout of a graph with 41 nodes after (A) 3 , (B) 10 and (C) 90 iterations.
(more or less) specific to the concrete problem and has been obtained by an evolutionary training procedure by using the cost function as quality function. It is included in our implementation to enable the user to perform parameter calibration for arbitrary applications. Partitioning phase. The nodes’ positions after R rounds are used to partition the graph geographically. Given a distance parameter 6, we singlelinkage cluster all nodes, meaning that nodes i and j belong to the same cluster if there exist nodes . . i = 2 0 ~ 2 1 , . . . i~ = j such that d ( i k - 1 , i k ) 5 6 for all k = I,.. . , K . We determine cost(G 4 Gi) for the so defined transitive graph G;. To find a good GZ, we start with a small distance parameter Sinit := 0 and increase it (6 c 6 0 ) by a growing step size 0 : Initially 0 + g i n i t := 0.01; subsequently 0 + 0 . f D with factor fD := 1.1. This continues until 6 2 S,, := 300. The best value for 6, along with its cost, is remembered. Obviously, the time complexity of the , D is the numpartitioning phase is O ( D . n 2 ) where ber of different values for S.
+
Postprocessing. The geometric single-linkage clustering is further improved by the postprocessing, which takes O ( n 4 )time in the worst case, but this almost never happens in practice. Effectively, the running overall time is O ( n 2 ) .The two postprocessing steps are: (1) For each pair of clusters, we check if joining them
into a single cluster decreases overall cost and perform this operation if appropriate. During this step, we especially reduce the number of erroneous singleton nodes. This happens in arbitrary but deterministic order as long as merging a pair of clusters results in an improvement. For each node i and cluster C with i q! C, we check if moving i to C decreases overall cost and perform this operation if appropriate. We also repeat this step as long as further improvements result.
5 . RESULTS We implemented the FP algorithm in C++, the greedy heuristic in Python, and the layout-based heuristic in Java. While with modern Java virtual machines, running times of Java programs are comparable to those of C++ programs, there is a higher start-up cost, which especially hurts performance for small problem instances. Python running times are about 10 times slower than those of the comparable C++ implementation. This should be kept in mind when comparing the running times of our implementations. All measurements were taken on a SunFire 880 with 900-MHz UltraSPARC III+ processors and 32 GB of RAM. Artificial graphs. We generate random artificial graphs as follows. Given the number of nodes n, we randomly select an integer k E [1,n] and define the corresponding nodes to be a cluster. We proceed in the same way with the remaining n - k
398 Table 1. Results on artificial graphs with different numbers of nodes n, resulting in different ranges of edge numbers m. For each n 5 50, ten random instances were generated. For each n 2 60, where the F P algorithm did not finish in reasonable time, only five instances were generated. Costs and running times are averages over these 10 resp. 5 instances. Smallest costs and running times are marked in boldface. The (Diff.) columns show the relative cost difference against the optimal solution returned by FP, where possible. Abbreviations: FP: fixed parameter algorithm; Greedy: greedy heuristic; Layout: layout-based heuristic. Parameters n m E 10 [11,30] 20 [65,165] 30 [138,296] 40 [251,533] 50 [402,821] 60 [515,1252] 70 [694,1911] 80 [1141,2094] 90 [1248,2969] 100 11711.31571
FP 95.75 301.89 671.25 1238.3 1859.99 __ -
-
Greedy 96.17 305.22 671.51 1238.31 1859.99 2742.3 3608.54 4729.52 6106.56 7494.36
costs (Diff.) (+O%) (+l%) (SO%) (SO%) (+O%)
(-)
(-1 (-1 (-1
(-)
nodes until no nodes are left. This gives us a random number of clusters of random sizes. Then the similarities of objects within a cluster are drawn from a Gaussian distribution N ( p i n ,o?*);they are positive on average, but negative with some probability. Similarities of objects in different clusters are conversely drawn from a Gaussian distribution N ( p e x ,o:~), which leads to negative values on average. If the parameters are chosen carefully, this construction leads to “almost transitive” graphs. For our experiments, we choose pin = 21, pex = -21, gin = f l e x = 20, so that the probability of seeing an undesired or missing edge is about 0.147 per node pair. Table 1 shows the results. We see that the FP algorithm is the fastest one for small graphs, but reaches its limits above 50 nodes. On the other hand, the greedy and layout-based heuristics perform almost as well, while requiring significantly less time. The layout-based heuristic is much faster on large components, but first requires a good choice of parameters, as discussed in Section 4.
Protein similarity graph from the COG dataset. We test the algorithms on the 66 organisms of the COG dataset17 from http ://www .ncbi .nlm.nih. gov/COG/, i.e., on the protein sequences from ftp: //ftp.ncbi.nih.gov/pub/COG/COG/myva/. We define the similarity score of two proteins as follows: First let s ( ~+ w) := CHE7LH(u+V) [-log,oE(H)I - 2 . (IZ(u .)I - 1). Here Z ( u + w) denotes the set of high-scoring pairs (HSPs) with E-value better than returned when +
Layout 95.75 301.89 671.25 1238.31 1859.99 2742.3 3609.48 4722.08 6106.56 7494.36
(Diff.) (SO%) (+O%)
(+O%) (SO%)
(+O%)
(-1 (-1 (-1 (-) f-)
Running Times Is] .. FP Greedy Layout 0.035 0.242 0.845 1.407 0.152 0.538 2.756 1.157 1.876 72.109 3.167 2.816 3.353 8.315 2204.862 19.198 3.972 4.358 58.124 69.056 4.698 128.986 5.384 207.958 5.464 -
BLASTing u against w. We subtract a penalty of 2 score points for each HSP beyond the highestscoring one. We similarly define the score s(u + u) by BLASTing w against u. Finally we define the symmetric similarity score s(u,w) := min{s(u + w),s ( v + u ) } - T , where we use a threshold of T = 10, corresponding to an E-value of The resulting similarity matrix defines a graph of 42563 (trivially transitive) connected components of size 1 and 2, and 8037 larger components, 3964 of which are not transitive; these are the input to our algorithms. Figure 2 shows a histogram of initial component sizes IVI. There are 70 intransitive components with IVI > 200 that are not shown in the histogram, the largest of size 8836. As all three algorithms perform well on very small components (which could be solved by exhaustive enumeration), we now restrict our attention to the 1243 components with IVI 2 20. For each instance and each algorithm, we limit computation time to 48 hours; thus we could find the exact FP solution for 825 of the 1243 components in the alloted time. Figure 3 (left) shows the relative cost of the solutions found by Greedy and Layout in comparison to the optimal one found by FP for the 825 components. Both heuristics work quite well: In 635 out of 825 cases, Greedy returns the optimal solution; and in 811 out of 825 cases, Layout returns the optimal solution. This behavior is relatively independent of the size or complexity of the graph (shown on the x-axis). The solution returned by Layout deviates in only two cases by more than 5% from the optimal so-
399
.i
12 10 component 51ze IVI
14
16
18
1
180
,t
/it
i 200
Fig. 2. Initial distribution of component sizes IVI for the complete COG dataset in the range 3 5 /VI < 20 (left) and 20 5 J V / 5 200 (right). Cyan (lower bars): number of non-transitive components. Magenta (upper bars): number of transitive components.
lution; this happens in 95 cases with Greedy, whose maximal deviation is about 50% in rare cases. Figure 3 (right) visualizes the running times of the different algorithms against component complexity for all 1243 components. It is evident that the FP algorithm is fastest for small components, but quickly hits a wall for larger ones. Greedy is quickest for medium-sized components, but its running time grows faster with graph complexity than that of Layout, which is the only feasible algorithm for the largest components.
6. DISCUSSION A N D CONCLUSION We have put forward three algorithms for weighted transitive graph projection or weighted cluster editing that cover the whole spectrum from an exact fixed-parameter algorithm to pure heuristics. If graphs that arise from “real” data are not far away from transitivity (in contrast to random graphs, which are highly intransitive with high probability according to Moon’s result12), we can find the optimal solution to the WTGPP with an FP algorithm in reasonable time for medium-sized components, and close-to-optimal solutions with well-engineered heuristics in guaranteed polynomial time. The FP and the Greedy algorithm complement each other well: The former guarantees the exact solution (and runs quickly for almost transitive graphs); the latter always runs in polynomial time and guarantees an optimal solution for close-to-transitive graphs. The
Layout heuristic works very well in practice, but has no provable guarantees. Our study shows that real protein similarity graphs are indeed close-to-transitive, and the three algorithms perform quite well on these WTGPP instances. Although not in the scope of this paper, the WTGPP has numerous potential applications to be investigated. Here we merely used the COG dataset as a comparative illustration of the respective capabilities of our three algorithms. Applications naturally arise in delineating gene and protein families”$ l7 (which in turn can be used as a preprocessing method for gene cluster discovery15) and in the discovery of structure in protein complexes or of communities in social or biological networks. To further understand and improve the FP algorithm, it is of interest to systematically compare the branching strategy of our FP algorithm with that of a general ILP solver, using the cutting plane algorithm of Ref. 10, which so far has not been attempted on large components. Acknowledgments and availability.
Tobias Wittkop is supported by the DFG GK Bioinformatik. Jan Baumbach is supported by the International NRW Graduate School in Bioinforniatics and Genome Research. The fixed-parameter algorithm was implemented and engineered by Sebastian Briesemeister. We thank M. Madan Babu for many constructive comments, and Andreas Dress for point-
400 Solution cost Differences
Comparison of Running Times x o
Greedy vs FP Layout vs FP
105
10
Graph complexity IVI* IE I
10 Graph complexity IVI*lEI
10
Fig. 3. Left: Relative cost differences of the solutions in percent (y-axis) found by Greedy and Layout in comparison to the exact Fixed-Parameter (FP) algorithm. Only those components whose exact solution could be computed in less than 48 hours are shown. For both Greedy and Layout, in the majority of cases, the optimal solution is found. Note that the x-axis, which shows the component complexity (we use |V| • \ E \ ) , is logarithmic. Right: Running times of FP, Greedy, and Layout against component complexity. Both axes are logarithmic. ing out the work of Grotschel and Wakabayashi. Supplementary material and source code is available at http://gi.cebitec.uni-bielefeld. de/transitivegraphprojection/.
8.
References
9.
1. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17):3389-3402, 1997. 2. P. Damaschke. On the fixed-parameter enumerability of cluster editing. In D. Kratsch, editor, Proc. of International Workshop on Graph Theoretic Concepts in Computer Science (WG 2005), volume 3787 of LNCS, pages 283-294. Springer, 2005. 3. F. Dehne, M. A. Langston, X. Luo, S. Pitre, P. Shaw, and Y. Zhang. The cluster editing problem: Implementations and experiments. In Proc. of International Workshop on Parameterized and Exact Computation (IWPEC 2006), volume 4169 of LNCS, pages 13-24. Springer, 2006. 4. S. Delvaux and L. Horsten. On best transitive approximations to simple graphs. Acta Informatica, 40(9):637-655, 2004. 5. R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer, 1999. 6. T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement. Software Practice and Experience, 21(11):1129-1164, 1991. 7. L. Goldovsky, I. Cases, A. J. Enright, and C. A. Ouzounis. BioLayout(Java): versatile network visu-
10.
11.
12.
13. 14.
15.
16.
alisation of structural and functional relationships. Applied Bioinformatics, 4(l):71-74, 2005. J. Gramm, J. Guo, F. Hiiffner, and R. Niedermeier. Automated generation of search tree algorithms for hard graph modification problems. Algorithmica, 39(4):321-347, 2004. J. Gramm, J. Guo, F. Hiiffner, and R. Niedermeier. Graph-modeled data clustering: Exact algorithm for clique generation. Theor. Comput. Syst., 38(4):373392, 2005. M. Grotschel and Y. Wakabayashi. A cutting plane algorithm for a clustering problem. Mathematical Programming, Series B, 45:59-96, 1989. A. Krause, J. Stoye, and M. Vingron. Large scale hierarchical clustering of protein sequences. BMC Bioinformatics, 6:15, 2005. J. W. Moon. A note on approximating symmetric relations by equivalence classes. SIAM Journal of Applied Mathematics, 14(2):226-227, 1966. R. Niedermeier. Invitation to Fixed-Parameter Algorithms. Oxford University Press, 2006. R. Niedermeier and P. Rossmanith. A general method to speed up fixed-parameter-tractable algorithms. Inform. Process. Lett, 73:125-129, 2000. S. Rahmann and G. W. Klau. Integer linear programs for discovering approximate gene clusters. In P. Bucher and B. Moret, editors, Proceedings of the 6th Workshop on Algorithms in Bioinformatics (WABI), volume 4175 of LNBI, pages 298-309. Springer, 2006. R. Shamir, R. Sharan, and D. Tsur. Cluster graph modification problems. Discrete Applied Mathematics, 144:173-182, 2004.
40 1 17. R. L. Tatusov, N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y . I. Wolf, J. J. Yin, and D. A. Natale. The COG database: an updated version includes eukary-
otes. BMC Bioinformatics, 4:41, 2003. 18. C. T. Zahn Jr. Approximating symmetric relations by equivalence relations. Journal of the Society of Industrial and Applied Mathematics, 12(4):840-847, 1964.
This page intentionally left blank
403
M E T H O D S FOR EFFECTIVE VIRTUAL SCREENING A N D SCAFFOLD-HOPPING IN CHEMICAL COMPOUNDS
Nikil Wale* a n d George Karypis
Department of Computer Science, University of Minnesota, T w i n Cities ’Email: [email protected], [email protected] Ian A. Watson
Eli Lilly and Company Lilly Research Labs, Indianapolis Email: [email protected] Methods that can screen large databases to retrieve a structurally diverse set of compounds with desirable bioactivity properties are critical in the drug discovery and development process. This paper presents a set of such methods, which are designed to find compounds that are structurally different to a certain query compound while retaining its bioactivity properties (scaffold hops). These methods utilize various indirect ways of measuring the similarity between the query and a compound that take into account additional information beyond their structure-based similarities. Two sets of techniques are presented that capture these indirect similarities using approaches based on automatic relevance feedback and on analyzing the similarity network formed by the query and the database compounds. Experimental evaluation shows that many of these methods substantially outperform previously developed approaches both in terms of their ability to identify structurally diverse active compounds as well as active compounds in general.
1. INTRODUCTION Discovery, design, and development of new drugs is an expensive and challenging process. Any new drug should not only produce the desired response to the disease but should do so with minimal side effects. One of the key steps in the drug design process is the identification of the chemical compounds (hit compounds or just h i t s ) that display the desired and reproducible activity against the specific biomolecular target23. This represents a significant hurdle in the early stages of drug discovery. A popular approach for finding these hits is to use a compound, known to possess some of the desired activity properties, as a reference and identify other compounds from a large compound database that have a similar structure. This is nothing more than a ranked-retrieval using the reference compound as a query. This approach relies on the wellknown fact that compounds sharing key structural features will most likely have similar activity against a biomolecular target. This is referred to as the structure activit,y relationship (SAR) ’. The similarity between the compounds is usually computed
by first representing their molecular graph as a vector in a particular descriptor-space and then using a variety of vector-based methods to compute their similarity 8 , ’. However, the task of identifying hit compounds is complicated by the fact that the query might have undesirable properties such as toxicity, bad ADME (absorption, distribution, metabolism and excretion) properties, or may be promiscuous 17, 26. These properties will also be shared by most of the highest ranked compounds as they will correspond to very similar structures. In order to overcome this problem, it is important to rank high as many chemical compounds as possible that not only show the desired activity for the biomolecular target but also have different structures (come from diverse chemical classes or chemotypes). Finding novel chemotype using the information of already known bioactive small molecules is termed as scaffold-hopping 17, 3 2 , ”. In this paper we address the problem of scaffoldhopping by developing a set of techniques that measure the similarity between the query and a compound that take into account additional information
beyond their structure-based similarities. These indirect ways of measuring similarity enables the retrieval of compounds that are structurally different from the query but at the same time possess the desired bioactivity properties. We present two sets of techniques t o capture such indirect similarities. The first set, contains techniques that are based on automatic relevance feedback, whereas the second set, derives the indirect similarities by analyzing the similarity network formed by the query and the database compounds. Both of these sets of techniques operate on the descriptor-space representation of the compounds and are independent of the of selected descriptor-space. We experimentally evaluate the performance of these methods using three different descriptor-spaces and six different datasets. Our results show that most of these methods are quite effective in improving the scaffold-hopping performance over standard ranked-retrieval. Among them, the methods based on the similarity-network perform the best and substantially outperform previously developed scaffold-hopping schemes. Moreover, even though these methods were created to improve the scaffoldhopping performance, our results show that many of them are quite effective in improving the rankedret>rievalperformance as well. The rest of the paper is organized as follows. Section 2 describes the problems addressed in this paper. Section 3 introduces the definitions and notations used in this paper. Section 4 introduces the various descriptor-spaces for this problem. Section 5 describes the methods developed in this paper. Section 6 gives an overview of the related work in this field. Section 7 describes the materials used in our experimental methodology. Section 8 compares and discusses the results obtained. Finally, Section 8.2 summarizes the results of this paper.
2. PROBLEM STATEMENT A N D M O TIVAT I0N The ranked-retrieval and the scaffold-hopping problems that we consider in this paper are defined as follows: Definition 2.1 (Ranked-Retrieval Problem) Given
a query compound, rank the compounds in the database based on how similar they are to the query in terms of their bioactivity.
Definition 2.2 (Scaffold-Hopping Problem) Given
a query compound and a parameter k , retrieve the k compounds that are similar to the query in terms of their bioactivity but their structure is as dissimilar as possible to that of the query. The solution to the ranked-retrieval problem relies on the well known fact that chemical structure of a compound relates to its activity (SAR) ’. As such, effective solutions can be devised that rank the compounds on the database based on how structurally similar they are to the query. However, for scaffold-hopping, the compounds retrieved must be structurally suficiently similar to possess similar bioactivity but at the same time must be structurally dissimilar enough to be a novel chemotype. This is a much harder problem than simple ranked-retrieval as it has the additional constraint of maximizing dissimilarity that runs counter t o SAR. Methods that have the ability to rank higher the compounds that are structurally different (different chemotypes) have advantages over methods that do not. They improve the odds of being able to find a compound that is not only active for a biomolecular target but also has all the other desired properties (non-toxicity, good ADME properties, target specificity, etc. 8 , 17) that the reference structure and compounds with similar structures might not possess. One of such compounds is then more likely to become a true drug candidate. Furthermore, scaffold-hopping is also important from the point of view of un-patented chemical space. Many important lead compounds and drug candidates have been already patented. In order to find new therapies and offer alternative treatments it is important for a pharmaceutical company to discovery novel leads away from the existing patented chemical space. Methods that perform scaffold-hopping can achieve those objectives.
3. DEFINITIONS A N D NOTATIONS Throughout the paper we will use D to denote a database of chemical compounds, q to denote a query compound, and c to denote a chemical compound present in the database. Given two compounds ci and c j , we will use sim(c,, c j ) to denote their (direct) similarity which
405 is computed with respect to their descriptor-space representation by a suitable similarity measure. Given a compound c, and a set of compounds A, we will use sim(c,, A ) to denote the average pairwise similarity between ci and all the compounds in A. Given a query compound q , a database D , and a parameter k , we define top-k to be the k compounds in D that are most similar to q. Given a compound c, a set of compounds A, and a similarity measure, its k-nearest-neighbor list contains the k compounds in A that are most similar to C.
Finally, throughout the paper we will refer to the task of retrieving active compounds as rankedretrieval and the task of retrieving scaffold-hops as scaffold-hopping. 4. DESCRIPTOR SPACES FOR RANKED-RETRIEVAL
The similarity between chemical compounds is usually computed by first transforming them into a suitable descriptor-space representation 8 , ’. A number of different approaches have been developed to represent each compound by a set of descriptors. These descriptors can be based on physiochemical properties as well as topological and geometric substructures (fragments) 31, 1, 3 , 12, 2 5 , 18,29 In this study we use three descriptor-spaces that have been shown to be very effective in the context of ranked-retrieval and/or scaffold-hopping. These descriptor-spaces are the graph fragments (GF) 29, extended connectivity fingerprints (ECFP) 25, ’*, and the extended reduced graph (ErG) descriptors 2 7 . GF is a 2D topology-based descriptor-space 29 that is based on all the graph fragments of a molecular graph up to a predefined size. ECFP is also a 2D topological descriptor-space and many flavors of these descriptors have been described by several authors 18. The idea behind this descriptor-space is to capture the topology around each atom in the form of shells whose radius (number of bonds) ranges from 1 to I , where 1 is a user defined parameter. We use the ECZ3 variation of ECFP in which each atom is assigned a label corresponding to its atomic number and the maximum shell radius is set to three. Both extended connectivity fingerprints (ECFP) and GF have been shown to be highly effective for the ranked-retrieval of chemical compounds l8, ”. 251
Extended reduced graph descriptors (ErG) is a pharmacophoric descriptor-space. A pharmacophore is defined as a critical 3D or 2D arrangement of molecular fragments forming a necessary but not sufficient condition for biological activity. The descriptors that rely only on 2D information are called 2D pharmacophoric descriptors whereas descriptors that utilize 3D information are called 3D pharmacophoric descriptors. ErG is a 2D pharmacophoric descriptor-space that combines the reduced graphs 151 l4 and binding property pairs 22 to generate pharmacophoric descriptor-space. A detailed description on the generation of these pharmacophoric descriptors can be found in 2 7 .
5. M E T H O D S
In order to improve the scaffold-hopping performance we developed a set of techniques that measure the similarity between the query and a compound by taking into account additional information beyond their descriptor-space-based representation. These methods are motivated by the observation that if a query compound q is structurally similar to a database compound c, and ci is structurally similar to another database compound c 3 , then q and cJ could be considered as being similar or related even though they may have zero or very low direct similarity. This indirect way of measuring similarity can enable the retrieval of compounds that are structurally different from the query but at the same time, due to associativity, possess the same bioactivity properties with the query. We developed two sets of techniques to capture such indirect similarities that were inspired by research in the fields of information retrieval and social network analysis. The first set, contains techniques that use various forms of automatic relevance feedback to identify a set of compounds to be used for creating an indirect similarity measure, whereas the second set, derives the indirect similarities by analyzing the network formed by a k-nearest-neighbor graph representation of the query and the database compounds. Both of these sets of techniques operate on the descriptor-space representation of the compounds and are independent of the of selected descriptor-space.
406
5.1. Relevance-Feedback- based Methods 5.1.1. Top-k Weighting This approach, which is based on the Rochio 24 scheme for automatic relevance feedback, first retrieves the top-lc compounds for a given query q and then uses these compounds to derive an indirect similarity between q and each of the compounds in the database. Specifically, if A is the initial set of top-k compounds, the new similarity, simA(q, c), between q and a compound c is given by simA(q,c) = asim(q,c)
+ (1
-
a)sim(c,A),
(1)
where 0 5 Q 5 1 is a user-specified parameter that controls the degree to which the new similarity is affected by the compounds in A. We will refer to this method as TOPKAVG. The motivation behind this approach is that for reasonably small values of k , the set A will contain a relatively large number of active compounds. Thus, by modifying the similarity between q and a compound c to also include how similar e is to the compounds in A, we obtain a similarity measure that is re-enforced by A’s active compounds. This enables the retrieval of active compounds that are similar to the compounds present in A even if their similarity to the query is not very high; thus, enabling scaffoldhopping
5.1.2.Cluster Weighting This method is similar in spirit to TOPKAVG,but employs a clustering-based approach to identify the set of compounds to use for automatic relevance feedback. We will refer to this scheme as CLUSTWTand consists of the following four steps. First, it finds the top-k most similar compounds to a query q. Second, it clusters these compounds into 1 = k / m sets (5’1, . . . , S,} each of size m (assuming that k is a multiple of m ) . Third, it selects among these sets, the set S* that has the highest similarity to the query. Fourth, it uses Equation 1 to re-rank all the compounds in the database using S* as the relevance feedback set (i.e., A = S*). The clustering is computed using a fixedcapacity heuristic min-cut partitioning algorithm on the complete weighted graph whose nodes are the k compounds and the edge-weights are the similarities between them 21) Consequently, the intercluster compound-to-compound similarities are ex-
’.
plicitly minimized leading to clusters in which the intra-cluster similarities are implicitly maximized (i.e., each cluster will end-up containing similar compounds). By using for relevance feedback the set S*, which contains compounds that are most similar to the query, CLUSTWTselects the cluster that will most likely have a large number of active compounds. This is similar in spirit to the method that TOPKAVGuses to select its own relevance feedback set A. However, since S* contains compounds that are also very similar to each-other, the nuniber of active compounds that it contains will tend to be higher than that contained in A (assuming that both A and S* have the same size). In fact, S* has already incorporated some form of automatic relevance feedback, since all pairwise similarities between its compounds were taken into account during the clustering process. The fact that objects that are relevant to a query tend to cluster together is well-known within the document retrieval community and is usually referred to as the clustering hypothesis 16.
5.1.3. Sum-based Search The performance of TOPKAVGand CLUSTWTdepends on selecting a reasonable value for the size of the set used to provide automatic relevance feedback. If that set is too small, it may not incorporate a sufficiently large number of active compounds and thus lead to limited (if any) performance improvements, whereas if the set is too large, it may degrade the performance by incorporating a relatively large number of inactive compounds. Unfortunately, our initial experiments showed that the right size of the relevance feedback set is dataset dependent. Motivated by this observation we developed a scheme for automatic relevance feedback, which instead of using a fixed number of compounds, it does so in a progressive fashion. Specifically, if A is the set of compounds that have been retrieved thus far, then the compound selected next, c,,,t, is the one that has the highest average similarity to the set A U (4). That is,
This compound is added in A and the overall process is repeated until the desired number of compounds is retrieved or all the compounds in D have been
407 ranked. Thus, in this scheme, as soon as a compound is retrieved it is used to expand the set of compounds used to provide relevance feedback. We will refer to this method as BESTSUMDESCSIM. 5.1.4. Max-based Search
A common characteristic to the three schemes described so far is that the final ranking of each compound is computed by taking into account all the similarities between the compound and the compounds in the relevance feedback set. Since the compounds in the relevance feedback set will tend t o be structurally similar to the query compound (with the CLUSTWTpotentially being an exception), this approach is rather conservative in its attempt to identify active compounds that are structurally different from the query (i.e., scaffold-hops). To overcome this problem, we developed a bestsearch scheme that is based on the BESTSUMDESCSIMapproach but instead of selecting the next compound based on its average similarity to A U {y}, it selects the compound that is the most similar to one of the compounds in A U (4). That is, the next compound is given by
cnezt = argmax{ max sim(ci, c j ) } . c,ED-A
CJEAUIq)
(3)
In this approach, if a compound cj other than y has the highest similarity to some compound ci in the database, ci is chosen as c,,,t and added to A irrespective of its similarity to q . Thus, the queryto-compound similarity is not necessarily included in every iteration as in the other schemes, allowing BESTMAXDESCSIM to identify compounds that are structurally different from the query. We will refer to this schemes as BESTMAXDESCSIM. 5.2. Nearest-Neighbor Graph-based Methods
These methods, motivated by the field of social (relational) network analysis, determine the similarity between a pair of compounds by taking into account any other compounds that are very similar to either or both of them. Thus, the similarity depends on the structure of the network formed by all highly similar pairs of compounds. The network linking the database compounds with each other and with the query is determined
by using a k-nearest-neighbor (NG) and a k-mutualnearest-neighbor (MG) graph. Both of these graphs contain a node for each of the compounds as well as a node for the query. However, they differ on the set of edges that they contain. In the k-nearestneighbor graph there is an edge between a pair of nodes corresponding to compounds ci and c j , if ci is in the k-nearest-neighbor list of cj or vice-versa. In the k-mutual-nearest-neighbor graph, an edge exists only when ci is in the k-nearest-neighbor list of c j and c j is in the k-nearest-neighbor list of ci. As a result of these definitions, each node in NG will be connected to a t least k other nodes (assuming that each compound has a non-zero similarity to at least k other compounds), whereas in MG, each node will be connected to at most k other nodes. Since the neighbors of each compound in these graphs correspond to some of its most structurally similar compounds and due to the relation between structure and activity, each pair of adjacent compounds will tend to have similar activity. Thus, these graphs can be considered as the network structures for capturing bioactivity relations. A number of different approaches have been developed for determining the similarity between nodes in social networks that take into account various topological characteristics of the underlying graphs 1 3 . In our work, we determine the similarity between a pair of nodes as a function of the intersection of their adjacency lists, which takes into account all two-edge paths connecting these nodes. Specifically, the similarity between ci and c j with respect to graph G is given by 28i
where adjG(ci) and adjc(c3) are the adjacency lists of ci and cj in G, respectively. This measure assigns a high similarity value to a pair of compounds if both are very similar to a large set of common compounds. Since a pair of active compounds will be more similar to other active compounds than an active-inactive pair, their similarity according to Equation 4 will be high. Also, since Equation 4 can potentially assign a high similarity value to a pair of compounds even if their direct similarity is very low (as long as they have a large number of common neighbors), it facilitates scaffold-hopping. For each of the NG and MG graphs we devel-
408 oped two retrieval schemes that use Equation 4 as the similarity measure in the sum- and max-based search strategies represented in Equations 2 and 3. For example, in the case of the NG graph and the sum-based search strategy, the next compound cnezt to be retrieved is given by Cnezt =
argmax{simNG(ct, A u {Y})},
(5)
c,ED-A
where simNG(cz,Au{y}) is the average pairwise similarity between c, and the compounds in A computed using Equation 4 for the NG graph. The equations for the other schemes are derived in a similar fashion. We will refer to these four schemes as BESTSUMNG, BESTMAXNG,BESTSUMMG,and BESTMAXMG, respectively. 6. RELATED WORK Many methods have been proposed for rankedretrieval and scaffold-hopping. These can be divided into two groups. The first contains methods that rely on better designed descriptor-space representations, whereas the second contains methods that are not specific to any descriptor-space representation but utilize different search strategies to improve the overall performance. Among the first set of methods, 2D descriptors such as path-based fingerprints ’, dictionary based keys and more recently Extended Connectivity fingerprints (ECFP)18, Graph Fragments (GF) 29 have all been successfully applied for the retrieval problem. Pharmacophore based descriptors such as ErG 27 have been shown to outperform simple 2D topology based descriptors for scaffoldhopping 3 3 . Lastly, descriptors based on 3D structure or conformations of the molecule have also been applied successfully for scaffold-hopping 3 3 , 2 6 . The second set of methods include the turbo search schemes (TURBOSUMFUSION and TURBOMAXFUSION) l7 and the structural unit analysis based techniques 32 all of which utilize relevance feedback ideas. These have been shown to be effective for both scaffold-hopping and ranked-retrieval. The turbo search techniques operate as follows. Given a query y, they start by retrieving the top-k compounds from the database. Let A be the ( k 1)-size set that contains y and the top-k compounds. For each compound c E A, all the compounds in the database are ranked in decreasing order based on 41
31
271
+
+
their similarity to c, leading to IC 1 ranked lists. These lists are used to obtain the final similarity of each compound with respect to the initial query. In particular, in TURBOMAXFUSION, the similarity between y and a compound c is equal to the similarity corresponding to the maximum ranking of c in the k 1 lists, whereas in TURBOSUMFUSION, the similarity is equal to the sum of all the similarities in these rankings. Similar methods based on consensus scoring, rank averaging, and voting have been investigated in 3 3 . The TURBOSUMFUSION approach is similar to that of the TOPKAVGdescribed in Section 5.1.1 as it utilizes relevance feedback mechanism to re-rank a database with respect to a query. However, the TURBOSUMFUSION approach treats every compound in the top-k set as equally important along with the query, whereas in TOPKAVG,each compound in A is given a weight of (1 - a)(l/IA[a) relative to y.
+
7. MATERIALS
7.1. Datasets We used datasets that contain compounds that bind to six different biomolecular targets: COX2 (cyclooxygenase 2), CDK2 (cyclin-dependent kinase 2), FXa (coagulation factor Xa), PDE5 (phosphodiesterase 5), AlA (alpha-1A adrenoceptor), and M A 0 (Monoamineoxidase). Each of these sets represent a different activity class. The datasets for the first five targets are obtained from 5 , 19. The entire set consists of 2142 compounds and there are 50 active compounds for each one of the targets (250 in total). The rest of the compounds are “decoys” (inactive) obtained from the National Cancer Institute diversity set. For each target, we constructed a dataset that contains its 50 active compounds and all the decoys. These datasets are termed as COX2, CDK2, PDE5, FXa and A1A. The dataset of the sixth target was derived from 11, 29 and after removing compounds with impossible Kekule forms and valence errors it contains 1458 compounds. The compounds in this dataset have been categorized into four different classes, 0, 1, 2, and 3 based on their levels of activity, with 0 indicating no activity. For our experiments we treat all the compounds that have non-zero activity level (268 compounds) as active.
409
7.2. Definition of Scaffold-Hopping Compounds
Molecular scaffold is a widely cited concept and is used to evaluate the performance of a method with respect to its scaffold-hopping ability. However the definition of a scaffold-hop is highly subjective with numerous papers using different criteria to define what constitutes a scaffold-hop 17) 32) lo. In this paper we use an objective way of defining which compounds can be considered as scaffoldhops by using an approach that directly relies on the scaffold-hopping problem definition (Section 3). In particular, for a given query q, the active compounds are ranked based on their structural similarity to q , and the lowest 50% of them are defined to be the scaffold-hops for q. Thus, this approach identifies a set of scaffold-hopping compounds that are specific to each query and represent the 50% most dissimilar active compounds to the query. We use the 2048bit path-based fingerprint generated by Chemaxon’s screen program for measuring the structural similarity between a query and an active compound. These fingerprints are well-designed to capture structural similarity between two compounds 2 7 . 331
*
7.3. Experimental Methodology
All the experiments were performed on dual core AMD Opterons with 4 GB of memory. We used the descriptor-spaces GF, ECZ3, and ErG (described in Section 4) for the evaluating the methods introduced in this paper. Each method is tested against six datasets (Section 7.1) using three different descriptor-spaces (Section 4) leading to a total of 18 different combinations of datasets and descriptorspaces. We will refer to them as 18 different problems. We use the Tanimoto similarity 8 , 31 for all direct similarity calculations. The Tanimoto similarity function is given by 301
shown to be an effective way of measuring the similarity between chemical compounds 30, 31 for rankedretrieval and is the most widely-used similarity function in cheminformatics. For each dataset we used each of its active compounds as a query and evaluated the extent to which the various methods lead to effective retrieval of the other active compounds and scaffold-hops. For CLUSTWTwe used hMETIS 21, 2o to perform the clustering into fixed sized clusters. We varied the parameter values for the methods described in Section 5 and obtained results by averaging over four different sets of values. For TOPKAvG, which depends on the number of compounds k used in relevance feedback, we used k = 5, 10, 15, and 20. For CLUSTWT,which depends on the cluster size m and the number of compounds k on which the clustering was performed, we used m = 25 and 40 and k = 200 and 400. For CLUSTWTand TOPKAVGthat have cr as a parameter, we use a value of 0.5. These parameter values were selected because they gave the best results in our experiments. For the nearest-neighbor methods which depend on the number of neighbors, we used k = 4, 6, 8, and 10 for the BESTSUMNGand BESTMAXNG,and k = 12, 16, 20, and 24 for the BESTSUMMGand BESTMAxMG schemes. These values were chosen because they gave good results. Moreover, for NG the value of k less than 4 leads to graphs with many connected components whereas for MG this value is 12. Hence, we decided not to use values below these thresholds. Note that the threshold for NG is less than that of MG because the criterion for an edge to exist between two nodes of the neighborhood graph is stricter for MG as opposed to NG (Section 5.2). We also compared our schemes against TURBOMAXFUSION and TURBOSUMFUSION 17. For both these methods, we used k = 5, 10, 15, and 20. These values gave the best results and the results degraded as k was further increased.
n CikCjk
sim(s,cj)=
k=l n
n
c( c i ! ~ ) ~ c
k=l
f
k=l
-
c
1
CikCjk
7.4. Standard Retrieval
k=l
(6) where C i k and cjk are the values for the kth dimension in the n-dimensional descriptor-space representation for the compounds ci and c j , respectively. This similarity function was selected because it has been
For each problem, we obtain a baseline performance by ranking all the compounds with respect to each active compound using the Tanimoto similarity 8 , 30, 31. We call this Standard Retrieval and denote it by STDRET.
A10
7.5. Performance Assessment Measures
8. RESULTS
We measure ranked-retrieval and scaffold-hopping performance using uninterpolated precision 16. This is calculated as follows. For each active that appears in the top 50 retrieved compounds we compute the precision value. For ranked-retrieval this is defined as the ratio of the number of actives retrieved over the number of compounds retrieved thus far. For scaffold-hopping it is defined as the number of scaffold-hops retrieved over the number of compounds retrieved thus far. For both ranked-retrieval and scaffold-hopping we sum all their precision values and normalized them by dividing them with 50. This is called the total uninterpolated precision for a query. Similar values are obtained for all the queries for a dataset and the total uninterpolated precision is the average of all these values. Note that the total uninterpolated precision captures the number of active compounds (scaffold-hops) for each query as well as the position (rank) information of the actives (scaffold-hops). To compare the ranked-retrieval or scaffoldhopping performance of two methods, we evaluate their relative performance over all the 18 problems. This is achieved as follows. Let T , and q, represent the ranked-retrieval or scaffold-hopping performance achieved by methods r and q on the ith problem respectively. We calculate the log-ratio, log, ( ~ % / q ,for ), every problem and take the average of these values. We call this quantity the Average Relative Performance or A R P of r with respect to q. On the average, if the A R P is less than zero, r performs worse than q whereas if the A R P is greater than zero, r performs better than q. Note that the reason that we use log-ratios as opposed to simply the ratios is that the distribution of the ratios of two random variables is not symmetric whereas the distribution of their log-ratios is normally distributed. This allows us to compute their average and compare them in an unbiased way. We also assess whether the A R P for a given pair of methods is statistically significant using the student’s t-test 7 , which is well-suited to assess statistical significance of a sample of values drawn out of a normal distribution. The null hypothesis being tested here is that the log-ratios are centered around a mean of zero.
8.1. Overall Performance Assessment Tables 1 and 2 compare the performance of all the methods in a pairwise fashion for scaffold-hopping and ranked-retrieval, respectively. In each of these tables we present two statistics. The first is the A R P of the row method ( r ) with respect to the column method (4) as described in Section 7.5. The second statistic, shown immediately below the A R P value in parenthesis, is its pvalue obtained from the student’s t-test. Note that for the remainder of this section we will define the ARP of the two methods to be statistically significant if p 5 0.01. The rest of this section highlights some of the key observations that can be made by analyzing the results in these tables. 8.1.1. Performance of Relevance Feedback Methods
Comparing the performance of the four relevancefeedback-based methods described in Section 5.1 against STDRET,we see that all of them lead to better scaffold-hopping results. Among them, the results achieved by CLUSTWT and BESTSUMDESCSIM are 63% and 94% better than STDRET,respectively and also these improvements are statistically significant. However, all four of these methods achieve somewhat worse ranked-retrieval performance (3% to 15%). Moreover, these differences are statistically significant for BESTSUMDESCSIM and BESTMAXDESCSIM. Comparing the four methods against TURBOSUMFUSION and TURBOMAXFUSION, we observe that the relative performance of most of these methods varies, with some methods doing better for scaffold-hopping and others doing better for rankedretrieval. However, with the exception of TOPKAVG, which is statistically better than the two fusion-based scheme for ranked-retrieval, all other differences are not statistically significant. Comparing the four relevance-feedback-based methods against each other we see that most of them perform the same for both scaffold-hopping and ranked-retrieval and whatever differences that exist are not statistically significant. Despite of this, the average performance of BESTSUMDESCSIM is better than BESTMAXDESCSIM, indicating that the sumbased search strategy leads to better results. The
41 1 results also show that the CLUSTWT is better than TOPKAVGfor scaffold-hopping and that this difference is statistically significant.
8.1.2. Performance o f Nearest-Neighbor Graph- Based Methods
Comparing the performance of the nearest-neighbor methods, we observe that all of these schemes show good performance for scaffold-hopping as well as ranked-retrieval. Among them, the best performing method is BESTSUMNG.It achieves the best balance between the ranked-retrieval and scaffoldhopping performance. Furthermore, similar to the relevance feedback-based methods, the sum-based search methods outperform the corresponding maxbased methods. However, these differences are not statistically significant. The results also show that the nearest-neighbor methods performs significantly better than all the other methods for scaffold-hopping and most of these differences are statistically significant ( BESTSUMDESCSIMand BESTNIAXDESCSIM are the two exceptions). In particular, the performance of the nearestneighbor methods are 62% to 300% better than the STDRETand the fusion-based methods and 46% to 244% better than the relevance-feedback-based methods. The nearest-neighbor methods also achieve better performance than all of the methods for rankedretrieval, although most of these differences are not statistically significant. BESTSUMNGis a clear exception as its ranked-retrieval performance is also significantly and statistically better than all the other non graph-based techniques. For example, compared to the fusion-based techniques its rankedretrieval performance is 62% to 209% better. 8.2. Performance of Descriptor-Spaces and
feedback- and graph-based methods, respectively. The results of these evaluations are shown in Figures 1 and 2, which compare the performance of STDRETagainst CLUSTWTand BESTSUMNG,respectively. In these figures, the left Y-axis represents uninterpolated precision values for ranked-retrieval, whereas the right Y-axis represents uninterpolated precision values for scaffold-hopping. For CLUSTWT and BESTSUMNGwe also show error bars that correspond to the standard deviation of the results obtained for the four sets of parameter values used for these schemes. These results show that for scaffold-hopping, CLUSTWToutperforms STDRET in most dataset and descriptor-space combinations. However, the actual performance gains are dataset and descriptorspace dependent. For example, CLUSTWT achieves significant gains on the A1A and FXa datasets for the ErG and ECZ3 descriptor-spaces, whereas the gains for the other datasets and/or descriptor-spaces are not as dramatic. In terms of ranked-retrieval performance, these results show that in the case of the G F descriptor-space, CLUSTWTperforms consistently better than STDRETacross all datasets. However, CLUSTWT’Sranked-retrieval performance for the other two descriptor-spaces is somewhat mixed. Finally, the results in Figure 2 show that for scaffold-hopping, BESTSUMNGperforms consistently better than STDRETfor all the descriptorspace and dataset combinations. However, similarly to CLUSTWT,the actual gains are dataset and descriptor-space dependent. For example, the gains are particularly high for the FXa, AlA, and COX2 datasets and for the ErG descriptor space. Similar trends can be observed with the ranked-retrieval results, with BESTSUMNGoutperforming STDRET. Moreover, the performance gains achieved on some problems by BESTSUMNGare usually much higher than the performance degradations in others.
Datasets
Our discussion so far focused on evaluating the average performance of the different methods across the various descriptor-space representations and datasets. In this section we analyze the performance of the methods on the individual descriptorspaces and datasets. We limit our evaluation to only the CLUSTWTand the BESTSUMNGmethods as these methods achieve the best scaffold-hopping and ranked-retrieval performance among the relevance-
CONCLUSION
In this paper we introduced a number of methods based on relevance feedback and social (relational) network analysis to improve scaffold-hopping and ranked-retrieval. Our results showed that among these methods, the ones based on social network analysis consistently and substantially outperform the standard retrieval as well as previously introduced methods for these problems.
412 Table 1.: Performance for Scaffold-Hopping.
(0.031)
(0.006)
(0.127) (0.007)
(0.002)
(0.024)
(0.000) (0.000)
(0.000) (0.000)
-0.38 (0.073)
0.13 -0.26 (0.024) (0.029)
-0.52 (0.068)
-0.44 (0.298)
-1.07 -1.07 (0.000) (0.000)
-1.16 -1.15 (0.000) (0.000)
0.11 0.51 (0.0 13) (0.467)
-0.14 (0.547)
-0.07 (0.835)
-0.7 -0.69 (0.002) (0.005)
-0.78 (0.001)
(0.000)
-0.4 (0.001)
-0.65 (0.032)
-0.57 (0.177)
-1.2 -1.2 -1.29 (0.000) (0.000) (0.000)
-1.28 (0.000)
-0.25 (0.316)
-0.18 (0.645)
-0.81 -0.9 -0.8 (0.000) (0.000) (0.000)
-0.88 (0.000)
0.07 (0.754)
-0.56 -0.65 -0.55 (0.038) (0.064) (0.039)
-0.63 (0.038)
-0.63 -0.62 -0.72 (0.109) (0.140) (0.053)
-0.7 (0.071)
-0.01 (0.947)
-0.1 (0.577)
-0.08 (0.579)
-0.09 (0.620)
-0.08 (0.614)
TLlttBOSUMFUSION
0.44 (0.031)
TURBO!dAXFLlSION
0.82 (0.006)
0.38 (0.073)
TOPKAVG
0.31 (0.127)
-0.13 (0.024)
-0.51 (0.013)
CLUSTWT
0.71 (0.007)
0.26 (0.029)
-0.11 (0.467)
0.4 (0.001)
BESTSUMDESCSIM
0.96 (0.002)
0.52 (0.068)
0.14 (0.547)
0.25 0.65 (0.032) (0.316)
BESTLIAXDESCSIM0.89 (0.024)
0.44 (0.298)
0.07 (0.835)
0.57 0.18 (0.177) (0.645)
-0.07 (0.754)
BESTSUMNG
1.51 (0.000)
1.07 (0.000)
0.69 (0.002)
1.2 0.8 (0.000) (0.000)
0.55 (0.038)
0.62 (0.109)
BESTMAXNG
1.52 (0.000)
1.07 (0.000)
0.7 (0.005)
1.2 0.81 (0.000) (0.000)
0.56 (0.064)
0.63 (0.140)
0.01 (0.947)
BESTSUIIMG
1.6 (0.000)
1.16
(0.000)
0.78 (0.001)
1.29 0.9 (0.000) (0.000)
0.65 (0.039)
0.72 (0.053)
0.1 0.09 (0.577) (0.620)
1.59 (0.000)
1.15 (0.000)
0.77 (0.000)
1.28 0.88 (0.000) (0.000)
0.63 (0.038)
0.7 (0.071)
0.08 0.08 -0.01 (0.579) (0.614) (0.886)
BESTMAXMG
-0.77
0.01 (0.886)
The top entry in each cell corresponds to the average of the log, ratios of the uninterpolated precision of the row method to the column method for the 18 problems. The number below this entry, in parenthesis, corresponds t o the p-value obtained from the student's t-test for t h a t entry.
"
mStdRet (Hits) EClustWt (Hits)
0 StdRet (Scanolds) 0.50 -
PCIuStWt (Scaffolds)
,E 0.40 -
--
0 10
- 0.08
,5 H z"E
.o 0
t
P ~
m 0 06
030
'g 0
1
v
' 0
0.20
0.04
0 to
0.02
0 00
0 00
ErG
ECZ3
2
GF
Fig. 1.: STDRETversus CLUSTWT. ACKNOWLEDGEMENTS This work was supported by NSF EIA-9986042, ACI0133464,IIS-0431135, NIH R L M O O ~ ~the ~~A,
High Performance Computing Research Center contract number DAAD19-01-2-0014, and by the Digital Technology Center at the University of Minnesota.
413 Table 2.: Performance for Ranked-Retrieval.
STDRET
0.14 (0.019)
TURBOSUMFCSION
-0.14 (0.019)
0.21 (0.001)
0.04 0.06 (0.332) (0.415)
0.17 (0.009)
0.27 (0.002)
-0.25 -0.12 (0.015) (0.179)
-0.18 (0.151)
-0.08 (0.434)
0.07 (0.156)
-0.1 -0.08 (0,001) (0.113)
0.03 (0.502)
0.13 (0.137)
-0.26 -0.39 (0.001) (0.003)
-0.32 (0.016)
-0.22 (0.037)
-0.17 -0.15 (0,001) (0,101)
-0.04 (0.426)
0.06 (0.419)
-0.33 -0.46 -0.39 (0,001) (0.002) (0.013)
-0.29 (0.028)
0.02 (0.725)
0.13 (0.017)
0.23 (0.016)
-0.16 -0.22 -0.29 (0.004) (0.054) (0.080)
-0.12 (0.226)
0.11 (0.168)
0.21 (0.071)
-0.24 -0.18 -0.31 (0.009) (0.027) (0.047)
-0.14 (0,158)
0.1 (0.121)
-0.29 -0.42 -0.25 -0.35 (0.001) (0.004) (0.02 1) (0.051)
TURBOhhXFUSION
-0.21 (0.002)
-0.07 (0.156)
TOPKAVG
-0.04 (0.332)
0.1 (0.001)
0.17 (0,001)
CLLJSTWT
-0.06 (0.415)
0.08 (0.113)
0.15 (0,101)
-0.02 (0.725)
BESTSUMDESCSI~~ -0.17 (0.009)
-0.03 (0.502)
0.04 (0.426)
-0.13 -0.11 (0.017) (0.168)
BESThfAXDESCSIM
-0.27 (0.002)
-0.13 (0.137)
-0.06 (0.419)
-0.21 -0.23 (0.016) (0.071)
-0.1 (0.121)
BEsTSUhlNG
0.25 (0.015)
0.39 (0.001)
0.46 (0.001)
0.29 0.31 (0.004) (0.009)
0.42 (0.001)
0.52 (0.001)
BESTRIAXNG
0.12 (0.179)
0.26 (0.003)
0.33 (0.002)
0.16 0.18 (0.054) (0.027)
0.29 (0.004)
0.39 (0.002)
-0.13 (0.148)
BESTSUMRIG
0.18 (0.151)
0.32 (0.016)
0.39 (0.013)
0.22 0.24 (0.080) (0.047)
0.35 (0.021)
0.45 (0.008)
0.06 -0.07 (0.517) (0.484)
BESTMAXMG
0.08 (0.434)
0.22 (0.037)
0.29 (0.028)
0.12 0.14 (0.226) (0.158)
0.25 (0.051)
0.35 (0.019)
-0.17 (0.079)
-0.39 -0.52 -0.45 (0.001) (0.002) (0.008)
-0.35 (0.0 19)
0.13 0.07 (0,148) (0.519)
0.17 (0.079)
-0.06 (0.484)
0.04 (0.591)
-0.04 (0.591)
0.1 (0.036) -0.1 (0.036)
The top entry in each cell corresponds to t h e average of the log, ratios of t h e uninterpolated precision of t h e row method t o the column method for the 18 problems. The number below this entry, in parenthesis, corresponds to the p-value obtained from the student's t-test for t h a t entry.
0 60
0.12 t3BestSumNG (Hits)
.$
0.50
0.10
040
0.08 0
t
b -
,;
rn
030
0.06
IgE 0.20
0.04
0.10
0.02
'g I" v
g
3
0.00
0.00
$
Go@
c+* +P
,.""
ECZB
p.' ErG
4".
$0
Qoe
OF
Fig. 2.: STDRETversus BESTSUMNG. References 1. http://www.daylight.com. Daylight Inc. 2. http://www.digitalchemistry.co.uk/.Digital Chemistry Inc.
3. http: f fwww.mdl.com. MDL Information Systems Inc. 4. www.chemaxon.com. ChemAxon Inc. 5 . www.cheminformatics.org. Cheminformatics.
414 6. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern information retrieval. Addison Wesley 1999. 7. J. M. Bland. An introduction to medical statistics. (1 995) 2nd edn. Oxford University Press. 8. H.J. Bohm and G. Schneider. Virtual screening for bioactive molecules. Wiley- VCH, 2000. 9. Gianpaolo Bravi, Emanuela Gancia, Darren Green, V.S. Hann, and M. Mike. Modelling structureactivity relationship. Virtual Screening f o r Bioactiue Molecules, 2000. 10. N. Brown and E. Jacoby. On scaffolds and hopping in medicinal chemistry. Mini Rev Medicinal Chemistry, 6(11):1217-1229, 2006. 11. R. Brown and Y. Martin. Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J . Chem. Info. Model., 36(1):576-584, 1996. 12. Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis. Frequent substructurebased approaches for classifying chemical compounds. I E E E T K D E . , 17(8):1036-1050, 2005. 13. F. Fouss, A. Pirotte, J. Renders, and M. Saerens. Random walk computation of similarities between nodes of a graph with application to collaborative filtering. IEEE T K D E , 19(3):355-369, 2007. 14. V. J. Gillet, P. Willet, and J. Bradshaw. Similarity searching using reduced graphs. J . Chem. Inf. Comput. Sci., 43:338-345, 2003. 15. G. Harper, G.S. Bravi, S.D. Pickett, J. Hussain, and D.V. Green. The reduced graph descriptor in virtual screening and data-driven clustering of highthroughput screening data. J . Chem. Info. Model., 44(6) ~45-56, 2004. 16. Marti Hearst and Jan Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. ACM/SIGIR, 1996. 17. J. Hert, P. Willet, and D. Wilton. New methods for ligand based virtual screening: Use of data fusion and machine learning to enchance the effectiveness of similarity searching. J . Chem. Info. Model., (46):462-470, 2006. 18. J. Hert, P. Willet, D. Wilton, P. Acklin, K. Azzaoui, E. Jacoby, and A. Schuffenhauer. Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Organic and Biomolecular Chemistry, 2:32563266, 2004. 19. Robert N. Jorissen and Michael K. Gibson. Virtual screening of molecular databases using support vector machines. J . Chem. Info. Model., 45(3):549-561, 2005. 20. George Karypis, Rajat Aggarwal, Vipin Kumar, and
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31. 32.
33.
Shashi Shekhar. Multilevel hypergraph partitioning: Applications in vlsi domain. Design and Automation Conference, pages 526-529, 1997. George Karypis and Vipin Kumar. Multilevel kway hypergraph partitioning. Design and Automation Conference, pages 343-348, 1999. S. K. Kearsley, S. Sallamack, E. M. Fluder, J. D. Andose, R. T. Mosley, and R. P. Sheridan. Chemical similarity using physiochemical property descriptors. J. Chem. Inf. Comput. Sci., 36:118-127, 1996. Andrew R. Leach. Molecular modeling: Principles and applications. Prentice Hall, Englewood Cliffs, NJ, 2001. J. J. Rochio. Relevance feedback in information retrieval. The S M A R T Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Chapter 14, 1971. D. Rogers, R. Brown, and M. Hahn. Using extendedconnectivity fingerprints with laplacian-modified bayesian analysis in high-throughput screening. J . Biomolecular Screening, 10(7):682-686, 2005. Jamal C. Saeh, Paul D. Lyne, Bryan K. Takasaki, and David A. Cosgrove. Lead hopping using svm and 3d pharmacophore fingerprints. J . Chem. Info. Model., 45:1122-113, 2005. Nikolaus Stiefl, Ian A. Watson, Kunt Baumann, and Andrea Zaliani. Erg: 2d pharmacophore descriptor for scaffold hopping. J . Chem. Info. Model., 46:208220, 2006. B. Teufel and S. Schmidt. Full text retrieval based on syntactic similarities. Information Systems, 31(1), 1988. Nikil Wale and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and classification. International Conference in Datamining. (ICDM), 2006. Martin Whittle, Valerie J. Gillet, and Peter Willett. Enhancing the effectiveness of virtual screening by fusing nearest neighbor list: A comparison of similarity coefficients. J . Chem. Info. Model., 44:18401848, 2004. Peter Willett. Chemical similarity searching. J . Chem. Info. Model., 38(6):983-996, 1998. P. N. Wolohan, L. B. Akella, R. J. Dorfman, P. G. Nell, S.M. Mundt, and R. D. Clark. Structural units analysis identifies lead series and facilitates scaffold hopping in combinatorial chemistry. J . Chem. Inf. Comput. Sci., 46:1188-1193, 2005. Qiang Zhang and Ingo Muegge. Scaffold hopping through virtual screening using 2d and 3d similarity descriptors: Ranking, voting and consensus scoring. J. Chem. Info. Model., 49:1536-1548, 2006.
Transcriptomics and Phylogeny
This page intentionally left blank
417 IMPROVING THE DESIGN OF GENECHIP ARRAYS BY COMBINING PLACEMENT AND EMBEDDING
S6rgio A. d e Carvalho Jr.* and Sven Rahmann
Computational Methods for Emerging Technologies ( C O M E T ) , Genome Informatics, Technische Fakultat, Bielefeld University, 0-33594 Bielefeld, Germany; D F G G K Bioinformatik and Institute for Bioinformatics, CeBiTec, Bielefeld University Email: { Sergio. Carvalho,Sven.Rahmann} @cebitec.uni-bielefeld.de The microarray layout problem is a generalization of the border length minimization problem and asks to distribute oligonucleotide probes on a microarray and to determine their embeddings in the deposition sequence in such a way that the overall quality of the resulting synthesized probes is maximized. Because of its inherent computational complexity, it is traditionally attacked in several phases: partitioning, placement, and re-embedding. We present the first algorithm, Greedy+, that combines placement and embedding and results in improved layouts in terms of border length and conflict index (a more realistic measure of probe quality), both on arrays of random probes and on existing Affymetrix Genechip@ arrays. We also present a large-scale study on how the layouts of GeneChip arrays have improved over time, and show how Greedy+ can further improve layout quality by as much as 8% in terms of border length and 34% in terms of conflict index.
1. INTRODUCTION Microarrays are a ubiquitous tool in molecular biology with a wide range of applications on a wholegenome scale including high-throughput gene expression analysis, genotyping, and resequencing. This article is about improving the design of high-density oligonucleotide microarrays, sometimes called DNA chips. This type of microarray consists of relatively short DNA probes (20-30-mers) synthesized at specific locations, called features or spots, of a solid surface, that are usually built by light-directed combinatorial chemistry, nucleotide-by-nucleotide. For example, Affymetrix Genechip@ arrays have up to 1.3 million spots on a fused silica substrate measuring a little over 1 cm'. The spots are as narrow as 5 pm (0.005 mm), and are arranged in a regularly-spaced rectangular grid. GeneChip arrays are produced with techniques derived from micro-electronics and integrated circuits fabrication. Probes are usually 25 bases long and are synthesized on the chip, in parallel, in a series of repetitive steps. Each step appends the same kind of nucleotide to probes of selected regions of the chip. The sequence of nucleotides added in each step is called deposition sequence. The selection of which probes receive the nucleotide is achieved with the help of photolithographic masks3. The quartz wafer of a GeneChip
*Corresponding author
array is initially coated with a chemical compound topped with a light-sensitive protecting group that is removed when exposed to ultraviolet light, activating the compound for chemical coupling. A mask is used to direct light and remove the protecting groups of only those positions that should receive the nucleotide of a particular synthesis step. A solution containing adenine (A), thymine (T), cytosine (C) or guanine (G) is then flushed over the chip surface, but the chemical coupling occurs only in those positions that have been previously deprotected. Each coupled nucleotide also bears another protecting group so that the process can be repeated until all probes have been fully synthesized. An alternative method of in situ synthesis uses an array of miniature mirrors to direct or deflect the incidence of light on the chip". Regardless of which method is used to direct light, it is possible that some probes are accidentally activated for chemical coupling because of light diffraction, scattering or internal reflection on the chip surface. The unwanted illumination introduces unexpected nucleotides that change the probe sequences, significantly reducing their chances of successful hybridization with their targets, and increasing the risk of cross-hybridization with unintended targets . This problem can be (and has been) alleviated by
418 improving the production process, which however is expensive. Here, we are interested in computational methods that re-arrange the probes on the chip in such a way that the problem is minimized. Note that the problem of unintended illumination primarily occurs near the borders between masked and unmasked spots (in the case of maskless synthesis, between a spot that is receiving light and a spot that is not); we thus speak of a border conflict. By carefully designing the arrangement of the probes on the chip and their embeddings (the sequences of masked and unmasked steps used to synthesize each probe), it is possible to reduce the risk of unintended illumination. The problem has received some attention in the past, mostly by Hannenhalli et al.4,Kahng et a1.6p8,and ourselves', '. In this paper, we put forward a new idea: We efficiently combine probe placement with probe embedding in a single algorithm; previously, these task have been done in separate phases. We also present a large-scale layoutquality study on several old and recent GeneChip arrays and propose alternative layouts with reduced conflicts. In the next section, we state the microarray layout problem formally and define two different objective functions to be minimized. Section 3 contains our study of GeneChip arrays and shows how their layouts can be improved. Section 4 explains our new Greedy+ algorithm that achieves these improvements. Since Greedy+ builds on previous work, we briefly review the relevant details in Section 4.1 before presenting Greedy+ in Section 4.2 and results on chips with random probes in Section 4.3. Section 5 contains a concluding discussion. Supplementary material is available at http: //gi . cebitec . uni-bielefeld.de/comet/chiplayout/affy/.
2. T H E MICROARRAY LAYOUT PROBLEM Data. The data for the microarray layout problem
{sl, s 2 , . . . , s m } , where each spot s accommodates many copies of a unique probe pk E P. Each probe is synthesized at a unique spot, hence there is a one-to-one assignment between probes and spots (if we assume that there are as many spots as probes, i.e., m = n). Some microarrays may have complex physical structures but we assume that the spots are arranged in a rectangular grid. the nucleotide deposition sequence N = N l N 2 . . . NT corresponding to the sequence of nucleotides added at each synthesis step. It is a supersequence of all p E P and often a repeated permutation of the alphabet C = {A, C, G, T}, mainly because of its regular structure and because such sequences maximize the number of distinct subsequences. Each synthesis step t uses a mask Mt to induce the addition of a particular nucleotide Nt E C to a subset of P (Figure 1).
A probe may be embedded within N in several ways. An embedding of pk is a T-tuple & k = ( E ~ , I ~, k , 2 , .. . , E ~ , T )in which E k , t = 1 if probe p k receives nucleotide Nt (at step t ) , and 0 otherwise. In particular, a left-most embedding is an embedding in which the bases are added as early as possible (as in ~1 in Figure 1). Finding good embeddings is part of the problem. Problem statement. Given P, S , and N as specified above, the MLP asks to specify a chip layout (A,&) that consists of a bijective assignment X : S ---f (1,. . . , n} that specifies a probe index X(s) for each spot s (meaning that will be synthesized at s), an assignment E : { 1, . . . , n } + (0, I } specify~ ing an embedding ~k = ( & k , l , . . . , E ~ , T )for each probe index k , such that N [ E := ~ ](Nt)t:Ek,t=l = Pk
1
(MLP) consists of a set of probes P = ( ~ 1 ~ ~ .2. ,,p n. } , where each p k E {A, C, G, T}* with 1 5 k 5 n is produced by a series of T synthesis steps. Frequently, but not necessarily, all probes have the same length .t. a geometry of spots, or sites, S =
such that a given penalty function is minimized. We now describe two such penalty functions: total border length and total conflict index. Objective functions. The total border length B(X,E ) of a chip layout (A,&) was first introduced by Hannenhalli et a ~ who ~ defined , the border length
419
PI
P2
P3
N = ACGTiACGTiAC
ACT CTG GAT Ps TCC GAC
P4
P7
P8
El
TGA CGT AAT
Fig. 1. Synthesis of a hypothetical 3 x 3 chip with photolithographic masks. Left: chip layout with 3-mer probe sequences. Center: deposition sequence with 2.5 cycles (delimited with dashed lines) and probe embeddings. Right: first six masks (masks 7 to 10 not shown).
Bt(X,&)of a mask Mt as the number of borders separating masked and unmasked spots a t synthesis step t. Then B ( X , E )= &(A,&). As an example, the six masks shown in Figure 1 have B1 = 4, Bz = 3, B3 = 5, B4 = 4,B5 = 8 and B6 = 9. The total border length of that layout is 52 (masks M7 to MI0 are not shown). Note that B ( X , E )can be expressed with the Hamming distance between embeddings of probes at adjacent spots: Let H,(k, k’) be the number of synthesis steps in which the embeddings &k and E ~ difI fer. Then B(X,E ) = adjacent H & ( X ( S ) , X(s’)). Ideally, all probes should have roughly the same risk of being damaged by unintended illumination, so that all hybridization signals are affected in approximately the same way. Total border length treats every conflict in the same way, which is reasonable without further information. However, it has been suggested previously7 that stray light might activate not only adjacent neighbors but also spots that lie as far as three cells away from the targeted spot, and that imperfections produced in the middle of a probe are more harmful than in its extremities. Therefore, as in Ref. 1, we define the total conflict index of a layout as C(X,E):= C , C ( s ) ,where C ( s ) f C ( s ;A, E ) is the conflict index of a spot s defined as:
zTxl
;c,,,/
T
s’: neighbor of s
The indicator functions ensure that there is a conflict
at s during step t if and only if s is masked EX(^),^ = 0) and a neighbor s’ is unmasked EX(^,),^ = 1). Function y(s, s’) is a “closeness” measure between s and s’, defined as r(s,s’) := ( ~ ( s , s ’ ) ) - ~ , where d ( s , s’) is the Euclidean distance between the spots s and s’. Note that, in ( l ) ,s’ ranges over all neighboring spots that are at most three cells away from s. The position-dependent weighting function w(&,t) accounts for the significance of the location inside the probe sequence where the undesired nucleotide is introduced in case of accidental illumination. It increases exponentially with the distance S ( E , ~ of ) the synthesized nucleotide from the probe’s closer end, as motivated by thermodynamic considerations’: W ( E , t ) := c . exp (0 . S ( E , t ) ) ,where c > 0 and 0 > 0 are constants. The parameter 0 controls how steeply the exponential weighting function rises towards the middle of the probe. In our experiments, we use probes of length i? = 25, and parameters 0 = 5/.t and c = 1/ exp (0). Problem variants and per-chip measures. We con-
sider two variants of the MLP:
BLM Border Length Minimization (BLM) means that the objective is to minimize B(X,E ) . CIM Conflict Index Minimization (CIM) means that the objective is to minimize (?(A,&), which depends on the weighting functions y and w and their parameters, which we choose as described above. In either case, we can measure both B(X,E ) and C(X,&). Naturally, after BLM, B(X,E)will be low,
420 whereas C(X, E ) may be relatively large; the converse holds after CIM. In order to better compare chips of different size, we introduce normalized versions of these quantities.
NBL If the the chip is a rectangular grid with nT rows and n, columns, the number of internal borders is n b = n,(n, - 1) + n,(n, - 1) FZ 2n,n, = 2121, and we call B(X,&)In,, the normalized bord e r length (NBL). We may also refer to the NBL of a particular mask Mt as Bt/nb. ABC Real arrays have a significant number of empty spots (as much as 11.94% on the Affymetrix Chicken Genome array). To better compare chips with different amounts of empty spots we use the average n u m b e r of border conflicts p e r probe (ABC), defined as B(X,E ) / I P \We . roughly have ABC = 2 . NBL if I S 1 M IPI.The ABC of a particular mask Mt is Bt/lPl. ACI We define the average conflict i n d e x (ACI) of a layout as C(X,&)/IPl.
perfect match (PM), which perfectly matches its target sequence, and the mismatch (MM) probe, which is used to quantify cross-hybridizations and unpredictable background signal variations. The MM probe is a copy of the PM probe except for the middle base (position 13 of the 25-mer), which is exchanged with its Watson-Crick complement. The layout of a GeneChip alternates rows of PM probes with rows of MM probes in such a way that the probes of a pair are always adjacent on the chip. Moreover, PM and MM probes are pair-wise left-most embedded. Informally, a pair-wise left-most embedding is obtained from left-most embeddings by shifting the second half of one embedding t o the right until the two embeddings are “aligned” in the synthesis steps that follow the mismatched middle bases. This approach reduces border conflicts between the probes of a pair, but it leaves a conflict in the steps that add the middle bases. The fact that probes must appear in pairs restricts even more which sequences can be used as probes on GeneChip arrays because both PM and MM probes must “fit” in the deposition sequence.
3. ANALYSIS OF GENECHIP ARRAYS We obtained the specification of several GeneChip arrays containing the list of probe sequences and their positions on the chip from Affymetrix’s web sitea. We make a few assumptions because some details such as the deposition sequence used to synthesize the probes, the probe embeddings, and the contents of “special“ spots are not publicly available (some of the special spots contain quality control probes used to detect failures during the production of the chip). Not knowing the contents of these special spots barely interferes with our analysis because, in all arrays we examined, they amount to at most 1.22% of the total number of spots. It has been reported that a fixed 74-step deposition sequence is used by Affymetrix7. All GeneChip arrays we analyzed, regardless of their size, can be synthesized in N = (TGCA)’’TG, i.e., 18.5 cycles of TGCA, and a shorter deposition sequence is indeed unlikely. This suggests that only sub-sequences of this particular deposition sequence can be used as probes on Affymetrix chips. In principle, this should not be a problem as this sequence covers about 98.45% of all 25-mers9. Probes of GeneChip arrays appear in pairs: the
Results. Figure 2 shows the ABC for each masking step of three GeneChip arrays (Yeast, Human and E. coli). We assume that the probes are pair-wise left-most embedded in N = (TGCA)~’TG,and we consider all spots whose contents are not available as empty spots. In all chips we analyzed, the ABC is higher in the steps that add the middle bases, a result of placing PM and MM probes in adjacent spots. The Yeast Genome S98 array has the worst layout in terms of border conflicts, and most of the earlier GeneChip arrays such as the E. coli Antisense Genome have similar levels of conflicts. The layout of the Human Genome U95A2 array has significantly fewer border conflicts than the Yeast array, suggesting that it was designed with a better placement strategy. The curve of the E. coli Genome 2.0 array, with very low levels of conflicts in the first 10 masks, is typical of the latest generation of GeneChip arrays, including the Chicken Genome and the Wheat Genome (one of the largest GeneChip arrays currently available with 1164 x 1164 spots), which suggest yet another placement strategy. Table 1 shows summary statistics on several
“http://www.affymetrix.com/support/technical/byproduct.affx?cat=arrays
42 1
m]
0.9
45000
0.8
xx
xx
40000
XX
x
xx
0.7
,x
0.6
b OO 00
x
X
xx x
X
xx
xx
++++~ooooo
0.5
xx xx
00
00
00 00
x X
xx 00
00
xx
om
.
++f+++++++++f++++++Q+
o
0.4
x
I 20000
ox
4
+ + B X XY
0.3
+
15000
10000
0.2 ., .. ..
0.1
2
25000
.
5000
. . . . .. ... ... .. . ., .. .
0
0
0
5
10
15
20
25
30
35 40 Masking step
45
50
55
60
65
70
Fig. 2. Average number of border conflicts per probe (scale on the left y-axis) of selected GeneChip arrays: Yeast Genome S98, Human Genome U95A2, and E. coli Genome 2.0. The histogram shows the number of middle bases added per synthesis step on the E. coli 2.0 chip (scale on the right y-axis).
commercially available arrays. The layout of the Human Genome U95A2 array is one of the best in terms of NBL and the best in terms of ACI. This, however, has more to do with empty spots than with the placement strategy as this chip has about 1.83% of empty spots that are evenly distributed on the chip surface. In contrast, the Chicken Genome array has an exceptionally high percentage of empty spots (11.94%) that contribute to its low NBL but not equally to a low ABC in comparison with the Human Genome array because the empty spots are concentrated in the lower part of the chip (figures illustrating the distribution of empty spots on these chips are available in the supplementary web page). GeneChip arrays exhibit relatively low levels of NBL and ABC when compared to layouts produced by the best algorithms for arrays of random probes of similar dimensions (see next section). This can be explained by the fact that each probe has a nearly identical copy next to it. However, they have relatively high ACIs because the conflicts are concentrated on the synthesis steps of the middle bases, which are expensive in the conflict index model. Design improvements. We used our new algorithm Greedy+ with different parameters Q, and Sequential8 re-embedding algorithm (see Section 4 for explanations; in general, larger Q gives better layouts, but also increases the running time), to create alternative layouts for two of the latest generation of GeneChip arrays: E. cola Genome 2.0 and Wheat
Genome. Greedy+ was modified to avoid placing probes on special spots or empty spots that we believe might have a function on the chip. For each chip we separately run both BLM and CIM versions of the algorithms. The main difference between our layouts and the original ones is that we do not require the arrays to alternate rows of P M and MM probes; hence, probes of a pair are not necessarily placed on adjacent spots. This is especially helpful for CIM since it avoids conflicts in the middle bases. With BLM, we observe that Greedy+ places between 90.7% and 95.2% of the P M probes adjacent to their corresponding MM probes. With CIM, this rate drops to between 12.9% and 21.3%. Figure 3 shows the NBL for each masking step of the layout produced by Greedy+ and Sequential for the E. cola Genome 2.0 array in comparison with the original Affymetrix layout. It can be clearly seen that the CIM variant of our algorithm greatly reduces the number of border conflicts in the middle synthesis steps, where conflicts are expensive. In the BLM variant, the conflicts are distributed more evenly across all synthesis steps. To compare the new layout algorithm with re-embedding only, we also show the result of running a pair-wise version of Sequential on the original layout (this version em sures that the embeddings of PM-MM pairs remain pair-wise “aligned”). The total NBL and ACI values of these layouts are also shown in Table 2, together with several lay-
422 Table 1. Average number of border conflicts per probe (ABC), normalized border length (NBL) and average conflict index (ACI) of selected GeneChip arrays. The dimension of the chip, the percentage of spots with unknown content and the percentage of empty spots are also shown. GeneChip Array Yeast Genome S98 E . coli Antisense Genome Human Genome U95A2 E . coli Genome 2.0 Chicken Genome Wheat Genome
Dimension
534 x 544 x 640 x 478 x 984 x 1164 x
Unknown
534 544 640 478 984 1164
outs for the Wheat Genome array. Greedy+ with Q = 10K produces a layout with 8.10% less border conflicts than the original layout for E. coli array (13.2406 versus 14.4079) in 218.3 minutes. With Q = 2K, the improvement is almost as good (7.15%), but requires only 46.9 minutes. For the larger Wheat array, Greedy+ with Q = 2K generates a layout with 7.36% less border conflicts than the original layout (12.7622 versus 13.3771). In terms of CIM, our results show that Greedy+ can improve the quality of GeneChip arrays in as much as 34.31% (from 550.2014 to 361.4418 for the E. coli array).
4. ALGORITHMS Traditionally, The MLP has been attacked heuristically in two phases, as exact solutions are computationally infeasible. First, an initial embedding of the probes is fixed and an arrangement of these embeddings on the chip with minimum conflicts is sought. This is usually referred to as the placement phase. Placement algorithms typically assume that an initial embedding of the probes is given (which can be a left-most or otherwise pre-computed embedding), and do not change the given embeddings. Second, a post-placement optimization phase reembeds the probes considering their location on the chip, in such a way that the conflicts with neighboring spots are further reduced. For superlinear placement algorithms, the chip is often yurtitzoned into smaller sub-regions before the placement phase in order to reduce running times, especially on larger chips. We briefly review the best known placement and re-embedding principles and then present a new algorithm, Greedy+, the first one to combine placement and embedding into a single phase. In addition to the results presented in the previ-
1.22% 1.17% 0.96% 1.08% 0.46% 0.38%
Empty
ABC
1.70% 3.12% 1.83% 0.46% 11.94% 0.08%
44.8168 43.3345 28.2489 29.2038 28.2087 27.6569
NBL
21.7945 20.7772 13.7517 14.4079 12.3680 13.7771
ACI
669.0663 663.7353 510.3418 550.2014 540.5022 539.9632
ous section, we show in Section 4.3 that Greedy+ compares favorably to the best known placement strategy (Row-Epitaxial). Partitioning algorithms such as Centroid-based Quadrisection' and Pivot Partitioning' are not discussed. 4.1. Review of Existing Placement and Re- Embedding Strategies Placement. The following elements of placement strategies have proven successful in practice for largescale chips.
Initial ordering The probe sequences (or their binary embeddings) are initially ordered, either lexicographically7, which is easy, or to minimize the sum of distances of consecutive probes, which leads to an instance of the NP-hard traveling salesman problem (TSP) that is then solved heuristically4. k-threading The sequence of ordered probes is threaded onto the chip. This can happen row-byrow, where the first row is filled left-to-right, the second one right-to-left, and so on. This leads to an arrangement where consecutive probes in the same row have few border conflicts, but probes in the same column may have a significant number of conflicts. An alternative is provided by &threading4, in which the right-to-left and left-to-right steps are interspaced with alt,ernating upward and downward movements over k sites. Row-by-row threading can be seen as k threading with k = 0. Iterative refinement The Row-Epitaxial7 algorithm refines an existing layout as follows: Spots are re-considered in a pre-defined order, from top to bottom, left to right. For each spot s, a user-defined number Q of probe candidates below and to the right of s is considered for an
423 Table 2. Normalized border length (NBL) and average conflict index (ACI) of layouts for the E. coli 2.0 and Wheat GeneChip arrays. Greedy+ used k-threading with k = 5 for BLM and k = 0 for CIM. Running times in minutes include placement and two passes of re-embedding with Sequential. Array
Layout
NBL
ACI
E. coli 2.0
Affymetrix with pair-wise left-most Affymetrix after “pair-aware’’ Sequential (BLM) Greedy+ with Q = 2K and Sequential (BLM) Greedy+ with Q = 10K and Sequential (BLM) Greedy+ with Q = 2K and Sequential (CIM) Greedy+ with Q = 10K and Sequential (CIM)
14.4079 13.5005 13.3774 13.2406 17.6935 17.5575
550.2014 541.0954 529.8129 515.5917 394.9905 361.4418
Affymetrix with pair-wise left-most Affymetrix after “pair-aware’’ Sequential (BLM) Greedy+ with Q = 2K and Sequential (BLM) Greedy+ with Q = 5K and Sequential (BLM) Greedy+ with Q = 2K and Sequential (CIM) Greedy+ with Q = 5K and Sequential (CIM)
13.7771 12.9151 12.7622 12.6670 17.1047 17.1144
539.9632 531.2692 519.0869 511.7193 387.8430 366.6045
Wheat
exchange with the probe p at s. Probe p is then swapped with the probe that generates the minimum number of border conflicts between s and its left and top neighbors. In the experiments conducted by Kahng et al.7,RowEpitaxial was the best large-scale placement algorithm for the BLM problem. We have adapted Row-Epitaxial to CIM by choosing the probe candidate that minimizes the sum of conflict indices in a region around s restricted to those neighboring spots that have been already refilled. Re-embedding. Most current re-ernbedding strategies are based on the Optimum Single Probe Embedding algorithm (OSPE; see below) first introduced by Kahng et aL6 and differ mainly in the order in which the spots are considered. Some of the proposed strategies are Chessboard, Greedy and Batched Greedy‘, and Sequential’. The Sequential strategy proceeds spot by spot, from top to bottom, left to right, re-embedding each probe optimally with regard to its neighbors using OSPE. Once the end of the array is reached, it is restarted at the top left corner of the array for the next iteration, until a local optimal solution is found, or until improvements drop below a given threshold, or until a given number of passes have been executed. Sequential is not only the simplest but also the fastest and most effective known strategy’. Therefore, we skip the discussion of other strategies.
Time -
46.9 218.3 54.9 225.7 ~
-
279.2 676.0 322.7 704.7
OSPE is a dynamic programming algorithm ( a variant of global sequence alignment) that computes an optimum embedding of a single probe p (of length l ) at a given spot s into the deposition sequence N (of length T ) with respect to p’s neighbors, whose embeddings are considered as fixed. The algorithm was originally developed for BLM but a more general form designed for conflict index minimization (CIM) was given by de Carvalho Jr. and Rahmann’. OSPE fills an (l 1) x (T 1) dynamic programming matrix D , where D [ i , t ]is defined as the minimum cost of an embedding of p l , , i into Nl,,t for 0 5 i 5 l , 0 5 t 5 T . The cost is the sum of conflicts induced by the embedding of on its neighbors (when s is unmasked and a neighbor is masked), plus the conflicts suffered by p l , , i because of the embeddings of its neighbors (when s is masked and a neighbor is unmasked). The basic recurrence is
+
D [ i ,t
D [ i ,t] =
+
+
11 m , t , D[ i - 1 , t - 11 -
+ ut
If
P, = Nt,
In accordance with the conflict index model, the additional costs Ut (incurred at masked neighbors when s is unmasked, only possible if p i = N t ) and Mi,t (incurred at masked s because of unmasked neighbors)
424 0.4 x
0.35
x
X
0.3 0.25 0.2 0.15 0.1
0.05 ++
I , . . . I . . . . I . . . . I . . . . I . . . , I . , . . I . . . . I . . . . I . . . . I . . . . I . . . . I . . . . I . . . . I Q Q ~
0 0
5
10
15
20
25
30
35
45
40
50
55
60
65
70
Masking step 0 +
Affymetrix layout (pair-wise left-most embeddings) Affymetrix layout after pair-wise Sequential
0
x
Greedy+ and Sequential (BLM) Greedy+ and Sequential (CIM)
Fig. 3. NBL for each masking step of the original Affymetrix layout for the E. coli 2.0 GeneChip compared with alternative layouts produced by Greedy+ (with Q = 10K) and Sequential. ’The layout resulting from running Sequential on the original layout is also shown
are
ut
c
:=
n{Ex(,5’),t=O}
s ’ : neighbor of s
Mi,t
:=
c . exp(0. (1
. W(EX(s’), t ) . Y(S’, s),
+ min{i,l
-
i}))
s ‘ : neighbor of s
The initialization is given by D[O,01 = 0,D [ i ,01 = 03 for 0 < i 5 l, and D[O,t]= D[O,t - 11 M o , for ~
+
O
(1) Sort the probes lexicographically and store them, in sorted order, in a doubly linked list L. (2) Place a randomly selected probe p at the first spot, using any reasonable embedding. ( 3 ) Remove p from L , but remember its former position. (4) For each following spot s of the array in a kthreading pattern: (a) For each of the Q probe candidates q closest to p’s former position in L : 0
4.2. Greedy+: Merging Placement and Embedding
The problem with the “place first then re-embed” approach is that once the placement is fixed, there is usually little freedom for optimization by reembedding the probes. Better results should be obtained when the placement and embedding phases are considered simultaneously instead of separately. However, because of the generally high number of possible embeddings of each probe, it is a challenge to design algorithms that efficiently use the additional freedom and run reasonably fast in practice. In this section, we propose Greedy+, the first placement algorithm that simultaneously places and embeds the probes. After the user has chosen two parameters Q and k , the overall strategy is as follows.
0
Compute q’s optimal embedding with respect to the already-filled neighbors of s by temporarily placing q at s and using OSPE. Denote the best cost for q by c ( q ) . Keep track of the minimum cost c* = min, c ( q ) and the corresponding best probe candidate q*.
(b) Place q* at s with its optimal embedding. (c) Set p + q*. Remove p from L , but remember its former position.
(5) Optionally, run Sequential re-embedding over the whole array. Compared to Row-Epitaxial, Greedy+ clearly spends more time evaluating each probe candidate. For this reason, we must use lower numbers Q of candidates per spot to achieve a running time comparable to Row-Epitaxial.
425 Table 3 . Normalized border length (NBL) of layouts produced by Row-Epitaxial and Greedy+ with border length minimization (both using 0-threading) on random chips in approximately the same time (running times in minutes including two passes of Sequential re-embedding optimization). The relative difference in NBL and time between the two approaches is shown in percentage.
Dim.
Row-Epitaxial and Sequential Q NBL Time
Greedy+ and Sequential Q NBL Time
Relative NBL Time
300 x 300
10000 20 000
18.0524 17.9430
4.3 9.5
300 700
17.9807 17.6746
4.2 9.2
-0.40% -1.50%
-1.24% -2.85%
500 x 500
10000 20000
17.3584 17.2502
16.0 34.7
450 950
17.2216 16.9382
16.0 30.4
-0.79% -1.81%
-0.40% -12.51%
800 x 800
10000 20 000
16.7176 16.6012
45.6 100.1
500 1130
16.6549 16.3175
41.7 97.7
-0.38% -1.71%
-8.51% -2.41%
Three observations significantly reduce the time spent with OSPE computations when several probe candidates q are considered in succession for filling the same spot. (1) The Ut and Mi,t costs of OSPE need to be computed only once for a given spot s since they do not depend on the probe sequence placed at s: Ut depends solely on the existing neighbors of s , whereas Mi,t depends on the neighbors of s and on the number i of bases already appended to q at synthesis step t (if all probes have the same length C, then c and 19 are constants). (2) Once we know that there exists some q that can be placed at s with cost K , we can stop the OSPE computation for other candidates as soon as all values in a row of the OSPE matrix D are greater than or equal to K . ( 3 ) If two candidates q and q’ share a common prefix of length T , rows 0 through T of D are identical for q and q’, so we can skip the re-computation. In order to fully exploit this fact, we examine the probes in lexicographical order so that we maximize the length of the common prefixes between two consecutive probe candidates. For this reason, Greedy+ uses the doubly-linked list L to maintain the probes in lexicographical order.
4.3. Results on Chips with Random Probes We compare the layouts produced by Row-Epitaxial and Greedy+ when both algorithms are given approximately the same amount of time (the parameter Q is chosen differently for both algorithms so that the running times are comparable). For this experiment we use probes of length C = 25 i.i.d. ran-
domly generated and left-most embedded in the standard Affymetrix deposition sequence (all results are averages over a set of ten arrays). Note that, although we use Affymetrix’s deposition sequence, the probes on these arrays do not appear in pairs. For Row-Epitaxial, an initial placement is constructed by threading a lexicographically sorted list of probes using 0-threading, i.e., row-by-row. To be fair, since Row-Epitaxial is a traditional placement algorithm that does not change the probe embeddings, we need to compare the layouts obtained by both algorithms after a re-embedding phase. For this task we use the Sequential algorithm, performing two passes of re-embedding optimization. The results are shown in Tables 3 (NBL after BLM) and 4 (ACI after CIM). For BLM, Greedy+ produces significantly better results in less time while looking at fewer probe candidates. For CIM, Greedy+ produces better layouts in approximately the same amount of time (or less), except for the smallest chips: On 300 x 300 arrays, Row-Epitaxial produces layouts with lower ACIs, but it quickly reaches its limit in terms of probe candidates per spot. Greedy+ examines fewer probe candidates to achieve similar results, and thus have a greater potential for producing better layouts. For instance, the largest value of Q for Row-Epitaxial on 300 x 300 chips (Q = 90000) produces a layout with 402.5457 ACI. Greedy+ produces a better layout (401A089 ACI) already with Q = 5500 (although that takes more time than Row-Epitaxial with Q = 90000). Our results also suggest that the larger values of Q are used, the greater is the advantage of Greedy+. In further experiments (details not shown), RowEpitaxial often produces the best results (for both
426 Table 4. Average conflict index (ACI) of layouts produced by Row-Epitaxial and Greedy+ with conflict index minimization (both using 0-threading) on random chips in approximately the same time (running times in minutes including two passes of Sequential re-embedding optimization).
Dim. 300 x 300
Row-Epitaxial and Sequential Q ACI Time
Greedy+ and Sequential Q ACI Time
10000 20 000 90 000
440.2397 423.4236 402.5457
12.3 21.2 50.6
900 1900 5500
500 x 500
10000 20 000
434.9764 417.8499
38.3 68.7
800 x 800
10000 20000
428.6301 412.4495
106.6 187.9
BLM and CIM) with k = 0, although the best initial layouts are frequently produced with high values of k (e.g., k = 4), contradicting the results of Hannenhalli et al.4 Greedy+ consistently achieves the best results with k = 0 for CIM, and with surprisingly high values of k (e.g., k = 14) for BLM. The results shown in Tables 3 and 4 use k = 0 (row-by-row threading), so the advantage of Greedy+ over Row-Epitaxial, in terms of BLM, is even greater in many cases.
5. DISCUSSION We have presented a large-scale study on the layout of GeneChip arrays. Our analysis suggest that placing perfect match (PM) and mismatch (MM) probes on adjacent spots is responsible for the low border length on GeneChip arrays. However, this has the disadvantage of concentrating the conflicts on those synthesis steps that add the middle bases, precisely where an unintentionally added nucleotide results in the highest damage to the probes. Our results indicate that, if PM and MM probes are not regularly placed in alternating rows, the average conflict index (ACI) may be reduced by as much as 34%. However, other desired properties might be lost, e.g., the correlation of PM and MM signal due to spatial effects. We remark that several researchers in the past have proposed to ignore the MM signals altogether5. Of course, the exact numbers (such as the 34% above) depend on the parameters of the conflict index model, which are subject to debate. However, changing them does not qualitatively change the results: In fact, our estimate of the relative importance of the middle bases for the integrity of the probes is rather conservative. We have also proposed the first layout algo-
442.8057 423.9464
Relative ACI Time
401.8089
12.4 21.3 53.9
+0.58% +0.12% -0.18%
+0.42% +0.60% +6.65%
1050 2150
432.9102 414.2703
38.1 66.2
-0.48% -0.86%
-0.48% -3.67%
1150 2400
424.7285 405.6095
104.3 184.4
-0.91% -1.66%
-2.12% -1.90%
rithm, Greedy+, that combines the previously separate phases of placement and embedding. Also, in contrast to most previous work, we use two models to evaluate layout quality: border length minimization and conflict index minimization. For fair comparisons, we have adapted the existing methods to the conflict index model. As evident by the results in Section 4.3, Greedy+ is the best placement strategy for border length minimization. It is also the best for conflict index minimization, except for the smaller chips and when running time is limited. In fact, the advantage of Greedy+ becomes more apparent for larger chips and greater number of candidates per spot. This makes Greedy+ an ideal candidate for truly large designs. It should also be noted that Greedy+ outperforms previous algorithms regardless of how PM and MM probes are placed on the chip, as can be seen on the results with random chips (where there are no probes pairs). Acknowledgments and availability. We thank Jens Stoye for many helpful discussions, Robert Giegerich for additional funding, and the reviewers for helping improving the presentation. Supplementary material is available online at h t t p : / / g i . c e b i t e c . uni-bielefeld.de/comet/chiplayout/affy/. References 1. S. A. de Carvalho Jr. and S. Rahmann. Improving the layout of oligonucleotide microarrays: Pivot Par-
titioning. In P. Bucher et al., editors, Proceedings
of
the 6th Workshop of Algorithms in Bioinformatics, volume 4175 of Lecture Notes in Computer Science,
pages 321-332. Springer, 2006. 2. S. A. de Carvalho Jr. and S. Rahmann. Microarray layout as a quadratic assignment problem. In
427
3.
4.
5.
6.
D. Huson et al., editors, Proceedings of the Germ a n Conference on Bioinformatics, volume P-83 of Lecture Notes in Informatics (LNI), pages 11-20. Gesellschaft fur Informatik, 2006. S. P. Fodor, J. L. Read, M. C. Pirrung, L. Stryer, A. T. Lu, and D. Solas. Light-directed, spatially addressable parallel chemical synthesis. Science, 251(4995):767-773, 1991. S. Hannenhalli, E. Hubell, R. Lipshutz, and P. A. Pevzner. Combinatorial algorithms for design of DNA arrays. Advances i n Biochemical Engineering Biotechnology, 77:l-19, 2002. R. A. Irizarry, B. M. Bolstad, F. Collin, L. M. Cope, B. Hobbs, and T. P. Speed. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4):e15, Feb 2003. A. Kahng, I. Mandoiu, P. Pevzner, S. Reda, and A. Zelikovsky. Border length minimization in DNA array design. In R. Guig6 et al., editors, Algorithms in Bioinformatics (Proceedings of W ABI), volume 2452 of Lecture Notes in Computer Science, pages 435-448. Springer, 2002.
7. A. B. Kahng, I. Mandoiu, P. Pevzner, S. Reda, and A. Zelikovsky. Engineering a scalable placement heuristic for DNA probe arrays. In Proceedings of the seventh annual international conference on research in computational molecular biology (RECOMB), pages 148-156. ACM Press, 2003. 8. A. B. Kahng, I. Mandoiu, S. Reda, X. Xu, and A. Z. Zelikovsky. Evaluation of placement techniques for DNA probe array layout. In Proceedings of the 2003 I E E E / A CM international conference on Computeraided design (ICCAD), pages 262-269. IEEE Computer Society, 2003. 9. S. Rahmann. Subsequence combinatorics and applications to microarray production, DNA sequencing and chaining algorithms. In M. Lewenstein et al., editors, Combinatorial Pattern Matching (CPM), volume 4009 of LNCS, pages 153-164, 2006. 10. S. Singh-Gasson, R. D. Green, Y . Yue, C. Nelson, F. Blattner, M. R. Sussman, and F. Cerrina. Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array. Nut Biotechnol, 17(10):974-978, Oct 1999.
This page intentionally left blank
429
MODELING SPECIES-GENES DATA FOR EFFICIENT PHYLOGENETIC INFERENCE
Wenyuan Li and Ying Liu* Department of C o m p u t e r Science, University of Texas at Dallas Richardson, T X 75083, U.S.A. * Email: [email protected] In recent years, biclique methods have been proposed to construct phylogenetic trees. One of the key steps of these methods is to find complete sub-matrices (without missing entries) from a species-genes data matrix. To enumerate all complete sub-matrices, l 7 described an exact algorithm, whose running time is exponential. Furthermore, it generates a large number of complete sub-matrices, many of which may not be used for tree reconstruction. Further investigating and understanding the characteristics of species-genes data may be helpful for discovering complete sub-matrices. Therefore, in this paper, we focus on quantitatively studying and understanding the characteristics of species-genes data, which can be used to guide new algorithm design for efficient phylogenetic inference. In this paper, a mathematical model is constructed to simulate the real species-genes data. The results indicate that sequence-availability probability distributions follow power law, which leads to the skewness and sparseness of the real species-genes data. Moreover, a special structure, called “ladder structure”, is discovered in the real species-genes data. This ladder stmcture is used to identify complete sub-matrices, and more importantly, to reveal overlapping relationships among complete sub-matrices. To discover the distinct ladder structure in real species-genes data, we propose an efficient evolutionary dynamical system, called “generalized replicator dynamics”. Two species-genes data sets from green plants are used to illustrate the effectiveness of our model. Empirical study has shown that our model is effective and efficient in understanding species-genes data for phylogenetic inference.
1. INTRODUCTION Phylogenetic inference can be defined as the process of determining estimated evolutionary history by analysis of a given data set 18. The evolutionary history of genes and species can be described by a phylogenetic tree ’. It is widely accepted that amino acid and/or DNA sequences produce a tree closest t o the true tree ’. As the amount of molecular sequence data available rapidly increases, it has spurred a number of phylogenetic analysis across the tree of life 16. In general, the data prepared for phylogenetic analysis is in the form of species-genes matrix, where genes refer t o any set of homologous sequences, whether protein coding or ’. As species-genes matrix indicates whether not there exist sequences for any species and gene, it is also called sequence availability matrix. Ideally, this matrix is complete, which means t h a t every species has been sequenced for every gene in the matrix. However, as pointed out by 17, a few species have been sequenced for many genes; a few genes have been sequenced for many species; but most of the ”1
’’,
1’ ‘
* Corresponding author.
potential data available for phylogenetic purposes is still missing. Therefore, species-genes matrices derived from the available sequence data are “sparse” and “uneven” 14, 11, ’, 17. The sparseness and skewness of species-genes data have posed serious challenges for the available phylogenetic methods and strategies of constructing trees 16, 19. Recent studies have shown t h a t concatenating multiple sequences from the same species can improve the accuracy of phylogenetic inference 11, ’, ”, ’. Given a large species-genes matrix, Sanderson et al. (2003) developed a n exact algorithm t o find all complete sub-matrices (without missing data). Once all the complete sub-matrices are discovered, an effective strategy for constructing phylogenetic tree is t o concatenate the sequences of all the genes in the complete sub-matrix. The whole process is illustrated in Fig. 1. In this process, a n important step is the discovery of all complete submatrices. In graph theory, the species-genes matrix W = ( ~ l i j ) , (as ~ ~right part of Fig. 1) can be represented as a bipartite graph B(S,G,W), in which there are two vertex sets S = (s1,. . . , s,} (si rep-
430 resents the i-th species) and G = { g l , . . . , g n } ( g J denotes the 3-th gene), and the edge between s, and gJ exists if wzJ = 1 (the gene gJ is sequenced for the species s,) and otherwise if wtJ = 0. Therefore, a complete sub-matrix of W corresponds to a complete subgraph of G,typically called ‘biclique’ in graph theory a. Therefore, in a biclique, all genes are sequenced for all species. In essence, the discovery of all complete sub-matrices is equivalent to an NPcomplete graph problem, known as “biclique enumeration” 17, 16. The running time of the “biclique enumeration” algorithm proposed by Sanderson e t al. (2003) is exponential and may take a long time to analyze large data sets. Furthermore, it generates a large number of bicliques, many of which may not be used for tree reconstruction. Hence, it is timeconsuming for phylogeneticists to determine which bicliques can build meaningful phylogenetic trees. Although the sparseness and skewness of speciesgenes data is a curse for biclique enumeration, they may become a blessing for phylogenetic inference if we study and take advantage of them. Therefore, in this paper, we focus on quantitatively studying and understanding the characteristics of real speciesgenes data, which can be used to guide new algorithm design for efficient phylogenetic analysis. We firstly construct a mathematical model to simulate real species-genes data. Then some underlying and special features or structures, such as “ladder structure”, can be discovered in real species-genes data through the model. This ladder structure can be used to identify complete sub-matrices, and more importantly, to reveal the distinct overlapping relationships among complete sub-matrices. Finally, to discover the ladder structure in real data, we propose an efficient evolutionary dynamical system, called “generalzzed replicator dynamics”. The rest of this paper is organized as follows: we firstly propose a model in Section 2 to study the species-genes data. Based on this model, characteristics of the real-world species-genes data can be quantitatively investigated. Two conclusions are drawn when we use this model to analyze two real species-genes data sets collected from green plants. In Section 3, we formulate the discovery of ladder
’,
structure as a maximization problem. To approach this problem, in Section 4, we generalize a wellknown population dynamics in the evolutionary biology, replicator dynamics, to the general matrix, i.e. species-genes data matrix, for efficiently estimating our model ’, lo. We call this new dynamics “generalized replicator dynamics”. Empirical results in Section 5 show our model can effectively and efficiently build phylogenetic trees by estimating its distributions. Finally, conclusions and future works are presented in Section 6.
2. MODEL OF SPECIES-GENES DATA Before introducing the model of species-genes data, we firstly review two characteristics recently observed by phylogeneticists 17, 1 6 : Sparse and uneven sequence availability distribution: as shown in Fig. 2(d) that is an excerpt from l 6 and Fig. 2(e), these two matrices are very sparse and uneven. L L M a n sequences y are available f o r a f e w species and a f e w heavily sampled genes are available for m a n y species”. Moreover, these two figures show “ t h e m o s t heavily sampled corner of t h e species-genes m a t r i x and t h e r e m a i n d e r of t h e m a t r i x i s e v e n m o r e sparse” 5. Many overlaps of bicliques: as observed and reported in 17, “ m a n y of bicliques overlap, a n d f o r a n y given biclique there are generally bicliques which have either slightly m o r e species and slightly f e w e r genes, or slightly m o r e genes a n d f e w e r species”. The first characteristic is an empirical observation of a global data distribution in the whole speciesgenes data and the second one indicates relationships among bicliques. However, they are only qualitative and rough view of the species-genes data and thus may not provide further useful guidance and insights for data analysis algorithm design. Therefore, we build a model to quantitatively analyze species-genes data and advance our understanding of species-genes data from qualitative observations to quantitative investigat ions.
aIn the rest of the paper, for simplicity, we use the term ‘biclique’ to denote ‘complete sub-matrix’
43 1
As a few species are sequenced for many genes and a few genes are sequenced for many species in the real-world species-genes data, in the model, we assign each species si E S a Sequence-Availability (shortly denoted as SA) probability value p f and each gene g j E G a SA probability value p: as well. The greater the sequence-availability probability of si or g j is, the larger the number of genes or species the corresponding species si or gene gj is sequenced for. Therefore, two sets of SA probabilities are used in the model: one is species SA probability distribution ps = ( psl , ps2 , . . . , p $ ) , and the other is genes SA probability distribution pG = (&,pf1.. . ]p:), where m is the number of species ( m = ISl), n is the number of genes ( n = IGl), and p:,p,” E [0,1]. Then the species-genes data matrix W = ( ~ i j (or) se~ quence availability matrix) can be simulated by the SA probability distributions ps and pG of m species and n genes in the model described as follows, VIodel of species-genes data (or sequence availabilty data) species SA distribution ps and genes SA distribution pG; O u t p u t : species-genes data W m x n . 1. generate a uniformly random value p E [0,1]; 2. the sequence as available for the species si and the p,” OT p py; gene g j (i.e., wi3 i s set I), if p otherwise, the sequence is missing f o r s i and g j (i.e., wi3 is set 0). 3. iterate step 1 and 2 for all species and genes.
overlap in the way described in the second characteristics aforementioned. But we further want to know “structures of these overlapping bicliques and t o what extent they overlap?”
Species
Genes
I
1
1
2
1
3
1
4
1 2
5 1
2 2
2 3
m-2
n-1
in-I
n-3
m-I m-I
n-2
I
~
~
I
“-1
Fig. 1. Construction process of phylogenetic trees from species-genes data: from a list of genes and the species for which sequences of those genes are available (a left), to species-genes matrix (as right), complete sub-matrix(or biclique) discovery, sequence concatenation in biclique, and phylogenetric trees construction.
Input:
<
In this model, the larger the p s and pf are, the more possible the species s, is sequenced for the gene g j (i.e., the corresponding wZj is 1). Therefore, the SA probability of a species (or gene) determines if this species (or gene) is sequenced more or less. With this model, we are interested in quantitatively answering two questions that are already partially and qualitatively answered in the observed characteristics mentioned above by Sanderson et al. (2003): (i). “what are the sequence avaalabalaty dastrabutaons of specaes and genes, ps and p G ? ” . We already know they are skewed from the first characteristic aforementioned and further want to know “how skew are they?” (ii). “what are the relataonshaps of bzclzques?” We already know many bicliques
Therefore, to answer the first question, a useful method is to try different statistical distributions as SA distributions of species and genes, i.e., ps and pG in the model, and to compare these simulated species-genes data with the real data in terms of both matrix similarity and sequence numbers’ distributions ’. After answering the first question by determining which statistical distribution the real species-genes data follows, we can simulate speciesgenes data using the model, and then more insights and understanding of species-genes data for the second question may be obtained from the study of the simulated species-genes data. In the following, we report our results regarding to the above two questions one by one. Conclusion 1: sequence availability probability distributions of species and genes follow power law. To answer the first question and determine what SA probability distributions in real data are, we utilized three types of statistical distributions, (a) uniform, (b) normal and (c) power
bThe sequence number of a species (or gene) is the total number of sequences available for this species (or gene) across all genes (or species). Specifically, the sequence number d(s,) of the a-th species sz is d(s,) = C3n,lw , ~ .Similarly, the sequence number d(g3) of the 3-th gene g3 is d(g,) = CzlW , ~ .
432 law ‘. Their skewness increases from type (a) to type (c). In this paper, we simply let ps = pG for only emphasizing the difference between distribution types. In practice, to make the simulated data closer to the real data, different parameters of the distribution can be tested for ps and pG. The corresponding simulated species-genes matrices of the distributions (a), (b) and (c) are shown in Fig. 2(a), 2(b) and 2(c), respectively. For comparison, two realworld species-genes data matrices are also drawn in Fig. 2(d) and 2(e). All matrices in Fig. 2 are rearranged by the decreasing order of sequence numbers of species and genes. It can be clearly observed that, with the increase of distribution’s skewness from type (a) to type (c), the first characteristic aforementioned becomes more and more obvious, e.g., matrices are more and more sparse and skewed. In addition to comparing simulated and real data by matrix similarity, we can also compare their sequence numbers’ distributions of species and genes. They are plotted in the form of log-log cumulative distribution ‘. We found that in two real data sets as shown in Fig. 2(d) and 2(e), species and genes’ sequence numbers’ distributions in log-log form are roughly straight lines and thus follow power law. When observing three simulated data from type (a) to type (c), their sequence numbers’ distributions evolve from curves to lines. Therefore, the data simulated by the model with the statistical distribution type (c) is much closer to real species-genes data. Hence we draw the conclusion that sequence availability probability distributions of species and genes in real data follow the statistical distribution type (c) - power law. Although this is not a surprising result, it gives us an idea of how skewed real species-genes data is. Furthermore, it provides us a way to study the structures and properties of real species-genes data. Next, we employ this model with the power law SA distributions of species and genes to study the second question. Conclusion 2: the most distinct overlapping structure of bicliques is a ladder structure. Based on Conclusion 1, we used our model
with the power law SA distributions of species and genes to generate a species-genes data matrix with 20 species and 20 genes. After rearranging the simulation matrix with the decreasing order of ps and pG as shown in Fig. 3, a distinct structure is revealed to the left-top corner of W . Bicliques are easily identified and the overlapping relationships among bicliques are also clearly shown in Fig. 3 . Each box framed by dotted lines is a biclique in Fig. 3. Therefore, this distinct structure is useful for not only locating bicliques, but also intuitively revealing the overlapping relationships among these bicliques. In this structure, bicliques overlap in the way that confirms what Sanderson et al. (2003) described, “ m a n y of bicliques overlap, and for a n y given biclique there are generally bicliques which have either slightly more species and slightly fewer genes, or slightly more genes and fewer species”. As this structure is like a ladder, we call it “ladder structure”. It can also be found that the sequence availability probability actually plays the role of measuring the contribution of each species and gene to the ladder structure. Therefore, although some species or genes have small sequence numbers, their sequence availability probabilities are very high. For example, the 10th species has only 4 sequences available, much smaller than the last (20th) species with 8 sequences. Hence, the ladder structure can be identified in the left-top corner of the species-genes matrix rearranged by the decreasing order of the estimated ps* and pG*,not by the decreasing order of sequence numbers’ distributions. As this small-size simulated species-genes data in Fig. 3 is generated by the same model with the same SA distributions of species and genes and parameters as the large-size simulated data in Fig. 2(c), it is reasonable to infer that, in real data, “the most distinct overlapping structure of bicliques is ladder structure” and “the distinct ladder structure i n real data can be discovered b y the estimated S A distributions ps* and pG*”. From the above model analysis of species-genes data, two conclusions are useful for effective and efficient phylogenetic tree inference. Especially, the lad-
CPower law distribution follows the rule P ( z ) = az-0. It can be seen as a straight line on log-log figure. More detail refers t o http://en.wikipedia.org/wiki/Power_law. dFor a species (or gene) sequence number k , how many species (or genes) have sequence number higher than k .
433
P
(a) Uniform
(b) Normal ( p = 0, cr = 1) (c) Power law ( a = 1, /3 = 1)
(d) GenBank data
( e ) SwissProt data
Fig. 2. Comparison of matrices and sequence numbers’ distributions among real and simulated species-genes data. In speciesgenes matrices, a dot indicates the existence of a sequence for that species and genes. And species are sorted vertically by their number of sequences, and genes are sorted horizontally by the number of taxa for which they have been sequenced. Sequence numbers’ log-log cumulative distributions of species and genes are plotted in the bottom of each species-genes data matrix. In the plot, the x-axis is the sequence number of a species (or gene) k , and the y-axis is the number of how many species (or genes) have sequence number higher than k . (a), (b), and (c): simulated data by the model with uniform, normal and power law distributions of sequence-availability probabilities of species and genes. (d) GenBank data: The most-represented species (Arabzdopszs thalzana) is at the top, and the most heavily sequenced gene (rbcL) is on the left. (d) SwissProt data. Images in (d) and (e) show the most heavily skewed corner of the species-genes matrix and the remainder of the matrix is very sparse.
der structure in the second conclusion, which was not discovered before, can be helpful for biclique discovery and tree reconstruction. Once the ladder structure is discovered, bicliques can be easily identified (as illustrated in Fig. 3 ) . In the rest of the paper, we will focus on discovering the distinct ladder structure in real species-genes data by estimating its SA distributions.
3. P R O B L E M OF E S T I M A T I N G SA P R O B A B I L ITY DIS T R l B UTIONS According to the model introduced above, the estimated sequence-availability distributions of species and genes are key to the discovery of the ladder structure. To estimate SA distributions in the real species-genes data, we introduce an availability probability p,, = p f p ? for the sequence existence in the i-th species s, and the j-th gene g3. Then, a sequence availability probability matrix P = (pt,)mxn can be constructed. To measure how well this sequence availability probability matrix P approximates the actual sequence availability matrix (real data W ) , we introduce a function called “Accumulated Probability Function of Sequence Avazlability” , denoted as ~ ( p ’ pG, , W ) = Czl C,”=, w,,p$p,G, to count all the availability probabilities of sequences existing in the real species-genes data W . Intuitively, ps* and pG*that can maximize P(ps, pG,W ) will make the
matrix P approximate the matrix W to the maximal extent. Therefore, by maximizing the function P(ps. pG,W ) ,we can obtain the estimated ps* and pG* for the real species-genes data W . Then the distinct ladder structure hidden in the real speciesgenes data W can be discovered in the left-top corner of W reordered by the decreasing orders of ps* and pG*.In practice, to limit the value range of the function P(ps,pG,W ) ,the constraints on ps and pG are added to the problem formulation. The constraints are only the normalization of ps and pG and thus will not affect the discovery of ladder structure. It is formally expressed as follows, arg
max
p”EAl; ,pGEAy
P(x,y,W)
(1)
where AT = n Cz=l x, = 1,and x,
0 ( i = 1 , .. . , enotes a superplane in n-dimensional non-negative vector space. In the next section, we will propose an efficient algorithm to efficiently maximize P(ps. pG.W).
4. A L G O R I T H M OF E S T I M A T I N G SA P ROBA B I LlT Y DIS T R l BU T I 0NS: G E N ERA LE Z E D REP L I C A T 0 R DY NA M I CS In this section, to simplify the denotations, we replace ps as x and pG as y. In the case of symmetric
434 matrix, let ps = pG = x. As proved in ', in the case that W is symmetric, replicator dynamics is able to approximate the maximization problem of P(x,x, W ) in Eq.(l). In this section, we firstly introduce replicator dynamics and then propose a novel discrete dynamical system, which generalize replicator dynamics from symmetric matrix to general matrix. Therefore, this new dynamical system is called "Generalized Replicator Dynamics" (shortly denoted as GRD). As GRD is developed based on all the evolutionary concepts of replicator dynamics, e.g., natural selection model, GRD can efficiently approximate the maximization problem of P(x,y , W ) in E q . ( l ) . Like replicator dynamics, we also provide the concrete proof to guarantee the optimization ability of GRD.
C,",,,=, w,,zp)zp)N. Let f2.j denote the frequency of the gene pair (Ai,A,) in the adult stage of the (t 1)-th generation, we can obtain,
+
e e e
e
e a
e
a a
order of SA probabilities p '*
4.1. Replicator Dynamics Replicator dynamics is one of the population dynamical methods, which is also a kind of discrete dynamical system. It was first introduced and studied in evolutionary game theory to model the evolution of animal behavior l o . Motivated by population evolution, the idea of replicator dynamics has been independently studied in many fields, such as population genetics 4, mathematical ecology 3 , computer vision 13. Replicator dynamics is based on the classical selection model by studying the effect of selection upon a population. The differential viabilities of the genotypes are the key of selection. Consider a single chromosomal locus with n al( t ), . . . , x leles Al, . . . , A n . Let x1 :) denote the gene frequencies at the mating stage in the parental generation (the t-th generation). The assumption of random mating leads to z!"x?) for the probability that a zygote carries the gene pair (Ai, Aj). Let wij be the probability that an (Ai, Aj)-individual survives to adult age. Since the gene pairs (Ai, A?) and (Aj, Ai) belong to the same genotype, the selective value wij 3 0 and wij = wji. The selection matrix W = ( w i j ) n x nis therefore symmetric. If N is the number of zygotes in the new generation, the ( t + l ) - t h generation, then z!")zF)N of them
Fig. 3. Ladder structure in a small simulated species-genes matrix, a miniature of species-genes data. Its SA probability distributions of species and genes follow the power law distribution with (Y = 1 and p = 1. A black dot indicates a non-zero value in both matrices. It can be seen that the ladder structure (framed by solid thick grey lines) exists in the simulated species-genes data. And bicliques (framed by lines with different types) in this structure overlap each other. The 10th specie (4 sequences available) and 20th specie (8 sequences available) are labeled in the left part of the figure.
Since zit+') is the frequency of the allele Ai in the adult stage of the (t 1)-th generation, we have x!~") = C,"=, f i j . This leads to the relation
+
(3) Eq.(3) is the selection model. It can be rewritten in the matrix form as follows,
)~ the i-th component of the where ( W X ( ~ )denotes vector W X ( ~ )and , the state of the gene pool of the t-th generation is given by the vector x ( ~ = ) (xl ( t ) , . . . , of gene frequencies. x ( ~ has ) noncarry the gene pair (Ai,Aj) of which ~ i j x , ( ~ ) x f ) N survive to adulthood. Therefore, the total numnegative components summing up to one, and beber of individuals reaching the mating stage is longs to the simplex AT. Eq.(4) describes the ac-
~ 2 ) ) ~
435 tiori of selection from one generation to the next, and therefore the map sending x(~) to x(~+') defines a discrete dynamical system on the space AT, called Replicator Dynamics.
... (a) replicator dynamics
(b) generalized replicator dynamics
Fig. 4. Alleles A , or B, as vertices and their mating survival probabilities w,, as edge weights in replicator dynamics and generalized replicator dynamics.
Definition 4.1 (Replicator Dynamics). Let W n x n be a non-negative symmetric matrix. Given the vector x ( ~ = ) (xl ( t ), . . . , x : ) ) ~ E AT being the
status of the system in the t-th iteration, we define the dynamical system as Eq.(d). Since the selection model from evolutionary biology defines a discrete dynamical system replicator dynamics, we are interested in its stationary states and the optimization ability. Before that, we first introduce the average fitness of the population.
Definition 4.2. (Average Fitness of Population in Selection Model). Given x,("xf) the frequency of the zygote of (Ai,A j ) and the selective value wij the probability that it survives to adult age, we define .-l ~ i j x $ ~ ) z is ; ' the ) average fitness (or average Jselective value) of the population in the (t)-th generation. The average fitness can be written in the matrix form as P ( x ( ~ ,~) ( ~ W1 ,) = x ( ~ ) ~ w x ( ~ ) . The fundamental theorem of natural selection tells us that under selection model, the average fitness increases from generation to generation. Refer to 9, lo for detailed proof of this theorem.
Theorem 4.1. (Fundamental Theorem of Natural Selection by Replicator Dynamics). For the replicator dynamics given b y Eq.(4), the average fit1 , ) increases with the generation t ness P ( x ( ~~ )(, ~ W increasing in the sense that P(X(t+l), X(t+l),
W ) 3 P(X(t), x@),W )
(5)
with equality if and only if x@) is an equilibrium point X*.
4.2. Generalized Replicator Dynamics The selection model above is based on the selection matrix W,,, that describes the survival probability of the zygotes of any two alleles ( A , , A 3 ) . Therefore, W is symmetric and the adjacency matrix of a weighted graph G ( A ,W ) ,whose vertex set A is alleles and edge weight is wZJin W. This weighted graph is shown in Fig.4(a). In this section, we generalize the replicator dynamics to a more general selection matrix Wmxnthat denotes the probability of the zygotes of any two alleles (A,, BJ) from allele types A and B. We suppose that there are two types (or sets) of alleles A = { A l ,. . . , A,} and B = (B1,. . . , B,}. There are restrictions of mating in these two types of alleles: the mating can only happen between different types of alleles. For example, the allele A, can mate with any B-type allele B3,but always fail with any other A-type allele. Therefore, the selection matrix W m X nand two sets of alleles A and B forms a bipartite graph as shown in Fig.4(b). Let z1 ( t ), . . . ,) :x denote the gene frequencies of A-type alleles Al, . . . ,A,, and y1( t ), . . . , y:) the gene frequencies of B-type alleles B1,. . . , B,, at the mating stage in the parental generation (the t-th generation). The assumption of random mating leads to xltiyjt) for the probability that a zygote carries the gene pair ( A , ,B 3 ) . If N is the number of zygotes in the new generation, the ( t + l ) - t h generation, then x!t)y:t)Nof them carry the gene pair (A,, B 3 )of which wt3x!t)yjt)N survive to adulthood. Therefore, the total number of individuals reaching the mating stage is wTsx?)yP)N.Let f, denote the frequency of the gene pair ( A aB3) , in the adult stage of the (t 1)-th generation, we can obtain,
ELl cy=l +
is the frequency of the allele A, in the Since x,(~+') adult stage of the (t 1)-th generation, we have xi( t + l ) fij. This leads to the relation
x,"=,
+
436 It can be rewritten in the matrix form as follows,
For B-type alleles, since y;"+') is the frequency of the allele BJin the adult stage of the (t 1)-th generation. we have y,(ttl) = Czl flJ, where f& is
+
computed according to Eq. (6) by substituting z!t) with zjt+l).This leads to the relation
W ~ ~ Z ~ is~ the ) ~av~ " adult age, we define Czl C,"=, erage fitness (or average selective value) of the population in the (t)-th generation. The average fitness in the matrix form is P ( x ( ~~) (, ~W 1 ), = = y ( t ) T W T ~ (and t ) therefore the same as the form of accumulated probability function introduced in Section 3 of a bipartite graph G(A,B,W ) , where A and B are two sets of alleles representing the vertices.
Theorem 4.2. (Fundamental Theorem of Natural Selection by Generalized Replicator Dynamics). For t h e generalized replicator d y n a m Its matrix form is,
The state of the gene pool of the t-th generation is given by the vector x ( ~ )= (z:), . . . ,z:))~ of gene frequencies in A-type alleles and the vector y ( t ) = (#I,. . . . of gene frequencies in B-type alleles. x ( ~and ) y ( t ) have non-negative components summing up to one, and belong to the simplex AT and A? respectively. Eq.(7) and Eq.(8) are the generalzzed selectaon model for two types of alleles A and B. It describes the action of selection between two types of alleles from one generation to the next, and therefore the map sending x ( ~and ) y ( t )to dt+') defines a discrete dynamical system on the spaces A;l and AT. called Generalzzed Replzcator D y n a m z c s (GRD).
Definition 4.3. (Generalized Replicator Dynamics). Let W m x nbe a non-negative matrix. Given the vector x(t) = (z1 ( t ), . . . ,z:))* E A;l and the vector y ( t ) = (y/i". . . . , yn( t )) T E A? being the status of the system in the t-th iteration, we define the discrete dynamical system as Eq. (7) and Eq. (8). Correspondingly, we studied the the fixed points and optimization ability of the generalized replicator dynamics. Next the average fitness of the population and the fundamental theorem of natural selection in the generalized selection model are given.
Definition 4.4. (Average Fitness of Population in Generalized Selection Model). Given zzit)yjt) the frequency of the zygote of (Az,B J )and the selective value wzJ the probability that it survives to eboth are available at http://ginger.ucdavis.edu/sandlab/www-data
ics given by E q . ( 7 ) a n d Eq.(8), t h e average f i t n e s s P ( x ( ~Y)(,~ )W) , increases w i t h t h e generation t i n creasing in t h e sense t h a t P(X(t+l)$+l),
W ) 3 P(x(t),y(t), W)
(9)
w i t h equality if and only i f x ( t )and y ( t ) are t w o equilibrium p o i n t s x* and y* respectively.
Proof. See Appendix A. If let W be symmetric, x and y are associated with the same set of vertices and thus equal to each other. Hence Eq.(7) and Eq.(8) are reduced to Eq.(4) and therefore the replicator dynamics become a special instance of the generalized replicator dynamics. In practice, the iteration of about 50 is enough for the generalized replicator dynamics to get converged. Therefore, its computational complexity is O(k(2h+m+n)), where I; is the number of iterations, h,m and n are the number of non-zeros, numbers of rows and columns in W respectively. If ignoring k , the final complexity is O(2h m n ) . Therefore, the generalized replicator dynamics is very efficient.
+ +
5 . EMPIRICAL STUDY To test if GRD is able to estimate the sequenceavailability distributions for discovering the target pattern we proposed - ladder structure, we apply GRD to two data sets collected from GenBank and Swiss-Prot respectively, which were published in '. Their species-genes data matrices are shown in Fig. 2(d) and 2(e). To validate the effectiveness of phylogenetic inference of ladder structure, bicliques
)
437
(a) the left-top corner of the reordered species-genes matrix.
(b) biclique 11 x 51, frame with dash-dot lines in (a)
(c) biclique 20 x 21, frame with solid lines in (a)
(d) biclique 33 x 5, frame with dashed lines in (a)
Fig. 5 . GenBank: the phylogenetic trees of selected bicliques obtained from the submatrix 100 x 100 computed by the generalized replicator. The numbers on the branches of pylogenetic trees are bootstrap support. A black dot in (a) indicates a non-zero value in the matrix. The ladder structure (framed by solid thick grey lines) in (a) can be clearly seen.
in this overlapping structure are manually selected for investigation. This process has been illustrated in Fig. 1. In detail, given a biclique, following steps described in 5 , sequences in bicliques are firstly concatenated and aligned using CLUSTALW with default options. Protein parsimony is used to construct trees, then bootstrap analysis (500 replicates) is applied to assess the reliability of trees, and finally the consensus tree is the eventual output. We implemented generalized replicator dynamics in MATLAB and all the experiments are performed in the computer system with Pentium 4 CPU 1.80GHZ, 512MB of RAM. GenBank data is extracted from GenBank database. It contains 16,348 species and 59,144 genes. According to the result published in 5 , there are 5587 bicliques with at least four species and two genes in the data. However, in this published result, the most distinct overlapping relationship (i.e., ladder structure) of bicliques among the 5587 bicliques is not revealed. It took GRD less than 1 second to
discover the distinct ladder structure from GenBank data, while it took the biclique enumeration algorithm more than 900 seconds to find bicliques. After obtaining estimated SA distributions of species and genes, i.e., ps* and pG*,we reorder the speciesgenes data matrix by the decreasing order of ps* and pG*.According to our analysis in Section 2, the most distinct ladder structure is collected to the lefttop corner of the reordered matrix. Therefore, as the original matrix is very large, we only show the first 100 species and 100 genes in the left-top corner of the reordered matrix in Fig.5(a). From this figure, a distinct ladder structure can be clearly seen and a lot of bicliques overlap in this ladder structure. Among these overlapping bicliques, we select three types of bicliques with different sizes: few species and many genes, balanced numbers of species and genes, and many species and few genes. They are 11x 51,20 x 21 and 33 x 5, and are framed in different types of dotted lines in Fig. 5(a). Their corresponding phylogenetic trees are also shown in Fig. 5(b-d). When
438 11491 Mssena tnmrglids serogotp B
1166699Njrssena m r g l i d s serog
1
Neisseriaceae
tlm Psar*mxas Engmsa t 1 mw h o c h d e
(a) the left-top corner of the reordered species-genes matrix. b W m c d !
1
(b) biclique 8 x 62, frame with dash-dot lines in (a)
Enterobacteriaceae
Gammaproteobacteria Proteobacteria m-aERglK€a
I
ti481 FBmanmqbdssxgwaosap8
b8s888FBsa!anmqbdssaq
Neisseria
(c) biclique 13 x 17, frame with solid lines in (a)
(d) biclique 25 x 7, frame with dashed lines in (a)
Fig. 6. SwissProt: the phylogenetic trees of selected bicliques obtained from the submatrix 100 x 100 computed by the generalized replicator dynamics. The numbers on the branches of pylogenetic trees are bootstrap support. A black dot in (a) indicates a non-zero value in the matrix. The ladder structure (framed by solid thick grey lines) in (a) can be clearly seen.
investigating these three trees, we found that the phylogenetic tree in Fig. 5(c) provides strong support for major clades within green plants, such as eudicots, flowering plants, seed plants, and vascular plants. In contrast, the trees in Fig.5(b) and 5(d) are not so informative. Phylogenetic trees from other bicliques with balanced numbers of species and genes in the same ladder structure also have similar results as shown in Fig. 5(c). We found that the relative positions of different organisms on these trees are not affected. In other words, organisms within the same genus are closely clustered together, e.g. Rosids, Asterids, Monocots, Conifers, Ferns and Green algae. This indicates that bicliques from the distinct ladder structure keep the stable inference of phylogenetic trees. This result shows that the distinct ladder structure is easy to discover and useful for phylogenetic inference. Furthermore, our model can not only locate many overlapping bicliques efficiently and effectively, but also reveal that these overlapping bicliques keep similar and stable phylogenetic structure, which provides more useful information to
phylogeneticists for comparing and evaluating phylogenetic trees from species-genes data. This is not what the biclique enumeration method can get. SwissProt data is extracted from Swiss-Prot database. It contains 7449 species and 64,712 genes. In this data set, we obtained similar results as those of Genbank data. They are shown in Fig. 6. Like GenBank data, the ladder structure can also be discovered in SwissProt data as shown in Fig. 6(a). Similarly, three bicliques are easily selected for building phylogenetic trees. These trees are presented in Fig. 6(b-d). The results in SwissProt data further verify the conclusions and efficiencies of our model.
6. CONCLUSIONS A N D FUTURE WORKS To better infer the evolutionary history of species, we build a model to understand and analyze speciesgenes data, also called sequence availability data. As previous works on species-genes data can provide only qualitative and rough view of this type of data, in this paper, we built a model to analyze
439 it in a quantitative way. Through this model, two conclusions are obtained: (1) It is the skewness of the sequence-availability probability distributions of species and genes that contribute t o the sparseness and skewness real-world species-genes data. Further, the sequence-availability probability distributions of species and genes follow power law. (2) By estimating sequence-availability probability distributions of species and genes in real data, a distinct ladder structure is discovered, that is an overlapping structure of bicliques. To estimate sequence-availability probability distributions of species and genes in real data for finding the distinct ladder structure, we proposed a novel evolutionary dynamical system, called “generalized replicator dynamics”, that is generalized from a popular biological system replicator dynamics. It is based on the fundamental theorem of natural selection and it can converge and approximate the solution of the maximization problem we formulated for the model estimation. We have conducted experiments on two species-genes data sets and the results have shown the effectiveness of our model in understanding and analyzing species-genes data for efficient phylogenetic inference. There are a number of venues in the future works. Because there are not only one ladder structure in the real data, algorithms based on the model need t o be developed t o find more other ladder structures. Besides the aspect of algorithms, more experiments of our model in other species-genes data are needed for observing more properties and characteristics of the data for effective phylogenetic inference.
References Alexe G., Alexe S., Crama Y., Foldes S., Hammer P.L., Simeone B. Consensus algorithms for the generation of all maximal bicliques. Discrete Applied Mathematics 2004; 145(1): 11-21. Bapteste E., Brinkmann H., Lee J.A., Moore D.V., Sensen C.W., Gordon P., Durufli. L., Gaasterland T., Lopez P., Muller M., Philippe H. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Natl. Acad. Sci. USA 2002; 99(3): 1414-1419. Baum L. E., Eagon J. A. An inequality with applications to statistical estimation for probabilistic functions of markov processes and to a model for ecology. Bull. Amer. Math. Soc. 1967; 73: 360-363. Crow J. F., Kimura M. A n Introduction to Popu-
lation Genetics Theory. Harper & Row, New York, 1970. 5. Driskell A.C., Arie C., Burleigh J.G., McMahon M.M., O’Meara B.C., Sanderson M.J. Prospects for building the tree of life from large sequence databases. Science 2004; 306: 1172-1174. 6. Felsenstein J. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods in Enzymology 1996; 266: 419427. 7. Felsenstein J. Inferring Phylogenies. Sinauer Press, 2003. 8. Hershkovitz M.A., Leipe D.D. Bioinformatics: A practical guide to the analysis of genes and proteins, chapter Phylogenetic analysis, pages 189-230. Wiley Interscience, New York, 1998. 9. Hofbauer J., Sigmund K. The Theory of Evolution and Dynamical Systems. Cambridge University Press, 1988. 10. Hofbauer J., Sigmund K. Evolutionary Games and Population Dynamics. Cambridge University Press, 1998. 11. Murphy W.J., Eizirik E., Johnson W.E., Zhang Y.P., Ryder O.A., O’Brien S.J. Molecular phylogenetics and the origins of placental mammals. Nature 2001; 409: 614-618. 12. Page R.D.M., Holmes E.C. Molecular Evolution: a Phylogenetic Approach. Blackwell, 1998. 13. Pelillo M. The dynamics of nonlinear relaxation labeling processes. J. Math. Imaging Vision 1997; 7(4): 309C323. 14. Qiu Y.L., Bernasconi-Quadroni F., Soltis D.E., Soltis P.S., Zanis M., Zimmer E.A., Chen Z., Savolainen V., Chase M.W. The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes. Nature 1999; 402(6760): 404-407. 15. Russo C.A.M., Takezaki N., Nei M. Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Mol. Biol. Evol. 1996; 13(3): 525-536. 16. Sanderson M.J., Driskell A.C. The challenge of constructing large phylogenetic trees. TREN D S i n Plant Science 2003; 8(8): 374-379. 17. Sanderson M.J., Driskell A.C., Ree R.H., Eulenstein O., Langley S. Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Mol. Biol. Evol. 2003; 20(7): 1036-1042. 18. Swofford D.L., Olsen G.J., Waddell P.J., Hillis D.M. Molecular Systematics, chapter Phylogenetic inference, pages 407-514. Sinauer Associates, Sunderland, Massachusetts, 2nd edition, 1996. 19. Wiens J.J. Missing data and the design of phylogenetic analyses. Journal of Biomedical Informatics 2006; 39(1): 34-42.
440 Here, we use the inequality of Cauchy-SchwarzBunyakowski
APPENDIX A: PROOF OF THEOREM 4.2 For simplicity, Eq.(7) and Eq. (8) are rewritten as.
n
n
n
i=l
i=l
i=l
(18) with equality iff there is some value c such that a3 = c for all j . By (18), we obtain b,
where x and x/ represent x ( ~and ) larly for y and y / . Correspondingly, Eq.(9) is rewritten as,
P(X’,
Y/, W) 3 P ( X l Y, W )
(12)
It is clearly seen that P(x,y, W ) = xTWy = yTWTx = Cijxiwijyj. We will prove the following two inequalities step by step,
Proof of hequality (13) Since we assume xtTWy > 0, we have to show that
(x’TWy)(x’TWy’) 3 (XrTWYI2
(15)
Clearly,
(x’TWy)(x’TWy’) = (x’TWy)
c
’
XLWijlJj
(16)
zj
On replacing y i by the expression in Eq. (11)we obtain
Combining Eq.(17) and Inequality (19), we prove Inequality (15). If Lw(x’,y’) = Lw(x’,y) in Inequality (13), the last estimate must be an equality, i.e. there must be a value c such that ~‘5‘’zc:w73 fi (WTx’), = c in Eq.(19) for all j . This means that x’ is an equilibrium.
Proof of Inequality (14) Similarly, we can follow the proof methodology of Inequality (13) to prove,
(xTWy)(x’TWy) 3 (xTWy)2
..
23
k
(20)
If Lw(x’,y) = L w ( x , y ) in Inequality (14), there must be a value d such that ( W Y ) ~= d in Eq.(19) for all j . This means that y is an equilibrium.
44 1
RECONCILIATIONWITH NON-BINARY SPECIES TREES
B. Vernot, M. Stoker, A. Goldman, D. Durand* Departments of Biological Sciences and Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA Email: {bvernot,mstolzer;aiton,durand}@ {cs,andrew,cs,cs}crnu.edu Reconciliation is the process of resolving disagreement between gene and species trees, by invoking gene duplications and losses to explain topological incongruence. The resulting inferred duplication histories are a valuable source of information for a broad range of biological applications, including ortholog identification, estimating gene duplication times, and rooting and correcting gene trees. Reconciliation for binary trees is a tractable and well studied problem. However, a striking propoi-tion of species trees are non-binary. For example, 64% of branch points in the NCBI taxonomy have three or more children. When applied to non-binary species trees, current algorithms overestimate the number of duplications because they cannot distinguish between duplication and deep coalescence. We present the first formal algorithm for reconciling binary gene trees with non-binary species trees under a duplication-loss parsimony model. Using a space efficient mapping from gene to species tree, our algorithm infers the minimum number of duplications and losses in O ( IV, I (ks h s ) )time, where VG is the number of nodes in the gene tree, hs is the height of the species tree and ICs is the width of its largest multifurcation. We also present a dynamic programming algorithm for a combined loss model, in which losses in sibling species may be represented as a single loss in the common ancestor. Our algorithms have been implemented in NOTUNG,a robust, production quality tree-fitting program, which provides a graphical user interface for exploratory analysis and also supports automated, high-throughput analysis of large data sets.
+
Keywords: Reconciliation, non-binary species trees, polytomy, gene duplication, gene loss, lineage sorting, deep coalescence
1. INTRODUCTION
Reconciliation is the process of constructing a mapping between a gene family tree and a species tree. Under a model of duplication-loss parsimony, this mapping can be used to infer duplications and losses in the history of the gene family, as well as the species lineages in which these events occurred. Reconciliation is an essential method in molecular phylogenetics that is widely used in evolutionary applications in medicine, development biology and plant science. Reconciliation, following phylogeny reconstruction, is the most reliable approach for identifying orthologs for use in function prediction, gene annotation, planning experiments in model organisms, and identifying drug targets (e.g., Refs. 5,40). Reconciliation is used to correlate specific duplications with the emergence of novel cellular functions or morphological features, providing clues to the functions of newly discovered genes (e.g., Refs. 8, 47). Minimizing duplications and losses provides a basis for rooting an unrooted tree7 and for selecting alternate gene or species tree t o p ~ l o g i e s . ~ ~ ~ . Reconciliation is also the kernel of a related, but more *Corresponding Author. [email protected]
complex, problem: inferring a species tree from many gene trees.21 High throughput reconciliation tools are required for automated construction of databases of molecular phylogenies.’, 19,33,37 Reconciliation of binary trees is a well-studied problem and a number of software packages for this problem are available.’, However, standard reconciliation will not produce correct results when applied to a non-binary species tree. Discordance between a binary gene tree and a binary species tree is always evidence of gene duplication. In contrast, when the species tree is non-binary, two different processes can cause discordance between gene and species trees: gene duplication and incomplete lineage sorting. Since these are different biological phenomena with different consequences for the interpretation of phylogenetic studies, it is essential to distinguish between discordances that must be due to duplication (required duplications) and discordances that could be due to either duplication or incomplete lineage sorting (conditional duplications). Standard binary reconciliation cannot make this distinction. As the tree of life project gains momentum, it is 10328,29337.49350
442 becoming evident that this is not a minor problem relegated to a few obscure species lineages. Rather, sixty four percent of branch points in the NCBI taxonomy?6 one of the most widely used databases of species phylogenies, have more than two children. In addition, a number of well documented analyses of simultaneous divergences have been r e p ~ r t e d . ' ~8,24,34,39 . Reconciliation methods for non-binary trees are urgently needed. To our knowledge, no formal algorithms for reconciliation of non-binary species trees have been published.
'
Our contributions: To address this need, we present novel algorithms to find the minimum number of duplications and losses when reconciling a binary gene tree, TG = (VG,E G ) ,with a non-binary species tree, TS = (VS,E s ) . We construct a mapping from nodes in TG to sets of nodes in Ts that allows us to test efficiently whether a discordance at a given node is a conditional or required duplication. Our mapping is space efficient; the maximum size of the set labeling any node in TG is O ( k s ) ,where ks is the maximum outdegree in Ts. Using this mapping, we present an efficient algorithm for reconciling a binary gene tree with a non-binary species tree under duplication-loss parsimony. Our algorithm infers all conditional and required duplications in O(IVG/. ( k s h s ) )time, where hs is the height of Ts. We also present algorithms to infer the minimum number of gene losses. We estimate the time at which each loss occurred by assigning it to an edge in TG. For binary species trees, this assignment is unambiguously determined by the reconciliation and is easily calculated. However, the lack of resolution in a non-binary tree leads to uncertainty about exactly when a loss occurred. Parsimony provides a principled basis for reducing this uncertainty: we assign losses in such a way that the total number of losses is minimized. We propose two minimization criteria and provide algorithms to infer losses for each one. The first criterion is based on the assumption that all losses were independent events (explicit losses). We infer the minimum set of explicit losses, againin O(lV~l.(ks+hs)) time. The second criterion is motivated by the observation that, under certain circumstances, losses in sibling species can be explained by a single loss in their common ancestor. Under this assumption, we can reduce the number of losses by combining losses that share a parent, whenever possible. We present a dynamic program that minimizes the num-
+
ber of combined losses, by considering all combinations of possible edge assignments for each loss to maximize the opportunities to combine losses. The worst case time complexity of this algorithm is exponential, but its performance is good on real trees from typical data sets, as we demonstrate empirically. Our algorithms have been implemented in NOTUNG, a software package that takes trees in the widely used Newick format as input, permitting interoperability with a wide range of phylogeny reconstruction packages. The resulting reconciliation can be viewed and manipulated using an interactive graphical user interface. A batch processing interface for automated analysis of large phylogenetic data sets is also available. Our software is implemented in Java and runs on Windows, Unix and Mac 0s X. It is freely available at http://www.cs.cmu.edu/"durand/Notung.
Roadmap: In the next section, we introduce notation, review the standard algorithm for reconciliation of binary trees, and summarize previous work on binary reconciliation. In Section 3, we review the relevant models of non-binary gene and species trees in the molecular evolution literature and give formal definitions for required and conditional duplications based on this foundation. Next we present our non-binary reconciliation algorithms. Duplications are discussed in Section 4. In Section 5 , we present algorithms for inferring the minimum number of explicit and combined losses and compare our work to related work by other authors. In Section 6, we demonstrate the utility of our methods with analyses of real data sets using our software. In the conclusion, we discuss probabilistic approaches to reconciliation and describe directions for future work. 2. NOTATION AND BINARY RECONCILIATION In this section, we introduce notation and review the standard reconciliation algorithm for binary trees. Let T i = (V,,Ei) be a rooted tree, where V , is the set of nodes in Ti, and Ei is the set of edges. L(T,)is the leaf set of Ti and L(w)refers to the leaf set of a subtree rooted at w E V,. C ( v )and p ( v ) refer to the children and parent of w , respectively. If w is binary, ~ ( w and ) I(w) denote the right and left children of w . A non-binary node in a tree is referred to as an unresolved node or polytomy. A group of taxa is monophyletic if it corresponds to an ancestor and
443 all of its descendants. The root node of T, is root(T.).If u E V , lies on the path from to root(T,),we say that u 2,w. We follow the computer science convention, in which the root is at the top of the tree, the leaves are at the bottom, and p ( g ) is above g. In tree figures, g-s denotes a gene that is sampled from species s. The objective of reconciliation is to identify gene duplications and losses by fitting a gene tree to a species tree.12s30,36Let TG be a binary gene tree and Ts be a binary species tree such that the genes in L(TG)were sampled from the species in L ( T , ) . A mapping M ( . ) is constructed from each node g E VGto a target node s E Vs. If g E L(TG), M ( g ) is the species from which g was sampled. Otherwise, M ( g ) is the least common ancestor (LCA) of the target nodes of its children, i.e. M ( g ) = L C A ( M ( I ( g ) )M , ( r ( g ) ) ) Using . this mapping, g E TG is a duplication if the children of g map to the same lineage as g; i.e. if M ( g ) = M ( l ( g ) )and/or M ( g ) = M ( r ( g ) ) .Otherwise, ,q is a speciation. A duplication at g indicates that a duplication occurred within the lineage leading to the ancestral species, M ( g ) , and that two copies of the gene must have been present in M ( g ) . In Fig. 1, for example, although the gene tree contains one gene sampled from each species, the topological disagreement between gene and species trees implies a duplication. Both nodes 1 and 2 in the gene tree map to a. Since M(1) = M ( 2 ) ,a duplication at node 1 is inferred. Losses can also be reconstructed from the mapping, &I(.),as described in Ref. 10. We refer to this algorithm as LCA reconciliation in order to distinguish it from the new reconciliation algorithms proposed for nonbinary species trees in the next section.
A
B
C
Fig. 1. Least Common Ancestor reconciliation. (a) Binary species tree. (b) Binary gene tree, reconciled with species tree (a). The black square indicates a duplication.
Variants of LCA reconciliation have been proposed by numerous authors, and several software packages for analyzing gene duplication histories have been devel-
oped.', 10,28,29,37,49, A related problem, inferring the optimal species tree from multiple conflicting gene trees, has been studied extensively13, for various optimization criteria" and has been shown to be NPhard.2' 14,2'925327,42348
3. MODELS FOR NON-BINARY SPECIES TREES Least Common Ancestor reconciliation is based on the assumption that disagreement between a gene tree and a species tree indicates that one or more gene duplications must have occurred. In this section, we show that when the species tree is non-binary, this assumption is no longer warranted. A polytomy may represent the simultaneous divergence of all descendants (a hardpoZytomy2*).It may also indicate that the true binary branching process cannot be reconstructed (a soft polytomyz2);this often occurs when a sequence of binary divisions proceeds in close succession and the time between these events is insufficient to accumulate informative variation. Since the true branching pattern in a gene tree is always binary,16 a polytomy in TG can only represent uncertainty. In contrast, since a species tree represents the evolution of a population of organisms, a polytomy in the species tree may represent either simultaneous divergence or uncertainty. Simultaneous divergences of three or more lineages may result from the isolation of subpopulations within a widespread species by sudden meteorological or geological events, or from rapid expansion of the population into open territory, resulting in reproductive isolation. Examples of simultaneous divergences in nature include Anolis lizards," modern birds in the order N e o a v e ~ macaque ,~~ monkeys,".24 auk let^,^^ and African cichlid fishes.39 m
r?
m
A B C
A B C
A B C
h%
Fig. 2. Gene trees evolving within the same species polytomy can have different binary topologies.
Because the species tree represents a population with genetic diversity, gene trees with different binary branching processes can be consistent with a species polytomy.20,23If k or more alleles are present in the popula-
444 tion when k lineages separate, a different allele may fix in each lineage. The resulting gene tree will be binary and will reflect the order in which new alleles arose in the ancestral population. This process is called incomplete lineage sorting. When the time of separation of lineages in TG predates the time of speciation, the divergence is called a deep coalescence. As shown in Fig. 2, all possible binary gene trees with k leaves can occur in a k-tomy in the species tree. Deep coalescence can also occur when two or more binary speciation events occur in rapid succession. If the time between subsequent speciations is shorter than the fixation time, more than one allele will still be present at the time of the second speciation (see, for example, Ref. 35). The probability of this occurring increases with shorter branch lengths and larger effective population sizes. When there is simultaneous or rapid divergence in the species tree, the challenge is to determine whether disagreement between a gene and specie tree indicates a deep coalescence or a duplication. Some incongruences can only be explained by a duplication. Obviously, a duplication must have occurred in any gene family that has two or more members in the same species. Even when no contemporary species contains more than one family member, there are cases where topological disagreement can only be explained by a duplication. For example, the incongruence between the species tree in Fig. 3(a) and the gene tree in Fig. 3(b) can only be explained by a duplication at node 2 . This can be seen in Fig. 3(c), which shows the gene tree embedded in the species tree. Two copies of the gene are present in the ancestral species ,6', indicating that a duplication must have occurred. We refer to cases where a duplication must have occurred as required duplications. In other cases, it is not possible to determine whether the disagreement is due to a deep coalescence or d~plication.~' For example, node 1 in Fig. 3(c) is associated with a deep coalescence rather than a duplication; however, it is possible that this node could also have been a duplication. These instances will be referred to as conditional duplications. Next, we formally characterize the properties of gene and species trees that determine when a duplication is required. For each polytomy s E VS,let H ( s ) be the set of all possible binary subtrees whose leaves are the children of s. Formally, given the k-tomy s E VS, let H ( s ) = {TiIL(Ti) = C ( s ) } . For example if node s is the trichotomy Q in Fig. 3(a), then H ( s ) = 16923331343944
{ ( A ,( B ,P I ) , ( B ,( A ,P I ) >(P, ( A ,B ) ) ) . Let H*(Ts) be the set of all possible binary trees obtained by replacing each polytomy sz E VSwith each tree Tz3E H ( s , ) . Note that, for every T' E H*(Ts),every node s E Ts corresponds to a node in T'; however, T' will also contain nodes that do not correspond to any node in Ts.The cardinality of H * ( T s )is I H ( s z ) l which , is equiva(2k,-3)' and kt = IC(sz)l. lent to nz,where nz = 2 k 7 - 2 ( k , - 2 ) , If TS is binary, then H* ( s ) = { Ts}. We now use H * ( T s ) to characterize formally the properties of the gene and species tree that determine when a duplication is required. When reconciling TG with every T' E H * ( s ) ,if g E Vc is a duplication in every reconciliation, then a duplication must have occurred. If at least one, but not all reconciliations indicate a duplication at g, then a deep coalescence may have occurred. Formally:
n,
nvszET,
Definition 3.1. YT' E H * ( T s ) ,reconcile TG with TI. Given g E VG \ root(Tc) p ( g ) is a required duplication if W' E H* ( T s ) ,M ( g ) =
M(p(g)). p ( g ) is a conditional duplication if 3T' E H * ( T s )s.t. M ( g ) = M ( p ( g ) )andp(g) is not a required duplication. the example in Fig. 3, H*(Ts) = {(A, ( B ,(C,D ) ) ) ,(B, (A, (C,D ) ) ) ,( ( AB), ( C ,D ) ) } .
In
From Definition 3.1, node 2 is a required duplication, since for every T' E H*(T'), M ( 2 ) = M ( 3 ) . Node 1 is a conditional duplication since there is a binary resolution in H * ( T s )- (A, ( B ,(C,D ) ) )-that does not lead to disagreement at 1.
4. IDENTIFYING DUPLICATIONS The goal of reconciliation with non-binary species trees is to determine whether a given node is a required duplication, a conditional duplication, or a speciation. Definition 3.1 provides a formal basis for such a test, but cannot be the basis of an efficient algorithm, since H * (Ts) grows superexponentially with the size and number of polytomies in Ts. While LCA reconciliation is able to identify all conditional and required duplications, it cannot distinguish between the two. For example, in the gene tree in Fig. 3(b), nodes 1, 2, and 3 are all mapped to a , indicating a duplication at nodes 1 and 2 under Least Common
445
(a)
(b)
(C)
Fig. 3. (a) A species tree with a polytomy at a. (b) A gene tree reconciled with species tree (a). (c) The gene tree (b) embedded in species tree (a). Black squares indicate duplications. Losses are not represented in these trees.
Ancestor reconciliation. While this inference is correct for node 2 , it incorrectly infers a duplication at 1. Why is Least Common Ancestor reconciliation unable to distinguish between required and conditional duplications? M ( g ) = s implies that the ancestral gene g was present in the ancestral population, s. If Ts is binary, then the descendants of g were also present in all descendants" of s. This, in turn, implies that when M ( p ( g ) )= M ( g ) ,both g and its parent were present in the same species, s, indicating that a duplication occurred at p ( g ) . This reasoning is the basis of Least Common Ancestor reconciliation. However, when Ts is non-binary, it is not necessarily true that the descendants of g were present in all descendants of M ( g ) . For example, in Fig. 3, M ( 2 ) = a , but no descendant of 2 is present in species A, which is a descendant of a. This is due to the deep coalescence at node 1in TG.As this example shows, M ( .) does not contain the information required to infer the set of nodes in Ts in which the descendants of g must have been present. In order to distinguish between conditional and required duplications, we need a new mapping from TGto Ts that allows us to determine when more than one descendant of g was present in some descendant of M ( g ) . We propose a mapping in which nodes in TG are mapped to sets of nodes in Ts. A na'ive solution would be to decorate each node in TG with all of the nodes in Ts in which the descendants of g were present; i.e., with all nodes in the subtree rooted at M ( g ) . However, this creates a problem of efficiency: The size of the sets labeling the nodes in the gene tree grows with the height of the tree and can contain as many as O(IVsI) elements. Fortunately, it is sufficient to store the roots of the subtrees in which descendants of g must have been present. The
mapping presented in Definition 4.1 is sufficiently informative to identify required and conditional duplications. Moreover, the size of this mapping at any given node is bounded by the size of the largest polytomy in Ts.
Definition 4.1. Define fi : VG\ T O O ~ ( T+ G )V ; to be fi(g) = { M ( g ) }if M ( p ( g ) )E L(Ts).Otherwise,
fi(d = {hlh E C ( M ( p ( g ) )A) 3 v E L ( g ) 3 h 2s M ( v ) }
For a given gene, g , in species s = M ( g ) ,fi(g) is the set of species that are children of the parent of s in which descendants of g must be present. For example, in Fig. 3(b), fi(3) = { B , P }because M ( p ( 3 ) ) = M ( 2 ) = a , and descendants of 3 were present in B and ,O, which are in C ( a ) . Note that fi is defined on every node in TG except the root. fi provides an efficient and accurate test for required duplications:
Theorem 4.1. A node g is a required duplication iff W ( g N n f i i ( W )# 0. Proof. + If fi(Z(g)) and f i ( r ( g ) ) intersect, then they share at least one element, 5 . Thus, for all T' E H * ( T s ) ,both M ( Z ( g ) )and M ( r ( g ) )must be z or ancestors of 2 , and thus both lie on the path from z to M ( g ) . This requires that either M ( Z ( g ) )= M ( r ( g ) )or one is a descendant of the other, and g meets the duplication criterion for binary gene and species trees. c We need to show that whenever f i ( T ( g ) n ) f i ( Z ( g ) )= 0, there is always at least one element of H * ( T s )that does not imply a duplication at g . Any T' E H * (T,) that has all members of f i ( l ( g ) )in the left subtree of M ( g )
amless g was lost. However, in that case we would need to infer a loss in a descendant of s
446 and all members of k ( r ( g ) ) in the right subtree of M ( g ) will meek this criterion. 0
Corollary 4.1. Node g E r/G is a conditional duplication i f g is not a required duplication and i f M ( g ) = M ( Z ( g ) ) and/or M ( g ) = M ( r ( g ) ) . Using Definition 4.1 we can correctly classify nodes in Fig. 3(b). Since k(3)= { B ,p } a n d ~ ( g 2 - 0 = ) {p}, their intersection is { p } , indicating the presence of two genes in ancestral taxa p. Therefore, a duplication is inferred at node 2. However, fi(g1-A) = { A } and f i ( 2 ) = { B , P } . k(1)n k(2)= 8, correctly implying that no duplication is required at gene node 1. k ( g ) is calculated easily with a postorder traversal of TG and is a function of the union of fi(Z(g)) and f i ( r ( g ) ) . An additional climbing step must be taken to ensure that the set k ( g ) is composed only of children of M(p(g)). The climb procedure, given in Alg. 5.1, also prevents ik(.)Ifrom growing larger than k s . Alg. 5.1 infers a required duplication at g if the intersection of the sets k ( l ( g ) ) and k ( r ( y ) )is non-empty. Conditional duplications are inferred using Least Common Ancestor reconciliation. k(.) can also be used to infer duplications when the species tree is binary, and produces the same results as Least Common Ancestor reconciliation. Pseudocode for this algorithm, which also calculates explicit losses, is given in Alg. 5.1 in the next section.
Theorem 4.2. Equivalence with LCA Reconciliation f o r Binary Species Trees. Let TG be a binary gene tree reconciled with a binary species tree. f i ( r ( g ) ) n fi(Z(g)) # B I f S M ( r ( g ) ) = M ( g ) and/or M ( l ( g ) )= M(g) Proof. This follows directly from Theorem 4.1. t Suppose that there exists h E C(g), such that M ( h ) = M ( g ) , but fi(r(g)) n k(Z(g)) = 8. There are two cases: either k(Z(g)) and k ( r ( g ) ) are both equal to {M(g)} or 8(Z(g)) and fi(r(g)) both contain children of M ( g ) . In the former case, I ? ( Z ( g ) ) and I ? ( r ( g ) )are not disjoint, leading to a contradiction. In the latter case, fi(L(g)) and fi(r(g)) both contain children of M ( g ) . Since M ( g ) has only two children and the sets are disjoint, one set must contain the right child of M ( g ) and the other the left child. Then, M ( r ( g ) ) and M(Z(g)) are not in the same lineage and there is no child h E C ( g ) , such that M ( h ) = M ( g ) , leading to a contradiction. 0 --f
5. IDENTIFYING LOSS NODES In this section, we discuss how to infer gene losses when reconciling with non-binary species trees. For a given gene tree, TG, and species tree, Ts, we report the total number of losses in TG and the timing of individual losses. We designate the time period when a loss occurred by assigning it to an edge in TG. When binary gene trees are reconciled with binary species trees under a parsimony model, the placement of each loss is unambiguous, and can be determined efficiently from AL~(.).~For example, in the gene tree in Fig. 4(b), three losses have occurred in the contemporary species A, C and D. Note that the two losses in C and D can be explained by the loss of a single ancestral gene in the ancestral species y. In general, given a set of monophyletic losses in a reconciled gene tree, it is more parsimonious to infer a single loss at the root of the corresponding clade in Ts.
Fig. 4. Losses in Least Common Ancestor reconciliation. (a) A binary species tree. (b) A binary gene tree reconciled with species tree (a).
In contrast to the binary case, when the species tree is non-binary, it is not always possible to determine exactly where a loss occurred. In Fig. 5(b), a descendant of node 2 is present in A and D but absent from B and C. This suggests losses in the lineages leading to the contemporary B and C species. However, for each loss it is not possible to determine whether the lost gene diverged before or after the divergence at node 2 and, if the latter, whether it was more closely related to g2-A or g2-D. For example, for the gene lost in B,this leads to three possible loss scenarios: ( g 2 - A , (Zost-B,g2-D)), ( ( g 2 - A , Zost-B),g2-0), or (lost-B,(g2-A, g2-D)). Note that each placement of lost-B in this tree implies a different element of H ( a ) .
447 mapping N ( . ) that allows us to infer losses by taking the difference between N ( . )and fi(.).
Definition 5.1. Define N : VG + V: to be N ( g ) = { M ( g ) }if M ( g ) E L(Ts).Otherwise, N ( g ) = (a)
(b)
Fig. 5. (a) A non-binary species tree. (b) A gene tree reconciled with species tree (a). Losses are not shown.
For non-binary reconciliation, we combine any set of losses that potentially form a monophyletic subtree; i.e., if a set of losses located on the same edge in TG corresponds to a monophyletic subtree of some T’ E H* (Ts), they can be replaced with the single loss of an ancestral gene. A simple test is that if several lost genes (or roots of lost subtrees) on the same edge all map to siblings of the same polytomy, they can be combined. Consider the losses in B and C in Fig. 5. If lost-B and lost-C are placed on the same edge, we can infer a single loss in their putative common ancestor. For example, if both are placed on the edge between 1 and 2, then they correspond to the monophyletic subtree ( ( A ,D ) , (B, C)) E H*(Ts).Thus, the losses in B and C can be combined, yielding a single loss for this reconciliation. If the losses are placed on different edges, they cannot be combined; two losses will be inferred. In the parsimony context, our goal is to assign loss nodes to edges in TGso as to minimize the total number of inferred losses. Since only losses on the same edge can be combined, the choice of placement influences the total number of losses. With this in mind, we propose two minimization strategies. The first strategy assigns losses to edges so as to minimize the total number of uncombined (or explicit) losses, and then combines losses wherever possible. The second strategy considers all possible assignments of losses to edges in TG and selects the assignment that minimizes the number of combined losses. Although each minimization strategy restricts the number of possible placements, it is still possible to have more than one optimal placement of losses. We present algorithms for both minimization strategies, described below.
5.1. Minimizing Explicit Losses Minimization of explicit losses is straightforward because each loss can be assigned to an edge in TG independently. In addition to fi(.), we introduce a second
{hl(h E C ( M ( g ) )A 3
E %I 3)h 2s M(v))l-
N ( g ) is the set of children of M ( g ) such that the descendants of g were present in the descendants of each element in N ( g ) . Just as fi(g) is a subset of the children of M ( p ( g ) ) ,N ( g ) is a subset of the children of M ( g ) . Unlike f i ( g ) ,N ( g ) is defined for root(TG). For a given edge e = ( g ; p ( g ) ) ,we use N ( p ( g ) ) and f i ( g ) to infer explicit losses on e. As described in Alg. 5.1, we make a single postorder traversal of TG,in which we calculate N ( . ) and fi(.),and infer the explicit losses associated with each edge by applying the following four tests. Each test corresponds to one of the four situations that can incur a loss on e. (1) Binary Duplication Losses: A duplication node and its children should be mapped to the same node in the species tree; otherwise, a loss has occurred. Formally, if p ( g ) is a required duplication and M ( p ( g ) ) is binary, then if M ( p ( g ) ) # M ( g ) , the species in N ( p ( g ) )\fi(g) are lost a t e (lines 18-21 in Alg. 5.1). (2) Skipped Species Losses: If a node and its parent do not correspond to child and parent nodes in the species tree, the intervening species must have been lost. If M ( g ) # M ( p ( g ) )and p ( * l ( g ) ) # M ( p ( g ) ) , we climb in TS from M ( g ) to M ( p ( g ) )and infer a loss at every skipped species (lines 29-3 1 in Alg. 5.1). (3) Polytomy Duplication Losses: This is analogous to test (1). If M ( p ( g ) )is a polytomy and p ( g ) is a required duplication, then the species in N ( p ( g ) )\ fi(g) are lost at e (lines 18-21 in Alg. 5.1). (4) Polytomy Speciation Losses: If a node g maps to a polytomy and maps to a different species than its parent, then all children of M ( g ) should contain a descendant of g. Otherwise, a loss has occurred. Formally, if M ( p ( g ) )# M ( g ) and M ( g ) is a polytomy, the species in C ( M ( g ) )\ N ( g ) are lost at e (lines 22-24 in Alg. 5.1). We demonstrate these four tests on the right subtree of Fig. 6(c), which has been labeled with the minimum number of explicit losses. (We do not discuss losses in the left subtree.) The following losses occur in this subtree: On the edge between nodes 1 and 3, a Binary Duplication
448
Loss in N(1) \ f i ( 3 ) = A and a Polytomy Speciation Loss in C(p)\ N ( 3 ) = B . On the edge between 3 and 4, a Polytomy Duplication Loss in N ( 3 ) \ f i ( 4 ) = D. On the edge between nodes 4 and g4-E, a Skipped Species Loss in F . On the edge between nodes 3 and 5 , a Polytomy Duplication Loss in C. Finally, on the edge between 5 and g5-F, a Skipped Species Loss in E . The rules described above minimize explicit losses by assigning each loss as close to r o o t ( T ~as ) possible. Any other placement might move the loss below a duplication, which would require two losses, one in each subtree of the duplication node. Once explicit losses are identified, those losses which form a clade in some T' E H* ( T s )and are placed on the same edge in TGcan be combined.
Algorithm 5.1. reconcile( g ) 1 if ( g.isLeaf() I 2 M ( g ) = species of g 3 "g) ={M(g)) 4 return 5 / / INTERNAL NODE CASE 6 / / descend first 7 reconcile ( l ( g ) ) ; reconcile ( r ( g ) j 8 fil(s) = L C A ( M ( l ( S ) )M(r(9))) , 9 calculateRequiredDuplication( g ) 10 if ( g # Required Duplication ) 11 if I M ( g ) == M ( l ( g ) )I / M ( g ) == M ( r ( g ) ) 12 g is Conditional Duplication
calculateRequiredDuplication( g ) 13 l y ( g ) ) = c l i m W ( g ) , g ) 14 N ( r ( g ) ) c l i m b ( r ( g ) ,9) 15 N(g) UAfi(T(S)) 16 if I N ( Z ( g ) ) n N ( r ( g ) )# 8 I 17 g is Required Duplication / / duplication losses for left child 18 19 Losses (l(g)) += N ( g ) \ fi(l(g)) 20 / / duplication losses for right child 21 Losses ( r ( g ) ) += N ( g ) \ fi(r(g)) climb( c, g 22 / / polytomy speciation losses 23 if ( M ( c ) @ L(Ts) && n / 4 ( C ) # M ( g ) 24 Losses (c) += C ( M ( c ) )\ N ( c ) 25 SpeciesNode x E N ( c ) 26 if ( x == M ( g ) I I p ( z ) == M ( g ) 27 return n 28 while ( p ( z ) # M ( g ) ) 29 / / skipped losses 30 if I p ( z ) # M ( c ) I 31 Losses ( c ) += Siblings (x 32 / / climb 33 2 =p(x) 34 return {x}
Theorem 5.1. Alg. 5.1 computes required and conditional duplications in O ( l v ~ .l ( k s h s ) ) ,where k s is the outdegree of the largest polytomy in Ts, and hs is the height of Ts. Proof. At every internal node g E VG,N ( g )is initialized with fi(l(g)) U f i ( r ( g ) ) . Ifi(.)l is bounded by k s . Using a suitable data structure, this step can be achieved in O(log(ks)) time per node. The climb routine is applied to every node in TG.For any given path from 1 E L(TG) to T = root(Tc),we will climb in total from M ( I ) to M ( r ) . Thus the total cost of calls to climb is O(IVc I .hs). Using fast Least Common Ancestor queries, M ( . ) can be calculated in O(lVc1) time for the entire tree.3 Once M ( . ) has been calculated, testing for conditional duplications takes constant time per node. Testing for required duplication requires calculation of the intersection of f i ( l ( g ) )and fi(r(g)).This operation takes O ( k s ) per node. Combining these, the total running time is O(IvG)' (kS f h S ) ) .
+
5.2. Minimizing Combined Losses
In the previous section, we presented a low time complexity algorithm to infer an optimal assignment of explicit losses. Although there may be more than one minimum cost assignment, Alg. 5.1 only finds one since it obtains a minimum cost assignment by placing losses as close to root(Tc) as possible. Next we present an algorithm that considers all possible loss assignments to obtain the minimum number of combined losses, which is always less than or equal to the minimum number of explicit losses. Unlike Alg. 5.1, this algorithm can find all possible optimal assignments, but with increased computational complexity: the worst case running time is exponential in k s . However, the implementation is fast in practise because ks is typically small and pathological cases rarely occur. Further speedups can be obtained by memoization. The basic approach to minimizing combined losses is illustrated using Fig. 6(c). Only losses associated with polytomies, that is losses inferred by rules 3 and 4 from Section 5.1, can be associated with more than one edge in the tree. Consider the polytomy losses lost-B, lost-C and lost-D in the right subtree. Because B , C and D are siblings in Fig. 6(a), they can be combined if they can be placed on the same edge in the gene tree. Note that lost-B can be moved below node 3, resulting in two copies of lost-B, one in each subtree of 3. One copy
449
A B C D E F (a)
N={C}
N={E}
N={D}
g4.C
N={F}
(b)
g4.E
g5.D
g4.C
p59
(c)
g4E
g5.D
g5.F
(d)
Fig. 6. (a) A species tree with a polytomy, 0.(b) A hypothetical gene tree that has been reconciled with the species tree in (a). This gene tree is annotated with the mappings M ( . ) ,N(.) and N(.). Losses are not represented in this tree. (c) The gene tree (b) annotated with one possible optimal placement of explicit losses. Losses in the left subtree are not displayed. (d) The gene tree (b) annotated with one possible optimal placement of combined polytomy losses. Losses in the left subtree are not displayed.
of lost-B can be combined with lost-C, while the other can be combined with lost-D. This new placement of losses reduces the number of inferred losses from six to five, as shown in Fig. 6(d). These combined losses are represented by lost-D, B and lost-C, B. Each polytomy loss is associated with a particular Polytomy Connected Component (PCC) in the gene tree, defined as follows: A node u E VG is the top node of a distinct PCC if M ( v ) is a polytomy and (u =
r o o t ( T ~V) M ( v ) # M ( p ( u ) ) ) . Let X be the set of nodes that contains u and all g V ,such that g < G u and M ( p ( g ) ) = M ( u ) . Then Y = {(z,(p(z))IzE X \ ~ o o t ( T ~is) the } PCC with top node u. Note that Y is a contiguous set of edges in TG.If a loss is inferred at some edge e E Y ,it can be placed on e or on any edge below e in Y . For example, the gene tree in Fig. 6(b), has one PCC. Its top node is node 3, and it contains the nodes {3,4,5,g4C, g4-E,g5-D,g5-F}. Alg. 5.2 uses a dynamic program to find the optimal placement for all polytomy losses in a PCC such that they can be combined to minimize the number of losses in the gene tree. To obtain the minimum number of combined losses, we first use a modification of Alg. 5.1 to calculate I?(.) and N ( . ) ,and to infer binary losses with rules 1 and 2 (but not 3 and 4) from Section 5.1. Combined losses are then inferred by calling Alg. 5.2 on the top node of each PCC in the reconciled gene tree. In Alg. 5.2, a postorder traversal of a PCC is used to calculate minimum cost for any combination of species which could be lost at or below a given node. This cost is stored in a global variable I?. The loss assignment associated with each cost entry in r is stored in Y'. For a node at the bottom of a PCC, the cost is 1 if one or more
species must be lost, and 0 if no species are lost. The cost at an internal node g is the sum of the costs for its children, plus the potential cost of a loss at g . After calculating I? and Y', an optimal loss assignment is selected using a preorder traversal of the PCC. Alg. 5.2 calculates a single optimal loss assignment for one PCC. The general algorithm returns all optimal assignments. The worst case running time is O( I VGI 2 3 k s ) . The exponential term is due to the enumeration of the power set of C ( p )\ f i ( g ) in ProcessComponent followed by the nested enumeration of two additional power sets in CulculuteCost. The running time on real data sets is reasonable because these sets are typically small and enumerating the smallest elements of the power set first allows reuse of intermediate results.
Algorithm 5.2. CombinedLosses( g ) 1 Global component-root = g 2 / / compute costs 3 Processcomponent( g , M ( g ) 4 / / add losses using costs 5 AddCombinedLosses ( g , C ( M ( g ) )\ N ( g ) ) N(9) 7
8
if ( g=componentroot else return k ( g )
I return N ( g )
ProcessComponent( g , p I 1 0 if ( g = componentroot 1 1 M ( g ) = M ( p ( g ) ) 11 Processcomponent ( l ( g ) , p 12 Processcomponent r ( g ) , p 13 else 14 component-lea f = true 15 foreach f E p o w ( C ( p )\ fi(g)) 16 if ( componentdeaf && f = 0 ) 17
18 19
r(g,S) = 0 else if ( component-leaf r(g,f) = 1
)
)
450 perform identically to Least Common Ancestor reconciliation when applied to binary trees. We tested the new algorithms in NOTUNGon a benchmark of 15 wellstudied, binary trees5, and verified that the results were the same as those generated by the binary version of NOTUNG,as well as those of the original authors. Second, we considered performance. The worst case running time of Alg. 5.2 is exponential in k s . The performance of our preliminary implementation is reasonable for species trees with k s 5 12. We tested NOTUNGon full trees in TreeFam 3.019 with a species tree obtained from the NCBI taxonomy.46 The time required to reconcile the 1173 gene trees that correspond to species trees with k s 5 1 2 was 2’07” for the explicit loss model and 5’21” for the combined loss model on a 3.2ghz OptiPlex GX620 computer. One tree from this data set corresponded to a species tree with a polytomy of size 15 and was not included. Finally, we reconciled binary gene trees with three species trees that contain well-studied polytomies.’8,34~45 All three studies applied statistical tests to verify that the species tree of interest contained a true polytomy. We constructed gene trees for sequence families drawn from each species data set and reconciled them with the nonbinary species trees, which were transcribed directly from the source articles. Table 1 shows the number of leaves (1) in each tree, the size of the maximum polytomy ( k s ) in each species tree, the number of duplications obtained by binary reconciliation (B), the number of required duplications predicted by our algorithm (R), and the number of combined losses. The globin tree has three equally parsimonious loss assignments. All others have one. As predicted, binary reconciliation substantially overestimates required duplications. 17,32338
AddCombinedLosses( g, fin ) 41 / / lose genes at the edge above g 42 if ( g = componentroot 1 1 M ( g ) = M ( p ( g ) ) I 43 Losses ( 9 ) += f i n \ (Ti(g, f i n ) U T r ( g >f i n ) ) 44 AddCombinedLosses I l(g), Tl(g,fin) ) 45 AddCombinedLosses r ( g ) , T r ( g ,f i n ) ) 46 else 47 Losses (9) += fin \ N ( g )
5.3. Related work To our knowledge, the results above are the first formal algorithms for reconciliation with non-binary species trees. Our approach is similar in flavor to a recent algorithm to root and correct an unrooted gene tree, given a rooted species tree,4 in that both algorithms use set-based mappings. We propose two such mappings, N and N , of size bounded above by k s . The M-mapping in Ref. 4 is equivalent to N . There is no equivalent to fi. Instead, they use a set 2 that is O(Vs) and resembles the nai‘ve solution proposed and rejected in Section 4. Although the mappings in both papers are similar, the goals and the algorithmic results differ.
6. EMPIRICAL RESULTS We have implemented the algorithms described above in a new version of our software tool, NOTUNG,and tested it on several data sets. First, we confirmed that our non-binary algorithms
Table 1. Empirical Results Gene family Neoave~~~ cytochrome B globin
auk let^^^ cox 1 cytochrome B NADH-6
Anolis18 NADH-2
Tree 1 k
s
Dup1.s B R
4 1
0 4
12
10
9
-
17
-
5 5 5 5
4
-
-
-
1 2 2
0 0 0
50 50
6
-
-
13
7
-
-
Losses
0 7 0 0
0 17
45 1
7. DISCUSSION
ACKNOWLEDGMENTS
In this work, we have presented novel algorithms for the reconciliation of binary gene trees with non-binary species trees founded on current theories of deep coalescence and incomplete lineage sorting. 16,23,31,43,44 Our algorithms are both space and time efficient. They have been implemented in a new version of our software tool, NOTUNG.To our knowledge, these are the first formal algorithms for non-binary trees. Our algorithms are of immediate use to researchers using phylogenetic analysis in a broad range of biological endeavors and are promising for further algorithmic development. Our definitions of required and conditional duplications and the mappings N and N provide a foundation for probabilistic models of non-binary reconciliation. Such models would complement the parsimony framework presented here. Probabilistic approaches,'** which assume homogeneous rates, are appropriate for data sets in which duplication and loss are neutral, stochastic processes. Parsimony is better suited to data sets in which duplication and loss are rare due to selective pressure. A probabilistic framework provides a natural setting for incorporating sequence data directly into the reconciliation process, but has the disadvantage that it is both computation and data intensive. A complete phylogenetic toolkit should include both approaches. Several other problems remain for future work. Our approach assumes that all binary resolutions of a polytomy are equally likely. A more general approach would include models that deviate from a uniform distribution. In addition, non-binary tree models that include horizontal gene transfer as well as gene duplication and loss are needed. Finally, reconciliation of non-binary gene trees with (1) binary and (2) non-binary species trees should also be investigated. Solutions to the former have been s ~ g g e s t e d ; ~lo, our ~ . solution'0 also has been implemented in NOTUNG.Berglund-Sonnhammer et d4proposed a particular formulation of the latter problem and showed that it leads to an NP-complete subproblem. The hardness of the general problem remains open and formal algorithms (or approximation algorithms) are needed. With the availability of sequences from many closely related genomes, it is increasingly apparent that the histories of individual genes differ and that discordance between gene and species trees is common. Software tools that are sufficiently flexible to handle this situation are needed.35 The work presented here offers this flexibility.
We thank S. Hartmann, T. Hladish, R. Hoberman, H. B. Nicholas, Jr., M. Sanderson, T. Vision, R. Schwartz, and S. Sridhar for helpful discussions, and R. Cintron, H. B. Nicholas, Jr., and J. Nam for providing phylogenetic trees for the experimental analysis. This work was supported by NIH grant 1 K22 HG 02451-01 and a David and Lucille Packard Foundation fellowship.
References 1. L. Arvestad, A. Berglund, J. Lagergren, and B. Sennblad. Bayesian genekpecies tree reconciliation and orthology analysis using MCMC. Bioinformatics, 19 Suppl 1:i7-15, 2003. 2. L. Arvestad, A. Berglund, J. Lagergren, and B. Sennblad. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In RECOMB, pages 326-335, 2004. 3. M. Bender and M. Farach-Colton. Least common ancestors revisited. In Latin '00, pages 88-94, 2000. 4. A. Berglund, P. Steffansson, M. Betts, and D. Liberles. Optimal gene trees from sequences and species trees using a soft interpretation of parsimony. J Mol Evol, 63(2):240250, Aug 2006. 5. R. Bourgon, M. Delorenzi, T. Sargeant, A. Hodder, B. Crabb, and T. Speed. The serine repeat antigen SERA gene family phylogeny in Plasmodium: the impact of GC content and reconciliation of gene and species trees. Mol Biol Evol, 2l(11):2161-2l71, Nov 2004. 6. W. Chang. Reconciling gene tree with apparent polytomies. Master's thesis, Iowa State University, Ames, IA, 2005. 7. K. Chen, D. Durand, and M. Farach-Colton. Notung: A program for dating gene duplications and optimizing gene family trees. J Comput B i d , 7(3/4):429-447, 2000. 8. J. P. Demuth, T. De Bie, J. E. Stajich, N. Cristianini, and M. W. Hahn. The evolution of mammalian gene families. PLoS ONE, l:e85,2006. 9. J. Dufayard, L. Duret, S. Penel, M. Gouy, F. Rechenmann, and G. Perriere. Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics, 21(11):2596-2603, Jun 2005. 10. D. Durand, B. Halldorsson, and B. Vemot. A hybrid micromacroevolutionary approach to gene tree reconstruction. J Comput Biol, 13(2):320-335, 2006. A preliminary version appeared in RECOMB 2005, LNBI 3500, Springer Verlag, 250-264. 11. 0. Eulenstein, B. Mirkin, and M. Vingron. Duplicationbased measures of difference between gene and species trees. J Comput Biol, 5:135-148, 1998. 12. M. Goodman, J. Czelusniak, G. Moore, A. RomeroHerrera, and G. Matsuda. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by
452
13. 14.
15.
16.
17.
18.
19.
20.
21. 22. 23. 24.
25.
26.
27.
28. 29.
30.
31. 32.
33.
cladograms constructed from globin sequences. Syst Zool, 28~132-163, 1979. R. Guigo, I. Muchnik, and T. Smith. Reconstruction of ancient phylogenies. Mol Phylogenet Evol, 6: 189-213, 1996. M. Hallett and J. Lagergren. New algorithms for the duplication-loss model. In RECOMB, pages 138-146, 2000. G. Hoelzer and D. Melnick. Patterns of speciation and limits to phylogenetics resolution. Trend Ecol Evol, 9(33):104-107, 1994. R. Hudson. Gene genealogies and the coalescent process. In Oxford surveys in evolutionary biology, volume 7, pages 1-44. Oxford University Press, 1990. A. Hughes. Phylogenetic tests of the hypothesis of block duplication of homologous genes on human chromosomes 6, 9, and 1. Mol Biol Evol, 15(7):854-70, 1998. T. Jackman, A. Larson, K. De Queiroz, and J. Losos. Phylogenetic relationships and tempo of early diversification in Anolis lizards. Syst. Biol., 48(2):254-285, 1999. H. Li et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res, 34(Database issue):572-580, Jan 2006. J. Lyons-Weiler and M. Milinkovitch. A phylogenetic approach to the problem of differential lineage sorting. Mol Biol Evol, 14(9):968-975, 1997. B. Ma, M. Li, and L. Zhang. From gene trees to species trees. SIAM J. on Comput., 2000. W. Maddison. Reconstructing character evolution on polytomous cladograms. Cladistics, 5:365-377, 1989. W. Maddison. Gene trees in species trees. Syst. Biol., 46( 3) 1523-536, 1997. D. Melnick, G. Hoelzer, R. Absher, and M. Ashley. mtDNA diversity in Rhesus monkeys reveals overestimates of divergence time and paraphyly with neighboring species. Mol Biol Evol, 10(2):282-295, Mar 1993. B. Mirkin, I. Muchnik, and T. Smith. A biologically consistent model for comparing molecular phylogenies. J Comput Biol, 2:493-507, 1995. J. Nam and N. Masatoshi. Evolutionary change in the numbers of homebox genes in bilateral animals. Mol Biol Evol, 22( 12):2386-2394, 2005. R. Page. Maps between trees and cladistic analysis of historical associations among genes, organisms and areas. Syst Zool, 43158-77, 1994. R. Page. GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinj, 14(9):819-20, 1998. R. Page and M. Charleston. From gene to organismal phylogeny: Reconciled trees and the gene treekpecies tree problem. Mol Phylogenet Evol, 7:23 1-240, 1997. R. Page and E. Holmes. Molecular Evolution: A phylogenetic approach. Blackwell Science, 1998. P. Pamilo and M. Nei. Relationships between gene trees and species trees. Mol Biol Evol, 5(5):568-583, 1988. M. Pebusque, F. Coulier, D. Birnbaum, and P. Pontarotti. Ancient large-scale genome duplications: phylogenetic and linkage analyses shed light on chordate genome evolution. Mol Biol Evol, 15(9):1145-1 159, 1998. G. Perrikre, L. Duret, and M. Gouy. HOBACGEN:
34.
35.
36.
37.
38.
39.
40. 41.
42.
43. 44.
45.
46.
47.
48.
49.
50.
database system for comparative genomics in bacteria. Genome Research, 10:379-385, 2000. S. Poe and A. Chubb. Birds in a bush: five genes indicate explosive evolution of avian orders. Evolution, 58(2):404415,2004. D. Pollard, V. Iyer, A. Moses, and M. Eisen. Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS Genet, 2(10):e173, Oct 2006. F. Ronquist. Parsimony analysis of coevolving species associations. In Roderic D. M. Page, editor, Tangled Trees: Phylogeny, Cospeciation and Coevolution, chapter 2, pages 22-64. Univ of Chicago Press, 2002. C. Roth, M. Betts, P. Steffansson, G. Szlensminde, and D. Liberles. The adaptive evolution database (TAED): a phylogeny based tool for comparative genomics. Nucleic Acids Res, 33:D495-D497, 2005. I. Ruvinsky and L. Silver. Newly indentified paralogous groups on mouse chromosomes 5 and 11 reveal the age of a T-box cluster duplication. Genomics, 40:262-266, 1997. W. Salzburger, A. Meyer, S. Baric, E. Verheyen, and C. Sturmbauer. Phylogeny of the Lake Tanganyika cichlid species flock and its relationship to the Central and East African haplochromine cichlid fish faunas. Syst Biol, 51(1):113-135, Feb 2002. D. Searls. Pharmacophylogenomics: genes, evolution and drug targets. Nut Rev Drug Discov, 2(8):613-623, 2003. J. Slowinslu and R. Page. How should species phylogenies be inferred from sequence data? Syst Biol, 48(4):814-825, Dec 1999. U. Stege. Gene trees and species trees: The geneduplication problem is fixed-parameter tractable. In WADS, LNCS 1663, pages 288-293, 1999. F. Tajima. Evolutionary relationship of dna sequences in finite populations. Genetics, 105(2):437460, Oct 1983. N. Takahata and M. Nei. Gene genealogy and variance of interpopulational nucleotide differences. Genetics, 110(2):325-344, Jun 1985. H. Walsh, M. Kidd, T. Moum, and V. Friesen. Polytomies and the power of phylogenetic inference. Evolution, 53(3):932-937, 1999. D. Wheeler et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, 33(Database issue):D3945, Jan 2005. D. Wheeler, R. Hope, S. Cooper, G. Dolman, G. Webb, C. Bottema, A. Gooley, M. Goodman, and R. Holland. An orphaned mammalian beta-globin gene of ancient evolutionary origin. Proc Natl Acad Sci U S A ,98(3): 1101-1 106, Jan 200 1. L. Zhang. On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J Comput Biol, 4: 177188, 1997. C. Zmasek and S. Eddy. ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics, 17(4):383-4, Apr 2001. C. Zmasek and S. Eddy. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics, 17(9):821-8, Sep 2001.
45 3
AUTHOR INDEX Ali, Hesham Athey, Brian Aung, Zeyar
215 359 287
Bailey-Kellogg, Chris Bajaj, Chandrajit Bandyopadhyay, Sanghamitra Baral, Chitta Basu, Kalyan Baumbach, Jan Becker, Kevin G. Bittner, Micheal Bock, Mary Ellen Bocker, Sebastian Bryant, Drew H. Bu, Dongbo Bylund, Joseph H.
31 275 183 381 121 39 1 371 169 263 391 343 323 343
Jagalur, Manjunatha Jakupovic, Elvis Jiang, Tao Johnson, Calvin A. Joshi-Tope, G.
Chen, Brian Y. Chen, Xin Chin, Francis Y.L. Chua, Hon Nian Close, Timothy J. Cruess, Amanda E.
343 249 111 97 203 343
Karypis, George Kavraki, Lydia E. Kim, Seungchan Kimmel, Marek Kristensen, David M. Kulp, David
311,403 343 169 343 343 133
Lau, William W. Lee, Timothy Leong, Hon Wai Leslie, Christina Leung, Henry C.M. Levine, Mike Li, Ming Li, Shuai Cheng Li, Wenyuan Li, Xiao-Li Lichtarge, Olivier Lin, Hao Liu, Huiqing Liu, Lan Liu, Ying Lonardi, Stefan0
371 41 19,97 9 111 5 237,323 323 429 157 343 237 79 67,203 429 67,203
Dai, Manhong Dalton, Stephen Das, Sajial K. de Carvalho Jr., Sergio Anibal Durand, Dannie
359,387 79 121 417 44 1
Ellrott, Kyle Engel, James Douglas
335 145
Fan, Zhaocheng Ferhatosmanoglu, Hakan Fofanov, Viacheslav Y. Foo, Chuan-Sheng Friedman, Alan M. Fu, Zheng
249 299 343 157 31 195
Gao, Xin Garutti, Claudio Ghosh, Preetam Ghosh, Samik
323 263 121 121
Gitter, Anthony Goldman, Aiton Gonzalez, Graciela Guerra, Conettina Guo, Jun-Tao Guo, Lingqiong
381 44 1 38 1 263 335 249
Han, Xiaoxu Hero 111, Alfred 0. Husmeier, Dirk Hwa, Terry
55 145 85 3
Macher, Bruce Martin, Marcel Mathee, Kalai
133 387 67,195,249 371 381
41 391 227
454 Meng, Fan Mirel, Barbara
359,387 359
Narasimhan, Giri Ng, See-Kiong Ning, Kang
227 157,287 19,97
Olman, Victor Ota. Motonori Papatsenko, Dmitri Quest, Daniel Rahmann, Sven RangwaIa, Huzefa Rao, Arvind
335 299 5 215 391,417 311 145
Sen, Ina Singh, Rahul Siu, M.H. Sjolander, Kimmen States, David J. Stolzer, Maureen Sun, Hong Sung, Ken W.K. Sung, Wing-Kin
169 41 111 11 145 44 1 299 111 97
Tan, Kian-Lee Tan, Soon-Heng Tapprich, William Tegarden, Craig TruB, Anke
287 287 215 381 391
Venkatesan, Kavitha
13
Vernot, Benjamin Wale, Nikil Walhout, A.J. Marian Wang, Xi Wang, Yusu Watson, Ian A. Watson, Stanley Watson, Stanley J. Werhli, Adriano V. Wilson, Justin Wittkop, Tobias Wong, Limsoon Wu, Yonghui
44 1 403 15 183 299 403 387 359 85 359,387 391 97 67,203
Xu, Jinbo Xu, Ying Xuan, Weijian Xuan, Zhenyu
323 79,335 359 183
Ye, Xiaoduan Yen, Ten-Yang Yiu, S.M. Young, Rick
31 41 111 5
Zeigler, Amanda Zeitlinger, Julia Zeng, Erliang Zhang, Michael Q. Zhang, Xiaoyu Zhang, Xuegong Zhang, Zefeng Zhao, Xiaoyue Zheng, Wei Zinzen, Rob
381 5 227 183 275 183 237 183 31 5