Pacific Symposium on Biocomputing 2007: Maui, Hawaii, 3-7 January 2007

PACIFIC S Y M P O S I U Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray & Teri E. Klein PACIFIC SYMP...

Author: Russ B. Altman | A. Keith Dunker | Lawrence Hunter | Tiffany Murray | Teri E. Klein

9 downloads 1111 Views 27MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

PACIFIC S Y M P O S I U

Russ B. Altman, A. Keith Dunker, Lawrence Hunter, Tiffany Murray & Teri E. Klein

PACIFIC SYMPOSIUM ON

BIOCOMPUTING 3007

PACIFIC SYMPOSIUM ON

BIOCOMPUTING 2007 Maui, Hawaii 3-7 January 2007

Edited by Russ B. Altman Stanford University, USA

A. Keith Dunker Indiana University, USA

Lawrence Hunter University of Colorado Health Sciences Center, USA

Tiffany Murray Stanford University, USA

Teri E. Klein Stanford University, USA

\[p World Scientific NEW JERSEY • LONDON • SINGAPORE •

BEIJING • SHANGHAI • HONGKONG

• TAIPEI • CHENNAI

Published by World Scientific Publishing Co. Pte. Ltd. 5 TohTuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

BIOCOMPUTING 2007 Proceedings of the Pacific Symposium Copyright © 2007 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-270-417-5

Printed in Singapore by Mainland Press

PACIFIC SYMPOSIUM ON BIOCOMPUTING 2007 Biomedical computing has become a key component in the biomedical research infrastructure. In 2004 and 2005, the U.S. National Institutes of Health established seven National Centers for Biomedical Computation, focusing on a wide range of application areas and enabling technologies, including simulation, systems biology, clinical genomics, imaging, ontologies and others (see http://www.bisti.nih.gov/ncbc/). The goal of these centers is to help seed an information infrastructure to support biomedical research. The Pacific Symposium on Biocomputing (PSB) presented critical early sessions in most of the areas covered by these National Centers, and we are proud to continue the tradition of helping to define new areas of focus within biomedical computation. Once again, we are fortunate to host two outstanding keynote speakers. Dr. Elizabeth Blackburn, Professor of Biology and Physiology in the Department of Biochemistry and Biophysics at the University of California, San Francisco will speak on "Interactions among telomeres, telomerase, and signaling pathways." Her work has led our understanding of overall organization and control of chromosomal dynamics. Our keynote speaker in the area of Ethical, Legal and Social implications of technology will be Marc Rotenberg, Executive Director of the Electronic Privacy Information Center (EPIC) in Washington, D.C. He will speak on "Data mining and privacy: the role of public policy." Many biomedical computation professionals have had and continue to grapple with privacy issues as interest in mining human genotype-phenotype data collections has increased. PSB has a history of providing early sessions focusing on hot new areas in biomedical computation. These sessions are often conceived during the previous PSB meeting, as trends and new results are pondered and discussed. Very often, new sessions are lead by new faculty members trying to define a scientific niche and bring together leaders in the emerging areas. We are proud that many areas in biocomputing received their first significant focused attention at PSB. If you have an idea for a new session, we the organizers, are available to talk with you, either at the meeting or later by e-mail. Again, the diligence and efforts of a dedicated group of researchers has led to an outstanding set of sessions, with associated introductory tutorials. These organizers provide the scientific core of PSB, and their sessions are as follows: v

vi

Indra Neil Sarkar Biodiversity Informatics: Managing Knowledge Beyond Humans and Model Organisms Bobbie-Jo Webb-Robertson & Bill Cannon Computational Proteomics: High-throughput Analysis for Systems Biology Martha Bulyk, Ernest Fraenkel, Alexander Hartemink, & Gary Stormo DNA-Protein Interactions and Gene Regulation: Integrating Structure, Sequence and Function Russ Greiner & David Wishart Computational Approaches to Metabolomics Pierre Zweigenbaum, Dina Demner-Fushman, Kevin Bretonnel Cohen, & Hong Yu New Frontiers in Biomedical Text Mining Maricel Kann, Yanay Ofran, Marco Punta, & Predrag Radivojac Protein Interactions in Disease In addition to the sessions and survey tutorials, this year's program includes two in depth tutorials. The presenters and titles of these tutorials are: Giselle M. Knudsen, Reza A. Ghiladi, & D. Rey Banatao Integration Between Experimental and Computational Biology for Studying Protein Function Michael A Province & Ingrid B Borecki Searching for the Mountains of the Moon: Genome Wide Association Studies of Complex Traits

We thank the Department of Energy and the National Institutes of Health for their continuing support of this meeting. Their support provides travel grants to many of the participants. Applied Biosystems and the International Society for Computational Biology continue to sponsor PSB, and as a result, we are able to provide additional travel grants to many meeting participants.

VII

We would like to acknowledge the many busy researchers who reviewed the submitted manuscripts on a very tight schedule. The partial list following this preface does not include many who wished to remain anonymous, and of course we apologize to any who may have been left out by mistake. Aloha! Russ B. Altman Departments of Genetics & Bioengineering, Stanford University A. Keith Dunker Department of Biochemistry and Molecular Biology, Indiana University School of Medicine Lawrence Hunter Department of Pharmacology, University of Colorado Health Sciences Center Teri E. Klein Department of Genetics, Stanford University

Pacific Symposium on Biocomputing Co-Chairs September 28, 2006

V1U

Thanks to the reviewers.. Finally, we wish to thank the scores of reviewers. PSB requires that every paper in this volume be reviewed by at least three independent referees. Since there is a large volume of submitted papers, paper reviews require a great deal of work from many people. We are grateful to all of you listed below and to anyone whose name we may have accidentally omitted or who wished to remain anonymous. Joshua Adkins Eugene Agichtein Gelio Alves Sophia Ananiadou Alan Aronson Ken Baclawski Joel Bader Breck Baldwin Ziv Bar-Joseph Serafim Batzoglou Asa Ben-Hur Sabine Bergler Olivier Bodenreider Alvis Brazma Kevin Bretonnel Yana Bromberg Harmen Bussemaker Andrea Califano Bob Carpenter Michele Cascella Saikat Chakrabarti Shih-Fu Chang Pierre Chaurand Ting Chen Hsinchun Chen Nawei Chen Praveen Cherukuri Wei Chu James Cimino Aaron Cohen Nigel Collier Matteo Dal Peraro

Vlado Dancik Rina Das Tjil De Bie Dina DemnerFushman Rob DeSalle Luis DeSilva Diego Di Bernardo Chuong Do Michel Dumontier Mary G. Egan Roman Eisner Emilio Espisitio Mark Fasnacht Oliver Fiehn Alessandro Flammini Fabian Fontaine Lynne Fox Ari Frank Kristofer Franzen Tema Fridman Carol Friedman Robert Futrelle Feng Gao Adam Godzik Roy Goodacre Michael Grusak Melissa A. Haendel Henk Harkema Marti Hearst P. Bryan Heidorn Bill Hersh

Lynette Hirschman Terence Hwa Sven Hyberts Lilia Iakoucheva Navdeep Jaitly Helen Jenkins Kent Johnson Andrew Joyce James Kadin Martin R. Kalfatovic Manpreet S. Katari Sun Kim Oliver King Tanja Kortemme Harri Lahdesmaki Ney Lemke Gondy Leroy Christina Leslie Li Liao John C. Lindon Chunmei Liu Yves Lussier Hongwu Ma Kenzie Maclsaac Tom Madej Ana Maguitman Askenazi Manor Costas Maranas Leonardo Marino John Markley Pedro Mendes Ivana Mihalek

Leonid Mirny Joyce Mitchell Matthew Monroe Sean Mooney Rafael Najmanovich Preslav Nakov Leelavati Narlikar Adeline Nazarenko Jack Newton William Noble Christopher Oehmen Christopher Oldfield Zoltan Oltvai Matej Oresic Bernhard Palsson Chrysanthi Paranavitana Matteo Pellegrini Aloysius Phillips Paul J. Planet Christian Posse Natasa Przulj Teresa Przytycka Bin Qian Weijun Qian Arun Ramani Kathryn Rankin Andreas Rechtsteiner Haluk Resat Tom Rindflesch Martin Ringwald Elizabeth Rogers Pedro Romero Graciela Rosemblat Andrea Rossi Erik Rytting Jasmin Saric Indra Neil Sarkar Yutaka Sasaki Tetsuya Sato

Santiago Schnell Rob Schumaker Robert D. Sedgewick Eran Segal Kia Sepassi Anuj Shah Paul Shapshak Hagit Shatkay Mark Siddall Mona Singh Mudita Singhal Saurabh Sinha Thereza Amelia Soares Bruno Sobral Ray Sommorjai Orkun Soyer Irina Spasic Padmini Srinivasan Paul Stothard Eric Strittmatter Shamil Sunyaev Silpa Suthram Lorrie Tanabe Haixu Tang Igor Tetko Jun'ichi Tsujii Peter Uetz Vladimir Uversky Vladimir Vacic Alfonso Valencia Karin Verspoor Mark Viant K. Vijay-Shanker Hans Vogel Slobodan Vucetic Alessandro Vullo Wyeth Wasserman Bonnie Webber Aalim Weljie

John Wilbur Kazimierz O. Wrzeszczynski Dong Xu Yoshihiro Yamanishi Yuzhen Ye Hong Yu Peng Yue Pierre Zweigenbaum

CONTENTS Preface

v

PROTEIN INTERACTIONS AND DISEASE Session Introduction Maricel Kann, Yanay Ofran, Marco Punta, and Predrag Radivojac

1

Graph Kernels for Disease Outcome Prediction from Protein-Protein Interaction Networks Karsten M. Borgwardt, Hans-Peter Kriegel, S.V.N. Vishwanathan, and Nicol N. Schraudolph

4

Chalkboard: Ontology-Based Pathway Modeling and Qualitative Inference of Disease Mechanisms Daniel L. Cook, Jesse C. Wiley, and John H. Gennari

16

Mining Gene-Disease Relationships from Biomedical Literature Weighting Protein-Protein Interactions and Connectivity Measures Graciela Gonzalez, Juan C. Uribe, Luis Tari, Colleen Brophy, and Chitta Baral

28

Predicting Structure and Dynamics of Loosely-Ordered Protein Complexes: Influenza Hemagglutinin Fusion Peptide Peter M. Kasson and Vijay S. Pande

40

Protein Interactions and Disease Phenotypes in the ABC Transporter Superfamily Libusha Kelly, Rachel Karchin, and Andrej Sali

51

LTHREADER: Prediction of Ligand-Receptor Interactions Using Localized Threading Vinay Pulim, Jadwiga Bienkowska, and Bonnie Berger

64

Discovery of Protein Interaction Networks Shared by Diseases Lee Sam, Yang Liu, Jianrong Li, Carol Friedman, and Yves A. Lussier

76

XI

xii An Iterative Algorithm for Metabolic Network-Based Drug Target Identification Padmavati Sridhar, Tamer Kahveci, and Sanjay Ranka Transcriptional Interactions During Smallpox Infection and Identification of Early Infection Biomarkers Willy A. Valdivia-Granda, Maricel G. Kann, and Jose Malaga

88

100

COMPUTATIONAL APPROACHES TO METABOLOMICS Session Introduction David S. Wishart and Russell Greiner

112

Leveraging Latent Information in NMR Spectra for Robust Predictive Models David Chang, Aalim Weljie, and Jack Newton

115

Bioinformatics Data Profiling Tools: A Prelude to Metabolic Profiling Natarajan Ganesan, Bala Kalyanasundaram, and Mahe Velauthapllai

127

Comparative QSAR Analysis of Bacterial, Fungal, Plant and Human Metabolites Emre Karakoc, S. Cenk Sahinalp, and Artem Cherkasov

133

BioSpider: A Web Server for Automating Metabolome Annotations Craig Knox, Savita Shrivastava, Paul Stothard, Roman Eisner, and David S. Wishart

145

New Bioinformatics Resources for Metabolomics John L. Markley, Mark E. Anderson, Qiu Cui, Hamid R. Eghbalnia, Ian A. Lewis, Adrian D. Hegeman, Jing Li, Christopher F. Schulte, Michael R. Sussman, William M. Westler, Eldon L. Ulrich, and Zsolt Zolnai

157

Setup X — A Public Study Design Database for Metabolomic Projects Martin Scholz and Oliver Fiehn

169

X1U

Comparative Metabolomics of Breast Cancer Chen Yang, Adam D. Richardson, Jeffrey W. Smith, and Andrei Osterman

181

Metabolic Flux Profiling of Reaction Modules in Liver Drug Transformation Jeongah Yoon and Kyongbum Lee

193

NEW FRONTIERS IN BIOMEDICAL TEXT MINING Session Introduction Pierre Zweigenbaum, Dina Demner-Fushman, Hong Yu, and K. Bretonnel Cohen

205

Extracting Semantic Predications from Medline Citations for Pharmacogenomics Caroline B. Ahlers, Marcelo Fiszman, Dina Demner-Fushman, Frangois-Michel Lang, and Thomas C. Rindflesch

209

Annotating Genes Using Textual Patterns Ali Cakmak and Gultekin Ozsoyoglu

221

A Fault Model for Ontology Mapping, Alignment, and Linking Systems Helen L. Johnson, K. Bretonnel Cohen, and Lawrence Hunter

233

Integrating Natural Language Processing with Flybase Curation Nikiforos Karamanis Y, Ian Lewin, Ruth Seal, Rachel Drysdale, and Edward Briscoe

245

A Stacked Graphical Model for Associating Sub-Images with Sub-Captions Zhenzhen Kou, William W. Cohen, and Robert F. Murphy

257

GeneRIF Quality Assurance as Summary Revision Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter

269

xiv Evaluating the Automatic Mapping of Human Gene and Protein Mentions to Unique Identifiers Alexander A. Morgan, Benjamin Wellner, Jeffrey B. Colombe, Robert Arens, Marc E. Colosimo, and Lynette Hirschman

281

Multiple Approaches to Fine-Grained Indexing of the Biomedical Literature Aurelie Neveol, Sonya E. Shooshan, Susanne M. Humphrey, Thomas C. Rindflesh, and Alan R. Aronson

292

Mining Patents Using Molecular Similarity Search James Rhodes, Stephen Boyer, Jeffrey Kreulen, Ying Chen, and Patricia Ordonez

304

Discovering Implicit Associations Between Genes and Hereditary Diseases Kazuhiro Seki and Javed Mostafa

316

A Cognitive Evaluation of Four Online Search Engines for Answering Definitional Questions Posed by Physicians Hong Yu and David Kaufman

328

BIODIVERSITY INFORMATICS: MANAGING KNOWLEDGE BEYOND HUMANS AND MODEL ORGANISMS Session Introduction Indra Neil Sarkar

340

Biomediator Data Integration and Inference for Functional Annotation of Anonymous Sequences Eithon Cadag, Brent Louie, Peter J. Myler, and Peter Tarczy-Hornoch

343

Absent Sequences: Nullomers and Primes Greg Hampikian and Tim Andersen

355

XV

An Anatomical Ontology for Amphibians Anne M. Maglia, Jennifer L. Leopold, L. Analia Pugener, and Susan Gauch

367

Recommending Pathway Genes Using a Compendium of Clustering Solutions David M. Ng, Marcos H. Woehrmann, and Joshua M. Stuart

379

Semi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor Guido Sautter, Klemens Bbhm, and Donat Agosti

391

COMPUTATIONAL PROTEOMICS: HIGH-THROUGHPUT ANALYSIS FOR SYSTEMS BIOLOGY Session Introduction William Cannon and Bobbie-Jo Webb-Robertson

403

Advancement in Protein Inference from Shotgun Proteomics Using Peptide Detectability Pedro Alves, Randy J. Arnold, Milos V. Novotny, Predrag Radivojac, James P. Reilly, and Haixu Tang

409

Mining Tandem Mass Spectral Data to Develop a More Accurate Mass Error Model for Peptide Identification Yan Fu, Wen Gao, Simin He, Ruixiang Sun, Hu Zhou, and Rong Zeng

421

Assessing and Combining Reliability of Protein Interaction Sources Sonia Leach, Aaron Gabow, Lawrence Hunter, and Debra S. Goldberg

433

Probabilistic Modeling of Systematic Errors in Two-Hybrid Experiments David Sontag, Rohit Singh, and Bonnie Berger

445

xvi

Prospective Exploration of Biochemical Tissue Composition via Imaging Mass Spectrometry Guided by Principal Component Analysis Raf Van de Plas, Fabian Ojeda, Maarten Demi, Ludo Van Den Bosch, Bart De Moor, and Etienne Waelkens

458

DNA-PROTEIN INTERACTIONS: INTEGRATING STRUCTURE, SEQUENCE, AND FUNCTION Session Introduction Martha L. Bulyk, Alexander J. Hartemink, Ernest Fraenkel, and Gary Stormo

470

Discovering Motifs With Transcription Factor Domain Knowledge 472 Henry CM. Leung, Francis Y.L. Chin, and Bethany M.Y. Chan Ab initio Prediction of Transcription Factor Binding Sites L. Angela Liu and Joel S. Bader

484

Comparative Pathway Annotation with Protein-DNA Interaction and Operon Information via Graph Tree Decomposition Jizhen Zhao, Dongsheng Che, and Liming Cai

496

PROTEIN INTERACTIONS AND DISEASE MARICEL KANN National Center for Biotechnology Information, NIH Bethesda, MD 20894, U.S.A. YANAY OFRAN Department of Biochemistry & Molecular Biophysics, Columbia University New York, NY 10032, U.S.A. MARCO PUNTA Department of Biochemistry & Molecular Biophysics, Columbia University New York, NY 10032, U.S.A. PREDRAG RADIVOJAC School of Informatics, Indiana University Bloomington, IN 47408, U.S.A.

In 2003, the US National Human Genome Research Institute (NHGRI) articulated grand challenges for the genomics community in which the translation of genome-based knowledge into disease understanding, diagnostics, prognostics, drug response and clinical therapy is one of the three fundamental directions ("genomics to biology," "genomics to health" and "genomics to society").1 At the same time the National Institutes of Health (NIH) laid out a similar roadmap for biomedical sciences.2 Both the NHGRI grand challenges and the NIH roadmap recognized bioinformatics as an integral part in the future of life sciences. While this recognition is gratifying for the bioinformatics community, its task now is to answer the challenge of making a direct impact to the medical science and benefiting human health. Innovative use of informatics in the "translation from bench to bedside" becomes a key for bioinformaticians. In 2005, the Pacific Symposium on Biocomputing (PSB) first solicited papers related to one aspect of this challenge, protein interactions and disease, which directly addresses computational approaches in search for the molecular basis of disease. The goal of the session was to bring together scientists interested in both bioinformatics and medical sciences to present their research progress. The session generated great interest resulting in a number of high quality papers and testable hypothesis regarding the involvement of proteins in various disease pathways. This year, the papers accepted for the session on Protein Interactions and Disease at PSB 2007 follow the same trend. 1

2

The first group of papers explored structural aspects of protein-protein interactions. Kelly et al. study ABC transporter proteins which are involved in substrate transport through the membrane. By investigating intra-transporter domain interfaces they conclude that nucleotide-binding interfaces are more conserved than those of transmembrane domains. Disease-related mutations were mapped into these interfaces. Pulim et al. developed a novel threading algorithm that predicts interactions between receptors (membrane proteins) and ligands. The method was tested on cytokines, proteins implicated in intra-cellular communication and immune system response. Novel candidate interactions, which may be implicated in disease, were predicted. Kasson and Pande use molecular dynamics to address high-order molecular organization in cell membranes. A large number of molecular dynamics trajectories provided clues into structural aspects of the insertion of about 20-residue long fusion peptide into a cell membrane by a trimer hemagglutinin of the influenza virus. The authors explain effects of mutations that preserve peptide's monomeric structure but incur loss of viral infectivity. The second group of studies focused on analysis of protein interaction networks. Sam et al. investigate molecular factors responsible for the diseases with different causes but similar phenotypes and postulate that some are related to breakdowns in the shared protein-protein interaction networks. A statistical method is proposed to identify protein networks shared by diseases. Sridhar et al. developed an efficient algorithm for perturbing metabolic networks in order to stop the production of target compounds, while minimizing unwanted effects. The algorithm is aimed at drug development where toxicity of the drug should be reduced. Borgwardt et al. were interested in predicting clinical outcome by combining microarray and protein-protein interaction data. They use graph kernels as a measure of similarity between graphs and develop methods to improve their scalability to large graphs. Support vector machines were used to predict disease outcome. Gonzalez et al. extracted a large number of genedisease relationships by parsing literature and mapping them to the known protein-protein interaction networks. They propose a method for ranking proteins for their involvement in disease. The method was tested on atherosclerosis. Valdivia-Granda et al. devised a method to integrate protein-protein interaction data along with other genomic annotation features with microarray data. They applied it to microarray data from a study of non-human primates infected with variola and identified early infection biomarkers. The study was complemented with a comparative protein domain analysis between host and pathogen. This work contributes to the understanding of the mechanisms of infectivity, disease and suggests potential therapeutic targets. Finally, Cook et al. worked on the novel ontology of biochemical pathways. They present Chalkboard, a tool for

3

building and visualizing biochemical pathways. Chalkboard can be used interactively and is capable of making inferences. Acknowledgements The session co-chairs would like to thank numerous reviewers for their help in selecting the best papers among many excellent submissions. References 1. Collins FS, Green ED, Guttmacher AE, Guyer MS. A vision for the future of genomics research. Nature 2003; 422(6934):835. 2. Zerhouni E. The NIH roadmap. Science 2003; 302(5642):63.

G R A P H KERNELS FOR DISEASE OUTCOME PREDICTION FROM P R O T E I N - P R O T E I N INTERACTION N E T W O R K S

K A R S T E N M. B O R G W A R D T A N D H A N S - P E T E R K R I E G E L Institute for Computer Science, Ludwig-MaximiliansUniversity Munich, Oettingenstr. 67, 80538 Munich, Germany E-mail: [email protected], [email protected]. Imu. de S.V. N. V I S H W A N A T H A N A N D N I C O L N. S C H R A U D O L P H Statistical Machine Learning Program, National ICT Australia, Canberra, 0200 ACT, Australia E-mail: SVN. [email protected], Nic. Schraudolph@nicta. com. au

It is widely believed that comparing discrepancies in the protein-protein interaction (PPI) networks of individuals will become an important tool in understanding and preventing diseases. Currently P P I networks for individuals are not available, but gene expression data is becoming easier to obtain and allows us to represent individuals by a co-integrated gene expression/protein interaction network. Two major problems hamper the application of graph kernels - state-of-the-art methods for whole-graph comparison - to compare P P I networks. First, these methods do not scale to graphs of the size of a P P I network. Second, missing edges in these interaction networks are biologically relevant for detecting discrepancies, yet, these methods do not take this into account. In this article we present graph kernels for biological network comparison that are fast to compute and take into account missing interactions. We evaluate their practical performance on two datasets of co-integrated gene expression/PPI networks.

1. Introduction An important goal of research on protein interactions is to identify relevant interactions that are involved in disease outbreak and progression. Measuring discrepancies between protein-protein interaction (PPI) networks of healthy and ill patients is a promising approach to this problem. Unfor4

5 tunately, establishing individual networks is beyond the current scope of technology. Co-integrated gene expression/PPI networks, however, offer an attractive alternative to study the impact of protein interactions on disease. But, researchers in this area are often faced with a computationally challenging problem: how to measure similarity between large interaction networks? Moreover, biologically relevant information can be gleaned both from the presence and absence of interactions. How does one make use of this domain knowledge? The aim of this paper is to answer both these questions systematically.

1.1. Interaction

Networks

are

Graphs

We begin our study by observing that interaction networks are graphs, where each node represents a protein and each edge represents the presence of an interaction. Conventionally there are two ways of measuring similarity between graphs. One approach is to perform a pairwise comparison of the nodes and/or edges in two networks, and calculate an overall similarity score for the two networks from the similarity of their components. This approach takes time quadratic in the number of nodes and edges, and is thus computationally feasible even for large graphs. However, this strategy is flawed in that it completely neglects the structure of the networks, treating them as sets of nodes and edges instead of graphs. A more principled alternative would be to deem two networks similar if they share many common substructures, or more technically, if they share many common subgraphs. To compute this, however, we would have to solve the so-called subgraph isomorphism problem which is known to be NP-complete, i.e., the computational cost of this problem increases exponentially with problem size, seriously limiting this approach to very small networks [1]. Many heuristics have been developed to speed up subgraph isomorphism by using special canonical labelings of the graphs; none of them, however, can avoid an exponential worst-case computation time. Graph kernels as a measure of similarity on graphs offer an attractive middle ground: they can be computed in polynomial time, yet, they compare non-trivial substructures of graphs. In spite of these attractive properties, as they exist, graph kernels neither scale to large interaction networks nor do they address the issue of missing interactions. In this paper, we present fast algorithms for computing graph kernels which scale to large networks. Simultaneously, by using a complement graph - a graph

6

made up of all the nodes and the missing edges in the original graph - we address the issue of missing interactions in a principled manner. Outline The remainder of this article is structured as follows. In Section 2, we will review existing graph kernels, and illustrate the problems encountered when applying graph kernels to large networks. In Section 3, we will present algorithms for speeding up graph kernel computation, and in Section 4, we will define graph kernels that take into account missing interactions as well. In our experiments (see Section 5), we employ our fast and enhanced graph kernels for disease outcome prediction, before concluding with an outlook and discussion. 2. Review of Existing Graph Kernels Existing graph kernels can be viewed as a special case of R-Convolution kernels proposed by Haussler [2]. The basic idea here is to decompose the graph into smaller substructures, and build the kernel based on similarities between the decomposed substructures. Different kernels mainly differ in the way they decompose the graph for comparison and the similarity measure they use to compare the decomposed substructures. Random walk kernels are based on a simple idea: Given a pair of graphs decompose them into paths obtained by performing a random walk, and count the number of matching walks [3-5]. Various incarnations of these kernels use different methods to compute similarities between walks. For instance, Gartner et al. [4] count the number of nodes in the random walk which have the same label. They also include a decay factor to ensure convergence. Borgwardt et al. [3] on the other hand, use a kernel denned on nodes and edges in order to compute similarity between random walks. Although derived using a completely different motivation, it was recently shown by Vishwanathan et al. [6] that the marginalized graph kernels of Kashima et al. [5] are also essentially a random walk kernel. Mahe et al. [7] extend the marginalized graph kernels in two ways. They enrich the labels by using the so-called Morgan index, and modify the kernel definition to prevent tottering, i.e., similar smaller substructures from generating high similarity scores. Both these extensions are particularly relevant for chemoinformatics applications. Other decompositions of graphs, which are well suited for particular application domains, include subtrees [8], molecular fingerprints based on various types of depth first searches [9], and structural elements like rings, functional groups and so on [10]. While many domain specific variants of graph kernels yield state-of-the-

7

art performance, they are plagued by computational issues when used to compare large graphs like those frequently found in PPI networks. This is mainly due to the fact that the kernel computation algorithms typically scale as 0(ne) or worse. Practical applications therefore either compute the kernel approximately or make unrealistic sparsity assumptions on the input graphs. In contrast, in the next section, we discuss three efficient methods for computing random walk graph kernels which are both theoretically sound and practically efficient. 3. Fast Random Walk Kernels In this section we briefly describe an unifying framework for random walk kernels, and present fast algorithms for their computation. We warn the biologically motivated reader that this section is rather technical. But, the algorithms presented below allow us to efficiently compute kernels on large graphs, and hence are crucial building blocks of our classifier for disease outcome prediction. 3.1.

Notation

A graph G(V, E) consists of an ordered and finite set of n vertices V denoted by {v\,V2,. • •, vn}, and a finite set of edges E c V x V. G is said to be undirected if (vi, Vj) € E <=$• (vj,Vi) € E for all edges. The unnormalized adjacency matrix of G is an nxn real matrix P with P^ = 1 if (vi,Vj) € E, and 0 otherwise. If G is weighted then P can contain non-negative entries other than zeros and ones, i.e., P,j € (0,oo) if (vi,Vj) € E and zero otherwise. Let D be an n x n diagonal matrix with entries Da = YljPtj- The matrix A := PD~X is called the normalized adjacency matrix, or simply adjacency matrix. A walk w on G is a sequence of indices 101,102,... Wt+i where (vWi,vWi+1) € E for all 1 < i < t. The length of a walk is equal to the number of edges encountered during the walk (here: t). A random walk is a walk where F(wi+i\wi,.. .Wi) = AWiiWi+1, i.e., the probability at Wi of picking Wi+i next is directly proportional to the weight of the edge Let X be a set of labels which includes the special label e. An edge labeled graph G is associated with a label matrix L £ xnxn, such that Lij = e iff (vi, Vj) <£ E. In other words, only those edges which are present in the graph get a non-e label. Let H be the RKHS endowed with the kernel K : X x X —> R, and let <j>: X —> H denote the corresponding feature map

8

which maps e to the zero element of H. We use $(L) to denote the feature matrix of G. For ease of exposition we do not consider labels on vertices here, though our results hold for that case as well. 3.2. Product

Graphs

Given two graphs G(V,E) and G'(V',E'), the product graph GX(VX,EX) is a graph with nn' vertices, each representing a pair of vertices from G and G', respectively. An edge exists in Ex iff the corresponding vertices are adjacent in both G and G'. Thus

Vx={(vuv'i,):vieVAv'i,€V'}, Ex = {((«<,«;,), (vj,v'r)) : (vi,Vj)€E

(1) A {v'ihv'r)&E'}.

(2)

If A and A1 are the adjacency matrices of G and G', respectively, the adjacency matrix of the product graph Gx is given by Ax := A®A', where <S> represents the Kronecker product of matrices. If G and G' are edge-labeled, we can associate a weight matrix Wx e n«'x„n' R w i t h Q^ d e f i n e d & Wx = $(L) <8> $(!/)• Recall that $(L) and $(L') are matrices defined in an RKHS. Hence we use a slightly extended version of the Kronecker product and define the (in+j, i'n'+j')-th entry of Wx as K(LIJ, Lily). As a consequence of the definition of <&(L) and $(//)> the entries of Wx are non-zero only if the corresponding edges exist in the product graph. We assume that H = R d endowed with the usual dot product, and that there are d distinct edge labels { 1 , 2 , . . . , d}. Moreover we let K be a delta kernel, i.e., its value between any two edges is one iff the labels on the edges match, and zero otherwise. Let lA denote the adjacency matrix of the graph filtered by the label I, i.e., lAij = Aij if Ly = I and zero otherwise. Some simple algebra (omitted for the sake of brevity) shows that the weight matrix of the product graph can be written as d

Wx=^2lA®lA'.

(3)

Let p and p' denote initial probability distributions over vertices of G and G'. Then the initial probability distribution px of the product graph is p x := p®p'. Likewise, if q and q' denote stopping probabilities (i.e., the probability that a random walk ends at a given vertex), the stopping probability q^ of the product graph is qx := q <8> q'.

9

3.3. Kernel

Definition

An edge exists in the product graph if, and only if, an edge exits in both G and G'. Therefore, performing a simultaneous random walk on G and G' is equivalent to performing a random walk on the product graph [11]. Given the weight matrix Wx, initial and stopping probability distributions Px and <7X, and an appropriately chosen discrete measure /J,, we can define a random walk kernel on G and G' as oo

fc(G,G'):=5>(fc)«IW*px.

(4)

fc=o A popular choice to ensure convergence of (4) is to assume fi(k) = Xk for some A > 0. If A is sufficiently small3, then (4) is well defined, and we can write k(G,G') =^\kqTW*Px =ql{\-\WxTlPx, (5) fc where I denotes the identity matrix. It can be shown (see Vishwanathan et al. [6]) that the marginal graph kernels of Kashima et al. [5] as well as the geometric graph kernels of Gartner et al. [4] are special cases of (5). 3.4. Fast

Computation

Direct computation of (5) is prohibitively expensive since it involves the inversion of a nn' x nn' matrix, which scales as 0(n6). We now outline three efficient schemes whose worst case computational complexity is lower, and whose practical performance as measured by our experiments is up to three orders of magnitude faster. Vishwanathan et al. [6] contains more technical and algorithmic details. 3.4.1. Sylvester Equation Methods Consider the following equation, commonly known as the Sylvester or Lyapunov equation: X = SXT + X0.

(6)

Here, S,T,X0 e Rnxn are given and we need for solve for X e RnXn. 3 These equations can be readily solved in 0(n ) time with freely available code [12], e.g. Matlab's dlyap method. a

T h e values of A which ensure convergence depend on the spectrum of Wx.

10 It can be shown that if the weight matrix Wx can be written as (3) then the problem of computing the graph kernel (5) can be reduced to the problem of solving the following generalized Sylvester equation: X = ^k'AXk

T

+ X0,

(7)

i

where vec(-Xo) = px, with vec(-) being the function that flattens a matrix by vertically concatenating its columns. 3.4.2. Conjugate Gradient Methods Given a matrix M and a vector b, conjugate gradient (CG) methods solve the system of equations Mx = b efficiently [13]. They are particularly efficient if the matrix is rank deficient, or has a small effective rank, i.e., number of distinct eigenvalues. Furthermore, if computing matrix-vector products is cheap the CG solver can be sped up significantly [13]. The graph kernel (5) can be computed by a two-step procedure: First we solve the linear system (I-\Wx)x=Px,

(8)

for x, then we compute qxx. By using extensions of tensor calculus rules to RKHS, one can compute Wxr for an arbitrary vector r rather efficiently, which in turn can be used to speed up the CG solver. 3.4.3. Fixed-Point Iterations Fixed-point methods begin by rewriting (8) as x=px+\Wxx.

(9)

Now, solving for x is equivalent to finding a fixed point of the above iteration [13]. Letting xt denote the value of x at iteration t, we set XQ := px, then compute xt+i =Px +XWxxt

(10)

repeatedly until |\x t + i —xt\\ < e, where 11• 11 denotes the Euclidean norm and £ some pre-defined tolerance. Observe that each iteration of (10) involves computation of the matrix-vector product Wxxt, and hence the extensions of tensor calculus to RKHS mentioned previously can again be used to speed up the computation.

11 4. Composite Graph Kernel The presence of an edge in a graph signifies interactions between the end nodes. In many applications these interactions are significant. For instance, in chemoinformatics the presence of an edge indicates the presence of a chemical bond between two atoms. In the case of the PPI networks, the presence of an edge indicates that the corresponding proteins interact. But, when studying protein interactions in disease, not just the presence but also the absence of interactions is significant. Existing graph kernels (e.g. (5)) cannot take this into account. We propose to modify the existing kernels to take this information into account. Key to our exposition is the notion of a complement graph which we define below. Suppose G(V, E) is a graph with vertex set V and edge set E. Then, its complement G(V, E) is a graph with the same vertex set V, but with a different edge set E := V x V \E. In other words, the complement graph is made up of all the edges missing from the original graph. Using the concept of a complement graph we can now define a composite graph kernel as follows: kcomp(G, G1) = k(G, G') + k(G, G<).

(11)

Although this kernel seems simple minded at first, it is in fact rather useful. To see this consider the product graph Gx (Vx ,EX) of the complement graphs G and G'. An edge exists in this graph if, and only if, the corresponding edge is absent in both G and G'. In other words, this graph characterizes all the missing interactions which are absent in both the PPI networks. As demonstrated by our experiments, this insight leads to gains in performance when comparing co-integrated gene expression/protein interaction networks. 5. Experiments The aim of our experiments is to predict disease outcome (dead or alive) of cancer patients using a combination of human PPI data and clinical gene expression data. Leukemia data For our first experiment, we employed a dataset of 119 microarrays of leukemia patients from Bullinger et al. [14], and co-integrated the expression profiles of these patients and known human PPI from Rual et al. [15]. This approach of co-integrating PPI and gene expression data is built on the assumption that genes with similar gene expression levels

12 are translated into proteins that are more likely to interact. Recent studies confirm that this assumption holds significantly more often for co-expressed than for random pairs of proteins [16, 17]. Specifically, we transformed a patient's gene expression profile into a graph as follows: A node is created for every protein which participates in a protein interaction, and whose corresponding gene expression level was measured on this patient's microarray. We connect two proteins in this graph by an edge if Rual et al. [15] list these proteins as interacting, and both genes are both up or downregulated with respect to a certain reference measurement. We found 2167 proteins from Rual et al. [15] for which Bullinger et al. [14] report gene expression levels. The CPU runtimes of our CG, fixed-point, and Sylvester equation approaches to graph kernel computation (as described in Section 3) on the 119 patients modeled as graphs is contrasted with that of the direct approach in Table 1. 50 of the 119 patients survived leukemia. Using the computed kernel and a support vector machine (SVM) we tried to predict the survivors, in the first variant by using a vanilla graph kernel (5), and in the second variant by using the composite graph kernel (11). The average prediction accuracy obtained by performing 10-fold cross validation (with 10 repetitions) is reported in Table 2 for both approaches.

Table 1. Average time (in seconds) taken by different methods to compute the graph kernel on protein interaction networks. Leukemia dataset Computation approach

Breast cancer dataset

Vanilla

Composite

Vanilla

Composite

8,123

18,749

14,476

30,285

Sylvester

723

1,541

1,221

2,751

CG

6

13

13

28

7

8

17

Direct

Fixed

4

B r e a s t cancer d a t a This dataset consists of the microarrays of 78 breast cancer patients, of which 44 survived the disease [18]. When generating co-integrated graphs, we found 2429 proteins from Rual et al. [15] for which van 't Veer et al. [18] measure gene expression. As before, we report runtimes for kernel computations and accuracy levels for different variants of graph kernels in Table 1 and Table 2, respectively.

13 Table 2. Prediction accuracy (and standard deviation) of vanilla and composite graph kernels, averaged over 10 repetitions of 10-fold cross validation. Graph kernel variant

Leukemia dataset

Breast cancer dataset

Vanilla

59.17 (2.49)

56.41 (2.12)

Composite

63.33 (1.76)

61.54 (1.54)

Results On both datasets, our approaches to fast graph kernel computation lead to up to three orders of magnitude gain in speed. The composite graph kernel outperforms the vanilla graph kernel in accuracy in both experiments, with an increase in prediction accuracy of around 4-5%. The vanilla random walk kernel suffers from its inability to measure network discrepancies, the simplicity of the graph model employed, and the fact that only a small minority of genes could be mapped to interacting proteins; due to these problems, its accuracy is close to that of a random classifier. But, since the composite kernel also models the missing interactions, even a simple model is able to capture relevant biological information, which in turn leads to better classification accuracy on these challenging datasets [19]. 6. Outlook and Discussion Two major stumbling blocks prevented the analysis of large datasets of PPI networks by modeling them as graphs. First, the scalability of the similarity metric to large graphs. Second, the inability to effectively use biological domain knowledge, viz. the presence or absence of certain protein interactions. In this article, we addressed both these issues by extending graph kernels to make them applicable to PPI networks. We sped up the computation of graph kernels by up to three orders of magnitude, without resorting to heuristics or approximations, thus making them practical for large problems. By using a composite kernel, we are able to model both the presence or absence of interactions between proteins. This leads to noteworthy improvements in classification accuracies in our experiments on disease outcome prediction for cancer patients. With both these features — graph complement comparison and scalability — we have laid the foundation for future application of graph kernels in research on PPI networks in particular, and proteomics in general. While we have proposed approaches to overcome general problems, future studies will look at even more refined graph kernels that will further increase

14 prediction accuracy for specific tasks. Acknowledgments The authors thank Matthias Siebert for preprocessing the network data. National ICT Australia is funded by the Australian Government's Department of Communications, Information Technology and the Arts and the Australian Research Council through Backing Australia's Ability and the ICT Center of Excellence program. This work is supported by the 1ST Program of the European Community, under the Pascal Network of Excellence, IST-2002-506778, and by the German Ministry for Education, Science, Research and Technology (BMBF) under grant no. 031U112F within the BFAM (Bioinformatics for the Functional Analysis of Mammalian Genomes) project, part of the German Genome Analysis Network (NGFN). References 1. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NF'-Completeness. Series of Books in Mathematical Sciences. W. H. Freeman, 1979. 2. D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99 - 10, Computer Science Department, UC Santa Cruz, 1999. 3. K. M. Borgwardt, C. S. Ong, S. Schonauer, S. V. N. Vishwanathan, A. J. Smola, and H. P. Kriegel. Protein function prediction via graph kernels. Bioinformatics, 21(Suppl I):i47-i56, 2005. 4. T. Gartner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and efficient alternatives. In B. Scholkopf and M. K. Warmuth, editors, Proc. Annual Conf. Computational Learning Theory. Springer, 2003. 5. H. Kashima, K. Tsuda, and A. Inokuchi. Kernels on graphs. In K. Tsuda, B. Scholkopf, and J. Vert, editors, Kernels and Bioinformatics, Cambridge, MA, 2004. MIT Press. 6. S. V. N. Vishwanathan, K. Borgwardt, and N. N. Schraudolph. Fast computation of graph kernels. Technical report, NICTA, 2006. 7. P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph kernels. In Proceedings of the Twenty-First International Conference on Machine Learning, 2004. 8. J. Ramon and T. Gartner. Expressivity versus efficiency of graph ker-

15

9. 10.

11. 12.

13. 14.

15.

16.

17.

18.

19.

nels. Technical report, First International Workshop on Mining Graphs, Trees and Sequences (held with ECML/PKDD'03), 2003. L. Ralaivola, S. J. Swamidass, H. Saigo, and P. Baldi. Graph kernels for chemical informatics. Neural Networks, 18(8): 1093 - 1110, 2005. H. Frohlich, J. K. Wegner, F. Siker, and andreas Zell. Kernel functions for attributed molecular graphs - a new similarity based approach to adme prediction in classification and regression. QSAR and Combinatorial Science, 25(4):317 - 326, 2006. F. Harary. Graph Theory. Addison-Wesley, Reading, MA, 1969. J. D. Gardiner, A. L. Laub, J. J. Amato, and C. B. Moler. Solution of the Sylvester matrix equation AXBr + CXDT = E. ACM Transactions on Mathematical Software, 18(2):223-231, 1992. J. Nocedal and S. J. Wright. Numerical Optimization. Springer Series in Operations Research. Springer, 1999. L. Bullinger, K. Dohner, E. Bair, S. Frohling, R. F. Schlenk, R. Tibshirani, H. Dohner, and J. R. Pollack. Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N Engl J Med, 350(16):1605-1616, 2004. J. F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, et al. Towards a proteome-scale map of the human proteinprotein interaction network. Nature, 437(7062):1173-1178, 2005. N. Bhardwaj and H. Lu. Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics, 21(ll):2730-2738, 2005. H. B. Fraser, A. E. Hirsh, D. P. Wall, and M. B. Eisen. Coevolution of gene expression among interacting proteins. Proc Natl Acad Sci U S A, 101(24):9033-9038, 2004. L. J. van 't Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415:530-536, 2002. P. Warnat, R. Eils, and B. Brors. Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics, 6:265, 2005.

CHALKBOARD: ONTOLOGY-BASED PATHWAY MODELING AND QUALITATIVE INFERENCE OF DISEASE MECHANISMS COOK, DANIEL L \ WILEY, JESSE C 2, & GENNARI, JOHN H 3 'Physiology/Biophysics, 2Comparative Medicine, ^Biomedical & Health University of Washington, Seattle, WA, 98195, USA

Informatics,

We introduce Chalkboard, a prototype tool for representing and displaying cell-signaling pathway knowledge, for carrying out simple qualitative reasoning over these pathways, and for generating quantitative biosimulation code. The design of Chalkboard has been driven by the need to quickly model and visualize alternative hypotheses about uncertain pathway knowledge. Chalkboard allows the biologists to test in silico the implications of various hypotheses. To fulfill this need, chalkboard includes (1) a rich ontology of pathway entities and interactions, which is ultimately informed by the basic chemistry and physics among molecules, and (2) a form of qualitative reasoning that computes causal chains and feedback loops within the network of entities and reactions. We demonstrate Chalkboard's capabilities in the domain of APP proteolysis, a pathway that plays a key role in the pathogenesis of Alzheimer's disease. In this pathway (as is common), information is incomplete and parts of the pathways are conjectural, rather than experimentally verified. With Chalkboard, we can carry out in silico perturbation experiments and explore the consequences of different conjectural connections and relationships in the network. We believe that pathway reasoning capabilities and in silico experiments will become a critical component of the hypothesis generation phase of modern biological research.

Motivation Molecular biologists must understand how biochemical reactions trigger downstream events leading to particular pathologies or phenotypes yet our signaling pathway knowledge is incomplete and volatile. Given a flood of high-throughput data, biologists increasingly depend on a myriad of well-organized, easily accessed data repositories [1-3] which, however, only provide the building blocks for generating and testing competing hypothetical pathway models of phenotypic expression. In this paper, we describe a candidate tool, Chalkboard, that allows biologists to easily build, revise and reason about pathway knowledge based on an ontological-based representation of the underlying chemistry and biophysics of pathway participants and reactions. Following Davis et al, we recognize that a knowledge representation is both the declarative language that captures knowledge (such as an ontology for pathway representation) but also a inference method that operates on the model [4], The declarative language that allows one to state facts must be linked to the inference method that allows one answer questions about those facts. 16

17 Thus, Chalkboard is an hypothesis generation tool designed for ease-of-use that allows researchers to easily explore the behavior of hypothetical pathways in order to better direct in vitro or in vivo research. Chalkboard perturbation experiments graphically display the consequences of molecular activities and pathway links as a tool to identify downstream effects and inconsistencies with current knowledge. Furthermore, as quantitative pathway data become available, Chalkboard can automatically generate a set of quantitative differential equations in the JSim mathematical modeling language [5]. We demonstrate these capabilities by representing and testing a pathway model of amyloid precursor protein (APP) processing pathway are at the core of the "Amyloid hypothesis" of Alzheimer's Disease pathogenesis [6] as described in Section 4.

Overview of Chalkboard Chalkboard is so named to emphasize its key features: First, one can create and modify models easily. Second, the system is designed for hypothesis generation and laboratory brainstorming—sharing, developing, and communicating hypothetical models with others. This contrasts with systems designed as repositories of consensus or authoritative models or datasets. Beyond physical a chalkboard, however, our Chalkboard models can be probed in silico to test ideas and predict outcomes as a guide to hypothesis generation. /./. Representing and visualizing biomolecules and their interactions Chalkboard's representation of biomolecules, events and interactions are based on the BioD biological description language [7] which has evolved into an ontology organized around three major classes: Entity, Action, and Functional attribute. The Entity class represents basic cell biological entities such as compartments (Compartment; e.g., intracellular space, intranuclear space), molecules (Molecule, e.g., a protein or polynucleotide), and the functional domains of molecules (Functional sites, e.g., binding sites, catalytic sites). Chalkboard enforces rules for composing complex cell biological structures. For example, Compartments may be nested within Compartments but not within a Molecule; a Functional site can be a part of a Molecule but not vice versa. Chalkboard implements a number of primitive Actions to represent functional interactions between Entities. Chemical flows represent a variety of chemical processes such as Bind reactions (dimerization) and Transporter flow (across a Compartment boundary). Actions can be modulated (e.g., activated or inhibited) to represent the complex cell signaling logic. We also include "wildcard" classes (Wildcard producer, Wildcard producer flow, Wildcard change action) for representing entities and actions whose physical basis is unknown. Functional attributes represent the state attributes of Entities (e.g., Concentration) and Actions (e.g., Rate

18 CfraficSoard file Edit Mode! K Csrsar

Figure 1. Annotated screenshot of the Chalkboard modeling environment showing the tool palette (left side) and a simple signaling cascade model. of a reaction) and provide the computational basis for qualitative reasoning (Chalkboard's inference method) and biosimulation code generation. Our emphasis to date has been to create an ontology and computational system that (a) is based on formal views of anatomy and physiology [8, 9] , (b) is sufficient to carry out qualitative inference (see Section 2.2) and (c) can automatically generate mathematical biosimulation code as warranted by available data (see Section 2.3). As described in Section 4, our ontology will conform to standards for sharing pathway and biological knowledge [2, 10] while retaining its inference capability. Figure 1 shows Chalkboard's graphical model editing environment that includes a model-drawing area and a tool palette for: a Cursor, a PathTrace tool (see Section 3.2), tools for installing Entities, and an Action tool for linking Entities. Model building is simplified because Entities and Actions are implemented as "smart" objects that enforce entity-composition and action-linking rules. For example, we built the simple signaling cascade in Figure 1 in steps: (1) Create and name 3 molecules with the "Molecule" tool. (2) Add to these molecules two Binding sites, a Phosphorylation site ("P-site") and a Kinase site. (3) Use the "Action" tool to install a Bind action, an Activate action, and a Phosphorylate action. Chalkboard's context-sensitive linking automatically

19

Figure 2. Chalkboard PathTracing applied to a simple metabolic model. Panel A, "A" produces "B" which dissociates into "C" and "D". D is transformed into "E" while C binds to a site on A that inhibits B production. The D-to-E transformation is activated by C via a Wildcard change action (solid-headed, single-weight arrow). Panel B. With PathTrace, the user clicks on B and drags ups to increment the amount of B. This increment propagates through the pathway and feeds back negatively on itself (a red side-arrow; positive feedback displays a green side-arrow). The change of D amount is ambiguous (yellow oval) because the increment of C due to the increment of B is counteracted by the activation of D transformation into E. Panel C. The Wildcard activation of the D-to-E reaction has been clamped (the red slash sign; equivalent to a "functional knockout") so that the change in D amount is no longer ambiguous. installs the correct Action for the Entities being linked. Even the limited set of primitive Action and Entities in Chalkboard's current ontology provides a rich and flexible vocabulary for creating models of complex cell biological systems. For example, activation actions can be either inhibitory (open arrowhead) or excitatory (solid arrowhead). In Section 4, we explore a biological example that includes proteolytic reactions.

20

1.2. Inference using PathTracing Chalkboard's Path Trace tool allows researchers to carry out exploratory thought experiments in silico using an inference method that simulates qualitative responses to small perturbations of the system. Qualitative responses are displayed with 3 values (Figure 2): increase (upward green arrow), decrease (downward red arrow) or ambiguous (yellow oval). PathTracing also detects feedback loops as well as the effects of in silico "functional knockouts" by "clamping" an Entity or an Action. PathTracing has three user-selectable modes: 1) Find all consequences of the perturbation of an index Entity or Action (as shown below). 2) Find only those feedback loops originating at an index Entity or Action ("Feedback only" mode; not shown). 3) Find only those pathways by which an index node affects any other preselected node (the "A-to-B" mode; not shown). 1.3. Architecture for PathTracing and biosimulation code generation The computational architecture that underlies PathTracing also can be used to generate differential equation biosimulation code. Chalkboard Entities and Actions are endowed with Functional attributes (FA) that represent the values of their physical properties. For example, a molecule has a single FA, its amount (ami); how much of the molecule exists in the system. Functional sites have three properties: amount (assumed to equal to the amount of the site's parent molecule), activity (the fraction of sites in an active state), and availability (the amount of active sites). Binding sites are specialized with two additional FAs: occupancy (the fraction of sites occupied by ligand-) and bound amount (the amount of occupied sites). As each Entity and Action is installed in a model, its corresponding FAs are created and linked via Operators, directed arcs that represent how each FA value depends (either directly or inversely) upon the values of other FAs. The resulting Inference network (e.g., Figure 3) is the basis for both PathTracing inference and for biosimulation code generation. PathTracing is accomplished by propagating tokens through the Inference network each delivering an incremental or decremental perturbation from one FA node to another. Incoming perturbations are stored and displayed as up- or down-arrows (Figure 2). A subsequent perturbation with a polarity opposite to a stored perturbation displays a yellow oval. At Inference network bifurcations, tokens are cloned and launched into outgoing arcs (as at a Bind site ami). At network convergences (e.g., at a Binding action's Jnet), if an incoming perturbation replicates a prior perturbation then the incoming token is terminated because cloning it would simply replicate prior network traversals. To detect loops, tokens enlist an identifier for each traversed FA node so that if it detects itself it declares a feedback loop and terminates the

21 Kinase

JJgand

© ^

T5> Model network Inference network

Entity Attributes 1$ *• ?, n u> w w w %

I I

Action attributes

Figure 3. (Top) The ligand and kinase model from Figure 2. (Bottom) The Inference network (not visualized in the Chalkboard user-interface), Functional attributes (FA) for each Entity and Action are represented and linked by a network of Operators (white circles with mathematical symbols) and arcs (arrows) that represent the directed dependencies of attribute values on each other. PathTracing displays one "main" FA for each£ntt'ry or Action (bold frames). FA's in this model include: amt = Amount of a Molecule or Site; molarity or concentration, act = Activity of a Site; percent or fraction, avl = Available amount = act x amt\ molarity or concentration, occ = Bind-site occupancy of a Bind site; percent or fraction, bnd = Bound amount of a Bind site; molarity or concentration, Met = Chemical flow rate of reaction; moles/s, concentration/s, Del = Change of Site attribute; percent or fraction, mod = Modulator (of action); percent or fraction. token. Feedback loops are characterized as positive or negative according to the net polarity of perturbations in the token's list. Tokens are also terminated when they reach nodes with no outgoing arcs (as at each occ in Figure 3). Chalkboard reuses the Inference network to automatically generate JSim [5] mathematical biosimulation code (not shown) that includes: (a) system state variables (one for each FA value) with default units, (b) algebraic or differential equations for each Operator (e.g., a rate equation), and (c) Operator equation parameters (e.g., reaction rate constants). The JSim system interprets Chalk-

22

nuclear transcription Car'.Tia-sccrpiase

Figure 4. A view of APP proteolysis within Chalkboard where the action between LRP and the proteolysis by BACE is clamped. Under this condition, if more LRP is bound to Fe65, or if more LRP is available, then Amyloid production decreases. board-generated code while parameter values are set by users at runtime.

A Chalkboard model of APP processing Alzheimer's Disease is a pervasive neurodegenerative disorder associated with aging characterized by diffuse cortical plaques (neurofibrillary tangles) [6] whose primary constituent is a small peptide derived from the [J-amyloid precursor protein (APP) [11]. The primary theory of Alzheimer's Disease etiology is the "amyloid hypothesis" by which elevated levels of (3-amyloid production results in neuronal degeneration, cortical plaques, cognitive dementia, and ultimately death[6]. Effective therapy requires that scientists understand the complex events of APP proteolysis, in both normal and pathologic situations. APP is a single-pass transmembrane protein that is sequentially proteolytically cleaved by enzymes to peptides (yellow and blue, respectively, in Figure 4). Primary cleavage occurs in the luminal/extracellular domain at the oc-secretase cleavage site by metalloproteinases such as TACE [12] or at the P-secretase cleavage site by the atypical aspartyl protease BACE[13]. Subsequently, the remaining carboxy-terminal fragments of APP (C99 and C83 in Figure 4) are

23

cleaved by the heterotetrameric y-secretase complex [14]. Cleavage of APP at the a- and y-secretase sites (left hand side of Figure 4) liberates the APP extracellular domain (APPsa), p3 peptide, and the APP intracellular domain CTFy (also called AICD)[15]. Alternatively, cleavage of APP at (3- and y-secretase sites, (right side, Figure 4) generates a soluble extracellular domain (APPs(3), an intracellular domain CTFy, and amyloid fj peptide[15]. CTFy plays an important role in transcription. In particular, the heterotrimeric APP-CTFy/Fe65/Tip60 complex functions as a nuclear targeted transcriptional regulator[16, 17]. It is currently unclear, however, how CTFy/Fe65/Tip60 complex affects neuronal survival[18], [19]. Furthermore, APP proteolysis by y-secretase complex may be regulated by the APP-associated factor LRP [20] via Fe65[21] and also may involve the stimulation of either a-secretase or (3-secretase cleavage[20, 22]. To test these possibilities, we have included the LRP/Fe65 binding in our Chalkboard model (Figure 4), and included LRP activation of both B ACE and TACE proteolysis. Then, we clamped the effect of LRP on P-secretase cleavage (the red slash sign), to show that the downstream effect is to decrease amyloid production. The inherent complexity of the interactions among APP, the proteolytic processing enzymes, and the associated binding proteins is an arena in which a detailed modeling system such as Chalkboard would be extremely helpful. Potentially, Chalkboard could help provide valuable insights into predictions about both mechanisms of action and potential experimental manipulations that could guide the development of effective therapeutic approaches to treating AD.

Discussion and related work Chalkboard is an ontology-based computational tool for representing biomolecular pathways using a graphical language and model editing environment to represent pathway models that can be analyzed qualitatively with a built-in PathTracing tool (Section 2.2) and analyzed quantitatively by exporting model simulation code (Section 2.3) to the JSim simulation system. As such, Chalkboard relates to several threads of computational research that deserve in-depth discussion beyond the scope of this paper. However, here we emphasize Chalkboards relation to three areas of pathway informatics research: Ontology research, qualitative inference, and quantitative analysis. We also address the tradeoffs between scalability and the rich biochemical representation we employ with Chalkboard. 1.4. Ontology-based representations of biomolecular pathways The Chalkboard ontology continues to evolve from the BioD biological description language [7] concurrently with biomolecular pathway ontologies including BioPax [2], PATIKA [23], CellDesigner [24], and others. As expected there is

24

considerable representational overlap that should, with community effort, be resolvable into a high-level ontology or, at least, alignment between related ontologies. We are committed to such efforts as advocated by others [10, 25]. We note, however, important representational differences, particularly in modeling molecular "states". Many ontologies consider different states of a physical entity (e.g., a molecule) to be separate entities (e.g., a molecule, its phosphorylated form, and its active form). Chalkboard takes an "object-oriented" view that a single entity Molecule can have Functional sites as parts and each part can have an independent operational state so that the state of a Molecule is specified by the values its own Functional attributes plus the FAs of its parts (e.g., Occupied, Active, etc.). We adopt the Functional attribute approach because it maps well to both qualitative and quantitative analyses (Section 2.3). Furthermore, we suggest, the Functional attribute approach generalizes readily to other biophysical domains such as membrane biophysics (e.g., membrane potentials, conductances and currents), structural mechanics (e.g., elastance), and fluid flow (e.g., diffusive or bulk flows). We see this generalizability as a prerequisite for the integration of pathway knowledge and analysis into multiscale (molecules, cells, organs, organ systems, etc.), multidomain (biochemistry, biophysics, mechanics) models. 7.5. Qualitative inference and quantitative analysis Qualitative reasoning tools in biological research have been driven by the scarcity and high cost of the quantitative datasets required for quantitative modeling. However, many representational schemes do not as yet, support qualitative inference (e.g., BioPax [2], CellDesigner [24]) and those that do use graph theoretic query methods (e.g., PATIKA [23]) and rule-based reasoning (e.g., BioCyc [3]) of state-based modeling. Chalkboard qualitative inference, is based more directly on the principles of quantitative modeling by tracking the propagation of (small) perturbations through a network of essentially quantitative relationships. The benefits of coupling graphical representations to the computational analysis of biological systems has long been recognized resulting in a variety of implementations including our own KineCyte [26] that integrates graphical modeling with biosimulation. Chalkboard, however, relies on existing simulation engines to interpret automatically-generated simulation code (currently, we use JSim but intend to support CellML[27] and SBML[28]). Although other molecular pathway representations (e.g., PATIKA, CellDesigner) may have sufficient rigor and expressiveness to export simulation code, to our knowledge this is not yet available for existing simulation languages[29].

25

1.6. Scalability and representational richness We recognize trade-offs between Chalkboard's semantically-rich graphical view of biological pathways and less rich but more scaleable representations used by applications such as Cytoscape[30]. We believe that scientists need both sorts of tools—although Cytoscape is appropriate for coarse-grained visualization of large networks, only tools like Chalkboard, that use richer representations can capture notions of competitive binding, cooperative and anti-cooperative effects. We recognize that Chalkboard will not be the only tool used by a researcher, and thus, we have the designed the system to export its models in a sharable format. Chalkboard models are saved in an XML text file that represents all model entities, model actions and their linkages in a form that can be read and parsed by other applications. Our plans more specifically include inter-operating with the BioPAX standard [2], (as much as possible, given the differences in modeling), as well as to CellML and SBML for simulation code.

Summary We have argued that modern pathway researchers need tools for building and reasoning about causal models based on an inference method. Chalkboard is one prototype system that fills this need. The key characteristics of Chalkboard are: (1) The use of an expressive ontology of Entities, Actions and Functional attributes to model pathways at a based on the physics and biochemistry of inter- and intra-molecular interactions. And (2) Chalkboard's ability to carry out high-level symbolic qualitative inference (PathTracing) and to generate quantitative (JSim) simulation code allows users to avoids two pitfalls: (1) being tied to quantitative models whose utility and relevance are limited by the (typical) lack of quantitative data, and (2) over-simplified biochemical representations whose fidelity to actual biochemical processes is limited. We have introduced Chalkboard modeling environment and demonstrated its use analyzing a cell-signaling pathway with important scientific and clinical implications. The design of effective therapeutics requires a rigorous understanding of how modulation of a particular molecular entity would affect a distributed signaling system. As Chalkboard is designed to assess this issue, we suggest that use of Chalkboard modeling could facilitate the identification of appropriate pharmacogenetic therapeutic targets within Alzheimer's Disease and other human pathologies.

26

References 1. Joshi-Tope G. GM, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L. Reactome: a knowledgebase of biological pathways. Nucleic Acids Research 2005;33:D428-D432. 2. BioPAX - Biological Pathways Exchange Language Level 2. http://www.biopax.org. 3. Karp PD, Ouzounis CA, Moore-Kochlacs C, et al. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res 2005;33(19):6083-9. 4. Davis R, Shrobe H, Szolovitz P. What is a knowledge representation? Al Magazine 1993;Spring:17-33. 5. National Simulation Resource. http://nsr.bioeng.washington.edu/PLN 6. Hardy J, Selkoe DJ. The amyloid hypothesis of Alzheimer's disease: progress and problems on the road to therapeutics. Science 2002;297(5580):353-6. 7. Cook DL, Farley JF, Tapscott SJ. A basis for a visual language for describing, archiving and analyzing functional models of complex biological systems. Genome Biol 2001;2(4):RESEARCH0012. 8. Cook DL, Mejino JLV, Rosse C. The Foundational Model of Anatomy: a template for the symbolic representation of multi-scale physiological functions. Medinfo 2005; 12. 9. Rosse C, Mejino JLV. A Reference Ontology for Bioinformatics: The Foundational Model of Anatomy. Journal of Biomedical Informatics 2003;36:478-500. 10. Open Biomedical Ontologies, http://obo.sourceforge.net 11. Glenner GG, Wong CW, Quaranta V, Eanes ED. The amyloid deposits in Alzheimer's disease: their nature and pathogenesis. Appl Pathol 1984;2(6):357-69. 12. Buxbaum JD, Liu KN, Luo Y, et al. Evidence that tumor necrosis factor alpha converting enzyme is involved in regulated alpha-secretase cleavage of the Alzheimer amyloid protein precursor. J Biol Chem 1998;273(43):27765-7. 13. Vassar R, Bennett BD, Babu-Khan S, et al. Beta-secretase cleavage of Alzheimer's amyloid precursor protein by the transmembrane aspartic protease BACE. Science 1999;286(5440):735-41. 14. De Strooper B. Aph-1, Pen-2, and Nicastrin with Presenilin generate an active gamma-Secretase complex. Neuron 2003;38(1):9-12. 15. Selkoe DJ. Alzheimer's disease: genes, proteins, and therapy. Physiol Rev 2001;81(2):741-66.

27

16. Cao X, Sudhof TC. A transcriptionally [correction of transcriptively] active complex of APP with Fe65 and histone acetyltransferase Tip60. Science 2001 ;293(5527): 115-20. 17. Baek SH, Ohgi KA, Rose DW, et al. Exchange of N-CoR corepressor and Tip60 coactivator complexes links gene expression by NF-kappaB and betaamyloid precursor protein. Cell 2002;110(l):55-67. 18. Kinoshita A, Whelan CM, Berezovska O, Hyman BT. The gamma secretase-generated carboxyl-terminal domain of the amyloid precursor protein induces apoptosis via Tip60 in H4 cells. J Biol Chem 2002;277(32):285306. 19. Sastre M, Steiner H, Fuchs K, et al. Presenilin-dependent gamma-secretase processing of beta-amyloid precursor protein at a site corresponding to the S3 cleavage of Notch. EMBO Rep 2001;2(9):835-41. 20. Pietrzik CU, Busse T, Merriam DE, et al. The cytoplasmic domain of the LDL receptor-related protein regulates multiple steps in APP processing. Embo 7 2002;21(21):5691-700. 21. Pietrzik CU, Yoon IS, Jaeger S, et al. FE65 constitutes the functional link between the low-density lipoprotein receptor-related protein and the amyloid precursor protein. J Neurosci 2004;24(17):4259-65. 22. Yoon IS, Pietrzik CU, Kang DE, Koo EH. Sequences from the low density lipoprotein receptor-related protein (LRP) cytoplasmic domain enhance amyloid beta protein production via the beta-secretase pathway without altering amyloid precursor protein/LRP nuclear signaling. J Biol Chem 2005;280(20):20140-7. 23. Demir E, Babur O, Dogrusoz U, et al. An ontology for collaborative construction and analysis of cellular pathways. Bioinformatics 2004;20(3):34956. 24. Kitano H, Funahashi A, Matsuoka Y, Oda K. Using process diagrams for the graphical representation of biological networks. Nat Biotechnol 2005;23(8):961-6. 25. Stromback L, Lambrix P. Representations of molecular pathways: an evaluation of SBML, PSI MI and BioPAX. Bioinformatics 2005;21(24):44014407. 26. Cook DL, Gerber AN, Tapscott SJ. Modeling stochastic gene expression: implications for haploinsufficiency. Proc Natl Acad Sci USA 1998;95(26): 15641-6. 27. CellML. http://www.cellml.org 28. Hucka M, Finney A, Sauro HM, et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 2003;19(4):524-31. 29. National Simulation Resource. http://nsr.bioeng.washington.edu/PLN. 30. Shannon P, Markiel A, Ozier O, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003;13(ll):2498-504.

MINING GENE-DISEASE RELATIONSHIPS FROM BIOMEDICAL LITERATURE: WEIGHTING PROTEINPROTEIN INTERACTIONS AND CONNECTIVITY MEASURES GRACIELA GONZALEZ'", JUAN C. URIBE*, LUIS TARI*, COLLEEN BROPHY+, CHITTA BARAL*, "Department of Biomedical Informatics, Ira A. Fulton School of Engineering, Computer Science and Engineering Department, Ira A. Fulton School of Engineering, *Center for Metabolic Biology, Department of Kinesiology Arizona Sate University Tempe, Arizona 85281, USA Motivation: The promises of the post-genome era disease-related discoveries and advances have yet to be fully realized, with many opportunities for discovery hiding in the millions of biomedical papers published since. Public databases give access to data extracted from the literature by teams of experts, but their coverage is often limited and lags behind recent discoveries. We present a computational method that combines data extracted from the literature with data from curated sources in order to uncover possible gene-disease relationships that are not directly stated or were missed by the initial mining. Method: An initial set of genes and proteins is obtained from gene-disease relationships extracted from PubMed abstracts using natural language processing. Interactions involving the corresponding proteins are similarly extracted and integrated with interactions from curated databases (such as BIND and DIP), assigning a confidence measure to each interaction depending on its source. The augmented list of genes and gene products is then ranked combining two scores: one that reflects the strength of the relationship with the initial set of genes and incorporates user-defined weights and another that reflects the importance of the gene in maintaining the connectivity of the network. We applied the method to atherosclerosis to assess its effectiveness. Results: Top-ranked proteins from the method are related to atherosclerosis with accuracy between 0.85 to 1.00 for the top 20 and 0.64 to 0.80 for the top 90 if duplicates are ignored, with 45% of the top 20 and 75% of the top 90 derived by the method, not extracted from text. Thus, though the initial gene set and interactions were automatically extracted from text (and subject to the impreciseness of automatic extraction), their use for further hypothesis generation is valuable given adequate computational analysis.

1.

Introduction

Post-genome project data and techniques available to the research community have exponentially increased the capacity of researchers to conduct experiments and publish results. The resulting deluge of biomedical literature, however, has reached a point that exceeds the capacity of any researcher to process and assume, making it difficult to realize the full benefit of these findings. From 1994 to 2004, close to 3 million biomedical articles were published by US and European researchers [1]. This publication rate has resulted in approximately 16 million publications currently indexed in PubMed. 28

29

Figure 1. Overview and data flow of the computational method presented here to mine the biomedical literature for genes potentially related to a specific disease.

Efforts have been made to extract data from articles and abstracts. For example, Entrez's OMIM [2] has summaries of published work that relate genes to diseases. However, it covers only about 20% of the human genes in the Entrez Gene database. A similar initiative for gene function annotation, GeneRIF (Gene Reference Into Function), was started in 2002, but it covers only about 1.7% of all the genes in Entrez and 25% of human genes[3]. New findings usually take a long time to be reflected by curated sources such as these, and any computational method that relies solely on them will necessarily have its hands tied. To fill this void, the Collaborative Bio Curation (CBioC) project [4, 5] was started to bring together nuggets of information automatically extracted from the published biomedical literature and the intellectual power of a social network of researchers, who can rate the accuracy of the extraction. Extracted facts include protein-protein interactions, gene-disease and gene-bioprocess relationships. This paper describes a computational method that uses extracted facts from the CBioC database and integrates them with curated sources to find a set of proteins potentially related to a target disease, ranking them so that existing knowledge (known gene-disease relationships and curated protein-protein interactions) is balanced with the potential impact of new information (proteinprotein interactions extracted from the literature) and the researcher's intuition. An assessment of the method through a study of atherosclerosis is also described and reported in the Results section. This balance of different factors, notably a network connectivity impact measure for each gene, among others, marks the difference between our approach and others such as MedGene [6] and the method in [7]. The scope of the initial gene-disease data also differs, as well as the level of user interaction,

30

which is very limited in other approaches. A comparative view to these efforts is presented in the Related Work section and in Section 2.4. The resulting ranked list of genes and gene products can provide the basis for further focused experiments to investigate the genetic determinants of any disease. On top of helping to find gene-disease relationships that were not discovered in the information extraction step (false negatives), this focused analysis could uncover yet unexplored genetic linkages and provide an insight into specific genetic and proteomic pathways related to any disease, as our study in atherosclerosis will show. The method is implemented in Java using SQL to access the CBioC database (which is stored as a MySQL database). On-demand runs can be requested by contacting the authors. A web-based interface to the software is in development. Other sections cover the computational method, the results of applying the method to the study of atherosclerosis, and a comparison with related work. 2.

The computational method

The computational method presented here takes a 4-step approach to the task of finding and ranking genes and gene products related to a given disease, relying not only on automatic computation, but allowing (not requiring) user input at different levels. The method can be summarized as follows: 1. Obtain a list of genes or gene products known to be involved with the target disease from the CBioC[5] database. 2. Apply heuristics to unify variants of extracted names, and use HUGO [8] to normalize both the set obtained in the previous step and the names stored in CBioC. This will be referred to as the initial set. 3. Apply nearest-neighbor expansion to the initial set to build a protein interaction network using data from the CBioC database and curated databases. Analyze the connectivity of the network. The genes and proteins in this network (derived from the interactions) form the extended set. 4. Apply a heuristic scoring formula to the extended set to predict the proteins most likely related to the disease. One part of the formula measures the number of interactions of each gene in the extended set with proteins in the initial set, incorporating contextual information if indicated by the user. The second part measures the role of the protein in the connectivity of the protein network, since high degrees of local network interconnectivity can identify sets of functionally related proteins [9, 10]. Researchers can focus the analysis through different interventions. Figure 1 shows the data and process flow of the method. Each step is detailed next.

31 2.1. Initial set of disease-related genes and gene products The initial set of genes and gene products of interest is obtained by querying the CBioC database using the disease of interest and any variants or synonyms of its name. CBioC uses a natural language processing extraction system, IntEx [11], which is based on identification of syntactic roles, such as subject, objects, verbs, and modifiers. English grammar dependencies reported by Link Grammar [12] are used to identify the roles and transform complex sentences of interest into triplets of the form (Entity 1, interaction, Entity2). We extended IntEx to extract not only protein-protein interactions, but also gene-disease relationships, using MeSH [13] terms under the disease category to recognize them in the abstracts. Even though the natural language processing approach allows for more precise extractions than co-occurrence[ll], the gene-disease relationships and protein-protein interactions extracted directly from the literature are not perfect. In fact, IntEx reports a 65.7% precision in extracted interactions[l 1]. Thus, there will be genes and gene products in the initial set that are not related to the disease (false positives), just as there will be others that are not retrieved even though they are related (false negatives). The protein interaction network analysis and the incorporation of protein-protein interactions from curated sources helps assuage the impact of these problems. Also, users might filter the initial set to narrow the focus to a particular set of genes and gene products. 2.2. Unifying extracted gene and protein names One of the challenges of using data extracted directly from biomedical texts is the great variety of names used for the same entity: one gene or gene product might appear under different synonyms and variants. For example, HNF4A might appear as hepatocyte nuclear factor 4 alpha or any of a number of aliases (such as HNF4, MODY, TCF, or TCF14), or variants of any of these, such as HNF4-alpha or HNF 4A. An additional problem is that the triplets in the CBioC database sometimes include modifiers that were in the same noun phrase or modifying phrase, such as "HNF4A protein" or "HNF4A mutation". It was necessary to unify the names (normalize them) so when the protein network is built, all the interactions of the same protein are clustered into a single node. A naive normalization algorithm was applied to entries in the CBioC database to eliminate non-essential words (such as "protein" or "mutation" at the end of a name), in order to then find its official abbreviation in the HUGO[8] database. 2.3. Build the protein network The CBioC database is queried for any and all interactions involving the genes and gene products in the initial set. On top of the extracted interactions, CBioC

32

integrates interaction data from BIND[14], MINT[15], DIP[16], IntAct [17], and BioGRID[18]. A nearest neighbor algorithm is run to build a protein interaction network, noting the confidence level for each interaction as follows: 1. If the interaction comes from any of the curated sources, its confidence level is noted as 1. 2. If the interaction comes from CBioC, and it has received "Yes" votes from the community of users, its confidence level is noted as .65 plus .07 for each "Yes" vote up to 1. CBioC counts only one vote per user per fact. 3. If the interaction comes form CBioC, and has not been rated by any user, its confidence level is given as .65 (the measured precision of IntEx [11]). 2.4. Rank the genes and gene products in the expanded network To rank the genes in the resulting set, we score each gene or gene product based on the number and confidence levels of its interactions with proteins in the initial set, and combine this measure with another that reflects how relevant it is for maintaining the protein network connectivity. Both measures are important. The first helps discover the most active proteins with respect to the disease (high precision), preferring interactions with the highest confidence level (high fidelity), while the second finds those that could potentially play a crucial role in a pathway related to the disease or that are very likely related to the known (extracted) genes, as high degrees of local network interconnectivity can identify sets of functionally related proteins [9, 10]. The first score also incorporates user-defined weights. For example, given interactions as triplets (Entity 1, interaction-term, Entity2), users might indicate that interactions that include "phosphorylates" as an interaction term should be given greater weight. Let us assume for now that no user weights are defined. We use a variation of the formula given in [7] for this level, removing a bias towards the initial set that the formula in [7] suffers from. Let • A be the extended set of proteins (initial set plus interactions). • N(i) is the set of proteins in the initial set interacting with protein ('. • p(i,j) be the confidence level of the interaction between proteins i and/ • N(i,j) = 1 if protein i e A andy 6 N(i), and 0 otherwise. Then a score t is assigned to each protein i by applying Eq.(l). (1) ti = u?*\N(i)\ Zp(ij) y., ( „ (2) |N(i) | Equation (2), u,, is the average confidence level of the interactions involving i. Equation (1) results from expanding the formula used in [7], noting that in [7], N(i) is the set of proteins interacting with protein i and N(ij) = 1 if; e N(i) n A. =

33

( r tt =exp k*\n

\

^>(u) h ln

r

w

EMU) \\k

exp A;*ln

exp In

ZP(U) /y

V

2>(U) {jeN(i)nA

JJ (3)

exp In

ZMu)

I>(u) VjsW(i)n/i

2>(U)

jeN(l)r,A

2>(u) _ VJEW(i)n^

|^(OnA|

From the last expression, using k=2 as in [7], and noting that \N(i) n AI = AYO (since only interactions in A are included), we get Eq. (1). Since ut<=\ remains relatively constant, the score V, mainly depends on the number of interactions for i. By the definition of N(i) in the method presented here, only interactions with proteins in the initial set would be counted, whereas all of the interactions loaded for ;' are counted in [7]. Thus, since only interactions that have at least one member in the initial set are loaded, all of the interactions involving the proteins that belong to the initial set will be counted in [7], compared to only a small fraction of the interactions for the proteins not in the initial set. Naturally, this leads to a scoring bias in [7], with proteins in the initial set having a larger score than all other proteins. This explains why only one of the top ranked proteins in their final ranking was "novel" (not in the initial set -"derived" in this study-). Figure 2 illustrates the problem by comparing what would be the score given with each method given a confidence level of 1 for all interactions. The method in [7] (column (a)) would score g2, g3, g4 the highest, while the method presented here (column (b)) will rate g4 and g7 equally high. Thus, we count only interactions with proteins in the initial set, which "evens out" the playing field for the proteins added later: if they interact with a good number of proteins in the initial set, their score will go up. Aside from the purely mathematical, this change makes biological sense: if a relationship has been reported between a gene and a disease, other proteins that are highly connected to the known facts might be important pieces in a pathway. The fairness of the formula is obvious in our evaluation study (Section 3), where 45% of the top 20 ranked proteins in our resulting list were derived from interactions, compared to 1 in 20 in [7].

34 extended set A / initia ; 1 set

fgl\

V-—"*£6

*g8 *g9

gl g2 g3 g4 g5 g6 g7 g8 g9

(a) 1 3 3 3 1 1 2 1 1

(b) 1 1 1 2 1 1 2 1 1

(c) 0 2/8 4/8 2/8 0 0 3/8 0 0

Figure 2. Simplified comparative scoring, assuming average confidence = 1, using scoring as in Eq. (1) but counting (a) all interactions involving protein i, as in [7], and (b) counting only interactions with proteins in the initial set, as done in the method presented here -before normalizing over 100-, with the corresponding (c) connectivity score (an innovative aspect of this method).

Further mathematical manipulations at this level include applying userdefined bias to certain proteins, as explained before, and normalization over 100 to reflect the relative ranking (thus, the interaction score r, of protein ;' indicates the % of proteins with an interaction score less than or equal to that of (')• The second level of scoring involves evaluating the role of the gene on the overall connectivity of the protein network. It has been demonstrated that high degrees of local network interconnectivity can identify sets of functionally related proteins [9, 10], The statistical validity of using network connectivity measures for sets of interacting proteins in this way has already been established [7, 9, 10, 19]. Here, this concept is applied to assessing the importance of a protein, measuring how much would the connectivity of the network be affected if it were removed. To formulate this precisely, let • The path between two proteins p} and p„ be the set {p]y p2, .. .pu-u Ph • • • Pn} such that for n > 2, pk.j interacts with pk for every km2...n. • A set of interactions is called a network. • The largest connected sub-network in a network is the largest subset of interactions from it that forms a path. • The connectivity index (aka index of aggregation) of a network N, C(N), is the ratio of the size of the largest sub-network of N to the size of N. The connectivity score for a given gene or gene product i is given by Eq(4). connectivity_scoret = C(N) - C(N\i)

(4)

In Eq.(4), NM stands for all the proteins in N except ;'. This score is then combined with the interaction score f, given by Eq. (1) using Eqs.(5) and (6). combined_scorei = tts (5) s = 1 + (w * connectivity_score,) (6) The exponential combination of the scores was preferred over linear since the connectivity score is very small (less than 0.01 in most cases). The constant w is used to adjust the weight of the connectivity score in the overall ranking.

35

The study in atherosclerosis presented here uses w=4 to achieve an approximate even split in the extracted and derived proteins among the top 20. The combined score allows distinctions among genes such as g4 and g7 in Figure 1, where the connectivity score (column (c)) will break the tie and favor g7, since removing g7 will disconnect the network. 3.

Results

According to the American Heart Association, more than 71 million American adults have one or more types of cardiovascular disease. It is the underlying cause of death in 37% and a contributing cause in 58% of all deaths in the United States. It claims more lives than the next 4 leading causes of death combined [20]. Atherosclerosis is the deadliest of cardiovascular disease and accounts for nearly three-fourths of all deaths. Atherosclerosis results from a complex process involving endothelial dysfunction, inflammation, and dyslipidemia (a process called atherogenesis). The process leads to the accumulation of lipid and extracellular matrix proteins in the intima of arteries. Even though the genetic basis of atherosclerosis is not completely understood [21], a number of genes have been associated with atherosclerosis. Gene expression profiling of atherosclerosis has been used to identify relevant genes and pathways [22]. Our tool will allow the incorporation of published data into these experiments, and could help form new hypothesis. There were 98 genes and gene products in the initial set from the CBioC database, resulting in 9963 genes in the extended set. Coverage was calculated with respect to OMIM [2] at 0.70 (with 73 out of the 104 genes listed in OMIM included in the extended set), using edit distance <= 1 to match (i.e., one or no characters were dropped or added to declare a match). We researched the evidence supporting the relationship of the top-ranked genes as to atherosclerosis, annotating each gene according to the findings, as described in Table 1. The accuracy statistics for the top n proteins appear in Table 2, with definitions and formulas used for all measures in Table 1. Table 3 presents the details for the top 20 unique proteins. Annotations for the top 90 unique proteins are available at http://www.cbioc.org. Those marked "found", like TNF alpha, Angiotensin II, IL1, Collagen, and PLAT, were verified by direct PubMed searches. Consider TNF alpha: among over 800 hits, PMID 16718633 reports the contribution of TNF-alpha, TGF-beta and IL-6 gene expression to systemic inflammation in atherosclerosis. It is also mapped to GO term 0008289, "lipid binding activity" [23]. However, they are not mentioned in OMIM and in some cases nor in Entrez Gene as being related to atherosclerosis, and would have been missed by relying only on the information in these sources.

36 Table 1. Definitions used in assessing the computational method (TP = true positives, FN = false negatives, FP= false positives). Extracted protein belongs to the initial set Derived protein belongs to the extended set Known protein is among those reported in OMIM as related to the disease Found protein found in the literature as related to the disease Suspect protein likely related to the disease based on its function or interactions Some support protein found to be related, small number of supporting articles Not Found protein not found to be related to the disease Not a gene extracted entity does not refer to a gene or protein Duplicate synonym or variant of a previously listed protein strictJP known + found relaxed_TP known + found + some support + suspect Coverage TP / (TP + FN) = known / all OMIM genes related to disease Accuracy_stp TP / (TP + FP) = strict_TP / (extracted + derived - duplicates) Accuracy_rtp TP / (TP + FP) = relaxed_TP / (extracted + derived - duplicates) Accuracy_stp w/ dups TP / (TP + FP) = strict_TP / (extracted + derived) Accuracy_rtp w/ dups TP / (TP + FP) = relaxed_TP / (extracted + derived)

Proteins marked "Suspect" are those Table 2. Performance measures for the top n proteins. See definitions in Table 1. for which there are threads in the literature n = 27 n= 123 that suggest they might be involved in the Unique proteins 20 90 Extracted 12 31 disease due to their interactions or function, Derived 15 92 but no direct report linking the two was % derived 56% 75% found. For example, for PRKCG, PMID 9 Known (in OMIM) 16 Found (in literature) 8 42 10617676 states: "the signaling pathway of Some support 1 6 protein kinase C is known to play a role in Suspect 2 8 mediating the action of cytokines". Other Not Found 0 16 Not a gene 0 2 cytokines, such as IL1 and IL6 have strong Duplicate 7 33 evidence of linkage to atherosclerosis [24]. Coverage wrt OMIM 0.09 0.15 For ERVK2 (HERV), PMID 11672541 Accuracy_stp 0.85 0.64 states that it "may cause type I diabetes by Accuracy_rtp 1.00 0.80 Accuracy_stp-w/ dups 0.63 0.47 activating autoreactive T cells", and that Accuracy_rtp-w/ dups 0.74 0.59 "endogenous retroviral (HERV) superantigens induced via IFN-alpha by viral infections is a novel mechanism through which environmental factors may cause disease in genetically susceptible individuals." In turn, PMID 16973967 states that "Adaptive immunity, in particular T cells, is highly involved in atherogenesis", relating T cells to the disease. Other articles support this idea. Overall, the top genes identified fit into categories underlying pathogenetic mechanisms of atherosclerosis: insulin resistance (insulin, ALB, ERVK2), lipids (APOB, APOE, HDL and LDL), inflammation (IL6, TNFa, IL1 -cytokines-), hypercoagulability (Fibrinogen) and endothelial injury (NOS, and ICAM).

37 Table 3. Top genes and gene products, ranked by combined score, using w=4. Duplicates due to name variants are not shown, Interaction Connectivity Combined score Evidence score score Protein Type 5239.1 Known 100.0 0.2149 INSULIN extracted 110.9 Known 60.0 ALB extracted 0.0375 0.0314 109.8 Known 65.0 APOE extracted 0.0334 89.1 Found 52.5 FIBRINOGEN extracted 0.0341 70.9 Known 42.5 ICAM 1 extracted 40.0 0.0315 63.7 Known IL6 extracted HDL 52.5 0.0116 extracted 63.1 Found 62.5 0.0001 62.6 Found TNF ALPHA derived 50.0 0.0120 60.3 Known LDL extracted 0.0304 NOS extracted 37.5 58.3 Found APOB 45.0 0.0116 53.7 Known extracted 50.0 0.0001 50.1 Suspect ERVK2 derived 50.0 0.0001 50.1 Found ANGIOTENSIN n derived 50.0 0.0001 50.1 Found IL1 derived PRKCG 47.5 0.0001 47.6 Suspect derived COLLAGEN 45.0 0.0001 45.1 Found derived TAT 44.5 0.0001 44.6 Some support derived VWF 0.0217 44.0 Known extracted 32.5 PLAT derived 42.5 0.0001 42.6 Found LIPOPROTEIN L 0.0002 derived 40.0 40.1 Known

4.

Related Work

The closest approaches to the one presented here are MedGene[6] and [7]. MedGene uses published literature to extract gene-disease passages, but then does not expand the initial list, ranking the passages using purely statistical methods related to co-occurrence in the text, not biological basis. The extraction tool uses co-occurrence rather than NLP, and user intervention is limited to choosing amongst different statistical ranking formulas. Aside from differences discussed in the previous section, the method in [7] uses an initial gene list from OMIM expanded with interactions from the Online Predicted Human Interaction Database (OPHID). Even though network connectivity is used for showing statistical validity of the method, the scoring formula does not account for it. 5.

Conclusion and future work

The method presented here makes an innovative use of a combination of important measures to rank a list of proteins mined from biomedical literature as related to a disease, namely, number of interactions and connectivity impact. It can be a valuable tool in the analysis and exploration of proteins and pathways that relate to a disease. The resulting ranked list of genes and gene products can provide the basis for further focused experiments to investigate the genetic determinants of diseases, as the atherosclerosis study presented here showed.

38

Focused analysis helps uncover false negatives, and can potentially result in calling attention to yet unexplored genetic linkages and descriptive research on the disease, such as chromosomal aberrations, specific genetic mutations and amplifications that play a role in the disease that can then be investigated through wet lab experiments. Future work includes improvement of the normalization module to reduce duplicates, since this pollutes the connectivity measures by generating unnecessary nodes, as well as tuning of the formula and methodology with other diseases to potentially incorporate other measures. A web-based interface to allow the public to use the tool is in development. References 1.

2.

3.

4.

5.

6.

7.

8. 9. 10.

Soteriades, E.S. and M.E. Falagas, Comparison of amount of biomedical research originating from the European Union and the United States. BMJ: British Medical Journal., 2005. 331 (7510): p. 192-194. Online Mendelian Inheritance in Man, OMIM (TM). 2000, Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, NLM (Bethesda, MD). Lu, Z., K.B. Cohen, and L. Hunter. Finding GeneRIFs via Gene Ontology Annotations, in Pacific Symposium on Biocomputing. 2006. Maui, Hawaii, USA: World Scientific Publishing Co. Pte. Ltd. Baral, C , H. Davulcu, M. Nakamura, P. Singh, L. Tari, and L. Yu, Collaborative Curation of Data from Bio-medical Texts and Abstracts and Its integration. Lecture Notes in Computer Science. 2005. 309-312. Baral, C, H. Davulcu, G. Gonzalez, G. Joshi-Topee, M. Nakamura, P. Singh, L. Tari, and L. Yu, CBioC: Web-based Collaborative Curation of Molecular Interaction Data from Biomedical Literature, in Genetics Society of America 1st Biocurator Meeting. 2005: Pacific Grove, CA. Hu Y, H.L., Weng H, Zuo D, Rivera M, Richardson A, LaBaer J., Analysis of genomic and proteomic data using advanced literature mining. Journal of proteome research., 2003. 2(4): p. 405-412. Chen, J.Y., C. Shen, and A.Y. Sivachenko. Mining Alzheimer Disease Relevant Proteins from Integrated Protein Interactome Data, in Pacific Symposium on Biocomputing. 2006. Maui, Hawaii, USA: World Scientific Publishing Co. Pte. Ltd. HUGO Gene Nomenclature Committee Database, [cited; Available from: http ://ww w. gene, ucl. ac.uk/nomenclature/. Rives, A.W. and T. Galitski, Modular organization of cellular networks. PNAS, 2003. 100(3): p. 1128-1133. LaCount, D.J., M. Vignali, R. Chettier, A. Phansalkar, R. Bell, J.R. Hesselberth, L.W. Schoenfeld, I. Ota, S. Sahasrabudhe, C. Kurschner, S.

39

11.

12. 13.

14. 15.

16.

17.

18.

19. 20.

21. 22.

23. 24.

Fields, and R.E. Hughes, A protein interaction network of the malaria parasite Plasmodium falciparum. Nature, 2005. 438(7064): p. 103-107. Ahmed, S.T., D. Chidambaram, H. Davulcu, and C. Baral, IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for BioMedical Text. , in BioLINK SIG (Biolink 2005). 2005: Detroit, Michigan. Sleator, D. and D. Temperley, Parsing English with a Link Grammar. Third International Workshop on Parsing Technologies, 1993. Kostoff, R.N., J.A. Block, J.A. Stump, and K.M. Pfeil, Information content in Medline record fields. International Journal of Medical Informatics, 2004. 73(6): p. 515-527. Bader, G., Betel, D., Hogue, C, BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res., 2003. 31: p. 248-250. Zanzoni, A., L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. HelmerCitterich, and G. Cesareni, MINT: a Molecular INTeraction database. FEBS Letters, 2002. 513(1): p. 135-140. Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S., Eisenberg, D., DIP: The Database of Interacting Proteins. A research tool for studying cellular networks of protein interactions. NAR, 2002. 30: p. 303-5. Hermjakob, H., L. Montecchi-Palazzi, C. Lewington, S. Mudali, S. Kerrien, S. Orchard, M. Vingron, B. Roechert, P. Roepstorff, A. Valencia, H. Margalit, J. Armstrong, A. Bairoch, G. Cesareni, D. Sherman, and R. Apweiler, IntAct: an open source molecular interaction database. Nucl. Acids Res., 2004. 32(suppl_l): p. D452-455. Stark, C , B.-J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Tyers, BioGRID: a general repository for interaction datasets. Nucl. Acids Res., 2006. 34(suppl_l): p. D535-539. Ewens, W. and G. Grant, Statistical Methods in Bioinformatics: An Introduction: Springer; 1st edition (April 20, 2001). Thorn, T., N. Haase, W. Rosamond, V.J. Howard, J. Rumsfeld, T. Manolio, Z.-J. Zheng, K. Flegal, C. O'Donnell, S. Kittner, D. Lloyd-Jones, D.C. Goff, Jr., Y. Hong, S. Members of the Statistics Committee and Stroke Statistics, R. Adams, G. Friday, K. Furie, P. Gorelick, B. Kissela, J. Marler, J. Meigs, V. Roger, S. Sidney, P. Sorlie, J. Steinberger, S. Wasserthiel-Smoller, M. Wilson, and P. Wolf, Heart Disease and Stroke Statistics-2006 Update: A Report From the American Heart Association Statistics Committee and Stroke Statistics Subcommittee. Circulation, 2006. 113(6): p. e85-151. Lusis, A.J., Atherosclerosis. Nature, 2000. 407(6801): p. 233-241. Bijnens, A.P.J.J., E. Lutgens, T. Ayoubi, J. Kuiper, A.J. Horrevoets, and M.J.A.P. Daemen, Genome-Wide Expression Studies of Atherosclerosis: Critical Issues in Methodology, Analysis, Interpretation of Transcriptomics Data. Arterioscler Thromb Vase Biol, 2006. 26(6): p. 1226-1235. PubGene. [cited; Available from: http://www.pubgene.org. Tedgui, A. and Z. Mallat, Cytokines in Atherosclerosis: Pathogenic and Regulatory Pathways. Physiol. Rev., 2006. 86(2): p. 515-581.

PREDICTING STRUCTURE AND DYNAMICS OF LOOSELY-ORDERED PROTEIN COMPLEXES: INFLUENZA HEMAGGLUTININ FUSION PEPTIDE PETER M. KASSON Medical Scientist Training Program Stanford University, Stanford CA 94305 USA VIJAY S. PANDE Department of Chemistry Stanford University, Stanford CA 94305 USA

Transient and low-affinity protein complexes pose a challenge to existing experimental methods and traditional computational techniques for structural determination. One example of such a disordered complex is that formed by trimers of influenza virus fusion peptide inserted into a host cell membrane. This fusion peptide is responsible for mediating viral infection, and spectroscopic data suggest that the peptide forms loose multimeric associations that are important for viral infectivity. We have developed an ensemble simulation technique that harnesses >1000 molecular dynamics trajectories to build a structural model for the arrangement of fusion peptide trimers. We predict a trimer structure in which the fusion peptides are packed into proximity while maintaining their monomeric structure. Our model helps to explain the effects of several mutations to the fusion peptide that destroy viral infectivity but do not measurably alter peptide monomer structure. This approach also serves as a general model for addressing the challenging problem of higher-order protein organization in cell membranes.

1. Introduction Seasonal influenza infection is responsible for an estimated 41,000 deaths each year in the United States [2], and an avian influenza pandemic is projected to cause illness in 90 million Americans, with up to 1.9 million deaths [3]. Viral infection of host cells is mediated by a trimeric hemagglutinin protein, which inserts an approximately 20-residue N-terminal fusion peptide into the target membrane. Many mutations to the fusion peptide destroy viral infectivity, demonstrating the importance of understanding the fusogenic activity of hemagglutinin and enabling the design of novel antiviral drugs. The hemagglutinin ectodomain is a tightly structured trimer [5], which enforces a high local concentration of fusion peptide. This trimeric structure of hemagglutinin and additional fluorescence quenching experiments [6] support peptide complex formation, but infrared spectroscopy data suggest the fusion peptide does not strongly self-associate in lipid bilayers [7, 8]. In combination, 40

41 these data suggest a model of a loose complex of fusion peptide trimers where complex formation is not associated with a detectable change in peptide conformation. While the available spectroscopic data show the individual monomers to be structured, the relationship of the monomers in the complex is not well defined.

a.

5&

Figure 1. Cut-away view of influenza virus hemagglutinin. Rendered in (a) are three copies of the influenza fusion peptide, which are inserted in the target cell membrane and connected via flexible linkers to the hemagglutinin ectodomain rendered in (b). This complex is anchored in the viral membrane by three copies of the Cterminal transmembrane domain rendered in (c). This model was constructed using crystal structures of the ectodomain [1] , NMR structures of the fusion peptide [4], and a homology model of the transmembrane domain.

This loosely-ordered nature of influenza fusion peptide complexes is representative of a large class of membrane signaling processes that are mediated by loose or transient interactions. Compared to more traditional multimeric complexes formed in solution, these complexes are much more challenging to probe using traditional methods for experimental structure

42

determination or computational structure prediction. Molecular dynamics simulation can yield physically-based models of protein behavior, but such approaches have typically been limited to exploring only a few molecular trajectories. Additionally, previous molecular dynamics simulations of influenza fusion peptides [9-11] have considered only monomers rather than trimers. To overcome these obstacles, we have developed ensemble molecular dynamics methodology for membrane protein simulation, a robust method that computes thousands of separate simulations to yield statistically accurate prediction of transient complexes or slow-timescale processes. To achieve calculations on this scale, we have employed worldwide distributed computing via the Folding@Home project [12]. In this report, we predict the structure and dynamic movement of influenza fusion peptide trimers via ensemble molecular dynamics simulation, thus demonstrating a powerful new approach for examining protein complexes and protein signaling in lipid membranes.

2. Methods Molecular dynamics calculations were performed using GROMACS [13] running under the Folding@Home distributed-computing architecture [12]. Fusion peptide coordinates were taken from the NMR structure [4], and POPC bilayer coordinates were used as previously reported [14]. Peptide conformations were generated as in Fig. 2 with the central Glull residues placed at a radial distance of 19A. Peptides were inserted in the bilayer by deleting any lipids with phosphate head groups within 7A of the peptide backbone and then equilibrating the bilayer with the peptide conformation fixed before beginning simulation. Simulations were performed in a periodic box under NPT conditions with semi-isotropic pressure coupling and in explicit TIP3P water with 150 mM sodium chloride. The GROMOS87 united-atom force field was used with modifications for lipid parameters from Berger [15]. Each nanosecond of a single simulation trajectory requires approximately 4 days of compute time on a 2.8 GHz Pentium 4.

43

i

!

** x % J

•&rm-

I \

>**»»•

&

< V

1

X

/

••*>»

v*J>W-

^

/

w

^ X

•^rtrJ ^

/

%

«T

^

Figure 2. Starting conformations for fusion peptide trimers The 19 starting conformations were taken by placing the peptides in a radially symmetric orientation and then generating all possible combinations of 90-degree rotations around a bilayer normal centered at Glull of each peptide, eliminating degenerate conformations resulting from symmetry of identical monomers. The trimers were then placed in a POPC bilayer and simulated with explicit water and ions. Boxes denote the four most stable conformations, with the most stable in black and the three following in gray. (See Fig. 4 for analysis.)

3. Results Using ensemble molecular dynamics, we have developed a structural and dynamic model for the interaction of hemagglutinin fusion peptide trimers. We incorporate known structural information regarding the peptide monomer conformations with a broad sampling approach to create a large ensemble of simulations. This ensemble allows us to predict the conformational dynamics of the relatively disordered trimeric fusion peptide complex. We use

44

experimentally-measured bilayer insertion depths for the peptide [4] and the NMR structure of the peptide in micelles [4] to specify the initial conformation of each monomer. All possible rotational arrangements of the peptides within the bilayer were sampled at 90-degree increments (Fig. 2). This broad array of starting conformations allows us to test the relative stability of different trimer conformations in an efficient manner. We report the results of >1000 separate molecular dynamics simulations of up to 25 ns in length and an additional 100 simulations of fusion peptide monomers for comparison. Use of this ensemble technique allows us to predict the conformational dynamics and relative stability of fusion peptide trimeric complexes.

3.1. Conformational change of peptide monomers We assess peptide conformational change by measuring root mean squared deviation (RMSD) of the protein backbone from the starting configuration at nanosecond intervals. The resulting data (Fig. 3a) show that the conformational fluctuation of peptides within the trimeric complex matches that of monomeric peptides to within error, suggesting that trimer formation does not involve a significant ordering of the internal conformation of the fusion peptide. RMSD values for the individual starting conformations are identical to within statistical error (data not shown). These findings are in good agreement with experimental infrared spectroscopic measurements [7] that show no conformational change from the monomeric NMR structure of the fusion peptide in micelles to the membrane-inserted form. Experimental measurements of short synthetic peptides containing fusion peptide sequences also suggests that the conformation of membrane-inserted peptides is primarily alpha-helical but that non-inserted beta-sheet aggregates may form at the lipid-solvent interface [7]. In our simulations, peptides remain membrane-inserted and no relevant beta sheet conformation is detected.

45 b.

a.

S i

Time Ins)

Hme(ns)

Figure 3. Conformational flexibility and translational motion of peptides. Conformational stability (a) is assessed for each peptide in the trimer by performing a rigid-body alignment to the starting conformation at each time point and then measuring RMSD between the backbone a-carbons. Average RMSD values are plotted here for 1230 separate molecular dynamics trajectories of the trimers (black) and overlaid with analogous values for monomeric fusion peptides (gray). Error bars represent one standard deviation of the mean. Translational motion (b) is assessed via mean squared deviation of the center of mass of each peptide from its starting position. Plotted values are the mean over each monomer in 1230 separate molecular dynamics trajectories of peptide trimers (black) and analogous values calculated for 100 trajectories of peptide monomers (gray).

3.2. Translational movement In addition to conformational changes, peptides may undergo both translational and rotational movements. We measure translational movements via mean squared displacement (MSD) of the peptide center of mass as a function of time. Average MSD values are plotted in Fig. 3b. Since peptides in a complex undergo constrained rather than Brownian diffusion, standard methods for estimating diffusion coefficients do not apply. However, a linear fit of MSD versus time provides a good means to compare peptide movements in trimeric complexes to movements of peptide monomers. In our simulations, individual peptides in trimers move approximately 4-fold more slowly than monomeric peptides in lipid bilayers (0.1 nm2/s versus 0.4 nm2/s), consistent with the diffusional constraint posed by other members of the trimer.

46 —

—if:

1 0

c o jj
o O "co

10

20

T

10

20

o

10

20

10

20

10

20

10

20

0

-1 i

1 0

1 ° a:

i 1

-vterv>rv

10

20

-1 I

10

20

10

0

10

220

20

10

20

20

10

20

c1

0

10

20

10

20

krv- .... 10

20

10

20

1

c co

0

10

20 -1

I

1 0

10

20 1

^J\ 10

Time (ns) Figure 4. Rotational autocorrelation functions for fusion peptide trimers. Plotted arc mean rotational autocorrelation functions for all monomers, with each subplot corresponding to the analogous trimer starting configuration in Fig. 2. Dotted lines represent mean values, while solid lines represent 80% confidence intervals. Correlation functions are calculated as , where r(t) is the unit vector corresponding to the first principal axis of the protein molecule at time t and are averaged over each of -70 separate simulations per starting configuration.

3.3. Orientation of the trimeric complex To assess the rotational dynamics of peptide monomers within the fusion peptide complex, angular correlation functions were measured for each monomer at each starting configuration. Angular (or rotational) correlation functions measure the average "decay" of a population of peptide trimers from a single given starting configuration into other configurations. They therefore provide a good metric for relative conformational stability, and indeed the results of these measurements (Fig. 4) show significant differences in stability between starting conformations (p-values for rotational correlation functions of the four most stable configurations after 15 ns of simulation are 0.004, 0.001, 0.007, and 0.009 calculated using the Wilcoxon rank sum test with Bonferroni multiple-hypothesis correction). Notably, the most stable rotational arrangements are not radially symmetric; instead, they feature two monomers facing end-in (with a slight but not statistically significant preference for "nose-

47

to-tail" over "nose-to-nose") and a third aligned sideways. This protein packing arrangement is unusual in that homotrimeric soluble proteins typically form radially symmetric complexes. We hypothesize that our predicted trimer conformation may be driven by lipid energetics: favored rotational states allow trimer packing such that the protein-lipid interface is reduced and fewer lipids are "confined" by protein helices on multiple sides. Local ordering of lipid bilayers by protein helices has been reported in the experimental literature [16], and the most favored packing arrangements we observe are likely to have increased lipid entropy. According to this model, membrane-embedded peptides exert a local ordering effect on adjacent lipid molecules. We postulate that a more compact packing arrangement that has a smaller peptide-lipid interfacial area would cause less ordering of the bilayer lipids and thus have a smaller entropic penalty. The asymmetric packing arrangements we simulate (Fig. 5) also help to explain functional mutation data on influenza hemagglutinin. Mutants to the Nterminus of the fusion peptide G1S and G1V prevent membrane fusion in cellbased assays [17, 18] but have near-identical conformations as determined by NMR in micelles [19]. Each stable trimer conformation that we predict has at least one N-terminus packed towards the center of the complex, such that substitution of the terminal glycine for a bulkier amino acid could disrupt trimer packing. We hypothesize that such a disruption could interfere with the fusogenic effects of the peptide complex on lipid membranes and potentially inhibit fusion via that mechanism. This would explain how point mutations that do not alter monomeric peptide structure can destroy viral infectivity.

48

Figure 5: Representative structure of fusion peptide trimer after 20 ns simuiation. The structure shown is from the most stabtc trimer orientation. The peptides arc rendered in space-fitting form, and the surrounding iipids (tight gray) arc rendered in stick form, showing the packing of protein and tipid. The N-tcrmini of the peptides are marked with arrows.

4. Conclusions M a n y protein interactions critica! for disease processes occur within loosely ordered membrane complexes. Because of their conformational diversity and the complicating presence of the lipid bilayer, these complexes pose a challenge for experimental structure determination and computational prediction. In this report, w e introduce a gcnerai means of studying loosely ordered complexes via ensemble molecular dynamics and apply it to influenza virus fusion peptides. Using distributed-computing techniques, w e are able to calculate conformational stability for a broadly-sampled ensemble of candidate conformations, generating a robust structural model for protein-protein and protein-lipid interactions in fusion peptide trimers. Experimental mutants of influenza fusion peptides have

49

been identified that have unchanged monomeric structure but lack fusogenic capability, thus destroying viral infectivity. The structural understanding generated by our computational approach aids the interpretation of such mutational data, suggesting a means by which such mutants may cause altered protein-protein packing among fusion peptide trimers. In addition, we suggest a novel paradigm for protein complex structure, in which lipid entropy plays a important role in determining the structure of loosely ordered complexes within lipid membranes.

Acknowledgments The authors would like to thank K. Branson for helpful discussions and Folding@Home volunteers worldwide.

References l.Bullough, P.A., F.M. Hughson, J.J. Skehel, and D.C. Wiley, Structure of influenza haemagglutinin at the pH of membrane fusion. Nature, 1994. 371(6492): p. 37-43. 2.Dushoff, J., J.B. Plotkin, C. Viboud, D.J. Earn, and L. Simonsen, Mortality due to influenza in the United States—an annualized regression approach using multiple-cause mortality data. Am J Epidemiol, 2006. 163(2): p. 181-7. 3.Pandemic Influenza Plan, U.S.D.o.H.H. Services, Editor. 2005. 4.Han, X., J.H. Bushweller, D.S. Cafiso, and L.K. Tamm, Membrane structure andfusion-triggering conformational change of the fusion domain from influenza hemagglutinin. Nat Struct Biol, 2001. 8(8): p. 715-20. 5.Wilson, I.A., J.J. Skehel, and D.C. Wiley, Structure of the haemagglutinin membrane glycoprotein of influenza virus at 3 A resolution. Nature, 1981. 289(5796): p. 366-73. 6.Cheng, S.F., A.B. Kantchev, and D.K. Chang, Fluorescence evidence for a loose self-assembly of the fusion peptide of influenza virus HA2 in the lipid bilayer. Mol Membr Biol, 2003. 20(4): p. 345-51. 7.Han, X. and L.K. Tamm, pH-dependent selfassociation of influenza hemagglutinin fusion peptides in lipid bilayers. J Mol Biol, 2000. 304(5): p. 953-65. 8.Haque, M.E., V. Koppaka, P.H. Axelsen, and B.R. Lentz, Properties and structures of the influenza and HIVfusion peptides on lipid membranes: implications for a role infusion. Biophys J, 2005. 89(5): p. 3183-94. 9.Vaccaro, L., K.J. Cross, J. Kleinjung, S.K. Straus, D.J. Thomas, S.A. Wharton, J.J. Skehel, and F. Fraternali, Plasticity of influenza haemagglutinin fusion peptides and their interaction with lipid bilayers. Biophys J, 2005. 88(1): p. 25-36.

50

1 O.Huang, Q., C.L. Chen, and A. Herrmann, Bilayer conformation of fusion peptide of influenza virus hemagglutinin: a molecular dynamics simulation study. Biophys J, 2004. 87(1): p. 14-22. 1 l.Lague, P., B. Roux, and R.W. Pastor, Molecular dynamics simulations of the influenza hemagglutinin fusion peptide in micelles and bilayers: conformational analysis of peptide and lipids. J Mol Biol, 2005. 354(5): p. 1129-41. 12. Shirts, M. and V.S. Pande, Computing - Screen savers of the world unite! Science, 2000. 290(5498): p. 1903-1904. 13.Van der Spoel, D„ E. Lindahl, B. Hess, G. Groenhof, A.E. Mark, and H.J.C. Berendsen, GROMACS: Fast, flexible, andfree. Journal of Computational Chemistry, 2005. 26(16): p. 1701-1718. 14.Tieleman, D.P., L.R. Forrest, M.S. Sansom, and H.J. Berendsen, Lipid properties and the orientation of aromatic residues in OmpF, influenza M2, and alamethicin systems: molecular dynamics simulations. Biochemistry, 1998. 37(50): p. 17554-61. 15.Berger, O., O. Edholm, and F. Jahnig, Molecular dynamics simulations of a fluid bilayer of dipalmitoylphosphatidylcholine at full hydration, constant pressure, and constant temperature. Biophys J, 1997. 72(5): p. 2002-13. 16.Killian, J. A., Hydrophobic mismatch between proteins and lipids in membranes. Biochim Biophys Acta, 1998. 1376(3): p. 401-15. 17.Gray, C , S.A. Tatulian, S.A. Wharton, and L.K. Tamm, Effect of the Nterminal glycine on the secondary structure, orientation, and interaction of the influenza hemagglutinin fusion peptide with lipid bilayers. Biophys J, 1996. 70(5): p. 2275-86. 18.Qiao, H., R.T. Armstrong, G.B. Melikyan, F.S. Cohen, and J.M. White, A specific point mutant at position 1 of the influenza hemagglutinin fusion peptide displays a hemifusionphenotype. Mol Biol Cell, 1999. 10(8): p. 2759-69. 19.Li, Y., X. Han, A.L. Lai, J.H. Bushweller, D.S. Cafiso, and L.K. Tamm, Membrane structures of the hemifusion-inducing fusion peptide mutant GIS and the fusion-blocking mutant GIV of influenza virus hemagglutinin suggest a mechanism for pore opening in membrane fusion. J Virol, 2005. 79(18): p. 12065-76.

PROTEIN INTERACTIONS AND DISEASE PHENOTYPES IN THE ABC TRANSPORTER SUPERFAMILY LIBUSHA KELLY1'2'3'*, RACHEL KARCHIN 2 ' 3 AND ANDREJ SALI 2 ' 3 'Program in Biological and Medical Informatics, 2Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and3 California Institute for Quantitative Biomedical Research, University of California at San Francisco QB3 at Mission Bay, Office 503B, 1700 4th Street, San Francisco, CA 94158, USA.

ABC transporter proteins couple the energy of ATP binding and hydrolysis to substrate transport across a membrane. In humans, clinical studies have implicated mutations in 19 of the 48 known ABC transporters in diseases such as cystic fibrosis and adrenoleukodystrophy. Although divergent in sequence space, the overall topology of these proteins, consisting of two transmembrane domains and two ATP-binding cassettes, is likely to be conserved across diverse organisms. We examine known intra-transporter domain interfaces using crystallographic structures of isolated and complexed domains in ABC transporter proteins and find that the nucleotide binding domain interfaces are better conserved than interfaces at the transmembrane domains. We then apply this analysis to identify known disease-associated point and deletion mutants for which disruption of domain-domain interfaces might indicate the mechanism of disease. Finally, we suggest a possible interaction site based on conservation of sequence and disease-association of point mutants.

1. Introduction ATP-binding cassette (ABC) transporters are membrane-spanning proteins that transport a wide variety of small molecule substrates and ions across cell membranes. Examples include the multidrug transporter P-gp, associated with drug resistance phenotypes in cancer therapy [1], and the cystic fibrosis transmembrane conductance regulator (CFTR) that transports chloride ions [2]. In Escherischia coli, the vitamin B12 transporter BtuCD is also an ABC transporter [3], further underscoring the diversity of molecules transported by these proteins. In humans, the ABC transporters are divided into seven families, labeled ABCA through ABCG. There are half transporters, such as the breast cancer resistance protein BCRP (ABCG2), that consist of one nucleotide-binding domain (NBD) and one transmembrane domain (TMD), and whole transporters, such as the sulfonylurea transporter (ABCC8), that consist of two NBDs and two TMDs [4]. In bacteria and archaea, the NBD and TMD domains are frequently separate genes and the corresponding proteins must associate for * Corresponding author [email protected] 51

52

proper function. One such example is the vitamin B12 transporter BtuCD in E. coli, in which the two BtuC proteins and two BtuD proteins associate for transport [3]. Because there are no complete, high-resolution structures of eukaryotic ABC transporters, it is not known how similar their structures and mechanisms are to those of their bacterial and archaeal homologs. However, the striking sequence conservation of domains (e.g., the motif conservation and sequence identity between NBDs of diverse organisms) suggests that, despite differences in gene organization, human ABC transporters are likely to have a quarternary structure similar to those observed in bacteria and archaea [5]. Four crystal structures of NBD dimers (PDB IDs 1L2T, IXEF, 1L7V and 1Q12) all have a structurally similar NBD/NBD interface, with the Walker A phosphate binding loop of one NBD appearing directly across the interface from the highly conserved 'signature' motif of the opposite NBD [6, 7, 3, 8]. The Ca RMSD (computed with MODELLER'S salign feature [17]) between the structures is between 1.7 and 2.7 A, further demonstrating that the NBD/NBD interface is well conserved among different ABC transporters. An unanswered question about ABC transporter associations is whether the "two NBD / two TMD" model can also include higher-order oligomeric states [9,10]. ABC transporters are also known to interact with a number of other membrane and soluble proteins. The sulfonylurea transporters (ABCC8 and 9) interact with inwardly rectifying (Kir) potassium channels to form ATPsensitive potassium channels that modulate the electrical activity in cells [11]. The CFTR protein is known to interact with PDZ domains and likely has other binding partners, including adrenergic receptors [12]. Because of the lack of high-resolution structural data, the nature of these interactions at the amino acid residue level is not known. Point mutations at interfaces can affect the function of ABC transporters in several ways. First, the mutant might destabilize domain folding or association during folding and prevent proper maturation of the protein. A medically relevant example is the deletion mutant AF508 in CFTR that is the most common cause of cystic fibrosis. This mutation leads to an immature, lower molecular weight form of the protein that is retained in the endoplasmic reticulum and degraded, which leads to a lack of functional transporters localized to the membrane [2]. Second, the mutant might interfere with the function of an intact transporter by affecting ATP binding and hydrolysis. Third, the mutant might affect allosteric interactions between the domains that are required for substrate binding and transport. Given the importance of intra- and inter-protein interactions in the ABC transporters, coupled with the large body of data on disease-associated

53

mutations, we examined known domain interactions in high-resolution crystal structures and used our analysis to suggest possible interface-related mechanisms of human disease. We comprehensively map known disease mutations onto putative nucleotide binding domain interface sites in human ABC transporters. The putative interfacial residues identified in this study can be used to focus efforts in the biochemical identification of functionally important residues.

2. Methods 2.1. Sequence collection and generation of multiple sequence alignments There are 35 structures of ABC transporter proteins in the PDB as of July, 2006. We selected six for study based on the following considerations: diversity of organism representation, diversity of transporter family representation, structural resolution, and completeness of structure as defined by the number of domains crystallized. We selected four structures with definable interfaces. Two are complete ABC transporters with two NBDs and two TMDs: the structure of the vitamin B12 transporter from E. coli [3] and the MsbA lipid A exporter from Salmonella typhimurium [13]. Two structures are NBD dimers, one from Methanococcus jannaschii, and the other from E. coli [6, 7]. The final two structures are human transporter NBD monomers [14, 15]. Sequences homologous to each of the proteins in Table 1 were culled by iteratively searching the Uniprot database [16] using the build_profile module of MODELLER [17] with a threshold e-value of 0.01. Build_profile is an iterative database searching method that uses dynamic programming for aligning profiles against sequences and an empirical definition of statistical significance based on the scores collected during the scan of the database [18]. Such automated alignments were also generated for each of the 76 nucleotide binding domains in the 48 human transporters. We used diverse, superfamily-level multiple sequence alignments to examine patterns of conservation across the whole family of ABC transporters rather than dividing the transporters by subfamily.

54 Table 1. ABC transporter structures used in analysis. Gene |PDBID1

Organism

Resolution

Description

(A) CFTR [1XMI1

H. sapiens

2.25

BtuCD [1L7V]

E. coli

3.20

MJ0796 |1L2T1

M. jcmnaschii

1.90

HlyB |1XEF)

E.coli

2.50

Tapl |UJ7|

H. sapiens

2.40

MsbA |1Z2R]

S. typhimurium

4.20

Monomeric NBD1 of the cystic fibrosis transmembrane conductance regulator [15] Complete structure (two TMDs and two NBDs) of the Vitamin B12 transporter. Both the TMD and NBD were used in the analysis [3] Dimeric structure of two NBDs, unknown substrate [6] Dimeric structure of the NBDs of the alpha-hemolysin transporter [7] Monomeric NBD of the peptide transporter Tap 1 [ 14] Complete structure of the lipid A exporter MsbA, which is homologous to human multidrug resistance transporters [13]

The CFTR, BtuC (TMD), BtuD (NBD), MJ0796, HlyB, Tapl, MsbA (TMD) and MsbA (NBD) alignments contained 36 199 , 5 444, 36 608, 43 981, 44 134, 28 251, 6 172 and 45 368 sequences, respectively. The alignments are available at http://salilab.org/~libusha/psb2007. 2.2. Evolutionary conservation: sequence weights and residue position entropies We use Shannon entropy to measure the evolutionary conservation at each position (column) in our multiple sequence alignments [19]. Henikoff weighting [20] was used to ensure that entropy calculations were not skewed by large numbers of highly similar sequences which are common in alignments of ABC transporters and which can be a problem with large automatically generated alignments in general. In Henikoff weighting, each column is given an initial weight of 1, which is divided equally between distinct amino acid residues in the column. Within a column, the weight for each amino acid residue is divided equally by the number of times it occurs. Finally, the weight of any given sequence is the sum of the weights of all of the amino acid residues in the sequence. A Shannon entropy: •*-> 20

H =

-Zaa-iPaalog2Paa

was calculated for each column where Paa is defined as:

(1)

55 Saa(i) >

£jaa»]

„Saa(/) * '

where Saa(/) = Y . " w(i) and w(i) is the sequence weight and nm is the number of amino acid residues of a particular type seen in the column. Because of the minus sign in Equation (1), lower numbers indicate greater evolutionary conservation. A MATLAB implementation of Henikoff weighting and the sequence weighting-based Shannon entropy calculation are available on request. 2.3. Interface definition Domain interfaces were defined according to PiBase, a database of domain interactions from x-ray crystal structures in PDB that uses a 5.5A cutoff for heavy atom interatomic distances to define residues at an interface [21]. The functional unit of ABC transporters is two transmembrane domains (TMD) complexed with two nucleotide-binding domains (NBD) [3, 13, 5]. For the complete ABC transporter structure BtuCD, we define three interfaces: NBD/TMD, NBD/NBD and TMD/TMD. For the dimeric structures we define only the NBD/NBD interfaces. The structure of each domain was aligned to the BtuCD structure (for TMD/NBD interactions) and the MJ0796 structure (for NBD/NBD interactions) with the salign routine in MODELLER [17]. All residues that aligned to interface residues in BtuCD or MJ0796 were predicted to also be interface residues. 2.4. Homology transfer annotations We used the multiple sequence alignments to predict the locations of interface amino acid residues in each of the human ABC transporter NBDs. We assume that if a residue aligns to a known interface residue in the MJ0796 structure (for NBD/NBD alignments) or the BtuCD structure (for NBD/TMD alignments) that it is also at an interface in homologous family members. Residues that aligned to defined interface residues (Interface definition) were examined for disease associations as annotated in the VARIANT records of the Uniprot database [16]. 2.5. Surface conservation We used the molecular graphics visualization program Chimera [22] to identify sites of putative binding interactions that have not yet been functionally characterized, by locating surface regions of medium to high conservation (excluding defined interface sites). Medium conservation is defined as no greater than half of the highest column entropy found in a given alignment (Figure 2B).

56

ATP binding

NBD/NBD interface

- W ^

NBD/TMD interface

ft'

Figure 1. ABC transporter interdomain interfaces. Interfaces are defined according to PiBase [21]. Interfaces are mapped on to the representative structure of the MJ0796 ABC transporter nucleotide binding domain (NBD) dimer structure from M. jannaschii (PDB ID: IL2T) [6].

Figure 2. Disease-associated residues at putative ABC transporter interfaces.

^ I

G480-CFTRV

«' ' rr

D6M-CFTR T668-ABCDI DI099-ABCAI

(A). A close-up of the first nucleotide binding domain of the human CFTR (PDB ID: 1XMI)[15]. Interface residues were defined using homology transfer annotation based on the structure of an NBD dimer from M. jannaschii (PDB ID: IL2T) [6] and are shown in gold. Residues with known cystic fibrosisassociations at the NBD/NBD interface are shown in black (Table 2). An N-terminal helix in the CFTR structure is hidden to show the complete interface as defined by the 1L2T structure. (B). The exposed, non-NBD surface of the 1L2T structure. Residue positions in yellow have entropies of no more than 2.1 bits. Residue positions in black are associated with cystic fibrosis (CFTR), adrenoleukodystrophy (ALD) and high-density lipoprotein deficiency type2(ABCAl).

57

3. Results 3.1. Differential conservation of interfaces in ABC transporters We examined evolutionary conservation at the amino acid residue level for three different interfaces in ABC transporter structures {Interface definition).

• ATP &md;ti3 s«« ' - NSO/NBD = NBDflioATP

4
4

8

3.S0-

+

••*>•

4 TMD/TMO a Ransom

3.0G2.83a,eo~ 1.50-

*

.

*

A

&

*

»

*

t,oa~

-*"'?—•"•"

117V 1ZJR tXEF 1L2T 1JJ7 tXMl StructwTe

Figure 3. Evolutionary conservation at binding and interface sites in six ABC transporter structures. The ATP binding site, which forms part of the interface between the nucleotide binding domains (NBDs) has the lowest entropy due to highly conserved residues in the Walker A, B and 'signature' motifs. The NBD/NBD interface is well conserved even when the ATP-binding residues are removed from consideration (NBD/noATP). The TMD interfaces, both with the cognate NBD and the cognate TMD (only definable for 1Z2R and 1L7V) are not highly conserved.

We were only able to define the TMD/TMD interface for the two complete structures, 1Z2R and 1L7V. We found that the NBD/NBD interface was consistently more conserved than either the NBD/TMD and TMD/TMD interfaces (Figure 3, Discussion). 3.2. Disease associated mutations at ABC transporter interfaces We found a total of 68 disease-associated positions at PiBase-defined interfaces in 10 transporters (Table 2, Figure 2A). Of these positions, 65 were single residue mutations and three were deletions. Thirty-eight mutations were at the NBD/NBD interface and 30 at the NBD/TMD interface. We also found conserved surface residues that included two positions associated with disease in several ABC transporters. These residues correspond to the 1L2T residues 1, 2, 31, 60, 164, and 213. There are 587 total known disease mutations in the 10 transporters, of which 504 are found in ABCC7, ABCD1 or ABCA4 [16].

58 Table 2. Disease associated mutations at putative ABC transporter interfaces. Human protein residues that aligned with the NBD/NBD or NBD/TMD interface were examined for disease association using Uniprot [16]. The two interfaces overlap by two residues. [Transported Disease(s) [ABCA1] High density lipoprotein deficiency type 2 [ABCA31 Respiratory distress syndrome [ABCA4] Stargardt disease (STGD), Fundus flavimaculatus (FFM), Age-related macular degeneration 2 (ARMD2)

[ABCA12] Lamellar icthyosis

[ABCC2] Dubin-Johnson syndrome [ABCC6] Autosomal recessive pseudoxanthoma elasticum

[ABCC7/CFTR] Cystic fibrosis

[ABCC8] Persistent hyperinsulinemic hypoglycemia of infancy

[ABCD1 j Adrcnoleukodystrophy

[ABCG51 Sitosterolemia

NBD/NBD N935S N568D R943W (STGD/FFM)

N965S (STGD) S1063P(STGD) E1087D/K(STGD) G1091E(FFM) G1975R(STGD) E2096K (STGD) H2128R(STGD) N1380S G1381E E1539K R768W T1301I G1302R Q1347H G458V S549I/N/R G551S R553Q D579G G1244E

G715V V1359M G1377R G1380S R1435Q E1505K G507V S552P S606P/L G608D E609G/K E630G S633I V635M S636I

NBD/TMD

L1014R(STGD)

T1019A(STGD) Kl 031E (STGD) E1036K(STGD) V1072A(STGD) L2027F (STGD/FFM) R2030Q (STGD/FFM) L2035P(STGD)

Q1382R AM 1393 R1314Q QI347H D1361N S492F E504Q AF507 AF508 W1282R R1283M F1286S N1303H R1392I1 R1419C R1435Q

P543L S552P Q556R P560R M566K

EI46Q

59

4. Discussion We have comprehensively mapped known disease-associated mutations to putative interfaces and found that 68 disease-associated positions in 10 transporters fall at putative interfaces. This indicates that a majority of diseaseassociated ABC transporters (10/17) have mutations at interface regions. Single residue point mutations were the most common and accounted for 65 of the disease-associated positions; the other three were single residue deletions. Thirty-eight mutations were at the NBD/NBD interface and 30 at the NBD/TMD interface. We hypothesize that many disease-association mutations involving ABC transporters may be due to disruption of domain-domain binding interactions. Proper function of ABC transporters involves cycles of substrate binding and release which are currently thought to be governed by an 'ATP switch'-type mechanism with ATP binding and hydrolysis causing formation and dissociation of an NBD/NBD dimer. The switch, between open and closed dimer states, causes conformational changes in the TMDs that enable substrate transport [5]. While large conformational changes have been seen in mammalian ABC transporters using electron microscopy [23, 24], the specific residue interactions at both the NBD/NBD interface and the TMD/NBD interface that are involved in these interactions in human transporters is lacking. As noted earlier, interface mutations can disrupt ABC transporter domain interactions in several ways: by interfering with ATP binding or hydrolysis, by destabilizing or preventing proper folding and association of the domains, or by interfering with allosteric communication between domains that is suggested by the large conformational changes seen during the transport cycle. Defining residues at these interfaces is useful to experimentalists interested in examining specific residue interactions that stabilize or abrogate interface interactions. For example, in CFTR, a hypothesized hydrogen bond between R555 in NBD1 and T1246 in NBD2 stabilizes the open, chloride-transporting state of the protein [27]. The high conservation of the NBD/NBD interface at the superfamily level suggests that there are likely additional residue interactions that stabilize dimer formation and facilitate transport. The relative lack of conservation at the TMD/NBD interface and the large number of disease-associated mutations at this interface (Table 2) might indicate that NBD/TMD mutations might lead to defects in folding and maturation rather than directly affecting the function of a properly processed, intact transporter. The AF508 mutant falls at the TMD/NBD interface and leads to an immature protein that is tagged for degradation and does not localize properly to the cell

60

membrane [2]. A recent study showed that mutating the analogous residue in Pglycoprotein (MDR1), Y409, also led to an immature form of the protein with an altered NBD/NBD interface. This observation indicates a misfolded protein with improper or incomplete domain associations [25]. Another predicted TMD/NBD interface mutant, R1435Q mutant in ABCC8, could not form functional KATP channels and showed 10-fold reduced expression compared to wild-type ABCC8. Either protein instability or defective transport to the cell membrane could cause this phenotype [26]. Alternatively, the lack of conservation at this interface might suggest TMD/NBD interactions are subfamily specific, in contrast to the overall high conservation of residues at the NBD/NBD interface. Given the 30 disease mutants at this interface, the lack of conservation at the NBD/TMD interface does not indicate that this region is unimportant for ABC transporter function. However, it suggests that instead of a larger conserved interaction footprint as seen in the NBD/NBD interface, perhaps a small number of conserved residues form the necessary contacts for communication between the domains. In the TMD of BtuCD, the A221 in the L2-loop is one of only three moderate to highly conserved residues in the TMD. In MsbA, the residues G122 and E208 are well conserved, and contribute to the TMD/NBD interface. We also suggest a possible interaction site distinct from those observed in the crystallographic structures based on conserved surface residues and diseaseassociation in human ABC transporters (Figure 2B). Surface residues not at defined interfaces are generally not well conserved (Appendix Figure Al) in our analysis. However, a moderate to highly conserved region on the surface of the 1L2T structure includes the aligned human mutations: D1099Y, in ABCA1 associated with high density lipoprotein deficiency type 2, D614G in CFTR associated with cystic fibrosis and T668I in ABCD1 associated with adrenoleukodystrophy. Observing three different transporters with disease-associated mutations at the same solvent-exposed position suggests that this position is conserved for a functional reason. If the residues indeed form part of an interaction site with an unknown partner, that partner might also be conserved in multiple transporters. Alternatively, these residues could indicate a region that stabilizes oligomerization of complete ABC transporters. This example also demonstrates the utility of homology transfer annotation for locating functionally important residues. There is little experimental data available defining the specific effect of disease-associated mutations on ABC transporters. A recent review noted that the majority of CFTR mutants have not been experimentally characterized [2].

61 The difficulty of working with these large membrane proteins underscores the need for computational analysis that provides hypotheses for the mechanism of domain interactions in ABC transporters that can be verified experimentally. We used this analysis to prioritize residues selected to experimentally probe domain interactions in the human multidrug ABC transporter P-gp. We will apply our method to new ABC transporter structures as they become available, and we intend to explore using other measures of residue conservation, including determining site-specific mutation rates and locating coevolving residues, in the future [28,29].

Acknowledgments We thank John Chodera (UCSF) and our reviewers for comments on the manuscript as well as Dr. Deanna Kroetz, Dr. Kathy Giacomini, Jason Gow and Marco Sorani (UCSF) for helpful discussions about membrane transporters. This work is supported by NIH (F32 GM-072403-02, U01 GM61390, U54 GM074929-G1, R01 GM54762), the Burroughs Wellcome foundation, the Sandler Family Supporting Foundation, SUN, IBM, and Intel.

Appendix

Figure A.l. Representative sequence conservation at putative ABC transporter interfaces. Residue conservation was mapped on to the structure of an NBD dimer from U. jannaschii (PDB ID; 1L2T) [6]. Conservation is colored from black to white, with black indicating high conservation and white indicating low conservation. The TMD/NBD interface region (left panel) defined by alignment to the BtuCD structure is circled, and shows low conservation. The NBD interface is visible as a curve of high conservation extending from one ATP molecule (shown in stick) to the other. The right panel is rotated 180 degrees horizontally and shows some solvent-exposed regions of higher conservation.

References 1. S. V. Ambudkar, era/., Annu. Rev. Pharmacol. Toxicol., 39, 361 (1999).

62

2. D. Gadsby, P. Vergani, and L. Csanady, The ABC protein turned chloride channel whose failure causes cystic fibrosis. Nature, 440, 477 (2006). 3. K. Locher, A. Lee, and D. Rees, The E. coli BtuCD Structure: A Framework for ABC Transporter Architecture and Mechanism. Science, 296, 1091 (2002). 4. M. Dean and T. Annilo, Evolution of the ATP-binding cassette (ABC) transporter superfamily in vertebrates. Annu. Rev. Genomics Hum. Genet., 6, 123 (2005). 5. C. F. Higgins, and K. J. Linton, The ATP switch model for ABC transporters. Nat. Struct. Mol Biol., 11,918 (2004). 6. P. C. Smith, et al, ATP binding to the motor domain from an ABC transporter drives formation of a nucleotide sandwich dimer. Mol. Cell, 10, 139(2002). 7. J. Zaitseva, et al, H662 is the linchpin of ATP hydrolysis in the nucleotidebinding domain of the ABC transporter HlyB. The EMBO Journal, aop (2005). 8. J. Chen, G. Lu, J. Lin, A. L. Davidson and F. A. Quiocho, F.A. A tweezerslike motion of the ATP-binding cassette dimer in an ABC transport cycle. Mol. Cell., 12, 65\ (2003). 9. A. Bhatia, H. J. Schafer and C. A. Hrycyna, Oligomerization of the human ABC transporter ABCG2: evaluation of the native protein and chimeric dimers. Biochemistry, 44, 10893 (2005). 10. J. Xu, Y. Liu, Y. Yang, S. Bates and J. T. Zhang, Characterization of oligomeric human half-ABC transporter ATP-binding cassette G2. J. Biol. Chem.,219, 19781 (2004). 11. C. Nichols, KATP channels as molecular sensors of cellular metabolism. Nature, 440, 470 (2006). 12. C. Li, and A. Naren, Macromolecular complexes of cystic fibrosis transmembrane conductance regulator and its interacting partners. Pharmacology & Therapeutics, 108, 208 (2005). 13. L. Reyes and G. Chang, Structure of the ABC transporter MsbA in complex with ADP.vanadate and lipopolysaccharide. Science, 308, 1028 (2005). 14. R. Gaudet and D. C. Wiley, Structure of the ABC ATPase domain of human TAP1, the transporter associated with antigen processing. EMBO J., 20, 4964(2001). 15. H. A. Lewis, et al, Structure of nucleotide-binding domain 1 of the cystic fibrosis transmembrane conductance regulator. EMBO J., 23, 282 (2004). 16. A. Bairoch, et al, The Universal Protein Resource (UniProt). Nucleic Acids Res., 33 (2005). 17. U. Pieper, et al, MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res., 34 (2006). 18. S. F. Altschul, et al, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389 (1997). 19. C.E. Shannon, A mathematical theory of communication. Bell Sys. Tech. J. 27 379(1948).

63

20. S. Henikoff and J. Henikoff, Position-based sequence weights. Journal of Molecular Biology, 243, 574 (1994). 21. F. P. Davis and A. Sali, PIBASE: a comprehensive database of structurally defined protein interfaces. Bioinformatics, 21, 1901 (2005). 22. E. F. Pettersen, et al, UCSF Chimera—a visualization system for exploratory research and analysis.J. Comput. Chem., 25, 1605 (2004). 23. M. F. Rosenberg, et al., Purification and crystallization of the cystic fibrosis transmembrane conductance regulator (CFTR). J. Biol. Chem., 279, 39051 (2004). 24. M. F. Rosenberg, et al., Repacking of the transmembrane domains of Pglycoprotein during the transport ATPase cycle. EMBO J., 20, 5615-5625 (2001). 25. T. W. Loo, M. C. Bartlett, and D. M. Clarke, Processing mutations located throughout the human multidrug resistance P-glycoprotein disrupt interactions between the nucleotide binding domains. J. Biol. Chem., 279, 38395 (2004). 26. Y. Tanizawa, et al., Genetic analysis of Japanese patients with persistent hyperinsulinemic hypoglycemia of infancy: nucleotide-binding fold-2 mutation impairs cooperative binding of adenine nucleotides to sulfonylurea receptor 1. Diabetes, 49, 114 (2000). 27. P. Vergani, et al., CFTR channel opening by ATP-driven tight dimerization of its nucleotide-binding domains Nature, 433, 7028 (2005). 28. Z. Yang and S. Kumar, Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites. MolBiolEvol, 13, 650-9 (1996). 29. S.W. Lockless and R. Ranganathan, Evolutionarily conserved pathways of energetic connectivity in protein families. Science, 286,295-9 (1999).

LTHREADER: PREDICTION OF LIGAND-RECEPTOR INTERACTIONS USING LOCALIZED THREADING JADWIGA BIENKOWSKA 1 ' 2 * BONNIE BERGER 1 * Computer Science and Artificial Intelligence Laboratory, MIT Biomedical Engineering Dept, Boston University. Corresponding author Identification of ligand-receptor interactions is important for drug design and treatment of diseases. Difficulties in detecting these interactions using high-throughput experimental techniques motivate the development of computational prediction methods. We propose a novel threading algorithm, LTHREADER, which generates accurate local sequencestructure alignments and integrates statistical and energy scores to predict interactions within ligand-receptor families. LTHREADER uses a profile of secondary structure and solvent accessibility predictions with residue contact maps to guide and constrain alignments. Using a decision tree classifier and low-throughput experimental data for training, it combines information inferred from statistical interaction potentials, energy functions, correlated mutations and conserved residue pairs to predict likely interactions. The significance of predicted interactions is evaluated using the scores for randomized binding surfaces within each family. We apply our method to cytokines, which play a central role in the development of many diseases including cancer and inflammatory and autoimmune disorders. We tested our approach on two representatives from different structural classes (all-alpha and all-beta proteins) of cytokines. In comparison with stateof-the-art threader, RAPTOR, LTHREADER generates on average 20% more accurate alignments of interacting residues. Furthermore, in cross-validation tests, LTHREADER correctly predicts experimentally confirmed interactions for a common binding mode within the 4-helical long chain cytokine family with 75% sensitivity and 86% specificity. For the TNF-like family our method achieves 70% sensitivity with 55% specificity. This is a dramatic improvement over existing methods. Moreover, LTHREADER predicts several novel potential ligand-receptor cytokine interactions. Supplementary website: http://theory.csail.mit.edu/~vinyAthreader

1.

Introduction

Proteins are essential for the proper operation of living cells and viruses, performing a wide variety of functions. Most often, they do so by interacting with other proteins. The study of these interactions is extremely important, as many diseases can be traced to undesirable or malfunctioning protein-protein interactions (PPIs). Currently, methods exist for predicting PPIs that have achieved some degree of success, relying mostly on data obtained from highthroughput (HTP) experiments such as yeast-two-hybrid screens. Receptors are proteins embedded within the cell membrane. Interactions with their extra-cellular ligands occupy a central role in inter-cellular signaling and biological processes that lead to the development and progression of many diseases. Of particular importance to human diseases are cytokines. Cytokine 64

65 interactions with their receptors are responsible for innate and adaptive immunity, hematopoiesis and cell proliferation. Etiology of cancer and autoimmune disorders can be attributed in part to the cytokine signaling through their receptors. For example, long-chain 4-helical bundle cytokines, erythropoietin and human growth hormone, are already used for the treatment of cancer and growth disorders. Many other therapies altering cytokine-receptor interactions are in clinical development [1]. However, ligand-receptor (L-R) interactions are much more difficult to predict than general PPIs, and methods that work well for PPIs often fail when applied to L-R binding pairs. In particular, the lack of high-throughput experimental data for these interactions makes it difficult to apply existing prediction methods that depend on this information (see Related Work). We consider the problem of predicting whether a ligand and receptor interact, given only their sequence information and several confirmed L-R PPIs among members of the same structural SCOP family [2]. Even when one or more complex structures are available within an L-R family it is often a challenge to effectively use this information to predict interactions among other members of the family. One reason is the difficulty in identifying the interacting residues that are common among distant family members. The conformational differences that often occur at the interface of bound proteins make such identification non-obvious. Our approach is to thread the sequences onto the binding interface of a solved L-R complex and to evaluate the complementarity of the resulting surface. In so doing, we face four challenges: (1) identifying the residues at the binding interface that are common to an L-R family; (2) threading the query sequences onto the binding interface; (3) scoring the resulting threaded sequences in order to differentiate between binding and non-binding partners; and (4) evaluating the significance of the predicted interaction scores. Related Work. Many computational approaches have been applied to prediction of PPIs such as: threading of structural complexes [3] and scoring them with statistical potentials [4]; correlated mutations [5-8]; and docking methods using physical force fields [9, 10]. However, the performance of all of these methods is highly dependent on the accuracy of the alignment to the structural template, and thus for distantly related proteins is more prone to errors. For example, the PPI predictor InterPrets [11] cannot find a confident match for any of the sequences from the cytokine families that we consider. Integrative machine learning methods also have been applied to prediction of PPIs and networks [12, 13]. Many of these approaches rely on HTP experimental PPI data itself as a predictor and this information is scarce for L-R pairs.

66

Contributions. This paper proposes a novel threading algorithm, LTHREADER, which first incorporates secondary structure (SS) and relative solvent accessibility (RSA) predictions with residue contact maps to guide and constrain alignments. While existing threading algorithms (e.g. RAPTOR [3]) are not so successful at aligning interacting residues in sequences with low homology [15], LTHREADER achieves much higher accuracy (see Section 3.1). Given interaction data from gold-standard low-throughput experiments, LTHREADER predicts L-R interactions using statistical and energy scores. We apply our algorithm to the cytokines, performing significantly better than existing in silico methods (see section 3). We investigate two structurally distinct cytokine families: 4-helical bundle cytokines and the TNF-like family belonging to the all-beta structural class. Cytokine interactions with receptors are particularly difficult to predict because they display a high level of structural similarity but almost no sequence similarity, preventing the effective use of simple homology-based methods or general threading techniques. Furthermore, little experimental interaction data exists for cytokine interactions, and the structures for only a few cytokine-receptor complexes have been determined. Therefore, accurate prediction of cytokine interactions is a good indicator of the success we can achieve with our algorithm. Finally, our method predicts previously undocumented cytokine interactions which may have implications for diseases. We evaluate the significance of our predictions by comparing them to those of randomized interaction surfaces. 2.

Algorithm

Overview. LTHREADER threads two given protein sequences onto a representative template complex in order to determine and score the putative interaction surface. Our interaction prediction algorithm is divided into three stages (Figure 1). In the first stage (Figure 1, Stage 1), from the set (at least two) of template complexes, we determine the residues that are most likely to be involved with L-R binding. We do this by generating a multiple alignment of clusters of interacting residues from each complex and determining the positions that are most conserved. We build a generalized profile for each position in the alignment of interacting residues [16]. In the second stage (Stage 2), the profile is used to identify the most likely location of interacting residues in the query sequences. The locations of the interacting residues in the query sequences define the putative interaction surface. In the third stage, this surface is scored using several methods and an interaction prediction is made using a decision tree classifier (Stage 3). The significance of the classification is then evaluated by

67

estimating the probability of predicting an interaction between the L-R pair using a randomized interaction surface. Sequences: MU.GTG.. •RNtQPVI,

Input

*r

CM Si ore lce'ihV-.U tin Common Interacting Residues

Localized Threading

L>P S':ore

fcgjftple0i:6f,

FF & v r e

- • gljMecacSiri^

Stage 1

Stage 2

h

^ CR Score ^

i Dn".iMi"i Tree 1 Tel ninrj

Stage 3

i.l.-r-,..: '

Y: S M M V M- [lIlK.illn.' Vi»li-

Output

Figure 1: Schematic of LTHREADER. In Stage 3, CM is the compensatory mutation score, SP the statistical potential score, FF the force field score, and CR the conserved residue score.

2.1. Stage 1: Generation of Localized Profiles for Interaction Cores In this stage, we assume that if a set of ligands and receptors have similar structures and binding orientation, then their corresponding interface surfaces will have good alignment. We first examine the L-R pairs that have solved structures for their bound complex and align the ligand and receptor structures separately using POSA [17]. Then, clusters of interacting residues are identified within these complexes and mapped to their corresponding ligand and receptor sequences, thus delimiting core regions of interaction within each sequence. Given a set (minimum two) of complexes, the positions of the cores are then optimized to ensure that the locations of the interactions contained in the clusters overlap as much as possible between complexes. Finally, generalized profiles are computed for each residue in the core regions of all pairs of L-R sequences. Clustering of Residue Interactions. For two interacting domains in a complex structure we define the interface residues as those in contact with residues from the other domain. We define two residues to be in contact if the distance between any two of their heavy atoms is less then 4.5A. This cutoff is the same as that used by Lu et. al. [4] to determine statistical potentials for contacting residues. We define a contact map as a matrix C such that cy = 1 if the z'th residue of the ligand and the 7th residue of the receptor interact, and cy = 0 if they do not. Given a contact map C, we group together clusters of interacting pairs (non-zero entries of C) by using a simple index-based distance function to determine inclusion. The distance between two interacting pairs {iiji} and {12, J2} in C, where ij, ji are the ligand and receptor indices respectively for the first interacting pair, and i2 and j 2 , for the second pair, is defined as follows: d>st({>\J\}AhJ2})

y('i '2) +0'i Ji) . which indicates infinite distance when

any two residues do not interact.

68

Interacting residue pairs that are separated by a distance, dist, less than three are considered members of the same cluster. A cluster in contact map C implies a corresponding sub-matrix whose non-zero entries are members of that cluster. Note that cluster edges delimit a contiguous sequence stretch on both the ligand and receptor sequences, referred to as a core (see Figure 2). Thus we can define a notation for indexing a cluster by the index of its corresponding cores in the ligand and receptor. Given contact map C, we denote Ck'' as the sub-matrix containing the cluster indexed by the £th core in the ligand and the /th core in the receptor. The size and position of C ' within C can vary as long as the requirement that only one cluster can be contained within C ' is not violated.

Figure 2: An illustration of how ligand (red) and receptor ( blue) cores are derived from a clustering of interactions within the interaction map (at right). The yellow dots correspond to interacting residues and the green dots in the interaction map indicate an interaction. A black line in the cartoon on the left denotes that an interaction occurs between the residues at its endpoints.

Alignment of Clusters for a Pair of Ligand-Receptor Complexes. The next step of our algorithm optimizes the length and location of cores within a pair of L-R complexes so that the similarity score of corresponding clusters is maximized. Let C be the contact map for the first complex, and D be the contact map for the second complex. Let m be the number of cores in the ligands for both complexes, and n be the number of cores in the receptor for both complexes. Let Ck'' refer to the k,l-th cluster in C, and D*'' refer to the corresponding k,I-th cluster in D. We set the height and width of both submatrices to the maximum of the height and width of each sub-matrix. (Note that this accounts for the rare case when two clusters in one complex map to a single larger cluster in another.) The precise alignment of the interaction cores is the goal of the following optimization procedure. For the k,l-th cluster we fix the starting position of Ck,!, but allow the starting position of DA,/ to vary. Let D*' ? be equal to D*'; offset by p along the first dimension of D and offset by q along the second dimension. Our goal then is to maximize the objective function,

69 / O v - M > - * < 7 „ ) = 2 sim(C W , D ^ ) , for 1 < k < m and 1 < 1 < n subject to the following constraints: -4 <, pl,...,pm s 4 and -4 <. ql,...,qn <. 4. sim(A, B) is the measure of similarity between matrices A and B (both of dimension m x n) and is defined by the sum of all entries in the Hadamard product of the two matrices: sim(A, B) = ]£tf,j6,,y • Since there are only a few cores in the ligand and receptor (<5 in most cases), we use a brute-force iteration over all possible values of the offset variables p,q in order to maximize f. Multiple Alignment of Interaction Cores. The above method allows us to find the location of cores in the ligand and receptor sequences that maximizes the overlap of interacting residues between a pair of complexes. For more than two complexes in the training set, we extend the pairwise-alignment method in a way that optimizes their multiple alignment using a variant of the neighborjoining method of Saitou and Nei [18]. At each step of the neighbor-joining procedure, we create a new contact matrix from the union of the Hadamard products of the contact matrices from each complex. The final result is a contact matrix representing the interaction surface common to all complexes. From the multiple alignment of core regions, we construct a generalized profile from relative solvent accessibility (RSA), secondary structure (SS) and sequence at each interaction core position. RSA and SS values are calculated using DSSP [19]. 2.2. Stage 2: Threading of Query Sequences onto the Template In this stage we determine which residues in the query sequence pair would be part of the putative interaction surface by threading their sequences onto a template complex. To do this, we devise a localized threading algorithm that aligns sequences to the generalized profile of the interaction cores. In order to reduce errors, we first limit the search space to the region in the query sequence most likely to contain the core. In the template structure, the interaction cores are localized to specific regions on the sequence delimited by the secondary structure (SS) elements: a-helices (H), f5-strands (B) and loops (L). Aligning the predicted (SS) elements (using SABLE [20]) to the template structure elements identifies the likely positions of interaction cores. Specifically the alignment of secondary structure tags, where tag=(HLHLBLB...) and a score for a match is 1 and a mismatch -1, between the template and the predicted SS determines the position of the interaction cores in the query sequence. Next, we predict RSAs for residues in the query ligand and receptor sequences using SABLE. Finally, the generalized profile of the core calculated in the previous stage is used to search the query sequence using the predicted

70

RSAs and SSs [16]. The search is performed by sliding a window, of length equal to that of the core, along the query sequence. The position,/?, at which the window best matches the profile defines the location of the putative core. We search for interaction cores (ICs) within five residues before and after a predicted SS element that contains the core to account for SS prediction errors. We define/?., and/?e to be the start and end position, respectively, of a predicted SS element within the query sequence. We compute p, the position of the predicted IC within the query sequence restricted to positions between ps-5 and pe+5 as follows: p = argmax p E | P ] _ v ^ 5 ] J 2 - - ^ ^ S E Q ^ a a ^ ^ a a c ' , ) + 5 ( ^ ^ , i s c , . ) - ^ ^

'-\

where aai+p is the amino acid, ssi+p is the predicted SS and sai+p is the RSA of the residue at position i+p in the query sequence. [lt and 0"; are the mean and standard deviation, respectively, of the RSA at position i within the IC multiple alignment, and ssct is the SS of the core position and aac'j is the amino acid from the template complex structure t. 6 is an indicator function for equality. N is the length of the IC multiple alignment profile and T is the total number of complex structures used as templates. For the sequence similarity matrix, SEQ, we use BLOSUM62 [21]. We have adopted the relative weights of different score contributions, sequence (SEQ) versus structure (SS and RSA), as previously determined by others [16]. 2.3. Stage 3: Scoring the Interaction Surface After the interaction surface is determined for the L-R complex, it is scored using the CM, SP, FF and CR algorithms. The scores are then normalized and integrated using a decision tree classifier. Each is described in detail below. Correlated Mutations (CM). In order to calculate this score, we first obtain a multiple sequence alignment (MSA) for each L-R sequence Si, SR from a set of orthologous species common to both the ligand and receptor. We then compute the Pearson correlation between positions i and/' in SL and SR respectively, as in [7]. Since we are interested in evaluating the likelihood of interaction, we only sum the correlation scores over all pairs (ij) within SL and SR that are within the putative interaction surface (based on the threading results from stage 2). We assign this sum to the score CM. Statistical Potentials (SP). For each residue pair located in the interaction surface, we assign a pairwise-potential energy from Lu et al [4]. This energy is statistically derived from a set of known pairwise interactions between residues in solved structures. To compute the SP score, we sum the potentials corresponding to all interacting residue pairs.

71 AMBER Force-Fields (FF). This score is equal to the calculated energy of the putative interface surface within the threaded complex. We use the SCWRL 3.0 side-chain packing program [22] to first determine the coordinates for all the side-chain atoms in the ligand and receptor. Second, we fix atom positions for all residues that do not belong to the interface. Third, allowing the flexibility of interacting residues we perform 20 steps of conjugated gradient minimization using the molecular dynamics package BALL [23]. The energy values typically reach a stable minimum after few steps of minimization. As the last step we compute the energy, FF, of the interface surface by applying BALL's AMBER force-field function. Conserved Residues (CR). This is a sequence based scoring method for determining whether the conservation across species of the interacting residues in the threaded complex plays a predictive role. It is based on the assumption that residues that are contained within an interaction region are less likely to mutate than those outside of the region [24]. We compute the percentage conservation at each residue position within the ligand and receptor from an MSA. The percentages are averaged over all residues within the putative interaction surface and assigned to the final CR score. Normalization of Scores. The examination of the raw scores of the interaction surface showed that for some receptors the scores are consistently high across all putative ligands. In order to put scores across all receptors and ligands on the same scale, we introduce the following formula to determine new normalized values for the scores. For each pair of ligand L and receptor R from the family we have the raw score S(L,R) calculated by one of the above methods S={CM, SP,FF,CR}. The normalized scores are then given by: \\S{L,R)\\-

S(L,R)

( J

S(L,R))-(

2S(>L'R»

Decision Tree Classifier. For classification purposes we associate with the pair L and R, a vector of scores SLR = (sj,...,s4) that are generated from each of the scoring methods described in stage 3 (when applied to L and R). We then use experimentally determined positive and negative interactions, to train a decision tree DT. DT is then used to classify each pair based on SLR. We used decision trees because they provide a very intuitive understanding of the contributions and relative strengths of the different scoring variables used for prediction. Significance of the Classifier Predictions. In order to estimate the significance of the predicted interaction for any L-R pair we have implemented the following probabilistic procedure. From all ligands and receptors within a family we create pools of ligand, P L = U/G/flm,-,//» and receptor, PR = [jrsfamify rr, residues

72

where rt and rr belong to the putative binding interface as defined in section 2.2. For each L-R pair we generate 100 randomized interaction surfaces by grafting onto the interaction cores amino acids randomly drawn from pools FL and PR. We then score and classify them to determine f, the frequency at which the randomized surfaces are predicted to interact. 1-f is the significance of predicted interactions within the L-R family for the non-randomized surfaces. 3.

Results

We applied LTHREADER to cytokine-receptor interactions belonging to allalpha and all-beta structural families. There are over 100 cytokines and a comparable number of corresponding receptors identified in the human genome. The interactions among cytokines and their receptors play a central role in the etiology of many human diseases and have been the focus of many investigations [1]. Cytokines are a challenging test case for our algorithm due to their low level of sequence similarity, and unavailability of high-throughput PPI data. Datasets. We chose a subset of cytokines that contained the most solved complexes and had substantial experimental interaction data: the hematopoietins from the SCOP family long-chain 4-helical bundle and TNF-like all-beta cytokines and their corresponding receptor families. In the 4-helical bundle family we focused on a receptor binding site (site II) that is common to all cytokines and is the major determinant of binding. The 4helical bundle cytokine data set includes 12 ligands and 7 receptors. Our set of template cytokine-receptor complexes consisted of the following structures from the Protein Data Bank (PDB) [25]: lcd9, lcn4, lhwg, lpvh, and lp9m. Our gold-standard positive interaction set was obtained from the KEGG database (http://www. senome.ad.ip/kess)• The training set consisted of 12 positive interactions derived from low-throughput experiments and 14 putative negative interactions based on membership in different subfamilies. In the TNF-like family we focused on the 90 's loop binding site on the receptor common to known structural complexes [26]. The TNF-like cytokine dataset includes 13 ligands and 21 receptors. Our template complexes consisted of the five PDB structures: IdOg, loqd, loqe, lxul lxu2. The gold standard positive and negative interaction set was taken from the results of the flowcytometry assays reported in [27]. The training set consisted of 20 positive and 20 negative interactions determined experimentally. For both families, the set of non-interacting pairs was chosen from the same ligands and receptors as those in the set of known interacting pairs to ensure that the classifier distinguishes complementarity of the interfaces rather than their

73

composition. The detailed list of ligands and receptors in our data sets is available at the supplementary website. 3.1. Alignment of Interacting Residues We applied LTHREADER (Sections 2.1 and 2.2) to the 4-helical bundle and TNF-like cytokine datasets. Due to the high sequence similarity and low loop length variability of the 4-helical bundle receptors, the main challenge in this case was accurately aligning the ligands. In the case of the TNF-like cytokines, identifying the location of interaction cores in the receptors was more difficult. When threading the low-similarity cytokine sequences onto the template, we achieved significantly better results with LTHREADER than RAPTOR despite the fact that RAPTOR uses the same structural templates and SS and RSA information. Table 1 shows how successful each algorithm was at identifying the locations of interacting residues. We see that even with low sequence similarity (between 15 to 25%), LTHREADER performs well at identifying interacting residues while RAPTOR struggles. This was not surprising as RAPTOR'S accuracy, like most general threaders, decreases as the sequence similarity to the template decreases [15]. Table 1: Comparison of threading accuracy between the RAPTOR and LTHREADER algorithms. We threaded L-R pairs onto other known template complexes and determined accuracy by the percentage of positively identified interactions out of all interacting pairs in the query complex. % similarity % accuracy % accuracy Cytokine Family (RAPTOR) to templates (LTHREADER) 4-Helical Ligands 21 35 56 43 TNF-like Receptors 35 63

3.2. Prediction of Ligand-Receptor Interactions As expected, the combination of standalone scoring methods results in higher prediction accuracy than the individual scoring methods, even when the latter are given correct alignments of the interaction surface. In order to measure the improvement of the integrated solution over the individual scoring methods, we compare the sensitivity and specificity of each one to that of the integrated solution using 1-fold cross-validation (see Table 2). While the integrated solution had comparable specificity to the single-score-based methods, it had higher sensitivity for the 4-helical bundle and TNF-like cytokines (75% and 70% respectively). For 4-helical bundles, the predicted interactions have a significance of 0.62 and for TNF-like cytokines, 0.81, also higher than standalone methods. Individual L-R scores are available at the supplementary website.

74

We verified that normalizing scores for the interaction surface greatly improved the performance of the method for both the individual and the combined scores (see supplementary website). Table 2: Comparison of single and combined scoring methods using 1-fold cross validation on experimentally confirmed binding and non-binding pairs of ligands and receptors. See Section 2.3 for definitions of the CM, SP, FF and CR scoring methods. Cytokine Family Scoring Function FF CM SP CR Combined 4-Helical Bundles 50 Sensitivity (%) 58 67 33 75 50 64 Specificity (%) 93 100 86 0.32 0.45 0.62 Significance 0.40 0.55 TNF-Like 30 55 Sensitivity (%) 10 30 70 Specificity (%) 35 30 35 70 55 Significance 0.64 0.28 0.46 0.81 0.35

3.3. Novel Predictions LTHREADER identified previously unidentified cytokine-receptor pairs as likely binding partners. These are osm-lepr, il6-ghr, epo-csf3r, cntf-lepr, lif-prlr out of the 4-helical bundle family and tnfsfl-tnfrsflla, tnfsfl-tnfrsfllb, tnfsf4tnfrsf6b, tnfsf4-tnfrsfl2a, tnfsflO-tnfrsfla, tnfsflO-tnfrsflb, tnfsf 13-tnfrsf6b, tnfsfl5-tnfrsflb out of the TNF-like family (see supplementary website for abbreviations). 4.

Conclusions

We have shown that more accurate localized threading, and integrating several existing methods for L-R interaction prediction, can greatly improve accuracy. The strength of our method comes, partially, from leveraging a novel threading algorithm that circumvents low-sequence similarity. By integrating the highquality threading results with various kinds of statistical and physical interaction-prediction methods we can achieve high prediction accuracy and statistical significance. However, the success of our approach depends on the availability of structural templates and orthologous sequences. This method helps fill a void in predicting traditionally challenging L-R interactions. We hope to further improve the prediction accuracy with a new scoring function that utilizes randomized surfaces to better separate signal from noise. Given the improved alignments we hope that LTHREADER will enhance structure-based PPI predictors [13] by refining the homology models of the interaction regions. We are currently applying LTHREADER to other L-R families and PPIs in general and will make the program available to the broader community.

75

Acknowledgements. Thanks to Andrew Macdonnell, Rohit Singh, and Jinbo Xu for helpful discussions and computational assistance. References l.Pestka, S., CD. Krause, and M.R. Walter, Interferons, interferon-like cytokines, and their receptors. Immunol Rev, 2004. 202: p. 8-32. 2.Murzin, A.G., et al., SCOP: a structural classification ofproteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-40. 3.Xu, J., et al., RAPTOR: optimal protein threading by linear programming. J Bioinform Comput Biol, 2003.1(1): p. 95-117. 4.Lu, H., L. Lu, and J. Skolnick, Development of unified statistical potentials describing proteinprotein interactions. Biophys J, 2003. 84(3): p. 1895-901. 5.Goh, C.S., et al., Co-evolution of proteins with their interaction partners. J Mol Biol, 2000. 299(2): p. 283-93. 6.01mea, O., B. Rost, and A. Valencia, Effective use of sequence correlation and conservation in fold recognition. J Mol Biol, 1999. 293(5): p. 1221-39. 7.Pazos, F., et al., Correlated mutations contain information about protein-protein interaction. J Mol Biol, 1997. 271(4): p. 511-23. 8.Tan, S.H., Z. Zhang, and S.K. Ng, ADVICE. Nucleic Acids Res, 2004. 32: p. W69-72. 9.Janin, J., Assessing predictions ofprotein-protein interaction: the CAPRI experiment. Protein Sci, 2005.14(2): p. 278-83. lO.Summa, CM., M. Levitt, and W.F. Degrado, An Atomic Environment Potential for use in Protein Structure Prediction. J Mol Biol, 2005. 352(4): p. 986-1001. 1 l.Aloy, P. and R.B. Russell, InterPreTS: protein interaction prediction through tertiary structure. Bioinformatics, 2003.19(1): p. 161-2. 12.Qi, Y., J. Klein-Seetharaman, and Z. Bar-Joseph, Random forest similarity for protein-protein interaction prediction from multiple sources. Pac Symp Biocomput, 2005: p. 531-42. 13. Singh, R., J. Xu, and B. Berger, StructlNet: Integrating Structurelnto Protein-Protein Interaction Prediction. Pac Symp Biocomput, 2006: p. 403-414. 14.Lin, N., et al., Information assessment on predicting protein-protein interactions. BMC Bioinformatics, 2004. 5: p. 154. 15.Xu, J., et al., Protein threading by linear programming. Pac Symp Biocomput, 2003: p. 264-75. 16.Przybylski, D. and B. Rost, Improvingfold recognition.... J Mol Biol, 2004. 341(1): p. 255-69. 17.Ye, Y. and A. Godzik, Multiple flexible structure alignment using partial order graphs. Bioinformatics, 2005. 21(10): p. 2362-9. 18.Saitou, N. and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987.4(4): p. 406-25. 19.Kabsch, W., Sander C , Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers, 1983. 22: p. 2577-2637. 20.Adamczak, R., A. Porollo, and J. Meller, Combining prediction of secondary structure and solvent accessibility in proteins. Proteins, 2005. 59(3): p. 467-75. 21.Henikoff, S. and J.G. Henikoff, Performance evaluation of amino acid substitution matrices. Proteins, 1993.17(1): p. 49-61. 22.Canutescu, A.A., A.A. Shelenkov, and R.L. Dunbrack, Jr., A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci, 2003.12(9): p. 2001-14. 23.Kohlbacher, O. and H.P. Lenhof, BALL. Bioinformatics, 2000.16(9): p. 815-24. 24.Caffrey, D.R., et al., Are protein-protein interfaces more conserved in sequence than the rest of theprotein surface? Protein Sci, 2004.13(1): p. 190-202. 25.Berman, H.M., et al., The Protein Data Bank. Acta Crystallogr D Biol Crystallogr, 2002. 58(Pt 6 No 1): p. 899-907. 26.Hymowitz, S.G., et al., Triggering cell death: the crystal structure of Apo2L/TRAIL in a complex with death receptor 5. Mol Cell, 1999. 4(4): p. 563-71. 27.Bossen, C, et al., Interactions of tumor necrosis factor (TNF) and TNF receptor family members in the mouse and human. J Biol Chem, 2006. 281(20): p. 13964-71.

DISCOVERY OF PROTEIN INTERACTION NETWORKS SHARED BY DISEASES* LEE SAM 5 '', Y A N G L l U § ' \ JIANRONG LI § '', CAROL FRIEDMAN*' 2 , YVES A. LUSSIER*'"' 2 I- Center for Biomedical Informatics and Section of Genetic Medicine, Dept. of Medicine; The University of Chicago, IL, 60637 2- Department of Biomedical Informatics; Columbia University, NY, NY 10032 The study of protein-protein interactions is essential to define the molecular networks that contribute to maintain homeostasis of an organism's body functions. Disruptions in protein interaction networks have been shown to result in diseases in both humans and animals. Monogenic diseases disrupting biochemical pathways such as hereditary coagulopathies (e.g. hemophilia), provided a deep insight in the biochemical pathways of acquired coagulopathies of complex diseases. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia. Similarly, more complex diseases such as different cancers have been shown to result from malfunctions of common proteins pathways. In order to discover, in high throughput, the molecular underpinnings of poorly characterized diseases, we present a statistical method to identify shared protein interaction network(s) between diseases. Integrating (i) a protein interaction network with (ii) disease to protein relationships derived from mining Gene Ontology annotations and the biomedical literature with natural language understanding (PhenoGO), we identified protein-protein interactions that were associated with pairs of diseases and calculated the statistical significance of the occurrence of interactions in the protein interaction knowledgebase. Significant correlations between diseases and shared protein networks were identified and evaluated in this study, demonstrating the high precision of the approach and correct non-trivial predictions, signifying the potential for discovery. In conclusion, we demonstrate that the associations between diseases are directly correlated to their underlying protein-protein interaction networks, possibly providing insight into the underlying molecular mechanisms of phenotypes and biological processes disrupted in related diseases.

1. Introduction and Related Work Currently, common diseases are mainly defined by their clinical appearance, with little reference to their molecular mechanism. For example, syndromes are defined in medicme as a set of phenotypes which, occurring together, serve to define a trait or disease. These phenotypes overlap in the case of many syndromes. This overlap brought about the concept of 'syndrome families' though consideration of the commonality of features shared between diseases [1], Conceptually, what we have learned about 2000 human single gene diseases with a defined genetic phenotype is that each monogenic disease has a specified collection of specific phenotypic features. For example, hemophilias with * Corresponding authors These authors have contributed equally to the work # This study was supported in part by NIH/NLM grants 1K22 LMOO8308-01, R01 LM007659, R01 LM008635, and the National Center for the Multi-scale Analysis of Genomic and Cellular Networks (U54CA12185201A1) §

76

77 deficiencies in coagulation factors, otherwise called hereditary coagulopathies, are single gene diseases with clear Mendelian inheritance that have provided significant insight in the biochemical pathways of acquired coagulopathies. Indeed, a variety of complex liver diseases can lead to decreased synthesis of the same set of coagulation factors as in hemophilia, leading to the same disease phenotype despite very different causes. In some cases, the clustering of syndromes into these families in combination with genetic insights has led to the discovery that what were often thought as two different disorders were really variable expressions of the same disorder [2-4]. Conversely, it has long been known that mutations at different loci can lead to the same genetic disease [5]. It has also been hypothesized that this genetic heterogeneity has its roots at the protein interaction level, suggesting that other genes associated with the phenotype also have some functional role [6]. Therefore, it is plausible to theorize that phenotypic overlap of diseases may reflect, at multiple biological scales, the relationships and functional properties of shared underlying molecular networks. As signal transduction pathways are less understood than biochemical pathways, protein-protein interactions networks provide unique opportunities for exploring the signaling pathways of diseases. The shift in focus to systems biology has resulted in an increased interest in biological pathways and protein-protein interaction networks. As a result, large scale knowledge bases representing them are being rapidly developed [7-14]. These resources enable us to study complex biological problems using high throughput computational tools. While there is a wealth of protein-disease relationships in the published literature and a number of readily computable protein-protein interaction resources, there has been a paucity of work relating diseases using protein interactions from this kind of knowledge. Making use of these networks is a relatively new challenge in the field. Network-based analyses have been developed with a number of goals in mind [15], including protein function prediction [16], identification of functional modules [17], interaction prediction [18-21], and the study of network structure and evolution [22-26]. To explore the possibility of using protein-protein interaction networks to identify correlations between diseases, we hypothesize that protein-protein interactions shared by two diseases or more can be accurately identified in a protein interaction network by integrating knowledge from the literature and using statistical methods. 1.1 Related Work The method reported in this paper utilizes the PhenoGO database [25][www.phenoGO.org] that provides protein-GO-phenotype relations and the humancurated Reactome knowledgebase [8] that provides protein interactions to link proteinprotein interactions with diseases. The recently developed PhenoGO database provides phenotypic context to protein-GO annotations, as an example, lymphoid tissue (a phenotype) is linked to interleukin 2 receptor (a protein) and interleukin 2 receptor

78 activity (a GO concept). It augments Gene Ontology (GO) annotations [27] by extracting protein-GO-phenotype relations from the literature using MeSH terms [28] and a natural language processing (NLP) system, BioMedLEE, combined with the PhenOS phenotype ontology organizer system. The phenotypic information, including diseases, tissues, and organs, is encoded into Unified Medical Language System (UMLS) codes as well as other ontological coding systems. PhenoGO was evaluated for anatomical and cellular context in mice, demonstrating a recall of 92% and a precision of 91% [29]. PhenoGO has since been extended to comprise over 523,000 unique entries associating disease phenotypes, ontological concepts, and proteins. In total, PhenoGO now contains data from 8,509 distinct PubMed articles, representing 7,016 distinct proteins classified under 3,214 distinct GO concepts in 3,102 distinct diseases. From a random sample of 120 Protein-disease-GO ternary annotations, precision was estimated at 85%, and recall at 76% [unpublished result]. PhenoGO

SQL: PhenoGO Data Filter - > 4 Associated Genes - homo sapiens subset . on|y SQL: Reactome Filter Direct Complex' and Reaction' types only

MySQL JOIN: - Generate gene pairs for each disease pair (M) - Identify common gene pairs (m) though repeated join operations

Perl: Hypergeometric Calculator

Correlated ^ Disease Pairs:

Figure 1. Method to correlate human diseases based on their underlying protein interactions. M and m refer to parameters of the hypergeometric calculations as described in equation 1.

2. Methods In order to identify associations between diseases by mapping their respective protein interaction networks with statistical significance values, we took the following steps. An overview of the process is pictured in Figure 1. Extraction of human protein-disease relationships was achieved though Structured Query Language querying of the PhenoGO database. We extracted all UMLS-coded diseases classified under the "Disease" semantic type hierarchy along with their associated proteins. In this study, we chose to stay on a more conservative side, and only extracted diseases associated with more than 4 proteins to avoid errors stemming from mis-assignment in PhenoGO and to reduce spurious predictions in the next step from the

79 hypergeometric distribution because a single error contributes proportionally to a larger statistical impact on a smaller sample of protein in the statistical method that follows (equation 1). These UMLS-coded terms fall under the UMLS semantic types 'Congenital Abnormality', 'Disease or Syndrome', 'Experimental Model of Disease', 'Anatomical Abnormality', and 'Neoplastic Process'. The resultant set consists of 154 diseases and their 1,931 associated proteins (http://phenos.bsd.uchicago.edu/PSB2007/). Integration and Discovery. The second step is to correlate diseases with their underlying protein-protein interaction networks using a statistical approach. In this study, we used the Reactome protein interaction dataset [8] to define the underlying topological networks associated with these diseases. The common proteins between diseaseassociated proteins in PhenoGO and proteins in the Reactome were identified by using the identifiers in the UniProt [30], The Reactome data set defines four distinct types of reactions: 1) neighboring reactions, which define interactions that occur consecutively; 2) indirect complexes, which define interactions which involve subcomplex interaction, but not direct binding/interaction; 3) direct complex, defining protein-protein complexes; and 4) reaction, representing situations where the two proteins participate in the same reaction [8]. The Reactome dataset was normalized to a set of paired Swiss-Prot accession numbers, and filtered to remove neighboring reactions and indirect complexes, leaving only entries for binary interactions and direct complexes. This data set contains 20,317 distinct interactions corresponding to 1,140 distinct proteins. From the 154 diseases, we generated combinations of pairs of diseases, and for each pair of diseases, proteins in both diseases were also paired for all potential combinations. These protein pairs were then cross-referenced with our filtered Reactome data set to determine if they participated in reactions or formed direct complexes with one another. There are two basic types of relationships used in calculations in our methods. These relationships correspond to the two scenarios we considered to determine whether two diseases share interaction networks: 1) an identity relationship where common proteins are shared by two diseases, and 2) direct interactions between protein A in one disease and protein B in the other disease. As related diseases can share both types of relations, and due to the requirements of the hypergeometric distribution, we consider both in the underlying protein-protein interaction network in diseases. Based on this, we calculated the correlations between all possible pairs of diseases by applying the hypergeometric distribution function to identify significantly correlated diseases (equation 1) and adjustments for multiple a posteriori comparisons (equation 2), as shown below: (M\(N-M\ P(i>=m\N,M,n,m) =

Y,

' j^V' (Equation 1) {n ) In equation 1, '/V represents the total number of all pair combinations between proteins of any two diseases in the experiment that includes the possibility of sharing the

80 same proteins (identical protein pair between two diseases), 'Af, as the sum of number of observed distinct pairs of interacting proteins that exist in the Reactome database for all the diseases in the experiment (direct interaction only), V as the putative total number of pairs of proteins that could exist in a pair of disease, and 'm" as the sum of the observed number of common proteins shared between two specific diseases and the number of distinct pairs of interacting proteins observed in the Reactome database for these two specific diseases (M n n). This measure gives a p-value which is then adjusted for multiple comparisons with the Dunn-Sidak method (a derivative of the Bonferroni method): p'=l-([-p)r

(Equation 2)

In equation 2, p' and p represent the corrected and uncorrected p-values, respectively, and r represents the number of independent comparisons, which is the number of disease pairs (r=\ 1,703) used in the study. These corrected p-values are then thresholded at p<0.05 to determine the final set of significantly correlated disease-disease relationships. Multiple diseases and genes sharing the same PubMed IDs can artificially boost the statistical significance of these disease pairs, therefore relationships mapping to more than 2 overlapping PubMed IDs were removed to reduce the this artifact. A total of 11,703 disease pairs passed the filter out of 11,780 candidates. 77 combinations had more than two PMID overlaps and were filtered out as a result of this process. An example of values used for the calculation is described in the results section. Evaluations. Two evaluations were conducted. The first one, a quantitative evaluation, was designed to control for the error rate in either assigning a protein disease relationship in PhenoGO or a protein-protein interaction in Reactome. It consisted of establishing the reliability of the predictions if we introduced noise in the integrated database network (10% more protein-protein interactions in the same set of diseases). The second one, a qualitative evaluation, consisted of carefully examining the discovered protein interactions shared by two diseases and identifying references in the scientific literature that validate the phenotypic overlap and potentially the protein interactions. Distribution of potential protein interactions in pairs of diseases

|

7000 i 6000

.1 5000 D •g E 4000 2 iE 3000 —

I

2000

I

1000

1 1 ~| 1 if

' \

j

j s

1-5

6-10

1

1

11-15 16-20 21-30 Potential Protein Pairs

31-50

51 +

Figure 2. Distribution of the number of disease pairs from PhenoGO according to the number of possible protein interactions observed between their proteins in the Reactome.

81 3. Results In this study, we examined a subset of PhenoGO pertaining to human diseases in order to identify relationships between these diseases according to criteria described in the methods. This filtering resulted in a set of 154 diseases and their 1,931 associated proteins. The intersection between the proteins of the Reactome and those of PhenoGO further reduced the set of proteins to 286. The number of candidate proteins per disease was greatly reduced by the need to be present in the Reactome dataset, and therefore the totals are smaller than observed in the PhenoGO database alone. We lose approximately 70% of the proteins in this process due to the limited content of the Reactome. In order to identify relationships between these diseases, we analyzed their underlying proteinprotein interaction maps by applying a statistical method (details of equations in the Method Section). Of the 154 selected diseases, there are (285*286/2+286) = 41,041 distinct combinations of protein pairs and identical protein overlap (term N, equation 1) possible for all possible disease pairs, of which only 4,857 exist in the Reactome (term M, equation 1). Figure 2 summarizes the distribution of protein-protein pairs per combination of diseases in our set. In -60% of the 11,703 disease pairs under consideration, the number of potential protein-protein interactions is five or less (no significant predictions from this category), and about 40% of them have more than five interactions. We then proceeded in calculating the correlation between groups of pairs of interacting proteins associated with every pair of diseases according to equations 1 and 2 (file available at http://phenos.bsd.uchicago.edu/PSB2007/). Based on the correlations of the shared protein interacting pairs between diseases, we identified 10 pairs of diseases that are significantly correlated due to their shared proteins and protein-protein interactions out of 11,703 disease pairs examined in this study (Table 1). Quantitative Evaluation. We added 2031 "false positive" interactions between random nodes in the network to evaluate the robustness of the method to 10% noise in the network. We found that even with the introduced noise, none of the p-values in the top 10 entries changed. We also attempted adding 10% noise (46 "false" interactions) in just the 286 proteins under study, which changed the p-values of the top 10 entries, but left their rank order relatively intact (results available at http://phenos.bsd.uchicago.edu/PSB2007/). Qualitative evaluation. The top ranked disease pairs are shown in Table 1, all of which have a significant adjusted pvalue less than 5%. The last column of Table 1 provides strong scientific evidence in support of the predictions. We have manually examined all the significant disease pairs, and confirmed their correlations in the literature, demonstrating our method can successfully predict non trivial correlations between different diseases. Among these pairs of diseases, Cockayne Syndrome (CS) and Xeroderma Pigmentosum (XP) provide a very interesting example on how two diseases are correlated through their protein-protein interaction networks. Xeroderma Pigmentosum is a disorder conferring susceptibility of the skin to ultraviolet radiation,

82

Proteins Common to Both Diseases Cellular tumor antigen p53 (P04637)

TFIIH basal transcription factor complex helicase subunit (P18074)

TF1IH basal transcription factor complex helicase XPBsubunit(P19447) A

DNA excision repair protein ERCC-5 (P28715) % *-k

DNA excision repair protein ERCC-6 (Q03468)

Xeroderma Pigmentosum Xeroderma pigmentosum group A-complementing protein (P23025) Xeroderma pigmentosum group C-complementing protein (Q01631)

Cockayne * Syndrome DNA excision

DNA repair protein RAD51 homolog 1

repair protein ERCC-8 (Q13216)

/

DNA damage-binding protein 1 (Q16531)

i#*

DNA damage-binding protein 2 (Q92466) DNA repair endonuclease XPF (Q92889)

Figure 3: Protein interactions between Xeroderma Pigmentosum and Cockayne Syndrome. Shared proteins between the two diseases are represented (top), proteins in the Cockayne syndrome (right), and those in the Xeroderma Pigmentosum (left). Protein interactions between the two diseases are linked by lines.

UMLS ID

Disease 1

Table 1. Top ranked significantly correlated diseases, P-PI UMLS ID Disease 2 pvalue

(#)

Corrected pvalue

Ref

C0009207

Cockayne Syndrome

C0043346

Xeroderma Pigmentosum

38

7.3e-22

8.5e-18

[31]

C0043346

Xeroderma Pigmentosum

C0085390

Li-Fraumeni Syndrome

24

6.7e-l 1

4.9e-06

[32]

C0007001

Carbohydrate Metabolism, Inborn Errors

C0009404

Colorectal Neoplasms

C0950123

Genetic Diseases, Inborn

5.0e-05

[33]

C0085390

Li-Fraumeni Syndrome

C0009207

Cockayne Syndrome

C0009404

Colorectal Neoplasms

COO 15625

C0009404

Colorectal Neoplasms

C0024141 C0024314 C0024314

Lupus Erythematosus, Systemic (LES) Lymphoproliferative Disorders (LD) LD

Amino Acid Metabolism,

Fanconi's Anemia

[9]

Polycystic Kidney, Autosomal Dominant

[21]

C0004364

Autoimmune Diseases

4

9.3e-05

9.9e-01

*

C0004364

Autoimmune Diseases

6

1.3e-04

9.9e-01

[34]

6

1.3e-04

9.9e-01

[35]

C0024141

" self-evident relations between disease pairs.

LES

Ref = references confirming the predictions, P-PI = Protein -Protein Interaction #

83 due to deficiencies in one of the XPA-XPG complementation group genes involved in nucleotide excision repair [36]. Similarly, Cockayne Syndrome involves deficiencies in transcription-coupled repair genes ERCC6 and ERCC8 leading to a number of conditions including abnormal sensitivity to sunlight. As shown in Figure 3, there are 27 direct protein-protein interactions and 5 common proteins (term m =27+5, equation 1) that are shared by these two diseases. A total of 66 potential combinations of protein-protein interaction pairs (term n, equation 1) can be formed between the 11 proteins of XP and the 6 proteins of CS. As shown in the Figure 3 and described in Table 2, we find that most proteins in the common networks between the two diseases are related to DNA repair processes, which are Global Genomic Nucleotide Excision Repair (NER) and Transcription-coupled NER. The Global Genomic NER repairs lesions from non-transcribed regions of genome, a process independent to transcription, and the Transcription-coupled NER repairs UVinduced damage in the transcribed strands of active genes. Both Cockayne syndrome and Xeroderma Pigmentosum are associated with these processes, suggesting defects in the repair of DNA damage are the cause of the diseases, as indicated in the literature [36]. Our computational approach allows us to quickly identify the shared networks between these two diseases, demonstrating the method we used is able to identify the underlying molecular basis shared by these diseases. In some cases, disruptions in any of the proteins or genes lying on a pathway can lead to a disease phenotype. This is the case with both Xeroderma Pigmentosum and Cockayne syndrome. At a higher classification level, these two previous diseases are a result of deficiencies in the DNA repair pathway, a class also shared with Li-Fraumeni Syndrome [37], Though these three single gene diseases have a known initial molecular cause, how this cause is related to DNA repair pathways and whether the diseases share the same pathway or related disjoint pathways may be poorly understood. In another example, Fanconi's Anemia (FA) is a hereditary DNA-repair deficiency characterized by hypersensitivity to DNA damaging agents. This disorder is caused by a mutation in any one of genes in the Fanconi's Anemia complementation group: FANCA, FANCB, FANCC, FANCD1, FANCD2, FANCE, FANCF, FANCG, FANCJ, FANCL, or FANCM [38-40]. Its phenotype is complex and includes anemia, several congenital malformations, and a strong predisposition to cancers [38, 39]. Kutler et al. (2003) analyzed clinical data from 754 FA patients from North America enrolled in the International Fanconi Anemia Registry, of whom 173 (23%) had a total of 199 neoplasms (28 distinct types of cancers) [9]. Among 14 potential protein interactions between Fanconi's Anemia and Colorectal Neoplasms, 8 were found to exist in the Reactome. An evaluation of the relationship between the generality of a disease class (based on graph-theoretic distance from the "MeSH Descriptor" node in the UMLS) and the number of proteins annotated to it found no correlation (available at http://phenos.bsd.uchicago.edu/PSB2007/).

84 4. Discussion The protein-protein interaction network constructed by the Reactome dataset provides us a framework for structuring the knowledge of human diseases, which enables an objective approach to examine the molecular underpinnings of diseases in the context of their known molecular interactions on genomic scale. This method not only allows us to conduct high throughput computational analysis of the relations between diseases, but also reveals the underlying molecular relationships between diseases. Furthermore, new relationships between well-known diseases and new diseases could be revealed based on their overlapping molecular networks. Although many diseases have been associated with their genetic and proteomic underpinnings, little research has been focused on bridging the gap between protein interactions and the relationships between diseases. Phenotype clustering methods achieve this to some extent. For example, Brunner and van Driel used a text mining approach based on MeSH terms as keywords over the OMIM database [6] to cluster similar disease phenotypes. Our implementation of the hypergeometric distribution significantly differs from its common use in bioinformatics. Other authors have used this distribution in large scale gene expression studies to identify "over-represented" gene classes (e.g. Gene Ontology classes) and find systemic patterns [41]. This classical implementation would be efficient in recognizing overlapping proteins or proteins sharing annotated pathways in GO, but would not recognize novel protein interaction based on newly discovered or predicted protein interactions. In contrast, we focused on protein interactions and thus counted the protein pairs rather than the genes' assignments to categories. The proposed analytical approach could scale up in two ways. First, we could extend it to proteins interacting indirectly through a pathway rather than directly interacting in the Reactome (through additional join operations in the database in order to determine those interacting with one or more intermediate proteins in pathways). In doing this, the Bonferroni-type adjustment would have to be replaced with a data-derived control for multiple comparisons such as bootstrap or permutation resampling in order to interpret the results. A second, probably more useful way in which this analysis can scale up is its use with the rapidly expanding number of protein-interaction databases, many of which are not publicly available. The subset of the PhenoGO database used in this study can readily be reused in a similar manner over another protein interaction database containing more genes and provide other specialized predictions. Limitations. One question about the use of this technique is its reliability when conditions change. Since we used well established statistics and one of the most severe multiple comparison criteria for controlling for false predictions, we believe this method is robust. As this technique relies on integrating accurate protein-protein interactions with accurate gene-disease associations, and both of these datasets likely contained at least 10% false positive relationships, we conducted an evaluation adding false relationships in

85 the network and confirmed that the identified disease pairs sharing protein networks were reliable in spite of the noise. Nonetheless, this approach remains limited by the quality of the underlying protein networks, and the accuracy of protein-disease mapping. Currently, the protein-protein interaction network is still at the early developmental stage. In this study, we extracted 1,931 proteins from 154 diseases, of which only 288 proteins exist in the Reactome dataset that contains 1,140 proteins. Therefore, the interaction network we used to correlate relationship between diseases is relatively small. Certainly, as bioinformatics databases become larger and more accurate this discovery method could become a valuable tool to identify relationships between diseases. Future studies. We intend to explore a permutation based resampling in order to unveil additional valid relationships. A resampling-based approach would help determine the optimal relationship between quantity and quality in the dataset. We also plan to significantly extend the protein-disease associations by mining additional genetic datasets. Besides using the Reactome, we also demonstrated we could use DIP [7], although it is smaller than Reactome [results not shown]. Since the UMLS is used to encode the diseases, we plan to compare related diseases and their associated proteinprotein interactions in order to establish the molecular basis of disease relationships in ontologies. 5. Conclusion We developed and evaluated an automatic system to predict protein interactions shared by two or more diseases. It augments current protein interaction networks by integrating literature-based knowledge of protein-disease associations and systematically identifying the statistically significant Protein Interactions of Diseases (PID). Results demonstrated that the PID system provides accurate predictions and is scalable in a number of dimensions: (i) it enables high throughput predictions, and (ii) it scales across different protein-interaction datasets. Beyond direct protein-protein interactions, it also provides the theoretical framework to compare shared pathways between diseases. In the future, this framework could be applied to more complex diseases to determine if their shared phenotypes are a result of the shared molecular mechanism and pathways. References 1. 2.

Pinsky, L., The polythetic (phenotypic community) system of classifying human malformation syndromes. Birth Defects Orig Artie Ser, 1977. 13(3A): p. 13-30. Bertola, D.R., C.A. Kim, L.M. Albano, H. Scheffer, R. Meijer H. van Bokhoven, Molecular evidence that AEC syndrome and Rapp-Hodgkin syndrome are variable expression of a single genetic disorder. Clin Genet, 2004. 66(1): p. 79-80.

86 3. 4.

5. 6. 7. 8.

9.

10. 11. 12. 13. 14. 15. 16.

17. 18. 19.

20. 21. 22.

Sorasio, L.G.B. Ferrero E. Garelli G. Brunello C. Martano A. Carando, et al.,. Eur J Med Genet, 2006. Zenteno, J.C., C. Venegas S. Kofman-Alfaro, Evidence that AEC syndrome and Bowen—Armstrong syndrome are variable expressions of the same disease. Pediatr Dermatol, 1999. 16(2): p. 103-7. Morton, N.E., Am J Hum Genet, 1956. 8(2): p. 80-96. Brunner, H.G.M.A. van Driel, Nat Rev Genet, 2004. 5(7): p. 545-51. Xenarios, I., D.W. Rice, L. Salwinski, M.K. Baron, E.M. Marcotte D. Eisenberg, Nucleic Acids Res, 2000. 28(1): p. 289-91. Joshi-Tope, G.M. Gillespie I. Vastrik P. D'Eustachio E. Schmidt B. de Bono, et al., Reactome: a knowledgebase of biological pathways. Nucleic Acids Res, 2005. 33(Database issue): p. D428-32. Kutler, D.I.B. Singh J. Satagopan S.D. Batish M. Berwick P.F. Giampietro, et al, A 20-year perspective on the International Fanconi Anemia Registry (WAR). Blood, 2003. 101(4): p. 1249-56. Ogata, H., S. Goto, W. Fujibuchi M. Kanehisa, Computation with the KEGG pathway database. Biosystems, 1998. 47(1-2): p. 119-28. Ogata, H., S. Goto, K. Sato, W. Fujibuchi, H. Bono M. Kanehisa, KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res, 1999. 27(1): p. 29-34. Bader, G.D., I. Donaldson, C. Wolting, B.F. Ouellette, T. PawsonC.W. Hogue, Nucleic Acids Res, 2001. 29(1): p. 242-5. Zanzoni, A., L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. HelmerCitterichG. Cesareni, FEBS Lett, 2002. 513(1): p. 135-40. Breitkreutz, B.J., C. StarkM. Tyers, Genome Biol, 2002. 3(12): p. PREPRINT0013. Sharan, R.T. Ideker, Modeling cellular machinery through biological network comparison. Nat Biotechnol, 2006. 24(4): p. 427-33. Nabieva, E., K. Jim, A. Agarwal, B. ChazelleM. Singh, Whole-proteome prediction ofprotein function via graph-theoretic analysis of interaction maps. Bioinformatics, 2005.21 Suppll:p. i302-10. Lubovac, Z., J. GamalielssonB. Olsson, Combining functional and topological properties to identify core modules in protein interaction networks. Proteins, 2006. Rhodes, D.R.S.A. TomlinsS. VaramballyV. MahavisnoT. BarretteS. KalyanaSundaram, et al, Nat Biotechnol, 2005. 23(8): p. 951-9. Jansen, R.H. YuD. GreenbaumY. KlugerN.J. KroganS. Chung, et al, A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 2003. 302(5644): p. 449-53. Lee, I., S.V. Date, A.T. AdaiE.M. Marcotte, A probabilistic functional network of yeast genes. Science, 2004. 306(5701): p. 1555-8. Wong, S.L.L.V. ZhangA.H. TongZ. LiD.S. GoldbergO.D. King, et al, Proc Natl Acad Sci U S A , 2004. 101(44): p. 15682-7. Berg, J., M. LassigA. Wagner, Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications. BMC Evol Biol, 2004. 4(1): p. 51.

87 23. Rzhetsky, A.S.M. Gomez, Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics, 2001. 17(10): p. 988-96. 24. Barabasi, A.L.R. Albert, Emergence of scaling in random networks. Science, 1999. 286(5439): p. 509-12. 25. Wagner, A.D.A. Fell, The small world inside large metabolic networks. Proc Biol Sci, 2001. 268(1478): p. 1803-10. 26. Eisenberg, E.E.Y. Levanon, Preferential attachment in the protein network evolution. Phys Rev Lett, 2003. 91(13): p. 138701. 27. Ashburner, M.C.A. BallJ.A. BlakeD. BotsteinH. ButlerJ.M. Cherry, et al, The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9. 28. National Library of Medicine. Medical Subject Headings (MeSH®) Fact Sheet. 27 May 2005 http://www.nlm.nih.gov/pubs/factsheets/mesh.html. 29. Lussier, Y., D. Rappaport, T. BorlawskyC. Friedman. PhenoGO: a Multistrategy Language Processing System Assigning Phenotypic Context to Gene Ontology Annotations, in Pacific Symposium on Biocomputing. 2006. Maui, HI, USA. 30. Bairoch, A.R. ApweilerC.H. WuW.C. BarkerB. BoeckmannS. Ferro, et al, Nucleic Acids Res, 2005. 33(Database issue): p. D154-9. 31. Online Mendelian Inheritance in Man, MIM Number: Excision-Repair, Complementing Defective, In Chinese Hamster, 5; ERCC5 (#133530); last edited 11/29/2005: http://www.ncbi.nlm.nih.gov/omim/ 32. Hoogervorst, E.M.C.T. van OostromR.B. BeemsJ. van BenthemS. GielisJ.P. Vermeulen, et al, Cancer Res, 2004. 64(15): p. 5118-26. 33. Online Mendelian Inheritance in Man, MIM Number: Colorectal Cancer (#114500); last edited 5/17/2006: http://www.ncbi.nlm.nih.gov/omim/ 34. Worth, A., A.J. Thrasher H.B. Gaspar, Br J Haematol, 2006. 133(2): p. 124-40. 35. Blanco, R., B. McLaren, B. Davis, P. SteeleR. Smith, Systemic lupus erythematosus-associated lymphoproliferative disorder: report of a case and discussion in light of the literature. Hum Pathol, 1997. 28(8): p. 980-5. 36. Spivak, G., The many faces of Cockayne syndrome. Proc Natl Acad Sci U S A , 2004. 101(43): p. 15273-4. 37. Hanawalt, P.C., Subpathways of nucleotide excision repair and their regulation. Oncogene, 2002. 21(58): p. 8949-56. 38. Cotran, R.S., V. KumarT. Collins, Robbins Pathologic Basis of Diseases. 1999: p. 169,296. 39. Online Mendelian Inheritance in Man, MIM Number: Fanconi Anemia (#227650); last edited 3/15/2006: http://www.ncbi.nlm.nih.sov/omim/ 40. Joenje, H.K.J. Patel, The emerging genetic and molecular basis of Fanconi anaemia. Nat Rev Genet, 2001. 2(6): p. 446-57. 41. Martin, D., C. Brun, E. Remy, P. Mouren, D. ThieffryB. Jacq, GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol, 2004. 5(12): p. R101.

AN ITERATIVE ALGORITHM FOR METABOLIC NETWORK-BASED DRUG TARGET IDENTIFICATION * PADMAVATI SRIDHAR, TAMER KAHVECI AND SANJAY RANKA Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611 E-mail: {psridhar, tamer, ranka}@cise.ufl.edu Post-genomic advances in bioinformatics have refined drug-design strategies, by focusing on the reduction of serious side-effects through the identification of enzymatic targets. We consider the problem of identifying the enzymes (i.e., drug targets), whose inhibition will stop the production of a given target set of compounds, while eliminating minimal number of non-target compounds. An exhaustive evaluation of all possible enzyme combinations to find the optimal solution subset may become computationally infeasible for very large metabolic networks. We propose a scalable iterative algorithm which computes a sub-optimal solution within reasonable time-bounds. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. It evaluates immediate precursors of the target compounds and iteratively moves backwards to identify the enzymes whose inhibition will stop the production of the target compounds while incurring minimum side-effects. We show that our algorithm converges to a sub-optimal solution within a finite number of such iterations. Our experiments on the E.Coli metabolic network show that the average accuracy of our method deviates from that of the exhaustive search only by 0.02 % . Our iterative algorithm is highly scalable. It can solve the problem for the entire metabolic network of Escherichia Coli in less than 10 seconds.

1. Introduction Traditional drug development approaches focused more on the efficacy of drugs than their toxicity (untoward side effects). Lack of predictive models that account for the complexity of the inter-relationships between the metabolic processes often leads to drug development failures. Toxicity and/or lack of efficacy can result if metabolic network components other than the intended target are affected. This is well-illustrated by the example of the recent failure of Tolcapone (a new drug developed for Parkinson's disease) due to observed hepatic toxicity in some patients 9 . Post-genomic drug research focuses more on the identification of specific biological targets (gene products, such as enzymes or proteins) for drugs, which can be manipulated to produce the desired effect (of curing a disease) with minimum disruptive side-effects 20<24. The main phases in such a drug development strategy are target identification, validation and lead inhibitor identification 7 . *Work supported partially by ORAU (Award no: 00060845). The work of Sanjay Ranka is supported in part by the National Science Foundation under Grant ITR 0325459. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

88

89

Enzymes catalyze reactions, which produce metabolites (compounds) in the metabolic networks of organisms. Enzyme malfunctions that result in the accumulation of certain compounds may result in diseases. We term such compounds as Target Compounds and the remaining compounds as Non-Target compounds. For instance, the malfunction of enzyme phenylalanine hydroxylase causes buildup of the amino acid, phenylalanine, resulting in phenylketonuria 23 , a disease that causes mental retardation. Hence, it is intuitive to identify the optimal enzyme set that can be manipulated by drugs to prevent the excess production of target compounds, with minimal side-effects. We term the side-effects of inhibiting a certain enzyme combination as the damage caused to the metabolic network. Formally, we define damage of inhibiting an enzyme as the number of non-target compounds whose production is stopped by the inhibition of that specific enzyme. In our earlier work 22 , we developed a graph model for metabolic networks based on the boolean network model 21 . In our model, R, C, and E denote the set of reactions, compounds, and enzymes respectively. The node set consists of all the members of R U C U E. A node is labeled as reaction, compound, or enzyme based on the entity it refers to. Edges represent the interactions in the network. A directed edge from vertex x to vertex y is drawn if one of the following three conditions holds: (1) x represents an enzyme that catalyzes the reaction represented by y. (2) x corresponds to a reactant for the reaction represented by y. (3) x represents a reaction that produces the compound mapped to y. We assume that the inputs to all reactions and compounds are already present in the network and that there are no external inputs. Figure 1(a) illustrates a small hypothetical metabolic network. A directed edge from an enzyme to a reaction implies that the enzyme catalyzes that reaction. For instance, Ex catalyzes i?x and R2. A directed edge from a compound to a reaction implies that the compound is a reactant (input compound). A directed edge from a reaction to a compound implies that the compound is a product (output compound). In this figure, Ci is the target compound (i.e., the production of Cj should be stopped). In order to stop its production, we have to prevent R\ from taking place. This can be accomplished in two ways: (1) By disrupting one of its catalyzing enzymes (E\ in this case). Figure 1(b) shows the effects of disrupting E\. The resulting damage is calculated as the number of non-target compounds whose production is stopped. Since the production of C 2 , C3 and C\ is stopped, the damage due to the disruption of E\ is 3. (2) By stopping the production of one of its reactant compounds (C 5 in this case). To stop the production of C 5 , we need to recursively look for the enzyme combination which is indirectly responsible for its production (Ei and E3). The combined damage of E2 and £73 is 1. Thus, the production of the target compound can be stopped by manipulating either E\ or a combination of E 2 and Ez- The optimal solution is the enzyme combination whose disruption has the minimum damage on the network (E2 and E3 in this case). Problem: Given a large metabolic network and a set of target compounds, we con-

90

"-1

W

(b) (a) Figure 1. (a)A graph constructed for a metabolic network with four reactions R\, •••, R4, three enzymes E\, E2 and E3, andfivecompounds C1.---.C5. Circles, rectangles, and triangles denote compounds, reactions, and enzymes respectively. Here, Ci (shown by double circle) is the target compound. (b)Effect of inhibiting E\. Dotted lines indicate the subgraph removed due to inhibition of enzyme Ei.

sider the problem of identifying the set of enzymes whose inhibition eliminates all the target compounds and inflicts minimum damage on the rest of the network. Evaluating all enzyme combinations is not feasible for metabolic networks with a large number of enzymes. This is because, the number of enzyme combinations, i.e., 2l£l - 1, increases exponentially with the number of enzymes. Efficient and precise heuristics are needed to tackle this problem. Note that different enzymes and compounds may have varying levels of importance in the metabolic network. Our model simplistically considers all the enzymes and compounds to be of equal importance. It can be extended by assigning weights to enzymes and compounds based on their roles in the network. However, we do not discuss these extensions in this paper. Contribution: In this paper, we develop a scalable iterative algorithm as an approximation to the optimal enzyme combination detection problem. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. It starts by finding the damage incurred due to the removal of each reaction or compound by evaluating its immediate precursors. It then iteratively improves the damage by considering the damage computed for the immediate precursors. It converges when the damage values cannot be improved any further. We prove that the number of iterations is at most the number of reactions on the longest path from any enzyme to the target compounds in the underlying pathway. To the best of our knowledge, this is the first polynomial time solution for a metabolic-network based drug target identification problem. The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 presents the proposed iterative algorithm for determining the enzyme combination whose inhibition achieves the desired effect of inhibiting the production of target compounds. Section 4 presents a theoretical analysis of the algorithm. Section 5 discusses experimental results. Section 6 concludes the paper.

91 2. Related work Traditional pharmacological drug discovery approaches involve the incorporation of a large number of hypothetical targets into cell-based assays and automated high throughput screening (HTS) of vast chemical compound libraries7. Post-genomic advances in bioinformatics have fostered the development of rational drug-design strategies, that seek to reduce serious side-effects8,4,3. This era has brought about the concept of reverse pharmacology, in which, the first step is the identification of protein targets, that may be critical intervention points in a disease process24'20'1. Since this method is driven by the mechanics of the disease, it is expected to be more efficient than the classical approach24. Rapid identification of enzyme (or protein) targets needs a thorough understanding of the underlying metabolic network of the organism affected by a disease. The availability of fully sequenced genomes has enabled researchers to integrate the available genomic information to reconstruct and study metabolic networks17. These studies have revealed important properties of metabolic networks10'2,15. The potential of an enzyme to be an effective drug target is considered to be related to its essentiality in the corresponding metabolic network13. Lemke et. al proposed the measure enzyme damage as an indicator of enzyme essentiality 14>16. Recently, a computational approach to prioritize potential drug targets for antimalarial drugs was developed 18. A choke-point analysis of P.falciparcum was performed to identify essential enzymes which are potential drug targets. The possibility of using enzyme inhibitors as antiparasitic drugs is being investigated through stoichiometric analysis of the metabolic networks of parasites 5 ' 6 . These studies show the effectiveness of computational techniques in reverse pharmacology. A combination of gene-knockout and micro-array time-course data was used to study the effects of a chemical compound on a gene network 12 . An investigation of metabolite essentiality was carried out with the help of stoichiometric analysis n . These approaches underline the importance of studying the role of compounds (metabolites) during the pursuit of computational solutions to pharmacological problems. 3. Iterative algorithm In this section, we develop a scalable iterative algorithm that finds a sub-optimal solution to the enzyme-target identification problem quickly. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one, by tracing backwards from the target compounds. We evaluate the immediate precursors of the target compounds and iteratively move backwards to identify the enzymes, whose inhibition will stop the production of the target compounds while incurring minimum damage. Our algorithm consists of an initialization step followed by iterations, until some convergence criteria is satisfied. Let E, R and C denote the sets of enzymes, reactions and compounds of the metabolic network respectively. Let \E\, \R\ and \C\ denote the number of enzymes, reactions and compounds respectively.

92

The primary data structures are three vectors, namely an enzyme vector VE = [ei, e2, • • •, e|#|], a reaction vector VR = [n, r2, • • •, r|fl|]> and a compound vector Vc — \c\, C2, • • •, c\c\\- Each value, ei, in VE denotes the damage of inhibition of enzyme, Ei G E. Each value, n, in VR denotes the damage incurred by stopping the reaction Ri e R. Each value, Q, in Vb denotes the damage incurred by stopping the production of the compound C* £ C. Initialization: Here, we describe the initialization of vectors VE, VR, and Vc- We initialize VE first, VR second, and Vc last. Enzyme vector: The damage e,, Vi, 1 < i < |.E|, is computed as the number of nontarget compounds whose productions stop after inhibiting Ei. Wefindthe number of such compounds by doing a breadth-first traversal of the metabolic network starting from Ei. We calculate the damage e» associated with every enzyme Ei e E, Vi, 1 < i < |-E|. and store it at position i in the enzyme vector VEReaction vector: The damage rj is computed as the minimum of the damages of the enzymes that catalyze Rj, Vj, 1 < j < \R\. In other words, let EVl, E7t2, • • •, E„k be the enzymes that catalyze Rj. We compute the damage of rj as rj = min*L1{e7ri}. This computation is intuitive since a reaction can be disrupted by inhibiting any of its catalyzers. We calculate rj associated with every reaction Rj e R< Vj, 1 < j < \R\ and store it at position j in the reaction vector VR. Let E(Rj) denote the set of enzymes that produced the damage rj. Along with rj, we also store E(Rj). Note that in our model, we do not consider back-up enzyme activities for simplicity. Compound vector: The damage Cfc, Vfc, 1 < k < \C\, is computed by considering the reactions that produce Cfc. Let Rltl, R„2, • • •, Rnj be the reactions that produce Cfc. We first compute a set of enzymes E(Ck) for Cfc as E(Ck) = E{RKl) U E(Rn2)\J- • •L)E(R7rj). We then compute the damage value Cfc as the number of nontarget compounds that is deleted after the inhibition of all the enzymes in E{Ck). This computation is based on the observation that a compound disappears from the system only if all the reactions that produce it stop. We calculate Cfc associated with every compound Cfc e C, 1 < k < \C\ and store it at position k in the compound vector Vc. Along with ck, we also store E(Ck). Column IQ in Table 1 shows the initialization of the vectors for the network in Figure 1. The damage e\ of E\ is three, as inhibiting E\ stops the production of three non-target compounds C2, C 3 and C4. Since the disruption of E2 or E3 alone does not stop the production of any non-target compound, their damage values are zero. Hence, Vg = [3, 0, 0]. The damage values for reactions are computed as the minimum of their catalyzers ( n = r 2 = e\ and r3 = r4 = e2). Hence, VR =[3, 3, 0, 0]. The damage values for compounds are computed from the reactions that produce them. For instance, Ri and R2 produce C 2 . E{RX) = E(R2) = {Ei}. Therefore, c2 = ei. Similarly c5 is equal to the damage of inhibiting the set E(Rs) U E(R4) — {E2,E3}. Thus,c 5 = l. Iterative steps: We iteratively refine the damage values in vectors VR and Vc in a

93 Table 1. Iterative Steps: To is the initialization step; I\ and h are the iterations. VR and Vc represent the damage values of reactions and compounds respectively computed at each iteration. VB = [3, 0, 0] in all iterations.

vR,vc

h

h

h

[3, 3, 0,0], [3, 3, 3, 3,1]

[1,3,0,0], [1,3, 3, 3, 1]

[1,3, 0,0], [1,3, 3, 3, 1]

number of steps. At each iteration, the values are updated by considering the damage of the precursor of the precursors. Thus, at nth iteration, the precursors from which a reaction or a compound is reachable on a path of length up to n are considered. We define the length of a path on the graph constructed for a metabolic network as the number of reactions on that path (see Definition 4.2). There is no need to update Vg since the enzymes are not affected by the reactions or the compounds. Next, we describe the actions taken to update VR and Vc at each iteration. We later discuss the stopping criteria for the iterations. Reaction vector: Let CWl, CK2, • • •, CKt be the compounds that are input to Rj. We update the damage of r^ as rj = minjrj, min'^-fc^}}. The first term of the min function denotes the damage value calculated for Rj during the previous iteration. The second term provides the damage of the input compound with the minimum damage found in the previous iteration. This computation is intuitive since a reaction can be disrupted by stopping the production of any of its input compounds. The damage of all the input compounds are already computed in the previous iteration (say (n — l)th iteration). Therefore, at iteration n, the second term of the min function considers the impact of the reactions and compounds that are away from Rj by n edges in the graph for the metabolic network. Let E(Rj) denote the set that contains the enzymes that produced the new damage rj. Along with rj, we also store E{Rj). We update all rj e VR using the same strategy. Note that the values rj can be updated in any order, i.e., the result does not depend on the order in which they are updated. Compound vector: The damage Cfc, Vfc, 1 < k < \C\, is updated by considering the damage computed for Cfc in the previous iteration and the damages of the reactions that produce Cfc. Let Rni, RW2, • • •, RKj be the reactions that produce Cfc. We first compute a set of enzymes as S ( i ^ J U E(R7T2) U • • • U E(Rnj). Here, E{RKt), 1 < t < j , is the set of enzymes computed for Rt after the reaction vector is updated in the current iteration. We then update the damage value ck as cfc = min{c fc ,damage(ULi E{RTTi))}. The first term here denotes the damage value computed for Cfc in the previous iteration. The second term shows the damage computed for all the precursor reactions in the current step. Along with cfc, we also store E(Ck), the set of enzymes which provides the current minimum damage Cfc. Condition for convergence: At each iteration, each value in VR and Vc either remains the same or decreases by an integer amount. This is because a min function

94 is applied to update each value as the minimum of the current value and a function of its precursors. Therefore, the values of VR and Vc do not increase. Furthermore, a damage value is always an integer since it denotes the number of deleted nontarget compounds. We stop our iterative refinement steps when the vectors VR and Vc do not change in two consecutive iterations. This is justified, because, if these two vectors remain the same after an iteration, it implies that the damage values in VR and Vc cannot be minimized any more using our refinement strategy. Columns l\ and I2 in Table 1 show the iterative steps to update the values of the vectors VR and Vc- In Io, we compute the damage r\ for R\ as the minimum of its current damage (three) and the damage of its precursor compound, C5 = 1. Hence, r\ is updated to 1 and its associated enzyme set is changed to {E2, E3}. The other values in VR remain the same. When we compute the values for Vc, c\ is updated to 1, as its new associated enzyme set is {£2, £3} and the damage of inhibiting both E2 and E3 together is 1. Hence, VR = [1,3,0,0] and Vc = [1,3,3,3,1]. In I2, we find that the values in VR and Vc do not change anymore. Hence, we stop our iterative refinement and report the enzyme combination E2, £3 as the iterative solution for stopping the production of the target compound, C\. Complexity analysis: Space Complexity: The number of elements in the reaction and compound vectors is (\R\ + \C\). For each element, we store an associated set of enzymes. Hence, the space complexity is 0((|i?| + |C|) * \E\). Time Complexity: The number of iterations of the algorithm is 0(|i?|) (see Section 4). The computational time per iteration is 0(G * (|J?| + |C|)), where G is the size of the graph. Hence, the time complexity is 0(\R\G * (\R\ + \C\)). 4. Maximum number of iterations In this section, we present a theoretical analysis of our proposed algorithm. We show that the number of iterations for the method to converge is finite. This is because the number of iterations is dependent on the length of the longest non-selfintersecting path (see Definitions below) from any enzyme to a reaction or compound. Definition 4.1. In a given metabolic network, a non-self-intersecting path is a path which traces any vertex on the path exactly once. • For simplicity, we will use the term path instead of non-self-intersecting path in the rest of this section. Definition 4.2. In a given metabolic network, the length of a path from an enzyme Ei to a reaction Rj or compound Ck is defined as the number of unique reactions on that path. • Note that the reaction Rj is counted as one of the unique reactions on the path from enzyme Ei to Rj.

95 Definition 4.3. In a given metabolic network, the preceding path of a reaction Rj (or a compound Ck) is defined as the length of the longest path from any enzyme in that network to Rj (or Ck)• Theorem 4.1. Let VE = [ei, e2, • • •, e| B |], VR — \v\, r2, ••-, r\R\], and Vc [c\, c2, • • •, C|c|] be the enzyme, reaction and compound vectors respectively (see Section 3). Let n be the length of the longest path (see Definitions 4.2 and 4.1) from any enzyme E{ to a reaction Rj (or a compound Ck)- The value Tj (or Ck) remains constant after at most n iterations. • Proof: We prove this theorem by an induction on the number of reactions on the longest path (see Definitions 4.2 and 4.1) from any enzyme in Ei corresponding to et e VE to Cfc. Basis: The basis is the case when the longest path from an enzyme Et is of length 1 (i.e., the path consists of exactly one reaction). Let Rj be such a reaction. This implies that there is no other reaction on a path from any Ei to Rj. As a result, the value rj remains constant after initialization. Let Ck be a compound such that there is at most one reaction from any enzyme to Ck- Let RKl, R7r2, • • •, R„ be the reactions that produce Ck- Because of our assumption there is no precursor reaction to any of these reactions. Otherwise, the length of the longest path would be greater than one. Therefore, the values r7Vl,r7T2, ••• ,rnj and the sets E(RKl), E{R-K2), • • •, E(Rnj) do not change after initialization. The value Ck is computed as the damage of E(Ck) = B f f l j J U ^ J U ' • • u B ( i J r j ) . Thus, ck remains unchanged after initialization and the algorithm terminates after the first iteration. Inductive step: Assume that the theorem is true for reactions and compounds that have a preceding path with at most n — 1 reactions. Now, we will prove the theorem for reactions and compounds that have a preceding path with n reactions. Assume that Rj and Ck denote such a reaction and a compound. We will prove the theorem for each one separately. Prooffor RJ: Let C7ri, Cn2, • • •, C7rt be the compounds that are input to Rj. The preceding path length of each of these input compounds, say CRs is at most n. Otherwise, the preceding path length of Rj would be greater than n. Case 1: If the preceding path length of C7Ts is less than n, by our induction hypothesis, cKs would remain constant after (n — l)th iteration. Thus, the input compound CVs will not change the value of Tj after nth iteration. Case 2: If the preceding path length of CVs is n, then Rj is one of the reactions on this path. In other words, C7ra and Rj are on a cycle of length n. Otherwise, the preceding path length of Rj would be greater than n. Recall that at each iteration, the algorithm considers a new reaction or a compound on the preceding path starting from the closest one. Thus, at nth iteration of computation of rj, the algorithm completes the cycle and considers Rj. This however will not modify rj. This is because the value of rj monotonically decreases (or remains the same) at each iteration. Thus, the initial damage value computed from Rj is guaranteed to be no

96

better than rj after n — 1 iterations. We conclude that rj will remain unchanged after nth iteration. Prooffor Ck: Let RVl, R^2, • • •, RKj be the reactions that produce Ck- The preceding path length of each of these reactions, say R„3 is at most n. Otherwise, the preceding path length of Cfc would be greater than n. Case 1: If the preceding path length of R7ts is less than n, by our induction hypothesis 7v3 would remain constant after (n - l)th iteration. Thus, the reaction fl^ will not change the value of Ck after nth iteration. Case 2: If the preceding path length of R^s is n, then from our earlier discussion for proof of Rj, rVs remains unchanged after nth iteration. Therefore Rns will not change the value of Ck after nth iteration. Hence, by induction, we show that the Theorem 4.1 holds. • 5. Experimental results We evaluate our proposed iterative algorithm using the following three criteria: Execution time: The total time (in milliseconds) taken by the method to finish execution and report if a feasible solution is identified or not. Number of iterations: The number of iterations performed by the method to arrive at a steady-state solution. Average damage: The average number of non-target compounds that are eliminated when the enzymes in the result set are inhibited. We extracted the metabolic network information of Escherichia Coli (E.Coli) from KEGG 19 ( f t p : / / f t p . g e n o m e . j p / p u b / k e g g / p a t h w a y s / e c o / ) . The metabolic network in KEGG has been hierarchically classified into smaller networks according to their functionality. We performed experiments at different levels of hierarchy of the metabolic network and on the entire metabolic network, that is an aggregation of all the functional subnetworks. We devised a uniform labeling scheme for the networks based on the number of enzymes. According to this scheme, a network label begins with 'N' and is followed by the number of enzymes in the network. For instance, 'N20' indicates a network with 20 enzymes. Table 2 shows the metabolic networks chosen, along with their identifiers and the number of compounds (C), reactions (R) and edges (Ed). The edges represent the interactions in the network. For each network, we constructed query sets of sizes one, two and four target compounds, by randomly choosing compounds from that network. Each query set contains 10 queries each. We implemented the proposed iterative algorithm and an exhaustive search algorithm which determines the optimal enzyme combination to eliminate the given set of target compounds with minimum damage. We implemented the algorithms in Java. We ran our experiments on an Intel Pentium 4 processor with 2.8 GHz clock speed and 1 -GB main memory, running Linux operating system. Evaluation of Accuracy: Table 3 shows the comparison of the average damage values of the solutions computed by the iterative algorithm versus the exhaustive

97 Table 2. Metabolic networks from KEGG with identifier (Id). C, R and Ed denote the number of compounds, reactions and edges (interactions) respectively. Id

Metabolic Network

C

R

Metabolic Network

C

R

Ed

N08

Polyketide biosynthesis Xenobiotics biodegradation Citrate or TCA cycle Galactose Pentose phosphate Glycan Biosynthesis

11

11

33

N42

Other amino acid

69

63

208

47

58

187

N48

Lipid

134

196

654

21 38 26 54

35 50 37 51

125 172 129 171

N52 N59 N71 N96

67 72 102 145

128 82 217 175

404 268 684 550

32 36

49 46

160 151

N170 N180

Purine Energy Nucleotide Vitamins and Cofactors Amino acid Carbohydrate

54 247

378 501

1210 1659

21

51

163

N537

Entire Network

988

1790

5833

N13 N14 N17 N20 N22 N24 N28 N32

Glycerolipid Glycine, serine and threonine Pyruvate

Ed

Id

Table 3. Comparison of average damage values of solutions determined by the iterative algorithm versus the exhaustive search algorithm. Pathway Id

NU

JV17

N2Q

JV24

N28

N32

Iterative Damage

2.51

8.73

1.63

3.39

1.47

0.59

Exhaustive Damage

2.51

8.73

1.63

3.17

1.47

0.59

Pathway Identifier

(a)

Pathway Identifier

(b)

Figure 2. Evaluation of iterative algorithm. (a)Average execution time in milliseconds. (b)Average number of iterations

search algorithm. We have shown the results only upto JV32, as the exhaustive search algorithm took longer than one day to finish even for 7V32. We can see that the damage values of our method exactly match the damage values of the exhaustive search for all the networks except iV24. For N24, the average damage differs from the exhaustive solution by only 0.02%. This shows that the iterative algorithm is a good approximation of the exhaustive search algorithm which computes an optimal solution. The slight deviation in damage is the tradeoff for achieving the scalability of the iterative algorithm (described next). Evaluation of Scalability: Figure 2(a) plots the average execution time of our it-

98 erative method for increasing sizes of metabolic networks. The running time increases slowly with the network size. As the number of enzymes increases from 8 to 537, the running time increases from roughly 1 to 10 seconds. The largest network, N537, consists of 537 enzymes, and hence, an exhaustive evaluation inspects 2 537 - 1 combinations (which is computationally infeasible). Thus, our results show that the iterative method scales well for networks of increasing sizes. This property makes our method an important tool for identifying the right enzyme combination for eliminating target compounds, especially for those networks for which an exhaustive search is not feasible. Figure 2(b) shows a plot of the average number of iterations for increasing sizes of metabolic networks. The iterative method reaches a steady state within 10 iterations in all cases. The various parameters (see Table 2) that influence the number of iterations are the number of enzymes, compounds, reactions and especially the number of interactions in the network (represented by edges in the network graph). Larger number of interactions increase the number of iterations considerably, as can be seen for networks N22, JV48, N96, N537, where the number of iterations is greater than 5. This shows that, in addition to the number of enzymes, the number of compounds and reactions in the network and their interactions also play a significant role in determining the number of iterations. Our results show that the iterative algorithm can reliably reach a steady state and terminate, for networks as large as the entire metabolic network of E.Coli. 6. Conclusion Efficient computational strategies are needed to identify the enzymes (i.e., drug targets), whose inhibition will achieve the required effect of eliminating a given target set of compounds while incurring minimal side-effects. An exhaustive evaluation of all possible enzyme combinations to find the optimal subset is computationally infeasible for large metabolic networks. We proposed a scalable iterative algorithm which computes a sub-optimal solution to this problem within reasonable timebounds. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. We evaluated the immediate precursors of a target compound and iteratively moved backwards, to identify the enzymes, whose inhibition stopped the production of the target compound while incurring minimum damage. We showed that our method converges within a finite number of such iterations. In our experiments on E.Coli metabolic network, the accuracy of a solution computed by the iterative algorithm deviated from that found by an an exhaustive search only by 0.02 %. Our iterative algorithm is highly scalable. It solved the problem for even the entire metabolic network of E.Coli in less than 10 seconds. References 1. 'Proteome Mining' can zero in on Drug Targets. Duke University medical news, Aug 2004. 2. M Arita. The metabolic world of Escherichia coli is not small. PNAS, 101 (6): 1543-7,

99 2004. 3. S. Broder and J. C. Venter. Sequencing the Entire Genomes of Free-Living Organisms: The Foundation of Pharmacology in the New Millennium. Annual Review of Pharmacology and Toxicology, 40:97-132, Apr 2000. 4. S. K. Chanda and J. S. Caldwell. Fulfilling the promise: Drug discovery in the postgenomic era. Drug Discovery Today, 8(4):168—174, Feb 2003. 5. A. Comish-Bowden. Why is uncompetitive inhibition so rare? FEBS Letters, 203(1):36, Jul 1986. 6. A. Cornish-Bowden and J. S. Hofmeyr. The Role of Stoichiometric Analysis in Studies of Metabolism: An Example. Journal of Theoretical Biology, 216:179-191, May 2002. 7. J Drews. Drug Discovery: A Historical Perspective. Science, 287(5460):1960-1964, Mar 2000. 8. Davidov et. al. Advancing drug discovery through systems biology. Drug Discovery Today, 8(4):175-183, Feb 2003. 9. Deane et. al. Catechol-o-mefhyltransferase inhibitors versus active comparators for levodopa-induced complications in parkinson's disease. Cochrane Database of Systematic Reviews, 4, 2004. 10. Hatzimanikatis et. al. Metabolic networks: enzyme function and metabolite structure. Current Opinion in Structural Biology, (14):300-306, 2004. 11. Imielinski et. al. Investigating metabolite essentiality through genome scale analysis of E. coli production capabilities. Bioinformatics, Jan 2005. 12. Imoto et. al. Computational Strategy for Discovering Draggable Gene Networks from Genome-Wide RNA Expression Profiles. In PSB 2006 Online Proceedings, 2006. 13. Jeong et. al. Prediction of Protein Essentiality Based on Genomic Data. ComPlexUs, 1:19-28,2003. 14. Lemke et. al. Essentiality and damage in metabolic networks. Bioinformatics, 20(1):115-119, Jan 2004. 15. Ma et. al. Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph. Bioinformatics, 20(12): 1870-6, 2004. 16. Mombach et. al. Bioinformatics analysis of mycoplasma metabolism: Important enzymes, metabolic similarities, and redundancy. Computers in Biology and Medicine, 2005. 17. Teichmann et. al. The Evolution and Structural Anatomy of the Small Molecule Metabolic Pathways in Escherichia coli. JMB, 311:693-708, 2001. 18. Yeh et. al. Computational Analysis of Plasmodium falciparum Metabolism: Organizing Genomic Information to Facilitate Drug Discovery. Genome Research, 14:917-924, 2004. 19. M Kanehisa and S Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28(l):27-30, Jan 2000. 20. C Smith. Hitting the target. Nature, 422:341-347, Mar 2003. 21. R. Somogyi and C.A. Sniegoski. Modeling the complexity of genetic networks: Understanding multi-gene and pleiotropic regulation. Complexity, 1:45-63, 1996. 22. P. Sridhar, T. Kahveci, and S. Ranka. Opmet: A metabolic network-based algorithm for optimal drug target identification. Technical report, CISE Department, University of Florida, Sep 2006. 23. R. Surtees and N. Blau. The neurochemistry of phenylketonuria. European Journal of Pediatrics, 159:109-13,2000. 24. Takenaka T. Classical vs reverse pharmacology in drag discovery. BJU International, 88(2):7-10, Sep 2001.

TRANSCRIPTIONAL INTERACTIONS DURING SMALLPOX INFECTION AND IDENTIFICATION OF EARLY INFECTION BIOMARKERS* WILLY A. VALDIVIA-GRANDA Orion Integrated Biosciences Inc., 265 Centre Ave. Suite 1R New Rochelle, NY 10805, USA Email: willy, valdivia @ orionbiosciences. com

MARICEL G. KANN National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike Bethesda, MD 20894, USA Email: kann @ mail.nih. gov

JOSE MALAGA Orion Integrated Biosciences Inc., Email: jose. malaga @ orionbiosciences. com Smallpox is a deadly disease that can be intentionally reintroduced into the human population as a bioweapon. While host gene expression microarray profiling can be used to detect infection, the analysis of this information using unsupervised and supervised classification techniques can produce contradictory results. Here, we present a novel computational approach to incorporate molecular genome annotation features that are key for identifying early infection biomarkers (EIB). Our analysis identified 58 EIBs expressed in peripheral blood mononuclear cells (PBMCs) collected from 21 cynomolgus macaques (Macaca fascicularis) infected with two variola strains via aerosol and intravenous exposure. The level of expression of these EIBs was correlated with disease progression and severity. No overlap between the EIBs co-expression and protein interaction data reported in public databases was found. This suggests that a pathogen-specific re-organization of the gene expression and protein interaction networks occurs during infection. To identify potential genome-wide protein interactions between variola and humans, we performed a protein domain analysis of all smallpox and human proteins. We found that only 55 of the 161 protein domains in smallpox are also present in the human genome. These co-occurring domains are mostly represented in proteins involved in blood coagulation, complement activation, angiogenesis, inflammation, and hormone transport. Several of these proteins are within the EIBs category and suggest potential new targets for the development of therapeutic countermeasures.

* correspondence should be addressed to: [email protected] 100

101 1. INTRODUCTION The virus that causes smallpox, known as variola major, belongs to the genus Orthopoxvirus within the family Poxviridae. During 1967, the year the smallpox global eradication program began, an estimated 10 to 15 million smallpox cases occurred in 43 countries and caused the death of 2 million people annually (1). Intensive vaccination programs lead in 1979 to the eradication of the disease. Since then, vaccination ceased, and levels of immunity have dropped dramatically (2). In recent years there has been increasing concern that this virus can be used as a bioweapon (3, 4). In very early stages of viral infection and during the progression of the disease, a series of physiological and molecular changes including differential gene expression occur in the host. This information can be used to identify biomarkers correlated with the presence or absence of a specific pathogen, the prognosis of the disease, or the efficacy of vaccines and drug therapies. Since microarrays can measure the whole genome gene expression profiles, the use of peripheral blood mononuclear cells (PBMCs) can allow the identification of pathogen-specific biomarkers before clinical symptoms appear. While the collection of PBMCs is a minimal invasive method which facilitates the assessment of host responses to infection, doubts about their usefulness persist. These revolve around two very strong arguments. First, expression signals might come from a minority of cells within the bloodstream. Thus, expression might be a secondary consequence rather a primary effect of viral infection. Second, PBMC population is not in a homogenous biological state; therefore, there is an inherent biological noise which could make the data impossible to reproduce. Rubins et al. (5) used cDNA microarrays to measure the expression changes occurring in PBMCs collected from blood of cynomolgus macaques infected with two strains of variola by aerosol and intravenous exposure. Clustering analyses revealed that variola infection induced the expression of genes involved in cell cycle and proliferation, DNA replication, and chromosome segregation. These transcriptional changes were attributed to the fact that poxviruses encode homologues of the mammalian epidermal growth factor (EGF) that bind ErbB protein family members which are potent stimulators of cell proliferation. However, the conclusions of Rubins et al. (5) were limited by the ability of unsupervised microarray data analysis algorithms, such as clustering, to detect true gene product interactions (6, 7). This is relevant, because an increasing number of data suggests that proteins involved in the regulation of cellular events resulting from viral infections are organized in a modular fashion rather in a particular class or cluster (8-10). While some microarray data analysis tools use gene ontologies to increase the performance of the classification of gene expression data (11, 12), these methods

102 incorporate molecular annotation after the classification of the gene expression values. However, many human genes have little or no functional annotation, or they have multiple molecular functions that can change with database update versions. Therefore, the identification of biomarkers is challenging because it is not possible to quantify the contribution of the molecular annotation in the overall classification process. To address these limitations, and to gain a better understanding of the molecular complexity resulting during host-pathogen interactions, we developed a new method for microarray data classification and for the discovery of early infection biomarkers (EIBs). Our approach incorporates different molecular biological datasets and narrows the set of attributes required for the classification process. This information is represented as transcriptional networks where genes associated with early viral infection events and disease severity are selected. These interactions were overlapped with physical protein-protein interaction data reported in the scientific literature. To complement these analyses and to identify possible human receptors used by smallpox during cellular entry, replication, assembly, and budding (13, 14), we identified all protein domains (from PFAM protein domain database (15)) within 197 smallpox proteins that are also present within human proteins. The results of our analysis provide new insights into receptor co-evolution and suggest potential therapeutic targets that might diminish the lethal manifestations of smallpox.

2. METHODS 2.1. Transcriptional Network Reconstruction We used the microarray gene expression data from the experiments by Rubins et al. (5). This information consists of the molecular profiles collected from PBMCs of 21 male cynomolgus macaques (Macaca fascicularis) exposed to two variola strains (India-7124 and Harper-99) via subcutaneous injections (5 X 108 plaqueforming-units p.f.u.) and aerosol exposure (109 p.f.u.). For the analysis of this data, we developed an algorithm to identify genes responding similarly to the viral challenge across different exposed animals. Then we proceeded to identify infection specific genes corresponding to a particular time-point after the inoculation (16). As shown in Figure 1, our implementation consists of two main steps. First, a nearest neighbor voting (NNV) classification including gene expression values and gene annotation features where the best attributes associated to a particular transcriptional network are selected (17). Second, a genetic algorithm (GA) optimization using the trade off between the false negative and false positive rates for every possible function cut off area, represented by the area under the receiver operating characteristic (ROC) curve, as fitness function (17).

103 Pred(G) = lm(G) + Sim(G) (1.1) lm(G) = WL(G) + W,(G)+WA(G) trnSet

SirriG) = £

(1.2)

features

Y,WtfMatchf

(G>S) + Im(G) (1.3)

Equation 1.1 defines the function used for predictor voting (Prerf) of specific transcriptional interactions, estimated as the sum of the similarity importance for a given gene G (Im(G)) and the similarity of attributes (Sim(G)) of its gene neighbors. The gene importance of gene G is given by Equation 1.2 and is based on the weights for scoring the gene cellular compartment localization (WL), number of interactions with other genes (Wi), and number of attributes (WA). Considering that there are multiple attributes to select, we optimized the weight space (Wtf) (used in Equation 1.3) by scoring the best combination of weights using a standard genetic algorithm (GA) matching each of the features (f) voted as important. This approach selects the best and/or fittest solution and allows only the higher scores to proceed in form of transcriptional interactions. The ROC value of the prediction is used as the fitness evaluator. Depending on the fitness value, random mutation is used occasionally to change or optimize an existing solution. For the visualization of the final transcriptional interactions we calculated the probability (p<0.01) of the network composition defined by hyper geometric distribution, as shown in Equation 2:

,_vlzJt»zZi (2) 5

(?)

where N is the total number of elements represented in our dataset (-18,000), r is the total number of those that are part of a transcriptional network, n is the number of differentially expressed genes that belong to the transcriptional network, and y is the number of differentially expressed genes members of a smallpox day-specific event. This information is represented as transcriptional networks at cellular localization, molecular function and biological process levels.

104

Fig, 1. Overall implementation of the computational analysis of the microarray data, the ElBs identification and the transcriptional network reconstruction. First, we implemented a genomic catalog where each gene annotation feature (e.g. GO, KEGG) is joined with gene expression values. Then, the nearest neighborhood voting (NNV) algorithm selects gene transcriptional interactions based on a genetic algorithm (GA) using the ROC curves as fitness function.

2.2. Overlapping of Gene Expression Networks and PPis After the formation of specific gene expression transcriptional interactions we used the Information Hyperlinked over Proteins (iHOP) database (18, 19) and the Human Protein Reference Database (HPRD) (20) to identify and retrieve physical and in-vitro protein interactions reported in the scientific literature. 2.3.

Smallpox-Human Protein Domain Analysis

To determine the level of co-occurrence of protein domains in smallpox and humans, we use HMMer version 2.3.2 to query the PFAM protein domain database release 19 (15) against the 197 proteins coded by the smallpox genome and all the human proteins from Genbank (21). All PFAM domains with statistically significant hits (E-value cutoff < le-03) to the smallpox proteins were used to create the smallpox-pfam protein domain database. The same procedure was applied to the set of human proteins to derive the human-pfam database. From the comparison of the smallpox-pfam and human-pfam we derived a set of protein domains that co-occur in both organisms. We constructed a protein interaction network based on protein domain present in variola and human. In such network, the nodes represent all human and variola proteins and the edges between two proteins (one from each organism) are drawn v/hen they share at least one domain

105 among them. All proteins without any edges were excluded from the analysis resulting in a network containing only proteins with shared domains.

3. RESULTS 3.1. PBMCs Gene Patterns as Early Infection Biomarkers Our re-analysis of more than 5.5 million data points, including 18,000 human genes, identified a transcriptional network that represented early infection biomarkers (EIB) with gene profile patterns similar across the animals used in this study (Table 1). The level of expression of these EIB coincided with disease severity (Figure 2). This transcriptional network is composed of 58 gene functions, 23 representing membrane receptors, signal transduction, and cell differentiation pathways involved in cell to cell communication, DNA binding and repair, as well as immune responses (Figure 3). Table 1. List of main human genes and clones considered as smallpox ElBs. GENE NAME

CLONE ID

SYMBOL

IMAGE:82734

Acyl-CoA synthetase long-chain family member 1

ACSL1

IMAGE:2014138

Acyl-CoA synthetase long-chain family member 1

ACSL1

IMAGE:1271662

Alkaline phosphatase, liver/bone/kidney

IMAGE:2020917

Arachidonate 5-lipoxygenase

IMAGE:67759

Arachidonate 5-lipoxygenase-activating protein

IMAGE.-186945

Ataxia telangiectasia

IMAGE:730433

Bactericidal/permeability-increasing protein

IMAGE:1881943

Carcinoembryonic antigen-related cell adhesion molecule 1

IMAGE:2248876

Cathepsin G

IMAGE:1551030

Cytidine deaminase

IMAGE:1S52797

Cytidine deaminase

IMAGE:814655

Dehydrogenase/reductase (SDR family) member 9

ALPL ALOX5 ALOX5AP ATM BPI CEACAM1 CTSG CDA CDA DHRS9

IMAGE:1914863

Dysferlin, limb girdle muscular dystrophy

IMAGE:1881815

Ectonucleoside triphosphate diphosphohydrolase 3

IMAGE:711918

Glutaminyl-peptide cyclotransferase (glutaminyl cyclase)

IMAGE:684912

Grancalcin, EF-hand calcium binding protein

IMAGE:25 08044

Haptoglobin

IMAGE:564325

In multiple clusters

IMAGE:1837472

Interleukin 13

IMAGE:741497

Lipocalin 2 (oncogene 24p3)

IMAGE:223176

MAX dimerization protein 1

MAD

IMAGE:505S73

Phosphorylase, glycogen; liver

PYGL

IMAGE:2435216

Phosphorylase, glycogen; liver (Hers disease)

IMAGE:200409

Poly (ADP-ribose) polymerase family, member 1

IMAGE:429029

Protein phosphatase 2, regulatory subunit, delta isoform

IMAGE:430169

Transcribed locus

IMAGE:42402

Zinc finger protein 276 homolog (mouse)

DYSF ENTPD3 QPCT GCA HP

— IL13 LCN2

PYGL PARP1 PPP2R5D

— ZFP276

I

106

Haper-99 lndia-7124 Aerosol Survival 35-42 Days

Haper-99 lndia-7124 Aerosol Dead within 5-11 days

<

fi-

i

tfi^ki

CO

sion Lev

Q)

Haper-99 lndia-7124 Aerosol ODay

Haper-99 lndia-7124 IV-Aerosol Dead within 4 days

2 0a. u5 - 3 o c O -6"

h -w \ -

*. _ 9E559MIB53B

* iru ,

.

'•kK

«•••"<

1 '891 i f B H I

• m i *JB* f i M ' i V M ^pfJTSrJBWKIi

- S i " =.-- if-E, ••" ? r ^ ?

yiSsSEJit" „

^'ligrR^

«,rj'. IS? ?<*'«

Figure 2. Gene expression patterns of ElBs associated with different stages of disease severity uniformly co-expressed across 21 cynomolgus macaques (Macaca fasciculahs). Darker intensities depict higher gene expression and disease severity. U-ptesminogtn adivator receptor activity lrammiMnbran
oxidonjUOJjMa activiiy

oxidoradudasa activity, acting on s^rlgEe donors withtocoflMraSonofmolecular oxygen, incorporation or two atoms or oxygon InieneutSi

/ slgnaiti transducer aclMty a \r^/cytokmf>tntJing \ ^ / "' 'i * Vv. „

^

muMWUfn r « W aclWy \

^

"

\

dlMyjeMia activity

I « « * „ . , . 5 ^ y g a n a M activity

interttuk n-i jifeeptor act) hemstopoietri^nlerferoi^clasT^O^doiTialn) cytokine receptor binding fn1»ri«t*N>fl binding badefteihinrfmg

inl«ftflukin. 13lftC6plor binding

growth factor binding Sf»m-n»9»tir/e Mdertai binding

Figure 3. Molecular function representation of transcriptional networks considered as ElBs.

107

3.2. Protein-Protein and Gene Expression Interactions We expected to find a small number of genes reported as physically interacting proteins to be co-expressed in the gene expression matrix; however, we were find how no correlation between transcriptional and proteomic levels (Figure 4a and 4b). Even when we retrieved the expression of genes reported to be physically interacting (Figure 4b), their gene expression profiles were down regulated in both control and treated animals and did not provide any discriminative information (data not shown).

Figure 4a. Gene transcriptional co-expression. Figure 4b. Protein-protein physical interaction data.

3.3. Protein Domain Comparison between Human and Smallpox Our analysis uncovered 161 PFAM domains present in smallpox proteins. From those, only 55 are also present in humans. The structure of the protein domain network displays a scale-free structure (Figure 5). In humans, these domains participate in biological processes such as blood coagulation, complement activation, fibrinolysis, angiogenesis, inflammation, tumor suppression, and hormone transport. For example, the immunoglobulin V-set domain family was represented in 339 different human proteins serving as several T-cell receptors such as CD2, CD4, CD80, and CD86. Also, 24 proteins humans contain the TNFR_c6 PFAM domain that is a co-stimulatory signal for T and B cell activation and is involved in T cell development.

108 750700

650600

550500450 400

!

350 300

.Q

250 200 150 100

•\ i

1

50

0

° 9 £-

= 5 Z S

» S H! £U m LU ->

Fig. 5. Top PFAM motifs hit distribution in humans and variola major virus.

4. DISCUSSION Until this report the analysis of microarray data was performed using either clustering or classification algorithms. These approaches did not consider the biological annotation available for each of the gene or incorporated this information after the analysis (10, 22). Instead, our analysis determined simultaneously the importance of each gene and their likelihood of interaction based on the gene expression values as well as in different annotation features available in several databases. In this work, we identified a specific set of genes with a transcriptional state (on/off) associated with the health condition of the animals exposed to variola. These patterns are independent of strain or the type of exposure (aerosol or intravenous) used in these experiments (5, 16). Despite our success, we believe that in order to truly identify EIBs it is necessary to collect transcriptional information as early as 6 hours after pathogen exposure. Nonetheless, our comparison of the smallpox EIB transcriptional network against other experiments using RNA viral infections suggests that EIBs reported here are specific for smallpox exposure (data not shown).

109 Detailed analysis of the molecular function of each EIB revealed their participation in multiple biological processes. For example, the protein encoded by the ataxia-telangiectasia mutated (ATM) gene is phosphatidylinositol-3 kinase involved in cancer predisposition and radiation sensitivity (23, 24). This gene regulates DNA repair, apoptosis, cell cycle and toll-like receptor signaling (25). The encoded protein by carcinoembryonic antigen-related cell adhesion molecule 1 precursor (CEACAM1) gene is a cell-cell adhesion molecule that is found in leukocytes, epithelia, endothelia, neutrophils, monocytes, macrophages, B and T lymphocytes. This protein has roles in the differentiation and arrangement of tissue three-dimensional structure, angiogenesis, apoptosis, tumor suppression, metastasis, and the modulation of innate and adaptive immune responses. Haptoglobin (Hp) is a regulator of the Thl/Th2 balance and exhibits a dose-dependent inhibitory effect on human T lymphocyte release of the Th2 cytokines (IL-4, IL-5, IL-10 and IL-13). Distinct biological roles, and the fact that there is no correlation between EIBs co-expression and protein interaction data reported in the literature, points to two main modes of host-viral interactions. First, since smallpox can manipulate immune response mechanisms (26, 27), it is plausible that the host activates alternative viral defense responses including the EIB reported here. Second, it is possible that smallpox proteins regulate the expression of the EIBs and use these protein products to complete a specific biological process. Since the level of up-regulation of EIBs was related to disease severity, and many of these human genes are involved in DNA repair and carcinogenesis (a feature known to be used by variola viruses), it is more likely that the pathogen takes advantage of these proteins to complete key biological processes. Protein motif profiling of the complete human and smallpox genomes reveals that 111 PFAM domains are specific for smallpox and are present mostly in one viral protein. In addition, 55 domains present in smallpox are also shared between different human protein families involved in key regulatory process such are interferon inhibitors, blood coagulation, inflammation, tumor suppression and T-cell differentiation. Several of these domains were present in our EIBs, thus suggesting that smallpox regulates host gene expression and that the virus protein-host protein interaction might result in a better viral cellular entry, replication and budding. It is also plausible that smallpox proteins block human proteins involved in immune responses. 5. CONCLUSIONS We presented a new computational analysis utilizing microarray gene expression data and molecular annotation to identify infection biomarkers potentially specific to variola major. Results of the weight optimization of the features associated can be useful in giving researchers an indication of what defines a particular transcriptional network during pathogen infection. Overall, our

110 approach uncovered a set of genes associated with disease severity and progression independent of the variola strains or the type of exposure used for the challenge. The profiling of PFAM domains also pointed to the possibility of variola utilizing specific host gene products to complete key biological processes. These results have important implications in diagnostics, vaccine efficacy assessment, and development of therapeutic countermeasures. 6.

ACKNOWLEDGMENTS

M.G.K would like to thank the intramural research program of the National Institutes of Health for their support of this work. 7.

REFERENCES

1. Mack, T. (2003) N Engl J Med 348, 460-3. 2. Henderson, D. A., Inglesby, T. V., Bartlett, J. G., Ascher, M. S., Eitzen, E., Jahrling, P. B., Hauer, J., Layton, M., McDade, J., Osterholm, M. T., O'Toole, T., Parker, G., Perl, T., Russell, P. K. & Tonat, K. (1999) Jama 281, 2127-37. 3. Berche, P. (2001) Trends Microbiol 9,15-8. 4. Schatzmayr, H. G. (2001) Cad Saude Publica 17, 1525-30. 5. Rubins, K. H., Hensley, L E., Jahrling, P. B., Whitney, A. R., Geisbert, T. W., Huggins, J. W., Owen, A., Leduc, J. W., Brown, P. O. & Relman, D. A. (2004) Proc Natl Acad Sci U S A 101, 15190-5. 6. Azuaje, F. (2003) Brief Bioinform 4, 31-42. 7. Futschik, M. E. & Carlisle, B. (2005) J Bioinform Comput Biol 3, 965-88. 8. Uetz, P., Dong, Y. A., Zeretzke, C , Atzler, C , Baiker, A., Berger, B., Rajagopala, S., Roupelieva, M., Rose, D., Fossum, E. & Haas, J. (2005) Science. 9. Jensen, L J., Saric, J. & Bork, P. (2006) Nat Rev Genet 7, 119-29. 10. Valdivia-Granda, W. A. (2003) Strategies for Clustering, Classifying, Integrating, Standardizing and Visualizing Microarray Gene Expression Data (Kluwer Academic Publishers, New York). 11. Al-Shahrour, F., Diaz-Uriarte, R. & Dopazo, J. (2004) Bioinformatics 20, 578-80. 12. Khatri, P., Desai, V., Tarca, A. L, Sellamuthu, S., Wildman, D. E., Romero, R. & Draghici, S. (2006) Nucleic Acids Res 34, W626-31. 13. Burnett, J. C , Henchal, E. A., Schmaljohn, A. L. & Bavari, S. (2005) Nat Rev Drug Discov 4, 281-97. 14. Buller, R. M. & Palumbo, G. J. (1991) Microbiol Rev 55, 80-122. 15. Finn, R. D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S. R., Sonnhammer, E. L. & Bateman, A. (2006) Nucleic Acids Res 34, D247-51. 16. Jahrling, P. B., Hensley, L. E., Martinez, M. J., Leduc, J. W., Rubins, K. H., Relman, D. A. & Huggins, J. W. (2004) Proc Natl Acad Sci U S A 101, 15196200.

Ill 17. Perera, A. D., A.; Kotala, P.; Valdivia-Granda, W. A.; Perrizo W. (2003) SIGKDD Explorations 4, 108-109. 18. Hoffmann, R. & Valencia, A. (2005) Bioinformatics 21 Suppl 2, N252-ii258. 19. Hoffmann, R., Krallinger, M., Andres, E., Tamames, J., Blaschke, C. & Valencia, A. (2005) Sci STKE 2005, pe21. 20. Mishra, G. R., Suresh, M., Kumaran, K., Kannabiran, N., Suresh, S., Bala, P., Shivakumar, K., Anuradha, N., Reddy, R., Raghavan, T. M., Menon, S., Hanumanthu, G., Gupta, M., Upendran, S., Gupta, S., Mahesh, M., Jacob, B., Mathew, P., Chatterjee, P., Arun, K. S., Sharma, S., Chandrika, K. N., Deshpande, N., Palvankar, K., Raghavnath, R., Krishnakanth, R., Karathia, H., Rekha, B., Nayak, R„ Vishnupriya, G., Kumar, H. G., Nagini, M., Kumar, G. S., Jose, R., Deepthi, P., Mohan, S. S., Gandhi, T. K., Harsha, H. C , Deshpande, K. S., Sarker, M., Prasad, T. S. & Pandey, A. (2006) Nucleic Acids Res 34, D411-4. 21. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. & Wheeler, D. L (2006) Nucleic Acids Res 34, D16-20. 22. Hsiao, A., Ideker, T., Olefsky, J. M. & Subramaniam, S. (2005) Nucleic Acids Res 33, W627-32. 23. Cortes, M. L , Oehmig, A., Perry, K. F., Sanford, J. D. & Breakefield, X. O. (2006) Neuroscience. 24. Boutell, C. & Everett, R. D. (2004) J Virol 78, 8068-77. 25. Murr, R., Loizou, J. I., Yang, Y. G., Cuenin, C , Li, H., Wang, Z. Q. & Herceg, Z. (2006) Nat Cell Biol 8, 91 -9. 26. Shchelkunov, S. N. (2003) Mol Biol (Mosk) 37, 41-53. 27. Jackson, S. S., Ilyinskii, P., Philippon, V., Gritz, L , Yafal, A. G., Zinnack, K., Beaudry, K. R., Manson, K. H., Lifton, M. A., Kuroda, M. J., Letvin, N. L , Mazzara, G. P. & Panicali, D. L. (2005) J Virol 79, 6554-9.

COMPUTATIONAL APPROACHES TO METABOLOMICS: AN INTRODUCTION DAVID S. W1SHART Depts. of Computing Science and Biological Sciences, University of Alberta, 2-21 Athabasca Hall, Edmonton, AB, T6G 2E8, Canada RUSSELL GREINER Dept. of Computing Science, University of Alberta, 2-21 Athabasca Hall, Edmonton, AB, T6G 2E8, Canada

1.

Session Background and Motivation

This marks the first time that the Pacific Symposium in Biocomputing has hosted a session specifically devoted to the emerging computational needs of metabolomics. Metabolomics, or metabonomics as it is sometimes called, is a relatively new field of "omics" research concerned with the high-throughput identification and quantification of the small molecule metabolites in the metabolome (i.e. the complete complement of all small molecule metabolites found in a specific cell, organ or organism). It is a close counterpart to the genome, the transcriptome and the proteome. Together these four "omes" constitute the building blocks of systems biology. Even though metabolomics is primarily concerned with tracking and identifying chemicals as opposed to genes or proteins, it still shares many of the same computational needs with genomics, proteomics and transcriptomics. For instance, just like the other "omics" fields, metabolomics needs electronically accessible and searchable databases, it needs software to handle or process data from various highthroughput instruments such as NMR spectrometers or mass spectrometers, it needs laboratory information management systems (LIMS) to manage the data, and it needs software tools to predict or find information about metabolite properties, pathways, relationships or functions. These computational needs are just beginning to be addressed by members of the metabolomics community. As a result we believed that a PSB session devoted to this topic could address a number of important issues concerning both the emerging computational needs and the nascent computational trends in metabolomics. This year we solicited papers that focused specifically on describing novel methods for the acquisition, management and analysis of metabolomic data. We were particularly interested in papers that covered one of the five following topics: 1) metabolomics databases; 2) metabolomics LIMS; 3) spectral analysis tools for metabolomics; 4) medical or applied metabolomics 112

113 and 5) metabolic data mining. In total there were 15 papers submitted to this session, with 8 papers accepted (5 for oral presentation). The papers we received covered an enormous range of metabolomics and computational topics and most were of very high quality. All papers underwent rigorous peer review with up to 4 reviewers for each paper. We are particularly grateful to Lori Querengesser who helped coordinate the review process and the 31 expert reviewers who provided thoughtful and informative comments for each paper. 2.

Session Summary

Among the papers accepted for publication in this year's proceedings are two manuscripts concerned with metabolomic databases and metabolomic laboratory information management systems (LIMS). The paper by Markley et al describes a suite of databases and LIM systems developed at the University of Wisconsin (Madison) including: 1) the BioMagResBank metabolomics database - which contains experimental NMR data on -250 metabolites; 2) the Madison Metabolomics Consortium Database (MMCD) database - which contains literature derived data on -10,000 Arabadopsis-related metabolites and 3) SESAME - a LIM system designed to handle the unique and diverse data needs of metabolomics researchers. Nicely complementing this work by the Madison group is the paper by Scholz and Fiehn, which describes a general izable metabolomics LIM system called SETUPX. This XML-based system supports nearly every aspect of metabolomic-based lab workflow management, data entry and data processing. It also makes use of publicly available taxonomic and ontology repositories to ensure data integrity and logical consistency. These and other features will likely make SETUPX a gold standard for the design and implementation of other metabolomics LIMS. Another area of active research in computational metabolomics is data mining and automated data retrieval. This year's proceedings includes two such papers. The manuscript by Knox et al describes a metabolome annotation tool, called BioSpider, that specifically seeks out chemical and biological data through text mining and data extraction. It access snippets of data from dozens of websites and electronic databases and assembles them into comprehensive (80+ data fields) "metabo-cards". Given the dearth of electronically accessible data about most metabolites, this tool will likely be very popular within the metabolomics community. A related manuscript by Ganesan et al. provides a nice overview and a critical assessment of various data harvesting and data profiling tools in genomics and proteomics and their potential applications to metabolic profiling. Two papers were also accepted into this year's proceedings which describe and assess the application of existing metabolomic software to realworld medical problems. The paper by Yoon et al describes a novel and very sophisticated approach to using metabolic flux profiling to characterize the metabolic fate of the anti-diabetic drug, troglitazone, in the liver. The authors

114 use experimental data combined with flux balance analysis and metabolic reaction network analysis to unearth some of the underlying reasons for this drug's hepatotoxicty. This provides a superb example of the potential impact that computational metabolomics could have in the area of pharmacology and pharmacotoxicity. A related paper by Yang et al describes the application of targeted metabolite identification and 13C isotopomer analysis towards distinguishing between cells derived from normal and cancerous breast tissue. Using existing computational tools, the authors identified a number of important changes in metabolic pathways and the redistrubution of metabolic fluxes that may point to new diagnostic and therapeutic targets. The paper by Chang et al describes the development and comparison of a targeted profiling technique that allows ID 'H NMR spectra of metabolite mixtures (i.e. blood, urine, tissue extracts) to be analyzed and the compounds within the mixtures to be identified and quantified. The method uses a library of pure metabolite spectra that can be used to fit experimentally collected NMR spectra. The authors compare this targeted profiling method to other spectral analysis tools and demonstrate that their method is generally more robust and often more useful for metabolite analysis. Finally, the paper by Karakoc et al addresses an interesting question about what chemical characteristics distinguish plant or animal metabolites from bacterial or fungal metabolites. This is an important question as many drugs and drug precursors are derived from plants, bacteria and fungi. Likewise, with the advent of high throughput metabolomics, many metabolites are being identified but their biological origins are unclear or unknown. The authors of this paper use classical cheminformatic techniques (QSAR, clustering, neural network analysis) to show that plant, fungal, bacterial and mammalian metabolites do have characteristic features that can distinguish one from another. Overall, the manuscripts appearing in this year's proceedings provide a nice cross-section of the activities and advances being made in computational metabolomics. No doubt, as the field matures, the focus will likely shift from developing tools and platforms for metabolite data extraction and metabolite data storage to the development of software to aid in the interpretation of that data. Over time, we are hopeful that metabolomic studies will become more integrated with genomic or proteomic studies and that software tools will eventually be developed to aid scientists in the prediction of the physiological or metabolic consequences of drugs, foods or genetic lesions.

Acknowledgments We would like to thank Genome Alberta, a division of Genome Canada, for their financial support and technical assistance in making this session possible.

LEVERAGING LATENT INFORMATION IN NMR SPECTRA FOR ROBUST PREDICTIVE MODELS DAVID CHANG1'2, AALIM WELJIE1'3, JACK NEWTON1 'Chenomx Inc., Suite 800, 10050 112 Street, Edmonton, Alberta, 2

3

Canada

Department of Chemical and Materials Engineering, University of Alberta, Edmonton, Alberta, Canada

Metabolomics

Research Centre, University of Calgary, Calgary, Alberta,

Canada

A significant challenge in metabolomics experiments is extracting biologically meaningful data from complex spectral information. In this paper we compare two techniques for representing ID NMR spectra: "Spectral Binning" and "Targeted Profiling". We use simulated ID NMR spectra with specific characteristics to assess the quality of predictive multivariate statistical models built using both data representations. We also assess the effect of different variable scaling techniques on the two data representations. We demonstrate that models built using Targeted Profiling are not only more interpretable than Spectral Binning models, but are more robust with respect to compound overlap, and variability in solution conditions (such as pH and ionic strength). Our findings from the synthetic dataset were validated using a real-world dataset.

1. Introduction Nuclear Magnetic Resonance (NMR) spectroscopy is a widelyused tool in the rapidly growing field of metabolomics, where the measurement of small molecule metabolites provides a chemical "snapshot" of an organism's metabolic state [lj. NMR is inherently quantitative and non-selective, thus a wealth of chemical information can be extracted from single NMR spectrum. Metabolomics studies often couple NMR spectral data with principal component analysis (PCA) and other pattern recognition techniques to uncover meaningful patterns in data sets [2]. Longterm goals of such computational model building include automation of data analysis as part of an integrated diagnostics platform [3] and personalized therapies [4]. Building statistical models from NMR spectra can be problematic however, as spectral distortions present potentially confounding artifacts to techniques such as PCA [5, 6]. 115

116

These distortions have an origin in the hardware [7], the type and nature of the sample, and choice of acquisition and processing parameters [8]. For example, pre- and post-processing algorithms and the signal-to-noise (S/N) in the time domain impact data quality. Metabolite signals in complex mixtures often span several orders of magnitude, thus requiring a significant dynamic range in the receiver. Furthermore, aqueous samples such as urine or plasma require suppression of the water solvent peak which is 7-8x more concentrated than the metabolites of interest, resulting in distortions of the baseline and intensity of metabolite signals. Metabolites' resonance frequencies, lineshapes, and linewidths will vary between samples within an NMR metabolomics dataset irrespective of hardware considerations. Factors influencing these chemical modulations include sample pH, ionic composition, and inter-metabolite interactions [9]. As a result, statistical analyses require some form of pre-processing or data reduction to ensure that the variables of interest are representative of the underlying chemical data [10]. In this paper, the impact of spectral distortion on the quality of predictive statistical models built upon two alternative representations of NMR data is assessed. A simulated dataset is used to model various types of spectral distortion in a systematic manner, and two techniques for dimensionality reduction, spectral binning and targeted profiling, are used to represent these simulated spectra. The results are assessed using the regression/classification extension of PCA, partial least squares for discriminant analysis (PLS-DA) [11]. We validate our findings using a real-world data set of rat-brain extracts. 2. NMR Data Representations An NMR spectrum is a linear combination of characteristic signals for each compound that is present in a given sample. As the concentration of a particular compound changes, the characteristic signal for that compound responds in a linear fashion. Thus, an NMR spectrum can be viewed from a theoretical perspective as follows:

117

dobs = c • [a ® s]+ u + n (1) [lxn]

[lxfe]

[*XK]

[lxn]

[ixn]

where d0bs is a [lxn] vector of the observed NMR data, c is a [Ixk] vector representing the concentrations of k known compounds in the mixture, and s represents a matrix of the spectral signatures present in the solution, a is a spectrum calibration function that is applied to each row of s to account for changes in the sample's pH, ionic strength, etc. u represents unknown contributions to the signal from unknown metabolites, lipoproteins, or any other contributions to the signal that are not explicitly modeled using .y. Finally, the observed spectrum contains noise that is introduced by the NMR hardware and processing algorithms, n. 2.1. Spectral Binning Spectral binning [2] is a widely-used technique where the spectrum is subdivided into a number of regions, and the total area within each bin is used as an abstracted representation of the original spectrum. The area encapsulated by a bin would ideally capture all of the area associated with a given resonance across all spectra in the dataset, thereby mitigating the effect of minor peak shift and line width variations for a compound across samples. A typical 64k NMR spectrum would be reduced using bin widths of 0.04 ppm, resulting in -250 bin integral values. Spectral binning is agnostic of the underlying generative model described in Equation 1, however it is commonly used due to the ease of implementation and complete spectral coverage. 2.2. Targeted Profiling Targeted profiling [8] is a technique that leverages a reference spectral database to directly recover the concentration matrix c from Equation 1, which is then used as the input to pattern recognition techniques such as PCA or PLS-DA. Targeted profiling can be viewed as a method of recovering the latent variables in the form of underlying metabolite concentrations that generated the observed spectral data. Because of its reliance on a spectral database s, targeted profiling does not directly model or deal with the unknown term u in Equation 1. Since u may contain potentially

118

important latent chemical information, it can be calculated directly as the residual from Equation 1, and spectral database-agnostic techniques such as spectral binning can be applied to u for subsequent analysis. 3.

Methods

3.1. Synthetic Study Several synthetic data sets were generated with specific characteristics to simulate, in a systematically controlled manner, some of the key challenges inherent in working with NMR data. The data for the synthetic study was generated using Chenomx NMR Suite 4.5 (Chenomx Inc., Edmonton, Alberta, Canada) compound database entries. Varying mixtures of twenty compounds, with the addition of DSS at 0.5 mM, were simulated. Compound concentrations for the following compounds were sampled randomly from a normal distribution: 2-oxoglutarate, acetate, acetone, alanine, betaine, carnitine, citrate, creatine, dimethylamine, fumarate, glucose, lactate, maleate, myo-inositol, taurine, tryptophan, tyrosine, urea, u-methylhistidine, xmethylhistidine. Biologically viable population statistics of mean and standard deviation were used for each compound [Chang, Rankin, McGeer, Shah, Marrie, and Slupsky, submitted] and these concentrations remained fixed from simulation to simulation. Random uncorrected noise was added to each spectrum in the frequency domain. Each spectrum was generated to have an equivalent amount of noise by an approximate signal to noise ratio (SNR) of 100:1. The effect of pH variability was simulated by randomly varying compound resonance frequencies within an empirically validated range. This range reflects the compound's NMR frequency response to pH levels ranging from pH 4 to 9 as determined from pH curves of pure reference spectra. The magnitude of this range was controlled to test the effects of pH variation via a transform fraction parameter. A fraction of 1.0 allowed clusters to be transformed over the entire pH 4 to 9 range, while a fraction of 0.1

119

would allow for clusters to be transformed over 10% of the range, centered at pH 7.0. The actual pH range that this represents will be different for each compound depending on the relative pH sensitivity of the compound near pH 7.0. In order to generate two classes of spectra, the population statistics of one or more metabolites were changed for each simulation. The parameters used in each simulation are outlined in Table 1. Table I. Simulation Parameters for Synthetic Study. Simulation # Parameters 1 Number of Files SNR Transform Fraction Group 1 Citrate/Tryptophan Mean ± Stdev (fimol) Group 2 Citrate/Tryptophan Mean ± Stdev (|imol) 2 Number of Files SNR Transform Fraction Group 1 Maleate Mean ± Stdev (fimol) Group 2 Maleate Mean ± Stdev (p.mol) 3 Number of Files SNR Transform Fraction Group 1 Citrate/Tryptophan Mean ± Stdev (umol) Group 2 Citrate/Tryptophan Mean ± Stdev (|±mol)

Value 200 (100 of each class) 100 0.1 2318 ±1496/5 ± 2 1031 ±945/10 ± 2 200 (100 of each class) 100 0.1 30 ±15 60 ±20 200 (100 of each class) 100 1 2318 + 1496/5 + 2 1031 ±945/10 + 2

3.2. Rat Brain Extracts This real-world dataset is based on a previously published [12] dataset and was kindly provided by Dr. Brent McGrath and Dr. Peter Silverstone (Department of Psychiatry, University of Alberta). Twelve adult male Sprague-Dawley rats brains were dissected into frontal (fcx) cortex, temporal cortex (tcx), occipital cortex (ocx) and hippocampus (hipp) regions according to stereotaxic demarcation [12]. For spectral binning, bins widths of 0.04 ppm were used, with the following dark regions defined: DSS (the internal standard): -0.1-0.lppm, 0.6-0.7 ppm; methanol (a byproduct of the extraction process): 3.33-3.37 ppm; water: 4.55.5ppm; imidazole (the pH indicator): 7.13-7.5, 7.82-8.68 ppm. The following compounds were identified and quantified using the targeted profiling technique [8] as implemented in Chenomx NMR Suite 4.5: 4-aminobutyrate, acetate, adenosine, alanine, aspartate, betaine, choline, citrate, creatine, creatinine, formate, fumarate, glutamate, glutamine, glycerol, glycine, hypoxanthine, isoleucine,

120

lactate, leucine, lysine, methanol, N-acetylaspartate, serine, succinate, taurine, threonine, tyrosine, valine, xanthine, and myoinositol. 3.3. Multivariate Statistical Modeling All multivariate modeling was performed using SIMPCA-P+ 11.0 from Umetrics Inc. Permutations tests were performed using 100 permutations. R2X and R2y are calculated as the fraction of the sum of squares of all X and Y that the model can explain using the latent variables. Q2 is the fraction of the total variation in Y that can be predicted using the model via seven-fold cross-validation. 4. Results 4.1. Synthetic Data By systematically varying key properties of the synthetic data sets, several aspects of building statistical models on NMR data representations were assessed. The first issue assessed was the effect of noise on the spectra. Specifically, noise was added to the spectrum to see how robust both spectral binning and targeted profiling methods were at being able to recover the latent information in the data in the presence of noise. What was observed was that if the noise was completely uncorrelated, then both methods are very robust to varying noise levels. (Data is available from supplementary materials.) The next issue we examined was the choice of variable scaling and normalization methods, since this can have a large impact on the quality of results obtained from multivariate statistical methods such as PLS-DA. Normalization for all spectral binning data was to the total area of the NMR spectrum. No normalization was necessary for the targeted profiling results, since direct quantification can be obtained with the addition of an internal standard. Both the spectral binning data and targeted profiling data were mean centered and were scaled using unit variance (UV) or Pareto scaling. UV scaling involves weighting each of the variables by the variables' group standard deviation, and has the advantage of not biasing statistical models towards large concentration

121

compounds or high area bins. Pareto scaling involves the weighting each of the variables by the variables' group variance, which minimizes the impact of noise. Data from simulation #1 was used to evaluate the effects of these two scaling procedures. This simulation encoded class differentiation through citrate, present at relatively high concentrations, and tryptophan, present at relatively low concentrations. Figure la demonstrates that PLS-DA on UV scaled data can recover differences in both tryptophan and citrate, while the loadings plot of Pareto-scaled data (Figure lb) is only able to distinguish the intense citrate signal. UV scaling was superior to Pareto scaling in recovering a model that accurately reflected the variables of interest (both low- and high-concentration metabolites) for both targeted profiling and spectral binning data. Overlap of NMR resonances from different metabolites is another issue hampering the analysis of complex biofluid spectra. Further complications arise from compound overlap with dominant peaks such as urea, where low intensity peaks are often lost in traditional analyses due to the overwhelming magnitude of the urea signal. Simulation #2 generated a dataset in which a single metabolite, maleate, differentiates the two classes and overlaps with the high concentration urea signal, which varies randomly (i.e. urea does not encode class discrimination). Figure 2 shows the scores, loadings, and permutations tests for spectral binning and targeted profiling methods. One can see from the loadings plot in Figure 2b, that targeted profiling methods identify maleate as a significant metabolite even under severe overlap conditions, while spectral binning shown in Figure 2a fails. Spectral binning is also prone to generating highly overfit models as shown by the permutation test in Figure 2, whereas targeted profiling models show no signs of overfitting. Permutation tests help assess overfitting by randomly permuting class labels and refitting a new model with the same number of components as the original model. An overfit model will have similar R2 and Q2 to that of the randomly permuted data. Well fit models will have R2 and Q2 values that are always higher than that of the permuted data.

122

a

0.0

0.2

0.4

0.6 PHI

Figure 1. PLS-DA models (scores plot left, loadings plot right) of targeted profiling data using a) unit variance scaling b) Pareto scaling.

°

a

Class 2

-.- - y ^ - : - ; ; — A

I?11:

;

*

:

>

•

:

•

,..|S ty'-J ,.,,,.

ffi . ....-_,;. :..,. :^iv!-:

""|;:; j+.

r;r.::

-6-4-2

-

2

0

-

2

1

4

0

6

8

1

10 12 1

2

Figure 2. PLS-DA models (scores plot left, loadings plot center, permutation plot right) for a) spectral binning and b) targeted profiling methods under conditions of large overlap.

123

Sample matrix conditions such as pH and ionic strength can have profound effects on metabolites' NMR resonance frequencies. These shifts can directly influence the quality of the models that are generated using NMR data, and were modeled with simulation #3. Both spectral binning and targeted profiling gave rise to models that were able to separate the data in the latent variable space. However, the quality of the model generated with the spectral binning data was low and resulted in overfitting as shown in permutation plots (Supplementary Figures). This is due to the large number of variable weights used in the loadings. A large number of variables share similar weights because the same significant resonances are now migrating over adjacent bins due to pH/ionic strength variation. Models built on targeted profiling data, which accounts for the shifts in resonance locations directly in the modeling process, are able to separate the two groups and do not overfit the data. The final effect studied is the impact of limited sample sizes on predictive capacity, a typical problem in metabolomics studies. The effect of sample size was shown using a subset from Simulation #3. The size of the dataset was reduced from 100 to 20 samples in each class. Even with a limited sample size, the targeted profiling approach resulted in well fit PLS-DA models, as assessed by the permutations tests. While the descriptive features of tryptophan and citrate are not as clearly distinguished in the loadings plot, the permutation plot indicates that even with a small number of samples the data is not overfit. The results for spectral binning, however, are quite deceptive, as the PLS-DA model shows very good separation of classes in the scores plot. However, the model generated has an extremely high degree of overfitting - the majority of the randomly permuted models generate Q values higher than that of the non-permuted model (Supplementary Figures). 4.2. Rat Brain Extract The rat brain extract dataset is a real-world dataset that exhibits many of the phenomena we have seen in the synthetic dataset. The spectra contain noise, have metabolite resonances that shift due to pH, and have low-concentration metabolites that are important in

124

differentiating the different brain regions, thus making it a suitable model dataset to validate our findings from the synthetic dataset. This dataset was acquired at high resolution (800MHz) and contains -30 NMR-visible compounds. We did not find that the choice of variable scaling affected the quality of the generated models for this dataset. We therefore used unit variance scaling for the results shown below. We found that using spectral binning generated a model with lower predictive accuracy than targeted profiling data: Q2 for spectral binning was 0.468, whereas Q2 for targeted profiling was 0.522. As in our synthetic dataset, we found that spectral binning-based results were prone to overfitting. To test for overfitting, we randomly permuted the class labels for the PLS-DA analysis 100 times. With the spectral binning dataset, we found that some of the models generated with random permutations of the data had higher Q2 and R2 values than the non-permuted data. This is illustrated in Figure 3a. Internal validation of the model based on the targeted profiling representation of the NMR data do not exhibit any characteristics of an overfit model, as shown in Figure 3b. The targeted profiling representation uses only 27 variables to represent the latent information in the dataset, thereby restricting the degrees of freedom available in the construction of a model, and reducing the capacity of the model to overfit the data.

Figure 3. a, Internal validation of spectral binning, showing clear evidence of overfitting with random permutations of the data generating better R2 and Q2 values than the non-permuted data, b, Internal validation of targeted profiling, showing clear decrease in performance on permuted data.

125

5. Conclusion We have demonstrated how the inherent properties of NMR spectroscopy can impact the predictive ability of models built upon spectral binning and targeted profiling representations of NMR data by using a novel method for synthetically generating NMR spectra. The quality of predictive models built was quantitatively assessed, as was the relative robustness of these two methods. Under the experimental design chosen, both methods are very robust with respect to noise. In contrast, variable scaling methods can affect both the quality and interpretability of the models generated. W found for targeted profiling data, unit variance scaling generates a more robust data representation. Targeted profiling was also found to be an effective dimensionality reduction technique that, overall, is more robust with respect to spectral distortions and high dynamic range metabolites than spectral binning, and is less prone to overfitting than spectral binning models. These findings were validated on a real-world dataset of rat-brain extracts consisting of -30 NMR detectable metabolites, in which statistical models were less prone to overfitting based on a spectral profiling representation of the data. Spectral binning is a common method for data reduction due to the speed of analysis, while current targeted profiling implementations require interactive input and are relatively time-intensive. While the rat-brain extract study represents a relatively simple dataset, targeted profiling has successfully been applied to extensive studies of serum [Weljie, Dowlatabadi, Miller, Vogel, Jirik, submitted] and urine [Chang, Rankin, McGeer, Shah, Marrie, and Slupsky, submitted]. As increasingly automated methods for quantitative profiling of NMR data become available, we expect database-driven targeted profiling to become the data-reduction method of choice. 6. Supplementary Information Supplementary Figures and Data is http://www.chenomx.com/publications/PSB2007

available

at

126

References [1] J. C. Lindon, E. Holmes and J. K. Nicholson, Anal. Chem. 75, 3 84A (2003) [2] E. Holmes, H. Antti, Analyst 127, 1549 (2002) [3] D. S. Wishart, L. M. M. Querengesser, B. A. Lefebvre, N. A. Epstein, R. Greiner and J. B. Newton, Clinical Chemistry 47,1918(2001) [4] T. A. Clayton, J. C. Lindon, O. Cloarec, H. Antti, C. Charuel, G. Hanton, J. P. Provost, J. L. Le Net, D. Baker, R. J. Walley, J. R. Everett and J. K. Nicholson, Nature 440, 1073 (2006) [5] M. Defernez, I. J. Colquhoun, Phytochemistry 62, 1009 (2003) [6] S. Halouska, R. Powers, J. Magn Reson. 178, 88 (2006) [7] R. Siuda, G. Balcerowska and D. Aberdam, Chemometrics Intell. Lab. Systems 40, 193 (1998) [8] A. M. Weljie, J. Newton, P. Mercier, E. Carlson and C. M. Slupsky, Anal. Chem. 78,4430 (2006) [9] J. C. Lindon, J. K. Nicholson, E. Holmes and J. R. Everett, Concepts in Magnetic Resonance 12, 289 (2000) [10] B. J. Webb-Robertson, D. F. Lowry, K. H. Jarman, S. J. Harbo, Q. R. Meng, A. F. Fuciarelli, J. G. Pounds and K. M. Lee, J Pharm. Biomed. Anal. 39, 830 (2005) [11] Umetrics AB, Multi- and Megavariate Data Analysis: Principles and Applications, Umea, (2001). [12] B. M. McGrath, A. J. Greenshaw, R. McKay, A. M. Weljie, C. M. Slupsky and P. H. Silverstone, Int. J. Neurosci. (In Press) (2006)

BIOINFORMATICS DATA PROFILING TOOLS: A PRELUDE TO METABOLIC PROFILING

NATARAJAN GANESAN, BALA KALYANASUNDARAM, MAHE VELAUTHAPLLAI Department of Computer Science, Georgetown University, 3900 Reservoir Rd NW Washington DC 20057 USA

The term metabolic profiling is often used to denote the systematic characterization of the unique biochemical trails or fingerprints left behind by cellular processes. Advances in computational biosciences are often invaluable in dealing with the huge amount of raw data generated from the countless biochemical intermediates that flood the cell at any given time. As a prelude to metabolic profiling, it is essential to completely profile and compile all related information about the genetic and proteomic data. Profiling tools in bioinformatics refer to all those software (web based and downloadable) that compile all related information in single user-interfaces. Generally, these interfaces take a query such as a DNA, RNA, or protein sequence or keyword; and search one or more databases for information related to that sequence. Summaries and aggregate results are provided in a single standardized format that would otherwise have required visits to many smaller sites or direct literature searches to compile. In other words they are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases.

1.2. Introduction and usage

1.1. Contents

The "post-genomics" era has given rise to a range of web-based tools and software to compile, organize, and deliver large amounts of primary sequence information, as well as protein structures, gene annotations, sequence alignments, and other common bioinformatics tasks. In general, there exist three types of databases and service providers. The first one includes the popular publicdomain or open-access databases supported by funding and grants such as Introduction and usage Keyword based profilers Sequence data based profilers Microarray analysis tools Future growth and directions References and External links

127

128 NCBI, ExPASy, Ensembl, and PDB. The second one includes smaller or more specific databases organized and compiled by individual research groups Examples include the Yeast Genome Database, RNA database. The third and final one includes private corporate or institutional databases that require payment or institutional affiliation to access. Typical scenarios of a profiling approach become relevant, particularly, in the cases of the first two groups, where researchers commonly wish to combine information derived from several sources about a single query or target sequence. For example, users might use the sequence alignment and search tool BLAST to identify homologs of their gene of interest in other species, ,

,

,

.

,

3, * I ,

F

'g ure l

;

;

A

typical example of a keyword

profiling tool - Entrez

and then use these results to locate a solved protein structure for one of the homologs. Similarly, they might also want to know the likely secondary structure of the mRNA encoding the gene of interest, or whether a company sells a DNA construct containing the gene. Sequence profiling tools serve to automate and integrate the process of seeking such disparate information by rendering the process of searching several different external databases transparent to the user. The importance of data profiling assumes significance in the burgeoning field of metabonomics which is soon likely to become a humungous warehouse of selectively processed data. Seamless integration of all relevant data should then become the buzzword to define the future of metabonomics. Many public databases are already extensively interlinked so that complementary information in another database is easily accessible; for example, Genbank and the PDB are closely intertwined. However, specialized tools organized and hosted by specific research groups can be difficult to integrate into this linkage effort because they are narrowly focused, are frequently modified, or use custom versions of common file formats. Advantages of sequence profiling tools include 1) the ability to use multiple of these specialized tools in a single query and present the output with a common interface, 2) the ability to direct the output of one set of tools or database searches into the input of another, and 3) the capacity to disseminate hosting and compilation obligations to a network of research groups and institutions rather than a single centralized repository

129 1.3. Keyword based profilers ® S *"="

-"-.'" ~T. .*' •'.-. •• „ :. 1 ,:. ..,,

.

-

Profiling tools based on keyword searches are essentially 'search engines' that are highly specialized for bioinformatics work, thereby eliminating a clutter of irrelevant or non-scholarly hits that might occur with a traditional search engine like Google. Most •—••••" keyword-based profiling tools allow

Figure 2 The network of Bioinformatic

flexible

l

yPes

of

keyword

input,

Harvester accession numbers from indexed databases as well as traditional keyword descriptors. For example, the NCBI search engine Zsnfr-e^segregates its hits by category, so that users looking for protein structure information can screen out sequences with no corresponding structure, while users interested in perusing the literature on a subject can view abstracts of papers published in scholarly journals without distraction from gene or sequence results. The Pubmed biosciences literature database is a popular tool for literature searches but is now a competitor with the more general Google Scholar . Keyword-based data aggregation services like the Bioinformatic Harvester performs provide reports from a variety of third-party servers in an as-is format so that users need not visit the website or install the software for each individual component service. This is particularly invaluable given the rapid emergence of various sites providing different sequence analysis and manipulation tools. Another aggregative web portal, the Human Protein Reference Database (Hprd), contains manually annotated and curated entries r , , for human proteins. The information • : provided is thus both selective and _^ s f m i l m comprehensive, and the query format is i t ! * * flexible and intuitive. The pros of «-<--' developing manually curated databases include presentation of proofread \ material and the concept of 'molecule — *—— 4,5

authorities'

to

undertake

the

Figure 3 The HPRD is manually curated

responsibility of specific proteins. However, the cons are that they are typically slower to update and may not contain very new or disputed data.

130 1.4. Sequence data based profilers A typical sequence profiling tool carries this further by using an actual DNA, RNA, or protein sequence as an input and allows the user to visit different webbased analysis tools to obtain the information desired. Such tools are also commonly supplied with commercial laboratory equipment like gene sequencers or sometimes sold as software Figure 4 Display of sequence profiling applications for molecular biology. In features on a SEQUEROME browser another public-database example, the BLAST sequence search report from NCBI provides a link from its alignment report to other relevant information in its own databases, if such specific information exists. For example, a retrieved record that contains a human sequence will carry a separate link that connects to its location on a human genome map; a record that contains a sequence for which a 3-D structure has been solved would carry a link that connects it to its structure database. SEQUEROME, a public service tool, links the entire BLAST report to many third party servers/sites that provide highly specific services in sequence manipulations such as restriction enzyme maps, open reading frame analyses for nucleotide sequences, and secondary structure prediction. The tool provides added advantage of tabbed browsing interface to track user operations and thus carry a project to its completion within one browser interface. The consequent evolution of such profilers would thus include ability to customize and automate processing of sets of sequence data. Though the presence of sequence based profilers are far and few in the present scenario, their key role will become evident when huge amounts of sequence data need to be cross processed across portals and domains.

1.5. Microarray data profiling Specialized software tools for statistical analysis to determine the extent of overor under-expression of a gene in a microarray experiment relative to a riguri' 5 l.xuniple of an apprciviiiiuli-li reference state have also been developed 40,000 probe spotted oligo iiuvruHrmy to aid in identifying genes or gene sets with enlarged inset to show detail.

131 associated with particular phenotypes. One such method of analysis, known as Gene Set Enrichment Analysis (GSEA), uses a Kolmogorov-Smirnov-style statistic to identify groups of genes that are regulated together ^ . This thirdparty statistics package offers the user information on the genes or gene sets of interest, including links to entries in databases such as NCBI's GenBank and curated databases such as Biocarta and Gene Ontology.

1.6. Future growth and directions The proliferation of diverse bioinformatics tools for genomic and proteomic genetic analysis has led to great advances in helping researchers identify and categorize genes of interest. However, this proliferation can also complicate the user interfaces for advanced users, while confusing and frustrating new users. This is so at time when metabonomics is beginning to spread its wings and branch out into a new area. The importance of such profiling tools is thus likely to expand very rapidly into other areas like metabolite profiling, and modeling dynamic living systems. This will occur as researchers and investigators begin to their directly upload raw information into better interfaces for more comprehensive profiles. Future tools are, thus, likely to evolve into interfaces that seamlessly integrate information from different user defined portals/services e.g. mass spectrometric/NMR databases. They are also likely to allow inputs from images (even sounds) of experimental data. For example, a researcher might want upload the 3D image of a molecule and look for possible targets and genes. The possibilities are endless and the bio-maze is just beginning to get interconnected. Tools like SEQUEROME6 are a step in this direction. As the information pyramid continues to grow and re-assemble itself, new generations of singleinterface systems like those described above are bound to spawn a range of similar tools that would address the specific needs of different research groups. References 1.

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102(43): 15545-50.

2. Biomedical language processing: what's beyond PubMed? Mol Cell. 2006 Mar 3:2U5):589-94. 3.

Google versus PubMed, Ann R Coll Surg Engl. 2005 Nov:87(6):491-2.

4.

'Harvester': a fast meta search engine of human protein resources. , Bioinforrnatics. 2004 Aug 12:20(12): 1962-3.

5.

'Harvester': a fast meta search engine of human protein resources. , Bioinforrnatics. 2004 Aug 12:20(12): 1962-3.

6.

Web-based interface facilitating sequence-to-structure analysis of BLAST alignment reports, Biotechniques. 2005 Aug:39(2):186

COMPARATIVE QSAR ANALYSIS OF BACTERIAL, FUNGAL PLANT AND HUMAN METABOLITES EMRE KARAKOC School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6,

Canada.

S. CENK SAHINALP School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6,

Canada.

ARTEM CHERKASOV Division of Infectious Diseases, Faculty of Medicine, University of British 2733, Heather street, Vancouver, BC, V5Z 3J5, Canada.

Columbia,

Several QSAR models have been developed using a linear optimization approach that enabled distinguishing metabolic substances isolated from human-, bacterial-, plant- and fungal- cells. Seven binary classifiers based on a k-Nearest Neighbors method have been created using a variety of 'inductive' and traditional QSAR descriptors that allowed up to 95% accurate recognition of the studied groups of chemical substances. The conducted comparative QSAR analysis based on the above mentioned linear optimization approach helped to identify the extent of overlaps between the groups of compounds, such as cross-recognition of fungal and bacterial metabolites and association between fungal and plant substances. Human metabolites exhibited very different QSAR behavior in chemical space and demonstrated no significant overlap with bacterial-, fungal-, and plant-derived molecules. When the developed QSAR models were applied to collections of conventional human therapeutics and antimicrobials, it was observed that the first group of substances demonstrate the strongest association with human metabolites, while the second group exhibit tendency of 'bacterial metabolite - like' behavior. We speculate that the established 'drugs - human metabolites' and 'antimicrobials - bacterial metabolites' associations result from strict bioavailability requirements imposed on conventional therapeutic substances, which further support their metabolite-like properties. It is anticipated that the study may bring additional insight into QSAR determinants for human-, bacterial-, fungal- and plant metabolites and may help rationalizing design and discovery of novel bioactive substances with improved, metabolite-like properties.

1.

Introduction

In a series of previous works we reported the use of our own 'inductive' and conventional 2D and 3D QSAR descriptors for creating binary QSAR models capable of recognizing various groups of substances including antimicrobial 133

134 molecules and peptides [1,2], steroid-like compounds [3], human therapeutics, drug-like chemicals [4] as well as bacterial and human metabolites [4,5]. These binary QSAR classifiers allowed defining certain structural determinants of the studied groups and provided important insights into their positioning in chemical space. Thus, the developed QSAR models could demonstrate immanent similarity between conventional antimicrobials and native bacterial metabolites and have been suggested as prospective tools for 'in silico' antibiotic discovery. In the current study we applied similar QSAR approach to the broader spectra of bacterial-, human-, plant- and fungal metabolites that have been explored for mutual overlaps as well as for possible associations with classes of conventional human therapeutics, antimicrobials and biologically neutral druglike chemicals. 2.

Materials and Methods

2.1. Molecular Datasets The dataset of antimicrobial compounds has been assembled from several public resources including ChemlDPlus service [6], the Journal of Antibiotics database [7] and from the literature [8-10]. The conventional drug molecules covering a broad range of therapeutic activities have all been identified from the Merck Index Database [11]. The structures of bacterial-, plant- and fungal metabolites have been obtained from the AnalytiCon-Discovery company [12]. Drug-like substances used in the study have been selected from the Assinex Gold collection [13], Structures of human metabolites have been obtained from the Metabolomics database [14]. The redundancy of the resulting dataset containing 519 Antimicrobials, 958 Drugs, 1202 Drug-like substances with no known therapeutic effects, together with 1102 Human-, 551 Bacterial-, 2351 Plant- and 825 Fungal metabolites has been ensured through the SMILES records and by descriptors-based clustering. All molecular structures have been further optimized with the MMFF94 forcefield [15] and using MOE modeling package [16], 2.2. QSAR Descriptors The optimized structures of 7508 compounds have been used for calculating 26 non cross-correlating 'inductive' QSAR descriptors [1-5] and 33 conventional QSAR parameters (the corresponding descriptions can be found in Appendix). The resulting set of 59 QSAR parameters descriptors computed for 7508

135 studied compounds has been normalized for [0,1] range. The normalized values have then been used to generate QSAR models distinguishing all four types of natural metabolites under study. 2.3. Mathematical Approach For the purpose of distinguishing four types of the studied metabolite substances based on descriptors we have utilized the fc-Nearest Neighbors (kNN) classification approach. This method requires the definition of distance D(S, R) between any pair of molecules S and R in d-dimensional descriptors space. According to QSAR formalism, such a distance measure should reflect functional association and/or chemical similarity between the molecules. Thus, the fc-NN approach allows descriptors-based clustering of chemical compounds according to already known biological activity and can be used to classify an untested chemical substance by its proximity to established clusters. Given a distance function D( ) , searching for the /j-Nearest Neighbors of untested chemical substance requires comparing its distance with tested compounds which is computationally costly. The efficiency of the classification process can be improved by efficient data-structures developed for metric spaces. Thus, a metric distance function is used for our clustering of tested compounds. A distance measure D() forms a metric if the following conditions are satisfied. D(S, S) = 0: a point has distance 0 to itself. D(S, R) = D(R, S): distance is symmetric. D(S, R) < D(S, Q) + D (Q, R): distance satisfies the triangle inequality. The distance measures satisfying the above conditions include Hamming Distance (i.e. L]): 4n<. „,, Euclidean Distance (i.e. L2): | i , „ „ v , Maximum of dimensions (i.e. L„): maxlsftS(/|5A - Rk\ among others. d

In the current work we utilized the weighted Hamming Distance V^-k _ p | , (\/<Jj,(7i > 0) that allows differentiating relevance of various QSAR descriptors for a given activity. Weighted Hamming Distance representations allows establishing optimal a ; values that maximize separation of active TA = {TA,, TA2 ... TAra) and inactive T1 = (T'I, T'2 ... T'„) elements in the training set.T = TA[jT'. We utilized the Linear Programming approach to minimize the function,

136

such that the following three conditions are satisfied: VT." e T-VLJJ=] i—td. .crjrnh]-T;[hJi)/m2<^X--^AW~TlW\)/m-n V; 0 < a, < 1 V (7 < c , where C - is a used-defined constant. The aim of the clustering is determining best descriptor space where the average distance among compounds reflexes the functional similarity. Although the sensitivity and specificity can be improved using more restricting constraints, the optimization may end up with an infeasible or over-trained solution. In order to avoid infeasible solutions and overtraining, the average distance constraints are used. Another important factor for the clustering is the size of training data. The accuracy of the clustering is going to improve logarithmically with the increasing size of training data. According to our observations, the ideal training dataset is 90% of the whole dataset. More details on the adopted &-NN procedure can be found in [5, 17]. It should be outlined that the described mathematical procedure not only maximizes the average distance between active and inactive elements of the training set, but also aims to minimize the average within-the-class distance and, therefore, tends to condense activity-clusters. 3.

Results and Discussion

The defined clusters of chemical compounds of interest in rf-dimensions can then be used to characterize unknown entries (molecules) by projecting their QSAR parameters into descriptors space (Figure 1).

Figure 1. Projection of unknown compound (green point) onto chemical space where active compounds (red points) have been separated from inactive ones (blue) using &-NN algorithm.

137 In particular, an untested compound (green point on Figure 1) can be associated to a certain pre-defined activity cluster by considering affiliations of its ^-nearest neighbors. In the current study we considered assigning the tested compound to the cluster of its nearest neighbor. The linear optimization instance for determining the distance function is obtained from the input compound dataset as described above and represented using the MPS format. This linear programming instance is solved using the open-source linear programming solver CPLEX [18]. The data structure for searching the nearest neighbor of a query point was SC-Vantage Point Tree that we developed earlier [19]. All programs are implemented using the standard C/C++ libraries on UNIX environment. We tested applicability of the above-described approach for creating binary QSAR classifiers that operate by 59 'inductive' and conventional QSAR variables. The combined molecular dataset consisting of 1202 drug-like chemicals, 958 conventional drugs of various types, 519 specific antimicrobials, as well as 551 Bacterial-, 2351 Plant-, 825 Fungal- and 1102 Human Metabolites (with 59 normalized QSAR descriptors assigned to each entry) has been used to create &-NN based QSAR models. Although 7 new QSAR models are developed (one for each type of chemical compounds) based on our combined dataset, we only concentrate on the following categories. 3.1. Plant Metabolites To create a binary QSAR model accurately distinguishing plant metabolites from other types of chemical substances, we considered a set of 2351 natural compounds characterized from plant isolates by AnalytiCon-Discovery Company [12]. As the negative control for such model we considered a combination 1477 conventional human therapeutic substances (including 519 antibacterials), 1202 biologically inactive chemicals (that nonetheless justify the 'Lipinski's' druglikeness rule) and 2478 metabolic substances participating in human, bacterial and fungal biological pathways. We included drugs and drug-like molecules into negative control to ensure that the desired QSAR model for plant-metabolites won't be simply biased toward drug-like structures. On another hand, the presence of other types of native metabolites in a negative control aimed to ensure that the QSAR approach won't be generally recognizing any metabolic substances. To develop the £-NN based QSAR binary model (yes/no) for plant metabolites we assigned a bioactivity value of 1.0 (dependent variable) to 2351 plant substances and treated them as actives. All the remaining 5157 molecules

138 have been assigned null activities asfc-NNalgorithm attempted separating them from plant metabolites. 3.2. Fungal Metabolites In this case, four &-NN QSAR models have been trained to separate 825 fungal metabolites (assigned 1.0 activity values) from the rest of the compounds that have been considered as inactive, with assigned 0.0 dependent variables. 3.3. Bacterial Metabolites To study this system, 551 bacterial metabolites from the AnalytiConDiscovery collection have been assigned 1.0 activity value and remaining 6957 general drugs, drug-likes, antimicrobials, fungal, plant and human metabolites all have been considered as a negative control and assigned to null dependent variable. 3.4. Human Metabolites The dataset of 1102 chemical substances involved with chemical reactions taking place in human body have recently been catalogued by the group of Prof. Wishart at the University of Alberta. These molecules have been incorporated to the larger metabolomics database and have been made available through the web: http://www.metabolomics.ca/. Thus, we attempted developing 'HumanMetabolite-Likeness' QSAR model hoping that the corresponding QSAR classifiers may become useful tools for assessing potential human therapeutics. We trained the fc-NN approach to recognize 1104 human metabolites among 7508 compounds under study. 3.5. QSAR Modeling All four classification systems 3.1-3.4 have been investigated using 10 fold cross-validation approach. In particular, within four classification systems for Bacterial-, Fungal-, Plant- and Human metabolites, all 7508 substances have been separated into active and inactive components, according to the protocols described above, and then have been separated into ten 90%-10% training/testing sets (where the training sets do not overlap), and keeping the ratio of active and inactive entries constant. At the next step 59 normalized QSAR descriptors have been used as independent variables to train £-NN based models. Four classification systems 3.1-3.4 have been independently processed within the ifc-NN training procedure, as described in the previous section and the performance of the resulting QSAR models has been assessed by the combined

139 True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) predictions on the testing sets. The corresponding parameters have then been transformed into Sensitivity, Specificity and Accuracy values that can be found in Table 1. Table 1. Cross-validation confusion matrices and accuracy parameters for the developed binary QSAR classifiers for Bacterial-, human-, Fungal-, and Plant metabolites. Model

TP

Bacterial Metabolites

FP

TN

FN

SEN

SPE

ACC

Human Metabolites

298

303

6654

253

0.541

0.956

0.926

856

291

6115

246

0.777

0.955

0.928

Fungal Metabolites

498

322

6361

327

0.604

0.952

0.914

Plant Metabolites

2179

211

4946

172

0.927

0.959

0.949

The data in Table 1 illustrates, that the method of /c-Nearest Neighbors allowed generally accurate separation of actives and inactives in all four systems 3.1-3.4. Thus, the use of 59 'inductive' and conventional QSAR descriptors allowed almost 95% accurate recognition of plant metabolites, followed by 92.8%, 92.6% and 91.4% accuracy estimated for Human-, Bacterial- and Fungal- metabolites respectively. These results confirm good predictive power of 'inductive' QSAR descriptors that has been previously attributed to the fact that they cover a broad range of proprieties of bound atoms and molecules related to their size, polarizability, electronegativity, electronic and steric interactions and, thus, can adequately capture structural determinants of intra- and inter-molecular interactions [1-5]. 3.6. Cross recognition and Similarity between Metabolites, Antibiotics, Drugs and Drug-like Substances Notably, with the exception of the model for classification of plant metabolites, all other QSAR approaches produced non-dismissible number of false positive predictions (see Table 1) determined by overlaps between the studied groups of compounds. To further investigate the extend of cross-recognition between four groups of native metabolites we re-trained 3-NN models 3.1-3.4 leaving one of the activity groups out of consideration and then applied the developed models to the excluded set. The resulting numbers of positive predictions have been collected into Table 2 and transformed into the corresponding fractions of antimicrobials, Drugs, Drug-likes, Bacterial-, Plant-, Fungal- and Humanmetabolites that have been recognized by the 'non-self QSAR models.

140 Table 2. Fractions of the studied groups of compounds recognized as false positive predictions by the developed four QSAR models Bacterial Metabolites 3.5 2.0 0.5

Antibacterials Drugs Chemicals Bacterial Metabolites Human Metabolites Fungi Metabolites Plant Metabolites

Human Metabolites 1.7 2.2 0.6 0.7

1.4 223 0.3

3.8 1.4

Fungi Metabolites 3.1 0.4 0.8 31.4 7.4

Plant Metabolites 2.9 1.4 0.4 0.9 8.0 11.3

6.0

These numbers reflect a profound similarity between Fungal and Bacterial metabolites as well as between Fungal and Plant metabolites (interestingly, no significant overlaps have been established for Plant and Bacterial substances). Human metabolites demonstrated no significant cross-recognition with other natural compounds which confirms the previously reported stand-alone nature of this class of substances. When the developed QSAR 'metabolite-likeness' models have been applied to the groups of conventional human therapeutics, antibacterials and inactive drug-like chemicals, some interesting overlaps have been found between antibacterials and bacterial metabolites as well as between drugs and human metabolites (see the upper part of Table 2). More detailed analysis of substances recognized by the 'human metabolite-likeness' classifier demonstrate, that the largest portion of the corresponding false positive predictions originated by the fungal metabolite substances (likely reflecting strongest resemblance between fungal and human cellular composition), followed by natural molecules of plant origin and bacterial metabolites (as illustrated by Figure 2). Composition of False 'ositives when Predicting Human Metabolites

/-""""

B .Drugs

-•* Bacterial MeUtbolies

1 • Fung* Metabolites

'

D Plant Metabolites

Figure 2. Composition of false positives produced by the QSAR model for Human metabolites.

Nonetheless, general overlap of Human metabolites with other studied groups of molecules is very limited. To illustrate positioning of Human metabolites against other groups in the chemical space we projected the

141 corresponding entries onto three Principal Components derived from 59 used QSAR descriptors (Figure 3)

• '

.. ' -..

•

Hi

m

* ; .

*

. •

——^———.^_______

•

i

-'-.'

•

'

1

- . •

*(

•

. . •

•

_ •

Figure 3. Separation of Human melaboliies (Green) from oiher groups of the studied compounds in ihrce dimensional space formed by 3 Principal Components derived from 59 used QSAR descriptors. The color coding of points corresponds to the following scheme: Red: Plant Metabolites. Orange: Fungal Metabolites. Pink: Bacterial Metabolites. Blue: Drugs. The chart demonstrates that descriptors computed for human metabolites determine certain overlap between Human metabolites and other substances, but their overall positioning in the chemical space is quite distinguished and rather sparse compared to other cluster that are much more compact. This, likely, can be attributed to the most diverse nature of substances involved in the human chemical pathways.

142 4.

Conclusions

To summarize the results of the previous sections, it is possible to conclude that antimicrobials, conventional therapeutics, inactive chemicals, as well as plant, fungal and bacterial metabolites are organized into rather compact as distinguished clusters in QSAR descriptors space what makes it possible to distinguish these types of chemicals with binary SA models. Fungal metabolites demonstrate rather significant mutual overlap with Bacterial substances and some degree of resemblance with Plant derivatives. When we utilized the kNearest Neighbors, algorithm for the purpose of recognizing four groups of metabolic substances it allowed their generally acceptable separation. When the developed four 'metabolite-likeness' models have been applied to conventional human therapeutics and specific antimicrobial substances the formers demonstrated strongest association with human metabolites, while the later demonstrated tendency of 'bacterial metabolite - like' behavior. It is possible to speculate that the established 'drugs-human metabolites' and 'antimicrobials-bacterial metabolites' associations result from strict bioavailability requirements imposed on therapeutics which, in a way, enforce their metabolite-like properties. The overall results of the conducted comparative QSAR analysis bring more insight into the nature and structural dominants of the studied classes of chemicals substances and, if necessary, can help rationalizing the design and discovery of novel antimicrobials and human therapeutics with metabolite-like chemical profiles. Acknowledgments Authors thank Dr. David Wishart (University of Alberta) for providing us the Database of Human Metabolites Appendix 'Inductive' (from 1 to 26) and conventional QSAR parameters (from 27 to 59) used for creating binary QSAR models 3.1-3.4. Average_EO_Neg, Average_EO_Pos, Average_Ha.rdn.ess, Average_Neg_Charge, Average_Neg_Hardness, Average_Pos_Charge, Average_Softness, EO_Equalized, Global_Softness, Hardness_of_Most_Neg, Hardness_of_Most_Pos, Largest_Neg_Hardness, LargestJNegJSoftness, Large$t_Pos_Hardness, Largest_Rs_i_mol, Most_Neg_Rs_mol_i, Most_Neg_Sigma_i_mol, Most_NegJ>igma_mol_i, Most_Pos_Charge, Most_Pos_Rs_i_mol, Most_Pos_Sigma_i_mol, Most_Pos_Sigma_mol_i,

143 Softness_of_Most_Pos, Sum_Hardness, Sum_Neg_Hardness, Total_Neg_Softness, b_double, bjrotN, bjrotR, bjriple, chiral, rings, a_nN, a_nO, a_nS, FCharge, lip_don, KierFlex, ajbase, vsa_acc, vsa_acid, vsa_base, vsa_don, density, logP(oAv), aJCM, chilv_C, chiraljx, balabanJ, logS, ASA, ASA+, ASA-, ASAJi, ASA_P, CASA+, CASA-, DASA, DCASA For more details on 'inductive' parameters see references [1-5], while the used conventional QSAR parameters can be accessed through the MOE program [16]. References 1. 2. 3. 4. 5. 6. 7. 8. 9.

10.

11. 12. 13. 14.

15. 16. 17.

A. Cherkasov, Curr. Comp.-Aided Drug Design. 1, 21 (2005). A. Cherkasov, and B. Jankovic, Molecules. 9, 1034 (2004). A. Cherkasov, Z. Shi, M. Fallahi, and G.L. Hammond, J. Med. Chem. 48, 3203 (2005). A. Cherkasov, J. Chem. Inf. Model. 46, 1214 (2006). E. Karakoc, S. C. Sahinalp, and A. Cherkasov. J. Chem. Inf. Model. 46, in press (2006). ChemlDPlus database: http://chem.sis.nlm.nih.gov/chemidplus/, May 2006 Journal of Antibiotics database: http://www.nih.go.ip/~jun/NADB/bvname.html , May 2006 F. Tomas-Vert, F. Perez-Gimenez, M.T. Salabert-Salvador, F.J. GarciaMarch, J. Jaen-Oltra, J. Molec. Struct. (Theochem). 504, 249 (2000). M.T.D. Cronin, A.O. Aprula, J.C. Dearden, J.C. Duffy, T.I. Netzeva, H. Patel, P.H. Rowe, T.W. Schultz A.P. Worth, K. Voutzoulidis, and G. Schuurmann, J. Chem. Inf. Comp. Sci. 42, 869 (2002). M. Murcia-Soler, F. Perez-Gimenez, F.J. Garcia-March, M.T. SalabertSalvador, W. Diaz-Villanueva, M.J. Castro-Bleda and A. Villanueva-Pareja. J Chem InfComput Sci. 44, 1031 (2004). The Merck Index 13.4 CD-ROM Edition, CambridgeSoft, Cambridge, MA, 2004. Analyticon Discovery Company: www.ac-discoverv.com May 2006 Assinex Gold Collection, Assinex Ltd., Moscow, 2004. Human Metabolome Database: http://redpoll.pharmacv.ualberta.ca/~aguo/www hmdb ca/HMDB/, May 2006 T.A. Halgren, J. Comp. Chem. 17, 490 (1996). Molecular Operational Environment, 2005, by Chemical Computing Group Inc., Montreal, Canada. E. Karakoc, A. Cherkasov, and S. C. Sahinalp. Bioinformatics, in press

144

18. 19.

(2006). CPLEX: High-performance software for mathematical programming http://www.ilog.com/products/cplex/. May 2006. M. Tasan, J. Macker, M. Ozsoyoglu, S. Cenk Sahinalp. Distance Based Indexing for Sequence Proximity Search, IEEE Data Engineering Conference ICDE'03, Banglore, India (2003)

B I O S P I D E R : A W E B SERVER F O R A U T O M A T I N G METABOLOME ANNOTATIONS

CRAIG KNOX, SAVITA SHRIVASTAVA, PAUL STOTHARD, ROMAN EISNER, DAVID S. WISHART Department of Computing Science, University of Alberta, Edmonton, AB T6G-2E8

Canada

One of the growing challenges in life science research lies in finding useful, descriptive or quantitative data about newly reported biomolecules (genes, proteins, metabolites and drugs). An even greater challenge is finding information that connects these genes, proteins, drugs or metabolites to each other. Much of this information is scattered through hundreds of different databases, abstracts or books and almost none of it is particularly well integrated. While some efforts are being undertaken at the NCBI and EBl to integrate many different databases together, this still falls short of the goal of having some kind of humanreadable synopsis that summarizes the state of knowledge about a given biomolecule - especially small molecules. To address this shortfall, we have developed BioSpider. BioSpider is essentially an automated report generator designed specifically to tabulate and summarize data on biomolecules - both large and small. Specifically, BioSpider allows users to type in almost any kind of biological or chemical identifier (protein/gene name, sequence, accession number, chemical name, brand name, SMILES string, InCHI string, CAS number, etc.) and it returns an in-depth synoptic report (-3-30 pages in length) about that biomolecule and any other biomolecule it may target. This summary includes physico-chemical parameters, images, models, data files, descriptions and predictions concerning the query molecule. BioSpider uses a web-crawler to scan through dozens of public databases and employs a variety of specially developed text mining tools and locally developed prediction tools to find, extract and assemble data for its reports. Because of its breadth, depth and comprehensiveness, we believe BioSpider will prove to be a particularly valuable tool for researchers in metabolomics. BioSpider is available at: www.biospider.ca

1.

Introduction

Over the past decade we have experienced an explosion in the breadth and depth of information available, through the internet, on biomolecules. From protein databases such as the PDB [1] and Swiss-Prot [18] to small molecule databases such as PubChem (http://pubchem.ncbi.nlm.nih.gov/), KEGG [2], and ChEBI (http://www.ebi.ac.uk/chebi/), the internet is awash in valuable chemical and biological data. Unfortunately, despite the abundance of this data, there is still a need for new tools and databases to connect chemical data (small, biologically active molecules such as drugs and metabolites) to biological data (biologically active targets such as proteins, RNA and DNA), and vice versa. Without this linkage clinically important or pharmaceutically relevant information is often lost. To address 145

146 this issue we have developed an integrated cheminformatics/bioinformatics reporting system called BioSpider. Specifically, BioSpider is a web-based search tool that was created to scan the web and to automatically find, extract and assemble quantitative data about small molecules (drugs and metabolites) and their large molecule targets. BioSpider can be used as both a research tool or it can be used as a database annotation tool to assemble fully integrated drug, metabolites or protein databases. So far as we are aware, BioSpider appears to be a unique application. It is essentially a hybrid of a web-based genome annotation tool, such as BASYS [3] and a text mining system such as MedMiner [4], Text mining tools such as MedMiner, iHOP [5], MedGene [6] and LitMiner [7] exploit the information contained within the PubMed database. These web servers also support more sophisticated text and phrase searching, phrase selection and relevance filtering using specially built synonym lists and thesauruses. However, these text mining tools were designed specifically to extract information only from PubMed abstracts as opposed to other database resources. In other words MedMiner, MedGene and iHOP do not search, display, integrate or link to external molecular database information (i.e. GenBank, OMIM [8], PDB, SwissProt, PharmGKB [9], DrugBank [10], PubChem, etc.) or to other data on the web. This database or web-based information-extraction feature is what is unique about BioSpider.

2. 2.1.

Application Description Functionality

Fundamentally, BioSpider is highly sophisticated web spider or web crawler. Spiders are software tools that browse the web in an automated manner and keep copies of the relevant information of the visited pages in their databases. However, BioSpider is more than just a web spider. It is also an interactive text mining tool that contains several predictive bioinformatic and cheminformatic programs, all of which are available through a simple and intuitive web interface. Typically a BioSpider session involves a user submitting a query about one or more biological molecules of interest through its web interface, waiting a few minutes and then viewing the results in a synoptic table. This hyperlinked table typically contains more than 80 data fields covering all aspects of the physico-chemical, biochemical, genetic and physiological information about the query compound. Users may query BioSpider with either small molecules (drugs or metabolites) or large molecules (human proteins). The queries can be in almost any form, including chemical names, CAS numbers, SMILES strings [11], INCHI identifiers, MOL files or Pubchem IDs (for small molecules), or protein names and/or Swiss-Prot IDs (for macromolecules). In extracting the data and assembling its tabular reports BioSpider employs several robust data-gathering techniques based on screen-scraping, text-

147 mining, and various modeling or predictive algorithms. If a BioSpider query is made for a small molecule, the program will perform a three-stage search involving: 1) Compound Annotation; 2) Target Protein/Enzyme Prediction and 3) Target Protein/Enzyme Annotation (see below for more details). If a BioSpider query is made for a large molecule (a protein), the program will perform a complete protein annotation. BioSpider always follows a defined search path (outlined in Figure 1, and explained in detail below), extracting a large variety of different data fields for both chemicals and proteins (shown in Table 1). In addition, BioSpider includes a builtin referencing application that maintains the source for each piece of data obtained. Thus, if BioSpider obtains the Pubchem ID for a compound using KEGG, a reference "Source: KEGG" is added to the reference table for the Pubchem ID. Figure 1 - Simplified overview of a BioSpider search (1) Obtain Chemical Information: CAS IUPAC Name, Synonyms, Melting Point, etc.

(2) Predict Drug Targets or Metabolizing Enzymes

(3) For each predicted Drug Target or Metabolizing Enzyme, obtain protein information including sequence information, description, SNPs, etc.

Table 1 - Summary of some of the fields obtained by BioSpider Drug or Compound Information Generic Name Brand Names/Synonyms IUPAC Name Chemical Structure/Sequence Chemical Formula PubChem/ChEBVKEGG Links SwissProt/GenBank Links FDA/MSDS/RxList Links Molecular Weight Melting Point Water Solubility pKa or pi LogP or Hydrophobicity NMR/Mass Spectra MOL/SDF Text Files Drug Indication Drug Pharmacology Drug Mechanism of Action Drug Biotransformation/Absorption Drug Patient/Physician Information Drug Toxicity

Drug Target or Receptor Information Name Synonyms Protein Sequence Number of Residues Molecular Weight Pi Gene Ontology General Function Specific Function Pathways Reactions Pfam Domains Signal Sequences Transmembrane Regions Essentiality Genbank Protein ID SwissProt ID PDBID Cellular Location DNA Sequence Chromosome Location

148 Step 1: Compound Annotation Compound annotation involves extracting or calculating data about small molecule compounds (metabolites and drugs). This includes data such as common names, synonyms, chemical descriptions/applications, IUPAC names, chemical formulas, chemical taxonomies, molecular weights, solubilities, melting or boiling points, pKa, LogP's, state(s), MSD sheets, chemical structures (MOL, SDF and PDB files), chemical structure images (thumbnail and full-size PNG), SMILES strings, InCHI identifiers, MS and NMR spectra, and a variety of database links (PubChem, KEGG, ChEBI). The extraction of this data involves accessing, screen scraping and text mining -30 well known databases (KEGG, PubChem), calling a number of predictive programs (for calculating MW, solubility) and running a number of file conversion scripts and figure generation routines via CORINA [12], Checkmol (http://merian.pch.univie.ac.at/~nhaider/cheminf/cmmm.html) and other in-house methods. The methods used to extract and generate these data are designed to be called independently but they are also "aware" of certain data dependencies. For instance, if a user only wanted an SDF file for a compound, they would simply call a single method: get_value('sdf_file')- There is no need to explicitly call methods that might contain the prerequisite information for getting an SDF file. Likewise, if BioSpider needs a Pubchem ID to grab an SDF file, it will obtain it automatically, and, consequently, if a Pubchem ID requires a KEGG ID, BioSpider will then jump ahead to try and get the KEGG ID automatically. Step 2: Target/Enzyme Prediction Target/enzyme prediction involves taking the small-molecule query and identifying the enzymes likely to be targeted or involved in the metabolism of that compound. This process involves looking for metabolite-protein or drug-protein associations from several well-known databases including SwissProt, PubMed, DrugBank and KEGG. The script begins by constructing a collection of query objects from the supplied compound information. Each query object contains the name and synonyms for a single compound, as well any similar but unwanted terms. For example, a query object for the small molecule compound "pyridoxal" would contain the term "pyridoxal phosphatase" as an unwanted term, since the latter name is for an enzyme. The list of unwanted or excluded terms for small molecule compounds is assembled from a list of the names and synonyms of all human proteins. These unwanted terms are identified automatically by testing for cases where one term represents a subset of another. Users can also include their own "exclusion" terms in BioSpider's advanced search interface. The name and synonyms from a query object are then submitted using WWW agents or public APIs to a variety of abstract and protein sequence databases, including Swiss-Prot, PubMed, and KEGG. The name and synonyms are each sub-

149 mitted separately, rather than as a single query, since queries consisting of multiple synonyms typically produce many irrelevant results. The relevance of each of the returned records is measured by counting the number of occurrences of the compound name and synonyms, as well as the number of occurrences of the unwanted terms. Records containing only the desired terms are given a "good" rating, while those containing some unwanted terms are given a "questionable" rating. Records containing only unwanted terms are discarded. The records are then sorted based on their qualitative score. BioSpider supports both automated identification and semiautomated identification. For automated identification, only the highest scoring hits (no unwanted terms, hits to more than one database) are selected. In the semiautomated mode, the results are presented to a curator who must approve of the selection. To assist with the decision, each of the entries in the document is hyperlinked to the complete database record so that the curator can quickly assess the quality of the results. Note that metabolites and drugs often interact with more than one enzyme or protein target. Step 3: Target/Enzyme Annotation Target/Enzyme annotation involves extracting or calculating data about the proteins that were identified in Step 2. This includes data such as protein name, gene name, synonyms, protein sequence, gene sequence, GO classifications, general function, specific function, PFAM [13] sequences, secondary structure, molecular weight, subcellular location, gene locus, SNPs and a variety of database links (SwissProt, KEGG, GenBank). Approximately 30 annotation sub-fields are determined for each drug target and/or metabolizing enzyme. The BioSpider protein annotation program is based on previously published annotation tools developed in our lab including BacMap [14], BASYS and CCDB [15]. The Swiss-Prot and KEGG databases are searched initially to retrieve protein and gene names, protein synonyms, protein sequences, specific and general functions, signal peptides, transmembrane regions and subcellular locations. If any annotation field is not retrieved from the abovementioned databases then either alternate databases are searched or internally developed/installed programs are used. For example, if transmembrane regions are not annotated in the Swiss-Prot entry, then a locally installed transmembrane prediction program called TMHMM (http://www.cbs.dtu.dk/services/TMHMM/) is used to predict the transmembrane regions. This protein annotation tool also coordinates the updating of fields that are calculated from the contents of other fields, such as molecular weight and isoelectric point. The program also retrieves chromosome location, locus location and SNP information from GeneCards [16] on the basis of the gene name. BLAST searches are also performed against the PDB database to identify structural homologues. Depending upon the sequence similarity between the query protein sequence to a sequence represented in the PDB database, a program

150 called HOMODELLER (X. Dong, unpublished data) may generate a homology model for the protein sequence. 2.2. Implementation The BioSpider backend is a fully objected-oriented Perl application, making it robust and portable. The frontend (website, shown in Figure 2) utilizes Perl CGI scripts which generate valid XHMTL and CSS. BioSpider uses a relational database (MySQL 5) to store data as it runs. As BioSpider identifies and extracts different pieces of information, it stores the data in the database. To facilitate this storage process, a module called a "DataBean" is used to store and retrieve the desired information from/to the database. This approach was chosen for 3 reasons: 1) it provides an "audit-trail" in terms of the results obtained, 2) it provides a complete search result history, enabling the easy addition of "saved-searches" to the website, and 3) it reduces memory load as the application is running. A screenshot of the BioSpider website is shown in Figure 2. Figure 2 - A screen shot montage of BioSpider

151 3.

Validation, Comparison and Limitations

Text mining and data extraction tools can be prone to a variety of problems, many of which may lead to nonsensical results. To avoid these problems BioSpider performs a number of self-validation or "sanity checks" on specific data extracted from the web. For example, when searching for compound synonym names, BioSpider will check that the PubChem substance page related to that synonym contains the original search name or original CAS number within the HTML for that page. This simple validation procedure can often remove bogus synonyms obtained from different websites. Other forms of such small-scale validation or sanity-checks includes a CAS number validation method, whereby the CAS number check-digit is used to validate the entire CAS number (CAS numbers use a checksum, whereby the checksum is calculated by taking the last digit times 1, the next digit times 2, the next digit times 3 etc., adding all these up and computing the sum modulo 10). Since the majority of the information obtained by BioSpider is screenscraped from several websites, it is also important to validate the accessibility of these websites as well as the HTML formatting. Since screen-scraping requires one to parse the HTML, BioSpider must assume the HTML from a given website follows a specific format. Unfortunately, this HTML formatting is not static, and changes over time as websites add new features, or alter the design layout. For this reason, BioSpider contains an HTML validator application, designed to detect changes in the HTML formatting for all the web-resources that BioSpider searches. To achieve this, an initial search was performed and saved using BioSpider for 10 pre-selected compounds, whereby the results from each of the fields were manually validated. This validation-application performs a search on these 10 pre-selected compounds weekly (as a cron job). The results of this weekly search are compared to the original results, and if there is any difference, a full report is generated and emailed to the BioSpider administrator. The assessment of any text mining or report generating program is difficult. Typically one must assess these kinds of tools using three criteria: 1) accuracy; 2) completeness and 2) time savings. In terms of accuracy, the results produced are heavily dependent on the quality of the resources being accessed. Obviously if the reference data are flawed or contradictory, the results from a BioSpider search will be flawed or contradictory. To avoid these problems every effort has been made to use only high-accuracy, well curated databases as BioSpider's primary reference sources (KEGG, SwissProt, PubChem, DrugBank, Wikipedia, etc). As a result, perhaps the most common "detectable" errors made by BioSpider pertain to text parsing issues (with compound descriptions), but these appear to be relatively minor. The second most common error pertains to errors of omission (missing data that could be found by a human expert looking through the web or other references). In addition to these potential programmatic errors, the performance of BioSpider can be com-

152 promised by incorrect human input, such as a mis-spelled compound name, SMILES string or CAS number or the submission of an erroneous MOL or SDF file. It can also be compromised by errors or omissions in the databases and websites that it searches. Some consistency or quality control checks are employed by the program to look for nomenclature or physical property disagreements, but these may not always work. BioSpider will fail to produce results for newly discovered compounds as well as compounds that lack any substantive electronic or web-accessible annotation. During real world tests with up to 15 BioSpider users working simultaneously for 5-7 hours at a time, we typically find fewer than two or three errors being reported. This would translate to 1 error for every 15,000 annotation fields, depending on the type of query used. The number of errors returned is highest when searching using a name or synonym, as it is difficult to ascertain correctness. Errors are much less likely when using a search that permits a direct mapping between a compound and the source websites used by BioSpider. It is thus recommended that users search by structure (InChI, SDF/MOL, SMILES) or unique database ID (pubchem ID, KEGG ID) first, resorting to CAS number or name only when necessary. Despite this high level of accuracy, we strongly suggest that every BioSpider annotation should be looked over quickly to see if any non-sensical or inconsistent information has been collected in its annotation process. Usually these errors are quite obvious. In terms of errors of omission, typically a human expert can almost always find data for 1 or 2 fields that were not annotated by BioSpider - however this search may take 30 to 45 minutes of intensive manual searching or reading. During the annotation of the HMDB and DrugBank, BioSpider was used to annotate thousands of metabolites, food additives and drugs. During this process, it was noted that BioSpider was able to obtain at least some information about query compounds 91% of the time. The cases where no information was returned from BioSpider often involved compounds whereby a simple web search for that compound would return no results. This again spotlights one of the limitations of the BioSpider approach - its performance is directly proportional to the "web-presence" of the query compound. Perhaps the most important contribution for Biospider for annotation lies in the time savings it offers. Comparisons between BioSpider and skilled human annotators indicate that BioSpider can accelerate annotations by a factor of 40 to 50 X over what is done by skilled human annotators. In order to test this time-saving factor, 3 skilled volunteers were used. Each volunteer was given 3 compounds to annotate (2-Ketobutyric acid, Chenodeoxycholic acid disulfate and alpha-D-glucose) and the fields to fill-in for that compound. Each volunteer was asked to search for all associated enzymes, but only asked to annotate a single enzyme by hand. The data obtained by the volunteers were then compared to the results produced by BioSpider. These tests indicated that the time taken to annotate the chemical fields averages 40 minutes and 45 minutes for the biological fields, with a range between 22

153 and 64 minutes. The time taken by Biospider was typically 5 minutes. In other words, to fill out a complete set of BioSpider data on a given small molecule (say biotin) using manual typing and manual searches typically takes a skilled individual approximately 3 hours. Using BioSpider this can take as little as 2 minutes. Additionally, the quality of data gathered by BioSpider matched the human annotation for almost all of the fields. Indeed, it was often the case that the volunteer would give up on certain fields (pubchem substance IDs, OMIM IDs, etc.) long before completion. In terms of real-world experience, BioSpider has been used in several projects, including DrugBank and HMDB (www.hmdb.ca). It has undergone full stress testing during several "annotation workshops" with up to 50 instances of BioSpider running concurrently. BioSpider has also been recently integrated into a LIMS system (MetaboLIMS - http://www.hmdb.ca/labm/). This allows users to produce a side-by-side comparison on the data obtained using BioSpider and the data collected manually by a team of expert curators. Overall, BioSpider has undergone hundreds of hours of real-life testing, making it stable and relatively bug-free. 4.

Conclusion

BioSpider is a unique application, designed to fill in the gap between chemical (small-molecule) and biological (target/enzyme) information. It contains many advanced predictive algorithms and screen-scraping tools made interactively accessible via an easy-to-use web front-end. As mentioned previously, we have already reaped significant benefits from earlier versions of BioSpider in our efforts to prepare and validate a number of large chemical or metabolite databases such as DrugBank and HMDB. It is our hope that by offering the latest version of BioSpider to the public (and the metabolomics community in particular) its utility may be enjoyed by others as well.

5.

Acknowledgments

The Human Metabolome Project is supported by Genome Alberta, in part through Genome Canada.

154 References 1.

2.

3.

4.

5. 6.

7.

8.

9.

10.

11. 12.

13.

Sussman, JL, Lin, D, Jiang, J, Manning, NO, Prilusky, J, Ritter, O & Abola, EE. Protein data bank (PDB): a database of 3D structural information of biological macromolecules. Acta Cryst. 1998. D54:1078-1084. Kanehisa, M„ Goto, S., Kawashima, S., Okuno, Y. and Hattori, M. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32(Database issue):D277-280. Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong X, Lu P, Szafron D, Greiner R, Wishart DS. 2005. BASys: a web server for automated bacterial genome annotation. Nucleic Acids. Res. l;33(Web Server issue):W455-9. Tanabe, L., Scherf, U., Smith, L.H., Lee, J.K., Hunter, L. and Weinstein, J.N. MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques 1999. 27:1210-1217. Hoffmann, R. and Valencia, A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics 2005. 21 Suppl 2:ii252-ii258. Hu Y., Hines L.M., Weng H., Zuo D., Rivera M., Richardson A. and LaBaer J: Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res. 2003. Jul-Aug;2(4):405-12. Maier H., Dohr S., Grote K., O'Keeffe S., Werner T., Hrabe de Angelis M. and Schneider R: LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts. Nucleic Acids Res. 2005. Jul 1 ;33(Web Server issue):W779-82. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A. and McKusick, V.A. 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33(Database issue):D514517. Hewett, M., Oliver, D.E., Rubin, D.L., Easton, K.L., Stuart, J.M., Altman, R.B. and Klein, T.E. 2002. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 30:163-165. Wishart, D.S., Knox, C , Guo, A., Shrivastava, S., Hassanali, M., Stothard, P. and Woolsey, J. 2006. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids. Res. 34(Database issue):D668-672. Weininger, D. 1988. SMILES 1. Introduction and Encoding Rules. J. Chem. Inf. Comput. Sci. 28:31-38. Gasteiger J, Sadowski J, Schuur J, Selzer P, Steinhauer L, Steinhauer V: Chemical information in 3D-space. J Chem Inf Comput Sci 36: 1030-1037, 1996. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., Studholme, D.J., Yeats, C. and Eddy, S.R. 2004. The Pfam protein families database. Nucleic Acids Res. 32:D138-141.

155 14. Stothard P, Van Domselaar G, Shrivastava S, Guo A, O'Neill B, Cruz J, Ellison M, Wishart DS. BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 2005 Jan 1 ;33(Database issue):D317-20. 15. Sundararaj S, Guo A, Habibi-Nazhad B, Rouani M, Stothard P, Ellison M, Wishart DS. BacMap: an interactive picture atlas of annotated bacterial genomes. Nucleic Acids Res. 2005 Jan 1 ;33(Database issue):D317-20. 16. Rebhan, M„ Chalifa-Caspi, V., Prilusky, J. and Lancet, D. 1998. GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14:656-664. 17. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J„ Zhang, Z., Miller, W. and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. 18. Bairoch, A., Apweiler. R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. 2005. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33(Database issue):D154-159. 19. Brooksbank, C , Cameron, G. and Thornton, J. 2005. The European Bioinformatics Institute's data resources: towards systems biology. Nucleic Acids Res. 33 (Database issue):D46-53. 20. Chen, X„ Ji, Z.L. and Chen, Y.Z. 2002. TTD: Therapeutic Target Database. Nucleic Acids Res. 30:412-415. 21. Halgren, T.A., Murphy, R.B., Friesner, R.A., Beard, H.S., Frye, L.L., Pollard, W.T. and Banks, J.L. 2004. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 47:1750-1709. 22. Hatfield, C.L., May, S.K. and Markoff, J.S. 1999. Quality of consumer drug information provided by four Web sites. Am. J. Health Syst. Pharm. 56:23082311. 23. Hulo, N., Sigrist, C.J., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P. and Bairoch, A. 2004. Recent improvements to the PROSITE database. Nucleic Acids Res. 32:D134-137. 24. Kramer, B., Rarey, M. and Lengauer, T. 1997. CASP2 experiences with docking flexible ligands using FlexX. Proteins Suppl 1:221-225 25. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305:567-580. 26. McGuffin, L.J., Bryson, K. and Jones, D.T. 2000. The PSIPRED protein structure prediction server. Bioinformatics 16:404-405. 27. Montgomerie, S., Sundararaj, S., Gallin, W.J. and Wishart, D.S. 2006. Improving the accuracy of protein secondary structure prediction using structural alignment. BMC Bioinformatics 7:301-312.

156 28. Orth, A.P., Batalov, S., Perrone, M. and Chanda, S.K. 2004. The promise of genomics to identify novel therapeutic targets. Expert Opin. Ther. Targets. 8:587-596. 29. Sadowski, J. and Gasteiger J. 1993. From atoms to bonds to three-dimensional atomic coordinates: automatic model builders. Chem. Rev. 93: 2567-2581. 30. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Church, D.M., DiCuccio, M , Edgar, R., Federhen, S., Helmberg, W., Kenton, D.L., Khovayko, O., Lipman, D.J., Madden, T.L., Maglott, D.R., Ostell, J., Pontius, J.U., Pruitt, K.D., Schuler, G.D., Schriml, L.M., Sequeira, E., Sherry, ST., Sirotkin, K., Starchenko, G., Suzek, T.O., Tatusov, R., Tatusova, T.A., Wagner, L. and Yaschenko, E. 2005. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 33(Database issue):D39-45. 31. Willard, L., Ranjan, A., Zhang, H., Monzavi, H., Boyko, R.F., Sykes, B.D. and Wishart, D.S. 2003. VADAR: a web server for quantitative evaluation of protein structure quality. Nucleic Acids Res. 31:3316-3319.

NEW BIOINFORMATICS RESOURCES FOR METABOLOMICS JOHN L. MARKLEY, MARK E. ANDERSON, QIU CUI, HAMID R. EGHBALNIA,* IAN A. LEWIS, ADRIAN D. HEGEMAN, JING LI, CHRISTOPHER F. SCHULTE, MICHAEL R. SUSSMAN, WILLIAM M. WESTLER, ELDON L. ULRICH, ZSOLT ZOLNAI Department of Biochemistry, University of Wisconsin-Madison, 433 Babcock Drive, Madison, Wisconsin 53706, USA

We recently developed two databases and a laboratory information system as resources for the metabolomics community. These tools are freely available and are intended to ease data analysis in both MS and NMR based metabolomics studies. The first database is a metabolomics extension to the BioMagResBank (BMRB, http://www.bmrb.wisc.edu), which currently contains experimental spectral data on over 270 pure compounds. Each small molecule entry consists of five or six one- and two-dimensional NMR data sets, along with information about the source of the compound, solution conditions, data collection protocol and the NMR pulse sequences. Users have free access to peak lists, spectra, and original time-domain data. The BMRB database can be queried by name, monoisotopic mass and chemical shift. We are currently developing a deposition tool that will enable people in the community to add their own data to this resource. Our second database, the Madison Metabolomics Consortium Database (MMCD, available from http://mmcd.nmrfam.wisc.edu/), is a hub for information on over 10,000 metabolites. These data were collected from a variety of sites with an emphasis on metabolites found in Arabidopsis. The MMC database supports extensive search functions and allows users to make bulk queries using experimental MS and/or NMR data. In addition to these databases, we have developed a new module for the Sesame laboratory information management system (http://www.sesame.wisc.edu) that captures all of the experimental protocols, background information, and experimental data associated with metabolomics samples. Sesame was designed to help coordinate research efforts in laboratories with high sample throughput and multiple investigators and to track all of the actions that have taken place in a particular study.

1.

Introduction

The metabolome can be defined as the complete inventory of small molecules present in an organism. Its composition depends on the biological fluid or tissue studied and the state of the organism (health, disease, environmental challenge, etc). Metabolomics is the study of the metabolome, usually as a high-throughput activity with the goal of discovering correlations between metabolite levels and the state of the organism. Metabolomics holds a place in systems biology

Also Department of Mathematics, University of Wisconsin-Madison. 157

158 alongside genomics, transcriptomics, and proteomics as an approach to modeling and understanding reaction networks in cells [1-4]. Mass spectrometry (MS) and nuclear magnetic resonance (NMR) are the analytical techniques used in the majority of metabolomics studies [5, 6], Although MS and NMR suffer from some well documented technical limitations [7], both of these tools are of clear utility to modern metabolomics [8]. MS is now capable of detecting molecules at concentrations as low as 1(T18 molar, and high-field nuclear magnetic resonance (NMR) can efficiently differentiate between molecules that are as similar in structure as glucose and galactose. Despite the availability of these impressive analytical tools, determining the molecular composition of complex mixtures is one of the most difficult tasks in metabolomics. One reason for this difficulty is a lack of publicly available tools for comparing experimental data with the existing literature on the masses and chemical shifts of common metabolites. We recently developed two databases of biologically relevant small molecules as practical tools for MS- and NMRbased research. The first of these databases is a metabolomics extension to the existing Biological Magnetic Resonance Data Bank (BioMagResBank, BMRB). The BMRB database contains experimental NMR data from over 270 pure compounds collected under standardized conditions. The peak lists, processed spectra, and raw time-domain data are freely available at http://www.bmrb.wisc.edu. Although the initial data were collected by the Madison Metabolomics Consortium (MMC), several groups in the metabolomics community have expressed interest in submitting data. We are currently developing a deposition tool that will facilitate these submissions and are encouraging others to submit their data. Our second free resource, the Madison Metabolomics Consortium Database (MMCD, available at www.nmrfam.wisc.edu), acts as a hub for information on biologically relevant small molecules. The MMCD contains the molecular structure, monoisotopic masses, predicted chemical shifts and links for more than 10,000 small molecules. The interface supports single and batch-mode searches by name, molecular structure, NMR chemical shifts, monoisotopic mass, plus various miscellaneous parameters. The MMCD is intended to be a practical tool to aid in identifying metabolites present in complex mixtures. Another impediment in metabolomics research is the complex logistics associated with coordinating multiple investigators in studies with large numbers of samples. To address this problem, we have created a metabolomics module for our Sesame laboratory information management system (LIMS) [9].

159 We designed Sesame to capture the complete range of experimental protocols, background information, and experimental data associated with samples. The system allows users to define the actions and protocols to be tracked and supports bar coded samples. Sesame is freely available at http://www.sesame.wisc.edu. In this paper we discuss the construction and mechanics of these resources as well as the details of our experimental designs and the sources we have drawn upon in developing these tools. 2.

Data Model for Metabolomics

The Metabolomics Standards Initiative recently recommended that metabolomics studies should report the details of study design, metadata, experimental, analytical, data processing, and statistical techniques used [10]. Capturing these details is imperative, because they can play a major role in data interpretation [11-13]. As a result, informatics resources need to be built on a data model that can capture all of the relevant information while maintaining sufficient flexibility for future development and integration into other resources [14]. To meet this challenge, the Madison Metabolomics Consortium has adopted a Self-defining Text Archival and Retrieval (STAR) [15-17] for storing and disseminating data. A STAR file is a flat text file with a simple format and extensible Data Definition Language (DDL). Data are stored as tag-value pairs and loop constructs resemble data tables. The STAR DDL is inherently a database schema that can be mapped one-to-one to a relational database model. Translating between STAR and other exchange file formats, such as XML, is a straightforward process. The STAR DLL used in our metabolomics resources was adapted from the existing data dictionary developed by the BMRB (NMRSTAR) for their work on NMR spectroscopic data of biological macromolecules and ligands. To describe the data for metabolic standard compounds, we used a subset of the NMR-STAR dictionary suitable for data from small molecules and extended the dictionary to include MS information. The information defined includes a complete compound chemical description (atoms, bonds, charge, etc.), nomenclature (including INChI and SMILES codes and synonyms), monoisotopic masses, links to databases through accession codes (PubChem, KEGG, CAS, and others), and additional information. Descriptions are provided for the NMR and mass spectrometers and chromatographic systems used in data

160 collection. Information on the sample contents and sample conditions is captured. Details of the NMR and mass spectrometry experiments can be included. For NMR, pointers to the raw NMR spectral data and the acquisition and processing parameters, experimental spectral peak parameters (peak chemical shifts, coupling constants, line widths, assigned chemical shifts, etc.), chemical shift referencing methods, theoretical chemical shift assignments and details of the calculation methods are described. For MS, the chromatographic retention times for the compound(s) of interest and standards are defined as well as the m/z values and intensities and pointers to the raw data files. The metabolite data dictionary is now being used to construct files containing all of the above information for the growing list of standard metabolic compounds analyzed by our consortium. The populated metabolite STAR files and the raw NMR and MS data files (instrumental binary formats) are being made freely available on the World Wide Web. The BMRB provides tools for converting NMR-STAR files into a relational database and XML files. 3.

Metabolite Database at BMRB

3.1. Approach The metabolomics community would clearly benefit from an extensive, freelyaccessible spectral library of metabolite standards collected under standardized conditions. Although the METLIN database serves this role for the MS community (http://metlin.scripps.edu/about.php), most current NMR resources have limitations in that they do not provide original spectral data (Sadtler Index [18], NMRShiftDB [19]; NMR metabolomics database of Linkoping (MDL http://www.liu.se/hu/mdl/main/), contain data that were collected under nonstandardized conditions ([19], MDL), or do not make their data freely available (AMIX/SBASE http://bruker-biospin.de). To our knowledge, the Human Metabolome Project (http://www.hmdb.ca/) is the only NMR resource, apart from BMRB, without these limitations. The current sparse coverage of NMR metabolomics resources stems in part from the high investment required to compile a comprehensive library of biologically relevant small molecules under standardized conditions. Our solution is to provide at BMRB a well-defined, curated platform that will allow the deposition of data from multiple research groups and free access to all.

161 3.2. Rationale for Metabolomics at BMRB The BMRB is a logical host for a metabolomics spectral library because of its history as a world wide repository for biological macromolecule NMR data [20-22]. BMRB is a public domain service and is a member of the Worldwide Protein Data Bank. Along with its home office in Madison, Wisconsin, BMRB has mirror sites in Osaka, Japan and Florence, Italy. BMRB is funded by the National Library of Medicine, U.S. National Institutes of Health, and its activities are monitored by an international advisory board. BMRB data are well archived with daily onsite tape backups and offsite third party data backup. 3.3. Data Collection and Organization Currently, the BMRB metabolomics archive contains experimental NMR data for more than 270 compounds collected by the Madison Metabolomics Consortium. Entries contain NMR time-domain data, peak lists, processed spectra, and data acquisition and processing files for one-dimensional ('H, B C , 13 C DEPT 90°, and 13C DEPT 135°) and two-dimensional ('H-'H TOCSY and 'H-13C HSQC) NMR experiments. A BMRB entry represents a set of either experimental or theoretical data reported for a metabolic compound, mixture of compounds, or experimental sample by a depositor. Entries are further distinguished by the experimental method used (NMR or MS). Separate prefixes on entries serve to discriminate between experimental data (bmse-) and theoretical calculations (bmst-). As described above, the metadata describing the chemical compounds and experimental details and quantitative data extracted from experiments or theoretical calculations for a unique entry are archived in NMR-STAR formatted text files. On the BMRB ftp site (ftp://ftp.bmrb.wisc.edu/pub/metabolomics), directories are defined for each compound or non-interconverting form of a compound (i.e., L-amino acids). Subdirectories for NMR, MS, and literature data are listed under each compound directory. All data associated with a BMRB experimental or theoretical entry are grouped together in a subdirectory, with the BMRB identifier located under the directory named for the compound studied and the appropriate subdirectory (NMR or MS). Data for compounds that form racemic mixtures in solution (e.g., many sugars) are grouped under a generic compound name. BMRB has developed internal tools to coordinately view spectra, peak lists, and the molecular structure; these tools are used to review deposited data for

162 quality assurance purposes. However, the depositor is ultimately responsible for the data submitted, and user feedback is the best defense against erroneous data on a public database. Users who encounter questionable data are encouraged to contact [email protected]. Questionable data will be reviewed and corrected if possible; otherwise they may be removed from the site. 3.4. Presentation

and Website

Design

The B M R B metabolomics website has been developed to meet needs expressed by many of its users. The layout and usage of the metabolomics web pages have had several public incarnations and will probably undergo more as the site matures and grows. The first page a visitor sees contains a two-paragraph introduction to the field and a collection of Internet links to a few important small molecule sites; a more complete listing of metabolomics websites is accessed from a link in the sidebar. The information contained in these websites and databases is complementary to that collected by B M R B . The Standard Compounds page (Figure 1) provides the means for searching for metabolites of interest. For each compound archived, an individual summary page (Figure 2) is created dynamically from the collection of files located in the standard substance sub-directory associated with that compound. A basic chemical description is provided from information BMRB collects from PubChem at the National Institutes of Health, National Center for Biotechnology Information, National Library of Medicine, (http://www.ncbi.nlm.nih.gov/) A twodimensional stick drawing is created. Three-dimensional '.mol' files are generated from the two dimensional ' . s d f files obtained from PubChem, and 8toto£*&tfM,fgft&rtoR0«d™»?ee0v>&Ffiatfjf these are displayed . using Jmol. Links are <M&*SSBfiS ft* \ * ^ ^ ^ | * ^ | * M 4 ^ W ^**fc%^ created to one or more Data Available for Tftese Standard PubChem entries and Substances to the K E G G entry if available. Synonym information and ti£@{f f H i n ^ H f f s i u u n various nomenclature descriptions such as INChI codes, IUPAC names, and SMILES strings are given. Figure 1. Metabolomics standard substances page in the BMRB website.

163 The use of dynamic information presentation techniques allows BMRB to create tools that search through the data or calculate answers according to specific user input. NMR data can be

v »*,««,-»«,»» ;_• ^JiVii^ri^,

mm

displayed in a variety of ways: as a collection of spectra, as a spectrum along with its peak list, or simply a single spectrum _ . of interest. Links allow the user to access the

m i»' '*} HXSI

' ' " ' '" • • > • « < > • •»•> > • <•-< „ „ % v t/.»«**i;r;^'"';;'»".".V1*"''' ""•"*,, „„ Figure 2 Portion of the substance summary page in the BMRB metabolomics website for N-acetyl-Dglucosamine-6-phosphate.

time-domain or processed data by FTP. The Peak Query tool allows the user to enter a list of peaks in oneor two-dimensional formats with tolerances and retrieve a list of compounds with matching signals. Historically, multiple approaches have been taken to name or identify small molecules in a standard, unique manner. For this reason, BMRB provides the ability to search on common names, INChI codes, IUPAC names, SMILES strings, and various database identifiers. BMRB allows users to select categories for searching. For example, a user wishing to see all entries for molecules containing nitrogen would search for 'N' with only chemical formula selected. But to find molecules with similar substructures, the INChI or SMILES searches would be the approach to use. A number of users have requested that the data in the BMRB archive be available through bulk transactions. To accommodate this, we have taken the data from the ftp repository and collated them into a collection of tar (tape archive) files that can be easily downloaded from the ftp site. 3.5. Prospects Over the past year, BMRB with assistance from a grant from the NIH Roadmap Initiative has developed a usable and maintainable metabolomics resource for the research community. A core set of metabolite data from pure compounds has been deposited, and additional data sets are being solicited from the community.

164 From this beginning, it is clear that enhancements and implementations are needed. A standardized protocol for archiving and presenting mass spectrometry data must be developed. A standard deposition system needs to be completed and brought online, and more efficient methods for validating deposited data need to be put in place. The STAR file DDL needs to be finalized. In addition, it will be useful to develop models and tools that enable users to explore dynamic and parametric relationships among sets of deposited or experimental data. 4. Madison Metabolomics Consortium Database (MMCD) 4.1. Overview The purpose of the MMCD is to be a resource for MS and NMR based metabolomics by providing tools for metabolite identification and characterization. The MMCD was created initially for in-house use, but later was released to the public as hosted on the NMRFAM website (http://www.nmrfam.wisc.edu). The core design features of the MMCD have been practicality and efficiency. MMCD collects, organizes, and edits information about metabolites from a number of sites, including BMRB. As a starting set, information for 10,912 biologically relevant metabolites was compiled from the KEGG and AraCyc pathway databases, and this list of compounds was supplemented by chemical information obtained from ChemlDplus at the National Library of Medicine Specialized Information Services (http://chem.sis.nlm.nih.gov/chemidplus/) and PubChem. Empirical chemical shifts were predicted for each of the compounds with ChemDraw software. The database also contains theoretical and experimental chemical shifts from BMRB and other sources. 4.2. Data Content Table 1. Three categories of data in the MMCD. 1. Data related to NMR spectroscopy. Experimental data collected under standard conditions. Literature data (from NMRShiftDB, etc.). Chemical shifts from theoretical calculations (by Gaussian 03). Empirically predicted chemical shifts. 2. Data related to mass spectrometery. Isotopomer masses for: 12C14N/ ,3C,4N / 12CI5N/ 13C15N. LC-MS data collected under defined conditions. 3. Links to chemical and biological informatics databases: e.g., PubChem; ChemlDplus; KEGG; CHEBI; HMDB and NMRShiftDB databases/websites.

165 Metabolomics studies require substantial informatics support to identify compounds in complex mixtures. The MMCD was designed to link three categories of data (Table 1). The flexible design of MMCD allows the database to modify both its content and informatics tools depending on demands of the metabolomics community. Each compound in the database can be characterized by more than 50 data elements. The numbers of compounds with various types of associated data are given in Table 2. Table 2. Current contents of the MMCD. Total compounds Compounds with experimental NMR data Compounds with theoretical NMR data Compounds with NMR data from the literature Compounds with empirical NMR data

10,912 324 150 1,000 10,912

4.3. Query Engine The MMCD has a flexible and efficient query system that allows searches by text, molecular structure, NMR parameters, mass spectrometry parameters (mass, retention time), and miscellaneous other criteria. With its WYSIWYG (what you see is what you get) interface, the user can combine up to five different types of search criteria. The query engine can also be used in batch mode for high-throughput searching. Clicking on a bar (Figure 3) activates the corresponding search section. Multiple search sections can be activated and queried together as a logical 'AND' relationship. Pushing the "reset" button clears all previous input and restores all sections to their original status. The "Text-based Search" section (Figure 4) is equipped with a flexible, ambiguous search engine for names or synonyms. Names recognized include those in: CAS, Kegg, CQ_ID, Exp_NMR, Pubchem_SID, ChemlDplusJD, CHEBMD. Synonyms include: Common Name; IUPAC name, Beilstein Fevt Dated >arc!i Handbook Reference ID, EINECS ID, Stnjrtnr«> Kiseif'Searcn NSC ID, and CCRIS ID. NMR Laicti Searcr When "Ambiguous search" is VA55 Dated Search' checked (Figure 4), the search will consider as a hit any synonym included as part of the very flexible input format. , c . .. . . , ,„,„„ c. r

J

f

Figure 3. Search criteria in the MMCD.

166 Wildcards '*' can be used. T w o or more names can b e submitted separated by ';': e.g. glucose; D-glucose. Such entries are processed according to a logical ' O R ' relationship. Salt information is discarded: e.g., gluconic acid sodium salt is considered as gluconic acid. Both *ic acid and *ate yield the same result, e.g., 'D-

j^^.-."':^::::- ; i. Tnxt_twiscri Sc.ttzt< :

,. i ;-" __=_-j «.:•-•-:-•- «..<.-••...- •-.—=;,- JC

; |j j ! ! •; - ;-*

Figure 4. Text-based searching of the

gluconic acid' is equal to 'D-gluconate'. Clicking on the "Batch mode" bar (Figure 4) enables input from a file. The user needs to identify the file, set the index of the query item (usually begins from 1), and the separator type. Clicking again on the "batch mode" title switches the interface back to the normal search mode. The other search modules ("Search by Structure", "NMR-Based Search", "Mass-Based Search", and "Miscellanea") have similar functionalities. As noted above, it is possible to combine searches over different criteria. For example one can search on a particular molecular formula ("Structure-based Search") with specified chemical shift values ("NMR-based Search") and tolerances and limit the search to metabolites believed to be associated with Arabidopsis ("Miscellanea"). 5.

Metabolomics Module for the Sesame Laboratory Information Management System

A metabolomics "Module" for the Sesame laboratory information management system (LIMS) has been developed and is being used to organize the activities of the M M C . Sesame is a platform-independent, specialized LIMS written in Java with C O R B A used as the middleware. The R D B M S used is Oracle or PostgreSQL, available on multiple platforms. All Sesame Modules contain tools and techniques for collaborative analysis, access, and visualization of data. All Sesame modules are available to the public and are "open source". Sesame is a web-based system that is accessible to team members over the Web. Access is customizable and is password protected. All data are secure and backed up. The Sesame module for metabolomics, called "Lamp", consists of Small Molecule, Detailed Small Molecule, Sample, Mass Sample, N M R Experiment, Software, Hardware, Vendor and Citation Views and different Lab and System Resources (Lab Protocol, Type, Status, Lock Solvent, Internal Reference Compound, Mass Spectrum Type, Mass Spectrometer, etc.). Views operate on various kinds of data, and facilitate data capture, editing, processing, analysis,

167 retrieval, or report generation. The data records from different Views can be linked to each other: for example, a Small Molecule entry is linked to a Sample, which is linked to a Mass Sample, NMR Experiment, and a Vendor. Correspondingly, an NMR Experiment is linked to Hardware and Software, etc. Every View contains general and view-specific fields. General fields include the lab name, the date the record was created, the date it was last modified, the lab protocol, user label, information, actions, linked items, attached files, and attached images. View-specific fields include the sample location (room, freezer, tower, box, position in box), description of the observed sample (constituent, name, concentration, concentration unit, isotopic labeling), lock solvent, ionic strength, pH, molecular weight, mass spectrum type, buffer and salt concentration, measured mass, clean-up steps, NMR spectrometer and probe, NMR experiment type, vendor name, address and contact info, etc. The Detailed Small Molecule View contains all the information loaded from different data sources and public databases: e.g., calculated chemical shifts, a two-dimensional image of the molecule drawn from a mol file (if it exists), linked items (samples, mass samples, NMR experiments, vendors, etc.). The Small Molecule View is designed to display results of different queries in a compound fashion. The columns are subsets of the Detailed Small Molecule View: Sesame id, PubChem id, name, formula, weight, 2D image, etc. The data in different views can be queried based on different ids, content, location, status, type, lab protocol used, actions performed, etc. The Lamp module supports queries and reports and export functions. All standard compounds and experimental samples can be bar-coded and logged into the system. This makes it possible to track the origin, location, amount, and history of each. Experimental data in Sesame can be associated with protocols entered in the system. This allows results on a given tissue to be associated with defined protocols (e.g., for extraction or fractionation). Acknowledgments Supported by NIH grant R21 DK070297; I.A.L. is the recipient of a fellowship from the NHGRI 1T32HG002760; NMR data were collected at the National Magnetic Resonance Facility at Madison (NMRFAM) funded by NIH grants (P41 RR02301 and P41 GM GM66326); metabolite standards data are archived at the Biological Magnetic Resonance Data Bank (BMRB), which is supported by a grant from the National Library of Medicine (P41 LM05799).

168 References 1. 2. 3. 4. 5. 6. 7. 8. 9.

10. 11. 12.

13. 14.

15. 16. 17. 18. 19. 20. 21. 22.

I. Nobeli and J. M. Thornton, Bioessays 28, 534 (2006). S. Rochfort, J. Nat. Prod. 68, 1813 (2005). Z. N.Oltvai and A. L. Barabasi, Science. 298, 763 (2002). D. B. Kell, Curr. Opin. Microbiol. 7, 296 (2004). O. Fiehn, Plant Molec. Biol. 48, 155 (2002). J. K. Nicholson, J. Connelly, J. C. Lindon and E. Holmes, E. Nat. Rev. Drug Discov. 1, 153 (2002). W. Weckwerth, Annu. Rev. Plant. Biol. 54, 669 (2003). J. C. Lindon, E. Holmes and J. K. Nicholson, Prog. NMR Spectr. 39, 1 (2001). Z. Zolnai, P. T. Lee, J. Li, M. R. Chapman, C. S. Newman, G. N. Phillips, Jr., I. Rayment, E. L. Ulrich, B. F. Volkman and J. L. Markley, J. Struct. Fund. Genom. 4, 11 (2003). J. C. Lindon et al., Nature Biotechnol. 23, 833 (2005). A. L. Castle, O. Fiehn, R. Kaddurah-Daouk, and J. C. Lindon, Brief Bioinform.l, 159(2006). H. Jenkins, N. Hardy, M. Beckmann, J. Draper, A. R. Smith, J. Taylor, O. Fiehn, R. Goodacre, R. J. Bino, R. Hall, J. Kopka, G. A. Lane, B. M. Lange, J. R. Liu, P. Mendes, B. J. Nikolau, S. G. Oliver, N. W. Paton, S. Rhee, U. Roessner-Tunali, K. Saito, J. Smedsgaard, L. W. Sumner, T. Wang, S. Walsh, E. S. Wurtele and D. B. Kell, Nat. Biotechnol. 22, 1601 (2004). E. J. Want, G. O'Maille, C. A. Smith, T. R. Brandon, W. Uritboonthai, C. Qin, S. A. Trauger and G. Siuzdak., Anal. Chem. 78, 743 (2006). D. V. Rubtsov, H. Jenkins, C. Ludwig, J. Easton, M. Viant, U. Guenther, N. Hardy and J. L. Griffin, Proceedings of the 2nd Scientific Meeting of the Metabolomics Society, Boston, Abs. 10 (2006). S. R. Hall, J. Chem. Inf. Comput. Sci. 31, 326 (1991). S. R. Hall and A. P. F. Cook, J. Chem. Inf. Comput. Sci. 35, 819 (1995). S. R. Hall and N. Spadaccini, J. Chem. Inf. Comput. Sci. 34, 505 (1994). 77ie Sadtler Standard Spectra N.M.R. Chemical Shift Index, Sadtler Research Laboratories, Philadelphia, (1967). C. Steinbeck, S. Krause and S. Kuhn, J. Chem. Inf. Comput. Sci. 43, 1733 (2003). E. L. Ulrich, J. L. Markley and Y. Kyogoku, Protein Seq. Data Anal. 2, 23 (1989). B. R. Seavey, E. A. Fan, W. M. Westler and J. L. Markley, J. Biomol. NMR. 1,217(1991). J. F. Doreleijers, S. Mading, D. Maziuk, K. Sojourner, L. Yin, J. Zhu, J. L. Markley and E. L. Ulrich, J. Biomol. NMR. 26, 139 (2003).

SETUP X - A PUBLIC STUDY DESIGN DATABASE FOR METABOLOMIC PROJECTS MARTIN SCHOLZ, OLIVER FIEHN* University of California, Davis Genome Center 451 E. Health Sci. Drive Davis, California 95616, USA

Metabolomic databases are useless without accurate description of the biological study design and accompanying metadata reporting on the laboratory workflow from sample preparation to data processing. Here we report on the implementation of a database system that enables investigators to detail and set up a biological experiment, and that also steers laboratory workflows by direct access to the data acquisition instrument. SetupX utilizes orthogonal biological parameters such as genotype, organ, and treatment(s) for delineating the dimensions of a study which define the number of classes under investigation. Publicly available taxonomic and ontology repositories are utilized to ensure data integrity and logic consistency of class designs. Class descriptions are subsequently employed to schedule and randomize data acquisitions, and to deploy metabolite annotations carried out by the seamlessly integrated mass spectrometry database, BinBase. Annotated result data files are housed by SetupX for downloads and queries. Currently, 39 users have generated 48 studies, some of which are made public.

1. Metabolomic DBs require metadata on study designs Metabolomic data can only be interpreted on the basis of background information of the experimental design of the biological parameters that were studied, and the details of data acquisition and data processing. Metabolites, unlike proteins, genes or RNA molecules, do not commonly carry specific information content. Instead, the role of metabolites in biological processes needs to be unravelled by their changes in levels, turnover rates and location in response to influence factors such as perturbation of the genetic constitution or external stress treatments. Generally, cellular and organismal responses on such perturbations comprise many metabolic events. Only comparisons across a variety of biological studies and many different perturbation factors enable researchers to distinguish specific from unspecific effects, and therefore precisely define the meaning of metabolomic changes. Currently, no public metabolome database comprises

#

Work supported by grant ROl ES013932 of the NIEHS/National Institutes of Health. 169

170 such wealth of information on the actual conditions under which biological studies were carried out [1]. We are here presenting a solution for setting up metabolomic experiments, SetupX", comprising a description of the biological study design, a management of the experimental lifecycle and serving as a public database for metabolomic studies. The primary objective of this system is to capture the most relevant biological metadata for a study and to enable the user an easy access to upload or download and query such information. Secondary to its function as experimental metadata repository, SetupX directs the metabolite profiling data acquisition at the laboratory gas chromatograph-time of flight mass spectrometer and enables overview about scheduled experiments and data acquisition status. It serves as central interface for data processing tasks to the BinBase mass spectrometry database and for keeping result files for downloads. SetupX therefore presents a fully functional and public database system integrating metabolomic workflows from conceptual design over laboratory practice to steering data processing tasks and result queries. We here detail the computational aspects of SetupX for reuse of metabolomic data sets for statistical analysis and cross-study investigations. Its functionality enforces researchers to carefully design and completely document biological studies.

2. Conceptualized Schema SetupX has been developed over the past three years. Partly as result of work for the Food Standards Agency, UKb, the first version of SetupX was based on the general 'Architecture for a metabolomics experiment' schema (ArMet) [2] that broadly classified the overall workflow and data facts into nine larger modules and relationships between these. In a similar manner to the later concept of MIAMET [3] (the minimal information on a metabolomic experiment), ArMet demands a description of the BioSource, the object and materialization of a biological study design. However, the internal structure and required ontologies supporting such BioSources descriptions remained vague and subject to the implementation of community-specific versions of ArMet. A similar vagueness of conceptual clarity and descriptive stringency was found in related omics areas, namely MAGE-ML [4] and proteomics database efforts. Most of the existing experimental design descriptors focused on the data a b

SetupX. [http://setupx.fiehnlab.ucdavis.edu/ml/] Food Standards Agency: Safety Assessment of Genetically Modified Foods Research Programme (G02) [http://www.food.gov.uk/science/research/researchinfo/foodcomponentsresear ch/novelfoodsresearch/g02programme/]

171 acquisition and processing parts rather than on the biological side of studies, which are indeed harder to describe and conceptualize. Promising efforts were presented by DOME, a database system for functional genomics and systems biology, covering various omic techniques0. The database schema embarks on thorough description of biological metadata defined by users; however, the underlying schema is still not universal enough to capture the breadth of study design in biology. The SetupX design therefore emphasizes the description of the BioSource and only demands pointers to documents for actual chemical processes used in sample handling and chemical preparations. Major efforts were reported by the SMRS groupd which focused on biomedical and toxicological studies and whose work is now continued in the efforts of the Metabolomics Society. Since SetupX was planned to house a very large set of different biological studies, spanning many disciplines from plant biology to clinical research, one of the most important tasks was to keep the schema adaptable to practical experiences and to ongoing discussions in each community with respect to organization and prerequisites of consistent and complete study design descriptions. At the same time, SetupX had to be flexible to account for a large variety of BioSources (spatial and genotypic descriptions of the physical objects that undergo metabolomic investigations, including their growth history) and treatments of these (experimental alterations of impact parameters influencing the metabolic states of BioSources). Consequently, SetupX utilizes a stringent J Experiment

(

1 I

Organ 0.1 1.."

I

| Dimension |

" " \

0.'

BioSource 1

„y

k^

\

GenoTyne | 0.1 1..*

I

0.1 Diet

I

I

Timeline

J 4 |

Dose 0.1 1..*

I

3

I TimePoint 0..1

|

1 *

1 1

\—-—|

Figure 1 UML diagram of the experimental design

c

DOME, [http://mendes.vbi.vt.edu/tiki-index.php?page=DOME] Standard Metabolic Reporting Structure [http://www.smrsgroup.Org/documents/SMRS_policy_draft_v2.3.pdf]

MachineRuti |

172 schema that conceptualizes these two orthogonal properties of any BioSource, its physical object (including past environmental factors) and any process or parameter that investigators intentionally manipulated to enforce metabolic responses - the treatment (including time course and dose descriptors). Therefore, BioSources for metabolomic studies require complete descriptions of the object as well as any manipulation of the object that distinguishes it from related objects in the same study. Hence, BioSource and treatment define the most important vectors that span the dimensions of a metabolomic study, called classes. In principle, further dimensions can be spanned by using different chemical treatments or data acquisition methods, but mostly this is not intended in metabolomic studies. The number of these dimensions is not limited and varies according to the experimental design of each study. Objects that cannot be distinguished by any of the vectors are called bioreplicates and belong into the same class yet have unique object identifiers. Often, metabolomic data of these classes are later compared by statistical means in order to unravel metabolic effects that distinguish these classes. However, classes may also be combined to super-classes if certain distinguishing dimensions are deemed by investigators to be less important. BioSources and treatments could thus comprise of any biological experiment and were not constricted to a certain experimental condition. The demands for such flexibility created a challenge for developing SetupX and to enable users an easy access to populate the database. SetupX has met this challenge by spanning the dimensions on the fly while users enter information that classifies distinct BioSource or treatment parameters. For example, each genotype, each organ, each cell type or each difference in age, sex or past growth location defines classes ('BioSource'), as well any intentionally altered parameter such as nutritional regimen, chemical elicitors or time lines that are imposed onto the

Figure 2. Three dimensional metabolomic study comprising 18 classes (2 genotypes x 3 organs x 3 time points, left panel) of which some classes may be void of bioreplicates and deselected in SetupX (right panel).

173 BioSources as part of the study ('treatment'). Figure 1 shows a simplified UML diagram of such an experimental design including the dimensions. Every class has a relation to one of the instances of variation of each dimension. Such study design can be conceptualized as cube if BioSource is distinguished by two vectors (genotype and organ) and treatment by one dimension (time) as shown in figure 2, left panel. The vector space represents the classes in the experiment - each possible combination of each variation per dimension represents one class, shown in the image a single cube. The maximal number n of classes spanned by d dimensions thus simply equals n=Tld. For specific studies, not every class may be populated by bioreplicates, i.e. not all dimensions may apply to all classes. For example, mice organs such as liver or kidney are usually studied after animals are sacrificed, whereas body fluids can be taken along treatment dimensions. In SetupX, users can therefore deselect certain classes that are void of bioreplicates. Multidimensional designs cannot be easily displayed to users. Therefore, study designs in SetupX are visualized as table which can handle as many dimensions as needed (Fig. 3). Deselected classes are represented in grey shaded boxes.

organ Line Sample Taken Sjslitratia 2 x 2 x 4 -A 2

#

#

#

Figure 3 SetupX view of a four dimensional study.

©

174

3. Customization of dimensions by ontologies There is no consensus in biology on the minimal, the necessary (required), the optimal or the maximal numbers of parameters that describe an acceptable study. Journals usually declare that materials and methods must be sufficiently described to understand and repeat a scientific report, but do not detail the parameters. Curation of reports is consequently performed in the peer-reviewing process that lacks consistent guidelines and thus leads to frustration among authors and reviewers. Database designs lack even this peer-reviewing process but must rely on automatic consistency checks. SetupX utilizes consistency checks on the level of dictionaries (spelling and minimal word/letter counts), controlled vocabularies and ontologies that define the parameter space for selecting dimensions. By using ontologies we can map the real name (for example of an organ) to a unique identifier taken from the ontology and thus enable queries that are comparing different objects by using unique identifiers instead of "strings" labelling the information. Hence, queries are independent from use of synonyms and may span across different levels of abstractions. SetupX is equipped with a connector to OBO ontologies. Consistency is checked by relating a specific ontology repository to each input field in the front end. The check in the current version is a simple lookup if the term entered by a user is defined in the related ontology. The Ontologies used in SetupX are currently the plantbased structure ontology from the Plant Ontology Consortium, the Arabidopsis development ontology from the Arabidopsis Information Resource (TAIR) and the Human developmental anatomy ontology from the Medical Research Council Human Genetics Unit Edinburgh U.K. In addition, SetupX has built-in validations that work as a spelling check. A service from Google checks the number of results that were found. If the number of results found is high enough, the value will be accepted, but it will be rejected if the returned number is lower than a defined minimum. Such simple validity checks can be assigned to any of the input fields in the system in order to prevent 'dummy' entries. For example, selection of BioSource 'human' and a plant organ 'leaf is disabled by SetupX using the powerful NCBI species taxonomy [5] for species definitions that informs the (subsequent) definition of organs that are selected for metabolomic studies. Use of the NCBI taxonomy enables queries for synonyms or generalized terms such as the genus 'rat' for any of the 23 rat species that are currently defined at NCBI. Organ selections subsequently depend on the species under study. A good example is the definitions of organs and their relationships given in PlantOntology [6] that can directly be utilized within the SetupX schema.

175 However, such dependencies cause practical problems for clarity in use. For example, if two different species were selected in a single metabolomic study (e.g. 'human' and 'soybean'), different subsequent views would result asking the user for input of specific parameters depending on the different biological ontologies that exist for these species. Whereas such different views can in principle be implemented, actual user access demonstrated that biologists might easily get confused by the number of required input parameters. Instead, SetupX enforces splitting such study into two independent experiments that can later be combined on the level of result downloads for data processing tasks and statistical comparisons. In addition to community accepted ontologies on the organ and genotype level, we have hard coded parameters further describing dimensions such as 'past growth conditions'. The current version of SetupX supports parameter definitions for humans, animals, plants and mircoorganisms. This separation is obviously not scientifically exhaustive but instrumental for adjusting user interfaces for parameter input and keeping logic consistency of the database. Customized sets of (required or optional) parameter inputs were realized by using the taxonomic tree structure by navigating from species node up to the first match for a node that would classify all underlying nodes. For example, if a selected species belongs to the plant kingdom, growth conditions on light, humidity, temperature, soil, location, developmental stage and others are requested, complying to the draft document of the minimal reporting standards requested to describe a plant metabolomic experiment which was recently released by the Metabolomics Society5. For the species 'human', a different set of parameters is requested such as gender, age, body mass index and others. Depending on the actual study, however, certain parameters need to be detailed to understand the study design and some of these parameters may even only be released long after a metabolomic experiment is finished such as 'survival rate' for cancer studies. Hence, maintenance of logic consistency of such a database is an ongoing challenge due to the huge number of parameters and study types that may influence the metabolic phenotypes. In a similar way like journals, SetupX asks study investigators to detail as many parameters as possible but does not comprise many required fields. Instead, documents detailing further parameters for a given study may be uploaded by investigators. Such documents will inform the further development of SetupX hard coded parameter fields.

c

Metabolomics Standards Initiative [http://msi-workgroups.sourceforge.net/]

176

4. Study design classes steer laboratory workflows A given metabolomic study may comprise many classes and even more bioreplicates. Based on experience and simple statistical considerations, the minimal number of bioreplicates populating a class is set as six in SetupX, whereas the optimal number of bioreplicates per class depends on a power analysis that takes the natural variability of metabolic levels into account. This variability is much higher in uncontrolled situations such as human (cohort) studies than under controlled laboratory conditions utilizing nearhomozygous genotypes and specific nutritional regimes. Consequently, small metabolomic studies typically comprise some 48 bioreplicates whereas larger studies easily contain hundreds, sometimes thousands of bioreplicates. The largest study included in SetupX is a project on 12 potato genotypes x 4 field trial growth locations, each class populated with 28-30 ,. ... , Figure 4: SetupX data acquisition task download bioreplicates which totals to more than 1,300 samples. This study was funded by the British Food Standards Agency in 2003 in order to test substantial equivalence of genetically modified and classically bred potato tubers, and result data sets for a field trial using the same experimental design (but under different environmental conditions for year 2001) has previously been published. Typical cycle times for an individual sample per metabolomic data acquisition is about 30 min, or about 40 samples per day plus quality control samples. Data acquisition instruments show drifts in sensitivity and resolution, especially mass spectrometry based technology platforms. In order not to bias statistical analyses or the metabolomic data structure by non-biological factors such as machine drift, bioreplicates (classes) need to be randomized across the whole data acquisition sequence. In addition, each sample needs to unambiguously match the unique bioreplicate identifier in SetupX. Laboratory staff downloads the randomization schemata, sample pre-treatment methods

177 (such as 'extraction protocols') and data acquisition method parameters (such as 'split ratio' or 'detector voltage') directly from SetupX, thereby limiting systematic or gross errors such as misspelling of file identifiers or misplacing samples. Importantly, use of different methods for sample pretreatments or data acquisition routines also generates new dimensions for class definitions. Consequently, sample preparation and instrumentation parameter differences are treated in the same manner like differences between biological parameters (BioSource or treatment dimensions). A square root blocking schema randomizes samples across data acquisition schedules, adding quality controls and blank control samples as mandatory part of the overall study (fig. 4). Two partnering laboratories currently use SetupX, the UC Davis Genome Center's metabolomics research and the metabolomics core laboratory, each with different laboratory staff and data acquisition machines. Raw data result files are processed and exported by post-acquisition macros. SetupX web interfaces enable investigators to keep track of the acquisition status by automatically checking for result file outputs and by reading the machine generated log files. Figure 5 shows personalized tracking information for three different experiments, two of which were completed while one of the experiments displayed a 56% completion status. SetupX is based on a modular design and can easily adapt to laboratory environments other than the current use of Leco's gas chromatography/time of flight mass spectrometers. Once complete, initial mass spectral result files are scheduled using a further SetupX GUI for users (or laboratory staff) to start subsequent data processing and metabolite annotation using the seamlessly integrated, but independent BinBase database7. BinBase receives information about the samples including the class structure which is essential for the calculation and starts an automatic

Figure 5 UML diagram of the experimental design

178

annotation. Filtered and annotated result data sets are reported back from BinBase and uploaded to Setup X from which investigator or the public can query both experimental metadata and metabolomic results. SetupX protects access rights to specific experiments in different access levels, from reading (level 20) to modification (level 40), download (level 45) and experiment deletion rights (level 85). Only a few studies are currently publicly available, depending on publication of the major conclusions in peerreviewed journals. Currently, SetupX details 48 studies comprising 4,500 samples with access rights for 39 users.

5. Implementation SetupX has been developed as a server side application in the J2EEf framework using a relational database management system. It can therefore be installed on any certified J2EE application server. The flexibility of the system posed challenges for implementing the front end because the underlying schema may be subject to changes, with subsequent needs for front end adaptations. Therefore the user front end is generated using a combination of Java Server Pages (JSP) and Java Servlets. Attempts were unsuccessful to implement a user friendly functionality for capturing experimental designs by generating the front end based on an XML schema8 as had been exemplified previously for PEDRo, a proteomics experiment database [8]. PeDRo inhibits an easy customization of front ends that are intuitive for first time users. Most connections form SetupX to different services and databases are implemented by using SOAP WebServicesh. WebServices enable communication Fig 6. Communication via WebServices between SetupX and other services.

f

between

Java 2 Platform, Enterprise Edition [http://java.sun.com/javaee/] W3C XML Schema [http://www.w3.org/XML/Schema] h W3C Web Services [http://www.w3.org/2002/ws/]

8

svstems

179 independently of their implementation and programming language by using XML as a self describing exchange format. Most importantly, SetupX is integrated with the metabolic annotation database BinBase by using WebService as interaction technology. Figure 6 demonstrates how the system communicates with other services. Each user request invokes several queries on the NCBI database, However, response times of the NCBI Service are too slow to be used in a live system. Therefore, the connection which was implemented as a WebService (Entrez1) was replaced by mirroring the whole NCBI taxonomy database locally with weekly downloads of updates. Based on these experiences, other resources were handled in the same manner by installing local copies such as the description of plant organs in Plantontology.org.

6. Conclusions and Challenges SetupX presents a conceptually clear and pragmatic implementation of a database solution to set up and describe metabolomic studies, from study design to laboratory workflow and result data housing. Most importantly, it encompasses biological metadata to enable the public to reuse metabolomic data sets and to gradually learn more about specific versus unspecific metabolic responses to study parameters like 'abiotic stress'. SetupX empowers queries across studies such as 'Which experiments are present for a certain species?' or 'Download all data corresponding to plant leaf studies.' Data can also be queried from the perspective of metabolites such as 'Report all data referring to a specific compound.' Obviously, such queries yield more interesting results with a growing numbers studies stored in the system. However, community efforts such as the Metabolomics Standards Initiative are needed to further define minimal (required) and optimal (best practice) reporting standards. The database schema employed here can easily be replaced by a different schema or mapped onto other schemas once consensus formats are established by biological communities. Such consensus schemas will truly enable exchange of studies just by the transfer of a file, not by parsing the contents of scientific reports or by repeating studies. The difficult part here is to convince biologists to undergo the efforts to carefully populate the study design databases. We envision an intelligent import of experimental metadata from a range of typical document types such as Excel sheets by automatically analyzing the document data structure and recreating a blueprint of that particular study in SetupX. ' Entrez [http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html]

180 The variability of biological study designs is so great that it is hard to imagine devising a perfect way to painlessly parse all relevant biological metadata from documentations held by investigators. Two examples may highlight these challenges: (a) Pharmacological studies frequently involve rodents. However, the genetic repertoire of standard laboratory mouse strains is not reflected by NCBI species codes but differs between individual laboratories or suppliers, based on a complex progeny and breeding schema, (b) Clinical studies challenge the presented study design in yet another, very different way. Classes may be compiled from a variety of patient (or volunteer) data, due to the individuality of every human subject that reflects the corresponding unique genotypic, phenotypic and societal context. In addition, a number of diseases are too rare to collect a high enough number of specimen for thorough statistical treatments. Despite great efforts to match patient and control subjects, often metadata that are acquired in follow up studies justify to regroup subjects into different study design classes or to carry out other ways of statistical analyses. It is therefore very hard to accurately represent the wealth of clinical patient metadata that could potentially impact metabolic phenotypes and simultaneously to keep strict patient privacy.

References 1

Kopka J, Schauer N, Krueger S, Birkemeyer C, Usadel B, Bergmuller E, Dormann P, Weckwerth W, Steinhauser D, et al. Bio informatics 21:16351638 (2005). 2 Jenkins H, Hardy N, Beckmann M, Draper J, Smith A, Taylor J, Fiehn O, Goodacre R, Bino R, et al. Nature Biotechnology 22:1601-1606 (2004). 3 Bino R.J., Hall R.D., Fiehn O. et al. Trends Plant Set 9 418-425 (2004). 4 Spellman P, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, et al. Genome Biol 3:research0046.1-0046.9. (2002). 5 Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, et al. Nucleic Acids Res. 1 ;34:D173-80 (2006). 6 Jaiswal P, Avaraham S, Ilic K, Kellogg E, McCouch S, Pujar A, Reiser L, et al. Comp Func Gen 6:388-397 (2006) 7 Fiehn O, Wohlgemuth G, Scholz M. Proc. Led. Notes Bio informatics 3615, 224-239 (2005) 8 Garwood K, McLaughlin T, Garwood C et al. BMC Genomics. 5: 68 (2004)

COMPARATIVE METABOLOMICS OF BREAST CANCER* CHEN YANG + , ADAM D. RICHARDSON 1 , JEFFREY W. SMITH, and ANDREI OSTERMAN The Burnham Institute for Medical Research, La Mia, California 92037, USA Comparative metabolic profiling of cancerous and normal cells improves our understanding of the fundamental mechanisms of tumorigenesis and opens new opportunities in target and drug discovery. Here we report a novel methodology of comparative metabolome analysis integrating the information about both metabolite pools and fluxes associated with a large number of key metabolic pathways in model cancer and normal cell lines. The data were acquired using [U-13C]glucose labeling followed by two-dimensional NMR and GC-MS techniques and analyzed using isotopomer modeling approach. Significant differences revealed between breast cancer and normal human mammary epithelial cell lines are consistent with previously reported phenomena such as upregulation of fatty acid synthesis. Additional changes established for the first time in this study expand a remarkable picture of global metabolic rewiring associated with tumorigenesis and point to new potential diagnostic and therapeutic targets.

1. Introduction Since the completion the human genome, the main thrust of functional genomics has been to establish the link between gene/protein expression profiles and cellular phenotype in normal and disease states, most notably in cancer. Remarkable advancements in transcriptomics and proteomics technologies (1,2) led to the identification of novel therapeutic targets as well as tumor subtypes and biomarkers (3-6). Nevertheless, these technologies, taken alone, fall short of reflecting the entire picture of cellular networks and pathways. A direct assessment of a large number of intermediary metabolites and metabolic activities (metabolomics) is emerging as a powerful complementary approach for identifying pathways that are perturbed in a given pathology. Metabolic profiling is of special importance in cancer biology due to profound changes in central metabolism associated with many tumors as established by early biochemical studies (7) and recently confirmed by functional genomics techniques. Nevertheless, our current knowledge of the molecular processes associated with these metabolic changes is quite incomplete.

* This work was partially supported by a California Breast Cancer Research Program Fellowship to CY and by the Burnham Institute Cancer Center Support Grant (2 P30 CA30199-24, PI R. Abraham). t Both authors contributed equally to this work. 181

182 Despite the recent progress in metabolomics technology its applications in the field of human biology are still limited, mostly due to many technical challenges. Among the most obvious are: a limited availability of biological material, insufficient sensitivity and resolution of existing protocols, incomplete reference data (eg for NMR peak assignment) and the lack of established computational modeling framework. In this study we addressed some of these problems by combining [U-13C]-glucose labeling with two-dimensional NMR and GC-MS techniques to assess simultaneously the metabolite pools and fluxes associated with several interrelated metabolic pathways in human cells. We were able to assign two-dimensional NMR signals for 24 intermediary metabolites representing a substantial fraction of central metabolism. The acquired data were analyzed using an isotopomer model derived from reconstruction of an extensive metabolic network. We applied this approach for the comparative analysis of breast cancer and normal human mammary epithelial cell lines. An isotopomer model was developed for a metabolic network including the reactions of central carbon, fatty acid, and amino acid metabolism. We chose this metabolic network because it is the central backbone of metabolism providing energy, cofactor regeneration, and building blocks for cell synthesis. Moreover, cancer cells have been reported to display different activities of some of these pathways. We determined the active pathways and the flux distribution in this metabolic network. The observed pattern of metabolic changes is consistent with earlier observations of metabolic shifts in tumors (7), validating the developed methodology. A number of newly established changes in metabolic fluxes and pools provided us with new insights to potential diagnostic markers and therapeutic targets. 2. Materials and Methods 2.1 Experimental techniques Experimental procedures are only briefly introduced in this subsection. The details are provided in the Supplementary on-line Materials (SOM)'. Cell lines and cultivation. Human cell lines used in this study were: MCF10A (ATCC), derived from normal mammary epithelial cells, and MDA-MB435 (NCI), a highly metastatic mammary epithelial cancer cell line. Cultivation, metabolic labeling with 20% [U-'3C]glucose and harvesting was performed as described in SOM.

available at: http://www.burnham.org/labs/osterman/

183 Gas Chromatography - Mass Spectrometry (GC-MS) was used to analyze fatty acids. Samples from ~ 5><107 cells were prepared as described (8) and analyzed on Trace GC/Trace MS Plus system (see SOM for details). Mass isotopomer distribution corrected for natural abundance (9), was used to assess de novo fatty acid synthesis as described (8). Nuclear Magnetic Resonance (NMR) was used to analyze a mixture of methanol- and water-soluble metabolites extracted from ~ 2.5* 108 labeled cells as described in SOM. Two-dimensional [13C,'H] HSQC spectra were acquired using a Bruker Avance 500 NMR spectrometer. The l3C-'3C scalar coupling fine structures were extracted from the cross sections taken along the 13C axis in a HSQC spectrum by using the Bruker XWINNMR software. The concentrations of metabolites were determined by integrating the cross peaks in the HSQC spectra using the NMRPipe (10) and Sparky (http://www.cgl.ucsf.edu/home/sparky/) software packages, comparing with the integral of resonances peaks of the L-methionine that was treated as an internal standard, and normalizing to the amount of total cellular protein. 2.2.

13

C- Isotopomer Model A mathematical model describing the 13C isotopomer distribution of metabolites in human cells fed with 13C-glucose was developed. It was used for the determination of metabolic fluxes using l3C-multiplet patterns of metabolites from HSQC spectra. The considered metabolic network included glycolysis, pentose phosphate pathway (PPP), tricarboxylic acid (TCA) cycle, anaplerotic reaction, and biosynthetic pathways of fatty acids and non-essential amino acids. The fraction of de novo synthesis of fatty acids was determined based on the mass isotopomer distribution measured by GC-MS as described (8). The fluxes through other pathways are derived as follows. Contribution of PPP to pyruvate/alanine formation. The assessment of PPP activity relies on the analysis of 13C multiplets of alanine C2. The observed relative multiplet intensities were transformed to the relative abundances of intact carbon fragments (11). According to the carbon rearrangements in PPP, three pentose molecules (C1-C2-C3-C4-C5 backbone) yield five pyruvate molecules. Three of them retain an intact C1-C2-C3 fragment, while two molecules carry only a C2-C3 fragment of the original backbone. The latter fraction denoted as /2>(Ala-C2), is the main contributor to the isotopomer population of [2,3-13C2]alanine as assessed by the relative intensity of the doublet ('/cc = 35 Hz) of alanine C2. The total fraction of pyruvate derived from PPP can be estimated as 5/2 */2>(Ala-C2). TCA cycle and anaplerotic flux. [U-'3C]pyruvate enters the TCA cycle either by pyruvate dehydrogenase oxidation or by the anaplerotic reaction of

184 pyruvate carboxylase. The first process generates [4,5-'3C2]a-ketoglutarate via [l,2-'3C2]acetyl-CoA. Since intracellular a-ketoglutarate concentration is too low to be detected by NMR, its labeling state was assessed via glutamate, an abundant metabolite in rapid exchange with a-ketoglutarate. The isotopomer population of [4,5-l3C2]glutamate reflects the flux through pyruvate dehydrogenase, which equals the TCA cycle (citrate synthase) flux, provided the acetyl-CoA synthetase flux is zero. The second process is expected to yield a distinct labeling pattern represented by [1,2,3-'3C3] and [2,3-13C2] glutamate. This pattern reflects the formation of [1,2,3-'3C3] and [2,3,4-'3C3] oxaloacetate due to the pyruvate carboxylase reaction followed by the reversible interconversion between the asymmetric oxaloacetate and symmetric succinate (or fumarate). The relative activity of pyruvate carboxylase versus pyruvate dehydrogenase (vPC / vPDH) was calculated from the l3C multiplet components of glutamate at C3 and C4 using Eq. 1: vPC _ d(Glu-C3) vPDH rf*(Glu-C4) + <7(Glu~C4) where rf(Glu-C3) is the contribution of doublet ('7Cc = 35 Hz) to the multiplets of glutamate C3, while d*(G\u-C4) and q(Glu-C4) are the relative contributions of the doublet with lJcc coupling constant of 55 Hz and the quartet to the glutamate C4 multiplets, respectively. Non-essential amino acid biosynthesis. We investigated the activities of biosynthetic pathways of cysteine, glutamate, glutamine, glycine, and proline. The equation used for glycine biosynthesis is derived as follows, similarly for the other amino acids. Glycine can be synthesized via serine from the glycolytic intermediate, 3-phosphoglycerate, or obtained directly from media components. 3-Phosphoglycerate has the same labeling pattern as pyruvate. Thus we obtain the isotopomer balance equation (2). p ' Gly-C2

Xs

s+d d*+q

+ (\-Xsyl,)-P„ Ala-C2

(2)

where X**0 is the fraction of glycine derived from glycolysis; s, d, d*, and q correspond to the relative intensities of singlet, doublet, doublet with a larger coupling constant and quartet, respectively. P„ is the natural 13C abundance (Pn = 0.011), and Pay-a and />AI»-C2 a r e the specific enrichments of glycine C2 and alanine C2. /JGiyc2 can be calculated from Xsyn, P„and P\\d-a using the relation Pa\y.ci = Xs*" • PAU-CI + (1 - - ' O • P„- Therefore A*yn can be derived from the analysis of l3C multiplets of alanine C2 and glycine C2 using Eq. 3:

185

1-^,

P.< y syn

Gly-C2

_

s+d d* +q

(3) ) + P„-( Gly-C2

Gly-C2

P.

3. Results 3.1. NMR Spectral Assignment Fig. 1 shows a typical two-dimensional [13C,'H] HSQC spectrum of metabolites extracted from the human breast cancer cells. The assignment of l3 C-'H cross peaks for various metabolites was made by comparing the carbon and proton chemical shifts with literature values (12-17), with spectra of pure compounds and by spiking the samples. Overall, 24 metabolites could be unambiguously assigned. The details of peak assignments and the reference summary Table SI of characteristic chemical shifts are provided in SOM.

~+ < «u)OSH)-C4

ItotjCH, _'

Civ?Hi |

SlyCJ

efctijMHH- • V.-!

i

i

.... — . L

V

V

N. J'

ly**C5

*w^ :2fj*3 ^ * < 3 J U * C 4 *

* i

~&e%y - - —I

Cys!6*H» FC-CH.0H i £

\

—\ :•»

J*

XT .

UtJ"IUDf>:CS *

-

L

*

'

-

«tf-C2j.

JM4KHt

i

*««£

Fig. 1. Atypical two-dimensional [13C, 'H] HSQC spectrum of the metabolites extracted from breast cancer cells. Abbreviations for the assigned peaks are as in Table SI.

3.2. Metabolic Fluxes A comparison of relative intensities of l3C-13C scalar coupling multiplet components of various metabolites extracted from [U-13C]glucose labeled MCF-

186 10A and MDA-MB-435 cells are shown in Table 1. These data were used in the C isotopomer model to determine the metabolic fluxes or flux ratios through individual pathways including glycolysis, PPP, TCA cycle and anaplerotic reaction, fatty acid and amino acid biosynthetic pathways (Fig. 2). 1

Table 1. Relative intensities of "C multiplet components of metabolites extracted from MCF-lOA and MDA-MB-435 cells grown on [U-l3C]glucose " Carbon position Alanine-C2

Alanine-C3 Lactate-C3 Acetyl-CoA (GlcNAc/GalNAc)-C2 Glutamine-C4

Glutamate-C3 Glutamate-C4

Glu (GSH)-C3 Glu (GSH)-C4

Gly (GSH)-C2 Glycine-C2 Proline-C4

Proline-C5

Isotopomer populations 2-"C 2,3-l3C2 1,2-'3C2 1,2,3-I3C3 3-,3C 2,3-l3C2 3-l3C 2,3-'3C2 2-l3C

Multiplet components s d d* q s d s d s

1,2-I3C2 4-13C 3,4-,3C2 4,5-l3C2 3,4,5-l3C3 3-13C 2,3-' 3 C 2 /3,4- l3 C 2 2,3,4-l3C3 4-,3C 3,4-l3C2 4,5-l3C2 3,4,5-13C3 3-13C 13 2,3- C2/3,4-l3C2 2,3,4-'3C3 4-,3C 3,4-'3C2 4,5-13C2 3,4,5-l3C3 2-l3C 1,2-13C2 2-,3C 1,2-I3C2 4-,3C 4,5-l3C2 3,4,5-13C3 5-13C 4,5-l3C2 ( Jcc, -35 Hz); d*, doublet split by a

MCF-lOA 0.27 0.01 0.01 0.71 0.28 0.72 0.16 0.84 0.29

MDA-MB435 0.16 0.11 0.01 0.72 0.17 0.83 0.20 0.80 0.14

d 0.71 0.86 _b s 0.50 _b d 0.01 _b 0.48 d* _b 0.01 q s 0.73 0.72 d 0.27 0.27 t 0 0.01 s 0.50 0.30 d 0.01 0.01 0.48 0.66 d* 0.01 0.03 q s 0.67 0.71 d 0.32 0.28 t 0.01 0.01 s 0.24 0.13 d 0.02 0.02 d* 0.70 0.73 0.04 0.12 q s 0.88 0.27 d 0.12 0.73 s 0.86 0.27 d 0.14 0.73 1.00 0.25 s 0.71 d 0.00 0.04 t 0.00 s 1.00 0.25 0.00 0.75 d large coupling constant ( Jcc, -60 Hz); t,

" s, singlet; d, doublet triplet; q, quartet. b Resonance of glutamine C4 is below the detection level in the MDA-MB-435 cells.

The relative activity of PPP versus glycolysis was determined based on the analysis of 13C multiplets of alanine C2 as described above. The contribution of

187

the signature doublet ('/ C c = 35 Hz) to the multiplets of alanine C2 is very small in MCF-10A but significant in MDA-MB-435 cells (Table 1), suggesting that a relative contribution of PPP to production of pyruvate is substantially higher in malignant cells (28%) compared to nonmalignant cells (-2%), where the bulk of pyruvate stems from glycolysis (Fig. 2). The increased use of PPP enables the MDA-MB-435 cells not only to supply more ribose for nucleic acid synthesis, but to recruit more of the NADPH reducing power for fatty acid synthesis. Indeed, the GC/MS analysis performed in this study revealed that 47% of palmitate is newly synthesized from glucose in MDA-MB-435 cells (Fig. 2) in correlation with the observed increase in PPP flux. The de novo synthesized palmitoleate, stearate, and oleate is 37%, 35%, and 18%, respectively. This is in marked contrast with almost no dc novo fatty acid synthesis in MCF-10A cells as evidenced by the lack of l3C tracer accumulation in palmitate, palmitoleate, stearate or oleate. 100 -i BMCF-10A 80 -

1

• MDA-MB-435

60

•M

4-

1 40-

!

•

rh

:LLUI

n

PyrfromPP Fatty acid Contribution Gly Pro pathway synthesized of synthesized synthesized from glucose anaplerosis from glucose from glucose to TCA cycle Fig. 2. Metabolic fluxes in MCF-10A and MDA-MB-435 cells (mean + s.d.; «=4).

The relative fluxes through pyruvate carboxylase and pyruvate dehydrogenase were estimated from the analysis of glutamate labeling. The major isotopomer populations of 4,5-13C2 of glutamate and y-glutamyl of glutathione indicated that these carbon atoms are derived from [l,2-13C2]acetylCoA (Table 1). The isotopomer ratio of acetyl-CoA C2, 1,2-13C2/ 2-l3C1, which can be assessed via the acetyl moiety of GlcNAc or GalNAc, is 2.5 for MCF10A and 6.1 for MDA-MB-435. Whereas these ratios are similar to the isotopomer ratios of 4,5- l3 C 2 + 3,4,5-13C3/ 4-l3C, + 3,4-l3C2 of glutathione C4

188 (2.8 for MCF-10A and 5.7 for MDA-MB-435), they are markedly different from the glutamate C4 ratios (0.96 for MCF-10A and 2.2 for MDA-MB-435). This indicates that the C4 and C5 in the y-glutamyl moiety of glutathione are solely derived from acetyl-CoA, whereas glutamate is a likely subject of the isotopic dilution originating from a non-enriched carbon source (e.g. glutamine). Therefore, the isotopomer distribution of y-glutamyl of glutathione was used to determine the relative activity of the anaplerotic reaction versus TCA cycle. The observed flux ratio of pyruvate carboxylase reaction over TCA cycle is slightly decreased in MDA-MB-435 compared to MCF-10A cells (Fig.2). Analysis of the 13C labeling pattern of the nonessential amino acids allowed us to determine the activity of the respective biosynthetic pathways. Using the 13 C isotopomer model, we found that cysteine is obtained directly from media components, and the activity of glutamate and glutamine biosynthesis is not changed significantly in MCF-10A and MDA-MB-435 cells (data not shown). Interestingly, MCF-10A cells do not utilize glucose for synthesis of glycine and proline, whereas these amino acids are actively synthesized from glucose in MDA-MB-435 cells (Fig. 2). 3.3. Metabolite Pools We used the 2D NMR data from the same labeling experiments to determine and compare the concentrations of unambiguously assigned metabolites (Table 2). Quantitation of metabolites with natural isotope abundance yields directly the total metabolite concentrations. At the same time, the differences observed for biosynthetically labeled metabolites may originate from changes in pool sizes as well as due to the l3C enrichment. In many cases these effects can be decoupled as illustrated below. Comparison of MCF-10A and MDA-MB-435 cell lines revealed significant changes in the pool sizes of many metabolites. For example, malignant cells exhibited significantly increased glutathione, m-inositol, and creatine concentrations and decreased isoleucine, leucine, valine, and taurine concentrations. Phosphocholine level is higher, whereas free choline and glycerophosphocholine were below the detection level in MDA-MB-435. The observed 12-fold increase in C2 and C3 peaks of succinate may not be explained solely by the l3C enrichment, which could account only for -12% of the overall increase. The latter estimate is based on the labeling pattern of ccketoglutarate deduced from the observed ~1.3-fold 13C enrichment at the C3 and C4 of y-glutamyl moiety of glutathione. Therefore, the total pool size of succinate was significantly increased in MDA-MB-435 cells. A similar approach allowed us to establish a substantial increase in the total pool size of GlcNAc or GalNAc and a decrease in those of alanine, glutamine, and glycine (Fig.3).

189 Table 2. Comparison of metabolite concentrations in MCF-10A and MDA-MB-435 cells " Metabolitesb

Arginine GSH Isoleucine Leucine Lysine Valine m-lnositol Free choline Phosphocholine Glycerophosphocholine Total choline Phosphocholine / glycerophosphocholine Creatine Taurine

Ratio MDA-MB-435 / MCF-10A 0.98 ±0.15 1.59 ±0.08 0.27 ±0.04 0.48 ± 0.05 0.74 ±0.16 0.26 ±0.03 1.75 ±0.10 <0.25 1.72 ±0.09 <0.10 1.39±0.17 > 17.2

Metabolites labeled at specific positionsc Alanine C2 Alanine C3 Glutamine C4 Glutamate C3 Glutamate C4 Glu (GSH) C3 Glu (GSH) C4 Gly (GSH) C2 Glycine C2 Proline C4 Proline C5 Lactate C2

Ratio MDA-MB-435 / MCF-10A 1.02 ±0.05 1.00 ±0.05 <0.05 1.56 ±0.22 2.09 ±0.31 2.17±0.16 1.94±0.15 6.54 ±0.98 1.15 ±0.06 8.57 ±2.53 10.9 ±3.3 0.67 ±0.17

1.74 ±0.08 0.25 ± 0.03

Succinate C2/C3 > 12.3 > 14.7 GlcNAc / GalNAc C2 UDP-GlcNAc / UDP2.56 ±0.64 GalNAc C2 UTP/UDPCl 3.38 ±0.53 a Relative amount of the various compounds were obtained by normalizing peaks to the internal reference standard, and further normalized per 1 mg of total protein (mean + s.d.; n=4) b Quantitation of metabolites with natural isotope abundance (a direct measure of metabolite concentrations). c Differences observed for biosynthetically labeled metabolites may reflect both, a 13C enrichment and a change in a total pool size.

4. Discussion The key aspects of the metabolomics methodology used in this study were: 1. A comparative approach was applied to assess metabolic changes in a model system of the highly metastatic cell line MDA-MB-435 versus the immortalized nontumorigenic cell line MCF-10A. 2. [U-nC]glucose labeling followed by the high-resolution 2D NMR spectroscopy allowed us to monitor twenty-four intracellular metabolites (Tables 1 and 2) in addition to fatty acids analyzed by GC-MS. 3. An extensive 13C isotopomer model was developed to determine and compare fluxes through the key central metabolic pathways including glycolysis, PPP, TCA cycle and anaplerotic reactions, and biosynthetic pathways of fatty acids and non-essential amino acids (Fig.2). 4. A combination of fluxes with individual metabolite pools within the single metabolic reconstruction framework expanded our ability to interpret underlying metabolic transitions (Fig.3). Although most of the individual components of this approach have been previously described, to our knowledge this is the first study when a combination of these techniques was systematically applied for metabolomics of

190 cancer. Although comprehensive isotopomer models are widely used in microbial systems (18,19), only a few models have been described for human cells (20-29). Most of these models were restricted by relatively narrow metabolic subnetworks (20-25) or based on the labeling data for one (i.e. glutamate (25,26)) or a few individual metabolites (27-29). Due to the higher sensitivity of HSQC method compared to regular l3C-NMR we were able to decrease the amount of cells required for the analysis. The increased signal dispersion in 2D spectra allowed us to analyze a wide range of metabolites without prior separation.

Fig. 3. Metabolic profile changes in breast tumors compared with normal human mammary epithelial cells. The arrows represent the fluxes. Fluxes are normalized to glucose uptake rate. The boldface arrows indicate the fluxes that are significantly upregulated. The pool sizes of boxed metabolites are directly assessed by [l3C,'H] HSQC. Metabolites are colored if their concentrations are increased (black), decreased (white), or not changed (gray). G6P, glucose-6-phosphate; R5P, ribose-5-phosphate; GAP, glyceraldehydes-3-phosphate; 3-PG, 3-phosphoglycerate. See other abbreviations in Table S1 given in SOM.

An integration of fluxes and pool sizes acquired within a single experiment

191 gives a more detailed fingerprint of the phenotype compared to conventional approaches based on one parameter. Although fluxes provide a direct measure of metabolic activities pointing to potential targets, they can be usually obtained only for a subset of central metabolic pathways. Metabolite pools can be readily assessed for both central and peripheral metabolites. While providing only an indirect evidence of metabolic activities, they can be used as biomarkers. We observed a sharp increase in metabolic activity of several pathways in cancer cells (Fig.2 and 3). Some of these observations such as upregulation of PPP and fatty acid synthesis are consistent with previous reports (30,31) providing us with a validation of the approach. An increase in other fluxes, eg the synthesis of glycine and proline, are reported here for the first time. Possible implications of these changes in establishing and maintaining a breast cancer phenotype are yet to be explored. Some of the observed changes in metabolite pools can be readily interpreted in the context of respective fluxes. For example the pools of all monitored amino acids decreased or remained largely unchanged in cancer cells, despite the established upregulation of some of the respective biosynthetic pathways (Fig.3). This is consistent with accelerated consumption of amino acids for protein synthesis. At the same time, the pool of glutathione (GSH in Fig.3), which is not consumed at the same level increased in keeping with the increased synthetic flux. Overproduction of GSH in tumors may reflect the increased resistance towards oxidative stress (32). We observed significant alterations in pools of several peripheral metabolites (eg creatine and taurine), whose metabolism may not be easily assessed via flux measurements. Therefore, the results obtained in this study, in addition to the validation of the approach, provide new information about metabolic aspects of tumorigenesis, and can aid the identification of new diagnostic and therapeutic targets. The presented approach constitutes a promising analytical tool to screen different metabolic phenotypes in a variety of cell types and pathological conditions. 1. 2. 3. 4. 5.

REFERENCES Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P. O., and Herskowitz, I. (1998) Science 282, 699-705 Klose, J., Nock, C, Herrmann, M., Stuhler, K., Marcus, K., Bluggel, M, Krause, E., Schalkwyk, L. C, Rastan, S., Brown, S. D., Bussow, K., Himmelbauer, H., and Lehrach, H. (2002) Nat Genet 30, 385-393 Voss, T., Ahorn, H., Haberl, P., Dohner, H„ and Wilgenbus, K. (2001) Int J Cancer 91, 180-186 Moch, H., Schraml, P., Bubendorf, L., Mirlacher, M, Kononen, J., Gasser, T., Mihatsch, M. J., Kallioniemi, O. P., and Sauter, G. (1999) Am J Pathol 154, 981-986 Celis, J. E., Celis, P., Ostergaard, M., Basse, B., Lauridsen, J. B., Ratz, G, Rasmussen, H. H., Orntoft, T. F., Hein, B., Wolf, H., and Celis, A. (1999) Cancer Res 59, 3003-3009

6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

26. 27.

28. 29. 30. 31. 32.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C , Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999) Science 286, 531-537 Dang, C. V., Lewis, B. C , Dolde, C, Dang, G, and Shim, H. (1997) J Bioenerg Biomembr 29, 345-354 Lee, W. N., Bassilian, S., Guo, Z., Schoeller, D., Edmond, J., Bergner, E. A., and Byerley, L. O. (1994) Am J Physiol 266, E372-383 Wittmann, C, and Heinzle, E. (1999) Biotechnol Bioeng 62, 739-750 Delaglio, R, Grzesiek, S., Vuister, G W., Zhu, G, Pfeifer, J., and Bax, A. (1995) J Biomol NMR 6, 277-293 Szyperski, T. (1995) Eur J Biochem 232, 433-448 Gribbestad, I. S., Petersen, S. B., Fjosne, H. E., Kvinnsland, S., and Krane, J. (1994) NMR Biomed 7, 181 -194 Gribbestad, I. S., Sitter, B., Lundgren, S., Krane, J., and Axelson, D. (1999) Anticancer Res 19, 1737-1746 Pal, K., Sharma, U., Gupta, D. K., Pratap, A., and Jagannathan, N. R. (2005) Spine 30, E68-72 Patel, A. B., Srivastava, S., Phadke, R. S., and Govil, G (1999) Anal Biochem 266,205-215 Sharma, U., Atri, S., Sharma, M. C, Sarkar, C, and Jagannathan, N. R. (2003) NMR Biomed 16, 213-223 Sharma, U., Mehta, A., Seenu, V., and Jagannathan, N. R. (2004) Magn Reson Imaging 22, 697-706 Dauner, M., Bailey, J. E., and Sauer, U. (2001) Biotechnol Bioeng 76, 144-156 Schmidt, K., Nielsen, J., and Villadsen, J. (1999) J Biotechnolll, 175-189 Fernandez, C. A., and Des Rosiers, C. (1995) J Biol Chem 270, 10037-10042 Lapidot, A., and Gopher, A. (1994) J Biol Chem 269, 27198-27208 Jeffrey, F. M., Storey, C. J., Sherry, A. D„ and Malloy, C. R. (1996) Am J Physiol 111, E788-799 Malloy, C. R., Sherry, A. D., and Jeffrey, F. M. (1988) J Biol Chem 263, 69646971 Vercoutere, B., Durozard, D., Baverel, G, and Martin, G (2004) Biochem J 378, 485-495 Lu, D., Mulder, H., Zhao, P., Burgess, S. C, Jensen, M. V, Kamzolova, S., Newgard, C. B., and Sherry, A. D. (2002) Proc Natl Acad Sci VSA99, 27082713 Cline, G W., Lepine, R. L., Papas, K. K., Kibbey, R. G, and Shulman, G I. (2004) J Biol Chem 279, 44370-44375 Boren, J., Cascante, M., Marin, S., Comin-Anduix, B., Centelles, J. J., Lim, S., Bassilian, S., Ahmed, S., Lee, W. N., and Boros, L. G (2001) J Biol Chem 276, 37747-37753 Boren, J., Lee, W. N., Bassilian, S., Centelles, J. J., Lim, S., Ahmed, S., Boros, L. G, and Cascante, M. (2003) J Biol Chem 278, 28395-28402 Portais, J. C, Schuster, R„ Merle, M., and Canioni, P. (1993) Eur J Biochem 217, 457-468 Boros, L. G, Cascante, M., and Lee, W. N. (2002) Drug Discov Today 7, 364372 Baron, A., Migita, T., Tang, D., and Loda, M. (2004) J Cell Biochem 91, 47-53 Meister.A. (1991) Pharmacol Ther 51, 155-194

METABOLIC FLUX PROFILING OF REACTION MODULES IN LIVER DRUG TRANSFORMATION JEONGAH YOON, KYONGBUM LEE Department of Chemical & Biological Engineering, Tufts University, 4 Colby Street, Medford, MA, 02155, USA With appropriate models, the metabolic profile of a biological system may be interrogated to obtain both significant discriminatory markers as well as mechanistic insight into the observed phenotype. One promising application is the analysis of drug toxicity, where a single chemical triggers multiple responses across cellular metabolism. Here, we describe a modeling framework whereby metabolite measurements are used to investigate the interactions between specialized cell functions through a metabolic reaction network. As a model system, we studied the hepatic transformation of troglitazone (TGZ), an antidiabetic drug withdrawn due to idiosyncratic hepatotoxicity. Results point to a welldefined TGZ transformation module that connects to other major pathways in the hepatocyte via amino acids and their derivatives. The quantitative significance of these connections depended on the nutritional state and the availability of the sulfur containing amino acids.

1.

Introduction

Metabolites are intermediates of essential biochemical pathways that convert nutrient fuel to energy, maintain cellular homeostasis, eliminate harmful chemicals, and provide building blocks for biosynthesis. Many metabolites are in free exchange with the extracellular medium, and may be used to obtain quantitative estimates of biochemical pathway activities in intact cells. In recent years, metabolite measurement arrays, or metabolic profiles, in conjunction with appropriate models, have been used for a variety of applications, e.g. comparisons of plant phenotypes [1], elucidation of new gene functions [2], and discovery of disease biomarkers [3]. Another promising application is the study of drug-mediated toxicity in specialized metabolic organs such as the liver. One approach to identifying drug toxicity markers has been to extract characteristic fingerprints by applying pattern recognition techniques to 'metabonomic' data obtained through nuclear magnetic resonance (NMR) spectroscopy [4]. An alternative and complementary approach is to build structured network models applicable to metabolomic data. These models could be used, for example, to globally characterize the effects of drug chemicals across cell metabolism, and thereby identify potential metabolic burdens; to associate adverse events, such as the formation of a harmful derivative, with 193

194 specific marker metabolites; and to formulate hypotheses on the mechanisms of drug toxicity. Here, we describe a modeling framework for characterizing the modularity of specific reaction clusters, in this case xenobiotic transformation. At its core, this framework consists of an algorithm for top-down partitioning of directed graphs with non-uniform edge weight distributions. The core algorithm is further augmented with metabolic flux profiling and stoichiometric vector space analysis. Thus, our modeling framework is well-suited for leveraging advances in both analytical technologies as well as biological informatics, especially genome annotation and pathway database construction [5]. As a model system, we considered the metabolic network of the liver, which is the major site of xenobiotic transformation in the body. Representative metabolic profile data were obtained for cultured rat or human hepatocytes from prior work [6, 7]. The model xenobiotic was troglitazone (TGZ), an anti-diabetic drug that has recently been withdrawn due to idiosyncratic liver toxicity [8]. The exact mechanisms of toxicity remain unknown, but could involve the formation of harmful derivatives through metabolic activation, cellular energy depletion via mitochondrial membrane damage [9], or other metabolic burdens such as oxidative stress [10]. In this work, we utilize our modularity analysis model to characterize the connections between the reactions of known TGZ conjugates and the major pathways of liver cellular metabolism. This type of analysis should complement more detailed studies on the roles of specific conjugation enzymes by identifying their interdependence with other major components of the cellular metabolic network. In the case of TGZ transformation, our results indicate that key connectors are sulfur-containing amino acids and their derivatives. 2.

Methods

2.1. Liver metabolic network Stoichiometric models of liver central carbon metabolism were constructed as follows. First, a list of enzyme-mediated reactions was collected from an annotated genome database [11]. Second, stoichiometric information was added for each of the collected enzymes by cross-referencing their common names and enzyme commission (EC) numbers using the KEGG database [12]. Third, biochemistry textbooks and the published literature [13] were consulted to build organ (liver) and nutritional state (fed or fasted) specific models. Net flux directions of reversible or reciprocally regulated pathways were set based on the nutritional state. These models were rendered into compound, directed graphs, visualized using the MATLAB (MathWorks, Natick, MA) Bioinformatics

195 toolbox, and corrected for missing steps and nonsensical dead ends. Reversible reactions flanked by irreversible reactions were assigned directionality so as to ensure unidirectional metabolic flux between the flanking reactions. The pathway memberships and other dimensional characteristics are summarized for each of the two models in Table 1*. Table 1. Pathway memberships of the fed- and fasted-state liver models Pathway Alcohol metabolism Amino acid metabolism Bile acid synthesis Cholesterol synthesis Gluconeogenesis Glycogen synthesis Glycolysis Ketone body metabolism Lipogenesis Lipolysis, (5-oxidation Oxidative phosphorylation PPP TCA cycle TGZ metabolism Urea cycle

Fed

Fasted

V V V <

V

V < V V

V

V V < V V

V V V V V <

V

2.2. TGZ metabolism The base models were augmented with TGZ conjugation reactions identified in the literature. Upon entry into the hepatocyte, TGZ is almost entirely transformed into one of its four main conjugate forms [14]: TGZ-sulfate (TGZS), TGZ-quinone (TGZ-Q), TGZ-glucuronide (TGZ-G), and TGZ-gluthathione (TGZ-GSH). Extension of the liver models with these derivatives added 10 new intermediates and 14 reactions. 2.3. Data sets Inputs to the flux calculations were external flux measurements (rates of metabolite uptake or output) taken from previously published work. These studies profiled the metabolism of cultured hepatocytes under medium conditions that set up either a fed or fasted state. All data sets included time series measurements on glucose, lactate, ketone bodies, ammonia, and the naturally occurring amino acids. The number of measured metabolites was 25. Complete model details, including reaction stoichiometry, the identities of balanced metabolites, and thermodynamic reaction parameters are available upon request to the authors.

196 Summary descriptions of the experimental settings are shown in Table 2. A representative mean value for TGZ uptake rate was estimated based on a study involving primary hepatocytes obtained from human donors [15]. Table 2: Metabolite data sets used for flux estimation Model Nutritional state Medium Supplements Hormones Reference

Cultured rat hepatocytes Fed DMEM w/ high (4.5 g/L) glucose Amino acids Insulin

Cultured HepG2 cells Fasted (spent medium) DMEM w/ low (1.0 g/L) glucose

161

m

Dexamethasone

2.4. Flux calculation 2.4.1 Metabolic Flux Analysis (MFA) Intracellular fluxes were calculated using an optimization based approach as described previously [16]. Briefly, a non-linear, constrained optimization problem was set up as follows: Minimize: ]T(v t -v°kbs)2

, Vk e {externalfluxes}

(1)

k

Subject to:

S •v=0 G v<0

(2) (3)

where the objective function minimizes the sum squared error between experimentally observed (v°kbs) and predicted external fluxes (vk). Eq. (1) expresses the balances around intracellular metabolites using an M*R stoichiometric matrix S and an R*l flux vector v. The number of balanced metabolites (M) and reactions (R) were 37 and 64 for the fasted-state and 60 and 102 for the fed-state model. Inequality (3) expresses constraints derived from the Second Law. To account for biochemical coupling between energetically favorable and unfavorable reactions, the thermodynamic constraints were applied to pathways, as opposed to individual reactions. Stoichiometrically balanced pathways were enumerated using the elementary flux mode (EFM) algorithm [17]. The output of the EFM analysis was collected into a P*R pathway matrix E, where P was the number of pathways (190 and 237 for the fasted- and fed-state model, respectively). To formulate the pathway AG (AGPATH°) constraint matrix G, we first collected the reaction AGs into an R*\ vector Ag and then performed element-by-element multiplications with each of the P (/{-dimensional) rows of E: Gij=ErAgj,

V/e {!.../>},V/e{l..JV}

(4)

197 2.4.2 Flux Balance Analysis (FBA) We also simulated flux distributions that maximized the formation of the key liver anti-oxidant glutathione (GSH), which in vitro studies had shown to play a critical role in the detoxification of TGZ and other drugs in the liver [10]. The simulations were performed using linear programming with maximization of the GSH synthesis step (vGSH) as the objective. The equality and inequality constraints were identical to the above MFA problem. The measured external fluxes were used as upper and lower bound constraints. To prevent overconstraining, we specified five of the 25 measured metabolites as major carbon and nitrogen sources/sinks. The final form of the FBA problem was: Maximize: Subject to:

vGSH S •v=0 G v<0 Q 5 v

- - i,meas

^

V

i ^2-vi,meas

(5) (2) (3) (6)

where v, refers to the measured rates of uptake, accumulation, or output of glucose, triglyceride, glutamine, urea or TGZ. 2.5. Modularity

analysis

Analysis of reaction modules was performed using an algorithm for top-down decomposition of directed graphs. Details of the algorithm have been described elsewhere [18]. The algorithm consists of the following two steps, which are iteratively applied until all edges in the graph have been removed. 1.

2.

Shortest paths through the network are calculated using Dijkstra's algorithm. This calculation critically depends on the edge-weight matrix, which specifies the relative adjustments of reactant-product node pair distances based on the activity of the intervening reaction. Here, a node pair distance was inversely scaled with the connecting reaction activity as measured by its steady-state flux. The edge-betweenness centrality index is calculated for all edges. Edgebetweenness centrality refers to the frequency of an edge that lies on the shortest paths between all pairs of vertices. The edges with highest betweenness values are most likely to lie between sub-graphs, rather than inside a sub-graph [19]. Successive removal of edges with the highest edgebetweenness values will eventually isolate sub-graphs consisting of vertices that share connections only with other vertices in the same sub-graph. The edge-betweenness centrality values were calculated using a newly developed method [18] based on an algorithm for vertex betweenness centrality calculation of large, sparse networks [20].

198 2.6. Projection and Match Scores The biological significance of a partition (iteration) was assessed by mapping the modules to stoichiometrically feasible pathways (defined by the EFMs). Projection score - Each sub-graph was transformed into a 1 *R binary reaction composition vector, where R is the total number of reactions included in the network. An element was set to 1 if both the reactants and products of the corresponding reaction were present as nodes in the module; otherwise, an element was set to 0. The EFM vectors (rows of E) were also transformed into 1 xR binary pathway inventory vectors (PIVs) by replacing all non-zero entries with one. A projection score was computed for every pair-wise combination of a binary module vector and each of the fed- or fasted-state model PIVs as follows: PSfj = {RCVk • PIVj)/Nf

, i = 1,2,...,L* , j = \,2,...,m

(7)

where PSkj , RCVt , and Nk were, respectively, the projection score, reaction composition vector, and number of nodes in module i at iteration number k, Lk was the number of modules at k, PIVj was the y'th PIV, and m was the total number of EFMs. The overall projection score of an iteration number k was calculated by averaging the 'best match' projection scores of this iteration: PSk = £max(pS* y )/L k

(8)

Match score - Several cases were noted where the projection score identified more than one 'best match' PIV for a given module. In these cases, a 1*R consensus pathway fragment (CPF) vector was formed for each module / of iteration number k as the smallest common reaction set of the best match PIVs. The similarity between a module and its CPF was assessed by a match score: MSk =R-Wk/R,

i = \,2,...,Lk

(9)

where MSk was the match score of module / at iteration number k, and Wi was the number of mismatches between the module RCVj and the corresponding CPF. A mismatch occurs if a reaction is found in the module, but not the fragment vector or if a reaction is found in the fragment, but not the module vector. The overall match score of an iteration was calculated as a simple average of the individual module match scores: MSk =^MSk

Lk

(10)

199 3.

Results

3.1. Flux distribution The predicted distributions of fluxes through the reactions of TGZ metabolism (Table 3) were only in partial agreement with the experimentally determined proportions of the derivatives reported in the literature. Radioactive tracer studies in animal models have shown that the major derivative is TGZ-S [21], accounting for up to 70 % of all conjugated forms [22]. For the fed-state model, the calculated distribution was, in decreasing order, 39 % TGZ-Q, 33 % TGZ-G, 15 % TGZ-GSH, and 13 % TGZ-S. The distribution calculated for the fastedstate model was 37 % TGZ-Q, 37 % TGZ-G, and 26 % TGZ-S. Thus, both models predicted the sulfate conjugate to be a minor component, contrary to the animal studies. On the other hand, the GSH conjugate was correctly predicted to be a minor derivative [14]. Table 3. Reactions of TGZ metabolism

kcal/mol

Flux, umol/106 cells/day Fed Fasted +TGZ Max +TGZ GSH

Max GSH

-125.6

0.06

0.42

Reaction Sloichiomelry Cysteine + 0 2 + a-Ketoglutarate -> Pyruvate + S032" + Glutamate Cysteine -> Pyruvate + NH4* + HS' HS" + 2Gluthatione + 20 2 -» GSSG + HS03- + H 2 0 TGZ uptake Glutamate + Cysteine + Glycine -> Glutathione TGZ + Glutathione -> TGZ-GSH TGZ + SCO2- -» TGZ-Sulfate TGZ + HS03' -> TGZ-Sulfate TGZ -> TGZ-Quinone TGZ -> TGZ-Glucuronide TGZ-GSH secretion TGZ-Sulfate secretion TGZ-Glucuronide secretion TGZ-Quinone secretion

0.32

0.12

-38

0.00

0.00

0.00

0.00

-648.5

0.00

0.00

0.00

0.00

0

0.46

0.91

0.46

0.92

132.4

0.07

0.59

0.00

0.50

0.07 0.06 0.00 0.18 0.15 0.07 0.06 0.15 0.18

0.59 0.32 0.00 0.00 0.00 0.59 0.32 0.00 0.00

0.00 0.12 0.00 0.17 0.17 0.00 0.12 0.17 0.17

0.50 0.42 0.00 0.00 0.00 0.50 0.42 0.00 0.00

-7.5 19 28.5 -31.1 -197.5 0 0 0 0

Measured inputs are shown in bold. +TGZ: flux distribution calculated by MFA with total drug uptake set to 0.46 umol/106 cells/day. Max GSH: flux distribution calculated by FBA with upper and lower bounds on glucose, TG, GLN, urea, and TGZ.

Interestingly, the two models predicted qualitatively similar trends despite their significantly different compositions and measured inputs, suggesting that there were a limited number of actively engaged connections between TGZ transformation and the other metabolic pathways. The major quantitative difference involved the contribution of the GSH conjugate. Thus, we next

200

examined the effect of increasing the availability of this conjugation substrate by simulating flux distributions that maximized GSH synthesis under the same stoichiometric and thermodynamic constraints applied to the MFA problems. To obtain flux values numerically compatible with the MFA results, we also assigned upper and lower bounds to the major carbon and nitrogen sinks and sources based on their respective measured external flux values. As expected, the flux through the GSH synthesis step (vGSH) increased significantly for both the fed- and fasted-models (in umol/106 cells/day) from 0.07 to 0.59 and 0 to 0.50, respectively, when the maximization objective was paired with no direct constraints on the uptake or output of the amino acid reactants. The only indirect constraint on GLU was applied through the upper and lower bounds on GLN (0.75 and 3 umol/106 cells/day, respectively), which were not approached. However, the higher vGSH flux for the fed-state model suggests a positive correlation with GLN uptake, which was significantly higher for the fed-state model. The predicted distribution of conjugation reaction fluxes were 65 % TGZ-GSH and 35 % TGZ-S for the fed-state model and 54 % TGZGSH and 46 % TGZ-S for the fed-state model. Both models predicted zero fluxes for the formation of the glucuronide and quinone conjugates, suggesting that the distribution of the TGZ derivatives may be dramatically altered by the availability of GSH, which in turn is influenced by the medium supply of its constituent amino acids. The increase in TGZ-GSH was accompanied by an increase in TGZ-S formation, likely because the cysteine component of GSH also acts as a source of sulfate (HSO3" and S032"), which drive the formation of TGZ-S. Cysteine as well as its sulfate derivatives mutually interacts with other intermediates of central carbon metabolism. These interactions have been further characterized through modularity analysis. 3.2. Reaction modules To characterize the interconnections between TGZ derivatives and other major liver metabolites, we applied a partition algorithm to directed graph representations of the various network models with and without edge-weights. The left-hand panels of Fig. 1 show the optimal partitions of the fed-state model without an edge-weight matrix (a), with an edge-weight matrix derived from MFA (c), and with an edge-weight matrix derived from FBA (e). Figs, lb, Id, and 1 f show the corresponding partitions of the fasted-state model. Optimality was evaluated based on the projection and match scores (see Methods, Fig. 2). For both the fed- and fasted-state models, the inclusion of reaction flux, or connection activity, significantly influenced their modularity. When only connectivity was considered, the (unweighted) fed-state network was optimally partitioned at iteration number 34 (Fig. la). Three modules were generated.

201

3.

r8Sx»

1

»

X

'

b

\

"-*& \

\

\

j

J

s

•v-,v

-

* * -

'I "•

*

'A?

TSN

J

rat

'

yl '•*<

M*

» ***P

tm '

1 1

, &Y TOBfcH

'

/

*»t*

\

-•»>»»

/**-

Ow--

i

^

^

0W

*

'• «T«Q

\

• %L'^

P 4«

^

HS5M

\

-"^"^ Tap«

Sir*

*te' ^^^

TGZOSH

BW

Figure 1. Optimal partitions of the liver network models. Left- and right-hand column panels show fed-and fasted-state models, respectively. Partition without flux weights (a, b), with flux weights (c, d), and with flux weights maximizing GSH (e, t). Arrows indicated carbon flow between modules as determined from the partition of the previous iteration.

202

The smallest module consisted of two metabolites in lipid synthesis (palmitate, PAL, and triglyceride, TG). The largest module included all other metabolites with the exception of TGZ and its direct derivatives, which constituted the remaining third module. When an edge-weight matrix was applied with MFA derived fluxes, the optimal partition was reached at iteration 8 (Fig. lc). Four modules (consisting of at least two connected nodes) were found. The smallest module consisted of metabolites in the urea cycle. A second module consisted of lipid synthesis and PPP metabolites. A third module consisted of the TCA cycle metabolites. The largest module included TGZ, its direct derivatives, and the intermediates of amino acid and pyruvate metabolism. When a different edgeweight matrix was used with a flux distribution corresponding to maximal GSH synthesis, the optimal partition (reached at iteration 8) consisted of three modules (Fig. 1 e). The two smaller modules were identical to the two smallest modules of the partition in Fig. lc. The third module essentially combined the larger two modules of Fig. lc, with connections through the reactions in and around the urea and TCA cycles.

o '8 P

Q.Si '• 0.4'

Iteration No.

iteration Mo

Fig. 2. Mean projection and match score plots for the fed-state model partitions. Legends refer to flux distribution used to form the edge-weight matrix. For both series of partitions, the optimal iteration was set at 8, which corresponds to the first significant rise in the two scores.

The modularity of the fasted-state was also significantly influenced by the connection diversity (flux) data. Without an edge-weight matrix, the net effect of the edge removals was to reduce the network graph size (Fig. lb). Application of the MFA derived fluxes as edge-weights generated an optimal partition with two modules at iteration 15 (Fig. Id). Similar to Fig. la, TGZ and its derivatives formed a separate module. However, this module lacked TGZGSH, presumably because the fasted-state model calculated zero flux for GSH synthesis. Unlike the fed-state partition (Fig. lc), the TGZ module did not connect directly to the other metabolic pathways. Direct connections remained absent when the GSH maximizing flux distribution was used to form the edgeweight matrix (Fig. If). The major effects were to isolate a small module consisting of urea cycle metabolites from the largest reaction module. As expected from the results of Table 3, TGZ-G and TGZ-Q were eliminated from the TGZ module, and replaced with TGZ-GSH. Together, Figs, lc-d suggest that the nutritional state of the liver directly impacts the connections between

203

reactions of TGZ transformation and the other major pathways of liver metabolism. Moreover, a comparison of the partitions in Fig. lc and le indicated that conjugation substrate availability, in this case GSH, influences the extent of integration between these reaction modules. 4.

Discussion

In this paper, we examined the interactions between the specialized reactions of TGZ transformation and the network of major metabolic reactions in hepatocytes. Using prior data, flux distributions were simulated that were in partial agreement with experimental observations on the relative distributions of various TGZ conjugates. With only total TGZ clearance rate as input, TGZGSH was correctly predicted as a minor derivative, but the contribution of TGZS was significantly under-estimated, suggesting that additional measurements on the conjugation reactions are needed to improve the flux calculations. Nevertheless, we noted several useful outcomes. First, the thermodynamic constraints allowed convergent solutions to be found with relatively small numbers of measured inputs. Second, we avoided potential pitfalls of individual reaction-based inequality constraints. For example, flux calculations correctly predicted significant net production of TGZ-S in all cases, even though the individual reaction AGs of the final synthesis steps were positive (Table 3). These results directly reflect the energetic coupling between sequential reaction steps as specified by the EFM calculations. Third, the EFMs generated for the flux calculations provided an inventory of stoichiometrically and energetically feasible reaction routes of the model networks. A major obstacle to applying the EFM analysis to larger, e.g. genome scale, networks is its computational intractability. One way to address this issue is to solve for a partial set of EFMs by eliminating high-degree currency metabolites. Many currency metabolites cannot be accurately measured or balanced, and thus frequently not included in the stoichiometric constraints, but form metabolic cycles that significantly expand the EFM solution space. In this work, ATP, C0 2 and 0 2 were not balanced, and the EFM calculations became NP-hard problems. The EFMs and the calculated flux distributions were ultimately used to examine the modularity of TGZ metabolism across different nutritional states and levels of conjugation substrate availability. While the connections between the immediate reactions of TGZ metabolism were well-conserved across these different conditions, connections to other major pathways varied. In the fastedstate, interactions between the main carbon network and the TGZ module were limited, regardless of the GSH level. In contrast, a number of active connections were found for the fed state. These connections mainly involved the sulfur

204

containing amino acid cysteine (CYS) and its immediate reaction partners. The liberation of the sulfide moiety from CYS requires complete degradation of the amino acid via transamination reactions, which involves other high-degree metabolites such as GLU and a-ketoglutarate. Along with glycine, GLU and CYS make up GSH, which also interacts with the TGZ module as a conjugation substrate. Taken together, our findings suggest that the availability of common medium nutrients could significantly influence the formation of drug derivatives. Prospectively, metabolic profile-based studies on drug reaction modules could be used to analyze drug transformation under varying metabolic states, which in turn could facilitate the development of effective nutritional approaches for managing drug toxicity [10]. Acknowledgements We thank Dr. Anselm Blumer in the Department of Computer Science at Tufts University for his help in implementing the edge-betweenness centrality algorithm. This work was in part funded by NIH grant 1-R21DK67228 to KL. References 1. O. Fiehn et al., Nat Biotechnol 18, 1157 (2000). 2. R. N. Trethewey, Curr Opin Biotechnol 12, 135 (2001). 3. J. L. Griffin et al, AnalBiochem 293, 16 (2001). 4. J. K. Nicholson et al, Nat Rev Drug Discov 1, 153 (2002). 5. M. Kanehisa et al, Nucleic Acids Res 34, D354 (2006). 6. C. Chan et al, Metab Eng 5, 1 (2003). 7. R. P. Nolan, M.S., Tufts University (2005). 8. E. A. Gale, Lancet 357, 1870 (2001). 9. Y. Masubuchi et al, Toxicology 222, 233 (2006). 10. S. Tafazoli et al., Drug Metab Rev 37, 311 (2005). 11. H. Ma, A. P. Zeng, Bioinformatics 19, 270 (2003). 12. M. Kanehisa, S. Goto, Nucleic Acids Res 28, 27 (2000). 13. I. M. Arias, J. L. Boyer, The liver: biology andpathobiology. 4th ed. (2001) 14. M. T. Smith, Chem Res Toxicol 16, 679 (2003). 15. N. J. Hewitt et al, Chem Biol Interact 142, 73 (2002). 16. R. P. Nolan et al, Metab Eng 8, 30 (2006). 17. S. Schuster et al, Nat Biotechnol 18, 326 (2000). 18. J. Yoon et al, Bioinformatics (2006). 19. M. E. Newman, M. Girvan, Phys Rev E Stat Nonlin Soft Matter Phys 69, 026113(2004). 20. U. Bvandss, J Math Soci 25, 163 (2001). 21. K. Kawai et al., Xenobiotica 30, 707 (2000). 22. S. Prabhu et al, Chem Biol Interact 142, 83 (2002).

N E W FRONTIERS IN BIOMEDICAL T E X T MINING

PIERRE ZWEIGENBAUM, DINA DEMNER-FUSHMAN, HONG YU, AND K. BRETONNEL COHEN 1. Introduction To paraphrase Gildea and Jurafsky [7], the past few years have been exhilarating ones for biomedical language processing. In less than a decade, we have seen an amazing increase in activity in text mining in the genomic domain [20]. The first textbook on biomedical text mining with a strong genomics focus appeared in 2005 [3]. The following year saw the establishment of a national center for text mining under the leadership of committed members of the BioNLP world [2], and two shared tasks [10,9] have led to the creation of new datasets and a very large community. These years have included considerable progress in some areas. The TREC Genomics track has brought an unprecedented amount of attention to the domain of biomedical information retrieval [8] and related tasks such as document classification [5] and question-answering, and the BioCreative shared task did the same for genomic named entity recognition, entity normalization, and information extraction [10]. Recent meetings have pushed the focus of biomedical NLP into new areas. A session at the Pacific Symposium on Biocomputing (PSB) 2006 [6] focussed on systems that linked multiple biological data sources, and the BioNLP'06 meeting [20] focussed on deeper semantic relations. However, there remain many application areas and approaches in which there is still an enormous amount of work to be done. In an attempt to facilitate movement of the field in those directions, the Call for Papers for this year's PSB natural language processing session was written to address some of the potential "New Frontiers" in biomedical text mining. We solicited work in these specific areas: • • • •

Question-answering Summarization Mining data from full text, including figures and tables Coreference resolution 205

206

• User-driven systems • Evaluation 31 submissions were received. Each paper received four reviews by a program committee composed of biomedical language processing specialists from North America, Europe, and Asia. Eleven papers were selected for publication. The papers published here present an interesting window on the nature of the frontier, both in terms of how far it has advanced, and in terms of which of its borders it will be difficult to cross. One paper addresses the topic of summarization. Lu et al. [14] use summary revision techniques to address quality assurance issues in GeneRIFs. Two papers extend the reach of biomedical text mining from the abstracts that have been the input to most BioNLP systems to date, towards mining the information present in full-text journal articles. Kou et al. [13] introduce a method for matching the labels of sub-figures with sentences in the paper. Seki and Mostafa [19] explore the use of full text in discovering information not explicitly stated in the text. Two papers address the all-too-often-neglected issue of the usability and utility of text mining systems. Karamanis et al. [12] present an unusual attempt to evaluate the usability of a system built for model organism database curators. Much of the work in biomedical language processing in recent years has assumed the model organism database curator as its user, so usability studies are well-motivated. Yu and Kaufman [22] examine the usability of four different biomedical question-answering systems. Two papers fit clearly into the domain of evaluation. Morgan et al. [15] describe the design of a shared evaluation, and also gives valuable baseline data for the entity normalization task. Johnson et al. [11] describe a fault model for evaluating ontology matching, alignment, and linking systems. Four papers addressed more traditional application types, but at a deeper level of semantic sophistication than most past work in their areas. Two papers dealt with the topic of relation extraction. Ahlers et al. [1] tackle an application area—information extraction—that has been a common topic of previous work in this domain, but does so at an unusual level of semantic sophistication. Cakmak and Ozsoyoglu [4] deal with the difficult problem of Gene Ontology concept assignment to genes. Finally, two papers focus on the well-known task of document indexing, but at unusual levels of refinement. Neveol et al. [16] extract MeSH sw&headings and pairs them with the appropriate primary heading, introducing an element of context that is lacking in most other work in BioNLP. Rhodes et al. [18]

207 describe a methodology for indexing documents based on the structure of chemicals t h a t are mentioned within them. So, we see papers in some of the traditional aplication areas, but at increased levels of sophistication; we see papers in the areas of summarization, full text, user-driven work, and evaluation; but no papers in the areas of coreference resolution or question-answering. W h a t might explain these gaps? One possibility is the shortage of publicly available datasets for system building and evaluation. Although there has been substantial annotation work done in the area of coreference in the molecular biology domain [21,17], only a single biomedical corpus with coreference annotation is currently freely available [17]. Similarly, although the situation will be different a year from now due to the efforts of the T R E C Genomics track, there are currently no datasets freely available for the biomedical question-answering task. 2.

Acknowledgments

K. Bretonnel Cohen's participation in this work was supported by NIH grant R.01-LM008111 to Lawrence Hunter. References 1. Caroline B. Ahlers, Marcelo Fiszman, Dina Demner-Pushman, Francois Michel Lang, and Thomas C. Rindflesch. Extracting semantic predications from MEDLINE citations for pharmacogenomics. In Pacific Symposium on Biocomputing, 2007. 2. Sophia Ananiadou, Julia Chruszcz, John Keane, John McNaught, and Paul Watry. The National Centre for Text Mining: aims and objectives. Ariadne, 42, 2005. 3. Sophia Ananiadou and John McNaught. Text mining for biology and biomedicine. Artech House Publishers, 2005. 4. Ali Cakmak and Gultekin Ozsoyoglu. Annotating genes by mining PubMed. In Pacific Symposium on Biocomputing, 2007. 5. Aaron M. Cohen and William R. Hersh. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. Journal of Biomedical Discovery and Collaboration, 1(4), 2006. 6. K. Bretonnel Cohen, Olivier Bodenreider, and Lynette Hirschman. Linking biomedical information through text mining: session introduction. In Pacific Symposium on Biocomputing, pages 1-3, 2006. 7. Daniel Gildea and Daniel Jurafsky. Automatic labeling of semantic roles. Computational Linguistics, 28(3):245-288, 2002. 8. William R. Hersh, Ravi Teja Bhupatiraju, Laura Ross, Phoebe Roberts, Aaron M. Cohen, and Dale F. Kraemer. Enhancing access to the Bibliome:

208

9.

10.

11.

12.

13.

14. 15.

16.

17.

18.

19. 20.

21.

22.

the TREC 2004 Genomics track. Journal of Biomedical Discovery and Collaboration, 2006. William R. Hersh, Aaron M. Cohen, Jianji Yang, Ravi Teja Bhupatiraju, Phoebe Roberts, and Marti Hearst. TREC 2005 Genomics track overview. In Proceedings of the 14th Text Retrieval Conference. National Institute of Standards and Technology, 2005. Lynette Hirschman, Alexander Yeh, Christian Blaschke, and Alfonso Valencia. Overview of BioCreAtlvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6, 2005. Helen L. Johnson, K. Bretonnel Cohen, and Lawrence Hunter. A fault model for ontology mapping, alignment, and linking systems. In Pacific Symposium on Biocomputing, 2007. Nikiforos Karamanis, Ian Lewin, Ruth Seal, Rachel Drysdale, and Edward J. Briscoe. Integrating natural language processing with FlyBase curation. In Pacific Symposium on Biocomputing, 2007. Zhenzhen Kou, William W. Cohen, and Robert F. Murphy. A stacked graphical model for associating information from text and images in figures. In Pacific Symposium on Biocomputing, 2007. Zhiyong Lu, K. Bretonnel Cohen, and Lawrence Hunter. GeneRIF quality assurance as summary revision. In Pacific Symposium on Biocomputing, 2007. Alexander A. Morgan, Benjamin Wellner, Jeffrey B. Colombe, Robert Arens, Marc E. Colosimo, and Lynette Hirschman. Evaluating human gene and protein mention normalization to unique identifiers. In Pacific Symposium on Biocomputing, 2007. Aurelie Neveol, Sonya E. Shooshan, Susanne M. Humphrey, Thomas C. Rindflesch, and Alan R. Aronson. Multiple approaches to fine indexing of the biomedical literature. In Pacific Symposium on Biocomputing, 2007. J. Pustejovsky, J. Castano, R. Sauri, J. Zhang, and W. Luo. Medstract: creating large-scale information servers for biomedical libraries. In Natural language processing in the biomedical domain, pages 85-92. Association for Computational Linguistics, 2002. James Rhodes, Stephen Boyer, Jeffrey Kreulen, Ying Chen, and Patricia Ordonez. Mining patents using molecular similarity search. In Pacific Symposium on Biocomputing, 2007. Kazuhiro Seki and Javed Mostafa. Discovering implicit associations beween genes and hereditary diseases. In Pacific Symposium on Biocomputing, 2007. Karin Verspoor, K. Bretonnel Cohen, Inderjeet Mani, and Benjamin Goertzel. Introduction to BioNLP'06. In Linking natural language processing and biology: towards deeper biological literature analysis, pages iii-iv. Association for Computational Linguistics, 2006. Xiaofeng Yang, Guodong Zhou, Jian Su, and Chew Lim Tan. Improving noun phrase coreference resolution by matching strings. In IJCNLP04, pages 326-333, 2004. Hong Yu and David Kaufman. A cognitive evaluation of four online search engines for answering definitional questions posed by physicians. In Pacific Symposium on Biocomputing, 2007.

EXTRACTING SEMANTIC PREDICATIONS FROM MEDLINE CITATIONS FOR PHARMACOGENOMICS CAROLINE B. AHLERS,' MARCELO FISZMAN, 2 DINA DEMNER-FUSHMAN,' FRANCOIS-MICHEL LANG,' THOMAS C. RINDFLESCH 1 'Lister Hill National Center for Biomedical Communications, National Library of Medicine Bethesda, Maryland 20894, USA 2 The University of Tennessee, Graduate School of Medicine Knoxville, Tennessee 37920, USA

We describe a natural language processing system (Enhanced SemRep) to identify core assertions on pharmacogenomics in Medline citations. Extracted information is represented as semantic predications covering a range of relations relevant to this domain. The specific relations addressed by the system provide greater precision than that achievable with methods that rely on entity co-occurrence. The development of Enhanced SemRep is based on the adaptation of an existing system and crucially depends on domain knowledge in the Unified Medical Language System. We provide a preliminary evaluation (55% recall and 73% precision) and discuss the potential of this system in assisting both clinical practice and scientific investigation.

1.

Introduction

We discuss the development of a natural language processing (NLP) system to identify and extract a range of semantic predications (or relations) from Medline citations on pharmacogenomics. Core research in this field investigates the interaction of genes and their products with therapeutic substances. Discoveries hold considerable promise for treatment of disease [1], as clinical successes, notably in oncology, demonstrate. For example, Gleevec is a first-line therapy for chronic myelogenous leukemia, as it attacks the mutant BCR-ABL fusion tyrosine kinase in cancer cells, leaving healthy cells largely unharmed [2]. Automatic methods, including NLP, are increasingly used as important aspects of the research process in biomedicine [3,4,5,6]. Current NLP for pharmacogenomics concentrates on co-occurrence information without specifying exact relations [7]. We are developing a system (called Enhanced SemRep in this paper) which complements that approach by representing assertions in text as semantic predications. For example, the predications in (2) are extracted from the sentence in (1). l)These findings therefore demonstrate that dexamethasone is a potent inducer of multidrug resistance-associated protein expression in rat 209

210

hepatocytes through a mechanism that seems not to involve the classical glucocorticoid receptor pathway. 2)Dexamethasone STIMULATES Multidrug Resistance-Associated Proteins Dexamethasone NEG_INTERACTS_WITH Glucocorticoid receptor Multidrug Resistance-Associated Proteins PART_OF Rats Hepatocytes PART_OF Rats Enhanced SemRep is based on two existing systems: SemRep [8,9] and SemGen [10,11]. SemRep extracts semantic predications from clinical text, and SemGen was developed from SemRep to identify etiologic relations between genetic phenomena and diseases. Several aspects of these programs were combined and modified to identify a range of relations referring to genes, drugs, diseases, and population groups. The enhanced system extracts pharmacogenomic information down to the gene level, without identifying more specific genetic phenomena, such as mutations (e.g., CYP2C9*3), single nucleotide polymorphisms (e.g., C2850T), and haplotype information. In this paper we describe the major issues involved in developing Enhanced SemRep for pharmacogenomics. 2.

Background

2.1. Natural Language Processing for Biomedicine Several NLP systems identify relations in biomedical text. Due to the complexity of natural language, they often target particular semantic relations. In order to achieve high recall, some methods rely mainly on co-occurrence of entities in text (e.g. Yen et al. [12] for gene-disease relations). Some approaches use machine learning techniques to identify relations, for example Chun et al. [13] for gene-disease relations. Syntactic templates and shallow parsing are also used, by Blaschke et al. [14] for protein interactions, Rindflesch et al. [15] for binding, and Leroy et al. [16] for a variety of relations. Friedman et al. [17] use extensive linguistic processing for relations on molecular pathways, while Lussier et al. [18] use a similar approach to identify phenotypic context for genetic phenomena. In pharmacogenomics, methods for extracting drug-gene relations have been developed, based on co-occurrence of drug and gene names in a sentence [19, 7]. The system described in [19] is limited to cancer research, while Chang et al. [7] use machine learning to assign drug-gene co-occurrences to one of several broad relations, such as genotype, clinical outcome, or pharmacokinetics. The system we present here (Enhanced SemRep) addresses a

211 wide range of syntactic structures and specific semantic relations pertinent to pharmacogenomics, such as STIMULATES, DISRUPTS, and CAUSES. We first describe the structure of the domain knowledge in the Unified Medical Language System (UMLS) [20], upon which the system crucially depends. 2.2. The Unified Medical Language System The Metathesaurus and the Semantic Network are components of the UMLS representing structured biomedical domain knowledge. In the current (2006AB) release, the Metathesaurus contains more than a million concepts. Editors combine terms from constituent sources having similar meaning into a concept, which is also assigned a semantic type, as in (3). 3) Concept: fever; Synonyms: pyrexia, febrile, and hyperthermia; Semantic Type: 'Finding' The Semantic Network is an upper level ontology of medicine. Its core structure consists of two hierarchies (entities and events) of 135 semantic types, which represent the organization of phenomena in the medical domain. 4) Entity Physical Object Anatomical Structure Fully Formed Anatomical Structure Gene or Genome Semantic types serve as arguments of "ontological" predications that represent allowable relationships between classes of concepts in the medical domain. The predicates in these predications are drawn from 54 semantic relations. Some examples are given in (5). 5) 'Gene or Genome' PART_OF 'Cell' 'Pharmacologic Substance' INTERACTS_WITH 'Enzyme' 'Disease or Syndrome' CO-OCCURS_WITH 'Neoplastic Process' Semantic interpretation depends on matching asserted semantic predications to ontological semantic predications, and the current version of SemRep depends on the unedited version of the UMLS Semantic Network for this matching. One of the major efforts in the development of Enhanced SemRep was to edit the Semantic Network for application in pharmacogenomics. 2.3. SemRep and SemGen SemRep: SemRep [8,9] is a rule-based symbolic natural language processing system developed to extract semantic predications from Medline citations on clinical medicine. As the first step in semantic interpretation, SemRep produces

212 an underspecified (or shallow) syntactic analysis based on the SPECIALIST Lexicon [21] and the MedPost part-of-speech tagger [22]. The most important aspect of this processing is the identification of simple noun phrases. In the next step, these are mapped to concepts in the Metathesaurus using MetaMap [23]. The structure in (7) illustrates syntactic analysis with Metathesaurus concepts and semantic types (abbreviated) for the sentence in (6). 6) Phenytoin induced gingival hyperplasia 7) [[head(noun(phenytoin)), metaconc('Phenytoin':[orch,phsu]))], [verb(induced)], [head(noun(['gingival hyperplasia')), metaconc( 'Gingival Hyperplasia': [dsyn]))]] The structure in (7) serves as the basis for the final phase in constructing a semantic predication. During this phase, SemRep relies on "indicator" rules which map syntactic elements (such as verbs and nominalizations) to predicates in the Semantic Network, such as TREATS, CAUSES, and LOCATION_OF. Argument identification rules (which take into account coordination, relativization, and negation) then find syntactically allowable noun phrases to serve as arguments for indicators. If an indicator and the noun phrases serving as its syntactic arguments can be interpreted as a semantic predication, the following condition must be met: The semantic types of the Metathesaurus concepts for the noun phrases must match the semantic types serving as arguments of the indicated predicate in the Semantic Network. For example, in (7) the indicator induced maps to the Semantic Network relation in (8). 8) 'Pharmacological Substance' CAUSES 'Disease or Syndrome' The concepts corresponding to the noun phrases phenytoin and gingival hyperplasia can serve as arguments because their semantic types ('Pharmacological Substance' (phsu) and 'Disease or Syndrome' (dsyn)) match those in the Semantic Network relation. In the final interpretation (9), The Metathesaurus concepts from the noun phrases are substituted for the semantic types in the Semantic Network relation. 9) Phenytoin CAUSES Gingival Hyperplasia SemGen: SemGen [10,11] was adapted from SemRep in order to identify semantic predications on the genetic etiology of disease. The main consideration in creating SemGen was the identification of gene and protein names as well as related genomic phenomena. For this SemGen relies on ABGene [24], in addition to MetaMap and the Metathesaurus. Since the UMLS Semantic Network does not cover molecular genetics, ontological semantic relations for this domain were created for SemGen. The allowable relations were defined in two classes: gene-disease interactions (ASSOCIATED_WITH, PREDISPOSE, and CAUSE) and gene-gene interactions (INHIBIT, STIMULATE, and INTERACTS_WITH).

213 3.

Methods

The development of Enhanced SemRep for pharmacogenomics began with scrutiny of the pharmacogenomics literature to identify relevant predications not identified by either SemRep or SemGen. Approximately 1000 Medline citations were retrieved with queries containing drug and gene names. From these, 400 sentences were selected as containing assertions most crucial to pharmacogenomics, including genetic (gene-disease), genomic (gene-gene), and pharmacogenomic (drug-gene, drug-genome) relations; in addition relations between genes and population groups; relations between disease and population groups; and pharmacological relations (drug-disease, drug-pharmacological effect, drug-drug) were scrutinized. Examples of relevant assertions include: 10) N-acetyltransferase 2 plays an important role in Alzheimer's Disease. (gene-disease) Ticlopidine is a potent inhibitor for CYP2C19. (drug-gene) Gefitinib and erlotinib for tumors with epidermal growth factor receptor (EGFR) mutations or increased EGFR gene copy numbers. (drug-gene) The CHF patients with the VDR FF genotype have higher rates of bone loss, (gene-disease and gene-process) After processing these 400 sentences with SemRep, errors were analyzed and categorized for etiology. It was determined that the majority of errors were missed predications that could be accounted for under three broad categories: a) the Semantic Network, b) errors in argument identification due to "empty" heads, and c) Gene name identification. For Enhanced SemRep, gene name identification was addressed by adding ABGene [24] to the machinery provided by MetaMap and the Metathesaurus. The other classes of errors required more extensive modifications. 3.1. Modification of Semantic Network for Enhanced SemRep The UMLS Semantic Network was substantially modified in enhanced SemRep. New ontological semantic predications were added and the definitions of others were modified. In order to accommodate semantic relations crucial to pharmacogenomics, semantic types stipulated as arguments of ontological semantic predications were reorganized into groups reflecting major categories in this field. Semantic Types: Semantic groups have been defined to organize the finer grained UMLS semantic types into broader semantic categories relevant to the clinical domain [25]. For Enhanced SemRep, five semantic groups (Substance, Anatomy, Living Being, Process, and Pathology) were defined to permit

214 systematic and comprehensive treatment of arguments in predications relevant to pharmacogenomics. These semantic groups are used to stipulate allowable arguments of the ontological semantic predications defined for each domain. Each group for pharmacogenomics is defined as: 11) Substance: 'Amino Acid, Peptide, or Protein', 'Antibiotic', 'Biologically Active Substance', 'Carbohydrate', 'Chemical', 'Eicosanoid', 'Element, Ion, or Isotope', 'Enzyme', 'Gene or Genome', 'Hazardous or Poisonous Substance', 'Hormone', 'Immunologic Factor', 'Inorganic Chemical', 'Lipid', 'Neuroreactive Substance or Biogenic Amine', 'Nucleotide Sequence', 'Organic Chemical', 'Organophosphorous Compound', 'Pharmacologic Substance', 'Receptor', 'Steroid', 'Vitamin' 12) Anatomy: 'Anatomical Structure', 'Body Part, Organ, or Organ Component', 'Cell', 'Cell Component', 'Embryonic Structure', 'Fully Formed Anatomical Structure', 'Gene or Genome', 'Neoplastic Process', 'Tissue' 13) Living Being: 'Animal', Archaeon', 'Bacterium', 'Fungus', 'Human', 'Invertebrate', 'Mammal', 'Organism', 'Vertebrate', 'Virus' 14) Process: 'Acquired Abnormality', 'Anatomical Abnormality', 'Cell Function', 'Cell or Molecular Dysfunction', 'Congenital Abnormality', 'Disease or Syndrome', 'Finding', 'Injury or Poisoning', 'Laboratory Test Result', 'Organism Function', 'Pathologic Function', 'Physiologic Function', 'Sign or Symptom' 15) Pathology: 'Acquired Abnormality', 'Anatomical Abnormality', 'Cell or Molecular Dysfunction', 'Congenital Abnormality', 'Disease or Syndrome', 'Injury or Poisoning', Mental or Behavioral Disorder', 'Pathologic Function', 'Sign or Symptom' In addition to grouping semantic types, semantic types assigned to two classes of Metathesaurus concepts were manipulated to handle the following generalizations. 16) Proteins are also genes. Concepts assigned the semantic type 'Amino Acid, Peptide, or Protein' are also assigned the semantic type 'Gene or Genome' ("Cytochrome P-450 CYP2E1" now has 'Gene or Genome' in addition to 'Amino Acid, Peptide, or Protein') 17) Group members are human. Concepts assigned the semantic type 'Group' (or its descendants) are also assigned the semantic type 'Human'. ("Child" now has 'Human' in addition to 'Age Group'). Predications: Predications for the pharmacogenomics domain were defined in the following categories (18-23). Ontological predications are defined by specifying allowable arguments, that is semantic types in the stipulated semantic

215 groups. The predications in (18-23) constitute a type of schema [26] for representing pharmacogenomic information. 18) Genetic Etiology: {Substance} ASSOCIATED_WITH OR PREDISPOSES OR CAUSES {Pathology} 19) Substance Relations : {Substance} INTERACTS_WITH OR INHIBITS OR STIMULATES {Substance} 20) Pharmacological Effects: {Substance} AFFECTS OR DISRUPTS OR AUGMENTS {Anatomy OR Process} 21) Clinical Actions: {Substance} ADMINISTERED_TO {Living Being} {Process} MANIFESTATIONJDF {Process} {Substance} TREATS {Living Being OR Pathology } 22) Organism Characteristics: {Anatomy OR Living Being} LOCATION_OF, {Substance} {Anatomy} PART_OF {Anatomy OR Living Being} {Process} PROCESS_OF {Living Being} 23) Co-existence: {Substance} CO-EXISTS_WITH {Substance} {Process} CO-EXISTS_WITH {Process} 3.2. Empty Heads "Empty" heads [27,28] are a pervasive phenomenon in pharmacogenomics text. An example is variants in (24). 24) We saw differential activation of CYP2C9 variants by dapsone. Nearly 80% of the 400 sentences in the training set contain at least one empty head. These structures impede the process of semantic interpretation. In SemRep the semantic type of the Metathesaurus concept corresponding to the head of a noun phrase qualifies that noun phrase for use as an argument. For example, from (24) we want to use the noun phrase CYP2C9 variant as an argument of STIMULATES, which requires that the semantic type of its object be a member of the Substance group. However, the semantic type of the head concept "Variant" is 'Qualitative Concept'. As has been noted (e.g. [28]), such words are not really empty (in the sense of having no semantic content). A complete interpretation would take the meaning of empty heads into account. However, that is beyond the present capabilities of the Enhanced SemRep system. It is possible to get a partial interpretation of structures containing this phenomenon by ignoring the empty head [27].

216 We enumerated several categories of terms which we identified as semantically empty heads. These include general terms for genetic and genomic phenomena (allele, mutation, polymorphism, and variant), measurements (concentration, levels), and processes (synthesis, expression, metabolism). During processing in Enhanced SemRep, words from these lists that have been labeled as heads are hidden and the word to their left is relabeled as head. After this processing, CYP2C9 becomes the head (with semantic type 'Gene or Genome', a member of the Substance group) in CYP2C9 variants above, thus qualifying as an argument of STIMULATES. 3.3. Evaluation Enhanced SemRep was tested for recall and precision using a gold standard of 300 sentences randomly generated from the set of 36,577 sentences containing drug and gene co-occurrences found on the Web site [29] referenced by Chang and Altman [7], These sentences were annotated by three physicians (CBA, DD-F, MF) for the predications discussed in the methods section. That is, we did not mark up all assertions in the sentences, only those representing a predication defined in Enhanced SemRep. A total of 850 predications were assigned to these 300 sentences by the annotators. 4.

Results

Enhanced SemRep generated 623 predications from the 300 sentences in the test collection. Of these, 455 were true positives, 168 were false positives, and 375 were false negatives, reflecting recall of 55% (95% confidence interval 49% to 61%) and precision of 73% (95% confidence interval 65% to 81%). We also calculated results for the groups of predications defined in the categories (18-22) above. Recall and precision for the predications in the five categories are: Genetic Etiology (ASS0CIATED_WITH, CAUSES, PREDISPOSES): 74% 74%; Substance Relations (INTERACTS_WITH, INHIBITS, STIMULATES): 50%

73%; Pharmacological Effects (AFFECTS, DISRUPTS, AUGMENTS): 41% 68%; Clinical Actions (ADMINISTEREDJTO, MANIFESTATION_OF, TREATS): 54% 84%; Organism Characteristics (LOCATlON_OF, PART_OF, PROCESS_OF): 63% 71%. 5.

Discussion

5.1. Error Analysis We assessed the etiology of errors separately for recall and precision. In considering both false negatives and false positives for Enhanced SemRep, the etiology of error was almost exclusively due to characteristics in SemRep before

217 enhancement, not to changes introduced for Enhanced SemRep. Word sense ambiguity was responsible for almost a third (28%) of all errors. For example, in interpreting (25), inhibition was wrongly mapped to the Metathesaurus concept "Psychological Inhibition," thus allowing the system to generate the false positive "CYP2C19 AFFECTS Psychological Inhibition." 25) Ticlopidine inhibition of phenytoin metabolism mediated by potent inhibition of CYP2C19. Difficulty in processing coordinate structures caused more than a third (35%) of the false negatives seen in our evaluation. For example, in processing (26), although Enhanced SemRep identified the predication "Fluorouracil INTERACTS_WITH DPYD gene," it missed "mercaptopurine INTERACTS_WITH thiopurine methyltransferase." 26) The cytotoxic activities of mercaptopurine and fluorouracil are regulated by thiopurine methyltransferase (TPMT) and dihydropyrimidine dehydrogenase (DPD), respectively. 5.2. Processing Medline citations on CYP2D6 We processed 2849 Medline citations containing variant forms of CYP2D6 with Enhanced SemRep, which produced 36,804 predications, 22,199 of which were unique. 5219 total and 2310 unique predications contained CYP2D6 as an argument, with the remaining predications representing assertions about other genes, drugs, and diseases. The 5219 total predications containing CYP2D6 were analyzed according to two predication categories (Genetic Etiology and Substance Relations), and the results were compared with relations listed for this gene on the PharmGKB Web site [30]. Genetic Etiology: 267 total predications represented CYP2D6 as an etiologic agent (CAUSES, PREDISPOSES, or ASSOCIATED_WITH) for a disease. The most frequent of these are the following: Parkinson's disease (35 occurrences), carcinoma of the lung (21), tardive dyskinesia (15), Alzheimer's disease (9), bladder carcinoma (8). All of the above relations were judged to be true positives. Only carcinoma of the lung occurs in PharmGKB. Of the 4 PharmGKB CYP2D6-disease relations not obtained by SemRep (hepatitis C, ovarian carcinoma, pain, and bradycardia), two were found not to contain the disease name in the referenced citation (ovarian carcinoma and pain). Substance Relations: Enhanced SemRep retrieved 1128 total predications involving CYP2D6 and a drug. Sixty-nine drugs occurred 3 or more times in those predications. Forty-one of the 69 were in PharmGKB and 28 were not. Sixty-eight were true positives. For example, The following drugs (all true positives) were interpreted by Enhanced SemRep as inhibiting CYP2D6:

218 quinidine (45 occurrences in 1128 predications with CYP2D6), paroxetine (34), fluoxetine (27), fluvoxamine (8), sertraline (8). Quinidine and sertraline are not in PharmGKB. SemRep also retrieved predications that the following drugs (all true positives) interact with CYP2D6: bufuralol (27), antipsychotic agents (25) dextromethorphan (21 occurrences), venlafaxine (19), debrisoquin (18). Bufuralol is not in PharmGKB. The PharmGKB relations SemRep failed to capture were CYP2D6 interactions with cocaine, levomepromazine, maprotiline, trazodone, and yohimbine. Two of these entries (levomepromazine and maprotiline) were found not to be based on the content of Medline citations. 6.

Conclusion

We discuss the adaptation of an existing NLP system to apply in the pharmacogenomics domain. The major changes for developing Enhanced SemRep from SemRep involved modifying the semantic space stipulated by the UMLS Semantic Network. The output of Enhanced SemRep is in the form of semantic predications that represent assertions from Medline citations expressing a range of specific relations in pharmacogenomics. The information provided by Enhanced SemRep has the potential to contribute to systems that go beyond traditional information retrieval to support advanced information management applications for pharmacogenomics research and clinical care. In the future we intend to adapt the summarization and visualization techniques developed for clinical text [31] to the pharmacogenomic predications generated by Enhanced SemRep. Acknowledgments This study was supported in part by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine. The first author was supported by an appointment to the National Library of Medicine Research Participation Program administered by the Oak Ridge Institute for Science and Education through an inter-agency agreement between the U.S. Department of Energy and the National Library of Medicine. References 1. Halapi E, Hakonarson H. Advances in the development of genetic markers for the diagnosis of disease and drug response. Expert Rev Mol Diagn. 2002 Sep;2(5):411-21.

219 2. Druker BJ, Talpaz M, Resta DJ, et al. Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. N EnglJMed. 2001 Apr 5;344(14):1031-7. 3. Yandell MD, Majoros WH. Genomics and natural language processing. Nature Reviews Genetics 2002;3(8):601-10. 4. K. Bretonnel Cohen and Lawrence Hunter Natural language processing and systems biology. In Dubitzky and Pereira, Artificial intelligence methods and tools for systems biology. Springer Verlag, 2004. 5. Hirschman L, Par JC, Tsujii J, Wong L, Wu CH. Accomplishments and challenges in literature data mining for biology. Bio informatics 2002;18(12):1553-61. 6. Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 2006;7:119-29. 7. Chang JT, Altman RB. Extracting and characterizing gene-drug relationships from the literature. Pharmacogenetics. 2004 Sep;14(9):57786. 8. Rindflesch TC, Fiszman M. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J of Biomed Inform. 2003 Dec;36(6):462477. 9. Rindflesch TC, Fiszman M, Libbus B. Semantic interpretation for the biomedical research literature. In Chen, Fuller, Hersh, and Friedman, Medical informatics: Knowledge management and data mining in biomedicine. Springer, 2005, pp. 399-422. 10. Rindflesch TC, Libbus B, Hristovski D, Aronson AR, Kilicoglu H. Semantic relations asserting the etiology of genetic diseases. AMIA Annu Symp Proc. 2003;: 554-8. 11. Masseroli M, Kilicoglu H, Lang FM, Rindflesch TC. Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease. BMC Bioinformatics 2006 Jun 8;7(1):291. 12. Yen YT, Chen B, Chiu HW, Lee YC, Li YC, Hsu CY. Developing an NLP and IR-based algorithm for analyzing gene-disease relationships. Methods InfMed. 2006;45(3):321-9. 13. Chun HW, Tsuruoka Y, Kim J-D, Shiba R, Nagata N, Hishiki T, Tsujii J. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. Pac. Symp. Biocomput. 2006:4-15. 14. Blaschke C, Andrade MA, Ouzounis C, Valencia A: Automatic extraction of biological information from scientific text: protein-protein interactions. In Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology: Edited by Lenauer T, Schneider R, Bork P, Brutlag DL, Glasgow JI, Mewes H-W, Zimmer R: San Francisco, CA: Morgan Kaufman Publishers, Inc; 1999:60-67.

220

15. Rindflesch TC, Rajan JV, Hunter L. Extracting molecular binding relationships from biomedical text. Proceedings of the ANLP-NAACL 2000:188-95. Association for Computational Linguistics. 16. Leroy G, Chen H, Martinez JD: A shallow parser based on closed-class words to capture relations in biomedical text. J Biomed Inform. 2003, 36(3): 145-158. 17. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A: GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics 2001, 17 Suppl 1:S74-S82. 18. Lussier YA, Borlawsky T, Rappaport D, Liu Y, Friedman C. PhenoGO: assigning phenotypic context to Gene Ontology annotations with natural language processing. Pac Symp Bio. 2006:64-75. 19. Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput. 2000, 517-528. 20. Humphreys BL, Lindberg DA, Schoolman HM, Barnett GO. The Unified Medical language System: An informatics research collaboration. J Am Med Inform Assoc 1998 Jan-Feb;5(l): 1-11. 21. McCray AT, Srinivasan S, Browne AC. Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care. 1994;235-9. 22. Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for biomedical text. Bioinformatics. 2004;20(14):2320-1. 23. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. Proc AMIA Symp. 2001;17-21. 24. Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics. 2002;18(8):1124-32. 25. McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Medinfo 2001;10(Pt l):216-20. 26. Friedman C, Borlawsky T, Shagina L, Xing HR, Lussier YA. Bioontology and text: bridging the modeling gap. Bioinformatics. 2006 Jul 26. 27. Chodorow, Martin S., Roy I. Byrd, and George E. Heidom (1985). Extracting Semantic Hierarchies from a Large On-Line Dictionary. Proceedings of the 23rd Annual Meeting of the ACL, pp. 299-304. 28. Guthrie L, Slater BM, Wilks Y, Bruce R. Is there content in empty heads? Proceedings of the 13th conference on Computational linguistics. 1990; v3:

138-143. 29. http://bionlp.stanford.edu/genedrug/ 30. Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE. PharmGKB: the Pharmacogenetics Knowledge Base. Nucleic Acids Res. 2002 Jan 1;30(1): 163-5. 31. M Fiszman, TC Rindflesch, H Kilicoglu. Abstraction Summarization for Managing the Biomedical Research Literature. Proc HLTNAACL Workshop on Computational Lexical Semantics, 2004.

ANNOTATING GENES USING TEXTUAL PATTERNS ALI CAKMAK

GULTEKIN OZSOYOGLU

Department of Electrical Engineering and Computer Case Western Reserve University Cleveland, OH 44106, USA {ali.cakmak, [email protected]

Science

Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN, a system to automatically infer new GO annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. GEANN (i) extracts from text significant terms and phrases associated with a GO term, (ii) based on the extracted terms, constructs textual extraction patterns with reliability scores for GO terms, (iii) expands the pattern set through "pattern crosswalks", (iv) employs semantic pattern matching, rather than syntactic pattern matching, which allows for the recognition of phrases with close meanings, and (iv) annotates genes based on the "quality" of the matched pattern to the genomic entity occurring in the text. On the average, in our experiments, GEANN has reached to the precision level of 78% at the 57% recall level.

1.

Introduction

In this paper, we present GEANN (Gene Annotator), a system to automatically infer new Gene Ontology (GO) annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. Currently, annotations for GO, a controlled term vocabulary describing the central attributes of genes [1], are most reliably done manually by experts who read the literature, and decide about appropriate annotations. This approach is slow and costly. And, compounding the problem is the rate of increase in the amount of available biological literature: at the present time, about 223,000 new genomics papers (that contain at least one of the words "gene", "protein" or "rna", and are published in 2005) per year are added to PubMed [3], far outstripping capabilities of a manual annotation effort. Hence, effective computational tools are needed to automate annotation of genes with GO terms. Currently, possibly many genes without appropriate GO annotations exist even though there may be sufficient annotation evidence in a scientific paper. We have observed that, as of Jan. 2006, only a small portion of the papers in 221

222

PubMed has been referred to in support of gene annotations (i.e., 0.9% of 3 million PubMed genomics papers with abstracts). We give an example. Example. The following is an excerpt from an abstract [18] which discusses experiments indicating the translation repressor activity (GO: 0030371) of the gene p97. However, presently gene p97 does not have the translation repressor activity annotation. "...experiments show that p97 suppresses both cap-dependent and independent translation ... expression of p97 reduces overall protein synthesis...results suggest that p97 functions as a general repressor of translation by forming... " . GEANN can be used to (i) discover new GO annotations for a gene, and/or (ii) increase the annotation strength of existing GO annotations by locating additional paper evidence. We are currently integrating GEANN into PathCase [2], a system of web-based tools for metabolic pathways, in order to allow users to discover new GO annotations. In general, GEANN is designed to: • facilitate and expedite the curation process in GO, and • extract explicit information about a gene that is implicitly present in text. GEANN uses paper abstracts, and utilizes textual pattern extraction techniques to discover GO annotations automatically. GEANN's methodology is to (i) extract textual elements identifying a GO term, (ii) construct patterns with reliability scores, conveying the semantics of how confidently a pattern represents a GO term, (Hi) extend the pattern set with longer ones via "crosswalks", (iv) apply semantic pattern matching techniques using WordNet, and (v) annotate genes based on the "quality" of the matched pattern to the genomic entity occurring in the text. In experiments, GEANN produced, on average, 78% precision at 57% recall. This level of performance is significantly better than the existing systems described in the literature, and compared in section 5.2.3 and section 6. Overview: The GEANN implementation has two phases, namely, the training and the annotation phases. The goal of the training phase is to construct a set of patterns that characterize a variety of indicators for the existence of a GO annotation. As the training data, annotation evidence papers [1] are used. The first step in the training phase is the tagging of genes in the papers. Then, significant terms/phrases that differentially appear in the training set are extracted. Next, patterns are constructed based on (i) the significant terms/phrases, and (ii) the terms surrounding significant terms. Finally, each pattern is assigned a reliability score. The annotation discovery phase looks for possible matches to the patterns in paper abstracts. Next, GEANN computes a matching score which indicates the strength of the prediction. Finally, GEANN determines the gene to be associated with the pattern match. At the end, new annotation predictions are ordered by their scores, and presented to the user.

223

The extracted patterns are flexible in that they match to a set of phrases with close meanings. GEANN employs WordNet [5] to deduce the semantic closeness of words in patterns. WordNet is an online lexical reference system in which nouns, verbs, adjectives and adverbs are grouped into synonym sets, and these synonym sets are hierarchically organized through various relationships. The paper is organized as follows. In section 2, we elaborate on significant term discovery, and pattern construction. Sections 3 and 4 discuss pattern matching and the scoring scheme, respectively. Section 5 summarizes the experimental results. In section 6 (and 5.2.3), we compare GEANN to other similar, competing, systems. 2.

Pattern Construction

In GEANN, the identifying elements of a GO concept are the representations of the concept in textual data. And, the terms surrounding the identifying elements are considered as auxiliary descriptors of the GO concept. A pattern is an abstraction which encapsulates the identifying elements and the auxiliary descriptors together in a structured manner. More specifically, a pattern is organized as a 3-tuple: {LEFT} <MIDDLE> {RIGHT} where each element corresponds to a set (bag) of words. <MIDDLE> element is an ordered sequence of significant terms {identifying elements), {LEFT} and {RIGHT} elements correspond to word sets that appear around significant terms (auxiliary descriptors). The number of terms in the left and the right elements is adjusted by a window size. Each word or phrase in the significant term set is assigned to be the middle element of a newly created pattern template. A pattern is an instance of a pattern template which may lead to several patterns with a common middle element, but (possibly) different left or right elements. We give an example. Example. Two of the patterns that are created from the pattern template {LEFT} {RIGHT} are listed below where rna polymerase ii is found to be a significant term within the context of GO concept positive transcription elongation factor with the window size of three. {LEFT} and {RIGHT} tuples are instantiated from the surrounding words that appear before or after the significant term in the text. {increase catalytic rate}{transcription suppressing transient} {proteins regulation transcription}{initiated search proteins} Patterns are contiguous blocks, that is, no space is allowed between the tuples in a pattern. Each tuple is a nag of words which are tokens delimited by white space characters. Since the stop words are eliminated in the preprocessing stage, the patterns do not include words like "the", "of, etc.

224

2.1. Locating Significant Terms and Phrases Some words or phrases appearing frequently in the abstracts provide evidence for annotations by a specific GO term. For instance, RNA polymerase II which performs elongation of RNA in eukaryotes appears in almost all abstracts associated with the GO term "positive transcription elongation factor activity". Hence, intuitively, such frequent term occurrences should be marked as indicators of a possible annotation. In order to avoid marking word(s) common to almost all abstracts (e.g., "cell"), the document frequency of a significant term is enforced to be below a certain threshold (10% in our case). The words that constitute the name of a GO term are by default considered as significant terms. Frequent phrases are constructed out offrequent terms through a procedure similar to the Apriori algorithm [9]. First, individual frequent terms are obtained using the IDF (inverse document frequency [4]) indices. Then, frequent phrases are obtained by recursively combining individual frequent terms/phrases, provided that the constructed phrase is also frequent. In order to obtain significant terms, one can use various methods from random-walk networks to correlation mining [9]. Since the training set for each GO term is most of the time not large, and to keep the methodology simple, we use frequency information to determine the significant terms. 2.2. Pattern Crosswalks Extended patterns are constructed by virtually walking from one pattern to another. The goal is to create larger patterns that can eliminate false GO annotation predictions, and boost the true candidates. Based on the type of the walk, GEANN creates two different extended patterns: (i) side-joined, and (ii) middle-joined patterns. Transitive Crosswalk: Given a pattern pair P, = {leftl} <middlel> {rightl}, and P2 = {left2} <middle2> {right2}, if {rightl} = {left 2}, then patterns P, and P2 are merged into a 5-tuple side-joined (SJ) pattern P3 = {leftl} <middlel> {rightl} <middle2> {right2}. Next, we give an example of a SJ pattern that is created for GO term positive transcription elongation factor. Example.

Pi = {factor increase catalytic}{RNA polymerase n} P2 = {RNA polymerase n}<elongation factor>{[ge]}

[SJ Pattern] P3 = {factor increase catalytic}{elongation factor} {[ge]}

SJ patterns are helpful in detecting consecutive pattern matches that partially overlap in their matches. If there exist two consecutive regular pattern matches, then such a match should be evaluated differently than two separate matchings of regular patterns as it may provide a stronger evidence for the

225

existence of a possible GO annotation in the match region. Note that pattern merging through crosswalks is performed among the patterns of the same GO concept. Middle Crosswalk: Based on the partial overlapping between the middle and side (right or left) tuples of patterns, we construct the second type of extended patterns. Given the same pattern pair Pj and P2 as above, the patterns can be merged into a 4-tuple middle-joined (MJ) pattern if at least one of the following cases holds. a. Right middle walk: {right 1} n <middle2> * 0 and <middlel> n {left2}=0 b. Left middle walk: <middlel> n {left2} * 0 and {rightl} n <middle2>=0 C. Middle walk: <middlel>n {left2} * 0 and {rightl} n <middle2> * 0 MJ patterns have two middle tuples. For case (a), the first middle tuple is the intersection of {rightl} and <middle2> tuples. Case (b) is handled similarly. As for case (c), the first and the second middle tuples are subsets of <middlel> and <middle2>. Below, we give an example of MJ pattern construction for the GO termpositive transcription elongation factor. Example. (Middle-joinedpattern construction) Pi = {[ge] facilitates chromatin} {chromatin-specific elongation factor} P2 = {classic inhibitor transcription} <elongation ma polymerase ii> {pol II} [MJPattern] P3 = {[ge] facilitates chromatin} <elongation> {pol n}

Like SJ patterns, MJ patterns capture consecutive pattern matches in textual data. In particular, MJ patterns detect partial information that may not be recognized otherwise, since we enforce the full matching of middle tuple(s) to locate a pattern match, which is discussed next. 3. Handling Pattern Matches Since middle tuples of a pattern are composed of significant terms, the condition for a pattern match is that the middle tuple of the pattern should be completely included in the text. For the matching of the left and the right tuples, GEANN employs semantic matching. We illustrate with an example. Example. Given a pattern "{increase catalytic rate}{RNA polymerase II}", we want to be able to detect the phrases which give the sense that "transcription elongation" is positively affected. Through semantic matching, phrases like "stimulates rate of transcription elongation" or "facilitates transcription elongation" are also matched to the pattern. GEANN first checks if an exact match is possible between the left/right tuples of the pattern, and the surrounding words of the matching phrase. Otherwise, GEANN employs WordNet [5] to check if they have similar meanings using an open source library [22] as access interface to WordNet. First, a semantic similarity matrix, R [ m , n ] , containing each pair of words is

226

built, where R[i, j] is the semantic similarity between the most appropriate sense of the word at position i of phrase X, and the word at position j of phrase Y. The most appropriate sense of the word is found by through a sense disambiguation process. Given a word w, each sense of the word is compared against the senses of the surrounding words, and the sense of w with the highest similarity to the surrounding words is selected as the most appropriate sense. To compute semantic similarity, we adopt a simple approach: the semantic similarity between word senses Wi and w2 is inversely proportional to the length of the path between the senses in WordNet. The problem of computing semantic similarity between two sets of words X and Y is considered as the problem of computing a maximum total matching weight of a bipartite graph [7], where X and Y are two sets of disjoint nodes (i.e., words in our case). The Hungarian Method [7] is used to solve this problem where R[i, j] is the weight of the edge from i to j . Finally, each individual pattern match is scored based on (i) the score of the pattern itself, and (ii) the semantic similarity computed using WordNet. Having located a match, the next step is to decide on the gene that is associated to the match. To this end, two main issues are resolved: (i) detecting gene names in the text, and (ii) determining the gene to be annotated among possible candidates. For the first task, we utilized a decent biological named entity tagger, called Abner [20]. For the second task of locating the gene to be annotated, GEANN first looks into the sentence containing the match, and locates the genes that are positioned before/after the matching region in the sentence, or else in the previous sentence and so on. The confidence of the annotation decays as the distance from the gene to the matching phrase increases. For more details, please see [14]. 4. Pattern Evaluation and Scoring 4.1. Scoring Regular Patterns Each constructed pattern is assigned a score conveying the semantics of how confidently a pattern represents a GO term. GEANN uses several heuristics for the final score of a pattern based on the structural properties of its middle tuple. i) Source of Middle Tuple [MT]: The patterns whose middle tuples fully consist of words from the GO term name gets higher score than those with middle tuples constructed from the frequent terms. ii) Type of Individual Terms in the Middle Tuple [TT]: Contribution of each word from GO term name changes according to (a) the selectivity, i.e., the occurrence frequency of the word among all GO term names, and (b) the position of the word in GO term name based on the observation that words in a GO term name get more specific from right to left [21].

227

Hi) Frequency of the Phrase in the Middle Tuple [PC]: A pattern's score is inversely proportional to the frequency of the middle tuple throughout the papers in the database. iv) Term-Wise Paper Frequency of the Middle Tuple [PP].1 The patterns with middle tuples which are highly frequent in the GO term's paper set get higher scores. Based on the reasoning summarized above, GEANN uses the following heuristic score function: PatternScr = (MT + TT + PP) * Log(l/PC) 4.2. Scoring Extended Patterns (a) Scoring SJ Patterns: SJ patterns serve for capturing consecutive pattern matches. Our scoring scheme differentiates between two-consecutive and twosingle pattern matches where consecutive pattern matches contribute to the final score proportional to some exponent of the sum of the pattern scores (after experimenting with different values of exponents in the extended pattern score functions for the highest accuracy, for the experimental results section, j and k were set to 2 and 1.5, respectively). This way, GEANN can assign considerably higher scores to consecutive pattern matches which are considered as much stronger indicators for an annotation than two individual pattern matches. Score(SJ Pattern) = ( Score(Patternl) + Score(Pattern2) ) ' (b) Scoring MJ Patterns: Consistent with the construction process, the score computation for MJ patterns is more complex in comparison to SJ patterns. Score(Middle-joined Pattern)= ( DegreeOfOverlap 1 * Score(Patternl) + DegreeOfOverlap2 * Score(Pattern2) )k where DegreeOfOverlap represents the proportion of the middle tuple of pattern 1 (pattern2) that is included in the left tuple of pattern2 (right tuple of pattern!). In addition, GEANN considers the preservation of word order, represented by the positionalDecayCoefficient. The degree of overlap is computed by: degreeOfOverlap = positionalDecayCoefficient * overlapFrequency The positional decay coefficient is computed according to the alignment of the left or the right middle tuple of a pattern with the middle tuple of the other pattern. If a matching word is in the same position in both tuples, then the positional score of the word is 1, otherwise, it is 0.75. ^

PositionalDecayCoefficient =

PosScore(w)

«i-o^,iv Size( Overlap)

228

5. Experimental Results 5.1. Data Set In order to evaluate the performance of GEANN, we performed experiments on annotating genes in NCBI's Genbank with selected GO terms. A subset of PubMed abstracts was stored in a database. The experimental subset consisted of evidence papers cited by GO annotations, and reference papers that were cited for the gene maintained by GenBank. This corpus containing around 150,000 papers was used to approximate the word frequencies in the actual PubMed dataset. As part of pre-processing, abstracts/titles of papers were tokenized, stopwords were removed, and inverse document indices were constructed for each token. GEANN was evaluated on a set of 40 GO terms (24 terms from the biological process, 12 terms from mol. function, 4 term from cellular component subontology). Our decision on which terms to choose for the performance assessment is shaped by the choices made in two previous studies [16, 17] for comparison purposes. For a complete list of GO terms used in the experiments, see [14]. The evidence papers that are referenced from at least one of the test GO term are used for testing patterns. In total, 4694 evidence papers abstracts are used to to annotate 4982 genes where on the average each GO term has 120 evidence papers and 127 genes. 5.2.

Experiments

Our experiments are based on the precision-recall analysis of the predicted annotation set. We use the k-fold cross validation scheme [9] (k=10 in our case). Precision is the ratio of the number of genes that are correctly predicted to the number of all genes predicted by GEANN. And, recall is the fraction of the correctly predicted genes in the whole set of genes that are known to be annotated with the GO term being studied. The genes that are annotated by GEANN, and yet, do not have a corresponding entry in Genbank are ignored as there is no way to check their correctness. Additionally, GEANN uses the following heuristics. Heuristic I (Shared Gene Synonyms): If at least one of the genes matching to the annotated symbol has the annotation with the target GO term, then this prediction is considered as a true positive. Heuristic 2 (Incorporating the GO Hierarchy): A given GO term G also annotates all the genes that are annotated by any of its descendants (true-path rule). 5.2.1. Overall Performance: For this experiment, predicted annotations were ordered by their confidence scores. Precision and recall values were computed by considering top k

229 predictions, k was increased by 1 at each step until either all the annotations for a GO term were located, or all the candidates in the predicted set were processed.

^ $ -

n

• •

•

1 1

r

jS 55 £> 4 5

€%^'

J fS-^ fw 5

6

8 10 15 20 30 50 79

Result Set Size

J

1

2

i

4

—•—CC-Presc •-~m---BPPresc -X'—MFPresc —#—BP-Recall —-&---MF-Recall -—JK—-CC-Recall

\ [ j |

5 6 8 10 15 20 30 50 79 Result Set Size ,

Figure 1: Overall System Performance & Figure 2: Annotation accuracy across different Approximate Error due to the NET subontologies in GO Observation 1: From fig. 1, which presents the average precision/recall values, GEANN yields 78% precision (the top-most line) at 46% recall (the bottom-most line). The association of a pattern to a gene relies on the accurate tagging of genes in the text. However, named entity taggers (NETs) are still far from being perfect (ABNER has 77% recall, 68% precision). It may be quite difficult to exactly quantify NET errors. Thus, we took a minimalist approach, and attempted to compute the rate of error that is guaranteed to be due to the fault of the NET. Heuristic 4 (Tagger Error Approximation): If none of the synonyms of a gene has been recognized by the tagger in any of the papers which are associated with the target GO term G, then we label the gene as a tagger-missed gene. Observation 2: After eliminating tagger-missed genes, the recall of GEANN increases to 57%from 46% at the precision level of 78% (the middle line in figure 1). Note that the actual error rate of the NET, in practice, may be much more than what is estimated above. In addition, eliminating tagger-missed genes does not affect the precision. Thus, precision is plotted only once. 5.2.2. Accuracy across Different

Subontologies:

In experiment 2, the same steps of experiment 1 were repeated; but average accuracy values were computed within the individual subontologies. Figure 2 plots precision/recall values of different subontologies of GO (MF: Molecular Function, BP: Biological Process, CC: Cellular Component). Observation 3: GEANN has the best precision for CC where the precision reaches to 85% at 52% recall while MF yields the highest recall (58% at 75% precision). Observation 4: CC almost always provides best precision values because the variety of the words to describe cellular locations may be much lower. However, CC has the lowest recall (52%) as the cellular location is well known for certain genomic entities, hence, are not stated explicitly in the text as much as MF or BP annotations.

230 Observation 5: Higher recall in MF is expected as, in general, the emphasis in a biomedical paper is on the functionality of a gene, where the process or the cellular location information is usually provided as secondary traits for the entity.

5.2.3. Comparative Performance Analysis with Other Systems: Raychaudhuri et al. [16] and Izumitani et al. [17] built paper classifiers to label the genes with GO terms through the classification of papers. Both works assume that a gene is a priori associated with several papers. This is a strong assumption in that if the experts are to invest sufficient time to read and associate a set of papers with a gene, then they can probably annotate the gene with the appropriate GO terms. Second, since both of the systems work at the document level, no direct evidence phrases are extracted from the text. Third, the classifiers employed by these studies need large training paper sets. In contrast, GEANN does not require a gene to be associated with any set of papers. Moreover, GEANN can also provide specific match phrases as evidence rather than the whole document. Fourth, GEANN handles the reconciliation of two different genomic databases whereas those studies have no such consideration. Izumitani et al. compares their system to Raychaudhuri et al.'s study for 12 GO terms. Our comparative analysis is also confined to this set of GO terms. Among these GO terms, five of them (Ion homeostasis, Membrane fusion, Metabolism, Sporulation) either have no or very few annotations in Genbank to perform 10-fold cross validation, and one of the test terms (Biogenesis) has recently became obsolete (i.e., removed from GO). Therefore, here we present comparative results for the remaining 6 GO terms. Table 1 provides the overall F-values [9] while Table 2 provides F-values in terms of the subontologies. F-value is a harmonic mean of precision and recall values, and computed as (2*Recall*Precision)/(Recall+Precision). GO category GO:0006914 GO:0007155 GO:0007165 GO:0006950 GO:0006810 GO:0008219 Average

GEANN

Izumitani etal.

0.85 0.66 0.75 0.69 0.72 0.75 0.74

0.78 0.51 0.76 0.65 0.83 0.58 0.69

Raychaudhuri Topi Top2 0.83 0.66 0.19 0.19 0.41 0.30 0.41 0.27 0.56 0.55 0.07 0.06 0.40 0.33

etal. Top3 0.38 0.13 0.21 0.24 0.49 0.02 0.25

.

*° .

Subontolgy Biological Process Molecular Function Cellular Location Average

GEANN

ni et al.

~ *

0.66

0.60

0.66

0.72

0.64

0.58

0.66

0.63

Table 1: Comparing F-Values against Izumitani and Table 2: Comparing F-Values for GO Raychaudhuri Subontologies Observation 6: Although GEANN does not rely on the strong assumption that genes need to be associated with a set ofpapers, and provides annotation prediction at a finer granularity with much smaller training data, it is still comparable to or better than other systems in terms of accuracy.

231 5.2.4. Contributions of Extended Patterns: Finally, we evaluated the effects of extended patterns. The experiments were conducted by first utilizing extended patterns, and, then, without using extended patterns. Observation 7: The use of extended patterns improves the precision by as much as 6.3% (GO.0005198). However, as the average improvement is quite small (0.2 %), we conclude that the contribution of the extended patterns is unpredictable. We observe that extended patterns have a localized effect which does not necessarily apply in every case. Furthermore, since we only use paper abstracts, it is not very likely to find long descriptions that match to extended patterns.

6. Related Work The second task of the BioCreAtlvE challenge involves extracting the annotation phrases given a paper and a protein. Most of the evaluated systems had low precision (46% for the best performing system) [15]. We are planning to participate in this assesment challenge in the near future. Raychaudhuri et al. [16] and Izumitani et al. [17] classify the documents, hence the genes that are associated to the documents into GO terms. As discussed above, even though GEANN is more flexible in terms of its assumptions, its performance is still comparable to these systems. Koike et al. [19] employs actorobject relationships from the NLP perspective. This system is optimized for the biological process subontology, and it requires human input and manually created patterns. Fleischman and Hovy [8] present a supervised learning method which is similar to our flexible pattern approach in that it uses WordNet. However, we use significant terms to construct additional patterns so that we can locate additional semantic structures while this paper only considers the target instance as the base of its patterns. Riloff [10] proposes a technique to extract the patterns. This technique ignores semantic side of the patterns. In addition, patterns are strict in that they require word-by-word exact matching. Brin's DIPRE [11] uses an initial set of seed elements as input, and uses the seed set to extract the patterns by analyzing the occurrences of seed instances in the web documents. SNOWBALL [12] extends DIPRE's pattern extraction system by introducing use of named-entity tags. Etzioni et al. developed a web information extraction system, KnowItAH [13], to automate the discovery of large collection of facts in web pages, which assumes redundancy of information on the web. 7. Conclusions and Future Work In this paper, we have explored a new methodology to automatically infer new GO annotations for genes and gene products from biomedical paper abstracts. We have developed GEANN which utilizes existing annotation information to

232

construct textual extraction patterns characterizing an annotation with a specific GO concept. Exploring the accuracy of different semantic similarity measures for WordNet, disambiguation of genes that share a synonym, and determining scoring weight parameters experimentally are among the future tasks. Acknowledgments This research is supported in part by the NSF award DBI-0218061, a grant from the Charles B. Wang Foundation, and Microsoft equipment grant. References 1. The Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource . Nucleic Acids Research 32, D258-D261, 2004 2. PathCase, available at http://nashua.case.edu/pathways 3. PubMed, available at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi 4. Salton, G., Automatic Text Processing, Addison-Wesley, 1989. 5. Fellbaum, C. An Electronic Lexical Database. Cambridge, MA. MIT Press, 1998. 6. Mann, G. Fine-Grained Proper Noun Ontologies for Question Answering. SemaNet, 2002. 7. Lovasz, L. Matching Theory, North- Holland, New York, 1986. 8. Fleischman, M., Hovy, E. Fine Grained Classification of Named Entities. COLING 2002 9. Han, J., Kamber, M. Data Mining: Concepts and Techniques. The Morgan Kaufmann, 2000. 10. Riloff, E. Automatically Generating Extraction Patterns from Untagged Text. AAA1/IAAI,1996. 11. Brin, S. Extracting Patterns and Relations from the World Wide Web. WebDB 1998. 12. Agichtein, E., Gravano, L. Snowball: extracting relations from large plain-text collections.ACM DL 2000 13. Etzioni, O. et al. Web-scale information extraction in Knowitall: WWW 2004. 14. Extended version of the paper available at: http://cakmak.case.edu/TechReports/GEANNExtended.pdf 15. Blaschke, C, Leon, EA, Krallinger M, Valencia A. Evaluation of BioCreAtlvE assessment of task 2. BMC Bioinformatics. 2005 16. Raychaudhuri, S. et al. Associating genes with Gene Ontology codes using a maximum entropy analysis of biomedical literature. Genome Res., 12(1):203-214. 17. Izumitani, T. et al. Assigning Gene Ontology Categories (GO) to Yeast Genes Using TextBased Supervised Learning Methods. CSB 2004. 18. Imataka, H., Olsen, H., Sonenberg, N. A new translational regulator with homology to eukaryotic translation initiation factor 4G. EMBO J. 1997 19. Koike, A., Niwa, Y., Takagi, T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 2005. 20. Settles, B. ABNER: An open source tool for automatically tagging genes, proteins, and other entity names in text. Bioinformatics, 2005. 2 1 . Ogren, P. et al. The Compositional Structure of Gene Ontology Terms. PSB 2004. 22. WordNet Semantic Similarity Open Source Library http://www.codeproject.com/useritems/semanticsimilaritywordnet.asp

A FAULT MODEL FOR ONTOLOGY M A P P I N G , A L I G N M E N T , A N D LINKING SYSTEMS

H E L E N L. J O H N S O N , K. B R E T O N N E L C O H E N , AND LAWRENCE HUNTER

E-mail:

Center for Computational Pharmacology School of Medicine, University of Colorado Aurora, CO, 80045 USA {Helen. Johnson, Kevin. Cohen, Larry.Hunter} ©uchsc.edu

There has been much work devoted to the mapping, alignment, and linking of ontologies (MALO), but little has been published about how to evaluate systems that do this. A fault model for conducting fine-grained evaluations of MALO systems is proposed, and its application to the system described in Johnson et al. [15] is illustrated. Two judges categorized errors according to the model, and inter-judge agreement was calculated by error category. Overall inter-judge agreement was 98% after dispute resolution, suggesting that the model is consistently applicable. The results of applying the model to the system described in [15] reveal the reason for a puzzling set of results in that paper, and also suggest a number of avenues and techniques for improving the state of the art in MALO, including the development of biomedical domain specific language processing tools, filtering of high frequency matching results, and word sense disambiguation.

1. Introduction The mapping, alignment, and/or linking of ontologies (MALO) has been an area of active research in recent years [4,28]. Much of that work has been groundbreaking, and has therefore been characterized by the lack of standardized evaluation metrics that is typical for exploratory work in a novel domain. In particular, this work has generally reported coarse metrics, accompanied by small numbers of error exemplars. However, in similar NLP domains finer-grained analyses provide system builders with insight into how to improve their systems, and users with information that is crucial for interpreting their results [23,14,8]. MALO is a critical aspect of the National Center for Biomedical Ontology/Open Biomedical Ontologies strategy of constructing multiple orthogonal ontologies, but such endeavors have proven surprisingly difficult—Table 1 shows the results of a representative linking system, which ranged as low as 60.8% overall when aligning the BRENDA Tissue ontology with the Gene Ontology [15]. This paper proposes a fault model for evaluating lexical techniques in MALO systems, and applies it to the output of the system described in 233

234

Johnson et al. [15]. The resulting analysis illuminates reasons for differences in performance of both the lexical linking techniques and the ontologies used. We suggest concrete methods for correcting errors and advancing the state of the art in the mapping, alignment, and/or linking of ontologies. Because many techniques used in MALO include some that are also applied in text categorization and information retrieval, the findings are also useful to researchers in those areas. Previous lexical ontology integration research deals with false positive error analysis by briefly mentioning causes of those errors, as well as some illustrative examples, but provides no further analysis. Bodenreider et al. mention some false positive alignments but offer no evaluations [3]. Burgun et al. assert that including synonyms of under three characters, substring matching, and case insensitive matching are contributors to false positive rates and thus are not used in their linking system [5]. They report that term polysemy from different ontologies contributes to false positive rates, but do not explain the magnitude of the problem. Zhang et al. report a multi-part alignment system but do not discuss errors from the lexical system at all [29]. Lambrix et al. report precision from 0.285-0.875 on a small test set for their merging system, SAMBO, which uses n-grams, edit distance, WordNet, and string matching. WordNet polysemy and the N-gram matching method apparently produce 12.5% and 24.3% false positive rates, respectively [17,16]. Lambrix and Tan state that the same alignment systems produce different results depending on the ontology used; they give numbers of wrong suggestions but little analysis [18]. For a linking system that matches entities with and without normalization of punctuation, capitalization, stop words, and genitive markers, Sarkar et al. report without examples a 4-5% false positive rate [26]. Luger et al. present a structurally verified lexical mapping system in which contradictory mappings occur at certain thresholds, but no examples or analyses are given [20]. Mork et al. introduce an alignment system with a lexical component but do not detail its performance [22]. Johnson et al. provide error counts sorted by search type and ontology but provide no further analysis [15]. Their system's performance for matching BRENDA terms to GO is particularly puzzling because correctness rates of up to 100% are seen with some ontologies, but correctness for matching BRENDA is as low as 7% (see Table 1). There has been no comprehensive evaluation of errors in lexical MALO systems. This leaves unaddressed a number of questions with real consequences for MALO system builders: What types of errors contribute to reduced performance? How much do they contribute to error rates? Are there scalable techniques for reducing errors without adversely impacting recall? Here we address these questions by proposing a fault model for false-positive errors in MALO systems, providing an evaluation of the errors produced by a biomedical ontology linking system, and suggesting

235 Table 1. Correctness rates for the ontology linking system described in Johnson et al. (2006). The three OBO ontologies listed in the left column were linked to the GO via the three lexical methods in the right columns. Type of linking method Ontology

Overall

Exact

Synonyms

Stemming

ChEBI

842^

98.3% (650/661)

60.0% (180/300)

73.5%(147/200)

Cell Type

92.9%

99.3% (431/434)

73.0% (65/89)

83.8% (88/105)

BRENDA

60.8%

84.5% (169/200)

76.0% (152/200)

11.0% (22/200)

methods to reduce errors in MALO. 2. Methods 2.1. The ontology linking method in Johnson

et al.

(2006)

Since understanding the methodology employed in Johnson et al. is important to understanding the analysis of its errors, we review that methodology briefly here. Their system models inter-ontology relationship detection as an information retrieval task, where relationship is defined as any direct or indirect association between two ontological concepts. Three OBO ontologies' terms (BRENDA Tissue, ChEBI, and Cell Type) are searched for in GO terms [9,27,11,1]. Three types of searches are performed: (a) exact match to OBO term, (b) OBO term and its synonyms, and (c) stemmed OBO term. The stemmer used in (c) was an implementation of the Porter Stemmer provided with the Lucene IR library [13,25]. Besides stemming, this implementation also reduces characters to lower case, tokenizes on whitespace, punctuation and digits (removing the latter two), and removes a set of General English stop words. The output of the system is pairs of concepts: one GO concept and one OBO concept. To determine the correctness of the proposed relationships, a random sample of the output (2,389 pairs) was evaluated by two domain experts who answered the question: Is this OBO term the concept that is being referred to in this GO term/definition? Inter-annotator agreement after dispute resolution was 98.2% (393/400). The experts deemed 481 relations to be incorrect, making for an overall estimated system error rate of 20%. All of the system outputs (correct, incorrect, and unjudged) were made publicly available at compbio.uchsc.edu/dependencies. 2.2. The fault

model

In software testing, a fault model is an explicit hypothesis about potential sources of errors in a system [2,8]. We propose a fault model, comprising three broad classes of errors (see Table 2), for the lexical components of MALO systems. The three classes of errors are distinguished by whether they are due to inherent properties of the ontologies themselves, are due to the processing techniques that the system builders apply, or are due to

236

including inappropriate metadata in the data that is considered for locating relationships. The three broad classes are further divided into more specific error types, as described below. Errors in the lexical ambiguity class arise because of the inherent polysemy of terms in multiple ontologies (and in natural language in general) and from ambiguous abbreviations (typically listed as synonyms in an ontology). Errors in the text processing class come from manipulations performed by the system, such as the removal of punctuation, digits, or stop words, or from stemming. Errors in metadata matching occur when elements in one ontology matched metadata in another ontology, e.g. references to sources that are found at the end of GO definitions. To evaluate whether or not the fault model is consistently applicable, two authors independently classified the 481 incorrect relationships from the Johnson et al. system into nine fine-grained error categories (the seven categories in the model proposed here, plus two additional categories, discussed below, that were rejected). The model allows for assignment of multiple categories to a single output. For instance, the judges determined that CH:29356 oxidef2-) erroneously matched to GO:0019417 sulfur oxidation due to both character removal during tokenization {(2-) was deleted) and to stemming (the remaining oxide and oxidation both stemmed to oxid). Detailed explanations of the seven error categories, along with examples of each, are given belowa. 3. Results Table 2 displays the counts and percentages of each type of error, with inter-judge agreement (IJA) for each category. Section 3.1 discusses interjudge agreement and the implications that low IJA has for the fault model. Sections 3.2-3.3 explain and exemplify the categories of the fault model, and 3.4 describes the distribution of error types across orthogonal ontologies. 3.1. Inter-judge

agreement

Inter-judge agreement with respect to the seven final error categories in the fault model is shown in Table 2. Overall IJA was 95% before dispute resolution and 99% after resolution. In the 1% of cases where the judges did not agree after resolution, the judge who was most familiar with the data assigned the categories. The initial fault model had two error categories that were eliminated from the final model because of low IJA. The first category, tokenization, had an abysmal 27% agreement rate even after dispute resolution. The second eliminated category, general English polysemy, had 80% a

I n all paired concepts in our examples, BTO=BRENDA Tissue Ontology, CH=ChEBI Ontology, CL=Cell Type Ontology, and GO=Gene Ontology. Underlining indicates the portion of GO and OBO text that matches, thereby causing the linking system to propose that a relationship exists between the pair.

237

pre-resolution agreement and 94% post-resolution agreement, with only 10 total errors assigned to this category. Both judges felt that all errors in this category could justifiably be assigned to the biological polysemy category; therefore, this category is not included in the final fault model. Table 2. The fault model and results of its application to Johnson et al.'s erroneous outputs. The rows in bold are the subtotaled percentages of the broad categories of errors in relation to all errors. The non-bolded rows indicate the percentages of the subtypes of errors in relation to the broad category that they belong to. The counts for the subtypes of text processing errors exceed the total text processing count because multiple types of text processing errors can contribute to one erroneously matched relationship.

Type of error Lexical ambiguity errors biological polysemy ambiguous abbreviation Lexical A m b i g u i t y Total Text processing errors stemming digit removal punctuation removal stop word removal T e x t P r o c e s s i n g Total M a t c h e d M e t a d a t a Total Total

3.2. Lexical ambiguity

Inter-judge agreement pre-resolution post-resolution

Percent

Count

56% 44% 38%

(105/186) (81/186) (186/481)

86% 96%

98% 99%

6% 51% 27% 14% 60% 1%

(29/449) (231/449) (123/449) (65/449) (290/481) (5/481)

100% 100% 100% 99%

100% 100% 100% 100%

100%

100%

99%

(481/481)

95%"

99%"

errors

Lexical ambiguity refers to words that denote more than one concept. It is a serious issue when looking for relationships between domain-distinct ontologies [10:1429]. Lexical ambiguity accounted for 38% of all errors. Biological polysemy is when a term that is present in two ontologies denotes distinct biological concepts. It accounted for 56% of all lexical ambiguity errors. Examples of biological polysemy include (1-3) below. Example (1) shows a polysemous string that is present in two ontologies. (1) BTO 0000280: def: GO

0042676: def:

cone A mass of ovule-bearing or pollen-bearing scales or bracts in trees of the pine family or in cycads that are arranged usually on a somewhat elongated axis. cone cell fate commitment The process by which a cell becomes committed to become a cone cell.

OBO terms have synonyms, some of which polysemously denote concepts that are more general than the OBO term itself, and hence match GO concepts that are not the same as the OBO term. Examples (2) and (3) show lexical ambiguity arising because of the OBO synonyms.

238 (2) BTO

0000131: synonym: def:

GO

0046759: def:

(3) CH

17997 synonym 0035243 def:

GO

blood plasma plasma The fluid portion of the blood in which the particulate components are suspended. lytic plasma membrane viral budding A form of viral release in which the nucleocapsid evaginates from the host nuclear membrane system, resulting in envelopment of the virus and cell lysis. dinitrogen nitrogen protein-arginine omega-N symmetric methyltransferase activity ... Methylation is on the terminal nitrogen (omega nitrogen) ...

Example (4) shows that by the same synonymy mechanism, terms from different taxa match erroneously. (4) CL GO

0000338 synonym 0043350

neuroblast (sensu Nematoda and Protostomia) neuroblast neuroblast proliferation (sensu Vertebrata)

Ambiguous abbreviation errors happen when an abbreviation in one ontology matches text in another that does not denote the same concept. The ambiguity of abbreviations is a well-known problem in biomedical text [7,6]. In the output of [15] it is the cause of 43% of all lexical ambiguity errors. The chemical ontology includes many one- and two-character symbols for elements (e.g. C for carbon, T for thymine, As for arsenic, and At for astatine). Some abbreviations are overloaded even within the chemical domain. For example, in ChEBI C is listed as a synonym for three chemical entities besides carbon, viz. L-cysteine, L-cysteine residue, and cytosine. So, single-character symbols match many GO terms, but with a high error rate. Examples (5) and (6) illustrate such errors. (5)

CH GO

17821 synonym 0043377:

thymine T negative regulation of CD8-positive T cell differentiation

One- and two-character abbreviations sometimes also match closed-class or function words, such as a or in, as illustrated in example (6). (6)

CH GO

30430 synonym 0046465 def:

3.3. Text processing

indium In dolichyl diphosphate metabolism ... In eukaryotes, these function as carriers of .

errors

As previously mentioned, Johnson et al.'s system uses a stemmer that requires lower-case text input. The system performs this transformation with a Lucene analyzer that splits tokens on non-alphabetic characters, then removes digits and punctuation, and removes stop words. This transformed text is then sent to the stemmer. Example (7) illustrates a ChEBI term and a GO term, and the search and match strings that are produced by the stemming device.

239 (7) CH GO

32443: 0018118:

Tokenized/stemmed text 1 cystein peptidyl 1 cystein ...

Original text L-cysteinate(2-) peptidyl-L-cysteine

Errors arise from the removal of digits and punctuation, the removal of stop words, and the stemming process itself (see Table 2). These are illustrated in examples (8-16). Few errors resulting from text processing can be attributed to a single mechanism. Digit removal is the largest contributor among the text processing error types, constituting 51% of the errors. Punctuation removal is responsible for 27% of the errors. These are illustrated in examples (8-10). (8)

CL GO

0000624: 0043378-

CD4 positive T cell positive regulation of CD8-positive T cell differentiation

(9)

CH GO

20400: 0004409 def 30509 0018492

4-hydroxybutanal homoaconitate hydratase activity Catalysis of the reaction: 2-hydroxybutane-l,2,4-tri ... carbon (1+) carbon-monoxide dehydrogenase (acceptor) activity

(10) CH GO

Six percent of the errors involve the stemming mechanism. (This is somewhat surprising, since the Porter stemmer has been independently characterized as being only moderately aggressive [12].) Table 3. Counts of correct and incorrect relationships that resulted after the stemming mechanism was applied. Matches

-al

-ate

-ation

-e

-ed

-ic

-mg

Correct Incorrect

19 1

1 17

2 3

12 26

0 3

11 2

0 4

-ize

-ous

-s

0 1

2 0

157 39

Of the 580 evaluated relationships that were processed by the stemming mechanism in the original linking system, 43% (253/580) match because of the stemming applied. Of those, 73% (185/253) are correct relationships; 27% (68/253) are incorrect. Table 3 displays a list of all suffixes that were removed during stemming and the counts of how many times their removal resulted in a correct or an incorrect match. Examples (11-13) display errors due to stemming: (11) CH GO (12) CH GO (13) CH GO

25741: 0016623: def: 25382 0015718 def: 32530: 0019558:

oxides oxidoreductase activity, acting on the aldehyde or oxo Catalysis of an oxidation-reduction (redox) reaction ... monocarboxylates monocarboxylic acid transport The directed movement of monocarboxylic acids into .. histidinate(2-) histidine catabolism to 2-oxoglutarate

240

While stemming works most of the time to improve recall—the count of correct matches in Table 3 is more than double the count of incorrect matches (204 versus 96)—an analysis of the errors shows that in this data, there is a subset of suffixes that do not stem well from biomedical terms, at least in these domains. Removal of -e results in incorrect matches far more often than it results in correct matches, and removal of -ate almost never results in a correct match. These findings illustrate the need for a domain-specific stemmer for biomedical text. Finally, stop word removal contributed 14% of the error rate. Examples like (14-16) are characteristic: (14) CL GO (15) CH GO

0000197 0030152 def: 25051 0046834

receptor cell bacteriocin biosynthesis ... at specific receptors on the cell surface lipid As lipid phosphorylation

(16) CH GO

29155: 0050562:

His-tRNA(His) lysine-tRNA(Pyl) ligase activity

3.4. Applying

the fault model to orthogonal

ontologies

The fault model that this paper proposes explains the patterns observed in the Johnson et al. work. They report an uneven distribution of accuracy rates across the ontologies (see Table 1); Table 4 shows that this corresponds to an uneven distribution of the error types across ontologies. Most striking is that ChEBI is especially prone to ambiguous abbreviation errors, which were entirely absent with the other two ontologies. BRENDA is prone to deletion-related errors — in fact, over half of the errors in the text processing error category are due to a specific type of term in BRENDA (169/290). These terms have the structure X cell, where X is any combination of capital letters, digits, and punctuation, such as B5/589 cell, T-24 cell, and 697 cell. The search strings rendered from these after the deletions—B cell, T cell, and cell, respectively—match promiscuously to GO (see Figure 1). Biological polysemy errors are a problem in all three ontologies. Sixtyfour percent of the errors for Cell Type were related to polysemy, 20% in BRENDA, and 12% in ChEBI. Dealing with word sense disambiguation could yield a huge improvement in performance for these ontologies. None of this error type distribution is apparent from the original data reported in [15], and all of it suggests specific ways of addressing the errors in aligning these ontologies with GO. 4. Fault-driven analysis suggests techniques for improving MALO Part of the value of the fault model is that it suggests scalable methods for reducing the false positive error rate in MALO without adversely affecting recall. We describe some of them here.

241 Table 4.

Ontology

Distribution of error types across ontologies

Biological polysemy

Abbreviation ambiguity

digit

84 29 26

0 0 81

187 9 35

BRENDA Cell Type ChEBI

Deletion of: punct. stopword 89 0 34

7n to ' u , ' -"—iwrcdTi

o 60 - ,-•

£ 50 i •6 fe •g 5 z

Stemming

Totals

2 0 27

416 45 207

54 7 4

BRENDA

|BY-2celi|

T

~~"-| blood plasma |

40 30 20 - 1 *-H T"84 cdl 1 10 < nU 1 3 5 7 9 11 13 15 17 19 Number of Terms

Figure 1.

A few terms from BRENDA caused a large number of errors.

4.1. Error reduction

techniques

related to text

processing

Johnson et al. reported exceptionally low accuracy for BRENDA relationships based on stemming: only 7-15% correctness. Our investigation suggests that this low accuracy is due to a misapplication of an out-of-the-box Lucene implementation of the Porter stemmer: it deletes all digits, which occur in BRENDA cell line names, leading to many false-positive matches against GO concepts containing the word cell. Similarly, bad matches between ChEBI chemicals and the GO (73-74% correctness rate) occur because of digit and punctuation removal. This suggests that a simple change to the text processing procedures could lower the error rate dramatically. 4.2. Error reduction

techniques

related to

ambiguity

For ontologies with error patterns like ChEBI and BRENDA, excluding synonyms shorter than three characters would be beneficial. For example, Bodenreider and Burgun excluded synonyms shorter than three characters [5]. Length-based filtering of search candidates has been found useful for other tasks in this domain, such as entity identification and normalization of Drosophila genes in text [21]. Numerous techniques have been proposed for resolving word sense ambiguities [24]. The OBO definitions may prove to be useful resources for

242

knowledge-based ontology term disambiguation [19]. 4.3. Error reduction

by filtering high error

contributors

The Zipf-like distribution of error counts across terms (see Figure 1) suggests that filtering a small number of terms would have a beneficial effect on the error rates due to both text processing and ambiguity-related errors. This filtering could be carried out in post-processing, by setting a threshold for matching frequency or for matching rank. Alternatively, it could be carried out in a pre-processing step by including high-frequency tokens in the stop list. This analysis would need to be done on an ontology-byontology basis, but neither method requires expert knowledge to execute the filtering process. As an example of the first procedure, removing the top contributors to false-positive matches in each ontology would yield the results in Table 5. Table 5.

Effect of filtering high-frequency match terms.

Ontology

Terms removed

BRENDA Cell Type ChEBI

697 cell, BY-2 cell, blood plasma, T-84 cell band form neutrophil, neuroblast iodine, L-isoleucine residue, groups

Increase in correctness

Decrease in matches

27% 4% 2%

41% 3% 2%

5. Conclusion The analysis presented in this paper supports the hypotheses that it is possible to build a principled, data-driven fault model for MALO systems; that the model proposed can be applied consistently; that such a model reveals previously unknown sources of system errors; and that it can lead directly to concrete suggestions for improving the state of the art in ontology alignment. Although the fault model was applied to the output of only one linking system, that system included linking data between four orthogonal ontologies. The model proved effective at elucidating the distinct causes of errors in linking the different ontologies, as well as the puzzling case of BRENDA. A weakness of the model is that it addresses only false-positive errors; evaluating failures of recall is a thorny problem that deserves further attention. Based on the descriptions of systems and false positive outputs of related work, it seems that the fault model presented in this work could be applied to the output of many other systems, including at least [3,5,16,17,18,26,20,22,29]. Note that in the data that was examined in this paper, the distribution of error types was quite different across not just lexical techniques, but across ontologies, as well. This reminds us that specific categories in the model may not be represented in the output of

243

all systems applied to all possible pairs of ontologies, and t h a t there may be other categories of errors that were not reflected in the d a t a t h a t was available to us. For example, the authors of the papers cited above have reported errors due t o case folding, spelling normalization, and word order alternations t h a t were not detected in the output of Johnson et al.'s system. However, the methodology t h a t the present paper illustrates—i.e., combining the software testing technique of fault modelling with an awareness of linguistic factors—should be equally applicable to any lexically-based MALO system. Many of the systems mentioned in this paper also employ structural techniques for MALO. These techniques are complementary to, not competitive with, lexical ones. T h e lexical techniques can be evaluated independently of the structural ones; a similar combination of the software testing approach with awareness of ontological/structural issues may be applicable to structural techniques. We suggest t h a t the quality of future publications in MALO can be improved by discussing error analyses with reference to this model or very similar ones derived via the same techniques. 6.

Acknowledgments

The authors gratefully acknowledge the insightful comments of the three anonymous PSB reviewers, and thank Michael Bada for helpful discussion and Todd A. Gibson and Sonia Leach for editorial assistance. This work was supported by NIH grant R01-LM008111 (LH). References 1. J. Bard, S. Y. Rhee, and M. Ashburner. An ontology for cell types. Genome Biol, 6(2), 2005. 2. R. V. Binder. Testing Object-Oriented Systems: Models, Patterns, and Tools. Addison-Wesley Professional, October 1999. 3. O. Bodenreider, T. F. Hayamizu, M. Ringwald, S. De Coronado, and S. Zhang. Of mice and men: aligning mouse and human anatomies. AMIA Annu Symp Proc, pages 61-65, 2005. 4. O. Bodenreider, J. A. Mitchell, and A. T. McCray. Biomedical ontologies: Session introduction. In Pac Symp Biocomput, 2003, 2004, 2005. 5. A. Burgun and O. Bodenreider. An ontology of chemical entities helps identify dependence relations among Gene Ontology terms. In Proc SMBM, 2005. 6. J. Chang and H. Schiitze. Abbreviations in biomedical text. In S. Ananiadou and J. McNaught, editors, Text mining for biology and biomedicine, pages 99-119. Artech House, 2006. 7. J. T. Chang, H. Schiitze, and R. B. Altman. Creating an online dictionary of abbreviations from MEDLINE. J Am Med Inform Assoc, 9(6):612-620, 2002. 8. K. B. Cohen, L. Tanabe, S. Kinoshita, and L. Hunter. A resource for constructing customized test suites for molecular biology entity identification systems. BioLINK 2004, pages 1-8, 2004. 9. T. G. O. Consortium. Gene Ontology: tool for the unification of biology. Nat Genet, 25(l):25-29, 2000. 10. T. G. O. Consortium. Creating the Gene Ontology resource: design and implementation. Genome Research, 11:1425-1433, 2001.

244 11. K. Degtyarenko. Chemical vocabularies and ontologies for bioinformatics. In Proc 2003 Itnl Chem Info Conf, 2003. 12. D. Harman. How effective is suffixing? J. Am Soc Info Sci, 42(1):7-15, 1991. 13. E. Hatcher and O. Gospodnetic. Lucent in Action (In Action series). Manning Publications, 2004. 14. L. Hirschman and I. Mani. Evaluation. In R. Mitkov, editor, Oxford handbook of computational linguistics, pages 414-429. Oxford University Press, 2003. 15. H. L. Johnson, K. B. Cohen, W. A. Baumgartner, Z. Lu, M. Bada, T. Kester, H. Kim, and L. Hunter. Evaluation of lexical methods for detecting relationships between concepts from multiple ontologies. Pac Symp Biocomput, pages 28-39, 2006. 16. P. Lambrix and A. Edberg. Evaluation of ontology merging tools in bioinformatics. Pac Symp Biocomput, pages 589-600, 2003. 17. P. Lambrix, A. Edberg, C. Manis, and H. Tan. Merging DAML+OIL bioontologies. In Description Logics, 2003. 18. P. Lambrix and H. Tan. A framework for aligning ontologies. In PPSWR, pages 17-31, 2005. 19. M. Lesk. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In SIGDOC '86: Proceedings of the 5th annual international conference on systems documentation, pages 24-26, New York, NY, USA, 1986. ACM Press. 20. S. Luger, S. Aitken, and B. Webber. Automated terminological and structural analysis of human-mouse anatomical ontology mappings. BMC Bioinformatics, 6(Suppl. 3), 2005. 21. A. A. Morgan, L. Hirschman, M. Colosimo, A. S. Yeh, and J. B. Colombe. Gene name identification and normalization using a model organism database. J. Biomedical Informatics, 37(6):396-410, 2004. 22. P. Mork, R. Pottinger, and P. A. Bernstein. Challenges in precisely aligning models of human anatomy using generic schema matching. Medlnfo, 11 (Pt l):401-405, 2004. 23. S. Oepen, K. Netter, and J. Klein. TSNLP - Test suites for natural language processing. In Linguistic Databases. CSLI Publications, 1998. 24. T. Pedersen and R.Mihalcea. Advances in word sense disambiguation. In Tutorial, Conf of ACL, 2005. 25. M. Porter. An algorithm for suffix stripping. Program, 14:130-137, 1980. 26. I. N. Sarkar, M. N. Cantor, R. Gelman, F. Hartel, and Y. A. Lussier. Linking biomedical language information and knowledge resources: GO and UMLS. Pac Symp Biocomput, pages 439-450, 2003. 27. I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res, 32(Database issue), 2004. 28. P. Shvaiko and J. Euzenat. A survey of schema-based matching approaches. Journal on Data Semantics, 4, 2005. 29. S. Zhang and O. Bodenreider. Aligning representations of anatomy using lexical and structural methods. AMIA Annu Symp Proc, pages 753-757, 2003.

INTEGRATING NATURAL L A N G U A G E PROCESSING WITH FLYBASE CURATION

N I K I F O R O S K A R A M A N I S * * , IAN L E W I N * , R U T H SEAL+, RACHEL DRYSDALEf AND EDWARD BRISCOE* Computer

Laboratory* and Department of Genetics^ University of Cambridge E-mail for correspondence: [email protected] Applying Natural Language Processing techniques to biomedical text as a potential aid to curation has become the focus of intensive research. However, developing integrated systems which address the curators' real-world needs has been studied less rigorously. This paper addresses this question and presents generic tools developed to assist FlyBase curators. We discuss how they have been integrated into the curation workflow and present initial evidence about their effectiveness.

1. Introduction The number of papers published each year in fields such as biomedicine is increasing exponentially [1,2]. This growth in literature makes it hard for researchers to keep track of information so progress often relies on the work of professional curators. These are specialised scientists trained to identify and extract prespecified information from a paper to populate a database. Although there is already a substantial literature on applying Natural Language Processing (NLP) techniques to the biomedical domain, how the output of an NLP system can be utilised by the intended user has not been studied as extensively [1]. This paper discusses an application developed under a user-centered approach which presents the curators with the output of several NLP processes to help them work more efficiently. In the next section we discuss how observing curators at work motivates our basic design criteria. Then, we present the tool and provide an overview of the NLP processes behind it as well as of the customised curation editor we developed following the same principles. Finally, we discuss how these applications have been incorporated into the curation workflow and present a preliminary study on their effectiveness. *William Gates Building, Cambridge, CB3 OFD, UK. tDowning Site, Cambridge, CB2 3EH, UK.

245

246 i GENE PROFORMA

Version 37: 5 Aug 2005

I G1a Gene symbol to use »i database *a : !G1b. Gene symbol used in paper (if dtfferent) "I: !G1c. Database gene symbol to replace *l: I G1 d. Gene category (ff gene is new to FlyBase) [CV] I G2a. Gene name to use in database 'e: !G15. Internal notes "W:: CONTROLLED VOCABULARY

NEW RECORD 1 POST PROCESSOR I

CUSTOMISED EDITOR

ONTOLOGIES CURATOR

PAPER

_

DATABASE

I ALLELE PROFORMA

Version 32: 5 Aug 2005

I GA1a. Allele symbol to use in database *A : I GA1b Complete symbol tor GA1a in paper (if different) ' j GA1c. Database allele symbol to replace *l: I GA2a. Allele name to use in database "e : I GA14. Internal notes *W"

Figure 1.

(A) Overview of the curation information flow. (B) Gene and allele proformae.

2. The FlyBase curation paradigm The tools presented in this paper have been developed under an approach which actively involves the potential user and consists of iterative cycles of (a) design (b) system development (c) feedback and redesign [3]. The intended users of the system are the members of the FlyBase curation team in Cambridge (currently seven curators). FlyBase a is a widely used database of genomic research on the fruit fly. It has been updated with newly curated information since 1992 by teams located in Harvard, Indiana and Berkeley, as well as the Cambridge group. Although the curation paradigm followed by FlyBase is not the only one, it is based on practices developed through years of experience and has been adopted by other curation groups. FlyBase curation is based on a watchlist of around 35 journals. Each curator routinely selects a journal from the list and inspects its latest issue to identify which papers to curate. Curation takes place on a paper-bypaper basis (as opposed to gene-by-gene or topic-by-topic). A simplified view of the curation information flow is shown in Figure 1A. A standard UNIX editor with some customised functions is used to produce a record for each paper. The record consists of several proformae (Figure IB), one for each significant gene or allele discussed in the paper. Each proforma is made of 33 fields (not all of which are always filled): some fields require rephrasing, paraphrasing and/or summarisation while others record very specific facts using terms from ontologies or a controlled vocabulary. In addition to interacting with the paper, typically viewed in printed form or loaded into a PDF viewer, the curator also needs to access the database a

www.flybase.org

247

to fill in some fields. This is done via several task-specific scripts which search the database e.g. for a gene-name or a citation identifier. After the record has been completed, it is post-processed automatically to check for inconsistencies and technical errors. Once these have been corrected, it is uploaded to the database. Given that extant information retrieval systems such as MedMiner [4] or Textpresso [5] are devised to support the topic-by-topic curation model in other domains, FlyBase curators are in need of additional technology tailored to their curation paradigm and domain. In order to identify users' requirements more precisely, several observations of curation took place focussing on the various ways in which the curators interact with the paper: some curators skim through the whole paper first (often highlightling certain phrases with their marker) and then re-read it more thoroughly. Others start curation from a specific section (not necessarily the abstract or the introduction) and then move to another section in search of additional information about a specific concept. The "find function" of the PDF viewer is often used to search for multiple occurrences of the same term. Irrespective of the adopted heuristics, all curators agreed that identifying the sections of the text which contain information relevant to the proforma fields is laborious and time-consuming. Current NLP technology identifies domain-specific names of genes and alleles as well as relations between them relatively reliably. However, providing the curator simply with the typical output of several NLP modules is not going to be particularly helpful [1]. Hence, one of our primary aims is to design and implement a system which will not only utilise the underlying NLP processes but also enable the curators to interact with the text efficiently to accurately access segments which contain potentially useful information. Crucially, this is different from providing them with automatically filled information extraction templates and asking them to go back to the text and confirm their validity. This would shift their responsibility to verifying the quality of the NLP output. Instead, we want to develop a system in which the curators maintain the initiative following their preferred style but are usefully assisted by software adapted to their work practices. Records are highly structured documents so additionally we aimed to develop, using the same design principles, an enhanced editing tool sensitive to this structure in order to speed up navigation within a record too. This paper presents the tools we developed based on these premises. We anticipate that our work will be of interest to other curation groups following the paper-by-paper curation paradigm.

248

3. PaperBrowser PaperBrowser b presents the curator with an enhanced display of the text in which words automatically recognised as gene names are highlighted in a coloured font (Figure 4A). It enables the curators to quickly scan the whole text by scrolling up and down while their attention is directed to the highlighted names. PaperBrowser is equipped with two navigation panes, called Paper View and EntitiesView, that are organised in terms of the document structure and possible relations between noun phrases, both of which are useful cues for curation [2]. PaperView lists gene names such as "zen" in the order in which they appear in each section (Figure 4B). EntitiesView (Figure 4C) lists groups of words (noun phrases) automatically recognised as referring to the same gene or to a biologically related entity such as "the zen cDNA". The panes are meant not only to provide the curator with an overview of the gene names and the related noun phrases in the paper but also to support focused extraction of information, e.g. when the curator is looking for a gene name in a specific section or tries to locate a noun phrase referring to a certain gene product. Clicking on a node in either PaperView or EntitiesView redirects the text window to the paragraph that contains the corresponding gene name or noun phrase, which is now highlighted in a different colour. The same colour is used to highlight the other noun phrases listed together with the clicked node in EntitiesView. In this way the selected node and all related noun phrases become more visible in the text. The interface allows the curators to mark a text segment as "read" by crossing it out (which is useful when they want to distinguish between the text they have read and what they still need to curate). A "find" function supporting case sensitive and wrapped search is implemented too. The "Tokens to verify" tab is used to collect feedback about the gene name recogniser in a non-intrusive manner. This tab presents the curator with a short list of words (currently just 10 per paper) for which the recogniser is uncertain whether they are gene names or not. Each name in the list is hyperlinked to the text allowing the curator to examine it in its context and decide whether it should be marked as a gene or not (by clicking on the corresponding button). Active learning [6] is then used to improve the recogniser's performance on the basis of the collected data. b

PaperBrowser is a "rich content" browser built on top of the Mozilla Gecko engine and JREX (see www.mozilla.org for more details).

249 XML J

FEXML (xml containing our own "added value" markup)

HaperBrowser

Figure 2.

Anaphoric dependencies

Paper processing pipeline

4. Paper Processing Pipeline In this section we discuss the technology used to produce the XML-based format which is displayed by PaperBrowser. This a non-trivial task requiring the integration of several components, each addressing different but often inter-related problems, into a unified system. The pipeline in Figure 2 was implemented since it was unclear whether integrating these modules could be readily done within an existing platform such as GATE [7]. The input to the pipeline is the paper in PDF, which is currently the only "standard electronic format" in which all relevant papers are available. This needs to be translated to a format that can be utilised by the deployed NLP modules but since current PDF-to-text processors are not aware of the typesetting of each journal, text in two columns, footnotes, headers and figure captions tends to be dispersed and mixed up during the conversion. This problem is addressed by the Document Parsing module which is based on existing software for optical character recognition (OCR) enhanced by templates for deriving the structure of the document [8]. Its output is in a general XML format defined to represent scientific papers. By contrast to standard PDF-to-text processors, the module preserves significant formating information such as characters in italics and superscripts that may indicate the mention of a gene or an allele respectively. The initial XML is then fed to a module that implements a machinelearning paradigm extending the approach in [9] to identify gene names in the text [10], a task known as Named Entity Recognition (NER).C Then, the RASP parser [11] is employed to identify the boundaries of the noun phrase (NP) around each gene name and its grammatical relations with other NPs in the text. This information is combined with features derived c

The NER module may also be fed with papers in XML available from certain publishers.

250 Table 1. Performance of the modules for Document Parsing, Named Entity Recognition and Anaphora Resolution. Recall

Precision

F-score

Named Entity Recognition

82.2%

83.4%

82.8%

Anaphora resolution

75.6%

77.5%

76.5%

Document Parsing

96.2%

97.5%

96.8%

from an ontology to resolve the anaphoric dependencies between NPs [12]. For instance, in the following excerpt: ... is encoded by the gene male specific lethal-l ... the MSL-1 protein localizes to several sites ... male animals die when they are mutant for msl-1 ... the NER system recognises "male specific lethal-l" as a gene-name. Additionally, the anaphora resolution module identifies the NP "the gene male specific lethal-l" as referring to the same entity as the NP "msl-1" and as being related to the NP "the MSL-1 protein". A version of the paper in FBXML (i.e. our customised XML format) is the result of the whole process that is displayed by PaperBrowser. The Paper View navigation pane makes use of the output of the NER system and information about the structure of the paper, while EntitiesView utilises the output of the anaphora resolution module as well. Images, which are very hard to handle by most text processing systems [2] but are particularly important to curators (see next section), are displayed in an extra window (together with their captions which are displayed in the text too) since trying to incorporate them into the running text was too complex given the information preserved in the OCR output. Following the standard evaluation methodology in NLP, we used collections of texts annotated by domain experts to assess the performance of the NER [10] and the anaphora resolution [12] modules in terms of Recall (correct system responses divided by all human-annotated responses), Precision (correct system responses divided by all system responses) and their harmonic mean (F-score). Both modules achieve state-of-the-art results compared to semi-supervised approaches with similar architectures. The same measures were used to evaluate the document parsing module on an appropriately annotated corpus [8]. Table 1 summarises the results of these evaluations. Earlier versions of the NER and anaphora resolution modules are discussed in [13].

251 5. ProformaEditor In order to further support the curation process, we implemented an editing tool called ProformaEditor (Figure 4D). ProformaEditor supports all general and customised functionalities of the editor that it is meant to replace such as: (a) copying text between fields and from/to other applications such as PaperBrowser, (b) finding and replacing text (enabling case-sensitive search and a replace-all option), (c) inserting an empty proforma, the fields of which can then be completed by the curator, and (d) introducing predefined text (corresponding to FlyBase's controlled vocabulary) to certain fields by choosing from the "ShortCuts" menu. Additionally, ProformaEditor visualises the structure of the record as a tree enabling the curator to navigate to a proforma by clicking on the corresponding node. Moreover, the fields of subsequent proformae are displayed in different colours to be distinguished more easily. Since the curators do not store pointers to a passage that supports a field entry, finding evidence for that entry in the paper based on what has been recorded in the field is extremely difficult [2]. We address this problem by logging the curator's pasting actions to collect information which will enable us to further enhance the underlying NLP technology such as: (a) where the pasted text is located in the paper, (b) which field it is pasted to, (c) whether it contains words recognised as gene names or related NPs, and (d) to what extent it is subsequently post-edited by the curator. This data collection also takes place without interfering with curation. 6. Integrating the tools into FlyBase's workflow After some in-house testing, a curator was asked to produce records for 12 papers from two journals using a prototype version of the tools to which she was exposed for the first time (CurationOl). CurationOl initiated our attempt to integrate the tools into FlyBase's workflow. This integration requires substantial effort and often needs to address low-level software engineering issues [14]. Thus, our aims were quite modest: (a) recording potential usability problems and (b) ensuring that the tools do not impede the curator from completing a record in the way that she had been used to. ProformaEditor was judged to be valuable although a few enhancements were identified such as the introduction of the "find and replace" function and the "ShortCuts" menu that the curators had in their old editor. Compared to that editor, the curator regarded the visualisation of the record structure as a very useful additional feature. PaperBrowser was tested less extensively during CurationOl due to the

252

loss of the images during the PDF-to-XML process which was felt by the curator to be a significant impediment. Although the focus of the project is on text processing, the pipeline and PaperBrowser were adjusted accordingly to display this information. A second curation exercise (Curation02) followed, in which the same curator produced records for 9 additional papers using the revised tools. This time the curator was asked to base the curation entirely on the text as displayed in the PaperBrowser and advise the developers of any problems. Soon after Curation02, the curator also produced records for 28 other papers from several journals (Curation03) using ProformaEditor but not PaperBrowser since these papers had not been processed by the pipeline. Like every other record produced by FlyBase curators, the outputs of all three exercises were successfully post-processed and used to populate the database. Overall, the curator did not consider that the tools have a negative impact on task completion. ProformaEditor became the curator's editor of choice after Curation03 and has been used almost daily since then. The feedback on PaperBrowser included several cases in which identifying passages that provide information about certain genes as well as their variants, products and phenotypes using Paper View and/or EntitiesView was considered to be more helpful than looking at the PDF viewer or a printout. Since the prototype tools were found to be deployable within FlyBase's workflow, we concluded that the aims of this phase had been met. However, the development effort has not been completed since the curator also noticed that the displayed text carries over errors made by the pipeline modules and pointed out a number of usability problems on the basis of which a list of prioritised enhancements was compiled. The shortlisted improvements of PaperBrowser include: (a) making tables and captions more easily identifiable, (b) flagging clicked nodes in the navigation panes, and (c) saving text marked-as-read before exiting. We also intend to boost the performance of the pipeline modules using the curator's feedback and equip ProformaEditor with new pasting functionalities which will incorporate FlyBase's term normalisation conventions. 7. A pilot study on usability This section presents an initial attempt to estimate the curator's performance in each exercise. To the best of our knowledge, although preliminary, this is the first study of this kind relating to scientific article curation. Although the standard NLP metrics in Table 1 do not capture how useful a system actually is in the workplace [1], coming up with a quantitative

253

measure to assess the curator's performance is not straightforward either. At this stage we decided to use a gross measure by logging the time it took for the curator to complete a record during each curation exercise. This time was divided by the number of proformae in each record to produce an estimate of "curation time per proforma". The data were analysed following the procedure in [15]. Two outliers were identified during the initial exploration of the data and excluded from subsequent analysis.6 The average time per proforma for each curation exercise using the remaining datapoints is shown in Figure 3A. A one-way ANOVA returned a relatively low probability (F(2,44) = 2.350, p=0.107) and was followed by planned pairwise comparisons between the conditions using the independent-samples two-tailed t-test. CurationOl took approximately 3 minutes and 30 seconds longer than Curation02, which suggests that revising the tools increased the curator's efficiency. This difference is marginally significant (t(44)=2.151, p=0.037) providing preliminary evidence in favour of this hypothesis. Comparing Curation03 with the other conditions suggests that the tools do not impede the curator's performance. In fact, CurationOl took on average about 2 minutes longer than Curation03 (the main difference between them being the use of the revised ProformaEditor during Curation03). The planned comparison shows a trend towards improving curation efficiency with the later version of the tool (t(44) = 1.442, p=0.156) although it does not provide conclusive evidence in favour of this hypothesis. The main difference between Curation02 and Curation03 is viewing the paper exclusively on PaperBrowser in Curation02 (as opposed to no use of this tool at all in Curation03). f Completing a proforma using PaperBrowser is on average more than one minute and thirty seconds faster. Although the planned comparison shows that the difference is not significant (t(44)=1.1712, p=0.248), this result again indicates that the tool does not have a negative impact on curation. Additional analysis using a more fine-grained estimate of "curation time per completed field" (computed by dividing the total time per record e

T h e first outlier corresponds to the first record ever produced by the curator. This happened while a member of the development team was assisting her with the use of the tools and recording her comments (which arguably delayed the curation process significantly). The logfile for the second outlier which was part of Curation03 included long periods during which the curator did not interact with ProformaEditor. The version of ProformaEditor was the same in both cases but the curator was more familiar with it during Curation03.

254 Tlma par flvld

T)m*p«r preforms j

120 1G0

f 80

•S-

I41"

B

—

40

••

'.' -;; '

- * •

^

• •

1

•vV

20

(A) Time per proforma

_. .-.:•• ' '. •

:;.j-

(B) Time per completed field

Average

St. dev.

Average

CurationOl

631.64s (10m 32s)

192.21s

132.90s (2m 13s)

33.50s

11

Curation02

424.21s (7m 04s)

157.04s

104.67s (lm 45s)

41.47s

9

236.91s

123.20s (2m 03s)

52.35s

27

Curation03

520.95s (8m 41s) Figure 3.

St. dev.

papers

Results of pilot study on usability.

by the number of completed fields) showed the same trends (Figure 3B). However, the ANOVA suggested that the differences were not significant (F(2,44)=0.925, p=0.404), which is probably due to ignoring the time spent on non-editing actions by this measure. Overall, this preliminary study provides some evidence that the current versions of ProformaEditor and PaperBrowser are more helpful than the initial prototypes and do not impede curation. These results concur with the curator's informal feedback. They also meet our main aim at this stage which was to integrate the tools within an the existing curation workflow. Clearly, more detailed and better controlled studies are necessary to assess the potential usefulness of the tools building on the encouraging trends revealed in this pilot. Devising these studies is part of our ongoing work, aiming to collect data from more than one curator. Similarly to the pilot, we will attempt to compare different versions of the tools which will be developed to address the compiled shortlist of usability issues. We are also interested in measuring variables other than efficiency such as accuracy and agreement between curators. In our other work, we are currently exploiting the curator's feedback for the active learning experiments. We also intend to analyse the data collected in the logstore in order to build associations between proforma fields and larger text spans, aiming to be able to automatically identify and highlight such passages in subsequent versions of PaperBrowser. Acknowledgments This work takes place within the BBSRC-funded Flyslip project (grant No 38688).

255 We are grateful to Florian Wolf and Chihiro Yamada for their insights and contributions in earlier stages of the project. PaperBrowser and ProformaEditor are implemented in Java and will be available through the project's webpage at: www.cl.cam.ac.uk/users/av308/Project_Index/index.html

References 1. A. M. Cohen and W. R. Hersh. A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1):57-71 (2005). 2. A. S. Yeh, L. Hirschman and A. A. Morgan (2003), Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 19 (suppl. 1): i331-i339. 3. J. Preece, Y. Rogers and H. Sharp. Interaction design: beyond humancomputer interaction. John Wiley and Sons (2002). 4. L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein. MedMiner: an internet text-mining tool for biomedical information with application to gene expression profiling. BioTechniques 27(6):1210-1217 (1999). 5. H. M. Mueller, E. E. Kenny and P. W. Sternberg. Textpresso: An ontologybased information retrieval and extraction system for biological literature. PLoS Biology 2(ll):e309 (2004). 6. D. A. Cohn, Z. Ghahramani and M. I. Jordan. Active learning with statistical models. In G. Tesauro, D. Touretzky and J. Alspector (eds), Advances in Neural Information Processing, vol. 7, 707-712 (1995). 7. H. Cunningham, D. Maynard, K. Bontcheva and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. Proceedings of ACL 2002, 168-175 (2002). 8. B. Hollingsworth, I. Lewin and D. Tidhar. Retrieving hierarchical text structure from typeset scientific articles: A prerequisite for e-Science text mining. Proceedings of the 4th UK e-science all hands meeting, 267-273 (2005). 9. A. A. Morgan, L. Hirschman, M. Colosimo, A. S. Yeh and J. B. Colombe. Gene name identification and normalization using a model organism database. J. of Biomedical Informatics 37(6):396-410 (2004). 10. A. Vlachos and C. Gasperin. Bootstrapping and evaluating NER in the biomedical domain. Proceedings of BioNLP 2006, 138-145 (2006). 11. E. Briscoe, J. Carroll and R. Watson. 'The second release of the RASP system', Proceedings of ACL-COLING 2006, 77-80 (2006). 12. C. Gasperin. Semi-supervised anaphora resolution in biomedical texts. Proceedings of BioNLP 2006, 96-103 (2006). 13. A. Vlachos, C. Gasperin, I. Lewin and E. J. Briscoe. Bootstrapping the recognition and anaphoric linking of named entities in Drosophila articles. Proceedings of PSB 2006, 100-111 (2006). 14. C. Barclay, S. Boisen, C. Hyde and R. Weischedel. The Hookah information extraction system. Proceedings of Workshop on TIPSTER II, 79-82 (1996). 15. D. S. Moore and G. S. McCabe. Introduction to the practice of statistics, 713-747. Freeman and Co (1989).

256

f l l K / Z / t r n p , ' 15728670 Bkmll4 J !9iiunlf7l<12755317

downstream of the stop codon ( o r at Internal sites to mafce truncated proteins). These constructs were expressed in vitro with the TNT Coupled Reticulocyte lysate system ( Promega) using the T7 promoter in the pAR vector . Expression of these proteins was confirmed fay electrophoresis on a 12 % SDS polyacryl amide gel.

Results K« encodes ahomeodomaln protein of the Antennapedia class , and is initially expressed broadly like app but becomes restricted to the dorsal most region, the presumptive amnioserosa( Doyle et al 1986) This refinement of the isn pattern is dependent onDpp , and proceeds simultaneously with the refining PM ad gradient ( Rushlowetal .2001) Thus, by mid-late stage 5 ,ze« transcripts are present only where there are peak lewis of PMad, in the dorsal most cells At this time , Race transcripts begin to accumulate in the same domain (Fig l A ) ( T a t e i e t al. .1995) activation is known to be dependent on dpp and ze« as Race is not transcribed in either mutant (Tatei et al , 1995, Rusch and Levine, 1997) To better understand the mechanism underlying the requirement for both genes, we first examined Race expression in different genetic backgrounds that alter levels of app and zew, and then performed molecular analyses withSmad and Zen proteins to characterize protein-protein and protein-DNA interactions with the well-defined Raw enhancer ( Rusch and Levine, 1997 , Wharton et al , 2 0 0 4 ) . We are particularly interested in howDpp morphogenetic activity is interpreted by high level target genes so they come

*

=J

(A) QQQQA

fllEmirnp/l572_ee70.Bbonll1_»hlmmil!2I5S3_H_

„„,„

Ectopic uprewion of zen and zen-I -

-

1+10to+:234trora

ffJBPIVirS] fmillKWt*. lT(««»<Mi>«lft i^iiplMivSilVliSn

t^irtnrw"V"^Sl«fl4w,^l l kwiw*sril¥.3,li«*WrttilytfiCl

bcwp o ra-

(C)

:T:'J

allele sgg[10 diRNA_Sr.er\UASj allele sgg(16 dsRNA.Sier\UAS] allele sggp2J aitfJ^Sj]gilOSi-fr\lj*S f T Hsap\ allele sgo[KBSR.5cer\lIAST H allele sgg|46 SterWSS T Hsai allele sgg[023SG_5cer\UAS r allele sgg[D300C Ster\UA5 r allele sgglD235G U300G Scer\ allele sgg[£Dgr,N3O0 Scer\UAS allele sgg[A81T S:ar\UAS|

CAla. Allele svmbol i o use in Oaial SOQ[10Scer\UAS1 HsaD\HYC| G A l b Complete swibul for C A l a i CMc

DaiaBa»a!l

m

(If am erent) -I

le synoDl io

GA2a Allele name GA2b Allele n a m "

erflfdtfferem)

GA2c database allele name io replace

*Y

CASa. Is ihe allele in RyBase already (wilh any name)?K CA9& Other synany-nls) tor allele ss*Tibol

*l

CA9t QiHers'yTWWnftl'o'alltleiiarne CA3

*V

RanK|CV|

*k

GA.4 Allele das. |CV]

-k

CA56 Phenotrfift l domtaanca class |blpanliB CV|

It

"CA17 Phenoivca |CV, body part(s> where marales!) ' maCroctiaela [ 5cer\GAL4[sca-P30911 sensory mother tell ( Scer\GA14[sca-P30911 GA7a. P h - n o W I f r e e t e i l ] 'h Eipresslon otffsga[10ScertiWST Hsap\MYCl» under l i afSOP celli Ini

(D)

er\GAL4|sca-P j 0 9 1 # re

j&JLL&^ii.&fe*«S Z

Figure 4. (A) Automatically recognised gene-names highlighted in PaperBrowser. Navigating through the paper using: (B) PaperView and (C) EntitiesView. (D) Editing a record with ProformaEditor.

A STACKED GRAPHICAL MODEL FOR ASSOCIATING SUB-IMAGES WITH SUB-CAPTIONS

ZHENZHEN KOU, WILLIAM W. COHEN, AND R O B E R T F. M U R P H Y Machine

E-mail:

Learning

Department, Carnegie Mellon 5000 Forbes Avenue, Pittsburgh, PA 15213, USA [email protected], [email protected],

University

[email protected]

There is extensive interest in mining data from full text. We have built a system called SLIF (for Subcellular Location Image Finder), which extracts information on one particular aspect of biology from a combination of text and images in journal articles. Associating the information from the text and image requires matching sub-figures with the sentences in the text. We introduce a stacked graphical model, a meta-learning scheme to augment a base learner by expanding features based on related instances, to match the labels of sub-figures with labels of sentences. The experimental results show a significant improvement in the matching accuracy of the stacked graphical model (81.3%) as compared with a relational dependency network (70.8%) or the current algorithm in SLIF (64.3%).

1. Introduction The vast size of the biological literature and the knowledge contained therein makes it essential to organize and summarize pertinent scientific results. Biological literature mining has been increasingly studied to extract information from huge amounts of biological articles : " 3 . Most of the existing IE systems are limited to extracting information only from text. Recently there has been great interest in mining from both text and image. Yu and Lee4 designed BioEx that analyses abstract sentences to retrieve the image in an article. Rafkind et al5 explored the classification of general bioscience images into generic categories based on features from both text (image caption) and image. Shatkay et al6 described a method to obtain features from images to categorize biomedical documents. We have built a system called SLIF 7 - 8 (for Subcellular Location Image Finder) that extracts information about protein subcellular locations from both text and images. SLIF analyzes figures in biological papers, which include both images and captions. In SLIF, a large corpus of articles is fully analyzed 257

258 Fig. 5. Double immunofluorescence confocal microscopy using mouse mAb against cPABF and affinity-purified rabbit antibodies against mmp41. Methanol-permeabilized and fixed He La cells were incubated with affinity-purifi < rabbit anti-mrnp 41 antibodies (a) andwith monoclonal anti-cPAPB antibodies (b), and I bound antibodies were visualized with fiuorescently labeled secondary antibodies. (Har= IdumA

Figure 1.

A figure caption pair reproduced from the biomedical literature.

and the results of analysis steps are stored in an SQL database as traceable assertions. An interface to the database (http://slif.cbi.cmu.edu) has been designed such that images and text of interest can be retrieved and presented to users 7 . In a system mining both text and images, associating the information from the text and the image is very challenging since usually there are multiple sub-figures in a figure and we must match sub-figures with the sentences in the text. In the initial version of SLIF, we extracted the labels for the sub-figures and sentences separately and matched them by finding equal-value pairs. This naive matching approach ignores much context information, i.e., the labels for sub-figures are usually a sequence of letters and people assign labels in a particular order rather than randomly, and could only achieve a matching accuracy of 64.3%. To obtain a satisfactory matching accuracy the naive approach requires high-accuracy image analysis and text analysis to get the labels. However, extracting labels from image is non-trivial. Inferring the label sequences and improving image processing allowed us to increase the Fl for panel label extraction to 78% 9 . In this paper, we introduce a stacked graphical model to match the labels of sub-figures with labels of sentences. The stacked model can take advantage of the context information and achieves an 81.3% accuracy. In the following, we give a brief review of SLIF in Section 2. Section 3 describes the stacked model used for the matching. Section 4 summarizes the experimental results and Section 5 concludes the paper. 2. SLIF Overview SLIF applies both image analysis and text interpretation to figures. Figure l a is a typical figure that SLIF can analyse. a

This figure is reproduced from the article "mRNA binding protein mrnp 41 localizes to both nucleus and cytoplasm", by Doris Kraemer and Giinter Blobel, Cell Biology Vol.

259 Entity matching & extraction

I proteins,

[Murphy el at, 2001]

Figure 2.

Overview of the image and text processing steps in SLIF.

Figure 2 shows an overview of the steps in the SLIF system with references to publications in which they are described in more details. Image processing includes several steps: Decomposing images into panels. For images containing multiple panels, the individual panels are recovered from the image. Identifying fluorescence microscope images. Panels are classified as to whether they are fluorescence microscope images, so that appropriate image processing steps can be performed. Image preprocessing and feature computations. Firstly the annotations such as labels, arrows and indicators of scale contained within the image are detected, analyzed, and then removed from the image. In this step, panel labels are recognized by Optical Character Recognition (OCR). Panel labels are textual labels which appear as annotations to images, for example, "a" and "b" printed in panels in Figure 1. Recognizing panel labels is very challenging. Even after careful image pre-processing and enhancement the F l accuracy is only about 75%. The OCR results are used as candidate panel labels and after filtering candidates an Fl accuracy of 78% is obtained 9 . Secondly, the scale bar is extracted, and finally subcellular location features (SLFs) are produced and the localization pattern of each cell is determined. Caption Processing is done as follows. Entity name extraction. In the current version of SLIF we use an extractor trained on conditional random fields10 and an extractor trained on Dictionary-HMMs11 to extract the protein name. The cell name is extracted using hand-coded rules. Image 94, pp. 9119-9124, August 1997.

260

pointer extraction. The linkage between the panels and the text of captions is usually based on textual labels which appear as annotations to the images (i.e., panel labels), and which are also interspersed with the caption text. We call these textual labels appearing in text image pointers, for example, "(a)" and "(b)" in the caption in Figure 1. In our analysis, image pointers are classified into four categories according to their linguistic function: Bullet-style image pointers, NP-style image pointers, Citation-style image pointers, and other12. The image-pointer extraction and classification steps are done via a machine learning method 12 . Entity to image pointer alignment. The scope of an image pointer is the section of text (sub-caption) that should be associated with it. The scope is determined by the class assigned to an image pointer. 12 3. A Stacked Model to Map Panel Labels to Image Pointers 3.1. Stacked Graphical Models for

Classification

Stacked graphical models are a meta-learning scheme to do collective classification13, in which a base learner is augmented by expanding one instance's features with predictions on other related instances. Stacked graphical models work well on predicting labels for relational data with graphical structures (Kou and Cohen, in preparation). The inference converges much faster than the traditional Gibbs sampling method and it has been shown empirically that one iteration of stacking is able to achieve good performance on many tasks. The disadvantage of stacking is that it requires more training time to achieve faster testing inference. Figure 3 shows the inference and learning methods for stacked graphical models. In a stacked graphical model, the relational template C finds the related instances. For instance Xi, C{xi) retrieves the indices i\,...,ii of instances xix, ...,xiL that are related to xt. Given predictions y for a set of instances x, C(xi,y) returns the predictions on the related instances, i.e., The idea of stacking is to take advantage of the dependencies among instances, or the relevance between inter-related tasks. In our application in this paper, we conjecture that panel label extraction and image pointer extraction are inter-related, and design a stacked model that combines them. 3.2. A Stacked Model for

Mapping

In the previous version of SLIF, we map panel labels to image pointers by finding the equal-value pair. Below we apply the idea of stacked graphical

261

• Parameters: a relational template C and a cross-validation parameter J. • Learning algorithm: Given a training set D = {(x,y)} and a base learner A: — Learn the local model, i.e., when k = 0: Return / ° = A(D°). Please note that D° = D,x° = x , y ° = y-

— Learn the stacked models, for k = 1...K: (1) Construct cross-validated predictions y f c _ 1 for x e D as follows: (a) Split D into J equal-sized disjoint subsets D\...Dj. (b) For j = 1...J, let fk~l = A(Dk-' - Dk~l). (c) For X e Dj,yk-1 = f^1^-1). (2) Construct an extended dataset Dk = (x fc ,y) by converting each instance Xi to xk as follows: xk = (xi,C(xi,yk~1)), where C(xi,y f c _ 1 ) will return the predictions for examples related to x-L such that x\ = (xi,yk-l,...,yk^). (3) Return fk = A{Dk). • Inference algorithm: given x : (1) y° = /°(x). For k = 1...K, (2) Carry out Step 2 above to produce xfc. (3) yfc = / fc (x fc ). Return yK. Figure 3. Stacked Graphical Learning and Inference

models to map the panel labels and image pointers. In SLIF the image pointer finding was done as follows. Most image pointers are parenthesized, and relatively short. We thus hand-coded an extractor that finds all parenthesized expressions that are (a) less than 15 characters long and (b) do not contain a nested parenthesized expression, and replaces X-Y constructs with the equivalent complete sequence. (E.g., constructs like "B-D" are replaced with "B,C,D".) We call the image pointers extracted by this hand-coded approach candidate image pointers. The hand-coded extractor has high recall but only moderate precision. Using

262

a classifier trained with machine learning approaches, we then classify the candidate image pointers as bullet-style, citation-style, NP-style, or other. Image pointers classified as "other" are discarded, which compensates for the relatively low precision of the hand-coded extractor. 12 In SLIF the panel label extraction was done as follows. Image processing techniques and OCR techniques are applied to find the labels printed within the panel. That is, firstly candidate text regions are computed via image processing techniques, and OCR is run on these candidate regions to get candidate panel labels. This approach has a relatively high precision yet low recall. We call the panel labels recognized by image processing and OCR candidate panel labels. A strategy based on grid analysis (a procedure which analyzes how many panels there are in a figure and finds out how the panels are ranged) is applied to the candidate panel labels to get a better accuracy.9 The match between panels labels and image pointers can be formulated as a classification problem. We construct a set of pairs < Oi,pj > for all candidate panel labels Oi 's and candidate image pointers pj 's from the same figure. That is, for a panel with li representing the real label, o, representing the panel label recognized by OCR, and Pj's representing the image pointers in the same figure, we construct a set of pairs < Oi,pj >. We label the pair < Oi,Pj > as positive only if li = pj, otherwise negative. For example, in Figure 1, the real label li for panel a is "a". If OCR recognizes Oj where o^ = "a", image pointers for the figure are "a" and "b", we construct two pairs, < a, a > labelled as positive and < a, b > labeled as negative. Note that the pair is labelled according to the real label and the image pointers. If unfortunately, OCR recognizes o* incorrectly for panel a in Figure 1, for example o* ="o", we have two pairs, < o,a > labelled as positive and < o, b > labeled as negative. We design features based on Oj's and pj's. For a base feature set, there are 3 binary features: one boolean value indicating whether Oi = Pj, one boolean value indicating whether o,j e /t = pj — 1 or Oi_„pper = pj — 1, and another boolean value indicating whether Oi_right = Pj + 1 or Oi_downPj + 1, where iJeft is the index of the left panel of panel i in the same row, ijupper is the index of the upper panel of panel i in the same column, pj + 1 is the successive letter of pj and pj — 1 is the previous letter of pj. This feature set takes advantage of the context information by comparing Oj jeft to pj — 1 and so on. The second and third features capture the first-order dependency. That is, if the neighboring panel (an adjacent panel in the same row or the same column) is recognized as the corresponding "adjacent" letter, there is

263

a

b

c

d

e

f

g

h

i

Figure 4.

Second-order dependency.

a higher chance that Oi is equal to pj. In the inference step for the base learner in the stacked model, if a pair < Oi,pj > is predicted as positive, we set the value of Oj to be pj since empirically the image pointer extraction has a higher accuracy than the panel label recognition. That is, the predicted value di is pj for a positive pair and 5{ remains as Oi for a negative pair. After obtaining di, we recalculate the features via comparing o^'s and pj's. We call the procedure of predicting < Oi,pj >, updating di, and re-calculating features "stacking". We choose MaxEnt as the base learner to classify < Oi,pj > and in our experiments we implement one iteration of stacking. Besides the basic features, we also include another feature that captures the "second-order context", i.e., consider the spatial dependency among all the "sibling" panels, even though they are not adjacent. In general the arrangement of labels might be complex: labels may appear outside panels, or several panels may share one label. However, in the majority of cases, panels are grouped into grids, each panel has its own label, and labels are assigned to panels either in column-major or row-major order. The "panels" shown in Figure 4 are typical of this case. For such cases, we analyze the locations of the panels in the figure and reconstruct this grid, i.e., the number of total columns and rows, and also determine the row and column position of each panel. We compute the second-order feature as follows: for a panel located at row r and column c with label o, as long as there is a panel located at row r and column c with label o (r ^ r and c ^ c) and according to either row-major order or column-major order the label assigned to panel (r , c) is o given the label for panel (r, c) is o, we assign 1 to the second-order feature. For example, in Figure 4, recognizing the panel label "a" at row 1, column 1 would help to recognize "e" at row 2, column 2 and "h" at row 3, column 2. With the first order-features and second-order features, it increases the chance of a missing or mis-recognized label to be matched to an image

264

pointer. 4. Experiments 4.1.

Dataset

To evaluate the stacked model for panel label and image pointer matching, we collected a dataset of 200 figures which includes 1070 sub-figures. This is a random subsample of a larger set of papers from the Proceedings of the National Academy of Sciences. Our current approach can only analyse labels contained within panels (internal labels) due to the limitations on the image processing stage therefore in our dataset we only collected figures with internal labels. Though our dataset does not cover all the cases, panels with internal labels are the vast majority in our corpus. We hand-labeled all the image pointers in the caption and the label for each panel. The match between image pointers and panels is also assigned manually. 4.2. Baseline

algorithms

The approaches to find the candidate image pointers and panel labels have been described in Section 3.2. In this paper, we take the hand-code approach and machine learning approach 12 as the baseline algorithms for image pointer extraction. The OCR-based approach and grid analysis approach 9 are baseline algorithms for panel label extraction. We also compare the stacked model to relational dependency networks (RDNs). 14 RDNs are an undirected graphical model for relational data. Given a set of entities and the links between them, a RDN defines a full joint probability distribution over the attributes of the entities. Attributes of an object can depend probabilistically on other attributes of the object, as well as on attributes of objects in its relational neighborhood. We build an RDN model as shown in Figure 5. In the RDN model there are two types of entities, image pointer and panel label. For an image pointer, the attribute Pj is the value of the candidate image pointer and o; is the candidate panel label. p_tru a n d o_tru are the true values to be predicted. The linkage Ljpre and L-next capture the dependency among the sequence of image pointers: L.pre points to the previous letter and L-next points to the successive letter. PJeft, Pjright, P^upper, and P-down point to the panels to the left, right, upper and down direction respectively. The RDN model takes the candidate image pointers

265

equal

Figure 5.

An RDN model

and panel labels as input and predicts their true values. The match between the panel label and the image pointer is done via finding the equal-value pair.

4.3. Experimental

Results

We used 5-fold cross validation to evaluate the performance of the stacked graphical model for image pointer to panel label matching. The evaluation was reported in two ways; the performance on the matching and the performance on image pointer and panel label extraction. To determine the matching is the "real" problem, i.e., what we really care about are the matches, not getting the labels correctly. Evaluation on the image pointer and panel label extraction is a secondary check on the learning technique. Table 1 shows the accuracy of image pointer to panel label matching. For the baseline algorithms, the match was done by finding the equal-value pair. Baseline algorithm 1 was done by comparing the candidate image pointers to the candidate panel labels. Baseline algorithm 2 was done by comparing the image pointers extracted by the learning approach to the panel labels obtained after grid analysis. The stacked graphical model takes the same input as Baseline algorithm 2, i.e., the candidate image pointers extracted by the hand-coded algorithm and the candidate panel labels obtained by OCR. We observe that the stacked graphical model improves the accuracy of matching. Both the first-order dependency and second-order dependency help to achieve a better performance. RDN also achieved a better performance than the two baseline algorithms. Our stacked model achieves a better performance than RDN, because in stacking the dependency is captured and indicated "strongly" by the way we design features.

266 Table 1.

Accuracy of image pointer to panel label matching. Image pointer to panel label matching

Baseline algorithm 1

48.7%

Baseline algorithm 2 (current algorithm in SLIF)

64.3%

RDN

70.8%

Stacked model (first-order)

75.1%

Stacked model (second-order)

81.3%

Table 2.

Performance on image pointer extraction and panel label extraction Image pointer

Panel label

extraction

extraction

Baseline algorithm 1

60.9%

52.3%

Baseline algorithm 2

89.7%

65.7%

RDN

85.2%

73.6%

Stacked model with first order dependency

77.8%

Stacked model with second order dependency

83.1%

That is, the stacked model can model the matching as a binary classification of < O;, Pj > and capture the first-order dependency and second-order dependency directly according to our feature definition. However, in RDNs, the data must be formulated as types of entities described with attributes and the dependency is modeled with links among attributes. Though RDNs can model the dependency among data, the matching problem is decomposed to a multi-class classification problem and a matching procedure. Besides that, the second-order dependency can not be modeled explicitly in the RDN. Table 2 shows the performance on the sub-task of image pointer extraction and panel label extraction. The results are reported with Flmeasurement. Since during the stacked model we update the value of o; and set it to be pj when finding a match, the stacking also improves the accuracy of panel label extraction. The accuracy for image pointer extraction remains the same since we do not update the value of pj. Baseline algorithm 1 is the approach of finding candidate image pointers or candidate panel labels. Baseline algorithm 2 for image pointer extraction is the learning approach, and the grid analysis strategy for panel label extraction. The inputs for the stacked graphical model are candidate image pointers and candidate panel labels. We observe that by updating the value of Oi,

267

a

b?c?

d

(a) A hard case for OCR Figure 6.

(b) A hard case for the stacked model

Cases where current algorithms fail

we can achieve a better performance of panel label extraction, i.e., provide more "accurate" features for stacking. RDN also helps to improve the performance yet the best performance is obtained via stacking. 4.4. Error

Analysis

As mentioned in Section 2, OCR on panel labels is very challenging and we suffer a low recall of baseline algorithm 1. Most errors occur when there are not enough Oj recognized from the baseline algorithm to obtain information of the first-order and second-order dependency. Figure 6(a) shows a case where the current OCR fails. Figure 6(b) shows a case where there is not enough contextual information to determine the label for the upper-left panel. 5. Conclusions In this paper we briefly reviewed the SLIF system, which extracts information on one particular aspect of biology from a combination of text and images in journal articles. In such a system, associating the information from the text and image requires matching sub-figures in a figure with the sentences in the text. We used a stacked graphical model to match the labels of sub-figures with labels of sentences. The experimental results show that the stacked graphical model can take advantage of the context information and achieve a significant improvement in the matching accuracy of the stacked graphical model as compared with a relational dependency network or the current algorithm in SLIF. In addition to accomplish the matching at a higher accuracy, the stacked model helps to improve the performance of finding labels for sub-figures as well.

268 The idea of stacking is to take advantage of the context information, or the relevance between inter-related tasks. Future work will focus on applying stacked models to more tasks in SLIF, such as protein name extraction. Acknowledgments The work was supported by research grant 017396 from the Commonwealth of Pennsylvania Department of Health, NIH grants K25 DA017357 and R01 GM078622, and grants from the Information Processing Technology Office (IPTO) of the Defense Advanced Research Projects Agency (DARPA). References 1. B. de Bruijn and J. Martin, Getting to the (c)ore of knowledge: mining biomedical literature. Int. J. Med. Inf., 67(2002), 7-18. 2. M. Krallinger and A. Valencia Text-mining and information-retrieval services for molecular biology. Genome Biology 2005, 6:224. 3. L. Hunter and K. B. Cohen, Biomedical language processing: what's beyond PubMed? Molecular Cell 21(2006), 589-594. 4. H. Yu and M. Lee. Accessing Bioscience Images from Abstract Sentences. Bioinformatics 2006, 22(14), 547-556. 5. B. Rafkind, M. Lee, SF Chang, and H. Yu. Exploring text and image features to classify images in bioscience literature. Proceedings of BioNLP 2006, 73-80. 6. H. Shatkay, N. Chen, and D. Blostein. Integrating Image Data into Biomedical Text Categorization. Bioinformatics 2006, 22(14), 446-453. 7. R. F. Murphy, Z. Kou, J. Hua, M. Joffe, and W. W. Cohen, Extracting and Structuring Subcellular Location Information from On-line Journal Articles: The Subcellular Location Image Finder. Proceedings of KSCE 2004, 109-114. 8. R.F. Murphy, M. Velliste, J. Yao, and G. Porreca,,Searching Online Journals for Fluorescence Microscope Images Depicting Protein Subcellular Locations. Proceedings of BIBE 2001, 119-128. 9. Z. Kou, W. W. Cohen, and R. F. Murphy, Extracting Information from Text and Images for Location Proteomics. Proceedings of BIOKDD 2003, 2-9. 10. M. Ryan and P. Fernando, Identifying Gene and Protein Mentions in Text Using Conditional Random Field. BMC Bioinformatics, 6(Suppl 1):S6, May 2005. 11. Z. Kou, W. W. Cohen, and R. F. Murphy, High-Recall Protein Entity Recognition Using a Dictionary. Bioinformatics 2005, 21(Suppl 1), 266-273. 12. W. W. Cohen, R. Wang and R. F. Murphy, Understanding Captions in Biomedical Publications. Proceedings of KDD 2003, 499-504. 13. B. Taskar, P. Abbeel and D. Koller, Discriminative probabilistic models for relational data. Proceedings of UAI 2002, 485-492. 14. D. Jensen and J. Neville, Dependency Networks for Relational Data. Proceedings of ICDM 2004, 170-177.

GeneRIF QUALITY ASSURANCE AS SUMMARY REVISION

ZHIYONG LU, K. BRETONNEL COHEN, AND LAWRENCE HUNTER Center for Computational Pharmacology, University of Colorado Health Sciences Center, Aurora, CO, 80045, USA E-mail: {Zhiyong.Lu, Kevin.Cohen, Larry.Hunter}®uchsc.edu

Like the primary scientific literature, GeneRIFs exhibit both growth and obsolescence. NLM's control over the contents of the Entrez Gene database provides a mechanism for dealing with obsolete data: GeneRIFs are removed from the database when they are found to be of low quality. However, the rapid and extensive growth of Entrez Gene makes manual location of low-quality GeneRIFs problematic. This paper presents a system that takes advantage of the summary-like quality of GeneRIFs to detect low-quality GeneRIFs via a summary revision approach, achieving precision of 89% and recall of77%. Aspects of the system have been adopted by NLM as a quality assurance mechanism.

1. Introduction In April 2002, the National Library of Medicine (NLM) began an initiative to link published data to Entrez Gene entries via Gene References Into Function, or GeneRIFs. GeneRIFs consist of an Entrez Gene ID, a short text (under 255 characters), and the PubMed identifier (PMID) of the publication that provides evidence for the assertion in that text. The extent of NLM's commitment to this effort can be seen in the growth of the number of GeneRIFs currently found in Entrez Gene—there are 157,280 GeneRIFs assigned to 29,297 distinct genes (Entrez Gene entries) in 571 species as of June 2006. As we will demonstrate below, the need has arisen for a quality control mechanism for this important resource. GeneRIFs can be viewed as a type of low-compression, single-document, extractive, informative, topic-focussed summary [15]. This suggests the hypothesis that methods for improving the quality of summaries can be useful for improving the quality of GeneRIFs. In this work, we evaluate an approach to GeneRIF quality assurance based on a revision model, using three distinct methods. In one, we examined the recall of the system, using the set of all GeneRIFs that were withdrawn by the NLM indexers over a fixed period of time as a gold standard. In another, we performed a coarse assessment of the precision of the system by submitting 269

270

system outputs to NLM. The third involved a fine-grained evaluation of precision by manual judging of 105 system outputs. 1.1. A fault model for GeneRIFs Binder (1999) describes the fault model—an explicit hypothesis about potential sources of errors in a system [3], Viewing GeneRIFs as summaries suggests a set of related potential sources of errors. This set includes all sources of error associated with extractive summarization (discussed in detail in [16]). It also includes deviations from the NLM's guidelines for GeneRIF production—both explicit (such as definitions of scope and intended content) and tacit (such as the presumed requirement that they not contain spelling errors). Since the inception of the GeneRIF initiative, it has been clear that a quality control mechanism for GeneRIFs would be needed. One mechanism for implementing quality control has been via submitting individual suggestions for corrections or updates via a form on the Entrez Gene web site. As the size of the set of extant annotations has grown—today there are over 150,000 GeneRIFs—it has become clear that high-throughput, semi-automatable mechanisms will be needed, as well—over 300 GeneRIFs were withdrawn by NLM indexers just in the six months from June to December 2005, and data that we present below indicates that as many as 2,923 GeneRIFs currently in the collection are substandard. GeneRIFs can be unsatisfactory for a variety of reasons: • Being associated with a discontinued Entrez Gene entry • Containing errors, whether minor—of spelling or punctuation—or major, i.e. with respect to content • Being based only on computational data—the NLM indexing protocol dictates that GeneRIFs based solely on computational analyses are not in scope [7] • Being redundant • Not being informative—GeneRIFs should not merely indicate what a publication is about, but rather should communicate actual information • Not being about gene function This paper describes a system for detecting GeneRIFs with those characteristics. We begin with a corpus-based study of GeneRIFs for which we have thirdparty confirmation that they were substandard, based on their having been withdrawn by the NLM indexers. We then propose a variety of methods for detecting substandard GeneRIFs, and describe the results of an intrinsic evaluation of the methods against a gold standard, an internal evaluation by the system builders,

271 and an external evaluation by the NLM staff. In this work, we evaluate an approach to GeneRIF quality assurance based on a summary revision model. In summarization, revision is the process of changing a previously produced summary. [16] discusses several aspects of revision. As he points out (citing [5]), human summarizers perform a considerable amount of revision, addressing issues of semantic content (e.g., replacing pronouns with their antecedents) and of form (e.g., repairing punctuation). Revision is also an important component of automatic summarization systems, and in particular, of systems that produce extractive summaries, of which GeneRIFs are a clear example. (Extractive summaries are produced by "cutting-and-pasting" text from the original, and it has been repeatedly observed that most GeneRIFs are direct extracts from the title or abstract of a paper ([2,9,12,15]). This suggests using a "revision system" to detect GeneRIFs that should be withdrawn.

2. Related Work GeneRIFs were first characterized and analyzed in [17]. They presented the number of GeneRIFs produced and species covered based on the LocusLink revision of February 13, 2003, and introduced the prototype GeneRIF Automated Alerts System (GRAAS) for alerting researchers about literature on gene products. Summarization in general has attracted a considerable amount of attention from the biomedical language processing community. Most of this work has focussed specifically on medical text—see [1] for a comprehensive review. More recently, computational biologists have begun to develop summarization systems targeting the genomics and molecular biology domains [14,15]. GeneRIFs in particular have attracted considerable attention in the biomedical natural language processing community. The secondary task of the TREC Genomics Track in 2003 was to reproduce GeneRIFs from MEDLINE records [9]. 24 groups participated in this shared task. More recently, [15] presented a system that can automatically suggest a sentence from a PubMed/MEDLINE abstract as a candidate GeneRIF by exploiting an Entrez Gene entry's Gene Ontology annotations, along with location features and cue words. The system can significantly increase the number of GeneRIF annotations in Entrez Gene, and it produces qualitatively more useful GeneRIFs than previous methods. In molecular biology, GeneRIFs have recently been incorporated into the MILANO microarray data analysis tool. The system builders evaluated MILANO with respect to its ability to analyze a large list of genes that were affected by overexpression of p53, and found that a number of benefits accrued specifically from the system's use of GeneRIFs rather than PubMed as its literature source, including a reduction in the number of irrelevant

272 Table 1. GeneRIF statistics from 2000 to 2006. The second row shows the annual increase in new GeneRIFs. The third row shows the number of new species for the new GeneRIFs. The fourth row is the number of genes mat gained GeneRIF assignments in the year listed in the first row. Note that although the gene indexing project was officially started by the NLM in 2002, the first set of GeneRIFs was created in 2000. Year New GeneRIFs New Species New Genes

2000 47 3 34

2001 617 1 529

2002 15,960 2 6,061

2003 37,366 3 6,832

2004 35,887 130 5,113

2005 45,875 341 7,769

2006 a 21,628 91 2,959

Sum 157,280 571 29,297

results and a dramatic reduction in search time [19]. The amount of attention that GeneRIFs are attracting from such diverse scientific communities, including not only bioscientists, but natural language processing specialists as well, underscores the importance of ensuring the quality of the GeneRIFs stored in Entrez Gene. 3. A corpus of withdrawn GeneRIFs The remarkable increase in the total number of GeneRIFs each year (shown in Table 1) comes despite the fact that some GeneRIFs have been removed internally by the NLM. We compared the GeneRIF collection of June 2005 against that of December 2005 and found that a total of 319 GeneRIFs were withdrawn during that period. These withdrawn GeneRIFs are a valuable source of data for understanding the NLM's model of what makes a GeneRIF bad. Our analyses are based on the GeneRIF files downloaded from the NCBI ftp siteb at three times over the course of a one-year period (June 2005, December 2005, and June 2006). The data and results discussed in this paper are available at a supplementary website0.

3.1. Characteristics of the withdrawn GeneRIFs We examined these withdrawn GeneRIFs, and determined that four reasons accounted for the withdrawal of most of them (see Figure 1). 1. Attachment to a temporary identifier: GeneRIFs can only be attached to existing Entrez Gene entries. Existing Entrez Gene entries have unique identifiers. New entries that are not yet integrated into the database are assigned a temporary identifier (the string NEWENTRY), and all annotations that are associated with them are provisional, including GeneRIFs. GeneRIFs associated with these temporary IDs are often withdrawn. Also, when the temporary identifier becomes a

From January 2006 to June 2006 ftp://ftp. ncbi. nlm. nih. gov/gene http://compbio.uchsc.edU/HunlerJab/7.hiyong/psb2007

h c

273

•Attached to NEWENTRY

6% 4%

i4« j i k \ H ^^ 3 9 „•* H V 37%

Figure 1.

B Computational methods BGrammar {Misspellings * Punctuation) [Miscellaneous corrections n Unknown

Distribution of reasons for GeneRIF withdrawal from June to December 2005.

obsolete, the GeneRIFs that were formerly attached to it are removed (and transferred to the new ID). 39% (123/319) of the withdrawn GeneRIFs were removed via one of these mechanisms. 2. Based solely on computational analyses: The NLM indexing protocol dictates that GeneRIFs based solely on computational analyses are not in scope. 37% (117/319) of the withdrawn GeneRIFs were removed because they came from articles whose results were based purely on computational methods (e.g., by prediction techniques) rather than traditional laboratory experiments. 3. Typographic and spelling errors: Typographic errors are not uncommon in the withdrawn GeneRIFs. They include misspellings and extraneous punctuation. 14% (46/319) of the withdrawn GeneRIFs contained errors of this type (41 misspellings and 5 punctuation errors). 4. Miscellaneous errors: 6% (20/319) of the withdrawn GeneRIFs were removed for other reasons. Some included the authors' names at the end, e.g., Cloning and expression ofZAK, a mixed lineage kinase-like protein containing a leucine-zipper and a sterile-alpha motif. Liu TC, etc. Others were updated by adding new gene names or modifying existing ones. For example, the NLM replaced POPC with POMC in Mesothelioma cell were found to express mRNAfor [POPC]... for the gene POMC (GenelD: 5443). 5. Unknown reasons: we were unable to identify the cause of withdrawal for the remaining 4% (13/319) of the withdrawn GeneRIFs. These findings suggest that it is possible to develop automated methods for detecting substandard GeneRIFs.

4. System and Method We developed a system containing seven modules, each of which addresses either the error categories described in Section 3.1 or the content-based problems described in Section 1.1 (e.g. redundancy, or not being about gene function).

274 Table 2. A total of 2,923 suspicious GeneRIFs found in the June 2006 data. See Sections 4.5-7 for the explanations of categories 5-7. No. 1.

Category Discontinued

GeneRIFs 202

2.

Misspellings

1,754

3.

Punctuation

505

4.

Computational results

5.

Similar GeneRIFs

209

6.

One-to-many

67

7.

Length Constraint

167

19

GeneRIF example GenelD 6841: SVS1 seems to be found only in rodents and does not exist in humans GenelD 64919: CTIP2 mediates transcriptional repression with SIRT1 in mammmalian cells GenelD 7124: ). TNF-alpha promoter polymorphisms are associated with severe, but not less severe, silicosis in this population. GenelD 313129: characterization of rat Ankrd6 gene in silico; PMID 15657854: Identification and characterization of rat Ankrd6 gene in silico GenelD 3937: two GeneRIFs for the same gene differ in the gene name in the parenthesis; Shb links SLP-76 and Vav with the CD3 complex in Jurkal T cells (SLP-76) A single GeneRIF text identification, cloning and expression is linked to two GenelDs (217214 and 1484476) and two PMIDs (12049647, 15490124) GenelD 3952: review; GenelD 135 molecular model; GenelD 81657: protein subunitfunction

4.1. Finding discontinued GeneRIFs Discontinued GeneRIFs are detected by examining the gene history file from the NCBI's ftp site, which includes information about GenelDs that are no longer current, and then searching for GeneRIFs that are still associated with the discontinued GenelDs. 4.2. Finding GeneRIFs with spelling errors Spelling error detection has been extensively studied for General English (see [13]), as well as in biomedical text (e.g. [20]). It is especially challenging for applications like this one, since gene names have notoriously low coverage in many publicly available resources and exhibit considerable variability, both in text [10] and in databases [4,6]. In the work reported here, we utilized the Google spell-checking APId . Since Google allows ordinary users only 1,000 automated queries a day, it was not practical to use it to check all of the 4 million words in the current set of GeneRIFs. To reduce the size of the input set for the spellchecker, we used it only to check tokens that did not contain upper-case letters or punctuation (on the assumption that they are likely to be gene names or domainspecific terms) and that occurred five or fewer times in the current set of GeneRIFs ;

http://www. google, com/apis/

275 Table 3.

Distribution of non-word spelling errors across unigram counts.

Word Frequency Spelling Errors

1 1,348

2 268

3 84

4 34

5 20

(on the assumption that spelling errors are likely to be rare). (See Table 3 for the actual distributions of non-word spelling errors across unigram frequencies in the full June 2006 collection of GeneRIFs, which supports this assumption. We manually examined a small sample of these to ensure that they were actual errors.) 4.3. Finding GeneRIFs with punctuation errors Examination of the 319 withdrawn GeneRIFs showed that punctuation errors most often appeared at the left and right edges of GeneRIFs, e.g. the extra parenthesis and period in ). TNF-alpha promoter polymorphisms are associated with severe, but not less severe, silicosis in this population. (GeneID:7124)... or the terminal comma in Heart graft rejection biopsies have elevated FLIP mRNA expression levels, (GeneID:8837). We used regular expressions (listed on the supplementary web site) to detect punctuation errors. 4.4. Finding GeneRIFs based solely on computational methods Articles describing work that is based solely on computational methods commonly use words or phrases such as in silico or bioinformatics in their titles and/or abstracts. We searched explicitly for GeneRIFs based solely on computational methods by searching for those two keywords within the GeneRIFs themselves, as well as in the titles of the corresponding papers. GeneRIFs based solely on computational methods were incidentally also sometimes uncovered by the "one-to-many" heuristic (described below). 4.5. Finding similar GeneRIFs We used two methods to discover GeneRIFs that were similar to other GeneRIFs associated with the same gene. The intuitions behind this are that similar GeneRIFs may be redundant, and that similar GeneRIFs may not be informative. The two methods involved finding GeneRIFs that are substrings of other GeneRIFs, and calculating Dice coefficients. 4.5.1. Finding substrings We found GeneRIFs that are proper substrings of other GeneRIFs using Oracle.

276

4.5.2. Calculating Dice coeffi cients We calculated Dice coefficients using the usual formula ([11]:202), and set our threshold for similarity at > 0.8. 4.6. Detecting one-to-many mappings We used a simple hash table to detect one-to-many mappings of GeneRIF texts to publications (see category 6 in Table 2). We anticipated that this would address the detection of GeneRIF texts that were not informative. (It turned out to find more serious errors, as well—see the Discussion section.) 4.7. Length constraints We tokenized all GeneRIFs on whitespace and noted all GeneRIFs that were three or fewer tokens in length. The intuition here is that very short GeneRIFs are more likely to be indicative summaries, which give the reader some indication of whether or not they might be interested in reading the corresponding document, but are not actually informative [16]—for example, the single-word text Review— and therefore are out of scope, per the NLM guidelines. 5. Results 5.1. Evaluating recall against the set of withdrawn GeneRIFs To test our system, we first applied our system to the withdrawn GeneRIFs described in Section 3. GeneRIFs that are associated with temporary IDs are still in the curation process, so we did not attempt to deal with them, and they were excluded from the recall evaluation. To ensure a stringent evaluation with the remaining 196 withdrawn GeneRIFs, we included the ones in the miscellaneous and unknown categories. The system identified 151/196 of the withdrawn GeneRIFs, for a recall of 77% as shown in Table 4. The system successfully identified 115/117 of the GeneRIFs that were based on solely computational results. It missed two because we limited our algorithm to searching only GeneRIFs and the corresponding titles, but the evidence for the computational status of those two is actually located in their abstracts. For the typographic error category, the system correctly identified 33/41 spelling errors and 3/6 punctuation errors. It missed several spelling errors because we did not check words containing uppercase letters. For example, it missed the misspellings Muttant (Mutant), MMP-lo (MMP-10), and Frame-schift (Frame-shift). It missed punctuation errors that were not at the edges of the GeneRIF, e.g. the missing space after the semicolon in RE-

277 Table 4. Recall on the set of withdrawn GeneRIFs. Only the 196 non-temporary GeneRIFs were included in this experiment. Although we did not attempt to detect GeneRIFs that were withdrawn for miscellaneous or unknown reasons, we included them in the recall calculation. Category Computational methods Misspellings Punctuation Miscellaneous Unknown Sum

Total 117 41 5 20 13 196

True Positive 115 33 3 0 0 151

False Negative 2 8 2 20 13 45

Recall 98% 80% 60% 0 0 77%

VIEW:Association of expression ... and the missing space after the comma in ...lymphocytes,suggesting a role for trkB...

5.2. 3rd-party evaluation of precision The preceding experiment allowed us to evaluate the system's recall, but provided no assessment of precision. To do this, we applied the system to the entire June 2006 set of GeneRIFs. The system identified 2,923 of the 157,280 GeneRIFs in that data set as being bad. Table 2 shows the distribution of the suspicious GeneRIFs across the seven error categories. We then sent a sample of those GeneRIFs to NLM, along with an explanation of how the sample had been generated, and a request that they be manually evaluated. Rather than evaluate the individual submissions, NLM responded by internally adopting the error categories that we suggested and implementing a number of aspects of our system into their own quality control process, as well as using some of our specific examples to train the indexing staff regarding what is "in scope" for GeneRIFs (Donna Maglott, personal communication).

5.3. ln-house evaluation of precision We constructed a stratified sample of system outputs by selecting the first fifteen unique outputs from each category. Two authors then independently judged whether each output GeneRIF should, in fact, be revised. Our inter-judge agreement was 100%, suggesting that the error categories are consistently applicable. We applied the most stringent possible scoring by counting any GeneRIF that either judge thought was incorrectly rejected by the system as being a false positive. Table 5 gives the precision scores for each category.

278 Table 5. Precision on the stratified sample. For each error category, a random list of 15 GeneRJFs were independently examined by the two judges. No. 1. 2. 3. 4. 5. 6. 7. 8.

Category Discontinued Misspellings Punctuation Computational methods Similar GeneRIFs One-to-many Length constraint Overall

True Positive 15 15 13 15 15 15 5 93

False Positive 0 0 2 0 0 0 10 12

Precision 100% 100% 86.7% 100% 100% 100% 33.3% 88.6%

6. Discussion and Conclusion The kinds of revisions carried out by human summarizers cover a wide range of levels of linguistic depth, from correcting typographic and spelling errors ([16]:37, citing [5]) to addressing issues of coherence requiring sophisticated awareness of discourse structure, syntactic structure, and anaphora and ellipsis ([ 16]:78—81, citing [18]). Automatic summary revision systems that are far more linguistically ambitious than the methods that we describe here have certainly been built; the various methods and heuristics that are described in this paper may seem simplistic, and even trivial. However, a number of the GeneRIFs that the system discovered were erroneous in ways that were far more serious than might be suspected from the nature of the heuristic that uncovered them. For example, of the fifteen outputs in the stratified sample that were suggested by the one-to-many text-to-PMID measure (category 6 in Table 2), six turned out to be cases where the GeneRIF text did not reflect the contents of the article at all. The articles in question were relevant to the Entrez Gene entry itself, but the GeneRIF text corresponded to only one of the two articles' contents, presumably due to a cutand-paste error on the part of the indexer (specifically, pasting the same text string twice). Similarly, as trivial as the "extra punctuation" measure might seem, in one of the fifteen cases the extra punctuation reflected a truncated gene symbol (sir-2.1 became -2.1). This is a case of erroneous content, and not of an inconsequential typographic error. The word length constraint, simple as it is, uncovered a GeneRIF that consisted entirely of the URL of a web site offering Hmong language lessons—perhaps not as dangerous as an incorrect characterization of the contents of a PubMed-indexed paper, but quite possibly a symptom of an as-yetunexploited potential for abuse of the Entrez Gene resource. The precision of the length constraint was quite low. Preliminary error analysis suggests that it could be increased substantially by applying simple language models to differentiate GeneRIFs that are perfectly good indicative summaries, but

279 poor informative summaries, such as REVIEW or 3D model (which were judged as true positives by the judges) from GeneRIFs that simply happen to be brief, but are still informative, such as regulates cell cycle or Interacts with SOCS-1 (both of which were judged as false positives by the judges). Our assessment of the current set of GeneRIFs suggests that about 2,900 GeneRIFs are in need of retraction or revision. GeneRIFs exhibit the two of the four characteristics of the primary scientific literature described in [8]: growth, and obsolescence. (They directly address the problem of fragmentation, or spreading of information across many journals and articles, by aggregating data around a single Entrez Gene entry; linkage is the only characteristic of the primary literature that they do not exhibit.) Happily, NLM control over the contents of the Entrez Gene database provides a mechanism for dealing with obsolescence: GeneRIFs actually are removed from circulation when found to be of low quality. We propose here a data-driven model of GeneRIF errors, and describe several techniques, modelled as automation of a variety of tasks performed by human summarizers as part of the summary revision process, for finding erroneous GeneRIFs. Though we do not claim that it advances the boundaries of summarization research in any major way, it is notable that even these simple summary revision techniques are robust enough that they are now being employed by NLM: versions of the punctuation, "similar GeneRIF," and length constraint (specifically, single words) have been added to the indexing workflow. Previous work on GeneRIFs has focussed on quantity—this paper is a step towards assessing, and improving, GeneRIF quality. NLM has implemented some of the aspects of our system, and has already corrected a number of the examples of substandard GeneRIFs that are cited here. 7. Acknowledgments This work was supported by NIH grant R01-LM008111 (LH). We thank Donna Maglott and Alan R. Aronson for their discussions of, comments on, and support for this work, and the individual NLM indexers who responded to our change suggestions and emails. Lynne Fox provided helpful criticism. We also thank Anna Lindemann for proofreading the manuscript. References 1. S. Afantenos, V. Karkaletsis, and P. Stamatopoulos. Summarization from medical documents: a survey. Artifi cial Intelligence in Medicine, 33(2):157-77; Feb 2005. Review 2. G. Bhalotia, P. I. Nakov, A. S. Schwartz and M. A. Hearst. Biotext report for the TREC 2003 genomics track. In Proceedings of The Twelfth Text REtrieval Conference, page 612,2003.

280 3. R. V. Binder. Testing Object-Oriented Systems: Models, Patterns, and Tools. AddisonWesley Professional, 1999. 4. K. B. Cohen, A. E. Dolbey, G. K. Acquaah-Mensah, and L. Hunter. Contrast and variability in gene names. In Proceedings of ACL Workshop on Natural Language Processing in the Biomedical Domain, pages 14-20. Association for Computational Linguistics. 5. E. T. Cremmins. The Art of Abstracting, 2nd edition. Information Resources Press, 1996. 6. H. Fang, K. Murphy, Y. Jin, J. S. Kim, and P. S. White. Human gene name normalization using text matching with automatically extracted synonym dictionaries. In Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology, pages 41-48. Association for Computational Linguistics. 7. GeneRIF: http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html 8. W. Hersh. Information Retrieval: a Health and Biomedical Perspective, 2nd edition. Springer-Verlag, 2006. 9. W. Hersh and R.T. Bhupatiraju. TREC genomics track overview. In Proceedings of The Twelfth Text REtrieval Conference, page 14, 2003. 10. L. Hirschman, M. Colosimo, A. Morgan, and A. Yeh. Overview of BioCreative Task IB: normalized gene lists. BMC Bioinformatics 6(Suppl. 1):S11, 2005. 11. P. Jackson and I. Moulinier. Natural Language Processing for Online Applications: Text Retrieval, Extraction, and Categorization. John Benjamins Publishing Co., 2002. 12. B. Jelier, M. Schwartzuemie, C. van der Fijk, M. Weeber, E. van Mulligen and B. Schijvenaars. Searching for GeneRIFs: concept-based query expansion and Bayes classifi cation. In Proceedings of The Twelfth Text REtrieval Conference, page 225, 2003. 13. D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, January 2000. 14. X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai and B. Schatz. Automatically generating gene summaries from biomedical literature. In Proceedings of Pacifi c Symposium on Biocomputing, pages 40-51, 2006. 15. Z. Lu, K. B. Cohen and L. Hunter. Finding GeneRIFs via Gene Ontology annotations. In Proceedings of Pacifi c Symposium on Biocomputing, pages 52-63, 2006. 16. I. Mani. Automatic Summarization. John Benjamins Publishing Company, 2001. 17. J. A. Mitchell, A. R. Aronson, J. G. Mork, L. C. Folk, S. M. Humphrey and J. M. Ward. Gene indexing: characterization and analysis of NLM's GeneRIFs. In Proceedings of AMI A 2003 Symposium, pages 460-464, 2003. 18. H. Nanba and M. Okumura. Producing more readable extracts by revising them. In Proceedings of the 18th International Congress on Computational Linguistics (COLING-2000), pages 1071-1075. 19. R. Rubinstein and I. Simon. MILANO - custom annotation of microarray results using automatic literature searches. BMC Bioinformatics, 6:12, 2005. 20. P. Ruch, R. Baud and A. Geissbuhler. Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record. Artifi cial Intelligence in Medicine, 29(2): 169-84, 2003.

EVALUATING THE A U T O M A T I C MAPPING OF HUMAN GENE AND PROTEIN MENTIONS TO UNIQUE IDENTIFIERS ALEXANDER A. MORGAN 1 , BENJAMIN WELLNER 2 , JEFFREY B. COLOMBE, ROBERT ARENS 3 , MARC E. COLOSIMO, LYNETTE HIRSCHMAN MITRE Corporation, 202 Burlington Road Bedford, MA, 01730, USA Email: [email protected]; [email protected]

We have developed a challenge task for the second BioCreAtlvE (Critical Assessment of Information Extraction in Biology) that requires participating systems to provide lists of the EntrezGene (formerly LocusLink) identifiers for all human genes and proteins mentioned in a MEDLINE abstract. We are distributing 281 annotated abstracts and another 5,000 noisily annotated abstracts along with a gene name lexicon to participants. We have performed a series of baseline experiments to better characterize this dataset and form a foundation for participant exploration.

1.

Background

The first Critical Assessment of Information Extraction in Biology's (BioCreAtlvE) Task IB involved linking mentions of model organism genes and proteins in MEDLINE abstracts to their corresponding identifiers in three different model organism databases (MGD, SGD, and FlyBase). The task is described in some detail in [1], along with descriptions of many different approaches to the task in the same journal issue. There has been quite a bit of past work associating text mentions of human genes and proteins with unique identifiers including the early work by Cohen et al. [2] and the AZURE system [3]. Very recently, Fang et al. [4] reported excellent results on a data set they created using one hundred MEDLINE abstracts. This widespread community interest in the issue and our experience with the first BioCreAtlvE motivated us to prepare another evaluation task for inclusion in the second BioCreAtlvE [5]. This task will require systems to link mentions of human genes and proteins with their corresponding EntrezGene (LocusLink) identifiers. We hope that researchers in this area can use this data set to compare techniques and ' Currently at Stanford Biomedical Informatics, Stanford University Also, the Department of Computer Science, Brandeis University 3 Currently at the Department of Computer Science, University of Iowa

2

281

282 gauge performance gains. It can also be used to address issues in the general portability of normalization techniques and to investigate the relationships between co-mentioned genes and proteins. 2.

Task Definition

The most important part of evaluating system performance is, of course, a very careful definition of the task. The original Task IB required each system to provide a list of all the model organism database identifiers for the species-specific (mouse, fly or yeast) genes and gene products mentioned in a MEDLINE abstract. There are a number of possible uses for such a system, such as improved document retrieval for specific genes, data mining over gene/protein co-mentions, or direct support of relation extraction (e.g., protein-protein interaction) and/or attribute assignment (e.g., assignment of Gene Ontology annotations). The latter might be immediately useful to researchers attempting to analyze high throughput experiments, performing whole genome or comparative genomics analyses, or data-mining for relationship discovery, all of which require links to the unique identifiers. Our initial investigations into a human gene/protein task suggested that UniProt identifiers [6] might be a good target to which we might normalize mentions of human proteins and their coding genes, and we hoped that this might bring the task into closer alignment with other efforts such as BioCreAtlvE I Task 2 [7] which required associating GO codes with human proteins identified through protein identifiers. UniProt provides a unified set of protein identifiers and represents a great leap forward for bioinformatics research, but it contains many redundancies: different fragments of the same polypeptide, polypeptide sequences derived from the same gene that differ in non-synonymous polymorphisms, and alternate transcripts from the same gene all may have separate entries and unique identifiers. We eventually settled on EntrezGene identifiers as unique target identifiers, despite incomplete mappings of UniProt to EntrezGene identifiers and what can be a complex many-to-many (e.g. alternate transcripts and gene duplications) relationship between genes and proteins. As described in [8], our annotation viewed genes and their products as equivalent because experience has found their typical usage interchangeable and/or indistinguishable. This is, of course, a simplification for purposes of evaluation; we recognize that this distinction is important in other cases. A significant difference between the normalized gene list task (BioCreAtlvE Task IB) and general entity normalization/grounding is that each gene list is associated with the abstract as a whole, whereas general entity grounding requires

283 the annotation of each mention in the text. The advantage of the "gene list" approach is that it avoids the issue of how to delimit the boundaries when annotating gene and protein mentions [9]. This becomes more of a problem in normalization when mentions are elided under various forms of conjunction. For example, it is difficult to identify the boundaries for the names of the different forms of PKC in "PKC isoforms alpha, delta, epsilon and zeta". Then there is the more difficult example of ellipsis: "AKR1C1-AKR1C4". Clearly AKR1C2 and AKR1C3 are being included in this mention, and functional information extracted about that group should include them. Fang et al. [4] excluded these cases from consideration, but we feel that these are important instances that need to be annotated and normalized. Equally difficult is the large gray area in gene and protein nomenclature between a description and a name and the related question of what should be tagged. The text "Among the various proteins which are induced when human cells are treated with interferon, a predominant protein of unknown function, with molecular mass 56 kDa, has been observed" mentions the protein also known as "interferoninduced protein 56", but the text describes the entity rather than using the listed name derived from this description. Our compromise was to keep the gene list task, but to provide a richer data set that associates at least one text string with each entry in the gene list, a significant addition over the first BioCreAtlvE Task 1B. Polysemy in gene and protein names creates additional complexity, both within and between organisms [10]. Determination of the gene or protein being described may require the interpretation of the whole abstract - or several genes may be described with one "family name" term (see the Discussion section for further exploration of this issue). The particular species can be intentionally under-specified when the text is meant to refer to all the orthologues in relevant species, but in other cases, a name is meant to be highly species specific. For example: "Anoxia activates AMP-activated protein kinase (AMPK), resulting in the inhibition of biosynthetic pathways to conserve ATP. In anoxic rat hepatocytes or in hepatocytes treated with 5-aminoimidazole-4-carboxamide (AICA) riboside, AMPK was activated and protein synthesis was inhibited." The mention of the properties of AMPK in the first sentence is meant to be general and to include activity in humans, but the subsequent experimental evidence is, of course, in rats.

284 3.

Corpus Construction

3.1. Abstract Collection To identify a collection of abstracts with a high likelihood of mentions of human genes and proteins, we obtained the genejassociation.goajiuman file [11] on 10 October 2005. This provided us with 11,073 PubMed identifiers for journal articles likely to have mentions of human genes and proteins. We obtained abstracts for 10,730 of these. The file gene2pubmed obtained from NCBI [12] on 21 October 2005 was used, along with the GO annotations, to create the automatic/noisy annotations in the 5,000 abstracts set aside as a noisy training set as described in [8]. This is further described in the Evaluation of Noisy Training Data section. We selected our abstracts for hand annotation from the 5,730 remaining abstracts. 3.2. Lexicon Creation The basic gene symbol and gene name information corresponding to each human EntrezGene identifier was taken from the genejnfo file from NCBI [12]. This was merged with name, gene and synonym entries taken from UniProt [6]. Suffixes containing "HUMAN", "1_HUMAN", "H_HUMAN", "protein", "precursor", "antigen" were stripped from the terms and added to the lexicon as separate terms in addition to the original term. HGNC [13] symbol, name, and alias entries were also added. We identified the phrases most repeated across identifiers and those that had numerous matches in the 5000 abstracts of noisy training data; we then used these to create a short (381 term) list to remove the most common terms that were unlikely to be gene or protein names but which had entered the lexicon as full synonyms. Examples of entries in this list are "recessive", "neural", "Zeta", "liver", "glycine", and "mediator". This list is available from the CVS archive [5]. This left us with a lexicon of 32,975 distinct EntrezGene identifiers linked to a total of 163,478 unique terms. The majority of identifiers have more than one term attached (average 5.5), although 8,385 had only one. For example, identifier 1001 has the following synonyms: "PCAD; CDHP; CDH3; cadherin 3, type 1, P-cadherin (placental); HJMD". It is important to note that many of these terms are unlikely to be used as mentions in abstracts for the given proteins and genes. Many of the terms/synonyms were not unique among the identifiers, with the terms often being shared across a handful of identifiers (Table 1). Sometimes this reflects noise inherited from the source databases; the most egregious example is "hypothetical" which shows up as a name for 89 genes. Similarly, "human" (alone)

285 shows up 15 times, "g protein coupled receptor" 12 times, and "seven transmembrane helix receptor" 30 times. Each normalized (Section 4) phrase included as a synonym in this relatively noisy lexicon is linked to an average of 1.1 different unique identifiers, although 80% of phrases link to only one identifier. These synonyms average 16.5 characters in length if whitespace is removed. Table 1. Lexicon statistics Unique Gene ID'S Unique Un-Nqrmalized Terms Unique Normalized Terms

32,975 177,200 163,478

Avq Term Lenqth (Characters) Avq Gene Identifiers per Term Avq Term Lenqth (Words) Avq Terms per Identifier

16.51 1.12 2.17 5.55

3.3. Annotation Tool and Annotation Process We developed a simple annotation tool using dynamic webpages with PHP and MySQL to support the creation of the normalized gene lists and extraction of the associated mention excerpts from the text. Annotators could annotate via their own web browsers. We could also make rapid changes to the interface as soon as they were requested without needing to update anything but the scripts on the server. The simple annotation guidelines and the PHP scripts used for the annotation are available for download from the Sourceforge CVS archive [5], The interface presented the plain text of the title and abstract to the annotators, along with suggested annotations (based on the automatic/noisy process). Using these resources, annotators had to provide the EntrezGene identifiers and supporting text for all mentions of human genes and proteins. All annotations then went through a review process to examine abstracts marked with comments and to merge the differences between annotators before inclusion in the gold standard set. A total of 300 abstracts were annotated for the freely distributed training set, although 19 were removed for a variety of reasons, such as, having mentions which could not be normalized to EntrezGene, leaving 281 for distribution. The annotators found of an average of 2.27 different human genes mentioned per abstract. We have annotated another -263 for use as an evaluation set. We plan to correct errors in these annotations based on pooling of the participants' submissions, as was done in the previous BioCreAtlvE [8]. The Sourceforge CVS archive will allow us to track corrections to these datasets [5].

286 3.4. Inter-annotator

Agreement

We studied the agreement between different annotators on the same abstracts. The annotation was done by three annotators (two with PhD's in biological sciences, one with an MS; none are specialists in human biology, but all had previous experience in annotation). There was one annotator (primary) who did annotations for all abstracts. Our first pass of agreement studies was done on the first abstracts in the training set and was done mostly to check our annotation guidelines. Two annotators annotated the same 30 abstracts. There were 71 annotations (same EntrezGene identifiers for the abstract) in common and 7 differences (91% agreement). A second agreement experiment was performed with 26 new abstracts. There was only 87% agreement, but all disagreements were missed mentions or incorrect normalizations by the non-primary annotator. Unfortunately, these small sample sizes can only be suggestive of the overall level of agreement. 4.

Characterizing the Data

In order to better characterize the properties of this dataset and task, we performed some baseline experiments, described below, to generate the list of EntrezGene identifiers for each abstract using the lexicon. We evaluated this using simple match against the gold standard annotations. For matching the terms from the lexicon, we ignored case and any punctuation or internal whitespace in the terms matched to the lexicon, but required match of start and end token boundaries as described in [14]. Table 2. Properties of the Data Experiment Noisy Traininq Data Quality Coverage of Lexicon

True Positive 348 530

False Positive 49 7941

False Negative 292 110

Precision 0.877 0.063

Recall 0.544 0.828

4.1. Evaluation of Noisy (Automatically Generated) Training Data We wanted to estimate the quality of the noisy training data and to evaluate our assumption that the document level annotations from the gene2pubmed file were indicative of a high likelihood of the mention of those genes in the abstract. To do this, we evaluated the gene lists derived from the gene2pubmed file (automatic/noisy data process) against those derived from human annotation (see Table 2). However, many genes may be mentioned in the abstract and paper but may not included in the gene2pubmed file causing our noisy training data to systematically underreport

287 genes mentioned, and we estimate from this result that only half of all genes mentioned are included in the automatic/noisy data annotations (recall 0.544). 4.2. Evaluating the Coverage of the Lexicon We also evaluated the coverage of the lexicon by using it to do simple pattern matching. This mirrors some of our early experiments in developing normalized gene lists for Drosophila melanogaster [15]. Our goal was to estimate a recall ceiling on performance for systems requiring exact match to the lexicon. The recall of 0.828 clearly shows the limits of the simple lexicon (Table 2). This demonstrates the need to extend exact lexical match beyond such simple rules as ignoring case, punctuation and white space. In some cases, very small affixes (e.g. h-, -p, -like), either in the lexicon or the text, caused a failure to match. There were numerous cases of acronyms, often embedded in longer terms, which caused problems ("actinin-1" vs. "ACTN1" or "GlyR alpha 1" vs. "Glycine receptor alpha-1 chain precursor" or "GLRA1"). The various modifiers indicating subtypes were a serious problem, e.g. "collagen, type V, alpha 1"; modifiers such as "class II", "beta subtype", "type 1", and "mu 1" varied in orthography and placement, and the modifier " 1 " is often optional. Conjunctions such as "freacl-freac7" are particularly costly from an evaluation perspective since it can count as several false negatives at once. There was a considerable amount of name paraphrase (see Discussion section), involving word ordering and term substitutions or insertions and deletions. This arises because the long phrases in the lexicon are often more descriptive than nominal, although the associated acronyms can give some indication as to how a mention might actually occur in text. For example, the text contains "kappa opioid receptor", whereas the lexicon contains "KOR" and "opioid receptor, kappa 1"). Lan Aronson has investigated these issues in term variation while mapping concepts to text extensively [16]. Interestingly, self-embedded terms (e.g. "insulin-like growth factor-1 (IGF-I) receptor") seem to be a relatively rare problem at the level of the whole abstract. As expected, the precision based on lexical pattern matching (Table 2, row 2) was very low due to false positive matches of terms in the lexicon against common English terms, ambiguous acronyms, and so forth. 4.3. Biological Context of Co-Mentioned Genes and Proteins As an example of how this dataset might be used outside of the evaluation, we looked at the biological relationships between genes and proteins which are mentioned together in the same abstracts. Our experience annotating the abstracts

288 indicated that genes or proteins are typically co-mentioned because of sequence homology and/or some functional relationship (e.g., interaction), although cell markers (e.g., CD4) may be mentioned in a variety of contexts. Many sophisticated techniques have arisen for comparing genes based on functional annotations and sequence, but for this initial analysis we intentionally used something naive and simple. We computed two different similarity measurements for each pair of genes mentioned together in our dataset. For a sequence similarity computation, we used BioPython's pairwise2 function [17]: pairwise2.allgn.globalxs (seql,seq2,-l,.l,penalize_encLgaps=0,scor9_only=l). For the sequence, we used the longest protein RefSeq for each gene. For a measure based on functional annotations, we computed the Jacquard set similarity (1Tanimoto distance) for the set of all GO annotations for each gene: Is, ns91 Set Similarity =

T

—

'

,

r

|5,|+|s2|-|5,ns2|

We excluded all GO codes that had an accompanying qualifier, which for human genes, is restricted to "contributesto", "colocalizesjwith", and "NOT". This GOderived similarity measure is a poor one for many reasons, including mixing experimental and homology based GO codes, ignoring the structure of GO, and ignoring the fact that the three main hierarchies are very different. Figure 1 shows the result of computing these similarity measures for the 737 pairs of genes that are co-mentioned in our hand annotated training set and for 1,630 pairs of randomly selected genes which are explicitly not co-mentioned. Of the 737 co-mentioned pairs, 100 have both similarity measures above 0.3, while none of the 1,630 non co-mentioned pairs do. This suggests that in the context of the evaluation, even simple biological knowledge may be helpful in such tasks as disambiguation (dealing with polysemy) for normalization or in ascertaining if comention suggests functional and/or physical interaction or simply homology. It is hoped that this dataset can encourage the use of greater exploration into the use of biological knowledge to improve text mining. Figure 1: Biological similarity between co-mentioned genes vs. not co-mentioned genes A) Co-mentioned

B) NOT Co-mentioned

o

o 0

0

0.0

0.2

0.4

0.6

GO Similarity

0.8

1.0

~i

1

1

1

1

r

0.0

0.2

0.4

0.6

0.8

1.0

GO Similarity

289 S.

Discussion

It is interesting to compare this new corpus with Task IB of BioCreAtlvE 1 for insights into portability of normalization techniques. One set of measures in Table 3 seems to indicate that human may be easier than mouse; it has over twice the number of terms for each identifier, it has many fewer unique identifier targets, and Table 3: A comparison of gene mention normalization Noisy Data Recall 0.54 0.55 0.86 0.81

Noisy Max Recall Data Approach Precision Recall 0.86 0.83 0.99 0,83 0.99 0,93 0.86 0.85

Average Max Recall Synonym Approach Length Precision in Words 0.06 2.17 0.19 2.77 0.33 1.00 0:07 1.47

Number of Unique IP's 32,975 52,494 7,928 27,749

Average # Synonyms/ Identifier 5.55 2.48 1.86 2.94

Average # BioCreAtlvE 1 Identifiers/ Max Synonym Submitted (ambiguity) F-measure 1.12 0.79 1,02 0.92 1.01 0.82 1.09

only slightly more ambiguity. However, this does not really represent how the terms in the lexicon map to the text. The synonyms in the model organism databases are drawn from text, whereas the lexicon that we created for human genes includes database identifiers or descriptive forms that have very little overlap with actual text mentions. This overestimates the number of useful term variants in the lexicon and probably underestimates ambiguity in practice. The affects of polysemy/ambiguity in gene/protein mention identification is discussed in detail in [10]. An important contrast between human and mouse nomenclature on the one hand, and yeast and fly on the other, is that the nomenclature is often much more descriptive than nominal as mentioned in the Task Definition section. In Drosophila, the gene rather whimsically named "Son of sevenless" ("Sos") is named just that. It would never be called "child of sevenless" or "Sevenless' son". However, the names of human genes may vary quite a bit. The Alzheimer's disease related "APP" gene is generally known as "beta-amyloid precursor protein", although "beta-amyloid precursor polypeptide" may be used as well. Many other equivalent transformations are also acceptable, such as "amyloid beta-protein precursor", and "betaAPP". In general, any semantically equivalent description of the gene or protein may be used as a name. However, the regularity of the allowed transformations suggests that it might be possible to design or automatically learn transformation rules to permit better matching, something investigated by past researchers [18]. As Vlachos et al. observed [19], in biomedical text there is a high occurrence of families of genes and proteins being mentioned by a single term such as: "Mxil

290 belongs to the Mad (Mxil) family of proteins, which function as potent antagonists of Myc oncoproteins". In future work in biomedical entity normalization, we suggest that normalizing entity mentions to family mentions may be an effective way to support other biomedical text mining tasks. Possibly the protein families in InterPro [6] could be used as normalization targets for mentions of families. For example, the mention of "Myc oncoproteins" could link to InterPro:IPR002418. This would enable information extraction systems that extract facts (relations, attributes) on gene families to attach those properties to all family members.

6.

Conclusion

In summary, we have described the motivation and development of a dataset for evaluating the automatic mapping of the mention of human genes/proteins to unique identifiers, which will be used as part of the second BioCreAtlvE. We have elucidated some of the properties of this data set, and made some suggestions about how it may be used in conjunction with biological knowledge to investigate the properties of co-mentioned genes and proteins. Anonymized submissions by evaluation participants along with the evaluation set gold standard annotations will be made publicly available [5] after the workshop, tentatively scheduled for the spring of 2007. 7. 1. 2.

3.

4.

References Hirschman, L., et al., Overview of BioCreAtlvE task IB: normalized gene lists. BMC Bio informatics, 2005. 6 Suppl 1: p. SI 1. Cohen, K.B., et al. Contrast and variability in gene names, in Proceedings of the workshop on natural language processing in the biomedical domain, pp. 14-20. Association for Computational Linguistics. 2002. Podowski, R.M., et al., AZuRE, a scalable system for automated term disambiguation of gene and protein names. Proc IEEE Comput Syst Bio inform Conf, 2004: p. 415-24. Fang, H., et al., Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries, in Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, Association for Computational Linguistics: New York, New York. p. 41-48.

291 5. 6.

7. 8. 9. 10. 11. 12. 13. 14.

15. 16. 17. 18. 19.

http://biocreative.sourceforge.net/, BioCreAtlvE 2 Homepage. Wu, C.H., et al., The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res, 2006. 34(Database issue): p. D187-91. Blaschke, C , et al., Evaluation of BioCreAtlvE assessment of task 2. BMC Bioinformatics, 2005. 6 Suppl 1: p. S16. Colosimo, M.E., et al., Data preparation and interannotator agreement: BioCreAtlvE Task IB. BMC Bioinformatics, 2005. 6 Suppl 1: p. S12. Tsai, R.T., et al., Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics, 2006. 7: p. 92. Tuason, O., et al., Biological nomenclatures: a source of lexical knowledge and ambiguity. Pac Symp Biocomput, 2004: p. 238-49. http://www.geneontology.org/, The Gene Ontology. ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, NCBI Gene FTP site. Wain, H.M., et al., Genew: the Human Gene Nomenclature Database, 2004 updates. Nucleic Acids Res, 2004. 32(Database issue): p. D255-7. Wellner, B., Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data, in Proceedings of the ACLISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics. 2005, Association for Computational Linguistics: Detroit, p. 1—8. Morgan, A.A., et al., Gene name identification and normalization using a model organism database. J Biomed Inform, 2004. 37(6): p. 396-410. Aronson, A.R., The effect of textual variation on concept based information retrieval. Proc AMIA Annu Fall Symp, 1996: p. 373-7. http://biopython.org, BioPython Website. Hanisch, D., et al., Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput, 2003: p. 403-14. Vlachos, A., et al., Bootstrapping the Recognition and Anaphoric Linking of Named Entities in Drosophila Articles. Pac Symp Biocomput, 2006. 11: p. 100-111.

MULTIPLE APPROACHES TO FINE-GRAINED INDEXING OF THE BIOMEDICAL LITERATURE AURELIE NEVEOL 1 ' 2 , SONYA E. SHOOSHAN 1 , SUSANNE M. HUMPHREY 1 , THOMAS C. RINDFLESH 1 , ALAN R. ARONSON 1 'National Library of Medicine, NIH Bethesda, MD 20894, USA 2

Equipe CISMeF, Rouen, France

The number of articles in the MEDLINE database is expected to increase tremendously in the coming years. To ensure that all these documents are indexed with continuing high quality, it is necessary to develop tools and methods that help the indexers in their daily task. We present three methods addressing a novel aspect of automatic indexing of the biomedical literature, namely producing MeSH main heading/subheading pair recommendations. The methods, (dictionary-based, post- processing rules and Natural Language Processing rules) are described and evaluated on a genetics-related corpus. The best overall performance is obtained for the subheading genetics (70% precision and 17% recall with post-processing rules, 48% precision and 37% recall with the dictionarybased method). Future work will address extending this work to all MeSH subheadings and a more thorough study of method combination.

1.

Introduction

1.1. Indexing the biomedical literature To ensure efficient retrieval of the ever-increasing number of articles in the U.S. National Library of Medicine's (NLM's) MEDLINE® database, these documents must be systematically stored and indexed. In MEDLINE, the subject matter of articles is described with a list of descriptors selected from NLM's Medical Subject Headings (MeSH®). MeSH contains about 24,000 main headings covering specific concepts in the biomedical domain such as diseases, body parts, etc. It also contains 83 subheadings that denote broad areas in biomedicine such as immunology or genetics. Subheadings can be coordinated to a main heading in order to refer to a concept in a more specific way. NLM indexers select for each article an average of ten to twelve MeSH main headings (e.g., Williams Syndrome) or main heading/subheading pairs (e.g., Williams Syndrome/genetics). The indexing task is time consuming and requires skilled, trained individuals. In order to assist indexers in their daily practice, the NLM's Indexing Initiative [1] has investigated automatic indexing methods, which led to the development of the Medical Text Indexer (MTI) [2]. MTI is a software tool producing indexing recommendations in the form of a list of stand-alone main 292

293

headings (i.e. not associated with subheadings) shown on request to the indexers while they work on a record in the MEDLINE Data Creation and Maintenance System (DCMS). Other work on the automatic assignment of MeSH descriptors to medical texts in English has also focused on stand-alone main headings [3-4]. While the indexing resulting from some of these automatic systems has been shown to approach human indexing performance as measured by retrieval [5], there is a need for automatic means to provide finer-grained indexing recommendations, namely main heading/subheading pairs in addition to standalone main headings. In fact, there are both theoretical and practical reasons for this effort. From a theoretical point of view, the MeSH indexing manual [6] states that indexers must chose descriptors that reflect the content of an article by first selecting correct main headings and second by attaching the appropriate subheadings. Consequently, selecting an isolated main heading where a main heading/subheading pair should have been assigned is, strictly speaking, erroneous - or at best, incomplete. On the practical side, indexers do use both main headings and main heading/subheading pairs when indexing a document. Therefore, stand-alone main heading recommendations, while useful, will always need to be completed by attaching subheadings where appropriate. The task of assigning MeSH descriptors to a document can be viewed as a multi-class classification problem where each document will be assigned several "classes" in the form of MeSH descriptors. When assigning MeSH main headings [4, 7] the scale of the classification problem is 23,883. Now, if one attempts to assign MeSH main heading/subheading pairs, the number of classes increases to 534,981. Many machine learning methods perform very well on binary classes but prove more difficult to apply successfully on larger scale problems. As regards MeSH main heading classification, the hierarchical relationships between the classes have been used to reduce the complexity of the problem [4, 7]. Previous work on producing automatic MeSH pair recommendations that relied on dictionary and rule-based methods seemed promising [10]. For these reasons, we are investigating similar methods here. 1.2. Genetics literature Following the rapid developments of genetics research in the past twenty years, the volume of genetics-related literature has grown accordingly. While genetics

294 literature represented about 6% of MEDLINE records for the year 1985*, it represents over 19% of MEDLINE records for 2005f. In this context, it seems that providing fine-grained indexing recommendations for genetics literature is particularly important, as it will impact a significant portion of the biomedical literature. Therefore, we have elected to concentrate our effort in this subdomain for our preliminary work investigating automatic methods of providing MeSH pair indexing recommendations. This led us to focus on the subheadings genetics, immunology and metabolism which were found to be prevalent in the MeSH indexing of our genetics test corpus (see section 2.4). 1.3. Objective and approach This paper presents the various methods we investigated to automatically identify MeSH main heading/subheading pairs from the text (title and abstract) of articles to be indexed for MEDLINE. The ultimate goal of this research is to add subheading-related features to DCMS when displaying recommendations to NLM indexers, in order to save time during the indexing process. A previous study of MTI usability showed that the possibility of selecting recommendations from a pick list saved look-up and typing time [8]. The ideal time-saving mechanism for subheading attachment would be to include relevant pairs in the current list of main headings available for selection. However, this solution is only viable if the precision of such recommendations is sufficiently high. The possible obstacle that we foresee to including pair recommendations in the current pick list is that high precision for pair recommendations might be difficult to achieve without any human input throughout the process. Work in the area of computer-assisted translation [9] has shown the usefulness of interactive systems in the context of highly demanding cognitive tasks such as translation or indexing. For this reason, we are considering the possibility of either dynamically showing related pair recommendations once the indexer selects a main heading for the record, or highlighting the most likely subheadings for the current record when indexers are viewing the list of allowable subheadings for a given main heading that they selected. The remainder of this paper will address the difficult task of producing the recommendations themselves.

f

19,348 citations retrieved by the query genetics AND MEDLINE [sb] compared to 313,638 records retrieved [dcom] AND MEDLINE [sb] on 07/12/06. 114,530 citations retrieved by the query genetics AND MEDLINE [sb] compared to 598,217 records retrieved [dcom] AND MEDLINE [sb] on 07/12/06.

1985 [dcom] AND by the query 1985 2005 [dcom] AND by the query 2005

295 2.

Material and methods

In this section, we describe the three methods we investigated to identify main heading/subheading pairs from medical text. We also introduce the genetics corpus we used to evaluate the methods. 2.1. Baseline dictionary-based method The first method we considered consists of identifying main headings and subheadings separately for a given document and then attempting to pair them. Main headings are retrieved with the Medical Text Indexer [2] and subheadings are retrieved by looking up words from the title and abstract in a manually built dictionary in which each entry contains a subheading and a corresponding term or expression that is likely to represent the subheading in text. These terms are mainly derived from inflectional and derivational forms of the subheadings. They were obtained manually and tested on a general training corpus composed of a random 3% selection of MEDLINE 2004. Candidate terms were added to the dictionary if they benefited the method performance on the training corpus. For example, gene, genes, genetic, genetics, genetical, genome and genomes are terms corresponding to /genetics. The dictionary contains 227 entries for all 83 subheadings, including 10 for /genetics. To obtain the pairs, the subheadings retrieved by the dictionary are coordinated with the main headings retrieved, if applicable. For each main heading, MeSH defines a set of subheadings called "applicable qualifiers" that can be coordinated with it (e.g. /genetics is applicable to Carcinoma, Renal Cell but not Odds Ratio). In the dictionary method, all the legal pairs that can be assembled from the sets of main headings and subheadings retrieved are recommended. For example, two occurrences of the dictionary entry genes were found in the abstract of MEDLINE record 15319295, which means that /genetics was identified for this record. Attempts were made to attach /genetics to each of the twelve main headings recommended by MTI for this record, including Carcinoma, Renal Cell and Odds Ratio. The pair Carcinoma, Renal Cell/genetics was recommended because /genetics is an allowable qualifier for Carcinoma, Renal Cell. However, /genetics is not an allowable qualifier for Odds Ratio; therefore no other pair recommendation was made. 2.2. Indexing rules The two methods detailed in this section are based on indexing practice, sometimes expressed in MeSH annotations. In previous work on the indexing of medical texts in French [10], indexing rules were derived from interviews with indexers. Similar rules were also available in the MedlndEx knowledge base

296

[11]. To build the sets of rules used here, we adapted existing rules [10-11] and manually created new rules. The rules were divided in two groups. Post-processing rules Post-processing (PP) rules build on a pre-existing set of indexing terms (i.e., the main heading recommendations from MTI), and enrich it by expanding on the underlying concepts denoted by the indexing terms within that set. Twenty-nine of these rules are currently implemented for /genetics (as well as 11 for /immunology and 8 for /metabolism). Rules that were created in addition to the existing rules from MedlndEx and the French system (such as the example shown in figure 1) were evaluated using MEDLINE data. Specifically, we computed an estimated precision equal to the number of citations indexed with the trigger terms over the number of citations indexed with the trigger terms and the recommended pair*. Only rules with an estimated precision over 0.6 were considered for inclusion in the rule sets. According to the sample rule shown in Figure 1, a pair recommendation shall be triggered by existing MTI recommendations including the main heading Mutation as well as a term*. Since Mutation is a genetics concept, an inference is made that /genetics should be attached to the disease main heading. For example, both main headings Mutation and Pancreatic Neoplasms are recommended by MTI for the MEDLINE record 14726700. As Pancreatic Neoplasms is a disease term, the rule will be applied and the pair Pancreatic Neoplasms/genetics will be recommended. If the main heading Mutation and a term appear in the indexing recommendations then the pair /genetics should also be used. Figure 1. Sample post-processing rule for the subheading genetics

* For the sample rule shown in Figure 1, the estimated precision was 0.67. (On 09/06/06, the query mutation [mh] AND (diseases category/genetics [mh] OR mental disorders/genetics [mh]) retrieved 144,698 citations while mutation [mh] AND (diseases category [mh] OR mental disorders[mh]) retrieved 216,749 citations) § DISEASE refers to any phrase that points to a MeSH main heading belonging to the diseases or mental disorders categories.

297 Natural Language Processing rules Natural Language Processing (NLP) rules use cues from the title or abstract of an article to infer pair recommendations. A sample NLP rule is shown in Figure 2. In the original French system, this type of rule was implemented by a set of transducers that exploited information on each term's semantic category (DISEASE, etc. ) stored in an integrated electronic MeSH dictionary. Although very efficient, this method is also heavily language-dependent. For English, such advanced linguistic analysis of medical corpora is performed by NLM's SemRep [12], a tool that is able to identify interactions between medical entities based on domain knowledge from the Unified Medical Language System® (UMLS®). If a phrase such as " is associated with " appears in text then the pair /genetics should also be used. Figure 2. Sample Natural Language Processing rule for the subheading genetics

Specifically, SemRep retrieves UMLS triplets composed of two concepts from the UMLS Metathesaurus® together with their respective UMLS Semantic Types (STs) and the relation between them, according to the UMLS Semantic Network. Hence, phrases corresponding to the pattern of the sample rule presented in Figure 2 would be extracted by SemRep as the triplet (gngm ASSOCIATED_WITH dsyn) where "gngm" denotes the ST "Gene or Genome", and "dsyn" denotes the ST "Disease or Syndrome". We can infer from this that there is an equivalence between the semantic triplet (gngm ASSOCIATED_WITH dsyn) and the MeSH pair *'genetics where "dsyn" and refer to the same entity. In this way, the NLP rules were used to obtain a set of equivalencies between these UMLS triplets and MeSH pairs. Subsequently, a restrict-to-MeSH algorithm [13] was used to translate UMLS concepts to their MeSH equivalents. For example, the phrase "Association of a haplotype of matrix metalloproteinase (MMP)-1 and MMP-3 polymorphisms with renal cell carcinoma" occurring in the MEDLINE record 15319295 was annotated by SemRep with the triplet (gngm ASSOCIATED_WITH neoptf) where the "Gene or Genome" was MMP and the "Neoplastic Process" ("neop") was Renal Cell Carcinoma. The latter UMLS concept can be restricted to its MeSH equivalent Carcinoma, Renal Cell and the ** GENE refers to any phrase that points to a MeSH main heading belonging to the GENE sub-hierarchy within the GENETIC STRUCTURES hierarchy. n In the Semantic Types hierarchy, "neop" is a descendant of "dsyn". By inheritance, rules that apply to a given Semantic Type also apply to its descendants.

298

pair Carcinoma, Renal Cell/genetics is then recommended for the indexing. In the context of the genetics domain, we also use triplets retrieved by SemGen [14], a variant of SemRep specifically adapted to the identification of GeneGene and Gene-Disease interactions. 2.3. Combination of methods In an attempt to assess the complementarity of the methods, we also evaluated the recommendations provided by any two methods. The combination consisted in examining all the recommendations obtained from two methods, and selecting only the concurring ones, if any. For example, the pairs Ascomycota/genetics, Capsid Proteins/genetic and RNA Viruses/genetics and Totivirus/genetics were recommended by the post-processing rules method for citation 15845253 while Viruses/genetics, RNA Viruses/genetics and Totivirus/genetics were recommended by the NLP rules for the same citation. Only the common pairs RNA Viruses/genetics and Totivirus/genetics are selected by combination of the two methods. In this case, the two pairs selected by combination were used to index the documents in MEDLINE. Two of the three discarded pairs {Ascomycota/genetics and Viruses/genetics) were not used by the indexers while the other one {Capsid Proteins/genetics) was. 2.4. Test corpus All three methods (baseline dictionary-based, PP rules, NLP rules) were tested on a corpus composed of genetics-related articles selected from all citations indexed for MEDLINE in 2005. In order to avoid bias, the selection was not directly based on whether the articles were indexed with the subheading genetics. Instead we applied NLM's Journal Descriptor Indexing tool, which categorized the citations according to Journal Descriptors and also according to Semantic Types [15]. This categorization provided an indication of the biomedical disciplines discussed in the articles. For our genetics-related corpus, we selected citations that met either of these criteria: • "Genetics" or "Genetics, Medical" were among the top six Journal Descriptors • "genf' (Gene Function) or "gngm" (Gene or Genome) were among the top six Semantic Types A total of 84,080 citations were collected and used to test the methods presented above. At least one of the subheadings genetics, immunology and metabolism appear in 53,903 of the corpus citations.

299 3.

Results

3.1. Independent methods Table 1 shows the performance of the methods of pair recommendation presented in section 2. For each method, we detail the results obtained for /genetics, /immunology and /metabolism. We also indicate the overall figures (All) for the total number of recommendations obtained (Nb_rec), the total number of citations impacted (Nb_cit), the number of recommendations that were selected by MEDLINE indexers (Nb_rec+), the precision (PREC) and the recall (REC). Precision corresponds to the number of recommendations that were actually used by MEDLINE indexers over the total number of recommendations provided by the methods. Recall corresponds to the number of recommendations that were used by the indexers over the total number of pairs that were used by the indexers. Table 1. Performance of MeSH pair recommendation

Method Dictionary Dictionary Dictionary Dictionary

(GE) (IM) (ME) (All)

PP (GE) PP(IM) PP (ME) PP (All) NLP (GE) NLP (IM) NLP (ME) NLP (All)

Nb_rec 97,553 6,691 5,317 109,561 31,164 1,451 25,823 58,438 2,480 97 21 2,598

Nb_rec+ 46,804 2,326 2,166 51,296 21,752 1,048 13,578 36,378 1,566 26 3 1,605

Nb_cit 29,632 1,629 1,577 31,476 16,441 1,027 10,391 23,184 2,327 91 17 2,435

PREC 0.48 0.35 0.41 0.47 0.70 0.72 0.53 0.62 0.63 0.27 0.33 0.62

3.2. Combinations Table 2. Cross precision of MeSH pair recommendation methods

Method Dictionary PP NLP

Dictionary 0.47 0.73 0.75

PP 0.73 0.62 0.87

NLP 0.75 0.87 0.62

REC 0.3663 0.1095 0.0200 0.1993 0.1703 0.0493 0.1253 0.1413 0.0123 0.0012 0.0000 0.0062

300

Table 2 shows the precision and Table 3 shows the recall obtained when the methods are combined two by two (bold figures on the diagonal reflect the performance of the methods considered independently, as presented in Table 1). Table 3. Cross recall of MeSH pair recommendation methods

Method Dictionary PP NLP 4.

Dictionary 0.1993 0.0498 0.0055

PP 0.0498 0.1413 0.0028

NLP 0.0055 0.0028 0.0062

Discussion

4.1. General The performance of each method can vary considerably depending on the subheading it is applied to. Moreover, the global performance of all three methods seems higher for /genetics than /metabolism or /immunology. This may be explained by the fact that genetics is a more circumscribed domain than metabolism and immunology. The best overall precision is obtained with the post-processing rules, and the best overall recall is obtained with the dictionary method. Similar observations could be made on a general training corpus, where the scope of the methods was mostly limited to the genetics-related articles. 4.2. Error analysis To gain a better understanding of the results and how they might be improved, we have analyzed a number of recommendations that were made which were inconsistent with our reference (MEDLINE indexing) and therefore analyzed as errors. Table 4 presents a few characteristic cases. Most errors fall into these categories: • Recommendation seems to be relevant • Recommendation corresponds to a concept not substantively discussed • Recommendation is incorrect Especially with the NLP rules, there seem to be more cases where the recommendations address a relevant topic that is not discussed substantively in the article (e.g. PMID 15659801 in table 4). Sometimes, however, as shown in the example of PMID 15638374 in table 4, the concept denoted by the recommended pair seems relevant but not indexed. The added value of our tool could include reducing the number of similar omissions in the future. Most "incorrect" recommendations come from the dictionary method which is the most simplistic. Another common source for errors is the case exemplified

301 with PMID 15574482 in table 4 where a given post-processing rule can apply to several main headings, but only one of the candidates is relevant for subheading attachment. This situation was particularly prevalent with /metabolism and resulted in a significantly lower precision for this subheading, compared to /immunology and /genetics. Table 4. Analysis of sample erroneous pair recommendations

Recommendations PMID 15574482 Seeds/GE Seedling/GE Orvza sativa/GE**

PMID 15638374

Method PP: if MH Plants, Genetically Modified and a appear in the indexing, the pair /genetics should be used.

Error interpretation Three plants were discussed and the rule only applied to one, Oryza sativa, which was more specific (however, there is no direct ancestordescendant relationship between the terms). The recommended pair seems relevant for the article, although it doesn't appear in the MEDLINE indexing.

NLP: The text "The aim of the study was an evaluation Phyl lodes Tumor of PCNA and Ki-67 /GE expression in the stromal component of fibroepithelial tumours."§§ was interpreted by SemRep as "gngm LOCATION_OF neop" which translate into Phyllodes Tumor/genetics. PMID 15659801 Dictionary: The phrase "... The concept is not gene expression in liver substantively discussed in Liver Neoplasms tumors ... " contains the the article. /GE dictionary entry "gene", related to /genetics which is an allowable qualifier for Liver Neoplasms, retrieved by MTI. Error analysis can point to changes that should be made in the rules or formal concept description. Links between concepts in the case of PMID 15574482 in table 4 would make it possible to consider a filtering according to main heading specificity. For example if the fact that Oryza sativa is a more specific term than either seeds or seedling were available, one might consider In this case, three pairs were recommended when applying the rule and only one (underlined) was correct. The original phrase was edited to enhance legibility in the table

302

enforcing a rule stating that subheadings should be only attached to the most specific term when several terms belonging to a same hierarchy are candidates for attachment. 4.3. Complementarity of the methods The overlap in recommendations is not significant. As a result, using different methods will help cover more citations and increase the overall recall. However, the gain in precision obtained when combining several methods is offset by significant loss in recall. In fact, most of the recommendations resulting from the combination of methods concern the subheading genetics, especially where the NLP method is one of the combined methods. To overcome this problem we could consider the performance of post-processing rules and Natural Language Processing rules independently (e.g., there are 29 PP rules for /genetics). Rules that achieve high precision individually may be used as such. 5.

Conclusion and Future Work

We have presented three methods to provide MeSH main heading/subheading pair recommendations for indexing the biomedical literature. These methods were applied to a genetics-related corpus to provide recommendations including the subheadings genetics, immunology and metabolism. Although performance may vary considerably depending on the subheading and the method used, the results are encouraging and seem to indicate that some useful pair recommendations could be used in indexing in the near future. In future work, we plan to expand the set of PP and NLP rules to cover all 83 MeSH subheadings. Investigating statistical methods to provide pair recommendations will be considered. For example, in the specific field of genetics, links between MEDLINE and other Entrez databases such as Gene could be exploited. Based on the results from the combination of methods, more elaborate combination techniques will be studied in order to lessen decrease in recall. Finer combinations at the rule level may be considered as well as other factors such as the influence of the specific genetics corpus we used. Finally, a qualitative evaluation of this work will be sought from the indexers at NLM. Ackno wledgm ents This research was supported in part by an appointment of A. Neveol to the Lister Hill Center Fellows Program sponsored by the National Library of Medicine and administered by the Oak Ridge Institute for Science and Education, and in part by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. The authors would like to thank Halil Kilicoglu

303

for his help in the use of SemRep/SemGen and James G. Mork for his help in the use of MTI (Medical Text Indexer) during the experiments. References 1.

2.

3. 4. 5. 6. 7. 8. 9.

10.

11. 12.

13.

14.

15.

AR. Aronson, O. Bodenreider, HF. Chang, SM. Humphrey, JG. Mork, SJ. Nelson, TC. Rindflesch and WJ Wilbur. "The NLM Indexing Initiative". Proc AMI A Symp. 17-21 (2000). AR. Aronson, JG. Mork, GW. Gay, SM. Humphrey, WJ. Rogers. "The NLM Indexing Initiative's Medical Text Indexer". Proc. Medinfo. 268-72 (2004). P. Ruch, R. Baud, A. Geissbtihler. "Learning-free Text Categorization". LNAI.lim, 199-204(2003). L. Cai and T. Hofmann. "Hierarchical document categorization with support vector machines". Proc. CIKM. 396-402 (2004). W. Kim, AR. Aronson and WJ. Wilbur. "Automatic MeSH term assignment and quality assessment". Proc AMIA Symp.319-23 (2001). http://www.nlm.nih.gov/mesh/indman/chapter_19.html (visited on 05/23/06) M. Ruiz and P. Srinivasan. "Hierarchical neural networks for text categorization". Proc. SIGIR. 281-282 (1999). C. Gay. "A MEDLINE Indexing Experiment Using Terms Suggested by MTI" National Library of Medicine Internal Report (2002). P. Langlais, G. Lapalme and M. Loranger. "Transtype: DevelopmentEvaluation Cycles to Boost Translator's Productivity" Machine Translation 15, 77-98 (2002). A. Neveol, A. Rogozan, SJ. Darmoni. "Automatic indexing of online health resources for a French quality controlled gateway." Inf. Process. Manage. 42, 695-709 (2006). SM. Humphrey. "Indexing biomedical documents: from thesaural to knowledge-based retrieval systems" Artif Intel. Med. 4, 343-371 (1992) TC. Rindflesh and M. Fiszman. "The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text'7 Biomed Inform. 36(6), 462-77 (2003) O. Bodenreider, SJ. Nelson, WT. Hole, and HF. Chang. "Beyond synonymy: exploiting the UMLS semantics in mapping vocabularies." Proc AMIA Symp. 815-9(1998). TC. Rindflesch, B. Libbus, D. Hristovski, AR. Aronson, H. Kilicoglu. "Semantic relation asserting the etiology of genetic diseases." Proc AMIA Symp. 8-1 (2003). SM. Humphrey. "Automatic indexing of documents from journal descriptors: a preliminary investigation." J Am Soc Inf Sci Technol. 50(8), 661-674(1999).

MINING PATENTS USING MOLECULAR SIMILARITY SEARCH JAMES RHODES 1 , STEPHEN BOYER 1 , JEFFREY KREULEN 1 , YING CHEN 1 , PATRICIA ORDONEZ 2 IBM, Almaden Services Research, San Jose, CA 95120, USA www.ibm.com 1 E-mail: jjrhodes, sboyer, kreulen, [email protected] 2 ordopal @umbc. edu Text analytics is becoming an increasingly important tool used in biomedical research. While advances continue to be made in the core algorithms for entity identification and relation extraction, a need for practical applications of these technologies arises. We developed a system that allows users to explore the US Patent corpus using molecular information. The core of our system contains three main technologies: A high performing chemical annotator which identifies chemical terms and converts them to structures, a similarity search engine based on the emerging IUPAC International Chemical Identifier (InChI) standard, and a set of on demand data mining tools. By leveraging this technology we were able to rapidly identify and index 3, 623, 248 unique chemical structures from 4, 375, 036 US Patents and Patent Applications. Using this system a user may go to a web page, draw a molecule, search for related Intellectual Property (IP) and analyze the results. Our results prove that this is a far more effective way for identifying IP than traditional keyword based approaches. Keywords: Chemical Similarity; Data Mining; Patents; Search Engine; InChI

1. Introduction The US Patent corpus is an invaluable resource for any scientist with a need for prior art knowledge. Since patents need to clearly document all aspects of an invention, they contain an plethora of information. Unfortunately, much of this information is buried within pages upon pages of legal verbiage. Additionally, current search applications are designed around keyword queries which prove ineffective when searching for chemically related information. Consider the drug discovery problem of finding a replacement molecule 304

305

for fiuoro alkane sulfonic acid (CF3CF2SO3H). This molecule appears in everyday products like Scotchgard®, floor wax, Teflon®, and in electronic chip manufacturing materials like photo resists etc. The problem is that this molecule is a bioaccumulator and is a potential carcinogen (substance that causes cancer). Furthermore, it has made its way through the food chain, and can now be found in polar bears and penguins. Companies are pro actively trying to replace this acid with other more environmentally friendly molecules. The sulfonic acid fragment, SO3H, is the critically necessary element. The harmful fragment is anything that looks like CF 3 (CF2) n - The problem then is to find molecules that have the SO3H fragment, and perhaps a benzene ring which would allow the synthetic chemist to replace an alkyl group with something that accounts for the electron withdrawing property of CF3CF2. The chemist would like to look for a candidate molecule based on its similarity to the molecular formula of the fragment, or the structure of the benzene or some weighted combination of both. It is quite possible that the needed information exists in literature already, but may be costly and time consuming to discover. A system that would allow users to search and analyze documents, such as patents, at the molecular level could be a tremendously useful tool for biomedical research. In this paper we describe a system that leverages text mining techniques to annotate and index chemical entities, provide graphical document searching and discover biomedical/molecular relationships on demand. We prove the viability of such a system by indexing and analyzing the entire US Patent corpus from 1976-2005 and we present comparative results between molecular searching and traditional keyword based approaches. 2. Extracting Chemicals The first step in the process is to extract chemical compounds from the Patent corpus. We developed two annotators which automatically parsed text and extracted potential chemical compounds. All of the potential chemicals were then fed through a name-to-structure program such as the name=struct®program from CambridgeSoft Corporation. Name=Struct makes no value judgments, focusing only on providing a structure that the name accurately describes.1 The output of Name=Struct in our system is a connection table. Using the openly available InChI code,10 these connection tables are converted into InChI strings. Due to the page limits, this paper focuses on the similarity search technology. We have built a machine learning and dictionary based chemical annotator that can extract chemical names out of text and convert them

306

into structures. The similarity search capability is built on top of such annotation results, but is not tied to any specific underlying annotator implementation. 3. Indexing As the new IUPAC International Chemical Identifier (InChI) standard continues to emerge, there is an increasing need to use the InChI codes beyond that of compound identification. Given our background in text analytics, we reduced the problem down to finding similar compounds based on the textual representation of the structure. Our experiments focused on the use of the InChFs as a method for identifying similar compounds. Using our annotators we were able to extract 3,623, 248 unique InChl's from the US Patent database (1976-2005) and Patent Applications (2001-2005). From this collection of InChl's an index was constructed using text mining techniques. We employed a traditional vector space model 14 as our underlying data structure. 3.1. Vector

Representation

InChl's are unique for each molecule and they consist of multiple layers that describe different aspects of the molecule as depicted in Figure 1. The first three layers (formula, connection and hydrogen) are considered the main layers (see15) and are the layers we used for our experiments. Using the main layers, we extracted unique features from a collection of InChI codes. Caffeine

1

/CH3 lf^ \

I CH3 InChI=l/C8Hl0N4O2/cl-104-9-6-5(10)7(l3)l2(3)8(l4)ll(6)2/h4H,l-3H3

Fig. 1.

A c o m p o u n d a n d its I n C h I d e s c r i p t i o n

We defined features as one to three unique character phrases in the connection and hydrogen layers and unique atoms or symbols in the formula layer. Features from each layer are proceeded by a layer identifier. For the

307

connection and hydrogen layers, features for an InChI i with characters Cj can be defined as unique terms Cj, Cj+Cj+\, CJ+CJ+I+CJ+2- These terms are added to the overall set of terms T which include unique Cj from the formula layer. Given a collection of InChl's U with terms Tj, each InChI is represented by the vector h —

(dil,di2---dij)

where dij represents the frequency to the jth term in the InChI. For example, the two InChl's InChI=l/H20/hlH2 and InChI=l/N20/cl-2-3 would produce the following features H, O, hi, hlH, hlH2, hH, hH2, h2, N, cl, cl-, cl-2, c-, c-2, c-2-, c2, c2-, c2-3, c-3, c3 with the following vector representations {2,1,1,1,1,1,1,1,0,0,0,0,0,0, 0,0,0,0, 0, 0} for water, and {0,1,0,0,0,0,0,0,2,1,1,1,2,1,1,1,1,1,1,1} for nitrous oxide. In our experiments, the formula, connection and hydrogen layers produced 963, 69334 and 55256 features respectively. This makes the combined dimensionality of the dataset T=125, 553. Feature values are always nonnegative integers. To take into account the frequency of features when computing the similarity distance calculation, we represented the vectors in unary notation where each of the three feature spaces is expanded by the maximum value of a feature in that space. This causes the dimensionality to exploded to 31, 288,976 features and the sparsity increases proportionally. Of course, this unary representation is implicit and need not be implemented explicitly. Each InChI is processed by building for it three vectors which are then added to the respective vector space model. The results are three vector space models of size 309MB, 950MB and 503MB for the formula (Fj), connection (F2) and hydrogen (F3) layers. Each vector space model Fj defines a distance function Dj by taking the Tanimoto 19 coefficient between the corresponding vectors. Consequently, for every two molecules x and y there are 3 distances defined between them, namely D\{x,y), D2{x,y) and D^{x,y).

308

3.2. Index

Implementation

For indexing of the vector space models we implemented the Locality Sensitive Hashing (LSH) technique of Indyk and Motwani . 9 A major benefit of the algorithm is the relative size of the index compared to the overall vector space. In our implementation the objects (and their feature vectors) do not need to be replicated. Vectors are computed for each InChI and stored only in a single repository. Each index maintains a selection of k vector positions and a standard hash function for producing an actual bucket numbers. The buckets themselves are individual files on the file system, and they contain pointers to (or serial numbers of) vectors in the aforementioned single repository. This allows both the entire index as well as each bucket to remain small. This implementation is of course useful because this single large repository still fits in our computer's main memory (RAM). During index creation, not all hash buckets are populated. Additionally, the number of data points per hash bucket may also vary quite a bit. In our implementation, buckets were limited to a maximum of B = 1000. The end result is a LSH index Lj for each of the 3 layers of the InChI. 3.3. Query

Processing

For each query molecule Q, vectors dj are created from each vector space model Fj. Each vector is then processed by the LSH index Lj which corresponds to a given layer. The LSH index provides a list of potential candidate Ci which are then evaluated against the query vectors using the Tanimoto Coefficient. The total similarity for each candidate C; is computed by Si =

3 l

~

3

3

(1)

where n is the total number of vector space models. The Tanimoto Coefficient has been widely used as an effective measure of intermolecular similarity in both the clustering and searching of databases. 6 While Willet et al. 19 discuss six different coefficients for chemical similarity, we found that the Tanimoto Coefficient was the most widely recognized calculation with our users. The results are then aggregated so each vector with the same S is merged into a set of synonyms. By dereferencing the vectors to the InChl's they represent and further dereferencing the InChI to the original text within the corpus, a list of the top K matching chemical names and the respective documents that contain those names is returned.

309

4. Experimental Results In order to explain the experimental results, an overview of the application as it is currently implemented is required. We will conclude with a full description of the experimental process and its results. 4.1. Graphical Similarity

Search

To use the Chemical Search Engine, a user may either draw the chemical structure of the molecule to be searched, enter an InChI or smile which represents the molecule into a text field, or open a file which stores a smile or InChI value in the corresponding field. The engine converts the query into an InChI and returns of a listing of molecules and their similarity values. Beside the molecule image is its similarity to the search molecule entered, its IUPAC name, an expandable list of synonyms, and the number of patents that were found containing that molecule as seen in Fig. 2. Not surprisingly for a query of a sketch of caffeine, the engine returned over 8,500 patents that contained a molecule with a similarity of 1.0, meaning that there was an exact match, and over 52 synonyms for that molecule. Six molecules with a similarity above .8 were rendered. For the experimental results, the canonical smile for the tested drug in the PubChem database was entered into the text field. £3 http://rtioiJes2.«Imadsn.l!>m.<»Bi:«)8!> • IBI* OtMntal Search • Microsoft Interna Explorer Fte

Edit

^e«

Favorites

Tods

Hdp

Chemical Search Alpha Vtew Patents

20 compounds found Draw a compound: i

;

I

•

T

PQ-R ered by ChmAxsm Mar* a?

•

caffeine.,

t

T j \ "• c '"l 1 .

Similarity:

1.0

(8 * synonyms

- Y't">

C e n t e r a SMILES:

Sj 56 synonyms

Patents: 7541 (1,2.3,6- tetrahydro-i,3-dimetnyl-2,6-rjioxo-

«,

Patents: 6 Similarity:

0.937

caffeinyl,

Examples...

?•

D

Patents: 2 Similarity:

Or entei a InChI: Examples ..

«*-*,!..«, — « „ ;

Fig. 2.

claim

Search results

0.920

310 4.2. Molecular

Networks

In the upper right hand corner of the results page, the user may click on three different links to view selected molecules and their patents either as a graph using Graph Results, as a listing of hyper-linked patents with View Patents, or as an analysis of claims with Claim Analysis. In this section, we will describe and illustrate the usefulness of the Graph Results page and in the following, the Claim Analysis. The value of a graphical representation of the selected molecules and their corresponding patents is most evident if we select the molecules with similar affinities to caffeine, but not exact matches to caffeine. The graph in Fig. 3 is a graph of the four molecules with the closest similarity to caffeine less than 1. In the graph, the search node is fixed as the center node and molecular representations of the other nodes surround it. In the future, the graph will also display each molecule's similarity to the search node as indicted by the thickness of its edge to the center(search) node. When the user rolls over the center node, the comment "Search Node" is viewed whereas for the other nodes the name of the molecule is displayed. Note that some of the same molecules have different names. The leaf nodes are the patents and patent applications associated with each molecule. If double-clicked the node will launch a browser window displaying the corresponding patent or application. A mouseover of these nodes will render the name of the assignee of the document. The nodes are color-coded by assignees. A researcher may use this graph to view which molecules are most like the search node and of those molecules which have the greatest number of patents associated with them. It is also very useful for determining which assignees have the greatest number of patents for a particular molecular structure.

4.3. Affinity

Analysis

The Claim Analysis page examines the claims of the patents associated with the selected molecules on the previous page to determine which medical conditions were found in the greatest number of patents. The more patents that mention a particular condition, the higher the condition's affinity to the molecule. Notice in Fig. 4, that for caffeine, migraine and headache have a high affinity, nausea and anxiety a moderate one, and burns and cough a low affinity.

311 •»' Applet Vip*.-pr: pfefu^n.rip^ios applets.MvGrephView.class Applet

^SSSHU 5841SSH

MedllM • 13318381

•w # * ^

Fig. 3.

^

YT*»

Graph of selected molecules

The conditions were derived from a dictionary of proteins, diseases, and biomarkers. A dictionary based annotator annotates the full text of the selected patents in real time to extract the meaningful terms. A Chisquared test was used referencing the number of patents that contained the conditions to determine the affinity between the molecules and the conditions. On expanding a condition in the Claim Analysis page, a listing of the patents mentioning the condition in its text is rendered. The patent names are links to the actual patents. Thus, a researcher looking to patent a drug may do a search on the molecule and uncover what other uses the molecule has been patented for before. Such data may also serve to discover unexpected side effects or complications of a drug for the purposes of testing its safety. 4.4.

Results

To evaluate the engine's effectiveness, we used a listing of the top 50 brandname drugs prescribed in 2004 as provided by Humana. 8 We acquired a

312 _€! IBM Chemleiil Search - Wcroott Internet Explorer =*e

edit

Vie*

Fa/txites

Tools

Neip

High Affinity

Moderate Affinity

Fig. 4. Claims analysis of selected molecules

canonical smile value associated with each of the 25 top prescribed drugs from the PubChem database. 7 PubChem could not provide the smiles for two of the drugs, Yasmin 28 and OrthoEvra. If more than one molecule was returned from the database, we used the canonical smile value of the first one listed except in the case of three of the drugs, Tropol XL, Premarin, and Plavix. In these cases, we used the smile string that returned the greatest number of matches when we performed a search on the chemical search engine. With the generic name of the drug, we performed a search on one of the most sophisticated patent databases known, Delphion, using a boolean query that examined the abstracts, titles, claims, and descriptions of the patents for the name on patents from January 1, 1976 to December 31, 2005. The results can be seen in Fig. 5. On acquiring the 25 drug names, the first obstacle was that 2 of the drugs could not be found in the PubChem database so that the canonical smile for these drugs could not be determined. Out of the 23 drugs that remained, our results indicate that for 19 of them more patents associated with the drug were found on our system than on Delphion. In the instances where the engine found more matches, the number of matches that it found was in some cases up to 10 times more, because the search was based on

313 the molecular structure of the match and not on the generic name. The number of times that a text based search outperformed the molecular search may be attributed a miss-selection of the smile string from the PubChem database. Thus, one of the greatest limitations of the chemical search engine is finding an accurate smile string for a given drug. Nevertheless, our experimental results demonstrate the enormous potential of being able to search the patent database based on a molecular structure.

Search Results 4500 -i

B Delphion Keyworci Search

tA

,2 8S

• IBM Chemsearch o

£

1 m T

rl_T-

1

1 B B H1I 1 rIrl 1 BU^. ^

Top 25 Drugs

Fig. 5. A graph comparing the results of searching for the top 25 drugs listed by Humana 8 on the Chemical Search Engine using a molecular search and on DELPHION performing a text search of the compound's name.

5. Conclusion We developed a practical system which leverages text analytics for indexing, searching and analyzing documents based on molecular information. Our results demonstrate that graphical structure search is a far more effective way to explore a document corpus than traditional keyword based queries when searching for biomedical related literature. The system is flexible and may be expanded to include other data sources besides Patents. These additional data sources would allow for meta-data information to

314

be tied to Patents through chemical annotations. Future versions may allow researchers to explore data sets based on chemical properties such as toxicity or molecular weight. In addition to discovering literature for an exact match, this tool can be used for identifying practical applications of a compound or possible negative side effects by examining the literature surrounding similar compounds.

References 1. J. Brecher. Name=struct: A practical approach to the sorry state of reallife chemical nomenclature. Journal of Chemical Information and Computer Science, 39:943-950, 1999. 2. A. Dalby, J. G. Nourse, W. D. Hounshell, A. K. I. Gunshurst, D. L. Grier, B. A. Leland, and J. Laufer. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. Journal of Chemical Information and Computer Science, 32(3):244-255, 1992. 3. Inc. Daylight Chemical Information Systems. Daylight Theory: Fingerprints, 2005. http://www.daylight.com/dayhtml/doc/theory/theory. finger.html. 4. Inc. Daylight Chemical Information Systems. Daylight Cheminformatics SMILES, 2006. h t t p : / / d a y l i g h t . c o m / s m i l e s . 5. GNU FDL. Open babel, 2006. h t t p : / / o p e n b a b e l . s o u r c e f o r g e . n e t . 6. D. Flower. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Science, 38(3):379-386, 1998. 7. National Center for Biotechnology Information. Pubchem, 2006. h t t p : / / pubchem.ncbi.nlm.nih.gov/search. 8. Humana. Top 50 brand-name drugs prescribed, 2005. http://apps.humana.com/prescription_benefits_ and_services/incl_des/Top50BrandDrugs.pdf. 9. P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 604-613, may 1998. 10. IUPAC. The IUPAC International Chemical Identifier(InChI TM), 2005. http://www.iupac.org/inchi. 11. Stefan Kramer, Luc De Raedt, and Christoph Helma. Molecular feature mining in HIV data. In HDD '01: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, 2001. 12. Elsevier MDL. Ctfile formats, 2005. http://www.mdl.com/downloads/ public/ctfile/ctfile.pdf. 13. Elsevier MDL. Mdl isis/base, 2006. http://www.mdli.com/support/ knowledgebase/faqs/faq_ib_22.jsp. 14. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(ll):613-620, 1975.

315 15. S. E. Stein, S. R. Heller, and D. Tchekhovskoi. An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier. In Proceedings of the 2003 International Chemical Information Conference (Nimes), 2003. 16. Murray-Rust Research Group The University of Cambridge. The Unofficial InChI FAQ, 2006. http://wwmm.ch.cam.ac.uk/inchifaq/. 17. D. Weininger. Smiles, a chemical language and information system, introduction to methodology and encoding rules. Journal of Chemical Information and Computer Science, 28(l):31-36, 1988. 18. D. Weininger, A. Weininger, and J. L. Weininger. Smiles algorithm for generation of unique smiles notation. Journal of Chemical Information and Computer Science, 29(2):97-101, 1989. 19. P. Willett, J. M. Barnard, and G. M. Downs. Chemical Similarity Searching. Journal of Chemical Information and Computer Science, 38(6):983-996, 1998.

DISCOVERING IMPLICIT ASSOCIATIONS BETWEEN GENES AND HEREDITARY DISEASES

KAZUHIRO SEKI Graduate School of Science and Technology, Kobe University 1-1 Rokkodai, Nada, Kobe 657-8501, Japan E-mail: [email protected] JAVED MOSTAFA Laboratory of Applied Informatics Research, Indiana University 1320 E. 10th St., LI Oil, Bloomington, Indiana 47405-3907 E-mail: [email protected]

We propose an approach to predicting implicit gene-disease associations based on the inference network, whereby genes and diseases are represented as nodes and are connected via two types of intermediate nodes: gene functions and phenotypes. To estimate the probabilities involved in the model, two learning schemes are compared; one baseline using co-annotations of keywords and the other taking advantage of free text. Additionally, we explore the use of domain ontologies to complement data sparseness and examine the impact of full text documents. The validity of the proposed framework is demonstrated on the benchmark data set created from real-world data.

1. Introduction The ever-growing textual data make it increasingly difficult to effectively utilize all the information relevant to our interests. For example, Medline—the most comprehensive bibliographic database in life science—currently indexes approximately 5,000 peer-reviewed journals and contains over 17 million articles. The number of articles is increasing rapidly by 1,500-3,000 per a day. Given the substantial volume of the publications, it is crucial to develop intelligent information processing techniques, such as information retrieval (IR), information extraction (IE), and text data mining (TDM), that could help us manage the information overload. In contrast to IR and IE, which deal with information explicitly stated in documents, TDM aims to discover heretofore unknown knowledge through an automatic analysis on textual data.1 A pioneering work in TDM (or literature-based discovery) was conducted by Swanson in the 1980's. He argued that there were 316

317 two premises logically connected but the connection had been unnoticed due to overwhelming publications and/or over-specialization. For instance, given two premises A —> B and B -> C, one could deduce a possible relation A -» C. To prove the idea, he manually analyzed numbers of articles and identified logical connections implying a hypothesis that fish oil was effective for clinical treatment of Raynaud's disease.2 The hypothesis was later supported by experimental evidence. Based on his original work, Swanson and other researchers have developed computer programs to aid hypothesis discovery (e.g., see Refs. 3 and 4). Despite the prolonged efforts, however, the research in literature-based discovery can be seen to be at an early stage of development in terms of the models, approaches, and evaluation methodologies. Most of the previous work was largely heuristic without a formal model and their evaluation was limited only on a small number of hypotheses that Swanson had proposed. This study is also motivated by Swanson's and attempts to advance the research in literature-based discovery. Specifically, we will examine the effectiveness of the models and techniques developed for IR, the benefit of free- and fulltext data, and the use of domain ontologies for more robust system predictions. Focusing on associations between genes and hereditary diseases, we develop a discovery framework adapting the inference network model5 in IR, and we conduct various evaluative experiments on realistic benchmark data.

2. Task Definition Among many types of information that are of potential interest to biomedical researchers, this study targets associations between genes and hereditary diseases as a test bed. Gene-disease associations are the links between genetic variants and diseases to which the genetic variants influence the susceptibility. For example, BRCA1 is a human gene encoding a protein that suppresses tumor formation. A mutation of this gene increases a risk of breast cancer. Identification of these genetic associations has tremendous importance for prevention, prediction, and treatment of diseases. In this context, predicting or ranking candidate genes for a given disease is crucial to select more plausible ones for genetic association studies. Focusing on gene-disease associations, we assume a disease name and known causative genes, if any, as system input. In addition, a target region in the human genome may be specified to limit the search space. Given such input, we attempt to predict a (unknown) causative gene and produce a ranked list of candidate genes.

318 3. Proposed Approach Focusing on gene-disease associations, we explored the use of a formal IR model, specifically, the inference network5 for this related but different problem targeting implicit associations. The following details the proposed model and how to estimate probabilities involved in the model. 3.1. Inference Network for Gene-Disease Associations In the original IR model, a user query and documents are represented as nodes in a network and are connected via intermediate nodes representing keywords that compose the query and documents. To adapt the model to represent gene-disease associations, we treat disease as query and genes as documents and use two types of intermediate nodes: gene functions and phenotypes which characterize genes and disease, respectively (Fig. 1). An advantage of using this particular IR model is that it is essentially capable of incorporating multiple intermediate nodes. Other popular IR models, such as the vector space models, are not easily applicable as they are not designed to have different sets of concepts to represent documents and queries.

Mutated genes

Gene functions (GO terms)

Phenotypes (MeSH C terms)

Disease Figure 1. Inference network for gene-disease associations.

The network consists of four types of nodes: genes (g), gene functions (/) represented by Gene Ontology (GO) terms,3 phenotypes (p) represented by MeSH C terms,b and disease (d). Each gene node g represents a gene and corresponds to the event that the gene is found in the search for the causative genes underlying d. Each gene function node / represents a function of gene products. There a

http://www.geneontology.org http://www.nlm.nih.gov/mesh

b

319 are directed arcs from genes to functions, representing that instantiating a gene increases the belief in its functions. Likewise, each phenotype node p represents a phenotype of d and corresponds to the event that the phenotype is observed. The belief in p is dependent on the belief in / ' s since phenotypes are (partly) determined by gene functions. Finally, observing certain phenotypes increases the belief in d. As described in the followings, the associations between genes and gene functions (g—>/) are obtained from an existing database, Entrez Gene,c whereas both the associations between gene functions and phenotypes (/ —» p) and the associations between phenotypes and disease (p —> d) are derived from the biomedical literature. Given the inference network model, disease-causing genes can be predicted based on the probability defined below.

P(d\G) = J]YJ pWft i

x P{ }

^

x F( |G)

^

(1)

J

Equation (1) quantifies how much a set of candidate genes, G, increases the belief in the development of disease d. In the equation, /?, (or fj) is defined as a vector of random variables with j'-th (or j-tti) element being positive (1) and all others negative (0). By applying Bayes' theorem and some independence assumptions discussed later, we derive

p(d\G) ex V V l^EM

x

PWKJm

x F{)

x F{f)

x mG)\

(2)

where FiPi)

p p ryj],ryj m P(Mpt) „,„ r f W iWP"'> r r cmEU, F = n > = (fj)

(3)

The first factor of the right-hand side of Eq. (2) represents the interaction between disease d and phenotype /?,-, and the second factor represents the interaction between pi and gene function fj, which is equivalent to the odds ratio of P(fj\pd and P(fj\pi). The third and fourth factors are functions of p, and fj, respectively, representing their main effects. The last factor takes either 0 or 1, indicating whether fj is a function of any gene in G under consideration. The inference network described above assumes independence among phenotypes, among gene functions, and among genes. We assert that, however, the effects of such associations are minimal in the proposed model. Although there may be strong associations among phenotypes (e.g., phenotype px is often observed with phenotype py), the model does not intend to capture those associations. That c

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=gene

320

is, phenotypes are attributes of the disease in question and we only need to know those that are frequently observed with disease d so as to characterize d. The same applies to gene functions; they are only attributes of the genes to be examined and are simply used as features to represent the genes under consideration. 3.2. Probability Estimation 3.2.1. Conditional Probabilities P(p\d) Probability P(p\d) can be interpreted as a degree of belief that phenotype p is observed when disease d has developed. To estimate the probability, we take advantage of the literature data. Briefly, given a disease name d, a Medline search is conducted to retrieve articles relevant to d and, within the retrieved articles, we identify phenotypes (MeSH C terms) strongly associated with the disease based on chi-square statistics. Given disease d and phenotype p, the chi-square statistic is computed as , ,

s

X 2 (a p) =

N{nu

• «22 - «2i - « i 2 ) 2

(4)

("11 +«2l)(«12 +«22)(«11 +«12)("21 +«22>

where N is the total number of articles in Medline, n\\ is the number of articles assigned p and included in the retrieved set (denoted as R), mi is the number of articles not assigned p and not included in R, n2\ is the number of articles not assigned p and included in R, and nu is the number of articles assigned p and not in R. The resulting chi-square statistics are normalized by the maximum to treat them as probabilities P{p\d). 3.2.2. Conditional Probabilities P(f\p) Probability P(f\p) indicates the degree of belief that gene function / underlies phenotype p. For probability estimation, this study adopts the framework similar to the one proposed by Perez-Iratxeta et al.6 Unlike them, however, this study focuses on the use of textual data and domain ontologies and investigate their effects for literature-based discovery. As training data, our framework uses Medline records that are assigned any MeSH C terms and are cross-referenced from any gene entry in Entrez Gene. For each of such records, we can obtain a set of phenotypes (the assigned MeSH C terms) and a set of gene functions (GO terms) associated with the crossreferencing gene from Entrez Gene. Considering the fact that the phenotypes and gene functions are associated with the same Medline record, it is likely that some of the phenotypes and gene functions are associated. A question is, however, what phenotypes and functions are associated and how strong those associations are.

321 We estimate those possible associations using two different schemes: SchemeK and SchemeT. SchemeK simply assumes a link between every pair of the phenotypes and gene functions with equal strength, whereas SchemeT seeks for evidence in the textual portion of the Medline record, i.e., title and abstract, to better estimate the strength of associations. Essentially, SchemeT searches for co-occurrences of gene functions (GO terms) and phenotypes (MeSH terms) in a sliding window, assuming that associated concepts tend to co-occur more often in the same context than unassociated ones. However, a problem of SchemeT is that gene functions and phenotypes are descriptive by nature and may not be expressed in concise GO and MeSH terms. In fact, Schuemie et al? analyzed 1,834 articles and reported that less than 30% of MeSH terms assigned to an article actually appear in its abstract and that only 50% even in its full text. It suggests that relying on mere occurrences of MeSH terms would fail to capture many true associations. To deal with the problem, we apply the idea of query expansion, a technique used in IR to enrich a query by adding related terms. If GO and MeSH terms are somehow expanded, there is more chance that they could co-occur in text. For this purpose, we use the definitions (or scope notes) of GO and MeSH terms and identify representative terms by inverse document frequencies (IDF), which has long been used in IR to quantify the specificity of terms in a given document collection. We treat term definitions as documents and define IDF for term t as \og(N/Freq(t)), where N denotes the total number of MeSH C (or GO) terms and Freq(-) denotes the number of MeSH C (or GO) terms whose definitions contain term t. Only the terms with high IDF values are used as the proxy terms to represent the starting concept, i.e., gene function or phenotype. Each co-occurrence of the two sets of proxy terms (one representing a gene function and the other representing a phenotype) can be seen as evidence that supports the association between the gene function and phenotype, increasing the strength of their association. We define the increased strength by the product of the term weights, w, for the two co-occurring proxy terms. Then, the strength of the association between gene function / and phenotype p within article a, denoted as S (/, p, a), can be defined as the sum of the increases for all co-occurrences of the proxy terms in a. That is, S(f

a)=

V

^ 0 / ) • w(tp)

where tf and tp denote any terms in the proxy terms for / and p, respectively, and (tf, tp,a) denotes a set of all co-occurrences of tf and tp within a. The product of the term weights is normalized by the proxy size, \Proxy(-)\, to eliminate the effect of different proxy sizes. As term weight w, this study used the TF IDF weighting

322

scheme. For term tp for instance, we define TF(tp) as 1 + log Freq{tp, Def(p)), where Def(p) denote p's definition and Freq(tp, Def(p)) denotes the number of occurrences of tp in Def(p). The association scores, S (/, p, a), are computed for each cross reference (a pair of Medline record and gene) by either SchemeK or SchemeT and are accumulated over all articles to estimate the associations between / ' s and p's, denoted as S(f,p). Based on the associations, we define probability P(f\p) as

S(f,p)/ZpS(f,p). A possible shortcoming of the approach described above is that the obtained associations S (/, p) are symmetric despite the fact that the network presented in Fig. 1 is directional. However, since it is known that an organism's genotype (in part) determines its phenotype—not in the opposite direction, we assumed that the estimated associations between gene functions and phenotypes are directed from the former to the latter. 3.2.3. Enhancing Probability Estimates P{f\p) byDomainOntologies The proposed framework may not be able to establish true associations between gene functions and phenotypes for various reasons, e.g., the amount of training data may be insufficient. Those true associations may be uncovered using the structure of MeSH and/or GO. MeSH and GO have a hierarchical structure11 and those located nearby in the hierarchy are semantically close to each other. Taking advantage of these semantic relations, we enhance the learned probabilities P(f\p) as follows. Let us denote by A the matrix whose element a,; is probability estimate P(fj\Pi) and by A' the updated or enhanced matrix. Then, A' is formalized as A' = WpAWf, where Wp denotes an n xn matrix with element wp(i, j) indicating a proportion of a probability to be transmitted from phenotypes pj to /?,-. Similarly, Wf is mm xm matrix with w/(i, j) indicating a proportion transmitted from gene functions ft to /;. This study experimentally uses only direct child-to-parent and parent-to-child relations and defines wp(i, j) as f 1 wp(i, j) =

if i = j

, r ,.,, r~ if Pi is a child of pj # of children of Pj j

— if p,; is a parent of p, ll v ll # of parents of pj 0 otherwise d

To be precise, GO's structure is directed acyclic graph, allowing multiple parents.

(6)

323

Equation (6) means that the amount of probability is split equally among its children (or parents). Similarly, wp(i, j) is defined by replacing i and j in the righthand side of Eq. (6). Note that the enhancement process can be iteratively applied to take advantage of more distant relationships than children/parents. 4. Evaluation To evaluate the validity of the proposed approach, we implemented a prototype system and conducted various experiments on the benchmark data sets created from the genetic association database (GAD).e GAD is a manually-curated archive of human genetic studies, containing pairs of gene and disease that are known to have causative relations. 4.1. Creation of Benchmark Data For evaluation, benchmark data sets were created as follows using the real-world data obtained from GAD. (1) Associate each gene-disease pair with the publication date of the article from which the entry was created. The date can be seen as the time when the causative relation became public knowledge. (2) Group gene-disease pairs based on disease names. As GAD deals with complex diseases, a disease may be paired with multiple genes. (3) For each pair of a disease and its causative genes, (a) Identify the gene whose relation to the disease was most recently reported based on the publication date. If the date is on or after 7/1/2003, the gene will be used as the target (i.e., new knowledge), and the disease and the rest of the causative genes will be used as system input (i.e., old knowledge). (b) Remove the most recently reported gene from the set of causative genes and repeat the previous step (3a). The separation of the data by publication dates ensures that a training phase does not use new knowledge in order to simulate gene-disease association discovery. The particular date was arbitrarily chosen by considering the size of the resulting data and available resources for training. Table 1 shows the number of gene-disease associations in the resulting test data categorized under six disease classes defined in GAD. In the following experiments, the cancer class was used for system development and parameter tuning. e

http://geneticassociationdb.nih.gov

324 Table 1. Number of gene-disease associations in the benchmark data. Cancer 45

, vascular 36

Immune

Metabolic

Psych

Unknown

Total

61

23

12

80

257"

4.2. Experimental Setup Given input (disease name d, known causative genes, and a target region), the system computes the probability P(d\G) as in Eq. (3) for each candidate gene g located in the target region, where G is a set of the known causative genes plus g. The candidate genes are then outputted in a decreasing order of their probabilities as system output. As evaluation metrics, we use area under the ROC curve (AUC) for its attractive property as compared to the F-score measure (see Ref. 8 for more details). ROC curves are two dimensional measure for system performance with x axis being true positive proportion (TPP) and y axis being false positive proportion (FPP). TPP is denned as TP/(TP+FN), and FPP as FP/(FP+TN), where TP, FP, FN, and FP denote the number of true positives, false positives, false negatives, and false positives, respectively. AUC takes a value between 0 and 1 with 1 being the best. Intuitively AUC indicates the probability that a gene randomly picked from positive set is scored more highly by a system than one from negative set. For data sets, this study used a subset of the Medline data provided for the TREC Genomics Track 2004.9 The data consist of the records created between the years 1994 and 2003, which account for around one-third of the entire Medline database. Within these data, 29,158 cross-references (pairs of Medline record and gene) were identified as the training data such that they satisfied all of the following conditions: 1) Medline records are assigned one or more MeSH C terms to be used as phenotypes, 2) Medline records are cross-referenced from Entrez Gene to obtain gene functions, 3) cross references are not from the target genes to avoid using possible direct evidence, 4) Medline records have publication dates before 7/1/2003 to avoid using new knowledge. Using the cross references and the test data in the cancer class, several parameters were empirically determined for each scheme, including the number of Medline articles as the source of phenotypes (nm), threshold for chi-square statistics to determine phenotypes (tc), threshold for IDF to determine proxy terms (t,), and window size for co-occurrences (vvs). For SchemeT, they were set as nm=700, fc=2.0, t,=5.0, and w,= 10 (words) by testing a number of combinations of their possible values.

325 4.3. Results 4.3.1. Overall Performance With the best parameter settings learned in the cancer class, the system was applied to all the other classes. Table 2 shows the system performance in AUC. Table 2. System performance in AUC for each disease class. The figures in the parentheses indicate percent increase/decrease relative to SchemeK. Scheme K

, vascular 0.677 0.737 (8.9%)

Immune

Metabolic

Psych

Unknown

Overall

0.686 0.668 (-2.6%)

0.684 0.623 (-9.0%)

0.514 0.667 (29.8%)

0.703 0.786 (11.7%)

0.682 0.713 (4.6%)

Both SchemeK and SchemeT achieved significantly higher AUC than 0.5 (i.e., random guess), indicating the validity of the general framework adapting the inference network for this particular problem. Comparing the two schemes, SchemeT does not always outperform SchemeK but, overall, AUC improved by 4.6%. The result suggests the advantage of the use of textual data to acquire more precise associations between concepts. Incidentally, without proxy terms described in Section 3.2.2, the overall AUC by SchemeT decreased to 0.682 (not shown in Tab. 2), verifying its effectiveness. 4.3.2. Impact of Full-Text Articles This section reports preliminary experiments examining the impact of full text articles for literature-based discovery. Since full-text articles provide more comprehensive information than abstracts, they are thought to be beneficial in the proposed framework. We used the full-text collection from the TREC Genomics Track 2004,9 which contains 11,880 full-text articles. However, the conditions described in Section 4.2 inevitably decreased the number of usable articles to 679. We conducted comparative experiments using these full-text articles and only the corresponding 679 abstract in estimating P(f\p) for fair comparison. Note that, due to the small data size, these results cannot be directly compared to those reported above. Table 3 summarizes the results obtained based on only titles and abstracts ("Abs") and complete full-text articles ("Full") using SchemeT. Examining each disease class, it is observed that the use of full-text articles lead to a large improvement over using abstracts except for the immune class. Overall, the improvement achieved by full texts is 5.1 %, indicating the potential advantage of full text articles.

326 Table 3. System performance in AUC based on 679 articles. The figures in the parentheses indicate percent increase/decrease relative to Abs. Text Abs Full

, vascular 0.652 0.737 (13.0%)

Immune

Metabolic

Psych

Unknown

Overall

0.612 0.590 (-3.6%)

0.566 0.640

0.623 0.724 (16.2%)

0.693 0.731 (5.5%)

0.643 0.676 (5.1%)

(13.0%)

4.3.3. Enhancing Probability Estimates by Domain Ontologies In order to examine the effectiveness of the use of domain ontologies for enhancing P(f\p), we applied the proposed method to SchemeT in Tab. 2 and to Full in Tab. 3. (Note that Full is also based on SchemeT for estimating P(f\p) but uses full-text articles instead of abstracts). Figure 2 summarizes the results for different number of iterations, where the left and right plots correspond to SchemeT and Full, respectively. Incidentally, we used only child-to-parent relations in GO hierarchy for this experiment as it yielded the best results in the cancer class. Full (SchemeT vtl 679 full-text articles)

SchemeT

O

<

Card

Imm

Meta

Psy

Unkw

All

Card

Imm

Meta

Psy

Unkw

All

Disease classes

Figure 2. System performance after enhancing associations using GO parent-to-child relations. Three bars in each disease class correspond to # of iterations of enhancement.

For SchemeT, the effects were less consistent across the classes and, overall, the improvement was small. For Full, on the other hand, we observed clearer improvement except for two classes, Cardiovascular and Psych, and the overall AUC improved by 4.0%. The difference is presumably due to the fact that the associations learned by Full is more sparse than those by SchemeT as the amount of the training data for Full was limited for this experiment. The enhancement was intended to uncover missed associations and thus worked favorably for Full. 5. Conclusion This study was motivated by Swanson's work in literature-based discovery and investigated the application of IR models and techniques in conjunction with the

327 use of domain-specific resources, such as gene database and ontology. The key findings of the present work are that a) the consideration of textual information improved system prediction by 4.6% in AUC over simply relying on co-annotations of keywords, b) using full text improved overall AUC by 5.1% as compared to using only abstracts, and c) the hierarchical structure of GO could be leveraged to enhance probability estimates, especially for those learned from small training data. Moreover, we created realistic benchmark data, where old and new knowledge were carefully separated to simulate gene-disease association discovery. For future work, we plan to investigate the use of semantic distance 10 in propagating the probabilities P(f\p). In addition, we would like to compare the proposed framework with the previous work (e.g., Ref. 6) and with other IR models having one intermediate layer between genes and disease so as to study the characteristics of our model. Acknowledgments Dr. Mostafa was funded through the NSF grant #0549313. References 1. M. A. Hearst. Untangling text data mining. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 3-10, 1999. 2. D.R. Swanson. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine, 30(1):7—18, 1986. 3. P. Srinivasan. Text mining: generating hypotheses from MEDLINE. Journal of the American Society for Information Science and Technology, 55(5):396-413, 2004. 4. M. Weeber, R. Vos, H. Klein, L. Berg, R. Aronson, and G. Molema. Generating hypotheses by discovering implicit associations in the literature: A case report of a search for new potential therapeutic uses for thalidomide. Journal of the American Medical Informatics Association, 10(3):252-259, 2003. 5. H. Turtle and W.B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187-222, 1991. 6. C. Perez-Iratxeta, M. Wjst, P. Bork, and M. Andrade. G2D: a tool for mining genes associated with disease. BMC Genetics, 6(1):45, 2005. 7. M.J. Schuemie, M. Weeber, B.J.A. Schijvenaars, E.M. van Mulligen, C.C. van der Eijk, R. Jelier, B. Mons, and J.A. Kors. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics, 20(16):2597-2604, 2004. 8. T Fawcett. ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Laboratories, 2004. 9. W. Hersh, R.T. Bhuptiraju, L. Ross, A.M. Cohen, and D.F. Kraemer. TREC 2004 genomics track overview. In Proceedings of the 13th Text REtrieval Conference, 2004. 10. P.W. Lord, R.D. Stevens, A. Brass, and C.A. Goble. Semantic similarity measures as tools for exploring the gene ontology. Pacific Symposium on Biocomputing, S t o l e n , 2003.

A COGNITIVE EVALUATION OF FOUR ONLINE SEARCH ENGINES FOR ANSWERING DEFINITIONAL QUESTIONS POSED BY PHYSICIANS HONG YU University of Wisconsin-Milwaukee, Department of Health Sciences, 2400 E. Hartford PO Box 413, Milwaukee, WI 53210, USA

Avenue

DAVID KAUFMAN Columbia University, Department of Biomedical Informatics, VC-5, New York, NY10032, USA

622 West, 16$h

Street

The Internet is having a profound impact on physicians' medical decision making. One recent survey of 277 physicians showed that 72% of physicians regularly used the Internet to research medical information and 51% admitted that information from web sites influenced their clinical decisions. This paper describes the first cognitive evaluation of four state-of-the-art Internet search engines: Google (i.e., Google and Scholar.Google), MedQA, Onelook, and PubMed for answering definitional questions (i.e., questions with the format of "What is X?") posed by physicians. Onelook is a portal for online definitions, and MedQA is a question answering system that automatically generates short texts to answer specific biomedical questions. Our evaluation criteria include quality of answer, ease of use, time spent, and number of actions taken. Our results show that MedQA outperforms Onelook and PubMed in most of the criteria, and that MedQA surpasses Google in time spent and number of actions, two important efficiency criteria. Our results show that Google is the best system for quality of answer and ease of use. We conclude that Google is an effective search engine for medical definitions, and that MedQA exceeds the other search engines in that it provides users direct answers to their questions; while the users of the other search engines have to visit several sites before finding all of the pertinent information.

1

Introduction

The Internet offers widespread access to health and science information. Although there were a lot of concerns about the quality due to variations in accuracy, completeness, and consistency (1-10), the Internet is having a profound impact on both patients' access to healthcare information (11, 12) and physicians' medical decision making (13). One recent survey of 277 physicians showed that 72% physicians regularly used internet to research medical information and 51% declared that the Internet influenced their healthcare decisions (13). The Internet may satisfy physicians' information needs by two means. First, it is wellreported that physicians often have questions when caring for their patients (14); the Internet incorporates vast amount of healthcare and scientific information which may provide an excellent resource to answer their questions. Although the quality of the information is still in dispute, studies found that the Internet has increased in quality over years (15). In certain domains, the information presented in the Internet was evaluated to be accurate (16). Secondly, the Internet provides different publicly available search engines and information retrieval systems (e.g., Google and PubMed) that may allow 328

329 physicians to efficiently access information. Efficiency is extremely important to physicians as studies found that physicians spent on average two minutes or less seeking an answer to a question, and that if a search took longer, it was likely to be abandoned (14, 17-19). In this study, we report a cognitive evaluation to compare a special purpose biomedical search engine, MedQA with three state-of-the-art search engines with the goal of identifying an optimal system that best suite physicians' information needs. Specifically, we asked physicians to evaluate Google, MedQA, Onelook, and PubMed for answering definitional questions (i.e., questions with the format of "What is X?"). Google is a popular online search engine (4) and was evaluated to be the best web-search engine for answering medical questions (18). Google offers a wide range of resources and special-purpose search engines such as Google Scholar. Subjects were free to use any of Google tools to conduct their searches. OneLook (http://www.onelook.coin/) is a portal for numerous online dictionaries including several medical editions (e.g., Dorland's). A recent study suggested that domain portals were most efficient for accessing healthcare information (20). MedQA automaticiilly analyzed thousands of documents to generate a coherent paragraph-level text to specifically answer an ad-hoc medical question (21). PubMed is frequently searched by physicians at clinical settings (22). Our work is related to the work of Berkowitz (2002) (23) in which 14 search engines (e.g., Google and PubMed) were evaluated to answer clinical questions. In that study, quality of answer and the overall lime spent for obtaining an answer were measured. The results showed that Google performed poorly in quality of answer because many of the answers were from consumer-oriented sites and therefore did not incorporate information physicians needed, and that PubMed required a longer time spent for obtaining an answer. The limitations of Berkowitz's study include that it did not measure the cognitive aspects, including interpretation and analysis of number of actions involved for identifying answers. Additionally, all the evaluation was performed subjectively by the author (i.e., Berkowitz) of the article. Our study is based on a randomized controlled cognitive evaluation of four physicians who are not the authors of this article. Additionally, a unique feature of our study is that we provide the evaluation of an advanced, biomedical question answering system, and we compare it to three other stateof-the-art information retrieval systems. Questions

Answer

Summarization J Question U Document [J Answer f Answer Foirnulation Classification! : Retrieval | \ Extraction i' * ...^~-*r„ ^-»,__ i Presentation CMBDLINE >

www

J

•

Figure 1: MedQA system architecture 2

MedQA

MedQA is a question answering system that automatically analyzes thousands of documents (both the Web documents and MEDLINE abstracts) to generate a short text

330 to answer definitional questions (21). In summary, MedQA takes in a question posed by either a physician or a biomedical researcher. It automatically classifies the posed question into a question type for which a specific answer strategy is developed (24, 25). Noun phrases are extracted from the question to be query terms. Document Retrieval applies the query terms to retrieve documents from either the World-Wide-Web documents or locally-indexed literature resources. Answer Extraction automatically identifies the sentences that provide answers to questions. Text Summarization condenses the text by removing the redundant sentences. Answer Formulation generates a coherent summary. The summary is then presented to the user who posed the question. Figure 1 shows the architecture of MedQA, and Figure 2 shows MedQA's output of the question "What is vestibulitis?" Most of the evaluation work on question answering systems (26) focuses on information retrieval metrics. A text corpus and the answer are provided for a question, the evaluation task is to measure the correctness to extract the text answer from the corpus. None of the studies, to our knowledge, apply cognitive methods to evaluate humancomputer interaction, and to measure efficacy, accuracy and perceived ease of use of a question answering system, and to compare a question answering system to other information systems such as information retrieval systems. 3

Cognitive Evaluation Methods

We designed a randomized controlled cognitive evaluation in order to assess the efficacy, accuracy and perceived ease of use of Google, MedQA, OneLook, and PubMed. The study was approved by the Columbia University Institutional Review Board. 3.1

Question Selection

We manually examined the total of 4,653 questions1 posed by physicians at various clinical settings (14, 27-29) and found a total of 138 definition questions". We observed that the definitional questions in general fell into several categories including Disease or Syndrome, Drug, Anatomy and Physiology, and Diagnostic Guideline. In order to maximize the evaluation coverage, we attempted to select questions that cover most of the categories. After preliminary examination, we found that many questions did not yield answers from two or more systems to be evaluated. For example, the question "what is proshield?" did not yield a meaningful answer from three systems (MedQA, OneLook, and PubMed). The objective was to compare different systems, and unanswerable questions present a problem for the analyses because they render such comparisons impossible. On the other hand, if we screened the questions with the four systems, it may introduce bias and a selective exclusion process. We therefore employed an independent information retrieval

The question collection is freely accessible at http://clinques.nlm.nih.gov/ All 138 definitional questions are listed at http://www.dbmi.columbia.edu/~yuh9001/research/definitional_questions.htm.

331 system, BrainBoost3, which is a web-based question answering engine that accepts natural language queries. BrainBoost was presented with questions randomly selected from the categories of definitional questions, and the first twelve questions that returned an answer were included in the study. The task was performed by an unbiased assistant who was not privy to the reasons for doing the search. The 12 selected questions are shown in bold in http://www.dbmi.columbia.edu/~vuh9001/research/definitional questions.htm. • HVU< jPuhKkihi meLnoi

^edQ4 "lull a-sKni Hhnf is

Ask-

\L\ttfath(t\

tt • ' a m r t n J th*1 /ihiP card H* wiitijf f n ^ 1ul«,snt«JE<'f.(il*ln H ^ n a trv*il* r i ^ n a v f* Mil j v i * i v b * * 1 4 l ^ iiUiil U J !3 **•)

f*'1*

Mmimtn fi om MFTrt m F 9 i t» ffi'i~iraH't!--i!>-'ill/ ti *• "H t-f E ^" mm) h j n ti>, *-rn * ">r Vyp fi1{_* i> tp hw ^ 1 * i 3ni * rw ! / " • » f * l^-al zclEoUi-re^jii _t t'ww-Wjs e^tbLtu P ; t t u * 1 ' " 1 ^ 1 *^w•'•*'ot.t,«r •»«' if.»i —it iw^uirvJ n N tiuri s * vi*i's t- i-anvi. tiiV n h m < ; U b j h l i t i I I/II , , C J I j M ; vVt u r O»J-^ *• m "i <-« v ruUi rri/ i |.r 1 1 1 J u L -ui i IJI j , M t ^ t« Jjl j ^ . u i l » j ij fitw ii* ir t i i j i i I ,li A J t ^n t>u V U v .? ff t w l }«!,< U uliti i»rpri i 3e n>3 M rue »nmj-\Ma a Ssu y * in J<. tec that t>- w*"'I* r e Hit B t-j-irti p-it- i <J aiaHm jn.sM *rjwtajv* t a t L « ; - « !*"3H<*!•>, " m f n **IM i i u r Y t ' s M* ^ •)}>

T n T v r « ' ' l IK ft(*> t v t v ii t>( wi>hv» rbul tK'-p ii tf»*f HI iviiv p^irvj' S f ps *> ! iti Mi U iv*r •y

r/FIH T I f - W i - r n ' P iin,P j i - nth i ^ l ' f >' ill ir --tl

't

T*~ l r ""( W IT'-lot •"" TVr i ' i ) v *yjf« « <• f ii p m' i n Ats* r% Jiimiw

-jrfrr

;

k

•** dc

rarik

- '*-*! e f ^ h jlu | ^ j

-v^-it ^ n i i 4i

e n ii»3

t if

i fjrtrn Mn

s

(* U

Figure 2 MedQA & output of the question ' What is vestibulitis9 The output displays an online definition that comes from Dorland s Illustrated Medical Dictionaries a summary that incorporates definitional sentences that are extracted from different PubMed records, and other relevant sentences" that incorporate other relevant sentences. The parenthetical expression incorporates the last name of the first author and the year of the publication (e.g., (Sackett 2001)); the expression links to the PubMed records from which the preceding sentences are extracted.

3.2

Subjects and Procedure

Four physicians (three females and one male, ages 30's-50's) who were trainees at Department of Biomedical Informatics, Columbia University volunteered to participate in the study. All four physicians have experience using information systems. Each physician was presented with all of the 12 questions selected for inclusion. For each question, the subjects were asked to evaluate two systems in succession and the order of the two systems was counterbalanced. Each subject posed six questions to each of the four systems. The four subjects therefore posed a total of 96 questions (12 x 4 x 2). All evaluation studies were conducted in May, 2006.

http://www.brainboost.com/

332 After consenting to participate in the study, participants were given written instructions on how to perform the task. They were presented with each question on a cue card and asked to find the text that best answered the question. The order of questions to be presented was randomized. The card also indicated the two systems to be used and their sequence. Once the text was located, they were asked to copy and paste it into a Word document. They were free to continue to search and paste text into the document until they were satisfied that they found the best answer possible. There was a time limit of 5 minutes for each question/system event. We chose 5 minutes as a cutoff because a previous study found that internet users successfully found health information to answer questions in an average of 5 minutes (30). Participants were asked to think-aloud during the entire process. After completing each question evaluation comparing the two systems, they were asked to respond to the following two Likert questions: 1) rate the quality of answer and 2) the ease of use of the system. We employed a five point rating scale from the poorest (1) to the best (5). We applied Morae usability software system to record the screen activities and audio record a subject's comments for the entire session. Morae provides a video of all screen activity and logs a wide range of events and system interactions including mouse clicks, text entries, web-page changes, and windows dialogue events. It also provides the analyst with the capability to timestamp, code, and categorize a range of video events. 3.3

Analysis

On the basis of a cognitive task analysis (Kaufman et al, 2003; Elhadad, 2005), we identified goals and actions common to all systems. Table 1 shows a list of actions we defined. We also noted system responses (e.g., what was displayed after executing a search), analyzed comments thematically and measured the response times. The protocols were coded by both authors. The total coding time for four subjects was about 30 hours. Table 1: Actions used to answer questions. Enter Query: Entering a search term in the search text box provided by the system. Find Document: An action that involves >10s of time spent examining the retrieved list of documents (e.g., Web documents or PubMed abstracts). Query Modification: An action that involves modification of the existing query or user-interface (e.g., change from Google to Scholar.Google). Read Document: An action that involves a subject to spend >10s to read the selected document. Scroll-Down Document: Scroll down a document to search for the answer. Search Text Box: A subject applies the "Find" function to locate relevant text. Select Document: A subject selects and opens a document to examine whether the answer appears in the document Select Linkout: An action that involves selecting another link from the selected document. Select Text as Answer: A subject selects the text as the answer to a question.

4

Evaluation Results

In the following section, we present results of the cognitive evaluation. The first part of this section illustrates the processes of question-answering. We also show the coding process used to characterize participants' actions. The second part of this section focuses on a quantitative comparison of the four systems. We include both objective measures

333 such as actions and response latency, and subjective measures, namely, participants' ratings of the quality of answers as well as their ease of use. 4.1

Illustrations

The following two coding excerpts illustrate the process of question-answering on two pairs of systems, PubMed and MedQA, and OneLook and Google. The excerpts are representative of task performance. The subject was an experienced physician with a master in informatics and was well-versed in performing medical information seeking tasks. Excerpt 1—PubMed and MedOA The subject had completed five questions and was a little more than forty minutes into the session. The question in this excerpt was "What is vestibulitis?" The systems used to find the answer were PubMed and MedQA respectively. The entire segment lasted 6 minutes, of which 4:25 is used to search PubMed and 1:11 to search MedQA. 44:23 ACTION (Enter Query-PubMed): vestibulitis 44:34 SYSTEM RESPONSE: 251 MEDLINE records returned 44:51 (User) COMMENT: OK, I definitely got some answers that do not apply at all...I have no idea why the first set of returns are coming back with psychological problems, but maybe not true, as a physician just makes assumption of that ENT would be returned, but if I am gynecologist, that probably is what I am looking for. Vulvar vestibulitis, I have no idea what it is. I guess I will go find out because I do not know. 45:22 ACTION SELECT DOCUMENT 45:23 ACTION SELECT FULL-TEXT OUT-LINK 45:24 SYSTEM RESPONSE: Out-link failed 45:25 ACTION SELECT FULL-TEXT OUT-LINK 45:26 SYSTEM RESPONSE: Out-link failed 45:33 ACTION FIND DOCUMENT COMMENT: No... I can not find any definitions 46:11 ACTION (Query Modification, "vestibulitis") COMMENT: Try vestibulitis only 46:14 SYSTEM RESPONSE: 251 MEDLINE records returned 46:17 ACTION SELECT DOCUMENT COMMENT: Just try this one, surgical treatment of vulvar vestibulitis, this seems to be a good definition 46:29 ACTION SELECT FULL-TEXT 46:39 SYSTEM RESPONSE: Out-link failed 46:40 ACTION SELECT LINKOUT (of the full-text article) 46:41 SYSTEM RESPONSE: Out-link failed COMMENT: It does not seem to have any outlink, it is only the abstract. The abstract does not give any characteristics of what syndrome is. 47:10 ACTION SELECT TEXT AS ANSWER 47:49 ACTION FIND DOCUMENT 47:57 ACTION SELECT DOCUMENT 48:00 ACTION SELECT FULL-TEXT (PDF FILE) 48:02 ACTION READ DOCUMENT COMMENT: seems to get pain syndromes 48:48 ACTION SELECT TEXT AS ANSWER COMMENT: OK, I am going to leave PubMed 49:12 ACTION (ENTER QUERY-MEDQA) : What is vestibulitis? COMMENT: MedQA uses MEDLINE, probably will return the same information, hopefully, it will get other information as well. 49:52 SYSTEM RESPONSE: shown in Figure 2 COMMENT: OK, MedQA pulls back exact the same information, nothing else. 50:23 ACTION SELECT TEXT AS ANSWER

334 GENERAL COMMENT: I would say that PubMed again all the information was there but was not held in a useful fashion and I need to search all and I have to filter myself...and quality of answer was OK and ease of use is poor because I need to go through everything. MedQA quality of answer is excellent and ease of use is excellent, I do not need to do anything.

Excerpt 2—QneLook and Google The subject had completed nine questions and was a little more than one hour and half into the session. The current question answered was "What is gemfibrozil?" The systems used to find the answer are OneLook and Google, respectively. The entire segment was 5:08 minutes, of which 1:44 is used to search the OneLook system and 2:46 to search Google. 1:31:08 ACTION (ENTER QUERY-ONELOOK): gemfibrozil COMMENT: I know I am looking into medication, Gemfibrozil, I know that I have the advantage of what I am looking for. 31:32 SYSTEM RESPONSE: 4 matching dictionaries in General and 4 matching dictionaries in Medicine COMMENT: So I get of course a General definition and Medicine related match. I will go my favorite Wikipedia first 31:51 SYSTEM RESPONSE Web Page Changes--Wiki... COMMENT: it returns out-links... COMMENT: Unfortunately, the Wikipedia isn't so good because it gives me more or less an outline of a whole set of other links that I would have to go find in order to get specific information. I am going back from Wikipedia and go to Medical online dictionaries, I am going to try Online Medical Dictionary first. 32:20 SYSTEM RESPONSE Web Page Changes -Online Medical Dictionary COMMENT: I got absolutely useless information. I am going to Stedman's and Stedman's is not working, I found it out before. I go to Dorland's, Dorland's Medical Dictionary... 32:30 SYSTEM RESPONSE Web Page Changes - Dorland's Medical Dictionary 32:52: ACTION SELECT TEXT AS ANSWER COMMENT: I get gemfibrozil ... which is medication used to lower serum lipid level by decreasing triglyceride, it is just one line definition. I would say that it is probably acceptable, but if I have spent the time with the Wikipedia following the out-links, I probably would be able to find more information. 33:30 ACTION (ENTER QUERY-SCHOLAR.GOOGLE) : gemfibrozil COMMENT Now I am going to Scholar.Google 33:43 SYSTEM RESPONSE Web Page Changes - Google returned three article links 33:58 ACTION SELECT DOCUMENT (a full-text article) 34:10 ACTION READ DOCUMENT COMMENT: On my first look on the medication... 34:32 ACTION SELECT TEXT AS ANSWER COMMENT: I get quite a good description of the effects of new medication along with ... 34:40 ACTION PULLUP PDF FILE 34:48 ACTION SCROLL-DOWN DOCUMENT 34:55 ACTION SELECT TEXT AS ANSWER COMMENT: looks great.-.along with appropriate bibliography...With Google, with Google again, I got lucky, find an article very quickly, given me the best information about the medication. 35:50 ACTION (ENTER QUERY-GOOGLE) : gembibrozil COMMENT: let's see what happened if I go Google itself as appose to Google Scholar. 36:05 SYSTEM RESPONSE Web Page Changes - (Google returns 1,330,000 hits) COMMENT: I got Medicine.com dictionary 36:16 ACTION SELECT TEXT AS ANSWER

335 COMMENT: I got some very good information. ...which is more an overview, put gemfibrozil in the context with other medications for lowering serum lipid levels, so I would get a more understanding from this perspective and therefore Google general as oppose to Google.Scholar is actually a better choice as the Google search engine. GENERAL COMMENT: For this study, Onelook I would say, was able to give me the definition which was OK in terms of quality, ease of use was poor because either that a lot of out-links are not working, or that the out links link to useless information. Google in this instance the quality of answer is definitely good, excellent, and ease of use in this instance, again is excellent, right answer comes from the top.

The two excerpts show that the pattern of actions employed by participants reflects the nature of interactions supported by each system. For example, subjects would iteratively search PubMed until they found a satisfactory answer. As a consequence, they would examine multiple documents (necessitating find link and Linkout actions), only a few of which were relevant. The subjects typically searched for full-text articles as the Linkout actions. The iterative nature of the search was also evidenced by the number of actions pertaining to query modification, searching the text box and document selection. Table 2 lists a summary of the comments made by subjects throughout the evaluation. Our results show that Google received more favorable comments than complaints. Both MedQA and OneLook received some good comments and some complaints. PubMed was generally criticized and was not given any favorable comments. Table 2: A summary of comments of different systems (D for disadvantages and A for advantages) Google (D) retrieves back a lot of links (to the question "What is cubital tunnel syndrome?"). Most of links seem to relate to individual cases of the diseases, not necessarily definitions. Google (D): One needs to search and evaluate the definitions in Google. Google (A) retrieves both patient (Google) and physician-centric (scholar. Google) information. Google (A): Scholar.Google is much faster because it is the second link, while in PubMed the evaluator has to search through a lot of other articles. MedQA (D) needs to type in 'What is' versus a direct query. MedQA (D) takes a considerable longer time to respond than other systems. MedQA (A) returns all the context otherwise the evaluator has to search manually. It is only one step and gets exactly needed. MedQA (A) gives answer (to die question "What is Popper?") that Onelook did not, which is that the drug is injectable, which is important to know for a physician. Onelook (D) pulls all links. It lets the user to guess which link contains a comprehensive answer. Sometimes, the links are broken. It is a matter of luck to get to the right links. Onelook (D) answer quality is poor. It has a terrible user-interface. It shows two ugly photos. Onelook (A) definition has more content than PubMed. PubMed (D) is not a good resource for definitions. PubMed (D) is not useful. It takes forever to find information.

4.2

Quantitative Evaluation

The results show that the subjects did not find answers to a single question in Google ("Dawn's phenomenon"), 3 questions in Onelook ("epididymis appendix", "heel pain syndrome", and "Ottawa knee rules"), 3 questions in MedQA ("epididymis appendix", "Ottawa knee rules," and "paregoric"); and 2 questions in PubMed ("epididymis appendix" and "paregoric"). Both MedQA and Onelook acknowledged "no results found" and returned no answers if such an event occurs, while both PubMed and Google

336 returned a list of documents even if a subject could not identify the definitions from the documents within the 5 minutes of time limit. We observed that none of the subjects used Google:Definition as the service to identify definitions; instead, they applied the query terms in either Google or Scholar.Google. We also observed that subjects gave the poorest score (i.e., 1) for quality of answer when both MedQA or OneLook returned no answers, and a better score (i.e., 2-3) when a search engine (e.g., Google or PubMed) returns a list of documents, even if the subject could not find any answers from the documents within 5 minutes of time limit. Subjects commented that even documents that do not contain answers frequently provided some knowledge about the answers. For example, subjects found "popper" is a drug although there were no details of definitions found. On the contrary, the subjects typically gave a good score for ease of use when MedQA and OneLook returned no answers. Table 3 presents descriptive statistics of the subjective and objective measures. In general, Google was the preferred system as reflected both in the quality of the answer and ease of use ratings. MedQA achieved the second highest ratings in both measures. OneLook received the lowest ratings for quality of answer and PubMed was rated the worst in terms of ease of use. If we excluded the poor scores when MedQA did not return any answer, the quality of answer for MedQA went up to 4.5. Table 3: Average score and (standard deviation) of quality of answer second) and action taken. Google MedQA Time Spent 69.6 (6.9) 59.1 (57.7) Number of Actions 4.4 (3.0) 2.1 (2.0) 2.92 (0.24) 4.90 (0.15) Quality of Answer 4.75(0.29) 4.0 (0.24) Ease of Use

and ease of use and average time spent (in Onelook 83.1(63.6) 6.5 (7.7) 2.77 (0.08) 3.9 (0.32)

PubMed 182.2(85.8) 10.3 (5.7) 2.92 (0.88) 2.36(0.88)

While the processing time to obtain an answer was almost instantaneous for Google, Onelook, and PubMed, the average time spent for MedQA to obtain an answer to the 10 answerable questions was 15.8±7.1 seconds. MedQA was nevertheless the fastest system on average for a subject to obtain the definition. For measuring the average time spent, we excluded the cases in which MedQA and Onelook returned no answer. The subjects, on average, spent more time searching PubMed than any of the other systems. In fact, the average PubMed search required more than three times the amount of time required to search MedQA. This is at least partly due to the complexity of the interaction. This is borne out by the fact that participants needed more than 10 actions in using PubMed to answer the question, whereas they only required 2 actions on average when they used MedQA. PubMed provides a range of affordances (e.g., limits, MeSH) that supports iterative searching. Although this is a powerful tool, it also increases complexity of the task and user cognitive load. MedQA offers the simplest mode of interaction because it eliminates several of the steps (e.g., upload documents, search text and selectively access relevant information in document) involved in searching for information. The results of the commercial search engines, Google and Onelook, fell in

337 between MedQA and PubMed. However, as evidenced by the high standard deviations, there was significant variability between questions. 5

Discussion

The evaluation results show that Google was the best system for quality of answer (4.90) and ease of use (4.75). Recall the highest score for both criteria was 5. The results indicate that the Internet resources incorporate reliable medical definitions, and Google allows subjects to readily access those reliable definitions. This is in contrast to numerous other studies that found Internet information to be of poor quality in the medical context (1-10). However, there are significant differences between our study and the others. First, our study evaluated a more general type of question; namely, definitional questions, while the other works examined more specific medical questions (e.g., "What further testing should be ordered for an asymptomatic Thyroid Nodule solitary thyroid nodule, with normal TFTs?" in (23)). Secondly, physicians would evaluate Google high if they found answers from some sites even if other sites did not provide answers to the questions. In other studies, precision (i.e., the number of hits that provide answers divided by the total retrieved top N hits) plays an important role for measuring the quality. For example, one study (31) concluded that Google hits were of a poor quality because only one link out of five contained relevant information. Lastly, in our study, the quality of answer was estimated by aggregating information from multiple Web pages. Other studies evaluated the quality of each Web page to answer a specific question; such, evaluation will certainly lead to a much poorer rating of the Internet because one evaluation study (32) concluded that information were typically scattered across multiple sites: most of the Web pages incorporate information either in depth or in breath, and few Web sites combine both depth and breath. Our results show that OneLook came in the 3rd in most of the evaluation criteria. We observed that the evaluators frequently expressed frustrations of failed out-links and nonspecific, general definitions that are of little value to physicians. We show that PubMed performed worst in almost all criteria. Unlike Google which assigns weights to the returned documents, PubMed returns a list of documents in a chronological order in which the most recent publications appear first. The most relevant documents in PubMed may never appear at the top; and therefore it usually takes a user significant time to identify answers. Previous research showed that it took an average of more than 30 minutes for a healthcare provider to search for answer from PubMed, which meant that "information seeking was practical only 'after hours' and not in the clinical setting" (22). Finally, we found that MedQA in general outperformed all search engines except for Google. In addition, MedQA out-performed Google in time spent and number of actions, two important efficiency criteria for obtaining an answer. Although it took less than a second for Google to retrieve a list of relevant documents based on a query keyword and it took an average of 16 seconds for MedQA to generate a summary, the average time spent for a subject to identify a definition was 59.1+57.7 seconds for MedQA, which was faster than 69.6±6.9 seconds for Google. This is due to the fact that information is scattered across the web (32). A subject typically needs to visit multiple web pages for an

338 answer. One can never be certain when a link will lead to useful information. This is a relative disadvantage for Google as compared to MedQA. 6

Conclusion

We evaluated four search engines; namely, Google, MedQA, OneLook, and PubMed, for their quality and efficiency to answer definitional questions posed by physicians. We found that Google was in general the preferred system and PubMed performed poorly. We also found that MedQA was the best in terms of time spent and number of actions needed to answer a question. It would be ideal if a powerful search engine such as Google could be integrated with an advanced question-answering system to yield timely and precise response to address users' specific information needs. Although we are encouraged by the findings, this research is best viewed as formative. The conclusions are limited by a number of factors. These include the fact that only four physicians participated in the evaluation. Future research would need to include a larger and more diverse sample of clinicians with different levels of domain expertise and degrees of familiarity with information retrieval systems. In this study, we introduced a novel cognitive method for the in-depth study of the question answering process. The method would have to be validated in different contexts. Finally, the scope of the system (answering definitional questions) is rather narrow at this point and we would want to conduct similar comparisons with different questions types. In general, the results of this work suggest that MedQA presents a promising approach for clinical information access. Acknowledgement: We thank three anonymous reviewers for valuable comments. References l.Purcell G. The quality of health information on the internet. BMJ. 2002;324(7337):557-8. 2.Jadad AR, Gagliardi A. Rating health information on the Internet: navigating to knowledge or to Babel? JAMA. 1998 Feb 25;279(8):611-4. 3.Silberg WM, Lundberg GD, Musacchio RA. Assessing, controlling, and assuring the quality of medical information on the Internet: Caveant lector et viewor-Let the reader and viewer beware. JAMA. 1997 Apr 16;277(15):1244-5. 4.Glennie E, Kirby A. The career of radiography: information on the web. Journal of Diagnostic Radiography and Imaging. 2006;6:25-33. 5.Childs S. Judging the quality of internet-based health information. Performance Measurement and Metrics. 2005;6(2):80-96. 6.Griffiths K, Christensen H. Quality of web based information on treatment of depression: Cross sectional survey. BMJ. 2000;321:1511-15. 7.Cline RJ, Haynes KM. Consumer health information seeking on the Internet: the state of the art. Health Educ Res. 2001 Dec; 16(6):671-92. 8.Benigeri M, Pluye P. Shortcomings of health information on the Internet. Health Promot Int. 2003Dec;18(4):381-6. 9.Wyatt J. Commentary: measuring quality and impact of the WWW. BMJ. 1997;314:1879. lO.McClung HJ, Murray RD, Heitlinger LA. The Internet as a source for current patient information. Pediatrics. 1998 Jun;101(6):E2.

339 1 l.Sacchetti P, Zvara P, Plante MK. The Internet and patient education-resources and their reliability: focus on a select urologic topic. Urology. 1999 Jun;53(6): 1117-20. 12.Gemmell J, Bell G, Lueder R, Drucker S, Wong C. MyLifeBits: fulfilling the Memex vision. Proceedings of the 10th ACM international conference on Multimedia; France, pp. 235-8. 13.Podichetty V, Booher J, Whitfield M, Biscup R. Assessment of internet use and effects among healthcare professionals: a cross sectional survey. Postgrad Med J. 2006;82:274-9. 14.Ely JW, Osheroff JA, Ebell MH, Bergus GR, Levy BT, Chambliss ML, et al. Analysis of questions asked by family doctors regarding patient care. BMJ. 1999 Aug 7;319(7206):358-61. 15.Pandolfini C, Bonati M. Follow up of quality of public oriented health information on the world wide web: systematic re-evaluation. BMJ. 2002 Mar 9;324(7337):582-3. 16.Sandvik H. Health information and interaction on the internet: a survey of female urinary incontinence. BMJ. 1999 Jul 3;319(7201):29-32. 17.Alper B, Stevermer J, White D, Ewigman B. Answering family physicians' clinical questions using electronic medical databases. J Fam Pract 2001 ;50(11):960-5. 18Jacquemart P, Zweigenbaum P. Towards a medical question-answering system: a feasibility study. Stud Health Technol Inform. 2003;95:463-8. 19.Takeshita H, Davis D, Straus S. Clinical evidence at the point of care in acute medicine: a handheld usability case study. Proceedings of the human factors and ergonomics society 46th annual meeting; 2002. p. 1409-13. 20.Bhavnani S, Bichakjian C, Johnson T, Little R, Peck F, Schwartz J, et al. Strategy Hubs: Domain Portals to help Find Comprehensive Information. JASIST. 2006;57(l):4-24. 21.Lee M, Cimino J, Zhu H, Sable C, Shanker V, Ely J, et al. Beyond information retrieval Medical question answering. AMIA. Washington DC, USA; 2006. 22.Hersh WR, Crabtree MK, Hickam DH, Sacherek L, Friedman CP, Tidmarsh P, et al. Factors associated with success in searching MEDLINE and applying evidence to answer clinical questions. J Am Med Inform Assoc. 2002. 9(3):283-93. 23.Berkowitz L. Review and Evaluation of Internet-based Clinical Reference Tools for Physicians: UpToDate; 2002. 24. Yu H, Sable C. Being Erlang Shen: Identifying answerable questions. Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence on Knowledge and Reasoning for Answering Questions 2005. 25.Yu H, Sable C, Zhu H. Classifying Medical Questions based on an Evidence Taxonomy. . Proceedings of the AAAI 2005 Workshop on Qusetion Answering in Restricted Domains; 2005. 26.Voorhees E, Tice D. The TREC-8 question answering track evaluation. TREC; 2000. 27.Ely JW, Osheroff JA, Ferguson KJ, Chambliss ML, Vinson DC, Moore JL. Lifelong selfdirected learning using a computer database of clinical questions. J Fam Pract. 1997 Nov;45(5):382-8. 28.Ely JW, Osheroff JA, Chambliss ML, Ebell MH, Rosenbaum ME. Answering physicians' clinical questions: obstacles and potential solutions. J Am Med Inform Assoc. 2005 MarApr;12(2):217-24. 29,D'Alessandro DM, Kreiter CD, Peterson MW. An evaluation of information-seeking behaviors of general pediatricians. Pediatrics. 2004 Jan; 113(1 Pt l):64-9. 30.Eysenbach G, Kohler C. How do consumers search for and appraise health information on the world wide web? Qualitative study using focus groups, usability tests, and in-depth interviews. BMJ. 2002 Mar 9;324(7337):573-7. 31.Berland GK, Elliott MN, Morales LS, Algazy JI, Kravitz RL, Broder MS, et al. Health information on the Internet: accessibility, quality, and readability in English and Spanish. JAMA. 2001 May23-30;285(20):2612-21. 32.Bhavnani S. Why is it difficult to find comprehensive information? Implications of information scatter for search and design: Research Articles. Journal of the American Society for Information Science and Technology. 2005;56(9):989-1003.

BIODIVERSITY INFORMATICS: MANAGING KNOWLEDGE BEYOND HUMANS AND MODEL ORGANISMS INDRA NEIL SARKAR1' Marine Biological Laboratory, 7 MBL Street Woods Hole, MA 02543, USA E-mail: [email protected]

In the biomedical domain, researchers strive to organize and describe organisms within the context of health and epidemiology. In the biodiversity domain, researchers seek to understand how organisms relate to one another, either historically through evolution or spatially and geographically in the environment. Currently, there is limited cross-communication between these domains. As a result, valuable knowledge that could inform studies in either domain often goes unnoticed. Biodiversity knowledge has long been a valuable source for many biomedical advances [1]. Before the creation of synthetic compounds, medicinal compounds originated from solely natural plant and animal extracts. Although there are upward estimates of between 10-100 million organisms on Earth [2], biomedical research primarily focuses on only a fraction of these as "model" organisms [3]. Furthermore, much knowledge may be lost from the biomedical community because many organisms are only described within biodiversity resources. These studies can form the basis for research on evolution, speciation, and distribution, and also provide an important baseline for studies of not only conservation but also the study of emerging diseases. The integration of biodiversity knowledge from museum collections, for example, has provided significant insights into the etiology and distribution of diseases such as hantavirus [4]. Understanding the etiology of diseases and their host epidemiology may also further the development of vaccinations and treatments that can help prevent epidemics or pandemics, such as the looming threat of the avian flu [5]. For emergent diseases (e.g., malaria), biomedical researchers traditionally focus on a limited number of species (e.g., four species of Plasmodium [6]). This represents very little in terms of the phylogenetic diversity of diseases that are known to infect numerous other organisms (e.g., malaria affects birds, lizards, and other mammals [7-12]). The incorporation of biodiversity knowledge in the context of biomedical advances may lead to breakthrough therapies for many of the diseases that still plague human society [13, 14]. The genomic revolution has resulted in a deluge of sequence data and derivative knowledge (e.g., protein structure prediction and gene expression experiments), which are the predominant data types in biomedical research. INS is funded in part by NSF-IIS-0241229, NSF-BDI-0421604, and the D.A.B. Lindberg Research Fellowship from the Medical Library Association. 340

341 Three of the papers in this session examine how these data can be integrated, annotated, and interpreted in light of greater taxonomic sampling. First, Cadag et al. propose a semi-automated framework that uses a federated approach to incorporate relevant knowledge from heterogeneous resources to assist with gene annotation. Next, Ng et al. demonstrate a prototype application to suggest annotations for genes that are involved with biological pathways across a range of organisms. Finally, Hampikian and Andersen examine the existence and utility of genes sequence regions that are not present in organisms across the tree of life. The final two papers in this session consider how sequence and sequencederived information can be complemented with biodiversity data (e.g., morphological, ecological, and temporal data). First, Maglia et al. propose a framework to incorporate existing ontologies and their structures towards the development of an ontology for Amphibian morphology. Sautter et al. then describe a system to semi-automatically organize knowledge that is embedded in literature resources. There has been considerable discussion in both the scientific [15-18] and popular media [5, 19] with regards to the need for methods and tools to organize and integrate biodiversity and biomedical knowledge within the context of legacy, existing, and newly generated data. Significant infrastructural and methodological advancements are needed to incorporate knowledge from both biomedical and biodiversity resources. To this end, it is hoped that the papers that follow will spark synergistic activities that benefit both biomedical and biodiversity communities. References 1. W. E. Muller, R. Batel, H. C. Schroder, and I. M. Muller, "Traditional and Modern Biomedical Prospecting: Part I-the History: Sustainable Exploitation of Biodiversity (Sponges and Invertebrates) in the Adriatic Sea in Rovinj (Croatia)," Evid Based Complement Alternat Med, vol. 1, pp. 7182, 2004. 2. E. O. Wilson, "The encyclopedia of life," Trends in Ecology and Evolution, vol. 18, pp. 77-80, 2003. 3. B. L. Umminger, "Unconventional organisms as models in biological research," J Exp Zool Suppl, vol. 4, pp. 2-5, 1990. 4. T. L. Yates, J. N. Mills, C. A. Parmenter, T. G. Ksiazek, R. R. Parmenter, J. R. Vande-Castle, C. H. Calisher, S. T. Nichol, K. D. Abbott, J. C. Young, M. L. Morrison, B. J. Beaty, J. L. Dunnum, R. J. Baker, J. Salazar-Bravo, and C. J. Peters, "The Ecology and Evolutionary History of an Emergent Disease: Hantavirus Pulmonary Syndrome," BioScience, vol. 52, pp. 989-998, 2002.

342

5. D. G. McNeil, "Hitting the Flu at Its Source, Before It Hits Us," in New York Times. New York, 2005. 6. M. T. Makler, C. J. Palmer, and A. L. Ager, "A review of practical techniques for the diagnosis of malaria," Ann Trop Med Parasitol, vol. 92, pp. 419-33, 1998. 7. C. T. Atkinson, K. L. Woods, R. J. Dusek, L. S. Sileo, and W. M. Iko, "Wildlife disease and conservation in Hawaii: pathogenicity of avian malaria (Plasmodium relictum) in experimentally infected iiwi (Vestiaria coccinea)," Parasitology, vol. I l l Suppl, pp. S59-69, 1995. 8. P. C. Garnham, "Recent research on malaria in mammals excluding man," Adv Parasitol, vol. 11, pp. 603-30, 1973. 9. B. Mons and R. E. Sinden, "Laboratory models for research in vivo and in vitro on malaria parasites of mammals: Current status," Parasitol Today, vol. 6, pp. 3-7, 1990. 10. R. E. Ricklefs and S. M. Fallon, "Diversification and host switching in avian malaria parasites," Proc Biol Sci, vol. 269, pp. 885-92, 2002. 11. J.J. Schall, "Virulence of lizard malaria: the evolutionary ecology of an ancient parasite-host association," Parasitology, vol. 100 Suppl, pp. S3552, 1990. 12. J. J. Schall, "Lizards infected with malaria: physiological and behavioral consequences," Science, vol. 217, pp. 1057-9, 1982. 13. L. A. Basso, L. H. da Silva, A. G. Fett-Neto, W. F. de Azevedo, Jr., S. Moreira Ide, M. S. Palma, J. B. Calixto, S. Astolfi Filho, R. R. dos Santos, M. B. Soares, and D. S. Santos, "The use of biodiversity as source of new chemical entities against defined molecular targets for treatment of malaria, tuberculosis, and T-cell mediated diseases—a review," Mem Inst Oswaldo Cruz, vol. 100, pp. 475-506, 2005. 14. F. Pelaez, "The historical delivery of antibiotics from microbial natural products-Can history repeat?," Biochem Pharmacol, 2005. 15. D. Agosti, "Biodiversity data are out of local taxonomists' reach," Nature, vol. 439, pp. 392, 2006. 16. J. Soberon and A. T. Peterson, "Biodiversity informatics: managing and applying primary biodiversity data," Philos Trans R Soc Lond B Biol Sci, vol. 359, pp. 689-98, 2004. 17. S. Blackmore, "Environment. Biodiversity update—progress in taxonomy," Science, vol. 298, pp. 365, 2002. 18. F. A. Bisby, "The quiet revolution: biodiversity informatics and the internet," Science, vol. 289, pp. 2309-12, 2000. 19. "Today we naming of parts: a global registry of animal species could shake up taxonomy," in Economist, 2006.

BIOMEDIATOR DATA INTEGRATION AND INFERENCE FOR FUNCTIONAL ANNOTATION OF ANONYMOUS SEQUENCES EITHON CADAG t 1 5 , BRENT LOUIE 1 , PETER J. MYLER 1 ' 25 , PETER TARCZY-HORNOCH 13 ' 4 Depls. of Medical Education and Biomedical Informatics', Pathobiology2, Pediatrics , and Computer Science and Engineering'1, University of Washington, Seattle, WA USA Seattle Biomedical Research Institute3, Seattle, WA USA Scientists working on genomics projects are often faced with the difficult task of sifting through large amounts of biological information dispersed across various online data sources that are relevant to their area or organism of research. Gene annotation, the process of identifying the functional role of a possible gene, in particular has become increasingly more time-consuming and laborious to conduct as more genomes are sequenced and the number of candidate genes continues to increase at near-exponential pace; genes are left un-annotated, or worse, incorrectly annotated. Many groups have attempted to address the annotation backlog through automated annotation systems that are geared toward specific organisms, and which may thus not possess the necessary flexibility and scalability to annotate other genomes. In this paper, we present a method and framework which attempts to address problems inherent in manual and automatic annotation by coupling a data integration system, BioMediator, to an inference engine with the aim of elucidating functional annotations. The framework and heuristics developed are not specific to any particular genome. We validated the method with a set of randomly-selected annotated sequences from a variety of organisms. Preliminary results show that the hybrid data integration and inference approach generates functional annotations that are as good as or better than "gold standard" annotations -80% of the time.

1.

Introduction

The increasing rate of genomic discovery has left biologists with an overwhelming amount of new and tentatively novel genes to examine. One of the first steps in scrutinizing a new genome is to annotate its genes with biochemical characteristics, cellular localization, and other functional properties to quickly identify targets of interest for further study. The re-visitation of "hypothetical" proteins using multiple updated molecular databases can reveal valuable biological information as well. It is estimated that between 25-66% of genes, depending on the organism, are annotated as "hypothetical" [1]. Annotation, however, is often a slow and laborious process, and the complete annotation of even a modestly-sized genome can take a small team of skilled annotators years to finish. Even with a large group of scientists the task remains non-trivial; collaborating scientists working on Drosophilia Corresponding author ([email protected]) 343

344

melonogaster organized a two week "jamboree" to accomplish functional annotation [2]. Coupled with the necessity to maintain currency as sequence information is revised and molecular reference databases are updated, annotation becomes a Sisyphean effort. Much of the challenge involved in annotating genes stem from scientists needing to consult various molecular databases to ensure complete and thorough annotations. Online data sources such as those furnished by the National Center for Biotechnology Information (NCBI)a, the Wellcome Trust Sanger Instituteb and many more made freely available by other researchers have become invaluable in helping annotators assign genes putative functions based on computational results. The nature of how biologic information is stored, i.e. in separate, heterogeneous data sources, dictates that data integration is the first step in gene annotation [3]. Information regarding functional properties of genes is fragmented in various online databases which were developed independently and do not inherently interoperate. To annotate genes biologists must manually query many individual data sources. Considerable research has been done investigating automated methods of annotation, which in addition to alleviating manual efforts have the capability of querying and analyzing a far larger volume of information. While many of the automated annotation systems created thus far are very effective and successful at generating annotations, most are meant as one-off solutions to specific organisms or set of organisms, or utilize only a select number of databases and analyses on which the annotation process is tailored; data integration is frequently ad hoc. As the number of molecular databases increases, scalable automated annotation systems will be more necessary. In this paper we present and evaluate a hybrid approach that addresses both the data integration and analytical needs of gene annotation. Recognizing that an effective annotation system must first be an effective data integration system and that biological expertise is indispensable in developing accurate annotations, we incorporated a robust inference engine on top of an already-existing data integration platform, BioMediatorc. We identified several promising online biologic databases based on the processes used for model and non-model genome annotation projects and formulated a set of pilot heuristics for the inference engine which would reason over database query results and draw conclusions toward the annotations for submitted sequences.

a b c

http://www.ncbi.nlm.nih.gov http://www.sanger.ac.uk http://www.biomediator.org

345 To evaluate our methodology, 116 annotated genes were selected randomly from GenBank [4] as a sample set. These genes were re-annotated using our BioMediator-based approach, and our computational annotations were compared to the actual annotations as listed in GenBank. Relying on manual inspection to resolve ambiguity, we found that our automated method yielded functional annotations as good as or better than the listed annotation for 78% of the sample. 2.

Related Work

Automated gene annotation is a well-studied subfield of bioinformatics, and many projects have arisen out of the need for expedient gene annotation. Most automated annotation systems rely on a pipeline-based approach [5-7], whereby data is transformed or analyzed step-wise to reach a predicted function. Often the data sources used for the pipeline are replications of publicly available online databases and housed in local data warehouses. Kasukawa et al., for instance, relied on a custom annotation pipeline with a well-defined control structure to generate first-pass annotations for the mouse genome, and provided an interface for human curation and modification of automated annotations [5]; Potter et al., included a protein annotation pipeline in the ENSEMBL analysis pipeline [7] which assigns InterPro [8] domains to putative proteins after the gene identification stage, derived from species-specific curated data. Marrying inference to gene annotation systems has also received research attention. Similar to MAGPIE [9], which uses PROLOG to reason over analytical results, FIGENIX uses Java-based JLog to enact intelligent reasoning over specific portions of its annotation pipelines [6]. Like most other automated annotation systems, FIGENIX uses a data warehouse approach to storing information. In contrast to the automated annotation methods already mentioned, our approach uses a federated database system. The system does not store information locally; rather, queries are sent to the sources, normalized, cleansed and then analyzed in real-time, providing a small client-side footprint. This has the advantage of always providing up-to-date data, a limitation of the aforementioned warehousing approaches [10]. Additionally, data integration is accomplished with the use of a mediated schema, which provides the necessary semantic linkage between data sources as well as a common ontology for the development of general heuristics that are not specific to any single data source or genome. Because of the multi-tiered architecture used for the data integration process, new data sources can be readily added and incorporated into the mediated schema with minimal overhead cost and without a large increase in system complexity such a change might provoke in a pipeline-based system.

346

And, unlike other systems that rely on inference in annotation, the reasoning system is not restricted by an algorithmic pipeline, and is free to enact rules at arbitrary points in the data gathering process. 3.

Methods

3.1. Identifying Annotation Sources and Heuristics To test our combined method of data integration and inference, it was first necessary to select a set of data sources as well as initial logical inference rules to reason over returned information. We created a list of online databases and other resources for use in the process of functionally annotating genomes derived from methods used for the annotation of a set of organisms at the Seattle Biomedical Research Institute (SBRI). Scientists from SBRI participate in an international effort to sequence and annotate the genomes of three disease-causing parasites, Leishmania major, Trypanosoma brucei and Trypanosoma cruzi [11]. While the three genomes share considerable sequence similarity, most of the genes have little homology with genes in other species; approximately 66% of their genomes are annotated as "hypothetical". Additionally, we attempted to emulate the annotation of Haemophilus influenzae, the first non-viral genome to be completely sequenced. As such, H. influenzae is a far more studied genome than any of the Trypanosomatids. Our experience with these genomes provided the data sources and annotation processes on which we based our system. Understanding annotation processes for a set of non-model genomes and a model genome gave us interesting results. Many of the data sources relied on by scientists for annotating the Trypanosomatids were based on computational analyses, and with the aid of Perl scripts, submission to multiple analytical services was done in parallel. Parsing through and drawing knowledge from the information, however, was a manual endeavor. Annotators for H. influenzae, while also employing some computational services, primarily NCBI's BLAST [12] and domain searches, relied more heavily on literature searches and some species-specific databases. From the sources used by scientists for the aforementioned genomes, a subset was selected to act as the data sources for the evaluation of our automated annotation system: the NCBI BLAST database [12], the NCBI Conserved Domain Database (CDD) [13], Wellcome Trust Sanger's Pfam database [14], PROSITE database [15], Fred Hutchinson Cancer Research Center's BLOCKS database [16] and the ProDom database [17]. Information on how to apply expert knowledge on returned data was also elicited from scientists, and provided the basis for initial logical inference rules. For example, heuristics provided by one scientist working on the

347

Trypanosomatid genomes noted that in examining BLAST scores, it was not necessarily preferable to use the top-scoring results because best BLAST hits are not always the closest relation to the sequence in question [18]. 3.2. Data Integration for Annotation with BioMediator The BioMediator data integration system is the querying, retrieval and normalization platform for our automated annotation method. Developed at the University of Washington, BioMediator is a general-purpose biologic data integration system whose adaptability to various biomedical domains has been demonstrated in the past by providing a data integration platform for linking expression array data with analytics software and uniting disparate neuroscience databases to identify locations in the cortex related to language processing [1921]. A federated data warehouse that queries sources in real-time, BioMediator relies on a multi-tiered architecture whose core is a mediated schema that translates data from heterogeneous data sources into entity instances from the schema, thus collecting all query results under a single semantic framework (see Figure 1).

Figure 1. Diagram of BioMediator's architecture; data comes from sources (far right, F) via wrappers (E), which serialize data to schema-mapped XML (A) via the metawrapper (C,D) layer and sent to the BioMediator query processor (B) and interface (G). Original image adapted from .

We manually created a mediated schema for generalized, non-genomespecific functional annotation using the Proteged ontology editor, wrappers to serialize data from the sources and source-to-schema mappings. During the evaluation of our annotation system, the schema contained 57 entities to represent data across genomic databases {e.g. 'Protein', 'ProteinDatabaseHit') and 55 binary relationships between those entities (such as 'ProteinHasProteinDatabaseHit' to describe a protein homology relationship).

d

http://protege.stanford.edu

348 3.3. Heuristics for Anonymous Sequence Annotation Utilizing BioMediator's plug-in architecture, we added the Java Expert System Shell (Jess) rule engine [22] to BioMediator, giving us the capability to formulate flexible sets of rules against mediated result sets. Unlike other previous annotation systems that employ rule engines to manage pipelines or make decisions based on analyses, our approach to integrating Jess into BioMediator does not compartmentalize the scope of the rule engine by limiting when or where the rules may fire; the Jess component is free to enact rules over any data as it enters the system piecemeal, after all data is loaded in aggregate or any combination thereof and treats all received data as part of the working memory. As a result of our approach, rules are applied in a consistent fashion for all annotations. For our evaluation, three classes of rules were created to emulate as best as possible some of the annotation processes used by the genome annotators at SBRI. A total of 16 rules6 were developed for the pilot evaluation of our system (see Figure 2 for rule example). (defrule evalue-threshold-homologs (threshold (type evalue) (max ?M) (db ?D)) ?F <- (homolog-pattern (evalue ?X&:(>= ?X ?M)) (db ?B&:(eq ?B ?D)) (property ?P)) => (delete-reason ?P ?F "High expect value")) Figure 2. Example rule that prunes homologs from the result set that do not pass a threshold. Specific thresholds for individual databases may be optionally set, and the final line above saves the reason for the removal of the record.

3.3.1 Filtering Rules Filtering rules are heuristics that were limited to strictly ruling out possible annotations or other relevant data from further use by the inference engine. Rules that examined quantitative values for a minimum threshold, for instance, fell under this classification. Also, based on techniques utilized by the FANTOM2 annotation pipeline [5], a filtering rule for the perceived quality of information was created. 12 regular expressions whose patterns indicate a possibly uninformative annotation were used. For example, homologous proteins that contained "unnamed" in their functional annotation were removed from further consideration. Data classified as removed does not leave the working memory; rather, they are restructured so that the reason for their removal is noted, and can be retrieved again if need be. e

Supplementary material on rules at: http.V/www.biomediator.org/publications

349

3.3.2 Evidence-Building Rules The second class of rules uses information returned to increment evidence levels of tentative annotations. Homologous proteins enter the system as working memory with a low evidence level. As evidence is found to support that protein annotation (e.g. corroborating domains or large number of similar protein annotations returned), its evidence level is increased. This rule is analogous to the confidence classification system used by scientists annotating the H. influenzae genome at SBRI, with an ordinal scale to represent the level of evidence. Domains that recur in working memory multiple times, for example, may have their evidence level increased, as their likelihood to be associated with a target sequence is improved; likewise, functional annotations that are correlated with domain support will also reflect an increase in evidence. Because our initial annotation system does not yet make use of formal biomedical vocabularies, such as the Gene Ontology [23] (GO), and there is no universally-accepted nomenclature in practice for all genomic databases, we establish correlations between the text of functional annotations provided by our data sources using a modified edit distance algorithm. Consider two strings, k and / with lengths m and n respectively; a matrix G of (m + 1) x (n + 1) is created, where row 0 is initialized to 0...n, and column 0 initialized to 0...m. The remaining positions in the matrix, G(ij) are computed byf: min(G(i - l,j) + l,G(i,j - 1) + l,G(i - l,j - 1) + c) _ f 0, char(k,i) = char(lj) I 1, otherwise V

'

(Eq. 1)

The value given by G(m,ri) I q is the phrase similarity measure we use between k and /, where q is the length of the longer of the two strings. In our annotation system, various evidence-building rules invoke this string-comparing algorithm, such as when protein homologies share similarly-phrased annotations. 3.3.3 Annotation Selection Rules The third classification in our initial rule-base is those that select likely functional annotations from the working memory, based on evidence levels. All possible annotations are stratified by their level of evidence; related annotations are percolated to the top of the list if they appear repeatedly, and the highestlevel annotation with the greatest amount of evidence is provided as the automated functional annotation, though the remaining possible annotations are

Where char(k,i) represents the (th character in the string k

350

available for viewing as well. If no annotation is available at any evidence level, the default "hypothetical" result is presented. 3.4. Evaluation To evaluate the efficacy of our BioMediator-based automated annotation system, we randomly selected 116 genes from a local copy of the GenBank database from April 2006 [4]. The GenBank annotations for the 116 genes served as our "gold standard". We parsed out species names so that results from the source organism could be excluded from query returns; protein sequences from 58 bacteria, 31 eukaryotes, three viruses and one archaea were represented. Once the genes were annotated by our system, the automated annotation and actual annotation were compared and individual automated annotation results scored as incorrect, correct but inferior to actual, same as actual or superior to actual. This quaternary scoring rubric was adapted to adjust for the known danger of outdated or incorrect GenBank annotations [24]. We used two measures in scoring, specificity and utility. Specificity is in reference to the level of granularity and precision provided in the annotation, e.g. "peptidase" would be a less specific annotation in comparison to "lysosomal cysteine-type endopeptidase", provided both are correct. Utility was used as a measure to compare how informative annotations are based on the textual content. An annotation that is based on a GO term, for example, would be considered more informative than one that uses idiosyncratic nomenclature. In cases where the automated annotation did not match the actual annotation, we used manual annotation methods and referred our findings to a domain expert for final scoring. 4.

Results of Automated Annotation Using Inference

Our evaluation showed the automated annotations had specificity at the same level, or better, than the GenBank annotations 78% of the time. Additionally, the automated annotation was equal to or more informative than the GenBank annotation in 85% of the sample genes. As putative genes from non-model organisms are generally less likely to register sequence similarity hits in databases versus well-studied model organisms, we also compared the systems performance along a model- and non-model organism stratification as determined by the NCBI Model Organisms Guide [25] (see Table 1). Of the 116 automated annotations generated, seven were deemed to be incorrect when compared to the GenBank annotations. Upon manual inspection, reasons for the system assigning incorrect annotations were attributable to either a) the genes having short sequences, and were subsequently expunged by expect-value rules,

351 or b) pertinent information returned originated from the organism which the sequence was taken and were thus pruned out.

Spec. Util.

Table 1, Results of automated annotation in comparison to GenBank annotations. Non-model Organisms (n=60) Model Organisms (n=56) Wrong Worse Same Better Wrong Worse Same Better 2 8 30 16 5 10 37 8 (3.6%) (14.3%) (53.6%) (28.6%) (8.3%) (16.7%) (61.7%) (13.3%) 2 6 41 7 5 4 42 9 (3.6%) (10.7%) (73.2%) (12.5%) (8.3%) (6.7%) (70.0%) (15.0%)

Individual results varied in quality and nomenclature. The databases we relied on as sources did not share a common terminology so semantically equivalent, though syntactically different, annotations were commonplace. In some cases, lower evidence levels provided superior annotations than those at higher evidence levels, though we used the highest evidence level presented in scoring. In seven cases, the automated annotation system presented a function for a gene for which GenBank records show either none or list "hypothetical". Manual annotation indicated that there was evidence in four of the seven to suggest that the automated annotation was correct; for the remaining three annotations some evidence suggested their correctness, though their true annotation remained relatively ambiguous (see Table 2 for example results). Table 2. Selected automated annotation results juxtaposed with actual annotations from GenBank, with notes. Automated Annotation Actual Annotation Notes Hypothetical Ribosomal protein L34 Sequence was small; relevant entries removed by expect-value rules Anion exchange transporter SLC26A5 protein Automated is less specific but more informative Unnamed protein product Nicotinic acetylcholine Evidence for automated is receptor alpha4 subunit very convincing; affirmed with manual inspection ABC-type uncharacterized COG4619: ABC-type Automated and actual uncharacterized transport transport system, ATPase match, controlled system, ATPase component component vocabulary used GTP-binding protein RAB4 PREDICTED: similar to rasAnnotations are essentially related GTP-binding protein 4b the same, but varying naming conventions used

5.

Discussion

The framework and methodology on which we base our approach to gene annotation is unique from previous automated gene annotation solutions. By using BioMediator as a data integration platform to handle sequence queries and

352

retrieve results, we avoid the overhead involved with maintaining large repositories that replicate already-available data sources; responsibility for updating the data sources we use falls on the originators of the source data itself, and generally remove users of our system from most maintenance tasks. Because of the system's relatively small memory and processor footprint, it can be used on the desktop computers of annotating scientists. BioMediator's tiered architecture also allows us to add and remove sources with relative ease, and without the effort often necessary in warehouse systems, where database schemata and workflows may need to be altered considerably as data sources and tasks change over time. Scientists researching a novel genome, for example, could map any local in-house databases to the databases linked to BioMediator, thereby rapidly integrating their species-specific data with any sources already supported by BioMediator. Also, building the inference system around the schema rather than individual sources afforded us a method of quickly developing annotation rules without having to necessarily address each data source individually. Inference rules are also a natural, transparent way of capturing annotator knowledge. Once the rules were conceived, the development time in Jess was rapid. It is important to note, though, that our results were obtained using a set of rules that were not tuned or optimized, and thus we expect results will be better as rules are improved based on feedback from annotators. The scalability and flexibility of our approach, however, did come at cost, and online data sources do experience downtime. While testing the system, one of our sources was unavailable for several hours. Theoretically, we hope that by utilizing many more sources in the future that have partial redundancy, the loss of any single source may be somewhat offset. Still, as a federated data system, our ability to retrieve data is subject to the real-time availability of the data sources. An important handicap was that we did not rely on a structured ontology such as GO for our initial evaluation. While the schema we utilized was ontology-based, none of the sources we relied on used any controlled vocabulary on a consistent basis. Phylogenic information was not represented in our evaluation, and could have provided valuable data in relating evolutionary linkage to target sequences. Despite these shortcomings, the initial evaluation of our annotation system and methodology gives encouraging results; the efficacy of our approach is comparable to that of a previously evaluated species-specific and pipeline-based automated annotation system, 75.1-78,6% estimated accuracy for FANTOM2 [5], with the additions of being non-specific to any genome and having an architecture oriented toward scalability.

353 6.

Conclusion

The growing size, disparity and heterogeneity of biologic data and the necessity for expert curation in determining the protein functions for the myriad of newly sequenced genomes means that an automated annotation system that can address future gene annotation requirements must to be both a robust data integration platform and a powerful expertise-based system. In this paper, we have presented a technique and framework that couples the two important tasks in gene annotation into a cohesive platform, and evaluated its performance. Future iterations of the system will annotate genes using a controlled vocabulary with the addition of data sources such as InterPro which regularly and consistently include GO terms in their records. While our initial system relies on online databases, incorporating analytical services like transmembranelocating or phylogeny-inferring software into the schema and developing rules to take advantage of such information would be a valuable addition. Alteration of current rules will also improve our annotation capabilities, such as a dynamically-determined threshold to account for sequences of variable length. Additionally, in the future, we hope to evaluate our system against more ongoing genome annotation projects, to compare automated annotation results with further manually-created annotations. The true test of our system would be to annotate a novel genome in parallel with expert scientists. Acknowledgements This work is supported by NHGRI grant R01HG02288 and the National Library of Medicine training grant T15LM07442. The authors of this paper would like to acknowledge Elizabeth Worthey and Alice Erwin for lending their knowledge of annotation to our research, as well as Ron Shaker, Janos Barberos and Dhileep Sivam for their technical assistance. References 1. Worthey, E., Myler, P., Protozoan genomes: gene identification and annotation. International Journal for Parasitology, 2005(35): p. 495-512. 2. Adams, M., Celniker, S., et al., The Genome Sequence of Drosophilia melanogaster. Science, 2000. 287(5461): p. 2185-2195. 3. Garrels, J.I., Yeast genomic databases and the challenge of the post-genomic era. Functional & Integrative Genomics, 2002. 2(4-5): p. 212-237. 4. GenBank. 2006 [cited April 2006]; Available from: http://www.ncbi.nlm.nih.gov/Genbank/ 5. Kasukawa, T., Furuno, M., et al.. Development and Evaluation of an Automated Annotation Pipeline andcDNA Annotation System. Genome Research, 2003. 13. 6. Gourct, P., Vitiello, V., et al., FIGENIX: Intelligent automation ofgenomic annotation: expertise integration in a new software platform. BMC Bioinformatics, 2005. 6.

354 7. Potter, S., Clarke, L., et al, The Ensembl Analysis Pipeline. Genome Research, 2004. 14. 8. Apweiler, R., Attwood, T., et al, The InterPro database, an integrated documentation resource for protein families, domains andfunctional sites. Nucleic Acids Research, 2001. 29(1): p. 37-40. 9. Gaasterland, T., Sensen, C , MAGPIE: automated genome interpretation. Trends in Genetics, 1996. 12(2): p. 76-78. 10. Louie, B., Mork, P., et al, Data Integration and Genomic Medicine. Journal of Biomedical Informatics, 2006. 11. El-Sayed, N., Myler, P., et al, Comparative Genomics of TrypanosomatidParasitic Protozoa. Science, 2005. 309(5733): p. 404-409. 12. Altschul, S., Gish, W., et al, Basic Local Alignment Search Tool. Journal of Molecular Biology, 1990. 215. 13. Marchler-Bauer, A., Anderson, J., et al, CDD: a Conserved Domain Database for protein classification. Nucleic Acids Research, 2005. 33(D): p. 192-196. 14. Bateman, A., Coin, L., et al, The Pfam protein families database. Nucleic Acids Research, 2004. 32(D). 15. Hulo, N., Bairoch, A., et al, The PROSITE database. Nucleic Acids Research, 2006. 34(D): p. 227-230. 16. Henikoff, S., Henikoff, J., Protein family classification based on searching a database of blocks. Genomics, 1994. 19(1): p. 97-107. 17. Corpet, F., Gouzy, l„et al, The ProDom database ofprotein domain families. Nucleic Acids Research, 1998. 26(1): p. 323-326. 18. Koski, L., Golding, B., The Closest BLAST Hit Is Often Not the Nearest Neighbor. Journal of Molecular Evolution, 2001. 52: p. 540-542. 19. Donelson, L., Tarczy-Hornoch, et al, The BioMediator System as a Data Integration Tool to Answer Diverse Biologic Queries. Proceedings of Medlnfo, IMIA, 2004. 20. Wang, K., Tarczy-Hornoch, P., et al., BioMediator Data Integration: Beyond Genomics to Neuroscience Data, in American Medical Informatics Association 2005 Symposium Proceedings. 2005. 21. Mei, H., Tarczy-Hornoch, P., et al, Expression Array Annotation Using the BioMediator Biological Data Integration System and the BioConductor Analytic Platform, in American Medical Informatics Association 2003 Symposium. 2003. 22. Jess, the Rule Engine for the Java Platform. 2006 [cited 2006]; Available from: http://herzberg.ca.sandia.gov/jess/ 23. Ashburner, M., Ball, C , et al, Gene ontology: toolfor the unification of biology. Nature Genetics, 2000. 25(1): p. 25-29. 24. Harris, J., Can you bank on GenBank? Trends in Ecology and Evolution, 2003. 18(7): p. 317-319. 25. National Center for Biotechnology Information, Model Organisms Guide. 2006 June 2006 [cited; Available from: http://www.ncbi.nih.gov/About/model/index.html

ABSENT SEQUENCES: NULLOMERS AND PRIMES GREG HAMPIKIAN Biology, Boise State University, 1910 N University Drive Boise, Idaho 83725, USA TIM ANDERSEN Computer Science, Boise State University, 1910 N University Drive Boise, Idaho 83725, USA

We describe a new publicly available algorithm for identifying absent sequences, and demonstrate its use by listing the smallest oligomers not found in the human genome (human "nullomers"), and those not found in any reported genome or GenBank sequence ("primes")- These absent sequences define the maximum set of potentially lethal oligomers. They also provide a rational basis for choosing artificial DNA sequences for molecular barcodes, show promise for species identification and environmental characterization based on absence, and identify potential targets for therapeutic intervention and suicide markers.

1.

Introduction

As large scale DNA sequencing becomes routine, the universal questions that can be addressed become more interesting. Our work focuses on identifying and characterizing absent sequences in publicly available databases. Through this we are attempting to discover the constraints on natural DNA and protein sequences, and to develop new tools for identification and analysis of populations. We term the short sequences that do not occur in a particular species "nullomers," and those that have not been found in nature at all "primes." The primes are the smallest members of the potential artificial DNA lexicon. This paper reports the results of our initial efforts to determine and map sets of nullomer and prime sequences in order to demonstrate the algorithm, and explore the utility of absent sequence analysis. It is well known that the number of possible DNA sequences is an exponentially increasing function of sequence length, and is equal to 4", where n is the sequence length. This means that any attempt to assemble the complete set of unused sequences is hopeless. We have developed an approach that examines the minimum length sequences that are absent. These absent oligomers (nullomers and primes) occur at the boundary between the sets of natural and potentially unused sequences, and in part can be utilized to delineate the two sets15. By identifying the boundary nullomers surrounding the various branches 355

356 of the phylogenetic tree of life, we hope to produce a map of the negative sequence space around each group. While the nullomer and prime sets will shrink as more sequences are reported, the mechanisms of mutation allow for rational predictions to be made about sequence evolution based on the accumulated nullomer data. The excluded sequences can be used for a number of purposes including: 1. Molecular bar codes 2. Species identification 3. Sequence specification for: RNAi, PCR primers, gene chips 4. Database verification and harmonization 5. Drug target identification 6. Suicide targets for recalling or eliminating genetically engineered organisms 7. Pesticide/antibiotic development 8. Environmental monitoring 9. Evolution studies Our ultimate goal in studying nullomers, is to model and predict which biosequences (DNA, RNA and amino acid) are unlikely to be found in the biosphere. If "forbidden" sequences can be identified and confirmed through bioassays, this information will be foundational to understanding the basic rules governing sequence evolution. The insights gained could also greatly improve the theoretical foundation for comparative genomics, and provide an important conceptual framework for genetic engineering using artificial sequences. 2.

Background

A naive assumption of early genomic analysis was that sequence distribution over large genomes would approximate randomness. That is, a 6 base sequence would be found on average every 46 or 4096 bases. These types of assumptions were used for such calculations as the number of expected restriction enzyme recognition sites in a genome. But even early studies of genome organization using thermal melting and gradient centrifugation813 showed that there is great non-uniformity in genomic sequences, particularly in warm-blooded vertebrates. What has emerged from many subsequent genome studies is a striking nonrandom distribution of certain large and short sequence motifs. Many of the described irregularities concern functional units of sequences. For example, AGA codons are rare in bacterial genes, and when artificially substituted for synonymous codons they often have lethal consequences. This is believed to be due to ribosome stalling and the consequent early termination of protein synthesis. The reason for this effect is that while the codon chart tells us

357

that AGA is one of the codons for the amino acid arginine, most bacteria preferentially use CGA to code for arginine. Even though the bacteria have the requisite tRNAs to use an AGA codon, these tRNAs are in such low concentration that the ribosome complex is destabilized while waiting for the tRNA to load an arginine6. Examples of such "codon biases" have been seen in all species sequenced to date20, and are a good example of the constraints on sequence evolution based on progenitor biases. In eukaryotes too, many genomic features have been identified which skew the distribution of very short sequence motifs. For example, one of the authors (GH) was involved in research that examined the role of GG sequences in oxidative damage to DNA. It was found that when oxidizing agents captured electrons from DNA, the electron holes were transferred along DNA until they reached a GG sequence where they induced strand breakage12. Subsequent studies have borne out our hypothesis that GGG stretches are rare in coding regions, and other researchers have shown that "sentinel GGG" motifs found in non-coding introns serve as sacrificial sinks for oxidative damage". Statistical studies using the autocorrelation function of Bernaola-Galvan (2002) have shown that the human genome contains areas with GC-rich isochors displaying long-range correlations and scale invariance. Other studies have shown long range correlations between sequence motifs and regularly spaced structural features of the genome such as nucleosome binding sites2,21. All of these studies demonstrate what we would expect for a highly ordered information processing system: it is highly organized, non-random, and constrained by many factors, including the architecture of its storage and processing systems. Thus, even though DNA is passed on through dynamic evolving systems, there are still limits on its content, and some of these limits exist within large species groups. For example, any limits imposed by nucleosomal organization are applicable to all eukaryotic organisms; while bacteria which lack nucleosomal structure are immune to these constraints. This suggests one obvious use for our nullomer approach: the identification of molecular therapeutic targets that are present in the pathogen and absent in the host, or vise versa. Other constraints may be universal, since all organisms share a presumed origin, and many components of DNA function are highly conserved. By examining universally absent sequences (primes), we hope to discover insights into the most conserved mechanisms of molecular biology: inviolable rules which preclude these prime sequences. Interestingly, the vast majority of bio-sequence analysis has ignored the exploration of absent sequences, instead focusing entirely on sequences that are either very rare, or very common. Some work has been done to characterize the expected number of missing words in a random text19, however the primary focus

358 of this research was the application of the result to the construction of pseudorandom number generators. One group has discussed the "absence versus presence" of short DNA sequences for the sake of identifying species10, and another group has examined absent protein sequences18; but our approach is unique in that we are studying the set of smallest absent sequences (nullomers and primes) in order to discover basic rules of sequence evolution, and then apply this understanding for practical purposes such as drug development and the development of a DNA tagging system. Our research stems from one of the primary assumptions of genomic analysis, that over and under-represented sequences are more likely to be interesting. While our work focuses on the novel area of absent oligomers, the general determination of over and under-represented sequences has received a great deal of attention3'4,5''4,16'17,22. For example, Nicodeme16 developed a fast statistical approximation method for determining motif probabilities and demonstrated that over and under representation of protein motifs can be a good indicator of functional importance17. Stefanov22 introduced a computationally tractable approach for determining the expected inter-site distance between pattern occurrences in strings generated by a Markov model. Bourdon and Vallee5 and Flajolet7 extended techniques to determine the likelihood and frequency of sequence motifs to generalized patterns, in particular patterns where the gap lengths between elements of the pattern in a random text are both bounded and unbounded. Amir et al.' generalize the notion of string matching further, developing statistical analysis techniques for a string matching approach they term structural matching. With this approach, the exact text of the strings is not important, rather, two strings are considered to match if some generalized relation between the two strings is satisfied. 3.

Counting Sequences

We have developed a set of software utilities for counting sequences in a variety of sequence data. The main software package that we have created is SeqCount. This program has two primary functions. First, it counts the frequency of occurrence of all possible short sequences up to a user given maximum length in a set of sequence data and then writes this frequency count information to a file. Second, SeqCount determines the set of sequences that do not occur (nullomers) and writes these sequences to an additional set of files, one file for each sequence length being examined. The algorithm used for counting sequences is shown in figure 1. The computational complexity of the algorithm is 0(mn), where m is the maximum sequence length and n is the amount of DNA being processed. The algorithm

359 can calculate the frequency of DNA sequences up to length 13 for the human genome (3 billion bases) in approximately 25 minutes on a single processor machine. The parallel version of the algorithm can process the human genome in less than 1 minute. A single pass through the entire set of DNA data downloaded from the NCBI web site takes approximately 12 hours. In addition to SeqCount, we have created a number of secondary support tools for manipulating and understanding the data output by SeqCount. These support tools are available in both C and Java versions. Also, we have created a web-based interface to some of the data that we have generated with SeqCount. In particular, one can access the sequence counts and nullomer sets for several species for sequences up to length 13. Following is the full list of software packages and support tools that are available: 1.

2.

3.

Set the m a x i m u m sequence length under consideration (/?) and the strand of D N A to examine. Beginning with the 1 st position, for each position in the strand of D N A being examined: a. Increment the count for the w-length sequence of nucleotides found at the current position After step 2 has finished, a. process the initial counts for the nlength sequences to determine the counts for the complementary strand, b. re-process the final «-length counts to determine the counts for all sequences of length n-\ through 1.

Figure 1. Algorithm for counting sequences. •

•

SeqCount: Given a set of genomic data in binary format, counts the total number of all sequences up to a user deter-mined length. The counts are saved in a single file. Additionally, if any sequences within the length given are not found, these sequences are output to a set of nullomer files (1 file for each nullomer length). GBK2Bin: Given a set of files in Genbank format, this pro-gram converts the files to a binary format wherein each DNA nucleotide is encoded as a 2-bit value. A single file is created for each contiguous sequence of DNA found in the genbank files, with the file name encoding the location of the sequence.

360

• •

•

• • •

CountNulIs: Counts the number of nullomers in a nullomer file and prints the result. Char2Null: Converts any set of carriage return delimited sequences encoded in ascii format to the nullomer file format. This utility is typically used to take the piped output from either DiffNulls, IntNulls, UnionNulls, or ViewNulls and convert the ascii-based output of these files to binary format. DiffNulls: Takes as input 2 to many nullomer files and prints to the screen the set difference of the 1 st nullomer file minus the union of the rest of the nullomer files. IntNulls: Takes as input 2 to many nullomer files and prints to the screen the set intersection of the nullomer files. UnionNulls: Takes as input 2 to many nullomer files and prints to the screen the set union of the nullomer files. ViewNulls: Takes as input 1 nullomer file and prints to the screen in ascii format the nullomers contained in the file.

SeqCount processes sequence data in a single pass, and has been optimized for speed of processing. SeqCount can be executed in either parallel mode on a Beowulf cluster or in sequential mode on a single workstation. In sequential mode the program is limited to counting sequences up to length 13. When the program is executed in parallel mode and the user requests the program to count sequences of length greater than 13, the program evenly divides the sequence space up amongst the available processors and then each process is responsible for counting sequences that occur within its assigned sequence space. At the end of processing the counts from each process are collected and written to a file as in the sequential version. The software packages, documentation, and webbased interface can be freely accessed at: http://trac.boisestate.edu/bioinformatics/nullomers. 4.

Results

We have downloaded the entire sequence database from the NCBI web site and used our algorithms to determine the nullomer sequences for several fully sequenced organisms: chimpanzee, human, etc. These results are given in section 4.1. We have also processed all of the data in the entire DNA sequence database and determined the "prime" DNA sequences (sequences that do not occur in any of the data), and these results are given in section 4.2. In addition, we have processed the entire protein database and also give these results in section 4.2

361 4.1. Nullomers -fully sequenced organisms Table 1 gives the number of DNA nullomers found at lengths 8 through 13 for several different organisms. The results for bacteria, fungi, and yeast are across all sequenced organisms. Table 1. Number of DNA nullomers at sequence length 8 through 13. 8

arabid bacteria c elegans chicken chimp cow dog fruitfly human mouse rat zebrafish

9

10 107

11 23646

2 2

7686 590 136 96 40 206 80 178 50 2

12 13 1167012 20237388 541 562870 1152038 23339534 131515 4722702 45938 2426474 45060 2432554 25217 1868964 221616 12399300 39852 2232448 54383 2625646 30708 1933220 15561 2469558

Table 2 shows how the nullomer sets of each of the organisms given in table 1 intersect with each other. The names of the organisms are listed in the first column. The 2nd through 4lh column show the actual size of each intersection for lengths 11 through 13. The 5th through 7lh column show the expected size (with the assumption that each set was independently and randomly generated), and the 8th through 10th column give the ratio of the actual/expected. For the ratio, numbers greater than 1 indicate the degree to which the intersection is larger than expected. The results are sorted in descending order on the ratio value at length 12. Table 2 Intersection of human nullomers with the nullomers of other organisms. chimp dog rat cow mouse chicken zebrafish fruitfly arabid c elegans bacteria

11 28 0 8 0 2 4 0 0 0 0 0

actual size 12 19581 4963 5975 7314 8765 10946 1080 2122 9521 8378 0

13 1521778 731372 734566 886544 927076 1162632 504532 761094 1325550 1273344 24242

expected size 11 12 13 0.002594 109.1195 80719.25 0.000763 59.89956 62173.08 0.000954 72.94269 64310.63 0.001831 107.0339 80921.51 0.003395 129.1794 87344.92 0.011253 312.396 157105.7 3.81 E-05 36.96304 82152.48 0.003929 526.4187 412476 0.451012 2772.079 673218.3 0.146599 2736.51 776414.5 0 1.285072 18724.47

ratio 11 10794.16 0 8388.608 0 589.0876 355.4495 0 0 0 0 0

12 179.4455 82.85536 81.91363 68.33348 67.85136 35.03886 29.21837 4.031012 3.434607 3.061564 0

13 18.85273 11.76348 11.42216 10.9556 10.61397 7.400316 6.141409 1.845184 1.968975 1.640031 1.294669

Human and chimp have the greatest intersection between their absent sequences, and mammals in general show a much stronger intersection with human than the other listed organisms. While this is intuitively satisfying, further studies will be required to demonstrate if nullomer sets can be used to corroborate phylogenetic relationships among species.

362 4.2. Human Genome nullomers Other researchers have reported absent sequences as a part of large scale analysis9, however, as far as we know this is the first publication of an actual list of human nullomers. Our results also differ from earlier reports of 44 absent 11mers, in that we have found 43 sequences and their compliments which are not found in the two published human genomes (Table 3). Of these sequences, 4 11mers and their complements currently have no sequence match in any reported human sequence in GenBank as determined by BLAST. Table 3. Human nullomers at length 11. Human BLAST matches 0 0 0 0

2 2 2 2 2 2 2 2 2 2 2 2

Nullomer cgctcgacgta gtccgagcgta cgacgaacggt ccgatacgtcg tacgcgcgaca cgcgacgcata tcggtacgcta tcgcgaccgta cgatcgtgcga cgcgtatcggt cgtcgctcgaa tcgcgcgaata tcgacgcgata atcgtcgacga ctacgcgtcga cgtatacgcga cgattacgcga cgattcggcga cgacgtaccgt cgacgaacgag cgcgtaatacg cgcgctatacg

Human BLAST matches 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 5 6 7 20

Nullomer cgcgcataata cgacggacgta cgaatcgcgta cggtcgtacga gcgcgtaccga cgcgtaatcga cgtcgttcgac ccgtcgaacgc acgcgcgatat cgaacggtcgt cgcgtaacgcg ccgaatacgcg catatcgcgcg cgcgacgttaa gcgcgacgtta ccgacgatcgt ccgttacgtcg ccgcgcgatat ccgacgatcga cgaccgatacg cgaatcgacga

We are presently searching the available single nucleotide polymorphism (SNP) databases, to determine which if any of the nullomers are associated with known SNPs. 4.3. Primes - all sequence data We have also used our algorithms to process the entire DNA sequence database available from NCBI, and found that length 15 is the shortest length at which primes (absent sequences) are found. At this length there are 60370 primes that are not found in any of the DNA sequence data. These sequences can be referenced through our web site at http://trac.boisestate.edu/bioinformatics/nullomers.

363

We have also processed all available protein sequences, and identified 1799 primes of length 5. It should be noted that this number is significantly less than the 12,080 "zero count pentats" that were reported by Otaki et al. in 200418. In that paper, the researchers cloned 6 of their zero count pentats and showed that they were not lethal when expressed in E.coli. But we found (using our algorithm) that 5 of the 6 "zero-count" oligomers are actually presently listed in GenBank. This discrepancy is likely due to the addition of new protein data at NCBI since the zero count search was performed in September of 2003. This demonstrates the need for continued processing of this data, and the utility of our web-available program for conducting immediate absent sequence inventories. We believe that the approach taken by Otaki et al.18 is a valuable first step in examining the potential lethality of absent sequences. As the number of such sequences shrinks, and large scale expression projects become more routine, the fitness effects of nullomers and primes can be studied more systematically. The fact that the amino and nucleotide primes both presently represent a maximum of 5 amino acids (15 nucleotide bases in the DNA database, and 5 amino acids in the protein database) is coincidental. We examined all possible coding sequences for the 1799 length-5 protein primes, and did not find any intersection with the DNA primes at length 15. The nucleotide sequences include coding and non-coding DNA, while the protein database has only expressed (and hypothetically expressed) sequences. Thus it is likely that most nucleotide sequences representing codons for absent amino acid sequences are found only in non-coding regions of DNA. We are presently exploring the intersection of amino acid nullomers, and DNA nullomers in coding regions, and will report those results separately. On average, the protein primes had about half as many possible DNA coding sequences as expected for peptides of their length, which indicates that the set of protein primes is biased towards those protein sequences that have fewer DNA coding options. We found 5 protein primes that have a single DNA coding sequence - MWMWW, MWWWW, WMMWM, WMWWW, and WWMMW. We then performed a BLAST search for short, exact matches to each of these DNA coding sequences and examined the results. Each DNA sequence yielded a number of exact matches. Most of these matches were in intron specific regions, however, several of the matches occurred in putative coding regions. We are currently working to resolve each of these database discrepancies. The identification of apparent discrepancies between protein and nucleotide primes in coding regions, demonstrates the utility of the nullomer approach as tool for harmonizing the various biomolecular databases.

364

5.

Conclusion and Discussion

We have developed a series of tools for the identification and study of absent sequences. Using these tools we have made publicly available the full set of amino acid and nucleotide primes (the shortest sequences not found in their respective databases.) In order to allow creative extensions of our approach, the software packages, documentation, and web-based interface can be freely accessed at: http://trac.boisestate.edu/bioinformatics/nullomers. In this paper we demonstrate some of the uses of these tools, and the elegance of the nullomer approach. It should be noted that nullomer searches have corollaries in the natural world, most notably in the development of the human immune system. During embryonic development a large variety of antigen recognizing cells are generated by the random rearrangement of DNA cassettes coding for the "variable" segments of antibody producing cells. This DNA shuffling results in the incredible diversity of immune cells which produce molecular soldiers that each recognize a single small oligomer (peptides, lipids, sugars or nucleotide). This army is reduced by a colossal "deselection" in the embryonic thymus. Here, any immune cell which finds its target among the "self molecules is culled from the army. In essence, what is left is a sentinel army of nullomer hunters. They recognize and destroy only absent oligomer sequences. When an adult immune cell detects its particular nullomer, it is stimulated to reproduce, and sometimes to hypermutate in order to recognize related nullomers. Thus the natural defense system of the body is based on recognizing nullomers, and anticipating oligomers that may arise from them. This type of approach would be useful in any intelligent response to novel biological threats, natural or manmade. For example, nullomer detection in environmental samples could indicate the introduction of novel natural or engineered species. The rapid response to such a potential threat should include the generation of agents to detect and possibly incapacitate related novel molecules. The absent sequences that we report here represent the largest possible set of artificial oligomers. Within this dynamic, shrinking set will be found all lethal oligomers, if any exist. These small molecules may prove to be powerful bioactive compounds which act in a species-specific or group-specific manner. Within the set of primes, there is even the possibility of a pan-lethal agent which could function as a sterilant, or suicide gene for therapeutic and biocontrol applications. We have also shown that nullomer searches can be used to assess the harmony of molecular databases (nucleotide and protein), and to identify potential therapeutic targets that exist in a pathogenic species but not its host. The nullomer approach may also be useful for studying genome relationships, in that the absent oligomers (nullomers) are more similar in closely related species, than in those more distantly related. Finally, it is easy to construct artificial tags of DNA or amino acids that

365

have not been reported in GenBank. But identifying the smallest oligomers that have not been found in a species or group of species, provides the first rational basis for the construction of an artificial DNA lexicon. By devising tags based on nullomers and primes, more efficient and elegant artificial sequences can be constructed. These sequences can be used to identify artificial constructs, tag them with identifying characteristics, or even code for suicide genes in order to "recall" a genetically engineered product. Acknowledgements: The authors wish to thank the following people for their help: Dr. Amit Jain and Ben Noland for assistance with the Beowulf cluster and initial algorithms; Barry Hall, Jim Smith and Ken Cornell for comments and criticism about the nullomer approach; Jim Munger for his encouragement and support; and the anonymous reviewers who provided valuable feedback. References 1. Amir, A., Cole, R., Hariharan, R., Lewenstein, M., & Porat, E. (2003). Overlap Matching. Inf. Comput. 181(1), 57-74. 2. Audit B, Vaillant C, Arneodo A, d'Aubenton-Carafa Y, Thermes C. (2002) Long-range correlations between DNA bending sites: relation to the structure and dynamics of nucleosomes. J Mol Biol. 2002 Mar 1;316(4):903-18. 3. Apostolico, A., M. Bock., and S. Lonardi. (2002). Monotony of Surprise and Large-Scale Quest for Unusual Words. Proceedings of the sixth annual international conference on Computational biology, pp22-3. 4. Apostolico, A., Gong, F., and Lonardi, S. "Verbumculus and the Discovery of Unusual Words", Journal of Computer and Science Technology, vol.19, no.l,pp.22-41,2004. 5. Bourdon, J. & Vallee, B. (2002). Generalized Pattern Matching Statistics. In Mathematics and Computer Science, II Versailles, 249-265. 6. Cruz-Vera LR, Magos-Castro MA, Zamora-Romo E, Guarneros G (2004) Ribosome stalling and peptidyl-tRNA drop-off during translational delay at AGA codons. Nucleic Acids Res. 2004 Aug 18;32(15):4462-8. 7. Flajolet, P., Guivarc'h, Y., Szpankowski, W., & Vallee, B. (2001). Hidden Pattern Statistics. ICALP2001, 152-165. 8. Filipski, J. (1987). Correlation between molecular clock ticking, codon usage fidelity of DNA repair, chromosome banding and chromatin compactness in germline cells. FEBS Lett. 217: 184-186. 9. Fofanov Y, Luo Y, Katili C, Wang J, Y. B, Powdrill T, Fofanov V, Li T-B, Chumakov S, Pettitt BM (2003) How independent are the appearances of nmers in different genomes? Bioinformatics, vol. 20, no. 15, pp2421-2428. 10. Fofanov V., Fofanov Y., Pettitt B. (2002). Counting array algorithms for the problem of finding appearances of all possible patterns of size n in a sequence. In The 2002 Bioinformatics Symposium, Keck/GCC

366

11.

12.

13. 14.

15. 16. 17. 18. 19.

20.

21.

22.

Bioinformatics Consortium, p 14. W.M. Keck Center for Computational and Structural Biology, Houston Texas. Friedman K, Heller A (2001) On the Non-Uniform Distribution of Guanine in Introns of Human Genes: Possible Protection of Exons against Oxidation by Proximal Intron Poly-G Sequences. J. Phys. Chem. B, 105 (47), 11859 -11865,2001. 10.1021/jp012043nS1089-5647(01)02043-0. Henderson P.T., Jones D., Hampikian G., Kan Y., Schuster G.B. (1999) Long distance charge transport in DNA: the phonon-assisted polaron-like hopping mechanism. Proc. Natl Acad. Sci. USA. 1999;96:8353-8358. Inman, R.B. (1966). A denaturation map of the 1 phage DNA molecule determined by electron microscopy. J. Mol. Biol. 18: 464-476. Leung, M. Y., Marsh, G. M., and Speed, T. P. (1996). Over and underrepresentation of short DNA words in herpesvirus genomes. J. Comput. Bio. 3, 345-360. Mitchell, T. (1997) Machine Learning. New York: McGraw Hill. Nicodeme, P. (2001). Fast approximate motif statistics. Journal of Computational Biology, 8(3), 234-248. Nicodeme, P., Doerks, T., & Vingron, M. (2002). Proteome Analysis Based on Motif Statistics. Bioinformatics, vol. 18,161—171. Otaki J, Ienaka S, Gotoh T, and Yamamoto H. (2005) Availability of short amino acid sequences in proteins. Protein Science, 14:617-625. Rahmann, S. & Rivals, E. (2000). Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts. CPM 2000, 375-387. Reis, M., Sawa, R. & Wernisch, L. (2004) Solving the riddle of codon usage preferences: a test for translational selection. Nucleic Acids Res. 32: 5036-5044. Segal, E., Y. Fondufe-Mittendorf, L. Chen, A. Thastrom, Y. Field, I. K. Moore, J. Z. Wang, J. Widom. (2006) A Genomic Code for Nucleosome Positioning. Nature, 2006 July, 442(7104):772-8. Stefanov, V. (2003). The intersite distances between pattern occurrences in strings generated by general discrete and continuous-time models: an algorithmic approach. Journal of Applied Probability. 40, no. 4, 881-892.

AN ANATOMICAL ONTOLOGY FOR AMPHIBIANS* ANNE M. MAGLIA Department of Biological Sciences, University of Missouri-Rolla, Rolla, MO 65409, USA

105 Schrenk Hall

JENNIFER L. LEOPOLD Department of Computer Science, University of Missouri-Rolla, 317 Computer Science Rolla, MO 65409, USA L. ANALIA PUGENER Department of Biological Sciences, University of Missouri-Rolla, Rolla, MO 65409, USA

105 Schrenk Hall

SUSAN GAUCH Department of Electrical Engineering & Computer Science, The University of Kansas Lawrence, KS 66045, USA

Herein, we describe our ongoing efforts to develop a robust ontology for amphibian anatomy that accommodates the diversity of anatomical structures present in the group. We discuss the design and implementation of the project, current resolutions to issues we have encountered, and future enhancements to the ontology. We also comment on future efforts to integrate other data sets via this amphibian anatomical ontology.

1. Introduction 1.1. The Need for an Amphibian Anatomical Ontology Studies of gene expression, molecular markers, and developmental biology are advancing our knowledge of the morphogenetic and evolutionary processes that lead to disease, physiological responses, adaptation, and phylogenetic diversity. Results from these studies promise both to enhance our quality of life and reveal the complex connection between genotype and phenotype. But to understand fully the results, we must have a detailed understanding of the anatomy of * This work is partially supported by NSF grant DBI-0445752

367

368 organisms. Unfortunately, the lack of terminological standardization for the anatomy of most organisms limits our ability to compare results across taxa, and thus has restricted the applicability of many embryological and gene expression experiments. The scientific community is well aware of this problem. In the hopes of facilitating the integration of genetic, embryological, and morphological studies, several groups are developing anatomical ontologies for certain model species (e.g., mouse, zebrafish). Further demonstrating the importance of anatomical ontologies was the recent National Center for Biomedical Ontology-sponsored workshop1 to bring researchers together to discuss issues associated with developing anatomical ontologies. The need for terminological standardization of anatomy is particularly pressing in amphibian morphological research. Amphibians are commonly used for gene expression and embryological studies, yet the three amphibian orders— Salientia (frogs and toads), Caudata (salamanders and newts), and Gymnophiona (caecilians)—are so morphologically distinct that studies of one order are rarely applied to another. As a consequence, morphological and developmental studies of frogs, salamanders, and caecilians are conducted by disassociated research groups, resulting in three different amphibian anatomical lexicons. Language inconsistencies confuse our understanding of homology, and thus, our ability to use morphology to understand the phylogeny and biodiversity among the orders. In addition, disparate anatomical lexicons limit our abilities to conduct comparative anatomical research, while hindering the integration of morphological, genomic, and embryological data. There are several challenges to developing an ontology for amphibian anatomy. First, the separate anatomical lexicons must be reconciled. Second, there are over 6,000 species of amphibians for which the anatomical terminology must be resolved. Although much of the terminology is similar across species, among-species variation will lead to a much larger ontology than those developed for a single model species. Third, because of anatomical diversity among amphibian orders, homologies of some structures are unknown; therefore, assigning terminological standards to them may be problematic. These challenges can be overcome by forging a partnership between the amphibian morphological community and the power of information extraction technology. Herein, we describe our ongoing efforts to develop a robust ontology for amphibian anatomy. We discuss the design and implementation of the project, http://www.bioontology.org/wiki/index.php/Anatomy_Ontology_Workshop

369

resolutions to date for issues that we have encountered, and future enhancements and modifications to the ontology. In addition, we comment on future plans to integrate other data sets via the amphibian anatomical ontology. 1.2. Prior Work in Biological Ontologies As stated in [1], "ontologies are becoming popular largely due to what they promise: a shared and common understanding of a domain that can be communicated between people and application systems." The importance of ontologies has not been lost in the biological community—a research domain that is notorious for its complex form and semantics, and one that will benefit tremendously from data integration and analysis [2]. Perhaps the best known of the biological ontologies is the Gene Ontology (GO)*, which began in the late 1990's as a collaboration among three model-organism databases (FlyBase§, the Saccharomyces Genome Database", and the Mouse Genome Database t+ ), but has grown to include many other genomic databases. The biomedical research community has made significant strides in developing medical and clinical ontologies. One of the most extensive projects is the U.S. National Library of Medicine's Unified Medical Language System (UMLS)**, a comprehensive knowledge-representation system that includes data sources and software tools (e.g., the Metathesaurus, the Semantic Network, and the Specialist Lexicon) that facilitate information retrieval, natural language processing, and other vocabulary services for biomedical research data. As an extension to the UMLS, the Digital Anatomist Foundational Model (FMA), an ontology of human anatomical relationships, was developed as part of the Digital Anatomist project [3]. Both GO and UMLS have proved to be extremely valuable for several widely-used applications (e.g., PubMed®§§, Swiss-Prot***). Some bio-ontology projects have begun integrating genomic and anatomical information for model species (e.g., the Zebrafish Information Network (ZFIN) m , The Jackson Laboratory's Mouse Anatomical Dictionary project***, and the FlyBase list of Anatomy and Development terms§§§). * http://www.geneontology.org 8 http://flybase.bio.indiana.edu ** http://www.yeastgenome.org " http://www.informatics.jax.org ** http://www.nlm.nih.gov/research/umls 85 http://www.pubmed.gov '"http://www.ebi.ac.uk/swissprot nt http://zfin.org *** http://www.informatics.jax.org/searches/anatdict_form.shtml

370

Unfortunately, some of these anatomical ontologies have restrictions that prevent their application to other organisms. For example, often there is a narrow set of relations, such as is-part-of and develops-from—terms that limit the options for describing the inter- and intra-relationships of anatomical parts. This limitation of concepts and properties also limits their use for phylogenetic and comparative anatomical analyses. 2. Methodological Considerations and Ontology Construction The architecture of an ontology typically is sufficiently complex to require a considerable amount of manual effort. As such, the development of an ontology usually is carried out by experts in the knowledge domain. Based on [4], the process of constructing an ontology can be represented by the following steps: 1. 2. 3. 4. 5. 6. 7. 8.

Determine the boundaries of the ontology. Consider reusing (parts of) existing ontologies. Enumerate all the concepts to include. Define an appropriate taxonomy to describe concepts, properties and relationships. Define properties of the concepts. Define facets of the concepts such as cardinality, required values, etc. Define instances. Check the consistency of the ontology.

Using the Protege-OWL editor [4], we developed an ontology in OWL DL for amphibian morphology that was consistent with the recommendations outlined in the Suggested Upper Merged Ontology (SUMO) [5]. In accordance with the list above, we first determined that the boundary for the ontology should include all anatomical physical, self-connected objects**** for all amphibians (i.e., frogs, toads, salamanders, newts, and caecilians). We evaluated a number of existing sources for reuse, including the SUMO mapping of WordNet [6], the Unified Medical Language System (UMLS), and several species-specific anatomical ontologies (e.g., the Jackson Laboratory's Mouse Anatomical Dictionary, the Anatomical Dictionary mt , and the ZFIN 885

http://flybase.bio.indiana.edu/cgi-bin/fbcvq.html7start *"* No abstract concepts were defined in the amphibian morphology ontology. Furthermore, each concept in the ontology is considered a self-connected object whose parts are all mediately or immediately connected with one another, and no collection concepts have been defined at this time. No process concepts are currently included in the ontology; however, such an extension may be added in the future to represent functional and physiological knowledge. See [5J for a more detailed discussion of these SUMO top level ontological categories. ttft http://www.dinosauria.com/dml/anatomy.htm

371 Anatomical Ontology of the Zebrafish). The SUMO mapping of WordNet provides basic descriptions of terms, and although we were able to identify a few concepts applicable to amphibian morphology, the terminology is too general to be useful for this project. The UMLS is an extensive biomedical ontology containing numerous concepts and relationships. However, our initial attempts to incorporate the UMLS terminology into the amphibian morphological ontology proved to be difficult because: 1) UMLS contains numerous concepts that are not relevant to the amphibian anatomical lexicon and, 2) those concepts that are relevant are not detailed enough for our needs. We also experimented with using an approach similar to the Foundational Model of Anatomy. Interestingly, the top-level organization of this ontology is based on abstract geometric concepts and relationships (e.g., spaces, points, adjacency, direction, etc.). Although such conceptual organization facilitates spatial queries at different levels of complexity, we felt that, for our initial efforts, a top-level organization based on anatomical systems was more consistent with facilitating comparisons among amphibian taxa. Of the species-specific anatomical ontologies, the ZFIN Zebrafish Anatomical Dictionary is most in line with the goals of our project****. We adopted relevant concepts, hierarchy, and relationships from ZFIN as an initial framework for the amphibian morphological ontology. Subsequent modifications and enhancements to our knowledge base, including the addition of concepts and properties and the identification of instances, were made by manually mining literature sources [e.g., 7, 8, 9, 10]. Finally, the consistency of the ontology was evaluated through tools provided in the Protege-OWL ontology builder. End-user evaluations of the usability and usefulness of the ontology are planned (see Section 3.3). 3. The Amphibian Anatomical Ontology 3.1. The Semantic Network The amphibian anatomy semantic network currently consists of 212 semantic concepts and 58 relationships. Each concept is given a textual definition, adopted from ZFIN (where appropriate) or manually mined from the literature. Properties in the ontology are symmetric (e.g., is-fused-to), inverse (e.g., forms It is important to note that at the time of this writing no information was publicly available about the dictionary of embryological anatomy of Xenopus (African clawed frog); thus, we could not evaluate the appropriateness of the contents of that knowledge base. When it becomes available, we plan to explore the integration of the dictionary with our amphibian anatomical ontology.

372

vs. is-formed-from), functional and inverse functional (e.g., is-defined-as vs. isthe-definition-of), or transitive (e.g., is-part-of). A partial view of the concept hierarchy and properties for the amphibian anatomical ontology (as displayed in Protege) is shown in Figure 1. 3.2. Challenges and Current Solutions Because of the broad range of organisms and morphologies included in our amphibian anatomical ontology, we faced several challenges in its development. For example, we were required to represent anatomical diversity in a logical and meaningful manner within the terminological and hierarchical framework of the ontology. To do this, we included taxonomic (i.e., Linnaean nomenclature) references as concepts in the ontology. In this way, we were able to designate the range of an instance of a concept as a given taxonomic group. This method also provided us with a way of referencing homologous and partiallyhomologous structures, while allowing the community to continue to use commonly-accepted terminology (e.g., the orbitosphenoid in salamanders is homologous to the sphenethmoid in frogs). An additional challenge arose from the need to include developmental stages in the ontology. Most ontologies that include development information are created specifically for that purpose, and often do not include information about adult anatomy (let alone anatomical diversity among groups). To overcome this challenge, we took an approach similar to the one above and included developmental stages as classes. As such, we could designate the range of a concept as an instance of a particular developmental stage. 3.3. Planned Modifications and Enhancements As is the case with most biological ontologies (e.g., Gene Ontology, Plant Ontology§§§§), the current ontology of amphibian anatomy can be considered a partonomy, because it uses both is-a and part-of relationships in the hierarchical foundation. Although the use of part-of relationships appears to be a logical representation of biological hierarchy, as shown by [11], the inclusion of part-of relationships in the hierarchy of a structural ontology can result in inconsistencies and multiple inheritances that are illogical, and can limit the mapping of an ontology into other such systems.

http://www.plantontology.org/docs/otherdocs/poc_project.html

373

.. ;"•! i•.-..!: • diivphibdiicit •,<.:lUJli|..-l.::thV

»MltHUIIIii.rl!._:>yMt > lll

! lnleirt»d V-ew

" • "

.-. wSfcetefcaLsystem t • Axtel_sketetan r # CraniaLskeleton •Cboridrocranium •Dermatocranlurrt •Neurocranium ». • Splanchnocraniurn > • Fore1imb_stetetan * • Hindlimb_skfifeton • Digestive_system * • £mbryonlc_stiudures * * integument si # Muscularjsysterri .•
Pmiw-itv

:

WARTICULATESJiVITH (multiple Atwtomictil] •CONTAINS (multiple Anatomical sys.bem) •FORMS (multiple Aifeitomtcaf. system) • INVESTS (multiple An,ilomi

LOCATED IN

if % %

(tiliittipif-.Andtomkdlsyst L U S u p e i classes

<S CraniaLstetetort

Figure 1. Protege-OWL editor screen shot of a portion of the class hierarchy and properties associated with the amphibian anatomical ontology.

At the recent NCBO workshop on anatomical ontologies"***, it was resolved that a Common Anatomy Reference Ontology (CARO), based on the Foundational Model of Anatomy [3], would be developed to facilitate the integration of anatomical ontologies representing various model organisms. Because the top-down foundational model of CARO is based on sound principles of ontology design, and is explicitly designed to accommodate anatomical diversity, we plan to adopt the CARO model in future implementation of the amphibian anatomical ontology. In addition, our current practice of including developmental and taxonomic information in the anatomical ontology presents logical inconsistencies. Although the CARO model explicitly excludes developmental and taxonomic information from the ontology, it does include plans to map concepts to other ""http://www.bioontology.org/wiki/index.php/Anatomy_Ontology_Workshop

374

ontologies that do include such data. Therefore, by adopting the CARO model, future implementations of the amphibian anatomical ontology will be logically sound while accommodating biodiversity and developmental information. 3.4. Software-Based Ontology Enrichment Although we have developed the hierarchical class structure for the amphibian anatomy ontology, we have not yet fully instantiated those classes, nor all of the properties associated with the classes. We plan to enrich the amphibian anatomical ontology by using information extraction (IE) technology to mine the amphibian anatomical literature. We currently are developing software to extract elements relevant to anatomy from electronic media, based on previous work by [12]. By combining pattern-based extraction methods with statistical natural language processing algorithms, the software identifies and weights the most important elements. It will be trained using an initial set of values taken from a portion of the current ontology, with focus on extracting highlyweighted, domain-specific terminology (e.g., nouns and noun phrases), important term relationships (e.g., terms related by domain-specific cue words), and inter-concept relationships (most likely indicated by verbs connecting terms specific to two or more concepts). Our ultimate goal is to provide a software system that can adapt any existing ontology automatically by mining concepts from the literature, extend the ontology by adding related concepts to those that are over-represented in the literature, and remove unused concepts. We assume that a concept with many instances in the literature probably is under-represented in the ontology. Through a series of subdivisions of the largest concepts, the ontologies can be supplemented to include new subconcepts with increased specificity, thereby providing a better match to the contents of the representative literature. In addition, machine-learning techniques can be employed on documents that contain few or no instances of concepts in the ontology in order to identify new concepts that might be missing from the ontology. Relationships between these new concepts and the existing concepts can then be inferred using IE techniques. By experimenting with the size of the initial seed ontology that is adapted, we will be able to evaluate the effect of the amount of information provided and the quality of the automatically generated ontology. As a proof-of-concept, we will seed the learning system with our current amphibian anatomical ontology and allow it to grow. We will also seed it with subsets of the amphibian morphological ontology and evaluate how much the automatically adapted ontology matches the current one, and how well each performs on extrinsic and

375

intrinsic benchmarks (e.g., similarity to a community-accepted ontology, similarity to concepts represented in a literature subset). We will also investigate which information sources produce the best ontologies. In the best case scenario, an entire ontology could be created from an initial root concept; however, we do not believe this to be probable. The automatic ontology is likely to contain only the simplest inter-concept relationships, e.g., is-a or has-a or more-general/more-specific. The resulting ontology will be evaluated empirically using benchmarks and by evaluation from the user community. 3.5. Community Curation of the Ontology Because a knowledge management system can only function satisfactorily if it is integrated into the organization in which it is used [13], it is imperative that the expert user community be highly involved in this project. As discussed in [14], the use of a collaborative ontology builder (COB) environment is vital to properly support the following tasks: 1. Knowledge integration. 2. Concurrence management. 3. Semantic consistency maintenance 4. Privilege management (i.e., to ensure accuracy of the ontology based on a user's expertise, authority, and responsibility for different parts of the ontology) 5. History maintenance. The collaborative environment for the Gene Ontology is based on a concurrent versions system (CVS), with a request tracking system hosted on SourceForge, and communication between users and curators facilitated via email lists t t t n . However, as observed in [15], such an environment suffers from the following drawbacks: 1. Absence of a principled mechanism to ensure curator privilege assignments, and organization of the ontology into smaller manageable units. 2. Risk of inconsistency from unintended couplings and over-writing. 3. Lack of support for restricting editing to only part of the ontology (i.e., a curator has to download the entire ontology before editing, and then submit the entire ontology after editing). 4. Expensive history maintenance (i.e., even a minor edit to the ontology causes the entire file to be replicated in its entirety).

http://www.geneontology.org/GO.contents.curator.guides.shtml

376

5.

The inability to grant different levels of privileges to different types of users, subsequently limiting community participation.

For the amphibian anatomical ontology we are currently using the same CVS protocol employed by the Gene Ontology. However, we are in the process of evaluating alternatives such as the use of COB Editor*****, a recently developed COB software system that overcomes many of the aforementioned problems, and has been used successfully for the Animal Trait Ontology (ATO)§§§§§. To facilitate the use of our amphibian anatomical ontology, we are developing a Web site (www.amphibanat.org) that includes documentation on all aspects of the project such as a monthly newsletter and answers to frequently asked questions, discussion boards, links to contacts and mailing lists, and downloadable tools for using the ontology including a Java-based API and a user interface for searching, browsing, and navigating the ontology. 4. Knowledge Integration via the Anatomical Ontology A long-term goal of this project is to integrate the amphibian anatomical ontology knowledge base with systematic, biodiversity, embryological and genomic resources. Interoperability with other media resources has been considered in the design and implementation of our knowledge base. We currently are developing a Java-based API with several functions, including: searching for a particular term, iterating through all terms related to a specific term, finding citations for literature associated with the use of a term, etc. By standardizing amphibian anatomical terminology and providing a platform- and implementation-independent API to access the ontology from existing applications, we are developing a means to facilitate integration of phylogenetic, anatomical, embryological, and gene expression data. We plan to demonstrate the usefulness of this API by integrating it with the querying facilities of MorphologyNet******, a digital library of 3D visualizations of anatomy developed by Leopold and Maglia [16]. The MorphologyNet query interface (which is currently being developed, and will be available in November 2006) allows searching for morphological images by any combination of: taxonomic reference, anatomical reference, developmental stage, accession number, and contributor name. Integrating the amphibian http://sourceforge.net/prqjects/cob http://www.animalgenome.org/bioinfo/projects/ATO " http://www.morphologynet.org

377

anatomical ontology with the MorpholgyNet database will provide users with the option of searching for anatomical structures using the controlled vocabulary, retrieving images using synonym matching, or accessing images that are hierarchically-related to the search term in the ontology. 5. Summary The amphibian anatomical ontology provides a terminological and hierarchical framework for amphibian anatomy. By standardizing the lexicon used for diverse biological studies related to anatomy (e.g., gene expression, embryology, systematics), we hope to facilitate the integration of anatomical data representing all orders of amphibians, thus enhancing our knowledge of amphibian biology and diversity. Our ontology will provide a powerful tool that will facilitate cross database querying and foster consistent use of vocabularies in the annotation of amphibian morphology. Thus, it could allow a morphologist determine the preferred name for a given anatomical structure, an evolutionary biologist to find similar morphological structures of phylogenetic significance present across different species, or an embryologist study/compare how gene expression guides the development of embryos in different taxonomic groups. By using good practices of ontology development, we hope to integrate the amphibian anatomical ontology with many different types of Internet-distributed databases (including anatomical ontologies representing other organisms), thus helping to realize fully a Semantic Web within the biological domain. References 1. Davies, J., Fensel, D., and Van Harmelen, F. (eds.). Towards the Semantic Web: Ontology-driven Knowledge Management. John Wiley & Sons., West Sussex, England (2004) 2. Soldatova, L., and King, R. Are the current ontologies in biology good ontologies? Nature Biotech. 23:1095-1098 (2005) 3. Rosse, C, Shapiro, L. G., Brinkley, J. F. The Digital Anatomist Foundational Model: principles for defining and structuring its concept domain. Proc AMIA Symp 820-4 (1998) 4. Knublauch, H. Fergerson, R. W., Noy, N. F., Musen, M. A. The Protege OWL Plugin: An Open Development Environment for Semantic Web Applications. Third International Semantic Web conference - ISWC 2004, Hiroshima, Japan (2004) 5. Pease, A., Niles, I., and Li, J. The Suggested Upper Merged Ontology: A Large Ontology for the Semantic Web and its Applications. In

378

6. 7. 8.

9. 10. 11.

12.

13. 14.

15.

Working Notes of the AAAI-2002 Workshop on Ontologies and the Semantic Web, Edmonton, Canada (2002) Fellbaum, C. WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press: Boston. 423 pp. (1998) Duellman, W. E. and Trueb, L. Biology of Amphibians. Johns Hopkins University Press, Baltimore (1994) Maglia, A. M., Pugener, L. A., and Trueb, L. Comparative Development of Anurans: Using Phylogeny to Understand Ontogeny. Amer. Zool., 41:538-551 (2001) Reilly, S. M. and Altig, R. Cranial ontogeny in Siren intermedia (Caudata: Sirenidae): Paedomorphic, metamorphic, and novel patterns of heterochrony. Copeia 1996: 29-41 (1996) Maglia, A. M. and Pugener, L. A. Skeletal development and adult osteology of Bombina orientalis (Anura: Bombinatoridae). Herpetologica 54: 344-363 (1998) Smith, B., and Ross, C. The Role of Foundational Relations in the Alignment of Biomedical Ontologies. 11th World Congress on Medical Informatics (MEDINFO) (2004) Gauch, S., and Chandramouli, A. Semi-Automatic Update of Existing Taxonomy using Text Mining. 5th International Conference on Ecological Informatics (ISEI5) (2006) Antoniou, G., and Van Harmelen, F. A Semantic Web Primer. Boston, MIT Press. (2004) Bao, J., Hu, Z., Caragea, D., Reecy, J., and Honavar, V. A Tool for Collaborative Construction of Large Biological Ontologies. Fourth International Workshop on Biological Data Management - DEXA 2006, Krakow, Poland (2006) Leopold, J., Maglia, A., and Hoeft, T. Interactive Anatomy Online: The MorphologyNet Digital Library of Anatomy. IEEE Potentials 24(2):39-41 (2005)

R E C O M M E N D I N G PATHWAY G E N E S U S I N G A C O M P E N D I U M OF C L U S T E R I N G SOLUTIONS

DAVID M. NG+, MARCOS H. WOEHRMANN+, AND JOSHUA M. STUART* Department

of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA Equal coauthor ship. * E m a i l : [email protected]

Santa

Cruz

A common approach for identifying pathways from gene expression data is to cluster the genes without using prior information about a pathway, which often identifies only the dominant coexpression groups. Recommender systems are well-suited for using the known genes of a pathway to identify the appropriate experiments for predicting new members. However, existing systems, such as the GeneRecommender, ignore how genes naturally group together within specific experiments. We present a collaborative filtering approach which uses the pattern of how genes cluster together in different experiments to recommend new genes in a pathway. Clusters are first identified within a single experiment series. Informative clusters, in which the user-supplied query genes appear together, are identified. New genes that cluster with the known genes, in a significant fraction of the informative clusters, are recommended. We implemented a prototype of our system and measured its performance on hundreds of pathways. We find that our method performs as well as an established approach while significantly increasing the speed and scalability of searching large datasets. [Supplemental material is available online at sysbio.soe.ucsc.edu/cluegene/psb07.]

1. I n t r o d u c t i o n We developed an approach that efficiently searches the growing body of functional genomics data for new genes that act in a pathway of interest. For many pathways, the cell must coordinate the expression of the participating genes so that their products are present at the same time and place. The functional similarity of these genes may be detectable in gene expression data, if the context under which the pathway is activated has been assayed. As the results of DNA microarray studies continue to be contributed to public repositories such as the Gene Expression Omnibus 1 , the chance that such a context exists in the database becomes more likely. * corresponding author.

379

380

However, finding this context among the many irrelevant experiments can be as challenging as finding a needle in a haystack. Existing recommendation systems for gene pathway discovery, such as the GeneRecommender 2 and the Signature Algorithm 3 , have shown promise for finding genes of related function. However, these approaches do not take advantage of the natural clustering of genes in different experiment series. Rather than using pre-existing clusters, they build a cluster around the given query genes using microarray hybridizations under which the query genes are most strongly up- or down-regulated. Because of this, they can miss correlations present across multiple hybridizations where the absolute levels of the query genes are different but where their expression changes are still highly similar. In addition, these approaches can be computationally intensive since the algorithms must compute the correlation of every gene compared to the input query set. Therefore, we expect these approaches to scale poorly as the number of microarray hybridizations increases. The task of identifying new genes that act in a pathway is analogous to the task of making product recommendations for customers of online stores. We have developed a collaborative filtering-based gene recommendation system 4 , ClueGene, which uses pre-computed clusters of genes to recommend new genes for a query pathway. In online shopping, recommendations for additional purchases are based on the contents of a customer's shopping cart and on the purchasing history of previous customers 5 . In gene pathway prediction, recommendations for additional genes in a pathway are based on the known genes of the pathway (the query genes) and on clusters of coexpressed genes computed from experimental data. The ClueGene system precomputes clustering solutions for multiple data sets and stores each identified cluster in a database referred to as the Cluster Compendium (see Figure 1). Storing clusters provides a more compact representation of gene regulation groups compared to storing the entire set of microarray results. Given a query, consisting of a set of genes, the ClueGene recommender algorithm scans the Cluster Compendium for clusters containing a significant proportion of the query genes. It scores each gene in the genome using a weighted vote across these clusters. ClueGene returns its recommendation as a list of all the genes in the genome ranked by score. Using a method analogous to e-commerce recommender systems, ClueGene quickly and accurately predicts members for a large number of pathways. We conclude that collaborative filtering approaches may provide an efficient and accurate methodology for scanning large amounts of functional

381

Figure 1. ClueGene method overview A Datasets from Stanford Microaxray Database and Gene Expression Omnibus are collected. B . Clusters are derived from each dataset by forming a network from all significant pairwise Pearson correlations from which dense subnetworks are identified with MODES. C. The grouping of clusters by dataset is maintained in the Cluster Compendium. D. A list of genes is supplied as the input query. E. All genes are scored according to their degree of co-clustering with the query. P. The top-scoring genes are returned as the recommendations.

genomics data to predict gene function. 2. Construction of the Cluster Compendium To test the ClueGene system, we clustered 44 different Saccharomyces cerevisiae experiment series collected from the Stanford Microarray Database 6 and the Gene Expression Omnibus. The datasets represent a diverse collection of experiments ranging from perturbations such as from various stresses and deletions to normative conditions such as cycling cells and regulation of general transcription. To increase the diversity of clusters in our search, we included two datasets in which the binding of specific transcription factors were assayed with genome-wide chromatin immunoprecipitation 7 ' 8 . For a full list of the datasets and their references, please see Supplemental Table 1. To find clusters of coregulated genes within each data series, we first constructed a coexpression network and then identified clusters as dense subnetworks in the network. A coexpression network was constructed by connecting any two genes whose Pearson correlation was equal to or greater than four standard deviations above what was expected by chance (based on randomly permuting gene vectors). The MODES (Mining Overlapping DEnse Subgraphs) algorithm 9 was used to identify highly-connected sets of genes in the resulting network. MODES clusters genes into overlapping subsets, allowing a gene to belong to multiple clusters. In total, 6900 clusters from the 44 datasets were identified and loaded into a yeast Cluster Compendium from which recommendations could be

382 computed. Clusters derived from the gene expression study represent sets of genes whose relative changes in expression across a single dataset are highly similar. Clusters derived from the chromatin-immunoprecipitation experiments represent sets of genes that are bound by a common set of transcription factors. Thus, clusters from both types of dataset group genes according to shared regulatory information. Note that the clustering step does not depend on a particular query and therefore was pre-computed. 3. Scoring Genes Based on Co-clustering with a Pathway ClueGene is given a set of genes, called the query, Q, that are thought to be functionally related. It then scores each gene, g, in the genome, G, based on how often the gene appears in clusters with the genes in Q. We define a function that assigns higher scores to genes that appear in clusters containing a high proportion of query genes. Let D be a set of clustering solutions where each element of D is a set of clusters. Define Ngd to be the number of clusters in data set d that contain g and at least one gene from Q. The co-clustering score C{g) of gene g € G is:

d£D L

9rf

c€d | y

U

'

where / is the indicator function that returns 1 if its argument is true and 0 otherwise. a The intuition underlying the choice of scoring function is to identify genes that occur in small and specific clusters with the query genes. If g belongs to a large cluster that also happens to have several of the query genes, this observation is down-weighted because the co-occurrence of gene g with the query may arise by chance if the cluster is large enough. On the other hand, if g belongs to a small cluster that also contains several of the query genes, this observation receives a high weight because the co-occurrence is less likely to be serendipitous. Dividing by Ngd corrects for the number of clusters that a gene appears in. Without this correction, high scores could be assigned to genes that a

T h e time complexity of ClueGene is 0(\D\), where \D\ is the number of data sets. The time complexity of GeneRecommender is 0(|Z?| x e), where e is the average number of experiments per data set. We expect e to grow over time as high-throughput techniques become less costly and more common. For details of the scoring algorithm and time complexity analysis see Supplemental Appendix A.

383

are "central" in the coexpression network simply because they appear in several clusters. Note that one could also consider including an additional normalization term to correct for missing data. b 4. Results on Positive Control Pathways To estimate the accuracy of a search, either from ClueGene or the GeneRecommender, we used a leave-half-out strategy. Half of the original genes in the pathway were used as the query to search for the remaining half. We refer to the withheld members as the expected set of genes. We obtain a conservative estimate of the accuracy of the search by using only the ranks of the expected genes while ignoring the ranks of the query genes. A single leave-half-out search results in a list of genes, sorted by their co-clustering scores, C(g). At a given score cutoff z, the precision and the recall of the search are measured. Expected genes with scores of at least z are considered to be recommended, while the rest are not. The precision is defined to be p/n where p is the number of expected genes with scores of at least z, and n is the number of total genes with scores of at least z. The recall is defined to be p/t where t is the total number of expected genes, and p is the same as before. Sweeping through a range of cutoff levels produces various precision levels as a function of recall. For positive control testing, we selected four functionally-related groups of genes defined by KEGG 1 0 : the Cell Cycle category—containing genes involved in the actuation and regulation of the cell cycle, the Oxidative Phosphorylation category—containing genes involved in the final stage of cellular respiration, the Proteasome category—containing genes encoding subunits of the 26S or 19S components of the proteasome, and the Ribosome category—containing genes that encode subunits of the small and large cytosolic ribosome. These pathways were previously shown in Stuart et al. n to contain genes with highly correlated expression profiles conserved across multiple species. As a negative control, we created four sets of genes selected at random from the entire yeast genome; these random sets contained 10, 25, 50, and 100 genes. ClueGene and GeneRecommender were both run on the posib

Dividing by Mg, the number of datasets in which g appears, would allow genes with differing amounts of missing data to be directly compared. We found that dividing by Mg had little effect on our results, presumably because the yeast data contains very little missing data. However, we suggest including a division by Mg if applied to other species in which more missing data is expected.

384

tive control pathways and the randomly constructed sets of genes. Figure 2 shows the precision-recall curves for the Cell Cycle, Oxidative Phosphorylation, Proteasome, and Ribosome categories.

Oxidative Phosphorylation — ClueGene GeneRecommender

0.4 0.6 Recall Cell Cycle

0.2

0.8

— ClueGene GeneRecommender

0.4 0.6 Recall

0.8

Figure 2. Estimates of the precision at various levels of recall for the four test pathways. Black lines, accuracies for ClueGene; gray lines, accuracies for GeneRecommender. Error bars show +/—1 standard deviations from 10 leave-half-out runs. A. Ribosomal subunits. B. Oxidative phosphorylation. C. Proteasomal subunits. D. Cell cycle related genes.

As expected, ClueGene and GeneRecommender perform equally well for the Ribosome and Oxidative Phosphorylation categories. ClueGene and GeneRecommender both cannot identify members of the Cell Cycle category. Since the KEGG Cell Cycle category contains genes that act at different stages of the cell cycle, we tested whether the performance could be improved by dividing up the category into gene groups that are known to act at the same phase. However, we found that all subsets of the Cell Cycle category, corresponding to different phases of the cell cycle, also performed poorly on the searches (see Supplemental Table 3). This suggests that the yeast Cluster Compendium does not contain informative clusters for identifying genes involved in this process. ClueGene appears to perform better than the GeneRecommender on predicting subunits of the proteasome. To assess whether the difference between the performance of the two methods was significant, we measured the area under the curve (AUC) of the precision-recall plot to summarize

385 the overall performance of a search method. 0 The average and standard deviation across ten leave-half-out tests was calculated. Table 1 summarizes the results of testing on four positive control pathways. Table 1.

Control Test Results.

Category size

ClueGene AUC

GeneRecommender AUC

CG Random a AUC

z-score b

Cell Cycle

87

0.0373

0.0444

0.0082

-0.5139

Oxidative Phosphorylation

64

0.3941

0.3058

0.0057

0.4549

Proteasome

32

0.7149

0.5631

0.0028

0.2827

Ribosome

147

0.8579

0.7942

0.0147

0.2387

KEGG category

Note: a ClueGene run on random pathways of size 10, 25, 50 and 100; the AUCs of three runs were averaged for each size. Reported values derived by linear interpolation. Mann-Whitney z-score.

The AUCs matched our intuitive sense of the performance as observed in Figure 2. In addition, the results on the negative controls yielded AUCs expected from uniformly distributed ranks (Table 1 shows the negative control results for ClueGene only; the results on negative controls were nearly identical for GeneRecommender). The AUC can be used as a single measure to evaluate a large collection of pathways to identify those pathways associated with high ClueGene performance. To detect pathways with significant accuracy, one could perform a 2-test between the AUCs obtained on the random controls compared to a specific pathway. Our focus, however, was to identify any pathways for which the two search algorithms produced significantly different precision levels. To test whether the search results were statistically comparable or different for a particular pathway, a Mann-Whitney test was performed to compare the ranks of the expected genes returned by ClueGene to those returned by GeneRecommender.*1 We used the z-score returned by the c

T h e AUC was estimated using the trapezoid rule, commonly used in discrete integration. The final AUC was normalized by dividing by the theoretical maximum: 1 — i . d For a pathway of size 2t, let the ranks assigned by ClueGene to the t expected genes be X i , X2,... ,Xt, and the ranks assigned by GeneRecommender be Yi, Y2, • • •, Yt • X and Y were combined into a single vector W and sorted. The Mann-Whitney statistic was computed as t 2 + 0.5(t + 1) — U, where U is the sum of the new ranks of X in W.

386 Mann-Whitney test as a measure of the difference in prediction accuracy between the two search engines, z-scores larger than 2 indicated ClueGene found a significantly more accurate result than GeneRecommender. Conversely, z-scores less than —2 indicated the GeneRecommender performed more accurately for a pathway than ClueGene. For each pathway, we calculated the Mann-Whitney z-score for each of the 10 different leave-half-out tests. We reported the median z-score from these 10 runs. Note that this is equivalent to calling a difference between the two methods significant if a majority of the leave-half-out tests yield significantly different rankings. For the positive control pathways, we found that ClueGene and GeneRecommender returned statistically similar results. For example, even though the difference in AUC between the two methods is higher for ClueGene (0.71) compared to the GeneRecommender (0.56) for the Proteasome, the rankings assigned to the expected genes were not found to be significantly different (0.28 standard deviations) (Table 1). Thus, the ClueGene search engine was found to perform as accurately as the GeneRecommender using the Mann-Whitney test for the four positive control pathways. To gauge the general performance of ClueGene, we next set out to test it on a large set of pathways. 5. Results On Diverse Pathways To compare the performance of ClueGene and GeneRecommender, the accuracy of each method was measured on 1441 functionally-related groups of genes defined by Gene Ontology 12 , 80 defined by KEGG 1 0 , and 180 defined by MIPS 13 , for a total of 1701 pathways (see Supplemental Table 2 for the complete results). Figure 3A shows the distribution of AUCs computed for ClueGene and GeneRecommender. The Mann-Whitney z-scores were centered on 1 (Figure 3A). This was surprising because it indicated that, in general, ClueGene had higher, although not significantly higher, performance across the pathways compared to the GeneRecommender. Few extreme z-scores were observed, indicating the two methods perform comparably on the set of pathways. We collected five pathways with the highest and five with the lowest Mann-Whitney z-scores (see Table 2). The results indicate the ClueGene method performed better for pathways specific to energy generation. For example, ClueGene obtained significantly better rankings for GO categories A 2-score was computed as z = (U — ^t2)/(t^J

j^(2t + 1)).

387

DNA transposition (GO)

0.2

0.4

0.6

0.8

GeneRecommender Precision

Figure 3. Performance comparison of ClueGene to GeneRecommender on diverse pathways. A. Distribution of AUCs for ClueGene and GeneRecommender on a non-redundant set of pathways for which at least one of the methods had an AUC of 0.20 or better. B. Each pathway's precision at the 50% recall rate is plotted for ClueGene against GeneRecommender. Open diamonds, pathways with absolute Mann-Whitney z-scores less than 2; black diamonds, pathways with absolute z-scores of at least 2. Table 2.

Selected Pathways with Extreme Mann-Whitney z-scores.

Top ClueGene Categories

z-score a

Top GeneRecommender Categories

membrane-bound

2.8

protein binding (GO)

-5.1

2.6

DNA recombination

-4.2

2.6

DNA metabolism

oxidoreductase activity, acting on heme group of donors (GO)

2.6

DNA transposition

aminoacyl-tRNA-synthetases

2.2

nucleic acid binding (GO)

organelle (GO)

carboxylic acid metabolism respiratory

(GO)

chain complex HI

(GO)

(MIPS)

(GO)

(GO) (GO)

z-score

-3.7 -2.9

-2.9

Note: a Mann-Whitney derived z-score. Higher z-scores indicate ClueGene ranked query genes toward the top compared to GeneRecommender.

carboxylic acid metabolism and respiratory chain complex III. The GeneRecommender outperformed ClueGene on pathways directly involved in the generation and manipulation of DNA (e.g. the GO categories DNA recombination and DNA metabolism). The ClueGene algorithm had a higher precision for several pathways, including the MIPS aminoacyl tRNA synthetase category (Figure 3B), but the significance compared to GeneRecommender was borderline. To identify datasets that contributed significantly to the high-ranking of the top-scoring genes, for the 25 highest scoring genes we summed the contributions from each dataset. This assigns each dataset a

388

score relative to its contribution. In the case of the aminoacyl tRNA synthetases, ClueGene was able to find a significant coregulation pattern in the datasets of Brem et al. and Yvert et al. (see Supplemental Table 1 for the references). We plotted the expression levels of the query genes for a subset of the conditions (see Supplemental Figure 1). Visual inspection of the expression levels reveals that, while the shape of the expression levels of the aminoacyl tRNA synthetases change in a coordinate fashion, their absolute levels of expression are very low. The GeneRecommender therefore assigned these experiments low scores and therefore missed the coordinate expression changes of this group of functionally-related genes. The GeneRecommender had high accuracy on the GO DNA Transposition category, and identified a significant coexpression pattern within the dataset published by Hughes et al.1*. The transposons had extremely high levels of expression with very little variance across this dataset (data not shown). The shape of the expression levels relative to each other look dissimilar. Thus, clustering based on centered Pearson correlation fails to capture the pattern of coregulation of the transposons in this dataset.

6. Discussion We have found that a collaborative-filtering-based strategy for predicting new members of a pathway from gene expression data gains speed and scalability without sacrificing search performance. The ClueGene search engine uses pre-computed clustering solutions to identify patterns of coregulation between novel and known genes of a pathway. In general, the ClueGene search engine performed comparably and, in some cases, better than the GeneRecommender on a diverse collection of categories from MIPS, Gene Ontology, and KEGG. The current implementation of ClueGene has several limitations. For example, we only considered positive correlation in our search for related genes. In the future, we plan to test the hypothesis that including anticorrelation can improve pathway prediction. We will build a new Cluster Compendium using absolute Pearson correlation. ClueGene could use these new clusters either alone or together with the original clusters. By measuring AUCs associated with each Cluster Compendium, the search engine could identify which compendium is more predictive for a specific pathway. ClueGene could be used to predict the function of unknown proteins. A single gene of unknown function could be supplied as the query and its function inferred from the functions of known genes that sort to the top of

389 the search result. In our study, we focused on assessing the performance of the search algorithm for identifying known genes of well-defined pathways. To facilitate these additional uses, we have made the source code available from our website. ClueGene was designed to extend to a diversity of organisms and datasets. The advantage of using clusters rather than the primary data is that ClueGene avoids the problem of having to normalize across different microarray platforms. The approach can accommodate new datasets in an online fashion: when a new dataset is available, clusters can be identified and added to the compendium from which updated recommendations can be made. Generalizing over species and data types will broaden the range of genetic processes that can be searched. The speed of ClueGene enables it to be applied to a large number of datasets for which the application of the GeneRecommender would be prohibitively slow, or have datasets too large to load into computer memory. Extending the Cluster Compendium to additional organisms will allow searches to be performed in organisms where predicted gene sequences (but possibly not the entire genome sequence) are available. ClueGene could be generalized by extending the cluster database to additional organisms, as well as by developing a method that identifies patterns of conservation in gene co-clustering. Genes recommended from coexpression clusters on several organisms may correspond to either ancient members of the pathway or to more newly evolved participants. For example, a gene may co-cluster with a pathway in humans and mice, but not in non-mammals. The co-clustering pattern in this case suggests the gene was recruited into the pathway sometime after the mammals diverged from the other animals. Finding such a pattern indicates the gene's regulation program (cis- or £rans-acting regulatory factors) may have undergone recent adaptations rather than a slower finetuning over a larger evolutionary period, which may provide clues about the gene's function. We envision computing a co-clustering score at each node in the phylogenetic tree that relates the set of organisms for which a Cluster Compendium is available. In this way, the ClueGene algorithm could make explicit use of the complementary information present in experimental data collected from a variety of organisms. References 1. Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R. NCBI GEO: mining millions of expres-

390

2.

3.

4.

5. 6.

7.

8.

9.

10.

11. 12. 13.

14.

sion profiles—database and tools. Nucleic Acids Res. 2005 Jan 1;33 Database Issue:D562-D566. Owen AB, Stuart J, Mach K, Villenvue AM, Kim S. A Gene Recommender Algorithm to Identify Coexpressed Genes in C. elegans. Genome Research 2003;13:1828-1837. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N. Revealing modular organization in the yeast transcriptional network. Nature Genetics 2002 Aug;31(4):370-377. Breese JS, Heckerman D, Kadie C. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence 1998 July:43-52. Linden G, Smith B, York J. Amazon.com Recommendations: Item-to-item Collaborative Filtering. IEEE Internet Computing 2003 Jan/Feb;7(l):76-80. Ball CA, Awad IA, Demeter J, Gollub J, Hebert JM, Hernandez-Boussard T, Jin H, Matese JC, Nitzberg M, Wymore F, Zachariah ZK, Brown PO, Sherlock G. The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 2005 Jan l;33(l):D580-2. Lieb JD, Liu X, Botstein D, Brown P O . Promoter-specific binding of Rapl revealed by genome-wide maps of protein-DNA association. Nat Genet 2001 Aug;28(4):327-334. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown P O . Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001;409(6819):533-8. Hu H, Yan X, Huang Y, Han J, Zhou XJ. Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics 2005;21 Suppl. 1 2005:i213-i221. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354-D357. Stuart JM, Segal E, KoUer D, Kim S. A gene-coexpression network for global discovery of conserved genetic modules. Science 2003;302:249-55. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 2000;25: 25-29. Guldener U, Munsterkotter M, KastenmuUer G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, Garcia-Martinez J, Perez-Ortin JE, Michael H, Kaps A, Talla E, Dujon B, Andre B, Souciet JL, De Montigny J, Bon E, Gaillardin C, Mewes HW. CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res. 2005 Jan l;33(Database issue):D364-8. Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH. Functional discovery via a compendium of expression profiles. Cell 2000 Jul 7;102(l):109-26.

SEMI-AUTOMATED XML MARKUP OF BIOSYSTEMATIC LEGACY LITERATURE WITH THE GOLDENGATE EDITOR* GUIDO SAUTTER, KLEMENS BOHM Department of Computer Science, Universitat Karlsruhe (TH), Am Fasanengarten 76128 Karlsruhe, Germany

5

DONAT AGOSTI Division of Invertebrate Zoology, American Museum of Natural History, New York NY 10024-5192, and Naturmuseum der Burgergemeinde Bern, 3005 Bern Switzerland

Today, digitization of legacy literature is a big issue. This also applies to the domain of biosystematics, where this process has just started. Digitized biosystematics literature requires a very precise and fine grained markup in order to be useful for detailed search, data linkage and mining. However, manual markup on sentence level and below is cumbersome and time consuming. In this paper, we present and evaluate the GoldenGATE editor, which is designed for the special needs of marking up OCR output with XML. It is built in order to support the user in this process as far as possible: Its functionality ranges from easy, intuitive tagging through markup conversion to dynamic binding of configurable plug-ins provided by third parties. Our evaluation shows that marking up an OCR document using GoldenGATE is three to four times faster than with an off-the-shelf XML editor like XML-Spy. Using domain-specific NLP-based plug-ins, these numbers are even higher.

1.

Introduction

Today, there are model organisms with huge bodies of literature increasingly digitally available. The descriptions of the remaining 1.5 Million known species alone, however, is scattered in an estimated 10 to 100 Million pages of printed record in thousands of journals and books. This biosystematics literature is in a very unique situation within the entire body of literature. Large parts of it are in a highly standardized structure, e.g. treatments or keys. They contain information in a very concise form, which is not available anywhere else. Its main sections comprise the descriptions or treatments of the species (including character X species data matrices, images and distribution records), tools for identification (keys), and phylogenies. A description of a species is comparable to its DNA sequence only at a higher organizational level (11). Therefore this body of literature offers a unique chance for data extraction and mining for Work supported by grant BIB47 of the Deutsche Forschungsgesellschaft and grant GRT963055 of the National Science Foundation. 391

392 biomedical and life sciences when it is transformed into a machine-readable form. Since all the data in a publication belongs to a particular species or higher level taxon, the insertion of markup identifying the taxonomic names and descriptions transforms a text document into a record of a biodiversity database. The taxon name serves as the unique identifier. Some of these databases are currently implemented, e.g., ispecies.org, INOTAXA, or Encyclopedia of Life (12). They target the integration of existing biodiversity data from different sources. This in turn allows mining and linking data from currently disconnected data sources such as genomics, behavior or distribution data. Figure la shows an example document as OCR produces it (idealized, no errors). Figure lb shows the same document after the markup process. Obviously, only the markup enables a machine to read the information which species was collected in which location. Ectatomma tuberculatum Olivier.- Sangre Grande, (R. Thaxter) Ectatornma ruidum Roger.- Port of Spain, (R. Thasrter)

_

Figure la. A legacy document as OCR output (idealized, no character misrecognitions) E(:tatomma tuberculatum OlivibK/taxonllamo.Sangra Grande, (R Thaxter} -ctaMonHamoEctatumma ruidum PogLjrPort of Spain, (R TharteO ^_ _ Figure lb. The same legacy document after the markup process.

A first approach towards digital availability and integration is the community-wide initiative to scan and OCR the biosystematics literature deposited at the main US and UK natural history museums (Biodiversity Heritage Library: www.bhl.si.edu), growing taxon based (15) as well as commercial initiatives (16). This results in pdf and raw OCR-ed documents. The first step towards machine readability is a cleanup of OCR errors and artifacts originating from the print layout. The second step is the insertion of structural markup into the documents, often including a cleanup of structural XML markup inserted by the OCR. Further steps towards full machine readability of this body add semantic markup on different levels of detail, e.g., treatments and scientific names. Currently, the tools for all these steps are vanilla XML editors. The only automation support for this process so far have been tools to find and extract scientific names (TaxonGrab (10), FAT (2), and FindIT(13)). These tools provide good results, but are hard to apply when using a common XML

393 editor. In this paper, we analyze the requirements on editors intended to support all the steps from OCR output to full machine readability: OCR cleanup, structural markup, NLP-based scientific name extraction, markup of the treatments. In particular, we focus on possible automations of the markup process. This is in order to reduce user effort as far as possible. A major difficulty is the integration of manual text editing and NLP. This is because the former works on characters, while the latter usually works on sequences of tokens, i.e., regards words as the atomic units of a text. Finally, we present the GoldenGATE editor, which we have built to implement this difficult integration. Our evaluation shows that the markup process is three to four times faster if the user can make use of such a tight integration. The rest of this paper is organized as follows: Section 2 presents existing editors and NLP tools, which can be useful for automated detail-level markup. Section 3 analyzes the markup process and specifies the requirements on an editing tool supporting this process. Section 4 presents the GoldenGATE editor, which is intended to comply with these requirements. Section 5 features experimental results, which demonstrate the feasibility of our new tool. In Section 6, we conclude with a discussion and an outlook to future work. 2.

Related Work

In this section, we give an overview of existing tools and editors, which might be useful for the markup of legacy literature. We also point to some freely available NLP tools and libraries, which can be helpful for automating the markup process as far as possible. While manual markup is not desirable, fully automated markup solely relying on the NLP on the other hand is not feasible either: First, the markup accuracy required is higher than the 95 - 98 % provided by up-to-date NLP tools. Second, applying a sequence of such tools, which perform different parts of the markup process and build on the results of each other, is likely to result in a summation of the errors. Think of a noun-phrase chunker which builds on the output of a part-of-speech tagger. If the latter produces erroneous tags, the former is likely to produce erroneous output as well. A sequence of five such tools, for instance, is likely to have an accuracy of around 98%5 = 90%, which is less than required. Thus, there is a need for manual correction after each automated markup step (i.e., the application of one automated tool). This in turn requires an editor tightly integrating NLP-based automated markup functionality and manual editing and tagging.

394 2.1. Editors

Before we analyze the markup process, we discuss existing editors, which form its current basis, and are widely used for handling textual data in general. General-purpose text editors like UltraEdit (5) are powerful editors for all kinds of text-based data, e.g., plain text, XML, or programming languages. Many of these editors natively provide syntax highlighting for XML and for common programming and script languages. Some also support recording macros for frequently used editing steps, and for including external components. On the other hand, most text editors are general-purpose so that they do not provide any special support for XML markup of text, e.g., some sort of support for inserting XML tags. This is cumbersome and unnecessary, since the functional parts of the XML tags could be inserted automatically. XML editors, like the widely used XMLSpy (3) and Oxygen (4), are built to support handling existing XML data: DTDs and XML schemas, the XPath and XQuery query languages, XSLT, etc. They also provide some automation in creating new tags. But with them, the natural sequence of actions seems to be building the structure first and then inserting the content. Inserting tags in existing content is not exactly supported well. Besides the query and transformation languages included, XML editors rarely provide mechanisms for automated changes to the content or markup. They are not designed to apply NLP because it is not a usual part of XML data handling. 2.2. NLP Tools In the following, we describe some NLP tools which might be useful for automating the detail-level markup process of legacy literature. The OpenNLP (6) project hosts a variety of smaller, mostly open-source projects related to the development of NLP tools. The tools provided are heterogeneous with regard to purpose, programming platform, and quality. Among others, the functionality provided comprises text tokenization, part-ofspeech tagging, noun-phrase and verb-phrase chunking, named entity recognition, and semantic parsing. The former four build on each other and form the basis for the latter two. The latter are interesting for the detail-level automated markup, e.g., recognition and tagging of collecting sites, i.e., the location where a particular specimen has been collected. GATE (7) is an NLP suite developed by the University of Sheffield. It offers functions comparable to OpenNLP, but also provides more complex processing, e.g., for co-reference resolution, and can produce XML output. The GATE suite also includes Apache Lucene (14) for information-retrieval purposes and a GUI for visualization. It is relatively easy to develop new

395

components and include them in the processing pipeline. On the other hand, the purpose of GATE is NLP research and automated evaluation rather than document markup and management. While an AnnotationDiff tool is provided for computing f-Scores from the results achieved with test corpora or evaluation tasks, GATE lacks any facility for editing the text and the markup of the documents manually. LingPipe (8) is a professional NLP suite developed by alias-i. Apart from tokenization, which works with rules, almost all the analysis functions are based on statistical models (Hidden Markov Models (9), in particular). This implies the need for training with pre-annotated data. Once a model is generated from the training data, it can be applied to annotate documents. The basic idea of LingPipe is that it can generate and apply models for a wide range of purposes. Ready-to-use models are available for part-of-speech tagging and named entity recognition. While LingPipe provides powerful NLP functionality, there is no user interface. Thus, it has to be included in another program to be accessible in ways different from the command line. 3.

Requirements Analysis

In this section, we analyze the conversion process of digitized legacy literature: From OCR output to valid XML documents with detailed markup both regarding structure and 'semantically' important parts. Given this, we summarize the requirements on an editor that supports the user during this process. We then discuss some additional aspects influencing the design of an editor for markup of legacy literature. 3.1. The Markup Process Figure 2 visualizes the digitization and markup process form printed legacy documents to XML data, as perceived by individuals involved in the process: First, a user will scan and OCR-process the printed documents. Even the best and best-trained OCR software achieves 100% accuracy, and old or low-quality source documents result in lower quality. This means that the output text will contain a significant number of misspellings and other character-recognition errors. With regard to later application of NLP tools, these errors are a serious problem. Named entity recognizers, for instance, often make heavy use of gazetteers, which will not be useful in the presence of misspellings. Misplaced punctuation marks are likely to disturb a tokenizer or sentence splitter. Further problems arise from the page layout of the printed original, which can include footnotes, captions, page numbers, page border decorations, and so

396 on. The OCR is likely to mix these parts up with the main text so that the resulting text is inconsistent. Legacy Document

Figure 2. The process of marking up a legacy document

Consequently, if we want to apply NLP tools for automating parts of the markup, the process starts with the correction of OCR errors and layout artifacts. This also induces structural cleanups like re-concatenating hyphenated words, correcting paragraph borders, or moving captions to the end of the paragraph enclosing them. An editor should support a user in these actions as far as possible. Re-concatenating hyphenated words, for instance, is a cumbersome task if it is done one by one manually, but it automates easily. Punctuation and capitalization in turn allow checking the structure of paragraphs automatically, leaving only the ambiguous cases for manual intervention. Once the structural integrity of a document is restored, the next steps include the markup of semantic units, e.g., individual treatments, and meaningful parts, e.g., taxonomic names and collecting locations. Since the latter are often locations, their markup can be automated: NLP provides powerful recognition algorithms like the one presented in (1), which can be applied for the markup of collecting sites. (2) presents an approach for high-accuracy taxonomic name extraction. Its output can serve as the basis for automated detail-level markup. Consequently, an editor intended for semantic markup of legacy documents should allow for integration of existing NLP tools. It should also provide lightweight interfaces for including additional tools so that the editor is easy to extend according to the particular automation needs of the user. Further, for similar documents, a user is likely to apply the same choice of automated tools in the same order, thus defining a sequence. For easier use of such a sequence, it is desirable to access it as one tool. Despite all possible automations, manual editing is indispensable because NLP rarely achieves 100% accuracy. This is especially important where one NLP component builds on the output of previous ones: Erroneous input is likely to induce faulty conclusions, and the errors typically add up. Consequently, an environment supporting automated NLP-based markup also has to provide facilities for manual editing of both the text and the markup.

397 3.2. Requirements Summarizing the transformation process, an editor intended for the XML markup of digitized legacy literature has to comply with the following requirements in order to assist its users as well as possible: • • • • •

Automation support for structural cleanup of documents, Easy manual editing of both text and markup, NLP support for automated markup, A lightweight interface for developing and including new NLP tools, according to the special needs of a specific application, Integrated access to sequences of tool.

3.3. Additional Aspects In addition to the key features listed above, there are some other aspects worth consideration. Different OCR tools provide different types of additional information in their output. While some simply produce plain text, others insert generic XML tags, and yet others provide HTML formatted documents. Consequently, an editor should be able to make use of any formatting contained in a document, to unify it for subsequent steps, and still provide good automation if there is no formatting at all initially. During the editing process, it is desirable to use some unified, generic markup, which does not depend on an application-specific XML schema. This is because, first, it is not feasible to develop generic NLP tools dependent on a specific XML name space. Second, it may be desirable to transform the completed documents into a variety of application-specific XML schemas. Since different schemas provide different types and levels of detail markup, a direct inter-schema transformation may not be possible in the general case. If the input schema provides no markup for locations, for instance, a schema-transformation tool cannot introduce such markup. Thus, an editor needs to support different XML schemas, rather than only a specific one. 4.

The GoldenGATE Editor

In this section, we present and describe the GoldenGATE editor, which we have built to comply with the requirements identified in the previous section. It combines automation-assisted markup and text editing with external NLP tools. 4.1. The Document Editor In the GoldenGATE main window, each document has its own editor tab. We refer to such a tab as a document editor. The markup of legacy literature

398

includes editing both the XML markup and the document text. Experience with standard XML editors like XMLSpy shows that editing documents is cumbersome if there are too many XML tags. This is because the tags are in the way of a concise view on the textual content. On the other hand, editing XML markup is unnecessarily cumbersome in an editor that supports plain text editing as well. This is because the XML tags have to be inserted in the text character by character. An editor in turn could automatically produce the functional parts of the tags, i.e., the XML-specific characters around the element names and the attribute values. Consequently, the editing functions for XML markup on the one hand and for textual content on the other hand are distributed to two different editor views in GoldenGATE. Thus, a document editor has three subtabs: An annotation editor, a plaintext editor, and, in addition, an XML view, which provides no editing functionality, but a view on the document as nicely indented and laid-out XML. The annotation editor (Figure 3) provides automation assistance for manual XML tagging. The buttons on the left invoke recently used functions, and userdefined ones, i.e., Macros. The checkboxes on the right allow showing and hiding individual tags by name. The annotation editor uses a token-based data model, which treats a word as an atomic unit, for two reasons: First, a user will insert XML tags between words in the very most cases. Second, almost all NLP tools work on tokens rather than on characters. Consequently, an interaction that is based on selected text, e.g., enclosing a passage of text in a new tag, will automatically affect complete words, even if only a part of them is selected.

:'.^^mt&S3& finely punctate and slightly opaque. Mandibles sp, pilioerous punctu res on the remainder of the bo( t/pS£agi:apb> Hairs pale yellow, uneven, rather short, coarse, m s u b j e c t both on the body and appendages. Custom Ructions Paragraph

Head and thorax yellowish re&, mandibles and me petiole, postpetiole, gaster, antennae and legs vt

-head -sentwce && Toksns 3drt Tokens Copy Totera Cut Totcens Swap Cfefcoard Paste Toterss D«kte tokens

Described from 15 specimens taken from a small army which I foUnd traversing the threshold of one of the fern-houses m the botanical garden at Port of Spain

Figure 3. The annotation editor.

Enclosing a sequence of words in an XML tag is easy and intuitive in the annotation editor: First, select the words to enclose in the new tag. Second, right

399 click and select Annotation in the context menu. This will open a prompt for entering the tag name. In addition, the context menu offers the most recently used tag names for instant selection. The annotation editor also provides functions for joining and splitting tags. This is helpful for, e.g., correcting structural markup generated by OCR software. In addition, it contains functionality for automatically tagging paragraphs, and for automated cleanup of their inner structure, including the re-concatenation of hyphenated words. The text editor provides editing functionality known from standard text processing tools. Because editing textual content is cumbersome if it is spread out between several XML tags, the text editor hides all tags to provide a concise view on the document content. 4.2. Integration of External NLP Tools As discussed in Section 2, NLP provides powerful tools for extracting meaningful phrases and word sequences from text, which are well suited for detail-level markup of legacy documents. The GoldenGATE editor provides a lightweight interface for integrating external NLP tools. To integrate an individual NLP component into the GoldenGATE editor, this component needs to implement the so-called Analyzer interface, or it needs an encapsulating wrapper, which implements this interface. The responsibility of the wrapper is to translate the token-based data model of the annotation editor to the data model used by the NLP component. Since most NLP tools work on token arrays representing the tokens as Strings, this tends to be straightforward. The wrapper also has to translate the output of the NLP component back into the data model of the annotation editor. Since most NLP components arrays of Strings to mark the extracted parts, this tends to be straightforward as well. A complete suite of NLP tools can bind to GoldenGATE via a wrapper factory, which implements another interface. The task of such a factory is to wrap the NLP tools, so that the individual tools need not be wrapped manually. The most feasible way of using this binding method is to provide a factory wrapping a common super class of a set of NLP tools. Once a tool is wrapped to implement the interface, it can simply be packed into a jar file along with the wrapper. The jar file is subsequently stored in defined location where GoldenGATE will automatically detect and include the new tool once it is restarted. In addition, a user can trigger the search for new tools manually. 4.3. Sequencing of Tools - Pipelines As mentioned in Section 3, a user is likely to apply the same NLP tools to similar documents in the same order. Thus it is desirable to access such a

400

sequence as one tool. GATE (7) uses a mechanism called Pipelines for this purpose. It bundles a sequence of tools and applies them in a specific order. Borrowing this idea, we have integrated Pipelines in the GoldenGATE editor. A GoldenGATE Pipeline is a sequence of external or built-in tools. When it executes, it invokes these tools subsequently. In contrast to GATE, users may configure a GoldenGATE Pipeline to display the documents for manual correction after being processed by each tool and before the next one is applied. This prevents the propagation of errors from one tool to subsequent ones. 4.4. Additional Functionality In this section, we describe some additional features of the GoldenGATE editor. Macros allow adding custom functions to the annotation editor. A macro combines first marking up the selection with a predefined XML tag, and then applying some automated processing to the content of the newly created tag. Two possible applications of this mechanism are (a) marking up a paragraph and applying structural normalization afterwards, or (b) marking up a treatment and then applying a tool that creates the internal markup of the treatment automatically, e.g., the 'nomenclature' and 'materials examined' domains. Lists provide a straightforward way of applying gazetteers. GoldenGATE natively provides some lists (stop words, for instance), and new ones are easily added: GoldenGATE loads them from files or URLs, or extracts them from the documents, making use of existing markup. Besides Macros and Lists, GoldenGATE provides a variety of further features, which we cannot describe here due to space limitations. These features include tagging functions based on regular expressions, or processing an entire folder of documents automatically. 5.

Preliminary Evaluation

This section features a preliminary evaluation of the GoldenGATE editor, to quantify its benefit in marking up legacy literature. 5.1. Experimental Setup Our evaluation criterion is the time it takes a user to mark up a document, starting with the unmodified output of an OCR tool. Our evaluation is as follows: A domain expert carries out the task, once using GoldenGATE, and another time using the 2005 version of the XML editor XMLSpy, to obtain a reference point. We stress that our test person is not a computer scientist. He has practical experience with vanilla XML editors, notably XMLSpy. As test documents, we use three issues of the American Museum Novitates (No 349

401 (1929), 1396 (1949), and 2257 (1966)), a life science periodical issued by the American Museum of Natural History. The original hardcopies have been scanned and OCR processed with Abbyy FineReader. The output contains HTML markup regarding fonts and structure, but a significant part of the latter is erroneous. The documents comprise a total of 24 pages (8 pages each) and 12 treatments (4 treatments each). The markup created during the test includes structural and semantic markup. The latter comprises treatments, taxonomic names, and collection events. 5.2. Results Table 1 displays the time needed for the markup process with the two tools under evaluation. It turns out that GoldenGATE benefits significantly from the token-based, semi-automated data handling of the annotation editor. The NLP tools applied to the fine-grained markup of important parts (e.g., collecting sites) result in an even larger difference. Table 1. Markup time from OCR output to fully markcd-up document (in minutes, time spent with the individual test documents in brackets) Editor XMLSpy Markup Step OCR errors 9 (3, 4, 2) 4(2,1,1) Structure Cleanup Treatments 84(29,31,24) Tax.»names 28(10, 10,8) Coll. events 24 (8, 9, 7)

4(2, 1, 1) 19(7,5,7) 7 (2, 3, 2) 5(1,2,2) 9 (4, 2, 3)

Total

44(16, 13, 15)

149 (52, 55, 42)

GoldenGATE

The only step that takes longer in GoldenGATE is the structural cleanup. This is due to different approach with the two editors: XMLSpy requires to first remove all the HTML and to insert new structural markup subsequently, while GoldenGATE transforms and corrects the existing tags. Thus the structural cleanup becomes an integral part of the structural markup. Summing up the structural cleanup and markup steps, GoldenGATE (19 + 7 = 26 minutes) is more than three times faster than XMLSpy (4 + 84 = 88 minutes). 6.

Discussion

In this paper, we have introduced the GoldenGATE editor, the first editor specifically designed and built for the digitization of legacy biosystematics literature. It supports all the steps from OCR output to full machine readability: OCR cleanup, semi-automated markup (both structural and semantic), including the detection of treatment boundaries and the markup of the internal structure of treatments. It allows the application of automated external markup tools, like TaxonGrab (10), FAT (2), or FindIT (13) for the markup of scientific names.

402

Our evaluation has shown that the GoldenGATE editor simplifies and accelerates the markup process significantly. This advantage results from both the semi-automated, token-wise XML editing and the integration of existing NLP tools for automated detail-level markup. We plan to include more functionality in the future, like better support for OCR error correction. We also intend to develop additional NLP tools for more markup automation on the detail level. This includes, for instance, automated detection and markup of morphological characters and states. The current version of GoldenGATE is available at http://idaho.ipd.unikaiismhe.de/GoldenGATE/. Acknowledgements We thank our US collaborators Terry Catapano, Bryan Heidorn, Bob Morris, Neil Sarkar, and Christie Stephenson for their interest in our work and their valuable suggestions. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

A. Mikheev, M. Moens, C. Grover, Named Entity Recognition without Gazetteers, in Proceedings of EACL, Bergen, 1999 G. Sautter, K. Bohm, The Difficulties of Taxonomic Name Extraction and a Solution, in proceedings of BioNLP, New York, 2006 Altova GmbH, www.altova.com , www.oxygenxml.com IDM Computer Solutions Inc., www.ultraedit.com The OpenNLP project, www.opennlp.org GATE, General Architecture for Text Engineering, gate.ac.uk LingPipe, www.alias-i.com/lingpipe L. Rabiner, B. Juang An Introduction to Hidden Markov Models, in IEEE ASSP Magazine, Jan 1986, Volume 3, Issue 1, pp 4-16 D. Koning, N. Sarkar, T. Moritz, TaxonGrab: Extracting Taxonomic Names from Text, Biodiversity Informatics, 2005 D. Agosti, Encyclopedia of life: should species description equal gene sequence?, Trends Ecol. Evol. 18: 273, 2003 E. Wilson, The encyclopedia of life, Trends Ecol. Evol. 18: 77-80, 2003 FindlT. names.mbl.edu/tools/recognize Apache Lucene, lucene.apache.org/iava/docs D. Butler, Mashups mix data into global service. Nature 439: 6-7, 2006 E. Gillingham, Blackwell launches 3000 years of digitized journal backfiles. Blackwell Publishing Journal News 15, July 2006.

COMPUTATIONAL PROTEOMICS: HIGH-THROUGHPUT ANALYSIS FOR SYSTEMS BIOLOGY WILLIAM CANNON Computational

Biology & Bioinformatics, Pacific Northwest Richland, WA 99352 USA

Computational

Biology & Bioinformatics, Pacific Northwest Richland, WA 99352, USA

National

Laboratory

National

Laboratory

BOBBIE-JO WEBB-ROBERTSON

High-throughput proteomics is a rapidly developing field that offers the global profiling of proteins from a biological system. These high-throughput technological advances are fueling a revolution in biology, enabling analyses at the scale of entire systems (e.g., whole cells, tumors, or environmental communities). However, simply identifying the proteins in a cell is insufficient for understanding the underlying complexity and operating mechanisms of the overall system. Systems level investigations generating large-scale global data are relying more and more on computational analyses, especially in the field of proteomics.

1.

Introduction

Proteomics is a rapidly advancing field offering a new perspective to biological systems. As proteins are the action molecules of life, discovering their function, expression levels and interactions are essential to understanding biology from a systems level. The experimental approaches to performing these tasks in a high-throughput (HTP) manner vary from evaluating small fragments of peptides using tandem mass spectrometry (MS/MS), to two-hybrid and affinity-based pull-down assays using intact proteins to identify interactions. No matter the approach, proteomics is revolutionizing the way we study biological systems, and will ultimately lead to advancements in identification and treatment of disease as well as provide a more fundamental understanding of biological systems. The challenges however are amazingly diverse, ranging from understanding statistical models of error in the experimental processes through categorization of tissue types. The papers presented in this session are a representative snapshot of this broad field of research that spans scale and scientific disciplines. 403

404

Proteins

Enzymatic digest

\.mm\

^"^St/"

HPLC Separatf&n

Peptides

MS profile

I '• I

I II III. I I

Theoretical MS/MS spectra

Computer Sequence database searching

T

Peptide identification Protein identification

Acquired MS/MS spectra

Figure 1. A typical MS proteomics process from protein isolation through peptide identification. Proteins are first isolated from other cellular components (top left) and then cleaved into peptides by enzymatic digestion(top middle). The peptides are partially separated using chromatography (top right) and then further separated by mass-to-charge ratios in the first stage of mass spectrometry (center). In tandem mass spectrometry, the isolated peptide is then collisionallyactivated causing it to fragment into pieces. The mass-to-charge ratio of each fragment is measured (bottom right), and this fragmentation pattern is compared to model spectra (bottom left) for the peptide that are derived from training data or expert opinion. The peptide with a model spectrum that best matches the experimental spectrum is a potential match.

1.1. HTP Mass Spectrometry The application of high-throughput (HTP) liquid-based separations and mass spectrometry (MS) to global profiling of proteins is providing an essential component to the challenge of understanding biology at a systems level (1). Figure 1 depicts a typical MS-based proteomics analysis that is performed in many laboratories. Enzymatic digestion of proteins extracted from cells results in the lysis at defined locations in the proteins producing peptides of predictable length (when derived from a known protein sequence). Reversed phase high performance liquid chromatography is used to partially separate the peptides in the solution. The eluting peak consists of a population of peptides which are analyzed by a mass spectrometer interfaced with the chromatography system. The electrospray process aerosolizes and ionizes the peptides into the gas phase and the charged particles are propelled into the mass spectrometer for analysis. The mass spectrometer scans the population of eluting ions, measures the mass to

405

charge ratio, and in the case of tandem mass spectrometry proceeds to the fragmentation step. This step consists of the capture of all ions in a narrow mass-to-charge range in the ion trap of the mass spectrometer, the peptides are vibrationally excited by collision with an inert gas. The peptides then fragment at labile bonds and a subsequent mass spectrum is obtained of the fragments of the peptide. Because the peptides tend to fragment into recognizable patterns, the identity of the peptide can frequently be determined from this spectrum. 1.2. HTP Yeast-Two-Hybrid In contrast to the destructive technique of HTP MS-based approaches, two-hybrid assays (2) are used for assessing protein-protein interactions in live cells. A typical implementation of the two-hybrid assay involves the attachment of bait and prey proteins to separated binding and activating domains of a transcription factor, typically GAL4, that controls for the production of a reporter protein. In principle, if the bait and prey interact then the modularized domains of the transcription factor are brought together and the newly combined transcription factor can both bind DNA and activate the gene coding for the reporter protein. Conversely, if the reporter protein is present in the assay, then it is presumed that the bait-prey pair interact. False negatives can occur for several reasons, such as when the covalent attachment of the transcription factor domain to the bait protein interferes with interaction with the prey protein. Likewise, false positives can occur if adaptive mutations or auto-activation result in expression of the reporter protein regardless of interaction between the bait and prey proteins. While issues regarding the interference of the binding of the bait to the prey due to blocking by the transcription factor modules may represent a random, independent source of errors, auto-activation is a systematic error affecting the entire screening process. Two-hybrid methods are most frequently used in genetically-tractable organisms such as yeast, C. elegans, and Drosophila. Recent development of bacterial two-hybrid systems may eventually result in the expansion of this method to many other genomes. 2.

Challenges

2.1. Accuracy of Peptide Identification Intrinsic to the MS-based proteomics measurement process is the comparison of MS/MS fragmentation patterns to model fragmentation patterns derived from the predicted peptides of a sequenced genome, which provides the basic peptide identification upon which all other evaluations are based (3-5). The common approach is to search the experimental spectra against a database of computationally generated model spectra in a database representing the constituent peptides of the entire genome. The computational peptide

406

identification process measures how well the mass peaks in an experimental spectrum match those in the model spectrum of a candidate peptide (3, 5-7). However, these database search routines are known to return both correct identifications against the experimental spectra, as well as a similar number of false positives. Considerable work on the data analysis front is still required (8). The false positive problem of MS-based proteomics is largely due to the introduction of many sources of errors through the entire experimental and identification stages. Since the experimental observation of peaks introduces a mass error, a mass error distribution is often used in this matching process (911), i.e., the peaks of the experimental and theoretical are not expected to match up exactly. However, in most computational identification methods to-date these error models have followed simple statistical distributions. In fact, by far the most widely used distribution is the uniform distribution. Fu et al.a present, improved estimations of mass error distributions that can be incorporated into the identification process. Another major technical challenge lies downstream from the database search routines. These database search procedures typically return several metrics associated with the match, creating a challenge in separating true from false identification. To counter this problem there has been a fair level of effort placed on the development of probabilitybased scores(12-15). These statistical metrics have alleviated some problems with false positives, but make uniform assumptions about the identifiability of each peptide. That is, it is generally universally assumed that all peptides are equally detectable. Alves et al. revisit the protein inference problem accounting for this assumption on peptide detectability. 2.2. Network Inference Rapid advances are currently being made in the determination of protein networks (16-19). Typical approaches use either a two-hybrid screening of an entire genome, or affinity purifications to pull-down a pre-selected bait protein and the prey proteins that interact, directly or indirectly. This information can be used to construct a protein interaction network in which all discovered interactions are laid out. A common challenge in both affinity-based methods and two-hybrid screens is that of estimating and reducing error rates. The paper

" Y. Fu. W. Gao, S. He, R. Sun, H. Zhou, R and R.Zeng. Mining tandem mass spectral data for more accurate mass error model for peptide identification. PSB 2007. b P. Alves, R.J. Arnold, M.V. Novotny, P. Radivojac, J.P. Reilly and H. Tang. The protein inference problem in shotgun proteomics - revisited. PSB 2007.

407

by Sontag et al.c describes a novel approach to estimating errors in two-hybrid experiments which could be adopted also for affinity purifications. Appropriately modeling this error allows better use of the data leading to better identification of interacting proteins. Due the error in the experimental system, it is common to utilize multiple sources of diverse information in defining protein interactions, for example, cell location or sequence motif composition. However, a common challenge in integrating all this information with the experimental data is that each source has varying levels of reliability. Computing reliability metrics for multiple sources of information to infer networks is the topic of Leach et al.. 3.

Final Thoughts

Ultimately, a major motivation for investments into the development of proteomics and systems biology is to develop advanced methods of disease diagnosis, understanding disease processes, and remedies. Each experimental approach offers a unique view, for example unlike MS/MS or two-hybrid approaches, MALDI-based Imaging MS (IMS) offers an approach to study the spatial distribution of biomolecules, such as proteins, in tissue. However, similar to other imaging based methods, classes in the data associated with the tissue must be identified in order to differentiate diseased tissue. A principal component analysis based approach is taken for IMS data in Van de Plas et al.e. The field of HTP proteomics is becoming a central component enabling systems level analyses. The level of complexity from inception of the experimental technique through the data analysis and modeling is incredible. As seen in this session research is taking place at the level of errors within a mass spectrum through the study of entire tissues. Proteomics is likely to offer a central role in understanding protein function and complex biological systems leading to a new revolution in advanced targeted therapeutics to treat disease. Acknowledgments The session organizers would like to express deep gratitude to the anonymous referees who together volunteered uncountable hours to provide to key feedback to make this session successful. The session chairs were supported through funding provided by the U.S. Department of Energy Office of Advanced Scientific Computing Research under contract 47901, and by the Office of c

D. Sontag, R. Singh and B. Berger. Probabilistic modeling of systematic errors in two-hybrid experiments. PSB 2007. d S.M. Leach, A. Gabow L. Hunter and D. Goldberg. Assessing and combining reliability of protein interaction sources. PSB2007. e R. Van de Plas, F. Ojeda, M. Dewil, L. Van Den Bosch, B. De Moor and E. Waelkens. Prospective exploration of biochemical tissue composition via imaging mass spectrometry guided by principal component analysis. PSB 2007.

408 Biological and Environmental Research under contract 43930, as well as PNNL laboratory directed research and development funds. References 1. Nesvizhskii, A. I. & Aebersold, R. (2005) Mol Cell Proteomics 4, 141940. 2. Fields, S. & Song, O. (1989) Nature 340, 245-6. 3. Cannon, W. R., Jarman, K. H., Webb-Robertson, B. J., Baxter, D. J., Oehmen, C. S., Jarman, K. D., Heredia-Langner, A., Auberry, K. J. & Anderson, G. A. (2005) J Proteome Res 4, 1687-1698. 4. Pappin, D., Rahman, D., Hansen, H., Bartlet-Jones, M, Jeffery, W. & Bleasby, A. (1996) Mass Spectrom. Biol. Sci, 135-150. 5. Yates, J. R., Eng, J. K., McCormack, A. L. & Schieltz, D. (1995) Anal Chem 67, 1426-1436. 6. Eng, J. K., McCormack, A. L. & Yates, J. R. (1994) Journal of the American Society for Mass Spectrometry 5, 976-989. 7. Mann, M. & Wilm, M. (1994) Analytical Chemistry 66, 4390-9. 8. Patterson, S. D. (2003) Nat Biotechnol 21, 221-222. 9. Dancik, V., Addona, T. A., Clauser, K. R., Vath, J. E. & Pevzner, P. A. (1999) J Comput Biol 6, 327-42. 10. Fenyo, D., Qin, J. & Chait, B. T. (1998) Electrophoresis 19, 998-1005. 11. Frank, A. & Pevzner, P. (2005) Anal Chem 77, 964-73. 12. Anderson, D. C, Li, W., Payan, D. G. & Noble, W. S. (2003) J Proteome Res 2, 137-46. 13. Keller, A., Nesvizhskii, A. I., Kolker, E. & Aebersold, R. (2002) Anal Chem 74, 5383-5392. 14. Moore, R. E., Young, M. K. & Lee, T. D. (2002) J Am Soc Mass Spectrom 13, 378-386. 15. Strittmatter, E. F., Kangas, L. J., Petritis, K., Mottaz, H. M., Anderson, G. A., Shen, Y., Jacobs, J. M., Camp, D. G., 2nd & Smith, R. D. (2004) J Proteome Res 3, 760-769. 16. Bader, J. S. (2003) Bioinformatics 19, 1869-74. 17. Butland, G., et al. (2005) Nature 433, 531-7. 18. Giot, L., et al. (2003) Science 302, 1727-36. 19. Ho, Y , etal. (2002) Nature 415, 180-3.

ADVANCEMENT IN PROTEIN INFERENCE FROM SHOTGUN PROTEOMICS USING PEPTIDE DETECTABILITY PEDRO ALVES, 1 RANDY J. ARNOLD, 2 MILOS V. NOVOTNY, 2 PREDRAG RADIVOJAC, 1 JAMES P. REILLY, 2 HAIXU TANG 1 ' 3 * 1) School of Informatics, 2) Department of Chemistry, 3) Center for Genomics and Bioinformatics, Department of Biology Indiana University, Bloomington, U.S.A. A major challenge in shotgun proteomics has been the assignment of identified peptides to the proteins from which they originate, referred to as the protein inference problem. Redundant and homologous protein sequences present a challenge in being correctly identified, as a set of peptides may in many cases represent multiple proteins. One simple solution to this problem is the assignment of the smallest number of proteins that explains the identified peptides. However, it is not certain that a natural system should be accurately represented using this minimalist approach. In this paper, we propose a reformulation of the protein inference problem by utilizing the recently introduced concept of peptide detectability. We also propose a heuristic algorithm to solve this problem and evaluate its performance on synthetic and real proteomics data. In comparison to a greedy implementation of the minimum protein set algorithm, our solution that incorporates peptide detectability performs favorably.

1. Introduction Shotgun proteomics refers to the use of bottom-up proteomics techniques in which the protein content in a biological sample mixture is digested prior to separation and mass spectrometry analysis.1"3 Typically, liquid chromatography (LC) is coupled with tandem mass spectrometry (MS/MS) resulting in highthroughput peptide analysis. The MS/MS spectra are searched against a protein database to identify peptides in the sample. Currently, Sequest4 and Mascot5 are the most frequently used computer programs for conducting peptide identification, both comparing experimental MS/MS spectra with in silico spectra generated from the peptide sequences in a database. Compared to top-down proteomics techniques, shotgun proteomics avoids the modest separation efficiency and poor mass spectral sensitivity associated with intact protein analysis, but it also encounters a new problem in data analysis, that of determining the set of proteins present in the sample based on the peptide identification results. At a " To whom all correspondence should be addressed; Email: [email protected].

409

410 first glance, this problem seems trivial. It may be concluded that a protein is present in the sample, if and only if at least one of its peptides is identified. This conclusion is true, however, only when each identified peptide is unique, i.e. when it belongs to only one protein. If some peptides are degenerate,6 i.e. shared by two or more proteins in the database, determining which of these proteins exist in the sample has multiple possible solutions. Indeed, tryptic peptides are frequently degenerate, especially for the proteome samples of vertebrates, which, due to recent gene duplications, often have a large number of paralogs. In addition, alternative splicing in higher eukaryotes results in many identical protein subsequences. The following example illustrates the extent of peptide degeneracy in a real proteomics experiment. Of the 693 identified peptides from a real rat sample used in this study (see sections 3-4 for details), 296 were unique and 397 were degenerate, when searched against the full proteome of R. norvegicus. These peptides can be assigned to a total of 805 proteins, of which only 149 proteins could be assigned based on the 296 unique peptides. Nesvizhskii and colleagues first formalized this challenge in shotgun proteomics data analysis. They formulated the protein inference problem and proposed a solution as the minimum number of proteins containing the set of identified peptides.6,7 Other methods assign the unique peptides first, and then use statistical methods6 to assign the degenerate peptides based on the likelihood of each putative protein already identified. As a result, if two proteins share some common tryptic peptides, the presence of each protein can be decided using this method only if there exists at least one identified unique peptide in one of the proteins. The degenerate peptides will be most likely assigned to the longer protein, because the shorter proteins may not contain any unique peptide (e.g. see Fig. 2 in reference 7). In this paper, we revisit the protein inference problem based on the recently proposed concept of peptide detectability.i The detectability of a peptide is defined as the probability of observing it in a standard proteomics experiment. We proposed that detectability is an intrinsic property of a peptide, completely determined by its sequence and its parent protein. We also showed that the peptide detectability can be estimated from its parent protein's primary structure using a machine learning approach.8 The introduction of peptide detectability provides a new approach to protein inference, in which not only identified peptides but also those that are missed (not identified) are important for the overall outcome. Figure 1 illustrates the advantage of the new idea. Assume A and B are two proteins sharing 3 degenerate tryptic peptides (a, b, and c, shaded). Each protein in Fig. 1 also has unique tryptic peptides (d, e, and/ g, h, i respectively, white). According to the original formulation of the protein inference problem, the identities of A and B cannot be determined since the only identified peptides

411

h

8 f

£>

•8

a

i

b

1

a

d

C

e

C

protein A

protein B

Figure 1. Detectabihty plot of a hypothetical protein A, broken up into tryptic peptides a-e, and protein B, containing peptides a-c and/-;'. Assume that peptides a-c are identified by the peptide identification software (shaded). Peptides in each protein are sorted according to their detectabihty. The example shows the intuition for tie breaking in the proposed protein inference problem. Peptides a-c are more likely to be observed in protein A than d-e, while they are less likely to be observed than peptides/-;' in protein B. Thus, protein A is a more likely to be present in the sample than B. Note that the detectabihty for the same peptide within different proteins is not necessarily identical, due to the influence of neighboring regions in its parent proteins. are degenerate. 7 However, if all the tryptic peptides are ranked in each protein according to their detectabilities (Fig. 1), we may infer that protein A is more likely to be present in the sample than protein B. This is because if B is present we would have probably observed peptides/ 1 ; along with peptides a-c, which all have lower detectabilities than either/ g, h, or ;'. On the other hand, if protein A is present, we may still miss peptides d and e, which have lower detectabilities than peptides a-c, especially if A is at relatively low abundance. 8 In summary, peptide detectabihty and its correlation with protein abundance provides a means of inferring the likelihood of identifying a peptide relative to all other peptides in the same parent protein. This idea can then be used to distinguish between proteins that share tryptic peptides based on a probabilistic framework. Based on this simple principle, we propose a reformulation of the protein inference problem so as to exploit the information about computed peptide detectabilities. We also propose a tractable heuristic algorithm to solve this problem. The results of our study show that this algorithm produces reliable and less ambiguous protein identities. These encouraging results demonstrate that peptide detectabihty can be useful for not only label-free protein quantification, but also for protein identification that is based on identified peptides. 8,9

412 2. Problem Formulation Consider a set of proteins T= {Pu P2, ..., PN} such that each protein P, consists of a set of tryptic peptides {p'j}, i= 1,2, ..., «,, where «, is the number of peptides in {p'j}. Suppose that J= {fufi, • • -,/M} is the set of peptides identified by some database search tool and that J c u {p'j}. Finally, assume each peptide/?^ has a computed detectabilityD{p't), fory = 1, 2, ..., N, and i= 1,2, ..., rij. We use T> to denote the set of all detectabilities D(p'j), for each / andy. The goal of a protein inference algorithm is to assign every peptide from J to a subset of proteins from T which are actually present in the sample. We call this assignment the correct peptide assignment. However, because in a real proteomics experiment the identity of the proteins in the sample is unknown, it is difficult to formulate the fitness function that equates optimal and correct solutions. Thus, the protein inference problem can be redefined to find an algorithm and a fitness function which result in the peptide-to-protein assignments that are most probable, given that the detectability for each peptide is accurately computed. In a practical setting, the algorithm's optimality can be further traded for its robustness and tractability. If all peptides in y are required to be assigned to at least one protein, the choice of the likelihood function does not affect the assignment of unique (nondegenerate) peptides in u {p'j}. On the other hand, the tie resolution for degenerate peptides will depend on all the other peptides that can be assigned to their parent proteins, and their detectabilities. In order to formalize our approach we proceed with the following definitions. Definition 1. Suppose that the peptide-to-protein assignment is known. A peptide p'j e {p'j} is considered assigned to Pt if and only if p'j e y and D(p'j) > Mj. Then, Mj e D is called the Minimum Detectability of Assigned Peptides (MDAP) of protein P,. Definition 2. A set of MDAPs {MJ}J = , 2, N is acceptable if for each/ e y, there exists Pj, such that D(f) > Mj. Thus, any acceptable MDAP set will result in an assignment of identified peptides that guarantees that every identified peptide is assigned to at least one protein. Definition 3. A peptide p'j is missed if p'j <£ y and D(p'j) > Mj. Note that, due to the connection between peptide detectability and protein amount in the sample, peptides whose detectabilities are below Mj are not considered missed. We can now formulate the protein inference problem as follows. Minimum missed peptide problem. Given N proteins, each consisting of «, tryptic peptides, and a set of identified peptides y, find an acceptable set of MDAPs, {Mj}j= i,2 N, which result in a minimum number of missed peptides.

413 If a protein does not exist in the sample, the MDAP Mj needs to be assigned a value greater than the maximum detectability observed in Pj. If protein Pj is not present in the sample, we set A/, to a maximum MDAP (= oo). Hence, only proteins whose Mj < 1 are considered identified. Note that in nearly all practical cases the maximum MDAP can be set to 1, except when there is a peptide in u {p'j} whose D(p'j) = 1. The relationship between the minimum missed peptide problem and the original minimum protein set problem becomes evident in the following theorem. Theorem 1. Minimum missed peptide problem is NP-hard. Proof: The minimum missed peptide problem can be reduced to the set-covering problem10 by setting D(p'j) - 0 for each i,j and adding a non-existing peptide with detectability of 1 to each protein. Minimizing the number of missed peptides now minimizes the number of covering subsets (proteins) in the solution set. D

3. Materials and Methods 3.1. Data The data used in this paper were obtained from three different sources. Our first two datasets were generated using mixtures of model proteins. Therefore, we know the proteins in these two samples. The first set was generated as a standard protein mixture consisting of 12 model proteins and 23 model peptides mixed at similar concentrations from 73 to 713 nM for proteins and from 50 to 1800 nM for peptides.'' This data set was made available to us by the authors. The second data set from a mixture of twelve standard proteins8 was prepared at 1 uM of final digestion solution for each protein except human hemoglobin which is at 2 [iM (mixture B), combined with buffer, reduced, alkylated, and digested overnight with trypsin. Peptides were separated by nano-flow reversed-phase liquid chromatography gradient and analyzed by mass spectrometry and tandem mass spectrometry in a Thermo Electron (San Jose, CA) LTQ linear ion trap mass spectrometer. The third sample was generated using a complex proteome sample from R. norvegicus. Rat brain hippocampus samples were homogenized and separated by sedimentation in a centrifuge to produce four fractions enriched in nuclei, mitochondria, microsomes (remaining organelles), and the cytosol. Each subcellular fraction was subjected to proteolytic digestion with trypsin and analyzed by reversed-phase capillary LC tandem mass spectrometry using a 3-D ion trap (ThermoFinnigan LCQ Deca XP). Searches versus either the Swiss-Prot or the

414 IPI rat database were performed for fully tryptic peptides using Mascot5 with a minimum score of 40 and allowing for N-terminal protein acetylation and methionine oxidation. 3.2. Prediction of peptide detectability As mentioned above, the probability that a peptide will be identified in a standardized proteomics experiment is referred to as the peptide detectability.8 Using machine learning approaches we previously provided evidence that peptide detectability can be predicted solely from the amino acid sequence of its parent protein. We constructed a set of 175 features describing the peptide sequence itself as well as the regions up or downstream from the peptide. An ensemble of neural networks was then trained and evaluated. We estimated its balancedsample accuracy at about 70% across training and test sets obtained from several independent proteomics studies. The usefulness of the learned peptide detectabilities was demonstrated on the problem of label-free protein quantification where the detectability of a peptide showed negative correlation with the abundance of its parent protein. 3.3. Solving minimum missed peptide problem We propose a simple greedy algorithm to solve the minimum missed peptide problem. It assigns identified peptides to proteins in the order of their detectabilities and does not change the peptide assignments once they are made. The algorithm assigns the peptide with lowest detectability first (denoted as LowestDetectability First Algorithm, LDFA). The pseudocode for the LDFA is presented in Fig. 2. We assume the detectabilities of a single peptide across different parent proteins are close enough not to affect the relative order of each such peptide in its parent protein if a representative detectability is selected. Thus, all identified peptides can be sorted consistently based on their detectabilities. For comparison with LDFA, we also implement a greedy solution to the minimum protein set algorithm (GMPSA), which can be formulated as a setcovering problem10 with very little modification. 4. Results We compared the performance of the LDFA and GMPSA. First, we used identified peptides from a synthetic sample mixture B,8 and Swiss-Prot as a reference database to conduct a controlled protein inference experiment. The advantage of this evaluation for quantifying the performance of the algorithm is that all proteins present in the sample are known. The sample mixture B contained 12 proteins corresponding to the 93 peptides identified in the experiment.

415

Algorithm. Lowest-detectability first algorithm (LDFA) Assign all unique peptides in JFand remove them from J Mj = <x> for ally's while J * 0 Choose/e J with lowest detectability for each protein i containing/ Compute the number of missed peptides, assuming M,=D(f) Select proteiny with the minimum number of missed peptides Set Mj=D(f) Remove from .y all peptides from proteiny Figure 2. Pseudocode for the LDFA solution to the minimum missed peptide problem. Out of 176,470 proteins from Swiss-Prot, 494 proteins (including the 12 proteins from the mixture) were identified as containing at least one identified peptide. The LDFA identified 12 proteins in the sample, 11 correctly. Of the 11 proteins that were correctly assigned, in only one instance could the algorithm not distinguish between the correct protein and one of its close homologs. We refer to this situation as a tie. Each tie is resolved by a random selection. The same data was tested using the GMPSA which simply tries to explain the identified peptides with the smallest possible number of proteins. GMPSA also identified 12 proteins as the total number of proteins in the sample, however, it suffered in accuracy. For 5 out of the 12 proteins, the GMPSA could not distinguish between the correct proteins and their homologs. Since in each step, the GMPSA considers only the number of the identified peptides per protein it is much more likely to encounter ties than the LDFA. As shown in Fig. 1, the GMPSA does not have a means of differentiating between proteins containing no unique identified peptides and the same number of degenerate peptides. In practice, these result in ties involving more homologs than the LDFA, thus reduce the chance of selecting the correct protein. An example of such a tie involves protein HBB_HUMAN. LDFA found two possible solutions ( H B B H U M A N and H B B G O R G O ) , resulting in a 50% chance of a correct selection. On the other hand, the GMPSA selected between four different proteins (HBB_HUMAN, HBB_HAPGR, HBB_HYLLA and HBB_PANPO) resulting in 25% chance of a correct prediction. Furthermore, the smaller average number of proteins per tie encountered by LDFA is advantageous for reporting results of identification. To avoid information leak in calculating peptide detectabilities, the training set for the predictor was constructed from a different synthetic dataset."

416 The one protein that was not identified correctly by the LDFA, bovine RNase A, was assigned to a close homolog from one of 7 organisms (69.4% average sequence identity) chosen at random. This assignment was made with a single identified peptide. Furthermore, the sequence for bovine RNase A in the Swiss-Prot database includes the 26-amino acid signal peptide that is not actually present in the sample. Since LDFA takes into consideration the detectabilities of both identified and unidentified peptides, the presence of the signal peptide in the database hinders the assignment of bovine RNase A. After the signal peptide is removed, the sequence identity compared to all seven sequences that match the identified peptide is 84.0%. In comparison, the GMPSA randomly selects among 20 proteins from Swiss-Prot sharing the identified peptide. Another experiment was performed on a biological sample from R. norvegicus, in which the correct proteins were not known. The identified peptides in the sample (693 in total) were searched against an IPI (http://ncbi.nlm.nih.gov) database and were found in 805 proteins. These are the proteins that may potentially be present in the sample. Table 1 shows the distribution of these peptides contained by different numbers of proteins. In this experiment, about 60% identified peptides (397 out of 693) are degenerate peptides, i.e. contained by two or more proteins. The two algorithms described above, LDFA and GMPSA, were run on this set. Table 1. Distribution of identified peptides contained by different number of proteins in a R. norvegicus proteome analysis. No. proteins No. peptides

1 296

2-5 330

6-10 43

11-20 16

>20 8

Mascot had originally assigned 301 proteins in this sample, LDFA assigned 275 proteins and GMPSA assigned 247 proteins. Taking into consideration all unique peptides from the rat sample only 149 proteins could be assigned by at least one unique peptide. Thus, any other protein to be assigned by any of the three methods would have to rely solely on degenerate peptides. Due to the prevalence of ties, GMPSA was run 30 times. Only 153 proteins were consistently assigned in all runs. Out of 430 proteins assigned over all GMPSA runs, 229 were assigned less than 50% of the time. Since the correct proteins in this sample were not known, the accuracy of the LDFA and GMPSA could not be quantified as on the synthetic data. Instead, a different approach was taken where protein distinguishability was measured in this experiment. Figure 3 shows, in grey, all pairs of 805 identified proteins that shared at least one identified peptide. The y-axis corresponds to the percentage of sequence identity, while the x-axis represents the length of one of the proteins

417 1.00 0.90

Virw r * '

0.80 0.70

to, g O.EO c
§• 0.40 4) w 0.30 -0.20 0.10 0.00

1000 Protein Length

0.B0

L'U^U^^

0.70

i-

—?~4rr»--r—

-*-.-

K;

| = 0.30

-»-.-i-

0.20 0.10 0.00

1000 Protein Length

Figure 3. A pairwise comparison of all proteins in IPI rat database in which proteins share at least one identified peptide. The grey dots indicate all pairs while the black triangles indicate pairs where the algorithm made a random selection between the two proteins for a) LDFA and b) GMPSA. in the pair. Figure 3a shows, in black, all pairs of proteins that share at least one identified peptide and that the LDFA could not distinguish. This means that at one point during the execution LDFA had to randomly select between those two proteins and that at the completion of the algorithm one of the proteins is not present in the final solution. Figure 3b shows the equivalent plot for the GMPSA. In a single run of each algorithm, there were 94 indistinguishable pairs for the LDFA and 2,346 indistinguishable pairs for the GMPSA. Interestingly,

418

1

••

MDIP, 4-™~

MDIP, MDAP,

Xi CO

ts B -o
MDAP, •«

experiment 1

experiment 2

Figure 4. Detectability plot of a hypothetical protein consisting of 8 tryptic peptides from two shotgun proteomics experiments. Peptides that are identified are represented in grey. In experiment 1, MDIP is obtained as the detectability of an identified peptide that maximizes true positive (100%) and true negative (100%) rates. In experiment 2, the maximum true positive rate is 75%, while true negative rate is 100%. the total number of proteins that were excluded from the final solution at random was 69 and 188 for the LDFA and GMPSA, respectively. 5. Discussion In the previous study we have defined the Minimum acceptable Detectability of Identified Peptides (MDIP) as the detectability of an identified peptide that maximizes the average of the true positive and true negative rates for an identified protein. We also showed that MDIP of a protein is correlated with its abundance in the sample. The relationship between MDIP and MDAP is shown in Fig. 4 where the identified and non-identified peptides are shown for the same protein under two different experiments. While MDAP is always the lowest detectability of an identified peptide in a protein, MDIP is influenced by nonidentified peptides as well. Ideally, as in the left part of Fig. 4, peptides are consecutively identified according to their decreasing detectabilities (starting from the top one), thus giving MDIP = MDAP. Non-identified peptides in the right part of Fig. 4 allow discrepancy between these two quantities which we believe will be useful for the advancement of label-free protein quantification. One challenge in correctly interpreting shotgun proteomics data involves assigning identified peptides to the proteins from which they originate.1,3| 7' l2"15

419 When the same peptide can be assigned to multiple proteins, this task - referred to as the protein inference problem - is non-trivial. Here we address this problem by utilizing the concept of peptide detectability - the probability that a peptide will be identified in a shotgun proteomics experiment based on inherent properties of the peptide and its surroundings within a protein. Previous work has shown that the rules governing peptide detectability can be assigned using a machine learning approach and that a peptide's detectability depends on its source protein concentration.8 In cases where a peptide sequence is found in multiple protein sequences, knowledge of the detectabilities of both the identified peptides (similar sequences in the multiple proteins) and the unidentified peptides (some of which will differ in the multiple proteins) can be used to discern between assignments that would not otherwise be distinguishable. The results shown here for 693 peptides identified from a rat brain sample indicate that 247 proteins can be assigned using a greedy algorithm for the minimum protein coverage formulation, but 94 (38%) of these are selected randomly. When peptide detectability is incorporated into the assignment algorithm, 275 proteins are assigned and only 51 (19%) of these are ambiguous. While the accuracy of this approach is difficult to test on a real proteomics data set, it is clear that the ability to distinguish potential peptide-to-protein assignments offers a significant advance in addressing the protein inference problem. In a typical shotgun proteomics experiment, less than 10% identified tryptic peptides contain missed cleavages. Currently, we are not able to predict the detectabilities of these peptides because of the lack of training data. As a result, missed-cleavage peptides are neglected in protein inference even if they are identified. We aim to incorporate this prediction in the future. In this study, the identified peptides are determined based on a threshold of Mascot score 40, consistent with the condition used to generate the identified peptides in the dataset for training the detectability predictor.8 If a different threshold is used, the predicted detectability may be different. The effects of threshold selection and other conditions used in peptide identification on the detectability prediction and protein inference will be explored in the future. Acknowledgements The authors wish to acknowledge the Office of the Vice President for Research for a Faculty Research Support Grant to RJA, PR, & HT. RJA and MVN acknowledge support from the 21sl Century Fund (State of Indiana). JPR wishes to acknowledge support from NSF grant CHE-0518234.

420

References 1. Aebersold, R. & Mann, M. (2003). Mass spectrometry-based protcomics. Nature 422, 198-207. 2. McDonald, W. H. & Yates, J. R. r. (2003). Shotgun protcomics: integrating technologies to answer biological questions. Curr Opin Mol Ther. 5, 302-309. 3. Kislinger, T. & Emili, A. (2003). Going global: protein expression profiling using shotgun mass spectrometry. Curr Opin Mol Ther. 5, 285-293. 4. Yates, J. R., Eng, J. K., McCormack, A. L. & Schieltz, D. (1995). Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal Chem 67, 1426-1436. 5. Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. (1999). Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551-67. 6. Nesvizhskii, A. I., Keller, A., Kolker, E. & Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 75, 4646-4658. . 7. Nesvizhskii, A. I. & Aebersold, R. (2005). Interpretation of shotgun proteomic data: the protein inference problem. Mol Cell Proteomics 4, 1419-1440. 8. Tang, H., Arnold, R. J., Alves, P., Xun, Z„ Clemmer, D. E., Novotny, M. V., Reilly, J. P. & Radivojac, P. (2006). A computational approach toward label-free protein quantification using predicted peptide detectability. Bioinformatics 22, (in press). 9. Gao, J., Opiteck, G. J., Friedrichs, M., Dongre, A. R. & Hefta, S. A. (2003). Changes in the protein expression of yeast as a function of carbon source. J. Proteome Res., 643-^49. 10. Cormen, T. H., Leiserson, C. E., Rivest, R. L. & Stein, C. (2001). Introduction to algorithms. 2nd edit, MIT Press, Cambridge, MA, U.S.A. 11. Purvine, S., Picone, A. F. & Kolker, E. (2004). Standard mixtures for proteome studies. Omics 8, 79-92. 12. Carr, S., Aebersold, R., Baldwin, M., Burlingame, A., Clauser, K. & Nesvizhskii, A. (2004). The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data. Mol Cell Proteomics 3, 531-3. 13. Rappsilber, J. & Mann, M. (2002). What does it mean to identify a protein in proteomics? Trends Biochem Sci 27, 74-8. 14. Resing, K. A., Meyer-Arendt, K., Mendoza, A. M., Aveline-Wolf, L. D., Jonschcr, K. R., Pierce, K. G., Old, W. M., Cheung, H. T., Russell, S., Wattawa, J. L., Goehle, G. R., Knight, R. D. & Ahn, N. G. (2004). Improving reproducibility and sensitivity in identifying human proteins by shotgun proteomics. Anal Chem 76, 3556-68. 15. Yang, X., Dondcti, V„ Dezube, R., Maynard, D. M., Geer, L. Y., Epstein, J., Chen, X., Markcy, S. P. & Kowalak, J. A. (2004). DBParser: web-based software for shotgun proteomic data analyses. J Proteome Res 3, 1002-8.

MINING TANDEM MASS SPECTRAL DATA TO DEVELOP A MORE ACCURATE MASS ERROR MODEL FOR PEPTIDE IDENTIFICATION YAN FU'' 2 t , WEN GAO 1 , SIMIN HE 1 , RUIXIANG SUN 1 , HU ZHOU 3 , RONG ZENG 3 ' Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2 Graduate University of Chinese Academy of Sciences, Beijing, China, 3 Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China

The assumption on the mass error distribution of fragment ions plays a crucial role in peptide identification by tandem mass spectra. Previous mass error models are the simplistic uniform or normal distribution with empirically set parameter values. In this paper, we propose a more accurate mass error model, namely conditional normal model, and an iterative parameter learning algorithm. The new model is based on two important observations on the mass error distribution, i.e. the linearity between the mean of mass error and the ion mass, and the log-log linearity between the standard deviation of mass error and the peak intensity. To our knowledge, the latter quantitative relationship has never been reported before. Experimental results demonstrate the effectiveness of our approach in accurately quantifying the mass error distribution and the ability of the new model to improve the accuracy of peptide identification.

1.

Introduction

Tandem mass spectrometry is playing an increasingly important role in current proteomics research [1]. In an experiment of tandem mass spectrometry, peptides digested from protein mixture are first protonized and isolated according to their mass-to-charge ratios. Peptide ions of a specific mass-tocharge ratio then undergo the low-energy collision-induced dissociation to break into fragment ions. Fragment ions are detected and their masses (or rather massto-charge ratios) and intensities are recorded. The ion intensity at one mass value forms an observed mass peak. All the mass peaks corresponding to the detected fragment ions of the same peptide constitute the experimental tandem mass spectrum of this peptide. To identify the peptide corresponding to a tandem mass spectrum, database searching is the most widely used approach. Popular database searching tools are SEQUEST [2] and Mascot [3]. Another approach is the de novo sequencing, e.g. the Lutefisk [4], PEAKS [5] and PepNovo [6] algorithms. A third approach is the sequence tag query, e.g. the pioneer work by Mann and Wilm [7], and the recent GutenTag [8] and Popitam [9] algorithms. To whom correspondence should be addressed. E-mail: [email protected]. 421

422

A key ingredient of peptide identification algorithms is the scoring function that measures the likelihood of a candidate peptide producing the experimental spectrum. In a peptide-scoring algorithm, observed mass peaks in the experimental spectrum are matched to the fragment ions predicted from a candidate peptide according to their mass values. Due to the imprecision of mass measurement, an error window on mass values is commonly used to tolerate mass match errors in a certain range. The error window plays a very important role in peptide-scoring algorithms. An error window inconsistent with the actual mass error distribution can lead to increased random matches or reduced true matches, thus degrading the performance of a peptide-scoring algorithm. Moreover, in the de novo or sequence tag approach to peptide identification, the allowed maximal mass error can greatly affect the number of candidate peptides or sequence tags. Ion trap mass spectrometers have been quite attractive in proteomics research, due to their relatively high sensitivity and low cost. However, compared to higher-resolution mass spectra, such as the Q-TOF spectra, the mass error of ion trap spectra is in general much larger and is less exploited in the computational proteomics area. Therefore, this paper focuses on ion trap spectra. The mass error models assumed for ion trap spectra in current peptide identification algorithms are quite simple. The most common assumption is that the mass error is uniformly distributed within the ±s error window around the theoretical mass value [2, 3, 6, 10-18]. For ion trap spectra, the width of error window fis often empirically set to 0.5 u, e.g. [6, 11, 12]. Another assumption is the normal distribution of mass error [19-22]. Previous mass error models used in ion-trap spectra analysis and peptide identification algorithms can be characterized as follows: 1. The mass error is centered at zero, 2. The mass error is independent of both the mass and the intensity of fragment ions; 3. The parameters in the mass error distribution are empirically set. A notable exception to (1) and (3) is the recent work due to Wan and Chen [21], in which the mean and standard deviation of the normally distributed mass errors are learned from training data. However, all existing error models have assumed so far that all mass errors in a given dataset of spectra come from an identical distribution regardless of ion masses and intensities. Although peptideidentification tools based on these simple error models often work well, a large proportion (eighty to ninety percentage) of spectra cannot be successfully interpreted in current proteomic experiments due to either known or unknown reasons. A mass error model lacking of enough accuracy has to be responsible for some of these un-interpreted spectra.

423

In this paper, we statistically investigate the distribution of mass errors of singly charged fragment ions in ion trap tandem mass spectra. By visualizing mass errors in various ways, we first illustrate that there is a linear correlation between the mass error and the ion mass, and there is an approximate log-log linearity between the standard deviation (SD) of mass error and the peak intensity. To our knowledge, the latter quantitative relationship has never been reported in the literature. Based on these observations, we model the mass error of a fragment ion by a conditional normal distribution, whose mean and SD are the functions of ion mass and peak intensity, respectively. We also propose an iterative algorithm, named PMED, to accurately estimate the parameter values in the conditional mean and SD functions. Experimental results demonstrate that the PMED algorithm converges very fast and the learned parameter values match real data very well. Experiment also shows that the new mass error model can considerably improve the accuracy of peptide identification. The rest of the paper is organized as follows. Section 2 describes the used datasets of tandem mass spectra. In Section 3, we first qualitatively illustrate the distribution trends of mass errors and then propose the conditional normal model of mass error. The iterative parameter learning algorithm, PMED, is presented in Section 4. Section 5 gives experimental results. We finally conclude the paper and point out future work in Section 6. 2.

Datasets

We have analyzed several datasets of ion trap mass spectra. However, due to the limited space, we report results on our own dataset in this paper. Results on several published datasets [23-26] are given in Supplementary Information online (http://www.jdl.ac.cn/user/yfu/pmed/index.html). The steps to generate our dataset (denoted by SIBS dataset) are briefly described below. A total of 300 jug protein sample from whole-cell lysate of mouse liver were digested with trypsin. Five LC-MS/MS runs were performed on the digested mixture with a linear ion trap (Thermo Finnigan, San Jose, CA) using different concentrations in salt steps. The mass spectrometer was set so that one full MS scan was followed by ten MS/MS scans on the ten most intense ions from the MS spectrum. The acquired spectra were searched against the mouse database (SwissProt) using the SEQUEST program. The resulting assignments of database peptides to experimental spectra were filtered according to their Xcorr and DeltCn scores (Xcorr>1.9 and 2.2 for [M+l] and [M+2] spectra, respectively, and DelCN>0.1). In addition, to reduce duplicate peptides, only the spectrum of the largest Xcorr was retained among a certain number of consecutive MS/MS scans on the same peptide ion. This finally resulted in a

424

total of 1,505 [M+l] and [M+2] spectra with high-confidence peptide assignments. [M+3] spectra were not included, since doubly charged fragment ions are often dominant in these spectra while our analysis focuses on the mass error of singly charged fragment ions. 3.

Conditional Distribution of Mass Error

Our purpose is to study how the mass errors are distributed. Especially, we are interested in whether the mass error correlates with the ion mass and the peak intensity. 3.1. Visualization of Mass Error Distribution To visualize and analyze the mass error, the mass peak produced by each expected fragment ion must be identified in advance. To this end, we first use a common strategy to match observed peaks to expected fragment ions - the most intense peak within the error window of ±e around the theoretical mass value of an expected fragment ion is assigned to this fragment ion. This criterion for determining peak-ion matches certainly lacks accuracy, since the error window is set empirically and is fixed for all the fragment ions, regardless of their masses and intensities. Fortunately, we find that the training data obtained with the above criterion are adequate already for the qualitative analysis of mass error at this stage. In the next section, we will develop an iterative learning algorithm, based on the observations in this section, to quantify the mass error distribution and revise the criteria for determining peak-ion matches. We use monoisotopic masses of amino acid residues to calculate the theoretical mass and set £to 0.5 u. Without loss of generality, we illustrate the analysis results only for y ions in this paper. Results for other fragment ion types, e.g. b ions, show a similar trend to y ions and are not given. Figure 1 gives the frequency histogram of mass errors. It shows that the mass error has a bellshaped distribution. A similar trend is also observed on other datasets (See Figures Sl-1, S2-1, S3-1 and S4-1 in Supplementary Information). In addition, depending on the instrument calibration, the center of the mass error distribution may deviate from zero (See Figure S3-1 for example in Supplementary Information). This is called systematic error. Figure 2 plots all the mass errors of y ions against their corresponding ion masses. From Figure 2, we can see that the mass errors display a trend of descending linearly with increasing ion masses, although, at a given value of ion mass, the mass errors spread fairly abroad. For well calibrated instruments, such a phenomenon may not be apparent (See Figure S4-2 in Supplementary Information for example). However, we did observe the linear relationship

425

between the mass error and the ion mass on several real datasets (See Figures Sl-2, S2-2 and S3-2 in Supplementary Information). This relationship has rarely been taken into account by peptide identification algorithms. 2500

.••v_V; VC^T

1

2000

I 1

w 1500 c

1

O 1000

'•'•:'-J: - •••\-iK-l-V-5;'!'V'i' : ;".S*; . s

. * 'i *

500 0

'. .-^l-

-0.5

0.5

0

Mass Error Figure. 1. Frequency histogram of mass errors.

500

r-:-.'..';;,•-,-

1000 Ion Mass

1500

Figure. 2. Mass errors versus ion masses.

• n

i

5 ln(Peak Intensity)

-1-5

data fit

^v

10

Figure. 3. Mass errors versus the logarithms of their corresponding peak intensities.

2000

10 ln(Peak Intensity)

Figure. 4. Log-log plot of the standard deviation (SD) of mass errors versus the peak intensity.

Figure 3 plots all the mass errors of y ions against the logarithms of their corresponding peak intensities. Raw intensities are used here. It shows that the mass error distribution is dramatically correlated with the peak intensity. The more intense the peaks are, the more concentrated the mass errors tend to be. This is intuitively understandable - the more ions are detected, the more accurate the measured mass value should be. Further analysis reveals that the logarithm of the standard deviation (SD) of mass errors goes down approximately linearly as the logarithm of the peak intensity increases (see Figure 4). The data in Figure 4 are obtained by grouping the intensities into a number of bins and calculating the mass error SD of fragment ions falling in each bin (sampling equal number of data points from each bin results in the same phenomenon). On other datasets, a similar trend is also observed (See Figures Sl-3, Sl-4, S2-3, S2-4, S3-3, S3-4, S4-3 and S4-4 in Supplementary Information).

426

To the best of our knowledge, the above quantitative relationship between the mass error and the peak intensity has never been discussed in the literature. It provides a new important constraint for determining peak-ion matches; that is, scaled rather than fixed size of error window should be used for peaks of different intensities. 3.2. Conditional Normal Model of Mass Error Based on the above observations, we model the mass error E/of a fragment ion/ as a random variable following a normal distribution, whose mean and SD are determined by the theoretical mass value M(f) and the observed raw peak intensity 1(f), respectively; that is,

Ef~n(u{M(f)),a2{l{f))),

(1)

where M(M(f))

" ( ' ( / ) )

= u-M(f) + v, =

* • ( ' ( / ) ) * •

(2) (

3

)

and w, v, a and b are parameters to be determined. The conditional mean and SD functions (2) and (3) directly follow from the observations in Section 3.1 (Figures 2 and 4). Notice that by fixing the value of parameter u (or a) at zero, the mass error mean (or SD) becomes unconditional on the ion mass (or peak intensity), which leads to variant formats of the conditional normal model. 4.

Iterative Parameter Learning Algorithm

Generally speaking, given a dataset of spectra of known peptide sequences, the parameter values in the conditional mean and SD functions can be roughly learned from the mass-error data generated in Section 3. However, since such training data are derived from a less accurate peak-ion matching criterion, the accuracy of parameter estimation could be accordingly affected. This problem may become significantly serious, when actual mass errors are distributed out of the expected range. To overcome this difficulty, we develop an iterative algorithm for more accurate parameter estimation. Intuitively, the algorithm, which we name PMED (Peaks' Mass Error Model), for learning the values of parameters, w, v, a and b is performed in the following iterative manner. In one step, according to the mass error distribution determined by the current parameter values, probable peak-ion matches are selected to generate a training dataset. In another step, the parameters are reestimated on this training dataset. These two steps are carried out alternately until the learned parameter values do not change any more. By this iterative

427

procedure, the learned parameter values are expected not to be sensitive to the prior assumption on mass error distribution. Let {<Si, />!>, <S2, P2>, ..., <SN, PN>} denote a set of tandem mass spectra labeled with corresponding peptide sequences. A spectrum S, is a set of peaks, \sn,si2,...,sim j , each associated with a mass value M(s^ and an intensity value I(SJJ). A peptide P, is a sequence of amino acid residues. Let Ifj^fa,---,/^.} denote the set of expected fragment ions of peptide Ph each associated with a theoretical mass value M(fik). Algorithm PMED Input: A set of tandem mass spectra labeled with peptide sequences, {<St, P\>, <S2,P2>,...,<SN,PN>}. Output: Estimated parameter values in the mass error distribution i.e. w, v, a and b in Equations (2) and (3). Step 1. Initialize the values of w, v, a and b in the conditional mean and SD functions (Equations (2) and (3)), according to the prior knowledge about the mass error distribution. Step 2. For each possible combination of /, j and k, compute the z-score z-ljk of the mass error of fragment ion fik, under the assumption that fik produced peak sy, based on the current values of w, v, a and b:

(M{S,^-M(flk))-M{M(fik)) Z

ijk ~

I

i

\\

'

VT/

-(%)) and a(I(s)) are as defined in Equations (2) and (3),

where f^M(f)) respectively. Step 3. Generate the training dataset D by selecting those peak-ion matches whose absolute z-scores are smaller than a given threshold z,\
0={(v/»)H ,}-

()

Step 4. Update the values of u, v, a and b with the maximum likelihood (ML) estimates for them based on the training dataset D; that is, (w,v,a,6)<-argmax y[ Pijk > (6) where

)(M{s,j)-M(fik)yM(M(f,k)) p

'Jk

=

i—

/ /

uexP

(7)

>H'M))

Step 5. Terminate the algorithm and return the learned values of u, v, a and b if they remain stable; otherwise, go to Step 2.

428 In Step 2, the absolute value of a z-score (or standard score) measures the deviation of a mass error from its expected value under the current assumption about the conditional mean and SD of the mass error distribution. If a peak-ion match is not due to chance, the deviation should be within a reasonable range. In Step 4, the ML estimates for «, v, a and b are not analytically tractable but can be numerically resolved efficiently. During the learning process, the parameters u and/or a can be fixed at zero to obtain variant versions of the error model. We implemented the PMED algorithm in MATLAB. The PMED algorithm, as an iterative ML estimator, was inspired by the Expectation-Maximization (EM) algorithm. Although it is not a rigorous EM algorithm, experiments in the next section demonstrate its convergence. 5.

Results and Discussions

The results given in this section are obtained on the SIBS dataset described in Section 2. Those obtained on other datasets are presented in Supplementary Information online. 5.1. Parameter Learning The values of w, v and a are initialized to zero, and the value of b is initialized to 0.1. These initialized parameter values reflect the weakest prior assumptions about the mass error distribution - centered at zero (v=0), independent of the ion mass (w=0), and independent of peak intensity (a=0). Such assumptions are most common in current peptide identification algorithms. Figure 5 depicts the learned values for w, v, a and b against the number of iterations. The learning process converges after four iterations and takes less than one minute. We made small changes to the initialized parameter values and found that the learning results are not sensitive to the initialized values. The learned results are also quite stable, when the z-score threshold z, in Equation (5) is set to about three.

0

2 4 Number of Iterations

6

Figure. 5. Learned parameter values versus the number of iterations

429

It is also shown in Figure 5 that the learned parameter values after the first iteration, which correspond to the direct ML estimates on the initial training data, are significantly different from the finally learned values. This justifies the necessity of the PMED algorithm. Figures 6 and 7 plot the learned conditional mass error mean and SD respectively. We can see that the learned results of the PMED algorithm are quite consistent with real data. In the case that the parameters u and/or a are fixed at zero, other parameters can also be accurately learned (results not given).

Ion Mass

Figure. 6. Learned conditional mean (dashed line) of the mass error distribution, plotted together with mass error data (dots)

ln(Peak Intensity)

Figure. 7. Learned conditional standard deviation (dashed curve) of the mass error distribution, plotted together with residuals (dots) of mass errors from the learned mean.

Due to the differences in instrument type, setup and calibration, the learned parameter values vary with instruments (See Figures SI-6, SI-7, S21-6, S2-7, S3-6, S3-7, S4-6 and S4-7 in Supplementary Information). 5.2. Application to Peptide Identification To test the usefulness of our proposed conditional normal model of mass error for improving the accuracy of peptide identification, we use a simple peptidescoring function defined on mass match errors. Given a mass error model, the score of a candidate peptide is the sum of probability densities of all mass match errors. This scoring function is in fact a weighted version of the SPC (Shared Peak Counts) with each peak-ion match weighted by the probability density of the corresponding mass match error. Notice that the high intensity of a matched peak does not necessarily mean a high score. In fact, the situation can be the contrary if the mass match error (residual from learned mean) is large. This is illustrated in Figure 8. Several mass error models are compared, including the uniform distribution, the normal distribution, and several variants of conditional normal distribution. Parameter values in each model are either set empirically or learned from data.

430

In the latter case, five-fold cross validation is used for performance evaluation. Further, when parameters are learned from data, they may be either conditional or fixed.

- with a high peak intensity - with a low peak intensity o 4 CO

2 0

-0.5

0 0.5 Mass Error Residual

Figure. 8. The score of a match is determined by both the mass match error and the peak intensity.

Spectra are searched using the pFind program [17, 27, 28] against a large database containing 127,432 protein sequences (SwissProt database of all species entries, appended with the peptide sequences of test spectra). For simplicity, b and y ion series are predicted. Trypsin is used for theoretical digestion with up to two missed cleavage sites allowed. Table 1 compares the search results on the SIBS dataset using the above defined peptide-scoring function equipped with various mass error models. Percentages of spectra with the correct peptide sequence ranked top one and top ten are used to measure the identification accuracy. Table 1. Comparison of search results with various mass error models Mass error model

Parameters H (mean)

CT(SD) or e (width)

Topl(%)'

Topl0(%) 2

95.9 87.0 90.1 75.6 78.8 60.0 c r = 0.3/z, 73.2 90.4 (7= 0.5/z, 97.3 87.1 (7= 0.7/z, 98.3 90.2 Normal Fixed/learned 90.2 98.3 Fixed/learned 99.7 Conditional 97.8 Fixed/learned 91.6 98.4 Conditional Conditional 98.1 99.7 Percentage of spectra with the correct peptide sequence ranked top one. Percentage of spectra with the correct peptide sequence ranked top ten. Uniform

0

£•=0.3 £•=0.5 £=0.7

It is s h o w n in T a b l e 1 that c o m p a r e d t o the unconditional uniform and n o r m a l m o d e l s , either the introduction of conditional m e a n or the introduction of

431 condition SD can independently increase the identification accuracy. The improvement caused by the introduction of conditional SD is particularly significant - seven percentage points in Topi performance. The best results are obtained with the fully conditional normal model. On the ToplO performance, the conditional normal model is also superior to unconditional models. On other datasets, the increases are remarkable too (See Tables SI, S2, S3 and S4 in Supplementary Information). 6.

Conclusions

The proposed mass error model and the associated parameter learning algorithm provide an automated method for quantifying the mass error distribution of fragment ions in ion trap tandem mass spectra. Compared to previous mass error models, the new model has several advantages: 1. Systematic error of mass measurement is taken into account; 2. Fragment ions of different masses and intensities are of different mass error distributions; 3. Parameters can be automatically learned from data. Experiments demonstrated the effectiveness of the parameter learning algorithm and the usefulness of the new mass error model for peptide identification. In the future, we expect to develop more sophisticated peptide scoring functions to take full advantage of the new mass error model. The analysis in this paper is limited to singly charged fragment ions. Due to the disturbance of isotopic peaks in the low-resolution ion trap spectra, the mass errors of doubly charged fragment ions are more complex and the analysis of them is more challenging and will be our future work. Acknowledgments This work was supported by the National Key Basic R&D Program of China (Grant No.: 2002CB713807) and National Key Technologies R&D Program of China (Grant No.: 2004BA711A21). We gratefully acknowledge Dr. Andrew Keller from the Institute for Systems Biology for valuable comments on an early version of the paper. We also thank Prof. Runsheng Chen, Dr. Dongbo Bu, Jingfen Zhang, Quanhu Sheng, Jie Dai and many others from Chinese Academy of Sciences for helpful discussions. References 1. R. Aebersold and M. Mann, Nature 422, 198 (2003).

432

2. J. K. Eng, A. L. McCormack and J. R. Yates, III, J. Am. Soc. Mass. Spectrom. 5,976(1994). 3. D. N. Perkins, D. J. Pappin, D. M. Creasy and J. S. Cottrell, Electrophoresis 20,3551 (1999). 4. J. A. Taylor and R. S. Johnson, Anal. Chem. 73, 2594 (2001). 5. B. Ma, K. Z. Zhang, C. Hendrie, C. Z. Liang, M. Li, A. Doherty-Kirby and G. Lajoie, Rapid Commun. Mass Spectrom. 17, 2337 (2003). 6. A. Frank and P. Pevzner, Anal. Chem. 11, 964 (2005). 7. M. Mann and M. Wilm, Anal. Chem. 66, 4390 (1994). 8. D. L. Tabb, A. Saraf and J. R. Yates, III, Anal. Chem. 75, 6415 (2003). 9. P. Hernandez, R. Gras, J. Frey and R. D. Appel, Proteomics 3, 870 (2003). 10. D. Fenyo, J. Qin and B. T. Chait, Electrophoresis 19, 998 (1998). 11. V. Dancik, T. A. Addona, K. R. Clauser, J. E. Vath and P. A. Pevzner, J. Comput. Biol. 6, 327 (1999). 12. M. Havilio, Y. Haddad and Z. Smilansky, Anal. Chem. 75, 435 (2003). 13. R. G. Sadygov and J. R. Yates, III, Anal. Chem. 75, 3792 (2003). 14. J. Colinge, A. Masselot, M. Giron, T. Dessingy and J. Magnin, Proteomics 3, 1454(2003). 15. D. L. Tabb, L. L. Smith, L. A. Breci, V. H. Wysocki, D. Lin and J. R. Yates, U\,Anal. Chem. 75, 1155 (2003). 16. J. E. Elias, F. D. Gibbons, O. D. King, F. P. Roth and S. P. Gygi, Nat. Biotechnol. 22,214(2004). 17. Y. Fu, Q. Yang, R. Sun, D. Li, R. Zeng, C. X. Ling and W. Gao, Bioinformatics 20, 1948 (2004). 18. M. Bern and D. Goldberg, The Ninth Annual International Conference on Research in Computational Molecular Biology 357 (2005). 19. V. Bafna and N. Edwards, Bioinformatics 17, SI3 (2001). 20. N. Zhang, R. Aebersold and B. Schwikowski, Proteomics 2, 1406 (2002). 21. Y. Wan and T. Chen, The Ninth Annual International Conference on Research in Computational Molecular Biology 342 (2005). 22. J. H. Oh and J. Gao, The 5th IEEE Symposium on Bioinformatics and Bioengineering 161 (2005). 23. A. Keller, S. Purvine, A. I. Nesvizhskii, S. Stolyar, D. R. Goodlett and E. Kolker, Omics 6, 207 (2002). 24. V. Mayya, K. Rezaul, Y. Cong and D. Han, Molecular & Cellular Proteomics 4, 214 (2005). 25. J. Peng, J. E. Elias, C. C. Thoreen, L. J. Licklider and S. P. Gygi, J. Proteome Res. 2, 43 (2003). 26. J. T. Prince, M. W. Carlson, R. Wang, P. Lu and E. M. Marcotte, Nat. Biotechnol. 22, 471 (2004). 27. D. Li, Y. Fu, R. Sun, C. Ling, Y. Wei, H. Zhou, R. Zeng, Q. Yang, S. He and W. Gao, Bioinformatics 21, 3049 (2005). 28. J. Zhang, W. Gao, J. Cai, S. He, R. Zeng and R. Chen, IEEE/ACM T. Comp. Biol. Bioinfo. 2,217(2005).

ASSESSING A N D C O M B I N I N G RELIABILITY OF P R O T E I N I N T E R A C T I O N SOURCES

SONIA LEACH, AARON GABOW, AND LAWRENCE H U N T E R University E-mail:

of Colorado at Denver and Health Sciences Center, Aurora, CO 80045, USA {Sonia.Leach, Aaron. Gabow, Larry.Hunter}Quchsc.edu D E B R A S. G O L D B E R G University of Colorado at Boulder, Boulder, CO 80309, USA E-mail: Debra. [email protected]

Integrating diverse sources of interaction information to create protein networks requires strategies sensitive to differences in accuracy and coverage of each source. Previous integration approaches calculate reliabilities of protein interaction information sources based on congruity to a designated 'gold standard.' In this paper, we provide a comparison of the two most popular existing approaches and propose a novel alternative for assessing reliabilities which does not require a gold standard. We identify a new method for combining the resultant reliabilities and compare it against an existing method. Further, we propose an extrinsic approach to evaluation of reliability estimates, considering their influence on the downstream tasks of inferring protein function and learning regulatory networks from expression data. Results using this evaluation method show 1) our method for reliability estimation is an attractive alternative to those requiring a gold standard and 2) the new method for combining reliabilities is less sensitive to noise in reliability assignments than the similar existing technique.

1. Introduction The recent availability of high-throughput proteomics data has allowed genome-wide construction of network models of relationships among proteins 1 " 12 . These networks are used in such downstream tasks as inferring protein function, identifying potential protein complexes, or interpreting gene expression data. In the majority of cases, however, the notion of protein interaction is biased to mean physical interaction. As argued in Lee et al.9, the term 'interaction' should instead encompass any type of evidence linking pairs of genes, whether it be physical, functional, ge433

434

netic, biochemical, evolutionary, or computational. Integrated networks from diverse sources give biologists more insight into their data when these networks are later used for analysis tasks, since each interaction data type offers an alternative view of the relationships which exist among genes. The major challenge of integration has been that each individual source varies in terms of accuracy and coverage over the domain. Thus, estimating confidence of a particular interaction must account for the number and reliability of the specific sources contributing evidence for that inter action11. For example, a high-throughput method offers evidence of interaction genomewide, yet with many false positives, so interactions supported solely by that method may be suspect a priori. One answer is to favor interactions supported by multiple data sources. Even then, high confidence in an interaction is not guaranteed since the individual reliabilities of the sources supporting that interaction may be low. As such, a large body of literature is dedicated to estimating error rates of the individual experiment types 2 - 1 2 . However, nearly all existing methods quantify reliability with respect to a 'gold standard,' such as the percentage of interacting pairs suggested by the source known to have the same function. We argue that requiring a gold standard for reliability assessments is disadvantageous for a number of reasons. First, the choice of gold standard depends on the task for which a protein interaction network is used and therefore cannot easily generalize to other tasks without recomputation. For example, using correlated gene expression profiles as a gold standard for reliability assessment may not be appropriate for the task of predicting physical interactions. Also, the need to reserve an information source as a gold standard decreases the amount of data providing evidence of interaction. This point becomes critical in less well-studied organisms where there may be few sources of information of interaction, let alone enough information for a gold standard. In this paper, we present a new method for assigning reliability to individual data types which does not rely on a gold standard. Rather, our method leverages on the relative frequency of overlap between the sources, rating each source by its average agreement with the consensus. Our method is not biased toward sources representing the same notion of interaction found in a predefined gold standard and as such, is generally ap-

a

Note here we are quantifying reliability of evidence for a particular interaction, not building a general predictor of interaction; thus some issues, such as missing data for predictive attributes, are not present in our application.

435 plicable for use in any downstream task. We compare b our method to two popular existing reliability assignment techniques 3 , 9 - 1 2 . We then consider strategies for creating a weighted protein interaction network by probabilistic integration of the individual data sources and their assigned reliabilities. For integration of reliabilities, we use one technique 0 appearing in the biological literature 11,12 as well as a new alternative we identify as applicable from the statistics literature 13 . We propose an extrinsic evaluation strategy which measures performance of each reliability assessment and integration method combination on two downstream tasks: inferring protein function and learning regulatory networks. Our results show that our proposed reliability assessment method is a viable alternative to previous methods and the alternative integration method is less sensitive to incorrect reliability assessments than the existing method. 2. Data Sources and Reliability Assignments Protein interaction networks represent proteins as nodes and integrate interaction sources to identify connections between them. We focus here on techniques which 1) quantify the accuracy of sources indicating that interaction and 2) combine their reliabilities to obtain a consensus reliability for each interaction. Before addressing the first task, we must first identify a set of genes and a set of interaction sources. To create comparable datasets, we use the 6760 annotated orfs in yeast and choose 6760 mouse genes randomly among those having information in at least three interaction sources. With this strategy we obtain similar coverage, where 50% of our mouse genes have a known pathway, compared to 80% with known function in yeast. We use these sets of 6760 genes for all reported results. We consider two types of data sources which provide positive assertions of relationships: explicit sources which indicate interaction between genes directly, or implicit sources from which relationships can be derived by noting when two genes are assigned the same category by the source. For example, yeast two-hybrid assays are an explicit measure of physical interaction while presence of identical sequence motifs implies genes may have related functions or regulators. Though implicit sources individually may be poor indicators of interaction, accumulation of evidence from several implicit sources may reliably indicate interaction in the absence of b

The day of submission we identified a similar comparison paper by Suthram et al. 14 This measure has identical use by independent research groups. We do not include logistic regression approaches6,7 since each group uses very different attribute sets. c

436

more explicit information. This point becomes critical in less well-studied organisms where indirect information may be easier to obtain. For explicit sources, we include those which experimentally measure physical, biochemical, and genetic interactions 1 5 - 2 0 as well as those which computationally predict gene neighborhoods, gene fusion events or conserved phylogenetic profiles21. In an effort to create independent indicators, we categorize information from these sources by type, e.g. yeast two-hybrid or gene fusiond Reliabilities then are calculated for each type (denoted with capital letters, e.g. Y2H and GENEFUSE). As some indication of diversity of types and their coverage, for yeast, we have 21 distinct types with between 208 (GENETIC) and 26k (HMS-PCI) interactions. For mouse, we use 11 types with between 1 (ELISA) and 2546 (IMMUNOPRECIP) interactions. As implicit interaction sources, we use information about literature references22, sequence motifs 2 3 - 2 6 , protein categories 23 , protein complexes 23 , cell phase 27 , phenotypes 23 ' 22 , essentiality 23 , cellular location 22 ' 23 ' 28 , molecular function 23 ' 28 , and biological process or pathway 2 8 - 3 0 . Considering each separately for the 6760 gene sets, we obtain eight implicit yeast sources with between 39k (COPROTCATEGORY) and 179k (COESSENTIAL) interactions. In mouse, we have six sources with between 30k (COLITERATURE) and 1.2M (GO:COMPONENT) interactions. Having identified our data sources and interaction types, our first task is to assign a reliability score to each type which reflects our confidence in its information. For example, we might assign a low score to a high-throughput explicit type like yeast two-hybrid or an implicit type like co-location, yet assign a high score to a low-throughput assay like x-ray crystallography. Many methods exist for estimating reliability where the accuracy of a source is quantified by agreement to a gold standard, such as correlated gene expression, or shared protein complex membership, function or cellular location. We consider two of the most prevalent techniques which rely on a gold standard and propose a third which does not. The first measure calculates reliability TE as the proportion of pairs from a source E with a known shared designation according to the gold standard, relative to all pairs annotated by the source 11 ' 12 . We denote this measure PropGS. For example, we might count the proportion of interacting pairs suggested by the source with the same function. The second measure of reliability calculates the log likelihood of pairs sharing a designation according d

Note, here we do not use any type of interaction evidence which is measured experimentally in another species and transferred to the species of interest using orthologs.

437

to the gold standard (denoted LogLikGS)3'9'10: rE = l o g C ^ f ^ g P ) We say two proteins are linked (L) if they share the same designation in a gold standard. Then Pr(L) is the prior expectation of linkage, while -iPr(L) is the prior expectation of non-linkage. The values Pr(L\E) and ~>Pr(L\E) represent the analogous expectations calculated only among interactions offered by the data source E. Note that PropGS computes Pr(L\E). In both of methods, the gold standard is predetermined and held in reserve of the other information sources. We propose a third alternative, not utilizing a gold standard, which relies instead on average agreement of a source with the overall consensus offered by all sources'5, denoted Cons. Let ne be the number of sources indicating an interaction (edge) e between a given pair proteins. We calculate reliability as: TE = igj * • where \E\ is the total number of interactions offered by source E. Applicability of this measure assumes relative sparsity for a good proportion of the sources. Reliability of sources which assert many edges, such as the implicit or high-throughput sources, will then be automatically discounted since many of those edges will not have further support among all experts; the average will thus be taken over many small values of ne. Unlike the previous alternatives, Cons favors a more diverse notion of interaction since it does not penalize sources whose interaction type differs from that of the gold standard. Given the set of reliability estimates for each interaction information source, our second task is to create a combined reliability score of each interaction based on the individual reliabilities of the sources contributing information for that interaction. We consider two probabilistic approaches to combine source reliabilities, where interactions are events and information sources are experts. Both assume independence of experts which we address by separating information by experiment type. The first is a noisyOR model (NoisyOR), used by several groups 11 ' 12 , which interprets TE as the probability 1 of interaction according to expert E and calculates the consensus unreliability of the experts: Pr(e) = 1 — Y\E{1 — TE), with the product over experts E contributing edge e. The second is a well-known result in statistics for computing consensus likelihoods from a collection of experts, the Linear Opinion Pool (LinOP) 13 : Pr(e) = X ^ a ^ P r j ^ e ) . The as are nonnegative expert weights that sum to 1. Our rE correspond roughly to a;, since PrE(e) — 1 for each edge offered by expert E and no e f

Assessing reliability using consensus has precedence in medical decision making 3 1 . We use r'E = ( m a x r{fE}_|_i) to create a valid probability value.

438

information otherwise; we must renormalize over the applicable experts per edge. To our knowledge, this is the first formal identification of LinOP from the statistics community for use in biological problems. 3. A p p l i c a t i o n t o Biological Tasks We use PropGS, LogLikGS and Cons to assign reliabilities to interaction sources, then use LinOP and NoisyOR with each assignment to obtain a confidence measure Pr(e) for each interaction e. We compare the various weighting strategies on two tasks which can make use of weighted interactions: inferring protein function and learning regulatory networks from gene expression data. In each task, we evaluate 1) whether incorporating reliability of interaction sources helps, 2) which reliability assignment method is better, and 3) which reliability combination method is better.

3.1. Protein Function

Prediction

Protein function prediction methods include machine learning 32 and graph theoretic methods 12 . Since most machine learning methods do not use pairwise information, we only consider graph-theoretic approaches. The popular Majority1 algorithm is based on 'guilt by association' whereby an unknown protein is assigned the (weighted) majority function of its neighbors in a protein network. As a baseline, Unweighted refers to the use of uniform edge weights while Weighted refers to the use of Pr(e) as edge weights. One disadvantage of Majority is revealed when a node is connected to many proteins with unknown function. The FunctionalFlow12 algorithm overcomes this difficulty by considering a larger neighborhood around a node. For each of the yeast and mouse genomes, we obtain a set of reliability assignments rs applying PropGS, LogLikGS and Cons to the available set of interaction types, excluding implicit types based on function or pathway assignments, such as GO:FUNCTION. We use the MIPS Functional Catalog 23 for yeast and KEGG Pathways 29 for mouse as gold standards in PropGS and LogLikGS, as well as gold standards for evaluating performance (described below). For yeast, we find an extremely high correlation between the set of reliabilities assigned by PropGS and LogLikGS (r = 0.97), with slightly lower correlation to Cons at r = 0.870 and r = 0.874, respectively. For mouse, the corresponding values were r = 0.75, r — 0.38, and r — 0.64. Given a set r g , we obtain consensus interaction probabilities Pr(e) using LinOP and NoisyOR for use as edge weights in the function prediction

439 — -e•Q-A•A-1•+

t

— -6•0 -A•A -++•

Unweighted Majority Cons LinOP ConsNolByOR LogLlkGS LinOP LogLikGS NoiayOR PropGS LinOP PropGS NoisyOR

600

Unweighted Majority Cons LinOP ConsNoisyGH LogLlkGS LinOP LogLikGS NoiayOR PropGS LinOP PropGS NoisyOR

•

if

&' # ' j & *

.+

& • /

Atf-*\/&

(a) Yeast Majority ^— •&•©• -A
Unweighted Majority Cone LinOP ConsNoisyOR LogLikGS LinOP LogLikGS NoisyOR PropGS LinOP PropGS NoiayOR

'

(b) Yeast FunctionalFlow — -e•O-AA -t+•

Unweighted Majority Cons LinOP ConaNoisyOH LogLikGS LinOP LogUkGS NoiayOR PropGS LinOP PropGS NDisyCfl

.. • O

0

Proteins predicted incorrectly (FP)

(c) Mouse Majority Figure 1.

(d) Mouse FunctionalFlow

ROC analysis on a Fixed Topology of 126k edges

algorithms Weighted Majority and FunctionalFlow. We use two-fold crossvalidation in which functions (pathways) for half of the proteins in the graph are hidden and then predicted from function (pathway) assignments to the other half. Correctness of multiple function (pathway) predictions is decided by majority. As done in Nabieva et al.12, we calculate a modified ROC curve, showing the number of incorrect predictions (FP) versus number of correct predictions (TP) as the prediction score threshold varies. Figure 1 considers relative performance of each Pr(e) assignment to edges for a fixed topology, allowing us to evaluate weighting strategies on the same set of edges. Figure l(a)-(b) show results in yeast using the prediction algorithms Majority and FunctionalFlow, respectively, while Figure l(c)-(d) show the equivalent in mouse. A fixed topology in each organism is generated by choosing 126k edges at random among those supported by more than one interaction expert. We choose 126k to make the graph size

440

tractable for multiple runs while requiring support from multiple experts counteracts the effect of promiscuous experts, like COESSENTIAL which asserts > 50% of the total possible edges in yeast. To answer the first question of the value of using weights based on relative reliability of experts, we compare the baseline curves Unweighted Majority in Figure 1 to the weighted alternatives. For most of the range of FP, Unweighted Majority identifies fewer TP than the weighted methods. The exceptions are the graphs for FunctionalFlow where PropGS LinOP and the Cons variants in yeast (Fig. 1(a)) and the LogLikGS variants in mouse (Fig. 1(d)) sometimes perform worse than Unweighted Majority. This can occur when a larger neighborhood involves related but not identical functions, in which case, the actual value of the edge weight determines the quality of the weighting strategies - the topic of our second evaluation question.

-9•0-AA -H

Cons LinOP 21k Cons NoisyOR 339k LogLikGS LinOP 3k LogLikGS NoisyOR 52k PropGS LinOP 4k PropGS NoisyOR 50k

. . • . • . ' '

A

" "

+

... •

A'"

• $

/**

• °

.0

W o

f/ / /

-S•G-A•A-+•+•

Conn LinOP 21k Cons NoisyOR 333k LogLikGS LinOP 3k LogLikGS NoisyOR 52k PropGS LinOP 4k PropGS NoisyOR 50k

Proteins predicted incorrectly (FP)

(a) Yeast Majority (Pr(e) > 0.54) -9G . -A•A-+, •+•

(b) Yeast FunctionalFlow

Cons LinOP 204k Cons NoisyOR 213k LogLikGS LinOP 306k LogLikGS NoisyOR 306k PropGS LinOP 511 PropGS NoisyOR 5k

B

-eO-A&• - i +•

**•*"* e (c) Mouse Majority (Pr(e) > 0.22)

O

Cons LinOP 204k Cons NoisyCfl 213k LogLikGS LinOP 306k LogLikGS NoisyOR 306k PropGS LinOP511 PropGS NobyOP. 5k

(d) Mouse FunctionalFlow

Figure 2. ROC analysis using a Fixed Probability Threshold

441

The LogLikGS reliability assignments show the best performance in yeast, yet the worst performance in mouse. Since LogLikGS is similar to PropGS corrected for background linkage distributions, their relative performance suggests this may be due to different background distributions (-fXm — 0.15 in yeast versus 0.01 in mouse). In fact, we found that the numerical edge weights were nearly identical for both methods in yeast, while in mouse, LogLikGS edge weights were generally twice the value of PropGS weights. Also, the yeast graph has a maximum of 77 neighbors while mouse has a maximum of 348, an enormous difference in size of neighborhoods which, together with a difference in weightings, allows FunctionalFlow to propagate a lot more noisy predictions. The Cons variants perform the best overall in mouse, suggesting better overlap of information from sources in mouse compared to those in yeast, even though mouse has fewer sources. Even in yeast, the Cons results, which do not use a function/pathwaybased gold standard, are comparable to PropGS which does. In fact, these results suggest capturing a more diverse notion of 'interaction' using Cons still proves successful for the task of function prediction. Together, these results suggest Cons is a valuable alternative to LogLikGS and PropGS in less-studied organisms, where including diverse types of interaction information is critical. For the third question of whether to use NoisyOR or LinOP to combine source reliabilities, the NoisyOR variants invariably have slightly higher performance than LinOP. For a given interaction, the value assigned by NoisyOR will be greater than by LinOP given the same set of reliability assignments to sources. In this task, this bias causes NoisyOR to make the same prediction as LinOP but at a higher threshold, accounting for the slight vertical shift between the two curves. The effect of this shift in distribution is the subject of the next figure. The effect of different edge distributions Pr(e) can be seen by fixing a probability threshold and allowing only edges which exceed the threshold. Results using Pr(e) > 0.54 in yeast and Pr(e) > 0.22 in mouse are shown in Figure 2 (legends indicate graph size per method). Shorter curves mean fewer predictions were made, a comment on the connectivity. As noted above, the LinOP variants will include fewer edges than NoisyOR for a given threshold, though here we see little performance difference between the two for all methods, except Cons (Fig. 2). This difference arises due to the large size of Cons NoisyOR (339k edges) versus the others (mean 26k) in combination with the neighborhood-based FunctionalFlow; for sparse graphs the immediate neighborhood is equivalent to the extended neighborhood,

442

making FunctionalFlow nearly equivalent to Majority. In mouse, LinOP and NoisyOR yield similar graph sizes so we do not see this effect repeated. Again, Cons performs strongly in mouse, suggesting this non gold standardbased approach will be valuable in less well-studied organisms.

3.2. Learning

Regulatory

Networks

Bayesian networks (BN) are a popular modelling formalism for learning regulatory networks from gene expression data (see Pe'er et. al 33 for an excellent example). A BN has two components: a directed acyclic graph (DAG) capturing dependencies between variables, and a set of conditional probability distributions (CPDs) local to each node. Nodes represent expression values, arcs represent potential regulatory relationships, and the CPDs quantify those relationships. Algorithms to learn BNs from data can use prior knowledge about the probability of arcs, such as our Pr(e). Learning performs an iterative search starting from an initial graph, exploring the space of DAGs by removing, adding or deleting a single edge, choosing the best scoring model among these one-arc changes, and terminating when no further improvement in score can be made. Each candidate model is scored with respect to the loglikelihood (LL) of the data, e.g. how well the CPDs capture dependencies inherent in the expression data. To evaluate the quality of a search, we obtain a single performance measure as follows. Given a starting model, we obtain a LL-trace of the best model chosen at each iteration and average the trace over all iterations. We repeat this process for a set of starting models sampled from some distribution, and average the average LL-trace over all models. Starting models are sampled either from an informed structural prior (our Pr(e)), or an uninformed prior which asserts uniform probability over edges. A high average LL trace value for a given prior indicates that searches using that prior consistently explore high-scoring models. Using the yeast genome, as before we create informed structural priors Pr(e) using all interaction sources (including functional/pathway sources) together with the Cons, PropGS and LogLikGS methods to assign reliabilities (again, KEGG is the gold standard for the latter two) and the LinOP and NoisyOR methods to combine reliabilities. We learn Bayesian networks for 50 genes using a expression dataset covering 1783 yeast microarray experiments (see refs. in Tanay et al.34). We also create priors using edge reliabilities calculated by other groups, namely STRING11 (a PropGS NoisyOR (on

443

experts different than ours) for predicting protein complexes) and MAGIC35 (a hand-crafted BN for predicting function). Both use expression data as experts. As baselines, we include a uniform reliability assignment over experts (Unif5) and two random reliability assignments (Randl and Rand2). Figure 3 shows the LL trace averages, scaled to give Uninformed the value x = 0. The worst overall performance by Uninformed demonstrates the value of using priors based on weighted reliabilities. The poor performance of the remaining baseline variants demonstrates the effect of neglecting to assign (Unif) or incorrectly assigning (Rand) reliability to interaction sources. Note NoisyOR performs worse than LinOP for the baseline priors, yet performs better for the non-baseline variants. This repeats the effect seen in the function prediction task where NoisyOR assigns higher values than LinOP. Here, the performance difference indicates that LinOP is more robust to errors in reliability assignment than NoisyOR. The strength of STRING, LogLikGS and MAGIC is due in part to having few high probabilities and many low probabilities in the corresponding Pr(e), in contrast with the more evenly distributed Pr(e) for the other methods. Such conservativism allows the Bayesian learner to strongly preserve only the highest confidence edges while remaining flexible for the others. Performance of the Cons variants is comparable to PropGS for this task as well, demonstrating the utility of our method which does not require a gold standard. Average ot LL Trace Over All Iterations •STRING •LogOdds NoisyOR •LogOdds LinOP •MAGIC •PropGS NoisyOR •Cons NoisyOR •Cons LinOP •PropGS LinOP *Rand2 LinOP •Randl LinOP •Unif5 NoisyOR #Unil5 LinOP *Rand2 NoisyOR •Randl NoisyOR HJniriormed 0

Figure 3.

200

400

600

800

1000

1200

Average of Log-Likelihood trace over all iterations

4. Conclusions Our results show that the Cons method for assigning reliability to interaction sources is an attractive alternative to existing methods and has the added advantage of not requiring a gold standard for assessment. In the task of predicting protein function, we demonstrated the effectiveness of using weighting strategies, where Cons proved competitive against other methods

444 which have an unfair advantage of using t h e same gold s t a n d a r d used for evaluation. For t h e task involving regulatory networks, we showed t h a t learning greatly benefits from correctly informed estimates of reliability. Again, Cons was comparable t o t h e other methods. We introduced LinOP as a n alternative m e t h o d for combining reliabilities a n d d e m o n s t r a t e d its performance t o be comparable t o NoisyOR in most tasks and more robust t o errors in others. References 1. B. Schwikowski et al, Nature Biotech. 18, 1257 (2000). 2. H. Hishigaki et al, Yeast 18, 523 (2001). 3. A. M. Edwards et al, Trends Genet. 18, 529 (2002). 4. C. M. Deane et al, Mol. Cell Proteomics 1, 349 (2002). 5. E. Sprinzak et al, J. Mol. Biol. 327, 919 (2003). 6. J. S. Bader et al, Nature Biotech. 22, 78 (2004). 7. Y. Qi et al, NIPS Workshop on Comp. Bio. and Anal, of Het. Data (2005). 8. S. Asthana et al, Genome Res. 14, 1170 (2004). 9. I. Lee et al, Science 306, 1555 (2004). 10. D. R. Rhodes et al, Nature Biotech. 23, 951 (2005). 11. C. von Mering et al, Nucl. Acids Res. 33, D433 (2005). 12. E. Nabieva et al, Bioinformatics 21, i302 (2005). 13. C. Genest and J. V. Zidek, Statistical Science 1, 114 (1986). 14. S. Suthram et al, BMC Bioinformatics 7, 360 (2006). 15. I. Xenarios et al, Nuc. Acids Res. 30, 303 (2002). 16. G. Bader et al, Nuc. Acids Res. 29, 242 (2001). 17. H. Hermjakob et al, Nuc. Acids Res. 32, D452 (2004). 18. C. Stark et al, Nuc. Acids Res. 34, D545 (2006). 19. T. I. Lee et al, Science 298, 799 (2002). 20. E. Wingender et al, Nuc. Acids Res. 28, 316 (2000). 21. J. C. Mellor et al, Nuc. Acids Res. 30, 306 (2002). 22. J. T. Eppig et al, Nuc. Acids Res. 33, D471 (2005). 23. H. W. Mewes et al, Nuc. Acids Res. 30, 31 (2002). 24. N. Hulo et al, Nuc. Acids Res. 32, 134 (2004). 25. A. Bateman et al, Nuc. Acids Res. 32, D138 (2004). 26. N. J. Mulder et al, Nuc. Acids Res. 33, D201 (2005). 27. R T. Spellman et al, Mol. Bio. Cell 9, 3273 (1998). 28. M. Ashburner et al, Nature Genet. 25, 25 (2000). 29. M. Kanehisa et al, Nuc. Acids Res. 34, D354 (2006). 30. K. D. Dahlquist et al, Nature Genet. 31, 19 (2002). 31. S. C- Weller and N. C. Mann, Medical Decision Making, 17, 71 (1997). 32. G. R. G. Lanckriet et al, PSB 9, 300 (2004). 33. D. Pe'er et al, Bioinformatics 17 S u p p l . l , S215 (2001). 34. A. Tanay et al, Molecular Systems Biology (2005). 35. O. G. Troyanskaya et al, PNAS 100 8348 (2003).

P R O B A B I L I S T I C M O D E L I N G OF S Y S T E M A T I C E R R O R S I N TWO-HYBRID EXPERIMENTS

DAVID S O N T A G * Computer

R O H I T SINGH*

BONNIE BERGERtt

Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge MA 02139 E-mail: {dsontag, rsingh, bab}Qmit.edu

We describe a novel probabilistic approach to estimating errors in two-hybrid (2H) experiments. Such experiments are frequently used to elucidate protein-protein interaction networks in a high-throughput fashion; however, a significant challenge with these is their relatively high error rate, specifically, a high false-positive rate. We describe a comprehensive error model for 2H data, accounting for both random and systematic errors. The latter arise from limitations of the 2H experimental protocol: in theory, the reporting mechanism of a 2H experiment should be activated if and only if the two proteins being tested truly interact; in practice, even in the absence of a true interaction, it may be activated by some proteins - either by themselves or through promiscuous interaction with other proteins. We describe a probabilistic relational model that explicitly models the above phenomenon and use Markov Chain Monte Carlo (MCMC) algorithms to compute both the probability of an observed 2H interaction being true as well as the probability of individual proteins being self-activating/promiscuous. This is the first approach that explicitly models systematic errors in protein-protein interaction data; in contrast, previous work on this topic has modeled errors as being independent and random. By explicitly modeling the sources of noise in 2H systems, we find that we are better able to make use of the available experimental data. In comparison with Bader et al.'s method for estimating confidence in 2H predicted interactions, the proposed method performed 5-10% better overall, and in particular regimes improved prediction accuracy by as much as 76%. S u p p l e m e n t a r y I n f o r m a t i o n : h t t p : / / t h e o r y . c s a i l . m i t . edu/probmod2H

1. Introduction The fundamental goal of systems biology is to understand how the various components of the cellular machinery interact with each other and the environment. In pursuit of this goal, experiments for elucidating proteinprotein interactions (PPI) have proven to be one of the most powerful tools available. Genome-wide, high-throughput PPI experiments have started to ' T h e s e authors contributed equally to the work t Corresponding author •Also in the MIT Dept. of Mathematics

445

446

provide data that has already been used for a variety of tasks: for predicting the function of uncharacterized proteins; for analyzing the relative importance of proteins in signaling pathways; for new perspectives in comparative genomics, by cross-species comparisons of interaction patterns etc. Unfortunately, the quality of currently available PPI data is unsatisfactory, which limits its usefulness to some degree. Thus, techniques that enhance the availability of high-quality PPI data are of value. In this paper, we aim to improve the quality of experimentally available PPI data by identifying erroneous datapoints from PPI experiments. We attempt to move beyond current one-size-fits-all error models that ignore the experimental source of a PPI datapoint; instead, we argue that a better error model will also have components tailored to account for the systematic errors of specific experimental protocols. This may help achieve higher sensitivity without sacrificing specificity. This motivated us to design an error model tailored to one of the most commonly-used PPI experimental protocols. We specifically focus on data from two hybrid (2H) experiments 6 ' 4 , which are one of the most popular high-throughput methods to elucidate proteinprotein interaction. Data from 2H experiments forms the majority of the known PPI data for many species: D. melanogaster, C. elegans, H. sapiens etc. However, currently available 2H data also has unacceptably high false-positive rates: von Mering et al. estimate that more than 50% of 2H interactions are spurious 11 . These high rates of error seriously hamper the ability to perform analyses of the PPI data. As such, we believe an error model that performs better than existing models — even if it is tailored to 2H data — is of significant practical value, and may also serve as an example for the development of error models for other biological experiments. Ideally, the reporting mechanism in a 2H experiment is activated if and only if the pair of proteins being tested truly interact. As in most experimental protocols, there are various sources of random noise. However, there are also systematic, repeatable errors in the data, originating from limitations in the 2H protocol. In particular, there exist proteins that are disproportionately prone to be part of false-positive observations (Fig. 1). It is thought that these proteins either activate the reporting mechanism by themselves or promiscuously bind with many other proteins in the particular setup (promiscuous binding is an experimental artifact— it does not imply a true interaction under plausible biological conditions). Contributions: The key contribution of this paper is a comprehensive error model for 2H experiments, accounting for both random as well as systematic errors, which is guided by insights into the systematic errors of the 2H experi-

447 True Positive

False Negative

Actua ,

Actual

^ ~ \

2H Experiment

j ^ T ^ ) 2H Experiment

•No Signal

Signal (Present

False Positive

Actual

I

2H Experiment Signal . (present Reporter'gene

PromoterXregion of the reporter gene

F i g u r e 1: T h e origin of s y s t e m a t i c errors in 2 H d a t a . The cartoons shown above demonstrate the mechanism of 2H experiments. Protein A is fused to the DNA binding domain of a particular transcription factor, while protein B is fused to the activation domain of that transcription factor. If A and B physically interact then the combined influence of their respective enhancers results in the activation of the reporter gene. Systematic errors in such experiments may arise: false negatives occur when two proteins which interact in-vivo fail to activate the reporter gene under experimental conditions. False positives may occur due to proteins which trigger the reporting mechanism of the system, either by themselves (self-activation) or by spurious interaction with other proteins (promiscuity). Spurious interaction can occur when a protein is grossly over-expressed. In the above figure, protein A in the lower right panel is such a protein: it may either promiscuously bind with B or activate the reporting mechanism even in the absence of B.

mental protocol. We believe this is the first model to account for both sources of error in a principled manner; in contrast, previous work on estimating error in PPI data has assumed that the error in 2H experiments (as in other experiments) is independent and random. Another contribution of the paper are estimates of proteins especially likely to be self-activating/promiscuous (see Supp. Info.). Such estimates of "problem proteins", may enable the design of 2H experimental protocols which have lower error rates. We use the framework of Bayesian networks to encode our assumption that a 2H interaction is likely to be observed if the corresponding protein pair truly interacts or if either of the proteins is self-activating/promiscuous. The Bayesian framework allows us to represent the inherent uncertainty and the relationship between promiscuity of proteins, true interactions and observed 2H data, while using all the data available to simultaneously learn the model parameters and predict the interactions. We use a Markov Chain Monte Carlo (MCMC) algorithm to do approximate probabilistic inference in our models, jointly inferring both desired sets of quantities: the probability of interaction, and the propensity of a protein for self-activation/promiscuity. We show how to integrate our error model into the two most common

448

probabilistic models used for combining PPI experimental data, and show that our error model can significantly improve the accuracy of PPI prediction. Related Work: With data from the first genome-wide 2H experiments (Ito et al.6, Uetz et a/. 4 ), there came the realization that 2H experiments may have significant systematic errors. Vidalain et al. have identified the presence of self-activators as one of the sources of such errors, and described some changes in the experimental setup to reduce the problem 10 . Our work aims to provide a parallel, computational model of the problem, allowing postfacto filtering of data, even if the original experiment retained the errors. The usefulness of such an approach was recently demonstrated by Sun et al.2 (to reconstruct transcriptional regulatory networks). Previous computational methods of modeling systematic errors in PPI data can be broadly classified into two categories. The first class of methods 5 ' 11 ' 8 exploits the observation that if two very different experimental setups (e.g. 2H and Co-IP) observe a physical interaction, then the interaction is likely to be true. This is a reasonable assumption to make because the systematic errors of two different experimental setups are likely to be independent. However, this approach requires multiple costly and time consuming genome-wide PPI experiments, and may still result in missed interactions, since the experiments have high false negative rates. Many of these approaches also integrate non-PPI functional genomic information, such as co-expression, co-localization, and Gene Ontology functional annotation. The second class of methods is based on the topological properties of the PPI networks. Bader et al.1, in their pioneering work, used the number of 2H interactions per protein as a negative predictor of whether two proteins truly interact. Since the prior probability of any interaction is small, disproportionately many 2H interactions involving a particular protein could possibly be explained by it being self-activating or promiscuous. However, such an approach is unable to make fine-grained distinctions: an interaction involving a high-degree protein need not be incorrect, especially if there is support for it from other experiments. Furthermore, the high degree of a promiscuous protein in one experiment (e.g. Ito et aVs) should not penalize interactions involving that protein observed in another experiment (e.g. Uetz et al.'s) if the errors are mostly independent (e.g. they use different reporters). Our proposed probabilistic models solve all of these problems.

2. Data Sets One difficulty with validating any PPI prediction method is that we must have a gold standard from which to say whether two proteins interact or do

449 not interact. We constructed a gold standard data set of protein-protein interactions in S. cerevisiae (yeast) from which we could validate our methods. Our gold standard test set is an updated version of Bader et a/.'s data. Bader et a/.'s data consisted of all published interactions found by 2H experiments; data from experiments by Uetz et al.4 (the U E T Z 2 H data set) and Ito et al.6 (the I T O 2 H data set) comprised the bulk of the data set. They also included as possible protein interactions all protein pairs that were of distance at most two in the 2H network. Bader et al. then used published Co-Immunoprecipitation (Co-IP) data to give labels to these purported interactions. When two proteins were found in a bait-hit or hit-hit interaction in Co-IP, they were labeled as having a true interaction. When two proteins were very far apart in the Co-IP network (distance larger than three), they were labeled as not interacting. We updated Bader et al.'s data to include all published 2H interactions through February 2006, getting our data from the MIPS 7 database. We added, for the purposes of evaluation, recently published yeast Co-IP data from Krogan et al.3. This allowed us to significantly increase the number of labeled true and false interactions in our data set. Since the goal of our algorithms is to model the systematic errors in largescale 2H experiments, we evaluated our models' performance on the test data where at least one of U E T Z 2 H or I T O 2 H indicated an interaction. We were left with 397 positive examples, 2298 negative examples, and 2366 unlabeled interactions. We randomly chose 397 of the 2298 negative examples to be part of our test set. For all of the experiments we performed 4-fold cross validation on the test set, hiding one fourth of the labels while using the remaining labeled data during inference.

3. Probabilistic Models We show how to integrate our model of systematic errors into the two most common probabilistic models used for PPI prediction. Our first model is complementary to the relational probabilistic model proposed by Jaimovich et al.8, and can be easily integrated into their approach. Our second model is an extension of Bader et al.'s, and will form the basis of our comparison. Our models also adjust to varying error rates in different experiments. For instance, while we account for random noise and false negatives in our error model for both U E T Z 2 H and I T O 2 H , we only model selfactivation/promiscuity for I T O 2 H observations. The U E T Z 2 H data set was smaller and included only one protein with degree larger than 20; I T O 2 H had 36 proteins with degree larger than 30, including one with degree as high as 285. Thus, while modeling promiscuity made a big difference for the I T O 2 H

450 data, it did not significantly affect our results on the

UETZ2H

data.

3.1. Generative model We begin with a simplified model of PPI interaction (Fig. 2). We represent the uncertainty about a protein interaction as an indicator random variable Xij, which is 1 if proteins i and j truly interact, and 0 otherwise. For each experiment, we construct corresponding random variables (RVs) indicating if i and j have been observed to interact under that experiment. Thus, Uij is the observed8, random variable (RV) representing the observation from U E T Z 2 H , and Jy is the observed RV representing the observation from I T O 2 H . The arrow from X^ to 1^ indicates the dependency of I^ on X^. The box surrounding the three RVs indicates that this template of three RVs is repeated for alH, j — 1 , . . . , N (i.e. all pairs of proteins), where N is the number of proteins. In all models of this type, the 1^ RVs are assumed to be independent of one another. If an experiment provides extra information about each observation, the model can be correspondingly enriched. For instance, for each of their observed interactions Ito et al. provide the number of times the interaction was discovered (called the number of 1ST hits). Rather than making 1^ binary, we have it equal the number of 1ST hits, or 3 if 1ST > 3. We will refer to the portion of I T O 2 H observations with 1ST > 3 as I T O C O R E . The model is called "generative" because the ground truth about the interaction, X^, generates the observations in the 2H experiments, 1^ and Uij. To our knowledge, all previous generative models of experimental interactions made the assumption that 1^ depended only on Xij. They allowed for false positives by saying that Pr(Iij > 0\Xij — 0) = 5fp, where Sfp is a parameter of their model. Similarly, they allowed for false negatives by saying that Pr(Iij = 0\Xij = 1) = 6fn, for another parameter 5fn. However, these models are missing much of the picture. For example, many experiments have particular difficulty testing the interactions of proteins along the membrane. For these proteins, Sfn should be significantly higher. In the 2H experiment, for interactions that involve self-activating/promiscuous proteins, Sfp will be significantly higher. In Fig. 3, we propose a novel probabilistic model in which the selfactivating/promiscuous tendencies of particular proteins are explicitly modeled. The latent Bernoulli RV F& is 1 if protein k is believed to be promiscuous or self-activating. In the context of our data set, this RV applies specifically to the I T O 2 H data; if self-activation/promiscuity in multiple exClear nodes are unobserved (latent) RVs, and shaded nodes are observed RVs.

451

IJ=1,.,N

Figure 2: Generative model.

lj-1,-,N

Figure 3: Generative model, with noise variables.

periments is to be modeled, we may introduce multiple such variables Fj^ (for protein k and experiment H). The 1^ RV thus depends on Ft and Fj. Intuitively, Iij will be > 0 if either Xtj — 1 or F& = 1. As we show later in the Results section, this model of noise is significantly more powerful than the earlier model, because it allows for the "explaining away" of false positives in I T O 2 H . Furthermore, it allows evidence from data sets other than I T O 2 H to influence (through the Xij RVs) the determination of the Ff. RVs. We also added the latent variables Ofj and OL, which will be 1 if the Uetz et al. and Ito et al. experiments, respectively, have the capacity to observe a possible interaction between proteins i and j . These RVs act to explain away the false negatives in U E T Z 2 H and I T O 2 H . We believe that these RVs will be particularly useful for species where we have relatively little PPI data. The distributions in these models all have Dirichlet priors (6) with associated hyperparameters a (see Supp. Info, for more details). There are many advantages to using the generative model described in this section. First, it can easily handle missing data without adding complexity to the inference procedure. This is important for when integrating additional experimental data into the model. Suppose, for example, that we use gene expression correlation as an additional signal of protein interaction, by introducing new RVs Etj (indicating coexpression of genes i and j ) and corresponding edges X^ —> Etj. If, for a pair of proteins, the coexpression data is unavailable, we simply omit the corresponding E^ from this model. In Bader et aVs model, and the second model that we propose below, we would need to integrate over possible values of the missing datapoint, a potentially complicated task. Second, this generative model can be easily extended: e.g., we could easily combine this model with Jaimovich et al.'s in order to model the common occurrence of transitive closure in PPIs.

452

Figure 4: Bader et al.'s logistic regression model (BaderLR).

Figure 5: Our Bayesian logistic model, with noise variables (BayesLR).

3.2. Bayesian logistic model In Fig. 4 we show Bader et aVs model ( B A D E R L R ) ; it includes three new variables in addition to the RVs already mentioned, whose values are pre-calculated using the 2H network. Two of these encode topological information: variable A^ is the number of adjacent proteins in common between i and j , and variable D^ is ln(dj + 1) + ln(dj + 1), where di is the degree of protein i. Variable Ly is an indicator variable for whether this protein interaction has been observed in any low-throughput experiments. In Bader et al.'s model, 1$ is an indicator variable representing whether the interaction between proteins i and j was in the I T O C O R E data set (1ST > 3). Xij's conditional distribution is given by the logistic function: v

1 + exp (w0ffset + UijWu + I?wi + LijWL + A^WA + DijWD)

The weights w are discriminatively learned using the Iterative Re-weighted Least Squares (IRLS) algorithm, which requires that all of the above quantities are observed in the training data. In Fig. 5 we propose a new model ( B A Y E S L R ) , with two significant differences. First, we no longer use the two proteins' degree, Dij, and instead integrate our noise model in the form of the F^ random variables. Second, instead of learning the model using IRLS, we assign the weights uninformative priors and do inference via Markov Chain Monte Carlo (MCMC). This will be necessary because Xij will have an unobserved parent, ijy The new RV t[> will be 1 when the Ito et al. experiment should be considered for predicting Xij. Intuitively, its value should be (/y > 0) /\->{Fi\JFj). However, to allow greater flexibility, we give the conditional distribution for I? a Dirich-

453

let prior, resulting in a noisy version of the above logical expression. The RVs Oij are not needed in this logistic model because the parameterization of the Xij conditional distribution induces a type of noisy OR distribution in the posterior. Thus, logistic models can easily handle false negatives. Because we wanted to highlight the advantages of modeling the experimental noise, we omitted Aij (one-hop) from both the models, B A Y E S L R and B A D E R L R . The one-hop signal, gene expression, co-localization, etc. can be easily added to any of the models to improve their prediction ability. 3.3. Inference As is common in probabilistic relational models, the parameters for the conditional distributions of each RV are shared across all of their instances. For example, in the generative model, the prior probability Pr(Xij — 1) is the same for all i and j . With the exception of X^ in B A Y E S L R , we gave all the distributions a Dirichlet prior. In B A Y E S L R , the conditional distribution of Xij is the logistic function, and its weights are given Gaussian priors with mean \xx = 0 and variance ax = .01. Note that by specifying these hyperparameters (e.g. fiXjO'x), we never need to do learning of the parameters (i.e., weights). Given the relational nature of our data, and the relatively small amount of it, we think that this Bayesian approach is well-suited. We prevent the models from growing too large by only including protein pairs where at least one experiment hinted at an interaction. We used BUGS 9 to do inference via Gibbs sampling. We ran 12 MCMC chains for 6000 samples each, from which we computed the desired marginal posterior probabilities. The process is simple enough that someone without much knowledge of machine learning could take our probabilistic models (which we provide in the Supplementary Information) and use them to interpret the results of their 2H experiments. We also tried using loopy belief propagation instead of MCMC to do approximate inference in the generative model of Fig. 3. These results (see Supp. Info.) were very similar, showing that we are likely not being hurt by our choice of approximate inference method. Furthermore, our implementation of the inference algorithm (in Java) takes only seconds to run, and would easily scale to larger problems.

4. R e s u l t s We compared the proposed Bayesian logistic model ( B A Y E S L R ) with the model based on Bader et a/.'s work ( B A D E R L R ) . Both models were trained and tested on the new, updated version of Bader et al.'s gold standard data set. We show in Fig. 6 that B A Y E S L R achieves 5-10% higher accuracy

454

at most points along the ROC curve. We then checked to see that the improvement was really coming from the noise model, and not just from our use of unlabeled data and MCMC. We tried using a modified B A Y E S L R model (called Bayesian Bader) which has D^ RVs instead of the noise model, and which uses I T O C O R E instead of I T O 2 H . AS expected, it performed the same as B A D E R L R . We also tried modifying this model to use I T O 2 H , and found that the resulting performance was much worse. Investigating this further, we found that the average maximum a posteriori (MAP) weights for B A Y E S L R were {wu = -2.32, wL — -10.85, u>/ = —4.26, and w0ffset = 7.34}. The weight corresponding to I T O 2 H is almost double the weight for U E T Z 2 H . Interestingly, this is a similar ratio of weights as would be learned had we only used the I T O C O R E data set, as in B A D E R L R . In the last of the above-mentioned experiments, the MAP weight for I T O 2 H was far smaller than the weight for UETZ2H, which indicates that U E T Z 2 H was a stronger signal than I T O 2 H . Overall, these experiments demonstrate that we can get significantly better performance using data with many false positives ( I T O 2 H ) and a statistical model of the noise than by using prefiltered data ( I T O C O R E ) and no noise model. In all regimes of the ROC curve, B A Y E S L R performs at least as well as B A D E R L R ; in some, it performs significantly better (Fig. 8). The examples that follow demonstrate the weaknesses inherent in B A D E R L R and show how the proposed model B A Y E S L R solves these problems. When IRLS learns the weight for the degree variable (in B A D E R L R ) , it must trade off having too high a weight, which would cause other features to be ignored, and having too low a weight, which would insufficiently penalize the false positives caused by self-activation/promiscuity. In B A D E R L R , a high degree Dij penalizes positive predictors from all the experiments (Uij,Iij, Lij). However, the degree of a protein in a particular experiment (say, Ito et aVs) only gives information about self-activation/promiscuity of the protein in that experiment. Thus, if a protein has a high degree in one experiment, even if that experiment did not predict an interaction (involving some other protein), the degree will negatively affect any predictions made by other experiments on that protein. Our proposed models solve this problem by giving every experiment a different noise model, and by having each noise model be conditionally independent given the Xij variables. Thus, we get the desired property that noise in one experiment should not affect the influence of other experiments on the X^ variables. Fig. 8(a) illustrates this by showing the prediction accuracy for the test points where Dtj > 4 and Uij = 1 or L^ = 1 (called the 'medium' degree

455

Bayesian LR with noise model Bader Bayosian Bader " Bayesian Bador wilh lul llo Random 0

02

04

06

04

08

0.6

Falso positive talo

Falso positivo talo

Figure 7: Comparison of generative models.

Figure 6: Comparison of logistic models. i

0 8

0.6

0<1

02

0

0.1

0.2

.-•' Bayesian LR wilh noise model Bader Random

-••* Bayesian LR with noise model Bader Random •••

0 02

0.4

06

0.8

04

1

False positive rate (*l)

Medium

degree

and

0.6

OS

False positive rate UKT/.2II = 1

or

(b)

High degree (107 neg and 50 pos)

L i t = l (20 neg a n d 115 pos)

1

^J-^—^p*

/ S - /'''' '

08

/

0.6

0.4

f

• J ,J (

0^

0.4

V

•-''

.-'*''

J ,.--'' Bayesian LR wilh noise model is Bader

0.4

0.6

0.6

False positive rate

False positive rate ( c ) No ITOCOHB (312 neg ami 211 pos)

/

/

(d)

No signal (286 neg a n d 04 p o s )

Figure 8: Examples of regimes where t h e noise model is particularly helpful. In parentheses we give t h e number of test cases t h a t fall into each category. range). When the degree of a protein is very high, B A D E R L R will always classify interactions involving it as false positives. Fig. 8(b) shows the setting of Da > 6 b . With a false positive rate of less than 1%, B A D E R L R detects b Recall that D y is on a log-scale, and is the sum for both proteins.

456

42% of the true interactions, while B A Y E S L R detects 74% of the true interactions, a 76% improvement. Bader et al. found that they got better performance by using only a subset (where 1ST > 3) of the interactions in I T O 2 H . Our noise model allows us to make use of all of the predicted interactions, without hurting our overall results. As a result, our predictions for the proteins pairs where Bader et a/.'s model ignored I T O 2 H ' S interactions (i.e. 1ST < 3) are highly more accurate. This is illustrated in Fig. 8(c). Finally, we show in Fig. 8(d) that at the very extreme when neither I T O C O R E , nor the low-throughput 2H experiments (Lit), nor U E T Z 2 H showed an interaction, we can still make meaningful predictions, using a combination of the noise model and the observed interactions in Ito et al. where 1ST < 3. We next compared the various generative models, with the results shown in Fig. 7. Naively implementing the generative model of Fig. 2, using an indicator variable for whether the interaction was observed in I T O 2 H , results in the worst performance. Changing the indicator variable to a discretized 1ST count significantly improves performance. Using our noise model (i.e. the model from Fig. 3) provides further improvements, especially in the lower left corner, where the previous two had performed poorly. However, if we remove the noise model and instead pre-filter the data as Bader et al. did, using an indicator variable for whether 1ST > 3 in I T O 2 H , we can get almost as good performance using the simple generative model of Fig. 2. The noise model still does better in the upper half of the ROC curve, which is arguably where it matters the most. It is also interesting that our noise model is able to recover the accuracy of the hand-filtered IST> 3 criterion. We then applied the B A Y E S L R model to the full data set to identify proteins in the I T O 2 H data which are likely to be self-activating/promiscuous (see Supplementary Information). As expected, most proteins with high degree in I T O 2 H (e.g. YPR086W, degree 99) had a high probability of being self-activating/promiscuous. However, three of the proteins with high degree (YER022W, degree 98; YGL127C, degree 68; and YGR218W, degree 34) had very low probabilities. These differences in promiscuity estimates make sense: for example, there were no positive labeled 2H examples involving YPR086W, while there were five involving YER022W. This propagation of information is precisely what we hoped to capture by using our Bayesian framework. When applying this model to new species where no labeled data is available, the inclusion of additional signals (e.g. co-expression) should result in the same effect. (Note that when no labeled data is available, it might be helpful to fix the model parameters to their MAP values from experiments on related species.)

457

5. Conclusion In this paper, we have presented a principled approach to modeling t h e random and systematic sources of error in two-hybrid experiments, and showed how to integrate our noise models into the two most common probabilistic models for integrating P P I d a t a . Comparisons with previous work demons t r a t e t h a t explicit modeling of the sources of error can improve proteinprotein interaction prediction, making better use of experimental d a t a . F u t u r e work could involve discriminative training of the generative models, investigation of systematic sources of noise in other biological experiments such as Co-IP, and applying noise models to the Markov networks of Jaimovich et al. and possibly even in a first-order probabilistic model, where more intricate properties of proteins can b e described a n d jointly predicted. A c k n o w l e d g m e n t s : The authors thank Chris Bakal, Leslie Kaebling, Luke Zettlemoyer, Dan Roy, and Tommi Jaakkola for useful comments. D.S. and R.S were partially supported by a NSF Graduate Research Fellowship and NSF grant ITR (ASE + NIH)-(dms)-0428715, respectively.

References 1. J.S. Bader et al. Gaining confidence in high-throughput protein interaction networks. Nat Biotech, 22(l):78-85, January 2004. 2. N. Sun et al. Bayesian error analysis model for reconstructing transcriptional regulatory networks. PNAS, 103(21) :7988-7993, 2006. 3. Nevan J. Krogan et al. Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature, 2006. 4. Peter Uetz et al. A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature, 403(6770):623-627, February 2000. 5. R. Jansen et al. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644):449-453, October 2003. 6. Takashi Ito et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. PNAS, 98(8):4569-4574, 2001. 7. U. Guldener et al. Cygd: the comprehensive yeast genome database. Nucleic Acids Research, 33:Database issue:D364-8, 2005. 8. Ariel Jaimovich, Gal Elidan, Hanah Margalit, and Nir Friedman. Towards an integrated protein-protein interaction network: A relational Markov network approach. Journal of Computational Biology, 13(2): 145-164, 2006. 9. D. J. Lunn, A. Thomas, N. G. Best, and D. J. Spiegelhalter. Winbugs - a Bayesian modelling framework: concepts, structure and extensibility. Statistics and Computing, 10:321-333, 2000. 10. Pierre-Olivier Vidalain, Mike Boxem, Hui Ge, Siming Li, and Marc Vidal. Increasing specificity in high-throughput yeast two-hybrid experiments. Methods, 32:363-370, 2004. 11. Christian von Mering et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417:399-403, 2002.

P R O S P E C T I V E E X P L O R A T I O N O F B I O C H E M I C A L TISSUE C O M P O S I T I O N VIA I M A G I N G MASS S P E C T R O M E T R Y G U I D E D B Y P R I N C I P A L C O M P O N E N T ANALYSIS Raf Van de Plas 1 - 3 t, Fabian Ojeda 1 ' 3 , Maarten Dewil 5 , Ludo Van Den Bosch 5 , Bart De Moor 1 ' 3 and Etienne Waelkens 2,3 ' 4 1

Katholieke Universiteit Leuven, Department of Electrical Engineering (ESAT), SCD-SISTA (BIOI), Kasteelpark Arenberg 10, B-3001 Leuven (Heverlee), Belgium. 2

Katholieke 3

4

5

Universiteit Leuven, Department of Molecular Cell Biology, Afd. O & N, Herestraat 49 - bus 901, B-3000 Leuven, Belgium.

Biochemie,

Katholieke Universiteit Leuven, ProMeta, Interfaculty Centre for Proteomics and Metabolomics, O & N 2, Herestraat 49, B-3000 Leuven, Belgium.

Katholieke Universiteit Leuven, BioMacS, Interfaculty Centre for Biomacromolecular Structure, IRC KUL Campus Kortrijk, E. Sabbelaan 53, B-8500 Kortrijk, Belgium.

Katholieke Universiteit Leuven, Department of Neurosciences, Neurobiology, O & N 2, Herestraat 49, B-3000 Leuven, Belgium. MALDI-based Imaging Mass Spectrometry (IMS) is an analytical technique that provides the opportunity to study the spatial distribution of biomolecules including proteins and peptides in organic tissue. IMS measures a large collection of mass spectra spread out over an organic tissue section and retains the absolute spatial location of these measurements for analysis and imaging. The classical approach to IMS imaging, producing univariate ion images, is not well suited as a first step in a prospective study where no a priori molecular target mass can be formulated. The main reasons for this are the size and the multivariate nature of IMS data. In this paper we describe the use of principal component analysis as a multivariate pre-analysis tool, to identify the major spatial and mass-related trends in the data and to guide further analysis downstream. First, a conceptual overview of principal component analysis for IMS is given. Then, we demonstrate the approach on an IMS data set collected from a transversal section of the spinal cord of a standard control rat. Keywords: principal component analysis; imaging mass spectrometry; proteomics; bioinformatics; rat nerve tissue.

+To whom correspondence should be addressed: raf.vandeplasSesat.kuleuven.be

458

459 1. I n t r o d u c t i o n Mass spectrometry allows one to very accurately measure the molecular masses found in an unknown sample. It has become one of the primary analytical instruments in proteomics and peptidomics research, which is the study of respectively proteins and peptides within the scope of an organism, tissue, cell, or organel and under a set of known physiological and environmental conditions. 1 Most proteomics and peptidomics studies, however, disregard the exact spatial origin of a sample within tissue, focusing solely on identification and quantitation. A number of studies 2 " 6 have demonstrated that incorporating spatial information into the analysis can provide further insight into biological processes. The study of the spatial distribution of biomolecules in organic tissue requires that an explicit link is preserved between proteomics/peptidomicsoriented mass spectral measurements and their exact spatial origin within an organic tissue section. For this purpose we employ a relatively new technology, termed laser-based or MALDI-based imaging mass spectrometry.

1.1. MALDI-based

Imaging

Mass

Spectrometry

MALDI-based Imaging Mass Spectrometry^ (IMS) is a technology that uses the molecular specificity and sensitivity of normal mass spectrometry to collect a direct spatial mapping of biomolecules (or rather their ions) in tissue sections. It allows for massive multiplexing of followed molecules (covering an entire mass range) and does not require complex chemistry or an a priori target molecule as is the case with complementary technologies such as immunochemistry and fluorescence microscopy. IMS has been succesfully used in a number of pioneering studies t h a t mainly focused on mammalian tissue. 3 , 4 The wet-lab side of the procedure consists of cutting an organic tissue section, mounting it on a MALDI target plate, applying an appropriate chemical matrix solution, and performing a MALDI mass spectral measurement at each grid point of a virtual array that has been superimposed on the tissue section. The result is an array of spots or 'pixels' covering the tissue section, with a mass spectrum linked to each individual pixel. Figure 1 gives a schematic overview of the wet-lab and in silico steps involved with performing IMS on the cross-section of spinal cord nerve tissue. Typically, the data generated by an IMS experiment populates a mathematical space that has two spatial dimensions (the x and y-dimension) and the mass-over-charge dimension (m/z).

** MALDI stands for 'matrix-assisted laser desorption ionization' and refers to a particular mass spectrometry ionization method which is well suited for the study of larger biomolecules such as proteins. It involves firing a controlled laser shot at the sample embedded in a crystalline chemical matrix solution on the target plate.

460 Tissue Slice Creation

Application of

Array of Raw Unprocessed MS

Application of

Array of Peaklisted MS

Laser-based Ionization & Dasorption

ait?

Fig. 1. Overview of the imaging mass spectrometry experiment. The tissue slice creation, the mounting on the target plate, and the application of an appropriate matrix solution are wet-lab steps. The mass spectral measurements, the data collection, and the translation into peaklisted mass spectra take place inside the mass spectrometer. The array of peaklisted mass spectra forms the starting point for an in silico analysis.

It can be represented as a three-way array or tensor, with an x-, a y-, and a m/z-dimension. 1.2. Ion

Images

A common approach taken in IMS-oriented studies 3 " 5 is to generate ion images from the IMS data tensor. These images are a false color visualisation of the spatial distribution of peak height for a particular m/z-window. They are called ion images because they show the spatial spread of a particular peak's height over the tissue, and because a mass spectral peak represents the amount of a particular ion that was measured. This ion can be the molecular ion, or a charged fragment of the original molecule. Prom a mathematical standpoint, an ion image can be seen as a cross-section of the d a t a tensor at a particular mass (or m/z). Four examples of such ion images are shown in Fig. 2, 3, 4, and 5, which were generated from the data set of rat spinal cord tissue which is further discussed in section 2.3. Ion images are a univariate approach to IMS imaging where one particular feature per pixel is picked for analysis and visualisation. This is very informative when the goal is to follow the spatial distribution of a particular molecule and you know beforehand which particular m/z-value is relevant to the study.

461

•

.

.'.

<M

.1' .-.

. ".-90.52

;

'..

.;

: • ! ! •!!•.•..

..i

'a

.

"i..--:.7'i

irom tne rat spmai cora aata set.

irom the rat spmai cora aata set.

from the rat spinal cord data set.

from the rat spinal cord data set.

However, ion images are much less suited for propective studies where no a priori hypothesis of a target molecule or mass is formulated. In this kind of high-throughput discovery use of IMS one can potentially extract from the IMS tensor as many different ion images as there are m/z-h'ms available, and this number can easily run into the thousands (depending on the extent of the mass range that was scanned by the mass spectrometer). As an example, in our rat spinal cord data set this means 7451 distinct ion images of which just four are shown in Fig. 2, 3, 4, and 5. Acquiring an overview and identifying the ion images that show meaningful spatial variation from this set of thousands is a nontrivial task, and does not lend itself well to human execution. This is why we employ multivariate data analysis methods, such as the principal component analysis discussed in this paper, to perform a preliminary exploration of the data tensor in order to identify spatial and mass trends that merit further investigation. The insights delivered by such a preliminary multivariate analysis can serve as a guide for further investigation using more traditional approaches such as the ion images. As shown in section 2.3, the PCA-results can even be used to discriminate between biologically relevant chemical zones in the tissue on the basis of their mass spectral footprint.

462

2. Principal Component Analysis (PCA) as a prospective guide to IMS data In this section we investigate the use of principal component analysis (PCA) as a guide for the prospective exploration of data coming out of an IMS experiment. The goal is to use the PCA results as a first stepping stone towards more elaborate multivariate analysis of IMS data. In section 2.1 we first discuss the general idea behind the PCA technique, followed by a treatise on the specific use of PCA in an IMS context in section 2.2. In the case study discussed in section 2.3, we apply the technique to real data from an IMS experiment where a rat nerve tissue section was imaged. 2.1. Principal

Component

Analysis

Principal component analysis is a latent variable* data analysis technique, widely employed in many areas for uses such as dimensionality reduction.7 It was mentioned in a MALDI-IMS context by McCombie et al.8 for the purpose of dimensionality reduction and denoising. In this study the method is used for trend detection in both the mass and the image domain. Before formulating a definition of PCA, it is necessary to explain some aspects of the concept of the rank of a matrix. The rank of a matrix M is the maximum number of linearly independent rows (or columns) of M. This means that the rank of M is the smallest number of outer products of vectors that can be used to reproduce the matrix M exactly. Another definition is that the rank of M is equal to the number of nonzero eigenvalues of MTM. A matrix of rank 1 can therefore be completely represented as the outer product of two vectors, while a matrix of rank 5 requires the sum of (at least) five such outer products for it to be completely reconstructed. PCA is a decomposition of a matrix X, of size N x K and with a certain rank, into matrices of rank 1, designated Fa:

(i)

* = XX o=l

Given the definition of rank, it is evident that the smallest value of A for which this equation still holds is equal to the rank of the matrix X. The matrices Fa have the same size as X (N x K), but as they are rank 1 they can be replaced by the outer product of two vectors sa (N x 1) and la (K x 1) in equation 1: A

A

X = J2Fa^J2sJl a=l

= SLT.

(2)

a=l

*A latent variable is a variable which we do not observe directly, but its existence can be inferred from the observed variables.

463

d) K measured features

K

I I, •

!

,

3 "i

mi

L«T

PCA

La^ decomposition

\\ \-J

K

selection of most important PCs

\ Trend detection

Fig. 6. Graphical representation of matrix decomposition and reconstruction using principal component analysis. The upper half shows the decomposition of data matrix X into A principal components, or the outer product of score matrix S (grouping score vectors si to SA) and loading matrix L (grouping loading vectors l\ to I A)- The lower half depicts selecting the principal components with the largest contribution of variance (or information) in order to study the uncorrelated trends found in X.

The vectors** sa are generally called the score vectors of the decomposition, while vectors la are named loading vectors. Each pair of vectors sa and la can be designated a principal component of matrix X and has a particular coefficient connected to it (stemming from the eigenvalue of the underlying XTX), which indicates the relative amount of variance of X explained by its particular principal component. By further utilizing matrix notation, PCA can be written more consicely as the decomposition of the data matrix X into the single product of a matrix S of size N x A, holding all the score vectors, and a matrix L of size K x A, holding all the loading vectors. In order to complete the decomposition with a minimum value for A, the matrices Fa and their composing vectors sa and la are necessarily maximally uncorrelated. A schematic overview of these steps is available in Fig. 6. Using the coefficients (actually eigenvalues) connected to them, one can order the various principal components according to their contribution in terms of variance of X explained and information contained. A number of uses follow from this notion such as dimensionality reduction and denoising,7 but we focus specifically on the **In this paper we follow the convention of representing a vector as a column vector unless explicitly transposed.

464 use of trend detection, in which the principal components with the largest contribution characterize the major uncorrelated trends underlying the data. 2.2. PCA

Applied

to IMS

Data

In applying PCA to IMS data, our primary goal is to identify the major uncorrelated trends that can be found in the spatial domain (x and y) as well as in the mass domain (m/z). These trends, which tell us which pixels or m/z-bins behave similarly or dissimilarly, can be used as a guide in exploring these often complex and very large data sets, and to avoid the proverbial 'drowning in information' that can be experienced when classical ion images are employed without a prior hypthesis of target mass. As mentioned in section 1, the data measured during an IMS experiment can be stored as an array of order 3, or tensor, D with two spatial dimensions (x and y) and one m/z dimension (cfr. the abscissa in a mass spectrum). Each scalar value dijk in the tensor represents the absolute intensity of a particular mass peak at a certain rr-position i, a certain y-position j , and measured at a certain m/z-bin k (with i = l , . . . , / , j = l , . . . , J, and k = l,...,K). One way of applying PCA to an IMS tensor D is to refold the tensor into an array of order 2, or matrix, D to fit the expression shown in equation 2. This refolding process is done by reordering all discrete spatial positions in the x- and y-dimensions, or 'pixels' if you will, into one long vector holding I.J elements. The result is a matrix D of size (LJ) x K, holding all information contained within the original tensor D. Applying P C A in the manner discussed in section 2.1 delivers a score matrix S of size (I-J) x A, a loading matrix L of size K x A, and a vector of eigenvalues A indicating each principal component's variance contribution. Based on the amount of variance explained, we can now identify and take a closer look at the most important principal components. A single principal component is characterized by one score vector or one loading vector. The score vector is of size (I.J) x 1 and does not easily allow for direct exploration. However, a reordering operation that reverses the effect of the operation performed to go from D to D allows us to refold this vector of size (I.J) x 1 to the image space defined by the two spatial dimensions x and y, resulting in an image matrix of size / x J . In their image form the score vectors deliver a more human-interpretable view on the underlying spatial correlations. This type of images gives us an idea of which pixels, or laser spots, have a similar mass spectral footprint when all m/z-b'ms are taken into account (note the difference with univariate ion images). The corresponding loading vector of size K x 1 does not require a refolding operation as it can be expressed directly in the m/z-domain. A visualisation of the loading vector gives us an indication of which m/z-b'ms behave similarly within the context of one principal component.

465

It is necessary to mention here that the above linkup of score vectors with the image domain and loading vectors with the mass domain is based on the assumption that PCA is performed on a data matrix where the rows represent pixels and the columns represent m/z-b'ms. This assumption would be in line with the convention of an objects x features data matrix used in most PCA literature. However, when there are more features available than objects, which is usually the case for IMS (e.g. 7451 versus 1302 in the spinal cord data set), it is more computationally efficient to use the transpose of D instead.7 The results of this more 'economic' PCA are identical to the ones from the procedure described earlier, with the only difference being that the loading vectors are now linked to the image domain and the score vectors to the mass domain. This economized PCA was used in the case study of section 2.3. 2.3. Case Study: Rat Spinal Cord Nerve

Tissue

In this section we demonstrate the use of PCA in an IMS context by applying it to the IMS measurement of rat nerve tissue. For reference, Fig. 7 shows a microscopic image of a nerve tissue section taken from the same animal as the one used in this case study. Figure 7's tissue slice has undergone histological staining to bring out the gray/white tissue differentiation which is not visually apparent in untreated samples, but which does show up in the PCA analysis performed in this section. Materials and methods The tissue section (15 micrometers thick) was taken from a transversal section of the spinal cord of a standard control rat. The recorded mass range extended from m/z 5000 to 12000 and alpha-cyano4-hydroxy cinnamic acid (7 mg/ml, in acetonitrile 50%, 0.05% TFA) was used as a chemical matrix. A MALDI mass spectral measurement was performed on each grid point of a virtual raster of size 31x42 that was superimposed on the tissue section with an interspot distance of 100 micrometers in both the x and ^-directions. The mass spectrometer that was used is the ABI 4800 MALDI TOF/TOF Analyzer from Applied Biosystems Inc in linear mode. The data collection in the mass spectrometer was guided by the 4000 Series Imaging module, available at http://www.maldi-msi.org. Processing was done using in-house developed software. Preprocessing As Fig. 2, 3, 4, and 5 show, the IMS raster was slightly off center with regards to the tissue section, resulting in a tissue-free area in the bottom right corner (shown in purple). To avoid these empty measurements consuming variance and influencing the PCA-results, we disregarded them when their total ion current fell below a 10% threshold. Analysis Results We applied PCA via singular value decomposition of the covariance matrix of the data matrix, using the economized version of PCA

466

Pig. 7. Microscopic image of a transversal section of rat spinal cord, histologically stained to show the butterfly-shaped central area known as the Substantia grisea (grey matter), surrounded by white matter nerve tissue.

Fig. 8. Percentage of variance explained by each principal component (PC). This graph only shows PCs with a contribution larger than 0.1%. The total number of PCs is actually given by min(JV", K), but most have a negligible or zero contribution.

mentioned in section 2.2. This means that the loading vectors are represented in the image domain, while the score vectors are shown in the mass domain. When interpreting these visualisations, the relative differences in value are important, not the absolute value or sign (e.g. see first score vector in Fig. 10). In the image domain (Fig. 9) low valued areas in blue are discriminated from high valued pixels in red, indicating zones within which the mass spectral footprint (and the underlying chemical composition) is similar. In the mass domain (Fig. 10) m/z-bins carrying similar values correlate strongly in behavior across the tissue (the peaks vary together), and can be discriminated from bins with dissimilar values. The bar plot in Fig. 8 shows us the relative amount of information contained in each principal component (PC) (above a 0.1% cut-off). It is apparent that the first PC is very prominent with more than 90% variance explained. This means that the spatial and mass-related correlations connected to the first PC can be considered as the primary structure found in the chemical composition of the tissue slice. Secondary and tertiary uncorrelated trends are also apparent as the second and third PC still hold a non-negligible amount of information. However, from the fourth PC onwards the contributions become less influential, tending towards noise in the data. Therefore, we will focus on the first three loading and score vectors shown in Fig. 9 and 10. The strong reduction in complexity indicates a large amount of correlation in the spatial domain (indicating region formation) and the mass domain (indicating ions, or m/z-bins, behaving similarly; e.g. by coming from the same parent molecule). The primary trend, characterized by the first loading vector in Fig. 9 and the first score vector in Fig. 10, shows that a butterfly-shaped region in the cen-

467

r^r

First loading vector

Second loading vector

10 Thir< loading vector

.

10

p.

rif

20

30

40

Fourth loading vector

i.

IS 20

•

25

30 10

20

30

40

Fig. 9. The first four loading vectors. Folded back into the image domain, they show correlations at the pixel level.

ter of the tissue has a dissimilar chemical composition from the areas surrounding it (blue vs. red). This area correlates strongly in location and shape with the anatomical region called the Substantia grisea (grey matter), surrounded by white matter nerve tissue (also visible in Fig. 7). The first score vector shows that all m/z-bins have negative values with differing relative amounts, indicating t h a t the spatial discrimination between the grey and white matter areas is mainly explained through quantitative differences in the chemistry, rather than qualitative ions showing up or disappearing. Also notice the two characteristic peaks at m/z 5484 and 8564, t h a t show up consistently across the nerve tissue but whose relative quantity can be employed as a mass marker for grey matter. The second trend differentiates the blue region of tissue at the top of the raster from the red/yellow area at the center. When studying the second score vector it becomes clear t h a t the differences between these areas are mainly caused by the dense peak area between m/z 5000 and m/z 8000 showing u p more prominently while the peaks at m/z 5484 and 8564 lose intensity. One has to bear in mind t h a t this secondary trend only accounts for some 4% of the chemical variation across the slice. With a contribution coefficient of 1 to 2%, the third loading vector differentiates between a ventral and dorsal area in the tissue. This third trend

468

L

fT^T^" •

\

)pj

*oA/L~_M_—.

Fig. 10. The first four score vectors. Represented in the mass domain, they show correlations between m/z-bins.

correlates strongly with the biochemical differences in ventral/dorsal cellular composition of the spinal cord. In the mass domain of the third score vector we see that this ventral/dorsal difference is mainly due to the 5484-peak showing up and the 8564-peak diminishing in one area while the reverse happens in the other area. In the primary and secondary trends the differences were mainly quantitative in nature. However, in this third trend we see an example of a more qualitative difference with the presence of a particular ion characterizing a particular area in the tissue. The fourth loading and score vectors are shown for completeness, but it is evident that spatially the correlations are less localized and structured, tending towards spread-out noise. In the mass domain we see differentiation between m/z-bins which are close together. This is rather unlikely to be structured given that there are isotopic and other ties between m/z-values this close together, which further seems to indicate that from this trend onwards we are dealing with modeled noise. In summary, the PCA-results tell us that in this particular IMS data set the chemical composition is dominated by the difference between grey matter nerve tissue and white matter, and two quantitative ion markers for these areas are observed at m/z 5484 and 8564. In addition to that, a ventral/dorsal difference was measured which can be related to known ventral/dorsal differences in the spinal cord.

469 3. C o n c l u s i o n s We described a procedure for using P C A in an IMS context as an intrument for guiding prospective analysis of the chemical tissue composition. In the spatial domain it can show the human observer which regions have a particular mass spectral footprint, and it can differentiate these from other areas in the tissue slice without the need to perform invasive chemistry on the sample as is the case with e.g. histological staining. In the mass domain, specific molecular masses responsible for these differences (in m/z-form) are identified, and lend themselves for further downstream analysis using, for example, ion images. The case study on rat nerve tissue demonstrated these uses by delineating grey matter from white matter and by identifying two mass markers that can be used to differentiate between these zones. It also illustrates how, in addition to the visual aspect of differentiating zones in the tissue, IMS as a technology permits a direct measurement of the chemical reality responsible for these differing areas, in the form of molecular masses. The use of PCA as described in this paper is but a first step towards a more insightful interrogation of IMS data. A thorough investigation of the influence of factors such as preprocessing of the data and robustness of the method will be required before it can be established as a firm first analysis step. Also, comparisons with other multivariate techniques, such as independent component analysis, are currently under way and will prove to be an interesting research avenue. 4. A c k n o w l e d g e m e n t s Research supported by Research Council KUL: GOA AMBioRICS, CoE E F / 0 5 / 0 0 7 SymBioSys, IDO , several P h D / p o s t d o c &c fellow grants; Flemish Government: - F W O : P h D / p o s t d o c grants, projects G.0407.02, G.0413.03, G.0388.03, G.0229.03, G.0241.04, G.0499.04, G.0232.05, G.0318.05, G.0553.06, G.0302.07, G.0129.00, research communities (ICCoS, ANMMM, MLDM); - IWT: PhD Grants, GBOU-McKnow-E, GBOU-SQUAD, GBOU-ANA, TAD-BtoScope-lT, Silicos; Belgian Federal Science Policy Office: IUAP P5/22; EU-RTD: ERNSI; FP6-N0E; FP6IP, FP6-MC-EST; ProMeta, BioMacS.

References 1. R. Aebersold and M. Mann, Nature 422, 198(Mar 2003). 2. W.-K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman and E. K. O'Shea, Nature 425, 686(Oct 2003). 3. M. Stoeckli, T. B. Farmer and R. M. Caprioli, J Am Soc Mass Spectrom 10, 67(Jan 1999). 4. M. Stoeckli, P. Chaurand, D. E. Hallahan and R. M. Caprioli, Nat Med 7, 493(Apr 2001). 5. H. Nygren, P. Malmberg, C. Kriegeskotte and H. F. Arlinghaus, FEBS Lett 566, 291 (May 2004). 6. R. M. A. Heeren, Proteomics 5, 4316(Nov 2005). 7. I. T. Joliffe, Principal Component Analysis (Springer-Verlag, New York, 1986). 8. G. McCombie, D. Staab, M. Stoeckli and R. Knochenmuss, Anal Chem 77, 6118(Oct 2005).

DNA-PROTEIN INTERACTIONS: INTEGRATING STRUCTURE, SEQUENCE, AND FUNCTION MARTHA L. BULYK Brigham & Women's Hospital and Harvard Medical School Boston, MA 02115 ALEXANDER J. HARTEMINK Duke University Durham, NC 27708 ERNEST FRAENKEL Massachusetts Institute of Technology Cambridge, MA 02139 GARY STORMO Washington University St. Louis, MO 63108 Recent technological advances have enabled many different types of data to be collected at a genome-wide and proteome-wide scale, including DNA sequence from various genomes, gene expression data, protein-protein and protein-ligand interactions, and protein-DNA binding data. In addition, efforts in structural biology are yielding structural data on proteins, protein complexes, and proteinligand interactions. These data provide us for the first time with the opportunity to integrate data from functional studies with those from structural studies, in order to understand how the biophysical aspects of protein-DNA interactions affect their function. Indeed, much recent work has been devoted to analyzing these data for various focused aspects of this purpose, such as either the regulatory aspects of protein-DNA interactions or the structural aspects of protein-DNA recognition. However, very few studies have integrated these various types of data in order to bridge this divide. This nascent research area builds on recent developments in diverse areas including DNA motif discovery, modeling of transcriptional regulatory networks, multiple sequence alignments, structural genomics, and structural and evolutionary studies of proteins and DNA. While each of these specific aspects of protein-DNA interactions have been studied previously, these different aspects have just recently begun to be considered together. This session focuses on methods that bridge structure, sequence, and function to infer previously undiscovered associations between these different aspects of protein-DNA interactions. 470

471 Methods that employ structure, sequence, and function have several key advantages. First, structural data alone often do not permit the inference of biological function. Second, experimental genomic datasets often contain errors arising from imperfections in the applied technology. Third, functional studies typically do not connect function to structure. Indeed, there has been only a small amount of work that addresses how to take advantage of these currently separate areas of research on protein-DNA interactions. We anticipate that combining these different types of data will allow us to identify essential biological associations, and ultimately to model and predict these interactions. We accepted three papers for this new session. In the first paper, Liu and Bader predict transcription factor binding sites by calculating binding free energies, starting with the 3D structure of another protein-DNA complex from the same structural class of protein. They apply their approach to homeodomain and bZIP proteins, but the method could be applied to proteins of other structural classes as well. In the second paper, Leung and Chin present an algorithm for improved motif discovery that takes advantages of pattern characteristics of different transcription factor binding site motif classes. In the third paper, Zhao and colleagues develop an approach to predict the biological pathway in a target genome that is orthologous to that in a query genome by considering the proteinDNA interactions and operon structures of the pathway genes. Further progress in these areas may further improve the ability to predict interactions between transcription factors and their DNA binding sites. In addition, there remain numerous other challenges in this nascent research area aside from those addressed in the accepted papers for this session. Future work may address questions such as: Do certain types of domains of DNA binding proteins confer particular biophysical properties, either in terms of kinetics or ligand specificity? Has there been an evolutionary selection for the usage of certain structural classes of DNA binding proteins in particular types of biological pathways? How are affinities of protein-DNA interactions tied to function? We believe that as more types of data become widely available, integrative approaches will likely produce great insights into the biophysical, evolutionary, and functional aspects of this important class of biomolecular interactions. Acknowledgments We are grateful to those who submitted manuscripts for this session, and we thank the numerous reviewers for their valuable expertise and time in the peer review process.

DISCOVERING MOTIFS WITH TRANSCRIPTION FACTOR DOMAIN KNOWLEDGE* HENRY C M . LEUNG FRANCIS Y.L. CHIN BETHANY M.Y. CHAN Department of Computer Science, University of Hong Kong, Hong Kong, China

Pokfulam

We introduce a new motif-discovery algorithm, DIMDom, which exploits two additional kinds of information not commonly exploited: (a) the characteristic pattern of binding site classes, where class is determined based on biological information about transcription factor domains and (b) posterior probabilities of these classes. We compared the performance of DIMDom with MEME on all the transcription factors of Drosophila with at least one known binding site in the TRANSFAC database and found that DOMDom outperformed MEME with 2.5 times the number of successes and 1.5 times in the accuracy in finding binding sties and motifs.

1. Introduction One important problem in bioinformatics is understanding how genes cooperate to perform functions. Related to this is the subproblem of discovering motifs. The context behind the motif discovering problem is the following. Gene expression is the process whereby a gene is decoded to form an mRNA sequence which is then used to produce the corresponding protein sequence. In order to start the gene expression process, a molecule called a transcription factor will bind to a short substring, called a binding site, in the promoter region of the gene. A transcription factor can bind to several binding sites in the promoter regions of different genes to make these genes co-express, and such binding sites should have common patterns. The motif discovering problem is to discover the common patterns, or motifs, from a set of promoter regions without knowing the positions of the binding sites. However, many motifs in real biological data cannot be discovered by existing algorithms because the existing models [3, 8, 12, 13, 20] that represent motifs might not be able to capture the different pattern variations of the binding sites. PSSM (Position Specific Scoring Matrix) [2, 4, 6, 7, 10, 11, 14] is the most common motif representation. It uses a 4 x / matrix of real numbers to represent a length-/ motif. The y'-fh column of 4 numbers gives us the probability,

* The research was supported in parts by the RGC grant HKU 7120/06E. 472

473

respectively, that symbol 'A', ' C \ 'G' or 'T' occupies at they'-th position of the motif. The goal is to discover the optimal motif matrix which maximizes the likelihood of the input sequences being generated according to the matrix. Existing algorithms assume the prior probability of each matrix being chosen to generate the input sequences is the same. However, this assumption is not correct in real biological data. Transcription factors mainly bind to the binding sites by substructures called active binding domains (in short, domain), e.g. zinc finger [23], leucine zipper [16] and homeodomain [19]. Although the binding sites of transcription factors with the same domain do not necessarily have the same patterns, they should share some common characteristics [18]. For example, binding sites of zinc finger usually contain the nucleotide 'G' regularly and binding sites of homeodomain usually contain the "TAAT" substring. If we know which domains of the transcription factors contact the binding sites, we can improve the accuracy of existing motif discovering algorithms by adding constraints on the motifs [5, 15, 21]. For some motif classes, it might be possible to find the motif by considering only substrings in the DNA sequences with certain characteristics as candidates for binding sites. However, we usually do not know which transcription factors or, more specifically, which domains of the transcription factors contact the binding sites. The approach of searching for substrings with characteristics of each possible motif class is not only timeconsuming, but may even fail to find the hidden motif because of the following two weaknesses of this approach. Firstly, the number of wrongly predicted binding sites might be large, e.g. many substrings in the input sequences with pattern [CG] . . [CG] . . [CG] are not binding sites of a motif in Class I (to be introduced in Section 2). Secondly, some binding sites of a motif in a particular class may not have the corresponding characteristics exactly, e.g. a binding site of motif in Class IV may contain the pattern TGA.*TGA instead of TGA.*TCA. A natural question is: can we improve the performance of motif discovering problem by knowing only the characteristics of each possible motif class? Narlikar et al. [17] trained 3847 binding sites in the TRANSFAC database and defined three motif classifiers using 1387 features. Each motif classifier can represent the common features for binding sites in the corresponding motif class precisely. However, the definition of the motif classifiers highly depends on a large set of training binding sites and may not capture the real common features of binding sites in the motif class. Xing and Karp [24] used a similar method by training 271 motif matrices in the TRANSFAC database which represents about 2000 binding sites In this paper, we model the common features of different motif classes by much less parameters than the above methods (Section 2). Our algorithm DIMDom (Section 3), which stands for Discovering Motifs with DOMain

474

knowledge, discovers motifs by an EM approach: the expectation step finds over-represented patterns in the DNA sequence, while the maximization step, based on the motif matrix with the maximum log likelihood, guesses the class of the binding site patterns according to posterior probabilities and then modifies the motif matrix according to the class guessed. Besides getting more accurate motifs, the binding sites with domain knowledge can converge to the real solution (motif) more quickly as shown in the experiments (Section 4) on real biological data when compared with the popular algorithm MEME.

2. Our Model The input sequences can be broken up into length-/ (overlapping) substrings X = {Xi, X2, ... , Xw] and each substring in X either belongs to a background (nonmotif) substring with a prior probability Ab or belongs to an instance of the hidden motif M with a prior probability 1 - Xb. In particular, Z=(Z\,Z2, ... , Zw) is the missing data that determines whether X, is generated according to the background probability B (Z, = 1) or the hidden matrix M (Z, = 0). The likelihood of some particular B, M, Xb being the hidden parameters of the finite mixture model [2] is defined as

L(B,M,\\X,Z)

=

P{X,Z\B,M,\) i-Z, >

{\-\)Y\M{XXJlj) <=i

j=\

(l) The goal of many existing algorithms [2, 4, 10] is to discover the B, M, Xb with the maximum likelihood (or log likelihood). Transcription factors are protein sequences with different three dimensional structures. They have different substructures, or domains, for recognizing and binding to specific binding sites. The binding affinity of a transcription factor depends on whether the binding sites have certain DNA patterns match with the domains of the transcription factor. For example, basic helix-loop-helix proteins usually bind to strings with the pattern "CA . . TG" [1]. Other examples can be found in [16, 19,23,25]. Narlikar and Hartemink [18] analyzed 3847 published binding sites. They found that these binding sites can be classified into six groups with different occurrence counts. These counts represent the prior probabilities as shown in Table 1. For example, the probability Pm{2) that the hidden matrix is in Class II (Cys4) is approximately 734/3847. Based on this observation, we introduce the Bayesian Mixture Model to describe these uneven probabilities.

475 Table 1. The six classes of binding sites patterns. Characteristics Class name I.Cys2His2 (zinc-coordinating) G . . G I G . . G . . G I [CG] .. [CG]. . [CG] n. Cy.S4 (zinc-coordinating) AGGTCA1TGACCT HI. bHLH (basic domain) CA..TG IV.bZip (basic domain) TGA .* TCA V.Forkhead (helix-turn-helix) no characteristics VI. Homeodomain (helix-turn-helix) TAATIATTA Total

Count 776 734 182 1353 281 621 3847

" . " means any nucleotide. " .* " means zero or more nucleotides. " [ ] " means one of the nucleotides in the bracket. " I " means or.

2.1. Bayesian Mixture

Model

Each substring in X is assumed either generated according to a background probability B = (b(A), b(C), b(G), b(T)) or a hidden matrix M. However, the prior probability of each matrix being the hidden matrix is not the same. A motif class g, g - 1, ... , 6 is randomly chosen according to probability distribution Pm = [Pm(g)} where Ss=i^m(g) = l • Once a motif class is chosen, a probability matrix is picked, with equal probability, from the chosen class as the hidden matrix. The goal of the motif discovering problem is to discover motif M and other parameters with maximum likelihood with respect to the given X and Pm. Given the joint distribution of the substring X, the missing data Z, the hidden motif M and the motif class g, the likelihood of some particular B, X,b, Pm being the hidden parameters of the Bayesian mixture model is defined as L(B,Ab,Pm\X,Z,M,g)

= P(X,Z,M,g\B,Ab,PJ = P(X,Z\M,g,B,Ab,Pm)P(M,g\B,Ai,Pm) = P(X,Z\B,M,^)P(M,g\PJ = L(B,M,\ I X,Z)P(M I g,PJP(g I PJ = UB,M,At\X,Z)P(M\g)Pm(g) (2)

Therefore, the likelihood L(B, lb, Pm I X, Z, M, g) is equal to L(B, M, Xb I X, Z) times the term P(M I g)Pm(g) which is the probability of class g being chosen and matrix M being picked from class g. 2.2. Characteristics

of the Motif

Classes

Each motif class can be characterized by a regular expression as shown in Table 1. A matrix for a particular motif class should contain a 4 x /' sub-matrix M' where /' < /, which satisfies the restriction stated by the regular expression. Note that a probability matrix can belong to more than one motif class. Each symbol 'A', ' C \ 'G', 'T' in the regular expression means the entries M'(AJ), M'(CJ), M'(GJ) or M'(TJ) of the corresponding j'-th column of the submatrix M' are larger than some predefined threshold /?, 0.25 < /? < 1. For example, the regular expression "CA . . TG" in Class III means all matrices in

476

c (0.25,^(C),MG),MT))

• (0,1,0,0)

M(G)

"(T> (0,0,0,1) A( 1,0,0,0) Figure 1. Graphical representation of all possible column vectors (//(A), /((C), //(G), /J(T)) of a probability matrix.

A(i,0,0,0)

4

(0,0,0,1)

Figure 2. Graphical representation of all possible column vectors with /J(T) > /3.

Class III must contain a 4 x 6 sub-matrix M' such that M'(C, 1) > /?, M'(A, 2) > P, M'(T, 5) > p and M'(G, 6) > p. Since the Class V has no characteristics, we assume all matrices belong to Class V, i.e. the regular expression is ".*". Since the size of the sample space for each motif class is not the same, the likelihood of a particular class g given a matrix M, i.e. P(M I g = k), k = 1, ..., 6, is not the same for different motif classes. In order to compare (without finding their exact values) the likelihood of different motif classes when given a matrix, we consider a 4 x 1 column vector CV = (fi(A), //(C), //(G), //(T)) in a probability matrix. Since 0 < //(A), //(C), //(G), //(T) < 1 and //(A) + //(C) + //(G) + //(T) = 1, the sample space of CV can be represented by the set of points in the tetrahedron shown in Figure 1 [10]. The four corners of the tetrahedron at (1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1) represent the four nucleotides A, C, G and T. Without loss of generality, let CV be the first column of a 4 x 4 matrix with the pattern "TAAT" in motif Class VI (Table 1), in which case //(T) > p. To illustrate the idea, let us consider two classes of motif. In Class V a column vector CV is randomly picked from all possible column vectors, whereas in Class VI, a column vector CV is randomly picked from all column vectors with //(T) > p. As the size of the sample space for column vectors with //(T) > p, i.e. the tetrahedron shown in Figure 2, is (1 - pf of the size of the sample space for arbitrary column vectors, i.e. the whole tetrahedron, conditional probability P(CV I g = 6) is 1/(1 - pf times higher than the conditional probability P(CV I g = 5). Similarly, we may compare the conditional probability of a particular matrix M' being picked given that it is from Class V (all probability matrices) and the conditional probability of another matrix M being picked given that it is from one of the remaining classes. For example, assume / = 4 and P - 0.8. The conditional probability P(M I g - 6) that a particular 4 x 4 matrix M in Class VI is picked from all length-4 matrices in Class VI is 1/(2(1 - 0.8)3x4) = 1.2 x 108 times larger than the conditional probability P{M' I g = 5) that another matrix M'

477

is picked from all length-4 matrices in Class V. Note that, if M' does not belong to Class VI, P(M' I g = 6) = 0. When the motif length / is not exactly 4, care should be taken not to double count those matrices with more than one sub-matrix satisfying the requirement (by using the Inclusion and Exclusion Principle).

3. DIMDom Algorithm DIMDOM, which stands for Discovering Motifs with DOMain knowledge, uses the expectation maximization (EM) approach to discover the motif matrix from the input sequences. In the expectation step (E-step), based on the current estimates of parameters M, B, Xb and g, DIMDom algorithm calculates the expected log likelihood log L(B, Xb, Pm I X, Z, M, g), over the conditional probability distribution of the missing data Z from the input sequences X. In the maximization step (M-step), DIMDom algorithm calculates a new set of parameters M, B, Xb and g based on the new estimated Z for maximizing the log likelihood. These two steps will be iterated in order to obtain a probability matrix with larger log likelihood. In order to discover the probability matrix with maximum log likelihood (instead of local maxima), DIMDom algorithm repeats the EM steps with different seed matrices. 3.1. Expectation

step

Given a fixed probability matrix M®\ the background probability Bi0>, prior probability V 0 ) and the motif class g(0), the expected log likelihood is ,„E

= IK

{\ogL(B,Ab,Pm\X,Z,M,g))

ioga 6 (0) ) + xi°g(/> (0) (x,.m)) iog(i-^<0))+Xiog(M(0)(x,.[;u))

+(i-z<°>) L

m

m

from (1) and (2)

;=i

+ \og{(p(M \g )Pm(g))

(3)

j{z,(0'£iogta;]))} '=1

I

J=l

+ X (l-Z, < 0 ) )pog(M(x ( [j U ))\ + log{(p(M \ g)Pm(g)) + X{z, (0) loga i ) + (1-Z,<0))log(l-A6)} where Z,(0) = E(Z, I X, M<0), Bm, Ab(0>) which can be calculated as follows

(4)

478

Z,(0)=

;

tZ

.

(5)

Therefore, we can calculate the expected log likelihood and the expected 2?0) from X, M^\ &°W0) and g(0) by Equations (4) and (5). 3.2. Maximization

step

Based on Equation (4), we can calculate the parameters M*0, B°\lb0) and g(1) to maximize the expected log likelihood. Afc(1) is involved in the last term in Equation (4) only and the expected log likelihood will be maximized when A 0 ) = Sr=i(2;l0) / w). B0) is involved in the first term in Equation (4) which will be maximized when

IEz«» «*,[;] = a) b«xr=—^^h a'=A,C,G,T .=1 j=\

where a can be A, C, G or T, and \{s) = 1 if and only if the proposition s is true and l(s) = 0 otherwise. M^l) and g(1) are involved in the second term in Equation (4). In order to find the probability matrix Af*0 and the motif class g(l), we assign M 0 ' and g(1> to be the probability matrix for each motif class that maximizes the expected log likelihood. Consider g(I) = 5, Equation (4) will be maximized (by considering a Lagrange Multiplier of each column vector of M') when

J(l-Z<0,)I(X,U] = a) M'(a,j) =

&—

(6)

X Xa-z^Kx^]^') ff'=A,C.G.T M

When g(1) = 1, 2, 3, 4 or 6, the matrix M' calculated in Equation (6) will maximize the log likelihood if M' belongs to the corresponding class. However, when M' does not belong to the corresponding class, we have to test all the boundary matrices (by considering a Lagrange Multiplier of each column vector of M' for the boundary e.g. M'(AJ) =/?) in each class, which are closest to M'. For example, when we are considering g0) = 6 (Class VI) and the matrix M' does not contain any 4 x 4 sub-matrix satisfying either TAAT or ATT A, we consider the 2(1 - 4 + 1) boundary matrices of M' in Class VI as follows. For

479 each starting position; = 1 , . . . , / — 4 + 1 , consider the 4 x 4 sub-matrix Msub of M' formed by columns j to j + 4 - 1 of M'. If Msub does not satisfy ATTA because some entries in Msub are less than /?, we set these entries to /? and decrease the values of the rest entries proportionally. When /3 = 0.8, we will modify the following sub-matrix Msub ^0.03 0.03 0.04 0.9

0.8 0.1 0.1 0

0.3 0.4 0.1 0.2

0.1 ^ 0.05 0.05 to 0.8

0.03 0.03 0.04 0.9

0.8 0.8 0.1 0.1 0.4x0.2/0.7 0.05 0.1 0.1x0.2/0.7 0.05 0 0.2x0.2/0.7 0.8

to form a boundary matrix of M'. We can prove that either matrix M' or one of its boundary matrices in each motif class can maximize the expected log likelihood when Z*0) is fixed. Thus, we can set M 0 ' to be the matrix with the largest expected log likelihood. We can repeat the E-step and M-step for a fixed number (10 is used in our experiments) of times to find the motif matrix with maximum expected log likelihood locally. 3.3. Seed

Matrices

In order to initiate the EM-step, we should have a set of seed matrices M®\ background probability B(0\ prior probability Afc(0) and motif class g(0). Similar to Bailey and Elkan [2], when the motif length / is short, we convert each length-/ DNA sequence S into a seed matrix A/*0' by setting m

M m

(a(a J) i) = l0-7 ' [0.1

a =

S[j]

a*S[j]

However, when the motif length / is long, as the number of seeds increases exponentially with /, it is impossible to try all seeds. Fortunately, real biological motifs usually contain a conserved region in the center (column vector with one or two entries having high probabilities) or conserved regions at two ends. Instead of considering all A1 seeds, we consider all length-/' seeds where /' < / and extend these length-/' seeds to length-/ by adding column vectors with all entries equal to 0.25 at both ends to represent motifs with a conserved region in the center. Similarly, we construct a seed with all entries equal to 0.25 at the center to represent motifs with conserved regions at both ends. Apart from A/0), we set the background probability B(0> to be the occurrence probability of each nucleotide in the input sequence fi<0)(ar) = (ir=iS';=iI(A',[;] = cr))/(w/) . We also set the prior probability 1 - Xb(0) of a substring being an instance of the motif to be the number of input sequences over w (we assume each input sequence contains one instance of the motif) and set the

480

motif class g{' = 5, which means that there is no restriction on the motif matrix Table 2. Experimental results on real biological data for transcription factors of Drosophila for output with 1 and 30(in brackets) predicted motif(s) per data set. Factor Name Ac adf-1 AP-1 AS-CT3 Bed Bfactor CF1 Ci D_MEF2 Dl DREF Dri DTF-1 E74A EcR Elf-1 En Exd Ftz FTZ-F1 GAGA GCM H Hb HSTF Kr Sc Sn Su_Hw TAB TBP

Tn

Ttk69k Ubx_a Zen-1 Zen-2 Zeste Zeste_b

Z

g

8 m 11 V 9 6 m 8 VI 4 9 n 9 10 11 14 10 6 17 V 7 n 8 7 VI 20 VI 12 VI 7 n 11 i 13 10 m 10 i 15 VI 10 i 8 m 13 i 12 i 15 7 8 n 8 i 19 VI 8 VI 8 VI 11 V 11 . ^iveraj>e score

Predicted S III (III)

noD

-(IV) i n (iii) VI (VI) -(VI) -(H)

n(i) -(ffl)

rv(rv) -(VI)

rv(rv) KD rv(rv) ffl (IV)

i(D

-d)

rv(H) VI (VI) II (II) 1(1) III (TV) (HI)

m(rv) VI (VI)

n(vi) ni (iii)

rv(rv) -(IV) -(H)

-0) I (VI)

rv(i) n(n) rv(vr» VI (VI)

rv©

rv(i)

DIMDom (class V only) 0 (0.6667) 0.2(0.1667) 0 (0.25) 0.5 (0.5) 0(0.3333) 0(0) 0 (0.3333) 0.1667 (0.2) 0 (0.3333) 0(0.1818) 0(0.1429) 0 (0.25) 0.5 0.3077 (0.375) 0(0.5) 0 (0.2222) 0 (0.25) 0.3333 (0.3333) 0(0.2813) 0(0) 0.0476(0.2941) 0.0588 (0.2307) 0 (0.3333) 0(0.1333) 0.0909 (0.2222) 0 (0.2857) 0 (0.6667) 0 (0.2727) 0 (0.25) 0 (0.2857) 0 (0.2) 0(0.1111) 0.0909 (0.3333) 0.25 (0.25) 0(0.1818) 0.1429(0.375) 0.0192(0.1224) 0.0192(0.1224) 0.0998(0.2761)

DIMDom

MEME

0.6667 (0.6667) 0.2 (0.33) 0(1) 0.5 (0.5) 0.2308 (0.3529) 0(0) 0(1) 0.1429(0.2143) 0 (0.3333) 0 (0.2857) 0 (0.3333) 0.5 (0.5) 0.1667(0.1667) 0.3333 (0.6667) 0.3333 (0.5) 0 (0.6667) 0 (0.25) 0.3333 (0.6667) 0.1429(0.25) 0.5 (0.5) 0.1579(0.1579) 0.3333 (0.3333) 0(1) 0(0.2142) 0.1111 (0.25) 0.0833 (0.2667) 0.6667 (0.6667) 0.2857 (0.5) 0(1) 0 (0.5) 0 (0.25) 0.1176(0.1176) 0 (0.4286)

0 (0.5) 0.1111 (0.1111) 0 (0.5) 0.3333 (0.3333) 0.0227 (0.2) 0 (0.2222) 0 (0.5) 0.25 (0.5) 0(0) 0.0476 (0.0870) 0(0.1429) 0 (0.5) 0.125(0.5) 0.1818(0.4) 0(0.3333) 0.1 (0.4444) 0(0.1) 0.2 (0.4) 0.1471 (0.1875) 0(0) 0(0.1818) 0 (0.25) 0(0.3333) 0.1667 (0.25) 0.1429(0.1667) 0 (0.25) 0 (0.5) 0.0667 (0.3333) 0 (0.5) 0(0.3333) 0 (0.25) 0.0526(0.1667) 0.2143 (0.2143) 1(1) 0.0435 (0.2353) 0.05(0.1667) 0.4222 (0.4222) 0.4222 (0.4222) 0.1925(0.3141)

KD

0 (0.2222) 0.1 (0.5) 0.05 (0.2) 0.05 (0.2) 0.2501 (0.4471)

4. Experimental Results We have implemented DIMDom using C++ and have compared its performance

481 with that of the popular motif discovery algorithm MEME [2], which is also based on an EM approach, on real biological motif from the TRANSFAC database (http://www.gene-regulation.com). For each transcription factor with at least one known binding site in fruit fly (Drosophila), we searched for all genes regulated by that transcription factor and used the 450 bp (base pairs) upstream and 50 bp downstream of the transcriptional start site of these genes as the input sequences. We set /' = 8 when constructing seed matrices and considered a substring X, as a binding site if 1 — Z, > 0.9 for a 90% confidence. Higher thresholds, such as 0.95 and 0.99, failed to give satisfactory results as the number of predicted binding sites decreased sharply to almost zero. A score for each predicted motif is defined as: Ipredicted sites n published sites! score = • \ predicted sites u published sites| A published binding site is correctly predicted if that binding site overlaps with at least one predicted binding site. The score is in the range of [0,1]. When all the published binding sites are correctly predicted without any mis-prediction, score = 1. When no published binding site is predicted correctly, score = 0. The value of the threshold /? used in calculating probability P(M I g) was determined by performing tests on another set of real data from the SCPD database (http://rulai.cshl.edu/SCPD/) for yeast (Saccharomyces cerevisiae). DIMDom had the highest average score when /? = 0.9. A smaller value of/? did not give better performance because the values of log(P(M I g)) were similar for different motif classes. As a result, DIMDom could not take much advantage of different motif classes and motifs from class V were predicted most of the time. Table 2 shows the performance of MEME [2] and DIMDom on two types of output, only one predicted motif and 30 predicted motifs (from now on, all results related to outputs with 30 predicted motifs will be parenthesised). In order to have a fair comparison with our experiments, we have ignored the known prior probabilities of different motif classes and set them all equal. We have also performed experiments on a version of DIMDom which considers only the class V (basic EM-algorithm) so as to illustrate the improvement in performance by introducing the knowledge of different motif classes. It is not surprising to find that MEME (with average score 0.1925 (0.3141)) performed better than the basic EM-algorithm (with average score 0.0998 (0.2761)). However, after introducing the five motif classes, DIMDom (with average score 0.2501 (0.4471)) outperformed MEME when the same set of parameters were

482

used. Note that DIMDom was about 1.5 times more accurate than MEME when 30 predicted motifs could be outputted. Among the 47 data sets, both DIMDom and MEME failed to predict any published binding sites in 19 (9) data sets and DIMDom had a better performance (higher score) for 17.5 (27.5) data sets while MEME had a better performance for 10.5 (10.5) data sets only. When the output has 30 predicted motifs, DIMDom outperformed MEME with 2.5 times in the number of successes. In 5.5 out of 10.5 cases for which MEME could do better than DIMDom, MEME predicted only 1 or 2 out of many not-so-similar binding sites because of the high threshold (0.9) used by DIMDom. Even with a simple description of motif classes, DIMDom can correctly predict the motif classes in 9 (12) out of 21 (25) instances. We expect better prediction results if more parameters are used to describe motif classes [17]. However, more training data are needed for tuning these parameters.

5. Conclusion We have incorporated biological information, in terms of prior probabilities and pattern characteristics of possible motif classes, into the EM algorithm for discovering motifs and binding sites of transcription factors. Our algorithm DIMDom was shown to have better performance than the popular software MEME. DIMDom will have potentially even better performance if more motif classes are known and included in the algorithm. Like many motif discovery algorithms, DIMDom will work without the length of the motif being given. When the length of the motif is specified, DIMDom will certainly have better performance than when the length is not given and the likelihoods of motifs of different lengths must be compared.

References 1. W. Atchley and W. Fitch, Proc. Natl Acad. ScL, 94, 5172-5176 (1997). 2. T. Bailey and C. Elkan, ISMB, 28-36 (1994). 3. F. Chin, H. Leung, S.M. Yau, T.W. Lam, R. Rosenfeld, W.W. Tsang, D. Smith and Y. Jiang, RECOMB04, 125-132 (2004). 4. E. Eskin, RECOMB04, 115-124 (2004). 5. S. Keles, M. Lann, S. Dudoit, B. Xing and M. Eisen, Statistical Applications in Genetics and Molecular Biology, 2, Article 5 (2003). 6. C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald and J. Wootton, Science, 262,208-214 (1993).

483 7. C. Lawrence and A. Reilly, Proteins: Structure, Function and Genetics, 7,41-51 (1990). 8. H. Leung and F. Chin, JBCB, 4,43-58 (2006). 9. H. Leung and F. Chin, WABI, 264-275 (2005). 10.H. Leung and F. Chin, Bioinformatics, 22(supp 2), ii86-ii92 (2005). 1 l.H. Leung and F. Chin, Bioinformatics (to appear) 12.H. Leung, F. Chin, S.M. Yiu, R. Rosenfeld and W.W. Tsang, JCB, 12(6), 686-701 (2005). 13.M. Li, B. Ma, and L. Wang, Journal of Computer and System Sciences, 65, 73-96 (2002). 14.J.S. Liu, A.F. Neuwald and C.E. Lawrence, Journal of the American Statistical Association, 432, 1156-1170 (1995). 15.K. Maclsaac, D. Gordon, L. Nekludova, D. Odom, J. Schreiber, D. Gifford, R. Young and E. Fraenkel, Bioinformatics, 22(4), 423-429 (2006). 16.N.J. Mulder et al, cleic Acids Res., 31, 315-318 (2003). 17.L.Narlikar, R. Gordan, U. Ohler and A. Hartemink, Bioinformatics, 22(14) e384-e392 (2006). 18.L. Narlikar and A. Hartemink, Bioinformatics, 22(2), 157-163 (2006). 19.C. Pabo and R. Sauer, Annu. Rev. Biochem., 61, 1053-1095 (1992). 20.P. Pevzner and S.H. Sze, ISMB, 269-278 (2000). 21.A. Sandelin and W. Wasserman, JMB, 338, 207-215 (2004). 22.S. Sinha and M. Tompa, BIBE, 214-220 (2003). 23.S. Wolfe, L. Nekludova and CO. Pabo, Annu. Rev. Biomol. Struct., 3, 183212(2000). 24.E. Xing and R. Karp, Nati. Acad. Set, 101, 10523-10528 (2004). 25.J. Zilliacus, A.P. Wright, D.J. Carlstedt and J.A. Gustafsson, Mol. Endocrinol, 9, 389-400 (1995).

AB INITIO PREDICTION OF T R A N S C R I P T I O N FACTOR B I N D I N G SITES L. ANGELA LIU and JOEL S. BADER* Department of Biomedical Engineering and High-Throughput Biology Center, Johns Hopkins University, Baltimore, MD 21218, USA * E-mail: [email protected] Transcription factors are DNA-binding proteins that control gene transcription by binding specific short DNA sequences. Experiments that identify transcription factor binding sites are often laborious and expensive, and the binding sites of many transcription factors remain unknown. We present a computational scheme to predict the binding sites directly from transcription factor sequence using all-atom molecular simulations. This method is a computational counterpart to recent high-throughput experimental technologies that identify transcription factor binding sites (ChlP-chip and protein-dsDNA binding microarrays). T h e only requirement of our method is an accurate 3D structural model of a transcription factor-DNA complex. We apply free energy calculations by thermodynamic integration to compute the change in binding energy of the complex due to a single base pair mutation. By calculating the binding free energy differences for all possible single mutations, we construct a position weight matrix for the predicted binding sites that can be directly compared with experimental data. As water-bridged hydrogen bonds between the transcription factor and DNA often contribute to the binding specificity, we include explicit solvent in our simulations. We present successful predictions for the yeast MAT-a2 homeodomain and GCN4 bZIP proteins. Water-bridged hydrogen bonds are found to be more prevalent than direct protein-DNA hydrogen bonds at the binding interfaces, indicating why empirical potentials with implicit water may be less successful in predicting binding. Our methodology can be applied to a variety of DNA-binding proteins. Keywords: transcription factor binding sites; free energy; position weight matrix; hydrogen bond

1. Introduction Transcription factors (TFs) are proteins that exert control over gene expression by recognizing and binding short DNA sequences (<6 base pairs, 484

485

roughly the width of the major groove). 1-5 The experimental methods used to identify these binding sites, 6 ' 7 including SELEX 8 and the recent highthroughput experiments (ChlP-chip 9 and protein-dsDNA binding microarrays 10 ), are often labor-intensive and expensive. Due to the complex molecular recognition mechanism between protein and DNA, there is no simple one-to-one code for protein-DNA recognition, 11,12 which makes theoretical predictions of TF binding sites challenging. As a result, the binding sites for many TFs are still unknown. There is consequently great interest in methods with the potential to determine binding preferences purely from sequence and 3D structure of a protein-DNA complex, with several methods using energy-motivated scoring functions to compute the possible TF binding sites. 13-17 Position weight matrices are typically generated using the energy differences among different DNA sequences under an additive approximation. 17-19 Good results have been obtained for some families of TFs, such as zinc finger proteins. 14 Several limitations remain, however. First, proteins that require water-bridged contacts with the DNA are poorly modeled by empirical, implicit solvent energy functions. Second, minimized energies often include only enthalpic and no entropic effects. Third, protein and DNA backbones are fixed to favor the conformation of the native DNA sequence, leading to a bias in the computed position weight matrix. In this work, we present a computational approach that overcomes the above limitations. Our approach uses molecular dynamics simulation and thermodynamic integration 20 ' 21 to calculate binding free energy differences. The only requirement of the method is a starting 3D structural model of the protein-DNA complex, which can be obtained from X-ray/NMR determination or homology modeling. Our method is complementary to experimental methods such as protein binding microarrays and ChlP-chip. Our approach studies the actual binding free energy of TF-DNA complexes and includes entropic effects by exploring the entire energy surface. Therefore, not only can we produce position weight matrices for binding site representation, but our binding free energy differences can also be directly compared with experimental measurements. Another advantage of our work is that we include explicit solvent molecules (counterions and water) in our simulation, whose importance has been reviewed. 22-24 Our work investigates the role of water in the TF-DNA recognition and binding specificity and accounts for the dynamics of water-bridged contacts. Furthermore, the intrinsic flexibility of protein and DNA backbones is explored in our simulation, which allows weak binders to be discovered. Finally, our

486

method can be modified to estimate non-additivity among DNA base pairs. 2. Model systems We select homeodomain and bZIP proteins as our model systems in this study. These families are abundant in eukaryotic genomes. Except for a few members of these families, binding sites have not been well-characterized. The 3D structures of homeodomain and bZIP proteins and their DNAbinding interfaces are highly conserved, making high quality homology modeling possible. Homeodomains contain ~60 amino acid residues that form three a-helices with a hydrophobic core in the middle. The third helix is often referred to as the "recognition helix" as it binds the DNA in the major groove and forms most of the base-specific contacts. The first 5 residues at the N-terminus of the protein bind the DNA in the minor groove and also form a few base-specific contacts. A typical basic region leucine zipper protein (bZIP) is ~60 amino acids long and forms a nearly-straight a-helix when bound to DNA. 25 The bZIP domain is composed of two relatively independent regions: the "leucine zipper region" is a dimerization region that helps stabilize the protein secondary structure, and the "basic region" contacts the DNA major groove and determines the DNA-binding specificity. The yeast mating-type protein a.2 (homeodomain) and the yeast general control protein GCN4 (bZIP) are studied in this work due to the availability of their experimental structures and binding sites for comparison and verification. Although the interactions among different monomers of TFs are important in exerting combinatorial controls over gene expression, only the monomers are considered in this work. The crystal structure of MAT-a2 (PDB:1APL) contains two identical binding sites for two isolated monomers of MAT-a2. One site is chosen in our modeling. The bZIP proteins normally bind to the DNA as homodimers or heterodimers. Since the binding interface between the basic region and the DNA is highly conserved, 25-27 we select only the half-site of GCN4 in our study and do not model the dimerization region. Homeodomain and bZIP monomers typically contact 4-6 DNA base pairs. We include a 10-base pair DNA duplex in our simulation. The DNA sequences are the same as in the corresponding crystal structures of MATol protein and GCN4 protein. The consensus binding site sequence for MAT-o;2 protein is TTACA. 28 The consensus binding site sequence for GCN4 protein is aTGA[C|G] for its monomer. 25 ' 29 The lowercase "a" represents weak selection at that position and the last position can be a C or G.

487 3. Methods 3.1. Molecular dynamics calculation

simulation

and free

energy

Figure 1 illustrates the theoretical foundation of this work. The binding DNA (aq) + protein (aq) -» DNA-protein (aq) |AG D N A |AG romp DNA' (aq) + protein (aq) -> DNA'-protein (aq) AAG = AG' - AG = A G ^ - AGDNA

AG AG'

Fig. 1. Thermodynamic cycle used in the relative binding free energy calculation.

free energies of a protein with two different DNA sequences can be measured experimentally. The first horizontal reaction contains the native DNA and TF-DNA complex, whereas the second horizontal reaction contains the mutant DNA and its complex. In computations, it is relatively easy to calculate the free energy change caused by a mutation in the DNA sequence, indicated by the vertical reactions in the figure. The difference in binding free energy in the two experimental measurements, AG' — AG, is identical to the computational free energy difference, AGCOmp - A G D N A - This difference, AAG, will be referred to as the relative binding free energy in this paper. More detailed theoretical background can be found in Refs. 20,21 The molecular simulation package C H A R M M 3 0 was used to carry out the molecular dynamics simulation, and its BLOCK module was used for free energy calculations. We first established well-equilibrated native proteinDNA complex and DNA-duplex configurations using molecular dynamics simulation. Missing hydrogen atoms were added to the crystal structures of MAT-a2 (PDB:1APL) and GCN4 (PDB:1YSA). Charges of the titratable amino acid residues were assigned to their values at neutral pH. TIP3P water molecules were added and periodic boundary conditions were applied. Counterions (Na + ion) were introduced to neutralize the system using the random water-replacement routine developed by Rick Venable. 31 The C H A R M M 2 7 force field was used. The positions of the ions and water molecules were minimized followed by full minimizations of the entire system using the adopted basis Newton-Raphson method. The non-bonded cutoff radius was 14 A. The system was then heated to 300 K and equi-

488

librated for 1.5 ns in the NPT ensemble using a 1 fs time step. The final configurations contained about 7000 water molecules and 25000 atoms for both MAT-a2 and GCN4 protein-DNA complexes. The protein-DNA complex and the DNA duplex were simulated separately. From the equilibrated native configurations, we used a house-built program to replace each native base pair by multi-copy base pairs. 32 ' 33 In this multi-copy approach, multiple base pairs are superimposed and their contributions to the total energy or force function are scaled by coupling parameters. In this paper, all multi-copy base pairs are a superposition of two physical base pairs. Therefore, there are 6 possible multi-copy base pairs at one position. The standard base geometry 34 was used to build a library of multi-copy base pair equilibrium geometries. Three consecutive rotations were applied to align the multi-copy base with the native base to preserve the orientation with repect to the rest of the DNA duplex. The structure with the multi-copy base pair was minimized first to remove possible bad contacts caused by the introduction of the multi-copy base. It was then heated to 350 K and equilibrated for 15 ps. This heating step helps move the conformation away from the native structure's local minima and may improve sampling of the glassy waters at the protein-DNA interface. The system was then cooled to 300 K and equilibrated for 65 ps. A 100 ps production run was done during which the trajectory was saved every 0.5 ps. The simulation is done in the NVT ensemble using the same periodic boundary condition as in the fully-equilibrated native structure. The free energy analysis on the production trajectory is outlined below. Thermodynamic integration 20,21 was used to calculate the free energy change for mutating the original base pair into another possible base pair in the multi-copy base pair. The linear coupling scheme in the coupling parameter A was used in BLOCK for the energy function of the multi-copy structures, which allows analytical solution of the free energy gradient. Typically, multiple values of A are required for the integration. From preliminary calculations, we have found that the free energy gradient was approximately linear with respect to A for multi-copy base pairs. Therefore, we used a mid-point approximation (A = 0.5) for computational saving. The binding free energy difference decomposes into separate contributions from DNA, protein, and solvent (ions and water) using the same

489 notation as Fig. 1: AAGtotal = AGcomp — A G D N A = AAGjnternal + AAGexternal

(1)

AGComp = A G p r o t + A G s o l v e n t + A G D N A A G D N A = AGsolvent + A G D N A AAGinternal — A G D N A — A G D N A

AAGexternal = A G £ r o t + AGg o l v e n t - A G s o l v e n t ,

where the superscripts c and ' represent the protein-DNA complex and the free DNA duplex, respectively. For homeodomains, the contribution of the N-terminus to the binding free energy difference was also calculated using AAGNterm = A^Nterm — 0, where the zero represents the corresponding AG term in the DNA duplex. The binding free energy differences in Eq. (1) are converted into Boltzmann factors and position weight matrices as in Ref.15 using the additive approximation. These matrices are converted into sequence logos35 using W E B L O G O . 3 6 For the TFs considered in this work (Sec. 2), the DNAs remain relatively undeformed upon TF binding, which may make the additive approximation accurate. 14 3.2. Hydrogen

bond

analysis

The native protein-DNA complex and DNA-duplex trajectories were further analyzed to explore the role of water in the binding specificity. CHARMM'S H B O N D module was used to analyze whether a hydrogen bond (H-bond) exists in a certain frame in the trajectory. A distance cutoff of 2.4 A was used as the maximum H-bond length (between acceptor and donor hydrogen) with no angle cutoffs. Then a house-built program was used to calculate the lifetime histograms for all occurrences of H-bonds. A 2 ps resolution was used such that any breakage of the H-bond shorter than 2 ps is ignored. 37 The existence of a direct or a water-bridged H-bond between the protein and DNA at each base pair position was also calculated. H-bonds formed by the N-terminal residues of MAT-a2 were considered separately from the rest of the protein. 4. Results and Discussions Using the methods outlined in Sec. 3, the predicted sequence logos for the free energy terms in Eq. (1) are shown in Fig. 2. Our prediction of MAT-a2 achieves excellent agreement for all 5 positions in the "TTACA" consensus

490

sequence. This agreement verifies that the mid-point approximation for thermodynamic integration (Sec. 3) is valid for this TF. The N-terminus is

a)

b)

I^TACA*. LTTicAQ, p* tJxCTx 'k.

total

^

kr"L....

]x_ A C A ^

internal |
external N-terminal

•IT

UTTACAX •Ux

TRANSFAC other literature

Fig. 2. Predicted sequence logos and experimental logos for yeast proteins MAT-a2 (homeodomain) and GCN4 (bZIP). The base pair positions that have base-specific contacts as either direct or water-bridged H-bonds between the protein and the DNA bases are shown. The total, internal, external, and N-terminal (for MAT-a2) logos are listed in that order. Logos generated from both TRANSFAC38 and primary experimental publications are listed at the bottom. For MAT-o2, the TRANSPAC logo is for heterodimer M A T - a l / M A T - a 2 , 3 9 and the literature logo is for heterotetramer M A T - a 2 / M C M l . 2 8 For GCN4, the TRANSFAC logo is based on sequences obtained from 4 rounds of affinity column selection and P C R amplification; 40 the literature logo is based on sequences of 15 promoter regions of GCN4 targets from DNA site protection experiments. 4 1 These two logos were obtained by converting the experimental dimer binding sequences into 2 half-site monomer binding sequences to facilitate comparison with the computational predictions.

responsible for the first two positions in the "TTACA" consensus sequence. A reduced model that considers only the "recognition helix" may fail to identify these positions. The DNA internal energies contribute largely to all five positions. Our GCN4 prediction agrees with the experimental binding sites at 4 out of 5 positions of the aTGA[C|G] consensus sequence, whereas the last position is variable in the experiemental sites. The external free

491

energies are largely responsible for these positions. The information content of our prediction agrees well with the literature logo, which considered both strong and weak binding sequences. 41 The TRANSFAC logo shows higher information content, possibly because it was constructed from only the strongest binding sequences. 40 The lifetime histograms of different types of H-bonds for MAT-a2 are shown in Fig. 3. The lifetimes for H-bonds between the DNA-duplex and DNA base

DNA backbone

50 100 150 200 250 300 I ' I ' I ' I ' I ' I

0 I

50 100 150 200 250 ' I ' I ' I ' I ' I

300

i

• 1

13 ps

0.1 g

•

|

'°P

0.1 0.01

0.01

IWUJl

0.001 0.0O01 1 "S

k

o.i

? oa

0.01 0.001 0.0001

Mi

1

0.01

Q

0.001

z

0.0001 1 •0.1 •0.01

l

6

• 0.001

PS

• 0.0001 • 1

4ps

1 0.1 s <

0.001

•0.1 •0.01 • 0.001

0.0001

i

50

100

'

i

150

•

i

200

'

i

250

'

i

300

"

i

0

•' ii •' ii ' ii 50

100

150

• 0.0001 •

i

200

•

t

250

•

r

300

H-bond lifetime (ps) Fig. 3. Histograms of H-bond lifetimes for yeast MAT-a2 homeodomain protein during a 600 ps simulation. The top, middle, and bottom panels represent the direct proteinDNA H-bonds, the water-bridged protein-DNA H-bonds, and the H-bonds between DNA and water, respectively. The left and right panels represent the H-bonds formed by the DNA bases and the DNA backbone, respectively. The insets of the panels show the average lifetimes.

water are similar to a previous simulation study, 37 although the average lifetime is slightly shorter. The histograms for GCN4 (not shown) are similar except for slightly longer average lifetimes for the direct and water-bridged H-bonds. Since the binding specificity of a TF arises primarily from contacts made with the DNA bases, we now examine the left panels of Fig. 3 further. There are 3 long-lived (> 100 ps) direct protein-DNA H-bonds for MAT-a2 during a 600 ps equilibration. Two of them are between the recognition helix and the major groove bases, which are also found in the crystal structure. 28 One

492

H-bond is between the N-terminal tyrosine and adenine base in the minor groove, which is not present in the crystal structure since the tyrosine side chain was not resolved. For GCN4, all long-lived direct H-bonds are also observed in the crystal structure. 25 Both the MAT-a2 and GCN4 binding interfaces are highly hydrated in the simulations, with the MAT-a2 interface more hydrated than GCN4 (data not shown). Figure 4 shows the H-bond existence time-series for the native MAT-a2-DNA complex. Base pair positions 1, 9, and 10 have rare occurrences of H-bonds and thus are not shown. Figures 3 and 4 demonstrate that the water-bridged H-bonds are highly dynamic, with H-bonds breaking and forming constantly. This is important because it indicates that the 100 ps production runs for the multi-copy structures provide adequate sampling of the bridging water. Figure 4 shows that bridged H-bonds form a large and extensive contact network at the protein-DNA binding interface that is more prevalent than the direct protein-DNA H-bond network. 42 As a result, the binding specificity arises exclusively from water-bridged H-bonds at base pair positions 3, 7, and 8 for MAT-a2 and at base pair positions 3 and 6 for GCN4 (data not shown), respectively. These results indicate that water-bridged H-bonds contribute more to the binding affinity and specificity than direct H-bonds in these TF-DNA complexes. 5. Conclusion We present here an all-atom molecular simulation and free energy calculation method that calculates the TF binding sites based on a 3D structural model of the protein-DNA complex. Explicit water molecules are included and are found to form a dynamic and more prevalent H-bond network than direct protein-DNA H-bonds. The predicted position weight matrices of M A T - Q 2 and the half-site of GCN4 agree well with the experimental binding sites. We are currently carrying out the following studies that will help establish the scope and limitations of our method. First, we are analyzing the hydration dynamics at the binding interface for multi-copy trajectories. These results serve to evaluate the efficiency in the configrational sampling of our simulation protocol. Second, we are implementing multi-copy base pairs using the most recent A M B E R force field43'44 to investigate sensitivity of our method to the force field parameters. Preliminary studies of homeodomain and bZIP proteins have shown that high quality homology modeling is possible (RMSD smaller than or about 1 A can be obtained for many family members). We are currently investigating the effects of starting 3D

493

()

100 200 1 , 1 •:

;

: .

300 1 •

»

400

1 •

:

*

1 >

. . . .

500 , 1

6C direct / bridged

;••••:*•"•:

* - .

BP=2

34/60 0/0 0/57 0/34

BP=3 BP=4

.i'-:*:^™*: S».;.;. . ?.?...?. U3.

BP=5 .........

.\..;. ;:.= .?.? SKIS

17/32

:::-.{-::rr"""

97/28

6/52 0/27

v;; .

•4

•:

0/0 23/92

BP=6 BP=7

.

•

0/0 0/52

V . ..,

0/0

BP=8 1

1

100

1

1

1

. .::::IJ;C itatJir:: i

i 200 300 400 Simulation time (ps)

'

i 500

0/68

' 600

Fig. 4. Time-series for direct (black dotted lines) and water-bridged (gray solid lines) Hbonds between the yeast MAT-a2 homeodomain protein and its native DNA base atoms during a 600 ps simulation. Base pair positions 2 to 8 are plotted. The two possible states being plotted are (i) having at least one direct or water-bridged H-bond (the spikes or steps), and (ii) no direct or bridged H-bond (the base lines). Cooccurring H-bonds at each base pair position are plotted as one H-bond. There are two panels for each base pair position. The upper and lower panels contain the time-series for H-bonds formed by the N-terminus and the rest of the protein with the DNA bases, respectively. The percentages of time that H-bonds exist are listed on the right hand side of the corresponding data series. For example, at base pair 2, 34% of the time during the equilibration, there is at least one direct H-bond between the N-terminus and the DNA base atoms, whereas 60% of the time there is at least one bridged H-bond between the N-terminus and the DNA base atoms.

structural models on the TF binding site predictions. Finally, sensitivity of the results to the starting DNA sequence is also being considered. Our method is computationally intensive. The prediction of binding sites for one TF requires ~400 CPU-days on a 3.0 GHz Intel processor, which is about $1500 considering a 3-year lifespan for a CPU. We are currently

494 developing and testing multiple-multi-copy methods in which two or more base pairs are b o t h multi-copies. These calculations can improve t h e computational efficiency of our method. Furthermore, the free energy analysis of such structures helps quantify t h e correlation among the base pairs a n d provides an estimation of error for t h e additive approximation. Acknowledgements LAL acknowledges funding from the D e p a r t m e n t of Energy (DEFG0204ER25626). J S B acknowledges funding from N S F C A R E E R 0546446, N I H / N C R R U54RR020839, and the Whitaker foundation. We acknowledge a starter grant and an M R A C grant of computer time from the P i t t s b u r g h Supercomputer Center, MCB060010P, MCB060033P, and MCB060056N. References 1. C. O. Pabo and R. T. Sauer, Annu Rev Biochem 53, 293 (1984). 2. C. O. Pabo and R. T. Sauer, Annu Rev Biochem 6 1 , 1053 (1992). 3. G. Patikoglou and S. K. Burley, Annu Rev Biophys Biomol Struct 26, 289 (1997). 4. N. M. Luscombe, S. E. Austin, H. M. Berman and J. M. Thornton, Genome Biol 1, p. REVIEWS001 (2000). 5. M. D. Biggin, Nat Genet 28, 303 (2001). 6. M. L. Bulyk, Genome Biol 5, p. 201 (2003). 7. G. D. Stormo and D. S. Fields, Trends Biochem Sci 23, 109 (1998). 8. C. Tuerk and L. Gold, Science 249, 505 (1990). 9. B. Ren, F. Robert, J. J. Wyrick, O. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell and R. A. Young, Science 290, 2306 (2000). 10. S. Mukherjee, M. F. Berger, G. Jona, X. S. Wang, D. Muzzey, M. Snyder, R. A. Young and M. L. Bulyk, Nat Genet 36, 1331 (2004). 11. P. V. Benos, A. S. Lapedes and G. D. Stormo, Bioessays 24, 466 (2002). 12. A. Sarai and H. Kono, Annu Rev Biophys Biomol Struct 34, 379 (2005). 13. J. Ashworth, J. J. Havranek, C. M. Duarte, D. Sussman, J. Monnat, R. J., B. L. Stoddard and D. Baker, Nature 4 4 1 , 656 (2006). 14. G. Paillard and R. Lavery, Structure (Camb) 12, 113 (2004). 15. A. V. Morozov, J. J. Havranek, D. Baker and E. D. Siggia, Nucleic Acids Res 33, 5781 (2005). 16. R. G. Endres, T. C. Schulthess and N. S. Wingreen, Proteins 57, 262 (2004). 17. R. A. O'Flanagan, G. Paillard, R. Lavery and A. M. Sengupta, Bioinformatics 2 1 , 2254 (2005). 18. M. L. Bulyk, P. L. Johnson and G. M. Church, Nucleic Acids Res 30, 1255 (2002). 19. P. V. Benos, M. L. Bulyk and G. D. Stormo, Nucleic Acids Res 30, 4442 (2002).

495 20. P. M. King, Free energy via molecular simulation: a primer, in Computational Simulation of Biomolecular Systems: Theoretical and Experimental Applications, eds. W. van Gunsteren, K. Weiner, P and A. Wilkinson (ESCOM, Leiden, 1993), p. 267. 21. L. Andrew, Molecular Modelling: Principles and Applications, 2nd edn. (Prentice Hall, 2001). 22. T. E. Cheatham 3rd and P. A. Kollman, Annu Rev Phys Chem 51, 435 (2000). 23. J. W. Schwabe, Curr Opin Struct Biol 7, 126 (1997). 24. C. Wolberger, Curr Opin Struct Biol 6, 62 (1996). 25. T. E. Ellenberger, C. J. Brandl, K. Struhl and S. C. Harrison, Cell 7 1 , 1223 (1992). 26. C. R. Vinson, P. B. Sigler and S. L. McKnight, Science 246, 911 (1989). 27. T. W. Siggers, A. Silkov and B. Honig, J. Mol. Biol. 345, 1027 (2005). 28. C. Wolberger, A. K. Vershon, B. Liu, A. D. Johnson and C. O. Pabo, Cell 67, 517 (1991). 29. W. Keller, P. Konig and T. J. Richmond, J Mol Biol 254, 657 (1995). 30. B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swaminathan and M. Karplus, J. Comput. Chem. 4, 187 (1983). 31. R. Venable, Ion addition by selective water replacement http://www. charmm.org/ubbthreads/ubbthreads. php?Cat=0, CHARMM Community Script Archive. 32. I. Lafontaine and R. Lavery, Biopolymers 56, 292 (2000). 33. I. Lafontaine and R. Lavery, Biophys J 79, 680 (2000). 34. W. K. Olson, M. Bansal, S. K. Burley, R. E. Dickerson, M. Gerstein, S. C. Harvey, U. Heinemann, X. J. Lu, S. Neidle, Z. Shakked, H. Sklenar, M. Suzuki, C. S. Tung, E. Westhof, C. Wolberger and H. M. Berman, J Mol Biol 313, 229 (2001). 35. T. D. Schneider and R. M. Stephens, Nucleic Acids Res 18, 6097 (1990). 36. G. E. Crooks, G. Hon, J. M. Chandonia and S. E. Brenner, Genome Res 14, 1188 (2004). 37. A. M. Bonvin, M. Sunnerhagen, G. Otting and W. F. van Gunsteren, J Mol Bio/282, 859 (1998). 38. E. Wingender, X. Chen, E. Fricke, R. Geffers, R. Hehl, I. Liebich, M. Krull, V. Matys, H. Michael, R. Ohnhauser, M. Pruss, F. Schacherer, S. Thiele and S. Urbach, Nucleic Acids Res 29, 281 (2001). 39. C. Goutte and A. D. Johnson, Embo J 13, 1434 (1994). 40. A. R. Oliphant, C. J. Brandl and K. Struhl, Mol. Cell. Biol. 9, 2944 (1989). 41. K. Arndt and G. R. Fink, Proc Natl Acad Sci U S A 83, 8516 (1986). 42. M. Billeter, P. Guntert, P. Luginbuhl and K. Wuthrich, Cell 85, 1057 (1996). 43. D. A. Case, T. E. Cheatham 3rd, T. Darden, H. Gohlke, R. Luo, J. Merz, K. M., A. Onufriev, C. Simmerling, B. Wang and R. J. Woods, J Comput Chem 26, 1668 (2005). 44. T. E. Cheatham 3rd and M. A. Young, Biopolymers 56, 232 (2000).

COMPARATIVE PATHWAY ANNOTATION WITH PROTEIN-DNA INTERACTION A N D OPERON INFORMATION VIA G R A P H T R E E DECOMPOSITION

JIZHEN ZHAO, D O N G S H E N G C H E AND LIMING CAI Department of Computer Science, University of Georgia, Athens, GA 30602, USA. Email: {jizhen, che, cai}@cs.uga.edu, Fax: (706) 542-2966.

Template-based comparative analysis is a viable approach to the prediction and annotation of pathways in genomes. Methods based solely on sequence similarity may not be effective enough; functional and structural information such as protein-DNA interactions and operons can prove useful in improving the prediction accuracy. In this paper, we present a novel approach to predicting pathways by seeking high overall sequence similarity, functional and structural consistency between the predicted pathways and their templates. In particular, the prediction problem is formulated into finding the maximum independent set (MIS) in the graph constructed based on operon or interaction structures as well as homologous relationships of the involved genes. On such graphs, the MIS problem is solved efficiently via non-trivial tree decomposition of the graphs. The developed algorithm is evaluated based on the annotation of 40 pathways in Escherichia coli (E. colt) K12 using those in Bacillus subtilis {B. subtilis) 168 as templates. It demonstrates overall accuracy that outperforms those of the methods based solely on sequence similarity or using structural information of the genome with integer programming.

Keywords: pathway annotation, pathway prediction, protein-DNA interaction, operon, tree decomposition, independent set 1. I n t r o d u c t i o n A challenge in the post-genomic era is the understanding at different levels of the genomes that have been sequenced 12 . Many efforts have been made in gene finding and assigning predicted or determined functionalities to found genes. However, higher order functional analysis of organisms from their genomic information remains in demand u . Assigning a biological pathway to a set of genes, known as pathway annotation, is one such analysis, which is essential to understanding cellular processes and organism behaviors in a larger context 14 . Biological pathways could be determined experimentally but this is usually expensive and laborious. As 496

497 more and more genomes are sequenced and annotated, it is feasible to employ comparative genomic analysis in pathway prediction and annotation at the genome scale. Based on an annotated pathway from one genome as template, a pathway for a target genome can be predicted by identifying a set of orthologues based on sequence similarity to the genes in the template pathway. A naive approach for orthology assigning is choosing the best BLAST hit for each gene (BH). A more often used technique is by reciprocal BLAST search, called bidirectional best-hit (BBH) 8 , where gene pairs are regarded as orthologues if they are the best hits in both directions of the search. However, these and other sequence similarity based approaches share the same limitation 7 : the best hits may not necessarily be the optimal orthologues, thus compromising the prediction accuracy. It is observed that homology relationships exist not only at the sequence level, but also at functional and structural levels 15 , e.g., those of operon structures and protein-DNA interactions such as transcriptional regulation patterns of some genes by transcriptional factors (TFs). Recently, substantial operon and transcriptional regulation information have been curated from the scientific literatures for a number of genomes 6'n. Computational methods 10 ' 11 - 15 have also been developed to predict operon structures and co-regulated genes. The structural information about transcriptional regulation patterns that are needed may be gathered in a number of ways, although they may not necessarily be complete or extremely accurate. By considering such high level information among genes along with the sequence similarity, it becomes possible to improve the pathway prediction accuracy. However, the optimal prediction of pathways at the genome scale becomes difficult combinatorial optimization problems if sophisticated structural information is incorporated. PMAP 7 is an existing method that overcomes the difficulty by incorporating partial structural information (it i.e. structural information of the target genome only) with integer programing. In this paper, a novel approach is introduced based on integrating data in sequence similarity, experimentally confirmed or predicted operons, transcriptional regulations, as well as available functional information of related genes, in both template pathway and target genome. The new approach has led to an efficient graph-theoretic algorithm called TdPATH for pathway prediction. Algorithm TdPATH predicts a pathway in a target genome based on a template pathway by identifying an orthologous gene in the target genome for each gene in the template pathway, such that the overall sequence and structural similarity between the template and the predicted path-

498

ways achieves the highest. In particular, homologes for each gene in the template pathway are first identified by the BLAST search 1 . Functional information is then used to filter out genes unlikely to be orthologues. The structural information are used to further constrain the orthology assignment. One of the homologes is eventually chosen to be the ortholog for the gene. The pathway prediction is formulated into the maximum independent set (MIS) problem and the maximum CLIQUE problem by taking protein-DNA interaction constraints and operon constraints respectively. Because both problems are computationally intractable, we solve them efficiently with non-trivial techniques based on tree decompositions of the graphs constructed from the structural constraints. Our algorithm TdPATH has been implemented and its effectiveness is evaluated against BH, BBH and PMAP in the annotation of 40 pathways of E. coli K12 using the corresponding pathways of B. subtilis 168 as templates. The results showed that overall, in terms of the accuracy of the prediction, TdPATH outperforms BH and BBH that based solely on sequence similarity, as well as PMAP that uses partial structural information. In term of average running time to predict a pathway, it outperforms PMAP. Algorithm TdPATH is dynamic programing based on tree decomposition techniques. The running time of the algorithm is dominated by function 2*n, where t is the tree width of the underlying graphs of n vertices constructed from the structural constraints. In particular, the statistics on the tree width of these graphs shows that about 87% of the graphs have tree width at most 5, while 94% have tree width at most 8. Therefore, the tree decomposition based algorithm for pathway prediction is both theoretically and practically efficient than the integer programming based algorithm PMAP.. 2. Methods and Algorithm 2.1. Problem

formulation

A pathway is defined as a set of molecules (genes, RNAs, proteins, or small molecules) connected by links that represent physical or functional interactions. It can be reduced to a set of genes that code related functional proteins. An operon is a set of genes transcribed under the control of an operator gene. Genes that encode transcriptional factors are called tf genes. In the work described in this paper, a known pathway in one genome is used as a template to predict a similar pathway in a target genome. That is, for every gene in the template pathway, we identify some gene in target

499 genome as its ortholog if there is one, under the constraints of protein-DNA interaction (i.e., transcriptional regulation) and operon information. The problem of predicting pathways is defined as: INPUT: a template pathway model P = (Ap, Rp, Op) and a target genome T, where Ap is a set of genes in P, Rp is a set of relationships between tf genes and genes regulated by corresponding tf gene products, and Op is a set of operons;

OUTPUT: a pathway Q = (AQ,RQ,OQ)

for T and an orthology mapping

7r : AQ —> Ap such that the overall sequence similarity between all genes in pathway Q and their corresponding orthologues in the template P, as well as the consistency of the operon and regulation structures between pathways P and Q are as high as possible. 2.2. The

methods

Our approach consists of the following steps: (1) For every gene in the template pathway P, find a set of homologes in the target genome T with BLAST; (2) Remove from the homologes genes unlikely to be orthologues to the corresponding gene in the template P. This is done based on functional information, e.g., Cluster of Orthologous Groups (COG) 16 , which is available. In particular, genes that are unlikely orthologous would have different COG numbers. (3) Obtain protein-DNA interactions and operon structures for the homologous genes in the template pathway and target genome from related databases 6 ' n , literatures or computational tools 10>15. (4) Exactly one of the homologous genes is eventually assigned as the ortholog for the corresponding gene in the template P. This is done based on the constraints by the protein-DNA interaction and operon information (for any gene that is not covered by the structural information due to the incomplete data or other reasons, we simply assign the best BLAST hit as the ortholog). Such an orthology mapping or assignment essentially should yield a predicted pathway that has overall high sequence similarity and structural consistency with the template pathway. By incorporating sophisticated structural information, the pathway prediction problem may become computationally intractable. We describe in the following in detail how an efficient algorithm can be obtained to find

500 the orthology mapping between the template pathway and the one to be predicted. We consider in two separate steps structural constraints with protein-DNA interactions and those with operons. 2.2.1. Constraints with protein-DNA

interactions

We use available protein-DNA interaction information, i.e. the transcriptional regulation information, to constrain the orthology assignment. This is to identify orthologs with consistent regulation structures to the corresponding genes in the template pathway. Think genes as vertices and relations among the genes as edges, the template pathway and the corresponding homologs in target genome can be naturally formed into two graphs. Thus the problem can be converted to finding the optimal common subgraph of these two graphs. It is in turn to be formulated into the maximum independent set (MIS) problem. Details are given below. For convenience, we call a regulon in this paper to be a gene encoding a transcription factor and all the genes regulated by the factor.

(a)

(b)

(c)

Figure 1. Constraints with transcriptional regulations, (a) Regulation graph G\ for template pathway. A directed edge points from a tf gene to a gene regulated by the corresponding T F , a solid edge connects two genes regulated by a same T F , a dashed edge connects two genes belonging to different regulons. (b) Regulation graph G2 for the homologous genes in the target genome, constructed in similar way to (a), (c) Merged graph G from G\ and Gi. Each node is a pair of homologous genes.

(1) A regulation graph G\ = (Vi, E\) is built for the template pathway P, where vertex set V\ represents all genes in template pathway P, and edge set E\ contains three types of edges: an edge of type-1 connects a tf gene and every gene regulated by the corresponding product; an edge of type-2 connects two genes regulated by the same tf gene product; and edges of type-3 connect two genes from different regulons if they are not yet connected (Figure 1(a)). (2) A regulation graph Gi = {V2,E2) is built for the target genome in

501

the similar way, where V2 represents homologous genes in the target genomes (Figure 1(b)). (3) Graphs G\ and G% are merged into a single graph G = (V, E) such that V contains vertex [i,j] if and only if i e V\ and j £ V2 are two homologous genes. A weight is assigned to vertex [i,j] according to the BLAST score between genes i and j . Add an edge ([i,j], [i',jr]) if either (a) i = i' or j — j ' but not both, or (b) edges (i, i') G E\ and (j,j') € E-2 are not of the same type (Figure 1(c)). (4) Then the independent set in the graph G with the maximum weight should correspond to the desired orthology mapping that achieves the maximum sequence similarity and regulation consistency. This assigns one unique orthologous gene in this template pathway to each gene in the pathway to be predicted, as long as they are covered by the known protein-DNA interaction structures.

2.2.2. Constraints with operon structures We now describe how to use confirmed or predicted operon information to further constrain the orthology assignment. This step applies to the genes that have not been covered by protein-DNA interaction structures.

W = 0.5X2X[(w1*w2*w3)/3]

fo

A A i\ ©@@©©©

©©©©©© \

Figure 2. Constraints with operon information. See description for details. A dashed line connects two homologes. a) Setting weight for an operon. b) A pair of partially conserved operons in template pathway and target genome, (c) A mapping graph formed according to (b). (d) An operon only appears in target genome, (e) The mapping graph formed according to (d).

502 We first assign to each gene i with a weight Wi. Wi is set according to the average of its BLAST scores with its top m (say, 5) homologes. The weight of an operon o is set as 0.5(n - 1) J2ieo wi/ni where n is the number of genes in the operon (Figure 2(a)). The factor 0.5 allows an operon in one genome to only contribute 50% and a conserved operon in the other genome to contribute the other 50%. We use term n — 1 in the formula since we want to exclude the operons that have only one gene from consideration, since they do not introduce structural information. We then sort the operons according the non-decreasing of their sizes and then use the following greedy iterative process to constrain the orthology mapping as long as there is an operon unexamined. Repeat the following 4 steps: (1) Select the largest unexamined operon and consider the related homologes in another genome as well as the available operon structures in them; (2) Build a mapping graph Gm = (Vm,Em) (Figure 2(b)-(e)), where Vm contains the following two types of vertices: an operon vertex presents each of the involved operons and a mapping vertex [i,j] presents each pair of homologous genes i and j . Edge set Em also contains three types of edges: an edge connects every pair of mapping vertices ([i,j], [k,l]) Hi / k and j / I, an edge connects an operon node and a mapping node if one of the two genes in the mapping node belongs to the operon, and an edge connects every pair of involved operons between the target genome and the template pathway; (3) Find the maximum clique C on Gm; (4) Remove the template genes appeared in the mapping nodes of C and their homologes. Remove an operon if all genes in it have been removed. If only a subset of the genes in an operon have been removed, leave the remaining genes as a reduced operon. Resort the remaining operons. By this formulation, an edge in graph Gm denotes a consistent relationship between two nodes connected by it. A maximum clique denotes a set of consistent operon and mapping nodes that have the maximum total weight and thus can infer a optimal mapping. Note that an operon in one genome could have zero or more, complete or partial conserved operons in another genome 10 . If it has one or more (Figure 2(b)), the constraint can be obtained from both of the genomes and thus is called a two side con-

503

straint. The procedure can find the orthology mapping that maximizes the sequence similarity and the operon structural consistency. Otherwise, it is called called an one side constraint (Figure 2(b)). The procedure can find the orthology mapping that minimizes the number of involved operons.

2.3. Tree decomposition

based

algorithm

Based on section 2.2, constraining the orthology mapping with proteinDNA interactions and with operon structures can be reduced to the problems of maximum independent set (MIS) and maximum clique (CLIQUE) on graphs formulated from the structural constraints. Both problems are in general computationally intractable; any naive optimization algorithm would be very inefficient considering the pathway prediction is at the genome scale. Our algorithm techniques are based on graph tree decomposition. A tree decomposition 13 of a graph provides a topological view on the graph and the tree width measures how much the graph is tree-like. Informally, in a tree decomposition, vertices from the original graph are grouped into a number of possibly intersecting bags; the bags topologically form a tree relationship. Shared vertices among intersecting bags form graph separators; efficient dynamic programming traversal over the graph is possible when all the bags are (i.e., the tree width is) of small size 3 . In general, the graphs formulated from protein-DNA interactions and operon structures have small tree width . We employ the standard tree decomposition-based dynamic programming algorithm 3 to solve MIS and CLIQUE problems on graphs of small tree width. On graphs with larger tree width, especially on dense graphs, our approach applies the tree decomposition algorithm on the complement of the graph instead. The running time of the algorithms is 0(2tn), where t and n are respectively the tree width and the number of vertices in the graph. Such a running time is scalable to larger pathways. Due to the space limitation, we omit the formal definition of tree decomposition and the dynamic programming algorithm. Instead, we refer the reader to 3 for details. We need to point out that finding the optimal tree decomposition (i.e., the one with the smallest tree width) is NP-hard 2 . We use a simple, fast approximation algorithm greedy fill-in 4 to produce a tree decomposition for the given graph. The approximated tree width t may affect the running time of the pathway prediction but not its accuracy.

504

3. Evaluation Results We evaluated TdPATH against BH, BBH and PMAP by using 40 known pathways in B. subtilis 168 from KEGG pathway database 5 as templates (Table 1) to infer corresponding pathways in E. coli K12. For TdPATH, the operon structures are predicted according to the method used in 10 and experimentally confirmed transcriptional regulation information is taken from 6 for B. subtilis 168 and from n for E. coli K12. For PMAP, predicted operon and regulon information is obtained according to the method used in 7 . Both TdPATH and PMAP include the COG filtering. Table 1.

Template pathways of B. subtilis 168, taken from KEGG pathway database.

bsu00040 bsu00471 bsu00660 bsu00930 bsu03060 bsu00520

bsuOOlOO bsu00480 bsu00720 bsu00950 bsu00220 bsu00920

bsu00130 bsu00511 bsu00730 bsu01031 bsu00450 bsu03010

bsu00190 bsu00530 bsu00750 bsu01032 bsu00770 bsu00240

bsu00193 bsu00531 bsu00760 bsu02040 bsu00780 bsu00400

bsu00401 bsu00602 bsu00900 bsu03020 bsu01053

bsu00430 bsu00604 bsu00903 bsu03030 bsu02030

We evaluated the accuracy of the algorithms. The accuracy was measured as the arithmetic mean of sensitivity and specificity. Let K be the real target pathway, H be the homologous genes searched by BLAST according to the corresponding template pathway. Let R be the size of KC\H, i.e. the number of genes common in both the real target pathway and the candidate orthologues. We use this number as the number of real genes to calculate sensitivity and specificity because that is the maximum number of genes a sequence based method can predict correctly. Since BH (or BBH) can be considered a subroutine of PMAP and TdPATH, we only evaluated efficiency for PMAP and TdPATH. Running times from reading inputs to output the predicted pathway were collected. For TdPATH, we also collected the data on tree width of the tree decompositions on the constructed graphs or their complement graphs. For all of the algorithms, program NCBI blastp * was used for BLAST search and the E-value threshold was set to 10~ 6 . The experiments ran on a PC with 2.8 GHz fntel(R) Pentium 4 processor and 1-GB RAM, running RedHat Enterprise Linux version 4 AS. Running times were measured using the "time" function. The testing results are summarized in Table 2. On average, TdPATH has accuracy of 0.88, which is better than those of other algorithms. We give two examples here to show the improvement is good for small as well as large pathways. One is the nicotinate and nicotinamide metabolism, which has 13 genes in B. subtilis 168 while 16

505

genes in E. coli K12. The prediction accuracy of TdPATH is 0.9, better than 0.79, 0.83 and 0.79 of BH, BBH and PMAP respectively. Another is the pyrimidine metabolism pathway, which has 53 genes in B. subtilis 168 and 58 in E. coli K12. TdPATH has prediction accuracy of 0.82, better than 0.79, 0.80, 0.79 of BH, BBH and PMAP respectively. PMAP has second highest accuracy, which means prediction accuracy could be improved even by incorporating structural information partially. Table 2. Evaluation results. T: time (in seconds), A: accuracy ((sensitivity+specificity)/2). BBH A 0.45 1.00 0.85

BH A 0.33 1.00 0.84

min max ave

PMAP A T 0.33 12.8 1.00 27.3 0.86 16.4

TdPATH A T 0.50 1.2 1.00 33.3 0.88 11.5

For efficiency, TdPATH has average of 11.5 seconds for predicting a pathway, which is slightly better than 16.4 seconds of PMAP. The tree width distribution is shown in Figure 3. On average, tree width of the tree decompositions on the constructed graphs or their complement graphs is 3. 87% of them have tree width at most 5 while 94% at most 8. Since theoretically the running time to find the maximum independent set by the tree decomposition based method is 0(2'n) (where t is the tree width), we can conclude that most of the time our algorithm is efficient based on the statistics of the tree width. 40-, 35

\

30

\

25

\

^ 20 -

\

10

\

5

V 0

2

4

6

8 Treewidth

10

12

14

16

Figure 3. Distribution, of the tree width of the tree decompositions on the constructed graphs or their complement graphs.

506 4. D i s c u s s i o n a n d C o n c l u s i o n We have shown our work in utilizing functional information and structural information including protein-DNA interactions and operon structures in comparative analysis based pathway prediction and annotation. T h e structural information used to constrain the orthology assignment between the template pathway and t h e one to b e predicted appears t o b e critical for prediction accuracy improvement. It was to seek the sequence similarity and the structural consistency between the template and the predicted pathways as high as possible. Technically, the problem was formulated as finding the maximum independent set problem on the graphs constructed based on the structure constraints. Our algorithm, based on the non-trivial tree decomposition, coped with the computational intractability issue well and ran very efficiently. Evaluations on real pathway prediction for E coli also showed the effectiveness of this approach. It could also utilize incomplete d a t a and tolerate some noise in the data. Tree decomposition based algorithm is sophisticated yet practically efficient. Simpler algorithms are possible if only functional information and sequence similarity are considered. However, computationally incorporating structure information such as protein-DNA interactions and operons in optimal pathway prediction appears t o b e inherently difficult. Naive optimization algorithms may not be scalable to larger pathway at the genome scale. In addition to the computational efficiency, our graph-theoretic approach also makes it possible to incorporate more information such as gene fusion and protein-protein interactions 1 2 to further improve the accuracy simply because such information may be represented as graphs as well. On the other hand, when a template pathway is not well conserved in the target genome, the method may fail to predict the pathway correctly. Multiple templates could be used to rescue this problem since the conserved information could be compensated with each other. We are trying to build profiles from multiple template pathways and use t h e m to do the pathway prediction.

References 1. S. F. Altschul, T. L. Madden, A. A. Schffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res, 25, 3389-3402, 1997. 2. H. L. Bodlaender, "Classes of graphs with bounded tree-width", Tech. Rep. RUU-CS-86-22, Dept. of Computer Science, Utrecht University, the Netherlands, 1986.

507 3. H. L. Bodlaender, "Dynamic programming algorithms on graphs with bounded tree-width", In Proceedings of the 15th International Colloquium on Automata, Languages and Programming, Lecture Notes in Computer Science, 317, 105-119, Springer Verlag, 1987. 4. I. V. Hicks, A. M. C. A. Koster, E. Kolotoglu, "Branch and tree decomposition techniques for discrete optimization", In Tutorials in Operations Research: INFORMS - New Orleans, 2005. 5. M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, M. Hirakawa, "From genomics to chemical genomics: new developments in KEGG", Nucleic Acids Res. 34, D354357, 2006. 6. Y. Makita, M. Nakao, N. Ogasawara, K. Nakai, "DBTBS: database of transcriptional regulation in Bacillus subtilis and its contribution to comparative genomics", Nucleic Acids Res., 32, D75-77, 2004 7. F. Mao, Z. Su, V. Olman, P. Dam, Z. Liu, Y. Xu, "Mapping of orthologous genes in the context of biological pathways: An application of integer programming", PNAS, 108 (1), 129-134, 2006. 8. D. W. Mount, Bioinformatics: sequence and genome analysis, Cold Spring Harbor Lab Press, 516-517, 2000. 9. R. Nielsen, "Comparative genomics: Difference of expression", Nature, 440, 161-161, 2006. 10. M. N. Price, K. H. Huang, E. J. Aim, A. P. Arkin, "A novel method for accurate operon predictions in all sequenced prokaryotes", Nucleic Acids Res., 33, 880-892, 2005. 11. H. Salgado, S. Gama-Castro, M. Peralta-Gil, etc., "RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions", Nucleic Acids Res., 34, D394-D397, 2006. 12. J. L. Reed, I. Famili, I. Thiele, B. O. Palsson, "Towards multidimensional genome annotation.", Nature Reviews Genetics, 7, 130-141, 2006. 13. N. Robertson and P. D. Seymour, "Graph minors ii. algorithmic aspects of tree width", J. Algorithms, 7, 309-322, 1986. 14. P. Romero, J. Wagg, M. L. Green, D. Kaiser, M. Krummenacker, P. D. Karp, " Computational prediction of human metabolic pathways from the complete human genome", Genome Biology, 6, R2, 2004. 15. Z. Su, P. Dam, X. Chen, V. Olman, T. Jiang, B. Palenik, Y. Xu, "Computational Inference of Regulatory Pathways in Microbes: an Application to Phosphorus Assimilation Pathways in Synechococcus sp. WH8102", Genome Informatics, 14, 3-13, 2003. 16. R. L. Tatusov, E. V. Koonin, D. J. Lipman, "A Genomic Perspective on Protein Families", Science, 278 (5338), 631-637, 1997.

PACIFIC S Y M P O S I U M ON

BIOCOMPUTING 2007 The Pacific Symposium on Biocomputing (PSB) 2007 is an international, m u l t i d i s c i p l i n a r y conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2007 will be held January 3-7, 2007 at the Grand Wailea, Maui. Tutorials will be offered prior to the start of the conference. PSB 2007 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.

The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's "hot topics." In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field.

World Scientific

ISBN 981 270 417 5

www.worldscientific.com 6332 he

9 "789812 704177

Pacific Symposium on Biocomputing 2004: Hawaii, USA 6-10 January 2004

Read more

Pacific Symposium on Biocomputing 2008: Kohala Coast, Hawaii, USA 4-8 January 2008

Read more

Frommer's Maui 2007 (Frommer's Complete)

Read more

SERVO Magazine - January 2007

Read more

Motor (January 2007)

Read more

Scientific American (January 2007)

Read more

Frommer's Hawaii 2007 (Frommer's Complete)

Read more

Memory Makers Scrapbooking January 2007

Read more

Frommer's Portable Maui (2007) (Frommer's Portable)

Read more

2007

Read more

2007,

Read more

2007

Read more

QuickBooks 2007 On Demand

Read more

Quicken 2007 On Demand

Read more

Israel Yearbook on Human Rights (Volume 37: 2007)

Read more

Israel Yearbook on Human Rights (Volume 37: 2007)

Read more

Quicken 2007 On Demand

Read more

Advances in Chronic Kidney Disease 2007: 9th International Conference on Dialysis, Austin, Tex., January 2007 (Blood Purification 2007)

Read more

Algebraic topology: The Abel symposium 2007

Read more

The Cooperstown Symposium on Baseball and American Culture, 2007-2008

Read more

Grid Computing: International Symposium on Grid Computing (ISGC 2007)

Read more

Microsoft Office 2007 On Demand

Read more

Advances in Image and Video Technology: Second Pacific Rim Symposium, PSIVT 2007 Santiago, Chile, December 17-19, 2007 Proceedings

Read more

Frommer's Hawaii with Kids (2007) (Frommer's With Kids)

Read more

2007 № 09

Read more

2007 (258)

Read more

2007 № 08

Read more

2007 (242)

Read more

2007 (248)

Read more

2007 (244)

Read more

Recommend Documents

Pacific Symposium on Biocomputing 2004: Hawaii, USA 6-10 January 2004

P A C I F I C SYMPOSIUM O N BBIOCCOMPUTING 2004 This page intentionally left blank P A C I F I C SYMPOSIUM O N BI...

Pacific Symposium on Biocomputing 2008: Kohala Coast, Hawaii, USA 4-8 January 2008

P A C I F I C SYMPOSIUM ON BIOCOMPUTING 2008 This page intentionally left blank P A C I F I C SYMPOSIUM ON BIOCOM...

Frommer's Maui 2007 (Frommer's Complete)

Maui 2007 by Jeanette Foster Here’s what the critics say about Frommer’s: “Amazingly easy to use. Very portable, very c...

SERVO Magazine - January 2007

Cover.qxd 12/7/2006 1:00 PM Page 84 Vol. 5 No. 1 SERVO MAGAZINE Motor Size: 0.5 HP Max - No Minimum Motor Supply:...

Motor (January 2007)

Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page Trouble Shooter ...

Scientific American (January 2007)

ENDING PAIN WITHOUT SIDE EFFECTS • THE MOUNTAINS THAT SANK If This Is a PLANET, Then Why Isn’t Pluto? JANUARY 2007 W W ...

Frommer's Hawaii 2007 (Frommer's Complete)

01_008652 ffirs.qxp 7/25/06 10:41 PM Page i Hawaii 2007 by Jeanette Foster Here’s what the critics say about Fromme...

Memory Makers Scrapbooking January 2007

...

Frommer's Portable Maui (2007) (Frommer's Portable)

P O R T A B L E Maui 5th Edition by Jeanette Foster Here’s what critics say about Frommer’s: “Amazingly easy to use. ...

2007

AGILITÄT durch ARIS Geschäftsprozessmanagement August-Wilhelm Scheer Helmut Kruppke Wolfram Jost Herbert Kindermann H...