Pacific Symposium on Biocomputing 2004: Hawaii, USA 6-10 January 2004

P A C I F I C SYMPOSIUM O N BBIOCCOMPUTING 2004 This page intentionally left blank P A C I F I C SYMPOSIUM O N BI...

Author: Russ B. Altman | A. Keith Dunker | Lawrence Hunter | Tiffany A. Jung | T. E. D. Klein

5 downloads 634 Views 33MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

P A C I F I C SYMPOSIUM O N

BBIOCCOMPUTING 2004

This page intentionally left blank

P A C I F I C SYMPOSIUM O N

BIOCOMPUTING 2004 Hawaii, USA 6-1 0 January 2004

Edited by

Russ 6. Altman Stanford University, USA

A. Keith Dunker Indiana University, USA

Lawrence Hunter University of Colorado Health Sciences Center, USA

Tiffany A. lung Stanford University, USA

Teri E. Klein Stanford University, USA

NEW JERSEY

-

\:

World Scientific

LONDON * SINGAPORE

-

SHANGHAI

*

HONG KONG

TAIPEI

BANGALORE

Published by

World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA oflce: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK oflce: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-PubIicationData A catalogue record for this book is available from the British Library

BIOCOMPUTING Proceedings of the 2004 Pacific Symposium Copyright 0 2003 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof; may not be reproduced in any form or by any means, electronic or mechanical, includingphotocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.

ISBN 981-238-598-3

Printed in Singapore by World Scientific Printers (S) Pte Ltd

PACIFIC SYMPOSIUM ON BIOCOMPUTING 2004 Last spring the PSB organizers engaged in an e-mail conversation on the origins of our field. This led to the following brief study. According to the Oxford English Dictionary, biocomputing is defined as the application of computing in biological research, especially the analysis of statistical data and molecular structures, and the modeling of biological processes, while bioinformatics is defined as the science of information and information flow in biological systems, especially the use of computational methods in genetics and genomics. Furthermore, the same dictionary indicates that an advertisement in Science, December 2, 1977, contained the first documented use of biocomputing and that a biographical sketch in Simulation volume 31, 1978, contained the first documented use of bioinformatics. The sketch referred to Paulien Hogeweg of the University of Utrecht as having her main field of research in bioinformatics. To this day, this university maintains a research group described as “Theoretical Biology/Bioinformatics,” and Professor Hegeweg continues to publish in this field. For the period up until the eve of the first PSB, biocomputing was, by a slight margin, the most widely used term for our field. For example, for the period up until December 31, 1995, PubMed searches give 27 hits for the term computational biology, 14 hits for bioinformatics, and 32 hits for biocomputing. The picture is quite different now: on the eve of the ninth PSB, computational biology gives 4,452 hits, bioinformatics gives 4,773 hits, and biocomputing a mere 147 hits. An e-mail conversation with Jean Michel Claverie provides an interesting hypothesis regarding the popularity of the term bioinformatics. Being unaware of Professor Hogeweg’s use of “bioinformatics,” Professor Claverie independently coined the term “la bioinformatique molecularie” [Claverie, J. M., Caudron, B., and Gerard, 0. (1984) Le systeme d’analyse de sequences de 1’Institute Pastuer (S.A.1.S.P) Biofutur, Juin 35-37]. In the following years, the term picked up quite well in France. Professor Claverie tells that “It became increasingly difficult for me and my French colleagues, when giving seminars in English, to switch back to the accepted English terms of the time: “computational biology” or “biocomputing.” During seminars, my tongue slipped many times, such as “I am going to present some of our new developments in Bioinformatics -sorryBiocomputing.. .” One of the clearest instances of this happening many times over, was in one of the first Waterville Valley meetings [Macromolecules, Genes and Computers. These important meetings were organized by Temple Smith. The first one was in August of 19861, where I first met with people like Lipman, Wilbur, Temple Smith, Staden, well, all the people from the early NAR special issues, but that

V

vi

was also well attended (for the first time) by Europeans (Saccone, Gautier, Grantham, myself, and a few others). I remember some of the attendance asking me the exact meaning and origin of “Bioinformatics” ... (Also because I had such a bad French accent!) I think I remember (this is S O 0 0 old), some Americans being quite enthusiastic about it, AND THIS IS IMPORTANT, as a way to distinguish the computational use of computers (when you do calculus and compute things) from the more “textual”use of them (sequence text analysis, sequence alignment, databases, etc), and thus, a great way to denote the ‘new wave’ from the old computer application (computational biology, biocomputing, theoretical biology). I think it is this new ecological niche that made the term quickly popular in English.” Despite the gaining popularity of the term bioinformatics, PSB will retain the use of biocomputing: its definition conveys the breadth of topics embraced by this meeting. -A. Keith Dunker PSB 2004 has again been supported by grants from the U.S. Department of Energy and the National Library of Medicinemational Institutes of Health. The International Society for Computational Biology and Applied Biosystems continue to sponsor PSB, and as a result, meeting participants will once again benefit from travel grants from their generous support. We look forward to the key addresses by Debbie Nickerson and by Henry T. Greely. Tiffany Jung again carried out a yeoman’s work of creating the printed and online proceedings while also providing the backbone for the administration of the meeting. Each year we thank the session organizers. Their unselfish and tireless work gives PSB its special flavor.

Trey Ideker, Eric Neumann, and Vincent Schachter Computational and Symbolic Systems Biology Alexander Hartemink and Elan Segal Joint Learning from Multiple Types of Genomic Data Hui Wang, Ueng-cheng Yang, and Chris Lee Alternative Splicing Francisco de la Vega, Kenneth Kidd, and Andrew Collins Computational Toolsfor Complex Trait Gene Mapping

vii

Olivier Bodenreider, Joyce A. Mitchell, and Alexa T. McCray Biomedical Ontologies Sean D. Mooney, Philip E. Bourne, and Patricia C. Babbitt Informatics Approaches in Structural Genomics PSB 2004 will also host four tutorials Systems Biology Host/Pathogen and Other ’Community’ Interactions by Christian Forst, Creating Web Services for Bioinformatics by Michael D. Jensen, Timothy B. Patrick, and Joyce A. Mitchell, Network (Reticulated)Evolution: Biology, Models, and Algorithms by C . Randal Linder, Bernard M.E. Moret, and Tandy Warnow, and Modeling Genetic and Metabolic Networks: Design of High Throughput Experiments by Kenneth Kauffman, Babatunde A. Ogunnaike, and Jeremy S. Edwards Again we acknowledge the crucial assistance of those who capably reviewed the submitted manuscripts. The partial list on the following pages does not include those who have been left of the list inadvertently or who wished to remain anonymous, Participants and those who haven’t yet attended PSB are encouraged to submit proposals for sessions and tutorials for future meetings. Well-conceived submissions are vital to the continuing success of PSB. Aloha!

Pacific Symposium on Biocomputing Co-Chairs

October 1, 2003

Russ B. Altman Department of Genetics & Stanford Medical Informatics, Stanford University A. Keith Dunker Center for Computational Biology & Bioinformatics, Indiana University School of Medicine Lawrence Hunter Department of Pharmacology, University of Colorado Health Sciences Center Teri E. Klein Department of Genetics & Stanford Medical Informatics, Stanford University

This page intentionally left blank

Thanks to reviewers

...

Finally, we wish to thank the scores of paper reviewers. PSB requires that every paper in this volume be reviewed by at least three independent referees. Since there is a large volume of submitted papers, paper reviews require a great deal of work from many people. We are grateful to all of you listed below and to anyone whose name we may have accidentally omitted or who wished to remain anonymous. Frank Alber Serkan Apaydin Manny Ares Gil Ast Gary Bader Keith Ball D. Rey Banatao Ziv Bar-Joseph Serafim Batzoglou Doug Black Judith Blake John Blume Olivier Bodenreider Michael Boehnke Erich Bornberg-Bauer Jim Bowie Heinz Breu Michael Brudno William Bruno Jeremy Buhler Martha Bulyk Roland Care1 Michelle Carillo Simon Cawley Joseph Chang Alan Cheng Steve Chervitz Derek Chiang John Chodera Lonnie Chrisman Andrew Clark

Melissa Cline Greg Cooper Ronald Cornet Nancy Cox Olivier Dameron Doris Damian David Danks Warren Delano Jon Dugan Keith Dunker Jeremy Edwards Eleazar Eskin Xiangdong Fu Irene Gabashvili Aldo Gangemi John Gennari Georg Gerber Warren Gish Chern-Sing Go Catherine Grasso Brenton Graveley Mike Gruninger Brian Haas Bjarni V. Halldorsson Amir Handzel Midori Harris Steffen Heber Win Hide Xiaolan Hu Conrad Huang Trey Ideker

ix

Li Jin Mike Jones Tommy Kaplan Kevin Karplus Jim Kent Kristian Kersting Giselle Knudsen David Konerding Balaji Krishnapuram Alain Laederach Douglas Lauffenberger Chris Lee Christina Leslie Mike Liang Zhen Lin Ken Lind Ross Lippert Irene Liu Jeffrey Long Joanne Luciano Jean MacCluer Nik Maniatis Marc Marti-Renom Hideo Matsuda Alexa McCray Robin McEntire Loralyn Mears Elaine Meng Eric Minch Joyce Mitchell

X

Brian Moldover Newton Morton Stephen Mount Iftach Nachman Eric Neumann William Noble Magnus Nordborg Matej Oresic Ed Otto Timothy Patrick Dana Pe'er Itsik Pe'er Scott Pegg Ruth Pfeiffer Antonio Piccolboni Tom Plasterer James Pustejovsky Daniel Rabinowitz Pedrag Radivojac Randall Radmer Bruce Rannala Aviv Regev

Andrea Rossi Burkhart Rost Mitul Saha Vincent Schachter Steffen SchulzeKremer Imran Shah Pak Sham Jessica Shapiro Roded Sharan Ambuj Singh Saurabh Sinha Luc smink Barry Smith Guang Song Nati Srebro Jason Stajich Stefan Stamm Susie Stephens Robert Stevens Josh Stuart Chen Su

Amos Tanay Dawn Teare MD Teare Alphonse Thanaraj John Todd Olga Troyanskaya Nick Tsinoremas Nathan Walsh Yingxin Wang Jinghua Wang Todd Wareham Liping Wei Lodewyk Wessels Jennifer Williams Wing Wong Eric Xing Jian Yao Chen-Hsiang Yeang Shibu Yoosep Michael Zhang Zheng Zhang Pierre Zweigenbaum

CONTENTS Preface

V

ALTERNATIVE SPLICING Session Introduction H. Wang and C. Lee Design of a High-Throughput Assay for Alternative Splicing Using Polymerase Colonies J.D. Buhlel; R.M. Souvenil; W Zhang, and R.D. Mitra The Effects of Alternative Splicing on Transmembrane Proteins in the Mouse Genome M.S. Cline, R. Shigeta, R.L. Wheeler;M.A. Siani-Rose, D. Kulp, and A.E. Loraine

17

Genome-Wide Detection of Alternative Splicing in Expressed Sequences Using Partial Order Multiple Sequence Alignment Graphs C. Grasso, B. Modrek, I: Xing, and C. Lee

29

Detection of Novel Splice Forms in Human and Mouse Using Cross-Species Approach Z. Kan, J. Castle, J.M. Johnson, and N.E Tsinoremas

42

Extensive Search for Discriminative Features of Alternative Splicing H. Sakai and 0. Maruyama

54

Transcriptome and Genome Conservation of Alternative Splicing Events in Humans and Mice C.W Sugnet, WJ. Kent, M. Ares J K ,and D. Haussler

66

A Database Designed to Computationally Aid an Experimental Approach to Alternative Splicing C.L. Zheng, T M . Nair; M. Gribskov, Y S . Kwon, H.R. Li, and X.-D. Fu

78

xi

xii

COMPUTATIONAL TOOLS FOR COMPLEX TRAIT GENE MAPPING Session Introduction F: de la Vega, K.K. Kidd, and A. Collins

89

Pedigree Generation for Analysis of Genetic Linkage and Association M.P Bass, E.R. Martin, and E.R. Hauser

93

A Markov Chain Approach to Reconstruction of Long Haplotypes L. Eronen, I? Geerts, and H. Toivonen

104

Tradeoff Between No-Call Reduction in Genotyping Error Rate and Loss of Sample Size for Genetic Case/Control Association Studies S.J. Kang, D. Gordon, A.M. Brown, J. Ott, and S.J. Finch

116

A Comparison of Different Strategies for Computing Confidence Intervals of the Linkage Disequilibrium Measure D’ S.K. Kim, K. Zhang, and E Sun

128

Multiplexing Schemes for Generic SNP Genotyping Assays R. Sharan, A. Ben-Doc and Z. Yakhini

140

Haplotype Block Definition and Its Application X. Zhu, S. Zhang, D. Kan, and R. Cooper

152

BIOMEDICAL ONTOLOGIES Session Introduction 0. Bodenreidel; J.A. Mitchell, and A.T. McCray

164

Part-of Relations in Anatomy Ontologies: A Proposal for RDFS and OWL Formalisations J.S. Aitken, B.L. Webbel; and J.B.L. Bard

166

Building Mouse Phenotype Ontologies G. b! Gkoutos, E. C.J. Green, A.M. Mallon, J.M. Hancock, and D. Davidson

178

...

Xlll

An Evidence Ontology for Use in Pathway/Genome Databases RD. Karp, S. Paley, C.J. Kriegel; and P: Zhang

190

Terminological Mapping for High Throughput Comparative Biology of Phenotypes XA. Lussier and J. Li

202

The Compositional Structure of Gene Ontology Terms I? K Ogren, K.B. Cohen, G.K. Acquaah-Mensah, J. Eberlein, and L. Hunter

214

Defaults, Context, and Knowledge: Alternatives for OWL-Indexed Knowledge Bases A. Rector

226

Biological Nomenclatures: A Source of Lexical Knowledge and Ambiguity 0. Tuason, L. Chen, H. Liu, J.A. Blake, and C.Friedman

238

Investigating Implicit Knowledge in Ontologies with Application to the Anatomical Domain S. Zhang and 0. Bodenreider

250

JOINT LEARNING FROM MULTIPLE TYPES OF GENOMIC DATA Session Introduction A. Hartemink and E. Segal

262

ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure A. Bhattacharya, 1: Can, 1: Kahveci, A.K. Singh, and Y - E Wang

264

Predicting the Operon Structure of Bacillus subtilis Using Operon Length, Intergene Distance, and Gene Expression Information M.J.L. De Hoon, S. Imoto, K. Kobayashi, N . Ogasawara, and S. Miyano

276

xiv

Combining Text Mining and Sequence Analysis to Discover Protein Functional Regions E. Eskin and E. Agichtein

288

Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast G.R.G. Lanckriet, M. Deng, N. Cristianini, M.I. Jordan, and W S . Noble

300

Discovery of Binding Motif Pairs from Protein Complex Structural Data and Protein Interaction Sequence Data H. Li, J. Li, S.H. Tan, and S.-K. Ng

3 12

Phylogenetic Motif Detection by Expectation-Maximization on Evolutionary Mixtures A.M. Moses, D.Y;Chiang, and M.B. Eisen

324

Using Protein-Protein Interactions for Refining Gene Networks Estimated from Microarray Data by Bayesian Networks N. Nuriai, S. Kim, S. Imoto, and S. Miyano

336

Motif Discovery in Heterogeneous Sequence Data A. Prakash, M. Blanchette, S. Sinha, and M. Tompa

348

Negative Information for Motif Discovery K. 7: Takusagawa and D. K. Gifford

360

INFORMATICS APPROACHES IN STRUCTURAL GENOMICS Session Introduction S.D. Mooney, F? E. Bourne, and l? C. Babbitt

3 72

The Status of Structural Genomics Defined Through the Analysis of Current Targets and Structures RE. Bourne, C.K.J. Allerston, W Krebs, W Li, I.N. Shindyalov, A. Godzik, I. Friedberg, 7: Liu, D. Wild, S. Hwang, 2. Ghahramani, L. Chen, and J. Westbrook

375

xv

Protein Structure and Fold Prediction Using Tree-Augmented Nalve Bayesian Classifier A. Chinnasamy, W-K. Sung, and A. Mittal

387

Clustering Protein Sequence and Structure Space with Infinite Gaussian Mixture Models A. Dubey, S. Hwang, C. Rangel, C.E. Rasmussen, Z. Ghahramani, and D.L. Wild

399

Accurate Classification of Protein Structural Families Using Coherent Subgraph Analysis J. H u m , W Wang, A. Washington, J. Prins, R. Shah, and A. Tropsha

411

Identifying Good Predictions of RNA Secondary Structure M.E. Nebelss

423

Exploring Bias in the Protein Data Bank Using Contrast Classifiers K. Peng, Z. Obradovic, and S. Vucetic

435

Geometric Analysis of Cross-Linkability for Protein Fold Discrimination S. Potluri, A.A. Khan, A. Kuzminykh, J.M. Bujnicki, A.M. Friedman, and C. Bailey-Kellogg

447

Protein Fold Recognition Through Application of Residual Dipolar Coupling Data I: Qu, J.-7: Guo, V Olman, and I: Xu

459

COMPUTATIONAL AND SYMBOLIC SYSTEMS BIOLOGY Session Introduction 7: Ideker; E. Neumann, and V Schachter

47 1

A Mixed Integer Linear Programming (MILP) Framework for Inferring Time Delay in Gene Regulatory Networks M.S. Dasika, A. Gupta, and C.D. Maranas

474

Robust Identification of Large Genetic Networks D. Di Bernardo, T.S. Gardner; and J.J. Collins

486

xvi

Reconstructing Chain Functions in Genetic Networks I. Gat-Viks, R. Shamil; R.M. Karp, and R. Sharan

498

Inferring Gene Regulatory Networks from Raw Data -A Molecular Epistemics Approach D.A. Kightley, N. Chandra, and K. Elliston

510

A Biospi Model of Lymphocyte-Endothelial Interactions in Inflamed Brain Venules f! Lecca, C. Priami, C. Luudanna, and G. Constantin

52 1

Modeling Cellular Processes with Variational Bayesian Cooperative Vector Quantizer X . Lu, M. Hauskrecht, and R.S. Day

533

Symbolic Inference of Xenobiotic Metabolism D.C. McShan, M. Updadhayaya, and I. Shah

545

Finding Optimal Models for Small Gene Networks S. Ott, S. Imoto, and S. Miyano

557

Pathway Logic Modeling of Protein Functional Domains in Signal Transduction C. Talcott, S. Eke< M. Knapp, P Lincoln, and K. Laderoute

568

Modeling Gene Expression from Microarray Expression Data with State-Space Equations EX. Wu, W J . Zhang, and A.J. Kusalik

58 1

Session Introductions and Peer Reviewed Papers

This page intentionally left blank

ALTERNATIVE SPLICING

H. WANG Affymetrix Inc., Bioinformatics, 3380 Central Expressway, Santa Clara, C A 9505 1 hui [email protected]

C. LEE Molecular Biology Institute, University of California, Los Angeles, CA 900951570 [email protected]

Human studies have estimated that approximately 30-60% of genes undergo alternative splicing and it has been shown that alternative splicing is an important regulatory mechanism often controlled by developmental or tissuespecific factors. Additionally, alternative splicing of a single gene sometimes produces functionally distinct proteins within the same tissue and, in some cases, gene isoforms have been associated with human diseases. Alternative splicing is an important topic to cover in a conference concerning computational biology because, 1) it plays a significant role in human disease, 2) a wealth of data has become available and 3) recent technological advances make the problem of alternative splicing detectable by assays.

Alternative splicing plays a significant role in physiology and disease. Alternative splicing is a mechanism for generating a versatile repertoire of different proteins, perhaps with distinct functions, within individual cells. Its significance is clearly evident in highly specialized cells such as neurons. While the mechanism of splicing is becoming well-understood, our understanding of alternative splicing is just starting to emerge. Several new cis and trans splicing factors are suggested to be related to alternative splicing. Abnormal splicing can cause severe diseases. Many researchers are trying to determine the cause of aberrant splicing and to understand its disease association. Recent literature has implicated alternative splicing in a vast array of neurological diseases. Some studies describe a clear link between a mis-spliced gene and a disease state, while others imply an increased risk for a range of problems which include little understood diseases and general physiological changes.

A wealth of public data has become available. Recent bioinformatics efforts on splice variant analysis have enabled the discovery of splice variants.

3

4

Database mining techniques, such as EST clustering correlating with EST library information, offer enormous value in new variant as well as tissue and/or disease specific variant discovery. The availability of human, mouse, and rat genomes has expedited the discovery process. Most of the current studies are conducted either by aligning EST and mRNA sequences to genomes and/or by statistical pattern search approaches. Many alternative splicing related databases and tools have been developed.

Recent measurement technology advances have enabled analysis of splice variants. While microarray technology has become a standard method for gene expression profiling, most microarray design and analysis is limited to detecting and measuring changes of expression on a per gene basis. Being able to measure variant-level expression is important for accurate expression profiling, and consequently for obtaining a better understanding of the biological processes. Recently, several studies have applied microarray technology to this issue. Genomic tiling arrays and exon arrays can be used to identify co-regulated exons, which allows the inference of variant mixtures. Expression arrays with multiple probes have been retrospectively analyzed to identify exons that are differentially included or skipped in a tissue-specific manner. RNA-mediated ligation combined with arrays presents a novel method for detecting exon-exon junction information of known splice variants. Recently splice junction spanning oligonucleotides representing nearly all yeast splicing events have been used to monitor the genome-wide effects of splicing factor mutations in yeast, suggesting exon joining information can be accessed using oligonucleotide arrays. More recently, a human splice variant oligonucleotide microarray has been designed and an algorithm has been developed to deconvolute the absolute concentrations of each splice variant in the variant mixture. All of these efforts have shed lights on the future of variant specific expression analysis. In this first PSB alternative splicing session we highlight work which studies several aspects of this interesting phenomenon. First we see how Sugnet et a1 and Kan et a1 develop comparative genomic methods using human and mouse genomes for the detection of alternative splicing. Then we see how Modrek et a1 applies a partial ordered alignment program for genome-wide splice variant detection, in seeking of minimizing both false positive and false negative errors when using EST data. Buhler et a1 present an alternative assay for the high throughput detection and quantification of splice variants using polymerase colonies. Sakai at el has tried to identify splicing pattern specific regulatory sequences. Zhang et a1 has developed a manually annotated alternatively spliced event database and an annotation system. Finally, Cline et a1 assesses the functional consequences of alternative splicing of transmembrance proteins.

DESIGN OF A HIGH-THROUGHPUT ASSAY FOR ALTERNATIVE SPLICING USING POLYMERASE COLONIES J. D. BUHLER, R. M. SOUVENIR, AND W. ZHANG Department of Computer Science and Engineering Washington University in St. Louis One Brookings Drive, St. Louis, M O 63130 Email: {jbuhler,rmsi?,.zhang} @cse.wustl.edu

R. D. MITRA Department of Genetics Washington University School of Medicine 660 S. Euclid Ave., St. Louis, M O 63110 Email: [email protected]. edu We propose an assay t o detect and quantify alternative splicing simultaneously for numerous genes in a pool of cellular mRNA. The assay exploits polymerase colonies, a recently developed method for sampling and amplifying large numbers of individual transcript molecules into discrete spots on a gel. The proposed assay combines the advantages of microarrays for transcript quantitation with the sensitivity and precision of methods based on counting single transcript molecules. Given a collection of spots sir each containing an unknown splice variant of some known gene Gi,we design a series of hybridizations t o short oligonucleotide probes t o determine in parallel which exons of Gi are present in every spot si. We give algorithms t o minimize the cost of such designs.

1

Introduction

Alternative splicing of gene transcripts'>2 is believed t o be a major mechanism by which eukaryotes can amplify the number of distinct proteins produced from a limited number of genes. Estimates of the fraction of alternatively spliced genes in the human genome range from 20% t o nearly 60%3>4.In several cases, different splice variants of a gene have been shown to play distinct or tissue-specific functional role^^,^>^. These facts have driven the development of assays to discover and quantify alternative splicing. Quantitative detection of alternative splicing aims to measure, for one or more genes, the amounts of each splice variant of that gene present in a pool of RNA. In this work, we focus on splicing events that result in insertion or deletion of one or more complete exons from a transcript. A gene is treated as an ordered list of exons G = {El . . . E n } , with each splice variant containing a subset of these exons. We seek t o determine which subsets of G describe splice variants present in a sample of mRNA, and to estimate how often each

5

6

variant occurs. Although this formulation does not consider variation arising from alternative exon starts or ends, it does encompass a wide variety of possible splice variants for a gene. The amounts of specific splice variants for one or a few genes can be quantified by, e.g., an rtPCR assay. More challenging, however, is the task of devising a high-throughput assay t o quantify all variants of numerous genes at once. Historically, high-throughput splicing assays have relied either on counting splicing events in E S T S ~or, ~on microarray methods in which each spot specifically recognizes a sequence arising from a particular splicing e ~ e n t ~ EST-based methods directly count transcripts and so allow precise quantitation of splice variants. Moreover, ESTs can span several exons, so they can reveal correlations between splicing events involving different cassettes of exons. However, EST counting requires large-scale DNA sequencing, and its quantitative accuracy is limited by biases in which transcripts survive the process of EST library construction. In contrast, array-based methods are less resource-intensive and require less processing of the sample RNA. However, the oligonucleotides used on arrays typically target a single boundary between two exons, so that these methods cannot easily detect correlations in combinatorial splicing events. Array-based methods also suffer from limited quantitative accuracy, particularly for rare transcripts. This work proposes a high-throughput assay t o quantify alternative splicing using polymerase colonies ( “polonies” for short). Polony-based assays combine EST counting’s precise quantitation and detection of combinatorial splicing events with microarray-like RNA preparation, hybridization, and imaging. A polony gel is a collection of up to ten million spots, each containing many copies of a single transcript molecule sampled randomly from a pool of RNA. The gene whose transcript gave rise t o each spot can rapidly be determined. Given this information, we show how t o design short (7-10 base) oligonucleotide probes t o determine which exons are present in each spot on the gel, and how t o pool probes so as to minimize the number of hybridizations needed for this determination. The remainder of this work is organized as follows. Section 2 describes polony technology and proposes our assay to quantify alternative splicing. Section 3 poses the problem of designing oligonucleotide probes to detect all splice variants of a set of genes while minimizing the cost of the assay. Although this problem is combinatorially challenging, we show how t o derive a spectrum of solutions to trade off the costs of oligo synthesis and hybridization. Section 4 evaluates designs from our methodology, and Section 5 concludes.

7

2

Exon Profiling w i t h a Polony G e l

Polony exon profiling is a single-molecule technology for quantifying alternatively spliced mRNAs’’. We first describe the current form of this technology, which quantifies all isoforms of a single gene. We then suggest an extension to quantify isoforms of multiple genes, up t o an entire genome, in a single assay. Polony exon profiling includes two steps: amplification and hybridization. Step 1: Amplification. A dilute cDNA sample is cast into a thin acrylamide gel attached t o a microscope slide. Because the sample is dilute, individual molecules are well separated from one another on the slide. Next, PCR is performed in the gel, using primers specific t o a gene of interest. Single cDNA molecules are amplified in situ; the acrylamide restricts the diffusion of amplification products so that they remain localized near their parent molecules. Each single cDNA molecule produces a discrete polony containing lo6 t o lo7 identical copies, with each DNA molecule covalently attached12 to the gel. Over ten million polonies can be amplified on one slideI3. Step 2: Hybridization. The slide is first denatured and washed so that each polony contains single-stranded DNA. Next, a fluorescently labeled oligonucleotide complementary to the first exon (known or putative) is diffused into the gel. Only polonies amplified from transcripts containing exon 1 will bind the oligonucleotide. The slide is imaged using a confocal laser scanner to identify these polonies. Finally, the gel is prepared for the next round of hybridization by heating the slide to remove the bound probe. The next hybridization is performed with an oligo complementary to exon 2, and so on for all exons. To increase efficiency, hybridizations can be multiplexed using several fluorescent labels. The outcomes of k successive hybridizations assign each polony a signature of k bits. Each 1 bit indicates a successful hybridization t o the polony, while each 0 bit indicates absence of hybridization. Each polony’s signature specifies the exons in one sampled transcript. For example, in Figure 1, the indicated polony with signature “1100” was amplified from a transcript containing exons 1 and 2 but not exons 3 or 4. The number of polonies with a given signature is proportional to the number of transcripts of the corresponding isoform in the sampled RNA. To quantify (up to sampling error) the abundance of each isoform, we count the number of polonies assigned each signature. Polony exon profiling has been used t o quantify alternative splicing in several genes, including CD44, a gene with 1,024 potential isoforms’l. The current protocol can realistically be expected t o multiplex 10-50 genes, but further multiplexing is unlikely t o be feasible for two reasons. First, multiplex PCR does not typically scale beyond 30-50 primer pairs per tube; greater

8

multiplexing tends t o cause primer-dimer artifacts and mispriming events. Second, the cost of exon-specific probes for tens of thousands of genes would be prohibitive. For 22,000 human genes averaging 8.5 exons/gene (see Section 4), 186,000 probes would be needed at a cost of roughly $40 per probe. To address the limits of multiplex PCR and the high cost of exon-specific probes, we propose a modified protocol t o quantify splice variants of numerous genes simultaneously using polony technology. Our proposal includes three key changes: Figure 1. Reading a polony’s signature from suc(l) Create so that (as cessive hybridizations against oligos specific to in a typical cDNA library) each exons 1-4 of a gene. White/black indicate postranscript is flanked by universal itive/negative outcomes. The indicated polony priming sequences. Polony am- has signature “1100.” plification is performed using this single universal primer pair. Hence, every mRNA molecule on the slide now produces a polony. (2) Identify the gene present in each polony by sequencing . a few bases from its ends using fluorescent in situ ~ e q u e n c i n g ’ ~Sequencing 10-12 bases from each end of a polony’s DNA should identify its gene. (3) To reduce the cost of oligo synthesis, use short (7-10 base) oligo probes. Each probe is specific to one exon within a single gene but can identify exons in more than one gene, so that many fewer probes than exons are needed. This revised protocol gives rise to two key computational challenges. First, we must choose short exon-specific probes for all genes while realizing the promised savings in synthesis costs. Second, because the probes are too short to guarantee specificity across genes, we must somehow keep probes intended for one gene from producing false positive hybridizations to another. The next section addresses each of these challenges.

3

Distinguishing Splice Variants w i t h Short Oligonucleotides

In this section, we formulate and solve problems arising in the design of short probes for distinguishing splice variants in a polony assay. We first describe formal criteria by which t o choose probes for one gene, or for multiple genes simultaneously. We then address the problem of testing all these probes using as few hybridizations as possible. Finally, we identify a tradeoff between the size of the probe set and the number of hybridizations needed and show how to obtain designs that favor one or the other side of this tradeoff.

9

In what follows, we reduce our problems of interest t o two problems known to be NP-hard. In each case, we have also constructed the opposite reduction (omitted for space reasons), showing that we have not set ourselves more difficult tasks than necessary. Also, we select probes directly from the input sequences, recognizing that the real assay must use their reverse complements.

3.1 Assay Designs with Unique Probes Let G be a gene consisting of exons El,Ez, . . . ,En. A single transcript of G contains a (nonempty) subset of these exons. We wish to construct a collection C of oligo probes with common length L, such that hybridizing each probe in C against a transcript from G unambiguously reveals which exons it contains. An e-mer probe p is unique to exon Ei if it occurs in every splice variant of G that contains Ei and in no variant that does not. Given a set of unique probes pl . . . p , for each exon of G, we can hybridize each pi in turn against a transcript of G to determine if it contains exon E,. This design uses only n probes, the fewest needed t o distinguish all 2" - 1 splice variants. Although an exon of G is not guaranteed t o contain a unique probe, the following lemma shows how to find such probes when they exist. Lemma 1. Let p be a n e-mer probe occurring as a substring of exon Ei of gene G . If each exon of G has length at least e, then p is unique to Ei iff ( I ) p i s not a substring of any other exon Ej of G, j # i, and (2) f o r any pair of exons Ej and EI, of G, j < lc, j , lc # i , p i s not a substring of the concatenated string Ej ' Ek. Proof. The probe p occurs in any splice variant of G containing Ei . Moreover, in any splice variant lacking Ei,p cannot occur in any of the remaining exons (by Condition 1) or at the boundary between two exons (by Condition 2). A single t-mer cannot span three or more exons if each exon has length 2 e. Hence, p is unique to Ei. Conversely, a probe p that occurs uniquely as a substring of Ei cannot appear in a transcript containing only E j , j # i (hence Condition 1). Moreover, p cannot appear across an exon boundary in a transcript containing only exons Ej and Ek,j , k # i (hence Condition 2). 0

Unless an exon is extremely short, unique probes can generally be obtained by choosing long enough e. In practice, setting l in the range 7-10 yields a t least one unique probe (and usually tens of such probes) for well over 90% of predicted coding exons in the human genome. While the above design produces probe sets for a single gene, we seek to test thousands of genes at once. Naively, we could choose probe sets for each

10

gene independently; however, such a design is wasteful because a single !-mer can be unique both to (say) exon E5 within gene G1 and t o exon E7 within gene Gz. Reusing probes when possible lowers the cost of oligo synthesis. We therefore consider the following optimization problem: Problem 1. Let G1 . . . G, be genes for which we want to distinguish all possible splice variants. Each gene G, has exons E,1 . . . Ex,*. For each exon E,i, let U,i be the set of all !-mer probes unique t o E,i within gene G,. Find a set C of !-mer probes of minimum size such that C contains at least one element of every set Uxi. A solution to Problem 1 yields a probe set C containing unique probes for every exon of every G,. Hence, testing each probe in C is sufficient to distinguish all splice variants of these genes. If a solution uses probe p t o detect the presence of exon E , we say that p is that solution's representative for E . One probe can represent exons of several genes. Problem 1 is an instance of the hitting set problem15. Hitting set is known t o be NP-hard but can be approximated t o within a factor log (max,,i IZAxil) by a greedy algorithm16. A similar combinatorial formulation was used by Nicodhme and Steyaert17 to design multiplex PCR assays. The hitting set problem generalizes (with comparable approximability) t o a variant in which each U,i must be hit at least r > 1 times18. We use this extension to design probe sets with at least r representatives per exon. Using only one representative per exon provides no way t o recover from failed hybridizations that cause false negative outcomes. In contrast , redundancy ensures multiple chances t o detect an exon if it is present. 3.2

Pooling t o Minimize Hybridization Costs

We can naively test each probe in set C by hybridizing it sequentially. Polony gels can maintain integrity through multiple washings, so sequential hybridization steps are possible. However, the assay cost increases with the number of steps, as does the probability that the gel will tear or detach from the slide. We therefore ask how few steps are needed t o test all probes in C. The danger of testing two or more probes in one hybridization is that one probe may prevent another from unambiguously detecting its intended exon. Figure 2 illustrates this danger in a gene G with exons El and E2. Probe p represents exon El, while probe q occurs in exon Ez (though it may not represent Ez). If p and q are mixed with the same fluorescent label, the mixture yields a positive result for variants of G that contain E 2 but lack El, making p useless for its intended purpose. We say that a probe p forbids a probe q if (1) p represents exon E of some

11

gene GI and (2) q occurs in any splice variant of G that lacks E . If p forbids q or vice versa, we say that p and q conJict. The above example shows that conflicting probes cannot be pooled in one hybridization. Conversely, if p and q do not conflict, then for any exon E represented by one probe, the other probe either does not bind to any variant of E’s gene or binds only to variants that contain E . Hence, p and q can safely be pooled. Any number of nonconflicting probes can be hybridized in a single exEl E2 periment. Hence, finding Correct False + large non-conflicting sets of probes reduces the number Of Figure 2. Effect of pooling probe p with probe q when hybridizations needed withp forbids q. Probe p represents exon El and so yields a positive result iff it is present (left). However, q, out compromising correctwhich binds to Ez, can cause a false positive result ness. We therefore formulate even if El is absent and p does not bind (right). the following problem. Problem 2 . Let C be a set of unique probes. Divide C into the fewest possible disjoint subsets C1 . . . C, so that n o two probes in a subset conflict. This problem reduces easily to vertex coloringlg. Let N be a conflict graph whose vertices are the probes of C, such that two vertices are connected iff their probes conflict. In any valid coloring of HI all vertices (probes) of one color are pairwise non-conflicting and so can safely be pooled. Graph coloring, like hitting set, is NP-hard. While approximation algorithms results exist for coloring”, we use less compute-intensive heuristics to color the conflict graph.

2-

3

3

3.3 A Tradeoff in Assay Design We have formulated a two-step process t o design high-throughput alternative splicing assays for polonies: first, select representative probes for all exons of interest; and second, divide these probes into non-conflicting pools. A fundamental tradeoff between these two steps arises because assay designs with fewer probes typically demand more hybridizations. Consider two probes p and q . As the number of genes with an exon represented by p increases, so too does the chance that q will appear in one of these genes, possibly inducing a conflict. Probe selection seeks t o cover all exons with as few probes as possible and so tends to increase the number of exons represented by each probe. As a result, the number of edges in the conflict graph H increases, inducing a likely increase in H’s chromatic number and hence in the number of hybridizations needed. Ideally, our assay design would optimize a joint cost function f(7r,q),

12

where T is the number of probes and 77 the number of hybridizations. While we cannot yet directly optimize such a joint cost, we instead seek a spectrum of designs that trade off between ‘ i and ~ 77, then choose the least-cost design. We produce a spectrum of designs for probe length e by generalizing probe selection to weighted hitting set. This problem variant assigns each probe p a weight w ( p ) and seeks to minimize the total weight of probes chosen. Weighted hitting set can be approximated within the same bound as the unweighted version by a modified greedy algorithmz1. We will use probe weighting as a heuristic to select probes that induce fewer conflicts and hence are less likely t o increase the conflict graph’s chromatic number. For each probe p, we define a conflict weight wc(p).A probe’s conflict weight estimates how many other probes would forbid p were it chosen as part of a probe set. We will define w,(p, G), the conflict weight of p versus a single gene GI and set w c ( p ) = CGw,(p, G ) . Suppose G has n exons. Then n if p occurs non-uniquely in G n-1 if p is unique to one exon of G 0 otherwise. The rationale for this weighting is as follows. Each exon of G must be represented in the probe set. If p occurs non-uniquely in G, then for each exon E of G, p occurs in a splice variant of G that lacks E . Hence, as described in Section 3.2, p cannot be mixed with E’s representative. All n representatives of G’s exons (whatever they may be) will therefore forbid p . If p is unique to exon Ei of G, a similar argument shows that p cannot be mixed with any representative except that for Ei. Hence, p is forbidden by all but one exon representative for G. Finally, if p never occursa in GI none of the representatives for G’s exons forbid p. To vary the extent t o which conflict weighting affects the design, we compute u,the average conflict weight of all candidate probes, and set w ( p ) = clu (1 - cw)w,(p) in the hitting set problem. Setting a closer to 0 favors solutions that minimize conflict, while setting it closer t o 1 favors solutions with fewer probes. Of course, our weighting scheme is only heuristic, since (1) it overcounts the number of potential conflicts when probes can represent more than one exon, and ( 2 ) the conflict count is not a perfect predictor of the conflict graph’s chromatic number. However, the results of the next section show that conflict weighting is effective in producing a spectrum of designs that trade off between probe count and number of hybridizations.

+

“We have refined wc(p, G) to handle cases when p occurs only at boundaries between exons.

13

4

Empirical Results

In this section, we describe the empirical properties of our assay designs on a comprehensive set of genes predicted by the Twinscan program" on NCBI release 31 of the human genome. The test set includes 21,845 multi-exon genes, with an average of 8.5 and a maximum of 80 exons per gene. Although Twinscan's predictions are among the most accurate available, they do not include the difficult-to-predict UTR regions, so the gene sizes and exon counts in our experiments are slightly reduced compared t o the real human genome. We implemented our own software for greedy probe selection and conflict graph generation. For coloring, we used an existing implementation of the DSATUR h e u r i s t i ~ ' ~ 1We ~ ~ .avoided probes within six bases of an exon boundary to accommodate small inaccuracies in Twinscan's exon predictions. When enumerating Gmers that cross exon boundaries, we considered splice variants that could link exon Ei with any exon in the range [Ei-s,Ei+5];however, the exact set of boundary l-mers considered minimally impacted our designs. An important first test of design methods using small l is whether most exons have unique probes. For l between 7 and 10, we assigned 97.5% of exons at least one unique probe and 97.4% a t least two probes. Discounting initial and terminal exons (which were artificially truncated by Twinscan at the first and last codons), we assigned two probes t o over 99% of exons. Figure 3A illustrates the range of tradeoffs achieved on the test genes between probe set size and number of pools, assuming one unique probe per exon, e from 7 t o 10, and cr from 0 t o 1 in steps of 0.1. All designs use many fewer probes than either the number of exons or 4e. Longer probes are less likely to cause conflicts, so the number of hybridization pools required decreases as the probe length l increases. For each !2 8, varying the probe weighting permits tradeoffs as described in Section 3.3. For smaller l , weighting has little effect on the cost of the solution, perhaps because any solution that hits every exon has close to 4e probes and hence unavoidably has a high density of conflicts. Figure 3B extends the assay design t o pick at least two unique probes for every exon. Protection against false negatives increases the number of required probes by a factor of 2-3. At small probe lengths .t, this increase brings the number of probes much closer t o 4', dramatically increasing the number of pools required; however, for larger !, the effects of redundancy on hybridization count are less pronounced. We now consider the practical utility of our designs. The controlling variable for practicality is likely t o be the number of hybridizations, each of which increases the chance that the gel will tear or detach, ruining the experiment.

14 11w

1800

1ow

1600

900

- 1400 v)

v) 5 800 0 a 700

-2

0 0.1200

c

0 1000 a, 800

600

L

{

5oo

n E

z

4oo

a,

3

300

600

400

2w 200

Iw

5

10

15

20

25

30

Probe set size (thousands)

35

10

20

30

40

50

Probe set size (thousands)

,

Figure 3. Tradeoffs between number of probes and number of pools. (A) Designs assigning at least one unique probe per exon. (B) Designs assigning at least two unique probes per exon t o reduce false negatives.

We estimate that realistic assays must use fewer than 100 (preferably fewer than 50) hybridizations. Assuming four-color probe labeling, designs should therefore use at most 400 (preferably at most 200) pools. We achieve such designs for t 2 9 while still using many fewer probes than exons. To make the abstract cost function of Section 3 concrete, we conclude by estimating the actual cost of our assay, assuming it can be fully ramped up to a high-throughput genome-wide survey of alternative splicing. We assume that $40 will purchase enough of one labeled oligo t o test 10,000 gels, and that a high-throughput survey amortizes this cost over the full 10,000 gels. The cost per gel is assumed to be the cost of oligos consumed (0.4 cents per probe, since each probe is used only once per gel) plus roughly $35 for materials, labor, and machine costs associated with 50 hybridizations. Assuming 45,000 10-mer probes to achieve two unique probes per exon (asin Figure 3B), the cost per gel is $225. The cost of a large-scale polony exon profiling survey is therefore competitive with microarray and EST-based methods. 5

Discussion

Polony gel technology provides a cell-free method t o probe and count individual transcripts from a sample of cellular RNA. It avoids most biases introduced by EST library construction while still yielding a digital readout that can probe every exon in a transcript. Previous work'l has shown the feasibility of polony gels for assaying splice variants of one or a few genes, but we seek t o scale the technology to thousands of genes for high-throughput use.

15

This work computes two key aspects in the design of high-throughput assays for polony exon profiling: the set of oligonucleotide probes to use, and their pooling into hybridization experiments. Our methods permit systematic selection of redundant probes to limit the rate of false negative outcomes. Our cost estimates show sufficient promise to pursue the laboratory work needed to realize the new assay, which will entail empirically optimizing both the specificity of the oligos and the ability of polony gels t o withstand large numbers of washings. Two theoretical issues demanding further exploration are the need for full-length gene predictions and, more generally, the problem of false positives. Full-length genes are necessary t o accurately design probe sets, since unexpected sequences in a transcript could cause false positive matches to probes. Accurate prediction of exon structure in UTRs is still an open problem, which means we are unlikely to be able to design probes for many UTR exons. However, our designs can tolerate some degree of overprediction if the UTRs are treated as “forbidden” sequences that, while not themselves probed, restrict the sets of unique probes for the coding exons. We plan t o use computational (over)prediction, especially of 5’ UTRs, combined with EST evidence from e.g. the NCBI R e f ~ e qproject ~ ~ t o estimate UTR boundaries. More generally, our designs do not yet address the question of false positive outcomes due to imperfect hybridization. To control the false positive rate, we plan to more accurately predict binding affinity for probes using estimates of their melting temperatures T,. Our definitions of uniqueness and conflict extend straightforwardly t o forbid probes that bind with too high an affinity as well as those that match a sequence exactly. These extensions, combined with existing provisions t o reduce false negative outcomes, will greatly increase our assay’s robustness to real-world variations in hybridization.

Acknowledgments The authors wish t o thank Gary Stormo for invaluable discussion and suggestions. Grant support included: JB, NSF DBI-0237903; RS, NIH GM08802 and NSF ITR/EIA-0113618; WZ, NSF 11s-0196057 and ITR/EIA-0113618; and RM, an award from the Whitaker Foundation, St. Louis, MO.

References 1. A. J . Lopez. Annual Reviews of Genetics, 32:279-305, 1998. 2. B. Modrek and C. Lee. Nature Genetics, 30:13-19, 2002.

16

3. A. A . Mironov, J. W. Fickett, and M. S. Gelfand. Genome Research, 911288-93, 1999. 4. Human Genome Sequencing Consortium. Nature, 409:860-921, 2001. 5. S. Seino and G. I. Bell. Biochemical and Biophysical Research C o m m u nications, 159:312-6, 1989. 6 . W. C . Horne, S. C. Huang, P. S. Becker, T. K. Tang, and E. Z. Benz, Jr. Blood, 82:2558-63, 1993. 7. L. Rowen, J. Young, B. Birditt, A . Kaur, A. Madan, D. L. Phipps, S. &in, P. Minx, R. K. Wilson, L. Hood, and B. R. Graveley. Genomics, 79:58797, 2002. 8. T. A. Clark, C. W. Sugnet, and M. Ares, Jr. Science, 296:907-10, 2002. 9. J. M. Yeakley, J.-B. Fan, D. Doucet, L. Luo, E. Wickham, Z. Ye, M. S. Chee, and X.-D. Fu. Nature Biotechnology, 20:353-8, 2002. 10. H. Wang, E. Hubbell, J . 3 . Hu, G. Mei, et al. Bioinformatics, 19 Suppl 1:315-322, 2003. 11. J . Zhu, J . Shendure, R. D. Mitra, and G. M. Church. Science, 301:83&8, 2003. 12. F. N. Rehman et al. Nucleic Acids Research, 27:649-55, 1999. 13. R. D. Mitra and G. M. Church. Nucleic Acids Research, 27:l-6, 1999. 14. R. D. Mitra, J. Shendure, J. Olejnik, E.-K. Olejnik, and G. M. Church. Analytical Biochemistry, 320:55-65, 2003. 15. R. M. Karp. In R. E. Miller and J . W . Thatcher, editors, Complexity of Computer Computations, pages 85-103. Plenum Press, 1972. 16. D. S. Johnson. J . Computer and Systems Science, 9:256-78, 1974. 17. P. Nicodkme and J . M. Steyaert. In Proc. ISMB '97, pages 210-3, 1997. 18. D. Hochbaum. Approximation algorithms f o r NP-hard Problems, chapter 3. PWS Publishing, 1997. 19. M. R. Garey, D. S. Johnson, and L. J. Stockmeyer. Theoretical Computer Science, 1:237-67, 1976. 20. D. R. Karger, R. Motwani, and M. Sudan. J A C M , 45:245-65, 1998. 21. V. Chvatal. Mathmematics of Operations Research, 4:233-5, 1979. 22. I. Korf, P. Flicek, D. Duan, and M. R. Brent. Bioinformatics, 17 Suppl 1:140-8, 2001. 23. D. Brelaz. CACM, 22:251-6, 1979. C. Culberson. Graph coloring programs, 1997. 24. 3. http://www .cs.ualberta.ca/-joe/Coloring/Colorsrc/index.html. 25. K. D. Pruitt, K. S. Katz, H. Sciotte, and D. R. Maglott. Trends in Genetics, 16:44-7, 2000.

T H E EFFECTS O F ALTERNATIVE SPLICING O N TRANSMEMBRANE PROTEINS I N T H E M O U S E G E N O M E M . S . C L I N E , R. SHIGETA, R. L . W H E E L E R , M . A . S IA N I-RO S E , D. KULP, A. E . LORAINE Affyrnetrix Inc., 6550 Vallejo Street, Suite 100 Emeryville, CA, 94608, USA

Alternative splicing is a major source of variety in mammalian mRNAs, yet many questions remain on its downstream effects on protein function. To this end, we assessed the impact of gene structure and splice variation on signal peptide and transmembrane regions in proteins. Transmembrane proteins perform several key functions in cell signaling and transport, with their function tied closely to their transmembrane architecture. Signal peptides and transmembrane regions both provide key information on protein localization. Thus, any modification to such regions will likely alter protein destination and function. We applied TMHMM and SignalP to a nonredundant set of proteins, and assessed the effects of gene structure and alternative splicing on predicted transmembrane and signal peptide regions. These regions were altered by alternative splicing in roughly half of the cases studied. Transmembrane regions are divided by introns slightly less often than expected given gene structure and transmembrane region size. However, the transmembrane regions in single-pass transmembranes are divided substantially less often than expected. This suggests that intron placement might be subject to some evolutionary pressure to preserve function in these signaling proteins. The data described in this paper is available online at http://www .affymetrix.com/community/publications/affymetri~tmsplice/,

1 Introduction Attention on alternative splicing has increased. Numerous groups have published analyses estimating alternative splicing frequency [ 1, 21, and the degree of conservation of splicing patterns [3, 41. Consequently, alternative splicing is now recognized as a major source of protein diversity in mammals. Yet questions remain on its functional significance [S]. A relation has been observed between intron positions and compact units of protein tertiary structure [6], and we previously observed that alternative splicing altered the pattern of domains and motifs in roughly one third of the genes studied [7]. Here, we focus on protein motifs of distinct structural and functional relevance: signal peptides and transmembrane helices. Thus, we explored the effects of gene structure and splice variation on predictions by TMHMM [8]and SignalP [9]. TMHMM is the prevalent method for identifying putative transmembrane helices in membrane-spanning proteins [ 101. These include transporters, channels, and signaling proteins. SignalP is the prevalent method for predicting signal sequences [ 113. Signal sequences help to guide secreted proteins into the endoplasmic reticulum, and are frequently present in transmembrane proteins. Because signal sequences and transmembrane regions are easily confused,

17

18

transmembrane and signal peptide predictors are best used together, with the signal peptide predictor acting as a screen for the transmembrane predictor [ 101. By analysis of genomic alignments, we identified the genomic coordinates of a number of proteins, associating a gene structure with the protein sequence. To focus our analysis on splice variation rather than genetic variation, we derived putative protein translations from the genomic sequence. We then applied SignalP and TMHMM to each translated protein, and determined the genomic coordinates of each predicted signal and transmembrane region We compared these genomic coordinates to the gene structures to determine how often intron boundaries avoid transmembrane regions. For perspective, we estimated how often intron boundaries might divide equivalently-sized segments of the same protein, selected at random. Finally, we assessed how often splice variation deletes or alters a signal peptide or transmembrane region of a protein. Because of the significance of these regions, any such alterations will have major consequences in protein localization and function.

2 Methods 2.1 . Gene structures and cDNA organization We chose the mouse genome for this investigation to build upon and support other investigations underway at our organization. We aligned all of the mouse cDNA sequences from GenBank (release 128) to the mouse genome (Whitehead Institute Center for Genome Research, April 2002) using blat [ 121. Of the 55997 sequences that aligned, we explored 13864 that aligned with coverage of at least 90% and a sequence identity of at least 9.5%; contained CDS annotations; and had no cDNA inserts in alignment of the CDS regions to the genome. Exon structures and transcript orientation were derived from the alignments as follows. Successive segments of matching sequence were joined if they were 20 bases or closer; otherwise, they were considered introns. MFWA orientation was determined by a weighted calculation on the directions inferred by the labeled GenBank direction, the polyA site and signal evidence on the rriRNA, and the dinucleotide splice pairs derived from the genomic alignment. We dynamically grouped transcripts together by gene according to their exon structure. We considered two transcripts to be from the same gene if they had overlapping genomic coordinates, and shared at least one intron junction. We grouped these transcripts by splice variation as follows: if an intron in one transcript alignment overlapped an exon in another, or if the two transcripts had start or stop codons at different locations, then the transcripts are considered products of different splice variants. Note that this scheme is not perfect: it

19

might miss cases where one transcript is a genuine longer form of another, with additional exons outside the coding region. However, due to limitations in sequencing technology, a cDNA sequence annotated as “full length” might not necessarily represent the full length of the sequence. Consequently, we chose the conservative route, and consider two sequences to be examples of the same splice variant unless there is strong evidence that they are not.

13483 thigh-quality, c

8061 splicevariants

-

I

Figure 1: Illustration of grouping transcripts by genes and splice variants. Next, we pruned the gene set to ensure that no UniGene cluster was associated with more than one gene. This step provided a safeguard against bias due to a large population of paralogs. This generated a set of 13483 transcripts of 6847 genes and 8061 splice variants. From each splice variant, we arbitrarily selected one protein for subsequent analysis. Only 904 genes had multiple variants at the protein level. This should not be regarded as an indication of alternative splicing frequency, as protein-level evidence represents a high evidence standard. A greater degree of alternative splicing can be observed by compiling putative transcripts from cDNA and EST evidence [ 131, but such transcripts often have no clear protein translation.

2.2 Protein Sequence Analysis For each cDNA sequence, we derived a protein sequence by assembling an mRNA from the genomic sequence, and inferring a protein translation from its CDS annotation. Note that this protein sequence might differ from the sequence associated with the cDNA, as this scheme does not account for genetic variation. This was deliberate. We chose to focus on splice variation. Other forms of variation, including genetic variation, are outside the scope of this work.

20

Next, we applied TMHMM [gland SignalP [9]to the translated proteins, using default parameters for both. From the TMHMM output, we discarded transmembrane regions with a score of 0.3 or less, or those that overlapped with regions predicted as signal peptides. These methods allow identification of three classes of proteins routed through the endoplasmic reticulum (ER). Proteins which have a predicted signal peptide but no predicted cleavage may be routed to the cell surface, but will remain anchored there; these are called Anchor proteins. Those predicted signal peptides with a predicted peptide cleavage may be released into the extracellular environment, and are denoted as secreted proteins. Finally, transmembrane proteins bridge the cell membrane, but are not released into the extracellular environment.

2.3 Genome-level analysis of protein transmembrane regions Each transmembrane protein region was mapped to genomic coordinates according to the CDS annotations of the associated cDNA and the protein coordinates of the transmembrane region. Each transmembrane region was divided into one more genomic spans, where a genomic span represents the ungapped alignment of a protein segment onto the genomic sequence. If the entire transmembrane region mapped onto one exon, then it had one genomic span; if it was divided by an intron, then it had two genomic spans. For each genomic span, we recorded its start and stop coordinates in the genomic sequence and the protein sequence, and inferred the translation frame from the corresponding CDS region. Next, we divided the transmembrane regions into two sets: those appearing in all transcripts of a gene, and those not. A region was placed into the first set only if all transcripts contained a region of the same type (signal or transmembrane), with the same genomic coordinates and translation frame. ProtAnnot, a program designed to allow visualization of protein motifs in the context of genomic sequence, was used to view protein sequence annotations in the context of gene structures [7]. The software is freely available from Affymetrix at http://www.riffymetrix.com/analysislbiotoolslprotannot/index.affx.

3 Results We applied SignalP and TMHMM to a nonredundant set of 8061 genomederived protein translations. 1156 proteins contained putative signal peptides, and 1714 contained putative transmembrane segments. Altogether, 2039 of the 806 1 proteins contained a transmembrane region of some form.

21

3.1 Relation between exon boundaries and transmembrane protein regions Prior evidence suggests some correspondence between modules, compact subunits of protein domains, and intron boundaries [6]. Along those lines, we would expect intron boundaries to typically avoid transmembrane regions. Thus, we assessed how often this is the case. Overall, intron boundaries did not split 695 of 1116 signal peptides (62.3%), 28 of 40 anchor peptides (70.0%), and 3628 of 5895 individual transmembrane regions (61.2%). The transmembrane regions in single-pass transmembranes were divided by introns the least: 687 of 812 (84.6%) were not divided by introns. For seven-transmembrane proteins, 793 of 980 (8 1.2%) individual transmembrane regions in 120 proteins were not split by introns. This follows the observation that genes encoding GPCRs, in particular, consist of a small number of large exons [ 141. To put this into perspective, we estimated the background likelihood of a 22-residue segment being divided by an intron, given observed gene structures and transmembrane topologies. Note that 22 residues is the average length of a region predicted by TMHMM. The likelihood estimation was as follows. For each protein of n transmembrane regions, we identified the all positions in the protein corresponding to a splice junction. Then, we selected n 22-residue segments at random. If these n random 22-mers did not overlap, and were separated by at least five residues (representing a minimal distance for turns between adjacent transmembrane segments), then we noted the number of segments placed n and the number m of segments that did not span any splice junctions. This process was repeated 100,000 times to sample the protein’s conformational space, yielding a total of N total segments placed, and M not divided by introns. The likelihood 1 of a 22-mer segment being divided by an intron, given the gene structure, was estimated as M/N. Finally, the,,overall likelihood L of any 22-residue segment being divided in any K-pass transmembrane was estimated as the average likelihood 1 for all K-TM proteins analyzed. This data is shown in Figure 2.

22

1

2

3

4

5

6

7

8

9

10+

Number of transmembrane regions in the protein

Figure 2: Shown by topology is the proportion of transmembrane regions not divided by an intron. This is compared to the likelihood of a random 22-mer amino acid sequence not being divided by an intron, as estimated by placing the equivalent number of 22-length segments on the protein sequence at random in 100,000 trials per protein. The trend towards increased intact, single exon TM sections for 5 , 6 , and 7TM proteins correlates well with the prevalence and importance of TMbound receptors, particularly for the large class of important GPCRs which contain 7TM segments.

In general, the likelihood that transmembrane regions are kept intact is only slightly greater than background. Even the transmembrane regions in 7-TM proteins are kept intact at a rate only slightly higher than expected, even though they are kept intact at a high rate of 81.2%. 7-TM proteins tend to be encoded by genes of few exons. This data indicates that transmembrane regions in 7-TM proteins span introns infrequently because they have few introns, not because introns are placed elsewhere in the gene. For contrast, the single transmembrane region in 1-TM proteins is kept intact at a rate of 84.6%, versus a background expectation of 58.5%. Thus, if there is some selective pressure to keep the transmembrane regions intact in the genomic sequence, this is evidenced to the greatest extent by single-pass transmembranes.

3.2 Effects of alternative splicing on transmembrane protein regions Previously, we analyzed all proteins with a plausible genomic alignment. Here, we analyze only those proteins from 904 genes with protein-level evidence of splice variation. Of the 904 genes, 240 contained some form of transmembrane annotation. These genes yielded a total of 790 annotations in 553 distinct proteins, each representing a distinct splice variant. We divided the annotations into two sets: those common to all observed splice variants, and those not. Annotations were considered common to all splice variants only if all variants

23

contained a region of the same type (signal or transmembrane), produced from the same genomic coordinates and in the same translation frame. Additionally, for an annotation to be common, we required the same class of annotation: the same number of transmembrane spans for a TMHMM prediction, and the same Anchor or Signal classification for SignalP predictions. As shown in Table 1, alternative splicing was associated with changes in transmembrane topology for about half of the genes studied, and about half of the annotations in each class. Table 1: For each transmembrane architecture, listed are the total examples observed, and the number that differ in some other variant of the same gene. Overall, half of the genes contained

Topology Signal Peptide

Observe d 145

Change d 79

1-passTM 2-passTM 3-pass TM 4-pass TM 5-pass TM

128 17 17 15 14

65 15 9 13 10

Topology Anchor Peptide 6-passTM 7-pass TM 8-pass TM 9-passTM 10+ passTM

Observe d 7

4

24 16

16 6

5

5 5

9 12

Changed

7

Overall, 7-TM regions were altered by alternative splicing at a lower rate than others, although the sample size is too small to suggest a significant trend. We did not observe any genera1 trends, such as whether the variants of a gene tended differ in their their transmembrane span count by multiples of two, a trend which would suggest that the terminal domains of the protein stayed in the same cellular region even if the number of transmembrane spans varied. For all transmembrane proteins, the function of the protein is intrinsically related to the number of transmembrane spans. Yet the effect is most vivid for single-pass transmembranes. There are numerous documented cases of genes with a single-pass transmembrane variant and a secreted variant; both variants contain the same extracellular domain, and the secreted variant inhibits the activity of the transmembrane variant. Two examples include the fibroblast growth factor receptor 1 (FGF-R1) [ 151 and the neuropilins [ 161. Roughly half of the single-pass transmembranes we analyzed contained a variant with no transmembrane region. This data suggests that these cases might not be examples of isolated phenomena, but part of a general trend. In most cases when the transmembrane architecture was modified, one or more transmembrane region was deleted. Yet in a small number of cases, verified by hand, the genomic coordinates of one transmembrane region were moved in one variant relative to another. Thus, the gene contained transmembrane-coding regions in the exons not constitutively expressed; by

24

selective use of these regions, the splice variants contained the same transmembrane composition. One example of this is MDWTAP, the multi drugresistant ATP binding cassette, subfamily B. The splice variants of this gene map to different 5’ exons, suggesting alternative promoters. Yet all variants encode a signal peptide in the 5’ exons. So curiously, the presence of the signal peptide is preserved in splice variation, even at the expense of maintaining two different sets of genomic coordinates. Other genes showing similar behavior include the interferon gamma receptor IFNGR, the poliovirus-receptor-related gene PVR13, and the tyrosine kinase TYR03.

3.3 Case Study]: Alternative splicing of GPCRs GPCRs typically feature a simple gene structure, comprised of a small number of large exons. Yet even so, they exhibit splice variation. Figure 4 shows the kappa-3 opiate receptor (KOR3) gene explored by Pan et al. [I71 In this gene, individual differences in splice variation are believed to have distinct phenotypic consequences. Incomplete cross-tolerance, where patients are highly tolerant of one opiate yet react to a second at surprisingly low doses, is believed to stem from differences in splice variation.

I

e

Figure 3: Alternative splicing of the mu opiate GPCR. In this image generated by ProtAnnot, the six splice variants for this gene are labeled with the letters a-f. Empty rectangles represent noncoding exons Filled rectangles represent translated exons, with the translation frame indicated by the shade of grey The small rectangles below each transcript indicate the locations of the transmembrane regions

This gene has several documented splice variants: ordinary 7-TM GPCRs (a); N-terminal anchored 1-TMs (b-d), and 4-TM variants with extracellular Cterminal domains (e) [ 171. We observed a 6-TM variant in addition (f). Given the complex interactions between membrane-bound receptors [ 181, the non-7TM receptors are not necessarily dead variants, but may be part of the complex

25

interplay between receptors in regulating response to outside influences and neuronal states.

3.4 Case Study 2: Alternative splicing and nonsense-mediated decay In 30 randomly-selected genes, we found five examples in which alternative splicing caused shifts in the translation frame and introduced premature termination codons (PTCs). Although such events can stem from artifacts in the cDNA library, we emphasize that all five sequences were documented as fulllength, with protein translations. The changes in translation frame stemmed from shifts in the exon boundaries, and conditional inclusion or exclusion of cassette exons. Two examples are shown in Figure 4.

Figure 4. Alternative splicing introduces premature termination codons (PTCs) via two different mechanisms: variable splice site selection and optional inclusion of an alternative exon in. In both examples, the termination codon in one of the transcripts is more than 50 bases upstream of a splice junction, thus exposing it them to regulation by nonsense-mediated decay pathway. (Top) TMC6 encodes putative 4-pass (a) and 6-pass (b) transmembrane membrane-bound proteins are shown. The exon beneath the F’TC contains a shorter 5’ leg in (a) than in (b), indicating variation in the 5’ boundary of the affected exon in the two transcripts. (Bottom) Shown is Chnn (calmin), a putative actin-binding protein. Inclusion of an optional exon in (a) introduces a PTC which deletes a downstream single-pass transmembrane region present in (b).

Curiously, many of these PTC-containing variants contained splice junctions downstream of the termination codon. According to current theories, this should target these proteins for nonsense-mediated decay (NMD). After splicing, components of the splicing machinery are thought to remain attached to the niRNA near former splice junctions, marking the positions of former introns [19]. They are usually displaced during translation, but might not detach if the

26

mRNA contains splice boundaries 50 bases or more downstream from the termination codon [20]. Their presence is believed to activate the nonsensemediated decay pathway, resulting in degradation of the affected molecule. The effects of NMD vary from gene to gene [21]. Recently, it was proposed as a genome-wide mechanism by which cells ensure splicing fidelity and avoid the production of potentially toxic, nonfunctional proteins [22]. Yet give our results, are all classes of protein-coding transcripts equally susceptible to NMD? W e observed 3 examples of NMD-susceptible transmembrane protein encoding transcripts (Tmcb, Clmm, Il17rb) in 30 genes examined. Perhaps mRNAs encoding membrane-spanning proteins, which are co-translationally inserted into the ER, might be subject to NMD to a lesser degree than other proteins. 4 Conclusions

Transmembrane proteins perform a number of key roles, including inter-cellular signaling and transport. Their function is tied closely to their organization of transmembrane spans. Alternative splicing modified this organization in about half of the genes studied, almost certainly altering the functions of the proteins produced. Thus, the process of alternative splicing could have a substantial impact on any cellular processes in which these proteins are involved. One cannot consider splicing without of gene structure. Associations have been observed between exons and units of protein structure [6]. Given the functional importance of transmembrane regions, plus their short length, we might expect them to be divided by introns rarely. On the surface, this seems true. However, when compared to the likelihood of an intron dividing an equivalently-sized protein segment, we observed that most transmembrane regions were kept intact at a rate barely higher than expected. The exception is the single pass in 1-TM proteins, which are kept intact far more frequently than expected. Few protein regions have such clear functional interpretation as these. There are numerous documented cases of 1-TMs with a secreted splice variant, where the two variants contain the same extracellular domain and the secreted variant inhibits the fmction of the transmembrane variant. These facts together support the idea of an evolutionary mechanism that avoids fragmentation of critical portions of the protein. While this work represents a starting point. Here, our interpretation of the results is limited by small data set sizes, resulting from the small amount of cDNA data for the mouse. In future work, we are considering repeating this analysis on other genomes where the cDNA data is more abundant. Any analysis based on genomic data tells only half of the story. Any cDNA sequence represents a splicing event that has been documented at least once.

27

The trends we reported here based on in-silico observations, but cannot describe the conditions under which such trends arise. Questions remain, such as when alternative splicing events are regulated, and when they represent random consequences of a noisy process. Addressing such questions would require the genomic data to be coupled with the proper measurement technology. In related future work, we hope to shed more light on some of the events described here, and the circumstances under which they occur.

Acknowledgments

We wish to thank several colleagues for insightful discussions on the science of splicing: John Blume, Jing-Shan Hu, Gang Lu, Tyson Clark, Gangwu Mei, Manny Ares, Chuck Sugnet, Bruce Conklin, Nathan Salomonis, and especially Hui Wang. The analysis reported here would have been nearly impossible without the elegant data analysis pipelines developed by Alan Williams, Brant Wong, and Harley Gorrell. Additionally, we wish to thank Harley for his generous assistance with the compute cluster and postgresql databases.

References 1. Modrek, B., et al., Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res, 2001.29( 13): p. 2850-9. 2. Mironov, A.A., J.W. Fickett, and M.S. Gelfand, Frequent alternative splicing of human genes. Genome Res, 1999. 9(12): p. 1288-93. 3. Nurtdinov, R.N., et al., Low conservation of alternative splicing patterns in the human and mouse genomes. Human Molecular Genetics, 2003. 12(11): p. 1313-20. 4. Thanaraj, T.A., F. Clark, and J. Muilu, Conservation ofhuman alternative splice events in mouse. Nucleic Acids Res, 2003. 31(10): p. 2544-52. 5. Modrek, B. and C. Lee, A genomic view of alternative splicing. Nat Genet, 2002. 30( 1): p. 13-9. 6 . Fedorov, A., et al., Intron distribution difference for 276 ancient and 131 modern genes suggests the existence of ancient introns. Proceedings of the National Academy of Sciences of the United States of America, 2001.98(23): p. 13177-82. 7. Loraine, A., et al. Protein-based analysis of alternative splicing in the human genome. in IEEE Computer Society Bioinfonnatics Conference. 2002. Stanford University.

28

8. Krogh, A., et al., Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol, 2001.305(3): p. 567-80. 9. Nielsen, H., S. Brunak, and G. von Heijne, Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Engineering, 1999. 12( 1): p. 3-9. 10. Moller, S., M.D. Croning, and R. Apweiler, Evaluation of methods for the prediction of membrane spanning regions, Bioinformatics, 2001. 17(7): p. 646-53. 11. Menne, K.M., H. Hermjakob, and R. Apweiler, A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics, 2000. 16(8): p. 741-2. 12. Kent, W.J., BLAT-the BLAST-like alignment tool. Genome Res, 2002. 12(4): p. 656-64. 13. Reese, M.G., et al., Improved splice site detection in Genie. J Comput Biol, 1997. 4(3): p. 311-23. 14. Kilpatrick, G.J., et al., 7TM receptors: the splicing on the cake. Trends in Pharmacological Sciences, 1999. 20(7): p. 294-301. 15. Kornmann, M., et al., Expression of the I l k Variant of FGF Receptor-1 Confers Mitogenic Responsiveness to Heparin and FGF-5 in TAKA-1 Pancreatic Ductal Cells. International Journal of Gastrointenstinal Cancer, 2001.29(2): p. 85-92. 16. Nakamura, F. and Y. Goshima, Structural and functional relation of neuropilins. Advances in Experimental Medicine and Biology, 2002. 515: p. 55-69. 17. Pan, Y .X., Identification of alternatively spliced variants from opioid receptor genes. Methods in Molecular Enzymology, 2003. 84: p. 65-75. 18. Abbadie, C., et al., Anatomical andfunctional correlation of the endomorphins with mu opioid receptor splice variants. European Journal of Neuroscience, 2002.16(6): p. 1075-82. 19. Le Hir, H., et a]., The exon-exon junction complex provides a binding platform for factors involved in mRNA export and nonsense-mediated mRNA decay. Embo J, 2001.20(17): p. 4987-97. 20. Nagy, E. and L.E. Maquat, A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. Trends Biochem Sci, 1998.23(6): p. 198-9. 21. Gudikote, J.P. and M.F. Wilkinson, T-cell receptor sequences that elicit strong down-regulation of premature termination codon-bearing transcripts. EMBO Journal, 2002.21( 1-2): p. 125-34. 22. Lewis, B.P., R.E. Green, and S.E. Brenner, Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans. Proc Natl Acad Sci U S A, 2003. lOO(1): p. 189-92.

GENOME-WIDE DETECTION OF ALTERNATIVE SPLICING IN EXPRESSED SEQUENCES USING PARTIAL ORDER MULTIPLE SEQUENCE ALIGNMENT GRAPHS C . GRASSO, B. MODREK, Y.XING, C . LEE Department of Chemistry and Biochemistry, University of California, 61 1 Charles E. Young Drive East, Los Angeles, CA 90095-1570, USA E-mail: leec @ mbi.ucla. erlu We present a method for high-throughput alternative splicing detection in expressed sequence data. This method effectively copes with many of the problems inherent in making inferences ahout splicing and alternative splicing on the basis of EST sequences, which in addition to being fragmentary and full of sequencing errors, may also be chimeric, misoriented, or contaminated with genomic sequence. Our method, which relies both on the Partial Order Alignment (POA) program for constructing multiple sequence alignments, and its Heaviest Bundling function for generating consensus sequences, accounts for the real complexity of expressed sequence data by building and analyzing a single multiple sequence alignment containing all of the expressed sequences in a particular cluster aligned to genomic sequence. We illustrate application of this method to human UniGene Cluster Hs.1162, which contains expressed sequences from the human HLA-DMB gene. We have used this method to generate databases, published elsewhere, of splices and alternative splicing relationships for the human, mouse and rat genomes. We present statistics from these calculations, as well as the CPU time for running our method on expressed sequence clusters of vaIying size, to verify that it truly scales to complete genomes.

1

Introduction

Alternative splicing describes the process by which multiple exons can be spliced together to produce different mRNA isoforms, encoding structurally and functionally distinct protein products.',' Recent studies have indicated that the mechanism of alternative splicing not only plays a large role in expanding the repertoire of gene function during the lifetime of an organism, but also facilitates the evolution of novel functions in alternatively spliced exons, which are less subject to the effects of natural ~election.~ Alternative splicing appears increasingly to make an important contribution to the complexity of the higher eukaryotes, by multiplying the number of gene products possible from the baseline number of genes. This issue has received much attention since the human genome (once estimated to contain up to 120,000 genes4) was reported to contain only -32,000 GENESS Large-scale expressed sequence tag (EST) and mRNA sequencing has made possible multiple bioinformatics studies of alternative splicing.577-'' In contrast with previous expectations that alternative splicing plays a relatively minor role in functional regulation (affecting perhaps 5 - 15% of genes), these EST-based studies have reported that alternative splicing is ubiquitous, observed in 40 - 60% of human genes.

29

30 While these results have aroused increasing interest in alternative splicing, there are many unanswered questions for the next phase of research. First of all, these studies were very different in their detailed methodology and results. For example, these methods divide into two very different camps. Some methods directly compare expressed sequences (ESTs and mRNA) to each other to identify divergent forms (insertions and deletions), which are interpreted as alternative SSPLICING. Other methods compare the expressed sequences individually to the genomic sequence to identify divergent patterns of exon i n c l u ~ i o n . ~These ~ ’ ~ two approaches, which we will refer to as “EST Comparison” and “Genomic Mapping”, cause very different patterns of false positive and false negative errors, and neither approach is by itself ideal. Second, EST-based alternative splice detection faces many fundamental technical challenges, concerning the experimental data, bioinformatics methods, and biological interpretati~n.’~ Thus, it is now essential to assess the ke technical factors that determine the reliability of such alternative splicing analyses.l2 In this paper we present a detailed examination of the technical problems we have encountered in undertaking high-throughput analyses of alternative splicing over the last four years, and the specific solutions we have developed for these problems, in seeking to minimize both false positive and false negative errors.

2 2.1

Methods Overview

In theory, detection of alternative splicing is straightforward: comparison of expressed sequences from a given gene can identify insertions and deletions that indicate alternative exon usage. In practice, however, this apparently simple task is complicated by serious technical problems that can produce artifacts resembling alternative splicing. The subtlety of these challenges is well illustrated by the question of whether to use EST Comparison vs. Genomic Mapping. As we will show in our analysis below, EST Comparison is vulnerable to a wide variety of problems (paralog mixing and genomic contamination, to name a few) that cause false positive errors (alternative splice predictions that are not reliable). However, this does not necessarily mean that Genomic Mapping is preferable. As we will show, Genomic Mapping not only raises many problems of computational load but also of accuracy, including significant false negatives. Thus, we have concluded that neither method is adequate by itself, and our approach combines both methods in an unusual hybrid approach. A flow chart detailing our alternative splicing analysis is shown in Figure 1. Our analysis takes as input a single UniGene EST c l u ~ t e rthat ’ ~ contains both mRNA and EST sequences from a particular organism along with the organism’s complete genome sequence. Our analysis produces as output a mapping of the cluster onto genomic sequence, a multiple sequence alignment of the set of expressed sequences aligned both to each other and to the genomic sequence, a set of detected splices

31 stored as pairs of indices in the genomic sequence, and a set of alternative splicing relationship stored as pairs of indices of splice sites. Expressed Sequence Cluster

POA PO-MSAof Expressed Sequences

>

Heaviest Bundling Full Genome Sequence

Best Consensus Sequence

Mapping expressed sequence cluster to genomic sequence

POA PO-MSA of ExpressedSequences Aligned to Genomic Sequence I

I

I

Splice Detection *

Splices

Alternative Splicing

Detecting splices and alternative splicing in alignment of expressed sequences to genomic sequence

Figure 1. Flow-chart depicting our alternative splicing detection method. Nodes are labeled with the inputloutput of each step in the method. Edges are labeled with the process undertaken at each step in the method.

2.2

Mapping the cluster of expressed sequences to genomic sequence

Extensive analysis of EST alignments has demonstrated that they are a valuable source of polymorphism identification, including SNPs and alternative However, we have found that such analysis is very vulnerable to artifacts in both the experimental methods and the bioinformatics interpretation.” Since alternative splicing is identified in these alignments as large insertions and deletions, any artifact that gives rise to such differences in ESTs will cause false positives that can be difficult to screen out.I3 We have identified a number of such causes of artifacts. First, genomic contamination (EST library clones derived from genomic DNA rather than mRNA) and incomplete mRNA processing (clones derived from mRNA molecules whose splicing has not been completed) will produce the appearance of large insertions, due to retention of some intron sequences. Second, paralog

32

contamination (mixing of ESTs derived from different, paralogous genes as a single EST cluster) can also produce the artifactual appearance of alternative splicing, which actually reflects differences between paralogous genes. Third, the EST data are frequently massive and complex. For example, a UniGene cluster for a single gene can contain up to 5000 ESTs, far too large for most multiple sequence alignment programs to compare. Genomic mapping provides an obvious solution to many of these problems, by permitting easy recognition of genomic contamination / intron retention, and verification of which gene a given EST is FROM When the complete genome sequence is available, it enables one to check definitively for the presence of possible paralogs, and to require that each EST match perfectly to its target gene (allowing for sequencing error) as a condition for inclusion in our calculation.” On the other hand, attempting to map EST sequences directly to the entire genome itself poses serious problems. Because ESTs are short single-pass fragments and full of sequencing errors, BLASTing them individually against the genome sequence is both computationally expensive (e.g. for the human genome, 4 million ESTs vs. 3 billion bases of genomic sequence) and error prone, leading to a high false negative rate for splicing and alternative splicing detection. Matching a short, error-filled EST fragment against short genomic exons (150 nt on average, but can be as short as 10 nt) separated by large introns (from 1 kb up to >20kb) is very challenging, and both standard search programs (such as BLAST18) and multiple sequence alignment programs (such as CLUSTAL19) cannot guarantee reliable results. To solve these problems, our method constructs a multiple sequence alignment (MSA) for the cluster of ESTs, extracts one or more “consensus” sequences that represent the aligned ESTs, and maps these consensus sequences to the genome using BLAST. The BLAST mapping step is straightforward, and has been described in detail.” However, the MSA and consensus construction steps pose significant new challenges. First, the large number of expressed sequences that must be aligned (up to 5000 ESTs in a single UniGene cluster) exceed the time and memory limitations of most MSA software.” To solve this problem, we use Partial Order Alignment (POA), whose time and memory requirements grow linearly with the number and length of EST sequences to be aligned.” POA can align 5,000 EST sequences in approximately 4 h on an inexpensive Pentium I1 PC. More importantly, POA generates the EST alignment as a graph structure that is able to represent both regions of match and regions of divergence: in regions where an EST matches other ESTs, it follows their path in the alignment graph; in regions where it diverges, it produces a new branch in the alignment graph. The alignment graph can accurately represent any level of complexity in the input sequence data: while a simple dataset of EST fragments of a single mRNA isoform would produce a single, linear path, a set containing a mix of ESTs from paralogous genes, genomic contaminants, or chimeric sequences would result in a branched alignment structure that reflects this complexity.

33 Moreover, this approach provides a natural, robust way for dealing with this complexity so that it does not cause artifacts in alternative splice detection. Specifically, we generate consensus sequence(s) by analyzing the Partial Order Multiple Sequence Alignment (PO-MSA) using the paralog separation algorithm of the Heaviest Bundling function of the POA program.I7 This method finds multiple consensus paths through the PO-MSA graph, and then associates with each consensus path all of the expressed sequences which follow that path (with an allowance for sequencing error).l7 By separating ESTs that show signs of substantial divergence from the majority, POA’s consensus generation is insulated from artifacts due to paralog mixing, genomic contamination, etc. Ordinarily, Genomic Mapping confronts the twin difficulties of poor sensitivity and enonnous inefficiency due to the high levels of sequencing error and redundancy in ESTs. Mapping individual ESTs is both harder (due to their short size and poor sequence quality) and very time consuming. We resolve both these issues by using the consensus sequences obtained from Partial Order Alignment. This both converts the EST data to reliable, assembled consensus sequence (greatly increasing sensitivity and robustness), and drastically reduces the number of search steps that must be performed. For large EST clusters (>lo0 ESTs) we have found this reduces the number of BLAST searches by 20 to 100-fold. In order to map the UniGene cluster to genomic sequence, we select the consensus sequence to which the majority of the expressed sequences have been bundled, since it most closely approximates a full-length mRNA transcript. The remaining consensus sequences, to which have been bundled the paralogous ESTs, chimeric ESTs, and mis-oriented ESTs that are not 90% identical to the majority consensus sequence, summarize the experimental and bioinformatics artifacts in the data. To assess the value of using POA and Heaviest Bundling to cope with the complexity of the UniGene expressed sequence data, we have constructed the POMSAs of 80,000 Human UniGene clusters using POA, and run the Heaviest Bundling function to extract the minimum number of linear consensus sequences required to describe the aligned EST sequences to at least 90% identity. The number of consensus sequences generated by Heaviest Bundling is a useful measure of the degree of complexity of the data. For all Human UniGene clusters containing at least 10 ESTs, we counted the number of consensus sequences generated by Heaviest Bundling. Remarkably, a single consensus sequence was generated for only 16% of the Human UniGene clusters; two or three consensus sequences were generated for 41% of the clusters; four to ten consensus sequences were generated for 43% of the clusters. These data suggest that the large insertions and deletions in multiple sequence alignments of expressed sequence clusters, which result from experimental and bioinformatics errors, are not a minor phenomenon in the UniGene data, but are instead the norm. Their prevalence in the UniGene data necessitates the application of POA and Heaviest Bundling to the problem of mapping a UniGene cluster to genomic sequence.

34

B intron 1

intron 2 EST

Figure 2. POA facilitates accurate alignment of EST fragments to genomic sequence. In this figure, all alignments are represented as PO-MSAs, regardless of the manner in which they were constructed. The nodes in the PO-MSA are represented as squares and directed edges are shown only at branch points; nodes containing genomic sequence are white with grey nodes indicating exons, while nodes containing the EST sequence are always colored red. In A, the EST fragment cannot pay the gap penalty in order to align its last six nucleotides to exon 3, instead the six nucleotides are not aligned to genomic sequence at all (*) and so they do not provide evidence for the splicing of intron 2. In B, the EST fragment aligns to the PO-MSA containing multiple ESTs and mRNAs aligned to genomic sequence along the edge connecting exons 2 and 3. In this case, aligning the six nucleotides to exon 3 (**) does not require the payment of a large gap penalty and so the EST provides evidence for the splicing of intron 2.

2.3

Aligning expressed sequences to genomic sequence and to each other

Once a genomic location for an EST cluster has been identified, the method must next compare each EST to the genomic sequence to identify alternative exon usage. Once again, this apparently simple task is undercut by many technical difficulties. Whereas gene mapping only requires finding the right genomic region, reliable splice detection requires an exact, robust alignment of each EST to the genomic sequence. This is much harder to ensure. Whereas EST Comparison based methods

35

rely on multiple sequence alignment, Genomic Mapping based methods rely on pairwise alignment, i.e. aligning each individual EST to the genomic sequence. While pairwise alignments between full-length mRNA and genomic sequence are likely to be reliable, pair-wise alignments between EST fragments and genomic sequence are much more difficult to construct accurately, because ESTs are short, randomly fragmented, and full of sequencing errors. Figure 2A shows a pair-wise sequence alignment between an EST fragment and genomic sequence. The six nucleotides at the end of the EST fragment, which should align to the third exon in the genomic sequence, fail to do so because the score for perfectly matching them to genomic sequence is insufficient to compensate for the large gap penalty required to accommodate the large intron between exons 2 and 3. Instead, these six residues do not align to genomic sequence at all. Any attempt to detect splices on the basis of the resulting pair-wise alignment alone would fail to identify the splice that removes the second intron, resulting in a false negative. Partial Order Alignment provides a systematic solution to this problem. As long as the PO-MSA contains at least one EST aligned across the gap, aligning a new EST can follow this path without any gap penalty. In this case, even the short EST fragment will align correctly across the gap from exon 2 to exon 3 (see Figure 2B). The key difference here is that POA provides a hybrid method between conventional EST Comparison and Genomic Mapping: each EST is aligned not only against the genomic sequence, but also against the set of all previous ESTs at the same time, to identify the best scoring alignment path. In practice, we align fulllength mRNA sequences to genomic sequence first, and then align EST sequences to the growing PO-MSA in order of decreasing length. This ensures that the evidence for splices, for which any sequence observation is able to pay the gap penalty, may be augmented by fragmentary sequence observations. In this way, our method is able to not only accurately align all EST fragments to genomic sequence, but also to combine the evidence for splicing from multiple ESTs. This is valuable not only to rescue many EST splice observations that would normally be lost, but also to detect when several ESTs show a similar divergence from the genomic sequence (for example, indicating that they may actually be derived from a paralogous gene). These ESTs would be aligned to each other as a distinct path in the alignment, branching away from the genomic sequence. This information is then used to filter the set of ESTs that are retained for analyzing splicing. The detailed retention criteria have been previously described. l 1 2.4

Splicing and alternative splicing detection in PO-MSAs

Figure 3B shows the PO-MSA of all of the expressed sequences in human UniGene cluster Hs. 1162 aligned to genomic sequence. Once the PO-MSA is constructed detecting splices amounts to finding large deletions in expressed sequences relative to genomic sequence. These deletions manifest themselves as directed edges in the

36

Figure2. Splicing and alalternative splicing detection in a because expres ed sequence elu

37 relative to genomic sequence are shown above the genomic sequence, and appear as blue spikes topped with red dots, while the edges corresponding to deletions relative to genomic sequence are shown below, and appear solely as blue spikes. The number of expressed sequences aligned to a particular position in the genomic sequence is reflected in the thickness of red dots along the black line representing the genomic sequence so that the regions of genomic sequence corresponding to exons appear as red rectangles above the black line. The edges corresponding to splices are in bold, while edges corresponding to alternative splices are shown in green. Figure C shows the splicing graph constructed from the PO-MSA shown in B. Nodes correspond to exons, while edges correspond to splices. Figure D of the splicing and alternative shows the graphical representation, provided by the ASAP databa~e,'~ splicing of the HLA-DMB gene inferred from the expressed sequences in human UniGene cluster Hs.1162 using our method. The four mRNA isofoms inferred from this data are shown as well.

alignment graph between a node containing genomic nucleotide i and a node containing genomic nucleotide i+x, where x is greater than 10 (Figure 3A). While one might be tempted to find these edges by depth first or breadth first search, the easiest way to find all of the large deletions in the expressed sequences relative to genomic sequence is to extract from the PO-MSA the set of pair-wise sequence alignments between each of them and genomic sequence and then analyze them directly. In order to verify each splice once it has been detected, we check not only the number of expressed sequences indicating the existence of the splice, but also whether the splice has valid intronic splice site sequences (GT / AG). Next, verified splices, which are stored as pairs of indices in genomic sequences, are compared to each other in order to infer alternative splice relationships. If the genomic sequence delimited by two of pairs of splices overlap, we identify the pairs as having an alternative splice relationship (Figure 4).

1

2

Primary Alternative Splice

3

4

5'

0

8-\

0

0 .

\

I

I 'I

, 5' ,

.

.

\

\

*\ \0'

3'

'

Secondary Alternative Splice

Figure 4: Filtering of alternative splicing by splice pair overlap relationships. In all four diagrams, genomic sequence is shown as a straight black line, and splices are shown as dotted black lines. The 5' splice sites and 3' splice sites are labeled for the splice depicted above genomic sequence. The ASAP database reports alternative splice relationship types 1 and 2, which produce alternative 3' splicing, alternative 5' splicing, and exon skipping.

38 In order to filter out alternative splice relationships that are the result of genomic contamination, we identify only those alternative splicing relationships between mutually exclusive splices, i.e. pairs of splices whose 5’ splice sites or 3’ splice sites are the same, as valid. These valid alternative splice relationships are the basis on which we make inferences about alternative 5’ splicing, alternative 3’ splicing, and exon skipping. Figure 3B shows the eight splices and eight valid alternative splicing relationships detected in human UniGene cluster Hs. 1162 using this method.

3

Results

We have applied our method to genome-wide detection of splicing and alternative splicing in the human, mouse, and rat genomes. This procedure is fully automated, can be applied to any genome, and its computation time scales linearly with the amount of EST data (Figure 5).

0

200

400 600 800 1000 Number of expressed sequences in cluster

1200

1400

Figure 5: Total computation time as a function of increasing EST data. For each number of sequences in the range of 50 to 1200, the cpu time was computed for five sequence clusters containing roughly the same number of expresses sequences. The black line plots the average of these five cpu times versus the number of sequences. These calculations were performed in an 1.4 GHz AMD Athlon running Linux.

Our method mapped 17,656 multi-exon genes to exact locations in the human genome (January 2003 data), 14,556 multi-exon genes in mouse, and 8,342 multiexon genes in rat. In the human data, it detected over 35,000 alternative splicing relationships, more than doubling the number of predicted gene products versus the number expected from the estimated 32,000 human genes without alternative splicing. We detected a total of 115,518 splices and 35,433 alternative splice

39

relationships, of these 30,891 were novel and 12,615 were novel and were supported by multiple expressed sequences. Using the January 2002 mouse UniGene data we detected 91,225 splices and 12,528 alternative splice relationships, of these 11,687 were novel and 4,090 were novel and were supported by multiple expressed sequence observations. Using the January 2002 rat UniGene data we detected 31,177 and 1,143 alternative splice relationships, of these 11,687 were novel and 4,090 were novel and were supported by multiple expressed sequence observations. Table 1: Total alternative splice detection in three genome-wide analyses. Total

Human 1/02

Mouse 1/02

Rat 1/02

Human 1/03

Clusters

96109

85049

61582

111064

Clusters with a consensus sequence

96040

83876

56668

110927

Clusters mapped to genome

6801 1

54115

39588

64577

1333691 18173

91223 14556

31177/8342

115518/17656

3079317991

12528/4895

1143/680

35433/7834

14656/5205

493 1/2488

4681274

171S7/S307

Splices detected Alternative splice relationships Alternative splice relationships with multiple evidence

Novel alternative splice relationships 26504/7393 11687/4691 919/581 30891/7313 Novel alternative splice relationships 10367/4094 4090/2 178 244/169 12615/4310 with multiple evidence N.B. ratios are the number of splices of a particular type divided by the total number of clusters in which they occur.

4

Discussion and conclusions

We have presented a method for genome-wide detection of splicing and alternative splicing using expressed sequence data. We have demonstrated that this method can be run on a genome-wide scale, both by running it on the full human, mouse and rat genomes, and by assessing the cpu time required to run it on clusters with number of sequences ranging from 50 to 1200. We have also provided evidence that the partial order alignment algorithms are useful for coping with the true complexity of expressed sequence data, screening out experimental and bioinformatics artifacts in EST data that might cause spurious alternative splices. In addition, we have argued for the value of POA for simultaneously aligning expressed sequences to each other and to genomic sequence in order to effectively cope with EST fragmentation, which contributes to the loss of evidence for splicing and alternative splicing when short ESTs cannot be accurately aligned to genomic sequence. While we have briefly explained the process by which we detect splicing and alternative splicing in the PO-MSA of all of the expressed sequences in a cluster aligned to genomic sequence, we have not discussed the benefits of this approach.

40

One of the major advantages of the PO-MSA representation is that its structure, which reflects exons, introns, and splices, can be easily abstracted as a splicing graphz1(see Figure 3C). We have been able to exploit this feature in order to design algorithms for inferring full-len th mRNAs isoforms either from the PO-MSA of the expressed sequences directly,"or from the splicing graph inferred from the POMSA of the expressed sequences aligned to genomic sequence.22 The other major advantage of the PO-MSA representation is that it stores all of the evidence for a particular splice or alternative splice in a single data structure. This could be useful for calculating statistics measuring the evidence for a particular splice or alternative splice relationship from multiple EST and mRNA observations. By applying such methods, we would be able to associate lod scores with all of the splices and alternative splicing relationships in our datasets. These lod scores would be very useful fcn molecular biologists as they determine the direction of their expensive and time consuming experimental work in the area of alternative splicing.

Acknowledgments C.G. was supported by a DOE Computational Science Graduate Fellowship; B.M. by NSF IGERT #DGE-9987641, and C.L. by NIMH / NINDS Grant #MH65166.

References 1. W. Gilbert, "Why genes in pieces?" Nature 271, 501 (1978) 2. T. Maniatis and B. Tanis, "Alternative pre-mRNA splicing and proteome expansion in metazoans." Nature 418, 236-243. (2002) 3. B. Modrek and C. Lee, "Alternative splicing in the human, mouse and rat genomes is associated with an increased rate of exon creation / loss." Nature Genet. 34, 177-180 (2003) 4. F. Liang, et al., "Gene Index analysis of the human genome estimates approximately 120,000 genes." Nature Genet. 25,239-240 (2000) 5. I. H. G. S. Consortium., "Initial sequencing and analysis of the human genome." Nature 409, 860-921 (2001) 6. D. Brett, et al., "Alternative splicing and genome complexity." Nature Genet. 30, 29-30 (2002) 7. A. A. Mironov, J. W. Fickett and M. S. Gelfand, "Frequent alternative splicing of human genes." Genome Res. 9, 1288-1293 (1999) 8. D. Brett, et al., "EST comparison indicates 38% of human mRNAs contain possible alternative splice forms." FEBS Letters 474, 83-86 (2000) 9. L. Croft, et al., "ISIS, the intron information system, reveals the high frequency of alternative splicing in the human genome." Nature Genet. 24, 340-1 (2000)

41

10. Z. Kan, E. C. Rouchka, W. R. Gish and D. J. States, "Gene structure prediction and alternative splicing analysis using genomically aligned ESTs." Genome Res. 11,889-900 (2001) 11. B. Modrek, A. Resch, C. Grasso and C. Lee, "Genome-wide analysis of alternative splicing using human expressed sequence data." Nucleic Acids Res. 29, 2850-9 (2001) 12. J. Burke, H. Wang, W. Hide and D. B. Davison, "Alternative gene form discovery and candidate gene selection from gene indexing projects." Genome Res. 8,276-290 (1998) 13. B. Modrek and C. Lee, "A genomic view of alternative splicing." Nature Genet. 30, 13-9 (2002) 14. Z. Kan, D. States and W. Gish, "Selecting for Functional Alternative Splices in ESTs" Genome Res. 12, 1837-45 (2002) 15. G. Schuler, "Pieces of the puzzle: expressed sequence tags and the catalog of human genes." J. Mol. Med. 75, 694-698 (1997) 16. K. Irizarry, et al., "Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences." Nature Genet. 26, 233236 (2000) 17. C. Lee, "Generating consensus sequences from partial order multiple sequence alignment graphs." Bioinformatics 19, 999-1008 (2003) 18. S. F. Altschul, et al., "Basic local alignment search tool." J. Mol. Biol. 215, 403-410 (1990) 19. J. D. Thompson, D. G. Higgins and T. J. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Res. 22, 4673-80 (1994) 20. C. Lee, C. Grasso and M. Sharlow, "Multiple sequence alignment using partial order graphs." Bioinformatics 18, 452-464 (2002) 21. S. Heber, et al., "Splicing graphs and EST assembly problem." Bioinformatics 18 Suppl. 1, S181-8 (2002) 22. Y. Xing, A. Resch and C. Lee, "The Multiassembly Problem: reconstructing multiple transcript isoforms from EST fragment mixtures," submitted.

DETECTION OF NOVEL SPLICE FORMS IN HUMAN AND MOUSE USING CROSS-SPECIES APPROACH Z. KAN, J. CASTLE, J. M. JOHNSON, N. F. TSINOREMAS Rosetta Inpharmatics, 12040 1151hAve. N.E. Kirkland, WA 98034 E-mail: [email protected] Millions of transcript sequences have become available for characterizing the transcriptome of human and mouse. Transcript databases have been extensively mined for extracting alternative splicing information within the same species; but they also represent a potentially valuable resource for the discovery of alternative splice variants in another species. In this study, we have performed analysis of alternative splicing patterns for 7,475 pairs of human and mouse genes. We found that cross-species transcript analysis could accomplish the same level of sensitivity in detecting constitutive splice patterns as EST resource from the same species. In contrast, identifying alternative splice patterns in human genes, mouse transcripts achieved only 50% of the sensitivity of human EST and 70% of the sensitivity of human niRNA. While identifying alternative splice patterns in mouse genes, human transcripts are 38% more sensitive than mouse mRNA, and reach 60% of the sensitivity of mouse EST. Furthermore, using the cross-species approach, we predicted novel alternative splice patterns for 42% of human genes and 51% of mouse genes. Splice site motif analysis suggests that the majority of predicted novel splice patterns are expressed in human. EST-based frequency analysis shows that novel splice patterns are expressed at lower frequency than alternative splice patterns present in the transcript data from both species, possibly explaining why they remain undetected in the transcript data of the same species.

1

Introduction

Alternative splicing is an important mechanism for regulating gene functions [7] and has been implicated in many human diseases [S]. Genome-wide EST analyses have found evidence of alternative splicing for the majority of human genes [9] and are being used for mining novel splice forms in human genes of therapeutic interest [14]. In addition to human transcript databases, mouse transcripts represent a potentially valuable resource for discovering alternative splice variants of human genes. There currently exist more than 3 million mouse ESTs, and 100,000 mouse mRNAs in the public domain. Novel splice variants of human genes may be predicted by mining the mouse transcript data. In addition, classifying individual human splice variants as conserved across species or as human specific is important for evolutionary analysis and functional investigation of alternate splice forms [5, 10, 12, 151. However, evolutionary divergence between human and mouse poses a new and considerable challenge to alternative splicing analysis. Cross-species alignment data is noisier than same-species alignment due to divergence at the sequence level that results in a higher error rate in delineating splice patterns. Recent studies also indicate that alternative splicing could be less well-conserved from human to mouse than constitutive splicing, although no clear agreement emerges on how conserved alternative splicing is [5, 10, 12, 151.

42

43

This study is focused on detecting and delineating alternative splice patterns using transcript sequences from a different species origin. We employed a bidirectional strategy for the parallel identification of splice variants for 7,475 orthologous pairs of human and mouse genes. A simple method was developed to screen errors in cross-species alignment by requiring splice junction consistency. We found that mouse transcripts could be used to predict 21% of known alternative splice patterns in human genes, and human transcripts could be used to predict 27% of known alternative splice patterns in mouse genes. In addition, potentially novel alternative splicing patterns were identified for 42% of human genes and 51% of mouse genes using the cross-species approach. Splice site motif analysis was introduced to assess the authenticity of a novel splice site. The methods developed in this work are applicable to future cross-species studies of splicing. This study also demonstrates that cross-species analysis would significantly enrich our knowledge of alternative splicing in human genes, and to an even larger extent in mouse genes. 2

Methods Figure 1: Strategy for cross-species identification of alternative splicing r

I

I I

and genomic sequence

I

Same-Species

alignment

UniGene transcripts

I

I

Consensus splice pattern (same-species)

A

alignment

splice pattern (cross-spccies)

44

2. I

Strategy for cross-species identijication of alternative splice patterns

We employed a bidirectional strategy that enables cross-species identification of alternative splice patterns for human and mouse genes in parallel (Fig. 1). In the first phase, a program called TAP [4, 51 identified consensus splice patterns, including both constitutive and alternative patterns, for two genome-wide collections of human and mouse genes based on same-species alignments, obtained by aligning transcript sequences to the genome of the same species. We used both EST and mRNA sequences in GenBank. In the second phase, cross-species alignments are generated by aligning mouse consensus sequences to the human genome and vice versa. Cross-species alignments are then used to identify alternative splice patterns using TAP. The first three steps accomplish the detection of alternative splice patterns in all human and mouse genes using transcripts of the same species origin. (1) For each gene in the LocusLink database [13], we obtain the following data, a RefSeq sequence, sequence of the corresponding genomic region and UniGene cluster [16]. (2) Transcripts in the UniGene cluster, including EST and Genbank mRNA sequences, are aligned to the genomic sequence using sim4 [3]. (3) Genomic alignments are processed by TAP and clustered into consensus splice patterns, each representing a distinct splice form. Consensus splice patterns that are mutually exclusive to the reference gene structure are identified as alternative splice patterns. The next phase is cross-species analysis. (4) Consensus sequences from one species are aligned to the orthologous genomic templates using est2genome [ 111. Two genes are “orthologous” if they are reciprocal best matches as annotated in the Homologene database [16]. (5) TAP analysis is performed to identify alternative splice patterns from “cross-species’’ transcript alignment data. Step (5) is similar to step (3) as cross-species alignment is treated the same as samespecies alignment. One minor modification involves reducing the percent identity requirement for screening poor alignment from 92% to 70%. A refinement procedure described below is used to screen errors in cross-species alignment.

2.2

Splice junction consistency check

Orthologous human and mouse transcripts exhibit a wide range of sequence homology from 70% to 95% [8]. Due to sequence divergence, cross-species alignment is more error-prone than same-species alignment in term of accuracy for inferring splice patterns. Furthermore, false splice patterns resulting from alignment errors would be mistaken as alternative splice patterns since they are distinct from the reference gene structure. To address this issue, we developed a refinement procedure for examining the consistency of splice junction inference by comparing alignments of the same sequence to different genomes, one from the same species and one from another species. A transcript sequence is aligned to the genome of origin and aligned to the genome from a different species. Each alignment indicates a splice pattern, a series of introniexon boundaries, on the genome and a set of splice junctions on the transcript sequence. A splice junction from the cross-species

45

alignment is “consistent” if it is located at the same position as a splice junction from the same-species alignment. If no matching junction can be found for a splice junction, it is classified as “inconsistent”.

-

Figure 2: Inconsistent splice junctions are alignment errors

o,25

0Consistent Inconsistent -Random GT.AG

0.10

0.05

>‘ 9 J”,,

9

b

,‘

”J

2

\ ,

6 ‘

9 ,

9

Splice Site Score

Shown here is a clear distinction between consistent and inconsistent splice junctions in term of splice site score, the sum of donor motif score and acceptor motif score. The score distribution of inconsistent splice junctions is similar to that of randomly selected splice junctions containing the canonical GT.AG motifs, indicating that they are artifacts of the alignment program, which only looks for the canonical motifs.

The splice site sequences of all putative splices from cross-species alignments are scored using a weight matrix method taking into account the contexts surrounding the donor and acceptor sites as well as the canonical GT.AG motifs [2, 61. The donor motif sequence is extracted from an 11-nt window (-2, +8) flanking the donor splice site on the genomic sequence, and the acceptor motif sequence is a 20-nt window (-2, +18) flanking the acceptor splice site. Log odds scores are calculated for individual motif sequences using two weight matrices, one derived from known splice sites and one from background genomic sequences. Figure 2 shows that consistent splices receive much higher scores than inconsistent ones, and the score distribution for inconsistent splices closely resembles that of randomly selected splice sites containing the canonical GT.AG motifs. In addition, inconsistent splices are rarely “reproducible”, also identified using transcript resources of the same species. Less than 8% of inconsistent splices (649/8129) from the mouse-human alignments are reproducible in the human transcripts, whereas 91% (56,678/62246) of consistent splices are reproducible. Based on above evidence, we decide to filter out inconsistent splices. In total, 12% of all splices and

46

60% of alternative splices from mouse-human alignments that are not reproducible were inconsistent splice junctions. These numbers suggest that alignment error compounded with a lack of sequence conservation could cause a dramatic drop in the accuracy of alternative splicing prediction using the cross-species approach. 2.3

Frequency analysis of splicing event

In the EST-based frequency analysis [5], alternative splice patterns are treated as mutually exclusive outcomes of a stochastic process. The biological frequency of a splicing event, represented by a splice, can be estimated from the frequency of observations in EST sequences. Z-score stands for the likelihood that the biological frequencyfof a splicing event is greater than the expected frequency p , set to 10% in this study. The following formula was used to calculate the z-score. k 0.5 --p*-n ==

J*

k the number of ESTs showing a particular splice (n - k): the number of ESTs showing mutually exclusive splices The binomial probability P @p I n, k) that an outcome occurs k or more times in n trials with an expected frequency of p is calculated. If n*p < 5 , Poisson approximation to the binomial probability is used when n 2 200; the exact binomial probability is calculated when n < 200. Probability is converted to z-score using the standard error function. 2.4

Sequence data resources

The December 2002 version of the LocusLink database was used for linking genes, RefSeq sequences, genomic contig mapping and UniGene clusters. For each gene, a single RefSeq sequence is used as the reference sequence. For 95% of loci in LocusLink, there was only a single RefSeq recorded for a gene. If one locus is linked to multiple RefSeq sequences, the RefSeq with the earliest accession number is chosen. Gene loci without a RefSeq sequence are not included in the study. Each gene is linked to a UniGene cluster consisting of both EST and GenBank mRNA sequences. Human and mouse transcript sequences are derived from UniGene build 154. Genornic sequences were retrieved from NCBI contig databases [16, 171 updated as of December 2002. A genomic template sequence for each gene is extracted with five kbs of extension at both ends according to genomic contig locations specified in LocusLink. Orthologous pairing between human and mouse genes require a reciprocal best match relationship according to annotation in the Homologene database.

47

3

Results

In this study, we used a dataset of 7,454 orthologous pairs of human and mouse genes based on annotations in Homologene [ 161. Alternative splice patterns are defined based on mutually exclusive relationship with the reference gene structure of the RefSeq sequence chosen to represent a gene. Each gene has four resources of splice pattern information: EST, mRNA from the same species, EST and mRNA from the other species. Splice patterns identified from different resources are characterized and compared on the basis of individual splice. A ”splice” refers to a pair of donoriacceptor splice sites flanking a putative intron on the genomic sequence. Splices are classified under several categories. One category is the source of inference, such as human to human EST alignment, or mouse to human mRNA alignment. Another category defines the alternative splicing relationship. A splice is labeled as “RefSeq” if it is found in the RefSeq gene structure or “alternative” if it is mutually exclusive to a RefSeq splice. RefSeq splices are likely to be constitutive splice patterns although it is not necessarily true in a minority of cases. 3.1

Detection of Known Constitutive and Alternative Splice Patterns

Cross-species transcript alignment data and same-species data were compared at the level of individual splice. For each splice present in the RefSeq gene structures, we examine if the exact same splice is present in the splice patterns derived from different transcript resources. We found that human ESTs can identify 84% of RefSeq splices, whereas mouse EST and mRNA combined can identify 82% of them, indicating that mouse transcripts are very informative about constitutive splicing in human genes. A similar trend is observed for detecting constitutive splicing of mouse genes using human transcripts (Table 1A). We sought to determine how well the cross-species approach predicts splice variants. A test set consisting of 8,786 known alternative splices in human genes was derived from human mRNA. As shown in Table lB, when compared with splice patterns from human EST data, only 40% of the known alternative splices could be identified. Lower sensitivity for detecting alternative splices indicates the difficulty of “capturing” splice variants that are often expressed at low levels or under specific conditions. Mouse transcripts, including both mRNA and EST sequences, identified 21% of the known alternative splices, more than 50% of the sensitivity of human EST. Human transcripts could identify 27% of known alternative splice patterns in mouse genes, about 60% of the detection power of mouse EST. Greater sensitivity is expected for human to mouse alignment because of greater sequence coverage. It is also worth noting that mRNA seems to be equally powerful as EST for detecting alternative splicing across species (Table IB).

48 Table 1: Alternative Splicing Statistics

Table (A) shows that cross-species analysis can detect constitutive splice patterns almost as effectively as same-species analysis. “Species” indicates the species origin of the gene under consideration. “RefSeq Splices” include all splices from 7,475 RefSeq gene structures “Alignment Evidence” refers to the type of transcript sequence data that is used for identifying RefSeq and alternative splice patterns. A splice in one data resource is “identified” if both splice sites are exactly matched with a splice inferred using a different resource. “Sensitivity” stands for the fraction of the total splices that are identified using one type of alignment evidence. Table (B) compares the detection power for identifying alternative splice patterns between cross-species resource and same-species resource. Known alternative splices are taken from same-species mRNA alignments or from same-species EST alignments.

49 Figure 3: Cross-species identification of known alternative splices Human mRNA

Mouse EST

Mouse mRNA

Mouse mRNA

Human EST

Human mRNA

Known alternative splices are represented by alternative splices identified from same-species mRNA alignment data. Shown on the left is a Venn diagram showing the overlaps between known alternative splices in human and alternative splices identified using two types of cross-species alignment evidence, EST and mRNA. Shown on the right is the same type of Venn diagram for mouse.

3.2

Characterizationof Novel Alternative Splice Patterns

Figure 3 shows that the majority of alternative splices predicted through crossspecies analysis are novel, meaning that no match can be found among the transcript alignments in the same species. These splices are all junction consistent, matching a splice junction in the same-species alignment. Mouse EST and mRNA predicted novel alternative splices for 42% ( 3 157) of human genes, whereas human transcripts predicted novel splices for 57% (4250) of mouse genes (Table 2). Predicted novel splice patterns are further characterized by frequency analysis. Based on the frequency of observing a particular splice in mouse EST sequences, a z-score is calculated for each splice, representing the chance that the real frequency of a splice pattern is greater than the expected frequency, set to 10% in this study. The greater the z-score, the more likely that the splice variant giving rise to the said splice pattern accounts for more than 10% of all splice variants originating from the same gene [5]. Interestingly, novel alternative splices exhibit a clear separation from alternative splices that are reproduced in same-species analysis (Fig. 4). This observation points to low frequency, whether due to low expression level or rare expression pattern, as one contributing factor to the absence of these splice patterns in the human transcript data. Human and mouse transcripts can be thought of as two repeat samples of a set of splice patterns. While high-frequency patterns are likely to have been detected using human transcripts alone, rare patterns may be identified in one sample but are missing from another sample. Even though the coverage of

50

human transcripts appears to be more comprehensive than mouse transcripts, crossspecies transcript analysis can still uncover novel splice patterns. A novel splice pattern derived from a mouse transcript is not necessarily expressed by human genes. Nonetheless, sequence motifs associated with putative splice sites delineated by cross-species alignment are from the human genome. Each sequence motif could be evaluated for the likelihood that it is a “real” splice site rather than randomly selected using the splice site motif score. Within the set of novel splice patterns, we further identified novel splice sites, required to be at least 10 bases apart from any known splice sites. From mouse-human data, we found 1,135 novel donor sites, and 60% (676) of them receive motif score > 3. There are 1,420 novel acceptor sites and 53% (759) receive motif score > 2. These score cutoffs are selected to maximally discriminate real splice sites from the background, randomly selected sequences containing the GT.AG motif. This is a strong indication that many predicted novel alternative splice patterns are likely to be real as they correspond to biological motifs. Work is currently underway to validate these predictions using RT-PCR. 4

Discussion

A bidirectional strategy that precedes cross-species analysis with same-species analysis is used to identify alternative splice patterns for both human and mouse genes (Fig. 1). This strategy helps to resolve several problems in transcript-based alternative splicing analysis in the context of cross-species analysis. (1) Artifacts. EST sequences are single sequencing reads often poor in quality and sometimes derived from chimeric cDNA clones. (2) Paralogs. Sequences of closely related paralogous genes are hard to differentiate from each other. (3) Redundancy. In the transcript database, many sequences, EST in particular, exhibit the same splice patterns and are therefore redundant for the purpose of discovering splice variants. Table 2: Novel alternative splice patterns

“Splices” refers to the number of novel alternative splices predicted by cross-species analysis. “EST or m R N A is the union of two resources, and “EST and m R N A is the intersection.

51 Figure 4: Characterization of novel alternative splices

a,

B

B

>

P

a

z

b

B

0

Donor Site Score (A)

0.2

0 0.15 % C

e

f

; U

0.1

0.05

0

a,

.B

D

,a

.z

Q

z

b

6

0,a,z,a

2-score (B) (A) The majority of novel donor splice sites are likely to be real, as indicated by the clear separation of score distribution from the randomly selected sequences containing GT.AG motif. The cutoff score of 3 (dashed line) is selected based on the maximum separation between the random set and RefSeq donor sites. (B) Novel alternative splices are expressed at lower frequencies than alternative splices reproducible in human transcripts. 2-scores based on frequency information in mouse ESTs (see methods) are calculated for three classes of splices derived from the Mouse-Human EST alignments. “RefSeq” splices are found in the human RefSeq gene structures. “Alt-Conserved” stands for alternative splices that are also identified in either human mRNA or EST sequences. “ALT-Novel” stands for junction-consistent splices not identified in any human transcript.

52

While directly aligning mouse sequence to the human genome, it is difficult to tell if a poor alignment is due to evolutionary divergence or other issues such as artifacts or paralogs. By filtering transcript sequences that are not aligned to the genome with near perfect identity in the same-species phase, we can effectively eliminate poor quality sequences, chimeric clones and paralogs. In addition, TAP analysis in the first phase clusters redundant transcript sequences into consensus splice patterns. This procedure substantially reduces the computational cost of performing crossspecies alignment, which is often the bottleneck in data analysis on a genomic scale. For example, there are 780,797 human ESTs mapped to 7,454 genes. Only 46,944 consensus splice patterns were aligned to the mouse genome, resulting in a 17-fold reduction in computational cost. In this study, we have performed genome-wide alternative splicing analysis for both mouse and human. We characterized the transcript resources in term of detecting known patterns of constitutive and alternative splicing across species. Furthermore, we have predicted novel splice forms for 42% of human genes and 51% of mouse genes through cross-species analysis. Work is underway to experimentally validate these predictions. While bioinformatics analysis has predicted many splice variants in human genes, the vast majority of which are poorly characterized, conserved splice variants may constitute an important subset as they remained unchanged across 75 million years of evolutionary drift. Having the mouse counterparts also offers many opportunities for comparative studies that would help elucidate the function and regulation of alternative splicing in the mammalian system. 5

Acknowledgements

We sincerely thank Chen Ronghua for his help with databases.

References 1.

Caceres, J. F., and Kornblihtt, A. R. (2002). Alternative splicing: multiple control mechanisms and involvement in human disease. Trends Genet 18, 186193. 2. Clark, F., and Thanaraj, T. A. (2002). Categorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. Hum Mol Genet I I, 45 1-464. 3. Florea, L., Hartzell, G., Zhang, Z., Rubin, G. M., and Miller, W. (1 998). A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8,967-974.

53 4.

5. 6.

7.

8. 9. 10.

1 1. 12.

13. 14. 15. 16.

17.

Kan, Z., Eric, R., Warren, G., and David, S. (2001). Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 11, 889-900. Kan, Z., States, D., and Gish, W. (2002). Selecting for functional alternative splices. Genome Res 12, 1837-1845. Lim, L. P., and Burge, C. B. (2001). A computational analysis of sequence features involved in recognition of short introns. Proc Natl Acad Sci U S A 98, 11193-11198. Lopez, A. J. (1 998). Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annu Rev Genet 32,279-305. Makalowski, W., and Boguski, M. S. (1 998). Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci U S A 95,9407-9412. Modrek, B., and Lee, C. (2002). A genomic view of alternative splicing. Nat Genet 30, 13-19. Modrek, B., and Lee, C. J. (2003). Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet 34, 177-180. Mott, R. (1997). EST-GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 13,477-478. Nurtdinov, R. N., Artamonova, 11, Mironov, A. A,, and Gelfand, M. S. (2003). Low conservation of alternative splicing patterns in the human and mouse genomes. Hum Mol Genet 12, 1313-1320. Pruitt, K. D., and Maglott, D. R. (2001). RefSeq and LocusLink: NCBI genecentered resources. Nucleic Acids Res 29, 137-140. Sorek, R., and Amitai, M. (2001). Piecing together the significance of splicing. Nat Biotechnol 19, 196. Thanaraj, T. A., Clark, F., and Muilu, J. (2003). Conservation of human alternative splice events in mouse. Nucleic Acids Res 31, 2544-2552. Waterston, R. H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J. F., Agarwal, P., Aganvala, R., Ainscough, R., Alexandersson, M., An, P., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562. Wheeler, D. L., Church, D. M., Federhen, S., Lash, A. E., Madden, T. L., Pontius, J. U., Schuler, G. D., Schriml, L. M., Sequeira, E., Tatusova, T. A., and Wagner, L. (2003). Database resources of the National Center for Biotechnology. Nucleic Acids Res 31, 28-33.

EXTENSIVE SEARCH FOR DISCRIMINATIVE FEATURES OF ALTERNATIVE SPLICING H. SAKAI" Graduate School of Mathematics, Kyushu University, Fukuoka 812-8581, Japan 0. MARUYAMA~ Faculty of Mathematics, Kyushu University, Fukuoka 812-8581, Japan Abstract Alternative pre-mRNA splicing events can be classified into various types, including cassette, mutually exclusive, alternative 3' splice site, alternative 5' splice site, retained intron. The detection of features of a particular type of alternative splicing events is an important and challenging problem in understanding the mechanism of alternative splicing. In this paper, we consider the problem of finding regulatory sequence patterns, which are specific to a particular type of alternative splicing events, on alternative exons and their flanking introns. For this problem, we have designed various pattern features and evaluated them on the alternative splicing data compiled in Lee's ASAP (Alternative Splicing Annotation Project) database. Through our work, we have succeeded in finding features with practically high accuracies.

1 Introduction Nowadays one of the greatest challenging problems in biology is to elucidate the whole picture of alternative splicing because alternative splicing is a central mechanism to generate the functional complexity of proteome. It was assumed that for a long time that alternative splicing was an exceptional event and, in most cases, t h e sequence of exons unique to a n ORF was spliced. T h e completion of large genomic sequencing projects, however, revealed t h a t metazoan organisms abundantly use alternative splicing. For example, the draft sequences of the human genome published in 2001 led t o the surprisingly low number of genes, about 30,000 40,000 genes, as compared with a figure of over 100,000 which was previously estimated. It is one of the very impcytant subproblems t o detect regulatory sequence elements for alternative pre-mRNA splicing events. For this issue, Brudno et al. focused their attention on tissue, and detected candidate intron regulatory sequence elements for tissue-specific alternative splicing. Thanaraj and

-

54

55

Stamm summarized from the literature, regulatory elements a t 5’ splice sites and 3’ splice sites, and exonic elements. Alternative pre-mRNA splicing events can be classified into various types, including cassette, mutually exclusive, alternative 3’ splice site, alternative 5’ splice site, retained intron The detection of regulatory sequence elements closely related to such a particular type of alternative splicing events is also an important and challenging problem in understanding the mechanism of alternative splicing. However, it seems that it has not been given enough extensive computational analysis of examining whether there are candidate regulatory sequence elements characterizing types of alternative splicing events. In this paper, we consider the problem of finding regulatory sequence patterns, which are specific to the types of alternative 5’ splice site, alternative 3’ splice site, and cassette, respectively, on their alternative exons and flanking introns. The data on alternative splicing which we use in this work is the product of Lee’s ASAP (Alternative Splicing Annotation Project)7. The approach we take for this problem is based on various feature designs. In general, it is very important how to look at the raw data, i.e., designing and selecting appropriate models of features (or attributes) on the data in the process of knowledge discovery (see for example 4 ) , because it is necessary to detect features appropriate for explaining the data suitably in the process of discovering something new from the data. Since we have not had any deep insight into appropriate pattern models for regulatory sequence elements for alternative splicing events yet, we take the approach of designing and testing various features on sequences. In this task, we consider, on DNA sequences, the various kinds of patterns: I-mers with some mismatches, strings over IUPAC nucleic acid codes, called degenerate patterns, and nucleic acid indexing, which is similar to amino acid indexing 5 . An alphabet indexing is a classification of characters of an alphabet, by which an original sequence is transformed into a sequence over a smaller alphabet. On the sequences alphabet-indexed from DNA sequences, substring patterns are searched. Since all the patterns we use here are formulated as binary functions, we can deal with conjunctions and disjunctions of them easily. Such composite patterns are also evaluated. In finding discrirninative sequence elements, it is also an important factor to locate search regions adequately. Through this approach, we have succeeded in finding discriminative features with practically high accuracies and reported the results. This paper is organized as follows: In Section 2, we describe the materials and methods we use in this work. The results we have attained in our computational analysis are reported in Section 3. We describe concluding remarks in Section 4.

’.

56 2

Materials and Methods

In this section, we describe data of alternative splicing, and sequence feature designs, including pattern modeling, pattern matcher specification and search region arrangements. The score function we use is described here. 2.1

Data

Lee et al. have compiled information related to alternative splicing, and the result is available as an online database ASAP (Alternative Splicing Annotation Project). The text files of this database can be downloaded a t the site, h t t p : //www. b i o i n f o r m a t i c s . u c l a . edu/HASDB/. An entry of the database has a column indicating how much evidence we have for the alternative splicing event. The value “multiple” means that both splices have at least two ESTs or at least one mRNA observation. All other alternative splices are indicated by “single”. The entries we use here are restricted to the ones where their evidences are labeled by “multiple”. Through our computational experiments, all the constitutive exons are considered to be negative examples. On the other hand, all the alternative exons involved in the alternative splicing type in question, for example, the type of alternative 5’ splice site, are used as positive examples. The sequences related to those exons are called negative and positive sequences, respectively. In the definition of the alternative 5’ and 3’ splice site events, we use a strict version. It is required that the non-alternative splice sites of the two overlapped alternative exons should be located in the same position.

2.2 Designing Features on Sequences Search Region Arrangements It is mentioned in 7,2 that regulatory elements known as silencers or enhancers can be intronic or exonic. Reflecting this knowledge, we exhaustively search patterns on the regions of alternative exons and their flanking regions. For two alternative exons el and e2 of alternative 5’ splice sites, four kinds of search regions, which are called upstream, overlapped exonic, non-overlapped right exonic, and downstream regions, are defined in (A) of Fig. 1. The length 1 of the upstream and downstream regions is set at 100 nt, which is the same as the previous work on finding candidate intron regulatory sequence elements for tissue-specific alternative splicin$. In the same way, the four kinds of search regions for alternative exons of alternative 3’ splice sites are defined (see (B)

57 upstnam

downstream

"PSVCanl

downstnam I

~

cz ! overlapped

"on-overlapped exonic

oon-overlapped cronic

cxomuc

!... L..! overlapped exonic

downstlram

upstream

.... CXONC

Figure 1: The search regions are shown as solid lines. Note that a search region does not contain any splice sites. It is apart from any splice sites at. least 8 bp (shown as dotted lines), to avoid any influence of splice site consensus.

of Fig. 1). For cassette and constitutive exons e , the upstream, exonic and downstream regions are defined (see (C) of Fig. 1).

Pattern Models Let C be a finite alphabet. The patterns we use in this work are mismatch patterns, degenerate patterns, numerical indexing patterns, and substring patterns over alphabet-indexed sequences. We describe the details of these patterns here.

A substring p a t t e r n over C is a string p over C. The substring pattern matcher we use here returns, given a string t over C, t r u e if there is at least one occurrence of p in t , and f a l s e otherwise. A m i s m a t c h p a t t e r n over C is a pair of a string p over C and a nonnegative integer k . The mismatch pattern matcher returns t r u e if there is a t least one substring of a given string t identical to p except at most k positions, and f a l s e otherwise. 0

A degenerate p a t t e r n over C is a sequence of subsets of C. For a degenerate pattern p = p l p z . . . p n with p i C for i = 1 , 2 , . . . ,n, the degenerate pattern matcher returns t r u e if there is at least one substring s = slsz . . . sn (si E C) of a given string t such that si is included in p i for each i = 1 , 2 , . . . ,n, and false otherwise. When C is set at the nucleotide set {A, C, G, T, U}, degenerate patterns are identical to strings over the IUPAC nucleic acid codes. The degeneracy of p is defined as the value of JJ$=,~ p i l .

Notice that the occurrences of the above patterns conserve the order of the characters occurring in the patterns completely or incompletely. For example,

588 a mismatch pattern p = ACGT with at most one mismatches matches the strings including as substrings, *CGT, A * GT, AC * T, or ACG*, where * means any one symbol. Trivially, these substrings conserve the sequence of the characters of p partially, that is, A,C,G, and T. Then we also consider quite a different type of pattern models which do not have any constraint on the order of the characters of the substrings which those patterns match. We introduce a numerical indexing, which is a mapping from a finite alphabet C to a numerical value set V . This is a generalization of an amino acid indexing, a mapping from one amino acid to a numerical value '.

Definition 1 (numerical indexing) Let C be a finite alphabet, and V a set of numbers. For a given numerical index I : C + V and a string s = ~ 1 . ~ 2. .s,. in C* (si E C for i = 1 , 2 , . . . , n ) , let I ( s ) denote the homomorphism ( I ( s 1 ) I; ( s 2 ) ; . . ; I ( s n ) ) ,where (; ) denotes a sequence of values. We will call I ( s ) the numerical-indexed string. A numerical indexing from C to V is called a nucleic acid indexing when C is the set of the nucleotides. A numerical indexing pattern is defined by a pair of a numerical indexing I and a threshold r. The matcher of numerical indexing patterns we use here returns, given a string t E C, t r u e if the value of the function maxavg, is greater than or equal to 7, where maxavg,(I(t)) is the average of a substring of size w in I ( t ) ,which gives the maximum value (i.e. max{I(t') j t = xwt'y, It'I = w}). It returns f a l s e otherwise. An alphabet indexing is a classification of characters of an alphabet. This can be used as a preprocess of transforming original DNA sequences into degenerate sequences over a smaller alphabet. It is formally defined as follows:

Definition 2 (alphabet indexing) An alphabet indexing Q is a mapping from one alphabet C to another alphabet I?, where II' ICI. For x = xlz2 " ' x l in C l , let Q ( x ) denote the homomorphism Q ( x l ) Q ( x 2 )... Q(x1) in I?. We will call Q(z) the alphabet-indexed string.

<

On alphabet-indexed sequences, substring patterns are searched. Notice that the returned values of the patterns mentioned in this section are binary. Thus, the conjunction (i.e., logical product) and disjunction (logical sum) of any two those patterns are defined and can be calculated.

2.3 Search Space We here describe the search spaces of the patterns given in Section 2.2 and how to search them.

59

mismatch pattern: For a specified length 1, all the substrings of 1 in the positive sequences are evaluated. For each of the strings, mismatch is allowed to be at most one. These mismatch patterns are evaluated in the following procedure. 1. Let P and N be sets of positive and negative sequences, respectively. 2. Let S be the set of all the substrings of length 1 in P . 3. For each s E S , assign the set of the indexes ( I , j ) to s such that s is a substring of j-th sequence in I where I is either P or N . The set assigned to s is denoted by assign(s).

4. For s E S , let L l ( s ) be the set of strings t of length 1 such that s matches t with at most one mismatch, and calculate u,,L~(s) assign(z)

+

+

+

This procedure runs in O( I I PI I . 1 . (1 PI INI) log( 1 PI INl) I IN1I), where for a set S of strings, llSll denotes the sum of the lengths of the strings in S . degenerate pattern: The length 1 of a pattern is also set at 4, 5 and 6. The degeneracy is set at 4. These patterns are calculated in a way similar to the mismatch patterns. numerical indexing pattern: For the length parameter 1, we consider a numerical indexing such that each character is assign an nonnegative integer in the range from 0 to 1. A local search method is used to find high score patterns. However, it is still rather time-consuming, thus the threshold T is fixed to be 1 - 2, and the length parameter 1 is restricted to be 6. substring pattern on alphabet-indexed sequences: The alphabet indexing we consider here classifies the four nucleotides into two even-sized subcategories. For example, we would use @ ( A ) = @ ( C )= 0 and @ ( G )= @ ( T = ) 1 where r = ( 0 , l ) . On the alphabet-indexed sequences over I?, all the substrings of length 4,5 , and 6, which are extracted from the positive ones, are evaluated as substring patterns. conjunction and disjunction: On each data set of alternative splicing types, the conjunctions and disjunctions of all the pairs of the top three patterns for a search region, a pattern model, a pattern length are evaluated.

2.4

Score Function

We here describe a score function F of patterns, whose value is called a contrast score. The contrast score based on the frequencies of occurrences of a pattern

60

’.

on a sequence is used in However, we use contrast score based on the binary values depending on whether there exists an occurrence of a specified pattern or not, which is defined as follows. Let p be a pattern, and let T be a set of strings. By T ( p ) we denote the number of the strings t in T such that there is at least one occurrence of p in t . The contrast score function F returns, given a positive sequence set P and a negative sequence set N , the value P(P)/lPI - N(P)/lNI.

3

Results

In this section, we report results of computational experiments for searching for discriminative patterns on the search regions, characterizing the alternative splicing types of alternative 5’ splicing, alternative 3’ splicing, and cassette. The numbers of entries of alternative 5’ splicing, alternative 3’ splicing, and cassette in the ASAP database are 227, 249, and 1299, respectively. As mentioned in Section 2.1, all the constitutive exons are used as negative examples, whose total number is 39,993. At first, our program is executed for all the combination of alternative splicing types, search regions, and pattern models. At the next stage, conjunctions and disjunctions derived from the patterns found in the previous stage are evaluated. 3.1

Alternative 5’ Splice Site

The patterns found in alternative exons of alternative 5’ splicing and their flanking introns are listed in Table 1. Table 1: The patterns found in alternative exons of alternative 5’ splicing and their flanking introns. The column labeled by R indicates a search region. U, 0, N , and D denote the upstream, overlapped exonic, non-overlapped exonic, and downstream regions, respectively. The column labeled by C indicates a model of patterns. DP, MP, AI+SP, and NIP, denote the classes of degenerate patterns, mismatch patterns, substring patterns over alphabetindexed sequences, and numerical indexing patterns, respectively. For a class C and a natural number I , C1 denotes a subset of C , {s E C I the length of s is I } . The columns labeled by F, P(p)/lPI and N(p)/lNI show, for the pattern in a row, the contrast scores, the ratio of true positives, and the ratio of false positives, respectively. In the last three rows, top three composite patterns, which are conjunction or disjunction of two single patterns, are listed.

61

C

DP

MP

AI+SP NIP6 DP

MP

AI+SP NIP6 DP

MP

AI+SP NIP6 DP

MP

AI+SP

pattern G [CT]C [CG] CC[CG][AC] CCC[AT][AT] ACCCC CCCGT CGTCC OOOOlO(A,T=l) 10000 (A,T=l) 00001 (A,T=l) {A:any,C:G,G:4,T:any) 7=4 C [AC]C [AG]G CC[CT][CGIG C [AC]C [AC]G CGCGG CGCGGA GACCA 000100(A,T=l) 001000(A,T=l) OlOOOO(A,T=1) {A:any,C:4,G:G,T:any} 7=4 GG[CG]TCC GG[CGT]TCC CGAG[CG][CG] GCGCGG GGCGCG TAGGGT 000000(A,T=l) 000001(A,T=l) 00000(A,T= 1) {A:any,C:4,G:5,T:any} 7=4 G [AC]GG [AG] G [AGIGG[AG] G [CG][AG]GA GCGGA GGAGGA GAGGAG 001001(A,T=l) OlOOOO( A,T= 1)

F P ( P ) l I PI N(P)llNI 16.44 68.28 51.83 16.35 68.72 52.36 15.32 50.22 34.89 44.01 17.22 61.23 44.32 60.35 16.02 41.26 16.00 57.26 68.72 14.35 80.17 80.17 67.59 12.58 67.69 12.47 80.17 77.62 89.42 11.80 42.00 11.29 53.30 38.41 49.33 10.92 49.33 38.41 10.49 37.85 10.16 48.01 17.58 9.28 26.87 9.11 81.93 72.82 65.97 5.83 71.80 4.81 70.92 66.10 63.65 4.63 68.28 12.20 51.10 38.90 5.61 12.77 7.15 9.04 5.05 14.09 2.71 4.77 7.48 -0.11 15.85 15.97 14.97 15.97 -0.25 11.89 12.67 -0.77 -5.85 30.83 36.68 42.73 56.33 -13.60 -13.64 43.17 56.81 1.12 15.45 14.33 17.01 49.33 32.31 55.94 40.74 15.19 15.11 51.98 36.87 17.01 61.23 44.22 51.10 34.58 16.51 15.85 49.77 33.92 12.14 78.85 66.70 11.11 70.48 59.36

62

I NIP6

I

001100(A,T=l)

{A:4,C:I,G:5,T:l} T=4

(Pattern,R) x (C[AC]CC[GT],U)OR ([GT]CC [CG]C,U) OR ([GT]CC[CG]C,U)OR

(Pattern,R) (G[AG]GGA[AG],D) (G[AG]GGA[AG],D) (G[AC]GG[AG],D)

10.98 12.84

69.60 50.74

58.62 38.90

F 21.41 21.30 21.27

P(P)IlPI

N(P)IlNI

64.75 62.55 80.17

43.34 41.24 58.90

3.2 Alternative 3’ Splice Site The patterns found in alternative exons of alternative 3’ splice sites and their flanking intronic regions are listed in Table 2. The top three patterns in this table are mismatch pattern CGGGG (22.51), CGGGGA (21.17), and degenerate pattern [CG][AGIGGG (20.34), which are patterns on downstream regions. Notice that these pattern share the string CGGGG at least. As for a G-rich element, it is known that intronic G triplets are frequently located adjacent to 5’ splice sites, which would correspond to the left ends of the downstream search regions, and bind U1 snRNPs to enhance splicing and select 5’ splice sites9. This knowledge would imply that those found patterns capture regulatory elements of alternative 3’ splicing because of the fact that those patterns are also frequently occurred on the downstream search regions. Table 2: The patterns found in alternative exons of alternative 3’ splicing and their flanking introns. C

I

pattern

I GGIACTIC DP MP

AI+SP

DP

N

MP

[CG]G [CT]C GG[CT][CG] CCCCG GGTCG AGCCC OOOlOO( A,T=l) 001000(A,T=1) lOOOOO( A,T=l) CGCCC CG [CT]CC CGC[CG]C CCCGCG CCCCCG CGCCCG

F P(P)IIPI N(P)IlNI 18.96 73.89 54.92 69.47 50.88 18.59 78.71 60.99 17.72 18.23 55.82 37.58 17.08 54.61 37.52 16.02 66.66 50.64 16.89 72.69 55.79 72.28 55.91 16.37 61.04 46.75 14.29 3.91 10.84 6.92 14.85 11.15 3.70 12.04 8.43 3.61 1.79 19.67 17.88 22.48 21.46 1.02 0.73 17.67 16.93

63

AI+SP

DP

0

MP

AI+SP

DP

D

MP

AI+SP

OOOOOO( A,T= 1) 000001(A,T=l) 00000(A,T=l) [CT][AC]CG C [AC][ C GI G CG[ACG]G CTCCGG GACACT CCGGAG 000010(A,T=l) 010000(A,T=l) lOOOOO( A,T=l) [CG][AG]GGG CCC[AC][CG] [CG][AC]GGG CGGGG CGGGGA CGGGA 000001(A,T=l) 000100(A,T=l) 00000(A,T=l)

-8.57 -16.17 -16.65 15.25 14.11 14.08 15.49 14.42 14.36 12.74 11.84 9.87 20.34 19.93 19.72 22.51 21.17 19.89 17.77 17.58 17.50

28.11 40.16 40.16 63.45 77.10 55.02 42.57 46.58 45.38 76.70 75.50 66.26 61.04 56.62 58.63 66.26 44.57 68.27 69.47 78.71 69.47

36.69 56.33 56.81 48.19 62.99 40.93 27.07 32.16 31.16 63.96 63.65 56.38 40.69 36.69 38.90 43.75 23.39 48.37 51.70 61.13 51.97

3.3 Cassette The patterns specific to the type of cassette are given in Table 3. Comparing with the other results, the contrast scores of these patterns are lower. Table 3: The patterns found in alternative cassette exons and their flanking introns. The symbol ’E’ in the column labeled by R denotes the exonic region.

R

C DP

pattern [AGITTT TTT[GT][AG] [AGTlTTT

F 10.03 9.99 9.43

P(P)IlPI 76.52 60.43 80.90

N(P)IINI 66.48 50.43 71.47

64

U

MP

AI+SP

DP E

MP

AI+SP DP

MP

AItSP

GTTTTT CGTTT TATTTT 101111(A,T=l) 111101(A,T=l) 11111(A,T=l) [CT]TA [AG] T[CT]TCT [AT][CT]TCT TCTTTT CTTAGC TTTAGT 1111(A,T=l) 01111(A,T=l) 101111(A,T=l) T [AT]T [AT] T[AT][AT]T [AT][ATITT TTAAT ATTTT TTTTA 11111(A,T=1) 11llOl(A,T=l)

8.88 8.75 8.56 9.18 9.16 9.15 4.37 4.31 4.30 3.99 3.30 3.16 2.17 1.93 1.75 10.33 9.49 9.26 8.71 8.58 8.57 9.21 9.18

(Pattern,R) x (Pattern,R) F (TTT[GT][AG],U) OR ([AG]T[AT]TT,D) 12.01 (TTT[GT][AG],U)OR ([CT]TTTT[CG],D) 11.83

64.97 68.89 63.43 79.44 79.52 76.90 47.72 24.08 35.79 35.)2 25.94 21.09 75.90 75.13 59.58 73.97 73.28 74.67 70.13 78.90 76.52 74.13 76.44

56.08 60.14 54.86 70.26 70.35 67.74 43.35 19.70 31.48 31.02 22.63 17.92 73.72 73.20 57.82 63.64 63.79 65.41 61.41 70.32 67.94 64.91 67.25

P(P)IlPI “ P ) I l N I 75.57 63.66 70.43 58.60 C”

qC

C T I nr,

4. Discussssion Through Table 1, 2, and 3, we can see that a higher score is obtained by composing single patterns. An interesting point is that, the two search regions of any composite pattern in the tables are a pair of upstream region (U) and downstream region (D). As for the substring patterns on the alphabet-indexed sequences, high score natt,erns share t,he same alnhahet, indexinp which sena.ra.tes A and T from C and G. Notice that this fact is not dependent on the types of alternative splicing. As for a score function, we have also examined a function based on the

65

frequencies of patterns p on a sequence t , instead of whether there is at least one occurrence of p on t , which is used in our score function F . Through our computational experiments, we have compared the two score functions, and F looks better than the frequency version (data is not shown).

Acknowledgments The authors would like to thank anonymous referees for valuable comments. This study was supported from the Research for the Future Program of the Japan Society for the Promotion of Science. References 1. M. Brudno, M.S. Gelfand, S. Spengler, M. Zorn, I. Dubchak, and J. G. Conboy. Computational analysis of candidate intron regulatory elements for tissue-specific alternative pre-mRNA splicing. Nucl. Acids. Res., 29:2338-2348, 2001. 2. T.A. Thanaraj and S. Stamm. Prediction and Statistical Analysis of Alternatively Spliced Exons, pages 1-31. Progress in Molecular and Subcellular Biology 31. Springer-Verlag, 2003. 3. C. Lee, L. Atanelov, B. Modrek, and Y. Xing. ASAP: the alternative splicing annotation project. Nucl. Acids. Res., 31:101-105, 2003. 4. 0. Maruyama and S. Miyano. Design aspects of discovery systems. IEICE Transactions on Information and Systems, E83-D:61-70, 2000. 5. H. Bannai, Y. Tamada, 0. Maruyama, K. Nakai, and S. Miyano. Extensive feature detection of N-terminal protein sorting signals. Bioinformatics, 18:298-305, 2002. 6. S. Shimozono. Alphabet indexing for approximating features of symbols. Theo. Comp. Sci., 210:245-260, 1999. 7. T.A. Cooper and W. Mattox. The regulation of splice-site selection, and its role in human disease. A m J Hum Genet., 61:259-266, 1997. 8. D. Shinozaki, T. Akutsu, and 0 . Maruyama. Finding optimal degenerate patterns in dna sequences. In Proc. European Conference on Computational Biology (ECCB 2003), Bioinformatics, 2003. To appear. 9. A.J. Mccullough and S.M. Berget. An intronic splicing enhancer binds U1 snRNPs to enhance splicing and select 5’ splice sites. Molecular and Cellular Biology, 20:9225-9235, 2000.

TRANSCRIPTOME AND GENOME CONSERVATION OF ALTERNATIVE SPLICING EVENTS IN HUMANS AND MICE C.W. SUGNET, W.J. KENT Center for Biomolecular Science and Engineering, University of California, Santa Cruz CA 95064, USA M. ARES JR. Department of Molecular, Cell, and Developmental Biology, University of California, Santa Cruz CA 95064, USA

D. HAUSSLER Howard Hughes Medical Institute; Department of Biomolecular Engineering, University of California Santa Cruz CA 95064, USA Abstract Combining mRNA and EST data in splicing graphs with whole genome alignments, we discover alternative splicing events that are conserved in both human and mouse transcriptomes. 1,964 of 19,156 (10%) loci examined contain one or more such alternative splicing events, with 2,698 total events. These events represent a lower bound on the amount of alternative splicing in the human genome. Also, as these alternative splicing events are conserved between the human and mouse transcriptomes they should be enriched for functionally significant alternative splicing events, free from much of the noise found in the EST libraries. Further classification of these alternative splicing events reveals that 1,037 (38.4%) are due to exon skipping, 497 (18.4%) are due to alternative 3’ splice sites, 214 (7.9%) are due to alternative 5’ splice sites, 75 (2.8%) are due to intron retention and the other 875 (32.4%) are due to other, more complicated, alternative splicing events. In addition, genomic sequences nearby these alternative splicing events display increased sequence conservation. Both the alternatively spliced exons and the proximal intron show increased levels of genomic conservation relative to constitutively spliced exons. For exon skipping events both intron regions flanking the exon are conserved while for alternative 5’ and 3‘ splicing events the conservation is greater near the alternative splice site. 66

67

1

Introduction

Researchers have been using mRNAs to study alternative splicing for decades I . Large mRNA and EST sequencing projects and the recent sequencing of both the human and mouse genomes have facilitated a number of computational surveys of alternative splicing 6,7,899,10,11. Many genes have been predicted to be alternatively spliced using computational methods. However, the poor quality of ESTs makes it difficult to distinguish functionally significant alternative splicing, aberrant transcripts from cell lines, cancers, incomplete splicing, chimeric clones, etc. that have made their way into the databases. Additionally, it is not clear that every observed transcript necessarily encodes a functional product. In fact, transcription and splicing are sufficiently error-prone processes that pathways such as nonsense mediated decay have evolved to find and degrade errors that do occur 1 2 1 1 3 . Both individual exons and larger gene structures are very similar in both human and mouse 14. Recently researchers have begun to use comparative genomics l 5 ? l 6to look for alternative splicing events that are conserved in both human and mouse, and thus more likely to be biologically significant to the organisms. If an alternative splicing event is conserved between these two transcriptomes then it is likely it provides some advantage to the organism. The work presented here extends the work using comparative genomics by Thanaraj et al!5 and Sorek and Ast l 6 by: 1. Using novel whole genome alignments l 7 of the human and mouse genomes to find large numbers of high confidence orthologous loci for comparison.

2. Classifying alternative splicing events and analyzing them separately according to the four distinct classes of alternative splicing displayed in Figure 1. This is accomplished by examining the graph topology of the splicing graphs like the one illustrated in Figure 2. 3. Examining the patterns of conservation for different classes of splicing events separately. Examining the 19,156 human loci for which we could identify an orthologous mouse locus we find that 1,964 (10%) contain at least one alternative splicing event expressed in both organisms. The different classes of alternative splicing vary in relative abundance, with the most common being exon skipping, and the rarest being intron retention. In addition to being conserved in both the human and mouse transcriptomes, the alternatively spliced exons and flanking intronic regions are more conserved in the genome than constitutively expressed exons. The high level of genomic conservation might indicate the presence of &-elements that help regulate alternative splicing.

68 Types of Alternative Splicing A. Alt 5’ Splice Site

B. Alt 3’ Splice Site

C. Exon Skipping

D. Retained lntron

Figure 1: Types of Alternative splicing. Four basic classes of alternative splicing are presented. All classes are presented from 5’ + 3’. Constitutive exons are white boxes, alternatively spliced regions are gray boxes and splice junctions are represented by arcs joining boxes. A. Alternative 5’ exon with two possible 5’ splice sites connecting t o next exon. B. Alternative 3’ exon with two possible 3’ splice sites connecting t o upstream exon. C . Exon skipping event where entire exon is included or excluded from final transcript. D. Retained intron event where intron is not always spliced out of final transcript.

2

Methods

Aligning mRNAs and ESTs to the genome provides both the exon-intron boundaries for a particular gene and the order and orientation of the exons. Once the order and orientation of exons is known, the 5’ + 3‘ directionality of transcription is naturally modeled using a directed acyclic graph (DAG). When looking for alternative splicing, it is useful to create a DAG where the vertices are the 5’ and 3’ splice sites (ss), start and end of spliced intron respectively, and the edges of this graph represent the exons and introns. Whether an edge is an intron or exon depends on whether the edge is a 5’ss+3’ss (intron) edge or a 3’ss+5’ss (exon) edge. Alternative splicing events are then easily found by looking for splice site vertices that connect to more than one other splice site. In order to construct splicing graphs, we have written a program called a l t s p l i c e which uses mRNA and EST evidence to construct splicing graphs. Only spliced mRNAs and ESTs with consensus splice sites are used, as they tend to be of higher quality and can be oriented by examining the splice sites in genomic sequence. Human and mouse splicing graphs are constructed independently for each organism, using only ESTs and mRNAs native to that particular organism. For the experiments described in this paper the human genome version NCBI Build33 and the mouse genome version NBCI Build30 were used.

2.1

Constructing Splicing Graphs.

The algorithm implemented by a l t s p l i c e is as follows: 0

Align mRNAs and ESTs to the genomic sequence using BLAT l a . A near best in genome filter is applied where only alignments with

69

mRNAs and Ests

0

0

0

0

0

Figure 2: Assembling mRNA and ESTs using altsplice. Splice sites are determined from exons with consensus splice sites as mRNAs and ESTs are built up to form larger splicing graphs that may contain alternative splicing. Regions that are alternatively spliced are shaded gray. Note that while the exons and introns are represented explicitly and the splice sites implicitly, the splice sites are actually the vertices in the graph.

97% identity over 90% of the transcript and with a score no more than .5% lower than the best score are kept. Retrieve genomic sequence and use it to orient ESTs using consensus splice sites, GT+AG, and the less common GC+AG. Cluster alignments together by sequence overlap in exons. As new splice sites are discovered, they are entered into the graphs as vertices, and the exons and introns connecting them are recorded as edges. This graph is built into a DAG data type called geneGraph. Each graph is considered to be a single locus, although they may be fragments of an actual gene structure. The supporting mRNA and EST accessions for each edge are also stored. Extend truncated transcripts by overlap with other transcripts to the next consensus splice site. This avoids keeping vertices in the graph that are not true splice sites. Convert geneGraph records to altGraphX structure, which is more compact, for storage.

A visualization of the construction process of a l t s p l i c e can be seen in Figure 2. Also, the splicing graphs are browseable interactively in the UCSC Human Genome Browser An example of the display on the browser for the ARVCF gene can be seen in Figure 3.

’.

70 18333000

I

18333500

I

I8334000

I

Mouse Cons

Figure 3: Example of exon skipping from ARVCF gene as seen in the UCSC Human Genome Browser. Transcripts can be seen which both contain and skip the exon centered in the “Human mRNA” track and visualized in the “Alternative Splicing” track. High levels of conservation are found both upstream and downstream introns in addition to the coding region of the exon as seen in the “Mouse Cons” track. The Mouse Cons4 track shows the similarity between this region and the orthologous region in mouse.

2.2

Comparing Orthologous Splicing Graphs.

Once the splicing graphs have been generated separately for both the human and mouse genomes, orthologous graphs are found using large scale genomic alignments provided by Kent et al. ”. Briefly, the human and mouse genomes are divided into segments and aligned all against all using BLASTZ 19. The resulting alignments are then chained together into larger structures. The chaining algorithm requires that the order of aligned blocks within the chain must be consistent with the genomic sequence order in both species alignment of the two genomes. This is equivalent to preferring the alignments that increase the synteny between the two genomes. In order to model inversions and duplications, there can be large gaps in the chains, and other chains are allowed to nest inside the gaps. For the purposes of determining orthologous regions, we use only the maximally scoring chain for a region to map between human and mouse genomes. This is analogous to using a base pair resolution synteny map of the two genomes which allows us to map with confidence to orthologous regions from human to mouse. These maximally scoring chains are referred to as “nets” and can be found on the UCSC Human Genome Browser as the “Mouse Net” track ”. We have written a program to analyze the splicing graphs from orthologous loci called orthosplice which implements the following algorithm: 0

Inputs are altGraphX records for two genomes and chains to map between those genomes.

71 0

0

0

0

For each altGraphX record on the human genome look up the orthologous altGraphX records in the mouse genome via the maximal chain for that region. Using the chains, create a mapping between the splice sites (vertices) of the two altGraphX records. Then use this mapping to compare the actual splice graphs for the two records. Examine properties of the common splicing subgraph including conserved exons, introns, and alternative splicing. Only alternative splicing involving internal exons is considered, not alternative promoters or polyadenlylation sites. Output the subset of the human altGraphX record that was also observed in the mouse altGraphX record. Report results locus by locus and also edge by edge in the graph.

The resulting subset of the human splicing graphs that was also conserved in the mouse transcriptome were examined to discover and classify alternative splicing events. Representing splicing as a graph facilitates this process, as it is relatively straightforward to examine the graph topology for patterns that correspond to functional classes of alternative splicing illustrated in Figure 1. Additionally, exons that are constitutively expressed are recorded and used for controls to examine the upstream and downstream intronic regions for conservation.

2.3

Calculating Conservation Per Base

If alternative splicing events are biologically significant to an organism, it is reasonable to hypothesize that these events would be regulated at the sequence level, and that those regulatory sequences would be conserved between human and mouse genomes. To investigate the conservation in the genomic sequences we examined individual bases adjacent to splice sites and calculated the percentage of times they were conserved when alignable to mouse. Percent identity was calculated using only bases that were aligned; inserts and deletions were excluded from the calculation. This is a more conservative measurement than counting unaligned bases as non-conserved because bases may not be aligned due to other factors such as the draft nature of the mouse genome. Alignments used were the same chains of BLASTZ alignments that were used to find the orthologous gene structures. This analysis resulted in a per base conservation profile for each class of alternative splicing event.

3

Results

Even requiring that alternative splicing be observed independently in both human and mouse transcripts we find that 10% (1,964) of the

72

Table 1: Relative abundance and size of human alternatively spliced regions conserved in mouse transcriptome. The alternatively spliced regions are those that are shaded in Figure 1. For exon skipping events the alternatively spliced region corresponds to the entire exon. For alternative 5’ and 3’ events the alternative spliced region is the area between the two alternative splice sites, excluding regions of length 3bp for alternative 3’events. Size of retained introns is the length of the intron that can be spliced out. Class of Alt. Splice Alt. 5’ Alt. 3’ Exon Skipping Intron Retention Other Constitutive Exons

Number 214 497 1,037 75 875 113,549

Percentage 7.9% 18.4% 38.4% 2.8% 32.4% NA

Mean*Sd (Med) 44.1f63 (21) 51.4*174 (18) 104f140 (84) 220.3f272.0 (110) PA) 140*125 (122)

19,156 loci for which we could find a mouse ortholog had alternative splicing. We discovered 2,698 different alternative splicing events that were conserved between human and mouse. Other studies of alternative splicing using mRNAs and ESTs have reported alternative splicing in the human transcriptome to range from 35-55%69899i11. Th e 2,698 splicing events conserved between the human and mouse transcriptomes reported here represent a lower bound on the number of alternative events that are present in the human transcriptome and should be enriched for events that are biologically significant. The requirement that human alternative splicing events be conserved in the mouse transcriptome is stringent, and dependent on the depth of both human and mouse transcript libraries. There are many human alternative splicing events that are probably functionally significant and present in the mouse transcriptome, but for which no transcript has yet been sequenced. As seen in Figure 4, exons and splice junctions that are conserved tend to have more representative transcripts. This is to be expected as highly expressed genes are more likely to be included in the EST libraries and thus found in both the human and mouse transcriptomes. After four transcripts are observed to contain a splice junction or exon, it is twice as likely to be from the conserved distribution as from the non-conserved distribution. By adding back human splice junctions and exons that are not conserved in mouse, but are observed in 4 or more human transcripts, we find 4,528 (22.3%) alternatively spliced loci out of 19,945 total loci in the human transcriptome. By relaxing our requirement for inclusion to single transcript coverage, even if not conserved in mouse, we find that 11,929 (37.6%) loci are

73 Histogramof cDNAs SupportingExons and SJs In Human

I

I

I

I

I

I

0

10

20

30

40

50

Numberof cDNAs LightGray -NotConsewed. Black -Consewed. Dalk Giay -0vadap

Figure 4: Histogram of transcripts supporting conserved and not conserved splice junctions and exons. Normalized frequency of number of transcripts that contain a given exon or splice junction for conserved (black), not conserved (light gray) and the overlap (dark gray) of exons and splice junctions examined. In general splice junctions and exons that are conserved between human and mouse transcriptome have more transcript coverage than those that are not conserved.

alternatively spliced out of 31,752 loci total. Some of these events could be conserved or could represent human-specific splicing events. It may be interesting in future studies to look at alternative splicing events that are seen in multiple transcripts but are not conserved in other organisms and may be enriched for species specific alternative splicing. Resolving whether an exon does not exist in mouse or is simply not yet represented in the transcript databases will require further work. Evolutionary implications of highly expressed alternative splicing events that are not conserved is explored further in Modrek et a1 20. Most of the conserved alternative splicing events can be described as falling into the four classes of alternative splicing described in Figure 1 and summarized in Table 1. The size distributions of the alternatively spliced regions (gray areas in Figure 1) differs for each of the separate classes (Table 1). Alternative 3' and 5' events have, on average, a much smaller number of base pairs alternatively spliced than the skipped exons. Also, the skipped exons are shorter than the constitutively spliced exons, consistent with a report that considered human transcripts only''. Further examination of the alternative 3' splicing events revealed al-

74 A. All 5' Spllce Site

B. All 3' Spllce Site

87

3'

5'

Inkon

5'

#

,

1

Exon

3'3'

I

lnbon

s

posibon rn sequence isplice sites highlighted]

pmbon in Sequence (splice sites highlighled)

C. Exon Sklpplng

D. Retained lntrons l"tl0"

81

3'

5'

posibon8n sequence (sploe sites highlighted)

5'

3 porlbon m requence (splice 1185 hlghlighled)

Figure 5: High genomic conservation in alternatively spliced regions. Alternatively spliced regions and the intronic sequences proximal to them exhibit high levels of genomic conservation between mouse and human. Average base identity for aligned bases is presented for generic representatives of alternative splicing classes described in Figure l. Conservation for alternatively spliced regions is shown in black. Conservation for constitutively spliced regions is filled in with light gray, overlaps between the two are illustrated in dark gray. Going from left to right in panels A-C it is possible to observe the conservation of the polypyrimidine track, the 3' splice site, the coding exon, the 5' splice site for both constitutive (gray) and alternative (black) exons. The splice sites are marked for each alternatively spliced exon. Individual Panels: A. Alternative 5' event: Regions illustrated are lOObp into upstream intron, 25 bp into exon from 3' splice site, 25bp upstream from first 5' splice site, 20bp upstream from second 5' splice site, and lOObp downstream from second 5' splice site. Only data from 5' events with more than 20bp presented. B. Alternative 3' event: Regions illustrated are lOObp upstream from first 3' splice site, 20bp downstream of the first 3' splice site, 25bp downstream from the second 3' splice site, 25bp upstream from the 5' splice site, and lOObp into the downstream intron. Only data from 3' events with more than 20bp presented. C . Exon skipping event: Regions presented are lOObp upstream from 3' splice site, first 35bp of exon, last 35bp of exon, and lOObp downstream. D. Retained intron event: First and last lOObp of retained intron compared to introns proximal to constitutive exons.

75

most half (45.4%) had only 3 nucleotides separating one alternatively spliced 3’ site from the other. The fact that many of these very small alternative splices have multiple transcripts supporting both isoforms, and don’t disrupt the coding frame, indicates that many of these alternative splices are real rather than an artifact. However, while they appear to be biologically real, it is not clear that they have any functional effect on the resulting protein. For the analysis shown in Figure 5B these 3bp splicing events were not included.

Conservation of Genomic Sequences Near Alternative Splicing Events. 3.1

As previously reported16, using a smaller set of skipped exons, the upstream and downstream flanking intronic regions of exon skipping events are conserved relative to constitutively expressed exons. Using the whole genome alignments it is possible to calculate a percent identity a t positions relative to the 3’ and 5’ splice sites for our set of skipped exons (Figure 5C). The 50bp upstream and downstream flanking intronic regions of exon skipping events have an average percent identity of 80% and 75% respectively while the average percent identity of the constitutive exons is 65% and 61% respectively. The increased conservation of flanking exon skipping events is consistent with that reported by Sorek et a1 l 6 although the conservation calculation is slightly different. Alternative 5’ and 3’ splicing events exhibit higher conservation in the proximal flanking intronic sequence, but not as much in the distal flanking exon. For alternative 5’ splicing events there is more conservation in the 50bp of flanking downstream intron (77%) than in the 50bp of flanking upstream intron (69%). The polypyrimidine tract is well conserved (65% vs 61%) even in constitutive exons (Figure 5A). The proximal intron is also better conserved near alternative 3’ splice site (Figure 5B) with the last 50bp of upstream intron (72%) greater than the downstream first 50bp of intron (62%). It is also interesting to note that the regions of exons that are alternatively spliced are also better conserved than constitutive exons. The first and last 20bp of the skipped exons have an average percent identity of 92% while the consituative exons have a percent identity of 87%. The same is also true of the first 20bp of the regions spliced out by alternative 5’ (96%) and alternative 3’ (95%) splicing events. Such high levels of conservation suggest the presence of regulatory motifs within the exon. While the retained introns examined did have more conservation than introns flanking constitutively expressed exons, they do not appear to have conservation levels near those of the constitutive exon sequences themselves. It is interesting to note that the median size for the retained introns of 1lObp is much smaller than normal introns. Further

76

examination will have to be done to determine if these retained introns have a function.

4

Conclusions

By Examining alternative splicing events that are conserved in both the human and mouse transcriptomes, we have generated a set likely to be enriched for those that confer a selective advantage via some biologically significant function. Thus, this set contains minimal amounts of alternative splicing that may due to aberrant transcription or splicing. We have shown that genomic regions nearby these alternatively spliced sequences are highly conserved. The high levels of conservation in the introns proximal to these alternative splicing events, as well as within the alternatively spliced regions implies that there are cis-elements present that have been conserved. It is possible that these cis-elements are necessary for the regulation of alternative splicing events and have been selected for in evolution. Future work will involve both computational efforts to identify regulatory elements and experiments at the bench to profile alternative splicing events.

5

Availability of Data and Programs

The subset of human splicing graphs conserved in the mouse transcriptome is browseable interactively a t the UCSC Genome Browser 5 , as well as downloadable in bulk. A special entry point to the browser with the list of alternatively spliced regions can be found at: http://www. soe.ucsc.edu/~sugnet/psb2004/altGraphXCon.htniL The source code for the altSplice and orthosplice programs can be found under Jim Kent’s CVS source tree under kent/src/hg/altSplice. The source tree is available at: http: //www. soe .ucsc. edu/-kent/src/.

Acknowledgments We would like to thank the International Human Genome Sequencing Consortium and the Mouse Genome Sequencing Consortium for providing the genomic sequence data. We would also like to thank the researchers who have contributed their cDNA sequences to Genbank. C. Sugnet is a Howard Hughes Medical Institute Predoctoral Fellow. W.J. Kent, M. Ares and D. Haussler were supported by NGHRI grant 1P41H. D. Haussler was also supported by the Howard Hughes Medical Institute.

77

References

1. J.W. Tamkun, J.E. Schwarzbauer, and R.O. Hynes, P N A S . 81, 16

(1984). 2. M.S. Boguski, T.M. Lowe, and C.M. Tolstoshev, Nat. Genetics 4, 4 (1993). 3. E.S Lander et. al., Nature. 409, 6822 (2001). 4. R.H. Waterson et al., Nature. 420, 6915 (2002). 5. W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin, T.H. Pringle, A.M. Zahler, and D. Haussler, Genome Res. 12, 6 (2002). 6. A.A. Mironov, J.W. Fickett, and M.S. Gelfand, Genome Res. 9, 12 (1999). 7 . D. Brett, J. Hanke, G. Lehmann, S. Haase, S. Delbruck, S. Krueger, J. Reich, and P. Bork, FEBS Lett. 474, 1 (2000). 8. Z. Kan, E.C. Rouchka, W.R. Gish, and D.J. States, Genome Res. 11, 5 (2001). 9. B. Modrek, A. Resch, C. Grasso and C. Lee, Nucl. Acids Res. 29, 13 (2001). 10. Q . Xu, B.Modrek and C. Lee, Nucl. Acids Res. 30, 17 (2002). 11. F. Clark and T. A. Thanaraj Human Mol. Gen. 11, 4 (2002). 12. C. Gonzalez, A. Bhattacharyaa, W. Wanga and S.W. Peltz, Gene. 274, 1-2 (2001). 13. B. Lewis, R . Green, and S. Brenner, P N A S . 100, 1 (2003). 14. S. Batzoglou, L. Pachter, J.P. Mesirov, B. Berger, and E.S. Lander, Genome Res. 10, 7 (2000) 15. T. Thanaraj, F. Clark and J. Muilu, Nucl. Acids Res. 31, 10 (2003). 16. R. Sorek and G. Ast, Genome Res. 13, 7 (2003). 17. W.J . Kent, R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler, P N A S in press., (2003). 18. W.J. Kent, Genome Res. 12, 4 (2002). 19. S. Schwartz, W.J. Kent, A. Smit, Z. Zhang, R. Baertsch, R.C. Hardison, D. Haussler and W. Miller Genome Res. 13, 1 (2003). 20. B. Modrek and C. Lee, Nat. Genetics 34, 2 (2003).

A DATABASE DESIGNED TO COMPUTATIONALLY AID A N EXPERIMENTAL APPROACH TO ALTERNATIVE SPLICING C.L. Z H E N G ' , T.M. NAIR', M . G R I B S K O V I 3 * University of California, Sun Diego 'Sun Diego Supercomputer Center *Department of Biology 9500 Gilman Dr. La Jolla, CA 92093, USA (czheng, nair, gribskov) @sdsc.edu Y . S . K W O N , H.R. L I , X . - D . FU University of California, Sun Diego Department of Cellular and Molecular Medicine 9.500 Gilman Dr. La Jolla, CA 92093, USA (ykwon, hairili, xdfu) @ucsd.edu

A unique microarray approach has been developed to profile alternative splicing in the cell. To support the development of this approach, we have developed the Manually Annotated Alternatively Spliced Events (MAASE) database system, which is a unique alternative splicing information resource designed specifically with experimentalists in mind. MAASE is an online resource for the convenient access, identification, and annotation of alternative splicing events (ASEs). MAASE consists of two components: an annotation system and a curated database. The annotation system is a web-based workspace that combines manual and computational approaches to identifying and annotating ASEs, a combination that is vital if a comprehensive collection is to be obtained. The annotation system is publicly available and provides a scalable solution to acquiring as well as contributing to annotated ASEs. MAASE annotated ASEs are deposited into the database component, which can either be queried one entry at a time or multiple entries at a time with convenient access to alternatively spliced junctional and surrounding sequences to facilitate the design of microarray experiments.

1 Introduction The frequency and importance of alternative splicing (AS) is evidenced by studies, indicating that up to 60% (1-4) of all human genes are alternatively spliced and that it may be one of the major mechanisms in expanding and regulating the composition of the proteome (5). With this in mind, substantial effort has been devoted to the identification, annotation and prediction of alternatively spliced genes and their alternatively spliced isoforms. A brief list of some database efforts focused on AS includes the Alternative Splicing Database (ASDB) (6); Alternative Splicing Database of Mammals (AsMamDB) (7); SpliceDB (8); Putative Alternative Splicing Database (PALS db) (9); Intron Information System (ISIS) (10); and Alternative Splicing Annotation Project (ASAP) (11). Each of these database efforts has

78

79

contributed to the further understanding of the field. For example ASDB has clustered and identified alternatively spliced variants based on analysis of SwissProt and GenBank while AsMamDB contains information on alternatively spliced genes of human, mouse, and rat species. SpliceDB is a database of canonical (GTAG) and non-canonical (GC-AG; AT-AC) splice sites inferred from EST sequences. ISIS is a database of intron sequences, extracted from GenBank, that are involved in AS. PALS db reveals putative alternative splicing events (ASEs) by visually aligning UniGene clusters to the longest cDNA sequence. The quality of each aligned portion is displayed to allow users to judge the veracity of putative alternatively spliced junctions. ASAP is currently the largest database of alternatively spliced genes in humans and provides information on gene structure, AS, tissue specificity, and protein isoforms. Informative as each AS database effort is, experimental labs have not been able to fully utilize them for several reasons. Firstly, strict heuristics, which are applied in selecting entries in many of the computationally derived databases to ensure a high degree of accuracy, have rendered them incomplete. Many are also missing information in which experimentalists are truly interested, such as the complex AS modes (e.g. mutually exclusive exons and even more complex modes such as those found in CD44 [reviewed (12, 13)] ) and information that can be found only in literature. Finally, interest in looking at the global effects odof AS have prompted some initial attempts to address alternative splicing using microarray platforms ( 1417); however current AS databases are not structured for convenient access to the information needed by experimentalists to design and perform these experiments. To meet the needs of experimentalists, the Manually Annotated Alternatively Spliced Events (MAASE) database system is designed to accurately annotate ASEs by a combination of manual and computational efforts and to allow for convenient access to its content. The inspiration for MAASE grew out of the parallel development of the RASL splicing array platform (17). This collaborative effort has resulted in a successful bridge linking AS databases and the needs of those who use them.

2 ResultdDiscussion

2. I The RASL Approach to Profiling Alternative Splicing In order to profile AS on a large scale by microarray approaches, an oligonucleotide ligation-dependent hybridization approach (Fig l), RASL (BNA Annealing Selection and Ligation), has been developed (17). The first step in RASL is to synthesize oligonucleotides complementary to specific splice junction donor and

80

acceptor sequences (the target oligonucleotides). To distinguish between different ASEs, oligonucleotides complementary to specific exonic splice junctions are linked to individual index (or address) sequences, a collection of computer-generated and experimentally verified sequences which are not found in the genomic sequence. Furthermore, each oligonucleotide is also linked to a universal primer landing site for PCR amplification.

address universal primer ........... ......... universal primer $( dyelabeled

! : E Annealing

Solid phase selection

\

Ligation

/ t

... ,...-

streptavimn coated PCR tube

g-. -1 .. .. .: g-. ::::..:.. g e : w....liiii

PCR amplification

:

:.:. :

1 Detection on a universal index array

Index may has complementary index sequence to the above index and many other index sequences

Fig 1. The ENA Annealing Selection and Ligation (RASL) Strategy The RASL assay consists of five steps: (1) Annealing: Pooled oligonucleotides are mixed with isolated total cellular RNA along with biotinylated oligo-dT. (2) Solid uhase selection: The mix is then transferred to a streptavidin coated PCR tube. Biotinylated oligo-dT is thus immobilized on the surface; mRNA is annealed to the oligo-dT; and the target oligonucleotides are annealed to specific splice junctions in the mRNA. After the selection, unhybndized oligonucleotides are washed away. (3) RNAmediated oligonucleotide ligafion: Target oligonucleotides corresponding to a particular splice junction are juxtaposed. The aligned oligonucleotides are then ligated by T4 ligase. This step is key for the specificity of the assay, as only oligonucleotides, which are annealed next to each other on a targeted RNA, will be ligated. Furthermore, because each target oligonucleotide carries only one primer site, only ligated oligonucleotides have primer sites on both ends, and thus can serve as a template for PCR amplification. (4) PCR amulification: The pair of universal primers, one of which is dye-labeled, is used to amplify the ligated products. This step is the basis for the high sensitivity of the assay. (5) Detection on a universal index array: The dye-labeled PCR products are hybridized to an array of index sequences to allow quantification of specific ASEs.

81

The RASL approach combines high specificity and sensitivity in profiling AS in the cell. The approach, however, requires prior knowledge of AS, and thus a high quality AS database is essential. We therefore decided to build such a database in order to efficiently facilitate this experimental approach.

2.2MAASE Database System Overview To facilitate the RASL approach, we set out to construct a comprehensive and userfriendly AS resource - the MAASE database system. The system comprises two components: an annotation component and a database component. The annotation component is an environment for manual annotation (with computational support) of ASEs. The combination of manual and computational annotation of ASEs addresses many of the shortcomings of purely computationally derived databases. The MAASE annotation system is publicly available and is intended to be a communitybased effort. The high level of effort required for manual annotation is one of the driving forces for this project. A community-based effort greatly enhances the scalability of this data resource. With this database system, the AS community can contribute as well as obtain information.

2.3 Annotation of ASEs Using MAASE Current AS databases are not comprehensive due to their lack of coverage of more complex AS splicing modes and the exclusion of AS information found only in the literature. In addition, those databases do not provide the information or the query capability needed for designing microarray experiments. Because of these limitations, we began to manually annotate ASEs for the RASL platform. Inspired by such manual annotation efforts, the MAASE annotation pipeline has been developed to incorporate computational efforts that enhance the speed and accuracy of manual annotation. The manual annotation pipeline has two entry points: the Swiss-Prot database and NCBI’s GenBank database. Swiss-Prot is the preferred point of entry due to its well-curated gene entries however not all gene loci have a corresponding Swiss-Prot entry at which time GenBank can be used. Next, related cDNA and EST sequences from a variety of databases are collected along with AS information from published literature. Each of the sequences is aligned to the genomic sequence to visualize the ASEs. The mode of splicing of each ASE is then determined. The data obtained from such a tedious annotation task proved to be well worth the time and effort, as it leads to a complete and detailed AS annotation. The MAASE annotation pipeline is a web-based tool that automates many of the manual annotation steps described above.

82

1

OR Swiss-PrG Database

NCBI Database Geii Info. Lit Ref. gene name

Gen Info. Lit Ref. gene name protein name function

Cross-Ref.

I Deposit into MAASE

Determine splicing mode

BLAT&Sim4 Align to genome

Fig 2. Flowchart of MAASE Annotation Tool Beginning with either a Swiss-Prot or GenBank entry, MAASE automatically retrieves information such as literature references and GenBank cross-references; obtains a weblink to PALSdb; obtains related GenBank entries; allows the user to exluddinclude sequences from other databases or published literature; aligns sequences to the genome to obtain a gene model; determines splicing mode; and deposits into MAASE.

The details of how the manual and computational efforts work in synergy are what make the MAASE annotation unique (Fig 2). The first step is for the user to enter the desired Swiss-Prot ID or GenBank ID. From this entry point, the MAASE annotation system automatically obtains useful information such as gene name, protein name, functional informational and a list of GenBank cross-references. MAASE then queries GenBank for other related cDNA sequences based on the cross-referenced sequences. A web-link to the PALS db is also retrieved for the specific Swiss-Prot entry to allow the user to judge and include EST sequences as they see fit. All of this information is then graphically presented to the user on the web. At this point, the user can includelexclude sequence based on existing evidence. Any AS information found in published literature can also be added at this point. Once all of this information is entered, MAASE takes over and aligns each of the sequences to the genomic sequence using BLAT (18) and Sim4 (19). BLAT is used to pinpoint the genomic location of the entered sequences; Sim4 is used to align each of the sequences to the genomic sequence. After alignment,

83 MAASE indicates all internal ASEs. MAASE does this by first constructing a master sequence of all non-redundant sequence segments. These sequence segments consists of whole exon segments as well as subsequences of exons showing a splicing difference (Fig 3 ) . Each entered sequence is compared to the master sequence in search of missing sequence regions, and then with all other sequences entered to determine the splicing mode. Once again all information is presented to the user. At this point the user does a final check of the sequences and splicing mode. Although MAASE is able to automatically assign the splicing mode, complicated splicing modes may not always be assigned correctly due to the use of certain heuristics, and therefore some manual intervention is needed. Once the user verifies the entry, it is deposited into the database. We believe that such an intricate tag-team system is required to achieve accurate annotation of ASEs.

Exon-1 Exon-2 Exon-3

i 1

2

Master Sequence

3

Subregions Fig 3. Construction of the Master Sequence The master sequence is constructed with all exonic information from collected cDNA and ESTs. Here is an example of how overlapping exons would be separated into subregions in the master. Once the master sequence is constructed, each entered sequence is compared to the master sequence in search of missing regions. The splicing modes are assigned by comparing all entered sequences using the corresponding genomic region as reference. The accuracy of the assignment is ensured by manual inspection.

2.4 MAASE Database The MAASE database is built for easy access to alternatively spliced junction sequences, either individually or collectively. These features are not addressed by other AS databases, but are essential for the design of microarray experiments. The MAASE database can be queried for a list of alternatively spliced junction sequences sorted by keyword or splicing mode. MAASE also contains a built-in program to pair index (or address) sequences to targeting oligonucleotides for the

84

RASL assay. The most well suited pairs are chosen by pairing each oligonucleotide sequence with each possible address sequence and calculating potential RNA secondary structure, using RNAFold (20). In this way, the chosen index-targeting oligonucleotide pair has the least stable structure to minimize internal hybridization to allow for maximal hybridization potential on the index chip. MAASE can be queried individually by keyword, gene name, NCBI accession number, Swiss-Prot ID and splicing mode. Each database entry consists of the following sections (Fig 4) : General Information: Gene Name, Protein Name, Synonyms, Species, Function, Related Sequences, Chromosome Position Global View: Graphical alignment of each isoform with the genome and with each other, assessment of individual exon and intron sequences, graphical representation of each alternatively spliced event Exon Alignment: Alignment of each isoform’s exons to all other isoforms with alternating colors for visual clarity Splicing Mode Information: Name of the variant sequence, variant region, splicing mode Literature References: Standard literature citations The Manually Annotated Altemauvely Spliced Even- Database MAASEID 2696

. .

LTRANS

l l O N BIREGULAIINGTHE IMRACELLULAR CONCENTRATlONOFCYCLICNKLEOTIDES THSPHOSPHODIESTEPASEIS HlGHLY A RCtE INMUSCLE SIGNAL TRANSWCTWN ADENOSINE 5 PHOSPHATE

21

Fig 4. MAASE Database Entry Information Page

85

event id

lefl-region-id right-region-id isoform-partner lefl-partner-id right-partner-id splice-type time

source-db-created

h

utI key

cit-id time

I lz, I

FIO

segment-id type tsoform-start-pos isoform-end-pos isoform-dna-seq genome_start-pos genome-end-pos genome-dna-seq method-id percent-identity

7 I

II

f

I I

L

event-id

PK

- _ - - +J 1 synonym ~Iy.IIIII

I

1s-necies

I

ItlmP-

-

-

.

type status date-created superceded-by time I

description

Fig 5.MAASE Database Schema For simplicity, not all foreign key relationships involving the UID table are shown.

86

The MAASE database schema (Fig 5) was designed for simplicity both of data entry and of retrieval. The design is similar to a star schema in which most of the tables (shown in bold) can be thought of as children of the unique identifier (uid) table. The uid table assigns unique identifiers (ids) to individual table objects and stores each of the table object ids, types and their current status in the database. The tables fall into several basic groups: core tables essential in a functional genomics database, tables for sequences, tables for annotation, and tables used to link internal and external information together. The core tables provide information on each gene locus (genome-segment), information on related sequences (xref), information on database users (user) and information used by the database (method, The tables providing sequence information are url-template, and uid). gene-region, and isoform. Annotation is managed by the splice-event table. The tables that link information together (splice-event-index, user-index, isoform-index, cit-index, and xref-index) allow for quick and convenient updates to the database while relationships between tables allow for the handling of the special needs of an AS database. The genome-segment table is the primary table for each database entry with all other (e.g., sequence and annotation) tables relating to it. An example of this relationship can be seen in the construction of an isoform from individual gene region objects. Each gene region, identified in the gene-region table, is a defined subsequence of a contiguous genomic DNA segment specified in the genome-segment table, as well as a subsequence of an individual isoform sequence. The isoform-index table indicates how a specific isoform is to be constructed from the gene-region objects. In this manner, individual gene-region objects can be separate entities as well as part of its isoform, allowing for easy access to individual exodintron sequences as well as whole isoforms. MAASE is implemented using the MySQL (2 1) relational database management system, and the core application programming interface (API) used by MAASE is written by Modulewriter (22), an object-relational mapping (ORM) tool. The MAASE database system can be accessed at http://splice.sdsc.edu

Acknowledgements This work was supported by tlia U.S. National Institutes of Health (CA-88351) and assisted by the facilities of the National Biomedical Computation Resource at SDSC (NIH RR08605).

87

References 1. I.H.G. Consortium, "Initial Sequencing and Analysis of the Human Genome" Nature 409, 860 (2001) 2. D. Brett, J. Hanke, G. Lehmann, S. Hasse, S. Delbruck, S. Krueger, J. Reich, and P. Bork, "EST Comparison Indicates 38% of Human mRNA Contain Possible Alternative Splice Forms" FEBS Letters 474, 83 (2000) 3. B. Modrek, A. Resch, C. Grasso, and C. Lee, "Genome-wide Detection of Alternative Splicing in Expressed Sequences of Human Genes" Nucleic Acids Research 29, 2850 (2001) 4. A.A. Mironov, J.W. Fickett, and M.S. Gelfand, "Frequent Alternative Splicing of Human Genes" Genome Research 9, 1288 (1999) 5. B.R. Graveley, "Alternative Splicing: Increasing Diversity in the Proteomic World" Trends in Genetics 17, 100 (2001) 6. I. Dralyuk, M. Brudno, M.S. Gelfand, M. Zorn, and I. Dubchak, "ASDB: Database of Alternatively Spliced Genes" Nucleic Acids Research 28,296 (2000) 7. H. Ji, Q. Zhou, F. Wen, H. Xia, X. Lu, and Y. Li, "AsMamDB: An Alternative Splice Database of Mammals" Nucleic Acids Research 29,260 (2001) 8. M. Burset, I.A. Seledtsov, and V.V. Solovyev, "SpliceDB: Database of Canonical and Non-canonical Mammalian Splice Sites" Nucleic Acids Research 29,255 (2001) 9. Y.-H. Huang, J.-J. Chen, S.-T. Yang, and U.-C. Yang, "PALS db: Putative Alternative Splicing Database" Nucleic Acids Research 30, 186 (2002) 10.L. Croft, S. Schandorff, F. Clark, K. Burrage, P. Arctander, and J.S. Mattick, "ISIS, the Intron Information System, Reveals the High Frequency of Alternative Splicing in the Human Genome" Nature Genetics 24, 340 (2000) 1l.C. Lee, L. Atanelov, B. Modrek, and Y. Xing, "ASAP: The Alternative Splicing Annotation Project" Nucleic Acids Research 3 1, 101 (2003) 12.D. Naor, S. Nedvetzki, I. Golan, L. Melnik, and Y. Faitelson, "CD44 in Cancer" Clinical Reviews in Clinical Laboratory Sciences 39, 527 (2002) 13.J. Lesley, and R. Hyman, "CD44: Structure and Function" Frontiers in Bioscience 3, d616 (1998) 14.D.D. Shoemaker, E.E. Schadt, C.D. Armour, Y.D. He, P. Garrett-Engele, P.D. McDonagh, P.M. Loerch, A. Leonardson, P.Y. Lum, G. Cavet, L.F. Wu, S.J. Altschuler, S. Edwards, J. King, J.S. Tsang, G. Schimmack, J.M. Schelter, J. Koch, M. Ziman, M.J. Marton, B. Li, P. Cundiff, T. Ward, J. Castle, M. Krolewski, M.R. Meyer, M. Mao, J. Burchard, M.J. Kidd, H. Dai, J.W. Phillips, L.P. S., R. Stoughton, S. Scherer, and M.S. Boguski, "Experimental Annotation of the Human Genome Using Microarray Technology" Nature 409,922 (2001) 15.H. Wang, E. Hubbell, J.-S. Hu, G. Mei, M. Cline, G. Lu, T. Clark, M.A. Siani-Rose, M. Ares, D.C. Kulp, and D. Haussler, "Gene Structure-Based Splice Variant Deconvolution Using a Microarray Platform" Bioinformatics 19, 315 (2003)

88

16.G.K. Hu, S.J. Madore, B. Moldover, T. Jatkoe, D. Balaban, J. Thomas, and Y. Wang, "Predicting Splice Variant from DNA Chip Expression Data" Genome Research 1I, 1237 (2001) 17.J.M. Yeakley, J.-B. Fan, D. Doucet,E. Wickham,Z. Ye, M.S. Chee, andX.D. Fu, "Profiling Alternative Splicing on Fiber-optic Arrays'' Nature Biotechnology 20, 353 (2002) 18.W.J. Kent, "BLAT - The BLAST-Like Alignment Tool" Genome Research 12,656 (2002) 19.L. Florea, G. Hartzell, Z. Zhang, G.M. Rubin, and W. Miller, "A Computer Program for Aligning a cDNA Sequence With a Genomic DNA Sequence" Genome Research 8,967 (1998) 20.M. Zuker, "Mfold Web Server for Nucleic Acid Folding and Hybridization Prediction" Nucleic Acids Research 3 1, 3406 (2003) 21 .MySQL, Ed. P. Dubois (New Riders Publishing, Indianapolis, 2000). 22.C.L. Zheng, F. Fana, P.V. Udupi, and M. Gribskov, "Modulewriter: A Program For Automatic Generation of Database Interfaces" Computational Biology and Chemistry 27, 135 (2003)

COMPUTATIONAL TOOLS FOR COMPLEX TRAIT GENE MAPPING F.M. DE LA VEGA Applied Biosystems, 850 Lincoln Centre Dr., Foster City, CA 94404, USA E-mail: [email protected]

K.K. KIDD Department of Genetics, Yale University School of Medicine 333 Cedar Street, New Haven, CT 06520, USA E-mail: [email protected] A. COLLINS Human Genetics, University of Southampton Duthie Building (SOS), Tremona Road, Southampton, England E-mail: [email protected]

The mapping of the genes underlying complex traits poses special challenges. The results of several years of effort by many groups in the extension of the linkage mapping methods, used with great effect for localizing major genes, has been disappointing on the whole for complex traits. Now that we have an effectively complete genome sequence and exciting new technologies for genotyping vast numbers of single nucleotide polymorphism (SNPs) the way is open for the advance of a new strategy. There have already been several successful outcomes for complex trait mapping through the analysis of linkage disequilibrium (LD) and haplotypes. However, these are early days and some of the difficulties are only slowly becoming apparent. Recent evidence [l] suggests that the human genome may contain up to 15 million SNPs. For this reason the probability of actually including a disease causal SNP in a sample of SNPs typed at a spacing of several kilobases is low. Furthermore, this implies that up to 100 other S N P s may be in linkage disequilibrium with a causal SNP. This poses major difficulties for identifying a causal site but the initial target is simply to determine candidate regions with confidence. The International HapMap project [2] has the aim of delimiting haplotype blocks in a number of populations to generate a genome-wide S N P map for association studies. One outcome of this project will be a large body of empirical data on patterns of linkage disequilibrium across the human genome. Other groups and organizations are involved in their own data collection and evaluation studies. Aspects of the effective collection, representation and use of these vast and developing data resources are the topics of the six papers included in this volume. The potential for whole genome association studies is currently limited by cost. Multiplexing, that is genotyping large numbers of SNPs in parallel per assay,

89

90

will obviously help reduce costs. The paper of Sharan et a1 shows an algorithmic approach for optimal multiplexing of genotyping assays in generic arrays. Through graph theory this approach partitions SNPs into sets within which every S N P has a unique feature. The results of real data analysis suggest the practical outcomes of such a strategy, permitting, for example, the genotyping of 5,000 SNPs on four all 7-mer arrays. Whatever system is applied to genotype SNPs concerns over genotyping error, and particularly the effects of error on subsequent analysis, is an ongoing issue. Another concern is the loss of information through ‘no call’ genotypes - where borderline genotypes are classed as missing. This reduces the error rate but also the number of genotypes returned. In their contribution, Kang et a1 consider the issues of error, no call and missing data and examine the statistical consequences of different scenarios. The basic conclusion is that the benefit of reduced genotyping error rate through not calling certain questionable genotypes is almost exactly balanced by the loss of information due to the reduced number of genotypes. The authors note, however, that in some situations (where one homozygote might be miss-classified as another) no calls might offer greater benefits. The recognition that a proportion of the genome comprises relatively long blocks of low haplotype diversity [ 3 ] was instrumental in the development of the HapMap project. Although there is still controversy about how well the haplotype block model captures the underlying nature of LD in the human genome [4, 51, there have been a number of algorithmic advances in the delineation of blocks since that time. The paper by Zhu et a1 develops a two stage procedure to determine blocks. In this approach a minimum block is extended by the sequential addition of SNPs with the outcome that haplotype blocks are defined in which all SNPs with a minor allele frequency as low as 5% are included. Application to data from four populations reveals that the LD between a SNP and neighboring haplotype blocks is a monotonic function of the distance. This supports the contention that a careful description of the block structure in a region should facilitate mapping. However larger samples are required before instabilities such as decrease in mean block length with increasing SNF’ density are resolved. Another area which has understandably seen a recent explosion of interest has been in the determination of haplotypes. Population-based samples pose particular difficulties for reliable haplotype estimation. Eronen et a1 developed a Markov chain approach for reconstruction of haplotypes from multilocus genotypes. This method considers a model that effectively accommodates

91

recombination, motivated by gene mapping in larger regions. Included is a Markov chain model of variable order which uses frequencies of haplotype fragments of different lengths in different regions, thereby accommodating recombination more effectively. The authors used both simulated data and the Daly et al [3] sample to evaluate their methods which outperformed existing methods with sparse maps and were competitive for dense maps. A number of pairwise linkage disequilibrium metrics exist amongst which the absolute value of the D’ metric offers a number of advantages. Kim et al examine strategies for computing confidence intervals (CI) for D’ in order to understand the allele frequency, sample size dependency and the impact on defining haplotype blocks. The authors examined three approaches to developing confidence intervals and concluded that the choice of method was somewhat sample size dependent but there was acceptable coverage (the fraction of times the CI contains the true value of D’) for two methods. Finally, Bass et a1 have developed a software package for generating pedigree data under user-specified conditions. A particular feature is the simulation of variable levels of both recombination and linkage disequilibrium in general pedigrees. The authors recognized a clear need for a program that allows simulation of linkage and association for multiple markers in different data structures from general pedigrees to case and control. It will be important to exploit simulation to examine the statistical properties of the analytical approaches to associationhaplotype analysis currently being developed by many groups so the need for such a computational tool is obvious. The papers in this session illustrate some of the diverse aspects of the exciting, and often controversial, field of complex trait gene mapping. The difficulties involved in performing these types of studies are only now becoming apparent but, fortunately, computational and bioinformatic solutions are keeping pace. It is only a matter of time before the genetic dissection of a number of complex traits is achieved. This will provide the greatly wanted datasets necessary to benchmark novel and more effective computational tools for complex trait gene mapping.

Acknowledgements We would like to acknowledge the generous help of the anonymous reviewers that supported the peer-review process for the manuscripts of this session.

92

References 1. Botstein, N and Risch N. Nat. Genet. 33:228-237 Suppl. (2003). 2. Couzin J. Science 296:1391-1393 (2002). 3. Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J. and Lander, E.S. Nat. Genet. 29~229-232(2001). 4. Wall, J.D. and Pritchard, J.K. Am. J. Hum. Genet. 73:502-515 (2003). 5. Stumpf, M.P. and Goldstein, D.B. Curr. B i d . 8:l-8(2003).

PEDIGREE GENERATION FOR ANALYSIS OF GENETIC LINKAGE AND ASSOCIATION M.P. BASS, E.R. MARTIN, E.R. HAUSER Department of Medicine, Centerfor Human Genetics, 595 LaSalle St., Box 3445, Duke University Medical Center, Durham, NC 27710, USA meredyth, emartin, [email protected] We have developed a software package, SIMLA (simulation of linkage and association), which can be used to generate pedigree data under user-specified conditions. The number and location of disease loci, disease penetrances, marker locations, and marker disequilibrium with a disease locus and with other markers can be controlled. In addition, the pedigree size and availability of genotype data may also be specified, and a number of rules for family ascertainment are available. Estimates for power and type I errors can be evaluated under a variety of conditions, as needed by the user. We developed this simulation program because there are no publicly available programs to simulate variable levels of both recombination and linkage disequilibrium (LD) in general pedigrees. Genetic researchers are routinely applying both tests of linkage and family-based tests of association in the search for complex disease genes, and a plethora of different statistical approaches are available. Thus there is a need for the flexible statistical simulation program that we describe. This is the only program that we are aware of that allows simulation of linkage and association for multiple markers in extended pedigrees, nuclear families or in sets of unrelated cases and controls. Furthermore, the program not only allows for variable levels of LD among markers but also between markers and disease loci. SIMLA can simulate the complex and variable levels of LD that have been observed at close markers across the genome and allows for realistic simulation of complex relationships between markers. The program will be useful for studying and comparing existing statistical tests, for developing new genetic linkage and association statistics, planning sample sizes for new studies, and interpreting genetic analysis results.

1

Introduction

Genetic analysis is concerned more and more with the search for genes that play a role in very complicated disease pathways. For complex diseases it is the case that a single gene may act in concert with additional genes or environmental exposures. Issues faced by the researcher searching for complex disease genes include locus heterogeneity, low penetrances, phenotypic variation, and the presence of phenocopies, to name a few of the difficulties encountered. With such complex diseases under study, it is important to understand how different genetic analysis statistics will behave under varying conditions. Further, in designing test statistics to detect disease genes for complex traits, it is crucial to be able to evaluate their performance under controlled situations. This includes estimating type I errors under the null hypothesis, assessing the power under conditions representative of alternative hypotheses, and determining optimal sample size for a given study design. More generally, it is useful to assess the

93

94

statistical distribution of a statistic under study, for instance, to verify whether a set of observed means or variances is in agreement with expected values. Simulations are invaluable in comparing results between statistics that are designed to perform similar tests. Simulation provides experimental conditions that allow the user to understand under which conditions one test may differ from another. For instance, one test may be preferable for detecting small genetic effects in homogeneous data sets, and another may be better in the presence of large sibships with multiple affected individuals. We have developed the software package SIMLA (simulation of linkage and association) with the goal of allowing the researcher a great amount of flexibility in specifying test conditions. The user selects a number of parameters, including the number of replicates, the size of a data set, a map of one or more markers and the location and prevalence of up to ten disease loci. SIMLA is unique among simulation packages in that the researcher may specify varying levels of both linkage and linkage disequilibrium (LD) among markers and between markers and disease loci, thus enabling simultaneous studies of linkage and association in extended pedigrees. By allowing the user to specify the level of LD among markers, this program allows the user to model the complex patterns of LD often observed in real data. Output consists of data sets of pedigrees that conform to the user’s selected inclusion criteria. These output files can then be used as input into various genetic analysis packages. This software has already been used to test power and type I errors for the Ordered Subset Analysis software package, a software package designed to assess linkage in the presence of genetic heterogeneity using covariate information SIMLA was instrumental in discovering and correcting a bias with the PDT statistic*. It has also been used to test the geno-PDT statistic, an extension of the PDT that analyzes transmitted vs. non-transmitted pairs of alleles rather than single alleles4.

’.

’.

2

System and Availability

SIMLA was implemented using C t t on the Solaris 7 Unix operating system. Please contact us if you require an alternate system for running SIMLA. Downloads with detailed documentation and example input for running SIMLA are available on the Center for Human Genetics web site: http://wwwchg.duhs.duke.edu.Follow the link to CHG Software. Regismtion is required for future notification regarding program modifications or upgrades. Contact information is not used for any other pmpose.

95

3

Algorithm

The pedigrees created by SIMLA are based on a common structure (Figure 1). The proband is the first individual in Generation I11 (111-1). In all pedigrees, there are four founders and three sibships. All the sibships, both within a family and across the data set, consist of the same number of individuals, ranging from two to ten siblings. Multiple disease genes can be specified for a data set, with subsets of families linked to each one. However, only one bi-allelic disease locus segregates within any one family. The proband is always affected, while the affection status of all other individuals is determined by their disease genotype and the penetrances specified in the SIMLA input parameter file.

I

I1

111

4

5

6

proband Figure 1. A standard pedigree as created by the SIMLA program. All pedigrees in a data set have the same number of individuals. The individual marked as the proband (111: 1) is always affected, and the disease status of all remaining relatives is determined at random, based on disease genotype and penetrances. Sibships may be specified to contain from 2 to 10 people, and there are always three sibships generated for each family.

Assigning disease status for pedigree members is accomplished first by dropping the set of "blank" founder chromosomes through the family, with each parental chromosome equally likely to be transmitted. The proband will have received a segment that originated from a founder (1-1 or 1-2) and a segment from hisher married-in parent (11-1). Given that the proband is affected, the disease genotype is assigned in accordance with specified disease allele frequencies and penetrances. All pedigree members with the same founder chromosome as the proband will receive the corresponding disease allele. The remaining six founder

96

chromosomes are assigned a disease allele consistent with given disease allele frequencies. At this point, all of the people in the pedigree have a disease genotype assigned. Affection status is then determined based on user-specified penetrances (Figure 2). If the resulting pedigree meets user-specified ascertainment criteria (an affected sibpair, for example), then marker genotypes are assigned to all individuals starting with the founders. In the case of no LD, founder chromosomes, which are blank except at the disease locus, are assigned alleles at each marker independently based on frequencies entered in the parameter file. When LD is desired between one or more markers and a disease locus, then frequencies for all possible haplotypes are specified in the parameter file. Two frequencies are given for each haplotype conditional on the presence or absence of the disease allele on the chromosome segment. Founders are assigned alleles for these markers as a set, based on designated haplotype frequencies. Remaining markers are assigned independently based on specified marker allele frequencies. It is the conditional haplotype frequencies that determine the extent of LD between the markers and the disease locus and among the markers themselves. SIMLA offers a great deal of flexibility to model patterns of LD. Haplotype frequencies based on observed data may be entered. Blocks of LD may be simulated, where the user selects a subset of markers to be in LD with a disease locus while markers not selected are in linkage equilibrium with the disease locus. It is even possible to specify LD only among the markers and none between the markers and the disease locus. For example, two markers A and B, each with two alleles 1 and 2, would have four possible allele combinations, or haplotypes. For each haplotype, two conditional frequencies must be designated, one in the presence of the disease allele and one in the absence of the disease allele. Table 1 shows possible sets of parameter values leading to three cases. The first case is complete LD between marker A and the disease locus but none between marker B and either locus. In this case, marker B could be left out of the haplotype set in the parameter file, though it is shown here for illustration. The second case shows LD between each marker and the disease locus as well as LD among the markers. The last case demonstrates LD only among the markers and none between the markers and the disease locus. Lewontin’s D’ statistic5 is used to quantify the extent and direction of LD seen between each pair of loci for each situation, but any measure of LD could be used. Once founder genotypes are assigned, chromosomes are passed down from parents to children in Mendelian fashion, allowing for cross-overs along the chromosome segment. Cross-over events occur based on recombination fractions between each locus along the map. Since the disease genotype has been determined for all family members at this point, it can be considered fixed for all family members. Accordingly, individual marker genotype assignments for children move out in either direction from the disease locus, allowing for chance recombination events.

lu!od le sadAloua6 aseas!p aneq siaquaui diiwei IIV

slqi

saoueiiauad uo peseq 'snlels aseasip pau6lsse ale s i a q u a u & u e j Su!uleuan

T alalle aseasip Suipuodsaiioo aqi an!aoai sa!doo lie !aiaiie aseas!p e pau6!sse a m sauosouoiqo iapunoj Su!u!euan

s i a q u a u Ai!uei lie ui alalle aseasip Su!qaieu paufi!sse ale sauosouioiqo iapunoj 6u!puodsaiio3

paiaaiie s! aq/s uan!6 'pueqold 01 pau6!sse adAioua6 aseasia

Ai! uiei aui q6noiql sauosouoiqo iapunoi "yueiq,. $0 18s doia

pueqoid paioajje ~ I ! M'ainioniis I i ! u e i o!sea

L6

98 Table 1. Example disequilibrium frequencies for two bi-allelic markers A and B showing varying levels of LD with a disease locus and with each other. Lewontin’s D’ measure of LD is given for allele A1 with the disease allele, B1 with the disease allele, and A1 with B1. N is the normal allele, and Dx is the disease allele for the disease locus.

After marker genotypes have been assigned and passed down through the family, pedigree members who have been defined as “missing” have their genotypes cleared. The user indicates which individuals in the pedigree should not have marker data available, and these people are reassigned “0 0” genotypes for all markers in all pedigrees generated. 4

Implementation

The only input required by the SIMLA program is a parameter file. This parameter file contains details regarding data set generation and family size, including the number of replicates, the number of families per data set, the size of the sibships, and the proportion of families linked to one disease locus versus another (Table 2). The researcher may choose to ascertain families with an affected proband, an affected proband sibpair, an affected cousin pair, an affected parent-child pair, an affected proband sibpair and affected cousin sibpair, at least one discordant sibling pair, or a user-specified list of affected or unaffected individuals. The number of disease loci can range from 1 to 10, and the user determines the allele frequencies, penetrances, and locations relative to the markers on the map. The user also specifies the number of markers with corresponding allele frequencies and genetic location. A genotyping error rate may be set for any marker, where an allele may be replaced at random in a specified proportion of allele assignments. As discussed above, LD between markers and a disease locus, or among markers, can be specified with conditional haplotype frequencies. If desired, a separate output file can be created with one or more covariates for each family in a data set. These variables are normally distributed with a standard deviation of 1 and a mean set by the user. Detailed documentation on creating the SJMLA parameter file is included

99 Table 2. Description of variables used as input for the SIMLA program. ~

Parameters General variables: fams gen-ped inc-code index-list nrep num-clear pdt-flag pdt-dat sibsize siblink-flag siblink-ped units vars Disease gene variables: ndloc chr dx-name f0, f l , f2 freq mloc prop-list Marker variables: ntloc alleles chr err freq mh-flag mloc name ord

Description ~

~~

~

~~~

Number of families per data set Name of pedigree file Inclusion (ascertainment) code List of individuals not genotyped Number of replicates (data sets) Number of individuals with no data Whether to create PDT datfile Name of PDT datfile Number of sibs in a sibship Whether to create SIBLINK files Name of SIBLINK files Map units (Haldane or Kosambi) Number of covariates Number of disease genes Chromosome Name of the locus Penetrances (for 0, 1, and 2 disease alleles) Allele frequencies Map location ProDortion of families linked to each disease gene Total number of markers Number of alleles at a marker Chromosome Amount of genotyping error Allele frequencies Whether a marker is in LD Map location Marker name Relative order in the map

100

with the download of the package from our web site http://wwwchg.duhs.duke.edu, as is an example parameter file. SIMLA uses this parameter file to create a postMAKEPED LINKAGE6 style pedigree file for each replicate generated. Alternatively, sets of unrelated cases and controls could be sampled from the simulated data. There are also flags available in the parameter file that indicate whether to run SIBLINK, which performs non-parametric linkage analysis of affected sibpairs, and PDT, which performs a valid test of linkage and association in extended pedigrees. Either of these programs could be used to perform genetic analyses of real or simulated data, and both are freely available on the Center for Human Genetics web site. SIMLA is able to create data set replicates in a reasonable amount of time. Table 3 gives a sample listing of simulation times to create data set replicates under varying situations. To give an idea of the simulation complexity, the number of replicates, the number of disease genes, disease allele frequencies, penetrances, number of covariates and data set sizes are listed. All simulations ascertained families with exactly one affected proband sibship and all were run using a Solaris 8 workstation. We have demonstrated the use of SIMLA for a complex trait by assessing the correlation between association and linkage statistics We have described the bias of an existing association statistic in a late onset disease due to the lack of parental genotypes and determined that a new statistic correctly handles this situation '. We have used SIMLA to describe power and type I error for a novel linkage statistic for analyzing complex traits in conjunction with covariate information and for an extension of an existing association statistic '. Thus SIMLA allows for assessment of test statistics in complicated study designs as well as in identifying powerful follow-up studies.

'.

Table 3 . Sample listing of computation times for SIMLA on a Solaris 8 workstation

101

5

Discussion

Our goal in creating SIMLA was to develop a tool that could be used to answer a variety of questions of interest to those developing methods for genetic analysis and conducting studies of complex disorders. SIMLA was designed to aid in assessing the effectiveness of both linkage and association statistics in family data. Accordingly, SIMLA can be used to determine optimal study design, sample size, and power under user-defined conditions. Unlike many other simulation packages available, such as SIMLINK lo, ", SLINK 12, 13, and SIMULATE , SIMLA does not require a pre-existing data set on which to perform calculations. This program is unique because it allows the researcher to simulate complicated patterns of LD as well as simple linkage. The POWERFBAT program l4 will also generate LD in nuclear families, but SIMLA extends this capability to larger pedigrees. Its strength lies in the flexibility afforded the user to simulate conditions common in the search for causes of complex disorders, such as missing data, locus heterogeneity, and phenocopies. SIMLA allows the researcher to understand how results might appear under various conditions. For instance, how might a multipoint LOD score curve appear when there are subsets of families linked to two disease loci on the same chromosome? Or, how much power is lost when parental data are not available? The answers to these and other questions can be critical in planning ascertainment or deciding which regions merit follow-up, given preliminary results. Though we have designed SIMLA to be highly flexible for the user, we are planning a number of enhancements to further this flexibility. At this time, SIMLA considers one chromosome or chromosomal region at a time. We are extending SIMLA to simulate the entire genome. While it is possible with the current version of SIMLA to simulate markers as if they were unlinked from other markers by inserting very large genetic distances between them, our goal is to streamline this for the user. Another limitation is that while we consider genetic heterogeneity models, only one susceptibility gene segregates through any one family. Our goal in future versions is to enable multiple genes to act through a single pedigree. Thus, we could incorporate epistatic models into the parameter file. It would also be of interest to simulate disease genes that exhibit parent-of-origin effects, as seen with imprinting, and to emulate quantitative as well as qualitative disease traits. SIMLA will permit study of complex genetic analysis problems. It can provide insights into issues of power and sample size, as well as aid in interpretation of observed results. By adding complexity to simulation models, we anticipate that SIMLA will provide greater understanding of the linkage and association statistics that are available, and the relationship between linkage and association statistics in a wide variety of study designs for detection and localization of complex genetic traits.

102

Acknowledgments

We are grateful for the generous support provided by the National Institutes of Health (Grants R 0 1 MH59528 and R 0 1 AG20135), the Neurosciences Education and Research Foundation, and the Morris K. Udall Center of Research Excellence (Grant PO1 NS39764-02). References

1. E.R.Hauser, M.P.Bass, E.R.Martin, R.M.Watanabe, W.L.Duren, and M.Boehnke, "Power of the ordered subset method for detection and localization of genes in linkage analysis of complex traits" Am. J. Hum. Genet. 69,529 (2001)

2. B.R.Martin, S.A.Monks, L.L.Warren, and N.L.Kaplan, "A test for linkage and association in general pedigrees: the pedigree disequilibrium test" Am. J. Hum. Genet. 67, 146 (2000) 3. E.R.Martin, M.P.Bass, and N.L.Kaplan, "Correcting for a potential bias in the pedigree disequilibrium test" Am. J. Hum. Gen. 6 8 , 1065 (2001) 4. E.R.Martin, M.P.Bass, and E.R.Hauser, "A genotype-based association test for general pedigrees: The geno-PDT" Am. J. Hum. Genet. 71, 2365A (2002)

5. R.C.Lewontin, "The interaction of selection and linkage. I. General considerations; heterotic models" Genetics 49,49 (1964) 6. G.M.Lathrop, J.M.Laloue1, C.Julier, and J.Ott, "Strategies for multilocus linkage analysis in humans" Proc. Nutl. Acud. Sci. U. S. A 8 1, 3443 (1 984) 7 . E.R.Martin, M.P.Bass, and E.R.Hauser, "Correlation between linkage and association tests in families." Am. J. Hum. Genet. 69, 5 11 (2001)

8. E.R.Martin, M.P.Bass, E.R.Hauser, and N.L.Kaplan. "Accounting for linkage in family-based tests of association with missing parental genotypes" Am J Hum Genet (2003) 9. E.R.Hauser, R.M.Watanabe, W.L.Duren, M.P.Bass, C.D.Langefeld, and M.Boehnke. "Ordered subset analysis in genetic linkage mapping of complex traits" Genetic Epidemiology (submitted) (2003)

10. M.Boehnke, "Estimating the power of a proposed linkage study: a practical computer simulation approach" Am. J. Hum. Genet. 39,513 (1986) 1 I . L.M.Ploughman and M.Boehnke, "Estimating the power of a proposed

linkage study for a complex genetic trait" Am. J. Hum. Genet. 44, 543 (1989)

103

12. J.Ott, "Computer simulation methods in human linkage analysis" Proceedings of the National Academy of Science, USA 86,4175 (1989) 13. D.E.Weeks, J.Ott, and G.M.Lathrop, "SLINK: A general simulation program for linkage analysis" Am. J. Hum. Genet. 47, A204 (1990) 14. N.M.Laird, S.Horvath, and X.Xu, "Implementing a unified approach to family-based tests of association" Genet. Epidemiol. 19 Suppl 1, S36 (2000)

A MARKOV CHAIN APPROACH TO RECONSTRUCTION OF LONG HAPLOTYPES L. ERONEN, F. GEERTS, H. TOIVONEN HIIT- BRU and Department of Computer Science, University of Helsinki Abstract

Haplotypes are important for association based gene mapping, but there are no practical laboratory methods for obtaining them directly from DNA samples. We propose simple Markov models for reconstruction of haplotypes for a given sample of multilocus genotypes. The models are aimed specifically for long marker maps, where linkage disequilibrium between markers may vary and be relatively weak. Such maps are ultimately used in chromosome or genome-wide association studies. Haplotype reconstruction with standard Markov chains is based on linkage disequilibrium (LD) between neighboring markers. Markov chains of higher order can capture LD in a neighborhood of a given size. We introduce a more flexible and robust model, MC-VL, which is based on a Markov chain of variable order. Experimental validation of the Markov chain methods on both a wide range of simulated data and real data shows that they clearly outperform previous methods on genetically long marker maps and are highly competitive with short maps, too. MCVL performs well across different data sets and settings while avoiding the problem of manually choosing an appropriate order for the Markov chain, and it has low computational complexity.

1

Introduction

Haplotypes capture information about regions descended from ancestral chromosomes. They are essential for many genetic studies, especially for association (or linkage disequilibrium, LD) based gene mapping: haplotypes can be much more informative than single markers, and they give higher power for assigning a phenotype t o a genetic region in association studies Being able t o use haplotypes is particularly important for SNP (single nucleotide polymorphism) markers, which are alone relatively uninformative. Current practical laboratory techniques provide unphased genotype information for diploids, i.e., an unordered pair of alleles for each marker. Reconstruction of haplotypes from genotype data is then a crucial step in the analysis process. There are two approaches t o t h e problem. One is based on trios: haplotypes are inferred from the genotypes of a subject's parents. This

'.

104

105

involves significant additional genotyping costs and potential recruiting problems. Further, in the case of SNPs, on average up to one eighth of the alleles can still remain ambiguous. The second approach is to apply computational or statistical inference to find the most likely haplotype configuration consistent with the observed genotype data. This population-based alternative is fast and cheap and has been recently researched a lot: Clark's parsimony method2i3, the expectation-maximization (EM) algorithm and its Partition Ligation (PL) variant 5 , Phase 6 , Haplotyper 7, and the phylogenetic approach 8,9. We propose and evaluate Markov chain models for population-based haplotype reconstruction and compare them with previous methods. While the existing methods typically assume that each haplotype is descended as a unit from generation to generation, we consider models that better accommodate recombinations. Our approach is motivated by gene mapping studies using genetically long, even genome wide maps l'. In a typical study for gene mapping by LD, a map of markers is selected from the region of interest, which may span from millions to hundreds of millions of base pairs. For economical reasons, only a sparse subset of all known markers (polymorphisms) in the region is used. Chromosome and even genome-wide association studies are considered to have potential for efficient mapping of common disease genes loill.

Instead of estimating frequencies of full haplotypes, like previous models for population-based haplotyping, the Markov chain (MC) models we propose in Section 2 estimate and use frequencies of local haplotype fragments, i.e., shorter regions potentially conserved for several generations and thus more likely to be reliably identifiable in a population sample. The method does not assume haplotype block&2in the population; in a sense our model allows each individual haplotype t o have its own structure. Higher adaptivity to the data at hand is obtained by using haplotype fragments of different lengths at different regions, based on the strength of evidence for a fragment to be identical by descent in several haplotypes. We propose a Markov chain model of variable order, MC-VL, t o obtain this adaptivity. Models related to the ones proposed in this article have been applied to various other sequence modeling and prediction p r ~ b l e m s 'but, ~ , ~ to ~ our knowledge] not to haplotyping. We provide a hierarchical algorithm for constructing haplotypes (Section 3). Finally, we give an experimental evaluation of the proposed methods under varying linkage disequilibrium and compare the methods with previous techniques (Section 4). For the evaluation we use a wide range of simulated data as well as Daly's data 1 2 . We conclude in Section 5 .

106

Models for Haplotype Reconstruction

2

Concepts and Notation We assume a set (map) M of l markers 1 , .. . ,l and denote the set of alleles of marker i by Ai. A haplotype H over M is then a vector of alleles: H E ni=l,,,,,tAi.A (multilocus) genotype G over M is a vector of (unordered) allele pairs: G E ~ i = l , , , , , t { { a l , aa1,az ~ } E Ai}. For SNPs, lAil = 2. Assuming alleles are labeled “1” and “2”, SNP haplotypes are vectors in (1,2)e and SNP genotypes vectors in ((1,l},{1,2}, (2, 2})e. In our terminology, a haplotype thus refers t o the alleles in a chromosome over the whole marker map (and not e.g. t o a segment descended as such from a founder). In a similar way, here the term genotype refers the data over the whole marker map (and not e.g. t o just one marker). Let H ( i , j ) denote the sequence from the i t h to the j t h marker in haplotype H . We call H ( i , j ) a (haplotype) fragment. We will denote H ( i , i ) simply by H ( i ) . Also, let G ( i ,j ) denote the sequence of allele pairs from the ith to the j t h marker in genotype G. Again, G ( i ,i) is denoted by G ( i ) . Given two haplotypes H1,Hz and a genotype G such that G ( i ) = { H l ( i ) ,H z ( i ) } for all i , we say that H1,Hz and G are consistent or that { H I ,H z } is a (possible) haplotype configuration for genotype G. Two haplotypes determine a unique consistent genotype in the obvious way. A genotype, on the other hand, can have several haplotype configurations. For a genotype G with k heterozygous markers ( k = I{i iG(i)l = 2}1), there are 2“’ different haplotype configurations. The set of all possible haplotype configurations for a genotype G will be denoted by CG, with lC,-l = 2k-1. Finally, we say that a fragment H ( i , j ) and a genotype G match if there exists a string p E n k = i , , , , , j Asuch i that { H ( i , j ) , p }is consistent with G(i,j ) .

I

I

Breakdown of the Haplotype Reconstruction Problem In this paper we address the haplotype reconstruction problem: given a set G of genotypes the task is t o output the most likely haplotype configuration for each genotype G E G. We assume Hardy-Weinberg equilibrium and use the equation

t o reduce the problem of estimating the probability of haplotype pairs to estimating the probability of single haplotypes. The genotypes are assumed to come from the same population and thus t o share haplotype fragments, based on which the probability of different haplotype configurations can be estimated.

107

E s t i m a t i o n of H a p l o t y p e Fragment Probabilities We estimate the probabilities of haplotype fragments by their frequencies computed from the genotype data G. Whenever a genotype fragment G ( i , j ) has more than one heterozygous marker, it has several possible haplotype configurations. To compensate for this ambiguity, the matching genotypes are weighted according to their heterozygosity:

where l c ~ ( i , j )is the number of heterozygous markers in G ( i , j ) and f r ( . ) denotes frequency of the parameter. A homozygous genotype has two identical haplotypes both matching the fragment, and thus weight 2. This approach is very simple and in a strong contrast with the previous work on the topic, where the main emphasis is on haplotype frequency estimation. M a r k o v Chains Markov chains are simple models that capture statistical dependence between neighboring alleles:

P(H)M P(H(1))

n

P(H(2) I H ( i - 1)).

i = 2 , ...,e

The motivation is that knowing a neighboring allele can tell a lot about the next allele, due to linkage disequilibrium between alleles of nearby markers. We estimate P ( H ) from frequencies of haplotype fragments of length one and two:

The obvious shortcoming of this model is that although linkage is strongest between neighbors, a neighborhood of several markers is more informative and can show stronger LD. Marlcov chains of order d (MC-d) are a more powerful alternative:

P ( H ( i ) I H ( i - d , i - 1)).

P ( H ) M P ( H ( 1 ,d ) )

(MC-d)

i = d + i , . .. ,e

Here d can be used t o tune the size of the neighborhood. With d = 1 we obviously have the standard Markov chain as a special case. To estimate P ( H ) we compute the set Fd of all haplotype fragments of size d and d 1 and use their frequencies as in formula (3).

+

108

Variable Order Markov Chains Markov chains of variable order aim at adjusting the size of the neighborhood for each marker and haplotype individually. Informally, the goal is to use haplotype fragments that maximize the informativeness of LD. The exact model we propose is a Markov chain of variable order determined by longest fragments (MC-VL). For this model, we compute the set Fvlof the N most frequent haplotype fragments: Fvl=

{ ~ ( i , jI )f r ( H ( i , j ) )> f r ( H ’ ) for all H’ @ FVl},lFVlI= N .

The idea is that we always use the longest fragments in estimate the probabilities:

P(H)M P(H(1))

n

Fvl from

P(H(4 1 N ( s 2 , i - 1)),

which we

(MC-VL)

i = 2 , . .. ,e

where si = min{slH(s, i) E Fvl}. In an area where there are long frequent fragments, the order of the Markov chain will be high. Since these fragments are frequent they are more likely to be identical by descent and thus are evidence for the haplotype to be reconstructed.

Handling Missing Data In real applications, marker data is often missing, due to changes in the marker map during the study, or due to genotyping problems. The MC methods can be extended to handle missing data with the following two modifications (we assume that either both alleles of a marker are known or both are missing). First, the estimation of fragment frequencies needs to be adjusted so that information in genotypes with missing data is included. This is done by distributing the probability mass of a genotype over all the fragments obtained by imputing possible alleles at the missing markers, weighted by the frequencies of the alleles. Recall the frequency estimate f r ( H ( i , j ) )in Equation 2 , and let G match H ( i , j ) if they match in all markers where G does have data. Then f r ’ ( H ( i , j ) ) ,the frequency estimate when G can have missing data, is defined as f r ’ ( f f ( i , j ) )= f r ( H ( i , j ) )

fr(H(m)), 7nElW1,

G ( m ) IS rnlsslng

where f r ( H ( m ) )is the frequency of allele H ( m ) . Second, to reconstruct haplotypes for genotypes with missing data, probabilities (frequencies) need to be estimated for fragments H ( i ,j ) that potentially have missing data (no alleles are imputed, though). The estimate is the sum of the frequency of all fragments H’(i, j ) in Fvlor Fd that match H ( i ,j ) wherever it has data.

109

3

Haplotype Reconstruction A l g o r i t h m

The number of haplotype configurations for a genotype grows exponentially with t h e number of heterozygous markers, so exhaustive search is feasible only for small marker maps. If the marker map is long, we use use a hierarchical “partition ligation” (PL) search strategy, motivated by a similar strategy used by Niu et al. We use MC-VL as probability model in the description of our haplotype reconstruction algorithm (Figure 1). It is obvious how to adapt the algorithm for use with MC-d. computes the fragGiven a set of genotypes the algorithm HAPLOREC ment frequencies and then uses the PL strategy t o search a subspace of all possible configurations for each genotype G individually. First, the PARTITION procedure recursively splits G until the genotype fragments consist of at most lmax markers. emax is chosen such that the evaluation of all possible is computationally feasible. In our configurations of a fragment of length lmax experiments, we have used emax= 8. When tmax= t, i.e., the total number of markers, then the algorithm performs the exhaustive search strategy.

’.

HAPLOREC (8) Compute the set Fvl of fragments and their frequencies using Equation (2); for each G E do Output the most probable element of PARTITION(G);

PARTITION( G) if IGI 5 L,, then Compute the set CG of all haplotype configurations for G; Estimate their probabilities using Equations (1) and (MC-VL); Output the B most probable elements of CG; else 7-ll = PARTITION(G(~, lG[/2)); ‘Hz = PARTITION(G(IGI/:! 1,IGI)); 7-l = LIGATE(%I, Nz); Estimate the probabilities of elements of 7-l using Equations (1) and (MC-VL); Output the B most probable elements of 7-l; end if

+

L I G A T E ( ~‘Hz) -~I, for each ( H l , H 1 } E XI and { H z- , ~_, E} XZdo Output { H 1 H z , H 1 H 2 }and { H I H z ,H ~ H z } ; Figure 1: Haplotype reconstruction algorithm using probability model MC-VL and hierarchical partition ligation.

110

Once G is partitioned into small fragments, the B most probable haplotype configurations from all possible configurations according t o MC-VL are obtained for each fragment. On all other levels of recursion, the LIGATION procedure produces 2 B 2 haplotype configurations by joining configurations for shorter fragments, obtained from the deeper recursion level and returns the B most probable ones. In the end, we obtain the B most probable haplotype configurations for the full genotype. The method is greedy and not guaranteed to find the haplotype configuration with the largest probability. I t is possible, although not likely, that a fragment of the most probable configuration is not among the B most likely fragments, and thus the global optimum is not found. However, in our experiments with B = 10 this was rarely the case. Both MC-VL and MC-d have linear time complexity in 141;are exponential in emax,quadratic in B , and subquadratic in e; MC-d is exponential in d. The space complexity of MC-VL is linear in N ; MC-d is again exponential in d.

4

Experimental Results

Test setting We used simulated data sets in order to be able t o perform controlled experiments. The setting corresponds t o an association study in a population isolate. We simulated a population with effective founder population of size 20 (20 founders each with independent random haplotypes with uniformly distributed alleles). The population then expanded for 20 generations with random mating, leading t o a final population of 100000 individuals. We used a sample of 500 genotypes, drawn randomly and independently from the last generation. We experimented separately with biallelic markers (SNPs) and 6-allele markers (microsatellites). In our experiments we used a marker map of 32 evenly spaced markers. The major parameter varied in the experiments was the distance between adjacent markers: it ranged between 0.01 and 1 cM. The simulated chromosomal regions had, respectively, genetic lengths between 0.31 and 31 cM. We ran 10 independent population simulations for each of the different marker spacings and report results averaged over the 10 simulations. In data sets and populations like the ones simulated, recombination is practically the only factor affecting haplotype (fragment) sharing between individuals in the final population. In 20 generations, 0.062-6.2 recombinations are expected per genotype for regions of length 0.31-31 cM, so reasonable mixing and fragmentation of founder haplotypes can be expected with the longer regions simulated. Marker mutations are unlikely in 20 generations and 32 markers, and were ignored in the simulation.

111

As a dense and real benchmark data we use the public Daly set l 2 which consists of 129 genotyped trios from a European derived population. The map consists of 103 SNPs ranging over 500 kb located on chromosome 5q31 (Crohn’s disease). We inferred the haplotypes of 129 children from pedigree data and used the nontransmitted chromosomes as an extra 129 (pseudo) haplotype pairs. Markers for which both alleles could not be inferred were marked as missing. From the set of 258 genotypes, the ones with more than 20% missing alleles were removed, leaving 147 genotypes in the final test set. We measure the performance of the methods by the average number of switches (“recombinations”) needed in the computer-generated haplotype configuration to recover the original haplotype configuration 1 5 . Switch distance is a natural error measure for this problem: many applications using inferred haplotypes will look at local haplotype segments and they are correct unless one of the needed switches is within the segment. For benchmarking, we used available implementations of Phase 6 , Snphap (see D. Clayton’s website) and PL-EM We used default parameters where possible. For Phase, no step-wise mutation model was assumed, the number of iterations and burn-in iterations were both set to 10000, and the thinning interval was 100. For PL-EM, we set buffer size to 50, number of iterations to 20, and parsize to 1, as in our case a lot of haplotype diversity was assumed t o be present. We did not succeed in running our experiments with Haplotyper (version 1.0, linux). Haplotyper worked fine for smaller test data sets, but terminated with a n error in most of the data sets t h a t were used in our experiments.

’.

’

Evaluation of the models The performance of the methods is illustrated in Figure 2. Results with SNP data sets are on the left, with microsatellites on the right. The first row shows the performance of different Markov chain models, as a function of the marker map density. An immediate observation is that as markers are more sparsely spaced, the problem becomes more difficult and the error increases, as expected. A useful and positive result is that the problem is solvable with quite a small error. Best models have switch distances between 0 and 3.5 (MC-10 and MC-VL, SNP data) or 0 and 2 (MC-4 and MC-VL, microsatellites), practically linear in the marker spacing. The results are excellent, less than 0.5 switches with SNP data for marker spacings 0.01-0.15 cM. Markov chain models MC-d of a fixed order give mixed results. With d = 1, i.e., the standard Markov chain, the results are poor. With a growing d, results first improve but later deteriorate for sparse maps (see especially MC-12 for SNPs and MC-5 for microsatellites). This is due to overfitting as d

112 Accuracy of MC models (mlcrosatellites)

8

5

MC.2 ......... 5 - MC.10 MC.7 ......

MC.12

MC.4 ......

.............

MC-VL

................ .... .......

.............

........... 01

0

0.2

0.3 0.4 0 5 0 6 0.7 Marker spacing (cM)

0.8 0.9

1

Effect of N on MC-VL (microsatellites) 6

8 5

5

.?

N=100000 . . . . N=30000 - ~=5000

. . . . . . . . . . . . . . .

N=l'00O@j............ N=30000 ~=5000

-

L

'

'

4

4

r

0

'

3

$ 2

'

'

......-4 .............. .... .....

2 t

tl $ 1

'

1 -

0

0

01

0 2 03 0 4 05 0 6 0 7 0 8 Marker spacing (cM)

1

09

0

0.1

Accuracy of previous methods (SNPs)

0.4 0 5 0.6 0 7 Marker spacing (cM)

0.2

0.3

08

09

1

Accuracy of previous methods (microsatellites) MC.4 5 'MC-VL

......

-

4 .

3 -

0

01

Marker spacing (cM) Space requirement (SNPs) 350

,..

300 .

.

02

0.4 0 5 0 6 0 7 Marker spacing (cM)

03

08

09

1

Space requirement (microsatellites)

MC-d,'markerspacing 100 .......... MC-d, marker spacing 0 01 MC-VL -

;.

:;: -

Y

200

MC-VL

!r!

5 250

.>

2 150

MC-d. marker spacing 1 00

......

-

250 200

.

6 */

,'

t 0

,

---:

_____,-

0

L-

1 d, the order 01 Markov chain MC-d

-...._

I.I~..i_.:...-

2 3 4 d. the order of Markov chain MC-d

Figure 2: Experimental evaluation of proposed methods on simulated data

5

113

is larger than the informative neighborhood of a marker. The suitable value of d can be quite different not only for different marker densities but also for different datasets: in our data, d = 10 is a good choice for SNPs, and d = 4 for microsatellites. Further, in a real data set with a less systematic marker map, no single value of d would necessarily be suitable across the whole map. The second row of Figure 2 tests the robustness of MC-VL to N, the number of most frequent fragments used. The tested range is from 5000 to 100000, and MC-VL shows quite stable behavior, especially if contrasted to the large variance in results obtained with MC-d models. In all other figures we have used N = 30000 most frequent fragments. A comparison to state-of-the-art methods is provided on the third row of Figure 2: Phase, Snphap, and PL-EM are applied to the same data sets, and results of MC-VL and MC-10 (SNPs) or MC-4 (microsatellites) are included for comparison. (The available implementations of Snphap and PL-EM assume SNP data and could not be run with microsatellites. The implementation of PL-EM failed for 1 cM marker spacing, resulting in a missing data point.) The performance of MC models is solid throughout the different settings and superior over previous models on marker densities larger than 0.05 cM. A comparison on the Daly data set shows that the MC models are most competitive also with dense, real data sets with missing data. MC-VL and MC-9 outperform Snphap in terms of switch distance (0.90, 0.93, and 1.29, respectively). Switch distance could not be measured for Phase, as it often gave haplotype configurations not consistent with the observed genotype data. If the accuracy is measured in terms of haplotypes that are not completely correct, then Snphap, MC-VL, and MC-9 outperform Phase with a clear margin (0.41, 0.45, 0.48, and 0.97, respectively). (PL-EM did not complete in few days.) The bottom row of Figure 2 illustrates the space requirements of the MC models, in number of haplotype fragments stored, for the simulated data sets. MC-VL has a constant space requirement, whereas the MC-d models have roughly exponential space requirement in d. The running times of MC-VL ranged from 70 to 140 seconds for both SNPs and microsatellites, depending on N . The time requirement of MC-d is proportional to its space requirement, i.e., exponential in d. With the values of d reported in Figure 2, the running times varied between 40 and 120 seconds. Larger values took too long for repeated experimental testing. In comparison, Snphap takes around 2 to 20 seconds for the current data sets, PL-EM 3 to 100 seconds, and Phase between 5 hours (SNPs) and 30 hours (microsatellites). All experiments were run on a P C with an AMD 1400 Mhz processor. The MC models were implemented in Java, other implementations were provided by their authors.

114

5

Conclusion

We proposed Markov chain models for the haplotype reconstruction problem, motivated by association studies with wide marker maps. We experimentally tested the performance on simulated and real data. Normal Markov chains (of order d = 1) did not perform well. Higher order Markov chains did, but a suitable order d needs to be found for each data set. Variable order Markov chains (MC-VL) showed consistently good behaviour. In experimental tests the MC models outperformed previous methods with sparse maps and were most competitive with dense maps, too. With SNPs the margin is clear and the switch distance of MC models is tens of percents smaller; with microsatellites the switch distance is less than half of Phase’s. The wide applicability of the MC models was demonstrated on real data. Why do the MC models perform well on sparse data? Previous haplotyping methods that are based on estimating haplotype frequencies are not well suited for situations where many haplotypes are unique. In the simulated setting, almost half of the haplotypes (480/1000) are unique with marker density 0.2 cM; with a density of 0.5 cM there are already 828 unique haplotypes. Estimating frequencies of haplotypes that occur only once is obviously difficult. Among the MC models, MC-VL has some nice properties. It seems to adjust for a suitable neighborhood, and the user does not need to worry about setting the order d of a Markov chain; the model is not sensitive to the selection of its model parameter N . The computational complexity is low and predictable, compared to the exponential time and space of MC-d in d. Our future work will include improved methods for estimating fragment probabilities. A promising idea is to use an iterative approach similar to EM. The performance of different components of the solutions could be evaluated: it is not fully clear which fraction of errors is due to fragment frequency estimation, which is due to models, and which to the heuristic search strategy. Probably each component has room for improvement. The effect of the haplotype reconstruction algorithm on the subsequent analysis, especially haplotypebased gene mapping, remains yet to be evaluated systematically. An implementation of the methods introduced in this article is available athttp://www.cs.helsinki.fi/group/genetics/haplotyping.html.

Acknowledgments The population simulator was provided by Vesa 01likainen. We thank authors of Phase, Snphap and PL-EM implementations for kindly making their programs available.

115

References 1. Joshua Akey, Li Jin, and Momiao Xiong. Haplotypes vs single marker linkage disequilibrium tests: what do we gain? European Journal of Human Genetics, 9:291-300, 2001. 2. Andrew G. Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biological Evolution, 7:111-122, 1990. 3. Dan Gusfield. Inference of haplotypes from samples of diploid populations: Complexity and algorithms. Computational Biology, 8:305-324, 2001. 4. Laurent Excoffier and Montgomery Slatkin. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biological Evolution, 12(5):921-927, 1995. 5. Tianhua Niu, Zhaohui S. Qin, and Jun S. Liu. Partition-ligation-expectationmaximization algorithm for haplotype inference with single-nucleotide polymorphisms. The American Journal of Human Genetics, 71: 1242-1247, 2002. 6. Matthew Stephens, Nicholas J . Smith, and Peter Donnelly. A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics, 68:978-989, 2001. 7. Tianhua Niu, Zhmhui S. &in, Xiping Xu, and Jun S. Liu. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. The American Journal of Human Genetics, 70:17-169, 2002. 8. Dan Gusfield. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In Proceedings of the sixth annual international conference on Computational biology, pages 166-175. ACM Press, 2002. 9. E. Eskin, E . Halperin, and R. M. Karp. Large scale reconstruction of haplotypes from genotype data. In Proceedings of the seventh annual international conference o n Computational biology, pages 104-113. ACM Press, 2003. 10. Neil Risch and Kathleen Merikangas. The future of genetic studies of complex human diseases. Science, 273:1516-1517, 1996. 11. L Kruglyak. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nature Genetics, 22:139-144, 1999. 12. Mark J. Daly, John D. Rioux, Stephan F. Schaffner, Thomas J . Hudson, and Eric S. Lander. High-resolution haplotype structure in the human genome. Nature Genetics, 29:229-232, 2001. 13. Mukund Deshpande and George Karypis. Evaluation of techniques for classifying biological sequences. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 417-431, 2002. 14. Dana Ron, Yoram Singer, and Naftali Tishby. The power of amnesia. In Jack D. Cowan, Gerald Tesauro, and Joshua Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 176-183. Morgan Kaufmann Publishers, Inc., 1994. 15. Shin Lin, David J. Cutler, Michael E. Zwick, and Aravinda Chakravarti. Haplotype inference in random population samples. The American Journal of Human Genetics, 71:1129-1137, 2002.

TRADEOFF BETWEEN NO-CALL REDUCTION IN GENOTYPING ERROR RATE AND LOSS OF SAMPLE SIZE FOR GENETIC CASEKONTROL ASSOCIATION STUDIES S. J. KANG’, D. GORDON’, A. M. BROWN3, J. OTT2, AND S. J. FINCH’ ‘Departmentof Applied Mathematics and Statistics, State University of New York at Stony Brook, Stony Brook, NY 1I794 2L.aboratoryof Statistical Genetics, Rockefeller University 1230 York Avenue, New York, NY 10021-6399 3Burke Medical Research Institute White Plains, N Y 10605 Single nucleotide polymorphisms (SNP) may be genotyped for use in case-control designs to test for association between a SNP marker and a disease using a 2 x 3 chi-squared test of independence. Genotyping is often based on underlying continuous measurements, which are classified into genotypes. A “no-call” procedure is sometimes used in which borderline observations are not classified. This procedure has the simultaneous effect of reducing the genotype error rate and the expected number of genotypes observed. Both quantities affect the power of the statistic. We develop methods for calculating the genotype error rate, the expected number of genotypes observed, and the expected power of the resulting test as a function of the no-call procedure. We examine the statistical properties of the chi-squared test using a no-call procedure when the underlying continuous measure of genotype classification is a three-component mixture of univariate normal distributions under a range of parameter specifications. The genotype error rate decreases as the no-call region is increased. The expected number of observations genotyped also decreases. Our key finding is that the expected power of the chi-squared test is not sensitive to the no-call procedure. That is, the benefits of reduced genotype error rate are almost exactly balanced by the losses due to reduced genotype observations. For an underlying univariate normal mixture of genotype classification to be analyzed with a 2 x 3 chi-squared test, there is little, if any, increase in power using a no-call procedure.

1 Introduction Single nucleotide polymorphisms ( S N P s ) genotypes are often determined by scoring technologies that first report the genotypes by one or more quantitative measurements’3’. Since the continuous measurements must be reduced to one of three genotypes (in this work, denoted by AA, AB, BB), some values may have ambiguous classification. One possible treatment of such classification is a L‘nocall” response; that is, no genotype is returned for the subject. For example, van den Oord et al.3 comment that “technicians will not score points [genotypes] when the are segregated from the group”. Throughout this work, we shall distinguish between the terms no-call and “all-call”, where the latter indicates a procedure where all subjects are assigned a genotype, even if a subset are incorrect. Some technologies classify genotypes using a mixture of ~ n i v a r i a t e or ~’~ bivariate normal distribution^'^^. For example, the Perkin Elmer software SNPscorer’ uses an ellipsoidal model that they label “Ellipsoidal model of equal dimensions at constant

116

117

orientation”. This bivariate model could be reduced to modeling a mixture of univariate normal distributions by an appropriate projection. That is, the univariate data or the bi-variate data after projection follows the pattern shown in figure 1. One standard use of S N P genotypes is a casekontrol genetic association analysis using the 2 x 3 chi-squared test of independence. We have previously investigated the effects of genotyping errors on the power of this test6”. The major finding was that an increase in genotype error rates always resulted in a loss of power. The rationale for a no-call procedure is that the gain in power due to reduction of genotype error rates more than offsets the inevitable loss of power due to decrease in the number of genotype observations. In this work, we develop a method of computation to investigate this tradeoff. Specifically, we calculate both the genotype error rates and the reduction in expected sample size as a function of the no-call procedure. We then use these quantities to calculate the power of the test as a function of the no-call procedure.

2

Materials and Methods

2.1

Notation

The following notation is used through the remainder of this work: Count variables: N,= Number of cases assuming no misclassification of genotypes (fixed) N , = Number of controls assuming no misclassification of genotypes (fixed)

NF

= Number of cases adjusted after allowing no-call regions (random variable) N;J“= Number of controls adjusted after allowing no-call regions (random variable)

Probability parameters: p = Allele frequency of S N P marker B allele in the case group (affecteds) p = Allele frequency of S N P marker B allele in the control group (unaffecteds)

p,, = Frequency of S N P marker genotype j assuming no misclassification of ,I

genotypes ( i = 0 for affected, i = 1 for unaffected, j=l for AA genotype, j = 2 for A B genotype, j= 3 for BB genotype) 4; = Probability of calling under no-call rule ( i = 0 for affected, i = 1 for unaffected, j =1 for AA genotype, j = 2 for AB genotype, j = 3 for BB genotype), 4; <1

118 C 0

i

119

Expected proportion of subjects genotyped under the no-call rule: pi,,, = pi, + pi, + pi, < 1, with no-call rule.

P,,,,r,,l

y;r

= fl'l+ fi, + pI\ < 1, with no-call rule. = Probability of calling

under no-call rule after adjustment ( i = 0 for

affected, i = 1 for unaffected, j =1 for AA genotype, j = 2 for A B genotype, j = 3 for BB genotype) These probabilities sum to one and are the genotype frequencies of each genotype conditional on the set of genotypes that are called. Three component normal mixture parameters (see figure 1): X = continuous measurement that is the underlying datum for classification into genotype d, = Mean of left-most (i.e., A A ) genotype measurement, d,- < 0 d, = Mean of right-most (i.e., BB) genotype measurement, d, > 0 (Note: we set the mean of the heterozygote (i.e., A B ) genotype measurement to be 0 and the variance of each component to 1) yL = Half-width of the left no-call region yR= Half-width of the right no-call region cL= Classification division point between AA and A B genotype with no-call; a subject is reported to have genotype AA when x < cI.- yL

cR= Classification division point between A B and BB genotype with no-call; a subject is reported to have genotype BB when x > cK+ yR (Note: a subject is reported to have genotype A B when cL + yL < x < cK- y K ) @( ) = Cumulative distribution function of standard normal random variable q$( ) = Probability density function of standard normal random variable Error model functions: The error rates are functions of the half-width parameters

Y Land yR

= Pr (AA incorrectly coded as A B using no-call rule) ( y L ,y K ) = Pr ( A A incorrectly coded as BB using no-call rule)

E , (~y L ,y K

(yL,y R ) = Pr (AB incorrectly coded as AA using no-call rule) = Pr (AB incorrectly coded as BB using no-call rule) E ~ , ( Y ~ - , Y , )= Pr ( B B incorrectly coded as AA using no-call rule) E ~ ~ ( Y ~ , Y=, Pr ) (BB incorrectly coded as A B using no-call rule)

120

Cost functions: Similarly, the cost of each type of error is a function of the half-width parameters Y L and Y R ’ c,(Y,,Y,) = Cost of misclassifying ith genotype in ordered list {AA, AB, BB}as jth genotype in same list using no-call rule. For example: c , , ( Y ~ , Y= ~ )Cost of misclassifying AA as A B using no-call rule

Fracctional increase in sample size requires to maintain assymptotio

Random variable that is ratio of controls to cases aftewr no-call rule. (Note: k

= N, is

NA

a fixed aspect of design, K = c i s a random variable with mean

N;

approximately equal to k , N Y is a binomial random variable on N, trials with probability of call piase,and similarly N ; is a binomial random variable.)

2.2 Computation of genotype error rates and genotype j-equencies with no-call region The error rates for the genotype classification with no-call regions assuming a threecomponent univariate normal mixture are:

In the results section, we present computations of the error rates as functions of the no-call region half-widths yLand Y R (see discussion of figure 2 in Results). Note as mentioned above that the error rates are functions of the means and cut-points.

2. 3 Probability of calling each genotype in the presence of errors with no-call half-width Po, = Po,@(c, -YL -d,)+Po,@(c, -YL)+Po3@(cL

-YL

-dR)

121

We rescale each probability by pel,,, and p~onrrol so that p;

4: + 4; + 47 = 1 .

= 1 and

That is,

pd;' = pdl ,
P;

p,,,, co",rol conditional probabilities.

2.2

+ p; + p;

=

&,and so forth. Note that these probabilities are Le

Non-centrality parameter

The non-centrality parameter for the all-call situation is

The non-centrality parameter for no-call, 2, (N:, N ; ) , is a random variable; that is, it is a function of the random variables N F and Nt;"given by:

Using the "delta method"', E(Az( N ; , N ; ) ) is approximately equal to:

We define expected power to be the power of the chi-squared test using the expected non-centrality parameter.

2.3

Cost function with symmetric no-call regions

When y = yR = y L ,the cost function f ( y , , y R )= f ( y ) is:

122

Research Approach

We evaluate three functions: (1) the error rates for the genotype classification with no-call regions (figure 2); (2) the expected proportion of subjects genotyped,

e,,,

and p~onrrol, under the no-call procedure (also figure 2); and (3) the expected power under the no-call procedure (figure 3). Note that all three are functions of the no-call region half-widths yL and y R (see discussion of figure 2 in Results). Our main question is: is there a setting of the no-call region half-widths Y,.and

Y Rthat

maximize the power of the chi-squared test? We investigate this question for the parameter settings N, = N u =500, P, =0.2, =0.15 with both groups in

c,

d , and level of Hardy Weinberg equilibrium, d , = -d, = 2 , c R = d, , c L =L 2 2 significance 0.05. We note that, while we assume symmetric means and cut-points for our examples, our methodology is completely general and may be applied to non-symmetric means and cut-points as well.

123 Figure 2. Probabilities of proportions and error rates for different settings of no-call half-width y = y,- = yR

1

09

Probability tha case or control is genotyped under no-call rule with different halfwidths

08 0.7 -

0.6 --n

05-h

PI

0.4

~~

03

~-

02 --

At. 011

0

Genotype error probabilities el2 and e21 under no-call rule with different halfwidths

-

~~

I

I

I

d, Results presented here are for parameter settings: - d , = d R =2, c L = d, 2 ’ c R =-’2 c,= 0.2, p,, =O. 15, no-call width y = y,, = yR

124

Figure 3. Expected power of the chi-square test as function of no-call half-widths Y L and Y R

The following parameter settings are used: -d, = d, =2, c L = &, c R = -, d, 2 2 0.2, p,, =0.15 Significance level = 0.05

e,=

125

3

Results

The error rates decrease as the half-width of the no-call regions increase. Figure 2 presents the error rates for the genotype classification with no-call regions and the ~ under~ the ~no-call~ rule,~ expected proportion of subjects genotyped, pia,, and P when the means are symmetric (i.e., d , = -dL=2), the cut points are half way between the means (i.e.,

cR =

d,,c L = d L ) and the no-call

2 (i.e, yL= yK= y ). Note that under these conditions

half-widths are equal

2

E~~= E~~ and E ~=, E

~ This ~ .figure

documents that genotype error rates decrease as the no-call half-width Y increases, and that simultaneously, the expected proportion of subjects genotyped under the no-call rule decreases. In the extreme case, when the half-width of the no-call regions is indefinitely large, the probabilities of misclassification are zero, but no observations are genotyped. The power of the 2 x 3 chi-squared test on genotypes generated using a no-call rule with indefinitely large half-widths is zero. Thus, we search for a setting of y that maximizes the expected power of the chi-squared test. In figure 3, we plot the expected power of the chi-squared test using the parameter settings described above (Methods - Research Approach) to understand the tradeoff between lowered probability of misclassification using larger 's and the lowered sample size due to the decrease in the number of observations. In general, the power is not sensitive to the choice of half-width. That is, the gain from the reduction in the probabilities of misclassification is almost exactly balanced by the loss in the number of observations genotyped for the 2 x 3 chi-squared test. In some situations, there is a small gain in power due to using an optimal choice of halfwidth. For example, setting yL =0.25 and y R = 0.0 for the scenario in figure 3 returns an expected power of 0.481, an increase of 0.02 over the all-call rule. A choice of half-width other than the optimal half-widths for a situation is associated with a small loss of power. While we do not report the findings here, we note that we performed the abovementioned analyses for a range of settings for the parameters y L , y R , d , ,d , , P, ,P, and k . Our results for different parameter settings were essentially the same as the ones presented here (data not shown).

Discussion Use of a no-call rule lowers the probability of misclassification but also lowers the number of observations classified. With respect to the expected power of the 2x3 chi-squared test of equality of genotype frequencies in cases and controls, the gain in power from the reduction in the probabilities of misclassification almost exactly matches the loss in power due to the reduction in the number of observations used for the parameter settings we investigated. The use of Occam's razor suggests that a

~

126

no-call rule not be used when the data generated are analyzed with the 2x3chisquared test. That is, any gain in power from using a no-call rule will be small compared to the power of the 2 x 3 chi-squared test based on using all observations, and there may well be a net loss in power due to reduction in sample size. The extra effort in using a no-call procedure produces no clear-cut gain in power when the chi-squared 2x 3 chi-squared test is used. We hypothesize that a similar finding holds for other tests, such as Armitage's test for trend in proportions'. The question can be answered by following the steps in this analysis. The results presented in this paper use situations with considerable symmetry. For example, in figure 2, component means were symmetric ( d, = -dL), cut-points d , were symmetric ( cR = R cL =%),and half-widths of the no-call regions were 2 2 equal ( yL = y R= y ). This setting was made to reduce the dimensionality of the graphics so that they focused on the essence of the findings. In figure 3, for example, the half-widths of the left and right no-call regions ranged over all possible values in the specified ranges. Space limitations preclude presenting a comprehensive set of graphs. These graphs are available on our website (http://linkage.rockefeller.eddderek/psb2OO4. html).

Also, we note that S N P genotype assignments can be modeled with by a mixture of three bi-variate normal distributions'. In such a model, the probability of misclassifying one homozygote as another may be non-negligible. Since such errors are much more costly", there may well be situations in which use of a no-call rule is in fact advantageous. The techniques in this paper can be readily applied to such a situation, and we are currently performing such calculations. Another issue regarding the transformation from a mixture of bi-variate distributions to a mixture of univariate distributions is that the order of genotypes may be different from the order presented in figure 1 (e.g., the two homozygote distributions may be adjacent). However, for the references that we have investigated'.', a projection onto a line + BY = c , where A, B > 0, will always result in an ordering of genotypes as we have presented in figure 1.

Acknowledgments The authors gratefully acknowledge grants K01-HG00055 and R01-MH59492 from the National Institutes of Health.

127

References 1. 2.

3.

4.

5. 6.

7.

8. 9. 10.

Ranade, K. et al. High-throughput genotyping with single nucleotide polymorphisms. Genome Research 11, 1262-8 (2001). Ahmadian, A., Gharizadeh, B., O'Meara, D., Odeberg, J. & Lundeberg, J. Genotyping by apyrase-mediated allele-specific extension. Nucleic Acids Res 29, El21 (2001). van den Oord, E.J.C.G., Jiang, Y., Riley, B.P., Kendler, K.S. & Chen, X. FP-TDI S N P scoring by manual and statistical procedures: A study of error rates and types. Biotechniques 34,610-6, 618-20,622 (2003). O'Meara, D., Ahmadian, A,, Odeberg, J. & Lundeberg, J. SNF' typing by apyrase-mediated allele-specific primer extension on DNA microarrays. Nucleic Acids Res 30, e75 (2002). Grant, S.F. et al. S N P genotyping on a genome-wide amplified DOP-PCR template. Nucleic Acids Res 30, el25 (2002). Gordon, D., Finch, S.J., Nothnagel, M. & Ott, J. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human Heredity 54,22-33 (2002). Gordon, D., Levenstien, M.A., Finch, S.J. & Ott, J. Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic case-control association studies. in Pacific Symposium on Biocomputing 490-501 (2003). Casella, G. & Berger, R.L. Statistical Inference, (Duxbury-Thomson Learning, Pacific Grove, CA, 2002). Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375-386 (1955). Kang, S.J., Gordon, D. & Finch, S.J. What S N P genotyping errors are most costly for genetic association studies? Genetic Epidemiology (in press)(2003).

A COMPARISON OF DIFFERENT STRATEGIES FOR COMPUTING CONFIDENCE INTERVALS OF THE LINKAGE DISEQUILIBRIUM MEASURE D ' S.K. KIM, K. ZHANG, F. SUN* Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, L o s Angeles, CA, 90089, USA Email: {sungkkim, kuizhang, fsun)@usc.edu Many linkage disequilibrium (LD) measures have been used to study LD patterns and for haplotype block partitioning. We examine the properties of one of these measures, Lewontin's D ' , in order to understand the dependency of its confidence interval (CI) to allele frequency and sample size as well as its applications in defining haplotype blocks. This measure and its CIS were used to partition haplotypes into blocks by Gabriel et al. [I] as well as in many other applications. Gabriel et al. [ l ] utilized a bootstrap approach to calculate the CI for D ' . Under this method, over 1,000 bootstrap samples may be needed to obtain an accurate estimate of the CI for each pair of single nucleotide polymorphism (SNP) markers which can be very computationally intensive, particularly when many SNP markers are involved. We develop two alternative methods for calculating the CI for D ' without bootstrap: one based on the approximate variance of D' given by Zapata et al. [2] and the other based on a maximum likelihood estimate (MLE) of D ' together with Fisher Information theory. Both methods depend on normal approximation for the estimates of D ' for large sample sizes. We assess and compare the coverage of the CIS using the three methods through extensive simulations. We define the coverage as the fraction of times the estimated CI contains the true value of D' . In general, the average coverage of the bootstrap method is less than the pre-specified coverage. When the sample size is small ( 2 100) , the remaining two methods slightly under estimate the coverage with MLE approach having smaller standard error compared to Zapata's method. When the sample size is large (2 200) , the estimated coverage from both Zapata's and MLE methods are very close to the pre-specified coverage with the MLE method having the smallest standard error among all three methods. In most typical scenarios, we recommend the use of MLE method for all sample sizes. Only under rare specific cases, would the bootstrap method be better suited for determining the CI, i.e. small sample size, at extreme allele frequencies and -3 < D'< 0 .

1 Introduction Linkage analyses have been successfully used to map many simple, monogenic and high penetrant diseases that obey the rules of Mendelian inheritance [3]. However, their utilities for mapping human complex diseases are limited. Recently, the 'To whom correspondence should be addressed: Fengzhu Sun, PhD, Department of Biological Sciences, University of Southern California, 1042 W. 36" Place DRB-288, Los Angeles, CA, 90089, USA. Tel: (213) 740-2413. Fax: (213) 740-2437. Email: [email protected]

128

129

analysis of linkage disequilibrium (LD) patterns has been of great interest in genome-wide association studies that attempt to identify genetic variation responsible for common human diseases [4, 5, 6, 71. Compared to traditional linkage studies, association studies based on LD have two major advantages to achieve fine scale mapping. First, only unrelated individuals need to be genotyped, which makes it feasible to survey a large number of samples. Second, LD utilizes historical recombination events, rather than just those found within a pedigree. The interest in LD patterns have been advocated by the completion of the human genome and the establishment of large single-nucleotide polymorphisms (SNF's) collections, such as identified by the S N P Consortium [8,9]. Recent studies have revealed that the human genome can be divided into long chromosomal segments with high LD punctuated by short regions with low LD [lo, 11, 121. Gabriel et al. [ l ] relied on the standardized gametic disequilibrium coefficient D ' [13], a commonly used LD measure that is not significantly influenced by allele frequencies [14, 15, 161, to identify regions with high LD. Since point estimates of D ' are unstable for low LD, especially under conditions of extreme allele frequencies or small sample size, Gabriel et al. [ l ] relied on the confidence intervals (CI) for D' instead of the estimate of LD between any two SNPs. Gabriel et al. [ 13 utilized a bootstrap approach [ 17, 181 to calculate the CI for D' . Under this method, over 1,000 bootstrap samples may be required to obtain an accurate estimate of the CI of D ' for every pair of SNF loci, which can be very time consuming, particularly when many markers are involved in the analysis. Zapata et al. derived the approximate sampling variance of D' between pairs of biallelic [2] and multiallelic [ 191 loci via large-sample theory. Through extensive simulations with various sample sizes and allele frequencies, they determined that the asymptotic sampling distribution of D ' generally coincides with the theoretical normal distribution [19]. Therefore, the sampling variance of D ' provides an efficient way to compute its CI under the presumption of normal approximation. Teare et al. [20] studied the properties of the sampling distributions of D' using simulations. However, no comparisons have been done to compare the bootstrap and Zapata's methods in estimating the CI of D ' . In this paper, we propose a technique based on maximum likelihood estimation (MLE) of D' together with Fisher Information theory for a second approach in directly estimating the sampling variance and CI [21]. Similar to Zapata's method, it is also based on the normal approximation by large-sample theory. Therefore, both Zapata's and MLE methods drastically reduces computational costs incurred by the bootstrap. We examine and compare the coverage rates for the CI estimated by bootstrap, Zapata's, and the MLE methods under various conditions of LD, allelic

130

frequencies and sample size. Our study provides practical guides for choosing proper methods in computing the CI of D' under different circumstances.

2 Methods 2.1 The LD measure D ' In this paper, we only consider two biallelic loci. Suppose there are two loci, A with alleles A1 and 4 , and B with alleles BI and B 2 , respectively. Let p v be the frequency of haplotype A.B .(i = 1,2;j = 1,2), p i ( i = 1,2) be the frequency of Z J allele 4. and q j ( j = 1,2) be the frequency of allele B j . If n haplotypes are sampled from a population, the haplotype frequencies can be estimated as follows, j .V . = n.. l n( i = 1,2; j = 1,2) , where nij is the number of A.Bj haplotypes. If n / 2 11 diploid individuals with' genotype data are sampled, j i j(i = 4 2 ; j = 42) is then

i and determined by the EM algorithm [22, 23, 241. Let ji = jil+ j i 2 (=1,2),

G j = ljlj + lj,Q

= 1,2) . For clarity and consistency in the presentation, we always

assume that the observed data is ( n ,q l,q,,nZl,n2,) , where nij(i = 1,2; j = 1,2) is the number of haplotype

4Bj

and n = q l + q2+ nZ1+ nZ2 is the total number of

haplotypes. A natural measure of gametic disequilibrium, D , which is the difference between the observed frequency of a haplotype and its expected frequency under the assumption that the alleles at two loci segregate independently, is defined as D = P11 - P141 . (1) The LD measure, D ' , is defined by

where min(piqj,(l-pi)(l-qj))

when D < O

min(p,(l-qj),(l-p,)qj)

when D > O

'

(3)

The quantity Dmaxis the maximum value that the gametic disequilibrium parameter can achieve given the marginal frequencies of the sampled observations [13]. D ,

131

can be estimated by using i), , j j and

D' , and D, and D,

i j and is denoted as

D , D' ,

, respectively.

2.2 Haplotype Data and Genotype Data Generation Under the assumption that the population is panmictic and given D ' , pi and q l , the expected frequency of haplotype AIBl is pl, = plql+ D'D,,,

. The frequencies

for the other haplotypes can also be computed through p l , q, and D ' . Then n haplotypes are sampled from a multinomial distribution with parameters (n,pil,p12,pZi,p22). Pairing two haplotypes together can subsequently generate genotypes for n / 2 individuals. In our simulation, we vary the haplotype sample size n from 100 to 500 and the minor allele frequencies(pl,ql) = (0.2,0.2), (0.2,0.4), and (0.4,0.4), respectively. The prespecified LD measure, D' , ranges

from -0.9 to 0.9. For each given set of parameters, we generate 1,000 replicate sets of haplotypes or genotypes.

2.3 Estimation of the Confidence interval and the Coverage For each simulated sample of n haplotypes or n12 genotypes, we estimate the CI of D ' by the bootstrap, Zapata's and MLE approach. In the bootstrap method, bt is computed for each of 1,000 simulated data sets containing the same number of haplotypes or genotypes. The upper and lower confidence limits for the 1-a CI are then determined from the empirical bootstrap distribution of D' by 1-a/2 and a 1 2 quantile method, respectively. For Zapata's and MLE methods, D ' and its

asymptotic sampling variance, V a r ( b ), are computed first. Under the asymptotic normality assumption of D' for large sample size, the upper and the lower confidence

limits

i

b'-Za,2* Var D

are

expressed

as

b'+Z,,,

*

n-7 Var D '

and

, respectively, where Za,2 is the 1 - a 1 2 percentile of the

standard normal distribution. The entire process is repeated for 1,000 replicate data sets and the coverage is defined as the fraction of times that the CI correctly contains the pre-specified parameter, D ' , which is used in generating the haplotype or genotype data.

132

2.4 Variance Estimation of D' by MLE with haplotype data One method of approximating the variance of an unknown parameter is through the use of Fisher Information along with MLE [21]. The log-likelihood for the observed data (n,n,l,n12,n2,,n22) is expressed as:

where p,(i

= 1,2; j = 1,2)

is a function of D ' , p1 and q, , and pI1= plql+ D'D,,,, .

The Fisher Information matrix with respect to ( D ' , p ,q ) can then be 1

1

calculated and is denoted as F ( D ' , p ,q ) . The sampling variance of D' for 1

1

unknown magnitude of LD and allelic frequencies is explicitly estimated by

V,&)

= [ ~ - l ( D h J 1 > ID.,B,,c d l , (5)

(i.e. the first element within the inverse of the Fisher Information matrix calculated with the MLE) [21].

2.5 Variance Estimation of D' by MLE with genotype data In order to estimate D' and its variance when only genotype data is available, we modify the method described in section 2.4 by computing the likelihood of the genotype data rather than the haplotype data. The log-likelihood for the observed is expressed as: genotypic data (n,n,,,n,2,n2,,n,2,x)

where

pij(i= 1,2; j = 1,2)

pl, = plql t D'D,, . Here

is again a function of

D ' , p, and q , , and

x denotes the number of individuals who are

heterozygous at both loci, and nijrepresents the total number of correctly inferred

4.Bj

haplotypes with n = n,,+ nI2+ n2,+ n22+ 2x. Similarly, the inverse of the

Fisher Information matrix gives an estimate for the variance of Lewontin's LD measure [22].

2.6 Variance Estimation by Zapata et al. [2/ Zapata et al. [2] utilized the method based on the Taylor approximation to obtain the

133

asymptotic sampling variance of D ' . For a large sample size, variance of the gametic disequilibrium, D, is computed as

Furthermore, Zapata et al. [2] approximated the variance of D' by r

1

2.7 Adjustment for the Confidence Interval In order to obtain the 1 - a CI, certain precautionary measures are taken under various conditions. Since D ' is the normalized value of the gametic disequilibrium D , its absolute value cannot be greater than 1. When computing the CI by Zapata's and MLE methods, circumstances may arise when either of the lower or upper confidence limits using the above approaches exceeds this range. This interval does not accurately depict a complete 1-a CI, thus we suggest the following tactics to ascertain the CI under different circumstances. Let X be a random variable with normal distribution N ( f i ' , V a r ( f i ' ) ). (1) If L = b t - Z a I 2 , / a < -1

and U = D ' + Z a , 2 , / a < 1 , the

lower confidence limit is defined as -1 and the upper confidence limit is defined as the smallest of 1 and U*, where U* is the unique value that satisfies the equation Pr(-1 < X < U * ) = 1 - a . ( 2 ) If L > -1 and U > 1, the upper confidence limit is defined as 1 and the lower confidence limit is defined largest of -1 and L*, where L* is the unique value that satisfies the equation Pr( L* 5 X 5 1) = 1 - a.

(3) If L < -1 and U > 1, the lower and the upper confidence limits are simply defined as -1 and 1, respectively.

134

When D ' = 0 , the two loci are said to be in complete linkage equilibrium and the estimation of the sampling variance is problematic. D,,, is undefined for

D = 0 , thus direct calculation of the estimated variance of D' is impossible for both Zapata's and the MLE methods. We suggest the following strategy to ascertain a 1-a CI for the Zapata's method and the MLE method taking advantage of the duality between the hypothesis testing and the estimate for CIS. Let

where n = nll + n12+ n21+ n22,n,, = n,, + nI2,and n,l = n,, + n2,. pij(i = 1,2; j = 1,2) is calculated by given values of D ' , p , and q l . Because we do not know the true value of pI and q, , the parameters are replaced by their MLEs, j3, and

0; > 0 be the value of

GI . Let

D' such that Pr-(D', pl,ql)= a12 and D; < 0 be the value

of D' such that P r + ( D ' , p , , q l ) = a / 2 .The 1 - a CI of D' is defined as

[&,&I.

D' can also take the maximum value of 1 if any one or more of the haplotypes is never observed. When this occurs, all three methods will be unable to directly determine the CI of D ' . Bootstrap will be subject to repeated sampling distribution

of D'= 1 and for all three methods, the estimated sampling variance will be equal to

0 . We employ similar tactics used for interval estimation when D ' = 0 . Let DA be the value ofD'that satisfies Pr(n,nll,n12,n21,n22 l D ' , ~ , , ~ , ) = The a . 1-a CI of D'is defined as [&,l]. Similar methods can be used to define the 1-a CI of D'when D'=-l

3 Results Table 1 gives the average estimates of D' and its sampling variance for bootstrap, Zapata's and MLE methods using haplotype data. The results using genotype data are similar to the results based on haplotype data. However, V,,,,a

(b')or

V,,

(b)

were typically larger for genotype data than that for haplotype data. Our findings remain consistent with Zapata's observation [2] pertaining to the trends of the sampling variance of D ' under different conditions of LD and allele frequencies. All three methods displayed an increase in sampling variance with a decrease in

135

magnitude of | D' | and sample size, or at extreme allele frequencies. The bootstrap method displayed the smallest variance in most cases. The MLE method typically had larger sampling variance compared to Zapata's, but the differences were minor and diminish with an increase in the sample size. Table 1. D' and its average estimated sampling variance using different methods based on haplotype data. The sample size of haplotype 200

100

average variance

D'

p

-0.9 0.2 0.2 0.4 -0.3 0.2 0.2 0.4 0.3 0.2 0.2 0.4 0.9 0.2 0.2 0.4

q

0.2 0.4 0.4 0.2 0.4 0.4 0.2 0.4 0.4 0.2 0.4 0.4

t «

1 ™

« rj

I

S

*

-0.905 0.0211 0.02180.0218 -0.902 -0.900 0.01130.01100.0110 -0.901 -0.906 0.00510.00510.0051 -0.901 -0.345 0.0803 0.0905 0.0905 -0.327 -0.306 0.0490 0.0490 0.0490 -0.303 -0.298 0.02160.02120.0212 -0.302 0.338 0.02650.01870.0189 0.328 0.308 0.0339 0.0292 0.0292 0.303 0.331 0.0135 0.0127 0.0130 0.323 0.940 0.0026 0.0035 0.0039 0.931 0.897 0.0075 0.0073 0.0074 0.900 0.933 0.00190.00260.0031 0.923

1f 1 a as

average variance average D'

average D'

average variance Q
500

3 &o

O CQ

N

1 8. «

1

w d5 *

0.01150.01130.0113 -0.899 0.0048 0.0048 0.0048 0.0057 0.0056 0.0056 -0.900 0.0023 0.0023 0.0023 0.0027 0.0027 0.0027 -0.901 0.0011 0.0011 0.0011 0.0470 0.0540 0.0540 -0.308 0.0221 0.0246 0.0246 0.0254 0.0258 0.0258 -0.299 0.0105 0.01060.0106 0.0108 0.0107 0.0107 -0.298 0.0043 0.0043 0.0043 0.0092 0.0087 0.0087 0.315 0.0033 0.0033 0.0033 0.01490.01390.0139 0.294 0.0056 0.0055 0.0055 0.0061 0.0060 0.0061 0.314 0.0023 0.0023 0.0024 0.0015 0.0020 0.0025 0.917 0.00070.00100.0013 0.0037 0.0037 0.0037 0.901 0.0015 0.0015 0.0015 0.0011 0.00150.0019 0.916 0.0005 0.0006 0.0010

Table 2 shows our coverage results under haplotype sample sizes of 100, 200, and 500. Although we used D 1 , ranging from -0.9 to 0.9, we only present combinations of minor allele frequencies at 0.2 and 0.4 and D' values of ±0.3 and ±0.9 with 95% CIs for illustrative purposes. When LD was high (£>' = 0.9) and sample size was small, both Zapata's and MLE approaches tended to overestimate the coverage rates. At a haplotype sample size of 200, the average and standard error of the coverage rates for the MLE method were found to be 0.945 and 0.0011, if we consider the full spectrum of simulated conditions. Zapata's and bootstrap method averaged 0.943 and 0.929, respectively. As sample size increased, MLEbased approximations consistently were closer to the expected coverage of 95% with the least standard error than those obtained by either Zapata's or the bootstrap methods. Despite having the highest standard error for the coverage rate, the bootstrap had better coverage when we simulate data with small sample size, extreme allele frequencies and -0.3
136

coverage, relative to the bootstrap method. To further study the subtle differences between Zapata's and MLE, we calculated the mean and the standard deviation of the CI lengths based on haplotype samplings, as shown in Table 3. Although the Table 2. The coverage rates for 95% CI with different D ' , allele frequencies and sample size based on haplotype data. The average coverage and its standard error for each sample size condition are listed 1

no

200

5 00

D' -0.9

p q 0.2 0.2 0.999 0.999 0.999 0.993 1.000 1.000 0.994 0.2 0.4 0.993 0.999 0.999 0.988 0.992 0.992 0.904 0.4 0.4 0.908 0.908 0.908 0.910 0.951 0.951 0.938 -0.3 0.2 0.2 0.959 0.795 0.795 0.935 0.879 0.879 0.944 0.939 0.2 0.4 0.932 0.904 0.904 0.950 0.928 0.928 0.4 0.4 0.954 0.946 0.946 0.951 0.947 0.947 0.960 0.3 0.2 0.2 0.932 0.932 0.933 0.949 0.952 0.952 0.942 0.2 0.4 0.947 0.953 0.954 0.943 0.938 0.938 0.943 0.4 0.4 0.932 0.948 0.951 0.927 0.946 0.946 0.909 1.000 1.000 0.781 0.931 0.931 0.845 0.9 0.2 0.2 0.998 0.2 0.4 0.991 0.989 0.989 0.963 0.968 0.968 0.919 0.4 0.4 0.875 0.965 0.965 0.836 0.987 0.987 0.810 0.927 0.952 0.952 0.921 Average 0.952 0.945 0.945 Standard Error 0.0014 0.0031 0.0031 0.0040 0.0010 0.0010 0.0032

0.998 0.974 0.937 0.922 0.937 0.959 0.958 0.946 0.932 0.914 0.990 0.890 0.946 0.0009

0.998 0.974 0.937 0.922 0.937 0.959 0.959 0.946 0.934 0.953 0.990 0.924 0.953 0.0006

Table 3. The average lengths of 95% CIS with different D ' , allele frequencies and sample size based on haplotype data. The average length and its standard deviation for each sample size condition are listed.

he sample size of haplotyr 1 nn

200

5 no

average length

average length

average length

L.

D

p

q

-0.9

0.2 0.2 0.4 0.2 0.2 0.4 0.2 0.2 0.4 0.2

0.2 0.4 0.4 0.2 0.4 0.4 0.2 0.4 0.4 0.2

-0.3

0.3

0.9

0.7506 0.3985 0.2492 1.0208 0.8490 0.5716 0.6157 0.7171 0.4563 0.2796

1.1074 1.2012 1.2700 1.4338 0.9361 0.5684 0.5298 0.6729 0.4409 1.3464

1.1074 1.2012 1.2700 1.4338 0.9360 0.5681 0.5326 0.6739 0.4452 1.3551

0.4262 0.2694 0.1916 0.8022 0.6170 0.4052 0.3735 0.4774 0.3063 0.1456

1.1846 1.2723 0.8240 0.9795 0.6256 0.4036 0.3639 0.4598 0.3039 1.2134

1.1846 1.2723 0.8240 0.9795 0.6256 0.4034 0.3658 0.4604 0.3070 1.3623

0.2503 0.1797 0.1285 0.5639 0.4007 0.2558 0.2241 0.2932 0.1870 0.1048

1.2128 0.6065 0.1603 0.6081 0.4023 0.2554 0.2246 0.2912 0.1881 0.1685

1.2128 0.6065 0.1603 0.6080 0.4023 0.2553 0.2258 0.2915 0.1901 0.3215

137 0.2 0.4 0.4 0.4 Average Standard Deviation

0.3180 0.1645 0.5326 0.2660

1.2763 1.3965 1.0150 0.3687

1.2769 1.4776 1.0231 0.3767

0.2211 0.1241 0.3633 0.2005

1.1111 0.5905 0.7777 0.3629

1.1118 0.1462 0.2897 0.2903 0.9316 0.8190 0.3772 0.1360 0.3106 0.2993

Table 4. The coverage rates for 95% CI with different D’ , allele frequencies and sample size based on genotype data. The average coverage and its standard error for each sample size condition are listed. he sample size of Genotyp

50

D -0.9

p q 0.2 0.2 0.2 0.4 0.4 0.4 -0.3 0.2 0.2 0.2 0.4 0.4 0.4 0.3 0.2 0.2 0.2 0.4 0.4 0.4 0.9 0.2 0.2 0.2 0.4 0.4 0.4 Average Standard Error

100

0.773 0.952 0.964 0.630 0.961 0.991 0.445 0.981 1,000 0.617 0.978 0.999 0.714 0.984 0.996 0.905 0.985 0.993 0.896 0.725 0.777 0.941 0.758 0.783 0.931 0.834 0.942 0.945 0.781 0.889 0.938 0.817 0.910 0.941 0.833 0.939 0.931 0.886 0.939 0.940 0.883 0.939 0.951 0.861 0.972 0.945 0.837 0.946 0.923 0.849 0.919 0.939 0.868 0.939 0.879 1.000 1.000 0.806 0.949 1.000 0.592 0.988 0.998 0.834 0.997 0.998 0.833 0.975 0.975 0.783 0.989 1.000 0.817 0.904 0.949 0.852 0.902 0.951 0.0411 0.0092 0.0036 0.0232 0.0091 0.0038

250

0.565 0.953 0.998 0.909 0.972 0.997 0.928 0.927 0.992 0.942 0.768 0.886 0.949 0.800 0.937 0.940 0.851 0.946 0.946 0.895 0.946 0.954 0.849 0.940 0.944 0.860 0.940 0.823 0.923 0.926 0.937 0.987 0.997 0.841 0.911 0.911 0.890 0.891 0.951 3.0149 0.0076 0.0013

Table 5. The average lengths and its standard deviation of 95% Ccls under various conditions of D', allele frequencies and sample size based on genotype data. Conditions were identical to Table 4. 50

100

250

average length

average length

average length

Average 0.6575 1.1993 1.3834 0.4620 0.8745 1.0687 0.3017 0.4154 0.5746 Standard Deviation 0.3791 0.5305 0.5166 0.2881 0.4957 0.5586 0.1993 0.3708 0.5403

MLE method generally had larger CI lengths than Zapata’s, it produced the least amount of variation as sample size increased. Furthermore, the bootstrap method had the shortest average and standard deviation for CIS lengths. In many studies, genotypic data rather than haplotype data are generated. Thus, we performed the same procedures to examine the effects of genotype data to the CI

138

estimates and coverage rates for all three methods. Although the overall trends of performance were nearly identical, the average coverage rates from genotypic data were much less than that obtained from the haplotype-based results for the bootstrap and Zapata's methods while MLE remained unaffected. Furthermore, the standard error and CI lengths proved larger for all three routines (Tables 4 and 5). Considering all simulated conditions and a genotype sample size of 250 (consisting of 500 haplotypes), the coverage rate and standard error for the MLE was 0.940 and 0.0016, respectively. Zapata's and bootstrap methods averaged 0.864 and 0.912, respectively. The MLE routine demonstrated the closest coverage rate to 95% with the smallest standard error compared to all three techniques while Zapata's performance appears to worsen evaluated against haplotype-based findings. This drop in performance may originate from errors introduced when estimating

Vz,,,,,

(6') after

inferring the haplotype frequencies. Furthermore, the bootstrap

approach had poor coverage when the magnitude of LD was high, but improved with increase in sample size. This may be due to the problem that the EM algorithm does not always find the maximum of the likelihood function. 4 Discussions We compare three methods of estimating the CI of the commonly used normalized gametic disequilibrium measure, D' . Aside from the bootstrap approach, we present two direcr methods for determining the CI through the use of the asymptotic sampling variance of D ' . Both later methods assume that D ' has a normal distribution under large sample size and its sampling variance is approximated

) . findings suggest that the MLE method either by V2parn(6')or V M L E ( 6 ' Our outperforms the remaining two methods by displaying satisfactory coverage and smaller variation with respect to the length of CI and the coverage rate. However, under conditions of small sample size, extreme allele frequencies and -0.3 < D'< 0 , it appears that the bootstrap performs best. We attribute the ill performance of Zapata's and MLE method under these conditions to high variability of D ' , as observed by Zapata et al. [2],and small sample size. Considering that the bootstrap method is more time consuming, we suggest using the MLE method in large-scale studies. When the genotype data is used in estimating the CI of D ' , the trend of the performance is nearly identical with that obtained using the haplotype data. However, we notice that there are differences between the coverage rate and such differences are affected by the magnitude of the pre-specified D ' and the allele

139

frequencies at the two loci. The increase in CI lengths and a reduction in average coverage rates may reflect the influence from haplotype frequency estimation. References 1. Gabriel S.B. and Schaffner S.F., Nguyen H., et al., Science 296 (2002) pp. 2225-2229. 2. Zapata C., Alvarez G., Carollo C., Am. J. Hum. Genet. 61 (1997) pp. 771-774. 3. Hall J.M., Lee M.K., Newman B., et al., Science 250 (1990) pp. 1684-1689. 4. Kruglyak L., Nat. Genet. 22 (1999) pp. 139-144. 5. Nordborg M. and TavarC S., Trends Genet. 18 (2002) pp. 83-90. 6. Risch N. and Merikangas K., Science 273 (1996) pp. 1516-1517. 7. Weiss K.M. and Clark A.G., Trends Genet. 18 (2002) pp. 19-24. 8. Sachidanandam R., Weissman D., et al., Nature 409 (2001) pp. 928-933. 9. Venter J.C., Adams M.D., Myers E.W., et al., Science 291 (2001) pp. 13041351. 10. Daly M.J., Rioux J.D., Schaffner S.F., et al., Nat. Genet. 29 (2001) pp. 229232. 11. Jeffreys A.J., Kauppi L., Neumann R., Nat. Genet. 29 (2001) pp. 217-222. 12. Patil N., Berno A.J., Hinds D.A., et al., Science 294 (2001) pp. 1719-1723. 13. Lewontin R.C., Genetics 49 (1967) pp. 49-67. 14. Devlin B. and Risch N., Genomics 29 (1995) pp. 31 1-322. 15. Hedrick P.W., Genetics 117 (1987) pp. 331-341. 16. Lewontin R.C., Genetics 120 (1988) pp. 849-852. 17. Efron B., Ann. Stat. 7 (1979) pp. 1-26. 18. Efron B. and Tibshirani R.J., An Introduction to the Bootstrap. (Chapman & Hall, New York, 1993). 19. Zapata C., Alvarez G., Carollo C., Ann. Hum. Genet. 65 (2001) pp. 395-406. 20. Teare M.D., et al., Ann. Hum. Genet. 66 (2002) pp. 223-233. 21. Ferguson T.S., A Course in Large Sample Theory. (Chapman & Hall, London, 1996). 22. Excoffier L. and Slatkin M., Mol. Biol. Evol. 12 (1995) pp. 921-927. 23. Hawley M.E. and Kidd K.K., J. Hered. 86 (1995) pp. 409-41 1. 24. Long J.C., Williams R.C., Urbanek M., Am. J. Hum. Genet. 56 (1995) pp. 799810.

MULTIPLEXING SCHEMES FOR GENERIC SNP GENOTYPING ASSAYS R. SHARAN International computer Science Institute, 1947 Center TodedOicsi.berkeley. edu.

st., Berkeley C A 94704.

A. BEN-DOR Agilent Laboratories. [email protected]. Z. YAKHINI Agilent Laboratories and Computer Science Dept., Technion. [email protected].

Abstract

A generic genotyping assay utilizes a fixed set of reagents, which is independent of the actual target sample, to determine all present alleles. An example is the interrogation of several amplicons spanning polymorphic sites using an all k-mer array. Due to the high cost associated with a genotyping experiment, it is desirable to design a set of experiments, which maximizes the number of SNPs that can be genotyped in parallel per assay. In this study we investigate algorithmic approaches for optimally multiplexing SNP genotyping using generic assays. We devise a graph theoretic formulation of the problem and use it to derive an approximation algorithm for the problem, and several practical heuristics. We apply our methods to simulated and real data, for evaluating the multiplexing rates afforded by generic techniques. The results on real human data show the practicality of generic approaches for genotyping, allowing, e.g., the genotyping of 5000 SNPs using four all 7-mer arrays.

1

Introduction

Single nucleotide polymorphisms (SNPs) are differences across the population, in a single base, within an otherwise conserved genomic sequence The sequence variation represented by SNPs is often directly related to phenotypic traits. Such is the case when the variation occurs in coding or other functional (e.g., regulatory) regions '. Somatic or native SNPs in oncogenes or in related regions can determine cancer susceptibility and are often related to pathogenesis 314,536. SNPs in all regions of the genome are useful in studies aimed at finding genomic regions linked to clinical or otherwise significant properties. Such studies are performed by seeking correlations between the inheritance pattern of the target properties and polymorphic genetic variations. Linkage,

'.

140

141

association and linkage disequilibrium studies are examples of specific methodologies employed for genetic studies Genotyping is a process that determines the variants present in a given sample, over a set of SNPs. In the case of association studies, a population of samples is jointly measured and the frequencies of the different variants need to be inferred. The development of efficient SNP detection, genotyping and measurement techniques is an active research area as they have great clinical, scientific and commercial value.

are problem specific in Most current SNP genotyping techniques the sense that at least some of the reagents used in the assay have to be specifically tailored to the set of SNPs under interrogation. Generic methods are techniques that defer all problem specific components to the assay planning stage and to the data analysis and result interpretation stage. For example, Sampson et al. l1 present a method that uses natural and mass modified generic mixtures of oligonucleotides, and a target mediated enzymatic reaction, to produce a mixture, the mass-spectrum of which is indicative of the genotype of the sample over a set of sites.

S N P genotyping is time-consuming and may be an expensive procedure. This cost is directly related to the number of assays actually performed. Thus, we are interested in minimizing the number of assays that need to be performed in a given study. Under certain circumstances, genotyping of multiple SNP sites can be performed simultaneously, in a single genotyping assay; a process called multiplezed genotyping. Examples include utilizing primer extension and MALDI-TOF mass spectrometry, relying on the natural masses of the extended specifically designed primers 10112. Typically, not all SNPs in a set of interest can be genotyped together. Specifically, any given genotyping method imposes a set of constraints regarding which SNPs can be assayed together, and which cannot. Thus, in order to achieve high multiplexing rates, it is necessary to carefully plan the genotyping assays, in order to allow simultaneous genotyping of as many SNPs as possible, on the one hand, while conforming to the constraints, on the other. In this paper we present methods for achieving high multiplexing rates for a family of generic SNP genotyping techniques. We model all the applications in a unified framework, in which each SNP is assigned a set of features and the multiplexing problem translates to partitioning the SNPs into sets, such that in each set every SNP has a unique feature (Section 2). We give a constant approximation algorithm for the problem which is based on graph coloring, and provide several practical heuristics (Section 3). Finally, we apply our algorithms to simulated data as well as to real human SNP data (Section 4).

142

2

Generic Genotyping Techniques

Polymerase extension is a widely used technique for interrogating DNA sequences. Typically, all methods based on this technique utilize extension of specifically designed primers, and are not generic. For example, in an Array Polymerase Extension assay (APEX) 13,’ the target sample is annealed to array bound probes, that are complementary to subsequences upstream the polymorphic site. Four differentially fluorescently labeled terminator nucleotides are used by DNA polymerase in primer extension reaction, extending the array probes. As a result each probe represents a polymorphic site and the fluorescence observed therein indicates the measured genotype there. Note that the array needs to be specifically designed to address the input set of SNPs. In a generic Polymerase Extension Assay (PEA), the target sample reacts with a generic set of features (e.g., primers). These are extended, or not, depending on the target. A detection step follows, wherein the extended primers are determined, based on their altered properties. Information on the target is obtained by an interpretation process. We provide several examples below. Throughout, two alleles that correspond to the same SNP are called mates. All k-mer Arrays (Ck). Ben-Dor et al. l4 study aspects of a system that uses a generic array design but a specifically designed set of solution primers. A completely generic approach uses an array of all k-mers, denoted ck,and no specific reagents, to perform the measurement as follows. First assume that a single site is to be genotyped:

1. The target region is PCR amplified. 2. The sequence is hybridized to the ck array and a polymerase reaction is started, in the presence of single labeled dideoxynucleotides.

3. k-mers that are complementary to non-polymorphic parts of the amplicon will hybridize to the target, get extended and produce fluorescence signals.

4. The hybridization signals obtained for k-mers that span the site, depend on the alleles of this SNP in the genotyped individual. The genotype of the sample, at the interrogated site, can therefore be determined by analyzing the hybridization signature, provided that there is at least one k-mer for each allele that does not appear in the sequence of its mate. In a multiplexed assay several targets are jointly interrogated. The set can be jointly interrogated as long as each allele has at least one unique k-mer that does not occur in the sequence of any other allelepair in the set.

143

PEA and Native/Tagged Mass-Spectrometry. This process involves the following components :

''

1. A mixture of primers is applied to the target in the presence of polymerase and all 4 dideoxynucleotides, allowing for single base extension to occur in a specific, target mediated manner.

2. Products (extended primers) are separated, e.g by HPLC. 3. The mixture of extended primers is analyzed by mass-spectrometry. Under complete stringency assumptions the output mass spectrum will only have peaks at masses that correspond to extended primers that are Watson-Crick complements of some target subsequence. A set of SNPs can be jointly interrogated as long as each of the respective alleles has a corresponding extended primer with a unique mass, different from that potentially arising from any other allele-pair in the set. A similar genotyping process uses cleavable mass-tags that are attached to the original primers and then cleaved after the separation of extended products. (Here we assume that the number of available distinguishable tags exceeds the number of primers.) The tags, rather than the extended primers are analyzed by mass-spectrometry. The spectrum will have peaks at masses of tags that correspond to primers that are Watson-Crick complements of some target subsequence. Again, a set of SNPs can be jointly interrogated as long as each of the respective alleles has a corresponding extended primer with a unique tag, different from that potentially arising from any other allele-pair in the set.

2.1

Problem Formulation

In any of the embodiments, the target is typically a collection of short PCR amplicons, spanning bi-allelic SNP sites. A SNP allele in a target can be determined if and only if the extension event, for one of the kmers spanning this site and corresponding to this allele, can be uniquely detected under the assay conditions. This requirement can be abstracted as follows: Associate with each target sequence a list of features at which it registers, e.g., all complementary k-mers, the masses corresponding to all complementary extended primers in the mixture, etc. This is the set of features potentially activated by the given target sequence. Furthermore, the set of activated features can be partitioned into informative ones, spanning the polymorphic nucleotide, and common ones, being all features activated by the amplicon corresponding to both alleles expected at this site. A set of allele-pairs is assignable if each allele in the set has an informative feature that is not potentially activated by any other allele in the set. Assume we are given a set of target sequences, each containing a bi-allelic polymorphic site. To genotype this set of SNPs we need to

144

partition them into assignable subsets. This partition constitutes a multiplezing scheme. We seek a multiplexing scheme under which the number of assignable subsets in the partition is minimum. W.l.o.g., we shall assume that for every given target sequence, each of the two alleles of the corresponding SNP has at least one informative feature that is not shared by its mate. The objective of the multiplexing scheme can be modeled in two ways. Both formulations reflect the fact that when a specific site is genotyped, both its alleles may activate features (indeed, this will be the case if the sample is heterozygous) and there is no easy way to separate these sets of features one from the other. In the first formulation we seek a partition of the SNPs into a minimum number of assignable subsets. The basic units here are allele-pairs (corresponding to SNPs). In the second variant we seek a partition of the alleles into a minimum number of assignable subsets. The basic units here are single alleles, dropping the constraint that two mates should be put in the same subset in a partition. Solutions to the first variant have the advantage that they require a smaller number of PCR reactions, compared to the second variant. However, when studying the multiplexing problem in isolation, the latter formulation is the more general one. 3

Algorithmic Approaches

In this section we provide theoretical analysis and practical heuristics for the multiplexing problem.

3.1

A n Approximation Scheme

We present an approximation algorithm for the multiplexing problem under the single-allele variant. First, we devise a graph-theoretic formulation of the problem: We view the input allele sequences and list of features as a bipartite graph G(U,V ,E , F ) , where U is the set of alleles and V is the set of features. We put an edge ( u , v )E E , connecting an allele u E U and a feature v E V , if v is an informative feature of u. We put an edge ( u , v ) E F , if v is a common feature of u or if v is an informative feature of u’s mate. We call this graph the Alleles-Features (AF) graph. We call E the set of informative edges in G . Note that every allele with a sequence of length 1 has k’ (at most k) informative edges incident to it, corresponding to the k-mers spanning the polymorphic site; and at most ( 1 - k 1) non-informative edges in F , corresponding to (I - k 1- k’) k-mers that do not involve the polymorphic site and k‘ additional k-mers that constitute the informative features of the allele’s mate.

+

+

145

Consider a set of alleles X 5 U of cardinality t . The set X is called assignable if there exists an induced matching over X consisting of only informative edges. That is, there exists a set R C_ E of t informative edges that form a matching between X and a set o f t features. In addition, the matching R is induced, that is, no two edges in R have a third edge adjacent to both of them. Clearly, if a set of alleles is assignable, it can be assayed together. We define the following two decision problems: Maximum Assignable Set (MAS): Given an AF graph G and an integer Ic, is there an assignable set of size at least Ic? Minimum Assignable Cover (MAC): Given an AF graph G and an integer Ic, is there a set of Ic assignable subsets that together cover its allele set? Observe that for a given set of alleles X , one can test in linear time whether X is assignable, by checking if each allele in X has an informative edge to a unique feature. By considering only the informative edges, MAS is equivalent to the maximum induced matching problem on an appropriate bipartite graph. For general bipartite graphs, the maximum induced matching problem is known to be NP-hard 15. However, not all bipartite graphs can be realized as AF graphs. The complexity of MAS on AF graphs is currently open. We now show a lower bound for the MAC problem. A good matching in G is a matching which consists of informative edges only. Clearly, any assignable subset of alleles corresponds to a good induced matching in G between the alleles and their unique features. If we restrict attention to informative edges only, and drop the requirement that the matching should be an induced one, our problem variant can be stated as follows: Minimum Matching Cover (MMC): Given a bipartite graph G'(U,V, E ) , find a minimum number of matchings that cover U . Given an instance G(U,V, E , F ) of MAC, the cardinality of an optimum solution to MMC on G'(U, V, E ) is a lower bound on the cardinality of any solution to MAC on G and, in particular, a lower bound on the optimum MAC solution. We can compute this lower bound in polynomial time using the algorithm of Aumann et al. for MMC l2. The following theorem states our approximation result for MAC.

Theorem 1 Let G(U,V,E , F ) be an instance of MAC with allele sequences of length 1. Then there is a (21 1)-approximation algorithm to MAC on G.

+

Proof: We find the approximate solution in two stages. The reader is referred to the example in Figure 3.1 for further explanation and intuition about the algorithm. First, we construct the graph G'(U, V, E ) by removing the non-informative edges from G. We find a minimum matching cover E l , Ez, . . . ,E, of G' using the algorithm of Aumann et al. l 2 . Next, we show below that for each matching E,, its set of alleles

146

+

can be partitioned into at most (21 1) assignable subsets. Overall, the cardinality of our solution is bounded above by o p t M M C ( G ’ ) . (21 1) 5 @ M A C ( G )* (21 1).

+

+

A. The Alleles-Features graph

a2

a3

a5

a4

a6

Alleles

Features

f2

fl

Q

B. The graph induced by the matching E,

a3

fl

f3

a4

a6

f3

f4

C. The auxiliary directed graph H

(a,,fJ

f4

Figure 1: An example demonstrating the approximation algorithm for HAC. A. The AllelesFeatures graph. Informative edges are solid, common edges are dashed. El and Ez represent one possible optimal matching cover. B. The graph induced by the matching El in part A. C . The auxiliary directed graph H constructed from the graph in part B. The outdegree 5 1, and thus H can be colored with 3 colors. Each color class corresponds to a set of independent edges in El. For example, the red color corresponds to the edges (al,fl),( a ~ , f 5 )These . edges corresponds to the assignable allele set (ai,ae}.

147 It remains to show how to partition the alleles included in a matching Ei into at most (21 1) assignable sets. To this end we use the coloring approach of Ben-Dor et al. 14: We build an auxiliary directed graph H , whose vertices correspond to the edges in the matching Ei. We direct an edge in H from ( u , v ) to ( u ’ , ~ ’if) (u,v’)E Ei. Note that since each allele has at most 1 1 incident edges in the original graph G, by the construction of H , each of its vertices has outdegree at most 1. Therefore, H can be colored using smallest-last ordering (SLO) coloring l6 by at most 21 1 colors. Each color class represent an independent set of vertices, which correspond to an independent set of informative edges. Thus, each color class induces an assignable set, and together they cover the alleles of Ei. rn

+

+

+

3.2

Practical Heuristic Approaches

In this section we propose two greedy heuristic procedures for the multiplexing problem. Both approaches work on allele-pairs as well as on single alleles. We describe them only in their allele-pair version, but apply both variants in the sequel. The first heuristic is called minimal partition (MP). We allocate one SNP at a time, inserting it into the subset that best accommodates it: This is a subset which after adding the allele-pair remains assignable and has the smallest number of activated features. We start a new subset only when the target cannot be accommodated in an existing subset. In the following we denote by o(S)the number of activated features in a set of allele-pairs S. We let q 1 , . . . ,qn be the input allele-pairs. The algorithm is given in Figure 2.

Randomly order the allele-pairs q1, . . . ,qn. Q1 = 0,k = 1. For i = 1. . . n consider the pair qi: Find an index j, s.t. Qj, U {ai} is assignable and c(Qjo U ( q i } ) is minimal. If such j o exists then Qjo = Qjo u { q i } . Else Qk+l = {qi} and k = k 1.

+

Figure 2: The minima1 partition (MP) algorithm.

The second heuristic is called maximal set (MS). We attempt to construct the largest assignable subset of SNPs. When this set cannot be extended anymore, we iteratively call the process on the remaining SNPs. The algorithm is given in Figure 3.

148

Qi=0,U={qi While

U#0

,..., q n } , k = l .

Find a pair q E

U s.t.

Qk

U { q } is assignable

and O ( & k U { q } ) is minimal. If such q exists then Q k = Q k U { q } , U = U \ {a}. Else Q k + l = 8 and k = k + 1. Figure 3: The Maximal Set (MS) algorithm.

The complexity of the MP algorithm is O ( n r )for a solution of cardinality T . The complexity of the MS algorithm is O ( n z ) which , is higher than the former since, typically, r << n. 4

Results

In this section we report on the performance of the two algorithmic schemes, M P and MS, on simulated and real SNP data. The synthetic data was generated as follows: We generated at random 41-long sequences for varying number of SNPs (between 1000 and 5000). For each sequence we chose at random two distinct nucleotides, representing two alleles, to occupy the 21-st base of the sequence. We used as features all Ic-mers of an array, where k ranged from 6 to 8. The results of applying both algorithms to the data, when using the allelepair version, are summarized in Table 1. The total running time of one simulation was less than a minute on a single processor. Notably, the maximal set algorithm outperforms the minimal partition algorithm in all simulations. Next, we applied the algorithms in their single-allele version to the synthetic data. The results are summarized in Table 2. Again the MS heuristic outperforms the MP heuristic. Overall, the results are compa-

Table 1: A comparison between Algorithms MP and MS, in their allele-pair version, on simulated data for different parameter combinations. Each entry contains the solution’s cardinality, averaged over 10 runs. All respective standard deviations were smaller than 1.

149 Table 2: Comparison between Algorithms MP and MS, in their single-allele version, on simulated data for different parameter combinations. Each entry contains the solution’s cardinality, averaged over 10 runs. All respective standard deviations were smaller than 1. I

Array

SNPs 1000 2000 5000

rable to those obtained for allele-pairs. This is a result of the assignment criterion employed by the algorithms, which is strongly biased towards assigning mates to the same set in the solution. We complemented our analyses by designing multiplexing schemes for 5000 SNPs taken from human chromosome 21. We extracted from the public SNP database 41-long sequences flanking the first 5000 reference SNPs of chromosome 2 1 (with the polymorphic site at the middle position). We then applied our algorithms to the data, again using as features all k-mers of an array, where k ranged from 6 to 8. The best results were obtained using the MS heuristic with pairs: 7 assays for CS,9 assays for C7 and 19 assays for c6. When comparing to the above simulation results, we observed that the solutions on real data had higher cardinality than the corresponding solutions on simulated data. Looking more closely at the results, we observed that the real data solutions contained several large sets covering most of the SNPs and some small sets containing only few SNPs each. Specifically, if one wishes to cover at least 95% of the SNPs then the following numbers of assays are required: 2 assays for Cs, 4 assays for C7 and 11 assays for c6. We could further see that most of the SNPs that were included in small sets had very degenerate sequences, often consisting of repeats of a single nucleotide around the polymorphic site.

’’

5

Concluding Remarks

In this paper we studied the problems arising in designing multiplexing schemes for SNP genotyping using generic assays. We devised a graph theoretic formulation for the multiplexing problem and used it to find a constant approximation algorithm for the problem. We also suggested two practical heuristics to approach the problem. We applied our algorithms to simulated and real SNP data. The results on real data show the practicality of generic approaches for genotyping, allowing, e.g., the

150

genotyping of about 5000 SNPs using four C7 arrays. It is important to note that the procedure described in the paper assumes full stringency of the Ck array measurements, and that current assays do not support this assumption. The multiplexing methods that we presented here can be modified to handle non-perfect hybridization under weaker assumptions. For example, one could define a measure of similarity between k-mers, and solve the multiplexing problem assuming that a k-mer may hybridize to its complement or to any other sufficiently similar k-mer. While focusing here on the application of our algorithmic approaches to SNP genotyping, multiplexing problems arise in other domains, e.g., primer design, gene expression measurements, etc. Our algorithmic approaches could be applicable to multiplexing problems in other domains that have graph theoretic models similar to the ones presented here.

Acknowledgments The authors thank Jeff Sampson, Anya Tsalenko and Bo Curry of Agilent Technologies for discussion and early testing of the heuristics. The authors also thank an anonymous PSB’04 referee for helpful comments. RS was supported by a Fulbright grant.

References 1. D. G. Wang, J. B. Fan, C. J. Siao, A. Berno, P. P. Young, et al. Large-scale identification, mapping, and genotyping of sinScience, gle nucleotide polymorphisms in the human genome. 280(5366):1077-82, 1998. 2. M. Cargill, D. Altshuler, J. Ireland, P. Sklar, K. Ardlie, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genetics, 22(3):231-8, 1999. 3. S. Venitt. Mechanisms of carcinogenesis and individual susceptibility to cancer. Clin. Chem., 40(7.2):1421-5, July 1994. 4. S. Venitt. Mechanisms of spontaneous human cancers. Envaron. Health Perspect., 104(3):633-7, May 1996. 5. V. N. Kristensen, N. Harada, N. Yoshimura, E. Haraldsen, P. E. Lonning, et al. Genetic variants of cypl9 (aromatase) and breast cancer risk. Oncogene, 19(10):1329-33, March 2000. 6. Y. Watanabe, A. Fujiyama, Y. Ichiba, M. Hattori, et al. Chromosome-wide assessment of replication timing for human chromosomes l l q and 21q: disease-related genes in timing-switch regions. Human Molecular Genetics, 11(1):13-21, January 2002. 7. J. Ott. Analysis of H u m a n Genetac Linkage. Johns Hopkins University Press, 1991.

151

8. N. J. Risch. Searching for genetic determinants in the new millennium. Nature, 405(6788):847-56, 2000. 9. A. C. Syvanen. From gels to chips: “minisequencing” primer extension for analysis of point mutations and single nucleotide polymorphisms. Hum. Mutat., 13(1):1-10, 1999. 10. P. Ross et al. High level multiplex genotyping by MALDI-TOF mass-spectometry. Nature Biotechnology, 16:1347-1351, 1998. 11. J.R. Sampson et al. Method and mixture reagents for analyzing the nucleotide sequence of nucleic acids by mass spectrometry. US patent 6,218,118., 2001. 12. Y. Aumann, E. Manisterski, and Z. Yakhini. Designing optimally multiplexed SNP genotyping assays. In Proc. of Workshop o n Algorithms in Bioinformatics ( W A B I ) , 2003. To appear. 13. http://vvv.asperbio.com/APEX.htm. 14. A. Ben-Dor, T. Hartman, B. Schwikowski, R. Sharan, and Z. Yakhini. Towards optimally multiplexed applications of universal DNA tag systems. In Proc. of seventh annual conference o n research in computational molecular biology ( R E C O M B ) , pages 48-56, 2003. 15. K. Cameron. Induced matchings. Discrete Applied Mathematics, 24:97-102, 1989. 16. D. Matula and L. Beck. Smallest-last ordering and clustering and graph coloring algorithms. Journal of the A C M , 30:417-427, 1983. 17. http://uvv.ncbi.nlm.nih.gov/SNP.

HAPLOTYPE BLOCK DEFINITION AND ITS APPLICATION X. ZHU’, S. ZHANG2,3,D. KAN’, R. COOPER’

’

Department of Preventive Medicine and Epidemiology, Loyola University Stritch School of Medicine, Maywood, 1L 60153, Departnzent of Mathematical Science, Michigan Technological University, Houghton, MI and ’Department of Mathematics, Heilongliang University, Harbin, China We present a simple two-stage procedure to define haplotype blocks and construct a statistic to test whether a polymorphism belongs to a block. Applying this method to the data of Gabriel et al. [2002] yielded longer haplotype blocks than were originally reported with a similar average percentage of common haplotypes in blocks. Furthermore, across regions of the genome and among the four populations that were studied, we found that linkage disequilibrium between a given single nucleotide polymorphism (SNP) and the haplotype block was a monotonic function of distance. This correlation was essentially independent of the minor allele frequency of the putative causal SNP when it fell outside of the block, however it was strongly dependent on the minor allele frequency when the SNP was internal to the block. These results have direct application to the design of candidate gene or region-wide association studies.

1 Introduction Since S N P s occur about every 300bp they provide much more information than other types of sequence variants in mapping complex diseases. Because the evolutionary history of common diseases is not known, one promising approach is to comprehensively test common genetic variants for association with the trait being studied [1,2]. A large-scale project has recently been initiated to define a “haplotype map” (HapMap) of genomic blocks that are shared in common across continental populations [3]. Knowledge of haplotype structure might make it possible to conduct genome-wide association studies at greatly reduced cost, since haplotype tagging S N P s can be chosen to capture most of the genetic information in a region that has a block structure [4-61. However, debate continues over the plausibility of the “common disease-common variants” (CDCV) hypothesis as a framework for understanding the genetic contribution to complex diseases [7-91. Pritchard [ 101 studied an explicit model for the evolution of complex disease loci by incorporating mutation, random genetic drift, and selection against susceptibility mutations. The results indicate that neutral susceptibility alleles are likely to be rare and contribute little for the most plausible range of mutation rates. Allelic heterogeneity at the associated loci might also be found in complex diseases. Finally, the patterns of LD are often extremely variable within and among loci and populations [ 11-13]. Substantial evidence has already accumulated that the genome can be parsed into haplotype blocks of variable length [4,14-161. Within these block-like segments, LD is strong and only a few common haplotypes are observed, while between blocks, LD is disrupted, presumably due to historical recombination. The

1 52

153

statistical methods that can best define this structure are still being developed, however, and relatively little is known about their efficiency in gene mapping. Gabriel et al. [ 151 proposed an approach based on the pattern of recombination across a region. They defined a pair of S N P s to be in “strong LD” if the one-sided upper 95% confidence bound on D’ is >0.98 (consistent with no historical recombination) and the lower bound is above 0.7. A pair of S N P s were defined as having “strong evidence for historical recombination” if the upper confidence bound on D’ is less than 0.9. Furthermore, informative marker pairs are those where both minor allele frequencies exceed 0.2. The haplotype block was then defined as a region over which less than 5% of comparisons among informative S N P pairs show strong evidence of historical recombination. In defining a haplotype block, this approach excludes all the S N P s with minor allele frequencies less than 20%, therefore, may lead to under represent of the genetic variation due to these SNPs. Zhu et al. [ 171 also proposed a definition of haplotype blocks based on the pairwise D’ values. Intervals in which all S N P s have a pairwise D’ value > 0.8 are identified and it is assumed that they constitute the basic blocks. These intervals are then expanded by adding S N P s to the ends to find the longest intervals, as follows: The observed haplotypes and 95% confidence intervals are bootstrapped before adding a S N P . If the haplotype frequencies after adding a S N P fall into the corresponding 95% confidence intervals, it is concluded that the added marker belongs to the same block. This procedure was repeated until adding a marker leads to a statistical change in the haplotype distribution. No apparent recombination events would therefore have occurred within a block based on this definition. In this report, we apply a modified method of Zhu et al. [17] to the data produced by Gabriel et al. 1151 to define haplotype blocks but include all the S N P s with minor allele frequencies great than 5%. However, our focus is to explore the correlation of LD with the physical distance and minor allele frequency after haplotype blocks are defined.

2

Methods

We modify the block definition by Zhu et al. 1171. In detail, we begin by generating a block definition in two steps: first, we define a basic block, then we extend it by sequential addition of the SNPs. Assuming there are N phase known chromosomes, at step 1 we bootstrap the original N chromosomes. For each pair of

d hquantile of DLi 1181. A segment of consecutive S N P s is a basic block if all the dhquantiles of the pairwise Dli. in S N P s i and j, we calculate the bootstrapped

the interval are greater than 0.95. Here we choose 0.95 because any recurrent andor backward mutation, genotype error or sampling variation can affect the value of D’ [15]. To choose the value a , we consider two S N P s with alleles (A, a) and (B, b), respectively. We further assume there is no historical recombination between the

154 two S N P s and no recurrent mutation. The true haplotype frequencies for haplotypes AB, Ab, aB, ab are n,, n 2 , n3 and 0 ( n , n2 n3 = N ), with corresponding

+ +

D ' = l . Since any genotype error can lead to the change of haplotype frequencies, the haplotype frequencies that are actually observed for AB, Ab, aB, ab may be n l ,

+n; + n ; +nk = N and nk>O. Therefore, the probability of D ' = 1 (which is the true D' value) in the bootstrapped sample is n ; , ni and nk, with

( 1 - 3 ) N

N

TZ;

e-": , when N is large. Assuming our genotyping is highly accurate,

we may allow for a small nk. We then choose a! = 1- e-n4. Thus, G is closely related to the possible genotype error rate from two common haplotypes to the rare haplotype (For example, the error rate from Ab->ab, aB->ab, or AE->ab). In the second step, we examine whether or not a S N P nearest the basic block also falls within its boundaries. To do this, we assume that there are K SNPs in a basic block with the haplotypes H , ,H , ,..., H,, , and the corresponding number

of observed haplotypes n1,n, ,..., nnK. We then define a statistic

Since n l , n2,..., n,,

s,

as

follow a multinomial distribution with corresponding nK

population haplotype frequencies

p , , p , , ..., p,, , we have E ( S , ) =

2

p , . It ,=1

can be proven that

s,

is the minimum variance unbiased estimate of

nK

homozygosity

z p I 2. Now

we add the K+l'

SNP and denote the haplotypes

I=,

across the K+l

SNPs by

H l l , H 1 2 , ~ 2212,. .H . , H n K 1 , H n 1with 2 the

corresponding numbers of observed haplotypes

,,

,

n1 n12,? z ,n22.. ~ ~ .,nnK,rink and

corresponding haplotype frequencies pl,,p 1 2 p, 2 ,,p22,..., p n k lp,, , 2 . Denote

variance unbiased estimate of TK+l. If the K+lthS N P falls into the basic block, we would expect that no new haplotypes would be created. Therefore, we have

155

T, = TKtl. Thus, a reasonable test to determine if the K+lth SNP falls into the basic block is to test the null hypothesis Ho: TK+,= TK vs HI: TK+l< T, . Since TKtlI T, is always true, this test is one-sided. If we do not reject Ho, the K+lth SNP belongs to the block consisting of S N P s 1, 2,. . .K. We then continue to add the neighboring S N P s as long as we do not reject the null hypothesis. When the K+I" S N P is rejected, this S N P is regarded as the putative starting point of a new block and a new basic block is again sought and expended. This procedure is repeated until all the S N P s are examined, leading to the initial block partition. We next examine the S N P s in the blocks consisting of less than 3 SNPs. We would test if these S N P s fall into the next block by above method to further expend the blocks. The block size is determined as the sequence length from the beginning to the ending SNP. To test the null hypothesis H , : TK = TK+lwe , can use the bootstrap

s, . That is, we bootstrap the original N chromosomes and calculate the empirical distribution of s, . If our observed SK+Ifalls in the left 5% tail of the empirical distribution of s, , we

technique to estimate the empirical distribution of

consider K+1" S N P falls outside the block comprised of S N P s 1,2...,K. Otherwise, the K+lthSNP is included within the block.

3 Results To conduct an empirical test we applied this method to the data from Gabriel et al. [15]; obtained from the public access website of the Whitehead Institute]. The genotype data were obtained from four population samples: 30 parent-offspring trios (90 individuals) from Nigeria, 93 individuals from 12 multigenerational CEPH pedigrees of European ancestry, 42 unrelated individuals of Japanese and Chinese origin, and 50 unrelated African Americans. A total of 3738 SNPs in 54 autosomal regions were successfully genotyped in all four groups. The average size of a region was 250 kb. For family data, we first used MERLIN [19] to reconstruct the haplotypes and then to estimate the haplotype frequencies vie EM algorithm, while we directly applied EM algorithm [20] to infer the haplotype frequencies for unrelated data in each region. Then we applied the proposed method to define haplotype blocks. Table 1 presents the characteristics of haplotype blocks for the four populations using a =0.5 and 0.638. Our definition identified more haplotype blocks in Nigerians and African-Americans than in Europeans and Asians (the blocks are limited to those covered by more than two SNPs), and encompassed around two third of the total sequence. On average we obtained block sizes somewhat longer than those reported by Gabriel et al. [ 151, who reported averaged 9 kb in the Nigerian and African-American samples and 18 kb in the European and Asian samples. The common haplotypes accounted for most of the information on

156

heterozygosity in a block, representing on average 93% to 96% of all haplotypes (Table 1). Perhaps somewhat surprising, our method gave similar numbers of common haplotypes in the different populations. However, as in previous analyses, the Europeans and Asians consistently had fewer haplotype blocks and LD extended over longer intervals than among Africans and African-Americans. Because the overall results of using I2 =0.5 and 0.638 are very similar, we only performed our next analyses based on I2 =0.5.

Table 1. Characteristics of baplotype blocks using the proposed definition

Common haplotypes : frequencies>5%: Nig: Nigerian: AA: African American; EA: European American: As: Asian

To estimate the distribution of block sizes, we performed simulations following the procedures of Gabriel et al. [I51 in which block sizes were exponentially distributed and markers were randomly spaced. The simulations provided an almost perfect fit to the observed data for both the African and European samples (Table 2). The definition applied by Gabriel et al. [15], on the other hand, overestimated the incidence of blocks with sizes less than 5 kb. We also compared the block boundaries defined in the four populations and found that most of the boundaries observed among Europeans and Asians were also present among Nigerians. To obtain a summary of this phenomenon we examined whether a block boundary among Europeans, Asians and African-Americans was consistent with that found among Nigerians across all 54 regions. We assumed consistency if the ending S N P of a block or the beginning S N P of the adjacent block in the three non-African populations fell between two SNPs that define the end and the beginning of the corresponding segment among the Nigerians. Our calculation suggests that 61%, 71% and 72% of block boundaries in the Europeans, Asians and African-Americans are consistent with those among the Nigerians. The results also suggest that most of the historical recombination breakpoints are shared across the four populations.

157

Table 11. Observed and predicted proportion of sequence found in haplotype blocks. Block span is based on an exponentially distributed random variable with

mean size of 22 kbbb in the European 13 kb in the Nigerian samples

() are the values from Gabriel et al. [2002]

We further looked at the strength of LD between a given S N P and the haplotype blocks in each region. For two biallelic markers, there are currently a number of measures for LD, as reviewed by Devlin and Risch [21]. It may be difficult to obtain an entirely satisfactory LD measure for two multiallelic markers using these procedures, however. As an alternative Zhao et al. [22] have proposed a permutation-based method to measure the LD between two multiallelic markers. This measure is based on the likelihood ratio test statistic t, which asymptotically distribution with y degrees of freedom, and is defined as follows a noncentral

x

J2Y t - P 5 =(-)

n

7

o

where p and O 2 are the mean and variance of the empirical distribution of the likelihood ratio test statistic t from the permuted samples, n is the number of individuals in the sample. Consequently is the measure of the overall deviation

4

5

from random association. The test of LD using to detect overall deviation from random association is therefore more powerful than one based on asymptotic distributions [22]. In Figure 1 the LD measure is presented as a function of distance between a given S N P and a block in a selected region for the four populations. Each line in figure 2 represents a given S N P and the data in the figure is centered on that SNP. is close to a montonic function of distance and potential noise caused by marker

5

5

5

is presumably due to the history almost disappears. The observed decrement of breakdown of LD by recombination. For comparison, we also present LD measured by the correlations r 2 and D’ using the same marker set (figures 2 and 3). As expected, very substantial noise was observed as a result of the marker history for 2

pairwise r or D’ (figures 2 and 3). The results for D’ are particularly erratic, presumably due to its dependence on the allele frequencies as well as marker

158 European

African American 9

N 0

N

-

-

Lc

u ,

0

D

L c 9

D Lc

0

0

-200 -100 0 100 Asian

9

200

Nigerian

N 0

Lc

0

c, Lr,

z -200

-100

0

100

-200 -100

200

0

100

200

Dlstance(kb)

Figure 1. Linkage disequilibrium measured by

5 [21] between a SNP and haplotype blocks. 10 SNPs,

represented by 10 lines, were used in each sample. The lines are centered on the selected SNPs. The distance measure is the physical distance between a SNP and the center of a haplotype block. A montonic relationship between the linkage disequilibrium and physical distance is observed.

Atrlcan American

N

%

-200 -100

0

100

200

0 100 Niarrian

200

-100

-200

-100

Asian

200

-

0

I

-

0 m

0

-

0

N

0

-200

-100

0

100

200

0

100

200

Distance(kb)

2

Figure 2. Linkage disequilibrium measured by correlation Y between a pair of SNPs. The 10 SNPs used in figure 1 are plotted. All the lines are centered by the 10 SNPs. Much variation is observed due to the marker history.

159

history. We extended this analysis across all the markers by summarizing the percentage of S N P s where this monotonic property is preserved. According to Zhao et al. 1221, > 0.59 indicates at least a weak degree of LD and we used this

5

definition to determine how often

5

was a montonic function of the distance

5

>0.59. In the European, between a S N P and haplotype blocks in regions where African American, Asian and Nigerian population samples, respectively, this finding was observed in 77% (1352/1758), 77% (1286/1677), 69% (1158/1679) and 90% (1030/1146) of the instances examined. These results demonstrate that haplotype blocks can potentially be very useful in efforts to localize disease loci using LD mapping. While it is necessary to establish consistent patterns of LD under conditions likely to apply in association studies, variation in the frequency of the marker alleles being studied can also threaten the validity of the statistical analyses. We therefore examined the LD between a S N P with a minor allele frequency >5% and a block. Here we considered a block as a supermarker and haplotypes as the alleles. We looped the haplotypes with frequencies less than 5% together. Figure 4 presents the vs the minor allele frequency of a S N P when this scatter plot of LD measured by

5

5

S N P falls within the block for the four populations. The scatter plot of vs the minor allele frequency of a S N P when the S N P is outside of the boundaries of the block is presented in figure 5. In general, these data demonstrate that the strength of LD is significantly dependent on the minor allele frequency. This association is obviously much stronger if the S N P falls within the block rather than outside it. Table 3 presents the variance expressed by the minor allele frequency - represented

R 2 when fitting a linear regression of 5 on minor allele frequencies. As can be seen, the R 2 values in the four populations are very similar, ranging from 0.49-

by

0.76 when a S N P falls within a block, and they decrease to 0.02-0.04 when the S N P resides outside the block. Since the association between a S N P and a block is dependent on distance (defined as the number of bp between the S N P and the middle position of the block), we added distance as an independent variable in the

R 2 values in this model increased to 0.10-0.14 2 for the four populations. However, for internal S N P s , R values remain virtually

regression model for external SNPs.

unchanged even after distance was added to the regression model (data not shown). Our results indicate that the LD between an internal S N P and a block is strongly dependent on the minor allele frequency, while distance is the primary determinant for external SNPs. Furthermore, our defined blocks are also valid because the LD between an internal S N P and a block does not depend on the position of the S N P . We also observed that the average values of for internal SNPS are very similar among the four populations, ranging from 0.89 to 0.93. This result seems reasonable since there is presumably little or no historical recombination within these

5

160

5

segments. Figure 3 and 4 also suggest that the variance of is largcr in AfricanAmericans and Asians than in Europeans and Nigerians. Two aspects of the data set used in these analyses could potentially explain this result. First, twice as many Europeans and Nigerians were studied, compared to the African-Americans and Asians. Secondly, family members were included among the Europeans and Nigerians and therefore haplotypes can be more reliably inferred. Table 111. The total linkage disequilibrium variance between a S N P and a block expressed by the minor allele frequency and the distance between the S N P and the European S N P within a block 0.65 SNF’ outside a block 0.04 S N P outside a block, adding 0.14 distance

AfricanAmerican 0.56 0.02

0.10

Asia

Nigerian

0.48 0.04 0.13

0.76 0.03

0.13

4 Discussion

Limitations of the empirical test of the method presented here must be recognized. Our analyses are based on an average marker density of 1 SNP/ 7 kb and a relatively small number of individuals (less than 100 unrelated chromosomes). As noted earlier, sample size, frequency and inter-marker interval may alter the scale on which patterns are discerned. As reported recently by Phillips et al. [23], block length may be dependent on the density of S N P s that are typed. Increased density of SNPs may yield shorter haplotype blocks in fact if such dense S N P s exist across the genome. That issue will clearly require further study in large empirical samples. Our proposed method requires that haplotype phase information be available. It may affect the haplotype block partition if only unrelated individuals are studied. However such an effect should be limited because our studied regions are small and therefore strong LD exists [24]. Furthermore, it should be noted that haplotype block partition only serves a tool for mapping a complex disease, which is our ultimate goal. Although the selection of a may be interpreted as the genotype error rate, it is ad hoc. Our results suggest that the selection of @ is not sensitive to the definition of haplotype blocks. In conclusion, our results provide a robust statistical method to define the haplotype structure of the human genome using S N P markers. The method can include the S N P s with minor allele frequencies >5%. By applying this method to a large empirical data set we obtained a highly consistent description of the properties of blocks across 54 genomic regions. Our results support the contention that in most instances LD between a S N P and neighboring haplotype blocks is a monotonic function of the distance. Using this strategy a disease locus

161 Arrlcan Amerlcan

European

b

-200 -100 0 100 Asian

200

-200 -100

200

-200 -100 0 100 Nigerian

200

-200 -100

200

l

0

100

0

100

Disiancs(kb)

Figure 3. Linkage disequilibrium measured by correlation D’ between a pair of SNPs. The 10 SNPs used in figure 2 are plotted. All the lines are centered by the 10 SNPs. Considerable variation is observed due to marker history. European

African Amerlcan

N Lr,

N kr,

.. . .

.

0 4

05

N D

I

, 0.2

0.3 Asian

0.4

00

0.5

0 1

0 2 03 Nigerian

I

xi

00

I,

I

0.1

5

0 1

0 2

03

01 0 2 0 4 0 5 M i n o r allele frequency (within b l o c k s )

4

03

04

0 5

Figure 4. Linkage disequilibrium measured by [21] between a SNP and a block as a function of the SNPs minor allele frequency when the SNP falls within the index block. A substantial proportion of the variance is accounted for by the minor allele frequency.

162 European

African American

zN

Ln

D

I

-

I

01

5

02

03 Asian

ci'

I

I

04

05

I,

0 0

I

01

I

I

I

I

02

0 3

04

05

Nigerian

E

-,

1 I - ;4

.-

0

00

01

02

03

04

05

01 02 Minor allele frequency (outside blocks)

Figure 5. Linkage disequilibrium measured by

0 3

04

05

5 1211between a SNP and a block as a function of the

SNP's minor allele frequency when the SNP falls outside of the index block. Only a small part of the variance is accounted for by the minor allele frequency.

163

could be mapped with high resolution in an appropriately designed association study if the distribution of haplotype blocks in the region has been well defined. On the other hand, to localize a functional S N P within a block, additional considerations may be critical, especially if the assumption of CDCV fails and the putative causal mutation is infrequent. This is because the LD between a S N P and the block is strongly dependent on the minor allele frequencies. In this case, a design that enriches the sample for the rare disease variants will be an important determinant of the chances of success. Although our results demonstrate that the individual block boundaries overlap across populations, this conclusion should be further investigated using SNPs at higher density.

Acknowledgments We thank Mark Daly for helpful comments. We thank Fang Yang for her assistance in programming. This work was supported by grants from the National Heart, Lung and Blood Institute (HL53353 and HL65702), and the Reynolds Clinical Cardiovascular Research Center at UT Southwestern Medical Center, Dallas, TX.

References 1. E.S. Lander, Science, 274:536-9 (1996) 2. N. Risch, K. Merikangas, Science. 273: 1516-7 (1996) 3. D.E. Reich DE, Nature. 411:199-204 (2001) 4. M.J. Daly, Nat Genet. 29:229-32 (2001) 5. R. Judson R, Pharmacogenomics. 3:379-91 (2002) 6. G.C. Johnson, Nat Genet. 29:233-7 (2001) 7. J.D. Terwilliger et al, Curr Opin Genet Dev. 12:726-34 (1998) 8. J.D. Terwilliger, K.M. Weiss, Curr Opin Biotechnol. 9:578-94 (1998) 9. K.M. Weiss, A.G. Clark, Trends Genet. 18:19-24 (2002) 10. J.K. Pritchard, Am J Hum Genet. 69:124-37 (2001) 11. J.K. Pritchard, M. Przeworski, Am J Hum Genet. 69: 1-14 (2001) 12. L.B. Jorde, Genome Res. 10:1435-44 (2000) 13. M. Boehnke, Nat Genet. 25:246-7 (2000) 14. N. Patil et al, Science. 294: 17 19-23 (2001) 15. S.B. Gabriel et al. Science. 296:2225-9 (2002) 16. A.J. Jeffreys et al, Nat Genet. 29:217-22 (2001) 17. X. Zhu et al, Genome Research. 13: 171-181 (2003) 18. R. Lewontin, Genetics. 49:49-67.( 1964) 19. G.R. Abecasis et al. Nat Genet; 30:97-10119 (2002) 20. L. Excoffier, M. Slatkin, Mol Biol Evo1;12:921-927 (1995) 21. B. Devlin, N. Risch, Genomics. 29:311-22 (1995) 22. H. Zhao et al, Ann Hum Genet. 63:167-79. (1999) 23. M.S. Phillips et al. Nat Genet. 33:382-387 (2003) 24. S. Lin et al. Am J Hum Genet. 71:1129-1137 (2002)

BIOMEDICAL ONTOLOGIES 0. BODENREIDER U.S. National Library of Medicine 8600 Rockville Pike, MS 43, Bethesda, Maryland, 20894, USA E-mail: [email protected] J. A. MITCHELL University of Missouri, Department of Health Management & Informatics, Columbia, Missouri, 6521 1, USA E-mail: MitchellJo@ healtlz.missouri.edu

A. T. MCCRAY U.S. National Library of Medicine 8600 Rockville Pike, MS 52, Bethesda, Maryland, 20894, USA E-mail: [email protected]

As we celebrate the 50th anniversary of the description of the structure of DNA, biology is evolving from a science of organisms and molecules to a science of information. In modern biology, massive amounts of data are produced resulting, for example, from sequencing the genome of many organisms and studying gene expression under various conditions. In turn, there has been a shift from hypothesisdriven experiments to data-driven experiments. Ontologies provide a conceptualization of a domain that can be shared among diverse groups of researchers and health care professionals and used computationally for multiple purposes. Biologic knowledge is evolving so rapidly that it is difficult for most scientists to assimilate and integrate the new information with their existing knowledge. Promoting the creation and use of ontologies for the field and linking to other ontologies in related domains holds the promise of assisting those working in biomedical disciplines and thus making more rapid scientific progress.

The papers presented in this session reflect the ontological needs arising in the biomedical community: sharing the experience of ontology developers and users on the one hand, and developing methods for auditing and evaluating existing ontologies and formalisms, as well as for assessing the usefulness of ontologies in biological applications on the other. Three papers focus on building, using, and aligning ontologies in various subdomains of biomedicine. Mouse phenotype ontologies are the object of two of these papers. One reports on building ontologies for mouse phenotypes based on the

164

165

Phenotype and Trait Ontology (PATO) schema. The other presents the mapping of Phenoslim, another mouse phenotype ontology, to clinical terminologies (UMLS@ and SNOMED-CT@).The last paper in this series reports on creating a hierarchy of evidence codes and discusses its application to pathway databases. One trend in this session is the analysis of the limitations of the formalisms currently used for representing ontologies, with two papers focusing on two different formalisms. The fxst one analyses issues in the representation of anatomy in ontologies built on the model of Gene OntologyTM (GO). The representation of defaults and exceptions in the Ontology Web Language (OWL) framework is investigated in the second one. The next two papers focus on making explicit the ontological relations embedded in concept names, thus providing additional auditing methods for these ontologies. One paper investigates implicit knowledge in the Foundational Model of Anatomy and GALEN and its applications to auditing and aligning ontologies. The other paper analyzes concept names nested within GO terms from which ontological relations can be acquired and discusses the application of this method to auditing GO structure. Finally, the last paper proposes methods for analyzing ambiguity in gene names and its consequences in information extraction. Although the papers selected for this PSB 2004 session on Biomedical Ontologies may not be representative of all ongoing research efforts in the community, we believe that these papers characterize important research directions in this field. Ontologies need to move from loosely organized sets of terms to frameworks supported by formal properties. The limitations of the formalisms used to represent ontolgies need to be carefully identified and studied. Finally, the current focus on ontologies of anatomy is not surprising since anatomy - from macroscopic to subcellular structures - is a core subdomain of biomedicine whose representation is needed in virtually any biomedical application.

PART-OF RELATIONS IN ANATOMY ONTOLOGIES: A PROPOSAL FOR RDFS AND OWL FORMALISATIONS J.S. AITKEN, B.L. WEBBER School of Informatics, University of Edinburgh, Appleton Tower, Crichton St. Edinburgh EH8 9LE

J.B.L. BARD Department of Biomedical Sciences, University of Edinburgh, George Square, Edinburgh EH8 9XD Abstract Part-of relations are central t o anatomy. However, the definition, formalisation and use of part-of in anatomy ontologies is problematic. This paper surveys existing formal approaches, as well as the use of partof in the Open Biological Ontologies (OBO) anatomies of model species. Based on this analysis, we propose a minimal ontology for anatomy which is expressed in the Semantic Web languages RDFS and OWL-Full. The paper concludes with a description of the context of this work in capturing cross-species tissue homologies and analogies.

1

Introduction

The increasing number of anatomies being defined, published and linked to gene expression data provides new opportunities to explore tissue homologies across species, and their relationship to the genetic evidence. Anatomies are now available for the main model species (c elegans, drosophila, mouse, zebrafish etc), and are expressed in the format developed for the Gene Ontology (GO). While GO was neither intended to encode knowledge of tissue homology, nor to include species-specific concepts, this knowledge can be very useful to biologists exploring gene expression data. However, automated techniques for manipulating this knowledge are needed. Traditionally, homologous tissues are those which share a common evolutional ancestor (bat wings and human forelimbs - both pentadactyl limbs); but, where there is no fossil record, it also means tissues with a common developmental lineage (the gut of mouse and gut of C. elegans). The other tissue relationship is analogy - where tissues have a similar function but different evolution/development ( e g . insect limbs and vertebrate limbs). We describe techniques for acquiring and representing homology knowledge from experts, encoding it through links between the tissues (terms) in OBO anatomies. Arising from this concern with anatomy, we also consider the structure and formal properties

166

167

of the OBO anatomies. For the proper treatment of reasoning about a single anatomy, and for making inferences about cross-anatomy links, the meaning of the anatomical terms and relationships requires such a clarification. Anatomies differ of necessity due to the different developmental patterns and radically different structures of the species concerned. However, these differences are unnecessarily compounded in existing ontologies by the differences in terminology used for different species, and differences in the way the ontology relations are interpreted and used. For example, the OBO mouse anatomy uses o~ilypart-of while drosophila uses isa, part-of and lineage. As long as the anatomies are used by humans, mismatches in how relations are used and differences in modelling practice have few serious consequences. But, they will limit and impoverish any attempt to automate symbolic reasoning about anatomy. Anatomists are extending the types of relationships required to capture anatomical associations, but there is no consensus on a clear semantics for the essential notions of type, part and developmental order. Potentially useful distinctions such as sex, axis, stage, normal/modified gene complement are introduced in some anatomies, but are far from common usage. This paper presents an analysis of the formal approaches to partwhole reasoning in anatomy. We then present a simple ontological framework, represented in RDFS and OWL-Full, which can be used to resolve the semantic and syntactic problems identified in GO. We then describe tools and techniques for acquiring homologies between the tissues defined in OBO anatomies. 2

Anatomy and Ontology

We introduce the Gene Ontology, with an emphasis on its formal aspects, and survey existing approaches to the formalisation of anatomical concepts. The proposed GO Schema (upper ontology) and homology mapping techniques are then presented in this context. 2.1

OBO and the Gene Ontology

OBO includes ontologies for the anatomies and developmental timelines of a range of plant and animal species. The semantics and syntax of OBO ontologies adhere to the open standard of the Gene Ontology. The Gene Ontology is a curated, controlled vocabulary for the speciesindependent annotation of genes and gene products. GO conforms to a central proposition of ontology development - it is a product of a community effort, so can be said to represent a consensus. GO is composed of three ontologies, biological process, molecular function, and cellular

168

component, which are meant to be orthogonal and independent of each other! The concepts in these ontologies are known as ‘terms’, each with a unique ID.

GO terms are intended to be at the class (or type) level rather than describing a single instance. The t r u e path rule should hold for all GO terms: the pathway from a child term to its top-level parents should always be true.’ In addition, citations and evidence supporting GO attributes and GO annotations must be provided. The isa and part-of relations are used to make links between termsGO ontologies are directed acyclic graphs. In such graphs, an entity can have multiple links to parents via isa and part-of. The isa relation has the interpretation of subclass: B isa A means all B’s are A’s (however, isa sometimes taken t o mean ’instance of” and under this interpretation isa is not transitive). part-ofrelates an entity and its components, and is intended to be transitive. However, the definition (as specified in natural language in ’) allows this relation to hold if some instances of the parent have an instance of the child as a component. Interpreted formally, the property of transitivity does not follow from the definition. Assuming that both isa and part-of are to be interpreted as transitive, then a partof association stated about an entity must hold for all subtypes of that entity. However, without a clear model for the interpretation of relations, or any automated reasoning to compute the deductive inferences, it appears unsafe to assume that anatomy developers have accounted for the interaction of isa and part-of. A critique of GO notes the overloading of isa to denote both type-of and instance-of, and the confusion in practice of isa and part-of relations. On the syntactic level, ontologies that are written in the GO format can be stored in several formats, including: flat file, XML/RDFS, or OWL (being an extension of XML/RDFS). The flat file format uses indentation by white space and the symbols % < to denote the hierarchy, but is unable to store the textual annotations that accompany GO terms. These are stored in the definitions file. The XML/RDFS syntax removes the reliance on white space, and permits a full record of the GO term to be made. The XML/RDFS syntax makes use of the RDF mechanism of URIRefs - references to named concepts which are defined in webaccessible files. In fact, RDF cannot be used without the RDF schema and this schema provides both a subclass relation, and the facility to introduce and define new types of relation.

As noted above, there is a need to model relationships such as lineage, requiring an extensible syntax for expressing the ontology. We explore these semantic and syntactic issues in a new RDF Schema for GO presented in Section 3.

169

2.2

Axiomatising part-of f o r A n a t o m y

Winston, Chaffin, and Herrmann identify six senses of part-of in their analysis of the semantics of meronymic relations underlying English usage. The senses are: component-integral, stuff-object, portion-mass, place-area, member-collection, feature-activity. The first five of these senses are used in the GALEN ontology of human anatomy: which we discuss in more detail below. The six senses are differentiated from each other by three properties that cau be associated has holding between parts and wholes: fanctional - the parts play a functional role; homeomerous - the parts are similar to each other and to the whole; separable - in principle, the parts can be disconnected from the whole. The formal properties of a generic part-of relation are analysed in the theories of mereology and topology. This part-of relation is often simply denoted ( P p a r t whole). When axiomatised in first-order logic, it is usually defined to to transitive and symmetric: 1.(Px z) 2 . ( P z y) A ( P y z ) -+ ( P z z ) 3 . ( P x y) A ( P y X ) -+ z = y This formal analysis allows two part-of relations to be differentiated: a proper part, PP, is any part excluding the whole: and a direct part, DP, is a proper part which is an immediate part of the whole?' 4. ( P P z y ) := (Pzy) A - ( P y z ) 5. (DPz y) := ( P P z y) A i ( 3 z (PPa: z ) A ( P P z y)) It is notable that in cases where part-ofis observed to lack the property of transitivity, the explanation is often that several senses of part-of are being used - there is no inherent opposition between the conceptual and formal approaches. In the formal analysis, part-of is a relation between instances of objects. However, in GO, part-of relates classes of objects, so the meaning of the class-level relation needs to be defined in terms of the conventional instance-level relation. As anatomy ontologies may specify tens of thousands of terms, the acquisition and maintenance of anatomical knowledge, and the efficiency of automated reasoning become major considerations. For these reasons, description logics and frame-based approaches have been adopted. A key issue in the practical application of anatomical knowledge is the propagation of properties up and down the part-of structure, and the need to control the generalisation/specialisation of these properties. As the models of anatomy that have been developed cannot be separated from the language they are expressed in, we discuss both together, beginning with description logics (DL). Recent approaches to exploiting the subsumption mechanism of DL to perform part-ofreasoning include that of Schulz and Hahn: who have translated the anatomical terms in UMLS into description logic. In order

170

to use subsumption in this way, a single anatomical entity is modelled by additional concepts that denote the structure of the entity and the set of parts that correspond to the entity. A property such as perforationof attributed to Colon Structure will correctly generalise to Intestine Structure, as these entity structures are in a isa relationship? A property which should not be generalised, such as inflammation-of, is defined to hold of the entity, as opposed to the entity structure, and no isa link holds. Description logics may allow a relationship to be declared to be transitive, in which case it is not necessary to use subsumption reasoning to get the transitivity property. For example, GRAIL: the DL used in the GALEN project, allows part-of relations to be declared transitive. As noted earlier, GALEN uses five senses of part-of, but it also specialises them further. For example, the component-integral sense is represented by component-of, and this has the specialisation func-component-of to distinguish functional components. One of the reasons for introducing this refinement concerns the inheritance of the function-of property along component-of: this property inherits along component-of but not along func-component-of to limit inheritance as being valid up to a “certain (often arbitrary) level of anatomical aggregation” 4 . GALEN also includes an arbitrary part relation t o describe structures whose parts are not delimited in a conventional anatomical sense. The need to distinguish anatomical parts from arbitrary parts is also noted in the Digital Anatomist Foundational Model8 In this frame-based approach, the part-of relation has the attributes of being shared or unshared, and of being arbitrary or anatomical. This formalisation avoids the need to define a set of different part-of relations with different properties: part-of-shared-arbitrary, part-of-shared-anatomical and so on. Further, Protege’s frame language permits the definition of constraints that hold at the class level (and therefore apply to all sub-concepts), and constraints that hold of the concept but do not get inherited by more specialised concepts (the so-called own-slots). Other extensions of DLs that are potentially relevant include those which permit rule-like reasoning (Horn clauses), and those which define ‘plural quantifiers’ that are able to discriminate between parts or wholes with different properties (e.g. parts physically connected to the whole) to specify how properties are generalised or specialised ’. OBO anatomies typically contain none of the fine-grained distinctions used in these models of human anatomies. As noted, part-of may be used exclusively, and may be used in place of isa. Where both are used, the type hierarchy based on isa is often incomplete, containing concepts whose parent is only identified by a part-of link. That is, the class hierarchy does not form a connected DAG. Similarly, the part-of model may be fragmented, having anatomical entities that are not part of any

171

__ :OBO Anatomies / I/

\

GO Schema

Figure 1: part-of vocabulary in the GO Schema

other anatomical entity. Furthermore, parts often have no associated type. No consistent modelling style or convention is imposed on the directed acyclic graph model. Despite these formal shortcomings, the OBO anatomies are extensive, encode valuable knowledge, are intuitive t o biologists and are linked to gene expression data, and so constitute valuable resources. Enhancing the machine processability of anatomies by clarifying the semantics of the terms, and assigning proper definitions are the immediate tasks that we address by constructing a schema for GO. This schema can be used in addition to the existing GO XML DTD (completing the anatomies in terms of the part-of and isa structure must be left to the curators of the OBO anatomies).

3

The GO Schema

The schema we propose clarifies the meaning of GO relationships, and provides an extendible framework a t both semantic and syntactic levels. This schema can be thought of as a minimal upper ontology. We view this as an intermediate step towards a fully formal ontology supported by inferencing capabilities, which may be in a description logic based language such as DAML-OIL or OWL. Following the GO standard, OBO anatomies currently have an XML/ RDF syntax, but none of the RDF Schema features are used. RDFS provides a well-defined subClassOf relation ( i s u ) , and a subPropertyOf relationP The latter could be used to specialise a generic lineage relation to descends-in-the-male - which is one of the ways the C Elegans anatomy models lineage. RDFS-enabled tools can either fully understand the descends-in-the-male relation, or can simply treat it as lineage. RDFS also now has a great deal of code support in Java (e.g. Jena l o ) . aNote that OWL also uses these relations, and that OWL-Full has the same expressivity as RDFS - differing primarily by the namespace to be used.

172

The proposed GO Schema extends the existing RDFS/OWL-Full classes (Class, Resource, Property) by making a fundamental distinction between Event and Object. Events are things that can be said to occur or ‘happen’, while Objects are things that exist over time. This distinction is common to the process ontology PSL and approximately corresponds to occurrents and continuants; where occurrents cannot be said to exist a t a single moment in time, while continuants can preserve their identity from moment to moment. Figure 1 illustrates how the GO Schema extends RDFS to create a richer vocabulary in the o n t e logical layer. OBO anatomies are expressed as assertions using the new vocabulary. Based on the event/object distinction, two types of part-of relation are defined: partOf and subEventOf. partof relates ObjectClasses to their parts, which must be subclasses of Object. Similarly, subEventOf relates composite EventClasses to their constituent EventClasses. A lineage predicate, successorOf, is introduced to relate later to earlier developing tissues (ObjectClasses). While this is not a part-of relation, we include it due to its prevalence in anatomy models. As yet we do not define further specialisations of these relations, as might be based on the functional/homeomerous/separable properties. On adopting the schema, concepts in OBO anatomies and in the GO cellular component ontology become subclasses of Object, while (the majority of) the GO process and molecular function concepts become subclasses of Event. The go:part-of relation is replaced by partOf and subEventOf as appropriate. We argue that for the kinds of anatomy ontologies that biologists have created, it is important to distinguish the direct parts of a tissue, and to know whether part-of should be interpreted as transitive or not. Consequently, the GO Schema defines a number of primitive relations, isPartOf and isProperPartOf, which are used to define the semantics of partof. isPartOf and isProperPartOf correspond to the P and PP relations introduced earlier and defined by formulae 1-4. The semantics of partof are defined below, where we use the transformation from RDFS to a first-order logic described in: (?Subject ?Predicate ?Object)RDF (Property Value ?Predicate ?Subject ?Object)FoL (?Subject type ?Object)RDFS tf (Type ?Subject ?Object)FoL

u

(=> (and (Propertyvalue partOf ?P

?W)

(Propertyvalue classDefinition ?P ?W)) (forall (?w) (=> (Type ?w ?W) (exists (?p) (and (Type ?p ?P) (Propertyvalue isProperPartOf ?p ?w))))))

173

Essentially, (partof Part Whole), plus the classDefinition qualifier, means that every instahce of Whole has some instance of Part as a part. Note that Whole and Part are classes, as all concepts are types in GO. The semantics of subEventOf and successorOf are defined in the same manner as for partof. Note that Type relates an instance to the class it belongs to. The partof, subEventOf and successorOf relations can be qualified in several ways: they can be classDefinitions, meaning the whole necessarily has parts of the specified type, or can be termDefinitions, meaning that the part-of assertion does not necessarily apply to sub-types of the whole (i.e. it is not inherited). The part-of relation can be declared to be a direct part relation by the qualifier directPartDefinition, and if the part always exists as part of the whole, then the partDefinition qualifier can be used. All qualifier relations hold between the part class and the whole class!. The directPartDefinition qualifier strengthens the part-of definition: (=> (and ( P r o p e r t y v a l u e partOf ?P

?W)

( P r o p e r t y v a l u e d i r e c t P a r t D e f i n i t i o n ?P ?W)) ( f o r a l l (?w) ( f o r a l l (?p) (=> (and (Type ?w ?W) (Type ?p ?P) ( P r o p e r t y v a l u e isProperPartOf ?p ?w)) (not ( e x i s t s (?z) (and ( P r o p e r t y v a l u e isProperPartOf ?p ? z ) ( P r o p e r t y v a l u e i s p r o p e r p a r t o f ?z ?w))))))))

The qualifier relations are used in conjunction with the basic partOf relation, and therefore we do not need to introduce a whole set of part-of relations corresponding t o combinations of properties. This approach is in accord with that of who reify the part-of slot in order to state the shared/anatomical properties. The approach allows the different versions of part-of to be stated using the RDF syntax of subject-predicate-object, and for these to be stated for each part and whole assertion (as opposed to defining a single part-of relation for all parts/wholes in the anatomy). The motivation for introducing the direct-part property is to be able to distinguish direct and inferred parts in anatomy models and in queries over those models. The not-inherited property, termDefinition, reflects the current state of OBO anatomies where the interaction of part-ofand subclass has not received the necessary attention. Thus, part-of assertions in current OBO anatomies can be immediately and safely translated into partof plus directPartDefinition and termDefinition. Further modelling input will be required to remove the qualifications, if they can correctly be removed. The use of relations that hold between classes puts bSee http: //urn.aiai.ed. ac .uk/resources/go/go-schema for the RDF Schema and OWL formalisation.

174

the ontology into the OWL-Full category. The qualifiers classDefinition and partDefinition can be translated into equivalent OWL-DL expressions, while the direct-part relation can be represented as a specialisation of part-of, but the axiomatisation cannot be captured in OWL-DL. The GO Schema plus the associated axioms provide a way to represent and interpret OBO anatomies in an unambiguous way. The RDFS/OWL-Full approach allows new sub-types of part-of to be introduced in an extendible way. Further, the approach allows discussion of the meaning of relationships (the axioms) without impacting on the syntax of the anatomies. This seems very appropriate as the level of consensus in the OBO community is rather low, and discussion and clarification seems desirable. A method for the step-wise refinement of GO to a DL representation is described in 1 3 . We concur that a clarification of the meaning of terms by creating definitions is desirable, and is a requirement for automated support for ontology curation. Our approach differs in that we allow the axiomatisation of relations to use full first-order logic (of which DLs constitute subsets) while remaining within the syntax of RDFS. However, there is as yet no inference support for the proposed formalisation. We argue that moving immediately to DAML-OIL or OWL-DL will preclude other options for the formal language] for example, DLs which provide plural quantifiers, DLs which support Horn clause reasoning] and frame-based approaches. As discussed earlier, part-of reasoning in ‘plain’ description logics has proved in the past to raise complex modelling issues. 4

Tools and Techniques for Acquiring Homology Mappings

The formal issues discussed above are important in the context of our work on defining homology and analogy mappings between anatomies, as we aim to make inferences on a more substantial basis than can be done by relying on concept names alone. However, we must use the existing OBO resources and we now describe the mapping process that links these anatomies. 4.1

Homology Data

The homology mappings are associations between (the existing) OBO anatomies. One of the key ideas is to identify the cell types that justify the homology. These cell types must be common to the tissues linked by homology. The cell type is drawn from a cell ontology that includes all the cell types for all the major phyla. This ontology includes type and lineage information. The major cataloguing categories are class

175

(function, morphology, lineage, gender-specific, number of nuclei, ploidy and stem cells) and organism (animal, fungus, plant, prokaryotic, spore). We identify four types of homology: tissue-homology, cell-functionhomology, analogy and association. The homology relationship holds between two tissues, typically selected from the anatomy of different model species. Database identifiers as well as tissue names are recorded for all homology mappings. Our approach requires that one or more cell types be associated with the homology mapping. The author of a homology mapping should be recorded, as should the date, and a textual annotation may be made. It is assumed that only one homology relation holds between any two tissues. An example of a cell function homology is listed below. The mapping is from WBdag: 3681 to FBbt :5612, that is, from the pharynx in C Elegans to embryonic esophagus in Drosophila. The relationship is symmetric: it can be interpreted as holding from Drosophila to C Elegans. The cell ontology term lining-cell (CL:213) is given as a basis for the homology. In addition, a textual annotation provides additional information, and where possible, a reference to the literature. mapping:cell-function-homology WBdag:3681 name : "pharynx" name: "embryonic esophagus" FBbt :5612 CL:213 name : "lining-cell" author: " J . Bard" date : "26.02.03" annotation:"Porteriko & Mango (Dev Biol 233, 482, 2001) say that the CE pharynx links the buccal cavity to the midgut and hence = pharynx unlikely to be a lineage homology as D eosophagus is ectodermal"

4.2

Methodology

To assist the recording of homologies, we have implemented an acquisition tool which displays two ontologies and permits the user to select terms and enter the homology data. Figure 2 shows the user interface. The tool allows the data to be recorded, and ensures that sufficient information is entered. The anatomist is responsible for exploring the two ontologies and identifying the most plausible homologies. This is necessarily a manual process which relies on the knowledge and expertise of the biologist. In an independent exercise we shall consider each anatomy and associate one or more cell types with each tissue at a leaf node in the anatomy. The cell types characterise the essential properties that distinguish one tissue from another. Then, these cell types will be propagated upwards through the anatomy. Once a set of homology mappings has been acquired we shall perform two types of evaluation. Firstly, for each tissue marked as being part

176

Figure 2: The COBrA Homology Acquisition Tool

of a homology relation, we will determine whether the cell types from the homology generation exercise match those from the tissue-cell type annotation exercise or not. Any differences will be resolved, and a final homology to cell type association determined. The use of cell types to provide definitional knowledge of homology is a key innovation of our technique. The second evaluation will be a critiquing exercise which will involve biologists in the respective fields. Together, these evaluations will provide a degree of confidence in the knowledge acquisition process.

5

Conclusions

We have presented an analysis of semantic and syntactic problems in the language used for OBO anatomies. The proposed solution allows a consensus to be reached on the relations required to describe anatomy, and their properties. We also show how a cell ontology can be used to define homology mappings between tissues in OBO anatomies. A further use of the cell type ontology is to provide concepts that can be used to define the properties of tissues. The cell ontology provides the means to define species-specific concepts (parts in an anatomy) in terms of more primitive species-independent concepts, and so can play an important role in deepening the ontological modelling of anatomy.

177

Acknowledgments This work is supported by BBSRC grant BBSRC 15/BEP 17046. The COBrA tool was designed and implemented by Roman Korf. References 1. Gene Ontology Consortium. Creating the Gene Ontology Resource: Design and Implementation. Genome Research, 11:14251433, 2001. 2. B. Smith, J. Williams, and S. Schulze-Kremer. The ontology of the gene ontology. Proc. AMIA 2003 in press. 3. M.E. Winston, R. Chaffin, and D. Herrmann. A Taxonomy of Part-Whole Relations. Cognitive Science, 11:417-444, 1987. 4. J. Rogers and A. Rector. Galen’s model of parts and wholes: Experience and comparisons, 2000. Annual Fall Symposium of American Medical Informatics Association, LA. 5. A. Varzi. Parts, wholes, and part-whole relations: The prospects of meretopology. Data and Knowledge Engineering, 20:259-286, 1996. 6. P. Borst, J. Benjamins, B. Wielinga, and H. Akkermans. An Application of Ontology Construction. Proc. ECAI, pages 5-16, 1996. 7. U. Hahn, S. Schulz, and M. Romacker. Part-whole reasoning: A case study in medical ontology engineering. IEEE Intelligent Systems, 14(5):59-67, 1999. 8. N.F. Noy, M.A. Musen, J.L.V Mejino, and C. Rosse. Pushing the Envelope: Challenges in a Frame-Based Representation of Human Anatomy, 2002. Stanford Medical Informatics technical report SMI-2002-0925.pdf. 9. A. Artale, E. Franconi, N. Guarino, and L. Pazzi. Part-Whole Relations in Object-Centered Systems: An Overview. Data and Knowledge Engineering, 20:347-383, 1996. 10. HP Labs Jena Toolkit http://www.hpl.hp.com/semweb/jena.htm. 11. C. Schlenoff, M. Gruninger, F. Tissot, J. Valios, J. Lubell, and J.Lee. The process specification language (PSL) overview and version 1.0 specification, 2000. NIST Internal Report 6459. 12. D.L. McGuinness, R. Fikes, J. Hendler, and L.A. Stein. Damltoil: An ontology language for the semantic web. IEEE Intelligent Systems, 17:72-80, 2002. 13. C.J. Wroe, R. Stevens, C.A. Goble, and M. Ashburner. A Methodology to Migrate the Gene Ontology to a Description Logic Environment Using DAMLf OIL. Pacific Symposium on Biocomputing, 8:624-635, 2003.

BUILDING MOUSE PHENOTYPE ONTOLOGIES G. V. GKOUTOS, E. C. J. GREEN, A.M. MALLON, J.M. HANCOCK MRC UK Mouse Genome Centre and Mammalian Genetics Unit, Harwell, Oxfordshire, England E-mail: g. gkoutos @ harmrc. ac. uk D. DAVIDSON MRC Human Genetics Unit, Edinburgh, England E-mail: [email protected] The structured description of mutant phenotypes presents a major conceptual and practical problem. A general model for generating mouse phenotype ontologies that involves combing a variety of different ontologies to better link and describe phenotypes is presented. This model is based on the Phenotype and Trait Ontology schema proposal and incorporates practical limitations and designing solutions in an attempt to model a testbed for the first phenotype ontology constructed in this manner, namely the mouse behavior phenotype ontology. We propose the application of such a model could provide curators with a powerful mechanism of annotation, mining and knowledge representation as well as achieving some level of free text disassociation.

1

Introduction

With the advent of functional genomics, the types and amounts of data that need to be stored in databases has changed both quantitatively and qualitatively. In particular, many types of information that were previously collected on an ad hoc basis now need to be stored in a more structured manner. Furthermore, as additional data sets (such as those for gene expression, proteomics and protein-protein interactions) grow in complexity and size, biologists and bioinformaticians are being faced with an increased demand for the construction of queries across these large, diverse datasets. For example, given a gene that was detected to be over-expressed in a microarray experiment it might be of interest to ask whether it was associated with an N-ethyl-N-nitrosourea (ENU) mutant, and whether that ENU mutant had a phenotype that resembled a human disease. It might also be useful to know if the function of the gene, or any homologues, was known, and whether a protein structure for any one of them had been determined. A number of laboratories worldwide are now carrying out detailed analysis of mouse phenotypes that have been generated from the large scale ENU mutagenesis of the mouse genome. Description of mouse phenotypes has not traditionally adhered to pre-defined rules or been recorded in databases but a number of requirements are now driving the developments of such databases, including the

178

179

requirement to share data from high-throughput screens (such as ENU mutagenesis) and the need to record data in a paperless environment in modem experimental facilities. Here we describe attempts to develop ontologies to aid in the description and mining of mouse phenotype data and make some suggestions, which have more general applications, concerning the ways in which ontologies might be combined to facilitate reasoning with data representing complex domains of knowledge.

2

Mouse Phenotype Ontology

The description of (mutant) phenotypes presents a major conceptual and practical problem. Currently ontological description of phenotypes are mostly linked to individual species databases and have evolved in necessary ad hoc manner. We note here that similar efforts described later are currently being made to describe phenotypic instances in different species [ 11. Conceptually, the description of phenotypes requires combinations of orthogonal ontologies with the ability to correlate factors depending on experimental values. Practically, if the data are to be efficiently analysed computationally, then there is a need for consistency between expressions in different phenotypic domains as well as different species. The term “phenotype” can adopt a variety of definitions depending on different fields in biology, and indeed on different researchers in those fields. It may be taken to mean anything from the complete set of phenotypic attributes (traits) that describe an individual to a single phenotypic attribute that distinguishes an individual from other, “normal” individuals. The details of the use of terminology can be divorced from the ontological structures used to describe phenotypic descriptions. In February 2002, Ashburner proposed a schema (PATO) [ 13 that could provide a platform of consistent representation of phenotypic data. According to this schema, ‘‘phenotypic data can be represented as qualifications of descriptive nouns or nounal phrases” [l]. For each noun there will be a set of relative attributes defining a set of appropriate values. The use of these three semantic classes (namely nouns, attributes and values), plus the assays by means of which the phenotypes were determined and the conditions (MGED [ 2 ] ) ,both environmental and genetic, under which these assays were performed, will form the basis for the systematic description of phenotype. Figure 1 presents an adaptation of the proposed schema [lJ, and an example of its application. Its simplicity could provide a common interface upon which to model all phenotype ontologies. Although, from an ontological point of view, its complexity could escalate, this schema provides a firm basis for generating consistent expression of phenotypic data. Such a schema could provide not only a sophisticated representation of knowledge but also, and perhaps more importantly, an efficient means to annotate and analyse phenotypic data. One can envisage applications that would allow the generation of powerful and advanced ways of searching, retrieving and performing added value mining operations in a particular field and across different domains.

180

SCHEMA concept - { wedfor) -attribute - { h a s ) -value -(has} -qualifier

I { determinedby)

I assay - [constrainedby)

I MGED (environmental & genetic)- {ofrype) - conditions

TRANSLATION eye- { usedfor) - has-color - { h a s ) - blue - { h a s )-bright

I { determinedby}

I (visual) assay - [constrainedby}

I MGED (environmental & genetic)- {oftype) - conditions Figure 1. Schema adapted from PAT0 proposal [l]

The particular domain of our interest, the mouse phenotype ontology, should comprise of at least the following: Anatomy - The Anatomical Dictionary for the Adult Mouse [ 13 has been developed by Terry Hayamizu, Mary Mangan, John Corradi and Martin Ringwald, as part of the Gene Expression, Database (GXD) [3] Project, Mouse Genome Informatics (MGI), The Jackson, Laboratory, Bar Harbor, ME 141 Ontogeny - The Anatomical Dictionary for Mouse Development has been developed at the Department of Anatomy, University of Edinburgh, Scotland (Jonathan Bard) and the MRC Human Genetics Unit, Edinburgh (Duncan Davidson) as part of the Edinburgh Mouse Atlas project (EMAP), in collaboration with the Gene Expression (GXD) project at MGI, The Jackson

181

Laboratory, Bar Harbor, ME. Copyright 1998-2002 University of Edinburgh (UK) and MRC (UK). 151 Behavior - Parts of Behavior have been expressed in a consistent manner [6,1] Pathology - The Pathbase mouse pathology ontology provides a description of mutant and transgenic mouse pathology phenotypes and incorporates 425 known mouse pathologies hierarchically organised as "instances of" pathological processes. [7] Gene Ontology - GO describes the roles of gene products and allows genomes to be annotated with a consistent terminology (The Gene Ontology Consortium 2002) [8] others .... These orthogonal ontologies can be combined with PAT0 to provide phenotypic instances. Generated instances could then be linked to provide individual phenotypes. Generation of such a combination of ontologies [lo] needs to be done collaboratively within the community. Associating concepts with their attributes and values is not an easy task. More often than not, the distinction between these terms is difficult and subjective. Therefore, domain expert knowledge is essential. We chose to model the behavioral phenotype ontology as a testbed for subsequent parts. We intend to use domain experts' knowledge available to us through EUMORPHIA [9], a European program that we are part of, and collaborate with the Jackson Laboratory [4]. Here, we present our methodology findings, our adaptation schema and raise some modeling issues.

3

3.1

Methodology

Tools Summary

Several tools exist for modeling and building ontologies. Below a small selection is listed, although comprehensive evaluations have been given elsewhere [ 11, 12, 13, 141.

0

DAG-Edit [ 151 provides an interface to browse, query and edit GO or any other vocabulary that has a DAG data structure GKB-Editor [ 161 (Generic Knowledge Base Editor) is a tool for graphically browsing and editing knowledge bases across multiple Frame Representation Systems (FRSs) in a uniform manner OilEd [ 171 is an ontology editor allowing the user to build ontologies using DAML+OIL

182

OpenKnoME [ 181 is a complete GRAIL knowledge management and ontological engineering environment ProtCgC-2000 [ 19,201 is the most widely known and used tool for creating ontologies and knowledge bases. WonderTools [21] is an index with the objective of supporting a decision in selecting an ontology-building tool WebOnto [22] is a Java applet coupled with a customised web server which allows users to browse and edit knowledge models over the web Since current versions of DAG-edit do not support slots (although a version supporting slots is very close to being released) we have chosen ProtCgC-2000, which was developed in the Musen Laboratory at Stanford Medical Informatics. Protege incorporates modeling features such as multiple inheritance, relation hierarchies, meta-classes, constraint axioms and F-Logic. It is written in Java and is well supported with frequent updates and plug-ins for several options (consistency checks, graphical viewing ontology merging etc.). It supports several formats such as RDF(S), XML, RDB, DAML+OIL. 3.2

Knowledge representation languages

A variety of languages can be used for representation of conceptual models, each with different expressiveness, ease of use and computational complexity [23]. Extended comparisons and evaluations have been discussed in detail elsewhere [24], and although the complexity of our current models can be described with existing tools, it should be noted that in the future, upon dealing with more complex phenotype domains, requiring different levels of constraint and expression of relationships varying in complexity, a migration to a finer grained conceptualization will be necessary. Indeed, such approaches have been described in the Gene Ontology Consortium (251 and elsewhere [26]. 3.3

Translation of existing ontologies into ProtigP-2000

Since most of the ontologies we are planning to use were generated using DAG-edit [18] we had to convert them in the ProtCgC-2000 format, a frame based system, using the tools [27] written in Java and the methodology as described by Yeh et a1 [27] with minor modifications to the code. Yeh et a1 presented a method for knowledge acquisition, consistency checking and concurrency control for Gene Ontology based on Protege-2000 [27].

183

3.4

Metaclasses and Metaslots

We have modeled the converted ontologies such as the anatomy ontology into ProtCgC-2000 metaclasses, including GO attributes such as name, database references, synonyms and IDS. ProtCgC allows only is-a relationships to form the class hierarchy so part-of relationships were modeled as slots, as discussed elsewhere [27]. Behavior phenotype ontology slots are described in metaclasses containing fields, such as Term, Documentation, Definition, Definition Reference, ID, Associative Ids, Synonyms, Associative Annotations, etc. Typical examples of metaclasses are given in reference 27. The first version of PATO was converted to form the slots for the behavior ontology. As initially conceived, PATO will be updated with attributes required for individual ontologies as appropriate. ProtCgC allows slot hierarchy (mimicking the PATO hierarchy) with additional information attached such as Documentation, Template values, Default, Value type, Cardinality, Minimum and Maximum Values, Inverse slots etc. It should be noted that care should be taken when new attributes are created. PATO should hold general attributes that can be applied through different phenotypic ontologies and attributes specific to classes should be assigned only when they cannot be modeled with existing options. 3.5

A typical example of implementation

PATO’s main advantage is the ability to allow expressions of phenotypic ontologies based on concept relations rather than instances. Using PATO, the ontology can constrain relationships and values for expressing phenotypic instances without the need of assigning the latter. Below is an example of a Behavior class called Feeding Behavior, a subclass of a class named Feeding and Drinking Behavior, (present in both GO [ S ] and the MGI Mammalian Phenotype Ontology [6]).The example also shows how the Mammalian Phenotype Ontology could possibly be linked to PATO. Based on this schema one can express a variety of phenotypic data such as preference of cookies versus sausage with a consumption of 40 gr. in a 24 hour period. The ability to interchange use of absolute and relative values combined with different attributes allows the ontology to model and express all possible combinations of phenotypic data for that particular class.

184 Table 1.A typical example of modeling the Behavior ontology with PAT0

CONCEPT Feeding Behavior

ATTRIBUTE 1.attribute:food-type

--

ASSAY -a. Specialised Diets & Choice Tests

VALUE 1. chocolate, cheese, cookies vs. sausages

2.attribute:food-discrimination

b. 24 hour Consumption

2. preference, indifferent, aversion, cookies

----1-~1-1-

3.attribute:food-consumption 3a relative-food-consumption 3b absolute-food-consumption

Adult feeding behavior Preweaning feeding behavior

I

I

3. 3a. increased, aphagia, poluphagia, 3b. 40 gr

4. attribute-time 4a attribute:relative-time 4b attribute:absolute-time 1. Inherited attributes of class Feeding behavior

4. 4a latency 4b 24 hours 1. abnormal

1. Inherited attributes of class Feeding behavior 2. attribute:suckling-reflex 3. attribute: swallowing-reflex

1. decreased

2. present 3. absent

-

185

4

Proposed New Schema

Upon implementation of the schema, we discovered certain modeling and practical limitations. In order to address these, we introduced an alternative version of the schema as presented in Figure 2. As far as this schema is concerned, a phenotype can be described with the combination of two parts. The phenotypic attribute and the assay.

Phenotype = Phenotypic Attribute + Assay The phenotypic attribute includes the core ontology concepts plus the associated attributes.

Phenotypic Attribute = Core Concept + Attribute (PATO)

Environmental & I

Attribute (PATO attribute)[l]

{has-attribute)

t

I {of type)

I

charucterised by

1 {has-value)

(constrained by}

Figure 2. Alternative version of the schema

(assay provided values

186

So in the example of Table 1, the phenotypic attribute would be the class Feeding Behavior plus any associated attributes such as attribute:food-discrimination or attribute:food-consumption etc. In order to reconstruct the phenotype one must take into account the Assay that will dictate, control and define any associated values, their units, their definition and the manner they could be assigned to that particular phenotype. In the case of lack of such assays, for example when a phenotype is assigned with visual inspection and no controlled assay is involved, the phenotypic attribute could take its value from a logical assay such as the common values that PATO provides. In this schema, the Assay plays a very important role in controlling the relationship between the attribute, the concept and the values. Beside the practical implementation advantages, discussed in section 5, we note here that it is also conceptually valid, since PATO in itself is a form of a logical assay, as created by its curators. Since in phenotypic knowledge domains, such as the mouse behavior ontology, many values are only speculative interpretations of the assay (i.e. learning and memory assays) it is important for the values to be linked directly to the assays (that describe their interpretation) in order for them to have a sensible meaning. The slot termed Free Text is included to capture knowledge that cannot be expressed in the ontology, which is both practical and necessary. This will allow curators to express knowledge, which although it will not be available for advanced computational operations, can still be used via traditional operations, such as free text searching.

5

Discussion

The advantages of such schema are considerable. Firstly, by having the Assay ontology to constrain and define the values, these values can be constrained further through the class hierarchy rather than in individual instances. This, as the PATO ontology grows to cover individual domains, would become an important factor in maintenance, consistency, scalability etc. It will also allow us to restrict the values that an attribute can take without asking the data to be input and then referring back to information on the assay to check them. Furthermore, if the phenorypic attribute is disassociated from individual (less common) values, it is not necessary for it to constrain their range, definition, units or general metadata, which is in itself an almost impossible task. This schema also implies that labs using different assays (and more often than not, different values and scoring systems) can associate their results with the preceding part of the hierarchy (phenotypic attribute) by implementing a particular assay ontology (which they might get off the self or develop themselves). The advantage of this is that such procedures are good for scientific autonomy and moreover will allow more stable

187

versions of PATO and reduce maintenance costs for both parts (namely, PATO and individual phenotype ontology curators). It should also be noted, that if values are linked directly to attributes, it will be much more complicated (requiring the use of instances) to assign what assays are allowed for common attributes. For example, abnormal would be a common value for most, if not all, attributes. Linking this value to assays used to determine it would require the generation of a new instance (and ids) that would hold the phenotypic attribute plus the value. Finally, from a data collection and electronic recording perspective, it would be much easier for institutes to work from the assay, which does not require a general comprehension of the domain, in order to populate their knowledgebase. In the case of the EUMORPHIA project [9] (taken as an example), whose aim is to produce a standardized set of phenotyping protocols, it would be possible to develop a freestanding database based on the standard protocols (along with their values) it produces and associate these with related phenotypic attributes to produce description of phenotypic data.

6

Conclusions

We have proposed, presented and analyzed a general model for building mouse phenotype ontologies. We have highlighted some technical aspects and given general modeling directions. We believe that the idea of creating universal attributes applicable across domains using common application models, will present a powerful and meaningful way of achieving consistency in phenotypic data expression. This has the potential to solve current problems, faced by most databases, of expressing mutant phenotypes, currently described by free text. We are currently in the process of assessing the scalability and versatility of this approach to cope with complex phenotypic data. In order to take advantage of the large amount of data that is continuously increasing and the particularities of each database and format, there is a need for facilities instigating human and machine-understandable data accessible and processable by humans and automated tools. The vision of a Semantic Web [28], as proposed by Tim Berners-Lee [29], will be realised by data used not only for display purposes but for automation, integration and reuse across various applications [30]. Achieving even partial disassociation from free text generates enormous computational and conceptual potential.

7

Acknowledgements

This project is funded by the European Commission under contract number QLG2CT-2002-00930.

188

References 1. Open Global Ontologies (OBO). Available on line: http://obo.sourceforge.net/. For P A T 0 see: ftp:l/ftp.~eneontology.or~/pub/~o/~obo/pl~enot~pe.on tology/Dhenotype.txt 2. Microarray Gene Expression Data Society (MGED). Available on line: http://www.mged.org/ 3. Ringwald M, Eppig J.T., Begley D.A., Corradi J.P., McCright I.J., Hayamizu T.F., Hill D.P., Kadin J.A., Richardson J.E. The mouse gene expression database. Nucleic Acids Res, 29 (2001)pp. 98-101. 4. Mouse Genome Informatics (MGI), The Jackson, Laboratory, Bar Harbor, ME. Available on line: http:l/www.informatics.jax.org/ 5. Davidson D., Bard J., Kaufman M. and Baldock R., The MouseAtlas Database: a community resource for mouse development, Trend Genetics, 17 (2001)pp.

49-51 6. Mammalian Phenotype Ontology, Mouse Genome Informatics Web Site, The Jackson Laboratory, Bar Harbor, Maine. Available on line: http://www .informatics .jax.org/searches/MP_form. shtml 7. European mutant mouse pathology database (Pathbase), University of Cambridge. Available online: http://www.pathbase.net) 8. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genetics 25 (2000)pp. 25-29. 9. EUMORPHIA. Understanding Human Discease through Mouse Genetics. Available on line: httu://www.eumomhia.org/ 10. Holloway E., Meeting Review: From Genotype to Phenotype: Linking Bioinformatics and Medical Informatics Ontologies, Comparative and Functional Genomics (2002)pp. 447-450 11. Gangemi A., Some tools and methodologies for domain ontology building, Comp. Funct. Genom., 4 (2003)pp. 104-110 12. Duineveld A.J., Stoter R., Weiden M.R., Kenepa B., Benjamns V.R., WonderTools? A comparative study of ontological engineering tools, Znt. J . Hum-Comp. St. 52 (2000)pp. 1 1 11-1 133 13. Stevens R., Bio-ontology Page. Available on line: http:l/www.cs.man.ac.uk/-stevensr/ontology.html 14. Denny M.,Ontology Building: A Survey of Editing Tools, 2002,Available on line: http:/lwww.xml.condpub/d2002/11106/ontologies.html 15. Richter J. and Lewis S., DAG-Edit. Available on line: http:llwww .~eneontolo~~.orpldoclGO.tools.litml#da~edit 16. GKB-Editor (Generic Knowledge Base Editor). Available on line: http:l/www.ai.sri.cond-gkbl

189

17. Bechhofer S., Horrocks I., Goble C. and Stevens R., OilEd: a Reason-able Ontology Editor for the Semantic Web, Proceedings of K12001, 2174 (2001) pp 396--408 18. OpenKnoME. Available on line: http://www.topthing.com/ 19. Grosso E. W, Eriksson H., Fergerson R. W., Gennari J. H., Tu S. W., and Musen M. A,, Knowledge Modeling at the Millennium (The Design and Evolution of Protege-2000), 1999. Available on line: http://smiweb.stanford.edu/pubs/SMI-Abstracts/SMI-1999-0801.html 20. ProtCgC-2000. Available on line: http://protege.stanford.edu/ 21. Wondertools. Available on line: http://www.swi.psy.uva.nUwondertools/ 22. WebOnto. Available on line: http://kmi.open.ac.uWprojects/webonto/ 23. Stevens R., Goble C. A. and Bechhofer S., Ontology-based knowledge representation for bioinformatics, Briefings In Bioinformatics, 4 (2000), pp. 398-414 24. Stevens R., Knowledge Represenation Languages. Available on line: http://www.cs.man.ac.uk/-stevensrlontolnode 14.html 25. Wroe C.J., Stpvens R., Goble C.A., Ashburner M.. A Methodology To Migrate The Gene Ontology To A Description Logic Environment Using DAML+OIL. Proceedings of the 8th Pacific Symposium on Biocomputing (PSB),Hawaii. 2003. 26. Stevens R., Wroe C., Bechhofer S., Lord P., Rector A., and Goble C., Building ontologies in DAML plus OIL, Comp. Funct. Genorn. 4 (2003)pp. 133-141 27. Yeh I., Karp P.D., Noy N.F. and Altman R.B. Knowledge acquisition, consistency checking and concurrency control for Gene Ontology (GO), Bioinformatics, 19 (2003) pp. 241-248 28. T. Berners-Lee. Reflections on Web Architecture, Available on line http://www.w3.org/DesignIssues/CG.html 29. T. Berners-Lee. Available on line: http://www.w3.orgPeople/Berners-Lee/ 30. Gkoutos G.V., Leach C. and Rzepa H.S., ChemDig: new approaches to chemically significant indexing and searching of distributed web collections, New J. Chem. 26 (2002) pp. 656-666

AN EVIDENCE ONTOLOGY FOR USE IN PATHWAY/ GENOME DATABASES P.D. KARP, S. PALEY, C .J . KRIEGER SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025 USA {pkarp,paley, krieger} @ai.sri.com P. ZHANG Carnegie Institution of Washington, Department of Plant Biology 260 Panama Street, Stanford, California 94305 [email protected] Abstract. An important emerging need in Model Organism Databases (MODs) and other bioinformatics databases (DBs) is that of capturing the scientific evidence that supports the information within a DB. This need has become particularly acute as more DB content consists of computationally predicted information, such as predicted gene functions, operons, metabolic pathways, and protein p r o p erties. This paper presents an ontology for encoding the type of support and the degree of support for DB assertions, and for encoding the literature source in which that support is reported. The ontology includes a hierarchy of 35 evidence codes for modeling different types of wet-lab and computational evidence for the existence of operons and metabolic pathways, and for gene functions. We also describe a n implementation of the ontology within the Pathway Tools software environment, which is used t o query and update Pathway/Genome DBs such as EcoCyc, MetaCyc, and HumanCyc.

1

Introduction

An important emerging need in Model Organism Databases (MODs) and other bioinformatics databases (DBs) is that of capturing the scientific evidence that supports the information within a DB. This need has become particularly acute as more DB content consists of computationally predicted information, such as predicted gene functions, operons, metabolic pathways, and protein properties. DB users want to know the type of evidence that supports assertions within a DB, and they want to know the strength of that evidence. Strength and type are in general independent parameters, although they are often related; for example, computationally generated predictions are generally held to be less reliable than are wet-lab experiments, but there are certainly unreliable types of wet-lab methods. This paper reports on an ontology for encoding the type of support and the degree of support for DB assertions, and for encoding the literature source in which that support is reported. We also describe an implementation of the ontology within the Pathway Tools software environment: which is used to

190

191

query and update Pathway/Genome DBs (PGDBs) such as EcoCyc, Cyc, and HumanCyc (see URL http://HumanCyc.org/).

2

Meta-

Motivations for an Evidence Ontology

The evidence ontology is designed t o encode information about why we believe certain assertions in a PGDB, the sources of those assertions, and the degree of confidence scientists hold in those assertions. An assertion could be the existence of a biological object described in a PGDB. For example, we would like to be able to encode the evidence supporting the existence of a gene, an operon, or a pathway that is described within a PGDB. Has the operon been predicted using a computational operon finder? Or is it supported by wet-lab experiments? If the latter, what types of experimental methods were used? It should be possible to capture multiple types of evidence: if the existence of a metabolic pathway is supported by both a computational algorithm and by two different types of wet-lab experiments, our evidence ontology should be able to capture that information, and also to capture the literature citations that are the source of that information. We also want to be able to capture two types of confidence information in the evidence ontology. If the probability of correctness of an individual piece of evidence is known, we want to capture that. For example, if we have measured the overall accuracy of an operon predictor and found that accuracy to be SO%, then computational operon predictions made by that program should be recorded to have an individual confidence of 0.8. Similarly, if we know that a wet-lab method has a probability of correctness of 0.7, we should be able to capture that information in conjunction with our evidence codes. But in addition to capturing the confidence in individual pieces of evidence, we want to capture the overall confidence in an assertion that results from synthesizing across multiple pieces of evidence. Consider a case where the existence of a metabolic pathway is supported by a computational prediction and by two wet-lab experiments. We would like a curator to be able to record his or her overall confidence level in that pathway that results from integrating those three pieces of evidence. Object existence is one class of PGDB assertion. But object properties and relationships form another important class of assertions. We should be able to encode evidence not just about object existence, but also regarding slot values stored in a PGDB, such as the type of evidence supporting the molecular weight of a protein, or the assertion that pyruvate inhibits an enzyme, or the assertion that a protein catalyzes a given reaction.

192

2.1

Related Work

Gene Ontology provides an evidence ontology that satisfies some of the preceding criteria, and that formed the starting point for our work? The GO evidence codes are described at URL http://geneontology.org/doc/GO.evidence.html.Wherever possible we have adopted the GO evidence codes, or small variations of them, to facilitate translation between the systems. But in many cases we significantly extended or reworked the GO system because it was not designed to satisfy, and cannot satisfy, the requirements listed earlier in this section. For example: (a) The GO system does not encode specific classes of experimental methods, that is, the GO code “IDA” (Inferred from Direct Assay) has no subclasses to define subtypes of experimental assays (such as assays that provide evidence for the activity of an enzyme or for the presence of a promoter). (b) Strictly speaking, the GO evidence system is intended to be used to annotate the support for attachment of a GO term to a gene. It is not specifically designed for use to record evidence for the existence of a biological object, or for other types of assertions such as slot values. (c) The GO evidence system does not provide a way to associate confidence values with assertions. We are unaware of directly related work in the Artificial Intelligence community. A1 work on truth maintenance systems (TMSs) is not relevant because TMSs are concerned with capturing relationships between propositions inferred by an automated reasoner, and the propositions on which those inferences depend 5 . That is a different problem than trying to capture general classes of evidence that support some proposition. 3

Overview of P a t h w a y Tools

The Pathway Tools software is a reusable package for creating, querying, visualizing, and analyzing MODS. Its components include the following. PathoLogic - This is a module for computationally creating a new PGDB for an organism from its annotated genome. PathoLogic includes a metabolic pathway predictor6 and an operon predictor. Given a properly formatted Genbank entry for an annotated genome, PathoLogic can create a new PGDB within a day. Additional manual processing that is typically required before a PGDB is ready for release takes 2-3 weeks for a bacterial genome? P a t h w a y / G e n o m e Editors - This module includes a graphical interactive editing tool for every datatype managed by Pathway Tools, including genes, proteins, biochemical reactions and pathways, small molecule metabolites, and operons. P a t h w a y / G e n o m e Navigator - This module allows users to query a PGDB and to visualize the results of a query. Visualization tools supported include visualization of chromosomes (genome browser), genes, proteins (with

193

specialized displays for enzymes, for transporters, and for transcription factors - the latter displays all operons controlled by the transcription factor), pathways, and transcription units (operons). A metabolic overview diagram is a drawing of all known metabolic pathways of an organism. Expression data for a given organism can be painted onto the metabolic overview to place expression data in a pathway context and to allow the user to discern the coordinated expression of entire pathways, or of important steps within a pathway. SRI has used Pathway Tools to develop 15 PGDBs, which are available through the BioCyc Web site at URL http://BioCyc.org/. In addition, Pathway Tools has been licensed by 31 groups in academia and industry. PGDBs available from those groups are also listed at URL http://BioCyc.org/. Recent enhancements to Pathway Tools include (1)ontology, visualization, and editing support for introns, exons, and alternative splicing; (2) tools for exporting PGDB information to flat files, and for importing information from flat files into PGDBs; (3) implementation of Per1 and Java APIs for querying and updating PGDBs, called PerlCyc and JavaCyc, respectively (see URL http://bioinformatics.ai.sri.com/ptools/ptools-resources.html) .

4 Pathway Tools Implementation of the Evidence Ontology This section describes how the Pathway Tools evidence ontology is presented to the user in Pathway Tools displays to give the reader an understanding of how the ontology is used. We extended Pathway Tools so that the pathway and operon predictors within PathoLogic decorate the pathway and operon PGDB objects that they create with evidence-code information to indicate computationally predicted objects as such. We extended the Editors to include functionality that allows users to interactively enter and modify evidence codes. We extended the Navigator to display evidence information. For example, the Navigator window shown in Figure 1 displays information about the transcription unita called cbl. The flask icon at the top right of the diagram indicates that evidence from wet-lab experiments supports the existence of the transcription unit. The lower flask and computer icons adjacent to the “Promoter: cbl” line indicate that both wet-lab experiments and computational predictions support the existence of this promoter. Finally, the flask at the bottom right of the window indicates that experimental evidence supports the information about the activity of the transcription factor. Although our evidence system provides for more precise distinctions than simply “wet-lab” versus “computational,” we felt it best to keep our graphical Transcription units are essentially the same as operons, although transcription units can contain single genes, whereas by definition operons must contain multiple genes.

194 E. mliK-12Transcription Unit: cbl Superclasses: Transcription-Units Transcription strand Reverse

44.381653(-)

Zlick onbinding site (if any) ta navigate to transcription facror window; :lick on gene ta navigate to gene window. Createdby: sgama on 20-Jun-2002 Citations: [Iwanioka951 Promoter:obl Citations: [IwanickaS5] Binds: RNA polymerase sigma70 Absolute Plus 1Por: 2058963 Sequence: TGGAATAAGA TGCGGGTTTT TATTATTTGT TATGCCGGGC ATTAGACTTT AACAATAACG gGAAATCTGA ACTGCCCGGA G Site 1. Bound unmodified WB (QsBtranscriptionaldual regulator)activates transcription. [IwanickaBB]

B

The location of this site is not specified. References IwanickaSS:Iwanicka-Nowicka R Hryniewicz MM (1995). "Anew gene, cbl, encoding a member of the LysRfamily of transcriptional regulatorsbelongs to Ercherichia coli cys regulon." Gene 1995;166(1);11-7. PMID 8529872

Figure 1: Pathway Tools display of the EcoCyc transcription unit cbl

interface simple by displaying only a few different icons, since expecting users to learn icons for each of our 35 evidence codes would be unreasonable. Therefore, object display windows such as the transcription-unit display in Figure 1 show icons that only differentiate evidence codes at the highest level of our hierarchy, such as distinguishing experimental evidence from computational evidence. The user can click on these icons to view another screen that shows the detailed evidence codes that support the existence of an object (see Figure 2), and from what literature sources that evidence was reported. 5

The Evidence Ontology

Each piece of evidence about object existence in PGDBs is recorded as a structured evidence tuple, as a value within the Citations slot of PGDB objects such

195

Figure 2: Detailed evidence report for existence of the trpC promoter.

as pathways and transcription units. An evidence tuple allows us to associate several types of information within one piece of evidence. Each evidence tuple is of the form: Evidence-code : Citation : Curator : Timestamp : Probability where 0

0

0

0

0

Evidence-code is a unique ID for the type of evidence, as provided in Table 1. Citation is an optional citation identifier such as a PubMed ID that indicates the source of the evidence. For computational evidence, the citation refers to an article describing either the general properties of the algorithm used, or its application in this case. Curator is the usernarne of the curator who created this evidence tuple. Timestamp is an optional integer representing the time and date on which this evidence tuple was created. Probability is an optional real number that indicates the probability that the assertion supported by this evidence is correct, such as a probability provided by an algorithm. We expect that the probability portion of the evidence tuple will be used much more frequently for computational evidence than for wet-lab evidence because the accuracies of

196

computational techniques tend to be better known in general than for experimental methods. The notion of what it means for a biological object to exist varies somewhat by object type, and is difficult to define precisely. For example, what does it mean for a gene to exist? Does the existence of a gene depend only on whether some gene product is produced from a region of DNA, or on whether the exact boundaries of the gene are defined precisely? In the case of a gene, the probability of existence should not reflect whether the exact nucleotide start and stop positions of the gene are correct, but should depend only on whether a gene product is produced by an approximate region of a chromosome, due to the accuracy limits of current gene finders. In contrast, the notion of probability of existence of a transcription unit should indeed depend on where the gene boundary of the transcription unit lies, since every bacterial gene lies within some transcription unit, and the fundamental problem of predicting transcription units is defining their gene boundaries. Finally, we defined a slot Confidence that allows a PGDB to record an overall integrated probability for the existence of an object. Whereas probabilities within evidence tuples encode the probability associated with an individual piece of evidence, the Confidence slot is intended to hold the n e t probability that results from integrating across the potentially multiple pieces of evidence available. This integration process will in most cases be a manual process performed by a curator, that will therefore be subjective and will vary among individuals, and we are in the process of developing guidelines for that integration process. Note our implementation performs no manipulation of probability values other than permitting their entry and display.

T h e Hierarchy of Evidence Codes

5.1

The Evidence-code components of evidence tuples denote different types of evidence. Those evidence types are arranged in a generalization-specialization hierarchy as shown in Table 1. The root of that hierarchy, the node Evidence, has four direct children that define the four main evidence types: 0

0

EV-Comp: Inferred from Computation. The evidence for an assertion comes from a computational analysis. The assertion itself might have been made by a person or by a computer, that is, EV-Comp does not specify whether manual interpretation of the computation occurred. EV-Exp: Inferred from Experiment. The evidence for an assertion comes from a wet-lab experiment of some type.

197 EV-Comp EV-Comp-HInf

EV-Comp-HInf-Positional-Identification EV-Comp-HInf-Similar-To-Consensus EV-Comp-HInf-Fn-From-Seq EV-Comp-AInf

.nferred from computational analysis. .nferred by a human based on computational evidence. 3uman inference of promoter position. Suman inf. based on similarity to consensus sequences. Human inf. of function from sequence. Inferred computationally without human automated inference. oversight Automated inf. of promoter position. Automated inf. based on similarity to consensus sequences. Automated inf. of function from sequence. Automated inf. that a single-gene directon is a transcription unit. Inferred from experiment. Inferred from physical interaction. Inferred from mutant phenotype. Site mutation. Polar mutation. Reaction blocked in mutant. Reaction enhanced in mutant. Inferred from genetic interaction. Inferred by functional complementation. Inferred from expression pattern. Gene expression analysis. Inferred from direct assay. Binding of cellular extracts. Binding of purified proteins. RNA polymerase footprinting. Transcription initiation mapping. Boundaries of transcription experimentally identified. Length of transcript experimentally determined. Assay of unpurified protein. Assay of protein purified from mixed culture. Assay of purified protein. Inferred by curator. Author statement. Traceable author statement. Non-traceable author statement. ~

EV-Comp-AInf-Positional-Identification EV-Comp-AInf-Similar-To-Consensus EV-Comp- AInf-Fn-From-Seq EV-Comp-AInf-SingleDirecton EV-EXP EV-Exp-IPI EV-EXP-IMP EV-Exp-IMP-SiteMutation EV-Exp-IMP-Polar-Mutation EV-Exp-IMP-Reaction-Blocked EV-Exp-IMP-Reaction-Enhanced EV-Exp-IGI EV-Exp-IGI-Func-Complement ation EV-EXP-IEP

EV-Exp-IEP-Gene-Expression-Analysis EV-Exp-IDA

EV-ExpIDA-Binding-Of-Cellular-Extracts EV-Exp-IDA-Binding-Of-Purified-Proteins EV-Exp-IDA-RNA-Polymerase-Footprinting EV-Exp-IDA-Transcription-Init-Mapping EV-Exp-IDA-Boundaries-Defined

EV-Exp-IDA-Transcript-Len-Determination

EV-Exp-IDA-Unpurified-Protein EV-Exp-IDA-Purified-Protein-Multspecies EV-Exp-IDA-Purified-Protein EV-IC EV-AS EV-AS-TAS EV-AS-NAS

Table 1: T h e taxonomy of evidence types. Each row of this t a b l e defines o n e evidence type, giving its code a n d description. Indentation indicates ordering in t h e taxonomy, for example, EV-AS-TAS (fourth row) is a child of EV-AS.

198 0

0

EV-IC: Inferred by Curator. An assertion was inferred by a curator from relevant information such as other assertions in a database. EV-AS: Author Statement. The evidence for an assertion comes from an author statement in a publication, where that publication does not provide direct experimental support for the assertion. (Ordinarily, this code will not be used directly - generally one of its child codes, EVAS-TAS or EV-AS-NAS, will be used instead.)

We expect the most commonly used codes will be EV-Comp and EV-Exp, and their sub-codes. An HTML version of the entire evidence ontology, including detailed comments describing each evidence code, is available at URI, http://bioinformatics.ai.sri.com/ptools/evidence-ontology. html. There are several reasons why we feel this evidence system is best structured as a hierarchical ontology rather than as a flat list of controlled terms. First, we expect that the hierarchy will facilitate understanding of the evidence system by new curators because terms are grouped into logically related clusters. This aspect will be particularly important as the size and complexity of the evidence ontology grows to model additional detailed evidence types. Second, we expect the hierarchy will facilitate the curation process itself by allowing faster retrieval of relevant terms from editor menus than if retrieval was from a flat list. Third, non-leaf nodes in the evidence ontology will themselves be used in curation in cases where leaf evidence nodes do not match the evidence that is actually available, or where a publication is not specific about the type of evidence that supports some assertion, thus requiring a more general evidence code. The lower levels of the hierarchy have thus far been designed primarily for encoding the evidence for protein function, and evidence related t o mechanisms of regulation of transcription initiation. There are many types of experiments and computational techniques, but our curators have made efforts to divide and group them into the categories they judge to be most meaningful. These evidence codes are not comprehensive with respect to other types of biological information. In the future, if we decide to apply evidence codes to different types of objects, we expect to extend the existing set of evidence codes to cover the types of experiments and analyses applicable to the new object types. In our implementation of the evidence codes, each code is defined as a class within the Ocelot object DBMS whose object ID is the evidence code. Slots defined for each evidence code include its name (such as “Inferred by Computational Analysis” ) , a comment describing the code, and a slot PertainsTo that lists the classes of PGDB objects to which the code can be applied. The Pathway Tools editors query this slot to determine what evidence codes

199

are applicable to a given type of objects, such as a metabolic pathway, when generating choose-lists of evidence codes for the curator. The EV-AS (Author Statement) category has two subtypes: author statements that are traceable to a publication that contained direct evidence for an assertion, and statements that are not traceable in that manner. The EV-Comp (Inferred by Computational Analysis) category is also divided into two subtypes: computational inferences that were made in a purely automated fashion, and inferences in which a human was involved, under the assumption that the condition of whether or not a person was involved in arbitrating among computational evidence is a factor that a scientist interpreting the evidence would consider important. An examination of the subtypes of EV-Comp reveals that some of these evidence types apply only to certain PGDB object types. For example, the code EV-Comp-AInf-Single-Directon applies only to computational inference of operons (it indicates that an operon was inferred by the existence of a gene G for which the adjacent genes on both sides of G are both transcribed in the opposite direction from G, implying that those genes cannot be in the same operon as G). This property of being relevant only to specific object types applies to other evidence types in our system.

5.2 Attaching Evidence Codes to Individual Slot Values As well as using evidence tuples to record evidence about object existence, we can attach evidence tuples to individual values within a PGDB to record evidence for finer-grained assertions. For example, we could record the evidence that supports the strength of a promoter stored in a PGDB, or that supports the assertion that a given metabolite inhibits the activity of an enzyme.

5 . 3 Object and Relational Implementations of Evidence Tuples We implement the association of evidence tuples with individual slot values using an Ocelot mechanism called annotations. An annotation is a five-tuple of the form Frame : Slot : Value : Label : Datum that allows a labeled datum to be associated with a value within a slot of a frame. In this case, the label is the string “Evidence” and the datum is the evidence tuple itself. In Ocelot, annotation tuples are physically stored within the frame that they are associated with. We envision that our evidence system could be implemented in a relational DBMS by creating a table whose columns are Table-ID, Key, Column, Value, Evidence-Code, C i t a t i o n , C u r a t o r , Timestamp, a n d p r o b a b i l i t y . Table-ID, Key, and Column are analogous to Frame and S l o t in the Ocelot

200

representation - they identify an “object” within a relational DBMS. Column and Value identify a specific relational column and value within that column to which the evidence tuple applies. When the evidence tuple applies to the entire object, Column and Value would be null. In both the Ocelot and the relational implementations, it is straightforward to compute across evidence information, such as to query all pathways with experimental evidence.

6

Use of the Evidence Ontology within EcoCyc and MetaCyc

The EcoCyc DB currently records evidence codes for the existence of 341 of its 810 transcription units, for 531 of its 878 promoters, and for 952 of its 1071 DNA binding sites. The preceding evidence codes were assigned over the past few years using an earlier, more primitive evidence system. We translated codes from that system into the system described here, which was feasible because the new system was explicitly designed to be a superset of the old system. Assignment of evidence codes for pathways and protein functions by curators of the EcoCyc and MetaCyc DBs is just beginning. As an example, the anaerobic toluene degradation pathway, described in MetaCyc a s it appears in Thauera aromatzca, has been established by biochemical assay of the enzymes, pathway intermediates, and products. Thus, we would assign to it the evidence code EV-Exp-IDA (Inferred by Direct Assay), along with one or more citations to the literature. If a user were to use PathoLogic to create a PGDB for some other organism, X ,and PathoLogic inferred that the anaerobic toluene degradation pathway were present in X ,it would be assigned an evidence code of EV-AInf (automated inference) in the new PGDB (this assignment is done automatically by PathoLogic). A protein that is multifunctional may have the same or different evidence codes assigned to each function. In addition to containing objects for each protein and for each reaction, the Pathway Tools schema defines an object, called an enzymatic-reaction, that describes the pairing of an enzyme to a reaction. We assign the evidence code for each function of a protein to the corresponding enzymatic-reaction object rather than to the protein object so there is no ambiguity about which functional assignment each evidence code pertains to. For example, the product of the E. coli ndh gene has NADH dehydrogenase and NADH cupric reductase activities. The NADH dehydrogenase activity was established by direct assay of the purified protein. The NADH cupric reductase activity was established by observing that the reaction is blocked or enhanced in mutants with the gene missing or over-expressed, respectively. Thus, we assign the evidence code EV-Exp-IDA-Purified-Protein (along with relevant literature citations) to the enzymatic-reaction that links the ndh gene product to the NADH de-

20 1

hydrogenase reaction, and the codes EV-IMP-Reaction-Enhanced and EVExp-IMP-Reaction-Blocked to the enzymatic-reaction that links the protein to the NADH cupric reductase reaction. (Note that this is not intended to be a complete description of the ndh system - additional codes may apply to one or both activities.) We are not restricted to using the lowest-level codes in the evidence ontology. If the enzyme was characterized by some experiment that did not precisely match one of our lowest-level codes, or the literature did not provide enough information to distinguish between them, we might assign a code of EV-Exp-IMP even EV-Exp in place of one of the more specific codes.

7

Software Availability

Pathway Tools for SUN/Solaris, Intel/Windows, and Intel/Linux is freely available to academics; a license fee applies to commercial use. See http://BioCyc.org/download.shtmlfor more details.

Acknowledgments Julio Collado-Vides contributed to development of evidence codes related to transcriptional regulation. This work was supported by grant R01-HG0272901 from the NIH National Human Genome Research Institute. The contents of this article are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health.

References 1. P.D. Karp, S. Paley, and P. Romero. The Pathway Tools Software. Bioinformatics, 18:S225-S232, 2002. 2. P.D. Karp, M. Riley, M. Saier, I.T. Paulsen, S. Paley, and A. PellegriniToole. The EcoCyc database. NUC. Acids Res., 30(1):56-8, 2002. 3. P.D. Karp, M. Riley, S. Paley, and A. Pellegrini-Toole. The MetaCyc database. NUC. Acids Res., 30(1):59-61, 2002. 4. M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, and G. Sherlock. Gene Ontology: Tool for the unification of biology. Nature Genetics, 25:25-29, 2000. 5. J . Doyle. Truth maintenance systems. Artificial Intelligence, 12(3):23172, 1979. 6. S. Paley and P.D. Karp. Evaluation of computational metabolic-pathway predictions for H. pylori. Bioinformatics, 18(5):705-714, 2002. 7. P. Romero and P.D. Karp. PseudoCyc, a Pathway/Genome Database for Pseudomonas aeruginosa. J. Mol. Microbiol. Biotech., 5(4):230-9, 2003.

TERMINOLOGICAL MAPPING FOR HIGH THROUGHPUT COMPARATIVE BIOLOGY OF PHENOTYPES Y . A . L U S S I E R ” AND ~ J. LI’ 1- Department of Biomedical Informatics, 2- Department of Medicine, Columbia University College of Physicians and Surgeons, New York, NY 10032 USA E-mail: [email protected] Comparative biological studies have led to remarkable biomedical discoveries. While genomic science and technologies are advancing rapidly, our ability to precisely specify a phenotype and compare it to related phenotypes of other organisms remains challenghg. This study has examined the systematic use of terminology and knowledge based technologies to enable high-throughput comparative phenomics. More specifically, we measured the accuracy of a multfstrategy automated classification methcd to bridge the phenotype gap between a phenotypic terminology (MGD: Phenosh) and a broad-coverage clinical terminology (SNOMED CT). Furthermore, we qualitatively evaluate the additional emerging properties of the combined terminological network for comparative biology and discovely science. According to the gold standard (n=100), the accuracies (precision I recall) of the composite automated methods were 67% I 97% (mapping for identical concepts) and 85% I 98% (classification). Quantitatively, only 2% of the phenotypic concepts were missing from the clinical terminology, however, qualitatively the gap was larger: conceptual scope, granularity and subtle, yet sigtllficanc homonymy problems were observed. These results suggest that, as observed in other domains, additionalstrategies are required for combining terminologies.

1 Introduction

Comparative biological studies have led to remarkable biomedical discoveries such as evolutionarily conserved signal transduction pathways (C. eleguns) and homeobox genes (D. melunoguster). Recently, comparative genomic studies to elucidate conserved gene functions have made significant advances principally via complementary integrative strategies such as functional genomics and standard notations for gene or gene function (e.g., Gene Ontology’). However, there is a pressing demand of technologies for greater integration of phenotypic data and phenotype-centric discovery tools to facilitate biomedical r e s e a r ~ h. While ~ ~ ~ ~ ~ ~ automated technologies permit increasingly efficient genotyping of organisms’ cohorts across distinct species or individuals with distinct phenotype, our ability to precisely specify an observed phenotrpe and compare it to related phenotypes of other organisms remains challenging’’ and does not match the throughput capabilities of genotypic studies. Further, phenotypic “qualifiers” span biological structures and

202

203

functions extending from the nanometer to populations12:proteins, organelles, cell lines, tissue, Model Organism, clinical, genetic and epidemiologic databases. This diversity of scales, disciplines and database usageI3 has lead to an extensive variety of uncoordinated phenotypic notations including 1) differences in the definition of a phenotypeI4 (e.g. trait, quantitative traits, syndromes), 2) differences in the terminological granularity and c o m p o ~ i t i o n ' ~ and ' ~ , ~3)~ ,distinct ~~ usage of identical terms according to the context ( e g organism, genotype, experimental design, etc.). For example, there are multiple phenotypic terms that illustrate various granularities related to the eye: Iris dysplasia (goniody~genesis)'~ [OMIM] , MP:0002092 eye: dysmorphology [Phenoslim] j2, uveitis severity [RGDI2', 368808003 Aberrant retinal artery [SNOMED CT], 81745001 Entire eye [SNOMED CT]. Moreover, the lack of timely and accurately access to relevant phenotypes across databases is another limiting factor that hinders the progress of phenotypic research. The heterogeneity of phenotype notation can be found in both the clinical and biological databases. While each Model Organism Database Systems has standardized the phenotypic notation for its own research community, bridging the gap of phenotypic data across species remains a work in progress. In this regard, the Phenotype Attribute Ontology (PAtO) is an initiative stemming from the Gene Ontology Consortium2' to derive a common standard for various existing phenotypic databases. In addition, the standardization of the database schema emerging from the PAtO collaboration will considerably increase the interoperability of phenotypic databases and may also clarify problems related to the terminological representation. In contrast, while heterogeneous database systems have been shown to unify disparate representational database ~ c h e m a ~to~our , ~knowledge, ~, the semantic modeling of the notation representation remains manually edited (e.g., structural naming differences, semantic differences and content difference^).'^ In addition, these general-purpose heterogeneous database systems have not been specifically adapted to the complexity of phenotypic data reuse for comparative biology and genomics. The most prominent banier to the integration of heterogeneous phenotypic databases is associated with the notational (terminological) representation. While terminologies can be manually or semi -automatically integrated, as illustrated by the meta-terminologies ( e g Unified Medical Language System), such a process is both time consuming and labor An alternative approach employing onto10gJ~,~~ and lexicon-based mapping utilizes knowledge-based and semantic-based terminological mapping29,30,31,32,3334. While single-strategy mapping systems have demonstrated limited success (only capable of mapping 13 - 60% of terms 35,36,37237, systems using a methodical combination of multiple mapping methods and semantic approaches have demonstrated significantly improved a c c ~ r a c ~ ~ ~ ~ ' ~ ~ ~ ~ ~ ~ . In our current study, we have developed an automated multi-strategy mapping method for high throughput combination and analysis of phenotypic data deriving from heterogeneous databases with high accuracy. Further, this mapping strategy also

204

allowed us to assess the qualitative discrepancies of phenotypic information between a clinical terminology and a phenotypic terminology.

2 Materials 2.1 Phenoslim terminology (PS) Phenoslim is a particular subset of the phenotype vocabularies developed by Mouse Genome Database5*(MGD) that is used by the allele and phenotype interface of MGD as a phenotypic query mechanism over the indexed genetic, genomic and biological data of the mouse. We used the 2003 version of PS containing 100 distinct concepts in our study. MGD is also currently developing comprehensive mammalian phenotype ontology and the Phenotype Attribute Ontology via collaboration with the Gene Ontology Consortium.

2.2 Systematized Nomenclature of Human and Veterinary Medicine-Clinical Term@ (SNOMED C n The SNOMED CT termin~logy~~ (version 2003) is a comprehensive clinical ontology that contains about 344,549 distinct concepts, 9 13,697 descriptions (test string variants for a concept). SNOMED-CT satisfies the criteria of controlled computable terminologies and, in addition, provides an extensive semantic network between concepts, supporting polyhiearchy and partonomy as directed acyclic graphs (DAGs) and twenty additional types of relationships. It also contains a formal description of “roles” (valid semantic relationships in the network) for certain semantic classes. SNOMED CT has been licensed by the National Library of Medicine for perpetual public use as of 2004 and will likely be integrated to UMLS.

2.3 The Unified Medical Language System@ (UMLS) and Norm UMLS54is created and maintained by the National Library of Medicine. The 2003-version of the UMLS consisting of about 800,000 unique concepts and relationships taken from over 60 diverse terminologies were used in our studies. In addition, UMLS includes a curated semantic network of about 120 semantic types overlying the terminological network. Moreover, it contains an older version of SNOMED (SNOMED 3.5, 1998) that houses about half the number of concepts and descriptions of the SNOMED -CT. By design, the relationships found in the source terminologies in UMLS are not curated. Thus transformations over the unconstrained UMLS network are required to obtain a DAG and to control convoluted terminological cycles.55

205

Norm is a lexical tool available from the UMLS.56As its name implies, Norm converts text strings into a normalized form, removing punctuation, capitalization, stop words, and genitive markers. Following the normalization process. the remaining words are sorted in alphabetical order.

2.4 Applications and Scripts All the applications and scripts pertaining to implementation of the methods discussed in this paper were written in Per1 and SQL. The Database used was IBM BD2 for workgroup, version 7. Additionally, the Norm component of the UMLS Lexical Tools was obtained from the National Library of Medicine in 2003. Applications were run on a Dual-processor SUN UltraSparc 111 V880 under the SunOS 5.8 operating system.

3 Methods

3.1 Mapping of the Phenotypic Terminology to SNOMED CT Phenoslim was mapped to SNOMED CT using the Molecular Medical Matrix @4’) tools that we have develpped an architecture that integrates lexical, terminological/conceptual and semantic approaches to methodically take advantage of pre-coordination and post-coordination mechanisms. The specific methods used sequentially were a) decomposition of Phenoslim concepts in components, b) normalization of Phenoslim and SNOMED CT, c) mapping of PS components to SNOMED CT, d) conceptual processing, and e) semantic processing. Steps a), b) and c) are “term processing” steps that have been separated for clarity. Retired concepts and descriptions of SNOMED were not used in the study, though they are present in the SNOMED files. a. Decomposition of Phenoslim concepts in components. Each Phenoslim concept is represented by one unique text string consisting of several words. Every combination of word was generated for each unique text string (including the full string) and mapped back to the original concept. A terminological component (TC) is a string of text consisting of one of these combinations. b. Normalization of Phenoslim and SNOMED CT. Each terminological component of Phenoslim and each term associated with a SNOMED CT concept (SNOMED descriptions) was normalized using Norm (ref. material section). c. Mapping of PS components to SNOMED CT. Subsequently, each normalized TC was mapped against each normalized SNOMED description using the DB2 database.

206 Table 1 Included Semantic Classes of SNOMED CT

I

Concept Identifier

I

SNOMED CT Concept Name

I

I

257728006 Anatomical Concepts 118956008 Morphologic Abnormality 64572001 Disease (disorder) 363788007 Clinical historv/examination 246188002 I Finding 246464006 Functions 105590001 1 Substance I 243796009 I Context-deoendent categories 246061005 Attribute 254291000 Staging and scales 71388002 Procedure 36298 1000 Qualifier value

I

d.

e.

I

1

1 1 I

Conceptual Processing. This process simplifies the output of the mapping methods. The Conceptual Processor is a database method that identifies all distinct pairs of conceptual identifiers of Phenoslim and SNOMED CT (PS-CT Pairs) that have been mapped by the previous terminological processes. Semantic Processing. The semantic processing consists of two successive subprocesses: (i) semantic inclusion criteria, and (ii) Subsumption. For Inclusion criteria, mapped SNOMED CT concepts were sorted according to the criteria “that they must be a descendant of at least one semantic class shown in table 1”. This process eliminates erroneous pairs arising fiom homonymy of terms due to the presence of a variety of semantic classes in SNOMED that are irrelevant to phenotypes. An inclusion criteria was chosen since valid concepts may inherit multiple semantic classes. The list of SNOMED codes related PS concept was hrther reduced by subsumption with the relationships found in the relationship table of SNOMED as follow: two ancestor-descendant tables (one from the “is-a” relationship of the relationship table of SNOMED CT and another one from the partonomy relationships “is part of ’) were constructed. Each network of SNOMED CT concepts paired to a unique PS concept was then recursively simplified by removing “is-a” ancestors that subsume other concepts of the network concept, based on the hypothesis that most specific match is also the most relevant. The same procedure was repeated for the “is part of’ relationship. Further, additional relationships of the disease and finding categories were explored in the relationship table and the concept related to a disease or finding was considered subsumed and then removed (within the scope of SNOMED concepts paired to the same PS concept). The remaining set of PSCT pairs were considered valid for the evaluation.

207

3.2 Quantitative Evaluation of the Mapping Methods The mapping methods previously described produces from none to multiple putative SNOMED concepts for every Phenoslim concept. Every group of distinct SNOMED concepts related to a unique PS concept was further assessed according to the following criteria: (i) classification - the SNOMED CT concepts are valid classifier or descriptor of part of the Phenoslim concept (GoodPoor), (ii) identity - the meaning of the SNOMED CT concept is exactly the same as that of the Phenoslim concept, (iii) completeness of representation of the meaning by SNOMED concepts, (iv) redundancy of representation of SNOMED concepts, (v) presence of erroneous matches. In addition, SNOMED CT was looked up to find an identical identifier or a class that could represent every PS concept that was not paired using the automated method. The problem of organizing the post-coordinated set of SNOMED concept was not addressed. We measured the efficacy of the mapping method using precision and recall. 3.1 Mapping of the Phenotypic Terminology to the Clinical one

3.3 Qualitative Evaluation of Mapping Problems between the Clinical and Phenotypic Terminologies The qualitative evaluation and discussions focus on the description of types of mapping problems encountered, their methodological cause and proposed avenues of further research.

4 Results and Discussion Using the mapping methods of M?, every combination of words contained in each term associated with the 100 concepts of Phenoslim were computed yielding 4,016 terminological components. These components were processed in Norm by every possible mapping with a SNOMED -CT description calculated in DB2 in less than 2 minutes (about 3,5 billion possible pairs). 4,842 distinct terminological pairs were found. The conceptual processing reduced this number to 1,387 pairs between Phenoslim and SNOMED CT concepts. As shown in table 2, the fmal semantic processing provided the final set consisting of 740 distinct pairs (426 pairs did not meet the semantic inclusion criteria and 221 pairs were removed by subsumption). Three Phenoslim concepts were not mapped, one of which could not be mapped or classified in SNOMED CT (the only true negative map). 79 PS concepts were l l l y mapped to a valid composition of SNOMED concepts, 15 of which also contained one erroneous and superfluous SNOMED code. 18 PS concepts were incompletely mapped, two of which also contained an erroneous and superfluous concept. Overall, 18 concepts were also redundantly mapped (not shown in

208 Table 2. Evaluation of the Quality of the Mapping between each Croup of SUO\IE:I) C‘oncept.r n*\ori:itctl to each Concrpt o f I’henoslirn

phenoslim’s Concepts Mapped by M3

Complete Map (identity and classification)

64

15

Incomplete Map (classification)

18

2

qualifier procedure attribute substance function finding morphology disease anatomy

0

20

40

60

80

figure 1. Proportion of Phenoslim Concepts that can be mapped to the Semantic Typesof SNOMED CT describedin Table 1 (“h)

the table) - having more than one representation of the same concept or an overlapping group of concepts. Figure 1 shows the proportion of Phenoslim concepts that can be classified to the semantic types of SNOMED, on average each concept is mapped to 2.9 semantic classes. 4.I Quantitative Evaluation Norm and the conceptual processing performed together at a precision of 11% (TP=64+18, FP=15+426+221). The precision of bf’s terminological classification accuracy is 98% (TP=725, FP=15). The precision and recall of M3 to classfi Phenoslim concepts in SNOMED CT are 85% and 98%, respectively (TP= 64+18, FP=15, FN=2); while the accuracy scores are 67%(precision) and 97%(recall) for M3 used to map thefilll meaning in SNOMED (TP= 64, FP= 15+18, FN=2).

209 Table 3. Examples of Problematic Mappings

Mapping Problem

Examples SNOMED

(i) erroneous mapping

Phenoslim “...prematuredeath”

(ii) partial mapping

“Hematology.. .”

Partially mapped missing “hematological system”

(iii) relevant mappings omitted by M3

“...postnatal lethality””

“postnnnatal death”

“immature” + “death”

“hair texture (body

(iv) redundancy

structure)”, “Texture of hair (observable entity), Hair texture, function (observable

“coat: hair texture defects”

he bladder, the (vi) inconsistency (vii) Not in SNOMED Loniexr ’

I Renrcsentation Scane

“neurologicalhehavioral: ... movement anomal&“ “neurological/behavioral: . . . nociception a b n o r m a w ’ “Coat. ..”, “Vibrissae...”

I “Embryonic...”

I “Fetal.. I

.”

+ “Embryonic...”

4.2 Qualitative Evaluation and Discussion Table 3 illustrates examples of mapping problems. Erroneous mapping occurred for primarily due to slightly different meanings of related concepts with taken out of their context. For example, the conepts “human fetus” (>8wks gestation) and “human embryo” (<8wks) are subsumed by the concept “mammalian mbryo” (vertebrate at any stage of development prior to birth). In SNOMED, the parent of fetus and embryo is “developmental body structure” which is the one desired for mapping this mammalian concept In addition, SNOMED is used for human and veterinary purposes, thus the representation of “embryo” probably requires reengineering as well. The absence of ‘haccompanied” adjectival forms of anatomical locations and systems contributed to the majority of the partial mapping problems. In contrast to SNOMED CT, SNOMED 98 in the current UMLS version contains adjectives mapped to the anatomical structure fbr corneal, skeletal, cellular, etc. In SNOMED CT, these adjectival forms are “accompanied” of the qualifier “structure” or “system structure” or “entire” as in “skeletal system”, “skeletal system structure” or “entire skeleton”. With additional semantic information in the phenotype terminology (e.g., anatomical location, or system), one could easily pre-process and extend terms with this contextual information before submitting them to norm Some redundancy can be solved by enriching SNOMED CT with a complete network of relationship: “the entire central nervous system” does not have a partonomy relationship with the “entire nervous system” which led to an overlap of mapping. More specifically for

210

phenotypes of model organisms and genetics, the following concepts are incompletely conceptualized in SNOMED: “normal embryogenesiy’, “tumor resistance”, “tumor sensitivity”, or “maternal effect”. While significant efforts have been put forward to address the problems arising from context, scale and granularity in mediated schema, interoperability of databases and integration of ontologies, these three issues afflict the manual mapping of terminologies and, as demonstrated in this study, become daunting in presence of automated mapping methods due to rapid amplification A careful modeling of semantic criteria could further improve the accuracy but may require machine learning approaches to avoid overtraining. For example, a phenotype must necessarily have an anatomical local coded or explicitly mapped from the relationships of its coded concept, to help discriminate between completely and incompletely mapped concepts. Context and scale from the source terminology can be processed as additional semantic criteria: phenotypes from the yeast should map to cellular and smaller SNOMED concepts, etc. Finally, once coded in SNOMED, additional classification properties emerge from the associated anatomical locations: regional anatomy, tissular anatomy, cellular, subcellular anatomies, functional anatomy, orgadsystem anatomy. IN addition, the whole network can be considered as a semantic filter as it is generally consistent due to the rigorous representation language underlying the development of SNOMED CT.

5 Caveats and Implications for Future Work It is important to point out that the manual curation used in the present evaluation was carried by one expert and employed a relatively small, domain-specific subset of the mammalian phenotypes. Mapping the phenotypes of yeast, worm or Drosophila may not yield as good accuracies and are currently investigated The redundancy of terminological representation has not been addressed and remains necessary for automated processing. Knowledge engineering and additional studies are required to understand how phenotypes can be automatically integrated across species. Nonetheless, venues such as semantic constraints on the scale of the mapping appear promising: mapping yeast to structures and morphologies smaller than a cell, etc. Finally, more comprehensive approaches than lexical ones are required to interoperate the intricate combinations of implicit and explicit semantics nested in the database schema of complex biomedical databases.

21 1

6 Conclusions Phenotypic analyses are critical to unlock the gene -disease relationships of complex diseases. The requirements for high throughput phenotypic genomics in which very large numbers of phenotype variants are related to a wide range of genes or gene patterns further motivate our research and development of the proposed methods. In addition, while manual mapping and the methathesaurus approaches remain the gold standards for accuracy, they are rate limiting. M3 will require additional improvements to provide accurate solutions to the obstacles of phenotypic research, yet in its present condition it can automatically keep pace with new representations of phenotypes as they appear in databases. We are concurrently addressing the limitations of M3 with additional semantic and language understanding tools.

Acknowledgments Partial Support for this work came from a New York State Office of Science, Technology, and Academic Research (NYSTAR)-sponsored Center for Advanced Technology at Columbia University (Grant C020054).

References Gene Ontology Consortium. Creating the Gene Ontology Resource: Design and Implementation. Gen Res. 11(8):1425-1433. (2001) Freimer N, Sabatti C. The human phenome project Nat Genet. 34(1):15-21(2003) Gerlai R. Phenomics: fiction or the future? Trends Neurosci. 25(10):506-9(2002). Bogue CW. Genetic Models in Applied Physiology: Invited Review: Functional genomics in the mouse: powefil techniques for unraveling the basis of human development and disease. J Appl Physiol. 2003 Jun;94(6):2502-9. Pool R,Esnayra. Bioinformatics - Converging Data to Knowledge Workshop Summary. Borad on Biology, Commission on Life Sciences. National Research Council. National Academy Press 41p (2000) Altman RB & Klein TE. Challenges for Biomedical Informatics and Phmacogenomics. Ann Rev Pharmaco & Toxicol. 42:113-133. (2002) Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes fbr mendelian disease, hture approaches for complex disease. Nut Genet. 33 Suppl:22837(2003) Collins FS, Morgan M, Patrinos A. The Human Genome Project: lessons from large-scale biology. Science. 300(5617):286-90(2003) 9 Balmain A, Gray J, Ponder B. The genetics and genomics of cancer. Nat Genet. 33 Suppl:238-44(2003) 10 Peltonen L, McKusick VA. Genomics and medicine. Dissecting human disease in the postgenomic era. Science. 291(5507):1224-9 (2001)

’

’

212

Navarro JD, Niranjan V, Pen S, Jonnalagadda CK,Pandey A. From biological databases to platforms for biomedical discovery. Trends Biotechnol. 2 1(6):263-8(2003) l 2 Blois MS. Information in Medicine: The Nature of Medical Descriptions. Berkeley, Califomia: University of Califomia Press(1984) 13 Rector AL, Rogers J, Roberts A, Wroe C. Scale and context: issues in ontologies to link health- and bio-infomatics. Proc AMU S’p:642-6(2002) 14 Mahner M. Kary M. What exactly are genomes, genotypes and phenotypes? And what about phenomes?. J Theoret Biol 186(1):55-63(1997). ”Elkin PL, Tuttle MS, Keck K, Campbell K, Atkin G, Chute C. The role of compositionality in standardized problem list generation. Proceedings MEDZNFO, 660664(1998) l 6 Elkin PL, Bailey KR, Chute C. A randomized controlled tnal of automated term composition. In Chute CG, ed. Proceedings M U Ann. S p p , 765-774(1998) ” Mays E, Weida R, Dionne R, et al.. Scalable and expressive medical terminologies. In Cimino JJ, ed. Proceedings AMIA Ann Symp, 259-263(1998) Stuart NJ, Nels OE, Lloyd F, Tuttle MS, William CG, Sherertz DD. Identifymg concepts m medical knowledge. MEDZNFO Proc, 33-36(1995) 19 Online Mendelian Inheritance in Man, OMIM Johns Hopkins University, Baltimore, MD. MIM #:270240:July 12,2003: http://www.ncbi.nlm.nih.gov/omim/ 2o Steen, R.G., et al, 1999. A high density integrated genetic linkage and radiation hybrid map of the laboratory rat. Research Genetics, Rat Genome Database, ftp://rgd.mcw.edulpub/publicationsf 1 999/steen_genome-researcl-d 21 Ashburner M, Ball CA, Blake JA, Botstein D, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nut Genet 25( 1):25-9(2000) 22 Hucka M, Finney A, Sauro HM, Bolouri H, Doyle J, Kitano H. The ERATO Systems Biology Workbench: enabling interaction and exchange between software tools for computational biology. Pac Symp Biocomput. 450-61(2002) 23 Mork P, Shaker R, Halevy A, Tarczy-Homoch P. PQL: a declarative query language over dynamic biological schemata. Proc M I A Symp.533-7(2002) 24 Sujansky W. Heterogeneous database integration in biomedicine. J Biomed Znform. 34(4):285-98(2001) 25 Cimino JJ, Clayton PD, Hripcsak G, Johnson SB. Knowledge-based approaches to the maintenance of a large controlled medical terminology.JML4 1(1):35-5q1994) 26 Burgun A, Bodenreider 0. Mapping the UMLS Semantic Network into general ontologies.Pmc AMU Symp 81-.5(2001) 21 Lambrix P, Edberg A. Evaluation of ontology merging tools in bioinformaticsPac Symp Biocomput. 589-600(2003). 28 Li Q, Shilane P, Noy NF, Musen MA. Ontology acquisition from owline knowledge sources. Proc M Up’S 497-501 (2000) 29 Hill DP, Blake JA, Richardson JE,Ringwald M. Extension and integration of the gene ontology (GO): combining GO vocabularies with external vocabularies. Genome Res. 2002 Dc; 12(12):1982-91. 30 Bodenreider 0, Mitchell JA, McCray AT. Evaluation of h e UMLS as a terminology and knowledge resource for biomedical informatics. Proc AMIA Symp. 2002;:61-5.

vw.

213

Burgun A, Bodenreider 0, Le Duff F, Moussouni F, Loreal 0. Representation of roles in biomedical ontologies: a case study in functional genomics. Proc AMZA pS' 8690 (2002) 32 Lussier YA, Shagina L, Friedman C. Automating SNOMED Coding Using Medical Language Understandmg: A Feasibility Study. Proc M I A : 418-422. (2001) 33 Tuttle MS, Sherertz DD, Erlbaum MS, et al. Adding Your Terms and Relationships to the UMLS Metathesaurus. 1991 Proc AMIA:219-223(1991) 34 Tuttle MS, Suarez-Munist ON, Olsen NE,et al. Merging Terminologies. 1995 MEDZNFO. S(Pt 1):162-166. (1995) 35 Lussier YA, Shagina L, Friedman C. Automating SNOMED Coding Using Medical Language Understanding: A Feasibility Study. Proc M U : 418-422. (2001). 36 McCray AT, Srinivasan S, Browne AC. Lexical Methods for managing variation in biomedical terminologies. In Ozbolt JG, ed. Proceedings of the Eighteenth Annual Symposium in Computer Applications in Medical Care. Philadelphia: Hanley & Belfus, 235-239( 1994) 37 Rocha R, Rocha B, Huff SM. Automated translation between medical vocabularies using a fiame-based interlingua. In Ozbolt JG, ed. Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care. 6%-694(1994) 38 Zeng, Q & Cimino JJ. Mapping Medical Vocabularies to the Unified Medical Language System. ProcAMZA 105-109. (1996) 39 Cantor MN, N. SI, Hartel F, Bodenreider 0, Lussier YA. An evaluation of hybrid methods for matching biomedical terminologies: Mapping the Gene Ontology to the UMLS. Stud Health Techno1 Inform 62-67(2003). 40 Sarkar IN, Cantor MN, Gelman R, Hartel F, Lussier YA. Linking biomedical language dormation and knowledge resources: GO and UMLS. Pac Symp Biocomput. 43950(2003) 4' Cantor MN, Lussier YA. Putting Data Integation into pratice: using biomedical terminologies to add structure to existing data scources. AMZA Symposium (2003) Accepted. 42 Zeng Q, Cimino JJ. Mapping medical vocabularies to the Unified Medical Language System. Proc AMIA Annu Fall Symp. 1996;:105-9. 52 Blake JA, Richardson E, Bult CJ, Kadin JA, Eppig JT,Mouse Genome Database Group. MGD: the Mouse Genome Database. Nucleic Acids Res. 1;31(1):193-5(2003) 53 Spackman KA & Campbell KE. Compositional Concept Representation using SNOMED: Towards Further Convergence of Clinical Terminologies.Proc AMlA : 875-879(1998) 54 Lindberg DA, Humphries BL, McCray AT. The Unified Medical Language System. Methods Inf Med. 32(4):281-291. (1993) 55 Bodenreider 0. Circular herarchical relationships in the UMLS: etiology, diagnosis, treatment, complications and prevention. Proc AMZA Symp. 57-61(2001) 56 National Library of Medicine. UMLS Lexical Tools. Application and Documentation available at http://umlsks.nlm.nih.gov. 57 Lussier YA, Sarkar IN, Cantor M. An integrative model for in-silico clinical-genomics discovery science. Proc AA4U Symp 469-73(2002) 3'

T H E COMPOSITIONAL STRUCTURE OF GENE ONTOLOGY TERMS P.V. OGREN" University of Colorado at Boulder, Dept. of Computer Science, Boulder, CO; Center f o r Computational Pharmacology, University of Colorado Health Sciences Center, School of Medicine, Denver, CO K.B. COHEN", G.K. ACQUAAH-MENSAH, J . E B E R L E I N , L. H U N T E R Center f o r Computational Pharmacology, University of Colorado Health Sciences Center, School of Medicine, Denver, CO

An analysis of the term names in the Gene Ontology reveals the prevalence of substring relations between terms: 65.3% of all GO terms contain another GO term as a proper substring. This substring relation often coincides with a derivational relationship between the terms. For example, the term regulation of cell proliferation (GO:0042127) is derived from the term cell proliferation (GO:0008283) by addition of the phrase regulation of Further, we note that particular substrings which are not themselves GO terms (e.g. regulation of in the preceding example) recur frequently and in consistent subtrees of the ontology, and that these frequently occurring substrings often indicate interesting semantic relationships between the related terms. We describe the extent of these phenomena-substring relations between terms, and the recurrence of derivational phrases such as regulation of-and propose that these phenomena can be exploited in various ways to make the information in GO more computationally accessible, to construct a conceptually richer representation of the data encoded in the ontology, and to assist in the analysis of natural language texts.

1 Introduction I . I Motivation The Gene Ontology (GO) is the result of an effort to enumerate and model concepts used to describe genes and gene products'.'. We refer to the central unit of description in GO as a concept. Concepts consist of a unique identifier and one or more strings that provide a controlled vocabulary for unambiguous and consistent naming. In this paper, we refi.r to these strings as terms. (Our use of the word t e r m subsumes GO names and synonyms, and in that sense is consistent with the use of term in the terminology literature (e.g. Jacquemin3), although not with GO'S use of it.) Concepts exist within a hierarchy of isA and partOf relations in a directed acyclic graph (DAG) that locates all concepts in the knowledge model with respect to their relationships to other concepts. As Wroe et aL4 and Yeh et al.' point out, the

a

These authors contributed equally to the work reported in this paper.

214

215

terms themselves contain additional information that is implicit in the term names, but is not explicitly represented by the isA and partOf relations that constitute the “model” of the ontology. For example, in the term positive regulation of cell migration (GO:030335), the facts that (a) the concept encodes a regulation relationship between two entities, and (b) that the direction of the regulation is positive are implicit, but in order to exploit those facts computationally, the term itself must be subjected to linguistic analysis. We are interested in using these sorts of facts to make the information in GO more computationally accessible and to leverage the information implicit in the ontology into a conceptually richer knowledge base. As a step in that direction, we undertook an analysis of the structure of the linguistic content of GO terms. Some hypotheses about the nature o f this structure immediately presented themselves: Many GO terms seem to contain other GO terms as proper substrings. For example, the term positive regulation of cell migration (GO:0030335) contains the GO term regulation of cell migration (GO:0030334) as a proper substring. In this process of deriving GO terms from other terms, certain strings which are not themselves GO terms seem to recur frequently. For example, the string regulation of occurs 1,053 times in GO terms, 330 o f these times directly modifying some GO term (as in the above example) to produce a new GO term. We hypothesized that these strings, which in general we refer to as complements, might themselves have interesting and exploitable characteristics. For example, it might be the case that all tokens of a particular complement or all terms that are modified by a particular complement might occur within a particular subtree of the GO hierarchy. In this paper we characterize the extent of these phenomena in GO, both with respect to the inclusion of GO terms in other terms, and with respect to the patterns of usage of the strings that are added to GO terms to produce other terms. We will demonstrate that some of these complements encode specific semantic relations, both corresponding to and more granular than the sanctioned GO relations of isA and partof. We refer to the subset of complements that constitute these semantically contentful complements as derivational phrases. We then give some examples of ways in which insight into these phenomena can be useful.

1.2 What it means to have (compositional) structure An ontology can include terms which have a semantic relationship (consisting of the relationships between nodes, encoded in the edges that link them within the DAG) but no surface linguistic relationship. For example, the MeSH ontology (http://www.nlm.nih.gov/mesh/meshhome.html) contains the following terms that are all related via the semantic relationship isA, but that have no “linguistic” similarity (in terms of the strings that label them):

216

Gram-Negative Bacteria [B03.440] Mollicutes [B03.440.560] Mycoplasmatales [B03.440.560.580] Acholeplasmataceae tB03.440.5 60.5 80.1001 Conversely, an ontology can also include terms that have semantic relationships that coincide with very clear surface linguistic relationships. For example, GO contains the following set of terms that are all related via the semantic relation partof. They also have an evident, patterned linguistic similarity, in that each lower node in the hierarchy contains its parent term as a proper substring: membrane [GO:0016020] inner membrane [GO:0019866] mitochondrial inner membrane [GO:0005743] mitochondrial inner membrane peptidase complex [GO:0042720] A central claim of this paper is that terms such as the preceding GO examples possess linguistic relationships with other terms that correspond to the semantic relationships encoded in GO. These linguistic relationships also can be used to uncover other underlying semantic relationships that enrich the GO knowledge model. These relationships are evident in the patterns of inclusion of terms in other terms, and also in the strings that are added to the included terms to form the “including” terms. It will be seen that these linguistic relationships are quite common and that a large majority of them do, in fact, correspond both to the sanctioned GO relationships and to other semantic relations as well. In this paper we refer to complements that encode systematic semantic relations as derivational phrases. This implies that GO curators engage in term construction through a mental process that explicitly represents the “derivational phrases” that we discuss. The Consortium encourages them to do so, directing curators to “Aim to be reasonably descriptive, even at the risk of.. .verbal redundancy.” Whether or not such a mental process exists is a moot point-the facts about terms and complements hold regardless.

2

Methods and results

2. I lncidence of inclusion of terms in other terms As our corpus we used the XML-formatted version of the June 2003 release of the Gene Ontologyb. This version of GO contains 13,361 concepts, associated with 13,361 names and 3090 synonyms for those names, for a total of 16,451 terms. b

go-200306-termdb.xml,downloadab1efromwww.godatabase. org

217

We examined all terms for the occurrence of other terms within them, including their own synonyms. We counted all occurrences of any term within another term. For every such occurrence, we classified the nature of the relationship between the two terms in a number of ways. We categorized the type of edge (i.e., isA, partof, or synonymy) between the two concepts within the ontology’s DAG. We counted instances where there was and was not a dominance relation between the two nodes, and where there was a dominance relation, we counted instances where it was entirely along isA edges, entirely along partOf edges, and where it was along a combination of the two. Also, where a dominance relation existed, we counted separately instances of dominance and instances of immediate dominancec. We also classified the directionality of the substring relation between the two strings-when a dominance relation existed, we counted separately instances where the superior node’s term was included within the inferior node’s term, and where the inferior node’s term was included within the superior node’s term. (Intuitively, you would not expect to find cases of the latter.) We also counted instances of a term being included in one of its own synonyms. In total, we found that 65.3% (10,747/16,451) of all GO terms contain another GO term. These terms correspond to 72.2% (9,658/13,361) of all GO concepts. Table Id shows the distribution of the terms that contain other terms between the two relations (isA and partof) and the two possible directions of containment (child or inferior term contains parent or superior term vs. parent contains child). (The rows do not sum up because a term can appear in multiple rows. For example, jasmonic acid mediated signaling pathway (induced systemic resistance) (GO:0009864) contains the term jasmonic acid mediated signaling pathway (GO:0009867) to which it is related by immediate dominance via the isA relation, so it appears once on the isA, A < B, A c B row (and also on the isA, A << B, A c B row). It also contains induced systemic resistance (GO:0009682), to which it is related as a partOf by immediate dominance, and so it also appears on the partOJ; A < B, A c B row (as well as the row for the corresponding transitive relation)). Of these terms that contain other terms, most are related transitively by isA and are inferior nodes that contain their superior nodes-52.5% of all GO terms fit this Following Partee et al., we define dominance and immediate dominance as follows: “We say that a node x dominates a node y if there is a connected sequence of branches in the tree extending from x to y. This is the case when all the branches in the sequence have the same orientation away from x and toward y.. ..If x and y are distinct, x dominates y, and there is no distinct node between x and y, then x immediately dominates y We use the mathematical terminology of dominance, rather than the parent/child/ancestor/descendant usage of the Consortium, because we found it to allow a clearer and more compact exposition in the Methods section. Additional related data for this and subsequent tables is available at http://compbio.uchsc.edu/Hunter_lab/Ogren/psb2OO4.html. ’I6.

218

description, with 25.5% of all GO terms containing the node that is immediately superior to them via the isA relation. We found some instances of containment in intuitively unlikely directions. 25 terms (0.15%) contain their immediate isA descendant, e.g. mating (GO:0007618) isA mating behavior (GO:0007617), and memory (GO:0007613) isA learning and/or memory (GO:0007611). 14 terms (0.07%) are wholes that have as a substring one of their parts, e.g., ribosome biogenesis (GO:0007046)partOf ribosome biogenesis and assembly (GO:0042254). Table 1 Occurrence of terms within other terms The “combined’ rows are for cases where the dominance relation involves both isA and partOf edges. “Sibling” rows are for terms that are both immediately dominated by the same node. A < B indicates A immediately dominates B, A << B indicates A dominates B, A c B indicates A is a proper substring of B .

percentage of all GO terms isA, A < B, A c B isA, A << B, A c B isA, A < B, B c A isA, A < < B , B c A partof, A < B, A c B partof, A << B, A c B partof, A < B, B c A partof, A << B, B c A combined, A c< B, A c B combined, A << B, B c A siblinglisA sibling/ partOf synonym no dominance relation total instances of terms containing another term

25.5% (4,197/16,451) 52.5% (8,639/16,451) 0.15% (24/16,451) 0.15% (24/16,451) 3.65% (601/16,451) 4.1% (673/16,451) 0.07% (12/16,451) 0.07% (12/16,451) 8.01% (1318/16,451) 0% 0.84% (139/16,451) 0% 0.84% (139/16,451) 16.8% (2,763/16,451) 65.3% (10,747/16,451)

In every case where one term contained the other as a proper substring, we recorded the phrase that was the complement of the substring with respect to the superstring, classifying these derivational phrases in a variety of ways as well which we describe below.

2.2 Characteristics of complements Whenever one term contained another as a proper substring, we recorded the complement of the substring with respect to the superstring. Note that we do not

219

claim that every instance of a substring relationship between two terms correlates with a non-trivial semantic relation between them-a key part of the analysis must be to attempt to find a principled way to differentiate between trivial and non-trivial ones. Two characteristics of the complements collected were examined-the frequency of occurrence and the consistency of their usage. The collection of complements includes 9,799 types and 16,915 tokens.e We found that 7,686 (78.4%) of the types occurred only once. While these complements may correspond to important semantic relationship between the pairs of terms involved, we did not include them in the data presented in this section. We examined two subsets of the complements for their consistency of usage; those that occurred twice or more and those that occurred five times or more. The former contains 21 13 (21.6%) of the complement types comprising 9229 (54.6%) of the complement tokens, while the latter contains 361 (3.7%) of the complement types comprising 4705 (27.8%) of the complement tokens. Consistency of complement usage was examined in two ways. First, the dominance relations encoded in GO between the pairs of terms associated with each complement type were noted. Second, the locations in the ontology of where each complement occurs were collected and summarized. Table 2 summarizes the resulting data on the consistency of complements to particular relation types. To generate the data in this table, we listed all of the complements associated with a particular dominance relation, and then counted the number of them that only occurred within that relation. For example, the data in isA, A < B, A c B row indicates that there are 462 complement types found in pairs of terms related via immediate dominance in the isA hierarchy with frequency greater than or equal to two, and that 293 of those types only occur in pairs of terms related via immediate dominance by an isA edge. In general, complements do tend to be consistent with a particular type of relation. Those complements that are consistent with a particular is, they add consistent relation may have the status of derivational phrases-that semantic content to the terms to which they are appended to produce new terms. Some implications of this finding are discussed in section 3 below. Table 2 Specificity of complements to relation types

I Frea >= 2 isA, A < B, A c B isA, A < < B , A c B partof, A < B, A c B partof, A << B, A c B

I

2931462 (63.4%) 147011655 (88.8%) 27/38 (71.1%) 37/49 (75.5%)

I Frea >=5

I

41/88 (46.6%) 187/282 (66.3%) 315 (60.0%) 315 (60.0%)

A type is a unique complement. A token is an instance of occurrence of that complement. Positive and negative are two types; positive occurs as a complement 380 times, so there are 380 tokens of the type positive.

220

Table 3 summarizes the resulting data on the consistency of complements with respect to particular locations in the GO hierarchy. Formally, we count all complements for which it is the case that there is some node n in the DAG such that all tokens of that complement are dominated by n, and all complements for which it is the case that there is some node n such that all of the terms that are modified by that complement are dominated by n. When either all tokens of a complement occur under a common node, or all of the terms that are modified by a particular complement occur under a common node, this may be a strong indicator that the complement encodes a semantically significant relation. The indication is stronger the lower in the hierarchy that the common node is, so we also differentiate between cases where the common node is one of the three root nodes (biological process, molecular function, and cellular component), and cases where the common node is lower than one of the three roots. Again, the cells do not sum up since a particular complement can occur in multiple rows. Overall, the data show that the metric of occurrence under a common node is very effective at narrowing the list of likely derivational phrases from among the total set of complements-only 13.8% (1,35419,799) of complements have more than one token and also share an ancestor-while still proposing a usefully large set of potential derivational phrases. Table 3 Occurrence of complements under a common node Each line can be read ‘Total of complement types . . . .’ Only complements with frequency greater than or equal to two were considered

with frequency > = 2 with a shared ancestor in isA under common node in isA under common node below roots in isA whose terms are under common node in isA whose terms are under common node below roots in partOf under common node in partOf under common node below roots in partOf whose terms are under common node in partOf whose terms are under common node below roots

3

2,113 1,354 1,240 1,182 1,178 996 52 52 52 52

Implications and conclusions

The data that we present above are consistent with the hypothesis that derivation of GO terms from other GO terms is a widespread phenomenon, and that this phenomenon often involves particular phrases that are used repeatedly to indicate particular semantic relations. We see a number of applications for this insight into the structure of GO terms. These applications include assistance in the evaluation and curation of the GO ontology; converting the information encoded in terms into a computable form; and applying GO to problems in natural language processing.

22 1

3.1 Aids to the evaluation and curation of GO

The high-frequency occurrence of certain complements suggests that these complements might themselves be suitable GO concepts or relationship types. For example, the string regulation of is one of the most common complements, occurring by itself as a complement in 330 terms (and occurring again in 313 terms as a substring of positive regulation and in 314 terms as a substring of negative regulation). However, there is no GO concept for regulation, per se. The frequent use of this term suggests that perhaps it should be. This suggestion is supported by other work on GO terms, which is consistent with the idea that word frequency is a good indicator of suitability for inclusion in the ontology. McCray et al. give a list of the twenty most common words found in GO terms (the top ones being protein, receptor, metabolism, biosynthesis, and catabolism, along with fifteen other words that are clearly related to the domain of molecular b i ~ l o g y ) ~We . found that if you allow for the addition of the word activity to words like receptor, then 80% of their top ten words and 55% of their top twenty words are themselves GO terms. Frequent usage within GO terms seems to correlate reasonably well with suitability for termhood. The derivational phrases also point us towards potential GO concepts when they occur unexpectedly. The “expected” pattern for the derivational phrases that we found is that they directly modify a GO term (to produce a new GO term). Occasionally, they occur in conjunction with a string that is not itself a GO term, and in such situations, that string itself possibly should be a GO term. For example, the derivational phrase negative regulation of is often followed immediately by a GO term. However, we noticed a GO term, negative regulation of REM sleep (GO:0042322), which is notable in that the string that follows negative regulation o j i.e. REM sleep, is not itself a GO term. This alerts us to the possibility that REM sleep should be added to the set of concepts in GO, probably as a child of sleep, GO:003043 1. We collected all strings that occurred in such contexts-i.e., they are modified by a derivational phrase but are not themselves GO terms-and counted their occurrences. Two of the authors with biological expertise independently reviewed a list of all such strings that occurred six or more times and rated each string as “would be a good novel GO term” or “would not be a good novel GO term.” In 22.2% of the cases (24/108) they concurred that the string would be a good novel GO term. The importance of this for a knowledge engineering effort is that the domain experts only had to consider just over 100 terms to discover 24 new terms for the ontology. Examples of strings that they concurred on include dehydrogenase activity, methylation, and stroma. We recommend these as strong candidates for inclusion in the ontology.f In a number of additional cases, they indicated that the strings should probably be synonyms for existing terms. For The 24 candidates were submitted to the GO consortium.

222

example, the string envelope occurred on this list. It appeared following ten different complements that in other cases are followed immediately by an embedded GO term, such as nuclear, viral, inner, etc. It is not itself a GO term, but the biologists suggested that it should probably be added as a synonym to external encapsulating structure (GO:00303 12). Thus, investigating the derivational structure of GO terms has helped us uncover new concepts that are good candidates for inclusion in the ontology and to find appropriate synonyms for concepts that are already in the ontology. Derivational phrases can also point us towards cases where two concepts that are already in the ontology should be related, but aren’t. For example, we found that the string limonene occurs as a complement, and when it occurs (as a complement per se), it usually occurs in isA relations in the biological process ontology. The one exception to this is its occurrence in limonene monooxygenase activity (GO:0019113), which has a proper substring monooxygenase activity (GO:0004497). This seems like an omission, and we suggest that such an edge should be added between these two concepts.

3.2 Enriching GO’S conceptual representations We are interested in leveraging GO into a conceptually richer, more interconnected knowledge base’. The relations directly encoded in the isA and partOf relations of GO are an excellent starting point. In the context of the Gene Ontology Next Generation project, Wroe et al. suggest starting by adding the relations part-of-cellular-component, part-of-moZecular_function, and part-of-biologicalgrocess. They also relate GO entries to external data sources, and elaborate the representation of metabolic processes4. Yeh et al. suggest relations that encode organism or taxon specificity (which they parse out of the terms themselves), macromolecular structure, and temporality5. Williams and Anderson suggest relations that encode temporality and location, among others’. We would like to also be able to exploit the information that is in the terms themselves; as Wroe et al. have pointed out, “Biologists are able to interpret information.. .within term names.. ..However, this implicit information is inaccessible to computer application^"^. The derivational phrases themselves suggest many relations. One such relation is seen when the derivational phrase encodes the fact that the contained term is the object of the process named by the containing term. For example, nucleosome (GO:0000786) is a substring of nucleosome disassembly (GO:0006337). There is no link between them in GO. The semantic relationship between them is that the nucleosome undergoes the process of disassembly. We can represent this by means of an UndergoesProcess edge in a semantic network, or by a similarly labelled slot in a frame-based representation or predicate in a description logic; the fact that the relation exists at all is suggested to

223

us by awareness that the string disassembly occurs as a complement. More elaborate examples of the suggestion of relations or slot values by the occurrence of derivational phrases occur, as well. For example, the strings positive, regulation o j and positive regulation of are all very frequent derivational phrases. Together, they suggest two regulation-oriented relations. One is regulatedltem, and the other is regulationDirection. For instance, positive regulation of mitotic cell cycle (GO:0045931) would be represented in a conceptually richer knowledge base with a regulatedltem slot whose value was mitotic cell cycle (GO:0000278) and a regulationDirection slot whose value was positive. Table 4 Relation names suggested by frequently occurring complements Each example complement type has the frequency with which it occurred as well as a token ‘ferm’indicating where the contained term is located in the containing term. For example, the term negative gravifuxis (GO:0048060), fits into the complement pattern ‘negative term 388’ where 388 is the count of all the terms that fit this pattern.

Relation Name

Example Complement Types

Regulation direction Process type Base type Oxygen availability Substance type Chirality Activity type Cellular location Gender Amino acid type Substance affinity level Cell division Development stage

Negative term 388. Dositive term 380 term binding 30, term biosynthesis 35 Purine term 55, pyrimidine term 53 Aerobic term 18, Anaerobic term 26 Protein term 26, peptide term 12 d-term 24, 1-term 18 term binding activity 23 Nuclear term 23, mitochondria1 term 22 Female term 15, male term 14 Serine term 14, glycine term 14 High affinity term 14, low affinity term 12 term mitotic 12, meiotic term 11 Adult term 11, larval term 11

We examined some of the most frequently occurring complements to determine possible relations that they suggest. Table 4 contains a partial list of relations based on this analysis that could be incorporated into GO. Each row has one or more example complement types used as evidence for the usefulness of the proposed relation. Tanabe found some of these, e.g. regulation direction and cellular location, to be relevant to text data mining in the molecular biology domain”. While these suggested relations could be named and modeled in numerous ways, they indicate productive avenues for future ontological development.

224

3.3 Natural language processing The observation that GO terms can contain other GO terms also points us towards a solution to the problem of recognition of variant forms of terms in natural language texts. Non-statistical approaches to this problem, such as the National Library of Medicine's MetaMap" and Jacquemin's FASTR system3, tend to perform well, but at the computational expense of performing extensive linguistic analysis of both the terminology itself and the natural language text. We suggest that noting the inclusion patterns of terms within other terms allows us to find meaningful linguistic boundaries purely on the basis of comparisons between terms within the ontology, without the necessity of submitting the terms to further morphological or syntactic analysis. An example of this approach can be seen with respect to the problem of dealing with coordination, or linkage of phrases by words like and, or, and but not. Consider the following sentence from Gilmore and RomerI2: These findings suggest that FAK functions in the regulation of cell migration and cell proliferation. If one were indexing this sentence by GO terms, the best matches would be regulation of cell migration (GO:0030334) and regulation of cell proliferation (GO:0042127). The first one is easy-the challenge is to get the second one, without being misled into matching instead to cell proliferation (GO:0008283). Though space does not allow a full description of our approach, we have been successful in handling this and similar examples by licensing the recognition of discontinuous embedded terms and their associated complements when the discontinuity is due to the intervention of a conjunction and another GO term.

3.3 Conclusion We have shown that substring relations between terms are prevalent in GO, and that complementary phrases in the superstrings recur frequently. Specificity of complements to relation types and occurrence of complements under common nodes allow us to differentiate between derivationally meaningful substring relations and incidental ones, as well as their associated complement phrases. Awareness of these phenomena can be used to make the information encoded in GO terms more computationally accessible, to assist in the curation of GO, to leverage GO into a conceptually richer knowledge base, and to analyze natural language texts.

Acknowledgements The authors gratefully acknowledge support for this research from a grant from the Wyeth Genetics Institute and from NIH/NIAAA grant U01-AA13524-02. This work would not have been possible without the tremendous investment of time and

225

effort by the GO Consortium into the development of the Gene Ontology. We also thank the anonymous reviewers whose comments led to an improved paper.

References 1. Gene Ontology Consortium, “Gene Ontology: tool for the unification of biology” Nature Genetics 25:25-29 (2000) 2. Gene Ontology Consortium, “Creating the Gene Ontology resource: design and implementation” Genome Research 11:1425-1433 (2001) 3. Jacquemin, Christian, Spotting and discovering terms through natural language processing (The MIT Press, USA, 2001) 4. Wroe, C.J.; Stevens, R.; Goble, C.A.; and M. Ashburner, “A methodology to migrate the Gene Ontology to a description logic environment using DAML+OIL” Pacz~5cSymposium on Biocomputing 2003 5. Yeh, Iwei; Karp, Peter D.; Noy, Natalya F.; and Russ B. Altman, “Knowledge acquisition, consistency checking and concurrency control for Gene Ontology (GO)” Bioinformatics 19(2):241-248,2003, 6. Partee, Barbara H. ; ter Meulen, Alice; and Robert E. Wall, Mathematical methods in linguistics, corrected first edition (Kluwer Academic Publishers, 1993) 7. McCray, Alexa T.; Browne, Allen C.; and Olivier Bodenreider, “The lexical properties of the Gene Ontology (GO)” (Proceedings of the AMlA 2002 Annual Symposium, pp. 504-508) 8. Acquaah-Mensah, George K.; Eberlein, Jens; McGoldrick, Daniel J.; Fox, Lynne M.; Cohen, K. Bretonnel; and Lawrence Hunter, An evaluation metric for molecular biology knowledge-bases (UCHSC Center for Computational Pharmacology Technical Report TR-03-01,2003) 9. Williams, Jennifer and William Anderson “Bringing ontology to the Gene Ontology” Comparative and Functional Genomics 4:90-93,2003, 10. Tanabe, Lorraine, Text mining the biomedical literature for genetic knowledge (George Mason University doctoral dissertation, 2003) 1I. Aronson, Alan R., “Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program” (Proceedings ofthe AMIA Symposium 2001, pp. 17-21) 12. Gilmore, A.P. and L. H. Romer, “Inhibition of focal adhesion kinase (FAK) signaling in focal adhesions decreases cell motility and proliferation.” Molecular Biology of the Cell 7(8):1209-24.

DEFAULTS, CONTEXT, AND KNOWLEDGE: ALTERNATIVES FOR OWL-INDEXED KNOWLEDGE BASES A . RECTOR Bio-Health Informatics Group, Department of Computer Science, University of Manchester, Manchester M I 3 9PL, UK ernail [email protected](.. ilk The new Web Ontology Language (OWL) and its Description Logic compatible sublanguage (OWL-DL) explicitly exclude defaults and exceptions, as do all logic based formalisms for ontologies. However, many biomedical applications appear to require default reasoning, at least if they are to be engineered in a maintainable way. Default reasoning has always been one of the great strengths of Frame systems such as ProtCgC. Resolving this conflict requires analysis of the different uses for defaults and exceptions. In some cases, alternatives can be provided within the OWL framework; in others, it appears that hybrid reasoning about a knowledge base of contingent facts built around the core ontology is necessary. Trade-offs include both human factors and the scaling of computational performance. The analysis presented here is based on the OpenCALEN experience with large scale ontologies using a formalism, GRAIL, which explicitly incorporates constructs for hybrid reasoning, numerous experiments with OWL, and initial work on combining OWL and Prot6gC.

1 1.1

Introduction The problem: W e want to rescue the “baby” from the “bathwater”

Until the mid 1980s, a key part of knowledge representation was capturing notions of defaults and exceptions. Minsky’s original paper on frames [ 111 was based on the notion of “prototypes” in which defaults were used to complete partial knowledge in our perceptions. Up through the late 1980s, a significant part of the literature concerned examples such as “Tweety” the bird who was assumed by default to fly until she was found to be an ostrich. During the 1980s a series of results made precise the notion of defaults and exceptions in frame systems with multiple inheritance 1221 and then went on to show that the resulting systems were worst case intractable computationally [20]. Meanwhile, a series of papers questioned the foundations of representation [23] [2]. Beginning with pioneering efforts such as KRL [l] and KL-ONE [4], interest turned increasingly to logic based representations using the notion of “definition” rather than “prototype” Unfortunately, logic based mechanisms for capturing the notions of default reasoning (non-monotonic reasoning) proved problematic, and all suggested solutions were computationally intractable [6]. On top of this, the KLONE family of logic based systems turned out to have serious problems with tractability even without defaults and exceptions [3] that were not overcome until the mid 1990s with the advent of the modern classifiers such FaCT [8, 91 and Racer[7] which underpin OWL-DL.

226

227

As attention turned to logic-based formalisms, the nature of the task changed from using prototypes to fill in incomplete data to classifying definitions and determining their consistency. Classification came to prominence particularly in the biomedical world as people wished to build large multi-axial ontologies where the classification structure was the means to correct retrieval. Such ontologies proved extremely difficult to build by hand [5, 17, 251. With the change from prototypes to definitions, the form of the typical statement in knowledge representation changed from “Most X have property value Y” to “All X have property value some/only kind-of Y” e.g. from “Most birds have-ability fly” to “All birds have-covering some kind-of feathers”. Clearly, the second form leaves no room for exceptions. However, despite the benefits of logic-based formalisms such as OWL, the inability to express defaults and exceptions is a serious limitation. An important “baby” had been thrown out with the “bathwater”. Without defaults & exceptions it is impossible to make high level generalisations and then refine them by adding exceptions to more specialised cases as they arise. However, this remains the most convenient way to express many notions whether in biomedicine or other fields. In human factors terms, it is almost certainly the most reliable form in which to author safety critical facts, since the defaults can be used to provide a fail-safe value and only overridden when the safety of the exception is established. In some cases, the greater expressiveness of OWL and modern description logics than previous formalisms allows one to reformulate a knowledge base so as to capture in a logic-based framework notions that previously would have required defaults in a less expressive formalism. In others, some hybrid form of reasoning is required. The purpose of this paper is to explore some of the alternatives in OWL and related languages for dealing with these issues. Note that this paper is concerned only with the representation of classes, since this is primarily where issue of defaults and exceptions arise, and that it focuses on the OWL-DL sublanguage, although most of the remarks apply equally to OWL-full. 1.2

Summary of Analysis

For analysis, we distinguish four cases. 1. Cases which concern only specialisation rather than exceptions, e.g. that “Blood vessels carry blood” but that “Arteries carry oxygenated blood”. 2. Cases where there is a single local exception. e.g. “Arteries carry oxygenated blood’ with an exception for the pulmonary artery. In these cases it is probably better to reformulate the statement, e.g. “Arteries except the pulmonary artery or its branches carry oxygenated blood” 3. Cases in which there are a modest number of dimensions of “context”, e.g. “The normal human manus has five digits”, where “normal” and “human” must be represented explicitly.

228

Cases in which there are a unpredictable number of exceptions, and possibly exceptions to the exceptions, especially those which need to be maintained in a fail safe manner, e.g. drug uses, contraindications and interactions, organising complex forms, linking to external resources, etc. We approach each case by a different method: 1. Cases which concern only specialisation, we deal with routinely in OWL, since they are not really exceptions. 2. Cases where there is a single local exception, we suggest are best dealt with by more precise logical formulation. 3. Cases in which there are a modest number of dimensions of “context”, we suggest are best dealt with by making context explicit and then generalising common information where possible. 4. Cases in which there are a unpredictable number ofexceptions, we deal with in by one of three approaches: a) to treat such information as part of the knowledge base indexed on the ontology rather than the ontology itself and then to use traditional frame-like methods; b) to compile this knowledge into the ontology from such a representation; c) to express the knowledge using a series of “work arounds” which subtly alter the semantics. The first three cases can be dealt with purely within the logic based ontology framework. The first can be dealt with in any description logic. The second and third require the recently developed highly expressive description logics which underpin OWL. The fourth case requires more involved methods and will be dealt with separately.

4.

2

Methods in detail

2.1

Principles for cases 1-3 which can be represented OWL alone

In this section we adopt a slightly modified OWL abstract syntax in which we use “+”rather than “subclass-of’ in order to emphasise that in OWL, “subclass-of’ is equivalent to implication. In Section, 2.2, we explain the OWL representation in more detail including features that are not obvious from the initial presentation.

2.1.1

Case 1: Specialisations

Specialisations are not really exceptions at all since they merely narrow, but do not contradict, the original statement. They are sometimes confused with exceptions because some systems such as ProtCgC use the same mechanism for both. Specialisations are represented with routinely in OWL or any other description logic. They are mentioned only to distinguish them from other cases.

229

2.1.2

Case 2: Single Exceptions

The easiest case to deal with is that of single exceptions. Usually these cases arise during the development of an ontology when the author realises that a statement that has been entered is in fact too general. In the example above, it would have been natural to enter the statement “Arteries carry oxygenated blood” neglecting at first the exception of the pulmonary arterial tree. However, with this one qualification, the initial statement is true (at least in normal adult mammals). Hence instead of Artery

+ (restriction carries someValuesFrom OxygenatedBlood

we make a more precise statement. Taking “Systemic Artery” to mean the branches and sub-branches of the aorta, we say: “The Aorta and its branches’ carry oxygenated blood”: (Aorta or (restriction isBranchOf someValuesFrom Aorta))

+ (restriction carries someValuesFrom OxygenatedBlood)

2.1.3

Case 3: Representing context explicitly

If the scope of the ontology needs to be extended to broader contexts - e.g. abnormal or fetal anatomy, or to multiple species - then one possibility is to represent that context explicitly. For example, the heart is normally contained in the left thorax, but in a small percentage of the population it is abnormally located on the right. We can represent the context “anatomically normal” explicitly by: Heart and (restriction hasAnatomicallStatus Normal)

+ (restriction isContainedln LeftThorax)

This statement has the consequence that the discovery of a heart in the right thorax implies that it is an anatomical abnormality - precisely what would be expected, and might lead us to look for other associated anomalies. In a similar situation using species rather than normality, humans have one prostate with three lobes whereas mice have five prostates none of which has lobes2. So we might express this as: Body and (restriction isOfSpecies sonzeValuesFrom Human) and (restriction hasAnatomicalStatus someValuesFrom Normal)

+ (restriction haspart exactly-1 Prostate)

’ For brevity in the examples we shall take ‘branch’ to be transitive - “branches or sub-branches” ’Cornelius R o se , Personal communication, 2002.

230

Body and (restriction isOflpecies someValuesFrom Mouse) and (restriction hasAnatomicalStatus someValuesFrom Normal)

+ (restriction haspart exactly-5 Prostate)’

One situation in which this case occurs commonly is when merging ontologies, e.g. of human and mouse anatomy. Given a first translation in which all statements

are qualified with context in this way, it is then possible to examine the ontologies and see which features were true in all contexts and generalise them to the unqualified entity. 2.2

Implementation in OWLfor cases 2-3

All of the above statements can be converted directly into valid OWL by replacing the ‘+’with ‘sub~lass-of’~. However, in OWL’S abstract syntaxes, in the de facto standard editor OiEd5, and indeed in the Lisp based notation for the underlying description logics, the standard way of introducing new classes adds a complication. OWL and related formalisms distinguish “primitive” classes which have “partial” definitions from “defined” classes which have “complete” definitions. The abstract syntax for the two cases is confusingly similar, distinguished only by the keywords complete and p a r t i d . For a primitive class the syntax is: class C partial subclass- of Super Restriction,

... RestrictionN

Whereas for a defined class it is class C complete subclass-of Super Restriction,

... RestrctionN ~

~~

’DAML+OIL allows such “qualified cardinality constraints” as do almost all description logic and KR formalisms which allow cardinality restrictions other than O,l,many. OWL v0.0 does not provide such a construct. but it is expected that it will be reinstated in OWL vO.1. See httD://lists. w 3 . o r ~ / A r c h i v e s / P u b l i c / ~ ~ w w - w e b o n t - w g / 2 0 0 The use of “subclass-of” for implication may seem strange to users new to OWL and related formalisms, but it follows from the fact the definition of “subclass-of” -known as “is subsumed by” or “is kind of’ in related formalisms. One class is a subclass of another if and only if all individuals in the subclass are also in the superclass - i.e. if being in the subclass implies being in the superclass. This is what distinguishes “logic based ontology formalisms” from other formalisms, such as frames, which do not impose this requirement, or at least do not use it for inference http://oiled,man.ac.uW Although the OWL standard officially deprecates this syntax it is deeply embedded in tools and likely to persist.

231

Unfortunately, although these look very similar, they behave very differently. In the case of the primitive definition, C individually implies each restriction, whereas in the case of the defined class, the conjunction of the superclasses and restrictions jointly defines C. It is the difference between a) and b) below: a ) C-+ Super & Restriction1 &... & Restriction, b) C eSuper & Restrictionl & ...& Restriction,

In the first case, the simple implication of the conjunction is equivalent to n individual implications of necessary conditions. C+ Super; C jRestriction1, .._C +Restriction,,

In the second, the reverse implication abbreviated in the bi-directional arrow ‘e’ is: Super & Restriction1 & ...& Restriction, 4

This implication cannot be split. None of the conjuncts is sufficient to imply C individually; rather it is the conjunction as a whole that implies C. Furthermore, if we add a restriction to the conjunction, we are adding to the sufficient conditions for recognizing the concept, not just to the necessary conditions that can be inferred from the concept. Therefore, as the ontology evolves, and the definition of a primitive concept is made explicit, some or all of the restrictions that appeared in the original class statement need to be moved to separate axioms. For example, consider the example of prostate above. If the ontology had started with human implicit and prostate treated as a primitive, then it would probably have been expressed as: class Prostate partial subclass-of Organ restriction hasSubdivision exactly-3 Lobe

When converted to a combined mouse-human anatomy ontology in which the context Human had to be made explicit, then the restriction would have to be moved from the class axiom to a separate subclass axiom as shown below: Class NormalHumanProstate complete subclass-of Prostate restriction isOfSpecies someValuesFrom Human restriction hasAnatomicalStatus someValuesFrom Normal NormalHumunProstate subclass-of restriction hassubdivision exactly-3 Lobe

In this form, first (class) axiom introduces and defines the named class NormalHumanProstate”. The second (subclass) axiom states that “NormalHumanProstate has three lobes”. Having three lobes is implied by being a human prostate; but it does not imply being a “human prostate”. In our experience, this transformation is a common operation during ontology development. Ontology authors often start by ‘sketching’ concepts as primitives and then elaborate them by defining those for which definitions are appropriate or adding context where needed. Unfortunately, converting restrictions on the

232 primitive class to subclass axioms is tedious in existing tools, although the authors are engaged in a project to produce improved interfaces that will incorporate this operation as a standard feature7. 2.3

Case 4:dealing with unpredictable number of exceptions, possibly with exceptions to the exceptions - representations requiring hybrid reasoning and a “ontology indexed knowledge base”

Consider a large richly interconnected evolving knowledge base with numerous axes - e.g. a drug knowledge base classified according to chemical structure, physiological effects, biochemical mode of action, formulation and route of administration. Consider trying to establish protocols for administering such drugs or keeping track of interactions and contraindications. (The Prodigy knowledge base approximates this structure - see [21,24] .) It is important that as the knowledge base evolves and new drugs are added, that recognised side effects be indicated initially, by default, even though they may in fact be over-ridden. Therefore, to be safe, we want to express interactions and contraindications at the most general level possible and inherit them by default, to be overridden if necessary. If expressed in logic, the result is likely to be a complex expression of the form: “Drug type A and not subtype B and not subtype C... + ContraindicationX”

If there are exceptions to the exceptions, we get expressions such as: “Drug type A and not (subtype B and not Subtype B I ) and not Subtype C .. . ”

Maintaining such structures is tedious and error prone. By contrast, the classic frame oriented default and exception mechanism is straightforward and gives much less opportunity for error. There are three possible solutions: 1. Treating such information as “contingent knowledge” to be represented in as in a classic frame system. Such ‘contingent’ knowledge is invisible to the classifier as logically it is of the form “Some Cs have property P ’ or “Protypically Cs have property P ’ rather than “All Cs have property P’. Although it is invisible the classifier, a hybrid reasoning system can query the knowledge base to find the set of most specific information inherited by each node just as in any classic frame system[22]*. If the set contains more than one member, then some additional reasoning mechanism must be used to resolve the ambiguity. 2. “Work-arounds” by interpreting the relevant properties as “potentially has value” and being multi-valued. An object can then inherit more than one ’http://www.cs.~nan,ac.uk/mie/~roiects/cu~~n~coode In systems with multiple inheritance this is the Touretzky distance[l7]. A value is a member of the set of most specific values for a property for a target node if there is no intervening node along any path in the multi-hierarchy between the source node for the value and the target node that has a value for that property. i.e. if the value is not overridden along any path.

233

“potential” value. As in case 1, some additional reasoning mechanism must be used to resolve inheritance conflicts, but in this case, the ‘potential’ values are visible to the classifier. 3. Using the frame representation as a high level language and then compiling the result to the logical format equivalent to the expressions above. T h s requires resolving all potential inheritance conflicts in advance using some additional reasoning mechanism. Using this method, once compiled, all information is available to the classifier. None of these solutions escapes Touretzky’s result [22] that default reasoning with cancellation is computationally intractable in the worst case. However, our experience is that, if the ontology is ‘normalised’ or ‘untangled’, the sets of most specific values rarely contain more than a single value - i.e. conflicts are rare. When conflicts do occur, then some additional, application specific, reasoning method is required - e.g. for drug interactions, take the most serious; for an information resource, take the union of all values etc.[l6, 181

3

Results and Discussion

In all cases there are three issues: a) expressiveness & correctness, b) scaling of computational performance; c) usability and understandability. Case 1 - specialisation - can be represented routinely within any description logic paradigm including OWL and requires no further comment. Case 2 - single exceptions xxpressiveness and correctness are not a problem. These cases can clearly be represented in OWL-DL, which is sufficiently expressive for all instances encountered or so far suggested. The constructs involved have only local effects on computational performance and so do not affect scaling globally. Case 3 - dealing with context - expressiveness and correctness are likewise not a problem. Theoretically, the effects on computational scaling should be modest, since.in general, the antecedents of the subclass-axioms required naturally contain at least one primitive, which limits the scope of their impact on performance’. So far, experience has been consistent with theory, but larger scale tests are under way. This course should only be taken after simulations to ensure that there are no scaling problems. Case 4 - unpredictable and complex exceptions - requires reasoning outside the pure OWL paradigm, for which we put forward three solutions. Our group has most experience with first solution. Such hybrid reasoning over an “ontology indexed knowledge base” was supported in the system underpinning our previous work, GRAIL [ 141, and its value is well proven in a range of applications, including:

’Such axioms are sometimes called “absorbable” because they can be transformed to avoid the global impact of “general inclusion axioms”. See [9]

234

Figure 1: The use of the logic based ontology as an index to contingent information about contraindications for drugs. The notation is derived from UML. Primitive concepts are in rectangles, defined concepts in rounded rectangles, and indexed information in octagons connected by heavy arrows.

1. The PEN&PAD clinical data entry and medical record system [ 10, 121 in which it was used to index information to be included on data entry forms which could be indefinitely tailored. 2. The Prodigy Drug Ontology [21] in which it is used to handled uses and interactions of drugs. 3. The GALEN modules for encoding to ICD9/10 and SNOMED International [13, 15, 191 in which it provides the mapping to the candidate ICD codes for concepts in the ontology. 4. Internally in the translation from the Intermediate Representation for indexing the transformation rules and mappings [ 181 Figure 1 illustrates this type of reasoning about contraindications in the drug ontology. The logic based ontology acts as “conceptual lego” allowing the definition of notions such as “Use of cardioselective beta blockers for asthma”. The reasoner classifies such concepts to form a polyhierarchy” which then acts as an index to the contingent information. Figure 2 shows the use of similar mechanisms for constructing complex forms or data structures. This is the mechanism underlying PEN&PAD. Whereas for drug information, one indexed value over-rides the default value, in this application multiple values are normally cumulative“. Unfortunately, no system based on current standards supports such hybrid reasoning. This is a major motivation for our development of a hybrid ProtCgCOWL environment.”.

’” Although only one axis is shown here for simplicity There is a mechanism to “turn off’ unwanted items at a lower level if this is required, but it is omitted for simplicity ~ittp:ll~~~w.cs.m~i.ac.u~nii~lproiec~slc~i~e~i~cood~ htta://aroteee.stanford.edu/Dlugins/owl

’’

235

Figure 2: Use of indexing to assemble adaptable forms. The composite concept of “Renin dependent hypertension in St Stevens Hospital in Natioonal Hypertension Survey is first formed and classified, and then the information items required are assembled using ‘inheritance’.

The group also has experience with the second solution, which was used extensively during the development of the drug ontology. It has the advantage of providing a uniform mechanism - classification - for retrieving information from the ontology. Its disadvantage is that the mismatch between the semantics of the ontology and the semantics of the knowledge base as a whole. Like most “workarounds”, this can lead to unexpected results. Furthermore, the classification operations required are more computationally expensive than the queries required for solution one, since they ask what is true in any extension of the given knowledge base whereas a query asks only what is true in the given knowledge base. The third solution has been tested only in simulated examples. There remain significant questions concerning how it will scale computationally, and developing tools to make this mechanism usable is challenging problem for tool builders. At the moment, it remains a theoretical possibility whose practical applicability is speculative Overall, the lessons for those who have previously used systems supporting defaults and exceptions is that conversion to ontologies based on OWL or related logic-based formalisms requires careful analysis. If the defaults fall into cases one to three, then representing them directly in OWL is probably feasible and desirable because it brings added inferential power, although the computational scaling should be checked. If they fall into case four, then more care is required, and some further reasoning methods will almost certainly be required. Working prototypes and demonstrations should be available by time of the meeting in January.

236

Acknowledgements This work supported in part by the CO-ODE grant from the UK Joint Information Services Committee (JISC). Special thanks to the ontology group at University of Manchester, the ProtCgC group at Stanford, and the Digital Anatomist project. References 1. Bobrow, D.G. and Winograd, T. An overview of KRL. Cognitive Science, I (1977). 3-46. 2. Brachman, R. What IS-A is and isn't: an analysis of taxonomic links insemantic networks. Computer, 16 (1983). 30-36. 3. Brachman, R. and Levesque, H., The tractability of subsumption in frame-based description languages. in M I - 8 4 , (1984), Morgan Kaufman, 34-37. 4. Brachman, R. and Schmolze, J. An overview of the KL-ONE knowledge representation system. Cognitive Science, 9 (1985). 171-216. 5. Campbell, K.E., Das, A.K. and Musen, M.A. A logical foundation for representation of clinical data. JAMIA, I (1994). 2 18-232. 6. Etherington, D. Formalising nonmonotonic reasoning systems. Artificial Intelligence, 31 (1987). 41-85. 7. Haarslev, V. and Moeller, R., Expresive ABox reasoning with number restrictions, role hierarchies, and transitively closed roes. in Proc Seventh International Conference on Knowledge Representation and Reasoning (KR2000), (San Francisco, CA, 2000), Morgan Kaufmann, 273-284. 8. Horrocks, I., Using an expressive description logic: FaCT or Fiction. in Principles of Knowledge Representation and Reasoning: Proceedings of the Sixth International Conference on Knowledge Representation (KR 98), (San Francisco, CA, 1998), Morgan Kaufmann, 634-647. 9. Horrocks, I., Statler, U. and Tobies, S. Practical reasoning for very expressive description logics. Journal of the Interest Group in Pure and Applied Logics (IGPL),8 (2000).293-323. 10. Kirby, J. and Rector, A.L., The PEN&PAD Data Entry System: From prototype to practical system. in AMIA Fall Symposium, (Washington DC, 1996), Hanky and Belfus, Inc, 709-713. 11. Minsky, M.L. A framework for representing knowledge. in Winston, P.H. ed. The Psychology of Computer Vision, McGraw Hill, New York, 1975, 211-277. 12.Nowlan, W., Rector, A., Kay, S., Horan, B. and Wilson, A., A Patient Care Workstation Based on a User Centred Design and a Formal Theory of Medical Terminology: PEN&PAD and the SMK Formalism. in Symp Computer Applications in Medical Care. (SCAMC-9)1,(Washington DC, 199l), McGrawHill. 855-857.

237

13.Pole, P. and Rector, A., Mapping the GALEN CORE Model to SNOMEDInternational: Initial Experiments. in AMIA Fall Symposium, (Washington DC, 1996), Hanley and Belfus, Inc, 100-104. 14.Rector, A., Bechhofer, S., Goble, C., Horrocks, I., Nowlan, W. and Solomon, W. The GRAIL concept modelling language for medical terminology. Artificial Intelligence in Medicine, 9 (1997). 139-171. 15. Rector, A., Rossi Mori, A., Consorti, F. and Zanstra, P. Practical development of re-usable terminologies: GALEN-IN-USE and the GALEN Organisation. International Journal of Medical Informatics, 48 (1998). 7 1-84. 16. Rector, A,, Wroe, C., Rogers, J. and Roberts, A., Untangling taxonomies and relationships: Personal and practical problems in loosely coupled development of large ontologies. in Proceedings of the First International Conference on Knowledge Capture (K-CAP 2001), (Victoria, BC, Canada, 2001), ACM, 139146. 17. Rector, A.L. Clinical Terminology: Why is it so hard? Methods of Information in Medicine, 38 (1999). 239-252. 18.Rector, A.L., Zanstra, P.E., Solomon, W.D., Rogers, J.E., Baud, R., Ceusters, W., W Claassen, Kirby, J., Rodrigues, J.-M., Mori, A.R., Haring, E.v.d. and Wagner, J. Reconciling Users' Needs and Formal Requirements: Issues in developing a Re-Usable Ontology for Medicine. IEEE Transactions on Information Technology in BioMedicine, 2 (1999). 229-242. 19. Rogers, J.E., Price, C., Rector, A.L., Solomon, W.D. and Smejko, N. Validating clinical terminology structures: Integration and cross-validation of Read Thesaurus and GALEN. Journal of the American Medical Informatics Association (1998). 845-849. 20. Selman, B. and Levesque, H.J. The complexity of path-based defeasible inheritance. Artificial Intelligence, 62 (1993). 303-340. 21. Solomon, W., Wroe, C., Rogers, J.E. and Rector, A. A reference terminology for drugs. Journal of the American Medical Informatics Association (1999). 152155. 22. Touretzky, D. The Mathematics of Inheritance Systems. Morgan Kaufmann, Los Altos, CA, 1986. 23. Woods, W.A. What's in a link: Foundations for semantic networks. in Bobrow, D. and Collins, A. eds. Representation and Understanding: Strudies in Cognitive Science, Academic Press, Newwq York, 1975, 35-82. 24. Wroe, C., Solomon, W., Rector, A. and Rogers, J. Inheritance of drug information. Journal of the American Medical Informatics Association (2000). 1158. 25. Wroe, C., Stevens, R., Goble, C.A. and Ashburner, M., An Evolutionary Methodology To Migrate The Gene Ontology To A Description Logic Environment Using DAML+OIL. in Proceedings of the 8th Pacific Symposium on Biocomputing (PSB), (Hawaii, 20031, 624-635.

BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY 0 . T U A S O N ' , L. C H E N ' , H . LIU',

J.A BLAKE2, C . FRIEDMAN'

1. Department of Biomedical Informatics, Columbia University, 622 W 168 St, VC-5, New York, NY 10032 2. The Jackson Laboratory, 600 Main Street, Bar Harbor, M E 04609 There has been increased work in developing automated systems that involve natural language processing (NLP) to recognize and extract genomic information from the literature. Recognition and identification of biological entities is a critical step in this process. NLP systems generally rely on nomenclatures and ontological specifications as resources for determining the names of the entities, assigning semantic categories that are consistent with the corresponding ontology, and assignment of identifiers that map to well-defined entities within a particular nomenclature. Although nomenclatures and ontologies are valuable for text processing systems, they were developed to aid researchers and are heterogeneous in structure and semantics. A uniform resource that is automatically generated from diverse resources, and that is designed for NLP purposes would be a useful tool for the field, and would further database interoperability. This paper presents work towards this goal. We have automatically created lexical resources from four model organism nomenclature systems (mouse, fly, worm, and yeast), and have studied performance of the resources within an existing NLP system, GENIES'. Using nomenclatures is not straightforward because issues concerning ambiguity, synonymy, and name variations are quite challenging. In this paper we focus mainly on ambiguity. We determined that the number of ambiguous gene names within the individual nomenclatures, across the four nomenclatures, and with general English ranged from 0%-10.18%, 1.187%-20.30%, and 0%-2.49% respectively. When actually processing text, we found the rate of ambiguous occurrences (not counting ambiguities stemming from English words) to range from 2.4%-32.9% depending on the organisms considered.

1

Introduction

The amount of scientific literature has increased exponentially over the past few years, providing a rich source of genomic information. Recently, there has been increased activity involving exploration of natural language processing (NLP) and information retrieval (IR) methods to help extract, organize and facilitate access to the information. One type of application involves automatic extraction of genomic entities, such as genes and proteins2.3, 435, . In order to perform extraction, a system generally requires a resource that specifies and classifies genomic entities, that associates them with normalized terms and also unique identifiers that preferably are identifiers associated with a standardized nomenclature system so that the extracted entities are well-defined. In addition, an automated extraction system must be able to effectively utilize the resources. Associating terms mentioned in text with specific biological entities is extremely challenging because 1) new genes are continually being named or known ones renamed, 2) the number of biomolecular entities is very large, 3) the nomenclature conventions differ for different organisms, 4) researchers do not

238

239

strictly follow standard naming conventions when they write articles, and 5) the names of biological entities are associated with synonymy and ambiguity: a gene can have multiple aliases (synonyms) in addition to its official symbol, and genes that are functionally different across species often have the same name (ambiguity). In addition to ambiguities among gene names, problems also arise when a gene has the same name as an English word, such as the genes named was, nervous, and to. There are numerous specialized genomic databases, which are invaluable resources that were developed to assist biological researchers. These databases are also valuable for NLP purposes because they publish nomenclature and ontological specifications for biomolecular entities in online databases that are continually updated. Among these are model organism databases, such as Mouse Genome Informatics (Mus rnusculus: http:// wwwinformatics.iax.org, FlyBase (Drosophila rnelanogaster. httP://www.flybase.org), WormBase (Caenorhabditis elegans: httd/www.wormbase.org), and Saccharomyces (yeast) Genome Database (Saccharomyces cerevisiae: http://www.yeastgenome.org). Although these databases are resources for NLP, they were developed for different purposes, and therefore a variety of automated procedures must be developed to use them effectively for NLP. One issue is that they are heterogeneous: the database formats are different, as are the ontological specifications and naming conventions. Obtaining a uniform structure and semantics containing gene names and their unique identifiers is a crucial first step in recognizing and identifying them in the literature. A resource developed specifically for NLP that automatically acquires biological knowledge for NLP purposes from diverse resources, and that provides effective tools for utilizing the knowledge would be of great benefit to the NLP and research community. As a first step towards this goal it is important to study issues that influence the effectiveness of such a resource. The work reported in this paper has several aims. One is to develop a lexicon automatically for NLP use containing gene names from several model organism databases. Later, this will be expanded to other types of entities and organisms. The second is to study aspects of performance, especially ambiguity, when using the lexicon to process abstracts. We performed an experiment to test recall when using an existing NLP system GENIES' and the lexical resources that were generated. We analyzed the errors in order to categorize and determine the causes. Additionally, the ambiguous nature of the lexical resource that was created was quantified because ambiguous lexical entries pose difficult problems for NLP systems and lead to decreased precision. Ambiguity within each species, across all four species, and with general English words was measured.

2

Background and Related Work

2.1 Model Organism Databases The research done here is based on the gene nomenclatures of four model organisms: mouse, fly, worm, and yeast, as mentioned above, because they have excellent resources that are easy to access through their websites, the organisms are

240

well-studied, much effort has gone into development of their nomenclatures, and their nomenclatures are mature. Their websites specify information needed for NLP such as official gene symbols, locus names, gene synonyms (aliases), unique identifiers, as well as other information, such as mappings to the same entity in other as Gene Ontology standardized nomenclature systems, such (http://www.geneontology.org). Additionally, the websites list associations betweens genes and journal articles, providing a reliable and cost-effective resource that can serve as a gold standard for evaluation. 2.2 Name Recognition Systems There have been many systems and experiments described in the literature that employ different techniques for biological name recognition. Recognition of gene names is a partial solution: in order to obtain important biological information, identification of the exact gene being referred to is crucial, as the names serve as indices to the literature that contains the knowledge and the results7. Fukuda2 developed the system PROPER which identifies protein names in the literature, using rules based on protein nomenclature. Another system4 utilizes a name dictionary that contains human symbols and aliases extracted from different databases, such as HUGO, and LocusLink. An algorithm developed by Hanisch’ uses name tokenization as well as a curated gene symbol dictionary to recognize protein names. Proux3 also uses both lexical analysis and contextual analysis for recognizing gene symbols and names. In addition to protein and gene names, a system to recognize chemical names has also been developed6. Our system GENIES recognizes biological entities and also extracts their relations. GENIES can use either a straightforward lexical lookup method or process text that has already been tagged by a separate module. Hirschman’performed a lexical-based pattern matching experiment for tagging genes using a list of genes symbols and synonyms obtained from FlyBase. A list of known associations between journal articles and gene names contained within each article served as the gold standard, against which the experimental results were compared. For the full text of the articles, this experiment yielded a precision of 2%. This experiment showed that problems in precision were largely due to gene name ambiguities (with each other, between genes and proteins, and with English words). Our work differs from the above related work on biological entity recognition in that we are focusing on the automated acquisition of a uniform lexical resource for NLP and on issues affecting performance of the resource, whereas the related work focused on development of methods for recognizing gene names. Furthermore, in measuring performance we study issues associated with performance in conjunction with identification and not just recognition. Our work is similar to Hirschman’s. However, we experiment with four different organisms and quantify ambiguity within and across organisms as well as with general English words.

24 1

3

Methods

3.1 Creating a Lexical Resource and Measuring Its Ambiguity We automatically created a lexicon from the four model organism databases. Specific files containing gene information were downloaded from the fly, mouse, and worm websites in January 2003. These files included FBgn.acode, wormpep. 93, MRK-List1 .sql. rpt, MRK-hcusLink.rpt,, and MRK-Synonym.sq1. rpt from Flybase, WormBase and MGI. The file from the yeast database, registiy.genenarnes.tab, was downloaded in June 2003. Since the file format for each different organism varied, the files were processed using different Per1 scripts to extract gene symbols, aliases, full names, and identifiers, and to map the information to a single uniform format. For each organism, a gene name lexicon was created so that there was one entry per gene name, which contained the gene name, unique database identifier, and full name, if one exists. Figure l a shows an example of three entries associated with the same name but denoting different genes. Also for the name of each entry, we kept track of whether it was an official symbol, synonym (alias), or full name, but this is not shown in the figure. fop1 MGI: 109606"formin binding protein 1 fbpl MGI:95492"fructose bisphosphatase 1 fbpl MGI:95568"folate receptor 1 (adult)

fbpl MGI: 109606Aformingbinding protein 1+MGI:95492Afructosebisphos phatase l+MGI:9556Vfolate receptor 1 (adult)

Figure l a - Ambiguous gene name entries created from the MGI nomenclature. The name Jbpl refers to three distinct genes, one corresponding to an official symbol and the other

Figure l b - The merged lexical entry for J b p l . The target forms in l a were combined by concatenating the individual target forms, and a '+' was used to separate them.

two to aliases.

After the initial lexical entries were created, an automated program merged all entries associated with the same name that had different target forms, so that all of the entries were combined into a single entry with a single target form consisting of the union of the individual target forms. Figure l b shows an example of the merged entry. After merging entries, the number of ambiguities in the lexicons was counted. Once the individual lexicons were created, we explored resources to use for identifying English words so that we could identify gene names that were ambiguous with general English. We explored three different resources, analyzed their effectiveness, and chose the best. We considered a resource effective if it did not intentionally contain genes names. The three sources were: 1) a list of English words obtained from the Moby lexicon project website (httr,://www.dcs.shef.ac.uk/researcNilash/Mobv/mwords.html) containing 74,550 English words, which occurred in two or more published dictionaries, 2) words

242

obtained from the Wall Street Journal (WSJ) corpus, and 3) a list of words that occurred in Medline abstracts from 1969-2002. The WSJ corpus consisted of one million words selected from samples of articles appearing in 1988 and 1989. The words in the WSJ corpus were tagged with parts of speech; we eliminated words from the corpus that were tagged as proper nouns because they were not general English words. As a result of manual analysis of the different lists, we determined that the Moby lexicon was the most appropriate to use. Based on the list of English words in the Moby lexicon, we identified gene names that were also English words and computed how often they occurred within each organism database. They were then removed from each of the four lexicons. In addition, we found that the majority of single and double letter names, such as a, a l , and to were highly ambiguous even if some were not general English words, and also removed them from the lexicons, thereby creating individual lexicons MB (mouse), FB (fly), WB (worm), and SC (yeast). Therefore for each lexicon, each unique name had only one entry, which was not an English word. The entries in each of the four lexicons were then combined to create a combined lexicon. Using the same merging process described above, entries for ambiguous names were merged. In the resulting lexicon each entry corresponded to a unique name, creating lexicon ALL4. Target forms for ambiguous gene names were combined as before except they could also consist of a union of identifiers associated with the four different nomenclatures. Ambiguous gene names were then quantified across species.

3.2 Evaluating Recall and Ambiguity In independent runs, we used GENIES with each of the five lexicons to study performance, and to analyze problems. One aspect measured recall. This was accomplished by using the lexical lookup method in GENIES, which is a straightforward string matching procedure that finds the longest match. We used GENIES to capture other information as well, but for this work we only focus on gene name recognition. We realized that straightforward string matching was not an ideal method, but our aim was to perform an analysis of the instances where genes were missed, and to categorize and quantify the reasons, as a preliminary step to refining lexical lookup. Our intention was to determine the tools that would be most useful to accompany the lexicon. We focused on the mouse model organism (lexicon MB), and automatically obtained a gold standard set of abstracts by downloading a file named MRK-Reference.rpt from the MGI website, which listed MGI gene identifiers and the corresponding Medline abstracts containing those genes. This correspondence was established manually by curators', thus serving as an independent and accurate gold standard (ft~://ftD.inforinatics.iax.or~/~ub/re~[)its/index.html). Based on this file, 45,000 Medline abstracts were obtained that contained at least one curated MGI gene. The abstracts were divided into two groups according to the number of MGI genes they

243

were associated with: Group I had 26,000 abstracts that each corresponded to only one curated MGI gene, whereas Group I1 had 19,000 abstracts associated with two or more genes. All 45,000 abstracts were parsed using GENIES with lexicon MB. The output that was generated for each abstract contained the target forms as specified in lexicon MB (see Figure lb). For each abstract, the MGI identifiers obtained were compared with the MGI identifiers that were associated with that abstract in the gold standard. A true positive instance was considered to be one where the output contained at least one instance of the appropriate MGI identifier. Recall was calculated for each group and an overall average was computed. Recall was computed as the ratio of the number of appropriate MGI identifiers that were found divided by the total number that should have been found. In order to determine the cause of recall errors, a random sample of 100 abstracts associated with errors from each group (200 abstracts in all) was chosen, and an error analysis was performed by one of the co-authors (LC) who has a background in biology. When manual analysis of the abstract failed to detect the appropriate gene, the complete article was retrieved and examined to see whether mention of the gene occurred somewhere else in the article other than in the abstract. In the next step, we determined the number of ambiguities in the output that contained the appropriately retrieved MGI identifiers. This phase consisted of three parts: one part involved using lexicon MB to determine occurrences of ambiguity within the mouse nomenclature, the second part involved using lexicon ALL4 to determine occurrences of ambiguities when considering all four nomenclatures, and the third, lexicon MBE, also included ambiguities with English words. To create lexicon MBE, gene names were added to MB that were previously removed because they were English words. Lexicons MB, ALL4, and MBE were each used by GENIES in three separate runs to process the set of abstracts. For each run, the number of MGI genes that were appropriately identified and that had more than one target form was counted and compared to the number of MGI genes that were appropriately retrieved. 4

Results

4.1 Ambiguity of the Lexical Resources Table 1 shows the amount of ambiguity within each database, across all databases, and with English words for gene symbols as well as for all names, which includes gene symbols, full names, and aliases as listed in the individual nomenclature databases. In the mouse database, only 43 out of the 19,175 gene symbols (0.22% of all gene symbols) had ambiguities with other gene symbols and only 948 out of the 55,795 names (1.69% of all names) had ambiguities with other names in the mouse database. The other databases also exhibited a very low rate of

244

ambiguity within the same organism, except for Flybase, which had a rate of .68% for symbols, but 10.18% when considering all names.

MOUSE symbols (19,175) all names* (55,795) WORM symbols (3,221) all names* (27,268) FLY symbols (43,394) all names* (82,553) YEAST symbols(5,117) all names* (7,264)

Ambiguities in Database 43 (0.22%) 948 (1.69%) 0 (0%) 0 (0%) 296 (0.68%) 8407 (10.18%) 0 (0%) 120 (1.65%)

Ambiguities with English words 307 (1.60%) 846 (1.52%) 0 (0%) 0 (0%) 731 (1.68%) 1985 (2.40%) 3 (0.06%) 5 (0.07%)

Ambiguities across databases 1585 (8.27%) 3693 (6.62%) 205 (6.36%) 511 (1.87%) 1668 (3.84%) 3279 (3.97%) 1039 (20.30%) 1372 (18.89%)

Table 1. Results quantifying ambiguities of model organism gene names within each respective database, with English words, and across databases. * The category “all names” in the table comprises gene symbols together with synonyms (aliases).

We determined that Moby was the best of the three resources that we experimented with for identifying English words. The Wall Street Journal corpus did not have broad enough coverage of English. The Medline articles did have good coverage of English, but the corpus also contained many gene names and symbols, and therefore was inappropriate. Based on the Moby list of words, 307 (1.60%) of the mouse gene symbols were found to be ambiguous with English words, and an additional 539 mouse names (1.52%) were ambiguous with English words. Flybase exhibited the largest amount of ambiguities with English (2.40%), and W o r d a s e exhibited none. Results for the Yeast were similar to those for the Worm. Not surprisingly, in all the databases, the amount of ambiguities increased substantially when considering all four nomenclatures. When compared to the other three organisms 1,585 mouse gene symbols (8.27%) were ambiguous, and 3,963 mouse names (6.62%) were ambiguous with the other three organisms. The Yeast database exhibited the largest rate of ambiguity with other databases, amounting to 20.3% for gene symbols and 18.89% for all the names. 4.2 Recall

In a total of 25,804 abstracts from Group I that were processed, 7,899 did not result in identification of the appropriate MGI gene, yielding a recall of 69.4%. Of the 96,712 genes that were associated with 18,636 abstracts from Group I1 (two or more genes per abstract according to the gold standard), 70,305 genes were not recognized, resulting in a much lower recall of 27.3%. Overall a recall of 36.2% was achieved. An analysis of the failures identified seven primary reasons, which are shown in Table 2. The most frequent cause in Group 1 was due to simple name variation between names in the abstracts and in the lexical entries. This can be

245

further divided into more specific categories: a) punctuation variations (bmp-4, bmp4); b) numerical variations, (syt4, syt iv), c) variations of Greek letters (iga, ig alpha), and d) word order differences (integrin alpha 4 , alpha4 integrin). Gene name variations accounted for 79% of the failures in Group I, and for 22% in Group 11. However, when errors in only abstracts were considered, the error rate in Group I1 became 61%. A significant source of error occurred in Group I1 (58%) because we processed only abstracts, but the curated genes appeared in the full text only and not in the abstracts. In contrast, only 2% of Group I errors were due to this reason. Another substantial source of error was due to partial matches (trophoblast specific protein alpha, trophoblast spec$c transcription factor), which accounted for 14% in Group I. Smaller amounts of error were due to several other reasons: a) a gene name was not found in either the abstract or in the full text, b) a gene name was the same as an English word, which was deliberately removed from the lexicon, c) gene names only appeared in the reference section but not in the text of the article, and d) the original abstract was missing from the Medline database. Table 2. Reasons for

recall failures based on analysis of 100 abstracts in Group I and 100 genes in Group 11. *Numbers in parentheses signify the error occurred in the full text and not in the abstract. 4.3 Ambiguities in the Output We determined ambiguous occurrences when using the MB lexicon. For Group I abstracts, 1,557 (8.7%) out of 17,891 MGI genes that were recognized appropriately by the straightforward lookup method had an ambiguous target form. Similar results were found for Group I1 abstracts, where 2,073 (7.9%) of the 26,378 genes had multiple target forms; the rates of the two groups did not differ significantly. Similar results were obtained for Group I and I1 abstracts, and therefore we combined the results. In total, 43,721 MGI genes were recognized correctly; of those 10% (4,389) had multiple mouse gene targets, and 24.7% (10,818) of the MGI genes had identical symbols with one or more fly genes. The ambiguities with C. elegans and yeast were 2.4% and 4.2%, respectively. Overall 32.9% of the curated MGI genes shared the same name with other genes, either within MGI database or across the species we examined (see Table 3). The last question we addressed was ambiguity with English words. With lexicon MB (which had English words removed), roughly 328,000 gene symbols were recognized by the lexical lookup method. When Moby English words were included and lexicon MBE was used, about 149,000 additional MGI IDS (a 45%

246

increase) were obtained when processing the same set of abstracts, bringing the total to 477.000.

Ambiguities Within MGI With Flybase With WormBase With Yeast Total

I Number I

I I

I

I

10.0% (4.389) , ,- - - , 24.7% (10,818) 2.4% (1.045) 4.2% (1,798) 32.9% (14,373)

Table 3. Occurrences of ambiguities of MGI gene names within MGI and across species. These were obtained as a result of processing a set of 45,000 abstracts. Note that the total number of ambiguities is not the simple sum of individual ambiguities since many overlap.

5 Discussion Results showed that the number of ambiguous names within each species varied from 0%-10.18%, with the number per name ranging from 2 to over 100; most ambiguities were caused by gene synonyms or aliases. For example, f b p l , as shown in Figure 1, corresponded to 3 different genes, but 2 of the names were aliases. One factor contributing to ambiguity is that some nomenclatures intentionally include broader terms in their synonym lists to facilitate access to gene data because authors do not always use the appropriate names. One potential solution would involve refinement of the synonym lists provided by the databases to include a separate category for terms that are broader. Another factor that contributes to ambiguity is the gene nomenclature rules established by the organism databases themselves. The four different species have various naming rules that specify how researchers should name their genes. For example, the rule for naming yeast genes states that the gene name symbol should consist of three letters (the gene symbol) followed by an integer (e.g. ADE12), and also requires that the name symbol be unique within that nomenclature. The low percentage of ambiguities within the yeast database, as well as with English words shows that this rule was effective in avoiding problems due to ambiguity. Similarly, the standards for naming genes in WormBase follow this pattern: “A Predicted Gene: A dot name, such as F59E12.2; A Named Gene: A three letter name, such as zyg-1.” This also accounts for the low percentage of ambiguities within the worm database and with English words, but there still exists ambiguity with names in other organisms as shown in Table 1. However, the other two species (mouse and fly) have a higher percentage of ambiguities within their respcctive databases. The naming conventions for these two are more lenient. For the fly nomenclature, the names should be concise, should allude to the gene’s function, mutant phenotype or other relevant characteristic, and should not have been previously used for a Drosophila gene. This more general rule does not place too many restrictions on the format of the gene names, and thus more ambiguities tend to arise. For example, alp is a symbol for the abnormal leg pattern gene and is also a synonym for nctivin like protein at 23B. For the mouse gene nomenclature the names of genes and loci should be brief, should convey accurate information, and should begin with a letter. This less stringent rule may lead to

247

ambiguities within the database, as well as with English words. One potential solution would involve refinement of the naming conventions. Not surprisingly, the ambiguities of gene names across the four nomenclatures seemed to be more severe, ranging from 1.19%-20.30%. Factors similar to the ones we discussed above for a single nomenclature apply in this situation also. However, another factor is that historically, different nomenclatures tend to name orthologs ( the same genes in different species that typically have the same functions) with the same name. For example, MGI has curated 9,981 mouse/human ortholog relationships, and only 2,329 of them differ in their names. Using the same names for orthologs makes intuitive sense and facilitates user comprehension, but is confounding for NLP applications. When we employed the NLP engine to recognize gene names in Medline abstracts using the lexicon ALL4, the ambiguity problem was exacerbated. Overall, 33% of the mouse genes that were extracted shared a name in common with other genes, either within the mouse database or across databases for different organisms. This shows that the ambiguity problem is a serious one for NLP and that more research in this area is needed. This problem may be reduced significantly if the appropriate organism-specific lexicon is used to process an article. This suggests that a method that first identifies the applicable organism(s) for each article would help alleviate the ambiguity problem somewhat. The information presented in this research may not be complete in a number of respects. We found that the worm data that we collected did not contain any aliases. It may be because only the official symbols are used in the literature, or because information concerning aliases were not available on the website. If aliases are used in journal articles in place of the official worm symbols, they would cause more ambiguities than we determined. This also raises the issue of completeness. We obtained the names of synonyms (aliases) from the websites and considered ambiguity in gene names and in the number of ambiguous occurrences based on that information. However, the ambiguity problem would be worse if the information we obtained was incomplete. We also did not account for ambiguities with other organisms, or with biomedical terms, such as drug names, diseases, clinical procedures, or symptoms. In particular, we did not consider ambiguities between gene and protein names, which is a serious problem also. Therefore it is likely that the quantities we obtained for ambiguities and ambiguous occurrences substantially underestimated the problem if the broader biomedical domain were to be considered. Future work will involve expanding our study of ambiguity to include more 9 organisms and also terms in the Unified Medical Language System , a comprehensive nomenclature system containing medical and biological terms. Because gene-disease and gene-drug relationships contain important genomic information, gene names that are ambiguous with clinical terms will be worthwhile to study further. The amount of gene names ambiguous with English words ranges from 0-2.4% of gene names in all four organisms. However, this seemingly low

248

percentage of genes could cause substantial difficulties for NLP and IR systems. If a gene name is a frequent English word, such as was, and to it will occur frequently in the articles, and cause a large decrease in precision unless special disambiguation procedures are available. Such a situation is consistent with the results reported by Hirschman and colleagues7. Their results showed an extremely low rate of precision (2%) because English words were included in their list of gene names. We found that when we used a lexicon containing English words, an additional 149,000 (45% increase) “genes” were extracted. The percent of these that occurred as actual gene names is currently under study, but the rate is probably low as suggested by the low rate of recall error that occurred when we removed genes that were English words from the lexicon (2%). It appears that sacrificing genes that are English words is likely to result in a small drop in recall and substantial increase in precision. It will also be interesting to see in what form common English words actually appear in the articles. For example, we did not find any occurrences of a gene named Was in the 45,000 articles. While examining the gene names, we converted all the symbols to non-italicized lowercase letters. Although this helped to discover gene name ambiguities within and across species, gene names in the literature can usually be identified by features such as capitalization (either the first letter or all of the letters), being italicized, or being surrounded by quotation marks or parenthesis. Ambiguities between genes of the same string but with different cases may be easy to resolve, as would genes that are completely or partially upper case when occurring in the middle of sentences. A pre-processor could easily tag these situations and avoid ambiguities with English words. However, this process would still not be straightforward since different organisms may have different conventions. A pre-processor would have to determine which rules to use based on which organism the article discussed. In this study, Group I abstracts showed a substantially higher rate of recall (69.4%) than that for Group I1 (27.3%). In Hirschman’s work, the full text articles resulted in a much higher recall rate than the abstracts. Not surprisingly, in the random sample for Group 11, 58% of missed MGI genes were not in the abstracts. Interestingly, Group I only had only 2% in this category. Presumably, this was because these articles were identified by the curators as containing only one primary gene, and therefore that gene was likely to be in the abstract. For abstracts with multiple genes, many of the genes may not be considered primary findings and therefore do not occur in the abstract, but possibly the most important ones do. The largest recall problems occurred because of simple gene name variants. A refinement of the string matching algorithm could alleviate this problem, and bring about increased recall, although it could also possibly result in decreased precision.

6

Conclusions

Identifying gene names is crucial for database interoperability and for development of automated techniques that extract important genomic information from the biomedical literature. Nomenclature databases are invaluable resources

249

for identifying gene names, but these resources are heterogeneous. Combining the nomenclature information into one resource could facilitate interoperability and also benefit NLP systems in this domain. However, a uniform resource alone is not enough. Our results show that the ambiguous nature of gene names within and across model organism databases presents a significant roadblock to reliable gene identification. More work on disambiguation by the NLP community is needed to address ways to resolve this problem. Furthermore, quantification of the ambiguities and their detrimental effect on the precision of automated text processing systems may provide useful feedback to the model organism communities. NLP methods could be cost-effective tools to assist in the curation process, but the ambiguous nature of the names and the lack of standard conventions across organisms are serious obstacles for NLP. Acknowledgements This work was supported in part by grant EIA-03 1 from the National Science Foundation and LM06274 and LM7659 from the National Library of Medicine. References (1) Friedman C, Liu H, Shagina L, Johnson SB, Hripcsak G. Evaluating the UMLS as a source of lexical knowledge for medical language processing. Baaken S, editor Phila: HanleyLkBelfus, 2001 :189-194 (2) Fukuda K, Tsunoda T, Tamura A, Takagi T. Information extraction: identifying protein names from biological papers. Hawaii: 1998:707-718 (3) Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B. Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction, Genome Inform Ser Workshop Genome Inform 1998; 9:72-80. (4) Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001; 28(1):21-28. (5) Hanisch D, Fluck J, Mevissen HT, Zimmer R. Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput 2003;403-414. (6) Narayanaswamy M, Ravikumar KE, Vijay-Shanker K. A biological named entity recognizer. Pac Symp Biocomput 2003;427-438. (7) Hirschman L, Morgan AA, Yeh AS. Rutabaga by any other name: extracting biological names. J Biomed Inf2002;35(4):247-59. (8) Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT. MGD: the Mouse Genome Database. Nucleic Acids Res 2003; 31(1):193-195. (9) Lindberg D, Humphreys B, McCray AT. The Unified Medical Language System. Meth Inform Med 1993; 32:281-291.

INVESTIGATING IMPLICIT KNOWLEDGE IN ONTOLOGIES WITH APPLICATION TO THE ANATOMICAL DOMAIN S. ZHANG & 0. BODENREIDER U.S. National Library of Medicine 8600 Rockville Pike, MS 43, Bethesda, Maryland 20894, USA E-mail: (szhang, olivier]@nlrn.nih.gov Knowledge in biomedical ontologies can be explicitly represented (often by means of semantic relations), but may also be implicit, i.e., embedded in the concept names and inferable from various combinations of semantic relations. This paper investigates implicit knowledge in two ontologies of anatomy: the Foundational Model of Anatomy and GALEN. The methods consist of extracting the knowledge explicitly represented, acquiring the implicit knowledge through augmentation and inference techniques, and identifying the origin of each semantic relation. The number of relations (12 million in FMA and 4.6 million in GALEN), broken down by source, is presented. Major findings include: each technique provides specific relations; and many relations can be generated by more than one technique. The application of these findings to ontology auditing, validation, and maintenance is discussed, as well as the application to ontology integration.

1

Introduction

Biomedical ontologies can be developed manually, semi-automatically or automatically, with the support of knowledge acquisition tools, or by knowledge servers reasoning on formal knowledge representation languages [I]. The resulting ontologies generally consist of concepts modeled by hierarchical relationships. Concepts are identified by names or formal definitions, and described by properties and associative relationships with other concepts. The inter-concept relationships, either hierarchical or associative, direct or indirect, constitute the explicit knowledge represented in the ontologies. Ontologies may also contain knowledge less explicitly represented. The notion of implicit knowledge has been explored in various contexts in AI-related areas including expert systems, knowledge acquisition, and knowledge representation and reasoning [2, 31. Explicit knowledge generally refers to what is represented through formal models or procedures. Implicit knowledge, on the other hand, is defined differently and may include human experiences, informal representations such as images and visions, and formal implications deduced from the explicit knowledge. In this paper, we investigate the implicit knowledge embedded in the concept names and inferable from various combinations of semantic relations. In a previous study [4],we proposed several techniques for acquiring implicit knowledge in biomedical ontologies. Our motivation was to facilitate ontology inte-

250

25 1

gration by making different ontologies more directly compatible. Additionally, we showed that acquiring implicit knowledge can help reveal latent inconsistencies within ontologies, as well as conflicts between representations of the same domain. Knowledge may not always need to be represented explicitly. For example, in description logic-based systems [5], reasoners and classifiers rely on metaknowledge expressed through axioms to generate additional knowledge from the explicit representation. Such systems would generally perform similarly to our techniques for inferring new knowledge. However, these systems do not usually take advantage of the knowledge implicitly embedded in concept names as we do. The contribution of this paper is to study the respective proportions of explicit and implicit knowledge in biomedical ontologies and the relative contribution of various techniques to making implicit knowledge explicit. We show later on that formally representing the origin of the relations is of interest as it may contribute to maintaining consistency in ontologies, to auditing and validating ontologies, and would, more generally, benefit tasks such as ontology merging [6, 71 and alignment [S], and agent communication in the Semantic Web [9, 101.

2

Materials

Our domain of interest for this study is anatomy. We selected two comprehensive ontologies representing anatomical knowledge: the Foundational Model of Anatomy' (FMA) [March 4, 2003 version] and the GALEN* common reference model [v. 61. The Foundational Model of Anatomy (FMA) is an evolving ontology that has been under development at the University of Washington since 1994 [11, 121. Its objective is to conceptualize the physical objects and spaces that constitute the human body. The underlying data model for FMA is a frame-based structure implemented with ProtCgC-2000. With 66,879 concepts, FMA claims to cover the entire range of gross, canonical anatomy. The Generalized Architecture for Languages, Encyclopedias and Nomenclatures in medicine (GALEN) has been developed as a European Union AIM project led by the University of Manchester since 1991 113, 141. The GALEN common reference model is a clinical terminology represented using GRAIL, a formal language based on description logics. GALEN contains 52,006 concepts and intends to represent the biomedical domain, of which canonical anatomy is only one part. Both FMA and GALEN are modeled by IS-A and PART-OF relationships and allow multiple inheritance. Relationships in GALEN are finer-grained than in FMA. For the purpose of this study, we considered as only one PART-OF relationship the various kinds of partitive relationships present in FMA (e.g., part of, gen-

' http://sig.biostr.washington.edulprojects/f~About~.h~ http://www.opengalen.org/

252

era1 part of) and in GALEN (e.g., isStructuralComponentOf, isDivision0f). IS-A and have inverse relationships, INVERSE-IS-A and HAS-PART. In canonical anatomy, the inverse relations are essentially always valid, although this may not necessarily be the case in the real world [ 151. PART-OF

3

3.1

Methods

Acquiring explicit knowledge

Inter-concept relationships are generally represented by semantic relations , where relationship links conceptl to concept,. In this study, we limited our investigation to hierarchical relationships, i.e., IS-A, INVERSE-ISA , PART-OF, and HAS-PART. Acquiring explicit knowledge simply consisted of extracting the semantic relations explicitly represented. In addition, we refined these explicit relations by a series of complementing and cleaning actions. First, in order to make the relations more easily comparable across systems, we added to each ontology the missing inverse relations3. Additionally, and only for FMA, we assigned to a more generic concept the PART-OF relationships common to all its leaf descendants4. Finally, we identified and removed a small number of hierarchical cycles within each ontology. The knowledge resulting from these actions is still considered explicit, either because the tasks are relatively trivial or because this knowledge was expected to be represented in the first place. The resulting relations are called the base semantic relations, to which implicit knowledge will be compared.

3.2

Acquiring implicit knowledge

Augmentation and inference were two main techniques used to acquire implicit knowledge from FMA and GALEN [4]. Augmentation attempts to represent with relations knowledge that is otherwise embedded in the concept names through reification or other linguistic phenomena such as nominal modification and prepositional attachment. Augmentation based on reified PART-OF relationships consists of creating a relation between concepts P (the part) and W (the whole) from a relation , where the concept Part of W reifies, i.e., embeds in its name, the PART-OF relationships to W . For example, was added from the relation , was added to GALEN, complementing , explicitly represented. For example, was added to FMA because all leaf descendants of Lung, i.e., Left lung and Righr lung, are in the PART-OF relationship with Intrathoracic part of chest. Such PART-OF relationships should have been assigned to more generic concepts and inherited downwards in the ontology modeling stage.

253

Joint>, where the concept Component of Joint reifies a specialized PART-OF relationship. Examples of augmentation based on nominal modification and prepositional attachment include (from the concept name Thyroid gland) and (from the concept name Leaflet of pulmonary valve). Inference generates additional semantic relations by applying inference rules to the existing relations. These inference rules, specific to this study, represent limited reasoning dong the PART-OF hierarchy, generating a partitive relation between a specialized part and the whole or between a part and a more generic whole. For example, was inferred based on the explicit relations and . 3.3

Identifying the origin of semantic relations

Semantic relations may be acquired by several methods. They can be explicitly represented, added by complementation, as well as generated by augmentation and by inference. The former two categories constitute explicit knowledge (i.e., the base semantic relations in this study) and the latter two implicit knowledge. In other words, each method produces a set of semantic relations. Augmentation relies solely on concept names and only one set of augmented relations obtains. In contrast, inference can be applied to the base relations only, to the augmented relations only, or to both, resulting in three distinguishable sets of inferred relations. The five sets of semantic relations studied are: B (base semantic relations), A (augmented semantic relations), IB(inferred semantic relations based on the base relations alone), I A(inA ferred semantic relations based on the augmented relations alone), and I ~(inferred semantic relations based on the base and augmented relations). Depending on which method (or methods) can generate it, each semantic relation belongs to at least one and at most five sets B , A , 16, IA, and IbA.When a relation can be generated by several methods, it is therefore common to the corresponding sets of relations and, thus, belongs to the intersection of these sets. We use the intersection of sets as a unique identifier for the origin of a relation, hereafter referred to as its source. For example, the source ( B n A n I B ~nA la) identifies the and, IA,but absent from 16. More concretely, relations common to the sets B, A , I ~ A the semantic relation in FMA belongs to the intersection ( B n A n t ~ ,n, In) ~ because the relation: is explicitly represented in FMA (i.e., in B); can be augmented from the name of the concept Anterior lobe of prostate (i.e., in A ) ; can be inferred from two augmented relations and (i.e., in In); can be inferred from a combination of base relation and augmented relation
254

Prostate> (i.e., in 1 6 " ~ ) ;and cannot be inferred solely from base relations using our inference rules (i.e., not in 16).

PART-OF,

4

4.I

Results

Number of semantic relations acquired

The number of semantic relations acquired from FMA and GALEN is presented in Table 1. The base semantic relations include the relations explicitly represented and those added by complementation, as described earlier. The implicit relations are generated by augmentation and inference. Because semantic relations may be acquired by several methods, the total number of unique semantic relations is slightly less than the sum of the number of relations in the four subcategories listed.

Table 1. Number of semantic relations acquired from FMA and GALEN

4.2

Origin of the semantic relations acquired

From the perspective of the semantic relations, the source of a relation represents the method (or methods) by which this relation can be generated. From the five individual methods studied in this paper (B, A, I g , I,, and I ~ A ) nineteen , sources in FMA and sixteen in GALEN were found to partition the total set of relations into disjoint subsets. To each subset corresponds a combination of methods by which the relations in the subset can be generated. As shown in Figure 1, four sources contribute the vast majority of relations in both FMA (about 95%) and GALEN (nearly 99%). These sources are: ( I B ~ n A Is), ( I s ~ A ) ,(B), and ( B n Z s u ~n IS). The number and percentage of relations coming from each source for FMA and GALEN are presented in Table 2. For example, 105,084 relations in FMA can be generated by both A (augmentation) and IB,,A (inference based on the base and augmented relations), but not by the other three methods. As shown in the table next to the label (A n I B ~ A ) these , A white 105,084 relations are represented by two gray slots in column A and I B ~ and

255

slots in the other three columns. Note that row (A) represents the relations that can only be generated by augmentation, while a gray slot in column A identifies the relations that may be generated by augmentation.

Source of the semanti

Table 2. Source of the semantic relations acquired from FMA and GALEN

Figure 1. Contribution of the top four sources of relations in FMA and GALEN

256

4.3

Base semantic relations

The base semantic relations come from all sources involving B, i.e., not only the row ( B ) in Table 2, but all ten rows marked in grey in column B, including, for example, (B ~ I B , A While ) . some of these relations are only present in the base, some of them may also be augmentable, be inferable, or both. The proportion of base relations for each of these categories in FMA and GALEN is shown in Table 3.

6.74 %

2.68 %

Table 3. The base semantic relations

4.4

Augmented semantic relations

The augmented semantic relations come from all sources involving A , i.e., not only the row ( A ) in Table 2, but all ten rows marked in grey in column A , including, for example, ( A n 15,~). While some of these relations can be generated only by augmentation, some of them may also be present in the base, be inferable, or both. The proportion of augmented relations for each of these categories in FMA and GALEN is shown in Table 4.

Augmented semantic relations Can only be augmented Also present in the base Also inferable (Both in the base and inferable)

4.5

FMA (N=392,314) 24.52 % 11.12 % 65.14 % 0.78 %

GALEN (N=32,922) 13.02 % 20.50 % 68.28 % 1.80 %

Inferred semantic relations

The inferred semantic relations come from all sources involving I B ~ A15, , or In, i.e., not only the rows ( I B ~ A,)(IB), and (la) in Table 2, but all rows except ( B ) , (A), and ( B n A). These rows are all marked in grey in column I B ~ A15, , or In, and include, for example, ( I B ~ n A IA). While some of these relations can be generated only by inference, some of them may also be present in the base, be augmentable, or both. The

257

proportion of inferred relations for each of these categories in FMA and GALEN is shown in Table 5.

Inferred semantic relations Can only be inferred Also present in the base Also augmentable (Both in the base and augmentable)

FMA (N=l1,896,508) 95.77 % 2.11 % 2.15 % 0.03 %

GALEN (N=4,356,244) 98.86 % 0.64 %

0.52 % 0.02 %

Table 5. The inferred semantic relations

The last row in Tables 3, 4, and 5 corresponds in all three cases to relations which are present in the base and are also augmentable and inferable (3,082 in FMA and 590 in GALEN). These relations correspond to the following four rows in Table 2: ( B n A n IB,,), ( B n A n IB,A n Is), ( B n A n IB"A n IA), and ( B n A r ) I B u A f l IBn IA).

5

5.1

Discussion

Specificity and common features of the various methods generating relations

Each method provides specific relations. With the exception of I B and I,, each method contributes specific relations, i.e., relations that could not be generated by other methods. By definition, I g U , includes both I g and In, i.e., every relation in I g or I A is also in IB,A. However, as reflected by the two non-empty sets (IB,,A n I S ) and (IB~n A IA), not every relation generated by I B can also be generated by l a , and viceversa. The largest proportion of specific relations is associated with inference (more than 95% of the relations inferred from FMA and GALEN can be generated only by inference). The base relations represent the second pool of specific relations (the proportion of base relations which cannot be generated by augmentation or inference is nearly 55% in FMA and 86% in GALEN). Many relations can be generated by more than one method. Many relations generated by augmentation (1 1% in FMA and 20% in GALEN) and, to a lesser extent, by inference (2.1% in FMA and .6% in GALEN) are also present in the base, i.e., explicitly represented in most cases. There is also a significant overlap between the relations generated by augmentation and by inference, especially when examined from the perspective of augmented relations (about two thirds of augmented relations can also be inferred). Finally, a few hundred relations can be generated by all the methods under investigation. These relations, B n A n I B ~ n A I B n IA, are present in the base, augmentable, and inferable from both the base and augmented rela-

258

tions. Examples of such relations include in FMA and in GALEN. Relative contribution of each method. The source of the relations can be used to study the generative capabilities of the various methods producing these relations. From Figure 1, it is clear that, in both FMA and GALEN, the most important contri~ Is), i.e., inference based on relations present only in the bution comes from ( I s u n base. This should not be surprising since inference performs similarly to a transitive closure applied to a combination of IS-A and PART-OF relations. In GALEN, relations from ( Z B ~ AnIB)account for nearly 90% of all relations. In FMA, however, this ) account for about 90%. proportion is only 57%, but (I&A n 16) and ( I B ~ A together What this illustrates is the role played by augmentation in FMA: while augmentation generally contributes few relations which could not have been generated otherwise, in FMA, augmented relations participate in a significant number of inferred relations. Some sources do not provide any relations in GALEN. As mentioned earlier, only sixteen sources are found to contribute relations in GALEN, while there are nineteen such combinations in FMA. The three combinations missing in GALEN are (B n A n l e u , n [ A ) . ( B n A 016,~) and (B n 1B"A n In), which in all account for about 0.02% of relations in FMA. Augmentation plays a role in these three sources directly or through inference - and it is consistent with earlier findings to see augmentation more strongly associated with FMA than GALEN.

5.2

5.2.1

Applications Ontology auditing, validation, and maintenance

This study showed that the relations represented in ontologies - explicitly or not may be redundant. When relations can be acquired from several different methods (e.g., explicitly represented and inferable from a combination of other relations), the relations in the ontology are no longer independent of each other. Redundancy may have beneficial effects for users of the ontology, such as providing direct links between important concepts. However, the dependence among equivalent relations or combination thereof is rarely explicit. Therefore, there is a chance that, over time, one relation be modified without modifying the dependent relations accordingly, leading to inconsistency. Recognizing redundancy. Using techniques such as augmentation and inference, we showed that it is possible to identify relations which can be generated by more than one method, i.e., redundant relations. The percentage of redundant relations can be used as an indicator for auditing ontologies. A small percentage is likely to be associated with consistency and ease of maintenance, but the ontology may be more difficult to use by humans without the help of an inference engine.

259

Identifying dependence among relations. An ontology in which dependence among equivalent relations is explicit would be easier to maintain in a consistent state. For example, the following guidelines, inspired by the two ontologies of anatomy under investigation, could be adopted: (1) If a relation to be modified is represented explicitly and augmentable (6.74% in FMA as shown in Table 3), modify the explicit representation (e.g., ) and the equivalent concepts and relations (e.g., , where Part of W embeds a reified PART-OF relationship). (2) If a relation to be modified is specific to the base relations (e.g., 54.92% in FMA as shown in Table 3), find all relations inferable from this relation (or using it for inference) and check their validity. ( 3 ) If a relation to be modified is represented explicitly and inferable (e.g., 38.83% in FMA as shown in Table 3), identify all relations from which this relation can be inferred, and check their validity. Detecting inconsistency. Both FMA and GALEN were found to contain a small number of hierarchical cycles, resulting from either reflexive or circular hierarchical relations. Cycles may be found among the relations explicitly represented (e.g., in GALEN). More often, they are revealed while making explicit the implicit relations by augmentation and inference. For example, a PART-OF reflexive cycle was identified while augmenting from explicit relation in FMA. Additionally, the explicit relation <Apex of urinary bladder, HAS-PART, Urinary bladder> and the relation augmented from <Apex of urinary bladder, IS-A, Subdivision of urinary bladder> composed a direct hierarchical cycle in FMA. 5.2.2

Integration of multiple ontologies

Facilitating comparisons across ontologies. The ontologies to be integrated may use different modeling conventions, resulting not only in different relations being represented, but also in different ways to represent the same relations. In both cases, integration is facilitated by forcing all relations to be explicitly represented. This enables comparisons across systems based on simple matches among relations on each side. Detecting inconsistencies across ontologies. As mentioned earlier, applying augmentation and inference to the relations represented explicitly helped detect inconsistencies within ontologies. The same techniques are similarly powerful for detecting inconsistencies across ontologies. For example, the relationship between Shoulder and Pectoral girdle is PART-OF in FMA and HAS-PART in GALEN. However, while hierarchical cycles within ontologies are generally indicative of wrong relations, inconsistencies across ontologies may reveal either wrong relations (at least one of the two hierarchical relations is wrong) or errors in the alignment (the two concept names, although lexically similar, may stand for distinct objects in the world) [ 161. In this case, the two concepts and their relations must be reviewed.

260

5.3

Advantages and limitations of this approach

Formalism. While other ontology tools (e.g., [ 6 , 7 ] )require OKBC-compliance, the approach described in this paper is not tied to a particular formalism. FMA is a frame-based system and GALEN is based on description logics (DL). One requirement is to extract hierarchical relations from the system (e.g., superclass-subclass). The other requirement is to augment knowledge using linguistic clues in concept names. This presupposes the existence of concept names and is therefore not applicable to some 3,000 anonymous concepts in GALEN. Of note, the relations resulting from applying inference rules to hierarchical relations would certainly have been generated by a reasoner in a DL-based system. By generating these relations independently of such a system, however, our method is applicable to ontologies represented in other formalisms as well. Domain. As a method for auditing ontologies (see section 5.2.1), this approach can be used with any ontology, as long as the requirements mentioned above are met. In its application to integrating multiple ontologies (section 5.2.2), this method requires that the ontologies to be integrated be of the same domain or, at least, have a significant overlap, as it is the case with FMA and GALEN. With other alignment methods (e.g., [17]), our method has in common that it intersects the content of several ontologies. However, we take advantage of techniques such as augmentation and inference, described in this paper and quantified for the FMA-GALEN alignment, to maximize the intersection. Validation. One limitation of this study is that no validation of the relations generated has been performed yet. However, some elements of validation are built in the method. Redundant relations are likely to be valid, as are the relations represented in several ontologies. Finally, relations resulting from inference mechanisms should generally be valid. The evaluation provided by this method is essentially quantitative, resulting from auditing the ontology automatically. For this reason, our method can be seen as complementary of a qualitative analysis of taxonomic relationships (e.g., [IS]), which requires extensive manual work. Acknowledgements The research was supported in part by an appointment to the National Library of Medicine Research Participation Program administrated by the Oak Ridge Institute of Science and Education through an interagency agreement between the U.S. Department of Energy and the National Library of Medicine. Thanks for their support and encouragement to Cornelius Rosse, JosC Mejino, and Kurt Richards for FMA and Alan Rector, Jeremy Rogers, and Angus Roberts for GALEN. Thanks also to Pieter Zanstra at Kermanog for providing us with an extended license for the GALEN server.

26 1

References Corcho 0, Fernandez-Lopez M, Gomez-Perez A. Methodologies, tools and languages for building ontologies. Where is their meeting point? Data & Knowledge Engineering 2003;46( 1):41-64 2. Duc HN. Resource-bounded reasoning about knowledge [PhD Thesis]: University of Leipzig; 2001 3. Sima J, Cervenka J. Neural knowledge processing in expert systems. In: Cloete I, Zurada JM,editors. Knowledge-based neurocomputing. Cambridge, Mass.: MIT Press; 2000. p. 419-466 4. Zhang S , Bodenreider 0. Aligning representations of anatomy using lexical and structural methods. Proc AMIA Symp 2003:(to appear) 5. Baader F, Horrocks I, Sattler U. Description logics as ontology languages for the Semantic Web. In: Hutter D, Stephan w, editors. Festschrift in honor of Jorg Siekmann: Springer; 2003. p. (to appear) 6. Noy NF, Musen MA. PROMPT: algorithm and tool for automated ontology merging and alignment. Proc of AAAI 2000:450-455 7. McGuinness DL, Fikes R, Rice J, Wilder S . The Chimaera ontology environment. Proc of AAAI 2000:1123-1124 8. Reed SL, Lenat D. Mapping Ontologies into Cyc. Proc of AAAI 2002 htt~://citeseer.ni.nec.com/509738.html. 9. Bailin SC, Truszkowsk W. Ontology negotiation as a basis for opportunistic cooperation between intelligent information agents. In: Cooperative Information Agents V, Proceedings; 2001. p. 223-228 10. Uschold M, Gruninger M. Creating semantically integrated communities on the world wide web. Proc International Workshop on the Semantic Web 2002 1.

htt~:llsemanticweb2002.aifb.uni-k~lsnthe.de/USCHOLD-Hawaii-01vitedTalk~OO2.~df. 11. Rosse C, Mejino JL, Modayur BR, Jakobovits R, Hinshaw Kp, Brinkley .IF. Motivation and organizational principles for anatomical knowledge representation: the digital anatomist symbolic knowledge base. J Am Med Inform Assoc 1998;5(1):17-40 12. Noy NF,Musen MA, Mejino JL,Rosse C. Pushing the envelope: challenges in a framebased representation of human anatomy: Technical Report of Stanford Medical Informatics; 2002. Report No.: SMI-2002-0925 13. Rector AL,Bechhofer S, Goble CA, Horrocks I, Nowlan WA, Solomon WD. The GRAIL concept modelling language for medical terminology. Artif Intel1 Med 1997;9(2): 139-71 14. Rogers J, Rector A. GALEN's model of parts and wholes: experience and comparisons. Proc M I A Symp 2000:714-8 15. Schulz S. Bidirectional mereological reasoning in anatomical knowledge bases. Proc AMIA Symp 2001:607-11 16. Bodenreider 0. Circular Hierarchical Relationships in the UMLS: Etiology, Diagnosis, Treatment, Complications and Prevention. Proc AMIA Symp 2001:57-61 17. Wiederhold G. An Algebra for Ontology Composition. Proceedings of the 1994 Monterey Workshop on Formal Methods 1994:56-61 18. Welty C, Guarino N. Supporting ontological analysis of taxonomic relationships. Data & Knowledge Engineering 2001;39(1):51-74

JOINT LEARNING FROM MULTIPLE TYPES OF GENOMIC DATA A.J. HARTEMINK Department of Computer Science and Center for Bioinformatics and Computational Biology Duke University, Box 90129 Durham, NC 27708-0129 [email protected] E. SEGAL Department of Computer Science Stanford University Stanford, CA 94303 eran @ cs.Stanford.edu

Recent technological advances enable us to collect many different types of data at a genome-wide scale, including DNA sequences, gene and protein expression measurements, protein-protein interactions, protein structural information, and protein-DNA binding data. These data provide us with a means to begin elucidating the large-scale modular organization of the cell. Indeed, much recent work has been devoted to the analysis of these data for this purpose. However, most of this work has been devoted to the analysis of a single type of data at a time, using other types of data only for validation. In contrast, results jointly learned from more than one type of data are likely to lead to new insights that might not be as readily available from analyzing one type of data in isolation. For instance, experimental genomic datasets often contain errors arising from imperfections in the applied technology. Thus, some of the findings of methods that analyze a single type of data may be erroneous. If we assume that technological errors across different genomic datasets are largely independent, then the probability of error in results that are supported by two different types of data is dramatically reduced. The Joint Learning from Multiple Types of Genomic Data session at PSB 2004 was created to provide a forum for novel methods that use more than one type of data in their analysis and do so jointly. Our goal in organizing this new session at PSB is two-fold: first, we hope to encourage the computational biology community to develop methods that are capable of integrating the large number of different types of data that are becoming increasingly available; second, we

262

263

hope to stimulate the discovery of new biological insights that would be difficult or impossible to identify in the analysis of only single types of data. Based on the number of excellent papers submitted, the session has clearly tapped into a growing interest in such joint methods. Because of this large number of quality submissions, we were able to accept nine papers for publication. Interestingly, almost every one is different from the others in terms of the types of data used and the goal of the study. Some examples include: combining sequences from multiple organisms, or combining phylogenetic trees with sequence, for the task of detecting cis-regulatory motifs; combining gene expression and sequence for detecting operon structure; combining protein sequences with tertiary structural information for classifying proteins; combining protein-protein interaction data with gene expression for learning regulatory networks; and combining text from the literature with protein sequences for discovering functional domains in proteins. The methods employed for the joint learning were also very diverse, and included probabilistic methods, support vector machines, and methods from combinatorial optimization. Taken together, these papers represent a fairly thorough cross-section of the most promising directions in this field. As more types of data become widely available, it is our belief that these kinds of unified approaches are likely to produce great insights into the complex biological systems that we are trying to better understand. The session co-chairs are grateful to those who submitted papers to the session for their contributions in advancing the field of joint learning, and especially grateful to those who reviewed submissions for their contributions in selecting the most outstanding papers to present this year, which was a challenging task given the large number of excellent submissions.

ProGreSS: SIMULTANEOUS SEARCHING OF PROTEIN DATABASES BY SEQUENCE AND STRUCTURE? A. BHATTACHARYA A. K. SINGH

T. CAN

T. KAHVECI Y.-E WANG

Department of Computer Science University of California, Santa Barbara, CA 93106 { arnab,tcan,tamel;ambuj,yfiYang}@ cs.ucsb.edu

Abstract We consider the problem of similarity searches on protein databases based on both sequence and structure information simultaneously. Our program extracts feature vectors from both the sequence and structure components of the proteins. These feature vectors are then combined and indexed using a novel multi-dimensionalindex structure. For a given query, we employ this index structure to find candidate matches from the database. We develop a new method for computing the statistical significance of these candidates. The candidates with high significance are then aligned to the query protein using the Smith-Waterman technique to find the optimal alignment. The experimental results show that our method can classify up to 97 % of the superfamilies and up to 100 % of the classes correctly according to the SCOP classification. Our method is up to 37 times faster than CTSS, a recent structure search technique, combined with Smith-Waterman technique for sequences.

1 Introduction The industrialization of molecular biology research has resulted in an explosion of bioinformatics data (DNA and protein sequences, protein structures, gene expression data and genome pathways). Each of these data present a different type of information about the functions of the genes and the interactions between them. Most of the earlier work focuses on only one type of data since each type of data has a different representation and the means of similarity varies for each data type. Combined learning from multiple types of data will help biologists achieve more precise results for several reasons: a) The probability of having false positive results due to errors in data generation decreases since it is less likely for the same error to appear in all the datasets. b) More than one aspect of the biological objects can be captured simultaneously.

1.1 Problem dejinition In this paper, we consider the problem of joint similarity searches on protein sequences and structures. A protein is represented as an ordered list of amino acids, where each amino acid has a sequence and a structure component (the terms amino a Work supported panially by NSF under grants EIA-0080134, 11s-9877142, DBI-0213903, and IRI9908441.

264

265

acid and residue are used interchangeably). The sequence component of an amino acid is its residue name indicated by a one letter code from a 20 letter alphabet. The structure component consists of the Secondary Structure Element (SSE) type of that residue (a-helix, P-sheet, or turn), and a 3-D vector which shows the position of its carbon-alpha (C,) atom. 1.2 Related work It has been one of the most important goals in molecular biology to elucidate the relationship among sequence, structure and function of p r ~ t e i n s l A ~ handful ~ ~ ~ . of algorithms and tools have been developed to analyze sequence and structure similarities between the proteins. These methods are usually focused on either sequence (SmithPSI-BLAST 7, or structure information (VAST ’, Waterman (SW) ‘, BLASTP DAL19, CE”, PSI1’, CTSS”) for finding similarities between different proteins. On the other hand, a few tools have been developed for providing integrated environments for analyzing the sequence and structure information together. Protein Analyst 13, 3DinSight “, and the integrated tools by Huang et al. l5 are among those tools. They provide a combination of separate (but cooperating) programs for integration of sequence and structure analysis under a single working environment. The components of these systems are usually run one after another, with one’s results being the input to the other. Although these tools provide integration of multiple types of data, they perform search on only one type of data at a time. We believe that integration of multiple data sources at indexing and search level would provide more precise and efficient tools. 596,

1.3 An overview of our method We extract a number of feature vectors on sequence and structure components of each protein in the database by sliding a window. Each feature vector maps to a point in a multi-dimensional space. Thus, a protein is represented by a number of points. This multi-dimensional space consists of orthogonal dimensions for sequence and structure. Later, we partition the space with the help of a grid and index these points using Minimum Bounding Rectangles (MBRs). Given a query, our search method runs in three phases: Phase I (index search): Feature vectors (i.e., points) are extracted from the query protein. For each of these query points, all the database points that are within eq and et distance along the sequence and the structure dimensions are found using the index structure. Each such point casts a vote for the protein to which it belongs as in geometric hashing 16. Phase 2 (statistical signiJicance): For each database protein, a statistical significance value is computed based on the votes it obtained in Phase 1 and its length. Phase 3 (post-processing): The top c proteins of highest significance are selected, where c is a predefined cutoff. The optimal painvise alignment of these c proteins to the query protein are then computed using the SW technique. Finally, the C , atom of

266

the matching residues are super-positioned using the least-squares method by Arun et al. l7 to find the optimal RMSD (Root Mean Square Distance). We name our method ProCreSS (Protein Grep by Sequence and Structure) since it enables queries based on sequence and structure simultaneously. The rest of the paper is organized as follows. Section 2 discusses our index structure for proteins. Section 3 explains our search algorithm. Section 4 presents the experimental results. We end with a brief discussion in Section 5.

2 Feature vectors and index construction In this section, we develop new methods to extract features for protein structures and sequences. Feature vectors for structures are computed as the curvature and torsion values of the residues in a sliding window. Curvature and torsion values provide a necessary and sufficient condition for the isomorphism of two space curves 12. For a detailed explanation of how curvature and torsion are computed, refer to CTSS 1 2 . Feature vectors for sequences are computed using a sliding window and a score matrix that defines the similarity between all the amino acids. We also propose a novel index structure to provide efficient access to these features.

2.1

Feature vectors for structure We slide a window of a prespecified size, w, on the proteins (i.e., each positioning of the window contains w consecutive residues). We will discuss the choice of w later. Figure l(a) depicts two positionings of the window. For a given window, the curvature and torsion values for each residue in that window is computed. The resulting vector contains 2w values since two values are stored per residue in the window. This vector maps to a point in a 2w-dimensional space. Having a large number of dimensions increases the cost for computing the similarity l8 and the cost for storing the vectors. Therefore, we reduce the number of dimensions to a smaller number, d t , using the Haar wavelet transformation, at the cost of reduced precision (see l9 for details on Haar transformation). We use dt = 2 in our experiments. The transformed vector is normalized to [O,lldt space. Along with each feature vector, we also store the SSE types of the residues. As w increases, the feature vector contains information about the correlation between larger number of residues. Thus the similarity between two feature vectors implies longer matches. On the other hand, very large values for w may cause false dismissals since shorter matches may be discarded due to their neighboring residues. We set w = 3 for our experiments.

2.2 Feature vectors for sequence The similarity between two amino acids of protein sequences is usually defined using score matrices (e.g., PAM and BLOSUM). A score matrix consists of 20 rows and columns; one for each amino acid. The entries of a score matrix denote the score for aligning a pair of residues. If two amino acids are similar, then the score for that pair is large, otherwise it is small.

267

Given a score matrix M , we call each row of M the score vector of the amino acid corresponding to that row. Thus, each entry of this vector shows the similarity of that amino acid to one of the 20 possible amino acids. We define the distance between two amino acids as the Euclidean distance between their score vectors. This is justified, because if the score of aligning two amino acids x and y is high in a score matrix, then they are similar. Therefore, if x is similar (or dissimilar) to another amino acid x , then y is also similar (or dissimilar) to z . Similar to protein structures, we extract feature vectors for protein sequences by sliding a window of length w (see Figure l(b) for w = 3). Each positioning of the window contains w amino acids. We append the score vectors of these amino acids in the same order as they appear in the window to obtain a vector of size 20w. This vector maps to a point in 20w-dimensional space. Since the number of dimensions is too large for efficient indexing even for small values of w, we reduce the number of dimensions to a smaller number, d,, using Haar wavelets. Similar to the structure component, we recommend d, = 2 for optimal qualityhme trade-off. The resulting vector is then normalized to [0,lIdq space. We again choose w = 3 .

2.3 Indexing feature vectors So far we have discussed how to extract feature vectors for structure and sequence components of the proteins separately. In this section, we will discuss how to build an index structure on these feature vectors. In order to search the protein database based on both sequence and structure, we need to combine the feature vectors for these two components. Since the same window size is used for both the components, every positioning of the window produces one dt-dimensional feature vector for its structure component and one d,-dimensional feature vector for its sequence component. We append these two vectors to obtain a single (&+d,)-dimensional vector. The resulting vector is called the combined feature vector. Since the entries of each of the feature vectors are normalized to the [0,1] interval, the combined feature vector resides in a [O,l]df+dq search space. The index structure is built by first partitioning the search space into 7 equal pieces along each dimension. The resulting grid contains qdf+dq cells of length 1/17 along each dimension. We will discuss the choice of 17 in Section 3.1. Once the space is partitioned, a window of length w is slid on each protein in the database. For each positioning of the window, the combined feature vector is computed. Each such vector maps to a point p in one of the cells of the grid. For each such point, we check whether that cell is empty. If it is empty, we construct an MBR that contains only p . Otherwise, we find the MBR B in that cell whose volume becomes the smallest after extending it to contain p . If the volume of B , after its expansion, is less than a precomputed volume threshold, V ,then we extend B and insert p into B , otherwise we create a new MBR that covers only p . V affects only the performance, not the experimentally. ~4 Figure 2 presents quality of the search. We chose V = ( 1 / 2 ~ ) ~ * +

268 Let D be a dataset that contains proteins, w be the window size, V be the volume cutoff. *I Procedure CreateIndex(D, w , V ) For each protein I E D for each positioning of window of length w p := combined frequency vector for current win1 C := cell that contains p ; if C = 0 then B.Lower := p ; B.Higher := p ; Insert B into C ; else B := argminBEC{volume(B Up)}; if volume(B U p ) 5 V then B := B U p ; else B.Lower := p ; B.Higher := p ; Insert B into C ; endif endif endfor endfor I*

~l:':j' .

.................

.

. A . R N! A : V

i ...,.

T K

Va = a l , a2, ..., a20 Vr = r l , 12, ..., 120 Vn = n l , n2, .... 1120 [ Va, Vr, Vn I [ Vr, Vn, Va I

1

Haar

0 (x1,yl)

~

1

0 (x2,y2)

(b)

Figure 1: Feature vectors for (a) protein structure, and (b) protein sequence.

Figure 2: Algorithm for building the index structure.

the algorithm that constructs the index structure. Figure 3 depicts a layout of a 2-D search space and the MBRs built on the data points for q = 4. Here, dt = d , = 1.

3 Querymethod Given a query <&,E,, e t , r>, where Q is a query protein, E, E [0,1] and et E [0,1] are the distance thresholds for sequence and structure respectively, and T is the boolean value regarding the use of SSE information, our search algorithm runs in three phases: 1) index search, 2) statistical significance computation, and 3 ) postprocessing. In this section, we will discuss each of these phases. We will assume that the index structure is built using a user specified score matrix for sequence (e.g., PAM or BLOSUM), and w for the window size. 3.1 Index search

Each residue of the query protein Q consists of a sequence component and a structure component. We extract combined feature vectors from Q by sliding a window of length w on it. Each of these combined feature vectors defines a query point in the search space. Figure 4 shows a sample query point in a 2-D search space, where the horizontal axis is the structure dimension and the vertical axis is the sequence dimension. In this figure, the search space is split into 16 cells numbered from 0 to 15. The query point falls into cell 10. We want to find the database points that are within

269

U 1 2

u n

8Q

A

9

Il

j

IP- l3

I

Ill

r l B

II

1

smruchue

1

Figure 3: A layout of the MBRs and data points on the search space for 7 = 4 in 2-D.

1

Figure 4: A sample query point and its query box for 7 = 4 in 2-D.

an E t distance along the structure dimensions and eq distance along the sequence dimensions from the query point. In Figure 4, we are interested in the points in the shaded region. Note that if r = true, then we only consider the database points that have the same SSE type as the query point. For each query point, we construct a query box by extending it by Et and by cq in both directions along the structure and the sequence dimensions respectively (see Figure 4). Next, we find the cells in the search space that overlap the query box. If a cell does not overlap the given query box, then it is guaranteed that it does not contain any database points that are in the query box. A cell can overlap a query box in two ways: 1) it is contained in the query box (e.g., cell 10 in Figure 4), or 2) it partially overlaps the query box (e.g., cells 5 , 6 , 7 , 9 , 11, 13, 14, and 15 in Figure 4). 1) If a cell is contained in the query box, all the points in that cell are guaranteed to overlap the query box. Therefore, we increment the vote to the database proteins that contains a data point in that cell for each such data point (if T = true, then the vote is added only for the points that have the same SSE type as the query point). 2 ) If a cell partially overlaps the query box, then we check all the MBRs in that cell. If an MBR is contained in the query box (e.g., the MBR in cell lo), each point in that MBR contributes a vote. If an MBR partially overlaps the query box (e.g., the MBR in cell 15), then we find the points in that MBR that are in the query box to find the votes. If an MBR does not overlap the query box (e.g., the MBR in cell 6), we ignore all the points in that MBR. This method is more precise than geometric hashing 16, because for a given query point it inspects the neighboring cells in addition to the cell into which that query point falls. The number of partitions 77 in the search space affects the run time of the index search. As 77 decreases, each cell contains more MBRs. Therefore, if a query box partially overlaps a cell, then more MBRs need to be tested for intersection with the

270

query box, thus increasing the search time. On the other hand, having too many partitions have two disadvantages: 1) most of the cells will be sparse or empty incurring space cost. 2) the volume of the boxes will be very small since each cell will get smaller. This increases the total number of MBRs, and hence the number of MBRs for intersection test. From our experiments we recommend q = 10 for optimal results. 3.2 Statistical signGcance computation Once the index structure is searched, we obtain a number of votes for each protein in the database. The total number of votes for a protein x shows the number of query points that are close to x’s points. We define the p-value of a match as the unexpectedness of that result. Smaller p-values imply better matches. Definition 1 Given a protein x with n points in the index structure and v votes for a given query, the p-value of x for that query is dejined as the probability of having at least v votes for a randomly generated protein with rz. points in the search space. Next, we discuss the computation of p-values. Consider a protein in the database that is represented in the search space using n points (n = length of protein - window size + 1). Let the protein receive v votes for a given query. Let X be a random variable representing the number of query boxes that overlap with a randomly selected point in the search space. Let px and 0% be the mean and the variance of X . The total number of query boxes that overlap with n randomly selected points can be computed as X , = X X . . . X (exactly n X s ) . Since X s are independent and identically distributed random variables, using Central Limit Theorem, one can show that X , is normally distributed with mean px,, = n . px and variance 0%- = n . 0%. Thus, if px and 0% are known, one can compute the distribution of X , using a normal distribution. Since the protein has v votes, its p-value can be computed as P ( X , 2 v). The computation of p-values requires the values of p x and & . The distribution of X depends on the distribution of query points, and the distance thresholds e q and e t . We compute the values of px and n$ by generating a large number of random points in the search space and counting the number of query boxes that it overlaps. In our experiments, we generate 10,000random points for this estimation.

+ + +

3.3 Post-processing After the statistical significances of all the proteins are computed, top c proteins with the highest significance are selected as candidates for post-processing, where c is a predefined cutoff. The purpose of post-processing is to find the optimal alignment between the query protein and the most promising proteins. Let q be the query protein. For every protein x in the candidate set, post-processing runs in two steps: Step 1: We build a 1x1 x IqI score matrix, MS&,for structure component, where 1x1 and 141 are the number of residues in x and q, as follows: For each residue in x and q, we construct a 2-D vector as its curvature and torsion. Each entry of Mstr is then computed as the negative of the Euclidean distance between the <curvature, torsion>-vector of the corresponding residues. For the sequence component, we cre-

27 1

ate another 1x1 x 141 score matrix, Mseq, such that V i , j the entry Mseq[i,j] is equal to the score of aligning the ith letter of x with the jth letter of q in the underlying score matrix (e.g., BLOSUM62). The matrices Mseq and Mstr are normalized and a combined score matrix Mcom = (1- e t ) . Mstr + (1- e q ) .Mseq is computed. Here, the weights (1- e t ) and (1 - E,) represent the importance that the user gives to each of the components. The optimal alignment between x and q is then found by running the Smith-Waterman dynamic programming using hfcom. Step 2: The alignment obtained in Step 1 defines a one-to-one mapping between a subset of residues of x and q, and is optimal with respect to Mcom. Finally, we find the 3-D rotation and translation of x that gives the minimum RMSD to q by using a least-squares fitting method 17.

4 Experimental evaluation We used single domain chains in our experiments. We downloaded all the protein chains in PDB ( h t t p : / /www. rcsb. org/pdb)that contain only one domain according to VAST8 and SC0P2O classifications. We only considered proteins that are members of one of the following SCOP classes: all a , all p, a+p and alp. We identified the superfamilies (according to SCOP classification) that have at least 10 representatives in this dataset. There are 181 such superfamilies. We created a database D of size 1810 proteins by including 10 proteins from each of these superfamilies. We formed a query set, DQ,by choosing a random chain from each of the 181 superfamilies in D . DQ is large enough to sample D since it contains one protein from each superfamily. We ran a number of experiments on these sets to test the quality and the performance of ProGreSS. The tests were run on a computer with two AMD Athlon M P 1600+ processors with 2 GB of RAM, running Linux 2.4.19. In the rest of this section, we use w for the window size, c for the cutoff, et and E, for the structure and sequence distance thresholds, 7 for the SSE type match choice, and q for the number of partitions. We employ the BLOSUM62 score matrix for sequences in all of our experiments. The number of dimensions d, and dt for sequence and structure are both set to 2. 4.1

Quality test Our first experiment set inspects the effect of various indexing and search parameters on the quality of our index search results. We classify a given query protein into one of the superfamilies and classes using the c best seeds as follows. The logarithms of the p-values of the matches in top c results in each superfamily are accumulated. The query protein is classified into the superfamily that has the largest magnitude of this sum. We use the same technique to classify the query protein to one of the four SCOP classes: all Q, all ,8, a+p and alp. Since the queries are selected from the database, in order to be fair, we do not take into account the query protein itself if it is among the top c results. We will only report the results for T = true, since it usually produced slightly better results than T =false.

272 100,

,

I

I

I

I

,

,

1w

,

90

80 70

60 50

40

30 20 10 401

2

"

4

6

'

8

"

10

cuton

12

"

14

16

'

18

1

20

(C)

Figure 5: Percentage of query proteins correctly classified for different values of c.

0 0 05

01 Dlslanm lhrerhald (L,=E

0 15

2

d Figure 6: Percentage of query proteins correctly classified for different values of distance threshold when E t = e q .

Figure 5 shows the percentage of query proteins correctly classified to classes (CL) and superfamilies (SF) for different values of c, where et = eq = 0.01 and 0.02, and w = 3. In all these experiments, we obtained the best results for c = 2 and 3. We achieved up to 96 % and 94 % correct classification for classes and superfamilies respectively. As c increases, our method starts retrieving proteins from other classes and superfamilies. We set c = 3 for the rest of the experiments. Figure 6 plots the percentage of correctly classified proteins for varying distance thresholds when et = eq and w = 3. The purpose of this experiment is to understand what a good distance threshold should be when sequence and structure have equal importance. The graph shows that the accuracy of ProGreSS increases when distance threshold increases from0.005 to 0.01. At et = eq = 0.01,ProGreSS achieves 96 % and 94 % correct classification for classes and superfamilies. As the distance threshold increases, ProGreSS starts retrieving distant proteins and its accuracy drops. Figure 7 shows the percentage of correctly classified superfamilies for different values of et when eq is fixed and for different values of eq when et is fixed, for w = 3. This experiment shows the effect of distance threshold for each of the structure and sequence components separately. When eq is fixed, as et decreases, the classification quality of ProGreSS increases. This implies that our method can find better results when the distance threshold is small. The highest accuracy obtained is 62 %. For eq = 1.0 (i.e., when the sequence component is ignored), ProGreSS performs the worst. This is an important result since it shows that searches based on structure alone would incur more false positives than the searches based on both sequence and structure. When et is fixed, as eq decreases, ProGreSS classifies more proteins correctly. In this case, 94 % of the proteins are correctly classified into their superfamilies. Our method performs the worst when et = 1.O. This result leads to two important conclusions: 1) Searching by sequence information alone is worse than searching based by

273 100 90

80

g

-8 -E B

5

70 60

50 40

8 30 20 10

Figure 7: Percentage of query proteins correctly classified for different values of ct ( c q ) when cq ( c t ) is fixed.

5

10 Window size (w)

15

20

Figure 8: Percentage of query proteins correctly classified for different values of w.

sequence and structure simultaneously. 2) For purposes of classification, our extraction of feature vectors for sequence is better than those for structure. Figure 8 plots the effect of window size on the classification quality of ProGreSS. The best results are achieved at w = 3. At this window size, ProGreSS can classify 100% and 97 % of the classes and superfamilies correctly. ProGreSS performs worse for smaller window sizes since correlations between the consecutive residues are not reflected to the index structure. As w becomes larger than 3 , ProGreSS starts to m i s s some of the good results since shorter local matches are not preserved for large w. Finally, Figure 9 compares the accuracy of our technique with CTSS, a recent algorithm that considers structure alone. We show the number of correct proteins (those from the same superfamily as the query protein) for different values of c. CTSS finds 3 out of 10 correct proteins in the first 100 candidates. On the other hand, our method finds the same number of proteins within the first 4 candidates. 4.2 Performance tesl In this experiment set we compare the performance of our method to CTSS. In order to have fair results, we run CTSS in two phases: 1) the top c candidates are found using the original CTSS code and each candidate is aligned to the query by using SW based on its structure score matrix. 2 ) The optimal sequence alignment of all the database proteins to the query are determined using SW alignment. For CTSS and ProGreSS, we choose c = 100 and 4 respectively. This is because the quality of their candidates are similar for these values of c (see Figure 9). We run queries for all of the 181 proteins and align only the candidate proteins to each of the query proteins. Figure 10 shows the average time spent by CTSS and our method. The run times for CTSS and SW are 38 and 18 seconds respectively. The graph for CTSS+SW is flat since these methods are independent of q. ProGreSS runs faster than CTSS+SW for all values of 11. For q = 10, ProGreSS runs in only 1.5 seconds (i.e., 37 times faster

274

I

m

c

-

1

0 cuton (C)

Figure 9: Number of proteins found from the same superfamily as the query protein for ProGreSS and CTSS for different values of c.

2

4

6 8 10 Number of pattnwns

12

14

16

Figure 10: Comparison of running times of ProGreSS and CTSS+SW.

than CTSS+SW). As Q gets smaller, ProGreSS runs slower. This is because when a query box partially overlaps a cell, more MBRs are tested for intersection. As Q becomes larger than 10, the performance of ProGreSS drops since the total number of MBRs in the index structure increases. 5 Discussion In this paper, we considered the problem of joint similarity searches on protein sequences and structures. We proposed a sliding-window-based method to extract feature vectors on the sequence and structure components of the proteins. Each feature vector is mapped to a point in a multi-dimensional space. We developed a novel index structure by partitioning the space with the help of a grid, and clustering these points using Minimum Bounding Rectangles (MBRs). Our search method finds the number of feature vectors that are similar to the feature vectors of a given query for each database protein. We also proposed a new statistical method to compute the significance of the results found at the index search phase. The results are sorted according to their significance and the most promising results are aligned using the Smith-Waterman (SW) method4 and the least-squares method by Arun et al.17 to find the optimal alignment. According to the experimental results on a set of representative query proteins, ProGreSS classified all of the classes and 97 % of the superfamilies correctly. Our method ran 37 times faster than CTSS, a recent structure search technique, combined with the SW technique for sequences. Combined learning from multiple data sources is an important research problem since each data provides a correlated yet different type of information about the protein. ProGreSS provides the user a wide flexibility of search parameters to assign weights on each of these data types. We believe that, the methods discussed in this

275

paper are an important step in understanding the functions of proteins better, and will be widely applicable in the area of proteomics. In the future, we would like to include other features into our index structure such as expression arrays and pathways.

References 1. T. C. Wood and W. R. Pearson. Evolution of Protein Sequences and Structures. J. OfMol. B i d , 291(4):977-995, 1999. 2. H. Hegyi and M. Gerstein. The Relationship between Protein Structure and Function: a Comprehensive Survey with Application to the Yeast Genome. J. ofMol. Eiol., 288(1):147-164, 1999. 3. J. M. Sauder, J. W. Arthur, and R. L. Dunbrack Jr. Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins: Structure, Function, und Genetics, 40(1):&22, 2000. 4. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J. qfMoleculur Biology, March 1981. 5. S. Altschul, W. Gish, W. Miller, E. W. Meyers, and D. J. Lipman. Basic local alignment search tool. J. Molecular Biology, 215(3):403410, 1990. 6. W. Gish and D.J. States. Identification of protein coding regions by database similarity search. Nature Genet., pages 266-272, 1993. 7. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST a new generation of protein database search programs. Nucl. Acids. Res., 25(17):3389-3402, 1997. 8. T. Madej, J.-F. Gibrat, and S.H. Bryant. Threading a database of protein cores. Proteins, 23:356369,1995. 9. L. Holm and C. Sander. Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology, 233:123-138, 1993. 10. H.N. Shindyalov and P.E. Bourne. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, 11(9):739-747, 1998. 11. 0. Camoglu, T. Kahveci, and A. K. Singh. Towards Index-based Similarity Search for Protein Structure Databases. In CSB, 2003. 12. T. Can and Y.F. Wang. CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Features. In CSB, 2003. 13. M. A. S. Saqi, D. L. Wild, and M. J. Hartshom. Protein Analyst - a distributed object environment for protein sequence and structure analysis. Bioinformatirs, 15521-522, 1999. 14. J. An, T. Nakama, Y. Kubota, H. Wako, and A. Sarai. Construction of an Integrated Environment for Sequence, Structure, Property and Function Analysis of Proteins. Genome Informafirs, 10:229230, 1999. 15. C. C. Huang, W. R. Novak, P. C. Babbitt, A. I. Jewett, T. E. Ferrin, and T. E. Klein. Integrated Tools for Structural and Sequence Alignment and Analysis. In PSE, pages 227-238,2000. 16. H.J. Wolfson and I. Rigoutsos. Geometric hashing: An introduction. IEEE Computational Science & Engineering, pages 10-21, Oct-Dec 1997. 17. K.S. Arun, T.S. Huang, and S.D. Blostein. Least-squares fitting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(5):698-700, September 1987. 18. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is "Nearest Neighbor" Meaningful? In ICDT, pages 217-235, 1999. 19. R.M. Rao and A S . Bopardikar. Wavelet Transforms Introduction to Theory and Applications. Addison Wesley, 1998. 20. A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247536-540, 1995.

PREDICTING THE OPERON STRUCTURE OF BACILLUS SUBTILIS USING OPERON LENGTH, INTERGENE DISTANCE, AND GENE EXPRESSION INFORMATION M.J.L. DE HOON', S. IMOTO', K. KOBAYASHI', N. OGASAWARA', S. MIYANO' ' H u m a n Genome Center, Institute of Medical Science, University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan Graduate School of Biological Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0101, Japan

'

We predict the operon structure of the Bacillus subtilis genome using the average operon length, the distance between genes in base pairs, and the similarity in gene expression measured in time course and gene disruptant experiments. By expressing the operon prediction for each method as a Bayesian probability, we are able to combine the four prediction methods into a Bayesian classifier in a statistically rigorous manner. The discriminant value for the Bayesian classifier can be chosen by considering the associated cost of misclassifying an operon or a nonoperon gene pair. For equal costs, an overall accuracy of 88.7% was found in a leaveone-out analysis for the joint Bayesian classifier, whereas the individual information sources yielded accuracies of 58.1%, 83.1%, 77.3%, and 71.8% respectively. The predicted operon structure based on the joint Bayesian classifier is available from the DBTBS database (http: //dbtbs .hgc. jp).

1

Introduction

In prokaryotes, open reading frames (ORFs) belonging to the same operon are transcribed together into a single mRNA molecule. To understand gene regulation in prokaryotic organisms, as a first step it is important to determine the operon structure of their genomes. In addition, as genes in the same operon are likely to be functionally related, the inferred operon structure may reveal the role of currently unknown genes. The distance between two adjacent genes on the same strand of DNA tends to be shorter if they belong to the same operon, and longer if they belong to different operons. Using a list of experimentally known operons, we can determine the discriminant value of the intergenic distance at which an adjacent gene pair is more likely to be an operon pair than a non-operon pair. For the Escherichia cola genome, operon pair predictions using the intergenic distance information were 82% accurate?i2 An alternative method of operon prediction is based on gene expression measurements. Using cDNA microarray technology, the expression levels can be measured simultaneously for all genes in the genome by measuring the corre-

276

277

sponding mRNA concentrations. In time course gene expression experiments, the expression levels are measured at several time points following a change in the environment of the organism, such as an increase in the temperature or the salt concentration. In gene disruptant experiments, the steady-state gene expression levels are measured for an organism in which the expression of a specific gene has been disrupted. As genes belonging to the same operon are transcribed into a single mRNA molecule, the degree of similarity in the gene expression profiles of two adjacent genes can be used to assess the likelihood that the gene pair belongs to the same operon. When applied to operon prediction in Escherachia coli using a collection of 72 cDNA microarray experiments t o calculate the similarity in gene expression, a sensitivity of 82% was found3 Sabatti e t al? postulated that gene experiments that perturb a large number of genes offer more information for operon prediction than confined perturbations. Time-course gene expression data may therefore be more suitable for operon prediction than gene disruptant expression data, as changes in the environment of an organism in a time-course experiment are likely to affect a larger number of genes than the disruption of a single gene in a gene disruptant experiment. In practice, the distribution functions of both the intergenic distance and the measured similarity in gene expression exhibit a large degree of overlap for operon gene pairs and non-operon gene pairs, and the choice between operon and non-operon may become ambiguous. The reliability of operon prediction can be improved by considering the intergenic distance and the similarity in gene expression together in a Bayesian posterior probability, which resulted in a sensitivity of 88% for the Escherichia coli genome3 For these predictions, a constant (uninformative) prior was used. To find the true Bayesian posterior probability, we would have to consider the relative abundance of operon pairs in comparison to non-operon pairs. This will give us a base line rate of finding operon gene pairs among the adjacent gene pairs, depending on the average number of genes per operon. Within a Bayesian framework, we can consider this rate as the prior probability of a gene pair t o belong to the same operon, while the intergenic distance and gene expression information are used to calculate the Bayesian posterior probability. As on average an operon in the Bacillus subtilis contains more than two genes, there are more operon gene pairs than non-operon gene pairs. Including the prior probability will therefore lead t o a more accurate prediction for operon pairs, a less accurate prediction for non-operon pairs, and a higher overall prediction. To guard against a less accurate prediction for non-operon pairs, we can consider the relative cost of misclassification as an operon pair compared to the cost of misclassification as a non-operon pair. For example, if we want t o

278

verify experimentally the operon boundaries by considering all candidate nonoperon gene pairs, the cost of misclassifying a non-operon pair as an operon pair would be relatively high, and we might consider to classify a gene pair as a tentative operon pair even if the posterior probability is somewhat lower than 50%. Here, we use the combination of intergenic distance and similarity in gene expression from 99 gene disruptant experiments and 75 time-course expression measurements to predict the operon structure in Bacillus subtilis. From a list of 635 known o p e r o n we ~ ~found ~ ~ ~582 ~ operon pairs and 91 non-operon pairs. Using these data, we predicted the operon structure of Bacillus subtilis, and assessed the overall prediction accuracy and the relative contributions of operon length, intergenic distance, and expression information. 2 2.1

Operon structure predictors Operon length

Table 1 shows the distribution of the operon length based on our list of 635 known operons. To infer a base line rate for adjacent gene pairs t o belong to the same operons, we would like to fit a statistical model to these measured operon lengths. The simplest statistical model consistent with the data is the geometric distribution? P r [operon contains n genes] = pn-l (1 - p )

(1)

Accordingly, we regard operons as being produced by a Bernoulli process with probability p , as shown in Figure 1. A Bernoulli process is the discrete equivalent of a Poisson process, and is the only discrete distribution without memory. Biologically, it means that a priori there is a probability p for each intergenic region to contain a terminator sequence to mark the end of an operon, independent of its length. Using Eq. 1, we can calculate the probability p from the average operon length 6: fi-1 p z - - , n where f i = 2.39 is determined from Table 1, leading to a prior probability p = 0.581 of finding an operon pair. Figure 2 shows the distribution of the measured operon lengths, as well ils the geometric distribution fitted to it. Note that except for singletons, any known operon will contribute to the set of known operon pairs, while non-operon pairs can only be found if two adjacent operons both happen to be known. Estimating p directly from the number of known operon pairs and known non-operon pairs would therefore lead to a severely biased estimate.

279 Table 1: Number of genes per operon, calculated from the list of 635 known operons.

70 35 31

4 5

9 10

2

13, 14, 15 16 31

gene to the

Probability p Figure 1: The distribution of the operon length can be described in terms of a Bernoulli process with probability p .

c

._

i

a I

v) .-

-

2 0.4

-3 R

e

LL

0.2

0.0

Number of genes per operon Figure 2: The distribution of the operon length, as determined from the list of 635 known operons.

280

Distance between genes in base pairs

Figure 3: The distribution function of the distance in base pairs between adjacent genes for operon pairs and non-operon pairs.

2.2 Intergenic distance Using the list of known operon and non-operon pairs, we estimated the probability density distribution of the distance between the genes, measured in base pairs, using an estimation procedure based on the Epanechnikov kernel? As some genes partially overlap each other, the intergenic distance is allowed t o be negative. Figure 3 shows the inferred probability distribution for operon pairs and non-operon pairs. Whereas the intergenic distance on average is considerably less for operon pairs than for non-operon pairs, there is a substantial overlap between the two distribution functions, highlighting the need for additional predictors to distinguish operon pairs from non-operon pairs. 2.3 Gene expression data

As genes that belong t o the same operon are transcribed into a single mRNA molecule, we expect their measured expression profiles to be highly similar. In cluster analysis: the Pearson correlation and the Euclidean distancd' are commonly used to assess the similarity in gene expression profiles. In operon prediction from gene expression data, the Pearson correlation is typically used. However, the theory of discriminant analysis' suggests that the Euclidean

281 Table 2: The time points at which expression measurements were made for the eight timecourse experiments of Bacillus subtilis. ~

Experiment Cold shock Competence Glucose, glutamine added during sporulation Glucose limitation Heat shock Increased aminoacid availability Phosphate, glucose starvation Phosphate limitation Salt stress Sporulation

Measurement time points in minutes 0, 5, 10, 30, 60, 120 0 , 60, 120, 180, 240, 300, 360 0, 60, 120, 180, 240, 300 0 , 60, 125, 180, 240 0, 5, 10, 30, 60 0, 30, 60, 120, 210, 300, 420, 540 0 , 60, 120, 180, 240, 300, 360, 420 0 , 55, 115, 175, 235, 295 0, 5, 10, 30, 60 0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480, 510, 540

distance would be optimal, given that the expression profiles of gene pairs in the same operon are equal rather than merely correlated. Yere, we will apply both the Euclidean distance and the Pearson correlation to evaluate their effectiveness in separating operon pairs from non-operon pairs. We consider the gene expression data measured at 75 time points total in eight time-course experiments, described in Table 2, together with 99 gene disruptant experiments, listed in Table 3. Genes with more than 50% missing data were removed for the leave-one-out analysis described below. Furthermore, in each disruptant experiment the measured expression levels for the disrupted gene were marked as missing. Global normalization was applied to the remaining genes. Figures 4 and 5 show the distribution functions of the Pearson correlation and the Euclidean distance for known operon and non-operon gene pairs. To guarantee that the probability density function vanishes for distances less than zero, a mirroring technique was used in which the negative of each data point was added to the data set. The probability density function estimated from the padded data set was subsequently multiplied by two and set to zero for negative distances. For the Pearson correlation r , the same mirroring technique was used for r = 1; for T = -1, no mirroring was needed as both probability density functions were already zero. Both figures show a considerable amount of overlap between the distribution functions for operon pairs and non-operon pairs, although the Pearson correlation achieves a slightly better separation.

282 Table 3: Disrupted gene in each experiment. The genes degU, sigF, sig W, and veg were each disrupted in two experiments, as indicated here

abh abrB acoR ahrC alsR ansR araR azlB ccpA YYaG ykuM

2.4

citR cit T cod Y comA comK cspB ctsR ydbG degU degU deoR

YjmH Y9kL gerE glcR glc T glnR gntR gutR hpr hrcA

hutP

iolR paiB ycso YPG lacR p h o P levR p u r R lexA P Y T R lmrA rocR lrpA sacT lrp c s e n S YQhN sigB m t r B sigD sigE paiA

sigF sigl? sigG sigw ykoZ sigL yhdM sig sig sig sigx

v

w w

sig Y sigz sinR soj splA spoOA spoOJ

-

tnrA treR veg veg xylR ybbH YbfA SPOIIIC yesS spoIIID YhjM spo VT yotL tenA ytzE -

B a y e s i a n classifier

From the estimated distribution functions f o p ( d ) , f ~ o (pd ) of the intergenic distance d for known operon pairs (OP) and known non-operon pairs (NOP), and the estimated distribution functions g o p ( D ) ,gNOp ( D )of the dissimilarity D between two expression profiles, we construct the joint Bayesian classifier

With the prior probability p calculated from the average operon length (Eq. a), the joint Bayesian classifier is equal to the posterior probability of finding an operon pair. The prediction accuracy will be higher for operon pairs than for non-operon pairs, due to the former being more abundant than the latter in the Bacillus subtilis genome, as parameterized by p . With the uninformative prior (p = proposed previously: Eq. 3 is no longer the true Bayesian posterior probability. The uninformative prior leads to an equal accuracy for operon and non-operon pairs, but to a lower overall accuracy. Usually, a gene pair is predicted to belong t o the same operon if the posterior probability is more than and to different operons if the posterior probability is less than Instead, we propose to classify a gene pair as an operon pair if the posterior probability surpasses a certain discriminant value p~ which is not necessarily equal to 0.5. This allows us to tune the relative accuracy of finding operon pairs or non-operon pairs by choosing the parameter p~ appropriately, depending on how the operon predictions will be used. For example, for terminator sequence prediction we may want to include all gene

i)

a.

i,

283

Euclidean distance between gene expression log-ratios Figure 4: The probability density function of the measured Euclidean distance between the expression log-ratios for known operon and known non-operon gene pairs, as calculated from the combined gene disruptant and time-course gene expression data.

pairs that have a posterior probability of 30% or more of being a non-operon pair ( p = ~ 0.7), as requiring a posterior probability of 50% will cause us t o miss many potential terminator sequences.

3

Prediction accuracy

The operon prediction accuracy was assessed using a leave-one-out analysis, in which each of the known operon or non-operon pairs was consecutively ignored in the learning phase, followed by a prediction of the operon status of the gene pair that was left out. Using only the operon length information, the Bayesian classifier reduces to the prior probability for all gene pairs. Consequently, all gene pairs are predicted to be operon pairs, resulting in a 100% prediction accuracy for operon pairs, a 0% accuracy for non-operon pairs, and an 58.1% overall prediction accuracy, corresponding to the prior probability p . Table 4 shows the accuracy of predictions based on the intergenic distance, the gene expression data, and on the joint Bayesian classifier, using a discriminant p~ = for the posterior probability. The intergenic distance, at an accuracy of 83.1%,is a somewhat more reliable predictor of the operon structure than the gene expression data, which yielded an accuracy of 79.9%. As

a

284

Pearson correlation between gene expression log-ratios Figure 5: The distribution of the measured Pearson correlation between the expression logratios for known operon and known non-operon gene pairs, as calculated from the combined gene disruptant and timecourse gene expression data.

expected, the joint Bayesian classifier surpasses each of the separate predictors, reaching an accuracy of 88.7%. Here, the similarity in the gene expression profiles was assessed using the Pearson correlation r by defining D = 1 - r. The Euclidean distance yielded a marginally lower prediction accuracy of 88.6% for the joint Bayesian classifier. The time-course gene expression data achieved a better prediction accuracy (77.3% based on 75 expression measurements) than gene disruptant experiments (71.8% based on 99 expression measurements). This is consistent with the conjecture by Sabatti et al! that gene expression experiments affecting a large number of genes are more suitable for operon prediction. The combined expression data of the time-course and the gene disruptant experiments achieved an improved prediction accuracy of 79.9%. As in this analysis the cost of misclassifying an operon pair is regarded t o be equal t o the cost of misclassifying a non-operon pair, the discriminant value for the posterior probability was chosen to be 50%. The prediction accuracy of non-operon pairs can be improved at the expense of a less accurate , vice prediction for operon pairs by increasing the discriminant value p ~ and versa. Figure 6 shows the prediction accuracy of the joint Bayesian classifier as a function of the discriminant probability p ~ The . optimal overall accuracy is achieved for a discriminant probability less than 0.5, which reflects the fact

285

Table 4: The accuracy of operon Predictor Intergenic distance Gene expression, overall Time-course experiments Gene disruptant experiments Joint Bayesian classifier

Operon pairs 82.1% 80.1% 76.8% 69.9% 88.8%

Non-operon pairs 89.0% 79.1% 80.2% 83.5% 87.9%

Overall accuracy 83.1% 79.9% 77.3% 71.8% 88.7%

that operon pairs are more abundant than non-operon pairs in the Bacillus subtilis genome. Next, we used the joint Bayesian classifier t o predict the operon structure of the complete Bacillus subtilis genome, using the Pearson correlation to assess the similarity in the expression profiles. The predicted operon structure is available from the DBTBS database? in terms of the posterior probability, enabling users to assess the reliability of each prediction, as well as to choose the discriminant value p~ corresponding t o their interests. In addition t o the predictors described above, we examined the viability of determining the operon structure by finding the oAtranscription factor binding site and the terminator sequence motif. For all regions between adjacent gene pairs on the same strand of DNA, we calculated the motif score using the Position Specific Score Matrix for the oA binding site5 The terminator sequence motif was predicted using dtp, a prediction tool for finding rho-independent transcription terminators.'2 Neither of these predictors produced a clear distinction between operon pairs and non-operon pairs, and were therefore not included in the joint Bayesian classifier. Note that in both cases the aim of the predictor is t o find where a motif is located in a given sequence segment, rather than whether a given sequence segment contains the motif. It may therefore be possible t o construct better sequence analysis tools for the specific task of operon structure prediction. 4

Conclusion

We predicted the operon structure of the Bacillus subtilis genome by combining operon length, intergenic distance, and gene disruptant and time-course gene expression experiments at an estimated overall accu.-azy of almost 89%. The intergenic distance information was the most accurate single predictor (83.1%), followed by the time-course gene expression data (77.3%) and the

286

6 100%

e

ODeron bairs '

'

>

0

m 0 c ._

-

0 .-

73

a,

a CAO,

1 /

0.0

Non-operon pairs

\\ 1

0.5 1.o Discriminant value po for the posterior probability

Figure 6: The prediction accuracy as a function of the choice for the discriminant probability p ~ A . large value of p~ corresponds to a high cost of misclassifying a non-operon gene pair.

gene disruptant data (71.8%). The average operon length was considered in order to determine the base line probability of finding an operon pair. The distribution of the operon length was modeled by a geometric distribution, which means that a priori there is an equal probability of finding a terminator sequence between any pair of adjacent genes, irrespective of the length of the operons in which those genes are located. The predicted operon structure is available from the DBTBS database5 In the leave-one-out analysis, we found that assessing the expression similarity using the Euclidean distance does not yield a better separation between operon and non-operon pairs than the Pearson correlation. This is somewhat surprising from the viewpoint of discriminant analysis. The superior results of the Pearson correlation may be due to the error structure in gene expression measurements, or to hitherto unexplained dependencies in the expression level of two adjacent genes in different operons. Similarity measures may exist that are even more suitable for operon prediction than the Pearson correlation.

Acknowledgments We would like to thank Yuko Makita and Mitsuteru Nakao of the University of Tokyo for assisting us with the oA and terminator sequence motif prediction.

287

References 1. G. Moreno-Hagelsieb and J. Collado-Vides. A powerful non-homology method for the prediction of operons in prokaryotes. In Proceedings of the Tenth International Conference on Intelligent Systems for Molecular Biology ( I S M B 2002), Bioinformatics Supplement 1, pages S329-S336, 2002. 2. H. Salgado, G. Moreno-Hagelsieb, T.F. Smith, and J. Collado-Vides. Operons in Escherichia coli: Genomic analyses and predictions. Proc. Natl. Acad. Sci. USA, 97:6652-6657, 2000. 3. C. Sabatti, L. Rohlin, M.-K. Oh, and J.C. Liao. Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res., 30:2886-2893, 2002. 4. S. Okuda, S. Kawashima, and M. Kanehisa. Database of operons in Bacillus subtilis. In Genome Informatics, volume 13, pages 496-497, 2002. 5. Y. Makita aiid K. Nakai. DBTBS: Database of transcriptional regulation in Bacillus subtilis and its contribution t o comparative genomics. Nucleic Acids Research, submitted, 2003. http://dbtbs.hgc.jp. 6. A.L. Sonenshein, J.A. Hoch, and R. Losick. Bacillus subtilis and its closest relatives: From genes to cells. ASM Press, Washington, DC, 2001. 7. J.H. Zar. Biostatistical Analysis. Prentice-Hall, London, 4 edition, 1999. 8. B.W. Silverman. Density Estimation f o r Statistics and Data Analysis. Chapman and Hill, London, 1986. 9. M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95:14863-14868, 1998. 10. D.K. Slonim, P. Tamayo, J.P. Mesirov, T. Golub, and E.S. Lander. Class prediction and discovery using gene expression data. In R E C O M B 2000, pages 263-272, 2000. 11. M. S. Bartlett and N. W. Please. Discrimination in the case of zero mean differences. Biometrika, 50:17-21, 1963. 12. T. Yada, M. Nakao, Y. Totoki, and K. Nakai. Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics, 15(12):987-993, 1999.

COMBINING TEXT MINING AND SEQUENCE ANALYSIS TO DISCOVER PROTEIN FUNCTIONAL REGIONS E. ESKIN School of Computer Science Engineering Hebrew University [email protected]. ac.il

E. AGICHTEIN Department of Computer Science Columbia University [email protected]. edu Recently presented protein sequence classification models can identify relevant regions of the sequence. This observation has many potential applications t o detecting functional regions of proteins. However, identifying such sequence regions automatically is difficult in practice, as relatively few types of information have enough annotated sequences t o perform this analysis. Our approach addresses this data scarcity problem by combining text and sequence analysis. First, we train a text classifier over the explicit textual annotations available for some of the sequences in the dataset, and use the trained classifier t o predict the class for the rest of the unlabeled sequences. We then train a joint sequence text classifier over the text contained in the functional annotations of the sequences, and the actual sequences in this larger, automatically extended dataset. Finally, we project the classifier onto the original sequences t o determine the relevant regions of the sequences. We demonstrate the effectiveness of our approach by predicting protein sub-cellular localization and determining localization specific functional regions of these proteins.

1

Introduction

Supervised learning techniques over sequences have had a tremendous amount of success in modeling proteins. Some of the most widely used methods are Hidden Markov Models (HMMs) to model protein families 1,2,3 and neural network techniques for predicting secondary structure 4 . Recently a new class of models which use margin learning algorithms such as the Support Vector Machine (SVM) algorithm, have been applied to modeling protein families. These models include the spectrum kernel5 and mismatch kernel6 which have been shown to be competitive with state-of-the-art methods for protein family classification. These methods represent sequences as collections of short substrings of length k or k-mers. One property of these classifiers is that we can examine the trained models generated by these methods and discover which k-mers are the most important for discriminating between the

288

289

classes. By projecting these Ic-mers onto the original sequences, we can discover which regions of the protein specifically correspond to the class and potentially discover the relevant functional region of the protein. In a recent paper, it has been shown that some of the k-mers with the highest weights in a protein family classification model correspond to known motifs of the protein family '. This technique is general in that it can be applied to determine the relevant functional region of a set of proteins given a set of example proteins by creating a data set where the examples of the class of proteins are positive training examples and a sampling of other proteins are negative examples. However, despite the large size of protein databases and the large amount of annotated proteins, very few types of information are sufficiently annotated to generate a large enough training set of proteins to perform this analysis. For example, consider the sub-cellular localization of proteins. Only a very small fraction of the database, 1576, is annotated with sub-cellular localization despite the fact that 35% of the database is annotated with functional annotation which corresponds to localization. If we can somehow use the functional annotation as a proxy for localization information, we can then apply our analysis to identify the regions of the proteins that are specific to each sub-cellular location. In their recent work, Nair and Rost 8 , defined a method for inferring localization information from the functional annotation which greatly influenced our work. In this paper, we introduce a framework that combines text-mining over database annotations with sequence learning t o both classify proteins and determine the functional regions specific to the classes. Our framework is designed specifically for the case when we are given a relatively small set of example sequences compared to a much larger amount of text annotated, yet unlabeled sequences. Our framework learns how the text is correlated with the labels and jointly learns over sequences and text of both the example (labeled) and unlabeled (yet annotated) examples. The output of the learning is a sequence classifier which can be used to identify the regions in the proteins specific to the class. We demonstrate our method with a proof of concept application to identify regions correlated to sub-cellular localization. We choose sub-cellular location as the proof of concept application because two recent works by Nair and Rost show that functional annotations of proteins correlate with localization and localization can be inferred from sequences. Using the small set of labeled examples and sequences as a seed we train a text classifier t o predict the subcellular localization based on the functional annotations similar to the approach presented in Nair and Rost, 2002. This effectively augments the seed set of labeled sequences with a larger set of sequences with predicted localizations. We then jointly learn a sequence and text classifier over the extended dataset.

290

Train Joint

Step 1: Extend Training Set by Exploiting Text Annotations

Step 2: Exploit both Text and Sequence information in the Extended Training Set

Figure 1: Framework for Extending and Combining Textual Annotations with Sequence Data.

This is similar to the work by Nair and Rost, 2002 where they showed that sequence homology can be used to predict localization. Finally we then use the sequence model t o identify the localization specific regions of the proteins. Preliminary analysis of the regions shows that some correspond with known biological sites such as DNA-binding sites for the nuclear proteins.

2 2.1

Methods Framework Ouemiew

The framework for discovering functional regions of a proteins given a set of examples of the protein consists of several steps, as shown in Figure 1. First we create a seed dataset which consists of the labeled proteins as positive training examples and a sampling of other proteins as the negative examples. Using this seed set, we train a text classifier over the annotations of the sequences. Then using the text classifier, we predict over the database additional sequences which correspond to the class. Using this extended dataset, we train a joint sequence and text classifier. By projecting the classifier onto our original sequences, we can identify which regions of the protein have a high positive weight with respect to the class corresponding to the example proteins and are likely candidates for the relevant functional region of the protein. The input t o our framework is a set of examples of the proteins and the output is a joint text sequence classifier for predicting other examples of that protein and predictions for regions in the original proteins which correspond to the common function of the example set of proteins.

29 1

2.2 Extending the Seed Dataset A significant problem in machine learning is the scarcity of training data. Insufficient training data often prevents machine learning techniques from achieving acceptable accuracy. In this section we present an application of text classification that allows us to automatically construct a comprehensive training set by starting with the initial smaller seed set of labeled sequences. Combining labeled and unlabeled examples is a topic that has been thoroughly studied in the machine learning community (e.g., Blum and Mitchell, 1998" and Tsuda et. al. ,I1 and the references therein for a starting point). The simple approach that we describe below was sufficient for our task, and we plan to explore more sophisticated approaches in our future work. To extend the training set, we exploit the large amount of textual information often associated with a sequence. For example, SWISS-PROT l 2 provides rich textual annotations for each entry in the database. Unfortunately, these annotations are difficult to compile and maintain, and as a result important information is often missing for many entries (e.g., the localization information). However, we can sometimes deduce this missing information from the textual annotations that happen to be present for a database entry. This general approach was presented in Nair and Rost '. The predictions for the unknown sequences rely on some form of clussifying the textual annotations. After training over a number of labeled training examples, text classifiers can successfully predict the correct class of unlabeled texts. We represent the text using a bag of words model where each text annotation is mapped t o a vector containing the frequency of each word. As the actual classifier, we use RIPPER 1 3 , a state of the art text classification system. RIPPER operates by learning rules to describe the text in the training examples, and then applies these rules to predict the appropriate classification of new, unlabeled texts. 2.3

Training a Joint Sequence T e x t Classifier

Each protein record consists of the sequence and the text from its functional annotation. We construct a classifier to predict members of the class of proteins corresponding to the example proteins by learning from both the text and the sequences. In order to learn from the text and sequences jointly we use a kernel framework. Both sequences and text are mapped to points in a feature space which is a high dimensional vector space. A kernel for both sequences and text allows us to efficiently compute inner products between points in the space. Using this kernel, we apply the SVM algorithm to train. The kernel, described below, is constructed in such a way t o take into

292

account interactions between the text and sequences during the learning, which results in a true joint sequence text classifier.

Text K e r n e l The feature space for the text portion of a protein record uses the bag of words representation described above. The feature space corresponding t o the kernel is a very high dimensional space where each dimension corresponds t o a specific word. Each word w corresponds to a vector in the space, & ( w ) where the value of the vector 1 for the word's dimension and 0 for all other dimensions. A text string z is mapped t o a vector which is the sum of the vectors corresponding t o the words in the text, ~ ! J T (= x) qb~(w). Although the dimension of the feature space is very large, the vectors corresponding t o the text strings are very sparse. We can take advantage of this t o compute inner products between points in the feature space very efficiently. For two text annotations x and y , we denote the text kernel t o be

xwEz

KT(X,?I)

= qbT(Z) ' qbT(?I).

Sequence Kernel

Sequences are also represented as points in a high dimensional feature space. Sequences are represented as a collection of their substrings of a fixed length k (or k-mers) obtained by sliding a window of length k across the length of the sequence. The simplest sequence feature space contains a dimension for each possible k-mer for a total dimension of 20k. For a k-mer a , the image of the k-mer in the sequence feature space, d s ( a ) ,has the value 1 for the k-mer a and the value 0 for the other dimensions. The image of a sequence x is the sum of 4 ~ ( a ) This . sequence representation the images of its k-mers, qbs(z) = CaEz is equivalent t o the k-spectrum kernep. An advantage of this representation is that we can compute kernels or inner products of points in the feature space very efficiently using a trie dat a s t r u c t u d . In practice, because of mutations in the sequences, exact matching k-mers between sequences are very rare. In order t o more effectively model biological sequences, we use the sparse kernel sequence representation that allows for approximate matching. The sparse kernel is similar in flavor t o the mismatch kernel and is fully described elsewherd4J5. Consider two sequences of length k , a and b. Each sequence consists of a single substring. The sparse kernel defines a mapping into a feature space which has the following property

293

where d H ( a , b ) is the Hamming distance between substrings a and b and 0 < a < 1 is a parameter in the construction of the mapping. If the two substrings are identical, than the Hamming distance is zero and the substrings contribute 1 to the inner product of the sequences, exactly as in the spectrum kernel. However, if the Hamming distance is greater than zero, the similarity is reduced by a factor of a for every mismatch. Details of the sparse kernel implementation are described e l ~ e w h e r d ~ ~ ’ ~ . Combining Text and Sequences We can use the framework of kernels to define a feature space which allows for interactions between sequences and text annotations. In our approach, we use a very simple method for combining the text and sequence classifiers. There exists a vast literature in machine learning on alternative techniques for this problem. We now define our combined kernel Kc(z,y) = K ~ ( z , y ) Ks(z,y) ( K T ( y) ~ , K s ( z ,Y))~. The first two terms effectively include the two feature spaces of text and sequences. The third term is a degree two polynomial kernel over the sum of the two kernels. If we explicitly determine the feature map for the combined kernel, the third term would include features for all pairs of sequences and words. Since the classifier trains over this space, it effectively learns from both sequence and text and the interactions between them.

+

+

+

S u p p o r t Vector Machines Support Vector Machines (SVMs) are a type of supervised learning algorithms first introduced by by Vapnik 1 6 . Given a set of labeled training vectors (positive and negative input examples), SVMs learn a linear decision boundary to discriminate between the two classes. The result is a linear classification rule that can be used to classify new test examples. Suppose our training set consists of labeled input vectors (xi,yi), i = 1 . ..m7where xi E R” and yZ E {fl}. We can specify a linear classification rule f by a pair (w,b ) , where w E R” and b E R,via

f ( x ) = W . x + b,

(2)

where a point x is classified as positive (negative) if f ( x ) > 0 ( f ( x ) < 0). Such a classification rule corresponds to a linear (hyperplane) decision boundary between positive and negative points. The SVM algorithm computes a hyperplane that satisfies a trade-off between maximizing the geometric margin which is the distance between positive and negative labeled points and training

294

errors. A key feature of any SVM optimization problem is that it is equivalent to solving a dual quadratic programming problem that depends only on the inner products xi 'xj of the training vectors which allows for the application of kernel techniques. For example, by replacing xi . xj by K c ( x i ,xj) in the dual problem, we can use SVMs in our combined text sequence feature space.

2.4

Predicting Relevant Functional Regions

Once a SVM is trained over a set of data, the classifier is represented in its dual form as a set of support vector weights si, one for each training example xi. The form of the SVM classifier is

which can be represented in the primal form as

f(.)

= 4qz) '

c

Sid(Zi)

= 4(2) . w

(4)

i

xi

where w = si$(xi) is the SVM hyperplane. By explicitly computing 4(xi) we can compute w directly. In the case of sequences, this can be efficiently implemented using the same data structures used for computing kernels betweensequencies We are interested in the sequence only portion of the feature space. For the sequence portion, w has a weight for every possible k-mer. The score can be interpreted as a measure of how discriminative the k-mer is with respect to the classifier. High positive scores correspond t o k-mers that tend to occur in the example set and not in other proteins. We define the score for a region on the protein as the sum of the k-mer scores contained in the region. If a region score is above a threshold, we predict that the region is a potential functional region associated with the example proteins. 3

Results for Protein Localization

We evaluate our framework in three ways. First we measure the accuracy of extending the set of labeled examples. Second, we evaluate the joint text sequence classifier over 20% of the annotated localization data. This data was held out of the training in all steps of the framework. We evaluate the accuracy of predicting localization from the functional class over this data. We also evaluate the joint sequence text classifier over this data and compare it to a text only and sequence only classifier. Finally, we perform a preliminary

295 100 Locnllzah.rr cytoplasm nuclear rnitoch chloroplast extrncel

endoplos perox g0lgi lyso W.C"OlM

Totol

Prectswn

Rccoll

count

Predicted count

4,976 3,843 1,925 1,693 755 655 174 160 167 97 14.454

28,318 10,504 6,996 3,414 7,724 2,742 810 914 1,004 470 62,806

0.869 0.940 0.823 0.869 0.728 0.696 0.442 0.805 0.654 0.579

0.705 0.790 0.656 0.705 0.474 0.538 0.217 0.559 0.530 0.112

Annotated

95

90 c Q

j

85 80 75 70

65 20

30

40

50 Recall

60

70

80

Figure 2: (a): Explicitly Annotated localization, and localization predicted based on textual information for SWISS-PROT4.0., (b): Precision vs. recall of the text classifier using keywords only, vs. field-specific text annotations, vs. all available text annotations

analysis of the predictions of the relevant regions in the proteins t o localization. We specifically examine nuclear localization signals since many of these are well known and there are readily available databases which we can use to verify our predictions.

3.1 D a t a Description We use SWISS-PROT4.0 1 2 , a large database of sequences and associated textual annotations. In this proof of concept application, we focus on the specific task of inferring sub-cellular localization. A fraction of the sequences in SWISS-PROT have associated annotations that explicitly state their subcellular localization. We report the number of sequences with explicitly annotated localization of each type in Table 2. As we can see, out of more than 100,000 entries in SWISS-PROT, less than 15% have explicit localization information.

3.2 Increasing the S e t of Localization Annotated Sequences We can increase the amount of information available to a learner by augrnenting the explicitly labeled examples with unlabeled data. Useful information relevant t o localization is often contained within unlabeled text annotations. By learning t o recognize the textual annotations associated with localizations, we can assign localization labels to the unlabeled text annotated sequences. This general approach for predicting localization of unlabeled, but annotated, sequences is presented in Nair and Rost '. In their approach, the training focuses on detecting a set of discriminating keywords. If such a keyword is present, the sequence is predicted to belong t o the appropriate class. In

296

this work we used RIPPER 1 3 , a rule-based classifier, to learn rules to predict localization of an SWISS-PROT entry based on textual annotations. The classifier was trained over the 14,454 explicitly annotated sequences. The derived rules were then used t o predict the localization of the remaining (unlabeled) SWISS-PROT entries. The approach described in Nair and Rost, 20028 focuses on carefully selected and assigned keyword annotations, and does not consider the unstructured annotations that are often available for the sequences. Text classification systems such as RIPPER implement sophisticated feature selection algorithms, and can be trained over the noisy, but potentially informative unstructured data. To evaluate this hypothesis, we varied the types of textual annotations available t o the classifier. We compared the quality of prediction based only on the keywords information, as used in Nair and Rost *, t o the prediction accuracy achieved by considering other text fields, such as descriptions, and finally with using all of the available textual annotations for the sequence. The experimental results for varying the type of textual annotation are reported in Figure 2(b). While the specific evaluation setup and methodology that we used is slightly different from the evaluation of Nair and Rost for the same task, the overall results for keywords-based classification appear comparable. As we can see in Figure 2(b), considering all of the available textual annotations significantly increases both the recall and the precision of predicting the localization of unknown sequences. For example, a t the precision level of 80%, using all of the text annotations allows RIPPER to achieve significantly higher recall. Therefore, for the remainder of this paper our text classifier considers all of the textual annotations that are available for each SWISS-PROT entry. The counts of the automatically predicted SWISS-PROT entries are reported in Figure 2(a). We also report the precision and recall of the classifier, evaluated over the hold-out data using cross-validation. These accuracy figures serve as an estimate of the accuracy, or the “quality” of the resulting extended training set. Note that the while the text classifier introduces some noise into the training set, the extended training set a t over 62,000 examples is significantly larger than the original training set. This extended automatically labeled training set can now be used to train a better join text and sequence classifier.

3.3 Evaluation t h e Joint Text Sequence Classifier Over the extended data described in Section 3.2, we performed experiments t o measure the improvement of the classifier when considering text and sequences

297 Localtiation Category cytoplasm nuclear mitoch chloroplast extracel endoplas perox golgi lyso vacuolar

Text Classzfier 0.91 0.94 0.96 0.96 0.92 0.89 0.93

Sequence Classifier 0.86 0.91 0.91 0.96 0.93 0.94 0.88

0.91

0.83

0.93 0.94

0.99 0.94

Joant Classzfier 0.93 0.97 0.99 0.96 0.95 0.96 0.95 0.93 0.99 0.94

Table 1: Comparison of text only classifier, sequence only classifier and joint classifier for each localization category. Each classifier is evaluated by computing the ROC50 score.

together. We ran three experiments by leaving out 20% of the original annotated sequence data as a test set and using the remaining data as a training set. We trained three models on the training set: a text only classifier, a sequence only classifier and a joint sequence text classifier. For all three classifiers, we used the SVM algorithm with the only difference being the choice of kernel. The text classifier uses the text kernel K T ( z ,y ) , the sequence classifier uses the sequence kernel K s ( z ,y) and the combined classifier use the K c ( z ,y) kernel. For each class, we used all of the members of the class as positive examples and a sampling of the remaining classes as negative examples. For each of the classes of localization data we report the results of the classifiers performance over the test data in Table 2(b). We use R O C 5 0 scores to compare the performance of different homology detection between methods. The R O C 5 0 score is the area under the receiver operating characteristic curve - the plot of true positives as a function of false positives - up to the first 50 false positives 17. A score of 1 indicates perfect separation of positives from negatives, whereas a score of 0 indicates that none of the top 50 sequences (or text annotations) selected by the algorithm were positives.

3.4 Identifying Regions Relevant t o Localization We made predictions for regions correlated to localization using the method described in Section 2.4. Since of all the localization signals, nuclear localization signals are the most characterized and have a searchable database of signals, the NLS databasel', we restricted our evaluation to these signals. We examined the 20 highest non-overlapping regions and compared them to the NLS database and found 8 common signals. Table 2 shows the eight predicted regions and the corresponding entries from the NLS database.

298 Predicted Region KKKKKKK RKRKK KKEKKEKKDKKEKKEKKEKKDKKEKKEKKEKK GGGTGGTGTGTGGG QRFTQRGGGAVGKNRRGGRGGNRGGRNNNSTR EVLKVQKRRIYD LSGGTPKRCLDLSNLS

NLS

Signal RfCRKK KKEKKKSKK RGGRGRGRG G G GxxxKNRRxxxxxxRG GRN [FLjKxxKRR T[PLV]KRC

Origin (A) (B)

predic ed (C)

predic ed predic ed

Table 2: Eight predicted regions corresponding to nuclear localization and the corresponding entries from the NLS database. The signal entry is a signal that is close to the predicted signal. The origin describes whether the signal was experimentally verified or predicted according to the database and the reference is the corresponding reference for the predicted signals. References: (A) Bouvier, D., Badacci, G., Mol, Biol. Cell,1995,6,1697-705 (B) Youssoufian, H. et al., Blood Cells Mol. Dis.,1999,25,305-9. (C) Truant, R., Cullen, B.R., Mol. Cell. Biol., 1998,18,1449-1458. 4

Discussion

We have presented a framework for combining textual annotations and sequence data for determining the relevant functional regions from a set of example proteins. Since a large enough set of examples to perform this kind of analysis is often difficult to obtain, we use a general approach of extending the original training set by exploiting textual annotations. This results in a significantly larger set of labeled examples. We can then train a joint text and sequence classifier over the extended training set, and subsequently project the classifier onto the original sequences to identify the relevant regions. We have shown how we can recover nuclear localization signals using this analysis. The framework takes advantage of recent sequence classification models which are based on analysis of subsequences of the protein and for each position in the sequence, can determine how relevant that position is to predict the class. We have applied the framework to sub-cellular localization of proteins. We plan to explore alternative ways for combining textual and sequence information using our general approach as well a more thorough analysis of the localization predictions. We also plan to apply our framework to determine relevant regions for other properties of proteins. References 1. S. R. Eddy. Multiple alignment using hidden Markov models. In C. Rawlings, editor, Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pages 114-120. AAAI Press, 1995. 2. A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501-1531, 1994. 3. K. Karplus, C. Barrett, and R. Hughey. Hidden markov models for

299

detecting remote protein homologies. Bioinformatics, 14(10):846-856: 1998. 4. B. Rost and C. Sander. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19(1):55-72, 1994. 5. C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classification. In Proceedings of the Pacific Symposium on Biocomputing (PSB), Kaua’i, Hawaii, 2002. 6. C. Leslie, E. Eskin, and W. S. Noble. Mismatch string kernels for SVM protein classification. In Proceedings of Advances in Neural Information Processing Systems 1 5 (NIPS), 2002. 7. C. Leslie, E. Eskin, A. Cohen, and W. S.Noble. Mismatch string kernels for SVM protein classification. Technical report, Columbia University, 2003. 8. R. Nair and B. Rost. Inferring sub-cellular localization through automated lexical analysis. Bioinformatics, 18 Suppl 1:S78-S86, Jul 2002. 9. R. Nair and B. Rost. Sequence conserved for subcellular localization. Protein Sci., 11(12):2836-47, Dec. 2002. 10. A. Blum and T . Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory, 1998. 11. K. Tsuda, S. Akaho, and K. Asai. The em algorithm for kernel matrix completition with auxiliary data. Journal of Machine Learning Research, 4167-81, 2003. 12. A. Bairoch and R. Apweiler. The SWISS-PROT protein sequence database: its relevance to human molecular medical research. J . Mol. Med., 75:312-316, 1997. 13. W. W. Cohen. Fast effective rule induction. In International Conference on Machine Learning, 1995. 14. E. Eskin and S. Snir. A biologically motivated sequence embedding into euclidean space. Technical report, Hebrew University, 2003. 15. E. Eskin, W. S. Noble, Y. Singer, and S. Snir. A unified approach for sequence prediction using sparse sequence models. Technical report, Hebrew University, 2003. 16. V. N. Vapnik. Statistical Learning Theory. Springer, 1998. 17. M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers and Chemistry, 20(1):25-33, 1996. 18. R. Nair, P. Carter, and B. Rost. NLSdb: database of nuclear localization signals. Nucleic Acids Research, 31(1):397-9, Jan 2003.

KERNEL-BASED DATA FUSION AND ITS APPLICATION TO PROTEIN FUNCTION PREDICTION IN YEAST G. R . G. LANCKRIET Division of Electrical Engineering, University of California, Berkeley

M. DENG Department of Biological Sciences, University of Southern California N. CRISTIANINI Department of Statistics, University of California, Davis

M. I. JORDAN Division of Computer Science, Department of Statistics, University of California, Berkeley W . S. NOBLE Department of Genome Sciences, University of Washington

Abstract Kernel methods provide a principled framework in which t o represent many types of data, including vectors, strings, trees and graphs. As such, these methods are useful for drawing inferences about biological phenomena. We describe a method for combining multiple kernel representations in a n optimal fashion, by formulating the problem as a convex optimization problem t h a t can be solved using semidefinite programming techniques. The method is applied t o the problem of predicting yeast protein functional classifications using a support vector machine (SVM) trained on five types of data. For this problem, t h e new method performs better than a previously-described Markov random field method, and better than the SVM trained on any single type of data.

1 Introduction Much research in computational biology involves drawing statistically sound inferences from collections of data. For example, the function of an unannotated protein sequence can be predicted based on an observed similarity between that protein sequence and the sequence of a protein of known function. Related methodologies involve inferring related functions of two proteins if they occur in fused form in some other organism, if they co-occur in multiple Online supplement at noble. gs .washington. e d d y e a s t

300

301

species, if their corresponding mRNAs share similar expression patterns, or if the proteins interact with one another. It seems natural that, while all such data sets contain important pieces of information about each gene or protein, the comparison and fusion of these data should produce a much more sophisticated picture of the relations among proteins, and a more detailed representation of each protein. This fused representation can then be exploited by machine learning algorithms. Combining information from different sources contributes to forming a complete picture of the relations between the different components of a genome. This paper presents a computational and statistical framework for integrating heterogeneous descriptions of the same set of genes, proteins or other entities. The approach relies on the use of kernel-based statistical learning methods that have already proven to be very useful tools in bioinformatics? These methods represent the data by means of a kernel function, which defines similarities between pairs of genes, proteins, etc. Such similarities can be quite complex relations, implicitly capturing aspects of the underlying biological machinery. One reason for the success of kernel methods is that the kernel function takes relationships that are implicit in the data and makes them explicit, so that it is easier to detect patterns. Each kernel function thus extracts a specific type of information from a given data set, thereby providing a partial description or view of the data. Our goal is to find a kernel that best represents all of the information available for a given statistical learning task. Given many partial descriptions of the data, we solve the mathematical problem of combining them using a convex optimization method known as semidefinite programming (SDP)2 This SDP-based approacp yields a general methodology for combining many partial descriptions of data that is statistically sound, as well as computationally efficient and robust. In order to demonstrate the feasibility of these methods, we address the problem of predicting the functions of yeast proteins. Following the experimental paradigm of Deng et a1.p we use a collection of five publicly available data sets to learn t o recognize 13 broad functional categories of yeast proteins. We demonstrate that incorporating knowledge derived from amino acid sequences, protein complex data, gene expression data and known proteinprotein interactions significantly improves classification performance relative to our method trained on any single type of data, and relative to a previously described method based on a Markov random field model!

302

2

Related Work

Considerable work has been devoted to the problem of automatically integrating genomic datasets, leveraging the interactions and correlations between them to obtain more refined and higher-level information. Previous research in this field can be divided into three classes of methods. The first class treats each data type independently. Inferences are made separately from each data type, and an inference is deemed correct if the various data agree. This type of analysis has been used to validate, for example, gene expression and protein-protein interaction data:,6,7 t o validate proteinprotein interactions predicted using five different methods: and to infer protein function? A slightly more complex approach combines multiple data sets using intersections and unions of the overlapping sets of predictions!' The second formalism to represent heterogeneous data is to extract binary relations between genes from each data source, and represent them as graphs. As an example, sequence similarity, protein-protein interaction, gene co-expression or closeness in a metabolic pathway can be used to define binary relations between genes. Several groups have attempted to compare the resulting gene graphs using graph algorithms:1)12 in particular to extract clusters of genes that share similarities with respect to different sorts of data. The third class of techniques uses statistical methods to combine heterogeneous data. For example, Holmes and Bruno use a joint likelihood model to combine gene expression and upstream sequence data for finding significant gene c l ~ s t e r s 1Similarly, ~ Deng et al. use a maximum likelihood method to predict protein-protein interactions and protein function from three types of data?4 Alternatively, protein localization can be predicted by converting each data source into a conditional probabilistic model and integrating via Bayesian c a l c ~ l u s !The ~ general formalism of graphical models, which includes Bayesian networks and Markov random fields as special cases, provides a systematic methodology for building such integrated probabilistic models. As an instance of this methodology, Deng et al. developed a Markov random field model to predict yeast protein function? They found that the use of different sources of information indeed improved prediction accuracy when compared t o using only one type of data. This paper describes a fourth type of data fusion technique, also statistical, but of a more nonparametric and discriminative flavor. The method, described in detail below, consists of representing each type of data independently as a matrix of kernel similarity values. These kernel matrices are then combined to make overall predictions. An early example of this approach, based on fixed sums of kernel matrices, showed that combinations of kernels can yield

303

improved gene classification performance in yeast, relative to learning from a single kernel matrix.16 The current work takes this methodology further-we use a weighted linear combination of kernels, and demonstrate how to estimate the kernel weights from the data. This yields not only predictions that reflect contributions from multiple data sources, but also yields an indication of the relative importance of these sources. The graphical model formalism, as exemplified by the Markov random field model of Deng et al., has several advantages in the biological setting. In particular, prior knowledge can be readily incorporated into such models, with standard Bayesian inference algorithms available to combine such knowledge with data. Moreover, the models are flexible, accommodating a variety of data types and providing a modular approach to combining multiple data sources. Classical discriminative statistical approaches, on the other hand, can provide superior performance in simple situations, by focusing explicitly on the boundary between classes, but tend to be significantly less flexible and less able to incorporate prior knowledge. As we discuss in this paper, however, recent developments in kernel methods have yielded a general class of discriminative methods that readily accommodate non-standard data types (such as strings, trees and graphs), allow prior knowledge to be brought to bear, and provide general machinery for combining multiple data sources. 3

Methods and Approach

Kernel Methods Kernel methods work by embedding data items (genes, proteins, etc.) into a vector space 3, called a feature space, and searching for linear relations in such a space. This embedding is defined implicitly, by specifying an inner product for the feature space via a positive semidefinite kernel function: K(x1,xZ) = (@(XI), @(x2)),where @(XI)and @(x2)are the embeddings of data items x1 and x2. Note that if all we require in order to find those linear relations are inner products, then we do not need to have an explicit representation of the mapping @, nor do we even need to know the nature of the feature space. It suffices to be able to evaluate the kernel function, which is often much easier than computing the coordinates of the points explicitly. Evaluating the kernel on all pairs of data points yields a symmetric, positive semidefinite matrix K known as the kernel matrix, which can be regarded as a matrix of generalized similarity measures among the data points. The kernel-based binary classification algorithm that we use in this paper, e ~ ~ a~linear ~ ~ discriminant the I-norm soft margin support vector r n a ~ h i n forms boundary in feature space F,f(x) = w*@(x) + b, where w E 3 and b E R.

304

Given a labelled sample Sn = { ( X I , yl), . . . , (x,, yn)}, w and b are optimized to maximize the distance (“margin”) between the positive and negative class, allowing misclassifications (therefore “soft margin”): n

min

w T w + ~ ) ( i

wAE

subject to

i= 1 yi(wT@(xi)

+ b ) 2 1 - (i,

i

= 1,.. . , n

E i > 0 , i = 1 , . . . ,n

where C is a regularization parameter, trading off error against margin. By considering the corresponding dual problem of (l),one can prove l8 that the weight vector can be expressed as w = oci@(xi), where the support values a; are solutions of the following dual quadratic program (QP):

zy=l

max a

2aTe - cuTdiag(y)Kdiag(y)a : C 2 a 2 0, a T y = 0,

An unlabelled data item x,,, linear function

can subsequently be classified by computing the n

f ( X n e w ) = WTQ,(Xnew)

+ b = C aiK(x2, x n e w ) + b. Z=1

If f(x,,,) is positive, then we classify x,,, as belonging to class +l; otherwise, ,, as belonging to class -1. we classify x

Kernel Methods for Data Fusion Given multiple related data sets ( e g , gene expression, protein sequence, and protein-protein interaction data), each kernel function produces, for the yeast genome, a square matrix in which each entry encodes a particular notion of similarity of one yeast protein to another. Implicitly, each matrix also defines an embedding of the proteins in a feature space. Thus, the kernel representation casts heterogeneous data-variablelength amino acid strings, real-valued gene expression data, a graph of proteinprotein interactions-into the common format of kernel matrices. The kernel formalism also allows these various matrices to be combined. Basic algebraic operations such as addition, multiplication and exponentiation preserve the key property of positive semidefiniteness, and thus allow a simple but powerful algebra of kernelsJg For example, given two kernels K1 and Kz, inducing the embeddings @ 1 ( x ) and @z(x),respectively, it is possible to define the kernel K = K1 Ka, inducing the embedding @(x)= [@l(x), @z(x)]. Of even greater interest, we can consider parameterized combinations of kernels.

+

305

In this paper, given a set of kernels linear combination

K=

K

=

{Kl, Kz, . . . ,Km}, we will form the

m

C piKi. i=l

As we have discussed, fitting an SVM to a single data source involves solving a QP based on the kernel matrix and the labels. We have shown that it is possible to extend this optimization problem not only to find optimal linear discriminant boundaries but also t o find optimal values of the coefficients pi in (2) for problems involving multiple kernels3 In the case of the 1-norm soft margin SVM, we want to minimize the same cost function (l),now with respect to both the discriminant boundary and the pi. Again considering the Lagrangian dual problem, it turns out that the problem of finding optimal pi and cq reduces to a convex optimization problem known as a semidefinite program (SOP):

subject to

trace

[5

piKi,) = c,

m

i=l

where c is a constant. SDP can be viewed as a generalization of linear programming, where scalar linear inequality constraints are replaced by more general linear matrix inequalities (LMIs): F ( x ) 0, meaning that the matrix F has to be in the cone of positive semidefinite matrices, as a function of the decision 0, variables x. Note that the first LMI constraint in (3), K = CLlpiKi emerges very naturally because the optimal kernel matrix must indeed come from the cone of positive semidefinite matrices. Linear programs and semidefinite programs are both instances of convex optimization problems, and both can be solved via efficient interior-point algorithms2 In this paper, the weights pi are constrained to be non-negative and the Ki are positive semidefinite and normalized ([Ki]jj= 1) by construction; thus K 0 is automatically satisfied. In that case, one can prove3 that the SDP (3) can be cast as a quadratically constrained quadratic program ( Q C Q P ) , which

306 Table 1: Functional categories. The table lists the 13 CYGD functional classifications used in these experiments. The class listed as “others” is a combination of four smaller classes: (1) cellular communication/signal transduction mechanism, (2) protein activity regulation, (3) protein with binding function or cofactor requirement (structural or catalytic) and (4)transposable elements, viral and plasmid proteins.

1 2 3 4 5 6 7

Category metabolism energy cell cycle & DNA processing transcription protein synthesis protein fate cellular transp. & transp. mech.

Size 1048 242 600 753 335 578

479

I

8 9 10 11 12 13

Category cell rescue, defense & virulence interaction w/ cell. envt. cell fate control of cell. organization transport facilitation others

Size 264 193 411 192 306 81

improves the efficiency of the computation: max a,t

subject to

22e-ct

(4)

1

t 2 -crTdiag(y)Kidiag(y)c, i = 1 , . . . , m n

Ty = o ,

Q

C>Q>O.

Thus, by solving a QCQP, we are able to find an adaptive combination of kernel matrices-and thus an adaptive combination of heterogeneous information sources-that solves our classification problem. The output of our procedure is a set of weights pi and a discriminant function based on these weights. We obtain a classification decision that merges information encoded in the various kernel matrices, and we obtain weights pi that reflect the relative importance of these information sources. 4

Experimental Design

In order to test our kernel-based approach, we follow the experimental paradigm of Deng et a14 The task is predicting functional classifications associated with yeast proteins, and we use as a gold standard the functional catalogue provided by the MIPS Comprehensive Yeast Genome Database (CYGD-mips .gsf .de/ p r o j / y e a s t ) . The toplevel categories in the functional hierarchy produce 13 classes (see Table 1). These 13 classes contain 3588 proteins; the remaining yeast proteins have uncertain function and are therefore not used in evaluating the classifier. Because a given protein can belong t o several functional classes, we cast the prediction problem as 13 binary classification tasks, one for each functional class.

307

The primary input t o the classification algorithm is a collection of kernel matrices representing different types of data. In order to compare the SDP/SVM approach to the MRF method of Deng et al., we perform two variants of the experiment: one in which the five kernels are restricted to contain precisely the same binary information as used by the MRF method, and a second experiment in which two of the kernels use richer representations and a sixth kernel is added. For the first kernel, the domain structure of each protein is summarized using the mapping provided by SwissProt v7.5 (us. expasy . org/sprot) from protein sequences t o Pfam domains (pfam.wust1.edu). Each protein is characterized by a 4950-bit vector, in which each bit represents the presence or absence of one Pfam domain. The kernel function K p f a mis simply the inner product applied to these vectors. This bit vector representation was used by the MRF method. In the second experiment, the domain representation is enriched by adding additional domains (Pfam 9.0 contains 5724 domains) and by replacing the binary scoring with log Evalues derived by comparing the HMMs with a given protein using the HMMER software toolkit (hmmer.wustl .edu). Three kernels are derived from CYGD information regarding three different types of protein interactions: protein-protein interactions, genetic interactions, and cc-participation in a protein complex, as determined by tandem affnity purification (TAP). All three data sets can be represented as graphs, with proteins as nodes and interactions as edges. Kondor and Lafferty 2o propose a general method for establishing similarities among the nodes of a graph, based on a random walk on the graph. This method efficiently accounts for all possible paths connecting two nodes, and for the lengths of those paths. Nodes that are connected by shorter paths or by many paths are considered more similar. The resulting dzfluszon kernel generates three interaction kernel ~ , and KTAP.A diffusion constant r controls the rate of matrices, K G ~KPhys diffusion through the network. 2o For K G and ~ Kphys ~ r = 5, and for KTAP r=1. The fifth kernel is generated using 77 cell cycle gene expression measurements per gene?l Two genes with similar expression profiles are likely to have similar functions; accordingly, Deng et al. convert the expression matrix to a square binary matrix in which a 1 indicates that the corresponding pair of expression profiles exhibits a Pearson correlation greater than 0.8. We use this matrix t o form a diffusion kernel K E ~In~the . second experiment, a Gaussian kernel is defined directly on the expression profiles: for expression profiles x and z, the kernel is K ( x , z )= exp(-llx - z(I2/2o)with width o = 0.5. In the second experiment, we construct one additional kernel matrix by applying the Smith-Waterman pairwise sequence comparison algorithm 22 to

308 ,

,

,

,

,

,

0 5 1

2

3

4

5

6

1,

,

,

,

,

,

7

8

P

10

?I

. , ,

09 0.85

0.8

0

g 0.75 0.7 0.65 0.6 0 55

12

13

Function Class

Figure 1: Classification performance for the 13 functional classes. The height of each bar is proportional to the ROC score. The standard deviation across the 15 experiments is usually 0.01 or smaller (see supplement), so most of the depicted differences are significant. Black bars correspond to the MRF method of Deng et al.; gray bars correspond to the SDP/SVM method using five kernels computed on binary data, and white bars correspond to the SDP/SVM using the enriched Pfam kernel and replacing the expression kernel with the SW kernel.

the yeast protein sequences. Each protein is represented as a vector of SmithWaterman log E-values, computed with respect to all 6355 yeast genes. The kernel matrix Ksw is computed using an inner product applied to pairs of these vectors. This matrix is complementary to the Pfam domain matrix, capturing sequence similarities among yeast genes, rather than similarities with respect t o the Pfam database. Each algorithm’s performance is measured by performing 5-fold crossvalidation three times. For a given split, we evaluate each classifier by reporting the receiver operating characteristic (ROC) score on the test set. The ROC score is the area under a curve that plots true positive rate as a function of false positive rate for differing classification thresholdsz3 For each classification, we measure 15 ROC scores (three 5-fold splits), which allows us to estimate the variance of the score. 5

Results

The experimental results are summarized in Figure 1. The figure shows that, for each of the 13 classifications, the ROC score of the SDP/SVM method is better than that of the MRF method. Overall, the mean ROC improves

309 Table 2: Kernel weights and ROC scores for the transport facilitation class. The table shows, for both experiments, the mean weight associated with each kernel, as well as the ROC score resulting from learning the classification using only that kernel. The final row lists the mean ROC score using all kernels. Kernel

Binary data ROC Weight 2.21

:% :: Kphys

0.18

K ~ a p K

E

K sw all

~

0.94 0.74 0.93 ~ -

Enriched kernels Weight ROC

,9331 ,6093 ,6655 ,6499 ,5457

1.58 0.21 1.01 0.49

-

1.72

,9674

-

-

,9461 .6093 ,6655 ,6499 .7126 ,9180 ,9733

from 0.715 t o 0.854. The improvement is consistent and statistically significant across all 13 classes. An additional improvement, though not as large, is gained by replacing the expression and Pfam kernels with their enriched versions (see supplement). The most improvement is offered by using the enriched Pfam kernel and replacing the expression kernel with the Smith-Waterman kernel. The resulting mean ROC is 0.870. Again, the improvement occurs in every class, although some class-specific differences are not statistically significant. Table 2 provides detailed results for a single functional classification, the transport facilitation class. The weight assigned to each kernel indicates the importance that the SDP/SVM procedure assigns to that kernel. The Pfam and Smith-Waterman kernels yield the largest weights, as well as the largest individual ROC scores. Results for the other twelve classifications are similar (see supplement) 6

Discussion

We have described a general method for combining heterogeneous genomewide data sets in the setting of kernel-based statistical learning algorithms, and we have demonstrated an application of this method to the problem of predicting the function of yeast proteins. The resulting SDP/SVM algorithm yields significant improvement relative to an SVM trained from any single data type and relative to a previously proposed graphical model approach for fusing heterogeneous genomic data. Kernel-based statistical learning methods have a number of general virtues as tools for biological data analysis. First, the kernel framework accommodates non-vector data types such as strings, trees and graphs. Second, kernels provide significant opportunities for the incorporation of specific biological knowledge, as we have seen with the Pfam kernel, and unlabelled data, as in the diffusion

310

and Smith-Waterman kernels. Third, the growing suite of kernel-based data analysis algorithms requires only that data be reduced to a kernel matrix; this creates opportunities for standardization. Finally, as we have shown here, the reduction of heterogeneous data types to the common format of kernel matrices allows the development of general tools for combining multiple data types. Kernel matrices are required only to respect the constraint of positive semidefiniteness, and thus the powerful technique of semidefinite programming can be exploited to derive general procedures for combining data of heterogeneous format and origin.

Acknowledgements

WSN is supported by a Sloan Foundation Research Fellowship and

by National Science Foundation grants DBI-0078523 and ISI-0093302. MIJ and GL acknowledge support from ONR MURI N00014-00-1-0637 and NSF grant 11s-9988642.

References 1. B. Scholkopf, K. Tsuda and J.-P. Vert. Support vector machine applications in computational biology. MIT Press, Cambridge, MA, 2004. 2. L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49-95, 1996. 3. G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semi-definite programming. In Proc 19th Int Conf Machine Learning, pp. 323-330, 2002. 4. M. Deng, T. Chen, and F. Sun. An integrated probabilistic model for functional prediction of proteins. Proc 7th Int Conf Comp Mol Biol, pp. 95-103, 2003. 5. H. Ge, Z. Liu, G. Church, and M. Vidal. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genetics, 29:482-486, 2001. 6. A. Grigoriev. A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucl Acids Res, 29:3513-3519, 2001. 7. R. Mrowka, W. Lieberneister, and D. Holste. Does mapping reveal correlation between gene expression and protein-protein interaction? Nature Genetics, 33:15-16, 2003. 8. C. von Mering, R. Krause, B. Snel et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417:399-403, 2002. 9. E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. 0. Yeates, and D. Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature, 402(6757):83-86, 1999. 10. R. Jansen, N . Lan, J. Qian, and M. Gerstein. Integration of genomic datasets to predict protein complexes in yeast. Journal of Structural and Functional Genomics, 2:71-81, 2002. 11. A. Nakaya, S. Goto, and M. Kanehisa. Extraction of correlated gene clusters by multiple graph comparison. In Genome Informatics 2001, pp. 44-53, 2001.

31 1

12. A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18:S136-S144, 2002. 13. I. Holmes and W. J. Bruno. Finding regulatory elements using joint likelihoods for sequence and expression profile data. In Proc Int Sys Mol Biol, pp. 202210,2000. 14. M. Deng, F. Sun, and T. Chen. Assessment of the reliability of protein-protein interactions and protein function prediction. In Proc Pac Symp Biocomputing, pp. 140-151, 2003. 15. A. Drawid and M. Gerstein. A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J. Mol. Biol., 301:1059-1075, 2000. 16. P. Pavlidis, J. Weston, J. Cai, and W. N. Grundy. Gene functional classification from heterogeneous data. In Proc 5th Int Conf Comp Mol Biol, pp. 242-248, 2001. 17. B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Computational Learing Theory, pp. 144-152, 1992. 18. B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 19. C. Berg, C. J. Christensen, and P. Ressel. Harmonic Analysis o n Semigroups: Theory of Positive Definite and Related Functions. Springer, New York, NY, 1984. 20. R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In Proc Int Conf Machine Learning, pp. 315-322, 2002. 21. P. T. Spellman, G. Sherlock, M. Q. Zhang et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerewisiae by microarray hybridization. Mol Biol Cell, 9:3273-3297, 1998. 22. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. J Mol Biol, 147(1):195-197, 1981. 23. J. A. Hanley and B. 3. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29-36, 1982.

DISCOVERY OF BINDING MOTIF PAIRS FROM PROTEIN COMPLEX STRUCTURAL DATA AND PROTEIN INTERACTION SEQUENCE DATA

'

H. LI J. LI ', S. H. TAN S.-K. NG Institute f o r Infocomm Research, 21 Heng M u i Keng Terrace, Singapore, 119613 School of Computing, National University of Singapore, Singapore, 11 9260 Email: {haiquan,jinyan,soonheng,[email protected]. edu.sg} 'x2

Abstract Unravelling the underlying mechanisms of protein interactions requires knowledge about the interactions' binding sites. In this paper, we use a novel concept, binding m o t i f pairs, to describe binding sites. A binding motif pair consists of two motifs each derived from one side of the binding protein sequences. The discovery is a directed approach that uses a combination of two data sources: 3-D structures of protein complexes and sequences of interacting proteins. We first extract m a x i m a l contact segment pairs from the protein complexes' structural data. We then use these segment pairs as templates to sub-group the interacting protein sequence dataset, and conduct an iterative refinement t o derive significant binding motif pairs. This combination approach is efficient in handling large datasets of protein interactions. From a dataset of 78,390 protein interactions, we have discovered 896 significant binding motif pairs. The discovered motif pairs include many novel motif pairs as well as motifs that agree well with experimentally validated patterns in the literature.

1

Introduction

Protein-protein interactions play a crucial role in the operations of many key biological functions such as inter-cellular communications, signal transduction, and regulation of gene expressions. Unravelling the underlying mechanisms of these interactions will provide invaluable knowledge t h a t could lead t o t h e discovery of new drugs and better treatments for many human diseases. Physically, protein interactions are mediated by short sequences of residues that form t h e contact interfaces between two interacting proteins, often referred as their binding sites. Though many experimental method2 and computationaf methods have been developed to detect protein interactions with increasing levels of accuracies, few methods can

aTo whom correspondence should be addressed.

312

313 pinpoint the specific residues in the proteins that are involved in the interactions. Such information are necessary for the interaction data to be directly useful for drug discovery. To determine the binding sites between interacting proteins, usually experimental methods include mutagenesis studies and phage displa?, which are tedious and time-consuming. Computational methods often include docking approaches and domaindomain interaction approaches. The docking approach is based on the analysis of bound protein structures. The use of this approach is very limited. The main reason is that resolved structures of proteins are often not available due t o the limitations in scalability and coverage of current protein structural determination technologies. The domain-domain interaction approach assumes that protein interactions are determined by the interactions between domains and is aimed to figure out the interactions only among predefined domain&'516.However, some domains may not directly determine the interactions, but only function as determinants of protein folding. Even though the domains involve in protein interactions, not all of their residues are contained in the binding sites and contribute t o the role of the interactions. In this work, we study the problem of binding site at residue level rather than at domain level. Our basic idea is that correlated sequence motif pairs determine the interactions. A similar concept, correlated sequence-signature pairs, was first proposed by Sprinzak4 with the expression of domain pairs. We focus on efficient in silico discovery of our motif pairs from multiple data sources about protein interactions. Ideally, such interacting motif pairs should be discovered from protein complex structural data. However, as discussed above, the availability of such data is very limited. Alternatively, interacting motif pairs may be discovered by analyzing their co-occurrence rates in interacting protein pairs' sequences. However, as high-throughput detection technologies such as two-hybrid screeningJ>' can rapidly generate large datasets of experimentally determined protein interactions, the search space on the associated protein sequences is enormous. The high false positive rates observed in high-throughput protein interaction data could also diminish the biological significance of motif pairs detected solely from protein interaction sequences. To address these issues in mining motif pairs, we propose a joint approach that makes use of the two available types of interaction data: (1) the limited structural data of protein complexes that provide exact information on inter-protein contact sites, and (2) the abundantly available interacting protein sequence pairs from high-throughput interaction detection experiments. The structural data of protein complexes are carefully mined for contact residues; these are then computationally extended into the so-called maximal contact segment pairs which we will define later. The complexes' maximal segment pairs are then de-

314

ployed t o seed the discovery of motif pairs from large sequence datasets of interacting proteins, followed by an iterative refinement procedure to ensure the significance of the derived motif pairs. This combined directed approach reduces the formidable search space of interacting protein sequences while providing some biological support for the motifs discovered. Indeed, many of our motif pairs discovered this way can be confirmed by biological patterns reported in the literature, as we will show later. We present the overall picture of our method in Section 2 . In Sections 3 and 4,we describe new algorithms t o discover maximal contact segment pairs from protein complex data, and then to discover binding motif pairs from interacting protein sequence data. Results showing the effectiveness and significance of this joint approach axe presented in Section 5 . Finally, we conclude and discuss about possible future work in Section 6. 2

Overview of Our Method and Data Used

A key idea in our proposed method for discovering significant binding motif pairs is the detection of m a x i m a l contact segment pairs between two proteins residing in a complex. First, all possible pairs of spatially contacting residues are determined from the 3-D structure data of a protein complex. These contact points are then extended t o capture as many continuous binding residues along the two proteins as possible, deriving the maximal contact segment pairs. Computationally, the derivation of maximal contact segment pairs is a challenging problem. In Section 3 , we will describe an algorithm to discover them efficiently. Our objective is to discover significant binding m o t i f pairs from proteinprotein interaction sequence datasets. Using the maximal contact segment pairs that we have discovered from the protein complex structural data, we cluster the interacting protein sequence data into sub-groups, each corresponding to one maximal contact segment pair. Then from each sub-group, we use a new motif discovery algorithm and an iterative optimization refinement algorithm to discover a binding motif pair. To assess the significance of binding motif pairs in the refinement procedure, we define a measure called emerging significance, which is similar to the concept of emerging pattern&'. This meamre is based on both positive and negative interaction datasets: A pattern or motif pair is said t o have a high emerging significance if it has a high frequency in the positive dataset but a relatively low frequency in the negative dataset.The iterative refinement is terminated when the motif pairs reach an optimized level of emerging Significance. The protein complex dataset used in this study is a non-redundant subset from PDB where the maximum pairwise sequence identify is 30% and only structures with resolution 2.0 or better are included. The set

315

used was generated on 9th June 2003 and contained 1533 entries in which each entry has at least 2 chains. As mentioned, our emerging significance approach requires the use of both positive and negative instances of pairwise protein-protein interactions. For positive protein-protein interaction sequence data, we used the data by von Mering et all. This dataset covers almost all those interaction data generated by experimental methods and in-silico methods for yeast proteins. In total, there are 78,390 non-redundant interactions in this dataset. However, there are currently no large datasets of experimentally validated negative interactions. As such, we generated a putative negative interaction dataset by assuming that any possible protein pair in yeast that do not occur in the positive dataset as a negative interaction. As our emerging significance measure only requires that the detected patterns have relatively lower frequency in the negative datasets, the effect of potential false negative interactions in this putative negative dataset is minimal.

3 Discovering M a x i m a l Contact S e g m e n t P a i r s f r o m P r o t e i n Complexes 3.1 Preprocessing: C o m p u t e Contact Sites Given a pair of proteins in a complex, a contact site is an elemental pair of two residues or atoms, each coming from one of the two proteins, that are close enough in space. A protein complex usually consists of multiple proteins, in this study we then consider all pairs of proteins in a protein complex to obtain all contact sites in this step. We define a contact site mathematically as follows: Suppose two proteins with 3 - 0 structural coordinates in (x,y,z), La = { ( a i , x a j r y a i , z a i ) , i= I...m} and Lb = { ( b j , X b j , Y b j ,Z b j ) , j = l . . . n } . T h e pair ( a i , b j ) is a contact site i f d i s t ( a i , b j ) E , where a; and bj are the a t o m id in the protein L , and L b respectively, and E is a n empirical threshold f o r the Euclidean distance f u n c t i o n d i s t ( . , .). Such a pair is denoted C o n t a c t ( a i , b j ) , or equivalently Contact ( b j a i ) . Note that a contact site in the atom level directly implies a contact site in residue level because each atom is a part of a unique residue. Hereafter, we will discuss contact sites only at the residue level. Since two residues are said to be in contact if one of the atoms in a residue is in contact with one atom in the other residue, it is possible for a residue to be in contact with multiple residues.

<

3.2 Extract Contact Segment Pairs Next, we extend the concept of contact sites t o the concept of contact segment pairs, aiming t o search for large areas of contact sites in a pair of

316

Figure 1: An illustration of contact segment pairs in a pair of interacting proteins A and B. Here, protein A is said to be the opposite protein of B, and vise vesa.

binding proteins. Figure 1 shows our idea, depicting a typical scenario where segments of residues in one protein are continuously in contact with segments of residues in the other protein. As an illustration, the segment [ale, a151 in protein A of Figure 1 is in contact with the segment [ b z l ,b27] in protein B. That is, they are a contact segment pair. But the segment [a301a401 in protein A and the segment [bal,b27] in protein B are collectively not a contact segment pair. Formally, the definition is: A contact segment pair is a segment pair ([ai,, a i z ] ,[bj, bj,]) satisfying, f o r V a i E [ail a i z ] ; 3bj E [bj, ,b j z ] such that ( a i , b j ) is a contact site, where a i l , a i z ,bj, , bJ2 are residue ids in two proteins La and Lb. Such a pair of segments as sometimes denoted C o n t a c t ( [ a i ,,ai2J, [bj,, b j 2 ] ) . A maximal contact segment pair is then defined as a contact segment pair such that no other contact segment pair can contain the both segments of this contact pair. In this paper, we are interested in the following problem:

Definition 1 Maximal Contact Segment Pairs Problem: Given a pair of binding proteins La and L b , suppose C = { ( a i l b j ) ( Contact(ai,b j ) with respect t o the two proteins L a a n d L b ) , the problem is how t o find all possible maximal contact segment pairs f r o m C with their segment lengths all longer than a threshold.

A naive approach to solving this problem would require testing all possible segment pairs. Suppose two proteins La and L b have m and n residues respectively, then the proteins La and L b will have m2 and n2 possible segments respectively. For each combination, O ( m n ) time complexity would be required for the computation. So, the total time complexity for such a naive approach will be O ( m 3 * n 3 ) per pair of proteins in each complex. This is very expensive particularly when the protein complexes are large and there are hundreds or thousands of protein complexes need to be examined. We present a more efficient method to discover maximal contact segment pairs here. Observe that for each residue, it may be in contact with multiple

317

residues in the opposite protein (see Figure 1). We introduce a concept named coverage to capture this phenomenon; it will be shown later that this is a useful concept for improving the efficiency of our discovery algorithm. The coverage of a residue a i , denoted Cow(a;),is the set of all residues in the opposite protein that are in contact with this residue, namely Gow(ai) = {bjI(a;,bj)E C}. The coverage of a segment [a;,, a ; 2 ] ,denoted Cow([ai,,ai,]), is the union of the coverages of all its residues, namely, Cov([ai,,4 )= ua;qai, ,,,,]Cov(aa). The following proposition is useful in our algorithm t o discover maximal contact segment pairs efficiently. Proposition 1 A segment pair ([a;,,ai,], [bj, ,bj,]) is a contact segment pair iff the coverage of a n y of the two segments contains the other segm e n t , i.e. G o n t a c t ( [ a ; ,, a;,], [bj, , bj,]) (Cow([ai,,a i 2 ] )2 [bj, , bj,]) A (CO~([bj,> b2l)

Proof:

2

3 :We

1% 4). 9

use contradiction t o prove. Suppose Cov([ai,,a;,]) 2

[bj,, bjz] is not true, then there exists a bj E [bj, ,bj,] but this bj

4

Cow([a;,,ai,]). This means there is no a; E [a;,,a;,] in contact with b j . This contradicts the assumption. Therefore, Cov([ai,,a i 2 ] ) 2 [bj,, b j 2 ] . We can prove Cow([bj,,bj,]) 2 [ a i l ,a;,] in a symmetrical manner. +: If Cov([a;,,ai,]) 2 [ b j l , b j 2 ] ,this means that for each bj E [bj,,bj,], there exist at least one contact site in [a,,,ai,]. Similarly, the residues in the other segment have the same property. Our algorithm is a top-down recursive algorithm. At the initial step, each entire protein in a pair is treated as a segment. A series of recursive breaking-down are then performed to output maximal contact segment pairs, using the above proposition t o determine when t o break-down a segment into several smaller segments and when to terminate producing a new candidate segment pair. The details of our algorithm are as follows: Input: Two proteins La = { ( a i , x a i ,y a i , z a i ) , i = l...m} and Lb = { ( b j , xbj, Y b j , zbj), j = I...n}, two special segments [ a l ,a,], and [ b l , b,], and G = { ( a ; , b j ) l C o n t a c t ( a i , b j ) , 1 5 i 5 m ,1 5 j 5 n } . Output: A set of maximal contact segment pairs. Preparation Step: Compute Gow(a;) and Gow(bj) for a11 1 5 i 5 m, 1 5 j s n . Initialization Step: P u t the initial segment pair ( [ a l a,], , [bl ,b n ] ) into the candidate list. repeat

Segment Coverage Step: Remove the first segment pair from the candidate list, denoted ([zi, , xi2], [yj, ,yj2]); Compute the coverage for C O ~ ( [ %4) , n [ Y j l ,Yhl. Splitting Step:

318 if (COv([xil , xiz])n [yj,, yj,]) == [yj, ,yj,] then if (Cov([yj,, yj,]) n [xi,, x i Z ] )== [xil,xi,] then Output the segment pair. else Add ([yj, ,yj,], [xil,xi,]) into the candidate list. end if else Split Cov([zi,, xi2]) n [yj, ,yj,] into w number of continuous subsegments, denoted [ykzt-,,ykzt],t = l....w, put each segment pair ([ykzt-,,ykat],[xi,,x i z ] )t, = l...w, into the candidate list. end if until The candidate list is empty. A detailed example can be found in this paper's supplementary information' 4

Discovering Binding Motif Pairs f r o m Interacting P r o t e i n Sequence P a i r s

Next, we describe how to discover binding motif pairs from protein interaction sequence data using the maximal contact segment pairs detected from protein complexes.

4 . I Seeded Sub-grouping and Consensus Motif Discovery We use each of the discovered maximal contact segment pairs as seed to sub-group the interaction sequence pairs such that all the interaction pairs that contain the contact segment pair are grouped together. We then conduct a consensus motif discovery in each of the sub-groups of protein interaction sequences. First, let us give the following two definitions: Contain: Suppose a sequence S = S I S ... ~ s, and a segment P = p l .p z...p,. S contains P , denoted Contain(S; P); if LocalLAlignment(S, P ) >_ A; where X is an empirical threshold. Cluster of a C o n t a c t Segment Pair: Given an interaction dataset D consisting of n sequence pairs, denoted D = { ( S j , S ; ) , l 5 i 5 n } , and a segment pair P = ( P I P2): , the cluster of this segment pair with respect to D; denoted G D ( P ) ;is

{(s{,sh)l(S;,sk) U

{ ( S y ,S;)l

E D,Contain(S{,Pi),andContain(Sh,Pz)} (S;, Sy) E D , C o n t a i n ( S y ,P I ) ,and G o n t a i n ( S ; , P2))

By this way of sub-grouping the interaction dataset, the resulting clusters of different segment pairs may overlap with one another. Biologically, this is important because one protein may involve interactions with different proteins in different locations.

319

Given the cluster of a contact segment pair, our subsequent step is to find two consensus motifs, one from all those S; plus all those Sy (namely the left-side sequences of those protein sequence pairs), and the other from all those S;plus all those S[ (namely the right-side sequences of those protein sequence pairs). At each side, we align all the sequences according t o the best alignments with respect to the corresponding segment (PI or PZ in this case). We used the score matrix developed by Azaryd' for the local alignmed2, since structure is preserved for any residue pairs that have high scores in the matrix. To obtain the consensus motif from each side of these alignments, every column in the alignment is examined as follows: If the occurrence of a residue in this column is above the stated threshold, we include it in the the consensus motif. If there are no such residues, we treat this column as a wildcard. It is also possible to use alternative methods such as EMOTIP3 to find the consensus motifs. These two consensus motifs form a binding motif pair. Note that we derive this binding motif pair starting from one contact segment pair. So, given a set of maximal contact segment pairs discovered from the protein complex dataset, we can obtain a set of binding motif pairs by going through all these maximal contact segment pairs on the interacting protein sequence datasets.

4.2

Iterative Refinement

Next, we perform an iterative refinement on the binding motif pairs discovered in the last subsection. The purpose of doing this is to optimize these binding motif pairs. Given a binding motif pair Q, our refinement algorithm uses Q to sub-group the interacting protein sequences dataset, and generates a new binding motif pair Q' (using exact m a t c h instead of local alignment here), as discussed in the last subsection but replacing the maximal contact segment pair P with Q. Iteratively, the algorithm repeats the procedure, using Q' as Q, until Q' reaches an optimized state. The stopping criteria used here is based on a concept of emerging significance of consensus motifs. Recall that we have established two protein sequence pair datasets: the interaction dataset (also called the positive dataset) and the negative dataset. So far, we have used only the positive dataset in generating the consensus motifs. To measure the emerging significance of a pair of consensus motifs, we make use of both of the positive and negative datasets. If a motif pair is significant, it is reasonable to expect the pair to occur in the positive dataset much more frequently than in the negative dataset. We give the definitions for emerging significance below:

320

Frequency of a motif pair with respect to a dataset: Suppose

5 i 5 n}, we have a dataset D consisting of sequence pairs D={(S:, the frequency of a m o t i f pair P=(Pl, Pz)with respect t o D i s defined as: Freq(P,D ) = Significant motif pairs: Suppose we have a positive dataset Dpos and a negative dataset D N ~A~motif . pair P i s significant if: r a t i o ( P , Dpos,DNeg) = ~ ~ ~ $ ~ 2 ;r ,$where ~ ~r ;i s {a threshold. W e also call ratio(P,D p O sD,veg) , the emerging significance of P.

v.

4 . 3 T i m e Complexity of t h e Method The time complexity for sub-grouping based on a segment pair is O( ( lDposI+ ID,vegI)* ICPI) because of using local alignment. Here CP represents the set of maximal contact segment pairs. The size of binding motif pairs is O(lCP1) in the case of using our column-by-column consensus algorithm. The time used to compute the clusters for motif pairs in each pass is linear if the suffix tree approach14 is applied to conduct the exact match for regular patterns. The complexity of computing a consensus motif pair from a cluster is also linear. Suppose there is at most K passes for the algorithm to terminate, the number of motif pairs is N c p , the time complexity for the refinement of motif pairs is O ( ( ( ( D p o s ( ( D , v e S (*)NCP ICPI) * K). In total, the time complexity for this step is O((IDpOsI lD,vegI)* (ICPI N c p * K) [CPl * K ) .

+

5

+

+

+

+

Implementation and Results

In the initial step of computing contact sites from the protein complex data, we set the threshold E to 5A. More than 56% of the complexes were found to contain at least one contact site. We also set the number 4 as the threshold of segment length. We found 1403 maximal segment pairs from this complex dataset. For sub-grouping the interaction dataset using the maximal segment pairs, a threshold should be set in the contain operation. Instead of setting X to be a constant, it is more reasonable to set the threshold strictly for short segments but loosely for long segments. The actual parameters used in our experiment are provided in our supplementary information l o . Our refinement procedure was performed for 7 iterative passes. After that all the motif pairs became stable. We found a total of 896 motif pairs to be significant when the emerging significance threshold 7- was set to be 2. The detailed distribution of emerging significance values can again be found in our supplementary information l o . All our source codes of the algorithms were run on a Pentium 4 PC with 2.4 GHZ CPU and 256M RAM. Most of the time (around 12 hours) were spent to sub-group the interaction sequence data using the maximal

321 contact segment pairs. The mining of all the maximal segment pairs was very fast, spending only 50 seconds. The refinement algorithm was also fast, spending about 1 hour. Note that this time cost is acceptable considering the enormity of the problem space. Although the objective is to discover novel motif pairs, to evaluate the biological significance of the motif pairs found by our algorithms, it is important to verify that some of the discovered motifs agree well with experimentally validated patterns in the literature. However, most publications on the experimental discovery of binding motifs only report a single motif on one side rather than a pair of binding motifs. As such, we can only confirm the coincidence of individual motifs in our motif pairs with the reported binding motifs found by traditional experimental methods. For example, for the mutagenesis method, we used key words ‘binding motif OR site AND mutagenesis’ to search all biomedical abstracts in PUBMED of NCBI. 202 motifs were found, in which 91 motifs are compatible with at least one in our motifs, 58 motifs are highly similar with ours. We show the first 5 matches in Table 1. Similar comparison with the phage display method is provided in our supplementary information l o . Table 1: Motif coincidence with the mutagenesis method.

ALETS PVDLS LLDLL PIDLSLKP

P*DLS

11435317 11373277 11451993 10748065 11062046

Table 2 illustrates how we can compare motif pairs using the individual binding motifs reported in the literature. As an example, we use the binding consensus sequences in the list compiled by Kay et a1 l 5 for various proteins by phage display. First, we identify the individual motifs in our population of discovered motif pairs that match closely with a binding consensus sequence in the compiled list. Then, for each of such matched motifs, we verify whether the motif on the other side of the corresponding motif pair axe found in proteins known to bind to the particular consensus sequence. In Table 2, we list six example binding consensus sequences from Kay et a1 l5 compiled list in the first column. In the second column, we list the individual matched motifs from our population of discovered motif pairs-we arbitrarily assign these motifs as the “left motifs”. In the third column, we show the motifs on the other sides (the “right motifs”) of the matched motif pairs. Since these right motifs are also found in the proteins (shown in the fourth column) reported to bind to the corresponding consensus sequence, the motif pairs

322

can be considered to be biologically verified. More examples are detailed in our website l o . Table 2: Motif pair coincidence betwren our motif pairs and peptide-protein binding pairs. Consensus Sequence P*LP*[KR] P*LP*[KR] P*LP*[KR]

Left motif P[EK]*P P [ILV][FILIPG P[ILV][FL]PG [RKH]PP[AILVP]P[AILVP]KP P[IV][EP][IV]A RLP*LP P[EK]*P [RKH]PP[AILVP]P[AILVP]KPP[IV][DP]P[FL]

6

Right Motif GV[FI]S P[ILV] [FLIPG P[ILV][FIL]PG AAS[FI] GV[FI]S PL[DP]PL

Binding Protein CRK A CRK A CRK A Cortactin Synaptojanin I Shank

Conclusion and Further Work

The mining of binding motif pairs from protein interaction data is important for extracting knowledge that can lead to the discovery of new drugs. Most of the work reported in the literature only dealt with finding individual binding motifs rather than pairs of interacting motifs. Since motif pairs-unlike single binding motifs-can provide better information for understanding the interactions between proteins, we studied the problem of finding binding motif pairs from large protein interaction dat asets. Our approach combines the mining of large protein interaction sequence datasets with the use of smaller protein complex structural datasets to direct the search. For mining protein complex structural data, we have formulated the detection of maximal contact segment pairs as a novel computational search and optimization problem, and we have provided an efficient algorithm for that. The maximal contact segment pairs derived can then be deployed as seeds for sub-grouping the vast dataset of interacting protein sequence pairs so that motif discovery algorithms can be directed to find the motif pairs within sub-groups. By iteratively applying this technique, we refine these motif pairs until they reach a satisfactory level of emerging significance. The results have shown that our combination approach is efficient and effective in finding biologically significant binding motif pairs. Many of the motif pairs that we have discovered coincided well with known motif pairs independently discovered by experimental methods. However, our this directed approach heavily depends on protein complex data source. As the current complex dataset is very limited, our approach may miss many other important motif pairs. On the other hand, it is worthwhile to improve our approach for discovering more significant binding motif pairs. For example, in our current definition of contact segment pairs, each residue in one segment is strictly required to have at least one contact residue in the other segment. Biologically, contact segment pairs are still valid even if a few residues in the segments are not in contact.

323 Computationally, however, our top-down recursive algorithm for finding maximal contact segment pairs will no longer be valid without this constraint. Therefore, one future research direction will be to explore the relaxation of this constraint while retaining the efficiency of the algorithm. References

1. Von Mering C et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399-403, 2002. 2. Valencia A and Pazos F. Computational methods for the prediction of protein interactions. Curr Opin Struc B i d , 12(3):368-373, 2002. 3. B.k. et a1 Kay. Phage display of peptides and proteins. Academic Press, New York, 1996. 4. Sprinzak E and Margalit H. Correlated sequence-signatures as markers of protein-protein interaction. J Mol Biol, 311(4):68192, 2001. 5. Deng M et al. Inferring domain-domain interactions from proteinprotein interactions. Genome Res., 12(10):1540-8, 2002. 6. Ng SK et al. Integrative approach for computationally inferring protein domain interactions. Bioinformatics, 19(8):923-9, 2003. 7. Uetz P et al. A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature, 403(6770):623-7, 2000. 8. Ito T et al. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci, 98(8):4569-74, 2001. 9. Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: Discovering trends and differences. In ACM SIGKDD, pages 4352, USA, Aug 1999. 10. Supplementary Information. http://sdmc.i2r .a-star.edu.sg/protein-interaction/. 11. Azarya-Sprinzak E et al. Interchanges of spatially neighbouring residues in structurally conserved environments. Protein Eng, 10(10):1109-22, 1997. 12. Smith TF and Waterman MS. Identification of common molecular subsequences. J Mol B i d , 147(1):195-7, 1981. 13. Nevill-Manning CG et al. Highly specific protein sequence motifs for genome analysis. Proc Natl Acad Sci,95:5865-71, 1998. 14. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249-260, 1995. 15. Kay BK et al. The importance of being proline: the interaction of proline-rich motifs in signaling proteins with their cognate domains. FASEB J., 14(2):231-41, 2000.

PHYLOGENETIC MOTIF DETECTION BY EXPECTATIONMAXIMIZATION ON EVOLUTIONARY MIXTURES A.M. MOSES Graduate group in Biophysics and Center for Integrative Genomics, University of California, Berkeley Email: amoses @ocJ:berkeley.edu D . Y . CHIANG Department of Molecular and Cell Biology, University of California, Berkeley Email: dchiang @ocJ:berkeley.edu

M.B. EISEN Department of Genome Sciences, Lawrence Berkeley Lab and Department of Molecular and Cell Biology, University of California, Berkeley Email: [email protected]

1

Abstract The preferential conservation of transcription factor binding sites implies that non-coding sequence data from related species will prove a powerful asset to motif discovery. We present a unified probabilistic framework for motif discovery that incorporates evolutionary information. We treat aligned DNA sequence as a mixture of evolutionary models, for motif and background, and, following the example of the MEME program, provide an algorithm to estimate the parameters by Expectation-Maximization. We examine a variety of evolutionary models and show that our approach can take advantage of phylogenic information to avoid false positives and discover motifs upstream of groups of characterized target genes. We compare our method to traditional motif finding on only conserved regions. An implementation will be made available at http://rana.lbl.gov.

2

Introduction

A wide range of biological processes involve the activity of sequence-specific DNA binding proteins, and an understanding of these processes requires the accurate elucidation of these proteins’ binding specificities. The functional binding sites for a given protein are rarely identical, with most proteins binding to families of related sequences collectively referred to as their ‘motif‘ [l]. Although experimental methods exist to identify sequences bound by a specific protein, they have not been widely applied, and computational approaches [2,3,4] to ‘motif discovery’ have proven to be a useful alternative. For example, the program MEME [ 5 ] , models a collection of sequences as a mixture of multinomial models for motif and background and uses an ExpectationMaximization (EM) algorithm to estimate the parameters.

324

325

Because functional binding sites are evolutionarily constrained, their preferential conservation relative to background sequence has proven a useful approach for their identification [61. With the availability of complete genomes for closely related species e.g., [7], it is possible to incorporate an understanding of binding site evolution into motif discovery as well. At present, few motif discovery methods simultaneously take advantage of both the statistical enrichment of motifs and the preferential conservation of the sequences that match them. One recent study [7] enumerated spaced hexamers that were both preferentially conserved (in multiple sequence alignments) and statistically enriched. Another method, FootPrinter, [81 identifies sequences (with mismatches) with few changes over an evolutionary tree. Neither of these methods, however, makes use of an explicit probabilistic model. Here we present a unified probabilistic framework that combines the mixture models of MEME with probabilistic models of evolution, and can thus be viewed as an evolutionary extension of MEME. These evolutionary models (used in the maximum likelihood estimation of phylogeny [9]) consider observed sequences to have been generated by a continuous time Markov substitution process from unobserved ancestral sequences, and can accurately model the complicated statistical relationship between sequences that have diverged along a tree from a common ancestor. Our approach considers observed sequences to have been generated from ancestral sequences that are two component mixtures of motif and background, each with their own evolutionary model. The value of varying evolutionary models has been realized in other contexts as well, e.g., [lo] and such models have been successfully traiced using EM [ l l ] . A mixture of evolutionary models has been used previously to identify slowly evolving non-coding sequences [12], and this work can equally be regarded as an extension of that approach. Given a set of aligned sequences, we use an EM algorithm to obtain the maximum likelihood estimates of the motif matrix and a corresponding evolutionary model.

3 3.1

Methods Probabilistic model

We first describe the probabilistic framework used to model aligned non-coding sequences. We employ a mixture model, which can be written generically as

p(data) =

c

pfmodel)p(csata17l)

models

where p ( x ) is the probability density function for the random variable x . The sum over models indicates that the data is distributed as some mixture of component models, where the prior, p(model), is the mixing proportion. For simplicity, we first address the case of pair-wise sequence alignments.

326

Given some motif size, w , we treat the entire alignment as a series of alignments of length w ,each of which may be an instance of the motif or a piece of background sequence. We denote the pair of aligned sequences as X and Y, where the i" position in the sequence as a vector of length 4, (for each of ACGT), where if the b" base is observed, and 0 otherwise. We denote the unobserved ancestral sequence, A, similarly, except that the values of Aib are not observed. For a series of alignments of total length N , the likelihood, L, is given by i=O nai

k=i

k 0

where the m, are unobserved indicator variables indexing the component models; in our case m is either motif or background. Generically, we let

the prior probability for each component. We incorporate the sequence specificity of the motif by letting the prior probabilities of observing each base in the ancestral sequence, p(Akblmi), be the frequency of each base at each position in the motif (the frequency matrix). We write p(Akb\mi) =fmkb?

such that if m is motif, fmkb gives the probability of observing the b" base at the k-i" position. For the background model we use the average base frequencies for each alignment, and assume that they are independent of position. This allows us to run our algorithm on several alignments simultaneously [ 131 and the densities are therefore conditioned on the alignment as well, but omit this here for notational clarity. Finally, noting that because the two sequences descended independently from the ancestor, we can write p(Xk,Ykwkb,mi) = P(XklAkb,mi) P(Ykpkb,mi), where P(Xkwkb,mi) is the probability of the residue xk, given that the ancestral sequence, A, was base b at that position - a substitution matrix for each component model. For simplicity we use the Jukes-Cantor [ 141 substitution matrix, which is, in our notation,

where ad is the rate parameter at position k-i. It is here that we incorporate differences in evolution between the motif and background by specifying different substitution matrices for each component. For example, if we set a, smaller for the motif than for background, the motif

327

evolves at a slower rate than the background - it is conserved. We test a variety of different substitution models for the motif and summarize the implications for motif discovery in the Gcn4p targets. (See results) Unfortunately, as the dependence of these models on the equilibrium frequencies becomes more complicated, deriving ML estimators for the parameters becomes more difficult, and more general optimization methods may be necessary. Once again, we can allow each alignment its own background rate, [ 131 and express the motif rate as a proportion of background.

3.2

An EM algorithm to train parameters

Following the example of the h4EME program [5] which uses an EM (an iterative optimization scheme guaranteed to find local maxima in the likelihood) algorithm to fit mixtures to unrelated sequences, we now derive an EM algorithm to train the parameters of the model described above. We write the ‘expected complete log likelihood’ [ 151

(In L,) =

zyc i=o

(mi) I n TnL +

mi

1 c (Am)( h p ( x . + , l %I

i+w-1

3

k=T

hco

+

~kh,.n1.i)

fmkb)

where In denotes the natural logarithm, and maximize by setting the derivatives with respect to the parameters to zero at each iteration. Setting

and solving gives

i

where Rkm is the ratio of expected changed to identical residues under each model, and is given by

c c c

AT-v

i+t~-l 3

(mi)

R,*=

i=o

k=i

N--IL1

.i+w-1

{mi)

i=O

k=i

~{AM)(~-YH.-&~) b=O 3

x(Akb)(xkb+ykb) b=O

for all k in the case of a constant rate across the motif. The sufficient statistics d k b ’ and m i , , are derived by applying Bayes’ theorem and are computed using the values of the parameters from the previous iteration. We have

1

328

In order to extend these results beyond pair-wise alignments, we can simply replace the two sequences X and Y with the probability of the entire tree below conditioned on having observed base b in the ancestral sequence. The likelihood becomes N-

L=

w

fl

i=o

n

i+w-1

xp(7ni) 1w;

k=i

3

~P(tr~~lAkblP(Akb,Inli) k 0

where p(treepkb) are computed using the 'pruning' algorithm [9].Of course, a tree topology is needed in these cases and we used the accepted topology for the sensu strict0 Succhuromyces [7] and computed for each alignment the maximum likelihood branch lengths using the p a d package [ 161.

3.3

Implementation

We implemented a C++ program (EMnEM: Expectation-Maximization OIEvolutionary Mixtures) to execute the algorithm described above, with the following extensions. Because instances of a motif may occur on either strand of DNA sequence, we also treat the strand of each occurrence as a hidden variable, and sum over the two possible orientations. In addition, because the mixture model treats each position in the alignment independently, we down-weight overlapping matches by limiting the total expected number of matches in any window of 2w to be less than one. Finally, because EM is guaranteed only to converge to a local optimum in the likelihood, we need to initialize the model in the region of the likelihood space where we believe the global optimum lies. Similar to the strategy used in the MEME program [ S ] , we initialize the motif matrix with the reconstructed ancestral sequence of length w at each position in the alignments, and perform the full EM starting with the sequence at the position that had the greatest likelihood. EMnEM will be made available at http://rana.lbl.gov.

329

3.4

Time complexity

The time complexity of the EM algorithm is linear with total length of the data, and the initialization heuristic we have implemented is quadratic with the length. Interestingly, because our algorithm runs on aligned sequences, relative to MEME, which that treats sequences independently, the total length is reduced by a factor of l/s, where S is the number of sequences in the alignment. Usually, we lose this factor in each iteration when calculating p(tree!Akb) using the ‘pruning’ algorithm [9], as it is linear in S. We note, however, that for evolutionary models (e.g., Juckes-Cantor) where p(tree!Akb)is independent of p(Akblmi),we may learn the matrix without reestimating the sufficient statistics u&,> (the reconstructed ancestral sequence) at each iteration. In these cases the complexity of EMnEM will indeed be linear in the length of the aligned sequence, a considerable speedup, especially in the quadratic initialization step. 4 4.1

Results and Discussion A test case from the budding yeasts

In order to compare our algorithm under various evolutionary models as well as to other motif discovery strategies, we chose to compare all methods on a single test case: the upstream regions from 5 sensu strict0 Sacchuromyces (S. bayanus, S. cerevisiae, S. kudriavzevii, S. mikatae, and S. paradoxus) of 9 known Gcn4p targets that are listed in SCPD [17]. In order to control for variability in alignment quality at different evolutionary distances, we made multiple alignments of all available upstream regions using T-coffee [18] and then extracted the appropriate sequences for any subset of the species. The Gcn4p targets from SCPD are a good set on which to test our method because there are a relatively high number of characterized sites in these promoters. In addition, the upstream regions of these genes contain stretches of poly T, which are not known to be binding sites. As a result, MEME (“tcm” model, w 10) assigns a lower (better) evalue to a ‘polyT’ motif (e=2.7e-03) than to the known Gcn4p motif (e=1.6e06) when run on the S. cerevisiae upstream regions. Because this is typical of the types of false positives that motif finding algorithms produce, we use as an indicator of the success of our method the log ratio of the likelihood of the evolutionary mixture model using the real Gcn4p matrix, to that using the polyT matrix. If this indicator is greater than zero, i.e.,

the real motif has a greater likelihood than the false positive, and should be returned as the top motif.

330

4.2

Incorporating a model of motif evolution can eliminatefalse positives

In order to explore the effects of incorporating models of motif evolution into motif detection, we tested several evolutionary models. In particular we were interested in the effect of incorporating evolutionary rate, as real motifs evolve slower than surrounding sequences. Using alignments of S. cerevisiae and S. mikatae, we calculated the log ratio of the likelihood using the real Gcn4p matrix to the likelihood using the polyT matrix with Jukes-Cantor substitution under several assumptions about the rate of evolution in the motif (Figure 1). Interestingly, slower evolution in the motif, either ‘/4 or 0.03 (the ML estimate) times background rate, is enough to assign a higher likelihood to the Gcn4p motif and thus eliminate the false positive. We tried two additional evolutionary models, in which the rate of substitution at each position depends on the frequency matrix. In the Felsenstein ’81 model (F81) the different types of changes occur at different rates, but the overall rate at each position is constant, while the Halpern-Bruno model (HB) assumes there is purifying selection at each position and can account for positional variation in overall rate [ 19,201. In each case, these more realistic models further favored the Gcn4p matrix over the polyT. fi

A’

0

10

c2

0 -5

.g

-10

M

-15 -20

,a a 5 d

8

Model of motif evolution

Figure 1. Effect of models for motif evolution on motif detection Plotted is the log ratio of the likelihood using the Gcn4p matrix to the likelihood using the polyT matrix under various evolutionary models in alignments of S. cerevisiae to S. rnikarae. Models that allow the motif to evolve more slowly than background, JC (0.25). JC (ML) and JC (HB), and models in which the rates of evolution take into account the deviation from equilibrium base frequencies, F81 and JC (HB), assign higher likelihood to the Gcn4p matrix. Also plotted is the negative log ratio of the evalues from MEME (‘tcm’ model, w 10). JC are Jukes-Cantor models with rate parameter equal to background (bg), !A of background (0.25) or set to the maximum-likelihood estimate below background (ML).

4.3

Success of motif discovery is dependent on evolutionary distance

In order to test the generality of the results achieved for the S. cerevisiae S. mikatae alignments, we calculated the log ratio of the likelihood of the evolutionary mixture using the real Gcn4p matrix to the polyT matrix over a range of evolutionary distances and rates of evolution (figure 2, filled symbols). At closer distances, more of the data is redundant, while over longer

33 1

comparisons, conserved sequences should stand out more against the background. Indeed, at the distance of S. cerevisiae to S. paradoxus (-0.13 substitutions per site), the likelihood of polyT is greater, while at the distance of S. cerevisiae, S. mikatae, and S. paradoxus (-0.31 subs. per site) the Gcn4p matrix is favored. Interestingly, this is true regardless of the rate of evolution assumed for the motif. While at all evolutionary distances slow evolution favors the Gcn4p matrix more than when the motif evolves at the background rate, the effect of including slower evolution is smaller than the effect of the varying evolutionary distance. Only at the borderline distance of S. cerevisiae to S. mikatae (-0.25 subs. per site), do the models perform differently. We also ran MEME (with the “tcm” model, w set at 10) on the all sequences (from all genes and all species) and calculated the negative log ratio of the MEME e-values for the two motifs (figure 2, heavy trace). MEME treats all the sequences independently, and continues to assign the polyT matrix a lower e-value over all the evolutionary distances. At least for this case, it seems more important to accurately model the phylogenetic relationships between the sequences (i.e., using a tree) than to accurately model the evolution within the motif.

30

20

10 0

- 10 -20 -30 -40

I

-1

I

+JC (0.125)

--

+JC (0.25) -t- JC

(0.5)

JC cbg) -MEME MEh4E 50% id MEME 70% id

Evolutionary distance

Figure 2. Effect of evolutionary distance on motif detection. Log ratio of the likelihood using the Gcn4p matrix to the likelihood using polyT matrix and alignments that span increasing evolutionary distance. At distances greater than S. cerevisiue to S. mikatue the evolutionary mixture assigns the Gcn4p matrix a greater likelihood whether the rate of evolution in the motif is equal to, !h, !4 or ‘/s of the background rate, (diamonds, squares, triangles and circles, respectively). Also plotted are negative log ratios of the MEME evalues for the Gcn4p to polyT, using the entire sequences, or prefiltering alignments for 20 base pair windows of at least 70% or 50% identity to a reference genome (heavy, lighter and lightest traces, respectively.)

332 4.4

The unified framework is preferable to using evolutionary information separately

In order to compare our method, which incorporates evolutionary information directly into motif discovery, to approaches that use such information separately, we scanned the alignments at each evolutionary distance and removed regions than were less than 50 or 70 % identical to a reference genome in a 20 base pair window. This allows MEME, which does take into account phylogenic information, to focus on the conserved regions. We ran MEME and computed the negative log ratio of the e-values for the Gcn4p matrix and the polyT matrix. While in both cases there were distances where the real motif was favored (figure 2, lighter traces), the effect of the filtering was not consistent. At

HEM13. RTTIOI,

Binding factor Roxlp +

I EMnEMrank

MEME rank

Motif

2 TCTATTGTTC

+7I

ERG2, ERG3, ERG9, UPC2

TCTAAACGAA

RNR2, RNR3, RNR4. RFXI

Rfxlp ++

CDCl9, PGKI, TPII, ENOI, EN02.

Gcrlp +

1

AR080, AR09, AROlO TRRI. TRX2. GSHI.

Ar08Op++

1

Yaplp++

-

Zaplp++

3

GTTGCCAGAC

FET4

Table 1. Motif discovery using EMnEM and MEME. The EMnEM program was run using the Jukes Cantor model for motif evolution with the rate set to 'A background (JC 0.25) on S.cerevisiue S. mikatue alignments in each case. For cases where EMnEM ranked the motif higher, the consensus sequence and a plot of the information content is shown. MEME was run on the unaligned sequences from both species simultaneously. Target genes are from SCPD[17] (+) or YPD [21] (++). indicates that a plausible motif was not found.

333

distances too close, not enough is filtered out, and the polyT is still preferred, while at distances too far, real instances of the motif will no longer pass the cutoff and the real motif is no longer recovered (figure 2, lighter traces). Thus, while incorporating evolutionary information separately can help recover the real motif, it depends critically on the choice of percent identity cutoff. 4.5

Examples of other discovered motifs

We ran both our program and MEME on the upstream regions of target genes of some transcription factors with few characterized targets and/or poorly defined motifs In several cases, for a given motif size, our algorithm ranked a plausible motif first, and MEME ranked a polyT motif first (see Table 1).

5

Conclusions and future directions

We have provided an evolutionary mixture model for transcription factor binding sites in aligned sequences, and a motif finding algorithm based on this framework. We believe that our approach has many advantages over current methods; it produces probabilistic models of motifs, can be applied directly to multiple or pair-wise alignments, and can be applied simultaneously at multiple loci. Our method should be applicable to any group of species whose intergenic regions can be aligned, though because alignments may not be possible at large evolutionary distances, our reliance on them is a disadvantage of our method relative to FootPrinter [S]. It is not difficult to conceive of extending this framework to unaligned sequences by treating the alignment as a hidden variable as well; unfortunately, the space of multiple alignments is large, and improved optimization methods would certainly be needed. In addition to motif discovery, our probabilistic framework is also applicable to binding site identification. Current methods that search genome sequence for matches to motifs are also plagued by false positives, but optimally combining sequence specificity and evolutionary constraint may lead to considerable improvement.

Acknowledgements We thank Dr. Audrey Gasch, Emily Hare and Dan Pollard for comments on the Manuscript. MBE is a Pew Scholar in the Biomedical Sciences. This work was conducted under the US Department of Energy contract No. ED-AC03-76SF00098

334

References 1. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. (2000) Jan;l6(1): 16-23. 2. Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Nut1 Acad Sci U S A. (1989) Feb;86(4): 1183-7. 3. Lawrence CE, Reilly AA. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolper sequences. Proteins. 1990;7(1):41-51. 4. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. (1993) Oct 8;262(5131):208-14. 5. Bailey TL, Elkan C, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, A M Press, Menlo Park, California, (1994.) 6. Hardison, Conserved noncoding sequences are reliable guides to regulatory elements, Trends in Genetics, (2000) Sep;16(9):369-372 7. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. (2003) May 15;423(6937):241-54. 8. Blanchette M, Schwikowski B, Tompa M. Algorithms for phylogenetic footprinting. J Comput Biol. (2002);9(2):211-23. 9. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. (1981);17(6):368-76. 10. Ng PC, Henikoff JG, Henikoff S. PHAT: a transmembrane-specific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics. 2000 Sep;16(9):760-6.Erratum in: Bioinformatics 2001 Mar;17(3):290 11. Holmes I, Rubin GM. An expectation maximization algorithm for training hidden substitution models. J Mol Biol. (2002) Apr 12;317(5):753-64. 12. Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. (2003) Feb 28;299(5611):1391-4. 13. Yang, Z. Maximum likelihood models for combined analyses of multiple sequence data. J Mol Evol. 42587-596 (1996.) 14. Yang, Z., N. Goldman, and A. E. Friday. Comparison of models for nucleotide substitution used in maximum likelihood phylogenetic estimation. Mol Biol Evol. 11:316-324 (1994) 15. M. I. Jordan, An Introduction to Probabilistic Graphical Models, in preparation. 16. Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. (1997) 13(5):555-556

335

17. Zhu J, Zhang MQ. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics. (1999) Jul-Aug;15(7-8):607-611. 18. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. (2000) Sep 8;302(1):205-17. 19. Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998 Ju1;15(7):910-917. 20. Moses AM, Chiang DY, Kellis M, Lander ES, Eisen MB. Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evol Biol. (2003) 3: 18 21. Hodges PE, Payne WE, Garrels JI. The Yeast Protein Database (YPD): a curated proteome database for Saccharomyces cerevisiae. Nucleic Acids Res. (1998) Jan 1;26(1):68-72.

USING PROTEIN-PROTEIN INTERACTIONS FOR REFINING GENE NETWORKS ESTIMATED FROM MICROARRAY DATA BY BAYESIAN NETWORKS N. NARIAI, S. KIM, S. IMOTO, S. MIYANO Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan We propose a statistical method t o estimate gene networks from DNA microarray data and protein-protein interactions. Because physical interactions between proteins or multiprotein complexes are likely to regulate biological processes, using only mRNA expression data is not sufficient for estimating a gene network accurately. Our method adds knowledge about protein-protein interactions t o the estimation method of gene networks under a Bayesian statistical framework. In the estimated gene network, a protein complex is modeled as a virtual node based on principal component analysis. We show the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae cell cycle data. The proposed method improves the accuracy of the estimated gene networks, and successfully identifies some biological facts.

1

Introduction

The complete DNA sequences of many organisms, such as yeast, mouse, and human, have recently become available. Genome sequences specify the gene expressions that produce proteins of living cells, but how the biological system as a whole really works is still unknown. Currently, a large number of gene expression data and protein-protein (p-p) interaction data have been collected from high-throughput analyses, and estimating gene networks from these data has become an important topic in systems biology. Several methods have been proposed for estimating gene networks from microarray data by using Boolean networks’>30,differential equation model&7, and Bayesian network~~g~12~13,14~15~16,22. However, using only microarray data is not sufficient for estimating gene networks accurately, because the information contained in microarray data is limited by the number of arrays, their quality, noise and experimental errors. Therefore, the use of other biological knowledge together with microarray data is a key for extracting more reliable ? ~ this idea previously and proposed a information. Hartemink et ~ l noticed method t o use localization data combined with microarray data for estimating a gene network. There are other works combining microarray data with biological knowledge, such as DNA sequences of promoter elementg3)32 and transcriptional bindings of regulator^?^?^^?^^. In this paper, we propose a statistical method for estimating gene net-

336

337 works from microarray data and p-p interactions by using a Bayesian network model. We extract 9,030 physical interactions from the MIPS d a t a b a s 8 to add knowledge about p-p interactions to the estimation method of gene networks. If multiple genes will form a protein complex, then it is natural to treat them as one variable in the estimated gene network. In addition, in the estimated gene network, a protein complex is modeled as a virtual node based on principal component analysis. That is, the protein complexes are dynamically found and modeled based on the proposed method while we estimate a gene network. Previously, Segal et al? proposed a method for identifying pathways from microarray data and p-p interaction data. A different point of our method is that we model protein complexes directly in the Bayesian network model aimed at refining the estimated gene network. Also, our method can decide whether we make a protein complex based on our criterion. We evaluate our method through the analysis of Saccharomyces cerevisiae cell cycle gene expression dats’. First, we estimated three gene networks, by microarray data alone, by p-p interactions alone, and by our method. Then, we compared them with the gene network compiled by KEGG for evaluation. We successfully show that the accuracy of the estimated gene network is improved by our approach. Second, among 350 cell cycle related genes, we found 34 gene pairs as protein complexes. In reality, most of them are likely to form protein complexes considering biological databases and existing literature. Third, we show an example to use an additional information “phase” together with the microarray data and p-p interactions for estimating a more meaningful gene network. 2

Bayesian Network Model with Protein Complex

Bayesian networks (BNs) are a type of graphical model that represents relationships between variables. That is, for each variable there is a probability distribution function whose definition depends on the edges leading into the variable. A BN is a directed acyclic graph (DAG) encoding the Markov assumption that each variable is independent of its non-descendants, given just its parents. In the context of BNs, a gene is regarded as a random variable and shown as a node in the graph, and a relationship between the gene and its parents is represented by the conditional probability. Thus, the joint probability of all genes can be decomposed as the product of the conditional probabilities. Suppose that we have n set of microarray data ( 2 1 , ..., zn}of p genes. A BN model is then written as f(zi1,...,zi,lOc) = fj(zijIpij, Oj), wherepij is the parent observation vector of j t h gene (genej) measured by ith array. For

&

338

example, if gene2 and gene3 are parents of genel, we set pi, = (xi2,Zi3)T. If we ignore the information of p-p interactions, the relationship between xij and pij can be modeled by using a nonparametric additive regression

where p,(;Z’ is the kth element of pi?, mj is a regression function and ~ i isj a random variable with a normal distribution with mean 0 and variance ~7.; When a gene is regulated by a protein complex, it is natural that we consider a protein complex as a direct parent. Therefore, we consider the use of virtual nodes corresponding to protein complexes in the BN model. Concretely, if gene2 and gene3 make a protein complex and regulate gene1, we construct a new variable L1complex23” from the expression data of gene2 and ~ ~genel” genes. In the BN model, then, we consider the relation ‘ ‘ c ~ r n p l e x+ instead of “gene2 -+ gene1 c genes”. If genes make a protein complex, it is expected that there may be a relatively high correlation among the expression values of those genes. For constructing a new variable representing a protein complex, therefore, we use principal component analysid7 (PCA). By using PCA, we can reduce the dimension of the data with the least loss of information. Suppose that genes from gene1 to gened make a protein complex and that the d dimensional vector is the = -ji:[l-dl)(xy-dl.-ji:[lfirst eigenvector of the matrix #-dl

xi

with xy-” = (zil, xid id)^ and = ~ ! - ~ ‘ / Here n . xT is the transpose of x. The ith observation of the protein complex is then obtained by [1-d] - [1-d]T ci - al (xi[1-d] - ~- [ l - ~ ]In) .such case, we use the regression function mj,[l-d~ instead of the additive regression function rnjl(xi1) ... m j d ( x i d ) . Figure 1 shows an example of modeling a protein complex. SPCQ7 and SPC98 form a protein complex. The solid line is the first principal component and the observations of the protein complex are obtained by projecting expression data onto this line.

(cP-~])

+

+

This model can be viewed as an extension of principal component regressiog, in which we choose whether we make protein complexes based on our criterion that evaluates the goodness of the BN model as a gene network.

339

x rnRNAexpreSSiondata

1-

m

8

0-

v) P

x x

I *~ -41

k./x:x Y. ? l

t principal imponent 1

0

1

2

SPC97

Figure 1: An example of modeling a protein complex by using principal component analysis. The scatter plot of SPC97 and SPC98, and the first principal component are shown.

3

Criterion and Algorithm for Estimating a Gene Network

From a Bayesian statistical viewpoint, we can choose the graph structure by maximizing the posterior probability of the graph G

where x ( G ) is a prior probability of the graph G, T(8GIX) is the prior distribution on the parameter OG and X is the hyperparameter vector. The marginal likelihood measures the closeness between microarray data and the graph G. We add the knowledge about p-p interaction into r ( G ) . Following the result of Imoto et al?5, we can model the knowledge about p-p interaction as a prior probability of graph G by using the Gibbs distributiod'. Let Uij be the interaction energy of the edge from gene; to genej and categorized into 2 values, H1 and H2 ( H I < H 2 ) . If there is a p-p interaction between gene; and genej, we set U;j = Uji = H I . The total energy of the graph G can then be defined as E(G) = x{i,jlEGU;j, where the sum is taken over the existing edges in the graph G. The probability r ( G ) is naturally modeled by the Gibbs distribution of the form n(G)= 2-1exp{-CE(G)}, where (> 0) is an inverse temperature and 2 is the partition function given by 2 = exp{-CE(G)}. Here is the set of possible graphs. By replacing CHI and CH2 with (1 and C2, respectively, the prior probability r ( G )is specified by C1 and C2. Hence, we have r ( G ) = 2-1 n(i,j)EGexp(-C,(i,j,), with a ( i , j )= k

<

xGEP

340 for

Uij

= Hk.

For computing the marginal likelihood represented by the integration in (2), we used the Laplace approximation for integral&1g~33 and the result was shown by Imoto e t al?4. Hence, we have a Bayesian information criterion, named BNRC (Bayesian network and Nonparameteric Regression Criterion), for evaluating networks

where

JA

(6),

=

-a2{lA

(eG

Ix)}/aeGaez

and b~ is the mode of lA(0GlX). We can choose the graph structure as the minimizer of BNRC. Based on the BN model with protein complex and the information criterion described above, we can naturally obtain the greedy hill-climbing algorithm for finding and modeling protein complexes and estimating a gene network as follows:

Stepl. For genei, perform one of four procedures, “add a parent”, “remove a parent”, “reverse the parent-child relationship” and “none”, which gives the lowest BNRC score. If directed cycles are formed, we cancel the operation. Step2. In Stepl, if “add a parent” was performed, go to Step3. Otherwise, go to Step6. Step3. If the relation between genei and the added gene (we denote gene(i)) is listed in p-p interactions, go to Step4. Otherwise, go to Step6. Stepl. Construct a protein complex from the expression values of genei and gene(i) based on the principal component analysis. Step5. If the protein complex works better than only using genei or gene(i) as a parent of each child of genei or gene(i),we use this protein complex in the estimated network. Otherwise, we ignore this protein complex. Step6. If the BNRC score becomes unchanged, the learning is finished. Otherwise, go to Stepl and continue the greedy hill-climbing algorithm.

34 1 Table 1: Comparison result of the cell cycle pathway in KEGG. “agree”, “reverse”, “false negative” and “false positive” edges are counted by comparing the estimated networks with the KEGG pathway. Note that edges among protein complexes are not counted in this table.

I

edge type agree reverse false negative false positive

4

using only microarray data 4

2 20 55

I

using only p - p interactions 19 (directions unknown) 26 11

I

o u r method

1

16 4

I

18 14

Computational Experiments

We apply our method t o Saccharomyces cerevisiae cell cycle microarray data31, and 9,030 p-p interaction data extracted from MIPS databasgl. For the prior probability T ( G )given in Section 3, we choose 0.5 for (1 and 25.0 for (2 experimentally. This point is where the maximum number of protein complexes is observed in the estimated gene networks. When we use a larger (1 and a smaller (2 , p-p interactions did not contribute to the gene network refinement. On the other hand, when we used a smaller (1 and a larger cz, the resulting network reflected the p-p interactions too strongly. 4.1

Cell Cycle Pathway in KEGG

For evaluating the accuracy of estimated gene networks, we choose 99 genes from KEGG pathway database of Saccharomyces cerevisiae cell cycle18. In this analysis, we focus on how the accuracy of the estimated network increases by adding the information of p-p interactions. We estimated three gene networks, by using only microarray data, by using only p-p interactions, and by using the proposed method. Then, we compared them with the gene network compiled by KEGG for evaluation. Table 1 summarizes the result of the comparison among three networks. Note that in this table, edges among protein complexes are not counted, because these edges should not be considered as “gene regulation” in the gene network. By comparing the network estimated by microarray data alone with the network estimated by our method, we can immediately find that the number of edges that agree with KEGG pathway, denoted as agree, adequately increases by adding p-p interactions t o microarray data. We can also observe that the proposed method can reduce the false positive edges drastically. By comparing the network estimated by p-p interactions alone with the network

342

Figure 2: Cell cycle gene network estimated by our method

estimated by our method, we can find that several false negative edges of p-p interactions are newly estimated by adding microarray data, though the number of agree edges is almost the same. As for false positive edges, we could not observe apparent improvements by adding microarray data. Figure 2 shows a part of the estimated gene network based on the proposed method. We can find that the proposed method succeeded in finding APC (Anaphase Promoting Complex), MCM (Mini-Chromosome Maintenance) complex, and clb5-cdc28p complex. 4.2

Gene Network with 350 Cell Cycle Genes

For evaluating our method in the sense of modeling a protein complex, we chose 350 genes from the MIPS functional category “mitotic cell cycle and cell cycle control” , and searched protein complexes while learning gene networks. We found 34 candidate protein complexes listed in Table 2. Among 34 candidate protein complexes, 22 pairs are also listed in the MIPS complex catalogue, and six pairs are reported in existing literature.

343 Table 2: Detected protein complexes among 350 cell cycle genes. The word rate means the contribution rate of the 1st principal component of two genes, and eval. means the evaluation of the results. “0” shows that the MIPS protein complexes catalogue contains the pair as a protein complex. “A’’ shows that while the MIPS cataloaue - does not contain those pairs, ” shoi that the result has not been reported yet tture suppor them. - -

RSC6 MCM5 SPC97 CIK 1 CLB5 GIM3 SKPl CDCll CDC3 CDClO APCl APC4 APC4 APClO APCS APCl APC2 APCS APCl APC2 APC3 APCll SMCl

scc3

BIM1 CLN2 CKSl HSL7 RAD23 NUF2 NUFl NUF2 CBF2 - CDC24

eual. rate gene B - 0.91 RSC8 0 0.89 MCM7 0 0.80 3PC98 0 0.70 KAR3 0 0.69 CDC28 0 0.67 PAC10 0 0.66 CDC53 0 0.80 CDC12 0 0.55 SHSl 0 0.54 SHSl 0 0.75 APClO 0 0.74 CDC23 0 0.73 APCll 0 0.72 APCll 0 0.71 APClO 0 0.66 CDC23 0 0.66 CDC16 0 0.66 CDC16 0 0.64 CDC26 0 0.63 APC5 0 0.63 CDC16 0 0.55 CDC26 0 0.84 SMC3 0.63 SMC3 0.69 TUB1 a 0.64 CDC53 a 0.57 CDC28 0.55 SWEl ? 0.82 RPT6 ? 0.80 NUMl ? 0.79 SPC97 ? 0.77 SMCl ? YGRl79C 0.65 ? SWEl 0.55

a a a

a

annotation RSC complex MCM complex gamma-tubulin complex kinesin-related motor proteins clb5-cdc28p complex gim complex SCF complex septin filaments septin filaments septin filaments APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex APC complex cohesin compledl cohesin compled’ tublin c0mple2~ G l / S transitiod4 cyclin-dependent kinasg4 septin assembly checkpoinf proteasome nuclear migration nuclear migration nuclear migration centromerel kinetochore-associated serinelthreonine protein kinase

344

Although six pairs, denoted as ‘l?” in Table 2, are unknown, they may suggest that each pair forms a protein complex. For example, RAD23 and RPTG may form a protein complex that involves in proteasome activity. In a similar way, NUFZ and NUMI may work together for nuclear migration. There are 309 p-p interactions among 350 cell cycle related genes, in which only 119 interactions are in fact protein complex related. These results suggest that our method successfully models the protein complexes, and finds the biologically plausible protein complexes.

4.3 Using Phase Information together with Microarrays and P-P Interactions In this section, we show a case t o use an additional information “phase” together with the microarray data and p-p interactions. It is known that cyclins “CLNI and CLNZ”, “CLB5 and CLBG’, and “CLBl and CLB2” are activated in Gl/S, S, and M phases, respectivelf. Before estimating a gene network, we choose phase-specific genes whose expression levels are highly correlated with each cyclin listed above. We collected 33 genes from the correlations, i.e., the correlation is greater than 0.75. Also, we selected 93 genes that show p-p interactions with 33 genes and six cyclins. That is, in this analysis, we focus on the gene network with 132 genes. Figure 3 shows the expression patterns of genes that are divided into three groups by the correlations and p-p interactions. At first, we estimate a gene network for each phase, i.e., Gl/S, S and M phases. We then combine those three networks and obtain a final network shown in Figure 4. Genes that are on the dotted line are selected as a member of both phases, i.e., Y O X I belongs to G l / S phase and also S phase. In this analysis, we can find biologically important genes, such as HCMI, FKH2 and ACE2. These genes are transcription factor^?^,^^, and FKH2 was reportea6 as a regulator of CLB2, SWI5, and HST3. Although KEGG pathway does not include those genes, we succeeded in finding those important relationships based on our approach. 5

Discussion

In this paper we proposed a statistical method for estimating gene networks by combining microarray gene expression data and p-p interactions. We also proposed a method for modeling protein complexes in the estimated gene network by using principal component analysis. An advantage of our method is that not only p-p interactions, but also protein complexes are naturally modeled under a Bayesian statistical framework. By adding p-p interaction data into our Bayesian network estimation method, we successfully estimated the gene

345 I

3

GIlS phase

I I

J!

I

I

,

alpha

cdcl5

cdc28

SIU

alpha

cdd5

cdc28

elu

s p ase

I

(G2phase)

I

I

I

M phase

I

I

Figure 3: Gene expression profiles that belong t o (Top) G l / S phase, (Middle) S phase, and (Bottom) M phase.

Figure 4: Cell cycle gene network estimated by using “phase” information together with microarray data and p-p interactions.

346 network more accurately than using only microarray data. We also observed that protein complexes were correctly found and modeled while learning gene networks. We consider the following topics as our future works: First, currently our greedy algorithm only merges protein pairs based on PCA. Modeling a larger protein complex in the gene network will be an important problem. Second, as real biological processes are often condition specific, it is important t o take “conditions” or “environments” into account. Third, in the last experiment, we showed an example that we added an additional information “phase” t o the microarray data and p-p interaction data, and estimated a gene network based on those three types of data. We expect that estimating an accurate gene network by using further genomic data, including DNA-protein interactions, binding site information, and so on, will give us more meaningful information about biological processes. We would like to investigate these topics in our future papers.

Acknowledgements The authors would like to thank three referees for their helpful comments and suggestions.

References 1. T. Akutsu, S. Miyano and S. Kuhara, Pac. Symp. Biocomput., 4, 17 (1999). 2. S. Chatterjee and B. Price, John Wzley and Sons, (1977). 3. T. Chen, H. L. He and G. M. Church, Pac. Symp. Biocomput., 4, 29 (1999).

M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart and R. W. Davis, Molecular Cell, 2 , 65 (1998). V. J. Cid, M. J. Shulewitz, K. L. McDonald and J . Thorner, Mol. Biol. Cell, 12, 1645 (2001). A. C. Davison, Biometrika, 73, 323 (1986). M. J. L. de Hoon, S. Imoto, K. Kobayashi, N. Ogasawara and S. Miyano, Pac. Symp. Biocomput., 8 , 17 (2003). N . Fkiedman, M. Goldszmidt, in M.I.Jordan ed., Kluwer Academic Publishers,

4. R. J. Cho,

5. 6.

7. 8.

421 (1998). 9. N. Friedman,

M. Linial, I. Nachman and D. Pe’er, J . Comp. Biol, 7, 601 (2000). 10. S. Geman and D. Geman, IEEE T P A M I , 6,721, (1984). 11. C. H. Haering, J. Lowe, A. Hochwagen and K. Nasmyth, Molecular Cell, 9, 773 (2002).

347 12. A. J. Hartemink, D. K. Gifford, T . S. Jaakkola and R. A. Young, Pac. S y m p . Biocomput., 6, 422 (2001). 13. A. J. Hartemink, D. K. Gifford, T . S. Jaakkola and R. A. Young, Pac. Symp. Biocomput., 7, 437 (2002). 14. S . Imoto, T . Goto and S. Miyano, Pac. S y m p . Biocomput., 7, 175 (2002). 15. S. Imoto, T. Higuchi, T. Goto, K. Tashiro, S. Kuhara and S. Miyano, Proc. 2nd IEEE Computer Society Bioinformatics Conference, 104 (2003). 16. S. Imoto, S. Kim, T. Goto, S. Aburatani, K. Tashiro, S. Kuhara and S. Miyano, J . Bioinformatics and Comp. Biol., 1(2), 231 (2003). 17. I. J. Jolliffe, Springer-Verlag, N e w York, (1986). 18. M. Kanehisa, S. Goto, S. Kawashima and A. Nakaya, Nucleic Acids Res., 30, 42 (2002). 19. S. Konishi, T. Ando and S. Imoto, Biometrika, (2003) in press. 20. H. J. McBride, Y. Yu and D. J . Stillman, J. Bid. C h e m , 274, 21029 (1999). 21. H. W. Mewes, D. Frishman, U. Guldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgenstern, M. Miinsterkoetter, S. Rudd and B. Weil, Nucleic Acids Res., 30(1), 31 (2002). 22. D. Pe’er, A. Regev, G. Elidan and N. Friedman, B i o i n f o m a t i c s , 17, S1 (2001). 23. Y . Pilpel, P. Sudarsanam and G. M. Church, Nature Genetics, 29, 153 (2001). 24. G. J. Reynard, W. Reynolds, R. Verma and R. J. Deshaies, Mol. Cell. Biol., 2 0 , 5858 (2000). 25. K. Schwartz, K. Richards and D. Botstein, Mol. Biol. Cell, 8, 2677 (1997). 26. E. Segal, Y. Barash, I. Simon, N. Friedman and D. Koller, R E C O M B , 273 (2002). 27. E.Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller and N. Friedman, Nature Genetics, 34(2), 166 (2003). 28. E. Segal, H. Wang and D. Koller, Bioinformatics, 19, S264 (ISMB 2003). 29. E. Segal, R. Yelensky and D. Koller, Bioinformatics, 19, S273 (ISMB 2003). 30. I. Shmulevich, E. R. Dougherty, S. Kim and W. Zhang, Bioinformatics, 18, 261 (2002). 31. P. Spellman, G. Sherlock, M. Zhang, V. Iyer, K. Anders, M. Eisen, P. Brown, D. Botstein and B. Futcher, Mol. Biol. Cell, 9, 3273 (1998). 32. Y. Tamada, S. Kim, H. Bannai, S. Imoto, K. Tashiro, S. Kuhara and S. Miyano, B i o i n f o m a t i c s , (ECCB 2003). in press. 33. L. Tinerey and J . B. Kadane, J. A m e r . Statist. Assoc., 81, 82 (1986). 34. A. R. Willems, S. Lanker, E. E. Patton, K. L. Craig, T. F. Nason, N. Mathias, R. Kobayashi, C. Wittenberg and M. Tyers, Cell, 8 6 , 453 (1996). 35. G. Zhu and T . N. Davis, Biochim. Biophys. Acta., 1448(2), 236 (1998). 36. G. Zhu, P. T . Spellman, T. Volpe, P. 0. Brown, D. Botstein, T . N. Davis and B. F‘utcher, Nature, 406, 90 (2000).

MOTIF DISCOVERY IN HETEROGENEOUS SEQUENCE DATA

A. PRAKASH Department of Computer Science and Engineering University of Washington Seattle, W A 98195-2350 U.S.A. M. BLANCHETTE

School of Computer Science McGiEl University Montreal, Quebec, Canada H3A 2 A 7 S. SINHA

Center f o r Studies in Physics and Biology T h e Rockefeller University New York, N Y 10021 U.S.A.

M. TOMPA Department of Computer Science and Engineering University of Washington Seattle, W A 98195-2350 U.S.A.

Abstract This paper introduces the first integrated algorithm designed to discover novel motifs in heterogeneous sequence data, which is comprised of coregulated genes from a single genome together with the orthologs of these genes from other genomes. Results are presented for regulons in yeasts, worms, and mammals.

1

Regulatory Elements and Sequence Sources

An important and challenging question facing biologists is to understand the varied and complex mechanisms that regulate gene expression: how, when, in what cells, and at what rate is a given gene turned on and off? This paper focuses on one important aspect of this challenge, the discovery of novel binding sites in DNA (also called regulatory elements) for the proteins involved in such gene regulation. This is an important first step in determining which proteins regulate the gene and how.

348

349

Until the present, nearly all regulatory element discovery algorithms have focused on what will be called homogeneous data sources, in which all the sequence data is of the same type (see Section 1.1). This paper introduces the first integrated algorithm designed t o exploit the richer potential of heterogeneous sequence data, which is comprised of coregulated genes from a single genome together with the orthologs of these genes from other genomes. 1.1 Regulatory Elements f r o m Homogeneous Data

A number of algorithms have been proposed for the discovery of novel regulatory elements in nucleotide sequences. Most of these try to deduce the regulatory elements by considering the regulatory regions of several (putatively) coregulated genes from a single genome. Such algorithms search for overrepresented motifs in this collection of regulatory regions, these motifs being good candidates for regulatory elements. Some examples of this approach include Bailey and Elkan', BrZzma et al. ', Buhler and Tompa 3, Hertz and Stormo 4 , Hughes et al. 5 , Lawrence et al. ', Lawrence and Reilly 7 , Rigoutsos and Floratos 8, Rocke and Tompa ', Sinha and Tompa l o , van Helden et al. 11, and Workman and Stormo12. An orthogonal approach deduces regulatory elements by considering orthologous regulatory regions of a single gene from multiple species. This approach has been used in phylogenetic footprinting (Tagle et al. 1 3 , Loots et al. 14) and phylogenetic shadowing (Boffelli et al. 15). The simple premise underlying these comparative approaches is that selective pressure causes functional elements to evolve at a slower rate than nonfunctional sequences. This means that unusually well conserved sites among a set of orthologous regulatory regions are good candidates for functional regulatory elements. The standard method that has been used for phylogenetic footprinting is to construct a global multiple alignment of the orthologous regulatory sequences using a tool such as CLUSTAL W (Thompson et al. 16), and then identify well conserved regions in the alignment. An algorithm designed specifically for phylogenetic footprinting without resorting to global alignment has been developed by Blanchette et al. 17118 1.2 Regulatory Elements f r o m Heterogeneous Data

As more related genomes are sequenced and our understanding of regulatory relationships among genes improves, we will find ourselves in a situation with richer data sources than in the past. Namely, the data t o be analyzed will often be heterogeneous, a collection of coregulated genes from one genome together with their orthologous genes in several related genomes. There is an obvious advantage t o considering heterogeneous

350

data when it is available: namely, motifs may not be detectable when one considers only the coregulated regions from one genome or only the orthologous regions of one gene (McGuire et al. 19, Wang and Stormo2'). The most obvious way to handle heterogenous data is to treat all the regulatory regions identically: pool all the input sequences, and search for overrepresented motifs. This is precisely what was done in studies There are several reasons by Gelfand et al. 21 and McGuire et al. why treating the heterogenous data homogeneously in this way discards valuable information that may be necessary for accurate prediction of regulatory elements: 1. This method ignores the phylogeny underlying the data so that, for example, similar sequences from a subset of closely related species will have an unduly high weight in the choice of motifs predicted. 2. Phylogenetic studies such as that of Lane et al. 22 show that instances of orthologous regulatory elements, because they evolved from a common ancestral sequence, tend to be better conserved than instances across coregulated genes of the same genome. By pooling all the sequences, this distinction is lost.

3. Perhaps most importantly, the number of occurrences of a given regulatory element will vary greatly across putatively coregulated genes: some regulatory regions will contain no occurrences, while others will contain multiple occurrences. This variance in number should be much less across orthologous genes, again because they are evolved from a single ancestral sequence. By pooling all the sequences, this distinction too is lost.

Another method for exploiting heterogeneous data involves two separate passes. For instance, Wasserman et al. 23, Kellis et al. 24, Cliften et al. 25 , and Wang and Stormo 2o search for well conserved motifs across the orthologous genes and then, among these, search for overrepresented motifs. GuhaThakurta et al. 26 do the opposite, searching for overrepresented motifs in one species and eliminating those that are not well conserved in the orthologs. In both cases, the first pass acts as a filter before performing the second pass, and a drawback is that the true motif may be filtered out because it is not conserved well enough in the dimension of the first pass. In other words, these algorithms do not integrate all the available information from the very beginning. In this paper we propose the first algorithm that uses the heterogeneous sequence data in an integrated manner. We focus on the 2species case for concreteness and efficiency, but also because of its timeliness for the study of regulons in important sequenced pairs such as human/mouse, fruitfiy/mosquito, and C. elegansl C. braggsae.

351 2

Expectation-Maximization for Heterogeneous Data

The Expectation-Maximization algorithm of MEME is very well suited for the discovery of regulatory elements in single-species regulons. We have generalized MEME’s framework and algorithm so that it is suited for the two-species heterogeneous data problem. We call the new algorithm OrthoMEME. The inputs to OrthoMEME are sequences X I ,X Z ,. . . , X,, Yl,Y2,... ,Y,, where X l , X z , . . . , X , are the regulatory regions of n genes from species X , and Y; is Xi’s orthologous sequence from species Y . For ease of discussion we will assume that the motif width W is fixed but, like MEME, OrthoMEME iterates over different values of W and chooses the best result. Also like MEME, OrthoMEME can be run in any of three modes: OOPS (One Occurrence Per Sequence), ZOOPS (Zero or One Occurrence Per Sequence), or TCM (zero or more occurrences per seqhence). TCM mode is particularly appropriate for most regulatory element problems. In the heterogeneous data setting, a motif occurrence in sequence i means an occurrence in Xi and an orthologous occurrence in Yi. That is, even in TCM mode every motif occurrence consists of an orthologous pair. Accordingly, the hidden random variables are Zis,k,defined to be 1 if there are orthologous motif occurrences that begin at position j of Xi and position k of Yi, both occurrences in orientation s (either or -), and 0 otherwise. (An underlying assumption is that sequences outside motif occurrences are drawn from the background distributions and, in particular, are not orthologous. This is in general untrue, but for sufficiently diverged sequences the resulting inaccuracy should be minimal.) OrthoMEME’s objective is to maximize the expected log likelihood of the model, divided by the motif width, given the input sequences and hidden variables. The model parameters specify how well conserved the motif is among the sequences of species X (parameter 8, a position weight matrix), and how well conserved orthologous pairs of motif instances are (parameter 7,a vector of 4 x 4 transition probability matrices). More specifically,

+

e,,

=

{

Pr(residue r in background distribution) Pr(residue r at position j of X’s occurrences)

ifj=O if 1 5 j

5 W,

qjrs = Pr(at position j of motif, residue r of X maps to residue s of Y ) .

There is also a corresponding parameter Oh, that specifies the background distribution in species Y . In ZOOPS and TCM modes, there is an additional parameter X that specifies the expected frequency of motif occurrences. Let 4 be a vector containing all the model parameters.

352

In classic expectation-maximization fashion, OrthoMEME alternates between E-steps (which update the expected values of the hidden variables) and M-steps (which update the model parameters). More specifically, the E-step computes E(Zisjk I Xi,Y,,4), where 4 consists of the values of the model parameters computed in the previous M-step. The M-step finds the values of the model parameters 4 that maximize the log likelihood of the model, given the input sequences and the expected values of & j k computed in the previous E-step. The formulas for these steps depend on the mode (OOPS, ZOOPS, TCM). For simplicity, we present only the formulas for OOPS mode. Let be the residue present at position p of strand s in sequence Xi, and let m be the length of each input sequence. Then the E-step for OOPS mode is computed as follows:

where

p= 1

p e { k , ..., k + W - - l l

p=l

The model parameters are evaluated in the M-step as follows. Let h f h f, denote the expected number of times residue f of X is mapped to residue g of Y at position h in the motif.

0 is updated as in MEME. Each E-step and M-step runs in time O(nm2W),since the number of hidden variables is 2nm2. This causes the algorithm to run slowly when the input sequences are long, which is an aspect of the algorithm that we are striving t o improve. MEME’s running time per step is O(nmW). The algorithm needs a measure t o compare solutions found, in order t o choose the best motif among all those found from different initial

353 values of 4 and different choices of motif width W . Unlike MEME, OrthoMEME compares solutions on the basis of the expected log likelihood of the model, divided by the motif width, given the input sequences and hidden variables. That is, it uses the very evaluation function that it is optimizing. (MEME instead uses the p-value of the relative entropy of the motif instances predicted.) There is an interesting algorithmic problem that arises only in the TCM mode of OrthoMEME and not at all in MEME. In order to produce actual motif occurrences from the final values Z i s j k of 4), OrthoMEME must choose 0 or more good ortholE(Zi,jk 1 X i , ogous pairs (jl,k l ) , ( j 2 , k2), . . . for each value of i. These pairs should represent nonoverlapping occurrences whose order is conserved between the two species, that is, j h W 5 j h + l and k h W 5 k h + l , for all h. For each value of i, OrthoMEME does this by retaining only those pairs ( j ,k ) such that z i s j k exceeds a threshold, and then using dynamic programming (quite similar to that for optimal alignment) to choose those pairs that represent nonoverlapping occurrences with conserved order and maximum total value of z i s j k .

x,

+

3

+

E x p e r i m e n t a l Results

OrthoMEME is implemented and we intend to make it publicly available. This section reports initial results of OrthoMEME on several heterogeneous data sets. All MEME and OrthoMEME motifs discussed below were among the top 3 motifs reported on those input sequences. Tables 1-3 show the predictions of OrthoMEME on yeast regulons from Saccharomyces cerewisiae and their orthologs in Saccharomyces bayanus. The S. cerewisiae target genes and binding sites for these transcription factors come from SCPD 27 The homogeneous S. cerevisiae data sets of Tables 1 and 2 are known to be particularly difficult: the motif discovery tools YMF l o , MEME ', and AlignACE all failed t o find the known transcription factor binding sites in these S. cerewisiae regulons (Sinha and Tompa2'). Table 1 shows OrthoMEME's predictions on the genes known to be regulated by HAP2;HAP3;HAP4. There are 5 known binding sites contained in 4 target genes. MEME predicted only 1 of these binding sites (whether run on just S. cerewisiae sequences or on the pooled sequences of both species), whereas OrthoMEME predicted 3 using the same parameters. In this and all subsequent tables, the underlined portions of the predicted motif occurrences are the subsequences that overlap the known binding sites. Table 2 shows OrthoMEME's predictions on the genes known to be regulated by UASCAR. There are 4 known binding sites contained in 3 target genes, all 4 of which are predicted by OrthoMEME. MEME pre~

354 Table 1: HAP2;HAP3;HAP4 predicted motif, OOPS mode, sequence length 600. The column labeled “Mut” shows the number of mismatches between the orthologous motif occurrences. The underlined portions of the motif occurrences are the subseauences that overlar, the known binding sites. OrthoMEME missed one occurrence in each of S P R J and 6 Y C l . Source: SCPD 2 7 .

S. bayanus CYCl

SPRJ QCR8

Table 2: UASCAR predicted motif, TCM mode, sequence length 300. OrthoMEME missed no occurrences. Source: SCPD 27

S. cerevisiae

Gene CAR2 CAR2 ARG5,6

CARl ARG5,6 ARG5,6 CAR2 CAR2 ARG5,6 CARl CARl CARl

Str

+ + + + + + + -

+ + +

Pos -218 -154 -114 -169 -52 -286 -189 -252 -224 -209 -232 -86

Instance CTCTGTTAAC T G C C C m TTCCATTAGG TTCACTTAGC TGCCTTTAGT TTCACTTAAA TGCCGTTAGC TTGCGTGTGG ATGACTCAGT TGCCATTAGC TGCCCTTCGC TTCTCTTCTC

S. bayanus Instance

Pos -222 -153 -122 -176 -56 -294 -193 -257 -228 -216 -239 -73

CTCTGTTAAC TGCCCTTGCC TTCCATTAGG TTCACTTAGC TGCCTTTAGT TTCACTTAAG TGCCGTTAGC TTGCGTGCGG ATGACTCAGT TGCCGTTAGC TGCCCTTGGC TTCTCCTCTC

Mut 0 0 0 0 0 1 0 1 0 1 1 1

dicted none of these binding sites when run on the S. cerevisiae sequences alone, and all 4 when run on the pooled sequences of both species. Table 3 summarizes the performance of OrthoMEME on some less difficult yeast regulons 28. On all three regulons OrthoMEME had few true negatives. On the SCB and PDR3 regulons, OrthoMEME’s number of false positives was comparable to that of MEME. On the MCB regulon, OrthoMEME had many more false positives than MEME, but many fewer true negatives to compensate. Tables 4 and 5 give examples of OrthoMEME run on heterogeneous human/mouse data. Table 4 shows target genes of the human transcription factor SRF together with their mouse orthologs. TRANSFAC 29 reports one known binding site in each of these 4 regulatory sequences.

355 Table 3: Summaryof other yeast regulons, S. cerevisiae vs. S. bayanus, TCM mode, sequence length 1000. Column headings: “genes”, the number of target genes in the regulon; “known”, the number of known S. cerevisiae binding sites in these target genes; “MEME, S. cer.”, MEME run on the S. cerevisiae sequences; “MEME, pooled”, MEME run on the pooled sequences of both species; “FP”, the number of false positives (predictions that were not binding sites); “TN”, the number of true negatives (binding sites that were not predicted). Source: SCPD 2 7 .

factor SCB MCB PDR3

genes 3

known

5 4

11 11

8

OrthoMEME FP TN 6 10 7

2 1 2

MEME, S. cer. FP TN 8 5 6

2

7 1

MEME, pooled FP TN 13 6

4 5 1

13

Table 4: SRF predicted motif, OOPS mode, sequence length 1000. OrthoMEME missed one occurrence in each of B-ACT and apoE. Source: TRANSFAC 29.

Gene B-ACT

Str

+

c-fos apoE

-

CA-ACT

-

H. sapiens M. POS Pos Instance -73 CCTTTTATGG I -65 -314 CCTAATATGG -459 -43 CCAATTATAG -855 -850 CCTTATTTGG -111

I

musculus Instance

I Mut

CCTTTTATGG 1 CCTAATATGG CCAATTATAG CCTTATTTGG

0 0 0 0

OrthoMEME predicted 2 of these 4 known binding sites. MEME, using the same parameters, found none of them, whether run on just the human sequences or on the pooled human and mouse sequences. Table 5 shows target genes of the human transcription factor NF-KB together with their mouse orthologs. TRANSFAC 29 reports 11 known binding sites in these 10 genes. Because OrthoMEME was run in OOPS mode, it missed one of the two occurrences in IL-2. It also missed the known occurrences in SELE and IL-PRa. MEME, using the same parameters, performed as well on this regulon. Table 6 shows a n example of OrthoMEME’s predictions on a worm regulon. This is a collection of Caenorhabditis elegans genes regulated by the transcription factor DAF-19 (Swoboda et al. 30), together with orthologs from Caenorhabditis briggsae. Each regulatory region in C. elegans is known t o contain one instance of the “x-box”, which is the binding site of DAF-19. OrthoMEME predicted all five of the documented x-boxes3’, as did MEME. (The full x-box has width 14 bp, of which OrthoMEME omitted the somewhat less conserved first 4 bp.)

356 Table 5: NF-KB predicted motif, OOPS mode, sequence length 1000. OrthoMEME missed one occurrence in each of SELE, IL-~RcY,and IL-2. Source: TRANSFAC 2 9 .

Gene SELE ICAM-1 GRO-y GRO-a IL-2Ra GRO-P

TNF-P IL-6 IFN-P IL-2

Str

+ + + + + +

Pos -285 -228 -160 -160 -306 -156 -274 -139 -140 -255

H. sapiens Instance CCCGGGAATATCCAC CTCCGGAATTTCCAA TCCGGGAATTTCCCT TCCGGGAATTTCCCT TGCGGTAATTTTTCA TCCGGGAATTTCCCT CCTGGGGGCTTCCCC TGTGGGATTTTCCCA CAGAGGAATTTCCCA AGAGGGATTTCACCT

Pos -262 -250 -140 -140 -276 -146 -251 -125 -137 -257

M. musculus Instance Mut TCTGGGAATATCCAC 2 4 TCTAGGAATTTCCAA TCCGGGAATTTCCCT 0 TCCGGGAATTTCCCT 0 TGCGGTAATTTTTCA 0 TCAGGGAATTTCCCT 1 CCTGGGGGCTTCCCC 0 TGTGGGATTTTCCCA 0 CAGAGGAATTTCCCA 0 AGAGGGATTTCACCT 0

Table 6: DAF-19 predicted motif, OOPS mode, sequence length 1000. OrthoMEME missed no occurrences. Source: Swoboda et d.30.

Gene che-2 osm-1 f02d8.3 osm-6 daf-19

4

Str

+ -

-

Pos -126 -86 -79 -100 -109

G. elegans Instance

TCATGGTGAC CCATGGTAGC CCATGGAAAC CTATGGTAAC CCATGGAAAC

C. braggsae Pos -178 -79 -93 -764 -243

Instance CCATGGCAAC CCATGGCAAC CCATGGAAAC CGATGACAAA CTTTGGCAAA

Mut

3 2 0 4 4

Conclusion

As more genomes are sequenced and our understanding of regulatory relationships among genes improves, algorithms for motif discovery from the rich source of heterogeneous sequence data will become prevalent. We have introduced the first algorithm to deal with heterogeneous data sources in a truly integrated manner, using all the data from the onset of analysis. We are still in the early stages of experimenting with the implementation and its parameters. There is much room for improved prediction accuracy and we are optimistic that, with more experience, we will consistently be able to solve problems with OrthoMEME that cannot be solved from homogeneous data alone. There is a reasonably straightforward extension to K > 2 species in which the transition matrices vj are replaced by rate matrices and one assumes that the phylogeny and its branch lengths are given. For this

357 extension the running time would be O(nmKW ) ,which is prohibitive. We are working on faster algorithms for this case and also the important case K = 2. For the case K = 2, it seems important to have a better understanding of how evolutionary distance between the species affects OrthoMEME’s accuracy. Acknowledgments Peter Swoboda provided us with the C. elegans DAF-19 data set, and Phil Green and Joe Felsenstein made helpful suggestions. This material is based upon work supported in part by the Howard Hughes Medical Institute, by the National Science Foundation under grants DBI-9974498 and DBI-0218798, and by the National Institutes of Health under grant R01 HG02602. References 1. Timothy L. Bailey and Charles Elkan. The value of prior knowledge in discovering motifs in MEME. In Proceedings of the Third

2.

3. 4.

5.

6.

7.

8.

International Conference on Intelligent Systems f o r Molecular Biology, pages 21-29, Menlo Park, CAI 1995. AAAI Press. Alvis BrZzma, Inge Jonassen, Jaak Vilo, and Esko Ukkonen. Predicting gene regulatory elements in silico on a genomic scale. Genome Research, 15:1202-1215, 1998. Jeremy Buhler and Martin Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225-242, 2002. Gerald Z. Hertz and Gary D. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7/8):563-577, JulyfAugust 1999. J. D. Hughes, P. W. Estep, S. Tavazoie, and G. M. Church. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. Journal of Molecular Biology, 296:1205-1214, 2000. Charles E. Lawrence, Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208-214, 8 October 1993. Charles E. Lawrence and Andrew A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Structure, Function, and Genetics, 7:41-51, 1990. Isidore Rigoutsos and Aris Floratos. Motif discovery without alignment or enumeration. In R E C O M B 9 8 : Proceedings of the Second

358

Annual International Conference on Computational Molecular Biology, pages 221-227, New York, NY, March 1998. 9. Emily Rocke and Martin Tompa. An algorithm for finding novel gapped motifs in DNA sequences. In RECOMB98: Proceedings of the Second Annual International Conference on Computational Molecular Biology, pages 228-233, New York, NY, March 1998. 10. Saurabh Sinha and Martin Tompa. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research, 30(24):5549-5560, December 2002. 11. J. van Helden, A. Rios, and J. Collado-Vides. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Research, 28:1808-1818, 2000. 12. C. T . Workman and G. D. Stormo. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In Pacific Symposium on Biocomputing, pages 464-475, Honolulu, Hawaii, January 2000. 13. D.A. Tagle, B.F. Koop, M. Goodman, J.L. Slightom, D.L. Hess, and R.T. Jones. Embryonic E and y globin genes of a prosimian primate ( Galago crassicaudatus) nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. Journal of Molecular Biology, 203:439-455, 1988. 14. Gabriela G. Loots, Ivan Ovcharenko, Lior Pachter, Inna Dubchak, and Edward M. Rubin. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Research, 125332-839, May 2002. 15. Dario Boffelli, Jon McAuliffe, Dmitriy Ovcharenko, Keith D. Lewis, Ivan Ovcharenko, Lior Pachter, and Edward M. Rubin. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science, 299(5611):1391-1394, February 2003. 16. J.D. Thompson, D.G. Higgins, and T.J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680, 1994. 17. Mathieu Blanchette, Benno Schwikowski, and Martin Tompa. Algorithms for phylogenetic footprinting. Journal of Computational Biology, 9(2):211-223, 2002. 18. Mathieu Blanchette and Martin Tompa. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Research, 12(5):739-748, May 2002. 19. Abigail Manson McGuire, Jason D. Hughes, and George M. Church. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Research, 10:744-757, 2000. 20. Ting Wang and Gary D. Stormo. Combining phylogenetic data

359

21.

22.

23.

24.

25.

26.

27.

28.

29.

with coregulated genes t o identify regulatory motifs. Bioinformatics, 2003. To appear. M. S. Gelfand, E. V. Koonin, and A. A. Mironov. Prediction of transcription regulatory sites in Archaea by a comparative genomic approach. Nucleic Acids Research, 28(3):695-705, 2000. Robert P. Lane, Tyler Cutforth, Janet Young, Maria Athanasiou, Cynthia Friedman, Lee Rowen, Glen Evans, Richard Axel, Leroy Hood, and Barbara J. Trask. Genomic analysis of orthologous mouse and human olfactory receptor loci. Proceedings of the National Academy of Science U S A , 98(13):7390-7395, June 19, 2001. Wyeth W. Wasserman, Michael Palumbo, William Thompson, James W. Fickett, and Charles E. Lawrence. Human-mouse genome comparisons to locate regulatory sites. Nature Genetics, 26:225-228, October 2000. Manolis Kellis, Nick Patterson, Matthew Endrizzi, Bruce Birren, and Eric S. Lander. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423:241-254, May 2003. Paul Cliften, Priya Sudarsanam, Ashwin Desikan, Lucinda Fulton, Bob Fulton, John Majors, Robert Waterston, Barak A. Cohen, and Mark Johnston. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science, 301:71-76, 2003. Debraj GuhaThakurta, Lisanne Palomar, Gary D. Stormo, Pat Tedesco, Thomas E. Johnson, David W. Walker, Gordon Lithgow, Stuart Kim, and Christopher D. Link. Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods. Genome Research, 12:701-712, 2002. Jian Zhu and Michael Q. Zhang. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics, 15(7/8):563-577, July/August 1999. http : //cgsigma. cshl.org/jian/. Saurabh Sinha and Martin Tompa. Performance comparison of algorithms for finding transcription factor binding sites. In 3rd IEEE Symposium on Bioinformatics and Bioengineering, pages 214-220. IEEE Computer Society, March 2003. E. Wingender, P. Dietze, H. Karas, and R. Knuppel. TRANSFAC: a database on transcription factors and their DNA bindNucleic Acids Research, 24(1):238-241, 1996. ing sites. http://transfac.gbf-braunschweig.de/TRANSFAC/.

30. Peter Swoboda, Haskell T . Adler, and James H. Thomas. The RFX-type transcription factor DAF-19 regulates sensory neuron cilium formation in C. elegans. Molecular Cell, 5:411-421, March 2000.

NEGATIVE INFORMATION FOR MOTIF DISCOVERY

K.T.TAKUSAGAWA

D.K.GIFFORD

kentaomit. edu gifford@mit. edu Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, M A 02139, USA We discuss a method of combining genome-wide transcription factor binding data, gene expression data, and genome sequence data for the purpose of motif discovery in S. cerevisiae. Within the word-counting algorithmic approach t o motif discovery, we present a method of incorporating information from negative intergenic regions where a transcription factor is thought n o t t o bind, and a statistical significance measure which account for intergenic regions of different lengths. Our results demonstrate that our method performs slightly better than other motif discovery algorithms. Finally, we present significant potential new motifs discovered by the algorithm.

1

Introduction

In the field of computational biology, motif discovery is one tool for unraveling the transcriptional regulatory network of an organism. The underlying model assumes that a transcription factor binds t o a specific short sequence (“a motif”) in an intergenic region near a gene the factor regulates. With the recent availabilty of many genome-wide data sets, we can predict certain motifs by computational methods rather than laborious experimentation. Such computational techniques rely on fusing genome sequence data with other data sets. In this paper, we discover motifs by fusing sequence data with transcription factor binding data and gene expression data. Chromatin immunoprecipitation (ChIP) microarray experiments can determine where in the genome particular transcription factor binds to a resolution of single intergenic region (usually 500-2000 bpy. The GRAM algorithd combines such genome-wide location information with gene-expression experiments. The algorithm discovers additional intergenic regions that are likely bound by the transcription factor but did not cause a strong signal in the ChIP experiment. For motif discovery, intergenic regions are partitioned into two categories: those t o which the transcription factor is thought to bind (according t o raw ChIP experiments or after incorporating additional information via an algorithm like GRAM) and those t o which it does not bind. We will refer t o the bound sequences as the “positive intergenic sequences” and those not bound as the “negative intergenic sequences”.

360

361

If an algorithm were only to use the positive sequences for motif discovery, then it would likely discover many false motifs. Such false motifs are caused by sequences which appear frequently in all the intergenic sequences of a genome. In S. cerevisiae, two prominent simple examples of such sequences are poly-A (long strings of consecutive adenine nucleotides) and poly-CA (long strings of alternating cytosine and adenine n~cleotides)~. Fortunately, fusing binding data with the complete sequencing of the S. cerevisiae genome provides us with a conceptually simple method of discovering a transcription factor’s motif: find a sequence which is present in the positive sequences and not present in the negative sequences. However, because of experimental noise and variability of binding by a transcription factor, we expect to find occasional examples of the correct motif in the negative sequences, so we instead seek a motif that is significantly over-represented in the positive intergenic sequences when compared with the negative intergenic sequences. 1.1

Related work

There have been many past efforts to use negative intergenic sequences to derive a statistical test. The very popular “Random Sequence Null Hypothesis” (so named in Barash, et al!) uses the negative sequences to discover the parameters of an n-th order background Markov model (n = 0 and n = 3 are popular). This approach greatly dilutes the information content of the negative intergenic sequences, and especially loses information about false motifs whose length is greater than the order of the Markov model. In contrast, the approach pursued in this paper will be similar t o Vilo, et al!’ and Barash, et up. Vilo, et al. cluster genes by their expression profiles and seek t o discover motifs within each cluster. Their test for significance compares the total occurrences of a potential motif in all intergenic sequences to the within-cluster count. Their significance test compares a statistic against a binomial distribution. Barash, et al. describe an alternative t o the “Random Sequence Null Hypothesis”, namely a “Random Selection Null Hypothesis”. They perform a similar calculation t o Vilo, et al., but compare against a hyper-geometric distribution. (The difference appears t o be the assumption of whether motif-containing sequences are selected “with replacement” or “without replacement” from all the sequences.) A somewhat different approach is described by Sinhag, who shows how t o view motif discovery as a feature selection problem for classification. Sinha’s algorithm requires the input of positive and negative intergenic sequences.

362

Sinha generates the negative examples (intergenic sequences) artificially using a Markov model, but the framework presented the paper could easily use actual

negative intergenic sequences from ChIP experiments. This paper makes the following two contributions to field. First, we describe modification to statistical methods of Vilo, et al. and Barash, et al. which allow for intergenic sequences with different lengths. Second, we also apply our motif discovery method and statistical test transcription factor binding data from ChIP microarray experiments. The papers cited above were published before ChIP data were available, therefore the authors used clustered geneexpression data for groups of genes thought to be regulated by a common transcription factor. Recently, other researchers have taken techniques similar those described in this paper and fused them with other data sets. Kellis, et a16 incorporate conservation information from different yeast species. Gordon, et al? incorporate structural data about the transcription factor and its likely binding domain.

2

Methods

We perform motif discovery in the framework of word-counting. This framework exhaustively enumerates a class of potential motifs (or words) and scores each word for its likelihood of being a true motif. We searched for potential motifs of width 7 with up to 2 wildcard elements among the 7 positions. The wildcard elements permitted were the double-degenerate nucleotides (IUPAC codes M, R, W, S, Y, K) and the quadruple-degenerate “gap” nucleotide (IUPAC code N). For each potential motif m, we determine which positive sequences and which negative sequences m occurs. We then determine if m occurs in the positive sequences more often then would be expected by chance. We must therefore first define a null hypothesis of what in fact is expected by chance. Biologically, the null hypothesis corresponds to the situation that m is not the motif for the transcription factor. To be able to statistically reject the null hypothesis, we must quantify what we would expect to see if the null hypothesis were true. We will present two different null hypotheses, the latter which will incorporate sequence lengths as additional information to the statistical measure. Computational constraints determined the limits of width 7 and 2 wildcards. At those limits, a search for a transcription factor’s motif (within approximately 3 Mbase of 5’. cerewisiae sequence) took approximately 20 minutes on a 1.6 GHz Athlon system. The running time scales exponentially with re-

363

spect t o the width and number of allowed wildcards. As an aside, we note that this exponential increase could be addressed in future investigations in two ways. For slightly wider motifs or more wildcards, more computing power can be applied: the algorithm parallelizes trivially by having different processors examine separate regions of the search space. Beyond that, if one wanted t o discover long motifs, one can use the short motifs discovered by exhaustive search as starting points to an expectationmaximization type algorithm, as done in by Barash, et a13 and Gordon, et a1.5.

2.1 Sequences chosen w i t h u n i f o r m probability The two null hypotheses are instances of the “Random Selection Null Hypothesis” of Barash et al?, which states that when the null hypothesis is true (i.e., the motif is incorrect), the positive sequences are “randomly selected” from among all the intergenic sequences, without any correlation or bias toward sequences containing the incorrect motif. (One can visualize a transcription factor as the “hand” which randomly selects from an urn of intergenic sequences.) For their model, “randomly selected” means “all sequences are equally likely to be chosen without replacement”. For this definition of “randomly selected”, they give a formula for the probability that m occurs in k sequences by chance alone.

where n is the number of positive sequences, N is the total number of sequences (positive and negative), and K is the number of sequences in which the word m occurs. The above formula is the hyper-geometric probability distribution. Using this formula we can calculate a p-value that the null hypothesis is true. The p-value sums the tail of the probability distribution for k’ 2 k. n

2.2

Sequences chosen by length

Instead of “all sequences equally likely” as the behavior under the null hypothesis, we propose the null hypothesis that:

Sequences will be selected (without replacement) with probability proportional to the sequence’s length.

364 Figure 1: Distribution of integenic sequence lengths in S. cerewisiae.

I0 length

The motivation for this alternative stems from the fact that sequences from the ChIP experiments are of different lengths (Figure 1). The modification is plausible: given no other knowledge about the transcription factor, a longer sequence is more likely t o contain the transcription factor’s true motif. Let AL be the bag (multi-set) of all sequence lengths, and K L be the sub-bag of the lengths of the sequences in which the word m occurs. (Thus lALl = N and IKLI = K . ) We use bags t o allow for distinct sequences which happen t o have the same length. Having defined the null hypothesis, we can define the probability of it being true as the probability that k or more sequences in which word occurs are selected. Because computing this probability exactly is computationally prohibitive, we instead compute an approximation. Instead of selecting sequences without replacement, we select sequences with replacement. The probability of selecting exactly k sequences is binomial:

where r is the proportion of total sequences (weighted by lengths) containing the word.

To calculate the p-value that the null hypothesis is true, we reuse equation 2, substituting for Phyper.

365 Table 1: Consensus sequences

TF ABFl GAL4 GCRl HAP3 HSFl MATal MIGl RAP1 STE12 SWIG

Consensus TCRNNNNNNACG CGGNNNNNNNNNNNCCG CTTTCC CCAATNA GAANNTTTCNNGAA TGATGTANNT WWWWSYGGGG RMACCCANNCAYY TGAAACA CACGAAA

TF

~

i

CBFl GCN4 HAP2 HAP4 IN02 MCMl PH04 REBl SW14 YAP1

Consensus RTCACRTG TGACTCA CCAATNA CCAATNA ATGTGAAA CCNNNWRGG CACGTG CGGGTRR CACGAAA TTACTAA

3 Results and Discussion The results and discussion are organized into the following sections. 53.1 validates the algorithm by attempting t o replicate known motifs. 53.2 presents potential new motifs discovered by the algorithm. Finally, 53.3 discusses ideas for future work.

Validation

3.1

This section measures and compares the algorithm’s motif discovery performance. For an absolute measure, the algorithm was run on binding data for transcription factors whose motifs were previously discovered and confirmed biologically. For a comparative measure, the same data were analyzed with the motif discovery programs MEME and MDscan’. The algorithm was also run on differently processed binding data for each transcription factor to determine the effect of the type binding data on motif discovery. Program parameters

MDscan was run through the web interface with the following parameters: 0

Motif width: 7

0

Number of top sequences to look for candidate motifs: 10

0

Number of candidate motifs for scanning the rest sequences: 20

0

Report the top final 10 motifs found

0

Precomputed genome background model: S . cerevisiae intergenic

366

MEME was run with the command-line parameters - h a -w 7 -motifs 10 -revcomp -bf ile $MEME/tests/yeast .nc .6.f req. The parameters direct MEME attempt t o discover 10 motifs of width 7 on either strand using the pre-computed order-6 Markov background model of the yeast non-coding regions.

Binding data Three different sets of positive sequences were used. That is, three different methods were used to determine which sequences are bound by a transcription factor. The first two are a simple p-value threshold on the ChIP experiment! (not related t o the p-values calculated the statistical tests of Chapter 2). The last uses the GRAM gene modules described in Bar-Joseph, et al? which fuse both binding data and expression level data. 1. Bound intergenic regions, cutoff p-value 0.001

2. Bound intergenic regions, cutoff p-value 0.0001 3. GRAM Gene modules under YPD To score the performance of both this paper’s algorithm, and MEME and MD-Scan, the discovered motifs were compared against the consensus sequences for transcription factors (Table 1) which were gathered from the TRANSFAC database. We score the closeness of a discovered motif with the consensus using a Euclidean distance metric described in the thesis version of this pape?’. The threshold of correctness was chosen “by eye” t o be a value for which discovered motifs below the threshold seemed close to consensus motifs. The threshold was loose enough that a motif is scored l L ~ ~ r r eeven ~ t ’ ’when the discovered motif spans only half of a wide gapped motif (for example ABFl or GAL4). We report the number of times the most statistically significant discovered motif was correct, and the number of times a correct motif was found somewhere in the top 10 significant motifs. This paper’s algorithm only reported so sometimes no motifs were found. motifs with significance greater than Table 2 gives the number of correct motifs found by the algorithm and other motif-discovery algorithms on different data sets. We can make the following observations: The best performance was this paper’s algorithm using binding data with threshold pvalue 0.001.

367 Table 2: Verified consensus motifs

0

0

0

Algorithm

Data set

Choose from

This paper MDscan MEME This paper MDscan MEME This paper This paper This paper MEME This paper

p=O.OOl p=O.OOl p=O.OOl p=O.OOl p=O.OOl p=O.OOl GRAM GRAM p=o.o001 p=o.o001 p=o.o001

Top 10 Top 10 Top 10

Top 10 Top 10 Top 1

Number correct (out of 20') 14 12 10 10 9 0 12 9 12 12 9

Choosing a more rigorous threshold for the binding data, namely 0.0001, resulted in slightly poorer performance, most likely because of insufficient positive intergenic sequences for a significant result. Incorporating gene expression information with the GRAM modules algorithm caused the algorithm t o perform slightly poorer than using the raw binding data. However, the modules result did find 2 correct motifs that the raw binding data did not (at the cost of failing t o 4 others). The algorithm finds slightly more correct motifs than MEME or MDscan.

3.2 New motifs Tables 3 and 4 give the top-scoring motifs for some transcription factors not listed in Table 1. These are candidates for further investigation. The positive sequences used for the table were the bound sequences at p-value 0.001, From discussion with a colleague, we note that the motifs for CIN5, GATS, GLN3, IME4, YAP5, and YAP6 are probably not correct, while those for BAS1, FKH1, FKH2, IN04, and SUM1 are consistent with what is known about the transcription factor&.

Results on shuffled data To judge the background level of motifs, the algorithm was also run on random sets of intergenic sequences. Ideally, these runs should produce no significant

368 Table 3: Top scoring motifs discovered for transcription factors not on Table 1with binomial significance greater than 10-l’. The significance values are loglo of the pvalue. The gap wildcard is denoted by a dot.

I

I

TF

+ Condition

BAS1 YPD CIN5

YPD

Motif I LiAtiYtiG

v3 19;o; TAYGSAA

v 1;3;

11

CC~TACA

FHLl Rapamycin

99”,191

FHLl

YPD

99”,19l

FKHl

YPD

FKH2

YPD

GAT3 YPD

CC~TACA

GTAAACA

3Vlll91 GTSAACA 3v’: 1191 CYGACGC

9;31939 C.GCGGA

Binomial

HvDeweometric

-10.99

-14.71

-10.86

-19.67

-27.28

-39.88

-35.12

-50.61

-10.85

-14.72

-12.16

-18.49

-15.90

-21.14

GLN3 Rapamycin

9,39331

-11.46

-16.65

IME4

YPD

CACACAC 9 19 19 19

-12.16

-15.22

IN04

YPD

CATGTGA 91V3V31

-12.14

-14.36

MBPl YPD

GACGCG? 319393Y

-20.14

-25.40

MET4 Rapamycin

ATTCGGC lVV9339

-10.25

-13.13

MET4 YPD

CtCGTGA 933v31

-10.78

-13.08

369 Table 4: Top scoring motifs (continued from Table 3)

TF

+ Condition

Motif

Binomial

Hypergeometric

NRGl YPD

CTGC?T“G 9V39YX3

-11.65

-19.00

PHDl YPD

AT“ G C A C . 1z3919.

-10.86

-20.01

-12.91

-15.94

RGMl YPD

CCC$CGA

999I93l

STBl YPD

CGCGAAA 9393111

SUM1 YPD

G$CAC$A 3Y919Yl

-11.38

-17.18

YAP5 YPD

ACGCGCP 1939398

-11.94

-16.98

YAP6 YPD

gGGCACO P33919f

-11.44

-18.78

-12.36

motifs. Twenty-five random trials were run for each of 20, 40, 80, 120, and 160 randomly chosen S. cerewisiae intergenic sequences (for a total of 125 trials). Five of the 125 experiments discovered a total of 11 motifs with binomial pvalues less than lop4, with most significant motif having significance 10-4.7. These falsely significant motifs were more likely to be found when there were fewer positive sequences, as 8 of the 11 motifs were found in data sets with 20 positive sequences. In the course of the 125 trials, over 70 million hypotheses (i.e., candidate motifs) are tested, so it is reasonable to see a few false positives with significance has higher than low4. 3.3 Future work

The statistical test developed in Chapter 2 can make use of more information for a better measure of significance. In $2.2 we defined the null hypothesis behavior “random selection” to be as selection with probability proportional to length. A straightforward modification would be to instead use the number of different subsequences of a sequences as its probability (appropriately normalized). As an extreme example, consider a very long sequence consisting of a repeat of a single nucleotide. While long, such a sequences offers few possibilities of where a transcription factor might bind. Such a long repetitive

370

sequence ought to be selected with low probability. Continuing in this manner, other biological prior knowledge can be incorporated into the prior probability that a sequence is selected. Such knowledge might involve the location of the sequence on the chromosome, knowledge about the gene which the sequence precedes, or other genetic markers. Biologically, we must question the assumption of independence (modulo choosing without replacement) between the n = IPI random selections from A. For example, it would be reasonable to hypothesize that if two sequences are very similar, they would likely both be selected, or neither. Not only can we incorporate biologically relevant information into the prior probability of the binding, but we can also try to incorporate more information about the binding event itself. Currently, the algorithm only makes use of the binary presence (“yes” or “no”) of words in sequences. It could, for example, incorporate the following features: 0 0

Number of occurrences of the word in the sequence Position of the occurrence(s) with respect t o the start of transcription or other genetic markers in the sequence

0

Strand of the occurrence of the word

0

p-value of the binding event.

Beyond yeast, of course, are the many organisms whose genomes have been recently sequenced, including human. It will be only a matter of time before ChIP and other genome-scale location experiments are performed on those organisms. We expect that to do worthwhile motif discovery on larger and more complicated genomes, careful attention will have to be paid to the statistical issues and improvements mentioned above.

Acknowledgements Special thanks to Richard A. Young, D. Benjamin Gordon, and Ziv Bar-Joseph for help with the data sources used in this project. K.T.T. was supported by a NDSEG/ASEE Graduate Fellowship.

References 1. T. L. Bailey and C. Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In Proc. 2nd International Conference on Intelligent Systems for Molecular Biology, 1994.

371 2. Z. Bar-Joseph, G. K. Gerber, T. I. Lee, N. J. Rinaldi, J. Y. Yoo, F. Robert, D. B. Gordon, E. Fraenkel, T. S. Jaakkola, R. A. Young, and D. K. Gifford. Computational discovery of gene modules and regulatory networks. (Submitted f o r publication), 2003. 3. Y. Barash, G. Bejerano, and N. Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. In Algorithms in Bioinfomnatics: Proc. First International Workshop. 2001. 4. D. B. Gordon, 2003. personal communication. 5. D. B. Gordon, L. Nekludova, N. J. Rinaldi, C. M. Thompson, D. K. Gifford, T. Jaakkola, R. A. Young, and E. Fraenkel. A knowledge-based analysis of high-throughput data reveals the mechanisms of transcription factor specificity. (Submitted f o r publication), 2003. 6. M. Kellis, N. Patterson, M. Endrizzi, B. Birren, and E. S. Lander. Sequencing and comparison of yeast species t o identify genes and regulatory elements. Nature, 423:241-254, 2003. 7. X. S. Liu, D. L. Brutlag, and J. S. Liu. An algorithm for finding proteinDNA binding sites with applications t o chromatin-immunoprecipitation microarray experiments. Nature Biotechnology, 203355839, August 2002. 8. B. Ren, F. Robert, J. J. Wyrick, 0. Aparicio, E. G. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, E. Kanin, T. L. Volkert, C. J. Wilson, S. P. Bell, and R. A. Young. Genome-wide location and function of DNA binding proteins. Science, 290:2306-2309, 2000. 9. S. Sinha. Discriminative motifs. In Proceedings of the Sixth Annual International Conference on Research in Computational Molecular Biology ( R E C O M B ) , 2002. 10. K. T. Takusagawa. Negative information for motif discovery. Master’s project, Massachusetts Institute of Technology, July 2003. 11. J . Vilo, A. Brazma, I. Jonassen, A. Robinson, and E. Ukkonen. Mining for putative regulatory elements in the yeast genome using gene expression data. In Proc. International Conference o n Intelligent Systems f o r Molecular Biology, 2000.

INTRODUCTION TO INFORMATICS APPLICATIONS IN STRUCTURAL GENOMICS S . D . MOONEY Stanford Medical Informatics Department of Genetics, Stanford University Stanford, CA 94305 P.E. BOURNE The Sun Diego Supercomputer Center The University of California Sun Diego San Diego, CA 92093

P.C. BABBITT Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, University of California Sun Francisco Sun Francisco, CA 94143

1. Structural Genomics

Structural genomics initiatives aim to determine all of the naturally evolved macromolecular scaffolds of proteins, RNA and DNA. In this introduction, we introduce several recent advances in the computational methods that support structural genomics. These include improvements at all levels of structure analysis, from fold identification of a target sequence and structure prediction, to structure evaluation and classification. The reader is referred to Goldsmith-Fischman and Honig’ for a thorough treatment on computational methods in structural genomics and to Bourne, et al. in this volume for the status of target structure determination. Improvements in computational methods for structural genomics are facilitating the identification of new, previously uncharacterized targets with novel fold classifications and predicted functions. These computational methods support the structural genomics pipeline by identifying targets, storing assay data, and by analyzing results in a statistically sound manner. The six papers presented here address many aspects of this diverse topic. One of the primary ways of identifying the function of an unknown structure is to identify its most similar structural neighbors. These “nearest neighbor” structural classification methods have proven to be powerful tools for identifying unknown

372

373

function. For example, the Structural Classification of Proteins project, SCOP, is an effort to classify all protein domains. SCOP classification is performed using both human intervention and through automated methods. Therefore, the challenge for fully automated computational methods is to correctly classify protein domains and to produce results similar to that of methods or databases that rely on human annotation. In this volume, Huan, et al. apply an information theoretic approach to identify coherent subgraphs in graphs that represent protein structures. They test their method on several families and find that their classifications correlate well with SCOP. Another challenge for computational structural bioinformatics methods is macromolecular structure prediction. A common approach to predicting the structure of an amino acid sequence is to apply comparative modeling methods, by modeling an unknown sequence upon a structure having a similar sequence. Comparative modeling is often performed in a four-step process: fold identification, threading, model building and evaluation with refinement of the structure. Fold identification and threading remain significant challenges. A target sequence may have little sequence similarity to any known scaffold. This volume presents two papers aimed at improving the identification of the appropriate fold for a target protein sequence through experimental intervention. First, Potluri et al. present a method for discriminating well predicted structures from poorly predicted ones using chemical cross linking data. Second, Qu et al., present a method for identifying the fold of a sequence using the NMR technique of residual dipolar coupling. Their program, RDCthread, identifies structural homologs of a target protein using RDC data and secondary structure prediction. Although most structural genomics techniques aim at studying protein structures, similar techniques have been applied to RNA structure prediction. For a review of structure prediction techniques as applied to RNA structure, see Schuster, et a1’. In this volume, Nebel presents a method for identifying good predictions of RNA secondary structure, thereby improving secondary structure prediction overall. Finally, one of the most exciting activities in structural genomics is studying the many structures that are now stored in public databases such as the Protein Databank (PDB)3. Peng, et al. apply contrast classifiers to explore bias in the PDB. When they compared the distributions of proteins in SWISS-PROT and the PDB, they found that transmembrane, signal, disordered and low complexity regions are poorly represented in the PDB. They reason that contrast classifiers can be used to select important targets for structural genomics initiatives.

374 Successes in structural genomics initiatives continue to be accompanied by the development of computational methods that apply sophisticated analyses from such diverse fields as information theory, clustering methods, and novel experimental techniques. As a result, novel structures continue to be added to our structural repertoire, giving new biological insight in this post-genomic era. Acknowledgements

We would like to Giselle Knudsen for her advice in preparing this document.

References

1. Goldsmith-Fischman S and Honig B (2003) “Structural genomics: computational methods for structure analysis” Protein Science 12(9): 181321 2. Schuster, P., Stadler, P.F., and Renner, A. (1997) “RNA structures and folding: from conventional to new issues in structure predictions” Current Opinions in Structural Biology 7(2):229-35. 3. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. (2000) “The Protein Data Bank” Nucleic Acids Research 28( 1):235-42.

THE STATUS OF STRUCTURAL GENOMICS DEFINED THROUGH THE ANALYSIS OF CURRENT TARGETS AND STRUCTURES

P . E . BOURNE, C . K . J . ALLERSTON, W . K R E B S , W. LI, and I . N . SHINDYALOV The San Diego Supercomputer Center, The University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA A. GODZIK, I. F R I E D B E R G , and T. LIU The Burnham Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037 USA D . W I L D a n d S . HWANG The Keck Graduate Institute, 535 Watson Drive, Claremont, CA 91 711 USA

Z. G H A H R A M A N I Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London, WCIN 3AR, U K L. C H E N a n d J . WESTBROOK Department of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854 USA

Structural genomics - large-scale macromolecular 3-dimenional structure determination - is unique in that major participants report scientific progress on a weekly basis. The target database (TargetDB) maintained by the Protein Data Bank (http://taroetdb.Pdb.org)reports this progress through the status of each protein sequence (target) under consideration by the major structural genomics centers worldwide. Hence, TargetDB provides a unique opportunity to analyze the potential impact that this major initiative provides to scientists interested in the sequence-structure-function-disease paradigm. Here we report such an analysis with a focus on: (i) temporal characteristics - how is the project doing and what can we expect in the future? (ii) target characteristics - what are the predicted functions of the proteins targeted by structural genomics and how biased is the target set when compared to the PDB and to predictions across complete genomes? (iii) structures solved - what are the characteristics of structures solved thus far and what do they contribute? The analysis required a more extensive database of structure predictions using different methods integrated with data from other sources. This database, associated tools and related data sources are available from http://spam.sdsc.edu.

1 Introduction Structural genomics has been heralded as the follow on to the human genome project. This is interpreted to mean a large-scale project, with scientific,

375

376

engineering and technological components and with the potential to have a large impact on the life sciences. Whereas the goals of the human genome project were relatively well defined - sequence the 3 billion nucleotides comprising the human genome and define all open reading frames - the goals advanced for structural genomics are more diverse (http://www.nigms.nih.gov/news/meetings/hinxton.html/) [I]. For instance, some of the NIH P50 structural genomics centers have focused on all of the protein structures in a given genome - A . thaliana, T. rnaritima and M. tuberculosis, are examples under scrutiny. Other groups have focused on obtaining sufficient coverage of fold space [2] to facilitate accurate homology modeling of the majority of proteins of biological interest (see http:Nspam.sdsc.edulsntdb for a description of the focus of each center). Since structure has already taught us so much about biological function when undertaken as a functionally driven initiative, undertaking structure determination in a broader genomic sense will likely also bring significant new understanding of living systems. Further, it will likely lead to advances in the process of structure determination, whether by X-ray crystallography or NMR. With such diversity of deliverables and with some projects now well established, an obvious question is, how are we doing? This paper addresses this question. The question has been addressed before in the context of new folds and functions and has proven has to be a somewhat controversial. An initial report in Science [3] implied that the number of structures produced as of November 2002 was minimal. A response from the US Northeast Structural Genomics Consortium (NESG) [4] indicated it was early in the process and that indeed that the absolute number of structures produced may not be the best measure, but rather the value of those structures is more to the point. NESG indicated that a structure containing a novel fold would indeed provide a new template from which many sequences could be related and hence was a significant contribution. It is not our intent here to join this argument but to simply point readers at some quantitative data and suggest how the process might proceed in the future and the challenges it provides to the bioinformatics community.

2 Methods An important feature of structural genomics, laid out by the N M as part of the awards made to the pilot centers engaged in this high throughput structure determination, was the importance of reporting their progress on a regular basis. The 16 pilot centers in the US and worldwide do this by way of weekly updates made available through their individual centers and collated by the Protein Data Bank (PDB) into what is known as the target database (TargetDB; http://targetdb.pdb.org)

377

[S].The contents of the target database are also available as an XML file. This file was used to create a local database from which the results presented here are derived. This database is available at http://spam.sdsc.edu/sgtdb. Fold prediction is based on three existing methodologies, FFAS [6] iGAP [7] and Bayesian networks [8] which are fully described elsewhere. Prediction of all open reading frames from complete proteomes uses the iGAP methodology and is part of the Encyclopedia of Life (EOL; httP://eol.sdsc.edu) project.

3 Results

3.1 Progress In the past year (May 1, 2002 - May 31, 2003) 314 structures resulting from structure genomics were reported by TargetDB. During the same period, a total of 3324 structures were deposited with the PDB. Thus structure genomics is currently contributing approximately 10% of structures to the field of structural biology. The number of structures at each stage in the pipeline is shown in figure 1.

Figure 1 Structural Genomics Targets at Different Stages of Solution (April 1,2003)

378 Slightly less than 50% of targets are selected for scrutiny. From these a high percentage can be expressed, but the number purified and crystallized drops off dramatically, indicating these steps continue to register low success rates and should be a focus of renewed efforts. Of those that crystallize, the majority find their way into the PDB. Is the percentage of structures determined by structural genomics likely to increase in the near future? To address this question requires that we look for temporal trends in the data. This is possible since TargetDB is updated each week and the mean time that an active target spends at each step in the structure determination pipeline can be assessed. These results are shown in Figure 2. It should be noted that not all of the centers reporting weekly status update their internal status tracking data with the same frequency. Consequently, the interval assessment here must be interpreted with care.

A Chartrhowinathe numbarofdavltahsnf~ratargettomaks atrsnrltlonhom onsTnmotUstabare statusto another.

matur T m r l t i o n

Figure 2 Mean Time of Targets at Each Structure Determination Step

379

For targets that make it to the next step, the data indicates that there is no specific bottleneck at this point, but rather a balance between the time taken at each structure determination step. Without a significant bottleneck the prospects for improving the rate of structure determination would seem good, particularly as the early stages of the project have included a significant engineering component for some projects. However, a final answer to the question will come from further review of TargetDB in the next two years.

3.2 Target characteristics

The characteristics of targets being attempted by individual structural genomics groups are highly variable (see http://sDam.sdsc.edu/satdb for a synopsis of the activities of each individual group). Groups are focusing on one or more of the following: complete proteomes, pathways and diseases, new folds, new technologies and specific structures. Thus the relative number of active targets from each group is meaningless and no attempt is made here to compare groups, rather the characteristics of the targets as a whole is considered. A review of the over 30,000 targets in the database (April 1, 2003) indicates a 13% redundancy at the 100% sequence identity and 38% redundancy at the 30% sequence identity level. This implies that either individual groups are operating without regard for other groups, or there is interest in the same targets by different groups perhaps indicating some important functional significance for a particular target. This data could be probed further to ascertain (if possible from sequence alone) the functional significance of these hotly contested targets. It should be noted that there is a temporal aspect to these target data. When a target was selected, which may be up to three years ago, the level of redundancy with respect to NR may have been significantly different, so these data need to be interpreted with care A review of each groups targets indicates that there is a significant level of redundancy within a groups targets (Figure 3). In some cases this is the nature of the redundancy in the complete proteome under study, in other cases perhaps a desire to attempt to solve multiple instances of an important structure that, based of sequence identity, are known to have the same fold.

380

.. . .

. .

. ..

...

. ..

. .. . -

.

. . .. ..

.

.

30%

“A

Fold Space

Complete Genome(s)

Specific Proteins

Techiiology Driven

Lab and main P r m d Goals

Figure 3 Sequence Redundancy within each Groups Targets

3.3 Structure characteristics Are there any specific characteristics of the novel folds in the structures determined by the Structural Genomics Initiative? How do these differ from the general population in the PDB and why? In short, what is novel from the structures being determined by structural genomics and how do they aid us by increasing our understanding of living systems and/or aid more rapid structure determination or modeling? An analysis of the former is provided by [9]. Here we focus on the characteristics important to bioinformatics, specifically fold and function, which can be used in further analysis, for example, in homology modeling. An analysis of the new folds as defined by SCOP is given in Table 1.

381 Table 1 New Folds Resulting from Structural Genomics Period Oct 2001 Mar 2002

Total New Folds 48

Apr 2002 Sep 2002

27

Oct 2002 Mar 2003

64

New Folds from Structure Genomics 1. YchN-like (c.144) 2. Hypothetical Protein MTH777 (c. 1 15) 3 . alphaheta knot (c.116) 4. Archaeosine tRNA-guanine transglycosylase, C-terminal additional domains (e.36) 5. YebC-like (e.39) 1. DsrC, the gamma subunit of dissimilatory sulfite reductase (d.203) 2. Ribosome binding protein Y (d.204) 3. Hypothetical protein MTH637 (d.206) 4. Thymidylate synthase-completmenting protein Thy1 (d.207) 5. MTH1598-like (d.208) 1. S13-like H2TH domain (a.156) 2. C-terminal domain of D$F45/ICAD (a.164) 3. BEACH domain (a. 169) 4. Viral chemokine binding protein m3 (b.116) 5. Obg-fold (b.117) 6. N-terminal domain of MutM-like DNA repair proteins (b. 113) 7. Pututive glycerate kinase (c.118) 8. DegV-like (c.119) 9. YbaB-like (d.222) 10. S U E(d.224) 11. Replication modualtor SeqA, C-terminal DNA-binding domain (d.228)

In the first reporting period the number of new folds reported by structural genomics was approximately 10% of the total number reported (5 out of 48), a result proportional to the percentage of structures coming from structural genomics. In the second and third periods this jumped to 18% (5 out of 27) and 17% (I1 out of 64), respectively indicating that the goal of new fold discovery may be being met, given that only 10% of structures overall are coming from structural genomics. However, the sample of new folds is small and hence we will need to wait for additional time periods and review this trend again.

382

A review of the sequences of solved structures against the non-redundant protein sequence database (NR)ordered in bins of expectation value (E-value) is given in Figure 3. A chat showin0 the e-value dlsmbunon of me Tarwts wlfh me stahls In PDE ' after BLAST DrOCeSmq a w n s t me nowredundam database

35

30 25 20 15 10

5 0

Figure 4 Likely Uniqueness of New Targets Approximately 70 of a total of 3 14 structures have an E-value of 10-3 or higher and represent a group for which sequence homology is not guaranteed and hence represent possible new functions (assuming functions were correctly assigned to sequences in NR).Again the above is only an indicator of the situation. A better analysis would require comparison against NR at the time the structure was solved or released. What of the overall distribution of folds represented by TargetDB? Figure 5 shows the distribution of folds derived by FFAS [ 6 ] , iGAP [7] and Bayesian networks [8]. The level of reliability is not considered, only possible predictions are represented, both FFAS and iGAP provided predictions for the nearly all targets, Bayesian networks for about lo%, based on a smaller template library. Not only does this highlight internal consistency between the methods of prediction, it also indicates differences. The distribution of major folds seems consistent with the distribution of associated biological functions in living systems. For example, it is known that p-loop containing protein families are very prevalent in nature.

383 SCOP Fold Distribution 6000

5000

4000 L.:

p

:

L 0

k $00

:

I

.-

2000

1000

0

Figure 5 Predicted Folds from TargetDB: I=FFAS; 2=iGAP; 3=Bayesian Networks This relationship is probed further in figure 6. Fold predictions are made for all open reading frames is a variety of organisms as well as the PDB and TargetDB.

Figure 6 SCOP Fold Distributions in Several Model Organisms, PDB and TargetDB

384

A question that can be posed from these data is how biased are the distributions of folds in TargetDB relative to those from specific target organisms and the PDB? Intuitatively one would expect the PDB to be biased towards proteins that are a) likely to be crystallized easily b) smaller proteins amenable to NMR or c) over represented by particular classes of proteins since they represent drug targets or functionally important proteins. Conversely, TargetDB would be somewhat closer to what is found in nature as whole genomes are being attempted. Having said that, it may be at this stage of structural genomics that projects are going for the low hanging fruit and hence it may be too early to make such a comparison. It should also be noted that there is an undetermined bias in these data and hence they should be considered cautiously. The bias arises in that predictions are done with a mix of fold prediction and homology modeling. In both cases there is a bias towards known folds since, nevertheless expected trends do occur. Immunoglobulin-like beta sandwiches (bl) are over represented in the PDB and under represented in TargetDB. This would suggest they have proven particularly amenable to crystallization and represent a sequence rich fold class which recognizes many of the targets and if new folds is an aim will likely discount a large number of targets, hence the under representation from TargetDB. The same argument can be made for tim barrels (cl). The empirical rule that emerges from these and other fold classes is that a class that is over represented in the PDB is under represented in TargetDB. RNNDNA binding 3 helical bundles (a4) appear to be over represented in TargetDB relative to what appears in the PDB and several model organisms. The same is true of P-loop containing nucleotide triphosphate hydrolases, perhaps a reflection of their role as drug targets. S-adenosyl-L-methionine-dependent methyltransferases also appear over represented in TargetDB. 4 Discussion

Structural genomics is a large science project involving multidisciplinary teams seeking to increase the number of macromolecular structures. From this process comes new understanding of living systems derived from functional inference from structure and improved methodologies. Improved methodologies range from new engineering practices which speed the structure determination process to an increased number of known folds that improves our ability to provide realistic models of proteins of unknown structure. A unique aspect of structural genomics is a weekly report by all groups engaged in this activity. Thus for the first time we are in a position to monitor quantitatively the scientific progress of a major scientific project. This progress is in the form of the status in the structure determination process of protein sequence targets. This status terminates at the point the structure enters the PDB and hence structures completed by structural genomics can be compared against structures

385

derived from conventional functionally driven structure determination experiments. Targets which have not yet been solved can be predicted with a variety of existing structure prediction methods. Taking existing unsolved targets, solved structures and predicted structures of the targets a picture of the progress of structural genomics begins to emerge. Here we have reported on that picture. The percent of structures being contributed by structural genomics is approximately 10% at this time. The time to solution ranges from three to eighteen months with a peak in the 8-10 month range (data not shown). Data are not available for how this compares to conventional structure determination but it is estimated to be of a similar order. At this time structural genomics would seem to be contributing twice the number of new folds as conventional structure determination, but the numbers are two small to be considered statistically significant. An argument has been made that structure genomics might contribute less new folds that one might anticipate since the emphasis will be on determining the maximum number of structures. Numbers implies taking what crystallizes easily and this could be construed as being those structures that appear in a subset of folds most amenable to crystallization. Conversely, a functionally driven initiative on a single target might expend more time and energy performing experiments that would result in the crystallization of a less amenable fold not pursued by structural genomics. This type of conjecture will become more fact as the number of structures increases. We will continue to process TargetDB and report our finding through the Web site at http://spam.sdsc.edu/sgtdb.

Acknowledgments This work is supported by the National Institutes of Health grant number lPOlGM63208-01.

References 1. S.E. Brenner SE, and M. Levitt. Expectations from Structural Genomics Protein Sci 9(1), 197 (2000). 2. E. Portugaly and M. Linial. Estimating the Probability for a Protein to have a New Fold: A Statistical Computational Model. Proc Nut1 Acad Sci U S A. 97(10), 5161 (2000) 3. R.F Service Tapping DNA for Structures Produces a Trickle. Science 298, 948 (2002). 4. M. Gerstein et al. Structural Genomics: Current Progress. Science 299, 1663 (2003). 5. J. Westbrook J. et al. The Protein Data Bank and Structural Genomics. Nucleic Acids Research 31(1) 489 (2003).

386 6. L. Rychlewski, L. Jaroszewski, W. Li, and A. Godzik, A. Comparison of Sequence profiles. Strategies for Structural Predictions using Sequence Information. Protein Science 9 232 (2000). 7. W.W. Li, G.B. Quinn, N.N. Alexandrov, P.E. Bourne and I.N. Shindyalov Proteins of Arabidopsis (PAT) database: A Resource for Comparative Proteomics. Genome Biology In Press (2003). 8. A. Raval, Z. Ghahramani and D.L. Wild A Bayesian Network Model for Protein Fold and Remote Homologue Recognition. Bioinformatics 18(6) 788 (2002). 9. C. Zhang and S-H Kim Overview of Structural Genomics: From Structure to Function. Current Opinions in Chemical Biology 7 28 (2003).

PROTEIN STRUCTURE AND FOLD PREDICTION USING TREE-AUGMENTED NAIVE BAYESIAN CLASSIFIER A. CHINNASAMY, W. K. SUNG (arun, ksung)@comp. nus. edu.sg, Department of Computer Science, National University of Singapore, 3 Science Drive 2, Singapore 11 7543.

A. MITTAL ankush@bits-pilani. ac. in, Department of Computer Science, Birla Institute of Technology and Science, Pilani, India. For determining the structure class and fold class of Protein Structure, computerbased techniques have became essential considering the large volume of the data. Several techniques based on sequence similarity, Neural Networks, SVMs, etc have been applied. This paper presents a framework using the Tree-Augmented Networks (TAN) based on the theory of learning Bayesian networks but with less restrictive assumptions than the n a k e Bayesian networks. In order t o enhance TAN’S performance, pre-processing of data is done by feature discretization and post-processing is done by using Mean Probability Voting (MPV) scheme. The advantage of using Bayesian approach over other learning methods is that the network structure is intuitive. In addition, one can read off the TAN structure probabilities t o determine the significance of each feature (say, Hydrophobicity) for each class, which help to further understand the mystery of protein structure. Experimental results and comparison with other works over two databases show the effectiveness of our TAN based framework. The idea is implemented as the BAYESPROT web server and it is available a t http://wwwappn.comp.nus.edu.sg/-bioinfo/bayesprot/Default.htm.

1

Introduction

In proteomics, finding the structure and the fold of a protein is very important since it helps t o understand the functions, the catalytic and the structural roles of proteins. Protein structure can be determined experimentally by Xray diffraction and NMR techniques. These methods are expensive, tedious, labor intensive and have their own limitations. This leads t o the research in predicting the protein folding pattern, given only its primary structure ‘. This computational way of protein structure prediction can be classified into two general types ’.

387

388

1. Homology methods: a) Sequence Similarity Methods: These methods are based on the observation that two proteins have very similar structure if their sequences have high homology 3 . b) Threading Methods: These methods predict the structure of a protein sequence by aligning with a known structure. 12. 2. Discriminative Methods: These methods extract some general “rules” from the known protein structures and applies the “rules” t o a new protein sequence to make the prediction 16. Sequence similarity has its limitation as it can apply only to those sequences which are similar in term of both sequences and structures ’. Several discriminative methods based on statistical techniques, neural networks and SVMs have been applied in the past. The main difficulty in applying learning(discriminative) methods is, the folding prediction becomes less accurate with increasing number of classes. This study hopes t o solve this issues using the Bayesian classifier framework. Bayesian classifier theoretically is the best classifier provided the underlying distribution functions are well estimated 7 . However, Bayesian classifier requires a prior knowledge of many probabilities. This paper designs a framework called BAYESPROT with discretization of feature space and TreeAugmented Network (TAN) Bayesian classifier as foundation t o address the problem of structure and fold classification from database. In addition, Mean Probability Voting (MPV) method is employed t o improve the performance. For the prediction in this paper,we use the protein classification type in SCOP 22 database, that is, proteins are classified in hierarchical order of structures, folds, super families and families. Since finding the structural and the fold class is more significant, in this paper we applied our classification system to classify a protein into different structural and fold classes. 2

Review

Recently, machine learning tools have been largely used in the classification based on tertiary super classes. These methods are denoted as discriminative methods or data mining approaches. Since no direct relationship between sequence and structure are derived, much attention paid on statistical or machine learning techniques to classify the proteins using feature vector representations of available knowledge. Dubchak et a1 1995, 1999 5,6 conducted the classification studies based on neural networks. Ding and Dubchak I (2001) * classified the proteins into 27 fold classes using SVMs and neural networks based on three

389 multi-classification methods (OvO, uOv0, AvA) and concluded that SVM’s performance is better than Neural networks. Their study introduces SVM t o the protein classification problem. The accuracy measurement in their method assumes that the prediction is partially correct when ties exist(for ours, we assume the prediction fail). Also their method uses large number of classifiers. Cai et aL(2001) l9 used SVMs to classify the proteins into four major protein classes and compared the results with component coupled with neural network. Edler et al. (2001) conducted a statistical study based on logistic regression, additive models, and projection pursuit on protein fold prediction with a dataset containing 268 proteins. Markowetz et al.(2003) used Gaussian and various polynomial kernels based on SVMs and showed that their approach performed better than the work in ’. From all these studies it is evident that among all the prediction methods, SVM performs better. Though most works recently showed that SVMs have good generalization property and outperforms statistically than Neural network methods for the protein fold prediction, SVM methods are reported t o result in high number of ‘false positive^'^. Besides, the number of binary classifiers is numerous and the computational time for the SVM training is high when the number of classes is large. It has also been shown that SVMs performances vary with change in dimensions of the feature vector and SVM methods might require feature selection Therefore, alternative method of learning are sought which might not have some of these defaults.

3

Overview of BAYESPROT

Figure 1 shows the overview of the BAYESPROT system. Given a database of several millions of protein sequences, their attributes are extracted and transformed into features, namely, composition (20), secondary structure (21), hydrophobicity (21), polarity (21), polarizability (21), and Van Der Waals Volume (21). After the feature vector extraction, the values of features were discretized t o four discrete states by frequency discretization method. Three separate TAN Bayesian classifiers were constructed using all concatenated feature vectors (126), composition feature vectors (20), and secondary structure feature vectors (21) respectively. The previous research and our experiments suggest that, amongst all the attributes, composition and secondary structure features are the most important for the protein structure prediction. Hence, we construct the TAN classifiers for composition and secondary structure separately and chose only these two to reduce the complexity. Next MPV is employed t o predict the structural class. A similar procedure is required to classify the fold

390

Structure

v Class

Fold

v Class

Figure 1: Architecture of BAYESPROT

class as shown in the Figure 1.

4

Dataset and feature vector representation

We used the datasets referred in two prominent recent works: Ding and Dubchak (2001) and Markowetz et al.(2003) '. Summary of the two datasets (Dataset I and Dataset 11) is tabulated in Table 2.

4.1

Dataset I

Dataset I used in our study was originally built for the study of and later used by '. Both studies confirm that the dataset is reasonable as it is based on the PDBselect sets where two proteins have no more than 35% of the sequence identity for sequences longer than 80 residues. Dataset I is available at http://www.nersc.gov/Ncding/protein/.

4.2 Dataset 11 Dataset I1 was built from the Database for expected Fold-Classes (DEF) for the statistical study 'O. Markowetzet et a1.(2003) used this dataset and concluded that SVM was better than previous statistical studies. Dataset I1 is available at http://www.dkfz.de/biostatistics/protein/gsme97.html.

39 1 4.3 Feature Vectors or Global Descriptors of A m i n o Acid Sequence

To apply machine learning algorithm] we have to turn the amino acid sequence of heterogeneous length into feature vector of homogeneous length. This feature vector construction is based on physical and stereo chemical properties of amino acids. This method was used and explained in and ‘. Each protein sequence is represented by a set of six attribute feature vectors. Composition feature vector of length 20, which lists out the proportion of the 20 amino acids, is constructed in a straightforward manner. Apart from composition, the other attributes used are predicted secondary structure, polarity] polarizability, hydrophobicity and Van der Waals volume. Except composition, feature vectors for the above five attributes are constructed in two steps.

Stepl: For each attribute, twenty amino acids are divided into three groups,(see Table 1). For each protein sequence, every amino acid was replaced by the index 1, 2, or 3 depending on its grouping. For example protein sequence KLLSHCLLVTLAAHLPAEFTPAV will be replaced by 13322333323222322132232 based on the attribute hydrophobicity division of amino acids(see Table 1). Step 2: For each converted sequences calculated in step1 three descriptors “composition” (C), “transition” (T), and “distribution” (D), are calculated based on the definition given below. Composition: Composition is calculated for each group based on the simple formula, Ci = ( ( n i ) / L *) 100; where Ci represents the percent composition of each group;, where ni represents total number of group; residues in the sequences, and L represents the length of the sequence. Transition: Transition (Tij) is represented by the percent frequency with which group; is followed by groupj or groupj followed by groupi where a , j takes the values 1, 2 and 3. Distribution: Distribution descriptor D consists of the five numbers for each of the three groups: the fractions of the entire sequence, where the first residue of a given group is located, and where 25%, 50%, 75%, and 100% of those are contained. Each attribute the feature vector contains 21 features: 3 composition features, 3 transition features and 5* 3 distribution features. Feature vector is of length 126 which is constructed by concatenating 21 all 5 attribute vectors of

392 Table 1: Amino acid attributes and corresponding groups.

Attribute secondary structure Hydrophobicity Polarizability Polarity Van der Waals volume ~

Group 1 Group 2 Group 3 Helix Strand Coil Polar Neutral Hydrophobic R,K,E,D,Q,N G,A,S,T,P,H,Y C,V,L,I,M,F,W (0-2.78) (2.95-4.0) (4.43-8.08) G,A,S,C,T,P,D N,V,E,Q,I,L M,H,K,F,R,Y,W (4.9-6.2) (8.0-9.2) (10.4-13.0) L,I,F,W,C,M,V,Y P,A,T,G,S H,Q,R,K,N,E,D (0-0.108) (0.128-0.186) (0.219-0.409) G,A,S,D,T C,P,N,V,E,Q,I,L K,M,H,F,R,Y,W

length 105 (5*21=105), amino acid composition vector of length 20 and the sequence length of length 1. 5

Our Framework

5.1 Discretization

In our dataset, the feature vectors are of continuous nature. Though the Bayesian classifier supports both continuous and discrete probability distributions it was experimentally found that the continuous probability distribution was not suitable for these datasets. Therefore, we pre-processed data by converting the continuous attribute data to discrete attribute data. One p o p ular and simple discretization approach is range discretization. However, in range discretization, some of the discretized partitions become over-populated while others remain empty leaving to poor discretization. In order to avoid this problem, we employ frequency-based discretization which partitions the attributes into intervals each containing almost same number of instances. Several frequency based discretization methods were employed with ‘3’ intervals, ‘4’ intervals, ‘5’ intervals, ‘7’ intervals and ‘10’ intervals. By experiment, method with ’4’ intervals yielded better classification performance than other methods and it was chosen.

5.2

TAN Bayesian Classifier

Bayesian Networks are directed acyclic graphs which combine both statistical and graph theory for representing conditional independencies l o . A directed edge A + B indicates the causal relationship (A causes B) and thus Bayesian

393

networks are quite intuitive. Optimal classifications can be achieved by reasoning about these probabilities along with observed data The classification is done by applying Bayes rules to compute the probability of a class C given the particular instance of attributes A l , . . . ,A, and then predicting the class with the highest probability. Structural relationship among the attributes is important for the Bayesian network classifier t o construct the relationship amongst various nodes. However, no clear structural relationship is known at present due to the nature of problem. Structural learning is not possible with present database. Therefore, we chose TAN Bayesian classifier I 3 , l 5 rather than Bayesian network classifier as it is more relevant to the problem considering the feature vector properties and relations. TAN Bayesian Classifier is an extension of na'ive Bayesian classifier. Similar t o na'ive Bayesian classifier, TAN consists of a class node connecting t o all child nodes each representing a feature. Moreover, each child node can has at most one other feature node as parent. Attractive property of the TAN Bayesian classifier is that it learns the probabilities from the data in polynomial time. For our case, we create a TAN Bayesian classifier which has a class node representing the protein structure/fold classes and connected to 126 child nodes for 126 feature vectors. In addition, it is assumed that composition node Ci has structural relationship with Ci+l , each attribute percent composition and each distribution vector has structural relationship. Three TAN Bayes classifiers have been constructed for the concatenated feature vectors of length 126, composition feature vectors of length 20 and secondary structure feature vectors of length 21 respectively. TAN Bayes classifier has been defined in the given equation where cy is normalization constant.

'*.

P(ClasslA1,.. . ,A,)

=

P(C1ass) . ~IZIP(AiIparents(Ai)) (1)

5.3 Mean Probability Voting

Let Pi , PCi and Psi for i = 1,2, . . . , k be the marginal probabilities from the TAN Bayesian classifiers which uses length 126 concatenated feature vectors, length 20 composition feature vectors and length 21 secondary structure feature vectors , respectively where 'k' represents the number of classes. Then mean probability MPi, for i = 1 , 2 , . . . , k is calculated by taking average of Pi , PCi and Psi. The prediction of structural/fold class was done by selecting the class which has the highest mean probability (MP). It is accepted from the previous studies that composition and secondary structure are important

394

Figure 2: TAN Bayesian Classifier Table 2: Structural and Fold Classification Results of BAYESPROT. Dataset

Number of Classes

No. of Proteins

5

313 143

Dataset I Dataset I1 Dataset I Dataset I1

in Train D a t a

4

27

313 143

42

No. of Proteins

Test D a t a in Test d a t a Accuracy(%) Structural Classes 385 80.52 125 77.6 Fold Classes 385 58.18 12s 74.40

Cross Validation Accuracy (%) 83.09 79.85 59.77 75.7s

in deciding the protein structures. And in our experiment, voting increased the accuracy by around 4%. 6

Experimental Results

6.1 Results Both structural and fold classifications have been done using BAYESPROT with dataset I and dataset 11. Table 2 summarizes the results of both dataset. Evaluation of classifier is done by testing with independent test dataset and 10-fold cross validation. In Dataset I, 27 fold classes used are from the structural classes a , p, a / @a, p, small and in Dataset 11, 42 fold classes used are from the structural classes a , p, a/P,and a p.

+

+

For dataset I structural classes, the confusion matrix is shown in Table 3 while the sensitivity and the specificity for five structural classes are shown in Table 4. Except a /3 super class, all other super classes are predicted with sensitivity greater than 70%. From confusion matrix for structural classifier it is evident that a significant number of proteins of ‘a p’ class are misclassified in ‘a’ and ‘0’ classes. Similarly, some ‘p’ class proteins are misclassified in ‘a/p’. Specificity of each

+

+

395 Table 3: Confusion Matrix for Super Classifier (Dataset I)

Actual

alp a+P

4 8 0

Small

8 5

132 4 0

14 1

1 24

Table 4: Sensitivity and Specificity for each class (Dataset I)

Specificity (%) Structural

I

a

Classes

Fold

Small Average of

80.33 77.78 91.03 40.00 88.89 50.89

94.44 92.16 90.83 96.29 99.72 61.76

Classes

structural class is very high compared t o sensitivity. Confusion matrix and individual accuracy tables for Database I1 structural classes are available in the webpage http://www.comp.nus.edu.sg/Nbioinfo/bayesprot/results.htm. From the experiment, it can be concluded that BAYESPROT classified six fold classes with accuracy greater than 60% and predicted 15 fold classes with accuracy greater than 50% in dataset I. Average specificity of 27 fold classes is 61.76% which is higher than average sensitivity 50.89%. Confusion matrix and detailed results for 27 fold classes and 42 fold classes are available in the webpage http://www.comp.nus.edu.sg/Nbioinfo/bayesprot/results.htm.

7 Analysis and Discussions 7.1 Dataset I: Comparison with Ding and Dubchak(2001)

In Ding and Dubchak study, they used One-Versus-Others(OvO), UniqueOne-Versus-Others(u0vO) or All-Versus-All(AvA) methods for multi classification which used binary SVM or Neural networks as building blocks. Table 5 summarizes the result of 27 fold classes by BAYESPROT and SVM '. In 10-fold cross-validation study accuracy of 59.77% is achieved by BAYESPROT which is 31.57% higher than SVM AvA method. The number

396 Table 5: Comparative Results of BAYESPROT and SVM with Dataset I

Methods

BAYES

PROT Accuracy (%) No. of Clfrs. Used

58.8 3 TAN Bayes Clfrs.

Test Dataset SVM SVM OvO uOv0 41.8 45.2 168 2457 Binary Binary SVM SVM Clfrs. Clfrs.

SVM AvA 56.0 2106 Binary SVM Clfrs.

Cross Validation BAYES SVM PROT AvA 59.77 45.4 30 TAN 84,240 BAYES Binary Clfrs. SVM Clfrs.

of classifiers used for this cross-validation study is 10*3 (=30) TAN Bayesian Classifier, which is substantially less than the number of classifiers in SVM AvA where 84,240 binary SVM classifiers were employed. It is important to note that the accuracy measurement used in our study and are different by the way of calculating the number of proteins correctly classified by the classifier. In the method by ', if the output for the three top classes C1, C, and Cs are 2, 2 and 1 respectively by voting results and the correct class is Cz, then the number of correctly predicted protein is counted as 0.5 in their work. However, our work considers such a case t o be a misclassification and we do not increment the number of true positives. Thus, the superiority of BAYESPROT method over SVM can be observed. Another thing t o be considered is that the number of classifiers used in SVM and Neural networks is much higher than BAYESPROT. Learning complexity of SVM depends on the number of iterations and in many cases the learning complexity is quite high. But in BAYESPROT, since the dataset is complete and structure is known, the time required to learn the parameters is very less. In addition, the number of classifiers used in Bayesian network is substantially less than SVM as can be seen in Table 5. 7.2 Dataset 11: Comparison with Markowetz et al.(2003)

Dataset I1 consists of 42 fold classes, 143 training proteins and 125 test proteins. In study OvO SVM multi classification method was employed and achieved a high accuracy of 76.8% among various kernel for the test dataset and 70.9% for cross-validat ion. Table 6 summarizes the BAYESPROT and SVM results. Distribution of number of proteins in all classes is quite less in dataset I1 which is not the case with dataset I. Out of 42 fold classes, 36 classes have proteins less than or equal to 4 in training dataset.

397 Table 6: Comparative Results of BAYESPROT and SVM with Dataset I1 Methods

Used

I

Test Dataset

1 BAYES I SVM I SVM

SVM Poly2 kernel

I

BAYES PROT

I

Cross Validation SVM I SVM AvA Polyl kernel kernel

I

SVM Poly2 kernel

68

75.75

69.8

70.9

65

Binary SVM Clfrs.

BAYES Clfrs.

Binary SVM Clfrs.

Binary SVM Clfrs.

Binary Clfrs.

7.3 Effects of Large number of Training Samples

Cross validation is a method to estimate the generalization error of a given model. We conducted 10 fold cross validation study t o estimate the generalization error and to compare with previous SVM methods. From Table 5 and Table 6, it is clear evident that after performing cross validation over dataset I and dataset 11, accuracy in BAYESPROT increases while the accuracy in SVM method decreases. 7.4 Interpreting the Classification Results

Analyzing the classification results is very important for solving biological problems. The biologists need t o know the confident level of the resultant classes outputted by the classifiers for further analysis. Understanding the marginal differences between top predicted classes is also important in further confirming the structural class of the protein. Our classification approach supports this type of interpretations, as it gives the probability for each class. This kind of interpretation is not possible in neural networks and difficult in SVM. Neural networks contain many hidden nodes and final output is based on threshold value. In SVM, as the number of classifiers is high, reading the distances between hyper plane and the classes are very difficult. 8

Conclusions and future work

In this paper, we presented a framework based on TAN and voting method that is shown t o perform better than SVM on most cases. Since the network structure and the probabilities are well understood, the BAYESPROT framework also has several theoretical advantages relevant t o biology researchers and thus it is a better tool for analyzing protein sequences. Further research is being carried out for incorporating better network structure than TAN to improve the performance.

398 References

1. A. Mittal et al., SPIE Conf. on Applns. of Art. Neural Networks in Image Procsg. VI, USA, 97-107, 2001. 2. A. Mittal and L.-F. Cheong, IEEE Transactions on Knowledge & Data Engg., vol 15, no4,(2003). 3. David W . Mount, Cold Spring Harbor Laboratory Press, (2001). 4. Ding CHQ, Dubchak I, Bioinformatics, 4(17):349-358, (2001). 5. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH , Proteins Jun 1; 35(4):401-7, (1999). 6. Dubchak I, I. Muchnik, S.R. Holbrook and S.-H. Kim, Proced. of Natl. Acad. of Sci. of USA, 92, 8700-8704, (1995). 7. Duda, R.O. and Hart, P.E,and and D. G. Stork, and D. G. Stork , John Wiley & Sons, (2001). 8. Edler L et al., Math. and Computer Modelling 33, 1401-1417, (2001). 9. F. Markowetz, L. Edler, and M. Vingron, Biometrical Journal 45 3, 377389, (2003). 10. Finn V. Jensen, Springer-Verlag, New york, (2001). 11. John, G.H., & Langley, P. , In Proced. of the 11th Conf. on Uncert. in AI, Montreal, Quebec, Morgan Kaufmann, pp. 338-345, (1995). 12. Jones D, Taylor W , Thornton J Nature,358:86-89, (1992). 13. Nir Friedman et al., Machine Learning 29(2-3): 131-163 (1997). 14. P. Domingos and M. Pazzani, Machine Learning, 29:103-130, 1997. 15. Pat Langley et al., In Procd. of the 10 Natnl. Conference on AI, pages 223-228. AAAI Press and MIT Press, (1992). 16. P. Wang and D. Zhang, the 14th IEEE Int. Conf. on Tools with AI. November pp. 252-257, (2002). 17. Ronan Collobert and Samy Bengio, J. of Machine Learning Research, vol 1, pages 143-160, 2001. 18. Sippl MJ, Flockner H, Structure 4, 15-19, (1996). 19. Yu-Dong Cai, Xiao-Jun Liu, Xue-biao Xu and Guo-Ping Zhou,BMC Bioinformatics 2:3, (2001). 20. J. Grassmann, M. Reczko, S. Suhai and L. Edler, In Proc. Int. Conf. Intell. Syst. Mol. Biol (ISMB 1999), pp. 106-12, (1999). 21. Joel R. Bock, David A. Gough, Bioinformatics vol 17-5, 455-460, (2001). 22. Murzin A. G., Brenner S. E., Hubbard T., Chothia C., J. Mol. Biol. 247, 536-5404 1995).

CLUSTERING PROTEIN SEQUENCE AND STRUCTURE SPACE WITH INFINITE GAUSSIAN MIXTURE MODELS A. DUBEY, S. HWANG, C. RANGEL Keck Graduate Institute, 535 Watson Drive, Claremont CA 91711, USA C.E. RASMUSSFN Max Planck Institute for Biological Cybernetics, Spemann Strasse 38 72076 Fuebingen, Germany Z . GHAHRAMANI Gatsby Computational Neuroscience Unit, University College London, 17 Queen Square, London, W C l N 3AR, UK

D.L.WILD Keck Graduate Institute, 535 Watson Drive, Claremont CA 91711, USA Abstract We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required t o model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at http://public.kgi.edu/~wild/PSB04/index.html

1

Introduction

The clustering of protein sequences into families and superfamilies is a common approach for both comparative genomics and t h e prediction of protein function. With t h e advent of structural genomics projects] t h e clustering of protein sequences with those of known structure has also

399

400

been proposed as a method of target selection for structure determination. Newly determined protein structures must then be classified, both to assess their novelty, and in the case of proteins of unknown function, as a first step in functional annotation. Most methods for clustering protein sequences begin with an allagainst-all pairwise similarity search and use the pairwise score as a measure of similarity of the two sequences. A variety of approaches have been described to construct clusters from these scores: GENERAGE uses recursive single linkage hierachical clustering, and PROTOMAP constructs hierarchical clusters in a similar manner but using the means of all pairwise scores. SYSTERS uses heuristics derived from set-theoretic considerations to obtain a set of disjoint clusters. Abascal and Valencia describe a method for clustering protein families which uses the Ncut algorithm derived from graph theory. All these methods rely on the setting of some score theshold to distinguish members of a particular cluster from non-members, making the determination of the number of clusters arbitrary and subjective. Approaches based on single linkage hierarchical clustering can give results which are highly dependent on small changes to the data (such as adding or removing a single sequence). Moreover, non-probabilistic approaches do not provide a measure of uncertainty about the clustering, make it difficult to compute the predictive quality of the clustering and to make comparisons between clusterings based on different model assumptions (e.g. numbers of clusters, shapes of clusters, etc). Krogh et al. provided an alternative probabilistic approach which used hidden Markov models (HMMs) to cluster protein sequences from the globin family into subfamilies. They fit a mixture of HMMs (which is itself a special kind of HMM) using maximum likelihood methods. The results of these experiments were promising for this particular example, yielding clusters that correspond to known globin subfamilies. Little work has followed up on this area. Methods for automatically clustering sequences into hypothesized classes will be increasingly useful as amounts of sequence and structural data continue t o grow. An important issue that must be addressed in any clustering method is the question of how many clusters to use. Bayesian statistics can provide a solution to model selection questions of this kind (e.g6'7). Within the Bayesian framework, an elegant alternative approach is to assume that the data was in fact generated from an infinite number of Gaussian clusters. Any actual clusters in the protein sequence data will surely not be Gaussian distributed". Infinite mixtures are a sensible way to capture the fact that we don't really believe that protein sequence data is well modeled by a finite number of Gaussians. An infinite Gaussian discuss below how one can derive vectorial representations of sequences so that questions about Gaussianity are well-defined.

401

mixture model can readily model a finite number of non-Gaussian clusters. Finally, in an infinite Gaussian mixture model there is no need to make arbitrary choices about how many clusters there are in the data; nevertheless, after modeling one can ask questions such as how probable it is that two protein sequences or structures belong to the same cluster? We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc. based on the theory of infinite mixtures8. This theory is based on the observation that the mathematical limit of an infinite number of components in an ordinary finite mixture model (i.e. clustering model) Such a Dirichlet process corresponds to a Dirichlet process prior prior allows the data itself to dictate how many mixture components are required to model it. That is, a diverse family may require several components whereas a simpler family may require only one. Although in theory the infinite mixture has an infinite number of parameters, surprisingly, it is possible to sample from these infinite mixture models efficiently since only the parameters of a few of the models need to be represented. The theory of infinite mixture models is laid out by Rasmussen', who showed that the procedure works effectively with mixtures of Gaussians. It has since been applied to the clustering of gene expression profiles by Medvedovic and Sivaganesan ll. 9,1098.

2

Infinite Gaussian Mixture Models

One commonly used computational method of non-hierarchical clustering based on measuring Euclidean distance between feature vectors is given by the k-means algorithm. However, the k-means algorithm is inadequate for describing clusters of unequal size or shape. A generalization of k-means can be derived from the theory of maximum likelihood estimation of Gaussian mixture model?. In a Gaussian mixture model, the data (e.g. features of protein sequences or gene expression profiles which can be arranged into pdimensional vectors y) is assumed to have been generated from a finite number (k) of Gaussians, P(y) = +jPj(y) where + j is the mixing proportion for cluster j (fraction of population belonging to cluster j ; + j = 1; + j 2 0) and Pj (y)is a multivariate Gaussian distribution with mean ,uj and covariance matrix C j . The clusters can be found by fitting the maximum likelihood Gaussian mixture model as a function of the set of parameters B = {q+, p j , Cj}j"=l using the EM algorithm 1 2 . Euclidean distance corresponds to assuming that the Cj are all equal multiples of the identity matrix. Starting from a finite mixture model (Z), we define a prior over the mixing proportion parameters +. The natural conjugate prior for

C3kxl

402

mixing proportions is the symmetric Dirichlet distribution: P(+la) =

nT=,

where a controls the distribution of the prior weight r (rq(l ak )) assigned to each cluster, and I? is the gamma function. We then explicitly include indicator variables ci for each data point (i.e. protein sequence) which can take on integer values 4 = j , j E (1,. . . , k } , corresponding to the hypothesis that data point i belongs to cluster j . Under the mixture model, by definition, the prior probability is proportional to the mixing proportion: P(ci = j ( 4 ) = +j. A key observation is that we can compute the conditional probability of one indicator variable given the setting of all the other indicator variables after integrating over all possible settings of the mixing proportion parameters:

where c-i is the setting of all indicator variables except the i t h , n is the total number of data points, and n - i j is the number of data points belonging to class j not including i. By Bayes rule, P(+IC-i,a) = P

( + I ~ ) / P ( c n- ~ I( c~ e) l + )

(2)

tfi

which is also a Dirichlet distribution, making it possible to perform the above integral analytically. We now can take the limit of k going to infinity, obtaining a Dirichlet Process with differing conditional probabilities for clusters with and without data: for clusters where n-i,j > 0: n-;, . p ( q = jIC-i,cY) = for all other clusters combined: p ( # ~ cif for all i’ # ilc-i,a) = +. This shows that the probabilites are proportional to the occupation numbers, n - i j . Using these conditional probabilities one can Gibbs sample from the indicator variables efficiently, even though the model has infinitely many Gaussian clusters. Having integrated out the mixing proportions one can also Gibbs sample from all of the remaining parameters of the model, i.e. { p , C}j. The details of these procedures can be found in Rasmussen (2000)8. We have used infinite Gaussian mixtures to model protein sequence data with the intention of answering queries of the kind: what is the probability that two proteins belong to the same cluster? Unlike previous methods based on a single clustering of the data, this approach computes this probability while taking into account all sources of model uncertainty (including number of clusters and location of clusters). We use the probability p i j that two proteins i and j belong to the same cluster in the infinite mixture model as a measure of the similarity of these protein sequences. Conversely 1 - p i j defines a dissimilarity measure

n--lia,

403

which for the purposes of visualization can be input to one of the standard linkage algorithms used for hierarchical clustering (see Figure 3). We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-hmensional structures and G-protein coupled receptor sequences.

3

Methods

To be able to cluster protein sequences, we need to be able to obtain a vector representation of the protein in a suitable metric space. We use the Fisher score vector respresentation described by Jaakkola et a1 1 3 , which provides an appropriate measure of similarity between sequences. The Fisher score vector for a particular protein X is obtained by evaluating the derivative of the log-likelihood with respect to a vector of parameters (6) of a hidden Markov model (HMM) trained on the set of protein sequences: UX = Ve log P ( X l 6 ) . Each component of the vector UX is the derivative of the log-likelihood for the sequence X with respect to a particular parameter (the emission probabilities of the HMM). In the work described below, we first train an HMM on the set of protein sequences of interest and then calculate a Fisher score vector as described above. In the case of sequences of known structure, we use the Bayesian network model of Raval et al. 14, which can be thought of as an extension of a hidden Markov model to incorporate multiple observations of primary sequence, secondary structure and residue solvent accessibility, calculated from the three-dimensional coordinates by the DSSP method of Kabsch and Sander 15. For all data sets the dimensionality of the Fisher score vector was then reduced by principal components analysis and we used this reduced dimension vector as the y vector input into the infinite Gaussian mixture model. We used the first 10 principal components, which captured most of the variance in the UX vectors. The mixture model was initialized with all data belonging to a single Gaussian, and a large number of Gibbs sampling sweeps are performed, updating all variables and parameters, i.e. { { p j , C j } , {ci},a } , in turn by sampling from the conditional distributions derived in the previous sections and described in more detail in Rasmussen (2000)8. We typically run the chain for 110,000 iterations, discarding the initial 11,000 steps as “burn-in” and keeping every 1000th step after that, generating 100 roughly independent samples from the posterior distribution.

404

Results

4

4.1

Globin Sequences

The mixture of HMMs method of Krogh et a1 discovered 7 clusters in a set of 628 globin sequences, corresponding to: 1. Class 1 233 sequences: principally all a , a few 6 ( an a-type chain of mammalian embryonic hemoglobin), X/T' (the counterpart of the a chain in major early embryonic hemoglobin P), and 6 - 1 chains (early erythrocyte a-like). 2. Class 2 232 sequences: almost all P, a few 6 (P-like), E (&type found in early embryos), y (comprises fetal hemoglobin F in combination with two a chains), p (major early embryonic P-type chain) and 6 chains (embryonic P-type chain).

3. Class 3 71 myoglobins.

4. Class 4 58 sequences. The 13 highest scoring in this cluster were leghemoglobins. This class contained a variety of sequences including 3 non-globins in the original data set. 5. Class 5 19 sequences. Midge globins. 6. Class 6 8 sequences. Globins from agnatha (jawless fish). 7. Class 7 7 sequences. varied. Our results, using an updated version of the same data set (630 globin sequences, distributed with the HMMER2 software package) are shown in Figure 1. In this plot we show the number of times, out of 100 samples, that the indicator variables for two sequences were equal. As shown above, this may be interpreted as the probability p i j that two proteins a and j belong to the same cluster. It is evident that our model has discovered a larger number of clusters that the method of Krogh et a15. The granularity of this clustering is determined by the data and not by some user-defined threshold. Large solid blocks of color along the diagonal correspond to homogeneous clusters. Note that in our method, sequences may belong to more than one cluster with a defined probability: off-diagonal elements indicate 'cross-clustering'. For comparison, we also clustered the sequences using BLASTCLUST, which clusters the sequences according to a sequence identity threshold and a single linkage algorithm. With a 90% sequence identity threshold, 261 clusters were obtained. The first large homogeneous cluster in Figure 1 (bottom right hand corner) comprises 37 hemoglobin ,B sequences plus two 6 sequences (HBD-COLPO and HBDPANTR) (Figure 1). Although a number of these sequences are contained within the same cluster in the BLASTCLUST output, indicating that they have > 90% sequence identity, we note that the clusters are by no means identical.

405

The BLASTCLUST cluster containing many of these hemoglobin p sequences also contains 8 hemoglobin 6 sequences and one Hemoglobin p-2 chain (HBBZPANLE). Figure 1 indicates that all sequences within this cluster also 'cross-cluster' with another group of p sequences with a probability of around 20-30%. The next cluster from the bottom right (Figure 1) contains all a sequences and cross clusters with another group of a sequences with a probability of around 40-50%. Although a detailed analysis of these results is beyond the scope of this paper, we identify at least 11 distinct a and 13 distinct ,f3 clusters (plus some additional smaller ones). Although some of the variant sequences cluster with a and ,f3 sequences] we identify a number of clusters composed only of variant sequences: 3 clusters comprising only 7, E and 0 sequences] one cluster of 6 and one cluster of ( sequences. We identify 3 distinct clusters of leghemoglobins and 1 cluster of midge hemoglobins (6 sequences)] a small cluster of fish hemoglobins and a small cluster comprising clam and earthworm sequences. Myoglobins, which Krogh et a1 (1994) found in one cluster, form 10 distinct clusters, mainly comprising proteins from related species. BLASTCLUST groups these into 6 clusters plus 9 singletons at a 90% identity theshold. We identify only 11 singletons (proteins which never cluster with another), none of which are myoglobins. The largest cluster comprises 40 hemoglobin beta sequences.

Figure 1: Clustering of the 630 globin sequences. The gray scale indicates the number of times, out of 100 samples, that the indicator variables for two sequences were equal, or the probability that two sequences belong to the same cluster

These results indicate that our method is capable of producing biologically meaningful results and correctly classifies the main globin subfamilies. In addition, it provides a finer level of clustering within these subfamilies than either the use of BLAST alignments and sequence identity or the method of Krogh et al!

406

4.2

Globin Sequences of Known Structure

For this experiment we obtained globin sequences from the Strucural Classification of Proteins (SCOP) database l6 using the ASTRAL resource ‘. Sequences with > 95% sequence identity were excluded, leaving 91 proteins. According t o the SCOP classification, these conprised representatives of 4 globin structural subfamilies (a.l.l.1: truncated hemoglobins (4 sequences) , a.1.1.2: glycera globins, myoglobins, hemoglobin I, flavohemoglobins, leghemoglobins, hemoglobin a and p chains, a.1.1.3: phycocyanins, allophycocyanins, phycoerythrins and a.1.1.4: nerve tissue mini-hemoglobin (1 sequence) ). The sequences were clustered using feature vectors derived from two models: a sequence-only HMM and a Bayesian net model (structural HMM). The results are shown in Figure 2 and Figure 3. The results from the sequence only clustering (Figure 2 left) show a similar pattern to those obtained with the 630 globin sequences. Fairly homogeneous clusters are mainly composed of related sequences, eg: p hemoglobin chains, a hemoglobin chains, myoglobins, phycocymin a and b, phycoerythrin and b and allophycocyanin a and b chains (which all form separate clusters). Glycera globins form a separate cluster, as do leghemoglobins. Three or four heterogeneous (loosely associated) clusters are observed, which include truncated hemoglobins, hemoglobin I’s, dehaloperoxidase etc. The results from the model which includes secondary structure and residue accessibility information shows fewer clusters; 1 2 in all, plus two singletons (dehaloperoxidase and pig roundworm hemoglobin, domain 1) (Figure 2 right). Again a and ,6 hemoglobin chains form distinct and fairly homogeneous clusters, as do the myoglobins, with the exception of lMYT (this is a myoglobin which lacks the D helix), which clusters more strongly with /3 hemoglobins, as well as weakly with the myoglobin cluster, and lMBA (a mollusc myoglobin), which clusters with clam hemoglobins and glycera globins from bloodworms. Phycocyanins, allophycocyanins and phycoerythrins (which are all classified by SCOP into the same subfamily a.1.1.3) form two distinct large joint clusters. Within these clusters one can detect subfamilies corresponding to the allophycocyanins, phycoerythrins and phycocyanins, which cluster amongst themselves with a higher probability. Leghemoglobins cluster strongly with a single non-symbiotic plant hemoglobin from rice, and weakly with a clam hemoglobin I. Truncated hemoglobins, which SCOP classifies into a different subfamily ( a . l . l . l ) , form two distinct clusters, and the sole member of subfamily a. 1.1.4 (nerve tissue mini-hemoglobin), clusters with 1CH4 (chimeric synthetic hemoglobin beta-alpha). In comparison, 13 clusters are produced with BLASTCLUST only at a 29% sequence bhttp://astral.stanford.edu

407

identity threshold or lower. These comprise a single cluster for a.l.l.1, nine separate clusters for a.1.1.2 (including 4 singletons), a single cluster for a.1.1.3 and a singleton for a.1.1.4. Our results, which do not require a predefined threshold to be specified, provide a reflection the underlying SCOP classifications, but the biologically meanigful sub-clusters also suggest that a further level of subfamily subdivision is possible.

Figure 2 : Clustering of the 91 SCOP globin sequences:left, by sequence information only; right, with the inclusion of structural information. Sequence labels on the y-axis are ordered optimally for each plot.

Figure 3: Dendrogram resresentation of the clustering of the 91 SCOP globin sequences shown in Figure 2: left, by sequence information only; right, with the inclusion of structural information.

4.3

G-Coupled Protein Receptors (GPCRs)

According to the GPCRDB classification system 17, the G-protein coupled receptor (GPCR) superfamily is classified into 5 major classes: Class A (related to rhodopsin and adrenergic receptors), Class B (related to

408

calcitonin and PTH/PTHrP receptors), Class C (related t o metatropic receptors), Class D (related to pheromone receptors) and Class E (related to CAMP receptors). The classes share 20% sequence identity over predicted transmembrane helices 1 7 . Each class is further divided into level 1 subfamilies (eg: Amine, Peptide, Opsin etc. for Class A) and further into Level 2 subfamilies (Muscarinic, Histamine, Serotonin etc. for the Amine subfamily). A number of putative GPCRs have no identified natural ligand and are dubbed ’orphan’ receptors. The sequence diversity of the GPCR classes makes subfamily classification a challenging problem. The problem of recognizing GPCR subfamilies is compounded by the fact that the subfamily classifications in GPCRDB are defined chemically (that is, according to the differential binding of ligands to the receptors) and not necessarily by either sequence similarity or the post ligand-receptor binding pathways. A number of other authors have described computational approaches to classifying GPCRs. Karchin et alla trained 2-class support vector machines (SVMs) using Fisher score vectors derived from HMMs 13. Joost and Methner l9 used a phylogenetic tree constructed by neighbor joining with bootstrapping. Lapinsh et al 2o translated amino acid sequences into vectors based on the physicochemical properties of the amino acids and used and autocross-covariance transformation followed by principal components analysis (PCA) t o classify GPCRs. For our experiments, sequences were obtained from the GPCRDB database l 7 ‘. Because of the smaller number of sequences in Classes B-E, we have focussed our analysis of Class A sequences. Our dataset comprised 946 sequences, of which 303 were “orphan” receptors, with no family classification. A portion of the clustering results using the infinite Gaussian mixture model are shown in Figure 4. Because of the sequence diversity of this superfamily, a larger number of smaller clusters are evident around the diagonal than were observed with the globin sequences. Most of the homogeneous clusters (solid color) comprise sequences from the same subfamily (level 3 in the GPCRDB hierarchy), and appear to be orthologs of the same protein from related species. Whilst a detailed analysis of these is beyond the scope of the present paper, as an illustration, we note that the largest cluster (bottom right hand corner), comprises Rhodopsin (Rhodopsin Vertebrate type 1) sequences from mammals and reptiles (plus lamprey), whilst the second cluster is composed entirely of fish Rhodopsins. Some unexpected associations also appear. Although in some case our results indicate assignments for certain orphan receptors which agree those of the authors cited above, in other cases our predictions are novel. A detailed analysis of these will be published in an extended version of this paper.

>

‘http://www.gpcr.org

409

Figure 4: Part of the clustering of the GPCR Class A sequences.

5

Discussion

The consistency of the clusters we obtain with a well annotated superfamily of protein sequences such as the globins gives us confidence that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. Homogeneous clusters tend to consist of orthologs of the same protein and paralogs appear to be separated into distinct clusters. This pattern appears to be repeated in our clustering of the GPCR sequences, with the potential of providing functional annotations for certain orphan receptors. Whilst some of these agree with predictions derived from neighborjoining phylogenetic trees and principal component analysis, a number are novel. In all cases, our method provides a finer level of granularity than the method of Lapinsh et al. ‘O, clustering orphan receptors with members of particular GPCRDB subfamilies, rather than a broad family classification. With the inclusion of secondary stucture and residue solvent accessibility information in the HMM on which our method is based, the clustering of the SCOP globin sequences changes from a large number of small clusters of functionally related sequences to a smaller number of clusters, in which the members of the SCOP globin families are clearly separated. However, once again we achieve an even finer level of classification, clearly separating a , p and myoglobins, as well as other members of SCOP class a.1.1.2. This suggests that our method also has the potential to provide a novel automated method for the structural classification of proteins. In order to achieve a large scale clustering of sequence or structure space we will investigate the use of Fisher scores obtained from from a “mixture model” which combines individual models for different superfamilies as described in 14.

410

Acknowledgments This work was supported by the National Institutes of Health (NIH) under Grant Number 1 PO1 GM63208. CER was supported by the German Research Council (DFG) through grant RA 1030/1. References 1. 2. 3. 4. 5.

6. 7. 8.

9. 10. 11.

12. 13. 14. 15. 16. 17.

18. 19. 20.

A.J. Enright and C.A. Ouzounis, Bioinformatics 16, 451-457 (2000) G. Yona and N. Linial and M. Linial, Proteins 37, 360-378 (1999) A. Krause and M. Vingron, Bioinformatics 14, 430-438 (1998) F. Abascal and A. Valencia, Bioinformatics 18, 908-921 (2002) A. Krogh A and M. Brown and I.S. Mian and K. Sjolander and D. Haussler, J. Mol. Biol. 235,1501-1531 (1994) Y. Barash and N. F'riedman,J. Comput. Biol. 9, 161-191 (2002) S. Richardson and P. Green (1997), J. Roy. Stat. SOC.B59, 731792 (1997) C. E. Rasmussen in Advances in Neural Information Processing Systems 1.2, ed. S. A. Solla, T. K. Leen, and K.-R. Muller (MIT Press, 2000) C.E. Antoniak, Annals of Statistics 2, 1152-1174 (1974) R. M. Neal, J. Comp. and Graphical Statistics 9,249-265 (2000) M. Medvedovic and S. Sivaganesan, Bioinformatics 1 8 , 1194-1206 (2002) G. McLachlan and D. Pee1,Finite Mixture Models,(Wiley, New York, 2000). T. Jaakkola and M. Diekhans and D. Haussler J Comput Biol. 7, 95-114 (2000) A. Raval and Z. Ghahramani and D.L. Wild, Bioinformatics 18, 788-801 (2002) W . Kabsch and C. Sander Biopolymers 22, 2577-2637 (1983) A.G. Murzin and S.E. Brenner and T. Hubbard and C. Chothia,J. Mol. Biol. 247, 536-540 (1995) F. Horn and J. Weare and M.W. Beukers and S. Hoersch and A. Bairoch and W. Chen and 0.Edvardsen and F. Campagne and G. Vriend,Nucleic Acids Res. 26, 277-281 (1998) R. Karchin and K. Karplus and D. Haussler, Bioinformatics 18, 147-159 (2002) P. Joost and A. Methner, Genome Biol. 3,RESEARCH0063 (2002) M. Lapinsh and A. Gutcaits and P. Prusis and C. Post and T. Lundstedt and J.E. Wikberg, Protein Sci. 11, 795-805 (2002)

ACCURATE CLASSIFICATION O F PROTEIN STRUCTURAL FAMILIES USING COHERENT SUBGRAPH ANALYSIS

w. WANG', A. WASHINGTON',

J. PRINS', R. SHAH^, A. T R O P S H A ~ ~ 'Department of Computer Science, 'The Laboratory for Molecular Modeling, Division of Medicinal Chemistry and Natural Products, School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599

J. HUAN',

Protein structural annotation and classification is an important problem in bioinformatics. We report on the development of an efficient subgraph mining technique and its application to finding characteristic substructural patterns within protein structural families. In our method, protein structures are represented by graphs where the nodes are residues and the edges connect residues found within certain distance from each other. Application of subgraph mining to proteins is challenging for a number reasons: (1) protein graphs are large and complex, (2) current protein databases are large and continue to grow rapidly, and (3) only a small fraction of the frequent subgraphs among the huge pool of all possible subgraphs could be significant in the context of protein classification. To address these challenges, we have developed an information theoretic model called coherent subgraph mining. From information theory, the entropy of a random variable X measures the information content carried by X and the Mutual Information (MI) between two random variables X and Y measures the correlation between X and Y. We define a subgraph X as coherent if it is strongly correlated with every sufficiently large sub-subgraph Y embedded in it. Based on the MI metric, we have designed a search scheme that only reports coherent subgraphs. To determine the significance of coherent protein subgraphs, we have conducted an experimental study in which all coherent subgraphs were identified in several protein structural families annotated in the SCOP database (Murzin et al, 1995). The Support Vector Machine algorithm was used to classify proteins from different families under the binary classification scheme. We find that this approach identifies spatial motifs unique to individual SCOP families and affords excellent discrimination between families.

1

Introduction

1.1 Spatial MotifDiscovery in Proteins Recurring substructures in proteins reveal important information about protein structure and function. For instance, common structural fragments may represent fixed 3D arrangements of residues that correspond to active sites or other functionally relevant features such as Prosite patterns (Hofmann, et al. 1999). Understanding recurring substructures in proteins aids in protein classification (Chakraborty et al. 1999), function prediction (Fischer et al. 1994), and folding (Kleywegt 1999).

41 1

412

Many computational methods have been proposed to find motifs in proteins. Multiple sequence alignments of proteins with similar structural domains (Henikoff, et a1 1999) could be used to provide information about the possible common substructures in the hope that conserved sequence patterns in a group of homologous proteins may have similar 3D arrangements. This method generally doesn’t work very well for proteins that have low sequence similarity although structurally similar proteins can have sequence identities below lo%, far too low to propose any structural similarity on the basis of sequence comparison (Orengo & Taylor, 1996). Several research groups have addressed the problem of finding spatial motifs by using computational geometry/computer vision approaches. From the geometric point of view, a protein can be modeled as a set of points in the R3 space and the problem of (pairwise) spatial motif finding can be formalized as that of finding the Largest Common Point (LCP) set. (Akutsu et al. 1997). Plenty of variations to this problem have been explored, which include approximate LCP problem (Chakraborty et al. 1999, Indyk et al. 1999) and LCP-a (finding a sufficiently large common point set S of two sets of points but not necessarily the maximal one) (Finn et al. 1997). Applying frequent subgraph mining techniques to find patterns from a group of proteins is a non-trivial task. The total number of frequent subgraphs for a set of graphs grows exponentially as the average graph size increases, as graphs become denser, as the number of node and edge labels decreases and as the size of the recurring subgraphs increases (Huan et a1 2003). For instance, for a moderate protein dataset (about 100 proteins with the average of 200 residues per protein), the total number of frequent subgraphs could be extremely high (>> one million). Since the underlying operation of subgraph isomorphism testing is NP-complete, it is critical to minimize the number of frequent subgraphs that should be analyzed. In order to apply the graph based spatial motif identification method to proteins, we have developed a novel information theoretic model called coherent subgraphs. A graph G is coherent if it is strongly correlated with every sufficiently large subgraph embedded in it. As discussed in the following parts of this report, coherent subgraphs capture discriminative features and afford high accuracy of protein structural classification. 1.2 Related Work

Finding patterns from graphs has long been an interesting topic in the data minindmachine learning community. For instance, Inductive Logic Programming (ILP) has been widely used to find patterns from graph dataset (Dehaspe 1998). However, ILP is not designed for large databases. Other methods focused on approximation techniques such as SUBDUE (Holder 1994) or heuristics such as greed based algorithm (Yoshida and Motoda, 1995).

413

Several algorithms have been developed in the data mining community to find all frequent subgraphs of a group of general graphs (Kuramochi and Karypis 2001, Yan and Han 2002, Huan et al. 2003). These techniques have been successfully applied in cheminformatics where compounds are modeled by undirected graphs. Recurring substructures in a group of chemicals with similar activity are identified by finding frequent subgraphs in their related graphical representations. The recurring substructures can implicate chemical features responsible for compounds’ biological activities (Deshpande et al. 2002). Recent subgraph mining algorithms can be roughly classified into two categories. Algorithms in the first category use a level-wise search scheme like Apriori (Agrawal and Srikant, 1994) to enumerate the recurring subgraphs. Examples of such algorithms include AGM (Inokuchi et al. 2000) and FSG (Kuramochi and Karypis 2001). Instead of performing the level-wise search scheme, algorithms in the second category use a depth-first enumeration for frequent subgraphs (Yan and Han 2002, Huan et al. 2003). A depth-first search usually has better memory utilization and thus better performance. As reported by Yan and Han (2002), a depth-first search, can outperform FSG, the current state-of-the-art level-wise search scheme by an order of magnitude overall. All of the above methods rely on a single threshold to qualify interesting patterns. Herein, we propose the coherent subgraph model using a statistical metric to qualify interesting patterns. This leads to more computationally efficient yet more accurate classification. The remaining part of the paper is organized as follows. Section 2 presents a formal base for the coherent subgraph mining problem. This includes the definition of the labeled graph and labeled graph database (Section 2.1), the canonical representation of graphs (Section 2.2), the coherent subgraph mining problem, and our algorithm for efficient coherent subgraph mining (Section 2.3). Section 3 presents the results of an experimental study to classify protein structural families using the coherent subgraph mining approach and a case study of identifying fingerprints in the family of serine proteases. Finally, Section 4 summarizes our conclusions and discusses future challenges.

2 Methodology 2.1 Labeled Graph We define a labeled graph G as a four element tuple G = {V, E, 1,1) where V is the set of nodes of G and E L V XV is the set of undirected edges of G. C is a set of labels and the labeling function 1: V u E + C maps nodes and edges in G to their labels. The same label may appear on multiple nodes or on multiple edges, but we require that the set of edge labels and the set of node labels are

414

disjoint. For our purposes we assume that there is a total order 2 associated with the label set A labeled graph G = (V, E, C, 1) is isomorphic to another graph G=(V', E', 1') iff there is a bijection f V + V' such that: V u E V, l(u) = l'(f(u)), and V U, v EV, ( ((u,v) E E (f(u), f(v)) EE') A l(u,v) = l'(f(u), f(v))). The bijection f denotes an isomorphism between G and G . A labeled graph G= (V, E, C, 1) is an induced subgraph of graph G=(V',E, 1') iff v c V', E L E', V u,v E V, ((u, v) E E' 3 (u, v) EE), V u E V, (l(u)= l'(u)), and V (u, v) EE, (I(u, V)= l'(u, v)). A labeled graph G is induced subgraph isomorphic to a labeled graph G , denoted by G L G , iff there exists an induced subgraph G ' of G such that G is isomorphic to G'. Examples of labeled graphs, induced subgraph isomorphism, and frequent induced subgraphs are presented in Figure 1.

c.

x,

x,

8

0

P,

P5

0

Q

'5

B

PC

p, (P)

(Q)

a

y-Tf$2

(R) "

rc

%.&/

,'x

b

Figure 1. (a): Examples of three labeled graphs (referred to as a graph database) and an induced subgraph isomorphism. The labels of the nodes are specified within the circle and the labels of the edges are specified along the edge. We assume the order a > b > c > d > x > y > 0 throughout this paper. The mapping ql + pz, qz + PI,93-1 p3 represents an induced subgraph isomorphism from graph Q to P. (b) All the frequent induced subgraphs with minSupport set to be 2/3 for the graph database uresented in (a).

Given a set of graphs GD (referred to as a graph database), the support of a graph G, denoted by SUPG is defined as the fraction of graphs in GD which embeds the subgraph G. Given a threshold t (0 < t 51) (denoted as rninSupport), we define G to be frequent, iff SUPG is at least t. All the frequent induced subgraphs in the graph database GD presented in Figure 1 (a) (with minSupport 2/3) are presented in Figure 1 (b).

41 5

Throughout this paper, we use the term subgraph to denote an induced subgraph unless stated otherwise.

2.2 Canonical Representation of Graphs We represent every graph G by an adjacency matrix M. Slightly different from the adjacency matrix used for an unlabeled graph (Cormen et al, 2001), every diagonal entry of M represents a node in G and is filled with the label of the node. Every off-diagonal entry corresponds to a pair of nodes, and is filled with the edge label if there is an edge between these two nodes in G, or is zero if there is no edge. Given an n x n adjacency matrix M of a graph with n nodes, we define the code of M, denoted by code(M), as the sequence of lower triangular entries of M (including the diagonal entries) in the order: Mz,l M2,2... Mn,l M,,2 ...Mn,,., M , , where Mi,, represents the entry at the ith row andjth column in M. The standard lexicographic ordering of sequence defines a total order of codes. For example, code “ayb” is greater than code ”byb” since the first symbol in string “ayb“ is greater than the first symbol in string “byb” (We use the order a > b > c > d > x > y > 0). For a graph G, we define the Canonical Adjacency Matrix (CAM) of G as the adjacency matrix that produces the maximal code among all adjacency matrices of G. Interested readers might verify that the adjacency matrix MI in Figure 2 is the CAM of the graph P shown in Figure 1.

Ml

M3

Figure 2. Three examples of adjacency matrices. After applying the total ordering, we have code(M1) = “aybyxbOyxcOOyOd” > code(M2)= “aybyxbOOydOyxOc” z code(M3) =“bxbyOdxyOcyy00a”.

Given an n x n matrix N and an m x m matrix M, we define N as the maximal proper submatrix (MP submatrix for short) of M iff n = m-1 and “ij = mij (0 < i, j Sn). One of the nice properties of the canonical form we are using (as compared to the one used in Inokuchi et al. 2000 and Kuramochi et al. 2001) is that, given a graph database GD, all the frequent subgraphs (represented by their CAMS) could be organized as a rooted tree. This tree is referred to as the CAM Tree of G and is formally described as follows: The root of the tree is the empty matrix;

416

0

Each node in the tree is a distinct frequent connected subgraph of G, represented by its CAM; For a given none-root node (with CAM M), its parent is the graph represented by the MP submatrix of M;

Figure 3. Tree organization of all the frequent subgraphs of the graph database shown in Figure 1 (a)

2.3 Finding Patternsjrom Labeled Graph Database As mentioned earlier, the subgraph mining of protein databases presents a significant challenge because protein graphs are large and dense resulting in an overwhelmingly large number of possible subgraphs (Huan et al. 03). In order to select important features from the huge list of subgraphs, we have proposed a subgraph mining model based on mutual information as explained below. 2.3.1

Mutual Information and Coherent Induced Subgraphs

We define a random variable XG for a subgraph G in a graph database GD as follows: 1 with probability SUPG XG= 0 with probability 1-sUpG Given a graph G and its subgraph G , we define the mutual information I(G, G ) as follows: I(G, G ) = Ex,, xG PWG, XG> log&(xG, x~i~)/(p(XG>p(X~i.))>. where P(XG XG,) is the (empirical) joint probability distribution of (XG, XG), which is defined as follows: if XG= 1 and XG.= 1 p(xG, xG>= sUPG if XG= 1 and XG. = 0 0 SUpG. - sUPG ifXG=OandXG.=1 1- SUpG' otherwise

417

Given a threshold t (t > 0) and a positive integer k, a graph G is k-coherent iff 'd G' c G and IGl >k, (I(G, G ) 2t), where IG'I denotes the number of nodes in G' . The Coherent Subgraph Mining problem is to find all the k-coherent subgraphs in a graph database, given a mutual information threshold t (t > 0) and a positive integer k. Our algorithm for mining coherent subgraphs relies on the following two well-known properties (Tan et al. 2002): Theorem For graphs P c Q L G, we have the following inequalities: ,@ 'I GI 5 I@', Q) I(P, G) 5 I(Q, G) The first inequality implies that every subgraph G (with size 2 k) of a kcoherent graph is itself k-coherent. This property enables us to integrate the kcoherent subgraph into any tree-based subgraph using available enumeration techniques (Yan and Han 2002, Huan et al. 2003). The second inequality suggests that, in order to tell whether a graph G is k-coherent or not, we only need to check all k-node subgraphs of G. This simplifies the search. In the following section, we discuss how to enumerate all connected induced subgraphs from a graph database. This work is based on the algebraic graphical framework (Huan et al. 2003) of enumerating all subgraphs (not just induced subgraphs) from a graph database.

2.3.2

Coherent Subgraph Mining Algorithm

CSM input a graph database GD, a mutual information threshold t (0 < t I 1) and a positive integer k output: set S of all G s coherent induced subgraphs. P t {all coherent subgraphs with size k in GD) S4-Q

CSM-Explore (P. S, t, k);

CSM-Explore input a CAM list P, a mutual information threshold t (0 < t I l), a positive integer k, and a set of coherent connected subgraphs' CAMS S. output set S containing the CAMSof all coherent subgraphs searched so far Foreach X E P S c S u { X } C t (YI Y is a CAM and X is the MP submatrix of Y J remove non k-coherent element@)from C. CSM-Explore(C, S , t, k) End

418

3

Experimental Study

3.1 Implementation and Test Platform The coherent subgraph mining algorithm is implemented using the C++ programming language and compiled using g++ with 0 3 optimization. The tests are performed using a single processor of a 2.OGHz Pentium PC with 2GB memory, running RedHat Linux 7.3. We used Libsvm for protein family classification (further discussed in Section 3.4); the Libsvm executable was downloaded from http://www.csie.ntu.edu.tw/-cjlin/libsvd. 3.2 Protein Representation as a Labeled Graph We model a protein by an undirected graph in which each node corresponds to an amino acid residue in the protein with the residue type as the label of the node. We introduce a “peptide” edge between two residues X and Y if there is a peptide bond between X and Y and a “proximity” edge if the distance between the two associated C, atoms of X and Y is below a certain threshold (lOA in our study) and there is no peptide bond between X and Y.’ 3.3Dataset.s and Coherent Subgraph Mining Three protein families from the SCOP database (Murzin et al, 1995) were used to evaluate the performance of the proposed algorithm under a binary (pairwise) classification scheme. SCOP is a domain expert maintained database, which hierarchically classifies proteins by five levels: Class, Fold, Superfamily, Family and individual proteins. The SCOP families included the Nuclear receptor ligand-binding domain (NRLB) family from the all alpha proteins class, the Prokaryotic serine protease (PSP) family from the all beta proteins class, and Eukaryotic serine protease (ESP) family from the same class. Three datasets for the pairwise comparison and classification of the above families were then constructed: C1, including NRLB and PSP families; C2, including ESP and PSP families, and C3, including both eukaryotic and prokaryotic serine proteases (SP) and a random selection of 50 unrelated proteins (RP). All the proteins were selected from the culled PDB list, (http://www.fccc.edu/research/labs/dunbrack/pisces/culledpdb.html)with less than 60% sequence homology (resolution = 3.0, R factor = 1.0) in order to remove redundant sequences from the datasets. These three datasets are further summarized in Table 1. For each of the datasets, we ran the coherent subgraph identification algorithm. Thresholds ranging from 0.5 to 0.25 were tested; however, we only

’

Note that this graph representation provides a lot of flexibility for future studies, e.g. using smaller number of residue classes or using additional edge labels.

419

report the results with threshold 0.3, which gave the best classification accuracy in our experiments. 3.4 Pair-wise Protein Classification Using Support Vector Machines (SVM) Given a total of n coherent subgraphs f b f2, ..., f n , we represent each protein G in a dataset as a n-element vector V=v b v2, ... .vn in the feature space where v; is the total number of distinct occurrences of the subgraph f t in G (zero if not present). We build the classification models using the SVM method (Vapnik 1998). There are several advantages of using SVM for the classification task in our context: 1) SVM is designed to handle sparse high-dimensional datasets (there are many features in the dataset and each feature may only occur in a small set of samples), 2) there are a set of kernel learning functions (such as linear, polynomial and radius based) we could choose from, depending on the property of the dataset. Table 1 summarizes the results of the three classification experiments and the average five fold cross validation total classification accuracy [i.e., (TP + TN)/(N) where TP stands for true positive, TN stands for true negative, and N is the total number of testing samples]. In order to address the problem of possible over-fitting in the training phase, we created artificial datasets with exactly same attributes but randomly permuted class labels. This is typically referred to as the Y-randomization test. The classification accuracy for randomized datasets was significantly lower than for the original datasets (data not shown) and hence we concluded that there is no evidence of over-fitting in our models.

Cl C2 C3

Class A PSP PSP SP

Total # Proteins 9 9 44

Class B NRLB ESP RP

Total # Proteins 13 35 50

Features 40274 34697 42265

Time, (sec.) 240 450 872

Accuracy (%) 96 93 95

Table 1. Accuracy of classification tasks Ci, C2, €3. We used the C-SVM classification model with the linear kernel and left other values as default. Columns 1-4 give basic information about the dataset. SP -serine proteases; PSP - prokaryotic SP; ESP - eukaryotic SP; NRLB - nuclear receptor ligand binding proteins, RP - random proteins. The fifth column (Features) records the total number of features mined by CSM and the sixth column (Time) records how much CPU time was spent on the mining task. The last column gives the five fold cross validation accuracy.

3.5 Identification of Fingerprints for the Serine Protease Family Features found for the task C3 in Table 1 were analyzed to test the ability of the CSM method to identify recurrent sequence-structure motifs common to particular protein families; we used serine proteases as a test case. For every coherent subgraph, we can easily define an underlying elementary sequence motif similar to Prosite patterns as:

420

M = { AAp, di, d2, A A , d3, AAs 1 where AA is the residue type, p, q, r and s are residue numbers in a protein sequence, and dl=q-p- 1, dz=r-q-1, d3=s-r-1, i.e., sequence separation distances. We have selected a subset of the discriminative features from the mined features such that every feature occurs in at least 80% of the proteins in the SP family and in less than 10% of the proteins of the RP family. For each occurrence of such features, sequence distances were analyzed. Features with conserved sequence separation were used to generate consensus sequence motifs. We found that some of our spatial motifs correspond to serine protease sequence signatures from the Prosite Database. An example (Gl) of such a spatial motif and its corresponding sequence motif C-x( 12)-A-x-H-C (where x is any residue(-s) and the number in the parenthesis is the length of the sequence separation) are shown in Fig. 4. This example demonstrates that the spatial motifs found by subgraph mining can capture features that correspond to motifs with known utility in identifying protein families. The spatial motif G2, which also was highly discriminative, occurs in SP proteins at a variety of positions, with varying separations between the residues. Such patterns seem to defy a sequence-level description, hence raise the possibility that spatial motifs can capture features beyond those described at the sequence level.

Figure 4:Two discriminative features that appear very frequently in SP family while are infrequent in the RP family. Left: the graphical representation of the two subgraphs (with residue type specified within the circle). A dotted line in the figure represents a proximity edge and a solid line represents a peptide edge. Right: the 3D occurrences of G1 (right) and G2 (left) within the backbone of one of serine proteases, Human Kallikrein 6 (Hk6).

4

Conclusions and Future Work

We have developed a novel coherent subgraph mining approach and applied it to the problem of protein structural annotation and classification. As a proof of concept, characteristic subgraphs have been identified for three protein families from the SCOP database, i.e., eukaryotic and prokaryotic serine proteases and nuclear receptor binding proteins. Using Support Vector Machine binary

42 1

classification algorithm, we have demonstrated that coherent subgraphs can serve as unique structural family identifiers that discriminate one family from another with high accuracy. We have also shown that some of the subgraphs can be transformed into sequence patterns similar to Prosite motifs allowing their use in the annotation of protein sequences. The coherent subgraph mining method advanced in this paper affords a novel automated approach to protein structural classification and annotation including possible annotation of orphan protein structures and sequences resulting from genome sequencing projects. We are currently expanding our research to include all protein structural families and employ multi-family classification algorithms to afford global classification of the entire protein databank.

Acknowledgments The authors would like to thank Prof. Jack Snoeyink and Deepak Bandyopadhyay for many helpful discussions.

References 1. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, Proc. of the 20th lnt. Con$ on Very Large Databases (VLDB), 487-499 (1994) 2. T. Akutsu, H. Tamaki and T. Tokuyama, “Distribution of distances and triangles in a point set and algorithms for computing the largest common point sets”. In Proc. 13th Annual ACM Symp. on Computational Geometry, 3 14-323 (1997) 3. S. Chakraborty and S. Biswas, “Approximation Algorithms for 3-D Common Substructure Identification in Drug and Protein Molecules”, Workshop on Algorithms and Data Structures, 253-264 (1999) 4. T. H. Cormen, C. E. Leiserson and R. L. Rivest, Introduction to Algorithms, (MIT press, 2001). 5. L. Dehaspe, H. Toivonen and R. D. King, “Finding frequent substructures in chemical compounds”, Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, 30-6 (1998) 6. M. Deshpande, M. Kuramochi and G. Karypis, “Frequent SubStructure-Based Approaches for Classifying Chemical Compounds”, Proc. of the 8th International Conference on Knowledge Discovery and Data Mining (2002) 7. P. W. Finn, L. E. Kavraki, J. Latombe. R. Motwani, C. R. Shelton, S. Venkatasubramanian and A. Y ao, “RAPID: Randomized Pharmacophore Identification for Drug Design”, Symposium on Computational Geometry, 324-333 (1997)

422

8. D. Fischer, H. Wolfson, S. L. Lin, and R. Nussinov, “Threedimensional, sequence order-independent structural comparison of a serine protease against the crystallographic database reveals active site similarities: potential implication to evolution and to protein folding”. Protein Sci. 3,769-778 (1994) 9. S Henikoff, J Henikoff, S Pietrokovski. “Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations”, Bioinfomatics, 15(6):471-9 (1999) 10. K. Hofmann, P. Bucher, L. Falquet, A. Bairoch, “The PROSITE database, its status in 1999”. Nucleic Acids Res, 1;27(1):215-9 (1999) 11. L. B. Holder, D. J. Cook and S. Djoko, “Substructures discovery in the subdue system”, Proc. AAAI’94 Workshop Knowledge Discovery in Databases, 169-180 (1994). 12. J. Huan, W. Wang, J, Prins, “Efficient Mining of Frequent Subgraph in the Presence of Isomorphism”, Proc. of the 31d International conference on Data Mining, (2003) 13. P. Indyk, R. Motwani, S. Venkatasubramanian, Geometric Matching Under Noise, “Combinatorial Bounds and Algorithms”, ACM Symposium on Discrete Algorithms (1999). 14. A. Inokuchi, T. Washio, and H. Motoda, “An Apriori based algorithm for mining frequent substructures from graph data”, In Proc. of the 4th European Con. On Principles and Practices of Knowledge Discovery in Databases, 13-23 (2000). 15. G.J. Kleywegt “Recognition of spatial motifs in protein structures” J MoZ Biol. 285(4): 1887-97 (1999) 16. M. Kuramochi and G. Karypis, “Frequent subgraph discovery”, Proc. of the I st International conference on Data Mining, (2001) 17. AG Murzin, SE Brenner, T Hubbard, C Chothia, “SCOP: a structural classification of proteins database for the investigation of sequences and structures”, J. Mol. Biol. 247,536-540 (1995) 18. CA Orengo and WR Taylor, “SSAP: Sequential Structure Alignment Program for Protein Structure Comparison”, Methods in Enzymol266: 617-643 (1996) 19. P. Tan and V. Kumar and J. Srivastava, “Selecting the right interestingness measure for association patterns”, In Proceedings ofthe Eighth ACM International Conference on Knowledge Discovery and Data Mining (2002) 20. V. Vapnik, Statistical Learning Theory, (John Wiley, 1998) 21. X. Yan and J. Han. gSpan, “Graph-based substructure pattern mining”, Proc. of the 2”dInternational conference on Data Mining, (2002) 22. K. Yoshida and H. Motoda, “CLIP: Concept learning from inference patterns”, Artificial Intelligence, 75( 1):63-92, (1995)

IDENTIFYING GOOD PREDICTIONS OF RNA SECONDARY STRUCTURE

M.E. NEBEL Johann Wolfgang Goethe- Universitat, Institut fur Informatik, 60325 Frankfurt a m Main, Germany Abstract Predicting the secondary structure of RNA molecules from the knowledge of the primary structure (the sequence of bases) is still a challenging task. There are algorithms that provide good results e.g. based on the search for an energetic optimal configuration. However the output of such algorithms does not always give the real folding of the molecule and therefore a feature to judge the reliability of the prediction would be appreciated. In this paper we present results on the expected structural behavior of LSW rRNA derived using a stochastic context-free grammar and generating functions. We show how these results can be used to judge the predictions made for LSU rRNA by any algorithm. In this way it will be possible to identify those predictions which are close to the natural folding of the molecule with a probability of 97% of success.

1

Introduction and Basic Definitions

A ribonucleic acid (RNA) molecule consists of a chain of nucleotides (there are four different types). Each nucleotide consists of a base, a phosphate group and a sugar group. The various types of nucleotides only differ from the base involved; there are four choices for the base, namely adenine (A), cytosine (C), guanine (G) and uracil (U). The specific sequence of the bases along the chain is called p r i m a r y structure of the molecule. It is usually modeled as a word over the alphabet { A , C , G, U } . Through the creation of hydrogen bonds, the complementary bases A and U (resp. C and G) form stable base pairs with each other. Additionally, there is the weaker G-U pair, where bases bind in a skewed fashion. Due t o these base pairs, the linear chain is folded into a threedimensional conformation called tertiary structure of the molecule. For some types of RNA molecules like transfer RNA, the tertiary structure is highly connected with the function of the molecule. Since experimental approaches which allow the discovery of the tertiary structure are quite expensive, biologists are looking for methods to predict the tertiary structure from the knowledge of the primary structure. It is the common practice to consider the simplified secondary structure of the molecule, where we restrict the possible base pairs such

423

424

that only planar structures occur. So far, several algorithms for the prediction of secondary structures using rather different ideas were p r e ~ e n t e d ! ~ ~ ~ ~ However, the output of such algorithms cannot be assumed to be error-free, so they might predict a wrong folding of a molecule. To have a tool t o quantify the reliability of the prediction would be helpful. In this paper we propose to use a statistical filter which compares structural parameters of the predicted molecule with those of an expected molecule of the same type and the same size (number of nucleotides/bases), and we show that such a filter offers good results. In literature you find a lot of different results dealing with the expected structure of RNA molecules. Waterman‘ gave the first formal framework for secondary structures. Later on, some authors considered the combinatorial and the Bernoulli model of RNA secondary structures (where the molecule is modeled as a certain kind of planar graph) and they derived numerous results like the average size and number of hairpins and bulges, the number of ladders, the expected order of a structure and its distribution or the distribution of unpaired bases (see8~9~10~11). I n l l it was pointed out (by comparison t o real world data) that both models are rather unrealistic and thus the corresponding results can hardly be used for our purposes. In this paper we will sketch one possible way to construct a realistic model for RNA secondary structures which allows us to derive the corresponding expectations, variances and all other higher moments to be used according to our ideas. In the rest of this paper we assume that the reader is familiar with the basic notions of Formal Language Theory such as context-free grammars, derivation trees, etc. A helpful introduction to the theory can be found in?2 We also assume a working knowledge on the notion of secondary structures and the concepts like hairpins, interior loops, etc. We refer to l3 for a related introduction. Besides modeling a secondary structure as a planar graph, it is a slightly different approach to model it by using stochastic context-free grammars as proposed by?4 A stochastic context-free grammar (SCFG) is a 5-tuple G = ( I ,T ,R, S,P ) , where I (resp. T ) is an alphabet (finite set) of intermediate (resp. terminal) symbols ( I and T are disjoint), S E I is a distinguished intermediate symbol called axiom, R c I x ( l u T ) *is a finite set of production-rules and P is a mapping from R to [0,1] such that each rule f E R is equipped with a probability p f := P(f).The probabilities are chosen in such a way that for all A E I the equality C f E R p f S ~ ( f=) ,1~holds. Here S is Kronecker’s delta and Q ( f ) denotes the source of the production f , i.e. the first component A of a production-rule ( A , a ) E R. In the sequel we will write p f : A --+ a instead of f = ( A , a ) E R, p f = P(f). In Information Theory SCFGs were introduced as a device for producing a language together with a corresponding

425

probability distribution (see e.g. 15,16). Words are generated in the same way as for usual context-free grammars, the product of the probabilities of the used production-rules provides the probability of the generated word. Note that this does not always provide a probability distribution for the language. However, there are sufficient conditions which allow us t o check whether or not a given grammar provides a distribution. First, one was interested in parameters like the moments of the word and derivation lengths l7 or the moments of certain subwords?' Furthermore, one was looking for the existence of standard-forms for SCFGs such as Chomsky normalform or Greibach normalform in order to simplify proof^!^ Some authors used the ideas of Schutzenberger2' to translate the corresponding grammars into probability generating functions to derive their result^!^^^' However, languages resp. grammars were not used to model any sort of combinatorial object besides languages themselves and therefore the question on how to determine probabilities was not asked. In Computational Biology SCFGs are used as a model for RNA secondary structure^?^'^ In contrast t o Information Theory not only the words generated by the grammar are used, but also the corresponding derivation trees are taken into consideration: A word generated by the grammar is identified with the primary structure of an RNA molecule, its derivation tree is considered as the related secondary s t r ~ c t u r e ! Note ~ that there exists a one-to-one correspondence between the planar graphs used by Waterman as a model for RNA secondary structures and a certain kind of unary/binary trees (see e.g. lo). Thus the major impact from using SCFGs is given by the way in which probabilities are generated. Since a single primary structure can have numerous secondary structures, an ambiguous SCFG is the right choice. The probabilities of such a grammar can be trained from a database. The algorithms applied for this purpose are generalizations of the forwardlbackward algorithm used in the context of hidden Markov models2,21and are also applied in Linguistics, where one usualy works with ambiguous grammars, too. At the end of the training the most probable derivation tree of a primary structure in the database equals the secondary structure given by the database. Applications were found in the prediction of RNA secondary structure1*' were the most probable derivation tree is assumed to be the secondary structure belonging to the primary structure processed by the algorithm. So far, no one used these grammars to derive structural results, which in case of an ambiguous grammar is obvious since it is impossible to find any sense in such results. In section 2 we provide the link between SCFGs and the mathematical research on RNA. We use non-ambiguous stochastic contextfree grammars t o model the secondary structures. This is done by disregarding the primary structure and representing the secondary structure as a certain kind of Motzkin language (i.e. a language over the alphabet {(,), I} which en-

426

codes unarylbinary trees equivalent to the secondary structure) which now is the language generated by the grammar. After training the SCFGs it is used t o derive probability generating functions which enable us to conclude quantitative results related to the expected shape of RNA secondary structures. Those results will be the basis for our quantitative judgement of predictions. In order to train the grammar we derived a database of Motzkin words which correspond one-to-one to the secondary structures contained in the databases of Wuyts et al?' We have also used the databases of Brown for RNase P seq u e n c e ~and ~ ~of Sprinzl et al. for tRNA m ~ l e c u l e sthe , ~ ~corresponding results are not reported here due t o lack of space.

2

The Expected Structure of rRNA Molecules

In this section we will present our results concerning the expected structure of rRNA molecules only with a few comments on how they were derived; technical details can be found in?5 As described in the first section, we used a SCFG whose probabilities were trained on all entries of the database of Wuyts et al. in order t o derive our results. This grammar can easily be translated into an equivalent probability generating function according to the ideas of Schutzenbergerz' From those generating functions we derived some expected values for structural parameters of large subunit (LSU) ribosomal RNA molecules, like e.g. the average number and length of hairpin-loops or the average degree of multiloops. The corresponding formuls are presented in Table 1, where each parameter is presented together with its expected asymptotical behavior, i.e. its expected behavior within a large (number of nucleotides) molecule. Note that we have investigated all the different substructures which must be distinguished in order t o determine the total free energy of a molecule which is necessary e.g. for certain prediction algorithms. Compared t o all previous attempts t o describe the structure of RNA quantitatively (see for instance 6,9,10,11,26), the results presented here are the most realistic ones. This is in line with the positive experience of Knudsen et al. and of Eddy et al. with respect t o the prediction of secondary structures based on trained SCFGs (resp. covariance models). The results in Table 1 should be considered as the structural behavior of an RNA molecule folded with respect t o its energetic optimum. Therefore, they are of interest themselves; for the first time we get some (mathematical) insight on how real secondary structures behave. Besides the application, which is the subject of this paper, the realistic modeling of the secondary structures gives rise t o further applications like the following: First, we can use our results t o provide bounds for the running-time of algorithms working on secondary structures as their input. Second, when predicting a

'

427 Table 1: Expectations for different parameters of large subunit ribosomal RNA secondary structures. In all cases n is used to represent the total size of the molecule.

Expect at ion 0.0226n 7.3766 0.0095n 1.5949 0.0593n 4.1887 0.0164n 3.8935 0.0 106n 4.1311 4.3686 18.1679 18.1353

Parameter Number of hairpins Length of a hairpin-loop Number of bulges Length of a bulge Number of ladders Length of a ladder (counting the number of pairs) Number of interior loops Length of a single loop within an interior loop Number of multiloop Degree of a multiloop Length of a single loop within a multiloop Number of single stranded regions Length of a single stranded region

secondary structure, our results may provide initial values for loop lengths etc. when searching for an optimal configuration such that a faster convergence should be expected. We used the following grammar t o derive the results in Table 1 (all capital letters are intermediate symbols):

fl=S fe=L

+ SAC, +

f2=S

( L ) ,f7=L

fll=L f15=H

+

Mfs=L

IB(L), f12=B +

f19=K

f22=N

+ C , f3=C + CI, f4=C

+

&, -+

flG=I

+

I , fg=L

-+

+

BI, f13=B

IJ(L)KI,f 1 7 = J

KI, fzo=K

U ( L ) N ,f z = N

+ E, -+

f21=M

U, f24=U

+

-+

-+

+

fS=A

IH, fio=L f14=N

---f

+

E,

JI,

-+

-+

+

(L),

(L)BI,

HI, &,

flS=J

U(L)U(L)N, U ( ,f25=U

-+

E.

The idea behind the grammar is the following: Starting at the axiom S a sentential form of the pattern CACAC . . . AC is generated, where each A stands for the starting point of a folded region and C represents a single stranded region. Applying production A -+ ( L ) produces the foundation of the folded region. From there the process has different choices. It may continue building up a ladder by applying L -+ ( L ) . It might introduce a multiloop by the application of L + M or an interior loop by the application of L -+ I . A

428 Table 2: The probabilities for the productions of our grammar obtained from its training on a database of large subunit ribosomal RNA secondary structures

rule

f

prob. p f

rule f f2

f5 fs

fig

f22 f25

0.0207 0.6270 1.oooo 0.7461 0.5149 0.1863

fll f14 fi7 f20 f23

prob. p f 0.1372 1.oooo 0.0662 0.0176 0.8644 0.7401 0.2539 0.4851

rule f f3

fs fg

fi2 fis f18 f2l f24

prob. p f 0.9477 0.7612 0.0941 0.3730 0.1356 0.2599 1.0000 0.8137

hairpin-loop is produced by L + IH. Additionally, the grammar may introduce a bulge by the productions L + (L)Sl resp. L + IB(L) where the two productions distinguish between a bulge at the 3’ resp. 5’ strand of the corresponding ladder. An interior loop is generated by the production I + IJ(L)KI where J and K are used to produce the loops. The multiloop is generated by the productions M + U ( L ) U ( L ) N N , -+ U ( L ) N and N + U , i.e. we have at least three single stranded regions represented by U , by additional applications of the production N -+ U ( L ) N the degree of the multiloop can be increased. The other production-rules are used t o generate unpaired regions in different contexts. We used different intermediate symbols in all cases because otherwise we would get an averaged length of the different regions instead of a distinguished length for all substructures considered. We first had t o determine the probabilities for this grammar in order to derive the results in Table 1. We used a special parsing algorithm with all entries of the database as the input. Table 2 presents the resulting probabilities. Then the grammar was translated into a probability generating function from which our expectations were concluded by using Newton’s polygon method and singularity analysis (details on that can be found in?5) Table 3 compares the expected values according to our formuk t o statistics computed from the database (archaea and bacteria data only). For this purpose we have set the parameter n to the average length of the structures used to compute the statistics. We observe that most parameters are described pretty well by our formula? (the root mean square deviation of the statistics compared t o our formula? is given by 3.5260. . .), so it makes sense t o use them according t o our ideas.

429 Table 3: The average values computed statistically from the database compared t o the values implied by the corresponding formulz in Table 1. All values were rounded t o the second decimal place.

Parameter number of hairpins length of a hairpin-loop number of bulges length of a bulge number of ladders length of a ladder number of interior loops length of single loop in interior loop number of multiloops degree of a multiloop length of single loop in multiloop number of single stranded regions length of single stranded regions

3

Stat ist ics 51.76 7.43 20.94 1.59 130.94 4.18 36.25 3.89 21.98 4.06 4.80 7.44 15.62

Formula 52.02 7.38 21.87 1.59 136.50 4.19 37.75 3.89 24.40 4.13 4.37 18.17 18.14

Quotient 99.49% 100.70% 95.78% 99.88% 95.92% 99.85% 96.02% 99.98% 90.10% 98.31% 109.96% 40.97% 86.15%

Identifying Good Predictions

In order t o see whether or not our expectations for certain structural parameters of RNA secondary structure can be used for identifying good or bad predictions we continued in the following way. First we used the RNAStructure software by Mathews, Zuker and Turner (version 3.71) in order t o obtain predicted secondary structures for all sequences for archaea and bacteria in the database of Wuyts et al.; the default settings of the program were used. We decided to use those parameters for the judgement of the predictions where according to Table 3 the relative error of the value of the formula compared t o the statistics computed from the database is at most 2%. Then the quality of the predictions was quantified as follows: For every prediction generated (for some sequences the software provides several predictions) we computed the number of hairpins 2 1 , the average length of a hairpin-loop 2 2 , the average length of a bulge 2 3 , the average length of a ladder 54, the average length of a single loop in an interior loop 2 5 and the average degree of a multiloop 2 6 . Furthermore we computed the corresponding values yi from our formulae, 1 5 i 5 6, setting n to the length of the sequence under consideration. Let z' := (11. - y11,. . . , 156 - y61) denote the vector of the differences of these values (1. I denoting modulus) and let 2 denote the set of all vectors z' obtained by considering all predicted structures. In order to

430

endow every parameter with the same weight, every z' E 2 was normalized by dividing each component by the maximal observed value for that component in 2. Finally, assuming that the resulting vectors are denoted by (vl,v2,.. . ,v6) the corresponding structure was ranked by

(1) i5i56

Squares were used t o amplify differences. This ranking must be considered as the distance of the structure under investigation t o some sort of consensus structure implicitly provided by the expected values presented in section 2. Therefore a small rank should imply a good prediction, high ranks should disclose bad results of the prediction algorithm. In order to see whether it worked, we needed some notion for the similarity of structures. We chose the most simple but also most stringent one: Two structures (the predicted structure and the corresponding structure in the database of Wuyts et al.) are compared position by position (using the ct-files) counting the number of bases which are bond to exactly the same counterpart in both files. The total number is divided by the length of the related primary structure. We call the resulting percentage matching rate, a matching rate of 70% or larger is assumed to be a successful prediction. For the data of archaea and bacteria considered in our experiments a , all structures with a matching rate greater or equal to 70% were rated 3.54.. . or less. Additionally, only about 2.56% of all predictions had a rank of 3.54. . . or less so that a rank of 3.54 or less implies a successful prediction with a probability close t o $&. Assuming a linear dependence between the matching rate of the predictions and the rank according to (1) an ideal ranking would possess a correlation coefficient of -1 when comparing the two. However, in our case we observed a correlation coefficient of -0.3645235338. Furthermore, when looking at the quantile-quantile plot which compares the distributions of ranking and matching rates as shown in Figure 1 we observe a poor behavior especially for predictions with a matching rate between 55% to 65%. Note that an ideal ranking would result in a linear (diagonal) plot. Searching for an explanation of this rather poor correlation we took a look at the correlations between the overall ranking according t o (1) and the values of the different vi, 1 5 i 5 6. The results can be found in Table 4. One immediately notices that the (expected) length of a hairpin-loop and the (expected) degree of the multiloops are neg-Note that the data of archaea and bacteria used for our experiments is a subset of the data used to train the grammar. However, since the grammar was trained on the entire database it was also trained on other families of rRNA and thus good results with respect to our task should result from some sort of generalization.

431 Table 4: The correlation of a single wi to name of its associated parameter.

w,”.Within the table each wi is identified by the

Parameter number of hairpins length of a hairpin-loop length of a bulge length of a ladder length of single loop in interior loop degree of a multiloop

Correlation 0.6575498439 -0.3432207906 0.4460590292 0.2158570276 0.3850727833 -0.0844724840

atively correlated with the rank, i.e. they have a counterproductive effect on our ranking. Therefore we run a second set of experiments now using VT

i€{ 1,3,4,5}

as the rank of the prediction. The new filter assigns a rank of at most 1.87. . . to those predictions that have a matching rate of 70% or larger. Again, only about 2.56% of all predictions were ranked 1.87. . . or less, thus the new filter works with the same accuracy as the former one. But now we observe a correlation coefficient of -0.4745120689. Additionally, the quantile-quantile plot as shown in Figure 2 is much closer t o the diagonal thus giving rise to a better judgement of the predictions particularly for predictions with a matching rate between 55% and 65%. Note that the number of hairpins is the only parameter used in (2) which depends on the size of the structures and thus needs our methods based on SCFGs to be derived. All the other parameters could have been determined by simple statistical methods only. However, omitting w1 from the computations results in a worse accuracy and in a poor correlation coefficient of -0.24249.. . 4

Possible Improvements

Certainly the results reported in the previous section are only a first step towards a precise judgement of an algorithmic prediction of RNA secondary structure. However, the author belives this first step t o be promising. There is a potential for improving our approach in many directions. First, one might consider additional parameters like e.g. the order of a secondary structure introduced by Waterman6 In contrast to the parameters considered here, the order does not only take care of small parts of a secondary structure but it is

432

Figure 1: T h e quantile-quantile plot of the ranking according t o (1) compared t o the matching rate of the predicted secondary structures.

Figure 2: T h e quantile-quantile plot of the ranking according t o (2) compared t o the matching rate of the predicted secondary structures.

a sort of global parameter considering the balanced nesting depth of hairpins. Mathematical results for the expected order of a secondary structure which fit pretty well with the real world behavior can be found in?l Second, it can be helpful to give different weights to the different parameters used when computing the rank of a structure. For instance it seems to be reasonable to give a higher weight t o such parameters which have a smaller (relative) variance than others since these parameters must be assumed to be conserved more strongly. Therefore a different behavior is more unlikely than a different behavior with respect t o others. So far, the author has not been able to gather experiences in this field but it is a starting point of further research. 5

Conclusions

In this paper we have shown how results for the expected structural behavior of RNA secondary structures can be used in order t o judge the quality of a prediction made by any algorithm. First experiences were gained by considering large subunit ribosomal RNA molecules. To judge a single predicted structure S it is necessary to compute the length n of the corresponding primary structure and the values observed within S for the four parameters attached to the zti in (2). Then it is possible t o compute the rank of S which according to our experiments provides information on the quality (matching rate)

433

of the prediction with high probability. The methods presented in 25, which were used to derive the key results for our methodology, i.e. expected values for structural parameters within a realistic model for the molecules, are not restricted to this familiy of RNA. So they might be used for kinds of RNA as well. Furthermore, it should work t o implement a corresponding set of routines using a computer algebra system like maple such that the expectations needed in order to judge predictions for other kinds of RNA can be computed automatically. As a consequence the ideas presented in this article may lead t o the development of a new kind of software tools which supports the automated prediction of secondary structure with posteriori information on the quality of the results. In the long run, these ideas might be transferred t o other areas of structural genomics, e.g. the prediction of three dimensional structure of proteins. Acknowledgements

I wish to thank Matthias Rupp for his support in writing the programs for the statistical analysis presented in section 3 and for helpful suggestions. References

1. S. R. EDDYAND R. DURBIN,Nucleic Acid Res. 22 (1994), 2079-2088. 2. B. KNUDSEN A N D J. HEIN, Bioinformatics 15 (1999), 446-454. 3. R . NUSSINOV, G. PIECZNIK, J. R. GRIGGAND D. J. KLEITMAN,SIAM Journal on Applied Mathematics 35 (1978), 68-82. 4. J . M. PIPASAND J . E. MCMAHON,Proceedings of the National Academy of Sciences 72 (1975), 2017-2021. 5. D . SANKOFF, Tenth Numerical Taxonomy Conference, Kansas, 1976. 6. M. S. WATERMAN, Advances in Mathematics Supplementary Studies 1 (1978), 167-212. 7. M. ZUKERAND P. STIEGLER,Nucleic Acid Res. 9 (1981), 133-148. 8. W . FONTANA, D. A. M. KONINGS,P. F. STADLERAND P. SCHUSTER, Biopolymers 33 (1993), 1389-1404. 9. I. L. HOFACKER,P. SCHUSTERAND P. F. STADLER, Discrete Applied Mathematics 88 (1998), 207-237. 10. M. E. NEBEL,Journal of Computational Biology 9 (2002), 541-573. 11. M. E. NEBEL,Bulletin of Mathematical Biology, to appear. 12. J . E. HOPCROFT,R. MOTWANIAND 3 . D. ULLMAN, Addison Wesley, 2001. 13. D. SANKOFF AND J. KRUSKAL, CSLI Publications, 1999.

434 14. Y.

15. 16. 17. 18. 19. 20.

21. 22. 23.

SAKAKIBARA,M. BROWN, R . HUGHEY, I. S. MIAN, K. SJOLANDER,R. C. UNDERWOOD AND D. HAUSSLER, Nucleic Acid Res. 22 (1994), 5112-5120. T. L. BOOTH,IEEE Tenth Annual Symposium on Switching and Automata Theory, 1969. U. GRENANDER, Tech. Rept., Division of Applied Mathematics, Brown University, 1967. S. E . HUTCHINS, Information Sciences 4 (1972), 179-191. H. ENOMOTO,T. KATAYAMA AND M. OKAMOTO, Systems Computer Controls 6 (1975), 1-8. T. HUANGAND K. S. Fu, Information Sciences 3 (1971), 201-224. N . CHOMSKYAND M. P. SCHUTZENBERGER, Computer Programming and Formal Systems (P. Braffort and D. Hirschberg, eds.), NorthHolland, Amsterdam, 1963, 118-161. R . DURBIN,S. EDDY,A. KROGHAND G. MITCHISON, Cambridge University Press. WUYTS J., DE RIJK P., VAN DE P E E R Y., WINKELMANS T., DE WACHTERR., Nucleic Acids Res. 29 (2001), 175-177. J. W. BROWN, Nucleic Acids Res. 27 (1999), http://jwbrown. mbio.ncsu.edu/RNaseP/home.htrnl.

24. M. SPRINZL, K. s. VASSILENKO, J. EMMERICH AND F.BAUER,(20 December, 1999) http://www.uni-bayreuth.de/departments/biochemie/trna/. 25. M. E. NEBEL, technical report, http://boa .sads.informati k. unifrankfurt.de:8000/nebel.htrnl 26. M. REGNIER,Generating Functions in Computational Biology: a Survey, submitted. 27. E. HILLE,Blaisdell Publishing Company, Waltham, 1962, 2 vol.

EXPLORING BIAS IN T H E PROTEIN DATA BANK USING CONTRAST CLASSIFIERS K . P E N G , Z . OBRADOVIC, S . VUCETIC Center f o r Information Science and Technology, Temple University, 1805 N Broad St Philadelphia, PA 19122, USA

In this study we analyzed the bias existing in the Protein Data Bank (PDB) using the novel contrast classifier approach. We trained an ensemble of neural network classifiers, called a contrast classifier, to learn the distributional differences between non-redundant sequence subsets of PDB and SWISS-PROT. Assuming that SWISS-PROT is a representative of the sequence diversity in nature while the PDB is a biased sample, output of the contrast classifier can be used to measure whether the properties of a given sequence or its region are underrepresented in PDB. We applied the contrast classifier to SWISS-PROT sequences to analyze the bias in PDB towards different functional protein properties. The results showed that transmembrane, signal, disordered, and low complexity regions are significantly underrepresented in PDB, while disulfide bonds, metal binding sites, and sites involved in enzyme activity are overrepresented. Additionally, hydroxylation and phosphorylation posttranslational modification sites were found to be underrepresented while acetylation sites were significantly overrepresented. These results suggest the potential usefulness of contrast classifiers in the selection of target proteins for structural characterization experiments.

1 Introduction The ultimate goal of structural genomics is to determine structures for every natural protein through a large-scale structure characterization and computational analysis. However, in anticipation of the development of cost-effective techniques and protocols for large-scale experiments, current efforts in structural genomics are aimed towards determining structures of a limited portion of representative proteins to achieve a rapid coverage of the protein sequencehtructure space [3]. As a common approach, the proteins are first filtered to remove those considered inappropriate for structural characterization, e.g., membrane, low complexity, and signal peptides. The remaining proteins are clustered into families based on sequence similarity. Finally, representative proteins from the families of largest biological interest are selected for structural characterization experiments. Although some progress has been made, selection of the target proteins remains an open problem in structural genomics [3]. As the main database of experimentally characterized structural information, Protein Data Bank (PDB) [ 13 contains more than 20,000 structures of proteins, nucleic acids and other related macromolecules characterized by methods such as X-ray diffraction and nuclear magnetic resonance (NMR) spectroscopy. However, current information in PDB is highly biased in the sense that it does not adequately cover the whole sequence/structure space. For example, membrane proteins represent a very important structural class in nature, but their structures are usually extremely difficult to determine due to the need for a lipid bilayer or substitute amphiphile [18]. In

435

436

general, PDB is positively biased towards proteins that are more amenable to expression, purification and crystallization. Another source of bias is the fact that different research groups usually have different objectives when selecting the target proteins: some aim at determining structures of proteins from a specific model organism; some may focus on proteins in a single pathway; others may be more interested in certain type of proteins, e.g., disease-related proteins. PDB is also statistically redundant due to the presence of multiple entries for highly similar or identical proteins. According to the PDB statistics available at http://www.rcsb.org/pdb/'holdings.html, out of the 3,298 structures deposited in year 2001 only 204 could be considered as novel while the remaining ones were mostly minor variants of those already reported. Understanding the bias and redundancy in PDB is crucial for selection of further structural targets as well as for various structure predictions. Several studies have been performed towards this goal. Brenner et al. [4] analyzed the SCOP [12] structural classification of PDB proteins and reported high skewness at all classification levels. Gerstein [6] compared several complete genomes with a non-redundant subset of PDB and concluded that the proteins encoded by the genomes were significantly different from those in the PDB with respect to sequence length, amino acid composition and predicted secondary structure composition. Liu and Rost [ 101 analyzed proteomes of 30 organisms and estimated that current structural information in PDB and other databases was available for only 6-38% of all proteins and found over 18,000 segment clusters suitable for structural genomics. In this paper we provide a complementary view of the bias in PDB that explores differences in sequence properties of PDB and SWISS-PROT [2] proteins. This was accomplished by training an ensemble of neural network classifiers to distinguish between distributions of the non-redundant subsets of PDB and SWISS-PROT. Following the recently proposed contrast classifier framework [14], output of such an ensemble of classifiers measures the level to which a given sequence property is overrepresentedhnderrepresented in PDB as compared to SWISS-PROT. We applied the contrast classifier to analyze the bias in PDB towards numerous protein properties and to examine whether our approach can be useful in selecting the most interesting target proteins for structural characterization.

2 Methods 2.1 Datasets Since both PDB and SWISS-PROT are statistically redundant due to the presence of large number of homologues, learning on such data could lead to biased results. Thus, non-redundant subsets were used as unbiased representatives of the two databases. The non-redundant representative of PDB used in this study was PDB-Select-25 [7]

437

constructed based on all-against-all Smith-Waterman alignments between PDB chains. In this subset, the maximal painvise identity was limited to 25% since it is believed to be an appropriate compromise between reducing the sequence redundancy and preserving the sequence diversity [17]. The version used in this study was released in December 2002 and consisted of 1,949 chains. After removing chains shorter than 40 residues, the resulting set PDB25 contained 1,824 chains with 324,783 residues. For SWISS-PROT (October 2001, Release 40, 101,602 sequences), we applied an approach used in our previous study [20] to construct its non-redundant representative subset. Sequence similarity information from ProtoMap database [22] was used to group all SWISS-PROT proteins into 17,676 clusters using the default ProtoMap E-value cutoff of 1. A representative protein with the richest annotation in SWISS-PROT was then selected from each cluster. Similarly to PDB25, proteins shorter than 40 residues were removed. The resulting set SwissRep consisted of 16,360 proteins with 6,946,185 residues. The relatively high E-value cutoff, leading to quite aggressive redundancy reduction, was acceptable since the resulting SwissRep was still sufficiently large to represent the diversity of SWISS-PROT. Table 1. Summary of special regions in SwissRep.

Regions transmembrane low complexity disordered

number of regions

number of residues

10,274 14,648 11,332

215,109 (3.1%) 2,041,162 (29.4%) 506,229 (7.3%)

We also identified various regions of interest from SwissRep proteins for further analysis. Transmembrane regions were identified through the keywords (KW lines) and feature tables (FT lines) associated with each SWISS-PROT sequence. We identified transmembrane helix regions as the most distinctive among all types of membrane regions. Low complexity regions were marked by the SEG program [21] using the standard parameters K1 = 3.4 and K2 = 3.75, and a window of length 45. Disordered regions longer than 30 residues were predicted by the VL3 disorder predictor [13] with Win/Wout= 41/1 and a threshold of 0.85. Table 1 shows the summary of these identified regions with their corresponding sizes measured as the number of regions, the number of residues and the percentage of residues at SwissRep.

2.2 Contrast Classi9ers Let us assume we are given dataset G obtained by unbiased sampling from a multivariate underlying distribution, and dataset H obtained by potentially biased sampling from the same distribution. This scenario could occur when objects from H are characterized by a larger set of attributes then those of G. For example, SwissRep is an example of unbiased dataset G that contains only protein sequence information, while PDB25 is an example of biased dataset H that contains both protein sequence and structure information. Understanding the bias in data G is of major importance for

438

an appropriate analysis and inference from such data. The recently proposed contrast classifier approach [141 provides a simple but effective framework for detecting and exploring the data bias. By g(x) and h(x) let us denote the probability density functions (pdf) of unbiased data G and biased data H, respectively. The contrast classifier is a classifier trained to discriminate between the distributions of datasets G and H. Using classification algorithms that are able to approximate the posterior class probability (e.g. neural networks), the output cc(x) of a contrast classifier trained on a balanced set with the same number of examples from G (class 1) and examples from H (class 0) approximates cc(x) = g(x)/(g(x)+h(x))[ 141. With a simple transformation it follows that h(x)/g(x)= cc(x)/(l-cc(x)), and that cc(x) = 0.5 corresponds to a data point x that is represented equally well in both datasets (i.e. h(x)/g(x)= 1). The contrast classifier output cc(x) is therefore a very suitable measure for analysis of the data bias. The distribution of cc(x) gives information about the overall level of bias in dataset H: if it is concentrated around 0.5 the bias is negligible, while if it is dispersed across the interval [0, 11 the bias is significant. Moreover, we could measure the level to which a given data point is overrepresentedunderrepresented in dataset H: data points with cc(x) < 0.5 are overrepresented, while those with cc(x) > 0.5 are underrepresented.

2.3 Training Contrast ClassiJiersfor Bias Detection in PDB In this study we assume that SwissRep is a representative of the protein sequence space, while PDB2.5 is a biased sample. Note that, while the first assumption is probably not completely correct since SWISS-PROT represents only the proteins studied with a sufficient detail, it is acceptable for the purpose of analyzing the bias in PDB. Based on the description in Section 2.2, it is evident that contrast classifiers can be used directly to explore the bias in PDB. While any classification algorithm able to approximate posterior class probability can be employed to train a contrast classifier, in this study we used feedfonvard neural networks with one hidden layer and sigmoidal neurons. Since there is a large imbalance in the number of data points in SwissRep and PDB25 datasets (with the proportion of approximately 21 :l), learning a single neural network on balanced samples from the two datasets would not properly utilize the data diversity present in SwissRep. We addressed this by training an ensemble of neural networks on balanced training sets consisting of equal number of PDB25 and SwissRep examples randomly sampled from the available data. Similar to bagging [ 5 ] we constructed a contrast classifier by aggregating the predictions of these neural networks through averaging. Additional benefit of using an ensemble of neural networks is that the averaging is known as a successful technique for increasing their accuracy by reducing variance while retaining low bias in prediction.

439

2.4 Knowledge Representation

For each sequence position, a set of relevant attributes was derived from statistics of a subsequence within a window of length W centered at the position. More specifically, given a sequence s = {si,i = 1, . . ., L } of length L, for each sequence position si an appropriate M-dimensional attribute vector xi = { x , j= 1, . . .,M } is constructed and a corresponding class label yi is assigned. Thus, sequeilce s is represented as a set of L examples {(xi,yi), i = 1, . .., L } . Using a window of length W= 21, a total of 25 attributes were derived for each sequence position. These attributes were proved to be useful in various protein sequence analyses and structure prediction problems, The first 19 attributes were the amino acid frequencies within the window since it has been shown that PDB proteins exhibited unique amino acid composition patterns [6]. Only 19 of the 20 frequencies were used since the remaining one could be uniquely determined from the rest. Based on the amino acid frequencies, an attribute called K2-entropy [21] was calculated to measure local sequence complexity. We also measured flexibility [19] and hydropathy [9] propensities obtained by triangular moving average window where center position had weight 1 and the most distant positions had weight 0.25. While window length was 21 for hydropathy attribute, it was only 9 for the flexibility attribute, as suggested by the previous study [ 191. The final 3 attributes were outputs of the PHD secondary structure predictor [ 161, i.e., the prediction scores for alpha helix (H), beta strand (E) and loop (L). Finally, class labels 0 and 1 were assigned to examples from PDB25 and SwissRep, respectively.

2.5 Using Contrast ClassiJiers to Explore Bias in PDB Given a measure of contrast cc(x) at each sequence position, we explored the bias in PDB towards numerous protein functional properties, as defined by SWISS-PROT keyword and feature classification. It was expected that the analysis would confirm known results (e.g. that transmembrane and low complexity regions are underrepresented in PDB) and point to some less-known sources of bias. For a set R of regions with a given functional property, mean and standard deviation of the corresponding cc(x) were calculated to measure the direction and level of bias. Additionally, the Kolmogorov- Smirnov goodness of fit test [ 111 (KS test) was used to measure the difference between the cc(x) distributions of R and PDB25. The KS test measures the maximum absolute difference between the empirical cumulative distributions of the two samples and uses it to estimate the test p-value. Since CC(X) of neighboring sequence positions are correlated due to the use of window W (=21) in attribute construction (see Section 2.4), we estimated the effective length as L,= 1 + (L-l)/Wfor each sequence region of length L and used it in calculation of the KS test p-values.

440 3 Results and Discussions

3.1 Training Contrast Classi$er We built the contrast classifier as an ensemble o f 50 neural networks each having 5 hidden neurons and 1 output neuron with sigmoid activation function. To reduce bias towards long sequences, a balanced training set for each neural network was selected in two steps: (a) 20 examples sampled randomly without replacement were taken from each sequence in PDB25 and SwissRep, and (b) a balanced set of 8,000 examples was sampled randomly with replacement from the resulting set. Individual neural networks were trained with the backpropagation algorithm. To avoid overfitting, 80% of the balanced set was used for training and the rest was reserved to signal the training termination. If the training was not stopped after 300 epochs it was terminated automatically.

3.2 Distributions of Contrast Classijier Outputs Comparing contrast classifier outputs at PDB25 and SwissRep. A trained contrast classifier was applied to both PDB25 and SwissRep sequences, and their cc(x) distributions were compared in Figure l(a). Since SwissRep contained a number of PDB25 sequences andlor their homologues, it was expected that the two distributions would overlap. However, a considerable proportion of SwissRep sequences had relatively large cc(x) values (e.g., larger than 0.7) while most PDB25 sequences had smaller cc(x) values concentrated around 0.47. This result clearly illustrated the existence of bias in PDB. In the following subsections, we analyze the sources of bias in greater detail. We also examined the distributions of another two sets in Figure l(a): PDB25H of homologues of PDB25 in SwissRep; and PDB25NH of the remaining sequences of SwissRep. The homologues were identified through 3 iterations of PSI-BLAST search using E-value thresholds of 0.001 for sequence inclusion in the profile and 1 for including sequences in the final selection. As expected, distribution for PDB25H was similar to PDB25, while distribution for PDB25NH was similar to SwissRep. Distributions of 3 specific sequence regions. We examined distributions of cc(x) for transmembrane regions, low complexity regions, and predicted disordered regions from SwissRep (see Section 2.1). As shown in Figure l(b), all these regions exhibited cc(x) values significantly higher than PDB25 sequences, indicating that they were highly underrepresented in PDB. As discussed in the Introduction, transmembrane regions are typically excluded from structural characterizations. Low complexity regions have biased amino acid composition involving a few amino acid types and they often do not fold into stable 3D structure [15]. Huntley and Golding [8] performed an extensive investigation on eukaryotic proteins in PDB and reported a

441

large deficiency in low complexity regions. Their results indicated that even for the few low complexity regions with structural data present in PDB, tertiary structures were missing in most cases. Predicted disordered regions [13] correspond to the regions very likely to have flexible structure that could not be captured by X-ray crystallography or NMR. Since disordered proteins are hard to crystallize, it was expected that they are underrepresented in PDB.

-PDBZ5

.1u

0.0

.i

.2

.3

.4

.5

.6

-

.lu

PDB25 ..... ..... ........

SwirrRep

SwissRsp

__-

PDB25H

.7

.8

.9

i.1

Figure 1. Comparison of cc(x) distributions between PDB25 and other sets: (a) SwissRep, PDB25H and PDBZSNH; (b) various regions of interestfrom SwissRep.

Distributions of functional regions characterized by SWISS-PROT FT line. We extended the analysis to functional regions described by feature tables (FT lines in SWISS-PROT) with the FT keywords. Note that the length of functional regions could range from one (e.g. posttranslational modification sites) to a few hundred residues. In Figure 2 we plot the distributions of the 3 selected functional region types. The supplementary material with the plots of all functional regions listed in the FT lines can be accessed at http://www. ist.temple.edu/disprot/PSB04. Given the explanation of the contrast classifier output discussed in Section 2.2, a positively skewed output distribution indicates that a certain type of functional site or region is underrepresented in PDB, while a negatively skewed output distribution indicates that it is overrepresented. For example, disulfide bonds (DISULFID) play important roles in stabilizing protein tertiary structure and thus should be abundant in PDB. Consistent with this fact is that their cc(x) distribution is highly negatively skewed (see Figure 2). On the other hand, signal peptides (SIGNAL) are short segments of amino acids in a particular order that govern the transportation of newly synthesized proteins, and then cleaved from the matured proteins. Since structure characterization experiments usually target matured proteins, signal peptides are expected to be underrepresented in PDB. Accordingly, we observe a positively skewed distribution similar to that of transmembrane regions in Figure 1(b). Repeats

442

(REPEAT) are internal sequence repetitions and typically have low sequence complexity and thus exhibit a similar distribution to that of low complexity regions in Figure l(b). .12

.10

08

2.

3m

.06

$ .04

.02

0 00 0.0 CC(X)

Figure 2 . Dishibutions of c c h ) of 3 selected sites or regions from SwissRep sequences.

.1

2

3

4

.5

.6

.7

8

.9

1.0

CCW

Figure 3 . Distributions of contrast classifier output cc(x) of 3 selected posttranslational modification sites from SwissRep sequences.

Comparing distributions of PDB and different functional regions. The cc(x) distributions of the functional sites or regions were compared with the distributions of the PDB25 sequences using the 2-sample Kolmogorov-Smirnov test described in Section 2.5. The FT keywords corresponding to these sites or regions were then ranked according to the resulting p-values, as shown in Table 2. Note that the table does not list FT keywords ACT-SITE, CA-BIND, CONFLICT, INIT-MET, LIPID, MUTAGEN, SIMILAR, SITE, THIOLEST, TRANSIT, NON-CONS, NON-TER, NOT SPECIFIED, NP-BIND, UNSURE, VARIANT, and VARSPLIC since they were either of less interest or their total effective length was less than 1000 residues. Also shown in Table 2 are means and standard deviations of cc(x) values, and the total effective length used in the KS test. We further examined contrast classifier output cc(x) on different posttranslational modification sites identified by FT keyword MOD-RES. Results for the 5 most frequent sites are shown in Table 3. Similar to Table 2, these sites were ranked according to their Kolmogorov-Smirnov test p-values when compared to the distribution of PDB25 sequences. Among the top 3 sites, phosphorylation and hydroxylation sites have positively skewed distributions, while acetylation sites have negatively skewed distribution, as shown in Figure 3. This suggests that the first 2 modification sites are underrepresented in PDB, while the acetylation sites are overrepresented.

443 Table 2. Comparison of distributions of contrast class$er outputs on sites or regions of interest and PDB25 sequences. The p-values were obtained using the Kolmogorov-Smirnov 2-sample test.

FT keyword

TRANSMEM REPEAT DOMAIN CARl3OHYD CHAIN MOD-RES PEPTIDE

p-value

NCC)

$4

L

0 0 7.9312-318 1.17e-138 5.09e-118 3.57e-061 7.24e-061

0.72 0.53 0.5 1 0 51 0 50 0.53 0 53

0.12 014 0.12 0.12 0.12 0.13 0.13

20531 13377 82912 7015 74737 1515 1225

Table 3. Comparison of contrast classi$er output distributions of different posttranslational modification sites with that of PDB25 sequences. p-value

Ncc)

Nee)

L

phosphorylation .

modificationsite

3.54e-054

0.55

0.11

608

amidation methylation PDB25

5.68e-015 8.66e-010

0.55 0.57 0.47

0.14 0.12 0.10

170 93 17194

Mcc) - mean of cc(x), ofcc) - standard deviation of cc(x),

L - effective number of residues

Distributions of SCOP structural classes. According to the SCOP database [121 (release 1.6 1, Nov. 2002), 1,685 out of the 1,824 chains in PDB25 can be classified into 11 structural classes, as shown in Table 4.Note that different parts of a chain might belong to different classes. We examined the cc(x) distributions of individual structural classes and compared them with the overall distribution of PDB25 sequences using the Kolmogorov-Smirnov test (results shown in Table 4). The most significant difference corresponded to sequences from the small class with a negatively skewed distribution. It is worth noting that membrane and cell surface, coiled coils, and peptide structural classes appeared to be significantly underrepresented in PDB25.

444 Table 4. Comparison of cc(x) distributions on PDB25proteinsfrom differentfold classes with that of all PDB25 proteins.

Fold Class

p-value

Mcc)

Nee)

L

small alpha membrane and cell surface

8.74~-037 4.65e-018 3.86e-015

0.41

peptides

0.000585

0 52

0.10 0.10 0 14 0 0 0 12

608 2779 382 19 73 85

0 47

0.10

17194

PDB25

0.49 0 53

Hcc)- mean of cc(x), Wcc) - standard deviahon of cc(x), L - effechve number of residues

Analysis of underrepresented proteins. Complementing the study of cc(x) distributions of different functional protein regions or protein types, we explored the properties of proteins that are most highly underrepresented by PDB25. Some of these proteins are arguably the most interesting targets for future structural determination experiments. For this study, each SwissRep sequence s was represented with a single number cc-avg(s) representing the average cc(x) over the sequence. A total of 2,814 (or 17.2% of) SwissRep sequences having cc-avg(s) > 0.597 were selected with the threshold chosen such that only 1% ofPDB25H sequences satisfied the inequality. We analyzed the properties of the resulting set, called SwissOut, by comparing the commonness of different SWISS-PROT keywords (KW line) in SwzssOut and PDB25H (see Section 3.2). By denotingfSwlssOut and fPDB2jH as frequencies of proteins with a given keyword among SwissOut and PDB25H, respectively, their difference can be quantified by measuring the 2-score defined as (f&ssOuf-fDBz,-H)H)l sqrtcfpDB2fd1- - D B ~ ~ H ) /where ~ ? ) , N is the number of proteins in Swissout. Table 5. The top 6 SWISS-PROT keywords associated with underrepresented sequences Keyword hypothetical protein transmembrane complete proteome inner membrane chloroplast chromosomal protein

fswissRep

I”/.]

42.38 17.55 31.64 1.49 1.36 0.66

[%I

~PDBZ~H

15.64 14.51 18.24 1.45 1.22 1.29

fSvisroUr [%]

Z-score

54.34 42.89 32.55 3.62 3.13 2.74

56.52 42.74 19.66 9.64 9.23 6.82

In Table 5 we list 6 SWISS-PROT keywords with the highest Z-scores among the ones represented with more than 50 SwissOut proteins. By careful examination, it is evident that the obtained results are reasonable, and are another indication of a potential of the proposed contrast classifier approach. Furthermore, it is likely that SwissOut proteins with keyword “complete proteome” would be very interesting structural targets.

445

4 Concluding Remarks We applied the contrast classifier to explore the bias existing in the Protein Data Bank towards different functional protein properties. Assuming SWISS-PROT is a representative of the protein universe while the PDB is a biased sample, we trained a contrast classifier with the non-redundant subsets of PDB and SWISS-PROT and used its output to analyze the bias in PDB. Comparing to other methods for examining bias in PDB (see the Introduction), the main strength of our approach is that it provides a quantitative measure to assess the bias in a uniform way. Our results confirmed some well-known facts such as the lack of transmembrane, low complexity and disordered regions among PDB sequences. They have also revealed some less recognized facts such as depletion of PDB in phosphorylation and hydroxylation modification sites and overrepresentation in acetylation sites. These results are a strong indication that contrast classifiers should be considered as an attractive tool for selection of target proteins for future structural characterization experiments. There are several immediate avenues of future research. As shown by our results, contrast classifier trained with attributes derived from simple statistics over a local window was able to successfully explore the bias in PDB. This suggests that more sophisticated choice of attributes could provide an additional insight into the sources of bias. Similarly, removing well-known underrepresented regions (e.g. transmembrane, low complexity) before the training of the contrast classifier would allow better focus on the less known sources of bias in PDB. Finally, by a slight extension of the proposed methodology contrast classifiers could be trained with sequences of known folds vs. sequences in SWISS-PROT. This could have a potential in detecting the sequences with potentially novel fold structures.

References

1. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov and P.E. Bourne, "The Protein Data Bank", Nucleic Acids Res., 28, 235 (2000). 2. B. Boeckmann, A. Bairoch, R. Apweiler, M.C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout and M. Schneider, "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003", Nucleic Acids Res., 3 1, 3 65 (2003). 3. S.E. Brenner, "Target selection for structural genomics", Nat. Struct. Biol., Structural Genomics supplement, 7,967 (2000). 4. S.E. Brenner, C. Chothia and T.J. Hubbard, "Population statistics of protein structures: lessons from structural classifications", Curr. Opin. Struct. Biol., 7,369 (1997). 5. L. Breiman, "Bagging predictors", Mach. Learning, 24, 123 (1996).

446

6. M. Gerstein, "How representative are the known structures of the proteins in a complete genome? A comprehensive structural census", Fold Des., 3,497 (1 998). 7. U. Hobohm and C. Sander, "Enlarged representative set of protein structures", Protein Sci., 3, 522, (1994). 8. M.A. Huntley and G.B. Golding, "Simple sequences are rare in the Protein Data Bank", Proteins: Struc. Funct. Gen., 48, 134 (2002). 9. J. Kyte and R.F. Doolittle, "A simple method for displaying the hydropathic character of a protein", J. Mol. Biol., 157, 105 (1982). 10.J. Liu and B. Rost, "Target space for structural genomics revisited", Bioinformatics, 18, 922 (2002). 11.F.J. Massey Jr., "The Kolmogorov-Smirnov test of goodness of fit", J. Amer. Statist. Assoc., 46, 68 (1951). 12.A.G. Murzin, S.E. Brenner, T. Hubbard and C. Chothia, "SCOP: a structural classification of proteins database for the investigation of sequences and structures", J. Mol. Biol., 247, 536 (1995). 13.Z. Obradovic, K. Peng, S. Vucetic, P. Radivojac, C. Brown and A.K. Dunker, "Predicting intrinsic disorder from amino acid sequence", Proteins: Struc. Funct. Cen., Special Issue on CASPS, in press. 14. K. Peng, S. Vucetic, B. Han, H. Xie and Z. Obradovic, "Exploiting unlabeled data for improving accuracy of predictive data mining", In Proc. Third IEEE Int'l Con$ on Data Mining, Novemember 2003, Melbourne, FL, in press. 15. P. Romero, Z. Obradovic, X. Li, E. Garner, C.J. Brown and A.K. Dunker, "Sequence complexity and disordered protein", Proteins: Struc. Funct. Gen., 42, 38 (2001). 16.B. Rost, "PHD: predicting one-dimensional protein structure by profile-based neural networks", Methods Enzymol., 266, 525 (1996). 17.B. Rost, "Twilight zone of protein sequence alignments", Protein Eng., 12(2), 85 (1999). 18.H. Sakai and T. Tsukihara, "Structures of membrane proteins determined at atomic resolution", J. Biochem. 124, 1051 (1998). 19.M. Vihinen, E. Torkkila and P. Riikonen, "Accuracy of protein flexibility predictions", Proteins: Struc. Funct. Gen., 19, 141 (1994). 20. S, Vucetic, D. Pokrajac, H. Xie and Z. Obradovic, "Detection of underrepresented biological sequences using class-conditional distribution models", In Proc. Third SIAM Int'l Con$ on Data Mining, May 2003, San Francisco, CA. 21. J.C. Wootton, S. Federhen, "Analysis of compositionally biased regions in sequence databases", Methods Enzymol., 266, 554 (1996). 22. G. Yona, N. Linial and M. Linial, "ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space", Proteins, 37,360 (1999).

GEOMETRIC ANALYSIS OF CROSS-LINKABILITY FOR PROTEIN FOLD DISCRIMINATION S. POTLURI', A.A. KHAN', A. KUZMINYKH', J.M. BUJNICK13, A.M. FRIEDMAN*, C. BAILEY-KELLOGG' Depts. of 'Comp. Sci.,'Math., and 4Biol. Sci., Purdue Vniv.,West Lafayette, IN 47907, USA Intl. Inst. Molec. and Cell Biol.,Warsaw, Poland

Abstract

Protein structure provides insight into the evolutionary origins, functions, and mechanisms of proteins. We are pursuing a minimalist approach to protein fold identification that characterizes possible folds in terms of consistency of their geometric features with restraints derived from relatively cheap, hghthroughput experiments. One such experiment is residue-specific cross-linking analyzed by mass spectrometry. This paper presents a suite of novel lower- and upper-bounding algorithms for analyzing the distance between surface cross-link sites and thereby validating predicted models against experimental cross-linking results. Through analysis and computational experiments, using simulated and published experimental data, we demonstrate that our algorithms enable effective model discrimination.

1 Introduction Knowledge of protein structure is vital for understanding protein function and evolution. Traditional protein structure determination techniques, X-ray crystallography and nuclear magnetic resonance spectroscopy, provide atomic detail, but despite many advances, they remain difficult, expensive, and time-consuming techniques. Recent reports from labs conducting the high-throughput protein structure initiative indicate that only 10 percent of expressed and purified proteins advance to full 3D structure. Alternatively, purely computational techniques (homology modeling, fold recognition, and ab initio) are much faster, but due to the inherent difficulty in scoring predictions, they encounter significant ambiguity in reliably identifying correct structures. We seek a middle ground, verifying predicted structures against minimalist experiments that provide relatively sparse, noisy information relatively quickly and cheaply. In particular, this paper focuses on developing and applying geometric algorithms for model discrimination using data from residue-specific cross-linking, analyzed by mass spectrometry (Fig. l). We assume here that the models have already been generated and the experimental data have been analyzed to identify a set of crosslinks. We present algorithms for checking the consistency of the identified cross-links with the structure models, in order to discriminate among the models.

447

448 1. Model/ predict

2. Cross-link

3. Pmteolytically digest

4. Interpret mass spectmm

Figure 1: Cross-linking mass spectrometry protocol. (1) Computationally generate a set of possible structure models. (2) Specifically cross-link the protein using a small molecule of a fixed maximum length. (3) Digest the cross-linked protein with a protease. (4) Obtain and interpret a mass spectrum, using identified cross-links as evidence for spatial proximity and thus for a particular model.

Employing Edman sequencing and mass spectroscopy of cross-links, Haniu et al. developed a largely correct model of human erythropoietin consistent with the cross-linking data, although no alternatives were explicitly considered. Later, Young et al. pioneered the use of mass spectroscopy alone to correctly discriminate among threading models of Basic Fibroblast Growth Factor, FGF-2, in spite of very low sequence similarity. More recent work employs a “top-down’’ method to fragment proteins within a Fourier transform mass spectrometer, so as to focus on only singly cross-linked protein monomers4. Similarly, cross-linking has been used to determine tertiary and quaternary arrangements of proteins 5 , including membrane proteins that are inherently difficult to ~ r y s t a l l i z eThe ~ ~ ~minimalist . philosophy has also been applied by other groups in support of approximate structure determination. For example, a limited number of long-range distance constraints from NMR 8,9, mutagenesis followed by functional evaluation chemical modification 12, and the pair distance distribution function from small-angle X-ray scattering 13, have all been employed. While traditional structure determination techniques provide substantial overdetermination, minimalist experimental methods for rapid confirmation are noisy and yield only very sparse information. This places a significant burden on computational analyses to carefully characterize model geometry and maximize discriminatory power, in order to be robust to experimental noise and ambiguity. This paper develops a suite of new algorithms, trading complexity vs. accuracy, for analysis of cross-linkability in predicted structure models. The algorithms provide better discriminability and robustness than previously published approaches, and thus promise to enable broader applicability of cross-linking to protein fold identification.

2 Cross-Linkability Analysis 2.1 Problem Formulation A cross-linker serves as a molecular ruler by linking only “close-enough” pairs of residues. Since the atoms of the cross-linker occupy physical space, the measurements are greatly constrained. We assume here that the cross-linker is energetically excluded

449

Input: Polyhedral protein surface S, representing the boundary of the body from which the cross-linker is excluded. Let Sintdenote the interior of the body. A set P of point cross-linking sites on S,representing potentially cross-linked atoms. Computation: Cross-linking paths between site pairs p i , p j E P and exterior to Sint. output: For each pair of sites p i , p j E P, cross-linking distance D , (i, j ) as the minimum of the lengths of crosslinking paths between pi and p i . Figure 2: Cross-link problem formulation and 2D schematic illustrating surface S, atoms, cross-linking sites p l and p z , and cross-linking paths Q (achieving cross-linking distance) and R.

from penetrating the protein interior. Since cross-linked residues (e.g. L y s ) must be on or near the protein surface in order for the cross-linker to react with them, we represent cross-linked atoms (e.g. Lys NC) by points on a solvent accessible surface 14. For example, one could find the closest surface point, or a set of “close-enough’’ such points, reachable from an atom without intersecting the van der Waals spheres of other atoms. While the cross-linked atoms have considerable mobility in solution, we assume that they are fixed for these algorithms. (Dynamics may be accounted for by applying the algorithms to multiple conformations.) We also assume the cross-linker is infinitely flexible. Alternatives will be addressed in a separate publication. With this representation, cross-linkability is determined by testing whether or not the distance between cross-linking sites, measured exterior to the protein, is short enough for the cross-linking molecule. Fig. 2 formalizes the problem and terminology. The basic protein surface representation we employ is a triangulation of the solvent accessible surface, where vertices indicate locations of a probe molecule’s center (typically water) when in contact with the protein, and edges connect triangle vertices. In order to allow for uncertainty in the atomic coordinates of models, we have found it desirable to ignore part or all of the protein side chains. For example, C” coordinates, as employed by Young3, completely ignore side chains, while C@coordinates ignore many atoms but retain the side chain direction. We have developed an iterative “peeling” algorithm to remove exposed side chain atoms while leaving internal ones intact so that no voids are introduced. The algorithm first identifies solvent accessible residues (with solvent accessible area above some threshold), and then removes those

450

side chain atoms that are solvent accessible, starting from the end and moving towards the C" in subsequent iterations. This approach guarantees that, upon termination, all and only the outer atoms are removed. The problem of computing cross-linking distance requires finding the shortest path between two points. This is a well-studied problem in graph theory and networks (e.g. Dijkstra's algorithm 15). The complexity of geometric shortest path algorithms (e.g. for robotics) grows rapidly with the dimension. Our cross-linking problem can be viewed as finding the shortest obstacle-avoiding path, treating the protein body as an obstacle. When the path is not constrained to a discrete graph, but can include bends, the number of combinatorially different paths becomes exponential. Several approximation algorithms for finding the shortest path have been developed 16. Here we specialize the shortest path problem to take into consideration the special geometry of proteins. We obtain a hierarchy of novel lower- and upper-bound algorithms for estimating cross-linking distance. Due to space constraints, we present here only high-level pseudocode (Fig. 3), examples (Fig. 4), and sketches of some correctness and complexity arguments.

2.2

Lower Bound Algorithms

The Euclidean distance d ( p i ,p j ) between cross-linking sites provides an obvious lower bound, Dline,on cross-linking distance. This straight-line approach does not account for the model's surface geometry, and provides relatively little information, but has been employed for model discrimination by Young et al. A tighter bound is obtained by sampling cross-sections of the protein at points along the segment connecting cross-link sites. Our disk algorithm (Figs. 3, 4a) computes a lower bound Ddisk by sampling a set C of points on the pipj segment and in Sint,and then constructing a sequence of disks with centers in C perpendicular to pipj and contained entirely within the body S U Si,t (they intersect the protein surface only by their boundary circles). The convex hull of the union of the disks and endpoints captures some of the essential surface geometry and provides for immediate computation of a lower bound path. The distance from one site to the other is measured along a path in the intersection of the boundary of the convex hull with a plane containing the segment pipj. D d i s k ( p i , pj) depends on the sample points C , which we treat as fixed for the following arguments. For all p i , p j , Dline(pi,pj).< D d i s k ( p i , p j ) because the length of each path from pi to p j is at least the Euclidean distance. For all p i , p i , Ddi& ( p i ,p j ) 5 D , ( p i , p j ) ) follows from the fact that if the length of a path P from pi to p j is less than D d i s k ( p i , p j ) , then P intersects the interior of at least one of the disks. Thus, if there exists a cross-linking path P, with lp,I = D , ( p i , p j ) < D d i & ( p i , p j ) , then P, contains an interior point of at least one of the disks. By construction, each interior

451

PlaneDistance ( S ,p i , p j ) C c a set of sample points on [pi,p j ] in S i n t 0 c a set of sample plane normals not perpendicular to pipj return m a x c c (maxeso (min { d ( p i , p ) d ( p , p j ) I P E S n plane(c, 8 ) ) ) )

+

ShortcutDistance ( S ,p i , p j ) P +- a set of sample paths on graph of S , from pi to p j for each P = (pi = q , v 2 , . . . ,u, = p j ) E P G p + (V,E ) : V = ( ~ 1 ,... , v,}, E = { { ~ k ,VI,} I d p t length of shortest pi to p j path on Gp return minpcp dp

n Sint

= 0}

VisibilityDistance ( S ,p i , p j ) G c (V,E ) : V = vertices o f , E = { {uk,u ~ }I 2ikvI n S i n t = S} return length of shortest pi to p j path on G Figure 3: Cross-linking distance bounding algorithms.

point of each of the disks belongs to Sint.so P, intersects S i n t , a contradiction. The complexity of the disk algorithm depends on the implementation of the various geometric tests. Selecting sample points requires testing inside/outside of the polyhedral surface, and determining disk radii requires finding distances to surface points on the perpendicular. We employ a straightforward inside/outside test counting the number of intersections of a ray from the sample point with the triangles of the protein surface, requiring total O ( C T ) time, where T is the set of triangles of S. We compute disk radii by first sorting surface vertices in order along the segment pipj, and then for each sample point, using binary search to find vertices of triangles that potentially intersect the disk at the sample point. This requires output-sensitive time O(CTclog T ) ,where Tc is the set of triangles found by the search. We note that if a very finely sampled set of points is desired (trading off increased complexity for increased accuracy), a plane sweep algorithm could be employed, keeping track of surface triangles intersecting the current plane and iterating by vertices in order of

452

Figure 4:2D schematics and examples on protein FGF-2 for (a) disk, (b) plane, and (c) shortcut algorithms.

m.

their projections onto A complementary lower bound, Dplane,considers single cross-sections at multiple angles and positions. Our plane algorithm (Figs. 3, 4b) employs this idea to compute a lower bound &lane by finding, at each sample point and each admissible plane orientation, the shortest path from one cross-link site to the other via a point on the intersection of the plane and the protein surface. The longest such path determines the lower bound. Correctness of the plane algorithm follows from the fact that the cross-linking path must pass through each such plane without intersecting Sint. The complexity analysis for the plane algorithm is similar to that for the disk algorithm. The disk algorithm considers the sample points simultaneously, at a uniform cross-section angle, while the plane algorithm considers the sample points independently, at variable angles. Both the lower bounds and the computational complexity of these algorithms depend not only on S, p i , p j , but also on the sample points (and for plane, sample normal directions). The two degrees of freedom sampled for the plane orientations result in more intersection tests than are required for the disk algorithm.

2.3

Upper Bound Algorithms

An immediate upper bound on the cross-linking distance is obtained by taking the convex hull of the protein surface, finding paths outside Sintfrom the cross-linking sites to representative points on the surface of the hull, and finding shortest paths on the hull surface between these points. The correctness of the upper bound computed by this hull algorithm follows immediately, since the hull is exterior to the protein. depends on the paths from the sites to the hull surface, and is useful when the computation of these paths is easy (e.g. a line segment not intersecting Sintcan be identified). By applying Chen and Han’s l7 single-source shortest-paths

453

algorithm for polyhedral surfaces, the complexity for a single site pi to all other p j E V is the set of hull vertices.

P is O ( V 2 ) where ,

The convex hull approach takes “shortcuts” across the mouth of concavities by traversing the hull of the protein, but can miss shortcuts through the concavities. A complementary approach is to start with a sample of paths on the protein surface, rather than on the hull, and then take shortcuts where possible to reduce the lengths of these paths. More precisely, a shortcut of a path replaces the subsequence of vertices ( p k , p k + l , . . . , p l ) with the sequence ( p k , p l ) when the s e g m e n t m doesn’t intersect Sint.We call such a pair pk,pl a visible pair. Our shortcut algorithm (Figs. 3, 4c) applies this approach to compute an upper bound Dshortcut. Since initial paths are on the surface and shortcuts do not penetrate the body, this is a correct upper bound. The complexity of the shortcut algorithm depends on the approaches to generating paths, computing visibility, and selecting shortcuts. Our current implementation generates diverse paths by repeatedly performing a breadth-first search from p i to p., (taking time linear in the number of surface vertices) and removing edges for path vertices before the next iteration. Other approaches are also possible to achieve diversity. We shortcut a path by an iterative greedy refinement algorithm, starting at p , and at each iteration jumping to the vertex furthest in the path and still visible. Visibility can be tested by computing surface triangle intersections, as discussed regarding the disk algorithm, yielding O ( T P 2 )total time to shortcut a path P. An alternate approach that we are exploring is to test intersection of a segment with each of the protein atom spheres, using an atomic radius expanded by that of the solvent. In either case, efficient data structures could reduce the number of triangles tested. Dijkstra’s single-source shortest path a l g ~ r i t h m ’could ~ be employed instead of the greedy shortcutting, requiring O ( T P 2 )time to guarantee optimal shortcutting. We find that in practice the greedy approach usually makes substantial progress per iteration and is closer to linear than quadratic in path length. Rather than considering shortcuts on a few sample paths, we can compute, at the cost of complexity, a complete visibility graph for the protein surface. A visibility graph l8 indicates all visible pairs of vertices. Given a visibility graph, we can apply standard shortest paths algorithms (e.g. Dijkstra’s algorithm 15). Our visibility algorithm (Fig. 3 ) uses this approach to compute an upper bound Dvisibility. As with the shortcut algorithm, correctness as an upper bound is immediate. A straightforward construction of the visibility graph, using the techniques mentioned above for shortcutting, requires O ( T V 2 )time, where T and V are respectively the set of triangles and vertices of S . This preprocessing is used for all cross-linking site pairs; Dijkstra’s algorithm then requires additional O ( V 2 )time for each site.

454

2.4 Protein Model Discrimination

In order to discriminate among a set of predicted protein models, we must test for each of them the feasibility of the distances for all observed cross-links. We note that less information can be gained from the absence of evidence for a cross-link under a bottom-up mass spectrometry approach, since several factors other than cross-linking distance can contribute to the absence. More powerful reasoning from negative evidence will be possible in future work, particularly following the application of topdown mass spectrometry for cross-linking analysis 4. When employed with observed cross-links, lower and upper bounds provide complementary information for model discrimination. A lower bound can provide evidence against a model, when the estimated distance for an observed cross-link exceeds the expectation for the cross-linker. An upper bound can provide evidence for a model, when the estimated distance for an observed cross-link is less than the maximum distance. We adopt a simple strategy assuming cross-links are independent and sum their scores: +l when an upper bound is satisfied, -1 when a lower bound is violated, and 0 when neither holds. (It is impossible for both to hold.)

3 Results We have tested the performance of our algorithms for model selection with both published experimental and simulated data. Fibroblast growth factor (FGF-2) is the primary target because of available data3 and structure (PDB id 4FGF). Competing models were obtained for the published template structures via the protein foldrecognition meta-~erver~';two of the models are of the same fold (p trefoil) as 4FGF. The Lys-specific cross-linker BS3 was used. To further demonstrate the utility of our approach, we chose two CASP4 2o targets with many high-quality models: deoxyribonucleoside kinase (PDB id 1J90) and a-catenin (PDB id 1L7C). We applied our algorithms, using Nc, CY, Cp, or C" atoms (with surfaces appropriately peeled), and found the Cp to provide the best results. The C" straight-line measurement of Young et al. provides a control, although we could not exactly reproduce their model discrimination results (presumably due to differences in the details of the protein models). Visualizations like those in Fig. 4 provide evidence of the ability of our algorithms to better approximate cross-linking distance. To quantitatively characterize discriminatory power, we computed, for each distance between 1 and 45 A, the number of possible L y s pairs in 4FGF whose length exceeds the threshold and compared the number for experimentally identified cross-links (to be maximized) and unidentified ones (to be minimized). Greater difference between these numbers at a threshold indicates better abstraction of structural features and enhanced ability of the method

455

Figure 5: Comparison of cross-linking distances for (left) Cn straight-line, (middle) C p disk, and (right) C p plane methods. The z-axis indicates a distance and the y-axis the number of experimentally-identified (blue lower line; 18 maximum) and not (red upper line; 48 maximum) cross-links exceeding that threshold.

employed to separate identified from unidentified for a cross-linker of that length. Fig. 5 compares the straight-line distance against two of our lower bound methods. The area between the curves (summing the count difference over the range) is 641 for C a straight-line, 826 for CD disk, and 887 for Cp plane, demonstrating the more informative bounds provided by our algorithms. In model discrimination, Young et al. employ a maximum value of 24 A for feasible cross-linking distance; we use the same threshold for testing both upper and lower bounds. This value accounts for the BS3 length (1 1.4 A), the distance from the reactive Nc to the representative cross-linking site, and a small amount of uncertainty. Fig. 5 shows that some of the experimentally-determined cross-links have distances exceeding even this threshold (e.g. Ddisk(Lys21,Lys125) is 29.5 A). These large distances were confirmed visually. Possible explanations include experimental errors, artificial distortion of the protein, or extensive natural flexibility. Artificial distortion (e.g. by partial denaturation due to multiple cross-links), may be alleviated by better choice of experimental conditions. The work of Falke21 suggests it is possible to obtain cross-links more than 10 A longer than expected, in mobile situations, although the rate of cross-linking falls off by orders of magnitude. To study such flexibility, we intend to apply our algorithms to multiple frames of a molecular dynamics simulation, boosting the need to trade off efficiency and tightness of bound. We note that infrequent conformations might in general be detected rarely by mass spectrometry, and thus could be treated as noise in a probabilistic analysis. The cross-link experiment could also be altered to exploit differences in rates. We further quantified discriminatory power by comparing differences in estimated cross-link distances between models. Treat the set of cross-linking distances for a model as a point in C-dimensional space (for l cross-links), and compute differences (Euclidean distance) between these points. A larger difference is indicative of greater discriminatory power, since the cross-linker’s fixed length is more likely to separate the points on some dimension (cross-link). We compared our disk Cp

456

algorithm to the control straight-line C”, and found that our algorithm yields an average of 0.2-0.3 8, larger average differences for both experimentally observed and all possible cross-links, when either comparing 4FGF to all other models, 4FGF to non-p-trefoil models, or each model to all other models. We tested our methods by ranking the correct structure vs. the models, scoring with either the Young approach of counting violations (straight-line distance > 24 A) or our discrimination method combining disk (lower bound) and shortcut (upper bound) distances. We analyzed the effects of cross-link sparsity and noise by choosing datasets consisting of a random subset of the identified plus a random set of the unidentified cross-links. Fig. 6 illustrates the average rank of the correct structure over 100 such simulations for each of several different numbers of observed and unobserved cross-links. (We apply the conservative choice of ranking the correct structure worst in case of a tie.) With smaller subsets of identified cross-links, the two methods are comparable. Larger subsets tend to include more cross-links labeled infeasible by the disk bound, and our method degrades. Finally, we analyzed model discriminability by varying the number of simulated “good” and “bad” cross-links and finding the average rank of the correct structure as above. For tests with our method, good cross-links were chosen from those with shortcut Cp distance below 24 A in the correct structure, and bad cross-links from those with disk Cp distance greater than 24 A.Similarly, good and bad cross-links for the straight-line method were chosen using the 24 A threshold. Fig. 7 shows results for FGF using each method to analyze the corresponding simulated dataset. These results test discriminability and robustness to sparsity and noise - over many different sets of feasiblehfeasible cross-links, our distances distinguish the correct structure from the models better than do straight-line distances. Fig. 8 shows our results on the CASP4 targets; straight-line is again inferior (not shown). 4

Conclusions

We have developed and applied a set of lower- and upper-bound algorithms for estimating cross-linking distance. The algorithms trade off complexity and tightness of bound. We have shown that by taking into account protein surface geometry, our algorithms provide better model discriminability, in terms of cross-link separability, distance differences, and discrimination effectiveness. We illustrated the robustness of our techniques by simulating sets of good and bad cross-link data. Our results demonstrate that information from relatively rapid and inexpensive experiments permit model discrimination in spite of sparse information and the presence of noise. The current work can be further extended in several ways. Protein dynamics can be taken into consideration. As more experimental data become available, better classifiers can be developed to apply distance estimates to model discrimination. While

457

5 0 Unidentified Figure 6: Discrimination using experimental data for FGF-2 with (a) straight-line C a , (b) combined disk and shortcut Cp. The 2-and y-axes indicate number of cross-link pairs identified and unidentified, respectively; the z-axis shows the average rank of the actual structure over 100 random subsets.

Figure 7: Discriminability for FGF-2 with (a) straight-line C a , (b) combined disk and shortcut C p . The zand y-axes indicate number of good and bad cross-link pairs, respectively, chosen according to the same methods; the z-axis shows the average rank of the actual structure over 100 random subsets.

Figure 8: Discriminability, as in Fig. 7, with combined disk-shortcut Cp using simulated data for (a) deoxyribonucleoside kinase and (b) a-catenin models.

458 cross-links were considered independent here, a more complex framework would capture dependencies with respect to differential reactivity, competing cross-links, and so forth. Our analysis can be used in planning experiments, e.g. proposing a cross-linker of the best length or the substitution of particular residues to lysine. Acknowledgments This work is supported in part by a US NSF CAREER award to CBK (11s-0237654); and EMBO/HHMI Young Investigator and Foundation for Polish Science Young Scholar award to JMB. Thanks to Mike Stoppelman, Xiaoduan Ye, and other members of our labs for helpful discussions and related work. References 1. Natl. Inst. Gen. Med. Sci. http://www. structuralgenornics.org. 2. M. Haniu, L. 0. Narhi, T. Arakawa, S. Elliott, and M. F. Rohde. Protein Sci, 9:1441-51, 1993. 3. M.M. Young et al. PNAS, 975802-5806,2000. 4. G. H. Kruppa, J. Schoeniger, and M. M. Young. Rapid Commun Mass Spectrom, 17(2):155-62,2003. 5. A. Scaloni et al. J Mol Biol, 277:945-958, 1998. 6. J. B. Swaney. Methods Enzymol, 128:613-626, 1986. 7. I. Kwaw, I. Sun, and H. R. Kaback. Biochemistry, 39:3134-3140,2000. 8. J. Skolnick, A. Kolinski, and A. R. Ortiz. J M o l Biol, 265:217-241, 1997. 9. P. M. Bowers, C. E. M. Straws, and D. Baker. J Biomol NMR, 18:311-318, 2000. 10. S. Elliott et al. Blood, 87(7):2702-13, 1996. 11. A. Bohm et al. J Biol Chem, 277(5):3708-17,2002. 12. F. Zappacosta et al. Protein Sci, 6(9):1901-9, 1997. 13. W. Zheng and S. Doniach. JMoZ Biol, 316:173-87,2002. 14. B. Lee and F. M. Richards. J Mol Biol, 55(3):379-400, 1971. 15. E. W. Dijkstra. Numerische Mathematik, 1:269-271, 1959. 16. J. S. B. Mitchell. Geometric shortest paths and network optimization. Handbook of Computational Geometry, 2000. 17. J. Chen and Y. Han. In Proc ACM Symp Comp Geom, pp. 360-369, 1990. 18. J.C. Latombe. Robot Motion Planning. Kluwer, 1991. 19. M.A. Kurowski and J.M. Bujnicki. Nucleic Acids Res, 31(13):3305-7, 2003. http://genesilico.pl/rneta. 20. J. Moult et al. Proteins, S5:2-7,2001. 21. C . L. Careaga and J. J. Falke. JMol Biol, 226:1219-35, 1992.

PROTEIN FOLD RECOGNITION THROUGH APPLICATION OF RESIDUAL DIPOLAR COUPLING DATA

Y.QU, J.-T. GUO, V. OLMAN, and Y. XU* Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA 30602, USA, and Computational Biology Institute, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA ( * correspondence: [email protected]) Residual dipolar coupling (RDC) represents one of the most exciting emerging NMR techniques for studying protein structures. However, solving a protein structure using RDC data alone is a highly challenging problem as it often requires that the starting structure model be close to the actual structure of a protein, for the structure calculation procedure to be effective. We report in this paper a computer program, RDC-PROSPECT, for identification of a structural homolog or analog of a target protein in PDB, which best matches the 15N-'H RDC data of the protein recorded in a single ordering medium. The identified structural homologlanalog can then be used as a starting model for RDC-based structure calculation. Since RDC-PROSPECT uses only RDC data and predicted secondary structure information, its performance is virtually independent of sequence similarity between a target protein and its structural homolog/analog, making it applicable to protein targets out of the scope of current protein threading techniques. We have tested RDC-PROSPECT on all "N-'H RDC data (representing 33 proteins) available in the BMRB database and the literature. The program correctly identified the structural folds for 80% of the target proteins, significantly better than previously reported results, and achieved an average alignment accuracy of 97.9% residues within 4-residue shift. Through a careful algorithmic design, RDC-PROSPECT is at least one order of magnitude faster than previously reported algorithms for principal alignment frame search, making our algorithm fast enough for large-scale applications.

-

1 Introduction Since the publication of the seminal work by Tolman et al.' and Tjandra & Bax? residual dipolar coupling (RDC) in weak alignment media has gained great popularity in solving protein structures using NMR techniques. RDC provides information about angles of atomic bonds, e.g., N-H bonds, of a protein's amino acids with respect to a specific 3-dimensional (3D) reference frame. Using such information, an NMR structure could, at least theoretically, be solved through molecular dynamics (MD) simulation and energy minimization, under the constraints of the RDC angle information. A key advantage of RDC-based NMR structure solution is that RDC data can be obtained using a small number of NMR experiments and done in a very efficient manner.3 Potentially, it could also overcome a number of limitations of traditional NOE-based NMR structure determination techniques, e.g., the size limit for a target p r ~ t e i n . ~ Though recognized for its great potential for solving larger proteins faster, direct application of RDC data for protein structure solution remains a highly challenging problem. The roblem mainly comes from the well-known four-fold degeneracy nature of RDC. An RDC value of an N-H bond (for example) does not

P

459

460

uniquely define a single orientation of the N-H bond as desired, rather it only restricts the orientation to two symmetric cones, making the search space of feasible structural conformations extremely large. In addition, inclusion of the RDC terms in the NMR energy function for structure calculation has resulted in a highly rippled energy surface with innumerable sharp local minima,6 making the search problem exceedingly difficult. In the absence of long-range NOE distance information, it is practically intractable to find the global minimum by conventional optimization techniques. However, if the starting model is close to the true structure, convergence will become much easier. Therefore, a great amount of efforts have been made to obtain good starting structures for RDC-based N M R structure calculation. Existing methods for deriving protein structures from RDC data alone mainly fall into two categories: de novo fragment assembly methods'-'' and whole protein structural homology search methods De novo methods build protein structures by assembling structural fragments that are consistent with RDC data. These methods typically require a complete or near-complete set of RDC data to be effective, and are often very time-consuming. One example of such methods is the RosettaNMR program," which typically need more than 3 RDC data per residue for its structure calculation to be accurate. As these methods typically attempt to assemble a protein structure in a sequential manner, they often suffer from problems resulting from accumulation and propagation of small errors from each individual fragment. Structural homology search methods generally require fewer RDC data and much less computing time, but are applicable only to proteins with solved homologous structures. Based on theoretical estimates on the total number of unique structural folds in nature and on the low percentage (< 5%) of novel structural folds among all structures submitted to PDB in the past few years,I3 people generally believe that the majority of the unique structural folds in nature are already included in PDB. Hence structural homology search methods are becoming increasingly popular. Annila et al." are the first to use assigned RDC to search for structural homologs. Their work demonstrated the feasibility of fold recognition using RDC data alone. Meiler et a1." developed a program, DipoCoup, for structural homology search using secondary structure alignment. While all the aforementioned methods contain interesting ideas, they have been tested only on a very small set of proteins, in a few cases only on one protein, ubiquitin. Therefore, their true practical usefulness is yet to be determined. We have recently developed a computer program, RDC-PROSPECT (RDCPROtein Structure PrEdiCtion Toolkit), for protein fold recognition and protein backbone structure prediction. Currently the program uses only assigned N-H RDC data in a single ordering medium and predicted secondary structure to identify structural homologs or analogs from the PDB database. RDC-PROSPECT identifies a structural fold through finding a structural fold in PDB, which best matches the N-H RDC data, using a dynamic programming approach. Compared with existing methods, RDC-PROSPECT has a number of unique capabilities. Firstly, RDCPROSPECT requires only a small number of RDC for fold recognition. On our test set consisting of all publicly available N-H RDC data of 33 proteins deposited in the

46 1

BMRB database (www.bmrb.wisc.edu) and published in the literature, RDCPROSPECT achieves an 80% fold recognition rate on an average of 0.7 RDC data per residue. The requirement of fewer RDC data implies smaller number of NMR experiments needed to solve a structure. Secondly, RDC-PROSPECT does not require sequence similarity information for fold recognition, making the program equally applicable to proteins with only remote homologs or structural analogs in the PDB database, which represents a significant challenge to current threading methods. Thirdly, RDC-PROSPECT runs significantly faster than almost all existing RDC-based methods, using a novel search algorithm for the principal alignment frame of the RDC data.

2 Methods An RDC measures the relative angle of an atomic bond in a residue, with respect to the principal alignment frame14of the protein (more rigorously, each rigid portion of the protein structure). The principal alignment frame, represented as an (x, y, z) Cartesian coordinate system, is dependent on the medium where the protein situates and the protein structure itself. In this paper, we consider only the RDC data of N-H bonds, the easiest RDC data to get experimentalp. The RDC data measured by NMR experiments for each N-H bond is defined as' D = D, (3cos28 - 1) + 1.5 D, (sin% cos2cp )

(1)

where 8 is the angle between the bond and the z-axis of the principal alignment frame (x, y, z), and cp is the angle between the bond's projection in the x-y plane and the x-axis; D, and D, represent the axial and rhombic component of the alignment tensor, respectively. Intuitively, D, and D, measure the magnitude (intensity) of the alignment. From an NMR experiment, we will get a set of {Di} values without knowing which Di corresponds to the N-H bond of which residue in a protein and what the principal alignment frame is. Our goal here is to develop a computational procedure to find a protein fold in the PDB database and search for an (x, y, z) Cartesian coordinate system that produces a set of calculated N-H bond RDC values using equation (l), which best match the experimental RDC data. In this paper, we solve a constrained version of this fold recognition problem, assuming that the RDC data are already correctly assigned to individual residues. 2.1 Alignment of RDC data with structural fold The RDC-based fold recognition problem can be rigorously stated as follows. Let D = (D1, . . ., DK) be a list of assigned experimental N-H RDC data (DNH)of a target protein. Let D*(T, F) = (DI1, . . ., D*M)be the calculated RDC data of a template structure T, assuming the principal alignment frame is F. We want to find an alignment A: i-+A(i) between D and D*(T, F), that minimizes the following function:

462

where Di is aligned with D*A(i), and CY is the standard deviation of the experimental are the predicted secondary structure type of position i of the target DNH; Si and S*A(i) protein and the assigned secondary structure of position A(i) of the template structure; M() is a penalty function for secondary structure type matcWmismatch, with M() equals -1 for match and 1 for mismatch; pG, is the total gap penalty for the j-th gap in the alignment, which has the following form a + Ljb, with a being the opening gap penalty, b being the elongation gap penalty and Lj being the length of the j-th gap (the number of consecutive skipped elements). w1 and w2are two scaling factors, which are empirically determined (using simulated data) as w1= 1 and 02 = 1. The D*(T, F) values of the template structure T are calculated using equation (1) for a specified alignment frame F (we will discuss how to systematically search for the correct alignment frame in the next subsection). To estimate D, and D, in (l), we use the equations in the histogram method proposed by Clore et al.:I6 D,, = 2 D , (3) D,, = - D, (1 + 1.5 DJDJ where D,, and D,, are the maximum or the minimum values of the experimental DNH, respectively, with IDzz[> ~D,,~. 0 and cp in equation (1) are calculated for the N-H bond of each residue of the template structure with respect to the specified alignment frame F. We have used PSIPRED” for secondary structure prediction of a target protein sequence. We consider three classes of secondary structures: helix (H), strand (E), and coil (C). In assessing secondary structure matches (using function M()), we consider only PSIPRED predictions with confidence level of at least 8 on the scale of 0-9. For a prediction with confidence level < 8, we assign a special category U (uncertainty) to this position and set M (Si, S*A(~)) = 0 when Si = U. The alignment problem also employs a few additional rules as hard constraints, when aligning a list of RDC data with a protein structure. These include (a) if a position in the target protein does not have assigned RDC data, its corresponding alignment score (the D-portion in (2)) will be set to zero; (b) no penalty for gaps in the beginning and the end of a global alignment; (c) no alignment gap is allowed in the middle of an H- or E- secondary structure of the template structure; and (d) we consider alignment scores defined by (2) only for helix and strand regions while for coil regions, we penalize length difference of aligned coils. This is done for the following consideration: homologous proteins are generally more conserved among their corresponding core secondary structures (helices and strands) but not the coil regions. Considering detailed sequence alignment between coil regions often hurts the fold recognition and alignment accuracy, especially when dealing with remote homologs and structural analogs. We have implemented a simple dynamic programming algorithm for finding the globally optimal solution of this alignment problem under the specified hard

463

constraints. The dynamic programming algorithm consists of a set of recurrences, similar to the Needleman-Wunsch algorithm.'8 At each step of the recurrence calculation, the hard constraints are checked to guarantee no violation of constraints.

2.2 Assessment of prediction confidence Considering that the alignment scores are not normalized with respect to the lengths and the composition of amino acids, we use Z-score to assess the quality of an alignment. For an RDC alignment problem with a set of experimental RDC data DNH and a template structure T, we calculate the Z-score of the alignment score To as follows. The RDC data with their respective secondary structure types are randomly shuffled multiple times. For each reshuffled RDC list, we calculate the alignment score with the template T. The Z-score of Tois defined as Z = (T, - To) / 0,

(4)

where T, and o, are the average alignment score of the reshuffled RDC lists and their standard deviation. For our current work, we run 500 times of reshuffling (we have also tried significantly larger number of reshuffling but found that 500 gives similar Z-scores to that with higher numbers). Figure 1 shows a plot of Z-score with respect to the fold recognition specificity on our test set of 33 proteins against our template structure database. For example, when Z-score is > 20, the prediction specificity is > 70%.

-0 0

20

40

60

80

100

2-score

Figure 1. Fold recognition Z-score versus prediction specificity

2.3 Principal alignment frame search and fold recognition

One of the challenging issues with the RDC-based fold recognition problem is that we do not know the principal alignment frame from the experimental data, which is required for the calculation of RDC values using equation (1). If the 3D structure of the target protein is known, this problem is equivalent to finding the correct rotation, in a fixed 3D Cartesian coordinate system of the structure that gives the (0, cp)angles of its N-H bonds and hence the calculated RDC values, which best match the experimental RDC data. For our fold recognition work, the problem is to find the rotation of a template structure that gives the best match with the experimental data, defined by equations (2) and (4). Note that any rotation of a 3D protein structure

464 (say in PDB format) can be accomplished by a combination of clockwise rotations around x-axis by a degree and around z-axis by y degree. More specifically, the new coordinates of a data point [x, y, z], after a (a,y)-rotation, can be calculated as

where the two rotation matrices are defined as

0

cosy

siny

0’

0

1

For each given template structure, our fold recognition algorithm will search through all possible (a, y)-rotations. For each (a, y)-rotation, the algorithm employs the alignment algorithm of Section 2.1 to find the optimal alignment between the (assigned) experimental RDC data and the calculated RDC data for the template under this particular rotation. One thing to notice is that the range of both a and y is between 0 and 180 degrees as there is no need to consider 180 < a, y 5 360 because of the four-fold degeneracy of RDC data.” We have extensively tested and evaluated different increments for a and y, ranging from 1 degree to 30 degrees. We found that the search surface (made of values of the calculated RDC) over the (a,y)-plane is very smooth, and an increment of 30 degree is adequate for our fold recognition. So we use 30 degrees as default increment value for RDC-PROSPECT. For each template, our algorithm will conduct 36 (6x6) RDC data alignments. The alignment with the optimal alignment score among the 36 alignments is considered the best alignment between the RDC data and the template. For cases we need to get very accurate alignment frame, we use a finer grid for searching the (a,y)angles, which takes longer search time. Our overall fold recognition procedure is carried out as follows. For each set of assigned RDC data, we search our template database consisting of all proteins in the SCOP40 d a t a b a ~ e . ’Currently, ~ SCOP40 (release 1.63 of May 2003) consists of approximately 5,200 protein domains covering 765 folds and 2,164 families. Hydrogen atoms are added to the structure using the program REDUCE.20 Secondary structure assignment is carried out using the program DSSPcont.’’ For each template, we calculate the Z-score of its best alignment with the experimental RDC data using equation (4). Then all the templates are ranked based on their alignment raw scores.

465

3 Results We have tested RDC-PROSPECT on all publicly available N-H RDC data deposited in the BMRB database and published in the literature (by July, 2003), which contain 51 sets of RDC data for 33 proteins. The goal of the tests is to evaluate the fold recognition rate using RDC data (plus predicted secondary structure of a target protein) and the accuracy of the alignment with the correct structural folds. Tables 1 and 2 summarize the fold recognition and alignment results on the 33 proteins using 51 sets of RDC data - for some proteins, there are multiple sets of RDC data collected by different labs andor in different ordering media. For the fold recognition prediction, we consider a prediction as correct if a member protein from the same family or superfamily of the target protein is ranked in top three among all proteins in SCOP40, otherwise as incorrect. From Table 1, we can see that RDC-PROSPECT correctly identified the structural folds for 41 out of 51 RDC data sets (80.4% success rate), and identified 26 structural folds for 33 target proteins (78.8% success rate). Hence we consider the performance of RDCPROSPECT as quite successful even under our very conservative definition of correct fold recognition, i.e. ranked among top three out of thousands of possible structures. It is somewhat unfortunate that there is very little published data by other RDCbased structure prediction programs. Most of them were tested only on one protein, ubiquitin. The only meaningful comparison we can do is with RosettaNMR that was tested on 4 proteins using experimental RDC data, ubiquitin (ld3z), BAF (lcmz), cyanovirin-N (lci4), and G A P (2ezx), and 7 proteins using simulated RDC data." Of the 4 proteins with experimental data, RosettaNMR predicted correct structures for ld3z and lcmz, and partially (-50%) correct structures for lci4 and 2ezm. Our program correctly identified the backbone structures for ld3z, lcmz, and 2ezx (the same protein as lci4), but did not find the correct structural fold for 2ezm due to inadequate secondary structure information (only 9.9% of the residues have reliable secondary structure prediction by PSIPRED). From Table 2, we can see that alignment accuracy for the 26 target proteins with correct fold recognition is very high. The percentage of 4-shifts is commonly used for assessing threading alignment accuracy. RDC-PROSPECT achieved an average alignment accuracy of 97.9% residues aligned within 4-residue shifts to their correct positions. None of the other RDC-based structure prediction programs provide this kind of statistics. Figure 2 shows the predicted structures (right) versus the actual structures (left) for four target proteins with < 25% sequence identity with their best structural templates. 4 Discussion

Our results have clearly demonstrated that RDC-based fold recognition, when

466

Target No.

PDB code

Table 1. A summary of fold recognition accuracy Length Data template template Seq. Set name length Iden

Rank

Z-score

No

~

1

lap4

89

2 3

lb4c 1brf

92 53

1 2 3 4

4 5

lc05 lcmz

159 152

6

ld3z

76

7 8

ld8v le81

263 129

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

lf3y li42 lj6t,A lj6t,B lj70 lj7p ljwe lkhm lkqv 113g llud ln7t lny9 2ezx

165 89 148 85 76 67 114 89 76 136 162 103 90 89 123 125 85 56

3eza,Al 3eza,A2

3eza,B 3gbl ~

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

d2pvbad2pvbad2pvbadlksoad 1rb9dldx8adlfjgddldk8ad l agredlbt0adlbt0ad 1h4ra3 d l h8cadlbt0ad 1h4ra3 d l h4ra3 d 1h4ra3 dlbt0adlh4ra3 d l bt0adlhwma-

d3lztd3 lztdljknadli42adla6jad 1opdd 1exradlj7qadlb79adlj4wal d 1irjad l bm8dlra9d 1mfgadlashdlci4adlzymal d l zyma2 d 1opdd2igd-

107 107 107 93 52 70 208 147 128 73 73 84 82 73 84 84 84 73 84 73 25 1 129 129 165 89 150 85 146 86 102 74 85 99 159 95 147 89 123 125 85 61

19.1 19.1 19.1 37.2 64.8 32.9 45.2 28.8 37.5 59.2 59.2 20.4 18.6 59.2 20.4 20.4 20.4 59.2 20.4 59.2 37.0 100 100 99.4 100 24.2 97.6 49.0 18.6 88.6 26.7 28.7 72.8 29.8 91.3 18.2 97.8 100 100 97.6 82.0

1 1 1 2

1 1

10.2 10.2 10.5 11.0 7.0 5 .O 12.6 10.1 12.3 12.5 13.7 14.4 16.0

1

13.4

1 1 1 1

15.8 15.2 16.9 14.7 17.1 13.1 97.5 14.1 12.8 14.8 10.6 24.4 23.6 21.3 9.7 16.2 45.1 11.4 16.5 26.6 19.4 12.6 7.4 14.4 8.6 14.8 13.2

1

1 1 1 1 1 1

1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1

~~

Only the highest ranked correct template is listed for each protein. The first two columns represent the target id in our test and in PDB code. The third column represents the sequence length of the target. The fourth column represents the id of the RDC data set for each protein, some of which have multiple data sets. The fifth and sixth columns are the correct template id in SCOP code and its sequence length. The seventh column represents the sequence identity between a target protein and its correct template. The eighth column shows the rank of the top template among all SCOP40 proteins while the ninth shows the corresponding Z-score. No correct templates are identified in top three templates for proteins 27-33 (including ld2b, Ighh, 1081, lq n l, 2ezm, Zgat, 4gat).

467 Table 2. Summary of alignment accuracy

I

I

97.9 95.3 96.8 Accuracy (%) 63.1 90.1 x-shift represents the percentage of residues that are within x residues to its correct alignment positions.

1ap4

ld3Z

lj6t, A

11qQ

Figure 2. Actual (left) and predicted structure (right) on four target proteins with < 25% sequence identity with their best structural folds in SCOP40.

coupled with predicted secondary structure, is highly effective and robust for identification of native-like structural folds and prediction of its backbone structure. Our test examples cover a wide range of prediction scenarios. The test proteins span over 5 SCOP classes and more than 20 SCOP fold families with varying sequence lengths. Their N-H RDC data coverage ranges from 43.4% to 95.5%, and their predicted secondary structure ranges from 9.9% to 76.3% (for the remaining residues, their predictions are “uncertain” and hence not used). We now discuss some key advantages and unsolved issues of RDC-PROSPECT along with some future developments.

4.1 Eficient algorithm for alignment tensor orientation search If we use N to represent the number of rotation an les we have to search along each of rotations while axis, previous similar algorithms9322all require N combinations . . our algorithm requires only N2, saving at least one order of magnitude of search time and making our program much faster than other similar programs.

f

4.2 Combination of RDC data and predicted secondary structure for fold recognition We found that predicted secondary structure, though not perfect, complements the RDC data for fold recognition. While RDC data are good for identification of global

468 structural environment, secondary structure is good for finding the local structural environment (e.g., in a helix or in a strand). Our test data have shown that without either one of the two types of data, RDC-PROSPECT'S performance drops significantly. In this work, we used predicted secondary structures based on protein sequence information only. Actually, secondary structures could be derived more accurately using experimental data, like chemical shifts data. The only reason we did not use chemical shifts is that only 10 out of 33 proteins have such data available in the BMRB database. Using chemical shifts data will improve the performance of the program. For example, the otherwise missed correct template for the protein 2ezm can be identified when chemical shifts based secondary structure prediction is used.

4.3 Why some protein structures cannot be correctly predicted? For 7 out of 33 target proteins, RDC-PROSPECT did not place the correct structural folds in the top three templates. We have done a detailed analysis on the failed predictions and found that the failures can be attributed to two classes of reasons. a. proteins composed mainly of coils: this group includes lo8r, l q n l , 2gat, 4gat (6gat). As discussed in Section 2, RDC-PROSPECT considers only coil length conservation but does not conduct detailed alignment for coil region. When a protein is mainly composed of coils, RDC-PROSPECT does not perform well. Work is currently under way to improve on such cases. b. others: we found that various other reasons could also contribute to the failure of our RDC-based fold recognition. The reasons range from inaccurate estimation of D, and D,, to incorrect prediction of secondary structures, to errors in the measured RDC data. In this work, we have used raw RDC data without treatment of the data for contributions from internal dynamics. Our results suggest this is feasible in practice. As Rohl and Baker discussed," internal dynamics likely contribute to the observed RDC to a greater content in flexible loops. Our method doesn't perform alignment in the coil region, so this greatly alleviates the effect of dynamics that could potentially harm the alignment.

4.4 Comparisons with DipoCoup DipoCoup is a popular program to perform 3D structure homology search using RDC and pseudo-contact shifts together with secondary structure information. A basic problem with DipoCoup is that it does not use gap penalty in alignment, thus its applicability is significantly limited. In contrast, RDC-PROSPECT allows the flexibility of having gaps inside or outside secondary structures. Moreover, DipoCoup uses secondary structure fragment as alignment unit, while RDCPROSPECT conducts alignments at residual level, making it more flexible and robust. This also allows us to use sparse secondary structure information, which DipoCoup could not handle.

469

4.5 Assignment of RDC data Like other RDC-based structure prediction programs, RDC-PROSPECT assumes that the RDC data have been assigned to individual residues. This should not limit its applications, as sequential assignments of NMR data (RDC data included), unlike NOE data assignments, are general1 solvable using existing programs. A recently published work by Coggins & Zhou& has achieved -80% assignments without any error for 27 test proteins using their PACES program. Assignments at such level are adequate for RDC-PROSPECT to perform well for most proteins. We have previously published an a1gorithm/softwarez4 for sequential assignments of NMR data using chemical shifts data. We are in the process of merging the two programs to do fold recognition using unassigned RDC data.

In conclusion, our method has convincingly testified the capability of fast and accurate protein fold recognition through combining sparse RDC data and threading technology. An important feature of our RDC-based homology search method is that it does not use sequence information for alignment. Our program provides a good complimentary and crosscheck tool to the conventional threading methods. It is especially attractive for the low sequence identity situations that the conventional structure prediction methods generally do not perform reliably. As we continue to work on this project, we will (a) use chemical shifts data for more reliable prediction of secondary structures, (b) include other types of RDC data, such as C-H RDC, which can be easily added into the framework of RDC-PROSPECT, and (c) include traditional statistics-based threading energy terms, such as pair-wise interaction potentials, in our RDC-based fold recognition method, as in our threading program PROSPECT.’’ We expect that RDC-PROSPECT will prove to be useful in structural genomics projects for high-throughput structure determinations, due to the efficient and effective application of RDC-PROSPECT to fit sparse RDC data with solved structures from a minimum number of NMR experiments.

Acknowledgments This work was funded in part by the Structural Biology Program of the Office of Health and Environmental Research, U.S. Department of Energy, under Contract No. DE-AC05-000R22725 managed by UT-Battelle, LLC. We thank Drs. Nitin Jain, Dong Xu and Dongsup Kim for helpful discussions.

470

References

1. J.R. Tolman, J.M. Flanagan, M.A. Kennedy, and J.H. Prestegard, Proc. Natl. Acad. Sci. I/. S. A. 92,9279 (1995) 2. N. Tjandra and A. Bax, Science 278, 1111 (1997) 3. J.R. Tolman, Curr. Opin. Struct. Biol. 11,532 (2001) 4. J.H. Prestegard, Nut. Struct. Biol. 5 Suppl, 5 17 (1998) 5. J.H. Prestegard, H.M. A1 Hashimi, and J.R.Tolman, Q. Rev. Biophys. 33, 371 (2000) 6. A. Bax, Protein Sci. 12, 1 (2003) 7. F. Delaglio, G. Kontaxis, and A. Bax, J. Am. Chem. SOC. 122,2142 (2000) 8. J.C. Hus, D. Marion, and M. Blackledge, J. Am. Chem. SOC. 123, 1541 (2001) 9. F. Tian, H. Valafar, and J.H. Prestegard, J. Am. Chem. SOC. 123, 11791 (2001) 10. C.A. Rohl and D. Baker, J. Am. Chem. SOC. 124,2723 (2002) 11. A. Annila, H. Aitio, E. Thulin, and T. Drakenberg, J. Biomol. NMR 14,223 (1999) 12. J. Meiler, W. Peti, and C. Griesinger, J. Biomol. NMR 17,283 (2000) 13. D. Lee, A. Grant, D. Buchan, C. Orengo, Curr. Opin. Struct. Biol. 13, 359 (2003) 14. J.A. Losonczi, M. Andrec, M.W. Fischer, and J.H. Prestegard, J. Magn Reson. 138,334 (1999) 15. G.M. Clore, A.M. Gronenborn, and N. Tjandra, J . Magn Reson. 131, 159 (1998) 16. G.M. Clore, A.M. Gronenborn, and A. Bax, J. Magn Reson. 133,216 (1998) 17. D.T. Jones, J. Mol. Biol. 292,195 (1999) 18. S.B. Needleman, C.D. Wunsch, J. Mol. Biol. 48,443 (1970) 19. A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia, J . Mol. Biol. 247, 536 (1995) 20. J.M. Word, S.C. Lovell, J.S. Richardson, and D.C. Richardson, J. Mol. Biol. 285, 1735 (1999) 21. P. Carter, C.A. Andersen, and B. Rost, Nucleic Acids Res. 31, 3293 (2003) 22. J.C. Hus, J.J. Prompers, and R. Bruschweiler, J. Magn Reson. 157, 119 (2002) 23. B.E. Coggins and P. Zhou, J. Biomol. NMR 26,93 (2003) 24. Y. Xu, D. Xu, D. Kim, V. Olman, J. Razumovskaya, and T. Jiang, IEEE Computing in Science d Engineering 4,50 (2002) 25. Y. Xu and D. Xu. Proteins 40, 343 (2000)

COMPUTATIONALAND SYMBOLIC SYSTEMS BIOLOGY T . IDEKER Department of Bioengineering U.C. San Diego La Jolla, CA 92093 [email protected] E. N E U M A N N Beyond Genomics 40 Bear Hill Road, Waltham, MA eneumann@beyondgenomics. corn

V. SCHACHTER GENOSCOPE (National Consortium for Genomics Research) 2, rue Gaston Crkmieux, F-91000 EVRY FRANCE [email protected]

It has become increasingly evident that the use of large-scale experimental data and the invocation of ‘Systems Biological’ principles are gaining widespread acceptance in mainstream biology. Systems Biology involves the use of global cellular measurements-i.e., genomic, proteomic, and metabolomic-to construct computational models of cellular processes and disease. It typically involves an iterative computational/experimental cycle of 1) inferring an initial model of the cellular process of interest through sequencing, expression profiling, and/or molecular interaction mapping projects; 2) perturbing each model component and recording the corresponding global cellular response to each perturbation; 3) integrating the measured global responses with the current model; and 4) formulating and testing new hypotheses for unexpected observations. Recent technological developments are enabling us to define and interrogate cellular processes more directly and systematically than ever before, using two complementary approaches. First, it is now possible to systematically measure pathway interactions themselves, such as those between proteins and proteins or between proteins and DNA. Several methods are available for measuring proteinprotein interactions at large scale-two of the most popular being the two-hybrid system and protein coimmunoprecipitation in conjunction with tandem mass spectrometry. Protein-DNA interactions, as commonly occur between transcription

471

472

factors and their DNA binding sites, are also being measured systematically using the technique of chromatin immunoprecipitation. Other types of molecular interactions and reactions, such as those involving metabolites and drugs, have been culled from the literature and stored in large, publicly-accessible databases such as MetaCyc and KEGG. A second major approach for interrogating pathways has been to systematically measure the molecular and cellular states induced by the pathway structure. For example, global changes in gene expression are measured with DNA microarrays, while changes in proteins and metabolite concentrations may be quantitated with mass spectrometry, NMR, and other advanced techniques. The amount of quantitative data these experiments yield is on the order of thousands of individual molecular channels, and has been used to successfully identify patterns indicative of biological responses or disease states. However, it has become apparent that single genes or their products do not cause most of the biological phenomena observed. These findings have drawn researchers to the conclusion that the most interesting phenomena in biology result from the interrelated actions of many components within the system as a whole. Recent computational approaches to Systems Biology have involved formulating both molecular interactions and molecular states into computational pathway models of various types. The amount of research in this area has exploded in recent years, as witnessed by the number of research presentations at meetings such as PSB, RECOMB, the Biopathways Consortium, and the International Conference on Systems Biology. Although much of this research has focussed on systems of differential equations and other numerical pathway simulations, a variety of model types and formalisms are in fact possible. Models may in fact be numerically computable, but they may also be symbolical and accessible to inferential logic. Logical formalisms that describe complex phenomena are just as important as is modeling molecular dynamics, and may lead to faster insight where the computational complexities are too great for a full-scale simulation. These research areas need to be pursued in parallel to more numerically-driven approaches, since they may offer a way to merge much of the symbolic knowledge derived from existing biological research. In support of this view, almost half of the papers presented in this session involve the use of logical formalisms for modeling pathways, pathway dynamics, and/or network inference. Symbolic logic is used to analyze protein functional domains (Talcott et d.);to infer novel metabolic pathways using information on known pathways and the biochemical structures of their metabolites (McShan et LIZ.); or to

473 model cell-cell interactions using a stochastic extension of the pi-calculus (Lecca et al.). Many of these papers combine more than one large-scale data type, including gene expression profiles, protein-protein interaction data, and/or pathway databases. Another group of papers concentrate on either new formal representations for network inference or efficient experimental design, i.e. choosing an optimal set of gene deletions, overexpressions, or other experiments to maximize the information gained about the network. Of particular interest here is work by Gat-Viks et al. on representing gene regulation by ‘chain functions’ ; inferring a system of differential equations through systematic overexpressions (di Bernardo et al.); and methods for decomposing gene expression data into its component cellular processes within a Bayesian framework (Lu et aL). Finally, as an overlapping theme, several papers point to how Systems Biology may be used as part of a high-throughput drug discovery and development platform. For instance, the work by McShan et al. might be used to explore how newly developed drugs will be metabolised by the body; the work by di Bernardo et al. could be applied to predict primary drug targets based on the pathways they affect; while the work of Kightley et al. is a method for network inference submitted by researchers in the biotechnology/pharma industry. The field of Systems Biology still includes many challenges and holds much promise. By increasing our repertoire of model representations and analytical formalisms, the methods explored here are the starting points for numerous advances in biotechnology, not the least of which is an enhanced ability to target therapeutics appropriately in diseased cells. Thus, we move one step closer to the day in which computational pathway modeling tehniques will have widespread impact and acceptance within basic biological research and replace high-throughput screening as a de-facto standard in “big pharma”.

A MIXED INTEGER LINEAR PROGRAMMING (MILP) FRAMEWORK FOR INFERRING TIME DELAY IN GENE REGULATORY NETWORKS M. S. DASIKA, A. GUPTA AND C. D. MARANAS Department of Chemical Engineering, The Pennsylvania State University, University Park, PA I6802 E-mail: {msdl79,axg218, costas]@psu.edu In this paper, an optimization based modeling and solution fkamework for inferring gene regulatory networks while accounting for time delay is described. The proposed framework uses the basic linear model of gene regulation. Boolean variables are used to capture the existence of discrete time delays between the various regulatory relationships. Subsequently, the time delay that best fits the expression profiles is inferred by minimizing the error between the predicted and experimental expression values. Computational experiments are conducted for both in numero and real expression data sets. The former reveal that if time delay is neglected in a system a priori known to be characterized with time delay then a significantly larger number of parameters are needed to describe the system dynamics. The real microarray data example reveals a considerable number of time delayed interactions suggesting that time delay is ubiquitous in gene regulation. Incorporation of time delay leads to inferred networks that are sparser. Analysis of the amount of variance in the data explained by the model and comparison with randomized data reveals that accounting for time delay explains more variance in real rather than randomized data

1

Introduction

The advent of microarray technology has made it possible to gather genome-wide expression data. In addition to experimentally quantifying system-wide responses of biological systems, these technologies have provided a major impetus for developing computational approaches for deciphering gene regulatory networks that control the response of these systems to cellular and environmental stimuli. A complete understanding of the organization and dynamics of gene regulatory networks is an essential first step towards realizing this goal [l, 21. To date, many computational/algorithmic frameworks have been proposed for inferring regulatory relationships from microarray data. Initial efforts primarily relied on the clustering of genes based on similarity in their expression profiles [3]. This was motivated by the hypothesis that genes with similar expression profiles are likely to be coregulated. Hwang e t . d [4] and Stephanopoulos eta1 [5] extended these clustering approaches to classify distinct physiological states. However, clustering approaches alone cannot extract any causal relationship among the genes. Many researchers have attempted to explain the regulatory network structure by modeling them as Boolean networks [6, 71. These networks model the state of the gene as either ON or OFF and the input-output relationships are postulated as logical functions. Measures of transcript levels, however, vary in a continuous manner implying that

474

475

the idealizations underlying the Boolean networks may not be appropriate and more general models are required [S]. Recently, there have been many attempts to develop approaches that can uncover the extent and directionality of the interactions among the genes, rather than simply grouping genes based on the expression profiles. These approaches include the modeling of genetic expression using differential equations [9-1I], Bayesian networks [12] and neural networks [13]. Even though a lot of progress has been made, key biological features such as time delay have been left largely unaddressed in the context of inferring regulatory networks. Experimentally measured time delay in gene expression has been widely reported in literature [14-161. However, on the computational front, the fact that gene expression regulation might be asynchronous in nature ( ie., the expression profile of all the genes in the system may not be regulated simultaneously), has largely been left unexplored. From a biological viewpoint, time delay in gene regulation arises from the delays characterizing the various underlying processes such as transcription, translation and transport processes. For example, time delay in regulation may result due to the time taken for the transport of a regulatory protein to its site of action. Consequently, accounting for this key attribute of the regulatory structure is essential to ensure that the proposed inference model accurately captures the dynamics of the system. Prominent among the initial efforts made to incorporate time delay is the framework developed by Yildirim and Mackey [17]. The authors examined the effect of time delay in a previously developed mechanistic model of gene expression, in the Lac operon [18]. Chen et. a1 [9] proposed a general mathematical framework to incorporate time delay but did not apply it to any gene expression data to produce verifiable results. While interesting, these methods are not scalable to large expression data sets where the mechanistic details are often absent. Quin et. a1 [ 191 have proposed a time-shifted correlation based approach to infer time delay using dynamic programming. Since this approach relies on pairwise comparisons, it fails to recognize the potential existence of multiple regulatory inputs with different time delays. In this paper, we propose an optimization based modeling and solution framework for inferring gene regulatory relationships while accounting for time delays in these interactions using mixed-integer linear programming (MILP). We compare the proposed model, both in terms of its capability to uncover a target network that exhibits time delays for a test example, as well as computational requirements with a model that does not account for time delay. The rest of the paper is organized as follows. In the following section, a detailed description of the proposed model formulation is provided. Subsequently, the performance of the proposed model is evaluated on two data sets (one in numero, one real). Finally, concluding remarks are provided and the work is summarized.

476

2

Method

Here, an inference method is described for extracting the regulatory inputs for each gene in a genetic regulatory network, while accounting for time delays in the system. To this end, the linear model of network inference [20-221 is adopted as a benchmark and modified to account for time delay as shown in Eq 1.

z,( t )= z,(t + 1)- z,( t )= CCwJ,,zJ( t - I) v r””r

At

i = 1,2,...~ ,=I,Z t ,...T

(1)

r=n /=I

i at time point t and wjir is the regulatory coefficient that captures the regulatory effect of gene j on gene i . The In Eq 1, Z i( t ) is the expression level of gene

index z indicates that this regulation has a time delay of z associated with it while the integer parameter zmaxdenotes the longest time delay accounted for. Note that the frequency at which gene expression is sampled through the microarray experiment determines the maximum amount of biologically relevant time delay that can be inferred. For example, if the time points are separated by secondslminutes then a higher value of rmaxcan be used. Subsequently, if wjir >O then gene j activates gene i with a time delay z ,while if wjlr > T . In order to uniquely determine all regulatory coefficients, N 2 (rmax+ 1) equations are needed. However, only NT equations are available implying that the system is typically underdetermined and consequently there exists a family of solutions that fit the microarray data equally well. To reduce the dimensionality of the solution space we assume a single time delay z for every regulatory interaction. Furthermore, we limit the maximum number of regulatory inputs to each gene. In order to impose both these constraints, boolean variables Yji, are defined as follows.

Y..=

1 if gene j regulatesgenei with a time delay z

Subsequently, the network inference model with time delay is formulated as the following mixed integer linear programming (MILP) model.

477

Minimize

subject to rM'

N

Z , ( t ) - C C m p r ~(t-r)=e,+(t)-e;(t) , Vi=1,2,..., ~ ; t = 1 ,,..., 2 T r=n

(3)

J = ~

r=n

~ Z Y ~ , ~,.._,

_ms.

\I

IN,

Vi=1,2

N

The objective function (Eq 2) minimizes the total (over all genes and time points) absolute error E between the predicted and the experimental expression values. The absolute value of the error is determined from Eq 3 through the positive and negative error variables e: ( t ) and c i ( t ) respectively. For a given gene i and time point t, only one of these variables can be non-zero. Specifically, if the error is positive then e:(t) is non-zero while if the error is negative then e,(t) is nonzero. This property arises from the fact that when the constraints of the model are placed in matrix form, the columns associated with these two variables are linearly dependent. Consequently, the linear programming (LP) theory principle that states that the columns of the basic variables (variables that are non-zero at the optimal solution) are linearly independent ensures the above property. Eq 4 ensures that the coefficients for all regulatory relationships not present in the network are forced to : and "F are the lower and upper bounds respectively zero. In this constraint, 0 on the values of regulatory coefficients. Eq 5 imposes the constraint that each regulatory interaction, if it exists, may assume only a single value of time delay associated with it while Eq 6 limits N , , the maximum number of regulatory inputs to gene i

The proposed framework has a number of key advantages. The basic linear model with no time delay is a special case of the proposed model. It can be recovered by including the following constraints.

Yji,= O V i i , j = 1, 2,..., N , z > O (9) Additional environmental stimuli may be incorporated by introducing an additional node that describes the influence of the stimulus into the network. Furthermore, various biologically relevant hypotheses can be tested by translating them into

478

either additional/alternative constraints or objective functions. For example, one of the hypotheses recently proposed, concerns the robustness of gene regulatory networks, defined as the ability of these networks to effectively tolerate random fluctuations in gene expression levels [23, 241. Within the context of the linear model, this translates into having small values of the regulatory coefficients mjir so that small variations in the expression levels of genej have a small impact on the rate of change of expression of gene i. From a statistical perspective, the proposed framework can be used to capture the trade-off between degree of model fit and the number of model parameters. By systematically varying the number of maximum regulatory inputs to a particular gene and computing the resulting minimum error, a trade-off curve between accuracy and model complexity can be generated. This curve provides an appropriate means for determining the critical number of regulatory inputs above which the model is tending towards over-fitting of data. In a system with N genes, there will be N2(zmax +1) binary variables implying a possible alternatives for the network connectivity. Even for a total of 2N2(rmax+’) relatively small network inference setting it is computationally expensive to conduct an exhaustive search through these alternatives. The computational requirements can be reduced, to a certain extent, by exploiting the decomposable structure of the proposed model. This is achieved by recognizing that the model can be solved for each gene i separately without any loss of generality. Note, however, that this model structure is lost if an overall maximum connectivity constraint is imposed in the same spirit as the individual gene maximum connectivity constraint (Eq 6). In addition to improved computational performance, another key advantage of the decomposable property is that it limits the amount of computational resources that need to be expended if only a sub-network involving a sub-set of the genes is to be inferred. The key parameters that determine the computational complexity of the proposed model are the bounds a?, imposed on the regulatory coefficients in Eq 4.

Qy

While in certain special application settings, there are pre-specified upper and lower bounds that are part of the model, in contrast, in our proposed model, these bounds are not known a priori. For such cases, typically the “Big-M’ approach is utilized whereby arbitrarily largehmall bounds are imposed [25]. Such a simplistic approach circumvents the need to determine tight valid bounds, although, at the expense of much higher computational requirements. On the other hand, if tight invalid bounds are specified, the computational gains realized will be off-set by the inability to attain the global optimal solution. In light of this trade-off between computational requirements and quality of optimal solution, a sequential bound relaxation

479

procedure is developed and described next. As a starting point for this procedure, for a given gene i*, both the upper and lower bounds are fixed such that

I Rl- I= I Rl:? I = Ro I'

. The initial value of the bound is selected based on the

scaling of the expression values. Specifically, for gene j , this initial bound value is determined as a value proportional to the ratio of the order of magnitude of the derivative values and that of the expression values. Subsequently, given these bounds, the inference model is solved to obtain the optimal values of the regulatory coefticient mii., (R;i. ) and the absolute error E,, (ao )= I

I,

Next, the bounds are relaxed such that

":.. = (1 + ai.).R o I'

2

+

(e; ( t ) e- (t)).

t=l

where 0 < 8,*I 1

followed by re-optimization of the model with these updated bounds. Since the relaxation of bounds leads to a larger feasible solution, it is guaranteed * ) 5 E,. (ao ) . These two steps of bound relaxation and optimization that E,. (a' I

I,

I'

are repeated until the total absolute error for gene if reduces tohelow the desired tolerance level. This procedure is then repeated for all the genes in the network until the entire (or a sub-set) network topology has been inferred. 3

Results and Discussion

To highlight and test the inference capabilities of the proposed model, it is applied to two different data-sets. Data set 1 (40 genes, 8 time points) is generated in numero by assuming known time delay in the system dynamics. The ability of the inference procedure to uncover an a priori known target network as well as the computational performance of the model is studied by employing this data set. Subsequently, a real microarray data-subset (24 genes, 9 time points) is analyzed to highlight the applicability of the inference procedure to data derived from real biological systems. 3.1 Data set I

The expression data for the 40 gene network is generated by assuming that 6 genes have 3 regulatory inputs, 10 genes have 2 regulatory inputs, while the remaining genes have a single regulatory input. 33 interactions are designed to have a time delay of zero, 21 have a time delay of one and 9 have a time delay of two time points. Given this topology of the regulatory network, gene expression values are computed for each one of the 40 genes at 8 time points. The derivatives are computed by employing forward difference. The starting value for the bound for

480

each gene is set to 1.0 and a bound increment value 6;= 1.0 is employed for computation. The assumed network constituted 63 interactions with known regulatory weights and time delays associated with these interactions. The original network, in terms of all 63 regulatory interactions and the associated regulatory weights and time delays, is perfectly recovered by solving the proposed model with time delay. The optimization model is solved using the CPLEX solver accessed via the GAMS modeling environment. The CPU time needed to recover the original regulatory inputs for each gene is shown in Figure l(a) while the distribution of total number of sequential bound relaxation iterations required is shown in Figure l(b). .With

Time Delay <@WithoutTime Delay

3 I-

0"

.With

Time Delay @Without Time Delay

80

2500 I 2000 .-E 1500

2

60

c ; 40

1000 500

s 20 0

0

1 4 7 1013161922252831 343740

Gene#

(4

2

4

6

8

More

#Iterations

@)

Figure 1: Comparison of computational performance for the model with and without time delay. (a) Total CPU time required for each of the 40 genes in the network (b) Distribution of the total number of sequential bound relaxation procedure iterations.

Specifically, 9,505 CPU seconds (on an IBM RISC 6000 machine) are required for the 86 iterations. In addition to the model with time delay, network inference is also camed out by neglecting time delay. This is achieved by including Eq 9 in the inference model Eq 2-8. The model without time delay provides the appropriate benchmark for systematically highlighting the gains, if any, that are realized by accounting for time delay. The computational results for the two models are contrasted in Figures l(a) and l(b). For the model without time delay, a total of 4696 CPU seconds are expended for the 227 iterations that are needed to infer a network with zero error. However, even though the model without time delay is able to fit the data perfectly with relatively lesser computational resources, it is unable to identify the assumed target network in terms of the network topology and regulatory parameters. In addition, as expected, the number of parameters required increases significantly for the model without time delay. In particular, 121 regulatory relationships are inferred by the model without time delay, implying an almost two-fold increase in the number of parameters needed over the model with time delay.

481

3.2

Data Set 2

The second microarray data set analyzed consisted of time course expression profiles of 24 genes of Bacillus subtilis subjected to an amino-acid pulse in minimal media. Gene expression is measured using Affymetrix GeneChipB arrays at 0, 8, 13, 18, 28, 38, 68, 118 and 178 minutes. The amino-acid pulse is introduced for 8 minutes at the start of the experiment. Subsequently, cubic splines are used to interpolate the expression data and the derivatives are computed by employing a local finite difference approximation at each of the time points. The model with time delay is used to infer the regulatory network, The trade-off curve between error and the maximum number of parents is shown in Figures 2(a) and 2(b) for both the model with and without time delay. Note that the maximum number of parents determines the number of parameters available for fitting. In accordance with the results obtained for data set 1, Figure 2 highlights the fact that for any imposed threshold error tolerance value, the model with time delay infers a network which is sparser.

* Model Without Time Delay

Model With Tlme Delay

0

1

2

3

4

5

Maximum # Parents

(4

6

1

0

1

2

3

4

5

6

7

Maximum #Parents

(b)

Figure2: Trade-off between number of model parameters and quality of fit (a) Model with time delay (b) Model without time delay.

The inferred regulatory relationships are shown in Figures 3 and 4. The proposed model is able to identify a number of regulatory relationships that have been previously reported in literature. Jin eta1 [26] have hypothesized the existence of regulatory relationship between citH and genes involved in aspartate production (nadB and purA). The inferred regulatory network identifies a potential indirect mechanism for these regulations mediated bypyc.4 and odhB (Figure 3). Miller et.d [27] have reported that genes sdhA and citG might share a common regulatory mechanism. The inferred network indicates that genes involved in glycine, serine and threonine bqhZJ) metabolism regulate both citG and sdhC, which is a part of the sdhCAB operon. These results highlight the capability of the proposed inference framework to capture biologically plausible regulatory interactions.

482

Figure 3. Regulatory network urferred by the model with time delay for

Figure 4. Time delays associated with the infered, pink r=

483

From a statistical perspective, in addition to the relative error, a metric that is widely used to determine how well a regression fits is the coefficient of determination (or multiple correlation coefficient) R2[28]. This metric quantifies the fraction of variability in the response variable that can be explained by the variability in the input variables. In the context of our current setting, the average R 2 value is given bY

where the Var[.]operator determines the variance of a particular quantity over time and E j ( t )= e,? ( t )- c r ( t ) is the computed error for gene i at time point t. Given this metric, the additional variance explained by the model with time delay is determined as Add. Variance Explained = R 2 [Time delay]-R2 [Without time delay]

(1 1)

Figure 5 shows the additional variance explained for data-set 2 as the number of parents is vaned. In addition to the real data set, the additional variance explained for random data is also shown in Figure 5 . Specifically, the randomized data is obtained by permuting the rows and columns of the expression matrix such that any underlying structure of the data I s Real Data =Randomized Data lost while the scaling of the data is 30% retained. The results of Figure 5 25% indicate that the model with time 20% 15% delay is able to discriminate 10% m n i between real and randomized data 5% , 0% only when the maximum number 1 2 3 4 5 6 7 of inputs allowed is either 4 or 5. Maximum # Inputs For relatively small number of Figure 5 : Additional variance explained inputs (1,2 and 3), the model is by including time delay for real and randomized data. unable to capture the underlying structure of the real data due to lack of sufficient number of parameters. Similarly, at the other extreme, when too many parameters are made available (6 and 7), the model starts tending towards over-fitting leading to the overlap between real and randomized data. A clear separation between the two data sets is realized only in the intermediate range of inputs (4 and 5). These results highlight the capabilities of the proposed modeling and solution framework in not only accounting for key system dynamics such as x

1..

.+.=,=,-,

484

time delay but also gaining deeper insights into the topological features of regulatory networks. 4

Summary and Conclusions

In this work, an optimization based modeling and solution framework, for incorporating time delay in transcriptional regulations was proposed. The proposed model used the existing linear model as a benchmark and employed boolean variables to incorporate discrete time delay into the interactions. Since, the system of equations describing the interactions is underdetermined and consequently has a family of solutions that fit the data equally well, various properties of biological networks such as sparseness, and uniqueness of time delay were employed to search through the solution space. A number of key advantages of the model in terms of examining the impact of alternative objective functions, incorporating known biological interactions and including environmental stimuli were discussed. On the computational front, however, the proposed model formulation was NP-hard implying that the computational requirements increase exponentially with the model size. To alleviate this problem, a sequential bound relaxation procedure was proposed. The inferential potential of the proposed methodology was determined by applying it to an in numero data set and a real expression data set. Results for the in numero data set confirmed the fact that neglecting time delay in a system a priori known to be characterized with it results in a significant increase in the number of parameters needed to describe the system dynamics. Subsequently, application of the model to real microarray data uncovered numerous regulatory relationships with time delay suggesting that time delay is ubiquitous in gene regulation. In the spirit of the results obtained for the first data set, inclusion of time delay resulted in inferred networks that were sparser. In addition, analysis of the amount of variance in the data explained by the model revealed that the proposed methodology explained more variance in real data as compared to randomized data. References 1. 2. 3. 4. 5. 6. 7.

8.

Bolouri, H. and J.M. Bower, Computational Modeling of Genetic and Biochemical Networkr, ed. H. Bolouri and J.M. Bower. 2001, Cambridge,Massachusetts: The MIT Press. Bolouri, H. and E.H. Davidson. BioEssays, 2002.24(12): p. 1118-1127. Spellman. Mol. Biol. Cell, 1998. 9: p. 3273-3297. D.Hwang, et al.. Bioinformatics, 2002. 18: p. 1184-1193. Stephanopoulous, G., et al.. Bioinformatics, 2002. 18: p. 1054-1063. Akutsu, T. and S. Miyano. Pac. Symp. Biocomput., 2000.5: p. 290-301. Ideker, T.E., V. Thorsson, and R.M. Karp. Pac. Symp. Biocomput., 2000. 5 : p. 302-313. D.Jong, H.. Journal of Computational Biology, 2002. 9(1): p. 67-103.

485

9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.

Chen, T., H.G.L. He, and G.M. Church. Pac. Symp. Biocomput., 1999.4: p. 102-111. Yeung, M.K.S., J. Tegner, and J.J. Collins. PNAS, 2002. 99: p. 6163-6168. Hoon, M.J.L., et al.. Pac. Symp. Biocomput., 2003. 8: p. 17-28. Friedman, N., et al.. Journal of Computational Biology, 2000. 7: p. 601620. Vohradsky, J.. The Journal of Biological Chemistry, 2001.276: p. 3616836173. Jagle, U., et al.. The Journal of Biological Chemistry, 1997.272: p. 58715879. Gill, R.T., et al.. Journal ofBacteriology, 2002.184(13): p. 3671-3681. Nitzan Rosenfeld, U.A.. J.Mol.Biol,2003.329(645-654). Yildirim, N. and M.C. Mackey. Biophysical Journal, 2003.84: p. 2841285 1. Wong, P., S. Gladney, and J.D. Keasling. Biotechnol. Prog., 1997. 13: p. 132-143. Quin, J., et al.. J. Mol. Biol., 2001. 314: p. 1053-1066. D'haeseleer, P., L. Shoudan, and R. Somogyi. Pac. Symp. Biocomput., 1999.4: p. 41. Weaver, D.C., C.T. Workman, and G.D. Stormo. Pac. Symp. Biocomput., 1999.4: p. 112-123. Someren, E.P.V., L.F.A. Wessels, and M.J.T. Reinders. 2000. Someren, E.P.V., et al.. Proceedings of the 2001 IEEE - EURASIP Workshop on Nonlinear Signal and Image Processing (NSIPOl), Baltimore, Maryland, June 2001., 2001. Someren, E.P.V., L.F.A. Wessels, and M.J.T. Reinders. Signal Processing, 2003.83: p. 763-775. Winston, W.L. and M. Venkataraman, Introduction To Mathematical Programming. 4 ed. Vol. 1. 2003, Pacific Grove: BrooksICole-Thomson Learning. S.Jin, M.D. Jesus-Bemios, and A.L.Sonenshein. J Bacteriol, 1996. 178(2): p. 560-3. P.Miller, et al.. J Bacteriol, 1988. 170(6): p. 2742-8. Ross, S.M., Introdution to Probability and Statistics for Engineers and Scientists. 2 ed. 2000: Harcourt Academic Press.

ROBUST IDENTIFICATION OF LARGE GENETIC NETWORKS D. DI BERNARD0 TIGEM, Via P Castellino 111; 80131, Naples, Italy email:[email protected]; Tel: +39 081 6132 229 FAX: +39 081 560 98 77 T.S. GARDNER, J.J. COLLINS Center for BioDynamics and Department of Biomedical Engineering, Boston University, 44 Cummington St., Boston, M A 02215, USA

Temporal and spatial gene expression, together with the concentration of proteins and metabolites, is tightly controlled in the cell. This is possible thanks to complex regulatory networks between these different elements. The identification of these networks would be extremely valuable. We developed a novel algorithm to identify a large genetic network, as a set of linear differential equations, starting from measurements of gene expression at steady state following transcriptional perturbations. Experimentally, it is possible to overexpress each of the genes in the network using an episomal expression plasmid and measure the change in mRNA concentration of all the genes, following the perturbation. Computationally, we reduced the identification problem to a multiple linear regression, assuming that the network is sparse. We implemented a heuristic search method in order to apply the algorithm to large networks. The algorithm can correctly identify the network, even in the presence of large noise in the data, and can be used to predict the genes that directly mediate the action of a compound. Our novel approach is experimentally feasible and it is readily applicable to large genetic networks.

1

Introduction

Temporal and spatial gene expression, together with the concentration of proteins and metabolites, is tightly controlled in the cell. This is possible thanks to complex regulatory networks between these different elements. The identification of these networks would be extremely valuable. Different experimental and computational methods have been proposed t o tackle the network identification problem Although implemented with some success, they are data intensive and the description of the network they provide is limited. A variety of mathematical models may be used to describe genetic networks 7,8,9, including Boolean logic Bayesian networks 1 2 , graph theory 1 3 , and 1,213,4*5,6.

486

487

ordinary differential equations. We concentrated our efforts on this last model, because it offers a description of the network as a continuous time dynamical system that can be used t o infer the genes with the major regulatory function in the network. In addition, it can be applied t o the RNA expression measurements obtained from pharmacological perturbations to identify the genes that directly mediate a compound's bio-activity in the cell. We already developed and tested in vitro an algorithm to identify a genetic network of nine genes, as a set of linear differential equations, starting from measurements of gene expression at steady state following transcriptional perturbations 14. In what follows we describe a modification of the algorithm to tackle the problem of reverse-engineering large genetic networks.

2

Methods

2.1 Network model description

'

A network can be described by a set of ordinary differential equations describing the time evolution of the mRNA concentration of the genes in the network? :

where g represents the mRNA concentrations of the genes in the network, and 1is a set of transcriptional perturbations. Assuming that the cell under investigation is at equilibrium near a stable steady-state point, we can apply a small perturbation t o each of its genes. A perturbation is small if it does not drive the network out of the basin of attraction of its stable steady-state point and if the stable manifold in the neighborhood of the steady-state point is approximately linear. With these assumptions, we can linearize the set of nonlinear rate equations near its stable state-steady point '. Thus, for each gene, i , in a network of N genes we can write the following equation:

where xi1 is the mRNA concentration of gene i following the perturbation in experiment 1 ; aij represents the influence of gene j on gene i; uil is an "(from now on we will use the following notation: g represents a column vector, gT is a row vector, 2 is a scalar and A is a matrix)

488 external perturbation t o the expression of gene i in experiment I . For all N genes, Eqs. 2 can be rewritten in more compact form using matrix notation:

(3) where g L is an N x 1 vector of mRNA concentrations of the N genes in experiment I , A is an N x N connectivity matrix, composed of elements a i j , and uLis an N x 1 vector of the perturbations applied t o each of the N genes in experiment 1.

2.2 Network Identification To identify the network, using the model described above, means t o retrieve matrix A. This is possible if we measure mRNA concentration of all the N genes a t steady state (i.e., & = 0) in M experiments and then solve the system of equations:

A.X=-U

(4)

where X is an N x M matrix composed of columns 3 ; U is an N x M with each column, uL.Equation 4 can be solved only if M 2 N . However, the recovered weights, A, will be extremely sensitive t o noise both in the data, X, and in the perturbations, U, and thus unreliable unless we overdetermine the system of Eqs. 4. This can be accomplished either by increasing the number of experiments ( M > N ) , or, by assuming the maximum number of regulators acting on each gene, k , is less than M (i.e., the network is not fully connected l5l6), thus reducing the number of weights aij to be recovered.

2.3 Experimental approach To identify the network we need to perform transcriptional perturbations for each of the genes in the network and t o measure the changes at steady state following the perturbation of the mRNA concentrations for each of the genes in the network. In each perturbation experiment, it is possible to overexpress a different one of the genes in the network using an episomal expression plasmid. Then we let the cells grow under constant physiological conditions t o their steady state after the perturbation and measure the change in mRNA concentration compared t o cells under the same physiological conditions but unperturbed. This can be achieved using microarrays or real time quantitative

PCR.

489

2.4 Algorithm. A genetic network can be described by the system of linear differential equations, Eqs. 2. For each gene i at steady state (& = 0) in experiment l , we can therefore write:

where uil is the transcriptional perturbation applied to gene i in experiment 1, gi is a row of A, and 3 ( N x 1) are the mRNA concentrations at steady state following the perturbation in experiment 1. The algorithm assumes that only k out of the N weights in aifor gene i are different from zero. For each possible combination of k out of N weights, the algorithm computes the solution to the following linear regression model:

where yil = -uil is the perturbation applied to gene i in experiment 1 ; bi is a k x 1 vector representing one of k ! possible combinations of weights 7-for gene i; E ~ is Z a scalar stochastic normal variable with zero mean and variance, V U T ( E ~ ~representing ), measurement noise on the perturbation of gene i in experiment 1; zZis a k x 1 vector of mRNA concentrations following the perturbation in experiment 1, with added uncorrelated Gaussian noise (y ) with -1 zero means and variances var(yl). Equation 6 represents a multiple linear regression model with noise qil = @ . % ~ i l with , zero mean and variance:

/Lkl!

+

c k

VUT(7il)=

b’+.Lr(yjl)

+

WUT(Eil)

(7)

j=1

(if

and

C ~ Z

T~are uncorrelated).

If we collect data in M different experiments, then we can write Eq. 6 for each experiment and obtain the system of equations:

where y . is an A4 x 1 vector of measurements of the perturbation yil t o gene i in tG M experiments; Z is a K x M matrix, where each column is the vector 3 for one of the M experiments; &iis an M x 1 vector of noise in the M experiments. From Eqs. 8, it follows that a predictor for --zy . given the data matrix Z is:

490

We chose to minimize the following cost function to find the k weights, hi, for gene i :

1=1

1=1

The solution can be obtained by computing the pseudo inverse

of

Z:

-

= (Z . z T ) - l

'

z

'

2a

(11)

Note that the solution, Ti, in Eq. 11is not the maximum likelihood estimate for the parameters bi when the regressors Z are stochastic variables 17, but it nevertheless is a good estimate. We select as the best approximation of the weights in Eqs. 2 for gene i , the one with the smallest least-squares error, C f , among the (N choose k) possible solutions

5,.

2.5 Estimation of the variance of the parameters. We now turn to the estimation of the variance on the estimated parameters

bi

- and the calculation of the goodness of fit. If, in each experiment, the noise

is uncorrelated and Gaussian with zero mean 2nd known variance, then the covariance matrix of the estimated parameters bi is 1 6 :

Cm&)

= (Z

.zT)-l.

z . c, . ZT

(Z

zT)-l

(12) where C, is an M x M diagonal matrix with diagonal elements equal to the noise variance for gene i in the M experiments, war(qil).. . war(qim). We assume that we can estimate war(qil) in each experiment using the parameters ii estimated with Eq. 11 and substituting in Eq. 7: '

'

We can now compute the variances of the parameters using Eq. 12, where C, is computed using Eq. 13. The quantities war(yjl) and w a r ( ~ j lare ) supposed

to have been estimated experimentally. We can also compute a goodness of fit test using the Chi-squared statistic:

49 1

2.6 Modification of t h e algorithm for large networks. For a network of N genes, with k 5 N connection for each gene, we need t o combinations of k genes and then select solve Eq. 6 for all the possible the one that fits the data best. For large networks, this exhaustive approach is unfeasible since there are too many combinations t o test. We used a heuristic search method (Forw-TopD-reest-K 18) t o reduce the number of solutions to test. We first compute all the possible solutions with single connections ( k = l ) as described in sec. 2.4. We then select the best D solutions (the ones with the smallest least squared error), and only for these intermediate solutions, we compute all the possible solutions with an additional connection. Then we again select the best D solutions, and so on until the number of connections found for each gene is k . We implemented this approach using a value of D = 5.

&

2.7

Target prediction.

A

It is possible t o use the recovered network t o deconvolve the results of an experiment, i.e., t o recover the unknown perturbations u, in an experiment, given the measurements of the response t o that perturbation, 5. The predicted perturbations & can be computed from:

The variance on the estimated perturbation of gene i can be computed as 19.

=

g .(Z . ZT)-'

W U T ( ~ ~ ~ )

k

. Z . C,. ZT . (Z . ZT)-'

+ ~ ~ ~ j u a r ( (16) ~ , j )

'go

j=1

Using the variance of the estimated perturbation, we perform a t test t o test the hypothesis that the predicted perturbations are significantly different from zero.

2.8

Simulated data

To test the algorithm on a realistic data set, we generated 10 random networks of N = 100 genes with an average of k = 10 connections for each gene. Each network was represented by a full rank sparse matrix A ( N x N ) , as described in sec.2.2. We made sure that all the eigenvalues of these random sparse matrices had a real part less than 0 t o ensure that the dynamical systems described by

492

them were stable. The data set X ( N x M ) was obtained by inverting eq.4 to obtain:

X=-A-l.U

(17)

where U, ( N x M ) , were the perturbations in M = 100 experiments. We chose U to be a diagonal identity matrix. This is equivalent to say that in each experiment only 1 out of the 100 genes was perturbed by increasing its transcription rate by 1. The data the algorithm needs t o identify the network A are the gene expression data matrix X and the perturbation matrix U. We added white gaussian noise to each data matrix. For the perturbation matrix U, the standard deviation of the noise was fixed to uu = 0.3 (i.e. 30% of the magnitude of the perturbation), while for the gene expression data matrix it varied from uz = 0.1 * X to uz = 0.5 * X where X represents the average of the absolute values of the elements in X . The performance of the algorithm was tested using these data with the different noise levels in order to identify the network A. We used two measures of performance: coverage (correct connections in the recovered network model / total connections in the true network) and false positives (incorrect connections in the recovered model / total number of recovered connections). In order to test the ability of the identified network to predict unknown perturbations given the gene expression data, for each random network, we generated 10 additional experiments in which 3 genes, randomly chosen out of 100 genes, were perturbed simultaneously. We computed the ability of the recovered network to predict which genes had been perturbed, using the method described in 2.7. The algorithm described in this section was fully implemented in MATLAB environment. For a network of 100 genes, the algorithm took 50s to run on a Pentium I11 with a clock speed of 1.2 Ghz.

3

Results

3.1 Identification of networks Figure 1 shows the average performance of the algorithm across the 10 random networks described in sec. 2.8 for noise levels ranging from 10% t o 50%. Since the algorithm reports also the variance of the identified elements in matrix A, it is possible t o compute a p value for each of its elements aij. We used a Student t distribution to test the hypothesis that the element aij identified

493

0 '

,--

I

I

20

I

I

30

I

I

40

I

I

50

Noise level (%)

Figure 1: Model recovery performance for simulations. Perturbations of magnitude ui = 1 (arbitrary units) were applied to ten randomly connected networks of one hundred genes with an average of ten regulatory inputs per gene. For each perturbation t o each random network, the mRNA concentrations at steady state were calculated, and normally-distributed, uncorrelated noise was added both to the mRNA concentrations and to the perturbations to represent measurement error. The noise (noise = S z / p z ,where S, is the standard deviation of the mean of 5,p , ) on the perturbations was set to 30%. The noise on the mRNA concentrations was varied from 10% t o 50%. The average coverage, top panel, (correct connections in the recovered network model / total connections in the true network) and average false positives, bottom panel, (incorrect connections in the recovered model / total number of recovered connections) were calculated across all the models recovered. Filled circles: All the recovered connections were included in the computation of coverage and false positives. Filled squares: Only the recovered connections with a p-value 5 0.05 were included in the computation.

494

by the algorithm is significantly different from 0. This is equivalent t o test whether gene i is significantly regulated by gene j. Figure 1 reports also the coverage and false positives in the case we consider significantly different from 0 only those elements with a pvalue 5 0.05 (dashed lines).

3.2 Target prediction Figure 2 shows the coverage (genes correctly identified as perturbed by the network model / total number of perturbed genes) and the percentage of false positives (genes wrongly identified as perturbed by the network model / total number of genes identified as perturbed by the network model) for noise levels ranging from 10% to 50% averaged across the 10 random networks and across 10 perturbation experiments, as described in sec.2.8. In Figure 2, open bars show coverage and false positives considering the predicted perturbations correct only if they have a pvalue 5 0.01, black bars show the same quantities for a pvalue 5 0.1. 4

Discussion

The algorithm we propose requires only measurements of mRNA concentrations at steady state following transcriptional perturbations. Therefore, the experimental time and costs involved in the procedure are affordable. This is a very useful feature of our approach. Another essential feature is its robustness to measurement noise. Measurements of mRNA concentration using microarrays are noisy, and therefore an algorithm to identify networks is useful only if it is robust t o such noise. We showed that the recovered network can be used for target prediction, this can be very useful for drug discovery. Using measurements of mRNA concentration changes at steady state following the application of a compound to a cell population, we can predict which are the direct targets of that drug in a large gene network using the recovered network model. The recovered network model, A , is a linear representation of a nonlinear system. Nonlinear behaviours that are sometimes exhibited by gene, protein, and metabolite networks, including bifurcations, thresholds, and multistability, cannot be described by A. Nevertheless, the linear approximation is topologically equivalent t o the nonlinear system near a steady-state point. Therefore, to apply our algorithm, it is necessary t o remain near a single steady state during the course of all experiments. From a practical perspective, this means that cells must be maintained under consistent and constant environmental

495 90

80

7c

6C

5c

4c

3c

2c

I(

0

10%

30% Noise level (“4)

50%

Figure 2: Perturbation prediction performance for simulations. Three genes were randomly and simultaneously perturbed. Using the steady state measurements following the perturbation, the network model was used to predict which genes had been perturbed. This experiment was repeated ten times for each one of ten different random networks of one hundred genes with an average of ten regulators per gene. Coverage (genes correctly identified as perturbed by the network model / total number of perturbed genes) and the percentage of false positives (genes wrongly identified as perturbed by the network model / total number of genes identified as perturbed by the network model) for noise levels ranging from 10% to 50% averaged across the ten random networks and the ten perturbation experiments. Open bars: Coverage (tall) and false positives (short) considering correct only predictions with a p-value 5 0.01. Filled bars: Coverage (tall) and false positives (short) considering correct only predictions with a p-value 5 0.1.

496 and physiological conditions, and the applied perturbations must be relatively small. If these conditions are not met, the recovered model may contain a certain degree of nonlinear error, or, in the extreme, it may not be possible t o adequately fit a linear model. In practice, it is generally straightforward t o keep the cells in a constant environmental and physiological state, but due t o the presence of measurement noise, it can be challenging t o meet the condition of small perturbations. For errors due to noise, we can improve the signal-to-noise ratio ( S I N ) by boosting the size of the Perturbations. However, larger perturbations can lead t o larger nonlinear errors. Thus, the experimenter must identify an acceptable balance between noise and nonlinear error. The network should be sparse for the method t o work. Our algorithm can be successfully applied as long as the real connectivity of the network (i.e. number of connections per gene) is less that the number of perturbation experiments. An exact threshold for the maximum number of connections that can be recovered correctly with this algorithm cannot be computed because this will depend on the noise level of the data. For noise-free data, the maximum connectivity will be equal t o the number of perturbations experiments performed. Our approach t o inferring genetic networks has been shown t o work in vivo for small networks 14. The computer simulations here described suggest that a modified version of the algorithm will work also for large genetic networks. We showed that even with considerable noise, it is still possible t o recover 60% of the real network with less than 10% of wrongly identified connections. This is important in biological research because it can provide a first draft of the map of interaction among hundreds of genes whose function or regulation is partly or completely unknown. Also the network recovered with the algorithm can predict the direct targets of an unknown perturbation with a specificity of approx. 80%, even in the presence of large noise. This would greatly help in the identification of the real targets of a novel molecule in a large network, by greatly reducing the targets t o be tested experimentally. In addition, the experiments required t o generate the data needed by the algorithm are feasible and economically affordable also for large networks.

497

References 1. A. H. Y. Tong, B. Drees, Nardelli G., G. D. Bader, B. Brannetti, L. Castagnoli, M. Evangelista, S. Ferracuti, B. Nelson, S. Paoluzi, M. Quondam, A. Zucconi, C. W. V. Hogue, S. Fields, C. Boone, and G. Cesareni, Science 295, 321-324 (2002). 2. T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J. B. Tagne, T. L. Volkert, E. Fraenkel, D. I<. Gifford, and R. A. Young. Science 298, 799-804 (2002). 3. T. Ideker, V. Thorsson, J. A. Ranish, R. Christmas, J. Buhler, J. K. Eng, R. Bumgarner, D. R. Goodlett, R. Aebersold, and L. Hood, Science 292, 929-934 (2001). 4. E. H. Davidson, J. P. Rast, P. Oliveri, A. Ransick, C. Calestani, et al. Science 295, 1669-1678 (2002). 5. A. Arkin, P. D. Shen, and J. Ross, Science 277, 1275-1279 (1997). 6. M. K. S. Yeung, J. Tegner, and J. J. Collins, Proc. Natl. Acad. Sci. U.S.A. 99, 6163-6168 (2002). 7. H. de Jong, J. Comp. Biol. 9, 67-103 (2002). 8. M .A. Savageau, Chaos, 142-159 (2001). 9. A Levchenko, and A. Iglesias, Baophys J , 50-63 (2002). 10. I. Shmulevich, E. R. Dougherty, S. Kim, and W. Zhang, Bioinformatics 18, 261-274 (2002). 11. S. Liang, S. Fuhrman, and R. Somogyi, Proc. Pacific Symp. Biocomp. 3, 18-29 (1998). 12. A. J. Hartemink, D. K. Gifford, T. S. Jaakkola, and R. A. Young, Proc. Pacific Symp. Biocomp. 7,437-449 (2002). 13. A. Wagner, Bioinformatics 17, 1183-1197 (2001). 14. T. S. Gardner, D. di Bernardo, D. Lorenz, and J. J. Collins, Science 301, 102-105 (2003). 15. Z. N. Oltvai, and A. L. BarabAsi, Science 298, 763-764 (2002). 16. L. Ljung, System Identification: Theory for the User. Prentice Hall, Upper Saddle River, NJ, (1999). 17. W. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge, UK, (1993). 18. E. P. van Someren, L. F. A. Wessels, M. J. T. Reinders, and E. Backer, Proc. PdInt. Conf. Systems Biol. , 222-230 (2001). 19. D .Montgomery, E. A. Peck, and G. G. Vining, Introduction to Linear Regression Analysis. John Wiley & Sons, Inc., New York, (2001).

RECONSTRUCTING CHAIN FUNCTIONS IN GENETIC NETWORKS I. GAT-VLKS, R. SHAMIR School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel. {iritg,rshamir} Otau.ac.il. R. M. KARJ?, R. SHARAN Internutaonal Computer Science Institute, 1947 Center St., Berkeley CA 94704. {karp,roded} @icsi.berkeley.edu. Abstract

Deciphering the mechanisms that control gene expression in the cell is a fundamental question in molecular biology. This task is complicated by the large number of possible regulation relations in the cell, and the relatively small amount of available experimental data. Recently, a new class of regulation functions called chain functions was suggested. Many signal transduction pathways can be accurately modeled by chain functions, and the restriction t o chain functions greatly reduces the vast search space of regulation relations. In this paper we study the computational problem of reconstructing a chain function using a minimum number of experiments, in each of which only few genes are perturbed. We give optimal reconstruction schemes for several scenarios and show their application in reconstructing the regulation of galactose utilization in yeast.

1

Introduction

The regulation of mRNA transcription is key to cellular function. High throughput genomic technologies, such as DNA microarrays, enable a global view of the transcriptome, and provide the means to reconstructing regulatory relations among genes, that is, inferring the set of genes that cooperate in the regulation of a given gene and the particular logical function by which this regulation is determined. This paper studies the number and complexity of biological experiments that are needed in order to infer certain regulatory relations. An experiment involves knocking out or over-expressing certain genes, and measuring the expression levels of all other genes. The order of an experiment is the number of genes that are perturbed. A key obstacle in the inference of regulation relations is the large number of possible solutions and, consequently, the unrealistically large amount of data needed to identify the right one. Akutsu et al. showed that even for a boolean

498

499

network model, the number of experiments that are needed for reconstructing a network of N genes is prohibitive: The lower and upper bounds on the number of experiments of order N - 1 that are needed, are C2(2N-') and O(N . 2 N - 1 ) , respectively. Even with no more than d regulators for each regulated gene, the number of required experiments of order d is still i 2 ( N d )and O(N"), respectively '. The inherent complexity of genetic network inference led researchers to seek ways around this problem. Ideker et al. studied how to dynamically design experiments so as to maximize the amount of information extracted. Friedman et al. used Bayesian networks to reveal parts of the genetic network that are strongly supported by the data. Tanay and Shamir suggested a method of expanding a known network core using expression data. Several studies used prior knowledge about the network structure, or restrictive models of the structure, in order to identify relevant processes in gene expression data 5 * 6 9 7 1 8+ Recently, a biologically motivated model of regulation relations based on chain functions, was suggested in order to cope with the problem of genetic network inference '. In a chain function, the state of the regulated gene depends on the influence of its direct regulator, whose activity may in turn depend on the influence of another regulator, and so on, in a chain of dependencies (we defer formal definitions till later). The chain model further assumes that variable states are boolean. The latter assumption is a drastic simplification of real biology, yet it captures important features of biological systems and was frequently used in previous studies '. The class of chain functions has several important advantages ':These functions reflect common biological regulation behavior, so many real biological regulatory relations can be elucidated using them (examples include the SOS response mechanism in E. coli l o and galactose utilization in yeast "). Moreover, by restricting consideration to chain functions, the number of candidate functions drops from double exponential to single exponential only. In this paper we study the computational problems arising when wishing to reconstruct chain functions using a minimum number of experiments of the smallest possible order. We address both the question of finding the set of regulators of a chain function, which is typically much smaller than the entire set of genes, and the question of reconstructing the function given its regulators. We give optimal reconstruction schemes for several scenarios and show their application on real data. Our analysis focuses on the theoretical complexity of reconstructing regulation relations (number and order of experiments), assuming that experiments provide accurate results, and that the target function can be studied in isolation from the rest of the genetic network. The paper is organized as follows: Section 2 contains basic definitions related to chain functions. In Section 3 we give worst-case and average-

500

case analyses of the number of experiments needed in order to reconstruct a chain function. Both low-order and high-order experimental settings are considered. In Section 4 we study the reconstruction of composite regulation functions that combine several chains. Finally, in Section 5 we describe a biological application of our analysis to reconstruct the regulation mechanism of galactose utilization in yeast. For lack of space, some proofs are shortened or omitted. 2

Chain Functions

Chain functions were introduced by Gat-Viks and Shamir '. In the following we define these functions and describe their main properties. Our presentation differs from the original one, to allow succinct description of the reconstruction schemes in later sections. Let U denote the set of all variables in a network, where IUI = N . These variables correspond to genes, mRNAs, proteins or metabolites. Each variable may attain one of two states: 1 or 0 . The state of gene g , denoted by s t a t e ( g ) , indicates the discretized expression level of the gene. The intended interpretation is that s t a t e ( g ) is 1if gene g is capable of being activated in a given environment, and 0 otherwise. A variable normally exists in its wild-type state, but perturbations such as gene knockouts may change its state. Let go E U be regulated by a set S of n variables. In that case we say that S is the regulator set of go, and go is called the regulatee. A candidate regulation function for the regulatee go has the form f g O : (0,l)" + (0,l). In other words, the state of go is a function of the states of its regulators. The chain function model assumes that the functional relations are deterministic. The chain function f 9 0 on the regulators gn, ...,g1 d+ termines the state of the regulatee go. The order of the regulators is important, as it reflects the order of influence among them. We call gi the predecessor of gj for i > j , and the successor of gk for i < k. Each regulator may activate or repress its successor, and this chain of events enables a signal to propagate from gn to go, in a manner described below. Associated with each regulator gi is a fixed value yi which dictates the regulatory influence of gi on gi-1. If yi = 0 then gi is an activator; otherwise gi is a repressor. The value yi represents an intrinsic property of the chain and is not subject to change. The control pattern of f g O is the binary vector (yn,...,y1 ) . The function f90 can be defined using two n-long boolean vectors attributing activity and influence to each g i . The definitions of the activity and influence are recursive. Let a ( g i ) denote the activity of g i , and i n f l ( g i ) denote the influence of gi on gi-1. The influence on gn is always 1. gi is activated ( u ( g i ) = 1) iff it is capable of being activated and it receives a positive activation signal from its predecessor. The activation signal i n f Z ( g i ) , transmitted from gi to gi-1

501 is 1 if gi is an activator and is itself activated, or if gi is a repressor and is not activated (so that it fails to repress g i - 1 ) . Formally, a ( g i ) = 1 iff ( i n f l ( g i + l ) = 1 and s t a t e ( g i ) = 1)

(1)

(2) i n f l ( g i ) = yi @ a ( g i ) Finally, the state of the regulatee go is simply the influence of 9 1 . We define the output of f g O to be s t a t e ( g 0 ) . A chain function is uniquely determined by its set of regulators, their order and the control pattern. Any control pattern may be separated into blocks of consecutive regulators by truncating the control pattern after each 1. The first block (rightmost, ending at 91) has two possible forms: 0 . . . 0 or 0 . . .01. All other blocks are of the form 0 . . . 01.

3

Reconstruction of Chain Functions

In this section we study the question of uniquely determining the chain function which operates on a known regulatee, using a minimum number of experiments. We assume throughout that all variable states in wildtype are known (or, else, these could be measured). We further assume that all regulator states in wild type are 1, except possibly g n . The latter assumption is motivated by the observation that in many biological examples, all regulators are expressed in wild type and the state of the regulatee is determined by the presence or absence of a metabolite g n . (Examples include the Trp, lac and araBAD operons in E . COZZ~~, and the regulation of galactose utilization in yeast l l . See Section 6 for a discussion of the situation when this assumption does not hold.) An experiment is defined by a set of variables that are externally perturbed (knocked-out or over-expressed). The states of the perturbed variables are thus fixed, and the states of all non-perturbed regulators are assumed to remain at the wild-type values, with the exception of the regulatee. Its state is determined by the chain function. The order of an experiment is the number of externally perturbed variables in it. Our reconstruction algorithms are based on performing various experiments and observing their influence on the state of the regulatee. The algorithms implicitly assume that the regulation function is indeed a chain function and do not explicitly test this property. We now devise a simple set of equations that characterize the output of a chain function as a function of the control pattern and the states of the regulators, both in the wild-type state and in states produced by perturbing some regulators. These equations are the foundation of all the subsequent reconstruction schemes:

Proposition 1 Let f be a chain function on g n , . . . , g 1 . I f s t a t e ( g i ) = 1 for 1 5 i < n then s t a t e ( g 0 ) = s t a t e ( g , ) @ ( @ Z l y i ) . For any other

502

state vector, i f the least index of a state-0 regulator is j 5 n then f90(9,, ...,91) = @sE1yi. Proof: By definition, a(g,) = state(g,). For i < n, state(gi) = 1 implies that u(gi ) = a(gi+l)@yi+l. It follows by induction that state(g0) = state(g,) @ (@y=lyi). Similarly, if sta te(g j) = 0 and state(gi) = 1 for all i < j, it follows by induction that f 9 0 ( g , , ...,g 1) = @s=lyi.

3.1

Types and Blocks

A perturbation is an experiment that changes the state of a variable to the opposite of its state in wild-type. By our assumption on the regulator states in wild-type, the perturbation of a regulator in {g,-l, ...,g 1 ) is a knockout. For S C: U,an S-perturbation is an experiment in which the states of all the variables in S are perturbed. Let 20 be state(g0) in wild-type. Let w be the opposite state. For the reconstruction, we first classify the variables in U into two types: W and W . A variable is in W ( W ) if its perturbation produces output w (w). Naturally, the majority of the genes have type W , since in particular all the genes that are not part of the chain function are such. By Proposition 1 we have g,, E W . We call a gene that belongs to W ( W )a W - g e n e ( W - g e n e ) . W-successor, W-successor of a gene and W-regulator , W-regulator are similarly defined. The type of a single gene can be determined by a single perturbation of the gene. Such an experiment will be referred as a typing experiment throughout. Corollary 2 Given an ordered set of regulators g,,, ...,91, their control pattern can be reconstructed using n typing experiments. Consider now the block partition of the regulators. The right boundary of a block corresponds to a regulator gj with yj = 1 (unless j = 1, in which case yl = 0 is also possible), and any other regulator gi in the block has yi = 0. Lemma 3 Each bZock contains regulators of a single type, and two adjacent blocks contain regulators of opposite types.

The proof follows from the fact that the type of gi differs from the type of gi-1 iff yi = 1. Thus, we can refer to a block as either a W-block or a W-block, and the two types of blocks alternate.

3.2 Reconstructing the Regulator Set and the Function Consider achainfunctionwithcontrolpattern (y,,, ...,y 1 ) andlet g j , . . . ,gi be a block. Then i n f Z ( g i ) = [ i n f Z ( g j + ~A) state(gh))]@ y i . Thus,

503 the behavior of the chain is determined by the boolean variable infZ(gj+l), by the control pattern, and by the conjunction of the states of its regulators, Since this conjunction is independent of the order of occurrence of these genes, no experiment based on perturbing the states of the genes can determine the order of the genes within the block. In view of this limitation, our goal is to reconstruct the control pattern, the set of genes within each block (but not the order of their occurrence) and the ordering of the blocks. Correspondingly, in the following we will use the term successor of a gene to denote a regulator that succeeds that gene in the chain and is not a member of its block. For convenience, we shall refer to W-genes that are not regulators of go as predecessors of gn. The above discussion implies that once we have typed each gene, it remains to determine, for each pair consisting of a W-gene and a Wgene, which of these genes precedes the other in the chain. Let kw,kW denote the number of regulators of types W,W, respectively. Note that ICw km = n 5 N, and in fact, typically, n << N , as kw << IWI. Suppose we perform a (2, k}-perturbation with gi E W and gk E I$'. If the result is w , then gk precedes gi. Otherwise, g, precedes gk. A 2-order experiment for determining the relative order of a W-gene and a W-gene will be called a comparison throughout.

+

Proposition 4 Given the set of regulators of a chain function and their types, kw km comparisons are necessary and sufficient to reconstruct the function. Proof: The upper bound follows by comparing every W-regulator with every W-regulator. The lower bound follows from the fact that, in the special case where every W-regulator precedes every W-regulator, no set of comparisons can determine the relative order of a given pair consisting of a W-regulator and a I$'-regulator, unless it includes a direct comparison between the pair. Therefore, all such comparisons must be performed. rn

We now turn to the question of reconstructing a chain function without prior knowledge of the identity of its regulators. The discussion above suggests a way to solve the problem: First, we find the gene types using N typing experiments. Next, we reconstruct the block structure by performing all possible comparisons between a W-gene and a W-gene. A more efficient reconstruction is possible when g n is known. 'This is common in functions in which gn stimulates the response. If gn is known, then, since gn E I$',all W-regulators can be identified by comparing every W-gene with gn, for a total of N - km comparisons. Since any W-gene is a regulator, these experiments are sufficient to identify all the regulators, and we can apply Proposition 4 to complete the reconstruct ion.

504

Proposition 5 A chain function can be reconstructed using at most N typing experiments and km x ( N - km) comparisons. Given gn, a chain function can be reconstructed using at most N - 1 typing experiments and N - n kwkm comparisons.

+

Propositions 4 and 5 were a worst case analysis. Next, we describe another reconstruction algorithm, whose expected number of required experiments is lower. The algorithm is based on identifying gn efficiently and using it for the reconstruction. Denote by D, the set of W-successors o f g E W in f .

Proposition 6 A chain function can be reconstructed using N typing kw kw) comparexperiments and a n expected number of O ( N log k, isons.

+

Proof: Algorithm: We perform N typing experiments. Next, we apply a randomized scheme to identify gn and reconstruct the chain: Each time we pick a gene g E W at random, find its successors and their order, and remove g and all its successors from further consideration. We stop when no W genes are left, identifying gn as the last picked gene. In order to find the successors of g, we first identify the members of D, using at most N - km comparisons. Using D,,we then reconstruct the part of the chain that spans g and its successors by at most ID,I(k, - 1) comparisons, as in Proposition 4. Complexity: The set of comparisons can be divided into two parts: Those that are required to identify the sets D,, and those required to reconstruct the chain parts induced by these sets. For the latter, kw km comparisons are needed in total, since every pair consisting of a Wregulator and a W-regulator is compared exactly once. Thus, it suffices to compute the expectation of the first part. Let T ( x ) be this expectation, given that the current W set contains x elements, where T ( 0 ) = 0. Then T ( z ) 5 CP"=l(N T ( x - 4)) for x 2 1. By induction T ( z )5 2Nlogx N . Substituting x = km we obtain the required bound. w

+

3.3

+

Using High-Order Experiments

In this section we show how to improve the above results when using experiments of order q > 2. The results in this section are mainly of theoretical interest, since high-order experiments may not be practical.

Proposition 7 Given the set of n regulators of a chain function, the function can be reconstructed using O(n + *) experiments of order at most q. This i s optimal u p to constant factors f o r q = O ( n ) .

505

Proof: The number of possible chain functions with n regulators is @((log, e)"+'n!) '. Since each experiment provides one bit of information, the information lower bound is Cl(nlog n) experiments. We give the upper bound proof for q = n. The proof for other values of q follows by appropriately choosing subsets of regulators of cardinality q , and reconstructing their sub-chains using the method we give next, thereby inferring the entire chain. Let ni be the number of regulators in block i, where blocks are indexed in right-to-left order. Our reconstruction algorithm is as follows: First, we perform pz typing experiments. Next, we identify the type of the first block using one experiment of order n, in which all regulators are perturbed. We proceed to reconstruct the blocks one by one, according to their order along the chain. Note that the type of each block is now known, since the two types alternate. Suppose we have already reconstructed blocks 1,.. . ,i - 1. For reconstructing the i-th block we only consider the set of regulators that do not belong to the first i - 1 blocks. Out of this set, let A be the subset of regulators that have the same type as block i, and let B be the the subset of regulators of the opposite type. We use standard binary search on the set A to identify the members of the i-th block, including in the perturbations also all regulators in B . This requires O(ni logn) experiments. Thus, altogether we perform O ( nlog n ) experiments. H

4

Combining Several Chains

In this section we extend the notion of a chain function to cover common biological examples in which the regulatee state is a boolean function of several chains. Frequently, a combination of several signals influences the transcription of a single regulatee via several pathways that carry these signals to the nucleus, and a regulation function that combines them together. Here, we formalize this situation by modeling each signal transduction pathway by a chain function, and letting the outputs of these paths enter a boolean gate. Define a Ic-chain function f as a boolean function which is composed of Ic chain functions over disjoint sets of regulators, that enter a boolean gate G(f). Let f' be the i-th chain function and let gj denote the j-th regulator in f'. The output of the function is G(infZ(g:),. . . ,infZ(gf)). In the following we present several biological examples for Ic-chain functions that arise in transcriptional regulation in different organisms: The lac operon codes for lactose utilization enzymes in E. Coli. It is under both negative and positive transcriptional control. In the absence of lactose, lac-repressor protein binds to the promoter of the lac operon and inhibits transcription. In the absence of glucose, the level of CAMP

506

in the cell rises, which leads to the activation of CAP, which in turn promotes transcription of the lac operon. In our formalism, the lac operon is controlled by a 2-chain function with an AND gate. The chains are: f ' ( g i , g ; ) = f'(lactose, lac-repressor), with control pattern 11, and f 2 ( g i ,gg,g:) = f '(glucose, CAMP, CAP), with control pattern 100. Other examples of 2-chains with AND gates are the regulation of arginine metabolism and galactose utilization in yeast l l . A 2-chain with an OR gate regulates lysine biosynthesis pathway enzymes in yeast l l . These examples motivate us to restrict attention to gates that are either OR or AND. We first show that we can distinguish between OR and AND gates. We then show how to reconstruct k-chain functions in the case of OR and later extend our method to handle AND gates. Denote the output of f i by Oi. If Oi = 1 in wild-type, we call f i a I-chain and, otherwise, a U-chain. A regulator gj is called a 0-regulator (1-regulator) if its perturbation produces Oi = 0 (Oi = 1). Let t o ( k l ) be the number of 0-regulators (1-regulators) in f . A block is called a 0-block (1-block), if it consists of 0-regulators (1-regulators). Lemma 8 Given a k-chain function f with gate G ( f ) which is either A N D or OR, k 2 2, we can determine, using O(N2) experiments of order at most 2, i f G ( f ) is an A N D gate or a n OR gate. Proof: We perform N typing experiments. If w = 0 and W = 8 then G ( f ) is an AND gate. If w = 1 and @ = 8 then G ( f ) is an OR gate. Otherwise, W # 0. In this situation the cases of w = 0 and w = 1 are similarly analyzed. We describe only the former. If w = 0 we have to differentiate between the case of an OR gate, whose inputs are all 0-chains, and the case of an AND gate, whose inputs are one 0-chain and ( k - 1) 1-chains. To this end we perform all comparisons of a W-gene and a W-gene. Let T be the set of genes g such that the result of a {g,g'}-perturbation is w for every g' E W. Then T # 0 iff G ( f ) is an AND gate. rn

We now study the reconstruction of an OR gate. Let S be the (possibly empty) set of regulators that reside in one of the first blocks (i.e., blocks containing g i ) , that are also 1-blocks. We observe that a perturbation of any regulator in S results in state(go) = 1regardless of any other simultaneous perturbations we may perform. Hence, our reconstruction will be unique up to the ordering within blocks and the assignments of the regulators in S to their chains. The next lemma handles the case w = 0. The subsequent lemma treats the case w = 1. Lemma 9 Given a k-chain function f with a n OR gate and assuming that w = 0 , we can reconstruct f using N typing experiments and ( N k l ) k l comparisons.

507

Proof: We perform N typing experiments. Then, for each 1-regulator b, we perform all possible comparisons, thereby identifying all 0-regulators that succeed b in its chain. This completes the reconstruction. w Lemma 10 Let f be a k-chain finction with an OR gate. Assume that w = 1, and let r be the number of 1-chains entering the OR gate. Then f can be reconstructed using O ( N r Nko') ezperiments of order at most min{k 1,r 2).

+

+

+

Proof: First, we determine r , the minimum order of an experiment that will produce output 0 for f . For successive values i we perform all possible i-order experiments; r is determined as the smallest i for which we obtain output 0. In total we perform O ( N r )experiments. We call the set of perturbed genes in an r-order experiment which results in output 0, a reset combination. Next, we identify all 1-regulators. This is done by performing O(Nk0') experiments of order (r 1) as follows: For each reset combination discovered, we perturb in addition each other gene, one at a time, and record those that produce output 1 as 1-regulators. Each reset combination identifies a set of 1-regulators. These sets form a partial order under set inclusion. Let M be a reset combination corresponding to a minimal set in the partial order of 1-regulator sets. The genes in this minimal set will be exactly the 1-regulators in the 0-chains and the 1regulators in S. By perturbing all r regulators in M , we deactivate the 1-chains, thereby reducing the problem of reconstructing the 0-chains to that of reconstructing a (k - r)-chain function with an OR gate and w = 0. This is done by applying the reconstruction method of Lemma 9 using experiments of order at most min{k 1,r 2). The assignment of 1-regulators in S will remain uncertain. The 1-chains can be now computationally inferred as follows: Pick an arbitrary reset combination and consider in turn each of its subsets of cardinality r - 1. Fixing a subset, consider all reset combinations that contain it. The variable 0-regulators in these combinations correspond to the 0-regulators of a particular 1-chain. For each of these variable 0regulators our experiments determine a set consisting of the 1-regulators in its chain that succeed it, plus the 1-regulators in S and in the 0chains, which have been identified by the reset combination M , and can be removed from consideration. Performing this computation for all combinations and subsets, we will have determined, for each 1-chain, its 0-regulators, its 1-regulators and the ordering relations between them.

+

+

+

Note that for k = 1 the above algorithms will reconstruct a single chain. Further note that these algorithms may be used for the reconstruction of an AND gate as well, exchanging the roles of 0 and 1 in the above description. This gives rise to the following result:

508

Theorem 11 A Ic-chain function with an OR o r an A N D gate can be reconstructed using U ( N k ) experiments of order at most k + 1. 5

A Biological Application

The methods we presented above can be applied to reconstruct chain functions from biological data. We describe in detail one such reconstruction of the yeast galactose chain function, for which some of the required perturbations have been performed. We show that one additional experiment suffices to fully reconstruct the regulation function. The galactose utilization in the yeast S. cerevisiae l1 occurs in a biochemical pathway that converts galactose into glucose-6-phosphate. The transporter gene gal2 encodes a protein that transports galactose into the cell. A group of enzymatic genes, gall, ga17, gallO, gal5 and ga16, encode the proteins responsible for galactose conversion. The regulators gal4p, gal3p and gal80p control the transporter, the enzymes, and to some extent each other (Xp denotes the protein product of gene X). In the following, we describe the regulatory mechanism, assuming that glucose is absent in the medium. gal4p is a DNA binding factor that activates transcription. In the absence of galactose, gal80p binds gal4p and inhibits its activity. In the presence of galactose in the cell, gal80p binds gal3p. This association releases gal4p, promoting transcription. This mechanism can be viewed as a chain function, where f1(g4,g3, g2,gl) = fi(gaZactose,gaZ3, gaZ80, gaZ4), and the corresponding control pattern is 0110. The ga17, gall0 and gall regulatees are also negatively controlled by another chain f 2 containing MIGl and glucose. The two chains are combined by an AND gate. We focus here on the reconstruction of f l , since the other chain has no influence in the experiments that we describe below (as those were conducted in the presence of glucose). f' consists of 3 blocks, where in wild-type (in the presence of glucose and galactose) ga13, gal80 and gal4 are in state 1 (using the same discretization procedure employed by Ideker et al. '). Assuming we know the group of four regulators, we need according to Proposition 4 a total of 4 typing experiments and 3 comparisons (since only gal80 is of type W) to reconstruct the chain. Notably, all 4 typings and 2 of the 3 comparisons were performed by Ideker et al. 1 2 , yielding the correct results. The missing experiment is a comparison of gal80 and ga13. A correct result of this experiment will lead to full reconstruction of the chain function. 6

Concluding Remarks

In this paper we studied the computational problems arising when wishing to reconstruct regulation relations using a minimum number of ex-

509

periments, assuming that the experiments provide correct results. We restricted attention to common biological relations, called chain functions, and exploited their special structure in the reconstruction. We also suggested an extension of that model, that combines several chain functions, and studied the implied reconstruction questions. On the practical side, we have shown an application of our reconstruction scheme for inferring the regulation of galactose utilization in yeast. The task of designing optimal experimental settings is fundamental in meeting the great challenge of regulatory network reconstruction. While this task entails coping with complex interacting regulation functions, we chose here to focus on the reconstruction of a single regulation relation of a single regulatee. We also made two strong assumptions that simplify the analysis considerably: (1) The function can be studied in isolation. Hence, upon any perturbation, none of the other regulators change their states; (2) the wild type state of all regulators (except possibly gn) is 1. Our study could serve as a component in a more general scheme for dealing with entire networks, whose regulation relations possibly interact with one another.

Acknowledgments R. M. Karp and R. Shamir were supported by a grant from the USIsrael Binational Science Foundation (BSF). R. Sharan was supported by a F’ulbright grant. I. Gat-Viks was supported by a Colton fellowship.

References 1. T. Akutsu et al. Theor. Comp. Sci., 298:235-251, 2003. 2. T. Ideker, V. Thorsson, and R.M. Karp. In Pmc. of the Pacific Symposium in Biocomputing, pages 305-316, 2000. 3. N. Friedman et al. J . Gomp. Biol., 7:601-620, 2000. 4. A. Tanay and R. Shamir. Bioinformatics, 17, Supplement 1:270278, 2001. 5. D. Hanisch et al. Bioinfomatics, 18, Supplement 1:145-154, 2002. 6. T. Ideker et al. Bioinformatics, 18, Supplement 1:233-240, 2002. 7. E. Segal et al. Bioinformatics, 17, Supplement 1:243-252, 2001. 8. D. Pe’er, A. Regev, and A. Tanay. Bioinfomatics, 18, Supplement 1:258-267, 2002. 9. I. Gat-Viks and R. Shamir. Bioinformatics, 19, Supplement 1:108117, 2003. 10. F. C. Neidhardt, editor. ASM Press, 1996. 11. E. W. Jones, J. R. Pringle, and J. R. Broach, editors. Cold Spring Harbor Laboratory Press, 1992. 12. T. Ideker et al. Science, 292:929-933, 2001.

INFERRING GENE REGULATORY NETWORKS FROM RAW DATA - A MOLECULAR EPISTEMICS APPROACH D. A. KIGHTLEY, N. CHANDRA AND K. ELLISTON Genstruct Inc., 125 Cambridgepark Drive, Cambridge, MA 01 702, USA Biopathways play an important role in the functional understanding and interpretation of gene function. In this paper we present the results of an iterative algorithm for automatically generating gene regulatory networks from raw data. The algorithm is based on an epistemics approach of conjecture (hypothesis formation) and refutation (hypothesis testing). These operations are performed on a matrix representation of the gene network. Our approach also provides a way of incorporating external biological knowledge into the model. This is done by preassigning portions of the matrix - which represent previously known background knowledge. This background knowledge helps make the results closer to a human’s rendition of such networks. We illustrate our approach by having the computer replicate a gene regulatory network generated by human scientists at an academic lab.

1

Introduction

Gene regulation in eukaryotes is the result of a complex interaction of numerous elements that combine to determine the expression of genes. The bindings of multiple transcription factors at cis-regulatory sites act in combination to determine the level of gene transcription. Discovering the nature of these interactions remains a challenging problem. Elucidation of the regulatory network architecture from a set of experimental data is a complex problem and development of an automated process can help in generating networks that are too large and too complex for humans to handle. Algorithms for automatically generating a genetic regulatory network have been used on a number of different data types. Microarrays [5] give a measure of levels of gene expression in a cell and these data have been used to generate the underlying genetic network [17]. However, the cost of analysis of each interaction in the network is high. The complete set of data is rarely produced and data are frequently sparse. As a result, network inference algorithms are typically applied for recreating complex functional network structures from limited datasets [ll, 13, 151. A different technique measures changes in mRNA transcription of various target genes, measured by PCR, when another gene is perturbed. These perturbation studies [8, 101 can yield information as to which genes are regulated, either directly or indirectly, by another. Thus by combining the interactions it is possible to build up a regulatory network. However, an effect can be the result of a direct interaction or an indirect action through intermediate genes. Therefore, it is necessary to incorporate prior knowledge of the system to infer the network structure; a Bayesian network has been used for this purpose [14, 161.

510

51 1

An alternative approach for generating gene regulatory networks has been to use reverse engineering of data using generative algorithms [6, 7, 121. This approach starts with a set of observations and generates networks that approximate the solution. Through modification and refinement the network that best explains the data is arrived upon (see Section 3.1).

2

2.1

Gene Perturbation Data

Source of the Data

The data relating to gene regulation of purple sea urchin (Strongylocentrotus purpuratus) embryo development has been made available on the Internet [2], from where the data was transcribed. Figure 1 is a sample of the data giving the effects on two transcription factors out of a total of 60 genes. The dataset relates to experiments performed at the Davidson Laboratory at the California Institute of Technology that involved quantitative PCR studies on embryos during the early stages of development (< 72 hr). Details of the findings from the studies have been published [3]. 2.2

Gene Perturbation

The experiments performed on the Sea Urchin embryos involved perturbation of genes and measurement of changes in expression of a second, target gene. In the absence of other influences, perturbation of a gene that is an activator of another will cause the expression of the second gene to be decreased. Alternatively, if the perturbed gene is inhibitory, the expression level of the latter will be increased. The numerical values refer to the cycle number in the PCR experiment and this relates back to the starting level of mRNA, which is amplified exponentially during PCR. A value of 1 represents an approximate doubling of initial mRNA level. Thus, if a value of 3 is reported for an interaction, perturbation of the gene resulted in an 8 fold increase in the gene product compared with the unchanged cell. The convention used in the data is that negative values mean less starting mRNA. Thus, if perturbation of a gene results in lower quantities of mRNA transcribed from target genes, the relationship must have been activation. Similarly positive values indicate inhibition. Transcription regulation involves a complex network of genes that encode transcription factors which, in turn, regulate other genes. A specific transcription factor can regulate multiple genes and there are chains of interactions which form a cascade. Thus perturbation of a single gene can affect the expression of many other genes both directly and indirectly. Consequently, an observed change in gene expression is the result of the combined effects on all of the regulatory genes that influence its transcription. Being able to determine whether an interaction is direct or indirect is a hurdle in deciphering causality in gene regulatory networks.

512

2.3

A look at the data from the Davidson lab

The experimenters presented data relating to three types of perturbations: Morpholino-subsituted antisense oligonucleotide (MASO) - the mRNA transcribed from a gene binds to the complimentary RNA strand, thereby preventing translation of the gene product. Messenger RNA overexpression (MOE) - involves amplification of gene products from the perturbed gene. Engrailed repressor domain fusion (En) - the transcription factor is converted into a form in which it becomes the dominant repressor of all target genes. The three techniques represent distinctly different methods for gene perturbation. However we do not have enough details on them to determine whether there are any useful differences in the results. Therefore, no distinction between techniques was made, results having been taken as being equivalent, and data for the same perturbation, but from different experimental techniques, were combined. The results for each perturbation experiment were reported as up to 7 individual values that relate to both replicate measurements of the same cDNA batch and separate experiments. These values were averaged to provide a single value for equivalent samples. Results recorded as Not Significant (NS) were treated as zero.

-. . ... ...

Figure 1. A sample of the data presented on the Davidson Lab website. This portion of the data relates to perturbation of multiple genes and the effect on the transcription factors, GataC and GataE. The original data used *1.6 as the significance threshold. However by treating non-significant samples as zero, time-averaged samples were reduced in value so a

513

lower threshold was needed. After analysis of the data, values that fell below h0.75 were taken to indicate no significant interaction. Data are presented as a set of time slices that cover intervals in embryo development between 12 and 72 hours after fertilization. However, most data are for three time slices between 12 and 28 hr and the remaining information is veIy sparse. For the majority of the work, mean values for the first 4 ranges were combined to yield an average across these times. In addition to gene perturbation results, there is a table of genes that are not affected by perturbation during the first 24 hrs and also footnotes that provide information about gene interactions, many highlighting possible indirect effects. This additional information was incorporated into the experimental data to yield a single value for the effect of one gene on another. Data were available for only around 12.8% (460 out of 3600) of the possible interactions. Some of the remainder may be filled in by future experimentation but, for the purpose of this analysis, these 'unknowns' were taken to indicate no interaction unless there were indications to the contrary.

2.4

Gene Selection

The overall dataset contained 60 genes identified to regulate gene expression in Sea Urchin embryos. To simplify the system, a decision to concentrate on the Endomesoderm was made since there was the greatest quantity of data relating to these cells. The remainder of the embryonic regions had considerably less experimental coverage. Twenty-one regulatory genes are active in the Sea Urchin endomesoderm during the chosen developmental stages and, of the 441 possible interactions, there are 162 data points or 36.7% coverage. In addition to the 21 genes, the published endomesoderm regulatory network also includes complexes (e.g. Su(H)-N'', n-TCF) involving endomesoderm gene products. However, no data were presented that supported the formation of these complexes, nor was there any data for their action within the cell. Therefore, complexes were omitted from the analysis.

3

3.1

Algorithm for Network analysis

The flowchart

The algorithm used is based on exploring the state space of all possible gene networks (models) in a systematic, iterative fashion. The first step involves generating a model from a given set of components. The components for the gene network are:

514

An activation An inhibition No effect These three relations between genes are represented as + 1, - 1 and 0 in a matrix of gene-to-gene interactions. The initial model generated represents a hypothesis that has to be tested and scored. The next step involves simulation, The model, which represents a set of regulatory connections between genes, can be simulated qualitatively. For example, the network contains the relation: A activates B which activates C. The experimental data are checked to see what experiments have been done. Assume that one of the experiments involved overexpressing A then, according to our hypothesized model, an overactivation of A will result in an increase in B and C. The results of the simulation are tested against the actual data. As indicated below, the actual data will show that B increases and C decreases.

Hypothesized Model

Actual connections (unknown to computer)

This comparison is then used to score the model. The model is then modified using a state space search algorithm to create a new model. The process is followed iteratively till the score does not immove any more. To avoid local minima, the modified-models are randomly perturbed using an annealing method.

r

Generate a model (hypothesis)

1:fine

Footnotes & Experimental data

1 ode1

Compare Results I

t Score the model I

-

f

Result

Figure 2. Molecular Epistemics Algorithm of Conjecture and Refutation.

515

3.2

Handling non-numerical biological knowledge outside the raw data

The process of scientific discovery involves experimentation, but interpretation of the results involves bringing to bear ones prior knowledge of the underlying biology. Our approach allows for outside literature, footnotes and personal knowledge to be added to the model before it runs. This is achieved in two ways. The first approach is to incorporate externally known regulatory knowledge into the input data prior to running the algorithm. Another approach involves incorporating the known prior knowledge into the initial model. The idea here is to make some of the gene-to-gene connections ‘fixed’ or pre-set before the model generation process is started. If this cannot be done for all the knowledge, it can be incorporated into the scoring algorithm [I].

4

4. I

Endomesoderm Gene Regulatory Networks

Representation of the Regulatory Networks

Networks generated by the algorithm were displayed graphically using Netbuilder, a tool for construction of computation models developed by Science and Technology Research Centre, University of Hertfordshire, UK. This tool was also used by the Davidson Lab team to display their network results. The colors and overall network layout presented here were chosen to closely resemble those used in the Davidson paper and so make for easier comparison. 4.2

The Complete Regulatory Network

By using a straight substitution of the data with values greater than or equal to the threshold taken to mean activation or inhibition depending on the sign, and all other values to signify no connection, a simple representation of the entire network of connections was obtained (Figure 3). This interpretation takes into account the additional information provided in the footnotes to the data (incorporated into the values), but is doing no interpretation or analysis of the data. The generated network comprises 56 links between the genes of which 45 were activations and 11 inhibitions.

516

Figure 3: Automatically generated Endomesoderm gene regulatory network that directly reflects the raw data.

The complete network generated directly from the data is similar to the Endomesoderm network published by Davidson, however there are some notable differences which may not be related directly to interpretation of the information. Firstly, the data available on the website is constantly under review and is augmented as new results become available. The dataset used in this study was dated October 28'h, 2002 and so was considerably newer than that used to construct the network for the article that appeared in the March 1 issue of Science [3]. Although the network displayed on the website is also being updated, it is changed less frequently than the data and may not reflect all the updates. Secondly, the Davidson Lab's network represents the regulatory network for the organism and includes many genes that are not active in the endomesoderm. These genes will have interactions with the 21 genes under study which may have effects that are not apparent when the endomesoderm is viewed in isolation. Nevertheless, there are still discrepancies. Some links are present on the published network even though the dataset indicates they should not be there. For instance, there are data to suggest an activation link between bra and nrl, however a footnote states that this must be an indirect link since bra is not active in the cell at this time. The data used for this work took all of the footnotes into account and so does not show this link, whereas the published network included it. On the other hand, there is data to support an activation link between eve and four other genes, yet the published networks only show a single effect. Thus, while these networks and the Davidson Lab published networks show similar information, they show some differences which are, at least partly, due to differences in the source data.

517

4.3

Network reduction

The scoring mechanism in the underlying algorithm was modified to give a low score to links that can be explained by intermediate genes. This was done to remove indirect links - thereby generating a minimal network that explained the raw data faithfully. For instance, elk, Sox-1 and Notch all activate both GataC and gcrn, and gcm activates GataC (Figure 3). Therefore, it is possible that the observed effects on GataC were really a result of an indirect effect through gcm. This suggests that the three links from elk, Sox-1 and Notch to GataC could be removed without contradicting information contained in the data. By eliminating the maximum number of links without breaking any of the connections between genes or making a link with too many intermediates, it was possible to remove 13 links from the network (all activations) and reduce the total number of links from 56 to 43 (Figure 4). In separate runs of the algorithm it was possible to get slightly different sets of links removed, but the minimum number of links necessary to explain all of the data was still 43. The algorithm was also run in a configuration that permitted the removal of links that can be explained through pathways of up to 2 intermediate genes. In this way 3 extra edges could be removed, however the more intermediates there are the harder it is to justify that the link has been retained and the observed effect is still valid.

Figure 4: Automatically generated minimal Endomesoderm network with links removed where a connection is already present through a single intermediate node. On the complete network, genes highlighted in rectangular boxes have links to both GataC and gcm (ellipses). In the minimal network, their actions on GataC are all through gcm.

518

4.4

Networks from separate stages of embryo development

Data for the 21 endomesoderm genes at each time period was rendered into a separate network to compare expression profiles at each time. This yielded a set of networks that contained 15 (12-16 hr), 30 (18-21 hr), 45 (24-28 hr), 6 (32-36 hr), 2 (40-48 hr) and 0 (60-72 hr) links. Although gene expression does change through the development stages, it is unlikely that these results represent an accurate picture of the regulatory system, rather an indication that the dataset is incomplete. Thus, without additional data to indicate that genes operational at one period are turned off in another (there are some data), it will be very difficult to draw any conclusions from these observations. 5

5. I

Next steps

Probabilistic assignment of effects

The approach taken for this study relied on definitive assignment of a link (or no link) between two genes based on the data. The output from the algorithm is trinary and, therefore, relies heavily on the thresholding function to define whether a gene is activated or inhibited. There is no indication as to the certainty of these predictions and this all-or-nothing approach leads to the possibility that a small change in the threshold level can create or eliminate links. The idea here is to generate networks with links with varying levels of confidence. This may be done in our platform by placing link values on a continuous scale, for example from -10 to +lo. The output value is a measure of the certainty that the algorithm can predict the presence of a link. For instance, a value of -10 would mean an activation relationship with absolute certainty, likewise +10 for a certain inhibition. A value closer to zero is less certain. A threshold hnction will still be required to apply the cut-off that defines an interaction with no link. Nevertheless, a value just exceeding the threshold will be labeled as uncertain, rather than all links having equal validity.

5.2

Incorporation of auxiliary information

A mechanism for incorporating external auxiliary knowledge of biology is needed. An example of where auxiliary information could be used is in the action of Otx on wnt8. The data indicates that this should be a straight forward inhibition. However, the published network indicates that Otx activates an intermediate gene labeled ‘Rep. of wnt8’ [Repressor] and that this gene inhibits wnt8. There is no footnote with the data that could indicate why the link was drawn like this, yet evidence can be found in another publication by the group at the Davidson

519

Laboratory [4]. This paper reported that introduction of an obligate repressor of Otx target genes resulted in a many fold increase in the transcripts of wnt8. Thus, this information is showing that the action of otx on wnt8 is a two (or more?) step process. This knowledge could have been incorporated into the algorithm to improve accuracy of the output. A future development of the module would, therefore, utilize the auxiliary information known about interactions and incorporate this into the decisions to include a link or not. Thus, additional knowledge could be used to strengthen the case for a particular configuration of the network over another. 6

Discussion

Automated generation of biopathways can help generate large complex gene regulatory networks that can be minimized to best explain the raw data. These methods can incorporate knowledge gleaned from the literature, footnotes and other sources. This makes the approach closer to how a human would work: bringing to bear knowledge and prior experiences when interpreting results from experiments. Acknowledgements

We would like to thank our scientific advisors Atul Butte and Trey Ideker for their inputs and direction in selecting the data set and developing the approach. References 1.

Chandra et. al. “Epistemics Engine”, US.Patent application, (Nov 2002)

2.

Davidson Laboratory Website. http:!/its.caltech.edu/-mirskv/awr.htnil

3. Davidson et al. A genomic regulatory network for development. Science 295, 1669-1678 (2002) 4.

Davidson et al. A provisional regulatory gene network for specification of endomesoderm in the sea urchin embryo. Developmental Biology 246, 162-190 (2002)

5.

Kohane IS, Kho A, Butte AJ. Microarrays for an Integrative Genomics, MIT Press (2002)

6. Kosa, et al. Reverse engineering of metabolic pathways from observed data using genetic programming. Pacific Symposi& on Biocomputing 6, 434-445 (2000)

520

7. Kosa, et al. Reverse engineering and automatic synthesis of metabolic pathways from observed data using genetic programming. Stamford University Technical report SMI-2000-0851 (2000) 8.

Ideker TE, Thorsson V, Karp RM. Discovery of regulatoryinteractions through perturbation: inference and experimental design. Pacific Symposium on Biocomputing 5, 302-313 (2000)

9.

Wessels L.F.A., Van Someren, E.P. and Reinders, M.J.T. A comparison of genetic network models. Pacific Symposium on Biocomputing 6 , 5085 19 (200 1)

10 Maki, Y. et al. Development of a system for the inference of large scale genetic networks. Pacific Symposium on Biocomputing 6 , 446-458 (2000)

11 Smith VA, Jarvis ED, Hartemink AJ. Evaluating functional network inference using simulations of complex biological systems. Bioinformatics lS(Supp1. I), 5216-24 (2002) 12. Liang S, Fuhrman S, Somogyi R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pacific Symposium on Biocomputing 3, 18-29 (1998) 13. Imoto S, Goto T, Miyano S. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pacific Symposium on Biocomputing 7, 175186 (2002) 14. Chrisman et al. Incorporating biological knowledge into evaluation of causal regulatory hypothesis. Pacific Symposium on Biocomputing 8, In press (2003) 15. Akutsu T, Miyano S, Kuhara S. Algorithms for inferring qualitative models of biological networks. Pacific Symposium on Biocomputing 5, 290-301 (2000) 16. Hartemink AJ et al. Combining location and expression data for principled discovery of genetic regulatory network models. Pacific Symposium on Biocomputing 7, 437-449 (2002) 17. Wimberly FC, Glymour C, Ramsey J. Experiments on the accuracy of algorithms for inferring the structure of genetic regulatory networks from associations of gene expressions, I: algorithms using binary variables. Submitted to the Journal of Machine Learning Research. (2002)

A BIOSPI MODEL OF LYMPHOCYTE-ENDOTHELIAL INTERACTIONS IN INFLAMED BRAIN VENULES P. LECCA AND C. PRIAMI Dipartimento d i Informatica e Telecomunicazioni, Universitci d i Trento {lecca,priami}@science.unitn.it C. LAUDANNA AND G. CONSTANTIN Dipartimento di Patologia Universitci di Verona {carlo.laudanna,gabriela.constantin}@univr. it This paper presents a stochastic model of the lymphocyte recruitment in inflammed brain microvessels. The framework used is based on stochastic process algebras for mobile systems. The automatic tool used in the simulation is the BioSpi. We compare our approach with classical hydrodinamical specifications.

1 Introduction Lymphocytes roll along the walls of vessels to survay the endothelial surface for chemotactic signals, which stimulate the lymphocyte to stop rolling and migrate through the endothelium and its supporting basement membrane. Lymphocyte adhesion to the endothelial wall is mediated by binding between cell surface receptors and complementary ligands expressed by the endothelium. The dynamic of adhesion is regulated by the bond association and dissociation rates: different values of these rates give rise to different dynamical behaviors of the cell adhesion. The most common approach to the simulation of rolling process of lymphocyte is based on hydrodynamical models of the particle motion under norAt a macroscopic scale, the process is generally mal or stressed flow modeled with the typical equations of mass continuity, momentum transport and interfacial dynamic. At a microscopic scale, the cell rolling is simulated as a sequence of elastic jumps on the endothelial surface, that result from sequential breaking and formation of molecular bonds between ligands and receptors This kind of model is able to simulate the time-evolution of bond density. A major challenge for a mechanical approach is to treat the disparate scales between the cell (typically of the order of micrometers) and the bonds (of the order of nanometers). In fact, rolling involves either dynamical interaction between cell and surrounding fluid or microscopic elastic deformations of the bonds with the substrate cells. Moreover recent studies have revealed 1,16y1a.

1636,9.

521

522

that the process leading to lymphocyte extravasation is a sequence of dynamical states (contact with endothelium, rolling and firm adhesion), mediated by partially overlaped interactions of different adhesion molecules and activation factors. The classical mechanical models are inefficent tools to describe the concurrency of the molecular interactions; also if they treat the physical system at the scale of intermolecular bonds with appreciable detail, they are not able to reproduce the sensitivity to the small pertubations in the reagent concentrations or in reaction rates typical of microscopic stochastic systems governed by complex and concurrent contributions of many different molecular reactions. The probabilistic nature of a biological system a t the molecular scale requires new languages able t o describe and predict the fluctuations in the population levels. We rely on a stochastic extension 21,22 of the n-calculus 17, a calculus of mobile processes based on the notion of naming. The basic idea of this biochemical stochastic n-calculus is to model a system as a set of concurrent processes selected according to a suitable probability distribution in order to quantitatively accommodate the rates and the times at which the reactions occur. We use this framework to model and simulate the molecular mechanism involved in encephalitogenic lymphocyte recruitment in inflammed brain microvessels. Our development can also be interpreted as a comparison between the most common modeling method based on hydrodynamical and mechanical studies and x-calculus representation, in order to point out the ability of this new tool to perform a stochastic simulation of chemical interactions that is higly sensitive to small perturbations. We also present data obtained from BioSpi simulations. 2

Molecular mechanism of autoreactive lymphocyte recruitment in brain venules

A critical event in the pathogenesis of multiple sclerosis, an autoimmune disease of the central nervous system, is the migration of the lymphocytes from the brain vessels into the brain parenchima. The extravasation of lymphocytes is mediated by highly specialized groups of cell adhesion molecules and activation factors. The process leading to lymphocytes migration, illustrated in Fig. 1 , is divided into four main kinetic phases: 1) initial contact with the endothelial membrane (tethering) and rolling along the vessel wall; 2) activation of a G-protein, induced by a chemokine exposed by the inflamed endothelium and subsequent activation of integrins 3) firm arrest and 4) crossing of the endothelium (diapedesis). For this study, we have used a model of

523

early inflammation in which brain venules express E- and P-selectin, ICAM-1 and VCAM-1 20. The leukocyte is represented by encephalitogenic CD4+ T lymphocytes specific for PLP139-151, cells that are able to induce experimental autoimmune encephalomyelitis, the animal model of multiple sclerosis. Tethering and rolling steps are mediated by binding between cell surface receptors and complementary ligands expressed on the surface of the endothelium. The principal adhesion molecules involved in these phases are the selectins: the P-selectin glyco-protein ligand-1 (PSGL-1) on the autoreactive lymphocytes and the E- and P-selectin on the endothelial cells. The action of integrins is partially overlaped to the action of selectins/mucins: a4 integrins and LFA-1 are also involved in the rolling phase, but they have a less relevant role. Chemokines have been shown to trigger rapid integrin-dependent lymphocyte adhesion in vivo through a receptor coupled with G iproteins. Integrindependent firm arrest in brain microcirculation is blocked by pertussis toxin (PTX), a molecule able to ADP ribosylate Gi proteins and block their function. Thus, as previously shown in studies on nazve lymphocytes homing to Peyer's patches and lymph nodes, encephalitogenic lymphocytes also require an in situ activation by an adhesion-triggering agonist which exerts its effect via Gi-coupled surface receptor. The firm adhesion/arrest is mediated by lymphocyte integrins and their ligands from the immunoglobulin superfamily expressed by the endothelium. The main adhesion molecules involved in cell arrest is integrin LFA-1 on lymphocyte and its counterligand ICAM-1 on the endothelium. The action of a4 integrins is partially overlaped to the action of LFA-1: a4 integrins are involved in the arrest but they have a less relevant role 'O.

leukocyte

Hemath Umv

LFA-VICAM-I

1. Tethering and mUing

Z.Pim a m 1

Figure 1. The process leading t o lymphocyte extravasation is a finely regulated sequence of steps controlled by both adhesion molecules and activating factors.

524

3

Kinetics models of cell adhesion

In this section we firstly describe the micro-scale model of cell adhesion proposed by Dembo et al. 6, that computes the time-evolution of the of the bonds density between ligands and receptors during the phase of rolling. Secondly, we briefly report the recent results of the computational method called Adhesive Dynamics developed by Chang et al. and based on the Bell model I, that expresses the dissociation rate as a function of the total force applied on the lymphocyte, simulates the adhesion of a cell to a surface under flow. Here the relationship between ligand/receptor functional properties and the dynamics of adhesion are expressed in state diagrams, drawing the variation of the lymphocyte centroid position in time. We have considered here these two models, because they describe the two main aspects of the cell motion: the molecular interaction a t molecular bond scale and the dynamics of the motion at the lymphocyte scale, to compare the two kinds of results with the .rr-calculus simulations. Dembo adhesion model. Rolling is a state of dynamic equilibrium in which there is rapid breaking of bonds at the trailing edge of the lymphocyteendothelium contact zone, matched by rapid formation of new bonds a t the leading edge. The process of lymphocyte rolling and adhesion under blood flow involves the balance of the forces arising from hydrodinamic effect including shear and normal stresses and the number and strength of the molecular bonds 7,12,23,24,25

The kinetic reaction model proposed by Dembo et a1.6 simulates the rolling lymphocyte as a viscous newtonian fluid enclose in a pre-stressed elastic membrane and the adhesion bonds formed between the rolling cell and its substrate are simulated as elastic springs perpendicular to substrate. The parameters considered by this model are: Nl (lingands density) = N, (receptors density) = 400 pm2, k:n (equilibrium association rate) = 84sW1, (equilibrium dissociation rate) = ls-', 0 (equilibrium spring constant) = 5 dyne/cm, uts(transient bond elastic constant) = 4.5 dyne/cm, K B T (thermal energy) = 3.8 x loW7ergs and X (equilibrium bond length) = 20 nm. They are used to compute the bond density Nb assuming the adhesion bond force Fb = Nbu(,!-x) 16,18. The hyperbolic analytic solution for the time-evolution of bond density Nb is given by Nb(t) = 400 and it is plotted in Fig. 2. Bell model and adhesive dynamics. The physicochemical properties that give rise to the various dynamic states of cell adhesion are mainly the rates of reaction. In particular the bond dissociation rate and its dependence on the resultant of the applied forces play an important role in rolling process. Bell proposed that the net dissociation rate k , f f of a bond under an applied

-& +

525

Figure 2. Time-evolution of bonds desity.

external force f can be modeled as k , f f = k$\

exp

(&) \

/

where k$\

is the

unstressed dissociation rate and KBT is the thermal energy; s is a parameter with units of length that relates the reactivity of the molecule to the distance to the transition state in the intramolecular potential of mean force for single The Bell model paramenter k$$ and s are functional properties of bonds the molecules. Using the equation above to model the force dependence of dissociation, Chang et. a1 performed Adhesive Dynamics computer simulations to obtain the states diagrams of the lymphocyte motion. In the Adhesive Dynamic method 3,13,14,the simulation begins with a freely moving cell (modeled as a sphere with receptors distributed at random over its surface and kinetic parameters ’). The cell is allowed to reach a steady translational velocity in absence of specific interactions, after which receptor-mediated binding is initiated. The involved adhesion molecules and the uniformely reactive substrate react with association rate k,, and dissociation rate k , f f . During each time step, bond formation and breakage are simulated by Monte Carlo methods, in which random numbers are compared with the probabilities for binding and unbinding to determine whether a bond will form or break in the time interval The dynamic of motion involves the elastic bond force, given by the Hooke’s law, colloidal force and the force imparted to the cell by the fluid shear. The motion of the lymphocyte is obtained from the mobility matrix for a sphere near a plane in viscous fluid. The new positions of free receptors and tethers at t d t are updated from their positions at t , using the translational and angular velocity of the cell. The process is repeated until the cell travels 0.1 cm, or 10s of simulated time has elapsed. The adhesive dynamics simulation parameters are: R (cell radius) = 5 pm, X (equilibrium bond length) = 20 nm, (T (spring constant) = 100 dyne/cm, p (viscosity) = 0.01 g c m - l s-l,T (temperature) = 310 K and yw (wall shear rate) = ~ O O S - ~ . From different values of rates constants in the Bell model (see caption of Fig.3) different motion state diagrams emerge 16. Tethering, in which ‘18.

’.

+

526

lymphocytes move at a translational velocity v < 0.5vh (where v h is the hydrodinamical velocity of the blood flow) but exhibit no durable arrest is shown in Fig. 3 (upper left). Rolling for which cells travel a t v < 0.5vh,but experience durable arrests, is shown in Fig. 3 (upper right). Finally in firm adhesion, shown in Fig. 3(lower), cells bind to the endothelium and remain mot ionless. 3w,

I

Figure 3. Representative trajectory of lymphocyte tethering at a mean velocity ?I equal t o one half of the hydrodinamic velocity w h . The parameters are the following: y = 0.001 nm, k,, = 84sP1, k i y i = Is-' (upper left). Representative trajectory of rolling motion of lymphocyte, with a mean ve1ocit.y w < 0.5vh, that experience durable arrests (upper right). Representative trajectory of lymphocyte for firm adhesion. The parameters are the following: y = 0.001nm, k,, = 84s-I, k$$ = 20s-I (lower).

4

The BioSpi model implementation and results

We first recall the syntax and the intuitive semantics of the stochastic 7rcalculus 22. We then describe our specification of the lymphocyte recruitment process, and eventually we discuss the simulation results. Biomolecular processes are carried out by networks of interacting protein molecules, each composed of several distinct independent structural parts, called domains. The interaction between proteins causes biochemical modification of domains (e.g. covalent changes). These modifications affect the potential of the modified protein to interact with other proteins. Since protein interactions directly affect cell function, these modifications are the main mech-

527

anism underlying many cellular functions] making the stochastic n-calculus particularly suited for their modeling as mobile communicating systems. The syntax of the calculus follows

P

::= 0

I X I ( T , T).P I ( v z ) P I [x= YIP I PIP I P + P

IA(y1,. . . , y n )

where T may be either x(y) for i n p u t , or %y for output (where x is the subject and y is the object) or 7%for silent moves. The parameter r corresponds to the basal rate of a biochemical reaction and it is an exponential distribution associated to the channel occurring in n . The order of precedence among the operators is the order (from left to right) listed above. Processes model molecules and domains. Global channel names and conames represent complementary domains and newly declared private channels define complexes and cellular compartments. Communication and channel transmission model chemical interaction and subsequent modifications. The actual rate of a reaction between two proteins is determined according to a constant basal rate empirically-determined and the concentrations or quantities of the reactants . Two different reactant molecules, P and Q, are involved, and the reaction rate is given by B r a t e x IPI x I&/, where B r a t e is the reaction's basal rate, and IPJand IQI are the concentrations of P and Q in the chemical solution computed via the two auxiliary functions, In,, Out, that inductively count the number of receive and send operations on a channel x enabled in a process. The semantics of the calculus thereby defines the dynamic behaviour of the modeled system driven by a race condition, yielding a probabilistic model of computation. All the activities enabled in a state compete and the fastest one succeeds. The continuity of exponential distributions ensures that the probability that two activities end simultaneously is zero. The reduction semantics of the biochemical stochastic 7r-calculus is

A reaction is implemented by the three parameters T b , TO and T I , where rb represents the basal rate, and TO and denote the quantities of interacting molecules, and are computed compositionally by I n , and Out,.

528

4.2

Specification

The system of interacting adhesion molecules that regulate the lymphocytes recruitment on endothelial surface illustrated in Fig. 1 has been implemented in the biochemical stochastic -ir-calculus. The system is composed by eight concurrent processes, corresponding to the eight species of adhesion molecules, that regulate the cell rolling and arrest: PSGLI, PSELECTIN, CHEMOKIN, CHEMOREC, ALPHA4, VCAMl, LFAl and ICAM1. The code implements the four phases of the lymphocyte recruitment: the interaction between PSGLl and PSELECTIN, the ALPHA4 and LFAl activation by chemokines and the firm arrest mainly caused by the interaction between the active form of LFA1, LFAlACTIVE, and ICAMl and in part also due to the interaction of the active form of ALPHA4, ALPHA4ACTIVE, with VCAMI. Its specification is We simulated the role and the contribution of the different interactions as bi-molecular binding processes occuring at different rates. The selectins interaction PSGLl/PSELECTIN plays a crucial role in guaranting an efficient rolling, therefore the channels rates for the communication in the binding process between PSGLl and PSLECTIN have been calculated from the deterministic rates of the Bell model, that reproduce the tethering and rolling motion. Analogously, for the ALPHA4ACTIVE/VCAMI interaction, that contributes to rolling and, in part, also to cell arrest, the channels rate have been calculated from the Bell model rates that recreate the rolling motion. The interaction LFAlACTIVE/ICAMl is the main responsible of firm arrest of the cell on the endothelium and thus the rates of communication between LFALACTIVE and ICAMl ACTIVE have been calculated from those reproducing the firm adhesion in Bell model simulations. The activation of ALPHA4 and LFAl integrins by the chemokines is implemented in two steps: firstly a chemokine CHEMOKIN binds to its receptors CHEMOREC and changes to a “bound” state CEHMOKINBOUND. Then the complex CHEMOKINBOUND sends two names sign1 and sign2 on the channels act-alpha and actlfa, on which the processes ALPHA4 and LFAl are ready to receive them as inputs. After ALPHA4 and LFAl have received the signals from CHEMOKINBOUND, they change to the active form ALPHA4ACTIVE and LFAIACTIVE. The whole process of lymphocyte recruitment occur in a space of V = 1.96 x 105pm3,corresponding to a volume of a vessel of 25pm of radius and 100pm of length, and in a simulated time of 15s. In the considered volume V , the number of mulecules is of the order of lo6. In our simulations the values

529 S Y S T E M ::= PSGLlIPSELECTINICHEMOKINICHEMORECIALPHAl IVCAMlILFAIIICAMl P S G L l ::= ( u b a c k b o n e ) B I N D I N G P S I T E l B I N D I N G P S I T E ::= (@(backbone), RA).PSGLlBOUND(backbone) P S G L l B O U N D ( b b ) ::= (bb,RDo).PSGLl P S E L E C T I N ::= (bind(cross-backbone),R A ) . P S E LECT I N B O U ND(crossbackbone) P S E L E C T I N B O U N D ( c b b ) ::= RDo).PSELECTIN C H E M O K I N ::= (u chemobb)B I N D I N G - C S I T E B I N D I N G C S I T E ::= (G(chemobb),RA-C).CHEMOCHIN-BOUND(chemobb) C H E M O C H I N B O U N D ( c h e m o b b ) ::= ACTlIACT2IACT3(cbb) ACT1 ::= (alpha-act ( s i g n l ) ,A).ACTI ACT2 ::= (lfa-a&sign2), A).ACT2 ACT3(chb) ::= (chb,R D _ C ) . C H E M O K I N C H E M O R E C ::= (lig(crossxhemobb),R A E ). C H E M O R E C B O U N D ( cross-chemobb) C H E M O R E C B O U N D ( c c r ) ::= (ccr,A ) . C H E M O R E C A L P H A 4 ::= (alphaact(act-a),A ) . A L P H A 4 A C T I V E L F A l ::= (If a-act (act-1),A).L F A l A C T I V E A L P H A 4 A C T I V E ::= ( u backbone2)BINDINGASITE B I N D I N G A S I T E ::= (binda(backbme2),RA).ALPHA4BOUND(backbone2) A L PH A 4 B O U N D( bb2) ::= RD1 ).ALP HA4 V C A M 1 ::= (bind2(cross-back~one2), R A ).V C AM I B O U N D( cross backbone2) V C A M l B O U N D ( c b b 2 )::= (cbba,RDl).VCAMl L F A l A C T I V E ::= ( u backbone3)BINDINGYITE3 B I N D I N G S I T E 3 ::= (bind3(backbwne3),RA).LFAIBOUND(backbone3) L F A l B O U N D ( b b 3 ) ::= (bb3,R D 2 ) . L F A l B O U N D I C AM 1 ::= (bind3(cross-backbae3),R A ).I C A M 1B O U N D ( crossbackbone3) I C A M l B O U N D ( c b b 3 ) ::=(cbb3,R D 2 ) . I C A M l B O U N D

(a,

(m,

RD1 = 5.100 RA = 6.500 RA-C = RDo = 0.051 RDz = 1.000 RD-C = 3.800 A = infinite Radius of vessel = 25 micromenters Length of vessel = 100 micromenters Volume of vessel = 1.96 X lo5 cubic micrometers Radius of lymphocyte = 5pm

of the volume and of the molecules number have been proportionally re-scaled by this factor, to make the code computationally faster. The stochastic reaction rates for bimolecular bindinglunbinding reaction are inversely proportial to the volume of space in that the reactions occur lo, in particular for the stochastic association rate we have that RA = kon/V and for the stochastic dissociation rate we have RD = 2 k o f f / V , where the ki’s are the deterministic rates. The output of simulation is the time-evolution of number of bonds (shown in Fig. 4) assuming the following densities expressed in prnF2: PSGL-1 l9 and P-SELECTIN 5600, ALPHA4 and VCAM-1 85, CHEMOREC and CHEMOKINES 15000, LFA-1 l1 and ICAM-1 5500. The characterization of the steps and the adhesion molecules implicated in lymphocyte recruit-

*

530 ment in brain venules was performed by using intravital microscopy, a potent technique allowing the visualization and analysis of the adhesive interactions directly thrmiuh t h o s h i l l in Iiim animal PSGL 1 P SELECTIN NTERACTWN 100

: 2 2

40 20

H

O

z

0

2

4 6 m T i m (s&)

i

ALPHWVCAM-I INTERACTION

o

CHEMOKINES'RECEPTORSIKTEPACTION

5 B z

100 50

0

2

4 6 8 T i m (sac)

1

0

Figure 4. BioSpi simulation of 4-phases model of lymphocyte recruitment.

The BioSpi simulations reproduce the hyperbolic behavior predicted by the Dembo model. However unlike Dembo model, the BioSpi model is more sensitive to the variations of the dissociation constant rate k${. Moreover the plots in Fig. 4 show the relevant roles played by PSGL1/P-Selectin and LFA-l/ICAM-l interactions. The curve describing the timeevolution of the bonds number of LFA-l/ICAM-l interaction presents an approximately linear steep increasing (with an angular coefficient of the order of lo3) followed by a clearly constant behavior: this curve represents the firm adhesion of lymphocyte and it is comparable with the state diagram of the Bell model of Fig. 3 . In fact, the firm arrest is reached when the number of bonds become stably constant in the time or, analogously, when the position of cell centroid does not change anymore. On the contrary, the plots representing PSGL-1/P-SELECTIN and ALPHA4/VCAM-l interactions present, after a steep increasing with about the same slope of that of LFA-l/ICAM-l binding, an oscillating behavior respect to the equilibrium positions given by the y = 80 and y = 1, respectively. This behavior represents the sequential bonds breaking and formation in the selectins and integrins binding during the rolling (see Fig. 3 for comparison). The results obtained in this work assert that the formal description provided by BioSpi model represents in a concise and expressive way the basic physics governing the process of lymphocyte recruitment.

531 More generally, physics describes either microscopic or macroscopic interactions between bodies by means of the concept of force, that expresses the action of the field generated by a particle (or a set of particle) on the other bodies of the system. BioSpi representation hits this remarks, that is just the central paradigma of the physical description of the nature and summarizes it in the new concepts of communications exchange or ( n a m e s passing). Moreover, the rates of communication in stochastic rr-calculus include all the dynamic of the system, because they contain the quantitative information about the intensity of the forces transmitted between the particles. Finally, the main advantage of the BioSpi model is that thte n-calculus permits to better investigate dynamics, molecular and biochemical details. It has a solid theoretical basis and linguistical structure, unlike other apporaches 5 . 5

Conclusion

The usage of new languages such as stochastic 7r calculus to describe and simulate the migration of autorective lymphocytes in the target organ will help us better understand the complex dynamics of lymphocyte recruitment during autoimmune inflammation in live animal. Furthermore, our approach may represent an important step toward future predictive studies on lymphocyte behavior in inflamed brain venules. The stochastic calculus may, thus, open new perspectives for the simulation of key phenomena in the pathogenesis of autoimmune diseases, implicating not only better knowledge, but also better future control of the autoimmune attack.

References 1. Bell G . I., Science 200, 618-627, 1978 2. The BioSpi project web site: http://www.wisdom.weizmann.ac.il/-aviv 3. Chamg K., Tees D. F. J. and Hammer D. A., The state diagram f o r cell adhesion under Pow: leukocyte adhesion and rolling, Proc. natl. Acad. Sci. USA 10.1073/pnas200240897, 2000. 4. Chigaev A, Blenc AM, Braaten JV, Kumaraswamy N, Kepley CL, Andrews RP, Oliver JM, Edwards BS, Prossnitz ER, Larson RS, Sklar LA. Real time analysis of the afinity regulation of alpha 4-integrin. The physiologically activated receptor is intermediate in afinity between resting and Mn(2+) or antibody activation. J Biol Chem. 2001 Dec 28;276(52) r48670-8. 5. M. Curti, P. Degano and C. T. Baldari, Casual n-calculus for biochemical modelling. Computational Methods in System Biology, CMSB 2003, Springer. 6. Dembo M., Tomey D. C., Saxaman K. and Hammer D, The reaction-limited kinetics of membrane-to-surface adhesion and detachment. Proc. R. SOC. Lon. B. Vol. 234, pp. 55-83, 1998. 7. Dong C., Cao J., Struble E. J . and Lipowsky H., Mechanics of leukocyte deformation and adhesion to endothelium in shear flow, Annual of biomedical engineering, Vol.

532 27, pp 298-312, 1999. 8. Evans E. and Ritchie K., Biophys. J., Vol. 72 1541-1555, 1997 9. Fritz J., Katopodis A. G., Kolbinger F. and Anselmetti D., Force-mediated kinetics of single P-selectin/ligand complexes by atomic force microscopy, Proc. Natl. Acad. Sci USA, Vol. 95, pp.12283-12288, 1998. 10. Gillespie D. T., Exact stochastic simulation of coupled chemical reactions, Journal of Physical Chemistry, 81(25): 2340 - 2361, 1977. 11. Goebel MU, Mills PJ. Acute psychological stress and exercise and changes in peripheral leukocyte adhesion molecule expression and density. Psychosom Med. 2000 Sep-Oct;62(5):664670. 12. Goldman A. J., Cox R. G . and Brenner H., Slow viscous motion of a sphere parallel to a plane wall: couette flow, Chem. Eng. Sci, 22: 653 - 660,1967. 13. Hammer D. A. and Apte S. M. Biophys. J. 63,35-57,1992. 14. Kuo S. C., Hammer D. A., and Lauffenburger D. A., Biophys. J. 73, 517-531,1996. 15. C. Laudanna, J. Yun Kim, G. Constantin and E. Butcher, rapid leukocyte integrin activation by chemokines, Immunological Reviews, Vol. 186: 37-46,2002 16. Lei X. and Dong C., Cell deformation and adhesion kinetics in leukocyte rolling, BED-Vol. 50, Bioengineering Conference, ASME 1999 (available at http://asme.pinetec.com/biol999/data/pdfs/a0081514.pdf) 17. Milner R., Communicating and Mobile Systems: the A-calculus. Cambridge University Press, 1999 18. N'dri N., Shyy W., Udaykumar and H. S. Tran-Son-tay R., Computational modeling of cell adhesion and movement using continuum-kinetics approach, BED-Vol. 50, Bioengineering Conference, ASME 2001 (available at http://asme.pinetec.com/bio2001/data/pdfs/aOO12976.pdf) 19. Norman KE, Katopodis AG, Thoma G, Kolbinger F, Hicks AE, Cotter MJ, Pockley AG, Hellewell PG. P-selectin glycoprotein ligand-1 supports rolling on E and Pselectin in vivo. Blood. 2000 Nov 15;96(10):3585-3591. 20. Piccio L., Rossi B., Scarpini E., Laudanna C., Giagulli C., Issekutz A. C., Vestweber D., Butcher E. C. and Costantin G., Molecular mechanism involved in lymphocyte recruitment in inflammed brain microvessel: critical roles for P-selectin Glycoprotein Ligand- 1 and Heterotrimeric G , -linked receptors, The Journal of Immunology, 2002 21. Priami, C., Stochastic A-calculus, The Computer Journal, 38, 6,578-589,1995 22. Priami, C., Regev A., Shapiro E. and Silverman W.., Application of a stochastic passing-name calculus to representation and simulation of molecular processes, Information Processing Letters, 80,25 -31,2001 23. Schmidtke D. W. and Diamond S. L., Direct observation of membrane tethers formed during neutrophil attachment to platelets or P-selectin under physiological flow, The Journal of Cell Biology, Vol. 149 Number 3, 2000. 24. Udaykumar H. S.,Kan H. C.,Shyy W. and Tran-Son-Tay R., Multiphase dynamics in arbitrary geometries o n $xed Cartesian grids, J. Comp. Phys., Vol. 137 pp. 366 -

405,1997. 25. Zhu C., Bao G . and Wang N., Cell mechanics: mechanical response, cell adhesion and molecular deformation, Annual Review of Biomedical Engineering 02:189-226.

MODELING CELLULAR PROCESSES WITH VARIATIONAL BAYESIAN COOPERATIVE VECTOR QUANTIZER X. LU1,2,4,M. HAUSKRECHT2 and R.S. DAY3 'Center for Biomedical Informatics, 'Dept of Computer Science, 3Dept of Biostatistics. University of Pittsburgh 4Dept of Biometry and Epidemiology,Medical University of South Carolina email: [email protected]", milos @ cs.pitt.edu, day @upci.pitt.edu

Abstract

Gene expression of a cell is controlled by sophisticated cellular processes. The capability of inferring the states of these cellular processes would provide insight into the mechanism of gene expression control system. In this paper, we propose and investigate the cooperative vector quantizer (CVQ) model for analysis of microarray data. The CVQ model could be capable of decomposing observed microarray data into many different regulatory subprocesses. To make the CVQ analysis tractable we develop and apply variational approximations. Bayesian model selection is employed in the model, so that the optimal number processes is determined purely from observed micro-array data. We test the model and algorithms on two datasets: (1) simulated gene-expression data and (2) real-world yeast cell-cycle microarray data. The results illustrate the ability of the CVQ approach to recover and characterize regulatory gene expression subprocesses, indicating a potential for advanced gene expression data analysis.

1 Introduction Current DNA microarray technology allows scientists to monitor gene expression at genome level. Although microarray data are not direct measurements of activity of cellular processes (or signal transduction pathways), they provide opportunities to infer the states of the cellular processes and study the mechanism of gene expression control at the system level. When a cell is subjected to different conditions, the states of the processes controlling gene expression change accordingly and result in different gene expression patterns. One important task for system biologists is to identify the cellular processes controlling gene expression and infer their states under a specific condition based on observed expression patterns. Different approaches have been applied in order to identify the cellular processes by decomposing (deconvoluting) the observed microarray data into different components. For example, singular value decomposition (SVD) principal component analysis (PCA) ', independent component analysis (ICA) Bayesian decomposition and probabilistic

',

314,

"To whom correspondence should be addressed.

533

534 relation modeling (PRM) ti have been used to decompose observed microarray data into different processes. The problem of identifying hidden regulatory processes in a cell can be formulated as a blind source separation problem, where distinct regulatory processes, which we would like to identify and characterize, are modeled as hidden sourcesb. The task is to identify the source signals purely based on observed data. An additional challenge is that the separation process must be performed fully unsupervised - the number of sources is not known in advance. To facilitate biological interpretation, the originating signals of the processes in a system should be identified uniquely. Some of the aforementioned models, such as SVD and PCA, restrict the components to be orthonormal, thus they are not suitable for blind source separation. Independent component analysis (ICA), independent factor analysis (IFA) and various vector quantization models7~8,g~10 are among the models used for blind source separation. In this work we develop an inference algorithm for one such model - the cooperative vector quantizer (CVQ) model. The main advantage of the CVQ model over other blind source separation models is that it mimics the switching-state nature of the regulatory processes; consequently, the results of the analysis can be easily interpreted by biologists. Fully unsupervised blind source separation requires learning the model structure. In microarray data analysis, one needs to infer the optimal number of latent regulatory processes in the system. The parameters of a latent variable model with a fixed structure (known number of processes) can be learned using maximum likelihood estimation (MLE) techniques, e.g. the expectation maximization (EM) l1 algorithm, as in Segal et al ‘. Unfortunately, the value of likelihood by itself is not suitable for model selection. The main reason is that MLE prefers more complex models and tends to over-fit the training data. That is, more complex models return higher likelihood scores for the training data, but they do not generalize well to future, yet to be seen, data. On the other hand, the methods used in the studies by Alter et al and Liebermei~ter~ simply dictate the number of processes of the model and do not have the flexibility of model selection. Model selection can be addressed effectively within the Bayesian f r a m e w ~ r k l ~ Bayesian ! ~ ~ > ~ selection ~. penalizes models for complexity as well as for poor fit, therefore it implements Occam’s Razor. In this work, we investigate the Bayesian model selection framework in the context of the CVQ model. More specifically, we derive and implement a variational Bayesian approach which can automatically learn both the structure and parameters of the CVQ model, and thus perform full-scale blind source separation. In the following sections, we first present the CVQ model. After that, we discuss the theory of the Bayesian model selection and its approximations. We derive and present a variational Bayesian approximation for learning the CVQ model from data. bWe use “sources” and “processes” interchangeably throughout the rest of paper.

535

Figure 1: A directed acyclic graph (DAG) representation of the cooperative vector quantizer (CVQ) model. The square corresponds to an individual data point which consists of observed variables y and latent variables s. W, 7, T and T are model parameters.

Finally, we test the model and algorithms on (1) a simulated gene expression data (2) yeast cell-cycle microarray dataz0 and discuss the results.

2 The CVQ Model In the CVQ model, the states of the cellular processes are represented as a set of binary variables s = { sk}fz1 referred to as sources, where K is the number of processes in a given model. Each source assumes a value of 0/1, which simulate the “off/on” state of cellular processes. Each microarray experiment is represented as a D-dimensional vector y, where D is the number of genes on a microarray. An observed data point y(n)is produced cooperatively by the sources depending on their states. When a source s k equals 1, it will output a D-dimensional weight wk to y. We can think of the source variable s k as a switch which, when turned on, allows the outflow of weights w k to y. More formally K Y = x S k W k + E k=l

\

k=l

where N ( . l p ,E) denotes a multivariate Gaussian distribution; s k is an index function; W k is the weight output by source s k i E N ( 0 , A )is noise of the system. Parameters (6)of the model are: 7r = { 7 r 1 , 7 r ~ ,. . . ,X K } where 7rk is the probability of s k = 1; a D x K weight matrix W whose column w k corresponds to the weight output for source S k i y = { n , y 2 , . . . ,Y K } whose components are the precision of columns of the weight matrix; the covariance matrix A = T-’I where T is the precision of noise E . The graphic representation of the model is shown in Figure 1. The learning task includes the parameter estimation and model selection based on the Bayesian framework.

-

(5)

536

3 Bayesian Model Selection The main task of model selection in the VBCVQ model is to determine the number of processes (sources) in the model. In the Bayesian model selection framework, we choose the model M i with the highest posterior probability P ( M ; I Y )among a set of models, ( M = { M j } E l ) based , on the observed data. Therefore the selection of the model is dictated by observed data, not arbitrarily by the modeler. According to Bayes’ theorem, the posterior probability of a model equals:

P ( Y J M ~ )=

S, P ( Y l e ,Mi)p(elMi)de

(2)

N

are the observed data; P ( Y I M i ) is the marginal likelihood where Y = or “evidence” for the model; P ( M i )is the prior probability for the model M i . If no prior knowledge is available, we use an uninformative prior P ( M i ) and the model selection is determined by P ( Y \ M ; ) . Variational approximations. The evaluation of equation (2) is often intractable in practice. Various techniques are used to approximate the integration, e.g., Laplace approximation, Bayesian information criteria (BIC) and Markov Chain Monte Car10 (MCMC) simulation 13. Recently, the variational Bayesian approach has been used in various statistical models to approximate the integration in equation (2) The approach takes advantage of the fact that, for a given model Mi, the log marginal likelihood, In P ( Y [ M i ) ,can be bounded from below l 5 , I 2 as: 15316,12110.

where &(.) is an arbitrary distribution, H and B denote sets of hidden variables and parameters of a given model respectively. The inequality is established by Jensen’s inequality. Thus, one can treat the lower bound 3 as the function of the free distribution Q ( H , 0) and maximize 3 with respect to Q ( H , 0). The best result is achieved if Q ( H , B ) equals the posterior joint distribution over hidden variables H and parameters 8. However, the evaluation of the true posterior distribution is intractable in most practical cases. To overcome the difficulty, a variational approximation can be achieved by restricting the maximization Q ( H , 0) to a smaller family of distributions chosen for convenience. A common approach is to use the mean-field approximation, which maximized on the family of models in which hidden variables

537 and parameters are independent. Then the joint distribution can be fully factored: Q(H,0) = Q H ( H ~ ) Q o ( Q j ) . Restricting Q ( H , Q )to this family gives a less tight bound in equation (4),but one can analytically maximize the lower bound of the log marginal likelihood with respect to the factorized family of distributions by an iterative algorithm similar to the EM algorithm12. In the Bayesian framework, the parameters of a given model are treated as random quantities, requiring us to specify prior distributions P(01Mi) for all model parameters. We choose the following conjugate priors to facilitate the estimation of approximate posterior distributions:

nEl

n,'=,

where Beta(.la,,B)is a beta distribution; G(.Ialb ) is a gamma distribution. We use the following set of values of hyper-parameters: a = p = 1, a, = b, = cT = d, = lop3 during training sessions. 4 Variational Bayesian Learning In the variational Bayesian approach, we maximize the lower bound F of the marginal log likelihood In P(YIMi) with respect to a set of parameterized variational distributions Q ( H k ) ,k = 1 , 2 , . . . ,K and Q(Q,),p = 1 , 2 , . . . P , which are approximate posterior distributions of hidden variables and parameters15-12.The process of maximizing the lower bound F and learning parameter is very similar to conventional expectation-maximization (EM) algoritM'. We adopt iterative variational approximation p r i n ~ i p l e ~which ~ > ~maximizes ~, the function F by iterating over two alternating re-estimation steps: 0

Estimation of hidden source distributions Q H ( H ) : Q & ( H ) 0: exP ( l n P ( y , H I @ ) ) ~ , ( e )

Estimation of parameter posteriors

(5)

538

where (.)&(,) denotes the expectation w.r.t. distribution Q ( . ) . Expanding and evaluating the equations (5) and (6), we obtain a set of approximate posterior distributions of the hidden sources H and parameters 8. Thus, the variational Bayesian approach allows us not only to approximate the log marginal likelihood In P ( Y ( M ito)achieve model selection, but also to learn the approximate distributions of the parameters. In the following, we summarize the form of the approximate posterior distributions and rules of updating the parameters of the distributions. Complete derivations can be found in the separate reportI7.

n K

Q ( s )=

Be(sklh);

Q ( r= )

nf=lBeta(rr,I&,Pk);

k=l

Q ( 7 )= G ( T [ & , & ) ;

where B e ( . l X ) is a Bernoulli distribution. One can maximize the lower bound 3 by initializing the parameters of the model with a suitable guess, then iteratively update the parameters for individual approximate distribution using following updating rules until 3converges to a local maximum.

CT = c,

+ -;N2D

539

Figure 2: Left panel: Original source images used to generate data. Middle panel: Observed images resulting from mixture of sources. Right panel: Recovered sources

5 Analysis of Simulated Data We have implemented the variational Bayesian inference algorithm for the CVQ model. To demonstrate the capability of the model to identify the source processes uniquely, we first applied the model to a simulated microarray data. In this experiment, we used 8 hidden sources to simulate cellular processes that control expression of 16 genes. The left panel of Figure 2 depict the components of the model, where genes are represented by pixels of a 4 x 4 image. Each of the 8 sources controls a subset of 16 genes, where the intensity of the pixels reflect the degree of influence by the source. As the figure shows, some genes are controlled by multiple sources. We generated 600 images (experimental data) by setting sources to be “odoff’stochastically, summing the weight output by sources and adding random noise into the images. The middle panel of Figure 2 illustrates some of the data images generated during the process. We run our program to test its ability of automatically recovering the number of sources and their patterns. The right panel of Figure 2 shows the result of an experiment where the algorithm is initialized with 16 hidden sources. The program correctly identified all 8 sources that were used to generate the data and eliminated the rest 8 unnecessary sources. The experiment demonstrates an excellent performance of the variational Bayesian approach on blind source separation for simulated gene expression data. Figure 2 also shows an interesting characteristic of our Bayesian CVQ model its ability to eliminate unnecessary sources automatically, thus, achieving the effect of model selection. Such an ability is due to the introduction of hierarchical parameters y (see Section 2) into the model. The approach is referred to as automatic relevance determination (ARD). It has been used in a number of Bayesian linear latent variable models to determine model dimension automatically. When variational Bayesian ICA model with mixture of Gaussian sources was first tested to perform a similar image separation tasklg,lO,recovery of source images 16,18i10.

540

Figure 3: Source processes recovered from the training data containing a background signal and both positive and negative weight sources. The first image captures the background signal. Black pixels capture negative weights.

from the mixed image data was hindered by contamination with negative “ghost” images. In order to prevent “ghost” images, special constraints on distributions were incorporated into the ICA model. Specifically, the use of rectified Gaussian distributions priors lo restricted both the source and weight matrix to the positive domain. In contrast, the CVQ model performs blind source separation without special constraints. Adopting Bernoulli distributions for sources in the CVQ model naturally constrains the sources to the non-negative domain, preventing “ghost” images. No constraint on the weight matrix appears necessary. This flexibility allows the capture of genuine negative influences of sources on the observed data, which is a highly desirable characteristic for detecting the repressive effects of signal transduction components on gene expression. To test the model’s ability to capture repressive effects, we generated 600 training data with 8 sources similar to those described earlier with one exception: weight outputs for two sources are negative on some of the pixels. We randomly initialized parameters for hidden sources, and then ran the algorithm to recover the sources. Once again our variational Bayesian algorithm was able to identify correctly not only the number of underlying regulatory signals but also their weight matrices, including their repressive (negative) components. Figure 3 shows the sources and weights recovered by the algorithm for the simulated data. Black pixels correspond to negative weights.

6 Application in Microarray Data Analysis In this section, we present the result of applying the CVQ data analysis to the yeast cell cycle data by Spellman et a120. These cell cycle data has been widely used to test The data set contains a collection different algorithms, including SVD and ICA of the whole yeast genome expression measurements (77 samples) across the yeast cell cycle. During the cell cycle, the states of the cellular processes that controls ‘i3.

541

progression of cell cycle switch “ordoff” periodically. Thus, these data are suitable to test the ability of the CVQ model to capture such periodical behavior of cellular processes. We have extracted expression patterns of 697 genes that are documented to be cell-cycle dependent 2o and used the CVQ to model the data. Original data is in the form of log ratio of fluorescence of labeled sample cDNA and control cDNA. Before fitting the model, the log ratio of the data was transformed to positive values by subtract the minimumratio of each gene. In order to determine the optimal model that fit the data well, we tested CVQ models setting the initial number of sources to values ranging from 8 to 30. We ran each model 30 times. Figure 4 shows the results of experiments. We can see that the lower bound F for log marginal likelihood reaches a plateau between the models with 12 to 20 sources. Inspecting the recovered models, we found that most of these models have 12 working sources; excess sources were eliminated by the ARD phenomenon. Note that models initialized with more than 20 sources are penalized by the Bayesian approach in that the 3 values begin to drop. Thus, the variational Bayesian approach consistently returned models with 12 sources as the most suitable model for the observed data. In comparison to the models studied by Alter et all and Liebermeister where the number of processes was determined by the number of samples, our approach determines the number of processes based on the sound statistical foundation of the Bayesian framework. In addition, the larger number of processes in their model significantly increases the number of parameters to estimate - about 50,000 more parameters would be needed to carry out a similar experiment. It is well known that the models with a large number of parameters are prone to over-fitting the training data, especially with a training set of a small size like the one used in our experiment. The full Bayesian treatment of the CVQ model implicitly penalizes models with too many parameters, thus making it less likely to over-fit the data.

’,

We have studied the recovered CVQ model to see if it can capture the periodic behaviors of the processes. The middle and right panel of the Figure 4 show one of the recovered models with the highest 3. The middle panel shows the state of 12 hidden sources across the experiment conditions, in this case, a times series of gene expression observations. One can clearly see the cyclic “ordoff” pattern of the sources which are far from being random. This is not surprising and encouraging, as we are modeling expression control processes of cell cycle related genes. For each of the cell cycle time points, we can see sources cooperatively contributing to observations. Thus, the CVQ model provides another approach to decomposing the overall observation at genome level into different processes, which may reflect the state of different cellular signal transduction components. A more detailed biological analysis of the results is being carried out and will be reported separately.

542

Figure 4: Left panel: Mean and standard deviation of 3of models initialized with different number of sources. Middle panel:Top: States of hidden sources (rows) of each time series observations (columns). Black blocks indicate the source is “on” and white blocks indicate the source is “off”.Eonom: Corresponding cell cycle phase for each observation. Right panel: The weights associated with sources (columns).

7

Discussion

One important aspect of systems biology is to understand how information is organized inside the cell. For example, an interesting question is: what is the minimum number of central signal transduction components needed to coordinate the variety of cellular signals and cellular function. A cell is constantly bombarded by extracellular signals; many of these signals are eventually propagated to the nucleus in order to regulate gene expression. It would be surprisingly inefficient for nature to endow every receptor at the plasma membrane with a unique pathway to pass its signal from plasma membrane to the promoter of a gene. Rather more plausible is a minimum set of partially shared signal transduction components that play central role in coordinating signals from extracellular environment and disseminating the signals to the transcription factor level. These components work as encoders that compress a large amount of information from extracellular and intracellular environments to minimum length, then pass the information to gene expression regulating components such as transcription factors or repressors. To model these signal transduction components, model selection becomes a key issue, which has not been well addressed previously. Bayesian model selection respects Occam’s razor, to minimize a fitted model’s complexity, potentially increase the interpretability of the data in terms of information organization and flow inside living cells. These characteristics put the model a step ahead of some commonly used models for modeling cellular processes controlling gene expression. Like most other models used to decompose observed microarray data into components, the CVQ model is a linear model. In microarray data analysis, measurements are usually transformed by the logarithm, so that cooperative effects that combine multiplicatively at the raw data level can be handled as additive. This simplifies

543

model-fitting but may be too restrictive. To capture nonlinear relationships in the log space, the CVQ model could naturally be extended to mixtures of CVQ models. This extension will be studied in the future. Another possible improvement of the model includes more sophisticated approximation methods, such as Minka’s expectation propagation method 21, to obtain a better approximation of the log marginal likelihood, and thus, better model selection and optimization.

Acknowledgments The authors would like to extend special thanks to Dr. Zoubin Ghahramani for constructive initial inputs and discussion. We thank Drs. Gregory Cooper, Chengxiang Zhai, Rong Jin, Vanathi Gopalakrishnan, Matthew J. Bed and two anonymous reviewers for insightful discussions and comments. Xinghua Lu would like to thank for the support from the National Library of Medicine (training grant 3T15 LM0705915S1) and the MUSC COBRE for Cardiovascular Disease.

Reference 1. Alter, 0, Brown, P. 0. and Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of Ameerica, 97:lOlOl-10106,2000. 2. Raychaudhuri, S., Stuart, J. M. and Altman, R. B.. Principal components analysis to summarize microarray experiments: application to sporulation time series. In Proceeding of Pacifc Symposium on Biocomputing, pages 45546,2000. 3. Liebermeister,W. Linear modes of gene expression determined by independent component analysis. Bioinformatics, 18:51-60,2002. 4. Martoglio, A, Miskin, J. W., Smith, S. K. and MacKay, D. J. C.. A decomposition model to track gene expression signatures: preview on observer-independent classification of ovarian cancer. Bioinformatics, 18 no. 12:1617-1624,2002. 5. Moloshok, T. D., Klevecz, R. R., Grant, J. D., Manion, F. J., Speier W. F., and Ochs, M. F.. Application of Bayesian decomposition for analysing microarray data. Bioinfomatics, 18(4):566-575.2002. 6. Segal, E, Battle, A and Koller, D. Decomposing gene expression into cellular processes. In Proceedings of Pacific Symposium on Biocomputing, volume 8, pages 89-100,2003. 7. Attias, H.. Independent Factor Analysis. Neural Computation, 11(4):803-851, 1999. 8. Hinton, G. E. and Zemel, R. S.. Autoencoders, minimum description length, and helmholtz free energy. In Advances in Neural Information Processing Systems 6 . Morgan Kaufman, 1994. 9. Ghahramani, Z. Factorial learning and EM algorithm. In Advances in Neural Znformation Processing Systems 7. Morgan Kanfmann Publishers, 1995. 10. Miskin, J and MacKay, D. Ensemble learning for blind source separation. In S. Roberts and R. Everson, editors, Independent Component Analysis: Principles and Practice, pages 209-233. Cambridge Unviersity Press, 2001.

544 11. Dempster, A.P., Laird, N.M. and Rubin, D.B.. Maximum likelihood estimation from incomplete data via EM algorithm (with discussion). Journal of Royal Statistics Society, B 3911 - 38, 1977. 12. Ghahramani, Z and Beal, M. J.. Propagation algorithms for variational bayesian learning. In Advances in Neural Information Processing Systems 12, pages 507-513. MIT Press, 2000. 13. Kass, R and Raftery, A, E.. Bayes Factors. Technical Report Technical Report No 254, Dept. of Statistics and Techical Report No 571, Dept. of Statistics, Univ. of Washington and Carnegie Mellon Univ., 1994. 14. MacKay, D. Probable networds and plausible predictions - a review of practical Baysian methods for supervised nerual networkds. Network: Computation in Neural Systems, 6(3):469-505, 1995. 15. Attias, H. Inferring parameters and structure of latent variable models by variational Bayes. In Proceedings of the Uncertainty in A1 Conference, pages 21-30, 1999. 16. Bishop,C. M.. Variational principal components. In Proceedings of Ninth International Conference on Artijicial Neural Networks, volume 1, pages 509-514. ICANN, 1999. 17. Lu, X., Hauskrecht, M. and R. S. Day, R. S.. Variational Bayesian learning of cooperative vector quantizer model - theory. Technical Report No: CBMI-02-181, The Center for Biomedical Informatics, University of Pittsburg, 2002. 18. Ghahramani, Z. and Beal, M. J.. Variational inference for Bayesian mixtures of factor analysers. In Advances in Neural ZnformationProcessing Systems 12, Cambridge, MA, 2000. MIT Press. 19. Lawrence, N. D. and Bishop, C. M.. Variational Bayesian independent component analysis. Technical report, Computer Laboratory, University of Cambridge, 2OOO. 20. Spellman, P. T., Sherlock, G, Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. 0. Botstein, D and Futcher, B.. Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell, 9(12):3273-3297, 1998. 21. Minka, M. P.. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 2001.

SYMBOLIC INFERENCE OF XENOBIOTIC METABOLISM D.C. MCSHAN, M. UPDADHAYAYA and I. SHAH School of Medicine University of Colorado 4200 East 9th Avenue, B-119 Denver, CO 80262 {daniel.mcshan,minesh.upadhyaya,imran.shah}@uchsc.edu Abstract We present a new symbolic computational approach to elucidate the biochemical networks of living systems de novo and we apply it to an important biomedical problem: xenobiotic metabolism. A crucial issue in analyzing and modeling a living organism is understanding its biochemical network beyond what is already known. Our objective is to use the available metabolic information in a representational framework that enables the inference of novel biochemical knowledge and whose results can be validated experimentally. We describe a symbolic computational approach consisting of two parts. First, biotransformation rules are inferred from the molecular graphs of compounds in enzyme-catalyzed reactions. Second, these rules are recursively applied to different compounds to generate novel metabolic networks, containing new biotransformations and new metabolites. Using data for 456 generic reactions and 825 generic compounds from KEGG we were able to extract 110 biotransformation rules, which generalize a subset of known biocatalytic functions. We tested our approach by applying these rules to ethanol, a common substance of abuse and to furfuryl alcohol, a xenobiotic organic solvent, which is absent in metabolic databases. In both cases our predictions on the fate of ethanol and furfuryl alcohol are consistent with the literature on the metabolism of these compounds.

Introduction

The objective of this work is to develop a predictive strategy for elucidating metabolism. We mold available metabolic information in an expressive symbolic representation and employ a novel inference framework to explore unchartered pathways. We hypothesize that biochemical rules can be inferred from the databases of endogenous metabolism and that we can use these rules to predict the metabolism of unknown xenobiotics through detoxification pathways. In particular, we focus on xenobiotic pathways in mammalian systems. What is the importance of discovering new pathways? Our knowledge of metabolism is essentially incomplete and it can be argued that cataloging

545

546

all possible mammalian xenobiotic pathways is infeasible. With the availability of the complete genomic blueprint for living systems and a large set of known biotransformations, it is becoming possible to theoretically elucidate metabolism. This includes the analysis of endogenous as well as xenobiotic pathways. Drugs, substances of abuse and environmental pollutants are examples of compounds that may not occur naturally in a living system. Since these compounds and/or their metabolic by-products can be potentially toxic, investigating xenobiotic metabolism is important for human health and the environment. Pathway inference is a computationally challenging problem even with the availability of the genomic blueprint for a living system and the functional annotations of its putative genes. Since the availability of the first microbial genome, Haemophilus influenza; a number of metabolic reconstruction tools have been developed. These include PathoLogicy and PathFinde?. These methods focused on matching putatively identified enzymes with known, or “reference”, pathways. Although reconstruction is an important starting point for metabolic processes it does not enable the discovery of new pathways. To overcome some of these issues we have recently developed a new pathway inference system to search for novel metabolic routes called PathMine?. PathMiner uses known biotransformations to synthesize new pathways and employs heuristics t o contain the combinatorial complexity of the search. This paper delves into a deeper biological problem: de novo pathway inference and its practical application to a biomedical problem: xenobiotic metabolism. The metabolic potential of a living system depends on biocatalysis. However, understanding the mechanisms of enzymatic catalysis is an extremely difficult problem, and knowledge in this area is limited to a handful of wellstudied examples. Generally, biochemists can abstract empirical “rules” for the biotransformation of metabolites by enzymes. For instance, consider the broad range of substrates for Saccharomyces cerevisae (yeast) alcohol dehydrogenase (YADH), which reduces acetaldehyde and a variety of other aldehyde&, and oxidizes ethanol, and other acyclic primary alcohols. Yet an alcohol dehydrogenase from Themnoanaerobium brokii (TADH) catalyzes the stereospecific reduction of ketones and the oxidation of secondary alcohols. The functions of YADH and TADH share common attributes and have some unique differences: they are both alcohol dehydrogenases but their specificities for the alcohols are different. The functions of these enzymes can be expressed in terms of the functional groups modified (alcohol to aldehyde or ketone), and the backbone structure of the molecule (primary or secondary alcohol). This is essentially a symbolic description of biocatalysis and we believe that it can be applied to complete metabolic systems.

547

Methods Our strategy for elucidating de nouo xenobiotic metabolism consists of two main steps. First, we use biotransformation data t o derive symbolic chemical substructural rules that generalize the action of enzymes on specific compounds. Second, we apply these rules iteratively to a compound t o generate a plausible metabolic system. We describe these steps in the following sections but first we discuss our metabolic representation.

Representing biotransfomnations and rules Our abstraction of metabolic concepts is based on work by Karp? in terms of the high-level concepts including pathways, enzyme-catalyzed reactions and transformations. At the level of biotransformations we are motivated by Kazic? in that we focus on the specific chemical substructural details of metabolites that are modified through biocatalysis. In our system, compounds are represented as X. Compounds in our abstraction have a chemical structure which is represented as a molecular graph, r, in which nodes are atoms and edges are bonds. In the context of a biotransformation the pattern of substructural changes from the input compound to the output compound is represented as a rule, U. A rule captures the concept of functional group changes that occur in a biotransformation. Rules are implicitly unidirectional so reversible transformations are represented as two separate rules. The two molecular graphs of a rule are indicated by the input graph, A-, and the output graph, A+. For instance, the rule for the conversion of a primary alcohol to an aldehyde is shown in Figure 1. In this case A- is an alcohol moiety, which is converted to A+, an aldehyde moiety.

xs P r i m a r yAlcohol

A-

A+

XP Aldehyde

Figure 1: Alcohol dehydrogenase (EC 1.1.99.9) Transformation from abstract PrimaryAlcohol to abstract Aldehyde showing the computedA- and A+ moieties. The A-moeity is the subgraph that is in Xsbut not in Xp. The A+moeity is the subgraph that is in X,but not in X , .

In the present work we focus on changes at the level of functional groups between pairs of compounds. We represent the conversion of one input com-

548 pound to one output compound as a transformation. This simplifies our representation of reactions in terms of the main metabolites. In this work we obtain this data from the KEGG distribution, but we are also exploring automated methods for identifying the main metabolites in a reaction.

Extracting transformation rules from reaction data One strategy for identifying rules is to curate them manually, however, our goal is to use the available metabolic dat&? to derive biotransformation rules automatically. This is a difficult problem in general as the information about reactive moieties is not explicitly available. In this paper we have used a simple strategy for extracting rules automatically from “general” reactions. In KEGG, for instance, general reactions are defined when the input and the output compounds are both Markush structures. We find 741 general reactions in KEGG, which constitute 20% of the reactions annotated as being human-specific. For example a gene that is extremely important in xenobiotic metabolism and encodes cytochrome P-450 enzyme, CYP2D6, is implicated in the disposition of over thirty toxins. In KEGG, the P-450 enzyme (EC 1.14.14.1) is associated with only four reactions as shown in Figure 2. There are two specific reactions involved in endogenous functions associated with tryptophan metabolism and gamma-hexachlorocyclohexane degradation. The other two operate on general compounds denoted by their Markush structures (these are abstract structures containing a wildcard “R” group and specific functional groups). We convert these general reactions automatically to rules as described above. This is done by replacing the wildcard of the substrate with “C” and storing it as the Asubgraph in the resulting rule; similarly, the “R” in the product graph is replaced and the resulting graph is stored as A+. To our knowledge no one has taken advantage of this annotation before in metabolic pathway inference. In this work we focus on the rules important in xenobiotic metabolism in mammalian systems, including oxidation, reduction, hydrolysis and conjugation to mention a few. There are generally two phases in xenobiotic metabolism. In phase 1 the compounds are ’functionalized’,which means that a reactive functional group is exposed. Detoxification occurs in phase 2 by further action on the functional groups, which is the form in which the compound is excreted. For instance, the first phase activates a molecular oxygen in the input compound, and the second phase conjugates it. Glucuronidation is the most common conjugate and can be attached to any labile oxygen. In the case of alcohol metabolism, both the alcohol and the acid can usually be conjugated.

549

Melatonin u 6 - Hydroxymelatonin FattyAcid u alphaHydroxyFattyAcid Alkane u R O H Parathion u Paraoxon Figure 2: CYP2D6 (EC 1.14.14.1) reactions in KEGG. Compounds are either Abstract(contain one or more Markush “R” groups) or Normal (have unique structure).

Biotransformation rule application Our rule application algorithm is illustrated in Algorithm 1. A rule is applied to a substrate X , by searching the graph of X,, I?, for the subgraph A-. If the subgraph A- is found, it is replaced by the A+ graph to yield the product graph, rP. This is summarized as follows:

r, - A- + A+ + rp. This is graphically illustrated in Figure 3.

XS

A-

A+

XP

Ethanol

Figure 3: Application of alcohol dehydrogenase rule to ethanol

The product of applying a rule to a compound can be a completely novel compound or a known compound. We use subgraph isomorphism to search the product molecular graph against the database of known compounds. If the compound is not found, a novel compound X i is created and given a unique identifier (Nxxxxxx in which x is a digit from 0-9). The corpus of all rules is designated We have a top-level function metabolize(X,U,n)which takes a compound X and systematically applies each rule in the rule-base through n iterations.

u.

550

input

: X,,

compound to metabolize

U , list of rules n, iterations output : Graphical visualization Products Products t 4

I?,

t molecular-graph(X,)

for ( A - , A + ) t U do

rpt g r a p h - r e p l a c e (rs,A-,A+)

I

if I?, then X, t find-compound-by-graph(r,) if X, = 4 then X, t rnake-novel-compound(r,)

pushnew (X,,Products)

if n > I then for X in Products do

1

L append (met abo 1 ize (X,U,n - l),Products)

Algorithm 1: metabolize(X,U,n). Algorithm to create a network of pathways length n from input compound X , by applying rules U . Initially the list of Products is set to null. The molecular graph, Fa, of the input compound is obtained from the KEGG m o l file representation. For every rule in the rulebase U , we obtain the A- and A+ subgraphs. The product graph, r, is obtained by performing a graphical search/replace on the input graph, r,. If r, is non, i.e., a match was found and applied, then the product graph r, is searched against the database of known compounds and the database of novel compounds to see if an isomorphic graph exists. If the graph matches an existing compound, then X , is returned. If there is no identified compound with the g a p h , then a novel compound, X, is generated and given a unique identifier (the Nxxxxx symbols in the diagrams). In either case, the product, X, is pushed onto the Products list for this metabolite X,. This process can occur iteratively for every product, X in the Produds list. The metabolize function is simply called again with the recursion level reduced. The results are appended to the Products list.

Implementation The system is implemented in Allegro Common Lisp. The metabolic databases are read in and parsed into CLOS structures. For visualization, the transformations are exported to the AT&T graphviz program neato which does a simple force-based layout of the metabolic graph. This network is read back in and presented with the nodes replaced by compound structures using our internal visualization system. The novel compounds that are produced by the appli-

551 U.

I

Reactant

I

Product

I

B.C.

I

Enzyme

I

Table 1: Simplest 10 of 110 rules inferred from KEGG generic reactions

cation of the rules are simply graphs. In order to visualize the compounds, we require 2D coordinates. To achieve this, we export the graph as a mol file with the 2D coordinates as zeroes and then layout the mol file using the JChem molconvert package. The mol files are read back in and stored with the compounds as they are created.

Results and Discussion We used a recent version of the KEGG database which had 10,635 compounds, out of which 825 are generic. Of the 5,428 reactions in the KEGG database, 741 operate on the generic compounds. From this data, we infer 110 biotransformation rules, and the 10 simplest ones are summarized in Table 1. These rules correspond to enzymes which have flexibility in the substrates they can transform. Using our symbolic computational approach described in the previous sections we elucidate the de now metabolism of two compounds. First, we consider ethanol, which is a common substance of abuse and for which we have some data of human metabolism. Second, we demonstrate the fate of furfuryl alcohol, which is is an industrial organic solvent used as a paint thinner and is absent in our database. Experimental evidence suggests that prolonged exposure to furfuryl alcohol may have significant toxicological effects. We first apply the rules to the compound ethanol which is in the database. The graph is shown in Figure 4. Next, we apply the rules to a new compound, furfuryl alcohol, which is not in the database. The result is shown in Figure 5. That some of the

552

w.75

Figure 4: The de novo prediction of ethanol metabolism. Ethanol is in the center of the figure. The highlighted transformations are the activation of the alcohol to an aldehyde by alcohol dehydrogenase (EC 1.1.99.20), then to an acid by aldehyde oxidase (EC 1.2.3.1), respectively. Not shown, but in the next iteration is the O-glycosylation of the aldehyde by beta-Glucuronidase (EC 3.2.1.31).

nodes in our ethanol metabolism graph match to known compounds in the database is encouraging. Additionally, we were able to identify the pathway, alcohol + aldehyde + acid + conjugation, which recapitulates the standard ethanol detoxification pathway. We are also able to predict metabolites for a compound previously unknown to the system. The furfuryl alcohol metabolic predictions are consistent with literature. Martin, et. al., report that furfuryl alcohol can be O-glycosylated by beta-Glucuronidasdo as we predict (shown as compound NO0482 in Figure 5). Additionally, the acid of furfurol, 2-furoate, is actually in the KEGG database and is identified as such by the algorithm. Nomeir, et. al., report that the initial step in furfuryl alcohol metabolism in rat is the oxidation to furoic acid, which is excreted unchanged and decarboxy-

553

lated, or conjugated with glycine or condensed with acetic acia'. In this case, the limitations in our system t o predict the condensation with acetic acid, for instance, lie in the breadth of the rules, not in the fundamental methodology. By extending our method for inferring new rules based on known biochemistry we can overcome this limitation.

-

Figure 5: The de novo prediction of furfuryl metabolism. Furfurol is in the center of the figure. The highlighted transformation between compound furfurol and compound NO0482 (up and to the left) is an 0-glycosylation by beta-Glucuronidase (EC 3.2.1.31). The highlighted transformations below furfurol are the activation of the alcohol to an aldehyde (N00479, furfural) by alcohol dehydrogenase (EC 1.1.99.20),then t o an acid by aldehyde oxidase (EC 1.2.3.1), respectively. The acid is identified by the algorithm as being in the KEGG database (by graph similarity) as 2-Furoate (C01546). In the next iteration, not shown, the acid is finally 0-glycosylated by beta-Glucuronidase (EC 3.2.1.31).

Most of the complex products of furfuryl alcohol are simply consecutive glucurodinations by the rule:

Alcohol

+

B -D

- Glucuronide

Due to the lack of specificity of this rule t o primary alcohols, glucuronidation is applied to the hydoxyl groups on the p- D-Glucuronide. While this might be

554

biologically valid, in reality, glucuronidation renders a compound water soluble after which it is eliminated by excretion. This limitation is beyond the scope of the current work but can be addressed in the future by considering the physical properties of compounds, like water-solubility. That a biotransformation rule can be applied does not imply that it is biochemically valid. For instance, consider the biotransformation rules that apply t o a hydroxyl functional group. Compounds containing this functional group include primary alcohols, secondary alcohols, and also carboxylic acids. Enzymes that act on alcohols may not act on carboxylic acids and vice-versa. To capture the substrate specificity of enzymes we are working on a more sophisticated representation of rules that can improve their biological validity. Though this is a limitation of our present algorithm, our predictions are still useful for elucidating potential xenobiotic metabolism, which can be tested experimentally. It is important to contrast our approach t o other rule-based approache2T6 in pathway prediction. One of the main advantages of our strategy is automated biotransformation rule extraction from available resources of metabolic data. As opposed t o the manual curation-based efforts, our approach will scale gracefully with increasing data for two important reasons. First, our algorithm for rule extraction can be extended t o utilize most of the available enzyme-catalyzed reaction data beyond the generic reactions in KEGG. Second, we can control the combinatorial explosion of plausible biotransformations by extending our existing algorithm on pathway search?. Another advantage of our approach is that we can relate our biotransformation predictions to the organism-specific enzymes and genes, which is crucial for in vivo or in vitro experimental validation.

Conclusion We have developed a symbolic inference approach and demonstrated the de novo elucidation of metabolism. This was accomplished by representing biocatalysis, which is the basis of metabolism, in terms of expressive symbolic biotransformation rules. These biotransformation rules generalize the biocatalytic functions of enzymes and enable the discovery of new metabolic potential in living systems. We developed an algorithm to extract these rules from known enzyme-catalyzed reactions and to apply these rules to elucidate the metabolism of new compounds. We successfully tested this concept to predict the xenobiotic metabolism of ethanol and furfuryl alcohol. The results are encouraging because furfuryl alcohol is absent in our database and yet we can correctly identify its products through 0-glycosidation and oxidation to

555

furoic acid in agreement with the literature. These results are also biologically interesting because they support the notion that xenobiotic metabolism is a manifestation of endogenous biocatalytic abilities in an organism. Though there are a some limitations in our approach the method is quite general and scalable for investigating the metabolic network of any living system. This work supports the relevance of symbolic approaches in discovering the biochemical capabilities of living systems. Our results on xenobiotic metabolism offer a prelude to the potential discoveries that can be made in combination with high-throughput or traditional experimental strategies. Acknowledgments The authors acknowledge Weiming Zhang for the visualization software. This work is sponsored by the National Science Foundation (BES-9911447), the Department of Energy (DE-FG03-01ER6311l/M003), and the Office of Naval Research (N00014-00-1-0749). References

1. Applications of Biochemical Systems in Organic Chemistry. Wiley, New York, N.Y., 1976. 2. R.D. Fleischmann et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269:469-512, 1995. 3. T. Gaasterland and E.E. Selkov. Automatic Reconstruction of Metabolic Networks Using Incomplete Information. ISMB, 3:127-135, 1995. 4. T Gaasterland and CW Sensen. MAGPIE: automated genome interpretation. Dends Genet, 12(2):76-78, 1996. 5. A. Goesmann, M. Haubrock, F. Meyer, J. Kalinowski, and R. Giegerich. PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics, 18(1):124-9, 2002. 11836220. 6. B.K. Hou, L.P. Wackett, and L.B. Ellis. Microbial pathway prediction: a functional group approach. J Chem Inf Comput Sci, 43(3):1051-7,2003. 7. P. Karp and M. Riley. Representations of metabolic knowledge: Pathways. In R. Altman, D. Brutlag, P. Karp, R. Lathrop, and D. Searls, editors, Second International Conference on Intelligent Systems for Molecular Biology. AAAI Press, 1994. 8. P.D. Karp, M. Krummenacker, S.M. Paley, and J. Wagg. Integrated pathway/genome databases and their role in drug discovery. Trends in Biotechnology, 17(7):275-281, 1999.

556 9. T Kazic. Reasoning about biochemical compounds and processes. pages 35-49. World Scientific, Singapore, 1992. 10. B.D. Martin, E.R. Welsh, J.C. Mastrangelo, and R. Aggarwal. General 0-glycosylation of 2-furfuryl alcohol using beta-glucuronidase. Biotechno1 Bioeng, 80(2):222-7, 2002. 11. A.A. Nomeir, D.M. Silveira, M.F. McComish, and M. Chadwick. Comparative metabolism and disposition of furfural and furfuryl alcohol in rats. Drug Metab Dispos, 20(2):198-204, 1992.

FINDING OPTIMAL MODELS FOR SMALL GENE NETWORKS S. OTT, S. IMOTO, S . MIYANO Human Genome Center, Institute of Medical Science, The University of Tokyo 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan { ott,imoto,miyano} 0ims.u-tokyo. a c . ~ Finding gene networks from microarray data has been one focus of research in recent years. Given search spaces of super-exponential size, researchers have been applying heuristic approaches like greedy algorithms or simulated annealing to infer such networks. However, the accuracy of heuristics is uncertain, which in combination with the high measurement noise of microarrays - makes it very difficult to draw conclusions from networks estimated by heuristics. We present a method that finds optimal Bayesian networks of considerable size and show first results of the application to yeast data. Having removed the uncertainty due to the heuristic methods, it becomes possible to evaluate the power of different statistical models to find biologically accurate networks.

1

Introduction

Inference of gene networks from gene expression measurements is a major challenge in Systems Biology. If gene networks can be infered correctly, it can lead to a better understanding of cellular processes, and, therefore, have applications to drug discovery, disease studies, and other areas. Bayesian networks are a widely used approach to model gene n e t ~ o r k s In ~ Bayesian ~ ~ ~ networks, ~ ~ ~ the ~ ~behaviour ~ ~ ~ of~ the ~ gene ~ ~ ~ ~ network is modeled as a joint probability distribution for all genes. This allows a very general modeling of gene interactions. The joint probability distribution can be decomposed as a product of conditional probabilities P(X, 1x1,.. . , X,), representing the regulation of a gene g by some genes 91,. . . , g,. This decomposition can be represented as a directed acyclic graph. The Bayesian network model has been shown to allow finding biologically plausible gene networks4,'. However, the difficulty of learning Bayesian networks lies in its large search space. The search space for a gene network of n genes is the space of directed acyclic graphs with n vertices. A recursive formula as well as an asymptotic expression for the number of directed acyclic graphs with n vertices (G) was derived by Robinson15. We state the asymptotic expression here:

557

558

For example, there are roughly 2.34. possible networks with 20 genes, and possible solutions for a gene network with 30 genes. Even for about 2.71. a gene network of 9 genes (search space size roughly 1.21 . 1015), a brute force approach would take years of computation time even on a supercomputer. Moreover, it is known that the problem of finding an optimal network is NPhard', even for the discrete scores BDe2i3 and MDL3. Therefore, researchers have so far used heuristic approaches like simulated annealing' or greedy algorithmsg to estimate Bayesian networksls. However, since the accuracy of heuristics is uncertain, it is difficult t o base conclusions on heuristically estimated networks. In order to overcome this problem, we have analysed the structure of the super-exponential search space and developed an algorithm that finds the optimal solution within the super-exponential search space in exponential time. This approach is feasible for gene networks of 20 or more genes, depending on the concrete probability distribution used. Furthermore, adding biologically justified assumptions, the optimal network can be infered for gene networks of up to 40 genes. Overcoming the uncertainties of heuristics opens up the possibility to compare statistical models with respect to their power t o infer biologically accurate gene networks. Also, this method is a valuable tool for refining gene networks of known functional groups of genes. We present the method in Section 2. In Section 3, we present results of an application of this method, which show that it can estimate gene networks biologically accurate.

TheMethod

2

2.1

Preliminaries

Throughout this section, we assume we are given a set of genes G and a network score function as used by several g r o ~ p s ~ ,i.e. ~'a ~ ~function , s : G x 2G + R that assigns a score t o a gene g E G and a set of parent genes A C G. Given a network N , the score of N is defined as score(N) =def s(g, P N ( g ) )where , P N ( g )denotes the set of g's parents in N .

xsEG

Examples: 1. BDe ~ c o r e ~ > ~ The score is proportional t o the posterior probability of the network, given the data. When the BDe score is used, the microarray data needs to be discretized.

559

2. MDL score3 The MDL score makes use of the minimal description length principle and also uses discretized data. 3. BNRC score' The BNRC score uses nonparametric regression to capture nonlinear gene interactions. Since the data does not need to be discretized, no information is lost. The task of infering a network is t o find a set of parent genes for each gene, such that the resulting network is acyclic and the score of the network is minimal. We introduce some notations needed to describe the algorithm.

Definition 1: F We define F : G x 2G -+

A

R as F ( g , A ) =def minBcA - s(g, B ) for all g E G and

C G.

0

The meaning of F ( g , A ) is, by the definition, the optimal choice of parents for gene g, when parents have to be selected from the subset A . For every acyclic graph, there is an ordering of the vertices, such that all edges are oriented in the direction of the ordering. Conversely, when given a fixed order of G, we can think of the set of all graphs that comply with the given order, as we do in the next definition. An ordering of a set A C G can be described as a permutation n : (1,.. ., IAI} A . Let us use IIA t o denote the set of all permutations of A. --f

Definition 2: ?r-linearity Let A C G and T E IIA. Let N A x A be a network. We say N is T-linear 0 iff for all (9,h) E N n-'(g) < ?r-'(h) holds. Now we use the above definitions and define function Q A , which will allow us to compute the score of the best n-linear network for a given T , as we show below.

Definition 3: Q A Let A C: G. We define Q A : IIA -+R as

560

If we can compute the best .ir-linear network for a given permutation 7r using functions F and Q , then what we need t o do in order to find the optimal network is t o find the optimal permutation .ir, which yields the global minimum. Formally, we define function M for this step.

Definition 4: M We define M : 2G -+

UAcGHA as (3)

for all A 2.2

C G.

The Algorithm

Using above notations, the algorithm can be defined as follows. Step 1: Step 2: Step 3: Step 4: Step 4a: Step 4b: Step 5:

Compute F ( g , 0) = s(g, 0) for all g E G. For all A C G , A # 0 and all g E G, compute F ( g ,A ) as min{s(g, A ) ,m i n a F ~ (~g , A - { a } ) } . Set M ( 0 ) = 0. For all A E G, A # 8, do the following two steps: Compute g* = argmin,gA(F(g, A - (9)) + Q A - { g } ( M ( A- (9)))). For all 1 5 i < IAl, set M ( A ) ( i )= M ( A - {g*})(i), and M(A)(IAl) = 9*. return Q G ( M ( G ) ) .

In the recursive formulas given in Step 2 and in Step 4, we want to compute the function F resp. M for a subset A G of cardinality m = / A ( ,and need function values of function F resp. M for subsets of cardinality m - 1. Therefore, we can apply dynamic programming in Step 2 as well as in Step 4 to compute functions F resp. M for subsets A of increasing cardinality. In the recursive formula in Step 4, first the last element g* of the permutation M ( A ) is computed in Step 4a, and then M ( A ) is set in Step 4b. 2.3

Correctness and Tame Complexity

First, we prove the correctness of the algorithm. The correctness of the recursive formula in Step 2 of the algorithm follows directly from the definition of F . Therefore, after execution of Step 1 and Step 2, the values of function

561

F for all genes g and all subsets A C G are stored in the memory. Before proceeding t o Step 3 and Step 4, we state a lemma on the meaning of function QA. Lemma 1

Let A C G and T E IIA.Let N * C A x A be a rr-linear network with minimal score. Then, QA(7r) = score(N*) holds. Proof. In a 7r-linear graph, a gene g can only have parents h, which are upstream in the order coded by 7 r , that is, ..-'(/I) < r - ' ( g ) . Therefore, when selecting parents for g, we are restricted t o B = { h E AIr-l(h) < r-l(g)}, and F ( g , B) is the optimal choice in this case. Since in a .rr-linear graph, all edges comply with the order coded by T , we can choose parents in this way for all genes independently, which proves the claim. 0 Using Lemma 1, we prove that function M can be computed by the formula given in Step 4.

Lemma 2 Let A C G . Let g* = argmingEA(F(g,A - {g}) + QA-{g)(M(A (9)))). Define 7r E IIA by ~ ( i=)M ( A - {g*})(i), and n(IA1) = g*. Then, rr = M(A). Proof. Let rr' E HA. By the definition of M , we have to show QA(7r) 5 Q A ( d ) . Let N * be an optimal n-linear network, M* be an optimal d-linear network. Then, by Lemma 1, QA(n) 5 Q A ( d ) is equivalent to score(N*) 5 score(M*). Let us denote the last element of 7r' as h = n'(IA1). We note that for any B C G, Q B ( M ( B ) )is the score of a global optimal network on B by above definitions. Therefore, we have: score(M*) = s(h,P M * ( h ) ) f C g ~ A - { h L )'(g, P M * (9)) 2 s ( h , ~ M * ( h )Q)A - { ~ ) ( M ( A - { h } ) ) 2 F ( h ,A - { h } ) QA-{h)(M(A - { h } ) ) > minhEA(F(h,A - { h } ) QA-{h)(M(A- { h } ) ) ) = F(g*,A - {g*}) QA-{g*)(M(A - {g*})) = score(N*),

+ +

+

which shows the claim.

+

0

Since Q can be directly computed using F , the algorithm can compute

562

Q G ( M ( G ) in ) Step 5. Finally, Q G ( M ( G ) )is the score of an optimal Bayesian network by definition, which shows the correctness. If the information of the best parents is stored together with F ( g , A ) for every gene g and every subset A C G, the optimal network can be constructed during the computation of Q G ( M ( G ) ) .

Theorem 1 Optimal networks can be f o u n d using O(n.2") dynamic programming steps.

Proof. The dynamic programming in Step 1 and Step 2 requires O ( n . 2") ( n = [GI)steps and in each step one score is computed. In the dynamic programming in Step 3 and Step 4 O(2") steps are needed, where each steps involves looking up some previously stored scores. Note that the function QA does not need to be actually computed in Step 4a, because QA-{g} can be stored together with M ( A - ( 9 ) ) in previous steps. Therefore, the overall time complexity is O(n '2"). 0 In biological reality, while the number of children of a regulatory gene may be very high, the number of parents can be assumed to be limited. When we limit the number of parents, the number of score calculations reduces substantially, allowing the computation of larger networks. We state the following trivial corollary, which is practically very meaningful (see Section 3).

Corollary 1 Let m E JV be a constant. Optimal networks, in which n o gene has more t h a n m parents, can be f o u n d in O(n.2") dynamic programming steps. If we do not want to limit the number of parents by a constant, but instead can select for each gene a fixed number of candidate parents, the complexity changes as follows.

Corollary 2 Let m E M be a constant. For each g E G , let C, C G be a set with JC,I 5 m. Optimal networks, an which each gene g has parents only in C,, can be found in O(2") dynamic programming steps.

Proof. Since the parents of each gene are selected from a set of constant size, the complexity of the dynamic programming in Step 1 and Step 2 becomes

563

constant. Therefore, the overall complexity becomes O(2").

0

We note, that the two applications of dynamic programming in our algorithm can be implemented as a single application of dynamic programming, because when we compute function M for a set of size m, we only need function values of function F for a set of size m - 1. Therefore, only the function values for functions F and M for sets of size m - 1 and m need t o be stored in the memory at the same time. This is practically meaningful to reduce the required amount of memory. We also note that the algorithm can be modified to also compute suboptimal solutions. Computing the second-best or the third-best network might be valuable in order to assess the stability of the infered networks under marginal changes of the score.

3

Results

The algorithm described above was implemented as a C++ program. As scoring functions, existing implementations of the BNRC score, the BDe score and the MDL score are used. All three approaches (Theorem 1, Corollary 1 and 2 ) were implemented. We applied the program to a dataset of 173 microarrays, measuring the response of Saccharomyces cerevisiae t o various stress conditions.

3.1 Application to Heat Shock Data From the dataset we selected 15 microarrays from 25°C to 37OC heat shock experiments and 5 microarrays from heat shock experiments from various temperatures to 37°C. Then we selected a set of 9 genes, which are involved or putatively involved in the heat shock response. Figure 1 shows the optimal network with respect to the BNRC score. We observe that the transcription factor M C M l is estimated to regulate three other genes, while it is not regulated by one of the genes in this set, which is plausible. The second transcription factor in our set of genes, HSFI, is estimated t o regulate three other heat shock genes. It is also estimated to be regulated by a HSP7U-protein ( S S A I ) ,which was reported before16. Another chaperone among these genes, SSA3, also seems to play an active role in the heat shock response and interacts with SSA1 and HSPlO4, coinciding with a report by Glover and Lindquist'. Overall, the result is biologically plausible and gives an indication for the active role of the chaperones SSA1 and SSA3 during the heat shock response.

564

We conclude that optimally infered gene networks are meaningful and useful for the elucidation of gene regulation. gene HSFl

SSAl SSA3 HIGl HSP104 MCMl HSP82

YR02 HSP26 3.2

annotat ion heat shock transcription factor ER and mitochondrial translocation, cytosolic HSP70 ER and mitochondrial translocation, cytosolic HSP70 heat shock response, heat-induced protein heat shock response, thermotolerance heat shock protein transcription, multifunctional regulator protein folding, HSPSO homolog unknown, putative heat shock protein diauxic shift, stress-induced protein

Computational Possibilities and Limitations

While even networks of small scale like the network infered in Section 3.1 cannot be infered with a brute force approach (Eqn. 1), they can be optimally infered by our program using a single Pentium CPU with 1.9 GHz for about 10 minutes. In order to evaluate the practical possibilities of this approach, we selected 20 genes with known active role in gene regulation’’ from the

565

data set and estimated a network with optimal BNRC score using all 173 microarrays. The computation finished within about 50 hours using a Sun Fire 15K supercomputer with 96 CPUs, 9OOMHz each. As a result of this computational experiment, we conclude that our method is feasible for gene networks of 20 genes, even if no constraints are made and a complex scoring scheme like the BNRC score is used. For the discrete scores BDe and MDL, which can be computed much faster, even networks of more than 20 genes can be infered optimally without constraints. When the number of parents is limited to about 6 (Corollary 1) or, alternatively, sets of about 20 candidate parents are preselected (Corollary 2), even with the BNRC score gene networks of more than 30 genes can be infered optimally. However, the method as it is now will not allow to estimate networks of more than about 40 genes. While the theoretical time complexity of the approach given in Corollary 2 is below the time complexity of the approach given in Corollary 1, we argue that the latter might be practically more important. First, limiting the number of parents by a constant can be easily done and is biologically justified, while selecting a set of candidate parents for each gene requires a method of gene selection, which can potentially bias the computation result. Second, it has to be considered that each dynamic programming step in the computation of function F requires the computation of one score, while one dynamic programming step for function M only requires looking up some previous results. When the number of parents is limited as in Corollary 1, the required number of score calculations becomes a polynomial, which makes this approach faster in practical applications, though the approach in Corollary 2 is theoretically superior. 4

Conclusion

We have presented a method that allows to infer gene networks of 20-40 genes optimally, depending on the probability distribution used and on whether additional assumptions are made or not. This makes it possible to compare different scoring schemes, to assess the best parameters for a given scoring scheme, and to evaluate the usefulness of given microarray data, since optimal solutions are obtained. Also, the method is especially useful in settings where researchers focus on a certain group of genes and want to exploit gene expression measurements concerning these genes to the full extent. In contrast to heuristic approaches, if the results are unsatisfying or contradictory to biological knowledge, it can be concluded that the statistical model is incorrect or the data is insufficient. Even for a network of 20 genes,

566

getting t o know the best network from the huge search space given is a large amount of information. We note that the method is not dependent on a certain scoring scheme or a certain kind of gene expression measurements. It can be applied in any setting, where a score as defined in Section 2 is given. For example, when sequence i n f ~ r m a t i o n ' ~protein , interaction data", or other knowledge is incorporated in the score function, this method can also be applied. In order to find gene networks with more than 40 genes, two directions of future work open up. First, if a part of the set of subsets, in which the algorithm performs the actual search, can be pruned, the limit of feasibility might be increased. Second, compartmentalization of gene networks" might be used t o decompose larger networks in smaller parts, and infer each partial network optimally.

Acknowledgements The authors would like to thank Michiel de Hoon for discussions of the manuscript, and Hideo Bannai for advice on implementational issues.

References 1. D.M. Chickering. Learning Bayesian networks is NP-complete. In D. Fisher and H.-J. Lenz, editors, Learning from Data: Artificial Intelligence and Statistics V, Springer-Verlag, 1996. 2. G.F. Cooper, E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9: 309-347, 1992. 3. N. Friedman, M. Goldszmidt. Learning Bayesian networks with local structure. Jordan, M.I. (ed.), Kluwer Academic Publishers, pp. 421-459, 1998. 4. N. Friedman, M. Linial, I. Nachman, D. Pe'er. Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7: 601620, 2000. 5. A.P. Gasch, et. al. Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell, 11: 42414257, 2000. 6. J.R. Glover, S. Lindquist. Hspl04, Hsp70, and Hsp40: a novel chaperone system that rescues previously aggregated proteins. Cell, 94: 73-82, 1998. 7. A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, R.A. Young. Using graphical models and genomic expression data to statistically validate models

567

of genetic regulatory networks. Pacific Symposium on Biocomputing, 6: 422-433, 2001. 8. A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, R.A. Young. Combining location and expression data for principled discovery of genetic regulatory network models. Pacific Symposium on Biocomputing, 7: 437-449, 2002. 9. S. Imoto, T. Goto, S. Miyano. Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. Pacific Symposium on Biocomputing, 7: 175-186, 2002. 10. T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, Y. Sakaki. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences USA, 97: 4569-4574, 2001. 11. S. Imoto, S. Kim, T. Goto, S. Aburatani, K. Tashiro, S. Kuhara, S. Miyano. Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Journal of Bioinformatics and Computational Biology, in press, 2003. 12. T.I. Lee, N.J. Rinaldi, F. Robert, et. al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298: 799-804, 2002. 13. I.M. Ong, J.D. Glasner, D. Page. Modelling regulatory pathways in E. coli from time series expression profiles. Bioinfomnatics, 18: 241-248, 2002. 14. D. Pe’er, A. Regev, G. Elidan, N. Friedman. Inferring subnetworks from perturbed expression profiles. Bioinformatics, 17: 215-224, 2001. 15. R.W. Robinson. Counting labeled acyclic digraphs. New Directions in the Theory of Graphs, pp. 239-273, 1973. 16. Y. Shi, D.D. Mosser, R.I. Morimoto. Molecular chaperones as HSF1specific transcriptional repressors. GeneskDevelopment, 12: 654-666, 1998. 17. V.A. Smith, E.D. Jarvis, A.J. Hartemink. Evaluating functional network inference using simulations of complex biological systems. Bioinfomnat~ C S 18: , 216-224, 2002. 18. E.P. van Someren, L.F.A. Wessels, E. Backer, M.J.T. Reinders. Genetic network modeling. Pharmacogenomics, 3(4): 507-525, 2002. 19. Y. Tamada, S. Kim, H. Bannai, S. Imoto, K. Tashiro, S. Kuhara, S. Miyano. Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection. Bioinformatics, in press, 2003.

PATHWAY LOGIC MODELING OF PROTEIN FUNCTIONAL DOMAINS IN SIGNAL TRANSDUCTION C. TALCOTT, S . EKER, M. KNAPP, P. LINCOLN, K. LADEROUTE SRI International, 333 Ravenswood Avenue, Menlo Park CA 94025 Cfirstname.lastname) @sri.com Abstract

Protein functional domains (PFDs) are consensus sequences within signaling molecules that recognize and assemble other signaling components into complexes. Here we describe the application of an approach called Pathway Logic to the symbolic modeling signal transduction networks at the level of PFDs. These models are developed using Maude, a symbolic language founded on rewriting logic. Models can be queried (analyzed) using the execution, search and modelchecking tools of Maude. We show how signal transduction processes can be modeled using Maude at very different levels of abstraction involving either an overall state of a protein or its PFDs and their interactions. The key insight for the latter is our algebraic representation of binding interactions as a graph.

1 Introduction There is a practical need to represent very large biological networks of all kinds as models at different levels of abstraction. For example, consider the following: The proteome of eukaryotic cells is at least an order of magnitude larger than the genome (very large and diverse protein networks) A large fraction of the genome of mammalian cells ( z 10% of the human genome) encodes genomic regulators producing very large regulatory networks of the genome itself Biological networks interact as modules/subnetworks to produce high levels of physiological organization (e.g., circadian clock subnetworks are integrated with metabolic, survival, and growth subnetworks) In silico models of such networks would be valuable but must have certain features. In particular, they must be easily modified-extended or updated-and useable by bench researchers for formulating and testing hypotheses about how signals and other changes are propagated. Pathway Logic 1,2 is an application of techniques from formal methods and rewriting logic to develop models of biological processes. The goals of the Pathway Logic work include: building network models that working biologists and biomedical researchers can interact with and modify; making formal methods tools accessible to the general biological and biomedical research community; and enabling wet-lab researchers to generate informed hypotheses about complex biological networks.

568

569

The Pathway Logic work has initially focused on curation of models of signal transduction networks, including the Epidermal Growth Factor Receptor (EGFR) network and closely related networks 4 ) 5 @ . Signal transduction processes are modeled at different levels of abstraction involving: (I) the overall state of proteins, or (11) protein functional domains (PFDs) and their interactions. These signaling networks can be queried using formal methods tools, for example, by choosing an initial condition and trying the following: (i) execution-show me some signaling pathway; (ii) search-show me all pathways leading to a specified final condition; or (iii) model-checking-is there a pathway with certain given properties? In this paper we use the recruitment and activation of the ubiquitous Rafl serine-threonine protein kinase to illustrate the two levels of representation and in particular to show how PFDs are modeled and how the resulting model can be used. This more detailed representation of signaling proteins in which PFDs are explicit can be used to model domain specific interactions in signaling networks, an important area of modern signal transduction research. Future work includes expanding the collection of proteins modeled at the level of PFD interactions as data becomes available, modeling additional signal transduction networks and modeling metabolic pathways and their interactions with signal transduction pathways. 1.1 Formal Methods in Biology

Formal methods techniques have been used by various groups to develop executable models of biological systems at high levels of abstraction. Typically the techniques are based on a model of concurrent computation with associated formal languages for describing system behavior and tools for simulation and analysis. Petri nets were developed to specify and analyze concurrent systems. There are many variants of the Petri net formalism and a variety of languages and tools for specification and analysis of systems using the Petri net model Petri nets have a graphical representation that corresponds naturally to conventional representations of biochemical networks. They have been used to model metabolic pathways and simple genetic networks (examples include 8,9,10,11 ). However, these efforts have largely been concerned with kinetic or stochastic models of biochemistry. In12 a more abstract and qualitative view was taken, mapping biochemical concepts such as stoichiometry, flux modes, and conservation relations to well-known Petri net theory concepts. The pi-calculus l 3 is a process algebra originally developed for describing concurrent computer processes. There are a number of specification languages and tools based on the pi-calculus. A pi-calculus model for the receptor tyrosine kinasehitogen-activated protein kinase (RTW-MAPK) signal transduction pathway is presented in 14. Signaling proteins are represented as processes and interactions as synchronous communications between processes (handshakes).

’.

570

A stochastic variant of the pi-calculus is used in l5 to model both the time and probability of biochemical reactions. Statecharts are a visual notation for specifying reactive concurrent systems l 6 used in object-oriented software design methodologies. Statecharts naturally express compartmentalization and hierarchical processes as well as flow of control amongst subprocesses. The resulting models can be used for simulation and visualization of biochemical processes. Statecharts have been used to model biological processes such as T-cell a c t i v a t i ~ n ~ ~ , ~ ~ . Live Sequence Charts l9 are an extension of the Message Sequence Charts modeling notation for system design. Using the associated PlayInPlayOut approach, models can be built and tested by acting out reaction scenarios. Models of subsystems can be combined and charts can be annotated with assertions that allow invariants and prohibited conditions to be expressed and checked. This approach has been used to model the process of cell fate acquisition during C.elegans vulva1 development'O. 1.2 Pathway Logic

Pathway Logic is an approach to modeling biological entities and processes based on formal methods and rewriting logic '. Pathway Logic models are developed using the Maude ( h t t p : / /maude.c s l . sri . corn) system, a formal language and tool set based on rewriting logic. Like the approaches to modeling biological processes mentioned above, Pathway Logic models are executable-hence they can be used for simulation. In addition, the Maude system provides search and model-checking capabilities. Using the search capability all possible future states of a system can be computed to show its evolution from a given initial state (specified by the states of individual components) in response to a stimulus or perturbation. Using model-checking a system in a given initial state can be shown to never exhibit pathways with certain properties, or the model-checker can be used to produce a pathway with a given property (by trying to show that no such pathway exists). Using the reflective capability of Maude, models can be mapped to other formalisms and exported in formats suitable for input to other tools for additional analysis capabilities and visualization. Rewriting Logic3, is a logical formalism based on two simple ideas: states of a system are represented as elements of an algebraic data type; and the behavior of a system is given by local transitions between states described by abstractions called rewrite rules. In Pathway logic, algebraic data types are used to represent concepts from cell biology needed to model signaling processes, including intracellular proteins, biochemicals such as second messengers, extracellular stimuli, biochemical modification of proteins, protein association, and cellular compartmentalization of proteins. Rewrite rules are used to model local processes withn a cell or transmission of a signal across a cell membrane. A signaling network is represented as a collection of rewrite rules together with the algebraic decla-

57 1

rations. Rewriting logic then allows reasoning about possible complex changes given the basic changes (rules) specified by the model. In particular, pathways in the network satisfying different properties can be generated automatically using tools based on logical inference for execution (deduction), search, and modelchecking.

2 Activation of Rafl modeled at two levels A Pathway Logic model of the Epidermal Growth Factor Receptor (EGFR) network (reviewed in4,5,6)is being developed by curating rewrite rules for relevant biochemical processes from the scientific literature. Depending on what data is available, processes are modelled at different levels of abstraction. Level I rules model processes in terms of overall protein states. Protein functional domains (PFDs) are consensus sequences within signaling molecules that rccognize and bind other signaling components to make complexes. When there is enough information about a protein and the domains it contains to hypothesize the details of activation and translocation Level I1 rules are developed. These rules model processes in terms of protein functional domains and explicit posttranslational modifications of individual signaling molecules are included in the model. A key idea for the Level I1 rules is the representation of PFDs and their interactions algebraically as a graph. Here we use the recruitment and activation of the ubiquitous Rafl serinethreonine protein kinase to illustrate the two levels of representation. The Rafl system is a reasonably well-established and detailed example of a signal integrator in the EGFR network 2 1 , 2 2 . The Rafl kinase is an effector of EGFR and other RTK signaling through the ERK1/2 MAPK pathway, which is organized in a module that can be represented by the kinase cascade MAPKKK + MAPKK MAPK (reviewed in5). In this module, Rafl is a MAPKKK. 2.1 Activation of Rafl at Level I

An early step in the activation of Rafl is recruitment of cytoplasmic Rafl to the inner side of the cell membrane by Ras, following stimulation of the EGFR. Figure 1 shows both a graphical representation and the Maude representation (from which the picture is generated) of the Level I rule 280 modeling the activation of Rafl and its recruitment to the cell membrane. This rule says that if the cell contains a Ras type protein with a GTP modification, activated Pak, and Src protein kinases on the interior side of the cell membrane, and Rafl, phosphorylated 143-3 scaffoldladaptor proteins, and the phosphatase PP2A in the cytoplasm, then Rafl can be activated and recruited to the membrane along with 14-3-3, leaving PP2A in the cytoplasm. In Maude a cell is represented by a term of the form { CM 1 . . . { . . . } 1 where the first ellipsis stands for biochemicals in or attached to the interior of the

572

crl[280.?Ras.?Pak.Src.PP2A.?l4-3-3.->.Rafll : {CM I crn [?Ras - GTP] [?Pak - actl [Src - actl (cyto Rafl [?14-3-3 - phosl PPZA => {CM I crn [?Ras - GTP] [?Pak - act] [Src - actl [Rafl - act] [?14-3-3 - phosl {cyto PP2A)J if ?Ras S:Soup : = N-Ras K-Ras H-Ras . [metadata "21192014( R )" I

Figure 1: Rafl activation rule (Level I)

cell membrane, and the second ellipsis stands for the biochemicals and compartments in the cytoplasm. A particular cell state is represented by replacing the ellipses by terms representing specific biochemicals and compartments. In a Maude rule the ellipses are replaced by patterns-terms with variables ranging over some set of biochemicals, represented as sorts in Maude. One of the sorts is Ras representing the Ras type proteins. We use the convention that the name of a class of proteins prfixed by a ? is a variable ranging over the corresponding sort. Thus ?Ras can be instantiated to any of the proteins in the model declared to be of sort Ras. At Level I, posttranscriptional modification is represented abstractly applied to a protein and a set of abstract modiby a modification operator [---I fications. In the left-hand side of rule 280 the term [ ?Ras - GTP] represents a Ras type protein with a GTP modification, while the term [ Src - act I represents activated Src protein kinase on the interior side of the cell membrane. The occurrence of Rafl, PP2A, and [?14-3-3 - phosl represent Rafl, PP2A and phosphorylated 14-3-3 in the cytoplasm. The variables c m and c y t o serve a place holders for any remaining unspecified biochemicals in (or on the interior side of) the cell membrane, and the cytoplasm respectively. In order to apply a set of rules to a particular cell, the components of that cell are formally represented as a multiset of ground terms (constants and other terms containing no variables) declared to be the initial cell state. A rule such as 280 is then applied to the cell by finding a substitution of components for the variables appearing in the left-hand side that make it equal to the cell in question (matching), and replacing the cell by the result of applying the matching substitution

573

to right-hand side of the rule. Representing cell contents using multisets means that the order that individual components are listed in does not matter, and the matching process takes this into consisderation. With the above in mind we can see that application of rule 280 to the initial cell state:

I [N-Ras - GTP] [Pakl {Rafl [14-3-3t - phos] PP2A

eq cell = PD({CM

-

act] [Src

}})

-

act]

.

does indeed move Rafl and 14-3-3 from the cytoplasm to the membrane, activating Rafl and leaving the phosphorylation state of the 14-3-3 protein unchanged. The condition following the if in rule 280 constrains the matching protein found for the variable Ras to be one of those listed. The term [metadata

"

2 119 2 0 14 I' ]

represents information that is not used in execution of the model but provides evidence and other useful information that can used in other operations on the model. This particular metadata is the medline citation for a paper used in curation of the rule. Level I rules have an alternative representation in terms of occurrences and transitions (corresponding to a special kind of Petri net), An occurrence is a biochemical paired with its location in the cell. For example, the occurrence of Rafl on the left hand side of the rule is represented by the pair < Raf 1, cyto z and the pair < [ Raf 1 - act 3 , cm > represents the occurrence on the righthand side. A rule is then represented by a triple consisting of the multiset of left-hand side occurrences, the rule identifier, and the multiset of right-hand side occurrences. (Generic variables such as cm and cyto are ignored.) In the picture the occurrences are represented by ovals labelled by a printed form and the transistion by a rectangle labeled with the rule identifier. Occurrences that appear only on the left-hand side are indicated by arrows from the oval to the rectangle, those that appear only on the right-hand side by arrows from the rectangle to the oval, and those that appear on both sides (enzymes, coenzymes) by dashed bidirectional arrows. 2.2

Activation of Rafl at Level II

The difference between aLevel I rule and aLevel I1 rule is that a Level I rule deals with interactions between whole proteins whereas a Level I1 rule deals with interactions between protein domains. In Level I, Rafl is considered to be inactive by (1) not having the modification "act" and (2) being located in the cytoplasm. In Level I1 the phosphorylation states of relevant amino acids, the domains and sites which are bound intra- or inter-molecularly are made explicit. Based on work by Dhillon and Kolch22 (augmented with details from a number of other publications) we drew, by hand, a stylized diagram of a possible Rafl activation process (Figure 2). The diagram is focused on the Rafl protein. Rafl is represented as a list of domains (blue bars) and potential phosphorylation sites

574

575

(lavender bars) relevant to the interaction being studied. Phosphorylation is indicated by a button labeled P hanging below the site bar. Other proteins binding to Rafl are represented by a bar labeled by the bound domain and the protein name. Those above the Rafl list (red) are in or attached to the cell membrane (also indicated by [CM]), and those below (green) are in the cytoplasm. The first row of the diagram represents inactive Rafl. It is associated with a dimer of 14-3-3 scaffold/adaptor proteins through binding of phosphorylated serines 259 and 621 in Rafl to serine binding domains (SBD) in the 14-3-3 dimer. In the diagram the 14-3-3 dimer is represented by the two 14-3-3 binding domains (green bars) and the line connecting these domains to each other. The arrows in the diagram indicate the progression of the activation process and the arrow labels give a description of the rule governing the interaction and indicate the key triggering biochemistry. For example, the trigger for Raf rule # 1 is activated PKCz ( [ P K C z - a c t ] ) . Based on this diagram, rules were written to model the steps of Rafl activation. To represent the functional domains of a signaling protein explicitly, we annotate proteins using the notation [ p: P r o t e i n I a t t s : A t t s 3 . Here a t t s :A t t s is a set of attributes representing one or more PFDs or amino acid residues (sites). Each attribute may have associated modifications such as phosphorylation (phos) or an indication that the domaidsite is participating in a binding (bound). Thus, a protein at Level I1 can be thought of as an encapsulated collection of functional domains and sites. The association or binding of signaling proteins through their functional domains is explicitly represented by edges in a graph whose nodes are protein-attribute pairs. For example the inactivated form of Rafl shown in the first row of Figure 2 is represented by right-hand side of the following Maude equation. eq Rafl.inact =

[Rafl

I

( S 43), RBD, C1, ( S 259 - phos - bound) (Y 341), PABM, ( S 621 - phos - bound) 1

[14-3-3a I (SBD [14-3-3b 1 (SBD e((Rafl,(S 621)), e ( ( R a f l , ( S259)) e( (14-3-3a,DMD),

bound), (DMD - bound)] bound), (DMD - bound), ( T 141 1 (14-3-3a,SBD)) (14-3-3b,SBD)) (14-3-3b,DMD)) .

The attributes 43), RBD, C1, S 259 - phos - bound), ( S 338), (Y 341), PABM, ( S 621 - phos - bound) (S

correspond to the bars in Figure 2. The attribute ( S 6 2 1 - phos - bound) denotes the site ( S 6 2 1 ) with two modifications phos and bound. The modifications -phos on the sites s 259 and s 6 2 1 correspond to the buttons labeled P and the modification, -bound is used to indicate locally that the attribute has a binding. In the Maude term the 14-3-3 dimer is represented by the two 14-3-3 protein terms, and the edge e ( (14-3-3a, DMD) , (14-3-333, DMD) )

576

The two vertical lines connecting the phosphorylated sites on Rafl to the 14-3-3 dimer are represented in the Maude term by the edges e( (Rafl,( S 621)), (14-3-3a,SBD)) e( (Rafl,( S 621)), (14-3-3a,SBD)) .

In the Level I1 representation the activation of Raf 1, represented at Level I by the single rule 280, requires several rules in which structural features of some of the proteins, including Rafl, are annotated with information about relevant PFDs and binding sites, and the binding between proteins is made explicit. As an example, we show the Maude representation of the rule numbered 6 in the diagram, in which activated Src phosphorylates partially activated Rafl at Tyrosine 341. rl[Rafl#6.Y34lphos]: {CM I cm PS PA [ ? S l k - act] [?Ras I GTPbound, (RafBD - bound)] [Rafl I ( S 43), ( S 259), (Y 341), (C1 - bound), ( S 621 - phos - bound), (PABM - bound), (RBD - bound), raf1:Attsl [14-3-3a \ (SBD - bound),(DMD - bound),la:Attsl [14-3-3b I SBD, (DMD - bound), (T 141 - phos)] e((14-3-3a, DMD), (14-3-3b,DMD)) e( (Rafl, ( S 621)1 , (14-3-3a,SBD)) e( (Rafl, Cl), b(PS)) e( (Rafl, PABM), b(PA)) e( (Rafl, RBD), (?Ras, RafBD)) {cyto}} =>

{CM I cm PS PA [ ? S l k - act] [?Ras I GTPbound, (RafBD - bound)] [Rafl I ( S 43), ( S 259), (Y 341 - phos), ( S 621 - phos - bound), (PABM - bound), (C1 - bound), (RBD - bound), rafl:Atts] [14-3-3a I (SBD - bound),(DMD - bound),la:Atts] [14-3-313 1 SBD,(DMD - bound), (T 141 - phos)] e((14-3-3a, DMD), (14-3-3b,DMD)) e((Raf1, ( S 621)), (14-3-3a,SBD)) e( (Rafl, Cl), b(PS)) e( (Rafl, PABM), b(PA)) e((Raf1, RBD), (?Ras, RafBD)) {cyto}) .

The left-hand side of rule matches a situation in which Rafl is associated with a dimer of 14-3-3 proteins through binding of phosphorylated serine 621 (represented by ( S 621 -phos - bound) )to the serine-binding domain ( (SBD - bound) ) in the 14-3-3 dimer, represented by the edge e((Rafl,(S 621)), (14-3-3a,SBD)).

The additional requirements that Rafl must be bound to Ras, phosphotidylserine (PS), and phosphatidic acid (PA) are represented by the edges e((Raf1, Cl), b(PS)) e((Raf1, PABM), b(PA)) e( (Rafl, RBD) , (?Ras, RafBD))

577

where the terms b ( PS) and b (PA)represent unspecified binding domains or sites on PS and PA respectively. Notice that the representation of overall celI structure is the same and that Level I and Level I1 notation for proteins can be mixed, only using Level I1 detail where relevant. For example, Src is used as a Level I protein (as a variable ? S 1k) of sort S 1k (Src like kinase). In order for Rafl to be fully activated it must be phosphorylated on both Y341 (by a Src-like-kinase) and on S338 (by a member of the Pak family). It is unclear whether Y341 or S338 is phosphorylated first. This is represented in Figure 2 by the branch in the sequence of rules. In the Maude representation, rule 6 deals with this ambiguity by using the variable raf 1 :Atts instead of requiring a particular phosphorylation state for S338. Rule 5 (not shown) similarly uses an attribute variable instead of requiring a particular phosphorylation state for Y341. The application of Level I1 rules follows the same procedure as for Level I. Although domains and sites have a fixed order within a protein sequence, in the Maude model we treat them as a set because the ordering information plays no role in the processes represented. (Some ordering information is implicit in the site numbers and could easily be added if required for other purposes.) Level I1 rules for Rafl are connected to Level I by the equational rule shown above that converts the Level I representation Raf 1.inact of inactivated Rafl to its Level I1 representation, and a dual rule that converts the Level I1 complex representing activated Rafl to its Level I representation (rule 7 in the pathway shown below).

3 Using the Pathway Logic Model We now illustrate some of the ways in which the tools supplied by Maude can be used to query and analyze a Pathway Logic model. To set a context for using the rules for Rafl activation at the PFD level (Level 11) we define an initial cell state (qraf containing inactive Rafl and postulated necessary conditions to activate it. eq qraf =

PD({CM

I

PS PA [ P a k l - actl [PKCz [Src - actl [H-Ras - GTPI {Rafl.inact PP2A)) ) .

-

act]

The form PD ( . . . ) represents a cell in a Petri dish, possibly with some external signaling compounds. As a first example of using the model, the question “can Rafl in a cell described by qraf be activated?’ is answered by defining a proposition praf 0 that expresses the query and then using the findPath 4uery. eq PD( out {CM 1 cm [Rafl prafo = true .

I=

-

actl {cyto}} )

The above equation says that the proposition praf 0 is true for a cell if the dish containing it matches the pattern on the left .

578

The query findPath (qraf,praf0) uses the Maude model checker to find a counter example to the assertion that no state satisfying praf 0 can be reached from the initial state qraf by applying the rules of the model (in this case the equation for Raf 1.inact and Raf rules 1-7). If a counter example is found, the query function extracts a path giving the labels of rules applied and the state reached that satisfies the property praf 0. The Maude command r e d f indPath (qraf,praf 0) executes this query, returning the following. result Simplepath: spath(‘Rafl#l.PKCz ‘Rafl#2.PP2A ‘Rafl#3.PS.PA ‘Rafl#4.Ras ‘Rafl#S.S338phos ‘Rafl#6.~341phos‘Rafl#7.Rafl.is.act, PD({CM 1 PA PS [Pakl - act] [PKCz - act] [Rafl - act] [H-Ras - GTP] [Src - act] 114-3-333 PP2A 14-3-3a))))

The label Raf 1 # 7 . Raf 1.is.act refers to a rule that converts the Rafl complex from Level I1 to Level I to connect with downstream Level I rules. To determine if other pathways are possible, we use the search command search qraf = > ! d:Dish . to ask for all paths leading to a final state (a state to which no more rewrite rules apply). The answer here is that there is one final state, the one found by the above query, and two paths. The second path differs from the first only in the order in which rules 5 and 6 are applied. In general we might discover quite different pathways to a given final state, and/or more than one possible final state. The f indPath query can also be used to check whether a model can generate expected intermediate states. For example, proposition praf 1 expresses the property that a certain collection of bindings occurs. eq PD( o u t {CM I cm e( (Rafl,( S 621)), (14-3-3a, SBD)) e( (Rafl,Cl), b(PS)) e( (Rafl,PABM),b(PA)) e((14-3-3a,DMD), (14-3-3b,DMD)) {cyto)} ) j = prafl = true .

Executing the query findPath (qraf,praf 1 ) results in a path in which rules 1, 2, and 3 have been applied. Although these results seem satisfactory, we might be concerned that the rules could also generate impossible or unlikely states, such as one in which Rafl is bound to both 14-3-3’s in the dimer as well as being bound to PS and PA. To determine whether this possibility is predicted by the model, we can search for a cell state satisfying praf2 , defined by matching the pattern PD( out {CM I cm [H-Ras - GTP] e((14-3-3a, DMD), (14-3-3b, DMD)) e((Rafl,(S 621)), (14-3-3a,SBD)) e( (Rafl,( S 259)), (14-3-3b,SBD)) e( (Rafl,Cl),b(PS)) e( (Rafl,PABM),b(PA)) {cyto}} )

Indeed executing the query f indPath (qraf,praf 2 ) Maude confirms that such a state is not reachable by returning the result (nopath).Simplepath.

*/

579

4 Conclusions Pathway Logic is an example of how logical formalisms and formal modeling techniques can be used to develop a new science of symbolic systems biology. We believe that this computational science will provide researchers with powerful tools to facilitate the understanding of complex biological systems and accelerate the design of experiments to test hypotheses about their functions in vivo. In particular, we are interested in formalizing models that biologists can use to think about signaling pathways and other processes in familiar terms while allowing them to computationally ask questions about possible outcomes. Here we have exemplified our approach using the biochemistry of signaling involving the mammalian Rafl protein kinase. The use of a logic such as rewriting logic for this kind of modeling has many practical benefits, including the ability to (1) build and analyze models with multiple levels of detail, (2) represent general rules, (3) define new kinds of data and properties, and (4) execute queries using logical inference. Model validation is done both by experimental testing of predictions and by using the analysis tools to check consistency with known results. Already the Pathway Logic models are useful for clarifying and organizing experimental data from the literature. The eventual goal is to reach a level of maturity that supports prediction of new and possibly unexpected results.

Acknowledgments We thank the anonymous reviewers for their helpful criticisms. This work was supported in part by grant CA73807 from the National Institutes of Health (KL). Maude tool development has been supported by NSF grants CCR-9900326 and CCR-9900334, and DARPA through Air Force Research Laboratory Contract F30602-02-C-0130.

References 1. S. Eker et al. Pathway logic: Symbolic analysis of biological signaling. In Proceedings of the Pac$c Symposium on Biocomputing, pages 400-412, January 2002. 2. S. Eker, M. Knapp, K. Laderoute, P. Lincoln, and C. Talcott. Pathway logic: Executable models of biological networks. In Fourth International Workshop on Rewriting Logic and Its Applications (WRLA'2002), 2002. http://www.elsevier.nl/locate/entcs/volue7l.html. 3. J. Meseguer. Conditional rewriting logic as a unified model of concurrency. Theoretical Computer Science, 96(1):73-155, 1992. 4. J. M. Kyriakis and J. Avruch. Mammalian mitogen-activated protein kinase signal transduction pathways activated by stress and inflammation. Physiol. Rev., 81:807-869, 2001. 5. G. Pearson et al. Mitogen-activated protein (MAP) kinase pathways: regulation and physiological functions. Endocr. Rev., pages 153-183, 2001.

580

6. J.D. Jordan, E. Landau, and R. Iyengar. Signaling networks: The origins of cellular multitasking. Cell, 103:193-200, 2000. 7. J. L. Peterson. Petri Nets: Properties, analysis, and applications. PrenticeHall, 1981. 8. P. J. Goss and J. Peccoud. Quantitative modeling of stochastic systems in molecular biology using stochastic Petri nets. Proc. Natl. Acad. Sci. U. S. A., 95:6750-6755, 1998. 9. H. Matsuno, A. Doi, M. Nagasaki, and S. Miyano. Hybrid Petri net representation of gene regulatory network. In Pacific Symposium on Biocomputing, volume 5, pages 341-352, 2000. 10. H. Genrich, R. Kuffner, and K. Voss. Executable Petri net models for the analysis of metabolic pathways. Int. J. STTT, 3, 2001. 11. J. S. Oliveira et al. A computational model for the identification of biochemical pathways in the Krebs cycle. J. Computational Biology, 105782, 2003. 12. I. Zevedei-Oancea and S. Schuster. Topological analysis of metabolic networks based on Petri net theory. In Silico Biology, 3(0029), 2003. 13. R. Milner. Communication and Concurrency. Prentice Hall, 1989. 14. A. Regev, W. Silverman, and E. Shapiro. Representation and simulation of biochemical processes using the pi-calculus process algebra. In R. B. Altman, A. K. Dunker, L. Hunter, and T. E. Klein, editors, Paczjic Symposium on Biocomputing, volume 6 , pages 459470. World Scientific Press, 2001. 15. C. Priami, A. Regev, E. Shapiro, and W. Silverman. Application of a stochastic name-passing calculus to representation and simulation of molecular processes. Information Processing Letters, 2001. in press. 16. D. Harel. Statecharts: A visual formalism for complex systems. Science of Computer Programming, 8:231-274, 1987. 17. N. Kam, I.R. Cohen, and D. Harel. The immune system as a reactive system: Modeling t cell activation with statecharts. Bulletin of Mathematical Biology, 2002. to appear. 18. S. Efroni, D. Harel, and I.R. Cohen. Towards rigorous comprehension of biological complexity: Modeling, execution and visualization of thymic t-cell maturation. Genome Research, 2003. Special issue on Systems Biology, in press. 19. W. Damm and D. Harel. Breathing life into message sequence charts. Formal Methods in System Design, 19(l), 2001. 20. N. Kam et al. Formal modeling of C.elegans development: A scenariobased approach. In First International Workshop on Computational Methods in Systems Biology, volume 2602 of Lecture Notes in Computer Science, pages 4-20. Springer, 2003. Meaningful relationships: The regulation of the 21. W. Kolch. Ras/Raf/MEWERK pathway by protein interactions. Biochem 1, 351 1289-305, 2000. 22. A. S. Dhillon and W. Kolch. Untying the regulation of the Raf-1 kinase. Arch. Biochem Biophys, 404:3-9,2002.

MODELING GENE EXPRESSION FROM MICROARRAY EXPRESSION DATA WITH STATE-SPACE EQUATIONS F. x.wu', w . J. ZHANG', A. J. KUSALIK'.' ' Division of Biomedical Engineering, Department of Computer Science, University of Saskatchewan, 57 Campus Dr., Saskatoon, SK, S7N 5A9, CANADA faw341 @mail.usask.ca; zhangc@engr. usask.ca; [email protected] We describe a new method to model gene expression from time-course gene expression data. The modelling is in terms of state-space descriptions of linear systems. A cell can be considered to be a system where the behaviours (responses) of the cell depend completely on the current internal state plus any external inputs. The gene expression levels in the cell provide information about the behaviours of the cell. In previously proposed methods, genes were viewed as internal state variables of a cellular system and their expression levels were the values of the internal state variables. This viewpoint has suffered from the underestimation of the model parameters. Instead, we view genes as the observation variables, whose expression values depend on the current internal state variables and any external input. Factor analysis is used to identify the internal state variables, and Bayesian Information Criterion (BIC) is used to determine the number of the internal state variables. By building dynamic equations of the internal state variables and the relationships between the internal state variables and the observation variables (gene expression profiles), we get state-space descriptions of gene expression model. In the present method, model parameters may be unambiguously identified from timecourse gene expression data. We apply the method to two timecourse gene expression datasets to illustrate it.

1. Introduction With advances in DNA microarray technology'.' and genome sequencing, it has become possible to measure gene expression levels on a genomic scale3. Data thus collected promise to enhance fundamental understanding of life on the molecular level, from regulation of gene expression and gene function to cellular mechanisms, and may prove useful in medical diagnosis, treatment, and drug design. Analysis of these data requires mathematical tools that are adaptable to the large scale of the data, and capable of reducing the complexity of the data to make it comprehensible. Substantial effort is being made to build models to analyze it. Non-hierarchical clustering techniques such as k-means clustering are a class of mixture model-based approaches4. They group genes with similar expression patterns and have already proven useful in identifying genes that contribute to common functions and are therefore likely to be c o r e g ~ l a t e d ~ ~ However, ~.',~. as pointed out by Holter et al?, whether information about the underlying genetic architecture and regulatory interconnections can be derived from the analysis of gene expression patterns remains to be determined. It is also important to note that models based on clustering analysis are static and thus can not describe the dynamic evolution of gene expression.

581

582

Boolean network can be applied to gene expression, where a gene’s expression (state) is simplified to being either completely “on” or “off’. These states are often represented by the binary values 1 and 0, respectively, and the state of a gene is determined by a Boolean function of the states of other genes. The functions can be represented in tables, or as rules. And example of the latter is “if gene A is ‘on’ AND either gene B OR C is ‘off‘ at time t , then gene D is ‘on’ at time t + At “. As the system proceeds from one state (or time point) to the next, the pattern of currently expressednon-expressed genes is used as input to rules which specify which genes will be “on” at the next state or time point. Somogyi and Sniegoski” showed that such Boolean networks have features similar to those in biological systems, such as global complex behaviour, self-organization, stability, redundancy, and periodicity. Liang et al.” described an algorithm for inferring genetic network architectures from the rules table of a Boolean network model. Their computational experiments showed that a small number of state transition pairs are sufficient to infer the original observations. Akutsu et al.” devised a much simpler algorithm for the same problem and proved that if the in-degree of each node (i.e., the number of input nodes to each node) is bounded by a constant h , only O(log n) state transition pairs (from possible 2” pairs) are necessary and sufficient to identify the original Boolean network of n nodes (genes) correctly with high probability. However, the Boolean network models depend on simplifying assumptions about biology systems. For example, by treating gene expression as either completely “on” or “off ’, these models ignore those genes that have a range of expression levels and can have regulatory effects at intermediate expression levels. Therefore they ignore those regulatory genes that influence the transcription of other genes to variable degrees. In addition to Boolean networks models (of discrete variables), dynamic models (of continuous variables) have also been applied to gene expression. Chen et aI.l3 proposed a differential equation model of gene expression. Due to the lack of gene expression data, the model is usually underdetermined. Using the additional requirements that the gene regulatory network should be sparse, they showed that time, where n is the number of genes the model can be constructed in o(~”+’) andlor proteins in the model and h is the number of maximum nonzero coefficients (connectivity degree of genes in a regulatory network) allowed for each differential equation in the model. In order that the parameters of the models are identifiable, both ChenI3 and Akutsu” assume that all genes have a fixed maximum connectivity degree h (often small). These assumptions obviously contradict biological reality. For instance, some genes are known to have many regulatory inputs, while others are not known to have more than a few. Another shortcoming of the previous work is that the fixed maximum connectivity degree h of Chen et al.I3 is chosen in an ad hoc manner. De Hoon et al.I4 considered Chen’s differential model and used Akaike’s Information Criterion (AIC) to determine the connectivity degree h of each gene. In their method, not all

583 genes must have a fixed connectivity. However, they do not present an efficient algorithm to identify the parameters of their differential equation model; the bruteforce algorithm used in the paperI4 has a computational complexity of 0 ( 2 ” ’ ) , where n is the number of genes in the model. The authors claim that their method can be applied to find a network among individual genes. However, for biologically realistic regularity networks, the computational complexity is prohibitive. For instance, De Hoon et al. do not build any gene expression models among individual genes and instead choose to group the genes into several clusters and only study the interrelationships between the clusters. D’haeseleer et al.” proposed a linear model for mRNA expression levels during CNS (stands for Central Nervous System) development and injury. To deal with the lack of gene expression data, the authors used a nonlinear interpolation scheme to guess the shapes of gene expression profiles between the measured time points. Such an interpolation scheme is ad hoc. Therefore, the reasonableness of the model built from such interpolated data is suspicious. In addition, while authors built a linear model for 65 measured mRNA species, there exists a problem of dimensional disaster when the number of genes in a model is large, for example, about 6000 (the number of genes in yeast). Recently we have investigated strategiesI6 for identifying gene regulatory networks from gene expression data with a state-space description of the gene expression model. We have found that modeling gene expression is key to inferring the regulatory networks among individual genes. Therefore, in the paper we focus on modeling gene expression. The contributions of this paper are as follows: A state-space description of a gene expression dynamic model is proposed, where gene expression levels are viewed as the observation variables of a cellular system, which in turn are linear combinations of the internal variables of the system. Factor analysis is used to separate the internal variables and calculate their expression values from the values of the observation variables (gene expression data), where Bayesian Information Criterion (BIC) is used to determine the number of the internal variables The method is applied to two time-course gene expression datasets. The results suggest that it is possible to determine unambiguously a gene expression dynamic model from limited of time-course gene expression data.

2. Methods Chen et a1.I3 theoretically model biological data with the following linear differential equations:

584

d - ~ ( t )= A . x ( t ) dt

where the vector x(t) = [x,( t ) ... x,,(t)]’ contains the mRNA and/or protein concentrations as a function of time t , the matrix A is constant and represents the extent or degree of regulatory relationships among genes and/or proteins, and where n is the number of genes and/or proteins in the model. The superscript “T’ in the formula indicates the transposition of a vector. D’haeseleer et a1.” proposed the following linear difference equations to model gene expression data: x(t

+ At) = W .~

( t )

(2)

where the vector x(t) = [x,( t ) ... x,,(t)]’ contains gene expression levels as a function of time t , the matrix w = [ W q ] , , x , r represents regulatory relationships and degrees among genes, and n is the number of genes in the model. In detail, x < ( t + A t ) is the expression level of gene i at time t +At , and w,, indicates how much the level of gene j influences gene i when time goes from r to t + At . Models (1) and (2) are equivalent. When At tends to zero, model (2) may be transformed into model (1). On the other hand, to identify the parameters in model (l), one must descretize it into the formalism of model (2). Since gene expression data from DNA microarray can only be obtained at a series of discrete time points with the present experimental technologies, difference equations are employed to model gene expression data in this paper. In addition, in DNA microarray experiments usually only the gene expression levels are determined, while the concentrations of resulting proteins are unknown. Therefore this work only considers constructing a system describing a gene expression dynamic model. In Boolean network model, model (1) or model ( 2 ) genes are viewed as state variables in a cellular system. This makes parameter identification of the models impossible without other additional assumptions when using microarray data. In addition, previous models assume that regulatory relationships among genes are direct; for example, gene j directly regulating gene i with the weight w,,in model (2). In fact, genes may not be regulated in such a direct way in a cellular system and may be regulated by some internal regulatory elements”. The following state-space description of a gene expression model is proposed to model gene expression evolution z(t + A t ) = A . z ( t ) + n , ( t ) x(t)

= c. z ( t ) + n , ( t )

(3)

where, in terms of linear system theory”, equations (3) are called the state-space description of a system. The vector ~ ( t= [)x , ( r ) ... x,,(t)lT consists of the

585

observation variables of the system and x i ( t ) (i = l;.-,n) represents the expression level of gene i at time t , where n is the number of genes in the model. The vector z ( t ) = [ z , ( t ) ... z,(t)lT consists of the internal state variables of the system and z , ( t ) (i = I,..., p ) represents the expression value of internal element i at time t

which directly regulates gene expression, where p is the number of the internal is the time translation matrix of the internal state variables. The matrix A = [uijlpXp state variables or the state transition matrix. It provides key information on the , ~ ~ influences of the internal variables on each other. The matrix C = [ c ~ ~ is] ,the transformation matrix between the observation variables and the internal state variables. The entries of the matrix encode information on the influences of the internal regulatory elements on the genes. Finally, the vectors n , ( t ) and n z ( t ) stand for system noise and observation noise. For simplicity, noise is ignored in this development. Let x(t) be the gene expression data matrix with n rows and m columns, where n and rn are the numbers of the genes and the measuring time points, respectively. The building of model (3) from microarray gene expression data x(t) may be divided into two phases. Phase one identifies the internal state variables and their expression matrix z(t) with p rows and rn columns from the data matrix x(t) and computes the transformation matrix C such that X(t) = c ' Z ( f )

(4)

Phase two builds the difference equations of the internal states; i.e. determine the state transition matrix A from the expression matrix z(t). In the process of building model (3), phase one, i.e. to establishing equations (4), is key. There are many methods that may be used to get decomposed equations (4) describing the gene expression data. For example, one may employ cluster where the means of the clusters may be viewed as the internal variables. One may also employ singular value d e c o m p o ~ i t i o n ~where . ~ ~ , the characteristic modes or eigengenes may be viewed as the internal variables. However, in typical applications of cluster analysis and singular value decomposition, the number of such internal variables is chosen in ad hoc fashion, with the result that matrix C and the expression data matrix of the internal variables z(t) are decided subjectively rather than from the data themselves. Note that the matrices C and z(t) are dependent. After z(t) is identified, C may be calculated by formula C = X ( t ) .Z' ( t ) , where Z' ( t ) is a unique Moore-Penrose generalized inverse of the matrix z(t). Next, maximum likelihood factor a n a l y s i ~ ~ .is~ used ' . ~ ~to identify the internal state variables, and BIC is used to determine the number of the internal state variables,

586

where x(t) is the n x m observed data matrix, C is the n x p unobserved factorscore matrix and z(t) is the p x m loaded matrix. In fact, both the generalized likelihood ratio test (GLRT) and the Akaike’s information criterion (AIC) methodz3 also may be used to determine the number of the internal variables, but they have a similar drawback, as the sample size increases there is an increasing tendency to accept the more complex modelz4. The BIC takes sample size into account. Although the BIC method was developed from a Bayesian standpoint, the result is insensitive to the prior distribution for adequate sample size. Thus a prior distribution does not need to be spe~ified’~>*~, which simplifies the method. For each model, the BIC is calculated as BIC = -2.

log - likelihood of the estimation model

number of the estimated parameters in the model

where n is the sample size. As with AIC, the model with the smallest BIC is chosen. BIC avoids the overfitting of a model to data. After obtaining the expression data matrix of the internal variables z(t) and the transformation matrix C in phase one, we develop the difference equations in model (3) ~ (+ A t t) = A .z(t)

(6)

from the data matrix z(t) in phase two. The matrix A contains p 2 unknown elements while the matrix z(t) contains m .p known expression data points. If p > m , equations (6) will be underdetermined. Fortunately, using BIC the number of chosen internal variables p generally is less than the number of time points m . Therefore matrix A is identifiable. To determine matrix A , the time step At is chosen to be the highest common factor among all of the experimentally measured time intervals so that the time of the j th measurement is ti = n, ’ A t , where n, is an integer. For equally spaced measurements, n, = j .

We define a time-variant vector v ( t ) with the same

dimensions as the internal state vector z ( t ) and with the initial value v(r,) = z ( t o ) . For all subsequent times, v ( t ) is determined fromv(t + A t ) = A . v ( t ) . For any integer k , we have

+

~ ( t , k .At) = Ak .~ ( t , ).

(7)

The p 2 unknown elements of the matrix A are chosen to minimize the cost function (the sum of squared relative errors)

587

where

IJ.11

stands for the Euclidean norm of a vector. For equally spaced

measurements, the problem is a linear regression one and the solution to minimizing the cost function (8) can be a least square one. For unequally spaced measurements, the problem becomes nonlinear, and it is necessary to determine matrix A by using an optimization technique such as those in chapter 10 of Press’s textz6.

3. Applications

0

2

4

6

8

1

# of the internal variables

0

co) Figure 1. Profiles of BIC with respect to the number of the internal variables for (a) CDC15 data and (b) BAC data.

In this section, the proposed methodology was applied to two publicly available microarray datasets. The first dataset (CDC15) is from Spellman et a].” and consists of the expression data of 799 cell-cycle related genes for the first 12 equally spaced time points representing the first two cycles. The dataset is available at http://cellcycle-www.stanford.edu, and missing data were imputed by the mean values of the microarrays. The second dataset (BAC) is from Laub at aLZ8and consists of the expression data of 1590 genes for 11 equally spaced time points with no missing data. The dataset is available is at http://caulobacter.stanford.edu /CellCvcle. As the mean values and magnitudes for genes and microarrays mainly reflect the experimental procedure, we normalize the expression profile of each gene to have length one and then for expression values on each microarray as so to have mean zero and length one. Such normalizations also make factor analysis simple”.

588 Table 1. The internal variable expression matrices RAP

-0.2065 0.2914 -0.5766 0.2401 -0,0886 -0.7472 0.0812 -0.4848 0.1591 -0.0418 -0.5397 -0.6201 -0.2144 0.1406 -0.0389 0.2695 - 0.7875 -0.0898 0.0950 0.1 159 0.7960 -0.3190 -0.2828 -0.0038 0.1283 0.6692 0.41 16 - 0.3365 -0.0460 0.1430 -0.4139 0.4091 -0.3770 -0.4557 -0.0130 - 0.7042 - 0.2534 - 0.0028 - 0.4060 0.0820 -0.3371 -0,6247 0.0893 -0,1332 -0.0618 0.5592 - 0.4646 - 0.1469 - 0.0957 - 0.3433 0.7490 0.0429 -0.1504 -0,1983 -0.2431 0.0216 0.5261 0.2677 0.2599 -0.1465

-0.4478 - 0.6954 - 0.8355 -0,7904 -0.7850 -0.8141 -0.7410 -0.6371 -0.5635 -0.7409 -0.7777

0.0733 -0.5429 0.0938 -0.1839 0.2965 - 0.4481 0.0018 - 0.2020 0.4048 0.0408 -0.2612 0.0739 0.2241 0.1674 0.0162 0.0252 0.2158 0.2685 0.0289 0.0021 -0.0381 0.2671 0.2602 -0.1303 -0.4120 0.1512 0.0618 -0.0864 -0.5639 0.0442 -0.2583 -0.1583 -0.4091 -0,1484 -0.2821 0.0947 -0.2597 -0.2584 0.1761 0.3170 -0.0906 -0.1943 0.1666 0.1007

The EM algorithm for maximum likelihood factor analysis23was employed for the two datasets. The gene expression profile for one gene is one sample observation and the identified parameters are the p . r n elements of the matrix z(t) and the variances of rn residue errorsz3.Figure 1 depicts the profiles of BIC with respect to the number of internal variables. Clearly from Figure I , 5 is the best choice as the number of internal variables for both datasets. The expression matrices for the internal varaibles are listed in Table I, where each column describes one internal variable. Table 2. The state transition matrix of the internal variables

CDC 15

BAC

A =[0.4378 -1.0077 0.5009 0.1851 -0.1189

0.6649 -0.0702 -0.0699 0.0161

0.5244 0.1734 -0.0103 0.0316

0.2475 0.6794 0.1786 -0.0700

0.1511 -0.3092 0.6163 0.1358

-0.1356 -0.5279 -0.5190 0.66621

A =[0.4378 -1.0077 0.5009

0.6649 -0.0702 -0.0699 0.0161

0.5244 0.1734 -0.0103 0.0316

0.2475 0.6794 0.1786 -0.0700

0.1851 -0,1189 0.151 1 -0.1356 -0.3092 -0.5279 0.6163 -0.5190 0.1358 0.66621

In order to determine the state transition matrices in the models from the internal expression matrices, we solve two optimization problems (8), for the two datasets. As both datasets are equally spaced measurements, the least square method can be used to obtain the two state transition matrices A in the models shown in Table 2 . Figure 2 gives a comparison of the internal variable expression profiles in Table 1 and their calculated profiles from the model (3) for (a) CDC1.5 and (b) BAC,

589

respectively: The values of the cost functions are 0.2321 and 0.0761 for the CDC15 dataset and the BAC dataset, respectively. That is, at each time point the average relative errors between the internal variable profiles in Table 1 and their calculated values by model (3) are 0.0622 and 0.0372 for the CDC15 dataset and the BAC dataset, respectively. Therefore, two state transition matrices in Table 2 are plausible.

0

2

4

6

8

1

0

1

1

II

I

2

4

6

8 1 0 1 2

2

4

6

I 8 1 0 1 2

0

2

4

6

8

10

12 I

0

2

4

6

8

10

12

2

4

6

8

10

12

I

'

-1 0 0 51

-1

I 2

0

2

4

6

8

1

0

1

I 2

-1 I

I

2

4

6

8

1

0

1

2

-1

0.5 I

0

2

4

6

8

1

0

1

2

0.5 I

-0.51 0

2

4

6

8

1

0

1

I

2

-0.51

0

I

Figure 2. A comparison of the internal variable expression profiles in table 1 and their calculated profiles from the model (3) for (a) CDC15 and @) BAC. The solid lines correspond to the profiles in table 1 and the dash lines to the calculated profiles from the model (3).

Since an exponential or a polynomial growth rate of a gene expression is unlikely to happen, the gene expression systems are assumed to be a stable systemI3. This means that all eigenvalues of the state transition matrix A in model (3) should lie

590

inside the unit circle if model (3) describes a gene expression dynamic system. Five eigenvalues of the state transition matrix A for the CDC15 dataset are and 0.4262 -0.8488i, 0.4262 +0.8488i, 0.5509, 0.7605 - 0.29501 , 0.7605+0.2950i1 all of which lie inside the unit circle. Five eigenvalues of the state transition matrix A for BAC dataset are 1.0282 , 0.6835-0.4997i, 0.6835 + 0.4997i , 0.3092 - 0.5769i , and 0.3092+0.5769i . All of these except for the first one lie inside the unit circle. However, the first eigenvalue is very close to 1. Since these two systems are (almost) stable, they are robust to system noise, for example, the squared summable noises. Therefore, these two models are sound to gene expression dynamic systems.

4. Discussion This paper proposes a method to model gene expression dynamics from measured time-course gene expression data. The model is in the form of the state-space description of linear systems. Two gene expression models for two previously published gene expression datasets were constructed to show how the method works. The results demonstrate that some of features of the models are consistent with biological knowledge. For example, genes may be regulated by internal regulatory elements", and gene expression dynamic systems are stable and robustz9. Compared to previous models, our model (3) has the following characteristics. First gene expression profiles are the observation variables rather than the internal state variables. Second, and from a biological angle, our model (3) can capture the fact that genes may be regulated by internal regulatory elements". Finally, although it contains two groups of equations (one is a group of difference equations and the other, algebraic equations), the parameters in model (3) are identifiable from existing microarray gene expression data without any assumptions on the connectivity degrees of gene^''.'',^^,^^ and the computational complexity to identify them is simple. The main shortcomings of this approach are: 1) the inherent linearity which can only capture the primary linear components of a biological system which may be nonlinear; 2) the ignorance to time delays in a biological system resulting, for example, from the time necessary for transcription, translation, and diffusion; 3) the failure to handle external inputs and noise. In the future work, we will address these shortcomings, especially the latter one. In addition, the present approach will be applied to more datasets and the biological relevance of the internal variables will be demonstrated. This last goal requires closer collaborations with biologists. We can not expect to obtain perfect gene expression models which can completely explain organismal or suborganismal behaviours from existing gene expression data at this time. On the other hand, any subjective assumptionsenforced models may result in misinterpreting organismal or suborganismal behaviours. Using the present methodology one may sufficiently explore the data to

591 construct sound models, which is what data can tell us. We believe that our method, along with the results of the application to two datasets, advances gene expression modelling from time-course gene expression datasets.

Acknowledgements We thank Natural Sciences and Engineering Research Council of Canada (NSERC) for partial financial support of this research. The first author thanks University of Saskatchewan for funding him through a graduate scholarship award and Mrs. Mirka B. Pollak for funding him through The Dr. Victor A. Pollak and Mirka B. Pollak Scholarship(s).

Reference 1.

2. 3. 4.

5. 6. 7. 8. 9. 10. 11.

12.

Pease, A. C., et al. “Light-Generated Oligonucleotide Arrays for Rapid DNA Sequence Analysis” Proc. Natl. Acad. Sci. USA 91: 5022-5026, (1994). Schena, M., et al. “Quantitative monitoring of gene expression patterns with a complementary DNA microarray” Science 270: 467-470, ( 1995). Sherlock, G., et al. “The Stanford Microarray Database” Nucleic Acids Research 29: 152-155,(2001). Everitt, B. S. and Dunn, G. “Applied Multivariate Data Analysis” New York: Oxford University Press, (1992). Tavazoie, S., et al. “Systematic determination of genetic network architecture”, Nature genetics 22: 281-285, (1999). Yeung, K.Y, et al. “Model-based clustering and data transformations for gene expression data”, Bioinformatics 17: 977-987, (2001). Ghosh, D. and Chinnaiyan, A. M. “Mixture modelling of gene expression data from microarray experiments” Bioinformatics 18: 275-286, (2002). McLachlan, G. J., Bean, R. W., and Peel, D. A. “Mixture model-based approach to the clustering of microarray expression data”, Bioinformatics 18: 413-422, (2002). Holter, N. S., et al. “Dynamic modeling of gene expression data” Proc. Natl. Acad. Sci. USA 98: 1693-1698..~(2001). Somogyi, R. and Sniegoski, C. A. “Modeling the complexity of genetic networks: Understanding multigenic and pleiotropic regulation” Complexity 1: 45-63, (1996). Liang, S., et al. “REVEAL, A general reverse engineering algorithm for inference of genetic network architectures” Pacific Symposium on Biocomputing 3: 18-29, (1998). Akutsu, T., et al. “Identification of gene networks from a small number of gene expression patterns under the Boolean network model” Paczfic Symposium on Biocomputing 4: 17-28, (1999).

592

13. Chen, T., He, H. L., and Church, G. M. “Modeling Gene Expression with Differential Equations” Pacific Symposium on Biocomputing 4: 29-40, (1999). 14. de Hoon, M. J. L., et al. “Inferring Gene Regulatory Networks from TimeOrdered Gene Expression Data of Bacillus Subtilis Using Differential Equations” Pacific Symposium on Biocomputing 8: 17-28, (2003). 15. D’haeseleer, P., et al. “Linear Modeling of mRNA Expression Levels During CNS Development and Injury’’ Pacific Symposium on Biocomputing 4: 41-52, (1999). 16. Wu, F. X., et al. “Reverse engineering gene regulatory networks using the state-space description of microarray gene expression data” in preparation. 17. Baldi, P. and Hatfield, G. W. “DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling” New York Cambridge University Press, (2002). 18. Chen, C. T. “Linear System Theory and Design” 3rd edition, New York: Oxford University Press, (1999). 19. van Someren, E. P., Wessels, L. F. A., and Reinders, M.J.T. “Linear modeling of genetic networks from experimental data” In Proceedings of the Eight International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), La Jolla, California, USA, (2000). 20. Alter, O., Brown, P. O., and Botstein, D. “Singular value decomposition for genome-wide expression data processing and modeling” Proc. Natl. Acad. Sci. USA 97: 10101-10106,(2000). 21. Lawley, D. N. and Maxwell, A. E. “Factor Analysis as a Statistical Method” 2ed, London: Buuterorth, (1971). 22. Bubin, D. B. and Thayer, D. T. “EM algorithms fro ML factor analysis” Psychometrika 47: 69-76, (1982). 23. Burnham, K. P. and Anderson, D. R., “Model selection and inference: a practical information-theoretic approach” New York: Springer, (1998). 24. Raftery, A. E. “Choosing models for cross-classification” American Sociological Review 51: 145-146, (1986). 25. Schwarz, G. “Estimating the dimension of a model” Annals of Statistics 6: 461-464, (1978). 26. Press, W. H. et al. “Numerical Recipes in C: The Art of Scientific Computing” 2nd edition, Cambridge, UK: Cambridge University Press, (1992). 27. Spellman, P. T., et al. “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization” Mol. Biol. 9: 3273-3297, (1998). 28. Laub, M. T. “Global analysis of the genetic network controlling a bacterial cell cycle” Science 290: 2144-2148, (2000). 29. Hartwell, L. H., et al. “From molecular to modular cell biology” Nature 402: C47 - 52, (1999).

This page intentionally left blank