Lecture Notes in Bioinformatics
6160
Edited by S. Istrail, P. Pevzner, and M. Waterman Editorial Board: A. Apostolico S. Brunak M. Gelfand T. Lengauer S. Miyano G. Myers M.-F. Sagot D. Sankoff R. Shamir T. Speed M. Vingron W. Wong
Subseries of Lecture Notes in Computer Science
Francesco Masulli Leif E. Peterson Roberto Tagliaferri (Eds.)
Computational Intelligence Methods for Bioinformatics and Biostatistics 6th International Meeting, CIBB 2009 Genoa, Italy, October 15-17, 2009 Revised Selected Papers
13
Series Editors Sorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los Angeles, CA, USA Volume Editors Francesco Masulli Università di Genova, DISI, Via Dodecaneso 35, 16146 Genoa, Italy E-mail:
[email protected] Leif E. Peterson Cornell University, Weill Cornell Medical College, TMHRI 6565 Fannin, Suite MGJ6-031, Houston, Texas 77030, USA E-mail:
[email protected] Roberto Tagliaferri Università di Salerno, DMI, Via Ponte don Melillo 84084 Fisciano (Sa), Italy E-mail:
[email protected]
Library of Congress Control Number: 2010930710
CR Subject Classification (1998): F.1, J.3, I.4, I.2, I.5, F.2 LNCS Sublibrary: SL 8 – Bioinformatics ISSN ISBN-10 ISBN-13
0302-9743 3-642-14570-1 Springer Berlin Heidelberg New York 978-3-642-14570-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
This volume contains a selection of the best contributions delivered at the 6th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB 2009) held at Oratorio San Filippo Neri in Genoa (Italy) during October 15–17, 2009. The CIBB meeting series is organized by the Special Interest Group on Bioinformatics of the International Neural Network Society (INNS) to provide a forum open to researchers from different disciplines to present and discuss problems concerning computational techniques in bioinformatics, systems biology and medical informatics with a particular focus on neural networks, machine learning, fuzzy logic, and evolutionary computational methods. From 2004 to 2007, CIBB meetings were held with an increasing number of participants in the format of a special session of bigger conferences, namely, WIRN 2004 in Perugia, WILF 2005 in Crema, FLINS 2006 in Genoa and WILF 2007 in Camogli. With the great success of the special session at WILF 2007 that included 26 strongly rated papers, we launched the first autonomous CIBB conference edition starting with the 2008 conference in Vietri. CIBB 2009 attracted 57 paper submissions from all over the world. A rigorous peer-review selection process was applied to ultimately select the papers included in the program of the conference. This volume collects the best contributions presented at the conference. Moreover, the volume also includes one presentation from keynote speakers and two tutorial presentations. The success of CIBB 2009 is to be credited to the contribution of many people. Firstly, we would like to thank the organizers of the special sessions for attracting so many good papers which extended the focus of the main topics of CIBB. Second, special thanks are due to the Program Committee members and reviewers for providing high-quality reviews. Last, but not least, we would like to thank the keynote speakers and tutorial presenters Gilles Bernot (University of Nice Sophia Antipolis, France), Taishin Nomura (Osaka University, Japan) and the tutorial presenters Fioravante Patrone (University of Genoa, Italy), and Santo Motta and Francesco Pappalardo (University of Catania, Italy).
October 2009
Francesco Masulli Leif Peterson Roberto Tagliaferri
Organization
The 6th CIBB meeting was a joint operation of the Special Interest Groups on Bioinformatics and Biopattern of INNS and of the Task Force on Neural Networks of the IEEE CIS Technical Committee on Bioinformatics and Bioengineering with the collaboration of the Gruppo Nazionale Calcolo Scientifico, the Italian Neural Networks Society, the Department of Computer and Information Sciences, University of Genoa, Italy, and the Department of Mathematics and Computer Science, University of Salerno, Italy, and with the technical sponsorship of the Human Health Foundation Onlus, the Italian Network for Oncology Bioinformatics, the University of Genoa, Italy, the IEEE CIS and the Regione Liguria.
Conference Chairs Francesco Masulli Leif E. Peterson Roberto Tagliaferri
University of Genoa, Italy and Temple University, Philadelphia, USA Methodist Hospital Research Institute, Houston, USA University of Salerno, Italy
CIBB Steering Committee Pierre Baldi Alexandru Floares Jon Garibaldi Francesco Masulli Roberto Tagliaferri
University of California, Irvine, USA Oncological Institute Cluj-Napoca, Romania University of Nottingham, UK University of Genoa, Italy and Temple University, Philadelphia, USA University of Salerno, Italy
Program Committee Klaus-Peter Adlassnig Gilles Bernot Domenico Bordo Mario Cannataro Giovanni Cuda Joaquin Dopazo Enrico Formenti
Medical University of Vienna, Austria University of Nice Sophia Antipolis, France National Cancer Research Institute, Genoa, Italy University of Magna Graecia, Catanzaro, Italy University of Magna Graecia, Catanzaro, Italy C.I. Pr´ıncipe Felipe, Valencia, Spain University of Nice Sophia Antipolis, France
VIII
Organization
Antonio Giordano
Nicolas Le Novere Pietro Li´ o Giancarlo Mauri Oleg Okun Giulio Pavesi David Alejandro Pelta Jagath Rajapakse Volker Roth Giuseppe Russo Anna Tramontano Giorgio Valentini Gennady M. Verkhivker L. Gwenn Volkert
University of Siena, Italy and Sbarro Institute for Cancer Research and Molecular Medicine, Philadelphia, USA Wellcome-Trust Genome Campus, Hinxton, UK University of Cambridge, UK University of Milano Bicocca, Italy University of Oulu, Finland University of Milan, Italy University of Granada, Spain Nanyang Technological University, Singapore ETH Zurich, Switzerland Temple University, Philadelphia, USA Sapienza University, Rome, Italy University of Milan, Italy University of Kansas, and UC San Diego, USA Kent State University, Kent, USA
Special Session Organizers C. Angelini, P. Li´ o, L. Milanesi Combining Bayesian and Machine Learning Approaches in Bioinformatics: State of Art and Future Perspectives A. Floares, F. Manolache, Intelligent Systems for Medical Decisions A. Baughman Support F. Patrone Using Game-Theoretical Tools in Bioinformatics V.P. Plagianakos, Data Clustering and Bioinformatics D.K. Tasoulis
Referees (in addition to previous committees) R. Avogadri J. Bacardit M. Biba P. Chen A. Ciaramella E. Cilia A. d’Acierno D. di Bernardo
G. Di Fatta P. Ferreira A. Fiannaca M. Filippone B. Fischer S. Gaglio C. Guziolowski P.H. Guzzi
G. Lo Bosco L. Milanesi A. Maratea G. Pollastri D. Trudgian
Organization
IX
Local Scientific Secretary Paolo Romano Stefano Rovetta
National Cancer Research Institute, Genoa, Italy University of Genoa, Italy
Congress Management Davide Chicco V.N. Manjunath Aradhya
Laura Montanari Maura E. Monville Daniela Peghini
University of Genoa, Italy Dayananda Sagar College of Engg, Bangalore, India and University of Genoa, Italy University of Genoa, Italy University of Genoa, Italy University of Genoa, Italy
Financing Institutions DISI, Department of Computer and Information Sciences, University of Genoa, Italy DMI, Department of Mathematics and Computer Science, University of Salerno, Italy GNCS, Gruppo Nazionale Calcolo Scientifico, Italy
Table of Contents
Tools for Bioinformatics The ImmunoGrid Simulator: How to Use It . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Pappalardo, Mark Halling-Brown, Marzio Pennisi, Ferdinando Chiacchio, Clare E. Sansom, Adrian J. Shepherd, David S. Moss, Santo Motta, and Vladimir Brusic
1
Improving Coiled-Coil Prediction with Evolutionary Information . . . . . . . Piero Fariselli, Lisa Bartoli, and Rita Casadio
20
Intelligent Text Processing Techniques for Textual-Profile Gene Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floriana Esposito, Marenglen Biba, and Stefano Ferilli SILACAnalyzer - A Tool for Differential Quantitation of Stable Isotope Derived Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lars Nilse, Marc Sturm, David Trudgian, Mogjiborahman Salek, Paul F.G. Sims, Kathleen M. Carroll, and Simon J. Hubbard
33
45
Gene Expression Analysis Non-parametric MANOVA Methods for Detecting Differentially Expressed Genes in Real-Time RT-PCR Experiments . . . . . . . . . . . . . . . . . Niccol´ o Bassani, Federico Ambrogi, Roberta Bosotti, Matteo Bertolotti, Antonella Isacchi, and Elia Biganzoli In Silico Screening for Pathogenesis Related-2 Gene Candidates in Vigna Unguiculata Transcriptome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Carolina Wanderley-Nogueira, Nina da Mota Soares-Cavalcanti, Luis Carlos Belarmino, Adriano Barbosa-Silva, Ederson Akio Kido, Semiramis Jamil Hadad do Monte, Valesca Pandolfi, Tercilio Calsa-Junior, and Ana Maria Benko-Iseppon Penalized Principal Component Analysis of Microarray Data . . . . . . . . . . Vladimir Nikulin and Geoffrey J. McLachlan An Information Theoretic Approach to Reverse Engineering of Regulatory Gene Networks from Time–Course Data . . . . . . . . . . . . . . . . . . Pietro Zoppoli, Sandro Morganella, and Michele Ceccarelli
56
70
82
97
XII
Table of Contents
New Perspectives in Bioinformatics On the Use of Temporal Formal Logic to Model Gene Regulatory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gilles Bernot and Jean-Paul Comet
112
Predicting Protein-Protein Interactions with K-Nearest Neighbors Classification Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario R. Guarracino and Adriano Nebbia
139
Simulations of the EGFR - KRAS - MAPK Signalling Network in Colon Cancer. Virtual Mutations and Virtual Treatments with Inhibitors Have More Important Effects Than a 10 Times Range of Normal Parameters and Rates Fluctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicoletta Castagnino, Lorenzo Tortolina, Roberto Montagna, Raffaele Pesenti, Anahi Balbi, and Silvio Parodi
151
Special Session on “Using Game-Theoretical Tools in Bioinformatics” Basics of Game Theory for Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . Fioravante Patrone Microarray Data Analysis via Weighted Indices and Weighted Majority Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Lucchetti and Paola Radrizzani
165
179
Special Session on “Combining Bayesian and Machine Learning Approaches in Bioinformatics: State of Art and Future Perspectives” Combining Replicates and Nearby Species Data: A Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claudia Angelini, Italia De Feis, Viet Anh Nguyen, Richard van der Wath, and Pietro Li` o Multiple Sequence Alignment with Genetic Algorithms . . . . . . . . . . . . . . . Marco Botta and Guido Negro
191
206
Special Session on “Data Clustering and Bioinformatics” (DCB 2009) Multiple Clustering Solutions Analysis through Least-Squares Consensus Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loredana Murino, Claudia Angelini, Ida Bifulco, Italia De Feis, Giancarlo Raiconi, and Roberto Tagliaferri
215
Table of Contents
XIII
Projection Based Clustering of Gene Expression Data . . . . . . . . . . . . . . . . Sotiris K. Tasoulis, Vassilis P. Plagianakos, and Dimitris K. Tasoulis
228
Searching a Multivariate Partition Space Using MAX-SAT . . . . . . . . . . . . Silvia Liverani, James Cussens, and Jim Q. Smith
240
A Novel Approach for Biclustering Gene Expression Data Using Modular Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V.N. Manjunath Aradhya, Francesco Masulli, and Stefano Rovetta
254
Special Session on “Intelligent Systems for Medical Decisions Support” (ISMDS 2009) Using Computational Intelligence to Develop Intelligent Clinical Decision Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandru G. Floares Different Methodologies for Patient Stratification Using Survival Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana S. Fernandes, Davide Bacciu, Ian H. Jarman, Terence A. Etchells, Jos´e M. Fonseca, and Paulo J.G. Lisboa 3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices in Application to Allen Brain Atlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anton Osokin, Dmitry Vetrov, and Dmitry Kropotov A Proposed Knowledge Based Approach for Solving Proteomics Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonino Fiannaca, Salavatore Gaglio, Massimo La Rosa, Daniele Peri, Riccardo Rizzo, and Alfonso Urso Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
266
276
291
304
319
The ImmunoGrid Simulator: How to Use It Francesco Pappalardo1,3, Mark Halling-Brown6, Marzio Pennisi1 , Ferdinando Chiacchio1 , Clare E. Sansom4 , Adrian J. Shepherd4 , David S. Moss4 , Santo Motta1,2 , and Vladimir Brusic5 1
4
Department of Mathematics and Computer Science, University of Catania 2 Faculty of Pharmacy, University of Catania 3 Istituto per le applicazioni del calcolo Mauro Picone, CNR, Roma Department of Biological Sciences and Institute of Structural and Molecular Biology, Birkbeck College, University of London 5 Cancer Vaccine Center, Dana-Farber Cancer Institute, Boston 6 The institute of Cancer Research, Surrey {mpennisi,francesco,motta,fchiacchio}@dmi.unict.it, {a.shepherd,d.moss,c.sansom}@mail.cryst.bbk.ac.uk,
[email protected], vladimir
[email protected]
Abstract. In this paper we present the ImmunoGrid project, whose goal is to develop an immune system simulator which integrates molecular and system level models with Grid computing resources for large-scale tasks and databases. We introduce the models and the technologies used in the ImmunoGrid Simulator, showing how to use them through the ImmunoGrid web interface. The ImmunoGrid project has proved that simulators can be used in conjunction with grid technologies for drug and vaccine discovery, demonstrating that it is possible to drastically reduce the developing time of such products.
1
Introduction
The ImmunoGrid project is a three year project funded by the FP6 programme of the European Union which established an infrastructure for the simulation of the immune system. One of the major objectives of the ImmunoGrid project is to help the development of vaccines, drugs and immunotherapies. The unique component of this project is the integration of the simulations of immune system processes at molecular, cellular and organ (system) levels for various applications. The project is a web-based implementation of the Virtual Human Immune System using Grid technologies. It adopts a modular structure that can be easily extended and modified. Depending on the users’ needs, it can be used as an antigen analyzer (T-cell and B-cell), antigen processing simulator, infection simulator, cancer simulator, allergy simulator, vaccine designer, etc. While immune system models and simulators have existed for many years, their use, from a computational point of view, has been limited by the amount of computational effort needed in order to simulate a realistic immunological scenario. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 1–19, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
F. Pappalardo et al.
In recent years, Grid Technologies have enabled access to powerful, distributed computational resources. Partly as a consequence, the use of models and simulators in immunology is becoming increasingly relevant to the health and life sciences. Nevertheless, the lack of communication between modelers and life scientists (biologists, medical doctors, etc.) represents an important drawback that inhibits the wider use of models and simulators. The ImmunoGrid project was intended to develop an easy-to-access simulator capable of showing, using selected descriptive and predictive models, that Information and Communication Technology (ICT) tools can be useful in the life sciences. Existing models on specific pathologies have been upgraded, verified and integrated via a web interface for scientific (predictive models) ed educational (descriptive models) purposes. New models have been also developed. Descriptive models are tuned and reproduce data from the literature. Models can also be used to predict the output of a new experiment. If the model predictions are confirmed by wet experiments the model can be used as a predictive model. Predictive models require fine tuning of the parameters on existing data which in turn implies longer development times. The ImmunoGrid simulator also includes a model (cancer vaccination) whose predictions have been experimentally validated. Academics and industrial vaccine developers are the primary users to whom the simulator is addressed. In addition, the simulator is of benefit to students, as it can be easily be used in academic courses to explain how the immune system works. In the present paper we will describe the ImmunoGrid simulator and its components. We will show how to use the different models included in the ImmunoGrid simultor, but for the description of each model the reader must refer to the published papers quoted in the reference. The paper is organized as follows: section 1 introduces the ImmunoGrid project; section 2 describes the technologies underlying the ImmunoGrid simulator, section 3 is devoted to explaining how to use the ImmunoGrid simulator and in section 4 final conclusions are drawn.
2
The ImmunoGrid Simulator
The ImmunoGrid simulator uses a modular strategy to integrate molecular and system models with grid and web-based technologies in order to provide an easily extensible and accessible computational framework. System level modeling is used to model biological events at a cellular and organ scale. System models and software prototypes of the immune system able to reproduce and predict immune responses to viruses, analyze MHC diversity and its effects on host-pathogen interactions, represent B cell maturation, simulate both cellular and humoral immune responses represent the outcome of system level modeling. The cores of the system modeling of the ImmunoGrid Simulator are based on C-ImmSim and SimTriplex computational models [1, 2]. C-ImmSim reproduces
The ImmunoGrid Simulator: How to Use It
3
the immune responses to bacteria and viruses (e.g. HIV-1), and SimTriplex models the tumor growth and responses to immunization. Molecular level models are used to simulate the immune systems sequencebased processing and discriminate self and non-self proteins at the molecular level. Actually, various modeling paradigms such as Artificial Neural Networks (ANN), Matrix, Hidden Makov Model (HMM), Support Vector Machine (SVM) and other statistical techniques are used. These models focus on predictions of Tcell and B-cell epitopes. CBS prediction tools make up the core of the molecular level models [3, 4]. Both pre-calculated models and online job submission models are supplied from the web for different users. Standardization of concepts and rules for the identification, description and classification of biological components and processes has been done to achieve easy cooperation and communication among the involved partners. A a common conceptual model, terminology and standards are used to favor the integration of different models into an unique infrastructure. For this purpose definitions R and ontologies are largely based on IMGT[5, 6], which has become a widely accepted standard in immunology. 2.1
System Level Models
The immune system is an highly distributed system capable of acting against foreign pathogens. In spite of the lacking of a central control system, it performs complex functions at a system level efficiently and effectively [7]. Principal mechanisms on how the immune system acts against harmful viruses, bacteria and microorganisms have been known since decades. However our knowledge of the immune system remains incomplete and imprecise. The main goal of mathematical and computational modeling in life sciences and, in particular, in immunology is to provide a formal description of the underlying organization principles of the immune system and the relationships between its components. This description needs to capture the complexity of the problem at the right level of abstraction in order to coherently describe and reproduce immune system behavior. To use mathematical models in immunology, both the strengths and weakness of the methodologies involved and the source data used for modeling have to be deeply understood and analyzed. For example the source data often needs to be properly treated (normalization, filtering, or other pre-processing) prior to be used for model development. Mathematical models of the immune system evolved from classical models using differential equations, difference equations, or cellular automata to model a relatively small number of interactions, molecules, or cell types involved in immune responses. Description of major mathematical models for immunological applications can be found in [8, 9]. The key enabling technologies of genomics [10], proteomics [11, 12], bioinformatics [13] and systems biology (including such genomic pathway analysis) [14] in immunology have provided large quantities of data describing molecular profiles of both physiological and pathological states.
4
F. Pappalardo et al.
Old and new technologies such as Multiparametric flow cytometry, nanotechnology for quantitation of cytokine production, ELISPOT, intra-cytoplasmic cytokine staining and mRNA/micro-RNA-based assays and laser scanning cytometry keep improving and expanding, enabling the availability of new data and informations that permit a more detailed modeling of immune processes [15, 16]. Mathematical modeling of the immune system has grown to include the extensive use of simulations [9] and iterative refinement of the immune system models at both molecular [17] and system level [18]. Only models that can be practically applied to vaccine discovery, formulation, optimization and scheduling can be considered. These models must, therefore, be at the same time biologically realistic and mathematically tractable [19]. The ImmunoGrid system-level modeling core has been inspired by IMMSIM, a cellular automata model of the immune system developed by Celada and Seiden [20]. In IMMSIM entities are followed individually, and the stochastic rules governing the interactions are represented by sophisticated algorithms. The model uses a discrete lattice grid, and more than one entity can be found in every site of the lattice. It also introduces the use of bit string representations of receptors and ligands. IMMSIM has been a conceptually important advance, because it developed a general modeling framework that could be used for multiple studies. It incorporated enough immunological detail to be used for real immunological problems. The use of spatial physical models such as partial differential equations has been also added to describe lymph nodes, chemotaxis, cell movement and diffusion. The two key models of the immune system used by ImmunoGrid are represented by C-ImmSim [1] and SimTriplex [2]. These models describe both adaptive and innate immunity and the core function, focusing in particular on the modeling the two branches of adaptive immunity (humoral and cellular immunity). C-ImmSim is a multi-purpose model used for the modeling of primary and secondary immune responses, bacterial and viral infection. It has been used for the description of HIV infection [21, 22], Epstein-Barr virus infection [23] and cancer immunotherapy [24]. SimTriplex first release initially derived from C-ImmSim, and was focused on predictive modeling. Later versions of SimTriplex evolved independently from its progenitor. The SimTriplex model has been used as a predictive model for the immunoprevention of cancer elicited by a vaccine [2, 18, 25, 26] and more recently as a descriptive model of atherosclerosis processes [27]. From SimTriplex model, new ImmunoGrid physical models for describing tumor growth based on nutrient or oxygen starvation (based on the lattice Boltzmann method [28, 29]) have been developed. These models enabled the simulation of tumor growth for both benign and malignant tumors. ImmunoGrid also has a lymph node simulator derived by C-ImmSim, which offers a mechanistic view of the immune processes that occur in lymph nodes. The models used in ImmunoGrid have been carefully validated experimentally and incrementally refined (predict-test-refine), allowing the investigation of
The ImmunoGrid Simulator: How to Use It
5
various problems, such as the study of mammary cancer immunoprevention, therapeutic approaches to melanoma and lung cancer, immunization with influenza epitopes, the study of HIV infection and HAART therapy, and the modeling of atherosclerosis. The immune responses of an immunopreventive vaccine against mammary carcinoma in HER-2/neu mice have been accurately reproduced for up to 52 weeks of age [2, 18, 25, 26]. Modeling of HIV-1 infection in untreated as well as patients receiving HAART has been obtained by tuning with data from literature as well as clinical observations [21, 22]. Atherogenesis descriptive model was tuned to match published experimental data [27]. The descriptive lymph-node simulator reproduces experimental data published in [30]-[32]. 2.2
Molecular Level Models
One of the most important tasks of the immune system is to discriminate between self and non-self patterns. This immune systems specialized sequencebased processing primarily occurs in adaptive immune response and involves multiple mechanisms (short range non-covalent interactions, hydrogen binding, Van Der Walls interactions, etc.). The adaptive immune system has two different arms: the humoral immunity mediated by antibodies and cellular immunity mediated by T cells. Antibodies are released by B cells and they can neutralize pathogens and antigens outside the cells. T cells act to neutralize intracellular pathogens and cancers by killing infected or malfunctioning cells (cytotoxic T cells), and can also provide a boost for both humoral and cellular immunity responses (T helper cells). B cells use receptors that have the same shape of the antibodies they release after differentiation into plasmacells. They are specialized in recognizing epitopes represented by 3D shapes on the antigen surface, mostly discontinuous parts of antigen sequences. T cells receptors are indeed specialized in recognizing epitopes composed by short fragments of antigenic peptides presented on the surface of infected cells. Cytotoxic CD8 T-cells (CTLs) recognize and target cells that express foreign T-cell epitopes in conjunction with class I MHC complexes. Helper CD4 T cells recognize class II MHC complexes presented on specialized Antigen Presenting Cells surfaces such as B cells, Macrophages and Dendritic cells, and provide co-stimulatory signals needed for activation of B cells and CTLs. The combinatorial complexity of antigen processing and presentation results in a large variety of cell clones that recognize, act and gain memory for specific antigens. The prediction and analysis of MHC-binding peptides and Tcell epitopes thus represents a problem suitable for large-scale screening and computational analysis. Many computational models of antigen processing and presentation have been developed during the last two decades. Specialized databases have been used to store information on MHC ligands and T-cell epitopes [33]-[36]. Multiple
6
F. Pappalardo et al.
prediction tools such as quantitative matrices, artificial neural networks, hidden Markov models, support vector machines, molecular modeling and others have been realized [43, 44] and made available for the scientific community through the use of web servers. The molecular-level models used in ImmunoGrid are primarily based on the CBS tools [37]-[42], but several other predictive models are included as well [45][47]. Exhaustive validation with carefully selected experimental data has been used to accurately select the best predictive model for any biological scenario. Focusing on vaccine research, the ImmunoGrid molecular models have been applied to analyze 40000 influenza proteins for peptides that bind to some 50 HLA alleles, showing how the predictions of peptide binding to HLA class I molecules are of high accuracy and therefore directly applicable to identification of vaccine targets [48]-[50]. Furthermore, these models are stable and their performance can be reproduced across several major HLA class I variants [51]. Predictions of MHC-II class II ligands and T-cell epitopes represent a more difficult challenge since MHC-II-binding predictions are of much lower accuracy than those of MHC-I [52, 53]. However this accuracy can be improved by the use of genetic algorithms [54], Gibbs sampling [55] or by combining prediction results from multiple predictors [56]. HLA class I predictions can be actually used for prediction of positives (binders and T-cell epitopes), while HLA class II predictions can be used for elimination of obvious negatives (non-binders, non-T-cell epitopes). 2.3
Integration with the Grid Framework
Newly designed integrated web-based systems for accessing to highly optimized computational resources, data storing and managing, and laboratory automation are nowadays emerging to support and improve research [57, 58]. The main goals of these systems are to provide integration of different hardware resources and to hide the underlying heterogeneities under an homogenous interface. These systems have to deal with the bottleneck coming from the huge and rapidly growing quantities of data that current research challenges require. Developments in the field of information and communication technologies and computational intelligence [59] allowed the born of the concept of “Virtual Laboratory”. A Virtual Laboratory can be defined as an integrated intelligent environment for the systematic analysis and production of high-dimensional quality assured data, in contrast with the common approach where independent exploratory studies are usually combined in an ad hoc manner [60]. The ImmunoGrid project goal is to address many specific vaccine questions thanks to the integration of molecular and system level models with Grid computing resources for large-scale tasks and databases. Even if grid infrastructures are not not always essential for computational projects, the ImmunoGrid simulator needs the integration of grid technologies for the multiple reasons. One of the essential requirements of the project is represented by the need of running large numbers of simulations. Multiple simulations over large number of individuals allow the exploration of parameters space.
The ImmunoGrid Simulator: How to Use It
7
The increasing in size of the simulation space represents an additional aim of the ImmunoGrid project. Larger simulations can give a more coherent representation of the “natural scale” of the immune system, and then more powerful machines, such those usually available in grid environments, have to be used. The required computational resources may vary substantially from a simulation to another, depending on the size and complexity of the question we want to address. For this reason, some resources will be more appropriate than others for certain problems. A Grid environment can easily supply the access to a more diverse set of resources than could be accessed without. The ImmunoGrid grid implementation has been designed to provide access to multiple diverse grid middlewares through a single interface, hiding the complexity of the implementation from the user. Without this approach, the average user would have to deal with complexities such as virtualization softwares, applications servers, authentications, certificates and so on. Using the ImmunoGrid integrated simulator, the final user will only see a front end which is as simple as using an average web form. To achieve this goal a job broker/launcher has been implemented into the ImmunoGrid Web Server (IWS) in order to access to PI2S2 [61], UNICORE [62] or AHE [63] clients via command line tools, and web services via standard Simple Object Access protocols (SOAP). The gLite middleware, mandatory for accessing to the PI2S2 Cometa Grid Infrastructure [61], has been integrated with the (IWS). Other middlewares that have been integrated are the Application Hosting Environment (AHE), DESHL client [64] (UNICORE) and simple web service access. The AHE permits to launch jobs on both globus [66] enabled resources and local (non-national grid) GRIDSam [65] resources. GRIDSam is a web service capable of launching jobs to local systems and of building Resource Specific Language (RSL) requests to globus enabled machines. To allow access to supercomputing resources at the CINECA DEISA site [67], integration of a DESHL client (a command line client which is able to launch jobs on UNICORE enabled machines) has also been provided. A simulator web service implemented in Java, Perl and PHP has been created to allow resources which do not have any grid middleware to be made available through the IWS. This service is able to run on an application server on the host resource and fork jobs to the local system. Interested readers can find further details in [69–71]. 2.4
The Web Interface
The ImmunoGrid web interface offers an educational and a research section. The educational section is dedicated to students whereas the research section is devoted to researchers and requires login credentials. Jobs coming from the educational section are only allowed access to local IWS resources as they are normally small, low-complexity jobs. Jobs launched from the research section level have access to the full range of resources. This approach also prevents access to the grid resources by unhautorized users.
8
F. Pappalardo et al.
The web interface is implemented in PHP, AJAX and DHTML to provide a modern look and feel to the site. The interface is used to create and prepare jobs to be launched onto the grid. As previously stated, communication with different grid middlewares is completely dealt with by the IWS job launcher. This launcher is written in PHP. The main challenge that has been faced when implementing this grid solution has been that grid clients have to be completely accessible via the command line in order for the PHP job launcher to interact with them. Solutions to this problem have been presented in [68]. When a new job is created by the user, it is then sent to the job launcher. The launcher selects an appropriate resource based on a pre-established criterion. It then executes the appropriate command line script for the chosen grid infrastructure. Any information required to uniquely identify the job, as well as the actual state of the job itself, is stored in the local database, allowing the monitoring of the job state when asked by the user through a request to the IWS. This centralised scheduling approach allows transparent access to multiple middlewares. It is also useful for enforcing the use of a particular resource under certain circumstances. At the end of a simulation, test results and logs files are saved retrieved to the IWS user space. All the authentication steps between the IWS and the different grid environments are dealt with by the job launcher. Access grant is gained through the use of public key authentications and PKCS12 certificates. This process is also hidden to the final user, whose only worry is to have an authorized access to the ImmunoGrid website. 2.5
Predicted Epitopes Database
The database of predicted epitopes has been integrated into the ImmunoGrid project. It provides a summary of predictions for all protein antigens of a particular virus (eg. influenza A or Dengue) or all known cancer antigens for the human (HLA) or mouse (H-2) MHC molecules. This database represents all possible peptide targets from known antigens and will serve as a maximum set of compounds for designing T-cell epitope based vaccines. The advantage relative to the current state of the art is that it offers to vaccine researchers an exhaustive set of targets that should be considered, rather than currently used small, incomplete, and historically biased data sets. 2.6
IMGT-ONTOLOGY for Antibody Engineering
The use of the IMGT-ONTOLOGY concepts and standards provides a standardized way to compare and analyse immunoglobulin sequences in the process of antibody engineering, whatever the chain type (heavy and light) and whatever the species (e.g. murine and human). More specifically, the IMGT unique
The ImmunoGrid Simulator: How to Use It
9
numbering provides the delimitations of the frameworks (FR-IMGT) and complementarily determining regions (CDR-IMGT) for the analysis of loop grafting in antibody humanization.
3
Using the ImmunoGrid Simulator
To access to the ImmunoGrid simulator web interface, users just only have to type the ImmunoGrid address (http:// www.ImmunoGrid.eu) in a web-browser, and to click to the “Simulators” link. Registered users can then login to website in order to have full access to the simulator; occasional (anonymous) users can access only to the Educational section of the simulator. Complete documentation about the simulators modules is available clicking on the “Education” link. 3.1
The Educational Simulator
Simulations produced by ImmunoGrid include results of either descriptive or predictive modeling. Selected cases have been integrated in the educational simulator. These educational examples use the concept of learning objects, where students can select several experimental conditions and observe the outcomes using realistic simulations. Learning objects offer visual and interactive means to introduce key concepts, information and situated data in developing disciplinespecific thinking. ImmunoGrid simulations help students develop a culture of practice, a spirit of inquiry, understand theoretical knowledge and information in immunology through encapsulated descriptions and opportunity for understanding practical aspects through simulations of key equipments, tools and processes in vaccine development. The Educational Simulator can be used in courses in the partner institutions. This exploitable result may become a template for development of virtual teaching laboratory. 3.2
System Models Developed for Specific Diseases
System models have been built and customized to address disease-specific questions. Examples of this kind are the simulation of HIV-1 infection and related antiretroviral therapy, the simulation of the Epstein-Barr virus, the simulation of hypersensitivity reaction, the simulation of cancer immunotherapy to a generic non-solid tumor. The innovation lies in the general-purpose nature of this simulator and the relative easiness of upgradeability and customizability. Two diseasespecific questions were addressed to the ImmunoGrid consortium, namely to model the immune control of atherogenesis and to model a influenza vaccine which was under investigation at Dana-Farber Cancer Institute in Boston. The result of these new investigations are now two new modules of the ImmunoGrid simulator. Viral infections section represents the first simulation section one can select. Three different viral infections can be selected: Influenza/A, HIV and EpsteinBarr virus. For each of those it is possible to run different types of simulations choosing among various specific parameters.
10
F. Pappalardo et al.
Selecting the influenza simulator it is possible to run simulations about influenza/A. Once one has selected a simulation, the interface will provide a screen where the user can choose parameters values. The list of available parameters depends on the specific simulation (figure 1).
Fig. 1. The influenza/A simulator parameters form
Once the parameters values have been chosen, the interface will take care of building the Grid job and submit it, freeing the user to know about specific Grid procedures. The simulation will be executed and the selected grid location will be shown on a map. Actually 4 different Grid environments are available: DEISA Grid – Cineca (Bologna, Italy); PI2S2 (Catania, Italy); Birkbeck College (London, UK); Dana-Faber Cancer Institute (Boston, USA). Once submitted, the status of the job can be followed inside the section “My jobs”. As a job is completed, the user can analyze results both through the web interface (visualizing graphs and movies) and by retrieving raw data for post-processing (figure 2). The same methodology applies to run the other simulations. The user can obviously run at the same time different type of simulations. For simulations that require more time, the user is free to log out and return to the web interface later. The “My jobs” section will provide information about the job status. The next section is represented by the cancer section. Here we can choose between a generic cancer growth simulator and breast cancer one. For the cancer growth simulation, it is possible to see pre-computed results (see figure 3) or to run a specific simulation, modifying simulation parameters in order to simulate several cancer growth scenarios. Available results are shown in an executive summary, interactive graphs and movies.
The ImmunoGrid Simulator: How to Use It
11
Fig. 2. Graphical visualization of influenza/A simulator results
Fig. 3. Pre-computed results of cancer growth simulations
The Breast cancer in HER/2-neu mice simulator has been validated with in vivo data from the University of Bologna partner. From this section it is possible to simulate a single mammary gland, or to launch a “natural scale”
12
F. Pappalardo et al.
simulation using 10 mammary glands at the same time. It is possible to choose the vaccination schedule, the number of simulations and to vary many biological parameters such as the lenght of the simulation, the IL2 quantity, the number of released antibodies and so on. Results are presented through an executive summary, interactive graphs and movies. The Atherogenesis simulator can be used to model atheromatous plaques formation related to LDL concentration. Even here it is possible to choose from a wide range of parameters of the atherogenesis such as the number of simulations, the length of the simulation the LDL quantity, the IL2 quantity and so on. When a simulation has finished, it is possible to visualize, download and play movies of the plaques formation as previously described. The last simulation is a descriptive simulation and deals with lymph node modeling. It is possible to simulate a lymph node with or without antigen infection or just have a look to pre-run simulations. Results are visualized using interactive graphs. 3.3
Molecular Level Prediction Models
The use of CBS web based tools allows a way to find epitopes in protein sequences. More specifically, the detection of epitopes could improve rational vaccine design, the creation of subunit or genetic vaccines, and resolve problems related to the use of whole live or attenuated organism. Models are divided in two different classes: – The T cell epitope prediction tools. NetChop performs predictions for cleavage sites of the human proteasome. The NetMHC is a tool to predict the binding of peptides to a large number of HLA alleles. The NetCTL is an integrative method, using NetChop and NetMHC (Integration of peptide-MHC binding prediction, proteasomal C-terminal cleavage and TAP transport efficiency). The NetMHCII is a tool that predicts binding of peptides to several HLA alleles. – B cell epitope prediction tools. The BepiPred offers prediction of linear epitopes and the DiscoTope specifically targets the prediction of discontinuous epitopes based on the three dimensional structure. Alle the CBS Prediction Tools have been integrated with the ImmunoGrid web portal to permit an easy centralized access and to allow the use of Grid computational frameworks to execute predictions. The first workflow is represented the epitopes prediction in breast cancer. Selecting the cancer section it is possible to execute the HER2 Epitope prediction. After choosing the MHC/HLA alleles (for either humans or mice) one can execute the simulation. Obtained results will show for each protein the position, the peptide and the list of alleles that recognize a specific peptide (promiscuity) (see image 4). Structural display trough 3D Jmol HER2 protein model is also available. From the influenza section it is possible to proceed with the Influenza epitope prediction. After choosing for Human class I and II alleles, the model allows
The ImmunoGrid Simulator: How to Use It
13
Fig. 4. Visualization of results relative to the breast cancer epitopes prediction module
Fig. 5. Structural display of with protein surface highlighting for the influenza epitope prediction module
the selection of the host (avian, mammal, human), the influenza protein, the influenza subtype and the influenza isolate. Results are presented in a similar way as previously shown. Structural display (with epitope and protein surface highlighting) is possible even here (see figure 5). The workflow for HIV epitope prediction is accessible from the HIV section of the ImmunoGrid web interface. For this kind of prediction it is possible to select
14
F. Pappalardo et al.
the Human class I alleles (max 20 alleles), the protein (e.g. envelop protein), the virus cascade and the virus strain. List of peptides and scores of the bindings are then presented. The atherosclerosis epitope prediction workflow can be accessed from the Atherogenesis section of the web interface. Selection of parameters and display of results is done in a similar way as previously presented.
Fig. 6. Example of results obtained using the atherosclerosis epitopes prediction module
4
Conclusions
In the present paper we have introduced the ImmunoGrid EC funded project, describing the models and the technologies used for building the ImmunoGrid Simulator, and focusing on how to use it. The ImmunoGrid project has reached some major goals: the use and integration of different of grid middlewares, an easy to use access to simulators running on grid environments, an extended ontology framework for immunology, a general framework for simulating immune system response. These achievements required a close cooperation among partners. These have gained significant experience in the integration of technology with maths and computer science to describe the immune response to different pathologies. The ImmunoGrid project has proved that simulators and grid technologies can be used with success in the framework of drug and vaccine discovery and life science research. At least 50 years of experiences in Physical, Chemical and Engineering Science had proven that “Computer Aided...” discovery, engineering, design, etc. has the effect of drastically reduce the developing time of the product.
The ImmunoGrid Simulator: How to Use It
15
Nowadays this experience can be translated to vaccine and drug discovery [72, 73]. If industries will pick up the ImmunoGrid experience and all the other experiences of the ICT-VPH projects, the drug developing time can be reduced by a significant amount ranging from 30-50%. The feedback on european people’s health and european industrial competitiveness is potentially enormous.
References 1. Castiglione, F., Bernaschi, M., Succi, S.: Simulating the immune response on a distributed parallel computer. Int. J. Mod. Phys. C 8, 527–545 (1997) 2. Motta, S., Castiglione, F., Lollini, P., Pappalardo, F.: Modelling vaccination schedules for a cancer immunoprevention vaccine. Immunome Res. 1, 5 (2005) 3. Lin, H.H., Ray, S., Tongchusak, S., et al.: Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 9, 8 (2008) 4. Lin, H.H., Zhang, G.L., Tongchusak, S., et al.: Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC Bioinformatics 9(Suppl. 12), S22 (2008) R 5. Lefranc, M.P.: IMGT, the international ImMunoGeneTics information system : a standardized approach for immunogenetics and immunoinformatics. Immunome Res. 1, 3 (2005) [imgt.cines.fr] R a system and an ontology that 6. Lefranc, M.P., Giudicelli, V., Duroux, P.: IMGT , bridge biological and computational spheres in bioinformatics. Brief Bioinform. 9, 263–275 (2008) 7. Motta, S., Brusic, V.: Mathematical modeling of the immune system. In: Ciobanu, G., Rozenberg, G. (eds.) Modelling in Molecular Biology. Natural Computing Series, pp. 193–218. Springer, Berlin (2004) 8. Louzoun, Y.: The evolution of mathematical immunology. Immunol. Rev. 216, 9–20 (2007) 9. Castiglione, F., Liso, A.: The role of computational models of the immune system in designing vaccination strategies. Immunopharmacol. Immunotoxicol. 27, 417–432 (2005) 10. Falus, A. (ed.): Immunogenomics and HumanDisease. Wiley, Hoboken (2006) 11. Purcell, A.W., Gorman, J.J.: Immunoproteomics: Massspectrometry-based methods to study the targets of the immune response. Mol. Cell Proteomics 3, 193–208 (2004) 12. Brusic, V., Marina, O., Wu, C.J., Reinherz, E.L.: Proteome informatics for cancer research: from molecules to clinic. Proteomics 7, 976–991 (2007) 13. Sch¨ onbach, C., Ranganathan, S., Brusic, V. (eds.): Immunoinformatics. Springer, Heidelberg (2007) 14. Tegn´er, J., Nilsson, R., Bajic, V.B., et al.: Systems biology of innate immunity. Cell Immunol. 244, 105–109 (2006) 15. Sachdeva, N., Asthana, D.: Cytokine quantitation: technologies and applications. Front Biosci. 12, 4682–4695 (2007) 16. Harnett, M.M.: Laser scanning cytometry: understanding the immune system in situ. Nat. Rev. Immunol. 7, 897–904 (2007)
16
F. Pappalardo et al.
17. Brusic, V., Bucci, K., Scho¨ nbach, C., et al.: Efficient discovery of immune response targets by cyclical refinement of QSAR models of peptide binding. J. Mol. Graph Model 19, 405–411 (2001) 18. Pappalardo, F., Motta, S., Lollini, P.L., Mastriani, E.: Analysis of vaccines schedules using models. Cell Immunol. 244, 137–140 (2006) 19. Yates, A., Chan, C.C., Callard, R.E., et al.: An approach to modelling in immunology. Brief Bioinform. 2, 245–257 (2001) 20. Celada, F., Seiden, P.E.: A computer model of cellular inter- action in the immune system. Immunol. Today 13, 56–62 (1992) 21. Castiglione, F., Poccia, F., D’Offizi, G., Bernaschi, M.: Mutation, fitness, viral diversity and predictive markers of disease progression in a computational model of HIV-1 infection. AIDS Res. Hum. Retroviruses 20, 1316–1325 (2004) 22. Baldazzi, V., Castiglione, F., Bernaschi, M.: An enhanced agent based model of the immune system response. Cell Immunol. 244, 77–79 (2006) 23. Castiglione, F., Duca, K., Jarrah, A., et al.: Simulating Epstein- Barr virus infection with C-ImmSim. Bioinformatics 23, 1371–1377 (2007) 24. Castiglione, F., Toschi, F., Bernaschi, M., et al.: Computational modeling of the immune response to tumor antigens: implications for vaccination. J. Theo. Biol. 237/4, 390–400 (2005) 25. Lollini, P.L., Motta, S., Pappalardo, F.: Discovery of cancer vaccination protocols with a genetic algorithm driving an agent based simulator. BMC Bioinformatics 7, 352 (2006) 26. Pappalardo, F., Lollini, P.L., Castiglione, F., Motta, S.: Modeling and simulation of cancer immunoprevention vaccine. Bioinformatics 21, 2891–2897 (2005) 27. Pappalardo, F., Musumeci, S., Motta, S.: Modeling immune system control of atherogenesis. Bioinformatics 24, 1715–1721 (2008) 28. He, X., Luo, L.: Theory of the lattice Boltzmann method: from the Boltzmann equation to the lattice Boltzmann equation. Phys. Rev. E 56, 6811–6817 (1997) 29. Ferreira Jr., S.C., Martins, M.L., Vilela, M.J.: Morphology transitions induced by chemotherapy in carcinomas in situ. Phys. Rev. E 67, 051914 (2003) 30. Catron, D.M., Itano, A.A., Pape, K.A., et al.: Visualizing the first 50hr of the primary immune response to a soluble antigen. Immunity 21, 341–347 (2004) 31. Garside, P., Ingulli, E., Merica, R.R., et al.: Visualization of specific B and T lymphocyte interactions in the lymph node. Science 281, 96–99 (1998) 32. Mempel, T.R., Henrickson, S.E., Von Andrian, U.H.: T-cell priming by dendritic cells in lymph nodes occurs in three distinct phases. Nature 427, 154–159 (2004) 33. Brusic, V., Rudy, G., Harrison, L.C.: MHCPEP, a database of MHC-binding peptides: update 1997. Nucleic Acids Res 26, 368–371 (1998) 34. Rammensee, H., Bachmann, J., Emmerich, N.P., et al.: SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 50, 213–219 (1999) 35. Toseland, C.P., Clayton, D.J., McSparron, H., et al.: AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Res. 1, 4 (2005) 36. Sette, A., Bui, H., Sidney, J., et al.: The immune epitope database and analysis resource. In: Rajapakse, J.C., Wong, L., Acharya, R. (eds.) PRIB 2006. LNCS (LNBI), vol. 4146, pp. 126–132. Springer, Heidelberg (2006) 37. Nielsen, M., Lundegaard, C., Lund, O., Kesmir, C.: The role of the proteasome in generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage. Immunogenetics 57, 33–41 (2005)
The ImmunoGrid Simulator: How to Use It
17
38. Larsen, M.V., Lundegaard, C., Lamberth, K., et al.: An integrative approach to CTL epitope prediction: a combined algorithm integrating MHC class I binding, TAP transport efficiency, and proteasomal cleavage predictions. Eur. J. Immunol. 35, 2295–2303 (2005) 39. Nielsen, M., Lundegaard, C., Lund, O.: Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics 8, 238 (2007) 40. Nielsen, M., Lundegaard, C., Blicher, T., et al.: NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS ONE 2, e796 (2007) 41. Larsen, J.E., Lund, O., Nielsen, M.: Improved method for predicting linear B-cell epitopes. Immunome Res. 2, 2 (2006) 42. Andersen, P.H., Nielsen, M., Lund, O.: Prediction of residues in discontinuous B-cell epitopes using protein 3D structures. Protein Sci. 15, 2558–2567 (2006) 43. Brusic, V., Bajic, V.B., Petrovsky, N.: Computational methods for prediction of T-cell epitopes a framework for modelling, testing, and applications. Methods 34, 436–443 (2004) 44. Tong, J.C., Tan, T.W., Ranganathan, S.: Methods and protocols for prediction of immunogenic epitopes. Brief Bioinform. 8, 96–108 (2007) 45. Reche, P.A., Glutting, J.P., Zhang, H., Reinherz, E.L.: Enhancement to the RANKPEP resource for the prediction of peptide binding to MHC molecules using profiles. Immunogenetics 56, 405–419 (2004) 46. Zhang, G.L., Khan, A.M., Srinivasan, K.N., et al.: MULTIPRED: a computational system for prediction of promiscuous HLA binding peptides. Nucleic Acids Res. 33, 17–29 (2005) 47. Zhang, G.L., Bozic, I., Kwoh, C.K., et al.: Prediction of supertype-specific HLA class I binding peptides using support vector machines. J. Immunol. Meth. 320, 143–154 (2007) 48. Peters, B., Bui, H.H., Frankild, S.: A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput. Biol. 2, e65 (2006) 49. Larsen, M.V., Lundegaard, C., Lamberth, K., et al.: Large-scale validation of methods for cytotoxic T-lymphocyte epitope prediction. BMC Bioinformatics 8, 424 (2007) 50. Lin, H.H., Ray, S., Tongchusak, S., et al.: Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 9, 8 (2008) 51. You, L., Zhang, P., Bod´en, M., Brusic, V.: Understanding prediction systems for HLA-binding peptides and T-cell epitope identification. In: Rajapakse, J.C., Schmidt, B., Volkert, L.G. (eds.) PRIB 2007. LNCS (LNBI), vol. 4774, pp. 337–348. Springer, Heidelberg (2007) 52. Lin, H.H., Zhang, G.L., Tongchusak, S., et al.: Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC Bioinformatics 9(Suppl. 12), S22 (2008) 53. Gowthaman, U., Agrewala, J.N.: In silico tools for predicting peptides binding to HLA-class II molecules: more confusion than conclusion. J. Proteome Res. 7, 154–163 (2008)
18
F. Pappalardo et al.
54. Rajapakse, M., Schmidt, B., Feng, L., Brusic, V.: Predicting peptides binding to MHC class II molecules using multi- objective evolutionary algorithms. BMC Bioinformatics 8, 459 (2007) 55. Nielsen, M., Lundegaard, C., Worning, P., et al.: Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics 20, 1388–1397 (2004) 56. Karpenko, O., Huang, L., Dai, Y.: A probabilistic meta- predictor for the MHC class II binding peptides. Immunogenetics 60, 25–36 (2008) 57. Zhang, C., Crasta, O., Cammer, S., et al.: An emerging cyberinfrastructure for biodefense pathogen and pathogen-host data. Nucleic Acids Res. 36, 884–891 (2008) 58. Laghaee, A., Malcolm, C., Hallam, J., Ghazal, P.: Artificial intelligence and robotics in high throughput post-genomics. Drug Discov. Today 10, 12539 (2005) 59. Fogel, G.: Computational Intelligence approaches for pattern discovery in biological systems. Brief Bioinform. 9, 307–316 (2008) 60. Rauwerda, H., Roos, M., Hertzberger, B.O., Breit, T.M.: The promise of a virtual lab in drug discovery. Drug Discov. Today 11, 228–236 (2006) 61. Becciani, U.: The Cometa Consortium and the PI2S2 project. Mem. S.A.It 13(Suppl.) (2009) 62. Romberg, M.: The UNICORE Architecture: Seamless Access to Distributed Resources, High Performance Distributed Computing. In: Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing, August 03-06 (1999) 63. Coveney, P.V., Saksena, R.S., Zasada, S.J., McKeown, M., Pickles, S.: The Application Hosting Environment: Lightweight Middleware for Grid-Based Computational Science. Computer Physics Communications 176(6), 406–418 64. Sloan, T.M., Menday, R., Seed, T.P., Illingworth, M., Trew, A.S.: DESHL– Standards Based Access to a Heterogeneous European Supercomputing Infrastructure. In: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, p. 91 (2006) 65. McGougha, A.S., Leeb, W., Dasc, S.: A standards based approach to enabling legacy applications on the Grid. Future Generation Computer Systems 24(7), 731–743 (2008) 66. Foster, I., Kesselman, C.: Globus: a Metacomputing Infrastructure Toolkit. International Journal of High Performance Computing Applications 11(2), 115–128 (1997), doi:10.1177/109434209701100205 67. Niederberger, R.: DEISA: Motivations, strategies, technologies. In: Proc. of the Int. Supercomputer Conference, ISC 2004 (2004) 68. Mastriani, E., Halling-Brown, M., Giorgio, E., Pappalardo, F., Motta, S.: P2SI2ImmunoGrid services integration: a working example of web based approach. In: Proceedings of the Final Workshop of Grid Projects, PON Ricerca 2000-2006, vol. 1575, pp. 438–445 (2009); ISBN: 978-88-95892-02-3 69. Halling-Brown, M.D., Moss, D.S., Sansom, C.J., Shepherd, A.J.: Computational Grid Framework for Immunological Applications. Philosophical Transactions of the Royal Society A (2009) 70. Halling-Brown, M.D., Moss, D.S., Shepherd, A.J.: Towards a lightweight generic computational grid framework for biological research. BMC Bioinformatics 9, 407 (2008)
The ImmunoGrid Simulator: How to Use It
19
71. Halling-Brown, M.D., Moss, D.S., Sansom, C.S., Sheperd, A.J.: Web Services, Workflow & Grid Technologies for Immunoinformatics. In: Proceedings of Intern. Congress of Immunogenomics and Immunomics, vol. 268 (2006) 72. Kumar, N., Hendriks, B.S., Janes, K.A., De Graaf, D., Lauffenburger, D.A.: Applying computational modeling to drug discovery and development. Drug discovery today 11(17-18), 806–811 (2006) 73. Davies, M.N., Flower, D.R.: Harnessing bioinformatics to discover new vaccines. Drug Discovery Today 12(9-10), 389–395 (2007)
Improving Coiled-Coil Prediction with Evolutionary Information Piero Fariselli, Lisa Bartoli, and Rita Casadio Biocomputing Group, Department of Biology University of Bologna, via Irnerio 42, 40126 Bologna, Italy {lisa,piero,casadio}@biocomp.unibo.it
Abstract. The coiled-coil is a widespread protein structural motif known to have a stabilization function and to be involved in key interactions in cells and organisms. Here we show that it is possible to increase the prediction performance of an ab initio method by exploiting evolutionary information. We implement a new program (addressed here as PS-COILS) in order to take as input both single sequence and multiple sequence alignments. PS-COILS is introduced to define a baseline approach for benchmarking new coiled-coil predictors. We then design a new version of MARCOIL (a Hidden Markov Model based predictor) that can exploit evolutionary information in the form of sequence profiles. We show that the methods trained on sequence profiles perform better than the same methods only trained and tested on single sequence. Furthermore, we create a new structurally-annotated and freely-available dataset of coiled-coil structures (www.biocomp.unibo.it/ lisa/CC). The baseline method PS-COILS is available at www.plone4bio.org through subversion interface.
1
Introduction
Coiled-coils are structural motifs comprising of two or more alpha-helices that wind around each other in regular bundles to produce rope-like structures (Fig.1) [1, 2]. Coiled-coils are important and widespread motifs that account for 5-10% of the protein sequences in the various genomes [3, 4] . A new advancement in the accurate detection of coiled-coil motifs at atomic resolution has been obtained with the development of the SOCKET program [4]. SOCKET recognizes the characteristic “knobs-into-holes” side-chain packing of coiled-coils and it can, therefore, distinguish between coiled-coils and the great majority of helix-helix packing arrangements observed in globular domains. Recently, SOCKET was also utilized to generate a “Periodic Table” of coiled-coil structures [5] available in a web-based browsable repository [6]. Several programs for predicting coiled-coil regions in protein sequences have been developed so far: the first and widely-used COILS [3, 7], PAIRCOIL [8], the retrained version PAIRCOIL2 [9], MULTICOIL [10], MARCOIL [11] and a profile-based version of COILS, called PCOILS [12], which substitutes sequence-profile comparisons with profile-profile F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 20–32, 2010. c Springer-Verlag Berlin Heidelberg 2010
Improving Coiled-Coil Prediction with Evolutionary Information
21
Fig. 1. Antiparallel regular coiled-coil (seven residues over two turns) of the SeryltRNA synthetase (PDB id: 1SRY)
comparisons exploiting evolutionary information. For a comparative analysis of these coiled-coil prediction methods, see [13]. Recently, we also developed CCHMM PROF [14], an evolutionary-based HMM for coiled-coil prediction.
2
Datasets
The only annotated dataset publicly available created for developing a predictor is the MARCOIL dataset of protein sequences. However, the same MARCOIL authors stated that the coiled-coil annotations in their database are not reliable [11]. For this reason we followed the prescription suggested by Lupas and coworkers [12] by adopting the intersection between SCOP [15] and SOCKET [4] as a “safer” and more impartial set, with respect to those used in literature. We generated our dataset of experimentally-determined coiled-coil structures following this suggestion and considering only the intersection between the SCOP coiled-coil class and the output of the SOCKET program. Following this procedure we ended up with a final annotated dataset comprising 104 sequences (S104). Sequences shorter than 30 residues, or with coiled-coil domains shorter than 9 residues were excluded. This lower limit has been chosen since 9-residues long domains are the shorter ones classified by MARCOIL [11]. The complete S104 dataset contains 10,724 residues with 3,565 coiled-coil residues (33% of the overall dataset). More specifically, among the different protein chains that are labeled with a coiled-coil domain in SCOP, we retained the subset for which SOCKET found at least a coiled-coil segment in that domain. The final annotation of the coiled-coil segment is the one indicated by the SOCKET program. This set can be downloaded from the web page, so that we provide a structurallyannotated dataset available for benchmarking (or for improving/correcting it). Furthermore, in order to test the different methods on a blind set, we selected a subset of 50 non-identical protein chains (S50) with sequence identity <30% with respect to the sequences of the MARCOIL dataset [11]. The S50 dataset contains 4,903 residues, among which 1,696 belong to coiled-coil regions (about 35%).
22
P. Fariselli, L. Bartoli, and R. Casadio
Another relevant and different issue is the discrimination between proteins containing coiled-coil domains and proteins that do not contain them. In order to perform this evaluation we selected a new dataset of protein chains that do not contain coiled-coil segments (“negative examples”). To obtain the negative dataset, we downloaded the Astral SCOP (release 1.69) which contains sequences with less than 40% identity. The selected sequences have been processed with SOCKET (7.4 ˚ A packing cut-off) and all the sequences for which the program detected at least one coiled-coil residue were removed from the dataset. We also filtered out all the sequences similar (>25% identity) to one of the MARCOIL negative set. In addition, we checked the corresponding entries of the PDB in order to further remove all the structures annotated as coiled-coil or coiled-coil related. Finally, we clustered the remaining sequences fixing the sequence identity threshold to 25% and we choose a representative sequence for each cluster. The negative dataset consists of 1,139 proteins sequences (S1139).
3 3.1
Measures of Accuracy Per-residue Indices
The overall accuracy (Q2) is defined as: Q2 = p/N
(1)
where p and N are the total number of correct predictions and total number of residues, respectively. The correlation coefficient (C) for a given class s is defined as: C(s) = [p(s)n(s) − u(s)o(s)]/d(s)
(2)
where d(s) is the factor d(s) = [((p(s) + u(s))(p(s) + o(s))(n(s) + u(s))(n(s) + o(s)))]1/2
(3)
p(s) and n(s) are respectively the true positive and true negative predictions for class s, while o(s) and u(s) are the numbers of false positives and false negatives. The sensitivity (coverage, Sn) for each class s is defined as Sn(s) = p(s)/[p(s) + u(s)]
(4)
The specificity (accuracy, Sp) is the probability of correct predictions and it is expressed as follows: Sp(s) = p(s)/[p(s) + o(s)] (5) However, these measures cannot discriminate between similar and dissimilar segment distributions. To overcome this problem the per-segment index SOV (Segment OVerlap) has been defined to evaluate secondary structure segments rather than individual residues [16].
Improving Coiled-Coil Prediction with Evolutionary Information
3.2
23
Per-segment Indices
If (s1 , s2 ) is a pair of overlapping segments, S(i) is defined as the set of all the overlapping pairs in state i: S(i) = {(s1 , s2 ) : s1 ∩ s2 = , s1 and s2 in conf ormation i}
(6)
while S (i) is the set of all segments s1 for which there is no overlapping segment s2 in state i. S (i) = {(s1 , s2 ) : s1 ∩ s2 = , s1 and s2 in conf ormation i} For state i the segment overlap (SOV) is defined as: 1 minov(s1 , s2 ) + δ(s1 , s2 ) × len(s1 ) SOV (i) = 100 × Ni maxov(s1 , s2 )
(7)
(8)
S(i)
with the normalization factor Ni defined as: len(s1 ) + len(s1 ) Ni = S(i)
(9)
S (i)
The sums over S(i) run over the segment pairs in state i which overlap by at least one residue. The other sum in the second equation runs over the remaining segments in state i. len(s1 ) and len(s2 ) are the lengths of segments s1 and s2 respectively, minov(s1 , s2 ) is the length of the overlap between s1 and s2 , maxov(s1 , s2 ) is the total extent for which either of the segments has a residue labelled with i and δ(s1 , s2 ) is defined as: δ(s1 , s2 ) = min{(maxov(s1 , s2 ) − minov(s1 , s2 )); minov(s1 , s2 ); ; int(len(s1 )/2); int(len(s2 )/2)}
(10)
More specifically, in the tables we indicate with SOV(CC) the value of the segment overlap for the coiled-coil regions and with SOV(N) the value for the non coiled-coil regions. 3.3
Per-protein Indices
Protein OVerlap (POV) is a binary per-protein measure (0 or 1). POV is equal to 1 only – if the number of predicted (Np ) and observed (Np ) coiled-coil segments is the same and – if all corresponding pairs have a minimum segment overlap. More formally: if (Np = No and pi ∩ oi ≥ th, ∀ i = j) ⇒ P = 1
(11)
24
P. Fariselli, L. Bartoli, and R. Casadio
Otherwise: if Np = No ⇒ P = 0
(12)
To establish the segment overlap we set two thresholds. The first one is the minimum between the half lengths of the segments: th = min(Lp /2, Lo/2)
(13)
where Lp is the length of the predicted coiled-coil segment and Lo is the length of the corresponding observed segment. A second and more strict threshold is the mean of the half lengths of the segments: th = (Lp /2 + Lo /2)/2
(14)
For a set of proteins the average of all POVs over the total number of proteins N is: N Pi P OV = i=1 (15) N The final scores are obtained by averaging over the whole set the values computed for each protein. This measure is usually more stringent than summing up all the predictions and computing the indices at the end, since in this last case the scores of completely misclassified proteins can be absorbed by other predictions. For this reason, it may happen that both Sn and Sp can be lower than the corresponding Q2.
4
Baseline Predictor
In order to highlight the contribution of the introduction of evolutionary information we implemented two baseline predictors. The first (called PS-COILS-ss) takes single sequence as input, the other one is a multiple sequence-based version of the former (PS-COILS-ms). PS-COILS-ss is our implementation of the COILS program. The basic idea behind COILS is the adoption of a simple Bayesian classifier. In practice, given a protein segment xi of length W starting at position i, PS-COILS-ss computes the probability P (cc|xi ) of being in coiled coil structure (cc) given the segment as: P (cc|xi ) =
P (xi |cc) P (xi |cc) + c · P (xi |g)
(16)
where c = P (g)/P (cc) is the ratio of the prior probabilities, while P (xi |cc) and P (xi |g) are the probabilities of the segment xi given the coiled-coil structure (cc) or non coiled-coil structure (g), respectively. To compute these quantities PS-COILS-ss models the likelihoods as Gaussian functions (P (xi |cc) = Gcc and P (xi |g) = Gg ). The COILS parameters are [7]:
Improving Coiled-Coil Prediction with Evolutionary Information
25
– μcc and μg , the average scoring values of the coiled-coil and globular protein sets, respectively; – σcc and σg the standard deviation of the scoring values for the coiled-coil and globular protein sets; – S h (a) the score for the residue type a in the heptad position h (with h ranging from 1 to 7). Thus, the probability of a coiled-coil segment of length W starting at position i (P (cc|xi )) in a given sequence is computed as: P (cc|xi ) =
Gcc Gcc + c · Gg
(17)
where Gcc and Gg are defined as: 2
2
(x −μ ) 1 − i σ2 cc cc e Gcc = √ 2σcc
;
Gg = √
(x −μ ) − i σ2 g 1 g e 2σg
(18)
The score xi is computed using the matrix S h (a) along the segment W starting at position i: W −1 xi = ( f (ai+h , h)eh )1/N (19) h=0
where eh is the exponential weight of the position h (eh = 1 if not weighted) W and N is the normalization factor N = h=1 eh . The function f is in the case of PS-COILS-ss program is simply f (ai+h , h) = S h (ai+h )
(20)
where S h (ai+h ) is the element of the COILS scoring table accounting for the residue type ai+h in the hth heptad position. The parameters S h (ai+h ) were computed by Lupas [3] on the original datasets and here they are not updated in order to better highlight the effect of the evolutionary information. PS-COILS-ms takes as input a sequence profile Pk (s) and its score is defined as: W −1 xi = ( f (S, P, h)eh )1/N (21) h=0
and f (S, P, h) =< S h , Pi+h >=
S h (a) · Pi+h (a)
(22)
a∈{Residues}
PS-COILS is designed to be more general and its score is a linear combination of the PS-COILS-ss and PS-COILS-ms scores, namely, PS-COILS=λ PS-COILSss+(1-λ)PS-COILS-ms where λ is a weight in the range [0,1]. The program adopts the same parameters that were calculated for PS-COILS-ss and combines singlesequence and evolutionary information (in the form of sequence profiles). We then have:
26
P. Fariselli, L. Bartoli, and R. Casadio
xi = (
W
f (S, P, h, λ)eh )1/N
(23)
h=1
and f (S, P, h, λ) = λS h (ai+h ) + (1 − λ) < S h , Pi+h >
5
(24)
HMM Predictors: MARCOIL and CCHMM PROF
In a previous work we have developed CCHMM PROF [14], the first HMM-based coiled-coil predictor that exploits sequence profiles. CCHMM PROF was specifically designed to predict structurally-defined coiled coil segments. However, to better highlight the effect of evolutionary information on the performance of the prediction of coiled-coil regions, here we implemented a new version of MARCOIL (called MARCOIL-ms) that can exploit evolutionary information in the form of sequence profile. By this we can compare the old version based on single sequence with the new one that can take advantage of mupltiple sequence alignments.
Fig. 2. Organization of the states of the MARCOIL HMM. The background state is the state 0. Each of the 9 groups has seven states that represent the heptad repeat. The arrows represent the allowed transitions.
The MARCOIL HMM [11] is composed of 64 states. The model has a background state (labeled with 0) and other 63 states labeled with a group number (from 1 to 9) and with a letter that indicates the heptad position. The minimal coiled-coil length allowed by the model is nine residues. Groups from 1 to 4 model the first four residues of a coiled-coil domain (namely the N-terminal
Improving Coiled-Coil Prediction with Evolutionary Information
27
turn) while groups from 6 to 9 cover the last four residues (the C-terminal turn). The fifth group models the internal coiled-coil residues. In Figure 2 a diagram of the allowed transitions in the MARCOIL model is represented. A coiled-coil region begins with a transition to the first group and ends with a transition to group state 0. With domains of more than nine residues, the states of group five are visited more than once. As depicted in Figure 3, each of the 9 groups has seven states that represent the heptad repeat.
Fig. 3. Details of the heptad transitions within each one the 9 groups of states in the MARCOIL HMM. The arrows represent the most probable transitions.
MARCOIL-ms uses the same automaton but the states emit vectors instead of symbols, as described in [17]. The sequences of characters, commonly analyzed by single sequence-based HMMs, are replaced with sequences of vectors, namely the sequence profile. In practice the emission probabilities ek (c) for each state k and for each emission symbol c are substituted with the dot product of the → profiles entries − x with the internal emission probability vector (for the k th −state → − → → we have eVk ( x ) =< − e k, − x >). Accordingly with this change we modify the updating procedure of the Expectation-Maximization learning. If Ek (c) is the expected number of times in which the symbol c is emitted from state k and Ai,k is the expected number of transitions between state i and state k, then the transition probabilities and the emission probabilities can be updated as: Ai,k ai,k = l Ai,l
(25)
Ek (c) ek (c) = l Ek (l)
(26)
28
P. Fariselli, L. Bartoli, and R. Casadio
Ai,k and Ek (c) are computed with the Forward and Backward algorithms. Ai,k =
Np
Lp −1
fi (t)ai,k ek (xpt+1 )bk (t + 1) Np L p 1 Ek (c) = p=1 xt =c fk (t)bk (t) P (xp )
1 p=1 P (xp )
t=0
(27) (28)
Using the scaling factor Ai,k =
Np Lp −1 p=1
Ek (c) =
p t=0 fi (t)ai,k ek (xt+1 )bk (t + 1) Np Lp p=1 xt =c fk (t)bk (t)Scale(t)
(29) (30)
Finally the updating equations are: Lp −1 p 1 p=1 P (xp ) t=0 fi (t) ai,k eVk (xt+1 )bk (t + Np Lp −1 p Ai,k = p=1 t=0 fi (t) ai,k eVk (xt+1 )bk (t + 1)
Ai,k =
Np
1)
(31)
And for the emissions Np L p 1 Ek (c) = p=1 t=1 fk (t)bk (t)xt (c) P (xp ) Np L p Ek (c) = p=1 t=1 fk (t)bk (t)Scale(t)xt (c)
(32)
− where xt (c) is the component c of the vector → x t , representing the t-th sequence position.
6 6.1
Results and Discussion PS-COILS Performance as a Function of the λ Value
In order to define a baseline predictor we evaluate the PS-COILS performance as function of the λ value. Figure 4 shows how the performance of our method at different λ values. The behavior of the per-residue index Q2, of the SOV(CC) per-segment index and of the POVmin per-protein index is plotted. The best performance of PS-COILS corresponds to λ = 0.5 (referred as PS-COILS 05 in Tables 1 and 2), when single sequence and evolutionary information are equally weighted. It is worth remembering that the method parameters are not retrained so that the PS-COILS can be taken as benchmarking baseline method. 6.2
Comparison with Other Methods
The two major issues of the prediction of coiled-coil domains can be synthesized as: 1. sorting out the list of proteins that contain coiled-coil segments in a given set of protein sequences (or in genomes) 2. predicting the number and the location of coiled-coil domains in a protein chain.
Improving Coiled-Coil Prediction with Evolutionary Information
29
Fig. 4. PS-COILS overall accuracy (Q2), coiled-coil Segment OVerlap (SOV(CC)) and Protein OVerlap (POVmin) as a function of the λ value
The majority of the papers addressed both questions at the same time and their final accuracy is very good for the first task. For each method we computed the Receiver Operating Characteristic Curve (ROC) and in Tab.1 we report the Area Under Curve (AUC). From Table 1 it is evident that the most recent methods outperform PS-COILS-ss in this task. However, when the evolutionary information in the form of sequence profiles is added to the oldest method the picture changes and PS-COILS-ms and PS-COILS 05 achieve the same level of accuracy. This highlights once more the relevance of the evolutionary information. Furthermore, in this paper we focus on the prediction of the location of the coiled-coil segments into the protein sequences. In Table 2 we report the accuracy of the different methods on the S50 subset (with no sequence similarity with the MARCOIL dataset proteins). The methods are evaluated using their default thresholds to predict the coiled-coil segments. From POV indices, it is evident that about 60% of chains are correctly predicted at the protein level when evolutionary information is taken into account. The exploitation of the evolutionary information increases the methods accuracy (see PS-COILS-ms and PS-COILS 05 versus PS-COILS-ss). By this PS-COILS-ms can be compared to more recent methods and PS-COILS-ms (or PS-COILS 05) can be adopted as as a baseline method for benchmarking tests. When the evolutionary informaton is taken into account, machine
30
P. Fariselli, L. Bartoli, and R. Casadio Table 1. Area under the ROC curve (AUC) computed for all the classifiers Method PS-COILS-ss MARCOIL PAIRCOIL2 PS-COILS-ms PS-COILS 05 MARCOIL-ms CCHMM PROF
AUC 0.92 0.95 0.96 0.96 0.96 0.96 0.97
Table 2. Comparison of the different methods on the benchmark S50 dataset
Method
single sequence PS-COILS-ss MARCOIL PAIRCOIL2 multiple sequence PS-COILS-ms PS-COILS 05 MARCOIL-ms MARCOIL-ms-cv CCHMM PROF(1)
Per-protein
Per-segment
POVmin POVav SOV(CC) SOV(N)
Per-residue C
Sn(CC) Sp(CC) Sn(N) Sp(N)
0.43 0.55 0.53
0.37 0.47 0.45
0.46 0.55 0.54
0.55 0.53 0.38
0.25 0.35 0.15
0.41 0.60 0.55
0.85 0.72 0.61
0.47 0.53 0.45
0.57 0.73 0.45
0.61 0.62 0.80 0.80 0.80
0.49 0.60 0.78 0.78 0.75
0.62 0.64 0.81 0.82 0.81
0.53 0.56 0.53 0.56 0.61
0.36 0.36 0.53 0.55 0.62
0.67 0.64 0.98 0.98 0.98
0.66 0.70 0.49 0.51 0.60
0.55 0.60 0.71 0.72 0.73
0.73 0.71 0.96 0.97 0.97
(1) CCHMM PROF is a new HMM-based predictor specifically developed for predicting structurally-annotated coiled-coil segments [14].
learning approaches such as MARCOIL and CCHMM PROF are the best performing methods overpassing the new baseline method PS-COILS 05. In particular MARCOIL, that was originally trained and tested with single sequence encoding on an old manually annotated data set, can perform significantly higher (POVmin 0.80 vs 0.53) when a structurally annotated dataset and multiple sequence information are provided (from 0.80 vs 0.53). Despite MARCOIL-ms and CCHMM PROF are two very different HMM models, they can interchangeably be used for predicting coiled-coil segments since they are able to correctly predict the 80% of the tested proteins. Summing up, in this work we addressed the following relevant issues: 1. we develope a new code for predicting coiled-coils segments freely available under GPL license that can be used as baseline predictor for bechmarkingr; 2. we present MARCOIL-ms a new HMM-based predictor that achieves the state of the art accuracy;
Improving Coiled-Coil Prediction with Evolutionary Information
31
3. we propose a new scoring scheme that takes into account per-segment and per-protein accuracy, allowing a more robust evaluation of the performance; 4. we show that the evolutionary information plays a relevant role also in the task of predicting coiled-coil segments.
Acknowledgments RC acknowledges the receipt of the following grants: FIRB 2003 LIBI–International Laboratory of Bioinformatics.
References [1] Gruber, M., Lupas, A.N.: Historical review: another 50th anniversary–new periodicities in coiled coils. Trends Biochem. Sci. 28, 679–685 (2003) [2] Lupas, A.N., Gruber, M.: The structure of alpha-helical coiled coils. Adv. Protein Chem. 70, 37–78 (2005) [3] Lupas, A.N.: Prediction and analysis of coiled-coil structures. Methods Enzymol. 266, 513–525 (1996) [4] Walshaw, J., Woolfson, D.N.: Socket: a program for identifying and analysing coiled-coil motifs within protein structures. J. Mol. Biol. 307, 1427–1450 (2001) [5] Moutevelis, E., Woolfson, D.N.: A periodic table of coiled-coil protein structures. J. Mol. Biol. 385, 726–732 (2008) [6] Testa, O.D., Moutevelis, E., Woolfson, D.N.: CC+: a relational database of coiledcoil structures. Nucleic Acids Res. 37(Database issue), D315–D322 (2009) [7] Lupas, A.N., Van Dyke, M., Stock, J.: Predicting coiled coils from protein sequences. Science 252, 1162–1164 (1991) [8] Berger, B., Wilson, D.B., Wolf, E., Tonchev, T., Milla, M., Kim, P.S.: Predicting coiled coils by use of pairwise residue correlations. Proc. Natl. Acad. Sci. USA 92, 8259–8263 (1995) [9] McDonnell, A.V., Jiang, T., Keating, A.E., Berger, B.: Paircoil2: improved prediction of coiled coils from sequence. Bioinformatics 22, 356–358 (2006) [10] Wolf, E., Kim, P.S., Berger, B.: MultiCoil: a program for predicting two- and three-stranded coiled coils. Protein Sci. 6, 1179–1189 (1997) [11] Delorenzi, M., Speed, T.: An HMM model for coiled-coil domains and a comparison with PSSM-based predictions. Bioinformatics 18, 617–662 (2002) [12] Gruber, M., Soding, J., Lupas, A.N.: REPPER-repeats and their periodicities in fibrous proteins. Nucleic Acids Res. 33, 239–243 (2005) [13] Gruber, M., S¨ oding, J., Lupas, A.N.: Comparative analysis of coiled-coil prediction methods. J. Struct. Biol. 155, 140–145 (2006) [14] Bartoli, L., Fariselli, P., Krogh, A., Casadio, R.: CCHMM PROF: a HMM-based coiled-coil predictor with evolutionary information. Bioinformatics 25, 2757–2763 (2009) [15] Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
32
P. Fariselli, L. Bartoli, and R. Casadio
[16] Zemla, A., Venclovas, C., Fidelis, K., Rost, B.: A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins 34, 220–223 (1999) [17] Martelli, P.L., Fariselli, P., Krogh, A., Casadio, R.: A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics 18(Suppl. 1), S46–S53 (2002)
Intelligent Text Processing Techniques for Textual-Profile Gene Characterization Floriana Esposito1 , Marenglen Biba2 , and Stefano Ferilli1 1
2
Department of Computer Science, University of Bari Via, E. Orabona, 4, 70125, Bari Italy Department of Computer Science, University of New York Tirana Rr. Komuna e Parisit, Tirana, Albania
[email protected],
[email protected],
[email protected]
Abstract. We present a suite of Machine Learning and knowledge-based components for textual-profile based gene prioritization. Most genetic diseases are characterized by many potential candidate genes that can cause the disease. Gene expression analysis typically produces a large number of co-expressed genes that could be potentially responsible for a given disease. Extracting prior knowledge from text-based genomic information sources is essential in order to reduce the list of potential candidate genes to be then further analyzed in laboratory. In this paper we present a suite of Machine Learning algorithms and knowledge-based components for improving the computational gene prioritization process. The suite includes basic Natural Language Processing capabilities, advanced text classification and clustering algorithms, robust information extraction components based on qualitative and quantitative keyword extraction methods and exploitation of lexical knowledge bases for semantic text processing.
1
Introduction
Many wide-spread diseases are still an important health concern for our society due to their complexity of functioning or to their unknown causes. Some of them can be acquired but some can be genetic and a large number of genes has already been associated to particular diseases. However, many potential candidate genes are still suspected to cause a certain disease and there is strong motivation for developing computational methods that can help reduce the number of these susceptibility genes. The number of suspected regions of the genome that encode probable disease genes for particular diseases is often estimated to be very large. This gives rise to a really large number of genes to be analyzed which would be infeasible in practice. In order to focus on most promising genes, prioritization methods are being developed in order to exploit prior knowledge on genetic diseases that can help guide the process of identifying novel genes. Much of the prior background knowledge on genetic diseases is primarily reported in the form of free text. Extracting relevant information from unstructured data has always been a key challenge for Machine Learning methods [2]. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 33–44, 2010. c Springer-Verlag Berlin Heidelberg 2010
34
F. Esposito, M. Biba, and S. Ferilli
These have the power to provide precious capabilities to rank genes based on their textual profile. Indeed, human knowledge on genome and genetic diseases is becoming huge and manual exploitation of the large amount of raw text of scientific papers is infeasible. For this reason, recently many automatic methods for information extraction from text sources have been developed. The approach presented in [1] links groups of genes with relevant MEDLINE abstracts through the PubMed engine, characterizing each group with a set of keywords from MeSH and the UMLS ontology. In [3] co-expression information is linked with the cocitation network constructed on purpose. In this way co-expressed genes are characterized by MeSH keywords appearing in the abstract about genes. In [4] the authors developed an approach called neighborhood divergence that quantifies the functional coherence of a group of genes through a database that links genes to documents. In [5] it was proposed a schema that links abstracts to genes in a probabilistic framework that uses the EM algorithm to estimate the parameters of the word distributions. Genes are defined similar when the corresponding gene-by-documents representations are close. Finally, in [6] and [7] it is provided a proof of principle on how clustering of genes encoded in a keyword-based representation can be useful to discover relevant subpatterns. In this paper we describe a suite of machine learning and knowledge-based methods for textual-profile based gene prioritization. We present a classification system based on inductive logic programming [8] that is able to semantically classify texts regarding genes or diseases. The power of the approach lies in the representation language, Horn logic, that permits to describe a text through its logical structure and thus reason about it. Classification theories are induced from examples and are then used to classify novel unseen documents. The input to this machine learning system is given by a knowledge-based component for text-processing. This rule-based system performs a series of NLP steps such as part-of-speech tagging and disambigation in order to produce an accurate representation of the text which is then transformed into a Prolog clause that serves as input for the learning system. Then we present a novel distance between Horn clauses that is used in this context to define similarities between papers regarding genes. The distance is then used in an instance-based approach to cluster documents of disease genes and candidate genes. Through the mapping gene-to-document provided by EntrezGene, then the clustering of documents can be seen as a clustering of genes and further analysis can be performed to reduce the number of genes to be further analyzed. Both, the learning system and the instance-based approach are combined with the taxonomic knowledge base WordNET [9,10]. In the case of the inductive approach WordNET is used to properly generalize theories in presence of similar words that share any common parent nodes in the hierarchy of concepts of WordNET. In the case of the instance-based approach, WordNET is used to semantically define a taxonomic distance between words in the text and include it in the overall distance between two texts. Clustering of documents is then mapped directly to gene clustering using the gene-to-docs mapping of EntrezGene.
Intelligent Text Processing Techniques
35
We present also two keyword-based approaches for textual-profile gene clustering. The first approach is quantitative in that it combines into the Bayes theorem the frequency of a term (in a document and in a collection) and its position in the document. Through this formula, the probability that a term is a keyword is computed. On the other side, the qualitative approach exploits WordNET Domains [11,12] to semantically extract keywords from documents using the hierarchy of categories defined in WordNET Domains for each of the synsets found in the document. Both methods are used to generate the gene × terms matrix which is then combined with the gene-to-docs mapping taken from EntrezGene, to give the final matrix gene × terms. Then classical clustering algorithms are performed on this matrix to discover relationships between disease and candidate genes. The hypothesis is that couple of genes (candidate-disease) within the same cluster and that show high similarity as given by their textual-profile, could be probably involved in the same disease. The paper is organized as follows: in Section 2 we describe the various components of the suite for gene prioritization, in Section 3 we describe some preliminary experimental scenarios, in Section 4 we discuss related work and the differences with our approach and in Section 5 we conclude.
2
Machine Learning and Knowledge-Based Components
In this section we describe the various parts of the suite and how they interact with each other. We also describe how each component is used for the final goal of scoring candidate genes. 2.1
Rule-Induction System for Text Classification on Diseases
The rule-induction system INTHELEX [13], is an incremental learning system that induces and revises first-order logic (FOL) theories for multiple classes from examples. It uses two inductive refinement operators to fix wrong classifications of the current theory. In case a positive example is rejected, the system generalizes the definition of the corresponding concept by dropping some conditions (ensuring consistency with all the past negative examples), or adding to the current theory (if consistent) a new alternative definition of the concept. When a negative example is explained, a specialization of one of the theory definitions that concur in explaining the example is attempted by adding to it some (positive or negative) conditions which characterize all the past positive examples and discriminate them from the current negative one. When dealing with examples describing text, the system finds difficulties to generalize theories if lexical knowledge is missing. Therefore, to handle theory refinement for text classification we have introduced in the system an operator for finding common parents of words in the WordNET knowledge base. The following example clarifies how generalization is performed using WordNET:
36
F. Esposito, M. Biba, and S. Ferilli
Let’s consider the following phrases: The enterprise bought shares. The company acquired stocks. These represent two examples for the learning system. When the first example comes, the learning system generates a theory to properly classify it. When the second example is given to the system, it fails to explain it and tries to generalize the current theory. It fails to generalize the theory since it does not have any further information on the proper logical leterals to use in the theory revision process. But if we use WordNET, we can navigate the concept hierarchy and find that the couples enterprise-company, buy-acquire and share-stock share common parents, therefore can be generalized using the most specific or general common parent in the WordNET graph. Each example is expressed in Horn logic and describes the structural composition of the text (subject, verb, complement) and the part-of-speech of each component. This is performed using the rule-based approach described in Section 2.3. The Prolog representation of one of the above examples is: observation(text1) :- phrase(text1,text1 e1), subject(text1 e1,text1 e2), token(text1 e2,text1 e3), company(text1 e3), noun(text1 e3), verb(text1 e1,text1 e4), token(text1 e4,text1 e5), buy(text1 e5), past tense(text1 e5), complement object(text1 e1,text1 e6), token(text1 e6,text1 e7), shares(text1 e7), noun(text1 e7). The induction system is used in the context of gene prioritization on a certain disease to classify documents regarding this disease. The mapping gene-to-doc coming from EntrezGene does not specify if a document is on a particular disease, thus the learning system once trained is used to properly weight the matrix gene×documents (Figure 1). Training examples are taken from PubMed collecting positive examples on the interesting disease and negative examples on any other disease, where each example is constructed from the document abstract. The learning system induces theories that can be then used to classify unknown texts. 2.2
Instance-Based System for Text Clustering Regarding Genes
Clustering aims at organizing a set of unlabeled patterns into groups of homogeneous elements based on their similarity. The similarity measure exploited to evaluate the distance between elements determines the effectiveness of the clustering algorithms. Differently from the case of attribute-value representations [17], completely new comparison criteria and measures must be defined for FOL descriptions since they do not induce an Euclidean space. Our proposal is based on a similarity assessment technique for Horn clauses introduced in [14], and exploits the resulting similarity value as a distance measure in a classical k-means clustering algorithm, based on medoids instead of centroids as cluster prototypes guiding the partitioning process (due to the non-Euclidean space issue).
Intelligent Text Processing Techniques
37
In order to properly deal with text, the distance in [14] is enriched with lexical knowledge from WordNET defining a taxonomic similarity element as part of the overall similarity. This taxonomic similarity between couples of words in the text is defined as the intersection of the graphs of the parents in the is-a hierarchy of concepts in WordNET. The more nodes these graphs have in common the more similar the respective words are. Clustering is applied to abstracts from EntrezGene in the following way. Disease genes are identified from the domain expert (there are already a large number of repositories which detail known genes about different diseases) and related documents for each gene are taken from EntrezGene. The same is performed from a series of candidate genes suspected to be responsible for the disease. The abstracts are pre-processed with the system presented in Section 2.3 and transformed in Horn clauses. These clauses are given in input to the clustering algorithm which produces a set of clusters containing documents on disease and candidate genes. These clusters provide precious information to be analyzed in order to discover in the same cluster elements regarding disease and candidate genes. This is useful to produce a final prioritization based on the similarity values computed during clustering. 2.3
Rule-Based System for Syntactic and Logical Analysis of Text
Natural language processing (NLP) is one of the classical challenges for Artificial Intelligence and Machine Learning. Many NLP approaches rely on knowledge about natural language instead of statistical training on corpora. Our rule-based system falls among the knowledge-based approaches to NLP. The rule-based component is part of a larger system DOMINUS [15] which performs a number of pre-processing steps to the raw text before it is given in input to the rulebased component. Specifically, after a tokenization step that aims at splitting the text into homogeneous components such as words, values, dates, nouns and a language identification step, the system also carries out additional steps that are language- dependent: PoS-tagging (each word is assigned to the grammatical role it plays in the text), Stopword removal (less frequent or uniformly frequent items in the text, such as articles, prepositions, conjunctions, etc, are ignored to improve effectiveness and efficiency), Stemming (all forms of a term are reported to a standardized form, this way reducing the amount of elements). After these basic NLP steps, the text is given in input to the rule-based system which performs Syntactic Analysis (yielding the grammatical structure of the sentences in the text) and Logical Analysis (providing the role of the various grammatical component in the sentences). After each grammatical component has been tagged with its semantic role, the structure is saved in an XML format for future use and then transformed in a Prolog representation as described in Section 2.1. The Prolog clause represents an example for the rule-induction system or a training instance for the clustering algorithm.
38
F. Esposito, M. Biba, and S. Ferilli
2.4
Quantitative Keyword Extraction Method
The quantitative keyword extraction method implemented in our system is a na¨ıve Bayes technique based on the concepts of frequency and position of a term and on the independence of such concepts [16]. Indeed, a term is a possible keyword candidate if the frequency of the term is high both in the document and in the collection. Furthermore, the position of a term (both in the whole document and in a specific sentence or section) is an interesting feature to consider, since a keyword is usually positioned at the beginning/end of the text. Such features are combined according to the Bayes Theorem in a formula to calculate the probability of a term to be a keyword in the following way, P (key|T, D, P T, P S) is given by: |insT | |insS| P (Di |key) ∗ j=1 P (P Tj |key) ∗ k=1 P (P Sk |key) |insD| |insT | |insS| P ( i=1 Di + j=1 P Tj + k=1 P Sk ) (1) where P (key) represents the probability a priori that a term is a keyword (the same for each term), P (T |key) is the standard tf-idf value of the term, P (D|key), respectively P (P T |key) and P (P S|key), are computed by dividing the distance of the first occurrence of the term from the beginning of the section (D), document (P T ), sentence (P S) with the number of the terms in the section, respectively document and sentence. Finally, P (D, P T, P S) is computed by adding the distances of the first occurrence of the term from the beginning of the section, document and sentence. Since a term could occur in more than one document, section or sentence, the sum of the values are considered. In this way, the probability for the candidate keyword are calculated and the first k with the highest probability are considered as the final keywords for the document. In the specific context, keywords are extracted from documents of EntrezGene regarding disease and candidate genes. For each document a set of weighted keywords is extracted and saved into a document×term matrix. This matrix is then combined with the gene-to-doc mapping given by EntrexGene in order to have the final matrix gene × terms. Different clustering algorithms are then applied to these matrix such as simple k-means, Expectation-Maximization clustering or Cobweb. P (T |key) ∗
2.5
|insD| i=1
Qualitative Keyword Extraction Method That Exploits Lexical Knowledge
In [18] it was shown that quantitative keyword extraction methods are complementary to qualitative ones. The quantitative method basically exploits terms frequency and is more related to a collection of documents, while the qualitative methods, exploiting lexical knowledge are in general more related to the single document. The qualitative method implemented in our suite exploits a WordNET-based density function defined in [19]. The method works as follows:
Intelligent Text Processing Techniques
39
terms not included in WordNet (frequent words such as articles, pronouns, conjunctions and prepositions) are not evaluated for classification, this way implicitly performing a stop-word removal. Given the set W = t1 , ..., tn of terms in a sentence, each having a set of associated synsets S(ti ), a generic synset s will have weights: = 1/|S(ti)| if sk ∈ S(ti ), 0 otherwise, – p(S(ti ), s) – p(W, s) = i=1,..,n p(S(ti ), s)/|W |
in S(ti ), and in sentence W.
If a term t is not present in WordNet, S(t) is empty and t will not contribute to computation of |W |. The weight of a synset associated to a single term ti is 1/(|W | ∗ |S(ti)|). The normalized weight for a sentence is equal to 1. Given a document D made up m sentences Wi , each with associated weight wi > 0, wof k ). The total weight for a document, given by the is p(D, Wi ) = wi /( k=1,..,h sum of weights of all its sentences, is equal to 1.Thus, the weight of a synset s in a document can be defined as: p(D, s) = j=1,...,m p(Wj , s) ∗ p(D, Wj ). In order to assign a document to a category, the weights of the synsets in the document that refer to the same WordNet Domains category are summed, and the category with highest score is chosen. This Text Categorization technique, differently from traditional ones, represents a static classifier that does not need a training phase, and takes the categories from WordNet Domains. The keyword extraction algorithm, after computing the density function, proceeds as follows: 1. sort the list of synsets in the document by decreasing weight; 2. assign to the document the first k terms in the document referred to the synsets with highest weight in the list; 3. for each pair synset-weight create the pair label-weight where label is the one that WordNet Domains assignes to that synset 4. sort by decreasing weight the pairs label-weight; 5. select the first n domain labels that are above a given quality threshold. After assigning weights to all synsets expressed by the document under processing, the synsets with highest ranking can be selected, and the corresponding terms can be extracted from the document as best representatives of its content. Keyword extracted with the qualitative methods can be used in the same way as for the quantitative methods as described in the previous section. An interesting point to investigate is the intersection regarding common keywords found by both methods. In [18] it was found that in a large collection of documents, there is an important intersection of the two methods and this can be exploited by taking into consideration only the keywords found by both methods.
3
Experimental Evaluation
The suite of components is currently being experimented on a number of candidate genes. There are different scenarios in which the framework could be used and currently it is being evalualed how to use it in practice for an important
40
F. Esposito, M. Biba, and S. Ferilli
Fig. 1. The Machine Learning and knowledge-based framework
disease. Here we describe each preliminary experiment that we are performing. Figures 1 and 2 present the various components of the pipeline. Scenario 1. Disease and candidate genes are taken from known lists of genes. Abstracts of documents regarding candidate and disease genes are taken from EntrezGene and given in input to the two keyword extraction methods. A document × term matrix is produced from each of them (a single matrix can be obtained by taking the keywords at the intersection of both methods). Then this matrix is combined with the gen-to-doc mapping of EntrezGene to produce the final matrix gene × terms. This matrix is then transformed into a suitable representation for clustering algorithms such as k-means, EM or Cobweb. Scenario 2. Scientific papers regarding a certain disease (and not) are taken from MEDLINE, serving as positive and negative examples for the learning system. These are first given in input to the pre-processing engine which performs the basic NLP steps and syntactic and logical analysis of the text. The output of this phase is a structural representation of text with the semantic role of each text element. This representation is transformed into Prolog examples for the rule-induction system which learns a semantic text classifier exploiting also the WordNET lexical knowledge base. The classifier is then used to properly weight papers from EntrezGene (regarding the genes involved) but that do not have any tagging or labeling about the disease. The papers classified as not regarding the given disease are weighted differently from the papers classified as treating that disease. This produces a weighted matrix gen × documents that combined with the matrix in Scenario 1 gives the final matrix gene × terms for the clustering of genes.
Intelligent Text Processing Techniques
41
Fig. 2. Instance-based learning for gene clustering on first-order descriptions
Scenario 3. Disease and candidate genes are taken from domain experts of known lists from online repositories and given in input to the pre-processing component which produces a Prolog clause for each abstract. These are given to the clustering algorithm that uses the similarity function defined on Horn clauses. Clustered documents are then mapped to a gene clustering through the mapping of EntrexGene gene-to-doc. Scenario 4. Scientific papers regarding specific genes are taken from MEDLINE and labeled with the aid of a domain expert. These are given to the pre-processing engine and then to the learning system which produces an accurate text classifier for each gene. Then other unknown papers from MEDLINE are given in input to the classifier which assigns each document to a gene. This process produces a novel gene-to-doc mapping which combined with the matrix documents× term from Scenario 1, gives the final matrix gene × terms on which clustering is then performed. The pipeline is designed also to use domain vocabolaries. Currently the vocabolary to be used is MeSH but we intend to extend to others such as eVOC, GO, OMIM and LDDB. Current experiments regard genes which are suspected to be involved in a certain disease. Respective documents are planned to be taken from MEDLINE according to the mapping of EntrezGene. Scenario 1 will be experimented on a collection of a very large number of documents that regard disease and candidate genes. Keywords will be extracted and weighted with the quantitative method described in the previous Sections. For each gene a vector of weights will be produced and then it will be computed the euclidean distance of the vectors. Finally, for each candidate gene the score is given from the sum of the distances of the respective vectors. Complete results on these experiments will be reported in an extended version of this work.
42
4
F. Esposito, M. Biba, and S. Ferilli
Related Work and Discussion
Text-based gene prioritization is receiving much attention and different approaches have been proposed for this purpose. In [20] was proposed a tool that scores concepts in GO according to their relevance to each disease via text mining. Potential candidate genes are then scored through a BLASTX search on reference sequence. Another approach proposed in [21] uses shared GO annotations to identify similarities for genes to be involved in the same disease. A similar text-mining approach was proposed in [22] where candidate gene selection is performed using the eVOC ontology as a controlled vocabulary. eVOC terms are first associated with disease names according to co-occurrence in MEDLINE abstracts. Then the identified terms are ranked and the genes annotated with the top-ranking terms are selected. One of the few approaches that exploits machine learning and one of the most promising ones is that proposed in [23]. The authors use machine learning to build a model and then rank the test set of candidate genes according to the similarity to the model. The similarity is computed as the correlation for vector space data and BLAST score for sequence data. The advantage of the method is that it incorporates different multiple genomic data sources (microarray, InterPro, BIND, sequence, GO annotation, Motif, Kegg, EST, and text mining). Recently, in [24] was proposed a gene prioritization tool for complex traits which ranks genes by comparing the standard correlation of term-frequency vectors (TF profiles) of annotated terms in different ontological descriptions and integrates multiple ranking results by arithmetical (min, max, and average) and parametric integrations. The most closely related to our approach is that of [25] which exploits the textual profile of genes for their clustering. The suite of algorithms and components that we propose here differs in many points from these previous approaches. First, to the best of our knowledge, machine learning in the form of rule-induction has not been used before for textbased gene prioritization or clustering. Rule-induction, due to the power of the approach, has been mainly used in biological domains for structural classification of molecules. Relation extration is from text is another important task often faced with rule-induction approaches. But this relations have not yet been used to produce gene characterization profiles. The use of relations could lead to novel and more reliable gene profiles because relations could involve different biological entities of interest and thus important information that was ignored before can be now used to strengthen informative power of the gene profile. Second, similarity measures used previously for gene prioritization has always been on attribute-value representations, whereas here we use a novel similarity function defined on first-order descriptions. This has the advantage that firstorder languages allow for more thorough description of text and this can help capture hidden features of the entities. Moreover, we adopt a novel representation of texts not simply as bag-of-words but as a Horn clause incorporating the syntactic and logical role of elements in the sentence. This helps perform through rule-induction more robust text classification compared to attributevalue based methods. In addition, taxonomic similarity is another novel feature
Intelligent Text Processing Techniques
43
that we exploit in our similarity function in order to better capture the meaning of each text and properly define the similarity between texts. Finally, qualitative and quantitative keyword extraction methods are plugged in together to boost extraction of most significant terms. The qualitative methods, being related to the single document, can find more specialized words that can properly represent the single document. On the other side, the quantitative method, being related to a collection of documents, tries to capture more general words found in the entire collection and that can distinguish between the documents.
5
Conclusion
Most genetic diseases are characterized by many potential candidate genes that can cause the disease. Gene expression analysis typically produces a large number of co-expressed genes that could be potentially responsible for a given disease. Extracting prior knowledge from text-based genomic information sources is essential in order to reduce the list of potential candidate genes to be then further analyzed in laboratory. In this paper we present a suite of Machine Learning algorithms and knowledge-based components for improving the computational gene prioritization process. The pipeline includes basic Natural Language Processing capabilities, advanced text classification and clustering algorithms, robust information extraction components based on qualitative and quantitative keyword extraction methods and exploitation of lexical knowledge bases for semantic text processing.
References 1. Masys, D.R., Welsh, J.B., Fink, J.L., Gribskov, M., Klacansky, I., Corbeil, J.: Use of keyword hierarchies to interpret gene expression. Bioinformatics 17, 319–326 (2001) 2. Feldman, R., Sanger, J.: Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006) 3. Jenssen, T., Laegreid, A., Komorowski, J., Hovig, E.: A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet. 28, 21–28 (2001) 4. Raychaudhuri, S., Schutze, H., Altman, R.B.: Using text analysis to identify functionally coherent gene groups. Genome Res. 12, 1582–1590 (2002) 5. Shatkay, H., Edwards, S., Boguski, M.: Information retrieval meets gene analysis. IEEE Intelligent Systems (Special Issue on Intelligent Systems in Biology) 17, 45–53 (2002) 6. Chaussabel, D., Sher, A.: Mining microarray expression data by literature profiling. Genome Biol. 3 (2002) 7. Glenisson, P., Antal, P., Mathys, J., Moreau, Y., Moor, B.D.: Evaluation of the vector space representation in text-based gene clustering. In: Pacific Symposium on Biocomputing, pp. 391–402 (2003) 8. Lavrac, N., Dzeroski, S.: Inductive Logic Programming: Techniques and applications. Ellis Horwood, UK (1994)
44
F. Esposito, M. Biba, and S. Ferilli
9. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database. International Journal of Lexicography 3(4), 235–244 (1990) 10. Fellbaum, C.: WordNet an Electronic Database, pp. 1–23. MIT Press, Cambridge (1998) 11. Magnini, B., Strapparava, C., Pezzulo, G., Gliozzo, A.: The role of domain Information in Word Sense Disambiguation. Natural Language Engineering 8(4), 359–373 (2002) 12. Magnini, B., Cavagli, G.: Integrating Subject Field Codes into WordNet. ITC-irst. In: Proc. Second International Conference on Language Resources and Evaluation, LREC 2000, pp. 1–6 (2000) 13. Esposito, F., Ferilli, S., Fanizzi, N., Basile, T.M., Di Mauro, N.: Incremental multistrategy learning for document processing. Applied Artificial Intelligence: An International Journal 17(8/9), 859–883 (2003) 14. Ferilli, S., Basile, T.M.A., Biba, M., Di Mauro, N., Esposito, F.: A General Similarity Framework for Horn Clause Logic. Fundamenta Informaticae Journal 90(1-2), 43–66 (2009) 15. Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for Digital Document Processing: From Layout Analysis To Metadata Extraction - Machine Learning in Document Analysis and Recognition, pp. 105–138 (2008) 16. Uzun, Y.: Keyword Extraction Using Na¨ıve Bayes, Bilkent University, Department of Computer Science (2005) 17. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Transactions On Information Theory 50(12) (December 2004) 18. Ferilli, S., Biba, M., Basile, T.M.A., Esposito, F.: Combining Qualitative and Quantitative Keyword Extraction Methods with Document Layout Analysis. In: Proceedings of 5th Italian Research Conference on Digital Libraries (IRCDL 2009), DELOS: an Association for Digital Libraries (2009) 19. Angioni, M., Demontis, R., Tuveri, F.: A Semantic Approach for Resource Cataloguing and Query Resolution. Communications of SIWN 5, 62–66 (2008) 20. Perez-Iratxeta, C., Wjst, M., Bork, P., Andrade, M.A.: G2D: a tool for mining genes associated with disease. BMC Genet. 6, 45 (2005) 21. Turner, F.S., Clutterbuck, D.R., Semple, C.A.M.: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol. 4(11), R75 (2003) 22. Tiffin, N., Kelso, J.F., Powell, A.R., Pan, H., Bajic, V.B., Hide, W.A.: Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res. 33(5), 1544–1552 (2005) 23. Aerts, S., Lambrechts, D., Maity, S., Van Loo, P., Coessens, B., De Smet, F., Tranchevent, L.C., De Moor, B., Marynen, P., Hassan, B., Carmeliet, P., Moreau, Y.: Gene prioritization through genomic data fusion. Nat. Biotechnol. 24(5), 537–544 (2006) 24. Gaulton, K.J., Mohlke, K.L., Vision, T.: A computational system to select candidate genes for complex human traits. Bioinformatics 23(9), 1132–1140 (2007) 25. Glenisson, P., Coessens, B., Van Vooren, S., Mathys, J., Moreau, Y., De Moor, B.: TXTGate: profiling gene groups with text-based information. Genome Biol. 5(6), R43 (2004)
SILACAnalyzer - A Tool for Differential Quantitation of Stable Isotope Derived Data Lars Nilse1,2 , Marc Sturm2 , David Trudgian3,4 , Mogjiborahman Salek4 , Paul F.G. Sims5 , Kathleen M. Carroll6, and Simon J. Hubbard1 1
Faculty of Life Sciences, University of Manchester, Michael Smith Building, Oxford Road, Manchester M13 9PT, UK
[email protected],
[email protected] 2 Wilhelm Schickard Institute for Computer Science, Eberhard Karls University, T¨ ubingen, Sand 14, 72076 T¨ ubingen, Germany
[email protected] 3 Centre for Cellular and Molecular Physiology, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK
[email protected] 4 Sir William Dunn School of Pathology, University of Oxford, South Parks Road, Oxford OX1 3RE, UK
[email protected] 5 Faculty of Life Sciences, Manchester Interdisciplinary Biocentre, 131 Princess Street, Manchester M1 7DN, UK
[email protected] 6 Manchester Centre for Integrative Systems Biology, Manchester Interdisciplinary Biocentre, 131 Princess Street, Manchester M1 7DN, UK
[email protected]
Abstract. Quantitative proteomics is a growing field where several experimental techniques such as those based around stable isotope labelling are reaching maturity. These advances require the parallel development of informatics tools to process and analyse the data, especially for highthroughput experiments seeking to quantify large numbers of proteins. We have developed a novel algorithm for the quantitative analysis of stable isotope-based proteomics data at the peptide level. Without prior formal identification of the peptides by MS/MS, the algorithm determines the mass charge ratio m/z and retention time t of stable isotope-labelled peptide pairs and calculates their relative ratios. It supports several nonproprietary XML input formats and requires only minimal parameter tuning and runs fully automated. We have tested its performance on a low complexity peptide sample in an initial study. In comparison to a manual analysis and an automated approach using MSQuant, it performs as well or better and therefore we believe it has utility for groups wishing to perform high-throughput experiments.
1
Introduction
Proteomics experiments are becoming increasingly automated, genome wide and quantitative [1–3]. The move towards high-throughput quantitave methods leads F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 45–55, 2010. c Springer-Verlag Berlin Heidelberg 2010
46
L. Nilse et al.
to larger data sets and data processing challenges. Many approaches are used for quantitation. Stable isotope labelling methods, where labelled amino acids are introduced in cell culture, e.g. SILAC [4]) allow otherwise identical cells to be cultured whose proteins and peptides differ only by a predictable, fixed mass shift. Quantitation can then be performed by comparing the intensities of labelled and unlabelled peptides observed during mass spectrometry (MS) experiments. The related QconCAT technique concerns recombinant proteins [5], but also performs quantitation by considering labelled peptides. Manual analysis of quantiative data, where individual peak pairs must be identified and their intensities integrated at the ion chromatogram level [6], is acceptable for small-scale studies, but is less attractive for several hundred or more peptide/protein identifications. A wide range of software tools for automated quantitation exist, provided by MS instrument vendors, third-party companies, and academic groups. Commercial packages such as Mascot Distiller (Matrix Science) (http://www.sanger.ac.uk/turl/90c) support a range of quantitation methods, but have significant licensing costs. Vendor software and many third party packages e.g. MSQuant (http://www.sanger.ac.uk/turl/90f) and MaxQuant [7] are instrument specific, compatible with data files from a subset of MS instruments only. In this paper we introduce SILACAnalyer, a tool which offers high quality automatable processing comparable with the best manual analyses, but does not require prior identification of peptides or indeed any other molecular ion species prior to quantitation. The software has minimal user parameters, is freely available, and has been integrated in the OpenMS framework [8] which ensures it can be used in pipelines and as a GUI-driven tool. It is compatible with a wide range of formats including the Proteome Standards Initiative (PSI) supported formats [9]. We believe these advantages ensure SILACAnalyzer is a useful addition to the range of software tools available for quantitation in the proteomics field.
2
Methods
The combination of liquid chromatography (LC) and mass spectrometry (MS) is a powerful technique for the analysis of complex protein samples. As the retention time t increases different sets of molecules elute from the LC column. Each of these simplified mixtures is then analysed in a mass spectrometer. The resulting set of spectra (intensity vs mass charge ration m/z) forms an intensity map in the two-dimensional t-m/z plane. Such a raw map of LC-MS data [10] is the starting point of our analysis. 2.1
Algorithm
Our algorithm is divided into three parts: filtering, clustering and linear fitting, see Fig 1(d), (e) and (f). In the following discussion let us consider a particular mass spectrum at retention time 1350 s, see Fig 1(a). It contains a peptide of mass
SILACAnalyzer - A Tool for Differential Quantitation
47
Fig. 1. (a) Part of a raw spectrum showing the isotopic envelopes of a peptide pair, (b) spline fit on the raw data, (c) standard intensity cut-off filter, (d) non-local intensity cut-off filter as used in the algorithm, (e) clusters of raw data points that passed the filter, (f) linear fit on intensity pairs of a single cluster
1492 Da and its 6 Da heavier labelled counterpart1 . Both are doubly [M H +2 ] charged in this instance. Their isotopic envelopes therefore appear at 746 and 749 in the spectrum. The isotopic peaks within each envelope are separated by 0.5. The spectrum was recorded at finite intervals. In order to read accurate intensities at arbitrary m/z we spline-fit over the data, see Fig 1(b). We would like to search for such peptide pairs in our LC-MS data set. As a warm-up let us consider a standard intensity cut-off filter, see Fig 1(c). Scanning 1
The exact mass shift is 6.02 Da. In the following discussion we will use the approximation.
48
L. Nilse et al.
through the entire m/z range (red dot) only data points with intensities above a certain threshold pass the filter. Unlike such a local filter, the filter used in our algorithm takes intensities at a range of m/z positions into account, see Fig 1(d). A data point (red dot) passes if 1. all six intensities at m/z, m/z+0.5, m/z+1, m/z+3, m/z+3.5 and m/z+4 lie above a certain threshold and 2. the intensities within the first envelope (at m/z, m/z+0.5 and m/z+1) and second envelope (at m/z+3, m/z+3.5 and m/z+4) decrease successively2 . Let us now filter not only a single spectrum but all spectra in our data set3 . Data points that pass the filter form clusters in the t-m/z plane, see Fig 1(e). Each cluster centers around the unlabelled peptide of a pair. We now use hierarchical clustering methods to assign each data point to a specific cluster. The optimum number of clusters is determined by maximizing the silhouette width [11] of the partitioning4 . Each data point in a cluster corresponds to three pairs of intensities (at [m/z, m/z+3], [m/z+0.5, m/z+3.5] and [m/z+1, m/z+4]). A plot of all intensity pairs in a cluster shows a clear linear correlation, see Fig 1(f). Using linear regression we can determine the relative amounts of labelled and unlabelled peptides in the sample. 2.2
Implementation
We implemented the above algorithm as part of TOPP - The OpenMS Proteomics Pipeline [8]. TOPP is based on OpenMS [10], an open-source C++ software library for the development of proteomics software tools. OpenMS provides a wide range of data structures and algorithms for data handling, signal processing, feature detection and visualization. It is written in ANSI C++ and makes use of several other libraries where appropriate, see Fig 2. As a result it is platform-independent, stable and fast. The classes in OpenMS are easy to use, intuitive and well documented. In short, developers can focus on their algorithms and do not need to worry about technicalities. 2
3
4
The second rule leads to a significant reduction of the false positive rate in the peptide pair detection. Background signals are strongly suppressed. This rule has got one drawback: At very high masses the first peak of an envelope is no longer the one of highest intensity. In such a case data points of the first peak do not pass the filter. The corresponding cluster contains fewer data points and the accuracy of the ratio is slightly reduced. The filtering step is controlled by two parameters: mz step width determines the step width with which we scan over the spline-fitted spectrum. It should be of the order of the m/z resolution of the raw data. intensity cutoff specifies the intensity cutoff itself, see dashed line in Fig 1(d). The clustering step requires two further parameters: The hierarchical clustering algorithm performs best for symmetric clusters, i.e. width and height of a cluster should be of the same order. We therefore scale the retention times by a dimensionless factor rt scaling. For the example in Fig 3 we choose rt scaling=0.02. In some cases the number of clusters returned by the silhouette width maximization is not optimal. The dimensionless parameter cluster number scaling can be used for fine-tuning.
SILACAnalyzer - A Tool for Differential Quantitation
49
Fig. 2. The TOPP tools including SILACAnalayzer are based on the OpenMS software library. OpenMS in turn founds on several other libraries: The Computational Geometry Algorithms Library (CGAL) (http://www.cgal.org) is used for geometric calculations. The GNU Scientific Library (GSL) (http://www.sanger.ac.uk/turl/917) deals with numerics problems. Xerces (http://www.sanger.ac.uk/turl/918) is used for XML file handling. Qt (http://www.sanger.ac.uk/turl/919) provides the platform independence layer and is used for the graphical user interface. LibSVM (http://www.sanger.ac.uk/turl/91a) is used for machine learning.
TOPP comprises of a set of specialized software tools, which can be used to design analysis pipelines for particular LC-MS experiments. OpenMS and therefore TOPP support the non-proprietary file formats mzData [12], mzXML [13] and mzML. Vendor software either allows the export to these formats or there exist helper tools that can convert from vendor formats to these open standards. In this way it is possible to analyse data sets from mass spectrometers, irrespective of the vendor. The new TOPP tool SILACAnalyzer is an implementation of the algorithm described above. It is In the following section we will test its performance in detail. 2.3
Samples
For the test we prepared three samples each containing 25 pairs of known peptides plus an excess of unlabelled unpaired peptides (≈ 250). Each pair consisted of a labelled and unlabelled form of the peptides. Their ratio was varied from sample to sample. The samples were analyzed in two different mass spectrometers, a Thermo LTQ Orbitrab XL and a Bruker Esquire 3000plus. On each machine three identical aliquots of each sample were run. The resulting 18 data sets were then analyzed in three different ways: manually, using SILACAnalyzer and MSQuant. The analyses were restricted to peptide pairs with a mass difference of 6Da in charge state 2+5 . They appear consequently as pairs with m/z separation 3 in the spectra. The peptides are the result of the tryptic digest of a QconCAT protein [5]. Table 1 lists all 25 peptides, T1 to T25, and their associated post-translational modifications. Some of these peptides are not expected to appear as pairs with 5
The data sets of the Thermo orbitrap contain predominantly 2+ and 3+ ions. Data sets of the Bruker ion trap are dominated by 1+ and 2+ ions. In order to compare like with like, we include only 2+ ions in our analysis.
50
L. Nilse et al.
Table 1. Peptides T1 to T25 of the artificial QconCAT protein [5] and their posttranslational modifications: pyroglutamate formation (pyro-glu) and methionine oxidation (meth-ox). Peptides in the lower part are not expected to be present in the spectra, see discussion in the text. peptide T3 T4 T5 T6 T7 T8 T8’ T9 T10 T11 T12 T13 T13’ T14 T15 T15’ T16 T18 T18’ T19 T20 T21 T21’ T22 T1 T2 T17 T23 T24 T25
sequence m/z GFLIDGYPR 519.27 VVLAYEPVWAIGTGK 801.95 NLAPYSDELR 589.30 GDQLFTATEGR 597.79 SYELPDGQVITIGNER 895.95 QVVESAYEVIR 646.85 QVVESAYEVIR (pyro-glu) 638.34 LITGEQLGEIYR 696.38 ATDAESEVASLNR 681.83 SLEDQLSEIK 581.31 VLYPNDNFFEGK 721.85 GILAADESVGTMGNR 745.87 GILAADESVGTMGNR (meth-ox) 753.87 ATDAEAEVASLNR 673.84 LQNEVEDLMVDVER 844.91 LQNEVEDLMVDVER (meth-ox) 852.91 LVSWYDNEFGYSNR 875.40 QVVDSAYEVIK 625.84 QVVDSAYEVIK (pyro-glu) 617.32 AAVPSGASTGIYEALELR 902.98 LLPSESALLPAPGSPYGR 913.00 FGVEQNVDMVFASFIR 929.96 FGVEQNVDMVFASFIR (meth-ox) 937.96 GTGGVDTAAVGAVFDISNADR 996.99 MAGR 217.61 VIR 194.14 ALESPERPFLAILGGAK 885.00 AGK 138.09 VICSAEGSK 447.22 LAAALEHHHHHH 705.35
m/z separation 3 in the raw LC-MS data for the following reasons: The m/z of peptides T1, T2 and T23 are very small and lie outside the scan range used in the respective MS methods. Peptide T17 contains an arginine R which is protected from tryptic digestion by the following Proline P. The T17 peptides are consequently separated by a 12Da shift and will not appear in our search. Peptide T24 contains a cysteine residue which was not protected by reduction and alkylation. We therefore do not expect it to be seen in the analyses. Finally, T25 contains a His tag. It is therefore very likely to appear in higher charge states and not as 2+.
SILACAnalyzer - A Tool for Differential Quantitation
51
Table 2. False positive and negative rates for identification of 216 peptide pairs (3 samples of 3 aliquots each containing 24 peptide pairs) Thermo orbitrap analysis false negatives false manual 18 SILACAnalyzer 26 MSQuant 25
positives 3 9 0
Bruker ion trap analysis false negatives false positives manual 86 10 SILACAnalyzer 103 10
Table 3. Average standard deviations of peptide pair ratios (Only peptide pairs which were identified in all three aliquots were taken into account) Thermo orbitrap analysis sample 1 sample 2 sample 3 manual 0.0186 0.0590 0.0639 SILACAnalyzer 0.0030 0.0344 0.0485 MSQuant 0.0102 0.0653 0.1228 Bruker ion trap analysis sample 1 sample 2 sample 3 manual 0.1517 0.0464 0.2787 SILACAnalyzer 0.0777 0.0159 0.1581
3
Discussion
In what follows, we will compare performance and ease of use of the three different analyses. In the identification of peptide pairs, all three methods perform well, with broadly equivalent performance, see Table 2. The data presented highlight only the errors, and in the majority of cases (190 out of 216, 88%) the heavy/light peptide pairs were identified, notably without a priori identification in the case of SILACAnalyzer. Some of the peptides, for example T21 and its modified form T21’, appear with a very low signal in the spectra. Finding such pairs is equally challenging for the manual analysis as it is for the two software tools and such pairs were infrequently identified across all experiments and data processing methods. For cases in which peptide pairs were successfully identified in each of the three aliquots, we calculated the average standard deviation of the ratios, shown in Table 3. The results from SILACAnalyzer are the most consistent ones of the three analyses, a desirable property in an analysis tool. Since the precise source of the underlying systematic error is unknown in this system, this is an important metric to assess the quality of the approach.
52
L. Nilse et al.
Fig. 3. (a) SILACAnalyzer identifies 22 of the 24 peptide pairs in a data set of sample 2 (b) zoom of cluster 1, (c) plot of intensity pairs of cluster 1
Let us now compare how much effort was needed to achieve these results. First, it should be noted that an MSQuant analysis of the Bruker data set was not possible since the software did not support the data format. SILACAnalyzer on the other hand was able to analyse data from both mass spectrometers. That allowed a direct and unbiased comparison of both machines. Our algorithm requires no prior peptide identification or fitting of the data to theoretical models. The few parameters of the algorithm were optimized within minutes. Once the parameters for a mass spectrometer were found, the analyses were performed autonomously. On the other hand, both MSQuant and manual analysis required human input and took several hours or days respectively to complete. We
SILACAnalyzer - A Tool for Differential Quantitation
53
Fig. 4. Three samples with labelled and unlabelled peptides at ratios 1:3, 4:3 and 8:3 were analysed in a Thermo LTQ orbitrap and a Bruker Esquire 3000plus ion trap mass spectrometer. The ratios were calculated manually, using SILACAnalyzer and MSQuant. (The Bruker data format is not supported by MSQuant. An analysis of these data sets was therefore not possible).
therefore believe SILACAnalyzer will be well suited for the analysis of complex protein samples. The quality of the results obtained is also naturally of relevance, not just the ease of use or the ability to process data without obtaining identifications. We present a comparison of the performance from the different processing pipelines in Fig 4. Again, the performance of SILACANalyzer is comparable, and frequently better, than the other methods. The data presented also highlights the inherent variability presented by quantification data that is reasonably typical of experimental data viewed at the individual peptide level. A considerable level of variability is present when compared to the expected ratio which is
54
L. Nilse et al.
generally independent of the data processing method. This is apparent from the fact that individual data points which deviate from expectation are generally seen as triplets which co-cluster with the equivalent points processed by another method. A good example is T15’ for which the ratio is consistently underestimated across all experiments and processing pipelines, as shown in Fig 4.
4
Conclusions
In this paper we have described a new, effective and automated algorithm for the quantitative analysis of LC-MS data sets at the peptide level. Importantly, it requires no prior peptide identification and is therefore well suited to highthroughput analyses, and can identify molecular species with interesting quantitative changes which may be targeted for further study. Indeed, if only analytes with high confidence identifications are targeted for quantification then interesting changes could be lost. Our approach therefore provides the possibility of quantification without identification and can save instrument duty cycles that are usually required for acquiring MS/MS spectra for identification purposes. Using our approach, the interesting changes in amount of molecular species can be prioritised for identification, maximising the amount of time the instrument spends studying relevant ions. An implementation of the algorithm is available as part of The OpenMS Proteomics Pipeline (TOPP) [8]. In combination with other TOPP tools such as IDMapper, SILACAnalyzer is an important component for the design of quantitative proteomics pipelines. We have tested the performance of our software tool and find its results to be as good or better than the ones of existing methods. Its principal advantages over other methods lie in its fast performance, full automation and minimal parameter tuning, as well as the ability to perform quantitation a priori without peptide identifications.
Availability SILACAnalyzer is available as part of The OpenMS Proteomics Pipeline (TOPP) under the Lesser GNU Public License (LGPL) from http://www.openms.de . Binary packages for Windows, MacOS and several Linux distribtions are available in addition to the plotform-independent source package.
Acknowledgments Lars Nilse would like to thank Julia Handl for very helpful discussions and sharing her knowledge on clustering methods, and Oliver Kohlbacher and his group for their hospitality during a visit in T¨ ubingen. David Trudgian wishes to acknowledge the Computational Biology Research Group, Medical Sciences Division, University of Oxford for use of their services in this project.
SILACAnalyzer - A Tool for Differential Quantitation
55
Funding: This work was supported by the Biotechnology and Biological Science Research Council [grant ref BB/E024912/1]. David Trudgian is funded by a John Fell OUP award and the EPA trust. Paul Sims thanks the Wellcome Trust for research support (grant no. 073896).
References 1. Cox, J., Mann, M.: Is proteomics the new genomics? Cell 130(3), 395–398 (2007) 2. Alterovitz, G., Liu, J., Chow, J., Ramoni, M.F.: Automation, parallelism, and robotics for proteomics. Proteomics 6(14), 4016–4022 (2006) 3. de Godoy, L.M.F., Olsen, J.V., Cox, J., Nielsen, M.L., Hubner, N.C., Frohlich, F., Walther, T.C., Mann, M.: Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast. Nature 455(7217), 1251–1254 (2008) 4. Ong, S.-E., Blagoev, B., Kratchmarova, I., Kristensen, D.B., Steen, H., Pandey, A., Mann, M.: Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 1(5), 376–386 (2002) 5. Beynon, R.J., Doherty, M.K., Pratt, J.M., Gaskell, S.J.: Multiplexed absolute quantification in proteomics using artificial QCAT proteins of concatenated signature peptides. Nat. Methods 2(8), 587–589 (2005) 6. Schulze, W.X., Mann, M.: A novel proteomic screen for peptide-protein interactions. J. Biol. Chem. 279(11), 10756–10764 (2004) 7. Cox, J., Mann, M.: MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 26(12), 1367–1372 (2008) 8. Kohlbacher, O., Reinert, K., Gr¨ opl, C., Lange, E., Pfeifer, N., Schulz-Trieglaff, O., Sturm, M.: TOPP – the OpenMS proteomics pipeline. Bioinformatics 23(2), e191–e197 (2007) 9. Martens, L., Orchard, S., Apweiler, R., Hermjakob, H.: Human Proteome Organization Proteomics Standards Initiative: data standardization, a view on developments and policy. Mol. Cell. Proteomics 6(9), 1666–1667 (2007) 10. Sturm, M., Bertsch, A., Gr¨ opl, C., Hildebrandt, A., Hussong, R., Lange, E., Pfeifer, N., Schulz-Trieglaff, O., Zerck, A., Reinert, K., Kohlbacher, O.: OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics 9, 163 (2008) 11. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987) 12. Orchard, S., Hermjakob, H., Taylor, C., Binz, P.-A., Hoogland, C., Julian, R., Garavelli, J.S., Aebersold, R., Apweiler, R.: Autumn 2005 Workshop of the Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) Geneva, September 4-6 (2005); Proteomics 6(3), 738–741 (2006) 13. Pedrioli, P., et al.: A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 22(11), 1459–1466 (2004)
Non-parametric MANOVA Methods for Detecting Differentially Expressed Genes in Real-Time RT-PCR Experiments Niccol´o Bassani1 , Federico Ambrogi1 , Roberta Bosotti2 , Matteo Bertolotti2 , Antonella Isacchi2 , and Elia Biganzoli1 1
Institute of Medical Statistics and Biometry “G.A.Maccacaro” - University of Milan Campus Cascina Rosa, via Vanzetti, 5 - 20133 Milan (MI) Italy
[email protected] 2 Biotechnology Dept., Genomics Lab - Nerviano Medical Sciences SRL Viale Pasteur, 10 - 20014 Nerviano (MI)
Abstract. RT-PCR is a quantitative technique of molecular biology used to amplify DNA sequences starting from a sample of mRNA, typically used to explore gene expression variation across groups of treatment. Because of the non-normal distribution of data, non-parametric methods based on the MANOVA approach and the use of permutations to obtain global F-ratio tests have been proposed to deal with this problem. The issue of analyzing univariate contrasts is addressed via Steel-type tests. Results of a study involving 30 mice assigned to 5 differents treatment regimens are presented. MANOVA methods detect an effect of treatment on gene expression, with good agreement between methods. These results are potentially useful to draw out new biological hypothesis to be verified in following designed studies. Future research will focus on comparison of such methods with classical strategies for analysing RT-PCR data; moreover, work will also concentrate on extending such methods to doubly multivariate design. Keywords: RT-PCR, gene expression, MANOVA, non-parametric, permutations.
1
Introduction
In molecular biology reverse transcription polymerase chain reaction (RT-PCR) is a laboratory technique used to amplify DNA copies from specific mRNA sequences. It is the method of choice to detect differentially expressed genes in various treatment groups due to its advantages in detection sensitivity, sequence specificity, large dynamic range, as well as its high precision and reproducible quantitation compared to other techniques [1]. In these investigations one usually has n samples (cell lines, tumor or skin biopsies, etc), whose mRNA is reverse transcribed into cDNA and then amplified to obtain a value of expression for p F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 56–69, 2010. c Springer-Verlag Berlin Heidelberg 2010
Non-parametric MANOVA Methods
57
genes with the method of the threshold cycle. Often the n samples have been treated with k different doses of a molecule whose effect in determining a differential expression of p genes involved is the main objective of the experiment. If the goal is to explore differences between the k groups, the MANOVA is a well-established technique for such purpose. However, some basic assumptions such as the normal distribution of the observations must hold. Moreover, the number of covariates p must not exceed the number of units n, in order to have enough degrees of freedom for the estimation of the effects. In the context of RT-PCR both requirements are often violated: gene expression values tipically show a right-skewed distribution, and it is quite common to have a small sample size on which the expression value of a high number of genes is measured. In the past few years there have been various attempts to face such issues. Specifically, Anderson et al. [2,3] and Xu et al. [4] have proposed methods for nonparametric MANOVA. Both these methods have been applied to real datasets of RT-PCR experiments. In the Methods section we briefly outline the two techniques for the one-way situation, as well as a non-parametric procedure for simultaneous univariate tests on the single variables introduced by Steel [5,6], while in the Results section a study on 30 mice for the analysis of differential gene expression of 58 genes in tumor samples performed at the Nerviano Medical Sciences, a a pharmaceutical company with focus in Oncology located north of Milan, Italy, is presented.
2
Methods (j)
Let yα be i.i.d. samples from p-dimensional multivariate population y (j) with mean vectors μ(j) and covariance Σ(j = 1, ..., k). The hypothesis to test in the classical MANOVA context is H0 : μ(1) = μ(2) = ... = μ(k) .
(1)
that is, there is no difference in the mean gene expression across the k groups. Starting from this hypothesis, Anderson et al. [2,3] and Xu et al. [4] have developed methods for testing differences among groups, using respectively the distances between the observations and the median statistics instead of the mean for the computation of the sum of squares. Both these methods do not assume multivariate normality of the responses as in classical MANOVA, and consider the p variables to be independent. 2.1
Method of Anderson: The Use of Distances
The method proposed by Anderson [2,3] was first introduced for ecological data, which often showed patterns that required a non-traditional handling. Given a population of N experimental units (i.e. observations) divided in k groups and a set of p variables, the task of determining whether the units differ across groups is transformed in the task of determining whether the distances
58
N. Bassani et al.
between units in the same group differ from those between units in different groups. Note that the term distance to refer to any measure of distance and/or dissimilarity is used here. Thus, starting from an N x p matrix containing the data, an N x N distance matrix D is defined, and from this it is possible to compute the total sum of squares as SST =
N −1 N 1 2 d . N i=1 s=i+1 is
(2)
where N is the total sample size of the experiment and dij represent the (i, s)th element of D, i.e. the distance between observation i = 1, . . . , N and s = 1, . . . , N . Similarly, the within-group sum of squares is computed as SSW =
N −1 N 1 2 d is . nj i=1 s=i+1 is
(3)
where nj is the number of experimental units in the j-th group and is takes the value 1 if observation i and j belong to the same group, 0 otherwise. The betweengroup sum of squares can be obtained, as in the classical ANOVA setting, as SSB = SST − SSW . From these quantities one can compute a pseudo F-ratio in order to test the null hypothesis of no difference between the k groups of treatment: SSB / (k − 1) . (4) SSW / (N − k) As the variables are expected to follow a non-normal distribution, it is not reasonable to assume this test statistic to be distributed like a classical F with (k − 1, N − k) degrees of freedom. To test the hypothesis the authors propose to resort to the permutation method, which consists in randomly permuting m times the lines of the original dataset, computing m values of the F statistic, denoted as F ∗ , and then calculating the p-value as F =
N ◦ of F ∗ ≥ F . (5) m According to the results of this permutation analysis, if the null hypothesis of equality of k treatments is refused, the focus of the analysis should be to investigate the differences between the k groups. Thus, contrasts are analyzed using reduced datasets containing only the observations belonging to the groups that are to be compared and doing the same calculations described before to obtain the suitable pseudo-F test and p-values. P =
2.2
Method of Xu: The Use of Medians
This method, originally proposed by the authors for microarray investigation data, makes use of the medians in place of the mean for the computation of the sum of squares.
Non-parametric MANOVA Methods
59
First, a robust version of the within-group error matrix and between-group error matrix is defined, as follows: (j) (j) (j) (j) ˜ W = yα − μ yα − μ nj medianα ˜ ˜ . (6) j
˜= B
(j) (j) ˜ −μ nj μ ˜ μ ˜ −μ ˜ .
(7)
j
where μ ˜(j) = median μ ˜ = median
yα(j) , α = 1, . . . , nj
.
yα(j) , α = 1, . . . , nj , j = 1, . . . , k .
(8) (9)
Here medianα means that the median is calculated element-wise on the matrix inside the curly brackets. For a more detailed notation refer to the original paper ˜ +B ˜= by Xu et al. [4] It has to be noted that in general one cannot say that W (j) (j) T˜ = j nj medianα yα − μ ˜ yα − μ ˜ , because of the properties of the median operator. To test the hypothesis previously stated, the following pseudoF statistic has to be computed: ˜ tr B G= . (10) ˜ tr W To compute the p-value the authors suggest to use the method of random permutations. Just like in the method of Anderson, the p-value is given by # {G∗ ≥ G} . (11) m where G∗ is the value of the test statistic for each permutation and m is the number of permutations. Also for this method it is possible to analyze single contrasts using pseudo-F tests computed on reduced datasets. P =
2.3
Steel Test: A Non-parametric Procedure for Simultaneous Univariate Contrasts
Once the analysis of contrasts has led to identify the relevant differences among the groups, one is interested in exploring for which variables these differences actually occur. To do this, Steel [5,6] has developed a non-parametric univariate technique, which can be considered an analogue of the Tukey’s procedure for multiple comparisons in the classical ANOVA setting. The procedure proposed by the author was originally introduced for balanced situations.
60
N. Bassani et al.
Given a set of k continuous random variables Xi (i = 1, . . . , k) measuring a single characteristic for n subjects on each of the k treatment groups, let us define their cumulative distribution function as Fi . Then, to compare treatment groups one has to test the null hypothesis H0 : F1 = . . . = Fk .
(12)
= Fj at least one i, j pair . H1 : Fi
(13)
against the alternative
= j), assigning rank To test H0 jointly rank the Xi ’s and Xj ’s (i, j = 1, . . . , k, i 1 to the least observation and 2n to the greatest in each ranking. Then the rank sums Tij are obtained by summing the ranks of every i, j pair previously calculated, and a conjugate value Tij is computed as Tij = (2n + 1) n − Tij . The smaller between Tij and Tij is to be used for carrying out the test. Given the current values of k and n one can determine the tabulated value of Ti j with which the actual values have to be compared to determine whether there’s a difference between the groups. This outline clearly refers to the balanced situation, but it is possible, as Munzel et al. [7] show, to extend it to the unbalanced situation. In the real dataset used in the Results section however, data are balanced across groups.
3
Results and Discussion
Results are reported from a RT-PCR experiment planned and carried out at Nerviano Medical Sciences, a pharmaceutical company with focus in Oncology, which involved 30 breed mice, that were considered as independent experimental units, divided in 5 groups with different doses of treatment with a cell cycle inhibitor: vehicle (i.e not treated), 2.5 mg/kg, 5 mg/kg, 10 mg/kg, 20 mg/kg. In this experiment the expression level of 58 genes was measured, and the primary objective was to determine whether there was a difference in the levels of gene expression between the control group and the other 4 treatment levels. 3.1
Cluster Analysis
To preliminary explore patterns of gene expression in this experiment, we drew a heatmap and the denrograms of hierarchical clustering on both genes and mice, using euclidean distance and complete linkage. It can be seen from Fig. 1 that control mice seem to cluster separately from the other groups, whereas higher doses of treatment (10 mg/kg - 20 mg/kg) are associated with increased gene expression (more prevalence of lighter gray blocks) and lower doses of treatment are associated with decreased gene expression (more prevalence of darker grey blocks). Moreover, there seem to be no relevant differences within these two of groups.
Non-parametric MANOVA Methods
61
Fig. 1. Heatmap and dendrogram on mice (columns) and genes (rows): most subjects receiving 2.5 mg/kg or 5 mg/kg of treatment dose are on the right side of the map, while most subjects receiving 10 mg/kg or 20 mg/kg are on the left side of the map. These groups are separated by subjects belonging to the control group (i.e. not receiving any treatment).
To confirm these preliminary exploratory results a particular clustering algorithm named Affinity Propagation [8], recently proposed by Frey and Dueck. Affinity Propagation was chosen because it offers the probability to use different distance measures and has a principled way to choose the number of clusters to be considered. This algorithm search for exemplars into the data points, starting from a user-defined preference value which indicates how much the i-th point is likely to be an exemplar. Thus, depending on this preference value the algorithm will identify a different number of clusters: it is possible to choose the number of clusters n by using a plot which shows the value of n associated with each preference value.
62
N. Bassani et al.
In figure 2 and 3 the number of clusters associated with different preference values is presented, using two different similarities measures: negative euclidean distance and Pearson correlation.
Fig. 2. Affinity Propagation using negative euclidean distance as similarity measure, where each plateau represents a different solution. The table shows the solution with three clusters: vehicle mice form an almost stand-alone cluster, whereas low(high) doses experimental units tend to cluster separately from high(low) doses.
The choice of the solution with three clusters is due to the fact that when using the Pearson correlation coeffcient as similarity measure there are many preference values associated with three clusters. This solution seems to work well also when using negative Euclidean distance, even if only few preference values are associated with it. For both measures however, the results of cluster analysis shows that there seems to be a difference between vehicle mice and all other treatment groups. From both figure 2 and 3 can be noted that the solution with 3 clusters is not the only one likely to be informative. In particular, for both 4 and 5 clusters solution and for both similarity measures described before vehicle mice tend to cluster apart from all the other (results not shown). Also low doses mice show patterns of clusterization similar to those previously described, whereas higher doses tend to separate into different clusters, and sometimes to mix with low doses mice. These results indicate that we should expect a difference in gene expression for all the 4 levels of treatment with respect to the control group, and that there
Non-parametric MANOVA Methods
63
Fig. 3. Affinity Propagation using Pearson correlation as similarity measure, where each plateau represents a different solution. Results are similar to those shown in figure 2.
should be no such difference within high (10 vs 20 mg/kg) and low (2.5 vs 5 mg/kg) regimens. In order to compare the two methods proposed in Section 2.1 and 2.2 and to evaluate agreement with these cluster analysis procedures, results from both analysis are described. With regard to the software, the method of Anderson made use of a Fortran routine available on the website of the author, while for the method of Xu an ad hoc set of R functions was written, starting from the Matlab code available from the authors.
3.2
Anderson Method
From the global F-ratio test in Tab. 1 it seems reasonable to suppose an effect of the factor treatment in determining a differential gene expression. Here the p-value was obtained after 100000 permutations of the original dataset. Table 1. ANOVA table - Anderson method Source Treatment Residual Total
df 4 25 29
SS MS F p (perm) 20275.7481 5068.9370 5.4807 0.0003 23121.6129 924.8645 43397.361
64
N. Bassani et al.
In Tab. 2 results of the contrast analysis are shown: such analysis was carried out to evaluate differences between the control group (i.e. the vehicle mice) and each of the 4 treatment groups. It is interesting to note that the t tests here reported are nothing but the square roots of a F-ratio test computed as previously described using a dataset with only observations from the two groups compared. The adjustment via the False Discovery Rate method was used as multiple comparisons were made, as the Bonferroni method was considered too much conservative for the exploratory purposes of the analysis. Table 2. Contrast analysis - Anderson method Contrast 2.5 mg/kg vs control 5 mg/kg vs control 10 mg/kg vs control 20 mg/kg vs control a
t test 3.0866 2.2696 2.3822 2.9347
p (perm) p (adjusted)a 0.0023 0.0046 0.0021 0.0046 0.0152 0.0152 0.0066 0.0088
False Discovery Rate adjustment.
It can be noted that all contrasts are significant, which means that every group of treatment shows patterns of differential gene expression when compared to the control one. Moreover, consistently with the heatmap, no statistically significant difference was found when comparing mice treated with 2.5 mg/kg and with 5 mg/kg nor when comparing those treated with10 mg/kg with those treated with 20 mg/kg. 3.3
Xu Method
Recalling that the total sum of squares cannot be derived as a sum of treatment and residuals sum of squares using Xu method, Tab. 3 reports only the traces (sum-of-squares) of the within-groups (residual) and between groups (treatment effect) matrix, with the corresponding value of the F test and the p-value, here computed with 10000 permutations. Table 3. ANOVA table - Xu method Source Treatment Residual
SS F P(perm) 4388.906 1.737521 0.0085 2525.959
The difference in the choice of the number of random permutations is due to the computational burden of the method. In Fig. 4 p-values resulting from different numbers of permutation (m) are shown; as the p-value of the pseudo global F test seems to become stable for values of m greater than 100, so the choice of 10000 permutations is considered to be suitable.
Non-parametric MANOVA Methods
65
Fig. 4. Global p-values with different numbers of permutations
In Tab. 4 the results of contrast analysis made with the method of Xu are reported, with the corresponding values of the F test on the subgroups and the adjusted p-values. It can easily be seen that the two methods provide very similar results, both for the global F-ratio test that for the contrast analysis, even with a different number of permutations. Table 4. Contrast analysis - Xu method Treatment 2.5 mg/kg vs control 5 mg/kg vs control 10 mg/kg vs control 20 mg/kg vs control a
F test 1.8328 1.6195 1.1365 1.6459
p(perm) p (adjusted)a 0.0000 0.0000 0.0020 0.0040 0.0450 0.0450 0.0094 0.0125
False Discovery Rate adjustment.
Specifically, from both Tab. 2 and 4, it can be seen that there is a difference in gene expression for all contrasts considered, i.e. each dose of treatment seems to be associated with a change in gene expression with respect to the control group. 3.4
Multiple Comparison Univariate Tests
To explore which could be the genes differentially expressed out of a set of 58, the non-parametric procedure of Steel, particularly suitable for univariate simultaneous tests was used.
66
N. Bassani et al. Table 5. Univariate contrasts - Steel test
Gene 1 2 3 gene1 0.324 0.420 1.000 gene2 0.081 0.172 0.740 gene3 0.034 0.635 1.000 gene4 0.419 1.000 0.958 gene5 0.997 0.986 0.324 gene6 0.833 0.997 0.081 gene7 0.997 0.635 0.240 gene8 0.323 0.986 0.081 gene9 0.053 0.241 0.419 gene10 0.081 0.034 0.241 gene11 0.833 1.000 0.958 gene12 0.033 0.033 0.034 gene13 0.081 0.120 0.907 gene14 0.034 0.033 0.323 gene15 0.833 0.420 0.240 gene16 1.000 0.322 0.080 gene17 0.120 0.120 1.000 gene18 0.033 0.053 1.000 gene19 0.997 0.241 0.034 gene20 0.986 0.986 1.000 gene21 0.421 0.241 1.000 gene22 0.081 0.033 0.080 gene23 0.741 0.240 0.741 gene24 0.034 0.033 0.033 gene25 0.173 0.034 0.082 gene26 1.000 0.986 0.740 gene27 0.081 0.034 0.033 gene28 0.526 0.241 0.324 gene29 0.033 0.033 0.033 1: 2.5mg/kg vs control; 2: 5mg/kg vs
4 Gene 1 2 3 4 0.958 gene30 0.121 0.173 1.000 0.957 0.174 gene31 0.033 0.033 0.033 0.033 0.421 gene32 0.907 0.324 0.833 0.324 0.420 gene33 0.957 0.120 0.033 0.053 0.035 gene34 0.323 0.121 0.986 0.907 0.033 gene35 0.526 0.997 0.741 0.323 0.121 gene36 0.997 0.907 1.000 1.000 0.173 gene37 0.525 0.907 1.000 0.526 1.000 gene38 0.635 0.635 0.173 0.986 0.324 gene39 0.740 1.000 1.000 0.986 0.034 gene40 0.420 0.986 0.907 0.907 0.080 gene41 0.052 0.324 0.636 0.636 0.997 gene42 0.324 0.958 0.741 0.526 0.033 gene43 0.241 0.833 1.000 0.741 0.080 gene44 0.053 0.241 0.324 0.907 0.034 gene45 0.997 0.635 0.081 0.052 1.000 gene46 0.833 0.420 1.000 0.997 1.000 gene47 0.740 0.907 0.986 0.833 0.034 gene48 0.081 0.033 0.033 0.081 0.636 gene49 0.420 0.986 0.636 0.240 0.526 gene50 0.120 0.324 0.997 0.740 0.173 gene51 0.241 0.986 1.000 1.000 0.907 gene52 0.526 0.636 0.997 0.833 0.034 gene53 0.833 1.000 0.907 0.526 0.053 gene54 0.421 0.741 0.833 0.526 0.741 gene55 0.120 0.033 0.986 0.907 0.033 gene56 0.526 0.635 1.000 0.986 0.421 gene57 1.000 0.526 0.173 0.173 0.120 gene58 0.034 0.034 0.035 0.033 control; 3: 10mg/kg vs control; 4: 20mg/kg vs control.
In Tab. 5 all the 2-sided p-values associated with the contrasts shown in Tab. 2 and in Tab. 4 are reported. 17 out of 58 of genes involved in this experiment were found to show possible pattern of different expression for at least one of the four pre-specified contrasts. It can be noted that p-values seem to be discretized, which is due to the fact that the Steel test acts on ranks of the variables to determine whether two groups are different or not, and so the test statistic can assume only a discrete number of values. The apparent lower boundary of the p-values is linked, besides the characteristics of the test itself, to the sample size of each group. In order to evaluate this boundary, a small simulation study was performed. As the Steel test works with univariate multiple comparisons, data were simulated from 5 Normal distributions (our five groups of treatment) with various values of μ and σ and multiple comparisons were carried out between the first group (the control one) and the others (treatment groups). In all the settings considered, and for different sample sizes, data were simulated 100 times to evaluate impact
Non-parametric MANOVA Methods
67
Table 6. Simulations features Group μ 1 0.888 1 2 0.515 (n = 6, 10, 15)a 3 0.424 4 0.526 5 0.497 Group μ 1 0.888 2 2 0.515 (n = 5, 10, 20, 3 0.424 40, 60, 100)a 4 0.526 5 0.497 Group μ 1 3.794 3 2 3.047 (n = 6, 10, 25, 3 3.730 40, 60, 100)a 4 4.077 5 4.967 a data simulated 100 times.
σ 0.080 0.140 0.049 0.087 0.094 σ 0.800 0.950 0.860 0.670 0.730 σ 1.156 0.481 0.415 0.631 0.684
of group size on the p-values lower boundary. Means, standard deviations and sample sizes for each setting are summarized in table 6. It has to be noted that values of μ and σ of setting 1 are the experimental means and standard deviations of gene 24 in table 5: these were chosen because they provided a situation were means were quite different and σ was very low, in contrast with the second setting, were the means are equal to the previous ones but the values of σ are much higher. Values for μ and σ for the third setting are experimental values for gene 4, characterized by partially overlapping means, a situations considerably from those previously described. The barplot in figure 5 shows simulation results for contrast 1 in the first setting. Only small sample sizes could be used because higher sizes led to pvalues which were mostly equal to zero, whereas in the other settings it was possible to increase size up to 100 (see table 6). Nevertheless, the discretization previously described seems to become less relevant when increasing sample size. In the other settings instead, the increase of sample size led mostly to much more distinct p-values, and so to the disappearance of the lower boundary previously described. In some cases, however, sample sizes were associated with a higher degree of discretization, mainly because p-values tended to zero much more frequently, thus making it difficult to actually distinguish between them, even when using a barplot. An example of this, can be seen in figure 6, which refers to contrast 1 in the second setting: as can be noted, a sample size of 100 seems to be associated with a higher frequence of p-value equal to zero compared to a size of 60. Results for other settings and contrasts are similar and are not reported here.
68
N. Bassani et al.
Fig. 5. Barplots of p-values in setting 1 (see table 6), with n = 6 (upper-left panel), 10 (upper-right panel) and 15 (lower panel)
Fig. 6. Barplots of p-values for contrast 1 in setting 2 (see table 6), with n = 60 (left panel) and 100 (right panel)
Even if the authors report that it is likely to expect such tests to have better small sample properties than other non-parametric procedures [7], they do not
Non-parametric MANOVA Methods
69
report how much small can be the sample sizes to have these properties. It seems thus clear, that larger and more detailed simulation studies are needed to better understand the particular features of this test, especially when used with low sample sizes.
4
Conclusions and Future Work
Beyond agreeing with well-known exploratory techniques (the clustering procedures shown in Section 3.1) these results are very useful for exploratory purposes; as a matter of fact, it is possible to draw out new biological hypothesis to test in specific studies appropriately designed. One main drawback of both methods is that they consider the p dependent variables as independent of one another: such an assumption is likely to be violated in many real situations, because it is reasonable to think of the genes as variables possibly correlated. On the other hand, these methods have the advantage of being easy to interpret, as the results are the same of a classic MANOVA approach: the goal is evaluating whether a set of variables shows different patterns depending on the levels of a specific factor. It is possible to extend this method to let it account for doubly multivariate studies, i.e. situations where on the same subjects a set of covariates is measured repeatedly over time/space. In this case, the problem lies mainly in handling the pattern of correlation between repeated measures. The topic of future research will be an evaluation of the performance of these methods when compared to other approaches used in the analysis of RT-PCR experiments. Issues concerning these approaches (e.g. dimensionality of the data) and the particular setting involved (e.g. correlation between variables) will be addressed via simulation studies.
References 1. Wong, M.L., Medrano, J.F.: Real-Time PCR for mRNA Quantitation. Biotechniques 39, 75–85 (2005) 2. Anderson, M.J.: A New Method for Non-Parametric Multivariate Analysis of Variance. Austral. Ecol. 26, 32–46 (2001) 3. McArdle, B.H., Anderson, M.J.: Fitting Multivariate Models to Community Data: a Comment on Distance-Based Redundancy Analysis. Ecology 82, 290–297 (2001) 4. Xu, J., Cui, X.: Robustified MANOVA with Applications in Detecting Differentially Expressed Genes from Oligonucleotide Arrays. Bioinformatics 24, 1056–1062 (2008) 5. Steel, R.G.D.: A Multiple Comparison Rank Sum Test: Treatments vs Control. Biometrics 15, 560–572 (1959) 6. Steel, R.G.D.: A Rank Sum Test for Compairing All Pairs of Treatments. Technometrics 2, 197–207 (1960) 7. Munzel, U., Hothorn, L.A.: A Unified Approach to Simultaneous Rank Test Procedures in the Unbalanced One-Way Layout. Biom. J. 43, 553–569 (2001) 8. Frey, B.J., Dueck, D.: Clustering by Passing Messages between Data Points. Science 315, 972–976 (2007)
In Silico Screening for Pathogenesis Related-2 Gene Candidates in Vigna Unguiculata Transcriptome Ana Carolina Wanderley-Nogueira1, Nina da Mota Soares-Cavalcanti1, Luis Carlos Belarmino1 , Adriano Barbosa-Silva1,2, Ederson Akio Kido1 , Semiramis Jamil Hadad do Monte3 , Valesca Pandolfi1,3 , Tercilio Calsa-Junior1, and Ana Maria Benko-Iseppon1 1 Universidade Federal de Pernambuco Center of Biological Sciences, Departament of Genetics, Laboratory of Plant Genetics and Biotechnology, R. Prof. Moraes Rˆego s/no., Recife, PE, Brazil
[email protected] 2 Max-Delbrueck Center for Molecular Medicine Computational Biology and Data Mining Group, Robert-Roessle-Str. 10, 13125, Berlin, Germany 3 Universidade Federal do Piaui Laboratory of Immunogenetics and Molecular Biology, Campus Petrˆ onio Portela bloco 16, Teresina, PI, Brazil
Abstract. Plants evolved diverse mechanisms to struggle against pathogen attack, for example the activity of Pathogenesis-Related (PR) genes. Within this category PR-2 encodes a Beta-glucanase able to degrade the polysaccharides present in the pathogen cell wall. The aim of this work was to screen the NordEST database to identify PR-2 members in cowpea transcriptome and analyze the structure of the identified sequences as compared with data from public databases. After search for PR-2 sequences in NordEST; CLUSTALx and MEGA4 were used to align PR-2 orthologs and generate a dendrogram. CLUSTER program revealed the expression pattern trough differential display. A new tool was developed aiming to identify plant PR-2 proteins based in the HMMER analysis. Among results, a complete candidate from cowpea could be identified. Higher expression included all libraries submitted to biotic (cowpea severe mosaic virus, CPSMV) stress, as well as wounded and salinity stressed tissues, confirming PR expression under different kind of stresses. Dendrogram analysis showed two main clades, the outgroup and Magnoliopsida where monocots and dicots organisms were positioned as sister groups. The developed HMM model could identify PR-2 also in other important plant species, allowing the development of a bioinformatic routine that may help the identification not only of pathogenesis related genes but any other genes, classes that present similar conserved domains and motifs. Keywords: Bioinformatics, pathogenesis-related, Beta-glucanases, biotic stress, cowpea. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 70–81, 2010. c Springer-Verlag Berlin Heidelberg 2010
In Silico Screening for Pathogenesis Related-2 Gene Candidates
1
71
Introduction
In response to persistent challenge by a broad spectrum of microorganisms, plants have evolved diverse mechanisms that enable them not only to resist drought and wounding but also to oppose attacks by pathogenic organisms and prevent infection [1,2]. The first mechanism is the hypersensitive response (HR), that is immediate and starting with signal recognition from a pathogen elicitor by a host Resistance (R) gene leading to rapid cell death [3] and acting as a keylocker system where HR occur only if the host presents the correct resistance gene that recognize the pathogen Avirulence (Avr) gene [4]; the second strategy is the systemic activation of genes encoding mithogen-activated protein kinases (MAPKs) and pathogenesis-related (PR) proteins which are directly or indirectly inhibitory towards pathogens and have been associated with the phenomenon of systemic acquired resistance (SAR) [5]. The PR proteins include seventeen gene families based on their serological properties and on sequence data. They generally have two subclasses: an acidic subclass, secreted to cellular space, and a vacuolar basic subclass [6]. Direct antimicrobial activities for members of PR protein families have been demonstrated in vitro through hydrolytic activities on cell walls and contact toxicity; whereas indirect activities perhaps bypass an involvement in defense signaling [7]. There are at least ten PR families whose members have direct activities against fungi pathogens, such as Beta-glucanases. This enzyme, product of PR-2 gene activity, is able to degrade the polysaccharides present in the pathogen cell wall specially in fungi and oomycetes [2], preventing the colonization of the host by these organisms [8]. Studies suggested that PR-2 proteins play a protective role through two distinct mechanisms. First, the enzyme can impair microbial growth and proliferation directly by hydrolyzing the Beta-1,3/1,6-glucan of the cell walls rendering the cells susceptible to lysis and possibly to other plant defense responses. Second, an indirect defensive role is suggested by the observation that specific Beta-1,3/1,6-glucan oligosaccharides, released from the pathogen walls by the action of glucanases, can induce a wide range of plant defense responses and hence the SAR [9]. In herbaceous plants the PR-2 activation are strongly influenced by the accumulation of salicylic acid (SA) in their tissues [10]. In other plants high levels of ethylene and methyl jasmonate can act as markers for PR-2 response [11]. Genes related to this pathway are highly conserved within the plant kingdom in relation to size, amino acid composition and isoelectric point [12], whilst some components of the system show similarity to proteins involved in innate immunity in the animal kingdom [13]. PR -2 genes also are present constitutively in some plant organs or tissues, including roots, leaves and floral tissues [14]. There is no previous evaluation regarding these metabolic pathways in V. unguiculata, which presents great economic importance in semi-arid regions throughout the world. The present work aimed to perform a data mining-based identification of PR-2 gene in the NORDEST database, comparing them with sequences deposited in public databases and literature data.
72
2
A.C. Wanderley-Nogueira et al.
Methods
For the identification of PR-2 gene candidates, a tBLASTn alignment was carried out against NordEST database constructed using 18,984 transcripts isolated from V. unguiculata. An Arabidopsis thaliana sequence (AT3G57260.1) was used as seed sequence. After this search, PR-2 matching sequences (cutoff e-5) have been used to screen for homology in Genbank (NCBI) using the BLASTx tool [15]. Cowpea clusters were translated at Expasy [16] and screened for conserved motifs with aid of the RPS-BLAST CD-search tool [15]. Multiple alignments with CLUSTALx (available at link http://www.nordest.ufpe.br/ lgbv/PR-2 Alignment) allowed the structural analysis of conserved and diverging sites as well as elimination of non aligned terminal segments. CLUSTALx alignments were submitted to the program MEGA (Molecular Evolutionary Genetic Analysis V.4) [17] aiming to create a dendrogram using the maximum parsimony method and bootstrap function (1,000 replicates). Conserved motif evaluation was carried out using *.aln files (from CLUSTALx) out of eight PR-2 candidates from eight different species as input to the HMMER (Hidden Markov Models) program, that allowed the search of PR-2 typical patterns in 16 cowpea selected sequences. To establish an overall picture of PR-2 gene distribution pattern in cowpea, we carried out a direct correlation of the reads frequency of each protein sequence in various NordEST cDNA libraries (available at link http://www.nordest. ufpe.br/~lgbv/PR-2_Candidates_Reads). Afterwards a hierarchical clustering approach was applied using normalized data and a graphic representation constructed with aid of the CLUSTER program. Dendrograms including both axes (using the weighted pair-group for each cluster and library) were generated by the TreeView program [18]. On the graphics, light gray means no expression and black all degrees of expression (see Fig. 4). The analysis of protein migratory behavior, using only the best matches of each seed sequence, was generated with the JvirGel program [19] using an analytical mode for serial calculation of MW (molecular weight), pI (Isoelectric Point), pH-dependent charge curves and hydrophobicity probes, generating a virtual 2D gel as JavaTM applet. The analysis of SuperSAGE data included evaluation of 298,060 tags (26 bp) distributed in four libraries submitted to injury or mosaic virus infection against a negative control (available at link http://www.nordest.ufpe.br/ ~lgbv/Sage_Tags). The transcripts were screened for homology using as seed sequence a cDNA full length sequence corresponding to the A. thaliana PR-2 gene. A local BLASTn against the SuperSAGE tags database (cutoff e-10) was performed using the Bioedit program. The obtained tags were classified into up or down-regulated comparing the control with infected and injured libraries; for this purpose the Discovery Space program was used. Moreover, it has been developed a web-tool in order to identify amino acid sequences similar to the protein PR-2. For this purpose we used a set of sequences highly similar to PR-2 from A. thaliana extracted from the Entrez Protein.
In Silico Screening for Pathogenesis Related-2 Gene Candidates
73
The tool has been developed as a HTML platform which accepts a unique (or list) of FASTA formatted sequence(s) and some alignment parameters like e-value, score, coverage and dust filters. Behind the platform the software BLAST [15] is used to perform the batch pairwise alignments and PHP (available at link http:// www.php.net) to parse the alignment results in order to generate the tool output. In parallel to the alignment, it is executed for each query sequence, which matched PR-2 sequences over the desired thresholds, a HMMPFAM using the software HAMMER [20] is conducted in order to screen the relevant biological motifs present in PR-2 proteins, like glyco-hydro domain. Finally, a distance tree based on protein’s similarities is generated for the local PR-2 proteins and the user’s selected sequences. For this purpose we have used the PHYLIP package (http://evolution.gs.washington.edu/phylip.html) and the Neighbor Joining method. The developed web-tool can be accessed at the supplementary material website in the corresponding link (available at link http://www. nordest.ufpe.br/~lgbv/HAMMER_consensus). This bioinformatic routine may help the identification not only of pathogenesis related genes but any other genes classes that present similar conserved domains and motifs as showed in Figure 1.
Fig. 1. Pipeline to identify PR genes. Black boxes indicates data from automatic annotation. Gray boxes indicates manual annotation steps. Cylinders means used databases.
3
Results
After the tBLASTn five PR-2 candidates were identified in the NordEST database as including three clusters presenting the desired glyco-hydro motif as conserved
74
A.C. Wanderley-Nogueira et al.
Table 1. V. unguiculata PR-2 candidates, showing the sequence size in nucleotides and amino acids with the conserved start and end sites, the best alignment in NCBI database, including the score and e-value. The accession numbers gb.AAY96764.1, gb.AAY25165.1, emb.CAO61165.1, emb.CAO71593.1 and gbEAU72980.1 corresponding to sequences from Phaseolus vulgaris, Ziziphus jujuba, Vitis vinifera, Leishmania infantum and Synechococcus sp., respectively. Abbreviations: CD, Glyco-hydro Conserved Domain. Vigna sequence Size(nt/aa) CD Start/End Best match Score/E-value Contig2370 804/140 yes 2/121 gb.AAY96764.1 255/8.00E-67 Contig2265 833/198 yes 1/164 gb.AAY25165.1 286/6.00E-76 Contig277 1244/242 yes 2/107 emb.CAO61165.1 345/1.00E-93 VUPIST02011D02.b00 867/103 no emb.CAO71593.1 43/0.005 VUABTT00001D02.b00 552/78 no gbEAU72980.1 33/4.6
domain (Tab. 1) and two singlets (sequences available at link http://www. nordest.ufpe.br/~lgbv/vigna_nucleotide_sequences). Best alignments using identified clusters occurred with chinese jujube (Ziziphus jujuba), grape (Vitis vinifera) and common bean (Phaseolus vulgaris). The three clusters presented the desired domain incomplete at the aminoterminal, although only the Contig277 (Vigna 1) presented the complete desired domain. The HMMER search of glyco-hydro domains generated a pattern of conserved motifs characteristic of these proteins when applied to V. unguiculata PR-2 sequences. Among the 16 NCBI cowpea sequences, three presented all 12 motifs that determine the Beta-glucanase activity according the HMMER consensus. Among the other sequences 23,07% presented nine from 12 conserved sites, 53,84% presented eight, 23,07% presented six and only 0,07% presented five sites (Fig. 2). As the aim was to reconstruct the evolutionary history of PR-2 family genes considering the recent sequencing V. unguiculata, the most ancestral organisms (Pinus pinaster and Physcomitrella patens) were selected as out group (Fig. 3). The topology showed two main clades, as expected: (A) outgroup and (B) Magnoliophyta group, that appeared as a monophyletic clade. Moreover, in this last one, the monocots and dicots organisms were identified as sister groups (Fig. 3B). Considering the Magnoliopsida subclade, the organisms were grouped according to their family, but the Rosid subclass behaved as a merophyletic paraphyletic group, sharing characteristics with the Asterid. The virtual electrophoresis evaluation of PR-2 proteins from the 11 analyzed species presented isoelectric point from 3.0 to 9.45. Considering the molecular mass, values varied from 22.16MW to 41.15MW (Fig. 4). Closely related species did not present similar pIs except in case of corn and sugarcane. Transcripts obtained in NordEST database were used to perform a hierarchical clustering analysis from ESTs permitting an evaluation of expression intensity considering co-expression in different libraries (black upper dendrogram) (Fig. 5).
In Silico Screening for Pathogenesis Related-2 Gene Candidates
75
Fig. 2. Twelve conserved motifs characteristic of PR-2 protein in 17 clusters from V. unguiculata. The first line shows the conserved motifs generated by the HMMER program using PR-2 proteins from eight different organisms. In light gray, it is possible to observe which motifs appeared in cowpea PR-2 candidates.
Fig. 3. Dendrogram generated after Maximum Parsimony analysis, showing relationships among the PR2 seed sequence of A. thaliana and orthologs of V. unguiculata and other organisms with PR2 proteins bearing desired domains. Dotted line delimits the main taxonomic units and letters on the right of the dendrogram refer to the grouping. The circle on the root of clade B shows the divergence point between monocots and dicots organisms. Decimal numbers under branches lines means distance values. The numbers between parenthesis on the left of the branches nodes corresponding to the Bootstrap values.
76
A.C. Wanderley-Nogueira et al.
Fig. 4. Graphic representation of PR-2 virtual 2D electrophoresis gel
The three V. unguiculata contigs were formed by 95 reads with expression in seven of the nine NordEST libraries. Higher expression (21%) occurred in the library SS02 (roots of genotype sensitive to salinity after 2 hours of stress). Furthermore, IM90 (leaves collected after 90 min. of virus inoculation) and CT00 (negative control) presented no expression. Regarding the SuperSAGE analysis, 31 tags were obtained with the local BLASTn in our database (Tab. 2), among them, eight (25.8%) were up-regulated in both injured (BRC2) and infected (BRM) libraries when compared with the control. No PR-2 tag was down-regulated and 23 presented no differential regulation.
In Silico Screening for Pathogenesis Related-2 Gene Candidates
77
Fig. 5. PR-2 expression profile. Black indicates higher expression, gray lower expression, and light gray absence of expression in the corresponding tissue and cluster. Abbreviations: CT00 (control); BM90 (Leaves of BR14-Mulato genotype); IM90 (Leaves of IT85F genotype collected with 90 minutes after mosaic viruses infection); SS00 (Root of genotype sensitive to salinity without salt stress); SS02 (Root of genotype sensitive to salinity after 2 hours of stress); SS08 (Root of genotype sensitive to salinity after 8 hours of stress); ST00 (Root of genotype tolerant to salinity without salt stress); ST02 (Root of genotype tolerant to salinity after 2 hours of stress); ST08 (Root of genotype tolerant to salinity after 8 hours of stress).
78
A.C. Wanderley-Nogueira et al.
Table 2. V. unguiculata SuperSAGE Tags in two stress conditions and information about their expression when compared to control SuperSAGE Tag VM753 VM792 VM882 VM10042 VM11167 VM13390 VM13919 VM16412 VM16415 VM17065 VM17386 VM19065 VM20627 VM20828 VM20889 VM24293 VM24766 VM24767 VM25644 VM3048 VM3157 VM3748 VM5358 VM6343 VM6436 VM8048 VM8117 VM8194 VM8909 VVM9116 VM9550
4
Tag Sequence CATGGTGGTGGGTTCAAGAAGTGGAA CATGGCTTGCAGCTCATCCTCACTGT CATGCACATTCACCTCATTTCAATGG CATGAAATTCTCGGTGATCCTTTTTC CATGGCATAGATATGTTGATGATTCG CATGCAGAGTATCAAATTGTTCACCT CATGCTTGTTGTAGTAAATTCAAATT CATGAAACAGTAAGGAATAATTAAGG CATGAAACAGTCCCTAAATAATAGAT CATGAATGGATGAGAATAATTAATGC CATGACCAAGAAGGAAGCTACCCAGG CATGCACATTCACCTCATTTCAATGC CATGCTTTAATTTCAACTATGGCATC CATGGAACTGGATCTGGATGAAAGAC CATGGAAGTTAATTTGAACTACTCTG CATGTCAGCTCGGTTAAATACGCTTA CATGTGAAGAATGAAATATTTGTGCT CATGTGAAGAATGACTCAAAGAAATA CATGTTAACAATTTGTAATGAATCAG CATGATTTTGGGAACTTGTTGTATTA CATGTGAAGAATGACTCAAAGAATAA CATGCGAGCAAGGGAGAAGTTGTAGG CATGCTTTAATTTCAACTATGGGTCG CATGTGTACATAAACAACAAAACATT CATGTTTCTTGATTTTTGGTGGGATT CATGTAACTTTTAACAATTTGATATT CATGTATATATTGAAATTAACTTTAC CATGTCCTTCATTTCCAACGTCCCTG CATGCCAAGGTTATAAATGTTGTTGT CATGGAGAGTAGGAAGGTTCAGGATG CATGTATTCTGTATTTTTCTATGATA
Injured Up regulated Up regulated Up regulated Up regulated Up regulated Up regulated Up regulated Up regulated -
Infected Up regulated Up regulated Up regulated Up regulated Up regulated Up regulated Up regulated Up regulated -
Discussion
Considering the number of sequences available hitherto the NordEST project, the number of PR-2 candidate sequences was higher as expected, revealing five clusters that aligned with this gene, formed by 95 reads. Regarding the search of conserved motifs, essential to protein activity, it is interesting to note that some sequences presented sites that did not match with the consensus generated by the HMMER program, although most of the differences occurred with synonymous amino acids like methionine, leucine, valine or isoleucine, probably not affecting the protein activity; as methionine is a hydrophobic amino acid, and can nearly be classed with other aliphatic amino acids, it prefers substitution with other hydrophobic amino acids (like leucine, valine
In Silico Screening for Pathogenesis Related-2 Gene Candidates
79
and isoleucine) [21]. Using the profile hidden Markov models (profile HMMs) may be a valuable tool to identify not only PR proteins but any other family built from the seed alignment and an automatically generated full alignment which contains all detectable protein sequences [22]. In the generated dendrogram it was possible to indentify that symplesiomorphic characteristics united all Magnoliopsida organisms, as expected, since these plants evolved from a common ancestor. In the Magnoliopsida group, the evolutionary model of the PR-2 seemed to follow a sinapomorphic pattern, leading to their presence in different families and subclasses. Moreover, it was possible to perceive that the different Magnoliopsida families were grouped based on unique autapomorphies. In the monocot grouping all members were annual cereal grains of the Poaceae family. However, while maize and sugarcane were grouped in a terminal clade, rice was grouped separately. This may be justified by the fact that Zea and Saccharum belong to the Panicoideae subfamily, while Oryza belongs to the Ehrhartoideae subfamily. The studied organisms presented different centers of origin, habitats and cycles of life, as well as tolerance, resistance and sensitivity to diverse kind of biotic and abiotic stresses. Despite of that, in a general view, it is evident that the PR-2 pathway, that is activated under any type of stressor component, was a characteristic present in a common ancestor, being relatively conserved in the different groups during the evolution. Considering our evaluation using virtual bidimensional electrophoresis migration for PR-2 sequences, one group could be clearly identified, with only two sequences (corn and sugarcane) deviating especially with respect to their molecular weight, despite the similar pIs and also confirming their similarity revealed in the dendrogram. The second group included most species of the 10 remaining species. Since the functional domains presented high degree of conservation, probably the divergent pattern of migration reflected divergences of the extra-domain regions that are responsible for the acid and basic character of the proteins. Such extra-domain variations are also responsible for the diversity of the sequences and probably for differences in their overall structure and transcription control. The pattern of expression showed that PR-2 transcripts appeared in seven from nine available libraries from NordEST project. Being excluded from negative control (CT00, no stress) and IM90 (leaves collected from IT85F genotype after 90 minutes after virus inoculation). The absence in the non stressed plants was expected and its absence in the genotype IT85F (that is susceptible to this virus infection) reveals that in the resistant genotype (BR14-Mulato) this category of gene is probably activated earlier after injury or virus infection. A similar pattern was confirmed by our SuperSAGE data with high expression after salinity stress, as compared with the negative control. The activation of PR genes after different categories of abiotic stress (wounding and salinity) confirmed the theory that these genes can be activated and expressed systematically as response to any kind of stress, biotic or abiotic [10, 12].
80
A.C. Wanderley-Nogueira et al.
The identified sequences represent valuable resources for the development of markers for molecular breeding and development of pathogenesis related genes specific for cowpea and other related crops. Additionally, the bioinformatic routine may help biologists to identify not only PR-2 genes but any other class of genes that bear similar conserved domains and motifs in their structure in higher plants. This platform represents the first step towards the development of on line tools to identify genes and factors associated with response to pathogen infection in plants.
Acknowledgments This work was developed as part of the Northeastern Biotechnology Network (RENORBIO). The authors are grateful to Brazilian Ministry of Science and Technology (MCT/CNPq) and to Foundation for Support of Science and Technology of Pernambuco State - Brazil (FACEPE) for supporting our research.
References 1. Pedley, K.F., Martin, G.B.: Role of mithogen-activated protein kinases in plant immunity. Current Opinion in Plant Biology 8, 541–547 (2005) 2. Campos, A.M., Rosa, D.D., Teixeira, J.E.C., Targon, M.L.P.N., Sousa, A.A., Paiva, L.V., Stach-Machado, D.R., Machado, M.A.: PR gene families of citrus: Their organ specific biotic and abiotic inducible expression profiles based on ESTs approach. Genetics and Molecular Biology 30, 917–930 (2007) 3. Bonas, U., Anckerveken, G.V.: Gene-for-gene interactions: bacterial avirulence proteins specify plant disease resistance. Current Opinion in Plant Biology 2, 94–98 (1999) 4. van Loon, L.C., van Strien, E.A.: The families of pathogenesis-related proteins, their activities, and comparative analysis of PR-1 type proteins. Physiological and Molecular Plant Pathology 55, 85–97 (1999) 5. Bokshi, A.I., Morris, S.C., McDonald, K., McConchie, R.M.: Application of INA and BABA control pre and post harvest diseases of melons through induction of systemic acquired resistance. Acta Horticulturae 694 (2003) 6. Cutt, J.R., Klessig, D.F.: Pathogenesis Related Proteins. In: Boller, T., Meins, F. (eds.) Genes involved in plant defense, pp. 209–243. Springer, Vienna (1992) 7. van Loon, L.C., Rep, M., Pieterse, C.M.J.: Significance of inducible defense-related proteins in infected plants. Annual Review of Phytopathology 44, 135–162 (2006) 8. Zhu, Q.E., Maher, S., Masoud, R., Dixon, R., Lamb, C.J.: Enhanced Protection against fungal attack by constitutive coexpression of chitinase and glucanase genes in transgenic tobacco. Biotechnology 12, 807–812 (1994) 9. Cte, F., Hahn, M.G.: Oligosaccharins: Structures and signal transduction. Plant Molecular Biology 26, 1379–1411 (1994) 10. Delaney, T.P., Ukness, S., Vernooij, B., Friedrich, L., Weymann, K., Negrotto, D., Gaffney, T., Gut-Rella, M., Kessman, H., Ward, E., Ryals, J.: A central role of salicylic acid in plant disease resistance. Science 266, 754–756 (1994) 11. Fanta, N., Ortega, X., Perez, L.M.: The development of Alternaria alternata is revented by chitinases and beta1,3-glucanases from Citrus limon seedlings. Biology Research 36, 411–420 (2003)
In Silico Screening for Pathogenesis Related-2 Gene Candidates
81
12. Bonasera, J.M., Kim, J.F., Beer, S.V.: PR genes of apple: identification and expression in response to elicitors and inoculation with Erwinia amylovora. BMC Plant Biology 6, 23–34 (2006) 13. Nurnberg, T., Brunner, F.: Innate Immunity in plants and animals: emerging parallels between the recognition of general elicitors and pathogen-associated molecular patterns. Current Opinion in Plant Biology 5, 318–324 (2002) 14. Osmond, R.I., Hrmova, M., Fontaine, F., Imberty, A., Fincher, G.B.: Binding interactions between barley thaumatin-like proteins and (1,3)-beta-D-glucans. Kinetics, specificity, structural analysis and biological implications. European Journal of Biochemistry 268, 4190–4199 (2001) 15. Altschul, S.F., Gish, W., Miller, W., Myers, E.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990) 16. Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research 31(13), 3784–3788 (2003) 17. Kumar, S., Tamura, K., Nei, M.: MEGA 3: Integrated Software for Molecular Evolutionary Genetics Analysis and Sequence Alignment. Briefings in bioinformatics 5, 150–163 (2004) 18. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, B.: Cluster analysis and display of genome-wide expression patterns. Genetics 25, 14863–14868 (1998) 19. Ewing, R.M., Kahla, A.B., Poirot, O., Lopez, F.: Large-scale statistical nalyses of rice ESTs reveal correlated patterns of gene expression. Genome Research 9, 950–959 (1999) 20. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998) 21. Betts, M.J., Russell, R.B.: Amino acid properties and consequences of subsitutions. In: Barnes, M.R., Gray, I.C. (eds.) Bioinformatics for Geneticists. Wiley, Chichester (2003) 22. Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, L.E., Bateman, A.: The Pfam protein families database. Nucleic Acids Research 36, 281–288 (2008)
Penalized Principal Component Analysis of Microarray Data Vladimir Nikulin and Geoffrey J. McLachlan Department of Mathematics, University of Queensland {v.nikulin,g.mclachlan}@uq.edu.au
Abstract. The high dimensionality of microarray data, the expressions of thousands of genes in a much smaller number of samples, presents challenges that affect the validity of the analytical results. Hence attention has to be given to some form of dimension reduction to represent the data in terms of a smaller number of variables. The latter are often chosen to be a linear combinations of the original variables (genes) called metagenes. One commonly used approach is principal component analysis (PCA), which can be implemented via a singular value decomposition (SVD). However, in the case of a high-dimensional matrix, SVD may be very expensive in terms of computational time. We propose to reduce the SVD task to the ordinary maximisation problem with an Euclidean norm which may be solved easily using gradient-based optimisation. We demonstrate the effectiveness of this approach to the supervised classification of gene expression data. Keywords: Singular value decomposition, k-means clustering, gradientbased optimisation, cross-validation, gene expression data.
1
Introduction
The goal of projection pursuit is to find low-dimensional projections that provide the most revealing views of the full-dimensional data [1], [2]. Principal component analysis in statistical science, provides a useful mathematical framework for the finding of important and interesting low-dimensional projections. A PCA can be implemented via SVD of the sample covariance matrix [3]. In PCA, the derived principal components are orthogonal to each other and represent the directions of largest variance. PCA captures the largest information in the first few principal components, guarantees minimal information loss and minimal reconstruction error in a least-squares sense [4]. PCA is a popular method of data decomposition or matrix factorisation with applications throughout science and engineering. The decomposition performed by PCA is a linear combination of the input coordinates where the coefficients of the combination (the principal vectors) form a low-dimensional subspace that corresponds to the direction of maximal variance in the data. PCA is attractive for a number of reasons. The maximum variance property provides a way to compress the data with minimal information loss. In fact, the principal vectors provide the closest linear subspace to the data. Second, the representation of the data in the F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 82–96, 2010. c Springer-Verlag Berlin Heidelberg 2010
Penalized Principal Component Analysis of Microarray Data
83
projected space is uncorrelated, which is a useful property for subsequent statistical analysis. Third, the PCA decomposition can be achieved via an eigenvalue decomposition of the data covariance matrix [5]. The pure SVD is a quite complex task, which (in the case of high-dimensional data) would require a significant amount of computational time. The main objective of the imposed penalty or regularisation is to simplify the task and to reduce it to the most ordinary squared optimisation, which may be easily solved using stochastic gradient descent techniques. We shall compare penalized PCA with another approach that uses regularised k-means to group the variables into clusters [6], and represent them by the corresponding centroids. Both methods are unsupervised and, therefore, may be characterised as so-called filter methods. Generally ‘unsupervised learning methods’ are widely used to study the structure of the data when no specific response variable is specified. In contrast to the SVM-RFE [7], we perform dimension reduction using penalized PCA only once to select the variables. This paper is organised as follows: Section 2 discusses two stochastic gradient descent models with left and right updates, which are essentially different. In the case of the model with left update we are looking for the matrix of metagines directly, and will be able to approximate original observations as a linear combinations of these metagenes. As a consequence, it may be difficult to explain secondary metagenes in terms of the original genes. In difference, in the model with right update we are looking for the orthogonal matrix-operator R, and will compute the required matrix of metagenes as a product of the original matrix of gene expression data X and R. As a result, the computational structure of the metagenes may be clearly explained as a linear combinations of the original genes. Dimensional reduction represents the first step of the proposed two steps procedure. As a second step (direct classification of the patterns) we can use linear regression, SVM or multinomial logistic regression (MLR) [8], which may be particular useful in the case of multi-class classification to explore the hidden dependencies between different classes. Section 3 describes five real datasets, where three datasets are binary, and two have multi-class label. Section 4 explains the experimental procedure and the most important aspects of our approach. Finally, Section 5 concludes the paper.
2
Modelling Technique
Let (xi , yi ) , i = 1, . . . , n, be a training sample of observations where xi ∈ Rp is p-dimensional vector of features, and yi is binary label: yi ∈ {−1, 1}. Boldface letters denote matrices or vector-columns. Let us denote by X = {xij , i = 1, . . . , n, j = 1, . . . , p} the matrix containing the observed values on the p variables for the above n observations. For gene expression studies, the number p of genes is typically in the thousands, and the number n of experiments is typically less than 100. The data are represented by an expression matrix X of size n × p, whose columns contain the expression levels of the p genes in the n samples. Our goal is to find a small number of metagenes or factors.
84
V. Nikulin and G.J. McLachlan
400
600
350
500
300
400
250
T(L)
T(L)
700
300
200 150
200 100 100 0
50 20
40
60
80
100
120
140
160
180
0
200
20
40
60
600
500
500
400
400
T(L)
T(L)
600
300
200
100
100
10
20
30
40
50
60
100
120
140
160
180
200
140
160
180
200
300
200
0
80
(b): leukaemia
(a): colon
70
80
90
100
0
(c): lymphoma
20
40
60
80
100
120
(d): Sharma
(L) Fig. 1. Algorithm 1: behaviour of Tn·p as a function of global iteration: (a) colon, (b) leukaemia, (c) lymphoma and (d) Sharma
SVD can be used in practice to remove ‘uninteresting structure’ from X [9]. It is well known that the leading term of the SVD provides the bilinear form with the best least-squares approximation to X. SVD can be employed to perform a principal component analysis on X X. The classical method of factoring a two-dimensional array (matrix) [10] is to factor it as X ∼ LDR , where L is a matrix of left eigenvectors and is n × k, D is a k × k diagonal matrix of eigenvalues, and R is a k × p matrix of right eigenvectors where L L = R R = Ik , Ik is an identity matrix with size k. The relationship, X = LDR ,
(1)
is exact if X has rank k. The task is to find matrix L which may be used then as a replacement for the original microarray matrix X. Algorithm 1. Penalised PCA (left update) 1. Input: X - matrix of microarrays. 2. Select - number of global iterations; k - number of factors; α - regulation parameter; λ - initial learning rate, 0 < ξ < 1 - correction rate, TS - initial value of the target function. 3. Initial matrix L may be generated randomly. 4. Compute values of the matrices E, Q and vector S according to (5).
Penalized Principal Component Analysis of Microarray Data
5. 6. 7. 8. 9. 10. 11. 12. 13.
85
Global cycle: repeat times the following steps 6 - 12: factors-cycle: for j = 1 to k repeat steps 7 - 10: genes-cycle: for i = 1 to n repeat steps 8 - 9: compute Tij according to (6); update lij ⇐ lij + λ · Tij ; re-compute: eja , a = 1, . . . , p; qja , qaj , a = 1, . . . , k, and sj ; compute T (L) according to (4); TS = T (L) if T (L) > TS ; otherwise: λ ⇐ λ · ξ. Output: L - matrix of metagenes or latent factors.
2.1
Penalised PCA with Left Update
The following relations follow directly from (1):
L
X2F
=
DR 2F
=
D2F
=
k
d2i ,
(2)
i=1
where M2F = tr(M M) indicates the squared Frobenius norm (the sum of squared elements of the matrix M). In fact, we might not be interested in all k eigenvectors, but only in those which correspond to the larger eigenvalues (principal components). In order to reduce noise and corresponding overfitting, it would be better to exclude from further consideration other eigenvectors. As a result, we shall derive the following squared optimisation problem: 1 2 α 2 max L XF − L L − Ik F , (3) L 2 2 where α is a positive regulation parameter, the second term in (3) represents penalty or regularisation. Remark 1. Note that another penalized version of the SVD was presented in [11] and [12], where the authors are most interested in obtaining a decomposition made up of sparse vectors. To do this, they use an L1 -penalty (LASSO-type penalty) of the rows and columns of the decomposition. Further, we can re-write the target function of (3) in a more convenient form α 1 2 2 T (L) = L XF − L L − Ik F 2 2 =
p k k−1 k k 1 2 α 2 2 eij − α qij − s , 2 i=1 j=1 2 i=1 i i=1 j=i+1
where eij =
n t=1
lti xtj , qij =
n t=1
lti ltj , si = qii − 1.
(4)
(5)
86
V. Nikulin and G.J. McLachlan
0.22
ALMR
0.2 0.18 0.16 0.14 0.12
1
3
7
5
9
11
15
13
17
19
21
23
25
27
29
30
19
21
23
25
27
29
30
19
21
23
25
27
29
30
(a): colon
ALMR
0.25
0.2
0.15 1
3
7
5
9
11
15
13
17
log(log(alpha+e))
(b): Sharma 1.5
1
0.5
0
1
3
5
7
9
11
13
15
17
(c): index
Fig. 2. Algorithm 1: behaviour of the average LOO misclassification rates as a function of parameter α: (a) colon, (b) Sharma; (c) log(log(α+e)), where values of α correspond to the above cases (a) and (b). The range of values of α is [0.01, . . . , 60] , see Section 4 for more details.
In order to maximise (4), we shall use gradient-based optimisation, and will need the following matrix of partial derivatives Tab =
=
p j=1
ebj xaj − 2α
∂T (L) ∂lab k
qbj laj − αsb lab .
(6)
j=1,j =b
Fig. 1 was produced using the following settings: (a) colon: λ = 4 · 10−6 , ξ = 0.55, k = 15, α = 1.5; (b) leukaemia: λ = 2 · 10−6 , ξ = 0.55, k = 7, α = 1; (c) lymphoma: λ = 2 · 10−6 , ξ = 0.55, k = 23, α = 2; (d) Sharma: λ = 15 · 10−6 , ξ = 0.55, k = 30, α = 0.3. 2.2
Penalised PCA with Right Update
Similarly to (2), we have XR2F = LD2F = D2F =
k i=1
d2i .
(7)
Penalized Principal Component Analysis of Microarray Data
87
Table 1. Algorithm 1: some selected test results with scheme (11), where LMR is LOO misclassification rate, “NM” is number of misclassified Data
Model
k
α
LMR
NM
colon colon colon
SVM SVM SVM
6 20 27
10 10 10
0.0968 0.0806 0.0968
6 5 6
leukaemia leukaemia leukaemia
SVM SVM SVM
9 13 20
5 5 5
0.0139 0.0 0.0139
1 0 1
lymphoma lymphoma lymphoma
MLR MLR MLR
10 13 20
50 50 50
0.0484 0.0323 0.0484
3 2 3
Sharma Sharma Sharma
SVM SVM SVM
27 33 37
5 5 5
0.0833 0.05 0.0833
5 3 5
Khan Khan Khan
MLR MLR MLR
23 31 55
50 50 50
0.0482 0.0241 0.0361
4 2 3
The task in this section is to find the matrix R, which may be used directly to compute the matrix of metagenes XR as a replacement for the original matrix X. Let us consider the following target function, 1 α 2 2 XRF − R R − Ik F T (R) = 2 2 =
n k k−1 k k 1 2 α 2 2 eij − α qij − s , 2 i=1 j=1 2 i=1 i i=1 j=i+1
where eij =
p t=1
rti xtj , qij =
p
rti rtj , si = qii − 1.
(8)
(9)
t=1
Algorithm 2. Penalised PCA (right update) 1. Input: X - matrix of microarrays. 2. Select - number of global iterations; k - number of factors; α - regulation parameter; λ - initial learning rate, 0 < ξ < 1 - correction rate, TS - initial value of the target function. 3. Initial matrix R may be generated randomly. 4. Compute values of the matrices E, Q and vector S according to (9). 5. Global cycle: repeat times the following steps 6 - 12: 6. factors-cycle: for j = 1 to k repeat steps 7 - 10:
88
V. Nikulin and G.J. McLachlan
1
1
0.8
0.8 10
10 0.6
0.6
20 0.4
20
0.4
0.2
30
tissue
tissue
0.2
30 0
0 40
−0.2
−0.2
40 50
−0.4
50
−0.4
−0.6
−0.6 60
−0.8
−0.8
60
70 2
4
6
8
10
−1
2
(a): metagene (lymphoma)
4
6
8
−1
(b): metagene (leukaemia)
Fig. 3. Matrices of metagenes: (a) lymphoma, (b) leukaemia
7. 8. 9. 10. 11. 12. 13.
genes-cycle: for i = 1 to p repeat steps 8 - 9: compute Tij according to (10); update rij ⇐ rij + λ · Tij ; re-compute: eaj , a = 1, . . . , n; qja , qaj , a = 1, . . . , k, and sj ; compute T (R) according to (8); TS = T (R) if T (R) > TS ; otherwise: λ ⇐ λ · ξ. Output: R, where XR is matrix of metagenes or latent factors.
In order to maximise (8), we shall use gradient-based optimisation, and will need the following matrix of partial derivatives Tab =
=
n i=1
ebi xia − 2α
∂T (R) ∂lab k
qbj raj − αsb rab .
(10)
j=1,j =b
Fig. 4 was produced using the following settings: (a) colon: λ = 4 · 10−6 , ξ = 0.55, k = 19, α = 0.8; (b) leukaemia: λ = 4 · 10−6 , ξ = 0.55, k = 23, α = 5; (c) lymphoma: λ = 2 · 10−6 , ξ = 0.55, k = 39, α = 35; (d) Sharma: λ = 15 · 10−6, ξ = 0.55, k = 40, α = 25. Also, we used parameters λ = 2 · 10−5 , ξ = 0.55 in Table 2 in all cases of lymphoma and Khan related to the MLR model.
Penalized Principal Component Analysis of Microarray Data
89
100
1200
90 1000
80
800
60
T(R)
T(R)
70
600
50 40 30
400
20 10
200
0 0 20
40
60
80
100
120
140
160
180
−10
200
20
40
60
(a): colon
80
100
120
140
160
180
200
140
160
180
200
(b): leukaemia
10
50
0
0
−10
−50
T(R)
T(R)
−20 −100 −150
−30 −40
−200
−50
−250 −300
−60
20
40
60
80
100
120
140
160
180
200
−70
20
40
60
80
100
120
(d): Sharma
(c): lymphoma
(R) Fig. 4. Algorithm 2: behaviour of Tn·p as a function of global iteration: (a) colon, (b) leukaemia, (c) lymphoma and (d) Sharma
2.3
Regularised k-Means Clustering Algorithm
Clustering methods provide a powerful tool for the exploratory analysis of highdimensional, low-sample size data sets, such as gene expression microarray data. As with PCA, cluster analysis requires no response variable and thus falls into category of unsupervised learning methods. There are two major problems: stability of clustering and meaningfulness of centroids as cluster representatives. On one hand, big clusters impose strong smoothing and possible loss of very essential information. On the other hand, small clusters are, usually, very unstable and noisy. Accordingly, they can not be treated as equal and independent representatives. To address the above problems, we propose regularisation to prevent the creation of super big clusters, and to attract data to existing small clusters [6]. Algorithm 3. Regularised k-means clustering 1. Select number of clusters k, distance Φ and regulation parameter β; 2. split randomly available genes into k subsets (clusters) with approximately the same size; 3. compute an average (centroid) qc for any cluster c; 4. compute maximum distance L between genes and centroids; 5. redistribute genes according to Φ(xj , qc ) + Rc ,
90
V. Nikulin and G.J. McLachlan
where regularisation term Rc = β·L·#c , #c is the size of cluster c at the p current time; 6. recompute centroids (known from the previous sections as metagenes); 7. repeat steps 5-6, until convergence (that means stable behavior of the target function).
3
Data
3.1
Colon
The colon dataset1 is represented by a matrix of 62 tissue samples (40 tumor and 22 normal) and 2000 genes. The microarray matrix for this set thus has p = 2000 rows and n = 62 columns. 3.2
Leukaemia
The leukaemia dataset2 was originally provided in [13]. It contains the expression levels of p = 7129 genes of n = 72 patients, consisting of 47 patients those with acute lymphoblastic leukaemia (ALL) and 25 patients with acute myeloid leukaemia (AML). Table 2. Algorithm 2: some selected test results with scheme (11)
3.3
Data
Model
k
α
LMR
NM
colon colon colon
SVM SVM SVM
9 19 30
0.8 0.8 0.8
0.129 0.0968 0.1129
8 6 7
leukaemia leukaemia leukaemia
SVM SVM SVM
15 23 25
9 9 9
0.0139 0 0.0139
1 0 1
lymphoma
MLR
26
50
0.0484
3
Sharma Sharma Sharma Sharma
SVM SVM SVM SVM
40 21 28 50
25 25 25 25
0.1 0.1833 0.15 0.0667
6 11 9 4
Khan Khan Khan
MLR MLR MLR
19 39 43
120 120 120
0.0482 0.0241 0.0361
4 2 3
Lymphoma
This publicly available dataset3 contains the gene expression levels of the three most prevalent adult lymphoid malignancies: (1) 42 samples of diffuse large Bcell lymphoma (DLCL), (2) 9 samples of follicular lymphoma (FL), and (3) 11 1 2 3
http://microarray.princeton.edu/oncology/affydata/index.html http://www.broad.mit.edu/cgi-bin/cancer/publications/ http://llmpp.nih.gov/lymphoma/data/figure1
Penalized Principal Component Analysis of Microarray Data
91
samples of chronic lymphocytic leukaemia (CLL). The total sample size is n = 62 and p = 4026 genes. Note that about 4.6% of values in the original lymphoma gene expression matrix are missing and were replaced by average values (in relation to the corresponding genes). The maximum number of missing values per gene is 16. The range of values in the expression matrix is [−6.06, . . . , 6.3]. More information on these data may be found in [14]. 3.4
Sharma
This dataset was described in [15] and contains the expression levels (mRNA) of 1368 genes from 60 blood samples taken from 56 women. Each sample was labelled by clinicians, with 24 labelled as having breast cancer and 36 labelled as not having it. Some of the samples were analysed more than once in separate batches giving a total of 102 labelled base samples. We computed for any of 60 samples an average of the corresponding base samples. As a result, original 102 × 1368 matrix with dimensions was reduced to the 60 × 1368 matrix. 3.5
Khan
This dataset [16] contains 2308 genes and 83 observations, each from a child who was determined by clinicians to have a type of small round blue cell tumour. This includes the following four classes: neuroblastoma (N), rhabdomyosarcoma (R), Burkitt lymphoma (B) and the Ewing sarcoma (E). The numbers in each class are: 18 - N, 25 - R, 11 - B and 29 - E. 3.6
Additional Pre-processing Steps
We followed the pre-processing steps of [17] applied to the leukaemia set: (1) thresholding: floor of 1 and ceiling of 20000; (2) filtering: exclusion of genes with max / min ≤ 2 and (max - min) ≤ 100, where max and min refer respectively to the maximum and minimum expression levels of a particular gene across a tissue sample. This left us with 1896 genes. In addition, the natural logarithm of the expression levels was taken. After the above pre-processing we observed significant improvement in quality of classification. Remark 2. We conducted similar studies in application to the colon data. Firstly, we observed the following statistical characteristics: min = 5.82, max = 20903, 4.38 ≤ max/min ≤ 1258.6. Then, we took the natural logarithm of the expression levels. Based on our experimental results we can not report any improvement in the quality of classification. We applied double normalisation to the data. Firstly, we normalised each column to have means zero and unit standard deviations. Then, we applied the same normalisation to each row.
92
V. Nikulin and G.J. McLachlan Table 3. Algorithm 3: some selected test results with scheme (11)
4
Data
Model
k
β
LMR
NM
colon colon colon
SVM SVM SVM
7 8 9
0.2 0.2 0.2
0.1129 0.0968 0.1129
7 6 7
leukaemia leukaemia
SVM SVM
25 30
0.1 0.1
0.0139 0
1 0
lymphoma lymphoma lymphoma
MLR MLR MLR
60 70 80
0.1 0.1 0.1
0.0484 0.0323 0.0484
3 2 3
Sharma Sharma Sharma
SVM SVM SVM
50 55 60
0.1 0.1 0.1
0.1167 0.0833 0.1333
7 5 8
Khan Khan Khan
MLR MLR MLR
40 44 50
0.1 0.1 0.1
0.0482 0.0241 0.0361
4 2 3
Experiments
After decomposition of the original matrix X we used the leave-one-out (LOO) classification scheme, applied to the matrix L in the case of Algorithm 1 and to the matrix XR in the case of Algorithm 2. This means that we set aside the ith observation and fit the classifier by considering remaining (n − 1) data points. We conducted experiments with linear SVM (used our own code written in C). The experimental procedure has heuristic nature and its performance depends essentially on the initial settings. We shall denote such a scheme as SCH(nrs, M odel, LOO1),
(11)
where we conducted nrs identical experiments with randomly generated initial settings. Any particular experiments includes 2 steps: 1) dimensionality reduction with Algorithm 2 or Algorithm 2.2; 2) LOO evaluation with the classification Model. In most of our experiments with scheme (11) we used nrs = 20. We conducted experiments with two different classifiers: 1) linear SV M , and 2) we used M LR in application to the lymphoma and Khan sets. These classifiers are denoted by “Model” in (11). Fig. 2 illustrates expected LOO misclassification rates (ELMR) as a function of the double logarithmic function of the parameter α. As it may be expected, results corresponding to the small values of α are poor. Then, we have significant improvement to some point. After, that point the model suffers from over-regularisation. Tables 1-2 represent some best results. It can be seen that our results are competitive with those in [18], [19], where the best reported result for the colon set is LMR = 0.113, and LMR = 0.0139 for the leukaemia set.
Penalized Principal Component Analysis of Microarray Data
93
In the case of Fig. 2, we used the following settings: (a) colon: λ = 10−5 , ξ = 0.55, k = 13; (b) Sharma: λ = 10−4 , ξ = 0.75, k = 15. 4.1
Selection Bias
Cross-validation can be used to estimate model or classifier parameters as well as perform model and variable selection. However, combining these steps with error estimation for the final classifier can lead to bias unless one is particular careful [20]. ‘Selection bias’ [21](p.218) can occur when cross-validation is used ‘internally’. In this case, all the available data are used to select the model as a subset of available predictors. This subset of predictors is then fixed and the error rate is estimated by cross-validation. In Fig. 2, we have plotted the ELMRs estimated using LOO cross-validation under the assumption that the matrix factorisation will be the same during the n validation trials as chosen on the basis of the full data set. However, there will be a selection bias in these estimates as the matrix factorisation should be reformed as a natural part of any validation trial; see, for example, [22]. But, since the labels yt of the training data were not used in the factoring process, the selection bias should not be of a practical importance. The validation scheme (LOO scheme N2) SCH(nrs, M odel, LOO2),
(12)
where the step 1 with dimensionality reduction is a part of any LOO loop, requires a lot more computational time comparing with the model (11). Nevertheless, we tested the model (12) with nrs = 10 in application to the fixed number of metagenes. Note that the validation scheme (12) is very compatible with Algorithm 2 and may be easily implemented: at any LOO cycle we can compute vector-row of metagenes according to the following formula: vmgnew = vgnew R, where vgnew is a vector-row of genes corresponding to the tissue under consideration. In the case of colon dataset, we used k = 19 and observed LMR = 0.0968 (1), 0.1129 (6), 0.129 (2) and 0.1451 (1), where the integer numbers in the brackets indicate the number of the times the corresponding value was observed. In the case of leukaemia dataset, we used k = 25 and observed the following values: LMR = 0.0139 (6), 0.0278 (3) and 0.0417 (1). In the case of Sharma dataset, we used k = 18 and observed the following values: LMR = 0.1833 (2), 0.2 (3), 0.2167 (3) and 0.2333 (2). Selection Bias with k-Means Algorithm. In the case of Algorithm 3 the scheme (12) may be organised as follows. We can run Algorithm 3 without
94
V. Nikulin and G.J. McLachlan
any particular tissue. As an outcome, Algorithm 3 produce codebook, or the direction for any gene to the unique cluster. After completion of the clustering with Algorithm 3, this codebook may be used in application to the absent tissue to recompute centroids/metagenes. Table 4. Algorithm 3: some selected test results with scheme (12)
4.2
Data
Model
k
β
LMR
NM
colon leukaemia lymphoma Sharma Khan
SVM SVM MLR SVM MLR
13 44 60 50 30
0.2 0.1 0.1 0.1 0.1
0.1129 0 0.0484 0.1666 0.0241
7 0 3 10 2
Computation Time
A Linux computer with speed 3.2GHz, RAM 16GB, was used for most of the computations. The time for 100 global iterations with Algorithm 1 (used special code written in C) in the case of k = 50 was about 15 sec (Sharma dataset). Table 5. Algorithm 2: computation time, where Scheme 1 corresponds to (11) and Scheme 2 corresponds to (12)
5
Data
k
n
p
Scheme
nrs
Time (min.)
colon lymphoma colon leukaemia Sharma lymphoma
29 21 19 25 40 39
62 62 62 72 60 62
2000 4026 2000 1896 1368 4026
1 1 2 2 2 2
20 20 10 10 10 10
6 11 100 180 195 600
Conclusion
Microarray data analysis presents a challenge to the traditional machine learning techniques due to the availability of only a limited number of training instances and the existence of a large number of genes. In many cases, machine learning techniques rely too much on the gene selection, which may cause selection bias. Generally, feature selection may be classified into two categories based on whether the criterion depends on the learning algorithm used to construct the prediction rule. If the criterion is independent of the prediction rule, the method is said to follow a filter approach, and if the criterion depends on the rule, the method is said to follow a wrapper approach [22]. The objective of this study is to develop a filtering machine learning approach and produce a robust classification procedure for microarray data.
Penalized Principal Component Analysis of Microarray Data
95
Based on our experiments, the proposed penalized PCA performed an effective dimensional reduction as a preparation step for the following supervised classification. Similarly, a two-step clustering procedure was introduced in [6]. Also, it was reported in [6] that classifiers built in metagene, rather than original gene space, are more robust and reproducible because the projection can reduce noise more than simple normalisation. Algorithms 1 and 2, as a main contribution of this paper, are conceptually simple. Consequently, they are very fast and stable. Note also that stability of the algorithms depends essentially on the properly selected learning rate, which must not be too large. We can include additional functions so that the learning rate will be reduced or increased depending on the current performance. There are many advantages to such a metagene approach. By capturing the major, invariant biological features and reducing noise, metagenes provide descriptions of data sets that allow them to be more easily combined and compared. In addition, interpretation of the metagenes, which characterize a subtype or subset of samples, can give us insight into underlying mechanisms and processes of a disease. The results that we obtained on five real datasets confirm the potential of our approach. Based on Tables 1 - 3, we can conclude that performance of the penalised PCA introduced in this paper is better in terms of LMR compared to the regularised k-means algorithm. It is also faster to implement.
References [1] Huber, P.: Projection pursuit. The Annals of Statistics 13, 435–475 (1985) [2] Friedman, J.: Exploratory projection pursuit. Journal of the American Statistical Association 82, 249–266 (1987) [3] Alter, O., Brown, P., Botstein, D.: Singular value decomposition for genome-wide expression data processing and modelling. PNAS 97, 10101–10106 (2000) [4] Guan, Y., Dy, J.: Sparse probabilistic principal component analysis. In: AISTATS, pp. 185–192 (2009) [5] Zass, R., Shashua, A.: Nonnegative sparse PCA. In: Advances in Neural Information Processing Systems (2006) [6] Nikulin, V., McLachlan, G.: Regularised k-means clustering for dimension reduction applied to supervised classification. In: CIBB Conference, Genova, Italy (2009) [7] Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46, 389–422 (2002) [8] Bohning, D.: Multinomial logistic regression algorithm. Ann. Inst. Statist. Math. 44, 197–200 (1992) [9] Liu, L., Hawkins, D., Ghosh, S., Young, S.: Robust singular value decomposition analysis of microarray data. PNAS 100, 13167–13172 (2003) [10] Fogel, P., Young, S., Hawkins, D., Ledirac, N.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Bioinformatics 23, 44–49 (2007) [11] Hastie, T., Tibshirani, R.: Efficient quadratic regularisation of expression arrays. Biostatistics 5, 329–340 (2004)
96
V. Nikulin and G.J. McLachlan
[12] Witten, D., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009) [13] Golub, T., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999) [14] Alizadeh, A., et al.: Distinct types of diffuse large b-cell-lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000) [15] Sharma, P., et al.: Early detection of breast cancer based on gene-expression patterns in peripheral blood cells. Breast Cancer Research 7, R634–R644 (2005) [16] Khan, J., et al.: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7, 673–679 (2001) [17] Dudoit, S., Fridlyand, J., Speed, I.: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of Americal Statistical Association 97, 77–87 (2002) [18] Dettling, M., Buhlmann, P.: Boosting for tumor classification with gene expression data. Bioinformatics 19, 1061–1069 (2003) [19] Peng, Y.: A novel ensemble machine learning for robust microarray data classification. Computers in Biology and Medicine 36, 553–573 (2006) [20] Wood, I., Visscher, P., Mengersen, K.: Classification based upon expression data: bias and precision of error rates. Bioinformatics 23, 1363–1370 (2007) [21] McLachlan, G., et al.: Analysing microarray gene expression data. Wiley, Hoboken (2004) [22] Ambroise, C., McLachlan, G.: Selection bias in gene extraction on the basis of microarray gene expression data. PNAS 99, 6562–6566 (2002)
An Information Theoretic Approach to Reverse Engineering of Regulatory Gene Networks from Time–Course Data Pietro Zoppoli1,2 , Sandro Morganella1,2, and Michele Ceccarelli1,2 1
2
Department of Biological and Evironmental Sciences, University of Sannio, Benevento, Italy Bioinformatics Lab, IRGS Istituto di Ricerche Genetiche G. Salvatore, BioGeM s.c.a r.l., Ariano Irpino (AV), Italy
Abstract. One of main aims of Molecular Biology is the gain of knowledge about how molecular components interact each other and to understand gene function regulations. Several methods have been developed to infer gene networks from steady-state data, much less literature is produced about time-course data, so the development of algorithms to infer gene networks from time-series measurements is a current challenge into bioinformatics research area. In order to detect dependencies between genes at different time delays, we propose an approach to infer gene regulatory networks from time-series measurements starting from a well known algorithm based on information theory. In particular, we show how the ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) algorithm can be used for gene regulatory network inference in the case of time-course expression profiles. The resulting method is called TimeDelay-ARACNE. It just tries to extract dependencies between two genes at different time delays, providing a measure of these dependencies in terms of mutual information. The basic idea of the proposed algorithm is to detect time-delayed dependencies between the expression profiles by assuming as underlying probabilistic model a stationary Markov Random Field. Less informative dependencies are filtered out using an auto calculated threshold, retaining most reliable connections. TimeDelay-ARACNE can infer small local networks of time regulated gene-gene interactions detecting their versus and also discovering cyclic interactions also when only a medium-small number of measurements are available. We test the algorithm both on synthetic networks and on microarray expression profiles. Microarray measurements are concerning part of S. cerevisiae cell cycle and E. coli SOS pathways. Our results are compared with the ones of two previously published algorithms: Dynamic Bayesian Networks and systems of ODEs, showing that TimeDelay-ARACNE has good accuracy, recall and F -score for the network reconstruction task.
1
Introduction
In order to understand cellular complexity much attention is placed on large dynamic networks of co-regulated genes at the base of phenotype differences. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 97–111, 2010. c Springer-Verlag Berlin Heidelberg 2010
98
P. Zoppoli, S. Morganella, and M. Ceccarelli
One of the aims in molecular biology is to make sense of high-throughput data like that from microarray of gene expression experiments. Many important biological processes (e.g., cellular differentiation during development, aging, disease aetiology etc.) are very unlikely controlled by a single gene instead by the underlying complex regulatory interactions between thousands of genes within a four-dimension space. In order to identify these interactions, expression data over time can be exploited. An important open question is related to the development of efficient methods to infer the underlying gene regulation networks (GRN) from temporal gene expression profiles. Inferring, or reverse-engineering, gene networks can be defined as the process of identifying gene interactions from experimental data through computational analysis. A GRN can be modelled as a graph G = (V, U, D), where V is the set of nodes corresponding to genes, U is the set of unordered pair (undirected edges) and D is the set of ordered pairs D (directed edges). A directed edge dij from vi to vj is present iff there is a causal effect from node vi to node vj . An undirected edge uij represents the mutual association between nodes vi and vj . Gene expression data from microarrays are typically used for this purpose. There are two broad classes of reverse-engineering algorithms [1]: those based on the physical interaction approach which aim at identifying interactions among transcription factors and their target genes (geneto-sequence interaction) and those based on the influence interaction approach that try to relate the expression of a gene to the expression of the other genes in the cell (gene-to-gene interaction), rather than relating it to sequence motifs found in the promoters. We will refer to the ensemble of these influence interactions as gene networks. Many algorithms have been proposed in the literature to model gene regulatory networks [2] and solve the network inference problem [3]. Ordinary Differential Equations Reverse-engineering algorithms based on ordinary differential equations (ODEs) relate changes in gene transcript concentration to each other and to an external perturbation. Typical perturbations can be for example the treatment with a chemical compound (i.e. a drug), or the over expression or down regulation of particular genes. A set of ODEs, one for each gene, describes gene regulation as a function of other genes. As ODEs are deterministic, the interactions among genes represent causal interactions, rather than statistical dependencies. The ODE-based approaches yield signed directed graphs and can be applied to both steady-state and time-series expression profiles [3, 4]. Bayesian Networks A Bayesian network [5] is a graphical model for representing probabilistic relationships among a set of random variables Xi , where i = 1, · · · , n. These relationships are encoded in the structure of a directed acyclic graph G, whose vertexes (or nodes) are the random variables Xi . The relationships between the variables are described by a joint probability distribution P (X1 , · · · , Xn ). The genes, on which the probability is conditioned, are called the parents of gene
An Information Theoretic Approach to Reverse Engineering
99
i and represent its regulators, and the joint probability density is expressed as a product of conditional probabilities. Bayesian networks cannot contain cycles (i.e. no feedback loops). This restriction is the principal limitation of the Bayesian network model [6]. Dynamic Bayesian networks overcome this limitation [7]. Dynamic Bayesian networks are an extension of Bayesian networks able to infer interactions from a data set consisting of time-series rather than steady-state data. Graphical Gaussian Model Graphical Gaussian model, also known as covariance selection or concentration graph models, assumes multivariate normal distribution for underlying data. The independence graph is defined by a set of pairwise conditional independence relationships calculated using partial correlations as a measure of independence of any two genes that determine the edge-set of the graph [8]. Partial cross correlation has been also used to deal with time delays [9]. Gene Relevance Network Gene relevance networks are based on the covariance graph model. Given a measure of association and defined a threshold value, for all pairs of domain variables (X, Y ), association A(X, Y ) is computed. Variables X and Y are connected by an undirected edge when association A(X, Y ) exceeds the predefined threshold value. One of the measures of association is the mutual information (MI) [10], one of the information theory (IT) main tools. In IT approaches, the expression level of a gene is considered as a random variable. MI is the main tool for measuring if and how two genes influence each other. MI between two variables X and Y is also defined as the reduction in uncertainty about a variable X after observing a second random variable Y . Edges in networks derived by information-theoretic approaches represent statistical dependencies among gene expression profiles. As in the case of Bayesian network, the edge does not represent a direct causal interaction between two genes, but only a statistical dependency. It is possible to derive the information-theoretic approach as a method to approximate the joint probability density function of gene expression profiles, as it is performed for Bayesian networks [11–13]. Time-Course Reverse Engineering Availability of time-series gene expression data can be of help in the study of the dynamical properties of molecular networks, by exploiting the causal genegene temporal relationships. In the recent literature several dynamic models, such as Probabilistic Boolean Networks (PBN) [14]; Dynamic Bayesian Networks (DBN) [7]; Hidden Markov Model (HMM) [15] Kalfman filters [16]; Ordinary Differential Equations (ODEs) [4, 17]; pattern recognition approaches [18]; signal processing approaches [19], model-free approaches [20] and informational
100
P. Zoppoli, S. Morganella, and M. Ceccarelli
approaches [21] have been proposed for reconstructing regulatory networks from time-course gene expression data. Most of them are essentially model-based trying to uncover the dynamics of the system by estimating a series of parameters, such as auto regressive coefficients [19] or the coefficients of state-transition matrix [16] or of a stiffness matrix [4, 17]. The model parameters themselves describe the temporal relationships between nodes of the molecular network. In addition, several current approaches try to catch the dynamical nature of the network by unrolling in time the states of the network nodes, this is the case of Dynamic Bayesian Networks [7] or Hidden Markov Models [15]. One of the major differences between the approach proposed here and these approaches, is that the dynamical nature of the behavior of the nodes in the networks, in terms of time dependence between reciprocal regulation between them, can be modeled in the connections rather that in the time-unwrapping of the nodes. As reported in Figure 1, we assume that the the activation of a gene A can influence the activation of a gene B in successive time instants, and that this information is carried out in the connection between gene A and gene B. Indeed, this idea is also at the basis of the time delay neural network model efficiently used in sequence analysis and speech recognition [22]. Another interesting feature of the reported method, with respect to the ARACNE algorithm, is the fact that the time-delayed dependencies can eventually be used for derive the direction of the connections between the nodes of the network, trying to discriminate between regulator gene and regulated genes. The approach reported here has also some similarities with the method proposed in [21], the main differences are in the use of different time delays, the use of the data processing inequality for pruning the network rather than the minimum description length principle and the discretization of the expression values. Summary of the Proposed Algorithm TimeDelay-ARACNE tries to extend to time-course data ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) retrieving time statistical dependency between gene expression profiles. The idea on which TimeDelayARACNE is based comes from the consideration that the expression of a gene at a certain time could depend by the expression level of another gene at previous time point or at very few time points before. TimeDelay-ARACNE is a threesteps algorithm: first it detects, for all genes, the time point of the initial changes in the expression, secondly there is network construction and finally a network pruning step. Is is worth noticing that, the analytical tools for time series often require conditions such as stability and stationarity (see [23]). Although it is not possible to state that these conditions hold in general for microarray data, this is due to the limited number of samples and to the particular experimental setup producing the data, nevertheless time series analysis methods have been demonstrated to be useful tools in many applications of time course gene expression data analysis, for example Ramoni et al. [24], used an auto-regressive estimation step as feature extraction prior to classification, while Holter et al., [25] use the characteristic modes obtained by singular value decomposition to
An Information Theoretic Approach to Reverse Engineering
101
model a linear framework resulting in a time translational matrix. In particular TimeDelay-ARACNE implicitly assumes stationarity and stability conditions in the kernel regression estimator used for the computation of the mutual information, as described in the section Algorithms.
Methods Datasets Simulated Gene Expression Data. We construct some synthetic gene networks in order to compute the functions p, r and F -score of the method having reference true tables and to compare its performance to other methods. According to the terminology in [26] we consider a gene network to be well-defined if its interactions allow to distinguish between regulator genes and regulated genes, where the first affect the behaviour of the second ones. Given a well defined network, we can have genes with zero regulators (called stimulators, which could represent the external environmental conditions), genes with one regulator, genes with two regulators, and so on. If a gene has at least one regulator (it is not a stimulator) then it owns a regulation function which describes its response to a particular stimulus of its regulator/regulators. Our synthetic networks contain some stimulator genes with a random behaviour and regulated genes which can eventually be regulators of other genes. The network dynamics are modeled by linear causal relations which are formulated by a set of randomly generated equations. In particular, let us call the expression of gene i at time t as git , our synthetic network generation module works as follows, – if gene i is a stimulator gene then its expression profile, git , t = 0, 1, ... is randomly initialized with a sequence of uniform numbers in [1, 100] – for each non-stimulator gene i, gi0 is initialized with a uniform random number in [1, 100] – for each non-stimulator gene i, the expression values gi1 , ..., git are computed according to a stochastic difference equation with random coefficients depending on one or two other regulator genes by using one of the two equations below: + ηit git = αti git−1 + βit gpt−1 i git
=
αti git−1
+
βit gpt−1 i
+
γit gqt−1 i
+
ηit
(1) (2)
here the coefficients αti , βit and γit are random variables uniformly distributed in [0, 1] and ηit is a random Gaussian noise with zero mean and variance σ. Moreover the regulators genes pi and qi of the i-th are randomly selected at the beginning of each simulation run. The network generation algorithm is set in such a way that 75% of genes have one regulator and 25% of genes have two regulators. – each expression profile is then normalized to be within the interval [0, 1]
102
P. Zoppoli, S. Morganella, and M. Ceccarelli
In our experiments, the PPV, recall and F -score of the proposed and the other methods is computed as the average over a set of 20 runs over different random networks with the same number of genes, number of time points and noise levels. Microarray Expression Profiles. The time course profiles for a set of 11 genes, part of the G1 step of yeast cell cycle, are selected from the widely used yeast, Saccharomyces cerevisiae, cell cycle microarray data [27]. These microarray experiments were designed to create a comprehensive list of yeast genes whose transcription levels were expressed periodically within the cell cycle. We select one of this profile in which the gene expressions of cell cycle synchronized yeast cultures were collected over 17 time points taken in 10-minute intervals. This time series covers more than two complete cycles of cell division. The first time point, related to the M step, is excluded in order to better recover the time relationships present in the G1 step. The true edges of the underlying network were provided by KEGG yeast’s cell cycle reference pathway [28]. Green Fluorescent Protein Real-Time Gene Expression Data. The time course profiles for a set of 8 genes, part of the SOS pathway of E. coli [29] are selected. Data are produced by a system for real-time monitoring of the transcriptional activity of operons by means of low-copy reporter plasmids in which a promoter controls GFP (green fluorescent protein). From such dataset we select the first 3 time course profiles and for each one the first 15 points, then we concatenate such profiles obtaining a 45 points profile. For each profile we use only the first 15 points avoiding the misguiding flat tails characterizing such gene profiles (the response to the UV stimulus is quick, so very soon mRNAs came back to pre-stimulus condition).
Algorithms ARACNE The ARACNE algorithm has been proposed in [11, 30]. ARACNE is an information-theoretic algorithm for the reverse engineering of transcriptional networks from steady-state microarray data. ARACNE, just as many other approaches, is based on the assumptions that the expression level of a given gene can be considered as a random variable, and the mutual relationships between them can be derived by statistical dependencies. It defines an edge as an irreducible statistical dependency between gene expression profiles that cannot be explained as an artifact of other statistical dependencies in the network. It is a two steps algorithm: network construction and network pruning. Within the assumption of a two-way network, all statistical dependencies can be inferred from pairwise marginals, and no higher order analysis is needed. ARACNE identifies candidate interactions by estimating pairwise gene expression profile mutual information, I(gi , gj ) ≡ Iij , an information-theoretic measure of relatedness that is
An Information Theoretic Approach to Reverse Engineering
103
zero iff the joint distribution between the expression level of gene i and gene j satisfies P (gi , gj ) = P (gi )P (gj ). ARACNE estimates MI using a computationally efficient Gaussian Kernel estimator. Since MI is reparameterization invariant, ARACNE copula-transforms (i.e., rank-order) the profiles before MI estimation; the range of these transformed variables is thus between 0 and 1, and their marginal probability distributions are manifestly uniform. This decreases the influence of arbitrary transformations involved in microarray data pre-processing and removes the need to consider position-dependent kernel widths which might be preferable for non-uniformly distributed data. Secondly the MIs are filtered using an appropriate threshold, I0 thus removing the most of indirect candidate interactions using a well known information theoretic property, the data processing inequality (DPI). ARACNE eliminate all edges for which the null hypothesis of mutually independent genes cannot be ruled out. To this extent, ARACNE randomly shuffles the expression of genes across the various microarray profiles and evaluate the MI for such manifestly independent genes. The DPI states that if genes g1 and g3 interact only through a third gene, g2 , (i.e., if the interaction network is g1 ↔ · · · ↔ g2 ↔ · · · ↔ g3 and no alternative path exists between g1 and g3 ), then I(g1 , g3 ) ≤ min(I(g1 , g2 ); I(g2 , g3 ))[31]. Thus the least of the three MIs can come from indirect interactions only, and so it’s pruned. TimeDelay-ARACNE TimeDelay-ARACNE tries to extend to time-course data ARACNE retrieving time statistical dependency between gene expression profiles. TimeDelayARACNE is a 3 steps algorithm: it first detects, for all genes, the time point of the initial changes in the expression, secondly there is network construction than network pruning. Step 1. The first step of the algorithm is aimed at the selection of the initial change expression points in order to flag the possible regulator genes [7]. In particular, let us consider the sequence of expression of gene ga : ga0 , ga1 , ...gat , ..., we use two thresholds τup and τdown and the initial change of expression (IcE) is defined as IcE(ga ) = arg min{ga0 /gaj ≥ τup or gaj /ga0 ≤ τdown } j
(3)
1 The thresholds are chosen with τup = τdown . In all the reported experiments we used τup = 1.2 and consequently τdown = 0.83. The quantity IcE(ga ) can be used in order to reduce the unuseful influence relations between genes. Indeed, a gene ga can eventually influence gene gb only if IcE(ga ) ≤ IcE(gb )[7].
Step 2. The basic idea of the proposed algorithm is to detect time-delayed dependencies between the expression profiles by assuming as underlying probabilistic model a stationary Markov Random Field [32]. In particular the model should try to catch statistical dependencies between the activation of a given gene
104
P. Zoppoli, S. Morganella, and M. Ceccarelli
ga at time t and another gb at time t+ κ with Ice(ga ) ≤ Ice(gb ). Our assumption relies on the fact the probabilistic properties of the network are determined by (κ) (κ) the joint distribution P (ga , gb ). Here gb is the expression series of gene gb shifted κ instants forward in time. For our problem we should therefore try to (κ) estimate both the stationary joint distribution P (ga , gb ) and, for each pair of genes, the best suited delay parameter κ. In order to solve these problems, as in the case of the ARACNE algorithm [30], the copula-based estimation can help in simplifying the computations [33]. The idea of the copula transform is based on the assumption that a simple transformation can be made of each variable in such a way that each transformed marginal variable has a uniform distribution. In practice, the transformations might be used as an initial step for each margin [34]. For stationary Markov models, Chen et al. [33] suggest to use a standard kernel estimator for the evaluation of the marginal distributions after the copula transform. Here we the use the simplest rank based empirical copula [34] as other kind of transformations did not produce any particular advantage for (κ) the considered problem. Starting from a kernel density estimation P˜ (ga , gb ) of P the algorithm identifies candidate interactions by pairwise time-delayed gene expression profile mutual information defined as: I κ (ga , gb ) =
P˜ (gai , gbi+κ ) P˜ (gai , gbi+κ )log P˜ (gai )P˜ (gi+κ ) i=1
(4)
Therefore, time-dependent MIs are calculated for each expression profile obtained by shifting genes by one time-step till the defined maximum time delay is reached (see Figure 1), by assuming a stationary shift invariant distribution.
Fig. 1. Figure 1 - TimeDelay-ARACNE pairwise time MI idea. The basic idea of TimeDelay-ARACNE is to represent the time-shifting in the connections rather than unrolling the activation of nodes in time.
After this we introduce the Influence as the max time-dependent MIs, I κ (gA , gB ), over all possible delays κ: (k)
Inf l(ga , gb ) = maxκ {I κ (ga , gb ) : κ = 1, 2, ..., with IcE(ga ) ≤ IcE(gb )} (5) TimeDelay-ARACNE infers directed edges because shifted gene values are posterior to locked gene ones; so, if there is an edge it goes from locked data gene to the other one. Shifting profiles also makes the influence measure asymmetric: = I κ (Y, X) for κ =0 I κ (X, Y )
(6)
An Information Theoretic Approach to Reverse Engineering
105
In particular, if the measure Inf l(ga , gb ) is above the the significance threshold, explained below, for a value of κ > 0, then this means that the activation of gene ga influences the activation of gene gb at a later time. In other terms there is a directed link “from” gene ga “to” gene gb , this is the way TimeDelay-ARACNE can recover directed edges. On the contrary, the ARACNE algorithm does not produce directed edges as it corresponds to the case κ = 0, and the Mutual Information is of course symmetric. We want to show direct gene interactions so under the condition of the perfect choice of experimental time points the best time delay is one because it allows to better capture direct interactions while other delays ideally should evidence more indirect interactions but usually time points are not sharply calibrated to detect such information, so considering few different time points could help in the task. If you consider a too long time delay you can see a correlation between gene a and gene c losing gene b which is regulated by a and regulates c while short time delay can be not sufficient to evidence the connection between gene a and gene b, so using some few delays we try to overcome the above problem. After the computation of the Inf l() estimations, TimeDelay-ARACNE filters them using an appropriate threshold, I0 , in order to retrieve only statistical significant edges. In particular TimeDelay-ARACNE auto-sets threshold randomly permuting dataset row values, calculating the average MI, μi and standard deviation, σI of these random values. The threshold is then set with I0 = μI + ασi . Step 3. In the last step TimeDelay-ARACNE uses the DPI twice. In particular the first DPI is useful to split three-nodes cliques (triangles) at a single time delay. Whereas the second is applied after the computation of the time delay between each pair of genes as in (5). Just as in the standard ARACNE algorithm, three genes loops are still possible on the basis of a tolerance parameter. In particular triangles are maintained by the algorithm if the difference between the mutual information of the three connections is below the 15% (this the same tolerance parameter adopted in the ARACNE algorithm).
Results Algorithm Evaluation TimeDelay-ARACNE was evaluated first alone than against ARACNE, dynamical Bayesian Networks implemented in the Banjo package [35] (a software application and framework for structure learning of static and dynamic Bayesian networks) and ODE implemented in the TSNI package [36] (Time Series Network Identification) with both simulated gene expression data and real gene expression data related to yeast cell cycle [27] and SOS signaling pathway in E. coli [29]. Synthetic Data In order to quantitatively evaluate the performance of the algorithm reported here over a dataset with a simple and fair “gold standard” and also to evaluate how the
106
P. Zoppoli, S. Morganella, and M. Ceccarelli
performance depend of the size of the problem at the hand, such as network dimension, number of time points, and other variables we generated different synthetic datasets. Our profile generation algorithm (see Methods) starts by creating a random graph which represents the statistical dependencies between expression profiles, and then the expression values are generated according to a set of stochastic difference equation with random coefficients. The network generation algorithm works in such a way that each node can have zero (a “stimulator” node) one or two regulators. In addition to the random coefficients of the stochastic equations, a random Gaussian noise is added to each expression value. The performance are evaluated for each network size, number of time points and amount of noise by averaging the PPV, recall and F -score over a set of 20 runs with different randomly generated networks. The performance is measured in terms of: – positive predictive value (PPV), it is the percentage of inferred connections which are correct: p=
Number of true positives Number of true positives + Number of false positives
(7)
– recall, it is the percentage of true connection which are correctly inferred by the algorithm: r=
Number of true positives Number of true positives + Number of false negatives
(8)
– F -score. Indeed, the overall performance depend both of the positive prediction value and recall as one can improve the former by increasing the detection threshold, but at the expenses of the latter and vice versa. The F -score measure is the geometric mean of p and r and represents the compromise between them: 2(p · r) F = (9) p+r Since TimeDelay-ARACNE always tries to infer edge’s direction, so the precisionrecall curves take into account direction. As a matter of fact an edge is considered as a true positive only if the edge exist in reference table and the direction is correct. 1.1
Synthetic Data
Many simulated datasets were derived from any different networks analyzed. TD-ARACNE performance in network reconstruction is evaluated for accuracy and recall. Tab. 1 and Tab. 2 show that TD-ARACNE’s performance is directly correlated with time point numbers but not very strictly correlated with network gene numbers. In Tab. 3 we choose 5 small-medium synthetic networks among those previously tested against TD-ARACNE that are different for gene numbers and time points. TD-ARACNE performs much better than competitors but TSNI probably needs of a perturbation and Banjo needs of a very high number of experiments (time points) [35].
An Information Theoretic Approach to Reverse Engineering
107
Table 1. TD-ARACNE performance against synthetic data changing network gene numbers. TD-ARACNE performance results seem to be not very strictly correlated with network gene numbers.
Genes Time Points Accuracy 20 30 0.45 30 30 0.51 50 30 0.46 100 30 0.39
Recall F-score 0.31 0.37 0.31 0.39 0.26 0.33 0.20 0.26
Table 2. TD-ARACNE performance against synthetic data changing network data points. TD-ARACNE performance results show direct correlation with time points.
Genes Time Points Accuracy 20 10 0.23 20 20 0.41 20 30 0.45 20 50 0.61 20 100 0.67
Recall F-score 0.13 0.17 0.25 0.31 0.31 0.37 0.43 0.50 0.52 0.59
Table 3. Algorithms performance comparison against synthetic data. Performance comparison against 5 small-medium synthetic networks among those previously tested against TD-ARACNE that are different for gene numbers and time points. Results checked for accuracy (or positive predictive value) and recall states TD-ARACNE goodness.
TD-ARACNE TSNI Banjo Genes Time Points Accuracy Recall Accuracy Recall Accuracy Recall 11 17 0.67 0.60 0.33 0.13 0.11 0.08 20 20 0.41 0.25 0.29 0.10 0.11 0.08 20 100 0.67 0.52 0.23 < 0.1 < 0.1 < 0.1 40 30 0.49 0.28 0.38 < 0.1 < 0.1 < 0.1 50 50 0.64 0.40 0.47 < 0.1 < 0.1 < 0.1
1.2
Microarray Expression Profiles
In order to test TD-ARACNE performance on microarray expression profile we selected an eleven gene network from yeast S. cerevisiae cell cycle, precisely part of G1 step. Selected genes are: Cln3, Cdc28, Mbp1, Swi4, Clb6, Cdc6, Sic1, Swi6, Cln1, Cln2, Clb5. In Fig. 2 we report the network graphs reconstructed by TD-ARACNE, TSNI, Banjo and the KEGG true table. TSNI and Banjo are used with default settings reported in their manuals. TD-ARACNE recovers many gene-gene edges with both a good accuracy and recall. We also tested our
108
P. Zoppoli, S. Morganella, and M. Ceccarelli
Fig. 2. Yeast cell-cycle KEGG pathway and reconstructed network by three different algorithms. a) yeast cell-cycle KEGG pathway; b) TNSI inferred graph; c)TD-ARACNE inferred graph; d) Banjo inferred graph. TSNI and Banjo are used with default settings reported in their manuals. TD-ARACNE better recovers this yeast network topology than other algorithms.
Fig. 3. TD-ARACNE SOS predicted network and SOS pathway reference. a) E. coli SOS pathway; b) TNSI inferred graph; c) TD-ARACNE inferred graph; d) Banjo inferred graph. TSNI and Banjo are used with default settings reported in their manuals. TD-ARACNE finds lexA correctly as the HUB, recovers 5 edges correctly, 1 edge has wrong direction and only 1 edge is missing even if an undirected relation is established (lexA → polB → uvrY ). TD-ARACNE again better recover E. coli SOS pathway than other algorithms.
An Information Theoretic Approach to Reverse Engineering
109
algorithm using 8 genes by E. coli SOS pathway. Selected genes are: uvrD, lexA, umuDC, recA, uvrA, uvrY, ruvA, polB. In Fig. 3 are reported SOS pathway reconstruction by the three algorithms and relative bibliographic control [37]. TD-ARACNE better recovers these network topologies than other algorithms.
Conclusions The goal of TimeDelay-ARACNE is to recover gene time dependencies from time-course data producing oriented graph. To do this we introduce time Mutual Information and Influence concepts. First tests on synthetic networks and on yeast cell cycle and SOS pathway data give good results but many other tests should be made. Particular attention is to be made to the data normalization step because the lack of a rule. According to the little performance loss linked to the increasing gene numbers shown in this paper, next developmental step will be the extension from little-medium networks to medium networks.
References 1. Gardner, T.S., Faith, J.J.: Reverse-engineering transcription control networks. Physics of Life Reviews 2(1), 65–88 (2005) 2. Hasty, J., McMillen, D., Isaacs, F., Collins, J.: Computational studies of gene regulatory networks: in numeromolecular biology. Nature Review Genetics 2, 268– 279 (2001) 3. Bansal, M., Belcastro, V., Ambesi-Impiombato, A., di Bernardo, D.: How to infer gene networks from expression profiles. Mol. Syst. Biol. 3, 78 (2007) 4. Kim, S., Kim, J., Cho, K.: Inferring gene regulatory networks from temporal expression profiles under time-delay and noise. Computational Biology and Chemistry 31, 239–245 (2007) 5. Neapolitan, R.: Learning bayesian networks. Prentice Hall, Upper Saddle River (2003) 6. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using bayesian networks to analyze expression data. Journal of Computational Biology 7, 601–620 (2000) 7. Zou, M., Conzen, S.D.: A new dnamic bayesian network (dbn) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1), 71–79 (2005) 8. Sch¨ afer, J., Strimmer, K.: An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics 21(6), 754–764 (2005) 9. Stark, E., Drori, R., Abeles, M.: Partial Cross-Correlation analysis resolves ambiguity in the encoding of multiple movement features. J. Neurophysiol. 95(3), 1966–1975 (2006) 10. Butte, A.J., Kohane, I.S.: Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. In: Pacific Symposium on Biocomputing, vol. 5, pp. 415–426 (2000) 11. Margolin, A.A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla Favera, R., Califano, A.: Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7(Suppl. I), S7 (2006)
110
P. Zoppoli, S. Morganella, and M. Ceccarelli
12. Faith, J.J., Hayete, B., Thaden, T.T., Mogno, I., Wierzbowski, J., Cottarel, G., Kasif, S., Collins, J.J., Gardner, T.S.: Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biology 5(1), e8+ (2007) 13. Meyer, P.E., Kontos, K., Lafitte, F., Bontempi, G.: Information theoretic inference of large transcriptional regulatory network. EURASIP Journal on Bioinformatics and Systems Biology 2007 (2007) 14. Shmulevich, I., Dougherty, E.R., Kim, S., Zhang, W.: Probabilistic boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 19, i255–i263 (2002) 15. Schliep, A., Sch¨ onhuth, A., Steinhoff, C.: Using hidden markov models to analyze gene expression time course data. Bioinformatics 18(2), 261–274 (2003) 16. Cui, Q., Liu, B., Jiang, T., Ma, S.: Characterizing the dynamic connectivity between genes by variable parameter regression and kalman filtering based on temporal gene expression data. Bioinformatics 21(8), 1538–1541 (2005) 17. Bansal, M., Gatta, G., di Bernardo, D.: Inference of gene regulatory networks and compound mode of action from time course gene expression. Bioinformatics 22(7), 815–822 (2006) 18. Chuang, C., Jen, C., Chen, C., Shieh, G.: A pattern recognition approach to infer time-lagged genetic interactions. Bioinformatics 24(9), 1183–1190 (2008) 19. Opgen-Rhein, R., Strimmer, K.: Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics 8, S3 (2007) 20. Li, X., Rao, S., Jiang, W., Li, C., Xiao, Y., Guo, Z., Zhang, Q., Wang, L., Du, L., Li, J., et al.: Discovery of time-delayed gene regulatory networks based on temporal gene expression profiling. BMC bioinformatics 7(1), 26 (2006) 21. Zhao, W., Serpedin, E., Dougherty, E.: Inferring gene regulatory networks from time series data using the minimum description length principle. Bioinformatics 22(17), 21–29 (2006) 22. Waibel, A.: Modular construction of time-delay neural networks for speech recognition. Neural Computation 1(1), 39–46 (1989) 23. Luktepohl, H.: New Introduction to Multiple Time Series Analysis. Springer, Heidelberg (2005) 24. Ramoni, M., Sebastiani, P., Kohane, I.: Cluster analysis of gene expression dynamics. Proceedings of the National Academy of Science 99(14), 9121–9126 (2002) 25. Holter, N., Maritan, A., Cieplak, M., Fedoroff, N., Banavar, J.: Dynamic modeling of gene expression data. Proceedings of the National Academy of Science 98(4), 1693–1698 (2000) 26. Gat-Viks, I., Tanay, A., Shamir, R.: Modeling and analysis of heterogeneous regulation in biological network. In: Eskin, E., Workman, C. (eds.) RECOMB-WS 2004. LNCS (LNBI), vol. 3318, pp. 98–113. Springer, Heidelberg (2005) 27. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botsein, D., Futcher, B.: Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9(12), 3273–3297 (1998) 28. Kanehisa, M., Goto, S.: Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acid Res. 28(1), 27–30 (2000) 29. Ronen, M., Rosenberg, R., Shraiman, B.I., Alon, U.: Assigning numbers to the arrows: Parameterizing a gene regulation network by using accurate expression kinetics. Proc. Natl. Acad. Sci. U.S.A. 99(16), 10555–10560 (2002)
An Information Theoretic Approach to Reverse Engineering
111
30. Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla Favera, R., Califano, A.: Reverse engineering of regulatory networks in human b cells. Nature Genetics 37(4), 382–390 (2005) 31. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley-Interscience, Hoboken (1991) 32. Havard, R., Held, L.: Gaussian Markov random fields: theory and applications. CRC Press, Boca Raton (2005) 33. Chen, X., Fan, Y.: Estimation of copula-based semiparametric time series models. Journal of Econometrics (January 2006) 34. Nelsen, R.B.: An Introduction to Copulas. Springer, Heidelberg (2006) 35. Yu, J., Smith, V.A., Wang, P.P., Hartemink, A.J., Jarvis, A.J.: Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20(18), 3594–3603 (2004) 36. Bensal, M., Della Gatta, G., Di Bernardo, D.: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22(7), 815–822 (2006) 37. Saito, S., Aburatani, S., Horimoto, K.: Network evaluation from the consistency of the graph structure with the measured data. BMC Systems Biology 2(84), 1–14 (2008)
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks Gilles Bernot and Jean-Paul Comet Laboratoire I3S, UMR 6070 UNS-CNRS Algorithmes-Euclide-B 2000, route des Lucioles B.P. 121 F-06903 Sophia Antipolis Cedex {bernot,comet}@unice.fr
Abstract. Modelling activities in molecular biology face the difficulty of prediction to link molecular knowledge with cell phenotypes. Even when the interaction graph between molecules is known, the deduction of the cellular dynamics from this graph remains a strong corner stone of the modelling activity, in particular one has to face the parameter identification problem. This article is devoted to convince the reader that computers can be used not only to simulate a model of the studied biological system but also to deduce the sets of parameter values that lead to a behaviour compatible with the biological knowledge (or hypotheses) about dynamics. This approach is based on formal logic. It is illustrated in the discrete modelling framework of genetic regulatory networks due to René Thomas.
1
Introduction: Modelling Gene Regulatory Networks
Since the advent of molecular biology, biologists have to face increasing difficulties of prediction to link molecular knowledge with cell phenotypes. The belief that the sequencing of genomes would rapidly open the door to a personalized medicine has been confronted at first to the necessity of annotating finely genomes, then to the difficulty to deduce the structure(s) of proteins, then to the huge inventory of interactions that constitute biological networks, and so on. In the same way, we have to face now the fact that the knowledge of an interaction graph does not make it possible to deduce the cellular dynamics. Indeed, interaction graphs are of static nature in the same way as genetic sequences, and it turns out that a large number of parameters, which are unknown and not easily measurable, control the dynamics of interactions. Moreover, combined interactions (and especially feedback circuits in an interaction graph) result in several possible behaviours of the system, qualitatively very different. Even with only two genes the situation is far from simple. Let us consider for example the interaction graph of Figure 1. This simple graph contains 2 circuits, whose intersection is the gene x. The left hand side circuit is said positive: F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 112–138, 2010. © Springer-Verlag Berlin Heidelberg 2010
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
113
activation
x activation
y inhibition
Fig. 1. A simple interaction graph containing a positive circuits and a negative one
– If, from an external stress, the concentration level of the protein coded by gene x grows up, then, at a certain threshold, it will favour the expression of x, producing in turn more x proteins even if the external stress has disappeared. – On the contrary, if the concentration level of the protein of gene x is low, then it will not favour the expression of x, and the concentration level of the x protein can stay at a low level. More generally, a positive circuit in a gene interaction network is a circuit that contains an even number of inhibitions, and a positive circuit favours the existence of 2 stable states [28,19,24,10,27,21] (respectively high level and low level of expression of x). The right hand side circuit is said negative because it contains an odd number of inhibitions: – If the concentration level of the protein coded by gene x grows up, then it will favour the expression of gene y, which will in turn inhibit gene x, resulting in a decreasing of the x protein concentration. – Conversely, a low concentration level of the x protein shall decrease the expression level of gene y, which shall become unable to inhibit x, resulting in a higher expression level of x. . . and the process will start again. More generally, a negative circuit favours homeostasy: oscillations which can either be damped towards a unique stable state or sustained towards a limit cycle surrounding a unique unstable equilibrium state of the biological system. The two circuits of Figure 1 are consequently competitors: do we get one or two stable states? shall we observe oscillations? and if so, do we get oscillations around a unique stable state or around two stable states? These predictions entirely depend on parameters that control the strength of the activation or inhibition arrows of the interaction graph (see Section 2.3). Most of the time, mathematical models are used to perform simulations using computers. Biological knowledge is encoded into ordinary differential equations (ODE) for instance, and many parameters of the system of ODEs are a priori unknown, a few of them being approximately known. Many and many simulations are performed, with different values for the parameters, and the behaviour observed in silico is compared with the known in vivo behaviour. This process, by trial and error, makes it possible to propose a robust set of parameters
114
G. Bernot and J.-P. Comet
(among others) that is compatible with the biological observations. Then, several additional simulations that simulate novel situations can predict interesting behaviours, and suggest new biological experiments. The goal of this article is to convince the reader that “brute force simulations” are not the only way to use a computer for gene regulatory networks. A computer can do more than computations. A computer manipulates symbols. Consequently, using deduction rules, a computer can perform proofs within adequate logics. One of the main advantages of logics is that they exhaustively manipulate sets of models, and exhaustively manage the subset of all models that satisfy a given set of properties. More precisely, a logic provides three well established concepts: – a syntax that defines the properties that can be expressed and manipulated (these properties are called formulas), – a semantics that defines the models under consideration and the meaning of each formula according to these models, – deduction rules or model checking from which we get algorithms in order to prove if a given model (or a set of models) satisfy a formula (or a finite set of formulas). Logic can thus be used to manipulate, in a computer aided manner, the set of all the models that satisfy a given set of known properties. Such an exhaustive approach avoids focusing on one model, which can be “tuned” ad libitum because it has many parameters and few equations. Consequently it avoids focusing on a model which is non predictive. Logic also helps studying the ability to refute a set of models with the current experimental capabilities, it also brings useful concepts such as observationally equivalent models, and so on. More generally formal methods are helpful to assist biologists in their reasonnings [4,7,22,1,18]. In this article, we provide a survey of the “SMBioNet method” whose purpose is to elucidate the behaviour of a gene interaction graph, to find all the possible parameter values (if any), and to suggest suitable experiments to validate or to refute a given biological hypothesis. In the next section, we remind the approach of René Thomas [29] to obtain discrete models (with finite sets of possible states) for gene regulatory networks. We take as a pedagogical example the well known lactose operon in E. coli. In Section 3, we show how temporal logic and more precisely CTL can be used to properly encode biological properties. In Section 4, we show how temporal logic can be used to guide the process of gene network elucidation.
2 2.1
Discrete Framework for Gene Regulatory Networks A Classical Example
In this article, we consider the system of the lac operon which plays a crucial role in the transport and metabolism of lactose in Escherichia coli and some other enteric bacteria [11]. Lactose is a sugar which can be used as a source of
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
115
carbon mandatory for mitosis. This system allows to switch on the production of enzymes allowing the metabolism of carbon only if lactose is available and no other more readily-available energy sources are available (e.g. glucose). The Lactose Operon and Its Associated Biological System. The operon consists of three adjacent structural genes, a promoter, a terminator, and an operator. The lac operon is regulated by several factors including the availability of glucose or of lactose. When lactose is absent in the current environment, a repressor protein maintains the expression of the operon at its basal level. In presence of lactose, it enters into the cell thanks to a protein named permease which is coded by the operon itself. The lactose proteins have affinity to the repressor proteins, form complexes with them leading first to a decreasing of the concentration of free repressor and thus to the activation of the operon. Consequently, the permease concentration increases, the lactose enters more efficiently into the cell, maintaining the low concentration of free repressor. These interactions form then a positive feedback loop on the intracellular lactose (left part of Figure 2). Another protein coded by the operon plays also a role in the carbon metabolism: the enzyme galactosidase. It is responsible for the degradation of the lactose in order to transform the lactose into carbon. Thus the increasing of the intracellular lactose leads to an increasing of the galactosidase, then to the decreasing of the intracellular lactose. These interactions form then a negative feedback loop on the intracellular lactose (right part of Figure 2). Moreover, when glucose is present, it inhibits indirectly the transcription of the operon (via an indirect inhibition of cAMP (cyclic Adenosine MonoPhospate) which forms with CAP (Catabolite gene Activator Proteins) the complex responsible for the transcription of the operon. Thus this alternative pathway of the carbon metabolism is inhibited. To summarize, the intracellular lactose is subject to two influences which are contradictory one to another. The positive feedback loop attempts to keep the high concentration of intracellular lactose whereas the negative feedback loop attempts to decrease this concentration, as shown in Figure 2. Biological Questions Drive Modelling Abstractions. The modelling of a biological system often means to construct a mathematical objet that mimics the behaviours of the considered biological system. The elaborated model is often built according to a particular knowledge on the system (interactions, structuration, published behaviours, hypotheses. . . ), and this knowledge is often not complete: for example one may improperly focus on a given subsystem. Thus the modelling process presented in this paper proposes to construct models according a particular point of view on the biological system. This point of view may turn out to be inconsistent and the modelling process should be able to point out inconsistencies leading to reconsider the model(s). The construction of a model aims at studying a particular behaviour of the system. Each facet of the system corresponds to a particular partial model. When
116
G. Bernot and J.-P. Comet
extra−cellular lactose
permease
intra−cellular lactose
lac repressor
operon
galactosidase
glucose
Fig. 2. Schematic representation of the lac operon system in Escherichia coli
this facet of the system is understood, the modeller can throw away the current model in order to go further in the understanding of the system. In other words, the point of view of the modeller evolves during the modelling process, leading to refinements or re-buildings of the model. In such a perspective, as the construction of the model is based on a set of biological facts and hypotheses, it becomes possible to construct a model only to test different hypotheses, to apprehend the consequences of such hypotheses and possibly to refute some of them. Last but not the least, these models can be used in order to suggest some experiments. For example, if biologists want to put the spot on the use of lactose, then we can adopt a modelling point of view that implicitly assumes the absence of glucose. Then glucose does not belong to the model any more. Moreover, even if it may seem surprising, the lacI repressor can be suppressed because the inhibition of cellular lactose on lacI and the inhibition of the lacI repressor on the operon are known to be always functional. These successive repressions can consequently be abstracted by a unique direct activation from cellular lactose to the operon. Depending on the studied hypotheses, some additional simplifications of the model are possible: – If the hypothesis does not refer explicitly to the galactosidase, then we can abstract galactosidase in a similar manner than for the repressor, see Figure 3. – If the hypothesis does not refer explicitly to the permease, then we can abstract permease as in Figure 4. In the sequel of this article we will consider the interaction schema of Figure 4. 2.2
Gene Interaction Networks
Discretization. When modelling gene interactions, threshold phenomena observed in biology [16] constitute one of the key points for the comprehension of
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
extra−cellular lactose
117
intra−cellular lactose
permease
Fig. 3. Abstraction of the lac operon system when focusing on intra-cellular lactose and permease
extra−cellular lactose
intra−cellular lactose
galactosidase
Fig. 4. Abstraction of the lac operon system when focusing on intra-cellular lactose and galactosidase
the behaviour of the system. Combined with the additional in vivo phenomenon of macromolecule degradation, the interaction curves get a sigmoidal shape (e.g. Hill functions) [6], see Figure 5. Then, it becomes clear that for each interaction, two qualitative situations have to be considered: the regulation is effective if the concentration of the regulator is above the threshold of the sigmoid, and conversely, it is ineffective if the concentration of the regulator is below the threshold. When the product of a gene regulates more than one target, more than two situations have to be considered. For example, Figure 5 assumes that u is a gene product which acts positively on v and negatively on w; each curve being the concentration of v (resp. w) with respect to the concentration of u; after a sufficient delay for u to act on v (resp. w). Obviously, three regions are relevant in the different levels of concentration of u: – In the first region u acts neither on v nor on w, – In the second region, u acts on v but it still does not act on w: – In the last region, u acts both on v and w: The sigmoid nature of the interactions shown in Fig. 5 is almost always verified and it justifies this discretization of the concentration levels of u: three abstract levels (0, 1 and 2) emerge corresponding to the three previous regions and constitute the only relevant information from a qualitative point of view1 . The generalization is straightforward: if a gene acts on n targets, at most n+1 abstract regions are considered (from 0 to n). Less abstract levels are possible when two thresholds for two different targets are equal. Gene regulatory graphs. A biological regulatory graph is defined as a labelled directed graph. A vertex represents a variable (which can abstract a gene and its protein for instance) and has a boundary which is the maximal value of its discrete concentration level. Each directed edge u → v represents an action of 1
Atypic behaviours on the thresholds can also be studied, see for example [8].
118
G. Bernot and J.-P. Comet v
u w
u 0
1
2
Fig. 5. Discretization of concentration of a regulator with 2 targets
u on v. The corresponding sigmoid can be increasing or decreasing (Figure 5), leading respectively to an activation or an inhibition. Thus each directed edge u → v is labelled with an integer threshold belonging to [0, bu ] and a sign: + for an activation of v and − for an inhibition of v. Definition 1. A biological regulatory graph is a labelled directed graph G = (V, E) where: – each vertex v of V , called variable, is provided with a boundary bv ∈ N∗ less or equal to the out-degree of v in G; except when the out-degree is 0 where bv = 1; – each edge u → v of E is labelled with a couple (t; ) where t, called threshold, is an integer between 1 and bu and ∈ {−, +}. The schematic figure 4 is too informal to represent a biological regulatory graph: the edge modelling the auto-regulation of the intra-cellular lactose does not point on a variable but on an edge and, moreover, the thresholds are missing. To construct from the figure 4 a biological regulatory graph, we modify the edge modelling the auto-regulation of the intra-cellular lactose: its target becomes directly the intra-cellular lactose, see Figure 6. Moreover, three different rankings of thresholds can be considered : cases A, B or C. Gene regulatory networks. The discretization step allows one to consider only situations which are qualitatively different: if an abstract level changes, there exists at least one interaction which becomes effective or ineffective. To go further, one has to define what are the possible evolutions of each variable under some effective regulations. Assuming that u1 . . . un have an influence on v (entering arrows ui → v), toward which concentration level is v attracted? This level depends on the set of active regulators, which evolves with time: at a given time, only some of them pass the threshold. For example in Figure 6-A, the level toward which intra-cellular lactose is attracted only depends on the presence of extra-cellular lactose (i.e. has a level
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
intra−cellular lactose
1, +
1, +
galactosidase
1, −
C
extra−cellular lactose
intra−cellular lactose
2, + galactosidase
1, −
extra−cellular lactose 1, +
1, +
1, +
2, +
B
extra−cellular lactose
1, +
A
119
intra−cellular lactose
1, + galactosidase
1, −
Fig. 6. Three different biological regulatory graphs of the operon lactose system (note the different values of interaction thresholds)
greater or equal to 1), the presence of itself (i.e. has a level greater or equal to 2) and the absence of galactosidase (i.e. has a level strictly less than 1). Indeed the absence of an inhibitor is equivalent to the presence of an activator, from the symmetry of sigmoids. These “target” concentration levels are defined by parameters, denoted by kv,ω , where ω is a subset of regulators. Biological regulatory networks are biological regulatory graph (Definition 1) together with these parameters kv,ω . Definition 2. A biological regulatory network is a couple R = (G, K) where G = (V, E) is a biological regulatory graph, and K = {kv,ω } is a family of integers such that – v belongs to V , – ω is a subset of G−1 (v), the set of predecessors of v in the graph G, and will be called a set of resources of v, – 0 ≤ kv,ω ≤ bv Intuitivelly, the parameter kv,ω describes the behaviour of variable v when all variables of ω act as a resource of v (a resource being the presence of an activator or the absence of an inhibitor). Most of the time, we consider an additional monotony condition called the Snoussi condition [25]: ∀v ∈ V, ∀ω, ω ∈ G−1 (v), ω ⊂ ω ⇒ kv,ω ≤ kv,ω In other words, values of parameters never contradict the quantity of resources. For the running example, the variable intra-cellular lactose, noted intra in the sequel, is regulated by two activators, and by one inhibitor. This variable can thus be regulated by 23 = 8 different subsets of its inhibitors/activators. In the same way, 2 parameters have to be given for the variable galactosidase, noted g in the sequel. To sum up, 10 parameters have to be given kg,∅ , kg,{intra} , kintra,∅ , kintra,{extra} , kintra,{g} , kintra,{extra,g} , K= . kintra,{intra} , kintra,{intra,extra} , kintra,{intra,g} , kintra,{intra,extra,g} These parameters control the dynamics of the model since they define how targets evolve according to their current sets of ressources.
120
2.3
G. Bernot and J.-P. Comet
Dynamics of Gene Networks: State Graphs
At a given time, each variable of a regulatory network has a unique concentration level. The state of the biological regulatory network is the vector of concentration levels (the coordinate associated with any variable u is an integer belonging to the interval from 0 to the boundary bv ). According to a given state, the resources of a variable are the regulators that help the variable to be expressed. The set of resources of a variable is constituted by all activators whose level is above the threshold of activation and all the inhibitors whose level is below the threshold. Resources are used to determine the evolution of the system. At a given state, each variable is attracted by the corresponding parameters kv,ω where ω is its set of resources. The function that associates with each state the vector formed by the corresponding kv,ω is an endomorphism of the state space. Table 1 defines the endomorphism for the example of Figure 6-A when extra-cellular lactose is present (parameter values are given in the caption). Table 1. Partial state table for the figure 6-A. Here only state with extra-cellular lactose are considered. Values of parameters are : kintra,{extra} = 0, kintra,{extra,intra} = 1, kintra,{extra,g} = 2, kintra,{extra,intra,g} = 2, kg,∅ = 0 and kg,{intra} = 1. extra intra 1 0 1 0 1 1 1 1 1 2 1 2
g ω(intra) 0 {extra, g} 1 {extra} 0 {extra, g} {extra} 1 0 {extra, intra, g} 1 {extra, intra}
ω(g) kintra,ω(intra) kg,ω(g) {} 2 0 {} 0 0 {intra} 2 1 {intra} 0 1 {intra} 2 1 {intra} 1 1
Such a table can be represented by a asynchronous state graph in which each state has a unique successor: the state towards which the system is attracted, see the left part of Figure 7 where extra is supposed to be equal to 1. In this example, when the system is in the state (1, 1, 0), it is attracted towards the state (1, 2, 1). Variables intra and g are both attracted toward different values. The probability that both variables pass through their respective thresholds at the same time is negligible in vivo, but we do not know which one will be passed first. Accordingly we replace such a diagonal transition by the collection of the transitions which modify only one of the involved variables at a time. For example, transition (1, 1, 0) → (1, 2, 1) is replaced by the transitions (1, 1, 0) → (1, 2, 0) and (1, 1, 0) → (1, 1, 1): the first one corresponds to the case where the variable intra evolves first whereas the second one corresponds to the case where the variable g evolves first, see the right part of Figure 7. An arrow of length greater or equal to 2 would imply a variable which increases its concentration level abruptly and jumps several thresholds. For our example, when the system is in the state (1, 0, 0), it is attracted towards the state (1, 2, 0).
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
121
Since the concentration varies continuously, independently of whether it varies rapidly or not, real transitions should only address neighbor states. Thus, transition (1, 0, 0) → (1, 2, 0) is replaced by the transition (1, 0, 0) → (1, 1, 0), see the right part of Figure 7. g kintra,{extra} kintra,{extra,intra} kintra,{extra,g} kintra,{extra,intra,g} kg,∅ kg,{intra}
= = = = = =
0 1 2 2 0 1
g
1
1
0
0 0
1
2
intra
synchronous state graph
0
1
2
intra
asynchronous state graph
Fig. 7. From the values of parameters to the asynchronous state graph .
The value of the parameter kv,ω (where ω is the set of resources of v at the current state η), indicates how the expression level of v can evolve from the state η. It can increase (respectively decrease) if the parameter value is greater (respectively smaller) than the current level of the variable v. The expression level must stay constant if both values are equal. Formally: Definition 3. The asynchronous state graph of the biologigical regulatory network R = (G, K) is defined as follow: – the set of vertices is the set of states Πv∈V [0, bv ] – there is a transition from the state n = (n1 , . . . , n|V | ) to m = (m1 , . . . , m|V | ) iff ∃!i such that mi m=n = ni or mi = (ni ki,ωi (n) ) ∀i ∈ [1, |V |], ni = (ni ki;ωi (n) ) where ωv (n) represents the set of resources of variable v at state n and where (a b) = a + 1 if b > a, (a b) = a − 1 if b < a and (a b) = a if b = a. Transitions have some biological interpretations. For the current example, horizontal left-to-right transitions correspond to the entering of extra-cellular lactose into the cell whereas horizontal right-to-left transitions correspond to the the breakdown of lactose into glucose. Unfortunatelly, this asynchronous state graph has been built from the knowledge of the different parameters. In fact, usualy no information about these parameters is available, it is necessary to consider all possible values. The gene regulatory graphs of Figure 6-A and 6-B, give rise to 23 = 8 parameters for intra-cellular lactose (intra) and 21 = 2 parameters for galactosidase (g). The intra-parameters can get 3 possible values (from 0 to 2), the g-parameters 3 1 can get 2 possible values (0 or 1). So, there are 32 ×22 = 26244 different regulatory networks associated to the regulatory graph of the figures 6-A or 6-B. More
122
G. Bernot and J.-P. Comet in
generally each gene of a regulatoty graph contributes for (out+1)2 different parameter combinations, where in and out are its in- and out-degrees in the graph, and gene contributions are multiplicative. . . The total number of differents reg3 1 ulatory networks denoted by Figure 6 is thus 26244 + 26244 + 22 × 22 = 53512 because Figure 6-C assumes that the two outgoing arrows of intra-cellular lactose share the same threshold. Let us nevertheless note that the number of different asynchronous state graphs can be less than the number of parameterizations (two different parameterizations can lead to the same state graph). For example, the parameterization deduced from the one of Figure 7 by replacing the value of kintra,{extra,intra} by 0, leads to the same dynamics. Anyway, the number of parameterizations depends of the number of interactions pointing on each variables following a double exponential. 2.4
Introducing Multiplexes
In order to decrease the number of parameterizations, other structural knowledge can be useful. For example, let us consider a variable c that has 2 different activators a and b. Without information, we have to consider 22 = 4 different parameters associated with the four situations : what is the evolution of c when both regulators are absent, when only a is present, when only b is present and when both are present. Sometimes additional structural knowledge can be derived from molecular biology: for example, it could be known that the regulation takes place only when both are present, because the effective regulator is in fact the complex of a and b. In such a case, the four parameters account for only two parameters: what is the evolution when the complex is present (both regulators are present) and when the complex is absent (at least one of the regulator is absent). Such information reduces the number of parameters, and drastically decreases the number of parameterizations to consider. Multiplexes allows the introduction of such information [2]. They provide a slight extension of the R. Thomas’ modelling, with explicit information about cooperative, concurrent or more complex molecular interactions [14,15]. Intuitivelly, a regulatory graph with multiplexes is a graph with two kinds of vertices: variables account for vertices (they constitute the set V of nodes, V for variables) and moreover each information about cooperative concurrent or more complex molecular interactions also gives rise to a vertex (they constitute the set M of nodes, M for multiplexes). Information about molecular interactions is coded into a logical formula that explains when the interaction takes place. In the previous example of complexation, the interaction takes place only when both regulators are present that is when (a ≥ sa ) ∧ (b ≥ sb ). Definition 4. A gene regulatory graph with multiplexes, is a tuple G = (V, M, EV , EM ) such that:
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
123
1. (V ∪ M, EV ∪ EM ) constitutes a (labelled) directed graph whose set of nodes is V ∪ M and set of edges is EV ∪ EM , with EV ⊂ V × IN × M and EM ⊂ M × (V ∪ M ). 2. V and M are disjoint finite sets. Nodes of V are called variables and nodes s of M are called multiplexes. An edge (v, s, m) of EV is denoted (v → m) where s is called the threshold. 3. Each variable v of V is labelled with a positive integer bv called the bound of v. 4. Each multiplex m of M is labelled with a formula belonging to the language Lm inductively defined by: s – If (v → m) ∈ EV , then vs is an atom of Lm , and if (m → m) ∈ EM then m is an atom of Lm . – If φ and ψ belong to Lm then ¬φ, (φ ∧ ψ), (φ ∨ ψ) and (φ ⇒ Ψ ) also belong to Lm . 5. All cycles of the underlying graph (V ∪ M, EV ∪ EM ) contain at least one node belonging to V 2 . Let us remark that the point 1 of the previous definition separates two sets of edges. On one hand the first set is made of edges starting from a variable: their targets are multiplexes, they are labelled by a threshold that determine the atoms used in the target multiplexes. On the other hand, the second set is made of edges starting from a multiplex: their targets can be either a variable (the target of the complex interaction) or a multiplex (the expressionlogical formula of the source multiplex plays the role of an atom in the language of the target multiplex). We define now the regulatory network with multiplexes as a regulatory graph with multiplexes provided with a familly of parameters which define the evolutions of the system according to the subset of predecessors (which are now multiplexes instead of variables). Definition 5. A gene regulatory network with multiplexes is a couple (G, K) where – G = (V, M, EV , EM ) is a regulatory graph with multiplexes. – K = {kv,ω } is a family of parameters indexed by v ∈ V and ω ⊂ G−1 (v) such that all kv,ω are integers and 0 ≤ kv,ω ≤ bv . As in the classical framework, the parameters kv,ω define how evolves the variable v when set of effective interactions on v is ω ⊂ G−1 (v). This set of effective interactions, named set of resources is defined inductivelly for each variable v and each state η: Definition 6. Given a regulatory graph with multiplex G = (V, M, EV , EM ) and a state η of G, the set of resources of a variable v ∈ V for the state η is the set of multiplexes m of G−1 (v) such that the formula ϕm of the multiplex m is satisfied. The interpretation of ϕm in m is inductively defined by: 2
This condition is mandatory for the definition of dynamics (Definition 6).
124
G. Bernot and J.-P. Comet
– If ϕm is reduced to an atom vs of G−1 (m) then ϕm is satisfied iff v ≥ s according to the state η. – If ϕm is reduced to an atom m ∈ M of G−1 (m) then ϕm is satisfied iff ϕm of m is satisfied. – If ϕm ≡ ψ1 ∧ ψ2 then ϕm is satisfied if ψ1 and ψ2 are satisfied; and we proceed similarly for all other connectives. We note ρ(v, η) the set of resources of v for the state η. Definition 3 of the asynchronous state graph remains valid for gene regulatory graphs with multiplexes: the set of vertices does not change, nor the definition of transitions, the only difference resids in the definition of the state of resources: ωi (n) has to be replaced by ρ(v, n). The contribution of multiplexes is thus simply to decrease the number of parameters. Introducing a multiplex corresponds to specify how the predecessors of the multiplex cooperate, and allows one to associate a single parameter whatever the number of predecessors. For example, if the cooperation of three regulators on a common target is well know, without multiplexes, one needs 23 = 8 parameters to describe the evolutions of the targer in each situation whereas when considering multiplexes, only 2 are mandatory: one to describe the evolution of the target when cooperation of regulators takes places, and another to describe the evolution of the target when the cooperation does not take place.
3
Temporal Logic and Model Checking for Biology
Since the parameters are generally not measurable in vivo, finding a suitable classes of parameters constitutes a major issue of the modelling activity. This reverse engineering problem is a parameter identification problem since the structure of the interactions is supposed known, see for example [13] for such a problem in the differential framework. In our discrete framework, this problem is simpler because of the finite number of parameterizations to consider. Nevertheless this number is so enormous that a computer aided method is needed to help biologists to go further in the comprehension of the biological system under study. Moreover, when studying a system, the biological knowledge arrives in an incremental manner. It would be apreciable to apprehend the problem in such a way that when a new knowledge has to be taken into account, previous work is not put into question. In other words, one would like to handle not only a possible model of the system, but the exhaustive set of models which are, at a given time, acceptable according to the current knowledge. Biological knowledge, extracted from the litterature, about the behaviour of the system, can be seen as constraints on the set of possible dynamics: a model is satisfactory only if its dynamics are compatible with the biological behaviours reported in the litterature. In a similar way, when a new wet experiment leads to new knowledge about the behaviour of the system, it can also be used as a new constraint which filter the set of possible models: only the subset of models
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
125
whose dynamics is compatible with this new experiment have to be conserved for further investigation. It becomes straightforward that a computer aided approach has to be developped in order to manipulate the different knowledge about the dynamics of the system and to make use of them to automatically handle the set of compatible parameterizations. Logic precisely allows the definition of the sets of these models. In constitutes a suited approach for addressing such a combinatorial problem. 3.1
The Computation Tree Logic (CTL)
Computation tree logic (CTL) is a branching-time logic, in which the time structure is like a tree: the future is not determined; there are different paths in the future, in other words, at some points of the time, it is possible to choose among a set of different evolutions. CTL is generally used in formal verification of software or hardware, specially when the artificial system is supposed to control systems where consequences of a bug can lead to tragedies (public transportation, nuclear power plants, ...). For such a goal, software applications known as model checkers are useful: they determine if a given model of a system satisfies or not a given temporal property which is written in CTL. Syntax of CTL. The language of well-formed CTL Formulae is generated by the following recursive definition: ⎧ ⊥||p| atoms ⎪ ⎪ ⎨ (¬φ) | (φ ∧ φ) | (φ ∨ φ) | (φ ⇒ φ) | (φ ⇔ φ) | usual connectives φ ::= AXφ | EXφ | AF φ | EF φ | temporal connectives ⎪ ⎪ ⎩ AGφ | EGφ | A[φU φ] | E[φU φ] temporal connectives where ⊥ and codes for False and True, p denotes a particular atomic formula, and φ is another well formed CTL formula. In the context of R. Thomas theory for genetic regulatory networks, the atoms can be of the form (a ∝ n) where – a is a variable of the system, – ∝ is a operator in {<, ≤, >, ≥} – n is an integer belonging to the interval [0, ba ]. Semantics of CTL. This definition uses usual connectives ( ¬, ∧, ∨, ⇒, ⇔ as well as temporal modalities which are pairs of symbols: the first element of the pair is A or E and the second belongs to {X, F, G, U } whose meanings are given in the next table. First letter Second letter A for All paths choices X neXt state E for at least one path choice (Exist) F some Future state G all future states (Globally) U Until
126
G. Bernot and J.-P. Comet
ϕ
ϕ ϕ
ϕ ϕ
t t+1 EX(ϕ) ϕ ϕ ϕ
EG(ϕ)
t ϕ
ϕ
ϕ
ϕ
ϕ
ϕ t+1 AX(ϕ)
ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ ϕ AG(ϕ)
EF (ϕ) ϕ
ϕ ϕ
ϕ
ϕ
E[ϕU ψ]
ϕ ϕ
ϕ AF (ϕ) ψ ϕ ϕ ψ ϕ ϕ ψ ψ ψ ϕ ϕ ψ ϕ ψ ϕ ψ A[ϕU ψ]
ϕ
ψ
Fig. 8. Semantic of CTL formula. Dashed arrows (resp. arrows with empty head) point on states where ϕ (resp. ψ) is satisfied.
Figure 8 illustrates each temporal modality: – EX(ϕ) is true at the current state if there exists a successor state where ϕ is true. – AX(ϕ) is true at the current state if in all successor states, ϕ is true. – EF (ϕ) is true at the current state if there exists a path leading to a state where ϕ is true. – AF (ϕ) is true at the current state if all paths lead to a state where ϕ is true. – EG(ϕ) is true at the current state if there exists a path starting from the current state whose all states satisfy the formula ϕ. – AG(ϕ) is true at the current state if all states of all paths starting from the current state, satisfy the formula ϕ. – E[ϕU ψ] is true at the current state if there exists a path starting from the current state leading to a state where ψ is true, and passing only through states satisfying ϕ. – A[ϕU ψ] is true at the current state if all paths starting from the current state lead to a state where ψ is true, pass only through states satisfying ϕ. For example AX(intra ≥ 1) means that in all next states accessible from the current state in the asynchronous state graph, the concentration level of intra is greater or equal to 1. Note that this last formula is false in the asynchronous state graph of figure 7 if the initial state is (1, 1) or (0, 1) and is is true for all other initial states. Formula EG(g = 0) means that there exists at least one path starting from the current state where the concentration of g is constantly equal to 0. In Fig. 7 no state satisfies this formula, because from each state, all paths will return to a state where the concentration of g is equal to 0.
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
127
We say that a model or a state graph satisfies a CTL formula, if each state satisfies the formula. 3.2
CTL to Encode Biological Properties
CTL formulas are useful to express temporal properties of the biological system. Once such properties have been elaborated, a model of the biological system will be acceptable only if its state graph satisfies the CTL formulas, otherwise, it is not considered anymore. The first temporal property focuses on the functionality of the entering of lactose into the cell. When external lactose is constantly present in the environment, after a sufficently long time, intra-cellular lactose will increase and will at least cross its first threshold. This first property can then be written in CTL by the formula ϕ1 as follows: ϕ1 ≡ AG(extra = 1) =⇒ AF (intra > 0) This CTL formula is not satisfied by the state graph of Figure 9 (where only the plane extra = 1 is drawn) because from the state (0, 0), it is not possible to increase the abstract level of intra. Let us remark that the state graph of Figure 7 does satisfy it because each path leads to a state where intra is present. g kintra,{extra} kintra,{extra,intra} kintra,{extra,g} kintra,{extra,intra,g} kg,∅ kg,{intra}
= = = = = =
0 2 0 2 0 1
g
1
1
0
0 0
1
2
intra
synchronous state graph
0
1
2
intra
asynchronous state graph
Fig. 9. From others values of parameters to the asynchronous state graph .
The second temporal property focuses on the production of galactosidase. When external lactose is constantly present in the environment, after a sufficently long time, β-galactosidase will be sufficiently produced to degrade lactose, and then it will stay present forever. The translation of this property into CTL is: ϕ2 ≡ AG(extra = 1) =⇒ AF (AG(g = 1)) This CTL formula is not satisfied by the states graphs of figures 7 and 9: In the first case, each path leads to a state where g = 0 whereas in the second case, from the state (0, 0), it is not possible to increase the abstract level of g. Similarly, when external lactose is constantly absent in the environment, after a sufficently long time, β-galactosidase will disappear. The CTL formula expressing this property is: ϕ3 ≡ AG(extra = 0) =⇒ AF (g = 0)
128
G. Bernot and J.-P. Comet
The fourth formula we consider, states that when environment is rich in lactose, the degradation of intra-cellular lactose to produce carbon, is not sufficient in order to entirely consume the intra-cellular lactose. In other words, the permease allows the lactose to enter sufficiently rapidly in order to balance the consumption of intra-cellular lactose. To express this property, we focus on the case where extra-cellular lactose is present, intra-cellular lactose and galactosidase are present. In such a configuration, the intra-cellular lactose will never reach the basal level equal to 0. The CTL formula coding for this property is then written as follow: ϕ4
≡
(AG(extra = 1) ∧ (intra > 0) ∧ (g = 1)) =⇒ AG (intra > 0)
This CTL formula is not satisfied by the state graph of figure 7 because from each state, there is a path that leads to the total degradation of intra. The state graph of figure 9 does not satisfy it because each path starting from a state where intra = 1 leads to a total degradation of intra. The two last temporal properties focus on the functionality of the lactose pathway, even when environment is no more rich in lactose. On one hand, when intra-cellular lactose is present at level 1 but extra-cellular lactose is absent, the pathway leads to a state where intra-cellular lactose has been entirely consummed, without passing through a state where intra is at its highest level (the only source of intra-cellular lactose is the extra-cellular one). Moreover, when this state is reached, there is no way to increase the concentration of intra-cellular lactose: its level then remains to 0. ϕ5 ≡ AG(extra = 0)∧(intra = 1) =⇒ A [(intra = 1)U AG(intra = 0)] On the other hand, when intra-cellular lactose is present at level 2 but extracellular lactose absent, the pathway leads to a state where intra-cellular lactose has decreased to level 1. ϕ6 ≡ AG(extra = 0) ∧ (intra = 2) =⇒ AF (intra = 1) These two previous CTL formulas cannot be checked in Figures 7 and 9 because the formulas concern the dynamics of the system when extra is absent whereas the figures focus on the dynamics when extra is present. The formal language CTL has been developped in a computer science framework, and then is not dedicated to gene regulatory networks. For example, it is not possible to express in CTL that the dynamic system presents n different stable states. Nevertheless if we know a frontier between two stable behaviours, it becomes possible to express it in CTL. Let us consider a system where if variable a is at a level less than 2, the system converges to a stable state, and if variable a is at a level greater than 2, the system converges to another stable state. This property can be translated into the formula: ((a < 2) ⇒ AG(a < 2))
∧
((a >= 2) ⇒ AG(a >= 2))
Even if in some cases the translation of a property is tricky, in practice, CTL is sufficient to express the majority of biological properties useful for gene regulatory networks.
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
129
Let us now emphasize that the CTL language makes the link between the biological experiments and the models that are supposed to represent the behaviours of the studied biological system. Indeed, – a CTL formula can be confronted against a model: the traces of the dynamics of the model verify the temporeal properties expressed through the CTL formula, – a CTL formula can also be confronted against traces observed through an wet experiment: either the wet experiment is a realisation of the temporal property coded by the CTL formula, or it accounts for a counter example. So, the modelling activity has to focus on the manner to select models that satisfy CTL formulas representing dynamic knowledge, extracted from experiments, about the system.
4 4.1
Computer Aided Elaboration of Formal Models The Landscape
The subject of this article is to present our computer aided method to accompany the process of discovery in biology, by using formal modelling in order to make valuable predictions. According to this point of view, the “ultimate model”, which would perfectly mimic the in vivo behaviour, is not our object of interest. It may be surprising for computer scientists, but this model is in fact rarely a subject of interest for biologists. Indeed the “ultimate model” would be untractable. The majority of valuable results, for a researcher in biology, comes from well chosen wet experiments and contributions to biology “are” wet experiments. The theoretical models are only intermediate objects, which reflect intermediate hypotheses that facilitate a good choice of experiments. Consequently, a computer aided process of discovery requires to formally manage these models. It implies a formal expression of the sensible knowledge about the biological function under interest. It also implies a formal expression of the set of (possibly successive) biological hypotheses that motivate the biological research. So, our method manages automatically the set of all possible models and we take benefit of this in order to guide a sensible choice of wet experiments. There are two kind of knowledge: – Structural knowledge, that inventories the set of relevant genes as well as the possible gene interactions. This knowledge can come from static analysis of gene sequences or protein sequences; it can come from dynamic data, e.g. transcriptomic data, via machine learning techniques; it can also come from the literature. This kind of knowledge can be formalized by one or several putative regulatory graphs. – Behavioural knowledge, that reflects the dynamic properties of the biological system, such as the response to a given stress, some possible stationnary states, known oscillations, etc. This kind of knowledge can be formalized by a set of temporal formulas.
130
G. Bernot and J.-P. Comet
They give rise to several formal objects of different nature: – The set M of all the structurally possible regulatory networks (for each putative regulatory graph, we consider all possible values of all the parameters). So, M can be seen as the set of all possible models according to the terminology of formal logic, since each regulatory network defines a unique state graph. For example, once the decision to make permease implicit is taken, the possible regulatory graphs are drawn in Figure 6, where all possible threshold distributions are considered. Remember that it gives rise to 53512 different parameterizations (section 2.3). – The set Φ of the CTL formulas that formalize the dynamic properties. For example, according to Section 3.2, the set Φ can contain the 6 formulas from ϕ1 to ϕ6 . – Moreover, as already mentioned, the biological research is usually motivated by a biological hypothesis, that can be formalized via a set of CTL formulas H. For example, let us consider the following hypothesis: “If extra-cellular lactose is constantly present then the positive circuit on intra-cellular lactose is functional.” It would mean that when extra-cellular lactose is constantly present, there is a multi-stationnarity on intra, which is separated by its auto-induction threshold 2 (according to the notion of characteristic state [26]). It can be formalized with H = {ψ1 , ψ2 } as follows: ψ1 ≡ AG(extra = 1) =⇒ (intra = 2 =⇒ AG(intra = 2)) ψ2 ≡ AG(extra = 1) =⇒ (intra < 2 =⇒ AG(intra < 2)) Of course, if “the ultimate model” were known and properly defined (i.e. a regulatory network with known values for all its parameters), it would satisfy exactly the set of behavioural properties that are true in vivo, and thus M would be reduced to a singleton (the ultimate model), Φ would be useless and H would be decided, by simply checking if H is satisfied by M. The difficulty comes from the uncertainty of the model structure and parameters, the incompleteness of the behavioural knowledge and the complexity of the systems which makes intuitive reasoning almost useless when studying hypotheses. Fortunately, once the formalization step is performed, formal logic and formal models allow us to test hypotheses, to check consistency, to elaborate more precise models incrementally, and to suggest new biological experiments. The set of potential models M and the set of properties Φ ∪ H being given, two obvious scientific questions naturally arise: 1. Is it possible that Φ ∪ H and M ? In other words: does it exist at least one model of M that satisfies all the formulas in Φ ∪ H ? In the remainder, this question will be referred to as the consistency question of knowledge and hypotheses.
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
131
2. And if so, is it true in vivo that Φ ∪ H and M ? As a matter of fact, the existence of a mathematical model satisfying the hypotheses is not sufficient. We must verify that the model reflecting the real in vivo behaviour belongs to the set of models that satisfy Φ ∪ H. It implies to propose experiments in order to validate or refute H (assuming that the knowledge Φ is validated). For both questions, we can take benefit of computer aided proofs and computer optimized validation schemas can be proposed. More precisely, a CTL property can be confronted to traces and traces can be either generated by wet experiments or extracted from a state graph. Consequently a logic such as CTL establishes a bridge between in vivo experiments and mathematical models. 4.2
Consistency
In practice, when actually working with researchers in biology, there is an obvious method to check consistency: 1. Draw all the sensible regulatory graphs according to biological knowledge, with all the sensible, possible threshold allocations. It formalizes the structural knowledge. 2. From the discussions with the biologists, express in CTL the known behavioural properties as well as the considered biological hypotheses. It defines Φ and H. 3. Then, automatically generate, for each possible regulatory graph, all the possible values for all Thomas’parameters: we get all the possible regulatory networks. For all of them, generate the corresponding state graph: it defines M. Our software platform SMBioNet handles this automatically. 4. Check each of these models against Φ ∪ H. SMBioNet intensively uses the model checker called NuSMV [5] to perform this step automatically. If no model survive to the fourth step, then reconsider the hypotheses and perhaps extend model schemas. . . On the contrary, if at least one model survives, then the biological hypotheses are consistent. Even better: the possible parameter sets Kv,ω have been exhaustively identified. If we consider for example the set of models M characterized by Figure 6-A, there are 19 parameter settings leading to a dynamics compatible with the set of properties Φ∪H proposed above. Among the 8+2=10 parameters that govern the dynamics, 6 of them are completely identified (i.e., shared by the 19 parameter settings): Kintra = 0, Kintra,g = 0, Kintra,extra = 1, Kintra,extra g = 1, Kg = 0 and Kg,intra = 1. The 4 other parameters are those were intra is a resource of itself. With respect to the classical ODE simulation method where some possible parameters are identified by trial and error, this method has the obvious advantage to compute the exhaustive set of possible models according to the current biological knowledge. It has also the crucial advantage of facilitating the refutation of models in a systematic manner.
132
G. Bernot and J.-P. Comet
The four steps described before, as such, replace a “brute force simulation method” by a “brute force refutation method” based on formal logic. In fact, the method is considerably more sophisticated in order to avoid the brute force enumeration of all the possible models. For example, in SMBioNet, when several regulatory networks share the same state graph, only one of them is considered: even better, it is not necessary to generate common state graphs as they can be identified a priori. In [9,17], logic programming and constraint solving are integrated in this method and they almost entirely avoid the enumeration of models in order to establish consistency. In [23] model checking is replaced by proof techniques based on products of automata. Moreover, in practice, the use of multiplexes considerably reduces the number of different networks to be considered. Anyway, all these clever approaches considerably improve the algorithmic treatment, but the global method remains the same. 4.3
Selection of Biological Experiments
Once the first question (consistency) is positively answered, the second question (validation) has to be addressed: we aim at proposing “wet” experiment plans in order to validate or refute H (assuming that the knowledge Φ is validated). Here, we will address this question from a point of view entirely based on formal logic. The Global Shape. Our framework is based on CTL whose atoms allow the comparison of the discrete expression level of a gene with a given integer value. So, a regulatory graph being given, we can consider the set CT Lsyntax of all the formulas that can be written about the regulatory graph, according to the CTL language. Notice that this set is not restricted to valid formulas: for example if ϕ belongs to CT Lsyntax then its negation ¬ϕ also belongs to CT Lsyntax. – The set of hypotheses H is a subset of CT Lsyntax. In Figure 10, we delimit the set CT Lsyntax with a bold black line and H is drawn in a circle. – Unfortunately, it usually does not exist a single feasible wet experiment that can decide if H is valid or not in vivo (except for trivial hypotheses, which do not deserve a research campaign). A given wet experiment reveals a small set of CTL properties, which are usually elementary properties. So, feasible wet experiments define a subset of CT Lsyntax: the set of all properties that can be decided, without any ambiguity, from a unique feasible wet experiment. Such properties are often called observable properties and we note Obs this subset of CT Lsyntax. In Figure 10, Obs is represented by the vertically hatched set. – H is usually disjoint from Obs, however we can consider the set T hΦ(H) of all the consequences of H (assuming the knowledge Φ). In Figure 10, T hΦ(H) is represented by the horizontally hatched set. – Let E = T hΦ (H) ∩ Obs be the intersection of T hΦ (H) and Obs. It denotes the set of all the consequences of the hypotheses that can be verified experimentally. If ψ is a formula belonging to E then there exists a wet experiment e after which the validity of ψ is decided without any ambiguity:
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
133
• If the experiment e “fails” then ψ is false in vivo and the hypothesis H is refuted. • If on the contrary the experiment e “succeeds” then ψ is true in vivo. . . Of course it usually does not imply H. In Figure 10, the intersection E defines the set of relevant experiments with respect to the hypothesis H. CT Lsyntax
1111111111 0000000000 0000000000 1111111111 0000000000 1111111111 000 111 0000000000 1111111111 000 111 0000000000 1111111111 000 111 0000000000 1111111111 0000000000 1111111111 0000000000 1111111111
11111111111 00000000000 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 H 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111
Obs T hΦ (H)
E
Fig. 10. Sets of CTL formulas involved in the computer aided selection of biological experiments
Observability and Refutability. According to Popper [20], we should only consider hypotheses H for which we can propose experiments able to refute H: if H is false in vivo, there must exist a wet experiment e that fails. In other words: ¬H =⇒ (∃ψ ∈ E | ¬ψ), which is equivalent to: E =⇒ H. Consequently, refutability depends on the “power” of Obs with respect to H because E = T hΦ (H) ∩ Obs. If Obs is “big enough” then it increases the capability of E to refute H. If E does not imply H (assuming Φ) it may result that experimental capabilities in biology are unsufficient, so H is out of the scope of current know-how and current knowledge. It may also be possible that the wet laboratory has not enough fundings, or that the experimental cost would be disproportionate with regard to the problem under consideration. We have shown so far that, for the modelling process, properties (Φ and H) are as much important as models (M). Refutability issues prove that “experimental observability” (Obs) constitutes the third support of the process. Often, observable formulas are of the form (ρ =⇒ AF (ω)) or (ρ =⇒ EF (ω)) where ρ characterizes some initial states that biologists can impose to a population of cells at the beginning of the experiment, and ω is deducible without any ambiguity from what can be observed at the end of the experiment. We use AF when all repeated experiments give the same result, and EF when we suspect that some additional conditions are imposed by the chosen experimental protocol during the experiments. According to our small running example, we may consider that only external lactose can be controled, and that the flux of entering lactose can be roughly
134
G. Bernot and J.-P. Comet
estimated. So, ρ can be of the form extra = 0 or extra = 1, possibly prefixed by a modality such as AG. Moreover, if the flux is considered “high”, it denotes the presence of many permease proteins, and consequently implies that intra has reached the threshold 2 (according to Figure 6-A). So, we may imagine for example that ω can be of the form intra = 2 or its negation intra < 2, possibly prefixed by modalities such as AG or AU . This defines Obs. We will see later on that this observability is a rough underestimation, and how formal proofs can help improving it. Selection of Experimental Schemas. Unfortunately, E is infinite in general, so, the art of choosing “good” wet experiments can be formalized by heuritics to select a finite (small) subset of formulas in E that has “good” chances to refute H if one of the corresponding experiments fails. Classical testing frameworks from computer science [3,12] aim at selecting such subsets. However the subsets selected by the corresponding software testing tools are always huge because running lot of tests on a computer costs almost nothing. Nevertheless, the main idea of these frameworks can still be suitably applied to regulatory networks. Tests are selected incrementally and completeness to the limit is the main preoccupation: if H is not valid then the incremental selection process must be able to provide a counter-example after a certain number of iterations. It formally means that each possible reason for H to be false is tested after a finite (possibly large) amount of selection time. Let us illustrate this completeness criteria on a simple case: according to our example, H is made of 2 formulas ψ1 and ψ2 . Spending a lot of money to refute only ψ1 would be a bad idea: this strategy would be incomplete because if H is false because ψ2 is false, then this strategy will never refute H. Thus, one must try to refute both formulas. Refutation of ψ1 : AG(extra = 1) =⇒ (intra = 2 =⇒ AG(intra = 2)) It is well known that the truth table of “=⇒” is always true when the precondition is false. Consequently, any wet experiment that does not ensure AG(extra = 1) has no chance to refute the hypothesis. So, Popper tells us that any experiment associated with ψ1 must constantly have external lactose (and remind that, from the context, glucose is always absent). For the same reason, one must start with a population of bacteria with intra = 2 as initial state. . . and unfortunately the precondition ρ ≡ AG(extra = 1)∧(intra = 2) is not reachable according to our description of observable formulas. Moreover, our knowledge (ϕ1 to ϕ6 ) never concludes on the atom intra = 2, so CTL cannot propose a sufficient condition to reach this initial state. Let us postpone this problem for a short time. Refutation of ψ2 : AG(extra = 1) =⇒ (intra < 2 =⇒ AG(intra < 2)) The same reasonning applies: one must start with a population of bacteria with intra < 2 as initial state, and ensure AG(extra = 1). Here, the formal definition
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
135
of “<” allows to transform ψ2 into the conjunction of two formulas, using classical unfolding techniques: ψ3 ≡ AG(extra = 1) =⇒ (intra = 0 =⇒ AG(intra < 2)) ψ4 ≡ AG(extra = 1) =⇒ (intra = 1 =⇒ AG(intra < 2)) From now on, we have to refute 3 hypotheses (ψ1 , ψ3 and ψ4 ) and again, the completeness of the method imposes to treat all of them. More precisely, the completeness of the unfolding technique, commonly used in prolog for example, ensures the completeness of our method to select wet experiments: in practice, after several unfoldings, the set of formulas under consideration gives a complete panel of the qualitatively different cases that deserve to be experimented in the wet laboratory. Unfolding steps make the cases more and more precise. This example gives the 3 obvious cases; for more elaborated examples, the exhaustive inventory of the relevant cases is far less obvious for a human, and the technique has proved useful. On this example, we would have to ask the biologists if they can increase their experimental operability in order to control the value of intra at the initial state of the experiments. Proof techniques can help. For example ψ3 needs intra = 0 and there are formulas in Φ that conclude on intra = 0. In fact, it is not difficult to establish that Φ implies ϕ7 : ϕ7 ≡ AG(extra = 0) =⇒ AF (AG(intra = 0)) The formula ϕ7 indicates that the atom intra = 0 can be included in the preconditions ρ of observable formulas. This extension of Obs suggests the following experimental protocol: in order to ensure intra = 0, keep the cell population in an environment without external lactose for a sufficiently long time, and then, put external lactose in order to check ψ3 . By the way, this ψ3 -experiment gives, after a sufficiently long time in the Petri dish, a high flux of entering lactose, which shows that intra = 2. This refutes the hypothesis H in vivo, although it was consistent.
5
Conclusion
We have shown that formal methods from computer science can help the discovery process in molecular biology. More precisely, we have proposed a formal modelling method for gene regulatory networks, based on the discrete approach of René Thomas, that we have enriched with CTL and model checking. We have also shown that, even if simulations are useful and easy to perform in this setting, the main subject to address is the logical identification of parameters. During the modelling process, we must face incomplete knowledge and, consequently, we must manage in parallel: – a (possibly large) set of different potential models and parameter values, – a set of known behavioural properties (whose consistency has to be managed),
136
G. Bernot and J.-P. Comet
– a set of biological hypotheses, that motivate the biological research (and whose consistency has to be managed), – and a set of conceivable wet experiments (which appears sometimes to be unsufficient to identify the parameters or to refute some potential models). Our framework is not only a method to identify the “good” model(s), it is rather a full modelling process to accompany the discovery process in molecular biology, from the elaboration of abstract models to the design of well chosen wet experiments. At the beginning of the process, the hypotheses of interest being expressed, we can simplify the models under consideration provided that the simplifications do not modify the truth of the hypotheses. This simple idea allows to efficiently reduce the number of nodes in the regulatory graph. Consistency check is the next step of our method and the formalization of biological properties into temporal logic is a key point. It facilitates the use of tools from formal logic, e.g. model checking or constraint solving, in order to establish the consistency of knowledge and hypotheses. Lastly, a formal definition of the so called observable properties helps to suggest wet experiments in order to validate or refute the hypotheses in vivo. If the hypotheses are consistent, then formal manipulations of the syntax of the formulas, such as unfolding techniques and theorem proving, can produce observable consequences that describe wet experiment schemas. All in all, formal models are not “universal” in biology: they are only a temporary intermediate that serve to validate or refute biological hypotheses.
References 1. Ahmad, J., Bourdon, J., Eveillard, J., Fromentin, D., Roux, O., Sinoquet, C.: Temporal constraints of a gene regulatory network: refining a qualitative simulation. Biosystems 98(3), 149–159 (2009) 2. Bernot, G., Comet, J.-P., Khalis, Z.: Gene regulatory networks with multiplexes. In: European Simulation and Modelling Conference Proceedings, France, October 27-29, pp. 423–432 (2008) ISBN 978-90-77381-44-1 3. Bernot, G., Gaudel, M.C., Marre, B.: Software testing based on formal specifications: A theory and a tool. Software Engineering Journal 6(6), 387–405 (1991) 4. Cardelli, L., Caron, E., Gardner, P., Kahramanogulları, O., Phillips, A.: A process model of rho gtp-binding proteins. Theoretical Computer Science 410(33-34), 3166–3185 (2009) 5. Cimatti, A., Clarke, E., Giunchiglia, E., Giunchiglia, F., Pistore, M., Roveri, M., Sebastiani, R., Tacchella, A.: NuSMV Version 2: An OpenSource Tool for Symbolic Model Checking. In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS, vol. 2404, p. 359. Springer, Heidelberg (2002) 6. Collado-Vides, J., Magasanik, B., Smith, T.: Integrative approaches to molecular biology. The MIT press, Cambridge (1996) 7. Curti, M., Degano, P., Priami, C., Baldari, C.T.: Modelling biochemical pathways through enhanced π-calculus. Theoretical Computer Science 325(1), 111–140 (2004)
On the Use of Temporal Formal Logic to Model Gene Regulatory Networks
137
8. de Jong, H.: Qualitative modeling and simulation of bacterial regulatory networks. In: Heiner, M., Uhrmacher, A. (eds.) CMSB 2008. LNCS (LNBI), vol. 5307, p. 1. Springer, Heidelberg (2008) 9. Fanchon, E., Corblin, F., Trilling, L., Hermant, B., Gulino, D.: Modeling the molecular network controlling adhesion between human endothelial cells: Inference and simulation using constraint logic programming. In: Danos, V., Schachter, V. (eds.) CMSB 2004. LNCS (LNBI), vol. 3082, pp. 104–118. Springer, Heidelberg (2005) 10. Gouzé, J.-L.: Positive and negative circuits in dynamical systems. Journal of Biological Systems 6, 11–15 (1998) 11. Jacob, F., Monod, J.: Genetic regulatory mechanisms in the synthesis of proteins. Journal of molecular biology 3, 318–356 (1961) 12. Jard, C., Jéron, T.: TGV: theory, principles and algorithms. a tool for the automatic synthesis of conformance test cases for non-deterministic reactive systems. Software Tools for Technology Transfert 7(4), 297–315 (2005) 13. Kügler, P., Gaubitzer, E., Müller, S.: Parameter identification for chemical reaction systems using sparsity enforcing regularization: A case study for the chlorite-iodide reaction. Journal of Physical Chemistry A 113(12), 2775–2785 (2009) 14. Khalis, Z., Bernot, G., Comet, J.-P.: Gene Regulatory Networks: Introduction of multiplexes into R. Thomas’ modelling. In: Proc. of the Nice Spring school on Modelling and simulation of biological processes in the context of genomics, EDP Science, pp. 139–151 (2009) ISBN : 978-2-7598-0437-5 15. Khalis, Z., Comet, J.-P., Richard, A., Bernot, G.: The smbionet method for discovering models of gene regulatory networks. Genes, Genomes and Genomics (2009) 16. Little, J.W.: Threshold effects in gene regulation: When some is not enough. PNAS 102(15), 5310–5311 (2005) 17. Mateus, D., Gallois, J.-P., Comet, J.-P., Le Gall, P.: Symbolic modeling of genetic regulatory networks. Journal of Bioinformatics and Computational Biology 5(2B), 627–640 (2007) 18. Naldi, A., Remy, E., Thieffry, D., Chaouiya, C.: A reduction of logical regulatory graphs preserving essential dynamical properties. In: Degano, P., Gorrieri, R. (eds.) CMSB 2009. LNCS (LNBI), vol. 5688, pp. 266–280. Springer, Heidelberg (2009) 19. Plahte, E., Mestl, T., Omholt, S.W.: Feedback loops, stability and multistationarity in dynamical systems. Journal Biological Systems 3, 409–413 (1995) 20. Popper, K.R.: Conjectures and refutations: the growth of scientific knowledge. Routledge & Kegan Paul, London (1965) 21. Richard, A., Comet, J.-P.: Necessary conditions for multistationarity in discrete dynamical systems. Discrete Applied Mathematics 155(18), 2403–2413 (2007) 22. Rizk, A., Batt, G., Fages, F., Soliman, S.: A general computational method for robustness analysis with applications to synthetic gene networks. Bioinformatics 25(12), i169–i178 (2009) 23. Siebert, H., Bockmayr, A.: Temporal constraints in the logical analysis of regulatory networks. Theoretical Computer Science 391(3), 258–275 (2008) 24. Snoussi, E.H.: Necessary conditions for multistationarity and stable periodicity. Journal of Biological Systems 6, 3–9 (1998) 25. Snoussi, E.H.: Qualitative dynamics of a piecewise-linear differential equations: a discrete mapping approach. Dynamics and stability of Systems 4, 189–207 (1989) 26. Snoussi, E.H., Thomas, R.: Logical identification of all steady states: the concept of feedback loop caracteristic states. Bull. Math. Biol. 55(5), 973–991 (1993)
138
G. Bernot and J.-P. Comet
27. Soulé, C.: Graphical requirements for multistationarity. ComPlexUs 1, 123–133 (2003) 28. Thomas, R.: On the relation between the logical structure of systems and their ability to generate multiple steady states and sustained oscillations. In: Series in Synergetics, vol. 9, pp. 180–193. Springer, Heidelberg (1981) 29. Thomas, R., d’Ari, R.: Biological Feedback. CRC Press, Boca Raton (1990)
Predicting Protein-Protein Interactions with K-Nearest Neighbors Classification Algorithm Mario R. Guarracino and Adriano Nebbia High Performance Computing and Networking Institute (ICAR-CNR) National Research Council, Italy Via P. Castellino, 111 - 80131 Napoli (IT)
[email protected], a
[email protected]
Abstract. In this work we address the problem of predicting proteinprotein interactions. Its solution can give greater insight in the study of complex diseases, like cancer, and provides valuable information in the study of active small molecules for new drugs, limiting the number of molecules to be tested in laboratory. We model the problem as a binary classification task, using a suitable coding of the amino acid sequences. We apply k-Nearest Neighbors classification algorithm to the classes of interacting and noninteracting proteins. Results show that it is possible to achieve high prediction accuracy in cross validation. A case study is analyzed to show it is possible to reconstruct a real network of thousands interacting proteins with high accuracy on standard hardware. Keywords: Protein-protein interaction prediction, conjoint-triad method, k-Nearest Neighbors, binary classification.
1
Introduction
Proteins are the main components of living organisms and take part to every process within a cell. They are composed of amino acids arranged in a linear sequence of variable length. In general, protein sequences are composed of 20 different amino acids, except for some organisms having two more amino acids. The sequence of amino acids is defined by the nucleotide sequence of a gene and it is determined by its genetic code. The longest known human proteins are the titins, whose length is around 27,000 amino acids. Protein−protein interactions (PPIs) take place in many biological processes and many diseases. The study of PPIs starts much earlier than the advent of proteomics. A notable example is the work of Ruhlmann and colleagues in the ’70 and Amit and colleagues in ’80 on antigenantibody and protease inhibitor complexes that has given insight into protein interfaces and their properties. At that time, although it was not clear to which extend and degree proteins would interact, it was clear that many biological processes relay on their interaction, and the protein functions may be regulated by other proteins. Therefore, the capability to predict interactions could be exploited to block these interactions with a drug. Such capability would give the chance to direct research towards highly probable predicted targets. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 139–150, 2010. c Springer-Verlag Berlin Heidelberg 2010
140
M.R. Guarracino and A. Nebbia
Proteins interact with other proteins when one binds to another. There are different types of interactions. These can be either permanent or transient. In the former case, proteins form a stable protein complex, as in case of ATPases. The latter interactions take place in certain cellular states to set out a particular biomolecular function, as in signal transduction cascades [1]. Nevertheless, many PPIs do not fall into distinct types. For example, an interaction may be mainly transient in vivo but become permanent under certain cellular conditions [2]. In the following, we will not make any difference among different interaction types. The problem of predicting PPIs is intrinsically difficult, due to the unbalance in the number of interacting and noninteracting pairs. Consider the case of Saccharomyces cerevisiae (yeast), for which approximately 6, 300 proteins are supposed to exist [3]. Therefore, all possible interacting pairs are in the order of 6, 3002 ≈ 39, 000, 000. At the time we write, there are 18,440 yeast PPIs available in DIP database [4], and there are estimates up to 30, 000 interacting pairs. Even if we trust a much larger estimate of 390, 000 interacting proteins, we still have that the pairs of interacting proteins only represent 0.01% of all possible pairs. In other words, if we classify all possible pairs of proteins as interacting, we will be 99.99% accurate! Nevertheless, the capability of predicting interactions is so important that over the years many efforts have been devoted to devise methods to predict PPIs [5–8] and it still represents a very active research area. The deluge of PPIs published in scientific journals are collected in publicly available databases [9]. Such databases contain information manually curated and inserted by experts, and are usually accessible from a web interface. The user can query protein pairs to know whether they are known to interact. There is a large number of public databases containing PPIs data. Some only contain information about a single organism, whereas others address multiple organisms. It is outside the scope of the present work to give a detailed description of existing databases, but a short description of some among major ones can give a better understanding of such resources. The Munich Information Center for Protein Sequences (MIPS) provides manually curated information about PPIs, protein complexes, pathways etc., for several organisms [10]. The Database of Interacting Proteins (DIP) is a curated database containing information about experimentally determined PPIs. It comprises about 63,000 unique interactions among 20,700 proteins in more than 270 organisms including yeast and human [4]. However, it still maintains only 2,292 human PPIs, compared to the 18,440 interactions in S. cerevisiae, 22,881 for D. melanogaster, and 4,043 for C. elegans. The General Repository for Interaction Datasets (GRID) contains interactions from several high throughput studies, as well as the interactions from MIPS and other databases [11]. A Molecular INTeraction data base (MINT) contains about 83,000 interactions curated manually from the literature [12]. Online Predicted Human Interaction Database (OPHID), now Interologous Interaction Database (I2D), stores predicted and known human PPIs [13]. I2D comprised over 60,000 known human PPIs obtained from various databases and additional 35,000 predicted interactions from model organisms. Predictions are available for from S. cerevisiae, C. elegans, D. melanogaster, and M. musculus. Individual predictions
Predicting PPIs with K-Nearest Neighbors Classification Algorithm
141
have been extensively investigated to provide biological evidence for their support. Finally, Human Protein Reference Database (HPRD) is a large-scale effort to manually store many known human PPIs related to diseases in a database. Currently, HPRD comprises 25,661 human protein interactions derived from literature sources [14]. In this work we address the problem of predicting PPIs with machine learning methods applied only to sequence information. This problem has been addressed by Shen and colleagues [15]. Many other researchers have also tackled the same problem, obtaining less accurate results [16] [17]. All these works only use sequence information on the assumption that the sequence specifies the structure. Shen and colleagues approach is based on a coding of the proteins based on the count of the amino acids in the protein sequence. In this way, proteins, composed of a different number of amino acids, are represented by vectors of the same size. As it will be detailed in the next sections, the 20 amino acids are clustered in 7 classes, and the count of all possible class triples is recorded. The number of possible triples of elements drawn from the seven classes is equal to 7 × 7 × 7 = 343. The problem is modeled as a binary classification task, in which the positive class (golden standard positive set - GSP) is composed of couples of proteins from HPRD that are known to interact in vitro or in vivo, and a negative one (gold standard negative - GSN) that is composed of random couples that are not known to interact. Each class element is obtained by pairing the two protein vectors. In this setting, the classification model is obtained in a space with 686 dimensions. Shen and colleagues report a mean classification accuracy, obtained on a dataset composed of 32,486 pairs and a test set constructed of 400 protein pairs, of 83.90%±1.29 using Support Vector Machines (SVMs), with various kernels, on five hold-outs. In the same paper, authors try to reconstruct three different networks. It has to be noted that, although all interacting protein pairs of the three networks were contained in HPRD, and therefore in the training set, the use of kernel SVMs could predict a large part of the interactions, but not all of them. In our work we could not completely reproduce the same experiments of [15], due to the computational complexity and memory footprint of the classification problem. For this reason, we directed our attention to alternative classification methods. Furthermore, we decided to use an instance based classification method, which would provide exact classification for all pairs in the training set, without the side effect of kernel methods of overfitting the problem. The contribution of this work consists in showing it is possible to: 1. gather a GSN dataset to increase prediction accuracy, and 2. obtain accurate results using a low computational complexity classification method, namely k-Nearest Neighbors (k-NN) [18], with an appropriate choice of the metric and k. In this new setting, a 3-fold cross validation accuracy of 94.29% has been obtained on a dataset of 33,300 protein pairs, with an execution time in the order of tens of minutes on standard hardware.
142
M.R. Guarracino and A. Nebbia
The rest of the paper is organized as follows. In the next section, classification methods are described and compared. Then, in section 3, the data preparation is detailed. In section 4, classification results are discussed and, in section 5, a case study is analyzed. Finally, in section 6, conclusions are drawn and future work is proposed.
2
Methods
In case of two linearly separable classes, SVM classification algorithm [19] finds a hyperplane that separates the elements of the training set belonging to the two different classes. The separating hyperplane is usually chosen to maximize the margin between the two classes. The margin can be defined as the maximum distance between two parallel hyperplanes w x − b = ±1 that leave all cases of the two classes on different sides. The classification hyperplane is w x − b = 0, which is in the middle of, and parallel to, those planes maximizing the margin. The points that are closest to the hyperplane are called support vectors, and are the only points needed to train the classifier. SVMs represent state of the art in supervised learning and they have been successfully applied to solve many scientific and technological problems [20]. Its computational complexity in the training phase is O(m3 × d) for the quadratic programming problem, and O(m2.1 ), when Sequential Minimal Optimization [21] is used, where m is dimension of training set, and d the number of features. The prediction phase complexity for n instances is O(n × d). Unfortunately, the number of proteins we want to deal with is in the order of 30,000 with 686 features, which cannot be handled by standard hardware with acceptable execution times. Another problem is that the method does not usually produce original class labels for the points in the training set. That can be acceptable in standard machine learning problems, when data are affected by errors and a classification model exactly reproducing all training set class labels would overfit the noisy data. In the present case, data represent frequencies of amino acids, which are exactly known. If a pair of interacting proteins belongs to the GSP dataset used for training, it should be recognized by the predictive method, and not missclassified. When selecting the negative dataset, particular attention should be used, since it is essential to the reliability of the prediction model. It is difficult to generate such a dataset, because of limited information about proteins which are really non interacting. For these reasons we turned our attention to k-Nearest Neighbors algorithm. The key idea of the algorithm is to classify a new point in the most frequent class of its closest k neighbors in the training set. This is a majority voting on the class labels of the test point neighbors. When k = 1, the point is simply assigned to the same class of its closest neighbor. For every choice of k, the classification model is not explicitly computed, but every time a new point has to be assigned to a class, the k closest points are detected, and
Predicting PPIs with K-Nearest Neighbors Classification Algorithm
143
their most frequent class is voted. To measure the distance between two points, different distance functions can be used. As it will be discussed later, we decided to use the cityblock distance, that is induced by the Minkowski metric: d(a, b) =
d
(|ai − bi |l )1/l .
(1)
i=1
with l = 1. Cityblock distance has a low computational complexity, with respect to euclidian metric, and it is less affected by the differences in each feature. Indeed, the k-NN computational complexity for training is O(1), since no computation is needed. For the prediction phase, with k = 1, it is O(n × d × m), where n is test set dimension, m the training set dimension and d the number of features.
3
Dataset Preparation
To predict PPIs using only sequence information, one of the major challenge is an efficient representation of the protein amino acids sequence. One of the most common solution is based on counting the occurrences of the amino acids. It has been shown by Costantini and Facchiano [22] that, in protein structural class prediction tasks, in order to preserve sequence information, the frequency of groups of t adjacent amino acids need to be considered. For t = 1, the representation of a protein is in a space of dimension 20, and each component represents the relative count of one amino acid. For t = 2 the vector x representing the the protein is of size 20 × 20 = 400, which is the total number of pairs that can be obtained with 20 amino acids. Larger tuples of amino acids produce higher dimensional vector spaces, without increasing the descriptive capabilities of the vector. Costantini and Facchiano found that the best structural classification of proteins is obtained with t = 3. In that work, the vector space has dimension 8,000, while in [15], Shen and colleagues prefer to use cojoint triads of amino acids. These triplets are composed grouping amino acids in the seven classes: {’A’ ’G’ ’V’}, { ’I’ ’L’ ’F’ ’P’}, {’Y’ ’M’ ’T’ ’S’}, {’H’ ’N’ ’Q’ ’W’}, {’R’ ’K’}, {’D’ ’E’},{’C’}. This type of grouping is based on electrostatic and hydrophobic interactions that take place on the side chains of amino acids. Amino acids belonging to the same class are suppose to have similar roles in cellular processes. The criteria of this classification is highlighted in Table 1. With this grouping, the dimension of the vector space is reduced from 203 to 73 , which makes the problem computationally tractable. For each protein i, the vector xi with the counts of all 343 possible triplets is computed. Then, the normalized vector vi is obtained: vi =
xi − min(xi ) max(xi ) − min(xi )
(2)
The normalization is used to avoid longer proteins sequences to have larger frequency values.
144
M.R. Guarracino and A. Nebbia
Table 1. a Dipole Scale (Deybe): -, Dipole<1.0; +, 1.0
3.0; + + + , Dipole>3.0 with opposite orientation. b Volume Scale ˚ A3 : -, Volume<50; +, Volume>50. c Cys is separated from class 3 because of its ability to form disulfide bonds. No. Dipole scalea Volume scaleb class 1 Ala, Gly, Val 2 + Ile, Leu, Phe, Pro 3 + + Tyr, Met, Thr, Ser 4 ++ + His, Asn, Gln, Tpr 5 +++ + Arg, Lys 6 + + + + Asp, Glu 7 +c + Cys
To prepare the positive class of the training set we use HPRD, release 090107. This database release contains 38,167 protein interactions between 25,661 proteins from the human proteome. Each interaction has been manually extracted by experts from the literature. We take a set of 16,650 pairs at random and, for each protein sequence in the pair (i, j), we use equation (2) to compute the vectors vi and vj , and their concatenation composed of 686 components. For the negative GSN, we extract at random the pair (l, m) and (k, s) from the HPRD database, and we check that pair (l, s) is not present in HPRD. This does not mean the two proteins do not interact, but the choice is motivated by the fact that if the two proteins are both listed in the database, and there is no evidence of their interaction, they can be supposed to form a non interacting pair. In a way similar to [23], we compute the Pearson coefficient of the corresponding frequency vectors: σl,s , ρl,s = σl σs where σl,s is the covariance between l and s, σl and σs the standard deviation of the vectors vl and vs . We accept the pair as a non interacting one, if the absolute value of the Perason coefficient is less then 0.3, in accordance with [24]. The process ends when 16,650 non interacting pairs have been computed. The total number of unique proteins is 7,652 in the negative class and 10,780 in the positive one. The number of unique proteins present in both classes is 155. In Fig. 1 the sample distribution of the interactions in the HPRD (left) and positive interaction in the training set (right) is reported. In both cases, there is a large number of proteins with few interactions and a few proteins with a large number of links. This is usually accepted as representative of the behavior of proteins in cells, whose interactions form a scale-free network, in which only few proteins take part to many different processes.
Predicting PPIs with K-Nearest Neighbors Classification Algorithm
6000
145
9000
Number of proteins with k interactions
Number of proteins with k interactions
8000 5000
4000
3000
2000
1000
7000 6000 5000 4000 3000 2000 1000
0
0
50
100 150 Number of interactions (k)
200
0
250
(a) HPRD
0
50
100 Number of interactions (k)
150
200
(b) Interacting proteins in training set
Fig. 1. Histograms of protein interactions
4
Computational Experiments
All macros have been implemented with Matlab 7.3.0. Results are calculated using an Intel Xeon CPU 3.20GHz, 6GB RAM running Red Hat Enterprise Linux WS release 3. We use Matlab implementation of k-NN classifier. Tables 2, 3 and 4 report results of a 3-fold cross validation for the dataset in the previous section. Accuracy refers to the mean classification accuracy on the three folds. Sensitivity and specificity are defined as: TP , TP + FN
Sensitivity =
TN TN + FP where TP is the number of true positives, FN the number of false negatives, and FP the number of false positives. Standard deviation of the accuracy on the three fold is also given. The distances used are cityblock, as defined in equation (2), euclidian and correlation, which is defined as one minus the correlation of Specif icity =
Table 2. Tests with cityblock metric k
Accuracy
1 94.29% ± 0.24 e-2 3 92.95% ± 0.31 e-2 5 90.95% ± 0.47 e-2
Sensitivity Specificity 96.22% 96,26% 96.29%
93.03% 90,74% 87.53%
Table 3. Tests with euclidean metric k Accuracy Sensitivity Specificity 1 84.64% ± 0.79 e-3 95.19% 79.56% 3 79.47% ± 0.16 e-2 95.98% 73.06% 5 75.19% ± 0.36 e-2 96.53% 68.48%,
146
M.R. Guarracino and A. Nebbia Table 4. Tests with correlation metric k Accuracy Sensitivity Specificity 1 76.89% ± 0.51 e-2 82.21% 80.27% 3 76.40% ± 0.46 e-4 84.68%, 77.46% 5 75.25% ± 0.79 e-4 86.51% 74.69%
the two vectors. We note that the highest accuracy is obtained with k=1 and cityblock metric. In order to understand if the accuracy results would depend on the number of proteins present in both the datesets, we repeated the experiments with a dataset with more proteins in common. We selected 16,200 pairs from HPRD with 1,957 unique proteins. We have generated 65,000 couples from those unique proteins, from which we filtered out 16,200 pairs with an absolute Pearson coefficient less than 0.3. The unique proteins in the negative class is 1,855. The results obtained with this dataset are reported in Table 5. Table 5. Results for dataset with large overlap Distance Accuracy Sensitivity Specificity Cityblock 88.33% ± 0.43e-2 91.05% 88.09% Euclidean 86.69% ± 0.59e-2 92.33% 84.53% Correlation 87.65% ± 0.58d-3 92.26% 86.00%
We note that accuracy depends on the number of overlapping proteins in the two classes. Nevertheless, cityblock distance always achieves highest results in terms of accuracy. As expected, we obtained lower accuracy results when the overlap is larger. This is due to the fact that the distance among points in the vector space is smaller. We believe that in a real setting, the training set has to contain all information about interactions available from experiments. For the GSP, all known interacting proteins should be taken in consideration. For GSN, pairs should be selected with respect to the analysis task at hand, which means that processes, localization and functions of the proteins to be analyzed should be taken into account in the design. On the other hand, the accuracy values obtained by a random choice of the training set can still be satisfactory, as it will be shown in the following experiment, where the GSP and GSN were the ones described in section 3. Finally, we compare our results with those obtained by Shen and colleagues. In table 6 we show the results of 5 hold-outs obtained randomly selecting 32,486 PPIs and 400 protein pairs for test set. The accuracy values for SVM are taken from [15], while the 1-NN have been computed using cityblock metric. The prediction accuracy of 1-NN outperforms the SVM accuracy.
Predicting PPIs with K-Nearest Neighbors Classification Algorithm Table 6. SVM and 1-NN accuracy comparison Test set SVM 1-NN 1 84.25% 96.25% 2 82.75% 97.75% 3 83.25% 96.00% 4 83.25% 96.50% 5 86.00% 97.00% Mean 83.90%± 1.29 96.70% ± 0.69 e-2
Fig. 2. Predicted aging network
147
148
5
M.R. Guarracino and A. Nebbia
A Case Study
To better understand the capability of the prediction algorithm, we decided to test it on the protein-protein interaction network found in [25]. This network is composed of human homologs of proteins that have an impact on the longevity of invertebrates species. The network is composed of 175 human homologs of proteins that have been experimentally found to increase longevity in yeast, nematode, or fly, and 2,163 additional human proteins that interact with these homologs. Overall, the network consists of 3,271 binary interactions among 2,338 unique proteins. The article provides the names of interacting proteins pairs, but no accession number. From these 3,271 pairs we took the 1,740 protein names that were also present in the HPRD database. Then we found 2,062 pairs with both proteins in HPRD. We used k-NN trained on the 33,300 protein pairs we obtained as described in section 3. The classifier was able to correctly predict the interaction of 2,023 pairs out of 2,062, with a classification accuracy of 98.11%. It is worth noting the training set contains only 32 interacting pairs present in the network. The resulting network has been processed with Cytoscape (www.cytoscape.org) and the result is depicted in Figure 2. In future, it will be interesting to investigate the correctness of the classifier on the remaining 1,209 pairs composed by the 598 proteins not found in HPRD. For all proteins contained in the network, amino acid sequences are available, but for those not in HPRD, obtaining the sequence from the name, rather than the accession number, becomes a cumbersome and prone to error task.
6
Conclusion
In this work we propose a novel way to predict protein protein interactions. Results on a network of 3,271 binary interactions among 2,338 unique proteins provide a prediction accuracy of 98,11%. Validation results are higher then those available in literature using different classification methods and methodologies to build the training set. Future work will be devoted to the implementation of different filtering techniques for GSN, to fully understand the capability of the prediction algorithm in connection with the training sets.
Acknowledgement Adriano Nebbia spent a period at ICAR-CNR as an undergraduate student. He has contributed to the present work implementing all software needed to codify the proteins frequencies for the data sets used for experiments in sections 4 and 5. This work has been partially funded by MIUR project PRIN 2007.
References 1. De Las Rivas, J., de Luis, A.: Interactome data and databases: different types of protein interaction: Conference reviews. Comp. Funct. Genomics 5(2), 173–178 (2004)
Predicting PPIs with K-Nearest Neighbors Classification Algorithm
149
2. Nooren, I.M., Thornton, J.M.: Diversity of protein-protein interactions. EMBO J. 22(14), 3486–3492 (2003) 3. Grigoriev, A.: On the number of protein-protein interactions in the yeast proteome. Nucleic Acids Research 31, 4157–4161 (2003) 4. Xenarios, I., Rice, D., Salwinski, L., Baron, M., Marcotte, E., Eisenberg, D.: Dip: the database of interacting proteins. Nucleic Acids Research 28(1), 289–291 (2000) 5. Walker-Taylor, A., Jones, D.: Computational methods for predicting protein protein interactions. In: Waksman, G. (ed.) Proteomics and protein-protein interactions: biology, chemistry, bioinformatics, and drug design, pp. 89–114. Springer, Heidelberg (2005) 6. Shoemaker, B., Panchenko, A.: Deciphering protein–protein interactions - part ii. computational methods to predict protein and domain interaction partners. PLoS Computational Biology 3(4), 595–601 (2007) 7. Shi, T.L., Li, Y.X., Cai, Y.D., Chou, K.C.: Computational methods for proteinprotein interaction and their application. Curr. Protein Pept Sci. 6(5), 443–449 (2005) 8. Pitre, S., Alamgir, M., Green, J., Dumontier, M., Dehne, F., Golshani, A.: Computational Methods for Predicting Protein-Protein Interactions. In: The Adaption of Virtual Man-Computer Interfaces to User Requirements in Dialogs, vol. 110, pp. 247–267. Springer, Berlin (2008) 9. Mathivanan, S., Periaswamy, B., Gandhi, T.K.B., Kandasamy, K., Suresh, S., Mohmood, R., Ramachandra, Y.L., Pandey, A.: An evaluation of human proteinprotein interaction data in the public domain. BMC Bioinformatics 7(Suppl. 5) (2006) 10. Mewes, H.W., Dietmann, S., Frishman, D., Gregory, R., Mannhaupt, G., Mayer, K.F.X., M¨ unsterk¨ otter, M., Ruepp, A., Spannagl, M., St¨ umpflen, V., Rattei, T.: Mips: analysis and annotation of genome information in 2007. Nucleic Acids Research 36(Database-Issue), 196–201 (2008) 11. Stark, C., Breitkreutz, B.J., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: Biogrid: a general repository for interaction datasets. Nucleic Acids Research 34(Database issue) (January 2006) 12. Chatr-aryamontri, A., Ceol, A., Palazzi, L.M.M., Nardelli, G., Schneider, M.V.V., Castagnoli, L., Cesareni, G.: Mint: the molecular interaction database. Nucleic Acids Research 35(Database issue), D572–D574 (2007) 13. Brown, K.R., Jurisica, I.: Online predicted human interaction database. Bioinformatics 21(9), 2076–2082 (2005) 14. Prasad, K.T.S., Goel, R., Kandasamy, K., Keerthikumar, S., Kumar, S., Mathivanan, S., Telikicherla, D., Raju, R., Shafreen, B., Venugopal, A., Balakrishnan, L., Marimuthu, A., Banerjee, S., Somanathan, D.S., Sebastian, A., Rani, S., Ray, S., Kishore, H.C.J., Kanth, S., Ahmed, M., Kashyap, M.K., Mohmood, R., Ramachandra, Y.L., Krishna, V., Rahiman, A.B., Mohan, S., Ranganathan, P., Ramabadran, S., Chaerkady, R., Pandey, A.: Human protein reference database–2009 update. Nucleic Acids Research 37(Database issue), gkn892+ (2009) 15. Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y., Jiang, H.: Predicting protein-protein interactions based only on sequences information. PNAS 104(11), 4337–4341 (2007) 16. Bock, J.R., Gough, D.A.: Predicting protein–protein interactions from primary structure. Bioinformatics 17(5), 455–460 (2001) 17. Nanni, L.: Hyperplanes for predicting protein-protein interactions. Neurocomputing 69(1-3), 257–263 (2005)
150
M.R. Guarracino and A. Nebbia
18. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997) 19. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) 20. Guarracino, M., Cuciniello, S., Feminiano, D., Toraldo, G., Pardalos, P.: Current classification algorithms for biomedical applications. Centre de Recherches Math´ematiques CRM Proceedings & Lecture Notes of the American Mathematical Society 45(2), 109–126 (2008) 21. Platt, J.: Fast training of SVMs using sequential minimal optimization. In: Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999) 22. Costantini, S., Facchiano, A.M.: Prediction of the protein structural class by specific peptide frequencies. Biochimie 1-4 (2008) 23. Hur, A.B., Noble, W.: Choosing negative examples for the prediction of proteinprotein interactions. BMC Bioinformatics 7(Suppl. 1) (2006) 24. Shi, M.G., Xia, J.F., Li, X.L.: Predicting protein–protein interactions from sequence using correlation coefficient and high-quality interaction dataset. Amino Acids (2009) (online) 25. Bell, R., Hubbard, A., Chettier, R., Chen, D., Miller, J.P., Kapahi, P., Tarnopolsky, M., Sahasrabuhde, S., Melov, S., Hughes, R.E.: A human protein interaction network shows conservation of aging processes between human and invertebrate species. Plos Genetics 5(3) (2009)
Simulations of the EGFR - KRAS - MAPK Signalling Network in Colon Cancer. Virtual Mutations and Virtual Treatments with Inhibitors Have More Important Effects Than a 10 Times Range of Normal Parameters and Rates Fluctuations Nicoletta Castagnino1 , Lorenzo Tortolina1, Roberto Montagna2 , Raffaele Pesenti3 , Anahi Balbi4 , and Silvio Parodi1 1
4
Department of Oncology, Biology and Genetics, University of Genoa; National Cancer Institute of Genoa, Largo R. Benzi 10, 16132, Genova, Italy 2 Department of Informatics and Information Sciences, University of Genoa, Via Dodecaneso 35, 16146, Genova, Italy 3 Department of Applied Mathematics, University of Ca’ Foscari of Venice, Dorsoduro 3825-30123, Venezia, Italy Advanced Biotechnology Center of Genoa, Largo R. Benzi 10, 16132, Genova, Italy
Abstract. The fragment of the signaling network we have considered was formally described as a sort of circuit diagram, a Molecular Interaction Map (MIM). We have mostly followed the syntactic rules proposed by Kurt W. Kohn [1,10,11]. In our MIM we drew 19 basic species. Our dynamic simulations involve 46 modified species and complexes, 50 forward reactions, 50 backward reactions, 17 catalytic activities. A significant amount of parameters concerning molecular concentrations, association rates, dissociation rates and turnover numbers, are known for this intensively studied neighborhood of the signaling network. In other cases, molecular, cellular and even clinical data generate additional indirect constraints. Some unknown parameters have been adjusted to satisfy these indirect constraints. In order to avoid hidden bugs in writing the software we have used two independent approaches: a) a more classic approach using Ordinary Differential Equations (ODEs); b) a stochastic simulation engine, written in Java, based on the Gillespie algorithm: we obtained overlapping results. For a quiescent and EGF stimulated network we have obtained a behavior in good agreement with what is experimentally known. We have introduced virtual mutations (excess of
This work was supported from: MIUR PRIN 2007-prot. 2005061408 005; national project Compagnia di San Paolo 2489 IT/CV, 2006.1997, Genoa Operative Unit (O.U.); Regione Liguria, Cap. 5296 F.S.R. Nr. Prov. 23, Prot. Gen. 616, 2007; Istituto Superiore di Oncologia, O.U.s of an ”Istituto Oncologico Veneto” project, and a ”Regione Campania project”; project Compagnia di San Paolo 4998 IT/CV, 2007.0087, Genoa University; Project LIMONTE-SP1P 2009 (with Castagnola and Zardi); project CARIGE, Genoa University 2008 (with Patrone, Ballestrero et al).
F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 151–164, 2010. c Springer-Verlag Berlin Heidelberg 2010
152
N. Castagnino et al. function) for EGFR, KRAS and BRAF onco-proteins. We have also considered virtual inhibitions induced from different EGFR, KRAS, BRAF and MEKPP inhibitors. Drugs of this kind are already in the phase of preclinical and clinical studies. The major results of our work are the following: 3.16x or 3.16/ fluctuations of total concentrations of independent molecular species or fluctuations of rates, were introduced systematically. We examined the effects on the plateau levels of the 61 parameters representing all molecular species / complexes. Fluctuations of concentrations generated scores of deviation from the normal reference situation with median = 5, Ist -IIIrd quartile = 1-9. Fluctuations of rates generated scores of deviation with median and Ist -IIIrd quartile = 0. In the case of virtual mutations the deviation from the normal reference situation generated scores in the range 33-115, well above the fluctuation range. The addition of a target-specific virtual inhibitor to its respective virtual mutation reduced the deviation scores by 64% (KRAS), 67% (BRAF), 97% (EGFR mutation) and 90% (EGFR strong stimulation). A double alteration (EGFR & KRAS) could be best inhibited by the association of the two corresponding inhibitors. In conclusion, the effects of virtual mutations and virtual inhibitors seem definitely more important than noise random fluctuations in concentrations and rates.
1
Introduction
We work at a signaling network downstream of the imbricated pathways of TGFβ, Wnt and EGF (G0 → G1 transition). We pay attention at colon and colon cancer as a privileged reference experimental model, in the broader framework of epithelial cancers. In this work we wanted to explore if dynamic simulations of limited signaling regions, in the presence of limited knowledge of the parameter space, suggest a non erratic behavior, in which the effects of virtual mutations and virtual inhibitors tend to emerge from the background of concentrations and rates fluctuations. Dynamic modeling is a difficult effort under the siege of many potential pitfalls. In order to test the feasibility of our major goal (the simulation of a larger MIM involving the imbricated TGFβ pathway, the Wnt pathway and the present EGF pathway, converging toward the transcription factor TCF/LEFβ-catenin), we adopted a strategy of gradualism of the efforts. We started examining under a multiplicity of different perspectives the behavior of a sufficiently well reconstructed smaller section (about 1/3) of our final signaling network map. Considering the limited size of our MIM of Fig. 1, and the fact that the molecules present in our MIM are rather well known, we don’t show here the corresponding annotation list. However, it should be not too difficult to reconstruct from the normal references [4,6,9,12,22] most of the sources of our information. A synthetic glossary is shown at the bottom of Fig. 2. EGFR, KRAS, BRAF are dominant oncogenes to which oncogenic mutations confer excess of function. The general rule in molecular oncology is that dominant oncogenes participate to the process of malignant transformation through mutations, conferring in various
Simulations of the EGFR - KRAS - MAPK Signalling Network
153
ways to the protein gene product a drastic and deregulated excess of function. Excess of function in a dominant onco-protein can also be the consequence of over-expression or post-translational modifications, downstream of other molecular alterations in the signaling network. In order to introduce a as close as possible to the real molecular alteration, we introduced the following signaling alterations in the simulation network: a drastic loss of susceptibility to phosphatases for the activated phosphorylated EGFR2 homodimer; a drastic loss of GTPase activity of GAP over a mutated RAS-GTP; a drastic loss of susceptibility to phosphatases for the activated form of BRAF-P (id est BRAF*). Excess of function in EGFR was also induced through a strong stimulation. We introduced in a similar way the activity of virtual inhibitors of EGFR, KRAS, BRAF, MEKPP. They display a high affinity competitive reversible binding with the target. The complex [target : inhibitor] is a dead end.
2
Methods
The MIM that represents the graphic picture of the network behind our simulations is shown in Fig. 1. The logic / syntactic rules for drawing our MIM have been described in [1,10,11] and are synthetically reported in Fig. 1. Our MIM describes a pathway downstream of EGF ↔ EGFR. Adaptor proteins like Grb2 and Shc connect the activating GDP ↔ GTP exchange factor SOS to KRAS. KRAS-GTP is broth back to an inactive KRAS-GDP by the protein GAP. KRAS-GTP can activate BRAF. BRAF-P can activate MEK. MEKPP can activate ERK. A steady state temporary equilibrium is assured by seven phosphatases. In Table 1 and 2 we show a list of 50 reversible reactions and 17 catalytic reactions, which represent the complete set of our dynamic simulations. We bring them to steady-state equilibrium. We have verified that starting from different out of equilibrium conditions we can have different provisional transitory peaks, but than we converge rather rapidly (typically in less than 10 virtual minutes) toward the same steady-state, provided that the initial total concentrations of our 15 independent molecular species remain invariant, as well as the 50 association rates, the 50 dissociation rates and the 17 catalytic turnover numbers. 2.1
Simulations Using ODEs
To simulate the signaling network considered in this paper, we have mathematically formalized the reactions scheme of Table 1 and 2 , in terms of the reactions’ kinetic laws [20]. The kinetic laws of a reaction describe the velocity at which the reactants are transformed into the products of the reaction. More specifically, we assumed that all reactions followed a mass action kinetic law. According to this kinetic law, the velocity of the reaction is directly proportional to the concentration of the reactants multiplied by the reaction rate. As an example, given the reversible reaction [A] + [B] ↔ [A − B] . (1)
154
N. Castagnino et al.
the velocity of the [A-B] formation reaction is: k1 ∗ [A][B] − k−1 ∗ [A − B], where each [X] indicates the concentration of a given reactant, k1 and k−1 are forward (association) and backward (dissociation) rates respectively of the reversible reaction. At equilibrium: (2) k1 ∗ [A][B] = k−1 ∗ [A − B] . ([A][B])/[A − B] = k−1 /k1 = Kd (equilibrium constant K).
(3)
We can also have an irreversible catalytic reaction of the type: [XP-Phosphatase] → [X] + [Phosphatase] + P (P goes into the phosphates pool). (4) (5) v = kcat ∗ [XP-Phosphatase]. where, kcat is a catalytic rate (a turnover number). In turn, knowledge of the reactions’ kinetic laws has allowed us to describe the rate of change of each complex concentration, by means of an ordinary differential equation in which the rates of the reactions that produce or consume the reactant are algebraically summed. As an example, let assume that A participates in the three following reversible reactions (A, B, C, D can be either individual species or more generally complexes): [A] + [B] ↔vv1−1 [A − B];[A] + [C] ↔vv2−2 [A − C];[A] + [D] ↔vv3−3 [A − D].
(6)
Then: d[A]/dt = −v1 + v−1 − v2 + v−2 − v3 + v−3 =
(7)
= −(k1 [A][B]+k2 [A][C]+k3 [A][D])+(k−1 [A−B]+k−2 [A−C]+k−3 [A−D]). (8) The collection of this type of differential equations for all the 61 complexes included in the signaling network completely describes the dynamic behavior of our biological system. Unfortunately, the non linear nature of the above differential equations has prevented us from determining the analytical expressions for the system evolution over time. Nevertheless, we have been able to simulate the system evolution numerically with the help of an appropriate software such as the SimBiology toolbox of Matlab (http://www. mathworks.com/products/simbiology/).Using our data stored in SBML we sometimes also used CellDesigner (http://celldesigner. org/) integrated with SBML ODE Solver. The two tools gave the same results.These kinds of numerical approaches have been pursued by different authors, among them [9,22] and other authors whose models are available in the BioModels Database (http://www.ebi.ac.uk/biomodels-main/). 2.2
Simulations Using a Stochastic Engine, Written in Java, Based on the Gillespie Algorithm
To our knowledge, a dozen of stochastic simulators have been produced, (COPASI, Gillespie2, Dizzy, among others). A nice and comprehensive review about the state of the art in stochastic modeling is presented in [21] . The stochastic
Simulations of the EGFR - KRAS - MAPK Signalling Network
155
simulations and the consequent results contained in this paper were however obtained using an ad hoc stochastic engine written in Java (Ph.D. thesis of M.R., in preparation); the simulator implements the Gillespies Algorithm (GA) [7], more specifically a ”safe” optimization proposed by Gibson and Bruck [14]. In a discrete and stochastic approach, it is necessary to take into account the precise number of molecules present for each species and complex (cardinality), it is therefore necessary to transform the initial quantitative information, usually expressed in terms of molar concentrations, in numbers of molecules. The greatest difference between this method and the deterministic ODEs approach are the modifications introduced by the stochastic aspect of the interactions among individual molecules. In GA, we associate discrete numbers and consequently a probability to each reaction. As an example, for a reaction involving the numbers of molecules (cardinality) of the three molecular species: |A| + |B| → |A − B|, where k is the association rate of the reaction, we calculate a value proportional to k ∗ |A| ∗ |B|. Starting from this value, we build a negative exponential distribution of probabilities [7]; from them we compute the next instant in which the reaction will take place. As a major consequence of this procedure, we obtain a non deterministic stochastic simulation; each individual simulation performed on the same system will slightly differ from others because it represents only a simulation of an individual run. This stochastic behavior is the basic behavior of an individual cell, and could impose restraints to the minimum number of molecules of a given molecular species that can exist at any given moment. It could also have subtle effects on ”normal” fluctuations of cell behavior [19]. We have verified that the average behavior of a number of stochastic simulations is essentially super-imposable with the behavior of the deterministic ODEs approach. In our hands, ODEs simulations are much faster than stochastic simulations, but they miss the possibility of understanding protein dynamics in single cells [2]. Having in our hands our own stochastic simulator is potentially conferring more flexibility towards specific needs. 2.3
A Synthetic Description of the Way of Implementation of the Three Virtual Mutations
As illustrated in detail from the reactions of Fig. 2, our three virtual mutations, involving EGFR, KRAS and BRAF, have been implemented according to the general rule that the mutated protein could not be de-phosphorylated. More specifically, for the tetramer [(EGF-EGFRP)2 ] the de-phosphorylation velocity of the PTP1B phosphatase (k4 ) was broth to 0, the much less significant spontaneous de-phosphorylation (k−3 ) was also broth to 0; for [KRAS-GTP] the GTP-ase activity of GAP was broth to 0 (k−61 ); for BRAF* (or BRAF-P) the de-phosphorylation velocity of the P-ase 1 (k−42 ) was broth to 0. 2.4
A Synthetic Description of the Way of Implementation of Our Four Virtual Inhibitors
As illustrated in detail in the reactions Fig. 2, our four virtual inhibitors (EGFR, KRAS, BRAF and MEKPP inhibitors) have been implemented according to the
156
N. Castagnino et al.
Fig. 1. The heuristic MIM of the EGFR - KRAS - MAPK signaling network. MIMs are based on a system of symbols and syntactic conventions: reactions operate on molecular species; contingencies operate on reactions or on other contingencies (see the right part of the Fig. 1).
following basic idea: an inhibitor works as a dead-end when bound to the target molecule, it has to bind > 90% of the target. 2.5
Scores Measuring Deviations from the Normal Condition: Fluctuations Scores, Mutations Scores, Inhibitors Scores
For each of the 15 molecular total concentrations we introduced a 3.16x and a 3.16/ fluctuation (30 independent, individual fluctuations). We random sorted 10 association rates (forward reactions), 10 dissociation rates (backward reactions) and 4 turnover numbers. For each of the 24 randomly sorted rates we again introduced a 3.16x and a 3.16/ fluctuation (48 independent, individual fluctuations). At steady state equilibrium we have 61 different molecular complexes, including the 15 basic species, considered in this context as free molecules. We evaluated the deviation of each of these 61 different parameters from their plateau level equilibrium when in the presence of the standard (normal) values of concentrations and rates. For each of the 61 molecular complexes, deviations were computed as follows: a deviation (absolute value) < 3.16 was computed = 0; a deviation ≥ 3.16 and < 10 was computed = 1; a deviation ≥ 10 and < 31.6 was computed = 2; a deviation ≥ 31.6 and < 100 was computed = 3; a deviation ≥ 100 and < 316 was computed = 4; and so on. At the end we made the of all the 61 deviations generated from an individual fluctuation. referred to the 61 deviations concerning the 61 different molecular complexes). An individual was the score an individual 3.16x or 3.16/ fluctuation. We generated 30 concentration related scores and 48 rate related scores. Scores for the deviations
Simulations of the EGFR - KRAS - MAPK Signalling Network
157
Table 1. Total protein concentrations which are assumed constant on the time scale under consideration, are the following (nM): [EGF]=3.4 ; [EGFR]= 100; [PLCγ]=105; [Shc]= 150; [Grb2]= 85; [SOS]= 34; [KRAS]= 85; [GAP]= 12; [BRAF]= 50; [MEK]= 200; [ERK]=200; [P-ase 1]= 50; [PP2A]= 50; [MKP3]= 100; [P-ase 5]= 50. The concentration of GTP and GDP are respectively 10,000 nM and 500 nM and are assumed constant. PTP1B, PTP, and P-ase 4, in the corresponding de-phosphorylation kinetic reactions, have been treated using a Michaelis-Menten approximation. Reversible reactions R + EGF ↔ Ra Ra + Ra ↔ R2 R2 ↔ R2P R2P + PLCy ↔ R PL R PL ↔ R PLP R2P + PLCy-P ↔ R PLP PLCy-P ↔ PLCyP 1 R2P + Grb2 ↔ R G R G + SOS ↔ R G S R2P + G S ↔ R G S Grb2 + SOS ↔ G S
Reversible reactions k1 = 0.003 k−1 = 0.06 k2 = 0.01 k−2 = 0.1 k3 = 1.0 k−3 = 0.01 k5 = 0.06 k−5 = 0.2 k6 = 1.0 k−6 = 0.05 k7 = 0.006 k−7 = 0.3 k9 = 1.0 k−9 = 0.03 k10 = 0.0015 k−10 = 0.2 k11 = 0.01 k−11 = 0.06 k12 = 0.0028 k−12 = 0.15 k13 = 0.0001 k−13 = 0.0015
R2P + Shc ↔ R Sh
k14 = k−14 = 0.6
0.09
R Sh ↔ R ShP
k15 = k−15 = 0.06
6.00
ShP + R2P ↔ R ShP
k16 = 0.0009 k−16 = 0.3
R ShP + Grb2 ↔ R ShP G R2P + ShP G ↔ R ShP G R ShP G + SOS ↔ R ShP G S ShP G S + R2P ↔ R ShP G S ShP G + SOS ↔ ShP G S ShP + Grb2 ↔ ShP G ShP + G S ↔ ShP G S R ShP + G S ↔ R ShP G S KRAS + GDP ↔ KRAS GDP KRAS + GTP ↔ KRAS GTP R ShP G S + KRAS GDP R Sh G S KRAS GDP
k17 k−17 k18 k−18 k19 k−19 k20 k−20 k21 k−21 k22 k−22 k24 k−24 k25 k−25 k26 k−26 k27 k−27 ↔ k29 k−29
= 0.0030 = 0.1 = 0.0009 = 0.3 = 0.01 = 0.0214 = 0.00024 = 0.12 = 0.03 = 0.064 = 0.0030 = 0.1 = 0.0210 = 0.1 = 0.0090 = 0.0429 = 0.00027 = .0000054 = 0.0780 = .00078 = 0.00475 = 0.76
R R R R R R R R R R R R R R R R R
Sh G S KRAS + GDP Sh G S KRAS GDP Sh G S KRAS + GTP Sh G S KRAS GTP ShP G S + KRAS GTP Sh G S KRAS GTP ShP G S + KRAS Sh G S KRAS G S + KRAS GDP G S KRAS GDP G S KRAS + GDP G S KRAS GDP G S KRAS + GTP G S KRAS GTP G S + KRAS GTP G S KRAS GTP G S + KRAS ↔ R G S KRAS
BRAF + KRAS GTP BRAF KRAS GTP
BRAF ∗
+
KRAS GTP
BRAF KRAS GTP BRAF ∗ + P-ase 1 BRAF ∗ P − ase1 MEK + BRAF ∗ M EK BRAF ∗ MEKP + BRAF ∗ ∗ M EKP BRAF
↔ k30 k−30 ↔ k31 k−31 ↔ k32 k−32 ↔ k33 k−33 ↔ k34 k−34 ↔ k35 k−35 ↔ k36 k−36 ↔ k37 k−37 k38 k−38 ↔ k39 k−39
= 0.0930 = 46.5 = 0.0030 = 2.4 = 1.5750 = 806.4 = 0.15625 = 0.001 = 0.0075 = 1.2 = 0.1 = 50 = 0.1 = 80 = 1.25 = 640 = 0.25 = 0.0016 = 0.01 = 0.0053
↔ k40 = k−40 = 1
0.0007
↔ k41 = 0.0717 k−41 = 0.2 ↔ k43 = 0.0111 k−43 = .01833 ↔ k45 = 0.0111 k−45 = .01833
MEKPP + PP2A ↔ MEKPP PP2A
k47 = 0.0143 k−47 = 0.8 k49 = 0.00025 k−49 = 0.5 ERK + MEKPP ↔ ERK MEKPP k51 = 0.01 k−51 = 0.033 ERKP + MEKPP ↔ ERKP MEKPP k53 = 0.01 k−53 = 0.033 ERKPP + MKP3 ↔ ERKPP MKP3 k55 = 0.0145 k−55 = 0.6 ERKP + MKP3 ↔ ERKP MKP3 k57 = 0.05 k−57 = 0.5 GAP + R2P ↔ R GAP k59 = 0.0830 k−59 = 1.5 KRAS GTP + R GAP ↔ k60 = 0.0622 R GAP KRAS GTP R Sh G S + ERKPP → k62 = 0.01 R Sh G S ERKPP k−62 = 0.033 R G S + ERKPP → R G S ERKPP k64 = 0.01 k−64 = 0.033 SOSP + P-ase 5 → SOSP P-ase 5 k66 = 0.01 k−66 = 0.033 MEKP + PP2A ↔ MEKP PP2A
from ”normality” induced from virtual mutations and virtual inhibitors were computed exactly in the same way.
3
Results
The parameter space concerning total protein abundance of basic molecular species, association and dissociation rates of protein-protein interactions,
158
N. Castagnino et al.
Table 2. • The following apparent turnover numbers represent the product of the fraction of the total concentration of the phosphorylated molecule [molP] actually bound to its phosphatase [P-ase], times the real turnover number, in a condition of steady state equilibrium in which d[molP-P-ase]/dt = 0 Catalytic reactions R2P → R2 PLCy-P → PLCy ShP → Shc KRAS GTP → KRAS GDP BRAF ∗ P − ase1 → BRAF + P-
ase 1 M EK BRAF
∗
→
MEKP M EKP BRAF ∗ ∗
→
BRAF
∗
MEKPP
• k = 9.0 4 • k = 0.01 8 •k 23 = 0.005 k28 = 0.0001 k42 = 1
Catalytic reactions ERK MEKPP → ERKP + MEKPP ERKP MEKPP → ERKPP + MEKPP ERKPP MKP3 → ERKP + MKP3 ERKP MKP3 → ERK + MKP3 R GAP KRAS GTP KRAS GDP
→
→
R GAP
R ShP G
k52 k54 k56 k58
= = = =
16 5.7 0.27 0.3
+ k61 = 1.494
+ k44 = 3.5
R Sh G S ERKPP SOSP + ERKPP
+ k63 = 5.7
+ k46 = 2.9
R G S ERKPP → R G + SOSP + k65 = 5.7 ERKPP
BRAF
MEKP PP2A → MEK + PP2A MEKPP PP2A → MEKP + PP2A
k48 = 0.058 k50 = 0.058
SOSP P-ase 5 → SOS + P-ase 5
k67 = 5.7
catalytic turnover numbers, is only partially known, even if Wolf and others [6,5,9,13,17,22], especially for the module we are discussing here, made a meritorious work extracting from a large set of experimental data as many parameters as possible, and also, when feasible, fitting some gap according to less direct experimental evidences at the molecular and cellular level. In the case of inhibitors of mutated onco-proteins constraints can even come from preclinical and clinical observations [8]. In the legend of Table 1 and 2 we give the ”normal” total concentrations of 15/19 basic species shown in the MIM of Fig. 1. PTP1B, PTP and P-ase 4 were treated using a Michaelis-Menten approximation and the species ”cytoskeleton protein” binding PLCγ-P was considered in large excess. Especially for the most studied regions of the signaling network filling parameter space is a work in progress. We can expect continuous adjustments and new improved tunings of dynamic simulations, including the introduction of additional signaling proteins. However, what seems important is to explore if a virtual mutation (a focal all / none alteration of the network) will introduce drastic and biologically significant dynamic or steady state (in this simulation) changes, capable of emerging clearly over the background of steady state changes induced by the entire spectrum of milder possible parameter space fluctuations. Only if this is the case, only if a relatively small fluctuation introduced will not cause very large fluctuation effects, we can already work to progressive larger and better defined signaling networks, with a special reference in our case to pathways containing mutated onco-proteins, challenged with virtual inhibitors, alone or in association. We do not have proof that the relative stability of our sub-network will be maintained in a larger one. However, if the difficulty and uncertainties in front of us will be positively overcome, this type of approach could be exceptionally relevant for a rational signaling-target cancer therapy. We also examined the behavior of a 10x KRAS over-expression, and the behavior of a very strong stimulation of EGFR with EGF 680 nM (peri-membrane concentration); in the ”normal” condition we chose an EGF concentration of 3.4 nM (200 times less).
Simulations of the EGFR - KRAS - MAPK Signalling Network
159
Fig. 2. Top - left: virtual mutations. Top - right: virtual inhibitors. Bottom: A synthetic glossary of the 19 molecular species depicted in our MIM.
Starting from different initial situations of disequilibrium, we generated different initial temporary peaks. We empirically verified many different starting conditions. In all tested cases we always came close, within a virtual time of 20-30 minutes, to the same plateau levels (residual differences < 5%). As introduced in the Sect. 2, we computed a score for the fluctuations around the 61 ”normal” plateau values reached by our complexes at steady state equilibrium. Among the 50 association rates, 50 dissociation rates, 17 turnover numbers, involved in our simulations (Table 1 and 2 ), we randomly selected 10 association rates, 10 dissociation rates and 4 turnover numbers; 24 3.16x and 24 3.16/ fluctuations were tested (in total 48 rates fluctuations). The scores of the rates fluctuations are reported in Fig. 3A. Fluctuations were also induced introducing 15 3.16x and 15 3.16/ (global range = 10) variations around the normal total concentrations (see Table 1 and 2 ) of the 15 independent basic species. The scores of these 30 fluctuations are reported in Fig. 3B. In Fig. 4A we report the corresponding scores obtained with individual virtual mutations, and also a 10x KRAS amplification and a very strong EGFR stimulation (EGF 200 times more concentrated than at the basic level). In Fig. 4B we report the behavior of an over-stimulated or mutated EGFR + a clearly active (in terms of concentration and target affinity) EGFR inhibitor; we did the same for a mutated KRAS and a KRAS inhibitor; we did the same for a mutated BRAF and a BRAF inhibitor; we also report the effect of a MEKPP inhibitor on a mutated BRAF and the negligible effects of a BRAF inhibitor or a MEKPP inhibitor on a mutated KRAS. We comment in the discussion section this apparent paradox, due mostly to the limited size of our MIM, in fact we have only the ERK protein downstream of BRAF and MEK. In Fig. 4C we report a clinically described in colon cancer [3] double alteration (EGFR over-stimulation and KRAS mutation). Individually the two inhibitors seem unable of going significantly below the mutated-KRAS score, but a virtual treatment with both the EGFR inhibitor + the KRAS inhibitor brings us down from the mutated-KRAS score (33) to a score of 15-23, much closer to the fluctuation scores of Fig. 3.
160
N. Castagnino et al.
In order to avoid hidden bugs in writing the software we have used two independent approaches: a) a more classic approach using Ordinary Differential Equations (ODEs); b) a stochastic simulation engine, written in Java, based on the Gillespie algorithm [7]. We have verified for all 61 complexes, in ”normal” conditions, in the presence of mutations alone and with inhibitors, that ODEs simulations and stochastic Java simulations generate practically super-imposable results. However, especially when we deal with rare complexes, stochastic fluctuations become dramatically evident. We report an example in Fig. 5. We have assumed the radium of our ideal spherical cell = 10 μm. Its volume is = 4.19 · 10−12 liters. From our ODEs graph, at steady state plateau level, the concentration of the complex [( EGF:EGFR-P)2 :Grb2:SOS:KRAS-GTP] is 5.4 · 10−15 M. Equivalent number of molecules = 6.022 · 1023 · 4.19 · 10−12 · 5.4 · 10−15 = .0136 molecules. According to a poissonian distribution, P(0) = .9865; P(1) = .0134; P(2) = .0000912. Most of the time the basic components of the complex are distributed in adjacent complexes. Especially in the first and second run, we have (in each run) about four empty sampling intervals of 40 - 70 sec in duration. In our three runs, considering the condition of reached equilibrium of the last 1,000 sec, and considering that our graph makes 9,259 samplings every 1,000 seconds. According to the Poisson distribution described above, we would expect in the last 1,000 sec .84 instances in which we have two molecules of our complex. In run 1 we have one case, in run 2 we have zero cases, and in run 3 we have two cases. Run 3 also has an average frequency of molecules present, slightly higher than the other two runs, as you would expect dealing with stochastic phenomena.
4
Discussion
We introduced systematically fluctuations around our reaction rates. A 3.16x 3.16/ individual fluctuation around its reference ”normal” rate (total range of the fluctuation 10 times) had negligible effects on the plateau values assumed by the concentrations of our 61 complexes (range 1-7, median and Ist -IIIrd quartile = 0). We did the same with the total concentrations of the 15 independent basic species. In this case the fluctuation range was still small but more significant (range 1-22, median = 5, Ist -IIIrd quartile 1-9). It is perhaps important to underline that in the case of the fluctuations introduced at the level of rates, at each
Fig. 3. 3.16x - 3.16/ fluctuations: A) scores related to rates; B) scores related to concentrations. See Methods for details.
Simulations of the EGFR - KRAS - MAPK Signalling Network
161
Fig. 4. A) Score of EGFR, KRAS, and BRAF mutations + EGFR strong stimulation + KRAS amplification. B) Each of the above mutations and EGFR stimulation + their corresponding inhibitor. Effects of BRAF inhibitor on KRAS mutation. Effects of MEKPP inhibitor on KRAS mutation and BRAF mutation, respectively. C) Double EGFR + KRAS alteration, in the presence of one or both virtual inhibitors.
time we changed only 1/117 of them, while, in the case of the fluctuations introduced at the level of total concentrations of the 15 basal independent species, at each time we changed 1/15 of them. It’s reasonable to expect an average larger effect of each fluctuation in the second case. In addition, a fluctuation in EGFR concentration is upstream of all the pathway (Fig. 1), while a fluctuation in PLCγ is in a peripheral position in our MIM. Accordingly, we observed the following very different individual scores: EGFx = 22; EGF/ = 18; PLCγx = 1; PLCγ/ = 1. It is interesting to note that our score system also suggests an important role for the GDP-GTP exchange factor SOS: SOSx score = 9; SOS/ score = 13. It was encouraging to observe that the scores of all our mutations were well above the fluctuation range (Fig. 4A): 33 for a mutated KRAS, 43 for a mutated BRAF, 115 for a mutated EGFR. It was also encouraging to be able to observe the important effects, in the right direction (Fig. 4B) of the inhibitors of our three mutated onco-proteins. With all the cautions related to the small size of our MIM, inhibitors downstream of a mutated KRAS (the BRAF and the MEKPP inhibitors) had an almost negligible effect in contrasting the KRAS mutation. A larger MIM would be needed to confirm the importance of inhibiting KRAS directly, or immediately downstream. In our limited MIM, ERKPP is a dead end downstream of all the pathway. Important effects downstream of ERKPP are not represented in our MIM. An activated ERKPP phosphorylates a wide variety of signaling-proteins substrates,
162
N. Castagnino et al.
Fig. 5. Three different stochastic runs concerning the complex [(EGF : EGF R − P )2 :Grb2:SOS:KRAS-GTP]. Bottom right: moving average (over 1,000 samplings); smaller quadrant: same simulation using ODEs.
relevant for cell-cycle entry and for cell growth, including the transcription factors c-Myc, Elk1, and c-Fos [18]. As a consequence we have to consider the effects of the MEKPP inhibitor as more important of what it appears in our limited MIM. For instance, in the presence of a mutated KRAS the steady state plateau concentration of ERKPP (active form) is 105.22 nM; it decreases drastically to 1.51 nM in the presence of the MEKPP virtual inhibitor. Similarly, in the presence of a mutated BRAF the steady state plateau concentration of ERKPP (active form) is 111.79 nM; it decreases drastically to 1.58 nM in the presence of the MEKPP virtual inhibitor. In conclusion, a human supervised analysis indicates that the effects of a MEKPP inhibitor could be definitely more important than suggested from our score system applied to this limited MIM. Past efforts toward finding a selective KRAS inhibitor at the level of geranylgeranylation (isoprenylation) of KRAS, failed mostly because inhibition of this membrane anchorage at the level of geranyl-geranylation (isoprenylation) of KRAS is shared by too many other sub-membrane proteins. However, a specific KRAS inhibitor at the level of some protein-protein interaction could still be possible. It was interesting to notice in our simulation that such an inhibitor could play important roles. For instance, it could reinstate sensitivity to an overexpressed EGFR & mutant KRAS, through an association of EGFR inhibitor & KRAS inhibitor (Fig. 4C). EGFR is frequently over-expressed in colon cancers, but the tumor becomes resistant to EGFR inhibitor in the case of an additional KRAS mutation (approximately 30% of the cases) [3,8]. Looking at Fig. 4C, it is very interesting to consider that apparently the double alteration of EGFR & KRAS causes a deviation of the scores from normality not stronger than the EGFR alteration alone. However the double alteration is unable of going below the deviation from normality induced by KRAS using the EGFR inhibitor alone (a sort of resistance). The KRAS inhibitor alone is even less effective. On the
Simulations of the EGFR - KRAS - MAPK Signalling Network
163
contrary, an association of the two inhibitors brings us down to a deviation score of 15-23, very close to the 1-9 Ist -IIIrd quartile range of random concentrationrelated fluctuations. Interesting stochastic simulations have been produced in the Medicine field, for instance applied to ageing [15,16]. For the moment we have checked the behavior of our stochastic simulator mostly for comparisons (and elimination of hidden bugs) with the deterministic ODEs approach. However, the comparison of a stochastic and a non stochastic approach, shown in Fig. 5, reminds us that, in an individual cell, the number of molecules of important signaling complexes has to be treated with great caution using the approximation of infinite or very large numbers of molecules. It is probably not farfetched to think that there is an evolutionary pressure toward saving organic precursors and energy. Even mammalian cells could have evolved toward a minimum limit of number of molecules for each molecular species. This limit could mean that rare signaling complexes could have (at each instant) a probability of existence well below one (.0136 in our case). Probably a network of check and balances (feedbacks) has evolved, to keep acceptably contained fluctuations of cell behavior, while staying close to a minimum number of molecules (and a minimum cell size).
References 1. Aladjem, M.I., Pasa, S., Parodi, S., Weinstein, J.N., Pommier, Y., Kohn, K.W.: Molecular interaction maps a diagrammatic graphical language for bioregulatory networks. Sci. STKE 222, 8 (2004) 2. Batchelor, E., Loewer, A., Lahav, G.: The ups and downs of p53: understanding protein dynamics in single cells. Nat. Rev. Cancer 9(5), 371–377 (2009) 3. Benvenuti, S., Sartore-Bianchi, A., Di Nicolantonio, F., Zanon, C., Moroni, M., Veronese, S., Siena, S., Bardelli, A.: Oncogenic activation of the RAS/RAF signaling pathway impairs the response of metastatic colorectal cancers to anti-epidermal growth factor receptor antibody therapies. Cancer Res. 67(6), 2643–2648 (2007) 4. Birtwistle, M.R., Hatakeyama, M., Yumoto, N., Ogunnaike, B.A., Hoek, J.B., Kholodenko, B.N.: Ligand-dependent responses of the ErbB signaling network: experimental and modeling analyses. Mol. Syst. Biol. 3, 144 (2007) 5. Blinov, M.L., Faeder, J.R., Goldstein, B., Hlavacek, W.S.: A network model of early events in epidermal growth factor receptor signaling that accounts for combinatorial complexity. Biosystems 83(2-3), 136–151 (2006) 6. Brightman, F.A., Fell, D.A.: Differential feedback regulation of the MAPK cascade underlies the quantitative differences in EGF and NGF signalling in PC12 cells. FEBS Lett. 482(3), 169–174 (2000) 7. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81(25), 2340–2361 (1977), doi:10.1021/j100540a008 8. Khambata-Ford, S., Garrett, C.R., Meropol, N.J., Basik, M., Harbison, C.T., Wu, S., Wong, T.W., Huang, X., Takimoto, C.H., Godwin, A.K., Tan, B.R., Krishnamurthi, S.S., Burris 3rd, H.A., Poplin, E.A., Hidalgo, M., Baselga, J., Clark, E.A., Mauro, D.J.: Expression of epiregulin and amphiregulin and K-ras mutation status predict disease control in metastatic colorectal cancer patients treated with cetuximab. J. Clin. Oncol. 25(22), 3230 (2007)
164
N. Castagnino et al.
9. Kholodenko, B.N., Demin, O.V., Moehren, G., Hoek, J.B.: Quantification of short term signaling by the epidermal growth factor receptor. J. Biol. Chem. 274(42), 30169–30181 (1999) 10. Kohn, K.W.: Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol. Biol. Cell 10, 2703–2734 (1999) 11. Kohn, K.W., Aladjem, M.I., Weinstein, J.N., Pommier, Y.: Molecular interaction maps of bioregulatory networks: a general rubric for systems biology. Mol. Biol. Cell 17, 1–13 (2006) 12. Jorissen, R.N., Walker, F., Pouliot, N., Garrett, T.P., Ward, C.W., Burgess, A.W.: Epidermal growth factor receptor: mechanisms of activation and signaling. Exp. Cell Res. 284(1), 31–53 (2003) 13. Markevich, N.I., Moehren, G., Demin, O.V., Kiyatkin, A., Hoek, J.B., Kholodenko, B.N.: Signal processing at the Ras circuit: what shapes Ras activation patterns? Syst. Biol. (Stevenage) 1(1), 104–113 (2004) 14. Gibson, M.A., Bruck, J.: Efficient Exact Stochastic Simulation of Chemical Systems with Many Species and Many Channels. J. Phys. Chem. A 104(9), 1876–1889 (2000), doi:10.1021/jp993732q 15. Proctor, C.J., Soti, C., Boys, R.J., Gillespie, C.S., Shanley, D.P., Wilkinson, D.J., Kirkwood, T.B.L.: Modelling the actions of chaperones and their role in ageing. Mech. Ageing Dev. 126(1), 119–131 (2005) 16. Proctor, C.J., Tsirigotis, M., Gray, D.A.: An in silico model of the ubiquitinproteasome system that incorporates normal homeostasis and age-related decline. BMC Syst. Biol. 1, 17 (2007) 17. Schoeberl, B., Eichler-Jonsson, C., Gilles, E.D., Mller, G.: Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors. Nat. Biotechnol. 20(4), 370–375 (2002) 18. Shaw, R.J., Cantley, L.C.: Ras, PI(3)K and mTOR signaling controls tumor cell growth. Nature 441, 424–430 (2006) 19. Sigal, A., Milo, R., Cohen, A., Geva-Zatorsky, N., Klein, Y., Liron, Y., Rosenfeld, N., Danon, T., Perzov, N., Alon, U.: Variability and memory of protein levels in human cells. Nature 444(7119), 643–646 (2006) 20. Tyson, J.J., Novak, B., Odell, G.M., Chen, K., Thron, C.D.: Chemical kinetic theory: understanding cell-cycle regulation. Trends Biochem. Sci. 21, 89–96 (1996) 21. Wilkinson, D.J.: Stochastic modelling for quantitative description of heterogeneous biological systems. Nat. Rev. Genet. 10(2), 122–133 (2009) 22. Wolf, J., Dronov, S., Tobin, F., Goryanin, I.: The impact of the regulatory design on the response of epidermal growth factor receptor-mediated signal transduction towards oncogenic mutations. FEBS J. 274(21), 729–743 (2007)
Basics of Game Theory for Bioinformatics Fioravante Patrone DIPTEM, University of Genoa [email protected]
Abstract. In this “tutorial” it is offered a quick introduction to game theory and to some suggested readings on the subject. It is also considered a small set of game theoretical applications in the bioinformatics field.
1
Introduction
It could be considered strange the fact that game theory (we shall abbreviate it as GT) is used in the field of bioinformatics. After all, not only GT was created to model economic problems, but also its foundational assumptions are very close relatives of those made in neoclassical economics: basically, the assumption that “players” are “rational” and intelligent decision makers (usually assumed to be human beings). If we deal with viruses, or genes, it is not so clear whether these basic assumptions retain any meaning, and we could proceed further, wondering whether there are “rational decision makers” around. We shall see that there are some grounds for such an extension in the scope of game theory, but at the same time we acknowledge that there is another reason (almost opposite) to use game theory in the field of bioinformatics. This second reason can be found in the fact that game theory can be seen as “math + intended interpretation”. Of course, if we discard the “intended interpretation”, we are left only with mathematics: by its very nature, math is “context free”, so that we are authorized to use all of the mathematics that has been developed in and for game theory, having in mind whatever “intended meaning” we would like to focus on (a relevant example of this “de-contextualization” is offered in Moretti and Patrone (2008), about the so-called Shapley value). This switch in the interpretation of the mathematical tools needs sound justifications, if it has to be considered a serious scientific contribution, but this can be done, as it has been done also in other very different fields of mathematics. So, we shall try to emphasize a bit these two approaches to the subject of this tutorial. Since we do not assume any previous knowledge of game theory, we shall start with a very quick sketch of the basics of game theory (section 3), at least to set the possibility of using its language. Since game theory is a subject quite extended in width and depth, for the reader interested to go further (maybe considering to apply GT on its own) we cannot do anything better than providing suggestions for further readings. For this reason, in section 3 we shall provide a concise guide to the main relevant literature, especially to GT books. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 165–178, 2010. c Springer-Verlag Berlin Heidelberg 2010
166
F. Patrone
We shall then move to illustrate some of the GT applications in the field of bioinformatics. Due to our personal contributions to the field, we shall mainly stress the applications of a basic concept and tool for “cooperative games”: the socalled Shapley value. This will be the subject of the first subsection of section 4. The second subsection will be a very quick tour touching some of the diverse ways in which GT has been used in bioinformatics.
2
Game Theory: Its Basics
There is no doubt about the date of birth of GT: it is 1944, year in which Theory of games and economic behavior, by John von Neumann and Oskar Morgenstern, appeared. That book built the language of the discipline, its basic models (still used today, with the improvement given by Kuhn (1953) to the definition of games in extensive form), and strongly suggested that GT should be the adequate mathematical tool to model economic phenomena. Of course, there was some GT before 1944 (extremely important is the so-called “minimax theorem” of von Neumann, appeared in 1928), but that book set the stage on which many actors have been playing since then. From the “birth” of GT it has been accumulated a lot of models, results, applications, foundational deepening. After more than 60 years, GT is really a non disposable tool in many areas of economic theory (consumers’ behavior, theory of the firm, industrial organization, auctions, public goods, etc.), and has spread its scope to other social sciences (politics, sociology, law, anthropology, etc.), and much beyond. The core situation that GT models is a situation in which individuals “strategically interact”. The way in which interaction is meant is the following: there is a set of individuals, each of whom has to make a choice from a set of available actions; it will be the full set of choices made by each individual to determine the result (or outcome). An important point is that individuals generally have different preferences on the final outcomes. Moreover, interaction is said to be “strategic”: it is important that individuals are aware of this interactive situation, so that they are urged to analyze it, and for that it will be important to know the structure of the interactive situation, the information available, and also the amount of intelligence of the players. The form which is most used to model such a situation is the so-called “strategic form”. It is quite easy to describe: it is a tuple: G = (N, (Xi )i∈N , (fi )i∈N ), where: - N is the (finite) set of players - Xi is the set of actions (usually said “strategies”) available to player i ∈ N - fi : X → R, where X = i∈N Xi , are the “payoffs”. Many things should be said about the payoffs, but the main point is the following: given a so-called “strategy profile”, i.e. x ∈ X, it will determine an outcome h(x); fi (x) is the personal evaluation of the outcome by player i, measured on some scale. Using a terminology which is standard in neoclassical economics, one can
Basics of Game Theory for Bioinformatics
167
see fi as the composition of the function h with the “utility function” ui of player i (ui is defined on the set of outcomes). We introduce here a couple of very simple (but quite relevant) examples of games in strategic form. The first one is the so-called “prisoner’s dilemma”: N = {I, II}, XI = XII = {C, N C}, and fI , fII are described in table 1 (where, e.g., fI (C, N C) = 4, fII (C, N C) = 1. Table 1. Prisoner’s dilemma I\II
C
C (2, 2) N C (1, 4)
NC (4, 1) (3, 3)
The description of a game by means of a table like table 1 is quite common in the case in which there are two players, with finite strategy sets (the table is often referred as a “bimatrix”, since cells contain two numbers, the payoffs for each of the players). A second example is the “battle of the sexes”, which is given in table 2. Table 2. Battle of the sexes I\II
L
R
T B
(2, 1) (0, 0)
(0, 0) (1, 2)
It is important to stress the fact that the standard assumption behind a game modeled in strategic form is that each player chooses his strategy without knowing the choices of the other players (just think of a sealed bid auction). The relevant question is: what is a “solution” for a game in strategic form? Before answering this question, it is important to notice that game theory can be seen both from a prescriptive (or normative) and a descriptive (or positive) point of view, these “dual” points of view are often found in models for the social sciences. That is: a “solution” can be seen either as: - what players should do (prescriptive point of view) - what we expect that players (will) do (descriptive point of view) The basic idea of solution for strategic games is the Nash equilibrium, which can find justifications based both on normative and descriptive bases. The formal definition is as follows. Given a game G, a strategy profile x ¯ ∈ X is a Nash equilibrium if: - for every i ∈ N , fi (¯ x) ≥ fi (xi , x ¯−i ) for all xi ∈ Xi (here x−i denotes the set of the remaining “coordinates” of x, after having deleted the i − th). An essential condition behind the use of the Nash equilibrium as a solution for games in strategic form brings us to consider the so-called “institutional setting”
168
F. Patrone
under which the game is played: we refer here to the fact that players cannot sign binding agreements. This assumption is fundamental in considering the game as a non cooperative game: if players can sign binding agreements, we say that we have a cooperative game. Namely, if we assume that agreements are not binding, one can see the definition of Nash equilibrium as some kind of stability condition for the agreement x ¯: the condition that defines a Nash equilibrium amounts to say that no player has an incentive to unilaterally deviate from the agreement. What are the Nash equilibria for the two examples introduced before? For the prisoner’s dilemma, the unique Nash equilibrium is (C, C), which yields a result that is not efficient. This surprising fact (players are rational and intelligent!) has been the reason for a lot of interest about this game, which is by far the “most famous” one in GT. For the battle of the sexes there are two Nash equilibria: (T, L) and (B, R). Not only the battle of the sexes provides an example in which the Nash equilibrium is not unique, but it has the additional feature that players have opposite preferences on these two equilibria: player I prefers (T, L), while player II prefers (B, R). But the most interesting point is that the battle of the sexes shows quite clearly that a Nash equilibrium is a profile of strategies, and that it is (generally) nonsense to speak of equilibrium strategies for the players: which would be the equilibrium strategies for player I? It has to be stressed that the model of a game in strategic form can be used in principle to represent both cooperative and non-cooperative games, but de facto it is essentially used only for non-cooperative games. The same thing can be said of the second basic model used in GT: games in extensive form. We shall not discuss this model, which is useful to describe interactive situations in which the “timing” of the choices made by players is relevant (chess, poker, English auctions): we shall just mention the fact that it can be considered as a natural extension of a “decision tree”. It is also important the fact that a game in extensive form can be transformed in a canonical way into a game in strategic form: this fact (already emphasized in von Neumann (1928)) is relevant in extending the scope of games in strategic form, going beyond the immediate and trivial interpretation (players decide “contamporarily”). This is obtained by means of the idea of “strategy” as an action plan, allowing to use games in strategic form o describe also situations in which players are called to make choices that can be based on the observation of previous choices made by other players. So far, we have almost left out the cooperative games. Even if the strategic and extensive form models can be used, the most common model for cooperative games is given by the so-called games in “characteristic form” (also said “in coalitional form”). Such a model provides a description of the interactive situation which is much less detailed than those provided by the other two models. Apart from this, it is a model which is “outcome oriented”, since it displays the “best” results that can be achieved (possibly using binding agreements) by coalitions, i.e., group of players.
Basics of Game Theory for Bioinformatics
169
We give here the formal definition of a game in characteristic form. To be more precise, we shall confine ourselves to the (simpler) version that is identified as the “transferable utility” case. A game in characteristic form with transferable utility (quickly: a TU-game) is a couple (N, v), where: - N is a finite set (the players) - v : P(N ) → R, with v(∅) = 0 is the so-called “characteristic function”. Of course, the key ingredient is v, the characteristic function, which assigns to each subset of N (usually referred as a “coalition” of players), a real number which has the meaning of describing (in some common scale) the best result that can be achieved by them. As it can be clearly seen, there is no room for “strategies” in this model: the basic info is the (collective) outcome that can be achieved by each coalition. Due to the quite different approach w.r.t. the model used for non-cooperative games, it should not come as a surprise that the “solutions” available for TU-games obey different types of conditions, compared to the Nash equilibrium. Before going to describe the two main solution concepts for TU-games (the core and the Shapley value), it is worth emphasizing a relevant difference between cooperative and non-cooperative games for what concerns the approach to the idea of a solution for them. Games in strategic and extensive form, used in the non-cooperative setting, try to offer an adequate description of the possibilities of choice available to the players: this fact allows us to use essentially a basic principle (that of Nash equilibrium) as the ground for solution concepts1 . On the other hand, the astounding simplicity of a TU-game has a price to be paid: the huge amount of details which is missing in the model has to be recovered somehow when we pass to the “solution” issue. As we shall see, the two main solutions that we shall describe obey different lines of thought. The overall result is that we find significantly different groups of solution concepts: to give just an example, the “bargaining set”, the “kernel”, the “nucleolus” are three different solutions that all try to incorporate conditions2 different from those incorporated in the core or the Shapley value. Having said this, let us consider one of the two anticipated solution concepts for a TU-game: the core. In the reminder of this section we shall assume that the TU-games that we shall consider satisfy the superadditivity condition3 : 1
2
3
It must be noticed that there are other relevant solutions, apart the Nash equilibrium. To quote the most relevant, we can mention: subgame perfect equilibrium, perfect equilibrium, proper equilibrium, sequential equilibrium. These different solution concepts try to cope with deficiencies of the Nash equilibrium, but are no more that (relevant) variants of the basic idea behind the Nash equilibrium. The “bargaining set” tries to take care of the possibilities of “bargaining” among the players, through proposals for an allocation, and the objections and counter objections that can be made w.r.t. the proposed allocation. The “kernel” and the “nucleolus” were introduced having in mind approaches to find allocations belonging to the bargaining set (which is difficult to determine). It is not an unavoidable condition, but it allows for more “natural” interpretations of formal conditions that we shall use.
170
F. Patrone
- a TU-game is said to be superadditive if v(S ∪ T ) ≥ v(S) + v(T ) for any S, T ⊆ N , s.t. S ∩ T = ∅. The interpretation of this condition is quite natural: if two groups of players join together, they should be able to achieve at least the sum of what they are able to achieve acting separately. For a superadditive game, the best collective results can be achieved when all of the players act together: so, a critical point to identify a solution is to look at v(N ) and to reasonable methods to split it among the players. Some terminology is needed: -
x = (xi )i∈N ∈ RN is said to be an allocation an allocation x is said to be an imputation if: -x i ≥ v({i}) for each i ∈ N (individual rationality condition). - i∈N xi = v(N ) (efficiency condition).
It is clear that being an imputation is essentially a necessary condition for a reasonable way of dividing v(N ) among the players: if the individual rationality condition is violated for some i, this player will prefer to “play alone”, instead of joining the “grand coalition” N . The “core” of a TU-game can be seen as an almost straightforward extension of the individual rationality condition to some sort of “group rationality”. The definition is actually quite simple. An allocation x = (xi )i∈N ∈ RN belongs to the core of the game (N, v) if: - i∈S xi ≥ v(S) for all S ⊆ N - i∈N xi = v(N ). It could be surprising, maybe, but the core of a superadditive TU-game can be empty. A standard example is given by the so-called simple majority game (N = {1, 2, 3}): v(N ) = v({1, 2}) = v({1, 3}) = v({2, 3}) = 1, v({1}) = v({2}) = v({3}) = 0. If an allocation satisfies the conditions for being in the core, it must be: - x1 + x2 ≥ 1 - x1 + x3 ≥ 1 - x2 + x3 ≥ 1 From this we get that x1 + x2 + x3 ≥ 3/2, which contradicts the efficiency condition. Another interesting example is given by the so-called “glove game”. Members of N belong to two subsets that partition N , “L” and “R”, whose intended meaning is that each player is owner of exactly one “left” or “right” glove, respectively. Unpaired gloves are worthless, pairs of glove have value 1: this is enough to get a TU-game (clearly v(S) = min{S ∩ L, S ∩ R}).
Basics of Game Theory for Bioinformatics
171
If #R = #L +1, there is a unique allocation in the core: “left” owners get 1, “right” owners get 0. This is true for all values of #L, even if #L = 106 . If one can imagine that left owners have some kind of advantage, due to the relative scarcity of left gloves w.r.t. right gloves, this advantage intuitively seems to be not so strong (especially for “big” values of #R) to justify the fact that left owners would get all, leaving nothing to right gloves owners. Despite of these examples, the core is an important solution concept, and has been proved to be especially useful in economic applications, in particular for the class of the so-called market games, for which it can be proved (under reasonable assumptions) that it is always non empty. It is interesting to notice that, despite its simplicity, it was not proposed as a solution in the seminal book of von Neumann and Morgenstern. It appeared later, in 1953, proposed by Gillies in his PhD thesis (Gillies (1953)) that was devoted to the analysis of “stable sets”: the idea of a “stable set” was the proposal by von Neumann and Morgenstern as a solution for TU-games. This kind of solution, quite interesting from a theoretical point of view, proved to be unpractical, except for games with few players, offering moreover in some cases results that do not offer any kind of sensible interpretation. In the same year in which the idea of the “core” appeared, another, quite different, proposal was made by Shapley (1953). There are many reasons of interest for the solution proposed by Shapley (now named Shapley value), and we shall list here quickly the most prominent: - contrary to the core, which is a set of imputations, it is a single valued solution - it was conceived as a way to “predict” the gain that a player could expect (in average), playing a given game - it was introduced in an “axiomatic” way4 : that is, Shapley introduced a set of conditions that a solution should satisfy, and proved that there is a unique way to associate to any TU-game an allocation that satisfies these conditions. Let us see a set of axioms that determine the Shapley value (it is a set of conditions a bit different from that used by Shapley, but the differences are more formal than substantial and offer us a quicker way to approach it). Given a set of players, N , consider the set of all superadditive TU-games having N as player set: SG(N ). We want to find a map Φ : SG(N ) → RN that satisfies the following conditions: - EFF i∈N Φ((N, v))i = v(N ). - SYM Symmetric players get the same. Two players i and j are symmetric if (v(S) ∪ {i})\{j} = (v(S) ∪ {j})\{i} for all S ⊆ N . - NPP If i is a “null player”, i.e. v(S ∪ {i}) = v(S) for all S ⊆ N , then i gets zero. - ADD Φ(v + w) = Φ(v) + Φ(w) for all v, w ∈ SG(N ). The game v + w is defined “pointwise” as follows: (v + w)(S) = v(S) + w(S) for any coalition S. 4
The axiomatic approach had been applied in a very successful way by Arrow (1951) and Nash (1950) few years before.
172
F. Patrone
The interesting result is that there is a unique Φ satisfying these four conditions, and this map is called “Shapley value”. It can be shown that the Shapley value is some appropriate average of the marginal contributions of the players (“marginal contribution” of player i to a coalition S is v(S ∪ {i}) − v(S)), from where one can get the following formula that expresses the Shapley value: Φi (v) =
1 n!
σ
mσi (v).
Here σ is a permutation of N ; j is the unique element in N s.t. i = σ(j), and mσi (v) = v({σ(1), σ(2), . . . , σ(j)}) − v({σ(1), σ(2), . . . , σ(j − 1)}). As could be expected, in the simple majority game the Shapley value assigns to each player 1/3 (this is immediately seen using SYM and EFF). For the glove game, if #L = 106 , we have that “L” players get approximately 0.500443 and “R” players 0.499557. A much more “reasonable” result than those offered by the core. It is interesting to notice that significantly different axiomatic characterizations are available: Young (1985), Hart and Mas-Colell (1987). Another aspect worth being mentioned is that the axiomatic characterization given by Shapley not necessarily “works” if we restrict the attention to subsets of SG(N ). A particularly relevant example is provided by the class of simple games: a TU-game is simple if v assumes just the values 0 or 1, and v(N ) = 1. It is obvious that summing two simple games we do not get a simple game (at least, v(N ) would be equal to 2), so the ADD condition becomes completely irrelevant, and the three remaining axioms are not enough for a complete characterization of the Shapley value for this class of games. Dubey (1975) has shown that it is enough to substitute ADD with TRANSF: - TRANSF For any two simple games v and w, Φ(v∨w)+Φ(v∧w) = Φ(v)+Φ(w). A similar problem is found for the class of microarray games, which we shall extensively discuss in section 4.
3
Suggested Readings
As said in the introduction, the room allowed for introducing basic facts about GT is obviously too small to consider even quite relevant questions: to quote just one, a due discussion about the meaning of the payoffs and of the values of the characteristic form. So, we shall provide a relatively small set of bibliographical items that could be useful for readers willing to go further these small notes. First of all, books that are good as a “preliminary step”, almost at a divulgative level: Davis (1970) and Straffin (1996). A more complete introduction to GT is offered by the books of Osborne (2003), Binmore (1992), Dutta (1999), Luce and Raiffa (1957). At a more advanced level, I would suggest three books that are essentially texts for graduate students: Owen (1968), a very good source for cooperative games; Myerson (1991), whose highlights are (in my humble opinion) the chapters on utility theory, on communication and on cooperation under uncertainty;
Basics of Game Theory for Bioinformatics
173
Osborne and Rubinstein (1994), that I would recommend for repeated games and implementation theory. A text that is worth to consider as a general reference source is the Handbook of Game Theory, in three volumes, edited by Aumann and Hart (1992). It is worth mentioning, having in mind the bioinformatics reader, a recent volume entitled “Algorithmic Game Theory”, edited by Nisam et al., (2007). Interesting sources to discuss foundational and interpretational issues are the already quoted book by Luce and Raiffa, Kreps (1990) and Aumann (1987). From the point of view of web-based sources, it is quite rich the web page maintained by Shor: http://www.gametheory.net. A lot of material (most of which in Italian, but it can be found also a video course on GT, available for download and in English) can be found also in my web page: http://dri.diptem.unige.it. To conclude, I mention a video lecture, by Tommi Jaakkola, MIT: “Game theoretic models in molecular biology”, http://videolectures.net/pmsb06_jaakkola_gtmmb/.
4
Applications of Game Theory
We shall stress in this section applications that use the Shapley value for cooperative games offering, for other applications, just a quick description and references. 4.1
Using the Shapley Value
A first application that we shall briefly describe is Kaufman et al., (2004). Focus of this contribution is the RAD6 DNA repair pathway for S. Cerevisiae (yeast). In such a pathway are involved 12 genes, and the question is to measure the relevance of each gene for this pathway. The laboratory technique used is somehow classical, i.e. the knock-out of genes. Critical factor is that groups of genes are knocked-out: in such a way, it is identified a “coalition” S, composed of the remaining genes. The value v(S) assigned to S is the survival rate to UV radiation exposure. In such a way one gets a TU-game (N, v). There is a non-trivial problem, however. Even if the knock-out technique is classical, one has to consider that the number of subsets of N = {1, 2, . . . , 12} is 212 = 4096, so that it is not sensible to conduct all of the experiments needed to get “the whole game” (N, v). This is a quite common issue in “practical” applications of GT, and has been somehow analyzed in Moretti and Patrone (2004), in which has been taken into account the fact that getting the data for knowing (N, v) is a costly operation so that one has to balance costs with the main aim of the application (in that paper, the focus was on the fairness of the imputation of costs). Getting back to the contribution of Kaufman et al. (2004), they propose the Shapley value to evaluate the relevance of the different genes in the RAD6 pathway. Of course, having not all of the data for (N, v), they need some approximation/estimation techniques, for which we refer the reader to the paper.
174
F. Patrone
We shall switch to another application of GT, to which we shall devote some more space to it, since it presents at least a couple of non-trivial questions related with the application of the GT tools. In Moretti et al. (2007) it has been introduced a subclass of TU-games, named “microarray games”, aiming at answering the following research question: find a relevance index for genes (using the data obtained by a microarray gene experiment), taking into account the way in which groups of genes “move together” when switching from healthy to ill condition. We shall focus here on microarray data from mRNA expression. Having selected a set of genes to analyze, each sample from an experiment will provide an array of data (expression levels for the selected genes), so that the outcome of an experiment will be a matrix collecting the data from all samples. The starting point for applying GT is to obtain a TU-game (a microarray game) from such a matrix. This is achieved in two steps: the first one is a discretization procedure, which converts the expression matrix into a boolean matrix. This is done fixing thresholds5 that distinguish, e.g., “normally expressed” genes (to which it will be attributed the value 0 in the boolean matrix) and “abnormally expressed” genes (value 1). Given a boolean matrix and a set (“coalition”) of genes S, v(S) is defined in the following way: it is the number of samples for which S contains all of the genes to which is assigned value 1, divided by the total number of samples. Having the microarray game (N, v), one can apply all of the mathematical tools developed for TU-games, and refer to any kind of solution, Shapley value included. Of course, it remains the problem of justifying the use of a particular “solution”, given the context. The path followed in the paper is somehow reversed, w.r.t. what has been just noticed. The idea followed is to propose a set of conditions that a “relevance index” should satisfy and then see what one gets. A critical role is reserved to partnerships of genes. Partnerships in the context of TU-games were introduced by Kalai and Samet (1988) to analyze weighted Shapley value. The definition is the following: - given (N, v), S ∈ P(N ) \ {∅} is a partnership if: for each T S and each R ⊆ N \ S, v(R ∪ T ) = v(R). So, a partnership, in the setting of microarray games, represents a set of genes that are relevant when they “move together”, and so it is natural to attribute an adequate “weight” to them. The assumptions made on the “relevance index” (a map F defined on the set of microarray games with gene set N , M(N ), to RN ) that is sought for, are given by the following three conditions (“axioms”) which involve partnerships: - PR The solution F has the Partnership Rationality property, if i∈S Fi (N, v) ≥ v(S) for each S ∈ P(N ) \ {∅} such that S is a partnership of genes in the game v. 5
The choice of the threshold is clearly delicate: we refer to the original paper of Moretti et al. (2007) for some discussion about it.
Basics of Game Theory for Bioinformatics
175
The PR property determines a lower bound of the power of a partnership, i.e., the total relevance of a partnership of genes in determining the onset of the tumor in the individuals should not be lower than the average number of cases of tumor enforced by the partnership itself. - PF The solution F has the Partnership Feasibility property, if i∈S Fi (N, v) ≤ v(N ) for each S ∈ P(N )\{∅} such that S is a partnership of genes in the game v. Contrary to PR, the PF properties fixes a natural upper bound of the power of a partnership: the total relevance of a partnership of genes in determining the tumor onset in the individuals should not be greater than the average number of cases of tumor enforced by the grand coalition. - PM The solution F has the Partnership Monotonicity property, if Fi (N, v) ≥ Fj (N, v) for each i ∈ S and each j ∈ T , where S, T ∈ P(N )\{∅} are partnerships of genes in v such that S ∩ T = ∅, v(S) = v(T ), v(S ∪ T ) = v(N ), |S| ≤ |T |. The PM property is very intuitive: genes in the smaller partnership should receive a higher relevance index than genes in the bigger one, where the likelihood that some genes are redundant is higher. To these crucial axioms it has been added a couple of “almost obvious” conditions: - NG For a null gene (i.e., null “player”) i ∈ N , Fi (N, v) = 0 - ES We refer to Moretti et al. (2007) for the somehow technical definition. The intended meaning is that all of the samples deserve the same importance. It turns out that there is a unique “relevance index” that satisfies these five conditions, and that it coincides with the Shapley value. This theoretical result poses a “practical” computational question: is it possible to compute the Shapley value for a given microarray game, taking into account that the “players” are of the order of thousands, or tenths of thousands? The answer, luckily, is positive. Not only it is possible, but the computation is also very easy and very quick: the reason for that has to be found in the fact that the Shapley value can be computed directly from the discretized (boolean) matrix, without having to consider the microarray game. This is a fact not new in GT: for example, in the field of the so-called “Operation Research Games” it often happens that a GT solution can be computed directly form the data of the original OR problem, without the need of knowing the game that has been built on it, the “Bird rule” being perhaps the most well know example of this kind6 . Going back to “microarray games” and Shapley value, the meaning of what we have noticed is that the role of GT is mainly connected with the justification of the computations that will rank genes in order of relevance. 6
It provides an element of the core of a “minimum cost spanning tree” game, just looking directly at the information on the connection costs that are associated to the connection graph, see Bird (1976).
176
F. Patrone
Applications of this approach have been made in Moretti et al. (2007) to colon cancer data, provided in Alon et al. (1999); to data about neuroblastoma in Albino et al. (2008); to the influence of pollution, in Moretti et al. (2008), where the GT analysis is combined with classical statistical tools; to autism, in Esteban and Wall (2009). It has also been made a comparison between the use of the Shapley value and of the Banzhaf value7 for microarray games, in Lucchetti et al. (2009). 4.2
A Quick Look at Other GT Applications
We shall give here a very concise description of few GT applications that offer an idea of the different GT tools and different applied fields, falling under the heading of “Game theoretic applications in bioinformatics”. In Kovács et al. (2005), in the context of energy and topological networks of proteins, it is discussed the possibility of introducing protein games to reduce the number of energy landscapes occurring when a protein binds to another macromolecule. Bellomo and Delitala (2008) analyze the role of mathematical kinetic theory of active particles to model the early stages of cancer phenomena, including in the available tools also stochastic games. Wolf and Arkin (2003) stress that regulatory network dynamics in bacteria can be analyzed using motifs, modules and games. In particular, it is suggested that the dynamics of modules determines the strategies used by the organism to survive in the game in which cells and the environment are involved. Frick and Schuster (2003) suggest that, for some range of the relevant parameters, microorganisms that can choose between two different pathways for ATP production can be engaged in a prisoner’s dilemma game. The effect is that of an “inefficient” outcome, meaning in this setting a waste of resources. In Nowik (2009), it is described the competition that takes place among motoneurons that innervate a muscle. The problem is modeled as a game, where the “strategies” for the motoneurons are their “activity level”. This model is able to justify the emergence of the so-called “size principle”, and to provide predictions which can be subjected to experimental testing. The last paper that we mention is Schuster et al. (2008), which is a survey on the use of evolutionary game theory to analyze biochemical and biophysical systems.
References Albino, D., Scaruffi, P., Moretti, S., Coco, S., Truini, M., Di Cristofano, C., Cavazzana, A., Stigliani, S., Bonassi, S., Tonini, G.P.: Identification of low intratumoral gene expression heterogeneity in Neuroblastic Tumors by wide-genome expression analysis and game theory. Cancer 113, 1412–1422 (2008) 7
The Banzhaf value was originally introduced as a “power index” for simple games by Banzhaf (1965), and then it has been extended to a value defined on al of SG(N ).
Basics of Game Theory for Bioinformatics
177
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissue probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 96, 6745–6750 (1999) Arrow, K.J.: Social Choice and Individual Values. Wiley, New York (1951); 2nd edn. (1963) Aumann, R.J.: Game Theory. In: Eatwell, J., Milgate, M., Newman, e P. (eds.) The new Palgrave Dictionary of Economics, pp. 460–482. Macmillan, London (1987) Aumann, R.J., Hart, S. (eds.): Handbook of Game Theory, vol. 3 vols. North-Holland, Amsterdam (1992) Banzhaf, J.F.: III: Weighted voting doesn’t work: A game theoretic approach. Rutgers Law Review 19, 317–343 (1965) Bellomo, N., Delitala, M.: From the mathematical kinetic, and stochastic game theory to modelling mutations, onset, progression and immune competition of cancer cells. Physics of Life Reviews 5, 183–206 (2008) Binmore, K.: Fun and games. Heath and Company, Lexington, MA (1992) Bird, C.G.: On cost allocation for a spanning tree: A game theoretic approach. Networks 6, 335–350 (1976) Davis, M.D.: Game Theory: A Nontechnical Introduction. Basic Books, New York (1970); 2nd edn. (1983) (reprinted 1997 by Dover, Mineola) Dubey, P.: On the Uniqueness of the Shapley Value. International Journal of Game Theory 4, 131–139 (1975) Dutta, P.K.: Strategies and Games: Theory and Practice. MIT Press, Cambridge (1999) Esteban, F.J., Wall, D.P.: Using game theory to detect genes involved in Autism Spectrum Disorder. Top (2009) (in press) doi: 10.1007/s11750-009-0111-6 Frick, T., Schuster, S.: An example of the prisoner’s dilemma in biochemistry. Journal Naturwissenschaften 90, 327–331 (2003) Gillies, D.B.: Some Theorems on n-Person Games. PhD thesis, Department of Mathematics, Princeton University, Princeton (1953) Hart, S., Mas-Colell, A.: Potential, value and consistency. Econometrica 57, 589–614 (1987) Kalai, E., Samet, D.: Weighted Shapley Values. In: Roth, A. (ed.) The Shapley Value, Essays in Honor of Lloyd S. Shapley, pp. 83–100. Cambridge University Press, Cambridge (1988) Kaufman, A., Kupiec, M., Ruppin, E.: Multi-knockout genetic network analysis: the Rad6 example. In: Proceedings of the 2004 IEEE computational systems bioinformatics conference, CSB 2004 (2004) Kovács, I., Szalay, M., Csermely, P.: Water and molecular chaperones act as weak links of protein folding networks: Energy landscape and punctuated equilibrium changes point towards a game theory of proteins. FEBS Letters 579(11), 2254–2260 (2005) Kreps, D.M.: Game Theory and Economic Modeling. Oxford University Press, Oxford (1990) Kuhn, H.W.: Extensive Games and Problems of Information. In: Kuhn, H.W., Tucker, e A.W. (eds.) Contributions to the Theory of Games. Annals of Math. Studies, vol. 28 (1953) Lucchetti, R., Moretti, S., Patrone, F., Radrizzani, P.: The Shapley and Banzhaf values in microarray games. Computers & Operations Research (2009) (in press), doi:10.1016/j.cor.2009.02.020 Luce, R.D., Raiffa, H.: Games and Decisions. Wiley, New York (1957)
178
F. Patrone
Moretti, S., van Leeuwen, D., Gmuender, H., Bonassi, S., van Delft, J., Kleinjans, J., Patrone, F., Merlo, F.: Combining Shapley value and statistics to the analysis of gene expression data in children exposed to air pollution. BMC Bioinformatics 9, 361 (2008) Moretti, S., Patrone, F.: Cost Allocation Games with Information Costs. Mathematical Methods of Operations Research 59, 419–434 (2004) Moretti, S., Patrone, F.: Transversality of the Shapley value. Top 16, 1–41 (2008); Invited paper: from pages 42 to 59, comments by Vito Fragnelli, Michel Grabisch, Claus-Jochen Haake, Ignacio García-Jurado, Joaquín Sánchez-Soriano, Stef Tijs; rejoinder at pages 60–61 Moretti, S., Patrone, F., Bonassi, S.: The class of Microarray games and the relevance index for genes. Top 15, 256–280 (2007) Myerson, R.B.: Game Theory: Analysis of Conflict. Harvard University Press, Cambridge (1991) Nash Jr., J.F.: The Bargaining Problem. Econometrica 18, 155–162 (1950) von Neumann, J.: Zur Theorie der Gesellschaftsspiele. Matematische Annalen 100, 295–320 (1928) von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior. Princeton University Press, Princeton (1944), 2nd edn. (1947); 3rd edn. (1953) Nisam, N., Roughgarden, T., Tardos, É., Vazirani, V.V. (eds.): Algorithmic Game Theory. Cambridge University Press, Cambridge (2007), http://www.cambridge.org/journals/nisan/downloads/ nisan_non-printable.pdf Nowik, I.: The game motoneurons play. Games and Economic Behavior 66, 426–461 (2009) Osborne, M.J.: An introduction to game theory. Oxford University Press, Oxford (2003) Osborne, M., Rubinstein, A.: A course in Game Theory. MIT Press, Cambridge (1994), http://theory.economics.utoronto.ca/books/ Owen, G.: Game Theory. Academic Press, New York (1968), 2nd edn. (1982); 3rd edn. (1995) Schuster, S., Kreft, J.-U., Schroeter, A., Pfeiffer, T.: Use of Game-Theoretical Methods in Biochemistry and Biophysics. Journal of Biological Physics 34, 1–17 (2008) Shapley, L.S.: A Value for n-Person Games. In: Kuhn, H.W., Tucker, e.A.W. (eds.) Contributions to the Theory of Games, II. Annals of Math. Studies, vol. 28, pp. 307–317. Princeton University Press, Princeton (1953) Straffin, P.: Game Theory and Strategy. The Mathematical Association of America, Washington DC (1995) Wolf, M., Arkin, D.M., Motifs, A.P.: modules and games in bacteria. Current Opinion in Microbiology 6, 125–134 (2003) Young, H.P.: Monotonic Solutions of Cooperative Games. International Journal of Game Theory 14, 65–72 (1985)
Microarray Data Analysis via Weighted Indices and Weighted Majority Games Roberto Lucchetti and Paola Radrizzani Dipartimento di Matematica, Politecnico di Milano, Italy [email protected], [email protected]
Abstract. A recent paper ([12]) introduces the notion of microarray game, a new class of cooperative games, with the aim of ranking genes potentially responsible of genetic diseases (especially tumors). A microarray game is for short an average of unanimity games, where each game corresponds to a patient: the underlying assumption is that the set of the abnormally expressed genes is, as a whole, responsible for the disease in each patient, and thus is globally the (minimal) “winning” coalition of the corresponding game. The Shapley index is then (axiomatized on the class of the microarray games and) used to rank the genes. Subsequent papers ([8], [9]) deal with the same issue, by using different indices. Here we propose a definition of extended microarray game, which allows using weighted power indices; this is very useful to better rank the genes: by using only unanimity games, the enormous amount of players in each game results in considering too many of them as symmetric players, and this does not allow to significantly differentiate them. Moreover, the extended game allows us also playing a (average of) weighted majority game(s) played by the genes in the first positions according to the ranking obtained by the previous method, since it provides a natural way to attach weights to the players. We then apply the machinery to some real data concerning tumoral diseases, and we observe that our ranking highlights the role of some genes already considered in the literature as important in the onset of the disease. Keywords: Extended microarray game, weighted indices, weighted majority game, gene expression analysis.
1
Introduction
The theory of games deals with the issue of describing (and/or predicting) the rational behavior of agents interacting each other. It was initially applied to human beings, and successively to animals. Recently, different applications were provided, with the underlying idea that some behavior can be considered rational, even if the involved agents cannot at all considered to be intelligent. This
The research of the authors was partially supported by Ministero dell’Istruzione, dell’Universit` a e della Ricerca (COFIN 2007).
F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 179–190, 2010. c Springer-Verlag Berlin Heidelberg 2010
180
R. Lucchetti and P. Radrizzani
applies to computers, and to microorganisms, like genes for instance. In a sense, groups of genes can be considered as to interact each other with the result of provoking the onset of a specific disease of genetic nature, like most of the tumors. Thus, tools of game theory can be in principle used to detect the effects of the genes. The issue we want to consider here is the attempt of ranking genes of people affected by a specific genetic disease, in order to single out a group of them potentially important in the process of the disease1 . In order to do this, we need to explain in few words what the microarray technique is, and in which way we may apply (cooperative) game theory to this setting . The recent availability of the genome sequence information has promoted the development of a number of new technologies, including microarrays. The microarrays have played an important role in changing investigation from focussing on single genes to an omics approach ([5], [7]). Omic studies are characterized by the use of high-throughput methods that produce large quantities of data. Microarrays can measure RNA, DNA, or protein levels from cells or tissues on a genome-wide scale. For example, DNA and RNA level alterations measured from the same sample provide information about genes in which expression is corrupted due to increased or decreased copy number. Copy number alterations represent an important mechanism for cancer cells to promote or suppress the expression of genes involved in cancer progression. Furthermore, genes deregulated in association with high level amplifications have been linked to poor outcome of cancer, representing potential drug targets ([3]). Thus the integrated array data can identify therapeutic targets which might then provide alternative options to surgery and radiation therapy cancer. Innovative approaches have been developed to exploit DNA sequence data and yield information about gene expression for entire genomes. DNA microarray is based upon the mutual and specific affinity of complementary strands of DNA. This approach provides a quantitative measurement of the gene expression (the amount of mRNA in a cell sample) for thousands of genes in the same experiment. Array size can range from a small subset of 500 genes to a large pool of 30, 000 genes. Microarrays are used to probe differences in gene expression. In order to highlight these differences, the use of proper controls is vital. mRNA must be extracted from a normal control as well as the experimental samples and purified for use in the array experiment. This RNA can be obtained from a variety of sources including cell culture, tissue samples from animal models or clinical patients, and histologically-archived samples. Under different biological conditions, individual genes may be up-regulated or down-regulated, and the fluorescent signal of the marker dyes reflects these changes. Usually, at the end of the process, the gene expression of a single individual is expressed by a very large vector of numbers, providing his gene expression.
1
Actually this is the first setting of application for microarray games, but not the only one. Later we mention another possible application and further ones are under investigation.
Microarray Data Analysis via Weighted Indices
181
At this point, it comes the task of interpreting all these data. Here is the point where the theory of games can help. Let us briefly explain how. Let N be a (finite) set, and denote by 2N the set of the subsets of N . A coalitional game or characteristic-form game is a pair (N, v), where N denotes the finite set of players and v : 2N → R is its characteristic function, with v(∅) = 0. We shall implicitly assume from now on that N = {1, . . . , n}. A group of players T ⊆ N is called a coalition and v(T ) is called the value of this coalition. A coalitional game (N, w) such that w : 2N → {0, 1} is called a {0, 1}-game, or also a simple game. In these games, a coalition A such that v(A) = 1 is called a winning coalition, otherwise it is a loosing coalition. Among the simple games, important are the so called weighted majority games: suppose there are n players and that n + 1 positive integers q, w1 , . . . , wn are given. The associated weighted majority game v is defined as: 1 if w(S) ≥ q v(S) = , 0 if w(S) < q where w(S) = i∈S wi . Such a game will be denoted by [q; w1 , . . . wn ]. The meaning is clear: the player i has weight wi , the majority quota q is fixed, and a coalition is winning if the sum of the weights of its members joins the quota. An important family of games2 is the following: for each R ⊆ N , let the unanimity game (N, uR ) be defined as 1 if R ⊆ T uR (T ) = . 0 otherwise A solution vector for a game v is a vector (x1 , . . . , xn ), where xi represents the amount assigned to the player i. A solution instead is a function defined on the set of the games v and assigning a solution vector to each v. Among them, and for the simple games, are relevant the so called power indices, which want to highlight the power of each player in the game. A power index φ usually has the form: pi (S)mi (v, S), (1) φi (v) = S∈2N \{i}
where mi (v, S) = v(S ∪ {i}) − v(S) and pi a probability measure on the set 2N \{i} . The meaning of the formula is clear: the term mi (v, S) is the marginal value of player i to the coalition S; thus the index assigns to each player a weighted average of its the marginal values. And now the two main indices we use here: the Shapley index is defined as σi (v) =
S∈2N \{i}
2
s!(n − s − 1)! [v(S ∪ {i}) − v(S)] ; n!
(2)
Actually a base for the 2n − 1 dimensional vector space of the characteristic form games with n players.
182
R. Lucchetti and P. Radrizzani
the Banzhaf index β, defined as
βi (v) =
S∈2N \{i}
1 [v(S ∪ {i}) − v(S)] . 2n−1
(3)
In the case of simple games, let W denote the set of the coalitions S such that – i∈ / S; – v(S) = 0; – v(S ∪ {i}) = 1. Coalitions in W are those coalitions for which the join of i is crucial to make it becoming winning: sometimes one says that i is a swing for the coalition. In this case, the formulas simplfy: σi (v) =
s!(n − s − 1)! n!
βi (v) =
S∈W
S∈W
1 . 2n−1
(4)
Observe that in any case on one hand the practical calculation of an index is trivial for games with few players, but on the other hand the problem rapidly becomes untractable, due to the fact that the number of coalitions is exponential with respect to the number of players. For general games, it becomes difficult to deal with more than twenty players. However, for particular classes of games, the calculation becomes easier, due to the possibility of using fast ad hoc algorithms. One class of these games is that one of the weighted majority games, where the tool of the formal series allows easily calculating the two indices even for games with one hundred players (see f.i. [2]). And now we define the microarray games, in the spirit of [12]. Consider an (n × m) matrix M = (mij ), such that mij is either zero or one, and such that for every j there is i with mij = 0. Given the column m·j , j = 1, . . . , m, define its support as the set supp m·j = {i : mij = 1}, and define the associated unanimity game v j generated by supp m·j , i.e. 1 if T ⊃ supp m·j v j (T ) = usupp m·j (T ) = . (5) 0 otherwise Then the microarray game associated to M = (mij ) is defined as m
v=
1 j v . m j=1
(6)
Let us briefly explain the idea underlying this definition. First of all, two words on how the matrix M is built up. Usually, the microarray data of a single experiment are divided in two (very big) matrices, say A, a n × l matrix, and B, a n × m matrix, with real entries. Each row corresponds to a gene, each column to an individual, and the individuals are partitioned in two sets: call the reference set
Microarray Data Analysis via Weighted Indices
183
the group of individual whose data are collected in the matrix A. The entries aij , bij represent the expression of gene i in the individual j. From the data of the row ai· we construct an interval, say [mi , Mi ], called the normality interval for the gene i. A possible choice could be taking mi as the smallest entry of the row, as Mi as the greatest entry of the row, but other choices are possible. Finally, the matrix A is built up in the following way: mij = 1 if and only if mij ∈ / [mi , Mi ]. Usually, the data from the matrix A are relative to sane individuals, while those from the matrix B are taken from people affected by a specific disease. But this is not the only possible application: for instance we could take two groups of people affected by a similar but not identical form of tumor, and the mutual comparison of the data (taking one time one group as a reference group, and the other one under experiment and repeating the experiment the other way around) can provide another example of microarray game. This actually was done in [1], and it is repeated here. Or also, we could test the effects of providing a protein to a group of patients; this will be the subject of a future paper. Once the microarray game is defined, it can be of course treated as any cooperative game. In particular, power indices of the games can be calculated, with the aim of providing a ranking of the genes, with the idea that genes with higher power can be suspected as being responsible of the onset of the disease under investigation. But looking at the data processed in the papers using these techniques ([8] and [9]), we can notice that the Shapley and Banzhaf indices (especially the second one) can have difficulties in distinguishing the genes. This can be structural, in that few samples versus so many genes can cause the fact that several of them could be grouped in families of symmetric players. On the other hand, it is quite possible that round off errors do not allow evaluating very small differences. By elaborating some real data it turned out that some patients presented around 200 genes abnormally expressed. In such a case, the Banzhaf assign zero power to all players3 , thus not distinguishing the abnormally expressed genes from the other ones. This means that actually the patient does not provide useful data, since it considers all genes as null genes4 . This in principle cannot be considered totally useless: in some sense, it indicates that the patient could be considered not meaningful for the analysis, since its abnormally expressed genes are too many. But on the other hand, especially when treating data with few patients, it is of interest to avoid the risk of having a partition of the set of genes made by few elements (i.e. few subsets with a large number of genes). For this reason, it is interesting to try to better differentiate the contribution that each gene could give to the disease. Thus, it seems to be a promising idea to try to differentiate the genes, by considering an extended idea of microarray game. The purpose of this note is twofold: at first we introduce a variant of the microarray game, called generalized microarray game, that allows us using the 3 4
We are talking about calculations made by computers, where round off errors are present. A player is called null in the game v if v(S ∪ {i}) = v(S) for all S. Null players have zero power.
184
R. Lucchetti and P. Radrizzani
so called weighted indices (see [6]). This is done by attaching a weight to any gene for each patient, according to how the gene is far from the normality interval, as explained below. This, as it is easily seen, actually produces less ties in the ranking of the genes. Then, we consider a new model of game, derived from the results of the (modified) microarray game. In short, to each patient we attach a weighted majority game, by considering a restricted set of genes (usually 50): those better ranked by means of the above procedure. Finally, we rank the genes by making an average of their ranking over all patients. Since at the very beginning the players are several thousands, it is clear that it is impossible to evaluate any index, unless we use unanimity games5 . This is a non negligible argument in favor of defining the microarray game as average of unanimity games. However it is conceivable that we could arrange a game among a much smaller set of players, after a preliminary selection made by the indices. Again, if we do not want to restrict too much our choice, we need to appeal to some class of games for which the calculation of the various indices is simpler than in standard games. Thus we shall attach to the generalized microarray game a weighted majority game, to provide further standings for a restricted set of genes. As already mentioned, by exploiting the tool of formal series, algorithms were developed which easily compute the indices for games with double digit of players (see f.i. [2]). We shall do it here with four sets of data taken by real microarray data, and in order to perform our calculations, rather using applets available in the net, we developed a program in MATLAB, since we needed to make a (minor) change on the known algorithms in order to take into account that in our context a player could have zero weight within the data of a single patient, and also in order to evaluate the index of the extended microarray game by averaging on the players. Then, we apply all the machinery to four sets of data, taken from the literature, and we make some comparison of our results with others. It is clear that our model contains a number of assumptions which do not have, at the present, strong theoretical motivations. For instance, it is not clear which index to privilege in order to select a group of genes to analyze further with the weighted majority game6 . Then, how many genes should be used in the subsequent game. Also, the weights assigned to the various players and the majority quotas could be source of discussion. But even if all we did must be considered, at the current state of the art, mainly experimental, we believe that this model is interesting, at least for two good reasons: first of all, we noticed some form of stability on the experimental results. Looking at the genes in the first 100 positions according to various indices, we find that a great percentage of them is present in all rankings made the different indices (with the exception of Banzhaf’s, for the reason explained above that it does not significantly differentiate the genes). Thus 5
6
1 In the unanimity game uR the Shapley value assigns the power |R| to the players in 1 R, while Banzhaf assigns 2|R|−1 . Of course, the choice of the index need not to be limited to the Shapley and Banzhaf indices. Other indices, for instance, actually inspired by the microarray games, but whose use is broader, were defined in [9].
Microarray Data Analysis via Weighted Indices
185
the use of the weighted indices introduced here is justified by the fact that in some cases they allow to better differentiate the genes, and in general the results found show similarities with the Shapley’s index, in the cases when this last one is able to differentiate the genes7 . The second important reason why our data seem to be promising is that a check made in the medical literature showed that some of the genes selected by means of our methods in particular experiments are considered to be of great importance from the medical point of view, in the onset of the disease. It is well known that interpreting all data collected by these new techniques based on DNA analysis is quite a difficult task, and that researchers in molecular biology are seeking for new methods to detect genes responsible of a disease. The game theoretic approach, even if a very recent and not yet a well developed idea, can maybe be useful in signalling genes deserving deeper investigation. Thus as a conclusion we believe that further interactions with researchers in medical groups should be enhanced in order to suggest new development of this approach.
2
Generalized Microarray Games
The idea underlying the new version of the microarray game is to allow the matrix at the core of the game to contain not only zeroes and ones. In other words, we do not classify the genes only in two categories, the normally expressed and the abnormally expressed, but we take also into account “how much” the genes are abnormally expressed, by giving them a weight gradually increasing depending on how much the gene is far from the reference interval. Of course, this can be done in several ways. A natural one is to consider, for each gene i, the reference interval, let us call it Ni0 = [mi , Mi ], to evaluate the standard deviation si relative to the data of the gene, to set Nik = [mi − ksi , Mi + ksi ], k = 1, . . . , j, and to assign the value k to the gene falling in the set Nik \ Nik−1 (j if it falls outside all these sets). In this way, we can rank the genes according to another type of index, called weighted index. Thus, suppose we are given a n × m matrix M such that mij ≥ 0 for all i, j. Observe that when M represents a classical microarray game, i.e. mij ∈ {0, 1}, the Shapley index of the player i fulfills the formula m
σi (v) =
1 m ij . m j=1 nl=1 mlj
(7)
It is then natural to use exactly the same formula also when mij is not only valued in {0, 1}. On the other hand, for every fixed j, this is exactly the expression of the so called weighted (Shapley) index (with associated weight mij for player 7
Different data sets can provide very different situations, as far as the differentiation of the genes is concerned. In some cases the Shapley index does not give many ties (especially in the first positions), in other ones it happened to us to have dozens and dozen of genes with the same index. In this case the index considered here produces much less ties.
186
R. Lucchetti and P. Radrizzani
i = 1, . . . , n in the game j). We address the interested reader to the survey article [6], for more about these indices. Our first attempts of processing data showed a kind of stability with respect to the ranking of the genes, even though, as expected, taking into account more intervals resulted in a better differentiation of the genes. Thus, we decided to avoid binding the number of intervals, in order to have a more fragmented ranking between the genes. The extended microarray matrix well serves, as already mentioned, also to build a weighted majority game. Quite naturally, the weight of the player i in the game j is given by the coefficient mij . We want to add one remark. Even if the indices give the same ranking in weighted games8 this is no longer true in microarray games, as simple examples show. Thus in the microarray game also the ranking between genes, and not only the ratio of the power of the players is relevant. Nevertheless, again our data show strong stability in the ranking of the genes, as far as we use different indices. However, since all indices are well characterized by several sets of properties (their axiomatic definitions), differences in the rankings could be interpreted in the light of different groups of properties. As far as the data analysis is concerned, we performed some tests with different data sets, i.e. Stroma Rich and Stroma Poor Neuroblastic tumors, Ductal and Lobular breast tumor, two different types of Colon tumor. It must be observed that in one case (Stroma Rich and Stroma Poor Neuroblastic tumors), we did a different experiment, in that we compare tissues from two groups of cells affected by two different, though similar, forms of tumor. We analyze both situations arising from taking one of the groups as reference group. In this case the ranking of the genes assumes the idea of singling out those characterizing one form of tumor with respect to the other one. The results are briefly illustrated in the next section. More extended results are available in the Ph. D. thesis of the second author9 , available under request.
3
Data Analysis
In all subsequent data sets we ranked the genes by using the weighted indices introduced above and we played the averaged weighted majority game with the genes in first 50 positions. The weights in each game are the values given to the genes in the cooperative game played with the weighted indices (i.e. denoting by M the extended microarray game, in the game j for the gene i its weight is given i mij by mij , and in each game we use the simple majority quota qj = [ 2 ] + 1). All tables of the ranking (of the first 50 genes) are available upon request. 8
9
The position of each player is quite reasonably related to its weight, in the following sense: for a power index φ and for the game v = [q; w1 , . . . wn ], φi (v) ≥ φj (v) if and only if wi ≥ wj . P. Radrizzani, Game Theory and Microarray Data Analysis, Politecnico di Milano, March 2009.
Microarray Data Analysis via Weighted Indices
3.1
187
Data from Early Onset Colon Rectal Cancer
Gene expression analysis was performed by using Human Genome U133A-Plus 2.0 GeneChip arrays [15]. This data set contains 10 healthy samples and 12 derived from tumor tissues. The following 7 genes are known to function in a multitude of biological processes ranging from transcription, angiogenesis, adhesion and inflammatory regulation to protein catabolism in various cellular compartments, from extracellular to the nucleo and they were already identified as a potentially prediction of early onset colorectal cancer ([14]): CYR61, UCHL1, FOS, FOSB, EGR1, VIP, KRT24. One of them (UCHL1) was ranked at around the 100-th place by the weighted Shapley value. All other ones are among the first 50 and are thus included in the majority game. In the following table we quote their positions in the ranking with Shapley and Banzhaf. wS S B FOSB 1 1 1 CYR61 2 2 2 FOS 3 4 3 VIP 5 5 5 EGR1 8 14 9 KRT24 34 46 46 (wS= weighted Shapley, S= Shapley in the weighted majority game, B=Banzhaf in the weighted majority game). 3.2
Data from Neuroblastic Tumors
This case is different with respect to all other ones since, as already mentioned, we compare two groups of people, affected by two similar but not identical tumors. The mathematical model however is the same, and the idea is to use one group as reference group and the other as the object of investigation, and conversely. This can single out genes which are specifically differentiating these two tumors. It would be very interesting to have also a reference group of sane individuals to make a further comparison. Gene expression analysis was performed by using Human Genome U133A GeneChip arrays [15]. This data set contains 10 samples derived from tissues of neuroblastic tumor stroma poor (SP) and 9 samples derived from tissues of neuroblastic tumors stroma rich (SR). The over expression of the following 8 genes (already mentioned in the medical literature in connection with these diseases) was already identified in([1])10 : ANGPTL7, PMP2, TSPAN8, CENPF, 10
In this paper the genes are identified by using a combination of game theoretic and statistical tools. The game theoretic approach is different from here especially in the way the reference intervals are defined. Thus it is interesting to see that an important group of genes seems to be persistent when using different techniques.
188
R. Lucchetti and P. Radrizzani
EYA1,PBK, TOP2A, TFAP2B. Moreover five of them (CENPF, EYA1,PBK, TOP2A, TFAP2B) encode for nuclear proteins. All of them are present in our ranking, according to the following tables: First table: SP is used as reference group
wS S B PMP2 3 4 3 TSPAN8 18 19 17 ANGPTL7 24 27 30 (wS= weighted Shapley, S= Shapley in the weighted majority game, B=Banzhaf in the weighted majority game). Second table: SR is used as reference group
wS S B CENPF 3 3 3 PBK 19 24 24 TOP2A 23 25 25 EYA1 25 20 20 TFAP2B 39 30 30 (wS= weighted Shapley, S= Shapley in the weighted majority game, B=Banzhaf in the weighted majority game). 3.3
Data from Lobular and Ductal Invasive Breast Carcinomas
Gene expression analysis was performed by using Human Genome U133-Plus 2.0 GeneChip arrays [15]. This data set contains 10 healthy samples of ductal and lobular cells, 5 samples of ductal cells and 5 samples of lobular cells derived from tumor tissues. In our ranking of genes, we identified the important gene HMMR which is already known to be associated with higher risk of breast cancer in humans ([13]). In this paper the authors, starting with four known genes encoding tumor suppressors of breast cancer, combined gene expression profiling with functional genomic and omic data from various species to generate a network containing 118 genes linked by 866 potential functional associations. This network shows higher connectivity than expected by chance, suggesting that its components function in biologically related pathways. One of the components of the network is HMMR, encoding a centrosome subunit. Two case-control studies of incident breast cancer indicate that the HMMR locus is associated with higher risk of breast cancer in humans. We found HMMR only in the lobular data, ranked respectively at positions: 21 (weighted Shapley), 19 (Shapley in the weighted majority game), 20 (Banzhaf in the weighted majority game).
Microarray Data Analysis via Weighted Indices
3.4
189
Data from Colon Tumor
Gene expression analysis was performed by using Affymetrix oligonucleotide microarray for a set of 40 tumor samples and a set of 22 normal samples [16]. Some of the genes selected were previously observed in association with the colon cancer ([4]): the vasoactive intestinal peptide (M36634: Human vasoactive intestinal peptide (VIP)), has been suggested to promote the growth and proliferation of tumor cells; the membrane cofactor protein (M58050; Human membrane cofactor protein (MCP)) represents a possible mechanism of the ability of the tumor to evade destruction by the immune system. H72234: DNA-(APURINIC OR APYRIMIDINIC SITE) LYASE (HUMAN) plays an important role in DNA repair and in resistance of cancer cells to radiotherapy ([10]). In our ranking we found M36634 at the 68-th position (thus not entering in the subsequent game), while H72234 is ranked at positions: 6 (weighted Shapley), 7 (Shapley in the weighted majority game), 8 (Banzhaf in the weighted majority game) and M58050 is ranked at positions: 5 (weighted Shapley), 3 (Shapley in the weighted majority game), 5 (Banzhaf in the weighted majority game), as the following table displays. wS S B H72234 6 7 8 M58050 5 3 5
References 1. Albino, D., Scaruffi, P., Moretti, S., Coco, S., Di Cristofano, C., Cavazzana, A., Truini, M., Stigliani, S., Bonassi, S., Tonini, G.P.: Stroma poor and stroma rich gene signatures show a low intratumoural gene expression heterogeneity in Neuroblastic tumours. Cancer 113, 1412–1422 (2008) 2. Bilbao, J.M., Fernandez, J.R., Jimenez Losada, A., Lopez, J.J.: Generating Functions for Computing Power Indices efficiently. TOP 8, 191–213 (2000) 3. Chin, K., DeVries, S., Fridlyand, J., Spellman, P.T., Roydasgupta, R., Kuo, W.L., Lapuk, A., Neve, R.M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew, K., Pinkel, D., Jain, A., Ljung, B.M., et al.: Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell 10, 529–541 (2006) 4. Fujarewicz, K., Wiench, M.: Selecting differentially expressed genes for colon tumour classification. International Journal of Applied Mathematics and Computer Science 13(3), 327–335 (2003) 5. Ge, H., Walhout, A.J., Vidal, M.: Integrating ’omic’ information: a bridge between genomics and systems biology. Trends Genet. 19, 551–560 (2003) 6. Kalai, E., Samet, D.: On weighted Shapley values. International Journal of Game Theory 24, 179–186 (1987); Kalai, E., Samet, D.: Weighted Shapley Values. In: Roth, A. (ed.) The Shapley Value, Essays in Honor of Lloyd, pp. 83–100. Cambridge University Press, Cambridge (1988) 7. Liu, E.T., Kuznetsov, V.A., Miller, L.D.: In the pursuit of complexity: systems medicine in cancer biology. Cancer Cell 9, 245–247 (2006)
190
R. Lucchetti and P. Radrizzani
8. Lucchetti, R., Moretti, S., Patrone, F., Radrizzani, P.: The Shapley and Banzhaf indices in microarray games. Computers and Operations Research S0305–0548(09) 00060–00064 (2009) 9. Lucchetti, R., Radrizzani, P.: A family of new power indices (to appear) 10. Moler, E.J., Chow, M.L., Mian, I.S.: Analysis of molecular profile data using generative and discriminative methods. Physiological Genomics 4, 109–126 (2000) 11. Monderer, D., Samet, D.: Variations on the Shapley value. In: Aumann, R.J., Hart, S. (eds.) Handbook of Game Theory, vol. 54, Elsevier Science, Amsterdam (2002) 12. Moretti, S., Patrone, F., Bonassi, S.: The class of microarray games and the relevance index for genes. TOP 15, 256–280 (2007) 13. Pujana, M.A., Han, J.D., Starita, L.M., Stevens, K.N., Tewari, M., Sook Ahn, J., Rennert, G., Moreno, V., Kirchhoff, T., Gold, B., Assmann, V., ElShamy, W., Rual, J.F., Levine, D., Rozek, L.S., Gelman, R.S., Gunsalus, K.C., Greenberg, R.A., Sobhian, B., Bertin, N., Venkatesan, K., Ayivi-Guedehoussou, N., Sol, X., Hernndez, P., Lzaro, C., Nathanson, K.L., Weber, B.L., Cusick, M.E., Hill, D.E., Offit, K., Livingston, D.M., Gruber, S.B., Parvin, J.D., Vidal, M.: Network modeling links breast cancer susceptibility and centrosome dysfunction. Nature Genetics 39, 1338–1349 (2007) 14. Yi, H., Kok, S.H., Kong, W.E., Peh, Y.C.: A Susceptibility Gene Set for Early Onset Colorectal Cancer That Integrates Diverse Signaling Paqthways: Implication for Tumorigenesis. Clin. Cancer Res. 13(4) (2007) 15. Affymetrix, Inc., Calif, http://www.ncbi.nlm.nih.gov/projects/geo/index.cgi 16. Affymetrix Inc., Calif, http://microarray.princeton.edu/oncology/affydata/index.html
Combining Replicates and Nearby Species Data: A Bayesian Approach Claudia Angelini1 , Italia De Feis1 , Viet Anh Nguyen2 , Richard van der Wath2 , and Pietro Li` o2 1
2
Istituto per le Applicazioni del Calcolo “Mauro Picone” Consiglio Nazionale delle Ricerche, Naples, Italy [email protected], [email protected] Computer Laboratory, University of Cambridge, Cambridge, United Kingdom [email protected], [email protected], [email protected]
Abstract. Here we discuss the biological high-throughput data dilemma: how to integrate replicated experiments and nearby species data? Should we consider each species as a monadic source of data when replicated experiments are available or, viceversa, should we try to collect information from the large number of nearby species analyzed in the different laboratories? In this paper we make and justify the observation that experimental replicates and phylogenetic data may be combined to strength the evidences on identifying transcriptional motifs and identify networks, which seems to be quite difficult using other currently used methods. In particular we discuss the use of phylogenetic inference and the potentiality of the Bayesian variable selection procedure in data integration. In order to illustrate the proposed approach we present a case study considering sequences and microarray data from fungi species. We also focus on the interpretation of the results with respect to the problem of experimental and biological noise.
1
Introduction
Statistical genetics and Bioinformatics are experiencing a period of great capability in providing the methodology for life science experimental research and they are keeping the pace with the growing availability of a large variety of molecular biology high through-put data. Molecular biologists are now pressing with more challenging requests. The most important issue is about the integration of the different types of high through-put data (omics). Another issue is whether to assess the robustness of a result is better to generate replicates (how many) or to use nearby species (how many, how evolutionary close) data. The problem with the first point is that it is not so obvious how one omic data (say transcriptomics) could support findings on another omics, say genomics or proteomics. Clearly all the omics are interdependent because they are related to the same process of the survival of the organism. So in theory we need all of them but their contribution to explain life is not simply additive but more complex. For example you can get a certain amount of protein having several copies of a F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 191–205, 2010. c Springer-Verlag Berlin Heidelberg 2010
192
C. Angelini et al.
gene or having one gene being highly transcribed or highly translated or simply slowly degraded. Note that at the time when only few DNA sequences were stored in databases there was the belief that the future availability of massive amount of sequences could dispel all statistical uncertainties. Nevertheless, the recent advent of high through-put sequencing data has not yet clarified the scenario. In summary due to the untangling of the biological processes it would be important to use all the evidences, however mathematical models and methods are not sophisticated enough to fully reach such goal. The second issue, i.e. the use of replicates and/or nearby species data, presents different but not simpler problems. In most of cases the experimenter cannot have all the replicates he would wish to have (cost, time constraints). From evolutionary theory we know that close species may behave similarly, such evidence may offer a way to improve the analysis, but again we do not have a methodology for integrating results from different species. A situation in which we have different omics, several replicates and also nearby species data, and a proper metric, as in our example is central to a basin of available data for which we would like to provide statistical support. One of the most challenging steps in annotating a genome is to find the transcription factor binding sites. These sites, located at various distances upstream each gene, directly influence the amount of gene transcripts. Several computational methods for the discovery of transcription binding sites (TFBSs) have been described, see [9]. They can be schematically divided into two main approaches: local multiple alignment of these promoters in order to detect common consensus sequences or by detection of over-represented oligonucleotide sequences, see [16]. Ideally, gene expression data may be combined with genome sequence data to identify regulatory modules. In [3] the authors applied linear regression with stepwise selection after getting a list of candidate motifs using MDScan (see [8]), an algorithm that makes use of word-enumeration and position-specific probability matrix updating techniques. The candidate motifs were scored in terms of number of sites and degree of matching with each gene. Following this approach, in [15] the authors proposed using Bayesian variable selection techniques instead of stepwise methods. The motivation was that Bayesian model selection methods perform a more thorough search of the model space and hence might potentially pick up motifs that can be missed by stepwise methods. They showed also that the method yielded better results than that in [3]. Recently [1] and [2] extended the Bayesian variable selection method to take into account the different and multiple information sources available, to pool together results of several experiments and to allow the users to select the motifs that best explain and predict the changes in expression level in a group of co-regulated genes. In the present paper we further extend and validate the above mentioned methodology. A problem with gene expression data is the noise, where actually several sources of noise have been identified. The variations in the level of gene expression is made of an intrinsic noise which originates from the inherent stochasticity of biochemical processes such as transcription and translation. An extrinsic noise is instead related to variations in the molecular environment, i.e. other
Combining Replicates and Nearby Species Data: A Bayesian Approach
193
cellular components. Then there is a noise linked to non perfect reproducibility of the experimental conditions. When the experimental data is variable, the most obvious way to improve the signal to noise ratio is to perform experimental replicates. Replicated observations would allow to quantify precisely the experimental noise in measurements for each gene at each experimental condition. When the level of experimental variability varies between different genes and between different experimental conditions, experimental replicates are necessary for assessing the reproducibility of observed patterns. On the other hand it is well documented phenomenon that log-transformed expression measurements of low-expressed genes vary more than expression measurements of highly expressed genes. The expression levels of some genes might be more variable than others [5]. When experiments are costly, particularly in high throughput biology, replicates are usually in a number to assure ‘just above’ the threshold statistical reliability for disseminating and publishing results; funding constraints sometimes result in seriously hampering statistical robustness. The biological reasoning behind replicating experiments is that each organism has homeostatic mechanisms that maintain the genetic information and its expression and functions. This allows to replicate experiments with decent accuracy if the conditions remain the same and no instrumental errors are present. In molecular biology it is often difficult to retain the exact experimental setup but close conditions, say the same technology, the same lab, the same cells, are widely accepted. In a field where technology changes constantly and at great pace, results that are few months distant may be based on slightly different technologies or manufacturing aspects of some components leading to slightly different accuracy. The biological samples used in the experiments, i.e. culture cell lines, for examples bacterial or yeast cells which have short generation times, although kept in the same medium, may slightly change because of mutations and contamination during their periodical proliferation and in vitro amplification. Cells may have minimal different concentration of constituents such as nucleic acid, proteic, lipid or sugar factors, ions, giving them different fitness with respect other colonies. However, a large body of experimental evidences in comparative genomics is showing that recently diverged species may retain similarities in gene sequence, expression and genome organization; see for a methodology example [18]. These considerations suggest that, in absence of experimental replicates, or in addition to these, statistical support to experimental evidences may also be searched by analyzing close variants of the species under examination or phylogenetic nearby species, i.e. species which have recently diverged. Obviously the validity of the assumption of replicas through phylogenetical relatedness should be proved by copious experimental evidences or literature. Such useful exploitation of species richness is hampered by the lack of any theoretical statistical framework leading to combine the knowledge from true replicas and the nearby species replicas. Our proposal is that a phylogenetic metric may allow to combine replicated and cross-species data. Replicates will be on the tip of leaves representing current species (internal nodes are ancestors); the distance between leaves provides an estimate of the effectiveness (weight) of different species in providing statistical
194
C. Angelini et al.
support to the nearby others. Mutants or different variants will be very close the leaf investigation. In order to make full use of the richness of cases without using too distant relatives, we have focused on fungi sequences and expression data. In Section 2 we summarize the methodology, whose results will be presented in Section 3. Conclusions will be drawn in Section 4.
2
Methodology
The proposed approach requires the following steps: 1. Selection of a group of co-regulated genes in a well annotated species. Progressing from [1] and [2], we focused our attention on the ENG1 cluster of S. Pombe, a set of very strongly cell cycle-regulated genes, which contains eight genes, involved in cell separation. Motif searches showed that each gene of the cluster has at least one binding site for the Ace2 transcription factor (consensus CCAGCC), see [10] and [13]. 2. Determination of the nearby species by means of a Tpathway generated on the basis of the information on the selected genes according to the guidance principles. Phylogenetic inference can be conducted with several methodologies. Here, we assessed the distance between fungi based on the ENG1 protein tree using the Jones, Taylor and Thornton amino acid substitution model (JTT model, [6]) of evolution due to the fact that the ENG1 protein family are globular cytoplasmic proteins. Likelihood maximization and maximum likelihood parameter estimation were performed by numerical optimization routines using a replacement matrix for all sites. We have selected S. Japonicus and S. Octosporus as nearby species of S. Pombe. 3. Choice of a set of about 500 biologically independent genes of the annotated species (S. Pombe) after a phylogenetic analysis. This genes’ selection is guided from the fact that extensive comparative genomic analysis has revealed that all the eukaryotic genomes contain families of duplicated genes which have recently diverged. In many cases these families have retained large part of the upstream regulatory sequences. In particular the remnants of whole genome duplications have been identified in different yeast strains [7], [14] and [17] as well as in other species. The redundancy of yeast genome suggest we should select a meaningful non redundant ensemble of genes that contains all the relevant statistical characteristics of the genome and that will play the role of control genes in step 6). In order to select this set of genes we have performed a phylogenetic analysis of the data set of genes using neighbor joining, which is a distance method. The phylogenetic analysis allows to identify the genes which have very large sequence similarities and therefore may derive from a common ancestor. All members except one of these groups of very similar genes were pooled out. Eventually we end up with a number of approximately 500 genes. 4. Identification of the homologous genes in the nearby species (S. Japonicus and S. Octosporus) using BLAST [4];
Combining Replicates and Nearby Species Data: A Bayesian Approach
195
5. Extraction of the upstream sequences up to 1000 bp of each species, shortening them, if necessary, to avoid any overlap with adjacent ORFs. For genes with negative orientation, we considered the reverse complement of the sequences. Note that motif finding algorithms are sensitive to noise, which increases with the size of upstream sequences examined and, as reported in [16], the vast majority of the yeast regulatory sites from the TRANSFAC database are located within 800 bp from the translation start site. 6. Determination and scoring a set of about 150 candidate motifs using a modified version of the software MDSCAN proposed in [8] to search for nucleotide patterns which appears in the upstream sequences of the genes of interest for each species. For a given species the score for each gene and each candidate motif of length w was calculated as in [3] using the following equation ⎡ Xmg = log2 ⎣
⎤ Pr (x from θm ) /Pr (x from θ0 )⎦
x∈Qwg
m = 1, . . . , M , (M number of motifs); g = 1, . . . , G, (G number of genes); Qwg is the set of all w-mer in the upstream region of gene g; θm is the probability matrix of motif m of width w, θ0 is the background model, computed using a Markov chain of the sixth order (Liu’s original algorithm permits only Markov chain of the third order) from the upstream regions of all the species of interest. We have considered nucleotide patterns of length 5 to 12 bp and we have scored up to 30 distinct candidates for each width. 7. Selection of few best motifs using Bayesian variable selection as described in [2] that represents the extension of [15] to handle the case of multiple experiments. The idea is to fit a linear regression model relating gene expression levels (Y e ∈ RG×Se ), where Se indicates the number of technical replicates for each experiment e, e = 1, . . . , E, performed on G genes of the annotated species (S. Pombe), to the pattern scores (X) of dimension G×M , M being the number of candidate motifs, evaluated on the nearby species (S. Japonicus/S. Octosporus). We assume the following model yges | μge , σe2 ∼ N μge , σe2
g = 1, . . . , G; s = 1, . . . , Se ; e = 1, . . . , E
where yges represents the observed gene expression value of the gene g in the eth experiment and the sth replicate and μge is the true expression of gene g under the experimental condition e. A latent vector, γ ∈ RM , with binary entries, is introduced to identify variables included in the model; γm takes on value 1 if the mth variable (motif) is included and 0 otherwise. Let M mγ = i=1 γi . The true gene expression value μge is connected to a specific subset of the M candidate motifs identified by the latent vector γ by the following relation μge |γ =
{m:γm =1}
xgm βme = Xg.(γ) β e(γ ) ,
196
C. Angelini et al.
with Xg.(γ) row of Xγ (score matrix of the species to infer) referring to gene g for all the included motifs. For the coefficients, the variance and the elements of γ the following priors have been elicitated β e(γ ) | σe2 , γ ∼ N 0, σe2 H(γ ) ,
σe2 ∼ IG (ν, S) ,
p(γ) =
M
θγj (1 − θ)1−γj
j=1
where H(γ ) , ν and S need to be assessed through a sensitivity analysis and θ = mprior /M with mprior number of covariates expected a priori to be included in the model. After some standard algebra, the posterior distribution of the vector γ given the data, i.e., f (γ|X, Y 1 , . . . , Y E ), can be obtained explicitly, however, given the large number of possible vector values (2M possibilities with M covariates), we use a stochastic search Markov Chain Monte Carlo (MCMC) technique to search for sets with high posterior probabilities. 8. Running of several parallel MCMC chains of length 100.000 and pooling of the sets of patterns visited by the MCMC chains for each species. The algorithm computes the normalized posterior probabilities of each distinct visited set and the marginal posterior probabilities for the inclusion of single nucleotide patterns. In our study we considered 10 parallel chains. We ran Steps 7.-8. with several values of mprior (mprior = 1, 3, 5 in our case) to investigate the effects of the sparsity request on the models selection. Note that we have chosen these values for mprior because fungi are rather simple organisms and it is known that the regulation mechanism is based on few motifs. Additionally, in order to study the robustness of the proposed methodology with respect to the choices of both each single experiment and control genes set, we repeated Steps 7.-8. for several different subsets of about 200 control genes randomly selected from the 500 genes identified in point 3. in combination with the leave one out cross validation strategy over the experiments for all the possible combinations. In our case we have used 8 different subsets.
3
Results
In order to show the performance of the proposed methodology we selected microarray experiments and sequences data from S. Pombe, a well studied species, and sequences data from S. Japonicus and S. Octosporus, recently sequenced species. As explicative example we chose to investigate the regulatory mechanism of cluster Eng1, so we considered the experiment elutriation A described in [10] and the experiments elutriation 1, elutriation 2, elutriation 3 and cdc25 block release 1 described in [13]. All the data explore the transcriptional responses of the fission yeast S. Pombe to cell cycle and measure gene expression as a function of time in cells synchronized through different approaches: centrifugal elutriation and the use of temperature sensitive cell cycle mutants. All these experiments have no technical replicates. The sequence data can be downloaded at http://www.broad.mit.edu. In the following the value yge ,
Combining Replicates and Nearby Species Data: A Bayesian Approach
197
for gene g, g = 1, . . . , G, and experiment e, e = 1, . . . , 5, represents the average of the genes expression values measured on the microarray in the interval where the Eng1 genes show their common activation peak, approximately 30 minutes-90 minutes. We applied the algorithm described in Section Methods. As regards the variable selection procedure (step 7) we chose the prior parameters in the following way: H(γ ) = c [diag(X X)+ ](γ ) , where X is the score matrix evaluated in step 6) and c as equal to the variability of the regression coefficients of the full model averaged over the experiments; ν = 3, to give weak prior knowledge, since it corresponds to the smallest integer such that the expectation of σe2 , E(σe2 ) = ν/(ν − 2)S, exists, and the scaling value S as equal to the variability of the data averaged over the experiments. We analyzed all the 3 species: S. Pombe, S. Japonicus and S. Octosporus, using different values of mprior , different control gene subsets, and the leave one out cross validation over the experiments. Indeed we considered mprior = 1, 3, 5 to study the effects of the sparsity request on the models selection.We chose these values because fungi are rather simple organisms, and it is known that the regulation mechanism is based on few motifs. For each mprior we repeated the analysis 8 times using G = 200 genes randomly selected from the set of 500 genes chosen in step 3) in order to show the robustness of the method with respect to these selections. Finally the all procedure was repeated within a leave one cross validation strategy over the experiments in order to study either the influence of each single experiment and to show the increase of power that we can have adding more information. In order to investigate and show the convergence of the MCMC procedure we ran 10 chains of length 100000 and we inspect either the output of each single chain and of the pooled chains. In general we found great robustness for the proposed procedure with respect to all the mentioned settings. In the following for the sake of brevity we present only the results obtained using all the 5 experiments for mprior = 1. Figures 1, 2 and 3 display the plots of the number of included variables in the model versus the MCMC iteration for one of the 8 analysis and all the 10 chains for S. Pombe, S. Japonicus and S. Octosporus, respectively. The MCMC samplers mostly visited models with 10-15 variables for all the 3 species. Figures 4, 5 and 6 show the pooled marginal probability for one of the 8 analysis for S. Pombe, S. Japonicus and S. Octosporus, respectively. In the figures the x-axes correspond to the pattern indices and the y -axes correspond to the marginal posterior probability. The spikes indicate patterns included in the model with high probability. Similar results were obtained for all the 8 analysis and for mprior = 3 and mprior = 5. Tables 1, 2 and 3 show for each species the motifs with marginal probability greater than 0.5 for each of the 8 subsets of control genes, their averaged marginal probability and their total occurrence in the 8 subsets of control genes. Note that long patterns are more often selected than short ones. This corresponds to the background model not so efficiently rejecting association among
198
C. Angelini et al.
nearby DNA bases. Eukaryotic DNA is highly heterogeneous, patchy and repetitious [11] and state of art background models cannot adequately take into account for the variations in base association. We also found that considering more replicates or data from more species, the marginal probabilities become much higher (about 3 fold) than those obtained using single replicates and one species, see [1]. We got both confirmation of known results and new findings (motifs) which have high marginal probability values. S. Pombe
number of included variables
150
100
50
0
0
2
4
6
8
10
12 4
x 10
Fig. 1. Number of included variables versus the MCMC iteration for S. Pombe
4
Conclusions and Interesting Biological Statistical Problems Involved
Here we have discussed state of art and potentialities of annotating a genome transcription factor biding sites on the basis of sequence data sets from other species. Although we have elucidated a single gene network in S. Pombe and its differences in other yeast species, the procedure can be generalized to other genetic networks. The most important suggestion is that the identification of a gene network in one species using another species information could reveal differences which may be real or deriving from a not accurate, fragmentary, knowledge of the network in the annotated species. Having defined a metric to use nearby species tree, there are several questions we are able to answer and others we could simply attempt to. Would be better to add a species or a not-so-great quality replicate? Obviously we need to estimate the quality of an experimental replicate (the quality may be even different for different genes) and to estimate the quality of a nearby species data. Therefore,
Combining Replicates and Nearby Species Data: A Bayesian Approach
199
S. Japonicus 140
number of included variables
120
100
80
60
40
20
0
0
2
4
6
8
10
12 4
x 10
Fig. 2. Number of included variables versus the MCMC iteration for S. Japonicus
S. Octosporus 140
number of included variables
120
100
80
60
40
20
0
0
2
4
6
8
10
12 4
x 10
Fig. 3. Number of included variables versus the MCMC iteration for S. Octosporus
C. Angelini et al.
S. Pombe
Marginal Probability
1.5
1
0.5
0
0
50
100
150
Motifs
Fig. 4. Marginal posterior probabilities for S. Pombe
S. Japonicus 1.5
Marginal Probability
200
1
0.5
0
0
50
100
Motifs
Fig. 5. Marginal posterior probabilities for S. Japonicus
150
Combining Replicates and Nearby Species Data: A Bayesian Approach
201
S. Octosporus
Marginal Probability
1.5
1
0.5
0
100
50
0
150
Motifs
Fig. 6. Marginal posterior probabilities for S. Octosporus
Table 1. Motifs with marginal probability greater than 0.5 in each of the 8 subset of control genes, their average probability and their total occurrence for S. Pombe, mprior = 1. Motifs
Probability Occurrence
GTAAAAAA 0.9118 CAAATATAAA 0.8920 AATGTAA 0.9923 GTGGTTGG 0.9908 GAAAATCGAA 0.9648 AAATTTAAGAG 0.8998 GATTTTACCA 0.9301 TTACTTTCTT 0.7289 TTATCCAGCC 0.9140 GGTGGCTGGCA 0.9959 TCTATATTCGG 0.7739 TTGCTTTAT 0.9664 TCAATATCCAGC 0.9306 TTATATAA 0.8923 CATGGCGGG 0.8681 ATCGATGGTAA 0.9643 CAAGAAAGTAC 0.9529
2 2 3 3 3 3 4 4 4 4 4 5 5 6 6 6 8
202
C. Angelini et al.
Table 2. Motifs with marginal probability greater than 0.5 in each of the 8 subset of control genes, their average probability and their total occurrence for S. Japonicus, mprior = 1. Motifs
Probability Occurrence
ACCAGCC GTGTCAC AAGGAGGCT ACTCGCGTCAC GGCTGG ATGCATA ACGGTGTGAA AGGCTGGT CAGATTTCGTGC
0.7427 0.7271 0.8879 0.9625 0.8282 0.7045 0.8602 0.7900 0.7774
2 2 2 2 3 3 3 5 6
a first non trivial statistical problem is which phylogenetically close species to choose and how much evidence may contribute. Note that species variants (mutants, strains) may be certainly better than species that are very different. For instance schizosaccharomyces have at least three close variants (pombe, japonicus and octosporus), aspergillum has three variants (clavatus, niger, oryzae). A criterion is to identify a distance threshold on the phylogenetic tree that would give some confidence on combining data. This would require generating the tree using phylogenetic advanced methodology, considering the same genes under investigation. Noteworthy, we use models of evolution of coding sequences and not of transcription factor binding sites. How wrong are we from the ideally correct models of evolution of binding sites? The currently available models for sequence evolution are based on fixed rate matrices generated from globular (for example Dayhoff, JTT) or mitochondrial proteins (for example MTrev). Note that regulatory regions, although under purifying selection, i.e. they evolve slower than surrounding sequences, are nevertheless diverging at higher rate than coding regions. In [12] the authors have demonstrated an inverse correlation between the rate of evolution of transcription factors and the number of genes that they regulate. Therefore, for small gene networks, distant species may not provide adequate support and in general the distance may depend on the size of the genetic network. Transcription factor binding sites are very short so there is a size limit effect. More important, while the discrepancy between coding region models and the actual patterns of changes depends on not considering the three dimensional building of a protein, the discrepancy between a model of transcription factor binding site and the true pattern depends on the binding energies of the interaction between the transcription factor and the binding sites. So coding regions and TFBS are inherently different. This error can lead to assigning a small distance on the tree due to the slow accumulation of mutations within orthologous genes while the binding factors might have been under stronger or more relaxed evolution. Therefore our metric should be considered simply as few
Combining Replicates and Nearby Species Data: A Bayesian Approach
203
Table 3. Motifs with marginal probability greater than 0.5 in each of the 8 subset of control genes, their average probability and their total occurrence for S. Octosporus, mprior = 1. Motifs
Probability Occurrence
ACTTTACTC CGTCGTGGTG GTATCGGTTG GTTCGATGGC AAGAGCAGAGC ACTTTCATCCA GATTTTACTCG TTCGTTTCCGT TTGTTTGTTTA AGAGAGAAA TCAATCCAGT CATTCAGGGG ACAATGGAT GTAGAAACA CCTTCCACCGA GTTGCAAGT GGTACGAAGAA GATGGCTGGTA
0.9208 0.8705 0.9980 0.8317 0.8820 0.9916 0.7602 0.9559 0.9964 0.9186 0.9752 0.9557 0.9843 0.7577 0.9803 0.9975 0.9111 1.0000
2 2 2 2 2 2 2 2 2 3 3 4 5 5 6 7 7 8
steps of inching towards reality. Extensive comparative genomic analysis has revealed that all the eukaryotic genomes contain families of duplicated genes which have recently diverged. In many cases these families have retained large par of the upstream regulatory sequences. In particular the remnants of whole genome duplications have been identified in different yeast strains [7,14,17] as well as in other species. Replicates allow to investigate the strength of a regulatory network in a species, i.e. small variations in gene expression may mean that the gene network is finely regulated and does not allow for so much natural variation. The inclusion of phylogenetic data tells us about how much that network has changed in different species in terms of number of genes involved, motif patterns, and changes in expression. Indeed both S. Japonicus and S. Octosporus show some changes in the number of genes and motif patterns. Therefore an important result of including additional species is the insight that this analysis gives on the evolution the gene regulatory network under investigation. The combined use of replicates and nearby species data will also bring attention to developing better metric of distances between the species. Current methods for phylogenetic inference are based on sequences. The possibility of developing models from an ensemble of replicated sequence and gene expression data from different species will provide a better estimation of evolution.
204
C. Angelini et al.
Better trees will influence multiple sequence alignments and protein structure prediction methods triggering an avalanche of improvements which, in this era, will effect translational bioinformatics.
Acknowledgments We acknowledge the Fungal Genome Initiative at Broad Institute of Harvard and MIT (http://www.broad.mit.edu), the CNR DG.RSTL.004.002 project and the CNR Bioinformatics Interdepartmental project.
References 1. Angelini, C., Cutillo, L., De Feis, I., van der Wath, R., Li` o, P.: Identifying regulatory sites using neighborhood species. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) EvoBIO 2007. LNCS, vol. 4447, pp. 1–10. Springer, Heidelberg (2007) 2. Angelini, C., Cutillo, L., De Feis, I., Li` o, P., van der Wath, R.: Combining experimental evidences from replicates and nearby species data for annotating novel genomes. In: AIP proceedings, vol. 1028, pp. 277–291 (2008) 3. Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S.: Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl. Acad. Sci. USA 100, 3339–3344 (2003) 4. Ewens, W.J., Grant, G.R.: Statistical Methods in Bioinformatics. Springer, New York (2001) 5. Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D., Kidd, M.J., King, A.M., Meyer, M.R., Slade, D., Lum, P.Y., Stepaniants, S.B., Shoemaker, D.D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M., Friend, S.H.: Functional discovery via a compendium of expression profiles. Cell 102(1), 109–126 (2000) 6. Jones, D.T., Taylor, W.R., Thornton, J.M.: The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282 (1992) 7. Kellis, M., Birren, B.W., Lander, E.S.: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428, 617–624 (2004) 8. Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein-dna binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835–839 (2002) 9. Ohler, U., Niemann, H.: Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet. 17, 56–60 (2001) 10. Oliva, A., Rosebrock, A., Ferrezuelo, F., Pyne, S., Chen, H., Skiena, S., Futcher, B., Leatherwood, J.: The cell cycle regulated genes of schizosaccharomyces pombe. PLoS Biol. 33(7), 1239–1260 (2005) 11. Piazza, F., Li` o, P.: Statistical analysis of low–complexity sequences in the human genome. Physica A 347, 472–488 (2005) 12. Rajewsky, N., Socci, N.D., Zapotocky, M., Siggia, E.D.: The evolution of dna regulatory regions for proteo-gamma bacteria by interspecies comparisons. Genome Res. 12, 298–308 (2002) 13. Rustici, G., Mata, J., Kivinen, K., Li` o, P., Penkett, C.J., Burns, G., Hayles, J., Brazma, A., Nurse, P., B¨ ahler, J.: Periodic gene expression program of the fission yeast cell cycle. Nature Genet. 36, 809–817 (2004)
Combining Replicates and Nearby Species Data: A Bayesian Approach
205
14. Seoighe, C., Wolfe, K.H.: Updated map of duplicated regions in the yeast genome. Gene 238, 253–261 (1999) 15. Tadesse, M.G., Vannucci, M., Li` o, P.: Identification of dna regulatory motifs using bayesian variable selection. Bioinformatics 20, 2553–2561 (2004) 16. Van Helden, J., Andre, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998) 17. Wolfe, K.H., Shields, D.C.: Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387, 708–713 (1997) 18. Sartor, M.A., Tomlinson, C.R., Wesselkamper, S.C., Sivaganesan, S., Leikauf, G.D., Medvedovic, M.: Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics 7, 538–547 (2006)
Multiple Sequence Alignment with Genetic Algorithms Marco Botta1,2 and Guido Negro1 1
Dipartimento di Informatica University of Torino Corso Svizzera 185 – 10149 Torino – Italy 2 Center for Complex Systems in Molecular Biology and Medicine, University of Torino Via Accademia Albertina 13 - 10100 Torino - Italy [email protected], [email protected]
Abstract. The multiple sequence alignment problem is one the most common task in the analysis of sequential data, especially in bioinformatics. In this paper, we propose to use a genetic algorithm to compute a multiple sequence alignment, by optimizing a simple scoring function. Even though the idea of using genetic algorithms is not new, the presented approach differs in the representation of the multiple alignment and in the simplicity of the genetic operators. The results so far obtained are reported and discussed in this paper. Keywords: Genetic Algorithms, Sequence Alignment.
1 Introduction One of the most common task in the analysis of biological sequences is sequence alignment: pairwise alignment involves two sequences, while multiple sequence alignment (MSA) involves three or more sequences. In this paper, we will focus on the multiple sequence alignment and present a heuristic method to face the problem. Multiple alignments of protein sequences constitute an extremely powerful means of revealing the constraints imposed by structure and function on the evolution of a protein family. They make it possible to answer a wide range of important biological questions, such as phylogenetic tree estimation, identification of conserved motifs and domains, structure prediction, and critical residue identification. The most natural formulation of the computational problem is to define a model of sequence evolution that assigns probabilities to elementary sequence edits and seeks a most probable directed graph in which edges represent edits and terminal nodes are the observed sequences. No tractable method for finding such a graph is known. A heuristic alternative is to seek a multiple alignment that optimizes some scoring function, that evaluates how good is the multiple alignment of the given sequences. A widely used scoring function is the sum of pairs (SP) score, i.e. the sum of pairwise alignment scores. Optimizing the SP score is NP complete [26] and can be achieved by dynamic programming with time and space complexity O(LN) in the sequence length L and number of sequences N [28]. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 206–214, 2010. © Springer-Verlag Berlin Heidelberg 2010
Multiple Sequence Alignment with Genetic Algorithms
207
Due to this all the current implementations of multiple alignment algorithms are heuristics and none of them guarantee a full optimization. Considering their most obvious properties, existing algorithms can be classified in three main categories: exact, progressive and iterative. Exact algorithms are high quality heuristics that deliver an alignment usually very close to optimality [16,20], but also require enormous quantities of computational resources (time and space). They can only handle a small number of sequences (< 20) and are limited to the sums-of-pairs objective function. Progressive alignments are by far the most widely used [22,12,6]. They depend on a progressive assembly of the multiple alignment [13,8,22] where sequences or alignments are added one at a time so that never more than two sequences (or multiple alignments) are simultaneously aligned using dynamic programming [14]. This approach has the great advantage of speed and simplicity combined with reasonable sensitivity, even if it is by nature a heuristic that does not guarantee any level of optimization. Other progressive alignment methods exist such as DiAlign [19] or Match-Box [7], which assemble the alignment in a sequence-independent manner by combining segment pairs in an order dictated by their score, until every residue of every sequence has been incorporated in the multiple alignment. Iterative alignment methods depend on algorithms able to produce an alignment and to refine it through a series of cycles (iterations) until no more improvements can be made. Iterative methods can be deterministic or stochastic, depending on the strategy used to improve the alignment. The simplest iterative strategies are deterministic. They involve extracting sequences one by one from a multiple alignment and realigning them to the remaining sequences [3,11], some of these methods can even be a mixture of progressive and iterative strategies [12]. The procedure is terminated when no more improvement can be made (convergence). Stochastic iterative methods include HMM training [15] and simulated annealing), Gibbs sampling or genetic algorithms [17,21,31,1,4,5,10,16]. Simulated annealing has been used for multiple alignment but it is generally very slow and usually only works well as an alignment improver, i.e. when the method is given an alignment that is already close to optimal and is not trapped in a local minimum. Gibbs sampling has been very successfully applied to the problem of finding the best local multiple alignment block with no gaps, but its application to gapped multiple alignment is not trivial. An inspiring use of genetic algorithms for multiple sequence alignment is SAGA [21], an optimization black box in which any scoring function invented can be tested. The basic idea of SAGA is very straightforward and closely follows the ‘simple GA’ scheme [9]: given set of sequences, a population of randomly generated multiple alignments evolve under some selection pressure. These alignments are in competition with each other for survival (survival of the fittest) and reproduction. Within SAGA, fitness depends on the score measured by an objective function (the better the score, the better the multiple alignment). Over a series of cycles (generations), alignments will die or survive, depending on their fitness. They can also improve and reproduce through some stochastic modifications known as mutations and crossovers. Mutations randomly insert or shift gaps while crossovers combine the content of two alignments (see Fig. 1). Overall, 25 operators co-exist in SAGA and
208
M. Botta and G. Negro
compete for usage. The program does not guarantee to find the optimal multiple alignment, but has been shown to provide high quality alignments. Of course, the more operators are defined, the more the number of parameters the users has to tune. But, is it really necessary to have so many ad hoc operators ? Thus, the presented approach differs from the above mentioned genetic algorithms in two main respects: first of all, the basic idea is that of evolving a population of Positional Weight Matrices (PWMs) each representing a multiple alignment profile of the input sequences, instead of directly evolving multiple alignments; secondly, we just use the two standard genetic operators, mutation and crossover, implemented for matrix representation (see details in Section 2). Moreover, the implemented system can be used to align any king of sequences, as it is not limited to DNA/RNA or protein alphabets.
Fig. 1. Example of crossover operator in SAGA
The tests on standard benchmarks show that the quality of the resulting alignments is generally high, compared to standard multiple alignment algorithms. This paper is organized as follows: Section 2 presents the approach, Section 3 described the experimental settings and the results obtained, and Section 4 draws some conclusions and future works.
2 The PWMAligner Approach As mentioned, genetic algorithms have been used for solving the multiple sequence alignment problem by evolving a population of MSAs, as done in SAGA. In the presented approach the basic idea is that of evolving a population of Positional Weight Matrices (PWMs) each representing a multiple alignment profile of the input sequences.
Multiple Sequence Alignment with Genetic Algorithms
209
A PWM is a matrix of score values that gives a weighted match to any given sequence of fixed length. It has one row for each symbol of the alphabet, and one column for each position in the profile. The score assigned by a PWM to a sequence is defined as , where j represents position in the sequence, sj is the symbol at position j in the sequence, and mα,j is the score in row α, column j of the matrix. In other words, a PWM score is the sum of position-specific scores for each symbol in the sequence. A PWM assumes independence between positions in the profile, as it calculates scores at each position independently from the symbols at other positions. The score of a sequence aligned with a PWM can be interpreted as the log-likelihood of the sequence under a product multinomial distribution. PWMs are usually more compact than MSAs, especially when we have to align many sequences, they can be easily represented as bitstrings and standard genetic operators can directly be used. A PWM, therefore, describes the information content of an MSA in a compact and operational way and can be used to find the same profile in other sequences. A PWM can be easily extracted from an MSA: each cell of the PWM contains the number of occurrences of a symbol in the corresponding column of the MSA divided by the total number of sequences (i.e., each cell corresponds to the probability that a given symbol appears in that column in the multiple alignment). Given a set of sequences and a PWM, it is also possible to build a corresponding MSA that has the PWM as profile, by aligning each sequence pairwise to the PWM and collecting together the best alignments. In order to compute the pairwise alignment, PWMAligner uses the standard dynamic programming algorithm by Needleman & Wunsch [20]. An example of this procedure is shown in Fig. 2. ACGTCCG AACGT ACCGT ACGTTG ACCTTG CCCTGG CCTTGG ACTTG TTGCA TTTCA (a)
A C G T
0 |0.8 |0.2 |0.0 |0.0 |0.0
1 |0.2 |0.5 |0.3 |0.0 |0.0
2 |0.0 |0.0 |0.8 |0.0 |0.2
3 |0.0 |0.0 |0.3 |0.4 |0.3
4 |0.0 |0.0 |0.0 |0.1 |0.9
(b)
5 |0.2 |0.0 |0.3 |0.3 |0.2
6 |0.3 |0.2 |0.1 |0.4 |0.0
7 |0.9 |0.0 |0.0 |0.1 |0.0
| | | Î | |
-ACGTCCG AACGT--ACCGT---ACGTTG-ACCTTG-CCCTGG-CCTTGG-ACTTG---TTGCA--TTTCA(c)
Fig. 2. Given a set of sequences (a) and a PWM (b), each sequence is aligned to the PWM and the best alignment is used for building the multiple sequence alignment on the right (c)
Even though there might be more than one MSA having that PWM as profile, the outlined procedure deterministically produces the same multiple alignment given a PWM and the input sequences. By evaluating the resulting MSA, one can assign a score to the corresponding PWM and tell which one better represents a multiple alignment of the input sequences.
210
M. Botta and G. Negro
2.1 Scoring Function Although several scoring functions can be easily implemented in PWMAligner, in this paper we will focus on two scoring function specifically designed for biological sequences. Any scoring function for protein multiple alignment has advantages and disadvantages. Some of the differences between scoring functions in use come from different ideas of how to represent the underlying biology and the likelihoods of different evolutionary events. A straightforward way to define a scoring function for multiple sequence alignments is to add together the pairwise similarity between each pair of sequences at each position, obtaining the well known and widely used weighted sum-of-pairs scoring function (wSP):
wSP(msa) =
nSeqs -1 nSeqs
∑ ∑ w σ (msa[i], msa[j]) i, j
i=1
j=i + 1
(1)
where σ (s1,s2) is the score of aligning s1 to s2, and wi,j is a weight measure (usually, it is related to the percentage of identity between the two aligned sequences). Its computation requires a total of N2 additions per column, if N is the number of sequences in the alignment. A simpler approach chooses a single sequence as a template, and scores each position by taking the sum of similarities between each sequence and the template at that position:
STT(msa,τ ) =
nSeqs
∑ σ (msa[i],τ ) i=1
(2)
where σ (s1,s2) is the score of aligning s1 to s2 and τ is the template. The template sequence may be one of the sequences being aligned, their consensus sequence, or another sequence that was determined externally. In our case, we use the PWM as template sequence and the scoring function being optimized is therefore the sum of similarities to the PWM template. For biological sequences, BLOSUM matrices are used to score pairs of aligned symbols, according to sequence identities, and an affine gap model is used for scoring gaps. PWMAligner is implemented as a GASteadyStateGA of the GALib library [27]. PWMs are implemented as GA2DArrayGenome, and the standard crossover and mutation operators provided by the GALib library are used to evolve a population of individuals, until convergence (i.e., the fitness does not change in the last 10 generations) or a fixed number of generations. In summary, the proposed approach to multiple sequence alignment differs from the most used ones in three main respects:
Multiple Sequence Alignment with Genetic Algorithms
211
1) it implements a stochastic search for the optimal alignment by a simple genetic algorithm that evolves PWMs. 2) even though the common sum-of-pairs scoring function can be used as an option, better results have been obtained by using a simplified scoring function (STT). 3) biological knowledge is not hard coded in the algorithm operators, and thus PWMAligner can be used to align any kinds of sequences.
3 Experimental Results We tested the performances of PWMAligner on the BAliBASE benchmark [2,26], version 2, reference Ref 1, Ref 2 , Ref 3, Ref 4 and Ref 5, for a total of 139 multiple alignments of varying lengths. BAliBASE is a database of manually-refined multiple sequence alignments specifically designed for the evaluation and comparison of multiple sequence alignment programs. In order to measure accuracy, we computed the SPS score, i.e., the number of correctly aligned residue pairs divided by the number of residue pairs in the reference alignment, as reported by the baliscore program, measured only on core blocks as annotated in the database.
PRRP
1.000 0.900 0.800 0.700 0.600 0.500 0.400 0.300 0.200 0.100 0.000
CLUSTALX SAGA DIALIGN SB_PIMA ML_PIMA MULTALIGN PILEUP8 MULTAL HMMT Ref1
Ref2
Ref3
Ref4
Ref5
PWMAligner
Fig. 3. Results obtained on the BAliBase benchmark
Fig. 3 reports the results obtained by PWMAligner compared to the ones published on the BAliBase web site. In particular, PWMAligner used the STT scoring function, for which we obtained the best results, and was run with a population of 200 PWMs until convergence or 500 generations, either of which comes first. In 3 out of 5 cases, PWMAligner obtained better results, on Ref1 subset it obtained comparable performance, while on Ref4 it is significantly worse. By analysing the PWMAligner behaviour more deeply, we found that PWMAligner effectively optimizes the scoring function we implemented in most cases (95% for STT, 80% for wSP), as shown in Table I. This means that the BAliBase manually refined alignments do not correspond to a maximum of the implemented scoring function in most cases. So, in order to obtain better results, PWMAligner should implement a more tailored scoring function.
212
M. Botta and G. Negro Table 1. Number of times BAliBase alignment gets higher scores than PWMAligner
BAliBase STT PWMAligner STT BAliBase wSP PWMAligner wSP
Ref1 5 76 21 60
Ref2 0 23 1 22
Ref3 0 11 0 11
Ref4 2 10 2 10
Ref5 0 12 4 8
Total 7 132 28 111
Moreover, performances have also been tested on a second benchmark: HOMSTRAD is a database of protein alignments assembled automatically by the structural alignment program COMPARER from sets of sequences where all members have a known 3D structure (http://tardis.nibio.go.jp/homstrad/). We considered 233 multiple alignments made of sequences of varying lengths. To measure accuracy and compare to other algorithms, we reported the CS score computed by aln_compare program. As for BAliBase, PWMAligner has been run with STT scoring function until convergence or 500 generations with a population of 100 PWMs. As reported in Fig. 4, PWMAligner accuracy is significantly better (P < 0.05) than the competitors.
HOMSTRAD 80 70 60 50
Pr ob co ns
PO A
PC M A
M us cl e6
FI N SI
Di ali gn -T
Cl us ta lW
M -C of fe e8
30
TC of fe e PW M Al ig ne r
40
Fig. 4. Results obtained on the HOMSTRAD benchmark by PWMAligner compared to the ones published in [28]
All experiments have been run on the ShareGrid platform [25].
4 Conclusions and Future Works In this paper, we presented an approach to multiple sequence alignment based on genetic algorithms. It turned out that GA are a very effective tool for optimizing the
Multiple Sequence Alignment with Genetic Algorithms
213
scoring function used to evaluate multiple alignments. However, as proved on the BAliBase benchmark, PWMAligner most times finds a multiple alignment that has an optimum value of the scoring function, but does not always correspond to the reference alignment, i.e. it does not have a high SPS score, as computed by bali_score program. Further investigations on the scoring function will allow us to tailor the scoring function to better mimic the manually refined alignments in BAliBase. No running time tests have been performed till now, because we were more interested in getting high scoring alignments than fast answers. Anyway, the time spent in the computation of a multiple alignment depends on the length and number of sequences to be aligned, the number of individuals in the GA population and the number of generations. In experiments performed, the running times ranged from few minutes to 1 hour on an Intel Core 2 Duo processor. Finally, further multiple alignment benchmarks will be taken into consideration, and performance tests will be carried on.
Acknowledgements We would like to thank Dr. Pietro Liò for his helpful suggestions and discussions on the scoring functions.
References 1. Anabarasu, L.A.: Multiple sequence alignment using parallel genetic algorithms. In: The Second Asia-Pacific Conference on Simulated Evolution (SEAL-98), Canberra Australia (1998) 2. Bahr, A., Thompson, J.D., Thierry, J.C., Poch, O.: BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res. 29, 323–326 (2001) 3. Barton, G.J., Sternberg, M.J.E.: A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 198, 327–337 (1987) 4. Cai, L., Juedes, D., Liakhovitch, E.: Evolutionary computation techniques for multiple sequence alignment. In: Congress on Evolutionary Computation (2000) 5. Chellapilla, K., Fogel, G.B.: Multiple sequence alignment using evolutionary programming. In: Congress on Evolutionary Computation (1999) 6. Corpet, F.: Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881–10890 (1988) 7. Depiereux, E., Baudoux, G., Briffeuil, P., et al.: Match-Box_server: a multiple sequence alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13(3), 249–256 (1997) 8. Feng, D.-F., Doolittle, R.F.: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351–360 (1987) 9. Goldberg, D.E.: Genetic Algorithms. In: Goldberg, D.E. (ed.) Search Optimization and Machine Learning. Addison-Wesley, New York (1989) 10. Gonzalez, R.R.: Multiple protein sequence comparison by genetic algorithms. In: SPIE-98 (1999)
214
M. Botta and G. Negro
11. Gotoh, O.: Significant Improvement in Accuracy of Multiple Protein Sequence Alignments by Iterative Refinements as Assessed by Reference to Structural Alignments. J. Mol. Biol. 264(4), 823–838 (1996) 12. Heringa, J.: Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Computers and Chemistry 23, 341–364 (1999) 13. Higgins, D.G., Sharp, P.M.: Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988) 14. Hogeweg, P., Hesper, B.: The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175–186 (1984) 15. Krogh, A., Brown, M., Mian, I.S., Sjölander, K., Haussler, D.: Hidden Markov Models in Computational Biology: Applications to Protein Modeling. J. Mol. Biol. 235, 1501–1531 (1994) 16. Kim, J., Pramanik, S., Chung, M.J.: Multiple Sequence Alignment using Simulated Annealing. Comp. Applic. Biosci. 10(4), 419–426 (1994) 17. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993) 18. Lipman, D.J., Altschul, S.F., Kececioglu, J.D.: A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA 86, 4412–4415 (1989) 19. Morgenstern, B., Dress, A., Wener, T.: Multiple DNA and protein sequence based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA 93, 12098–12103 (1996) 20. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970) 21. Notredame, C., Higgins, D.G.: SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res. 24, 1515–1524 (1996) 22. Notredame, C., Higgins, D.G., Heringa, J.: TCoffee: A novel algorithm for multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000) 23. Stoye, J., Moulton, V., Dress, A.W.: DCA: an efficient implementation of the divideandconquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13(6), 625–626 (1997) 24. Taylor, W.R.: A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28, 161–169 (1988) 25. The ShareGrid Project Home Page, http://dcs.di.unipmn.it/ShareGrid/ (visited on June 2009) 26. Thompson, J.D., Plewniak, F., Poch, O.: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87–88 (1999) 27. Wall, M.: GAlib: A C++ Library of Genetic Algorithm Components, ver. 2.4.7, http://lancet.mit.edu/ga/ 28. Wallace, I.M., O’Sullivan, O., Higgins, D.G., Notredame, C.: M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 34(6), 1692–1699 (2006) 29. Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–348 (1994) 30. Waterman, M.S., Smith, T.F., Beyer, W.A.: Some biological sequence metrics. Adv. Math. 20, 367–387 (1976) 31. Zhang, C., Wong, A.K.: A genetic algorithm for multiple molecular sequence alignment. Comput. Appl. Biosci. 13(6), 565–581 (1997)
Multiple Clustering Solutions Analysis through Least-Squares Consensus Algorithms Loredana Murino1 , Claudia Angelini2 , Ida Bifulco1 , Italia De Feis2 , Giancarlo Raiconi1 , and Roberto Tagliaferri1 1
NeuRoNe Lab, DMI University of Salerno, via Ponte don Melillo, 84084 Fisciano (SA) Italy 2 Istituto per le Applicazioni del Calcolo ‘Mauro Picone’ CNR, via Pietro Castellino, 111 - 80131 Napoli Italy
Abstract. Clustering is one of the most important unsupervised learning problems and it deals with finding a structure in a collection of unlabeled data; however, different clustering algorithms applied to the same data-set produce different solutions. In many applications the problem of multiple solutions becomes crucial and providing a limited group of good clusterings is often more desirable than a single solution. In this work we propose the Least Square Consensus clustering that allows a user to extrapolate a small number of different clustering solutions from an initial (large) set of solutions obtained by applying any clustering algorithm to a given data-set. Two different implementations are presented. In both cases, each consensus is accomplished with a measure of quality defined in terms of Least Square error and a graphical visualization is provided in order to make immediately interpretable the result. Numerical experiments are carried out on both synthetic and real data-sets.
1
Introduction
Clustering can be considered as one of the most important unsupervised learning techniques and it deals with finding a structure in a collection of unlabeled data. It is well known that different clustering algorithms applied to the same data-set produce different solutions. Also, the most used clustering algorithms, as partitional clustering and model-based clustering are based on initial random assignments or follow random procedures. So when the same algorithm runs several times on the same data, different clusterings can be obtained and more than one of these solutions can explain in a convincing manner the data distribution. As a final result, there are multiple different solutions for a single problem. Clustering was proved to be a useful technique to reveal the hidden structure of data in many applications and, in particulary, in genomic experiments where a large amount of data have to be analyzed. In this case the problem of multiple solutions becomes crucial and providing a limited group of different “good” solutions is often more desirable than a single solution. One possible approach to the problem of analyzing multiple clustering solutions is to merge more solutions to obtain a new clustering: this is called Consensus Clustering [1,3,10,11,14,16,17]. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 215–227, 2010. c Springer-Verlag Berlin Heidelberg 2010
216
L. Murino et al.
Obviously a measure of quality of the new clustering must be defined to show how representative is the new solution with respect to the starting clusterings. In this work we propose the Least Square Consensus method that extends the idea of the Least Square Clustering (see, for example [8]) and allows to extrapolate a small number of different clustering solutions from an initial (large) set of clusterings obtained by applying any clustering algorithm to a selected data-set. To this aim two different procedures have been developed. Also a measure of quality was defined in terms of Least Square error and a graphical visualization of the solutions is presented to make immediately interpretable the results of the application of the two procedures. The present paper is organized as follows. Section 2 is devoted to the methods proposed for the selection of a small group of clustering solutions, Section 3 describes the results of the experiments performed on a simulated and a real data-set. Details about the software used are shown in Section 4. Conclusions are drawn at the end.
2
Methodology
Let γ1 , . . . , γM be a collection of M clustering solutions obtained from any clustering algorithm applied to the same data-set of dimensions N × D, where N is the number of elements to cluster (genes) and D is the number of features (for each gene). The goal is to find a set of L solutions γ1∗ , . . . , γL∗ , with L << M , which are representative solutions of the starting M and for which it is possible to define a measure of quality E1 , . . . , EL . A consensus clustering algorithm based on the Least-Squares Clustering (LS) has been developed to this purpose. The idea is based on the Least-Squares Clustering (LS) used in [8] for selecting a single clustering solution from the thousands of clusterings generated by the Markov chain Monte Carlo (MCMC) algorithm. In that context LS selected the clustering solution from the Markov chains by minimizing the sum of squared deviations from the averaged pairwise probability matrix that genes are clustered together. Starting from this basic idea, we have defined a Least-Square Consensus Clustering of a set of clusterings solutions and its associated quality measure. For each clustering γ ∈ Γ , where Γ is a subset of clusterings, an association matrix δ(γ) of dimension N × N can be formed whose (i, j) element is δi,j (γ), an indicator of whether gene i is clustered with gene j . Element-wise averaging of these association matrices yields the pairwise probability matrix of clustering, denoted π. Specifically, the least-squares clustering γLS is the observed clustering γ which minimizes the sum of squared deviations of its association matrix δ(γ) from the pairwise probability matrix π: N N (δi,j (γ) − πi,j )2 γLS = arg min γ∈Γ
i=1 j=1
(1)
Multiple Clustering Solutions Analysis
217
The Least-Squares Consensus Clustering presents many advantages: first it uses information from all the clusterings (via the pairwise probability matrix) and is intuitively appealing because it selects the average clustering (instead of forming a clustering via an external, ad hoc algorithm). It is independent from the number of clusters k of the single clustering. Finally, the consensus clustering, being one of the original clusterings of the considered set, presents a label for all the N data-set elements, that is none element is eliminated by the combining procedure. Given a least-squares clustering γLS for a certain group of clustering solutions Γ , one can define a quality measure to evaluate the level of representation of the obtained solution. We have defined such measure in terms of Least Square Error. More specifically, the error ELS associated to a least-squares clustering γLS is the sum of squared deviations of the association matrices of all the clusterings of the set Γ from the association matrix of the consensus clustering γLS divided for the total number of clusterings Z in Γ : ELS =
N N 1 (δi,j (γr ) − δi,j (γLS ))2 Z i=1 j=1
(2)
γr ∈Γ
Intuitively, Eq. 2 shows that the greater is the Least Square Error ELS the more distant is the least-squares clustering γLS from the other clusterings of the group. On the other hand, when ELS is small, the consensus clustering γLS is closer to the other clusterings and therefore it is more representative of the whole group. In the following we describe two procedures, based on Least Square Consensus Clustering, that allow a user to form a limited number of lists of similar solutions, to extract for each list a consensus solution and its quality measure. In both cases the procedures start from the concept of similarity measure between clusterings. Well known similarity functions are found in literature, like Minkowski Index, Jaccard Coefficient, correlation and matching coefficients (all found in [2]). In our studies we used a measure S based on the entropy of the confusion matrix between clustering solutions [5]. When M clusterings γ1 , . . . , γM are considered, the measure Sl,t can be computed for any pair of clusterings γl and γt and assembled in a similarity matrix SM of dimension M × M. 2.1
Algorithm 1
The algorithm is based only on the similarity matrix SM . It can be summarized in the following steps: 1. Construct the similarity matrix SM , select its first row and set Γ = γ1 , ...., γM . 2. Compute the Least Square consensus clustering γLS (Eq. 1) for all the clusterings in Γ . 3. Compute the corresponding error ELS (Eq. 2).
218
L. Murino et al.
4. In the selected row of SM individuate the clustering with the minimum similarity value, that is the clustering further from the considered one. Remove it from the initial set obtaining a new set Γ that contains one element less than the initial one. 5. Repeat steps 2 − 5 using the new set Γ until it contains only 3 clusterings. In fact when there are 2 elements both the clusterings have the same distance from the pairwise probability matrix π and the choice of one as the consensus clustering γLS is equivalent. 6. Repeat the whole procedure (steps 2 − 6) for all the rows of the similarity matrix SM . Using this procedure, for each row i (clustering) and at each iteration (see step 5) we have a list of clusterings Γi , the corresponding consensus clustering and its error measure. Moreover, it is possible to plot the error ELS versus the number of clusterings contained in Γi . By superimposing the error plots of all the rows the user can select a common threshold for the ELS . Given this value, for each row i, it is possible to extract the largest Γi and its consensus clustering in order to guarantee that the Least Square Error is lower than the established threshold. In practice, the number L of distinct consensus will depend on the choice of the threshold and is usually much smaller than M . Such list of potential consensus can be further reduced. For example, one can inspect the groups of clusterings in Γi which have produced the same consensus, i.e., for each distinct consensus we can associate a list of the clusterings that form that consensus. There are at least two options for carrying such step: to construct the minimum list (clusterings intersection) or the maximum list (clusterings union). It turns out that for each distinct consensus we can associate a list of the clusterings that form the consensus. Clearly, comparing the lists corresponding to different consensus one can notice that some of them are strictly contained in others and, for this reason, they can be eliminated from the final list, leaving as consensus the one corresponding to the larger set. On the other hand, some groups are completely distinct, so they form “independent consensus”, however most of the groups are characterized by a partial overlap of different dimension with other groups of the list, forming the “partially dependent consensus”. 2.2
Algorithm 2
Unlike the Algorithm 1, this procedure is based not only on the similarity matrix SM but also on the dendrogram that can be constructed starting from this matrix. In fact, once the similarity between each couple of solutions is obtained, a hierarchical clustering can be directly applied to it. The result is a meta-clustering dendrogram in which leaves represent the clustering solutions. Algorithm 2 can be summarized in the following steps: 1. Construct the similarity matrix SM . 2. Construct a dendrogram using a hierarchical clustering algorithm applied to SM .
Multiple Clustering Solutions Analysis
219
16000
Least Square Error (ELS)
14000 12000 10000 8000 6000 4000 2000 0
5
10
15
20 25 30 Number of steps
35
40
45
50
Fig. 1. Least Square Error of the clusterings set at each step of the Algorithm 2 for the synthetic data-set 1
3. Starting from the lower level of linkage (or, equivalently, the higher level) of the dendrogram, consider the groups of solutions which have been aggregated in each node. 4. Compute the Least Square consensus clustering γLS (Eq. 1) for the current group of solutions. 5. Compute the error ELS (Eq. 2) as following: If during the present step another leaf is aggregated to the group of solutions of the former node define the error ELS as the maximum between the error calculated at the current step and that calculated for the same group without the new leaf. If else during the present step two clusterings have been joined in a node in another region of the dendrogram, define the error ELS as the maximum between the error calculated at the current step and that calculated during the previous one. 6. Repeat the whole procedure until all the nodes of the dendrogram have been explored, that is when all the starting clustering solutions of the problem have been aggregated in a unique group. As for the Algorithm 1, also for this procedure, at each step a group of clusterings, its consensus solution and its associated error are available and, also in this case, it is possible to construct a plot of the Least Square Error versus the number of clusterings of the group for each step. The final result is a single curve showing the error trend. Observing this kind of curves for some experiments, as shown in Section 3, it is clear that against an average steady behavior, some “jumps” are present.
220
L. Murino et al.
Depending on the algorithm, these jumps emphasize the aggregation of a single clustering or a group of solutions which are far (in least-square sense) from the group of solutions obtained by the previous aggregation. The natural way to proceed is to select a threshold value on the plot in correspondence to one of these jumps. This value individuates a corresponding threshold on the dendrogram which can be cut putting into evidence several groups of clusterings that are similar. For each group a consensus solution and a measure of quality have been defined. The result is a list of groups of solutions which are all distinct each other and, by construction, do not admit overlaps. On the other hand, the result is strongly dependent on the hierarchical clustering algorithm used to construct the dendrogram starting from the similarity matrix.
2.3
Pairwise Matrix Visualization
As explained in Section 1, both the methodologies provide a small number of solutions to the end user who, in most cases, is a biologist. He is called to analyze the biological significance of the obtained results performing an accurate and deep study on the genes that are grouped together. In order to have an immediate feedback on the analysis results, a graphical visualization have been provided. It is valid for both the procedures. Let us consider the list of clusterings Γt = {γ1 , . . . , γm } where m ∈ {1, . . . , M } (M is the number of initial clustering solutions) and let denote γt∗ its Least square consensus. Without loss of generality, the samples can be rearranged in order to place together elements (genes) with the same label, so that γt∗ contains first all the sample that belong to cluster 1, then the one in cluster 2, finally the one in cluster k. The average pairwise probability matrix πij can be rearranged accordingly. Note that each element (i, j) of this matrix takes the value 1 if the corresponding couple of elements are allocated in the same cluster in all the clusterings {γ1 , . . . , γm } and takes the value 0 if the corresponding couple of elements are allocated in different clusters in each clusterings. All the values included in the range [0, 1] model the intermediate conditions. In this way a heat-map of the pairwise probability matrix of each group Γ1 , . . . , ΓL of the final list can be displayed. It is easy to see that when the clusterings of the same group are similar (in least square sense), homogeneous blocks appear in the matrix showing that these sets of genes have been clustered always in the same manner. Also this visualization enables to display “how many” and “which” clusters are “mixed”, that is to highlight the genes classified in a different way in several clusterings. To improve on this type of visualization a further reorganization of the blocks is possible: for each block the portion of cells with value 1 were arranged in the upper left corner of the block and, using an iterative procedure, all the other portions were arranged in a descending order depending on their values in the range [0, 1]. This makes easier to evaluate the homogeneous fractions of the single clusters but, on the other hand, all the information on the genes position was missed.
Multiple Clustering Solutions Analysis
221
0.25
0.2
0.15
0.1
0.05
0
464745504951 1 483942404143443438353714171516242526 3 4 7 9 8 1022 5 6 111213362728332932 2 1819202123303152
Fig. 2. Dendrogram of the solutions for the synthetic data-set 1. Different colors indicate different groups of aggregated clusterings.
3
Experimental Results
In the present work the K-means clustering algorithm was used to generate the initial group of clustering solutions. It is clear that more K-means runs can provide the same solution so, in a preprocessing phase, only distinct clusterings were selected. Moreover to construct the dendrogram of clustering solutions starting from the similarity matrix, the “complete linkage” algorithm, already implemented in MATLAB toolbox, was used. Both the procedures have been applied to simulated and real data. 3.1
Simulated Data
In order to test the proposed procedures we applied them to synthetic data-sets composed by mixtures of Gaussians with different covariance matrices. Three synthetic data-sets have been generated: the first is composed by 5 Gaussians in 6 dimensions with a total of N = 814 points, the second is composed by 5 Gaussians in 8 dimensions with a total of N = 439 points and the third is composed by 10 Gaussians in 8 dimensions with a total of N = 590 points. For the sake of brevity only the results obtained for the first data-set will be presented, however results on the other data-set are analogous. We start by running the K-means algorithm for 500 times obtaining 52 different solutions. For these solutions we computed the similarity matrix SM according to the similarity measure defined in Section 2. Applying the Algorithm 1 to this data-set we obtained a plot of 52 Least Square Error curves, one for each clustering. From this plot we selected a thresh∗ = 5000. Considering, for example, the minimum list (intersection) old value ELS
222
L. Murino et al.
of the clusterings which have a consensus characterized by an error lower than the threshold, we obtained a final list of 16 groups of clusterings: 8 are independent (in particular there are 2 singletons) and the remaining 8 present partial overlaps. A similar result was obtained considering the maximum list (union) of the clusterings. Then, using the complete linkage method, we built the hierarchical tree on the 52 solutions and applied the Algorithm 2. As explained in Section 2, starting from the lower level of aggregation, we have explored the whole dendrogram obtaining at each step a group of clusterings, its Least Square Consensus and the associated error. Fig. 1 shows the plot of Least Square Error of the clusterings set at each step. A first evident jump in the plot appears at the step number 33. The corresponding value of the Least Square Error ELS is E33 = 4729 which approximates the ∗ threshold value ELS = 5000 selected in the Algorithm 1. This fact confirms that the two procedure are equivalent and, even if they are independent, they provide similar results. The selected threshold E33 allows a cut of the dendrogram of the solutions. The result of this cut is shown in Fig. 2: different groups of clusterings have been highlighted using different colors for each group. From the starting 52 clusterings, a set of 20 solutions (13 groups and 7 singletons) have been extrapolated. 1 0.9
100
0.8 200 0.7 300
0.6
400
0.5
500
0.4 0.3
600
0.2 700 0.1 800 100
200
300
400
500
600
700
800
0
Fig. 3. Pairwise matrix visualization for the group of clusterings {39, 40, 41, 42, 43, 44} obtained from the application of Algorithm 2 to the synthetic data-set 1
Finally we show the pairwise matrix visualization applied on the group formed by the following 6 clusterings: {39, 40, 41, 42, 43, 44}. The LS consensus clustering
Multiple Clustering Solutions Analysis
223
for this group is the solution 42, it is rearranged to separate the groups of elements with the same label k where k ∈ (1, . . . , 5). Then the pairwise probability matrix π is computed and rearranged according to the new order of clustering 42. The result is a symmetric 814 × 814 matrix shown in Fig. 3. Three homogeneous blocks are clearly visible: they represent three clusters which elements have been clustered together in all the 6 solutions of the group. The other two clusters present mixed regions identifiable in different colors. This type of plot has been generated for all the clusterings groups of the final list. They enable the user to have an immediate verification of the LS aggregation result. 3.2
Real Data
As an example, we applied our framework to the Central Nervous System Embryonal Tumors data-set [13]. The CNS embryonal tumors data set was obtained by a representative heterogeneous group of tumors about which little is known biologically.The data set is composed by microarray gene expression data derived from 96 patient samples. Patients included children with medulloblastomas, young adults with malignant gliomas (WHO grades III and IV), children with AT/RTs, children with renal/extrarenal rhabdoid tumors and children with supratentorial PNETs. This data-set consists of 96 samples over 7129 conditions. The corresponding clustering solution has 5 clusters. The raw data were normalized by standardizing each row (sample) to mean 0 and variance 1.
700
Least Square Error (ELS)
600
500
400
300
200
100
0
20
40
60 Number of steps
80
100
120
Fig. 4. Least Square Error of the clusterings set at each step of the Algorithm 2 for the CNS data-set
224
L. Murino et al.
As for the synthetic data-set described in Section 3.1, also in this case 500 runs of K-means algorithm have been considered to obtain the starting set of clustering solutions. Actually, this same procedure has been performed 5 times using the classical K-means and as many times using an optimized version of K-means also implemented in MATLAB statistic toolbox. In this second case Kmeans repeats a certain number of times the clustering, each with a new set of initial cluster centroid positions and returns the solution with the lowest distortion value. Finally, the generation of multiple clustering solutions is performed also exploiting a Global Optimization approach [4]. However, in many applications the function of interest is multi-modal and possibly not differentiable and it is with this view in mind that the Controlled Random Search (CRS) algorithm [15] was initially developed. Some CRS algorithms also include genetic techniques [6,7]. We used a Price based Global Optimization algorithm to explore local minima of the K-means objective functions. Instead of sticking to the best found solution, we collect all the solutions corresponding to local minima. For brevity reasons only results of the test performed with the Global Optimization approach have been presented. Using K-means integrating in the Price algorithm we obtained a set of 121 different clusterings. The similarity matrix SM is computed and the hierarchical tree is built using the complete linkage, then the two algorithms are applied. Using Algorithm 1, we obtain a plot of 121 Least Square Error curves from which it is very hard to select a threshold value. So in this case the Algorithm 2 results more suitable. Fig. 4 shows the plot of Least Square Error of the clusterings set at each step of the Algorithm 2. For this example, an evident jump in the plot appears at the step number 101. Cutting the dendrogram in correspondence of this jump, we obtain a list of 18 groups of clusterings represented by as many consensus solutions and 3 singletons (see Fig. 5). So, a smaller group of clustering solutions can be provided to the end user, as required. The results of the procedure applied to the other sets of starting solutions obtained by the K-means algorithm show a similar behavior. Fig. 6 shows the pairwise matrix visualization for one of the group of clusterings of the final list. It refers to the group of 11 clusterings {8, 15, 18, 21, 23, 30, 36, 37, 77, 82, 85}. Of the 5 clusters, 2 result quite homogeneous (the majority of pixels is of the same color),that is the patients in these clusters have been classified always in the same manner in all the clusterings. The remaining 3 clusters present mixed color areas, that is the patients have been often classified in different manner in each of clusterings of the selected group.
4
Software
All the software we developed runs in the MATLAB environment. In a future work it will be mainly integrated in the MIDA (Modular Interactive Dendrogram Analyzer, [12]) project to make the analysis more interactive and user friendly.
Multiple Clustering Solutions Analysis
225
0.6
0.5
0.4
0.3
0.2
0.1
Fig. 5. Dendrogram of the solutions obtained using the Global Optimization approach for the CNS data-set. Different colors indicate different groups of aggregated clusterings.
1 10
0.9
20
0.8
30
0.7
40
0.6
50
0.5 0.4
60
0.3
70
0.2
80
0.1
90 10
20
30
40
50
60
70
80
90
0
Fig. 6. Pairwise matrix visualization for the group of clusterings {8, 15, 18, 21, 23, 30, 36, 37, 77, 82, 85} (obtained using the Global Optimization approach) of the CNS data-set
226
5
L. Murino et al.
Conclusions
In this work we have introduced the Least Square Consensus clustering. The aim is to extrapolate a small number of different clustering solutions from an initial large set obtained by applying any clustering algorithm to a given data-set. Two different methodologies have been developed, both accompanied by a quality measure definition and a graphical visualization. In a future work we want to modify the procedure of dendrogram exploration in order to avoid a horizontal cut: the aggregation of solutions at different heights could be considered. To this aim a new definition of the error at each step of the procedure is necessary. The proposed approach can help in better understanding the data structure and permits an easier analysis by the end user.
References 1. Barthlemy, J.P., Leclerc, B.: The median procedure for partitions. In: Cox, I.J., Hansen, P., Julesz, B. (eds.) Partitioning Data Sets, pp. 3–34. American Mathematical Society, Providence (1995) 2. Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. In: Pacific Symposium on Biocomputing, vol. 7, pp. 6–17 (2002) 3. Bifulco, I., Fedullo, C., Napolitano, F., Raiconi, G., Tagliaferri, R.: Robust Clustering by Aggregation and Intersection Methods. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part III. LNCS (LNAI), vol. 5179, pp. 732–739. Springer, Heidelberg (2008) 4. Bifulco, I., Murino, L., Napolitano, F., Raiconi, G., Tagliaferri, R.: Using Global Optimization to Explore Multiple Solutions of Clustering Problems. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part III. LNCS (LNAI), vol. 5179, pp. 724–731. Springer, Heidelberg (2008) 5. Bishehsari, F., Mahdavinia, M., Malekzadeh, R., Mariani-Costantini, R., Miele, G., Napolitano, F., Raiconi, G., Tagliaferri, R., Verginelli, F.: PCA based feature selection applied to the analysis of the international variation in diet. In: Masulli, F., Mitra, S., Pasi, G. (eds.) WILF 2007. LNCS (LNAI), vol. 4578, pp. 551–556. Springer, Heidelberg (2007) 6. Brachetti, P., De Felice Ciccoli, M., Di Pillo, G., Lucidi, S.: A new version of the Price’s algorithm for global optimization. Journal of Global Optimization 10, 165–184 (1997) 7. Bresco, M., Raiconi, G., Barone, F., De Rosa, R., Milano, L.: Genetic approach helps to speed classical Price algorithm for global optimization. Soft Computing Journal 9, 525–535 (2005) 8. Dahl, D.B.: Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model. In: Do, K.-A., M¨ uller, P., Vannucci, M. (eds.) Bayesian Inference for Gene Expression and Proteomics, pp. 201–218. Cambridge University Press, Cambridge (2006) 9. Dudoit, S., Fridlyand, J.: A Prediction-based Resampling Method for Estimating the Number of Clusters in a Dataset. Genome Biology 3(7) (2002) 10. Fred, A.L.N., Jain, A.K.: Combining Multiple Clusterings Using Evidence Accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 835–850 (2005)
Multiple Clustering Solutions Analysis
227
11. Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. ACM Trans. Knowl. Discov. Data 1(1, 4) (2007) 12. MIDA software, NeuRoNe lab, DMI, University of Salerno, http://www.neuronelab.dmi.unisa.it 13. Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E., Kim, J.Y.H., Goumnerovak, L.C., Blackk, P.M., Lau, C., Allen, J.C., ZagzagI, D., Olson, J.M., Curran, T., Wetmore, C., Biegel, J.A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D.N., Mesirov, J.P., Lander, E.S., Golub, T.R.: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442 (2002) 14. Nguyen, N., Caruana, R.: Consensus Clustering. In: ICDM, pp. 607–612 (2007) 15. Price, W.L.: Global optimization by controlled random search. Journal of Optimization Theory and Applications 55, 333–348 (1983) 16. Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002) 17. Swift, S., Tucker, A., Vinciotti, V., Martin, N., Orengo, C., Liu, X., Kellam, P.: Consensus clustering and functional interpretation of gene-expression data. Genome Biology 5(11) (2004)
Projection Based Clustering of Gene Expression Data Sotiris K. Tasoulis1 , Vassilis P. Plagianakos1, and Dimitris K. Tasoulis2 1
Department of Computer Science and Biomedical Informatics, University of Central Greece, Papassiopoulou 2–4, Lamia, 35100, Greece {stas,vpp}@ucg.gr 2 Mathematics Department, Imperial College London, 180 Queen’s Gate, London SW7 2AZ, UK [email protected]
Abstract. The microarray DNA technologies have given researchers the ability to examine, discover and monitor thousands of genes in a single experiment. Nonetheless, the tremendous amount of data that can be obtained from microarray studies presents a challenge for data analysis, mainly due to the very high data dimensionality. A particular class of clustering algorithms has been very successful in dealing with such data, utilising information driven by the Principal Component Analysis. In this paper, we investigate the application of recently proposed projection based hierarchical clustering algorithms on gene expression microarray data. The algorithms apart from identifying the clusters present in a data set also calculate their number and thus require no special knowledge about the data. Keywords: Unsupervised Clustering, Cluster Analysis, Principal Component Analysis, Kernel Density Estimation, Bioinformatics, Gene Expression Analysis.
1
Introduction
The normal function of a cell is largely affected by the gene expression at a given stage. To understand the biological processes, the gene expression levels in different developmental phases need to be measured, at different body tissues, at different clinical conditions and for different organisms. This kind of information can help in the characterisation of gene function and the understanding of other molecular biological processes [30]. The discovering of hidden patterns in gene expression microarray data is a big challenge for genomics and proteomics research [30]. A promising technique seems to be the use of data mining methods. In particular, clustering could be the key step for understanding how the genes activity varies during biological F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 228–239, 2010. c Springer-Verlag Berlin Heidelberg 2010
Projection Based Clustering of Gene Expression Data
229
processes, and how it is affected by states of diseases and cellular environments. Clustering can be used either to recognise groups of genes according to their expression in a set of samples [6,24], or to group samples into similar clusters that corresponds to particular macroscopic phenotypes [11]. This second type of analysis is more difficult because of the “curse of dimensionality” [3], i.e. limited number of samples and very high feature dimensionality. Formally, clustering can be defined as “the process of partitioning a set of data vectors into disjoint groups (clusters), so that objects of the same cluster are more similar to each other than objects in different clusters”. The modern roots of data clustering date back to 1939 [23], but there are references to it from antiquity. While clustering has a long history and a large number of clustering techniques have been developed in statistics, pattern recognition, data mining, and other fields, significant challenges still remain. There exist many categorisations of clustering algorithms, but broadly we could divide them into three main categories: Hierarchical, Partitioning, and Distance-Based. Hierarchical clustering algorithms construct hierarchies of clusters in a top-down (agglomerative) or bottom-up (divisive) fashion. The former, start from n clusters, where n stands for the number of data points, each containing a single data point and iteratively merge the clusters satisfying certain measures of closeness. Divisive algorithms follow a reverse approach; starting with a single cluster containing all the data points and iteratively split existing clusters to subsets. Hierarchical clustering algorithms have been shown to result in high quality partitions especially for applications involving clustering text collections. None the less, their high computational requirements, usually prevents their usage in real-life applications, where the number of samples and their dimensionality is expected to be high (the computational cost is quadratic to the number of samples). One of the most important challenges encountered in clustering is associated with high data dimensionality [3]. In general terms, these problems result from the fact that a fixed number of data points become increasingly “sparse” as the dimensionality increases [21]. For clustering purposes, the most relevant aspect of the curse of dimensionality concerns the effect that it has on distance or similarity. In [4], for certain data distributions, it is shown that the relative difference of the distances of the closest and farthest data points of an independently selected point goes to 0 as the dimensionality increases. Thus, it is often said that “in high dimensional spaces, distances between points become relatively uniform” [21]. However, things are sometimes not as bad as they might seem, since it is often possible to reduce the dimensionality of the data without significant information loss. This can be accomplished by an appropriate feature selection procedure, i.e. discarding features showing little variation or being highly correlated with other features. Feature selection is a complicated subject in its own right and is out of the scope of the present study.
230
2
S.K. Tasoulis, V.P. Plagianakos, and D.K. Tasoulis
Projection Based Clustering
A projection based method project points from a higher dimensional space to a suitable lower dimensional space. Typically this type of dimensionality reduction is accomplished by applying techniques from linear algebra or statistics such as Principal Component Analysis (PCA) [13] or Singular Value Decomposition (SVD) [16]. This way, a powerful class of clustering algorithms has emerged that provide the opportunity for the deployment of these computational technologies from numerical linear algebra, an area that has seen enormous expansion in recent decades. In this class of algorithms the Principal Direction Divisive Partitioning (PDDP) algorithm is of particular value [5]. Compared to other similar techniques (like Latent Semantic Indexing [8] and Linear Least Square Fit [7]), PDDP has the advantage of very low computational complexity. PDDP is a “divisive” hierarchical clustering algorithm. As already described, this kind of techniques produce a nested sequence of partitions, with a single, all-inclusive cluster at the top. Starting from this all-inclusive cluster the nested sequence of partitions is constructed by iteratively splitting clusters, until a termination criterion is satisfied. Any divisive clustering algorithm can be characterised by the way it chooses to provide answers to the following three questions: Q1 : Which cluster to split further? Q2 : How to split the selected cluster? Q3 : When should the iteration terminate? The PDDP based algorithms in particular, use information from the PCA of the corresponding data matrix to provide answers to all three questions, in a computationally efficient manner. This is achieved by incorporating information from only the first singular vector and not a full rank decomposition of the data matrix.
3
Background
To formally describe the manner in which principal direction based divisive clustering algorithms operate, let us assume the data are represented by an n × a matrix D, whose row vectors represent a data sample di , for i = 1, . . . , n. Also define the vector b and matrix Σ to represent the mean vector and the covariance of the data respectively: n
b=
1 di , n i=1
Σ=
1 (D − be) (D − be), n
where e is a column vector of ones. The covariance matrix Σ is symmetric and positive semi-definite, so all its eigenvalues are real and non-negative. The eigenvectors uj j = 1, . . . , k corresponding to the k largest eigenvalues are called the
Projection Based Clustering of Gene Expression Data
231
principal components or principal directions. The PDDP algorithm and all similar techniques use the projections pi : pi = u1 (di − b),
i = 1, . . . , n,
onto the first principal component u1 , to initially separate the entire data set into two partitions P1 and P2 . To generate this partitioning, the original PDDP algorithm uses the sign of the projection of each data point (division point 0). This reveals the text mining origins of the algorithm, since based on a term-document data specification [5], data points (documents) with positive projections should be more similar to each other than data points (documents) with negative ones. However, very often, the sign of the projection can lead the algorithm to undesirable cluster splits [9,10,22,28]. More formally we can define this splitting criterion as follows: – (Splitting Criterion – SP C1 ): ∀di ∈ D, if pi 0, then the i-th data point belongs to the first partition P1 = P1 ∪di ; otherwise, it belongs to the second partition P2 = P2 ∪ di . To answer the cluster selection question Q2 , the PDDP algorithm as well as all its variations [9,10,17,28,29], select the cluster with maximum scatter value SV , defined as SV = D − beF , (1) where e is a column vector of ones, the vector b represents the mean vector of D, and · F is the Frobenious norm. This quantity can be a measure of coherence and it is the same as the one used by the k-means steered PDDP algorithm [29]. Finally most of the PDDP variants terminate the clustering procedure when a user defined number of clusters has been retrieved.
4
Algorithms
In this section we present two algorithms that are enhancements of the original PDDP algorithm. The criteria that these algorithms use are presented in a simple way in order to demonstrate the algorithmic procedures. The particular implementations do not try to mix different approaches, but are pure in the sense that they employ criteria from a particular methodology. The two algorithmic implementations we have are firstly (iPDDP) which is based on the idea of splitting based on the largest distance between any two consecutive projections and secondly (dePDDP) which is a compilation of criteria that are based on the minimiser of the estimated density of the projections. 4.1
iPDDP
Recently in [22], a new splitting criterion was proposed based on density or histogram arguments. In that study, it was proposed to compute the most sparse region of the projected data, by first sorting and then finding the largest distance between two consecutive projections. The iPDDP implementation is shown below:
232
S.K. Tasoulis, V.P. Plagianakos, and D.K. Tasoulis
Function iPDDP (D, cmax , M inP ts) { 1. Set Π = {D} 2. While |Π| < cmax , do 3. Select a set C ∈ Π using cluster selection criterion CS1 4. Split C into two sub-sets C1 , C2 using Splitting Criterion SP C2 5. Remove C from Π and set Π → Π ∪ {C1 , C2 } 6. Set O = ∅ 7. For any C in Π with |C| < M inP ts, do 8. Remove C from Π 9. Set C with |C| < M inP ts, and set O → O ∪ C, 10. Return Π the partition of D into |Π| clusters, and O the set of outliers. } This algorithmic schema is build around the following criteria: – (Stopping Criterion – ST1 ): Iteratively split the data set D into kmax subsets. Report as clusters the ones with more than M inP ts points. Designate the points included the remaining clusters as outliers. – (Cluster Selection Criterion – CS1 ): Let Π a partition of the data set D into k sets. Let M, be the sets of the largest distances Mi among consecutive, for each Ci ∈ Π, i = 1, . . . , k. The next set to split is Cj , with j = arg maxi {Mi : Mi ∈ M}. – (Splitting Criterion – SP C2 ): Let ∀pi ∈ P, ne(i) = arg minj {pj : pj > pi , ∀pj ∈ P}, and define isp as follows isp = arg maxi {pne(i) − pi , ∀pi ∈ P}. then P1 = {di ∈ D : pi pisp }, and P2 = {di ∈ D : pi > pisp }. The iPDDP algorithm has the drawback that the user needs to specify the maximum number of clusters (cmax ) and the minimum number of points that are allowed to constitute a valid cluster (M inP ts). Although the selection of M inP ts parameter is easy, the selection of cmax is not straightforward. If cmax is selected to obtain a much larger value than the actual number of clusters the algorithm will be forced to iteratively strip points from the clusters. In extreme cases, it could even occur that a cluster is totally decomposed into small sets of less than M inP ts points, ending up in the outlier set. The computational complexity of the iPDDP implementation is mostly influenced by the computation of the principal vectors as in the original PDDP algorithm. To compute them, the Singular Value Decomposition of the data matrix D is employed. Thus, if the iterations needed by the Lanczos SVD computation algorithm are kSV D and snz is the fraction of non-zero entries in D, the total worst case complexity of the algorithm is O(cmax (2 + kSV D )snz n a). For more details refer to [5]. In the iPDDP case, the additional computation steps that are required change this complexity to O(cmax (2 + kSV D )(snz n a + n log(n))), which although increased is still on par with most clustering algorithms. Also note that the additional cost is not influenced by the data dimensionality. Thus, the ability of the algorithms to deal with ultra high dimensional data is maintained.
Projection Based Clustering of Gene Expression Data
233
Figure 1 illustrates the splitting criterion introduced by the iPDDP algorithm. There is a simple two dimensional dataset which contains three actual clusters. In this case, the PDDP algorithm would split the bottom cluster into half based on the splitting criterion defined by splitting the data points based on the sign of the respective projections (PDDP split point). On the other hand, based on the iPDDP splitting criterion (iPDDP split point) the dataset is properly divided.
PDDP splitpoint maximum distance
first principal component
iPDDP splitpoint
second principal component
Fig. 1. An illustration of the iPDDP splitting criterion
PDDP algorithm and its variations choose to split the cluster with maximum scatter value as we reported at previous section. This has a serious drawback on the presence of unbalanced clusters. An example of unbalanced clusters is illustrated in Figure 2. In this case, the cluster selection criterion introduced by the iPDDP algorithm is not affected by the cluster size.
first principal component
projection variance of right cluster
projection variance of left cluster
Fig. 2. An illustration of unbalanced clusters
234
4.2
S.K. Tasoulis, V.P. Plagianakos, and D.K. Tasoulis
dePDDP
The dePDDP algorithm is based on a splitting criterion that suggests that the minimiser of the estimated density of the projected data onto the first principal component, is the best we can do to avoid splitting clusters. The cluster selection criterion and the termination criterion are guided by the same idea. More specifically, the dePDDP algorithm utilises the following criteria: – (Stopping Criterion – ST2 ): Let Π a partition of the data set D into k sets. Let X , be the set of minimisers x∗i of the density estimates fˆ(x∗i ; h) of the projection of the data of each Ci ∈ Π, i = 1, . . . , k. Stop the procedure when the set X is empty. – (Cluster Selection Criterion – CS2 ): Let Π a partition of the data set D into k sets. Let F be the set of the density estimates fi = fˆ(x∗i ; h) of the minimisers x∗i for the projection of the data of each Ci ∈ Π, i = 1, . . . , k. The next set to split is Cj , with j = arg maxi {fi : fi ∈ F }. – (Splitting Criterion – SP C3 ): Let fˆ (x; h ) be the kernel density estimation of the density of the projections pi ∈ P, and x∗ its global minimiser. Then construct P1 = {di ∈ D : pi x∗ } and P2 = {di ∈ D : pi > x∗ }. The dePDDP implementation is shown below: Function dePDDP (D) { 1. Set Π = {D} 2. Do 3. Select a set C ∈ Π using cluster selection criterion CS2 3. Split C into two sub-sets C1 , C2 using Splitting Criterion SP C3 4. Remove C from Π and set Π → Π ∪ {C1 , C2 } 5. While Stopping Criterion ST2 is not satisfied 6. Return Π the partition of D into |Π| clusters } The computational complexity of this approach, using a brute force technique, would be quadratic in the number of samples. However it has been shown [12,26] that using techniques like the Fast Gauss Transform we achieve linear running time for the Kernel Density Estimation, especially for the one dimensional case at hand. To find the minimiser of that we only need to evaluate the density at n positions, in between the projected data points since those are the only places we can have valid splitting points. Thus the total complexity of the algorithm remains O (cmax (2 + kSV D )(snz n a)). In Figure 3, we illustrate the splitting criterion of the dePDDP algorithm. There is the same simple two dimensional dataset used to illustrate the iPDDP algorithm. Using the dePDDP splitting criterion (dePDDP split point) guided by the minimiser of the kernel density estimation function (global minimizer) the dataset is properly divided.
Projection Based Clustering of Gene Expression Data
235
density estimation
global minimizer
dePDDP splitpoint
first principal component
second principal component
Fig. 3. An illustration of the dePDDP splitting criterion
4.3
Automatic Cluster Number Determination
In [22], a termination criterion based on the maximum distance between consecutive projections was proposed. More specifically, as mentioned in the previous section, the maximum number of allowed clusters kmax is used as an upper bound and subsequently splitting is continued as long as there exists clusters with more than M inP ts points, where M inP ts is a user defined parameter to describe the minimum number of points that are allowed to constitute a valid cluster. This is not an uncommon procedure for algorithms that are designed to deal with noisy datasets [19]. Formally, this termination criterion based on the two user defined parameters is defined above (see Stopping Criterion ST1 ). For the density based approach presented here, we could allow the existence of a minimiser to guide the termination of the procedure. We can stop the iteration as long as there does not exist a minimiser for any of the retrieved clusters. This criterion depends on the bandwidth selection, but automated bandwidth selection techniques could be employed to remove user intervention (see Stopping Criterion ST2 ).
5
Experimental Results
The development of microarray technologies gives scientists the ability to examine, discover and monitor the mRNA transcript levels of thousands of genes in a single experiment. Discovering the patterns hidden in gene expression microarray data is a tremendous opportunity and challenge for functional genomics and proteomics. A promising approach to address this task is to utilise data mining techniques. Cluster analysis is a key step in understanding how the activity of genes varies during biological processes and is affected by disease states and cellular environments. In particular, clustering can be used either to identify sets of genes according to their expression in a set of samples, or to cluster samples
236
S.K. Tasoulis, V.P. Plagianakos, and D.K. Tasoulis
into homogeneous groups that may correspond to particular macroscopic phenotypes. The latter is in general more difficult because of the high dimensionality that describes these datasets. For this reason they make a perfect candidate to explore the performance of the algorithms proposed in this paper. As such we employ the following microarray datasets: – The COLON data set [2] consists of 40 tumour and 22 normal colon tissues. For each sample there exist 2000 gene expression level measurements. The data set is available at http://microarray.princeton.edu/oncology. – The LYMPHOMA dataset [1] that contains 62 samples of the 3 lymphoid malignancies samples types. The samples are measured over 4026 gene expression levels. This dataset is available at http://genome-www.stanford.edu/. – The PROSTATE data set [20] contains 52 prostate tumour samples and 50 non-tumour prostate samples. There exist 6033 gene expression level measurements per sample. Available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. – The ALL data set [27] covers six subtypes of acute lymphoblastic leukemia. The data set is available at http://www.stjuderesearch.org/data/ALL1. – The CARCINOMA data set [18] that contains 18 tumour samples and 18 normal samples. There exist 7457 gene expression level measurements per sample. The data set is available at http://microarray.princeton.edu/oncology/carcinoma.html. – The SRBCT data set [15] that contains 83 samples spanning 4 classes. There exist 2308 gene expression level measurements per sample. The data set is available at http://research.nhgri.nih.gov/microarray/Supplement. To measure the quality of the clustering results we use purity as defined in [14,25] k
|D |
i P = i=1 , where |Di | denotes the number of points with the dominant |N | class label in cluster i, |N | the total number of points, and k, the number of clusters. In Table 1, the clustering results with respect to the purity of the algorithms for each dataset are illustrated. The actual number of clusters in the data is given as input to the PDDP, the iPDDP, the dePDDP, and the k-means-PDDP [29]. The number in the parentheses is the number of clusters that the algorithm retrieved. It is evident that the PDDP, the iPDDP and the k-means-PDDP algorithms manage to identify pure partitions only in the case of the LYMPHOMA dataset. In particular iPDDP results in a purity of 1, which translate to clusters with direct correspondence to the class labels of the data. The dePDDP algorithm results in good partitions in all of the cases. Additionally, it successfully determined the actual number of clusters in the LYMPHOMA (3 clusters) and in the ALL (6 clusters) datasets. It is also interesting to examine the results of the algorithms when, in an attempt to improve their purity, they are allowed to compute more clusters than the labels in the dataset. In Table 2 we illustrate the clustering results with respect to the purity of the algorithms for the COLON, PROSTATE, CARCINOMA and SRBCT datasets since for these datasets dePDDP algorithm exhibited better purity results and located a higher number of clusters. For the new
Projection Based Clustering of Gene Expression Data
237
Table 1. Results with respect to the clustering purity for different methods Dataset
PDDP
iPDDP
COLON(2) LYMPHOMA(3) ALL(6) PROSTATE(2) CARCINONA(2) SRBCT(4)
0.6452(2) 0.9677(3) 0.3266(6) 0.5784(2) 0.7500(2) 0.6190(4)
0.6452(2) 1(3) 0.3226(6) 0.5882(2) 0.7500(2) 0.4444(4)
dePDDP k-means-PDDP 0.7903(7) 1(3) 0.6935(6) 0.8039(12) 0.9167(5) 0.9455(14)
0.6452(2) 0.8548(3) 0.3266(6) 0.5882(2) 0.7500(2) 0.4921(4)
Table 2. Results with respect to the clustering purity for different methods. The algorithms are forced to find more clusters. Dataset
PDDP
iPDDP
k-means-PDDP
COLON(2) PROSTATE(2) CARCINOMA(2) SRBCT(4)
0.7097(7) 0.6275(12) 0.7500(5) 0.7460(14)
0.7742(7) 0.6373(12) 0.8333(5) 0.6825(14)
0.6452(7) 0.6765(12) 0.8056(5) 0.6667(14)
results to be comparable with the cluster number computed by the dePDDP algorithm (see Table 1), the cluster number for these problems was set to 7, 12, 5 and 14 respectively. Notice that greater cluster numbers result in no significant changes in the purity results.
6
Concluding Remarks
DNA microarray technologies measure gene expression levels for a very large number of genes covering the entire genome. However, the number of genes is usually very high compared to the number of data samples. This problem has been recognised as the “curse of dimensionality” and has introduced novel challenges for the analysis of such data. For this reason a powerful class of clustering algorithms has emerged, which using technologies from numerical linear algebra, can efficiently and with low computational complexity produce high quality data partitions. These algorithms operate on the projection of the data on a few principal components of the covariance matrix. In this work, we present two approaches that improve significantly upon the well-known PDDP projection based clustering algorithm. As the experimental results show the algorithms apart from identifying pure clusters are also able to give good approximations for their number.
References 1. Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X.: Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
238
S.K. Tasoulis, V.P. Plagianakos, and D.K. Tasoulis
2. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc. Natl. Acad. Sci. USA 96(12), 6745–6750 (1999) 3. Bellman, R.: Adaptive control processes: A guided tour. Princeton University Press, Princeton (1961) 4. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful. In: 7th International Conference on Database Theory, pp. 217–235 (1999) 5. Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998) 6. Brown, P., Botstein, D., Eisen, M., Spellman, P.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 95(25), 14863–14868 (1998) 7. Chute, C., Yang, Y.: An overview of statistical methods for the classification and retrieval of patient events. Methods Inf. Med. 34(1-2), 104–110 (1995) 8. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990) 9. Dhillon, I., Kogan, J., Nicholas, C.: Feature selection and document clustering. A Comprehensive Survey of Text Mining, 73–100 (2003) 10. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–274. ACM, New York (2001) 11. Golub, T., Slomin, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Caligiuri, M., Downing, J., Bloomfield, C., Lander, E.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 268, 531–537 (1999) 12. Greengard, L., Strain, J.: The fast gauss transform. SIAM J. Sci. Stat. Comput. 12(1), 79–94 (1991) 13. Jain, A.K., Dubes, R.C.: Algorithms for clustering data (1988) 14. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999), http://citeseer.ist.psu.edu/jain99data.html 15. Khan, J., Wei, J., Ringner, M., Saal, L., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C., Peterson, C., Meltzer, P.: Classification and diagnostic prediction of cancers using expression profiling and artificial neural networks. Nature Medicine 7, 673–679 (2001) 16. Lax, P.D.: Linear algebra and its applications. Wiley Interscience, Hoboken (2007) 17. Nilsson, M.: Hierarchical Clustering Using Non-Greedy Principal Direction Divisive Partitioning. Information Retrieval 5(4), 311–321 (2002) 18. Notterman, D.A., Alon, U., Sierk, A.J., Levine, A.J.: Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. Cancer Research 61, 3124–3130 (2001) 19. Sander, J., Ester, M., Kriegel, H.P., Xu, X.: Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery 2(2), 169–194 (1998) 20. Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P.: Gene expression correlates of clinical prostate cancer behavior. Cancer cell 1(2), 203–209 (2002)
Projection Based Clustering of Gene Expression Data
239
21. Steinbach, M., Ertz, L., Kumar, V.: The challenges of clustering high dimensional data. New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition (2003) 22. Tasoulis, S., Tasoulis, D.: Improving principal direction divisive clustering. In: 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), Workshop on Data Mining using Matrices and Tensors, Las Vegas, USA (2008) 23. Tryon, C.: Cluster Analysis. Edward Brothers, Ann Arbor (1939) 24. Wen, X., Fuhrman, S., Michaels, G., Carr, D., Smith, S., Barker, J., Somogyi, R.: Large-scale temporal gene expression mapping of cns development. Proceedings of the National Academy of Sciences of the United States of America 95, 334–339 (1998) 25. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004) 26. Yang, C., Duraiswami, R., Gumerov, N.A., Davis, L.: Improved fast gauss transform and efficient kernel density estimation. In: Proceedings of Ninth IEEE International Conference on Computer Vision, pp. 664–671 (2003) 27. Yeoh, E.J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A.: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer cell 1(2), 133–143 (2002) 28. Zeimpekis, D., Gallopoulos, E.: PDDP(l): Towards a Flexing Principal Direction Divisive Partitioning Clustering Algorithms. In: Boley, D., Dhillon, I., Ghosh, J., Kogan, J. (eds.) Proc. IEEE ICDM ’03 Workshop on Clustering Large Data Sets, Melbourne, Florida, pp. 26–35 (2003) 29. Zeimpekis, D., Gallopoulos, E.: Principal direction divisive partitioning with kernels and k-means steering. In: Survey of Text Mining II: Clustering, Classification, and Retrieval, pp. 45–64 (2007) 30. Zhangi, A., Jiang, D., Tang, C.: Cluster analysis for gene expression data: a survey. IEEE Transactions on Knowledge Data Engineering 16(11), 1370–1386 (2004)
Searching a Multivariate Partition Space Using MAX-SAT Silvia Liverani1, James Cussens2 , and Jim Q. Smith3 1
2
Department of Mathematics University of Bristol Bristol BS8 1TW, UK [email protected] York Centre for Complex Systems Analysis & Department of Computer Science University of York York YO10 5DD, UK 3 Department of Statistics University of Warwick Coventry CV4 7AL, UK
Abstract. Because of the huge number of partitions of even a moderately sized dataset, even when Bayes factors have a closed form, a comprehensive search for the highest scoring (MAP) partition is usually impossible. Therefore, both deterministic or random search algorithms traverse only a small proportion of partitions of a large dataset. The main contribution of this paper is to encode the formal Bayes factor search on partitions as a weighted MAX-SAT problem and use well-known solvers for that problem to search for partitions. We demonstrate how, with the appropriate priors over the partition space, this method can be used to fully search the space of partitions in smaller problems and how it can be used to enhance the performance of more familiar algorithms in large problems. We illustrate our method on clustering of time-course microarray experiments.
1 Introduction Many Bayesian model selection procedures are based on the posterior probability distribution over the appropriate model space. A very common method is MAP selection, where the most a posteriori probable model is selected (9). These selection techniques have the advantage that they incorporate scientific judgements (11). However, a full exploration of the partition space is not possible when, as in our case, the number of elements is in the order of tens of thousands, even when using fast conjugate modelling. The number of partitions of a set of n elements grows quickly with n. For example, there are 5.1 × 1013 ways to partition 20 elements. In this paper we demonstrate how to explore a partition space using weighted MAXSAT. The SAT problem, which addresses whether a given set of propositional clauses is satisfiable, can be extended to the weighted MAX-SAT problem where weights are added to each clause and the goal is to find an assignment that maximizes the sum of the weights of satisfied clauses. This problem setting has been used by (6) for F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 240–253, 2010. c Springer-Verlag Berlin Heidelberg 2010
Searching a Multivariate Partition Space Using MAX-SAT
241
model search over Bayesian networks, a class of models which shares some similarities with the search over partitions. For example, in both scenarios, models are scored using a marginal likelihood which is local in the sense of (11) and decomposable (see Section 2). The advantage of algorithms encoding the weighted MAX-SAT methodology over many greedy search algorithms such as agglomerative hierarchical clustering (AHC) is that they are not intrinsically sequential. Under AHC once a decision to combine to clusters is made it cannot be reversed. This is not the case with weighted MAX-SAT solvers generally. In our illustrative examples this is a big advantage since under Bayes factor search via AHC early combinations of clusters are prone to be distorted by the presence of outliers (16). On the other hand the advantage weighted MAX-SAT has over random search algorithms is that it is typically more efficient and finds local maxima of the Bayes score function for sure in a sense explained later in the paper. Thus in small problems weighted MAX-SAT can be used to find an optimal partition for sure, whilst in large problems it can be used to enhance the performance of faster but less refined and adaptable algorithms. Provided the appropriate local prior structure over the partition space is used a weighted MAX-SAT algorithm can be very flexible and can be used to search all spaces its competitors can. Here we will illustrate how this method can be used to cluster a class of time-course experiments known to exhibit circadian rhythms (8). The paper is organized as follows. In Section 2 we illustrate the model used to score partitions and review the current methods used to search the partition space. Section 3 describes how the search on the partition space is encoded as a weighted MAX-SAT problem. We discuss some examples in Section 4 and present ongoing work in Section 5.
2 Evaluating Partitions The main contribution of this paper is to encode the formal Bayes factor search on partitions as a weighted MAX-SAT problem and use well-known solvers for that problem to search over a multivariate partition space. We use weighted MAX-SAT in conjunction with a conjugate Gaussian regression model developed by (9). This model has a wide applicability because it can be customized through the choice of a given design matrix X. Conjugacy ensures the fast computation of scores for a given partition because these can be written explicitly and in closed form as functions of the data and the chosen values of the hyperparameters of the prior. Applications range from one-dimensional data points to multidimensional datasets with time dependence among points or where the points are obtained by applying different treatments to the units. Let Y i ∈ Rr for i = 1, . . . , n represent the r-dimensional units to cluster. In our example in Section 4 these are log expressions of genes over r time points at which measurements are taken. Let D = (Y 1 , . . . , Y n ) and Y = vec(D) satisfy Y = Xβ + ε
242
S. Liverani, J. Cussens, and J.Q. Smith
where β = (β1 , β2 , . . . , βp ) ∈ Rp and ε ∼ N (0, σ 2 I) is a vector of independent error terms with σ 2 > 0. The posterior Normal Inverse Gamma joint density of the parameters (β, σ 2 ) denoted by N IG(0, V, a, b), is given by ∗
p(β, σ 2 |y) ∝ (σ 2 )−(a +p/2+1) × 1 × exp − 2 (β − m∗ ) (V ∗ )−1 (β − m∗ ) + 2b∗ 2σ with m∗ = (V −1 + X X)−1 X Y V ∗ = (V −1 + X X)−1 γ = {Y Y − (m∗ ) (V ∗ )−1 m∗ } a∗ = a + rn/2,
b∗ = b + γ/2
where a, b > 0 and V is a positive definite matrix. Throughout this paper we assume that X = 1n ⊗ B, where B is a known matrix, and that X X = nB B is full rank. The design or basis function matrix B encodes the type of basis used for the clustering: linear splines in (9), wavelets in (15) or Fourier in (8). The latter is the most appropriate choice in the context of a study of daily rhythms as in Section 4. The Bayes factor associated with this model can be calculated from its marginal likelihood L(y); see for example (7) and (13). Thus L(y) =
nr/2 1 ba |V ∗ |1/2 Γ (a∗ ) ∗ π (b )a∗ |V |1/2 Γ (a)
(1)
Unlike for univariate data, within the class of problems illustrated in Section 4, there are a myriad of different shapes of expressions over time possible for each gene. Consequently, Bayes factors associated with different gene combinations are highly discriminative and informative. Let C denote a partition belonging to the space of partitions C, on a space Ω of cardinality n, and c a cluster of such partition. (9) assume that each observation is exchangeable within its containing cluster. The Normal Inverse-Gamma conjugate Bayesian linear regression model for each observation in a cluster c takes the form Y (c) = X (c) β (c) + ε(c) (c)
(c) Here β (c) = (β 1 , . . . , β (c) is the design p ) is the vector of parameters with p ≤ r, X (c) 2 matrix of size nc r × p, ε ∼ N (0, σc Irnc ) where nc is number of observations in cluster c and Irnc is the indicator function of size rnc × rnc . A partition C of the N observations divides into N clusters of sizes {n1 , . . . , nN }, with n = i=1 ni . Assuming the parameters of different clusters are independent then, because the likelihood separates, it is straightforward to check (16) that the log marginal likelihood score Σ(C) for any partition C with clusters c ∈ C is given by
Searching a Multivariate Partition Space Using MAX-SAT
Σ(C) = log p(C) +
log pc (y)
243
(2)
c∈C
Here the prior p(C) is often chosen from the class of cohesion priors over the partition space (14) which assigns weights to different models in a plausible and convenient way: see e.g. (16). An essential property of the search for MAP models - dramatically increasing the efficiency of the partition search - is that with the right family of priors the search is local. That is, if C + and C − differ only in the sense that the cluster c+ ∈ C + is split − − into two clusters c− 1 , c2 ∈ C then the log marginal likelihood score is a linear function − only of the posterior cluster probabilities on c+ , c− 1 and c2 . 2.1 Choosing an Appropriate Prior over Partitions Although there are many possible choices for a prior over partitions, an appropriate choice in this scenario is the Crowley partition prior p(C) (5; 12; 3) for partition C p(C) =
N Γ (λ)λN
Γ (ni ) Γ (n + λ) i=1
(3)
where λ > 0 is the parameter of the partition prior, N is the number of clusters and n is the total number of observations, with ni the number of observations in cluster ci . This prior is consistent in the sense of (12). The authors argue that this property is extremely desirable for any partition process to hold. Conveniently if we use a prior from this family then the score in (2) decomposes. Thus Σ(C) = log p(N, n1 , . . . , nN |y) = log p(N, n1 , . . . , nN ) +
N
log p(y i )
i=1
= log Γ (λ) − log Γ (n + λ) +
N
Si
i=1
where Si = log p(y i ) + log Γ (ni ) + log λ Thus, the score Σ(C) is decomposable into the sum of the scores Si over individual clusters plus a constant term. This is especially useful for weighted MAX-SAT which needs the score of an object to be expressible as a sum of component scores. The choice of the Crowley prior in (3) ensures that the score of a partition is expressible as a linear combination of scores associated with individual sets within the partition. It is this property that enables us to find straightforward encoding of the MAP search as a weighted MAX-SAT problem. Note that a particular example of a Crowley prior is the Multinomial-Dirichlet prior used by (9), where λ is set so that λ ∈ (1/n, 1/2).
244
S. Liverani, J. Cussens, and J.Q. Smith
2.2 Searching the Partition Space The simplest search method using the local property is agglomerative hierarchical clustering (AHC). It starts with all the observations in separate clusters, our original C0 , and evaluates the score of this partition. Each cluster is then compared with all the other clusters and the two clusters which increase the log likelihood in (2) by the most are combined to produce a new partition C1 . We now substitute C1 for C0 and repeat this procedure to obtain the next partition C2 . We continue in this way until we have evaluated the log marginal score Σ(Ci ) for each partition {Ci , 1 ≤ i ≤ n}. We then choose the partition which maximizes the score Σ(Ci ). A drawback of this method is that the set of partitions searched is an extremely small subset of the set of all partitions. The number of partitions of a set of elements n grows quickly with n. For example, there are 5.1 × 1013 ways to partition 20 elements, and the AHC evaluates only 1331 of them! Despite searching only a small number of partitions, AHC is surprisingly powerful and often finds good partitions of clusters, especially when used for time-course profile clustering as in Section 4. It is also very fast. However one drawback is that the final choice of optimal partition is completely dependent on the early combinations of elements into clusters. This initial part of the combination process is subject to be sensitive and can make poor initial choices, especially in the presence of outliers or poor choices of hyperparameters when used with Bayes factor scores (see Section 4) in a way carefully described in (16). Analogous instabilities in search algorithms over similar model spaces have prompted some authors to develop algorithms that devote time to early refinement of the initial choices in the search (4) or to propose alternative stochastic search (10). The latter method appears very promising but is difficult to implement within our framework due to the size of the datasets. We propose an enhancement of the widely used AHC with weighted MAX-SAT. This is simple to use in this context provided a prior such as (3) is used over the model space which admits a decomposable score. Weighted MAX-SAT is able to explore many more partitions and different regions of the partition space, as we will demonstrate in Section 4, and is not nearly as sensitive to the instabilities that AHC, used on its own, is prone to exhibit.
3 Encoding the Clustering Algorithm (6) showed that for the class of Bayesian networks a decomposition of the marginal likelihood score allowed weighted MAX-SAT algorithms to be used. The decomposition was in terms of child-parent configurations p(xi |Paxi ) associated with each random variable xi in the Bayesian network. Here our partition space under the Crowley prior exhibits an analogous decomposition into cluster scores. 3.1 Weighted MAX-SAT Encoding For each considered cluster ci , a propositional atom, also called ci , is created. In what follows no distinction is made between clusters and the propositional atoms representing them. Propositional atoms are just binary variables with two values: TRUE and
Searching a Multivariate Partition Space Using MAX-SAT
245
FALSE. A partition is represented by setting all of its clusters to TRUE and all other clusters to FALSE. However, most truth-value assignments for the ci do not correspond to a valid partition, and so such assignments must be ruled out by constraints represented by logical clauses. To rule out the inclusion of overlapping clusters we assert clauses of the form: ci ∨ cj
(4)
for all non-disjoint pairs of clusters ci , cj . (A bar over a formula represents negation.) Each such clause is logically equivalent to ci ∧ cj : both clusters cannot be included in a partition. In general, it is also necessary to state that each data point must be included in some cluster in the partition. Let {cy1 , cy2 , . . . , cyi(y) } be the set of all clusters containing data point y. For each y a single clause of the form: cy1 ∨ cy2 ∨ · · · ∨ cyi(y)
(5)
is created. The ‘hard’ clauses in (4) and (5) suffice to rule out non-partitions; it remains to ensure that each partition has the right score. This can be done by exploiting the decomposability of the partition score into cluster scores and using ‘soft’ clauses to represent cluster scores. If Si , the score for cluster ci , is positive the following weighted clause is asserted: S i : ci
(6)
Such a clause intuitively says: “We want ci to be true (i.e. to be one of the clusters in the partition) and this preference has weight Si .” If a cluster cj has a negative score Sj then this weighted clause is asserted: −Sj : cj
(7)
which states a preference for cj not to be included in the partition. Given an input composed of the clauses in (4)–(7) the task of a weighted MAX-SAT solver is to find a truth assignment to the ci which respects all hard clauses and maximizes the sum of the weights of satisfied soft clauses. Such an assignment will encode the highest scoring partition constructed from the given clusters. Note that if a given cluster ci can be partitioned into clusters ci1 , ci2 , . . . cij(i) where Si < Si1 + Si2 + · · · + Sij(i) , then due to the decomposability of the partition score, ci cannot be a member of any optimal partition: any partition with ci can be improved by replacing ci with ci1 , ci2 , . . . cij(i) . Removing such clusters prior to the logical encoding reduces the problem considerably and can be done reasonably quickly: for example, one particular collection of 1023 clusters which would have generated 495,285 clauses was reduced to 166 clusters with 13,158 clauses using this approach. The filtering process took 25 seconds using a Python script. This cluster reduction technique was used in addition to those mentioned in the sections immediately following.
246
S. Liverani, J. Cussens, and J.Q. Smith
3.2 Reducing the Number of Cluster Scores To use weighted MAX-SAT algorithms effectively in this context, the challenge in even moderately sized partition spaces is to identify promising clusters that might be components of an optimal partition. The method in (6) of evaluating the scores only of subsets of less than a certain size is not ideal to this context since in our applications many good clusters appear to have a high cardinality. However there are more promising techniques formulated in other contexts to address this issue. One of these, which we use in the illustrative example, is outlined below and others presented in Section 5. Reduction by iterative augmentation. A simple way to reduce the number of potential cluster scores for weighted MAX-SAT is to evaluate all the possible clusters containing a single observation and to iteratively augment the size of the plausible clusters only if their score increases too, thanks to the nice decomposability of our score function. We will focus our discussion in this paper to an algorithm, the iterative augmentation algorithm described below. Step 1. Compute the cluster score for all n observations as if each belonged to a different cluster. Save these scores as input for weighted MAX-SAT. Set k ← 0 and c ← ∅. Step 2. Set k ← k + 1, j ← k + 1 and c ← {k}. Exit the algorithm when k = n. Step 3. Add element j to cluster c and compute the score for this new cluster c . If Sc > Sc + Sj , then – Save the score for cluster c – If j = n, go to Step 2. – c ← c and j ← j + 1 – Go to Step 3 else – If j = n, go to Step 2. – Set j ← j + 1 – Go to Step 2. The main advantage of this algorithm is that it evaluates the actual cluster scores, never approximating them by pairwise dissimilarities or in any other way. Furthermore, this method does not put any restriction on the maximum size of the potential clusters. Hybrid AHC algorithm. Even though this algorithm performs extremely well when the number of clustered units n < 100, it slows down quickly as the number of observational vectors increases. However this deficiency disappears if we use it in conjunction with the popular AHC search to refine clusters of fewer than 100 units. When used to compare partitions of profiles as described in Section 2, AHC performs extremely well when the combined clusters are large. So to improve its performance we use weighted MAX-SAT to reduce dependence on poor initialization. By running a mixture of AHC together with weighted MAX-SAT we are able to reduce the dependence whilst retaining the speed of AHC and its efficacy with large clusters. AHC is used to initialize a candidate partition. Then weighted MAX-SAT is used as a ‘split’ move to refine these clusters and find a new and improved partition on which to start a new AHC algorithm. The hybrid algorithm is described below.
Searching a Multivariate Partition Space Using MAX-SAT
247
Step 1. Initialize by running AHC to find best scoring partition C1 on this search. Step 2. (Splitting step) Take each cluster c in C1 . Score promising subsets of c and run a weighted MAX-SAT solver to find the highest scoring partition of c. Note that, because our clusters are usually several orders of magnitude smaller than the whole set, this step will be feasible at least for interesting clusters. Step 3. Substitute all the best sub-clusters of each cluster c in C1 to form next partition C2 . Step 4. If C1 = C2 (i.e. if the best sub-cluster for each cluster in C1 is the cluster itself) then stop. Step 5. (Combining step) If this is not the case then by the linearity of the score C2 must be higher scoring than C1 . Now take C2 and - beginning with this starting partition to test combinations of clusters in C2 - using AHC. (Note we could alternatively use weighted MAX-SAT here as well). This step may combine together spuriously clustered observations that initially appeared in different clusters of C1 and were thrown out of these clusters in the first weighted MAX-SAT step. Find the optimal partition C3 doing this. Step 6. If C3 = C2 stop, otherwise go to Step 2. This hybrid algorithm obviously performs at least as well as AHC and is able to undo any early erroneous combination of AHC. The shortcomings of AHC, discussed in (16), are overcome by checking each cluster running weighted MAX-SAT to identify outliers. Note that the method is fast because weighted MAX-SAT is only run on subsets of small cardinalities. We note that at least in the applications that we have encountered most clusters of interest appear to contain fewer than a hundred units.
4 A Simple Example We will illustrate the implementation of weighted MAX-SAT for clustering problems in comparison to and in conjunction to the widely used AHC. Here we demonstrate that weighted MAX-SAT can be used to cluster time-course gene expression data. The cluster scores are computed in C++, on the lines of the algorithm by (9) and it includes the modifications suggested by (16) and (1). All runs of weighted MAX-SAT were conducted using the C implementation available from the UBCSAT home page http://www.satlib.org/ubcsat. UBCSAT (17) is an implementation and experimentation environment for Stochastic Local Search (SLS) algorithms for SAT and MAX-SAT. We have used their implementation of WalkSat in this paper. 4.1 Data Our algorithm will be illustrated by an example on a recent microarray experiment on the plant model organism Arabidopsis thaliana. This experiment was designed to detect genes whose expression levels, and hence functionality, might be connected with circadian rhythms. The aim is to identify the genes (of order 1,000) which may be connected with the circadian clock of the plant. A full analysis and exposition of this data, together with a discussion of its biological significance is given in (8).
248
S. Liverani, J. Cussens, and J.Q. Smith
We will illustrate our algorithms on genes selected from this experiment. The gene expression of n = 22, 810 genes was measured at r = 13 time points over two days by Affymetrix microarrays. Constant white light was shone on the plants for 26 hours before the first microarray was taken, with samples every four hours. The light remained on for the rest of the time course. Thus, there are two cycles of data for each of the Arabidopsis microarray chip. Subjective dawn occurs at about the 24th and 48th hours – this is when the plant has been trained to expect light after 12 hours of darkness. 4.2 Hybrid AHC Using Weighted MAX-SAT
−20 −10
0
10
20
Although our clustering algorithms apply to a huge space of over 22,000 gene profiles, to illustrate the efficacy of our hybrid method it is sufficient to show results on a small subset of the genes: here a proxy for two clusters. Thus we will illustrate how our hybrid algorithm can outperform AHC and how it rectifies partitions containing genes clustered spuriously in an initial step. In the example below we have therefore selected 15 circadian genes from the dataset above and contaminated these with 3 outliers that we generated artificially. We set the parameters v = 10, a = 0.001, b = 0.001 and λ = 0.5 and ran AHC which obtained the partition formed by 2 clusters shown in Figure 1. AHC is partially successful: the 15 circadian genes have been clustered together, and so have the 3 outliers. The latter cluster is a typical example of misclassification in the sense of (16) in that it is rather coarse with a relatively high associated variance. The score for this partition is Σ(CAHC ) = 64.89565.
30
40
50
60
70
60
70
−0.4
−0.2
0.0
0.2
Time Cluster 1 ( 3 genes)
30
40
50 Time Cluster 2 ( 15 genes)
Fig. 1. Clusters obtained on 18 genes of Arabidopsis thaliana using AHC (Σ(CAHC ) = 64.89565). The y-axis is the log of gene expression. Note the different y-axis scale for the two clusters.
249
−20
−10
0
5
Searching a Multivariate Partition Space Using MAX-SAT
30
40
50
60
70
60
70
60
70
−20 −10
0
10
20
Time Cluster 1 ( 1 genes)
30
40
50
−5
0
5
10 15 20
Time Cluster 2 ( 1 genes)
30
40
50 Time Cluster 3 ( 1 genes)
Fig. 2. Clusters obtained on 3 outliers of Arabidopsis thaliana using AHC (1 cluster, S1 = −156.706) and weighted MAX-SAT (3 cluster, S1 = −145.571)
Following the hybrid AHC algorithm we then ran MAX-SAT on both the clusters obtained by AHC. The clusters obtained are shown in Figures 2 and 3. Both the clusters obtained by AHC have been split up by MAX-SAT. The score of the partition formed by these 5 clusters, including the constants, is now Σ(CMAX-SAT ) = 79.43005. This is the log of the marginal likelihood and taking the appropriate exponential, in terms of Bayes factor, this represents a decisive improvement for our model. Note that the increase in the log marginal likelihood is supported also by the visual display. The outliers are very different between themselves and from the real data and it seems reasonable that each one would generate a better cluster on its own - note the different scale of the y-axis. The other 15 genes have a more similar shape and it seems visually reasonable to cluster them together, as AHC does initially, but MAX-SAT is able to identify a more subtle difference between 2 shapes contained in that cluster. It was not necessary in our case to run AHC again to combine clusters, given the nature of our data. A single iteration of the loop described in our hybrid algorithm identified the useful refinement of the original partition. This example shows how, as discussed in (16), AHC can be unstable especially when dealing with outliers at an early stage in the clustering. The weighted MAX-SAT is helpful to refine the algorithm, and obtain a higher scoring partition.
S. Liverani, J. Cussens, and J.Q. Smith
−0.10
0.00
0.10
250
30
40
50
60
70
60
70
−0.4
−0.2
0.0
0.2
Time Cluster 4 ( 4 genes)
30
40
50 Time Cluster 5 ( 11 genes)
Fig. 3. Clusters obtained on 15 genes of Arabidopsis thaliana using AHC (1 cluster, S2 = 255.973) and weighted MAX-SAT (2 clusters, S2 = 259.372)
It is clear that in larger examples involving thousands of genes the improvements above add up over all moderate sized clusters of an initial partition, by simply using weighted MAX-SAT over each cluster in the partition, as described in our algorithm and illustrated above.
5 Further Work on Cluster Scores for Large Clusters In the approach taken in this paper clusters are explicitly represented as propositional atoms in the weighted MAX-SAT encoding and so it is important to reduce the number of clusters considered as much as possible. The hybrid method with iterative augmentation that we have described in Section 3.2 works very efficiently for splitting clusters with cardinality smaller than 100. However it slows down dramatically for greater cardinalities. It would be useful to generalize the approach so that it can also be employed to split up larger clusters. The main challenge here is to identify good candidate sets. Two methods that we are currently investigating are outlined below. Reducing cluster scores using cliques. One promising method for identifying candidate clusters is to use a graphical approach based on pairwise proximity between the clustered units. (2) - a well known and highly cited paper - proposes the CAST algorithm to identify the clique graph which is closest to the graph obtained from the proximity matrix. A graph is called a clique graph if it is a disjoint union of complete graphs. The disjoint cliques obtained by the CAST algorithm define the partition. We suggest using an approach similar to (2), enhanced by the use of weighted MAX-SAT and a fully Bayesian model.
Searching a Multivariate Partition Space Using MAX-SAT
251
We focus on maximal cliques, instead of clique graphs as in (2), to identify possible clusters to feed into weighted MAX-SAT. A maximal clique is a set of vertices that induces a complete subgraph, and that is not a subset of the vertices of any larger complete subgraph. The idea is to create an undirected graph based on the adjacency matrix obtained by scoring each pair of observations as a possible cluster and then use the maximal cliques of this graph to find plausible clusters. It is reasonable to assume that a group of elements is really close and should belong to the same cluster when it forms a clique. This considerably reduces the number of clusters that need to be evaluated and are the input for weighted MAX-SAT, which will then identify the highest scoring partition. The first step is to calculate the proximity between observations i and j (i, j = 1, . . . , n) such as D = {dij } = Sij − (Si + Sj ) which gives a matrix of adjacencies A A = {aij } =
1 0
if dij > K otherwise
from which we can draw a graph (Sij is the score for the cluster of 2 elements, i and j). Each vertex represents an observation. Two vertices are connected by an edge according to matrix D. The adjacency matrix defines an undirected graph. The maximal cliques, the intersections between maximal cliques and the union of maximal cliques with common elements define the potential cluster scores for weighted MAX-SAT. Although such methods are deficient in the sense that they use only pairwise relationships within putative clusters, they identify potentially high scoring clusters quickly. Of course, it does not matter whether some of these clusters turn out to be low scoring within this candidate set, because each is subsequently fully scored for weighted MAXSAT and their deficiency identified. This is in contrast to the method of (2) which is completely based on pairwise dissimilarities. So the only difficulty with this approach is induced by those clusters which are actually high scoring but nevertheless are not identified as promising. Other advantages of this method are that all the scores that are calculated are used as weights in the weighted MAX-SAT and it does not induce any artificial constraint on cluster cardinalities. Reducing cluster scores by approximating. An alternative to the method described above is to represent the equivalence relation given by a partition directly: for each distinct pair of data points yi , yj , an atom ai,j would be created to mean that these two data points are in the same cluster. Only O(n2 ) such atoms are needed. Hard clauses (O(n3 ) of them) expressing the transitivity of the equivalence relation would have to be added. With this approach it might be possible to indirectly include information on cluster scores by approximating cluster scores by a quadratic function of the data points in it. A second-order Taylor approximation is an obvious choice. Such an approach would be improved by using a different approximating function for each cluster size.
252
S. Liverani, J. Cussens, and J.Q. Smith
6 Discussion WalkMaxSat appears to be a promising algorithm for enhancing partition search. It looks especially useful to embellish other methods such as AHC to explore regions around the AHC optimal partition and to find close partitions with better explanatory power. We demonstrated above that this technique can enhance performance on small subsets of the data and on large datasets too, in conjunction with AHC. Although we have not tested this algorithm in the following regard, the algorithm can also be used as a useful exhaustive local check of a MAP partition found by numerical search (10). Also, note that weighted MAX-SAT can be used not just for MAP identification, but also by following the adaptation suggested by (6) in model averaging, using to identify all models that are good. There are many embellishments of the types of methods described above that will potential further improve our hybrid search algorithm. However, in this paper we have demonstrated that in circumstances where the Crowley priors are appropriate weighted MAX-SAT solvers can provide a very helpful addition to the tool box of methods for MAP search over a partition space.
References [1] Anderson, P.E., Smith, J.Q., Edwards, K.D., Millar, A. J.: Guided conjugate Bayesian clustering for uncovering rhytmically expressed genes. CRiSM Working Paper (2006) [2] Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. Journal of Computational Biology 6(3-4), 281–297 (1999) [3] Booth, J.G., Casella, G., Hobert, J.P.: Clustering using objective functions and stochastic search. J. Royal Statist. Soc.: Series B 70(1), 119–139 (2008) [4] Chipman, H.A., George, E.I., McCulloch, R.E.: Bayesian treed models. Machine Learning 48(1-3), 299–320 (2002) [5] Crowley, E.M.: Product partition models for normal means. Journal of the American Statistical Association 92(437), 192–198 (1997) [6] Cussens, J.: Bayesian network learning by compiling to weighted MAX-SAT. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (2008) [7] Denison, D.G.T., Holmes, C.C., Mallick, B.K., Smith, A.F.M. (eds.): Bayesian Methods for Nonlinear Classification and Regression. Wiley Series in Probability and Statistics. John Wiley and Sons, Chichester (2002) [8] Edwards, K.D., Anderson, P.E., Hall, A., Salathia, N.S., Locke, J.C.W., Lynn, J.R., Straume, M., Smith, J.Q., Millar, A.J.: FLOWERING LOCUS C Mediates Natural Variation in the High-Temperature Response of the Arabidopsis Circadian Clock. The Plant Cell 18, 639– 650 (2006) [9] Heard, N.A., Holmes, C.C., Stephens, D.A.: A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves. Journal of the American Statistical Association 101(473), 18–29 (2006) [10] Lau, J.W., Green, P.J.: Bayesian Model-Based Clustering Procedures. Journal of Computational and Graphical Statistics 16(3), 526 (2007) [11] Liverani, S., Anderson, P.E., Edwards, K.D., Millar, A.J., Smith, J.Q.: Efficient Utilitybased Clustering over High Dimensional Partition Spaces. Journal of Bayesian Analysis 4(3), 539–572 (2009)
Searching a Multivariate Partition Space Using MAX-SAT
253
[12] McCullagh, P., Yang, J.: Stochastic classification models. In: Proceedings International Congress of Mathmaticians, vol. III, pp. 669–686 (2006) [13] O’Hagan, A., Forster, J.: Bayesian Inference: Kendall’s Advanced Theory of Statistics, 2nd edn., Arnold (2004) [14] Quintana, F.A., Iglesias, P.L.: Bayesian clustering and product partition models. J. Royal Statist. Soc.: Series B 65(2), 557–574 (2003) [15] Ray, S., Mallick, B.: Functional clustering by Bayesian wavelet methods. J. Royal Statist. Soc. Series B 68(2), 305–332 (2006) [16] Smith, J.Q., Anderson, P.E., Liverani, S.: Separation measures and the geometry of Bayes factor selection for classification. J. Royal Statist. Soc.: Series B 70(5), 957–980 (2006) [17] Tompkins, D.A.D., Hoos, H.H.: UBCSAT: An implementation and experimentation environment for SLS algorithms for SAT and MAX-SAT. In: Hoos, H.H., Mitchell, D.G. (eds.) Theory and Applications of Satisfiability Testing: Revised Selected Papers of the Seventh International Conference, pp. 306–320. Springer, Heidelberg (2005)
A Novel Approach for Biclustering Gene Expression Data Using Modular Singular Value Decomposition V.N. Manjunath Aradhya1,2, Francesco Masulli1,3 , and Stefano Rovetta1 1
2
Dept. of Computer and Information Sciences University of Genova, Via Dodecaneso 35, 16146 Genova, Italy {aradhya,masulli,ste}@disi.unige.it Dept. of ISE, Dayananda Sagar College of Engg, Bangalore, India - 560078 3 Sbarro Institute for Cancer Research and Molecular Medicine, Center for Biotechnology, Temple University, BioLife Science Bldg., 1900 N 12th Street Philadelphia, PA 19122 USA Abstract. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Recently, biclustering (or co-clustering), performing simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. In this paper we propose a novel approach to biclustering gene expression data based on Modular Singular Value Decomposition (Mod-SVD). Instead of applying SVD directly on on data matrix, the proposed approach computes SVD on modular fashion. Experiments conducted on synthetic and real dataset demonstrated the effectiveness of the algorithm in gene expression data.
1
Introduction
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis [18]. Inspite of mature statistical literature on clustering, DNA microarray data are the triggering aid for the development of multiple new methods. The clusters produced by these methods reflect the global patterns of expression data, but an interesting cellular process for most cases may be only involved in a subset of genes co-expressed only under a subset of conditions. Discovering such local expression patterns may be the key to uncovering many genetic pathways that are not apparent otherwise. Therefore, it is highly desirable to move beyond the clustering paradigm, and to develop approaches capable of discovering local patterns in microarray data [6,9]. In recent development, biclustering is one of the promising innovations in the field of clustering. Hartigan’s [14] so-called ”direct clustering” is the inspiration for the introduction of the term biclustering to gene expression analysis by F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 254–265, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Novel Approach for Biclustering Gene Expression Data
255
Cheng and Church [6]. Biclustering, also called co-clustering, is the problem of simultaneously clustering rows and columns of a data matrix. Unlike clustering which seeks similar rows or columns, co-clustering seeks ”blocks” (or co-clusters) of rows and columns that are inter-related. In recent years, biclustering has received a lot of attention in several practical applications such as simultaneous clustering of documents and words in text mining [16], genes and experimental conditions in bioinformatics [6,19], tokens and contexts in natural language processing [27], among others. Many approaches to biclustering gene expression data have been proposed to date (and the problem is known to be NP-complete). Existing biclustering methods include δ-biclusters [6,32], gene shaving [15], flexible overlapped biclustering algorithm (FLOC) [33], order preserving sub-matrix (OPSM) [3], interrelated two-way clustering (ITWC) [29], coupled two-way clustering (CTWC) [13], and spectral biclustering [19], statistical algorithmic model (SAMBA) [28], iterative signature algorithm [17], a fast divide-and-conquer algorithm (Bimax) [9], and maximum similarity biclustering (MSBE) [22]. A detailed survey on biclustering algorithms for biological data analysis can be found in [23]; the paper presents a comprehensive survey on the models, methods and applications in the field of biclustering algorithms. Another interesting paper on comparison and evaluation of biclustering methods for gene expression data is described in [9]. The paper compares five prominent biclustering methods with respect to their capability of identifying groups of (locally) co-expressed genes; hierarchical clustering and a baseline biclustering algorithm. A different approaches to biclustering gene expression data based on Hough transform, possibilistic approach, fuzzy, genetic algorithm can also be seen in literature. A geometric biclustering algorithm based on the Hough Transform (HT) for analysis of large scale microarray data is presented in [35]. The HT is performed in a two-dimensional (2-D) space of column pairs and coherent columns are then combined iteratively to form larger and larger biclusters. This reduces the computational complexity considerably and makes it possible to analyze large-scale microarray data. An HT-based algorithm has also been developed to analyze three-color microarray data, which involve three experimental conditions [36]. However, the original HT-based algorithm becomes ineffective, in terms of both computing time and storage space, as the number of conditions increases. Another approach based on geometric interpretation of the biclustering problem is described in [12]. An implementation of the geometric framework using the Fast Hough transform for hyperplane detection can be used to discover biologically significant subsets of genes under subsets of conditions for microarray data analysis. Work on possibilistic approach for biclustering microarray data is presented in [5]. This approach obtain potentially-overlapping biclusters, the Possibilistic Spectral Biclustering algorithm (PSB), based on Fuzzy technology and spectral clustering. The method is tested on S. cerevisiae cell cycle expression data and on a human cancer dataset. An approach to the biclustering problem using the Possibilistic Clustering paradigm is described in [11]; this method finds one
256
V.N.M. Aradhya, F. Masulli, and S. Rovetta
bicluster at a time, assigning a membership to the bicluster for each gene and for each condition. Possibilistic clustering is tested on the Yeast database, obtaining fast convergence and good quality solutions. Genetic Algorithm (GA) have also been used in [24] to maximize homogeneity in a single bicluster. Given the complexity of the biclustering problem, the method adopts a global search method, namely genetic algorithms. A multicriterion approach is also used to optimize two fitness functions implementing the criteria of bicluster homogeneity (Mean Squared Residue) and size [7]. Comparing fuzzy approaches to biclustering is described in [10]. Fuzzy central clustering approaches to biclustering show very interesting properties, such as speed in finding a solution, stability, capability of obtaining large and homogeneous biclusters are discussed. SVD based methods have also been used in order to obtain biclusters in gene expression data and also in many potential applications [8,21]. Applying SVD directly on the data may obtain biclusters, but obtaining large number of biclusters and functionally enriched with GO categories is still a challenging problem. Hence in this paper we propose a method based on modular SVD for biclustering in gene expression data. By doing this, local features of genes and conditions can be extracted efficiently in order to obtain better biclusters. The organization of the paper is as follows: in Sect 2, we explain the proposed Modular SVD based method. In Sect 3, we perform experiments on synthetic and real dataset. Finally conclusions are drawn at the end.
2
Proposed Method
In this section we describe our proposed method which is based on modular (data subsets) SVD. The SVD is one of the most important and powerful tool used in numerical signal processing. It is employed in a variety of signal processing applications, such as spectrum analysis, filter design, system identification, etc. SVD based methods has also been used in order to obtain biclusters in gene expression data and also in many potential applications [8,21]. Applying SVD directly on the data may obtain biclusters, but obtaining efficient biclusters on data is still a challenging problem. The standard SVD based method may not be very effective under different conditions, since it considers the global information of gene and conditions and represents them with a set of weights. Hence in this work, we made an attempt at overcoming the aforementioned problem by partitioning a gene expression data into several smaller subsets of data and then SVD is applied to each of the sub-data separately. The three main steps involved in our method, named M-SVD Biclustering, are: 1. An original whole data set denoted by a matrix is partitioned into a set of equally sized sub-data in a non-overlapping way. 2. SVD is performed on each of such data subsets. 3. At last, a single global feature is synthesized by concatenating each data data subsets.
A Novel Approach for Biclustering Gene Expression Data
257
In this work, we partitioned the data along rows in a non-overlapping way, which is as shown in figure 1. The size of each partition is around 20% of the original matrix.
Fig. 1. Procedure of applying SVD on partitioned data
Each step of the algorithm is explained in detail as follows: Data Partition: Let us consider a m × n matrix A, which contain m genes and n conditions. Now, this matrix A is partitioned into K d-dimensional sub-matrices of similar sizes in a non-overlapping way, where A = (A1 , A2 , ..., AK ) with Ak being the sub-data of A. Figure 1 shows a partition procedure for a given data matrix. Note that a partition of A can be obtained in many different ways e.g., selection groups of continuous rows or groups of continuous columns, or also randomly sampling some rows or some columns. In this work, we selected group of continuous rows for experiment purpose. Apply SVD on K sub-data: Now according to the second step, conventional SVD is applied on K sub-pattern. The SVD provides a factorization for all matrices, even matrices that are not square or have repeated eigenvalues. In general, the theory of SVD states that any matrix A of size m × n can be factorized into a product of unitary matrices and a diagonal matrix, as follows [20]: A = U ΣV T
(1)
where U ∈ m×m is unitary, V ∈ n×n is unitary, and Σ ∈ m×n has the form Σ = diag(λ1 , λ2 , .....λp), where p is the minimum value of m or n. The diagonal
258
V.N.M. Aradhya, F. Masulli, and S. Rovetta
elements of Σ are called the singular values of A and are usually ordered in descending manner. The SVD has the eigenvectors of AAT in U and of AT A in V. This decomposition may be represented as shown in figure 2:
Fig. 2. Illustration of SVD method
The ”diagonal” of Σ contains the singular values of A. They are real and non negative numbers. The upper part (green strip) contains the positive singular values. There are r positive singular values, where r is the rank of A. The lower part of the diagonal (gray strip) contains the (n - r), or ”vanishing”, singular values. SVD is known to be more robust than usual eigen-vectors of covariance matrix. This is because the robustness is determined by the directional vectors rather than mere scalar quantity like magnitudes (singular values stored in Σ). Since U and V matrices are inherently orthogonal in nature, these directions are encoded in U and V matrices. This is unlike the eigenvectors which need not be orthogonal. Hence, a small perturbation like noise has very little effect to disturb the orthogonal properties encoded in the U and V matrices. This could be the main reason for the robust behavior of the SVD. Finally, from each of the data partitions, we would expect that the eigenvectors corresponding to the largest eigenvalue would provide the optimal clusters. However, we also observed that an eigenvector with smaller eigenvalues could yield clusters. Instead of clustering each eigenvector individually, we perform clustering by applying k-means to the data projected to the best three or four eigenvectors. In general, it can be described as shown below: – Apply SVD on each partitioned data – Project the data for assigned number of eigenvalues – Apply K-means algorithm to the projected data using desired number of clusters. – Finally, find the bicluster’s for specified number of clusters.
A Novel Approach for Biclustering Gene Expression Data
259
Steps to obtain biclusters – [U S V] = svd(partion data) = kmeans(projected – cluster index sired number of clusters) – cluster data = cluster index(x,y)
(U
&
V),
de-
f f fA More formally, the proposed method is presented in the form of Algorithm as shown below. Algorithm: Modular SVD – Input: Gene Expression Data – Output: Homogeneity H and Bicluster’s size N – Steps: • 1: Acquire gene expression matrix and generate K number of ddimensional sub-data in a non-overlapping way and reshaped into K × n matrix A = (A1 , A2 , ..., AK ) with Ak being the sub-data of A. • 2: Apply standard SVD method to each sub-data obtained from the Step 1. • 3: Perform final clustering step by applying the k-means to the data projected to get the best three or four eigenvectors. • 4: Repeat this procedure for all the partition present in the gene expression data. • 5: At last, computation of Heterogeneity H and size N , are done using the resultant bicluster’s obtained from each partition matrix. – Algorithm ends
3
Experiment Results and Comparative Study
In this section we experimentally evaluate the proposed method with pertaining to synthetic and standard dataset. The proposed algorithm has been coded in R language on Pentium IV 2 GHz with 756 MB of RAM under Windows platform. Following [6], we are interested in the largest biclustes N from DNA microarray data that do not exceed as assigned homogeneity constraint. To this aim we can utilize the definition of biclustering average heterogeneity H. The size N of a biclusters is usually defined as the number of cells in the gene expression matrix X belonging to it that is the product of cardinality ng = |g| and nc = |c| (here g, c refers to genes and conditions respectively): N = ng · nc
(2)
here ng and nc are the number of columns and number of rows of X, respectively.
260
V.N.M. Aradhya, F. Masulli, and S. Rovetta
We can define H and G as the Sum-squared residue and Mean-squared residue respectively, two related quantities that measures the biclusters and heterogeneity: d2ij (3) H= i∈g j∈c
1 2 H G= d = N i∈g j∈c ij N
(4)
where d2ij = (xij + xIJ − xiJ − xIj )/n, xIJ , xIj and xiJ are the biclusters mean, the row mean and the column mean of X for the selected genes and conditions respectively. 3.1
Results on Synthetic Dataset
We compared our results with a standard spectral method [19] using a synthetic and a standard real dataset. To obtain the synthetic dataset, we generated matrices with random values and the size of the matrices varied from 100 × 10 (rows × columns). Table 1 shows the experimental results of heterogeneity and bicluster size obtained using standard spectral and proposed method. From the result we can notice that spectral method failed to find bicluster, however the proposed method successfully extracted bicluster from the data. We also tested our method on another synthetic data set. To this aim we generated matrices with random numbers, on which 3 biclusters were similar, with dimensions ranging from 3-5 rows and 5-7 columns. Heterogeneity Hand biclusters size N are tabulated in Table 2. The obtained heterogeneity from both the methods is competitive. However if we consider bicluster size, the proposed method achieved better size compared to standard spectral method. Figure 3 shows the example of biclusters extracted using the proposed method for synthetic dataset. From the above two results, it is ascertained that proposed M-SVD based method performs better compared to standard spectral method both in heterogeneity and bicluster size. Table 1. Heterogeneity and size N for Synthetic Dataset of size 100 × 10 Methods Heterogeneity (H) Size (N) Spectral[19] — — M-SVD-BC 0.90 70 Table 2. Heterogeneity (H) and size N for Synthetic Dataset of size 10 × 10 Methods Heterogeneity (H) Size (N) Spectral[19] 7.1 30 M-SVD-BC 7.21 51
A Novel Approach for Biclustering Gene Expression Data
261
Fig. 3. (a):A synthetic dataset with multiple biclusters of different patterns (b-d):and the biclusters extracted
3.2
Results on Real Dataset
For the experiments on real data, we tested our proposed method on the standard dataset of Bicat Yeast, SyntrenEcoli and Yeast. Data structure with information about the expression levels of 419 probesets over 70 conditions follow Affymetrix probeset notation in Bicat Yeast dataset [2]. Affymetrix data files are normally available in DAT, CEL, CHP and EXP formats. Date containing in CDF files can also be used and contain the information about which probes belong to which probe set. For more information on file formats refer [1]. Results pertaining to this dataset are reported in Table 3. Another important experiment on performance index, defined as the ratio of N/G is also considered. The largest the ratio, the better the performance. Figure 4 shows the relationship between the two quality criteria obtained by the proposed and Spectral [19] method respectively. The first column in the graph represents the bicluster size (N) for varying number of principal components (PC’s). The second column shows the value of performance index, defined as the ratio N/G. Table 3. H and N for standard Bicat Yeast Dataset Methods Heterogeneity (H) Size (N) Spectral[19] 0.721 680 M-SVD-BC 0.789 2840
Another data structure with information about the expression levels of 200 genes over 20 conditions from transcription regulatory network is also used in our experiment [26]. Detailed description about Syntren can be found in [4]. The results of heterogeneity and bicluster size is tabulated in Table 4. The homogeneity obtained from spectral method is high compared to proposed method. However the proposed method obtained less homogeneity and achieved better bicluster size compared to spectral method.
262
V.N.M. Aradhya, F. Masulli, and S. Rovetta
Fig. 4. Performance comparison of bicluster techniques: Size (N) and ratio N/G Table 4. H and N for standard Syntren E. coli Dataset Methods Heterogeneity (H) Size (N) Spectral[19] 19.95 16 M-SVD-BC 7.07 196
We also applied our algorithm to the yeast dataset which is genomic database composed by 2884 genes and 17 conditions [30]. The original microarray gene expression data can be obtained from the web site http://arep.med.harvard.edu/ biclustering/yeast.matrix. Results are expressed as both the homogeneity H and bicluster size N . Table 5 shows the H and N obtained from well known algorithms on Yeast dataset. Here we compared our method with Deterministic Biclustering with Frequent pattern mining [34], Flexible Overlapped Clusters [33], Cheng and Church [6], Single and Multi objective Genetic Algorithms [25,7], Fuzzy Co-clustering with Ruspini’s condition [31], and Spectral [19]. From the results it is ascertained that the proposed modular SVD extracts better bicluster size compared to other well known techniques.
Table 5. H and N for standard Yeast Dataset Methods Heterogeneity (H) Size (N) DBF [34] 115 4000 FLOC [33] 188 2000 Cheng and Church [6] 204 4485 Single-objective GA [7] 52.9 1408 Multi-objective GA [25] 235 14828 FCR [31] 973.9 15174 Spectral [19] 201 12320 Proposed Method 259 18845
A Novel Approach for Biclustering Gene Expression Data
263
From the above observations and results it is clear that applying SVD on modular approach yields better performance compared to standard approaches.
4
Conclusion
In this paper, a novel approach for biclustering gene expression data based on Modular SVD is proposed. Instead of applying SVD directly to the data, the proposed method uses modular approach to extract biclusters. The standard SVD based method may not be very effective under different conditions of gene, since it considers the global information of gene and conditions and represents them with a set of weights. While applying SVD on modular way, local features of genes and conditions can be extracted efficiently in order to obtain better biclusters. The efficiency of the proposed method is tested on synthetic as well as real dataset demonstrated the effectiveness of the algorithm when compared with well known existing ones.
References 1. http://www.affymetrix.com/analysis/index.affix 2. Barkow, S., Bleuler, S., Prelic, A., Zimmermann, P., Zitzler, E.: Bicat: A biclustering analysis toolbox. Bioinformatics 19, 1282–1283 (2006) 3. Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: the order-preserving sub-matrix problem. In: Proceedings of the Sixth Annual International Conference on Computational Biology, pp. 49–57. ACM Press, New York (2002) 4. Blucke, Leemput, Naudts, Remortel, Ma, Verschoren, Moor, Marchal: Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithm. BMC Bioinformatics 7, 1–16 (2006) 5. Cano, C., Adarve, L., L´ opez, J., Blanco, A.: Possibilistic approach for biclustering microarray data. Computers in Biology and Medicine 37, 1426–1436 (2007) 6. Cheng, Y., Church: Biclustering of expression data. In: Proceedings of the Intl Conf. on intelligent Systems and Molecular Biology, pp. 93–103 (2000) 7. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Trans. on Evolutionary Computatation 6, 182– 197 (2002) 8. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD, pp. 269–274 (2001) 9. Prelic, A., et al.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22, 1122–1129 (2006) 10. Filippone, M., Masulli, F., Rovetta, S., Zini, L.: Comparing fuzzy approaches to biclustering. In: Proceedings of International Meeting on Computational Intelligence for Bioinformatics and Biostatistics, CIBB (2008) 11. Filippone, M., Masulli, F., Stefano, R.: Possibilistic approach to biclustering: An application to oligonucleotide microarray data analysis. In: Proceedings of the Computational Methods in System Biology, pp. 312–322 (2006) 12. Gan, X., Alan, Yan, H.: Discovering biclusters in gene expression data based on high dimensional linear geometries. BMC Bioinformatics 9, 209–223 (2008)
264
V.N.M. Aradhya, F. Masulli, and S. Rovetta
13. Getz, G., Levine, E., Domany, E.: Coupled two-way clustering analysis of gene microarray data. In: Proceedings of National Academy of Science, 12079–12084 (2000) 14. Hartigan, J.A.: direct clustering of a data matrix. Journal of the American Statistical Association 67, 123–129 (1972) 15. Hastie, T., Levine, E., Domany, E.: ’Gene shaving’ as a method for identifying distinct set of genes with similar expression patterns. Genome Biology 1, 0003.1– 0003.21 (2000) 16. Mallela, S., Dhillon, I., Modha, D.: Information-theoretic co-clustering. In: In Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 89–98 (2003) 17. Ihmels, J., Bergmann, S., Barkai, N.: Defining transcription modules using largescale gene expression data. Bioinformatics 20, 1993–2003 (2004) 18. Jain, A.K., Murthy, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 19. Kluger, Y., Basri, Chang, Gerstein: Spectral biclustering of microarray data: Coclustering genes and conditions. Genome Research 13, 703–716 (2003) 20. Lay, D.C.: Linear Algebra and its Applications. Addison-Wesley, Reading (2002) 21. Li, Z., Lu, X., Shi, W.: Process variation dimension reduction based on svd. In: Proceedings of the Intl Symposium on Circuits and Systems, pp. 672–675 (2003) 22. Liu, X., Wang, L.: Computing the maximum similarity bi-clusters of gene expression data. Bioinformatics 23, 50–56 (2007) 23. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: A survey. IEEE & ACM Trans. on Computational Biology and Bioinformatics 1, 24–45 (2004) 24. Mitra, S., Banka, H.: Mulit-objective evolutionary biclustering of gene expression data. Pattern Recognition 39, 2464–2477 (2006) 25. Mitra, S., Banka, H.: Multi-objective evolutionary biclustering of gene expression data. Pattern Recognition 39, 2464–2477 (2006) 26. Orr, S.: Network motifs in the transcriptional regulation network of escherichia coli. Nature Genetics 31, 64–68 (2002) 27. Rohwer, R., Freitag, D.: Towards full automation of lexicon construction. In: HLTNAACL 2004: Workshop on Computational Lexical Semantics, pp. 9–16 (2004) 28. Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18, 136–144 (2002) 29. Tang, C., Zhang, L., Zhang, A., Ramanathan, M.: Interrelated two-way clustering: an unsupervised approach for gene expression data analysis. In: Proceedings of the Second Annual IEEE International Symposium on Bioinformatics and Bioengineering, BIBE, pp. 41–48 (2001) 30. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M.: Systematic determination of genetic network architecture. Nature Genetics 22, 281–285 (1999) 31. Tjhi, W.C., Chen, L.: A partitioning based algorithm to fuzzy co-cluster documents and words. Pattern Recognition Letters 27, 151–159 (2006) 32. Yang, J., Wang, W., Wang, H., Yu, P.: δ-cluster: capturing subspace correlation in a large data set. In: Proceedings of the 18th IEEE International Conference Data Engineering, pp. 517–528 (2002) 33. Yang, J., Wang, W., Wang, H., Yu, P.: Enhanced biclustering on expression data. In: Proceedings of the Third IEEE Conference on Bioinformatics and Bioengineering, pp. 321–327 (2003)
A Novel Approach for Biclustering Gene Expression Data
265
34. Zhang, Z., Teo, A., Ooi, B.: Mining deterministic biclusters in gene expression data. In: Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering, p. 283 (2004) 35. Zhao, H., Alan, Xie, X., Yan, H.: A new geometric biclustering algorithm based on the Hough transform for analysis of large scale microarray data. Journal of Theoretical Biology 251, 264–274 (2008) 36. Zhao, H., Yan, H.: Hough feature, a novel method for assessing drug effects in three-color cdna microarray experiments. BMC Bioinformatics 8, 256 (2007)
Using Computational Intelligence to Develop Intelligent Clinical Decision Support Systems Alexandru G. Floares1,2 1 2
SAIA - Solutions of Artificial Intelligence Applications, Cluj-Napoca, Romania [email protected] Department of Artificial Intelligence, Cancer Institute Cluj-Napoca, Romania [email protected]
Abstract. Clinical Decision Support Systems have the potential to optimize medical decisions, improve medical care, and reduce costs. An effective strategy to reach these goals is by transforming conventional Clinical Decision Support in Intelligent Clinical Decision Support, using knowledge discovery in data and computational intelligence tools. In this paper we used genetic programming and decision trees. Adaptive Intelligent Clinical Decision Support have also the capability of self-modifying their rules set, through supervised learning from patients data. Intelligent and Adaptive Intelligent Clinical Decision Support represent an essential step toward clinical research automation too, and a stronger foundation for evidence-based medicine. We proposed a methodology and related concepts, and analyzed the advantages of transforming conventional Clinical Decision Support in intelligent, adaptive Intelligent Clinical Decision Support. These are illustrated with a number of our results in liver diseases and prostate cancer, some of them showing the best published performance.
1
Introduction
Using IT to replace painful, invasive, or costly procedures, to optimize various medical decisions, to improve medical care and reduce costs, represent major goals of Biomedical Informatics. Using Clinical Decision Support Systems (CDSS), which are computer systems designed to impact clinician decision making about individual patients at the point in time that these decision are made [1], on a large scale is a major step toward these goals. Transforming conventional CDSS in Intelligent Clinical Decision Support (i-CDSS), using knowledge discovery in data and computational intelligence, could mark a revolutionary step in the evolution of evidence-based medicine. Adaptive i-CDSS have also the capability of self-modifying their rules set through learning from patients data. Intelligent and adaptive CDSS are a significant step toward clinical research automation too. We are performing a series of investigations on constructing i-CDSS for several liver diseases, prostate and thyroid cancer, and chromosomal disorders (e.g., Down syndrome) during pregnancy. Here, we proposed a methodology and related concepts, and analyzed the advantages of transforming conventional CDSS F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 266–275, 2010. c Springer-Verlag Berlin Heidelberg 2010
Using Computational Intelligence to Develop i-CDSS
267
in intelligent and evolving CDSS. These are briefly illustrated with some of our results in liver diseases, some of them showing the best published performance.
2
Proposed Methods and Concepts
We are developing a set of methods for transforming conventional CDSS into iCDSS in liver diseases, especially chronic hepatitis C and B, prostate and thyroid cancer, and chromosomal disorders (e.g., Down syndrome) during pregnancy. Some of these methods and concepts are mature enough to be presented, but still under development. 2.1
Non-invasive i-CDSS for Fibrosis and Necroinflammation Assessment in Chronic Hepatitis B and C
Chronic Hepatitis B and C are major diseases of mankind and a serious global public health problem. The persons with these chronic diseases are at high risk of death from cirrhosis and liver cancer. Liver biopsy is the gold standard for grading the severity of disease and staging the degree of fibrosis and the grade of necroinflammation. The most used scoring systems are METAVIR A (A for activity) or Ishak NI (NI for necroinflammatory), and METAVIR F or Ishak F for the fibrosis stage (F). By assigning scores for severity, grading and staging of hepatitis are used as CDSS for patient management. The first step of the proposed strategy consists in identifying the problems of the conventional CDSS, potentially solvable with the aid of computational intelligence. We want to notice that the investigated CDSS are based on clinical practice guidelines, adapted to our clinical studies; they were built and implemented by our interdisciplinary team. We analyzed the clinical workflows used in monitoring chronic hepatitis B and C patients. Liver biopsy is invasive, painful, and relatively costly; complications severe enough to require hospitalization can occur in approximately 4% of patients [2]. In a review of over 68,000 patients recovering from liver biopsy, 96% experienced adverse symptoms during the first 24 hours of recovery. Hemorrhage was the most common symptom, but infections also occurred. Side effects of the biopsies included pain, tenderness, internal bleeding, pneumothorax, and rarely, death [3]. The second step consists in searching the literature for potential solutions and critically evaluate them. There are two main non-invasive diagnosis techR niques of interest [4]. FibroScan is a type of ultrasound machine that uses transient elastography to measure liver stiffness. The device reports a value that is measured in kilopascals (kPa). FibroTest for assessing fibrosis, and ActiTest for assessing necroinflammatory activity are available through BioPredictive (www.biopredictive.com). These tests use algorithms to combine the results of serum tests of beta 2-macroglobulin, haptoglobulin, apolipoprotien A1, total bilirubin, gamma glutamyltranspeptidase (GGT), and alanine aminotransferase (ALT). The results of these diagnosis techniques are not directly interpretable by a pathologist, but can be extrapolated to a fibrosis and necroinflammation score.
268
A.G. Floares
R FibroTest and Fibroscan have reasonably good utility for the identification of cirrhosis, but lesser accuracy for earlier stages. It is considered that refinements are necessary before these tests can replace liver biopsy [4], and these motivate us to continue our investigation. The third step consists in investigating the possibility of building i-CDSS, capable to solve the problems of the conventional CDSS. To this goal, we used a knowledge discovery in data approach, based on computational intelligence. The existing solutions have two main weak points which could be addressed: the accuracy, which might be improved, and the fact that the results are not expressed in the familiar pathologist scoring systems. We extracted and integrated information from various non-invasive data sources, e.g. imaging, clinical, and laboratory data, and build i-CDSS capable to predict the biopsy results - fibrosis stage and necroinflammation grade - with an acceptable accuracy. The meaning of an acceptable accuracy is a matter of consensus. Probably, in this context, taking into account the opinions of the hepatologists from our team, and the performances of the best commercial systems (see above), the majority of physicians will agree that the prediction accuracy should be at least 80%.
2.2
i-CDSS for Optimizing Interferon Treatment and Reducing Errors Costs in Chronic Hepatitis B and C
Another major conventional CDSS that will be investigated and transformed into an adaptive i-CDSS, using the same three steps approach, is related to treatment. Chronic hepatitis B and C are treated with drugs called Interferon or Lamivudine, which can help some patients. This treatment decision is based on several patients selection criteria. For example, the Romanian Ministry of Health’s criteria, for selecting the patients with chronic hepatitis C who will benefit from Interferon treatment, are: 1. Chronic infection with hepatitis C virus (HCV): antibodies against HCV (anti-HCV) are present for at least 3 months. (a) the hepatitis B surface antigen (HBsAg) is present for at least 6 months, or (b) the hepatitis B e antigen (HBeAg) is present for at least 10 weeks. 2. The cytolytic syndrome: the transaminases level is increased or normal. 3. Pathology (biopsy): the Ishak NI ≥ 4 and Ishak F ≤ 3. 4. The virus is replicating: the transaminases level is increased or normal, and anti-HCV are present, and RNA-HCV ≥ 105 copies/milliliter. For hepatitis B there is a similar set of selection rules. A careful analysis of these conventional CDSS could identify two problems: 1. Invasiveness: the patients’ selection criteria include fibrosis and necroinflammation assessed by liver biopsy, an invasive medical procedure. 2. Cost of the wrong decisions: patients selection mistakes are very costly because Interferon or Lamivudine therapy costs thousands of dollars.
Using Computational Intelligence to Develop i-CDSS
269
Developing intelligent CDSS, based on non-invasive medical investigations, and optimized selection criteria, could be of great benefit to the patients and could also save money. To this goal, one should investigate if it is possible: 1. To build i-CDSS capable to predict the biopsy results - fibrosis stage and necroinflammation grade - with an accuracy of at least 80%. 2. To integrate the i-CDSS predicting the biopsy results with the other selection criteria in an i-CDSS for Interferon treatment. 3. To make the Interferon treatment i-CDSS an evolving one, capable to optimize the treatment decisions by self-modifying through learning. Evolving i-CDSS can minimize the costs due to wrong patients selection, and maximize the benefit of the treatment. They can optimize the selection rule sets by finding the relevant selection criteria and their proper cutoff values. For this, the results of the Interferon treatment must be clearly defined as numerical or categorical attributes and registered in a data base for each treated patient. Then, intelligent agents are employed to learn the prediction of the treatment results. They must be capable of expressing the extracted information in the form of rules, using non-invasive clinical, laboratory and imaging attributes as inputs. Using feature selection [5], one will find the relevant patient selection criteria. Thus, the i-CDSS started with the largely accepted patients selection criteria but these are evolving. It is worth to mention that the evolved selection criteria could be different and usually better than those initially proposed by physicians. However, they should be always evaluated by the experts. In the supervised learning process, intelligent agents also discover the proper cutoff values of the relevant selection criteria. Again, these are usually better than those proposed by experts, but they should always be evaluated by them. In our opinion, learning and adapting capabilities are of fundamental importance for evidence basedmedicine. The third step of our approach is also the key step and the most complex one. The i-CDSS are the result of a data mining predictive modeling strategy, which is now patent pending, consisting mainly in: 1. Extracting and integration information from various medical data sources, after a laborious preprocessing: (a) cleaning features and patients, (b) various treating of missing data, (c) ranking features, (d) selecting features, (e) balancing data. 2. Testing various classifiers or predictive modeling algorithms. 3. Testing various ensemble methods for combining classifiers. The following variables were removed: 1. Variables that have more than 70% missing values. 2. Categorical variables that have a single category counting for more than 90% cases.
270
A.G. Floares
3. Continuous variables that have very small standard deviation (almost constants). 4. Continuous variables that have a coefficient of variation CV ≤ 0.1 (CV = standard deviation/mean). 5. Categorical variables that have a number of categories greater than 95% of the cases. For modeling, we first tested the fibrosis and necroinflammation prediction accuracy of various types of computational intelligence agents: 1. 2. 3. 4. 5.
Neural Networks of various types and architectures, Decision trees C5.0 and Classification and Regression Trees Support Vector Machines, with various kernels Bayesian Networks Genetic Programming based agents.
Usually, physicians prefer white-box algorithms for supporting their clinical decisions. From the above algorithms, decision trees, Bayesian Networks and Genetic Programming are white-box. Decision trees can be produced using Genetic Programming too. Genetic Programming has the unique capabilities of producing automatically mathematical models. Unfortunately, physicians are not very familiar with mathematical models, and they will be inclined to favor decision trees. Mathematical models are preferred when they are more accurate than the other white-box algorithms and also simple. Simplicity could be partially controlled by properly restricting the function set to the simplest reaching the desired performances. As decision trees, we have chosen C5.0 algorithm, the last version of the C4.5 algorithm [6], with 10-fold cross-validation. As ensemble method, we used Freund and Schapire’s boosting [7] for improving the predictive power of C5.0 classifier learning systems. A set of C5.0 classifiers is produced and combined by voting, and by adjusting the weights of training cases. We suggest that boosting should always be tried when peak predictive accuracy is required, especially when unboosted classifiers are already quite accurate. We also used a linear version of steady state genetic programming proposed by Banzhaf (see [8] for a detailed introduction and the literature cited there). In linear genetic programming the individuals are computer programs represented as a sequence of instructions from an imperative programming language or machine language. Nordin introduced the use of machine code in this context (cited in [8]); The major preparatory steps for GP consist in determining: 1. the set of terminals (see Table 1), 2. the set of functions (see Table 1), 3. the fitness measure, 4. the parameters for the run (see Table 1), 5. the method for designating a result, and 6. the criterion for terminating a run. The function set, also called instruction set in linear GP, can be composed of standard arithmetic or programming operations, standard mathematical functions, logical functions, or domain-specific functions. The terminals are the attributes and parameters.
Using Computational Intelligence to Develop i-CDSS
271
Table 1. Genetic Programming Parameters Parameter Setting Population size 500 Mutation frequency 95% Block mutation rate 30% Instruction mutation rate 30% Instruction data mutation rate 40% Crossover frequency 50% Homologous crossover 95% Program Size 80-128 Demes Crossover between demes 0% Number of demes 10 Migration rate 1% Dynamic Subset Selection Target subset size 50 Selection by age 50% Selection by difficulty 50% Stochastic selection 0% Frequency (in generation equivalents) 1 Function set {+, -, *, /} Terminal set 64 = j + k Constants j Inputs k
Because we are reporting work in progress, we will only mention some of the main features of our experiments; a detailed description is prepared. The aforementioned criteria for feature selection were passed by about 40 features. Usually, removing the patients with missing values of the selected features is a conservative and robust practice. The problem is that the disproportion between the number of features and the number of records is increasing. We experimented with 40 features and a small number of patients, due to cleaning of patients with missing values for some of these features, and the accuracy was about 80%. With 25 features the accuracy significantly increased, but it starts to decrease again when the number of features was reduced to 20, 15, and 10, respectively. The rationale for these experiments is that we want to end up with a small enough number of features for the method to be clinical feasible, but large enough for an acceptable accuracy. Both decision trees and genetic programming algorithms performed their own feature selection but they needed enough features to do it properly. As we previously mentioned, we are preparing a detailed manuscript with these results, but here we will just give a short description of the results of applying GP to 20 features. We used a commercial GP software package - Discipulus, by RML Technologies, Inc. - and the main parameter settings of the GP algorithm are presented in Table 1. This software package has the facility of building teams of models. The experiments were performed on a Lenovo ThinkStation, Intel Xeon CPU, E5450, 2 processors 3.00GHz, 32 GB RAM, Windows Vista 64 bit OS.
272
3
A.G. Floares
Results
We started with a dataset of 700 patients and more than 100 inputs. The accuracy of the first experiments, using the aforementioned algorithms with default settings, and without a careful data preprocessing, was about 60%. Preprocessing increased the accuracy with 20% to 25% for most of the algorithms. C5.0 accuracy was one of the highest, about 80%. Parameter tuning and boosting increase the accuracy of some i-CDSS even to 100% [9], [10], [11]. GP results were similar and building teams of models increase the accuracy to about 91% 92%. The GP results are preliminary results, but for some reasons (not presented here) we believe that this accuracy is more robust against the subjectivity of the ultrasound images interpretations. Moreover, they are not only in the acceptable accuracy range, starting from about 80%, but they also outperform the best two R [4]. We methods already in use in clinical practice FibroTest and Fibroscan obtained similar results for prostate cancer, predicting the Gleason score with about 92% accuracy (manuscript in preparation). We also build the i-CDSS for Interferon treatment [12], [13], [14]. This non-invasive i-CDSS is of a special kind, being able to adapt - by attempting to predict the progressively accumulating results of the Interferon treatment, it will identify in time the proper patients selection criteria, and their cutoff values from data. Thus, the rules set of this i-CDSS is evolving. We tried to develop not only the technical foundation of the intelligent evolving CDSS, but also the related concepts. A central one is TM i-Biopsy , which is an i-CDSS capable to predict, with an acceptable accuracy (e.g., at least 80%), the results usually given by a pathologist, examining the tissue samples from biopsies, expressed as scores of a largely accepted scoring system. To do this, it takes as inputs and integrate various routine, non-invasive, clinical, imaging and lab data. Also, to distinguish between the scores of the real TM biopsy and their counterparts predicted by the i-Biopsy , we proposed the general terms of i-scores. For example, in the gastroenterological applications, we have: TM
1. The liver i-Biopsy is the i-CDSS corresponding to the real liver biopsy; the TM i-Metavir F scores are the values predicted by i-Biopsy for the Metavir-F fibrosis scores, designating exactly the same pathological features. 2. The i-Metavir F scores and the biopsy Metavir F scores could have different values for the same patient, at the same moment, depending for example on the prediction accuracy. 3. i-Metavir F scores are obtained in a non-invasive, painless, and riskless manner, as opposed to Metavir-F scores, assessed by liver biopsy. For simplicity, we referred only to the Metavir F scores, but these considerations are quite general, and can be easily extrapolated to other liver scores, e.g., Ishak F, Metavir A, and Ishak NI. TM Also, we developed i-Biopsy as a non-invasive i-CDSS counterpart for prostate biopsy, and we proposed the i-Gleason score (work in progress). While we built i-CDSS with accuracy reaching even 100%, the GP results proved to be robust,
Using Computational Intelligence to Develop i-CDSS
273
showing a constant accuracy of about 92%, for different lot of patients and medical team. For amniocentesis and thyroid cancer we have encouraging preliminary results too (not shown). We have built the following i-CDSS which can be used for Interferon treatment decision support: 1. Module for liver fibrosis prediction, according to either Metavir F or Ishak R ), F scoring system, with and without liver stiffness (Fibroscan 2. Module for the grade of necroinflammation (activity) prediction, according to Ishak NI scoring systems.
4
Discussions
A short digression about the meaning of the diagnosis accuracy, of the i-CDSS in general and i-Biopsy in special, seems to be necessary, because it confused many physicians, especially when reporting very high values like 100%. Typically, physicians believe that 100% accuracy is not possible in medicine. The meanings will be made clear by means of examples. Typically, an invasive liver biopsy is performed to the patient, and a pathologist analyzes the tissue samples assessing fibrosis, necroinflammation, etc., expressed as scores. The pathologist may have access to other patient’s medical data, but usually these are not necessary for the pathological diagnosis. Moreover, in some studies it is required that the pathologist knows nothing about the patient. His or her diagnosis can be more or less correct or even wrong, for many reasons not discussed here. We have proposed i-CDSS predicting the fibrosis scores resulted from liver biopsy with accuracy reaching even 100% for a number of systems. On the contrary, for the i-CDSS a number of the clinical, imaging, and lab data of the patient are essential, because they were somehow incorporated in the system. They were used like input features to train the system, and they are required for a new, TM is in fact a relationship between these unseen patient, because the i-Biopsy inputs and the fibrosis or necroinflammation scores, as outputs. The category of i-CDDSs discussed here do not deal directly with diagnosis correctness, but with diagnosis prediction accuracy. Without going into details, this is due in part to the supervised nature of the learning methods used to build them. The intelligent agents learned to predict the results of the biopsy given by the pathologist, and the pathologist diagnosis could be more or less correct. For example, suppose that the pathologist diagnosis is wrong, the i-Biopsy could still be 100% accurate in predicting this wrong diagnosis, but this is rarely the case. In other words, the i-Biopsy will predict, in a non-invasive and painless way, and without the risks of the biopsy, a diagnosis which could be even 100% identical with the pathologist diagnosis, if the biopsy is performed. While the accuracy and the correctness of TM the diagnosis are related in a subtle way, they are different matters. i-Biopsy will use the information content of several non-invasive investigations to predict the pathologist diagnosis, without performing the biopsy. The correctness of the diagnosis is a different matter, but typically a good accuracy correlates well with a correct diagnoses. The accuracy of the diagnosis, as well as other performance
274
A.G. Floares
measures like the area under the receiver operating characteristic (AUROC), for a binary classifier system [13], are useful for intelligent systems comparison. From the point of view of accuracy, one of the most important medical criteriTM system outperformed the ons, to our knowledge the proposed liver i-Biopsy most popular and accurate system, FibroTest and ActiTest [14] commercialized TM R by BioPredictive company, and Fibroscan . The liver i-Biopsy is a multiclasses classifier, expressing the results in the pathologist’s scoring systems, e.g., five classes for Metavir F and seven classes for Ishak F. Multi-classes classifiers are more difficult to develop than binary classifiers, with outputs not directly related to the fibrosis scores. We also build binary classifiers as decision trees with similar accuracy and mathematical models (work in progress). Despite the fact that AUROC is only for binary classifiers, loosely speaking a 100% accuracy n classes classifier is equivalent with n binary classifiers with AUROC = 1 (maximal). BioPredictive company analyzed a total of 30 studies14 which pooled 6,378 subjects with both FibroTest and biopsy (3,501 chronic hepatitis C). The mean standardized AUROC was 0.85 (0.82-0.87). The robustness of these results TM results need is clearly demonstrated by this cross-validation, while i-Biopsy TM to be cross-validated. The fact that i-Biopsy , in its actual setting, relies on routine ultrasound features is both a strong point and a weak one, because of the subjectiveness in ultrasound images interpretation. It is worth to note that TM could be superior to in certain circumstances the result of the liver i-Biopsy that of real biopsy. When building the i-CDSS, the results of the potentially erroneous biopsies, which are not fulfilling some technical requirements, were TM eliminated from the data set. Thus, the i-Biopsy predicted results correspond only to the results of the correctly performed biopsies, while a number of the real biopsy results are wrong, because they were not correctly performed. Due to the invasive and unpleasant nature of the biopsy, is very improbable that a patient will accept a technically incorrect biopsy to be repeated. Unlike real TM biopsy, i-Biopsy can be used to evaluate fibrosis evolution, which is of interest in various biomedical and pharmaceutical studies, because, being non-invasive, painless and without any risk, can be repeated as many time as needed. Also, in the early stages of liver diseases, often the symptoms are not really harmful for the patient, but the treatment is more effective than in more advanced fibrosis stages. The physician will hesitate to indicate an invasive, painful and risky liver biopsy, and the patient are not as worried about their disease as they are about TM the pain of the biopsy. However, i-Biopsy can be performed and an early start of the treatment could be much more effective.
References 1. Berner, E.S. (ed.): Clinical Decision Support Systems. Springer, New York (2007) 2. Lindor, A.: The role of ultrasonography and automatic-needle biopsy in outpatient percutaneous liver biopsy. Hepatology 23, 1079–1083 (1996) 3. Tobkes, A., Nord, H.J.: Liver biopsy: Review of methodology and complications. Digestive Disorders 13, 267–274 (1995)
Using Computational Intelligence to Develop i-CDSS
275
R for the predic4. Shaheen, A.A., Wan, A.F., Myers, R.P.: FibroTest and Fibroscan tion of hepatitis C-related fibrosis: a systematic review of diagnostic test accuracy. Am. J. Gastroenterol. 102(11), 2589–2600 (2007) 5. Guyon, I., et al.: Feature Extraction: Foundations and Applications. In: Studies in Fuzziness and Soft Computing, Springer, Heidelberg (2006) 6. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) 7. Freund, Y., Schapire, R.E.: A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 8. Brameier, M., Banzhaf, W.: Linear Genetic Programming. Springer, Heidelberg (2007) 9. Floares, A.G., et al.: Toward Intelligent Virtual Biopsy: Using Artificial Intelligence to Predict Fibrosis Stage in Chronic Hepatitis C Patients without Biopsy. Journal of Hepatology 48(2) (2008) 10. Floares, A.G.: Liver Intelligent Virtual Biopsy and the Intelligent METAVIR and Ishak Fibrosis Scores. In: Proceedings of the Computational Intelligence in Bioinformatics and Biostatistics, Vietri Sul Mare, Italy (2008) 11. Floares, A.G., et al.: Intelligent virtual biopsy can predict fibrosis stage in chronic hepatitis C, combining ultrasonographic and laboratory parameters, with 100% accuracy. In: Proceedings of The XXth Congress of European Federation of Societies for Ultrasound in Medicine and Biology (2008) 12. Floares, A.G.: Artificial Intelligence Support for Interferon Treatment Decision in Chronic Hepatitis B. In: International Conference on Medical Informatics and Biomedical Engineering; WASET - World Academy of Science, Engineering and Technology, Venice, Italy (2008) 13. Floares, A.G.: Intelligent Systems for Interferon Treatment Decision Support in Chronic Hepatitis C Based on i-Biopsy. In: Proceedings of Intelligent Data Analysis in Biomedicine and Pharmacology, Artificial Intelligence in Medicine, Washington DC (2008) 14. Floares, A.G.: Intelligent clinical decision supports for Interferon treatment in chronic hepatitis C and B based on i-Biopsy. In: Proceedings of the International Joint Conference on Neural Networks, Atlanta, Georgia, USA (2009)
Different Methodologies for Patient Stratification Using Survival Data Ana S. Fernandes2, Davide Bacciu3, Ian H. Jarman1, Terence A. Etchells1, José M. Fonseca2, and Paulo J.G. Lisboa1 1
School of Computing and Mathematical Sciences, Liverpool John Moores University, Byrom Street, Liverpool L3 3AF, UK {T.A.Etchells,I.H.Jarman,P.J.Lisboa}@ljmu.ac.uk 2 Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa {asff,jmf}@uninova.pt 3 Dipartimento d’Informatica, Università di Pisa {bacciu}@di.unipi.it
Abstract. Clinical characterization of breast cancer patients related to their risk and profiles is an important part for making their correct prognostic assessments. This paper first proposes a prognostic index obtained when it is applied a flexible non-linear time-to-event model and compares it to a widely used linear survival estimator. This index underpins different stratification methodologies including informed clustering utilising the principle of learning metrics, regression trees and recursive application of the log-rank test. Missing data issue was overcome using multiple imputation, which was applied to a neural network model of survival fitted to a data set for breast cancer (n=743). It was found the three methodologies broadly agree, having however important differences. Keywords: Prognostic risk index, patient stratification.
1 Introduction Clinical oncologists are very interested in making prognostic assessments of patients with operable breast cancer, in order to better tailor the treatments and to better assess the impact of prognostic factors on survival. Therefore it is necessary to identify patients with higher and lower survival risk as treatment may vary following the survival behaviour. However, this risk is only significant if stratified with different thresholds, depending not only on a prognostic index, but also on the prognostic factors. Based on survival models, the most widely used test statistic to stratify prognostic indices in the presence of censored data is the log-rank test. However this statistic finds the different patient risk groups by thresholding only the Prognostic Index (PI), making an assumption that the threshold separates distinct patient populations, while in practice it may be cutting across a single patient population. It would be desirable to stratify by identifying distinct patient populations directly from the prognostic F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 276–290, 2010. © Springer-Verlag Berlin Heidelberg 2010
Different Methodologies for Patient Stratification Using Survival Data
277
factors. However, traditional clustering methods will often return clusters that have little specificity for outcome differences. This paper presents a comparison between three stratification methodologies: the first is a log-rank bootstrap aggregation methodology, which uses the log-rank statistic at its core but carries out bootstrap re-sampling of the population of prognostic indices in order to gain robustness over a maximum significant search. The second methodology is based on regression trees, applied to the continuous value prognostic scores. The third methodology uses informed clustering utilising the principle of learning metrics. More specifically it estimates a local metric based on the Fisher information matrix of the patients’ distribution with respect to the PI. Cluster number estimation is performed in the Fisher-induced affine space, stratifying the patients into groups characterized by significantly different survival. Two different survival models were considered: PLANN-ARD, a neural network for time-to-event data, and Cox proportional hazards, such that each model provides an independent prognostic index. It is important to note that survival models must take account of censorship, which occurs when a patient drops out of the study before the event of interest is observed or if the event of interest does not take place until the end of the study. All patients in this study were censored after 5 years of follow-up. Real-world clinical data, especially when routinely acquired, are likely to have missing data. Previous research [3] on this dataset showed information to be missing at random (MAR): hence, it can be successfully imputed. The application of multiple imputation in combination with neural network models for time-to-event modelling takes account of missing data and censorship within principled frameworks [3]. Section 2 gives a description of the data set used to train the model and defines the predictive variables chosen for the analysis. Sections 3 and 4 present the prognostic models and the methodologies used for patient stratification into different survival groups, followed by the conclusions of the analysis..
2 Data Description and Model Selection The data set consists of 931 routinely acquired clinical records for female patients recruited by Christie Hospital in Wilmslow, Manchester, UK, during 1990-94. The current study only comprises patients with early or operable breast cancer and whose cases are filtered using the standard TNM (Tumour, Nodes, Metastasis) staging system as tumour size less than 5 cm, node stage less than 2 and without clinical symptoms of metastatic spread. Two of the 931 records in the training data were identified as outliers and removed. Time-to-event was measured in months from surgery with a follow up of 5 years and the event of interest death to any cause. 16 explanatory variables in addition to outcome variables were acquired for all patient records. All covariates and their attributes are listed in Table 1. The category “others” in the “Histological type” variable includes also patients with “in situ” tumour. However these records shouldn’t be included in the data set because this category of patients has a different disease dynamic, therefore this category was been removed from the data set. For the purpose of this study will only focus only on Histological type lobular and ductal. The final data set comprises 743 subjects.
278
A.S. Fernandes et al.
Missing data is a common problem in prediction research. Different causes may lead to missing data, and it is important to carefully consider them since approaches to handle missing data rely on assumptions on the causes. After analysing the data, information has been considered to be Missing at Random (MAR). Here, missingness is related to known patient characteristics, that is the probability of a missing value on a predictor is independent of the values of the predictor itself, but depends on the observed values of other variables. Given this, a new attribute may be created to denote missing information or the missing values can be imputed. The latter has been shown to be effective [1,2] and does not make assumptions relating to the distribution of “missingness” in the training data, which is essential on future patients inference. Therefore, the missing covariates were imputed following the method indicated in [1,2] and repeated 10 times. The choice of this number is a conservative choice, as several studies have shown that the required number of repeated imputations can be as low as three for data with up to 20% missing information. Model selection was carried out through Cox regression (proportional hazards) [3], where six predictive variables were identified: age at diagnosis, node stage, histological type, ratio of axillary nodes affected to axilar nodes removed, pathological size (i.e. tumour size in cm) and oestrogen receptor count. All of these variables are binary coded as 1-from-N.
3 Prognostic Modeling for Breast Cancer Patients In clinical practice, more specifically for breast cancer patients, it is common practice to define prognostic indices that rank patient data by severity of the illness. In order to determine the prognostic index for each patient it is necessary to define a prognostic model. Two analytical models have been used to fit the data set to a prognostic index: a piecewise linear model Cox regression, also known as proportional hazards, and a flexible model consisting of Partial Logistic Artificial Neural Networks regularised with Automatic Relevance Determination (PLANN-ARD). Cox regression factorises dependence on time covariates, where the hazard rate is modelled for each patient with covariates xp at time tk, as follows:
h(x p , t k )
Ni
∑ bx )
h 0 (t k ) = .exp( 1 − h(x p , t k ) 1 − h 0 (t k )
i
(1)
i=1
where h0 represents the empirical hazard for a reference population and xi are the patient variables. Using this model, the prognostic index is defined by the traditional linear index βx. As in this study, missing data was imputed 10 times, there are 10 different data sets to be used with such model, which means that the final patient prognostic index was determined as the mean of 10 prognostic indices. PLANN-ARD has the structure of a multi-layer perceptron with a single hidden layer and sigmoidal activations in the hidden and output layer nodes. Covariates and
Different Methodologies for Patient Stratification Using Survival Data
279
discrete monthly time increments are introduced into the network as inputs, where the output is the hazard for each patient and for each time. The objective function is the log-likelihood summed over the observed status of the patient with a binary indicator when the patient status is observed alive or has died. Flexible models have the potential to model arbitrary continuous functions, therefore need to be regularized in order to avoid overfitting. For this purpose, we have applied Automatic Relevance Table 1. Variable description, existing values, coding for each category for Christie Hospital data set
Variable description Age category
Values
Categories
1;2;3
9 1;2;3;4 9 1;2;3;4 9 1;2;3;4 9
20-39; 40-59; 60+ Invasive ductal ; Invasive lobular/lobular in situ; In situ/ mixed/ medullary/ ucoid/ papillary/ tubular/ other mixed in situ Pre-Menopausal; Peri-Menopausal; Post-menopausal Missing Well differentiated; Moderately differentiated; Poorly differentiated Missing 0; 1-3; 4+; 98 (too many to count) Missing 0-9 ; 10-19; 20+; 98(too many to count) Missing 0-20% ; 21-40% ; 41-60% ; 61+% Missing
1;2
<2 cm ; 2-5 cm
1;2 9
0-10; 10+ Missing(Christie- 51% ; BCCA-19%)
1
M0 (no distant metastasis)
Histological type
1;2;3
Menopausal status
1;2;3 9
Histological Grade
1;2;3
Nodes involved Nodes removed Nodes Ratio Pathological size Oestrogen receptors Metastasis stage Predominant site Side Maximum diameter of tumour Clinical stage tumour Clinical stage nodes
9 1;2 1;2;3
Upper outer; lower outer; upper inner; lower inner; subareolar Missing Right; Left <2 cm ; 2-5 cm ; 5+ cm
9
Missing
1;2
T1 (tumour <2cm); T2 (2-5 cm)
0;1
N0(no nodes found clinically or node negative by histological type); N1 (ipsilateral and mobile axillary nodes)
1;2;3;4;5
280
A.S. Fernandes et al.
Determination (ARD) [4] regularization, which has the effect of suppressing covariates that have little predictive influence on outcome. This has the fundamental advantage of automatically adjusting the effect of each covariate according to its relevance to the model, protecting against over-fitting of the data without requiring hard model selection. The papers cited in [4] make a strict theoretical correspondence between this neural network model and classical statistical time-to-event models for censored data. An appropriate prognostic index for non-linear models, such as PLANN-ARD, is a natural extension of the bx approach described previously, that is
PI(x p ) = (− ln(1 − CCI(t))) = ln(− ln(S(t)))
(2)
where the CCI is the crude cumulative incidence, identified as the probability of the occurrence of a specific event of interest [5] and S(t) is the estimated survival at the end of follow up, i.e. 60 in this study. The estimated survival follows k
S ( tk ) = ∏ (1 − h ( tl ) )
(3)
l =1
The above assumes that a single estimate of the hazard is available for each patient and time period. As a consequence of imputation, PLANN-ARD was run 10 times for each imputed data set, which resulted in 10 different trained networks. The best estimate of the hazard is obtained by an empirical averaging over the hazard estimates from each imputed data set, namely
∫
h( x p , tl ) = h( x p , tl | μ )P( μ )dμ 1 ~ T
10
h( x ∑ μ i
p , tl | μi )
(4)
=1
where in the continuous expression μ represents a Lesbesgue measure for the distribution of the imputed data and in the discrete version μi denotes a particular imputed sample. The mentioned time independent prognostic index was also computed after the estimation of the hazard and survival as mentioned previously. Once that the PI has been defined, independently from the chosen modeling method, a stratification methodology needs to be identified, such that patients can be separated into statistically significant risk groups by overall mortality.
4 Stratification Methodologies There are a variety of parametric and non-parametric methods for comparing distributions in the complete data, but there are fewer options for comparing two survival
Different Methodologies for Patient Stratification Using Survival Data
281
distributions in the presence of censored data. For this kind of data, the most popular test choice is the log-rank test. The prognostic index obtained for each patient using a determined prognostic model can be grouped into classes, so as to obtain different survival curves for each group. However the definition of cohorts with significant difference on survival may not be determined directly from a prognostic index. Consequently four different stratification methodologies are compared: a robust bootstrap log-rank aggregation, a regression decision tree, informed clustering with local Fisher metric and clustering using standard Euclidean metric. The bootstrap log-rank aggregation is based on the log-rank test. This known statistic test can be used to find the best cut-points where there are significant differences in the survival of the patients. However, the optimization of these cut-points results in an overestimation of the relative risk between the two prognostic groups. Moreover, it should be kept in mind that the obtained cut-points would be highly data dependent and that such values are expected to vary markedly between different data sets, which is not advisable for model validation. In order to diminish this cut-points overestimation, bootstrapping re-sampling technique can be used by applying it to the prognostic indices, as these are calculated from the original data set. There are some studies that use bootstrapping methods, as [6,7] and earlier study that propose the indicated robust methodology for risk group allocation which exploits bootstrap re-sampling in order to stabilize the distribution of risk groups predicted for each value of the risk score index [8]. Regression tree decision methodology was performed using the CART algorithm [7] where there are 6 predictive categorical variables and one continuous target variable, which is the prognostic index already obtained. It is important to mention that the predictor variables used in CART are the mode of the imputed data sets obtained previously. The minimum records for each node tree has been chosen to be 20, in order to define a certain significance level of populations to be compared. The maximum span of the tree for the prognostic index calculated with Cox regression produced 51 leafs, while the tree for the PLANN-ARD prognostic index produced 48 leafs at maximum.. After growing the tree, a “pruning method” was applied to combine pairs of leaves which do not have significantly different survival, calculating a new average for the PI for that new leaf. The final leafs are consequently ordered by their new prognostic index average and for each pair of leafs the pairwise log-rank statistic has been computed. This is repeated until all leaves have 95% significantly different pairwise survival. Both trees finalized with 4 different survival groups, which means that the regression trees, after pruning, can be considered as classification trees. The Learning Metrics model [9] provides a mean for learning informed distance functions ║·║J (x) by exploiting prior information concerning the distribution of the samples with respect to some auxiliary information. Such an auxiliary information is typically modeled as a random variable c that is bound to the input samples x by a conditional distribution P(c|x), providing information regarding relevant aspects of the data. The learning metrics, then, measures distances in terms of changes in the distribution P(c|x) as x varies; such changes can be measured by the local KullbackLeibler divergence as
282
A.S. Fernandes et al.
D(P(c x) P(c x + dx)) = dx T J(x)dx
(5)
where J(x) is the Fisher information matrix, that is
⎛ ∂ log(P(c x))T ∂ log(P(c x)) ⎞ J(x) = E P(c x )⎜ ⎟ ∂x ∂x ⎠ ⎝
(6)
where EP(c|x) is the expectation over c. The tensor J(x) is a positive semidefinite function defining a local scaling of the input space at the point x. The new metric that is used to cluster the data in place of the Euclidean distance is as follows
d 2 (x,mi ) = (x − mi )T J(x)(x − mi )
(7)
where mi is the prototype of the i-th cluster and J(x) is the Fisher matrix at the point x. For the purpose of this paper, we derive the informed metric based on the Fisher information matrix of the conditional distribution of the prognostic indices obtained by survival analysis. In particular, we consider a set of samples (i.e. the patient profiles) x ∈ X that are associated to two independent auxiliary variables PICox and PIPLANNARD, that are the prognostic indices defined in Section 3, through the respective conditional probabilities P(PICox|x) and P(PIPLANNARD|x). In addition, we also consider the joint conditional probability of the two independent prognostic indices, that is P(PICox , PIPLANNARD|x). To obtain an analytical formulation for the Fisher information matrix in our survival analysis scenario, we need to compute the derivative
∂ log(P(PI x)) ∂x for each of the three distributions of the prognostic indices. To do so, we consider PICox and PIPLANNARD to be Normally distributed as
⎧⎪ (CCX T ∑ −1CCX) ⎫⎪ P(PI x) ~ exp− ⎨ ⎬ 2 ⎪⎩ ⎪⎭
(8)
where CCX is a short form for
CCX = Bx + β 0 − μ
(9)
where B is the 1 × K vector (2 × K matrix for the joint distribution P(PICox,PIPLANNARD|x)) of the linear parameters of Cox survival model. The term µ refers to the Normal expectation and x is the K × 1 sample vector. To calculate the Fisher matrix with (7) we need to compute T −1 ⎪⎧ (CCX ∑ CCX) ⎫⎪ ⎬ 2 ⎪ ∂ log(P(PI x)) ⎩ ⎭⎪ = ∂x ∂x
∂ ⎨−
(10)
Different Methodologies for Patient Stratification Using Survival Data
283
Given that CCX writes as in (9), we can solve (10) using the product rule, yielding
∂ log(P(PI x)) = −(Bx + β 0 − μ)T ∑ −1B ∂x
(11)
By inserting this result in (7), we obtain the Fisher distance for the prognostic indices. To complete the derivation of the model, we need to fit the linear parameters B and β0 to the values of the prognostic indices PI predicted by the Cox and PLANN-ARD model for each sample x. This entails solving a linear system with respect to [B;β0], which can be straightforwardly done by using the pseudo inverse matrix (see [10] for details). Once the Fisher distance has been estimated, it can be embedded within the clustering algorithm at the stage where unit activation is computed. Following on the approach in [10], we focus on the CoRe clustering model [11], an algorithm that performs cluster number identification by exploiting an information compression mechanism of the visual cortex, named repetition suppression. Starting from an initial overestimation of the actual cluster number, the CoRe algorithm iteratively suppresses neurons whenever they fire un-selectively for the input patterns, eventually pruning unselective units from the network. The neurons retained at the end of the learning phase encode the clusters found by CoRe within the input data, hence the estimated cluster number is equal to the network size at convergence. The response of CoRe units is determined by a multivariate Gaussian: to embed the Fisher metric within such units, we consider the following activation function −1T −1 ⎧ 1 ⎫ T R ( x )( x − m ) [ i ] ∑i ∑i [R ( x )( x − m i )]⎬ ⎩ 2 ⎭
ϕ i ( x, mi , ∑i ) = exp⎨− −1
(12)
where R(x) is the right Cholesky decomposition of the Fisher matrix J(x). As regards learning, the original CoRe algorithm updates unit means and variances in the direction given by the gradient (∂ϕi/∂mi) and (∂ϕi/∂Σi). Since the Fisher metrics is a Riemannian metrics, such steepest descent can be computed by means of the natural gradient [12]. The application of this rule to the prototype vectors mi yields to the same update rules as for the Euclidean case; on the other hand, the learning equations for the scale matrix Σi need to be slightly modified to enclose the contribution from the Fisher matrix J(x): see [10] for the details of the learning equations. In the experimental phase, the CoRe algorithm with the Fisher metrics has been applied to discover clusters within the patient population, by exploiting the information from the distribution of the prognostic indices PICox and PIPLANNARD in isolation and jointly. The simulation setup is the same described in [10]: the CoRe network has been initialized with 30 units and the cluster number estimates are based on 50 repeated runs of the algorithm, with random initial prototype assignments.
284
A.S. Fernandes et al.
5 Results Obtained with the Different Stratification Approaches The prognostic indices PICox and PIPLANNARD, as well as the mode of the 6 variables found as the most predictive ones, for the 10 imputed data sets were used for the different stratification methodologies. Using the both bootstrap log-rank aggregation and regression tree method it has been found 4 different risk groups, for both prognostic indices, whose KM curves are shown in Fig. 1 and Fig. 2.
Fig. 1. Actuarial estimates of survival obtained with the Kaplan-Meier method, stratified over a 60 month period using the regression tree method. The upper picture displays the results using PICox and the bottom pictures represent the survival for PIPLANNARD.
Different Methodologies for Patient Stratification Using Survival Data
285
Fig. 2. Actuarial estimates of survival obtained with the Kaplan-Meier method, stratified over a 60 month period using the bootstrap log-rank aggregation method. The upper picture displays the results using PICox and the bottom pictures represent the survival for PIPLANNARD.
Comparing the KM curves for the PICox and PIPLANNARD it can be concluded that, for both algorithms, survival is lower for the risk groups obtained by using PICox. However, the risk groups’ patient allocation obtained with PIPLANNARD are more conservative than PICox because patients are allocated in higher risk groups, as it can be observed in Table 2. Although survival for both methods is similar, group membership is not the same, as can be observed in Table 3. Here, with the exception of 4th risk group, the bootstrap log-rank aggregation is generally more conservative in terms of patients’ risk group allocations than the regression tree method.
286
A.S. Fernandes et al.
PLANN-ARD
1 280 1 12 2 0 3 1 4 Total 293
2 13 235 85 3 336
Cox 3 0 0 19 58 77
4 0 0 0 37 37
Total 293 247 104 99 743
PLANN-ARD
Table 2. Top table represents patients’ cross tabulation for the regression tree method and the bottom table represents patient’s cross-tabulation using the Bootstrap log-rank aggregation, using the PI obtained with Cox and PLANN-ARD
1 322 1 43 2 0 3 0 4 Total 365
2 3 137 33 0 173
Cox 3 0 1 116 49 166
4 0 0 0 39 39
Total 325 181 149 88 743
Bootstrap Log-rank
Bootstrap log-rank
Table 3. Top table represents patients’ cross tabulation for PICox and the right table represents patient’s cross-tabulation for PIPLANNARD, using both methods regression tree and bootstrap logrank aggregation
1 291 1 2 2 0 3 0 4 Total 293
Regression tree 2 3 4 Total 73 1 0 365 170 1 0 173 92 69 5 166 1 6 32 39 336 77 37 743
1 284 1 9 2 0 3 0 4 Total 293
Regression tree 2 3 4 Total 10 0 1 325 166 5 1 181 39 93 17 149 2 6 80 88 247 104 99 743
The learning metrics approach described in Section 4 has been applied to 3 distributions of auxiliary information, i.e. using PICox alone, using PIPLANNARD alone and on the joint information from the two independent indices. The first two experiments have predicted the cluster number to be either 6 or 7. By using the joint information,
Different Methodologies for Patient Stratification Using Survival Data
287
Fig. 3. The top-most picture represents the actuarial estimates of survival obtained with the Kaplan-Meier method, stratified over a 60 month period and the bottom-most picture depicts the clustered samples in the space of the two prognostic indices
on the other hand, the algorithm stably identifies 5 clusters: Figure 2 shows the KM curves for the 5 patient groups as well as a 2D plot of the clustered samples in the space of the prognostic indices. The KM plot clearly shows that clusters 2 and 3 denote the same risk profile, while the remaining 3 clusters identify 3 markedly different survival behaviors. Figure 3 shows the clusters projected onto the 3 principal components of the original data (leftmost) and of the dataset subject to the affine transformations induced by the Fisher metric (rightmost). In the original space,
288
A.S. Fernandes et al.
Cl1 Cl2 Cl3 Cl4 Cl5
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
Cl1 Cl2 Cl3 Cl4 Cl5
1.5 1 0.5 0 -0.5 -1 -15 -10
2 1.5
-5
1 0.5
0 0 5
-0.5
Fig. 4. The top-most picture shows the clusters projected onto the 3 principal components of the original data and the bottom-most picture shows the affine transformed samples in the Fisher-induced space
cluster 3 is fully contained in a separated sample group and such separation in the input space seems to be the cause of the generation of a separate cluster with the same survival profile of cluster 2, also in the Fisher space. Overall, the clustering analysis seems to confirm that there are 4 risk groups within the data, although the discrimination of the clustering methodology seem lower than for the bootstrap and regression tree algorithms. clusters identify 3 markedly different survival behaviors.
6 Conclusions Perhaps surprisingly, the three risk allocation methodologies broadly agree, although they are founded on very different principles. However, at the level of detail there are
Different Methodologies for Patient Stratification Using Survival Data
289
important differences. In particular, it is generally the case in breast cancer that the population of operable patients comprises a very well surviving group and another, thankfully a much smaller group, with especially poor survival. Nevertheless, it is the accurate discrimination and grouping of patients in the mid-surviving groups that is of most interest, since these two groups of patients are those likely to benefit most from better targeting of therapy. The bootstrap log-rank method showed clearly the better discrimination in survival between the most and least surviving group, and it is more conservative than the use of regression trees, compared to which it draws a substantial number of patients from group 2 into group 3. This effect is more pronounced when the linear survival estimator is used, in part reflecting the observation that the non-linear estimator, PLANNARD, is itself slightly more conservative than Cox regression with respect to these two risk groups. In contrast, informed clustering with the Fisher information matrix as a metric, finds two different patient subgroups with similar disease progression and stratifies the patient population into distinct cohorts that show a progression in survival, also reflected by a localised distribution of risk scores estimated by either survival model. This approach has the merit of permitting a specific definition of the patient population, from which to forward predict grouped survival, instead of inferring a threshold back from the log-rank separation index. Further work with characterise each risk group in more detail to search for clear differentiators among the individual risk factors. Furthermore, the generalisation power in multicentre studies of the group separation will also be ascertained. Acknowledgments. The authors gratefully acknowledge Mr. R. Swindell from Christie Hospital for making the anonymised modelling data set available for this study and members of the BC Cancer Agency's Breast Cancer Outcomes Unit for assembling the British Columbia data. This study was partially funded by the European Network of Excellence Biopattern (FP6/2002/IST/1-508803) and by Fundação para a Ciência e Tecnologia, through the POS_C(SFRH/BD/30260/2006).
References 1. Clark, T.G., Altman, D.G.: Developing a prognostic model in the presence of missing data an ovarian cancer case study. Journal of clinical epidemiology 56, 28–37 (2003) 2. Van Buuren, S., Boshuizen, H.C., Knook, D.L.: Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18(6), 681–694 (1999) 3. Fernandes, A.S., Jarman, I.H., Etchells, T.A., Fonseca, J.M., Biganzoli, E., Bajdik, C., Lisboa, P.J.G.: Missing data imputation in longitudinal cohort studies – application of PLANN-ARD in breast cancer survival. In: Machine Learning and Applications, ICMLA’08, pp. 644–649 (2008) 4. Lisboa, P.J.G., Wong, H., Harris, P., Swindell, R.: A Bayesian neural network approach for modelling censored data with an application to prognosis after surgery for breast cancer. Artificial Intelligence in Medicine 28(1), 1–25 (2003) 5. Ambrogi, F., Biganzoli, E., Boracchi, P.: Estimates of clinically useful measures in competing risks survival analysis. Statistics in Medicine 27(30), 6407–6425 (2008)
290
A.S. Fernandes et al.
6. Etchells, T.A., Fernandes, A.S., Jarman, I.H., Fonseca, J.M., Lisboa, P.J.G.: Stratification of severity of illness indices: a case study for breast cancer prognosis. In: Lovrek, I., Howlett, R.J., Jain, L.C. (eds.) KES 2008, Part II. LNCS (LNAI), vol. 5178, pp. 214–221. Springer, Heidelberg (2008) 7. Breiman, L., Friedman, J.H., Olsen, A.R., Stone, C.J.: Classification and Regression Trees. The Wadsworth & Brooks (1984) 8. Lisboa, P.J.G.: A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Networks, Invited Paper 15(1), 9–37 (2002) 9. Kaski, S., Sinkkonen, J., Peltonen, J.: Bankruptcy analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networks 12(4), 936–947 (2001) 10. Bacciu, D., Jarman, I.H., Etchells, T.A., Lisboa, P.J.G.: Patient Stratification with Competing Risks by Multivariate Fisher Distance. In: Int. Joint Conf. on Neural Networks (IJCNN’09), Atlanta, USA (2009) 11. Bacciu, D., Starita, A.: Competitive repetition suppression (CoRe) clustering: A biologically inspired learning model with application to robust clustering. IEEE Transactions on Neural Networks 19(11), 1922–1941 (2008) 12. Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices in Application to Allen Brain Atlas Anton Osokin1 , Dmitry Vetrov1 , and Dmitry Kropotov2
2
1 Moscow State University, Computational Mathematics and Cybernetics Department, 119992, Russia, Moscow, Leninskie Gory, 2-nd ed. bldg. [email protected], [email protected] Dorodnicyn Computing Center of the Russian Academy of Sciences, 119333, Russia, Moscow, Vavilov str. 40 [email protected]
Abstract. The paper describes a method of fully automatic 3D-reconstruction of a mouse brain from a sequence of histological coronal 2D slices. The model is constructed via non-linear transformations between the neighboring slices and further morphing. We also use rigid-body transforms in the preprocessing stage to align the slices. Afterwards, the obtained 3D-model is used to generate virtual 2D-images of the brain in arbitrary section-plane. We use this approach to construct a highresolution anatomic 3D-model of a mouse brain using well-known Allen Brain Atlas which is publicly available. Keywords: 3D-Reconstruction, neuroimaging, morphing, elastic deformations, image registration, B-splines.
1
Introduction
The problem of automatic annotating of brain structures using only images of histological brain slices is very important in modern brain research [1]. Biologists are now able to monitor the activity of various genes throughout the brain. This is usually done in vitro, i.e. on dead species. The extracted brain is frozen and then cut into slices. Then each slice is double-stained by Nissl method to highlight histology and by special stain which reveals the neurons with expression of corresponding genes. The main problem is to determine brain structures where active genes are located. This problem is difficult even for experts especially in cases when slices are obtained using non-standard section-plane. However there are several atlases for various animals which contain both histological images and the corresponding images where all brain structures are marked by expert [2]. The intriguing problem is to use such atlases for constructing annotated 2Dimage in arbitrary section-plane. Although detailed histological atlases of mouse F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 291–303, 2010. c Springer-Verlag Berlin Heidelberg 2010
292
A. Osokin, D. Vetrov, and D. Kropotov
brain are easy to find (see, e.g. [3,4]) the anatomic structure is either absent or not detailed. Perhaps the best mouse brain atlas with detailed manual annotation of anatomic structures is well-known Allen Brain Atlas (ABA) which is available in the web [6]. Unfortunately only each 5th slice is annotated and if one wants to construct 3D anatomic model directly, the result appears to be of poor quality. In the paper we suggest an algorithm which is fully automatic and allows one to construct a histological 3D model and the corresponding anatomical 3D model using only those 20% of histological slices from ABA that have manual anatomical annotation. Such model can be then used to get 2D anatomic virtual images in various sections. To construct a 3D model we first perform rigid-body transforms to align the neighboring slices and eliminate contour fluctuations. Secondly, we find non-linear transforms which map each slice to the neighboring ones in the best way. The family of B-splines is used as a set of basis deformations. We use morphing approach to compute virtual intermediate images between real slices. Our 3D-model is not voxel-based. Instead we keep the initial slices and a set of B-spline coefficients. This allows us to save memory and at the same time supplies high spatial resolution which is important for obtaining virtual slices in different planes. When 3D-model is constructed we use it to synthesize 2D virtual slice by setting arbitrary section-plane. As we have used only annotated slices from atlas we may use the same transforms to calculate the anatomic structure for this virtual slice. To show the applicability of the suggested approach we compare the quality of virtual image of brain in saggital and axial sections to the ones which are produced from histological 3D model built from the whole set of coronal ABA slices. The rest of paper is organized as follows. In section 2 we briefly characterize particular brain atlas we use — Allen Brain Atlas. Section 3 gives a list of steps for 3D-modelling. We describe preprocessing of brain images in sections 4 and 5 and non-linear interpolation of slices in section 6. Section 7 contains some experimental results and in the last section we conclude.
2
Allen Brain Atlas
The Allen Brain Atlas [5,6] is a set of full-color, high-resolution coronal digital images (132 images) of mouse brain accompanied by a systematic, hierarchically organized taxonomy of mouse brain structures. The Allen Brain Atlas is obtained from 8-week old C57Bl/6J male mouse brain prepared as unfixed, fresh-frozen tissue. The total number of coronal slices is 528 spaced at 20 μm, but only 132 of them spaced at 100 μm are annotated. So we construct 3D model using only these 132 images for which both histological and anatomical structures are known. On figure 1 the left half represents histological structure of one mouse brain slice, the right half represents structural color segmentation of mouse brain. This segmentation was made by experts.
3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices
293
Fig. 1. Allen Brain Atlas image
3
Stages of Mouse Brain 3D Modelling
Here we consider the problem of 3D mouse brain model reconstruction using a set of coronal 2D slice images obtained from Allen Brain Atlas. In the paper we propose to solve this problem in 3 steps: 1. Illumination correction. For different brain images illumination level is different and even within one separate image there are areas with different illumination levels. 2. Proportional alignment. Due to technological aspects of brain cutting procedure some brain slices may change a little in their actual size and shape. This deformation can be significant for automatic 3D model reconstruction. In fMRI imaging problems the alignment transformations are assumed to be either affine or rigid-body in 3D or 2D space [7]. In our case we limit the set of alignment transforms by considering only vertical shifts and stretches exploiting the fact that the slices were already horizontally centered and aligned w.r.t. the symmetry line. 3. Non-linear transformation between neighboring slices. Such transformation allows us to find the correspondence not only between the shape of slices but also between their internal structures. There is a variety of methods for nonlinear deformations: parametric deformation model [7,8,9], inclosed dynamic programming [10], optical flows [11,14], level-sets [12,13], free form deformations [15,16]. In this paper we follow the paper [9] and use the parametric approach based on B-spline basis functions. The choice of cubic B-splines as basis functions provides good deformation quality and high calculation speed because of limited number of basis functions. It is also considered to be more robust as only a subregion of image influences the coefficient of each B-spline. In the paper we provide methods for solving all three mentioned problems.
294
A. Osokin, D. Vetrov, and D. Kropotov
(a)
(b)
Fig. 2. Atlas image without illumination correction (a) and after illumination correction (b)
Although atlas images are given as positives we decided to work with negatives due to some implementation aspects. Hereinafter all further illustrations are given as negatives. Atlas image resolution is 5690 × 4418 pixels. In our implementation image resolution was reduced to 270 × 204 pixels because of high computational cost. Resolution decrease gives additional image smoothing as well.
4
Illumination Correction
For illumination correction we apply a Gauss filter with large radius: 2 1 x + y2 exp − g(x, y) = , −R ≤ x, y ≤ R. 2πσ 2 2σ 2
(1)
We used σ = 20, R = 100. This operation gives an illumination map of the image Imap . Afterwards the initial image is divided by obtained filtered image Inew (i, j) =
I(i, j) . Imap (i, j) + 1
(2)
This operation provides both equal local illumination within each image and equal illumination of different atlas images. Figure 2 illustrates this procedure.
5
Alignment of Atlas Images
For alignment of atlas images we first find the bounding box for each slice. Then we consider the borders of bounding box as a function of slice number. Figure 3,a shows top and bottom borders of bounding boxes without alignment.
200
200
150
150
Y coordinate
Y coordinate
3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices
100 50 0
U D 0
50
100 Atlas (a)slices
150
295
100 50 0
U D 0
50 100 Atlas (b)slices
150
Fig. 3. Top (U) and bottom (D) borders of mouse brain without alignment (a) and after alignment (b)
These functions are noisy and need to be smoothed before building 3-dimensional model. Here we apply Savitzky-Golay filter to smooth these functions. The idea of Savitzky-Golay filtering is to find filter coefficients that preserve higher moments. Equivalently, the idea is to approximate the underlying function within the moving window not by a constant (whose estimate is the average), but by a polynomial of higher order (we used order of 5 in our approach). For each point we fit a polynomial to the points in the moving window (we used window width 15) using least squares method, and then set the new value to be the value of that polynomial at the same position. Savitzky-Golay filters can be thought of as a generalized moving average. Their coefficients are chosen in a way to preserve higher moments in the data, thus reducing distortion of essential features like peak heights and line widths in a spectrum, while the suppression of random noise is improved. Figure 3,b shows top and bottom borders of brain rectangles after alignment. If we are interested in any specific section of mouse brain we can make additional alignment in appropriate plane. Such alignment makes specific section smoother but the whole model becomes less smooth. So for 3D-model reconstruction we perform the alignment of top and bottom borders of bounding box only and don’t use the alignment w.r.t. specific plane.
6
Non-linear Deformations
A 3D model of mouse brain is a function: F : R3 → [0, 1].
(3)
From atlas slices we know the values of F only at some points. Within the slice plane the expansion is straightforward and can be done e.g., by bilinear interpolation between neighboring pixels. Spatial expansion can be done in the same way (weighted sum of neighboring slices). However, this simple solution makes a 3D model of poor quality. A better solution can be obtained using non-linear image deformations.
296
A. Osokin, D. Vetrov, and D. Kropotov
The input images are given as two 2-dimensional discrete functions: f1 , f2 : I ⊂ Z2 → [0, 1].
(4)
Here I is a 2-dimensional discrete interval covering the set of all pixels in the image. Function values stand for normalized intensities of corresponding pixels. We denote continuous expansions of two images as f1c , f2c . Our goal is to find a deformation of the first image to the second one in the following way: (5) f1c (g(x, y)) ≈ f2 (x, y). Here g(x, y) : R2 → R2 is a deformation (correspondence) function between pixels. We measure the difference between images by SSD (sum of squared deviations) criterion: (f1c (g(i, j)) − f2 (i, j))2 . (6) E= (i,j)∈I
We find deformation function by minimizing E with respect to g. To solve this problem we consider deformation function as a linear combination of some basis functions: g(x, y) = ck bk (x, y), (7) k∈K
where ck ∈ R2 and bk : R2 → R are some set of basis functions. The family of deformation functions (7) reduces the functional optimization problem to optimization in finite-dimensional space. We use uniformly spaced cubic B-splines as basis functions. A B-spline βr of degree r is recursively defined as βr = βr−1 ∗ β0 , r > 0.
(8)
Here β0 is a characteristic function of [−0.5, 0.5], ∗ is a convolution operator. Specifically, cubic B-spline is the following function: ⎧ 2 ⎪ ⎨2/3 − (1 − |x|/2)x , 0 < |x| ≤ 1, 3 β3 (x) = (2 − |x|) /6, 1 < |x| < 2, (9) ⎪ ⎩ 0, |x| ≥ 2. So we are looking for the deformation function in the family: g(x, y) = ckx ,ky β3 (x/hx − kx )β3 (y/hy − ky ).
(10)
(kx ,ky )∈K
Centers of B-spline functions are placed on the regular grid (kx hx , ky hy ). Working with uniformly-spaced splines is significantly faster in comparison to the use of non-uniform grid. In order to get complete control over g, we put some spline knots outside the image.
3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices
297
Fig. 4. Deformation field for B-spline method. This deformation field is obtained by applying deformation of neighboring slices to regular grid.
Finally the problem is to optimize SSD criteria E w.r.t. set of parameters c. Here we use gradient descent algorithm with feedback step size adjustment. In this algorithm parameter update rule is Δc = −μ∇c E(c). After a successful step μ is multiplied by some value μf > 1, otherwise it is divided by some other value μ∗f > 1. An example of deformation field obtained from B-spline basis functions for a pair of neighboring slices from ABA is shown in figure 4. Since we have deformation of the first image to the second one and vice versa, we can fill gaps between atlas slices with weighted sum of deformed neighboring slices: 1−α α (x, y) + αf2,k (x, y). (11) F (x, y, z) = (1 − α)f1,k−1 Here α =
z−zk−1 zk −zk−1 ,
zk−1 ≤ z < zk , zk is a z-coordinate of slice number k.
α k f1,k−1 (x, y) = fk−1 ((x, y) + α(gk−1 (x, y) − (x, y))) α f2,k (x, y)
= fk ((x, y) +
α(gkk−1 (x, y)
− (x, y)))
(12) (13)
Here gij (x, y) is a deformation function of slice number i to slice number j. To construct a 3D-model we interpolate between 132 ABA coronal slices adding 50 intermediate virtual coronal slices between each neighboring pair of ABA slices.
7
Experimental Results
We used coronal Allen Brain Atlas for 3D-reconstruction of brain model and then synthesized virtual slices in sagittal and axial projections. For sagittal projection we compared both histological and anatomical views with the ones obtained from sagittal Allen Brain Atlas.1 It should be noted that sagittal Allen Brain Atlas consists of only 21 annotated sagittal slice spaced at 200 μm and can hardly be used for constructing anatomical 3D model since neighboring slices differ too much. We also compared our method of building virtual slices with the analogous method used in AGEA project [17] (figure 8). In AGEA project all 528 histological coronal slices from ABA are used for building histological 3D model but anatomical model is not constructed. 1
http://mouse.brain-map.org/atlas/ARA/Sagittal/browser.html
298
A. Osokin, D. Vetrov, and D. Kropotov
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. Sagittal histological and structural view of 3D model in three cases: without illumination correction, alignment and nonlinear deformations (figures a and b), with illumination correction and alignment (figures c and d) and with illumination correction, alignment and nonlinear deformations (figures e and f)
(a)
(b)
Fig. 6. Appropriate histological (a) and structural (b) slice from Sagittal Allen Brain Atlas
3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices
299
Figure 5,a,b shows sagittal brain view for 3D-model reconstructed straightforwardly from atlas without illumination correction, proportional alignment and non-linear deformations between neighboring slices. Both histological and anatomical structures for this virtual slice are shown. It can be seen that the quality of image is poor. There are many gaps in brain structures, e.g. the hippocampus (light C-shaped structure in the middle of the histological image)
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7. Axial histological and structural view of 3D model in three cases: without illumination correction, alignment and nonlinear deformations (figures a and b), with illumination correction and alignment (figures c and d), and with illumination correction, alignment and nonlinear deformations (figures e and f)
300
A. Osokin, D. Vetrov, and D. Kropotov
is discontinuous. Figure 5,c,d shows the same view for 3D-model built with illumination correction and atlas image alignment (after steps 1 and 2 of our method). The border of virtual slice became more natural but there are still many discontinuities in inner structures. Finally figure 5,e,f shows the result obtained with application of nonlinear deformations (after step 3 of our method) between the neighboring slices. Both histological and anatomical structures are now smooth although still noisy. The real histological sagittal slice with corresponding anatomical annotation taken from approximately the same part of the brain in sagittal ABA is shown in figure 6. These images are a bit different
(a)
(b)
Fig. 8. Sagittal (a) and axial (b) histological view of AGEA project 3D model
(a)
(b)
(c)
Fig. 9. Examples of diagonal sections of 3D model obtained by our approach. (a) — histological view, (b) — structural view, (c) — combined histological and structural view.
3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices
301
from ones obtained by us due to peculiarity of cutting technology and different mouse brain. But they show that our method gives virtual views of brain with correctly reproduced internal structures. They can be treated as “correct answer” in some sense. For comparison we also provide virtual histological slice obtained in AGEA project (see figure 8,a) by using 5 times more histological coronal slices. Figure 6 shows appropriate histological and structural views from sagittal ABA. Figure 7 shows same pictures for axial virtual slices views. Again we may conclude that the application of steps 1 and 2 makes the virtual image more homogeneous but does not improve significantly the quality of anatomical image. The application of step 3 provides better anatomical picture, and histological image becomes comparable to the one in figure 8,b which shows the axial virtual slice of approximately the same part of the brain obtained from AGEA project. Since axial ABA doesn’t exist we cannot check the correspondence of synthesized histological and anatomical images to the right ones. As opposed to AGEA project and different Allen Brain Atlases we can obtain an arbitrary section of 3D model of a mouse brain. Figure 9 shows diagonal virtual slices of the 3D model obtained in two different diagonal sections.
8
Conclusion
In the paper we proposed an algorithm that constructs anatomical 3D model of mouse brain using annotated histological coronal slices from ABA. Using this model we may construct virtual slices of brain w.r.t. arbitrary section-plane. We have shown that such algorithm allows us to get synthetic images of relatively good quality both with histological and anatomical structure. Such algorithm opens great perspectives for further brain research as it provides the opportunity of discovering the anatomical structures in a single slice of real mouse brain. The procedure of slices’ preparation is very time and labor consuming, that is why it is highly desirable to reduce the number of slices obtained from real mouse to minimum (in the ideal case to one which is of interest for biologists). The slice can be made in non-standard (coronal, sagittal, or axial) section-plane and it should be mapped into 3D-model of atlas brain. Our algorithm allows us to synthesize the image of an atlas brain w.r.t. any section-plane and hence is the key part of future method which will compute the best histological mapping. When it is done the anatomical structures in real brain slice can be found either by projecting anatomical structure of atlas brain onto the virtual slice with further performing inverse mapping to adapt it to real brain slice, or by using the anatomical structure of best-map virtual slice as prior information and then by performing segmentation of real slice using, e.g. conditional random fields (CRF) approach [18]. The algorithm for identifying anatomical structures in arbitrary brain slice is extremely useful for brain research as it allows to understand what structures are responsible for specific genes expression in specific situations.
302
A. Osokin, D. Vetrov, and D. Kropotov
Acknowledgments This work is supported by Russian Foundation for Basic Research, projects 0801-00405, 08-01-90016 and 09-01-12060, and Russian Federal Target Program on Scientific Staff, project P1295. We would like to thank prof. Konstantin Anokhin for problem formulation and Dr. Anton Konushin for useful discussions.
References 1. Ng, L., Pathak, S.D., Cuan, C., Lau, C., Dong, H., Sodt, A., Dang, C., Avants, B., Yushkevich, P., Gee, J.C., Haynor, D., Lein, E., Jones, A., Hawlyrycz, M.: Neuroinformatics for Genome-Wide 3D Gene Expression Mapping in the Mouse Brain. IEEE Transactions on Computational Biology and Bioinfomatics 4(3), 382– 393 (2007) 2. Bolyne, J., Lee, E.F., Toga, A.W.: Digital Atlases as a Framework for Data Sharing. Frontiers in Neuroscience 2, 100–106 (2008) 3. Brookhaven National Laboratory (Internet). 3-D MRI Digital Atlas Database of an Adult C57BL/6J Mouse Brain, http://www.bnl.gov/CTN/mouse/ 4. Mikula, S., Trotts, I., Stone, J., Jones, E.G.: Internet-Enabled High-Resolution Brain Mapping and Virtual Microscopy. NeuroImage 35(1), 9–15 (2007) 5. Lein, E.S., et al.: Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2007) 6. Allen Brain Atlas (Internet). Allen Institute for Brain Science, Seattle (2008), http://www.brain-map.org 7. Frackowiak, R.S.J., Friston, K.J., Frith, C., Dolan, R., Price, C.J., Zeki, S., Ashburner, J., Penny, W.D.: Human Brain Function, 2nd edn. Academic Press, London (2003) 8. Kybic, J., Thevenaz, P., Nirkko, A., Unser, M.: Unwarping of Unidirectionally Distorted EPI Images. IEEE Transactions on Medical Imaging 19(2), 80–93 (2000) 9. Kybic, J., Unser, M.: Fast Parametric Elastic Image Registration. IEEE Transactions on Image Processing 12(11), 1427–1442 (2003) 10. Ju, T., Warren, J., Carson, J., Bello, M., Kakadiaris, I., Chiu, W., Thaller, C., Eichele, G.: 3D volume reconstruction of a mouse brain from histological sections using warp filtering. Journal of Neuroscience Methods 156, 84–100 (2006) 11. Barron, J.L., Fleet, D.J., Beauchemin, S.: Performance of optical flow techniques. International Journal of Computer Vision 12(1), 43–77 (1994) 12. Vemuri, B.C., Ye, J., Chen, Y., Leonard, C.M.: Image registration via level-set motion: applications to atlas-based segmentation. Med. Image Anal. 7(1), 1–20 (2003) 13. Gefen, S., Kiryati, N., Nissanov, J.: Atlas-Based Indexing of Brain SEctions via 2-D to 3-D Image Registration. IEEE Transactions on Biomedical Engineering 55(1), 147–156 (2008) 14. Fleet, D.J., Weiss, Y.: Optical flow estimation. In: Paragios, N., Chen, Y., Faugeras, O. (eds.) Mathematical models for Computer Vision: The Handbook, Springer, Heidelberg (2005) 15. Rueckert, D., Sonoda, L.I., Hayes, C., Hill, D.L.G., Leach, M.O., Hawkes, D.J.: Nonrigid Registration Using Free-Form Deformations: Application to Breast MR Images. IEEE Transaction on Medical Imaging 18(8) (August 1999)
3-D Mouse Brain Model Reconstruction from a Sequence of 2-D Slices
303
16. Myronenko, A., Song, X.: Image Registration by Minimization of Residual Complexity. In: Computer Vision and Pattern Recognition (CVPR’09), pp. 49–56 (2009) 17. AGEA Project (Internet). Allen Institute for Brain Science, Seattle (2008), http://mouse.brain-map.org/agea/ 18. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning (2001)
A Proposed Knowledge Based Approach for Solving Proteomics Issues Antonino Fiannaca1,2 , Salavatore Gaglio1,2, Massimo La Rosa1,2 , Daniele Peri1,2 , Riccardo Rizzo2 , and Alfonso Urso2 1
Department of Computer Science (DINFO), University of Palermo, Viale delle Scienze, Ed. 6, 90128 Palermo, Italy 2 ICAR-CNR, National Research Council, Viale delle Scienze, Ed. 11, 90128 Palermo, Italy [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract. In this paper we present a novel knowledge-based approach that aims at helping scientists to face and resolve a large number of proteomics problem. The system architecture is based on an ontology to model the knowledge base, a reasoner that starting from the user’s request and a set of rules builds the workflow of tasks to be done, and an executor that runs the algorithms and software scheduled by the reasoner. The system can interact with the user showing him intermediate results and several options in order to refine the workflow and supporting him to choose among different forks. Thanks to the presence of the knowledge base and the modularity provided by the ontology, the system can be enriched with new expertise in order to deal with other proteomic or bioinformatics issues. Two possible application scenarios are presented.
1
Introduction
Knowledge management aims at turning data into information so that it is possible to make artificial reasoning that can help the scientist to face and solve problems. Year after year, several expert systems have been developed in order to aid users to handle the growing amount of available data. Computer assisted analysis have brought benefits in many fields, especially in biomedical research. For example, in the past, an interesting application in medical field was developed by [1], where authors obtained rules by interviews with human experts and used a Bayesian approach for analysis with uncertainties in detection of artery stenosis. This way, each fact was given a score representing the prior probability that the fact was true. The proposed inference engine uses a decision tree that links facts through theirs relationships. In [2] the concepts of demons and agenda in a rule-based diagnostic expert system have been adopted, using some meta-rules to focus on a specific knowledge base contest. The enhanced version of rule-based expert system proposed by authors aims at arranging knowledge bases into hierarchical, heterarchical or sequential arrangements, and to combine diagnostics with certain planning actions. F. Masulli, L. Peterson, and R. Tagliaferri (Eds.): CIBB 2009, LNBI 6160, pp. 304–318, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Proposed Knowledge Based Approach for Solving Proteomics Issues
305
Further support for biologists is represented by systems oriented to management and reasoning with biological knowledge; in fact, some knowledge-based systems have been developed to obtain rules based on biologists experience. For instance, a popular system is [3] where the authors propose an algorithm for protein secondary structure prediction: their approach uses a knowledge base of peptide sequence-structure and an heuristic based on a voting scheme for structure prediction. The system can extract a match rate, that represents the amount of structural information belonging to each amino acid, among homologous proteins contained in the knowledge base. The match rate plays a central role in this work, because it contains a confidence level of the prediction results for each amino acid. Few recent works in bioinformatics use expert systems without knowledge base: although these systems use some rules to determine the best way to resolve a specific problem, the workflow is set in advance. In other words, no reasoning is done by the system. For instance, an interesting system is [4], where the authors propose an expert system to predict protein thermostability; in this work, a predetermined process is executed to evaluate three different machine learning algorithms for classification task. Each candidate algorithm is run and the related score is computed. The comparison of these scores is the discriminant factor to obtain the best classification algorithm for the proposed goal. In the last years, a valuable support to biologist is represented by workflow management systems as [5,6]. Among these, the most famous is [5], a system able to automatically integrate tools and databases available on the web, so that users without an extensive experience with program languages can design, execute and share workflows of local or web services. The process flow is described either graphical or through a script language. Unfortunately, as the previous cited expert systems, these ones have neither a knowledge base nor a reasoner to infer the sequence of operations, but they allow bioinformaticians to construct pipelines of services to perform different analysis, such as sequence analysis or genome annotation. A similar approach was implemented into knowledge based systems that support decision-making activities. They represent a middle ground between workflow management systems, where the user must design the workflow, and classical knowledge-based expert systems, where the system resolves automatically the user’s query. Typically, these systems are used in medical field. A recent clinical decision support system is [7], where unstructured databases of medical records and a set of clinical practice guidelines have been organized in both dedicate ontology and related rules. These rules are needed in order to make inference on ontology, to build new guidelines and to get recommendations for care process patient. In the present work, we introduce a new knowledge-based system focused to resolve proteomics issues. It aims at helping scientists to generate an appropriate data model performing on unstructured information. In fact, a lot of proteomic experiments are executed using some guidelines that are not enough to describe and codify a well structured protocol. The proposed approach represents an
306
A. Fiannaca et al.
intelligent alternative to workflow management systems, because the user can obtain the same result that is reached with previous systems, but with minor effort; in fact, our system incorporate a meta-level that generates a main workflow and indicates possible alternatives to refine it, using appropriate evaluation criteria. This meta-level will plan and suggest a workflow to the user according to expert knowledge, guidelines, databases and web services available for community and, then, the user will be able to interact with the system in order to require a fine tuning of parameters. Of course, if the system does not have enough elements in knowledge-base to select a “winner algorithm”, then it will communicate to user some possible ways to select the winner among possible candidates. In the next section, we briefly discuss some proteomic issues. In Section 3, we describe the architecture of the proposed system. In Section 4, we show the main concepts of our ontology. In Section 5, we give a brief overview of the main software components used in order to implement our system. In Section 6, we present two possible scenarios in the field of protein structure prediction and protein-protein interaction network, respectively. Finally, in the last section we provide conclusive remarks and an outlook for future works.
2
Proteomic Issues
Proteomics represents a big challenge for bioinformatics; it is a very hard issue to understand how proteins work in biological processes. In facts, the proteome of a specific organism differs even from cell to cell, this is because a single gene can code for over 1,000 proteins and each protein can express several functionality, according to other interacting proteins. In the present study, according to bioinformatics topics classification in [8], the proteomics is divided into five classes of problems: protein structure prediction, protein annotation, protein function prediction, protein-protein interaction and protein localization prediction. This top-level partitioning represents the main proteomic issues our system is supposed to face. Of course the design and implementation of a system that is able to solve all of these problems represents a huge effort, nevertheless our purpose is to develop and to focus on some problems, such as structure prediction and protein-protein interaction, and at the same time to provide a methodology and a framework so that other researchers can add their knowledge and expertise in order to cope with the remaining domains. 2.1
Protein Structure Prediction
The structure of a protein represents a key feature in its functionality [9]. Unfortunately, the prediction of 2D and 3D structures is an NP hard problem in general, because most of the proteins are composed by thousands of atoms and bounds and the number of potential structures is very large. For this reason, in order to approximate the real structure of a protein, several optimization techniques based on machine learning approaches have been implemented and a competition (CASP [14]), aiming at improving prediction techniques in the years, has been instituted.
A Proposed Knowledge Based Approach for Solving Proteomics Issues
2.2
307
Protein-Protein Interaction
A central role in biological mechanism of a cellular process is covered by the analysis of protein-protein interaction (PPI). Nowadays a large amounts of PPI data have been identified with many technologies, but only a few of them are account as real interaction with an emerging function. Moreover, at biological pathway level, the functionality is not linked to a simple pair of proteins, but arises with protein complex, that is a collection of PPIs. Analysis of proteinprotein interaction, as well as identification and extraction of protein complexes, represents an hard task for machine learning algorithms [10], because uncertain information about interconnection and functionality of each protein could lead to erroneous interpretation. 2.3
Protein Annotation
Available databases and technical information on proteins form the raw material of the proteomics. A correct organization of these input data prevents a misleading interpretation of elements. A critical phase in this process is a correct annotation of properties and main features of proteins. This step is based on the classification of scientific texts and the information extraction in the biological domain [11], and it copes with the identification problems. In the biological field the nomenclature is highly variable and ambiguous, especially for protein name identification, where both the use of phenotypical descriptions and the gene homonym/alias management have influenced the nomenclature. 2.4
Protein Function Prediction
Another challenge is to determine protein function at the proteomics scale. In fact, although in a model organism many individual proteins have a known sequence and structure, their functions are currently unknown. In particular, a single protein can express different function according to some environmental parameters, therefore it is not enough to identify which proteins are responsible for diseases or are advised for medical treatments, if the specific functions are unknowns. Approaches to the function prediction are based on different techniques [12]: some of these are related to protein sequence and structure, the other ones use protein-protein interaction patterns and correlations between occurrences of related proteins in different organisms. 2.5
Protein Localization Prediction
The prediction of protein localization aims at determining localization sites of unknown proteins in a cell. By means of this study, it is possible to cope with problems like genome annotation, protein function prediction, and drug discovery. The location of protein into the cell can be calculated through experimental approaches [13], but they are time and cost consuming, thus a computational technique able to screen possible candidates for further analyses, appears a desirable solution.
308
3
A. Fiannaca et al.
System Architecture
The system structure, Fig. 1, is composed of three main functional layers that interact each other. The Interface layer, on the top, allows the user to interact with the system. Using a Graphical User Interface, the user submits his query and receives intermediate and final results, as well as the suggested strategies he can choose. The wrapper module manages the communication between the controller layer and the GUI. The middle layer, called the Controller layer, is the core of our system. If we consider it as a single block, it receives a request from the Interface layer, in the form of a query, runs all the needed tools, belonging to the Object layer, and sends the result to the Interface layer. The Controller layer holds several modules: first of all we have a reasoner, that can be seen as a task scheduler. The reasoner decides what are the set of operations to do in order to fulfil the user’s request. These decisions are taken by consulting a Knowledge Base (KB), representing the expertise of the system about the application domain. The KB, see Fig.2, is built by means of an ontology, used to give a fixed structure to the KB itself. The design and the main features of our ontology will be described in the next subsection. The KB consists of facts and rules: the facts are instances of the concepts defined in the ontology: they represent the information content about the domain; the rules, written by an expert of the domain, are in the typical form “IF condition THEN action”: they describe which conditions have to be satisfied in order to run a specific task or algorithm. Rules can be seen as a coding for the strategies and heuristics our system can provide. The information content, coded as facts, is extracted from the most popular biological databases, PDB [23]; structural databases, like SCOP [21] and CATH [22]; biological and bioinformatics papers, and knowledge provided directly from human experts of the domain. The knowledge base is not a closed structure: that means it is easily expandable when new information, or expertise, is available. The reasoner, given the query, checks the knowledge base and if there are some facts that match the condition expressed in one or more rules, the following actions are processed. With this mechanism of checking the knowledge base and execution of action, the reasoner fills the executor with all the tasks to do in order to give a response to the user’s query. The tasks to perform also include all the operations related to the organization of the correct input for each software, algorithm or tool to execute. The executor module has the assignment to run the tasks the reasoner schedules. To do that, the executor makes use of the algorithms, tools and web-services belonging to the lower layer. Moreover the executor can update the knowledge base with intermediate results and send the final results to the wrapper, that will arrange them in the GUI. The Object layer represents all the low level parts that will be run by the executor, according to the decision taken by the reasoner. The Object layer can
A Proposed Knowledge Based Approach for Solving Proteomics Issues
309
Fig. 1. System Architecture. Three main layers interact to make the system work. The Interface layer is able to interact with the user by receiving his request ad showing him the results. The Controller layer is the core of our system, and is composed of several modules. The Knowledge Base (KB) represents what the system knows about the application domain. The information content is organized and structured through an ontology. The reasoner, given a query, consults the KB and then fills the agenda with all the operations to to in order to accomplish the user’s request. The executor runs the software components belonging to the object layer, updates the KB with intermediate results that can help the reasoner during the decision process, and send the results to the wrapper so that they can be displayed through the GUI. Finally the Object layer represents all the algorithms, software, tools and web services used by the agenda in order to execute the workflow of operations decided by the reasoner.
be considered as a big container, made up of different compartments, corresponding to different class of software and tools. For example we have a library of machine learning algorithms and bioinformatics tools, such as Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), Bayesian networks, Blast, PSI-Blast; validation and evaluation criteria; access to the most common web services; bioinformatics tools; etc...
310
A. Fiannaca et al.
Fig. 2. Knowledge Base, composed of an ontology, that defines the structure and the relationships among the concepts of the domain; the facte, representing single pieces of known information about the domain; and the rules, that allow reasoning and inferencing using the facts
4
Ontology Design
The need of building a well structured, robust, consistent, and easily expandable knowledge base, made us design an ontology of our application domain. Our aim is to model three main concepts: 1. the set of problems regarding the proteomic domain, i.e. problems inherent the study of proteins, especially their structure and functions; 2. the tools, software and algorithms currently used in bioinformatics and proteomics; 3. the protein domain, that is the notions strictly related to the properties, functions, structures and features of proteins from a biological point of view. We are working on the first two component of the ontology, Proteomic Domain and Tools, whereas for the protein domain, we are going to integrate an existing ontology: The Protein Ontology (PO) [25] that describes proteins from a structural and functional perspective. There exists another ontology developed to model protein concept: it is called simply Protein Ontology (PRO) [24]: it considers protein in terms of evolutionary classes and multiple products of a gene locus. Since we are interested in the relationships between structural and functional protein features used by the most popular bioinformatics algorithms and techniques in order to solve proteomic issues, PRO will not be subject of our study.
A Proposed Knowledge Based Approach for Solving Proteomics Issues
311
All three main branches of our ontology are modelled according to an hierarchy of classes and subclasses. The proteomic domain is decomposed in subdomain, corresponding to the different case studies in proteomics: we have identified,, five domains, that are Function prediction, Location prediction, Structure prediction, Protein annotation, Protein-Protein interaction (PPI). When possible, these domains are further divided in one or more subdomains, for example inside Structure Prediction we have 2D Prediction, 3D Prediction and Contact Map Prediction. The hierarchy, which constitutes the proteomic domain can be considered as a set of classes, or types, of problems: the instances of these classes represent proteomic problem we actually want to resolve: for example as instances of 3D Prediction we have a specific 3D prediction to perform. Multiple instances of the same class inside ProteomicDomain represent different strategies that can be used in order to solve the same problem. The Tools component is also structured in an hierarchy of type of algorithms: we have neural networks, algorithms on graphs, structural predictors, etc. As instances we consider a particular algorithm or software: for example Self-Organizing Map, Neural Gas and Multi-layer perceptron are instances of the Neural Network class. Protein Ontology concept hierarchy and information can be found in [25]. The ontology hierarchical decomposition is shown in Fig. 3. ProteomicDomain and Tools super classes are both abstract classes, because they represent top level concepts that will never be instanced; on the other hand, at this level we can define some general properties, also called slots, that all the subclasses will have. Apart from an hierarchy of classes and subclasses, an ontology is characterized by the relationships among those classes. We are interested above all in the relationships between algorithms and proteomic domain on one hand, because we want to find the best set of operations to do, given a proteomic issues; on the other hand we are interested on the relationships between protein structures and functions and proteomic domain, because we want to know what are the allowed tasks we can perform according to the type of data we want to analyse. At the top level we have defined a mutual relation between ProteomicDomain and Tools: an instance of Tools “resolves” an instance of ProteomicDomain, that conversely is “resolved by” an instance of Tools. We want this way point out that a particular software or algorithm is suited to be applied to a particular proteomic problem. In Section 6.1 we will see the part of ontology related to a proposed case study.
5
System Implementation
As for implementation details, both the GUI and the wrapper are written using Java language; the ontology is modelled using Prot´eg´e 3.4 editor [18]; the reasoner, the knowledge base and the executor are implemented using Jess [19], the Rule Engine for the Java Platform.
312
A. Fiannaca et al.
Fig. 3. Hierarchy of classes for the three main branches of the ontology: Tools, representing the set of software, algorithms and bioinformatics tools supported; ProteinOntology, defining the concept of protein from a structural and functional perspective; Proteomic Domain, that structures the types of proteomics issues
6
Case Studies
The following two scenarios are introduced in order to show how our system exploits the ontology with basic rules and how it builds a workflow with different abstraction layers, respectively. 6.1
Scenario: 3D Structure Protein Prediction
In proteomic, as highlighted in Section, the identification of the three dimensional structure of a protein is fundamental in order to understand the protein functions. If we look at the last 2008 CASP8 competition [14], considering only the server submissions, there are more than 70 different 3D structure predictors. At this point, it is obvious that the choice of the best predictor for a given proteomic sequence or set of sequences is not trivial. In fact, although CASP competition provides a ranking of the predictors, according to several score methods [15,16], the number of target sequences, about 120, can not be considered statistical significant: that means choosing the first ranked predictor could not always be the best selection. Structural predictors, moreover, differ each other on the basis of the adopted algorithm, the presence, or not, of a template used as a starting point for the prediction, and the physical and structural protein properties considered to make a prediction, like for instance secondary
A Proposed Knowledge Based Approach for Solving Proteomics Issues
313
structure, solvent accessibility, evolutionary profiles. For this reason, some predictors give better results with certain class of proteins, and in CASP8 we can find individual predictors rankings for each target protein. Our system, using a Knowledge Base and a reasoner, as we saw in Section 3, can give a solid structure and consistence to all the information and expertise needed to make an optimal choice. A typical scenario for 3D structure prediction can be the following: the user has a protein sequence, that is only the sequence of the amino acids, and query our system in order to obtain the prediction of the three dimensional structure of his input sequence. The system, once received the query, starts its reasoning by means of the facts and the rules belonging to the KB and tells the agenda what to do to make a 3D prediction. First of all the system searches for homologous proteins using PSI-Blast [17] and find that a target protein of the last CASP can be taken as a template. The system runs the predictor that obtained the best score for that template, and gives the prediction results and the score for that prediction to the user, notifying him that it is possible to obtain more refined results, if he/she wants. The user accepts to do a deeper experiment, so the system, this time, searches for homologies in the main protein databases, such as PDB [23], Swiss-Prot [26] and UniProt [27], and finds a better template than the previous one. The system runs the first ten ranked at CASP8 template-based predictors, eventually retrieving or asking the user the needed parameters. After receiving the 3D models, the system performs a model quality assessment procedure or model evaluation test, like [29] and [28]. Finally the system shows to the user the obtained 3D models with their own scores. In Figure 4 we present the part of the ontology used in the scenario described in this Section. With regards to the Proteomic Domain module, we consider a 3D Prediction problem (the grey class), belonging to Structure Prediction class. As for the Tools, we consider a 3D Predictor, that is a subclass of Structural Predictors. For the Protein domain, we will consider only the Entry subclass, containing the database identifier of proteins belonging to the KB. 3D Prediction class is characterized by the properties: isResolvedBy, accepting as value an instance of 3D Predictor class and representing which Predictor can be used; suitedFor, representing which protein sequence the predictor has been already applied to; score, that is an index giving the quality of the previous prediction; source, that is where the prediction comes from, e.g. CASP, paper, database, etc. An example of instances of this class are showed in Table 1. As previously said, the reasoner, once found a template, checks if there exist in the KB, an optimal predictor for that template, applying the following rule to facts structured as in Table 1: IF input has template and template has predictor THEN use that predictor If the score of the predictor is lower than a certain threshold, chosen by an expert, the system suggests to the user to run other predictors in order to look for a better prediction. The corresponding rule can be the following one:
314
A. Fiannaca et al.
IF score IS lower than treshold THEN suggest other predictors. The 3D predictor class, being a subclass of both Structure Predictors, directly, and Tools, indirectly, has the following slots: input, for example the protein sequence in FASTA format; output, for example if the predictor gives the protein backbone or also the side chains; description, representing a textual overview of the predictor; status, a boolean property noticing if the predictor is available or not; category, that specifies if the predictor works on a template (homology) or without it (ab initio); resolves, that denotes what instances of ProteinDomain it can be applied to, i.e an instance of 3D prediction.
Fig. 4. Part of the ontology related to the case study presented in Section 6.1. The asterisk mark (*) in slots or relationships means multiple values are allowed.
6.2
Scenario: Detection of Functional Modules from a Protein-Protein Interactions Network
Proteins represent the working molecules of a cell and, as is well known, they can provide many biological activities. Actually, proteins exhibit their functions by interacting with other proteins. For this reason, as highlighted in section 2.2,
A Proposed Knowledge Based Approach for Solving Proteomics Issues
315
Table 1. Example of facts belonging to the Knowledge Base: they represent instances of the 3D Prediction Class Entry Predictor Score Source T0388-D1 pro-sp3Tasser 93.14 CASP8 T0391-D1 multicom-rank 91.77 CASP8 T0414-D1 Phyre de novo 63.98 CASP8 ... ... ... ...
the analysis of protein-protein interactions networks is necessary for detection of functional modules. Several predictive methods are available to predict protein interactions using high throughput proteomics technologies, in order to generate a large amount of data. Experimentalists have the chance to use different online available databases of PPIs network (DIP [30], MIPS [31], etc. . . ), each of which contains a proper set of features (according to interaction detection method, interaction type, experimental condition, etc. . . ); moreover there are a lot of tools and strategies used to resolve complex identification problem (soft-clustering, greedy heuristics, probabilistic approaches, etc. . . ). The proposed system both allows the experimentalist to try a wide combination of different models and strategies and supports the retrieval of data among available databases. More in details, in complex protein analysis, it can suggest the right clustering approach among several strategies and, if it is necessary, can combine more than a model.
Fig. 5. Part of the workflow related to the case study presented in Section 6.2
Figure 5 shows a typical workflow generated by the proposed system. It emphasizes logical abstraction layers, strategies and sub-strategies adopted to resolve each problem and, at the lower layer, instances of algorithms implemented. Obviously, each workflow is generated according to user preferences; in fact, the system only suggest a strategy exploiting the reasoner module, while iteration between proposed model and user, provides the key to ensuring that models are as good as possible.
316
7
A. Fiannaca et al.
Conclusion and Future Work
In this paper, we proposed a novel knowledge-based approach that is able to help the scientist to handle and in principle resolve a large set of proteomic issues. The main component of the system are an ontology used to model and give consistence to the knowledge base; the facts and the rules, provided by an expert of the domain, belonging to the KB and representing the expertise of our system; a reasoner that schedules the set of operations to do, by consulting the knowledge base, in order to accomplish the user’s request; an executor that phisycally runs what the reasoner decided. The proposed system allows the user to interact with the decision process, providing intermediate results, showing what are the advantages and the weaknesses of possible crossroads in the developing of the workflow of operation and letting the user decide some constraints such as execution time or quality of the results. The system, moreover, builds workflow at different level of abstraction and is designed in order to be easily expandable thanks to its modular structure and the possibility to add further expertise to the Knowledge Base. In the near future we are going to polish and improve our ontology, finding the strengths and weaknesses of the supported algorithms and the potential relationships between protein features and algorithms performances. At the same time we are going to develop and implement other application scenarios, including typical bioinformatics and system biology issues such as reverse engineering gene regulatory networks.
References 1. Cios, K.J., Shin, I., Wedding II, D.K.: Bayesian Approach to Dealing with Uncertainties for Detection of Coronary Artery Stenosis Using a Knowledge Based System. IEEE Engineering in Medicine and Biology 8(4), 53–58 (1989) 2. Lhotska, L., Marik, V., Vlcek, T.: Medical applications of enhanced rule-based expert systems. International Journal of Medical Informatics 63(1), 61–75 (2001) 3. Lin, H.N., Chang, J.M., Wu, K.P., Sung, T.Y., Hsu, W.L.: HYPROSP II: a knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence. Bioinformatics 21(15), 3227–3233 (2005) 4. Wu, L.C., Lee, J.X., Huang, H.D., Liu, B.J., Horng, J.T.: An expert system to predict protein thermostability using decision tree. Expert Systems with Applications 36(5), 9007–9014 (2009) 5. Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock, M.R., Li, P., Oinn, T.: Taverna: a tool for building and running workflows of services. Nucleic Acids Res. 34 (2006) 6. Bartocci, E., Corradini, F., Merelli, E., Schortichini, L.: BioWMS: a Web-based Workflow Management System for Bioinformatics. BMC Bioinformatics 8(1) (2007) 7. Ceccarelli, M., Donatiello, A., Vitale, D.: KON3: a Clinical Decision Support System, in oncology environment, based on knowledge management. In: IEEE International Conference on Tools with Artificial Intelligence, vol. 2, pp. 206–210 (2008)
A Proposed Knowledge Based Approach for Solving Proteomics Issues
317
8. Larranaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J.A., Armananzas, R., Santafe, G., Perez, A., Robles, V.: Machine learning in bioinformatics. Briefing in bioinformatics 7(1), 86–112 (2005) 9. Robles, V., Larraaga, P., Pea, J.M., Menasalvas, E., Prez, M.S., Herves, V.: Bayesian networks as consensed voting system in the construction of a multiclassifier for protein secondary structure prediction. Artificial Intelligence in Medicine 31, 117–136 (2004) 10. Yamakawa, H., Maruhashi, K., Nakao, Y.: Predicting Types of Protein-Protein Interactions Using a Multiple-Instance Learning Model. In: Washio, T., Satoh, K., Takeda, H., Inokuchi, A. (eds.) JSAI 2006. LNCS (LNAI), vol. 4384, pp. 42–53. Springer, Heidelberg (2007) 11. Hanisch, D., Fundel, K., Mevissen, H.T., Zimmer, R., Fluck, J.: ProMiner: rulebased protein and gene entity recognition. BMC Bioinformatics 6(suppl. 1), S14 (2005) 12. Whisstock, J.C., Lesk, A.M.: Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics 36(3), 307–340 (2003) 13. Su, E.C., Chiu, H.S., Lo, A., Hwang, J.K., Sung, T.Y., Hsu, W.L.: Protein subcellular localization prediction based on compartment-specific features and structure conservation. BMC Bioinformatics 8 (2007) 14. CASP: Critical Assessment of Techniques for Protein Structure Prediction, http://predictioncenter.org/index.cgi 15. Zemla, A.: LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Research 31(13), 3370–3374 (2003) 16. Zhang, Y., Skolnick, J.: Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004) 17. Altschul, S.F., et al.: Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Research 25(17), 3389–3402 (1997) 18. The Protege Ontology Editor and Knowledge Acquisition System, http://protege.stanford.edu 19. Sandia National Laboratories, Jess: The rule engine for the JavaTM platform (2003), http://herzberg.ca.sandia.gov/jess/ 20. Eriksson, H.: Using JessTab to integrate Proteg´e and Jess. IEEE Intelligent Systems 18(2), 43–50 (2003) 21. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995) 22. Orengo, C.A., Michie, A.D., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH: A Hierarchic Classification of Protein Domain Structures. Structure 5, 1093–1108 (1997) 23. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic Acids Research 28, 235–242 (2000), http://www.pdb.org 24. Natale, D.A., Arighi, C.N., Barker, W.C., Blake, J., Chang, T., Hu, Z., Liu, H., Smith, B., Wu, C.H.: Framework for a Protein Ontology. BMC Bioinformatics 8(suppl. 9), S1 (2007) 25. http://proteinontology.org.au/hierarchy.htm 26. Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-Prot: Juggling between evolution and stability. Brief. Bioinformatics 5, 39–55 (2004) 27. The UniProt Consortium: The Universal Protein Resource (UniProt). Nucleic Acids Res. 36, D190–D195 (2008)
318
A. Fiannaca et al.
28. Wiederstein, M., Sippl, M.J.: ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Research 35, W407–W410 (2007) 29. Wang, Z., Tegge, A.N., Cheng, J.: Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins 75, 638–645 (2009) 30. http://dip.doe-mbi.ucla.edu/ 31. http://www.helmholtz-muenchen.de/en/mips/
Author Index
Ambrogi, Federico 56 Angelini, Claudia 191, 215 Aradhya, V.N. Manjunath 254
Halling-Brown, Mark 1 Hubbard, Simon J. 45 Isacchi, Antonella
Bacciu, Davide 276 Balbi, Anahi 151 Barbosa-Silva, Adriano 70 Bartoli, Lisa 20 Bassani, Niccol´ o 56 Belarmino, Luis Carlos 70 Benko-Iseppon, Ana Maria 70 Bernot, Gilles 112 Bertolotti, Matteo 56 Biba, Marenglen 33 Bifulco, Ida 215 Biganzoli, Elia 56 Bosotti, Roberta 56 Botta, Marco 206 Brusic, Vladimir 1 Calsa-Junior, Tercilio 70 Carroll, Kathleen M. 45 Casadio, Rita 20 Castagnino, Nicoletta 151 Ceccarelli, Michele 97 Chiacchio, Ferdinando 1 Comet, Jean-Paul 112 Cussens, James 240 da Mota Soares-Cavalcanti, Nina 70 De Feis, Italia 191, 215 do Monte, Semiramis Jamil Hadad 70 Esposito, Floriana 33 Etchells, Terence A. 276 Fariselli, Piero 20 Ferilli, Stefano 33 Fernandes, Ana S. 276 Fiannaca, Antonino 304 Floares, Alexandru G. 266 Fonseca, Jos´e M. 276 Gaglio, Salavatore 304 Guarracino, Mario R. 139
Jarman, Ian H.
56 276
Kido, Ederson Akio 70 Kropotov, Dmitry 291 La Rosa, Massimo 304 Li` o, Pietro 191 Lisboa, Paulo J.G. 276 Liverani, Silvia 240 Lucchetti, Roberto 179 Masulli, Francesco 254 McLachlan, Geoffrey J. 82 Montagna, Roberto 151 Morganella, Sandro 97 Moss, David S. 1 Motta, Santo 1 Murino, Loredana 215 Nebbia, Adriano 139 Negro, Guido 206 Nguyen, Viet Anh 191 Nikulin, Vladimir 82 Nilse, Lars 45 Osokin, Anton
291
Pandolfi, Valesca 70 Pappalardo, Francesco 1 Parodi, Silvio 151 Patrone, Fioravante 165 Pennisi, Marzio 1 Peri, Daniele 304 Pesenti, Raffaele 151 Plagianakos, Vassilis P. 228 Radrizzani, Paola 179 Raiconi, Giancarlo 215 Rizzo, Riccardo 304 Rovetta, Stefano 254
320
Author Index
Salek, Mogjiborahman 45 Sansom, Clare E. 1 Shepherd, Adrian J. 1 Sims, Paul F.G. 45 Smith, Jim Q. 240 Sturm, Marc 45
Tortolina, Lorenzo 151 Trudgian, David 45
Tagliaferri, Roberto 215 Tasoulis, Dimitris K. 228 Tasoulis, Sotiris K. 228
Wanderley-Nogueira, Ana Carolina
Urso, Alfonso
304
van der Wath, Richard Vetrov, Dmitry 291
Zoppoli, Pietro
97
191
70