Methods
in
Molecular Biology™
Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK
For other titles published in this series, go to www.springer.com/series/7651
Bioinformatics for Omics Data Methods and Protocols Edited by
Bernd Mayer emergentec biodevelopment GmbH, Vienna, Austria
Editor Bernd Mayer, Ph.D. emergentec biodevelopment GmbH Gersthofer Strasse 29-31 1180 Vienna, Austria
[email protected] and Institute for Theoretical Chemistry University of Vienna Währinger Strasse 17 1090 Vienna, Austria
[email protected]
ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-61779-026-3 e-ISBN 978-1-61779-027-0 DOI 10.1007/978-1-61779-027-0 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011922257 © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)
Preface This book discusses the multiple facets of “Bioinformatics for Omics Data,” an area of research that intersects with and integrates diverse disciplines, including molecular biology, applied informatics, and statistics, among others. Bioinformatics has become a default technology for data-driven research in the Omics realm and a necessary skill set for the Omics practitioner. Progress in miniaturization, coupled with advancements in readout technologies, has enabled a multitude of cellular components and states to be assessed simultaneously, providing an unparalleled ability to characterize a given biological phenotype. However, without appropriate processing and analysis, Omics data add nothing to our understanding of the phenotype under study. Even managing the enormous amounts of raw data that these methods generate has become something of an art. Viewed from one perspective, bioinformatics might be perceived as a purely technical discipline. However, as a research discipline, bioinformatics might more accurately be viewed as “[molecular] biology involving computation.” Omics has triggered a paradigm shift in experimental study design, expanding beyond hypothesis-driven approaches to research that is basically explorative. At present, Omics is in the process of consolidating various intermediate forms between these two extremes. In this context, bioinformatics for Omics data serves both hypothesis generation and validation and is thus much more than mere data management and processing. Bioinformatics workflows with data interpretation strategies that reflect the complexity of biological organization have been designed. These approaches interrogate abundance profiles with regulatory elements, all expressed as interaction networks, thus allowing a one-step (descriptive) embodiment of wide-ranging cellular processes. Here, the seamless transition to computational Systems Biology becomes apparent, the ultimate goal of which is representing the dynamics of a phenotype in quantitative models capable of predicting the emergence of higher order molecular procedures and functions that arise from the interplay of basic molecular entities that constitute a living cell. Bioinformatics for Omics data is certainly embedded in a highly complex technological and scientific environment, but it is also a component and driver of one of the most exciting developments in modern molecular biology. Thus, while this book seeks to provide practical guidelines, it hopefully also conveys a sense of fascination associated with this research field. This volume is structured in three parts. Part I provides central analysis strategies, standardization, and data management guidelines, as well as fundamental statistics for analyzing Omics profiles. Part II addresses bioinformatics approaches for specific Omics tracks, spanning genome, transcriptome, proteome, and metabolome levels. For each track, the conceptual and experimental background is provided, together with specific guidelines for handling raw data, including preprocessing and analysis. Part III presents examples of integrated Omics bioinformatics applications, complemented by case studies on biomarker and target identification in the context of human disease. I wish to express my gratitude to all authors for their dedication in providing excellent chapters, and to John Walker, who initiated this project. As for any omissions or errors, the responsibility is mine. In any case, enjoy reading. Vienna, Austria
Bernd Mayer
v
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v ix
Part I Omics Bioinformatics Fundamentals 1 Omics Technologies, Data and Bioinformatics Principles . . . . . . . . . . . . . . . . . . . Maria V. Schneider and Sandra Orchard 2 Data Standards for Omics Data: The Basis of Data Sharing and Reuse . . . . . . . . . Stephen A. Chervitz, Eric W. Deutsch, Dawn Field, Helen Parkinson, John Quackenbush, Phillipe Rocca-Serra, Susanna-Assunta Sansone, Christian J. Stoeckert, Jr., Chris F. Taylor, Ronald Taylor, and Catherine A. Ball 3 Omics Data Management and Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arye Harel, Irina Dalah, Shmuel Pietrokovski, Marilyn Safran, and Doron Lancet 4 Data and Knowledge Management in Cross-Omics Research Projects . . . . . . . . . Martin Wiesinger, Martin Haiduk, Marco Behr, Henrique Lopes de Abreu Madeira, Gernot Glöckler, Paul Perco, and Arno Lukas 5 Statistical Analysis Principles for Omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniela Dunkler, Fátima Sánchez-Cabo, and Georg Heinze 6 Statistical Methods and Models for Bridging Omics Data Levels . . . . . . . . . . . . . Simon Rogers 7 Analysis of Time Course Omics Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin G. Grigorov 8 The Use and Abuse of -Omes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sonja J. Prohaska and Peter F. Stadler
3 31
71
97
113 133 153 173
Part II Omics Data and Analysis Tracks 9 Computational Analysis of High Throughput Sequencing Data . . . . . . . . . . . . . . Steve Hoffmann 10 Analysis of Single Nucleotide Polymorphisms in Case–Control Studies . . . . . . . . Yonghong Li, Dov Shiffman, and Rainer Oberbauer 11 Bioinformatics for Copy Number Variation Data . . . . . . . . . . . . . . . . . . . . . . . . . Melissa Warden, Roger Pique-Regi, Antonio Ortega, and Shahab Asgharzadeh 12 Processing ChIP-Chip Data: From the Scanner to the Browser . . . . . . . . . . . . . . Pierre Cauchy, Touati Benoukraf, and Pierre Ferrier 13 Insights Into Global Mechanisms and Disease by Gene Expression Profiling . . . . Fátima Sánchez-Cabo, Johannes Rainer, Ana Dopazo, Zlatko Trajanoski, and Hubert Hackl
vii
199 219 235
251 269
viii
Contents
14 Bioinformatics for RNomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Kristin Reiche, Katharina Schutt, Kerstin Boll, Friedemann Horn, and Jörg Hackermüller 15 Bioinformatics for Qualitative and Quantitative Proteomics . . . . . . . . . . . . . . . . . 331 Chris Bielow, Clemens Gröpl, Oliver Kohlbacher, and Knut Reinert 16 Bioinformatics for Mass Spectrometry-Based Metabolomics . . . . . . . . . . . . . . . . . 351 David P. Enot, Bernd Haas, and Klaus M. Weinberger
Part III Applied Omics Bioinformatics 17 Computational Analysis Workflows for Omics Data Interpretation . . . . . . . . . . . . Irmgard Mühlberger, Julia Wilflingseder, Andreas Bernthaler, Raul Fechete, Arno Lukas, and Paul Perco 18 Integration, Warehousing, and Analysis Strategies of Omics Data . . . . . . . . . . . . . Srinubabu Gedela 19 Integrating Omics Data for Signaling Pathways, Interactome Reconstruction, and Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paolo Tieri, Alberto de la Fuente, Alberto Termanini, and Claudio Franceschi 20 Network Inference from Time-Dependent Omics Data . . . . . . . . . . . . . . . . . . . . Paola Lecca, Thanh-Phuong Nguyen, Corrado Priami, and Paola Quaglia 21 Omics and Literature Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vinod Kumar 22 Omics–Bioinformatics in the Context of Clinical Data . . . . . . . . . . . . . . . . . . . . . Gert Mayer, Georg Heinze, Harald Mischak, Merel E. Hellemons, Hiddo J. Lambers Heerspink, Stephan J.L. Bakker, Dick de Zeeuw, Martin Haiduk, Peter Rossing, and Rainer Oberbauer 23 Omics-Based Identification of Pathophysiological Processes . . . . . . . . . . . . . . . . . Hiroshi Tanaka and Soichi Ogishima 24 Data Mining Methods in Omics-Based Biomarker Discovery . . . . . . . . . . . . . . . . Fan Zhang and Jake Y. Chen 25 Integrated Bioinformatics Analysis for Cancer Target Identification . . . . . . . . . . . Yongliang Yang, S. James Adelstein, and Amin I. Kassis 26 Omics-Based Molecular Target and Biomarker Identification . . . . . . . . . . . . . . . . Zgang–Zhi Hu, Hongzhan Huang, Cathy H. Wu, Mira Jung, Anatoly Dritschilo, Anna T. Riegel, and Anton Wellstein Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
379
399
415
435 457 479
499 511 527 547
573
Contributors S. James Adelstein • Harvard Medical School, Harvard University, Boston, MA, USA Shahab Asgharzadeh • Department of Pediatrics and Pathology, Keck School of Medicine, Childrens Hospital Los Angeles, University of Southern California, Los Angeles, CA, USA Stephan J.L. Bakker • Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Catherine A. Ball • Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA Marco Behr • emergentec biodevelopment GmbH, Vienna, Austria Touati Benoukraf • Université de la Méditerranée, Marseille, France; Centre d’Immunologie de Marseille-Luminy, Marseille, France; CNRS, UMR6102, Marseille, France; Inserm, U631, Marseille, France Andreas Bernthaler • emergentec biodevelopment GmbH, Vienna, Austria Chris Bielow • AG Algorithmische Bioinformatik, Institut für Informatik, Freie Universität Berlin, Berlin, Germany Pierre Cauchy • Inserm, U928, TAGC, Marseille, France; Université de la Méditerranée, Marseille, France Jake Y. Chen • Indiana University School of Informatics, Indianapolis, IN, USA Stephen A. Chervitz • Affymetrix, Inc., Santa Clara, CA, USA Irina Dalah • Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Eric W. Deutsch • Institute for Systems Biology, Seattle,WA, USA Ana Dopazo • Genomics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain Anatoly Dritschilo • Lombardi Cancer Center, Georgetown University, Washington, DC, USA Daniela Dunkler • Section of Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria David P. Enot • BIOCRATES life sciences AG, Innsbruck, Austria Raul Fechete • emergentec biodevelopment GmbH, Vienna, Austria Pierre Ferrier • Centre d’Immunologie de Marseille-Luminy (CIML), Marseille, France Dawn Field • NERC Centre for Ecology and Hydrology, Oxford, UK Claudio Franceschi • ‘L Galvani’ Interdept Center, University of Bologna, Bologna, Italy Alberto de la Fuente • CRS4 Bioinformatica, Parco Tecnologico SOLARIS, Pula, Italy Srinubabu Gedela • Stanford University School of Medicine, Stanford, CA, USA Gernot Glöckler • emergentec biodevelopment GmbH, Vienna, Austria Martin G. Grigorov • Nestlé Research Center, Lausanne, Switzerland Clemens Gröpl • Ernst-Moritz-Arndt-Universität Greifswald, Greifswald, Germany Bernd Haas • BIOCRATES life sciences AG, Innsbruck, Austria ix
x
Contributors
Jörg Hackermüller • Bioinformatics Group, Department of Computer Science, University of Leipzig, Leipzig, Germany; Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany Hubert Hackl • Division for Bioinformatics, Innsbruck Medical University, Innsbruck, Austria Martin Haiduk • emergentec biodevelopment GmbH, Vienna, Austria Arye Harel • Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Georg Heinze • Section of Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria Merel E. Hellemons • Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Steve Hoffmann • Interdisciplinary Center for Bioinformatics and The Junior Research Group for Transcriptome Bioinformatics in the LIFE Research Cluster, University Leipzig, Leipzig, Germany Friedemann Horn • Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany; Institute of Clinical Immunology, University of Leipzig, Leipzig, Germany Zhang-Zhi Hu • Lombardi Cancer Center, Georgetown University, Washington DC, USA Hongzhan Huang • Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE, USA Mira Jung • Lombardi Cancer Center, Georgetown University, Washington, DC, USA Amin I. Kassis • Harvard Medical School, Harvard University, Boston, MA, USA Oliver Kohlbacher • Eberhard-Karls-Universität Tübingen, Tübingen, Germany Vinod Kumar • Computational Biology, Quantitative Sciences, GlaxoSmithKline, King of Prussia, PA, USA Hiddo J. Lambers Heerspink • Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Doron Lancet • Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Paola Lecca • The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Yonghong Li • Celera Corporation, Alameda, CA, USA Henrique Lopes de Abreu Madeira • emergentec biodevelopment GmbH, Vienna, Austria Arno Lukas • emergentec biodevelopment GmbH, Vienna, Austria Gert Mayer • Department of Internal Medicine IV (Nephrology and Hypertension), Medical University of Innsbruck, Innsbruck, Austria Harald Mischak • mosaiques diagnostics GmbH, Hannover, Germany Irmgard Mühlberger • emergentec biodevelopment GmbH, Vienna, Austria Thanh-Phuong Nguyen • The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Rainer Oberbauer • Medical University of Vienna and KH Elisabethinen Linz, Vienna, Austria Soichi Ogishima • Department of Bioinformatics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
Contributors
xi
Sandra Orchard • EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton,Cambridge, UK Antonio Ortega • Department of Electrical Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA Helen Parkinson • EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Paul Perco • emergentec biodevelopment GmbH, Vienna, Austria Shmuel Pietrokovski • Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Roger Pique-Regi • Department of Human Genetics, University of Chicago, Chicago, IL, USA Corrado Priami • The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Sonja J. Prohaska • Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany John Quackenbush • Department of Biostatistics, Dana-Farber Cancer Institute, Boston, MA, USA Paola Quaglia • The Microsoft Research – University of Trento Centre for Computational and Systems Biology, Povo, Trento, Italy Johannes Rainer • Bioinformatics Group, Division Molecular Pathophysiology, Medical University Innsbruck, Innsbruck, Austria Kristin Reiche • Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany Knut Reinert • AG Algorithmische Bioinformatik, Institut für Informatik, Freie Universität Berlin, Berlin, Germany Anna T. Riegel • Lombardi Cancer Center, Georgetown University, Washington, DC, USA Phillipe Rocca-Serra • EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Simon Rogers • Inference Research Group, Department of Computing Science, University of Glasgow, Glasgow, UK Peter Rossing • Steno Diabetes Center Denmark, Gentofte, Denmark Marilyn Safran • Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Fátima Sánchez-Cabo • Genomics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, Spain Susanna-Assunta Sansone • EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Maria V. Schneider • EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Katharina Schutt • Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany; Institute of Clinical Immunology, University of Leipzig, Leipzig, Germany Dov Shiffman • Celera Corporation, Alameda, CA, USA Peter F. Stadler • Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany
xii
Contributors
Christian J. Stoeckert Jr • Department of Genetics and Center for Bioinformatics, University of Pennsylvania School of Medicine, Philadelphia, PA, USA Hiroshi Tanaka • Department of Computational Biology, Graduate School of Biomedical Science, Tokyo Medical and Dental University, Tokyo, Japan Chris F. Taylor • EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK Ronald Taylor • Computational Biology & Bioinformatics Group, Pacific Northwest National Laboratory, Richland, WA, USA Alberto Termanini • ‘L Galvani’ Interdept Center, University of Bologna, Bologna, Italy Paolo Tieri • ‘L Galvani’ Interdept Center, University of Bologna, Bologna, Italy Zlatko Trajanoski • Division for Bioinformatics, Innsbruck Medical University, Innsbruck, Austria Kerstin Boll • Fraunhofer Institute for Cell Therapy and Immunology, Leipzig, Germany; Institute of Clinical Immunology, University of Leipzig, Leipzig, Germany Melissa Warden • Department of Pediatrics and Pathology, Keck School of Medicine, Childrens Hospital Los Angeles, University of Southern California, Los Angeles, CA, USA Klaus M. Weinberger • BIOCRATES life sciences AG, Innsbruck, Austria Anton Wellstein • Lombardi Cancer Center, Georgetown University, Washington, DC, USA Martin Wiesinger • emergentec biodevelopment GmbH, Vienna, Austria Julia Wilflingseder • Medical University of Vienna and KH Elisabethinen Linz, Vienna, Austria Cathy H. Wu • Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE, USA Yongliang Yang • Department of Radiology, Harvard Medical School, Harvard University, Boston, MA, USA; Center of Molecular Medicine, Department of Biological Engineering, Dalian University of Technology, Dalian, China Dick de Zeeuw • Department of Nephrology, University Medical Center Groningen, Groningen, The Netherlands Fan Zhang • Indiana University School of Informatics, Indianapolis, IN, USA
Part I Omics Bioinformatics Fundamentals
Chapter 1 Omics Technologies, Data and Bioinformatics Principles Maria V. Schneider and Sandra Orchard Abstract We provide an overview on the state of the art for the Omics technologies, the types of omics data and the bioinformatics resources relevant and related to Omics. We also illustrate the bioinformatics challenges of dealing with high-throughput data. This overview touches several fundamental aspects of Omics and bioinformatics: data standardisation, data sharing, storing Omics data appropriately and exploring Omics data in bioinformatics. Though the principles and concepts presented are true for the various different technological fields, we concentrate in three main Omics fields namely: genomics, transcriptomics and proteomics. Finally we address the integration of Omics data, and provide several useful links for bioinformatics and Omics. Key words: Omics, Bioinformatics, High-throughput, Genomics, Transcriptomics, Proteomics, Interactomics, Data integration, Omics databases, Omics tools
1. Introduction The last decade has seen an explosion in the amount of biological data generated by an ever-increasing number of techniques enabling the simultaneous detection of a large number of alterations in molecular components (1). The Omics technologies utilise these high-throughput (HT) screening techniques to generate the large amounts of data required to enable a system level understanding of correlations and dependencies between molecular components. Omics techniques are required to be high throughput because they need to analyse very large numbers of genes, gene expression, or proteins either in a single procedure or a combination of procedures. Computational analysis, i.e., the discipline now known as bioinformatics, is a key requirement for the study of the vast amounts of data generated. Omics requires the use of Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_1, © Springer Science+Business Media, LLC 2011
3
4
Schneider and Orchard
t echniques that can handle extremely complex biological samples in large quantities (e.g. high throughput) with high sensitivity and specificity. Next generation analytical tools require improved robustness, flexibility and cost efficiency. All of these aspects are being continuously improved, potentially enabling institutes such as the Wellcome Trust Sanger Sequencing Centre (see Note 1) to generate thousands of millions of base pairs per day, rather than the current output of 100 million per day (http://www. yourgenome.org/sc/nt). However, all this data production makes sense only if one is equipped with the necessary analytical resources and tools to understand it. The evolution of the laboratory techniques has therefore to occur in parallel with a corresponding improvement in analytical methodology and tools to handle the data. The phrase Omics – a suffix signifying the measurement of the entire complement of a given level of biological molecules and information – encompasses a variety of new technologies that can help explain both normal and abnormal cell pathways, networks, and processes via the simultaneous monitoring of thousands of molecular components. Bioinformaticians use computers and statistics to perform extensive Omics-related research by searching biological databases and comparing gene sequences and proteins on a vast scale to identify sequences or proteins that differ between diseased and healthy tissues, or more general between different phenotypes. “Omics” spans an increasingly wide range of fields, which now range from genomics (the quantitative study of protein coding genes, regulatory elements and noncoding sequences), transcriptomics (RNA and gene expression), proteomics (e.g. focusing on protein abundance), and metabolomics (metabolites and metabolic networks) to advances in the era of post-genomic biology and medicine: pharmacogenomics (the quantitative study of how genetics affects a host response to drugs), physiomics (physiological dynamics and functions of whole organisms) and in other fields: nutrigenomics (a rapidly growing discipline that focuses on identifying the genetic factors that influence the body’s response to diet and studies how the bioactive constituents of food affect gene expression), phylogenomics (analysis involving genome data and evolutionary reconstructions, especially phylogenetics) and interactomics (molecular interaction networks). Though in the remainder of this chapter we concentrate on an isolated few examples of Omics technologies, much of what is said, for example about data standardisation, data sharing, storage and analysis requirements are true for all of these different technological fields. There are already large amounts of data generated by these technologies and this trend is increasing, for example second and third generation sequencing technologies are leading to an exponential increase in the amount of sequencing data available. From a computational point of view, in order to address the
Omics Technologies, Data and Bioinformatics Principles
5
c omplexity of these data, understand molecular regulation and gain the most from such comprehensive set of information, knowledge discovery – the process of automatically searching large volumes of data for patterns – is a crucial step. This process of bioinformatics analysis includes: (1) data processing and molecule (e.g. protein) identification, (2) statistical data analysis, (3) pathway analysis, and (4) data modelling in a system wide context. In this chapter we will present some of these analytical methods and discuss ways in which data can be made accessible to both the specialised bioinformatician, but in particular to the research scientist.
2. Materials There are a variety of definitions of the term HT; however we can loosely apply this term to cases where automation is used to increase the throughput of an experimental procedure. HT technologies exploit robotics, optics, chemistry, biology and image analysis research. The explosion in data production in the public domain is a consequence of falling equipment prices, the opening of major national screening centres and new HT core facilities at universities and other academic institutes. The role of bioinformatics in HT technologies is of essential importance. 2.1. Genomics High-Throughput Technologies
High-Throughput Sequencing (HTS) technologies are used not only for traditional applications in genomics and metagenomics (see Note 2), but also for novel applications in the fields of transcriptomics, metatranscriptomics (see Note 3), epigenomics (see Note 4), and studies of genome variation (see Note 5). Next generation sequencing platforms allow the determination of the sequence data from amplified single DNA fragments and have been developed specifically to lend themselves to robotics and parallelisation. Current methods can directly sequence only relatively short (300–1,000 nucleotides long) DNA fragments in a single reaction. Short-read sequencing technologies dramatically reduce the sequencing cost. There were initial fears that the increase in quantity might result in a decrease in quality, and improvements in accuracy and read length are being looked for. However, despite this, these advances have significantly reduced the cost of several sequencing applications, such as resequencing individual genomes (2) readout assays (e.g. ChIP-seq (3) and RNAseq (4)).
2.2. Transcriptomics High-Throughput Technologies
The transcriptome is the set of all messenger RNA (mRNA) molecules, or “transcripts”, produced in one or a population of cells. Several methods have been developed in order to gain expression information at high throughput level.
6
Schneider and Orchard
Global gene expression analysis has been conducted either by hybridization with oligonucleotide microarrays, or by counting of sequence tags. Digital transcriptomics with pyrophosphatase based ultra-high throughput DNA sequencing of ditags represents a revolutionary approach to expression analysis, which generates genome-wide expression profiles. ChIP-Seq is a technique that combines chromatin immunoprecipitation with sequencing technology to identify and quantify in vivo protein–DNA interactions on a genome-wide scale. Many of these applications are directly comparable to microarray experiments, for example ChIP-chip and ChIP-Seq are for all intents and purposes the same (5). The most recent increase in data generation in this evolving field is due to novel cycle-array sequencing methods (see Note 6), also known as next-generation sequencing (NGS), more commonly described as second-generation sequencing which are already being used by technologies such as next-generation expressed-sequence-tag sequencing (see Note 7). 2.3. Proteomics High-Throughput Technologies
Proteomics is the large-scale study of proteins, particularly their expression patterns, structures and functions, and there are various HT techniques applied to this area. Here we explore two main proteomics fields: Mass Spectrometry HT and Protein– Protein Interactions (PPIs).
2.3.1. Mass Spectrometry High-Throughput Technologies
Mass spectrometry is an important emerging method for the characterization of proteins. It is also a rapidly developing field which is currently moving towards large-scale quantification of specific proteins in particular cell types under defined conditions. The rise of gel-free protein separation techniques, coupled with advances in MS instrumentation sensitivity and automation, has provided a foundation for high throughput approaches to the study of proteins. The identification of parent proteins from derived peptides now relies almost entirely on the software of search engines, which can perform in silico digests of protein sequence to generate peptides. Their molecular mass is then matched to the mass of the experimentally derived protein fragments.
2.3.2. Interactomics HT Technologies
Studying protein–protein interactions provides valuable insights into many fields by helping precisely understand a protein’s role inside a specific cell type, with many of the techniques commonly used to experimentally determine protein interactions lending themselves to high throughput methodologies. Complementation assays (e.g. 2-hybrid) measure the oligomerisation-assisted complementation of two fragments of a single protein which when united result in a simple biological readout – the two protein fragments are fused to the potential bait/prey interacting partners respectively. This methodology is easily scalable to HT since it can
Omics Technologies, Data and Bioinformatics Principles
7
yield very high numbers of coding sequences assayed in a relatively simple experiment and a wide variety of interactions can be detected and characterised following one single, commonly used protocol. However, the proteins are being expressed in an alien cell system with a loss of temporal and physiological control of expression patterns, resulting in a large number of false-positive interactions. Affinity-based assays, such as affinity chromatography, pull-down and coimmunoprecipitation, rely on the strength of the interaction between two entities. These techniques can be used on interactions which form under physiological conditions, but are only as good as the reagents and techniques used to identify the participating proteins. High throughput mass spectrometry is increasingly used for the rapid identification of the participants in an affinity complex. Physical methods depend on the properties of molecules to enable measurement of an interaction, as typified by techniques such as X-ray crystallography and enzymatic assays. High quality data can be produced but highly purified proteins are required, which has always proved a rate limiting step. Availability of automated chromatography systems and custom robotic systems that streamline the whole process, from cell harvesting and lysis through to sample clarification and chromatography has changed this, and increasing amounts of data are being generated by such experiments. 2.4. Challenges in HT Technologies
It is now largely the case that high throughput methods exist for all or most of the Omics domains. The challenge now is to prevent bottlenecks appearing in the storing, annotation, and analysis of the data. First the data which is required to describe both – how an experiment was performed and the results generated by it – must be defined. A place to store that information must be identified, a means by which it will be gathered has to be agreed upon, and ways in which the information will be queried, retrieved and analysed must also be decided. Data in isolation is of limited use, so ideally the data format chosen should be appropriate to enable the combination and comparison of multiple datasets, both in-house and with other groups working in the same area. HT data is increasingly used in a broader context beyond the individual project; consequently it is becoming more important to standardise and share this information appropriately and to preinterpret it for the scientists who are not involved with the experiment, whilst still making the raw data available for those who wish to perform their own analyses.
2.5. Bioinformatics Concepts
In high throughput research, knowledge discovery starts by collecting, selecting and cleaning the data in order to fill a database. A database is a collection of files (archive) of consistent data that are stored in a uniform and efficient manner. A relational database consists of a set of tables, each storing records (instances).
8
Schneider and Orchard
A record is represented as a set of attributes which define a property of a record. Attributes can be identified by their name and store a value. All records in a table have the same number and type of attributes. Database design is a crucial step in which the data requirements of the application have first to be defined (conceptual design), including the entities and their relationships. Logical design is the implementation of the database using database management systems, which ensure that the process is scalable. Finally the physical design phase estimates the workload and refines the database design accordingly. It is during this phase that table designs are optimized, indexing is implemented and clustering approaches are optimized. These are fundamental in order to obtain fast responses to frequent queries without jeopardising the database integrity (e.g. redundancy). Primary or archived databases contain information directly deposited by submitters and give an exact representation of their published data, for example DNA sequences, DNA and protein structures and DNA and protein expression profiles. Secondary or derived databases are socalled because they contain the results of analysis on the primary resources, including information on sequence patterns or motifs, variants and mutations and evolutionary relationships. The fundamental characteristic of a database record is a unique identifier. This is crucial in biology given the large number of situations where a single entity has many names, or one name refers to multiple entities. To some extent, this problem can be overcome by the use of an accession number, a primary key derived by a reference database to describe the appearance of that entity in that database. For example, using the UniProtKB protein sequence database accession number of human p53 gene products (P04637) gives information on the sequence of all the isoforms of these proteins, gene and protein nomenclature as well as a wealth of information about its function and role in a cell. More than one protein sequence database exists, and the vast majority of protein sequences exist in all of these. Fortunately resources to translate between these multiple accession numbers now exist, for example the Protein Identifier Cross-Reference (PICR) Service at the European Bioinformatics Institute (EBI) (see Note 8). The Omics fields share with all of biology the challenge of handling ever-increasing amounts of complex information effectively and flexibly. Therefore a crucial step in bioinformatics is to choose the appropriate representation of the data. One of the simplest but most efficient approaches has been the use of controlled vocabularies (CVs), which provide a standardised dictionary of terms for representing and managing information. Ontologies are structure CVs. An excellent example of this methodology is the Gene Ontology (GO) that describes gene products in terms of their associated biological processes, cellular
Omics Technologies, Data and Bioinformatics Principles
9
components and molecular functions in a species independent manner. Substantial effort has been, and continues to be, put into the development and maintenance of the ontologies themselves; the annotation of gene products, which entails making associations between the ontologies and the genes and gene products across databases; and the development of tools that facilitate the creation, maintenance and use of ontologies. The hierarchical nature of these CVs enable more meaningful queries to be made, for example searching either a microarray or proteomics experiment for expression patterns on the brain, enable experiments annotated to the cortex to be included because the BRENDA tissue CV recognises the cortex as “part-of ” the brain (http://www. ebi.ac.uk/ontology-lookup/browse.do?ontName=BTO). Use of these CVs have been encouraged, and even made mandatory, by many groups such as the Microarray Expression Data group (MGED) which recommends the use of the MGED ontology (6) for the description of key experimental concepts and, where possible, ontologies developed by other communities for describing terms such as anatomy, disease and chemical compounds. Clustering methods are used to identify patterns in the data, in other words to recognise what is similar, to identify what is different, and from there to know when differences are meaningful. These three steps are not trivial at all; proteins for example exhibit rich evolutionary relationships and complex molecular interactions and hence present many challenges for computational sequence analysis. Sequence similarity refers to the degree to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity (the extent to which two (nucleotide or amino acid) sequences are invariant) and/or conservation (changes at a specific position of an amino acid or nucleotide sequence that preserve the physicochemical properties of the original residue). The applications of sequence similarity searching are numerous, ranging from the characterization of newly sequenced genomes, through phylogenetics, to species identification in environmental samples. However, it is important to keep in mind that identifying similarity between sequences (e.g. nucleotide or amino acid sequences) is not necessarily equivalent to identifying other pro perties of such sequences, for example their function.
3. Methods It is obvious that without bioinformatics it is impossible to make sense of the huge data produced in Omics research. If we look at the increase of the EMBL Nucleotide Sequence Database (EMBL-Bank), the Release 105 on 27-AUG-2010 contained
10
Schneider and Orchard
195,241,608 sequence entries comprising 292,078,866,691 nucleotides. This translated to a total of 128 GB compressed and 831 GB uncompressed data. Bioinformatics does not only have to provide the structures in which to store the information, but also store it in such a way that is retrievable, and comparable not only to similar data but also to other types of information. The challenges and concepts bioinformatics as a discipline currently encompasses do not essentially differ from those listed by (7), they have merely expanded to meet the challenges imposed by the volume of data produced. These include: 1. A precise, predictive model of transcription initiation and termination: the ability to predict where and when transcription will occur in a genome (fundamental for HTS and proteomics); 2. A precise, predictive model of RNA splicing/alternative splicing: the ability to predict the splicing pattern of any primary transcript in any tissue (fundamental for transcriptomics and proteomics); 3. Precise, quantitative models of signal transduction pathways: ability to predict cellular responses to external stimuli (required in proteomics and pathways analysis); 4. Determination of effective protein:DNA, protein:RNA and protein:protein recognition codes (important for recognition of interactions among the various types of molecules); 5. Accurate ab initio protein structure prediction (required for proteomics and pathways analysis); 6. Rational design of small molecule inhibitors of proteins (chemogenomics); 7. Mechanistic understanding of protein evolution: understanding exactly how new protein functions evolve (comparative genomics); 8. Mechanistic understanding of speciation: molecular details of how speciation occurs (comparative genome sequences, sequence variation); 9. Continued development of effective gene ontologies – systematic ways to describe the functions of any gene or protein (genomics, transcriptomics, and proteomics). The above list summarises general concepts required for multiple Omics data sources. Next we describe issues which are specific to one particular field, but may have downstream consequences in other areas. 3.1. The Role of Bioinformatics in Genomics
Here we will explore two major challenges in genomics: de novo sequencing assembly and genome annotation.
Omics Technologies, Data and Bioinformatics Principles
11
3.1.1. De Novo Genome Sequencing
A critical stage in de novo genome sequencing is the assembly of shotgun reads, in other words putting together fragments randomly extracted from the sample to form a set of contiguous sequences and contigs that represent the DNA in the sample. Algorithms are available for whole genome shotgun fragment assembly, including Atlas (8), Arachne (9), Celera (10), PCAP (11), Phrap (http://www.phrap.org) and Phusion (12). All these programmes rely on the overlap-layout-consensus approach (13) where all the reads are compared to each other in a pair-wise fashion. However, this approach presents several disadvantages, especially in the case of next-generation microread sequencing. EDENA (14) is the only microread assembler developed using computation of pairwise overlaps. Included reads, i.e. reads which align over their whole length onto another read, have to be removed from the graph; this means that mixed-length sequencing cannot be performed directly with an overlap graph. Short reads are either simply mapped onto long read contigs or they are assembled separately (Daniel Zerbino personal communication). The use of a sequence graph to represent an assembly was introduced by (15). Idury and Waterman presented an assembly algorithm for an alternative sequencing technique, sequencing by hybridisation, where an oligoarray could detect all the k nucleotide words, also known as k-mers, present in a given genome. Pevzner et al. (16) expanded on this idea, proposing a slightly different formalisation of the sequence graph, called the de Bruijn graph, whereby the k-mers are represented as arcs and overlapping k-mers join at their tips, and consecutively presented algorithms to build and correct errors in the de Bruijn graph (13), use paired-end reads (16) or short reads (17). Zerbino and Birney (18) developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly for the de novo assembly of microreads. Several studies have used Velvet (19–22). Other analytical software adopting the use of the de Bruijn graph are ALLPATHS (23) and SHORTY (24) specialised in localising the use of paired-end reads, whereas the ABySS (25, 26) successfully parallelised the construction of the de Bruijn graph, thus removing practical memory limitations on assemblies. The field of de novo assembly of NGS reads is constantly evolving and there is not yet a firm process or best practise set in place.
3.1.2. Genome Annotation
Genome annotation is the process of marking the genes and other biological features in a DNA sequence. It consists of two main steps: (1) Gene Finding: identifying elements on the genome and (2) adding biological information to these elements. There are automatic annotation tools to perform all this by computer analysis, as opposed to manual annotation which involves human expertise. Ideally, these approaches coexist and complement each
12
Schneider and Orchard
other in the same annotation pipeline. The basic level of annotation uses BLAST to find similarities, and annotates genomes based on that. However, nowadays more and more additional information is added to the annotation platform. Structural annotation consists of the identification of genomic elements: ORFs and their localisation, gene structure, coding regions and location of regulatory motifs. Functional annotation consists in attaching biological information to genomic elements: biochemical function, biological function, involved regulation and interactions and expression. These steps may involve both biological experiments and in silico analysis and are often initially performed in related databases, usually protein sequence databases such as UniProtKB, and transferred back onto the genomic sequence. A variety of software tools have been developed to permit scientists to view and share genome annotations. The additional information allows manual annotators to disentangle discrepancies between genes that have been given conflicting annotation. For example, the Ensembl genome browser relies on both curated data sources as well as a range of different software tools in their automated genome annotation pipeline (27). Genome annotation remains a major challenge for many genome projects. The identification of the location of genes and other genetic control elements is frequently described as defining the biological “parts list” for the assembly and normal operation of an organism. Researchers are still at an early stage in the process of delineating this parts list, as well as trying to understand how all the parts “fit together”. 3.2. The Role of Bioinformatics in Transcriptomics
Both microarray and proteomics experiments provide long lists of transcripts (mRNA and proteins respectively) co-expressed at any one time and the challenge is to give biological relevance to these lists. Several different computational algorithms have been developed and can be usefully applied at various steps of the analytical pipeline. Clustering methods are used to order and visualise the underlying patterns in large scale expression datasets showing similar patterns that can therefore be grouped according to their co-regulation/co-expression (e.g. specific developmental times or cellular/tissue locations). This indicates (1) co-regulated transcripts which might be functionally related and (2) the clusters represent a natural structure of the data. Transcripts can also be grouped by their known – or predicted function. A resource commonly used for this is the GO ontology (http://www.geneontology.org). There are several bioinformatics tools for calculating the number of significantly enriched GO terms, for example: (1) GO miner (http://discover.nci.nih.gov/ gominer) generates a summary of GO terms that are significantly enriched in a user input list of protein accession numbers when compared to a reference database like UniProtKB/SwissProt;
Omics Technologies, Data and Bioinformatics Principles
13
(2) GO slims which are subsets of GO terms from the whole Gene Ontology and are particularly useful for giving a summary of the results of GO annotation of a genome, microarrays and proteomics (http://amigo.geneontology.org/cgi-bin/amigo/ go.cgi). 3.3. The Role of Bioinformatics in Proteomics 3.3.1. Protein Annotation
3.3.2. Protein–Protein Interaction Analysis and Comparative Interactomics
The use of different bioinformatics approaches to determine the presence of a gene or open reading frame (ORF) in those genomes can lead to divergent ORF annotations (even for data generated from the same genomic sequences). It is therefore crucial to use the correct dataset for protein sequence translations. One method for confirming a correct protein sequence is mass spectrometry based proteomics, in particular by de novo sequencing which does not rely on pre-existing knowledge of a protein sequence. However, historically, there has initially been no method for publishing these protein sequences, except as long lists reported directly with the article or included on the publisher’s website as supplementary information. In either case, these lists are typically provided as PDF or spreadsheet documents with a custom-made layout, making it practically impossible for computer programmes to interpret them, or efficiently query them. A solution to this problem is provided by the PRIDE database (http://www.ebi.ac. uk/pride) which provides a standards compliant, public repository for mass spectrometry based proteomics, giving access to experimental evidence that a transcribed gene product does exist, as well as the pattern of tissues in which it is expressed (28). The annotation of protein functional information largely relies on manual curation, namely biologists reading the scientific literature and transferring the information to computational records – a process in which the UniProtKB curators have lead the way for many years. The many proteins for which functional information is not available, however, rely on selected information being transferred from closely related orthologues in other species. A number of protein signature databases now exist, which create algorithms to recognise these closely related protein families or domains within proteins. These resources have been combined in a single database, Interpro (http://www.ebi.ac.uk/ interpro) and the tool InterProScan (see Note 9) (http://www. ebi.ac.uk/Tools/InterProScan) is available for any biologist wishing to perform their own automated protein (or gene) annotation (29). Protein–protein interactions are generally represented in graphical networks with nodes corresponding to the proteins and edges to the interactions. Although edges can vary in length most networks represent undirected and only binary interactions. Bioinformatics tools and computational biology efforts into graph theory methods have and continue to be part of the knowledge
14
Schneider and Orchard
discovery process in this field. Analysis of PPI networks involves many challenges, due to the inherent complexity of these networks, high noise level characteristic of the data, and the presence of unusual topological phenomena. A variety of data-mining and statistical techniques have been applied to effectively analyze PPI data and the resulting PPI networks. The major challenges for computational analysis of PPI networks remain: 1. Unreliability of large scale experiments; 2. Biological redundancy and multiplicity: a protein can have several different functions; or a protein may be included in one or more functional groups. In such instances overlapping clusters should be identified in the PPI networks, however since conventional clustering methods generally produce pairwise disjoint clusters, they may not be effective when applied to PPI networks; 3. Two proteins with different functions frequently interact with each other. Such frequent connections between the proteins in different functional groups expand the topological complexity of the PPI networks, posing difficulties to the detection of unambiguous partitions. Intensive research trying to understand and characterise the structural behaviours of such systems from a topological perspective have shown that features such as small-world properties (any two nodes can be connected via a short path of a few links), scalefree degree distributions (power-law degree distribution indicating that a few hubs bind numerous small nodes), and hierarchical modularity (hierarchical organization of modules) suggests that a functional module in a PPI network represents a maximal set of functionally associated proteins. In other words, it is composed of those proteins that are mutually involved in a given biological process or function. In this model, the significance of a few hub nodes is emphasized, and these nodes are viewed as the determinants of survival during network perturbations and as the essential backbone of the hierarchical structure. The information retrieved from HT interactomics data could be very valuable as a means to obtain insights into a systems evolution (e.g. by comparing the organization of interaction networks and by analyzing their variation and conservation). Likewise, one could learn whether and how to extend the network information obtained experimentally in well-characterised model systems onto different organisms. Cesareni et al. (30) concluded that, despite the recent completion of several high throughput experiments aimed at the description of complete interactomes, the available interaction information is not yet of sufficient coverage and quality to draw any biologically meaningful conclusion from the comparison of different interactomes. The development of more
Omics Technologies, Data and Bioinformatics Principles
15
accurate experimental and informatics approaches is required to allow us to study network evolution. 3.4. Storing Omics Data Appropriately
The massive amounts of data produced in Omics experiments can help us gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analyzed. Data has to be stored in a structured and standardized format that enables data sharing between multiple resources, as well as common tool development and the ability to merge data sets generated by different technologies. Omics is very much technology driven, and all instrument and software manufacturers initially produce data in their own proprietary formats, often then tying customers into a limited number of downstream analytical instruments. Efforts have been ongoing for many years to develop and encourage the development of common formats to enable data exchange and standardized methods for the annotation of such data to allow dataset comparison. These efforts were spear-headed by the transcriptomics community, who developed the MIAME standards (Minimum Information About a Microarray Experiment, http://www.mged. org/Workgroups/MIAME/miame.html) (31). The MIAME standards describe the set of information sufficient to interpret a microarray experiment and its results unambiguously, to enable verification of the data and potentially to reproduce the experiment itself. Their lead was soon followed by the proteomics community with the MIAPE standards (Minimum Information About a Proteomics Experiment, http://www.psidev.info/index. php?q=node/91), the interaction community (MIMIx, http:// imex.sourceforge.net/MIMIx) and many others. This has resulted in the development of tools which can combine datasets, for example it is possible to import protein interaction data into the visualisation tool Cytoscape (http://www.cytoscape.org) in a common XML format (PSI-MI) and overlay this with expression data from a microarray experiment.
3.5. Exploring Omics Data in Bioinformatics
Below we will follow the three Omics fields we described above. It would be impossible to list all the databases dealing with these data, however as the European Bioinformatics Institute hosts one of the most comprehensive sets of bioinformatics databases and also actively coordinates or is involved in setting standards and their implementation, it serves as exemplar for databases that are at the state of the art for standards, technologies and integration of the data. A list of major Institutes and their databases is provided at the end of this chapter (see Note 18).
3.5.1. Genomics
The genome is a central concept at the heart of biology. Since the first complete genome was sequenced in the mid-1990s, over 800
16
Schneider and Orchard
more have been sequenced, annotated, and submitted to the public databases. New ultra-high throughput sequencing technologies are now beginning to generate complete genome sequence at an accelerating rate, both to gap-fill portions of the taxonomy where no genome sequence has yet been deciphered (e.g. the GEBA project, http://www.jgi.doe.gov/programs/ GEBA, which aims to sequence 6,000 bacteria from taxonomically distinct clades), and to generate data for variation in populations of species of special interest (e.g. the 1000 Genomes Project in human, http://www.1000genomes.org, and the 1001 Genomes Project in Arabidopsis, http://www.1001genomes.org). In addition, modern sequencing technologies are increasingly being used to generate data for gene regulation and expression on a genomewide scale. The vast amount of information associated with the genomic sequence demands a way to organise and access it (see Note 19). A successful example of this is the genome browser Ensembl. Ensembl (http://www.ensembl.org) is a joint project between the EBI and the Wellcome Trust Sanger Institute that annotates chordate genomes (i.e. vertebrates and closely related invertebrates with a notochord such as sea squirt). Gene sets from model organisms such as yeast and fly are also imported for comparative analysis by the Ensembl “compara” team. Most annotation is updated every 2 months; however, the gene sets are determined about once a year. A new browser, http://www. ensemblgenomes.org, has now been set up to access nonchordates genomes from bacteria, plants, fungi, metazoa and protists. Ensembl provides genes and other annotation such as regulatory regions, conserved base pairs across species, and mRNA protein mappings to the genome. Ensembl displays many layers of genome annotation into a simplified view for the ease of the user. The Ensembl gene set reflects a comprehensive transcript set based on protein and mRNA evidence in UniProt and NCBI RefSeq databases (see Note 10). These proteins and mRNAs are aligned against a genomic sequence assembly imported from a relevant sequencing centre or consortium. Transcripts are clustered into the same gene if they have overlapping coding sequence. Each transcript is given a list of mRNAs and proteins it is based upon. Ensembl utilises BioMart, a query optimised database for efficient data mining described below, and the application of a comparative analysis pipeline: Compara. The Ensembl Compara multi-species database stores the results of genome-wide species comparisons calculated for each data release including: (1) Comparative genomics: Whole genome alignments and Synteny regions and (2) Comparative proteomics: Orthologue predictions and Paralogue predictions.
Omics Technologies, Data and Bioinformatics Principles
17
Ensembl Compara includes GeneTrees, a comprehensive gene orientated phylogenetic resource. It is based on a computational pipeline to handle clustering, multiple alignment, and tree generation, including the handling of large gene families. Ensembl also imports variations including Single Nucleotide Polymorphisms and insertion-deletion mutations (Indels) and their flanking sequence from various sources. These sequences are aligned to the reference sequence. Variation positions are calculated in this way along with any effects on transcripts in the area. The majority of variations are obtained from NCBI dbSNP. For human, other sources include Affymetrix GeneChip Arrays, The European Genome-phenome Archive, and whole genome alignments of individual sequences from Venter (32), Watson (33) and Celera individuals (34). Sources for other species include Sanger re-sequencing projects for mouse, and alignments of sequences from the STAR consortium for rat. Ancestral alleles from dbSNP were determined through a comparison study of human and chimpanzee DNA (35). 3.5.2. Transcriptomics
There is a wide range of HT transcriptomics data: single and dual channel microarray-based experiments measuring mRNA, miRNA and generally non-coding RNA. One can also include non-array techniques such as serial analysis of gene expression (SAGE). There are three main public repositories on microarray based studies: ArrayExpress (36), Gene Expression Omnibus (37), and CIBEX (38). Here we describe the EBI microarray repository, ArrayExpress, which consists of three components: ●●
●●
●●
the ArrayExpress Repository – a public archive of functional genomics experiments and supporting data, the ArrayExpress Warehouse – a database of gene expression profiles and other bio-measurements, the ArrayExpress Atlas – a new summary database and metaanalytical tool of ranked gene expression across multiple experiments and different biological conditions.
The Warehouse and Atlas allow users to query for differentially expressed genes by gene names and properties, experimental conditions and sample properties, or a combination of both (39). The latest developed ArrayExpress Atlas of Gene Expression (http://www.ebi.ac.uk/microarray-as/atlas) allows the user to query for condition-specific gene expression across multiple data sets. The user can query for a gene or a set of genes by name, synonym, Ensembl identifier, GO term or, alternatively, for a biological sample property or condition, (e.g. tissue type, disease name, developmental stage, compound name or identifier). Queries for both genes and conditions are also possible (e.g. the user can query for all “DNA repair” genes up-regulated in
18
Schneider and Orchard
cancer” which returns a list of “experiment, condition, gene” “ triplets each with a P-value and an up/down arrow characterising the significance and direction of a gene’s differential expression in a particular condition in an experiment). ArrayExpress accepts data generated on all array-based technologies, including gene expression, protein array, ChIP-chip and genotyping. More recently, data from transcriptomic and related applications of uHTS technologies such as Illumina (SOLEXA Ltd, Saffron Walden, UK), and 454 Life Sciences (Roche, Branford, Connecticut) are also accepted. For Solexa data FASTQ files, sample annotation and processed data files corresponding to transcription values per genomic location are submitted and curated to the emerging standard MINSEQE (http://www.mged. org/minseqe) and instrument-level data are stored in the European Short Read Archive (http://www.ebi.ac.uk/embl/ Documentation/ENA-Reads.html). The ArrayExpress Warehouse now includes gene expression profiles from in situ gene expression measurements, as well as other molecular measurement data from metabolomics and protein profiling technologies. Where in situ and array-based gene expression data are available for the same gene, these are displayed in the same view and links are provided to the multispecies 4DXpress database of in situ gene expression (39). The Gene Expression Atlas provides a statistically robust framework for integration of gene expression experiment results across different platforms at a meta-analytical level. It also represents a simple interface for identifying strong differential expression candidate genes in conditions of interest. The Atlas also integrates ontologies for high quality annotation of gene and sample attributes and builds new gene expression summarised views, with the aim to provide analysis of putative signalling pathway targets, discovery of correlated gene expression patterns and the identification of condition/tissue-specific patterns of gene expression. A list of URLs to bioinformatics relevant resources to transcriptomics can be found in Subheading 4 (see Note 20). 3.5.3. Proteomics
A list of proteomics relevant bioinformatics resources can be found in Note 21.
3.5.3.1. Protein Sequence and Functional Annotation
Translated proteins and their co-translational modification or PTM (post-translated modifications) are the backbone of proteomics (28). UniProt is the most comprehensive data repository on protein sequence and functional annotation. It is maintained by a collaboration of the Swiss Institute of Bioinformatics (SIB), the Protein Information Resource (PIR), and the EBI. It has four components, each of them optimized for different user profiles: 1. UniProt Knowledgebase (UniProtKB) comprises two sections: UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
Omics Technologies, Data and Bioinformatics Principles
19
(a) UniProtKB/Swiss-Prot contains high quality annotation extracted from the literature and computational analyses curated by experts. Annotations include, amongst others: protein function(s), protein domains and sites, PTMs, subcellular location(s), tissue specificity, structure, interactions, and diseases associated with deficiencies or abnormalities. (b) UniProtKB/TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL/ GenBank/DDJB nucleotide sequence databases, excluding some types of data such as pseudogenes. UniProtKB/ TrEMBL records are annotated automatically based on computational analyses. 2. UniProt Reference Clusters (UniRef), which provides clustered sets of all sequences from the UniProtKB database and selected UniProt Archive records to obtain complete coverage of sequences at different resolutions (100, 90, and 50% sequence identity), while hiding redundant sequences. 3. UniProt archive (UniParc) is a repository that reflects the history of all protein sequences. 4. UniProt Metagenomic and Environmental Sequences database (UniMES) contains data from metagenomic projects such as the Global Ocean Sampling Expeditions. UniProtKB includes cross-references from over 120 external databases, including Gene Ontology (GO), InterPro (protein families and domains), PRIDE (Protein identification data), IntEnz (enzyme) (see Note 11), OMIM (the Online Mendelian Inheritance in Man database) (see Note 12), Interaction databases (e.g. IntAct, DIP, Mint, see Note 21), Ensembl, several genomic databases from potential pathogens (e.g. EchoBase, Ecogene, LegioList, see Note 13), the European Hepatitis C Virus database (http://euhcvdb.ibcp.fr/euHCVdb) and others. 3.5.3.2. Mass Spectrometry Repositories
Several repositories have been established to store protein and peptide identifications derived from MS, the main method for the identification and quantification of proteins (28). There are two main repositories for MS data in proteomics: ●●
●●
The Proteomics IDEntifications database (PRIDE, http:// www.ebi.ac.uk/pride) Peptidome (http://www.ncbi.nlm.nih.gov/peptidome)and a number of related resources such as PeptideAtlas (http:// www.peptideatlas.org) and the Global Proteomics Machine (http://www.thegpm.org/GPMDB), which take deposited raw data for reanalysis in their own pipeline.
These all serve as web-based portals for data mining, data visualisation, data sharing, and cross-validation resources in the field.
20
Schneider and Orchard
The proteomics identifications (PRIDE) database has been built to turn publicly available data, buried in numerous academic publications, into publicly accessible data. PRIDE is fully compliant to the standards released by the HUPO-PSI and also makes extensive use of CVs such as Taxonomy, the BRENDA Tissue Ontology and Gene Ontology, thus direct access to PRIDE data organised by species, tissue, sub-cellular location, disease state and project name can be obtained via the “Browse Experiments” menu item. PRIDE remains the most complete database in terms of metadata associated with peptide identifications, since it contains numerous experimental details of the protocols followed by the submitters (28). The detailed metadata in PRIDE has enabled analyses of large datasets which have proven to yield very interesting information for the field (28). PRIDE uses Tranche (see Note 14) to allow the sharing of massive data files, currently including search engine output files and binary raw data from mass spectrometers that can be accessed via a hyperlink from PRIDE. As a member of the Proteome Exchange consortium PRIDE will make both the annotated meta-raw spectral data available, via Tranche to related analytical pipelines such as PeptideAtlas (see Note 15) and The Global Proteome Machine (see Note 16). 3.5.3.3. Protein–Protein Interactions and Interactomics
Both the number of laboratories producing PPI data and the size of such experiments continues to increase and a number of repositories exist to collect this data (see Note 22). Here we explore IntAct, a freely available, open source database system and analysis tools for molecular interaction data derived from literature or direct user submissions. IntAct follows a deep curation model, capturing a high level of detail from the experimental reports on the full text of the publication. Queries may be performed on the website with the initial presentation of the data as a list of binary interaction evidences. Users can access the individual evidences that describe the interaction of two specific molecules, thus allowing users to filter result sets (e.g. by interaction detection method) to only retain user-defined evidences. For convenience, evidence pertaining to the same interactors is grouped together in the binary interaction evidence table. Downloads of any datasets are available in both PSI-MI XML and tab-delineated MITAB format, providing end users with the highest level of details without compromising the integrity and simplicity of access to the data (40). IntAct is also involved in a major data exchange collaboration driven by the major public interaction data providers (listed at the end of this Chapter): The International Molecular Exchange Consortium (IMEx, http://imex.sourceforge.net) partners share curation efforts and exchange completed records on molecular interaction data. Curation has been aligned to a common standard, as detailed in the curation manual of the individual databases and
Omics Technologies, Data and Bioinformatics Principles
21
summarised in the joint curation manual available at http://imex. sourceforge.net. IMEx partner databases request the set of minimum information about a molecular interaction experiment (MIMIx) to be provided with each data deposition (41). The use of common data standards encourages the development of tools utilising this format. For example Cytoscape (http:// www.cytoscape.org) resembles an open source bioinformatics software platform for visualising molecular interaction networks and integrating these interactions with gene expression profiles and other state data, in which data from resources such as IntAct can be visualised and combined with other datasets. The value of the information obtained from comparing networks depends heavily on both the quality of the data used to assemble the networks themselves and the coverage of these networks (30, 42). The most comprehensive studies are in Saccharomyces cerevisiae; however, it should be noted that two comparable, “comprehensive” experiments, performed in parallel by two different groups using the same approach (tandem affinity purification technology) ended up with fewer than 30% of the interactions discovered by each group in common (43), suggesting that coverage is far from complete. 3.6. Integration of Omics Data
In the Omics several efforts have been and continue to be made in order to create computational tools for integrating Omics data. These need to address three different aspects of the integration (44): 1. To identify the network scaffold by delineating the connections that exist between cellular components; 2. To decompose the network scaffold into its constituent parts in an attempt to understand the overall network structure; 3. To develop cellular or system models to simulate and predict the network behaviour that gives rise to particular cellular phenotypes. As we have seen in the previous section here are significant challenges to modern post-genomics data sets: 1. Many technological platforms, both hardware and software, are available for several Omics data types, but some of these are prone to introducing technical artefacts; 2. Standardized data representations are not always adopted, which complicates cross-experiment comparisons; 3. Data-quality, context and lab-to-lab variations represent another important hurdle that must be overcome in genomescale science. Obviously the spread of Omics data in wide variety of formats represents a challenge for encompassing the technical hitches in integrating and migrating across platforms. One of the important
22
Schneider and Orchard
techniques often used is XML. XML is used to provide a document markup language that is easier to learn, retrieve, store and transmit. It is semantically richer than HTML (45). Here we present three different infrastructures which have been used and represent different ways of integration of Omics data: BioMart, Taverna and the BII Infrastructure. 3.6.1. BioMart
BioMart is a query-oriented DBMS developed jointly by the Ontario Institute for Cancer Research and the EBI: BioMart (http://www.biomart.org) is particularly suited for providing “data mining” like searches of complex descriptive data. It can be used with any type of data as shown by some of the resources currently powered by BioMart: Ensembl, UniProt, InterPro, HGNC, Rat Genome Database, ArrayExpress DW, HapMap, GermOnLine, PRIDE, PepSeeker, VectorBase, HTGT and Reactome. BioMart comes with an “out of the box” website that can be installed, configured and customised according to user requirements. Further access is provided by graphical and text based applications or programmatically using web services or API written in Perl and Java. BioMart has built-in support for query optimisation and data federation and in can also be configured to work as a DAS 1.5 Annotation server. The process of converting a data source into BioMart format is fully automated by the tools included in the package. Currently supported RDBMS platforms are MySQL, Oracle and Postgres. BioMart is completely Open Source, licenced under the LGPL, and freely available to anyone without restrictions (46).
3.6.2. Taverna
The Taverna workbench (http://taverna.sourceforge.net) is a free software tool for designing and executing workflows, created by the myGrid project (http://www.mygrid.org.uk/tools/ taverna), and funded through OMII-UK (http://www.omii. ac.uk). Taverna allows users to integrate many different software tools, including web services from many different domains. Bioinformatics services include those provided by the National Centre for Biotechnology Information, The EBI, the DNA Databank of Japan, SoapLab, BioMOBY and EMBOSS (see Note 17). Effectively, Taverna allows a scientist with limited computational background and technical resource support to construct highly complex analyses over public and private data and computational resources, all from a standard PC, UNIX box or Apple computer. A successful example of using Taverna in Omics is demonstrated by the work of Li et al. (47) where the authors describe an example of a workflow involving the statistical identification of differentially expressed genes from microarray data followed by the annotation of their relationships to cellular processes. They show that Taverna can be used by data analysis experts as a
Omics Technologies, Data and Bioinformatics Principles
23
generic tool for composing ad hoc analyses of quantitative data by combining the use of scripts written in the R programming language with tools exposed as services in workflows (47). 3.6.3. BII Infrastructure
As we have seen, it is now possible to run complex multi-assay studies through a variety of Omics technologies, for example determining the effect on a number of subjects, of a compound by characterising a metabolic profile (by mass spectroscopy), measuring tissue specific protein and gene expression (by mass spectrometry and DNA microarrays, respectively), and conducting conventional histological analysis. It is essential that such complex metadata (i.e. sample characteristics, study design, assay execution, sample-data relationships) are reported in a standard manner to correctly interpret the final results (data) that they contextualise. Relevant EBI systems, such as ArrayExpress, PRIDE and ENA-Reads (The European Nucleotide Archive (ENA) accepts data generated by NGS methodologies such as 454, Illumina and ABI SOLiD) are built to store microarraybased, proteomics and NGS-based experiments, respectively. However, these systems have different submission and download formats, and diverse representations of the metadata and terminologies used. Nothing yet exists to archive metabolomics-based assays and other conventional biomedical/environmental assays. The BioInv Index (BioInvestigation Index, http://www.ebi. ac.uk/net-project/projects.html) infrastructure (BII) aims to fill this gap. BII infrastructure aims to be a single entry point for those researchers willing to deposit their multi-assay studies and datasets, and/or easily download similar datasets. This infrastructure allows commonly representing and storing the experimental metadata of biological, biomedical and environmental studies. Although relying on other EBI production systems, the BII infrastructure shields the users from their diverse formats and ontologies, by progressively implementing in the editor tool integrative crossdomain “standards” such as MIBBI, OBO Foundry and ISA-TAB. A prototype instance is up and running at http://www.ebi.ac.uk/ bioinvindex/home.seam.
4. Notes 1. Wellcome Trust Sanger Sequencing Centre: The Sanger Institute is a genome research institute primarily funded by the Wellcome Trust. The Sanger uses large-scale sequencing, informatics and analysis of genetic variation to further improve our understanding of gene function in health and disease and to generate data and resources of lasting value to biomedical research, see http://www.sanger.ac.uk.
24
Schneider and Orchard
2. Metagenomics: The term indicates the study of metagenomes, genetic material recovered directly from environmental samples. It is also used generically for environmental genomics, ecogenomics or community genomics. Metagenomics data can be submitted and stored in appropriate databases (see http://www.ncbi.nlm.nih.gov/Genbank/metagenome.html and http://www.ebi.ac.uk/genomes/wgs.html). 3. Metatranscriptomics: This term refers to studies where microbial gene expression in the environment is accessed (e.g. pyrosequencing) directly from natural microbial assemblages. 4. Epigenomics: Understanding the large numbers of variations in DNA methylation and chromatin modification by exploiting omics techniques. There are various recent efforts in this direction (i.e. http://www.heroic-ip.eu). 5. Studies of genome variation: Clear examples on the advances on this front come from the large-scale human variation databases which archive and provide access to experimental data resulting from HT genotyping and sequencing technologies. The European Genotype Archive (http://www.ebi.ac.uk/ ega/page.php) provides dense genotype data associated with distinct individuals. Another relevant projects on this front is ENCODE (http://www.genome.gov/10005107), the Encyclopedia Of DNA Elements, which aims to identify all functional elements in the human genome sequence. 6. Cycle-array sequencing methods: also known as NGS: Cyclearray methods generally involve multiple cycles of some enzymatic manipulation of an array of spatially separated oligonucleotide features. Each cycle only queries one or a few bases, but an enormous number of features are processed in parallel. Array features can be ordered or randomly dispersed. 7. Next generation expressed-sequence-tag sequencing: ESTs are small pieces of DNA sequence (200–500 nucleotides long) that are generated by sequencing of an expressed gene. Bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms are sequenced and use as “tags” to fish a gene out of a portion of chromosomal DNA by matching base pairs. Characterising transcripts through sequences rather than hybridization to a chip has its advantages (i.e. the sequencing approach does not require the knowledge of the genome sequence as a prerequisite, as the transcript sequences can be compared to the closest annotated reference sequence in the public database using standard computational tools). 8. The PICR service reconciles protein identifiers across multiple source databases (http://www.ebi.ac.uk/tools/picr).
Omics Technologies, Data and Bioinformatics Principles
25
9. InterPro/InterProScan: InterPro is a database of protein families, domains, regions, repeats and sites in which identifiable features found in known proteins can be applied to new protein sequences (http://www.ebi.ac.uk/interpro/index. html). InterPro combines a number of databases (referred to as member databases) that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful integrated database and diagnostic tool: InterProScan. InterProScan is a sequence search package that combines the individual search methods of the member databases and provides the results in a consistent format: The user can choose among text, raw, HTML or XML. The results display potential GO terms and the InterPro entry relationships where applicable (http://www.ebi.ac.uk/ Tools/InterProScan). 10. NCBI RefSeq databases: The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa, see http://www.ncbi.nlm.nih.gov/RefSeq. 11. IntEnz: Integrated relational Enzyme database is a freely available resource focused on enzyme nomenclature (http:// www.ebi.ac.uk/intenz). 12. OMIM: the Online Mendelian Inheritance in Man database (http://www.ncbi.nlm.nih.gov/omim). 13. Genomic databases from potential pathogens: EchoBase is a database that curates new experimental and bioinformatic information about the genes and gene products of the model bacterium Escherichia coli K-12 strain MG1655; http://www. york.ac.uk/res/thomas. Ecogene database contains updated information about the E. coli K-12 genome and proteome sequences, including extensive gene bibliographies; http://ecogene.org. LegioList is a database dedicated to the analysis of the genomes of Legionella pneumophila strain Paris (endemic in France), strain Lens (epidemic isolate), strain Philadelphia 1, and strain Corby; http://genolist.pasteur.fr/LegioList. 14. Tranche: Tranche is a free and open source file sharing tool that facilitates the storage of large amounts of data, see https://trancheproject.org. 15. PeptideAtlas: PeptideAtlas (http://www.peptideatlas.org) is a multi-organism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments.
26
Schneider and Orchard
16. The Global Proteome Machine: Open-source, freely available informatics system for the identification of proteins using tandem mass spectra of peptides derived from an enzymatic digest of a mixture of mature proteins, for more see http:// www.thegpm.org. 17. EMBOSS: EMBOSS is “The European Molecular Biology Open Software Suite”. It is a free, Open Source software analysis package especially designed for the needs of the molecular biology user community. EMBOSS automatically copes with data in a variety of formats and allows transparent retrieval of sequence data from the web, see http://emboss. sourceforge.net/what. 18. Selected projects, organisations and institutes relevant in Omics http://www.ebi.ac.uk http://www.ncbi.nlm.nih.gov http://www.bii.a-star.edu.sg http://www.ibioinformatics.org http://www.bioinformatics.org.nz http://www.isb-sib.ch http://www.igb.uci.edu http://www.uhnres.utoronto.ca/centres/proteomics http://www.humanvariomeproject.org http://www.expasy.org/links.html http://bioinfo.cipf.es http://www.bcgsc.ca http://www.blueprint.org http://www.cmbi.kun.nl/edu/webtutorials http://newscenter.cancer.gov/sciencebehind http://www.genome.gov/Research http://cmgm.stanford.edu 19. Genomics related resources Genomes Pages at the EBI: http://www.ebi.ac.uk/genomes http://www.ensembl.org/index.html, http://www.ensemblgenomes.org Caenorhabditis elegans (and some other nematodes): http:// www.wormbase.org Database for Drosophila melanogaster: http://flybase.org Mouse Genome Informatics: http://www.informatics.jax.org Rat Genome Database: http://rgd.mcw.edu
Omics Technologies, Data and Bioinformatics Principles
27
Saccharomyces Genome Database: http://www.yeastgenome.org Pombe genome Project: http://www.sanger.ac.uk/Projects/ S_pombe AceDB genome database: http://www.acedb.org/introduction.shtml HIV Sequence Database: http://www.hiv.lanl.gov/content/ sequence/HIV/mainpage.html 3-D structural information about nucleic acids: http:// ndbserver.rutgers.edu Gene Ontology: http://www.geneontology.org Human mitochondrial genome database: http://www. mitomap.org 20. Transcriptomics related resources ArrayExpress: http://www.ebi.ac.uk/microarray-as/ae Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo MGED Society: http://www.mged.org miRBASE: http://www.mirbase.org Comparative RNA: http://www.rna.ccbb.utexas.edu Arabidopsis gene expression database: http://www.arexdb.org Noncoding RNA database: http://www.ncrna.org/frnadb Mammalian noncoding RNA database: http://jsm-research. imb.uq.edu.au/rnadb Noncoding RNA databases: http://biobases.ibch.poznan.pl/ ncRNA Comprehensive Ribosomal RNA database: http://www. arb-silva.de RNA modification pathways: http://modomics.genesilico.pl RNA modification database: http://library.med.utah.edu/ RNAmods RNAi database: http://nematoda.bio.nyu.edu:8001/cgi-bin/ index.cgi Genomic tRNA database: http://gtrnadb.ucsc.edu MicroCosm Targets: http://www.ebi.ac.uk/enright-srv/ microcosm/htdocs/targets/v5 miRNA sequences: http://www.ebi.ac.uk/enright-srv/MapMi microRNA binding and siRNA off-target effects: http:// www.ebi.ac.uk/enright/sylamer 21. Proteomics related resources Protein sequences: http://www.uniprot.org ExPaSy Proteomics Service: http://www.expasy.org
28
Schneider and Orchard
Protein information Resources: http://pir.georgetown.edu Gene Ontology (GO) annotations to proteins: http://www. ebi.ac.uk/GOA/index.html The Peptidase database: http://merops.sanger.ac.uk Molecular Class-Specific Information System (MCSIS) project: http://www.gpcr.org PROWL (Mass spectrometry and Gaseous Ion Chemistry): http://prowl.rockefeller.edu Protein fingerprinting: http://www.bioinf.manchester.ac.uk/ dbbrowser/PRINTS/index.php Protein families: http://pfam.sanger.ac.uk Domain Prediction: http://hydra.icgeb.trieste.it/~kristian/ SBASE Protein domain families: http://prodom.prabi.fr/prodom/ current/html/home.php Protein families, domains and regions: http://www.ebi.ac. uk/interpro/index.html Simple Modular Architecture Research Tool: http://smart. embl-heidelberg.de Integrated Protein Knowledgebase: http://pir.georgetown. edu/iproclass TIGRFAMS: http://www.jcvi.org/cms/research/projects/ tigrfams/overview Protein databank: http://www.rcsb.org/pdb/home/home.do PRIDE: http://www.ebi.ac.uk/pride Protein Data Bank in Europe: http://www.ebi.ac.uk/pdbe Peptidome: http://www.ncbi.nlm.nih.gov/peptidome PeptideAtlas: http://www.peptideatlas.org Global Proteomics Machine: GPMDB/index.html
http://www.thegpm.org/
22. Protein–protein interaction databases IntAct: http://www.ebi.ac.uk/intact/main.xhtml IMEx: http://imex.sourceforge.net DIP: http://dip.doe-mbi.ucla.edu MINT: http://mint.bio.uniroma2.it/mint MPact: http://mips.gsf.de/genre/proj/mpact MatrixDB: http://matrixdb.ibcp.fr MPIDB: http://www.jcvi.org/mpidb BioGRID: http://www.thebiogrid.org
Omics Technologies, Data and Bioinformatics Principles
29
Acknowledgements The authors would like to thank Dr. Gabriella Rustici and Dr. Daniel Zerbino for useful insights and information on transcriptomics and genome assembly respectively. The authors would also like to thank Dr. James Watson for useful comments to the manuscript. References 1. Knasmüller, S. et al. (2008) Use of conventional and -omics based methods for health claims of dietary antioxidants: A critical overview. Br J Nutr 99, ES3–52. 2. Hillieret, L.W. et al. (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5, 183–88. 3. Johnson, D.S. et al. (2007) Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1441–42. 4. Mortazavi, A. et al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621–8. 5. Rustici, G. et al. (2008) Data storage and analysis in ArrayExpress and Expression Profiler. Curr Protoc Bioinformatics 7, 7–13. 6. Whetzel, P.L. et al. (2006) The MGED Ontology: A resource for semantics-based description of microarray experiments. Bioinformatics 22, 866–73. 7. Burge, C., Birney, E., and Fickett, J. (2002) Top 10 future challenges for bioinformatics. Genome Technol 17, 1–3. 8. Havlak, P. et al. (2004) The Atlas genome assembly system. Genome Res 14, 721–32. 9. Batzoglou, S. et al. (2002) ARACHNE: A whole genome shotgun assembler. Genome Res 12, 177–89. 10. Myers, E.W. et al. (2000) A whole-genome assembly of Drosophila. Science 287, 2196–204. 11. Huang, X. et al. (2003) PCAP: A wholegenome assembly program. Genome Res 13, 2164–70. 12. Mullikin, J.C., and Ning, Z. (2003) The Phusion assembler. Genome Res 13, 81–90. 13. Pevzner, P.A., Tang, H., and Waterman, M.S. (2001) An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA 14, 9748–53. 14. Hernandez, D. et al. (2008) De novo bacterial genome sequencing: Millions of very short
15. 16. 17. 18.
19. 20.
21. 22.
23. 24.
25.
reads assembled on a desktop computer. Genome Res 18, 802–9. Idury, R., and Waterman, M. (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2, 291–306. Pevzner, P., and Tang, H. (2001) Fragment assembly with double-barrelled data. Bioinformatics 17, S225–33. Chaisson, M.J., and Pevzner, P.A. (2008) Short read fragment assembly of bacterial genomes. Genome Res 18, 324–30. Zerbino, D.R., and Birney, E. (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821–9. Ossowski, S. et al. (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18, 2024–33. Farrer, R.A. et al. (2009) De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett 1, 103–11. Wakaguri, H. et al. (2008) DBTSS: Database of transcription start sites, progress report. Nucleic Acids Res 36, D97–101. Chen, X. et al. (2009) High throughput genome-wide survey of small RNAs from the parasitic protists Giardia intestinalis and Trichomonas vaginalis. Genome Biol Evol 1, 165–75. Butler, J. et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res 18, 810–20. Chen, J., and Skiena, S. (2007) Assembly for double-ended short-read sequencing technologies. In ‘Advances in Genome Sequencing Technology and Algorithms’, edited by E. Mardis, S. Kim, and H. Tang. Artech House Publishers, Boston. Simpson, J.T. et al. (2009) ABySS: A parallel assembler for short read sequence data. Genome Res 9, 1117–23.
30
Schneider and Orchard
26. Jackson, B.G., Schnable, P.S., and Aluru, S. (2009) Parallel short sequence assembly of transcriptomes. BMC Bioinformatics 10, S1–14. 27. Spudich, G., Fernandez-Suarez, X.M., and Birney, E. (2007) Genome browsing with Ensembl: A practical overview. Brief Funct Genomic Proteomic 6, 202–19. 28. Vizcaíno, J.A. et al. (2009) A guide to the Proteomics Identifications Database proteomics data repository. Proteomics 9, 4276–83. 29. Hunter, S. et al. (2009) InterPro: The integrative protein signature database. Nucleic Acids Res 37, 211–15. 30. Cesareni, G. et al. (2005) Comparative interatcomics. FEBS Lett 579, 1828–33. 31. Brazma, A. et al. (2001) Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat Genet 29, 365–71. 32. Levy, S. et al. (2007) The diploid genome sequence of an individual human. PLoS Biol 5, 2113–44. 33. Wheeler, D.A. et al. (2008) The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–76. 34. Venter, J.C. et al. (2001) The sequence of the human genome. Science 291, 1304–51. 35. Spencer, C.C. et al. (2006) The influence of recombination on human genetic diversity. PLoS Genet 2, e148. 36. Brazma, A. et al. (2003) ArrayExpress – A public repository for microarray gene expression data at the EBI. Nucleic Acids Res 31, 68–71. 37. Edgar, R., Domrachev, M., and Lash, A.E. (2002) Gene Expression Omnibus: NCBI
38. 39.
40. 41.
42. 43. 44. 45. 46. 47.
gene expression and hybridization array data repository. Nucleic Acids Res 30, 207–10. Ikeo, K. et al. (2003) CIBEX: Center for information biology gene expression database. C R Biol 326, 1079–82. Parkinson, H. et al. (2009) ArrayExpress update – From an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37, 868–72. Aranda, B. et al. (2009) The IntAct molecular interaction database. Nucleic Acid Res. 1–7 doi:10.1093/nar/gkp878. Orchard, O. et al. (2007) The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol 25, 894–8. Kiemer, L., and Cesareni, G. (2007) Comparative interactomics: Comparing apples and pears? Trends Biotechnol 25, 448–54. Kiemer, L. et al. (2007) WI-PHI: A weighted yeast interactome enriched for direct physical interactions. Proteomics 7, 932–43. Joyce, A.R., and Palsson, B.Ø. (2006) The model organism as a system: Integrating ‘omics’ data sets. Nat Rev Mol Cell Biol 7, 198–210. Akula, S.P. et al. (2009) Techniques for integrating -omics Data. Bioinformation 3, 284–6. Haider, S. et al. (2009) BioMart Central Portal – Unified access to biological data. Nucleic Acids Res 1, W23–27. Li, P. et al. (2008) Performing statistical analyses on quantitative data in Taverna workflows: An example using R and maxdBrowse to identify differentially-expressed genes from microarray data. BMC Bioinformatics 9, 334.
Chapter 2 Data Standards for Omics Data: The Basis of Data Sharing and Reuse Stephen A. Chervitz, Eric W. Deutsch, Dawn Field, Helen Parkinson, John Quackenbush, Phillipe Rocca-Serra, Susanna-Assunta Sansone, Christian J. Stoeckert Jr., Chris F. Taylor, Ronald Taylor, and Catherine A. Ball Abstract To facilitate sharing of Omics data, many groups of scientists have been working to establish the relevant data standards. The main components of data sharing standards are experiment description standards, data exchange standards, terminology standards, and experiment execution standards. Here we provide a survey of existing and emerging standards that are intended to assist the free and open exchange of large-format data. Key words: Data sharing, Data exchange, Data standards, MGED, MIAME, Ontology, Data format, Microarray, Proteomics, Metabolomics
1. Introduction The advent of genome sequencing efforts in the 1990s led to a dramatic change in the scale of biomedical experiments. With the comprehensive lists of genes and predicted gene products that resulted from genome sequences, researchers could design experiments that assayed every gene, every protein, or every predicted metabolite. When exploiting transformative Omics technologies such as microarrays, proteomics or high-throughput cell assays, a single experiment can generate very large amounts of raw data as well as summaries in the form of lists of sequences, genes, proteins, metabolites, or SNPs. Managing, analyzing, and sharing the large data set from Omics experiments present challenges
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_2, © Springer Science+Business Media, LLC 2011
31
32
Chervitz et al.
because the standards and conventions developed for single-gene or single-protein studies do not accommodate the needs of Omics studies (1) (see Note 1). The development and applications of Omics technologies is evolving rapidly, and so is awareness of the need for, and value of, data-sharing standards in the life sciences community. Standards that become widely adopted can help scientists and data analysts better utilize, share, and archive the ever-growing mountain of Omics data sets. Also, such standards are essential for the application of Omics approaches in healthcare environments. This chapter provides an introduction to the major Omics data sharing standards initiatives in the domains of genomics, transcriptomics, proteomics, and metabolomics, and includes summaries of goals, example applications, and references for further information. New standards and organizations for standards development may well arise in the future that will augment or supersede the ones described here. Interested readers are invited to further explore the standards described in this chapter (as well as others not mentioned) and keep up with the latest developments by visiting the website http://biostandards.info. 1.1. Goals and Motivations for Standards in the Life Sciences
Standards within a scientific domain have the potential to provide uniformity and consistency in the data generated by different researchers, organizations, and technologies. They thereby facilitate more effective reuse, integration, and mining of those data by other researchers and third-party software applications, as well as enable easier collaboration between different groups. Standardscompliant data sets have increased value for scientists who must interpret and build on earlier efforts. And, of course, software analysis tools which – of necessity – require some sort of regularized data input are very often designed to process data that conform to public data formatting standards, when such are available for the domain of interest. Standard laboratory procedures and reference materials enable the creation of guidelines, systems benchmarks, and laboratory protocols for quality assessment and cross-platform comparisons of experimental results that are needed in order to deploy a technology within research, industrial, or clinical environments. The value of standards in the life sciences for improving the utility of data from high-throughput post-genomic experiments has been widely noted for some years (2–6). To understand how the conclusions of a study were obtained, not only do the underlying data need to be available, but also the details of how the data were generated need to be adequately described (i.e., samples, procedural methods, and data analysis). Depositing data in public repositories is necessary but not sufficient for this purpose. Several standard types of associated data are also needed. Reporting, or “minimum information,” standards
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
33
are needed to ensure that submitted data are sufficient for clear interpretation and querying by other scientists. Standard data formats greatly reduce the amount of effort required to share and make use of data produced by different investigators. Standards for the terminology used to describe the study and how the data were generated enable not only improved understanding of a given set of experimental results but also improved ability to compare studies produced by different scientists and organizations. Standard physical reference materials as well as standard methods for data collection and analysis can also facilitate such comparisons as well as aid the development of reusable data quality metrics. Ideally, any standards effort would take into account the usability of the proposed standard. A standard that is not widely used is not really a standard and the successful adoption of a standard by end-user scientists requires a reasonable cost-benefit ratio. The effort of producing a new standard (development cost) and, more importantly, the effort needed to learn how to use the standard or to generate standards-conforming data (end-user cost) has to be outweighed by gains in the ability to publish experimental results, the ability to use other published results to advance one’s own work, and higher visibility bestowed on standards-compliant publications (7). Thus, a major focus of standards initiatives is minimizing end-user usability barriers, typically done by educational outreach via workshops and tutorials as well as fostering the development of software tools that help scientists utilize the standard in their investigations. There also must be a means for incorporating feedback from the target community both at the initiation of standard development and on a continuing basis so that the standard can adapt to user needs that can change over time. Dr. Brazma and colleagues (8) discuss some additional factors that contribute to the success of standards in systems biology and functional genomics. 1.2. History of Standards for Omics
The motivation for standards for Omics initially came from the parallel needs of the scientific journals, which wanted standards for data publication, and the needs of researchers, who recognized the value of comparing the large and complex data sets characteristic of Omics experiments. Such data sets, often with thousands of data points, required new data formats and publication guidelines. Scientists using DNA microarrays for genome-wide gene expression analysis were the first to respond to these needs. In 2001, the Microarray and Gene Expression Data (MGED) Society (http://www.mged.org) published the Minimum Information About a Microarray Experiment (MIAME) standard (9), a guideline for the minimum information required to describe a DNA microarray-based experiment. The MIAME guidelines specify the information required to describe such an
34
Chervitz et al.
experiment so that another researcher in the same discipline could either reproduce the experiment or analyze the data de novo. Adoption of the MIAME guidelines was expedited when a number of journals and funding agencies required compliance with the standard as a precondition for publication. In parallel with MIAME, data modeling and XML-based exchange standards called Microarray Gene Expression Object Model (MAGE-OM) and Markup Language (MAGE-ML) (10), and a controlled vocabulary called the MGED Ontology (11), were created. These standards facilitated the creation and growth of a number of interoperable databases and public data repositories. Use of these standards also led to the establishment of opensource software projects for DNA microarray data analysis. Resources such as the ArrayExpress database (12–14) at the European Bioinformatics Institute (EBI), the Gene Expression Omnibus (GEO) (15–18), at the National Center for Biotechnology Information (NCBI), and others were advertised as “MIAME-compliant” and capable of importing data submitted in the MAGE-ML format (10). Minimum information guidelines akin to MIAME then arose within other Omics communities. For example, the Minimum Information about a Proteomics Experiment (MIAPE) guidelines for proteomics studies (19) have been developed. More recent initiatives have been directed towards technology-independent standards for reporting, modeling, and exchange that support work spanning multiple Omics technologies or domains, and directed toward harmonization of related standards. These pro jects have, of necessity, required extensive collaboration across disciplines. The resulting standards have gained in sophistication, benefiting from insights gained in the use and implementation of earlier standards, in the use of formalisms imposed by the need to make the data computationally tractable and logically coherent, and in the experience in engagement of multiple academic communities in the development of these prior standards. Increasingly, the drive for standards in Omics is shifting from the academic communities to include the biomedical and healthcare communities as well. As application of Omics technologies and data expands into the clinical and diagnostic arena, organizations such as the US Food and Drug Administration (FDA) and technology manufacturers are becoming more involved in a range of standards efforts, for example the MicroArray Quality Control (MAQC) consortium brings together representatives of many such organizations (20). Quality control/assurance projects and reference standards that support comparability of data across different manufacturer platforms are of particular interest as Omics technologies mature and start to play an expanded role in healthcare settings.
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
35
2. Materials Omics standards are typically scoped to a specific aspect of an Omics investigation. Generally speaking, a given standard will cover either the description of a completed experiment, or will target some aspect of performing the experiment or analyzing results. Standards are further stratified to handle more specific needs, such as reporting data for publication, providing data exchange formats, or defining standard terminologies. Such scoping reflects a pragmatic decoupling that permits different standards groups to develop complementary specifications concurrently and allows different initiatives to attract individuals with relevant expertise or interest in the target area (8). As a result of this arrangement, a standard or standardization effort within Omics can be generally characterized by its domain and scope. The domain reflects the type of experimental data (transcriptomics, proteomics, metabolomics, etc.), while the scope defines the area of applicability of the standard or the methodology being standardized (experiment reporting, data exchange, etc.). Tables 1 and 2 list the different domains and scopes, respectively, which characterize existing Omics standardizations efforts (see Note 2).
Table 1 Domains of Omics standards. The domain indicates the type of experimental data that the standard is designed to handle Domain
Description
Genomics
Genome sequence assembly, genetic variations, genomes and metagenomes, and DNA modifications
Transcriptomics
Gene expression (transcription), alternative splicing, and promoter activity
Proteomics
Protein identification, protein–protein interactions, protein abundance, and posttranslational modifications
Metabolomics
Metabolite profiling, pathway flux, and pathway perturbation analysis
Healthcare and Toxicogenomicsa Clinical, diagnostic, or toxicological applications Harmonization and Multiomicsa Cross-domain compatibility and interoperability a Healthcare, toxicological, and harmonization standards may be applicable to one or more other domain areas. These domains impose additional requirements on top of the needs of the pure Omics domains
36
Chervitz et al.
Table 2 Scope of Omics standards. Scope defines the area of applicability or methodology to which the standard pertains. Scope-General: Standards can be generally partitioned based on whether they are to be used for describing or executing an experiment. Scope-Specific: The scope can be further narrowed to cover more specific aspects of the general scope Scope-General
Scope-Specific
Description
Experiment description
Reporting (Minimum information) Data exchange & modeling
Documentation for publication or data deposition Communication between organizations and tools Ontologies and CV’s to describe experiments or data
Terminology Experiment execution
Physical standards Data analysis & quality metrics
Reference materials, spike-in controls Analyze, compare, QA/QC experimental results
CV controlled vocabulary, QA/QC quality assurance/quality control
The remainder of this section describes the different scopes of Omics standards, listing the major standards initiatives and organizations relevant to each scope. The next section then surveys the standards by domain, providing more in-depth description of the relevant standards, example applications, and references for further information. 2.1. Experiment Description Standards
Experiment description standards, also referred to generally as “data standards”, concern the development of guidelines, conventions, and methodologies for representing and communicating the raw and processed data generated by experiments as well as the metadata for describing how an experiment was carried out, including a description of all reagents, specimens, samples, equipment, protocols, controls, data transformations, software algorithms, and any other factors needed to accurately communicate, interpret, reproduce, or analyze the experimental results. Omics studies and the data they generate are complex. The diversity of scientific problems, experimental designs, and technology platforms creates a challenging landscape of data for any descriptive standardization effort. Even within a given domain and technology type, it is not practical for a single specification to encompass all aspects of describing an experiment. Certain aspects are more effectively handled separately; for example, a description
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
37
of the essential elements to be reported for an experiment is independent of the specific data format in which that information should be encoded for import or export by software applications. In recognition of this, experiment description standardization efforts within the Omics community are further scoped into more specialized areas that address distinct data handling requirements encountered during different aspects of or types of data encountered in an Omics study. Thus we have: ●●
Reporting.
●●
Data exchange & modeling.
●●
Terminology.
These different areas serve complementary roles and together, provide a complete package for describing an Omics experiment within a given domain or technology platform. For example, a data exchange/modeling standard will typically have elements to satisfy the needs of a reporting standard with a set of allowable values for those elements to be provided by an associated standard controlled vocabulary/terminology. 2.1.1. Reporting Standards: Minimum Information Guidelines
The scope of a reporting standard pertains to how a researcher should record the information required to unambiguously communicate experimental designs, treatments and analyses, to contextualize the data generated, and underpin the conclusions drawn. Such standards are also known as data content or minimum information standards because they usually have an acronym beginning with “MI” standing for “minimum information” (e.g. MIAME). The motivation behind reporting standards is to enable an experiment to be interpreted by other scientists and (in principle) to be independently reproduced. Such standards provide guidance to investigators when preparing to report or publish their investigation or archive their data in a repository of experimental results. When an experiment is submitted to a journal for publication, compliance with a reporting standard can be valuable to reviewers, aiding them in their assessment of whether an experiment has been adequately described and is thus worthy of approval for publication. A reporting specification does not normally mandate a particular format in which to capture/transport information, but simply delineates the data and metadata that their originating community considers appropriate to sufficiently describe how a particular investigation was carried out. Although a reporting standard does not have a specific data formatting requirement, the often explicit expectation is that the data should be provided using a technology-appropriate standard format where feasible, and that controlled vocabulary or ontology terms should be used in descriptions where feasible. Data repositories may impose such a requirement as a condition for data submission.
38
Chervitz et al.
Omics experiments, in addition to their novelty, can be quite complex in their execution, analysis, and reporting. Minimum information guidelines help in this regard by providing a consistent framework to help scientists think about and report essential aspects of their experiments, with the ultimate aim of ensuring the usefulness of the results to scientists who want to understand or reproduce the study. Such guidelines also help by easing compliance with a related data exchange standard, which is often designed to support the requirements of a reporting standard (discussed below). Depending on the nature of a particular investigation, information in addition to what is specified by a reporting standard may be provided as desired by the authors of the study or as deemed necessary by reviewers of the study. Table 3 lists the major reporting standards for different Omics domains. The MIBBI project (discussed later in this chapter) catalogues these and many other reporting standards and provides a useful introduction (21).
Table 3 Existing reporting standards for Omics Acronym
Full name
Domain
Organization
CIMR
Core Information for Metabolomics Reporting
Metabolomics
MSI
MIAME
Minimum Information about a Microarray Experiment
Transcriptomics
MGED
MIAPE
Minimum Information about a Proteomics Experiment
Proteomics
HUPO-PSI
MIGS-MIMS
Minimum Information about a Genome/Metagenome Sequence
Genomics
GSC
MIMIx
Minimum Information about a Molecular Interaction eXperiment
Proteomics
HUPO-PSI
MINIMESS
Minimal Metagenome Sequence Analysis Standard
Metagenomics
GSC
MINSEQE
Minimum Information about a high-throughput Nucleotide Sequencing Experiment
Genomics, Transcriptomics MGED (UHTS)
MISFISHIE
Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments
Transcriptomics
MGED
Acronyms and definitions of the major reporting standards efforts are shown, indicating their target domain and the maintaining organization, which are as follows: MGED MGED Society, http://mged.org; GSC Genomic Standards Consortium, http://gensc.org; HUPO-PSI Human Proteome Organization Proteomics Standards Initiative, http:// www.psidev.info; MSI Metabolomics Standards Initiative, http://msi-workgroups.sourceforge.net
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
39
For some publishers, compliance with a reporting standard is increasingly becoming an important criterion for accepting or rejecting a submitted Omics manuscript (22). The journals Nature, Cell, and The Lancet have led the way in the enforcement of compliance for DNA microarray experiments by requiring submitted manuscripts to demonstrate compliance with the MIAME guidelines as a condition of publication. Currently, most journals that publish such experiments have adopted some elaboration of this policy. Furthermore, publishers such as the BioMed Central are moving to, or already endorse the MIBBI project, described below, as a portal to the diverse set of available guidelines for the biosciences. 2.1.2. Data Exchange and Modeling Standards
The scope of a data exchange standard is the definition of an encoding format for use in sharing data between researchers and organizations, and for exchanging data between software programs or information storage systems. A data exchange standard delineates what data types can be encoded and the particular way they should be encoded (e.g., tab-delimited columns, XML, binary, etc.), but does not specify what the document should contain in order to be considered complete. There is an expectation that the content will be constructed in accordance with a communityapproved reporting standard and the data exchange standard itself is typically designed so that users can construct documents that are compliant with a particular reporting standard (e.g., MAGE-ML and MAGE-TAB contain placeholders that are designed to hold the data needed for the production of MIAMEcompliant documents). A data exchange standard often is designed to work in conjunction with a data modeling standard, which defines the attributes and behaviors of key entities and concepts (objects) that occur within an Omics data set. The model is intended to capture the exchange format-encoded data for the purpose of storage or downstream data mining by software applications. The data model itself is designed to be independent of any particular software implementation (database schema, XML file, etc.) or programming language (Java, C++, Perl, etc.). The implementation decisions are thus left to the application programmer, to be made using the most appropriate technology(s) for the target user base. This separation of the model (or “platform-independent model”) and the implementation (or “platform-specific implementation”) was first defined by the Object Management Group’s Model Driven Architecture (http://www.omg.org/mda) and offers a design methodology that holds promise for building software systems that are more interoperable and adaptable to technological change. Such extensibility has been recognized as an essential feature of data models for Omics experiments (23). Data exchange and modeling standards are listed in Table 4.
40
Chervitz et al.
Table 4 A sampling of data exchange and modeling standards for Omics Acronym Data format
Object model
Full name
Domain
Organization
FuGE-ML
FuGE-OM
Functional Genomics Experiment Markup Language/Object Model
Multiomics
FuGE
Investigation Study Assay – Tabular
Multiomics
RSBI
MicroArray and Gene Expression Markup Language MicroArray and Gene Expression Tabular Format
Transcriptomics
MGED
Molecular Interactions Format Mass Spectrometry Markup Language Mass Spectrometry Identifications Markup Language
Proteomics
HUPO-PSI
Polymorphism Markup Language/ Phenotype and Genotype Object Model Study Data Tabulation Model
Genomics
GEN2PHEN
Healthcare
CDISC
ISA-TAB MAGE-ML
MAGE-OM
MAGE-TAB
MIF (PSI-MI XML) mzML mzIdentML
PML
PAGE-OM
SDTM
Acronyms and names of some of the major data exchange standards efforts are shown, indicating their target domain and the maintaining organization, which are as described in the legend to Table 3 with the following additions: RSBI Reporting Structure for Biological Investigations, http://www.mged.org/Workgroups/rsb; FuGE Functional Genomics Experiment, http://fuge.sourceforge.net; GEN2PHEN Genotype to phenotype databases, http://www. gen2phen.org; CDISC Clinical Data Interchange Standards Consortium, http://www.cdisc.org. Additional proteomics exchange standards are described on the HUPO-PSI website, http://www.psidev.info
2.1.3. Terminology Standards
The scope of a terminology standard is typically defined by the use cases it is intended to support and competency questions it is designed to answer. An example of a use case is annotating the data generated in an investigation with regard to materials, procedures, and results while associated competency questions would include those used in data mining (for example, “find all cancer studies done using Affymetrix microarrays”). Terminology standards generally
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
41
provide controlled vocabularies and some degree of organization. Ontologies have become popular as mechanisms to encode terminology standards because they provide definitions for terms in the controlled vocabulary as well as properties of and relationships between terms. The Gene Ontology (24) is one such ontology created to address the use case of providing consistent annotation of gene products across different species and enabling questions such as “return all kinases”. The primary goal of a terminology standard is to promote consistent use of terms within a community and thereby facilitate knowledge integration by enabling better querying and data mining within and across data repositories as well as across domain areas. Use of standard terminologies by scientists working in different Omics domains can enable interrelation of experimental results from diverse data sets (see Note 3). For example, annotating results with standard terminologies could help correlate the expression profile of a particular gene, assayed in a transcriptomics experiment, to its protein modification state, assayed in a separate proteomics experiment. Using a suitably annotated metabolomics experiment, the gene/protein results could then be linked to the activity of the pathway(s) in which they operate, or to a disease state documented in a patient’s sample record. Consistent use of a standard terminology such as GO has enabled research advances. Data integration is possible across diverse data sets as long as they are annotated using terms from GO. Data analysis for association of particular types of gene products with results from investigations is also made possible because of the effort that has been made by the GO Consortium to consistently annotate gene products with GO. Numerous tools that do this are listed at the Gene Ontology site http://www. geneontology.org/GO.tools.microarray.shtml. There is already quite a proliferation of terminologies in the life sciences. Key to their success is adoption by scientists, bio informaticians, and software developers for use in the annotation of Omics data. However, the proliferation of ontologies which are not interoperable can be a barrier to integration (25) (see Note 4). The OBO Foundry targets this area and is delineating best practices underlying the construction of terminologies, maximizing their internal integrity, extensibility, and reuse. Easy access to standard terminologies is important and being addressed through sites such as the NCBO BioPortal (http://bioportal.bioontology.org) and the EBI Ontology Lookup Service (http://www.ebi.ac.uk/ ontology-lookup). These web sites allow browsing and downloading of ontologies. They also provide programmatic access through web services, which is important for integration with software tools and web sites that want to make use of these. Terms in ontologies are organized into classes and typically placed in a hierarchy. Classes represent types of entities for which
42
Chervitz et al.
Table 5 Terminology standards Acronym Full name
Domain
Organization
EVS
Enterprise Vocabulary Services Healthcare
NCI
GO
Gene Ontology
GOC
MS
Proteomics Standards Initiative Proteomics Mass Spectrometry controlled vocabulary
MO
MGED Ontology
Transcriptomics MGED
OBI
Ontologies for Biomedical Investigators
Multiomics
OBI
OBO
Open Biomedical Ontologies
Multiomics
NCBO
Multiomics
HUPO-PSI
PSI-MI Proteomics Standards Initiative Proteomics Molecular Interactions ontology
HUPO-PSI
sepCV
Sample processing and separations controlled vocabulary
Proteomics
HUPO-PSI
SO
Sequence Ontology
Multiomics
GOC
Acronyms and names of some of the major terminology standards in use with Omics data are shown, indicating their target domain and the maintaining organization, which are as described in the legends to Tables 3 and 4 with the following additions: GOC Gene Ontology Consortium, http://geneontology.org/GO.consortiumlist. shtml; NCI National Cancer Institute, http://www.cancer.gov; NCBO National Center for Biomedical Ontology, http://bioontology.org; OBI Ontology Biomedical Investigations, http://purl.obofoundry.org/obo/obi
there can be different instances. Terms can be given accession numbers so that they can be tracked and can be assigned details, such as who is responsible for the term and what was the source of the definition. If the ontology is based on a knowledge representation language such as OWL (web ontology language, http:// www.w3.org/TR/owl-ref) then restrictions on the usage of the term can be encoded. For example, one can require associations between terms (e.g. the inputs and outputs of a process). Building an ontology is usually done with a tool such as Protégé (http:// protege.stanford.edu) or OBO-Edit (http://oboedit.org). These tools are also useful for navigating ontologies. Table 5 lists some of the ontologies or controlled vocabularies relevant to Omics. For a complete listing and description of these and related ontologies, see the OBO Foundry website (http://www.obofoundry.org). 2.2. Experiment Execution Standards 2.2.1. Physical Standards
The scope of a physical standard pertains to the development of standard reagents for use as spike-in controls in assays. A physical standard serves as a stable reference point that can facilitate the quantification of experimental results and the comparison of
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
43
Table 6 Organizations involved in the creation of physical standards relevant to Omics experiments Acronym Full name
Domain
ERCC
External RNA Control Consortium
Transcriptomics http://www.cstl.nist.gov/biotech Cell&TissueMeasurements/GeneExpression/ ERCC.htm
LGC
Laboratory of the Transcriptomics, http://www.lgc.co.uk Government Proteomics Chemist
NIST
National Institute Transcriptomics http://www.cstl.nist.gov/biotech/ for Standards Cell&TissueMeasurements/Main_Page.htm Technology
NMS
Multiomics National Measurement System (NMS) Chemical and Biological Metrology
ATCC
American Type Culture Collection Standards Development Organization
Healthcare
Website
http://www.nmschembio.org.uk
http://www.atcc.org/Standards/ ATCCStandardsDevelopmentOrganizationSDO/ tabid/233/Default.aspx
results between different runs, investigators, organizations, or technology platforms. Physical standards are essential for quality metrics purposes and are especially important within applications of Omics technologies in regulated environments such as clinical or diagnostic settings. In the early days of DNA microarray-based gene expression experiments, results from different investigators, laboratories, or array technology were notoriously hard to compare despite the use of reporting and data exchange standards (26). The advent of physical standards and the improved metrology promises to increase the accuracy of comparisons within cross-platform and cross-investigator experimental results. Such improvements are necessary for the adoption of Omics technologies in clinical and diagnostic applications within the regulated healthcare industry. Examples of physical standards are provided in Table 6. 2.2.2. Data Analysis and Quality Metrics
The scope of a data analysis or quality metrics standard is the delineation of best practices for algorithmic and statistical
44
Chervitz et al.
approaches to processing experimental results as well as methods to assess and assure data quality. Methodologies for data analysis cover the following areas: ●●
Data transformation (normalization) protocols.
●●
Background or noise correction.
●●
Clustering.
●●
Hypothesis testing.
●●
Statistical data modeling.
Analysis procedures have historically been developed in a tool-specific manner by commercial vendors, and users of these tools would rely on the manufacturer for guidance. Yet efforts to define more general guidelines and protocols for data analysis best practices are emerging. Driving some of these efforts is the need for consistent approaches to measure data quality, which is critical for determining one’s confidence in the results from any given experiment and for judging the comparability of results obtained under different conditions (days, laboratories, equipment operators, manufacturing batches, etc.). Data quality metrics rely on data analysis standards as well as the application of
Table 7 Data analysis and quality metrics projects Acronym
Full name
Domain
Organization
arrayQuality- Quality assessment Metrics software package
Transcriptomics
BioConductor
CAMDA
Critical Assessment of Microarray Data Analysis
Transcriptomics
n/a
CAMSI
Critical Assessment of Mass Spectrometry Informatics
Proteomics
n/a
iPRG
Proteome Informatics Research Group
Proteomics
ABRF
MAQC
Microarray Quality Control Project
Transcriptomics
FDA
NTO
Normalization and Transformation Ontology
Transcriptomics
EMERALD
BioConductor’s arrayQualityMetrics: http://bioconductor.org/packages/2.3/bioc/ html/arrayQualityMetrics.html. CAMDA is managed by a local organizing committee at different annual venues: http://camda.bioinfo.cipf.es. EMERALD’s NTO: http://www.microarray-quality.org/ontology_work.html. MAQC is described in Subheading 3.2.6
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
45
physical standards. Collecting or assessing data quality using quality metrics is facilitated by having data conforming to widely-adopted reporting standards as available in common data exchange formats. A number of data analysis and quality metrics efforts are listed in Table 7.
3. Methods Here we review some of the more prominent standards and initiatives within the main Omics domains: genomics, transcriptomics, proteomics, and metabolomics. Of these, transcriptomics is the most mature in terms of standards development and community adoption, though proteomics is a close second. 3.1. Genomic Standards
Genomic sequence data is used in a variety of applications such as genome assembly, comparative genomics, DNA variation assessment (SNP genotype and copy number), epigenomics (DNA methylation analysis), and metagenomics (DNA sequencing of environment samples for organism identification). Continued progress in the development of high-throughput sequencing technology has lead to an explosion of new genome sequence data and new applications of this technology. A number of efforts are underway to standardize the way scientists describe and exchange this genomic data in order to facilitate better exchange and integration of data contributed by different laboratories using different sequencing technologies.
3.1.1. MIGS-MIMS
This term stands for Minimum Information About a Genome Sequence/Minimum Information about a Metagenomic Sequence: MIGS/MIMS (http://gensc.org). MIGS (Minimum Information About a Genome Sequence) is a minimum information checklist that is aimed at standardizing the description of a genomic sequence, such as the complete assembly of a bacterial or eukaryotic genome. It is intended to extend the core information that has been traditionally captured by the major nucleotide sequence repositories (Genbank, EMBL, and DDBJ) in order to accommodate the additional requirements of scientists working with genome sequencing project data. MIGS is maintained by the Genomic Standards Consortium (http://gensc.org) which also has developed an extension of MIGS for supporting metagenomic data sets called MIMS (Minimum Information about a Metagenomic Sequence/ Sample). MIMS allows for additional metadata particular to a metagenomics experiment, such as the details about environmental sampling.
46
Chervitz et al.
A data format called GCDML (Genomic Contextual Data Markup Language) is under development by the GSC for the purpose of providing a MIGS/MIMS-compliant data format for exchanging data from genomic/metagenomic experiments. 3.1.2. SAM Tools
The SAM format is an emerging data exchange format for efficiently representing large sequence alignments, driven by the explosion of data from high-throughput sequencing projects, such as the 1,000 Genomes Project (27). It is designed to be simple, compact, and to accommodate data from different alignment programs. The SAM Tools open source project provides utilities for manipulating alignments in the SAM format, including sorting, merging, indexing, and generating alignments (http://samtools.sourceforge.net).
3.1.3. PML and PaGE-OM
The Polymorphism Markup Language PML (http://www. openpml.org) was approved as an XML-based data format for exchange of genetic polymorphism data (e.g., SNPs) in June 2005. It was designed to facilitate data exchange among different data repositories and researchers who produce or consume this type of data. Phenotype and Genotype Experiment Object Model (PaGE-OM) is an updated, broader version of the PML standard and provides a richer object model and incorporates phenotypic information. It was approved as a standard by the OMG in March 2008. PaGE-OM defines a generic, platform-independent representation for entities such as alleles, genotypes, phenotype values, and relationships between these entities with the goal of enabling the capture of the minimum amount of information required to properly report most genetic experiments involving genotype and/or phenotype information (28). Further refinements of the PaGE-OM object model, harmonization with object models from other domains, and generation of exchange formats are underway at the time of writing. PaGE-OM is maintained by JBIC (http:// www.pageom.org) in partnership with the Gen2Phen project (http://www.gen2phen.org).
3.2. Transcriptomics Standards
This section describes the organizations and standards related to technologies that measure transcription, gene expression, or its regulation on a genomic scale. Transcriptomics standards pertain to the following technologies or types of investigation: ●●
Gene expression via DNA microarrays or ultra high-throughput sequencing.
●●
Tiling.
●●
Promoter binding (ChIP-chip, ChIP-seq).
●●
In situ hybridization studies of gene expression.
Data Standards for Omics Data: The Basis of Data Sharing and Reuse 3.2.1. MIAME
47
The goal of MIAME (Minimum Information About a Microarray Experiment, http://www.mged.org/Workgroups/MIAME/ miame.html) is to permit the unambiguous interpretation, reproduction, and verification of the results of a microarray experiment. MIAME was the original reporting standard which inspired similar “minimum information” requirements specifications in other Omics domains (9). MIAME defines the following six elements as essential for achieving these goals: 1. The raw data from each hybridization. 2. The final processed data for the set of hybridizations in the experiment. 3. The essential sample annotation, including experimental factors and their values. 4. The experiment design including sample data relationships. 5. Sufficient annotation of the array design. 6. Essential experimental and data processing protocols. For example, the MIAME standard has proven useful for microarray data repositories that have used it both as a guideline to data submitters and as a basis for judging the completeness of data submissions. The ArrayExpress database provides a service to publishers of microarray studies wherein ArrayExpress curators will assess a dataset on the basis of how well it satisfies the MIAME requirements (29). A publisher can then choose whether to accept or reject a manuscript on the basis of the assessment. ArrayExpress judges the following aspects of a report to be the most critical toward MIAME compliance: 1. Sufficient information about the array design (e.g., reporter sequences for oligonucleotide arrays or database accession numbers for cDNA arrays). 2. Raw data as obtained from the image analysis software (e.g. CEL files for Affymetrix technology, or GPR files for GenPix). 3. Processed data for the set of hybridizations. 4. Essential sample annotation, including experimental factors (variables) and their values (e.g., the compound and dose in a dose response experiment). 5. Essential experimental and data processing protocols. 6. Several publishers now have policies in place that require MIAME-compliance as a precondition for publication.
3.2.2. MINSEQE
The Minimum Information about a high-throughput Nucleotide SEQuencing Experiment (MINSEQE, http://www.mged.org/ minseqe) provides a reporting guideline akin to MIAME that is
48
Chervitz et al.
applicable to high-throughput nucleotide sequencing experiments used to assay biological state. It does not pertain to traditional sequencing projects, where the aim is to assemble a chromosomal sequence or resequence a given genomic region, but rather to applications of sequencing in areas such as transcriptomics where high-throughput sequencing is being used to compare the populations of sequences between samples derived from different biological states, for example, sequencing cDNAs to assess differential gene expression. Here, sequencing provides a means to assay the sequence composition of different biological samples, analogous to the way that DNA microarrays have traditionally been used. MINSEQE is now supported by both the Gene Expression Omnibus (GEO) and ArrayExpress. ArrayExpress and GEO have entered into a metadata exchange agreement, meaning that UHTS sequence experiments will appear in both databases regardless of where they were submitted. This complements the exchange of under lying raw data between the NBCI and EBI short read archives, SRA and ERA. 3.2.3. MAGE
The MAGE project (MicroArray Gene Expression, http://www. mged.org/Workgroups/MAGE/mage.html) aims to provide a standard for the representation of microarray gene expression data to facilitate the creation of software tools for exchanging microarray information between different users and data repositories. The MAGE family of standards does not have direct support for capturing the results of higher-level analysis (e.g., clustering of expression data from a microarray experiment). MAGE includes the following sub-projects: ●●
MAGE-OM: MAGE Object Model
●●
MAGE-ML: MAGE Markup Language
●●
MAGEstk: MAGE Software Toolkit
●●
MAGE-TAB: MAGE Tabular Format
MAGE-OM is a platform independent model for representing gene expression microarray data. Using the MAGE-OM model, the MGED Society has implemented MAGE-ML (an XML-based format) as well as MAGE-TAB (tab-delimited values format). Both formats can be used for annotating and communicating data from microarray gene expression experiments in a MIAME-compliant fashion. MAGE-TAB evolved out of a need to create a simpler version of MAGE-ML. MAGE-TAB is easier to use and thus more accessible to a wider cross-section of the microarray-based gene expression community which has struggled with the often large, structured XML-based MAGE-ML documents. A limitation of MAGE-TAB is that only single values are permitted for certain data slots that may in practice be
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
49
multivalued. Data that cannot be adequately represented by MAGE-TAB can be described using MAGE-ML, which is quite flexible. MAGEstk is a collection of Open Source packages that implement the MAGE Object Model in various programming languages (10). The toolkit is meant for bioinformatics users that develop their own applications and need to integrate functionality for managing an instance of a MAGE-OM. The toolkit facilitates easy reading and writing of MAGE-ML to and from the MAGE-OM, and all MAGE-objects have methods to maintain and update the MAGE-OM at all levels. However, the MAGE-stk is the glue between a software application and the standard way of representing DNA microarray data in MAGE-OM as a MAGE-ML file. 3.2.4. MAGE-TAB
MAGE-TAB (30) is a simple tab delimited format that is used to represent gene expression and other high throughput data such as high throughput sequencing (see Note 5). It is the main submission format for the ArrayExpress database at the European Bioinformatics Institute and is supported by the BioConductor package ArrayExpress. There are also converters available to MAGE-TAB from GEO soft format, from MAGE-ML to MAGETAB, and an open source template generation system (31). The MGED community maintains a complete list of applications using MAGE-TAB at http://www.mged.org/mage-tab (see Note 6).
3.2.5. MO
The MGED Ontology (MO, http://mged.sourceforge.net/ ontologies/index.php) provides standard terms for describing the different components of a microarray experiment (11). MO is complementary to the other MGED standards, MIAME, and MAGE, which respectively specify what information should be provided and how that information should be structured. The specification of the terminology used for labeling that information has been left to MO. MO is an ontology with defined classes, instances, and relations. A primary motivation for the creation of MO was to provide terms wherever needed in the MAGE Object Model. This has led to MO being organized along the same lines as the MAGE-OM packages. A feature of MO is that it provides pointers to other resources as appropriate to describe a sample, or biomaterial characteristics, and treatment compounds used in the experiment (e.g. NCBI Taxonomy, ChEBI) rather than importing, mapping, or duplicating those terms. A major revision of MO (currently at version 1.3.1.1, released in Feb. 2007) was planned to address structural issues. However, such plans have been recently superseded by efforts aimed in incorporating MO into the Ontology for Biomedical Investigations (OBI). The primary usage of MO has been for the annotation of microarray experiments. MO terms can be found incorporated
50
Chervitz et al.
into a number of microarray databases (e.g., ArrayExpress, RNA Abundance Database (RAD) (32), caArray (http://caarray.nci. nih.gov/). Stanford Microarray Database (SMD) (33–38), maxD (http://www.bioinf.manchester.ac.uk/microarray/maxd), MiMiR (39)) enable retrieval of studies consistently across these different sites. MO terms have also been used as part of column headers for MAGE-TAB (30), a tab-delimited form of MAGE. Example terms from MO v.1.3.1.1: ●●
●●
●●
●●
3.2.6. MAQC
BioMaterialPackage (MO_182): Description of the source of the nucleic acid used to generate labeled material for the microarray experiment (an abstract class taken from MAGE to organize MO). BioMaterialCharacteristics (MO_5): Properties of the biomaterial before treatment in any manner for the purposes of the experiment (a subclass of BioMaterialPackage). CellTypeDatabase (MO_141): Database of cell type information (a subclass of the Database). eVOC (MO_684): Ontology of human terms that describe the sample source of human cDNA and SAGE libraries (an instance of CellTypeDatabase).
The MAQC project (MicroArray Quality Control project, http:// www.fda.gov/nctr/science/centers/toxicoinformatics/maqc) aims to develop best practices for executing microarray experiments and analyzing results in a manner that maximizes consistency between different vendor platforms. The effort is spearheaded by the U.S. Food and Drug Administration (FDA) and has participants spanning the microarray industry. The work of the MAQC project is providing guidance for the development of quality measures and procedures that will facilitate the reliable use of microarray technology within clinical practice and regulatory decision-making, thereby helping realize the promises of personalized medicine (40). The project consists of two phases: 1. MAQC-I demonstrated the technical performance of microarray platforms in the identification of differentially expressed genes (20). 2. MAQC-II is aimed at reaching consensus on best practices for developing and validating predictive models based on microarray data. This phase of the project includes genotyping data as well as gene expression data, which was the focus of MAQC-I. MAQC-II is currently in progress with results expected soon (http://www.fda.gov/nctr/science/centers/ toxicoinformatics/maqc).
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
51
3.2.7. ERCC
The External RNA Control Consortium (ERCC, http://www.cstl. nist.gov/biotech/Cell&TissueMeasurements/GeneExpression/ ERCC.htm) aims to create well-characterized and tested RNA spike-in controls for gene expression assays. They have worked with the U.S. National Institute of Standards and Technology (NIST) to create certified reference materials useful for evaluating sample and system performance. Such materials facilitate standardized data comparisons among commercial and custom microarray gene expression platforms as well as by an alternative expression profiling method such as qRT-PCR. The ERCC originated in 2003 and has grown to include more than 90 organizations spanning a cross-section of industry and academic groups from around the world. The controls developed by this group have been based on contributions from member organizations and have undergone rigorous evaluation to ensure efficacy across different expression platforms.
3.3. Proteomic Standards
This section describes the standards and organizations related to technologies that measure protein-related phenomena on a genomic scale.
3.3.1. HUPO PSI
The primary digital communications standards organization in this domain is the Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) (http://www.psidev.info/), which provides an official process for drafting, reviewing, and accepting proteomics-related standards (41). As with other standardization efforts, the PSI creates and promotes both minimum information standards, which define what metadata about a study should be provided, as well as data exchange standards, which define the standardized, computer-readable format for conveying the information. Within the PSI are six working groups, which define standards in subdomains representing different components in typical workflows or different types of investigations:
3.3.2. MIAPE
●●
Sample processing
●●
Gel electrophoresis
●●
Mass spectrometry
●●
Proteomics informatics
●●
Protein modifications
●●
Molecular interactions
MIAPE (Minimum Information About a Proteomics Experiment, http://www.psidev.info/index.php?q=node/91) is a reporting standard for proteomics experiments analogous to use of MIAME for gene expression experiments. The main MIAPE publication (19) describes the overall goals and organization of the MIAPE
52
Chervitz et al.
specifications. Each subdomain (e.g., sample processing, column chromatography, mass spectrometry, etc.) has been given a separate MIAPE module that describes the information needed for each component of the study being presented. The PSI has actively engaged the journal editors to refine the MIAPE modules to a level that the editors are willing to enforce. 3.3.3. Proteomics Mass Spectrometry Data Exchange Formats
Since 2003, several data formats for encoding data related to proteomics mass spectrometry experiments have emerged. Some early XML-based formats originating from the Institute for Systems Biology such as mzXML (42) and pepXML/protXML (43) were widely adopted and became de-facto standards. More recently, the PSI has built on these formats to develop official standards such as mzML (44) for mass spectrometer output, GelML for gel electrophoresis, and mzIdentML for the bioinformatics analysis results from such data and others. See Deutsch et al. (45) for a review of some of these formats. These newer PSI formats are accompanied by controlled vocabularies, semantic validator software, example instance documents, and in some cases fully open-source software libraries to enable swift adoption of these standards.
3.3.4. MIMIx
The PSI Molecular Interactions (MI) Working Group (http:// www.psidev.info/index.php?q=node/277) has developed and approved several standards to facilitate sharing of molecular interaction information. MIMIx (Minimum Information about a Molecular Interaction Experiment) (46) is the minimum information standard that defines what information must be present in a compliant list of molecular interactions. The PSI-MI XML (or MIF) standard is an XML-based data exchange format for encoding the results of molecular interaction experiments. A major component of the format is a controlled vocabulary (PSI-MI CV) that insures the terms to describe and annotate interactions are used consistently by all documents and software. In addition to the XML format, a simpler tab-delimited data exchange format MITAB2.5 has been developed. It supports a subset of the PSI-MI XML functionality and can be edited easily using widely available spreadsheet software (47).
3.4. Metabolomics Standards
This section describes the standards and organizations related to the study of metabolomics, which studies low molecular weight metabolite profiles on a comprehensive, genomic scale within a biological sample. Metabolomic standards initiatives are not as mature as those in the transcriptomic and proteomic domains, though there is a growing community interest in this area. (Note that no distinction is made in this text between metabolomics vs metabonomics. We use “metabolomics” to refer to both types of investigations, in so far as a distinction exists).
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
53
Metabolomic standards pertain to the following technologies or types of investigations: ●●
Metabolic profiling of all compounds in a specific pathway
●●
Biochemical network modeling
●●
●●
Biochemical network perturbation analysis (environmental, genetic) Network flux analysis
The metabolomics research community is engaged in development of a variety of standards, coordinated by the Metabolomics Standards Initiative (48, 49). Under development are reporting “minimum information” standards (48, 50), data exchange formats (51), data models (52–54), and standard ontologies (55). A number of specific experiment description-related projects for metabolomics are described below. 3.4.1. CIMR
CIMR (Core Information for Metabolomics Reporting, http:// msi-workgroups.sourceforge.net) is in development as a minimal information guideline for reporting metabolomics experiments. It is expected to cover all metabolomics application areas and analysis technologies. The MSI is also involved in collaborative efforts to develop ontologies and data exchange formats for metabolomics experiments.
3.4.2. MeMo and ArMet
MeMo (Metabolic Modelling, http://dbkgroup.org/memo) defines a data model and XML-based data exchange format for metabolomic studies in yeast (54). ArMet (Architecture for Metabolomics, http://www.armet. org) defines a data model for plant metabolomics experiments and also provides guidance for data collection (52, 56).
3.5. Healthcare Standards
The health care community has a long history of using standards to drive data exchange and submission to regulatory agencies. Within this setting, it is vital to ensure that data from assays pass quality assessments and can be transferred without loss of meaning and in a format that can be easily used by common tools. The drive to translate Omics approaches from a research to a clinical setting has provided strong motivation for the development of physical standards and guidelines for their use in this setting. Omics technologies hold much promise to improve our understanding of the molecular basis of diseases and develop improved diagnostics and therapeutics tailored to individual patients (6, 57). Looking forward, the health care community is now engaged in numerous efforts to define standards important for clinical, diagnostic, and toxicological applications of data from highthroughput genomics technologies. The types and amount of data from a clinical trial or toxicogenomics study are quite extensive,
54
Chervitz et al.
incorporating data from multiple Omics domains. Standards development for electronic submission of this data is still ongoing with best practices yet to emerge. While it is likely that highthroughput data will be summarized prior to transmission, it is anticipated that the raw files should be available for analysis if requested by regulators and other scientists. Standards-related activities pertaining to the use of Omics technologies within a health care setting can be roughly divided into three main focus areas: experiment description standards, reference materials, and laboratory procedures. 3.5.1. Healthcare Experiment Description Standards
Orthogonal to the experiment description standards efforts in the basic research and technical communities, clinicians and biologists have identified the need to describe the characteristics of an organism or specimen under study in a way that is understandable to clinicians as well as scientists. Under development within these biomedical communities are reporting standards to codify what data should be captured and in what data exchange format to permit reuse of the data by others. As with the other minimum information standards, the goal is to create a common way to describe characteristics of the objects of a study, and identify the essential characteristics to include when publishing the study. Parallel work is underway in the arena of toxicogenomics (21, 58). Additionally, standard terminologies in the form of thesauri or controlled vocabularies and systematic annotation methods are also under development. It is envisioned that clinically relevant standards (some of which are summarized in Table 8) will be used in conjunction with the experiment description standards being developed by the basic research communities that study the same biological objects and organisms. For example, ISA-TAB (described below) is intended to complement existing biomedical formats such as the Study Data Tabulation Model (SDTM), a FDA-endorsed data model created by CDISC to organize, structure, and format both clinical and nonclinical (toxicological) data submissions to regulatory authorities (http://www.cdisc.org/models/sds/v3.1/ index.html). It is inevitable that some information will be duplicated between the two frameworks, but this is not generally seen as a major problem. Links between related components of ISATAB and SDTM could be created using properties of the subject source, for example.
3.5.2. Reference Materials
Developing industry-respected standard reference materials, such as a reagent for use as a positive or negative control in an assay, is essential for any work in a clinical or diagnostic setting. Reference materials are physical standards (see above) that provide an objective way to evaluate the performance of laboratory equipment, protocols, and sample integrity. The lack of suitable reference
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
55
Table 8 Summary of healthcare experiment description standards initiatives Acronym
Full name
Description
Scope
Website
BIRN
Data analysis; Biomedical Collaborative Terminology Informatics informatics Research Network resources medical/ clinical data
http://www.nbirn.net
CDISC
Clinical Data Interchange Standards Consortium
Regulatory submissions of clinical data
Data exchange & modeling
http://www.cdisc.org
CONSORT Consolidated Standards of Reporting Trials
Minimum requirements for reporting randomized clinical trials
Reporting
http://www.consortstatement.org
EVS
Enterprise Vocabulary Services
Controlled vocabulary by the National Cancer Institute in support of cancer
Terminology
http://www.cancer. gov/cancertopics/ terminologyresources
HL7
Health Level 7
Programmatic data exchange for healthcare applications
Data exchange
http://www.hl7.org
SEND
Standards for Exchange of Preclinical Data
Regulatory submissions of preclinical data; based on CDISC
Data exchange & http://www.cdisc.org/ modeling standards
ToxML
Toxicology XML
Toxicology data exchange; based on controlled vocabulary
Data exchange; terminology
http://www.leadscope. com/toxml.php
materials and guidelines for their use has been a major factor in slowing the adoption of Omics technologies such as DNA microarrays within clinical and diagnostic settings (6). The ERCC (described above) and the LGC (http://www.lgc. co.uk) are the key organizations working on development of standard reference materials, currently targeting transcriptomics experiments. 3.5.3. Laboratory Procedures
Standard protocols providing guidance in the application of reference materials, experiment design, and data analysis best practices are essential for performing high-throughput Omics procedures in clinical or diagnostic applications.
56
Chervitz et al.
Table 9 CLSI documents most relevant to functional genomics technologies Document Description
Status
MM12-A
Diagnostic Nucleic Acid Microarrays
Approved guideline
MM14-A
Proficiency Testing (External Quality Approved guideline Assessment) for Molecular Methods
MM16-A
Use of External RNA Controls in Gene Expression Assays
Approved guideline
MM17-A
Verification and Validation of Multiplex Nucleic Acid Assays
Approved guideline
The Clinical Laboratory Standards Institute (CLSI, http:// www.clsi.org) is an organization that provides an infrastructure for ratifying and publishing guidelines for clinical laboratories. Working with organizations such as the ERCC (described above), they have produced a number of documents (see Table 9) applicable to the use of multiplex, whole-genome technologies such as gene expression and genotyping within a clinical or diagnostic setting. 3.6. Challenges for Omics Standards in Basic Research
A major challenge facing Omics standards is proving their value to a significant fraction of the user base and facilitating widespread adoption. Given the relative youth of the field of Omics and of the standardization efforts for such work, the main selling point for use of a standard has been that it will benefit future scientists and application/database developers, with limited added value for the users who are being asked to comply with the standard at publication time. Regardless of how well designed a standard is, if complying with it is perceived as being difficult or complicated, widespread adoption will be unlikely to occur. Some degree of enforcement of compliance by publishers and data repositories most likely will be required to inculcate the standard and build a critical mass within the targeted scientific community that then sustains its adoption. Significant progress has been achieved here: for DNA microarray gene expression studies, for example, most journals now require MIAME compliance and there is a broad recognition of the value of this standard within the target community. Here are the some of the “pressure points” any standard will experience from its community of intended users: ●●
●●
Domain experts who want to ensure comprehensiveness of the standard End-user scientists who want the standard to be easy with which to comply
Data Standards for Omics Data: The Basis of Data Sharing and Reuse ●●
●●
57
Software developers who want tools for encoding and decoding standards-compliant data Standards architects who want to ensure formal correctness of the standard
Satisfying all of these interests is not an easy task. One complication is that the various interested groups may not be equally involved in the development of the standard. Balancing these different priorities and groups is the task of the group responsible for maintaining a standard. This is an ongoing process that must remain responsive to user feedback. The MAGE-TAB data exchange format in the DNA microarray community provides a case in point: it was created largely in response to users that found MAGE-ML difficult to work with. 3.7. Challenges for Omics Standards in Healthcare Settings
The handling of clinical data adds additional challenges on top of the intrinsic complexities of Omics data. Investigators must respect certain regulations imposed by regulatory authorities. For example, the Health Insurance Portability Accountability Act (HIPAA) mandates the de-identification of patient data to protect an individual’s privacy. Standards and information systems used by the healthcare community therefore must be formulated to deal with such regulations (e.g., (59)). While the use of open standards poses risks to the release of protected health information, the removal of detailed patient metadata about samples can present barriers to research (60, 61). Enabling effective research while maintaining patient privacy remains an on-going issue (Joe White, Dana-Farber Cancer Institute, personal communication).
3.8. Upcoming Trends: Standards Harmonization
The field of Omics is not suffering from lack of interest in standards development, as the number of different standards discussed in this chapter attests. Such a complex landscape can have adverse effects on data sharing, integration, and systems inter operability – the very things that the standards are intended to help (62). To address this, there are a number of projects in the research and biomedical communities engaged in harmonization activities that focus on integrating standards with related or complementary scope and aim to enhance interoperability in the reporting and analysis of data generated by different technologies or within different Omics domains. Some standards facilitate harmonization by having a sufficiently general-purpose design such that they can accommodate data from experiments in different domains. Such “multiomics” standards typically have a mechanism that allows them to be extended as needed in order to incorporate aspects specific to a particular application area. The use of these domain- and technologyneutral frameworks is anticipated to improve the interoperability of data analysis tools that need to handle data from different types
58
Chervitz et al.
Table 10 Existing Omics standards harmonization projects and initiatives Acronym
Full name
Scope
Organization
FuGE-ML Functional Genomics FuGE-OM Experiment Markup Language/Object Model
Data exchange FuGE & modeling
ISA-TAB
Investigation Study Assay Tabular Format
Data exchange RSBI, GSC, MSI, HUPO-PSI
HITSP
Healthcare Information (various) Technology Standards Panel
ANSI
MIBBI
Minimum Information Reporting for Biological and Biomedical Investigations
MIBBI
OBI
Ontologies for Biomedical Investigations
Terminology
OBI
OBO
Open Biomedical Ontologies
Terminology
NCBO
P3G
Public Population Project in Genomics
(various)
International Consortium
P3G covers harmonization between genomic biobanks and longitudinal population genomic studies including technical, social, and ethical issues: http://www.p3gconsortium. org. The other projects noted in this table are described further in the chapter
of Omics experiments as well as to reduce wheel reinvention by different standards groups with similar needs. Harmonization and multiomics projects are collaborative efforts, involving participants from different domain-specific standards developing organizations with shared interests. Indeed, the success of these efforts depends on continued broad-based community involvement. In Table 10, we describe major current efforts in such multiomics and harmonization. 3.8.1. FuGE
The FuGE (Functional Genomics Experiment, (http://fuge. sourceforge.net)) project aims to build generic components that capture common facets of different Omics domains (63). Its contributors come from different standards efforts, primarily MGED and HUPO-PSI, reflecting the desire to build components that provide properties and functionalities common across different Omics technologies and application areas.
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
59
The vision of this effort is that using FuGE-based components, a software developer will be better able to create and modify tools for handling Omics data, without having to reinvent the wheel for common tasks in potentially incompatible ways. Further, tools based on such shared components are expected to be more interoperable. FuGE has several sub-projects that include the FuGE Object Model (FuGE-OM) and the FuGE Markup Language (FuGE-ML), a data exchange format. Technology-specific aspects can be added by extending the generic FuGE components, building on the common functionalities. For example, a microarrayspecific framework equivalent to MAGE could be derived by extending FuGE, deriving microarray-specific objects from the FuGE object model. 3.8.2. HITSP
The Healthcare Information Technology Standards Panel (HITSP) is a public-private sector partnership of standards developers, healthcare providers, government representatives, consumers, and vendors in the healthcare industry. It is administered by the American National Standards Institute (ANSI, http://www.ansi.org) to harmonize healthcare-related standards and improve interoperability of healthcare software systems. It produces recommendations and reports contributing to the development of a Nationwide Health Information Network for the United States (NHIN, http://www. hhs.gov/healthit/healthnetwork/background). The HITSP is driven by use cases issued by the American Health Information Community (AHIC, http://www.hhs.gov/ healthit/community/background). A number of use cases have been defined on a range of topics, such as personalized healthcare, newborn screening, and consumer adverse event reporting (http://www.hhs.gov/healthit/usecases).
3.8.3. ISA-TAB
The ISA-TAB format (Investigation Study Assay Tabular format, http://isatab.sourceforge.net) is a general purpose framework with which to communicate both data and metadata from experiments involving a combination of functional technologies (64). ISA-TAB therefore has a broader applicability and more extended structure compared to a domain-specific data exchange format such as MAGE-TAB. An example where ISA-TAB might be applied would be an experiment looking at changes both in (1) the metabolite profile of urine, and (2) gene expression in the liver in subjects treated with a compound inducing liver damage, using both mass spectrometry and DNA microarray technologies, respectively. The ISA-Tab format is the backbone for the ISA Infrastructure – a set of tools that support the capture of multiomics experiment descriptions. It also serves as a submission format to compatible databases such as the BioInvestigation Index project at the EBI
60
Chervitz et al.
(http://www.ebi.ac.uk/bioinvindex). It allows users to create a common structured representation of the metadata required to interpret an experiment for the purpose of combined submission to experimental data repositories such as ArrayExpress, PRIDE, and an upcoming metabolomics repository (64). Additional motivation comes from a group of collaborative systems, part of the MGED’s RSBI group (65), each of which is committed to pipelining Omics-based experimental data into EBI public repositories or willing to exchange data among themselves, or to enable their users to import data from public repositories into their local systems. ISA-TAB has a number of additional features that make it a more general framework that can comfortably accommodate multidomain experimental designs. ISA-TAB builds on the MAGE-TAB paradigm, and shares its motivation for the use of tab-delimited text files; i.e., that they can easily be created, viewed, and edited by researchers, using spreadsheet software such as Microsoft Excel. ISA-TAB also employs MAGE-TAB syntax as far as possible, to ensure backward compatibility with existing MAGE-TAB files. It was also important to align the concepts in ISA-TAB with some of the objects in the FuGE model. The ISATAB format could be seen as competing with XML-based formats such as the FuGE-ML. However, ISA-TAB addresses the immediate need for a framework to communicate for multiomics experiments, whereas all existing FuGE-based modules are still under development. When these become available, ISA-TAB could continue serving those with minimal bioinformatics support, as well as finding utility as a user-friendly presentation layer for XMLbased formats (via an XSL transformation); i.e. in the manner of the HTML rendering of MAGE-ML documents. Initial work has been carried out to evaluate the feasibility of rendering FuGE-ML files (and FuGE-based extensions, such as GelML and Flow-ML) in the ISA-TAB format. Examples are available at the ISA-TAB website under the document section, along with a report detailing the issues faced during these transformations. When finalized, the XSL templates will also be released, along with Xpath expressions and a table mapping FuGE objects and ISA-TAB labels. Additional ISA-TAB-formatted examples are available, including a MIGS-MIMS-compliant dataset (see http://isatab.sourceforge.net/examples.html). The decision on how to regulate the use of the ISA-TAB (marking certain fields as mandatory or enforcing the use of controlled terminology) is a matter for those who will implement the format in their system. Although certain fields would benefit from the use of controlled terminology, ISA-TAB files with all fields left empty are syntactically valid, as are those where all fields are filled with free text values rather than controlled vocabulary or ontology terms.
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
61
3.8.4. MIBBI
Experiments in different Omics domains typically share some reporting requirements (for example, specifying the source of a biological specimen). The MIBBI project (Minimal Information for Biological and Biomedical Investigations, http://mibbi.org; developers: http://mibbi.sourceforge.net) aims to work collaboratively with different groups to harmonize and modularize their minimum information checklists (e.g., MIAME, MIGS-MIMS, etc.) refactoring the common requirements, to make it possible to use these checklists in combination (21). Additionally, the MIBBI project provides a comprehensive web portal providing registration of and access to different minimum information checklists for different types of Omics (and other) experiments.
3.8.5. OBI
An excellent description of the OBI project comes from its home web page: The Ontology for Biomedical Investigations (OBI, http://purl.obofoundry.org/obo/obi) project is developing an integrated ontology for the description of biological and medical experiments and investigations. This includes a set of “universal” terms that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. This ontology will support the consistent annotation of biomedical investigations, regardless of the particular field of study. The ontology will model the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type of analysis performed on it. This project was formerly called the Functional Genomics Investigation Ontology (FuGO) project (66). OBI is a collaborative effort of many communities representing particular research domains and technological platforms (http://obi-ontology.org/page/Consortium). OBI is meant to serve very practical needs rather than be an academic exercise. Thus it is very much driven by use cases and validation questions. The OBI user community provides valuable feedback about the utility of OBI and acts as a source of terms and use cases. As a member of the OBO Foundry (described below), OBI has made a commitment to be interoperable with other biomedical ontologies. Each term in OBI has a set of annotation properties, some of which are mandatory (minimal metadata defined at http:// obi-ontology.org/page/OBI_Minimal_metadata). These include the term’s preferred name, definition source, editor, and curation status.
3.8.6. OBO Consortium and the NCBO
The OBO Consortium (Open Biomedical Ontologies Consortium, http://www.obofoundry.org), a voluntary, collaborative effort among different OBO developers, has developed the OBO Foundry as a way to avoid the proliferation of incompatible ontologies in the biomedical domain (25). The OBO Foundry provides validation and assessment of ontologies to ensure
62
Chervitz et al.
interoperability. It also defines principles and best practices for ontology construction such as the Basic Formal Ontology, which serves as a root-level ontology from which other domain-specific ontologies can be built, and the relations ontology, which defines a common set of relationship types (67). Incorporation of such elements within OBO is intended to facilitate interoperability between ontologies (i.e., for one OBO Foundry ontology to be able to import components of other ontologies without conflict) and the construction of “accurate representations of biological reality.” The NCBO (National Center for Biomedical Ontology, http://bioontology.org) supports the OBO Consortium by providing tools and resources to help manage the ontologies and to help the scientific community access, query, visualize, and use them to annotate experimental data (68). The NCBO’s BioPortal website provides searches across multiple ontologies and contains a large library of these ontologies spanning many species and many scales, from molecules to whole organism. The ontology content comes from the model organism communities, biology, chemistry, anatomy, radiology, and medicine. Together, the OBO Consortium and the NCBO are helping to construct a consistent arsenal of ontologies to promote their application in annotating Omics and other biological experiments. This is the sort of community-based ontology building that holds great potential to help the life science community convert the complex and daunting Omics data sets into new discoveries that expand our knowledge and improve human health. 3.9. Concluding on the Need for Standards
A key motivation behind Omics standards is to foster data sharing, reuse, and integration with the ultimate goal of producing new biological insights (within basic research environments) and better medical treatments (within healthcare environments). Widely adopted minimum information guidelines for publication and formats for data exchange are leading to increased and better reporting of results and submission of experimental data into public repositories, and more effective data mining of large Omics data sets. Standards harmonization efforts are in progress to improve data integration and interoperability of software within both basic research settings as well as within healthcare environments. Standard reference materials and protocols for their use are also under active development and hold much promise for improving data quality, systems benchmarking, and facilitating the use of Omics technologies within clinical and diagnostic settings. High-throughput Omics experiments, with their large and complex data sets, have posed many challenges to the creation and adoption of standards. However, in recent years, the standards initiatives in this field have risen to the challenge and continue to
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
63
engage their respective communities to improve the fit of the standards to user and market needs. Omics communities have recognized that standards-compliant software tools can go a long way towards enhancing the adoption and usefulness of a standard by enabling ease-of-use. For data exchange standards, such tools can “hide the technical complexities of the standard and facilitate manipulation of the standard format in an easy way” (8). Some tools can themselves become part of standard practice when they are widely used throughout a community. Efforts are underway within organizations such as MGED and HUPO PSI to enhance the usefulness of tools for end user scientists working with standard data formats in order to ease the process of data submission, annotation, and analysis. The widespread adoption of some of the more mature Omics standards by large numbers of life science researchers, data analysts, software developers, and journals has had a number of benefits. Adoption has promoted data sharing and reanalysis, facilitated publication, and spawned a number of data repositories to store data from Omics experiments. A higher citation rate and other benefits have been detected for researchers who share their data (7, 69). Estimates of total volume of high-throughput data available in the public domain are complex to calculate, but a list of databases maintained by the Nucleic Acids Research journal (http://www3.oup.co.uk/nar/database/a) contained more than 1,000 databases in areas ranging from nucleic acid sequence data to experimental archives and specialist data integration resources (70). More public databases appear every year and as technologies change, so that deep sequencing of genomes and transcriptomes becomes more cost effective, the volume will undoubtedly rise even further. Consistent annotation of this growing volume of Omics data using interoperable ontologies and controlled vocabularies will play an important role in enabling collaborations and reuse of the data by other third parties. More advanced forms of knowledge integration that rely on standard terminologies are beginning to be explored using semantic web approaches (71–73). Adherence to standards by public data repositories is expected to facilitate data querying and reuse. Even in the absence of strict standards (such as compliance requirements upon data submission), useful data mining can be performed from large bodies of raw data originating from the same technology platform (74), especially if standards efforts make annotation guidelines available and repositories encourage their use. Approaches such as this may help researchers better utilize the limited levels of consistently annotated data in the public domain. It was recently noted that only a fraction of data generated is deposited in public data repositories (75). Improvements in this
64
Chervitz et al.
area can be anticipated through the proliferation of better tools for bench scientists that make it easier for them to submit their data in a consistent, standards-compliant manner. The full value of Omics research will only be realized once scientists in the laboratory and the clinic are able to share and integrate over large amounts of Omics data as easily as they can now do so with primary biological sequence data.
4. Notes 1. Tools for programmers: Many labs need to implement their own tools for managing and analyzing data locally. There are a number of parsers and tool kits for common data formats that can be reused in this context. These are listed in Table 11. 2. Tips for using standards: Standards are commonly supported by tools and applications related to projects or to public repositories. One example is the ISA-TAB related infrastructure described in Subheading 3.8.3, others are provided in Table 12. These include simple conversion tools for formats used by standards compliant databases such as ArrayExpress and GEO, and tools that allow users to access these databases and load data into analysis applications.
Table 11 Programmatic tools for dealing with standards, ontologies and common data formats Tool name
Language
Purpose
Website
Limpopo
Java
MAGE-TAB parser
http://sourceforge.net/ projects/limpopo/
MAGEstk
Perl and Java MAGE-ML toolkit
http://www.mged.org/ Workgroups/MAGE/ magestk.html
MAGE-Tab module
Perl
MAGE-TAB API
http://magetabutils. sourceforge.net/
OntoCat
Java
Ontology access tool for OWL, http://ontocat.sourceforge. OBO format files and ontology net/ web services
OWL-API
Java
Reading and querying OWL and OBO format files
http://owlapi.sourceforge. net/
Data Standards for Omics Data: The Basis of Data Sharing and Reuse
65
Table 12 Freely available standards related format conversion tools Tool name
Language
Formats supported
Website
MAGETabulator
Perl
SOFT to MAGE-TAB http://tab2mage.sourceforge.net
MAGETabulator
Perl
MAGE-TAB to MAGE-ML
http://tab2mage.sourceforge.net
ArrayExpress Package R (Bioconductor) MAGE-TAB to R objects
http://www.bioconductor.org/ packages/bioc/html/ ArrayExpress.html
GEOquery
R (Bioconductor) GEO SOFT to R objects
http://www.bioconductor.org/ packages/1.8/bioc/html/ GEOquery.html
ISA-Creator
Java
ISA-TAB to MAGETAB
http://isatab.sourceforge.net/ tools.html
ISA-Creator
Java
ISA-TAB to Pride XML
http://isatab.sourceforge.net/ tools.html
ISA-Creator
Java
ISA-TAB to Short http://isatab.sourceforge.net/ Read Archive XML tools.html
Table 13 Standards compliant data annotation tools Tool name
Language
Purpose
Website
Annotare
Adobe Air/Java
Desktop MAGE-TAB annotation application
http://code.google.com/p/ annotare/
MAGETabulator
Perl
MAGE-TAB template generation and related database
http://tab2mage.sourceforge.net
caArray
Java
MAGE-TAB Data management solution
https://array.nci.nih.gov/caarray/ home.action
ISA-Creator
Java
ISA-TAB annotation application
http://isatab.sourceforge.net/ tools.html
3. Annotation tools for biologists and bioinformaticians: Annotation of data to be compliant with standards is supported by several open-source annotation tools. Some of these are related to repositories supporting standards, but most are available for local installation as well. These are described in Table 13.
66
Chervitz et al.
4. Tips for using Ontologies: Further introductory information on design and use of ontologies can be found at the Onto genesis site (http://ontogenesis.knowledgeblog.org). Publicly available ontologies can be queried from the NCBO’s website (http://www.bioportal.org) and tutorials for developing ontologies and using supporting tools such as the OWL-API are run by several organizations, including the NCBO, the OBO Foundry and the University of Manchester, UK. 5. Format Conversion Tools: The MAGE-ML format described in Subheading 3.2.4 has been superseded by MAGE-TAB and the different gene expression databases use different formats to express the same standards compliant data. There are therefore a number of open source conversion tools that reformat data, or preprocess data for analysis application access. These are provided as downloadable applications, and are summarized in Table 12. Support for understanding and applying data formats is often available from repositories that use these formats for data submission and exchange. Validation tools and supporting code may also be available. Email their respective helpdesks for support. 6. Tips for developing standards: Most standards bodies have affiliated academic or industry groups and fora who are developing applications and who welcome input from the community. For example MGED has mailing lists, workshops, and an open source project that provides tools for common data representation tasks. References 1. Boguski, M.S. (1999) Biosequence exegesis. Science 286(5439), 453–5. 2. Brazma, A. (2001) On the importance of standardisation in life sciences. Bioinformatics 17(2), 113–4. 3. Stoeckert, C.J., Jr., Causton, H.C., and Ball, C.A. (2002) Microarray databases: standards and ontologies. Nat Genet 32, 469–73. 4. Brooksbank, C., and Quackenbush, J. (2006) Data standards: a call to action. OMICS 10(2), 94–9. 5. Rogers, S., and Cambrosio, A. (2007) Making a new technology work: the standardization and regulation of microarrays. Yale J Biol Med 80(4), 165–78. 6. Warrington, J.A. (2008) Standard controls and protocols for microarray based assays in clinical applications, in Book of Genes and Medicine. Medical Do Co: Osaka. 7. Piwowar, H.A., et al. (2008) Towards a data sharing culture: recommendations for leadership
8. 9.
10.
11.
12.
from academic health center. PLoS Med 5(9), e183. Brazma, A., Krestyaninova, M., and Sarkans, U. (2006) Standards for systems biology. Nat Rev Genet 7(8), 593–605. Brazma, A., et al. (2001) Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat Genet 29(4), 365–71. Spellman, P.T., et al. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 3(9), RESEARCH0046. Whetzel, P.L., et al. (2006) The MGED ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22(7), 866–73. Parkinson, H., et al. (2009) ArrayExpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37(Database issue), D868–72.
Data Standards for Omics Data: The Basis of Data Sharing and Reuse 13. Parkinson, H., et al. (2007) ArrayExpress – a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35(Database issue), D747–50. 14. Parkinson, H., et al. (2005) ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 33(Database issue), D553–5. 15. Barrett, T., and Edgar, R. (2006) Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods Enzymol 411, 352–69. 16. Barrett, T., et al. (2005) NCBI GEO: mining millions of expression profiles – database and tools. Nucleic Acids Res 33(Database issue), D562–6. 17. Barrett, T., et al. (2007) NCBI GEO: mining tens of millions of expression profiles – database and tools update. Nucleic Acids Res 35(Database issue), D760–5. 18. Barrett, T., et al. (2009) NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 37(Database issue), D885–90. 19. Taylor, C.F., et al. (2007) The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol 25(8), 887–93. 20. Shi, L., et al. (2006) The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 24(9), 1151–61. 21. Taylor, C.F., et al. (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26(8), 889–96. 22. DeFrancesco, L. (2002) Journal trio embraces MIAME. Genome Biol 8(6), R112. 23. Jones, A.R., and Paton, N.W. (2005) An analysis of extensible modelling for functional genomics data. BMC Bioinformatics 6, 235. 24. Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1), 25–9. 25. Smith, B., et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11), 1251–5. 26. Salit, M. (2006) Standards in gene expression microarray experiments. Methods Enzymol 411, 63–78. 27. Li, H., et al. (2009) The sequence alignment/ map format and SAMtools. Bioinformatics 25(16), 2078–9. 28. Brookes, A.J., et al. (2009) The phenotype and genotype experiment object model
29.
30.
31.
32.
33.
34.
35.
36. 37.
38. 39.
40. 41.
42.
67
(PaGE-OM): a robust data structure for information related to DNA variation. Hum Mutat 30(6), 968–77. Brazma, A., and Parkinson, H. (2006) ArrayExpress service for reviewers/editors of DNA microarray papers. Nat Biotechnol 24(11), 1321–2. Rayner, T.F., et al. (2006) A simple spreadsheetbased, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7, 489. Rayner, T.F., et al. (2009) MAGETabulator, a suite of tools to support the microarray data format MAGE-TAB. Bioinformatics 25(2), 279–80. Manduchi, E., et al. (2004) RAD and the RAD Study-Annotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics 20(4), 452–9. Ball, C.A., et al. (2005) The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 33(Database issue), D580–2. Demeter, J., et al. (2007) The Stanford Microarray Database: implementation of new analysis tools and open source release of software. Nucleic Acids Res 35(Database issue), D766–70. Gollub, J., et al. (2003) The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Res 31(1), 94–6. Gollub, J., Ball, C.A., and Sherlock, G. (2006) The Stanford Microarray Database: a user’s guide. Methods Mol Biol 338, 191–208. Hubble, J., et al. (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37(Database issue), D898–901. Sherlock, G., et al. (2001) The Stanford Microarray Database. Nucleic Acids Res 29(1), 152–5. Navarange, M., et al. (2005) MiMiR: a comprehensive solution for storage, annotation and exchange of microarray data. BMC Bioinformatics 6, 268. Allison, M. (2008) Is personalized medicine finally arriving? Nat Biotechnol 26(5), 509–17. Orchard, S., and Hermjakob, H. (2008) The HUPO proteomics standards initiative – easing communication and minimizing data loss in a changing world. Brief Bioinform 9(2), 166–73. Pedrioli, P.G., et al. (2004) A common open representation of mass spectrometry data and
68
Chervitz et al.
its application to proteomics research. Nat Biotechnol 22(11), 1459–66. 43. Keller, A., et al. (2005) A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol Syst Biol 1, 0017. 44. Deutsch, E. (2008) mzML: a single, unifying data format for mass spectrometer output. Proteomics 8(14), 2776–7. 45. Deutsch, E.W., Lam, H., and Aebersold, R. (2008) Data analysis and bioinformatics tools for tandem mass spectrometry in proteomics. Physiol Genomics 33(1), 18–25. 46. Orchard, S., et al. (2007) The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol 25(8), 894–8. 47. Kerrien, S., et al. (2007) Broadening the horizon – level 2.5 of the HUPO-PSI format for molecular interactions. BMC Biol 5, 44. 48. Fiehn, O., et al. (2006) Establishing reporting standards for metabolomic and metabonomic studies: a call for participation. OMICS 10(2), 158–63. 49. Sansone, S.A., et al. (2007) The metabolomics standards initiative. Nat Biotechnol 25(8), 846–8. 50. Goodacre, R., et al. (2007) Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3(3), 231–41. 51. Hardy, N., and Taylor, C. (2007) A roadmap for the establishment of standard data exchange structures for metabolomics. Metabolomics 3(3), 243–8. 52. Jenkins, H., Johnson, H., Kular, B., Wang, T., and Hardy, N. (2005) Toward supportive data collection tools for plant metabolomics. Plant Physiol 138(1), 67–77. 53. Jenkins, H., et al. (2004) A proposed framework for the description of plant metabolomics experiments and their results. Nat Biotechnol 22(12), 1601–6. 54. Spasic, I., et al. (2006) MeMo: a hybrid SQL/ XML approach to metabolomic data management for functional genomics. BMC Bioinformatics 7, 281. 55. Sansone, S.-A., Schober, D., Atherton, H., Fiehn, O., Jenkins, H., Rocca-Serra, P., et al. (2007) Metabolomics standards initiative: ontology working group work in progress. Metabolomics 3(3), 249–56. 56. Jenkins, H., Hardy, N., Beckmann, M., Draper, J., Smith, A.R., Taylor, J., et al. (2004) A proposed framework for the description of plant metabolomics experiments and their results. Nat Biotechnol 22(12), 1601–6.
57. Kumar, D. (2007) From evidence-based medicine to genomic medicine. Genomic Med 1(3–4), 95–104. 58. Fostel, J.M. (2008) Towards standards for data exchange and integration and their impact on a public database such as CEBS (Chemical Effects in Biological Systems). Toxicol Appl Pharmacol 233(1), 54–62. 59. Bland, P.H., Laderach, G.E., and Meyer, C.R. (2007) A web-based interface for communication of data between the clinical and research environments without revealing identifying information. Acad Radiol 14(6), 757–64. 60. Meslin, E.M. (2006) Shifting paradigms in health services research ethics. Consent, privacy, and the challenges for IRBs. J Gen Intern Med 21(3), 279–80. 61. Ferris, T.A., Garrison, G.M., and Lowe, H.J. (2002) A proposed key escrow system for secure patient information disclosure in biomedical research databases. Proc AMIA Symp, 245–9. 62. Quackenbush, J., et al. (2006) Top-down standards will not serve systems biology. Nature 440(7080), 24. 63. Jones, A.R., et al. (2007) The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol 25(10), 1127–33. 64. Sansone, S.A., et al. (2008) The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?” OMICS 12(2), 143–9. 65. Sansone, S.A., et al. (2006) A strategy capitalizing on synergies: the Reporting Structure for Biological Investigation (RSBI) working group. OMICS 10(2), 164–71. 66. Whetzel, P.L., et al. (2006) Development of FuGO: an ontology for functional genomics investigations. OMICS 10(2), 199–204. 67. Smith, B., et al. (2005) Relations in biomedical ontologies. Genome Biol 6(5), R46. 68. Rubin, D.L., et al. (2006) National Center for Biomedical Ontology: advancing biomedicine through structured organization of scientific knowledge. OMICS 10(2), 185–98. 69. Piwowar, H.A., and Chapman, W.W. (2008) Identifying data sharing in biomedical literature. AMIA Annu Symp Proc, 596–600. 70. Galperin, M.Y., and Cochrane, G.R. (2009) Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res 37(Database issue), D1–4.
Data Standards for Omics Data: The Basis of Data Sharing and Reuse 71. Ruttenberg, A., et al. (2007) Advancing translational research with the Semantic Web. BMC Bioinformatics (8 Suppl 3), S2. 72. Sagotsky, J.A., et al. (2008) Life Sciences and the web: a new era for collaboration. Mol Syst Biol 4, 201. 73. Stein, L.D. (2008) Towards a cyberinfrastructure for the biological sciences: progress,
69
visions and challenges. Nat Rev Genet 9(9), 678–88. 74. Day, A., et al. (2007) Celsius: a community resource for Affymetrix microarray data. Genome Biol 8(6), R112. 75. Ochsner, S.A., et al. (2008) Much room for improvement in deposition rates of expression microarray datasets. Nat Methods 5(12), 991.
Chapter 3 Omics Data Management and Annotation Arye Harel, Irina Dalah, Shmuel Pietrokovski, Marilyn Safran and Doron Lancet Abstract Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledgegleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter. Key words: Data management, Omics data integration, GeneCards, Project life cycle, Relational database, Heuristics, Versioning, Quality assurance, Annotation, Data modeling
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_3, © Springer Science+Business Media, LLC 2011
71
72
Harel et al.
1. Introduction 1.1. What Is Data Management?
Data management is the development, execution, and supervision of policies, plans, data architectures, procedures, programs, and practices that control, protect, deliver, and enhance the value of data and information assets. Topics in data management include architecture, analysis, security, quality assurance, integration, and metadata management. Good data management allows one to work more efficiently, to produce higher quality information, to achieve greater exposure, and to protect the data from loss or misuse (1–4).
1.2. Why Manage Omics Data?
Technological breakthroughs in genomics and proteomics, nextgeneration sequencing (5, 6) as well as in polychromatic flow cytometry and imaging account for vastly accelerating data acquisition trends in all Omics fields (4–7). Even one biological sample may be used to generate many different, enormous Omics data sets in parallel (8). At the same time, these technologies improve the focus of biology, from reductionist analyses of parts, to system-wide analyses and modeling – which consequently further increases this avalanche of data (9). For example, the complex 2.25 billion bases giant panda genome has been determined using 52-bases reads to include 94% of the genome with 56× coverage, probably only excluding its repeat regions (10). This state of affairs has caused a change in approaches to data handling and processing. Extensive computer manipulations are required for even basic analyses, and a focused data management strategy becomes pivotal (11, 12). Data management has been identified as a crucial ingredient in all large-scale experimental projects. Exploiting a well-structured data management system can leverage the value of the data supplied by Omics projects (13). The advantages of data management are as follows: 1. It ensures that once data is collected, information remains secure, interpretable, and exploitable. 2. It helps keep and maintain complete and accurate records obtained in parallel by many researchers who manipulate multiple objects in different geographical locations. 3. It can bring order to the complexity of experimental procedures, and to the diversity and high rate of development of protocols used in one or more centers. 4. It addresses needs that increase nonlinearly when analyses are carried out across several data sets. 5. It enables better research and the effective use of data mining tools. 6. It supports combining data which is mined from numerous, diverse databases.
Omics Data Management and Annotation
73
2. Materials 2.1. Data Management System Requirements
Present-day Omics research environments embody multidimensional complexity, characterized by diverse data types stemming from multicenter efforts, employing a variety of software and technological platforms, requiring detailed project planning, which takes into account the complete life cycle of the project (Fig. 1). The ideal data management system should fulfill a variety of requirements, so as to facilitate downstream bioinformatics and systems biology analyses (6, 13, 14): 1. Flexible inputs supporting source databases of different formats, a variety of data types, facile attainment of annotation about the experiments performed, and a pipeline for adding information at various stages of the project. 2. Flexible views of the database at each stage of the project (summary views, extended views, etc.), customized for different project personnel, including laboratory technicians and project managers. 3. User-friendliness, preferably with Web-based interfaces for data viewing and editing.
Omics Project Planning Data Sources
Programming Technologies
Data Warehouse /Modeling
Information Presentation
de-novo Insight
Implementation and Development Versioning
Algorithms and Heuristics
Data Integration
Quality Assurance
Public Releases User Interface
Search Pages
Integrated Database
Fig. 1. Data management starts with project planning, and completes its initial cycle with the first public release of the system.
74
Harel et al.
4. Interactive data entry, including the capabilities to associate entries with particular protocols, to trace the relationships between different types of information, to record time stamps and user/robot identifiers, to trace bottlenecks, and to react to the dynamic needs of laboratory managers (rather than software developers). 5. Computing capacity for routine calculations via the launching of large-scale computations, e.g., on a computer cluster or grids. 6. External depositions, i.e., a capacity to create valid depositions for relevant databases (such as GenBank (15), Ensembl (16), SwissProt (17)), including the tracking of deposition targets. 7. Robustness through residing on an industrial strength, well supported, high quality, database management system (preferably relational) ensuring security and organization. 2.2. Barriers for Implementing Data Management
Given the volume of information currently generated, and the relevant resources engaged, making the case for data management is relatively straightforward. However, the barriers that must be overcome before data management becomes a reality are substantial (4, 6, 8, 18): 1. Time and effort considerations. Data management is a longterm project. The development of data management solutions frequently stretches well beyond the initial implementation phase (instrumentation and laboratory workflow evolve continuously, thus making data management an ever-moving target). 2. Personnel recruitment. Data management requires a different set of skills and a different frame of mind as compared to data analysis, hence a recruitment challenge. 3. Search and display capacities. For large sets of complex data, and where a wide variety of results can be generated from a single experiment, project planning should include extensive search capacities and versatile display modes. 4. Proactive management. A capacity to accommodate newly discovered data, new insights, and remodeling needs. Planning upfront for migration to new versions is essential. 5. Optimized data sharing. In large projects encompassing several research centers, data sharing (19) poses several impediments that need to be overcome, including issues of shared funding, publication, and intellectual property, as well as legal and ethical issues, necessitating providing means to avoid unauthorized use of the data. 6. Community resources. Lack of informatics expertise poses a problem, and expert pools of scientists with the requisite skills must be developed, as well as a community of biocurators (12). Paucity of funding has been highlighted (20), necessitating
Omics Data Management and Annotation
75
new ways of balancing streams of support for the generation of novel data and the protection of existing data (8). 7. Data Storage. The cost of storing the hundreds of terabytes of raw data produced by next-generation sequencing has been estimated to be greater than the cost of generating the data in the first place (6). In this and other Omics examples, maintaining the streams of data in a readily usable and queryable form is an important challenge. 8. Redundancy. Whole genome resequencing, including the human 1,000 genomes project (http://www.1000genomes. org) (21), as well as multiple plant and other animal variation discovery programs lead to shifting from central database to interactive databases of genome variation, such as dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP) (22) and databases supporting the human and bovine HapMap pro jects (http://www.hapmap.org) (23). These constitute trials for data integration that needs to be addressed. 9. Ontologies and semantics. Use of ontologies (24, 25) and limited vocabularies across many databases is an invaluable aid to semantic integration. However, it seems that no single hierarchy is expressive enough to reflect the abundant differences in scientist viewpoints. Furthermore, the complexity of ontologies inflicts difficulties, since grappling with a very deep and complex ontology could sometimes be as confusing as not having one at all. 10. Integration. Since significant portions of the data incorporated in Omics databases represent information copied from other sources, complexities arise due to the lack of standardized formatting, posing an integration challenge. 11. Text mining vs. manual curation. Many databases lack objective yardsticks which validate their automatic data mining annotation. The alternatives are expert community-based annotation, or local team manual curation of information from the scientific literature. These are time-consuming processes which require highly trained personnel. While algorithms for computer-assisted literature annotation are emerging (26, 27), the field is still in its infancy. 12. The metagenomics challenge. The production of metagenomics data is yet another challenge for data management because sequences cannot always be associated with specific species. 13. The laboratory-database hiatus. Information is often fragmented among hard drives, CDs, printouts, and laboratory notebooks. As a result, data entry is often incomplete or delayed, often accompanied by only minimal supporting information. New means to overcome this problem have to be urgently developed.
76
Harel et al.
2.3. Omics Data Management
Since data management was recognized as crucial for exploiting large sets of data, considerable effort has been invested by the scientific Omics community to produce relevant computer-based systems, and to standardize rules that apply to worldwide users of such systems (1, 6, 9, 20, 21). Such broadly accessed projects in genomics and proteomics are, among others, the international nucleotide sequence databases consisting of GenBank (http:// www.ncbi.nlm.nih.gov/Genbank) (15), the DNA Databank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) (28), the European Molecular Biological Laboratory (EMBL, http://www.embl.org) (29), the Universal Protein Resource (UniProtKB, http://www. uniprot.org) (17), and the Protein Data Bank (PDB, http://www. pdb.org) (30). In addition to hosting text sequence data, they encompass basic annotation and, in many cases, the raw under lying experimental data. Although these projects are pivotal in the Omics field, they do not answer the complete variety of needs of the scientific community. To fill the gap, a number of other databases have been developed. Many are meta-databases, integrating the major databases and often also various others. Examples include GeneCards (http://www.genecards.org) (31–35) and Harvester (http://harvester.fzk.de) (36). In parallel, databases that focus on specific areas have emerged, including HapMap (23) for genetic variation; RNAdb (http://research.imb.uq.edu.au/ rnadb) (37) for RNA genes; and OMIM (http://www.ncbi.nlm. nih.gov/omim) (38) for genetic disorders in human.
2.4. Laboratory Information Management Systems
In order to manage large projects, it is possible to use Laboratory Information Management Systems (LIMS), a well-known methodology in scientific data management. A LIMS is a software system used in laboratories for the management of samples, laboratory users, instruments, standards, and other laboratory functions, such as microtiter plate management, workflow automation, and even invoicing (13, 39, 40). LIMS facilitate day-to-day work of complex, high dimensional projects (multiuser, multigeographic locations, and many types of data, instruments, and input/output formats); it organizes information and allows its restoration in a convivial manner, which improves searches. It also simplifies scientific management by centralizing the information and identifying bottlenecks from global data analyses. Finally, it allows data mining studies that could in turn help choose better targets, thus improving many project outcomes (13). Since Omics data management tasks are complex and cover a wide range of fields, a variety of implementation platforms have been developed. Starting from general Perl modules supporting the development of LIMS for projects in genomics and transcriptomics (40), followed by a more sophisticated LIMS project (41) covering data management in several Omics fields (e.g., 2D Gels, Microarray, SNP, MS, Sequence data), and specialized projects for managing ESTs data
Omics Data Management and Annotation
77
(42, 43), transcriptome data (44), functional genomics (45), toxicology, and biomarkers (46). In addition, LIMS designed specifically for the proteomics arena include: Xtrack (47). Designed to manage crystallization data and to hold the chemical compositions of the major crystallization screens, Xtrack stores samples expression and purification data throughout the process. After completion of the structure, all data needed for deposition in the PDB are present in the database. Sesame (48). Designed for the management of protein production and structure determination by X-ray crystallography and NMR spectroscopy. HalX (49). Developed for managing structural genomics projects; The Northeast Structural Genomics Consortium has developed a LIMS as part of their SPINE-2 software (50). ProteinScape™ (51). A bioinformatics platform which enables researchers to manage proteomics data from generation and warehousing to storage in a central repository. 2.5. Data Viewers
As reference genome sequences have become available, several genome viewers have been developed to allow users effective access to the data. Common browsers include EnsEMBL (http://www. ensembl.org) (16), GBrowse (http://gmod.org/wiki/Gbrowse) (52), and the University of California, Santa Cruz genome browser (http://genome.ucsc.edu) (53) (see Note 1). These viewers increasingly include sequence variation and comparative genome analysis tools, and their implementations are becoming the primary information source for many researchers who do not wish to trawl through the underlying sequence data used to identify the sequence annotation, comparison, and variation. Genome viewers and their underlying databases are becoming both the visualization and interrogation tools of choice for sequencing data.
2.6. Standardization
Massive-scale raw data must be highly structured to be useful to downstream users. In all types of Omics projects, many targets are manipulated, and the results must be able to be considered within the context of experimental conditions. In such large-scale efforts, data exchange standardization is a necessity for facilitating and accelerating the process of collecting relevant metadata, reduce replication of efforts and maximize the ability to share and integrate data. One effective solution is to develop a consensusbased approach (54). Standardized solutions are increasingly available for describing, formatting, submitting, sharing, annotating, and exchanging data. These reporting standards include minimum information checklists (55), ontologies that provide terms needed to describe the minimal information requirements (25), and file formats (56, 57).
78
Harel et al.
The “minimum information about a genome sequence” guideline published by the Genomic Standards Consortium (GSC) (9) calls for this critical field to be mandatory for all genome and metagenome submissions. The GSC is an openmembership, international working body formed in September 2005, aimed to promote mechanisms that standardize the description of genomes and the exchange and integration of genomic data (58). Some pertinent cases in point are: Minimum Information About Microarray Experiment (MIAME), which has been largely accepted as a standard for microarray data; also, nearly all LIMSs developed for relevant experiments are now “MIAME-compliant” (59–61). The US Protein Structure Initiative, PDB, and BioMagResBank jointly developed standards were partially used by the European SPINE project and used in designing its data model (62). The Proteomics Standards Initiative of the Human Proteome Organization (HUPO) aims to provide the data standards and interchange formats for the storage and sharing of proteomics data (63). The Metabolomics Standards Initiative recommended that metabolomics studies should report the details of study design, metadata, experimental, analytical, data processing, and statistical techniques used (64). The growing number of standards and policies in these fields has stimulated the generation of new databases that centralize and integrate these data: Digital Curation Centre (DCC, http://www.dcc.ac.uk), tracks data standards, documents best practice (65), and the BioSharing database (http://biosharing.org) (66), currently centralizing bioscience data policies and standards, by providing a “onestop shop” for those seeking data policy information (8). 2.7. Project Planning
As in all disciplines, the proper start to an Omics project is defining requirements. Good planning makes for clean, efficient subsequent work. The following areas of a project require planning: Type of data to be used. Should the project be restricted to only one type of Omics data (genomics, metabolomics, etc.), in order to get a highly specialized database and have simple relations among a limited number of business objects (see below)? Or does one prefer to include and connect numerous data types, potentially leading to new cross-correlation insights? Does one wish to focus on experimental work or include computational inference? Data presentation. Most projects opt for a Web-based interface, but there are many advantages to stand-alone client-based
Omics Data Management and Annotation
79
a pplications, which users install on their own computers. Advantages of Web-based applications include: fast propagation to worldwide users, easier deployment and maintenance, and ubiquity. Advantages of stand-alone applications include: no developer data security issues (they all fall on the user), and no server load (e.g., too many simultaneous Web requests) issues. Design of business objects. To be managed, data must be categorized and broken into smaller units – customarily called business objects. How should those be designed? And what can be expected from the data? Is simple presentation to the user sufficient or are searches required? Should there be an option for user contribution to the project’s information base? Should the contributions be named or anonymous? Data updates. Omics projects, especially those based on remote sites, need to take into account that information is not static: it is never sufficient to just create a data compilation; update plans are crucial. Data integrity. No data management plan is satisfactory if it does not allow for constant maintenance of data integrity, security, and accuracy. 2.8. Choosing Omics Sources
What type of Omics data? With so many fields to choose from, and with such a high need for the analysis of large-volume data, the choice is never straightforward. A large variety exists (67, 68) including: Genomics
The genes’ sequences and the information therein
Transcriptomics The presence and abundance of transcription products Proteomics
The proteins’ sequence, presence and function within the cell
Metabolomics
The complete set of metabolites within the cell
Localizomics
The subcellular localization of all proteins
Phenomics
High-throughput determination of cell function and viability
Metallomics
The totality of metal/metalloid species within an organism (69)
Lipidomics
The totality of lipids
Interactomics
The totality of the molecular interactions in an organism
Spliceomics
The totality of the alternative splicing isoforms
Exomics
All the exons
Mechanomeics
The force and mechanical systems within an organism
Histomics
The totality of tissues in an organ
80
Harel et al.
Many projects evolve from a field of interest and from curiosity-driven research; others are an outgrowth of a wider recognition of scarcity of knowledge. Scope of database. The choice is between a low volume, highly specialized database, freed from the constraints of supporting general and generic requirements, wherein one can optimize the design to be as efficient as possible for the relevant application. This presents the opportunity for the author to become the recognized expert in a specific Omics area. Larger databases involve more complex database management. Even for a focused database, experiments often yield too many results to be practically manageable for an online systems; one must then decide what, how much, and how often to archive, and design tools to effectively access the archives and integrate them into query results when needed. In addition, one should realize that ongoing analyses of the data often provide insights which in turn impact the implementation of updated versions of the data management system. Data integration and parsing. Data integration involves combining data residing in different sources and providing users with a unified and intelligent view of the merged information. One of the most important scientific decisions in data integration is which fields should be covered. There are many databases for each Omics field; designing a standard interface which also eliminates duplicates and unifies measurements and nomenclature is very important, but not easy to implement. An example of where multiarea coverage has been beneficial is the integration of metabolomics and transcriptomics data in order to help define prognosis characteristics of neuroendocrine cancers (70). Some of the challenges include the need to parse and merge heterogenous source data into a unified format, and importing/merging it into the business objects of the project. This step includes taking text that is given in one format, breaking it into records (one or more fields or lines), choosing some or all of its parts, writing it in another format, often adding additional original annotation. The main implementation hurdle is dealing with different formats from different sources. Some need to be parsed from the Web, preferably in an automatic fashion. Many sources provide users with data dumps from a relational database, others opt for exporting extensible markup language (XML) files (see Note 2), or simple text files, e.g., in a comma separated values (CSV) format. The more sources a project has, the higher the likelihood that more than one input format needs parsing. The project has to provide specific software for each of its mined sources; when developing in an object-oriented fashion, type-specific parsing classes can be used as a foundation and ease this work (71). Irrespective of programing language, and especially if there is more than one programer working on a project, a source code
Omics Data Management and Annotation
81
version control system, like CVS (see Note 3), should be strongly considered. Also, for efficiency, most programers use an integrated development environment, such as Eclipse (see Note 4). 2.9. Defining the Data Model Within the Project: Data, Metadata, and Application Flow
Once the sources of information are chosen, the system’s data definition has to be developed. The volume of data, be it small or large, has to be broken down and categorized into business objects. How to achieve these goals and to implement the objects and their relationships is called data modeling (72). An item may be defined as a business object if it has specific annotation. For example: A genomics project may choose to define genes, sequences, and locations among its many objects, but not proteins – even though they may be mentioned as the product of the gene. If the base information does not contain protein annotation, such as amino acid sequence, or 3D images, there may be no need for the protein to be defined as an object. The names of the objects, their permitted actions within the project, their data types and more, are referred to as metadata. Application flow is the sum of actions that can be performed within the project, and the order in which they are connected. It is important for the efficient management and integration of the data. Examples of questions to consider include: Is it sufficient for the project information to be merely presented to the user, or should it be searchable as well? What must the user be able to do with the results? Should there be a capability for users to add comments or even new data – or will the database as provided by the system be static until its next official update? Large projects, with complex relationships among its fields, are useful only if the data is searchable, so the application flow should contain: (a) a home page; (b) a page from which the user can enter their search terms and define the search parameters; and (c) a page listing the search results. Additionally, useful features would be for the search results page to: (a) sort by different criteria; (b) refine the search; (c) perform statistical analyses on the results set; (d) export the results set to a file (say in Excel format) or to a different application, either internal or external to the system. Understanding the application flow is critical at the data modeling stage since its requirements (both functional and performance) drive the design choices.
2.10. Defining Data Warehouse Requirements
A data warehouse is defined as the central repository for a pro ject’s (or a company’s) electronically stored data (73). Different databases allow mining via different means: parsing of Web pages, ad hoc text files (flat files), CSV files, XML files, database dumps. For clean, efficient data use in the project, all of these sources should be integrated into one, uniform project database. With the advent of reliable public domain solutions, a significant proportion of current Omics projects use relational database
82
Harel et al.
management systems (RDBMSs) as their data warehouse. Using a text file database is best suited for small projects with relatively low volume annotations and relatively simple relations between the business objects. For example, an experimental data project specializing in cytoskeleton proteins only has a small volume of data, in comparison to say NCBI’s EntrezGene project, so using text files may suffice for it. A text file database has some advantages: 1. It needs less specialized maintenance; simple system commands can be used to check how many files there are, how they are organized, and how much space they take up. 2. Implementation can be very quick: One can choose a favorite programing language and write the necessary code to create, extract, and search the data. If the files are in XML, or another popular format, there are many relevant public-domain modules, in a variety of programing languages, already written, and freely accessible. 3. It is easier to deploy. One can use system commands again to organize and compress the set of files into one downloadable file. However, the file system solution also has considerable disadvantages, especially for high volume data, with complex relationships among objects (74): 1. It could use more disk space; often data is stored redundantly for easier access. 2. Writing and maintaining custom-made code to access/add/ analyze data may become cumbersome and/or slow. 3. Fewer analyses can be done on complex relationships, since they cannot be easily marked in text files. Indeed, for high-volume, complex-relationships projects, a relational database offers the following important advantages: 1. Application program independence; data is stored in a standard, uniform fashion, as tables. 2. Multiple views of data, as well as expandability, flexibility, and scalability. 3. Self-describing meta-data which elucidates the content of each field. 4. Reduced application development times once the system is in place. 5. Standards enforcement; all applications/projects using the same RDBMS have a ready-made solution for integration. A widely used open source RDBMS is MySQL (75). The Web includes many tutorials, examples, free applications, and user
Omics Data Management and Annotation
83
groups relating to it. Complementing such a database system are programing languages to automate the insertion of data, and curating and analyzing the included data, for example, employing Perl (76) for generating user MySQL extensions (77). To a lesser degree, the type of mined data also plays a role in choosing a data warehouse. If all other things are equal, then keeping the same type as the project’s input data saves implementation time. RDBMSs provide basic querying facilities which are often sufficient to power the searches needed by Omics applications. Systems based on flat files, as well as relational databases for which speed and full-text searching is essential, typically need an external search engine (e.g., as can be provided by Google or by specialized software like Glimpse or Lucene (78, 79) for efficient indexing and querying). 2.11. Defining Versioning Requirements
Planning a project cannot end before accounting for its life cycle. Both in-silico and experimental data projects need to anticipate the future evolution of the data and its analyses. Integrated databases would rapidly become obsolete if they fail to deploy updates with new information from their sources. Programing development time forms a lower bound for version intervals, and the more sources a project mines, the more complicated this process becomes. Not all sources update at the same time, and frequencies vary. Planning version cycles must also take into account the interdependence of different sources. Data updates can be either complete (with all data mined and integrated from scratch) or incremental (updating only what has changed, e.g., new research articles that have just been published about a specific disease). Incremental updates of this sort, along with triggered reports to users, are very attractive. In practice, this is often extremely difficult to implement in Omics applications, due to widespread data complexities, exacerbated by unpredictable source data format changes, as well as the interdependencies of many of the major data sources. Finding full or partial solutions in this arena is an interesting research focus. Ensuring speedy deployment of data updates (complete or partial) is not the only reason for versioning. As time passes, and more users send feedback about an Omics project, and new scientific areas and technologies emerge, new features become desirable, and/or new application behaviors become necessary. These warrant a code update, and since the code services the data, it is often most convenient to provide joint data and feature updates.
2.12. Data Quality Control and Assurance
Quality Assurance (QA) is crucial to all projects, and is customarily performed to assure compliance with the designed models and specifications. Quality Assurance refers to the process used to ensure that deliverables comply with the design models and specifications. Examples of quality assurance include process checklists,
84
Harel et al.
project audits and methodology, and standards development. It is usually done toward the end of project development, though extremely useful in intermediate stages as well, and it can be done by anyone who has access to the project manual or user interface. For an in-silico application, QA would mean checking that all of the data is represented as planned, that all of the application’s functionality exists and behaves as designed, and that all of the application’s required steps follow one another in the specified order. For a project based on experimental data, quality assurance would also include verification that the data in the project’s database is identical to the data gathered by the scientists. Plans for QA should be designed in parallel with the implementation; running QA tests should be allotted their own time frames within milestones plans. Once shortcomings are uncovered, they should be returned to the implementation stages for correction and retesting. Also, some of the defects lead to redesigning the test plan – often to add more checks. For Omics projects in particular, QA has to ensure that both the business logic (the specific data entities and the relationships among them) and the science logic (e.g., no DNA sequence should contain the letter P, or full protein sequence should be at least 30AA) are correct.
3. Methods 3.1. GeneCards Data and Its Sources
In this section, we use the GeneCards project as a case study for illustrating many of the data management concepts described above. GeneCards® (http://www.genecards.org) (31–35) is a comprehensive, authoritative, and widely used compendium of annotative information about human genes. Its gene-centric content is automatically mined and integrated from over 80 digital sources, including HGNC (80), NCBI (81), ENSEMBL (82), UniProtKB (83), and many more (84), resulting in a Web-based deep-linked card for each of >53,000 human gene entries, categorized as protein-coding, RNA genes, pseudogenes, and more. Figure 2 depicts the GeneCards project life cycle. The result is a comprehensive and searchable information resource of human genes, providing concise genome, proteome, transcriptome, disease, and function data on all known and predicted human genes, and successfully overcoming barriers of data format heterogeneity using standard nomenclature, especially approved gene symbols (85). GeneCards “cards” include distinct functional areas, encompassing a variety of topics, including the GIFtS annotation score (35), aliases and descriptions, summaries, location, proteins, domains and families, gene function, proteins
Omics Data Management and Annotation
85
GeneCards Project Planning NCBI,Ensembl UniProt,HGNC...
Perl,XML,PHP,Propel Smarty, MySQL, Solr
flat files to relational DB
home site and mirrors
annotation wealth
Implementation and Development GeneCards versions 2,3
Algorithms: GeneNote, GIFtS Partner hunter, Set distiller, GeneLoc
Omics Data Sifting, Sorting, Merging, Ranking
Quality Assurance via GeneQArds
The GeneCards Suite One web card per gene, Deep links
Fast searches GeneALaCart, GeneDecks
Integrated database in text, XML, and MySQL formats
Fig. 2. The GeneCards project’s instantiation of data management planning, implementation, releases, and versioning, with examples of its sources, technologies, data models, presentation needs, de novo insights, algorithms, quality assurance, user interfaces, and data dumps.
and interactions, drugs and compounds, transcripts, expression, orthologs, paralogs, SNPs, disorders, publications, other genomewide and specialized databases, licensable technologies, and products and services featuring a variety of gene-specific research reagents. A powerful search facility provides full-text and fieldspecific searches; set-specific operations are available via the GeneALaCart and GeneDecks (34) subsystems. 3.2. GeneCards Data Modeling and Integration
The GeneCards data model is complex. In legacy GeneCards Versions 2.xx, information is stored in flat files, one file per gene. Version 3.0 (V3), deployed in 2010, uses a persistent object/ relational approach, attempting to model all of the data entities and relationships in an efficient manner so that the diverse functions of displaying single genes, extracting various slices of attributes of sets of genes, and performing well on both full text and field-specific searches are taken into account. Since the data is collected by interrogating dozens of sources, it is initially organized according to those sources. However, it is important for the data to
86
Harel et al.
be also presented to users organized by topics of interest, e.g., with all diseases grouped together, whether mined from specialized disorder databases, literature text-mining tools, or the source of protein data. Data integration in GeneCards operates at a variety of additional levels, serving as good examples for such a process. In some cases, such integration manifests only juxtaposition, such as sequentially presenting lists of pathways from seven different data sources, thereby allowing the user to perform comparisons. In other cases, further unification-oriented processing takes place, striving to eliminate duplicates and to perform prioritization. This is exemplified by the alias and descriptor list, by the genomic location information, which employs an original exon-based algorithm (86), and by the gene-related publications list. The latter integration is based on the prevalence of the associations and their quality (manual versus automatic curation) of the method of associations. The functional view is provided on the Web and in the V3 data model. Each of the views (source or topic) is available in a corresponding set of XML files. The V2-to-V3 migration path uses these files as the input for loading the relational database. The administration of the database is facilitated by the use of the phpMyAdmin (87) tool (see Note 5). The data generation pipeline is completed by having the database serve as input to an indexing facility which empowers sophisticated and speedy searches. 3.3. GeneCards Business Objects and Application Flow
The GeneCards V3 database (in version 3.02) is very elaborate, with >55,000 gene entries, a count that results from a consensus among three different gene entry sources: HGNC (80), NCBI (81), and ENSEMBL (82). It further encompasses ~456,000 aliases, descriptions, and external identifiers, >800,000 articles, >900,000 authors, >3 million gene-to-articles associations, >7 million SNPs – just a sample of the myriads of annotations and relationships culled from the >80 data sources, and populating 110 tables (data and system) and two views, interlinked by 81 foreign keys. The primary business object is the genes entity, with attributes that include symbol, GeneCards identifier, HGNC approved symbol (when available), and origin (HGNC, EntrezGene, or ENSEMBL). The data model parallels that of the Webcard, with some of the complex sections (e.g., gene function) represented by many tables. An off-line generation pipeline mines and integrates the data from all sources. The application flow of the online GeneCards Web site enables users to: (a) view a particular GeneCards gene and all of its attributes; (b) search the database from the home page or from any of the specific webcards; (c) analyze and/or export the search results; (d) use the GeneALaCart batch query facility to download a subset of the
Omics Data Management and Annotation
87
descriptors for a specified set of genes; (e) use GeneDecks’s Partner Hunter (either from a card or from its home page) to find more genes that are similar to a given gene based on a chosen set of GeneCards attributes; (f) use GeneDecks’s Set Distiller to find GeneCards annotations that are the most strongly enriched within a given set of genes. 3.4. Different Views of the Data
GeneCards users are eclectic, and include biologists, bioinformaticians and medical researchers from academia and industry, students, patent/IP personnel, physicians, and lay people. To address the varied individual needs of a multifaceted user base, GeneCards affords a variety of output formats, including the Web interface as described above, excel files exported by batch queries and GeneDecks results, plain text files embodying the legacy V2 format, XML files, organized by sources or by function, and MySQL (75) database dumps containing all the tabular information. Additionally, it provides a Solr (88)/Lucene (79) index, available for independent querying by the Solr analyzer; an object-oriented interface to the data, facilitated by Propel (89) (see Note 6), and a complete copy of the data and software, used by academic and commercial mirror sites. An Application Programing Interface (API) either developed in-house or adopted from projects of similar scope is in the planning. Examples of useful algorithms implemented within GeneCards include integrated exon-based gene locations (86), and SNPs filtering and sorting. In the latter, SNPs are hierarchically sorted (by default) by validation status, by location type (e.g., coding non-synonymous/synonymous, splice site), and by the number of validations.
3.5. Managing GeneCards Versions
Since 1997, the GeneCards project has released over 70 revisions, addressing numerous data format changes imposed by the mined sources. Timing of new releases has often been constrained by uneven source synchronization relative to the latest builds of root sources, such as NCBI. Mechanisms for incremental updates have been designed, but often found to be suboptimal solutions. The GeneCards generation process is embarrassingly parallelizable, so the time to generate all of the data from scratch into text files has been reduced to about 1 or 2 days, followed by about 1 week for XML data generation and the loading of the MySQL database, and a few hours for indexing.
3.6. Quality Assurance
Consistent data integrity and correctness is a major priority, and we have developed a semi-automated system, GeneQArds, for instantiating a key data management quality assurance component. GeneQArds is an in-house mechanism that was established to: (a) assess the integrity of the migration from the V2 text file system into the MySQL database and (b) validate and quantify the results of the new V3 search engine. To ensure correctness,
88
Harel et al.
we have developed a mechanism (using SQL queries and PHP) which builds a binary matrix for all gene entries, indicating the presence or absence of data from each one of the GeneCards sources in the database. Comparison of such matrices for two builds or software versions provides an assessment of database integrity and points to possible sources of error. A search engine comparison tool enables comparisons of single Web queries as well as batch (command line) queries. A report provides lists of genes showing search engine inconsistencies, and enables tracking of the particular source contributing to such discrepancies. Finally, an automated software development and bug tracking system, Bugzilla (90) (see Note 7) is used to record, organize, and update the status of feature and bug lists. 3.7. Lessons Learned
Work on the GeneCards project provides a variety of data management lessons and insights. Regarding versioning, it shows the advantages of careful version planning, so as to ensure data release consistency. In parallel, GeneCards provides an example of the difficulties involved in finding the balance between full and incremental updates, and the partial offset of the problem by database generation speed optimization. Database architecture migration is often considered difficult, and GeneCards’ successful move from a flat file architecture to a relational database model demonstrates the feasibility, advantages, and hurdles. It also shows how an evolutionary transition helps prevent disruptions in service and addresses time and funding constraints. Finally, the GeneCards example shows the advantage of developing a comparison-based quality assurance system. Furthermore, we show that developing quality assurance software systems result in useful side effects. This is exemplified by how the presence/absence vectors for the data sources, which are embedded in the GeneQArds QA effort, helped develop a novel gene annotation scoring mechanism (35).
3.8. Practical Issues Regarding Single-Researcher Data Management
Assembling, annotating, and analyzing a database of various Omics data are currently well feasible for modest sized labs and even single researchers. The key developments allowing this are: (a) the very large amounts of public Omics data; (b) free and simple accessibility of this data through the internet; (c) affordable personal computers with strong processors and very large storage capacity; (d) free and cheap software to construct databases and write programs to assemble, curate, and analyze data. We illustrate these points, and several potential difficulties and their possible solutions, using specific types of Omics data, resources, and computational approaches. Studying a particular gene or gene family is common for many biology and related research groups. Computational bio logy and bioinformatics analyses often accompanies experimental
Omics Data Management and Annotation
89
work on genes, in order to keep abreast with current research work and to complement and direct the “wet” work at a fraction of its time and cost. A typical starting point is the studied gene sequence and data about its function. Usual questions on the gene function include active site(s) sequence location, tissue and subcellular expression site, and the occurrence and reconstructed evolution of its paralogs and orthologs (see Note 1). Sequence data can help address these questions and is readily accessible through public databases, such as the ones at the NCBI, EBI, and DDBJ. These databases are more than simple data repositories, offering diverse data search methods, easy data downloading, several data analyses procedures, and links between related data within and between these sites and others. Most researchers are well familiar with finding sequences related to their gene of interest by sequence to sequence searches (e.g., BLAST (91)) and simple keyword searches. The advantage of the sequence search is that it searches primary data (i.e., the sequenced nucleotides data) or its immediate derivative (i.e., the virtually translated protein sequences). However, these searches rely on detectable sequence similarity which can be difficult to identify between quickly diverging and/or ancient separated sequences. The advantage of keyword searches is in accessing the sometimes rich annotations of many sequences, thus using experimental and computational analyses already available for the database sequences. The main disadvantages of keyword searches are missing and miss-annotated data. The former is usually the result of the deposit of large amounts of raw data that is yet to be processed in the database. Large genomic data sets are an example of such data. Raw sequence reads (e.g., the Trace Archive (92)) are an example of data which is not even meant to be annotated beyond its source and sequencing procedures. These sequence reads are meant to complement the assembled data or to be reassembled by interested users. Miss-annotations of data are a more severe problem, since they can mislead researchers and direct them to erroneous conclusions or even futile experiments. The cause of miss-annotations is often automatic annotations necessitated by the shear amounts of deposited data (15), data contamination that can be present before the final assembly at last stages of large sequencing projects, hindrances in the process of metadata integration, and wrong implementation of the search engine. One way to avoid and reduce the pitfalls of these two types of sequence finding approaches (by sequence similarity and by keywords), is to use both and then cross-reference their results, along with careful manual inspection of the retrieved data, employing critical assessment and considering all possible pitfalls. Once the sequence data is found and downloaded, it needs to be curated and stored. Following the discussion in the previous
90
Harel et al.
parts of this chapter, it is clear that it should be organized in a database. Current standard personal computers feature enough computation power and storage space to easily accommodate databases of thousands of sequences and their accompanying data, and sequence assemblies of prokaryotes and even complex organisms with genomes sizes of a few gigabases. Freely available powerful databases systems, such as MySQL and applications to view and manage them (see Note 5) are not too difficult to install on personal computers by some end users themselves, or with the skilled assistance of computer support personnel. Using this approach, a research group can set up a database of biological data with the resources of an up-to-date personal computer and internet connection, the main effort being staff time. Depending on users’ background, appreciable time might be needed to install and utilize the relevant database systems and programing languages. Relevant courses, free tutorials, examples and other resources are available on the Web (74, 93). Researchers can then devote their thoughts and time to plan, construct, curate, analyze, and manage their Omics databases.
4. Notes 1. Genome Viewing Tools. The UCSC (53) portal includes the popular Web-based UCSC Genome Browser, which zooms and scrolls over complete chromosomes, showing a variety of external annotations; GeneSorter which elaborates expression, homology, and detailed information on groups of genes, GenomeGraphs, which enables visualization of genome-wide data sets and others. Artemis (94) is a free stand-alone genome viewer and annotation tool from the Sanger Institute, which allows visualization of sequence features and the results of analyses within sequence contexts. 2. XML. Extensible Markup Language (XML) (95) is a simple, flexible, standardized tag-based format for data exchange between projects, Web sites, institutions, and more. Originally designed to meet the challenges of large-scale electronic publishing, it has evolved to become a data formatting tool which is as widespread as relational databases. 3. CVS. Concurrent Versions System (CVS) (96) is a UNIXbased open source-code control system that allows many programers to work simultaneously on the same program files, reports on conflicts, and keeps track of code changes over time, and by programer. This makes it easy to find which changes introduced bugs, and to revert back to previous revisions if necessary.
Omics Data Management and Annotation
91
4. Eclipse. An integrated development environment supporting a variety of programing languages, including C++, Java, Perl, and PHP, Eclipse (97) allows programers to easily develop, maintain, test, and manage all of the files associated with software projects. Eclipse is a free shareware application, written by different individuals, affiliated with different institutions, and united under the Eclipse platform. 5. MySQL Graphical Interfaces. phpMyAdmin (87), is a public domain graphical user interface, written in PHP, that enables handling of MySQL database administration (75) via a Web browser. phpMyAdmin supports a wide range of operations with MySQL, including managing and querying databases, tables, fields, relations, and indexes, with results presented in a friendly and intuitive manner. Sequel Pro (98), another graphical interface for managing MySQL databases, is a free open-source application for the Macintosh OSX 10.5 system. It is a stand-alone program that can manage and browse both local and remote MySQL databases. 6. PHP Propel. An open-source Object-Relational Mapping (ORM) framework for the PHP programing language. Propel (89) is free software that helps automatically create objectoriented code as well as relational database structures based on the same input schema. With the help of Propel, a system can be based on a standard relational database model of normalized tables of scalars, and also allow easy access to persistent, complex data objects, enabling application writers to work with database elements in the same way that they work with other PHP objects. 7. Bugzilla. A popular open-source Web-based general-purpose bugtracker and testing tool (90). Each bug or enhancement request is assigned a unique number, and at any point in time, is attached to a particular project, component, version, owner, and status (migrating from new to assigned to resolved to verified or reopened, and eventually to closed).
Acknowledgments We thank the members of the GeneCards team: Iris Bahir, Tirza Doniger, Tsippi Iny Stein, Hagit Krugh, Noam Nativ, Naomi Rosen, and Gil Stelzer. The GeneCards project is funded by Xennex Inc., the Weizmann Institute of Science Crown Human Genome Center, and the EU SYNLET (FP6 project number 043312) and SysKID (FP7 project number 241544) grants.
92
Harel et al.
References 1. Liolios, K., Mavromatis, K., Tavernarakis, N., and Kyrpides, N. C. (2008) The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 36, 475–9. 2. Data Management International, http://www. dama.org/i4a/pages/index.cfm?pageid=1. 3. Tech FAQ. What is Data Management?, http://www.tech-faq.com/data-management. shtml. 4. Chaussabel, D., Ueno, H., Banchereau, J., and Quinn, C. (2009) Data management: it starts at the bench. Nat Immunol 10, 1225–7. 5. Aebersold, R., and Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207. 6. Batley, J., and Edwards, D. (2009) Genome sequence data: management, storage, and visualization. Biotechniques 46, 333–6. 7. Wilkins, M. R., Pasquali, C., Appel, R. D., Ou, K., Golaz, O., Sanchez, J. C., Yan, J. X., Gooley, A. A., Hughes, G., Humphery-Smith, I., Williams, K. L., and Hochstrasser, D. F. (1996) From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology (NY) 14, 61–5. 8. Field, D., Sansone, S. A., Collis, A., Booth, T., Dukes, P., Gregurick, S. K., Kennedy, K., Kolar, P., Kolker, E., Maxon, M., Millard, S., Mugabushaka, A. M., Perrin, N., Remacle, J. E., Remington, K., Rocca-Serra, P., Taylor, C. F., Thorley, M., Tiwari, B., and Wilbanks, J. (2009) Megascience. ‘Omics data sharing’. Science 326, 234–6. 9. Field, D., Garrity, G., Gray, T., Morrison, N., Selengut, J., Sterk, P., Tatusova, T., Thomson, N., Allen, M. J., Angiuoli, S. V., Ashburner, M., Axelrod, N., Baldauf, S., Ballard, S., Boore, J., Cochrane, G., Cole, J., Dawyndt, P., De Vos, P., DePamphilis, C., Edwards, R., Faruque, N., Feldman, R., Gilbert, J., Gilna, P., Glockner, F. O., Goldstein, P., Guralnick, R., Haft, D., Hancock, D., Hermjakob, H., Hertz-Fowler, C., Hugenholtz, P., Joint, I., Kagan, L., Kane, M., Kennedy, J., Kowalchuk, G., Kottmann, R., Kolker, E., Kravitz, S., Kyrpides, N., Leebens-Mack, J., Lewis, S. E., Li, K., Lister, A. L., Lord, P., Maltsev, N., Markowitz, V., Martiny, J., Methe, B., Mizrachi, I., Moxon, R., Nelson, K., Parkhill, J., Proctor, L., White, O., Sansone, S. A., Spiers, A., Stevens, R., Swift, P., Taylor, C., Tateno, Y., Tett, A., Turner, S., Ussery, D.,
10.
11. 12.
13.
14. 15. 16.
Vaughan, B., Ward, N., Whetzel, T., San Gil, I., Wilson, G., and Wipat, A. (2008) The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol 26, 541–7. Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O. A., Leung, F. C., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C. C., Lam, T. T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M. W., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T. -W., Yiu, S. -M., Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, L., Kristiansen, K., Wong, G. K., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., and Wang, J. (2009) The sequence and de novo assembly of the giant panda genome. Nature 463, 311–7. (2008) Big Data special issue. Nature 455. Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., Hill, D. P., Kania, R., Schaeffer, M., St Pierre, S., Twigger, S., White, O., and Rhee, S. Y. (2008) Big data: the future of biocuration. Nature 455, 47–50. Haquin, S., Oeuillet, E., Pajon, A., Harris, M., Jones, A. T., van Tilbeurgh, H., Markley, J. L., Zolnai, Z., and Poupon, A. (2008) Data management in structural genomics: an overview. Methods Mol Biol 426, 49–79. Gribskov, M. (2003) Challenges in data management for functional genomics. OMICS 7, 3–5. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Wheeler, D. L. (2006) GenBank. Nucleic Acids Res 34, D16–20. Birney, E., Andrews, T. D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., Down, T., Eyras, E., Fernandez-Suarez, X. M., Gane, P.,
Omics Data Management and Annotation
17.
18.
19. 20. 21. 22.
23.
Gibbins, B., Gilbert, J., Hammond, M., Hotz, H. R., Iyer, V., Jekosch, K., Kahari, A., Kasprzyk, A., Keefe, D., Keenan, S., Lehvaslaiho, H., McVicker, G., Melsopp, C., Meidl, P., Mongin, E., Pettett, R., Potter, S., Proctor, G., Rae, M., Searle, S., Slater, G., Smedley, D., Smith, J., Spooner, W., Stabenau, A., Stalker, J., Storey, R., Ureta-Vidal, A., Woodwark, K. C., Cameron, G., Durbin, R., Cox, A., Hubbard, T., and Clamp, M. (2004) An overview of Ensembl. Genome Res 14, 925–8. Boutet, E., Lieberherr, D., Tognolli, M., Schneider, M., and Bairoch, A. (2007) UniProtKB/Swiss-Prot. Methods Mol Biol 406, 89–112. Schofield, P. N., Bubela, T., Weaver, T., Portilla, L., Brown, S. D., Hancock, J. M., Einhorn, D., Tocchini-Valentini, G., Hrabe de Angelis, M., and Rosenthal, N. (2009) Post-publication sharing of data and tools. Nature 461, 171–3. Pennisi, E. (2009) Data sharing. Group calls for rapid release of more genomics data. Science 324, 1000–1. Merali, Z., and Giles, J. (2005) Databases in peril. Nature 435, 1010–1. 1000 Human Genomes Project, http://www. 1000genomes.org. Smigielski, E. M., Sirotkin, K., Ward, M., and Sherry, S. T. (2000) dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res 28, 352–5. Frazer, K. A., Ballinger, D. G., Cox, D. R., Hinds, D. A., Stuve, L. L., Gibbs, R. A., Belmont, J. W., Boudreau, A., Hardenbol, P., Leal, S. M., Pasternak, S., Wheeler, D. A., Willis, T. D., Yu, F., Yang, H., Zeng, C., Gao, Y., Hu, H., Hu, W., Li, C., Lin, W., Liu, S., Pan, H., Tang, X., Wang, J., Wang, W., Yu, J., Zhang, B., Zhang, Q., Zhao, H., Zhou, J., Gabriel, S. B., Barry, R., Blumenstiel, B., Camargo, A., Defelice, M., Faggart, M., Goyette, M., Gupta, S., Moore, J., Nguyen, H., Onofrio, R. C., Parkin, M., Roy, J., Stahl, E., Winchester, E., Ziaugra, L., Altshuler, D., Shen, Y., Yao, Z., Huang, W., Chu, X., He, Y., Jin, L., Liu, Y., Sun, W., Wang, H., Wang, Y., Xiong, X., Xu, L., Waye, M. M., Tsui, S. K., Xue, H., Wong, J. T., Galver, L. M., Fan, J. B., Gunderson, K., Murray, S. S., Oliphant, A. R., Chee, M. S., Montpetit, A., Chagnon, F., Ferretti, V., Leboeuf, M., Olivier, J. F., Phillips, M. S., Roumy, S., Sallee, C., Verner, A., Hudson, T. J., Kwok, P. Y., Cai, D., Koboldt, D. C., Miller, R. D., Pawlikowska, L., Taillon-Miller, P., Xiao, M., Tsui, L. C., Mak, W., Song, Y. Q., Tam, P. K., Nakamura,
93
Y., Kawaguchi, T., Kitamoto, T., Morizono, T., Nagashima, A., Ohnishi, Y., Sekine, A., Tanaka, T., Tsunoda, T., Deloukas, P., Bird, C. P., Delgado, M., Dermitzakis, E. T., Gwilliam, R., Hunt, S., Morrison, J., Powell, D., Stranger, B. E., Whittaker, P., Bentley, D. R., Daly, M. J., de Bakker, P. I., Barrett, J., Chretien, Y. R., Maller, J., McCarroll, S., Patterson, N., Pe’er, I., Price, A., Purcell, S., Richter, D. J., Sabeti, P., Saxena, R., Schaffner, S. F., Sham, P. C., Varilly, P., Stein, L. D., Krishnan, L., Smith, A. V., Tello-Ruiz, M. K., Thorisson, G. A., Chakravarti, A., Chen, P. E., Cutler, D. J., Kashuk, C. S., Lin, S., Abecasis, G. R., Guan, W., Li, Y., Munro, H. M., Qin, Z. S., Thomas, D. J., McVean, G., Auton, A., Bottolo, L., Cardin, N., Eyheramendy, S., Freeman, C., Marchini, J., Myers, S., Spencer, C., Stephens, M., Donnelly, P., Cardon, L. R., Clarke, G., Evans, D. M., Morris, A. P., Weir, B. S., Mullikin, J. C., Sherry, S. T., Feolo, M., Skol, A., Zhang, H., Matsuda, I., Fukushima, Y., Macer, D. R., Suda, E., Rotimi, C. N., Adebamowo, C. A., Ajayi, I., Aniagwu, T., Marshall, P. A., Nkwodimmah, C., Royal, C. D., Leppert, M. F., Dixon, M., Peiffer, A., Qiu, R., Kent, A., Kato, K., Niikawa, N., Adewole, I. F., Knoppers, B. M., Foster, M. W., Clayton, E. W., Watkin, J., Muzny, D., Nazareth, L., Sodergren, E., Weinstock, G. M., Yakub, I., Birren, B. W., Wilson, R. K., Fulton, L. L., Rogers, J., Burton, J., Carter, N. P., Clee, C. M., Griffiths, M., Jones, M. C., McLay, K., Plumb, R. W., Ross, M. T., Sims, S. K., Willey, D. L., Chen, Z., Han, H., Kang, L., Godbout, M., Wallenburg, J. C., L’Archeveque, P., Bellemare, G., Saeki, K., An, D., Fu, H., Li, Q., Wang, Z., Wang, R., Holden, A. L., Brooks, L. D., McEwen, J. E., Guyer, M. S., Wang, V. O., Peterson, J. L., Shi, M., Spiegel, J., Sung, L. M., Zacharia, L. F., Collins, F. S., Kennedy, K., Jamieson, R., and Stewart, J. (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–61. 24. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25–9. 25. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., Leontis, N., Rocca-Serra, P., Ruttenberg, A.,
94
26. 27. 28. 29.
30.
31.
32.
33.
34.
35.
Harel et al. Sansone, S. A., Scheuermann, R. H., Shah, N., Whetzel, P. L., and Lewis, S. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25, 1251–5. ClearForest, Text Analytics Solutions, http:// www.clearforest.com/index.asp. novo|seek, http://www.novoseek.com/ Welcome.action. DDBJ: DNA Data Bank of Japan, http:// www.ddbj.nig.ac.jp. Cochrane, G., Aldebert, P., Althorpe, N., Andersson, M., Baker, W., Baldwin, A., Bates, K., Bhattacharyya, S., Browne, P., van den Broek, A., Castro, M., Duggan, K., Eberhardt, R., Faruque, N., Gamble, J., Kanz, C., Kulikova, T., Lee, C., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., McHale, M., McWilliam, H., Mukherjee, G., Nardone, F., Pastor, M. P., Sobhany, S., Stoehr, P., Tzouvara, K., Vaughan, R., Wu, D., Zhu, W., and Apweiler, R. (2006) EMBL Nucleotide Sequence Database: developments in 2005. Nucleic Acids Res 34, D10–5. Sussman, J. L., Lin, D., Jiang, J., Manning, N. O., Prilusky, J., Ritter, O., and Abola, E. E. (1998) Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr 54, 1078–84. Rebhan, M., Chalifa-Caspi, V., Prilusky, J., and Lancet, D. (1998) GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics 14, 656–64. Safran, M., Chalifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen, M., Atarot, T., Groner, Y., and Lancet, D. (2003) Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res 31, 142–6. Safran, M., Solomon, I., Shmueli, O., Lapidot, M., Shen-Orr, S., Adato, A., Ben-Dor, U., Esterman, N., Rosen, N., Peter, I., Olender, T., Chalifa-Caspi, V., and Lancet, D. (2002) GeneCards 2002: towards a complete, objectoriented, human gene compendium. Bioinformatics 18, 1542–3. Stelzer, G., Inger, A., Olender, T., Iny-Stein, T., Dalah, I., Harel, A., Safran, M., and Lancet, D. (2009) GeneDecks: paralog hunting and gene-set distillation with GeneCards annotation. OMICS 13, 477–87. Harel, A., Inger, A., Stelzer, G., StrichmanAlmashanu, L., Dalah, I., Safran, M., and
Lancet, D. (2009) GIFtS: annotation landscape analysis with GeneCards. BMC Bioinformatics 10, 348. 36. Liebel, U., Kindler, B., and Pepperkok, R. (2004) ‘Harvester’: a fast meta search engine of human protein resources. Bioinformatics 20, 1962–3. 37. Pang, K. C., Stephen, S., Engstrom, P. G., Tajul-Arifin, K., Chen, W., Wahlestedt, C., Lenhard, B., Hayashizaki, Y., and Mattick, J. S. (2005) RNAdb – a comprehensive mammalian noncoding RNA database. Nucleic Acids Res 33, D125–30. 38. Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A., and McKusick, V. A. (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33, D514–7. 39. Laboratory information management system, http://en.wikipedia.org/wiki/ Laborator y_information_management_ system. 40. Morris, J. A., Gayther, S. A., Jacobs, I. J., and Jones, C. (2008) A Perl toolkit for LIMS development. Source Code Biol Med 3, 4. 41. Genome Canada LIMS, http://wishart. biology.ualberta.ca/labm/index.htm 42. Parkinson, J., Anthony, A., Wasmuth, J., Schmid, R., Hedley, A., and Blaxter, M. (2004) PartiGene – constructing partial genomes. Bioinformatics 20, 1398–404. 43. Schmid, R., and Blaxter, M. (2009) EST processing: from trace to sequence. Methods Mol Biol 533, 189–220. 44. The maxd software: supporting genomic expression analysis, http://www.bioinf. manchester.ac.uk/microarray/maxd. 45. Gribskov, M., Fana, F., Harper, J., Hope, D. A., Harmon, A. C., Smith, D. W., Tax, F. E., and Zhang, G. (2001) PlantsP: a functional genomics database for plant phosphorylation. Nucleic Acids Res 29, 111–3. 46. Predict-IV, www.predict-iv.toxi.uni-wuerzburg. de/participants/participant_7. 47. Harris, M., and Jones, T. A. (2002) Xtrack – a web-based crystallographic notebook. Acta Crystallogr D Biol Crystallogr 58, 1889–91. 48. Zolnai, Z., Lee, P. T., Li, J., Chapman, M. R., Newman, C. S., Phillips, G. N., Jr., Rayment, I., Ulrich, E. L., Volkman, B. F., and Markley, J. L. (2003) Project management system for structural and functional proteomics: Sesame. J Struct Funct Genomics 4, 11–23. 49. Prilusky, J., Oueillet, E., Ulryck, N., Pajon, A., Bernauer, J., Krimm, I., QuevillonCheruel, S., Leulliot, N., Graille, M., Liger,
Omics Data Management and Annotation
50.
51. 52.
53.
54. 55.
56.
D., Tresaugues, L., Sussman, J. L., Janin, J., van Tilbeurgh, H., and Poupon, A. (2005) HalX: an open-source LIMS (Laboratory Information Management System) for smallto large-scale laboratories. Acta Crystallogr D Biol Crystallogr 61, 671–8. Goh, C. S., Lan, N., Echols, N., Douglas, S. M., Milburn, D., Bertone, P., Xiao, R., Ma, L. C., Zheng, D., Wunderlich, Z., Acton, T., Montelione, G. T., and Gerstein, M. (2003) SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nucleic Acids Res 31, 2833–8. ProteinScapeTM, http://www.protagen.de/ index.php?option=com_content&task=view &id=95&Itemid=288. Stein, L. D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J. E., Harris, T. W., Arva, A., and Lewis, S. (2002) The generic genome browser: a building block for a model organism system database. Genome Res 12, 1599–610. Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., Weber, R. J., Haussler, D., and Kent, W. J. (2003) The UCSC Genome Browser Database. Nucleic Acids Res 31, 51–4. Brazma, A. (2001) On the importance of standardisation in life sciences. Bioinformatics 17, 113–4. Taylor, C. F., Field, D., Sansone, S. A., Aerts, J., Apweiler, R., Ashburner, M., Ball, C. A., Binz, P. A., Bogue, M., Booth, T., Brazma, A., Brinkman, R. R., Michael Clark, A., Deutsch, E. W., Fiehn, O., Fostel, J., Ghazal, P., Gibson, F., Gray, T., Grimes, G., Hancock, J. M., Hardy, N. W., Hermjakob, H., Julian, R. K., Jr., Kane, M., Kettner, C., Kinsinger, C., Kolker, E., Kuiper, M., Le Novere, N., Leebens-Mack, J., Lewis, S. E., Lord, P., Mallon, A. M., Marthandan, N., Masuya, H., McNally, R., Mehrle, A., Morrison, N., Orchard, S., Quackenbush, J., Reecy, J. M., Robertson, D. G., Rocca-Serra, P., Rodriguez, H., Rosenfelder, H., Santoyo-Lopez, J., Scheuermann, R. H., Schober, D., Smith, B., Snape, J., Stoeckert, C. J., Jr., Tipton, K., Sterk, P., Untergasser, A., Vandesompele, J., and Wiemann, S. (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26, 889–96. Jones, A. R., Miller, M., Aebersold, R., Apweiler, R., Ball, C. A., Brazma, A., Degreef, J., Hardy, N., Hermjakob, H., Hubbard, S. J., Hussey, P., Igra, M., Jenkins, H., Julian, R. K., Jr., Laursen, K., Oliver, S. G., Paton, N.
57.
58.
59.
60.
61.
62.
95
W., Sansone, S. A., Sarkans, U., Stoeckert, C. J., Jr., Taylor, C. F., Whetzel, P. L., White, J. A., Spellman, P., and Pizarro, A. (2007) The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol 25, 1127–33. Sansone, S. A., Rocca-Serra, P., Brandizi, M., Brazma, A., Field, D., Fostel, J., Garrow, A. G., Gilbert, J., Goodsaid, F., Hardy, N., Jones, P., Lister, A., Miller, M., Morrison, N., Rayner, T., Sklyar, N., Taylor, C., Tong, W., Warner, G., and Wiemann, S. (2008) The first RSBI (ISA-TAB) workshop: “can a simple format work for complex studies?”. OMICS 12, 143–9. Field, D., Garrity, G., Morrison, N., Selengut, J., Sterk, P., Tatusova, T., and Thomson, N. (2005) eGenomics: cataloguing our complete genome collection. Comp Funct Genomics 6, 363–8. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29, 365–71. Webb, S. C., Attwood, A., Brooks, T., Freeman, T., Gardner, P., Pritchard, C., Williams, D., Underhill, P., Strivens, M. A., Greenfield, A., and Pilicheva, E. (2004) LIMaS: the JAVA-based application and database for microarray experiment tracking. Mamm Genome 15, 740–7. Ball, C. A., Awad, I. A., Demeter, J., Gollub, J., Hebert, J. M., Hernandez-Boussard, T., Jin, H., Matese, J. C., Nitzberg, M., Wymore, F., Zachariah, Z. K., Brown, P. O., and Sherlock, G. (2005) The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res 33, D580–2. Pajon, A., Ionides, J., Diprose, J., Fillon, J., Fogh, R., Ashton, A. W., Berman, H., Boucher, W., Cygler, M., Deleury, E., Esnouf, R., Janin, J., Kim, R., Krimm, I., Lawson, C. L., Oeuillet, E., Poupon, A., Raymond, S., Stevens, T., van Tilbeurgh, H., Westbrook, J., Wood, P., Ulrich, E., Vranken, W., Xueli, L., Laue, E., Stuart, D. I., and Henrick, K. (2005) Design of a data model for developing laboratory information management and analysis systems for protein production. Proteins 58, 278–84.
96
Harel et al.
63. Orchard, S., Hermjakob, H., Binz, P. A., Hoogland, C., Taylor, C. F., Zhu, W., Julian, R. K., Jr., and Apweiler, R. (2005) Further steps towards data standardisation: the Proteomic Standards Initiative HUPO 3(rd) annual congress, Beijing 25-27(th) October, 2004. Proteomics 5, 337–9. 64. Lindon, J. C., Nicholson, J. K., Holmes, E., Keun, H. C., Craig, A., Pearce, J. T., Bruce, S. J., Hardy, N., Sansone, S. A., Antti, H., Jonsson, P., Daykin, C., Navarange, M., Beger, R. D., Verheij, E. R., Amberg, A., Baunsgaard, D., Cantor, G. H., Lehman-McKeeman, L., Earll, M., Wold, S., Johansson, E., Haselden, J. N., Kramer, K., Thomas, C., Lindberg, J., Schuppe-Koistinen, I., Wilson, I. D., Reily, M. D., Robertson, D. G., Senn, H., Krotzky, A., Kochhar, S., Powell, J., van der Ouderaa, F., Plumb, R., Schaefer, H., and Spraul, M. (2005) Summary recommendations for standardization and reporting of metabolic analyses. Nat Biotechnol 23, 833–8. 65. Digital Curation Centre, http://www.dcc. ac.uk. 66. Biosharing, http://biosharing.org. 67. Joyce, A. R., and Palsson, B. Ø. (2006) The model organism as a system: integrating ‘omics’ data sets. Nat Rev Mol Cell Biol 7, 198–210. 68. Omes and Omics, http://omics.org/index. php/Omes_and_Omics. 69. Mounicou, S., Szpunar, J., and Lobinski, R. (2009) Metallomics: the concept and methodology. Chem Soc Rev 38, 1119–38. 70. Ippolito, J. E., Xu, J., Jain, S., Moulder, K., Mennerick, S., Crowley, J. R., Townsend, R. R., and Gordon, J. I. (2005) An integrated functional genomics and metabolomics approach for defining poor prognosis in human neuroendocrine cancers. Proc Natl Acad Sci USA 102, 9901–6. 71. Pefkaros, K. 2008 Using object-oriented analysis and design over traditional structured analysis and design. International Journal of Business Research. International Academy of Business and Economics. HighBeam Research. http://www.highbeam.com. 2 Jan. 2011. 72. Whitten, J. L., Bentley, L. D., and Dittman, K. C. (2004) Systems Analysis and Design Methods, 6th ed. McGraw-Hill Irwin, New York. 73. Todman, C. (2001) Designing a Data Warehouse: Supporting Customer Relationship Management, 1st ed., pp 25–58. PrenticeHall PTR, New Jersey.
74. CIS 3400 Database Management Systems Course – Baruch College CUNY, http:// cisnet.baruch.cuny.edu/holowczak/classes/ 3400. 75. MySQL, http://dev.mysql.com. 76. Perl, http://www.perl.org. 77. BioPerl, http://www.bioperl.org. 78. Glimpse, http://www.webglimpse.org. 79. Lucene, http://lucene.apache.org. 80. HGNC, http://www.genenames.org. 81. Entrez gene, http://www.ncbi.nlm.nih.gov/ sites/entrez?db=gene. 82. Ensembl, http://www.ensembl.org/index. html. 83. Universal Protein Resource (UniProtKB), http://www.uniprot.org. 84. GeneCards sources, http://www.genecards. org/sources.shtml. 85. Eyre, T. A., Ducluzeau, F., Sneddon, T. P., Povey, S., Bruford, E. A., and Lush, M. J. (2006) The HUGO Gene Nomenclature Database, 2006 updates. Nucleic Acids Res 34, D319–21. 86. Rosen, N., Chalifa-Caspi, V., Shmueli, O., Adato, A., Lapidot, M., Stampnitzky, J., Safran, M., and Lancet, D. (2003) GeneLoc: exon-based integration of human genome maps. Bioinformatics 19, i222–4. 87. phpMyAdmin, http://www.phpmyadmin. net/home_page/index.php. 88. Solr, http://lucene.apache.org/solr. 89. Propel, http://propel.phpdb.org/trac. 90. Bugzilla – server software for managing software development, http://www.bugzilla.org. 91. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic local alignment search tool. J Mol Biol 215, 403–10. 92. Trace at NCBI, http://www.ncbi.nlm.nih. gov/Traces. 93. Perl for bioinformatics and internet, http:// bip.weizmann.ac.il/course/prog. 94. Artemis, http://www.sanger.ac.uk/Software/ Artemis. 95. Extensible Markup Language (XML), http:// www.w3.org/XML. 96. Concurrent Versions System (CVS) Overview, http://www.thathost.com/wincvs-howto/ cvsdoc/cvs_1.html#SEC1. 97. Eclipse project, http://www.eclipse.org/ eclipse. 98. Sequel Pro, http://www.sequelpro.com.
Chapter 4 Data and Knowledge Management in Cross-Omics Research Projects Martin Wiesinger, Martin Haiduk, Marco Behr, Henrique Lopes de Abreu Madeira, Gernot Glöckler, Paul Perco, and Arno Lukas Abstract Cross-Omics studies aimed at characterizing a specific phenotype on multiple levels are entering the scientific literature, and merging e.g. transcriptomics and proteomics data clearly promises to improve Omics data interpretation. Also for Systems Biology the integration of multi-level Omics profiles (also across species) is considered as central element. Due to the complexity of each specific Omics technique, specialization of experimental and bioinformatics research groups have become necessary, in turn demanding collaborative efforts for effectively implementing cross-Omics. This setting imposes specific emphasis on data sharing platforms for Omics data integration and cross-Omics data analysis and interpretation. Here we describe a software concept and methodology fostering Omics data sharing in a distributed team setting which next to the data management component also provides hypothesis generation via inference, semantic search, and community functions. Investigators are supported in data workflow management and interpretation, supporting the transition from a collection of heterogeneous Omics profiles into an integrated body of knowledge. Key words: Scientific data management, Cross-Omics, Biomedical knowledge management, Systems biology, Inference, Context
1. Introduction Technological advancements in biomedical research and here in particular to note the Omics revolution, have provided powerful measures for investigating complex molecular processes and phenotypes. Each single high-throughput Omics technique is accompanied by the generation of a significant amount of data, and their combination (cross-Omics) has started to become common practice in multidisciplinary team project settings. Hence, data Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_4, © Springer Science+Business Media, LLC 2011
97
98
Wiesinger et al.
and information sharing has become essential. Virtual online environments address those issues and support critical processes in biomedical research (1–4). Complementary use of Omics methods, where each single Omics technique is driven at a specific group, demands for data unification and harmonization. Additionally, the integration of internal and third party (public domain) information is desired for supporting data interpretation. Unfortunately, effective and focused data/knowledge sharing is far from the norm in practice (5). Next to technical difficulties, further factors potentially hampering communication and information sharing are given by legal and cultural matters as well as partially diverging interests of collaborating parties (6). Many technical and procedural barriers are caused by the peculiarity of scientific data, characterized by heterogeneity, complexity, and extensive volume (7). In any case, user acceptance of technologies supporting data and knowledge exchange is central. Accessibility, usability, and a comprehensive understanding of purpose and benefits for each participating team are prerequisites for an “added value” deployment of any scientific data management tool in the Omics field and beyond. We introduce a concept dedicated at supporting a collaborative project setup, and specifically demonstrate the workflow for cross-Omics projects. Although publicly available repositories as found e.g. for transcriptomics and proteomics data do not meet theses general requirements, they are valuable representatives of established scientific data sharing platforms. Examples for transcriptomics data repositories providing access to a multitude of profiles on various cellular conditions include caArray (https:// cabig.nci.nih.gov/tools/caArray) (8), Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) (9), ArrayExpress (http:// www.ebi.ac.uk/microarray-as/ae) (10) and the Stanford Microarray Database (http://smd.stanford.edu) (11). These repositories usually provide raw data files further characterized by metadata (12, 13) holding study aim, biological material used, experiment type and design, contributors, technology platform, and sample description, among others. The scope of these platforms is relatively narrow and functions regarding data retrieval (and partially analysis) are similar in nature. Formats for metadata are usually rigid and the purpose of these platforms is clearly focused on creating a public memory for specific Omics data rather than facilitating collaboration. For our envisaged data management this rigid concept has to be expanded: (a) Management of diverse Omics data (next to transcriptomics also covering other Omics tracks) has to be taken care of. (b) Seeing cross-Omics team efforts, a knowledge management system has to provide the necessary flexibility allowing adaption to the research workflow (which for scientific workflows frequently experiences significant changes over time).
Data and Knowledge Management in Cross-Omics Research Projects
99
(c) Such a concept has to respect nonfunctional requirements for covering the needs of group constraints as access policy. (d) Next to mere data management such a system shall support interlinking of data (context representation) e.g. by easily interrogating transcriptomics and proteomics feature lists and querying the results profile. A defined level of flexibility enables adapting to changing project conditions, but in turn the framework still has to guarantee data consistency. These issues have to be respected under the constraints of operational, regulatory, and productivity aspects including security, versioning, further providing an audit trail and referencing. Based on the complex methodological environment and translational research aspects in molecular biology, collaborative projects are clearly seen as a way to design and implement research projects (5, 14, 15), and this trend is in particular obvious when combining Omics methods (7, 12).
2. Materials 2.1. Basic Considerations
We will discuss the knowledge management concept for a typical project setting involving collaborative application of different Omics techniques (transcriptomics and proteomics) driven by a team focusing on a specific phenotype. The data accumulated by the different groups participating in the effort shall be shared and integrated for providing the basis for cross-Omics analysis. Information may originate from human (e.g. a study plan document specifying a particular Omics study with respect to samples and associated sample descriptors) as well as machine sources (e.g. a transcriptomics raw data image file produced by a microarray scanner). In general standardization of the data formats is important particularly for information subjected to further automated processing (e.g. for bioinformatics workflows and statistics procedures). The example scenario is respecting the following major considerations and requirements: (a) Project centered view In contrast to an institution (or company) view on data maintenance and sharing a project view implicitly assumes an end of activities at project completion. This strongly impacts the modus operandi regarding what information shall be shared and why. The integration with legacy systems is considered as less critical. (b) Confidentiality and data security Unauthorized access to information must be prevented, thus a capable permission system and encrypted data transfer are
100
Wiesinger et al.
needed. The “need-to-know” principle may apply even in a collaborative research setting. (c) Share and find data Comprehensive search capabilities are required to facilitate quick data retrieval. Implementation of complex queries as well as full text search options are recommended for supporting user needs. Infrastructure capable of supporting large scale data exchange has to be in place. (d) Flexible structure regarding data objects Various types of data have to be considered, e.g. study plans, raw data (as generated by transcriptomics and proteomics instruments) as well as result profiles and associated analysis protocols. Those data objects might change over time, thus flexible data modeling is required. (e) Persistence and maintenance of data Once data is accessible to other participants it shall be persistent over time with a unique identifier assigned for referencing. Versioning of information is needed to account for eventual changes and to support reproducing results at any given point in time. (f) Inference as a mechanism for traversing data into knowledge A data repository shall not necessarily be limited to the data management aspect. Establishing and modeling associations among related information has to be envisaged, either upon manual definition (the user explicitly specifies which data objects “belong” together) or by automated mining procedures retrieving implicit relations. Consequently, relations among experimental studies, samples, raw data (transcriptomics and proteomics), result protocols and the like should be represented. In this way a data repository is transformed into an analysis and information retrieval environment. 2.2. General Data Concept
A common challenge in scientific data management is the need for a flexible data structure. Even if the data structure is properly defined with respect to the requirements identified at project start, adaptations are frequently required due to changes in the experimental protocols or even larger parts of the workflow as arising during project life time. Unfortunately, increasing flexibility regarding data structure usually comes with higher risk regarding data consistency and standardized data querying and analysis. LIMS (Laboratory Information Management Systems) partially address customizability for adapting to changing procedures. However, these powerful software platforms are mainly designed for mirroring relatively rigid processes, which for e.g., come into place at later stages of biomedical product development. A major issue to be considered here is the granularity of data representation: Representation may be fine grained, e.g. each
Data and Knowledge Management in Cross-Omics Research Projects
101
single expression value of an expression array is explicitly stored and addressable, or coarse grained e.g. by storing the whole expression file as one data object. Certainly more fine grained data representation easies specific querying, but on the other hand complicates adaptability as all data instances have to be explicitly taken care of in database design. To cope with the requirement of flexible data structures for an integrated Omics-bioinformatics repository we propose a mixed model for the data structure. The central components are (a) taxonomy, (b) records and (c) relations. These components allow building up a structured collection of data objects, as schematically given in Fig. 1. This general data model can be easily adapted to accurately react upon requirements introduced by changing research procedures. In this concept a user request allows to specifically access a data object in its native format (e.g. a CEL file coming from Affymetrix GeneChips) together with associated metadata. On top the user may specifically query also the native file as such if the data provided in this file follow a strict standard (as given for machine-generated raw data files). (a) Taxonomy component Similar to a conventional file system data objects are most easily organized in a user definable taxonomy. This taxonomy may represent a hierarchical structure modeled according to the project requirements. Like conventional file systems, the taxonomy channels the assignment of single data objects to one unique location in the hierarchy. Examples
b
a
c
Taxonomy
Record
Relation
Omics
RD-001
RD-001 Studies
Author
...
S-001
Subject
...
S-002
Study
S-002
File
expr 02.cel
S-003
S-002
Raw Data RD-001
Fig. 1. Components of the general information concept. (a) Hierarchical taxonomy shown as file system-like folder structure resembling an Omics experiment covering protocols and raw data. (b) Example raw data record characterized by metadata and the native data file. (c) Explicit relations between records, where the raw data record is uniquely assigned to a specific study plan record.
102
Wiesinger et al.
Table 1 Selected taxonomy types Taxonomy
Description
Work Breakdown Structure
In project management the Work Breakdown Structure is a deliverable-oriented grouping of tasks in a way that supports defining and organizing and the scope of a project.
Product Breakdown Structure
The Product Breakdown Structure is a grouping of physical or functional components (items) that make up a specific deliverable (e.g. a product, may be the experimentally validated features from the Omics-bioinformatics workflow).
Organizational Structure
An Organizational Structure represents a hierarchy of teams (contributors) on the level of organizations, departments, or individuals.
Procedural Structure
A Procedural Structure is a hierarchy of process related categories as the consecutive, stepwise maturation of deliverables described by status changes.
for taxonomies are listed in Table 1. Using, for e.g., a work breakdown structure (16) of a project as taxonomy means that the tasks are represented as folder hierarchy (e.g. a main folder study plans with subfolders for specific studies) and data objects generated in the context of a specific task (study) are assembled in the respective task folder. Figure 2 exemplarily shows a work breakdown structure that could be used as taxonomy for covering an integrated Omics-bioinformatics project. Next to such a taxonomy built on the basis of a project workflow alternative taxonomies as biomedical terminologies or ontology terms from molecular biology may be used (17, 18). (b) Record component For representing and handling data we introduce a “record” as the fundamental information instance. A record can be understood as the smallest, self-contained entity of data or information stored in the repository. Records are comprised of a set of named properties, each containing either data files, references to other records, or just plain text. A typical record combines a data file (e.g. a study plan document) along with a set of properties (metadata) providing further information about the data file (author, subject, study, etc.). Shortly, metadata can be understood as data about data (9). Diverse types of records (a study plan, an analysis report, etc.) can be modeled side by side, in the following referred to as “record type”. Each record type is defined via a fixed set of descriptors
Data and Knowledge Management in Cross-Omics Research Projects
103
Project
Project Management
Sample Collection
Omics
Data Analysis
Validation
Project Start
Protocols
Protocols
Protocols
Protocols
Coordination
Ethics
Studies
Quality
Reports
Controlling
Acquisition
Raw Data
Annotation
Project End
Storage
Reports
Statistics
Distribution
Reports
Fig. 2. Example of a Work Breakdown Structure for an integrated Omics-bioinformatics workflow. Main elements (next to project management) include sample collection, Omics, data analysis, and validation workflows.
for defining properties of this particular type (as a study plan record has a fundamentally different scope than an Omics raw data record). These properties are populated by the user upon creation of a new record assigned to a specific record type. This concept of records serves a generic data object representation, however, all records of a particular type have the same particular purpose and specific metadata definition. This concept of metadata and records allows both, rigidity and flexibility in data modeling, being central for standardization efforts in a (to a large extent non-standardized) environment as found in a science project. Each Record Type at least has to hold a description of scope and purpose to allow a researcher to query the database by either explicitly searching in selected metadata fields or by doing full text search in the record as such. (c) Relation component The relation component allows explicit or implicit linking of records, where directed and undirected relations are relevant. These relations allow the definition of context between records. A typical directed and explicit (user defined) relation would be a record being “associated” to another record, as given for “sample descriptor” records associated with the record “sample cohorts”. An example of an undirected and implicit (systems identified) relation would be given by two
104
Wiesinger et al.
records which share keywords in their metadata (e.g. two study plans both addressing a specific phenotype). Here the implicit relation is identified by a script building relations based on screening the metadata information for all records and record types. This concept can be further expanded for delineating complex sets of relations and represented as semantics (19) e.g. on the basis of OWL (Web Ontology Language, http://www.w3.org/TR/owl-features) or RDF (Resource Description Framework, http://www.w3.org/ RDF). The introduction of relations between records is part of the metadata functions.
2.4. Information Management Policy
Equally important as the general concept and its proper implementation is an appropriate information management policy
Knowledge Management Application
Modular implementation based on existing software building blocks is recommended for realizing a knowledge management system as introduced above. Significant infrastructure and utility functionality is offered in the public domain ready for use, clearly providing reduced development effort. Generally a Web application is favored for supporting easy handling on the client side in a multicenter setting e.g. using the Java Enterprise Platform (http:// java.sun.com/javaee). For supporting dynamic data models a post-relational approach as data foundation as given by the Content Repository for Java (http://www.ibm.com/developerworks/ java/library/j-jcr) technology is recommended. The coarse components of the application are illustrated in Fig. 3. Additional details on software architecture and environment (see Note 1), data persistence (see Note 2) and presentation technology (see Note 3) have to be respected.
WEB APPLICATION SERVER
2.3. Technical Realization
Presentation JSF/ICEfaces
Web Service
Business Logic EJBs
Data Access Jackrabbit
Relational Database
Filesystem
Fig. 3. Prototypical software architecture for realizing a record management system.
Data and Knowledge Management in Cross-Omics Research Projects
105
specifically designed for the particular project setup. Such a policy shall at least clarify the following main issues: (a) Who should provide information and When? It has to be clarified which team member is, at what point in time, expected to provide information to others. Automatic e-mail notification e.g. when a record is added to a specific folder supports efficient information flow. (b) What information should be provided and Why? It should be clear which information is required for reaching the project goals. Therefore, also the purpose of information with respect to driving project specific processes has to be defined. (c) In What format should information be provided? This definition is important for modeling of record types. For files considered for automated processing, the format has to be specified in a strict manner. The policy also has to define the type of taxonomy to be applied. (d) What context definition should be applied to information? Rules regarding explicit (user driven) relations among records have to be clarified. (e) Who is allowed to use information for What? Restrictions for data retrieval and use have to be clear and agreed on between parties. Freedom of operation, exploitation, intellectual property, and confidentiality issues have to be considered.
3. Methods 3.1. Application Example
In this section we outline a practical application example utilizing the concept given above focusing on an integrated Omics – bioinformatics workflow. User A: Investigator at laboratory A; defines studies and collects biological samples. User B: Investigator at laboratory B; performs transcriptomics experiments. User C: Investigator at laboratory C; performs proteomics experiments. User D: Investigator at an organization D; analyzes raw data and performs bioinformatics. In our record concept, data is collected in the form of record stacks in the sense that data of the same type is stored by using the same record type and metadata structure, assigned to storage locations in the given taxonomy. Records can therefore be distinguished according to source, generation methodology, or
106
Wiesinger et al.
Timeline
Analysis Report (Transcriptomics)
User D
Transciptomics Data
User B
User A
User C
Analysis Report (Cross-omics)
Transciptomics Data
Study Plan
Addendum
Proteomics Data
Analysis Report (Proteomics)
Fig. 4. Example for an Omics-bioinformatics workflow realized in a team setting, all being centrally linked to a study plan, as represented by explicit and directed linking of records.
particular scope (raw data, analysis report, etc.). Each study can be clearly described according to the hypothesis, the experimental methods and conditions, the materials and samples used, and finally the results generated. Therefore record types have to be defined to model these studies in advance. The procedural creation of records and their relational organization is outlined in Fig. 4. The actual first step of the process for user A is the creation of a study plan document and the creation of the respective study plan record. The study plan is a document defining the aims and procedures of the study, the major goals to be achieved, participating members, time lines, etc. User A thus creates a record of the record type “Study plan” by uploading the study plan document to the repository. User A may add further information to the specific study plan using the predefined metadata fields provided for this record type. The study plan record may also hold description and label assignment of samples used in the experimental protocol (see Note 4). Process constraints may be introduced at this step, e.g. by requesting a study plan as first document before any downstream tasks can be started like adding an Addendum record associated with the study plan for specifying in detail the experimental conditions for the proteomics experiment, followed by creating a new record for proteomics raw data. In parallel, user B is performing transcriptomics experiments according to the specifications provided in the study plan, and creates records holding transcriptomics raw data. As soon as user B and C have provided data, user D is able to download and feed the
Data and Knowledge Management in Cross-Omics Research Projects
107
Omics raw data into an analysis workflow, subsequently generating results records appropriately stored as the record type “Analysis Report”. Finally, all information generated in the course of the study refers to the respective study plan. Organizing cross-Omics data in a way as depicted in Fig. 4 has numerous advantages. Raw data files or analysis reports can be easily recalled, data provided from different contributors can be distinguished, and via setting relations raw data, analysis files and samples used are all unambiguously linked to a study plan. In this way information can be documented and used (or reused) in a comprehensive manner. 3.2. Neighborhood and Context
The concept shown in the example leads to a repository of project specific content utilizing records conforming to record types. Browsing taxonomy folders and applying keyword searches are simple ways of navigating and identifying content in the repository. However, such search procedures do only make limited use of relations. Relations among records allow deriving context networks (ontology), thereby offering entirely new opportunities regarding interpretation of data generated in their context. In a context network, relations with a specific meaning are defined among records. The “meaning” is given by semantics which provides edges (relations) between nodes (records). This concept is also found with relational databases, but in our system relations are defined between records and not explicitly during database design on the level of fine grained data tables. A visual representation of relations for a selected record of the set of records given in Fig. 4 is shown in Fig. 5. For the record “Study Plan A” (central node in Fig. 5a), relations to other records are given. Explicit relations indicating “associated” are depicted as solid lines, whereas implicit relations are indicated by dashed lines. Implicit relations are inferred automatically from records sharing similar attributes. In the example, a second relation instance is used to represent additional context, namely that study plan A and B are related e.g., by using the same sample cohort. A simple computational procedure can be implemented to screen for such implicit relations which mines metadata. Yet another type of implicit relation may be present, as illustrated in Fig. 5b: One relation between two analysis reports (transcriptomics and proteomics) may e.g. be derived on the basis of sharing a common list of features (provided as gene or protein identifiers) jointly identified as relevant. The second relation between “Analysis Report” (cross-omics) and “Study Plan B” may be based on a common disease term found in both records either in the metadata fields or in the text of attached files (see Note 5). Such a flexible delineation of implicit neighborhood as discussed here facilitates an explorative approach, allowing for the
108
Wiesinger et al.
a
b
Analysis Report (Proteomics)
Analysis Report (Transcriptomics)
Proteomics Data Study Plan A
Analysis Report (Cross-omics)
Addendum
Selected Record
Transcriptomics Data
Study Plan B Transcriptomics Data
Neighbor Record Explicit Relation Implicit Relation
Fig. 5. Visual representation of neighborhood. (a) Explicit (solid line) and implicit (dashed line) relations are shown. The explicit relations resemble the record structure given in Fig. 4. The implicit relation in (a) is derived by analyzing record metadata, the implicit relations given in (b) demonstrate the neighborhood representation of records.
discovery of relations (context) among data objects (records). Browsing such neighborhoods provides entirely new ways of extracting information stored in the record repository going far beyond the mere data management aspect.
4. Notes 1. Modern server-side web technologies (often referred as web 2.0) enable access to complex applications regardless of location or equipment used. The requirements for using such web applications are mostly as simple as having access to the Internet via a standard web browser. Therefore, designing a data/knowledge management system as web application becomes a method of choice for supporting distributed groups. Web application servers are used as platform for deploying and executing web applications. Such platforms come with numerous powerful technologies and concepts supporting the development, maintenance, and operation of web applications. Recent platforms provide a broad range of infrastructure functionality, among them database connection management, transactions and security. Hence, the efforts for implementing such functionalities can be significantly reduced. Moreover, most application servers are built
Data and Knowledge Management in Cross-Omics Research Projects
109
in a modular style to encourage clean levels of abstraction and support exchangeability of single components to a certain extent. One example includes Glassfish v3 (https://glassfish. dev.java.net) as application server, supporting Java Enterprise 6 (Java EE) technologies and paradigms. Following the Enterprise Java Bean (EJB, http://java.sun.com/products/ ejb) server-side component architecture (as included in Java EE) facilitates the design of a separated architectural design for e.g. separating the application logic and the presentation logic. This implies that changing or adding presentation components to the system does not affect the application logic (e.g. when a dedicated client application needs to be introduced). 2. Data structures as well as persistence mechanisms form the basis of data centric applications. Knowledge management systems in particular demand for database solutions with specific characteristics: The data management system (and data structures) is required to support handling of semi-structured data as well as support changes of the data model during operation. Especially in the life sciences domain it is essential to have data solutions capable of handling large amounts of data. In contrast, database performance of considerably complex queries is secondary. We found it very convenient to use the Content Repository for Java (JCR, as defined in Java Specification Request JSR 170) as central technology for data access and persistence. Following a hierarchy centered postrelational database model, JCR allows accessing data at a very high level of abstraction that perfectly meets the requirements mentioned above. Specifically we use Apache Jackrabbit (http://jackrabbit.apache.org), which represents the reference implementation of JCR. Since Jackrabbit provides additional features as versioning, full text search, and a dynamic data model, the development effort can be further reduced. A relational database (as MySQL, http://www. mysql.com) and the filesystem serve as immediate persistence mechanisms, as data access through JCR retains complete transparency as the relational database beneath is never accessed directly. Jackrabbit integrates seamlessly with the application server environment in the shape of a connector module component. 3. Choosing a capable and easy to use presentation framework significantly influences development efforts regarding the user interface. Accessibility to the application largely depends on the technology used for presentation with respect to usability and technical compatibility. Building a concise user interface directly influences acceptance and user satisfaction. Naturally, it is of high importance not to exclude users because
110
Wiesinger et al.
of their client environment (web browser, operating system). The Java Server Faces (JSF, http://java.sun.com/javaee/ javaserverfaces) based integrated application framework may be used as server-side technology to build a user interface. Particularly, ICEfaces (http://www.icefaces.org) is a rich component framework based on JSF that utilizes the Asynchronous JavaScript and XML (Ajax) technology for providing a responsive user interface, on top saving bandwidth. Opposed to other frameworks, using JSF as server-side technology does not demand for any specific software preinstalled at the clients. JSF-based solutions can easily be accessed via standard web browsers (having Javascript enabled). 4. A properly designed system should provide the user (or system administrator) the option of flexible record type definition. In our example case, the sample documentation may, for e.g., also be done via generating a separate record of the record type “Sample documentation”, followed by explicitly introducing relations between samples used and the study plan. Such extension would allow adding specific metadata to individual samples (retrieval procedure/date of sample drawn) in contrast to having all samples organized in a tabular manner in a single record (where the metadata then only provide information being valid for all samples as e.g. cohort name). 5. Implementing the record management system in Jackrabbit (JCR reference system, see Note 2) provides full text search, consequently it is a straight forward procedure to search for any specific term in all records. Interrogation of given full text indexing with a term list of genes, proteins, diseases, pathway names, etc. is an easy computational task.
Acknowledgments The research leading to these results has received funding from the European Community’s Seventh Framework Programme under grant agreement n° HEALTH-F5-2008-202222. References 1. Sagotsky, J. A., Zhang, L., Wang, Z., Martin, S., and Deisboeck, T. S. (2008) Life Sciences and the web: a new era for collaboration. Mol Syst Biol 4, 201. 2. Waldrop, M. (2008) Big data: Wikiomics. Nature 455, 22–25. 3. Ruttenberg, A., Clark, T., Bug, W., Samwald, M., Bodenreider, O., Chen, H., Doherty,
D., Forsberg, K., Gao, Y., Kashyap, V., Kinoshita, J., Luciano, J., Marshall, M. S., Ogbuji, C., Rees, J., Stephens, S., Wong, G. T., Wu, E., Zaccagnini, D., Hongsermeier, T., Neumann, E., Herman, I., and Cheung, K. (2007) Advancing translational research with the Semantic Web. BMC Bioinformatics 8 Suppl 3, S2.
Data and Knowledge Management in Cross-Omics Research Projects 4. Stein, L. D. (2008) Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nat Rev Genet 9, 678–88. 5. Disis, M. L., and Slattery, J. T. (2010) The road we must take: multidisciplinary team science. Sci Transl Med 2, 22–9. 6. Nelson, B. (2009) Data sharing: empty archives. Nature 461, 160–3. 7. Moore, R. (2000) Data management systems for scientific applications. IFIP Conf Proc 188, 273–84. 8. Bian, X., Gurses, L., Miller, S., Boal, T., Mason, W., Misquitta, L., Kokotov, D., Swan, D., Duncan, M., Wysong, R., Klink, A., Johnson, A., Klemm, J., Fontenay, G., Basu, A., Colbert, M., Liu, J., Hadfield, J., Komatsoulis, G., Duvall, P., Srinivasa, R., and Parnell, T. (2009) Data submission and curation for caArray, a standard based microarray data repository system. Nat Proc doi:10.1038/npre.2009.3138.1. 9. Edgar, R., Domrachev, M., and Lash, A. E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30, 207–10. 10. Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., Malone, J., Mani, R., Pilicheva, E., Rayner, T. F., Rezwan, F., Sharma, A., Williams, E., Bradley, X. Z., Adamusiak, T., Brandizi, M., Burdett, T., Coulson, R., Krestyaninova, M., Kurnosov, P., Maguire, E., Neogi, S. G., Rocca-Serra, P., Sansone, S., Sklyar, N., Zhao, M., Sarkans, U., and Brazma, A. (2009) ArrayExpress update – from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res 37, D868–72. 11. Hubble, J., Demeter, J., Jin, H., Mao, M., Nitzberg, M., Reddy, T. B. K., Wymore, F.,
12.
13. 14.
15. 16.
17.
18.
19.
111
Zachariah, Z. K., Sherlock, G., and Ball, C. A. (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37, D898–901. Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., and Heber, G. (2005) Scientific data management in the coming decade. ACM SIGMOD Record 34, 34–41. National Information Standards Organization (U.S.). (2004) Understanding metadata. NISO Press, Bethesda, MD. Wuchty, S., Jones, B. F., and Uzzi, B. (2007) The increasing dominance of teams in production of knowledge. Science 316, 1036–39. Gray, N. S. (2006) Drug discovery through industry-academic partnerships. Nat Chem Biol 2, 649–53. Project Management Institute (2008). A Guide to the Project Management Body of Knowledge: PMBoK Guide, Fourth Edition, PMI. Bodenreider, O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32, D267–70. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R. H., Shah, N., Whetzel, P. L., and Lewis, S. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25, 1251–55. Das, S., Girard, L., Green, T., Weitzman, L., Lewis-Bowen, A., and Clark, T. (2009) Building biomedical web communities using a semantically aware content management system. Brief Bioinformatics 10, 129–38.
Chapter 5 Statistical Analysis Principles for Omics Data Daniela Dunkler, Fátima Sánchez-Cabo, and Georg Heinze Abstract In Omics experiments, typically thousands of hypotheses are tested simultaneously, each based on very few independent replicates. Traditional tests like the t-test were shown to perform poorly with this new type of data. Furthermore, simultaneous consideration of many hypotheses, each prone to a decision error, requires powerful adjustments for this multiple testing situation. After a general introduction to statistical testing, we present the moderated t-statistic, the SAM statistic, and the RankProduct statistic which have been developed to evaluate hypotheses in typical Omics experiments. We also provide an introduction to the multiple testing problem and discuss some state-of-the-art procedures to address this issue. The presented test statistics are subjected to a comparative analysis of a microarray experiment comparing tissue samples of two groups of tumors. All calculations can be done using the freely available statistical software R. Accompanying, commented code is available at: http://www.meduniwien.ac.at/msi/biometrie/MIMB. Key words: Differential expression analysis, False discovery rate, Familywise error rate, Moderated t-statistics, RankProduct, Significance analysis of microarrays
1. Introduction The recent developments of experimental molecular biology have led to a new standard in the design of Omics experiments: the so-called p n paradigm. Under this new paradigm, the number of independent subjects n (e.g., tissue samples) is much smaller than the number of variables p (e.g., number of genes in an expression profile) that is analyzed. While in classical settings few prespecified null hypotheses are evaluated, now we are confronted with simultaneous testing of thousands of hypotheses. Classical statistical methods typically require the number of independent subjects to be large and a multiple of the number of variables in order to avoid collinearity and overfit (1). However, the number
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_5, © Springer Science+Business Media, LLC 2011
113
114
Dunkler, Sánchez-Cabo, and Heinze
of subjects that can be considered for a high-throughput experiment is often limited due to the technical and economical limitations of the experiment. Hence, new statistical methods had to be developed to handle Omics data. These developments mainly focus on three issues: First, new test statistics (2, 3) and corrections for multiple testing of many hypotheses in one experiment (4, 5), second, avoiding collinearity in model estimation with more variables than subjects, and third, feature selection by optimizing predictive accuracy in independent samples. Some methods simultaneously address the latter two issues (6–8). In this contribution, we focus on statistical methods developed for the simultaneous comparison of continuous variables (e.g., gene expression profiles) between two conditions (e.g., two types of tumors). We do not discuss advanced issues of model building or feature selection that go beyond the scope of an introductory text. Thus, the remainder of the chapter is organized as follows: in Subheading 2, we explain basic statistical concepts and their use in an Omics data context. Subheading 3 exemplifies the most commonly used methods for hypothesis testing in Omics experiments by means of a microarray experiment comparing the gene expression of two types of tumor tissues (9).
2. Materials 2.1. Population and Sample
Any biological experiment that involves data collection seeks to infer properties of an underlying entity. This entity is called the target population of our experiment, and any conclusions we draw from the experiment apply to this target population. Conclusions that are restricted to the few subjects of our experiment must be considered scientifically worthless. The variation of gene expression among the members of the population constitutes a statistical distribution. Its characteristics can be described by statistical measures, e.g., its central tendency by the mean and its spread by the standard deviation. However, since the target population is not completely observable, we have to estimate its characteristics using data independently collected from a sample of few members of the population (the biological replicates or subjects, see Note 1). Statistical estimates are values computed from the subjects in a sample which are our best guesses for the corresponding population characteristics. Repeated sampling from a population yields different estimates of population characteristics, but the underlying population characteristics remain constant (as sampling has no influence on the target population).
2.2. The Principle of Statistical Testing
In statistical testing, we translate our biological hypothesis, e.g., “The expression of gene XY is different between basal-like human breast cancers (BLC) and non BLC (nBLC) tumors,” into a
Statistical Analysis Principles for Omics Data
115
s tatistical null hypothesis: “The mean expression of gene XY is equal among BLC and nBLC tumors.” A statistical test results in a decision about this null hypothesis: either reject the null hypothesis and claim that there is a difference between the groups (a positive finding), or do not reject the null hypothesis because of insufficient evidence against it (a negative finding). Statistical tests do not proof or verify hypotheses, but instead reject or falsify postulated hypotheses, if data is implausible for a given hypothesis. The complementary hypothesis to the null is denoted as the alternative hypothesis, and corresponds to any condition not covered by the null hypothesis: “Mean gene expression in BLC tumors differs from mean gene expression in nBLC tumors.” Our population now comprises BLC as well as nBLC tumors, and the parameter of interest is the difference in expected gene expression between these two groups. This parameter could be zero (corresponding to the null hypothesis) as well as different from zero (corresponding to the alternative hypothesis), and by our experiment we wish to infer its true value. A test statistic measures the deviation of the observed data from the null hypothesis. In our example, the test statistic could be defined as the difference in mean gene expression between BLC and nBLC tumors. We could now reject the null hypothesis as soon as the sample difference in means is different from zero. However, since we get a new difference in means each time we draw a new sample, we must take sampling variation into account. Therefore, we first assume that the null hypothesis is true, and estimate the distribution of the test statistic under this assumption. Then, we estimate a p-value, which measures the plausibility of the observed result given the null hypothesis is true on a probability scale. p-values, which are less than the predefined significance level, correspond to low plausibility and lead to rejection of the temporarily assumed null hypothesis. The steps of statistical testing can be formalized as follows: 1. Define a null hypothesis in terms of the population parameter of interest 2. Define a significance level, i.e., the probability of a falsely rejected null hypothesis 3. Select a test statistic that scores the data with respect to the null hypothesis 4. Derive the distribution of the test statistic given the null hypothesis applies 5. Compute the test statistic from the data at hand 6. Compute the p-value as the probability, given the null hypothesis applies, of a test statistic as extreme as or more extreme than the value observed in the sample 7. Reject the null hypothesis if the p-value is less than the significance level; otherwise, do not reject the null hypothesis
116
Dunkler, Sánchez-Cabo, and Heinze
The decision of a statistical test could be wrong: we could falsely reject the null hypothesis although the observed difference in means was only due to sampling variation. This type of wrong decision is called the type I error. We can control the probability of a type I error by choosing a small significance level. In gene expression comparisons, a type I error implies that we declare a gene differentially expressed in two groups of subjects although it is not. By contrast, a type II error occurs if the null hypothesis is falsely not rejected, i.e., if an experiment misses to declare a gene as important, although in truth this gene is differentially expressed. The principle of statistical testing provides control over the type I error rate, by fixing it at the significance level. Although a significance level of 5% has been established as a quasi-standard in many fields of science, this number has no other justification than it was easy to use in the precomputer era. In screening studies, investigators may accept a higher type I error rate in order to improve chances of identifying genes truly related to the experimental condition. In a confirmatory analysis, however, typically a low type I error rate is desirable. The type II error rate can be controlled by including a sufficient number of subjects, but, to estimate it precisely, a number of assumptions are needed. 2.3. Hypothesis Testing in Omics Experiments
In a typical analysis of an Omics experiment, we are interested in which of several thousand features are differentially expressed. Thus, hypothesis testing in Omics experiments results in a socalled gene list, i.e., a list of genes assumed to be differentially expressed. The principle of statistical testing as outlined in the last section can still be applied; however, error rates now refer to the gene list and not to a single gene. In particular, the type I error rate applicable to a single hypothesis is replaced by the familywise error rate (FWER), which is the probability to find differentially expressed genes although the global null hypothesis of no differential expression in any gene is true. Often the false discovery rate (FDR) (4) is considered as alternative to the FWER. The FDR is defined as the proportion of truly nondifferentially expressed genes among those in the gene list. Furthermore, modified test statistics have been developed which make use of similarities in the distributions across genes to enhance precision rather than treating each gene as a separate outcome variable.
2.4. Test Statistics for Omics Experiments
In order to present test statistics useful for Omics experiments, some basic notation is needed. We assume that gene expression values have been background corrected, normalized, and transformed by taking the logarithm to base 2 (see Subheading 3.2). In the following, the log2 gene expression of gene g (g = 1, …, G) in subject i (i = 1, …, nk) belonging to group k (k ∈{1, 2}) is denoted as y gik . The sample mean and variance of gene g in group k are
Statistical Analysis Principles for Omics Data
117
given as y gk = nk−1 ∑ i =1 y gik and s 2gk = (nk − 1)−1 ∑ i =1 (y gik − y gk )2 , respectively. The square root of the variance, s gk , is also known as the standard deviation. The following test statistics are frequently used: nk
nk
2.4.1. Fold Change Statistic
The mean difference is given by M g = y g 1 − y g 2. Transferring M g M back to the original scale of gene expression yields 2 g , a statistic denoted as the fold change of gene g between groups 1 and 2. Early analyses of microarray data used to define a threshold M on M g , and declared all genes significant for which | M g |>| M |. This procedure makes the strong assumption that the variances are equal across all genes, which, however, is not plausible for gene expression data.
2.4.2. t-Statistic
In order to take different variance between genes into account, one may use the t-statistic t g = M g / s g , with the so-called pooled within-group standard error defined as s g = s 2g 1 / n1 + s 2g 2 / n 2 . Under the null hypothesis, and given normal distribution of y g 1 and y g 2 , t g follows Student’s t-distribution with n1 + n2 − 2 degrees of freedom (see Note 2). The t-statistic uses a distinct pooled within-group standard error estimated for each gene separately, thereby taking into account different variances across different genes. However, in small samples the variance estimates are fairly unstable, and large values of t g , associated with high significance, may arise even with biologically meaningless fold changes.
2.4.3. Moderated t-Statistic
A general-purpose model proposed for microarray experiments with arbitrary experimental design (2) can be applied to a twogroup comparison, resulting in the moderated t-statistic t g = M g / s g , where sg = d0s02 + d g s 2g / d0 + d g is a weighted
(
) (
2 g
)
average of the gene-specific variance s and a common variance s02 , d g = n g − 1 and d0 are the degrees of freedom of s 2g and s02, respectively. s02 and d0 are so-called hyperparameters and can be estimated, using empirical Bayes methods, from the distributions of gene expression of all genes (2). While s02 serves as variancestabilizing constant, d0 can be interpreted as the relative weight assigned to this variance stabilizer. Under the null hypothesis, the moderated t-statistic t g follows a Student’s t-distribution with d0 + d g degrees of freedom. The approach is implemented in R’s limma package (see Note 3). 2.4.4. Significance Analysis of Microarrays
The SAM procedure (3), implemented in the R package samr, tries to solve the problem of unstable variance estimates by defining a test statistic d g = M g / (s 0 + s g ), where, compared to the t-statistic, the denominator is inflated by an offset s 0 , which is based on the distribution of s g across all genes. Common choices for s 0 are the 90th percentile of s g (10) or are based on more
118
Dunkler, Sánchez-Cabo, and Heinze
refined estimation like the one applied in samr (11). Although related, d g (which offsets the standard errors) and tg (which employs an offset on variances) are not connected in any formal way. Due to the modification of its denominator, the distribution of d g is intractable even under the null hypothesis, and has to be estimated by resampling methods (see Subheading 2.6 and see Note 4). 2.4.5. Wilcoxon Rank-Sum Statistic
The classical nonparametric alternative to the t-test is the Wilcoxon rank-sum test, also known as the Mann–Whitney U-test, which tests the null hypothesis of equality of distributions in the two groups. Its test statistic is directly related to the probabilistic index Pr(y gi 1 > y gj 2 ) , i.e., the probability that a gene expression y gi 1 randomly picked from group 1 is greater than a gene expression y gj 2 randomly picked from group 2. The Wilcoxon rank-sum test is based on the ranks of the original data values. Thus, it is invariant to monotone transformations of the data and does not impose any parametric assumptions. It is particularly attractive because of its robustness to outliers. However, it has not the same interpretation as t-type statistics (fold change, t, moderated t, SAM), as it is sensitive to any differences in the location and shape of the gene expression distributions in the two groups. Assuming that data have been rank-transformed across groups, then the Wilcoxon rank-sum test statistic W g is given by the sum of ranks in group 1. In its implementation in the R package samr, W g is standardized to Z g = [W g − E(W g )] / (s 0 + s g ) , where E(W g ) is the expected value of W g under the null hypothesis, and s g is the standard error of that rank sum. samr again uses an offset s 0 in the denominator, which is by default the 5th percentile of all s g , g = 1, …, G.
2.4.6. RankProduct Statistic
Breitling et al. (12) proposed a simple nonparametric method to assess differential expression based on the RankProduct statistic n n R g = ∏ i =1 1 ∏ j 2=1 rank up (y gi 1 − y gj 2 ) / G , where rank up (y gi 1 − y gj 2 ) sorts the gene-specific expression differences between two arrays i and j in descending fashion such that the largest positive difference is assigned a rank of 1. If a gene is highly expressed in group 1 only, R g will assume small values, as the differences (y gi 1 − y gj 2 ) across all pairs of subjects (i, j ) from groups 1 and 2 are more likely to be positive. Similarly, R g will be large if gene g shows higher expression in group 2. For a gene not differentially expressed, the product terms rank up (y gi 1 − y gj 2 ) / G fall around 1/2. The distribution of the RankProduct statistic can be assessed by permutation methods (see Subheading 2.6).
2.5. The Multiple Testing Problem
The main problem that we are confronted with in Omics analyses is the huge number of hypotheses to be tested simultaneously, leading to an inflation of the FWER if no adjustment for multiple tests is considered. Corrections have been developed which allow researchers to preserve the FWER at the significance level. p-value
Statistical Analysis Principles for Omics Data
119
adjustments inflate original p-values, thus resulting in an increased type II error compared to an unadjusted analysis. Therefore, in recent years more and more experiments aim at controlling the FDR (4), which allows for a less stringent p-value correction. The FDR is defined as the proportion of false positives among all genes declared differentially expressed, i.e., in contrast to the FWER, the reference set is no longer defined by the true status of the gene (differentially expressed or not), but by our decision about the true status. Here, it is no longer believed that the null hypothesis applies to all genes under consideration. Instead, we are quite sure that differential expression is present in at least a subset of the genes. Procedures to control the FDR assign a q-value to each gene, which is an analog to the p-value, but now refers to the expected proportion of false discoveries in a gene list obtained by calling all genes with equal or lower q-values differentially expressed. The q-values are typically lower than adjusted p-values. 2.6. Assessing Significance in Omics Experiments While Controlling the FDR
q-values can be obtained by the following two procedures:
2.6.1. Stepwise Adjustment
A recursive step-up procedure, proposed by Benjamini and Hochberg (4), starts at the highest p-value, which is directly translated into a q-value. The second-highest p-value is multiplied by G/(G − 1) to obtain the corresponding q-value, etc. Each q-value is restricted such that it cannot exceed the precedent q-value. Formally, the procedure is defined as follows: 1. Raw p-values are ordered such that p(1) < p( g ) < p(G ) 2. q(G ) = p(G ) 3. q( g ) = min[q( g +1) , p( g )G / g ] This procedure takes raw p-values as input and is therefore suitable for all test statistics for which p-values can be analytically derived: t g and t g . It relies on the assumption of independent or positively correlated test statistics, an assumption that is likely to apply to microarray experiments (13). The Benjamini–Hochberg approach is the default p-value adjustment for the moderated t-statistics tg in R’s limma package.
2.6.2. Simulation-Based Adjustment
Storey (5) proposed a direct permutation-based approach to q-values to control the FDR. This adjustment is based on two ideas: first, we assume that some genes are truly differentially expressed, and second, that the FDR, associated with a p-value cutoff of, e.g., 0.05, equals the expected proportion of false findings among all genes declared significant at that cutoff. The number of false findings can be estimated from, e.g., 100 data sets that emerge from random reassignment of the group labels to the arrays. In each of these permuted data sets, the number of
120
Dunkler, Sánchez-Cabo, and Heinze
genes declared significant is counted. The average count estimates the number of false findings. Assuming that genes are ordered by their raw p-values, i.e., p(1) ≤ ≤ p( g ) ≤ ≤ p(G ) , and given the permuted p-values Plb (l = 1, …, G ; b = 1, …, B) from B permuted data sets as input, the Storey adjustment can be formalized as B
G
q(Sg ) = pˆ 0 ∑∑ I [Plb < p( g ) ] / [rank( g )B].
b =1 l =1
Here, rank( g ) denotes the index of gene g in the ordered sequence of unadjusted p-values such that, e.g., rank( g ) = 1 for the most significant gene, etc. rank( g ) is essentially the number of genes declared significant at a p-value threshold of p( g ) , and
∑
G l =1
I [Plb < p( g ) ] / rank( g ) is the estimated false
positive rate among nondifferentially expressed genes in permutation b, which is averaged over B permutations. Alternatively, one may also apply the adjustment to the ordered absolute test statistics T(1) ≥ ≥ T( g ) ≥ T(G ) , yielding B G q(Sg ) = pˆ 0 ∑ b =1 ∑ l =1 I [| Tlb |>| T( g ) |] / [rank( g )B]. In both variants, pˆ 0 is the estimated proportion of genes not differentially expressed. Various suggestions have been made about how to estimate this proportion. Based on the empirical distribution function of raw G p-values Fˆ (p g ) = ∑ I (pl ≤ p g ) / G , pˆ = [1 − Fˆ (0.5)] / 0.5 , i.e., l =1
0
the ratio of observed and expected proportions of p-values exceeding 0.5. The approach is applicable to both p-values and test statistics, thus it works with all proposed test statistics. Since it employs permutation, it relies on the assumption of subset pivotality (see Note 5).
2.7. Improving Statistical Power by Unspecific Filtering
Adjusted p-values are always inflated compared to raw p-values. The magnitude of this inflation directly depends on G, the number of hypothesis to be considered, irrespective of the type of adjustment. In particular, control of the FDR depends on the number of hypotheses rejected, not on the total number of hypotheses. Hence, it is desirable to identify a priori genes that are very unlikely to be differentially expressed and that could be excluded from statistical testing. Some unspecific filtering criteria, i.e., criteria that do not make use of the group information, are defined as follows (14, 15): 1. Exclude those genes whose variation is close to zero. Consider the interquartile range, defined as IQR = (75th percentile − 25th percentile). The IQR filter excludes those genes where the IQR is lower than a prespecified minimum relevant log fold change, e.g., 1, here corresponding to a fold change of 2. If groups are perfectly separated, then the median log2
Statistical Analysis Principles for Omics Data
121
expression in the groups will be equal to the 75th and 25th percentiles, computed over the combined groups. Thus, for genes that do not pass this filter, even under perfect separation of the groups, the minimum relevant log fold change (approximately equal to the difference in medians) cannot be achieved. 2. Exclude genes with low average expressions. If nearly all expression values are low, there is no chance to detect overexpression in one group. 3. Exclude genes with a high proportion of undetectable signals or high background noise. The thresholds used for these filtering rules may depend on the platform used and on individual taste such that the number of genes excluded by filtering is highly arbitrary. Lusa et al. (16) proposed a method to combine the results from a variety of filtering methods, without the need to prespecify filter thresholds. It is important to note that only unspecific filtering does not compromise the validity of results. If a filtering rule is applied that uncovers the group information (e.g., | M g |> M , with some arbitrary threshold M), then this must be considered also in resampling-based assessment of statistical significance.
3. Methods This section presents a complete workflow of a differential gene expression analysis, exemplified by comparing tissues of BLC patients to tissues of nBLC cancer patients (9). We compute test statistics and assess significance comparing eight BLC to eight nBLC subjects. Each tissue sample was hybridized to an Affymetrix Human Genome U133 Plus 2.0 Array and hence methods for the preprocessing of Affymetrix data are briefly described. All analyses were performed using R (http://www.r-project.org) and Bioconductor (http://www.bioconductor.org) software (17, 18). The data set and the R script to perform all analytical steps can be found at http://www.meduniwien.ac.at/msi/biometrie/MIMB. 3.1. Experimental Design Considerations 3.1.1. Sample Size Assessment for Microarray Experiments
The necessary sample size can be estimated by assuming values for the following quantities: 1. The number of genes investigated (NG). 2. The number of genes assumed to be differentially expressed (this number is often assumed to be equal to the number to be detected by testing) (NDE). 3. The acceptable number of false positives (FP).
122
Dunkler, Sánchez-Cabo, and Heinze
4. The relevant mean difference (or log fold change), often assumed as 1.0. 5. The per gene variation of gene expression within groups (usually expressed as standard deviation), often assumed as 0.7. The FDR that is to be controlled is given by FP/NDE. High variation of gene expression and a low (relevant) mean difference increase the necessary sample size. Given that the NDE equals the number of genes detected by testing, FDR equals the type II error rate. One minus the type II error rate is defined as the statistical power. The per gene type I error is also denoted as false negative rate and is given by FP/(NG-NDE), given all other numbers remain constant. Sample size assessment for microarrays can be done using the MD Anderson sample size calculator available at http://bioinformatics.mdanderson.org/Microarray SampleSize. Table 1 gives some impression about how the required sample size varies with different assumptions. R’s samr package also provides an assessment of sample size, given some pilot data, and can take into account different standard deviations across genes as well as correlation between genes. Bias may arise as a result of using arrays from different production batches, using different scanners to produce intensity images, or analyzing samples at different times. To minimize these sources of bias, stratification or randomization of these factors between experimental conditions is performed. As an example, consider
3.1.2. Bias
Table 1 Sample size calculations performed by the MD Anderson microarray sample size calculator (http://bioinformatics.mdanderson.org/MicroarraySampleSize) NG
NDE
FP
FDR
FC
SD
FNR
N
54,000
1,000
100
0.10
2
0.7
0.0186
19
5,400
1,000
100
0.10
2
0.7
0.1887
13
54,000
1,000
100
0.10
1.5
0.7
0.0186
56
54,000
1,000
100
0.10
2
1.4
0.0186
76
54,000
1,000
50
0.05
2
0.7
0.0186
25
54,000
300
30
0.10
2
0.7
0.00056
22
Provided with the numbers of the first six columns, the calculator supplies estimates of the false negative rate and the necessary sample size per group NG number of genes investigated, NDE number of genes truly/declared differentially expressed, FP number of false positive findings, FDR false discovery rate (=1 – power), FDR = FP/NDE, FNR false negative rate (per gene type I error), FNR = FP/(NG − NDE), FC fold change to be detected, SD per gene standard deviation (on log2 expressions), N necessary number of arrays per group
Statistical Analysis Principles for Omics Data
123
two experimental conditions 1 and 2, and assume that arrays stem from two production batches B1 and B2 of sizes 30 and 10, respectively, and are hybridized and analyzed at days D1–D5. On each day, a maximum of eight samples can be processed. Let Design A be characterized by the sequence of conditions {1111111111111111-11112222-22222222-222222}, bold figures standing for arrays from the production batch B1. This design does not allow separating the biological effect of BLC vs. nBLC from the effects of day and production batch. By contrast, design B {12122121-12112112-21211212-21121221-121121} randomizes the two latter effects such that the biological effect is approximately uncorrelated to the nuisance factors. 3.2. Data Preprocessing
Prior to statistical analysis of the data, several steps have to be performed in order to make the data comparable from array to array: background correction, normalization, and summarization. The most popular method for preprocessing of Affymetrix data is the Robust Multichip Average (RMA) (19) method, which serves all three purposes: first, the background is estimated based on the optical noise and nonspecific signal, ensuring that the background corrected signal is always positive. Second, quantile normalization (20) is applied to all arrays on the background corrected intensities. This type of normalization forces the gene expressions to have a similar distribution in all arrays under study, which is plausible assuming that typically only few genes show differential expression. Third, the multiple probes per gene are summarized by assuming that the log2 intensity x gij of gene g, array i, and probe j is a sum of the log2 expression level y gi , some probespecific affinity effects a gj common to all arrays, and some residual error. Assuming that the probe affinity effects sum up to zero, they are estimated by a robust alternative to linear regression minimizing the sum of absolute residuals, the median polish (21), and y gi results as estimated log2 expression level.
3.3. Q uality Checks
Before and after normalization, quality checks need to be performed in order to discover arrays for which data quality may be questionable. Assuming that gene expression should be similar for the majority of genes, a useful and simple method is to compute concordance correlation coefficients and Spearman correlation coefficients between all arrays. Principal components analysis (see Subheading 3.5) can also be helpful to detect spurious arrays. After outlier detection and removal, the normalization step must be repeated, as outlying arrays may have disproportional impact on the results of normalization. Furthermore, a plot of difference against mean of gene expression of two arrays (MA-plot), or a plot of standard deviation of gene expression of all genes against mean gene expression of all genes are useful (see Note 6).
124
Dunkler, Sánchez-Cabo, and Heinze
3.4. Unspecific Filtering
For our selected example gene expression profiles, we employ R’s genefilter package to filter out genes with an IQR < 1 such that a fold change of 2 cannot be reached even under perfect separation of the groups. Only 5,874 out of 54,675 genes pass this filter.
3.5. Assessing Differential Expression
After preprocessing and filtering, the capability of gene expression to separate the two groups can be visualized by a plot of the first two principal components. This multivariate statistical procedure reduces the high-dimensional space of gene expressions to few dimensions which can then be displayed graphically. Each of these principal components is a weighted sum of gene expressions, where the weights are determined such that the principal components capture as much variance of gene expression as possible. Figure 1 shows good separation of the samples in the space spanned by the first two principal components. An extension of the principal components plot is the spectral map (22). Analysis of differential expression produces a gene list. In our analysis, we apply the test statistics developed for testing hypotheses in Omics experiments presented above: moderated t, SAM, and the RankProduct statistic, controlling the FDR at 1%. 1. The moderated t-test declares 626 genes significant while controlling an FDR of 1%. q-value estimation is based on the deterministic step-up procedure such that the q-value estimates are not prone to simulation error.
Fig. 1. BLC (circles) and nBLC (triangles) arrays in the space spanned by the first two principal components. The first two principal components are an approximation of the high-dimensional space of gene expressions, which cannot be visualized in a twodimensional diagram. The two principal components account for 29% (PC1) and 11% (PC2) of the variance of gene expression across all genes.
Statistical Analysis Principles for Omics Data
125
2. The SAM procedure estimates the proportion of genes not related with group status as 0.46. This number refers to the number of genes subjected to SAM analysis (5,874). Thus, the total proportion of null genes is estimated to be 0.95. 902 genes are declared significant while controlling an FDR of 1%. Since q-value computation is based on simulation, SAM issues an estimate of the median q-value and of the 90th percentile, and results depend on the starting value of the random number generator. 3. The RankProduct test results in 208 genes declared significant at an FDR of 1%. Since RankProduct computes onesided tests (see Note 7), we have to combine tables produced for significant up- and downregulation of BLC, and the total FDR tolerated has to be allocated to both tables at equal parts. The procedure outputs the “percentages of false positive predictions” which have similar interpretation as q-values. These values are again based on simulation. 4. The methods agree in finding 172 genes which are contained in all three lists. The lists generated by the moderated t and SAM methods agree in 619 genes (see Note 8). Statistical characteristics of a selected number of genes are presented in Table 2. 3.6. F urther Remarks
3.7. Visualizing Results
Although a higher number of genes has been detected by SAM than by moderated t or by RankProduct, it is not appropriate to conclude that SAM is more powerful than the other approaches, as the number of false findings may be higher than the estimated 1% due to violations of the assumptions of this test (see Notes 2 and 5). Conclusions about higher or lower power can only be meaningfully drawn by simulation experiments in which the population is known. Very likely, advantages of one or the other method depend crucially on the data generation mechanism. Higher robustness of the nonparametric approach (RankProduct) has to be traded by a lower number of differentially expressed genes. Here, we have fixed the FDR in advance and have produced gene lists accordingly. In experiments with less impressive separation of the groups, it may be desired to produce a gene list of prespecified length, and estimate the FDR associated with that list. 1. A heat map can be used to visualize results from cluster analysis on samples and on genes. Ideally, a cluster analysis using genes found to be differentially expressed reveals perfect separation of the two groups. However, unless results are validated on an independent data set, such separation does not proof that group membership of individuals can be predicted by the expression values of those genes. The heat map could
69.0
33.8
1/26.5
1/23.6
18.0
1/5.2
4.5
1/2.3
2.8
4.6
205044_at
1553613_s_at
203963_at
215867_x_at
220425_x_at
224590_at
229927_at
226684_at
220892_s_at
219654_at
6.7
7.4
6.9
5.9
7.5
7.2
8.8
7.8
7.4
8.6
a
Pooled within-group standard deviation, given by
Fold change (BLC/nBLC)
Gene ID
258
563
769
368
199
6
5
4
2
1
Rank in list
−5.6
−3.8
3.5
−4.3
4.8
−10.3
10.3
11.0
−13.1
−14.6
tg
Moderated t
s 2g 1 / 2 + s 2g 2 / 2
1.1
0.5
0.3
1.3
1.2
0.7
0.9
0.8
0.5
0.7
Average Standard expression deviation (log2) (log2)a
0.0029
0.0088
0.0133
0.0045
0.0021
3.4e−7
3.4e−7
1.7e−7
9.8e−9
2.2e−9
q-value
211
634
837
258
163
7
6
4
2
1
Rank in list
SAM
−1.8
−1.3
1.2
−1.7
1.9
−3.7
3.9
4.1
−4.6
−5.3
dg
0.0021
0.0084
0.0088
0.0021
0.00
0.00
0.00
0.00
0.00
0.00
q-value
204
716
1626
193
184
13
10
8
4
1
Rank in list
46.8
36.8
31.5
18.2
5.9
370.7
800.8
1,192.7
358.8
350.5
Rg
RankProduct
0.0086
0.2660
0.7560
0.0076
0.0088
0.00
0.00
0.00
0.00
0.00
q-value
Table 2 Characteristics of the top five ranked genes from moderated t, SAM, and RankProduct methods, and of five further genes
126 Dunkler, Sánchez-Cabo, and Heinze
Statistical Analysis Principles for Omics Data
127
Fig. 2. Example dot plots to depict differential expression, with dots representing the arrays, log2 gene expression shown on the vertical axis, and horizontal lines referring to the median within each group.
help to identify subgroups defined by subclusters. The gene dimension of the heat map allows detecting groups of genes with similar expression patterns (see Note 8). 2. A plot of the intensity values for genes is useful to judge the relevance of the gene to separate groups, depict the range of intensity values, and the fold change compared to this range (Fig. 2). In practice, it is useful to include values of relevantstatistics in the plot, as exemplified in Fig. 2 by means of the concordance or c-index. This nonparametric measure estimates the probability that a gene expression from the BLC group is greater than a gene expression from the nBLC group. A value of 0.5 is associated with no group difference, while values close to 0 or 1 indicate perfect separation of the groups. 3. Gene lists are compared with a reference list to assess whether functional groups are over/underrepresented based on Gene Set Enrichment or Gene Ontology. 3.8. Sensitivity Analyses
1. Analysis should be repeated with different filter criteria (see Subheading 3.4) to assess whether results are sensitive to the assumptions imposed in this step. Without the variance filter step, moderated t, SAM and RankProduct detect 247, 185, and 1,461 genes, compared to 626, 902, and 208 after variance
128
Dunkler, Sánchez-Cabo, and Heinze
filtering, respectively. We see that for the moderated t and SAM statistics, variance filtering increases power, while RankProduct shows discordant behavior: it is likely that the enlarged gene list obtained without filtering includes many genes with nonrelevant fold change, as the IQR is low in these genes (see Note 8). 2. Analysis should be repeated with some arrays left out. Candidates to be omitted are those arrays that exhibit low concordance correlation coefficients with the other arrays. Note that the whole process, starting from normalization across arrays, which takes into account data from all arrays, should be repeated with the reduced data set (see Note 8). 3. Sensitivity analyses should not be used to optimize results based on the sensitivity condition; otherwise, the validity of results would be compromised. Conditions such as which array is left out from the analysis, or which variance filter is applied are hyperparameters, and optimizing these criteria would require an additional hyperparameter adjustment for multiple testing. 3.9. Other Types of Omics Data
Platforms like proteomics, metabolomics or microRNA often supply intensity values exhibiting a point mass at zero intensity. Zero intensity results if a compound is absent, or if the signal is below a technical detection limit. In any case, the data distribution consists of a proportion of zero and a distribution of a continuous component of usually positive intensity values. Standard tests typically either focus on a difference in the proportion of zeros between the groups, or on the difference in the mean of the continuous part. If the zeros are simply added to the continuous part such that the mixture distribution is analyzed, then the normal assumption of the t-test, known to be optimal under this assumption, is heavily violated. To solve this problem, two-part statistics have been proposed which employ a t-test or a Wilcoxon rank-sum test to compare the distribution of the continuous part, and a binomial test to compare the proportion of zeros between the groups. The two resulting test statistics are appropriately rescaled into chi-squaredtype statistics. Under the null hypothesis, both parts are independent and a combined test statistic can be defined as the sum of both components, distributed as chi-squared with two degrees of freedom. Other proposals include the empirical likelihood ratio statistic (23), the so-called truncated Wilcoxon test (24), or the Wilcoxon test applied to the mixture distribution of zeros and continuous intensity values (25). We have performed a simulation study to evaluate the performance of these proposals in some typical Omics settings. We simulated 200 data sets with 15 out of 250 features exhibiting a fold
Statistical Analysis Principles for Omics Data
129
change of 2 in the original, nontruncated means, with ten samples per group. Data for each feature were independently simulated from normal or lognormal distributions. For each data set, a detection limit was imposed by truncating the lowest 30% or 60% of intensity values to zero. This resulted in – on average – 30% or 60% point mass at zero per feature. Then, we analyzed each data set with the test statistics mentioned above. In each simulated data set, we sorted the features by the resulting test statistic and declared the top-ranked features as being differentially expressed (DE). By varying the number of features called DE, we obtain different true positive rates (the proportion of called DE genes in truly DE genes) and false positive rates (the proportion of called DE genes in non-DE genes). These preliminary simulations revealed that the Wilcoxon test applied to the mixed distribution of zeros and positive intensities provides the highest true positive rates, when comparing tests at similar false positive rates. The Wilcoxon test can be computed by the samr R package.
4. Notes 1. Here, sample refers to a subset of the population. At other occasions, sample may denote the material obtained from one biological replicate or subject. 2. This condition is fulfilled in small samples (nk < 30) if the y gik follow a normal distribution with group-specific expected value m gk and unknown variance s 2g , assumed to be the same in both groups. 3. Moderated t-statistics are not restricted to two-group comparisons; they have been presented for a broad framework of models that may take into account the experimental design of an Omics study. As an example, one may adjust for batch differences by including a random factor like in a classical mixed ANOVA model. One may also test composite hypotheses (e.g., three group or paired comparisons) by moderated F-statistics which generalize moderated t-statistics to composite contrasts of interest. 4. SAM analysis can also be performed with other types of outcome, e.g., with paired data (the same samples are measured under different experimental conditions), with time-to-event outcome (to detect genes associated with survival of individuals), or multiclass data (more than two groups are compared). 5. Permutation of the group labels assumes that the distributions in the two groups differ only in the parameter of interest, i.e., in mean gene expression. This “subset pivotality” assumption may be violated and could lead to misleading
130
Dunkler, Sánchez-Cabo, and Heinze
results (26). In such cases, a different resampling procedure based on sampling with replacement (the bootstrap) can be applied. Here, B resamples are obtained by sampling n1 + n2 subjects with replacement from the original sample such that some subjects appear multiple times in one resample, others not at all. From each resample b, the G test statistics T gb are computed. These test statistics are then centered and scaled such that across the B resamples the mean is zero and the standard deviation is 1. The test statistic T g computed from the original data is similarly centered and scaled. Finally, the simulation-based adjustment can be applied. 6. Several Bioconductor packages specific for Affymetrix are available to this end (e.g., affyPLM, affyQCreport) and others can be used for all platforms (e.g., arrayQualityMetrics). More information can be found, e.g., in the documentation of R’s limma package, or in textbooks (22, 27). 7. Two-sided testing, as performed in Subheading 3, reveals upas well as downregulated genes. By contrast, with one-sided testing one is only interested in upregulation of genes in one of the groups. Test statistics have to be adjusted such that only positive values indicate significance. One-sided testing requires the null hypothesis to be stated similar to “Gene expression in group BLC is equal or lower than in group nBLC.” Finding downregulation in group BLC would match the null hypothesis. Therefore, only upregulation in BLC can lead to a significant result. As a consequence, p-values from one-sided tests are, if the observed mean difference is in favor of the alternative hypothesis, only half the p-values from equivalent two-sided tests. 8. More details can be found at http://www.meduniwien.ac.at/ msi/biometrie/MIMB.
Acknowledgments This work was partially supported by the European Union FP7 project “SysKid,” project number 241544. References 1. Harrel, F.E. (2001) Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression and Survival Analysis. Springer Verlag, New York. 2. Smyth, G.K. (2004) Linear models and empirical Bayes methods for assessing differential
expression in microarray experiments. Stat Appl Genet Mol Biol 3, 3. 3. Tusher, V.G., Tibshirani, R.J., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98, 5116–21.
Statistical Analysis Principles for Omics Data 4. Benjamini, Y., and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to Multiple testing. J R Stat Soc Series B Stat Methodol 57, 375–86. 5. Storey, J.D. (2002) A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodol B 64, 479–98. 6. Boulesteix, A.L. (2004) PLS dimension reduction for classification in microarray data. Stat Appl Genet Mol Biol 3, 33. 7. Tibshirani, R.J. (1997) The LASSO method for variable selection in the Cox model. Stat Med 16, 385–95. 8. Dettling, M. (2005) Classification with gene expression data, pp. 421–430. In Gentleman, R., Carey, V., Huber, W., Irizarry, R.A., and Dudoit, S. (ed.), Bioinformatics and Computational Biology Solution Using R and Bioconductor. Springer, New York. 9. Richardson, A.L., Wang, Z.C., De Nicolo, A., Lu, X., Brown, M., Miron, A., Liao, X., Iglehart, J.D., Livingston, D.M., and Ganesan, S. (2006) X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell 9, 121–32. 10. Efron, B., Tibshirani, R.J., Storey, J.D., and Tusher, V.G. (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96, 1151–60. 11. Chu, G., Narasimhan, B., Tibshirani, R.J., and Tusher, V.G. (2009) Significance Analysis of Microarrays – User’s Guide and Technical Document. 12. Breitling, R., Armengaud, P., Amtmann, A., and Hercyz, P. (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 573, 83–92. 13. Reiner, A., Yekutieli, D., and Benjamini, Y. (2003) Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19, 368–75. 14. Hackstadt, A.J., and Hess, A.M. (2009) Filtering for increased power for microarray data analysis. BMC Bioinformatics 10, 11. 15. McClintick, J.N., and Edenberg, H.J. (2006) Effects of filtering by present call on analysis of microarray experiments. BMC Bioinfor matics 7, 49. 16. Lusa, L., Korn, E.L., and McShane, L.M. (2009) A class comparison method with filtering-enhanced variable selection for highdimensional data sets. Stat Med 27, 5834–49.
131
17. Gentleman, R., Carey, V., Bates, D., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J., and Zhang, J. (2004) Bioconductor: open software development for computational bio logy and bioinformatics. Genome Biol 5, R80. 18. R Development Core Team (2009) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. 19. Irizarry, R.A., Hobbs B., Collin, F., BeazerBarclay, Y.D., Antonellis, K.J., Scherf, U., and Speed, T.P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–64. 20. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19, 185–93. 21. Holder, D., Raubertas, R.F., Pikounis, V.B., Svetnik, V., and Soper, K. (2001) Statistical analysis of high density oligonucleotide arrays: a SAFER approach. Proceedings of the ASA Annual Meeting, Atlanta, GA. 22. Amaratunga, D., and Cabrera, J. (2004) Exploration and Analysis of DNA Microarray and Protein Array Data. Wiley, Hoboken, NJ. 23. Taylor, S., and Pollard, K.S. (2009) Hypothesis tests for point-mass mixture data with application to Omics data with many zero values. Stat Appl Genet Mol Biol 8, 8. 24. Hallstrom, A.P. (2010) A modified Wilcoxon test for non-negative distributions with a clump of zeros. Stat Med 29, 391–400. 25. Dakna, M., Harris, K., Kalousis, A., Carpentier, S., Kolch, W., Schanstra, J.M., Haubitz, M., Vlahou, A., Mischak, H., and Girolami, M. (2010) Addressing the challenge of defining valid proteomic biomarkers and classifiers. BMC Bioinformatics 11, 594. 26. Kerr, K.F. (2009) Comments on the analysis of unbalanced microarray data. Bioinformatics 25, 2035–41. 27. Gentleman, R., Carey, V.J., Huber, W., Irizarry, R.A., and Dudoit, S. (2005) Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer, New York.
Chapter 6 Statistical Methods and Models for Bridging Omics Data Levels Simon Rogers Abstract Multiple Omics datasets (for example, high throughput mRNA and protein measurements for the same set of genes) are beginning to appear more widely within the fields of bioinformatics and computational biology. There are many tools available for the analysis of single datasets but two (or more) sets of coupled observations present more of a challenge. I describe some of the methods available – from classical statistical techniques to more recent advances from the fields of Machine Learning and Pattern Recognition for linking Omics data levels with particular focus on transcriptomics and proteomics profiles. Key words: Data integration, Clustering, Classification, Multi-view learning
1. Introduction The development of Omic-scale measurement techniques and the subsequent use of the data they produce has revolutionised Molecular Biology and motivated the development of many useful and sophisticated algorithms in the fields of bioinformatics and applied statistics. However, it is only relatively recently that researchers have started investigating the synergistic possibilities of analysing Omics data of more than one type simultaneously. Picking any two Omics data types, it is easy to see how their combined analysis could reveal more than investigating them individually. For example, in an early paper (1) sequence and mRNA expression data for the same set of genes were combined to help search for regulatory motifs, the idea being that genes with similar expression and similar motifs in their upstream regions are more likely to be co-regulated than those with similarity in just one dataset. Algorithmically, this work did not provide
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_6, © Springer Science+Business Media, LLC 2011
133
134
Rogers
much novelty – the sequence and expression components were both standard, prepublished models – but the explicit combination of the two data types (as opposed to two-step techniques, such as clustering the expression and then searching within clusters for sequence motifs) opened the door for similar approaches with other data types, some of which we will see later. In this chapter, many examples involve combining transcriptomic and proteomic expression data. This is a particularly important area as it promises to help elucidate the complex regulatory mechanisms that, when operating correctly, underpin cellular behaviour but when broken, can be responsible for diseases such as cancer. As well as aiding investigation into specific regulatory networks, combining these data can help us answer more general questions. For example, only by combined analysis of the expression of both mRNA and proteins can we draw conclusions as to how much regulatory control is exhibited at the transcriptional level and how much occurs after transcript production. At a more abstract level, this analysis can tell us how well reality concurs with the oft-used assumption (particularly when modelling transcriptional regulation) that mRNA expression data can be used as a proxy for the activity of the proteins it encodes. We begin with a brief description of publicly available tools that help the practitioner in the nontrivial task of matching database IDs to produce the necessary mapping between the objects in each representation. In the remainder, we describe methods for analysing combined data sources, including correlation, clustering, predictive models, and projection-based techniques.
2. Materials 2.1. Data Object Matching
The first task that we are likely to face when given multiple datasets of the same objects is matching the object identifiers across datasets. Each gene may carry different IDs depending on the reference data source, and it is likely that data produced from different experimental platforms use different ones. For example, mRNA expression data may use RefSeq accessions (http://www.ncbi.nlm. nih.gov/RefSeq/) while proteomic data may use SwissProt names (http://www.uniprot.org/). As it is not feasible to match the objects by hand for even a moderately sized dataset, there are tools available to perform the matching automatically. One example is Matchminer (2) (http://discover.nci. nih.gov/matchminer/index.jsp). The Batch Merge feature of Matchminer enables the user to provide two lists of identifiers that are then matched. For more than two sets of identifiers, this procedure could be used in a repeated fashion – matching the output of the first two with the third and so on (see Note 1).
Statistical Methods and Models for Bridging Omics Data Levels
2.2. Statistical Resources
135
Computational tools that perform the statistical methods are described later in this chapter. Perhaps the most obvious is Matlab (http://www.mathworks.com) – which has a statistical toolbox, including many useful functions. In addition, the bioinformatics toolbox provides utilities for reading in Omics data of many different types in standard formats and enables the user to query online databases (for example, the gene ontology) to help biologically verify analysis results. A popular alternative is the statistical language R(http://www.r-project.org) that has much of the functionality of Matlab. Bioconductor (http://www. bioconductor.org) is a repository of free, open source, R code specifically for bioinformatics (see Note 2). As methods for combining data sources are relatively new, they are unlikely to be found within Matlab or R/Bioconductor. Often, those working in this area release code when new algorithms are published. For example, Matlab code for the couple clustering method described in Subheading 3.2.3 can be downloaded from http://www.dcs.gla.ac.uk/inference/ genomic_integration. An excellent resource for kernel methods of various types is http://www.kernel-machines. org/, including code in various languages (e.g. Matlab, C, C++) as well as an extensive bibliography. The Machine Learning community has been very active in multi-view learning (of which Omic data combination is an obvious application). Various workshops have been organised, for example, the 2008 and 2009 NIPS workshops on Learning from Multiple Sources (http://nips.cc/Conferences/2008/ Program/event.php?ID=1051 and http://nips.cc/ Conferences/2009/Program/event.php?ID=1518). These are worth examining to find people active in this area and discover state-of-the-art techniques (see Note 3).
3. Methods 3.1. Univariate Methods
In this section, we consider the following univariate problem formulation. We have quantitative observations from two separate Omics data sources for g = 1 … G objects. We denote these two datasets by x1, …, xg, …, xG and y1, …, yg, …, yG (see Note 4). A common example of this setting is steady state mRNA and protein expressions for G genes (e.g. (3, 4)). In all of the following, we assume a one-to-one mapping between objects in the two datasets although this is not always the case.
3.1.1. Correlation
The simplest question we may wish to ask concerns how similar the values for the objects are across the two spaces. For example, if xg is high, is yg also high? An alternative way of asking the same
136
Rogers
question is: how much do we know about yg given a value for xg? This is important in a modelling context where one may be interested in the feasibility of using xg in a model as a proxy for the potentially more difficult to measure yg. Computing the correlation between the two sets of variables is one straightforward way to gain some insight into the relationship between them. In the following, we describe two correlation measures that have been used for exactly this problem. Pearson’s correlation coefficient: Pearson’s correlation coefficient (see, e.g. (5)) is computed as rp =
∑
g
(x g − m x )(y g − m y ) (G − 1)s x s y
where mx, sx and my, sy are the sample means and standard deviations of x and y respectively. rp ranges between ±1 with 1 corresponding to a positive correlation (for example, when x is high, y is high) and −1 corresponding to a negative correlation (when x is high, y is low) (see Note 5). When analysing transcriptomic and proteomic data, a high positive correlation would suggest that translation was performed on demand and there exists minimal regulatory control after transcription. A correlation closer to zero would suggest a much more complex regime. A strong negative correlation would be harder to explain but could be the result of negative feedback loops. Figure 1 shows two datasets exhibiting different levels of linear correlation. One benefit of Pearson’s correlation is that there exist statistical tests to determine whether an observed value of rp is likely to be due to a real effect or has b 3
3
2
2 Protein Expression
Protein Expression
a
1 0 ρ = 0:823383 p = 1:06692e − 13
−1 −2 −3 −3
−2
−1 0 1 mRNA Expression
2
1 0 ρ = 0.136313 p = 0.172606
−1 −2
3
−3 −2
−1
0 1 mRNA Expression
2
3
Fig. 1. Example synthetic datasets with a linear correlation significant at p = 0.05 (a) and not significant at p = 0.05 (b). Each point represents a gene and the two data types have been normalised to have zero mean and unit standard deviation.
Statistical Methods and Models for Bridging Omics Data Levels
137
just come about randomly. Particularly, one can attempt to reject the null hypothesis that there is no linear correlation (i.e. rp = 0) using a Student’s t-test with statistic r G −2
1−r2 and G − 2 degrees of freedom. Pearson’s correlation coefficient has been used widely for the analysis of transcriptomic and proteomic data – for example ( 3, 4, 6, 7) with the latter also discussing potential methods for determining how much of the lack of correlation can be ascribed to measurement error. Its popularity is likely due to its simplicity and interpretability – it essentially computes the strength of the linear relationship between the two quantities. This is clearly also a weakness – if we obtain a highly significant value for Pearson’s correlation coefficient, we have strong evidence of a linear relationship between the two variables. However, a value close to zero does not imply that there is no relationship, just no linear relationship. Correlation of any type does also not imply causation – if yg is high whenever xg is high, we cannot necessarily conclude that yg is high because xg is high. Spearman’s rank correlation coefficient: One way to overcome the linearity assumption that underlies Pearson’s correlation coefficient is to work with the ranks of the two sets rather than their actual values. For example, we can give each value of xg a rank corresponding to its position in an ordered list of all G values. If we do the same for y (and remaining within the context of mRNAprotein expression combination), we obtain two ranks for each gene – one for mRNA and one for protein – and subtract one from the other to give a distance dg. This value is then used to compute the Spearman’s rank correlation coefficient (see e.g. (8)). 6∑ g =1 d 2g G
rs = 1 −
G (G 2 − 1)
The statistical significance of such correlation can be assessed in exactly the same way as described above. If there are any identical values and hence tied ranks, the process is slightly different and the reader is referred to (8) for more information. Because we are only working with ranks and not values, we do not have to make any assumptions as to the form of the relationship (e.g. that it is linear) and this coefficient provides a high correlation value for any monotonic relationship. As an example, consider the synthetic data shown in Fig. 2. Here, we see a saturation effect: protein expression increases reasonably linearly with mRNA expression until it reaches a maximum quantity after which production saturates.
138
Rogers 1.5
Protein Expression
1 0.5 0 −0.5 −1 ρp = 0:757205 ρs = 0:998857
−1.5 −2 −2.5 −2
−1
0 mRNA Expression
1
2
Fig. 2. Example synthetic dataset showing a nonlinear relationship due to protein production saturation. The Pearson correlation (rp ) is much lower than the Spearman’s correlation (rs ).
There is a clear monotonic relationship here (increase in mRNA never leads to a decrease in protein) but the relationship is nonlinear. This is reflected in the values of the two different correlation coefficients – Pearson’s (rp = 0.783) and Spearman’s (rs = 0.999). For an example of the use of Spearman’s rank correlation coefficient in the analysis of transcriptomic and proteomic data, see (4). Results from correlation analysis on univariate data have been reasonably consistent thus far. Values of 0.3 < r < 0.5 are not uncommon suggesting that some linear correlation exists between the two gene products in a variety of organisms. Interestingly, the linear correlation appears to (in general) increase as the magnitude of the expressions increases (see, for example (4)). While this may be an interesting biological phenomenon, it may also be an artefact of the measurement process, whereby objects present in small quantities are harder to measure and more subject to noise. An interesting study (9) uses correlation to decipher the relationship between mRNA and protein expression change. Rather than looking at the correlation between mRNA and protein expression values, they look for correlation between the changes in mRNA and protein expression over two experimental conditions. This is intuitively appealing as it is reasonable to assume that the change in these quantities are likely to be more related than their absolute values. 3.2. Multivariate Methods
We now turn our attention to the more general setting of multivariate data. If numerically valued, our objects are now represented as column vectors (note that a vector representation in one space and a scalar in the other could be considered a special case of this), xg, yg with dimension dx and dy and often we stack them
Statistical Methods and Models for Bridging Omics Data Levels
139
(after transposing) together to make a pair of matrices X and Y with dimension G × dx and G × dy. If not numerical (for example, strings or graphs), we will still represent the general collections of objects as X and Y although they will not necessarily be matrices. An example of a dataset where both representations are numerical is given by the mRNA and protein expression time series data of (10) and an example, where the representations are different is the mRNA expression and sequence data mentioned in Subheading 1 (1). 3.2.1. Projection-Based Approaches: Canonical Correlation Analysis
Data projection techniques are widely used in the analysis of single Omics datasets. The general scheme involves taking our realvalued data X and projecting the objects into a lower dimensional space. Mathematically, this consists of choosing some dz × dx projection matrix W that transforms each xg into a dz dimension variable zg:zg = Wxg. The most popular approach is principal components analysis (PCA) (see Note 6) in which the objects are projected to a lower dimensional space such that as much of the original variability is maintained as possible. With, for example, microarray data, the objects could be the genes or arrays depending on the goal of our analysis – we may be interested in the analysis of genes based on several representations of each gene or analysis of arrays, based on several representations of the objects being assayed. One particularly popular use is to project the arrays from G dimensions into two dimensions so that they are easier to visualise. Visualisation is one of the most popular uses of projection techniques, but they are also used extensively as a preprocessing step before, for example, cluster analysis. Canonical correlation analysis (CCA) (11) (see Note 7) is a statistical technique that extends the basic premise behind PCA to multiple datasets. Consider a vector a of length dx and a vector b of length dy. This pair of vectors defines a mapping from x-space and y-space onto the real line. CCA finds the vectors a and b such that the projections are maximally correlated (using Pearson’s correlation) with one another. Once the most correlated pair has been found, it then searches for the next best pair that is additionally uncorrelated with the first pair. The end result of this procedure is that we can project both our datasets into the dz dimensional space in which they are maximally (linearly) correlated. Using the transcriptomic and proteomic data described in (10), Fig. 3 shows an example of applying CCA to an Omics dataset. The dataset consists of mRNA and protein expression profiles for 542 genes over six time-points. Therefore, dx = 6, dy = 6, and G = 542. Using the canoncorr function supplied with the Matlab statistical toolbox, we project both datasets into the dz = 5 dimensional space in which they are most correlated. Figure 3a shows the original datasets (rows correspond to genes, columns to time-points) and the projections of each to the shared space. In addition, the scatter
140
Rogers
a
½ = 0.333042
Projection Dimensions
Time
Projection Dimensions
Projection Expression
Expression
Expression
Expression
Time
d Projection Expression
c Projection Expression
b
Time
Projection Dimensions
Fig. 3. CCA example on real omic data. (a) Data in original mRNA and protein spaces and projected space. (b, c, d) Example genes whose correlation improves dramatically in the projected space. In each plot, the left panel shows the data in the original space (solid line mRNA, dotted line protein) and the right panel shows the data in the projected space. Note that the axis for the projected plots no longer correspond to concrete entities like time and expression.
plot to the right shows the first dimension in each space (each point corresponds to a gene) showing the discovered correlation. Figure 3b, c show two genes whose correlation is much higher in the projected space than in the original space. Choosing the dimensionality of the projected space, dz, is nontrival. Too few and we may be missing interesting inter-dataset relationships, too many and we are diluting the information that is present with unnecessary noise. One measure that we can use when making this choice is the degree of correlation between the datasets in each of the projected dimensions. The correlations for this dataset can be seen in Fig. 4. Based on this, one might argue that one dimension captures most of the shared information as there is a large drop in correlation between dimensions one and two.
Statistical Methods and Models for Bridging Omics Data Levels
141
Fig. 4. Canonical correlations for the first 5 projection dimensions for the data in Fig. 3. The y-axis shows r, the correlation between the datasets in projected dimensions.
Unfortunately, there is no way around this subjective step and the choice is likely to be highly dataset specific (see Note 8). While CCA is undoubtedly an essential tool for discovering shared trends, it is not immediately clear how one should interpret the results from biological data. Discovering a linear combination of variables in one dataset that is maximally correlated with a linear combination of variables from another does not necessarily have any immediate biological use. Perhaps the most obvious way to exploit the CCA output is to use the projection (rather than the original data) for subsequent analysis. For example, it might be interesting to cluster the mRNA data in the projected space. The rationale behind this is that even though we are only analysing one dataset, there is an influence from the other through the CCA projections – in some sense, we are analysing the mRNA data in the space in which it is most similar to the protein data. One drawback of the CCA method is that we still have two datasets, albeit in the same projected space. A recent paper has investigated the possibility of using projection to transform several datasets into just one (12). They consider the following three-step procedure: 1. Whiten the data: In this step, any relationships between the variables in the individual datasets are removed. Essentially, each dataset is rotated such that the covariance matrix has no off-diagonal elements. 2.Concatenate the data: The two (or more) whitened datasets are concatenated resulting in a single G × (dx + dy + …) matrix.
142
Rogers
3. Perform PCA: PCA is then performed on the concatenated matrix. Any combinations of variables found with high variance must come from different datasets (because of the whitening in step 1) and so must correspond to shared variability. The result of this procedure (if we take dz principal components) is a single G × dz matrix that can then be used for further analysis. In the paper, the authors prove that this procedure is exactly equivalent to perform CCA on the original data and then adding the projected datasets together. In Fig. 5, we show two clusters obtained by performing cluster analysis using the data of (10) combined in this way. For each cluster, we also show the same genes in the original spaces. Because of the transformation, the data in the combined space bear no similarity to that in the original space as we would expect. However, it becomes clear from this example that the shared space is capturing some shared variability in the two individual spaces – there is certainly some inter-dataset homogeneity in the clusters shown (for example, in Fig. 5a both mRNA and protein expression seems to be increasing over time). The applications we have shown so far for CCA are only applicable when both datasets are real values, and we have reason to assume that there may be a linear relationship between them. In many Omics applications, we may be interested in the joint analysis of very different types. Fortunately, there is an extension to classical CCA that allows us to do this. Kernel-CCA (KCCA) uses the popular “kernel trick” to allow different data types to be analysed together.
Fig. 5. Example of using the approach proposed in (12) to combine two datasets and then performing cluster analysis in the combined space. Shown are two clusters in the combined space as well as the data for these genes in the original mRNA and protein spaces.
Statistical Methods and Models for Bridging Omics Data Levels
143
The kernel trick is typically applied to algorithms (such as CCA) that can be formulated so that the data only appears within T inner products (e.g. x g x g ′ ). Consider some arbitrary mapping f(xg) that places our data into a space in which it is more readily analysable (the feature space). We could map each of our data points and then perform our analysis. Alternatively, given that we only require f (x g )T f (x g ′ ) , it may be possible to define a function k(x g , x g ′ ) that calculates f (x g )T f (x g ′ ) directly without having to explicitly perform the mapping. Such a function is known as a kernel function and kernel functions exist for many diverse data types, for example, real numbers, strings, and graphs – see (13) for a comprehensive list and (14) for many examples of kernel methods within computational biology. In KCCA, rather than working with the data matrix, we work with a G × G kernel matrix, the (g, g¢) element of which is k(x g , x g ′ ) . As well as enabling us to use diverse data-types, KCCA also has the potential to find nonlinear relationships between the two datasets – although KCCA projections are linear in the mapped variables f(xg), whether or not they are linear in the original variables xg depends on the choice of mapping f(⋅). We omit an example here as the methodology is the same as for classical CCA once the data matrices or objects have been converted into kernel matrices. KCCA has appeared in the bioinformatics literature. Two interesting examples are (15) that we consider more thoroughly in the classification section below and (16), where KCCA is used to combine expression, location, and graph data (see Note 9). As well as being kernelised, CCA has been given a probabilistic makeover. First proposed by (17), this method was extended into a technique called “Local Dependent Components” by (18) and used to combine time series gene expression data with ChIPchip data. Detailed description is beyond the scope of this chapter, but the reader is referred to (18) for more information. There have also been attempts at designing bespoke projection algorithms for the task of Omics data integration – see (19) for an example. 3.2.2. Predictive Models
The choice of data combination method depends largely on the ultimate data analysis goal. Predictive modelling, particularly classification, is common in bioinformatics. This involves learning a rule or model that is able to classify new data instances as belonging to one of a distinct set of classes. The rule or model is typically learnt from a set of training objects and their associated class labels. In bioinformatics, classification has been used in, for example, disease diagnosis (20) and protein function prediction (21). Combining datasets within a classification framework is a popular Machine Learning research area approaches can be split
144
Rogers
into three categories depending on the stage at which the data is combined: 1. Early: The data is combined prior to classification using, for example, one of the projection techniques described above. 2. Mid: The data is combined within the learning phase of the classifier. 3. Late: Classifiers are trained on the individual data sources and the predictions made by the individual classifiers are combined. As the focus of this chapter is combining Omics data, we focus our discussion on the first and second options. For a thorough treatment of classifier combination (late combination), the reader is referred to (22). An interesting example of early combination is given in (15). The aim is to classify protein function based on mRNA expression. Rather than only relying on the expression data, the authors suggest using KCCA to project the expression data into a space in which it is highly correlated with gene network data. The rationale behind this is the reasonable assumption that good features (combinations or functions of input variables) should be made from combinations of genes that have similar mRNA profiles and are close together in a network of genetic interactions. The results show that features generated in this way are significantly better than those produced from mRNA expression alone. An interesting facet of this work is that by the very nature of the problem, network information is unavailable for the proteins that require classification (i.e. those for which function is unknown). Hence, once the projection had been defined from the ~600 genes for which both representations are available, network information is not explicitly used. Rather, it is implicitly present in the effect it has had on defining the projection for the expression data. More generally, kernel methods provide a highly flexible framework for combining data types in a classification framework (midcombination). As already described in the section covering KCCA, kernel methods do not operate on the original data matrix, but on G × G kernel matrices (where G is the number of objects available for training and testing our predictive algorithm). In (21), the authors highlight a useful facet of kernels; that linear combinations of kernel matrices are themselves kernel matrices. Consider the problem of protein function prediction. For G proteins, we might have a set of mRNA expression values (e.g. a time-series), a set of protein expression values and some network information. Using appropriate kernel functions, we use this information to generate three kernel matrices, Ke, Kp, and Kn. The model uses the following G × G composite kernel
K = b1K e + b 2 K p + b3K n
Statistical Methods and Models for Bridging Omics Data Levels
145
subject to the constraints bi ³ 0 and ∑ i bi = 1 . The learning task then involves learning the standard classifier as well as the kernel weights b. In (21), the authors propose an extension to the standard Support Vector Machine (SVM) that would accomplish this extra learning stage and presented highly competitive performance in protein functional class prediction. The approach is not limited to SVMs, it has also been extended to other kernel-based learning methods such as Relevance Vector Machines (23) and Gaussian Processes (24) (see Notes 10 and 11). 3.2.3. Cluster Analysis
Cluster analysis is very widely used for the analysis of single Omics datasets – for example, clustering genes based on their expression profiles to find potentially co-regulated groups (25). In the following, we consider the problem of clustering genes based on two representations (say mRNA expression and protein expression under different conditions or at different time-points) although all of the proposed methods could be just as applicable to the problem of clustering biological samples based on more than one representation of their state. If we are provided with two real-valued representations, the simplest approach to clustering involves simply concatenating the two datasets and then using any one of the many clustering algorithms currently available. The outcome of this approach would be groups of genes with intra-dataset similarity in both datasets. This process is depicted in Fig. 6 using mRNA and
Fig. 6. Example of concatenated clustering. On the top line, the two datasets are combined, rows correspond to genes, columns to time-points. In the bottom line, we show three examples of clusters obtained from cluster analysis of the concatenated data.
146
Rogers
protein expression data described in (10). The top line shows the concatenation procedure (rows correspond to genes and columns to time-points) and in the second line we show three example clusters produced by clustering the concatenated data. It is important to distinguish the patterns found here with those that are found in correlation analysis. Each cluster discovered in the concatenated data includes genes that exhibit similar behaviour to one another in the two individual representations. There is no reason to assume that there is inter-dataset similarity. Therefore, we do not necessarily find groups of genes that exhibit similar correlated behaviour over the two representations. To explore clustering methods further we must move to a higher level of abstraction. Considering a cluster is just a group of similar genes, it does not matter how many representations we have of the genes as long as we can define what similar means for each representation, and we have a way of combining these similarity measures into a single measure of how close two genes are. If we consider the data in Fig. 6 and use Euclidean distance as our (dis)similarity measure in each space (see Note 12), then we can easily combine these similarity measures by adding the squared distances together and taking the square root. Performing hierarchical clustering using this measure would be identical to performing hierarchical clustering on the concatenated data, and hence we can view concatenation within our general abstraction. A very flexible family of clustering algorithms that could easily fit into this scheme is the family of probabilistic mixture models. A mixture model for a single dataset X assumes that each data point was generated from one of a finite set of probability distributions. Mathematically, this corresponds to assuming that the data is distributed as
G
K
p (X | ∆) = ∏∑ p (k) p (x g | ∆ k ) g =1 k =1
where p(k) is the prior probability of a data object belonging to distribution (cluster) k, p(xg|Dk) is the distribution corresponding to cluster k and D is used to represent some parameters required to define the distributions. The clustering task involves assigning data points to distributions and learning the parameters of those distributions and also the prior cluster probabilities. Gaussian mixture models (i.e. where p(xg | Dk) is a Gaussian distribution) have been widely used to cluster mRNA expression data. It is straightforward to extend this to multiple datasets. The only change is that we must now model the distribution p(xg, yg | Dk) rather than p(xg | Dk). The simplest way to do this is to assume that, given that gene g is assigned to cluster k, the two representations (xg and yg) are independent
(
)(
p (x g , y g | ∆ k ) = p x g | ∆ kx p y g | ∆ ky
)
Statistical Methods and Models for Bridging Omics Data Levels
147
so that we assume the complete dataset is distributed as G
K
(
)(
)
p (X, Y | ∆) = ∏∑ p (k) p x g | ∆ kx p y g | ∆ ky . g =1 k =1
In the example shown in Fig. 6, we used exactly this method to cluster by assuming Gaussian components for each data type. In general, the component distributions do not need to be the same, and for many applications they will not be. For example, if x is real-valued and y discrete. An example of such a model with different components is given in (1), where genes were clustered based on their mRNA expression and sequence information. A cluster was represented as a Gaussian in the mRNA space and a probabilistic motif model (using a position-specific scoring matrix) in the sequence space. This model is depicted graphically in Fig. 7. The flexibility of mixture models makes such an approach very appealing – if a distribution can be defined over the space in question, it is straightforward to cluster. Recently, this joint mixture model clustering paradigm has been extended. In (10), the assumption that genes should reside in the same clusters on each side was tested. The authors propose a more flexible mixture model that has K components for the mRNA side and J components for the protein side. i.e. the data is assumed to be distributed as G
K
J
(
)(
)
p (X, Y | ∆) = ∏∑∑ p (k, j ) p x g | ∆ kx p y g | ∆ yj . g =1 k =1 j =1
Fig. 7. Cartoon of joint clustering method of (1). Expression data is depicted on the left, sequence data on the right. Each expression cluster is paired with exactly one sequence cluster.
148
Rogers
The model described above can be thought of as a special case of this more general model where J = K and p(k, j) = d(j = k)p(k) with d( j = k) is 1 if j = k and 0 otherwise. If we go to the other extreme and define p(k, j) = p(k)p(j), then we have, in effect, two completely uncoupled cluster models. (10) uses a third decomposition p(k, j) = p(k)p(j |k). p(j |k) describes the probability of a gene’s protein profile residing in cluster j on the protein side if its mRNA profile resides in cluster k on the mRNA side and its values are learnt in the clustering phase. If the “concatenation” assumption (p(k, j) = d(j == k)p(k)) is reasonable, these distributions should be dominated by one particular j for each k. The result of the investigation was emphatically that this was not the case, with very complex relationships existing between the two representations. Some strong links were found, corresponding to large protein complexes (e.g. the ribosome) but in general each mRNA cluster was linked to many protein clusters. These results suggest that for this particular organism (human), there are a lot of regulatory forces at play after transcription – a conclusion that would be impossible to draw via the analysis of either of the datasets alone. There have been other attempts to cluster jointly while loosening the concatenation assumption. In (26), the authors present a framework for the joint analysis of several mRNA datasets. They propose fitting mixtures of regression models, where the linking is done through a prior on the regression coefficients that acts as a switch – they can be correlated between datasets or not. As well as producing clusters for each dataset simultaneously, the authors are able to automatically determine groups of genes that behave similarly across all datasets. 3.2.4. Bespoke Probabilistic Models
Probabilistic clustering algorithms are not the only probabilistic models that have been applied in this area. In (27), a regression model is proposed whereby the observed protein level is assumed to be distributed as a Poisson random variable, where the mean is a function of the mRNA abundance. Such a model that explicitly accounts for variability in the mRNA – protein relationship is potentially able to find subtle relationships that would be missed with a more crude approach. Another method was proposed by (28). Here, the authors make the assumption that both mass spectrometry peptide counts and mRNA levels can be both modelled as functions of some latent variable that represents protein expression. In a method similar to (27), the observed protein peptide counts are assumed to be Poisson distributed with a mean equal to the protein expression. The measured mRNA values are modelled in one of two ways: either they are a function of the protein expression or they come from some baseline distribution. The decision for each particular gene is left in the hands of the model. This means that the model automatically finds genes for which there appears to be a link between mRNA and protein levels (see Note 13).
Statistical Methods and Models for Bridging Omics Data Levels
149
4. Notes 1. Unfortunately, 1–1 mapping of IDs is not straightforward. For example, some IDs become obsolete and in some data regimes the relationship might be more complex. There is no one way to overcome this problem, further sources for ID matching are e.g. provided in GeneCards (http://www.genecards. org) or BioMart (http://www.biomart.org). 2. Matlab (http://www.mathworks.com) and R (http:// www.r-project.org) are probably the most popular tools for statistical analysis of Omics data and as they have similar functionality, choice between them is often personal. Matlab is a commercial product while R is open source and hence freely available. 3. As these methods are state-of-the-art, they are not normally provided in a particularly user-friendly manner and tend to be unsupported. As such, they are best used by someone familiar with the techniques. In addition, some methods scale rather badly and so may be infeasible on standard computing hardware. 4. Often a degree of data preparation and normalisation is required before analysis can take place. Standardising the variables so that they have the same range is normally sensible. A further problem often found with Omics data is that of missing values. Much research effort has gone into developing algorithms to impute missing values (e.g. http://www. bioconductor.org/packages/release/bioc/ html/impute.html). When using such a method, the practitioner should always remain aware of the (normally quite restrictive) assumptions that are being used. Some probabilistic algorithms may be able to handle missing data but this varies from algorithm to algorithm. 5. Pearson’s correlation coefficient can be computed using the function corrcoef in Matlab. This function computes both the correlation and tests for significance. 6. PCA can be performed in Matlab using the princomp function. 7. CCA can be performed in Matlab using the canoncorr function. 8. One exception to this is when CCA is being used as a preprocessing step before the data is used in a predictive model. The number of dimensions can then be optimised with respect to predictive performance. 9. The easiest way to perform KCCA is to compute kernel matrices (see, for example, code available at www.kernel-machines. org) and then feed the kernel matrices into standard CCA code
150
Rogers
(see Note 7). When using KCCA, the choice of kernel function is crucial – see (14) and (13) for extensive discussion. 10. From a practical point of view, the main difference between these different kernel classification methods is in the output produced. SVMs and derivatives provide a predicted class, whereas probabilistic methods provide a distribution over output classes. The latter may be more appropriate in applications where the risk of misclassification is unbalanced (i.e. it is better to misdiagnose a patient who is not ill than one who is ill. SVMs also require the tuning of an additional parameter (the margin parameter), adding to the computational load. 11. Multiple-kernel RVM Matlab code is available from http:// www.dcs.gla.ac.uk/inference/pMKL/pMKL.html. 12. Choice of distance metric depends on the type of data and any normalisation that has occurred. Euclidean distance is widely used if the data are on the same scale. 13. By their nature, these models are rather specific and unlikely to be directly useful outside their particular application. References 1. Holmes, I. and Bruno, W. J. (2000) Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proceedings/International Conference on Intelligent Systems for Molecular Biology; ISMB International Conference on Intelligent Systems for Molecular Biology, 8, 202–210. 2. Bussey, K., Kane, D., Sunshine, M., Narasimhan, S., Nishizuka, S., Reinhold, W., Zeeberg, B., Ajay, W., and Weinstein, J. (2003) Matchminer: a tool for batch navigation among gene and gene product identifiers. Genome Biol, 4, 4. 3. Gygi, S. P., Rochon, Y., Franza, B. R., and Aebersold, R. (1999) Correlation between protein and mRNA abundance in yeast. Mol Cell Biol, 19(3), 1720–1730. 4. Schmidt, M. W., Houseman, A., Ivanov, A. R., and Wolf, D. A. (2007) Comparative proteomic and transcriptomic profiling of the fission yeast schizosac-charomyces pombe. Mol Syst Biol, 3, 79. 5. Meyer, P. (1978) Introductory probability and statistical applications. Addison-Wesley, 2nd edition. 6. Cox, B., Kislinger, T., and Emili, A. (2005). Integrating gene and protein expression data: pattern analysis and profile mining. Methods, 35(3), 303–314.
7. Nie, L., Wu, G., Culley, D. E., Scholten, J. C. M., and Zhang, W. (2007) Integrative analysis of transcriptomic and proteomic data: challenges, solutions and applications. Crit Rev Biotechnol, 27(2), 63–75. 8. Gibbons, J. D. (1971) Nonparametric statistical inference. McGraw-Hill. 9. Griffin, T. J., Gygi, S. P., Ideker, T., Rist, B., Eng, J., Hood, L., and Aebersold, R. (2002) Complementary profiling of gene expression at the transcriptome and proteome levels in Saccharomyces cerevisiae. Mol Cell Proteomics, 1(4), 323–333. 10. Rogers, S., Girolami, M., Kolch, W., Waters, K. M., Liu, T., Thrall, B., and Wiley, H. S. (2008) Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models. Bioinformatics, 24(24), 2894–2900. 11. Hotelling, H. (1936) Relations between two sets of variates. Biometrika, 28(3–4), 321–377. 12. Tripathi, A., Klami, A., and Kaski, S. (2008) Simple integrative preprocessing preserves what is shared in data sources. BMC Bioinformatics, 9, 111. 13. Shawe-Taylor, J. and Cristianini, N. (2004) Kernel methods for pattern analysis. Cambridge.
Statistical Methods and Models for Bridging Omics Data Levels 14. Schoölkopf, B., Tsuda, K., and Vert, J.-P., editors (2004) Kernel methods in computational biology. MIT Press. 15. Vert, J.-P. and Kanehisa, M. (2003) Graphdriven feature extraction from microarray data using diffusion kernels and kernel CCA. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15. MIT press. 16. Yamanishi, Y., Vert, J.-P., and Kanehisa, M. (2004) Heterogenous data comparison and gene selection with kernel canonical correlation analysis. In Schoölkopf, B., Tsuda, K., and Vert, J.-P., editors, Kernel methods in computational biology, MIT Press. 17. Bach, F. and Jordan, M. (2005) A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley. 18. Klami, A. and Kaski, S. (2007) Local dependent components. In ICML ‘07: Proceedings of the 24th international conference on Machine learning, pages 425–432, New York, NY, USA. 19. Fagan, A., Culhane, A. C., and Higgins, D. G. (2007) A multivariate analysis approach to the integration of proteomic and gene expression data. Proteomics, 7(13), 2162–2171. 20. Furey, T., Cristianini, N., Duffy, N., Bednarski, D., Schummer, M., and Haussler, D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10), 906–914. 21. Lanckriet, G., Bie, T. D., Cristianini, N., Jordan, M., and Stafford Noble, W. (2004)
22. 23.
24.
25.
26.
27.
28.
151
A statistical framework for genomic data fusion. Bioinformatics, 20(16), 2626–2635. Kuncheva, L. (2004) Combining pattern classifiers: methods and algorithms. Wiley. Girolami, M. and Rogers, S. (2005) Hierarchic bayesian models for kernel learning. In ICML ‘05: Proceedings of the 22nd international conference on Machine learning, pages 241–248, New York, NY, USA. Girolami, M. and Zhong, M. (2007) Data integration for classification problems employing gaussian process priors. In 20th annual conference on Neural Information Processing Systems – NIPS 2006. MIT Press. Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA, 95(25), 14863–14868. Heard, N. A., Holmes, C. C., Stephens, D. A., Hand, D. J., and Dimopoulos, G. (2005) Bayesian coclustering of anopheles gene expression time series: study of immune defense response to multiple experimental challenges. Proc Natl Acad Sci USA, 102(47), 16939–16944. Nie, L., Wu, G., Brockman, F. J., and Zhang, W. (2006) Integrated analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: zero-inflated poisson regression models to predict abundance of undetected proteins. Bioinformatics, 22(13), 1641–1647. Kannan, A., Emili, A., and Frey, B. (2007) A bayesian model that links microarray mRNA measurements to mass spectrometry protein measurements. Research in Computational Molecular Biology, pages 325–338.
Chapter 7 Analysis of Time Course Omics Datasets Martin G. Grigorov Abstract Over the past 20 years, Omics technologies emerged as the consensual denomination of holistic molecular profiling. These techniques enable parallel measurements of biological -omes, or “all constituents considered collectively”, and utilize the latest advancements in transcriptomics, proteomics, metabolomics, imaging, and bioinformatics. The technological accomplishments in increasing the sensitivity and throughput of the analytical devices, the standardization of the protocols and the widespread availability of reagents made the capturing of static molecular portraits of biological systems a routine task. The next generation of time course molecular profiling already allows for extensive molecular snapshots to be taken along the trajectory of time evolution of the investigated biological systems. Such datasets provide the basis for application of the inverse scientific approach. It consists in the inference of scientific hypotheses and theories about the structure and dynamics of the investigated biological system without any a priori knowledge, solely relying on data analysis to unveil the underlying patterns. However, most temporal Omics data still contain a limited number of time points, taken over arbitrary time intervals, through measurements on biological processes shifted in time. The analysis of the resulting short and noisy time series data sets is a challenge. Traditional statistical methods for the study of static Omics datasets are of limited relevance and new methods are required. This chapter discusses such algorithms which enable the application of the inverse analysis approach to short Omics time series. Key words: Omics, Molecular profiling, Time-course Omics experiments, Time series, Data analysis, Data mining
1. Introduction A time series is a sequence of data points, resulting from measurements which are recorded at successive moments, typically equally spaced in time. The interest in time series recording and analysis is based on the assumption that an investigation of the recorded data could bring information about the internal structure of the system which generated it (1–3). A successful Omics experiment investigating the dynamics of a biological process requires first a Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_7, © Springer Science+Business Media, LLC 2011
153
154
Grigorov
carefully devised experimental design. Then, the processing of the recorded time series is carried out in two main stages. First, a learning phase is aimed at preprocessing the data and at determining some basic statistical properties, such as the molecular constituents differentially expressed in time, and the occurrence in their dynamic behavior of trends, cycles, or other informationrich structures. This lays the basis for the true understanding of the processes which generated the time series. At this second stage of interpretation and understanding, mathematical models are developed to explain the observed dynamical processes. If these models are realistic they would generate time series similar in their statistical structure to the investigated ones.
2. Materials 2.1. Experimental Design
An appropriate experimental design is the key to the success of any scientific investigation. In the case of time-course Omics experiments the few replicates available, the uneven sampling, the loss of synchronization, and the phase-shift within the studied biological population, make classical statistical approaches inadequate. They may bring misleading conclusions when trying to identify functional relationships among transcripts, proteins, and metabolites. For specific points to be considered when designing and analyzing a time course Omics experiment, see Notes 1–3.
2.1.1. Replicates
Methods operating with none or few replicates have been recently reported in the literature. In order to identify genes with significant temporal variation, Billups et al. (4) proposed an important algorithm based on a temporal test statistic using the degree to which data are smoothed when fit by a spline function. The method does not require replicates to draw statistically significant results. In the case when only few replicates are available, but when similar studies have been already conducted, Sun et al. (5) proposed a meta-analysis methodology to combine existing data for improving the statistical power and for detecting differentially expressed genes. More recently Han et al. (6) proposed the Partial Energy ratio for Microarray (PEM) method to remediate to the lack of sufficient number of replicates. The sole assumption of the PEM method is that gene expression varies smoothly in the temporal domain. Then a new statistic is computed by comparing the energies of two convoluted profiles. The method was found to outperform previous approaches such as SAM (7) or the spline-based EDGE (8) developed for identifying differentially expressed genes.
2.1.2. Synchronization
A limitation of most of the Omics analytical methods currently in use resides in their inability to perform measurements on the
Analysis of Time Course Omics Datasets
155
single cell level. Instead, samples are most commonly composed of a high number of cells of similar type which undergo similar biological processes, but shifted in phase. The acquired data limit the possibility to analyze the dynamic behavior of cells directly on a pathway level. One solution to this problem is cell development to be arrested and started for all at the same time. Synchronization has received much attention and Shedden and Cooper (9) used a Fourier analysis algorithm to test the synchronizations of different arrest methods. The method proposed by the authors considers how many genes are best explained by a periodic curve and how many are best explained using an aperiodic curve. The actual synchronization achieved by the method could be estimated by randomization tests comparing the two sets of genes. 2.1.3. Sampling Rate
The central issue of an experimental design specific to time course Omics experiments is the determination of the sampling rate. If a cyclical process is observed with duration shorter than a cycle this will almost certainly impair the inference of realistic causal relationships. Undersampling can lead to temporal aggregation effects as shown by Bay et al. (10) when each successive sampling point represents the sum of all signal changes since the previous sample. On the other hand, over-sampling is expensive and time consuming. An adequate sampling rate should be derived from the characteristic frequencies of the observed molecular processes. The commonly used Fast Fourier Transform algorithm is applicable only when data are evenly spaced and when no values are missing (see Note 4). A number of more sophisticated methods to uncover such characteristic frequencies were proposed and applied by numerous researchers, the most important ones being summarized in Table 1. Dequeant et al. (11) described the combined use of a number of these algorithms for periodic pattern recognition. The authors confirmed the superiority of such a jurylike, consensus-based approach. Wu et al. (12) are probably the first who have devised an objective process to determine the adequate number of observations and the interval between them for a time course Omics experiment.
2.2. Data Preprocessing
In this paragraph data preprocessing techniques will be reviewed within the specific context of time course Omics experiments. These complement standard data preprocessing procedures for the transformation of static Omics data such as background correction for transcriptomics data, normalization to adjust for technical variability, and spectral binning and peak detection and selection for NMR or MS spectra acquired in proteomics or metabolomics experiments.
2.2.1. Missing Values
Missing value estimation is an important preprocessing step in Omics time series analysis. Although several methods have been
156
Grigorov
Table 1 Algorithms to detect significant periodic modes in time course Omics datasets Method
Highlights
References
Cyclohedron
A rank test inspired by recent advances in algebraic combinatorics. Robust to measurement errors, it can be used to ascertain the significance of topranked genes.
Morton, J., et al. (13)
Average periodogram
Graphical assessment and exact statistical test to identify periodically expressed genes.
Wichert, S., et al. (14)
Shape
Shape-invariant model together with a false discovery rate (FDR) procedure for identifying periodically expressed genes.
Luan, L., et al. (15)
C&G
Statistical hypothesis testing methods for identifying periodic components in time series.
Chen, J. (16)
Permutation
Test based on a random permutation of time points Ptitsyn, A. A., et al. (17) in order to estimate the nonrandomness of a periodogram. The Permuted time Pt-test is able to detect oscillations within noisy expression profiles.
Lomb–Scargle
The Lomb–Scargle periodogram provides a method Glynn, E. F., et al. (18) to detect periodic genes and to treat missing values and unevenly spaced time points.
Modified Lomb–Scargle
The new method is based on signal reconstruction in a shift-invariant signal space, where a direct spectral estimation procedure is developed using the B-spline basis.
Liew, A., W.-C., et al. (19)
Wavelets
Allows for detection of oscillations in time series transcriptomics experiments with Saccharomyces cerevisiae.
Klevecz, R. R., et al. (20)
developed to solve this problem, their performance is unsatisfactory for datasets with limited number of samples, high missing rates, or very noisy measurements. Hu et al. (21) proposed an integrative Missing Value Estimation Method (iMISS, http://zhoulab. usc.edu/iMISS) by incorporating information from multiple reference microarray datasets. Validation studies suggested that iMISS can significantly and consistently improve the accuracy of imputation. In a similar approach Jornsten et al. (22) explored the feasibility of imputation across transcriptomics experiments, using the vast amount of publicly available microarray data. More recently Brock et al. (23) reported a benchmark study of the most efficient imputation methods in use in trancriptomics studies. The authors found that several imputation algorithms, that is least squares adaptive, local least squares (LLS, http://www.ii.uib. no/~trondb/imputation), and Bayesian principal component
Analysis of Time Course Omics Datasets
157
analysis, are all highly competitive with each other, and that no method is uniformly superior in all of the data sets examined. 2.2.2. Regularization of Profiles
Another aspect of the design of time-course Omics experiments resides in the determination of the timing ordering of gene expression under different experimental factors, such as heat shock or oxidative stress. Yoneya et al. (24) developed a Hidden Markov Model -based algorithm to compare time-series Omics data with timing difference caused by differing experimental factors. A second aspect of comparing heterogeneous time profiles is that functional data often exhibit a common shape, but with variations in amplitude and phase across profiles. Telesca et al. (25) proposed a method to align trancriptomics profiles, and Fischer et al. (26) reported an algorithm for proteomics data sets processing.
2.2.3. Quality Control
Determining the quality of the molecular profiles resulting from time course Omics studies remains an important problem. Simon et al. (27) proposed an objective criterion comparing the structure of a time-series with averages over static expression experiments in similar biological contexts. For each gene, the authors determined whether its temporal expression profile can be reconciled with its static expression levels. Experimental validation has shown the utility of this approach for determining the accuracy of each gene expression pattern. Proper reporting is enhancing the quality control process for a time course Omics experiment (see Note 5).
3. Methods 3.1. Data Structures
Once preprocessed, the resulting data is collated in matrix data structures. In the following discussion such a matrix will be referred to as the data matrix X. It typically contains a number of columns of the order of thousands, much larger than the number of its rows. The columns of the data matrix contain the relative expressions, abundances, quantities or concentrations of the different molecular species (genes, proteins, and metabolites) under investigation, and the rows represent the conditions obtained by sampling the biological system states at every point of time t = 1, …, N. This data is reported in every matrix element xik of the data matrix regarding the i-th gene, protein, or metabolite at the k-th time point, 1 £ k £ N. An important derivative of the data matrix is the covariance matrix of order k with matrix elements {cov(xi(t), xj(t − k)): t = 1, …, N; 1 £ k £ N} or the generalized covariance matrix when a summation over the index k is operated {Scov(xi(t), xj(t − k)): t = 1, …, N}. These matrices encode the similarities of the dynamical behavior of the i-th and j-th cellular constituents
158
Grigorov
and are important for the reconstruction of causal relationships and for the inference of molecular networks. The related autocorrelation and generalized autocorrelation matrices could be derived by scaling the matrix elements of the autocovariance matrices by the respective standard deviations. In the literature other data structures were reported, related to specific molecular constituents. For example it was proposed that the time series {xi(t): t = 1, …, N}, characterizing the dynamical behavior of the i-th gene, protein, or metabolite, could be embedded into a vector space of dimension M. The M lagged copies {xi(t − k): k = 1, …, M}, produced by the sliding of a window of dimension M over the original time series {xi(t): t = 1, …, N}, then form a matrix of dimension (N – M + 1) × M which is referred to as the trajectory matrix. 3.2. Learning
Time series analysis of dynamical datasets can be performed by univariate or multivariate statistical methods, depending of the fact if a single variable or multiple variables are being investigated concurrently. Omics experiments investigate the effects of perturbations on biological systems induced by physical, chemical, pharmacological, or nutritional factors over time. Unsupervised statistical learning methods mine through data and extract relevant information without the presence of any a priori information. In difference, some statistical learning methods were specifically designed to take in account the phenotypic effects of such perturbations, when these effects are either known or measurable. These methods, referred to as supervised statistical learning, enable the addition of external information during the data processing and learning phases.
3.2.1. Univariate Statistical Analysis
The term “univariate time series” refers to a time series consisting of single scalar measurements recorded sequentially over equal time intervals. Because in most of the cases data points are equispaced on the time axis, such a time series is represented as a single column of numbers with time taken as an implicit variable. In the case of a time-course Omics dataset, stored in a data matrix, the methods available for the processing of univariate time series are typically applied column by column. Very often the analysis is restricted to a set of these columns containing data about genes, proteins, or metabolites of specific biological significance within a predefined context. Such sets are represented by genes, proteins, or metabolites spatially grouped in a tissue or functionally clustered in biological networks (28).
3.2.1.1. Unsupervised Univariate Methods
Unsupervised univariate analysis methods are based on standard statistical tests and an important number of them were described in the literature in the recent years. The most important univariate statistical methods to identify differentially expressed genes in time-course Omics datasets are summarized in Table 2.
Analysis of Time Course Omics Datasets
159
Table 2 Statistical methods to identify differentially expressed genes in time course Omics datasets Class
Highlights
References
ANOVA-SCA
Direct generalization of analysis of variance (ANOVA) Nuede, M. J., et al. (33) for univariate data to the multivariate case, improved by dimension reduction to the G-test.
Hotelling-T2
New statistical test.
Regression
Xu, X. L., et al. (35) Least-squares used to estimate a robust Z-statistics under a set of well-defined assumptions. A permutation is used to estimate the number of false-positives, providing a measure of statistical significance.
Vinciotti, V., et al. (34)
Regression Bayes Bayesian median regression model, based on polynomial functions, to detect genes whose temporal profile is significantly different. across a number of biological conditions.
Yu, K., et al. (36)
MARD
Treatment and control time courses are converted into neighborhood networks. Differentially expressed genes are then determined by comparing gene relationships networks.
Cheng, C., et al. (37)
BATS
Software package addressing the challenge of identifying differentially expressed genes from time course omics experiments.
Angelini, C., et al. (38)
Compression
The test estimates the algorithmic compressibility of a time series. If compressible, the time series is more likely to result from simple underlying mechanisms than series which are incompressible.
Ahnert, S. E., et al. (39)
Recently, Fischer et al. (29) compared methods for identifying differentially expressed genes on simulated time-series microarray data from artificial gene networks. In another comparative study Di Camillo et al. (30) evaluated the performance of three selection methods, using synthetic data, over a range of experimental conditions. The ranking of differentially expressed genes, based on their statistical significance, is important to better guide the efforts of biologists to prioritize leads for follow-up studies (31, 32). 3.2.1.2. Supervised Univariate Methods
Supervised univariate learning algorithms are mathematical methods aimed at transforming univariate time series into biological hypotheses under the guidance of a known phenotypic response. Therefore, a typical supervised learning algorithm is designed to identify a model that will correctly associate an input with a target response, either measured or recorded in a biological ontology.
160
Grigorov
For example Hvidsten et al. developed the supervised learning algorithm Rosetta (http://rosetta.sourceforge.net) based on rough set theory, able to generate hypotheses on the involvement of single yet uncharacterized genes in known biological processes (40). The model could predict the participation of single genes in biological processes even though the genes of this class exhibit a wide variety of gene expression profiles. 3.2.2. Multivariate Statistical Analysis
The datasets stemming from global Omics studies are highdimensional multivariate arrays, and commonly contain up to tens of thousands of variables, which are often noisy and interdependent, if not collinear. Multivariate data analysis is essential in the process of extracting information from these complex data sets, and has previously been shown to be valuable for the analysis and visualization of biological and chemical data. In this paragraph current multivariate statistical methods are discussed and how they are used in the context of analysis of time series Omics experiments.
3.2.2.1. Unsupervised Multivariate Methods
Biological systems are characterized by multiple degrees of freedom and the datasets obtained by molecular profiling of such systems are inherently multivariate. One variable is rarely sufficient to describe on its own the effect of a change in such systems. In most cases the response occurs as changes in a restricted number of multiple variables simultaneously. These sets of characteristic variables are not directly measurable, and are often referred to as latent or hidden variables. The existence of a restricted set of latent variables allow for a biological system to be described and studied in a mathematical space of much lower number of dimensions than the rank of the original data matrix X. However, specific challenges exist in the processing of time course Omics measurements where thousands of molecular constituents are being profiled simultaneously across few time points. In such datasets many genes, proteins, or metabolites might display similar expression patterns just by random chance. Furthermore, usually few, if any, full time series replicates are available to gain statistical power.
3.2.2.2. Clustering
Clustering analysis provides the means for finding observations or variables that have similar properties in a data set, and can be useful for both visualisation and interpretation of data. Time-course Omics experiments produce vectors of gene expression, protein abundance, or metabolite concentration profiles across a series of time points. Clustering algorithms utilize similarity metrics defined over such profiles to define distances between observations at different time points and to discover functionally related and co-regulated genes. An explicit use of the causality condition should allow for a more sensitive detection of genes. Most of the available clustering methods are listed in Table 3. Datta et al. (41)
Analysis of Time Course Omics Datasets
161
Table 3 Clustering algorithms for time course Omics datasets Method
Class
Highlights
References
ORIOGEN
Order-restricted ORIOGEN is a user-friendly Java-based statistical software package for selecting and inference clustering genes according to their profiles. Available at: http://dir.niehs. nih.gov/dirbb/oriogen/index.cfm
Peddada, S. D., et al. (42)
TIME-WARP
Random timewarping
A nonparametric time-synchronized iterative mean updating technique to find an overall mode representation of a sample of expression profiles, viewed as a random sample in function space
Liu, X., and Mueller, H.-G. (43)
SPLINES
Smoothing splines
(44) Spline smoothing and first derivative computation are combined with hierarchical and partitioning clustering. A heuristic approach is proposed to tune the spline smoothing parameter using both statistical and biological considerations.
STEM
Profile match
(45) The algorithm operates by assigning genes to a predefined set of model profiles, significant profiles are retained for further analysis and are combined to form clusters. Available at: http:// www.cs.cmu.edu/.apprx.jernst/stem.
WAVELET
Multiresolution The algorithm is transforming the data analysis obtained under different growth conditions to permit comparison of expression patterns from experiments that have time shifts or delays
Song, J. Z., et al. (46)
MBCC
Bayesian inference
The Bayesian product partition model simultaneously searches for the optimal number of clusters, and assigns cluster memberships based on temporal changes of gene expressions. Available at:http://www.stat.ufl. edu/~casella/mbbc
Joo, Y., et al. (47)
DIFF
Symbolic dynamics
The algorithm defines a pattern as a sequence of symbols indicating direction and the rate of change between time-points, and each gene is assigned to a cluster whose members share a similar pattern. Available at: http://www.biomedcentral.com/ content/pdf/1471-2105-8-253.pdf
Kim, J., et al. (48)
(continued)
162
Grigorov
Table 3 (continued) Method
Class
Highlights
References
N/A
Bayesian inference
The algorithm uses polynomial models to describe the gene expression patterns over time, iterative procedure to identify genes that have a common temporal expression profile. Available at:http://www.biomedcentral.com/ content/pdf/1471-2105-9-147.pdf
Wang, L., et al. (49)
TIMECLUST Symbolic dynamics
TimeClust is a user-friendly software package to cluster genes according to their temporal expression profiles. It implements two original algorithms specifically designed for clustering short time series together with HCA and SOM.
Magni, P., et al. (50)
QUADR
Quadratic regression
The algorithm identifies differentially expressed genes and classification of genes based on their temporal expression profiles for noncyclic short time-course microarray data. Available at: http://www.biomedcentral.com/ content/pdf/1471-2105-10-146.pdf
Liu, T., et al. (51)
RANK-CLUST Rank-based clustering
The method performs rank-based clustering after initial discretization of the expression data into groups. Testing procedure uses the bootstrap samples to select the genes that show similar patterns.
Yi, S.-G., et al. (52)
evaluated several of the algorithms on real time course microarray time series and concluded that no single clustering algorithm may be best suited for clustering genes into functional groups. The authors proposed a validation measure to guide the selection of an optimal algorithm. 3.2.2.3. Dimension Reduction
The most commonly used linear dimension reduction method is principle component analysis (PCA). The method is based on the minimization of the statistical dependence among the variables in a multivariate dataset up to the second statistical order. Numerically, PCA consists in the eigenvalue decomposition of the covariance matrix of the data matrix X into the product of three matrices – the loadings matrix, the eigenvalues matrix, and the scorings matrix. The PCA loadings correspond to the eigenvectors of the
Analysis of Time Course Omics Datasets
163
covariance matrix of the data matrix X. The PCA scores represent the projections of the data matrix X onto the corresponding loading vectors, and can be interpreted as the latent variables describing the main variation in the dataset. Recently, several authors adapted the original PCA algorithm specifically for the purpose of analyzing time course Omics experiments (53, 54). Analyzing time-course Omics data in PCA- or SVD-truncated subspaces, approximating the original dataset by filtering out noise and irrelevant patterns, was found to significantly improve the performance of the clustering algorithms (55–57) or the possibility to integrate heterogeneous datasets (58). An alternative linear approach for dimension reduction is the independent component analysis aimed at separating and recovering signals from several different observed linear mixtures. In the context of microarray data, “sources” may correspond to specific cellular responses or to co-regulated genes (59, 60). Probably one of the best known nonlinear approaches for dimensionality reduction of multivariate datasets is multidimensional-scaling (MDS). The procedure is referred to as metric when Euclidean or other type of topological distance functions are applied to quantify the similarity among the different time profiles, or respectively, non-metric, when more general functions are used, such as the Pearson correlation coefficient or the Kullback–Leibler mutual information (61). Other nonlinear dimension reduction methods have been proposed in the literature, such as the multivariate curve resolution (62), correspondence analysis (63), Fourier harmonic analysis (64), and multidimensional unfolding analysis (65). 3.2.2.4. Connectionist Methods
Neural networks have been used to extract knowledge from various datasets. Specific neural network architectures specifically designed for the purpose of processing noisy, high dimensional time course Omics data are regularized neural networks (66) and hierarchical Bayesian neural networks (67). The learning performance of such architectures were found to outperform other classification techniques such as Nearest Neighbor, Support Vector Machine, and Kohonen Self-Organized Maps.
3.2.2.5. Supervised Multivariate Methods
Multivariate partial least square projection to latent structures (PLS) is the supervised method of choice largely used to date to model complex biological events. In a recent study, Palermo et al. investigated the influence of different factors on the performance of the method (68). This method and its derivatives such OSC, OPLS, and OPLS-DA (69) have been extensively used in timecourse, longitudinal metabolomics experiments investigating the response of biological systems to toxicants (70–72). At present, time course metabolomics studies are still quite rare due to long technological times of the order of weeks of machine time for
164
Grigorov
setting such experiments. Software platforms to process time course metabolomics data are starting to be developed such as the MetaboAnalyst (http://www.metaboanalyst.ca) (73). Time course proteomics studies are even rarer at present and few examples were described in the literature, most of them investigating the trajectories of time evolution of biological systems under stress conditions or perturbed by toxicants (74). 3.3. Understanding 3.3.1. Causality
The central dogma of the flow of information in biological systems from genes to proteins to metabolites imposes a strict condition of causality at the molecular level, controlled at the very beginning by the activation of transcription factors. Recently several methods have been reported in the literature capable of extracting information about the transcriptional regulation of gene expression from microarray time series (75, 76). Wiener (77) and later Granger (78) proposed the intuitive concept of causality between two variables, now referred to as Granger causality, which is based on the idea that an effect never occurs before its cause. Later, Geweke generalized the concept to the multivariate Granger causality (79). Although Granger causality is not the “effective causality” of Aristotle, this concept is useful to infer directionality and information flow in the observed data (80). Granger causality is usually identified by Vector Autoregressive models due to their simplicity (81–83). Software for identifying transcriptional modules from time series is available, such as TRANS-MNET (http://daweb. ism.ac.jp/~yoshidar/software/ssm) (84).
3.3.2. Reconstruction of Molecular Networks
The analysis of time series derived from time course Omics experiments allows for the reconstruction of the functional relationships between the molecular parts of complex biological systems. The assumption when establishing a relationship between two molecular constituents is that if these are functionally related they will exhibit similar time profiles (85). A number of measures of similarity have been described in the literature (86–90). Most of the currently available algorithms for reverse engineering of molecular networks are summarized in Table 4.
3.3.3. Dynamics
Early evidence for a characteristic sensitivity of biological systems to initial conditions and for fractal geometry of the data points in phase spaces were discovered by analyzing physiological electrocardiograms (ECG) and electroencephalograms (EEG) time series. These findings made possible the use of several measures of low-dimensional chaos to predict patients at imminent risk of lethal ventricular fibrillation (98). More recent scientific findings suggest that shifts from low-dimensional dynamics to higherdimensional (noise) or to few-dimensional (periodic) regimes is a sign of an underlying pathology (99). Attempts were made to explain the observed chaotic dynamics of the physiological
Analysis of Time Course Omics Datasets
165
Table 4 Algorithms for inference of networks from time course Omics data Method
Highlights
References
DBN
Dynamic Bayesian Network for analysis of time course microarray experiments with perturbations.
Kim, S. Y., et al. (91)
QUANTIZATION
A quantization method, based on a model of the experimental error and on a significance level able to compromise between false-positive and false-negative classifications; used with two discrete reverse engineering methods, Reveal and Dynamic Bayesian Networks.
Di Camillo, B., et al. (92)
PARE
Chang, C. L., et al. (93) Pattern recognition method (PARE) utilizes a nonlinear score to identify subclasses of gene pairs with different time lags for inferring gene functional relationships. Available at: http:// www.stat.sinica.edu.tw/~gshieh/pare.htm
TdGRN
Reverse engineer the dynamic mechanisms of gene regulations, by identifying the timedelayed gene regulations through supervised decision-tree analysis of the time-delayed gene expression matrix.
Chen, J. (94)
GENENETWORK
An interactive tool, which provides four reverse engineering models and three data interpolation approaches to infer relationships between genes.
Wu, C. C., et al. (95)
SARGE
Show, O. J., et al. (96) An interactive tool for creating, visualizing, and manipulating a putative genetic network from time series microarray data. Available at: http://www.bioinformatics.cs.ncl.ac.uk/sarge/ index.html
BioCiChlid
BioCichlid is a 3D visualization system of time-course microarray data on molecular networks for interpretation of gene expression data. Available at: http://newton.tmd.ac.jp/
Ishiwata, R. R., et al. (97)
parameters at a finer, molecular scale. An important result providing the link between the macro- and the micro-dynamics within a biological system was obtained by Peng and associates who discovered the fractal character of exon structures of DNA (100). This irregular structure certainly affects transcriptional events. For example recent findings indicated that the concentration in the cell nucleus of the major transcription factor NF-kB converges to a strange attractor under periodic stimuli (101). In this perspective, the reports about the discovery of underlying patterns
166
Grigorov
or “characteristic modes” in time course microarray expression data projected into phase spaces after singular value decomposition came as another indication of the molecular source of biological chaos (102–104). These findings open new avenues for time course Omics research.
4. Notes 1. Scaling. It is a major challenge to extract relevant biological information from large heterogeneous time series Omics datasets. Such datasets often contain concentrations of mole cular species covering several orders of magnitude. Such differences in concentrations do not reflect the actual biological relevance of molecular species, and data analysis methods are not able to make the distinction. Among many data pretreatment methods, scaling can correct for aspects that hinder the biological interpretation of time series Omics datasets (105). 2. Stationarity. The statistical properties of stationary time series, such as mean, variance, and autocorrelation, are constant over time. Biological time series are rarely strictly stationary as they are generated by underlying non-equilibrium dissipative dynamical processes. An important step in the analysis of Omics time series consist in a check for stationarity, and if necessary the application of data transformations to “stationarize” the data. The reason to seek for stationarity is that the derived mathematical models are robust and allow for extrapolations over model parameters, which is important for e.g. sensitivity analysis. In most cases a non-stationary time series will be exhibiting a trend which is consistently increasing over time. In such a case it may be possible to “stationarize” the series by fitting a trend line and subtracting it from the original time series. The procedure is called detrending and the resulting time series is referred to as trendstationary. If de-trending is not sufficient to ensure stationarity it might be necessary to transform the series to a differencestationary by taking the differences between the time points of sampling. 3. Surrogate data. The availability of powerful computational devices and the implementation of robust random number generators on these, transformed experimental statistics by allowing the generation of series of random datasets with predefined statistical properties. Surrogate data is an ensemble of data sets similar to an observed time series, but consistent with a predefined null hypothesis to be tested. The use of such surrogate datasets allows to translate complex inference
Analysis of Time Course Omics Datasets
167
problems into formal statistical problems of hypothesis testing. The method of surrogate data is a form of bootstraping which was first introduced by Theiler et al. (106) for distinguishing between chaotic time series and colored noise. 4. Cyclical patterns. Two complementary approaches to timeseries analysis are associated with the time domain and the spectral domain. Analysis in the time domain relies on the computation of classical statistical properties, such as the mean, the variance, and lag autocorrelation function. The Bochner– Khinchin–Wiener theorem states that the later function is the Fourier transform of the spectral density or power spectrum of a time series and provides the connection between the time domain and the spectral domain (107). The spectral-domain approach is motivated by the observation that the most regular, and hence predictable, behavior of a time series resides in its cyclical patterns. This approach then proceeds to determine the periodic components embedded in the time series by computing the associated periods, amplitudes, and phases, from its power spectrum. 5. Reporting. The ultimate goal of Omics studies is the discovery of molecular species which would be indicative of the evolution of a biological system into pathological deviations. These species are referred to as biomarkers. In the past, a number of Omics studies suffered from the lack of standards and the low quality of reporting of the experimental results. Recently, such standards were developed by the scientific community, described in the literature, and endorsed by regulatory agencies. The accurate documentation and reporting improved research quality and facilitated the translation of biomarkers into clinical practice (108). References 1. Bar-Joseph, Z. (2004) Analyzing time-series gene expression data. Bioinformatics 20(16), 2493–503. 2. Androulakis, I. P., Yang, E., and Almon, R. R. (2007) Analysis of time-series gene expression data: methods, challenges, and opportunities. Annu Rev Biomed Eng 9(3), 1–24. 3. Wang, X., Wu, M., Li, Z., and Chan, C. (2008) Short time-series microarray analysis: methods and challenges. BMC Sys Biol 2, 58–64. 4. Billups, S. C., Neville, M. C., Rudolph, M., Porter, W., and Schedin, P. (2009) Identifying significant temporal variation in time course microarray data without replicates. BMC Bioinformatics 10, 96. 5. Sun, R., Fu, X., Guo, F., Ma, Z., Goulbourne, C., Jiang, M., Li, Y., Xie, Y., and Mao, Y.
(2009) A strategy for meta-analysis of short time series microarray datasets. Front Biosci 14, 4058–70. 6. Han, X., Sung, W.-K., and Feng, L. (2006) PEM: a general statistical approach for identifying differentially expressed genes in timecourse cDNA microarray experiment without replicate. Ser Adv Bioinform Comp Biol 4, 123–32. 7. Tuscher, V. G., Tibshirani, R., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Nat Acad Sci USA 98(9), 5116–21. 8. Leek, J. T., Monsen, E., Dabney, A. R., and Storey, J. D. (2006) EDGE: extraction and analysis of differential gene expression. Bioinformatics 22(4), 507–8.
168
Grigorov
9. Shedden, K., and Cooper, S. (2002) Analysis of cell-cycle gene expression in Saccharomyces cerevisiae using microarrays and multiple synchronization methods. Nucleic Acids Res. 30, 2920–29. 10. Bay, S. D., Chrisman, L., Pohorille, A., and Shrager, J. (2004) Temporal aggregation bias and inference of causal regulatory networks. J Comp Biol 11(5), 971–85. 11. Dequeant, M.-L., Ahnert, S., Edelsbrunner, H., Fink, T. M. A., Glynn, E. F., Hattem, G., Kudlicki, A., Mileyko, Y., Morton, J., Mushegian, A. R., Pachter, L., Rowicka, M., Shiu, A., Sturmfels, B., and Pourquie, O. (2008) Comparison of pattern detection methods in microarray time series of the segmentation clock. PLoS One 3, 8–12. 12. Wu, F.-X., Zhang, W. J., and Kusalik, A. J. (2006) Determination of the minimum number of microarray experiments for discovery of gene expression patterns. BMC Bioinformatics 7(Suppl. 4), S4–13. 13. Morton, J., Pachter, L., Shiu, A., and Sturmfels, B. (2007) The cyclohedron test for finding periodic genes in time course expression studies. Stat Appl Genet Mol Biol 6, 1–10. 14. Wichert, S., Fokianos, K., and Strimmer, K. (2004) Identifying periodically expressed transcripts in microarray time series data. Bioinformatics 20(1), 5–20. 15. Luan, Y., and Li, H. (2004) Model-based methods for identifying periodically expressed genes based on time course microarray gene expression data. Bioinformatics 20(3), 332–9. 16. Chen, J. (2005) Identification of significant periodic genes in microarray gene expression data. BMC Bioinformatics 6, 286. 17. Ptitsyn, A. A., Zvonic, S., and Gimble, J. M. (2006) Permutation test for periodicity in short time series data. BMC Bioinformatics 7(Suppl. 2), S2–10. 18. Glynn, E. F., Chen, J., and Mushegian, A. R. (2006) Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics 22(3), 310–16. 19. Liew, A. W.-C., Xian, J., Wu, S., Smith, D., and Yan, H. (2007) Spectral estimation in unevenly sampled space of periodically expressed microarray time series data. BMC Bioinformatics 8, 137. 20. Klevecz, R. R., Li, C. M., and Bolen, J. L. (2007) Signal processing and the design of microarray time-series experiments. Methods Mol Biol 377, 75–94.
21. Hu, J., Li, H., Waterman, M. S., and Zhou, X. J. (2006) Integrative missing value estimation for microarray data. BMC Bioinformatics 7, 449. 22. Jornsten, R., Ouyang, M., and Wang, H.-Y. (2007) A meta-data based method for DNA microarray imputation. BMC Bioinformatics 8, 109. 23. Brock, G. N., Shaffer, J. R., Blakesley, R. E., Lotz, M. J., and Tseng, G. C. (2008) Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC Bioinformatics 9, 9–12. 24. Yoneya, T., and Mamitsuka, H. (2007) A hidden Markov model-based approach for identifying timing differences in gene expression under different experimental factors. Bioinformatics 23(7), 842–9. 25. Telesca, D., and Inoue, L. Y. T. (2008) Bayesian hierarchical curve registration. J Am Stat Assoc 103(481), 328–39. 26. Fischer, B., Roth, V., and Buhmann, J. M. (2007) Time-series alignment by non-negative multiple generalized canonical correlation analysis. BMC Bioinformatics 8(Suppl. 10), S10–14. 27. Simon, I., Siegfried, Z., Ernst, J., and BarJoseph, Z. (2005) Combined static and dynamic analysis for determining the quality of time-series expression profiles. Nat Biotechnol 23(12), 1503–8. 28. Filkov, V., Skiena, S., and Zhi, J. (2002) Analysis techniques for microarray timeseries data. J Comput Biol 9(2), 317–30. 29. Fischer, E. A., Friedman, M. A., and Markey, M. K. (2007) Empirical comparison of tests for differential expression on time-series microarray experiments. Genomics 89(4), 460–70. 30. Di Camillo, B., Toffolo, G., Nair, S. K., Greenlund, L. J., and Cobelli, C. (2007) Significance analysis of microarray transcript levels in time series experiments. BMC Bioinformatics 8(Suppl. 1), S1–10. 31. Xu, R., and Li, X. (2003) A comparison of parametric versus permutation methods with applications to general and temporal microarray gene expression data. Bioinformatics 19(10), 1284–9. 32. Tai, Y. C., and Speed, T. P. (2009) On gene ranking using replicated microarray time course data. Biometrics 65(1), 40–51. 33. Nuede, M. J., Conesa, A., Westerhuis, J. A., Hoefsloot, H. C. J., Smilde, A. K., Talon, M., and Ferrer, A. (2007) Discovering gene expression patterns in time course microarray
Analysis of Time Course Omics Datasets experiments by ANOVA-SCA. Bioinformatics 23(14), 1792–800. 34. Vinciotti, V., Liu, X., Turk, R., de Meijer, E. J., and ‘t Hoen, P. A. C. (2006) Exploiting the full power of temporal gene expression profiling through a new statistical test: application to the analysis of muscular dystrophy data. BMC Bioinformatics 7, 183. 35. Xu, X. L., Olson, J. M., and Zhao, L. P. (2002) A regression-based method to identify differentially expressed genes in microarray time course studies and its application in an inducible Huntington’s disease transgenic model. Hum Mol Genet 11(17), 1977–85. 36. Yu, K., Vinciotti, V., Liu, X., and ‘t Hoen, P. A. C. (2007) Bayesian median regression for temporal gene expression data. AIP Conference Proceedings 940(CompLife 2007), 60–70. 37. Cheng, C., Ma, X., Yan, X., Sun, F., and Li, L. M. (2006) MARD: a new method to detect differential gene expression in treatment-control time courses. Bioinformatics 22(21), 2650–7. 38. Angelini, C., Cutillo, L., De Canditiis, D., Mutarelli, M., and Pensky, M. (2008) BATS: a Bayesian user-friendly software for Analyzing Time Series microarray experiments. BMC Bioinformatics 9, 415. 39. Ahnert, S. E., Willbrand, K., Brown, F. C. S., and Fink, T. M. A. (2006) Unbiased pattern detection in microarray data series. Bioinformatics 22(12), 1471–6. 40. Hvidsten, T. R., Laegreid, A., and Komorowski, J. (2003) Learning rule-based models of biological process from gene expression time profiles using gene ontology. Bioinformatics 19(9), 1116–23. 41. Datta, S., and Datta, S. (2006) Evaluation of clustering algorithms for gene expression data. BMC Bioinformatics 7(Suppl. 4), S4–17. 42. Peddada, S., Harris, S., Zajd, J., and Harvey, E. (2005) ORIOGEN: order restricted inference for ordered gene expression data. Bioinformatics 21(20), 3933–4. 43. Liu, X., and Mueller, H.-G. (2003) Modes and clustering for time-warped gene expression profile data. Bioinformatics 19(15), 1937–44. 44. Luan, Y., and Li, H. (2003) Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics 19(4), 474–82. 45. Dejean, S., Martin, P. G. P., Baccini, A., and Besse, P. (2007) Clustering time-series gene expression data using smoothing spline derivatives. EURASIP J Bioinform Syst Biol 70561.
169
46. Song, J. Z., Duan, K. M., Ware, T., and Surette, M. (2007) The wavelet-based cluster analysis for temporal gene expression data. EURASIP J Bioinform Syst Biol 39382. 47. Joo, Y., Booth, J. G., Namkoong, Y., and Casella, G. (2008) Model-based Bayesian clustering (MBBC). Bioinformatics 24(6), 874–5. 48. Kim, J., and Kim, J. H. (2007) Differencebased clustering of short time-course microarray data with replicates. BMC Bioinformatics 8, 253–8. 49. Wang, L., Montano, M., Rarick, M., and Sebastiani, P. (2008) Conditional clustering of temporal expression profiles. BMC Bioinformatics 9, 147. 50. Magni, P., Ferrazzi, F., Sacchi, L., and Bellazzi, R. (2008) TimeClust: a clustering tool for gene expression time series. Bioinformatics 24(3), 430–2. 51. Liu, T., Lin, N., Shi, N., and Zhang, B. (2009) Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments. BMC Bioinformatics 10, 146. 52. Yi, S.-G., Joo, Y.-J., and Park, T. (2009) Rank-based clustering analysis for the timecourse microarray data. J Bioinform Comput Biol 7(1), 75–91. 53. Jonnalagadda, S., and Srinivasan, R. (2008) Principal components analysis based methodology to identify differentially expressed genes in time-course microarray data. BMC Bioinformatics 9, 267. 54. Nueda, M. J., Sebastian, P., Tarazona, S., Garcia-Garcia, F., Dopazo, J., Ferrer, A., and Conesa, A. (2009) Functional assessment of time course microarray data. BMC Bioinformatics 10(Suppl. 6), S6–9. 55. Horn, D., and Axel, I. (2003) Novel clustering algorithm for microarray expression data in a truncated SVD space. Bioinformatics 19(9), 1110–15. 56. Kim, H. Y., Kim, M. J., Han, J. I., Kim, B. K., Lee, Y. S., Lee, Y. S., and Kim, J. H. (2009) Searching the principal genes for neural differentiation of mouse ES cells by factorizing eigengenes of clusters. BioSystems 95(1), 17–25. 57. Ghosh, D. (2002) Resampling methods for variance estimation of singular value decomposition analyses from microarray experiments. Funct Integr Genomics 2(3), 92–7. 58. Omberg, L., Golub, G. H., and Alter, O. (2007) A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies. Proc Natl Acad Sci USA 104(47), 18371–6.
170
Grigorov
59. Frigyesi, A., Veerla, S., Lindgren, D., and Hoeglund, M. (2006) Independent component analysis reveals new and biologically significant structures in microarray data. BMC Bioinformatics 7, 290. 60. Chiappetta, P., Roubaud, M. C., and Torresani, B. (2004) Blind source separation and the analysis of microarray data. J Comput Biol 11(6), 1090–109. 61. Taguchi, Y.-H., and Oono, Y. (2005) Relational patterns of gene expression via non-metric multidimensional scaling analysis. Bioinformatics 21(6), 730–40. 62. Wentzell, P. D., Karakach, T. K., Roy, S., Martinez, M. J., Allen, C. P., and WernerWashburne, M. (2006) Multivariate curve resolution of time course microarray data. BMC Bioinformatics 7, 343. 63. Tan, Q., Brusgaard, K., Kruse, T. A., Oakeley, E., Hemmings, B., Beck-Nielsen, H., Hansen, L., and Gaster, M. (2004) Correspondence analysis of microarray timecourse data in case-control design. J Biomed Inform 37(5), 358–65. 64. Zhang, L., Zhang, A., and Ramanathan, M. (2003) Fourier harmonic approach for visualizing temporal patterns of gene expression data. Proc. 2nd IEEE Bioinf. Conf., Publisher: IEEE Computer Society (Los Alamitos, CA, USA) 137–47. 65. Van Deun, K., Marchal, K., Heiser, W. J., Engelen, K., and Van Mechelen, I. (2007) Joint mapping of genes and conditions via multidimensional unfolding analysis. BMC Bioinformatics 8, 181. 66. Liang, Y., and Kelemen, A. (2005) Temporal gene expression classification with regularised neural network. Int J Bioinform Res Appl 1(4), 399–413. 67. Liang, Y., and Kelemen, A. G. (2004) Hierarchical Bayesian Neural Network for gene expression temporal patterns. Stat App Genet Mol Biol 3(1), 20. 68. Palermo, G., Piraino, P., and Zucht, H.-D. (2009) Performance of PLS regression coefficients in selecting variables for each response of a multivariate PLS for omics-type data. Adv Appl Bioinform Chem 2, 57–70. 69. Svensson, O., Kourti, T., and MacGregor, J. F. (2002) An investigation of orthogonal signal correction algorithms and their characteristics. J Chem 16(4), 176–88. 70. Keun, H. C., Ebbels, T. M. D., Bollard, M. E., Beckonert, O., Antti, H., Holmes, E., Lindon, J. C., and Nicholson, J. K. (2004) Geometric trajectory analysis of metabolic responses to toxicity can define treatment
specific profiles. Chem Res Toxicol 17(5), 579–87. 71. Bohus, E., Coen, M., Keun, H. C., Ebbels, T. M. D., Beckonert, O., Lindon, J. C., Holmes, E., Noszal, B., and Nicholson, J. K. (2008) Temporal metabonomic modeling of l-arginine-induced exocrine pancreatitis. J Proteome Res 7(10), 4435–45. 72. Yap, I. K., Clayton, T. A., Tang, H., Everett, J. R., Hanton, G., Provost, J.-P., Le Net, J.-L., Charuel, C., Lindon, J. C., and Nicholson, J. K. (2006) An integrated metabonomic approach to describe temporal metabolic disregulation induced in the rat by the model hepatotoxin allyl formate. J Proteome Res 5(10), 2675–84. 73. Xia, J., Psychogios, N., Young, N., and Wishart, D. S. (2009) MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res 37, W652–60. 74. Mintz, M., Vanderver, A., Brown, K. J., Lin, J., Wang, Z., Kaneski, C., Schiffmann, R., Nagaraju, K., Hoffman, E. P., and Hathout, Y. (2008) Time series proteome profiling to study endoplasmic reticulum stress response. J Proteome Res 7(6), 2435–44. 75. Vu, T. T., and Vohradsky, J. (2007) Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae. Nucleic Acids Res 35(1), 279–87. 76. Wang, L., Chen, G., and Li, H. (2007) Group SCAD regression analysis for microarray time course gene expression data. Bioinformatics 23(12), 1486–94. 77. Wiener, N. (1956) The theory of prediction. In E. F. Beckenbach, Ed., Modern mathematics for engineers (McGraw-Hill, New York, USA). 78. Granger, C. W. J. (1969) Investigating causal relationships by econometric models and cross-spectral methods. Econometrica 37, 424–38. 79. Geweke, J. (1984) Measures of conditional linear dependence and feedback between time series. J Am Stat Assoc 79, 907–15. 80. Magwene, P. M., Lizardi, P., and Kim, J. (2003) Reconstructing the temporal ordering of biological samples using microarray data. Bioinformatics 19(7), 842–50. 81. Mukhopadhyay, N. D., and Chatterjee, S. (2007) Causality and pathway search in microarray time series experiment. Bioinformatics 23(4), 442–9. 82. Opgen-Rhein, R., and Strimmer, K. (2007) Learning causal networks from systems
Analysis of Time Course Omics Datasets biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics 8(Suppl. 2), S2–3. 83. Fujita, A., Sato, J. R., Garay-Malpartida, H. M., Yamaguchi, R., Miyano, S., Sogayar, M. C., and Ferreira, C. E. (2007) Modeling gene expression regulatory networks with the sparse vector autoregressive model. BMC Systems Biology 1, 39. 84. Hirose, O., Yoshida, R., Imoto, S., Yamaguchi, R., Higuchi, T., CharnockJones, D. S., Print, C., and Miyano, S. (2008) Statistical inference of transcriptional modulebased gene networks from time course gene expression profiles by using state space models. Bioinformatics 24(7), 932–42. 85. Rajaram, S. (2009) A novel meta-analysis method exploiting consistency of highthroughput experiments. Bioinformatics 25(5), 636–42. 86. Butte, A. J., Bao, L., Reis, B. Y., Watkins, T. W., and Kohane, I. S. (2001) Comparing the similarity of time-series gene expression using signal processing metrics. J Biomed Inform 34(6), 396–405. 87. Lindloef, A., and Lubovac, Z. (2005) Simulations of simple artificial genetic networks reveal features in the use of Relevance Networks. In Silico Biol 5(3), 239–49. 88. Wei, H., and Kaznessis, Y. (2005) Inferring gene regulatory relationships by combining target – target pattern recognition and regulator-specific motif examination. Biotech Bioeng 89(1), 53–77. 89. Soranzo, N., Bianconi, G., and Altafini, C. (2007) Comparing association network algorithms for reverse engineering of large-scale gene regulatory networks: synthetic versus real data. Bioinformatics 23(13), 1640–47. 90. Yeung, L. K., Szeto, L. K., Liew, A. W.-C., and Yan, H. (2004) Dominant spectral component analysis for transcriptional regulations using microarray time-series data. Bioinformatics 20(5), 742–9. 91. Kim, S. Y., Imoto, S., and Miyano, S. (2003) Inferring gene networks from time series microarray data using dynamic Bayesian networks. Brief Bioinform 4(3), 228–35. 92. Di Camillo, B., Sanchez-Cabo, F., Toffolo, G., Nair, S. K., Trajanoski, Z., and Cobelli, C. (2005) A quantization method based on threshold optimization for microarray short time series. BMC Bioinformatics 6(Suppl. 4), S4–11. 93. Chuang, C.-L., Jen, C.-H., Chen, C.-M., and Shieh, G. S. (2008) A pattern recognition
171
approach to infer time-lagged genetic interactions. Bioinformatics 24(9), 1183–90. 94. Jiang, W., Li, X., Guo, Z., Li, C., Wang, L., and Rao, S. (2006) A novel model-free approach for reconstruction of time-delayed gene regulatory networks. Sci China C Life Sci 49(2), 190–200. 95. Wu, C.-C., Huang, H.-C., Juan, H.-F., and Chen, S.-T. (2004) GeneNetwork: an interactive tool for reconstruction of genetic networks using microarray data. Bioinformatics 20(18), 3691–93. 96. Shaw, O. J., Harwood, C., Steggles, L. J., and Wipat, A. (2004) SARGE: a tool for creation of putative genetic networks. Bioinformatics 20(18), 3638–40. 97. Ishiwata, R. R., Morioka, M. S., Ogishima, S., and Tanaka, H. (2009) BioCichlid: central dogma-based 3D visualization system of timecourse microarray data on hierarchical biological network. Bioinformatics 25(4), 543–44. 98. Skinner, E. J. (1994) Low-dimensional Chaos in Biological Systems. Nat Biotechnol 12, 596–600. 99. Peng, C. K., Buldyrev, S. V., Hausdorff, J. M., Havlin, S., Mietus, J. E., Simons, M., Stanley, H. E., and Goldberger, A. L. (1994) Non-Equilibrium dynamics as an indispensable characteristic of a healthy biological system. Integr Physiol Behavioral Sci 29, 283–293. 100. Peng, C. K., Buldyrev, S. V., Goldberger, A. L., Havlin, S., Sciortino, F., Simons, M., and Stanley, H. E. (1992) Long-range correlations in nucleotide sequences. Nature 356, 168–170. 101. Fonslet, J., Rud-Petersen, K., Krishna, S., and Jensen, M. H. (2007) Pulses and chaos: dynamical response in a simple genetic oscillator. Int J Mod Phys B 21(23 and 24), 4083–90. 102. Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J. R., and Fedoroff, N. V. (2000) Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc Natl Acad Sci USA 97(15), 8409–14. 103. Rifkin, S. A., and Kim, J. (2002) Geometry of gene expression dynamics. Bioinformatics 18(9), 1176–83. 104. Grigorov, M. G. (2006) Global dynamics of biological systems from time-resolved omics experiments. Bioinformatics 22(12), 1424–30. 105. van den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., and van der Werf, M. J. (2006) Centering, scaling, and
172
Grigorov
transformations: improving the biological information content of metabolomics data. BMC Genomics 7, 142–7. 106. Theiler, J., Eubank, S., Longtin, A., Galdrikian, B., and Farmer, J. D. (1992) Testing for nonlinearity in time series: the method of surrogate data. Physica D 58, 77–94. 107. Ghil, M., Allen, R. M., Dettinger, M. D., Ide, K., Kondrashov, D., Mann, M. E.,
Robertson, A. W., Saunders, A., Tian, Y., Varadi, P., and Yiou, P. (2002) Advanced spectral methods for climatic time series. Rev Geophys 40(1), 3.1–3.41. 108. Azuaje, F., Devaux, Y., and Wagner, D. (2009) Challenges and standards in reporting diagnostic and prognostic biomarker studies. Clin Transl Sci 2(2), 156–61.
Chapter 8 The Use and Abuse of -Omes Sonja J. Prohaska and Peter F. Stadler Abstract The diverse fields of Omics research share a common logical structure combining a cataloging effort for a particular class of molecules or interactions, the underlying -ome, and a quantitative aspect attempting to record spatiotemporal patterns of concentration, expression, or variation. Consequently, these fields also share a common set of difficulties and limitations. In spite of the great success stories of Omics projects over the last decade, much remains to be understood not only at the technological, but also at the conceptual level. Here, we focus on the dark corners of Omics research, where the problems, limitations, conceptual difficulties, and lack of knowledge are hidden. Key words: Omics, Systems biology, Data integration, Annotation, Assumptions, Limitations
1. Introduction In cellular and molecular biology, the suffix -ome refers to “all constituents considered collectively.” The first -ome of this kind was defined in the early twentieth century. The term “genome” combined the “gene” with the “chromosome” (i.e., “colored body,” from the Greek “chromo,” “color,” and “soma” “body”) which was known to carry the genes. Bioinformaticians and molecular biologists over the last couple of decades started to widely use -ome, adding the suffix to all sorts of biological concepts from epigenes to transcripts, to refer to large data sets produced in one shot by high-throughput technologies. An entertaining commentary on the history of the Omics words is provided by Lederberg et al. (1). The emerging research field of digesting the resulting pile of data was readily labeled as Omics, in analogy to the already time-honored fields of genomics, transcriptomics, and proteomics. The Gerstein Lab (http://www.bioinfo.mbb.yale.edu) keeps a list of the most popular -omes on their Web page. We have Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_8, © Springer Science+Business Media, LLC 2011
173
174
Prohaska and Stadler
updated their citation data and added several other -omes and Omics that have made it into PubMed in Table 1. The high-throughput Omics approaches have profoundly changed Molecular Biology over the past decade. Many exciting
Table 1 Usage of -ome and Omics terms in the scientific literature. PubMed was queried on Fri Jan 29 2010 for “*ome or *omes” and “*omics” for each of the terms below. The distribution of -ome and Omics terms follows a power law. Only a handful of top-ranking terms are commonly used -ome[s] Since
Number
Genome Proteome Transcriptome Metabolome Interactome Epigenome Secretome Peptidome Phenome Glycome Lipidome Orfeome Degradome Cellome Fluxome Regulome Variome Toponome Transportome Modificome Translatome Localizome Ribonome RNome Morphome Recombinome Signalome Expressome Foldome
1943 1995 1997 1998 1999 1987 2000 2001 1989 2000 2001 2000 2003 2002 1999 2004 2006 2003 2004 2006 2001 2002 2002 2005 1996 2006 2001 2007 2009
55,750 23,343 778 1,686 43 189 8 158 102 479 279 1 22 68 21 2 – 7 – 3 2 – 10 54 1 – – – 1
18 16 14 log2(PubMed entries)
PubMed entries Number 189,019 15,756 6,022 950 578 375 333 160 141 120 64 63 53 32 25 19 16 13 8 6 6 6 4 4 3 3 2 2 1
20
Omics
12 slope = −3.32 +/− 0.0 7
10 8 6 4 2 0 0
1
2 log2(rank)
3
4
5
The Use and Abuse of -Omes
175
success stories have been highlights in top-ranking journals. Here, however, we want to focus on the flip-side of the coin, the limitations and shortcomings of the current practice of “Omics biology.” Being closest to the authors’ own work, we use transcriptomics as a paradigmatic example; as we shall see in the next section, however, the issues are generic. Transcriptomics aims at cataloging the transcriptome by listing all transcripts. Transcripts of what? The transcriptome of an individual organism, say a mouse, is the set of transcripts that can be found in all of the organisms’ cells taken together. In distinction from genomes, transcriptomes vary between cell types, developmental state, and in dependence of environmental stimuli. As a consequence, one may say that a mouse has more than one transcriptome. Distinct individual transcriptomes can be derived from different samples, and such a transcriptome needs to be interpreted in the context of the cells sampled (Fig. 1). Emerging single cell methods are expected to show variation even between individual cells of the same cell type. Some of these differences may still be functional, subdividing the cell types further into subtypes, with another part of variance constituting neutral stochastic variation. The power of high-throughput techniques to measure many constituents concurrently in general comes hand-in-hand with a lack of selectivity, sensitivity, and precision. The difficulty of the task is to keep similar but different objects separate and to accurately measure concentrations in parallel that vary by several orders of magnitude. Low-abundance transcripts are either not detected at all, or the signal cannot be separated from background noise inherent in the measurement technology. Quite inconveniently, however, rarity of a functional element is not synonymous with negligibility. In Fig. 1, the functional elements “X” and “E” are
M CC O M C I XI E O IE O
M S M C C O CS C O
C
S I O M O M I E C S SI
I II I I I C CC C C C CC OOCO O S OOO S MS S M MS M MM
O M Spatio−Temporal Sample
Individual Variation
Completeness
I
C S
Precision
Fig. 1. Transcriptomics is the attempt to list and quantify all transcripts and their relative or absolute quantities in the sample. The interpretability of the results depends (1) on the sample that was taken, for example, from a whole mouse. Given some sample from different mice, developmental stages and/or tissues, the transcriptome can be expected to be different. In addition, (2) individual cells can contribute further variation (see “I” and “S”). (3) The completeness of the list is strongly dependent on the measurement technique (in the figure only “one-stroke characters” are analyzed). Transcripts that are not recognized (see “E” and “X”) are systematically missed. Several orders of magnitude separate the high and low abundant transcripts and complicate the quantification process. The measurement technique is also a source of imprecision, making it difficult to distinguish true variation from uncertainty.
176
Prohaska and Stadler
rare, but they are essential to form words of length six or more (e.g., “MEXICO” and “MEIOSIS”). As if the technical issues would not be enough, the biological concept of the entities under investigation may be inadequate and introduce unintended biases. Early transcriptomics, for example, was performed on a poly-T primed cDNA sample, biasing the sample toward mRNA-like transcripts while overlooking even the most highly abundant noncoding RNAs as well as the mRNAs of histones. A fundamental question in Omics-style approaches should be: How complete is the catalog really? And in all honesty we have to answer, even for the best understood biological systems, “we don’t really know.” Long before the rise of high-throughput techniques, we have learned to address the limited precision of measurements. Repetitions of experiments, statistical methods to evaluate significance, and comparative control experiments help us to handle unavoidable imprecision and identify relevant signals. The combination of poorly understood biological variation, of poorly understood technological biases and artifacts (from library preparation to hybridization on chips or sequencing chemistry), and the high costs that in practice limit the number of replicates challenge the available statistical methods and limit accuracy and confidence of the results. It is still fair to say that there is no high-throughput method that can accurately measure absolute or relative concentrations of transcripts. Indeed, it has become standard operating procedure to validate measurements by means of classical “gold standard,” such as Northern blots (to demonstrate the existence of an RNA) or quantitative PCR (to measure concentrations). The mass of Omics data available today has a profound impact on the way how we think about and organize large-scale data in the Life Sciences. Moving away from a molecular toward a systems view of cells or larger functional entities, we are expected to handle a single -ome with ease. Computational and statistical methods are therefore quickly becoming ubiquitous and indispensable in experimental laboratory environments. Data analysis and interpretation focuses on the comparison of related experimental settings as well as the large-scale integration of different Omics on the same or related settings, a task that we see at the very heart of Systems Biology. We argue below that data organization is not just a technical detail. On the contrary, it has a profound impact on the biological interpretation whether transcriptomics data are stored, interpreted, and utilized as expression profiles of DNA intervals (defined by a probe on an array or a piece of sequence), entire transcripts (ESTs), exons (using exon arrays), or entire genes (as defined by various GeneChips). The eventual goal of all the Omics technologies is to measure each of the parts in full spatiotemporal resolution for a complete collection of environmental conditions. Would not it be great, say,
The Use and Abuse of -Omes
177
to have a complete list of all transcripts with exact abundances for every single cell in Mr. Spurlock (http://www.youtube.com/ watch?v=I1Lkyb6SU5U) in 1 s time intervals starting from the first bite of his Super Sized fast food meal? (Instead of aggregating data of entire organs in a before/after comparison (2)). Technical limitations most likely prevent such a SciFi-style experiment, in particular if we stipulate that it should be performed in a way that is nondestructive for Mr. Spurlock. We shall see in the following, however, that technical limitations are not the only obstacles. Current Omics experimentation can also be restricted by problematic experimental design, flawed concepts, and sloppy analysis. Clearly, the measurement of, say, a spatiotemporally resolved transcriptome is not an ultimate goal in itself. Eventually, the data are to be utilized to infer knowledge of biochemical mechanisms, biological processes, to diagnose an illness, or to assist a therapeutic regimen. Obviously, therefore, Omics is eventually about function on a large scale. Just as obviously, a transcript does not make its function explicit by measuring a single, or even a series, of transcriptomes. The task of inferring biological functions, usually known as annotation, requires the interlinking of data from different types of experiments. In the Omics context, because of the sheer amount of data, this requires the systematic integration of diverse Omics data sets. Before we proceed to this topic in more detail, however, let us analyze the current state of affairs in a somewhat more formal setting (see Note 1).
2. Materials 2.1. Generic Properties
Many of the modern high-throughput methods in the Life Sciences have claimed a term ending in Omics, proclaiming the aim to provide comprehensive descriptions of their subject of study. With the advent of alternative technologies to detect and measure the same, or at least similar, objects, the term has usually shifted to refer to the study of the corresponding -ome, i.e., the comprehensive collection of particular type of relevant objects, irrespective of the particular technology that is employed. Not surprisingly, this has also led to quite a bit of confusion in the language. Transcriptomics, for instance is often used, in particular in a biomedical context, to refer specifically to gene-chip data. We argue here that a full-fledged Omics has three components that make it a coherent scientific endeavor: 1. A suite of (typically high-throughput) technologies address a well-defined collection of biological objects, the pertinent -ome.
178
Prohaska and Stadler
2. Both technologically and conceptually, there is a cataloging effort to enumerate and characterize the individual objects – hopefully coupled with an initiative to make this catalog available as a database. 3. Beyond the enumeration of objects, one strives to cover the quantitative aspects of the -ome at hand. A surprisingly large number of -omes adhere to this general outline (Table 1) (see Note 2). Two classes of subfields, or flavors, of Omics are recurring topics in the literature. Functional -omics summarize the attempts to make biological sense out of the wealth of data generated for the -ome, typically by ascribing biological functions to the individual objects. One might also say that the goal of Omics is to “functionally” annotate the objects of the -ome. Comparative -omics, on the other hand, makes use of cross-species comparisons of Omics data to leverage functional information that has already been determined for one species in order to gain, usually functional, insights on a homologous object in another organism. The four canonical Omics that deal with best studied cellular constituents have reached this state (Table 2). The wide-spread use of Omics data has also fundamentally changed the work-flow in bio-data analysis. Computational and statistical methods are therefore quickly becoming ubiquitous and indispensable in experimental laboratory environments. Even simple routine analyses of Next Generation Sequencing, TandemMass-Spectrometry, or time-resolved imaging already requires computational capabilities that go away beyond spreadsheets. Sophisticated bioinformatics has thus become an integral part of Omics research, bringing with it algorithmic challenges and the requirement to provide a formal data model that is amenable to efficient processing, for example, in databases. Many of the Omics fields in Table 1 thus come with an associated “computational Omics” that deals specifically with these aspects.
Table 2 Structural comparison of Omics fields. The last two columns list the number of PubMed articles returned by querying the phrase “comparative -omics” and “functional -omics” on Fri Jan 29 2010 Field
Objects
Quantitative
Comparative
Functional
Genomics
DNA sequence
variation
2,448
5,215
Transcriptomics
Transcripts
expression
31
4
Proteomics
Polypeptides
expression
340
424
Metabolomics
Metabolites
concentration
11
32
The Use and Abuse of -Omes
2.2. Limitations
179
In practice, all Omics studies are subject to inaccuracies and bounds on prior knowledge that necessarily restrict our efforts to approximations of what we aspire to measure. An issue of particular importance is the set of – often unspoken – assumptions that underlie the measurement technology, the design of the experiments, and the subsequent data analysis. Due to their similar conceptual construction, all Omics fields are confronted with essentially the same set of issues, that are, depending on the maturity of measurement technology and computational analysis techniques, addressed only partially, implicitly, or not at all in the scientific literature. In the following paragraphs, we address the issues at a general level; individual examples are briefly discussed in the subsequent survey. 1. Technical Limitations are caused by the measurement technology itself. Notorious examples are the uncertain connections between actual RNA expression levels and the signals produced by the various microarray technologies (3, 4), biochemical issues that, for example, leave genomic DNA libraries incomplete, or problems in detecting very small peptides in current proteomics protocols. 2. Limitations in the Experimental Design preclude access to certain information not just as a practical matter. Obvious examples are the limitation of GeneChips to mostly coding regions, the implicit assumption in high-throughput sequencing that nucleic acids use the canonical 4-letter alphabet despite sometimes abundant chemical modifications, or the selection of poly-adenylated mRNAs in EST library construction. There are many less obvious cases as well: Exon Chip approaches cannot disentangle certain combinations of splice variants, and sufficiently long repetitive elements make it impossible to complete genome assemblies based exclusively on shotgun data (Fig. 2) (see Note 3). Expression profiles of a mix of different tissues, developmental stages, or even species also fall into this category. Obviously, one better be aware of such limitations in subsequent data analysis and in particular when phrasing the biological interpretation. 3. Conceptual Limitations affect both technology and experimental design. The notion of a “gene,” for instance, carries a particular burden in that it is not well defined and a closer analysis (see below) shows that its practical application, for example, in GeneChips does not correspond very well with our current understanding of transcriptome structures (Fig. 3). It leaves the nagging question “what is it that we are measuring here?” and raises issues in the biological interpretation of data. Measurements that necessarily produce complex aggregated data that represent a mixed signal arising from multiple distinct physical entities may be hard or even impossible to map to biological reality.
180
Prohaska and Stadler
a A
Q
C
B
D
b R
X
R
Y
R
Fig. 2. Two examples of design limitations. (a) Certain combinations of splice variants cannot be disentangled by exon chips. For instance, the concentrations of the four transcripts arising from a combination of two pairs A, B and C, D of mutually exclusive exons cannot be inferred from measuring expression levels of the individual exons (see Note 3). (b) Repetitive elements that are longer than the shotgun reads make it impossible to determine the relative order of the nonrepetitive genome fragments X and Y interleaved with the repeats R. Reads located entirely within the repeats could belong to any one of the copies. Unambiguous contigs in pure shotgun assembly are therefore restricted to intervals devoid of long repetitive elements.
4. Limitations in the Analysis cause a partial loss of the measured information at the level of computational analysis. This may have many causes. It may be the case that the analytical tasks are inherently difficult algorithmically or in practice. An example is the analysis of the in situ hybridization (ISH) images of the Berkley Drosophila Genome Project (BDGP) (5) for which several basic image processing problems so far have not been solved satisfactorily, such as the registration of expression patterns from different individuals onto each other. Computational resources may be too limited for more accurate predictions as, for example, in the case of large-scale protein structure predictions (6). In many cases, however, it is just the convenience of a quick-and-dirty analysis with the aim of harvesting the low-hanging fruits. A particularly important issue is that of the Incompleteness of the Catalog, i.e., the inability to completely enumerate the members of the -ome in questions. Indeed, all Omics fields are still in a discovery phase in the sense that none of them can claim to possess a complete catalog of a particular -ome (with the possible exception of a few “finished” genomes). More often than not, holes in the catalog are a composite of several or all of the types of limitations introduced above. Technical limitations make us miss low abundance transcripts as they fall below the detection limits. Implicit assumptions on the transcripts, for example, on a particular chemistry of their 3¢ and 5¢ end (7) may focus on or exclude certain molecules by design. Conceptually, one may have set out to measure concentration of “genes” (whatever these are), leading in practice to a restriction to certain ORF-
The Use and Abuse of -Omes
181
Fig. 3. GeneChip design causes difficulties in data interpretation. Each “gene” is assayed by a collection of probes whose signals are averaged to determine the expression level. In this example, the eight probes (little squares) respond to different combinations of splice variants that derive from two different transcripts. The “gene expression level” is therefore a complex combination of the different transcripts that cannot be disentangled without a detailed understanding of both the transcript structures and the concentration-response curve of each individual probe. Such arrangements of overlapping transcripts are common in eukaryotic genomes (10, 23), implying that “gene expression levels” are typically composite signals.
carrying RNAs in array experiments. The deliberate choice to exclude unspliced ESTs (e.g., because their reading direction cannot be determined) may have been a limitation by design, by computational considerations, or because of a misleading concept stipulating that “interesting genes” arise from spliced mRNAs. Pragmatic choices, that eventually may turn out to be misleading, may also result from the legitimate desire to avoid false positives. If we presuppose that trans-splicing is a rare exotic phenomenon in mammals (see ref. 8, 9 for a few well-documented case studies), we may be inclined to remove all chimeric sequences that map to different chromosomal regions or chromosomes as artifacts. From the published literature, it is by far not always clear why such choices are made, and whether they have been taken by design, or just because this is how others proceeded before. A second serious issue is the comparability and reproducibility of data among different technologies. Here, the high costs of high-throughput experiments is the limiting factor on the number of replicates, and the employment of different technologies to measure the same system is usually completely out of reach. The ENCODE Pilot Project (10) may in fact serve a good example how different the results of different technologies can be in detail, and to what extent different method produce complementary rather than congruent results. We do not see this as a problem per se — as long as the researcher is aware of these issues and deals with them appropriately in the interpretation of the results. We shall return to this point briefly in the following section. 2.3. What Can Go Wrong, Will Go Wrong
By far the most frequently invoked Omics approach is genomics. Defined as the study of complete genomes, the terminology is the prime exception to the rule outlined above. Genomics is not (or should not be) concerned with the collection of all “genes,” since we have learned that genes are just a small subset of the genomic DNA at best (11), and an ill-defined concept at worst (12). Genomics is not only the oldest and most mature Omics, it has
182
Prohaska and Stadler
also the least problem to fulfill the promise to provide “complete catalogs,” after all, we can quite safely assume that most prokaryotic genomes are complete and we have at least a few “finished” eukaryotic genomes at our disposal. Still, the devil is in the details. For instance, the microchromosomes are notably underrepresented in the chicken genome assembly so that there are still genes known to exist in chicken that are not represented, see e.g. (13). With most plant and animal genomes still at draft stage, similar artifacts may well have gone unnoticed in other cases as well. A related issue of practical relevance is sequence assembly itself. Since genomes are of course not sequenced as a whole, but broken down into pieces that are palatable for the various sequencing technologies (from a few dozen nucleotides for emerging next generation sequencing approaches to about a kilobase for Sanger sequencing) reconstructing the correct genomic sequence is still a formidable computational problem (14). Current toolkits are still plagued by all sorts of limitations that, in combination with the unavoidable biases in the data generation itself, adversely affect the final result. The Hox cluster regions of the chicken genome may serve as an instructive example of the difficulties (15). Sequence variation across individuals in a population or among tissues in the same individuals, i.e., single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) may be seen as a quantitative aspect of the genome. In some cases, developmental variations can be dramatic. Ciliates have long been known for their elaborate processing of genomic DNA during the construction of the functional macronucleus and the transcriptionally inactive micronucleus (16, 17) (Fig. 4). Chris Amemiya and collaborators recently reported that such mechanisms are also at work much closer to home (18): The sea lamprey, a jawless vertebrate, undergoes a dramatic remodeling of its genome, resulting in the elimination of hundreds of millions of base pairs from many somatic cell lineages, suggesting that genomes can be much more dynamical than commonly perceived. This opens up yet another can of worms: current genome browsers and genome databases do not seem well equipped to deal with such situations. Transcriptomics started out as the attempt to quantify the expression levels of mRNAs (19), laying claim to reflecting a sample’s transcriptional state in its entirety. Unbiased approaches to cataloging transcripts, such as the large-scale cDNA sequencing project FANTOM (mouse) and the Encyclopedia of DNA elements (ENCODE, human), which set out to complete the list of protein coding genes, amassed evidence for pervasive transcription of the genome, massive differential expression patterns of nonprotein coding transcripts, and a much more complex organization of the “transcriptome” (10, 20). The recent advent of high-throughput sequencing technologies not only lead to an ever-increasing flood of transcriptomics data, for instance, as part of organized efforts
The Use and Abuse of -Omes
183
IES
Cbs
IES
IES IES
Cbs
IES
Cbs
IES
IES
a MIC MAC
13
MDS14
MDS3
MDS2
tel
12
tel
MDS13
tel tel
tel
tel
b
MDS4 MDS5
1
7
2 8
MDS12
IESs
6
9
MDS1 11
MDS6
10
MDS7
4
5 MDS8
MDS11 MDS10
MDS9
Fig. 4. Ciliates exhibit complex processing of the DNA in the transition from their micronucleus to the macronucleus. (a) In Tetrahymena, the macronuclear chromosomes (MAC) derive from the micronuclear (germline) DNA by deletion of the MIC-specific IES sequences, site specific cleavage at the Cbs sites (15 nt sequence signals) and subsequent attachment of new telomers. Most MAC chromosomes are amplified about 45 times (76). (b) In Oxytricha, some of the micronuclear genes are “encrypted” in addition: they consist of “macronucleardestined segments” (MDS) that are out of order relative to the functional gene sequence. Upon the removal of the IES, the MDS are rearranged in the correct order (77). Only the preprocessed and amplified macronuclear chromosomes are transcriptionally active.
such as the ModENCODE projects and the scale-up phase of ENCODE, but keeps changing our picture of genome/transcriptome organization by refining and expanding both the complexity of primary transcription and the scope of processing. Even for the best-studied organisms the catalog of transcripts is far from complete and novel genes, including protein-coding ones, as well as a plethora of widely different ncRNAs keep being discovered,
184
Prohaska and Stadler
see e.g. (21–24). Large-scale whole-mount ISH (5, 25) opens up the way to spatially resolved transcriptome analysis. So far, however, broad applications of this approach are hampered by unresolved technical difficulties in both image processing (such as the registration of images from different biological samples onto each other) and subsequent data processing (26–28). Proteomics tandem mass spectrometry followed by database search is currently the predominant technology for peptide sequencing in shotgun proteomics experiments (29). In practice, this endeavor is not hypothesis-free: Rather, databases of putative polypeptide sequences or previously determined mass-spectra underlie the recognition of fragmentation patterns (30–32) effectively limiting the scope of most experiments to what is already known or predicted to exist. There are also technical limitations: Most, but by no means all proteins of an organism are detectable by current MS-based proteomics. Small peptides in complex protein mixtures are a particular challenge. Compared to the overall protein expression levels, short peptides often show low abundance, they are easily lost using standard proteomic protocols, and only a limited number of proteolytic peptides can be obtained (33, 34). Many proteins are modified posttranslationally in chemically very diverse manners, ranging from simple phosphorylation or acetylation to the complex carbohydrate structures of glycoproteins. Specialized methods are necessarily to address these modifications in practice (35, 36). Multi-Epitope-Ligand-“Kartographie” (MELK) (37) is an ultrasensitive topological proteomics technology capable of analyzing protein colocalizations with subcellular spatial resolution. It addresses the higher level order in a proteome, referred to as the toponome, coding cell functions by topologically determined networks of interacting proteins. The computational techniques for the analysis of such data are currently under rapid development. Metabolomics by definition deals with tens of thousands of molecules with small molecular weight, many of which are still of unknown identity. The compounds in question have a wide variety of different functional groups, physicochemical properties, and chemical reactivities, and they appear in distinct pathways in abundances that vary by many orders of magnitudes, from sugars and salts at mM concentrations to vitamins and metabolic intermediates in a nM or even lower concentration range. All these issues limit metabolomics techniques in practice (38) and make incompleteness of data a crucial issue. Epigenomics is concerned with position-specific information that sits “on top of” the genome, in particular DNA methylation and histone variants and modifications. Both seem to be relevant for the cross talk between genome and environment and the definition of genomic/cellular states. DNA methylation marks, i.e., methylated cytidines, are also studied under the name methylome. Methylated DNA immunoprecipitation (MeDIP) is one
The Use and Abuse of -Omes
185
technique that allows 90-fold enrichment of methylated, genomic DNA fragments due to an antibody specific for methylated cytosine (39). Sequence identification of methylated fragments can be carried out by hybridization to a microarray (MeDIP-chip) or by sequencing (MeDIP-seq). To obtain a signal, the fragments need to be long enough and span more than one methylated cytosine for the antibody to bind. The maximal resolution of the MeDIP technology is therefore limited by the size of the fragments sequenced and/or the chip resolution. It is unlikely to fall below 30 nt. In theory, single nucleotide resolution can be achieved with bisulfite sequencing. In practice, however, incomplete conversion of unmethylated cytosine to uracil and DNA degradation are limiting (40). The information contained in nucleosomes and their positions is even more difficult to analyze. The nucleosome is a DNA– protein complex built from 2 times 4 histones (H3 − H4)2 (H2A − H2B)2 that wrap up 146 nt of DNA. For a human nucleosome, more than hundred distinct marks are possible, which may appear in complex combinations, see (41) for a summary. To date, it is known that histone marks can index the genome, marking transcribed regions, active promoters, direction of transcription, and exon/intron boundaries (42–45). Ideally, a map of the epigenome would list the type of the mark, the marked site, distinguish between the two copies of the four histone types, take histone isoforms/paralogs into account, and capture the genomic positioning of the nucleosome in vivo. Chromatin immunoprecipitation in combination with DNA sequencing (ChIP-seq) is, currently, the standard high-throughput method to study the genomewide distribution of selected histone modifications. Modified histones can be assigned to a genomic location with a resolution of ±50 nt. This constitutes an increase by a factor of 10 compared to the resolution of ChIP-chip technologies, but still provides only approximate nucleosome positions (see Note 4). A quickly increasing number of less well-known Omics fields is currently entering the literature (Table 1). Some of them are well-defined in the sense of the discussion of this section. Glycomics (46, 47), for instance, deals with glycan, i.e., oligo- and polysaccharides. Glycans are branched rather than linear polymers, they frequently incorporate chemically modified sugars, and in contrast to the overwhelming majority of peptides, they are not coded in easily accessible information molecules. Already the determination of their structural formula, i.e., the analog of sequencing, presents substantial challenges. Other -omes, such as the ORFeome, are probably better understood as particular aspects of more inclusive research areas, in this case proteomics. A similar case is RNomics (focusing on small ncRNAs in its original definition (48) and used in broader sense by many authors since then), which we see as a subfield of transcriptomics.
186
Prohaska and Stadler
2.4. Interactomics
This fast-growing field is concerned with the complex networks of interactions that link biological molecules in a cell (49). In contrast to the Omics disciplines discussed in the previous section, it does not deal with material objects but rather with relationships between them. Nevertheless, it is well-defined to speak of “physical interactions of two or more biomolecules,” giving a concrete meaning to the interactome. A broad array of techniques is employed to assay different types of interactions. Despite the technological advances of the last years, only certain aspects of the complete interactome are accessible to experimental approaches at present, however. ●●
●●
●●
●●
Protein–protein interactions are most frequently assessed using yeast-two-hybrid (Y2H) assays, reviewed in (50). Modern Y2H tools can screen nearly the entire cellular proteome for interactions, and in particular deal with membrane proteins, transcriptionally active proteins, and different localization. Nevertheless, the method suffers from high false positive and false negative rates. An alternative to Y2H is affinity purification of protein complexes and subsequent mass spectroscopic analysis (AP/ MS) of the intact protein complexes, see e.g. (51). Again, technical issues limit the accuracy of the attainable data. At present, Y2H should be seen as complementary to emerging AP/MS techniques, and combination of different techniques together with bioinformatics stands a chance to lead to an accurate description of large interaction networks (50). DNA–protein interaction that can be assayed by large-scale chromatin immunoprecipitation is reviewed, for example, in (52). The attached nucleic acid sequences are then determined either by quantification on a microarray (53) or by deep sequencing (54). A similar technology, known as HITSCLIP, maps RNA–protein binding sites in vivo by crosslinking immunoprecipitation and subsequent sequencing of the RNAs (55). Again, a corresponding microarray-based approach, known as RIP-chip, is possible (56). This type of approaches is limited by the availability, sensitivity, and specificity of antibodies against the protein component for the immunoprecipitation step. The size of the crosslinked DNA or RNA fragments, furthermore, imposes as limitation on the positional resolution. Interactions between two nucleic acid molecules are typically mediated by specific base pairing (hybridization). Direct experimental verification of DNA–DNA, DNA–RNA, or RNA–RNA binding interactions is based on detailed chemical probing of the cofolded structure, an approach that is still beyond routine procedures even in a low-throughput scale (57). Nucleic acid interactions, however, can be predicted computationally with decent (but by far not perfect) accuracy (58–61).
The Use and Abuse of -Omes
187
We are not aware, furthermore, of high-throughput approaches to measuring the interactions of biological macromolecules with small molecules, such as metabolites. In contrast to interactomics, which studies measurable interactions, the related term “regulomics” refers to the study of biological regulation at system-wide scales. As such, regulomics is not at all an Omics field in the sense of this article, but rather constitutes a part of systems biology that attempts to reconstruct regulatory networks and mechanisms.
3. Methods 3.1. Comparability of Omics Data
Biological systems are neither stationary in time nor homogeneous in space. In fact, much of the success of Omics approaches comes from the comparison of variation between individuals, environmental stimuli, tissue types, or cell-lines. One may say, as we did in the introduction, that each sample defines its own transcriptome, proteome, metabolome, etc. From a data management point of view, however, it is much more convenient to speak, for example, of the “mouse transcriptome” as the union of all the transcripts detectable in Mus musculus. A particular transcriptomics measurement thus assays the state of the transcriptome in a particular sample. Interpreting the -ome as the collection of objects (here: transcripts) that can in principle be observed, it is justified to represent the state as a vector, providing a convenient mathematical starting point for analyzing quantitative data. Of course, this is the setting in which, for example, GeneChip experiments have always been analyzed anyway. Logically, the comparison of two Omics measurements thus requires that we have a way of identifying when two of the assayed objects are the same. This may seem trivial at the first glance. In the case of microarray experiments, this issue is typically hidden in the manufacturer’s software, which produces expression levels for “genes” as output. It is easy to conceive examples, however, where different platforms, using different probe locations, interrogate different transcripts. To our knowledge, this issue has not been investigated systematically. That there is an issue, however, is exemplified by several attempts to recompute the assignment of probes to genes for Affymetrix GeneChips in the light of improving genome assemblies and gene annotations (62). In the case of sequencing data, this is much less obvious: individual reads first need to be assembled before reconstructed transcripts can be compared. As for ESTs, there is no guarantee that such contigs are complete at both ends hence in general ambiguities remain, including those of Fig. 2. In the words of President Clinton, it depends what the meaning of “is” is (Grand
188
Prohaska and Stadler
jury testimony, August 17, 1998), when we say that object A in experiment I is object B in experiment II. The meaning of “is” depends explicitly on a model of both the physical entities and the measurement process. In many cases, this model remains implicit, hidden in the computational voodoo of data integration pipelines. One instantiation could be a collection of tables that map probe IDs to gene IDs and gene IDs of different nomenclature and annotation systems to each other. A particularly striking case is the ambiguity of gene annotations. The concept of the gene itself has come under intense scrutiny in response to the recognition that the “standard” model of genes as beads on a genomic DNA string is inconsistent with the findings of high-throughput transcriptomics (63, 64). As a consequence, several modifications of the concepts of the gene have been explored, ranging from purely structural definitions in terms of groups of transcripts (65), the consideration of transcripts themselves as the central operational units of the genome (66), to functional notions (67), and attempts to reconcile functional and structural aspects (12, 68) (see Note 5). 3.2. Data Models for Computational Omics
The definition of the entities that constitute the -ome naturally determines how particular Omics data are stored in data repositories, and how they are handled in data integration efforts. If we think of genes as the natural entities of which a transcriptome is composed, it is completely legitimate to represent (possibly spatiotemporal) expression levels of genes as the basic data, and all integration of data, such as the interpretation of variational data from genome resequencing, have to refer to these basic entities. Modern transcriptomics, however, has shown beyond reasonable doubt, that “genes” are a theoretical construct whose exact meaning remains ambiguous at best and ill-defined at worst, suggesting that “genes” are not a particularly appropriate basis for genome annotation in the first place (12). We suspect that other Omics disciplines suffer to varying degrees from similar issues. Genome browsers, for example, invariably view data relative to a single reference genome, which is entirely adequate for Escherichia coli or Homo sapiens, but makes it difficult to deal with organisms that show a large degree of regulated DNA remodeling. This is not just a matter of technological convenience, but of the deep-sitting preconception that what needs to be understood is how transcripts map to genomic locations because transcripts directly derive from there in the few model organisms that are typically studied. It may not come as a surprise that the very same technology is also used to browse Tetrahymena (http://www.ciliate.org), where a more detailed model incorporating the intricate relationships of micronucleus and macronucleus would be desirable (Fig. 4).
The Use and Abuse of -Omes
189
mature tRNA CCA
snoRNA (pol−III)
tRNA precursor genomic DNA
snoRNA (from intron)
primary pol−II transcript
cap
p o ly A
mature mRNAs ESTs
Fig. 5. Relationships of processing products are not always obvious and ideally could be stored explicitly. The snoRNA, for instance, might be processed from an intron, via its own promoter, or possibly through either pathway in different contexts. The tRNA is cleaved from a precursor (by RNAse P and RNAse Z), a CCA tail is added (e.g., in human), and several positions are chemically modified. The pol-II transcript is spliced (here in two alternative forms), capped and polyadenylated, while the snoRNA-carrying intron is spliced out. The sequence of processing stages is often known (or can be inferred from the sequences, albeit with some effort) but for the most part these simple relations (gray shades) are not explicitly represented, and hence cannot by utilized.
As we argue below, the integration of Omics data calls for a data representation that is less focused on a particular task but rather can be viewed as abstractions of biological processes. For instance, transcripts are processed in complex ways, from primary to mature transcripts and further to degradation products which may or may not be functional (Fig. 5) (see Note 6). 3.3. Functional Omics = Annotation
The goal of Omics analysis is, in many cases, to infer functional information. By definition, this information does not come from within the Omics experiment but rather by linking it to other experiments of different type. Typically, function is expressed in terms of direct interactions (“X is part of the Y-complex”), as involvement in a particular process of action (“Y is part of the apoptosis pathway,” “Z contributes to the risk for coronary heart diseases,” “W catalyzes the dehydrogenation of malate in the citric acid cycle”), as membership in a particular class of molecules (“Q is a tyrosine kinase”) or, in not too many cases, in terms of a mechanistic description of a catalytic activity. Much of this information is derived from correlations between Omics experiments and/or transferred from the annotation of one biological system to another by means of comparative arguments. In particular, sequence homology is routinely used to
190
Prohaska and Stadler
annotate newly sequenced genomes, using sequence homology as a predictor for functional similarity. Despite the power of homology in spreading annotation information from system to system, eventually this information must be grounded in experimentally verified knowledge, derived, for instance, from related Omics experiments. 3.4. Integration of Omics Data = Systems Biology
The combination of different Omics data is necessarily based on the assumption that we know the relation of different types of data. In some cases, this is straightforward: from mature mRNAs, we can predict the encoded proteins with rather high certainty, allowing the direct comparison of transcriptomics and proteomics data (despite all the complications of multiple transcripts and splice variants that may eventually lead to the same protein). The observation that transcript levels are surprisingly poor predictors for protein levels, for instance, was the first indication for the pervasive scale of posttranscriptional regulation. Another example is the combination of genomic variation (SNP) and gene expression data, known as expression Quantitative Trait Locus (eQTL) studies. These provide indications for the effect of mutations on particular regulatory pathways. In practice, however, the interpretation of such data requires a rather detailed model of the pathways in question. In combination with comparative genomics, correlated expression and the presence of common conserved sequence motifs can be utilized to infer cis-regulatory networks, see e.g. (69). Efforts to large-scale integration of Omics data can be seen as an effort to make rather vague statements (“Y is part of the apoptosis pathway”) more precise and to place them in the context of a pathway (“Y is downregulated by X and inhibits Z”) by observing correlations and reactions to perturbations of the system. Knowledge about the molecular properties of X, Y, and Z furthermore may help to make the statement even more precise (“The microRNA Y targets the mRNA of Z. The expression of the primary precursor of Y is inhibited by the transcription factor X”). The model-dependent integration of Omics data constitutes one facet of Systems Biology. The other one is the creation and modification of the models themselves, a task that can be rephrased as “generating biological insights.” As emphasized, for example, in (70), this requires that the underlying conceptual assumptions are made explicit in the form of a computational model so that it can be used in the course of data processing and eventually becomes amenable to rigorous analysis. Conversely, every computational integration of Omics data explicitly or implicitly presupposes an underlying model of the relationships of the data. The efforts in Omics research are, to a large extent, geared toward the construction of a model of biological reality that is as faithful as possible: it strives to provide a mechanistic explanation of the molecules and their interactions that is as complete and as accurate as possible.
The Use and Abuse of -Omes
191
Many systems biologists describe their field in a somewhat different way: Systems biology combines experimental techniques and computational methods in order to construct predictive models (71). Naïvely, one might think that predictive models are even better than “just” descriptions of the mechanisms. In fact, the inference of the parameters implicitly also determines the (significant) interactions as those with nonzero values. Mathematically, this is formulated as an inverse problem, where the unknown network topology and parameters are fitted to reproduce the measured data. This inverse problem, however, is ill-posed, i.e., under-determined, in general so that additional constraints, so-called regularizations, need to be employed. Most commonly, one asks for “sparsity,” that is, the number of nonzero parameters should be as small as possible. A similar approach is often taken in the statistical analysis of gene expression data, when “candidate gene” or “gene set” approaches focus on the (usually small) subsets of the strongest response to stimuli or differences in conditions. It cannot be overemphasized that the interpretation of any reconstruction or parameter estimation is highly model-dependent. For example, the discrepancies between transcript and peptide concentrations imply that there are mechanisms of posttranscriptional regulation at work. It may well be possible to represent much or even all of these observed data, for example, by a complex network of proteins influencing each other’s lifetime and to estimate parameters for such a model. The point is that expression data for polyadenylated mRNAs and polypeptides, even taken together, do not imply or even hint at the existence and importance of microRNAs, which as we now know, play an important role in posttranscriptional regulation. Indeed, we are not aware of many hints in the literature before the discovery of microRNAs that a complex additional regulation circuitry, such as the RNAi machinery, could be missing in our mechanistic understanding of gene regulation, let alone that its specificity would be based on RNAs rather than proteins. To reach this conclusion, the discovery of microRNAs and their characterization as wide-spread regulators was necessary. At least with the present arsenal of mathematical and statistical methods, we stand little chance to determine the completeness and mechanistic correctness of models by considering time-series, stimulus-response data, or healthy/diseased comparisons. The reason is that nature appears to have evolved robust, redundant networks (72). For this type of systems, the details of the architecture appear to be very hard to infer. Already in simple cases, such as Bayesian Network models of tiny signaling networks, huge amounts of data may be required (73). We should be aware, therefore, that regularized sparse models – whether we obtained them as solution of an inverse problem or whether we constructed them “by hand” (as has been done with most transcription factor
192
Prohaska and Stadler
networks and signaling pathways) based on Omics data and biological insight – might work well as predictors for a certain range of circumstances and phenomena, while at the same time being quite far from the mechanistic reality of the biological system. The Systems Biology approach to solve this problem is to create a sufficiently large data pool focusing on simultaneous experimental approaches in conjunction with improved, model driven approaches to integrating heterogeneous experiments (74, 75). Unbiased cataloging efforts, such as ENCODE, are another part of the remedy. The discovery of a huge diversity of noncoding RNAs provides us with new pieces, many of which still need to be placed in the puzzle, but at least we know that the puzzle is much bigger than anticipated, and we can try to stick them into various different models to see where they might fit. Similarly, in order to understand the role of epigenetic marks in gene expression, it will be necessary to systematically acquire such data as completely as possible. Even complete parts lists and spatiotemporal correlations between the parts are not sufficient, however, to build a complete computational model. We strongly suspect that the systematic measurement of physical interactions as constraints on correlation-based data is indispensable to make the step from sufficient predictive representations to mechanistic explanations.
4. Notes 1. There is no discipline of “omology” devoted to the systematic study of -omes and their interrelations at an abstract level. We suggest that this could be a worthwhile endeavor, given the amount and importance of Omics data in present-day Life Sciences. 2. Some of them, such as peptidome (dealing with small peptides), secretome (dealing with secreted peptides), or degradome (dealing with proteases) are clearly identifiable as specialized subsets, in this case of the proteome. Only a few -omes are pervasively used in the literature. 3. The concentrations cAB through cBD of the four transcripts arising from a combination of two pairs A, B and C, D of mutually exclusive exons in Fig. 2 cannot be inferred from measuring exon signals [A] through [D] since the linear equation
1 0 1 0
0 1 1 0
1 0 0 1
0 c AB [A] 1 c BC [B] · = 0 c AD [C ] 1 c BD [D]
relating these quantities is not invertible.
The Use and Abuse of -Omes
193
4. It is hard to evaluate the accuracy of individual modification maps since the antibodies used in the immunoprecipitation step recognize the chemical groups in their histone context and hence might be disturbed by complex modification patterns. At present, no high-throughput method can analyze the histone modification state without relying on the specificity of antibodies. 5. In practice, different annotation schemes (GENCODE, RefSeq, VEGA, ENSEMBL) subsume different transcription and protein products under their gene symbols, and they do so based on the expert judgment of individual annotators. Database information linked to gene symbols thus cannot be traced unambiguously to the actual physical entities involved in the original measurements (without going back to the original literature). 6. It would be extremely helpful, for example, to have an explicit representation of the “processed from” relation, linking genomic DNA (more precisely, germ-line DNA to accommodate DNA-remodeling as well) to products it encodes. So far, the only connection that is routinely available is the link between a protein-coding mRNA and the polypeptide it encodes. References 1. Lederberg, J., McCray, A. T. (2001) “ome sweet” omics – a genealogical treasury of words. The Scientist 15(7), 7–8. 2. Somel, M., Creely, H., Franz, H., et al. (2008) Human and chimpanzee gene expression differences replicated in mice fed different diets. PLoS ONE 3, e1504. 3. Binder, H., Kirsten, T., Löffler, M., Stadler, P. F. (2004) The sensitivity of microarray oligonucleotide probes – variability and the effect of base composition. J Phys Chem 108, 18003–18014. 4. Binder, H., Preibisch, S. (2008) “Hook” calibration of GeneChipmicroarrays: theory and algorithm. Alg Mol Biol 3, 12. 5. Tomancak, P., Beaton, A., Weiszmann, R., et al. (2002) Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biol 3, R0088. 6. Pieper, U., Eswar, N., Webb, B. M., et al. (2009) MODBASE, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res 37, D347–D354. 7. Ruby, J. G., Jan, C., Player, C., et al. (2006) Large-scale sequencing reveals 21U-RNAs
8. 9.
10.
11. 12. 13.
14.
and additional microRNAs and endogenous siRNAs in C. elegans. Cell 127, 1193–1207. Gingeras, T. R. (2009) Implications of chimaeric non-co-linear transcripts. Nature 461, 206–211. Li, H., Wang, J., Ma, X., Sklar, J. (2009) Gene fusions and RNA trans-splicing in normal and neoplastic human cells. Cell Cycle 8, 218–222. The ENCODE Project Consortium (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. Pennisi, E. (2003) A low number wins the genesweep pool. Science 300, 1484. Prohaska, S. J., Stadler, P. F. (2008) “Genes” Theory Biosci 127, 215–221. Douaud, M., Fève, K., Gerus, M., et al. (2008) Addition of the microchromosome GGA25 to the chicken genome sequence assembly through radiation hybrid and genetic mapping. BMC Genomics 9, 129. Scheibye-Alsing, K., Hoffmann, S., Frankel, A. M., et al. (2009) Sequence assembly. Comp Biol Chem 33, 121–136.
194
Prohaska and Stadler
15. Richardson, M. K., Crooijmans, R. P., Groenen, M. A. (2007) Sequencing and genomic annotation of the chicken (Gallus gallus) Hox clusters, and mapping of evolutionarily conserved regions. Cytogenet Genome Res 117, 110–119. 16. Katz, L. A. (2005) Evolution and implications of genome rearrangements in ciliates. J Euk Microbiol 52, 7S–27S. 17. Duharcourt, S., Lepère, G., Meyer, E. (2009) Developmental genome rearrangements in ciliates: a natural genomic subtraction mediated by non-coding transcripts. Trends Genet 25, 344–350. 18. Smith, J. J., Antonacci, F., Eichler, E. E., Amemiya, C. T. (2009) Programmed loss of millions of base pairs from a vertebrate genome. Proc Natl Acad Sci USA 106, 11212–11217. 19. Schena, M., Shalon, D., Davis, R. W., Brown, P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470. 20. Maeda, N., Kasukawa, T., Oyama, R., et al. (2006) Transcript annotation in FANTOM3: Mouse gene catalog based on physical cDNAs. PLoS Genetics 2, e62. 21. Drosophila 12 Genomes Consortium (2007) Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218. 22. Hiller, M., Findeiß, S., Lein, S., et al. (2009) Conserved introns reveal novel transcripts in Drosophila melanogaster. Genome Res 19, 1289–1300. 23. Kapranov, P., Cheng, J., Dike, S., et al. (2007) RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–1488. 24. Dinger, M. E., Amaral, P. P., Mercer, T. R., Mattick, J. S. (2009) Pervasive transcription of the eukaryotic genome: functional indices and conceptual implications. Brief Funct Genomic Proteomic 8, 407–423. 25. Frise, E., Hammonds, A. S., Celniker, S. E. (2010) Systematic image-driven analysis of the spatial Drosophila embryonic expression landscape. Mol Syst Biol 6, 345. 26. Kumar, S., Jayaraman, K., Panchanathan, S., Gurunathan, R., Marti-Subirana, A., Newfeld, S. J. (2002) BEST: a novel computational approach for comparing gene expression patterns from early stages of Drosophila melanogaster development. Genetics 162, 2037–2047. 27. Ye, J., Chen, J., Janardan, R., Kumar, S. (2008) Developmental stage annotation of Drosophila gene expression pattern images via an entire solution path for LDA. ACM Trans Knowl Discov Data 2, 1–21.
28. Heffel, A., Stadler, P. F., Prohaska, S. J., Kauer, G., Kuska, J.-P. Process flow for classification and clustering of fruit fly gene expression patterns. In: Proceedings of the 15th IEEE International Conference on Image Processing, ICIP 2008. IEEE, 2008 721–724. 29. Aebersold, R., Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207. 30. Johnson, R. S., Davis, M. T., Taylor, J. A., Patterson, S. D. (2005) Informatics for protein identification by mass spectrometry. Methods 35, 223–236. 31. Liu, J., Bell, A. W., Bergeron, J. J. M., et al. (2007) Methods for peptide identification by spectral comparison. Proteome Sci 5, 3. 32. Lam, H., Deutsch, E. W., Eddes, J. S., Eng, J. K., Stein, S. E., Aebersold, R. (2008) Building consensus spectral libraries for peptide identification in proteomics. Nat Methods 5, 873–875. 33. Garbis, S., Lubec, G., Fountoulakis, M. (2005) Limitations of current proteomics technologies. J Chromatography A 1077, 1–18. 34. Klein, C., Aivaliotis, M., Olsen, J. V., et al. (2007) The low molecular weight proteome of Halobacterium salinarum. J Proteome Res 6, 1510–1518. 35. Reinders, J., Sickmann, A. (2007) Modificomics: posttranslational modifications beyond protein phosphorylation and glycosylation. Biomol Eng 24, 169–177. 36. Fu, Y., Jia, W., Lu, Z., et al. (2009) Efficient discovery of abundant post-translational modifications and spectral pairs using peptide mass and retention time differences. BMC Bioinformatics 10 (Suppl 1), S50. 37. Schubert, W., Bonnekoh, B., Pommer, A. J., et al. (1270–1278) Analyzing proteome topology and function by automated multidimensional fluorescence microscopy. Nat Biotechnol 24, 2006. 38. Scalbert, A., Brennan, L., Fiehn, O., et al. (2009) Mass-spectrometry-based metabolomics: limitations and recommendations for future progress with particular focus on nutrition research. Metabolomics 5, 435–458. 39. Weber, M., Davies, J. J., Wittig, D., et al. (2005) Chromosome-wide and promoterspecific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 37, 853–862. 40. Fraga, M. F., Esteller, M. (2002) DNA methylation: a profile of methods and applications. Biotechniques 33, 632–649. 41. Prohaska, S. J., Stadler, P. F., Krakauer, D. C. (2010) Innovation in gene regulation: the case of chromatin computation. J Theor Biol 265, 27–44.
The Use and Abuse of -Omes 42. Andersson, R., Enroth, S., Rada-Iglesias, A., Wadelius, C., Komorowski, J. (2009) Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome Res 19, 1732–1741. 43. Nahkuri, S., Taft, R. J., Mattick, J. S. (2009) Nucleosomes are preferentially positioned at exons in somatic and sperm cells. Cell Cycle 8, 3420–3424. 44. Schwartz, S., Meshorer, E., Ast, G. (2009) Chromatin organization marks exon–intron structure. Nat Struct Mol Biol 16, 990–995. 45. Tilgner, H., Nikolaou, C., Althammer, S., et al. (2009) Nucleosome positioning as a determinant of exon recognition. Nat Struct Mol Biol 16, 996–1001. 46. Apte, A., Meitei, N. S. (2010) Bioinformatics in glycomics: glycan characterization with mass spectrometric data using SimGlycan. Methods Mol Biol 600, 269–281. 47. Zaia, J. (2008) Mass spectrometry and the emerging field of glycomics. Chem Biol 15. 48. Hüttenhofer, A., Brosius, J., Bachellerie, J. P. (2002) RNomics: identification and function of small, non-messenger RNAs. Curr Opin Chem Biol 6, 835–843. 49. Kiemer, L., Cesareni, G. (2007) Comparative interactomics: comparing apples and pears? Trends Biotech 25, 448–454. 50. Brückner, A., Polge, C., Lentze, N., Auerbach, D., Schlattner, U. (2009) Yeast two-hybrid, a powerful tool for systems biology. Int J Mol Sci 10, 2763–2788. 51. Heck, A. J. (2008) Native mass spectrometry: a bridge between interactomics and structural biology. Nat Methods 5, 927–933. 52. Wong, E., Wei, C. L. (2009) ChIP’ing the mammalian genome: technical advances and insights into functional elements. Genome Med 1, 89. 53. Yoder, S. J., Enkemann, S. A. (2009) ChIPon-Chip analysis methods for Affymetrix tiling arrays. Methods Mol Biol 523, 367–381. 54. Park, P. J. (2009) ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 10, 669–680. 55. Licatalosi, D., Mele, A., Fak, J. J., et al. (2008) HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature 456, 464–469. 56. Khalil, A. M., Guttman, M., Huarte, M., Garber, M., et al. (2009) Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci USA 106, 11675–11680. 57. Brunel, C., Romby, P. (2000) Probing RNA structure and RNA-ligand complexes with
195
chemical probes. Methods Enzymol 318, 3–21. 58. Mückstein, U., Tafer, H., Hackermüller, J., Bernhard, S. B., Stadler, P. F., Hofacker, I. L. (2006) Thermodynamics of RNA-RNA binding. Bioinformatics 22, 1177–1182. 59. Busch, A., Richter, A., Backofen, R. (2008) IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions. Bioinformatics 24, 2849–2856. 60. Chitsaz, H., Salari, R., Sahinalp, S., Backofen, R. (2009) A partition function algorithm for interacting nucleic acid strands. Bioinformatics 25, i365–i373. 61. Huang, F. W. D., Qin, J., Reidys, C. M., Stadler, P. F. (2010) Target prediction and a statistical sampling algorithm for RNA-RNA interaction. Bioinformatics 26, 175–181. 62. de Leeuw, W. C., Rauwerda, M. J., Jonker, H., Breit, T. M. (2008) Salvaging affymetrix probes after probe-level re-annotation. BMC Res Notes 1, 66. 63. Pearson, H. (2006) Genetics: what is a gene? Nature 441, 398–401. 64. Pennisi, E. (2007) DNA study forces rethink of what it means to be a gene. Science 316, 1556–1557. 65. Gerstein, M. B., Bruce, C., Rozowsky, J. S., et al. (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17, 669–681. 66. Gingeras, T. R. (2007) Origin of phenotypes: genes and transcripts. Genome Res 17, 682–690. 67. Scherrer, K., Jost, J. (2007) The gene and the genon concept: a conceptual and information-theoretic analysis of genetic storage and expression in the light of modern molecular biology. Theory Biosci 126, 65–113. 68. Stadler, P. F., Prohaska, S. J., Forst, C. V., Krakauer, D. C. (2009) Defining genes: a computational framework. Theory Biosci 128, 165–170. 69. Dean, A., Harris, S. E. H., Kalajzic, I., Ruan, J. (2009) A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes. BMC Bioinformatics 10 (Suppl 9), S5. 70. Laubenbacher, R., Hower, V., Jarrah, A., et al. (2009) A systems biology view of cancer. Biochim Biophys Acta 1796, 129–139. 71. Engl, H. W., Flamm, C., Kügler, P., Lu, J., Müller, S., Schuster, P. (2009) Inverse problems in systems biology. Inverse Problems 25, 123014. 72. Ciliberti, S., Martin, O. C., Wagner, A. (2007) Innovation and robustness in complex
196
Prohaska and Stadler
regulatory gene networks. Proc Natl Acad Sci USA 104, 13591–13596. 73. Missal, K., Cross, M. A., Drasdo, D. (2006) Gene network inference from incomplete expression data: transcriptional control of hematopoietic commitment. Bioinformatics 22, 731–738. 74. Klipp, E., Herwig, R., Kowald, A., Wierling, C., Lehrach, H. Systems Biology in Practice. Concepts, Implementation, and Application. Weinheim, DE: Wiley, 2005.
75. Marcus, F. Bioinformatics and Systems Biology. Berlin: Springer, 2008. 76. Eisen, J. A., Coyne, R. S., Wu, M., et al. (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol 4, e286. 77. Prescott, J. D., DuBois, M. L., Prescott, D. M. (1998) Evolution of the scrambled germline gene encoding a-telomere binding protein in three hypotrichous ciliates. Chromosoma 107, 293–303.
Part II Omics Data and Analysis Tracks
Chapter 9 Computational Analysis of High Throughput Sequencing Data Steve Hoffmann Abstract The advent of High Throughput Sequencing (HTS) methods opens new opportunities for the analysis of genomes and transcriptomes. While the sequencing of a whole mammalian genome took several years at the turn of this century, today it is only a matter of weeks. The race towards the thousand-dollar genome is fueled by the – ethically challenging – idea of personalized genomic medicine. However, these methods allow new and interesting insights in many aspects such as the discovery of novel noncoding RNA classes, structural variants, or alternative splice sites to name a few. Meanwhile, several methods for HTS have been introduced to the markets. Here, an overview on the technologies and the bioinformatics analysis of HTS data is given. Key words: High throughput sequencing, Short reads, Mapping, Assembly, SNP detection, 454, Illumina, Helicos, SOLiD
1. Introduction When turning to High Throughput Sequencing (HTS), often also referred to as Next Generation Sequencing (NGS), one quickly realizes that the time of spreadsheet-bioinformatics is coming to an end. A single 3-day sequencer run confronts the scientists with terabytes of data: images, qualities, statistics, summaries, sequences, maps, and assemblies to name few. In the light of this quick and massive avalanche of data, deciding on data storage policies alone appears to be a difficult task. The deletion of any file will inevitably limit the possibilities of reanalysis or require cumbersome reevaluations. Especially for precious biological samples it might be worthwhile to store the data. For example, the reanalysis of images with alternative base calling tools may yield important improvements. Because the prices for whole-genome sequencing Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_9, © Springer Science+Business Media, LLC 2011
199
200
Hoffmann
of mammalian genomes have dropped significantly below 20,000 US$ (1) and the time needed for sequencing does not exceed a couple of weeks, yet unknown amounts of data will accumulate in laboratories all over the world. Hence, whatever policies are adopted and irrespective of whether the sequencing itself is outsourced or not, a series of smaller HTS projects already requires a specialized infrastructure with arrays of hard disks, network architectures, or even tape storages. These issues may not be overstressed since they heavily affect everyday work with the HTS data. A few questions have to be answered carefully in each case. How to exchange data with the collaborators? How to set up pipelines for analyses? Is a version control of reanalyzed data necessary? Picturing a tape storage library with loads of TB-cartridges in a molecular biology lab one immediately realizes that the analysis of HTS data is a problem of its own. Many standard algorithms and tools frequently used in genome informatics had or still have to be revised in order to contain the deluge of HTS data. Moreover, HTS offers the opportunity for new types of analysis such as transcription start site detection, ncRNA detection, or RNA structure probing requiring new algorithms and standards. This chapter intends to give an overview on the sequencing technologies as well as basic approaches to HTS data analysis. As the HTS methods are steadily improving and changing at a very fast pace, the focus will be on their basic principles rather than fugacious facts or specific pros and cons of single technologies. Despite their different error models, properties, and application areas, all HTS approaches gain their true beauty and fascination by their genuine combination of different techniques from various fields of science: molecular biology, chemistry, physics, material science, engineering, and computer science.
2. Materials Current high throughput DNA sequencing methods may be subdivided in two major approaches: sequencing by ligation and sequencing by synthesis. In sequencing by synthesis a singlestranded, primer-probed DNA template is sequentially duplicated by a polymerase. During duplication, chemically modified nucleotides added to the newly synthesized strand allow the detection of each base of the template. On the other hand, sequencing by ligation does not use polymerases, but employs specifically binding primer probes. ABI’s SOLiD sequencing platform (2) puts this idea into practice. However, in the near future a direct read out of sequences using the physicochemical properties of nucleotides (such as charges) may become the state of the art.
Computational Analysis of High Throughput Sequencing Data
201
In principle, all novel sequencing methods achieve high throughput by immobilizing large amounts DNA or cDNA fragments locally. Regardless whether the immobilization takes place on beads or plane surfaces – the idea is to spatially separate fragments sufficiently to perform the sequencing for all fragments simultaneously. 2.1. Sequencing Platforms 2.1.1. 454 Pyrosequencing
The 454 pyrosequencing system was the first HTS system introduced to the markets (3). The key idea of pyrosequencing is to trigger detectable chemiluminescent reactions during the sequencing step. To prepare a DNA library for the 454 platform, the double-stranded source material needs to be sheared into fragments of several hundred nucleotides. In a second step two different linkers, i.e. specific DNA sequences of known length and composition are ligated to the double-stranded shreds. A 5¢-biotin tag attached to one of the linkers allows the immobilization of the fragments on DNA capture beads. An excess of beads ensures that the expected number of bound fragments per bead does not exceed one. To generate detectable light signals during the synthesis step, an emulsion polymerase chain reaction (emPCR) clonally amplifies the fragment. During this process millions of copies are directly attached to the bead. Picotiter plates, i.e., plates with wells just large enough to contain a single bead, are used to trap and locally immobilize the beads. The nucleotides adenine (A), cytosine (C), thymine (T), and guanine (G) are sequentially washed over the plates in four recurring cycles. Polymerases along with other enzymes generate light signals when the nucleotides are incorporated into the DNA strand. More precisely, the enzyme luciferase is the major component of the chemiluminescent reaction (Fig. 1). After each cycle, a washing step is needed to remove the excess of nucleotides from the plate. The description of the sequencing by synthesis step already reveals an important problem: sequencing of homopolymers. A stretch of two or more identical nucleotides in the DNA template will generate multiple subsequent nonsynchronous chemiluminescent reactions during a single cycle. Hence, the intensity of the light signal is the only way to determine the length of a homopolymer – making the use of rather complicated signal processing steps necessary. Additionally, despite the washing steps, nucleotides accumulate within the wells causing an increase of the background signal as the sequencing proceeds. The major advantage of the 454 method in comparison with other technologies is the longer read length making the sequences especially useful for assemblies (Table 1).
202
Hoffmann
Fig. 1. 454 Pyrosequencing. After clonal amplification of DNA fragments at DNA capture beads, beads are trapped in small wells of a picotiter Plate. During the sequencingby-synthesis the four nucleotides are washed in recurring cycles over the plate. Incorporation of nucleotides by the polymerase results in the production of ATP. In turn, the energy rich ATP triggers a luciferase reaction generating a light signal that is recorded by a CCD camera. 454 Sequencing © Roche Diagnostics. All rights reserved. 2.1.2. Illumina (Solexa)
With about 100 nucleotides, Illumina fragments are significantly shorter compared to those from 454 (4). However, due to its degree of parallelization this method allows a higher throughput with more than 20 gigabases per day. In contrast to 454, fragments ligated to two different adapters are immobilized on translucent plates, flow cells, densely coated with oligonucleotides complementary to the adapters. The key idea of this procedure is to clonally amplify the fragments in a circumscribed region of the flow cell. The complementary adapters on the plate act as PCR primers to generate clusters of millions of identical copies using a process called bridge-amplification. The prerequisite for a successful generation and detection of such clusters is that the initially binding fragments are well separated from each other (Fig. 2). In fact, separability and purity of clusters is one of the major challenges in the signal detection step. The sequencing by synthesis step involves reversible dyeterminators. The polymerase incorporates differently labeled
~400 bp
36–100 bp
~50 bp
22–55bp (35 bp)
Roche 454 (GS FLX w/ Titanium chemistry)
Illumina/Solexa (HiSeq 2000)
SOLiD (SOLiD 3.0)
Helicos
~20–30 GB
~20–30 GB (paired-end)
~200 GB (paired end/mate pair mode)
~400 MB
bp per run
>95%
99.94%
>98%
99.5%
Accuracya
8 days
14 days
8 days
7 h
Run time
Supports paired end sequencing
Supports mate pair sequencing
Supports paired end and mate pair sequencing
Supports mate pair sequencing; indels in homopolymer stretches
Remarksb
b
a
According to the manufacturer Please note that mate pair sequencing refers to parallel sequencing of two DNA fragments with a known approximate physical distance. Paired end sequencing refers to the sequencing of a single DNA fragment from both sides
Read length
Technology (platform)
Table 1 Comparison of high-throughput sequencing methods
Computational Analysis of High Throughput Sequencing Data 203
204
Hoffmann
Fig. 2. Illumina/Solexa cluster generation. Adapter-tagged fragments are immobilized on glass cover slips densely coated with reverse complementary adapters (1). Subsequently, fragments are amplified using a bridge amplification step (2), resulting in locally separated clusters of clonally amplified fragments (3). Separability and purity of clusters is a key prerequisite for the Illumina technology. During the sequencing step (using reversible terminator dye chemistry) the clusters generate sufficiently strong light signals to be detected by a camera. ©2010, Illumina, Inc. All rights reserved.
nucleotides that bring the sequencing to a sudden stop. Laser excitation of labels indicates the type of the incorporated nucleotide and cleaves the terminator activity. Subsequently, the next base can be called. The Illumina approach intends to call bases synchronously, i.e. all light signals generated in the nth cycle belong to the nth nucleotide of all fragments in the flow cell. In practice, however, this does not always work. A failure to remove the terminator or the integration of nucleotides without a terminator activity, e.g., will result in a phase shift. This phasing increases noise and complicates the signal detection. 2.1.3. Helicos
Helicos styles its single molecule sequencing figuratively as “DNA microscopy”. Although the company’s own interpretation is at least semantically questionable their system provides a solution that does not require amplification, which may introduce biases preceding the detection step (1, 5). Instead of specific adapter sequences, poly-A anchors are covalently attached to single fragments. Subsequently, glass cover slips covered with poly-T oligomers are used to capture the library fragments. During sequencing, the plate is incubated with only one Cy5-labled nucleotide at a time and a light signal is generated upon laser excitation. Because of the low signal strength, good background discrimination, and high-resolution image detection is required. And indeed, the error rates of Helicos are reported to be significantly higher in comparison with other HTS technologies (1), demanding more sensitive software in the downstream analysis.
2.1.4. SOLiD
Applied Biosystems’ SOLiD technology employs a “sequencing by ligation” approach. Sheared fragments coupled to a universal primer (P1) are attached to beads and clonally amplified in an emPCR reaction. In contrast, beads are not captured in wells but covalently bound to glass plates.
Computational Analysis of High Throughput Sequencing Data
205
In the first sequencing round, a universal primer complementary to the 3¢-end of P1 is used. Subsequently, fluorescently labeled 5 mer-probes are washed over the plate. Each label represents a set of four dinucleotide combinations specific only for the first and second position of the probe, while the other three nucleotides are random. Probes complementarily binding to the template sequence are ligated to the primer. After signal detection, the labels are simultaneously cleaved and a second ligation reaction elongates the extension product. At the end of the first ligation round with multiple ligation reactions the color information is obtained for pairs of nucleotides with a lag of 4nt, i.e. the 1st and 2nd, the 6th and 7th, the 11th and the 12th, and so on. Note that the color information from the first round is not sufficient to decode the bases. To decode a single base, the antecedent nucleotide (not only the color information) has to be known (Table 2). Hence, additional ligation rounds are necessary to translate the colors and to interrogate the remaining nucleotides. Therefore, the extension product is removed and a second primer complementary to the 3¢-end of P1 with an offset of −1 binds to the template. In this ligation round, the first two interrogated nucleotides are the last base of P1 and the first base of the template. In total, it requires five ligation rounds to translate the whole template from the color space to sequence space. Due to the fact that each base is interrogated twice, a higher accuracy is expected. However, in case of a sequencing error, e.g., by an unspecific binding of a probe, correct translation to nucleotide space will fail for all subsequent bases. As a result, there are two basic philosophies for analyzing SOLiD data.
Table 2 The SOLiD color space coding table. Note that for each dinucleotide, the reverse, the complement, and the reverse complement always have the same color Second Base A
First Base
C
T
G
A
Blue
Green
Yellow
Red
C
Green
Blue
Red
Yellow
T
Yellow
Red
Blue
Green
G
Red
Yellow
Green
Blue
206
Hoffmann
While some tools skip the translation and process the SOLiD data directly in color space (e.g., mapping the data to a color space version of the reference genome), others actually translate it to sequence space. Note, that the SOLiD color space has some interesting properties: for an unknown color x, the color signal x-blue-blue-blue-blue may translate to a poly-A, poly-C, poly-T or poly-G (Table 2). It was specifically designed so that for each dinucleotide the reverse, the complement and the reverse complement have the same color. Another sequencing by ligation approach called combinatorial probe anchor ligation (cPAL) is used by Complete Genomics. This company currently offers sequencing services only. Complete Genomics was the first to announce the sequencing of a whole human genome with a coverage of 78-fold for less than 5,000 US$ (6). 2.1.5. Other Approaches to Sequencing
Nanopore sequencing is an alternative approach to HTS. The initial concept was based on threading a DNA molecule through a staphylococcal alpha-hemolysine. During passage, the DNA would alter a current applied to the molecule in a way that is specific for its nucleotide composition. Newer publications investigate the usage of covalently incorporated adapters to detect single nucleotides cleaved from DNA fragments upon exonuclease treatment by means of current modulations (7). Under optimal conditions this approach was reported to achieve acceptable error rates. A different technology, currently developed by Pacific Biosciences, locally immobilizes the polymerases on a glass surface. Each polymerase is surrounded by a zero-mode waveguide (ZMW) that confines light in volumes (typically <10e-21 L) smaller than its wavelength (8). Using ZMWs light signals generated upon the integration of labeled nucleotides can be distinguished from the background. While the reads are reported to be significantly longer, error rates close to 10% make further improvements necessary before the technology is ready for the markets.
2.1.6. Paired-Ends
The sequencing of paired-ends helps to overcome the problem of ambiguity caused by short read lengths and improves the assembly results. By using DNA circularization during the library construction, shorter fragments can be obtained from both ends of a larger genomic fragment with a typical length of 2–5 kb. The pair of small DNA fragments is then sequenced in the same cluster or well. The prior knowledge of the approximate distance between the two fragments impedes the misplacement of both reads during alignment and helps to assemble repetitive regions. Furthermore, it is useful to detect structural variations such as copy number variations.
Computational Analysis of High Throughput Sequencing Data
207
Note that there is some confusion about the terminology. In the Illumina protocol, the process described above is termed mate-pair sequencing, while the 454-protocol refers to it as paired-end sequencing. Illumina’s paired-end sequencing does not require DNA circularization: a little modification of the single-end protocol allows the sequencing of a DNA fragment of about 200–500 bp from both ends simultaneously.
3. Methods As mentioned above, the sheer amounts of data demand efficient and fast algorithms for the bioinformatics analysis. In principle, the analysis already begins with the base calling, i.e., the decoding of electromagnetic signals to nucleotide sequences and associated quality values. Meanwhile, several independent groups have proposed alternative base calling approaches with optimized results (9–11). These concepts vary from Bayesian base calling methods to approaches using support vector machines. Here, we only focus on some standard input and output formats as well as the two most common forms of sequence data analysis: mapping and assembly. 3.1. Basic Data Analysis and Evaluation
For further downstream analysis, base callers typically report the sequences together with their quality values. While the Illumina platform reports the sequences along with the quality values in a multiple FASTQ format, reads and quality values from the 454 GS FLX are often exported from the binary SFF format (Standard Flowgram Format) to two separate multiple FASTA files. However, the Genome Sequencer Data Analysis suite that comes with the sequencer offers several tools to process the SFF files directly. The multiple FASTA format contains multiple sequences that are preceded by a header line. The header starts either with the symbol “>” or “;”. It holds the basic identifiers and sequence information. All subsequent lines hold the sequence itself. For 454 reads, the header line starts with a unique identifier followed by the read length, the coordinates of the bead and the date of the sequencing run. >FW8YT1Q01B9VMY length=380 xy=0815_2008 region=1 run=R_XXXX TGATCTTACTATCATTGCAAAGCCACTTAAAGAC CACACACTACGTCACTGGAAAAGAGT TCAATAGAGGCCTCCTACGAGTAACACCCTTACAC TTCTGCTACAGAAACTACACCTTTT
Quality values for 454 reads are given in a separate file. The assignment of reads and quality values is possible via header line. >FW8YT1Q01B9VMY length=380 xy=0815_2008 region=1 run=R
208
Hoffmann 37 39 39 39 39 28 28 31 33 37 37 35 35 35 35 35 35 37 27 31 31 37 37 37 37 37 37 38 37 37 37 37 39 39 39 39 39 39 39 39 39 37 37 33 32 32 30 32 32 32 19 19 15 15 15 15 23 16 30 29 …
The FASTAQ format is a slight modification of the FASTA format. @solexaY:7:1:6:1669/1 GCCAGGNTCCCCACGAACGTGCGGTGCGTGACGGGC +solexaY:7:1:6:1669/1 ``a`aYDZaa``aa_Z_`a[`````a`_P][[\ _\V
The FASTQ header begins with an “@” symbol. Again, all following lines hold the sequence itself. For Illumina reads, the header informs about the name of the instrument (solexaY), the flowcell lane (7), the tile number within the flow cell (1), its x- and y-coordinates (6:1,669) and has a flag indicating whether the read is single-end (/1) or belongs to a mate-pair or paired-end run (/2). The “+” sign followed by the same sequence identifier indicates the beginning of the quality value string. Note that the qualities are given in ASCII. The quality values give an estimate on the accuracy of the base calling. Nowadays, most sequencing platforms report a Phred quality score. The score, originally developed in the context of the Human Genome project, is given by
Q = −10·log10 p, where, p is the probability that the reported base is incorrect. Illumina initially decided to deviate from this scoring and instead used the formula
Q Solexa = −10·log10
p . 1− p
While the Illumina quality score Q Solexa is asymptotically identical to Q for low error probabilities, it is typically smaller for higher error probabilities. Since the Illumina quality scores can become negative, a conversion to real phred scores using
(
Q = 10·log10 1 + 10Q Solexa /10
)
may be necessary. While high Illumina quality scores have been reported to overestimate the base calling accuracy, low scores underestimated the base calling accuracy (10, 12). Since version 1.3, the proprietary Solexa pipeline uses Phred scores. It is important to note that also the encoding of the quality string has been subject to changes. The new pipeline encodes the phred qualities from 0 to 62 in a non-standard way using the ASCII characters 64 to 126. Due to the fact that Phred scores from 0 to 93 are normally encoded using ASCII characters 33–126, a conversion might be necessary.
Computational Analysis of High Throughput Sequencing Data
209
3.2. M apping
In genome informatics the mapping describes the process of generating a (mostly heuristic) alignment of query sequences to reference genomes. It is the basis for qualitative as well as quantitative analysis. To map HTS sequences, the algorithms have to address three different problems at once. In addition to the tremendous amounts sequences, the methods have to deal with a lower data quality and shorter read lengths. Particularly for short (erroneous) reads it is often not possible to decide for its original position in the reference genome since the reads may align equally well to several genomic locations. Sequencing of repetitive regions complicates this problem even more. The methods presented here apply different mapping policies to tackle those problems. To address the problem of the huge amount of data, most of the short read alignment programs use index structures either for the reads or the reference.
3.2.1. Mapping with Hash Tables
Heng Li et al. developed one of the first read mappers, MAQ, for Illumina sequences based on hash tables. Although the tool is no longer supported, a look into the core of this approach reveals some basic principles and policies of short read mapping. The focus of MAQ is to incorporate quality values to facilitate and improve read mapping (13). By default, MAQ indexes only the first 28 bp of the reads (the seed) in six different hash tables ensuring that all reads with at most two mismatches may be found in the genome. Equivalently for a seed of 8 bp the hash tables are built from three pairs of complementary templates, 11110000, 00001111, 11000011, 00111100, 11001100, and 00110011, where a 1 indicates a base that is included in the hash key generation. After this indexing step, MAQ proceeds by scanning the refe rence sequence once for each pair of complementary template. Each time a seed hit is encountered MAQ attempts to extend the hit beyond the seed and scores it according to the quality values. It has been reported earlier that the use of quality values during the read alignment can improve the mapping results substantially (14). By default, MAQ reports all hits with up to two mismatches– but its algorithm is able to find only 57% of the reads with three mismatches. Hits with insertions and deletions (indels) are not reported. Furthermore, for reads with multiple equally scoring best hits only one hit is reported.
3.2.2. Mapping with Suffix Arrays and the Burrows– Wheeler Transform
A second approach to short read alignment uses the Burrows– Wheeler Transform (BWT). In brief, the BWT is a sorted cyclic permutation of some text T, e.g., a reference genome. Its main advantage is that the BWT of T contains stretches of repetitive symbols – making the compression of the T more effective. The backward search algorithm (15) on a compressed BWT simulates a fast traversal of a prefix tree for T – without explicitly representing
210
Hoffmann
the tree in the memory. It only requires two arrays to efficiently access the compressed BWT, which is the key to the speed and the low memory footprint of read aligners such as BWA (16), Bowtie (17) and SOAP2 (18). Because the backward search only finds exact matches, additional algorithms for inexact searches had to be devised. BWA, for example, solves this problem by enumerating alternative nucleotides to find mismatches, insertions and deletions, while SOAP2 employs a split alignment strategy. Here, the read is split into two parts, to allow a single mismatch, insertion or deletion. The mismatch can exist in at most one of the two fragments at the same time. Likewise, the read is split into three fragments to allow two mismatches and so forth. Other tools such as Bowtie do not allow short read alignments with gaps. BWT-based read mappers are the speed champions of short aligners – with an exceptionally low memory footprint. However, for all of the tools described above, the user has to carefully choose a threshold for a maximum number of acceptable errors. For error thresholds >2 mismatches, insertions, or deletions, the speed decreases significantly. While these thresholds seem to be sufficient for mapping of genomic DNA, mapping of transcriptome data or data that contains contaminations (e.g., linkers) may be more difficult. In contrast the tool segemehl (19), based on enhanced suffix arrays, aims to find a best local alignment with increased sensitivity. In a first step, exact matches of all substrings of a read and the reference genome are computed. The exact substring matches are then modified by a limited number of mismatches, insertions, and deletions. The set of exact and inexact substring matches is subsequently evaluated using a fast accurate alignment method. While the program shows good recall rates of 80% for high error rates of around 10%, it has a significantly larger memory footprint in comparison with the BWT and hashing methods. A practical example is given at the end of this chapter (see Note 1). The selection of an appropriate mapping method depends on various criteria (see Note 2). Due to the different indexing techniques some short read aligners are limited to certain read lengths. These tools may not be used if long reads or reads of different sizes need to be aligned. Furthermore, for speed reasons some aligners report only one hit per read – regardless of whether multiple equally good hits could be obtained. This may be a problem if repetitive regions are sequenced. The user has to assess carefully which degree of sensitivity is needed. A method that discards reads with multiple hits (sometimes a random hit is reported) or high error rates may be suitable for SNP detection, while mapping of transcriptome (RNAseq) data may require a higher sensitivity. A selection of mapping tools is given at the end of the chapter (see Note 3).
Computational Analysis of High Throughput Sequencing Data 3.2.3. SAM/BAM Mapping Output Format
211
Because most of the read mapping tools have their own output formats, a standard output format for short read aligners was developed in the context of the 1000 Genomes Project (http:// www.1000genomes.org) (20). The Sequence Alignment/Map (SAM) is a human readable tab-delimited format. A binary equivalent (BAM) is intended to facilitate the parsing with computer programs. The SAM format contains a header and an alignment section. A typical header section starts with a mandatory header line (@HD) that holds the file format version (VN:1.0). Sequence dictionaries (@SQ) hold the names (SN:chr20) and the lengths (LN:62435964) of the reference sequences to which the reads in the alignment section are mapped to. @HD @SQ @RG @RG
VN:1.0 SN:chr20 LN:62435964 ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891
To identify different biological samples, the SAM file may also hold one or more read groups (@RG). Each group has to have a unique identifier. (ID:L1, ID:L2) and the name of the sample (SM:NA12991) from which the reads were obtained. Additionally, the platform unit (PU:SC_1_10), e.g. the lane of the Illumina flowcell, or the library name (LB:SC_1) can be given. The alignment section holds all read alignments. A typical alignment line like read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<: <9/,&,22;;<<< NM:i:1 RG:Z:L1
has the format
<MAPQ> <MRNM> <MPOS> <SEQ> [:: [...]] where, the field holds the name of the query sequence (or sequence pair), the name of the reference sequence and the position in the reference sequence. The mapping quality value is store in the <MAPQ> field. The extended string is a representation of the read alignment. It is comprised of a series of operation lengths plus the operation types. While the conventional format only allows for three types of operations (M for match or mismatch, I for insertion, and D for deletion), the extended also identifies clipping, padding, and splicing operations. Finally, the fields <SEQ> and hold the read sequence and the corresponding quality values. A complete description of the SAM and BAM formats can be obtained from http://samtools. sourceforge.net.
212
Hoffmann
This output format has quickly advanced to a standard and many read mappers offer a SAM/BAM compatible output. 3.3. Assembly of Short Read Data
The advent of HTS has raised hopes to quickly and inexpensively perform de novo assemblies of large genomes. However, shorter read lengths and higher error rates have spoiled all too optimistic expectations. One of the first tools for short read assembly, SSAKE (21), employs a greedy method to build larger contigs from short Illumina reads. After building a hash table holding unique read sequences, a prefix tree indexes all such sequences. The assembly starts with the most abundant unique sequence. All 3¢ most k-mers, i.e. substrings of length k at most m characters from the 3¢ end of the sequence apart, are looked up in the prefix tree. All hits are used to build the first consensus contig. This consensus is then used to find the next set of k-mers. This process is iterated until all possibilities of the contig extension are exhausted. While such a simple method works well for small genomes, the assembly of larger genomes from short reads is rather cumbersome. Zerbino et al. (22) used a more complicated de Bruijn-Graph approach in their program called “Velvet”. Each node in the graph represents a series of overlapping k-mers, such that two adjacent k-mers overlap by k − 1 characters. Two nodes A and B are connected by a directed edge if the last k-mer of the node A overlaps with the first of B. Hence, not only the reads, but a whole series of overlapping reads can be modeled as a path through the graph. The authors report that Velvet is capable of assembling bacterial genomes with N50 contig lengths of up to 50 kb. In simulations with 5-Mb regions of large mammalian genomes, contigs were ~3 kb long. If available, both applications make use of mate-pair information. In the future, alternative approaches that combine the high coverage provided by short read sequencers such as Illumina with longer 454 and Sanger reads may prove to be more effective when it comes to the assembly of larger vertebrate genomes. However, another tool that employs de Bruijn-Graphs, SOAPdenovo, was used to successfully assemble mammalian genomes from singleend and mate-pair Illumina sequences only (23, 24). A list of selected tools is given at the end of the chapter (see Note 4).
3.4. Other Applications
The most important goal of personalized genomics is the detection of variations such as single nucleotide polymorphisms (SNP). HTS for the first time offers the opportunity to detect previously unknown SNPs with minor allelic variants on a large scale. Therefore, some vendors of HTS platforms such as Illumina provide tools to call SNPs directly from the sequencing data.
Computational Analysis of High Throughput Sequencing Data
213
Alternative methods for SNP calling are, for example, offered by MAQ (13) or SOAPsnp (25). Primary prerequisite for the SNP calling is a high quality of the sequencing data. To assure this, typically all reads with multiple hits, reads with low overall qualities, and reads with more than one mismatch are discarded. The success of SNP calling in HTS data depends not only on the quality of the reads but also on the coverage. After mapping the reads to a reference, the cross-section at sufficiently covered (>10) genomic positions is checked for polymorphisms. To do this many SNP callers employ Bayesian statistics. For example, SNPsoap assumes a set of ten different genotypes
Ti = H m H n = {AA,CC,GG, TT, AC, AG, AT,CG,CT,GT}
where, H m and H n denote the two haplotypes of the genotype Ti at some position i of the genome. To obtain an estimate on the conditional probability of the data one may calculate P (dk | H m ) + P (dk | H n ) 2 k =1 l
P (D | T ) = ∏
where, l is the number of observed alleles in the cross-section and p (dk | H) is the probability of observing the allele dk under the hypothesis H . The posterior probability for a genotype Ti given the HTS data D is then evaluated with P (Ti | D) =
P (Ti )P (Di | T j ) 10
∑ P (T )P (D | T ) x =1
x
x
where the probability of a genotype P (T ) is usually calculated using prior knowledge. To reduce false-positive SNP calling SOAPsnp additionally considers quality values. A similar approach was chosen for the Helicos pipeline (5). However, an important drawback of the approach sketched above is that the successful base calling depends on an equally successful coverage of both haplotypes and it may only be used for single individuals but not for pooled samples. It furthermore assumes that only two nucleotides segregate per site so that autosomal mutations may be missed. An interesting alternative maximum-likelihood approach for analyzing pooled samples was published by Michael Lynch (26).
214
Hoffmann
4. Notes 1. Meanwhile there are several tools available to map reads to a reference genome. The selection of the appropriate read mapper depends on several criteria such as accuracy or speed. Here, a mapping run with segemehl is given as an example. The program can be downloaded at http://www.bioinf.uni-leipzig. de/Software/segemehl. The program should compile on all LINUX systems with a C99-compatible C compiler and 2 GB of free memory. To generate an executable binary, type tar -xvzf call make
segemehl_0_0_*.tar.gz
and then
to compile the program. To start a set of short reads and a reference genome is needed. Reads for Arabidopsis. thaliana may be obtained at http:// www.bioinf.uni-leipzig.de/~steve/omics/arabidopsis.fna. The A. thaliana reference sequence is available at the website of the plant genome database (http://www.plantgdb.org/download/ Download/xGDB/AtGDB/ATgenomeTAIR9.171). To map the reads just call ./segemehl.x -x ATgenomeTAIR9.171.idx -d ATgenomeTAIR9.171 -q arabidopsis.fa > arabidopsis.map With the same call, segemehl generates an index (ATgenomeTAIR9.171.idx) of the Arabidopsis genome. In this example most of the time is spent on the index construction. If the index was already built segemehl may be called with ./segemehl.x -i ATgenomeTAIR9.171.idx -d ATgenomeTAIR9.171 -q arabidopsis.fa > arabidopsis.map To increase the sensitivity the option –D 2 may be given. In case the program is run on a multi-core architecture –threads 4 will parallelize the task in four threads. The minimum required accuracy of the alignment may be changed using the –A parameter. The output file arabidopsis.map now contains the mapping information. Note that unlike other tools segemehl by default also reports reads that map to multiple sites. The mapping file contains a description of the fields in the header line. This file may be used to generate BED or other file formats to visualize the mapping data in genome browsers. 2. Although HTS methods are already well established some issues may still be a nuisance and a common source of error in data analysis. The error models of the technologies are very different. While mismatches are the major error type in
Computational Analysis of High Throughput Sequencing Data
215
Illumina sequences, 454 sequences suffer from insertions and deletions – especially in homopolymers. As mentioned earlier, the rate of indels significantly increases along homopolymer stretches making it cumbersome to call SNPs in those regions. Some sequencers such as Helicos are reported to have much higher error rates. Moreover, also the reference sequences are not free of errors. Due to the additional RT-PCR step RNAseq data is less accurate than genomic data. Therefore, especially small RNAs such as miRNA require additional sensitivity. Before mapping RNA sequences should be scanned for poly-A tails. Clipping of these tails facilitates the mapping process as e.g. A-rich genomic stretches may misguide the mapping procedure. As pointed out earlier, short read lengths make the assembly of larger contigs very difficult. Hence, assemblies with HTS data have to be planned carefully. An approach that uses technologies that provide longer reads or a combination of different sequencing methods is more likely to succeed. 3. Selected mapping programs –– MAQ (13) is one of the first tools for mapping of HTS reads with hash tables. The tool only considers mismatches and was developed for Illumina reads only (http://maq.sourceforge.net/maq-man.shtml). –– BWA (16) is a very fast short read aligner based on the Burrows-Wheeler transform that also allows the detection of insertions and deletions. It is not limited to a specific platform or read length. The tool typically allows only a few errors per read (http://bio-bwa.sourceforge.net/bwa.shtml). –– Bowtie (17) is also based on a Burrows-Wheeler transform. This tool currently does not support the detection of indels and works for Illumina reads only. Only a few errors per read are allowed (http://bowtie-bio.sourceforge.net/tutorial.shtml). –– SOAP2 (18) is an alternative to BWA (http://soap. genomics.org.cn/). –– segemehl (19) is a sensitive read aligner with indel detection support. The program does not depend on fixed read lengths and is platform-independent. It allows mapping of sequences with higher error rates but has a large memory footprint (http://www.bioinf.uni-leipzig.de/Software/ segemehl; http://ngslib.genome.tugraz.at/node/36). 4. Selected assembly programs –– SSAKE (21) is a sequence assembly program for Illumina reads based on a greedy hashing method. The program is capable of assembling short genomes (http://www.bcgsc. ca/platform/bioinfo/software/ssake).
216
Hoffmann
–– Velvet (22) is based on a de Bruijn-Graph method (http://www.ebi.ac.uk/~zerbino/velvet). –– SHARCGS (27) is one of the first sequence assemblers for Illumina reads. It uses a similar strategy like SSAKE (http://sharcgs.molgen.mpg.de/).
Acknowledgments Many thanks go to Maribel Hernandez Rosales, Dulce Palafox, Ishaan Gupta, Sven Findeis, Dominic Rose, and Jörg Hackermüller for fruitful discussions and proof-reading the manuscript. This publication is supported by LIFE-Leipzig Research Center for Civilization Diseases, Universitaet Leipzig. This project was funded by means of the European Social Fund and the Free State of Saxony. References 1. Pushkarev, D., Neff, N. F., and Quake, S. R. (2009) Single-molecule sequencing of an individual human genome. Nat Biotechnol 27, 847–52. 2. Pandey, V., and Nutter, P. E. (2008) Nextgeneration genome sequencing: towards personalized medicine. Wiley, New York. 3. Margulies, M., Egholm, M., Altman, W. E. et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376–80. 4. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P. et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–9. 5. Harris, T. D. et al (2009) Single-molecule DNA sequencing of a viral genome. Science 302, 106–9. 6. Drmanac, R., Sparks, A. B., Callow, M. J. et al. (2009) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327, 78–81. 7. Clarke, J., Wu, H.-C., Jayasinghe, L., Patel, A., Reid, S., and Bayley, H. (2009) Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol 4, 265–70. 8. Eid, J., Fehr, A., Gray, J. et al. (2009) Realtime DNA sequencing from single polymerase molecules. Science 323, 133–8. 9. Quinlan, A. R., Steward, D. A., Stromberg, M. P., and Marth, G. T. (2008) Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods 5, 454–57.
10. Kircher, M., Stenzel, U., and Kelso, J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10, R83. 11. Erlich, Y., Mitra, P. P., de la Bastide, M., McCombie, W. R., and Hannon, G. J. (2008) Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods 5, 679–82. 12. Dohm, J. C., Lottaz, C., Borodina, T., and Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 36, e105. 13. Li, H., Ruan, J., and Durbin, R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18, 1851–8. 14. Smith, A. D., Xuan, Z., and Zhang, M. Q. (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinform 9, 128. 15. Ferragina, P., and Manzini, G. (2000) Opportunistic data structures with applications. Proceedings 41st Annual Symposium on Foundations of Computer Science, 390–8. 16. Li, H., and Durbin, R. (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–60. 17. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. L. (2009) Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25. 18. Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., and Wang, J. (2009) SOAP2:
Computational Analysis of High Throughput Sequencing Data
19.
20.
21.
22.
an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–7. Hoffmann, S., Otto, C., Kurtz, S., Sharma, C. M., Khaitovich, P., Vogel, J., Stadler, P. F., and Hackermuller, J. (2009) Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol 5, e1000502. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–9. Warren, R. L., Sutton, G. G., Jones, S. J. M., and Holt, R. A. (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23, 500–1. Zerbino, D. R., and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821–9.
217
23. Li, R., Zhu, H., and Wang, J. (2009) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res doi:10.1101/gr.097261.109. 24. Li, R. et al. (2009) The sequence and de novo assembly of the giant panda genome. Nature doi:10.1038/nature08696. 25. Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K., and Wang, J. (2009) SNP detection for massively parallel wholegenome resequencing. Genome Res 19, 1124–32. 26. Lynch, M. (2009) Estimation of allele frequencies from high-coverage genomesequencing projects. Genetics 182, 295–301. 27. Dohm, J. C., Lottaz, C., Borodina, T., and Himmelbauer, H. (2007) SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 17, 1697–706.
Chapter 10 Analysis of Single Nucleotide Polymorphisms in Case–Control Studies Yonghong Li, Dov Shiffman, and Rainer Oberbauer Abstract Single nucleotide polymorphisms (SNPs) are the most common type of genetic variants in the human genome. SNPs are known to modify susceptibility to complex diseases. We describe and discuss methods used to identify SNPs associated with disease in case–control studies. An outline on study population selection, sample collection and genotyping platforms is presented, complemented by SNP selection, data preprocessing and analysis. Key words: Single nucleotide polymorphism, Case–control association studies, Genotyping, Genetic analysis, Linkage disequilibrium
1. Introduction The goal of a genetic analysis is to identify genetic variants that are associated with an outcome (a phenotype, a quantitative trait, or a disease) of interest. In this chapter, we are discussing the relevance of genetic variation in common complex human diseases. As of November 2009, 505 genome wide association studies (GWAS) of sufficient size have been reported (see at the HuGE Navigator, http://www.hugenavigator.net). We, in this chapter, focus on a class of common genetic variants called SNPs (Single Nucleotide Polymorphisms) – a change of one nucleotide with another – which occur frequently (1 in every 300–500 basepair [bp]) along the 3 billion bps of the human genome (see the haplotype map of the human genome, http://www.hapmap.org) (1). For most SNPs, it is currently unknown which are relevant to disease pathology and which simply represent normal intraindividual and interpopulation variability.
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_10, © Springer Science+Business Media, LLC 2011
219
Li, Shiffman, and Oberbauer
SNPs that occur in both coding and noncoding regions of genes as well as in intergenic regions may have functional consequences. SNPs can influence gene function by changing the encoded amino acid (nonsynonymous SNPs). SNPs in the noncoding regions of the genome can influence promoter activity, microRNA binding, mRNA stability, as well as subcellular localization of mRNA. These functional consequences are the biological cause for the association of SNPs with human diseases. GWAS have identified at least 16 nonsynonymous SNPs associated with human diseases or traits (2) with high level of statistical significance (P < 5 × 10−8). However, the function of the majority of the disease-associated SNPs has not been characterized. SNPs make modest contribution to the risk of common complex diseases (Fig. 1, and see Note 1), but their identification in human diseases can enhance our understanding of biological processes as well as diseases on an individual basis as well as from a populations view point. For example, recent genetic analyses of SNPs have implicated genes in the IL-12/IL-23 and TNFa pathways in psoriasis and highlighted the involvement of TNFa and NF-kB pathways in rheumatoid arthritis (3). Insights from the genetic studies may ultimately help improve disease diagnosis, prognosis and therapy.
> 10
Effect Size (OR)
220
Rare–Difficult to Find
Single Allele Causing Mendelian Disorders
2 1.5 1
CVCD Population Genetics – Multiple Alleles (Haplotypes) involved
Difficult to Identify
10−3
5.10−3
10−2
Allele Frequency
Fig. 1. Allele frequencies and effect sizes of genetic variants in human diseases. Disease causing single allele mutations are rare, but show high penetrance (Mendelian inheritance). On the other hand, contribution of common genetic variants to common disease (CVCD) only explains a small proportion of a phenotype (population genetics). The other extremes are very low allele frequencies with little impact that may be detected only by sequencing the locus (left lower corner) and the very rare disease that are caused by high prevalent SNPs (upper right corner). Adapted from the McCarthy et al. (36).
Analysis of Single Nucleotide Polymorphisms in Case–Control Studies
221
2. Materials 2.1. Study Population Selection
The data needed for genetic analyses are genotypes and phenotypes for a given study population. In this chapter, we primarily describe the experiments and analyses to identify and validate diseaseassociated SNPs and we limit our discussion to case–control studies of unrelated individuals. Case–control studies are relatively easy to assemble and are thus most frequently used to investigate the genetics of common, complex diseases. Depending on the objective of the study (e.g. discovery study or validation study) and the scale (e.g. candidate gene approach, or whole genome association study), several of the basic experiments and analyses outlined in the following sections can be performed. Discussion on more advanced genetic analyses, such as gene– gene interaction and genetic pathway analysis, can be found elsewhere (4, 5).
2.2. Sample Collection and DNA Preparation
Sample collection should follow the design of the genetic study (see Note 2) and adhere to a protocol approved by an Institutional Review Board. Since large study size improves the power of a study to detect a genetic effect (Fig. 2, also see Note 3), an effort should be made to enroll as many participants as possible, provided they meet the inclusion and exclusion criteria. The ascertainment of the case and control definitions is crucial to the success of a genetic study because inconsistent definitions of disease could result in a collection of cases with different disease or different disease stage (see Note 4). For example, a broad definition of cardiovascular disease could include coronary artery narrowing, myocardial infarction, stroke, or carotid intimal-medial thickening. However, each of these definitions is likely to have its own specific genetic components. Considering these conditions together would likely dilute the strength of the genetic association. Genomic DNA is typically prepared from blood samples. However, DNA can also be extracted from buccal swabs, saliva, dried blood spots, serum, plasma, and biopsy specimens. A detailed description on sample collection and DNA preparation can be found elsewhere (6).
2.3. Genotyping Platforms
When the goal of the study is to investigate genetic polymorphisms throughout the genome (GWAS), there are currently two dominant commercial platforms for genotyping 500,000 SNPs or more in a single reaction (Illumina, http://www.illumina.com and Affymetrix, http://www.affymetrix.com). Both platforms provide reliable genotype data but have different workflows that could affect their suitability to different lab setups. Both platforms
222
Li, Shiffman, and Oberbauer
Fig. 2. Power analysis of case–control association studies. The Power to detect a SNP with an additive odds ratio of 1.25 or 1.5 and with various risk frequencies was determined using the QUANTO program (http://hydra.usc.edu/gxe). The threshold of significance level was set at 0.05 for single marker testing or 1 × 10−8 for GWAS. The sample size is for cases; cases and controls are matched at 1:1 ratio.
offer a choice of standard chips (sets of SNPs). For example, Illumina currently offers chips with standard set of >1,000,000 SNPs, ~658,000, or ~300,000 SNPs. These options can change rapidly and up to date information is available online. Both platforms also offer chips that are custom made to include SNPs specified by the customer. Genotyping data from both platforms can be performed by service providers, an option that would make sense if large-scale genotyping projects are not performed routinely. When the goal of the study is to interrogate a limited number of candidate SNPs or to understand the genetic variability in a specific genomic region, there are several commercial sources that offer genotyping assays for specific variants “off the shelf ” (TaqMan® SNP Genotyping, http://www.appliedbiosystems.com; Illumina GoldenGate and Illumina iSelect, Sequenom iPLEX, http://www.sequenom.com). In addition, PCR-based “home brew” methods can be developed for almost any genetic variant using existing technologies (7). Finally, the current reduction in cost and the improvement in speed of DNA
Analysis of Single Nucleotide Polymorphisms in Case–Control Studies
223
sequencing could make it feasible to determine the sequence of entire genomes and thus obtain complete genetic information for each individual in a study; however, the vast quantity of data that whole genome sequencing generates would create new analytical and computational challenges. These “next generation” sequencing platforms are currently available from several commercial sources (e.g. SOLiD, http://www3.appliedbiosystems.com/AB_ Home/applicationstechnologies/SOLiD-System-Sequencing-B/ index.htm; Illumina; 454, http://www.454.com).
3. Methods 3.1. SNP Selection
The number of SNPs that could be included in genetic association studies has expanded rapidly with advancements in understanding the human genome. Currently, the Phase III HapMap dataset (release #27) lists genotypes and frequencies of about 4 million SNPs in the African, Asian, and Caucasian populations and over 1.4 million SNPs in other populations (http://www. hapmap.org). The objectives of a study should determine which SNPs would be interrogated. When a particular locus is targeted (e.g. linkage regions or fine mapping), only SNPs in that region would be tested. However, even if the study goal is to interrogate the entire genome, the functional consequences of the SNP, linkage disequilibrium (LD), and allele frequency are important information to consider.
3.1.1. Functional SNPs
Potential functional SNPs include those coding for missense and nonsense substitutions as well as SNPs in transcription factor binding sites, donor and acceptor splicing sites, exon skipping sites, and miRNA binding sites. Since these SNPs may provide a plausible mechanistic explanation for disease association, they have a priori higher probability for being associated with disease. Thus, interrogating functional SNPs should be prioritized over other SNPs. For example, a missense SNP rs2476601 (Arg620Trp) in PTPN22 was found to associate with rheumatoid arthritis following testing putative functional SNPs in candidate genes and linkage peaks (8), and rs3798220 (Ile4399Met) in LPA was identified for its association with myocardial infarction and severe coronary artery disease from genome-wide scans of over 12,000 putative functional SNPs (9, 10). Annotation of SNP types (synonymous or nonsynonymous, UTR, intron, intergenic) can be found in dbSNP database (http://www.ncbi.nlm. nih.gov/sites/entrez?db=snp); annotation of SNPs in regulatory elements such as transcription factor binding sites can be found in other databases (see, for example, http://variome.kobic.re. kr/SNPatPromoter).
224
Li, Shiffman, and Oberbauer
3.1.2. Linkage Disequilibrium
The genetic heterogeneity of a given region can be efficiently and economically interrogated by genotyping only a subset of the SNPs, or tagging SNPs (tSNPs), that are correlated with other SNPs. The International HapMap Project has mapped the LD structure of the genome, and information on SNP–SNP correlation (measurable in D¢ and r2 (11)) is publicly available (http:// www.hapmap.org). This information has been used to design some genotyping arrays for GWAS (http://www.illumina.com). For a study of a particular locus or gene, tSNPs can be identified by the Tagger function in the HaploView program (http://www. broadinstitute.org/haploview/haploview). Given an input of a genomic region, the ethnicity of study population, a desired tagging procedure (pairwise or multimarker), a threshold of LD, and a minimal allele frequency to be captured, the Tagger generates a list of tSNPs and provides LD information between tSNPs and SNPs that are tagged by them. When an assay for a particular tSNP is not available or cannot be designed, another SNP in the same tagging group should be investigated; Note that such an alternative SNP may not be sufficient to tag all other SNPs in the same tagging group.
3.1.3. Allele Frequency
SNPs have allele frequencies that range from rare (loosely defined as minor allele frequency [MAF] < 0.5%), low (0.5% < MAF < 5%), to common (MAF > 5%). The power of an association study is affected by the allele frequency, as well as the effect size of the allele and the study size (Fig. 2). In a case–control study with 3,000 cases and 3,000 controls, there is nearly 80% power to detect a SNP with an odds ratio (OR) of 1.25 (per allele) and a risk allele frequency of 5%. Bear in mind that allele frequency may be population specific, thus frequency information should be obtained from the population of interest. For example, the common missense SNP (rs2476601) in PTPN22 associated with rheumatoid arthritis in the Caucasian population is largely absent in the Asian population, therefore testing of this SNP in an Asian population is unlikely to be informative.
3.2. Quality Control and Data Preprocessing
Most genotyping platforms report the signal intensity of each allele for each SNP. Plotting the signal intensity of a single SNP for all the samples in the study in a scatter plot (Fig. 3) is a good first step to identify potential problems with the data. For example, Fig. 3 shows the signal intensity of the two alleles of a SNP in each of 996 samples. The three data cluster (circles) indicate the three genotype groups (major homozygotes (AA), heterozygotes (Aa), and minor homozygotes (aa)). Some samples did not yield signal (broken circle), probably because these DNA samples were of poor quality or insufficient quantity. One sample (indicated by an arrow) gave an intermediate result (between the Aa and aa clusters); this sample should not be included in the analysis.
3.2.1. Hardy–Weinberg Equilibrium
Analysis of Single Nucleotide Polymorphisms in Case–Control Studies
225
Fig. 3. Cluster analysis of genotyping quality. Each dot represents signal intensities of the major and minor alleles in one sample.
Data that do not cluster into three genotype groups indicate a problematic assay, and thus the data should not be used. Another good quality control measure is to test for deviation from Hardy–Weinberg Equilibrium (HWE) expectations. HWE defines the expected distribution of the three genotype groups of a biallelic SNP as
p 2 + 2 pq + q 2 = 1; where, p and q are the frequencies of the two alleles (p + q = 1). These relations between the three genotype groups are true in populations that are large enough to allow for random mating; although random mating does not generally occur in human populations, HWE expectations are generally met. Deviation from HWE could indicate a study population that was derived from ethnic groups in which the allele frequency is different, a SNP on the X chromosome that was genotyped in a population that includes only males, or both males and females, or an extreme selective pressure. However, if there is no biological explanation for deviation from HWE expectations, this deviation probably indicates a flawed genotype assay that under- or over reports one of the genotype groups. Several other tests should be considered to assess the internal consistency of the data. Genotyping some DNA samples in the study more than once could provide information about genotyping accuracy. Duplicate DNA samples can also be used to identify potential plate or sample mix-up. The Y chromosome sex assay can also be used to test if there is a sample mix-up, since the
226
Li, Shiffman, and Oberbauer
reported patient sex and the Y chromosome-determined sex should by and large match, and any gross deviation from expected results indicates a sample mix-up. Similarly, genotype information can also be used to identify unintentional sample duplication. 3.2.2. Imputation–In Silico Genotyping
Imputation is the prediction of a genotype of one SNP based on its known correlation with genotypes of other SNPs in a haplotype. This method can be used to predict missing genotype calls (see Note 5), thereby increasing the power of an association study. It can also be used to predict genotypes of SNPs that were not genotyped, increasing the number of SNPs that are interrogated. Because different GWAS platforms test different sets of markers, imputation makes it possible to compare or combine (e.g. using meta-analysis) data obtained from different studies. There are several tools available for imputation (see Note 6). In principle, genotype data for the study population and haplotype information for a population of the same ethnicity are required. The output of the imputation program provides the most likely genotype of a SNP for a particular individual, based on the probability of each of the three possible genotypes. Some genotypes can be predicted with high confidence (for example, 99% probability for AA vs. 1% probability for Aa or aa), while other genotypes cannot be predicted with high probability (for example 50% probability for AA, 35% for Aa, and 15% for aa). Typically, genotypes from imputed SNPs with high probability score (e.g. >95%) are acceptable, provided that genotypes with sufficiently high probability score are available for most (e.g. >90%) of the study population.
3.3. Association Testing
In a case–control study, a contingency table is a useful tool for initial analysis. Figure 4 presents the genotype counts among cases and controls. Using this table, the OR – that is, the odds of being a case given a specific genotype divided by the odds of being a case given the reference genotype – can be calculated. For example, the OR for the aa genotype using the AA genotype 88 393 ÷ = 1.54 (Fig. 4, 1). Using a as a reference genotype is: 118 640 chi-square test, we can ask if the genotype distribution is different between cases and controls (not assuming any inheritance model). The same table can be used to test a co-dominant model with the Cochran–Armitage trend test (12), that is, asking if the effect size of aa is greater than the effect size of Aa which in turn is greater than AA. Dominant and recessive models can be tested by collapsing the heterozygotes row (Aa) with either the minor or major homozygotes to test specific dominant (Fig. 4, 2) or recessive (Fig. 4, 3) inheritance models. If there is an a priori reason to test a specific inheritance model, only that model should be tested in order to reduce multiple testing.
Analysis of Single Nucleotide Polymorphisms in Case–Control Studies
227
Fig. 4. Case–control association testing. Contingency tables and corresponding statistics are presented.
If the inheritance model is not known, typically only the codominant model is tested. The codominant model can detect associations in other inheritance models, albeit with lower power (11). Since other risk factors can affect the genetic associations, established risk factors can be taken into account using regression models (e.g., logistic regression in this example of a case–control study) that include the genetic variant as well as other risk factors. Multiple testing in genetic analysis, like most “Omics” investigations, is a persistent problem (see Note 7). Several approaches to multiple testing correction have been implemented in genetic testing including Bonferroni (13), false discovery rate (14), and more recently a Bayesian approach (15). Finally, a combined analysis of several studies can be used to increase the power to detect association, provided the combined studies have a similar endpoint and are drawn from comparable populations. Combined analyses of several studies can be performed with techniques for combining multiple contingency tables such as the methods described by Mantel-Haenszel (16) or with methods for metaanalysis (17). 3.4. F ine Mapping
Because the genotypes of many SNPs in the same locus are correlated, when a SNP is found to be associated with a disease, it is necessary to investigate other SNPs in the same locus in order to identify which gene(s) and which SNP(s) are most likely to affect the disease. For example, a genome wide scan identified several SNPs in the 5q31 locus that are associated with psoriasis (18). However, this locus contains many immune-related genes that are in high LD, thus fine mapping was required to determine that IL13, rather than the neighboring immune-related genes, is the
228
Li, Shiffman, and Oberbauer
most likely causal gene (19). In addition, a gene may have multiple independent genetic variants that modify disease risk, as has now been demonstrated in many studies (see, for example, ref. (20)). To identify likely causal and independent variants, fine mapping would involve genotyping many SNPs in the locus (dense genotyping) followed by statistical analyses. In addition, using genotype information from the large number of SNPs in the locus, imputation can be effectively used to obtain genotype information for untyped SNPs. Alternatively, if imputations are considered at the design stage of fine mapping, SNPs can be selected to increase the number and accuracy of non-genotyped, imputable SNPs. To efficiently carry out these experiments, however, it is necessary to understand the LD structure in the chromosomal region of interest. LD information is available from the HapMap data set and can be viewed in the HaploView (http:// www.broadinstitute.org/mpg/haploview), as shown in Fig. 5. The LD blocks, i.e., regions within which recombination was rare, can often be visualized by the triangles generated by the color-coded pairwise r2 values (ranging from r2 = 1 in black to r2 = 0 in white). If SNP1 was the initial variant that was found to be associated with a disease, then the causal SNP (i.e., the SNP that causes a functional change that is responsible for the observed association) is expected to be in the LD block marked in Fig. 5. An LD block could be included within a single gene (Scenario A) or it could span several genes (Scenario B). Once the gene that includes the causative SNP is identified (e.g. Gene 1 in Scenario A), additional genotyping and association testing may be required to
Fig. 5. HapMap LD graph. SNP–SNP linkage disequilibrium in r 2 (ranging from r 2 = 1 in black to 0 in white) in a genomic locus is shown at the bottom. SNP1 represents a variant identified in an initial case–control study. The target LD block is noted with a black bar. Gene distribution is noted above.
Analysis of Single Nucleotide Polymorphisms in Case–Control Studies
229
identify SNPs that are independently associated with disease. Finally, genetic variants associated with disease may be defined by haplotypes, which can be inferred from the genotype data by various algorithms (21). 3.5. Targeted Sequencing
Targeted sequencing of a locus may identify SNPs or other genomic variants (see Note 8) that affect common diseases. It can be particularly suitable for the discovery of low frequency variants that might have a large effect size. One may choose to sequence the exons or the entire genomic region, particularly from individuals of extreme phenotypes or individuals of different ethnicities. For example, following the finding of rs738409 in PNPLA3 associated with non-alcoholic fatty liver disease (NAFLD) in Caucasians, resequencing the coding regions of the gene in the African patients resulted in the discovery of another missense SNP, rs6006460 (Ser453Ile), that was associated with risk of NAFLD in Africans; this SNP is common in Africans (MAF of 10%), but rare in Caucasians (MAF of 0.3%) and low in Hispanics (MAF of 0.8%) (22).
3.6. Functional Studies
Functional characterization of SNPs may be required to identify the causal SNP when genetics alone cannot tease apart several correlated SNPs that are all associated with a disease. Furthermore, such study is important to understand the biochemical mechanism that links genetic variation to disease. For example, a combination of functional data and genetic data may reveal that up- or downregulation of a gene increases disease risk. To gain the insight about the function of a SNP, both bioinformatics analysis and molecular and cellular experimentation should be carried out. Bioinformatics can shed light on the likely functional consequence of a SNP that encodes as amino acid substitution; is the change likely to affect enzyme activity, protein stability, or protein– protein interaction? Similarly, bioinformatics can predict whether a SNP at a regulatory element is likely to affect gene transcription, splicing, or mRNA stability. These hypotheses are then tested by in vitro and in vivo experiments. One example for this approach is the characterization of the missense SNP (rs2476601) in PTPN22 encoding lymphoid tyrosine phosphatase (LYP). The amino acid residue corresponding to the SNP mapped to the proline-rich motif in LYP that binds the SH3 domain of the tyrosine kinase Csk. This information led to experiments that showed that the Trp620 variant bound Csk less efficiently than the Arg620 variant (8, 23). Additional enzymatic assays showed that the Trp620 variant had higher phosphatase activity than the Arg620 variant, and cellbased assays showed that the Trp620 variant was a more potent inhibitor of T-cell activation (24).
230
Li, Shiffman, and Oberbauer
Another example is rs1800925, an intergenic SNP immediately upstream of IL13 that was likely a causal variant predisposing to psoriasis based on a fine mapping analysis (19). In silico analysis revealed that the SNP affected the binding of the transcription factor YY1 to the site; YY1 binds to DNA with consensus sequence AAAATGA, but not AAAACGA (SNP base underlined) (25). This was validated by gel shift assays with synthetic DNA of the two variant forms. Furthermore, a reporter assay showed that the IL13 promoter with the T allele had higher activity than that with the C allele, corresponding to their ability to bind to YY1. In mitogen-activated peripheral blood mononuclear cells, significantly more IL13 was secreted from those derived from individuals homozygous of the T allele than of the C allele. These in vitro and in vivo observations and the genetic association data suggest that upregulation of IL13 protects from psoriasis.
4. Notes 1. Common complex diseases are thought to be modulated by common genetic variants (common variant common disease hypothesis) or rare genetic variants (Fig. 1). Although very infrequent single allele mutations causing inherited disease exist, they are very rare and subjected to strong selection pressure. For example, infrequent single allele mutations in the PKD1 gene (encoding for the glycoprotein polycystin-1) cause autosomal dominant polycystic kidney disease, which has the highest incidence rate among the Mendelian disorders (roughly 1 per 1,000 live births) (http://pkdb.mayo.edu). 2. Case–control association studies are often carried out in two or more stages that include exploratory (or hypothesis generating) study and validation studies. Exploratory studies identify a subset of tested SNP that are then tested in validation studies. This staged approach can reduce cost and conserve valuable clinical samples. For two stage association studies, joint analysis is more efficient than replication-based analysis (26). 3. The power of a case–control study to detect a genetic effect has significant impact on the outcome of the study and should be considered in interpretation of the data. As shown in Fig. 2, the estimated power is a function of the sample size, allele frequency, the effect size, and the threshold of significance. Since common complex diseases stem from multiple etiologies and therefore SNPs will not have high effect sizes, a well-powered genetic association study nowadays requires thousands or tens of thousands of samples (27).
Analysis of Single Nucleotide Polymorphisms in Case–Control Studies
231
4. Misclassification of the outcome is a classical mistake when subjects meeting the criteria for being cases are present in the population based control group. Since the outcome of interest is already known in case–control studies, they are also referred to as “trohoc” studies (cohort in reverse) and are therefore subject to selection bias. On the other hand, infrequent outcomes may be difficult to study in prospective studies and thus case–control studies are the only alternative. Drug treatment that modifies disease or quantitative trait is another source of potential bias that should be addressed (28). 5. As in all other large clinical studies, missing information is unavoidable and thus pattern of absent data needs to be investigated. Nonrandom missingness in genotype as well as phenotype/outcome classification may lead to spurious associations. A sensitivity analysis using permutation techniques and imputation methods compared to complete cases only analysis may uncover the informative nature of systematic missingness. 6. Useful links (a comprehensive review on SNP resources can be found elsewhere (29)) Genetic association database, searchable by phenotype and gene: www.hugenavigator.net International HapMap Project: http://hapmap.ncbi.nlm. nih.gov Human genome browser: http://genome.ucsc.edu/cgi-bin/ hgGateway dbSNP database: entrez?db=snp
http://www.ncbi.nlm.nih.gov/sites/
Whole genome association analysis tools including imputation: http://pngu.mgh.harvard.edu/~purcell/plink Sample size or power for association studies of genes, gene– environment interaction, or gene–gene interaction: http://hydra.usc.edu/GxE HaploView program: http://www.broadinstitute.org/ haploview/haploview Imputation or in silico genotyping: MACH, http://www.sph.umich.edu/csg/abecasis/MaCH/ tour/imputation.html BEAGLE, http://www.stat.auckland.ac.nz/~bbrowning/ beagle/beagle.html IMPUTE, https://mathgen.stats.ox.ac.uk/impute/impute. html 7. Pitfalls and incorrect results interpretation: Case–control studies can produce both false-positive and false-negative
232
Li, Shiffman, and Oberbauer
results; therefore it is suggested that guidelines by Little et al. (30) be followed when reporting genetic association studies. To control false-positive reporting, multiple-testing correction should always be applied. Population stratification can also lead to false-positive association – for example, a study of a population derived from two ethnic groups, in which disease is more prevalent in one group than in the other, would identify every SNP that has a different frequency in these two populations as associated with disease (31). Several methods have been developed to examine whether there are ancestry differences between cases and controls; for example, principal component analysis can be used to detect and correct for population stratification using the genotypes of unlinked SNPs (32). On the other hand, false-negative results are often attributable to low power. Meta-analysis of multiple studies is used to increase the effective power of the analysis and to determine a more reliable estimate of the association of a particular SNP with a disease. 8. SNPs are the most common type of genetic variants (1), and the current technology allows comprehensive testing of SNPs at increasingly affordable price. As of November 2009, 505 GWAS have been reported (HuGE Navigator, http://www. hugenavigator.net). The results of these studies indicate that the initial expectation of identification of numerous SNPs that are associated with a large effect on the risk of a common disease such as diabetes or coronary heart disease was not realized. Most reported SNPs have only a modest effect. For example, a GWAS that investigated SNPs that are associated with height found 54 SNPs that together explained only a few percent of the variability in the population; if all the height predicting SNPs have similar effect size, roughly 100,000 SNPs would be necessary to explain 80% of the variability in height of a population (3). Thus to more fully account for the genetic contribution to common complex diseases, studies on rare SNP, gene–gene interaction, gene– environment interaction, and genetic variants other than SNPs may be needed (33). Indeed, other variants such as copy number variation (CNV) have been found to modulate disease risk (34). Methods to survey and analyze CNV in large scale have been developed (35).
Acknowledgments Research related to this manuscript was supported by Celera (Y.L. and D. S.) and the Austrian Science Fund (FWF P-18325) and the Austrian Academy of Science (OELZELT EST370/04) (R.O.).
Analysis of Single Nucleotide Polymorphisms in Case–Control Studies
233
References 1. The International HapMap Project. (2003) Nature 426, 789–96. 2. Hindorff, L.A., Sethupathy, P., Junkins, H.A., Ramos, E.M., Mehta, J.P., Collins, F.S., and Manolio, T.A. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA 106, 9362–7. 3. Li, Y., and Begovich, A.B. (2009) Unraveling the genetics of complex diseases: susceptibility genes for rheumatoid arthritis and psoriasis. Semin Immunol 21, 318–27. 4. Hahn, L.W., Ritchie, M.D., and Moore, J.H. (2003) Multifactor dimensionality reduction software for detecting gene–gene and gene– environment interactions. Bioinformatics 19, 376–82. 5. Torkamani, A., and Schork, N.J. (2009) Pathway and network analysis with highdensity allelic association data. Methods Mol Biol 563, 289–301. 6. Vance, J.M. (2006) Collection of biological samples for DNA analysis. Genetic Analysis of Complex Diseases, 2nd edition, Ed. Haines, J.L. & Pericak-Vance, M., Wiley, New York. 7. Germer, S., Holland, M.J., and Higuchi, R. (2000) High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Res 10, 258–66. 8. Begovich, A.B., Carlton, V.E., Honigberg, L.A., Schrodi, S.J., Chokkalingam, A.P., Alexander, H.C., Ardlie, K.G., Huang, Q., Smith, A.M., Spoerke, J.M., Conn, M.T., Chang, M., Chang, S.Y., Saiki, R.K., Catanese, J.J., Leong, D.U., Garcia, V.E., McAllister, L.B., Jeffery, D.A., Lee, A.T., Batliwalla, F., Remmers, E., Criswell, L.A., Seldin, M.F., Kastner, D.L., Amos, C.I., Sninsky, J.J., and Gregersen, P.K. (2004) A missense singlenucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am J Hum Genet 75, 330–7. 9. Luke, M.M., Kane, J.P., Liu, D.M., Rowland, C.M., Shiffman, D., Cassano, J., Catanese, J.J., Pullinger, C.R., Leong, D.U., Arellano, A.R., Tong, C.H., Movsesyan, I., Naya-Vigne, J., Noordhof, C., Feric, N.T., Malloy, M.J., Topol, E.J., Koschinsky, M.L., Devlin, J.J., and Ellis, S.G. (2007) A polymorphism in the protease-like domain of apolipoprotein(a) is associated with severe coronary artery disease. Arterioscler Thromb Vasc Biol 27, 2030–6. 10. Shiffman, D., Kane, J.P., Louie, J.Z., Arellano, A.R., Ross, D.A., Catanese, J.J., Malloy, M.J., Ellis, S.G., and Devlin, J.J. (2008) Analysis of
11. 12. 13.
14.
15. 16.
17. 18.
19.
20.
21. 22.
17,576 potentially functional SNPs in three case–control studies of myocardial infarction. PLoS One 3, e2895. Balding, D.J. (2006) A tutorial on statistical methods for population association studies. Nat Rev Genet 7, 781–91. Armitage, P. (1955) Tests for linear trends in proportions and frequencies. Biometrics 26, 535–46. Bonferonni, C.E. (1936) Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8, 3–62. Benjamini, Y., and Hockberg, Y. (1995) Controlling the false discovery rate: a new and powerful approach to multiple testing. J R Stat Soc Ser B 57, 1289–1300. Stephens, M., and Balding, D.J. (2009) Bayesian statistical methods for genetic association studies. Nat Rev Genet 10, 681–90. Mantel, N., and Haenszel, W. (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 22, 719–48. Rothman, K., and Greenland, S. (1998) Modern Epidemiology, 2nd edition, Lippincott Williams and Wilkins, Philadelphia, PA. Chang, M., Li, Y., Yan, C., Callis, K., Matsunami, N., Garcia, V., Cargill, M., Civello, D., Bui, N., Catanese, J., Leppert, M., Krueger, G., Begovich, A., and Schrodi, S. (2008) Variants in the 5q31 cytokine gene cluster are associated with psoriasis. Genes Immun 9, 176–81. Li, Y., Chang, M., Schrodi, S., Callis-Duffin, K., Matsunami, N., Civello, D., Bui, N., Catanese, J., Leppert, M.F., Krueger, G.G., and Begovich, A. (2008) The 5q31 variants associated with psoriasis and Crohn’s disease are distinct. Hum Mol Genet 17, 2978–85. Cargill, M., Schrodi, S.J., Chang, M., Garcia, V.E., Brandon, R., Callis, K.P., Matsunami, N., Ardlie, K.G., Civello, D., Catanese, J.J., Leong, D.U., Panko, J.M., McAllister, L.B., Hansen, C.B., Papenfuss, J., Prescott, S.M., White, T.J., Leppert, M.F., Krueger, G.G., and Begovich, A.B. (2007) A large-scale genetic association study confirms IL12B and leads to the identification of IL23R as psoriasisrisk genes. Am J Hum Genet 80, 273–90. Niu, T. (2004) Algorithms for inferring haplotypes. Genet Epidemiol 27, 334–47. Romeo, S., Kozlitina, J., Xing, C., Pertsemlidis, A., Cox, D., Pennacchio, L.A., Boerwinkle, E., Cohen, J.C., and Hobbs, H.H. (2008) Genetic variation in PNPLA3 confers
234
23.
24.
25.
26.
27.
28.
29.
30.
Li, Shiffman, and Oberbauer susceptibility to nonalcoholic fatty liver disease. Nat Genet 40, 1461–5. Bottini, N., Musumeci, L., Alonso, A., Rahmouni, S., Nika, K., Rostamkhani, M., MacMurray, J., Meloni, G.F., Lucarelli, P., Pellecchia, M., Eisenbarth, G.S., Comings, D., and Mustelin, T. (2004) A functional variant of lymphoid tyrosine phosphatase is associated with type I diabetes. Nat Genet 36, 337–8. Vang, T., Congia, M., Macis, M.D., Musumeci, L., Orru, V., Zavattari, P., Nika, K., Tautz, L., Tasken, K., Cucca, F., Mustelin, T., and Bottini, N. (2005) Autoimmune-associated lymphoid tyrosine phosphatase is a gain-offunction variant. Nat Genet 37, 1317–9. Cameron, L., Webster, R.B., Strempel, J.M., Kiesler, P., Kabesch, M., Ramachandran, H., Yu, L., Stern, D.A., Graves, P.E., Lohman, I.C., Wright, A.L., Halonen, M., Klimecki, W.T., and Vercelli, D. (2006) Th2 cell-selective enhancement of human IL13 transcription by IL13-1112C > T, a polymorphism associated with allergic inflammation. J Immunol 177, 8633–42. Skol, A.D., Scott, L.J., Abecasis, G.R., and Boehnke, M. (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38, 209–13. Wellcome Trust Case Control Consortium. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–78. Tobin, M.D., Sheehan, N.A., Scurrah, K.J., and Burton, P.R. (2005) Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure. Stat Med 24, 2911–35. Johnson, A.D. (2009) Single-nucleotide polymorphism bioinformatics: a comprehensive review of resources. Circ Cardiovasc Genet 2, 530–6. Little, J., Higgins, J.P., Ioannidis, J.P., Moher, D., Gagnon, F., von Elm, E., Khoury, M.J.,
31. 32.
33.
34.
35.
36.
Cohen, B., Davey-Smith, G., Grimshaw, J., Scheet, P., Gwinn, M., Williamson, R.E., Zou, G.Y., Hutchings, K., Johnson, C.Y., Tait, V., Wiens, M., Golding, J., van Duijn, C., McLaughlin, J., Paterson, A., Wells, G., Fortier, I., Freedman, M., Zecevic, M., King, R., Infante-Rivard, C., Stewart, A., and Birkett, N. (2009) Strengthening the reporting of genetic association studies (STREGA): an extension of the STROBE statement. Eur J Epidemiol 24, 37–55. Cardon, L.R., and Bell, J.I. (2001) Association study designs for complex diseases. Nat Rev Genet 2, 91–9. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–9. Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A., Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti, A., Cho, J.H., Guttmacher, A.E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C.N., Slatkin, M., Valle, D., Whittemore, A.S., Boehnke, M., Clark, A.G., Eichler, E.E., Gibson, G., Haines, J.L., Mackay, T.F., McCarroll, S.A., and Visscher, P.M. (2009) Finding the missing heritability of complex diseases. Nature 461, 747–53. Wain, L.V., Armour, J.A., and Tobin, M.D. (2009) Genomic copy number variation, human health, and disease. Lancet 374, 340–50. Ionita-Laza, I., Rogers, A.J., Lange, C., Raby, B.A., and Lee, C. (2009) Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics 93, 22–6. McCarthy, M.I., Abecasis, G.R., Cardon, L.R., Goldstein, D.B., Little, J., Ioannidis, J.P., and Hirschhorn, J.N. (2008) Genomewide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9, 356–69.
Chapter 11 Bioinformatics for Copy Number Variation Data Melissa Warden, Roger Pique-Regi, Antonio Ortega, and Shahab Asgharzadeh Abstract Copy number variation is known to be an important component of structural variation in the human genome. Greater than 1 kb in size, these gains and losses of genetic material are known to confer risk to many human diseases, both Mendelian and complex. Therefore, the technologies used to detect copy number variation have been quickly improving in both throughput and cost. From comparative genomic hybridization to synthetic high-density oligonucleotide arrays to next-generation sequencing methods, algorithms used to estimate copy number are plentiful. Here we describe a practical introduction to the copy number variation technology and available analysis methods, and demonstrate the analysis flow on an example case. Key words: Single nucleotide polymorphism, Copy number variation, Comparative genomic hybridization, Microarray, Bioinformatics
1. Introduction Single nucleotide polymorphisms (SNPs) are thought to be the most abundant source of human genetic variation. A SNP is a single site in the DNA at which two or more different nucleotide pairs occur at a frequency of 1% or greater within a population. Several million SNPs have been documented to date, and are responsible for the majority of phenotypic variability observed in humans. More recently, the importance of another submicroscopic type of structural genetic variant has been discovered, named copy number variants (CNVs) (1). CNVs are structural changes that occur throughout the genome, primarily due to duplication, deletion, insertion, and unbalanced translocation events (2–5). Several mechanisms of CNV formation include meiotic recombination, homology-directed and nonhomologous repair of double-strand Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_11, © Springer Science+Business Media, LLC 2011
235
236
Warden et al.
breaks, and errors in replication (5). These gains and losses of genetic material are 1 kb or greater in size, and are found to vary in frequency among healthy individuals (6). Copy Number Polymorphisms (CNP) are common CNVs present in greater than 1% of the population, while CNVs that are found in less than 1% of the population are considered to be rare (7). The frequency of CNVs varies by ethnicity, which may contribute to phenotypic variations and differences in disease susceptibility across different ethnic groups (8, 9). Several public databases are available, which provide a comprehensive summary of CNVs detected in diseasefree human populations. For example, the Database of Genomic Variants (DGV) is a collection of the structural variation identified in the human genome. It is continuously updated with the detailed information of the location and gene content of several types of structural variation, including, but not limited to, CNV (see Notes 1 and 2). The functional impact of CNVs has been demonstrated through both cellular phenotypes, such as gene expression, and by the genetic basis of human disease (10). These large regions of the genome encompassed by the CNV are likely to include several deleted or duplicated genes, unlike traditional SNPs that affect only one gene at a time (11). CNVs are known to confer risk for inherited diseases, such as autism spectrum disorders (12), schizophrenia and bipolar disorder (13), and X-linked mental retardation in males (14); complex diseases, such as systemic lupus erythematosus (15) and HIV-1/AIDS susceptibility (16); and cancer. It was recently reported that a common CNV inherited at 1q21.1 is associated with an increased risk of developing neuroblastoma, a common childhood cancer, and that this CNV influences the expression of a previously unknown neuroblastoma breakpoint family (NBPF) gene (17). The NBPF gene is located in regions of segmental duplications on chromosome 1, and was named after a neuroblastoma patient in whom this region was disrupted by a chromosomal translocation (18). Several approaches have been used to examine CNVs in the human genome. Comparative genomic hybridization (CGH) was first developed for genome-wide analysis of DNA sequence copy number in a single experiment (19). CGH is based on a competitive in situ hybridization of differentially fluorescently labeled test and reference DNA to normal human metaphase chromosomes. The fluorescence intensity ratio measured along the length of each chromosome is approximately proportional to the ratio of the copy numbers of the corresponding DNA sequences in the test and reference genomes. However, the use of CGH was limited by its low-resolution of only 5–10 Mb, and so improvements were made possible using the resources generated for the publicdomain Human Genome Project, where large-insert clone libraries were developed and assembled into overlapping contigs for
Bioinformatics for Copy Number Variation Data
237
sequencing (20). The metaphase chromosomes used for CGH could now be replaced with arrays of clones accurately mapped to the human genome. Bacterial artificial chromosome (BAC) and phage artificial chromosome (PAC) clones are most commonly amplified and spotted for genome wide CGH arrays, with a resolution of 1–1.5 Mb (21). Array-CGH is similar to CGH in that test and reference DNA are differentially fluorescently labeled and hybridized together to the array. The resulting fluorescence signal ratio is then measured for each clone and plotted relative to the position of the clone in the genome (22). Finally, oligonucleotide probes provide the highest amount of resolution for arrayCGH; however, use of these shorter probes results in less stringent hybridization leading to poor signal-to-noise ratio and higher signal variability compared to the CGH platform (23). The development of synthetic high-density oligonucleotide microarrays used for genome-wide SNP genotyping are now being used to estimate copy number (24). Hybridizations are not performed using cohybridization of test and reference DNA as in array-CGH, but rather by hybridization of a single DNA sample. In order to improve the signal-to-noise ratio, Affymetrix has developed a technology in which the DNA sample is first digested using restriction enzymes. The smaller DNA fragments are ligated with adapters and then polymerase chain reaction (PCR) is used to amplify the fragments with universal PCR primers. The PCR product of a single sample is fluorescently labeled and hybridized to a chip consisting of millions of 25 base pair (bp) oligonucleotide probes. The signal intensities of these millions of hybridizations are used to determine genotype and estimate copy number. This entire process reduces the complexity of the hybridization; however, it also introduces possible bias. Preferential amplification of different regions of the genome may reflect differences in restriction digestion patterns rather than copy number variation (CNV) between individuals. Illumina® has developed an alternative platform using 50-bp oligonucleotides attached to indexed beads randomly deposited onto glass slides. Following whole genome amplification and fragmentation, a two-step allele detection method is used. First, unlabeled DNA fragments are hybridized to 50-bp probes on the array, followed by an enzymatic single base extension with labeled nucleotides. Similar to the Affymetrix arrays, the signal intensities of these millions of hybridizations are used to determine genotype and estimate copy number. While high-density oligonucleotide microarrays have revolutionized the detection of CNVs in large-scale genome studies, next-generation sequencing technologies are now available, providing improved accuracy and specificity. The rapid development of new sequencing technologies, such as Roche’s 454 sequencing (25), Illumina’s Genome Analyzer (aka Solexa) (26), and Applied Biosystems’
238
Warden et al.
SOLiD (27), is continuously increasing the speed and throughput of sequencing, while also decreasing the cost. Several computational methods are available for CNV detection using these next-generation sequencing platforms (28, 29).
2. Materials 2.1. Microarray Platforms
Comparative genomic hybridization and array-CHG were originally used for genome-wide CNV detection, but are limited by their lowresolution. The synthetic high-density oligonucleotide microarrays used for genome-wide SNP genotyping and copy number estimation provide the highest resolution. Currently, the Genome-Wide Human SNP Array 6.0 is the highest resolution SNP microarray platform from Affymetrix, and features more than 1.8 million markers of genetic variation, including probes for the detection of both SNPs and CNVs. Illumina® has developed an alternative platform using 50-bp oligonucleotides attached to indexed beads randomly deposited onto glass slides. The HumanOmni1-Quad BeadChip containing more than one million probes is currently the highest resolution offered. Roche’s NimbleGen and Agilent technologies both offer a-CGH products with more than one million 50-bp or greater probes useful for detecting CNPs but not SNPs.
2.2. Raw Data and Annotation Files
The Affymetrix CEL files store the results of the intensity calculations from the DAT file, which is where the pixel intensity values collected from an Affymetrix scanner are stored. The CEL file includes an intensity value, standard deviation of the intensity, the number of pixels used to calculate the intensity value, and a flag to denote an outlier, indicating that the feature should be excluded from future analysis. The CEL file stores these data for each feature on the Affymetrix microarray and is used for all downstream analysis. For Illumina microarrays, the IDAT file contains the average intensity value for each probe averaged over at least 20 beads collected from an Illumina scanner. The IDAT file can be read by the Illumina BeadStudio analysis software to produce all other types of files for downstream analysis. A unique Chip Definition File (CDF) accompanies each type of Affymetrix microarray. The CDF file contains necessary information about the specific layout of the microarray and can be downloaded from the Affymetrix website. Platform specific annotation files map the units in the CDF file, and contain genome information specific to the type of microarray, such as the chromosome number, transcription starting and ending sites, and strand indication (the sense strand of a gene relative to the genome sequence). Several command-line and Graphical User Interface (GUI)-based applications are available for preprocessing the array files (see Table 1).
Bioinformatics for Copy Number Variation Data
239
Table 1 Software packages available for CNV analysis Supported CNV algorithms
Software
Platforms
Details
Aroma Project
Multiple platforms
CBS, GADA (additional algorithms can be introduced into the system)
Normalization, Summarization R-package Utilizes other CNV algorithms for CNV, LOH detection. http://www.aroma-project.org
BeadStudio Analysis
Illumina
cnvPartition, proprietary (additional algorithms can be imported as modules)
Normalization, CNV, LOH detection GUI-based software http://www.illumina.com
Affymetrix Genotyping Console
Affymetrix
Canary
Normalization, CNV, LOH detection GUI-based software http://www.affymetrix.com/index.affx
Affymetrix Power Tools (APT)
Affymetrix
Multiple
Normalization, CNV, LOH detection Command-line http://www.affymetrix.com/partners_ programs/programs/developer/ tools/powertools.affx
DNA-chip Analyzer (dChip)
Affymetrix, Illumina
HMM
Normalization, CNV, LOH detection GUI-based software http://biosun1.harvard.edu/complab/ dchip
Nexus Copy Number
Multiple platforms
Segmentation algorithm
Normalization, CNV, LOH detection GUI-based software http://www.biodiscovery.com
Partek Genomics Suite
Multiple platforms
HMM
Normalization, CNV, LOH detection GUI-based software http://www.partek.com
GADA
Multiple platforms
SBL
CNV detection R-package http://groups.google.com/group/ gadaproject
CNAG
Affymetrix
HMM
CNV, LOH detection GUI-based software http://www.genome.umin.jp (continued)
240
Warden et al.
Table 1 (continued) Software
Platforms
Supported CNV algorithms
ITALICS
Affymetrix
GLAD algorithm
Normalization, CNV detection R-package http://www.bioconductor.org
PennCNV
Multiple platforms
HMM
CNV detection Command-line http://www.openbioinformatics.org/ penncnv
Details
Additional modules available in the R open-source statistical platform also allow for preprocessing of samples. The Aroma Project (http://www.aroma-project.org) supports preprocessing of Affymetrix raw data and contains specific information regarding the 25-mer oligonucleotide sequences and strand indication. This software also allows for further downstream analyses of variety of commercial array platforms including the Illumina platform. 2.3. Computer Hardware and Software
For large studies, a UNIX based environment is highly recommended. For small to moderate projects, a 32-bit or 64-bit MS Windows computer operating system is sufficient. The high density genome-wide microarray datasets require large amounts of memory and storage. For instance, the size of an Affymetrix Genome-Wide Human SNP Array 6.0 CEL file is approximately 70 MB. Therefore, the minimum hardware requirements are a 120 GB hard drive, 4 GB of memory, and at least a 2.0 GHz Intel Pentium Processor. R is both a computer language for statistical computing and free software that provides a coherent, flexible system for data analysis that can be extended as needed (http://www.r-project. org). The open-source nature of R ensures its availability and it runs on a variety of UNIX platforms, Windows, and MacOS. Aroma.Affymetrix is an open-source R package that provides memory-efficient methods to perform basic data analysis such as normalization and probe-level summarization on Affymetrix microarray datasets (30). Genome Alteration Detection Algorithm (GADA) is an R package developed by our group that imports normalized Affymetrix or Illumina microarray data sets and
Bioinformatics for Copy Number Variation Data
241
detects CNVs and also allows jointly modeling both copy number and reference intensities (31). GADA’s CNV module can also be called from within the Aroma Project.
3. Methods 3.1. Q uality Control
Quality control (QC) is the first component of data analysis. A single-sample QC analysis can be used to identify poor quality samples that should be removed from subsequent analysis. This can be done by simply comparing the signal intensities (in log scale) of each probe on the microarray for each sample. The singlesample QC metric should be a good indication of the final performance of the copy number estimation. The signal intensities from the X chromosome of male samples can also be used to measure the distance between a copy number state of two versus a copy number state of one. These QC methods can be performed both before and after normalization to determine whether known sources of background have been correctly minimized.
3.2. Normalization
Measurements from microarrays can be affected by many biological factors, such as sample extraction and hybridization. Therefore, it is important to correct the measurements using normalization procedures in order to make comparisons between different samples. Probe-level transformation methods are used to transform the measurements into modified probe intensities by identifying and removing the systematic effects that cannot be explained by the biological variation of interest or by random noise. Examples of generic transformations include Robust Multi-Array (RMA) background correction, gcRMA background correction, and quantile normalization (32, 33). RMA background correction (as e.g. provided in the bioconductor package of R) estimates the background using a mixture model which assumes that the background signals follow a normal distribution and the true signals follow an exponential distribution. Using quantile normalization, the target distribution is first estimated by calculating an average of all the signal intensities across all the microarrays, and then each microarray is normalized toward this target distribution (34). In addition, an effect known as allelic crosstalk can occur on the microarrays because the oligonucleotide sequences for allele A and allele B probes only differ by one nucleotide. This crosshybridization can be corrected for using allelic-crosstalk calibration methods (see Table 1) (34).
3.3. Summarization
Once the probe-level signal intensities have been backgroundcorrected and normalized, the signals must be summarized. Respective methods are used to summarize multiple signals from
242
Warden et al.
a set of probes into a single signal. Probe-level models (PLMs) are models that describe the preprocessed signal intensities using statistical models including both the effects and random noise (see Table 1). 3.4. Detection of Copy Number Variation
A major concern for the detection of CNVs using synthetic highdensity oligonucleotide microarray technology is how to define the breakpoints of a given CNV. Many algorithms have been developed to detect CNV and are based on the assumption that the genome of a normal diploid individual consists of constant number of DNA segments (31, 35). “Genome Alteration Detection Algorithm” is an R package developed by our group that imports the preprocessed Affymetrix microarray data sets and detects CNVs by jointly modeling copy number and reference intensities (31, 36, 37). This segmentation procedure is done in two steps: First, a sparse Bayesian learning (SBL) model is fit to determine the most likely candidate breakpoints of a given CNV. Second, a backward elimination (BE) procedure consecutively removes the least significant breakpoints and allows for modification of the False Discovery Rate (FDR). Several other applications and algorithms have been developed for detection of CNV (Table 1). “PennCNV” is a free software tool for detection of CNVs from Affymetrix and Illumina microarray data sets. This algorithm uses a hidden Markov model (HMM) based approach that uses total signal intensity and allelic intensity ratio for each probe, the distance between neighboring SNPs, the allele frequency of SNPs, and pedigree information when available (38). dChip SNP and CNAG are also freely available GUI-based applications that allow processing of Affymetrix CEL files, detection, and visualization of chromosomal regions with Loss of Heterozygosity (LOH) and copy number alterations using HMM-based algorithms.
3.5. Practical Applications
Genetic association studies are the predominant strategy for identifying CNVs conferring risk for complex genetic diseases, either within candidate loci or genome-wide. Using this approach, the frequency of a given CNV is compared among affected and unaffected individuals. These types of studies require large sample sizes and are susceptible to population stratification if cases and controls differ by ethnicity.
3.5.1. Association Studies
3.5.2. Population Genetics
CNVs in normal individuals follow a model of Mendelian inheritance and present a broad range in population frequencies. The distribution of copy number variation within and among different populations is influenced by mutation, selection, and demographic history. One study that attempted to create a CNV map in African Americans revealed two regions of the genome with
Bioinformatics for Copy Number Variation Data
243
large CNV frequency differences between Caucasians and African Americans, one on chromosome 15 and another on chromosome 17 (39). 3.5.3. Cytogenetics
Cytogenetics studies attempt to identify structural changes in DNA, such as copy number changes. Platforms that reliably detect CNVs, such as Affymetrix and Illumina, are particularly useful for genome-wide assessment of uniparental disomy (when two copies of the chromosome are present, but both have been inherited from a single parent) in the form of LOH, which refers to the loss of function of one allele of a gene. Despite the widespread copy number variation in the genomes of healthy individuals, clinical cytogeneticists must differentiate between pathogenic CNVs and CNVs that do not contribute to the clinical presentation of an affected individual (40).
3.5.4. Example
Presented here is a guided analysis using hapmap (http://www. hapmap.org) data from an individual (NA06991) profiled on the Affymetrix Genome-Wide Human SNP Array 6.0. The data was generated using the lymphoblastoid cell line of this individual that is part of the collection of the Centre de’Etude du Polymorphism Humain (CEPH) families (http://ccr.coriell.org). The CEPH family pedigree consists of multigenerational Caucasian families from Utah. This data represents diploid data from a healthy female. There are several commercial and opensource tools available for preprocessing and CNV analysis of data from SNP or aCGH arrays (Table 1). Commercial software offer ease of data import and analysis, but are limited in flexibility and in the number of samples that can be analyzed. Most array vendors provide preprocessing tools and ability to conduct CNV. In our example, we present the Affymetrix Power Tools (APT) command line software package apt-copynumber-workflow (apt-1.12.0, http://www.affymetrix.com/support/developer/powertools/ changelog/index.html) to preprocess the signal intensities for downstream segmentation methods (34). In the following procedures, “Genome Alteration Detection Algorithm” R package will be used to illustrate the copy number segmentation procedure. Example dataset and software can be obtained at (http://groups. google.com/group/gadaproject). 1. Import the processed array into GADA and store the object with the imported data (comments for each command are followed by #). >install.packages(“gada_0.7-5.tar.gz”,repos=NULL) #Install the GADA R package. >library(gada) #Load the GADA R package.
244
Warden et al.
>ParAffyData<- setupParGADAaffy(log2ratioCol=5, Num Cols=8) #Import the Affymetrix array data. The log2ratioCol designates which column contains the log2 ratio intensities and NumCols indicates the number of columns included in the file. These arguments may be modified depending on the file format. >save(ParAffyData, file=’ParAffyData.rData’) #Store the object with the imported data for future analysis so the data do not need to be imported again. >load(“ParAffyData.rData”) #Load the stored object. 2. Visualize the log ratio signal intensities by chromosome (Fig. 1). >plotRatio(ParAffyData, Sample=1, chr=22, num. points=100000) #Plot the signal intensities of sample 1, chromosome 22. Plotting all the probes on a single plot will generate a very large file given the high resolution of the microarray, and therefore, may be modified using the num.points argument. 3. Create the GADA model. Use aa, T, and MinSegLen to control the parameters. The array noise level can be set manually using the argument sigma2 = s2, otherwise it is estimated automatically by the algorithm by setting estim. sigma2 = TRUE. The sparseness hyperparameter, aa, controls the SBL prior distribution which does not indicate the location of the CNV breakpoints, but imposes a penalty on the number of CNV breakpoints. The lower the aa value, the greater the number of CNV breakpoints are detected, which includes both true positives and false positives. Using a sparseness hyperparameter setting of aa = 0.2 is recommended as an effective way to adjust between the sensitivity and FDR. The T argument indicates the critical value (cutoff) of the BE algorithm. The statistical score tm associated with each breakpoint m remaining in the model has to be greater than T. The score tm can be interpreted as the difference between the sample averages of the probes located on the left and right segment, divided by a pooled estimation of the standard error. Adjusting T allows ranking of breakpoints from the BE algorithm without additional computational cost. This provides great flexibility in adjusting the final CNV set. The expected FDR is monotonically decreasing with T, so one can obtain a list of breakpoints with lower FDR by increasing threshold T (Fig. 1). Finally, the minimum number of probes included in each copy number segment can be adjusted via MinSegLen. A minimum segment length of three is recommended to eliminate false detections due to extreme outliers.
Fig. 1. The Backward Elimination (BE) algorithm implemented in GADA allows for flexibility in CNV breakpoint detection. An increase in the T parameter will decrease the false discovery rate (FDR), but may also decrease the detection of true positives (TP). A plot of the log2 ratios and the segments detected by GADA for chromosome 22 of a single HapMap sample (NA06991) are shown using a value of T = 3 (a), T = 6 (b), and T = 12 (c).
246
Warden et al.
>parSBL(ParAffyData, aAlpha=0.5, estim.sigma2=TRUE) #The first step of the segmentation procedure fits the SBL model and detects the most likely candidate breakpoints for the copy number state (See above for definition of aAlpha and estim.sigma2) >parBE(ParAffyData, T=6, MinSegLen=8) #The second step of the segmentation procedure uses the BE procedure to remove the least significant breakpoints one after the other. (See above for definition of T and MinSegLen) 4. Visualize the results of the segmentation (Fig. 1). >plotRatio (ParAffyData, Sample = 1, chr = 22, num. points = 500000, segments=T) #Plot the signal intensities and segments of sample 1, chromosome 22 obtained after the BE procedure. num.points defines the density of the plot and segments is a True/ False indicator that will superimpose the segmentation results on top of normalized data. 5. Identify the probes that are located in CNV regions on chromosome 22. >probes<- getAlteredProbes (NA06991, chr=22) #Create a list of altered probes (gains or losses) of sample NA06991, chromosome 22. 6. Summarize the results of the segmentation. >NA06991<- summary(ParAffyData) # Create a list of the altered segments detected for sample NA06991. 7. Export the segments in BED format. The GADA results for this HapMap sample can also be visualized using the UCSC Genome Browser (see Notes 2 and 3) and compared to known CNV regions (Fig. 2).
Fig. 2. UCSC Genome Browser view of a known duplicated CNV region in chromosome 22 of HapMap sample NA06991, compared to a CNV segment detected in the same sample using GADA (T = 8).
Bioinformatics for Copy Number Variation Data
247
> exportToBED (NA06991) #Generate the file “BED.txt” which contains the information required for visualization in UCSC and other major genome browsers.
4. Notes 1. The Database of Genomic Variants (DGV, http://projects. tcag.ca/variation) is a collection of the structural variation identified in the human genome. It is continuously updated with the detailed information of the location and gene content of several types of structural variation, including, but not limited to, CNV. 2. Researchers at the Wellcome Trust Sanger Institute have generated a comprehensive map of variable copy number regions in healthy individuals, entitled The Copy Number Variation (CNV) Project (http://www.sanger.ac.uk/humgen/cnv). The CNV Project Data Display can be used to visualize copy number regions present in HapMap samples, along with user supplied custom tracks using the UCSC Genome Browser. 3. The UCSC Genome Bioinformatics website (http://genome. ucsc.edu) contains reference sequences and working draft assemblies for a large set of various genomes. In particular, the UCSC Genome Browser allows for visualization of data from many genomes, with extensive annotation tracks for various types of data. References 1. Beckmann, J. S., Shapr, A. J., and Antonarakis, S.E. (2008) CNVs and genetic medicine (excitement and consequences of a rediscovery). Cytogenet Genome Res 123, 7–16. 2. Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire, S., Freeman, J. L., Gonza’lez, J. R., Grataco’s, M. N., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J. R., Marshall, C. R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C., Jones, K. W., Scherer, S. W., and Hurles, M. E. (2006) Global variation in copy number in the human genome. Nature 444, 444–54.
3. Sharp, A. J., Cheng, Z., and Eichler, E. E. (2006) Structural variation of the human genome. Ann Rev Genomics Hum Genet 7, 407–42. 4. Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., Scherer, S. W., Lee, C. (2004) Detection of large-scale variation in the human genome. Nat Genet 36, 949–51. 5. Hastings, P. J., Lupski, J. R., Rosenberg, S. M., and Ira, G. (2009) Mechanisms of change in gene copy number. Nat Rev Genet 10, 551–64. 6. Shaikh, T. H., Gai, X., Perin, J. C., et al. (2009) High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res 19, 1682–90.
248
Warden et al.
7. Itsara, A., Cooper, G. M., Baker, C., Girirajan, S., Li, J., Absher, D., Krauss, R. M., Myers, R. M., Ridker, P. M., Chasman, D. I., Mefford, H., Ying, P., Nickerson, D. A., and Eichler, E. E. (2009) Population analysis of large copy number variants and hotspots of human genetic disease. Am J Hum Genet 84, 148–61. 8. Takahashi, N., Satoh, Y., Kodaira, M., and Katayame, H. (2008) Large-scale copy number variants (CNVs) detected in different ethnic human populations. Cytogenet Genome Res 123, 224–33. 9. Jakobsson, M., Scholz, S. W., Scheet, P., Gibbs, J. R., VanLiere, J. M., Fung, H.-C., Szpiech, Z. A., Degnan, J. H., Wang, K., Guerreiro, R., Bras, J. M., Schymick, J. C., Hernandez, D. G., Traynor, B. J., SimonSanchez, J., Matarin, M., Britton, A., van de Leemput, J., Rafferty, I., Bucan, M., Cann, H. M., Hardy, J. A., Rosenberg, N. A., and Singleton, A. B. (2008) Genotype, haplotype, and copy-number variation in worldwide human populations. Nature 451, 998–1003. 10. Conrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., Aerts, J., Andrews, T. D., Barnes, C., Campbell, P., Fitzgerald, T., Hu, M., Ihm, C. H., Kristiansson, K., MacArthur, D. G., MacDonald, J. R., Onyiah, I., Wing, A., Pang, C., Robson, S., Stirrups, K., Valsesia, A., Walter, K., Wei, J., Tyler-Smith, C., Carter, N. P., Lee, C., Scherer, S. W., and Hurles, M. E. (2009) Origins and functional impact of copy number variation in the human genome. Nature 464, 704–12. 11. Shlien, A., and Malkin, D. (2009) Copy number variations and cancer. Genome Med 1, 62. 12. Kusenda, M., and Sebat, J. (2008) The role of rare structural variants in the genetics of autism spectrum disorders. Cytogenet Genome Res 123, 36–43. 13. Lachman, H. M. (2008) Copy variations in schizophrenia and bipolar disorder. Cytogenet Genome Res 123, 27–35. 14. Bauters, M., Weuts, A., Vandewalle, M., Nevelsteen, J., Marynen, P., Esch, H. V., and Froyen, G. (2008) Detection and validation of copy number variation in X-linked mental retardation. Cytogenet Genome Res 123, 44–53. 15. Ptacek, T., Li, X., Kelley, J. M., and Edberg, J. C. (2008) Copy number variants in genetic susceptibility and severity of systemic lupus erythematosus. Cytogenet Genome Res 123, 142–47. 16. Nakajima, T., Kaur, G., Mehra, N., and Kimura, A. (2008) HIV-1/AIDS susceptibility
17.
18.
19.
20. 21.
22.
23.
24.
25.
and copy number variation in CCL3L1, a gene encoding a natural ligand for HIV-1 coreceptor CCR5. Cytogenet Genome Res 123, 156–60. Diskin, S. J., Hou, C., Glessner, J. T., Attiyeh, E. F., Laudenslager, M., Bosse, K., Cole, K., Mosse’, Y. P., Wood, A., Lynch, J. E., Pecor, K., Diamond, M., Winter, C., Wang, K., Kim, C., Geiger, E. A., McGrady, P. W., Blakemore, A. I. F., London, W. B., Shaikh, T. H., Bradfield, J., Grant, S. F. A., Li, H., Devoto, M., Rappaport, E. R., Hakonarson, H., and Maris, J. M. (2009) Copy number variation at 1q21.1 associated with neuroblastoma. Nature 459, 987–92. Vandepoele, K., Roy, N. V., Staes, K., Speleman, F., and van Roy, F. (2005) A novel gene family NBPF: intricate structure generated by gene duplications during primate evolution. Mol Biol Evol 22, 2265–74. Kallioniemi, O., Kallioniemi, A., Sudar, D., Rutovitz, D., Gray, J., Waldman, F., and Pinkel, D. (1993) Comparative genomic hybridization: a rapid new method for detecting and mapping DNA amplifications in tumors. Semin Cancer Biol 4, 41–6. Chueng, V. (2001) Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409, 953–8. Snijders, A., Nowak, N., and Segraves, R. (2001) Assembly of microarrays for genomewide measurement of DNA copy number. Nat Genet 29, 263–4. Pinkel, D., Segraves, R., Sudar, D., Clark, S., Poole, I., Kowbel, D., Collins, C., Kuo, W.-L., Chen, C., Zhai, Y., Dairkee, S. H., Ljung, B.-M., Gray, J. W., and Albertson, D. G. (1998) High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20, 207–11. Carvalho, B., Ouwerkerk, E., Meijer, G. A., and Ylstra, B. (2004) High resolution microarray comparative genomic hybridisation analysis using spotted oligonucleotides. J Clin Pathol 57, 644–6. Bengtsson, H., Wirapati, P., and Speed, T. P. (2009) A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6. Bioinformatics 25, 2149–56. Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J., Braverman, M. S., Chen, Y.-J., Chen, Z., Dewell, S. B., Du, L., Fierro, J. M., Gomes, X. V., Goodwin, B. C., He, W., Helgesen, S., Ho, C. H., Irzyk, G. P., Jando, S. C., Alenquer,
Bioinformatics for Copy Number Variation Data M. L. I., Jarvie, T. P., Jirage, K. B., Kim, J.-B., Knight, J. R., Lanza, J. R., Leamon, J. H., Lefkowitz, S. M., Lei, M., Li, J., Lohman, K. L., Lu, H., Makhijani, V. B., McDade, K. E., McKenna, M. P., Myers, E. W., Nickerson, E., Nobile, J. R., Plant, R., Puc, B. P., Ronan, M. T., Roth, G. T., Sarkis, G. J., Simons, J. F., Simpson, J. W., Srinivasan, M., Tartaro, K. R., Tomasz, A., Vogt, K. A., Volkmer, G. A., Wang, S. H., and Wang, Y. (2005) Genome sequencing in open microfabricated high density picoliter reactors. Nature 437, 376–80. 26. Bentley, D. R. (2006) Whole-genome resequencing. Curr Opin Genet Dev 16, 545–52. 27. Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng, K., Malek, J. A., Costa, G., McKernan, K., Sidow, A., Fire, A., and Johnson, S. M. (2008) A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res 18, 1051–63. 28. Xie, C., and Tammi, M. T. (2009) CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform 10, 80–9. 29. Yoon, S., Xuan, Z., and Makarov, V. (2009) Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res 19, 1586–92. 30. Bengtsson, H., Ray, A., Spellman, P., and Speed, T. P. (2009) A single-sample method for normalizing and combining full-resolution copy numbers from multiple platforms, labs and analysis methods. Bioinformatics 25, 1223–30. 31. Pique-Regi, R., Ortega, A., and Asgharzadeh, S. (2009) Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA. Bioinformatics 25, 1223–30. 32. Irizarry, R. A., Hobbs, B., Collin, F., BeazerBarclay, Y. D., Antonellis, K. J., Scherf, U., and
249
Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–64. 33. Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003) Summaries of Affymetrix GeneC hip probe level data. Nucleic Acids Res 31, e15. 34. Bengsston, H., Irizarry, R., Carvalho, B., and Speed, T. P. (2008) Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics 24, 759–67. 35. Rueda, O. M., and Diaz-Uriarte, R. (2010) Finding recurrent copy number alteration regions: a review of methods. Curr Bioinform 5, 1–17. 36. Pique-Regi, R., Tsau, E., Ortega, A., Seeger, R. C., and Asgharzadeh, S. (2007) Wavelet footprints and sparse Bayesian learning for DNA copy number change analysis. IEEE Proc ICASSP. 37. Pique-Regi, R., Ortega, A., Triche, T. J., Seeger, R. C., and Asgharzadeh, S. (2008) Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics 24, 309–18. 38. Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F. A., Hakonarson, H., and Bucan, M. (2007) PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res 17, 1665–74. 39. McElroy, J. P., Nelson, M. R., Caillier, S. J., and Oksenberg, J. R. (2009) Copy number variation in African Americans. BMC Genet 10, 15. 40. Lee, C., Iafrate, A. J., and Brothman, A. R. (2007) Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nat Genet 39, S48–54.
Chapter 12 Processing ChIP-Chip Data: From the Scanner to the Browser Pierre Cauchy, Touati Benoukraf, and Pierre Ferrier Abstract High-density tiling microarrays are increasingly used in combination with chromatin immunoprecipitation (ChIP) assays to delineate the regulation of gene expression. Besides the technical challenges inherent to such complex biological assays, a critical, often daunting issue is the correct interpretation of the sheer amount of raw data generated by utilizing computational methods. Here, we go through the main steps of this intricate process, including optimized chromatin immunoprecipitation on chip (ChIP-chip) data normalization, peak detection, as well as quality control reports. We also describe convenient standalone software suites, including our own, CoCAS, which works on the latest generation of Agilent high-density arrays, allows dye-swap, replicate correlation, and easy connection with genome browsers for results interpretation, or with, e.g., other peak detection algorithms. Overall, the guidelines described herein provide an effective introduction to ChIP-chip technology and analysis. Key words: ChIP-Chip, Chromatin immunoprecipitation, Microarray, Protein–DNA interaction, DNA motif, Normalization, Bioinformatics
1. Introduction High-resolution mapping of epigenetic marks and transcription factors (TFs) is a powerful tool which, in the long run, helps to decipher the complexity of gene expression, including the intricate mechanisms sustaining regulatory networks of TFs (1), the polymerase II carboxyl-terminal domain (CTD) code (2, 3), and the histone code (4). Two major techniques, among others, can achieve such precise mapping: chromatin immunoprecipitation on chip (ChIP-chip) (5) and ChIP-Seq (6). In this chapter, we focus on ChIP-chip, describing both low- (i.e., data processing) and high-level analysis (i.e., data interpretation) methods used. The peak detection and high-level analysis methods described Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_12, © Springer Science+Business Media, LLC 2011
251
252
Cauchy, Benoukraf, and Ferrier
herein also apply to ChIP-Seq. Although ChIP-Seq is becoming increasingly more popular, ChIP-chip remains largely used due to its very competitive price as, furthermore, complete genome coverage is not always necessary. In order to grasp the bioinformatics underlying ChIP-chip analysis, we would like to remind readers of the biological basis for this technique. For further details, see, e.g., (7). ChIP-chip essentially consists of ChIP followed by microarray hybridization. ChIP itself is a molecular biology technique that involves immunoprecipitation (IP) of one (or more) chromatin proteins. A cross-linking agent, usually formaldehyde, is used to freeze all bound chromatin proteins along with DNA. The chromatin is then fragmented, using either sonication or micrococcal nuclease (Mnase) digestion. Following this step, a specific antibody is used to pull down the protein(s) of interest along with the chromatin to which it is bound. DNA is eventually recovered through elution of immunoprecipitated chromatin and subsequent purification. Results can be viewed in Far Western Blot (8) and/or via semiquantitative or real-time (qPCR) PCR amplification of a region of interest, known or suspected to interact with the protein, and compared to a control sample that underwent no IP (Input). ChIP-chip takes this step further by making use of microarray technology. There are a variety of chromatin-associated proteins; major types usually in focus are TFs and epigenetic marks. TFs may interact directly with DNA, in which case they are sequencespecific and therefore possess a DNA-binding motif for a specific transcription factor binding site (TFBS). Alternatively, they can bind to other TFs and usually behave as transcriptional cofactors that are able to modulate the trans-activating activity of TFs. TFs are either general, such as the TATA binding protein (TBP), which implies that they are part of the basal transcription machinery and are present in all cell types, or specific, in which case their presence is determined by several variables, such as the tissue/cell type, degree of phosphorylation, and/or other physical/structural changes following the triggering of discrete signal transduction cascades (9–12). Another type of comprehensive study involving ChIP-chip is that of epigenetic marks with regard to their effect on gene activation. Histone octamers form nucleosomes which, when wrapped with DNA, encompass 146 base pairs (bp). Histone tails undergo particular posttranslational modifications, including acetylation, methylation, phosphorylation, and ubiquitination, which are often a sign of gene activation or repression. These modifications, known to modulate gene activity (see Note 1), make up what is called the histone code (4). The above biological considerations are not without consequence in terms of experimental results and computational analysis in that ChIP pulls down significantly
Processing ChIP-Chip Data: From the Scanner to the Browser
253
different amounts of DNA depending on whether the experiment was carried out on a histone modification mark or on a TF (see Subheadings 3.8.1 and 3.8.2). Regardless, it has been brought to light that certain epigenetic marks are specific to distinct types of cis-regulatory elements, namely, promoters and enhancers (13). Thus, computational analysis of the data generated by highthroughput techniques, such as ChIP-chip still holds a bright future in terms of Systems Biology and regulatory networks modeling.
2. Materials ChIP-chip, though versatile enough to be carried out in the average lab, requires attention with respect to the choice of microarray type(s). Furthermore, this technique still makes use of heavy equipment, namely, scanners and reasonably powerful computing hardware, in order to run rather CPU-intensive software. 2.1. Physical Support of Microarrays: “Platforms”
Historically, the first ChIP-chip experiments were performed on nylon arrays using radioactive labeling. Modern ChIP-chip makes use of glass slides and one fluorescent dye (Cyanin) per channel (see Note 2). A list of ChIP-chip microarray suppliers can be found in Table 1 (also see Note 3). The resolution, or genomic distance between each probe (specific to each platform), is an
Table 1 List of ChIP-chip microarray manufacturers and array types Supplier
Maximum number of probes
Number of channels
Tiling array
Affymetrix
6,500,000
Human, mouse, Arabidopsis, C. elegans, Drosophila, S. Pombe, yeast
1
Yes
Agilent
1,000,000
Human, mouse, Arabidopsis, C. elegans, Drosophila, S. Pombe, yeast, Zebrafish
2
Yes
Nimblegen
2,100,000
Human, mouse, Arabidopsis, C. elegans, Drosophila, yeast, chicken, dog, rat, E. coli
2
Yes
Species supported
Aviva
20,000
Human
2
No
NCI Operon
30,942
Human
2
No
254
Cauchy, Benoukraf, and Ferrier
important criterion during peak detection (see below). In Agilent arrays, probes are typically 60-mers, spaced in the genome by about 300 bp. This distance ranges from 100 to 200 bp in Nimblegen arrays. On custom arrays, probe intervals can shrink to 100 bp for both platforms. Finally, a noteworthy point is that in all arrays, probes are randomized in order to minimize hybridization artifacts (bubbles, uneven hybridization) and/or other false positive signals, e.g., dust particles, streaks caused by washing, etc. 2.2. Scanners
Agilent, Affymetrix, and Nimblegen manufacture their own scanners. Genepix manufactures scanners that can in theory scan any type of array, provided that it is a printing process supported by the scanner (2, 5, or 10 mm), and that the array grid is available in a universal MicroArray Gene Expression Markup Language (MAGEML) format. This format also allows the scanning of microarrays using equipment from different manufacturers. Grid files are array-specific and should be downloaded and installed into the scanning software prior to scanning. Scanning instructions are manufacturer-specific and should be obtained from the vendor. Modern scanners usually read 2 mm arrays. However, older scanners often require an upgrade to read the latter. Scanning time depends on the array printing process, parameters used, and the number of channels and passes. For example, reading a 244K array with two channels, two passes, and extended data range (XDR) usually takes 15 min. A 1M array takes an hour with these settings.
2.3. Computer Hardware and Software
Most scanning software is written for 32-bit MS Windows and scanner hardware drivers are usually not yet available in 64-bit, so make sure that you still have a 32-bit system. Memory concerns may arise when dealing with high-density arrays printed using a 2-mm process, in which case 4 Gb is required. High storage space is also desirable since a single image from such an array takes up to 1 Gb, as is a corresponding Feature Extraction file/Genepix file. We suggest these files be zipped, as most software (notably R) reads compressed Agilent and Nimblegen files, which result in a 3:1 compression ratio. Affymetrix has gone around this problem by using compression in their CEL format.
3. Methods As previously mentioned, the computational methods used for data interpretation of ChIP-chip can be divided into two main steps: low- and high-level microarray analysis. Low-level analysis
Processing ChIP-Chip Data: From the Scanner to the Browser
255
focuses on processing the data and comprises hybridization quality control, intra-array and inter-array normalization, as well as peak detection. High-level microarray analysis aims to interpret the experimental results, and notably involves motif discovery, genomic average profiles, and gene and/or assay clustering. 3.1. Quality Control
Prior to starting the analysis, it is pivotal to assess the quality of the experiment, in a first step simply by visual inspection of the scanner image. A successful hybridization is characterized by an image composed of clear dots, ranging from red to green. Underrepresentation of a color often hints at either dye or hybridization bias. Checking spots for positive and negative control (frequently spotted on the corners of arrays), these should be brightly colored. Second, before starting normalization the quality and level of hybridization should also be assessed by computing a density plot (14) and an MA plot.
3.1.1. Density Plot
The Input channel (generally green – Cy3) represents total DNA, which should theoretically bind all probes with equal intensity, except for some controls. In statistical language, the signal should follow a normal distribution (Fig. 1a; but see Note 3 when handling, e.g., cell lines or primary cancer cells). Mathematically, one can represent the IP channel (generally red – Cy5) by a mixture of two Gaussian curves corresponding to specific and nonspecific binding, respectively (15). The level of Gaussian entanglement directly reflects the enrichment level, which corresponds to the log ratio of IP/Input intensities (Fig. 1b, c; and see below). Concretely, Fig. 1b represents a low enrichment experiment, such as that of a specific TF. Conversely, Fig. 1c shows high enrichment such as that of a histone-tail modification (see Subheading 3.8.2).
Fig. 1. Input and ChIP density plots in primary cells. (a) Input intensities follow a normal distribution. Signal distribution of the ChIP channel is composed of a mixture of two Gaussian curves (b, c), which correspond to specific and nonspecific enrichment (respectively short and long dashed lines). The degree of overlap between both Gaussians is contingent on the overall enrichment level: low (b) and (c) high.
256
Cauchy, Benoukraf, and Ferrier
3.1.2. MA Plot
MA values are used to compare each channel’s probe values, where their values are defined as follows for each probe: M = log 2 R − log 2 G A=
1 × (log 2 R + log 2 G ) 2
M is therefore the intensity ratio (or fold change) and A is the average intensity for a dot in the plot. MA plots are then used to visualize the intensity-dependent ratio of raw and normalized microarray data. In an MA plot, M corresponds to the y-axis and A to the x-axis. In many microarray experiments, the general assumption is that most probes are not enriched. Therefore, the majority of the points on the y-axis (M) would be located at 0, since log(1) is 0. The MA plot gives a quick overview of dye bias (traditionally referred to as “banana-shape”), enrichment, and normalization effects (Fig. 2). 3.2. Normalization
Comparing two different channels involves making adjustments for systematic errors introduced by differences in procedures and dye intensity effects. A local regression (loess or lowess normalization) or a data rescaling (median normalization) can achieve dye normalization for two-color arrays. This is generally called intra-array normalization. LIMMA (16), an R/Bioconductor
Fig. 2. MA plots for ChIP-chip data prior to and following normalization. (a) Raw Suz-12 ChIP-chip data for an experiment, where the effects of dye bias are clearly visible on the linear regression of global intensities (dashed line). (b) Normalized Suz-12 data. We correct this bias using lowess intra-array normalization. The fit corresponds to the signal median. Since the ChIP now follows a normal distribution, enriched probes can be discerned by establishing a statistical threshold.
Processing ChIP-Chip Data: From the Scanner to the Browser
257
library, provides a set of tools for background correction (underlying noise in the array, see Note 4) and intra-array normalization. A handy method to assess normalization quality consists in drawing MA plots before (Fig. 2a) and after processing (Fig. 2b). Successful normalization should show the signal median around zero. 3.3. Microarray Data Merging
In experiments involving multiple arrays, data from replicates require correlating and merging to check if results are reproducible. Likewise, arrays from multiple slide designs necessitate concatenation in order to be interpreted. Additionally, comparing several arrays implies having an absolute referential. To this end, inter-array normalization may be performed on replicates and/or multiple slide designs. Inter-array normalization types include median, quantile, and variance stabilization and normalization (VSN) (16).
3.3.1. Replicates
Replicates can be merged simply by averaging each probe value or using a weighted average. As a matter of fact, certain scanner extraction software, such as Agilent’s Feature Extraction, provides P-values for spot quality. These are used by the Rosetta error model (17) when merging replicates.
3.3.2. Correlations
An intensity correlation plot is best to assess experimental reproducibility. If there are several replicates, correlation plots should be generated for all replicates. A Pearson correlation score shows the strength of the relationship between replicates X and Y:
rX ,Y =
E (X − m X )(Y − mY ) s X sY
where E is the expected value, m is the signal average, and s gives the standard deviation. The higher the correlation coefficient, the more significant the correlation. Correlations are only significant on the enriched probes and/or peaks. In ChIP-chip assays, most probes are usually under-enriched and correspond to background, which by definition does not correlate. Therefore, interpreting the coefficient correlation should only take enriched probes into account. 3.3.3. Multiple Slide Designs
The resolution of modern tiling arrays reaches two million probes per array using a 2-mm printing process. While this can be sufficient for certain experiments, such as transcriptome and short promoter arrays, full genome coverage still requires several arrays. This statement also applies to lower resolution platforms, i.e., the Agilent Human promoter array set which spans 2 × 244K probes. Each array must be treated separately until inter-array normalization is complete, as implemented in most programs. In R, one would need to concatenate parts of each design together.
258
Cauchy, Benoukraf, and Ferrier
3.4. Dye Swap
Dye bias often occurs, especially for Cy5 that is sensitive to heat and high ozone concentrations. Ideally, labeling and hybridization are carried out in ozone-free and mild environments. However, Cy5 quenching can still occur. Moreover, dye incorporation can be uneven during labeling. A simple solution consists in doubling the number of arrays and swapping dyes in a given set. For example, you can use Cy5 as the IP channel and Cy3 as the Input channel in one array, and the reverse in a technical replicate. Most ChIP-chip analysis programs allow this. In R, this translates as swapping the R and G columns from the given array(s) in an RGList/ExpressionSet object before normalization.
3.5. Peak Detection
The main goal of this process is to identify significant binding events. Peak detection is carried out on normalized ChIP-chip data. Normalized probe intensity log ratios can be viewed in most open source genome browsers, e.g., IGB, or UCSC genome browser (see Note 5), provided that microarray data is in Browser Extensible Data (BED), gene feature format (GFF), or S-Plus GRaph (SGR) format. Probes are sorted by position on the genome and due to the nature of ChIP-chip, enriched regions show up as a “peak” spanning several probes. The platform’s resolution also defines how precise a peak’s boundaries are. As a matter of fact, ChIP fragments have different sizes and span around the IP-ed protein, resulting in a phenomenon called the neighborhood effect (Fig. 3). Consequently, a peak can be detected if the signal of its central probe is significantly above the background. Thanks to this property, artifacts can be identified since neighboring probes should be enriched as well. The background can be estimated using the following two methods.
3.5.1. Models
– Statistical model. This method makes use of the binding event probability. In Mpeak (18), peak detection consists in superimposing a triangle to the signal curve using a sliding window. Peaks are scored according to their area. – Deterministic model. In this model, no probabilities are computed. Due to the neighborhood effect, a binding event is detected if at least three consecutive probes are significantly greater than a given threshold.
3.5.2. Background Estimation
Below, we list two main ways to calculate a significant threshold. Most algorithms assess the background Bg as: Bg = Mean(Signal) + n × StandardDeviation(Signal) where Signal represents probe intensities and n generally equals 2 (see Note 6).
Processing ChIP-Chip Data: From the Scanner to the Browser
259
Fig. 3. Neighborhood effect in ChIP-chip. Top: Consecutive probes for a given genomic region are shown. ChIP fragments corresponding to one binding event vary in size and span. They therefore hybridize to more than one oligonucleotide probe from a given tiling set. Bottom: This translates as an IP/Input log Ratio Gaussian around the actual binding site, where the signal is highest.
Alternatively, Ringo (19) is an original method which uses random probe shuffling and smoothing of intensity log ratios. It is generally accepted that the background corresponds to all signal below the 99th percentile. Our software, CoCAS (20), makes use of both techniques in order to perform peak detection. However, we introduced a second threshold that we call the “extension threshold.” Once a significantly enriched probe has been isolated, one needs to identify probes that are also part of the given peak. Due to the neighborhood effect, the surrounding probes likely possess a weaker signal ratio, especially in the case of TFs. 3.6. Motif Discovery
TF or histone ChIP-ed regions are often motif-rich (21). One way to elucidate a region’s function is to study its motif composition. We can simply correlate enrichment to CpG content, which is useful for methylation experiments, or map regions by potential TFBS. This approach is widely used in the case of ChIP-chip experiments on TFs (22). There are three methods that can be used and combined when looking for motifs.
260
Cauchy, Benoukraf, and Ferrier
The first one involves matching known patterns or position weight matrices. The latter often have been identified using Selex (23). A convenient suite for this purpose is RSA Tools (http:// rsat.ulb.ac.be/rsat) (see Note 5). The second one is known as phylogenetic footprinting (24), which is based on sequence conservation. Its paradigm is that functional motifs should be conserved throughout several species. We thereby perform a pairwise or a multiple alignment of the sequences of interest in different species in order to highlight conserved motifs. We then annotate these sequences by confronting overrepresented motifs to TFBS databases, such as Jaspar (25) or Transfac (http://www.gene-regulation.com/pub/databases. html). The third method is based on motif discovery. Here, we attempt to isolate both known and unknown motifs that are enriched in ChIP-ed regions. This is useful when studying a TF whose binding motif is unknown, or when looking for TF modules (1). MEME (26, 27) and Gibbs sampling (28) are two convenient algorithms for this purpose. 3.7. Gene Profiles
Experiments that generate high-level enrichments (histone marks, a few TFs such as Pol II, Suz 12, etc.) may be illustrated by plotting the average ChIP signal throughout the gene body as well as the transcription start site (TSS). Such profiles are helpful to discriminate between several clusters, depending on the enrichment, transcription level, gene ontology, etc. Most profile studies are performed in an R/Bioconductor by using linear regression on a gene’s probes (i.e., the approx function from the Rbase library or the locfit function from locfit). For a more thorough analysis, an R/Bioconductor allows the linking of data to genome annotation and ontology (29). Alternatively, when comparing experiments, clustering by assay and gene reveals how and where assays correlate. One may use either total peaks or signal flanking the TSS, gene body, or transcription termination site (TTS). Suitable programs for this type of analysis are CEAS, RSA Tools, and CisGenome (see Note 5).
3.8. Practical Applications
Here, we present a guided analysis of the Polycomb protein Suz12 ChIP-chip assay in embryonic stem (ES) cells (20) using our own software suite, CoCAS, as well as a brief analysis for the TF ETS1. Both experiments were carried out using two biological replicates. Each replicate was hybridized on a promoter array set made up of two arrays, printed respectively with probes from chromosome 1 to 10 and 11 to Y. Dye swap was used for the second biological replicate.
3.8.1. Examples
1. The Feature Extraction files may be downloaded at http:// www.ciml.univ-mrs.fr/software/cocas/dataset/suz12.zip. All four files can be loaded in CoCAS.
Processing ChIP-Chip Data: From the Scanner to the Browser
261
2. Since we are using multiple array designs, we need to check the “multiple slide design merge” option. In this mode, arrays with identical “Slide #” are treated as biological replicates. Consequently, the first two slides (from chr1 to 10, for each replicate) are labeled as “1” in order to be concatenated with the last two slides (hence labeled “2”). 3. Since we used Dye swap, we need to specify this to CoCAS. In order to do this, we select “Cy3” under IP for both slides that make up the array set undergoing dye-swap. Typically, these are slides appearing last in each replicate. 4. In “Intra-Normalization type,” we select “Loess” (see Figs. 2 and 4 for details). In “Inter-Normalization Type,” we select median. 5. Peak detection: Ringo with 0.99 (99th percentile) as the central probe threshold, 0.95 for the extension threshold. CoCAS is then started. 6. In UCSC Genome Browser, the output shows a clear peak on the KLF4 gene (Fig. 4a). Using similar parameters, we can also process ChIP-chip data from an experiment where an anti-ETS1 antibody was used (PC, unpublished results). In this assay, custom arrays spotted with probes ranging from [−50 kb; +50 kb) around the TSS were hybridized. A peak is clearly visible on the TCRa gene enhancercontaining region (Fig. 4b). In this example, we show how CoCAS can be used for ChIPchip analysis on the Agilent microarray platform. However, in Note 5, we provide a full list of programs for each step of the analysis.
Fig. 4. Views of binding events (arrows) in UCSC Genome browser of (a) a Suz12 peak in the promoter region of Klf4, a TF involved in ES cell differentiation. (b) A well-known ETS1 peak in the mouse TCRa locus enhancer (43). Each bar corresponds to a probe’s intensity log ratio.
262
Cauchy, Benoukraf, and Ferrier
3.8.2. Pitfalls
1. TF ChIP bias. Due to the much lower occurrence of specific TFBS, a ChIP assay using an antibody against a given TF generally yields significantly smaller amounts of DNA (usually in the range of 1–15 ng/108 cells, depending on the antibody’s efficiency, vs. 1–5 mg for an histone modification mark). Consequently, enrichment appears much lower when plotting an MA plot. In such cases, one should use lowess (intensity-dependent) intra-array normalization. Traditionally, histone modification ChIP-chip assays only require median intra-array normalization. 2. How to detect a problem with the Input DNA? A non-Gaussian distribution in a density plot reveals experimental issues, such as incomplete labeling or amplification bias (Fig. 5a). In such cases, the experiment should be redone. 3. The density plot shows no enrichment following normalization. Extremely poor enrichment, even following lowess normalization, often reveals poor antibody specificity. This results in a very low signal/noise ratio (Fig. 5b), necessitating either performing the experiment again with more washes, or using
Fig. 5. (a) Density plot of an experiment with labeling or amplification bias shows non-Gaussian distributions in the Input and IP in primary cells. (b) When viewed in IGB, correct signal appears homogeneous, where peaks can clearly be distinguished from the background (top). Isolated enriched probes distributed heterogeneously generally correspond to false positives (bottom), which reflect low signal/noise ratio.
Processing ChIP-Chip Data: From the Scanner to the Browser
263
an antibody with improved specificity. Typically, there should be a minimum tenfold ChIP enrichment against the reference background (IgG mock IP) when performing quality controls, e.g., qPCR.
4. Notes 1. Histone modification marks are written as histone number, residue, and type of modification, i.e., H3K4me3 is histone 3 lysine 4 tri-methylation. Activation marks include the latter, H3K36me3 (also a sign of gene transcription) as well as H3K4/K9Ac2. Known repression marks include H3K27me3 and H3K9me3. 2. Channels: two types of experiments can be performed in ChIP-chip, one-channel (ChIP) or two-channel (ChIP and Input) experiments. The ChIP sample is amplified using either whole genome amplification (WGA) or T7 amplification, then labeled with a fluorescent dye. Two-channel experiments involve similar treatment for Input samples, although these are labeled using a separate dye. In this case, Cy5 (red) is usually used for the ChIP channel and Cy3 for the Input channel. These can be inverted for dye-swap experiments in order to correct for dye bias. 3. Affymetrix does not use two-channel comparative hybridization (30), therefore one-channel data processing is required. Not having an input channel has downsides in that enrichment in the IP channel can be caused by either WGA/T7 amplification bias or abnormally amplified regions in the genome, which is often the case in cell lines. 4. Background subtraction consists in removing background noise that is inherent to the quality of given hybridization. It is measured by spot detection software and corresponds to the signal around each spot. High background noise occurs upon poor hybridization; therefore, subtraction should not be used in this case. Not all programs feature this, although CoCAS does. Additionally, background subtraction can be passed as an argument in the “normalizeWithinArrays” function provided by R/LIMMA. When used, this function contributes toward improving the experiment’s signal/noise ratio. 5. Software and program list. The widespread use of the techniques outlined in this chapter has prompted several laboratories to develop new analytic tools. Following, we provide a list of programs we recommend for use for both low- and highlevel analysis. We sorted them by user-friendliness. The first
264
Cauchy, Benoukraf, and Ferrier
part describes online programs (due to high uploading times, mostly used for high-level analysis). They can be run on a standard machine and are intended for nonspecialists. Results are generally sent by e-mail. The second part deals with standalone programs. Most of these are also intended for nonspecialists but require a relatively powerful computer as they can carry out most steps of ChIP-chip analysis locally. The third part highlights scripts and programing functions, which can provide a more personalized analysis, although a minimum expertise in computer science and bioinformatics is required. Online programs –– TAMALPAIS (31) is a Web-based online analysis for NimbleGen ChIP-chip platforms, which allows low-level analysis and provides binding site localization (http:// chipanalysis.genomecenter.ucdavis.edu/cgi-bin/tamalpais. cgi). –– UCSC Genome browser (http://genome.ucsc.edu) is a convenient genome browser that allows superimposing of diverse biological data, such as sequence composition, species conservation, expression, etc. –– CEAS (32) is a high-level analysis suite, starting from a BED or GFF file. This pipeline provides information about ChIP-ed sequences, such as GC content, evolutionary conservation, annotation, TFBS, etc., and may be used for human and mouse genomes (http://liulab.dfci. harvard.edu/CEAS; a stand-alone version is available since January 2010). –– MEME suite is an online tool helpful for motif discovery (26, 27) and motif annotation (33, 34) (http://meme. sdsc.edu/meme4_3_0). –– DCODE.org (35) combines several phylogenetic footprinting tools, such as ECR Browser and MultiTF (http:// www.dcode.org). –– RSA Tools (http://rsat.ulb.ac.be/rsat/) provides several efficient motif search and discovery algorithms. Stand-alone programs –– CoCAS (20), as described earlier (see Subheading 3.8.1) (http://www.ciml.univ-mrs.fr/software/cocas). –– HMMTiling (36) is a Python analysis pipeline for the Affymetrix platform (http://chip.dfci.harvard.edu/~wli/ HMM.Tiling/HMMTiling/HMMTiling_Readme.htm). –– CisGenome, (21, 37), is a ChIP-Chip/Seq analysis suite. It is designed for Affymetrix tilling arrays but can import
Processing ChIP-Chip Data: From the Scanner to the Browser
265
other formats. CisGenome allows normalization, sequence annotation, motif discovery, and TFBS mapping (http://www.biostat.jhsph.edu/~hji/cisgenome). –– Chipper (38) is an online and stand-alone R-based program. The main normalization method is the VSN package (39) from Bioconductor. All ChIP-chip platforms are supported, though raw data should be converted to a simple format (Identifier, IP, control) prior to use (http:// llama.med.harvard.edu/Chipper). –– Mpeak (18) is a reference peak detection program that takes gff format files as input (http://www.stat.ucla. edu/~zmdl/mpeak). –– IGB is a dynamic genome browser, useful mostly for data visualization. IGB works with any gff/bed/wig/sgr file. It can be downloaded at http://igb.bioviz.org/. –– DRIM (40) is a motif discovery algorithm for ChIP-ed sequences (http://bioinfo.cs.technion.ac.il/drim). Scripts and programing functions –– R/Bioconductor. R is a free programing language for statistical computing and graphics. Combined with Bioconductor, a mainstream bioinformatics R package repository, the language has become the main environment for microarray analysis. Most worldwide bioinformatics contributions in this field are hosted at http:// bioconductor.org. –– Ringo (19) is an R/Bioconductor package designed to perform low-level analysis, peak detection, and genome visualization for the Nimblegen and Agilent platforms. –– ACME (41) is also an R/Bioconductor package that combines several algorithms for calculating ChIP-chip enrichment. –– TiMAT2 (http://bdtnp.lbl.gov/TiMAT/TiMAT2) is a collection of JAVA command-line applications. It can be used for both low- and high-level tiling microarray data analysis using Affymetrix or Nimblegen platforms. TiMAT2 is designed for processing ChIP-chip, transcriptome assays, as well as comparative genome-wide hybridization experiments from both single and multi-chip datasets. –– ChIPOTle (42) is a Microsoft Excel macro. Note that there is a restriction on the number of probes that can be analyzed since the maximum number of rows handled by Excel is 64,000 in Excel 2003, and one million in Excel 2007 (http://www.bio.unc.edu/faculty/lieb/labpages/ chipotle/home.htm).
266
Cauchy, Benoukraf, and Ferrier
6. Central probe threshold: Gaussian confidence intervals. If signal X follows a normal distribution, its signal density curve follows a Gaussian distribution. Therefore, the probability that X is less than a threshold x is: Pr(X ≤ x ) =
1 x − m 1 + erf 2 s 2
where erf is the error function. According to the “three sigma” rule, the area under the bell curve between m − 2s and m + 2s in terms of the cumulative normal distribution function is equal to erf (2 / 2) = 0.9545 . Consequently, when n = 2, Pr[(X ≤ m − 2s ) ∪ (X ≥ m + 2s )] = 0.9545. As for the central probe, we therefore use the threshold m + 2 × s since we are only interested in positive events.
Acknowledgments Work in the Ferrier laboratory is supported by Inserm, CNRS, the Agence Nationale de la Recherche (ANR), Institut National du Cancer (INCa), Association pour la Recherche sur le Cancer (ARC), Fondation Princesse Grace de Monaco, Fondation de France, Association Laurette Fugain, and Commission of the European Communities. PC and TB were supported by fellowships from, respectively, INCa and Marseille-Nice Genopole and ANR 06-BYOS-0006; both are now fellows from the Fondation de la Recherche Médicale (FRM). We also extend our thanks to Jean-Christophe Andrau, Salvatore Spicuglia, Frederic Koch, and Frederic Rosa for their comments, as well as Virginia Cauchy for her corrections. References 1. Smeenk, L., van Heeringen, S. J., Koeppel, M., van Driel, M. A., Bartels, S. J., Akkers, R. C., Denissov, S., Stunnenberg, H. G. and Lohrum, M. (2008) Characterization of genome-wide p53-binding sites upon stress response. Nucleic Acids Res 36, 3639–54. 2. Koch, F., Jourquin, F., Ferrier, P. and Andrau, J. C. (2008) Genome-wide RNA polymerase II: not genes only! Trends Biochem Sci 33, 265–73. 3. Buratowski, S. (2003) The CTD code. Nat Struct Biol 10, 679–80. 4. Kouzarides, T. (2007) Chromatin modifications and their function. Cell 128, 693–705.
5. Ren, B., Robert, F., Wyrick, J. J., Aparicio, O., Jennings, E. G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T. L., Wilson, C. J., Bell, S. P. and Young, R. A. (2000) Genome-wide location and function of DNA binding proteins. Science 290, 2306–9. 6. Barski, A. and Zhao, K. (2009) Genomic location analysis by ChIP-Seq. J Cell Biochem 107, 11–8. 7. Gilmour, D. S. and Lis, J. T. (1985) In vivo interactions of RNA polymerase II with genes of Drosophila melanogaster. Mol Cell Biol 5, 2009–18.
Processing ChIP-Chip Data: From the Scanner to the Browser 8. Defeo-Jones, D., Huang, P. S., Jones, R. E., Haskell, K. M., Vuocolo, G. A., Hanobik, M. G., Huber, H. E. and Oliff, A. (1991) Cloning of cDNAs for cellular proteins that bind to the retinoblastoma gene product. Nature 352, 251–4. 9. Siegel, J. N., Egerton, M., Phillips, A. F. and Samelson, L. E. (1991) Multiple signal transduction pathways activated through the T cell receptor for antigen. Semin Immunol 3, 325–34. 10. Darnell, J. E., Jr., Kerr, I. M. and Stark, G. R. (1994) Jak-STAT pathways and transcriptional activation in response to IFNs and other extracellular signaling proteins. Science 264, 1415–21. 11. Hanekom, C., Nel, A., Gittinger, C., Rheeder, A. and Landreth, G. (1989) Complexing of the CD-3 subunit by a monoclonal antibody activates a microtubule-associated protein 2 (MAP-2) serine kinase in Jurkat cells. Biochem J 262, 449–56. 12. Berridge, M. J. and Irvine, R. F. (1984) Inositol trisphosphate, a novel second messenger in cellular signal transduction. Nature 312, 315–21. 13. Heintzman, N. D., Stuart, R. K., Hon, G., Fu, Y., Ching, C. W., Hawkins, R. D., Barrera, L. O., Van Calcar, S., Qu, C., Ching, K. A., Wang, W., Weng, Z., Green, R. D., Crawford, G. E. and Ren, B. (2007) Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 39, 311–8. 14. Parzen, E. (1962) On estimation of a probability density function and mode. Ann Math Stat 33, 1065–76. 15. Martin-Magniette, M. L., Mary-Huard, T., Berard, C. and Robin, S. (2008) ChIPmix: mixture model of regressions for two-color ChIP-chip analysis. Bioinformatics 24, i181–6. 16. Smyth, G. K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3, Iss 1 Article 3. http:// www.bepress.com/sagmb/vol3/iss1/art3/. 17. Weng, L., Dai, H., Zhan, Y., He, Y., Stepaniants, S. B. and Bassett, D. E. (2006) Rosetta error model for gene expression analysis. Bioinformatics 22, 1111–21. 18. Zheng, M., Barrera, L. O., Ren, B. and Wu, Y. N. (2007) ChIP-chip: data, model, and analysis. Biometrics 63, 787–96. 19. Toedling, J., Skylar, O., Krueger, T., Fischer, J. J., Sperling, S. and Huber, W. (2007) Ringo – an R/Bioconductor package for analyzing ChIP-chip readouts. BMC Bioinformatics 8, 221.
267
20. Benoukraf, T., Cauchy, P., Fenouil, R., Jeanniard, A., Koch, F., Jaeger, S., Thieffry, D., Imbert, J., Andrau, J. C., Spicuglia, S. and Ferrier, P. (2009) CoCAS: a ChIP-on-chip analysis suite. Bioinformatics 25, 954–5. 21. Ji, H., Vokes, S. A. and Wong, W. H. (2006) A comparative analysis of genome-wide chromatin immunoprecipitation data for mammalian transcription factors. Nucleic Acids Res 34, e146. 22. Farnham, P. J. (2009) Insights from genomic profiling of transcription factors. Nat Rev Genet 10, 605–16. 23. Tuerk, C. and Gold, L. (1990) Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249, 505–10. 24. Kesmir, C., van Noort, V., de Boer, R. J. and Hogeweg, P. (2003) Bioinformatic analysis of functional differences between the immunoproteasome and the constitutive proteasome. Immunogenetics 55, 437–49. 25. Bryne, J. C., Valen, E., Tang, M. H., Marstrand, T., Winther, O., da Piedade, I., Krogh, A., Lenhard, B. and Sandelin, A. (2008) JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res 36, D102–6. 26. Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28–36. 27. Bailey, T. L. and Gribskov, M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48–54. 28. Liu, X., Brutlag, D. L. and Liu, J. S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of coexpressed genes. Pac Symp Biocomput 6, 127–38. 29. Toedling, J. and Huber, W. (2008) Analyzing ChIP-chip data using bioconductor. PLoS Comput Biol 4, e1000227. 30. Johnson, D. S., Li, W., Gordon, D. B., Bhattacharjee, A., Curry, B., Ghosh, J., Brizuela, L., Carroll, J. S., Brown, M., Flicek, P., Koch, C. M., Dunham, I., Bieda, M., Xu, X., Farnham, P. J., Kapranov, P., Nix, D. A., Gingeras, T. R., Zhang, X., Holster, H., Jiang, N., Green, R. D., Song, J. S., McCuine, S. A., Anton, E., Nguyen, L., Trinklein, N. D., Ye, Z., Ching, K., Hawkins, D., Ren, B., Scacheri, P. C., Rozowsky, J., Karpikov, A., Euskirchen, G., Weissman, S., Gerstein, M., Snyder, M., Yang, A., Moqtaderi, Z., Hirsch, H., Shulha, H. P., Fu, Y., Weng, Z., Struhl, K., Myers, R.
268
31.
32. 33. 34.
35.
36.
37.
Cauchy, Benoukraf, and Ferrier M., Lieb, J. D. and Liu, X. S. (2008) Systematic evaluation of variability in ChIPchip experiments using predefined DNA targets. Genome Res 18, 393–403. Bieda, M., Xu, X., Singer, M. A., Green, R. and Farnham, P. J. (2006) Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res 16, 595–605. Shin, H., Liu, T., Manrai, A. K. and Liu, X. S. (2009) CEAS: cis-regulatory element annotation system. Bioinformatics 25, 2605–6. Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. and Noble, W. S. (2007) Quantifying similarity between motifs. Genome Biol 8, R24. Boden, M. and Bailey, T. L. (2008) Associating transcription factor-binding site motifs with target GO terms and target genes. Nucleic Acids Res 36, 4108–17. Ovcharenko, I., Nobrega, M. A., Loots, G. G. and Stubbs, L. (2004) ECR Browser: a tool for visualizing and accessing data from comparisons of multiple vertebrate genomes. Nucleic Acids Res 32, W280–6. Li, W., Meyer, C. A. and Liu, X. S. (2005) A hidden Markov model for analyzing ChIPchip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 21 Suppl 1, i274–82. Zhou, Q. and Wong, W. H. (2004) CisModule: de novo discovery of cis-regulatory modules
by hierarchical mixture modeling. Proc Natl Acad Sci USA 101, 12114–9. 38. Gibbons, F. D., Proft, M., Struhl, K. and Roth, F. P. (2005) Chipper: discovering transcription-factor targets from chromatin immunoprecipitation microarrays using variance stabilization. Genome Biol 6, R96. 39. Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A. and Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18 Suppl 1, S96–104. 40. Eden, E., Lipson, D., Yogev, S. and Yakhini, Z. (2007) Discovering motifs in ranked lists of DNA sequences. PLoS Comput Biol 3, e39. 41. Scacheri, P. C., Crawford, G. E. and Davis, S. (2006) Statistics for ChIP-chip and DNase hypersensitivity experiments on NimbleGen arrays. Methods Enzymol 411, 270–82. 42. Buck, M. J., Nobel, A. B. and Lieb, J. D. (2005) ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol 6, R97. 43. Ho, I. C., Bhat, N. K., Gottschalk, L. R., Lindsten, T., Thompson, C. B., Papas, T. S. and Leiden, J. M. (1990) Sequence-specific binding of human Ets-1 to the T cell receptor alpha gene enhancer. Science 250, 814–8.
Chapter 13 Insights into Global Mechanisms and Disease by Gene Expression Profiling Fátima Sánchez-Cabo, Johannes Rainer, Ana Dopazo, Zlatko Trajanoski, and Hubert Hackl Abstract Transcriptomics has played an essential role as proof of concept in the development of experimental and bioinformatics approaches for the generation and analysis of Omics data. We are giving an introduction on how large-scale technologies for gene expression profiling, especially microarrays, have changed the view from studying single molecular events to a systems level view of global mechanisms in a cell, the biological processes, and their pathological mutations. The main platforms available for gene expression profiling (from microarrays to RNA-seq) are presented and the general concepts that need to be taken into account for proper data analysis in order to extract objective and general conclusions from transcriptomics experiments are introduced. We also describe the available main bioinformatics resources used for this purpose. Key words: Gene expression profiling, Transcriptomics, Microarrays, RNA-seq
1. Introduction With the advent of the microarray technology the simultaneous profiling of expression levels of thousands of genes in a single experiment has become possible. The great potential of DNA microarrays lies in viewing the technology not only as a collection of individual expression measurements, but also in generating a composite picture of the expression profile of the cell. The possibility to take a snapshot of the currently expressed genes at a given time, state, environment, genetic background, and treatment in one or numerous cells, over several different conditions make it an inevitable tool for genomics studies. This approach specifically allows insights into global mechanisms; for instance,
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_13, © Springer Science+Business Media, LLC 2011
269
270
Sánchez-Cabo et al.
to show which pathways and biological processes are turned on and off at the transcriptional level and a specific condition, or if there is an association between the phenotype (from patients sample/tissue) and typical expression profiles (classification and stratification). Even if there are several regulatory mechanisms from the DNA to the active protein like epigenetic and posttranscriptional regulations, the genome wide transcriptional status over a wide range of conditions might reflect the global underlying regulation patterns. A basic assumption is that coexpressed genes (sharing similar expression profiles over several conditions) are coregulated (by the same set of transcriptional regulators) or involved in the same biological process or mechanisms (“guiltby-association”) (1). The ultimate goal is the systematic understanding of the gene regulatory network. Considering the complexity of the transcriptional process in higher eukaryotes and the limited sensitivity/accuracy of microarray assays for lowabundant mRNAs, attempts to deduce regulatory interactions from gene expression profiles might not be sufficient to build a valid network model. However, in some cases, successful identification of (direct) transcription factor targets was shown based on time series microarray experiments on a perturbated system (2, 3). Microarrays are used nowadays as a standard laboratory technique in basic research and mostly used to screen for new molecular targets (differentially expressed genes between samples). Moreover, this technology found its way into clinical medicine and pharmacogenomics. Especially in the research and diagnosis of cancer and the analysis of tumors, this approach adds immense detail and complexity to the information available from traditional clinical and pathological sources. The applications include the identification and validation of cancer biomarkers and therapeutic targets, the elucidation of the mechanisms of cancer pathways, clinical classification and stratification. A variety of cancer studies and profiling of cancer cell lines were performed using DNA microarrays. A large fraction of this data can be found integrated in Oncomine (http://www.oncomine.org) (4), a cancer microarray database and web-based data-mining platform. Over the years different methods have been developed to study the transcriptional activity of genes ranging from qualitative methods like northern blotting, RNAse protection assays over the low-throughput methods differential display (5), differential and subtractive hybridization (6, 7), real-time reverse transcription polymerase chain reaction (qPCR) (8) to high-throughput methods involving oligo- and cDNA-microarrays (9, 10) and large-scale methods based on sequencing serial analysis of gene expression (SAGE) (11), massively parallel signature sequencing (MPSS) (12), and RNAseq (13, 14). For a long time hybridization based methods, especially microarrays, have been the main source for generation of large-scale gene expression data. However, with the
Insights into Global Mechanisms and Disease by Gene Expression Profiling
271
upcoming of new sequence-based technologies, RNA-seq will become more popular, especially if the price drops. Considering the advantages that no probes have to be preselected the study of alternative splicing and differential expression of isoforms is possible. Meanwhile there is still place for microarrays, given that analysis tools are more widely available and better developed, making analysis less complicated and more affordable in case of a high number of samples. The chapter is organized as follows: In the following section we describe the main platforms available for gene expression profiling, from traditional arrays to RNA-seq. We are summarizing the main points that should be taken into account in the experimental design and also outline the data generation pipeline. Some examples of typical microarray output files are given. In Subheading 3 we describe the state-of-the-art methods for data analysis: platform-specific preprocessing methods, statistical analysis tools and mathematical models, biological interpretation, and data integration of the results. Some of the methods are exemplified based on publicly available data (15) of eight samples of basal like breast cancer (BLC) versus eight non-BLC samples. Finally, we will provide different resources and some practical remarks to be taken into account in gene expression experiments.
2. Materials 2.1. P latforms 2.1.1. qPCR
2.1.2. 3 ¢ Gene Expression Arrays 2.1.2.1. Affymetrix
The real-time reverse transcription polymerase chain reaction (RT-PCR, qPCR) uses fluorescent reporter molecules to monitor the production of amplification products during each cycle of the PCR reaction (16). There are different variations of this technology (for review see ref. 17) and calculation methods, including comparison to a reference (18S, housekeeping genes) (18–21). qPCR is not per se a method to generate Omics data, but it can be used to (1) validate expression levels of some genes from microarray analysis, and (2) to produce medium-scale data (e.g. in 384 well format) utilizing the advantages of qPCR high sensitivity, large dynamic range, and accuracy (22). TaqMan® assays from Applied Biosystems with predesigned and tested primers and probes are an excellent tool for this purpose. Affymetrix GeneChips® are one of the most widely used (commercial) platforms for DNA microarray analyses. DNA sequences are built up using light-directed chemical synthesis. For this purpose, photolithographic masks are used. The whole procedure is similar that of semiconductor production and dates back to the work of Stephen Fodor and his team in the early 1990s (23, 24). 25-mer oligonucleotides (probes) are in situ synthesized on the surface of a glass slide. The used technology and enhancements in
272
Sánchez-Cabo et al.
array manufacturing allows a high number of features and high feature densities. The latest generation features a 11 mm spacing (GeneChips® typically have features in squared patches) within a total area of 1.28 cm2 and allows up to 1.3 million unique oligonucleotides per array. Each gene (transcript) is represented on the array by 11–20 paired sets of perfect match (PM) and mismatch (MM) oligonucleotides such that PM–MM pairs are in adjacent position on the array (but otherwise randomly distributed across the array). Whereas the PM feature is a perfect match to the sequence, the MM has a single mismatch in the center (position 13 of 25). The purpose of the MM sequence is to capture the nonspecific binding (background) that would otherwise interfere with the measured intensity level of the PM. To extract the expression level for each gene the probe set pairs have to be summarized based on the intensity levels of PM and MM. For this purpose a variety of algorithms have been developed; some of them are discussed below. GeneChips® are of the type, one-color array, where only one sample is hybridized to the array. Quite a number of different versions of GeneChips for different organisms and application have been developed so far. To give an example of the current 3¢ gene expression arrays, the Human Genome U133 Plus 2.0 Array contains over 54,000 probe sets representing approximately 38,500 transcripts on a single array. Probe sets are designed from sequences (600 bp) most proximal to the 3¢ end of the transcript (see Note 1). 2.1.2.2. Agilent
Agilent Technologies, today, is a renowned provider of gene expression microarray solutions, and Agilent Whole Genome Expression microarrays are perhaps the second most used (commercial) platform for DNA microarray gene expression analyses. It can be considered as an open platform, since Agilent microarrays are printed using 60-mer oligonucleotides on standard glass 1² × 3² slides, such that all microarray scanners, suitable for standard glass slide microarrays, are suitable for Agilent microarrays. In general, each gene (transcript) is represented on the array by one specific 60-mer oligonucleotide. Optimized 60-mer oligonucleotides to represent genes are carefully selected from groups of computationally determined candidate probes and over 70% of the microarray represented probes are validated by Agilent’s laboratory validation process. Microarrays are manufactured using a proprietary noncontact in situ synthesis process, base by base, from digital sequence files. This is done by using an inkjet process, which very accurately distributes extremely small volumes of chemicals to be spotted. Today, Agilent microarrays are available in a variety of formats to allow researchers to optimize the density and the number of microarrays on each slide, thus reducing the cost per experiment; typical formats are 1 × 244K, 2 × 105K, 4 × 44K, or 8 × 15K individual microarrays printed on a single glass slide.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
273
Agilent’s 244,000-feature microarrays represent a fivefold increase in density compared with their previous microarray series, although for transcriptome profiling purposes the 4 × 44K slide format is the preferred among Agilent’s users. The format represents four whole genome microarrays on a single slide; in the case of human microarrays, each array consists of about 41,000+ human genes and transcripts, all with public domain annotations. Agilent 60-mer oligonucleotide microarrays, available for quite a number of different organisms, can be used for both one- and two-color experiments, and are fully enabled for use with Agilent’s Dual-Mode Gene Expression microarray platform, which integrates optimized protocols, reagents, hardware, and software for gene expression applications. Researchers can also design their own arrays by using Agilent’s eArray, a web-based tool that allows customers to rapidly design custom microarrays in a secure online environment at no additional cost. 2.1.2.3. Other Array Platforms
A good summary of different microarray platforms can be found in (25). There are also other commercial suppliers of in situ synthesis platforms, namely NimbleGen (Roche) and Febit, which make use of a maskless photolithography based on controlled micromirrors, a technology developed by Texas instruments (25). The big advantage of this technology is that no masks are necessary and therefore they are well suited for customized arrays. NimbleGen is also a service provider of whole genome tiling arrays. ABI has stopped the production of AB 1700 array platform with 60-mer oligonucleotide and chemiluminescent detection. Spotted cDNA (or oligonucleotide) two-color microarrays are an inexpensive alternative to in situ synthesized platforms (26), especially for customized arrays. PCR products of a cDNA library or designed olignucleotides (e.g. from commercial provider like MWG Operon) are spotted on chemical-modified glass slides (e.g. amino-silanated) by a robotic device using contact printing or ink-jet technology. A bioinformatics challenge, however, is in the design of oligonucletides and in the annotation of expressed sequence tags (ESTs).
2.1.3. Exon Arrays
Identification of alternative splice events of expressed or regulated gene isoforms is not, or only partially, possible with classical whole genome gene expression technologies. Classical microarrays target only a small part of a gene’s mRNA, in the case of Affymetrix GeneChips® and Agilent arrays only its 3′ end. Such microarrays thus allow only the identification of differentially expressed genes, but not of individual gene variants and also fail to identify potentially new gene isoforms. Affymetrix thus designed a microarray with oligo-probe sequences targeting all known and predicted (based on EST evidence) exons of all genes. This Exon microarray should allow
274
Sánchez-Cabo et al.
measuring the expression of individual exons of a gene, and therefore allow identifying the expressed gene isoform and also the potential alternative splice events. However, the high coverage of probe sequences along transcripts comes at the cost of a limited potential to design probe sequences with similar or comparable hybridization properties and affinities. The measured intensities of the probes are thus strongly biased by their sequence composition, with probes having a high guanine (G) and cytosine (C) content tending to yield higher intensities than other probes. Adjustment of the raw data for this bias in the data preprocessing is thus a crucial step in the Exon microarray analysis. Exon microarrays can be employed for the analysis of differentially expressed genes, or also for the identification of differential splice events. A similar approach is used for the new Affymetrix Gene Arrays available for Human, Mouse, and Rat. These chips comprise a subset of the well annotated probes spotted in their corresponding Exon arrays. For example, for the Human Gene 1.0 ST Array each of the 28,869 genes is represented by around 26 probes spread all over the gene length, which provide a better picture of gene expression. This makes a total of 764,885 different probes in the array. 2.1.4. RNA-Seq
A recently developed alternative approach to identify differential splicing, alternative isoform regulation, and also gene expression is RNA-seq (13, 14, 27, 28), where poly(A) RNA is sequenced by high throughput sequencing (HTS) technology. The several hundred million short (about 30 nucleotide long) reads from HTS are aligned to known mRNA sequences of genes, EST-based potential transcript variants, and all possible exon–exon combinations per gene. Predicted gene isoforms based on EST alignments and genomic sequence analyses are compiled in databases like the ASTD (http://www.ebi.ac.uk/astd) (29). Sequence reads over a splice junction of two exons allow to determine which exons of a gene are spliced together and thus identify potential splice events (e.g. exon skipping). However, identification and measurement of the whole sequence of a transcript variant is not yet possible with RNA-seq, due to the limited sequence read length of HTS (30). Sequencing length of second generation sequencer has been increased recently to about 100 bases (ABI SOLiD and Illumina GAIIX, Roche 454 reads about 500 nucleotides, but produces less sequences), but to be able to read long sequences representing complete RNAs, several thousands of nucleotides have to be sequenced. This may ultimately be possible with third-generation sequencers, such as those from Pacific Biosciences or Oxford Nanopore. The main advantage of RNA-Seq over microarrays is that it provi des an unbiased (probe-less) measurement of the sample transcripts.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
275
These assays can also be performed for non-model organisms for which commercial microarrays are not available. Currently RNASeq does present some other limitations such as its expensiveness that makes it unrealistic to perform enough replicates for statistical analysis in routine experiments (see Note 2). 2.2. Experimental Design
As for any biological experiment, the basic principles of experimental design also apply for large-scale gene expression studies. The objective is to make the analysis of data and the interpretation of results as simple as possible, given the purpose of the experiment and the constraints of the experimental material (31).
2.2.1. Replication
There are several levels of replication in a microarray experiment (or other large scale experiments). The most important is the replication at the biological level (samples from different patients, animals, and primary cell culture from different donors), which guarantees that conclusions can be drawn about a larger population. This includes variability due to strain, disease states, treatment variation, and environmental factors in addition to the variability introduced by subsequent levels (like RNA isolation or labeling). Replication of microarray hybridization or replicated spots/features on the array, also referred as technical replicates, accounts for the variability introduced by measurement errors. A special type of technical replicate can be performed for repeated two-color arrays, using the same RNA samples with reversed dye assignment (dye-swap). A pitfall in experimental design is when the chosen level of replication does not correspond to the addressed question (see Note 3) (32, 33). In case of large numbers of experimental units or limited availability of RNA, pooling could be an option. This would minimize the biological variance, but it might also eliminate precious sample-specific information thus shortening the possibility of statistical analysis (34–36).
2.2.2. Design Types and Controls
Single channel arrays have become increasingly popular, among other reasons, due to the simplicity of its experimental design. In two-color arrays, however, there are mainly two types of designs: loop and reference design. A universal reference (e.g. pool from cancer cell lines, or a pool of all patient samples), which ensures that many genes are expressed to some level, can be used for identification of tumor subtypes within many tumor tissue samples from different patients. In case of indirect comparisons, the ratios (in log scale) of two microarrays have to be subtracted, hence doubling the variance of the fold change (FC) of interest (2s 2) (35). Block and loop designs can effectively reduce the number of arrays required for a given number of nonreference samples, but they lose many of the advantages of the reference design (33, 37). Initially there were some approaches to test these design types including multifactorial designs, however, they are not applied to
276
Sánchez-Cabo et al.
the bulk of large-scale gene expression studies. As for the array design, all platforms include positive and negative controls. Positive controls are helpful for the scanner in order to set the intensity range and the anchor points for the grid. Negative controls can be used to detect slides with increased background intensities and spatial effects. 2.3. D ata Generation
Microarray technology is now used in laboratory routine. For the different platforms, standard operating procedures are available and subsequently applied. Common to all these protocols is that 5–20 mg total RNA per sample is utilized to generate labeled cDNA/cRNA, by direct or indirect labeling during cDNA synthesis (using fluorescently labeled dUTP or amino-allyl dUTP) or in vitro transcription e.g. by T7 polymerase incorporating biotin labeled nucleotides, which are then subjected to hybridization. Fluorescent dyes are used as labels. Cy3 and Cy5 are most common in two-color arrays as for instance within the Agilent technology, but any dye with a good separable excitation/emission spectrum can be applied. In case of the Affymetrix platform for the detection of the biotin labels after hybridization and washing, slides are stained with streptavidin-phycoerythrin (SAPE). Image acquisition is performed using fluorescent scanners (similar to a laser scanning confocal microscope). The basic principle of this process includes excitation of the fluorescent dyes incorporated into the heteroduplexes on the surface of the array using one or two lasers (mostly at wavelength 532 nm for Cy3, SAPE, and 635 nm for Cy5). The slide (or in some cases the optics) is moved so that the scanner can excite each area of the surface. For each pixel the fluorescence intensities is measured and transformed to a digital signal by a photomultiplier tube (PMT) or a chargecoupled device (CCD) using dye specific emission filter. Specific parameters of these instruments are the resolution (from 2 to 10 mm depending on feature size), the dynamic range (for instance with 16 bit scanner there are 65,535 different possible intensity levels, a typical background intensity level of 100 and beginning saturation at average pixel intensity from 50,000 results in a 500fold dynamic range for microarrays), and the voltage of PMT indicating the amplification of the signal, which can be adjusted for each channel separately to balance total intensities for the different dyes. The resulting digital image(s) for each microarray (mostly a 16 bit single- or multi-TIFF format is used, because within this format the image information such as more channels and meta-information like scanner settings can be stored) are the starting point for all following bioinformatics procedures, including image segmentation, extraction, and statistics of feature and background intensities. However, they are very specific to the microarray and technology used in terms of which features are included on the array, how the features are arranged, and how
Insights into Global Mechanisms and Disease by Gene Expression Profiling
277
feature intensity and background information can be integrated to (relative) expression level of a specific transcript. For current Affymetrix arrays with 11 mm feature size, the GeneChip scanner 3000 allows scanning at a resolution down to 1.25 mm. The GeneChip software analyzes the image data file (.dat) and computes a single intensity value for each probe cell on the array, which is the 75th percentile of the pixel intensities for that feature removing the boundary pixels (9). This intensity levels are saved to another file (.cel) utilizing the information about the physical position of the features. The cel-files are the basis for any further data analysis. The image analysis process also includes a dynamic gridding algorithm to segment the image (also based on alignment features in each corner of the image) and a regionalized method of background correction is used based on the average of the lowest 2% probe cell intensity levels in that region, typically 16 probe cells block. For two-color arrays the scanning process results in a multiTIFF image including the two channels, usually in 10 mm resolution if the feature size is around 100 mm. The gridding, image segmentation, and feature extraction including all necessary statistical parameters about pixel information of the features (spots) and local background can be done, for instance, with GenePix software (Axon Instruments). For this purpose, information about the spatial alignment of features across the array is used (gal-file) in a semi or fully automated process to align grids, subgrids, rows, and columns taking advantage of the fact that robotic spotting is performed in a regular, predefined manner. Adaptive circle segmentation is used to identify foreground pixel (spot) and local background (surrounding of spot, space between spots). The calculated statistics about the foreground and background pixel intensities enables (1) to do background correction (e.g. subtraction of median background pixel intensity), (2) to remove low quality spots (flagging) based on criteria applied to parameters like percentage of saturated pixels, background corrected median spot pixel intensity, standard deviation of background intensities and (3) to calculate a ratio (or log2 ratio) between the channels. All parameters (including flag information) for every feature are stored in a result file (.gpr), which builds the basis for all further analysis. See Note 4 for different, typical output files from several platforms.
3. Methods 3.1. P reprocessing
In a high-throughput experiment we measure the intensity of thousands of entities (probes in a microarray experiment and sequences in a RNA-seq experiment) simultaneously across some
278
Sánchez-Cabo et al.
biological samples. But the measured intensities might contain biological as well as nonbiological signal. The sources of confounding variability might be systematic (if they affect a large proportion of entities in the same way, we can estimate it and approximately remove it) or random (it affects different measurements in different amounts and hence difficult to estimate). The first type can comprise, for example, high overall background in a particular slide or set of slides, different amounts of initial RNA per sample, batch effects, etc. The second comprises the detection errors from the scanners, different dye efficiencies per probe, etc. Since the sources of error are technology-specific (and within technology, platform specific) we will now present separately the main issues for RNA-seq and microarrays data preprocessing. 3.1.1. RNA-Seq Preprocessing
With Next Generation Sequencing, after enrichment of the regions of interest (see Note 2), sequences of varying length are generated. Quantifying the transcript abundance by mapping the sequenced fragments on known transcripts and counting the overlapping ones gives a snapshot of the abundance of each transcript in the cell under some biological condition. This final count per transcript involves several bioinformatics manipulations to be performed: (1) Read each letter within a given sequence, (2) for multiplex approaches, trim the adaptor sequences, (3) align to the reference genome for traditional transcriptional studies or assemble the sequences for de novo sequencing, and (4) normalize per transcript length and per overall genome density (to make the results comparable between samples and not dependent on the amount of transcripts). The measure reads per kilobase per million reads (RPKM) allows comparing the respective expression levels of different genes within a library and between libraries, and is defined as follows (13), RPKMexon =
RPKM gene =
rexon ·106 103 · rlane ntheor,exon
∑
exon ∈ gene
∑
(1)
RPKMexon
exon ∈ gene
ntheor,exon
(2)
where, rexon is the number of reads in an exon, rlane is the total number of mapped unique reads for the whole sequencing lane, ntheor,exon is the number of theoretical unique positions in the exon for a specific read length. The background noise is estimated by counting the number of reads mapping to a genomic region not annotated for transcripts related to the length of this region. 3.1.2. Microarray Data Preprocessing
While RNA-seq allows absolute quantification of the number of transcripts in a given sample, microarrays are a comparative
Insights into Global Mechanisms and Disease by Gene Expression Profiling
279
technique, given that the binding properties of the spotted sequences might vary from probe to probe. Most error models proposed for microarrays (38–41) suggest additive multiplicative models that can be simplified into (40):
I = B +aS
(3)
where, I is the measured intensity for a given spot in a given sample, B is a random variable modeling the background noise (optical noise as well as nonspecific binding), a is a scaling factor, and S is a random variable that accounts for the true signal, the measurement error and the probe effects. This type of model is usually called Additive Background Multiplicative Error model (ABME) (38). This model, which strongly suggests some kind of logarithmic transformation and the use of FC unit to correct for probe-specific bias, motivates the main steps at the low level in the microarray analysis pipeline: (1) Background correction (to estimate and remove noise adding up to the true signal) (2) Normalization (so that the data is approximately normally distributed and the variance is not intensity-dependent) and (3) Summarization, if more than one probe is available per gene. Most currently used methods integrate all three steps. 3.1.2.1. Background Correction
There are mainly two types of background estimates: biased (forcing, for example, the corrected signal to be not negative) or unbiased. Traditionally, the local background intensity (per spot) provided by the image analysis software has been used as unbiased estimate. Negative signals or very low intensities might be truncated to a given value to avoid not a number (NaN) intensities after logarithmic transformation. If a different transformation, more appropriate for low intensity levels, is used (e.g. the generalized logarithm (41)) this step is not needed. Ritchie et al. (42) compare some of the available background correction methods and propose a new one (norm-exp) that seems to outperform the local background subtraction strategy. Estimating the background level from negative controls can be also a useful approach, although they might not account for spatial effects. For Affymetrix arrays the MM probes were supposed to serve as estimates of nonspecific signal. Similarly to the background subtracted signal, most initial methods for analysis of Affymetrix data were based on PM-MM quantities (43, 44). The distribution of the MMs, however, has been shown (45) to be above zero, suggesting that they capture specific as well as nonspecific signal. Irizarry et al. (46) propose a specific method based on the ABME error model to estimate the background due to the optical noise and to nonspecific binding from the PMs only. GeneChip RMA (GC-RMA, (47)) is an improvement of the RMA method using sequence information of the probes to get more accurate estimates.
280
Sánchez-Cabo et al.
3.1.2.2. Normalization
Some systematic errors such as different amounts of initial RNA might affect the measured intensities of all probes within the hybridized samples. Normalization makes the measurements comparable across arrays and transforms the data so that traditional statistical methods can be applied to the data, i.e. data are roughly normally distributed (at least symmetric, see ref. 48 for a review of the distribution of the abundance of different biomolecular species) and with a variance stable across the intensity range. Figure 1a shows the MA plot for the raw data of 8 versus 8 samples from the experiment described in (15) to compare tumor versus non-tumor samples. An MA plot displays, for each gene: I M = log 2 1 I2
A=
log 2 (I1 ) + log 2 (I 2 ) 2
(4)
(5)
In Fig. 1a, I1 is the average intensity across tumor samples and I2 the same for the non-tumor samples. While data in the logarithmic scale are already approximately normally distributed, normalization methods will try to achieve simultaneously: (1) to make the data across arrays comparable, correcting systematic errors and (2) stabilize the variance that in the raw data increases with the intensity, as seen in Fig. 1b. Normalization methods usually assume that most genes do not change across samples
Fig. 1. (a) MA and (b) Standard deviation (sd) versus rank of the mean plot for the raw data of eight tumor samples versus eight non-tumor samples from ref. (15). The MA plot shows for each spot the log fold change in average between the tumor and non-tumor samples versus the average intensity. The light gray line is the fitted loess curve that clearly deviates from 0. The standard deviation versus mean plot shows that the variance of the raw data is not constant across the whole intensity range.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
281
(or, at least, that the overall distributions should be similar across arrays (49)). For experiments in which these conditions might be violated, the normalization methods can be adjusted to a subset of genes expected not to vary such as spike-ins or housekeeping genes (50–52). The first normalization methods simply tried to force all arrays to have the same mean/median. However, this simple shift in expression was not enough to stabilize the variance across intensities. For this reason, loess transformation (53) gained increasing importance for two-color arrays. LOESS (54) fits locally a line for each subset of probes with a similar intensity, resulting in a nonlinear transformation without the need of assuming any particular shape for the relationship between the channels/ arrays. The light gray line in Fig. 1a is the fitted loess curve. Generalizations of loess, such as cyclic loess, can also be applied to single-channel arrays (49), although it is not so much used as the two methods that we will describe below (see Note 5). Quantiles normalization (49) forces the empirical distribution (quantiles-based) of all the hybridized samples to be the same. It is the most used method for single-channel arrays. Quantiles normalization does effectively reduce the variability across arrays resulting simultaneously in a good reproducibility (low variance) within each condition and across conditions. This leads to more significant results even with slightly lower FC than other methods. Finally, the Variance Stabilization Normalization (VSN) proposed by Huber et al. (41) can be used for both single and dual channel arrays. Directly arising from the ABME model, the data are transformed according to (45): y ij = g log 2
3.1.2.3. Probes Summarization
I ij − b j kj
= mi + eij
(6)
where, yij is the corrected intensity for feature i in sample j, Iij are the corresponding measured intensities, bj , kj are a background estimate and scaling factor estimated per array, and mi is the average intensity of the feature in the glog2 scale. The estimates for bj , kj , and mi are obtained with the least trimmed sum of squares (LTS) (55), using the smallest q % of the residuals for the adjustment with q ∈ (50,100 ) that makes it less sensitive to outliers. q is user-defined depending on the experiment, with q = 100% being the ordinary least sum of squares regression. Most arrays contain different probes for the same gene to minimize probe-specific effects and to get an estimate as robust as possible from the gene expression. In particular, Affymetrix chips contain different probe sets per probe, i.e. sets of 16–20 probes referring to the same probe. After background correction and normalization, probes corresponding to the same gene (or at least to the same probe set) might be summarized. Since there can be strong variations among probes (46), a robust method against
282
Sánchez-Cabo et al.
outliers is desired. While MAS5 (43) used the Tukey Biweight summary measurement, the RMA method proposed by Irizarry gets a final estimate for the expression of each probe set in each array using the median polish for the fit of the linear model in which the whole transformation is based. 3.2. Quality Control
A first “de visu” quality check is performed after scanning the array images in order to detect artifacts that might make any of the chips nonusable. For Affymetrix arrays the Bioconductor packages (56) affyPLM and affyQCReport (45) might be useful to detect any potential quality issue from the raw intensities. See Fig. 2 for an example of these packages applied to the example data. After data preprocessing, it is important to check if the normalization has worked as expected and also if there is any possible outlier array that might lower the power of the posterior statistical analysis. Principal Component Analysis and simple correlation are implemented in some commercial packages (i.e. GeneSpring, (57)) to detect possible outlier arrays. More complete is the arrayQualityMetrics package (58) from Bioconductor (http://www. bioconductor.org) (56) for detecting outlier arrays from the MAplots, dendrogram, and boxplots. It can also help to detect confounding effects such as array or batch effects that should be
Fig. 2. Boxplots of the Normalized Unscaled Standard Error (NUSE) and Relative Log Expression (RLE) for the 16 Affymetrix arrays used as example.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
283
Fig. 3. Standard deviation versus mean for all probes after RMA preprocessing (a) and after VSN normalization (b). Compared to Fig. 1b the variance has been stabilized using any of the methods. No probe sets summarization has been performed after VSN, resulting in higher spots density.
included in the statistical model in order to extract reliable conclusions (see Note 6). Figure 3 shows the standard deviation versus mean plot for the example data after preprocessing using (a) RMA and (b) VSN. 3.3. Statistical Analysis and Mathematical Modeling
Once systematic errors have been removed from the data, confounding effects have been detected, the distribution of the data looks approximately normal, and the variance is independent on the mean, we are in conditions to apply standard statistical tools to answer the question that motivated the experiment. The following methods are independent on the technique or platform used for data generation. The analysis questions can be: 1. Which genes are differentially expressed between two or more different conditions? 2. Which genes are coexpressed across all conditions of the experiment? 3. Which genes are associated with a given phenotype? 4. What are the regulatory mechanisms underlying the observed expression patterns? Depending on the question, we can use different statistical or bioinformatics tools to answer.
3.3.1. Finding Differentially Expressed Genes
To answer the first question, statistical methods specific to Omics data have been developed to accommodate their peculiarities. In the first place, while typical statistical tests assume a relatively large number of independent replicates per condition (~10), typical
284
Sánchez-Cabo et al.
Omics experiments usually have no more than four independent replicates, in occasions only two. The small sample size results in an underestimation of the sample variance that might lead to large values of the t-statistic even for very small log fold-changes. Moderated t-tests (59, 60) borrow information from all genes in the experiment to increase the precision of the variance estimates. It is also important to mention that the methods available for statistical testing in the high-throughput context are sophisticated enough to allow models with several fix and random effects (61, 62). This is essential in order to consider the biological information available and also other known sources of variability that might influence the outcome of the experiment and that can be detected as described in Subheading 3.2. Once statistical significance has been calculated for every gene, raw p-values do not tell us much about the significance of a change between conditions for a single gene. This is due to the fact that just by chance in a typical set-up with 30,000 genes, up to 5% of them (i.e. 1,500) could be false positives at a significance level of 0.05. Hence, multiple testing is essential. While traditional methods controlling the Family Wise Error Rate (63) are too stringent (resulting in a very low statistical power), methods controlling the False Discovery Rate (FDR (64), expected proportion of false positives among all genes declared as differentially expressed) are widely used in Omics experiments. In order to improve the statistical power, unspecific filtering can be performed prior statistical testing (65, 66). Those genes not expressed at all or not changing in the experiment can be removed since they can only inflate the false positives rate. While empirical Bayes methods such as limma (59) and permutation tests such as SAM (60) are rising as the state of the art for typical gene expression testing, a battery of new methods have also been developed to discover isoforms in a sample or to detect potential alternative splicing events from exon arrays. In particular, some of the most used are PAC (67), MADS (68), or FIRMA (69). FIRMA calculates the splicing scores for individual exons based on the residuals from the fitting process of the RMA linear model (46) to the expression intensities and also takes the individual probe affinities into account. Other approaches are based on the so-called splicing index (70), that is, the expression of the exon normalized to the expression level of the gene. These methods, however, might overestimate the potential alternative splicing events especially for differentially expressed genes as discussed by Gaidatzis et al. (71). 3.3.2. Clustering and Classification
Clustering refers to the discovery of groups of genes (or samples) that behave similarly across conditions (or genes). The biological meaning of these groups (i.e. commonly regulated by a transcription factor, functionally related, etc.) is unknown.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
285
On the other hand, if the aim is not the discovery of new groups (because the grouping information is already available), but to find some features (genes) that enable to distinguish among groups of samples from different biological characteristics, we are dealing with a classification problem. While clustering methods are merely exploratory tools, classification algorithms are inferential. 3.3.2.1. Clustering
Cluster analysis algorithms are mainly divided into hierarchical clustering methods and partitioning methods, although there are also hybrid methods. The result from a hierarchical cluster analysis is a tree, usually plotted as a dendrogram, with the length of the branches representing the similarity between elements. Figure 4 shows the hierarchical clustering performed with the academic software Genesis (http://genome.tugraz.at/genesisclient/genesisclient_description.shtml) (72) for the 626 genes detected as differentially expressed between BLC tumor and non-BLC tumor samples from our example data set. Partitioning methods, like k-means clustering, require a predefined number of clusters, although there are some techniques such as the “Figure of Merit” (73)
Fig. 4. Partial view of the hierarchical clustering of the genes detected as differentially expressed between the BLC and nBLC. We see that samples cluster according to their biological origin. In gray are marked groups of genes that discriminate well between BLC and nBLC and that could be used for classification.
286
Sánchez-Cabo et al.
that help to define the optimal number of classes. To assess the similarity between the items each cluster algorithm has to use some sort of similarity or distance metric. While the Euclidian distance is the default metric for the hierarchical cluster analysis, in certain experimental settings, like in time series experiments where the similarity in trend is more important than the similarity in values, a different metric, like the Pearson correlation might be used. As an useful alternative the expression profiles can be centered around the mean expression level of each gene. All these options can be performed using free programs such as Genesis (72) and TM4 (http://www.tm4.org) (74). Alternatively, to these methods, several clustering algorithms based on a probabilistic model for the data have also been proposed (75–77). These methods are mainly based on normal mixture models and can also automatically estimate the optimal number of clusters in the data. They are, however, not so widely used due to their computational expensiveness. Once a set of coregulated genes has been identified, it is of interest to unravel any regulatory program that might explain this behavior. One option is to look for common upstream motifs in promoters and enhancers of this gene set. There are two possibilities (1) de novo motif search using enumerative (word-based) or optimization based methods (Gibbs sampling or maximum expectation) and (2) to find regulatory sequences using position weight matrices (PWM) based on experimentally verified transcription factor binding sites (78). A comprehensive list of PWMs for a number of transcription factors can be found in TRANSFAC (http://www. gene-regulation.com) (79) and JASPAR (http://jaspar.cgb.ki.se) (80). These matrices can also be used to look for over-represented potential binding sites in the promoter region of the clustered genes. Sequences associated with a biological function tend to be conserved across organisms. Hence, considering only conserved motifs limits the number of false positive predictions (phylogenetic footprinting). There is a variety of public available software or services summarized in (81, 82) or commercialized approaches like from Genomatix (http://www.genomatix.de) (83). Another identified (post-) transcriptional regulatory mechanism is that 3′UTRs can be targeted by microRNAs. Therefore it might be helpful to detect a significant fraction of coregulated genes potentially targeted by microRNAs. While not many miRNA–mRNA have been experimentally tested (84), many algorithms have been developed to find potential miRNA targets by modeling the binding site characteristics (84–86). 3.3.2.2. Classification Methods
In a typical microarray experiment we usually try to classify samples (for which thousands of transcripts have been measured) into different classes, i.e. if a given sample is from a tumor or non-tumor, what is the type of tumor or the stage of the patient.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
287
The first step is to discover a set of features (i.e. genes) whose expression characterize the given samples according to their biological origin. One very easy approach is to use correlation (87) or logistic regression using the class label as the dependent variable and the expression level of the genes as independent variables. Also the genes selected as differentially expressed can be used as class features. In our example, those genes differentially expressed and with a FC larger than 2 can be used as class predictors and the classification model can be built over them. The methods for feature selection are varied and it is difficult to know beforehand which one would work best. Software tools such as Weka (http:// www.cs.waikato.ac.nz/ml/weka) (88) allow choosing between many of them. With the subset of selected features in hand, and always on the training data set, a model is built in order to predict, as reliably as possible, the sample class. Some of the most popular methods for classification are Support Vector Machine (89) or decision trees (90). For a review of methods and practical application see ref. 40. Once the classifier is built, its performance can be calculated from the test set. It is important to shuffle the data so that the classification scheme obtained is not only valid on the current training/test sets partition. Methods such as bootstrapping and cross-validation need to be applied and the final performance of the method will be the average performance through all the runs. 3.3.3. Reverse Engineering
If the aim of our experiment is not only to find associations within gene expression profiles, but also to find potential cause-effect relationships among them, reverse engineering techniques can be applied to the data. The idea, coming from control theory, is that if we know the input (stimulus, biological conditions in which our experiment was performed) and the output (gene expression profiles) we might be able to infer the regulatory network. Although the principle seems readily applicable to microarray data and the tools have been developed for years, like Boolean networks, Bayesian networks, systems of differential equations (see ref. 91, 92 for review) in practice, the limited number of samples available for microarray experiments and the unbalanced sampling schemes used (more suited for surveillance of a biological process than for mathematical modeling) make it difficult to obtain reliable system models for a large number of genes. The large number of genes to be modeled and the small number of samples available make the problem prone to collinearity and overfit. Hence, some techniques for data reduction might be used before applying reverse engineering methods, for example, clustering the genes and building networks on the clusters of genes instead of utilizing the genes themselves. Another approach consists in the artificial generation of measurements via interpolation. Even with all these problems there has been already some
288
Sánchez-Cabo et al.
successful applications of reverse engineering methods to microarray data for lower (93, 94), but also for higher organisms (2, 3). Simultaneously, the decreasing price of arrays has made possible to perform experiments specifically designed for the application of reverse engineering methods (2, 95). Since gene expression regulation remains largely unknown, another problem is the lack of realistic simulated data sets to test the performance of the proposed methods. This has been however overcome in the last years (96, 97). 3.4. Functional Analysis
Once we have selected a subset of genes based on the magnitude of the change across conditions and its reproducibility, the question is if they explain any mechanism or have any predictive value for the biological process under study. One approach is to establish which Gene Ontology categories are enriched in terms of the molecular function, biological process or cellular component of the gene products, as annotated in the Gene Ontology database (http://www.geneontology.org) (98). Similarly, numerous curated databases contain proven interactions among different biomolecular species (i.e. proteins, genes, miRNAs) conforming canonical biochemical pathways, for example, KEGG (http://www.kegg.org), Biocarta (http://www.biocarta.com), GenMapp (http://www. genmapp.org). Some academic (99) or commercial (100, 101) applications are available to map gene expression data to these networks and pathways. Figure 5 shows an example of the Ingenuity®
Fig. 5. Top gene network (a) from IPA (100) for the genes differentially expressed in our example and (b) enriched functional categories, with cancer as the most overrepresented.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
289
Pathway Analysis suite networks (http://www.ingenuity.com, based on published interactions) in which many of the genes are differentially expressed in our example map. Some of the programs construct networks of genes based on cocitation (102). A GO category or a pathway is said to be overrepresented if the proportion of genes in the set of interest with this term is significantly larger than the same proportion in the whole gene set. Fisher’s exact test or a Chi-square test can be used to test the hypothesis of equality of both proportions. The resulting p-value assesses the hypergeometric probability whether the number of interesting genes associated with the category is larger than expected. The p-values need to be corrected for multiple testing. Bioconductor (40) contains annotation of most available Affymetrix chips and also provides information and statistical analysis of the GO terms (see Note 7). The interpretation of GO analysis can however be difficult, due to the overlapping of categories. To alleviate this problem, modifications of the original GO analysis have been developed (103–105) that perform the tests from the most specific to the least specific GO terms and remove all genes associated with a significantly enriched GO term from its less specific parent terms. An alternative approach to identify significant sets of genes in a data set is the Gene Set Enrichment Analysis (GSEA, http:// www.broadinstitute.org/gsea) (106–109). Unlike all previously described methods that base on prefiltered data e.g. use the list of significantly differentially expressed genes as input, GSEA is usually applied to the whole data set from a microarray experiment (although a nonspecific filtering based on variances across samples should be performed to remove uninformative genes). GSEA aggregates the per gene statistics (e.g. for differential expression) across genes within a predefined gene set, thus allowing detection of situations where all genes in the set change in a small, but coordinated way. GSEA is thus a useful alternative approach for Omics data, especially for experiments without strong differences at transcriptional level, but it depends, however, on a correct and meaningful definition of the gene sets. 3.5. Databases and Repositories
A few years of use was just enough to highlight the importance of describing and standardizing high-throughput experiments. The MIAME standards (Minimum Information About a Microarray Experiment) (110) have set the basis on all information and experimental parameters that are required when presenting the results of a microarray experiment and in order to make it reproducible. Two main repositories have emerged, namely the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih. gov/geo) hosted at the NCBI (111), and ArrayExpress (http:// www.ebi.ac.uk/microarray-as/ae) at the EBI (112). Both are public repositories for microarray based Omics data and ultra high throughput sequencing data (UHTS) using the underlying
290
Sánchez-Cabo et al.
short read archives (SRA and ERA). Authors are obliged by many journals to deposit their experimental information to one of those repositories prior to publication (see Note 8). While processes are being properly standardized and described at the experimental level, it is often difficult to follow all data analysis steps. It is therefore important to provide information about the used software, algorithms, settings, and parameters, so that everyone starting from the raw data can get to the same conclusions following the described analysis strategy. Raw data and partly processed data can be freely downloaded from GEO and ArrayExpress and further analyzed. There are also quite a number of gene expression databases, which provide access to already processed data, e.g. expression data from all human or mouse tissues or related to specific diseases (see Note 9). The data integration of different data sets helps unraveling the underlying biological mechanisms via mathematical modeling including more conditions or with a larger sample size (113). The later is also known as “meta-analyses” similar as it is done in clinical research, in which a generalized hypothesis in a systematic review is deduced from the analyses of multiple studies. The next challenge comes with the integration of heterogeneous datasets from different biomolecular species and clinical parameters (114). This will ultimately provide us with a systems level view of biological processes and pathological changes within a cell. 3.6. Validation
We discussed some typical methods for detecting genes differentially expressed with a certain degree of uncertainty. Although statistical testing helps taking the best choice about the hypothesis with our data in hand, there is still some probability that we are making the wrong decision. Hence, it is needed to validate the conclusions of our experiment, ideally on different samples, to warrant generalization of the results. Although traditionally the technique used for validation of array data was qPCR (or western blots), at the moment, it is not unusual to see experiments with RNA-seq and expression arrays where one validates the other. Without regard to the technique used, it is very important in Omics experiments to validate the main conclusions with an independent technique.
3.7. Software and Tools
Apart from commercial software suites for the analysis of high throughput transcriptomics data a large panel of free and open source software tools are available. These range from pure application programming interfaces like the Perl API from Ensembl (http://www.ensembl.org), or the statistical programming language and environment R (http://www.r-project.org), to specialized computer programs with graphical user interfaces and web applications.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
291
The Ensembl Perl API allows to query all Ensembl databases. Thus, Perl scripts can be easily implemented using the Perl API to annotate e.g. probe sets to genes, or to retrieve all annotations for a specific gene, including protein domains, transcript variants, exon–intron structures, functional annotations or even the sequence of individual transcripts. R is an open-source implementation of the S language (115) and is becoming one of the most widely used software tools for bioinformatics. This is mainly due to its flexibility and data handling capabilities as well as the large amount of available functions for the analysis of biological data, from microarray data to data from the HTS technology. Most of these methods are supplied in software packages from the Bioconductor project (56). R also provides with the Sweave system (116), a tool that allows analysts to integrate text (using LATEX) and computer code for the R language. Hence reports can be generated for individual analyses that contain all R commands, generated figures, tables, and results as well as descriptive text. Comprehensive analysis reports are crucial to ensure the reproducibility of analysis results, which is also becoming increasingly important for the description of bioinformatics analyses for publications. Tools with a graphical user interface, including web applications, lacking the high flexibility of frameworks like the R environment, are however easier to use, especially for nontrained users. Among software with graphical user interfaces, web applications have the advantage that the user does not have to install any software locally on his computer, and that the calculations are performed remotely on a server, which usually has larger hardware resources than standard workstations. Two of the most frequently used web applications for the analysis of microarray data are GEPAS (http://gepas.bioinfo.cipf.es) (117) and CARMAweb (https://carmaweb.genome.tugraz.at/carma) (118). Both GEPAS and CARMAweb provide tools for microarray data preprocessing, differential gene expression, cluster analysis, classification and functional annotation of genes and gene lists. CARMAweb also generates comprehensive analysis report files, since it uses the Sweave system to perform all calculations using functions from the Bioconductor packages in R.
4. Notes 1. Differentially expressed probe sets from an analysis based on classical 3¢ UTR Affymetrix microarrays (e.g. HG-U133 Plus 2.0) should be checked for their genomic position relative to the 3¢ end of the corresponding gene, since for this type of microarrays the target RNA is reverse transcribed into
292
Sánchez-Cabo et al.
cDNA by T7 oligo(dT). Differences in reverse transcription efficiency as well as low RNA quality or partial RNA degradation can thus lead to differential expression of more 5¢ located probe sets. The location of probe sets relative to their target genes can be visually inspected e.g. in the Ensembl Genome Browser (http://www.ensembl.org). 2. RNA-seq represents an alternative to arrays with the advantage that no probes need to be preselected. However, for targeted sequencing, a previous hybridization step to reduce the complexity of the target material is needed also using probes that do not cover highly complex regions of the genome. Quantitative RNA measurements are also only possible with HTS if a certain sequencing depth is achieved, i.e. if each RNA species is sequenced several times to allow unambiguous alignments and exclusion of sequencing errors. Additionally, most of the currently used sequencers require a PCR amplification step that is known to introduce biases and errors. 3. The level of replication in the design of microarray experiments should be of the same level as the addressed question. For instance if from one tissue of the same donor three extracts are taken and gene expression is studied, the only relevant question that could be addressed would be, what are the effects of the extraction method on gene expression. No conclusion can be drawn for a larger population about gene expression levels in this tissue. 4. Typical output files and respective description from several microarray platforms (Agilent, Affymetrix, two-color microarrays) can be viewed at http://genome.tugraz.at/fileformats. html. 5. It can be difficult to know before-hand as to which normalization method performs best. It is important to take the biological characteristics of the experiment into account and after normalization to look at QC plots to see how well it has performed. 6. In-depth quality assessment of microarray data before and after preprocessing is crucial to identify potential problematic microarrays and outlier samples that might influence the final results of the analysis. It is also important to include in the statistical model the random effects (such as batch) that might obscure the conclusions of the analysis. There are some Bioconductor packages (limma and maanova) that can fit mixed models. Maanova can also estimate the effect of the random covariates although its use of permutation-based estimation methods makes it computationally expensive.
Insights into Global Mechanisms and Disease by Gene Expression Profiling
293
7. An annotated output file generated with the annotate package from Bioconductor can be found at http://genome. tugraz.at/DEG.html. 8. The submission of information and results about a microarray experiment is proceeded in a structured format by an adapted mark-up language (MAGE-ML, http://www.mged.org/ Workgroups/MAGE/mage-ml.html) (119), utilized by common state-of-the art microarray databases (see for example MARS, http://genome.tugraz.at/mars/mars_description. shtml, ref. 120). 9. A number of databases and online resources provide processed gene expression data, which can be used to retrieve gene signatures or expression profiles over several conditions and experiments for specific genes. Some examples are listed below: Gene expression atlas, http://biogps.gnf.org, Gene Expression Omnibus (GEO) Profiles, http://www. ncbi.nlm.nih.gov/geo, ArrayExpress Gene Expression Atlas, http://www.ebi.ac.uk/ gxa, Oncomine, http://www.oncomine.org, GeneSigDB, http://compbio.dfci.harvard.edu/genesigdb, Genevestigator http://www.genevestigator.ethz.ch, L2L, http://depts.washington.edu/l2l. References 1. Quackenbush, J. (2001) Computational analysis of microarray data. Nat Rev Genet 2, 418–27. 2. Basso, K., Margolin, A. A., Stolovitzky, G., Klein, U., Dalla-Favera, R., and Califano, A. (2005) Reverse engineering of regulatory networks in human B cells. Nat Genet 37, 382–90. 3. Della, G. G., Bansal, M., AmbesiImpiombato, A., Antonini, D., Missero, C., and di Bernardo, D. (2008) Direct targets of the TRP63 transcription factor revealed by a combination of gene expression profiling and reverse engineering. Genome Res 18, 939–48. 4. Rhodes, D. R., Kalyana-Sundaram, S., Mahavisno, V., Varambally, R., Yu, J., Briggs, B. B., Barrette, T. R., Anstet, M. J., KinceadBeal, C., Kulkarni, P., Varambally, S., Ghosh, D., and Chinnaiyan, A. M. (2007) Oncomine 30: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 9, 166–80.
5. Liang, P., and Pardee, A. B. (1992) Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science 257, 967–71. 6. St John, T. P., and Davis, R. W. (1979) Isolation of galactose-inducible DNA sequences from Saccharomyces cerevisiae by differential plaque filter hybridization. Cell 16, 443–52. 7. Sargent, T. D., and Dawid, I. B. (1983) Differential gene expression in the gastrula of Xenopus laevis. Science 222, 135–39. 8. Weis, J. H., Tan, S. S., Martin, B. K., and Wittwer, C. T. (1992) Detection of rare mRNAs via quantitative RT-PCR. Trends Genet 8, 263–64. 9. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14, 1675–80.
294
Sánchez-Cabo et al.
10. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–70. 11. Velculescu, V. E., Zhang, L., Vogelstein, B., and Kinzler, K. W. (1995) Serial analysis of gene expression. Science 270, 484–87. 12. Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D. H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., Roth, R., George, D., Eletr, S., Albrecht, G., Vermaas, E., Williams, S. R., Moon, K., Burcham, T., Pallas, M., DuBridge, R. B., Kirchner, J., Fearon, K., Mao, J., and Corcoran, K. (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 18, 630–34. 13. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621–28. 14. Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., Seifert, M., Borodina, T., Soldatov, A., Parkhomchuk, D., Schmidt, D., O’Keeffe, S., Haas, S., Vingron, M., Lehrach, H., and Yaspo, M. L. (2008) A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321, 956–60. 15. Richardson, A. L., Wang, Z. C., De Nicolo, A., Lu, X., Brown, M., Miron, A., Liao, X., Iglehart, J. D., Livingston, D. M., and Ganesan, S. (2006) X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell 9, 121–32. 16. Bustin, S. A., Benes, V., Nolan, T., and Pfaffl, M. W. (2005) Quantitative real-time RT-PCR – a perspective. J Mol Endocrinol 34, 597–601. 17. Vanguilder, H. D., Vrana, K. E., and Freeman, W. M. (2008) Twenty-five years of quantitative PCR for gene expression analysis. Biotechniques 44, 619–26. 18. Pfaffl, M. W. (2001) A new mathematical model for relative quantification in real-time RT-PCR. Nucleic Acids Res 29, e45. 19. Hellemans, J., Mortier, G., De Paepe, A., Speleman, F., and Vandesompele, J. (2007) qBase relative quantification framework and software for management and automated analysis of real-time quantitative PCR data. Genome Biol 8, R19. 20. Bookout, A. L., and Mangelsdorf, D. J. (2003) Quantitative real-time PCR protocol
for analysis of nuclear receptor signaling pathways. Nucl Recept Signal 1, e012. 21. Livak, K. J., and Schmittgen, T. D. (2001) Analysis of relative gene expression data using real-time quantitative PCR and the 2(-delta delta C(T)) method. Methods 25, 402–08. 22. Bookout, A. L., Cummins, C. L., Mangelsdorf, D. J., Pesola, J. M., and Kramer, M. F. (2006) High-throughput realtime quantitative reverse transcription PCR. Curr Protoc Mol Biol Chapter 15, Unit. 23. Pease, A. C., Solas, D., Sullivan, E. J., Cronin, M. T., Holmes, C. P., and Fodor, S. P. (1994) Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci USA 24, 5022–6. 24. Lipshutz, R. J., Fodor, S. P., Gingeras, T. R., and Lockhart, D. J. (1999) High density synthetic oligonucleotide arrays. Nat Genet 21, 20–4. 25. Hardiman, G. (2004) Microarray platformscomparisons and contrasts. Pharmacogenomics 5, 487–502. 26. Seidel, C. (2008) Introduction to DNA microarrays. In Analysis of microarray data: a network-based approach (Edited by EmmertStreib, F., and Dehmer, M.), pp. 1–25. Wiley-VCH, New York. 27. Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P., and Burge, C. B. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–76. 28. Hughes, T. R., Marton, M. J., Jones, A. R., Roberts, C. J., Stoughton, R., Armour, C. D., Bennett, H. A., Coffey, E., Dai, H., He, Y. D., Kidd, M. J., King, A. M., Meyer, M. R., Slade, D., Lum, P. Y., Stepaniants, S. B., Shoemaker, D. D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M., and Friend, S. H. (2000) Functional discovery via a compendium of expression profiles. Cell 102, 109–26. 29. Stamm, S., Riethoven, J. J., Le Texier, V., Gopalakrishnan, C., Kumanduri, V., Tang, Y., Barbosa-Morais, N. L., and Thanaraj, T. A. (2006) ASD: a bioinformatics resource on alternative splicing. Nucleic Acids Res 34, D46–55. 30. Carninci, P. (2009) Is sequencing enlightenment ending the dark age of the transcriptome? Nat Methods 6, 711–13. 31. Yang, Y. H., and Speed, T. (2002) Design issues for cDNA microarray experiments. Nat Rev Genet 3, 579–88.
Insights into Global Mechanisms and Disease by Gene Expression Profiling 32. Simon, R. M., and Dobbin, K. (2003) Experimental design of DNA microarray experiments. Biotechniques Suppl, 16–21. 33. Simon, R., Radmacher, M. D., and Dobbin, K. (2002) Design of studies using DNA microarrays. Genet Epidemiol 23, 21–36. 34. Hackl, H., Sanchez, C. F., Sturn, A., Wolkenhauer, O., and Trajanoski, Z. (2004) Analysis of DNA microarray data. Curr Top Med Chem 4, 1357–70. 35. Churchill, G. A. (2002) Fundamentals of experimental design for cDNA microarrays. Nat Genet 32 Suppl, 490–95. 36. Kendziorski, C., Irizarry, R. A., Chen, K. S., Haag, J. D., and Gould, M. N. (2005) On the utility of pooling biological samples in microarray experiments. Proc Natl Acad Sci USA 102, 4252–57. 37. Kerr, M. K., and Churchill, G. A. (2001) Statistical design and the analysis of gene expression microarray data. Genet Res 77, 123–28. 38. Rocke, D. M., and Durbin, B. (2001) A model for measurement error for gene expression arrays. J Comput Biol 8, 557–69. 39. Kerr, M. K., Martin, M., and Churchill, G. A. (2000) Analysis of variance for gene expression microarray data. J Comput Biol 7, 819–37. 40. Gentleman, R., Carey, V., Huber, W., Irizarry, R., and Dudoit, S. (2005) Bioinformatics and computational biology solutions using R and bioconductor. Springer Science+Business Media, New York, NY, USA. 41. Huber, W., von Heydebreck, A., Sultmann, H., Poustka, A., and Vingron, M. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18 Suppl 1, S96–104. 42. Ritchie, M. E., Silver, J., Oshlack, A., Holmes, M., Diyagama, D., Holloway, A., and Smyth, G. K. (2007) A comparison of background correction methods for two-colour microarrays. Bioinformatics 23, 2700–07. 43. Affymetrix (2002) Statistical algorithms description document. http://www.affymetrix.com/support/technical/whitepapers/ sadd_whitepaper.pdf 44. Li, C., and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 98, 31–36. 45. Hahne, F., Huber, W., Gentleman, R., and Falcon, S. (2008) Bioconductor case studies. Springer Science+Business Media, New York, NY, USA.
295
46. Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., and Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–64. 47. Wu, Z., and Irizarry, R. A. (2004) Preprocessing of oligonucleotide array data. Nat Biotechnol 22, 656–58. 48. Lu, C., and King, R. D. (2009) An investigation into the population abundance distribution of mRNAs, proteins, and metabolites in biological systems. Bioinformatics 25, 2020–27. 49. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–93. 50. van de Jeroen, P., Kemmeren, P., van Bakel, H., Radonjic, M., van Leenen, D., and Holstege, F. C. (2003) Monitoring global messenger RNA changes in externally controlled microarray experiments. EMBO Rep 4, 387–93. 51. Sarkar, D., Parkin, R., Wyman, S., Bendoraite, A., Sather, C., Delrow, J., Godwin, A. K., Drescher, C., Huber, W., Gentleman, R., and Tewari, M. (2009) Quality assessment and data analysis for microRNA expression arrays. Nucleic Acids Res 37, e17. 52. Pradervand, S., Weber, J., Thomas, J., Bueno, M., Wirapati, P., Lefort, K., Dotto, G. P., and Harshman, K. (2009) Impact of normalization on miRNA microarray expression profiling. RNA 15, 493–501. 53. Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30, e15. 54. Cleveland, W. (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74, 829–36. 55. Rousseuw, P., and Leroy, A. (1987) Robust regression and outlier detection. Wiley, New York. 56. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y., and Zhang, J. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80.
296
Sánchez-Cabo et al.
57. Agilent (2009) GeneSpring GX Software. http://www.chem.agilent.com. 58. Kauffmann, A., Gentleman, R., and Huber, W. (2009) arrayQualityMetrics – a bioconductor package for quality assessment of microarray data. Bioinformatics 25, 415–16. 59. Smyth, G. K. (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3, Article3. 60. Tusher, V. G., Tibshirani, R., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98, 5116–21. 61. Smyth, G., Thorne, N., and Wettenhall, J. (2009) limma Users guide. http://bioinf. wehi.edu.au/limma. 62. Wu, H., Yang, H., Sheppard, K., and Churchill, G. (2009) maanova: tools for analyzing Micro Array experiments. http:// cran.r-project.org/web/packages/maanova/ index.html. 63. Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18, 71–103. 64. Benjamini, Y., and Hochberg, Y. (1995) Controlling the false discovery rate – a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57, 289–300. 65. Hackstadt, A. J., and Hess, A. M. (2009) Filtering for increased power for microarray data analysis. BMC Bioinformatics 10, 11. 66. Lusa, L., Korn, E. L., and McShane, L. M. (2008) A class comparison method with filtering-enhanced variable selection for highdimensional data sets. Stat Med 27, 5834–49. 67. French, P. J., Peeters, J., Horsman, S., Duijm, E., Siccama, I., van den Bent, M. J., Luider, T. M., Kros, J. M., van der Spek, P., and Sillevis Smitt, P. A. (2007) Identification of differentially regulated splice variants and novel exons in glial brain tumors using exon expression arrays. Cancer Res 67, 5635–42. 68. Xing, Y., Stoilov, P., Kapur, K., Han, A., Jiang, H., Shen, S., Black, D. L., and Wong, W. H. (2008) MADS: a new and improved method for analysis of differential alternative splicing by exon-tiling microarrays. RNA 14, 1470–79. 69. Purdom, E., Simpson, K. M., Robinson, M. D., Conboy, J. G., Lapuk, A. V., and Speed, T. P. (2008) FIRMA: a method for detection of alternative splicing from exon array data. Bioinformatics 24, 1707–14.
70. Clark, T. A., Sugnet, C. W., and Ares, M., Jr. (2002) Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science 296, 907–10. 71. Gaidatzis, D., Jacobeit, K., Oakeley, E. J., and Stadler, M. B. (2009) Overestimation of alternative splicing caused by variable probe characteristics in exon arrays. Nucleic Acids Res 37, e107. 72. Sturn, A., Quackenbush, J., and Trajanoski, Z. (2002) Genesis: cluster analysis of microarray data. Bioinformatics 18, 207–8. 73. Yeung, K. Y., Haynor, D. R., and Ruzzo, W. L. (2001) Validating clustering for gene expression data. Bioinformatics 17, 309–18. 74. Saeed, A. I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T., Thiagarajan, M., Sturn, A., Snuffin, M., Rezantsev, A., Popov, D., Ryltsov, A., Kostukovich, E., Borisovsky, I., Liu, Z., Vinsavich, A., Trush, V., and Quackenbush, J. (2003) TM4: a free, opensource system for microarray data management and analysis. Biotechniques 34, 374–8. 75. Banfield, J. D., and Raftery, A. E. (1993) Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–21. 76. Yeung, K. Y., Medvedovic, M., and Bumgarner, R. E. (2003) Clustering geneexpression data with repeated measurements. Genome Biol 4, R34. 77. Vogl, C., Sanchez-Cabo, F., Stocker, G., Hubbard, S., Wolkenhauer, O., and Trajanoski, Z. (2005) A fully Bayesian model to cluster gene-expression profiles. Bioinformatics 21 Suppl 2, ii130–136. 78. Vingron, M., Brazma, A., Coulson, R., van Helden, J., Manke, T., Palin, K., Sand, O., and Ukkonen, E. (2009) Integrating sequence, evolution and functional genomics in regulatory genomics. Genome Biol 10, 202. 79. Wingender, E., Dietze, P., Karas, H., and Knuppel, R. (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 24, 238–41. 80. Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W. W., and Lenhard, B. (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 32, D91–94. 81. MacIsaac, K. D., and Fraenkel, E. (2006) Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2, e36. 82. Tompa, M., Li, N., Bailey, T. L., Church, G. M., De Moor, B., Eskin, E., Favorov, A. V.,
Insights into Global Mechanisms and Disease by Gene Expression Profiling Frith, M. C., Fu, Y., Kent, W. J., Makeev, V. J., Mironov, A. A., Noble, W. S., Pavesi, G., Pesole, G., Regnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., and Zhu, Z. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23, 137–44. 83. Werner, T. (2000) Computer-assisted analysis of transcription control regions. Matinspector and other programs. Methods Mol Biol 132, 337–49. 84. Sethupathy, P., Megraw, M., and Hatzigeorgiou, A. G. (2006) A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods 3, 881–86. 85. Krek, A., Grun, D., Poy, M. N., Wolf, R., Rosenberg, L., Epstein, E. J., Macmenamin, P., da Piedade, I., Gunsalus, K. C., Stoffel, M., and Rajewsky, N. (2005) Combinatorial microRNA target predictions. Nat Genet 37, 495–500. 86. Lewis, B. P., Burge, C. B., and Bartel, D. P. (2005) Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120, 15–20. 87. ‘t-Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–36. 88. Witten, I., and Frank, E. (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, San Francisco, CA, USA. 89. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., and Haussler, D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906–14. 90. Pittman, J., Huang, E., Nevins, J., Wang, Q., and West, M. (2004) Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes. Biostatistics 5, 587–601. 91. D’haeseleer, P., Liang, S., and Somogyi, R. (2000) Genetic network inference: from coexpression clustering to reverse engineering. Bioinformatics 16, 707–26. 92. de Jong, H. (2002) Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol 9, 67–103.
297
93. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I., Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B., Ren, B., Wyrick, J. J., Tagne, J. B., Volkert, T. L., Fraenkel, E., Gifford, D. K., and Young, R. A. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804. 94. Shen-Orr, S. S., Milo, R., Mangan, S., and Alon, U. (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31, 64–68. 95. Gardner, T. S., di Bernardo, D., Lorenz, D., and Collins, J. J. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301, 102–105. 96. DI Camillo, B., Toffolo, G., and Cobelli, C. (2009) A gene network simulator to assess reverse engineering algorithms. Ann N Y Acad Sci 1158, 125–42. 97. Marbach, D., Schaffter, T., Mattiussi, C., and Floreano, D. (2009) Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol 16, 229–39. 98. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., and Sherlock, G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25–29. 99. Mlecnik, B., Scheideler, M., Hackl, H., Hartler, J., Sanchez-Cabo, F., and Trajanoski, Z. (2005) PathwayExplorer: web service for visualizing high-throughput expression data on biological pathways. Nucleic Acids Res 33, W633–W637. 100. Ingenuity systems (2009) Ingenuity Pathway Analysis Software. http://www.ingenuity.com. 101. SRI International (2009) PANTHER Classification System for Genes and Proteins. http://www.pantherdb.org. 102. Hoffmann, R., Krallinger, M., Andres, E., Tamames, J., Blaschke, C., and Valencia, A. (2005) Text mining for metabolic pathways, signaling cascades, and protein networks. Sci STKE 2005, e21. 103. Alexa, A., Rahnenfuhrer, J., and Lengauer, T. (2006) Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600–07.
298
Sánchez-Cabo et al.
104. Falcon, S., and Gentleman, R. (2007) Using GOstats to test gene lists for GO term association. Bioinformatics 23, 257–58. 105. Bindea, G., Mlecnik, B., Hackl, H., Charoentong, P., Tosolini, M., Kirilovsky, A., Fridman, W. H., Pages, F., Trajanoski, Z., and Galon, J. (2009) ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics 25, 1091–93. 106. Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S., and Mesirov, J. P. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102, 15545–50. 107. Tian, L., Greenberg, S. A., Kong, S. W., Altschuler, J., Kohane, I. S., and Park, P. J. (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA 102, 13544–49. 108. Saxena, V., Orgill, D., and Kohane, I. (2006) Absolute enrichment: gene set enrichment analysis for homeostatic systems. Nucleic Acids Res 34, e151. 109. Jiang, Z., and Gentleman, R. (2007) Extensions to gene set enrichment. Bioinformatics 23, 306–13. 110. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29, 365–71. 111. Edgar, R., Domrachev, M., and Lash, A. E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30, 207–10. 112. Brazma, A., Parkinson, H., Sarkans, U., Shojatalab, M., Vilo, J., Abeygunawardena, N., Holloway, E., Kapushesky, M., Kemmeren, P., Lara, G. G., Oezcimen, A., Rocca-Serra, P., and Sansone, S. A. (2003) ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 31, 68–71. 113. Hwang, D., Rust, A. G., Ramsey, S., Smith, J. J., Leslie, D. M., Weston, A. D., de Atauri,
P., Aitchison, J. D., Hood, L., Siegel, A. F., and Bolouri, H. (2005) A data integration methodology for systems biology. Proc Natl Acad Sci USA 102, 17296–301. 114. Galon, J., Costes, A., Sanchez-Cabo, F., Kirilovsky, A., Mlecnik, B., Lagorce-Pages, C., Tosolini, M., Camus, M., Berger, A., Wind, P., Zinzindohoue, F., Bruneval, P., Cugnenc, P. H., Trajanoski, Z., Fridman, W. H., and Pages, F. (2006) Type, density, and location of immune cells within human colorectal tumors predict clinical outcome. Science 313, 1960–64. 115. Becker, R., Chambers, J., and Wilks, A. (1988) The New S Language: a programming environment for data analysis and statistics. Wadsworth & Brooks/Cole, Pacific Grove, CA, USA. 116. Leisch, F. (2002) Sweave: dynamic generation of statistical reportse using literate data analysis. In Compstat2002 – proceedings in computational statistics (Edited by Haerdle, W., and Roenz, B.), Physica-Verlag, Heidelberg, Germany. 117. Tarraga, J., Medina, I., Carbonell, J., HuertaCepas, J., Minguez, P., Alloza, E., Al Shahrour, F., Vegas-Azcarate, S., Goetz, S., Escobar, P., Garcia-Garcia, F., Conesa, A., Montaner, D., and Dopazo, J. (2008) GEPAS, a web-based tool for microarray data analysis and interpretation. Nucleic Acids Res 36, W308–W314. 118. Rainer, J., Sanchez-Cabo, F., Stocker, G., Sturn, A., and Trajanoski, Z. (2006) CARMAweb: comprehensive R- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res 34, W498–W503. 119. Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L., Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M., Aronow, B. J., Robinson, A., Bassett, D., Stoeckert, C. J., Jr., and Brazma, A. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 3, RESEARCH0046. 120. Maurer, M., Molidor, R., Sturn, A., Hartler, J., Hackl, H., Stocker, G., Prokesch, A., Scheideler, M., and Trajanoski, Z. (2005) MARS: microarray analysis, retrieval, and storage system. BMC Bioinformatics 6, 101.
Chapter 14 Bioinformatics for RNomics Kristin Reiche, Katharina Schutt, Kerstin Boll, Friedemann Horn, and Jörg Hackermüller Abstract Rapid improvements in high-throughput experimental technologies make it nowadays possible to study the expression, as well as changes in expression, of whole transcriptomes under different environmental conditions in a detailed view. We describe current approaches to identify genome-wide functional RNA transcripts (experimentally as well as computationally), and focus on computational methods that may be utilized to disclose their function. While genome databases offer a wealth of information about known and putative functions for protein-coding genes, functional information for novel non-coding RNA genes is almost nonexistent. This is mainly explained by the lack of established software tools to efficiently reveal the function and evolutionary origin of non-coding RNA genes. Here, we describe in detail computational approaches one may follow to annotate and classify an RNA transcript. Key words: RNomics, Non-coding RNA, ncRNA, Transcriptome, Bioinformatics, Regulatory RNA
1. Introduction Of the 3.3 billion bases of the human genome, only about 2% code for proteins. Until very recently, the remaining 98% have been considered to be “junk” and functionless. However, large transcriptomic studies like ENCODE (ENCyclopedia Of DNA Elements) (1, 2) or FANTOM (The Functional Annotation Of the Mammalian Genome) (3) have shown that approximately 90% of the genome are actively transcribed into RNA. Of the overall transcriptional output of human cells, non-coding RNAs (ncRNAs) represent the vast majority (up to 98%). The non-coding parts of genomes have expanded dramatically during evolution to higher species, whereas the number of
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_14, © Springer Science+Business Media, LLC 2011
299
300
Reiche et al.
protein-coding genes remained rather constant. Therefore, ncRNAs might critically contribute to the complexity of higher organisms. In fact, the pilot phase of the ENCODE project revealed that ncRNAs, compared to mRNAs, are expressed in a far more pronounced cell- and tissue-specific manner (1). In this pilot phase the ENCODE project focused on 1% of the human genome in 30 regions to identify all sequence elements that confer biological function. It becomes increasingly apparent that ncRNAs form an important, previously underestimated regulatory layer that contributes pivotally to the control of many cellular functions. The potential of RNA as regulatory molecules has been revealed decades ago. Recently, many classes of ncRNAs in species ranging from viruses to mammalian have been discovered to be involved in the control of RNA stability, gene expression, tissue and cellular development, RNA modification, chromatin organization, alternative splicing, subcellular localization of proteins, heat shock sensing, and other processes (4). In prokaryotes, a limited number of trans-acting small ncRNAs has been described that appear to mainly regulate mRNA translation or stability. However, ncRNAs do not dominate genomic output in prokaryotes, representing only a small fraction of their genomes, which are generally dominated (80–95%) by proteincoding sequences (5, 6). Although in humans and other eukaryotes the number of identified ncRNAs has increased tremendously during the last few years, many if not most ncRNAs still remain to be identified and functionally characterized. Taking their functional repertoire into account, ncRNAs can be largely divided into two classes: (1) housekeeping and (2) regulatory ncRNAs. Housekeeping RNAs are transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) involved in mRNA translation, small nuclear (snRNAs) involved in splicing, small nucleolar (snoRNAs) involved in the modification of rRNAs, RNase P RNAs, so called ribozymes, telomerase RNA, and others. Whereas housekeeping ncRNAs are generally constitutively expressed and are required for basal cell functions, the class of regulatory ncRNAs or riboregulators is expressed at certain stages of development, during cell proliferation or in response to external stimuli (7). Among regulatory ncRNAs, microRNAs (miRNAs), ~21nt long RNAs, which regulate mRNAs at the posttranscriptional level, belong to the best-studied classes. miRNAs are involved in the control of crucial biological processes like development, differentiation, and apoptosis (8). Initially discovered in Caenorhabditis elegans (9, 10), they are today known to be expressed in plants and throughout the animal kingdom, including humans, and many of them are evolutionary conserved (11, 12). miRNAs bind to partially complementary sites in the 3¢UTRs of target mRNAs, causing either mRNA degradation or translational repression,
Bioinformatics for RNomics
301
depending on the grade of complementary. Another class of small ncRNAs are the Piwi-RNAs (piRNAs). piRNAs form RNA– protein complexes through interactions with Piwi proteins. These piRNA complexes have been linked to transcriptional gene silencing of retrotransposons and other genetic elements in germ line cells, particularly those in spermatogenesis (13). In contrast to the various classes of small ncRNAs, long ncRNAs, with a length of >200nt, lack satisfactory classifications. The broad functional repertoire of long ncRNAs includes epigenetic mechanisms as well as transcriptional and post-transcriptional regulation. The large non-coding RNA Xist, for example, is the master regulator of X chromosome inactivation. Xist is negatively regulated by its antisense transcript Tsix. This repressive antisense transcription across Xist operates at least in part through the modification of the chromatin environment of the locus (14). Long ncRNAs can act as cofactors by modulating transcription factor activity, effect global changes by interacting with basal components of the RNA polymerase II (RNAPII) dependent transcription machinery, and regulate RNAPII activity, e.g., by influencing promoter choice. As an example, a ncRNA transcribed from an upstream region of the human dihydrofolate reductase (DHFR) locus forms a triplex in the major promoter of DHFR to prevent the binding of the transcriptional cofactor TFIID (15). Furthermore, most mammalian genes express antisense transcripts, which might constitute a class of ncRNA that is particularly adept at regulating mRNA dynamics. They have been shown to direct the alternative splicing of mRNA isoforms, for example Zeb2 (16), or alternatively the annealing of ncRNA can target protein effector complexes to the sense mRNA transcript in a manner analogous to the targeting of the RNA-induced silencing complex (RISC) to mRNAs by siRNAs (17). RNAs, in particular ncRNAs and other functional RNAs, are characterized by either structure motifs, sequence motifs or a combination of both, which is a consequence of their action in RNA–(m)RNA or RNA–protein complexes as described above. Short- and long-range base pair interactions that organize the RNA molecules into structured and unstructured domains define characteristic secondary and tertiary structures, which form the base for their diverse functional activities (18, 19) (see Note 1). Evolutionary selection ensures that these sequence and structure motifs remain mostly unmodified. ncRNAs of the same evolutionary origin are defined to belong to the same RNA family. Members of the same ncRNA family do in general share strong sequence similarity, which decreases for ancestral homologies, while secondary structure may still be conserved. RNA families sharing the same evolutionary origin but no sequence similarity are defined to be members of the same RNA clan. Lastly, in case RNA families did not evolve from the same origin, but their similar cellular functions converged to similar secondary structures are seen as members of the same RNA class.
302
Reiche et al.
2. Materials In this section, we will briefly review available datasets of ncRNAs. In the field of non-coding RNAs computational prediction of functional ncRNAs and experimental identification of expressed RNAs were neck-to-neck to each other. Bioinformatic prediction delivered putative RNAs that showed signs of stabilizing selection, i.e., of functionality, but lacked knowledge of expression. Experimental approaches delivered numerous expressed ncRNAs without being able to delineate their functional relevance. We will therefore pre sent datasets of predicted RNAs as well as datasets of large-scale experimental efforts for their identification. Due to the dynamics which this field has recently developed, this chapter cannot be comprehensive and we mainly focus on the complex datasets, derived for mammalian species, despite the many interesting reports on functional ncRNAs in prokaryotes and more basal eukaryotic model organisms. 2.1. Datasets of Computationally Predicted ncRNAs 2.1.1. Prediction of Structured ncRNAs
Many of the long known “housekeeping” ncRNAs, like tRNAs, snRNAs, or snoRNAs, exhibit pronounced secondary structure features which are under stabilizing selection, e.g., because a particular structure is required for interaction with protein complexes. It was therefore suggested that functional RNA elements should have a secondary structure that is energetically more stable than expected by chance (20). However, thermodynamic stability alone did not prove to be statistically significant enough for ncRNA detection (21). Comparative approaches, in contrast, aimed to detect conserved RNA secondary structure in sequence alignments. QRNA, which was the first successful algorithm of this type, compares which of three models describes a given pairwise sequence alignment best. A pair stochastic context free grammar (SCFG) is used to model RNA secondary structure evolution. A pair Hidden Markov Model (HMM) describes protein coding sequence evolution and another pair HMM represents the null model of an unconstrained sequence (22). An up-to-date extension of QRNA to multiple sequence alignments (MSAs) is EvoFold, which combines SCFGs to model RNA secondary structure with a phylogenetic tree to model substitution rates along the branches of the tree (23). Human genome wide predictions of structured RNAs using EvoFold are provided as a track in the UCSC genome browser (see Notes 3 and 5). An alternative strategy to the prediction of RNAs with secondary structure under stabilizing selection is RNAz, which is based on thermodynamic RNA folding. It determines a structure conservation index (SCI) obtained by comparing folding energies of the individual sequences with the predicted consensus folding of an MSA as one classification criterion and a z-score measuring
Bioinformatics for RNomics
303
thermodynamic stability of the individual sequences as a second. A support vector machine that detects conserved and stable RNA secondary structures with high sensitivity and specificity combines both measures (24). RNAz has been applied to detect ncRNAs in the human genome (25), data are available as BED files at http://www.tbi.univie.ac.at/papers/SUPPLEMENTS/ ncRNA, for Urochordates (26), Nematodes (27), Drosophilids (28), yeast (29), Plasmodia (30) and teleost fishes (31). Estimation of false discovery rates (FDR) for alignment-based methods like RNAz is intricate, as it requires the randomization of alignments, which is easily biased. A solution to this issue is SISSIz, which generates random alignments satisfying various constraints and has been used for more accurate FDR estimates of RNAz (32). RNAz 2.0 has been recently released (33) and a web server is available for small-scale predictions (34). Setting up your own RNAz screen is described in (35, 36) and may require computation of dedicated MSAs for which NcDNAlign provides a solution adjusted to the needs of RNAz (37) (see Note 5). 2.1.2. Sequence Alignment Independent Approaches
The above described approaches have the disadvantage that no reliable MSAs may be at hand, because only distantly related genomes are available, or due to the rapid evolution of non-coding sequences. This can be overcome by relying on structural alignments, however, at the cost of a huge computational effort. Variants of the Sankoff algorithm, which solves RNA folding and sequence alignment simultaneously (38), have been used for ncRNA prediction, e.g., a classifier based on Dynalign (39, 40), or a screen using FOLDALIGN (see Note 5) to compare sequences between human and mouse which are not alignable on primary sequence level (41, 42). While these and similar approaches can be used for ncRNA screens, a genome wide application is usually not feasible or only when outstanding computational resources are available. Apart from ncRNA identification, these approaches are also useful for classification of ncRNAs.
2.1.3. Secondary Structure Independent Prediction of ncRNAs
While the different approaches for detecting structured ncRNAs delivered a plethora of potentially functional ncRNAs, not all to date known ncRNAs have easily detectable structural features or contain only few structured domains in a longer unstructured sequence, which is in particular true for most so far identified mRNA-like ncRNAs. In addition, the overlaps between ncRNAs detected experimentally in ENCODE and those predicted by EvoFold and RNAz (see Note 5) were surprisingly small (43). A recently published approach that successfully identified novel unstructured ncRNAs in Drosophila melanogaster relied on the detection of conserved intron positions in otherwise largely nonconserved sequence context (44).
304
Reiche et al.
2.2. Datasets of Experimentally Identified ncRNAs
We briefly introduce large-scale studies that led to the identification of non-coding transcripts. Approaches aimed at finding individual ncRNAs cannot be covered, but we refer to the databases listed in Note 2, which collect these transcripts. Another useful metaresource for ncRNAs is the ncRNA web repository (see Note 4). An experimental identification of ncRNAs has one important prerequisite: methods need to be unbiased, i.e., detection is not restricted to a particular subset of RNAs, like polyadenylated transcripts. Also, as many ncRNAs are – compared to mRNAs – expressed at low levels, approaches for identifying novel ncRNAs benefit from increased sensitivity. Over the last years, unbiased transcriptomics using tiling arrays and sequencing of cDNA and EST libraries have been most successful in finding new ncRNAs. More recently, transcriptome sequencing (RNAseq) approaches identifying RNAs bound to proteins and identification of ncRNAs based on epigenetic patterns proved to be of value.
2.2.1. ncRNA Identification by Transcriptomics Approaches
Microarrays have been used since the 1990s for quantifying the expression of known transcripts. Phil Kapranov (45), Eric Schadt (46), Paul Bertone (47), and Viktor Stolc (48) and colleagues pioneered the application of tiling arrays to identify novel transcripts. Tiling arrays are oligonucleotide microarrays that unlike other arrays do not interrogate the sequences of specific transcripts but rather tile the genome of interest, i.e., probes are spaced in regular intervals throughout the genome. Typically, probes with low complexity or repetitive sequences are removed, as their signal cannot be attributed to a specific genomic location. Apart from that and the variation of probe lengths for designs relying on longer probes, no probe design is possible for this type of arrays and hybridization free energies vary strongly between probes. Together with cross-hybridization effects, this leads to rather noisy signals, which require a rigorous data processing. Therefore, Kampa et al. (49) proposed to use a stringent cutoff for considering individual probes as expressed and to require at least three consecutive probes to have a signal above this cutoff to designate a region as expressed. With a probe pitch of 35nt in many whole genome tiling experiments, this results in the efficient detection of transcripts, which are at least approximately 110 nt in size. (50) Arrays with a resolution of 5nt were used to extend the approach to the detection of small RNAs. However, with the need of using around 180 arrays per sample to accommodate the resulting number of probes, this approach is hardly practical for whole genome applications. Different tiling array approaches have been a mainstay of the pilot phase of the ENCODE project (1, 2). Based on this data, ENCODE concluded that more than 90% of the human genome is transcribed into RNA on at least one strand. In addition, comparing expression data of several cell lines and tissues it was shown that on
Bioinformatics for RNomics
305
average expression patterns of ncRNAs are far more specific than those of mRNAs. Datasets of the ENCODE pilot phase and ongoing ENCODE studies are most easily accessible via the UCSC genome browser (see Note 3). A disadvantage of tiling arrays is the locality of information, i.e., transcript starts and ends can be inferred only indirectly based on segmentation of the expression signal into blocks of similar expression. Tiling array data are therefore ideally complemented by techniques like Capanalysis of gene expression (CAGE) tags, or paired end tags (PET). With the rapid development of second-generation sequencing techniques, transcriptome sequencing (RNAseq) became a viable alternative to array based approaches, in particular for small RNAs (51). Compared to tiling arrays RNAseq methods have the advantage that sequences of any size that can be reliably mapped to the genome, or assembled, can be identified. On the other hand, library preparation for RNAseq may be more biased than labeling and amplification for tiling arrays. In addition, the capacity of today’s sequencers may still be limiting for identifying differential expression of long ncRNAs in complex samples. 2.2.2. ncRNAs Identified from cDNA Clones
FANTOM, the functional annotation of the mouse transcriptome project, initially identified numerous ncRNAs from sequencing full-length cDNA clones (3, 52, 53). Subsequent FANTOM studies supplemented cDNA data by massive identification of transcription start and termination sites (54) and most recently combining CAGE with deep sequencing (deepCAGE) (55). In particular, the latter technique alleviates one of the major downsides of purely clone-based approaches, which hardly capture differential expression.
2.2.3. Epigenetics Based Approaches
Recently, Mitchell Guttman and colleagues introduced an approach for identifying ncRNAs that is orthogonal to transcriptomics (56): Studying the profiles of specific histone modifications in and around protein coding genes, they developed an epigenetic signature of actively transcribed genes and used these patterns to identify a series of novel transcripts called lincRNAs.
3. Methods In this section, we describe common computational methods that one may use to annotate and classify ncRNAs that are retrieved either from genome-wide high-throughput experimental transcriptomics, or by genome-wide computational predictions. Interested readers should also consider Bompfünewerer et al. (57). Annotation of ncRNAs comprises problems like disclosing the RNA family, clan, or class the RNA belongs to; identifying the
306
Reiche et al.
Fig. 1. Short outline one may follow to annotate and classify an ncRNA candidate retrieved from experimental or computational ncRNA identification approaches.
reading strand; predicting the secondary structure in order to find functional motifs; and revealing likely targets, i.e., binding partners. To achieve a comprehensive annotation one may follow the outline as described below and in Fig. 1. A selection of the most important software tools, which address the problem of annotating and classifying ncRNA genes, is given in Table 1 (see Note 5). 3.1. Sequence-Based Classification
Sequence-based classification aims to identify ncRNAs that are conserved in their primary sequence and, thus, are ncRNAs exhibiting close homology as well as similar cellular functionality. With standard local sequence alignment tools like blast (58) (see Note 5), good results can be achieved, already. However, sequence conservation is mostly restricted to rather short functional sequence motifs that are interrupted by longer stretches of weak sequence
Sequence based classification
De novo prediction
n
n n n
y
y
y
(y)
blastn
SSEARCH
Probalign
GotohScan
y y
y n
EvoFold RNAz
MSA
Pairwise
y
QRNA
Local
MSA
Prob. model
Citations
Background and tips
(continued)
(62)
(60, 61)
(59)
(58)
Citations
(22) A pair-SCFG is used to model RNA secondary structure evolution. A pairwise alignment is classified to one of three evolution classes: structural RNA evolution, coding sequence evolution, or position independent evolution. Extension of QRNA to multiple sequence alignments (MSA). (23) Alternative de novo prediction based one thermodynamic (24, 33) RNA folding. It compares folding energies of the individual sequences with the predicted consensus folding of an MSA.
Background and tips
Fast local alignment tool to detect homology at the nucleotide level between a sequence pair. Detecting conserved subsequences of ncRNAs using blastn often forms a good starting point, while post-filtering steps reject all sequences not exhibiting specific structural features of the query ncRNA. Full Smith–Waterman It is much slower than blast but more sensitive, because no heuristics to speed up searches in large databases are implemented. Partition function In contrast to blast and SSEARCH, it computes all sub-optimal alignments between a pair of sequences. Semi-global Interprets the problem of homology searches as identifyalignment ing the best match of the query ncRNA within a complete genome.
Heuristic Smith–Waterman
Algorithm
n y
n
Energy
Table 1 A selection of the most common software tools for identification and functional classification of non-coding RNAs
Bioinformatics for RNomics 307
Secondary structure prediction
Table 1 (continued) – y
Sankoff
n
n
y
n
n
Energy
y
y
n
fragrep
ClustalW
UNAFold
Vienna RNA Package
FOLDALIGN(M)
n
n
n
Prob. model
n
y/n
n
MSA
Progressive MSA
Pattern matching
Citations
(65)
(63)
One of the two standard software packages for secondary (66) structure prediction for one or two single stranded RNA sequences. Includes algorithms for free energy minimization, partition function calculations and stochastic sampling. It replaces and extends the original mfold package. (67, 68) The second standard software package for secondary structure prediction for RNA sequences. Includes algorithms for prediction of minimum free energy structures, of suboptimal secondary structures as well as of locally stable structure of long sequences, and calculation of partition functions. RNAfold predicts minimum free energy structure from a single sequence, and RNAalifold from an MSA. (42, 69) Implementation of a simplified version of the Sankoff algorithm (38) for local or global simultaneous secondary structure prediction and aligning a collection of RNA sequences.
Background and tips
Fragmented pattern search that treats sequence regions which are poorly conserved as distance constraints between significantly conserved blocks. MSA is built from a phylogenetic tree either created by pairwise alignments or provided by the user. Produces good alignments of ncRNAs that show high to medium sequence homology (64), and forms the basis of the famous ncRNA prediction tool RNAz (24, 33).
308 Reiche et al.
n
SVM
CM
n
y
RNAmicro
y
n
SimulFold
y
y
y
n
Consan
y
n
n
y
PETfold
y
tRNAscan-SE
n
n
Pfold
n
Prob. Model model
y
y
(m)LocARNA
n
n
Motif
y
y
PMcomp
Known class Secondary structure based classification
y
y
Dynalign
n
n
y
y
n
n
n
Citations
(77)
(76)
(75)
(73, 74)
(72)
(71)
(40)
(continued)
Standard software tool for scanning genomes for tRNA (78) genes. It detects tRNA pseudogenes, as well as unusual tRNA homologs such as selenocysteine tRNAs. A SVM that recognizes conserved miRNA precursors in (79) multiple sequence alignments.
Background and tips
Combines free energy minimization and comparative sequence analysis to find a low free energy structure common to two sequences. Uses McCaskill’s approach (70) to compute base pairing probability matrices, which incorporate information on the energetics of each sequence, and aligns these matrices to retrieve a consensus secondary structure model and alignment for two sequences. Same approach as used in PMcomp, however with decreased memory and run-time requirements. (m)LocARNA computes a multiple alignment progressively from pairwise LocARNA alignments. Combines an evolutionary model of the RNA sequences with a probabilistic model for the secondary structures. Extends Pfold such that energetic information is also incorporated into the structure prediction. Uses pair stochastic context-free grammars (pairSCFGs) as a unifying framework for scoring pairwise alignment and folding. Predicts secondary structures (including pseudoknots), multiple sequence alignments, and an evolutionary tree from unaligned RNA input sequences simultaneously.
Bioinformatics for RNomics 309
Table 1 (continued)
y
y y
n
n
n
SnoScan
SnoGPS
RNAmmer
y y y
n
y
n
Infernal
CMfinder
RaveNna
General purpose
n
n
SnoReport
HMM
CM
CM
HMM
CM
CM
SVM
Searches sequence databases for RNA structure and sequence similarities. Requires a structure-annotated alignment of related RNA sequences as input. Expectation maximization algorithm that captures secondary structure motifs within a set of unaligned RNA sequences. Implementation of rigorous filters that accelerate CM searches for almost all known ncRNA families from the RFAM database and tRNA models in tRNAscan-SE.
A SVM that recognizes two major classes of snoRNAs, box C/D and box H/ACA snoRNAs, in multiple sequence alignments. Classification of the multiple alignment is solely based on information about conserved sequence boxes and secondary structure constraints. It does not require any target information. Implementation of a deterministic search algorithm and a probabilistic gene model to scan genomic sequences for C/D box snoRNA genes. SnoScan requires as input the sequence of the query, i.e., the snoRNA candidate, as well as the sequence of the target, i.e., RNAs that may interact and are modified by the query snoRNA sequence. Analog implementation of SnoScan for H/ACA snoRNA genes. Fast computational predictor for the major rRNA species.
(86)
(85)
(84)
(83)
(82)
(81)
(80)
310 Reiche et al.
y
n n
n
n
n n
n
n
n
n
n
y
n
n
n
RSEARCH
FastR
ERPIN
fragrep3
ExpaRNA
RNAbob
HyPaL
RNAMotif
Pattern
Pattern
Pattern
Pattern
Pattern
Lod-score profile
Pattern
CM
Uses a nondeterministic finite state machine with node rewriting rules to model RNA structure motifs. Contains a library of annotated structural elements characteristic for certain classes of structural and/or functional RNAs and allows searching databases for those motifs. Allows user-defined approximate rules, which rank matches according to their distance to the motif. Relies on a flexible structure definition language, which can specify any type of base pairs and provides a user controlled scoring section that allows ranking of matches.
Improves run-time by applying structural filters in order to eliminate unrelated sequences from the database, while truly homologous RNAs are retained. It requires an RNA sequence alignment as input and identifies related sequences using a profile-based dynamic programming algorithm. It is able to handle pseudoknots. A pattern matching approach for RNA secondary structures combining features of statistical approaches and descriptor-based methods. While fragrep3 is conceptually similar to ERPIN, it treats poorly conserved regions as simple distance constraints. Another exact pattern matching approach to detect the longest collinear sequence of substructures common to two RNA sequences. Substructures common to both RNA sequences are treated as whole unit while variable regions are allowed between them.
Alignments are scored by single nucleotide and base pair substitution matrices (RIBOSUM matrices) specifically designed for pairwise alignments of RNA sequences.
(continued)
(18)
(94)
(93)
(92)
(91)
(89, 90)
(88)
(87)
Bioinformatics for RNomics 311
Seed
y
y y y
y
y
Access.
n
n
n
y
n
n
RNAhybrid
miRanda (Microcosm)
PicTar
PITA
TargetScan(S)
DIANA microT
Interaction miRNA–RNA based classification
Table 1 (continued)
y
y
y
y
y
y
Energy
y
y
n
y
y
n
Cons.
Citations
Employs a combined bioinformatics and experimental approach to identify rules important for miRNA-target identification that allow prediction of human miRNA targets.
Whole-genome predictions of miRNA target genes. It also incorporates the degree of conservation of putative target sites in the prediction process. Identification of mRNAs likely to be the target for a combination of miRNAs using a scoring system based on a HMM. Quantifies the effect of target site accessibility on miRNA– mRNA interations by systematic experimentation. The model underlying PITA computes the difference between the free energy gained from the formation of the miRNA–target duplex and the energetic cost of unpairing the target to make it accessible to the miRNA. Predicts targets of miRNAs by searching for conserved 8mer and 7mer sites that are complementary to the seed region of the miRNA.
(103)
(101, 102)
(100)
(99)
(97, 98)
(95, 96) Introduces a specialized energy model for dimer hybridization. Especially designed to predict multiple potential binding sites of miRNAs in large target RNAs.
Background and tips
312 Reiche et al.
Global/global
Global/global Global/global
y
y
n
n
Accessibility
–
–
–
miTarget
NbmiRTar
RNA–(m)RNA
UNAFold
PairFold, Multifold
RNAcofold
y
y
y
n
n
(y)
Provides a tool that computes the minimum free energy (MFE) secondary structure for two interacting RNA molecules. Only intermolecular base pairs are considered in duplexes. Considers also intramolecular pairing for RNA duplexes. MultiFold is the extension of the PairFold algorithm to multiple molecules. Provides also an extension of McCaskill’s partition function algorithm to compute base pairing probabilities, realistic interaction energies, and equilibrium concentrations of duplex structures allowing intramolecular base pairing.
Scope (RNA/target) Background and tips
y
y
n
miRtarget2
n
y
A web-tool that uses a different view to the miRNA target prediction problem: Checks if a given mRNA sequence contain a binding site for any miRNA that originates from this organism and that is available in the database. Hence, the miRNA must not be known beforehand. Is a pattern-based approach for the discovery of miRNA binding sites. It identifies putative miRNA binding sites without a need to know the identity of the specific targeting miRNA by defining sequence patterns common to known mature miRNAs. Target prediction is done by a SVM, which evaluates several features like seed conservation, GC content, accessibility, free energy and others. Conservation of target site is regarded, but not a requirement. SVM has been trained by systematically studying public microarray data. Another SVM based miRNA target prediction tool. Features include complementary to seed part and 3¢ region, free energy of total alignment as well as seed and 3¢ region, and position based features. A Naïve Bayes classifier that is based on sequence information and free energy.
n
n
rna22
y
y
MicroInspector n
(continued)
(111)
(110)
(109)
Citations
(108)
(107)
(106)
(105)
(104)
Bioinformatics for RNomics 313
Table 1 (continued)
n
y
RNA binding site predictions in proteins
MEME
MEMERIS
RNA–protein
y
Protein binding motif prediction in RNAs Accessibility
RNA–protein
n
Global/local
y
piRNA
PPRint
Global/local
y
RNArip
SVM
Method
Global/local
y
IntaRNA
Cons.
Global/local
y
RNAup
Structure known
Global/local
n
RNAplex
(113)
(112)
RNA-binding sites are predicted from a protein sequence by training a SVM using a position-specific scoring matrix (PSSM) that reflects evolutionary conservation of the binding motif.
Background and tips
Standard sequence motif discovery tool. Discovers any type of sequence motif common to a set of DNA sequences. It is not especially designed to notice protein-binding sites in RNAs. Searches for sequence motifs that occur preferably in single-stranded regions by guiding the motif search to unpaired regions of the RNA sequence.
(120)
Citations
(119)
(118)
(114, 115) Enables the calculation of interaction probabilities for any (116) given interval on the target RNA. Computes the interaction partition function over the (117) whole ensemble of structures between two interacting RNAs. Background and tips Citations
Based on a simplified energy model to speed up target site identification in large databases, not specifically designed for a particular target class. Includes target site accessibility, i.e., that the binding site is unpaired. Includes target site accessibility and user-definable seeds
314 Reiche et al.
n n y y
n
n
y
y
BindN
RNABindR
PRINTR
RnaPred
–
SVM
Bayes classifier
SVM
SVM
Here the PSSM is smoothed, such that it incorporates for each amino acid in the protein sequence the dependency effect from its neighboring amino acids. Features like pK(a) value, hydrophobicity index, and molecular mass of an amino acid are used to train a SVM from known DNA or RNA-binding residues. Naïve Bayes classifier trained on a set of known protein– RNA complexes to predict which amino acids in a protein sequence are most likely to bind RNA. Incorporates secondary structure prediction of the protein into the process. Provides a classification of known RNA nucleotide and dinucleotide protein binding sites in order to identify common types of shared 3D physicochemical binding patterns. Searches a complete protein for regions similar to known 3D consensus patterns of RNA-binding sites. (125)
(124)
(123)
(122)
(121)
The table is divided into five subtables describing software tools addressing de novo prediction of ncRNAs, sequence based classification, secondary structure prediction, secondary structure based classification, and interaction based classification, respectively. De novo prediction: Column headings Prob. model, MSA, and Energy indicate if the described algorithm uses a probabilistic model for prediction, requires a multiple sequence alignment as input, and if thermodynamic RNA folding is included. Sequence based classification: Column headings Local, MSA, and Algorithm indicate if the software tool computes local sequence alignments, a multiple sequence alignment, and the type of algorithm used, respectively. Secondary structure prediction: Column headings Energy, Sankoff, Prob. model, MSA indicate if the prediction relies on minimization of the free energy, on predicting the alignment and the secondary structure simultaneously, on a probabilistic model for RNA secondary structures, and if a MSA is a prerequisite. Secondary structure based classification: Column headings Motif, Prob. model, and Model indicate if the software tool is able to predict secondary structure motifs common to the input RNA sequences, if a probabilistic model is used, and the type of model used to describe the RNA motif. SVM: support vector machine, HMM: Hidden Markov Model, CM: covariance model for RNAs. Software tools are separated in tools specifically designed for a RNA family/clan/class (known class) or tools that are not restricted to a specific ncRNA family/clan/class (general purpose). Interaction based classification: Column headings Access, Seed, Energy, and Cons. for the miRNA–mRNA interaction subtable indicate if the tools incorporate the structural accessibility of the mRNA target, the complementarity to the seed region of a miRNA, the free energy, and the conservation of the target site into the prediction process, respectively. Column headings Accessibility, and Scope for the RNA–(m)RNA subtable indicate if the tool checks if the binding region in the target RNA is likely to be unpaired, and the scope of interaction. Column heading Accessibility in the RNA–protein subtable indicates if the tool guides the motif search to single stranded regions of the RNA. Column headings Structure known, Cons., and Method for the RNA–protein subtable indicate if tools incorporate the secondary or tertiary structure of the protein or solely base the prediction on sequence information, if conservation of binding sites is regarded, and the method used for classification.
y
n
RNAProB
Bioinformatics for RNomics 315
316
Reiche et al.
conservation that may be part of a secondary structure motif. The most promising approaches to detect homologous ncRNAs by sequence conservation alone are searches for fragmented patterns that do not require a conserved substring of sufficient length as seed region for the local alignment (91). Regions of poor conservation are simply treated as distance constraints between wellconserved blocks. 3.2. Secondary Structure-Based Classification
In contrast, classification based on secondary structure resolves more distantly homologous ncRNAs, because ancestral copies of the same ncRNA are likely to exhibit only low sequence conservation, but still recognizable conservation in secondary structure. Furthermore, classes of ncRNAs likely to be involved in the same cellular processes can be identified by structure-based classification, as they are expected to share secondary structure motifs. Such classification problems have in common that structural similarity must be evaluated in an efficient and reliable way among a large set of candidate structures. A prerequisite for classification by secondary structure are reliable secondary structure models that reflect structure and sequence motifs that are central for the functional constraints of an ncRNA. Predicting the secondary structure from a set of evolutionary related RNA sequences outperforms predictions from single RNA sequences, because RNA structures being functionally important are expected to evolve much more slowly than their underlying sequences. Exploiting patterns of sequence covariations according to structure variation improves secondary structure prediction considerably. The Sankoff algorithm that optimizes sequence and structure conservation of related RNAs simultaneously results in reliable predictions (38). However, a full implementation of the Sankoff algorithm is computationally too expensive for widespread applications. Implementations that are tractable for realistic input sizes are FOLDALIGN (41, 42), FoldalignM (69), Dynalign (39, 40), PMcomp (71), and LocARNA (72). Probabilistic approaches to simultaneously align and predict secondary structures are usually based on stochastic context free grammars (SCFGs) like CMfinder (85), SimulFold (77), and Consan (76). Efficient alternatives to those Sankofflike methods are approaches that evaluate covariations in sequence and structure based on a given multiple sequence-alignment. Implementations of such methods are RNAalifold, which predicts the minimum free energy secondary structure (MFE structure) of a global sequence-alignment (67, 126), Pfold, which combines an evolutionary model of the RNA sequences with a probabilistic model of the secondary structures (73, 74), and PETfold which simultaneously uses evolutionary and energetic information (75) (see Note 5). Predicting the common secondary structure from a set of pre-aligned RNA sequences has
Bioinformatics for RNomics
317
the disadvantage that the pure sequence alignment must reflect sequence as well as structural conservation, which is problematic for ncRNAs exhibiting sequence conservation less than 60% (64). For evolutionary distant ncRNAs, hand-curation of the multiple-sequence alignments is still necessary to retrieve reliable consensus secondary structure predictions. Specialized classification methods use secondary structure models that are specifically designed for particular ncRNA clans/ classes. They describe specific structure and sequence motifs occurring explicitly in the ncRNA clan/class of interest. An alternative to predefined models are descriptor based approaches that provide description languages allowing users to define combined sequence/structure models for particular ncRNA clans/classes and to search for instances of the defined patterns in databases. Such methods mostly connect a pattern language with userdefined approximate rules, which rank the results according to their distance to the motif. The construction of specialized models is in general time-consuming and requires extensive expert knowledge about the RNA clan/class of interest. General-purpose classification tools are, in contrast, not restricted to a specific ncRNA clan/class, but detect new members for any predefined secondary structure motif. The most common tools require as input a sequence alignment in combination with structural annotation as a training set to automatically learn a statistical model. Freyhult et al. present a recent evaluation of such methods (127). Infernal (84) and CMfinder (85) are based on covariance models, which are the stochastic context free grammar analogue of profile Hidden Markov Models. They are specifically designed to model base pair interactions, and search in a database of candidate sequences for high-scoring matches to the RNA model. Infernal requires an alignment of the set of RNA sequences, while CMfinder is able to learn the covariance model from an unaligned set of RNA sequences. These approaches are extremely time-consuming, and pre-filtering steps, as provided by RaveNnA, are required. However, the recent version of Infernal might make pre-filtering unnecessary. An alternative approach is followed by ERPIN (89) (see Note 5). It takes a structure-annotated multiple alignments as input to construct an RNA descriptor in terms of log-odds-score profiles. In contrast to descriptor based search methods, the user has little effort with generating the statistical model, but in consequence to that, cannot easily improve the model by incorporating expert knowledge. Secondary structure clustering, on the other hand, identifies among a diverse set of ncRNAs, members of the same secondary structure profile in an unsupervised way. It allows not only to assign candidates for functional ncRNAs to known RNA families, but also to reveal novel ncRNA families. These approaches use, in
318
Reiche et al.
general, an efficient sequence-structural alignment method to compute all pairwise alignments in order to reveal pairwise structural similarities among the input set. Subsequently, all sequences are clustered according to a distance matrix derived from pairwise alignments, using, for example, an agglomerative clustering approach. Lastly, multiple alignments are computed for all clusters and the optimal number of clusters is retrieved (69, 72, 128). 3.3. Interaction-Based Classification
Interaction-based classification focuses on computational identification of binding partners, i.e., the regulatory coworkers of ncRNAs. The majority of known ncRNAs are regulatory active by base pairing with a target (m)RNA, or by binding to protein complexes. Computational prediction of target (m)RNAs or target proteins, especially for novel ncRNA families, is essential in order to reveal their regulatory role and the signal transduction pathways they may be involved in. Despite the importance of these facts, with exception of miRNAs, only few efficient software tools are available to predict RNA interaction targets with high sensitivity and specificity.
3.3.1. RNA Target Prediction
RNA target prediction either identifies RNA targets that simply show sense–antisense base pair interactions, which is mainly observed for small RNAs and their target mRNAs, or for targets exhibiting more complex interaction structures like co-folding of two long RNA molecules. Hybridization energies, structural accessibility of the target sites, as well as perfect Watson–Crick base pairing of short consecutive seed regions are important for sense–antisense base pair predictions. Software tools designed for this task can be separated into approaches that completely avoid internal base-pairing in both RNA strands (95, 111), and tools that follow a more sophisticated approach which restricts base-pairing to sub-regions that are likely to remain unpaired in the internal secondary structure of both RNA molecules (113, 114). These approaches model general types of RNA–RNA interaction structures and do not specifically consider hybridization characteristics that are typical for the RNA classes under consideration. Specialized target prediction tools have mainly been developed for miRNAs, while little effort has been put into methods for other regulatory RNA classes. First attempts to predict complex RNA–RNA interactions concatenate both RNA molecules to one single strand and predict its minimum free energy structure. The drawback of this approach is that it is not able to predict important RNA–RNA interaction motifs, like kissing hairpin loops. Recently, two efficient methods have been published that compute the whole ensemble of interaction structures that are known from in vivo RNA–RNA complexes (116, 117). Their implementations follow an energy minimization algorithm for RNA–RNA joint secondary structures proposed by Alkan et al. (129).
Bioinformatics for RNomics
319
3.3.2. microRNA Prediction
Though several hundred miRNAs have been validated in humans and other animals, only a small fraction of verified miRNA:target pairs exist, mainly due to lacking high-throughput experimental identification methods. However, several methods to predict miRNA targets computationally have been developed (see Note 5). The major challenge for such an in silico target identification approach is the lack of complete complementarity between the miRNA and its target sequence, containing imperfect characteristics like mismatches, gaps, and G:U wobbles. The seed match, almost complete complementary between nucleotides 2 and 7 of the miRNA, evolutionary conservation of the miRNA and target binding site, as well as binding of the miRNA to the 3¢UTR of its target gene are features most of the prediction programs have in common. Most of these methods, like miRanda (now termed Microcosm) (97, 98), PicTar (99), PITA (100), Target Scan/TargetScanS (101, 102), DIANA microT (103), MicroIn-spector (104), and miRtarget2 (106) primarily use sequence complementarity, thermodynamic stability calculations and evolutionary conservation among species to determine the likelihood of a productive miRNA:mRNA duplex formation. Variations in these algorithms include, for example, the number of miRNA binding sites in one target, the accessibility of the target (miRTarget2 and PITA), binding sites outside the 3¢UTR (CDS+5¢UTR) (miRanda, Microcosm, microInspector), as well as additionally conserved nucleotides beside the seed region (TargetScanS). In addition, several target prediction methods are available, that do not rely on sequence conservation (rna22 (105), miTarget (107) NbmiRTar (108)), or the seed match only (NBmiRTar) Rna22, in contrast to all other programs, first identifies the putative target site in a given mRNA, before it identifies the targeting sequence. Future knowledge gained from experimentally validated miRNA:target pairs is required for optimizing existing and implementing new algorithms and to lower the false positive rates of the programs (~30%).
3.3.3. Identification of Protein Binding Partners
Identification of protein binding partners for an ncRNA is one important step towards understanding the cellular role of an RNA molecule. Computationally, this task is solved by either predicting protein-binding motifs for a given ncRNA, or predicting RNAbinding sites of a given protein. In both cases, the corresponding binding partner is not known in general, and thus, the searching for a binding motif in either an RNA molecule or protein is only based on the sequence and structure information of the RNA or protein, respectively. Two main types of protein–RNA interactions are known (125, 130): (1) Interactions with the backbone of double-stranded RNA molecules and (2) interactions of singlestranded RNA bases that are accommodated in the protein
320
Reiche et al.
binding pockets. While little knowledge is available to identify interactions concerning the backbone, different approaches exist to predict binding interactions with single stranded RNA bases (131). Many RNA-binding proteins contain domains, often also in multiple copies, that bind specifically to single-stranded RNA. Examples for RNA binding domains of proteins are the K homology domain (KH), the RNA recognition motif (RRM), the pumilio homology domain (PUF-HD), and the Zinc-finger binding domains (131–136). RNA molecules interacting with protein domains contain sequence-specific motifs with a general structural property, such as being located within single-stranded regions of an arbitrary secondary structure, as for example the trans-activation response element (137) and the U1A polyadenylation inhibition element (138). Software tools designed to predict RNA-binding domains in proteins can be separated in approaches incorporating knowledge about the tertiary structure of the protein (125), and in approaches that identify those nucleotides that may be located at the RNAinteraction interface from the amino acid sequence of the protein of interest, while both the structure of the protein and the sequence of the target RNA is unknown. Examples for the latter approach are RNAProB (121), BindN (122), RNABindR (123), PPRint (120), and PRINTR (124) (see Note 5). While most of the research focuses on predicting RNAbinding domains in proteins, only little research is done to predict protein-binding motifs in RNA sequences, with the exception of the works by Westhof and colleagues (139–141). Mostly, standard sequence motif search tools like MEME or Gibbs sampler (118, 142, 143) are utilized. The only tool that also includes information about the secondary structure of the RNA is MEMERIS (119) (see Note 5). It guides motif search in RNA sequences towards single-stranded regions by considering information about paired and unpaired regions as a requirement for putative start positions for reasonable binding-motifs.
4. Notes 1. RNA secondary structure RNA molecules fall into two different groups, one group constitutes of RNAs whose function within a cell is mainly defined by features of their secondary structure, which defines a scaffold for their tertiary structure, and, a second group of RNA molecules which are functional while being unstructured. At the current state of knowledge, it is not clear whether the majority of functional RNA molecules constitute of structured or unstructured RNAs. We focus on structured
Bioinformatics for RNomics
321
RNAs because they consist of several structural patterns that can be recognized and classified by computational methods, whereas computational annotation and classification of unstructured RNAs still remains an open problem. The sequence of an RNA molecule can be made of two purines, adenine (A) and guanine (G), and two pyrimidines, cytosine (C) and uracil (U). The secondary structure for structured RNA molecules is determined by base pairs that interact by hydrogen bonds with each other and different loop types, i.e., single stranded regions. Possible base pairs are the two Watson–Crick base pairs G-C, A-U, and the common wobble base pair G-U, while G-C is more stable than A-U and G-U. Its number of closing base pairs defines the type of a loop. A hairpin loop interrupts the reverse complementary sequences of the same base pairing region and has, thus, just one closing base pair. An internal loop disconnects two consecutive base pairing regions, and has, thus, two closing base pairs. A special case of an internal loop is a bulge, where the single stranded region is restricted to one of the two reverse complementary sequences of a base pairing region. A multi-loop is a structural motif consisting of at least three closing base pairs. 2. Databases of known regulatory ncRNAs. ●●
●●
●●
●●
●●
●●
Rfam (http://rfam.sanger.ac.uk) is a collection of all known RNA families providing multiple alignments, consensus secondary structure information and an evolutionary tree for each RNA family. miRBase (http://www.mirbase.org) is a collection of published miRNA sequences providing information about genomic location, hairpin sequence as well as mature miRNA sequence. ONCODE (http://www.noncode.org) is an integrated N knowledge database, which classifies ncRNAs based on the cellular process, function, and association with diseases. RNAdb (http://research.imb.uq.edu.au/rnadb) is a comprehensive collection of mammalian ncRNAs containing their sequences and annotation. snoRNA–LBME-db (http://www-snorna.biotoul.fr) is a database of human H/ACA and C/D box snoRNAs (small nucleolar RNA). snoRNAdb (http://lowelab.ucsc.edu/snoRNAdb) is a collection of predicted snoRNAs.
3. Genome browsers. ●●
CSC (http://genome.ucsc.edu) is one of the two standard U genome browsers providing comprehensive annotation
322
Reiche et al.
tracks for many species for which the genomic DNA is sequenced. A big advantage of the UCSC browser is the possibility to upload user tracks for parallel viewing with standard UCSC annotation tracks. ●●
●●
UCSC mirror for ncRNAs (http://www.ncrna.org/glocal/ cgi-bin/hgGateway) that focuses on annotation tracks for the non-coding part of the genomes for a variety of species. Ensembl (http://www.ensembl.org/index.html) is the second standard genome browser that provides comprehensive annotation tracks for a variety of sequenced species.
4. Useful web services. ●●
●●
●●
●●
●●
The Vienna RNA server (http://rna.tbi.univie.ac.at) contains a collection of online versions of RNA software tools, mainly from the Vienna RNA package. Available software tools range from secondary structure prediction from single sequences (RNAfold) and alignments (RNAalifold and LocARNA) to RNA interaction predictions (RNAup), and prediction if a multiple alignment may contain a conserved secondary structure (RNAz). Further web services for common RNA software tools are: The RNA Studio (http://bibiserv.techfak.unibielefeld.de/bibi/Tools_RNA_Studio.html), the software repository by Sean Eddy and coworkers (http:// selab.janelia.org/software.html), the Vienna RNA package (http://www.tbi.univie.ac.at/~ivo/RNA), the software repository by Rolf Backofen and coworkers (http:// www.bioinf.uni-freiburg.de/Software), and the software repository by Peter F. Stadler and his coworkers (http:// www.bioinf.uni-leipzig.de/software.html). miRecords (http://mirecords.biolead.org) is a comprehensive resource for animal miRNA target interactions. SCOR (http://scor.berkeley.edu) is a database of 3D structural classification of ncRNAs. ncRNA web repository (http://www.ncrna.org) is a web page collecting bioinformatics tools and databases that are specialized for functional RNAs.
5. Software tools. 5.1. Overview of the most common software tools. Table 1 collects the most common software tools that address questions concerning RNA identification and annotation. Each software tool is classified according to the problem it addresses. In addition, the most important features as well as tips are provided.
Bioinformatics for RNomics
323
5.2. Software links in alphabetical order. In the following, the URLs of all described software tools are listed insofar they are available online. BindN:
http://bioinfo.ggc.org/bindn
blast:
http://blast.ncbi.nlm.nih.gov/Blast.cgi
ClustalW:
http://www.ebi.ac.uk/Tools/clustalw2/ index.html
CMfinder:
http://bio.cs.washington.edu/yzizhen/ CMfinder
Consan:
http://selab.janelia.org/software/consan
DIANA MicroT:
http://www.diana.pcbi.upenn.edu/cgi-bin/ micro_t.cgi
Dynalign:
http://rna.urmc.rochester.edu/dynalign.html
ERPIN:
http://tagc.univ-mrs.fr/erpin
EvoFold:
http://users.soe.ucsc.edu/~jsp/EvoFold
ExpaRNA:
http://www.bioinf.uni-freiburg.de/Software/ index.html?en
FastR:
http://cseweb.ucsd.edu/~vbafna
FOLDALIGN(m):
http://foldalign.ku.dk//software/index.html
fragrep:
http://www.bioinf.uni-leipzig.de/Software/ fragrep
Gibbs sampler: http://bayesweb.wadsworth.org/gibbs/ gibbs.html GotohScan:
http://www.bioinf.uni-leipzig.de/Software/ GotohScan
HyPaL:
http://www.zbh.uni-hamburg.de/research/ GI/software.php
infernal:
http://infernal.janelia.org
IntaRNA:
http://www.bioinf.uni-freiburg.de/Software
LocARNA:
http://rna.tbi.univie.ac.at/cgi-bin/ LocARNA.cgi
MEME:
http://meme.sdsc.edu/meme4_3_0/ fimo-intro.html
MEMERIS:
http://www.bioinf.uni-freiburg.de/~hiller/ MEMERIS
MicroInspector: http://bioinfo.uni-plovdiv.bg/microinspector miRanda (Microcosm):
http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5
miRtarget2:
http://mirdb.org/miRDB (continued)
324
Reiche et al. (continued) miTarget:
http://cbit.snu.ac.kr/~miTarget
NbmiRTar:
http://wotan.wistar.upenn.edu/NBmiRTar/ login.php
NcDNAlign:
http://www.bioinf.uni-leipzig.de/Software/ NcDNAlign
QRNA:
http://nbc11.biologie.uni-kl.de/framed/ left/menu/auto/right/qrna
PairFold:
http://www.rnasoft.ca/cgi-bin/RNAsoft/ PairFold/pairfold.pl
PETfold:
http://genome.ku.dk/resources/petfold
Pfold:
http://www.daimi.au.dk/~compbio/pfold
PicTar:
http://pictar.mdc-berlin.de
PITA:
http://genie.weizmann.ac.il/pubs/mir07/ mir07_prediction.html
PMcomp:
http://www.tbi.univie.ac.at/RNA/PMcomp
PPRint:
http://www.imtech.res.in/raghava/pprint
PRINTR:
http://210.42.106.80/printr
Probalign:
http://probalign.njit.edu/standalone.html
RaveNna:
http://bliss.biology.yale.edu/~zasha/ravenna
rna22:
http://cbcsrv.watson.ibm.com/rna22.html
RNAalifold:
http://rna.tbi.univie.ac.at/cgi-bin/ RNAalifold.cgi
RNABindR:
http://bindr.gdcb.iastate.edu/RNABindR
RNAbob:
ftp://selab.janelia.org/pub/software/rnabob
RNAcofold:
http://rna.tbi.univie.ac.at/cgi-bin/ RNAcofold.cgi
RNAduplex:
http://www.tbi.univie.ac.at/RNA/ RNAduplex.html
RNAfold:
http://rna.tbi.univie.ac.at/cgi-bin/RNAfold. cgi
RNAhybrid:
http://bibiserv.techfak.uni-bielefeld.de/ rnahybrid
RNAmicro:
http://www.tbi.univie.ac.at/~jana/software/ RNAmicro.html
RNAMotif:
http://casegroup.rutgers.edu/
RNAplex:
http://www.tbi.univie.ac.at/~htafer
RNAstrand:
http://rna.tbi.univie.ac.at/cgi-bin/ RNAstrand.cgi
RNAup:
http://www.tbi.univie.ac.at/~ulim/RNAup (continued)
Bioinformatics for RNomics
325
(continued) RNAz:
http://rna.tbi.univie.ac.at/cgi-bin/RNAz.cgi and http://www.tbi.univie.ac.at/~wash/ RNAz
RSEARCH:
ftp://selab.janelia.org/pub/software/rsearch
SimulFold:
http://people.cs.ubc.ca/~irmtraud/simulfold
SnoGPS:
http://lowelab.ucsc.edu/snoGPS
SnoReport:
http://www.bioinf.uni-leipzig.de/~jana/ index.php/jana-hortel-software/64 janahortel-SnoReport
SnoScan:
http://lowelab.ucsc.edu/snoscan
SSEARCH:
http://www.biology.wustl.edu/gcg/ssearch.html
Stemloc:
http://biowiki.org/StemLoc
TargetScan/ TargetScanS:
http://www.targetscan.org and http://genes. mit.edu/tscan/targetscanS2005.html
tRNAscan-SE:
http://lowelab.ucsc.edu/tRNAscan-SE
UNAFold:
http://dinamelt.bioinfo.rpi.edu/unafold
References 1. The ENCODE Project Consortium. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799–816. 2. The ENCODE Project Consortium. (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306, 636–40 3. Maeda N, Kasukawa T, Oyama R, Gough J, Frith M, Engström PG, et al. (2006) Transcript annotation in FANTOM3: mouse gene catalog based on physical cDNAs. PLoS Genet 2, e62. 4. Kapranov P, Willingham AT, and Gingeras TR. (2007) Genome-wide transcription and the implications for genomic organization. Nat Rev Genet 8, 413–23. 5. Mattick JS. (2004) The hidden genetic program of complex organisms. Sci Am 291, 60–7. 6. Mattick JS, and Makunin IV. (2006) Noncoding RNA. Hum Mol Genet 15(Spec No 1), R17–29. 7. Prasanth KV, and Spector DL. (2007) Eukaryotic regulatory RNAs: an answer to the ‘genome complexity’ conundrum. Genes Dev 21, 11–42. 8. Bartel DP. (2004) MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116, 281–97.
9. Wightman B, Ha I, and Ruvkun G. (1993) Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75, 855–62. 10. Lee RC, Feinbaum RL, and Ambros V. (1993) The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843–54. 11. Pasquinelli AE, Reinhart BJ, Slack F, Martindale MQ, Kuroda MI, Maller B, et al. (2000) Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. Nature 408, 86–9. 12. Kato M, and Slack FJ. (2008) MicroRNAs: small molecules with big roles – C. elegans to human cancer. Biol Cell 100, 71–81. 13. Seto AG, Kingston RE, and Lau NC. (2007) The coming of age for Piwi proteins. Mol Cell 26, 603–9. 14. Senner CE, and Brockdorff N. (2009) Xist gene regulation at the onset of X inactivation. Curr Opin Genet Dev 19, 122–6. 15. Martianov I, Ramadass A, Barros AS, Chow N, and Akoulitchev A. (2007) Repression of the human dihydrofolate reductase gene by a non-coding interfering transcript. Nature 445, 666–70.
326
Reiche et al.
16. Beltran M, Puig I, Peña C, García JM, Alvarez AB, Peña R, et al. (2008) A natural antisense transcript regulates Zeb2/Sip1 gene expression during Snail1-induced epithelial-mesenchymal transition. Genes Dev 22, 756–69. 17. Mercer TR, Dinger ME, and Mattick JS. (2009) Long non-coding RNAs: insights into functions. Nat Rev Genet 10, 155–9. 18. Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, and Sampath R. (2001) RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res 29, 4724–35. 19. Tinoco I, and Bustamante C. (1999) How RNA folds. J Mol Biol 293, 271–81. 20. Le SV, Chen JH, Currey KM, and Maizel JV. (1998) A program for predicting significant RNA secondary structures. Comput Appl Biosci 4, 153–9. 21. Rivas E, and Eddy SR. (2000) Secondary structure alone is generally not statistically significant for the detection of non-coding RNAs. Bioinformatics 16, 583–605. 22. Rivas E, and Eddy SR. (2001) Non-coding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2, 8. 23. Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander ES, et al. (2006) Identification and classification of conserved RNA secondary structures in the human genome. PLoS Comput Biol 2, e33. 24. Washietl S, Hofacker IL, and Stadler PF. (2005) Fast and reliable prediction of non-coding RNAs. Proc Natl Acad Sci USA 102, 2454–9. 25. Washietl S, Hofacker IL, Lukasser M, Hüttenhofer A, and Stadler PF. (2005) Mapping of conserved RNA secondary structures predicts thousands of functional non-coding RNAs in the human genome. Nat Biotechnol 23, 1383–90. 26. Missal K, Rose D, and Stadler PF. (2005) Non-coding RNAs in Ciona intestinalis. Bioinformatics 21 Suppl 2, ii77–8. 27. Missal K, Zhu X, Rose D, Deng W, Skogerbø G, Chen R, et al. (2006) Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. J Exp Zool B Mol Dev Evol 306, 379–92. 28. Rose D, Hackermüller J, Washietl S, Reiche K, Hertel J, Findeiss S, et al. (2007) Computational RNomics of drosophilids. BMC Genomics 8, 406. 29. Steigele S, Huber W, Stocsits C, Stadler PF, and Nieselt K. (2007) Comparative analysis
of structured RNAs in S. cerevisiae indicates a multitude of different functions. BMC Biol 5, 25. 30. Mourier T, Carret C, Kyes S, Christodoulou Z, Gardner PP, Jeffares DC, et al. (2008) Genome-wide discovery and verification of novel structured RNAs in Plasmodium falciparum. Genome Res 18, 281–92. 31. Rose D, Jöris J, Hackermüller J, Reiche K, Li Q, and Stadler PF. (2008) Duplicated RNA genes in teleost fish genomes. J Bioinform Comput Biol 6, 1157–75. 32. Gesell T, and Washietl S. (2008) Dinucleotide controlled null models for comparative RNA gene prediction. BMC Bioinformatics 9, 248. 33. Gruber A, Findeiss S, Washietl S, Hofacker I, and Stadler P. (2010) RNAz 2.0: improved non-coding RNA detection. Pac Symp Biocomput 15, 69–79. 34. Gruber AR, Neuböck R, Hofacker IL, and Washietl S. (2007) The RNAz web server: prediction of thermodynamically stable and evolutionarily conserved RNA structures Nucleic Acids Res 35, W335–8. 35. Washietl S. (2007) Prediction of structural non-coding RNAs with RNAz. Methods Mol Biol 395, 503–26. 36. Washietl S, and Hofacker IL. (2007) Identifying structural non-coding RNAs using RNAz. Curr Protoc Bioinformatics 12, Unit 12.7. 37. Rose D, Hertel J, Reiche K, Stadler PF, and Hackermüller J. (2008) NcDNAlign: plausible multiple alignments of non-protein-coding genomic sequences. Genomics 92, 65–74. 38. Sankoff D. (1985) Simultaneous solution of the RNA folding, alignment, and protosequence problems. SIAM J Appl Math 45, 810–25. 39. Uzilov AV, Keegan JM, and Mathews DH. (2006) Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics 7, 173. 40. Mathews DH, and Turner DH. (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol 317, 191–203. 41. Torarinsson E, Sawera M, Havgaard JH, Fredholm M, and Gorodkin J. (2006) Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. Genome Res 16, 885–9. 42. Gorodkin J, Heyer LJ, and Stormo GD. (1997) Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res 25, 3724–32.
Bioinformatics for RNomics 43. Washietl S, Pedersen JS, Korbel JO, Stocsits C, Gruber AR, Hackermüller J, et al. (2007) Structured RNAs in the ENCODE selected regions of the human genome. Genome Res 17, 852–64. 44. Hiller M, Findeiss S, Lein S, Marz M, Nickel C, Rose D, et al. (2009) Conserved introns reveal novel transcripts in Drosophila melanogaster. Genome Res 19, 1289–300. 45. Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SPA, et al. (2002) Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–9. 46. Schadt EE, Edwards SW, GuhaThakurta D, Holder D, Ying L, Svetnik V, et al. (2004) A comprehensive transcript index of the human genome generated using microarrays and computational approaches. Genome Biol 5, R73. 47. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, et al. (2004) Global identification of human transcribed sequences with genome tiling arrays. Science 306, 2242–6. 48. Stolc V, Samanta MP, Tongprasit W, Sethi H, Liang S, Nelson DC, et al. (2005) Identification of transcribed sequences in Arabidopsis thaliana by using high-resolution genome tiling arrays. Proc Natl Acad Sci USA 102, 4453–8. 49. Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, Cawley S, et al. (2004) Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res 14, 331–42. 50. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, et al. (2007) RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316, 1484–8. 51. Berezikov E, Thuemmler F, van Laake LW, Kondova I, Bontrop R, Cuppen E, et al. (2006) Diversity of microRNAs in human and chimpanzee brain. Nat Genet 38, 1375–7. 52. Kawai J, Shinagawa A, Shibata K, Yoshino M, Itoh M, Ishii Y, et al. (2001) Functional annotation of a full-length mouse cDNA collection. Nature 409, 685–90. 53. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–73. 54. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, et al. (2005) The
327
transcriptional landscape of the mammalian genome. Science 309, 1559–63. 55. The FANTOM Consortium, Suzuki H, Forrest ARR, van Nimwegen E, Daub CO, Balwierz PJ, et al. (2009) The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line. Nat Genet 41, 553–62. 56. Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, et al. (2009) Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–7. 57. Athanasius F Bompfünewerer Consortium, Backofen R, Bernhart SH, Flamm C, Fried C, Fritzsch G, et al. (2007) RNAs everywhere: genome-wide annotation of structured RNAs. J Exp Zool B Mol Dev Evol 308, 1–25. 58. Altschul SF, Gish W, Miller W, Myers EW, and Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol 215, 403–10. 59. Smith TF, and Waterman MS. (1981) Identification of common molecular subsequences. J Mol Biol 147, 195–7. 60. Roshan U, and Livesay DR. (2006) Probalign: multiple sequence alignment using partition function posterior probabilities. Bioinformatics 22, 2715–21. 61. Roshan U, Chikkagoudar S, and Livesay DR. (2008) Searching for evolutionary distant RNA homologs within genomic sequences using partition function posterior probabilities. BMC Bioinformatics 9, 61. 62. Hertel J, de Jong D, Marz M, Rose D, Tafer H, Tanzer A, et al. (2009) Non-coding RNA annotation of the genome of Trichoplax adhaerens. Nucleic Acids Res 37, 1602–15. 63. Mosig A, Sameith K, and Stadler P. (2006) Fragrep: an efficient search tool for fragmented patterns in genomic sequences. Genom Proteom Bioinf 4, 56–60. 64. Gardner PP, Wilm A, and Washietl S. (2005) A benchmark of multiple sequence alignment programs upon structural RNAs. Nucleic Acids Res 33, 2433–9. 65. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, et al. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 31, 3497–500. 66. Markham NR, and Zuker M. (2008) UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol 453, 3–31. 67. Hofacker IL. (2007) RNA consensus structure prediction with RNAalifold. Methods Mol Biol 395, 527–44.
328
Reiche et al.
68. Hofacker IL. (2009) RNA secondary structure analysis using the Vienna RNA package. Curr Protoc Bioinf 12, Unit12.2. 69. Torarinsson E, Havgaard JH, and Gorodkin J. (2007) Multiple structural alignment and clustering of RNA sequences. Bioinformatics 23, 926–32. 70. McCaskill JS. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105–19. 71. Hofacker IL, Bernhart SHF, and Stadler PF. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics 20, 2222–7. 72. Will S, Reiche K, Hofacker IL, Stadler PF, and Backofen R. (2007) Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol 3, e65. 73. Knudsen B, and Hein J. (1999) RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15, 446–54. 74. Knudsen B, and Hein J. (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res 31, 3423–8. 75. Seemann SE, Gorodkin J, and Backofen R. (2008) Unifying evolutionary and thermodynamic information for RNA folding of multiple alignments. Nucleic Acids Res 36, 6355–62. 76. Dowell RD, and Eddy SR. (2006) Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics 7, 400. 77. Meyer IM, and Miklós I. (2007) SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework. PLoS Comput Biol 3, e149. 78. Lowe TM, and Eddy SR. (1997) tRNAscanSE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25, 955–64. 79. Hertel J, and Stadler PF. (2006) Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics 22, e197–202. 80. Hertel J, Hofacker IL, and Stadler PF. (2008) SnoReport: computational identification of snoR NAs with unknown targets. Bioinformatics 24, 158–64. 81. Lowe TM, and Eddy SR. (1999) A computational screen for methylation guide snoRNAs in yeast. Science 283, 1168–71.
82. Schattner P, Decatur WA, Davis CA, Ares M, Fournier MJ, and Lowe TM. (2004) Genome-wide searching for pseudouridylation guide snoRNAs: analysis of the Saccharomyces cerevisiae genome. Nucleic Acids Res 32, 4281–96. 83. Lagesen K, Hallin P, Rdland EA, Staerfeldt HH, Rognes T, and Ussery DW. (2007) RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35, 3100–8. 84. Nawrocki EP, Kolbe DL, and Eddy SR. (2009) Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–7. 85. Yao Z, Weinberg Z, and Ruzzo WL. (2006) CMfinder – a covariance model based RNA motif finding algorithm. Bioinformatics 22, 445–52. 86. Weinberg Z, and Ruzzo WL. (2004) Exploiting conserved structure for faster annotation of non-coding RNAs without loss of accuracy. Bioinformatics 1, i334–41. 87. Klein RJ, and Eddy SR. (2003) RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics 4, 44. 88. Bafna V, and Zhang S. (2004) FastR: fast database search tool for non-coding RNA. Proc IEEE Comput Syst Bioinform Conf p. 52–61. 89. Lambert A, Fontaine JF, Legendre M, Leclerc F, Permal E, Major F, et al. (2004) The ERPIN server: an interface to profilebased RNA motif identification. Nucleic Acids Res 32, W160–5. 90. Gautheret D, and Lambert A. (2001) Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol 313, 1003–11. 91. Mosig A, Zhu L, and Stadler PF. (2009) Customized strategies for discovering distant ncRNA homologs. Brief Funct Gen Proteom 8, 451–60. 92. Heyne S, Will S, Beckstette M, and Backofen R. (2009) Lightweight comparison of RNAs based on exact sequence-structure matches. Bioinformatics 25, 2095–102. 93. Eddy, SR. RNABOB: a program to search for RNA secondary structure motifs in sequence databases. http://selab.janelia.org/pub/ software/rnabob. 94. Gräf S, Strothmann D, Kurtz S, and Steger G. (2001) HyPaLib: a database of RNAs and RNA a structural elements defined by hybrid patterns. Nucleic Acids Res 29, 196–8. 95. Rehmsmeier M, Steffen P, Hochsmann M, and Giegerich R. (2004) Fast and effective
Bioinformatics for RNomics prediction of microRNA/target duplexes. RNA 10, 1507–17. 96. Krüger J, and Rehmsmeier M. (2006) RNAhybrid: microRNA target prediction easy, fast and flexible. Nucleic Acids Res 34, W451–4. 97. Enright AJ, John B, Gaul U, Tuschl T, Sander C, and Marks DS. (2003) MicroRNA targets in Drosophila. Genome Biol 5, R1. 98. John B, Enright AJ, Aravin A, Tuschl T, Sander C, and Marks DS. (2004) Human MicroRNA targets. PLoS Biol 2, e363. 99. Krek A, Grün D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, et al. (2005) Combinatorial microRNA target predictions. Nat Genet 37, 495–500. 100. Kertesz M, Iovino N, Unnerstall U, Gaul U, and Segal E. (2007) The role of site accessibility in microRNA target recognition. Nat Genet 39, 1278–84. 101. Lewis BP, Hung Shih I, Jones-Rhoades MW, Bartel DP, and Burge CB. (2003) Prediction of mammalian microRNA targets. Cell 115, 787–98. 102. Lewis BP, Burge CB, and Bartel DP. (2005) Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell, 120, 15–20. 103. Kiriakidou M, Nelson PT, Kouranov A, Fitziev P, Bouyioukos C, Mourelatos Z, et al. (2004) A combined computational-experimental approach predicts human microRNA targets. Genes Dev 18, 1165–78. 104. Rusinov V, Baev V, Minkov IN, and Tabler M. (2005) MicroInspector: a web tool for detection of miRNA binding sites in an RNA sequence. Nucleic Acids Res 33, W696–700. 105. Miranda KC, Huynh T, Tay Y, Ang YS, Tam WL, Thomson AM, et al. (2006) A patternbased method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell 126, 1203–17. 106. Wang X, and Naqa IME. (2008) Prediction of both conserved and nonconserved microRNA targets in animals. Bioinformatics 24, 325–32. 107. Kim SK, Nam JW, Rhee JK, Lee WJ, and Zhang BT. (2006) miTarget: microRNA target gene prediction using a support vector machine. BMC Bioinformatics 7, 411. 108. Yousef M, Jung S, Kossenkov AV, Showe LC, and Showe MK. (2007) Naïve Bayes for microRNA target predictions–machine learning for microRNA targets. Bioinformatics 23, 2987–92.
329
109. Dimitrov RA, and Zuker M. (2004) Prediction of hybridization and melting for double-stranded nucleic acids. Biophys J 87, 215–26. 110. Andronescu M, Zhang ZC, and Condon A. (2005) Secondary structure prediction of interacting RNA molecules. J Mol Biol 345, 987–1001. 111. Bernhart SH, Tafer H, Mückstein U, Flamm C, Stadler PF, and Hofacker IL. (2006) Partition function and base pairing probabilities of RNA heterodimers. Algorithms Mol Biol 1, 3. 112. Tafer H, and Hofacker IL. (2008) RNAplex: a fast tool for RNA-RNA interaction search. Bioinformatics 24, 2657–63. 113. Mückstein U, Tafer H, Hackermüller J, Bernhart SH, Stadler PF, and Hofacker IL. (2006) Thermodynamics of RNA-RNA binding. Bioinformatics 22, 1177–82. 114. Busch A, Richter AS, and Backofen R. (2008) IntaRNA: efficient prediction of bacterial sRNA targets incorporating target site accessibility and seed regions. Bioinformatics 24, 2849–56. 115. Richter AS, Schleberger C, Backofen R, and Steglich C. (2010) Seed-based INTARNA prediction combined with GFP-reporter system identifies mRNA targets of the small RNA Yfr. Bioinformatics 26, 1–5. 116. Huang FWD, Qin J, Reidys CM, and Stadler PF. (2009) Partition function and base pairing probabilities for RNA-RNA interaction prediction. Bioinformatics 25, 2646–54. 117. Chitsaz H, Salari R, Sahinalp SC, and Backofen R. (2009) A partition function algorithm for interacting nucleic acid strands. Bioinformatics 25, i365–73. 118. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37, W202–8. 119. Hiller M, Pudimat R, Busch A, and Backofen R. (2006) Using RNA secondary structures to guide sequence motif finding towards single-stranded regions. Nucleic Acids Res 34, e117. 120. Kumar M, Gromiha MM, Raghava GPS. (2008) Prediction of RNA binding sites in a protein using SVM and PSSM profile. Proteins 71, 189–94. 121. Cheng CW, Su ECY, Hwang JK, Sung TY, and Hsu WL. (2008) Predicting RNAbinding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics 9(12), S6.
330
Reiche et al.
122. Wang L, and Brown SJ. (2006) BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res 34, W243–8. 123. Terribilini M, Lee JH, Yan C, Jernigan RL, Honavar V, and Dobbs D. (2006) Prediction of RNA binding sites in proteins from amino acid sequence. RNA 12, 1450–62. 124. Wang Y, Xue Z, Shen G, and Xu J. (2008) PRINTR: prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids 35, 295–302. 125. Shulman-Peleg A, Shatsky M, Nussinov R, and Wolfson HJ. (2008) Prediction of interacting single-stranded RNA bases by proteinbinding patterns. J Mol Biol 379, 299–316. 126. Bernhart SH, Hofacker IL, Will S, Gruber AR, and Stadler PF. (2008) RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 9, 474. 127. Freyhult EK, Bollback JP, and Gardner PP. (2007) Exploring genomic dark matter: a critical assessment of the performance of homology search methods on non-coding RNA. Genome Res 17, 117–25. 128. Kaczkowski B, Torarinsson E, Reiche K, Havgaard JH, Stadler PF, and Gorodkin J. (2009) Structural profiles of human miRNA families from pairwise clustering. Bioinformatics 25, 291–4. 129. Alkan C, Karakoç E, Nadeau JH, Sahinalp SC, and Zhang K. (2006) RNA-RNA interaction prediction and antisense RNA target search. J Comput Biol 13, 267–82. 130. Draper DE. (1999) Themes in RNA-protein recognition. J Mol Biol 293, 255–70. 131. Auweter SD, Oberstrass FC, and Allain FHT. (2006) Sequence-specific binding of singlestranded RNA: is there a code for recognition? Nucleic Acids Res 34, 4943–59. 132. Messias AC, and Sattler M. (2004) Structural basis of single-stranded RNA recognition. Acc Chem Res 37, 279–87. 133. Hall KB, and Stump WT. (1992) Interaction of N-terminal domain of U1A protein with
an RNA stem/loop. Nucleic Acids Res 20, 4283–90. 134. Spassov DS, and Jurecic R. (2003) The PUF family of RNA-binding proteins: does evolutionarily conserved structure equal conserved function? IUBMB Life 55, 359–66. 135. de Moor CH, Meijer H, and Lissenden S. (2005) Mechanisms of translational control by the 3¢ UTR in development and differentiation. Semin Cell Dev Biol 16, 49–58. 136. Hudson BP, Martinez-Yamout MA, Dyson HJ, and Wright PE. (2004) Recognition of the mRNA AU-rich element by the zinc finger domain of TIS11d. Nat Struct Mol Biol 11, 257–64. 137. Kulinski T, Olejniczak M, Huthoff H, Bielecki L, Pachulska-Wieczorek K, Das AT, et al. (2003) The apical loop of the HIV-1 TAR RNA hairpin is stabilized by a crossloop base pair. J Biol Chem 278, 38892–901. 138. Clerte C, and Hall KB. (2004) Global and local dynamics of the U1A polyadenylation inhibition element (PIE) RNA and PIE RNA-U1A complexes. Biochemistry 43, 13404–15. 139. Leontis NB, and Westhof E. (2002) The annotation of RNA motifs. Comp Funct Genomics 3, 518–24. 140. Leontis NB, and Westhof E. (2003) Analysis of RNA motifs. Curr Opin Struct Biol 13, 300–8. 141. Hermann T, and Westhof E. (1999) NonWatson-Crick base pairs in RNA-protein recognition. Chem Biol 6, R335–43. 1 42. Thompson W, McCue LA, Lawrence CE. (2005) Using the Gibbs motif sampler to find conserved domains in DNA and protein sequences. Curr Protoc Bioinf 2, Unit 2.8. 143. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, and Wootton JC. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–14.
Chapter 15 Bioinformatics for Qualitative and Quantitative Proteomics Chris Bielow, Clemens Gröpl, Oliver Kohlbacher, and Knut Reinert Abstract Mass spectrometry is today a key analytical technique to elucidate the amount and content of proteins expressed in a certain cellular context. The degree of automation in proteomics has yet to reach that of genomic techniques, but even current technologies make a manual inspection of the data infeasible. This article addresses the key algorithmic problems bioinformaticians face when handling modern proteomic samples and shows common solutions to them. We provide examples on how algorithms can be combined to build relatively complex analysis pipelines, point out certain pitfalls and aspects worth considering and give a list of current state-of-the-art tools. Key words: Mass Spectrometry, MALDI, ESI, HPLC, MS, MS/MS, Tandem-MS, Software, Algorithm
1. Introduction High-throughput Omics techniques are dominated by a rather limited set of analytical methods. For transcriptomics, these techniques are microarrays and next-generation sequencing. In proteomics (and to a large extent also in metabolomics), mass spectrometry, coupled to various separation techniques, is the key analytical tool. A typical proteomics sample contains – depending on its origin – hundreds to tens of thousands of different proteins. Hence, it is necessary to reduce the complexity of the sample by separation techniques and then to conduct one or more mass spectrometric analyses. Common separation techniques are highperformance liquid chromatography (HPLC), capillary electrophoresis (CE), and two-dimensional gel electrophoresis (2D PAGE). Techniques like HPLC or CE have certain advantages over 2D PAGE for high-throughput analysis because they can be Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_15, © Springer Science+Business Media, LLC 2011
331
332
Bielow et al.
readily automated. HPLC, on the other hand, has difficulties separating intact proteins, so, in general, a shotgun proteomics approach is applied: proteins are enzymatically digested into peptides, which are then easily separated using HPLC. Peptides elute off the HPLC column at different retention times due to their interaction with the stationary phase of the column. The eluting peptides are then typically spotted onto a MALDI target (LC-MALDI-MS) or continuously injected into a mass spectrometer using electrospray ionization (LC-ESI-MS). In matrix-assisted laser desorption/ionization (MALDI)-MS, the ionization of the peptides is achieved by evaporation of the analyte together with a matrix. The analyte is spotted onto a metal target using an organic matrix. Matrix and analyte are dried on the target, and this solid is then evaporated using laser shots. The matrix is chosen to have an absorption maximum at the laser wavelength. It is thus rapidly heated by the laser shots and evaporates into the gas phase together with the analyte. The matrix then transfers protons to the analyte in a gentle ionization process, which is very well suited to peptides. While LC-MALDI-MS requires the spotting and handling of targets, LC-ESI-MS directly couples the HPLC column to the mass spectrometer. Electrospray Ionization (ESI) ionizes the analytes at atmospheric pressure by applying a strong electrostatic field at a very fine tip (typically 3–6 kV). The analyte is dissolved in the mobile phase and is dispersed into very fine, highly charged droplets. The fine spray then enters the mass spectrometer, where the solvent evaporates and, during this process, ionizes the analyte. Inside the mass spectrometer, the mass-to-charge ratio of the ionized analyte is measured using a mass analyzer. There are various types of mass analyzers like quadrupole analyzers, quadrupole ion trap analyzers, time-of-flight analyzers, ion cyclotron resonance and Fourier transform analyzers, orbitraps, etc. We now briefly describe the time-of-flight (TOF) analyzer. Here, the ionized analyte with charge z is accelerated by an electrostatic field and travels through a field-free chamber until it hits the detector. The detector measures the flight time of the ion and from this calculates the mass-to-charge ratio of the ion. Other mass analyzers use different physical principles to derive the m/z of an ion. Nevertheless, they all yield the same fundamental data: a spectrum, i.e., an ion count at a certain mass-to-charge ratio. Depending on your instrument different software solutions are available (see Note 1). Figure 1 shows a typical data set generated from a biological sample using HPLC-MS and illustrates its multidimensional nature. After a possible first reduction of sample complexity by appropriate fractionation techniques, each fraction is further separated using a chromatographic column. The eluting analytes are then continuously injected into the mass spectrometer, which
Bioinformatics for Qualitative and Quantitative Proteomics
333
Fig. 1. Data generated in a typical multidimensional experiment. On the left a three-dimensional map consisting of several scans (up to thousands); on the right a zoomed-in section of a scan. The inset shows the complete scan, which is highlighted on the left.
records mass spectra (scans) at high speed. Stacking individual spectra (on the right-hand side of the figure) yields a threedimensional data set, a so-called map (depicted on the left). Variations of this setup are widely used in differential analysis of samples of healthy and diseased patients, to investigate changes occurring in time-series experiments, or to explore the effects of perturbations on biological systems. This type of analysis is an essential tool for understanding the molecular foundations of diseases, the discovery of biomarkers, or the identification of potential drug targets. In all those applications, the two main questions of mass spectrometry today are those of identification and quantitation of compounds. In the identification problem, one can either use Peptide Mass Fingerprinting (PMF) or MS/MS (also known as “peptide fragment fingerprinting”). In PMF, which can be used for simple protein mixtures, one tries to identify the masses of all peptides resulting from the digest of the parent protein. For smaller peptides, or more complex samples, MS/MS is used. In the latter, usually a small mass range is chosen in MS mode to isolate a single peptide. This so-called precursor is then subjected to a secondary fragmentation (usually by a collision with gas molecules) and then to another MS measurement (hence the term MS/MS or tandem MS). The fragmentation usually takes place at preferred places on the amino acid backbone forming pairs of complementary ions (for example, a and z ions or the more common b and y ions.) Ions having the charge on the N-terminus fragment are labeled a, b, c depending on the exact break point, and those from the C-terminus x, y, z. The ions are numbered along the backbone. All ions of a specific type form an ion ladder (see Fig. 2 for an example). The difference between adjacent ions of the
334
Bielow et al.
Fig. 2. MS/MS spectrum of the peptide SGFLEEDELK. The inset shows b, y, and a ions. The masses of all b- and y-ions are listed above, but not all are observed in the spectrum.
same type in an ion ladder obviously accounts for the mass of an amino acid. Hence, it is straightforward to reconstruct the amino acid sequence from a peptide if the ion ladder is complete. In practice, this is not always the case and hence the problem is difficult, especially since quantitation is further complicated by the presence of different ion types, noise peaks, contaminants, and missing peaks. In quantitation, the key task is to find the signals in a map that belong to the same ion, quantitate them, and to assign them to the corresponding signal from another sample for comparison. The signal in the other sample is either in the same map but marked with a label changing its mass to avoid a mixture with the original signal (labeled quantitation), or it is in a different map. In the latter case, the reproducibility of the HPLC and the high mass accuracy of the MS measurements are used to find the corresponding signal in the second map (label-free quantitation). Both problems are solved by means of bioinformatics algorithms, preferably in an automated way. In recent years, it has become evident that the handling and computational analysis of the data is the major bottleneck for biomedical studies in the fields of proteomics and metabolomics. In the next sections, we elaborate on the specifics involved in the analysis of large proteomics data sets. We introduce the algorithmic problems, and then propose specific state-of-the-art methods to solve them.
Bioinformatics for Qualitative and Quantitative Proteomics
335
2. Materials A broad range of algorithmic problems in mass spectrometry have been identified and solved during the last two decades. We limit ourselves to a few select ones in this context. Numerous others, e.g., mass calibration and intensity normalization are all important problems that, unfortunately, cannot be dealt with here. We start in this section by defining the problems and give in Subheading 3 practical details of algorithms for solving them. The input data itself is usually obtained from in-house instruments, partners, or online databases (see Note 2). 2.1. Baseline Filtering
In MALDI spectra, and to some extent in ESI spectra, a baseline is apparent (Fig. 3). In MALDI spectra, the baseline can become dominant in the low m/z regions and disappears with increasing m/z. It is typically shaped like an exponential decay distribution and can be attributed to matrix material. The baseline leads to poorly resolved peak shapes due to a loss of baseline separation between adjacent peaks. The baseline thus interferes with intensity estimation and has to be removed computationally.
2.2. Noise Filtering
Every mass spectrometer suffers from high-frequency noise (electronic noise, usually attributed to the detector, and chemical noise, usually attributed to solvents, buffers, and contaminants) and, thus, peaks expected to be approximately Gaussian shaped might
Fig. 3. Raw spectrum (black ) and baseline filtered spectrum (gray ). Note that signal due to overlapping Gaussian peaks at the right is correctly preserved and not attributed to baseline.
336
Bielow et al.
not be convex any longer. This is a potential pitfall for algorithms which rely on local minima to separate isotope peaks. A noise filter smoothes the data by removing high-frequency noise. 2.3. Centroiding
Due to the limited resolving power of the mass analyzer, and since the ionization is a stochastic process, even identical ions are not measured at the exact same m/z and the resulting distribution should be processed with two main goals: first, to obtain the correct m/z value of the ion, and second, to obtain an accurate estimate of the number of ions measured (see also Fig. 4 for a series of isotopic peaks). The shape of an individual peak is described by a mathematical model (a normal distribution is a good approximation, but not quite sufficient) whose parameters are adjusted to the observed m/z distribution. The most prominent parameters of a mass peak are:
Fig. 4. Visualization of data processing (a) Raw spectrum, (b) Noise filtering applied, and (c) subsequent peak picking on a zoomed-in single MS spectrum. Note that there is an extra centroided peak on the right, which is due to an overlapping isotope envelope from a neighboring peptide.
Bioinformatics for Qualitative and Quantitative Proteomics
337
centroid, intensity, width, and skew. Centroiding reduces the raw measurement data to a handful of parameters for each compound. It shrinks the volume of data by orders of magnitude and extracts the information of interest (see Note 3). The centroid m/z can be reported as the position of the maximal intensity, or by averaging over m/z (raw data points weighted by intensity). Likewise, the intensity of a peak can be read off as the maximum height from the raw data (the peak apex), or one can compute the area under the curve, i.e., the peak volume. The peak apex is a simple and intuitive measure, but relatively susceptible to random measurement errors. The average m/z and area under the curve are generally more stable and accurate measures, but they depend on details of the peak picking procedure, e.g., how the lower and upper margins of the mass range are chosen for integration. The width of a peak can be quantified in statistical terms using the variance or standard deviation of its m/z distribution. Taking a morphological perspective, one can also report the width at a certain percentage of the maximum height. Note that for many mass spectrometers the peak width is smaller for higher charge states, and is therefore an important parameter for charge estimation. Higher-order statistics, such as the skew can also be interesting. Ideally, the shape of peaks should be symmetric, but systematic deviations exist in practice, due for example, to the so-called “dead time” of the detector, or if a peak actually represents an entire unresolved isotopic envelope (see below). For asymmetric peaks, the maximum intensity is generally not attained at average m/z. 2.4. Deisotoping and Charge Estimation
Amino acids are made up of elements which exist in different isotopes. For example, about 1.1% of carbon in nature is 13C, and 0.4% of nitrogen is 15N. The signal of a peptide observed by mass spectrometry therefore does not consist of a single peak, but a characteristic isotopic pattern, a series of isotopic peaks separated by about 1 Da mass difference. With modern instruments, these peaks can be resolved. For small peptides (e.g., 700 Da), the monoisotopic peak, which corresponds to the most frequent isotopes, is the strongest one, and the envelope is approximately a Poisson distribution. Larger peptides tend to contain at least a few heavy isotopes, and the envelope becomes more symmetric and bell-shaped. For masses above 5,000 Da, the monoisotopic peak is faint and starts to fade into the background. In this situation, average m/z is more appropriate than m/z at maximum intensity to indicate peak position. In mass spectra of entire proteins, the isotopic peaks cannot be resolved easily (due to the high charge and hence the small distance between peaks), and all we might observe is a single, slightly skewed peak. Knowing the m/z of the monoisotopic peak exactly is a big step toward identification of a peptide. The process of
338
Bielow et al.
inferring the monoisotopic mass is called deisotoping. With modern high-precision instruments, even more subtle effects become observable. For example, the mass difference between 15N and 14 N is not the same as between 13C and 12C. Sometimes, the presence of sulfur or phosphorous can also be proven this way. During ionization, different charge states can be formed, for example, by attaching a variable number of protons. This is normally the case when electrospray ionization (ESI) is used. Each charge comes with an additional mass, and the result is a characteristic charge ladder in the observed spectrum. For example, the peptide YYGYTGAFR has a monoisotopic weight of M ~ 1,097.2. Assuming the charge adduct is a proton, we observe the monoisotopic peaks for charge states 1, 2, and 3, at 1,098.2 ([M + H]1+/1), 549.6 ([M + 2H]2+/2), and 366.7 ([M + 3H]3+/3). Note that the m/z difference between isotopic peaks is 1/z. This fact is used most commonly to deduce the charge state from an isotope pattern. Yet another indicator for charge state is the peak width (as stated above). The entire process of inferring the original, uncharged mass is called decharging. When analyzing whole proteins (usually in a simple mixture), one can observe extremely high charge states (>100) and usually cannot discern isotopic patterns any longer. This is not the case when analyzing peptides, which have a much lower charge mainly due to a smaller number of basic amino acids, which are targets for protonation. 2.5. Feature Detection
When peptide mass spectrometry is preceded by liquid chromatography fractionation, the observed signal corresponding to a single charge state of a peptide is actually a two-dimensional intensity distribution in retention time and mass-to-charge (see also Fig. 1). The primary task of feature detection is to estimate the parameters of this distribution. Signals of different charge states can be combined later. Features extend over several consecutive spectra. In general, one can assume that the two-dimensional distribution is a product of two independent distributions. For the marginal distribution over m/z, a similar reasoning applies as for individual spectra (see centroiding and deisotoping above). The combination of monoisotopic mass and retention time is fairly unique for each peptide. This can be used for protein identification, extending the PMF approach mentioned above. The region occupied by a feature is where it stands out significantly from the background. It can be modeled as an axis-parallel rectangle, but more generally, a convex hull can also be used. When the isotopic peaks are baseline separated, each one is assigned its own region. We can think of the region as the support of its intensity distribution. A fixed (small) interval of m/z, extending over retention time, is sometimes called a mass trace. Its projection onto retention time is an extracted ion chromatogram or elution profile. Elution profiles
Bioinformatics for Qualitative and Quantitative Proteomics
339
are highly correlated among isotopic peaks and charge variants. Data acquisition on the MS level is sometimes interrupted when MS/MS spectra are taken. Competitive ionization or ion suppression can also lead to notches or even gaps in the elution profile. In general, elution profiles are less regular than m/z peak shapes (e.g., fronting and tailing effects are often observed) and the width can vary from a few seconds to the entire run of an experiment. 2.6. Map Alignment
Larger proteomics studies create several LC-MS(-MS) data sets. In order to be able to compare the same features in two given maps, we first have to assign the features in the two maps. This is not immediately possible because both the retention time and the m/z dimension can be distorted in systematic ways and, moreover, both coordinates are afflicted with random errors. The process of correcting systematic errors of the retention time by means of a suitable warping function is called map alignment. On the other hand, the ultimate goal is to establish a table of corresponding features. We refer to this process as feature grouping. This step is equally important because in many cases the uncorrelated (stochastic) error is of about the same size as the systematic error, and in practice it can be hard to define the criteria for what should be considered a match. Peaks or features missing from the table can be filled in if there is sufficient evidence from both the alignment and the raw data. When multiple experiments are compared, e.g., five normal vs. five diseased, a progressive strategy can be used.
2.7. From Peptides to Proteins
Most of the data recorded nowadays stems from shotgun experiments, i.e., proteins are split into peptides before analysis. Peptides are usually identified by MS2 measurements, which are run against a database of theoretical peptides using a search engine, such as Mascot (1) (http://www.matrixscience.com, commercial) or X!Tandem (2) (http://www.thegpm.org/tandem/index.html, free). The primary goal of (differential) quantitation is to obtain meaningful quantitative data at the level of proteins rather than peptides. The process of inferring the set of proteins contained in the sample, i.e., the protein inference problem, is complicated by the fact that peptides can be ambiguous/degenerate, i.e., are shared among two or even more proteins. This also complicates the estimation of their amounts (quantitation).
3. Methods We now describe which algorithmic solutions exists for the obstacles presented in Subheading 2. Some steps can be computationally demanding and required time can vary greatly between software packages (see Note 4).
340
Bielow et al.
3.1. Baseline Filter
Algorithms to remove the baseline usually either employ a kind of Fourier transform to estimate low-frequency noise or use a morphological filter (e.g., top-hat wavelet (3) or a local regression like LOESS (4)). The problem with baseline removal is that there is no accepted method of judging the best algorithm (see ref. 5) and usually visual inspection is used. See Fig. 3 for a successful baseline removal example.
3.2. Noise Filter
The most prominent noise filter is the Savitzky–Golay (SG) filter (6), which approximates the underlying data within a sliding window by a polynomial of higher order and retains the shape and m/z of peak maxima. However, it may change the peak volume. To be computationally effective, SG requires uniformly spaced data. The Gaussian filter is also widely used, and has the advantage that the area under the curve is preserved exactly, although the peak maxima tend to be flattened. Implementations can be found in various software packages, e.g., OpenMS/TOPP, http://www. openms.de (3). Another method is the wavelet transformation filtering (Symmlet8) used, e.g., in the SpecArray program (7). See Fig. 4 for an example of noise filtering with subsequent centroiding, which we will describe next.
3.3. Centroiding
Converting profile data into centroided data, also known as peak picking, is a fundamental problem in mass spectrometry data analysis. We only sketch the main ideas here. Most algorithms start by identifying a region of interest and then adjust the theoretical model for the peak shape to the data. The region of interest must contain a local maximum that is significantly higher than the surrounding baseline. An interesting approach to this problem uses the wavelet transformation, which combines baseline and noise filtering in one step and was shown to estimate the peak position very accurately (8). Once the peak apex has been found, the remaining parameters can be assigned using maximum-likelihood estimators or least-squares fitting, just to name a couple of strategies. In practice, centroiding is mostly performed using the software provided with the mass spectrometers and the details of the algorithms working behind the scenes are often poorly documented. Finding the right parameter settings can be tedious and time consuming, but its importance should not be underestimated (see Note 5). It is always a good idea to have a look at the raw profile data to confirm that peak picking was successful because the process cannot be reproduced once the raw data has been discarded (which might be necessary due to limited storage). Alternatively, such quality control can be done automatically, which in itself is not an easy problem (9).
3.4. Deisotoping and Charge Estimation
Deisotoping can be performed on profile or centroided data. For peptides, the expected isotopic distribution at a given mass can be approximated using the so-called averagine, a fictitious
Bioinformatics for Qualitative and Quantitative Proteomics
341
amino acid obtained by averaging the atomic composition of typical peptides. Charge estimation for peptides is usually done by locally estimating the distance between isotopic peaks. Almost every software package offers such an algorithm (Table 1). Mass is then calculated from the resulting monoisotopic peak and charge. Charge ladders are not addressed by most software packages for peptide analysis (a notable exception is MaxQuant, http://www. maxquant.org (10), which uses charge variants to increase mass resolution by averaging). For protein analysis, charge ladder data is usually the only information used; the algorithms are historically older and usually work directly on raw spectra. Notable algorithms are the ZScore-Algorithm (11) and THRASH (12). 3.5. Feature Detection
The overall approach to feature detection is similar to that of centroiding: find a region of interest and adjust a theoretical model to it. The first phase can be improved by searching for regions containing an isotopic pattern. In the case of centroided data, this implies that the peaks should align to a grid with 1/z spacing. For profile data, the wavelet approach has been extended to isotopic patterns (13). Other free algorithms for feature detection are msInspect, http://proteomics.fhcrc.org (14) and MZmine, http://mzmine.sourceforge.net (15). When two features overlap, a correlation analysis on the extracted ion chromatograms can reveal which of them actually belong together (16).
3.6. Map Alignment
The most appropriate type of warping function is a debated topic in map alignment. In simple cases, a shift or affine function often suffices (OpenMS, http://www.OpenMS.de (17)), or the warping step can be skipped altogether. A simple minus-versus-add (MA) plot of the aligned retention times can be used for quality control. If the MA plot shows that the systematic error changes rapidly at certain points of time, e.g., at the start and end of the gradient applied for chromatography, then it is better to choose a spline function (XCMS, http://masspec.scripps.edu/xcms/xcms.php (18)). The risk of overfitting can be minimized by outlier removal (OBIWarp, http://obi-warp.sourceforge.net/ (19)). This is particularly important in regions where the landmarks are scarce. With regard to the grouping aspect, the information available about individual peaks and features can include m/z, charge, intensity, retention time, MS/MS spectra, sequence annotations, etc., depending on the type of experiment. Various combinations of these can be considered to define a criterion that constitutes a “match” in the presence of random errors. During matching, false positives and false negatives must be balanced. In practice, the error rates can be estimated if a set of reliable and trusted matches is available at least for a subset of the data. Such an independent ground truth can be obtained, e.g., using information that is not available for the entire data set (20), or by manual inspection.
OS
L
L, O, W
L, O, W
W
W
L, O, W
W
L, O, W
L, O, W
L
L, O
Software
Corra
OpenMS(TOPP)
PEPPeR
MaxQuant
MSight
MsInspect
MSQuant
MZmine 2
ProteoWizard
SpecArray
SuperHirn
High-res
High-res
All
High-res
All
Isotope resolved
All
High-res
High-res
All
High-res
Instrument
mzXML
mzXML
mzML, mzXML
mzML, mzXML, mzData
mfg
mzXML
mzXML, raw
mzXML
mzXML
mzML, mzXML, mzData
mzXML
Input format
Open source (C++)
Open source (C)
Open source (C++)
Open source (Java)
Open source (.NET)
Open source (Java)
Free of charge (C++)
Free of charge
(14)
(31)
(10)
(30)
(3)
(29)
References
http://tools.proteomecenter. org/wiki/
http://tools.proteomecenter. org/software.php
http://proteowizard. sourceforge.net/
(34)
(7)
(33)
http://mzmine.sourceforge.net/ (15)
http://msquant.alwaysdata.net/ (32)
http://proteomics.fhcrc.org/ CPL/home.html
http://www.expasy.org/MSight/
http://www.biochem.mpg.de/en/ rd/maxquant/
http://www.broad.mit.edu/ cancer/software/ genepattern/
http://www.OpenMS.de
Open source (C++) Open source (Perl, R)
http://sourceforge.net/ projects/corra/
URL
Open source (multiple)
Distribution (language)
Table 1 LC-MS programs for labeled and label-free quantitation and identification
342 Bielow et al.
L, W
W
W
W
W
W
TPP
Elucidator
Expressionist
ProteinPilot
QuanLynx
SIEVE
Thermo instruments
Waters instruments
AB instruments
All
High-res
All
Instrument
OS operating system, L Linux, O MacOSX, W Windows
OS
Software
Raw
Raw
Raw
mzXML
Raw
mzXML
Input format
Commercial
Commercial
Commercial
Commercial
Commercial
Open source (multiple)
Distribution (language)
http://www.thermo.com
http://www.waters.com
http://www.appliedbiosystems. com
http://www.genedata.com
http://www.rosettabio.com
http://tools.proteomecenter. org/wiki/index.php? title=Software:TPP
URL
–
–
–
–
–
–
References
Bioinformatics for Qualitative and Quantitative Proteomics 343
344
Bielow et al.
There are also methods that align data in earlier processing stages (not features). Global characteristics such as the similarity of the entire MS spectra, usually combined with some form of binning (ChamS, http://www.pasteur.fr/recherche/unites/ Biolsys/chams/ (21)), or even the total ion chromatogram (CPM, http://www.cs.toronto.edu/~jenn/CPM/ (22)) can then be used to guide the alignment. A recent survey of map alignment algorithms is given in (23). 3.7. From Peptides to Proteins
Solving the problems of protein inference and quantitation is not trivial and still an ongoing topic of research. Even standards for reporting peptide and protein identification results are not unified within the community. One commonly used guideline was established in 2005 in Paris (termed the “Paris Guidelines” available at http://www.mcponline.org/misc/ParisReport_Final.dtl), setting standards as to what information should be included in publications to enable other researchers to critically assess the results. It was common practice to accept only proteins of which at least two peptides had been identified (two-peptide rule) and discard so-called one-hit wonders until a paper by Gupta (24) proposed the use of error rates as basis for protein identification, using only the highest scoring peptide of a protein. This results in a better sensitivity/specificity trade-off than the two peptide rule and outperformed ProteinProphet (25). ProteinProphet is a widely used and readily available program, distributed as part of the Trans-Proteomic Pipeline (TPP) available at http://tools. proteomecenter.org/wiki/index.php?title=Software:TPP, which uses peptide probabilities to derive protein probabilities. Usually, a probability cut-off between 0.9 and 0.95 for proteins is used, that, when passed, includes the protein into the final set of identified proteins. Another recent approach (26) based on bayesian statistics which incorporates peptide detectability (i.e., the probability of a peptide to be observed if the protein is present in the sample) is also available for public usage at http://darwin.informatics. indiana.edu/yonli/proteininfer. Protein quantitation is usually addressed separately from protein identification, although the two are closely linked. Quantitation using a single peptide is usually less reliable than deriving a protein quantitation from several peptide intensity values. Usually, only unique peptides are used for quantitation, as shared peptides’ abundance is biased by other proteins. One solution which can use shared peptides to improve protein quantitation applies an integer linear programming approach (27) and is also available for download at http://cseweb.ucsd.edu/~bdost/ downloads.htm. For labeled experiments (i.e., multiple ratios of peptide intensities for a protein, one ratio being derived from differently labeled peptides), a recent publication by Carrillo (28) investigates methods to estimate relative protein abundance and
Bioinformatics for Qualitative and Quantitative Proteomics
345
concludes that adding intensities of all peptides in one channel before computing the channel ratios is the most stable approach. Search engines that support protein identification are Mascot and X!Tandem, where Mascot also supports quantitation out of the box. 3.8. Quantitation Pipeline
All the above steps can be used in building blocks for an analysis pipeline. Proteomics data processing pipelines come in various designs, depending on the separation technique used, instrument, labeling state of the sample, type of information sought and many more. The list of available software used for data analysis is long and we have assembled a list of the most prominent ones in Table 1. We concentrate on one of the most commonly used pipelines, namely, label-free quantitation. The pipeline can have different entry points based on the kind of data available. Usually, the more preprocessed the data, the later it enters the pipeline. The most prominent example is a raw data reduction step called centroiding, which is performed by default by the instrument software before the data is even exported to be used by other software packages. As a rule of thumb, you should try to acquire data as “raw as possible” because vendor software might not be optimal for centroiding the data (see Note 6). Common to most pipelines is the basic preprocessing of: noise filtering, baseline filtering, centroiding. We look at this pipeline in more detail. The goal of this pipeline is to quantify peptides in two or more unlabeled samples and derive a differential quantitation (i.e., over/underexpression) leading to potential biomarkers. In the general setting, it works as follows: Baseline filter Removes the low-frequency signal. Noise filter Smoothes the mass spectra to remove high-frequency noise which might distort centroiding. Centroiding (peak picking) Reduce data by merging points in the m/z dimension belonging to a single Gaussian peak to its centroid. Feature detection Further reduces data by grouping peaks which were induced by the sample chemical entity (e.g., all peaks of the isotopic and RT envelope of a peptide). Map alignment Corrects retention time differences between different maps so that (ideally) identical peptides have identical coordinates (in RT and m/z) and are linked. Statistical analysis Given a matrix of features in two (or more) classes of maps, tries to find differentially expressed peptides. This pipeline is designed to work on MALDI data, but for ESI data the baseline filter step might be superfluous. If given centroided data, enter the pipeline at the feature detection step. The order
346
Bielow et al.
of the steps is not fixed and might be different depending on the software employed, e.g., it is possible that the map alignment algorithm requires raw data, and thus this step moves further toward the beginning. Furthermore, there are algorithms for feature detection (13) which require raw data and do not work on centroided data, so the centroiding might be skipped. Statistical analysis comprises normalization (e.g., quantile or median normalization) which can be adapted from the microarray community and classical machine learning approaches (e.g., support vector machines). For the analysis, the most prominent tools are R (http://www.r-project. org) or WEKA (http://www.cs.waikato.ac.nz/ml/weka).
4. Notes 1. Software Target Platforms. Some algorithms are designed to work only on certain types of instruments (in terms of resolution). Be careful to read the manual before trying to attempt analyzing QStar data with software written for OrbiTrap data. See Table 1 for an overview. 2. Data Transfer. It is sometimes necessary to transfer MS data to colleagues for analysis. Common approaches range from FTP-Transfer, to public databases like PRIDE (http://www. ebi.ac.uk/pride) or TRANCH (https://trancheproject.org), and to send hard disks by mail. 3. HDD Space. With new instruments producing more than 100 GB per map, you might run into serious data storage issues. You can either store only preprocessed data (e.g., centroided) or get more and rather expensive server storage. Preprocessing will not allow you to redo certain steps, thus you ought to be sure about the quality of your preprocessing. 4. Computation Time. Most software allows processing of MS data in less time than it takes to measure the data, however, there can be significant differences in runtime. Some software supports distributed computing or at least supports multicore CPUs. Compare results and time required using different packages and choose what fits you best. 5. Parameter Settings. Be careful in choosing parameters for the algorithms, since free software (especially) is usually designed to work with a broad range of instruments. Wrong parameters lead to suboptimal results. In addition, you can try more than one software on your data and compare the results. 6. Vendor Formats. Most free software requires HUPO-PSI data formats (like mzML) as input and cannot operate on vendorspecific formats. To convert the vendor data, you can use specialized conversion tools (see Table 2). This usually requires the vendor libraries which come with the instrument software.
L, O, W
L, O, W
W
W
W
W
W
OpenMS/TOPP
ProteoWizard
ReAdW / TPP
massWolf
mzWiff
trapper
CompassXport
Bruker (.baf,.yep,.fid)
Agilent MassHunter (.d)
Agilent Analyst (.Wiff)
Waters MassLynx folder
Thermo Xcalibur (.RAW)
Thermo Xcalibur (.RAW)
AndiMSa, mzML, mzXML, mzData
Input formats
b
a
32bit Linux only 32bit Linux only c Vendor-independent version in preparation/beta release
OS operating system, L Linux, O MacOSX, W Windows
OS
Software
Table 2 Data conversion tools
mzXML, mzData
mzXML
mzML, mzXML
mzML, mzXML
mzML, mzXML
mzML, mzXML, MGF
AndiMSb, mzML, mzXML, mzData
Output formats
Required
Included
Required
Required
Required
Yesc
Not required
Vendor library
http://www.brukerdaltonics. com/
http://sourceforge.net/ projects/sashimi/files/
http://sourceforge.net/ projects/sashimi/files/
http://sourceforge.net/ projects/sashimi/files/
http://tools.proteomecenter. org/wiki/index. php?title=Software:ReAdW
http://proteowizard. sourceforge.net/
http://www.OpenMS.de
URL
Bioinformatics for Qualitative and Quantitative Proteomics 347
348
Bielow et al.
Acknowledgments CB is supported by the European Commission´s 7th Framework Program (GA202222). OK gratefully acknowledges financial support from DFG (SFB 685/B1, SPP 1335) and BMBF (0313842A, 0315395F). References 1. Perkins, D. N., Pappin, D. J. C., Creasy, D. M., Cottrell, J. S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567. 2. Craig, R., Beavis, R. C. (2004) TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467. 3. Kohlbacher, O., Reinert, K., Gröpl, C., et al. (2007) TOPP – the OpenMS proteomics pipeline. Bioinformatics 23, e191–e197. 4. Ruckstuhl, A. (2001) Baseline subtraction using robust local regression estimation. J Quant Spectrosc Radiat Transfer 68, 179–193. 5. Williams, B., Cornett, S., Dawant, B., Crecelius, A., Bodenheimer, B., Caprioli, R. An algorithm for baseline correction of MALDI mass spectra. New York, New York, USA: ACM Press, 2005. 6. Savitzky, A., Golay, M. J. E. (1964) Smoothing and differentiation of data by simplified least squares procedures. Anal Chem 36, 1627–1639. 7. Li, X.-J., Yi, E. C., Kemp, C. J., Zhang, H., Aebersold, R. (2005) A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Mol Cell Proteomics 4, 1328–1340. 8. Lange, E., Gröpl, C., Reinert, K., Kohlbacher, O., Hildebrandt, A. High accuracy peak- picking of proteomics data using wavelet techniques. In: Proceedings of the 11th Pacific Symposium on Biocomputing (PSB06). 2006 243–254. 9. Schulz-Trieglaff, O., Machtejevas, E., Reinert, K., Schlüter, H., Thiemann, J., Unger, K. (2009) Statistical quality assessment and outlier detection for liquid chromatography-mass spectrometry experiments. BioData Min 2, 4. 10. Cox, J., Mann, M. (2008) MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat Biotechnol 26, 1367–1372.
11. Zhang, Z., Marshall, A. (1998) A universal algorithm for fast and automated charge state deconvolution of electrospray mass-to-charge ratio spectra. J Am Soc Mass Spectrom 9, 225–233. 12. Horn, D. (2000) Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J Am Soc Mass Spectrom 11, 320–332. 13. Schulz-Trieglaff, O., Hussong, R., Gröpl, C., Hildebrandt, A., Reinert, K. A Fast and Accurate Algorithm for the Quantification of Peptides from Mass Spectrometry data. In: Proceedings of the 11th Annual International Conference on Research in Computational Molecular Biology. 2007 473–487. 14. Bellew, M., Coram, M., Fitzgibbon, M., et al. (2006) A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics 22, 1902–1909. 15. Katajamaa, M., Miettinen, J., Oresic, M. (2006) MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. BMC Bioinformatics 6, 634–636. 16. Tautenhahn, R., Böttcher, C., Neumann, S. Annotation of LC/ESI-MS mass signals. In: BIRD, Hochreiter, S., Wagner, R., eds., vol. 4414 of Lecture Notes in Computer Science. Springer, 2007 371–380. 17. Lange, E., Gröpl, C., Schulz-Trieglaff, O., Leinenbach, A., Huber, C., Reinert, K. (2007) A geometric approach for the alignment of liquid chromatography-mass spectrometry data. Bioinformatics 23, i273–i281. 18. Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R., Siuzdak, G. (2006) XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Anal Chem 78, 779–787. 19. Prince, J. T., Marcotte, E. M. (2006) Chromatographic alignment of ESI-LC-MS proteomics datasets by ordered bijective
Bioinformatics for Qualitative and Quantitative Proteomics
20.
21.
22.
23.
24.
25.
26.
27.
interpolated warping. Anal Chem 78, 6140–6152. Lange, E., Tautenhahn, R., Neumann, S., Gröpl, C. (2008) Critical assessment of alignment procedures for LC-MS proteomics and metabolomic measurements. BMC Bioinformatics 9, 375. Prakash, A., Mallick, P., Whiteaker, J., et al. (2005) Signal maps for mass spectrometrybased comparative proteomics. Mol Cell Proteomics 5, 423–432. Listgarten, J., Neal, R. M., Roweis, S. T., Wong, P., Emili, A. (2007) Difference detection in LC-MS data for protein biomarker discovery. Bioinformatics 23, e198–e204. Vandenbogaert, M., Li-Thiao-Té, S., Kaltenbach, H.-M., Zhang, R., Aittokallio, T., Schwikowski, B. (2008) Alignment of LC-MS images, with applications to biomarker discovery and protein identification. Proteomics 8, 650–672. Gupta, N., Pevzner, P. A. (2009) False discovery rates of protein identifications: a strike against the two-peptide rule. J Proteome Res 8, 4173–4181. Nesvizhskii, A. I., Keller, A., Kolker, E., Aebersold, R. (2003) A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 75, 4646–4658. Li, Y. F., Arnold, R. J., Li, Y., Radivojac, P., Sheng, Q., Tang, H. (2009) A bayesian approach to protein inference problem in shotgun proteomics. J Comput Biol 16, 1183–1193. Dost, B., Bandeira, N., Li, X., Shen, Z., Briggs, S., Bafna, V. Shared Peptides in Mass
28.
29.
30.
31.
32. 33.
34.
349
Spectrometry Based Protein Quantification. In: Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology, Batzoglou, S., ed., vol. 5541 of Lecture Notes in Computer Science. Springer, 2009 356–371. Carrillo, B., Yanofsky, C., Laboissiere, S., Nadon, R., Kearney, R. E. (2009) Methods for combining peptide intensities to estimate relative protein abundance. Bioinformatics. Brusniak, M.-Y., Bodenmiller, B., Campbell, D., et al. (2008) Corra: computational framework and tools for LC-MS discovery and targeted mass spectrometry-based proteomics. BMC Bioinformatics 9, 542. Jaffe, J. D., Mani, D. R., Leptos, K. C., Church, G. M., Gillette, M. A., Carr, S. A. (2006) PEPPeR, a platform for experimental proteomic pattern recognition. Mol Cell Proteomics 5, 1927–1941. Palagi, P. M., Walther, D., Quadroni, M., et al. (2005) MSight: an image analysis software for liquid chromatography-mass spectrometry. Proteomics 5, 2381–2384. Schulze, W. X., Mann, M. (2004) A novel proteomic screen for peptide-protein interactions. J Biol Chem 279, 10756–10764. Kessner, D., Chambers, M., Burke, R., Agus, D., Mallick, P. (2008) ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 24, 2534–2536. Mueller, L. N., Rinner, O., Schmidt, A., et al. (2007) SuperHirn – a novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics 7, 3470–3480.
Chapter 16 Bioinformatics for Mass Spectrometry-Based Metabolomics David P. Enot, Bernd Haas, and Klaus M. Weinberger Abstract The broad view of the state of biological systems cannot be complete without the added value of integrating proteomic and genomic data with metabolite measurement. By definition, metabolomics aims at quantifying not less than the totality of small molecules present in a biofluid, tissue, organism, or any material beyond living systems. To cope with the complexity of the task, mass spectrometry (MS) is the most promising analytical environment to fulfill increasing appetite for more accurate and larger view of the metabolome while providing sufficient data generation throughput. Bioinformatics and associated disciplines naturally play a central role in bridging the gap between fast evolving technology and domain experts. Here, we describe the strategies to translate crude MS information into features characteristics of metabolites, and resources available to guide scientists along the metabolomics pipeline. A particular emphasis is put on pragmatic solutions to interpret the outcome of metabolomics experiments at the level of signal processing, statistical treatment, and biochemical understanding. Key words: Metabolomics, Metabolome, Mass spectrometry, Computational biology, Data mining, Biostatistics, Research design
1. Introduction 1.1. History and Proof-of-Concept
The euphoria before and around the first sequencing of the human genome was quickly displaced by the realization that a functional understanding of an organism and its (patho-)physiology required much more than just the genetic blueprint. In the era of functional genomics that was proclaimed soon afterward, all new disciplines were immediately conceived as systematic, comprehensive approaches, the so-called Omics. While some of these initiatives focused on chemical or functional niches (e.g., glycomics, lipidomics, signalomics, secretomics, fluxomics), the three traditional
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_16, © Springer Science+Business Media, LLC 2011
351
352
Enot, Haas, and Weinberger
stages of the central paradigm of molecular biology attracted and keep attracting most attention: transcriptomics, proteomics, and metabolomics. Among these three areas, multiplexed methods and standardized commercial solutions for medium to high-throughput were first developed for transcriptomics, followed by various competing and far less satisfying approaches to the presumably most complex of the Omes, the proteome. Over the last few years, the youngest of these disciplines, metabolomics, has gained remarkable momentum, mainly driven by two independent facts: First, increased sensitivity of triple quadrupole and time-of-flight mass spectrometers and experimental improvements, such as the parallelized use of multiple reaction monitoring (MRM) and the application of stable isotope dilution (SID) for absolute quantitation paved the way to a systematic detection of metabolites in biologically relevant sample types, such as plasma or serum; the limited sensitivity of previous Nuclear Magnetic Resonance (NMR)-based workflows restricted their use mainly to urine (urine as a sample type is analytically very convenient but the concentration of metabolites in urine is not controlled and regulated in the sense of a strict homeostasis as in blood). Second, the long-standing experience in elucidating metabolic pathways offers an invaluable source of background information that supports interpretation of (quantitative) metabolomics datasets. While the scientific community just scratches the surface in understanding and mapping protein–protein interactions (and does not even seriously try interaction predictions at the mRNA level), the majority of metabolic pathways are characterized at a very detailed level (namely, substrates and products of enzymatic reactions, reaction mechanisms, equilibria, kinetics, and energetics of these reactions, cofactors or compartmentalization). This wealth of information allows a uniquely direct way of functional interpretation as soon as reliable quantitative data on metabolite concentrations become available (1–3). The most convincing proof-of-concept for mass spectrometry (MS)-based metabolomics has been achieved in the routine diagnostic application of neonatal screening for inborn errors of metabolism, e.g., monogenic disorders of amino acid metabolism and fatty acid transport and mitochondrial oxidation (FATMO). Here, the introduction of tandem mass spectrometry and multiparametric diagnostic criteria in clinical routine has enabled a dramatic expansion of the diagnostic portfolio, a significant improvement of diagnostic performance (particularly of the specificity and the predictive values for very rare diseases), and, additionally, led to a marked reduction of health care costs.
Bioinformatics for Mass Spectrometry-Based Metabolomics
1.2. Schools of Thought
353
Metabolomics (in the NMR community, also the term metabonomics is sometimes used and artificially distinguished from metabolomics) comes in two flavors depending on the stage of the workflow at which some form of chemical identification is performed. Discerning the two approaches is essential as it dictates the downstream bioinformatics strategy. Nontargeted metabolomics (usually called metabolic profiling) aims at providing a holistic view of the metabolome with minimal chemical/ biological bias (i.e., data driven) by using all the information from the mass spectrometer, in a raw or partially processed form (also called biochemical fingerprinting or footprinting). Additionally, comparative or differential analysis is undertaken to select potential marker candidates that undergoes subsequent chemical characterization and biochemical interpretation. In a targeted approach (i.e., analyte driven) in contrast, metabolite-specific peaks/signals (selected according to biochemical hypotheses) are retrieved from the crude data, and measurements are presented to statistical analysis in the form of integrated signals or calculated concentrations (see below for quantitation options). When opting for a nontargeted strategy, exploration of chemical space is only limited by sample extraction (if performed) and instrument sensitivity, and, therefore, de novo metabolite discovery is eventually possible (but a very rare event in the biomedical field). One must keep in mind that post-analysis is still not trivial, and there is no warranty that the origin of significant changes can be elucidated unequivocally. Targeted metabolomics is inevitably biased by the a priori knowledge of the metabolome but essentially pays off regarding adapted sample preparation so that low abundant molecules are integrated into the profile, spurious features do not enter the analysis, and, finally, metabolite changes can be directly associated to biochemical mechanisms. In metabolite profiling, the origin of the MS information depends heavily on the analytical technique, and the intrinsic nature of the measurements, i.e., the properties of the metabolites; these are two real challenges that motivate ad hoc bioinformatics procedures. Despite these two and other issues (such as the variety of the experimental design, the overall objective of the study and the availability of relevant expertise along the process), the overall metabolomics workflow follows a logical set of protocols, e.g., data acquisition, signal processing, information compression, data preparation, statistical analysis, and finally data interpretation and dissemination (Fig. 1). For the sake of brevity, we only concentrate on general computational aspects in a typical MS-based metabolomics experiment: processing of crude information from the analytical platform, solutions for normalization and quantification, metabolite identification, considerations for the adapted statistical treatment and common practices in follow-up data interpretation.
354
Enot, Haas, and Weinberger
Flow Injection (FIA)
Chromatography: Liquid (LC) / Gas (GC)
Data Acquisition
Data Processing
Baseline correction Smoothing Alignment Ion extraction/peak detection (Isotope correction)
Full scan mode Electron impact qTOF Orbitrap FT-ICR Single reaction monitoring Triple quadrupole
Full scan mode qTOF Orbitrap FT-MS
Smoothing (Alignment) Peak integration Averaging Baseline correction (Isotope correction) (Binning)
Single reaction monitoring Triple quadrupole
Metabolite/ species specific DBs
Averaging (Isotope correction)
Spectral/ experiment DBs, prediction tools
Data Analysis Interpretation
Normalisation/quantification Data integrity assessment Data quality checks Sample/feature removal Imputation/Data transformation Univariate analysis Multivariate classification/regression Peak/signal annotation (Non targeted) Metabolic pathway analysis Biochemical plausibility
Pathways biochemical based DBs
Fig. 1. Overview of the MS-based metabolomics workflow.
2. Materials 2.1. MS Data Processing
The general purpose of data processing is to translate crude signals into a more natural quantity characterizing putative metabolites. Typically, MS data are made up of the mass to charge ratio (m/z, mass domain), the retention time if chromatographic separation is performed (time domain), and a measurement relating to ion intensity. In flow injection analysis (FIA), the sample is directly introduced to mass spectrometer over a short period of time (typically 3 min). The optional separation dimensions that arise in the case of GCxGC or multidimensional LC platforms are not discussed here. In addition to inherent physical and biological properties of metabolites (i.e., abundance, polarity, ionization behavior), the diversity of chromatographic specificity, ionization methods, mass analyzers, ion detection accuracy, mass spectrometry platforms, and vendors are the major obstacles to a common and accepted framework to extract and consolidate metabolomics data. Figure 1 gives an overview of the most common instrumentations in metabolomics.
Bioinformatics for Mass Spectrometry-Based Metabolomics
355
Alongside the sample introduction strategy, technologies are split in two categories depending on the m/z data acquisition for each time point: in full scan mode, the mass spectrometer monitors a range of masses (varying from unit mass resolution to exact mass resolution), whereas in single reaction monitoring, the mass spectrometer is set to record only one ion at the time. Regardless of the profiling strategy and the analytical platform, a general MS processing workflow navigates across three stages with different levels of sophistication: raw processing during or after data acquisition (low-level processing), data compression strategies to convert data points into features that are characteristic of the metabolite content (mid-level processing) and, finally, production of a measurement matrix ready for data exploitation and dissemination. A detailed summary of the data processing steps specific to each analytical system from our classification schema is given in Fig. 1. 2.1.1. Low-Level Processing
Low-level processing is motivated by the fact that MS signals are subject to chemical noise (interference coming from the complex biological sample), electronic noise (related to the physics of the analytical technique), and sample preparation errors (such as cross-contamination or variability of derivatization reactions). Typical operations rely on traditional signal processing techniques, namely, filtering, smoothing, and baseline correction. In the special case of FIA platforms, the crude infusion profile must be first averaged (or equivalent see, (4)) prior to any further manipulation. The baseline, or background, corresponds to a combination of spectrometer response observed when no biological sample is intentionally introduced (i.e., zero sample), and the offset caused by interfering components with various severity degrees depending on the matrix (such as, urine or plant extracts). The baseline shape is usually of nonlinear nature, is noticeable in both the time and mass domain, has no theoretical justification, and its behavior is rarely transferable between experiments (5, 6). One consensus procedure to model the baseline shape relies on the estimation of local average or minimum intensity from a short window on the mass domain (7–9). Alternative approaches have been formulated using polynomial models, Eigensystems (10) or low order polynomial Savitzky–Golay filters (11). For a listing of relevant software, see Table 1. Mass spectral data are also characterized by their high frequency noise component, i.e., rapid changes in intensity regardless of the actual intensity levels (5). Typical noise reduction procedures are smoothing filters like the Gaussian filter or moving average/median filters (12, 13), Savitzky–Golay type filters, or decomposition methods like the wavelet transform (8, 14). There are no systematic comparisons between these methods or the
$
$
$
$
MarkerView
MassLynx
Sieve
MetIQ
Automated processing
Automated processing
Features
OS
OS
(F)
OS
OS
Mzmine 2
XCMS
AMDIS
msInspect
MathDAMP
No
No
No
Yes
List of components
Annotation: $ commercial, ($) licensing under special conditions, F free, OS open source
Processed profiles
Baseline correction, noise filtering, peak Table of integrated detection/alignment peaks
Deconvolution
GC/LC-MS Signal processing, alignment
LC-MS
GC-MS
LC/GC-MS Baseline correction, noise filtering, peak Table of integrated detection/alignment peaks
No
No
No
No
LC/GC-MS Alignment, peak picking, normalization Table of extracted ions/ No peaks
Table of extracted ions
(F)
MET-IDEA
LC/GC-MS Alignment, PD
LC/GC-MS Binning, baseline correction, smoothing, Table of extracted ions alignment, peak extraction
F
metAlign
Binning, baseline correction, denoising, Table of integrated alignment, peak extraction peaks
FIA/LC/ GC-MS
Concentration table
Yes
Yes
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes Table of integrated peaks/concentrations Table of integrated Yes peaks/concentrations
No
++
++
−
++
+
+
+
+
+
+
+
+
Mathematica
R environment
None
R environment
Statistics
None
Statistics
Additional software
Statistics/ pathways
Pathway analysis
Additional software
Statistics
Platform Open MS Data specific format Throughput exploitation
Table of integrated Yes peaks/concentrations
Output
Genedata $ Expressionist
FIA/LC-MS Automated processing, quantification and identification
LC/GC-MS Automated processing
LC-MS
LC-MS
Type Input
Tool
Table 1 Software solutions for processing metabolomics data
356 Enot, Haas, and Weinberger
Bioinformatics for Mass Spectrometry-Based Metabolomics
357
sequence in applying them, and it is advisable to keep in mind the potential bias that may be introduced for downstream analyses (6, 15). Both steps are critical for deriving high-quality data. Aggressive parameterization of the smoothing algorithm may attenuate relevant signals, could irreversibly mask overlapping peaks (rendering peak deconvolution ineffective) while a less stringent approach would keep noisy and uninformative features at the risk of hampering subsequent processing steps. A typical consequence of conservative baseline shape estimation is revealed during data mining when the background noise becomes discriminatory. 2.1.2. Data Compression (Mid-Level)
Mid-level processing relates to methods for compressing the data into a smaller set of informative components given as signals or peaks that are presumably characteristic of metabolites (16). This first goal is, of course, decisive in a nontargeted context to limit follow-up chemical characterization to genuine information rather than artifacts. Ultimately, data compression allows direct comparison between samples by matching each component to every sample in order to comply with machine learning and statistical method requirements. The type of data dictates feature extraction from MS profiles. A first approach to compress MS data on the mass domain consists in grouping m/z points into a number of bins for which the average or maximum intensity across all points represent the signal abundance (4, 11). Despite the method’s relative simplicity and efficiency, biochemical status may not be reflected in an optimal way and incorrect features could arise from improper definition of the bin boundaries. A second strategy involves the detection of peaks, usually on the time domain, by monitoring a specific ion or a specific transition (i.e., single reaction monitoring, SRM, or MRM, in the case of MS/MS experiments), or by generating all plausible features in the dataset. Such an operation, known as peak picking, can be formulated in terms of the identification of the accurate position, height, and dispersion measure of all MS peaks considering the presence of noise, baseline artifacts, and potential overlap between unresolved components. The most efficient methods available for this task are using peak shape model approximations, Gaussian-based mixture modeling, statistical testing (17), extracting derivatives (18, 19), refining heuristics to isolate the peak information from the background noise (20), application of continuous/discrete wavelets or Fourier transforms (8, 14), and pattern classification methods (21). The existing peak picking algorithms are mostly validated by the MS-based application or instrument type they are developed for so that the choice of the method depends largely on the number of data points defining a peak, the degree of overlap between peaks and the levels of high/low frequency noise. An alternative
358
Enot, Haas, and Weinberger
strategy is exploiting cross-correlations on both the mass and time domain to extract putative spectra and associated abundances without any a priori knowledge of the chemical content (12). MS-based metabolomics data are also prone to systematic drift in both time and mass domains. This effect is amplified in large experiments due to changes in the analytical set-up, e.g., replacement of chromatographic separation columns. Drift in the m/z axis is less apparent and can easily be overcome by strictly following manufacturer calibration specifications or by applying an adequate mass binning strategy. In contrast, drift on the time domain must often be compensated to correct for differences in elution patterns due to chromatographic performance, changes in pressure and temperature, and the sample’s individual physicochemical properties. Most alignment algorithms work by mapping each sample in the experiment to a reference chromatogram by maximizing the overlay over a set of transformations or by using nonparametric alignments, such as dynamic programing (22–25). These approaches are most effective when they work from information-rich raw data but show limitations with respect to: (1) the choice of a good reference template encompassing all sample characteristics (including different responses related to the physiological/pathological status); (2) the type of the template that is usually in form of a summary of all m/z (e.g., total ion count); and (3) their abilities to cope with peaks with switching elution order. Since the ionization of small molecules rarely leads to the formation of multiply charged complexes, charge deconvolution is not a major concern (in contrast to proteome analysis). However, isotope deconvolution may be performed to remove isotope peaks, simplifying the profile to unique metabolites and removing residual intensity resulting from neighboring components in the mass domain. Currently, published methods for correcting isotopic artifacts are limited to targeted approaches, since knowledge of the molecular composition is expected to be available. Efficient methods have been described (26) and implemented in a commercially available solution for targeted metabolomics. Another application of isotopic pattern identification is to help reduce the number of potential molecular formulae of an unknown signal in the context of high resolution MS (27). 2.2. Normalization
In metabolomics, the term normalization refers to the data adjustment step that precedes data dissemination and subsequent statistical treatment, including data scaling and transformation (7, 28). As the primary goal of metabolomics is the study of metabolite changes in response to environmental and genetic changes, normalization is a crucial step to make measurements comparable. Typically, intersample variability originates from sample concentration and homogeneity differences, loss of sensitivity, and drift of the analytical
Bioinformatics for Mass Spectrometry-Based Metabolomics
359
system or sample degradation over time. It becomes a real challenge when multiple experiments are considered and when metabolite measurements are originating from several experimental procedures or from different analytical platforms. A first class of strategies comprises methods that infer a sample-related bias factor, also known as dilution or scaling factor, from the data themselves (29). The simplest approach is global normalization where the scaling factor is the sum (or related to the sum) of all measurements across the sample (also known as constant sum, integral, total ion count normalization). Limitations of this rather naive approach have been widely discussed (29, 30), and alternative approaches have been proposed that derive a more adequate scaling factor estimate by introducing class information (7), subselection of peaks/signals (31, 32) or mapping intensity distribution to a reference sample profile (29, 30). A second family uses information from the biological context to adjust sample concentrations (31). As such, creatinine, urine volume, or osmolality at time of sampling are commonly used in urine-based studies, and tissue weight, cell count, protein content, or DNA content may also be employed to scale sample measurements. In addition to the availability of such information in practice and the inherent errors related to their measurement, the application of experimental/clinical parameters for normalization can be misleading when the study addresses biochemical processes that may induce dramatic changes in their estimation (e.g., urine analysis in conditions with impaired kidney function). Regardless of the sophistication of the technique employed and its applicability for solving a specific biological question, the generated data remain in a form of a dimensionless or experiment-dependent quantity that cannot be efficiently transferred or compared between different application areas. 2.3. Metabolite Quantification
Determination of absolute metabolite concentrations by means of calibration to internal (i.e., added to the sample before extraction, typically as structurally identical compounds labeled with stable isotopes such as deuterium, 13C, or 15N) or external (i.e., added to the sample after extraction) standards is the most reliable approach to minimize inter-individual variance and, more importantly, to remove confounding effects coming from molecular inferences in the sample affecting metabolite ionization behavior (i.e., matrix effects like ion suppression). This allows the results of the measurements to be presented in a meaningful way (in SI units for concentration), and, eventually, facilitates a more direct way to compare datasets originating from multiple sites and experiments. An additional technique, called standard addition, which is based on incremental spiking of biological samples with standards, is even better suited to the quantification of endogenous
360
Enot, Haas, and Weinberger
metabolites but is not discussed here due to its evident limitations in a high-throughput metabolomics profiling and also considering the low individual sample volumes that are usually available. The general idea of calibration by either means relies on the determination of a calibration curve constructed by regression of analyte and standard peak responses in the sample to the concentrations of the standard in the calibration mixture. Quantitation in metabolomics can be approached from two bioinformatics-related angles. Concentrations or calibration characteristics are provided by the analytical team and computational tasks are restricted to signal extraction, calculations according to the standard operating procedure and data storage, including statistics relevant to the validation guidelines (various international standards, e.g., ISO 17025, or the FDA’s guidance for industry on bioanalytical method validation (33)). Best practices in calibration, adequacy of quantification or validation are not within the scope of this chapter and readers are referred to the above-mentioned guidelines. However, strict implementation of quantification can become a daunting task in a high-throughput context, mostly due to the chemical diversity of the matrix: the use of a single internal standard is not applicable to a wide range of compounds and/or large number of analytes in the time domain, and the lack of suitable blank matrices free of metabolites limits authentic calibration and exact determination of validation parameters, such as Limit of Detection (LOD), Lower Limit of Quantification (LLOQ), or accuracy. Alternative quantification methods are being investigated to circumvent these issues by normalizing all the peaks with multiple internal standards specific to retention time regions (34, 35) or compound classes (36, 37), but a comprehensive collection of internal standards that would optimally account for matrix effects is extremely costly and, sometimes, not commercially available at all. 2.4. Metabolomics Data Processing Software
The need for more efficient processing, access to external resources and integration to the in-house workflow has inevitably resulted in an increasing number of software platforms to cover the metabolomics pipeline. Table 2 summarizes widely used solutions encompassing all possible types of MS data and suitable to both targeted and nontargeted strategies. None of the tools are generic and the comparative assessment of their respective performance remains unanswered since many implemented methods are most efficient within their domain of application (including mass spectrometer design) and the relative degree of chemical expertise required to apply them. Accessibility, analytical characteristics, and data processing features are summarized alongside the type of data generated. Output ranges from processed profiles to measurement matrices in the form of integrated peaks or calculated concentrations, where one analytical feature is matched to each
Bioinformatics for Mass Spectrometry-Based Metabolomics
361
sample. From a bioinformatics point of view, additional usability criteria to facilitate workflow integration are analytical platform dependence, open source MS format support (mzXML and mzData) and the ability to cope with the automated processing of large experiments. Whereas data exploitation is a separate area, several solutions are being embedded directly, or via other resources available in their respective development environment, data interpretation, and statistical analysis functionalities. 2.5. Metabolomics Resources
Regardless of the adopted metabolomics strategy, external sources of chemical information, preferably in the form of Web-based services, are key components at most stages of the workflow (Fig. 1). These fulfill several objectives, namely, identification of likely target metabolites for the species of interest, relating m/z signals and spectra to chemical entities and integration of biochemical knowledge by means of biological or literature databases. In contrast to gene and, to a lesser extent, protein databases that are now fairly well annotated and standardized, the metabolomics community cannot rely on a system comprehensive enough to cover these three objectives altogether. Current solutions are primarily designed to (1) contain spectral information; (2) support or predict metabolite annotation; (3) store raw metabolic profiles; (4) list known metabolites for a given species or sample matrix; and (5) represent established biological knowledge. However, all available databases still suffer from incomplete coverage, generic representation of complex metabolite classes (most lipids and carbohydrates), and from differing and often unknown levels of curation of the individual content. An overview of representative solutions commonly employed by the metabolomics community is given in Table 2. In-depth description of individual databases can be accessed on their respective Web sites or in general publications, e.g., (38). Databases are described according to criteria we feel relevant for bioinformatics. This includes database content (according to the above classification), level of human intervention to ensure information validity and the ease of integration in terms of supporting text-based queries and/or offering downloadable database dumps. Interoperability between databases and linking to other Omics sources also deserves a particular attention. Matching metabolites/chemicals across databases requires the lack of a generally accepted chemical information format or ontologies to be overcome. Among several options, we focus on explicit molecular representation encoded in convertible formats (InCHI, SMILES, SDF, and Mol), linking to the authoritative PubChem database alongside CAS numbers and KEGG identifiers that have been historically adopted by the chemical and metabolomics communities (27). The final topic is mostly relevant to both data interpretation and data integration of metabolic measurements with their genomics and proteomics counterparts.
362
Enot, Haas, and Weinberger
Table 2 Common resources covering the whole metabolomics pipeline Database
Level of Data Content curation integration Chemical annotation
Links to other sources
NIST
1/2
++
−
CAS, InCHI, Mol
Metlin
1/2
++
++
KEGG, CAS
Spectral/physical information only
Massbank
1/2
++
+
CAS, KEGG, PubChem, SMILES
BinBase
1/3
−
+
KEGG, PubChem
[email protected]
1/3
+
+
None
KNApSAcK
2/4
−
+
CAS
Biochemical role, literature
MZedDb
2/4
−
+
SMILES
KEGG, Biocyc, BRENDA, HMDB
KEGG
4/5
++
++
CAS, PubChem, Mol
ExPASy, ExplorEnz, BRENDA, GO
MetaCyc
4/5
+
++
CAS, PubChem, SMILES
GO, BRENDA, Uniprot, ExPASy, KEGG
CheBI
4/5
++
++
CAS, KEGG, PubChem, SMILES, SDF, InCHI
KEGG, IntEnz, EBI resources, literature, patents
HMDB
4/5/(1) ++
++
CAS, KEGG, PubChem, SMILES, SDF, InCHI
Links to sequences, disease, literature, concentration data
LipidMaps
5
+
++
PubChem, SDF
EntrezGene, UniProt, GO, KEGG
Reactome
5
++
+
KEGG, PubChem
GO, UniProt, EC, EntrezGene, KEGG, CheBI
Spectral/experiment information only
Annotation: − unsatisfactory or inexistent, + minimal or restricted, ++ adequate
2.6. Standardization and Ontologies for Metabolomics Studies
Beside the multidisciplinary nature of the tasks and the necessity of joint efforts between chemists, biologists, computer scientists, and software engineers, a fundamental aspect of metabolomics bioinformatics is that of context. In the situation of a cell or animal experiment, metabolomics provides data that are dependent on the analytical context to match the biochemical context, as recognized by the MIBBI consortium (39, 40). For metabolomics studies, standardization must specifically address issues related
Bioinformatics for Mass Spectrometry-Based Metabolomics
363
to sample handling and preparation, analytical set-up, including instrumental performance and method validation, and, eventually, metadata and chemical annotation in relation to the experiment. Despite the lack of an agreement for standards across the metabolomics community, these topics are being covered within the Metabolomics Standards Initiative Core Information for Metabolomics Reporting (CIMR, (41)) and have been summarized in (42). Recently, proposed models such as Architecture for a Metabolomics Experiment (ArMet) (43) or MeMo (44, 45) aim at implementing these requirements and designing software tools of Minimum Information About METabolomics experiments (MIAMET). A clear obstacle for designing new software to process MS data is that each instrument vendor generally stores the raw mass spectrometry measurement data in its own proprietary file format. JCAMP-DX and netCDF are, historically, the most common exchange formats in practice for storing MS data, but they somehow fail to efficiently support complex MS experiments in a high-throughput setting. Attention has turned to the proteomics community (Proteomics Standards Initiative and Seattle Proteome Center) which has been developing open XML-based formats (namely, mzXML and mzData, available at http://www. proteomecommons.org, (46)) and associated converters and validators. We strongly advocate that the use of format conversion software must be seen as a temporary solution and efforts, including convincing manufacturers to offer appropriate export options, should be made to start any data processing with a vendor-neutral format.
3. Methods 3.1. Data Analysis Workflows in Metabolomics
Despite the inherent complexity of metabolomics data, originating from the high dimensionality and variance, the data analysis step usually proceeds to the simple task of identifying a set of features (i.e., markers) that are changed between physiological conditions. Depending on the biological question addressed, data analysis employs a mixture of univariate, multivariate, and machine learning methods utilizing supervised and unsupervised algorithms. Common practices for analyzing metabolomics data rely on similar techniques and algorithms as for other Omics topics. Metabolomics analysis issues are similar to, e.g., microarrays, in that the features are interdependent, their number usually exceeds the number of samples (curse of dimensionality), and common recommendations in machine learning (overfitting, model selection bias) and practices for correcting multiplicity of tests are also
364
Enot, Haas, and Weinberger
applicable. A dedicated overview dealing with common pitfalls encountered in the statistical analysis of metabolomics datasets has been compiled by (47) and several data analysis workflows have been illustrated in (7, 35, 48). Within the short remit of this chapter, the common backbone to most analyses of metabolomics data is summarized in Note 1, where the choice for a particular modeling approach is left to individual expert preference. Whereas the development of the open source environment R, via the Bioconductor project, accompanied advances in genomics technologies, there is neither a consensus solution to perform metabolomics data analysis, nor standardization efforts such as the MAQC Consortium. Historically, metabolomics data were heavily subjected to chemometrics methods, such as Principal Components Analysis (PCA) and Partial Least Square (PLS) as implemented in SIMCA-P (Umetrics AB) and the data exploitation relied on interpreting the latent vector coefficients. Several data processing solutions listed in Table 1 (e.g., Markerview, MetIQ) utilize statistical methods ranging from univariate tests, multivariate techniques and interactive visualization. At the other end, open source/script-based software such as R and to a lesser extent Matlab or Python are too receiving increasing attention. From these general considerations, we prefer to emphasize the peculiarities and characteristics of metabolomics data in comparison to other Omics and their impact during statistical treatment. 3.2. Data Variance
As for any high-throughput technology, basic experimental considerations, such as strict control of the experiment, adequate cohort selection, technical and experimental randomization, and sample replication should be enforced. Even with the most dedicated technical team, incompressible variance associated with instrument performance and laboratory practices is, as expressed in terms of coefficient of variation (CV, ratio between standard deviation and average abundance), around 10% whereas biological variance is expected to reach CVs in the order of 50% under normal perturbations (49, 50). These facts point to two important consequences for the data analyst. First, features with high technical variability, as detected from repeated measurement of a representative sample (QC, quality control), should not be included to limit the risk of highlighting changes likely to be analytical artifacts. The second point relates to the essential step of sample size determination in relation to the goal of the study and the calculation of the observed power. In a typical case/control study assuming lognormal distributions (see below) and specifying standard statistical error levels (alpha = 0.05 and beta = 0.2, or power = 0.8), a rough estimate for sample size calculation can simply be accessed by n = 16 × log(1 + CV2)/(log(FC))2 (51). Considering the within identical group coefficient of variance is
Bioinformatics for Mass Spectrometry-Based Metabolomics
365
50% (CV = 0.5) and the expected effect of the treatment is translated by a biologically relevant concentration shift of 25% (FC = 1.25), an adequate number of samples in each cohort would be 72. 3.3. Metabolomics Data Origin
Despite the origin of metabolomics data and process to obtain them, like in any experimental approach, there is no guarantee that metabolomics measurements answer the biological question posed. One reason is that to date, no technology, alone or in combination, is able to offer complete coverage of the metabolome. Another issue is purely related to laboratory practices, namely, sample collection protocols, storage, efficiency of the analytical process and metabolite stability. As far as bioinformatics is concerned, the metabolic fingerprint that enters further statistical treatment comprises analytical artifacts and noisy features because of inappropriate signal processing and/or inefficient work-up to control molecular interferences. For possible outcomes typically observed at the data assessment and mining stages, alongside possible causes under the control of the bioinformatician, see Note 2.
3.4. Correlation in Metabolomics Data
The notion of correlation in metabolomics experiments differs somehow from those found in their transcriptomics and proteomics counterparts. Elucidation of the interdependence between metabolite abundances is an often overlooked but powerful tool, and should be also exploited during data analysis. Correlation between analyte responses could reflect chemical characteristics, e.g., multiple signals belonging to the isotopic pattern of the same parent ion, or represent underlying properties of the analytical platforms, such as several ionization and derivatization products from the same molecule (52, 53). Relating metabolite correlation to the reaction network has been largely discussed by (54–56). According to these studies, interpretation of correlation between a pair of metabolites can be summarized in simple terms as follows: 1. Positive correlation results from metabolites that are in chemical equilibrium (e.g., glucose-6-phosphate and fructose-6phosphate). 2. Negative correlation involves metabolites sharing a common moiety (e.g., NAD+ and NADH) or other mass conservation relation. 3. Modification of correlation in two pathological states could reflect changes in the regulation of the respective metabolite. We have to put particular emphasis on the limitations of metabolic pathways to analyze biological correlation by means of metabolite proximity in biochemical networks, and, therefore, correlation-based distance metrics as often employed for clustering and/or graph-based representations are of limited use to uncover affected pathways.
366
Enot, Haas, and Weinberger
3.5. Measurement Characteristics
By nature, metabolomics data are skewed due to detection (and quantitation) thresholds [LOD; LLOQ, ULOQ (Upper Limit of Quantification)] of the analytical instrumentation and a universal biological tendency for a non-normal distribution of metabolite concentrations in almost every (sub)population. In addition, most metabolomics datasets are characterized by a clear intensity-to-variance relationship that needs to be considered. These issues have immediate consequences for the application of adequate transformations prior to data mining and methods for coping with missing values (“non-detects”). First, many studies, also beyond the scope of metabolomics, suggest that log or similar transformations are appropriate ways to transform MS data in order to convert multiplicative errors into additive errors and consequently stabilize the variance (6, 57, 58). Second, nondetects appear because of a combination of metabolic pathway dynamics, complex molecular interactions within samples, and efficiency of the overall analytical protocol, including data processing. With respect to the limited number of strategies to adequately cope with nondetects, replacement of missing data points by artificial values is more of a computational convenience for applying most multivariate statistical tools than a significant improvement of their predictive abilities. Imputing missing data points by means, medians or a prespecified value like, for instance, zero or the minimum observed value in the dataset are generally recognized as suboptimal solutions (59). Instead, patterns should be examined in the light of experimental factors, features with large numbers of missing data must be discarded (7, 35), and missing values are estimated by multivariate methods, such as the ones summarized in (60).
3.6. Data Exploitation
The unique level of understanding of metabolomics data that makes them suitable for a key role in functional genomics mainly results from this last step of data handling, namely, the biochemical interpretation in the context of pathway and background knowledge. Despite its importance, very few standardized procedures have been developed and/or published for this step, and some experts would probably consider it their proprietary methodology to derive biochemical and pathobiochemical insight from multivariate metabolic datasets. The last few years have seen multiple efforts to systematically annotate endogenous metabolites and led to databases, such as KEGG (61), Reactome (62), BioCyc (63), HMDB (64), and OMIM (65). Despite suffering from some of the aforementioned shortcomings, these may serve as a more or less accepted framework for future knowledge collection. These databases also provide the background for various attempts at visualization of metabolic pathways and data mapping on these charts, although most of these projects still follow a static
Bioinformatics for Mass Spectrometry-Based Metabolomics
367
Fig. 2. Ranking of metabolic pathways according to the significant differences between a study and control cohort in analogy to a gene set enrichment analysis (MarkerIDQ™ software, Biocrates).
approach of predefined (and predrawn) maps that cannot do justice to the dynamics of biochemical networks. In the following paragraph, we would like to demonstrate a few concepts about how dynamic representation and simulation (66) of metabolic pathways enables the first steps of generating hypotheses from multivariate datasets. Firstly, electronic availability of metabolites and metabolic reactions facilitates an almost trivial but nevertheless powerful approach that is analogous to a gene set enrichment analysis GSEA (67). Any given set of metabolites which have been identified by statistics as significantly different in two biological states or clinical cohorts can be mapped on the entirety of metabolic pathways, and these pathways can then be ranked by the number of altered metabolites they contain (Fig. 2). This is a way of structuring the data that scientists from transcriptomics and proteomics are familiar with although the definition of metabolic pathways does not follow a similarly strict classification system as the classical gene ontology (GO, (68)). Note also that a reliable selection of species-specific enzymatic reactions instead of the generic reference pathways is necessary to reduce the risk of false positive hits. Secondly, starting from a particular metabolite of interest, exploration of the reactions that either synthesize or degrade this metabolite immediately generates a list of enzymes of interest for further investigation. This concept of exploring shells of reactions around a metabolite is exemplarily shown for tryptophan meta bolism in Fig. 3 and can be expanded stepwise around every metabolite serving as a new seed node. Each of these reactions can then be characterized by a ratio of product and substrate concentrations as a measure of enzymatic activity. Assessment of such ratios reduces biological noise and often dramatically increases the significance of the findings (69–71).
368
Enot, Haas, and Weinberger
Fig. 3. Shellwise exploration of enzymatic reactions in tryptophan metabolism. l-tryptophan was used as the first seed node; after expansion of eight synthetic or degrading reactions, 5-hydroxy-l-tryptophan was used as secondary seed node and further expanded (MarkerIDQ™ software, Biocrates).
Thirdly, moving even further from a traditional textbook r epresentation of metabolic pathways, one can apply route finding algorithms to find and depict connections between metabolites of interest across the boundaries of (often artificially) predefined pathways. Such algorithms can identify the shortest route, routes up to a defined length, routes that do not share a certain metabolite (termed node-disjoint paths) or enzyme (so-called edge disjoint paths), depending on the respective biological question (Fig. 4). Here, the main prerequisite to avoid a potentially very large number of false positive hits is the exclusion of common cofactors and small inorganic molecules that connect many metabolites to many others, e.g., H2O, CO2, ATP, NADP. Using tools like these, and keeping in mind all the caveats discussed above, enzymes and pathways involved in the pathophysiology of a certain disease or in the mode-of-action of a drug can be more efficiently identified. In addition, hypotheses for designing further validation experiments and studies can be formulated. Yet, all of this needs to be combined with another plausibility check, which originates from inherent redundancies in metabolism: quite often groups of compounds are metabolized by the
Bioinformatics for Mass Spectrometry-Based Metabolomics
369
Fig. 4. Route finding across metabolic pathways. Nine paths from arginine to spermine, ranging in length from four to six steps were calculated based on the KEGG dataset, and the settings allowed for joint nodes, e.g., ornithine (MarkerIDQ™ software, Biocrates).
same enzyme and should, therefore, be influenced in at least a similar (if not the same) way by regulatory mechanisms, drugs, etc. If this rule of thumb is severely challenged, one should always check for possible analytical or statistical artifacts, or – not uncommon in pharmaceutical R&D – interference by xenobiotics, e.g., a drug or drug metabolite disturbing the signal for an endogenous metabolite. 3.7. Final Remarks
Metabolomics, possibly the most informative and diagnostically relevant branch of functional genomics has made major inroads into biomedical research, pharmaceutical R&D, and clinical diagnostics. Data integration with other Omics sources has been attempted for many years but to do this in a meaningful way (i.e., beyond simply merging data from different analytical platforms) is still an extremely ambitious goal. First successful examples of combining genome-wide association studies (GWA) with metabolic phenotypes, so-called metabotypes, have recently been published (70) and show significant promise for a more useful outcome of population-based association studies in general. As much as the analytical background and data formats differ between the Omics worlds, most of the basic issues like data normalization, data reduction, and statistical assessment with all its caveats of multiple testing remain more or less the same (at least from a qualitative perspective). The most important difference is the level of functional understanding that, currently, metabolomics is offering to a much greater extent than the other –omes.
370
Enot, Haas, and Weinberger
Yet, the tools for optimally exploiting this potential (i.e., for a bioinformatics-supported interpretation of metabolic datasets) are lagging far behind and deserve more attention in the scientific community. Finally, and this is, of course, true for every analytical or experimental improvement: new opportunities to generate more and more accurate data do not relieve the scientist of the necessity to design their studies and to interpret the results as thoroughly as possible. The enthusiasm for new “toys” always tempts intellectual curiosity but requires even more scientific responsibility than conventional “non-Omics” experiments.
4. Notes 1. Stepwise workflow for the statistical treatment of metabolomics data 1.1. Data integrity/quality assessment • Check data integrity in terms of sample assignment and annotation of the measurements. • Exclude variables for which the majority of measurements are below a specified detection threshold (e.g., signal to noise ratio less than 3). • Assess unplanned interactions between experimental factors and analytical processes, such as injection order or batch. • Examine patterns between planned experiment covariates and metabolite detection occurrence. This typically refers to the identification of cases where the detection of a metabolite depends on particular treatment group. • Assess sample (i.e., biological replicate) variance in the light of technical (i.e., replicate from the same material) variance to identify unreliable measurements. • Discard outlying samples. Judgment can be made from both univariate and multivariate angles: high number of measurements below the detection threshold, singular behavior in a multivariate space, e.g., PCA, or in diagnostic tools, such as the sum of all the measurements (7). 1.2. Data transformation • Apply a normalization method of choice after ensuring that it does not result in any further additional bias. • Apply variance stabilization methods to alleviate abundance/variance dependency.
Bioinformatics for Mass Spectrometry-Based Metabolomics
371
• If the data comprise nondetected measurements, missing data should be imputed before further analyses, unless an adequate model introducing “left censoring” (i.e., measured concentrations below LLOQ) is conducted. 1.3. Univariate analysis • Perform univariate testing, generally by means of linear models, including factors of interest and optionally other covariates, to derive p-values and fold changes. • Derive appropriate indicators for estimating uncertainty (e.g., confidence intervals) and correct p-values for multiple testing. • Select interesting changes on the basis of the expected levels of significance, fold changes, and possibly other measures, such as area under the receiver operating characteristic curve (AUC). • Place each marker in the biological context by comparing to the gold standard and by assessing the biological plausibility of the findings. 1.4. Multivariate modeling • Average, if necessary, replicates originating from the same biological sample so that only independent samples are entering the analysis. • Derive multivariate classification/regression models by repetitive splitting of the data into training/test sets (e.g., cross validation, bootstrapping). If required, optimization of the algorithm parameters (e.g., number of component in PLS model or kernel parameters of a support vector machine) must be performed with the training set. • Introduce feature selection/reduction in the optimization step to derive parsimonious models. As a result, the probability of selection of a metabolite can be used to further decide its relevance. • Repeat classification/regression modeling on and with different algorithms to assess the accuracy estimate. • Examine of sample prediction to highlight consistent misclassification patterns that reflect wrong class assignment, inadequate characterization of the treatment (e.g., nonresponder) or unexpected bias (e.g., sample collection conditions). • If available, compare predictive power of the set of markers to a gold standard as well as its effectiveness on an independent set of samples (i.e., validation set) that is normally agreed prior to statistical analysis.
372
Enot, Haas, and Weinberger
2. Typical metabolomics data processing errors that can be detected during the data evaluation phase Diagnostics
Potential reasons
Sparse dataset (i.e., missing value)
Overcorrection of the baseline, misalignment, incorrect setting of the deconvolution algorithm, aggressive smoothing
Metabolite detection patterns are complementary between two (or more) variables
Misalignment, database mismatches, inadequate peak peaking, or deconvolution
High variance observed in the QC samples
Inappropriate normalization to compensate machine drift, wrong peak integration, additive noise
Sample injection order dominates variance (e.g., “horseshoe” in PCA)
Check randomization, poor biological variability, conservative baseline correction
Data mining reveals significant Misalignment, data differences at inflection points compression strategy of the peak rather than the apex to be preferred PCA loadings and/or univariate data mining show consistent offset over m/z range
Review baseline correction parameterization and normalization method, use of internal standard to overcome matrix effects
Consecutive m/z have opposite loading coefficients in PCA
Incorrect m/z binning threshold/strategy
Poor reproducibility of classifier
Integration of noise, conservative smoothing, inclusion of system peaks
References 1. Weinberger, K. M., and Graber, A. (2005) Using comprehensive metabolomics to identify novel biomarkers. Screen Trends Drug Discov 6, 42–5. 2. Weinberger, K. M. (2008) Metabolomics in diagnosing metabolic diseases. Ther Umsch 65, 487–91. 3. Weinberger, K. M., Ramsay, S. L., and Graber, A. (2005) Towards the biochemical fingerprint. Biosyst Solut 12, 36–7. 4. Beckmann, M., Parker, D., Enot, D. P., Duval, E., and Draper, J. (2008) High-throughput, nontargeted metabolite fingerprinting using
nominal mass flow injection electrospray mass spectrometry. Nat Protoc 3, 486–504. 5. Shin, H., and Markey, M. K. (2006) A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples. Biomed Inform 39, 227–48. 6. Listgarten, J., and Emili, A. (2005) Statistical and computational methods for comparative proteomic profiling using liquid chromatography-tandem mass spectrometry. Mol Cell Proteomics 4, 419–34.
Bioinformatics for Mass Spectrometry-Based Metabolomics 7. Enot, D. P., Lin, W., Beckmann, M., Parker, D., Overy, D. P., and Draper, J. (2008) Preprocessing, classification modeling and feature selection using flow injection electrospray mass spectrometry metabolite fingerprint data. Nat Protoc 3, 446–70. 8. Karpievitch, Y. V., Hill, E. G., Smolka, A. J., Morris, J. S., Coombes, K. R., Baggerly, K. A., and Almeida, J. S. (2007) PrepMS: TOF MS data graphical preprocessing tool. Bioinformatics 23, 264–5. 9. Haimi, P., Uphoff, A., Hermansson, M., and Somerharju, P. (2006) Software tools for analysis of mass spectrometric lipidome data. Anal Chem 78, 8324–31. 10. Bylund, D. (2001) Chemometric tools for enhanced performance in liquid chromatography-mass spectrometry, Comprehensive summaries of Uppsala dissertations from the Faculty of Science and Technology, ISSN 1104-232X; 607. 11. Wang, W., Zhou, H., Lin, H., Roy, S., Shaler, T. A., Hill, L. R., et al. (2003) Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal Chem 75, 4818–26. 12. Jonsson, P., Johansson, A. I., Gullberg, J., Trygg, J., Grung, B., Marklund, S., et al. (2005) High-throughput data analysis for detecting and identifying differences between samples in GC/MS-based metabolomic analyses. Anal Chem 77, 5635–42. 13. Zhu, W., Wang, X., Ma, Y., Rao, M., Glimm, J., and Kovach, J. S. (2003) Detection of cancer-specific markers amid massive mass spectral data. Proc Natl Acad Sci USA 100, 14666–71. 14. Zhao, Q., Stoyanova, R., Du, S., Sajda, P., and Brown, T. R. (2006) HiRes – a tool for comprehensive assessment and interpretation of metabolomic data. Bioinformatics 22, 2562–4. 15. Fredriksson, M. J., Petersson, P., Axelsson, B. O., and Bylund, D. J. (2009) An automatic peak finding method for LC-MS data using Gaussian second derivative filtering. Sep Sci 32, 3906–18. 16. Katajamaa, M., and Oresic, M. (2007) Data processing for mass spectrometry-based metabolomics. J Chromatogr A 1158, 318–28. 17. Tan, C. S., Ploner, A., Quandt, A., Lehtiö, J., and Pawitan, Y. (2006) Finding regions of significance in SELDI measurements for identifying protein biomarkers. Bioinformatics 22, 1515–23. 18. Vivó-Truyols, G., Torres-Lapasió, J. R., van Nederkassel, A. M., Vander Heyden, Y., and
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
373
Massart, D. L. (2005) Automatic program for peak detection and deconvolution of multioverlapped chromatographic signals part I: peak detection. J Chromatogr A 1096, 133–45. Fredriksson, M., Petersson, P., JörntenKarlsson, M., Axelsson, B. O., and Bylund, D. (2007) An objective comparison of pre- processing methods for enhancement of liquid chromatography-mass spectrometry data. J Chromatogr A 1172, 135–50. Morris, J. S., Coombes, K. R., Koomen, J., Baggerly, K. A., and Kobayashi, R. (2005) Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21, 1764–75. Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G., Koong, A., and Le, Q. T. (2004) Sample classification from protein mass spectrometry by peak probability contrasts. Bioinformatics 20, 3034–44. Lange, E., Tautenhahn, R., Neumann, S., and Gropl, C. (2008) Critical assessment of alignment procedures for LC-MS proteomics and metabolomics measurements. BMC Bioinformatics 9, 375. Nordstrom, A., O’Maille, G., Qin, C., and Siuzdak, G. (2006) Nonlinear data alignment for UPLC-MS and HPLC-MS based metabolomics: quantitative analysis of endogenous and exogenous metabolites in human serum. Anal Chem 78, 3289–95. Sadygov, R. G., Maroto, F. M., and Huhmer, A. F. (2006) ChromAlign: a two-step algorithmic procedure for time alignment of three-dimensional LC-MS chromatographic surfaces. Anal Chem 78, 8207–17. Peters, S., van Velzen, E., and Janssen, H. G. (2009) Parameter selection for peak alignment in chromatographic sample profiling: objective quality indicators and use of control samples. Anal Bioanal Chem 394, 1273–81. Eibl, G., Bernardo, K., Koal, T., Ramsay, S. L., Weinberger, K. M., and Graber, A. (2008) Isotope correction of mass spectrometry profiles. Rapid Commun Mass Spectrom 22, 2248–52. Kind, T., Scholz, M., and Fiehn, O. (2009) How large is the metabolome? A critical analysis of data exchange practices in chemistry. PLoS ONE 4, e5440. Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G., and Kell, D. B. (2004) Metabolomics by numbers: acquiring and understanding global metabolite data. Trends Biotechnol 22, 245–52. Torgrip, R. J. O., Aberg, K. M., Alm, E., Schuppe-Koistinen, I., and Lindberg,
374
30.
31.
32.
33. 34.
35.
36.
37.
38. 39. 40.
41. 42.
Enot, Haas, and Weinberger J. (2008) A note on normalization of biofluid 1D 1H-NMR data. Metabolomics 4, 114–21. Dieterle, F., Ross, A., Schlotterbeck, G., and Senn, H. (2006) Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal Chem 78, 4281–90. Warrack, B. M., Hnatyshyn, S., Ott, K. H., Reily, M. D., Sanders, M., Zhang, H., and Drexler, D. M. (2009) Normalization strategies for metabonomic analysis of urine samples. J Chromatogr B Analyt Technol Biomed Life Sci 877, 547–52. Wang, P., Tang, H., Zhang, H., Whiteaker, J., Paulovich, A. G., and Mcintosh, M. (2006) Normalization regarding non-random missing values in high-throughput mass spectrometry data. Pac Symp Biocomput 11, 315–26. http://www.fda.gov/downloads/Drugs/ GuidanceComplianceRegulatoryInformation/ Guidances/UCM070107.pdf Sysi-Aho, M., Katajamaa, M., Yetukuri, L., and Oresic, M. (2007) Normalization method for metabolomics data using optimal selection of multiple internal standards. BMC Bioinformatics 8, 93. Bijlsma, S., Bobeldijk, I., Verheij, E. R., Ramaker, R., Kochhar, S., Macdonald, I. A., et al. (2006) Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal Chem 78, 567–74. Hermansson, M., Uphoff, A., Käkelä, R., and Somerharju, P. (2005) Automated quantitative analysis of complex lipidomes by liquid chromatography/mass spectrometry. Anal Chem 77, 2166–75. Unterwurzacher, I., Koal, T., Bonn, G. K., Weinberger, K. M., and Ramsay, S. L. (2008) Rapid sample preparation and simultaneous quantitation of prostaglandins and lipoxygenase derived fatty acid metabolites by liquid chromatography-mass spectrometry from small sample volumes. Clin Chem Lab Med 46, 1589–97. Go, E. P. J. (2009) Database resources in metabolomics: an overview. Neuroimmune Pharmacol 5, 18–30. http://www.mibbi.org Taylor, C. F., Field, D., Sansone, S. A., Aerts, J., Apweiler, R., Ashburner, M., et al. (2008) Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 26, 889–96. http://msi-workgroups.sourceforge.net MSI Board Members, Sansone, S. A., Fan, T., Goodacre, R., Griffin, J. L., Hardy, N. W.,
43.
44. 45.
46. 47.
48.
49.
50.
51. 52.
53.
54. 55.
et al. (2007) The metabolomics standards initiative. Nat Biotechnol 25, 846–8. Jenkins, H., Hardy, N., Beckmann, M., Draper, J., Smith, A. R., Taylor, J., et al. (2004) A proposed framework for the description of plant metabolomics experiments and their results. Nat Biotechnol 22, 1601–6. http://dbkgroup.org/memo Spasić, I., Dunn, W. B., Velarde, G., Tseng, A., Jenkins, H., Hardy, N. W., et al. (2006) MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics. BMC Bioinformatics 7, 281. http://www.proteomecommons.org Broadhurst, D. I., and Kell, D. B. (2006) Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2, 171–96. Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., et al. (2005) A metabolome pipeline: from concept to data to knowledge. Metabolomics 1, 39–59. Parsons, H. M., Ekman, D. R., Collette, T. W., and Viant, M. R. (2009) Spectral relative standard deviation: a practical benchmark in metabolomics. Analyst 134, 478–85. Crews, B., Wikoff, W. R., Patti, G. J., Woo, H. K., Kalisiak, E., Heideker, J., and Siuzdak, G. (2009) Variability analysis of human plasma and cerebral spinal fluid reveals statistical significance of changes in mass spectrometrybased metabolomics data. Anal Chem 81, 8538–44. van Belle, G., and Martin, D. C. (1993) Sample size as a function of coefficient of variation and ratio of means. Am Stat 47, 165–7. Werner, E., Croixmarie, V., Umbdenstock, T., Ezan, E., Chaminade, P., Tabet, J. C., and Junot, C. (2008) Mass spectrometry-based metabolomics: accelerating the characterization of discriminating signals by combining statistical correlations and ultrahigh resolution. Anal Chem 80, 4918–32. Draper, J., Enot, D. P., Parker, D., Beckmann, M., Snowdon, S., Lin, W., and Zubair, H. (2009) Metabolite signal identification in accurate mass metabolomics data with MZedDB, an interactive m/z annotation tool utilising predicted ionisation behaviour ‘rules’. BMC Bioinformatics 10, 227. Steuer, R. (2006) Review: on the analysis and interpretation of correlations in metabolomic data. Brief Bioinform 7, 151–8. Mendes, P., Camacho, D., and de la Fuente, A. (2005) Modelling and simulation for metabolomics data analysis. Biochem Soc Trans 33, 1427–9.
Bioinformatics for Mass Spectrometry-Based Metabolomics 56. Camacho, D., de la Fuente, A., and Mendes, P. (2005) The origin of correlations in metabolomics data. Metabolomics 1, 53–63. 57. Lu, C., and King, R. D. (2009) An investigation into the population abundance distribution of mRNAs, proteins, and metabolites in biological systems. Bioinformatics 25, 2020–7. 58. Purohit, P. V., Rocke, D. M., Viant, M. R., and Woodruff, D. L. (2004) Discrimination models using variance-stabilizing transformation of metabolomic NMR data. OMICS 8,118–30. 59. Jain, R. B., Caudill, S. P., Wang, R. Y., and Monsell, E. (2008) Evaluation of maximum likelihood procedures to estimate left censored observations. Anal Chem 80, 1124–32. 60. Stacklies, W., Redestig, H., Scholz, M., Walther, D., and Selbig, J. (2007) pcaMethods – a bioconductor package providing PCA methods for incomplete data. Bioinformatics 23, 1164–7. 61. http://www.genome.jp/keg 62. http://www.reactome.org 63. http://biocyc.org 64. http://www.hmdb.ca
375
65. http://www.ncbi.nlm.nih.gov/omim 66. Modre-Osprian, R., Osprian, I., Tilg, B., Schreier, G., Weinberger, K. M., and Graber, A. (2009) Dynamic simulations on the mitochondrial fatty acid beta-oxidation network. BMC Syst Biol 3, 2. 67. http://www.broadinstitute.org/gsea 68. http://www.geneontology.org 69. Wang-Sattler, R., Yu, Y., Mittelstrass, K., Lattka, E., Altmaier, E., Gieger, C., et al. (2008) Metabolic profiling reveals distinct variations linked to nicotine consumption in humans – first results from the KORA study. PLoS ONE 3, e3863. 70. Gieger, C., Geistlinger, L., Altmaier, E., Hrabe de Angelis, M., Kronenberg, F., Meitinger, T., et al. (2008) Genetics meets metabolomics: a genome-wide association study of metabolite profiles in human serum. PLoS Genet 4, e1000282. 71. Altmaier, E., Ramsay, S. L., Graber, A., Mewes, H. W., Weinberger, K. M., and Suhre, K. (2008) Bioinformatics analysis of targeted metabolomics – uncovering old and new tales of diabetic mice under medication. Endocrinology 149, 3478–89.
Part III Applied Omics Bioinformatics
Chapter 17 Computational Analysis Workflows for Omics Data Interpretation Irmgard Mühlberger, Julia Wilflingseder, Andreas Bernthaler, Raul Fechete, Arno Lukas, and Paul Perco Abstract Progress in experimental procedures has led to rapid availability of Omics profiles. Various open-access as well as commercial tools have been developed for storage, analysis, and interpretation of transcriptomics, proteomics, and metabolomics data. Generally, major analysis steps include data storage, retrieval, preprocessing, and normalization, followed by identification of differentially expressed features, functional annotation on the level of biological processes and molecular pathways, as well as interpretation of gene lists in the context of protein–protein interaction networks. In this chapter, we discuss a sequential transcriptomics data analysis workflow utilizing open-source tools, specifically exemplified on a gene expression dataset on familial hypercholesterolemia. Key words: Omics data analysis, Bioinformatics workflow, Transcription factor, Protein network, Data interpretation
1. Introduction High-throughput methods in molecular biology research, and in particular microarray technologies and mass spectrometry have led to the quantitative assessment of thousands of features on the level of the genome, transcriptome, proteome, and metabolome, resulting in the accumulation of a massive amount of data. Microarray technologies, initially restricted to applications in research, have in the meantime found its way into the clinic, e.g., referring to the MammaPrint microarray-based test system cleared by the FDA in early 2007 for the prognosis of breast cancer patients (1). Next to basic research and molecular diagnostics,
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_17, © Springer Science+Business Media, LLC 2011
379
380
Mühlberger et al.
Omics procedures are also used for toxicological profiling as well as for drug discovery research in the hunt for novel therapeutic targets, just to give examples. With these well-established methodologies and standardized protocols for experimental processing in hand, the emphasis of research in recent years has been on the analysis of high-throughput data and results interpretation (2). Analyses steps include data storage, data annotation, data preprocessing, and normalization, followed by explorative and statistical analyses, functional interpretation, and hypothesis generation. For all these different steps, open-source tools are available, and databases storing Omics raw data have been vigorously populated. In this chapter, we address computational analysis workflows for the interpretation of Omics data. We provide links to databases, tools and Web sites, discuss their applicability, and navigate through the analysis process on a given example dataset on gene expression profiles of monocytes from patients with familial hypercholesterolemia.
2. Materials 2.1. Omics Data Repositories
Public databases provide genomics and proteomics data for a wide range of cells, tissues, and diseases (Table 1A). Open-access repositories for microarray data are, e.g., the ArrayExpressDB hosted by European Bioinformatics Institute (EBI) (3), the Gene Expres sion Omnibus (GEO) developed at the National Center for Biotechnology Information (NCBI) (4), or the Stanford Microarray Database (SMD) (5). One of the most comprehensive collections of proteomics data is provided by SWISS 2-D PAGE hosted by the Swiss Institute of Bioinformatics (6, 7) (see Note 1). Standards for data annotation and exchange of microarray data have been introduced by the Microarray Gene Expression Data (MGED) Society. The Minimum Information About a Microarray Experiment (MIAME) guidelines describe the minimum information needed for revising and interpreting results of a microarray-based experiment palpably (8).
2.2. Data Preprocessing
A sequence of data preprocessing steps is required for the analysis of abundance data, e.g., from gene expression or protein profiling (Table 1B). Background correction and normalization of the data are the first steps to clear the impact of nonbiological influences potentially arising from different array batches used, or from varying intensities of different dyes. Frequently used background correction methods are the Robust Multi-array Average (RMA) method (9) or MAS 5.0 from the Affymetrix Microarray Suite
Computational Analysis Workflows for Omics Data Interpretation
381
Table 1 Listing of Omics repositories, Web resources, and analysis tools discussed in this chapter A: Omics repositories ArrayExpress Gene Expression Omnibus Stanford Microarray Database Proteomics database
http://www.ebi.ac.uk/microarray-as/ae http://www.ncbi.nlm.nih.gov/geo http://smd.stanford.edu http://www.expasy.ch/ch2d
(3) (4) (5) (6)
B: Data preprocessing RMA MAS5 dChip
http://rmaexpress.bmbolstad.com http://src.moffitt.usf.edu/sf/projects/libaffy http://www.dchip.org
(9) (10) (12)
C: Explorative analysis routines Bioconductor SAM TIGR MeV
http://www.bioconductor.org http://rmaexpress.bmbolstad.com http://www.tm4.org/mev.html
(25) (27) (28)
D: Functional annotation DAVID PANTHER
http://david.abcc.ncifcrf.gov http://www.pantherdb.org
(30) (32)
E: Pathway analysis KEGG PANTHER KEGG spider
http://www.genome.jp/kegg/pathway.html http://www.pantherdb.org http://mips.helmholtz-muenchen.de/proj/keggspider
(31) (32) (34)
F: In silico promoter analysis JASPAR oPOSSUM
http://jaspar.genereg.net http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum
(35) (36)
G: Interaction network analysis STRING FunCoup
http://string.embl.de http://funcoup.sbc.su.se
(37) (39)
(10). Normalization techniques are Quantile Normalization (RMA), Invariant Difference Selection (IDS) (11), and dChip (12). Further preprocessing is particularly important for gene expression data to achieve a reduction of data complexity. Filter routines focus on the elimination of entries which are probably invalid and do not contribute to informative results. One possible filter is to remove all objects for which the number of missing values over all experiments (arrays) performed exceeds a certain threshold. Missing values may be a problem caused by improper resolution, image corruption, or physical defects. Methods for handling missing values span from simple row average estimates to more sophisticated approaches, e.g., based on K-nearest-
382
Mühlberger et al.
neighbor replacement (13), Bayesian variable selection (14), least squares replacement (15), or a combination of above-mentioned procedures (16). Preprocessing of proteomic MS data aims to identify a list of m/z peak values to be directly used for further analyses. Analyses steps include background correction, filtering, noise estimation, peak detection, and spectral alignment algorithms (18–23). Nie et al. summarized current applications of statistics in several stages of global gel-free proteomic analysis by mass spectrometry (17). For protein identification based on m/z data, several resources are available as, e.g., MASCOT (18). After normalization issues are resolved, the annotation of Omics features is essential. The SOURCE tool from the Stanford Genomics Facility (19) or the GeneCards system from the Weizmann Institute of Science (20) are commonly used annotation databases/tools for DNA/mRNA and protein sequences. 2.3. Identification of Differentially Expressed Genes and Proteins
For the evaluation of differentially expressed genes (DEGs)/proteins several methods based on test statistics are in use (Table 1C). A straightforward method is the Student’s t-test determining the significance of differences between distributions of expression levels combined with computation of the fold change. The correction for multiple testing is pivotal for the analysis of Omics data in order to reduce the number of false positive findings. A very stringent correction method is the Bonferroni correction, whereas less conservative methods are based on permutations, e.g., realized by the maxT and minP method as described by Westfall and Young (21). Such permutation and resampling methods are described in detail by Dudoit et al. (22) and Gen et al. (23). Implementations of these algorithms can be found in the multtest Bioconductor package of the R statistics environment (24, 25). Bootstrap and Jackknife procedures, both using randomly drawn subsets of the whole dataset, further strengthen the statistical findings and lower the susceptibility to outliers (26). Significance Analysis of Microarrays (SAM) is also based on data permutation but controls the false discovery rate (FDR), defined as the percentage of genes identified as significant with respect to the number of features identified as relevant by chance (27). This method is widely accepted in microarray analysis. SAM is available as stand-alone package and is also implemented in the MultiExperiment Viewer (MeV) developed at The Institute for Genomic Research (TIGR) (28).
2.4. Functional Annotation and Pathway Enrichment Analysis
One approach for functional grouping of genes or proteins identified as relevant from a statistical viewpoint is realized by utilizing gene ontologies (GO), categorizing proteins according to their molecular functions, cellular components, and biological
Computational Analysis Workflows for Omics Data Interpretation
383
processes (31, 32) (Table 1D). Another classification system is the Protein ANalysis THrough Evolutionary Relationships (PANTHER) ontology (33–35). Generally, ontologies are controlled vocabularies and can be represented as acyclic, directed graphs where each ontology category can have one or more parent and subterms. Statistical tools exist to identify enriched or depleted categories for a list of genes or proteins of interest (29). One of these tools is Database for Annotation, Visualization, and Integrated Discovery (DAVID) (30). Pathway databases like the one from the Kyoto Encyclopedia of Genes and Genome (KEGG) (31) complement the functional ontologies and can give even more information on the interplay of gene and proteins. Other pathway databases describing metabolic networks and signaling transduction cascades are the BioCarta, the PANTHER pathway database (32), or Reactome (33). KEGG spider provides a robust analytical framework for the interpretation of gene lists in the context of a global gene metabolic network (34) (Table 1E). 2.5. In Silico Promoter Analysis
Transcription factors are key elements in the regulation of transcription exerting their function by binding to the promoter region of a gene as well as to regulatory elements further away from the transcription start site (Table 1F). JASPAR is a database holding binding site matrices for specific transcription factors which can be used by pattern matching algorithms in order to scan genomic sequences for potential transcription factor binding sites (TFBS) (35). The JASPAR Core database provides a curated, nonredundant set of binding profiles from experimentally defined TFBS for eukaryotes reported in the literature. For a given list of differentially regulated genes or proteins, the search for enriched TFBS in the regulatory regions becomes feasible. The oPOSSUM database holds precalculated TFBS in the regulatory regions of human genes that can be used in order to identify enriched transcription factors in a set of deregulated genes (36). The regulatory regions of human genes are identified searching for conserved regions in the mouse genome (phylogenetic footprinting) using different stringency criteria. The oPOSSUM tool uses TFBS as stored in the JASPAR database.
2.6. Integrated Approaches
Besides sequential workflows following a step-by-step analysis several integrated approaches exist (Table 1G). One example is STRING, provided by the EBI which aims to present genes directly or indirectly related to a query gene (37, 38). The basis of STRING is a protein network obtained from integrating highconfidence data, high-throughput experiments, and computationally derived data for more than 2.5 million proteins
384
Mühlberger et al.
occurring in 630 organisms. Information is integrated over organisms and the respective proteins are represented as clusters of orthologous groups. STRING currently integrates protein interactions, co-expression data, literature co-occurrences, genomic context encoded by conserved genomic neighborhoods, gene fusion events, and phylogenetic co-occurrences. For each pair of proteins STRING precomputes a detailed measure of evidence based on each available data source for describing the association between the two proteins. These subscores are combined to represent an evidence score. A STRING query is performed by entering a gene name, protein name or a protein sequence, or a list of identifiers or sequences. As a result STRING shows an integrated, interactively expandable view of the network context of the input proteins enriched with biological information associated with these proteins. The routine FunCoup globally reconstructs protein networks in human and other eukaryotes from comprehensive data integration, namely protein–protein interactions, mRNA expression, subcellular location, phylogenetic profiles, miRNA–mRNA targeting, transcription factor binding sites, protein expression, and domain–domain interactions (39). The software utilizes InParanoid to transfer information between species. In the course of visualization, the user is provided with the option to group networks by spatial subcellular position of proteins, their membership – relation to pathways, or as a force-directed layout. Furthermore, where possible, a detailed description of the type of association between the proteins is supported (direct physical interaction, protein complex members, metabolic reaction, regulatory/signaling). omicsNET is another data integration framework supporting researchers throughout the process of the analysis of disease specific data in identifying and selecting potential diagnostic markers or therapeutic targets (40). Pairwise dependencies between human proteins are calculated based on the following data sources: gene expression profiles in normal human tissues, functional gene annotation based on gene ontologies as well as on pathway information, shared transcription factor binding site as well as miRNA profiles, information on subcellular protein localization, protein– protein interaction data, and shared protein domains. Based on these dependencies, a protein network is constructed which is easily extendable and is embedded in a fully automatic downloading and importing framework capable of following the fast update cycles of scientific data repositories and data formats. Objects are centered around a general definition of biological entities based on international protein index (IPI) IDs presently covering about 68k protein sequences (41).
Computational Analysis Workflows for Omics Data Interpretation
385
3. Methods In the following section, the tools described above are exemplarily applied on a publicly available gene expression dataset. Mosig and colleagues profiled the gene expression of monocytes of patients with familial hypercholesterolemia (FH) (42). In this study, microarray gene expression experiments were performed using Affymetrix HG-U133 Plus 2.0 GeneChips, each holding 54,675 unique transcripts. 3.1. Omics Data Repositories and Data Retrieval
The example dataset is deposited in the public GEO database (http://www.ncbi.nlm.nih.gov/geo) hosted by NCBI reachable via the GEO accession number “GSE6054.” The summary page of this specific record holds a short summary of the study, the experiment type, samples used in the experiment, as well as contributors. The contact details of the corresponding author as well as the date of submission are furthermore provided. The raw data files are provided as zipped archive which includes 23 Affymetrix CEL files providing the basis for further preprocessing and analysis (see Note 2).
3.2. Data Preprocessing
Main data preprocessing steps involve background correction and data normalization. One tool capable of handling both tasks in a user friendly way is CARMAweb, developed at the Technical University of Graz (https://carmaweb.genome.tugraz.at) (43). Creating an account in CARMAweb allows the user storing of files and results for further analysis at a later time. CARMAweb supports a number of file formats generated by the scanner software of different platforms, including Affymetrix, Applied Biosystems, as well as two-color systems. When using Affymetrix data, the CEL files have to be uploaded to the system in order to start the preprocessing procedure as described step by step below (see Note 3 for a detailed discussion on input parameters and resulting plots): 1. Choose New Analysis from the tool bar 2. Select Perform an Affymetrix GeneChip analysis 3. Upload the raw data CEL files for the analysis 4. Select the preprocessing method “mas5” 5. Scale the values to 200 6. Check the boxes for drawing additional plots from the raw and normalized data 7. Check the box Save the normalized expression values to a text file
386
Mühlberger et al.
8. Skip the replicate handling step as there are no replicated arrays in this example dataset 9. Start the analysis Results as well as the analysis protocols are accessible after the preprocessing steps are completed. The analysis report contains a summary of the performed analysis steps as well as plots for checking the quality of given array data. The normalized expression dataset that is used for further analysis is denoted as “ExpressionValues.txt” and can be downloaded to a local machine (Table 2). Features are annotated with their respective NCBI GenBank accession number, NCBI UniGene Cluster ID, NCBI Entrez Gene ID (LocusLink ID), NCBI Gene Symbol, as well as a short summary. Result files can be downloaded separately or as a compressed archive. 3.3. Identification of Differentially Expressed Genes
The preprocessed and normalized data file “ExpressionValues. txt” is the basis for the identification of DEGs. Main interest in our study is the identification of genes that show differential expression between subjects with familial hypercholesterolemia and healthy controls. Various open-source as well as commercial tools exist for this task as outlined in the Materials section. One open-source tool that we consider very intuitive to use is the MeV developed at TIGR (http://www.tm4.org/mev.html). MeV handles tab-delimited text files holding expression datasets, such as our normalized file “ExpressionValues.txt.” Various statistical tests are implemented in the MeV software package, among them the t-test, the Analysis of Variance (ANOVA) for multigroup comparisons, and the Statistical Analysis of Microarrays (SAM) method controlling the FDR. The following steps result in a list of significantly differentially expressed transcripts using the SAM method (see Note 4 for a detailed discussion on input parameters): 1. Select Load Data from the MeV file menu 2. Check Single-color Array in the Expression File Loader dialog box 3. Load the file “ExpressionValues.txt” 4. Select Significance Analysis for Microarrays from the Statistics tab 5. Select the Two-class unpaired tab 6. Assign diseased samples to group A and healthy control samples to group B 7. Set the number of permutations to 500 8. Select S0 using Tusher et al. method 9. Check no for calculating q-values
GenBank
U48705
M87338
X51757
X69699
L36861
L13852
X55005
X79510
M21121
J02843
Id
1007_s_at
1053_at
117_at
121_at
1255_g_at
1294_at
1316_at
1320_at
1405_i_at
1431_at
Hs.12907
Hs.514821
Hs.437040
Hs.724
Hs.16695
Hs.92858
Hs.469728
Hs.654614
Hs.647062
Hs.631988
UniGene
Cytochrome P450, family 2, subfamily E, polypeptide 1
Chemokine (C-C motif) ligand 5
Protein tyrosine phosphatase, nonreceptor type 21
Thyroid hormone receptor, alpha
Ubiquitin-activating enzyme E1-like
Guanylate cyclase activator 1A (retina)
Paired box 8
Heat shock 70 kDa protein 6 (HSP70B’)
Replication factor C (activator 1) 2, 40 kDa
Discoidin domain receptor family, member 1
Description
1571
6352
11099
7067
7318
2978
7849
3310
5982
780
LocusLink
CYP2E1
CCL5
PTPN21
THRA
UBE1L
GUCA1A
PAX8
HSPA6
RFC2
DDR1
Symbol
Table 2 Excerpt of the file “ExpressionValues.txt” resulting from CARMAweb preprocessing
37.95928637
5625.3812
7.528551122
53.92162423
766.9258871
5.513836299
204.6355705
781.1669739
217.6610085
133.1116888
GSM140232.CEL
(continued)
31.52161119
5054.011476
6.906572974
79.99926573
742.6529846
12.32890509
281.2443974
465.5422967
239.3148494
129.7100459
GSM140233.CEL
Computational Analysis Workflows for Omics Data Interpretation 387
X75208
L38487
M33318
NM_005505
NM_015140
NM_052871
NM_080735
NM_138957
NM_138957
NM_145004
1438_at
1487_at
1494_f_at
1552256_a_at
1552257_a_at
1552258_at
1552261_at
1552263_at
1552264_a_at
1552266_at
Hs.521545
Hs.431850
Hs.431850
Hs.2719
Hs.652166
Hs.517670
Hs.520348
Hs.439056
Hs.110849
Hs.2913
UniGene
ADAM metallopeptidase domain 32
Mitogen-activated protein kinase 1
Mitogen-activated protein kinase 1
WAP four-disulfide core domain 2
Chromosome 2 open reading frame 59
Tubulin tyrosine ligase-like family, member 12
Scavenger receptor class B, member 1
Cytochrome P450, family 2, subfamily A, polypeptide 6
Estrogen-related receptor alpha
EPH receptor B3
Description
203102
5594
5594
10406
112597
23170
949
1548
2101
2049
LocusLink
ADAM32
MAPK1
MAPK1
WFDC2
C2orf59
TTLL12
SCARB1
CYP2A6
ESRRA
EPHB3
Symbol
40.02411982
676.0495879
872.2548451
35.72363554
17.18700753
483.2240522
294.3431947
72.22351713
510.4666229
15.63201307
GSM140232.CEL
37.37860646
887.2818401
604.3554185
52.43665489
24.89409347
425.4874463
239.620105
79.30259241
430.1333161
14.03528667
GSM140233.CEL
The first six columns hold the main identifiers for all of the 54,675 transcripts included on the Affymetrix HG-U133 Plus 2.0 GeneChip. The seventh column provides a short description of the gene, and the last columns hold the normalized expression values for each array (the values of the first two arrays are exemplarily shown)
GenBank
Id
Table 2 (continued)
388 Mühlberger et al.
Computational Analysis Workflows for Omics Data Interpretation
389
Fig. 1. Example output graph resulting from an SAM analysis. The two dotted lines represent the region within +/− delta units (set to 1.156) from the observed to expected line. The genes whose plot values are within +/− delta units are considered nonsignificant, those above + delta units are considered as significantly upregulated, and the ones below − delta units are considered as significantly downregulated.
10. Select K-nearest neighbors impute as Imputation Engine with ten neighbors 11. Start analysis Once the analysis has finished, the resulting SAM graph is displayed reporting the number of significantly differentially regulated genes regarding the group comparison as well as the median number of genes being false positives at a given delta threshold level (Fig. 1). The slider for controlling the delta value at the bottom of the graph can be used to set the FDR, representing the fraction of false positive genes among the total number of all genes indicated as being differentially regulated. Usually, values in the range of 5–10% are acceptable. In our experiment setting, a delta value of 1.156 results in 1,016 significant genes and a median number of falsely significant genes is 50. Please note that these results may slightly vary when redone due to the sequence of random permutations used in the analysis. The list of 1,016 significant genes can be displayed by selecting the node Table Views/All Significant Genes in the folder Analysis Results/SAM on the left of the MeV navigation window. The table of the significant genes can be downloaded via selecting Save cluster from the menu. Using the fold change criterion can further reduce the list of interesting genes to be considered for further analysis. The fold
390
Mühlberger et al.
change determines how many times the expression levels for a given transcript are increased or decreased in the diseased samples as compared to the healthy individuals. Focusing on genes showing at least a twofold change in either direction further reduces the dataset from 1,016 DEGs to 97 DEGs. 3.4. Functional Annotation and Pathway Enrichment Analysis
DEGs can be linked to gene ontology categories in order to identify enriched or depleted biological processes as implemented in the DAVID tool (http://david.abcc.ncifcrf.gov). Input is a list of NCBI Gene IDs, for e.g., DEGs or more generally speaking genes of interest that can either be pasted into the data input field provided by the application or uploaded as a simple text file. The following steps are necessary to complete the analysis: 1. Select Start Analysis from the tool bar 2. Paste the list of identifiers into “box A” or upload the identifiers from a text file 3. Select ENTREZ_GENE_ID as Identifier 4. Select Gene List as List Type 5. Submit list 6. Choose HOMO SAPIENS as species in the “List Manager” DAVID integrates several tools for data annotation and in a first step we assign GO terms and KEGG pathways to the individual genes: 1. Select Functional Annotation Table 2. Check the boxes GOTERM_BP_ALL, GOTERM_CC_ALL, and GOTERM_MF_ALL from the Gene Ontology node and KEGG_PATHWAY from the Pathways node on the Annotation Summary Results Page 3. Select Functional Annotation Table 4. A separate window opens showing a table with all submitted Entrez Gene IDs and their functional categories (Fig. 2) 5. Download the table as a text file Another Web tool for categorizing genes by their biological function is PANTHER (http://www.pantherdb.org). To analyze the genes being differentially expressed between FH and healthy monocytes (as given for our example case) in terms of functional enrichment when compared to the whole NCBI H. sapiens gene list, the following steps have to be performed: 1. Select Tools from the tool bar 2. Choose Gene Expression Data Analysis and Compare gene lists 3. Select Gene ID as identifier and upload the list of Entrez Gene IDs for the DEGs
Computational Analysis Workflows for Omics Data Interpretation
391
Fig. 2. Example analysis output utilizing DAVID. Given are the gene ontology terms for two differentially expressed genes.
Fig. 3. PANTHER analysis example output. The second and third column hold the number of genes in the reference and FH list mapping to the PANTHER classification category in the first column. The expected number of genes in the respective category is listed in column 4. A plus or minus sign in the fifth column indicates over- or underrepresentation of features for a given category. The last column of the results table holds the p-values indicating the significance of deviation of the identified number of features with respect to the number of features present in a particular category when following a chi-square test.
4. Finish selecting lists 5. Select NCBI: H. sapiens genes as reference list 6. Check Biological Processes 7. Launch analysis 8. Download the results table (Fig. 3)
392
Mühlberger et al.
3.5. In Silico Promoter Analysis
Transcription factors with enriched binding sites in a set of genes or proteins can be identified with the oPOSSUM tool (http:// www.cisreg.ca/cgi-bin/oPOSSUM/opossum). Gene as well as protein identifiers are accepted by the analysis tool, such as Ensembl IDs, HUGO Gene Symbols or aliases, RefSeq IDs, or Entrez Gene IDs. The following steps are necessary to obtain transcription factors with enriched binding sites. For a discussion of input parameters, see Note 5. 1. Select as organism either human or mouse 2. Select the type of identifier and upload your list of IDs 3. Select all JASPAR Core profiles with a specificity of 10 bits 4. Set the level of conservation to the top 10% of conserved regions and the matrix match threshold to 85% 5. Define the region in respect to the transcription start site to be searched for binding sites 6. Focus on significantly enriched transcription factors by setting the Z-score ³5 and the p-value of the Fisher’s exact test to £0.05 In our example, the transcription factor NR2F1 is found to be significantly enriched with a p-value of <0.001 and a Z-score of 8.069 when searching 2,000 base pairs upstream of the transcription start sites of all upregulated genes. Next to the statistics the counts of TFBS in our gene set as well as in the background gene set is given, along with the respective transcription factor class and supergroup. A detailed view of the predicted binding sites in the analysis dataset is accessible via the link in the field of target gene hits.
3.6. Integrated Approaches
The STRING tool (http://string.embl.de) for the generation of protein interaction networks accepts both, protein identifiers or protein sequences as input. To retrieve protein identifiers from the list of DEGs, the DAVID tool can be used. The procedure is the same as described in above for the assignment of GO terms and pathways, but the box UNIPROT_ACCESSION from the Main Accession node has to be selected. The following steps lead to a STRING network of proteins from the DEGs: 1. Select the multiple names tab from the search box 2. Paste the list of protein identifiers in the respective box 3. Choose Homo sapiens as organism 4. Start the analysis 5. Review the list of input proteins and continue
Computational Analysis Workflows for Omics Data Interpretation
393
The resulting network holds the uploaded proteins and can be further expanded with additional interacting partners by selecting the more buttons below the graphics. The default network view is the evidence view, where nodes represent proteins and edge color indicates the type of evidence for the association. Further views can be selected on the bottom of the results page. Figure 4 shows a resulting subgraph when expanding the entire network of DEGs by adding ten additional partners with the highest evidence score. For the given example, most of the members are involved in mRNA transcription.
Fig. 4. Subgraph extracted from the STRING protein network. Edge colors indicate the type of interaction. Dark edges: interaction based on text mining; Gray edges: experimental interaction evidence; Light gray edges: information from other databases.
394
Mühlberger et al.
4. Notes 1. A listing of databases, Web-based resources and tools discussed in this work is given in Table 1. 2. Next to the zipped CEL files, the GEO accession summary page provides links to three additional files holding information on metadata and the normalized expression values. The SOFT formatted family file and the MINiML formatted family file include information about the family of the specific accession in text or XML format, respectively. Family implies all records related to the accession, including platform, sample, and series records. The third file is called “Series Matrix File” and is a text file, holding expression values for all samples in matrix format. The header of these files contains all relevant metadata, including the abstract, contributors, sample hybridization protocol, processing method, etc., and can be used as input for analysis software packages like the TIGR MeV tool. 3. CARMAweb provides several different methods for preprocessing, including MAS5, RMA, and additionally custom normalization can be defined. The custom normalization allows the user to select from various methods for the consecutive steps of the preprocessing procedure. In order to make arrays comparable, the expression values are scaled up or down using a predefined intensity value, which is by default set to 200 when using MAS5 in CARMAweb. A histogram and a boxplot of the raw data as well as of the normalized data are drawn after checking the respective box. These plots can give a first impression of the data and array quality. If a dataset includes array replicates, they can be merged by calculating the mean expression values across the replicates. 4. SAM is implemented for two-class unpaired, two-class paired, multi-class, censored survival, and one-class group comparisons. Because the FH dataset used in the given example case consists of two groups (diseased and healthy) and no pairing of samples is available, we choose the two-class unpaired design. For our dataset, we consider 500 permutations to be sufficient for reaching robust results. This number, however, can be increased up to the point where all possible permutations are performed. If a number higher than the possible number of unique permutations is entered, the user is asked whether to use all possible permutations. The S0 constant minimizes the coefficient of variation of the relative difference in gene expression and is computed as a percentile based on alpha, which indicates the probability of false positive results. Q-values can be computed to indicate the lowest FDR
Computational Analysis Workflows for Omics Data Interpretation
395
at which the transcript is denoted as significant. For imputation of missing values, SAM provides two methods, namely, the K-nearest neighbor algorithm and the row average method. The K-nearest neighbor algorithm replaces missing values with the K-nearest neighbors according to the Euclidean distance, whereas the row average method simply uses the mean of the expression values for the respective transcript over all arrays. 5. In order to reduce the number of false positive predictions, the use of more stringent input parameters is advised. We only use transcription factor binding matrices with a minimum specificity of 10 bits and a matrix match threshold of 85%. Additionally, only the top 10% of conserved regions with a minimum conservation of 70% are used. References 1. Wittner, B. S., Sgroi, D. C., Ryan, P. D., Bruinsma, T. J., Glas, A. M., Male, A., Dahiya, S., Habin, K., Bernards, R., Haber, D. A., Van’t Veer, L. J., and Ramaswamy, S. (2008) Analysis of the MammaPrint breast cancer assay in a predominantly postmenopausal cohort. Clin Cancer Res 14, 2988–93. 2. Perco, P., Rapberger, R., Siehs, C., Lukas, A., Oberbauer, R., Mayer, G., and Mayer, B. (2006) Transforming omics data into context: bioinformatics on genomics and proteomics raw data. Electrophoresis 27, 2659–75. 3. Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., Holloway, E., Kolesnykov, N., Lilja, P., Lukk, M., Mani, R., Rayner, T., Sharma, A., William, E., Sarkans, U., and Brazma, A. (2007) ArrayExpress – a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 35, D747–50. 4. Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., Marshall, K. A., Phillippy, K. H., Sherman, P. M., Muertter, R. N., and Edgar, R. (2009) NCBI GEO: archive for high-throughput functional genomic data. Nucleic Acids Res 37, D885–90. 5. Demeter, J., Beauheim, C., Gollub, J., Hernandez-Boussard, T., Jin, H., Maier, D., Matese, J. C., Nitzberg, M., Wymore, F., Zachariah, Z. K., Brown, P. O., Sherlock, G., and Ball, C. A. (2007) The Stanford Microarray Database: implementation of new analysis tools and open source release of software. Nucleic Acids Res 35, D766–70. 6. Hoogland, C., Mostaguir, K., Sanchez, J. C., Hochstrasser, D. F., and Appel, R. D. (2004)
SWISS-2DPAGE, ten years later. Proteomics 4, 2352–6. 7. Smolka, M., Zhou, H., and Aebersold, R. (2002) Quantitative protein profiling using two-dimensional gel electrophoresis, isotopecoded affinity tag labeling, and mass spectrometry. Mol Cell Proteomics 1, 19–29. 8. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29, 365–71. 9. Irizarry, R. A., Hobbs, B., Collin, F., BeazerBarclay, Y. D., Antonellis, K. J., Scherf, U., and Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–64. 10. Affymetrix (2001) Statistical algorithms reference guide, Technical Report. Technical Report, Affymetrix. 11. Schadt, E. E., Li, C., Ellis, B., and Wong, W. H. (2001) Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J Cell Biochem Suppl Suppl 37, 120–5. 12. Li, C., and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol 2, RESEARCH0032. 13. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein,
396
14.
15.
16.
17.
18.
19.
20.
21. 22. 23. 24.
25.
Mühlberger et al. D., and Altman, R. B. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–5. Zhou, X., Wang, X., and Dougherty, E. R. (2003) Missing-value estimation using linear and non-linear regression with Bayesian gene selection. Bioinformatics 19, 2302–7. Bo, T. H., Dysvik, B., and Jonassen, I. (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 32, e34. Jornsten, R., Wang, H. Y., Welsh, W. J., and Ouyang, M. (2005) DNA microarray data imputation and significance analysis of differential expression. Bioinformatics 21, 4155–61. Nie, L., Wu, G., and Zhang, W. (2008) Statistical application and challenges in global gel-free proteomic analysis by mass spectrometry. Crit Rev Biotechnol 28, 297–307. Grosse-Coosmann, F., Boehm, A. M., and Sickmann, A. (2005) Efficient analysis and extraction of MS/MS result data from Mascot result files. BMC Bioinformatics 6, 290. Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J. C., Hernandez-Boussard, T., Rees, C. A., Cherry, J. M., Botstein, D., Brown, P. O., and Alizadeh, A. A. (2003) SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res 31, 219–23. Safran, M., Chalifa-Caspi, V., Shmueli, O., Olender, T., Lapidot, M., Rosen, N., Shmoish, M., Peter, Y., Glusman, G., Feldmesser, E., Adato, A., Peter, I., Khen, M., Atarot, T., Groner, Y., and Lancet, D. (2003) Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res 31, 142–6. Westfall, P. H., and Young, S. S. (1993) in Wiley series in probability and mathematical statistics. Wiley, New York. Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003) Multiple hypothesis testing in microarray experiments. Statistical Science 19, 1090–9. Ge, Y., Dudoit, S., and Speed, T. P. (2003) Resampling-based multiple testing for microarray data analysis. TEST 12, 1–44. van der Laan, M. J., Dudoit, S., and Pollard, K. S. (2004) Multiple testing. Part II. Stepdown procedures for control of the familywise error rate. Stat Appl Genet Mol Biol 3, Article14. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry,
26. 27.
28.
29.
30.
31. 32.
33.
34.
35.
R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y., and Zhang, J. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80. Efron, B., and Tibshirani, R. J. (1993) An introduction to the bootstrap. Chapman and Hall, New York. Tusher, V. G., Tibshirani, R., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98, 5116–21. Saeed, A. I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T., Thiagarajan, M., Sturn, A., Snuffin, M., Rezantsev, A., Popov, D., Ryltsov, A., Kostukovich, E., Borisovsky, I., Liu, Z., Vinsavich, A., Trush, V., and Quackenbush, J. (2003) TM4: a free, open-source system for microarray data management and analysis. Biotechniques 34, 374–8. Khatri, P., and Draghici, S. (2005) Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21, 3587–95. Huang da, W., Sherman, B. T., and Lempicki, R. A. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4, 44–57. Kanehisa, M., Goto, S., Kawashima, S., and Nakaya, A. (2002) The KEGG databases at GenomeNet. Nucleic Acids Res 30, 42–6. Mi, H., Lazareva-Ulitsky, B., Loo, R., Kejariwal, A., Vandergriff, J., Rabkin, S., Guo, N., Muruganujan, A., Doremieux, O., Campbell, M. J., Kitano, H., and Thomas, P. D. (2005) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res 33, D284–8. Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G. R., Wu, G. R., Matthews, L., Lewis, S., Birney, E., and Stein, L. (2005) Reactome: a knowledgebase of biological pathways. Nucleic Acids Res 33, D428–32. Antonov, A. V., Dietmann, S., and Mewes, H. W. (2008) KEGG spider: interpretation of genomics data in the context of the global gene metabolic network. Genome Biol 9, R179. Portales-Casamar, E., Thongjuea, S., Kwon, A. T., Arenillas, D., Zhao, X., Valen, E., Yusuf, D., Lenhard, B., Wasserman, W. W., and Sandelin, A. (2010) JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 38, D105–10.
Computational Analysis Workflows for Omics Data Interpretation 36. Ho Sui, S. J., Mortimer, J. R., Arenillas, D. J., Brumm, J., Walsh, C. J., Kennedy, B. P., and Wasserman, W. W. (2005) oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucleic Acids Res 33, 3154–64. 37. von Mering, C., Jensen, L. J., Kuhn, M., Chaffron, S., Doerks, T., Kruger, B., Snel, B., and Bork, P. (2007) STRING 7 – recent developments in the integration and prediction of protein interactions. Nucleic Acids Res 35, D358–62. 38. Jensen, L. J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., Bork, P., and von Mering, C. (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37, D412–6. 39. Alexeyenko, A., and Sonnhammer, E. L. (2009) Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res 19, 1107–16.
397
40. Bernthaler, A., Muhlberger, I., Fechete, R., Perco, P., Lukas, A., and Mayer, B. (2009) A dependency graph approach for the analysis of differential gene expression profiles. Mol Biosyst 5, 1720–31. 41. Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Birney, E., and Apweiler, R. (2004) The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985–8. 42. Mosig, S., Rennert, K., Buttner, P., Krause, S., Lutjohann, D., Soufi, M., Heller, R., and Funke, H. (2008) Monocytes of patients with familial hypercholesterolemia show alterations in cholesterol metabolism. BMC Med Genomics 1, 60. 43. Rainer, J., Sanchez-Cabo, F., Stocker, G., Sturn, A., and Trajanoski, Z. (2006) CARMAweb: comprehensive R- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res 34, W498–503.
Chapter 18 Integration, Warehousing, and Analysis Strategies of Omics Data Srinubabu Gedela Abstract “-Omics” is a current suffix for numerous types of large-scale biological data generation procedures, which naturally demand the development of novel algorithms for data storage and analysis. With next generation genome sequencing burgeoning, it is pivotal to decipher a coding site on the genome, a gene’s function, and information on transcripts next to the pure availability of sequence information. To explore a genome and downstream molecular processes, we need umpteen results at the various levels of cellular organization by utilizing different experimental designs, data analysis strategies and methodologies. Here comes the need for controlled vocabularies and data integration to annotate, store, and update the flow of experimental data. This chapter explores key methodologies to merge Omics data by semantic data carriers, discusses controlled vocabularies as eXtensible Markup Languages (XML), and provides practical guidance, databases, and software links supporting the integration of Omics data. Key words: XML, RDF, Controlled vocabularies, Omics data, Warehousing, Data integration
1. Introduction 1.1. General Considerations
Living cells are organized around some central aspects, including complex and integrated structure, regulatory mechanisms (as homeostasis), growth and development, energy utilization, response to the environmental stimuli, reproduction (DNA guaranties (semi)exact replication), evolution (capacity of living entities to adapt over time), which all together are reflected in Systems Biology. Latest research expanded toward a systems view of complex diseases, also including different species as, e.g., in the Systems Biology of host–pathogen interactions. Data warehouses holding biological data of gene products, metabolites, and most important also their relationships and biochemical organization in metabolic pathways, are a central prerequisite for such Systems
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_18, © Springer Science+Business Media, LLC 2011
399
400
Gedela
Biology approaches. As an example, organizations as Biocyc (http://www.biocyc.org) provide tools for developing organism specific metabolic pathway databases from previously annotated metabolites also allowing the inclusion of newly annotated meta bolites (1). Genome-wide or at least large-scale quantification of molecular components and experimental assessment of how these components interact have offered a broader insight into cellular function and into the effects of genetic and environmental perturbations. Omics data include quantification of mRNA transcripts (transcriptome), protein abundance (proteome), metabolic fluxes (fluxome), the concentration profiles of intracellular and extracellular metabolites (metabolome), and information on protein–protein and protein–DNA interactions (interactome). A variety of methods has been derived for data analysis, interpretation of phenomenological observations, and quantitative prediction of cellular behavior. These methods include comparative analysis of Omics profiling (e.g., statistical tests and reduction of dimensionality methods), models for integrative analysis (e.g., graph theory – based models), and predictive models (2, 3). Selection of the appropriate type of models capable of appropriately handling a given problem plays an important role in the extraction of knowledge, and experimental design should anticipate the planned data analysis strategy, and certainly follow a clear definition of the biological hypothesis. 1.2. “-Omics” Data
Biological information extracted in Omics includes all levels of exploration of cellular activities, spanning from the gene to expression and further to the phenotype level. Main data levels are genomics, transcriptomics, proteomics, glycomics, lipidomics, metabolomics, and localizomics. The functional states of these Omics data are, e.g., phenomics and fluxomics, involving the effective expression levels reflected by Omics data and flux required for metabolites in pathways. Further Omics levels describe interactions as protein–protein or protein–DNA interactions. Data flows describing various components of Omics within a cell are shown in Fig. 1. These Omics procedures in turn generate enormous amounts of data which need to be stored in efficient ways. Large-scale information is readily available in genome-scale Omics repositories while most of the dedicated databases store experimental data on the gene and protein level taken from various sources. A selected list of available data repositories, including a short description and URL, is provided in Table 1. Omics data are naturally retrieved from specific experimental procedures, and raw data as well as processed data and analysis results are represented in databases. Formally, the integration of Omics data can be described as an array in which all the individual array elements are interlinked. For respecting this fact, a design has to be chosen capable of representing both the experiments as
Integration, Warehousing, and Analysis Strategies of Omics Data
401
Fig. 1. Schematic representation of different components of Omics data and information flow within a cell.
such and their result vectors. The experimental designs noted in Fig. 2 span from the genome level to sequence annotation and further to ORF validation by, e.g., utilizing microarrays and SAGE. Subsequent experiments as proteomics provide, e.g., information on post-translational modifications (PTMs) centrally used for enzyme annotations in metabolic pathways. Analysis of these Omics levels allows studying interactions on gene regulatory networks and protein–protein interaction networks (see Note 1). Finally, functional annotation contributes to an understanding of the overall expression on the gene level, as e.g., represented in OmicBrowse (http://omicspace.riken.jp/omicBrowse) interconnecting different Omics data component levels in a semantic fashion (4).
2. Materials Omics data are frequently represented in vocabularies, represented, e.g., as hierarchical data elements also providing the interface to numerous data repositories and analysis tools (5). For example, the protein ontology (http://pir.georgetown.edu/pro)
402
Gedela
Table 1 Major Omics data resources Data types
Online resource
Description
URL
Genomics
Genomes OnLine Database (GOLD)
http://www. genomesonline.org
Transcriptomics
Gene Expression Omnibus (GEO) Stanford Microarray Database (SMD)
Repository of completed and ongoing genome projects Microarray and SAGEbased genome-wide expression profiles Microarray-based genomewide expression data
World-2DPAGE
Links to 2D-PAGE data
Open Proteomics Database (OPD)
Mass-spectrometry-based proteomics data
Lipid Metabolites and Pathways Strategy (LIPID MAPS) Yeast GFP Fusion Localization Database Consortium of functional Glycomics
Genome-scale lipids database
http://genome www.stanford.edu/ microarray http://us.expasy.org/ ch2d/2d-index. html http://bioinformatics. icmb.utexas.edu/ OPD http://www.lipidmaps. org
Yeast genome-scale protein-localization data Glycan array and profile data
http://yeastgfp.ucsf. edu http://www.functionalglycomics.org/
Protein–DNA
Biomolecular Network Database (BIND) Encyclopedia of DNA Elements (ENCODE)
Published protein–DNA interactions Database of functional elements in human DNA
Protein–protein
Munich Information Center for Protein Sequences (MIPS) Database of Interacting Proteins (DIP)
Links to protein–protein interaction data and resources Published protein–protein interactions
http://www.bind.ca/ Action/ http://genome.ucsc. edu/ENCODE/ index.html http://mips.gsf.de/ proj/ppi
RNAi database
C. elegans RNAi screen data Synthetic-lethal interactions in yeast
http://biodata.mshri. on.ca/grid
Single-gene-deletion microarray data for E. coli phenotypes
http://www.genome. wisc.edu/tools/ asap.htm
Components
Proteomics
Lipidomics Localizomics Glycomics
http://www.ncbi.nlm. nih.gov/geo
Interactions
http://dip.doe-mbi. ucla.edu
Functional states Phenomics
General Repository for Interaction Datasets (GRID) A Systematic Annotation Package For Community Analysis of Genomes (ASAP)
http://rnai.org
Integration, Warehousing, and Analysis Strategies of Omics Data
403
Fig. 2. Types of experimental designs: ChIP–chip (chromatin-immunoprecipitation–DNA-microarray); co AP-MS (co-affinity purification–mass-spectrometry); RNAi (RNA interference); SAGE (serial analysis of gene expression); yeast 2H (yeast two-hybrid analysis).
is a standard for supporting data integration, data mining, and models for deriving protein structural and functional properties (6). The Gene Ontology (GO, http://www.geneontology.org) is a controlled vocabulary to functionally annotate gene products with respect to their biological process, molecular function, and cellular location (7). Various frameworks have been derived for implementing vocabularies, including eXtensible Markup Languages (XML), Resource Description Framework (RDF), Open Biomedical Ontologies (OBO), and OWLWeb Ontology Language (OWL). 2.1. eXtensible Markup Languages
XML is a general-purpose markup language that supports data sharing across heterogeneous systems, and provides a format of choice for storing information with an inherent hierarchical structure (see Note 2). XML has been widely accepted in the Omics sciences as a standard for data exchange. Examples for powerful XML-based data integration are the GLYcan Data Exchange (GLYDE, http://lsdis.cs.uga.edu/projects/glycomics) enabling interoperability and exchange of glycomics data and more gene rally on structures carrying glycan moieties as developed by Sahoo
404
Gedela
and his team (8), and BIOMART (http://www.biomart.org), an integration tool using XML syntax for building data elements from different databases providing user-defined queries (9). 2.2. Resource Description Framework
RDF (http://www.w3.org/RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. RDF is used as a general syntax for linking a wide variety of data in a single framework. RDF is, e.g., used to combine genome data and public domain annotations within GO, KEGG, and the SUPERFAMILY database (10).
2.3. Open Biomedical Ontologies
OBI (http://obi.sourceforge.net) is a collection of controlled vocabularies freely available to the biomedical community. Webbased ontology portals, such as the BioPortal (http://bioportal. bioontology.org) allow users to browse, search, submit, and visualize ontologies. The need for innovative technology and methods that allow scientists to record, manage, and disseminate biomedical information and knowledge in machine-processable form gave rise to the National Center for Biomedical Ontology (NCBO, http://www.bioontology.org) initiative created in 2005 (11).
2.4. OWLWeb Ontology Language
OWL (http://www.w3.org/TR/owl-guide) facilitates further improved machine interpretability of Web content when compared to XML, RDF, and RDF Schema (RDF-S) by providing additional vocabulary along with a formal semantics. OWL has three sublanguages: OWL Lite, OWL DL, and OWL Full. They are described as ●●
●●
●●
OWL DL is an ontology language based on description logics (DLs). OWL Lite supports classification hierarchy and simple syntax. OWL Full with maximum expressiveness and syntactic freedom.
BioPAX (http://www.biopax.org) is an effort to create a data exchange format for biological pathway data utilizing OWL semantics.
3. Methods Obviously, various methods for Omics data integration and analysis are available (12). First criteria for data integration depend upon the type of data; hence, most of the algorithms available are based on genomics, transcriptomics, and proteomics experimental data. Other Omics data components like phenomics and fluxomics can be studied through integration of further analysis tools of the integration environments.
Integration, Warehousing, and Analysis Strategies of Omics Data
405
Fig. 3. A flow chart describing various steps of data integration toward network reconstruction and model building.
3.1. Identifying, Decomposing, and Modeling
The procedure of identifying and decomposing a network scaffold, followed by cellular systems modeling and analysis is schematically depicted in Fig. 3.
3.1.1. Identifying a Network Scaffold
This task depicts the strategy for identifying all interactions between Omics components. A typical example is the identification of a gene-regulatory network scaffold by integrating chromatin immunoprecipitation (ChIP) and microarray gene expression data (referred to as ChIP–chip data). Such Omics data specify the interactions between a transcriptional regulator and its target gene, and various statistical approaches are available to derive the specific regulatory relationship (namely, transcriptional activation or repression). Data on protein–DNA and protein– protein interactomes reflect the activity of a cellular network, and a typical analysis strategy follows clustering of high throughput gene expression data sets, complemented by isolating the upstream regions of clustered genes for identifying common cis-regulatory motifs. Tools like Module construction using gene expression and sequence motifs (MODEM) (13) and regulatory-element detection using correlation with expression (REDUCE, http:// bussemaker.bio.columbia.edu/reduce) (14) implemented with scaffold building algorithms based on the transcriptional motifs
406
Gedela
found in clustered gene expression data are available. Another approach is Genetic regulatory modules (GRAM, http://psrg.lcs. mit.edu/GRAM/Index.html) (15) for identifying protein–DNA binding events within sets of transcription factors (see Note 3). 3.1.2. Network Scaffold Decomposition
Integrating Omics data in network modules and aligning such modules into more complete networks is the common procedure in reconstructing networks. Network modules rest on available interactome data and are typically composed of a limited number of nodes. Such identified motifs represent the basic building blocks that comprise the cellular network. The incorporation of localizomics data further supports isolation of biologically relevant motifs, as interacting components are with higher probability found in the same subcellular location. Methods as Statistical Analysis of Network Dynamics (SANDY, http://sandy.topnet.gersteinlab.org/), methods for bicluster analysis (SAMBA, http://www.cs.tau.ac.il/%7Ershamir/ expander/expander.html), and tools like Mdraw (http://www. weizmann.ac.il/mcb/UriAlon/NetworkMotifsSW/mdraw/) and Mfinder (http://www.weizmann.ac.il/mcb/UriAlon/ NetworkMotifsSW/mfinder/MfinderManual.pdf) are available for constructing correlative maps (see Note 4). A representative map constructed by using Mdraw is depicted in Fig. 3.
3.1.3. Cellular Systems Modeling and Analysis
The availability of Omics data sets opens the way for efforts aimed at integrating diverse Omics profiles into whole-cell or systems models, spanning from identification of network modules to quantitative modeling and simulation (see Note 5). The constraintbased reconstruction and analysis (COBRA, http://gcrg.ucsd. edu/Downloads/Cobra_Toolbox) technique (16) has emerged in recent years as a successful approach for modeling systems on a genome scale integrating genomic, proteomic, and other high throughput data. This toolbox can be downloaded for Matlab. Next to a quantitative description of a cellular state, Omics data may also be seen in the context of overall constraints from thermodynamics, mass conservation, reactions involved, etc. A reconstruction is here defined as the list of biochemical reactions occurring in a particular cellular procedure (as metabolism), and the associations between these reactions and relevant proteins, transcripts, and genes. A reconstruction can be converted to a model by including the assumptions necessary for computational simulation, for example, maximum reaction rates and nutrient uptake rates, which results in a reconstruction of the cellular process encoded within the omics data. Latest methods in developing such cellular simulations involve tools like Biotapestry (http:// www.biotapestry.org) (17). A sample process done using Biotapestry (see Note 6) is depicted in Fig. 4.
Integration, Warehousing, and Analysis Strategies of Omics Data
407
Fig. 4. Constraint-Based Reconstruction and Analysis (COBRA) method.
3.2. DBE
The Data analysis and visualization system for Biological Experiments (DBE, http://www.bic-gh.de/dbe) (18) describes a method for mapping metabolomics data into metabolite data, where DBE helps the scientists in managing, analyzing, and visualizing experimental data. DBE has a flow of components for handling omics data in multidisciplinary way. DBE-Web site provides the user interface, the DBE-Database supports consistent data storage, support of data import is realized via Excel-based templates, DBE-Pictures supports handling of, e.g., image files, and DBE-Gravisto provides network analysis and visualization. Selected components are shown in Fig. 5. For demonstrating DBE functionalities, we use metabolite data available for seed development of beans (Vicia narbonensis). In this case, transgenic technology was applied for increasing protein accumulation via introducing the bacterial enzyme phosphoenolpyruvate carboxylase (PEFC). The enzyme refixes HCO3− liberated by respiration, and together with PEFC yields oxaloacetate that can either be converted to aspartate or into malate and other intermediates of the citric acid cycle. To characterize the responsible metabolic shift within seeds from sugars/starch into organic acids/amino acids/proteins, the metabolite pattern for glycolysis, citrate cycle as well as related sugars and free amino acids was analyzed. Visualization of metabolites within their pathways (Fig. 6) gives an immediate overview of specific changes in metabolism within transgenic seeds.
408
Gedela
Fig. 5. DBE-Gravisto, a network analysis and graph visualization system.
3.3. BIOMART
BioMart (http://www.biomart.org) (9) is an open source data management system that comes with a range of query interfaces allowing the user to group and refine data based upon many different criteria. The capabilities of BioMart are further extended by integrating several widely used software packages, such as BioConductor, DAS, Galaxy, Cytoscape, or Taverna. BIOMART provides a graphical as well as command line interface, and furthermore Web services or APIs written in Perl and Java supporting various database systems as MySQL, Oracle, and Postgres. Data integration involves four steps, namely, (1) querying, (2) configuration, (3) transformation, and (4) source data (Fig. 7). Querying allows the user to select data, including filtering on the basis of attributes like the Gene ID or GO terms, providing a structured XML view. Configuration rests on XML for aligning heterogeneous data supporting structured querying. Transformation allows the data integration into the XML format from source data, and source data are available data sets which are parsed through PERL APIs into a MySQL databases. Three tier Architecture: First tier consists of one or more relational databases. Two tools present in First tier are: ●●
●●
Mart Builder to construct SQL statements for transforming a schema into a mart. Mart Editor for generating a data set configuration XML stored in metadata tables within the actual mart database.
Integration, Warehousing, and Analysis Strategies of Omics Data
409
Fig. 6. Visualization of experimental data in the context of a metabolic network constructed by using the DBE-Gravisto standalone version 1.1(beta).
Second tier is the Perl API which interacts with both, the data set configuration and the mart databases. Third tier consists of the query interfaces which utilize the API to present the possible BioMart queries and results: ●●
Mart View, a Web browser interface.
●●
Mart Service, a Web services interface.
●●
MartURLAccess, a mart view based on Web URL.
410
Gedela
Fig. 7. Steps of data integration.
We show as practical example the analysis of the 1 kb upstream sequences of a cluster of human genes identified by an expression profile experiment using an Affymetrix Genechip U95Av2. The Homo sapiens genes data set is selected and filters of ID list limit in the GENE section is chosen. Selecting the Affy hg u95av2 ID(s) option provides an upload option for Affymetrix probeset IDs using the file Browse button, or alternatively by copy and paste of the data set into the text box. Data types include complementary DNA (cDNA), peptides, coding regions, untranslated regions (UTRs), and exons with additional upstream and downstream flanking regions. In order to identify upstream regulatory features in subsequent analysis, the 1 kb upstream flank sequence for each gene has to be selected (Fig. 8). The subsequent data can be used for further annotation, e.g., by assigning GO terms to the Affymetrix data via selecting respective filters and features attributes as shown in Fig. 9. A number of external software packages have incorporated BioMart for enhancing querying capabilities, e.g., for using services as Galaxy, BioConductor, Taverna, or to add further annotation and visualization of results (e.g., Cytoscape, http://www. cytoscape.org). This integration has been made possible through MartServices. BioMart can be easily configured to become a DAS annotation server for viewing of data through various Distributed Annotation System (DAS) clients.
Integration, Warehousing, and Analysis Strategies of Omics Data
411
Fig. 8. Example for sequence attributes, filters, and results after the selection of given options in the MART window.
4. Notes 1. Further functional studies provide additional detailed information, e.g., on drugability of target genes adding value to the discovery of novel therapeutics (19). 2. XML is a common set of well-defined data formats and is the format of choice for storing information with an inherent hierarchical structure. XML has been widely accepted in Omics for data exchange, migration, and storage. 3. Grid Resource Allocation Manager or Globus Resource Allocation Manager (GRAM) is a software component of the Globus Toolkit that can locate, submit, monitor, and cancel
412
Gedela
Fig. 9. Selected features, attributes, filters, and results after the selected options from the MART viewer showing the GO-annotated tables.
jobs on Grid computing resources. It provides reliable operation, stateful monitoring, credential management, and file staging. GRAM does not provide job scheduler functionality and is in fact just a front-end (or interoperability bridge) to the functionality provided by an external scheduler that does not natively support the Globus Web service protocols. REDUCE is a general-purpose computer algebra system geared toward applications in physics.
Integration, Warehousing, and Analysis Strategies of Omics Data
413
4. Mdraw is an ANSI drawing tool written in C# using the mono platform Mfinder. 5. So far, merging of Omics data has fundamentally contributed to basic biological research for deriving models and controlled vocabularies for annotating biological processes. On a subsequent level, pharmacogenomics and pharmacoproteomics have emerged to study, e.g., drug pharmacodynamic and pharmacokinetic studies with reference to human and other organisms, allowing the analysis of small molecule drugs as well as biologicals. 6. BioTapestry is an interactive tool for building, visualizing, and simulating genetic regulatory networks. The tool is also used for Interactive Web Models. References 1. Caspi, R., Foerster, H., Fulcher, C.A., Kaipa, P., Krummenacker, M., Latendresse, M., Paley, S., Rhee, S.Y., Shearer, A.G., and Tissier, C. (2008) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 36, D623–31. 2. Srinubabu, G. (2009) Computational systems biology of – Omics data: integration, warehousing and validation. BIT Life Sciences’ 2nd Annual World Summit of Antivirals, July 18–20, 2009, Beijing, China. 3. Hanuman, T., Raghava, N.M., Siva, P.A., Mrithyunjaya, R.K., Chandra, S.V., Allam, A.R., and Srinubabu, G. (2009) Performance comparative in classification algorithms using real datasets. J Comput Sci Syst Biol 2, 97–100. 4. Tetsuro, T., Yoshiki M., Keith, P., Naohiko, H., Norio, K., and Yoshiyuki, S. (2007) OmicBrowse: a browser of multidimensional omics annotations. Bioinformatics 23, 524–26. 5. Avraham, S., Tung, C.W., Ilic, K., Jaiswal, P., Kellogg, E.A., McCouch, S., Pujar, A., Reiser, L., Rhee, S.Y., Sachs, M.M., Schaeffer, M., Stein, L., Stevens, P., Vincent, L., Zapata, F., and Ware, D. (2008) The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations. Nucleic Acids Res 36, D449. 6. Sidhu, A.S., Dillon, T.S., and Chang, E. (2006) Advances in Protein Ontology Project. Computer-Based Medical Systems CBMS 19th IEEE International Symposium 588–92. 7. Ashburner M. et al. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25, 25–29.
8. Satya, S.S., Christopher, T., Amit, S., Cory, H., and William, S. (2005) GLYDE – An expressive XML standard for the representation of glycan structure. Carbohydr Res 18, 2802–7. 9. Syed, S.H., Benoit, B., Richard, H., Darin, L., Gudmundur, T., and Arek, K. (2009) BioMart – biological queries made easy. BMC Genomics 10, 22. 10. Vandervalk, B.P., McCarthy, E.L., and Wilkinson, M.D. (2009) Moby and Moby 2: creatures of the deep (web). Brief Bioinform 10, 114–28. 11. Burgun, A., and Bodenreider, O. (2008) Accessing and integrating data and knowledge for biomedical research. France Yearb Med Inform 91–101. 12. Akula, S.P., Miriyala, R.N., Thota, H., Rao, A.A., and Srinubabu, G. (2009) Techniques for integrating -omics data. Bioinformation 3, 284–86. 13. Wei, W., Michael, C. J., Yigal, N., Emmitt, J., David, B., and Hao, L. (2005) Inference of combinatorial regulation in yeast transcriptional networks: a case study of sporulation. Proc Natl Acad Sci USA 102, 1998–03. 14. Crispin, R., and Harmen, J.B. (2008) REDUCE: an online tool for inferring cisregulatory elements and transcriptional module activities from microarray data. Nucleic Acids Res 31, 3487–90. 15. Bar-Joseph, Z., Gerber, G.K., Lee, T.I., Rinaldi, N.J., Yoo, J.Y., Robert, F., Gordon, D.B., Fraenkel, E., Jaakkola, T.S., Young, R.A., and Gifford, D.K. (2003) Computational discovery of gene modules and regulatory networks. Nat Biotechnol 21, 1337–42.
414
Gedela
16. Scott, A.B., Adam, M.F., Monica, L.M., Gregory, H., Bernhard, P., and Markus, J.H. (2007) Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox. Nat Protoc 2, 227–38. 17. Longabaugh, W.J.R., Eric, H.D., and Hamid, B. (2005) Computational representation of developmental genetic regulatory networks. Dev Biol 283, 1–16.
18. Ljudmilla, B., Mohammad-Reza, H., Christian, K., Hardy, R., and Falk, S. (2005) Integrating data from biological experiments into metabolic networks with the DBE information system. In Silico Biol 5, 93–102. 19. Denong, W, and Srinubabu, G. (2008) Insights of new tools in glycomics research. J Proteomics Bioinform 1, 374–78.
Chapter 19 Integrating Omics Data for Signaling Pathways, Interactome Reconstruction, and Functional Analysis Paolo Tieri, Alberto de la Fuente, Alberto Termanini, and Claudio Franceschi Abstract Omics data and computational approaches are today providing a key to disentangle the complex architecture of living systems. The integration and analysis of data of different nature allows to extract meaningful representations of signaling pathways and protein interactions networks, helpful in achieving an increased understanding of such intricate biochemical processes. We here describe a general workflow and relative hurdles in integrating online Omics data and analyzing reconstructed representations by using the available computational platforms. Key words: Pathway, Interactome, Signaling, Network, Protein interactions, Data integration, Data retrieval, Systems biology, Bioinformatics
1. Introduction Network abstractions and network analysis are today common in science. This approach has been applied for the representation of complex systems, and has achieved a certain success, from social studies (1) to engineering (2) and biology (3–10). Despite its intrinsically limited perspective, such conceptualization enables complex biological systems to be considered as a whole and open for mathematical analysis, aiming to the discovery of salient systemic features and providing an accurate and analytic view at the glance of entities, relations, and functions that characterize them. This approach also allows to highlight how the qualities and behavior of single elements influence the network topology and dynamics, how network structure impinges upon
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_19, © Springer Science+Business Media, LLC 2011
415
416
Tieri et al.
processes spreading over the network, or the effect of perturbations on network performance (11, 12). In this regard, the network abstraction of biochemical signaling pathways can represent a useful functional view that can complement analyses and approaches from molecular biology and the various Omics. Biochemical pathways are usually referred to as intracellular processes whose scale can in some way be placed between small events, such as protein complexes formation or enzyme catalysis, and cell-wide or larger events, such as cell death or inflammation. These processes can be divided into separate steps, which seldom follow a linear and unambiguous succession. It is not yet simple to define a pathway in terms of its components, steps, dynamics and function, given its manifold, hazy, and intricate nature. Actually, pathways and signaling cascades are not isolated entities. A signaling pathway can be triggered by different extra- or intracellular events, may cover different parallel paths and branches, may intersect, be competitive or cooperative or interdependent with other events, each step may have diverging functions, and so on. Pathways, in conclusion, are processes characterized by high complexity (13–15). Abstractions and models of biological networks and pathways discussed here are mainly protein interaction networks (PINs) and protein-signaling networks (PSNs). PINs represent protein–protein binding events on a proteome-wide scale. Nodes and undirected edges represent proteins and binding events among them. In PSNs, nodes and directed edges represent phosphoproteins and phosphorylation reactions. The two models can be combined and enriched with additional layers, such as transcriptional regulatory networks, among others. Omics data and computational approaches are today providing a key to disentangle the complexity of objects like signaling pathways, assisted by dedicated online databases and specific software tools. Through such methodology, it is possible to integrate data of different nature to extract meaningful representations and useful information, finally leading to an increased understanding of the biochemical process under examination. Nevertheless, the workflow for the integrated reconstruction and analysis of signaling pathways, interactomes, and biological networks is hampered by difficulties of diverse nature, such as lack of data, annotation differences or multiple interpretations, data integration problems and other difficulties (16–18). Materials and workflow described here want to demonstrate a general approach for gathering information of interest from some of the existing pathway and protein interactions databases, for integrating and analyzing data and reconstructed representations by using the available tools, and to understand which kind of knowledge can be extracted from the combination of existing information (Fig. 1). We shortly describe the characteristics of some of the many pathway and protein
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction
417
Fig. 1. Schematic representation of the analysis workflow. From manual and automated data retrieval, through human curation and software platforms, data are integrated to reconstruct coherent objects able to undergo mathematical analysis. Results can feed back in the pipeline for further enrichment, analysis, simulations, or improvement of existing models and representations.
interactions online resources and databases, and how the Cytoscape software platform and other analysis tools can be applied to reconstruct and analyze some exemplar pathways and interactomes.
2. Materials 2.1. Overview of Databases and Online Data Sources
Signaling cascade and pathway information is more and more systematically collected and organized into publicly available databases. Such kind of resources lay the foundations for the systems level approach, allowing a workflow consisting in the reconstructive process of the pathway/interactome network, that generally consists in the manual or automated retrieval of pathway data, their integration, merging, comparison and enrichment with other forms of data, and then the analytical process (simulation, mathematical modeling, statistical analysis). Iterative cycles of
418
Tieri et al.
such procedures, modeling, and prediction, combined with experimental validation, can result in the improvement of the knowledge of cell signaling and responses. Online dedicated databases usually store cell signaling data in exchangeable formats (often BioPAX – Biological Pathway Exchange-, or SBML – Systems Biology Markup Language; see Note 1) accessible by diverse software platforms and tools, allowing for their retrieval, visualization, and analysis. The following list should by no means be considered as exhaustive; links and URLs can be found in the Notes section. The Pathguide (the Pathway Resource List, see Note 2) (19) is a useful resource serving as starting point for biological pathway analysis, since it is a content aggregator for integrated biological information systems. Pathguide is a meta-database that provides an overview of current pathway and other systems biology-oriented databases. Pathguide currently lists and provides details and links to more than 300 web-accessible biological pathway and network databases. These include databases on metabolic pathways, signaling pathways, transcription factor targets, gene regulatory networks, genetic interactions, protein–compound interactions, and protein–protein interactions. The listed databases are curated and maintained by diverse scientific groups in different worldwide locations, and the information represented is derived either from the scientific literature or from systematic, high-throughput experiments. Reactome ((20), see Note 2) is a pathway database covering a wide set of biological processes, organized in a hierarchical manner: Lower levels for smaller reactions, higher levels for pathways and extended processes. Data are extracted from literature and biomedical experiments, are human-curated and are represented as chains of chemical reactions (including transcription, catalysis, binding). Data can be physical entities (DNA, RNA, protein complexes, phosphorylated and unphosphorylated proteins, small molecules…), or events (reaction-like event for smaller reactions, or pathway-like event clustering a set of reaction-like events). The tool allows remote search and browsing, but also to download data in the most common formats or in graphical representation. The Web site also provides some useful statistical and graphical tools and can be accessed through a Simple Object Access Protocol (SOAP, http://www. w3.org/TR/soap) Web service for automated data queries. KEGG ((21, 22), see Note 2) consists of a number of interlinked databases devoted to several domains in the cell and beyond (genes, genomes, proteins, chemical compounds, pathways, diseases, drugs, ontologies). The pathways section covers many organisms, including human. Data are categorized into the different processes (metabolic, genetic information, signaling, etc.) and are coded in a special XML format (KGML), but also in BioPAX and SBML through the use of additionally available coding tools.
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction
419
The Nature Pathway Interaction Database (PID) ((23), see Note 2) is hierarchically organized in a way similar to Reactome and hosts pathway data (available in BioPAX or XML) obtained from peer reviewed literature or imported from other databases, such as Reactome or BioCarta (a supplier of reagents and assays for biopharmaceutical and academic research; see Note 2). DNA and RNA are not part of the PID pathways but active/inactive, phosphorylated/unphosphorylated states are annotated. The pathways can be browsed starting from UniProt, Entrez Gene (see Note 2), or other identifiers, and query as well as statistical tools are provided. Pathway Commons is based on already existing databases, such as Reactome, PID, and other protein interactions databases, and provides an integrated access point and a compilation of such databases, thus conserving their structure and data hierarchies. However, this kind of integration is not only a simple task and may result in overlapping, but also discordant and/or redundant information. A useful feature is the complete accessibility through the dedicated Pathway Commons plugin from the Cytoscape platform (see later in the chapter). WikiPathways ((24), see Note 2) is an open source and collaborative platform for biological pathway information, storage, and curation, in the wake of the Wikipedia style. Data are categorized by species and processes (e.g., metabolic process, molecular function, etc.) and are coded in the GenMAPP (an application designed to visualize gene expression and other genomic data on maps representing biological pathways and groupings of genes, see Note 2) Pathway Markup Language (GPML), being compatible with applications, such as PathVisio (a visualization tool, see Note 2), Cytoscape, and GenMAPP. Agile Protein Interaction DataAnalyzer (APID) (25) is an interactive Web-based platform devoted to the exploration and analysis of diverse information about protein interactions, integrated and unified in a common and comparative environment. APID provides an open access frame where all experimentally validated protein–protein interactions (obtained from protein interactions databases, such as BIND, BioGRID, DIP, HPRD, IntAct, and MINT, see Note 2) are unified in a unique Web application that allows the exploration and analysis of networks and interactomes. APID provides some embedded online tools to query and browse data and, most useful, a Cytoscape plugin (APID2NET, (26)) that allows to extract, visualize, and analyze unified interactome data by directly quering APID servers, including all the annotations and attributes associated to the retrieved PPIs. Transcriptional Regulatory Element Database (TRED) ((27, 28), see Note 2) is a manually curated database of regulatory elements (promoters, transcription factor binding sites, both cis and trans) with experimental evidence in mammalian genomes.
420
Tieri et al.
Currently, it enlists a total of 36 transcription factors families (most of which are involved in cancer), more than 7,000 target genes and around 15,000 target promoters, with the goal to assist detailed functional studies and to help in obtaining a panoramic view of gene regulatory networks in a cancer research perspective. TRANSPATH (29), together with the more famous TRANSFAC ((30), see Note 2), that stores transcription factors and their DNA binding sites, is a widely used and powerful knowledge base on gene regulatory networks that comprises and integrates information on signal transduction and tools for visualization and analysis. It allows obtaining complete signaling pathways from ligand to target genes and their products. Its access requires a license purchase, even if a version dating back to years ago can be accessed for free. NetPath ((31), see Note 2) is a curated compendium of human signaling pathways which currently contains annotations for several cancer and immune signaling pathways. Pathway data are available for browsing and download in the most common formats (including the Proteomics Standards Initiative-Molecular Interaction – PSI-MI format), and listing of up- and downregulated genes for each pathway is provided based on experimental data and literature. Notwithstanding the quantity and quality of the publicly available resources, information automatically extracted from pathway databases is usually not yet exhaustive. Given the often complementary nature of data in different databases, they should be retrieved, integrated, and combined, and we feel the quality of the result strongly relies upon a sharp manual curation effort (16–18). The integration process itself, however, can present several problems, not least those of interchangeability of the different formats and data models, but also in terms of reaction annotation, or of significant differences in other key biological factors, such as cellular state and type (16). Thus, the process of literature extraction of data (also possibly aided with text mining techniques) combined with information from databases under expert supervision and curation probably remains a good choice in order to get an accurate pathway reconstruction. A complete and deep curation process can last months and employ many experts, and yet yield controversial results. Conversely, manual integration of data extracted from online pathway resources – under expert review – can be decently performed in days, allowing to create a sufficiently accurate (also depending on the scope) representation of a given pathway, or part of it, ready for further functional enrichment and analysis. 2.2. Computational Analysis Software 2.2.1. Main Platforms and Tools
Since the purpose of the interactome or pathway reconstruction process is to have an “object” which can be further elaborated, enriched, and analyzed step by step, we need to access and store data in local machines, in contrast to browsing them online. As described before, most of the databases allow downloading the
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction
421
relevant data in diverse formats (BioPAX, SBML, PSI-MI, among others). At this point, the choice of one or more tools for network editing and analysis is up to the user. Some of these are directly embedded or available inside the different databases, such as Reactome, WikiPathways, BioCarta, and GenMAPP. Others are commercial suites, such as Ingenuity or Pathway Studio, with special visualization features (see Note 2). Among the open source applications, Cytoscape ((32), see Note 2) is a very powerful software platform, available for all the major operating systems, designed for biological research, but versatile enough to be used in many other fields where network editing, visualization, and analysis are key features. The core tool has been developed to visualize molecular interaction networks and biological pathways, and to integrate these networks with annotations, gene expression profiles, and other state data. Many more additional features, such as advanced network and molecular profiling analyses, new layouts, additional file format support, scripting, and connection with databases, are available as plugins. Cytoscape supports many different standard network and annotation file formats, including Simple Interaction Format (SIF), BioPAX, PSI-MI, SBML, tab-delimited text files, and MS Excel. BiologicalNetworks ((33), see Note 2) is an integrated research environment for biological sciences that allows querying and integrating molecular interaction networks, metabolic and signaling pathways with a large number of biological features related to transcriptional regulation, microarray and proteomics experiments, 3D structures ontologies, taxonomies, and other types of data. The tool is based on a database currently integrating over 100 curated and publicly contributed data sources for thousands of eukaryotic, prokaryotic, and viral genomes. CellDesigner ((34), see Note 2) is a structured diagram editor for drawing gene-regulatory and biochemical networks. Networks are drawn based on a process diagram, with a dedicated graphical notation system, and are stored using the SBML format. Networks can be linked with simulation and other analysis packages through a wider software platform named Systems Biology Workbench (SBW). We in the Methods section focus on a workflow mainly based on the Cytoscape platform given its free availability, diffusion in biology research, upgradeability, and versatility. 2.2.2. Other Specific Analysis Tools and Plugins
Powerful standalone packages specific for network analysis are freely available. Pajek (35) (“spider” in Slovene, the nationality of the developers, see Note 2), for instance, is able to visualize and analyze networks of millions of nodes. Specific add-on modules can be used inside the well-known R statistical package (http:// www.r-project.org). Other packages have direct Web-based functionality: GraphWeb (36) is a public Web server for graph-based analysis
422
Tieri et al.
that has been designed for extensive analyses of directed and undirected, weighted and unweighted, heterogeneous networks of genes, proteins and microarray probesets for many eukaryotic genomes, and is able to integrate multiple, diverse datasets for constructing extended networks. Among the many available Cytoscape plugins (for an exhaustive list and references see the Cytoscape.org Web site), NetworkAnalyzer (37) requires no expert knowledge in graph theory. The tool provides functionality to compute and display charts for a quite complete set of topological parameters for undirected and directed networks, which includes the number of nodes, edges, and connected components, the network diameter, radius, density, centralization, heterogeneity, clustering coefficient, and the characteristic path length. ClusterMaker (Cytoscape plugin) unifies different clustering techniques and displays into a single interface. It uses specific algorithms for clustering expression or genetic data, and similarity networks to look for protein families and putative functional similarities. The Hub Objects Analyzer (Hubba) (38) is both a Webbased service and a Cytoscape plugin for exploring networks for the discovery of hubs in an interactome network generated from specific small- or large-scale experimental methods.
3. Methods 3.1. General Retrieval and Reconstruction Procedures 3.1.1. Data Retrieval
The process of manual literature mining for data extraction is labor-intensive and time consuming but typically gives back highquality data and models. It is evident that, given the broadness and importance of this topic, it cannot be exhaustively treated here and we refer to Jensen and colleagues (39) for a comprehensive review on the field of manual and machine-aided extraction of biomedical facts from scientific literature. In the first step of retrieving the pathway data of interest through Cytoscape, the user can utilize one of the many existing plugins, each one designed to query and retrieve data from many different databases. It is evidently advisable that the user has in a first step browsed the candidate databases to understand which type, model, and format is used for data representation. Among the many Cytoscape plugins, BioNetBuilder (40) can be used to build networks for many different species, including most common model organisms and human, retrieving data from currently supported databases that include DIP, BIND, KEGG, HPRD, BioGrid, among others. The interface offers different options to specify a set of initial genes/gene products for which to find molecular interactions (including loading a text file, finding
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction
423
genes with specified Gene Ontology annotations, and finding genes whose common name match a given string). Biological networks for whole organisms can also be created and displayed. Another very useful plugin is the aforementioned APID2NET, linked to the APID database. This tool allows to specify a list of proteins for retrieving the network of their interactions at the desired connection level (level 0 considers only the interactions among the listed proteins, level 1 considers all their neighbors in APID, level 2 considers also the neighbors of the neighbors, etc.) and validated by a number of different experimental methods to be selected. The system also displays additional information on node, edge, and network attributes. The user can also start a Cytoscape session with the embedded “import network from web service” function to connect directly to the Pathway Commons or WikiPathways servers and obtain the data. It is also possible to retrieve the data from each single database simply by downloading a formatted file and then upload and open it in the Cytoscape client for visualizing the network. It is not always possible to retrieve data following a pluginautomated or semi-automated process as described above. For some databases, not specifically designed for systems biology but containing useful and well-arranged information, as for instance the TRED, no workflow is provided. Here, it may be necessary to formulate a query, to extract the data with copy/ paste operations in text format, and to perform further adaptation to import and incorporate them into a network in a very manual fashion. 3.1.2. Data Merging and Combination
As said, combination of data from different pathway databases is highly desirable. The user can, for instance, download the same pathway data as provided by two or more databases and try to combine them in order to make it as complete as possible. For this purpose, again, suitable Cytoscape plugins (e.g., AdvancedNetworkMerge) or embedded functions can be used. This is a critical point, since frequently molecular and reaction data are encoded and modeled in different manner according to the originating database so that the network resulting from the merging of such two or more networks can disappointingly result as a simple sum of the originating objects, or anyway an inconsistent network, even without any partial overlap, or any other shared information or link. There is no trivial solution to this kind of issues, since from database to database there are no uniquely defined identifiers for each of the entities that compose the pathways or the networks. Accurate filtering and expert curation performed before the merging process could purge the data from undesired or redundant information. This also usually makes it quite easy to build improved versions of the networks based on additional and different types of data.
424
Tieri et al.
3.1.3. Functional Enrichment
Obtained networks can be functionally enriched, i.e., can be integrated and superimposed with data of different type, such as gene expression data or Gene Ontology (GO) categories to verify if statistically overrepresented features are linked to topological characteristics. Some plugins are available for Cytoscape and many others are accessible on the internet. Among them, we just mention BiNGO (41) and ClueGO (42) as plugins enabling to determine which Gene Ontology categories are overrepresented in sets of genes, (in the present context corresponding to subgraphs of a given biological network), allowing to map the predominant functional themes of a given gene set on the GO hierarchy as a graph, and to perform cluster analysis and comparison of clusters.
3.2. Network Analysis
Once that the user has performed the reconstruction steps and considers the “object” pathway or interactome in some way complete and stable (for the subjective purpose of the study to be carried out), it is time to proceed with the subsequent network analysis. All cited computational platforms are precisely designed to perform such analyses that can be easily implemented through embedded or add-on features. The goal of topological analysis of protein networks is to discern organizational “design” principles, relate those to dynamical properties, and establish connections to biological functions. The detection of interesting topological properties occurs by comparing the network under study with a “null model”; that is, a set of networks that reflect what is expected by random chance. If a network under study possesses certain characteristics different from what is expected by chance alone, then these might be related to the specific function of the network: they could have been selected by evolution for their advantageous properties. Topological measures have demonstrated their usefulness in uncovering the organizing principles that rule the development and the evolution of networks of different nature (8). Several observations led to the conclusion that the classical degree distribution, and the well-investigated scale-free characteristic of nodes in PINs, for instance, correlates with biological meaningful features, such as importance, lethality, robustness, and dynamics of perturbations. Hierarchical topology, subgraphs, modular structures, clusters are, among others, strongly characterizing features of networks that a focused analysis can reveal (10). In some fields, such as cancer research, extensive and deep meta-analyses have shown how some specific measures, such as betweenness and stress centrality, among others, are particularly relevant in characterizing pathological states and malignant tissues (43).
3.2.1. Topological Measures
3.2.2. Dynamical Models
Owing to the intricacy of signal transduction, computational analysis is necessary to obtain the understanding of dynamical properties of PSNs. Even for very small, relatively simple PSNs,
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction
425
it has been shown that a wide range of complex dynamical properties could be attained (13, 44–46), and parallels were drawn between signaling circuits and man-made control systems for explaining important biological properties, such as amplification, robustness, homeostasis, and adaptation, particularly highlighting the importance of feedbacks in PSNs (45, 47–51). Several larger mathematical models based on Ordinary Differential Equations have been formulated for signal PSNs, and their parameters were optimized in order to fit experimental observations (52–55). Although studies with such models provide detailed insights into the dynamics and function of signaling pathways, formulating such models is a difficult problem that requires a huge amount of specific and quantitative experimental data, which are not expected to be available on proteome-wide scale in the near future. Dynamical models of proteome-wide PSNs, although lacking precise quantitative information of the kinetic dependencies, can still be used to discover principles of global dynamical organization. For example, a qualitative approach to the dynamic modeling of PSNs is the use of Boolean logic, in which each protein is “off ” or “on” at a given time-step depending on the states of its inputs. Recently, it was shown that a PSN formalized with Boolean logic can classify sets of inputs into distinct output patterns – an ability that arises through the complex wiring pattern among the proteins in the PSNs (56). This ability is an emergent dynamical property determined by the structure of the PSN, as the authors showed that randomizing the network results in loss of this ability. Interesting metaphors have been drawn between PSNs and computational networks (14). Back in 1990, Dennis Bray highlighted the similarity between PSNs and “artificial neural networks” (57). Rather than signal transduction as just a mechanism to transmit information from the cell surface to the nucleus and other functions, this analogy suggests a process of turning complex input signals (environments) into complex output signals (biological responses). Similar to artificial neural networks, where the parameters are adjusted through mathematical optimization to obtain required input–output relationships, evolution has tweaked the parameters in PSNs to obtain the ability to generate appropriate responses to the wide variety of complex environmental signals that organisms are subjected to (57). It has become clear that the proteome forms a complex system with many emergent properties yet to be discovered and understood (10, 14). Topological and dynamical studies of PSNs that take explicitly the INPUT → CENTRAL NETWORK → OUTPUT structure (56, 58–61) into account most certainly yields many insights into the functional organization these intricate protein networks (Fig. 2).
426
Tieri et al.
Fig. 2. The “bow-tie” layout of a Human Protein Signaling Network clearly shows the network’s main information flow from the input nodes to the central core, which processes and passes the information to the output nodes, in turn establishing the physiological responses (59). This is one example among the many relevant results that can be obtained by network analysis.
3.3. Practical Applications 3.3.1. Examples
We present here the main steps of the workflow followed for the transcription factor Nuclear Factor-kB (NF-kB) interactome reconstruction and analysis (62). NF-kB is a central transcription factor, involved in inflammation as well as in many other normal and pathophysiological processes. Given the intricacy of the signaling system and the number of genes directly regulated, it is interesting to study the main characteristics of its interactome. 1. We start from manual literature mining and review: such approach, that in this particular case lasted about 2 months, guarantees a quite complete list of proteins that take part in the signaling cascade with different roles and importance. This basic list can be expanded, enriched, and refined step by step confronting and complementing preliminary results with data browsed and downloaded (manually or utilizing tools as Cytoscape) from several pathway and PIN databases. The result at the end of this manual process is a “core list” consisting of 140 proteins.
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction
427
2. Protein interactions data are added to build the first version of the “core interactome.” The main tool used in this step is the APID database, automatically accessed through the dedicated plugin in Cytoscape. The result is a network consisting of 140 nodes and 829 nondirectional interactions. 3. By using an automated retrieval tool and databases (APID2NET, BioNetBuilder in Cytoscape, main PIN databases) a “wider interactome” is built, taking into account all the proteins with the evidence of interaction with at least one protein present in the “core interactome.” At the end of the process, the whole “wider interactome” consists of more than 3,100 proteins accounting for a total of more than 42,600 interactions. 4. Data elaborated from a manually curated list of NF-kBdownstream genes (63), from the TRED database (manually extracted), and integrated with results from TRANSFAC allow to constitute a relatively comprehensive list of about 400 genes that result to be up- or downregulated via NF-kB. Gene products and relative UniProt identifiers are obtained directly through the ID mapping functions available on the UniProt Web interface, allowing to compile the list of proteins whose expression can be regulated by NF-kB. 5. The whole interactome now consisting of core proteins (those that directly participate in signaling cascades activating NF-kB), wider interactome proteins (their direct interactors), regulated genes and relative expressed proteins, now undergoes functional enrichment and analysis: topological characterization, GO enrichment and clustering are all easily performed thanks to the availability of several standalone analysis tools as Cytoscape as well as Web-based services. Results from the analysis include, among others, a wider, integrated overlook of the NF-kB signaling system and its main topological characteristics, the detection of specific hubs or central proteins, the discovery of feedback loops and cross controls among proteins, and genes that can be candidates for further in-depth studies. 3.3.2. Pitfalls
We take into account here some pitfalls in the procedure shown, as well as some general considerations on the proposed workflow and relative problems encountered. One of the major concerns in pathway and PSN reconstruction is the lack of clear and comprehensive data about reactions and subsequent directionality. Directional information is still a rare quality. As said, the user unlikely finds that the same pathway is represented in at least similar ways in different databases. This poses the necessity to choose one out of different data models and content, or to engage in the nontrivial effort of integrating
428
Tieri et al.
and complementing the various data and data types. The lack of undisputable data about a number of reactions and proteins in the mentioned NF-kB interactome reconstruction and the existence of normal time constraints persuaded us – at least provisionally – to omit the relative dynamical information in our representation. Without directional information, it is impossible to implement dynamical models and simulation, even a simple model based on Boolean dynamics, unless willing to make the assumptions that each edge A-B is bidirectional, i.e., A → B and B → A, which is very unrealistic indeed. Automation of procedures able to integrate different pathways in a coherent and biological meaningful way is a critical point. Currently, there is no practical, coherent, and effective way to integrate data from multiple sources into a single object other than manual intervention. Even if data from single pathways in the different databases are often very close to be precise, comprehensive, and satisfying to serve as a starting base, it is the integration process and subsequent elaboration to hopefully bring valuable information and new knowledge. Actually, in this regard, data models, representations, and annotation are key points in the discussion about these hot topics (64–67).
4. Notes 1. Standards for representation of information about pathways are necessary for integration and analysis of data from various sources. BioPAX (Biological Pathway Exchange, http://www.biopax.org) is a biological pathway data exchange format. It enables the integration of diverse pathway resources by defining an open file format specification for the exchange of biological pathway data. Widespread adoption of BioPAX for data exchange facilitates access to uniformity of pathway data from different sources, thereby increasing the efficiency of computational pathway research. The Systems Biology Markup Language (SBML, http:// www.sbml.org) is a computer-readable format for representing models of biological processes. It is mostly used for dynamical models of metabolism, cell-signaling, and many other topics. PSI-MI (http://www.psidev.info) is a standard proposed for improving the annotation and representation of molecular interaction data wherever it is published, i.e., in journal articles, authors’ Web sites, or public domain databases, and for improving the general accessibility of molecular interaction data.
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction
429
The Gene Ontology project (http://www.geneontology. org) is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data from GO Consortium members, as well as tools to access and process this data. 2. The following is an alphabetical and nonexhaustive list of the resources cited and used in the described reconstruction and analysis process. ●●
●●
●●
●●
●●
APID Agile Protein Interaction DataAnalyzer – http:// bioinfow.dep.usal.es/apid/index.htm Ariadne Genomics Pathway Studio – http://www. ariadnegenomics.com/products/pathway-studio BIND Biomolecular Interaction Network Database – http://www.bind.ca BioCarta Pathways – http://www.biocarta.com/genes/ index.asp BioGRID The Biological General Repository for Interaction Datasets – http://www.thebiogrid.org
●●
BiologicalNetworks – http://biologicalnetworks.net
●●
CellDesigner – http://www.celldesigner.org
●●
●●
●●
●●
●●
●●
●●
●●
ClusterMaker – http://www.cgl.ucsf.edu/cytoscape/ cluster/clusterMaker.html Cytoscape – http://www.cytoscape.org DIP Database of Interacting Proteins – http://dip.doembi.ucla.edu/dip/Main.cgi Entrez Gene – http://www.ncbi.nlm.nih.gov/gene GenMAPP Gene Map Annotator and Pathway Profiler – http://www.genmapp.org GraphWeb – http://biit.cs.ut.ee/graphweb HPRD Human Protein Reference Database – http:// www.hprd.org HUBBA Hub objects analyzer – http://hub.iis.sinica. edu.tw/Hubba
●●
Ingenuity Systems – http://www.ingenuity.com
●●
IntAct – http://www.ebi.ac.uk/intact
●●
●●
●●
KEGG Kyoto Encyclopedia of Genes and Genomes – http://www.genome.jp/kegg MINT the Molecular INTeraction database – http:// mint.bio.uniroma2.it/mint NCI-Nature Pathway Interaction Database – http://pid. nci.nih.gov
430
Tieri et al. ●●
●●
●●
●●
NetPath – http://www.netpath.org NetworkAnalyzer – http://med.bioinf.mpi-inf.mpg.de/ netanalyzer Pajek – http://vlado.fmf.uni-lj.si/pub/networks/pajek Pathguide: the pathway resource list – http://www.pathguide.org
●●
PathVisio – http://www.pathvisio.org
●●
Pathway Commons – http://www.pathwaycommons.org
●●
●●
●●
●●
●●
●●
●●
R Project for Statistical www.r-project.org
Computing
–
http://
Reactome – http://www.reactome.org SBW Systems Biology Workbench – http://sbw.sourceforge.net TRANSFAC & TRANSPATH – http://www.gene-regulation.com TRED Transcriptional Regulatory Element Database – http://rulai.cshl.edu/cgi-bin/TRED UniProt – http://www.uniprot.org WikiPathways – http://www.wikipathways.org/index. php/WikiPathways
Acknowledgments This work has been partially funded by Emilia-Romagna Region BioPharmaNet High Technology Network (http://www. biopharmanet.eu) and by the Regional Authorities of Sardinia. References 1. Travers J., and Milgram S. (1969) An experimental study of the small world problem. Sociometry 32, 425–43. 2. Alderson D.L., Li L., Willinger W., and Doyle J.C. (2005) Understanding internet topology: principles, models, and validation. IEEE/ ACM Trans Netw 13, 1205–18. 3. Watts D.J., and Strogatz S.H. (1998) Collective dynamics of ‘small-world’ networks. Nature 393, 440–42. 4. Albert R., Jeong H., and Barabasi A.L. (2000) Error and attack tolerance of complex networks. Nature 406, 378–82. 5. Jeong H., Tombor B., Albert R., Oltvai Z.N., and Barabasi A.L. (2000) The large-scale
6. 7. 8.
9.
organization of metabolic networks. Nature 407, 651–4. Newman M.E.J. (2000) Models of the small world. J Stat Phys 101, 819–41. Jeong H., Mason S.P., Barabasi A.L., and Oltvai Z.N. (2001) Lethality and centrality in protein networks. Nature 411, 41–2. Barabasi A.L., and Oltvai Z.N. (2004) Network biology: understanding the cell’s functional organization. Nat Rev Genet 5, 101–15. Goh K.I., Cusick M.E., Valle D., Childs B., and Vidal M., et al. (2007) The human disease network. Proc Natl Acad Sci USA 104, 8685–90.
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction 10. Pieroni E., de la Fuente van Bentem S., Mancosu G., Capobianco E., Hirt H., and de la Fuente A. (2008) Protein networking: insights into global functional organization of proteomes. Proteomics 8, 799–816. 11. Boccaletti S., Latora V., Moreno Y., Chavez M., and Hwang D.U. (2006) Complex networks: structure and dynamics. Phys Rep 424, 175–308. 12. Tieri P., Valensin S., Latora V., Castellani G.C., Marchiori M., Remondini D., and Franceschi C. (2005) Quantifying the relevance of different mediators in the human immune cell network. Bioinformatics 21, 1639–43. 13. Bhalla U.S., and Iyengar R. (1999) Emergent properties of networks of biological signaling pathways. Science 283, 381–7. 14. Bhalla U.S. (2003) Understanding complex signaling networks through models and metaphors. Prog Biophys Mol Biol 81, 45–65. 15. Ivakhno S., and Armstrong J.D. (2007) Nonlinear dimensionality reduction of signaling networks. BMC Sys Biol 1, 27. 16. Adriaens M.E., Jaillard M., Waagmeester A., Coort S.L.M., Pico A.R., and Evelo C.T.A. (2008) The public road to high-quality curated biological pathways. Drug Discov Today 13, 856–62. 17. Bauer-Mehren A., Furlong L.I., and Sanz F. (2009) Pathway databases and tools for their exploitation: benefits, current limitations and challenges. Mol Sys Biol 5, 290. 18. Gardy J.L., Lynn D.J., Brinkman F.S., and Hancock R.E. (2009) Enabling a systems biology approach to immunology: focus on innate immunity. Trends Immunol 30, 249–62. 19. Bader G.D., Cary M.P., and Sander C. (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34, D504–6. 20. Matthews L., Gopinath G., Gillespie M., Caudy M., Croft D., de Bono B., Garapati P., Hemish J., Hermjakob H., Jassal B., Kanapin A., Lewis S., Mahajan S., May B., Schmidt E., Vastrik I., Wu G., Birney E., Stein L., and D’Eustachio P. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37, D619–22. 21. Kanehisa M., and Goto S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28, 27–30. 22. Kanehisa M., Goto S., Furumichi M., Tanabe M., and Hirakawa M. (2010) KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 38, D355–60.
431
23. Schaefer C.F., Anthony K., Krupa S., Buchoff J., Day M., Hannay T., and Buetow K.H. (2009) PID: the Pathway Interaction Database. Nucleic Acids Res 37, D674–9. 24. Pico A.R., Kelder T., van Iersel M.P., Hanspers K., Conklin B.R., and Evelo C. (2008) WikiPathways: pathway editing for the people. PLoS Biol 6, e184. 25. Prieto C., and De Las Rivas J. (2006) APID: Agile Protein Interaction DataAnalyzer. Nucleic Acids Res 34, W298–302. 26. Hernandez-Toro J., Prieto C., and De las Rivas J. (2007) APID2NET: unified interactome graphic analyzer. Bioinformatics 23, 2495–7. 27. Zhao F., Xuan Z., Liu L., and Zhang M.Q. (2005) TRED: a transcriptional regulatory element database and a platform for in silico gene regulation studies. Nucleic Acids Res 33, D103–7. 28. Jiang C., Xuan Z., Zhao F., and Zhang M.Q. (2007) TRED: a transcriptional regulatory element database, new entries and other development. Nucleic Acids Res 35, D137–40. 29. Choi C., Krull M., Kel A., Kel-Margoulis O., Pistor S., Potapov A., Voss N., and Wingender E. (2004) TRANSPATH-A high quality database focused on signal transduction. Comp Funct Genom 2, 163–8. 30. Matys V., Fricke E., Geffers R., Gössling E., Haubrock M., Hehl R., Hornischer K., Karas D., Kel A.E., Kel-Margoulis O.V., Kloos D.U., Land S., Lewicki-Potapov B., Michael H., Münch R., Reuter I., Rotert S., Saxel H., Scheer M., Thiele S., and Wingender E. (2003) TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31, 374–8. 31. Keshava Prasad T.S., Goel R., Kandasamy K., Keerthikumar S., Kumar S., Mathivanan S., Telikicherla D., Raju R., Shafreen B., Venugopal A., Balakrishnan L., Marimuthu A., Banerjee S., Somanathan D.S., Sebastian A., Rani S., Ray S., Harrys Kishore C.J., Kanth S., Ahmed M., Kashyap M.K., Mohmood R., Ramachandra Y.L., Krishna V., Rahiman B.A., Mohan S., Ranganathan P., Ramabadran S., Chaerkady R., and Pandey A. (2009) Human Protein Reference Database – 2009 update. Nucleic Acids Res 37, D767–72. 32. Shannon P., Markiel A., Ozier O., Baliga N.S., Wang J.T., Ramage D., Amin N., Schwikowski B., and Ideker T. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13, 2498–504.
432
Tieri et al.
33. Baitaluk M., Sedova M., Ray A., and Gupta A. (2006) Biological Networks: visualization and analysis tool for systems biology. Nucleic Acids Res 34, W466–71. 34. Funahashi A., Tanimura N., Morohashi M., and Kitano H. (2003) CellDesigner: a process diagram editor for gene-regulatory and biochemical networks. BIOSILICO 1, 159–62. 35. Batagelj V., and Mrvar A. (2003) Pajek – analysis and visualization of large networks. In Jünger M., Mutzel P., (Eds.) Graph drawing software. Springer, Berlin. 77–103. 36. Reimand J., Tooming L., Peterson H., Adler P., and Vilo J. (2008) GraphWeb: mining heterogeneous biological networks for gene modules with functional significance. Nucleic Acids Res 36, W452–9. 37. Assenov Y., Ramírez F., Schelhorn S.E., Lengauer T., and Albrecht M. (2008) Computing topological parameters of biological networks. Bioinformatics 24, 282–4. 38. Lin C.Y., Chin C.H., Wu H.H., Chen S.H., Ho C.W., and Ko M.T. (2008) Hubba: hub objects analyzer – a framework of interactome hubs identification for network biology. Nucleic Acids Res 36, W438–43. 39. Jensen L.J., Saric J., and Bork P. (2006) Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 7, 119–29. 40. Avila-Campillo I., Drew K., Lin J., Reiss D.J., and Bonneau R. (2007) BioNetBuilder: automatic integration of biological networks. Bioinformatics 23, 392–3. 41. Maere S., Heymans K., and Kuiper M. (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21, 3448–9. 42. Bindea G., Mlecnik B., Hackl H., Charoentong P., Tosolini M., Kirilovsky A., Fridman W.H., Pagès F., Trajanoski Z., Galon J. (2009) ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics 25, 1091–3. 43. Platzer A., Perco P., Lukas A., and Mayer B. (2007) Characterization of protein-interaction networks in tumors. BMC Bioinformatics 8, 224. 44. Bray D. (1995) Protein molecules as computational elements in living cells. Nature 376, 307–12. 45. Sauro H.M., and Kholodenko B.N. (2004) Quantitative analysis of signaling networks. Prog Biophys Mol Biol 86, 5–43. 46. Tyson J.J., Chen K.C., and Novak B. (2003) Sniffers, buzzers, toggles and blinkers: dynamics
47. 48.
49.
50.
51.
52.
53.
54. 55. 56.
57. 58.
59.
60.
61.
of regulatory and signaling pathways in the cell. Curr Opin Cell Biol 15, 221–31. Alon U., Surette M.G., Barkai N., and Leibler S. (1999) Robustness in bacterial chemotaxis. Nature 397, 168–71. Ferrell J.E., Jr. (1996) Tripping the switch fantastic: how a protein kinase cascade can convert graded inputs into switch-like outputs. Trends Biochem Sci 21, 460–6. Goldbeter A., and Koshland D.E., Jr. (1981) An amplified sensitivity arising from covalent modification in biological systems. Proc Natl Acad Sci USA 78, 6840–4. Levin M.D., Morton-Firth C.J., Abouhamad W.N., Bourret R.B., and Bray D. (1998) Origins of individual swimming behavior in bacteria. Biophys J 74, 175–81. Yi T.M., Huang Y., Simon M.I., and Doyle J. (2000) Robust perfect adaptation in bacterial chemotaxis through integral feedback control. Proc Natl Acad Sci USA 97, 4649–53. Chen K.C., Calzone L., Csikasz-Nagy A., Cross F.R., Novak B., and Tyson J.J. (2004) Integrative analysis of cell cycle control in budding yeast. Mol Biol Cell 15, 3841–62. Chen K.C., Csikasz-Nagy A., Gyorffy B., Val J., Novak B., and Tyson J.J. (2000) Kinetic analysis of a molecular model of the budding yeast cell cycle. Mol Biol Cell 11, 369–91. Kholodenko B.N. (2006) Cell-signalling dynamics in time and space. Nat Rev Mol Cell Biol 7, 165–76. Tyson J.J., Chen K., and Novak B. (2001) Network dynamics and cell physiology. Nat Rev Mol Cell Biol 2, 908–16. Helikar T., Konvalina J., Heidel J., and Rogers J.A. (2008) Emergent decision-making in biological signal transduction networks. Proc Natl Acad Sci USA 105, 1913–8. Bray D. (1990) Intracellular signalling as a parallel distributed process. J Theor Biol 143, 215–31. Cui Q., Yu Z., Purisima E.O., and Wang E. (2006) Principles of microRNA regulation of a human cellular signaling network. Mol Syst Biol 2, 46. de la Fuente A., Fotia G., Maggio F., Mancosu G., and Pieroni E. (2008) Insights into biological information processing: structural and dynamical analysis of a Human Protein Signalling Network. J Phys A 41, 224013. Liu W., Li D., Zhang J., Zhu Y., He F. (2006) SigFlux: a novel network feature to evaluate the importance of proteins in signal transduction networks. BMC Bioinformatics 7, 515. Ma’ayan A., Jenkins S.L., Neves S., Hasseldine A., Grace E., Dubin-Thaler B., Eungdamrong
Integrating Omics Data for Signaling Pathways, Interactome Reconstruction N.J., Weng G., Ram P.T., Rice J.J., Kershenbaum A., Stolovitzky G.A., Blitzer R.D., and Iyengar R. (2005) Formation of regulatory patterns during signal propagation in a Mammalian cellular network. Science 309, 1078–83. 62. Tieri P. (2009) Reconstruction and analysis of the NF-kB pathway interactome, communication to NetSci 2010, International Conference on Complex Network Science, 10–14 May 2010, M.I.T. Boston, USA (http://www. netsci2010.net/abstracts/Tieri.htm), and RECOMBSAT 2010, 16-20 November 2010, Columbia Univ., New York, USA (available from Nature Precedings, http://dx.doi. org/10.1038/npre.2010.5266.1.). 63. Gilmore T.D. Rel/NF-kB Transcription Factors website, http://www.nf-kb.org.
433
64. Ceol A., Chatr-Aryamontri A., Licata L., and Cesareni G. (2008) Linking entries in protein interaction database to structured text: the FEBS Letters experiment. FEBS Lett 582, 1171–7. 65. Leitner F., and Valencia A. (2008) A textmining perspective on the requirements for electronically annotated abstracts. FEBS Lett 582, 1178–81. 66. Gerstein M., Seringhaus M., and Fields S. (2007) Structured digital abstract makes text mining easy. Nature 447, 142. 67. Termanini A., Tieri P., Franceschi C. (2010) Encoding the states of interacting proteins to facilitate biological pathways reconstruction. Biology Direct 13, 5:52.
Chapter 20 Network Inference from Time-Dependent Omics Data Paola Lecca, Thanh-Phuong Nguyen, Corrado Priami, and Paola Quaglia Abstract We provide a commented overview of the available databases for the systematic collection of pathway information and biological models essential for the interpretation of Omics data. Then, we present both the state of the art and the future challenges of network inference, a research area dealing with the deduction of reaction mechanisms from experimental Omics data. This approach represents one of the most challenging instances for making use of the huge amount of information gathered in the Omics era. Key words: Signaling, Pathway database, Bayesian inference, Network inference
1. Introduction With the emergence and growing impact of Systems Biology, Omics data become now more available than ever. The access to these data offers a chance to better understand how biological systems behave as a result of the integration and interaction between the individual components which high-throughput experimental methods can now monitor simultaneously (see Note 1). Here, we present an overview of Omics databases for the systematic collection of pathway information and biological models, together with the challenges related to the integration of the data they contain. This overview comes together with a commented list of related software and tools, and with the presentation of recent approaches to data integration. We present both the state of the art and the next challenges in network inference. This research area, concerning deduction of reaction mechanisms from experimental data, is one of the most promising and challenging instances for making use of the huge
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_20, © Springer Science+Business Media, LLC 2011
435
436
Lecca et al.
amount of information available in the Omics era. First, we provide the state of the art about the three main methodological techniques for attacking network inference: the perturbation-based approach, the correlation-based approach, and the probabilistic approach. Then, by means of an example relative to the Nuclear Factor-kB (NF-kB) signaling pathway, we illustrate a novel method to network inference that can only partially be classified in the above three main categories. This latest method is implemented in KInfer, one of the tools in the CoSBiLab suite (http://www.cosbi. eu/index.php/research/prototypes/overview) (1), which is also presented.
2. Materials Bioinformatics and Systems Biology have induced significant new developments of general interest in databases, machine learning, graph inference, semi-supervised learning, system identification, and novel combinations of optimization and learning algorithms (2, 3). The challenge is how to integrate effectively the data from multiple data sources and identify the relevant pieces of information and making sense out of it. The systematic collection of pathway information in the form of pathway databases and their analysis for pathway modeling are crucial. At present, there are several repositories on cell signaling pathways that contain high quality data in terms of annotation and cross references to biological databases (4). Those pathways are mostly presented in graphical format (e.g., in textbooks), or in standard formats allowing exchange between different software platforms and further processing by network analysis, visualization, and modeling tools. The Pathguide resource provides one of the most comprehensive overviews of current pathway databases (5). Some of the wellknown databases include Reactome, KEGG, WikiPathways, and PID databases. The Reactome database contains reactions for almost all types of biological process and organizes them in a hierarchal manner. Besides holding information on pathways, the KEGG database is composed of 19 highly interconnected databases, resembling genomic, chemical, and phenotypic information. Building on a different idea, WikiPathways serves as an open and collaborative platform for creating and editing biological pathways in various species. Based on three data sources – peer reviewed literature, the Reactome database, and the BioCarta database, the PID database consists of data on cell signaling in humans. With the increasing number of signaling pathway databases, modeling databases have emerged as repositories of complex biological systems. The two best-known databases are the BioModels database and the CellML repository. The BioModels
Network Inference from Time-Dependent Omics Data
437
Database and the CellML repositories both allow the user to view a model as a traditional flow diagram, and to further download the data. The BioModels Database contains links to proteins, but the models are not rigidly categorized by pathway type, such as circadian or intracellular signaling. The CellML repository allows models to be imported directly in the PCEnv modeling package. As more biologists are seeing the possibilities of quantitative computer modeling, databases of biological models are starting to grow. Table 1 lists popular databases for signaling pathways.
Table 1 Signaling pathway databases Database
Web address
COPE – Cytokines Online Pathfinder
http://www.copewithcytokines.de
Encyclopedia Dynamic Signaling Maps
http://www.hippron.com/hippron/index.html
Pathway Analysis Tool for Integration and Knowledge Acquisition
http://www.patika.org
PFBP – Protein Function and Biochemical
http://www.scmbb.ulb.ac.be/amaze
KEGG – Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg
PathDB
http://www.ncgr.org/pathdb/index.html
SPAD – Signaling PAthway Database
http://www.grt.kyushu-u.ac.jp/spad
Cytokine Signaling Pathway Database
http://www.signaling-gateway.org
EcoCyc and MetaCyc
http://ecocyc.PangeaSystems.com/ecocyc
BioCarta database
http://www.biocarta.com
TransPath
http://www.biobase-international.com
Reactome
http://www.reactome.org
Pathway Interaction Database
http://pid.nci.nih.gov
SigPath
http://www.sigpath.org
WikiPathways
http://www.wikipathways.org/index.php/ WikiPathways
The Alliance for Cellular Signaling
http://www.signaling-gateway.org
BioModels Database
http://www.ebi.ac.uk/biomodels
DOQCS
http://doqcs.ncbs.res.in
JWS Online
http://jjj.biochem.sun.ac.za/database/index.html
Kinetic Data of Biomolecular Interaction
http://xin.cz3.nus.edu.sg/group/kdbi/kdbi.asp
A Comprehensive Systems-Biology Database
http://csbdb.mpimp-golm.mpg.de
CellML Model Repository
http://models.cellml.org
438
Lecca et al.
Each of the databases contains a different subset of biological knowledge. Hence, a specific database can successfully provide some specific data in its domain, but cannot cover extensive data in various domains. To obtain a complete view of the biological process of interest, data integration is being highly encouraged (6). A crucial issue is finding an in-between solution to maintain the scientific and ownership independence of databases while fostering the integration of the recorded information to enable crossdatabase queries. Various research groups have been working on this issue, and three main approaches for the integration of biological databases have been investigated. They are referred to, respectively, as the link integration, the view integration, and data warehousing. ●●
●●
●●
Link integration. Beginning a query with one data source, researchers can then follow hypertext links to related information in other data sources. An important variant of linklevel integration is Web-service applications. View integration. The objective is to build an environment around databases that makes them all seem to be part of a unique large system. Data warehousing. This general approach can be broadly described as bringing all the data under one roof in a single database.
Data integration brings in numerous advantages, as well as many relevant challenges. Two of the main issues are: (1) different formats and different models used for the same data and (2) assignment and maintenance of the correct names of biological objects across databases. These problems cause the clash of concepts as users move from one database to another. The main problem faced by researchers in carrying out model integration is the absence of a general rule for naming molecular species. Users commonly name the model components (e.g., proteins) in an idiosyncratic way, often using short, almost cryptic labels. But even a name that is perfectly clear to a biologist cannot be unambiguously identified or linked to a specific protein in a database. Naming of species is a problem as well. Additionally, there are also technical challenges. The various biological databases use different database management systems (DBMSs) and none provides a standard way of accessing the data. Some databases provide large text dumps of their contents, some others offer access to the underlying DBMS, and still other databases provide only Web pages as their primary mode of access. Even more challenging is the issue of updates: biological databases are always changing, so integration must be an ongoing task.
Network Inference from Time-Dependent Omics Data
439
One possible solution for solving integration problems is the use of ontologies, i.e., strict rules for naming entities and their relationships. In this regard, a stream of continuous improvements has now resulted in an impressive set of tools, such as the semantic Web, the open document format (ODF, http://opendocument.xml.org), and the ontology web language (OWL, http://www.w3.org/TR/owl-semantics). For boosting efficacy of data processing, a careful annotation of data has to be applied, further supported by the development of minimum information standards. For example, MIAME is the agreed minimum standard for microarray data. For models, minimal information requested in the annotation of biochemical models (MIRIAM, http://www.ebi.ac.uk/miriam) has just been registered with minimum information for biological and biochemical investigations (MIBBI, http://www.mibbi.org). MIBBI acts as a clearing house for diverse minimal information specifications in different fields, and consolidates what is agreed as being the acceptable standard of annotations. Another approach is BlenX4Bio (7) allowing users to collect data in tabular format and automatically translate it into an algorithmic model written in BlenX (8) ready for simulation in the CoSBiLab suite. Several software and tools in Systems Biology have been developed to aid scientists in their research, however, being frequently tailored to specific applications. They can be API programs, software packages, or Web-service applications, etc. A listing of tools for the visualization of networks and pathways includes: VANTED, Cytoscape, MapMan, KaPPA-View, PathwayExplorer, and Omics Viewer, as included in MetaCyc-related databases (http:// metacyc.org), such as AraCyc and others. At the same time, simulation tools are currently in focus to simulate biochemical systems (9). As the number of modelers grows, it is not surprising that there has also been an exponential increase recently in the number of modeling tools available. About 110 of them are listed on the SBML Web page (http://sbml.org/ index.psp). Modeling programs certainly differ in number and type of features they offer. For example, the E-Cell program, part of the comprehensive ECell Project, is notable for its hybrid simulations, that is, simulations that use a combination of several algorithms. ECell provides a simple scripting interface that eases the creation of events. The tool embeds 13 different modeling algorithms, and any combination of which can be mixed in a single simulation. One of the latest trends in the development of tools for Systems Biology is the Web-service approach. This approach allows accessing the software via a Web site, where the creation of models, their visualization, and their manipulation can be done online. In this way, any changes to the software can be checked against any effects on the model output (10). Table 2 presents some tools and software used in Systems Biology.
440
Lecca et al.
Table 2 Tools and software for network inference Tool name
Web address
CoSBiLab
http://www.cosbi.eu/index.php/research/prototypes/overview
Copasi
http://www.copasi.org
Pathway Analytics
http://www.teranode.com/products/solutions/pathway_ analytics.php
PathwayLab
http://www.innetics.com
SimTool
http://www.simtool.com
Jarnac/JDesigner
http://jdesigner.sourceforge.net
SimBiology
http://www.mathworks.com/products/simbiology
Genesis/KinetiKit
http://genesis-sim.org
Gepasi3
http://www.gepasi.org
E-Cell
http://www.e-cell.org/ecell
JWS Online
http://jjj.biochem.sun.ac.za
WebCell
http://webcell.kaist.ac.kr
Simulation Web Application
http://sbw.kgi.edu/Simulation2005
Virtual Cell
http://www.nrcam.uchc.edu/index.html
3. Methods Below we provide an overview about network inference, one of the most promising and challenging instances for using the huge amount of information generated on the basis of Omics techniques. Network inference is the research area in computational Systems Biology dealing with the deduction of reaction mechanisms from given experimental Omics data. At present, network inference is a challenging problem that requires considerable expertise and intellectual efforts both in the design of experimental protocols and in the development of theoretical procedures. Different theoretical and modeling paradigms need to be developed and applied, depending on the type of experimental techniques applied for retrieving data. The currently available approaches to network inference can be classified into three main classes: (1) the perturbation methods, (2) the correlation-based methods, and (3) the probabilistic approaches. In the next section,
Network Inference from Time-Dependent Omics Data
441
we give a survey of the main inference techniques belonging to these three groups. Then, we discuss the major upcoming challenges that these methods are asked to tackle. To understand the functioning of cells or that of higher units of biological organization, it is beneficial to conceptualize them as systems of interacting elements, i.e., as networks of influences (physical or statistical) between components. For such conceptual framework, being the basis for a system-level description, one needs to know (1) the identity of the components that constitute the biological system; (2) the dynamic behavior of the abundance or activity of these components; and (3) the interactions among these components (11). Ultimately, this information can be combined into a model that, if validated and showing consistence with current knowledge, provides new insights and predictions, such as the behavior of the system under conditions that were previously unexplored. High-throughput experimental methods (see Note 1), e.g., enable the measurement of expression levels for thousands of genes or the determination of thousands of protein–protein or protein–DNA interactions. It is increasingly recognized that theoretical and computational network inference methods are needed to make sense of this manifold of data. Methodological approaches and algorithms have been proposed to determine reaction mechanisms from time series data (see Note 2) which are collected for gene and protein interactions and for metabolic pathways and networks. The aim of these techniques is to infer the systematic of biochemical reaction mechanisms from time series data based on the concentration or abundance of different reacting components of a network with little prior information about the pathways involved. The great majority of these methods belong to the class of the mechanistic approaches of network inference knowledge. Crampin et al. (12) provide a survey of mathematical and computational mechanistic techniques proposed to deduce complex biochemical reaction networks. The majority of the reviewed techniques require the generation and analysis of significant quantities of experimental data in terms of composition and concentration time series. The mechanistic view of a system of biochemical reactions is widespread among modelers. It is considered important for several reasons: (1) an improved understanding of the functional role of different molecules can be achieved only with the knowledge of the mechanism of specific reactions and the nature of key intermediates; (2) the control (or regulation) of different biochemical pathways can best be understood if some hypothesis for the reaction mechanism is available; (3) kinetic modeling, which forms the basis for understanding reaction dynamics, is based on comprehensive information about the reaction mechanism. Kinetic models allow simulation of complicated pathways, and
442
Lecca et al.
even whole-cell dynamics, which is an increasingly important predictive tool in the post-genomic era (12, 13). The data required for kinetic modeling are typically time series data (see Note 2) of the response of a biochemical system to different conditions and stimuli. The reason why these data are used is that time series data reveal transient behavior, away from chemical equilibrium, and contain information on the dynamic interactions among reacting components. From time resolved data of reagent concentration, the mechanistic inference methods deduce the nature of reagents and their interactions (how they react with, or transform into, each other) and determine the rates of these transformations. Mechanistic inference can be broken down into two tasks: first, a connectivity – or “wiring diagram” – has to be established and second, the individual interactions are assigned with appropriate kinetics, or rate laws. The most direct approach to build a wiring diagram consists in evaluating the Jacobian matrix. The (i,j)th entry of the Jacobian matrix corresponds to the magnitude of change in the time behavior of species i in response to an infinitesimal change in the level of species j. The experimental method to build this matrix for a biochemical network consists in perturbing one or more of the concentrations from steady state and monitor the response of each of the chemical species as the system relaxes. The biggest hurdle of this method is that the response of each of the biochemical species in the network must be monitored in order to determine the Jacobian: Even if recent advances of the experimental technologies allow to record simultaneous measurements of the concentration of several species, for practical reasons it may not be possible to measure many concentrations concurrently. Rather than to deduce connectivity from observations on the response to perturbations of arbitrary small amplitude made at different locations in the network, more feasible experiments can be made to deduce connectivity from the order and magnitude of the responses of different species to stimuli perturbing the network at different points. If the concentration of one or more species of a network at steady state is increased by some arbitrary amount, unlike the previous methods for determining the Jacobian which relied on small amplitude perturbations, the responses in the concentrations of the other species reveal qualitative properties of the network (14). As the concentrations increase and decrease following the initial impulse, the order of the appearance of peaks and troughs in the time courses for the different species reveals information about their ordering in a pathway. For example, consider an unbranched chain of reactions as given in Fig. 1. Suppose that at the beginning the concentration of species A and B are both equal to 100, in arbitrary units (au), and assume without the loss of generality that the rate constants ks are all equal to 0.1 au (see Fig. 2a). If we perturb the reaction chain of Fig. 1 at one end, for
Network Inference from Time-Dependent Omics Data
443
Fig. 1. A didactic example of a chain of reactions. X1 transforms into X2, X2 transforms into X3, and X3 is degraded. X4 is produced at a constant rate and is transformed into X2. The ks are the rate constants of the first order chemical reactions of this network. The connectivity of this branched pathway can be determined by pulses applied to components X1 and X4.
Fig. 2. Behavior of a time series in response to impulse changes applied to species A (b) and species B (c). In (a), the reference time series for [A] = [B] = 100 is shown. Time and concentration are expressed in arbitrary units (a u).
instance by applying pulses to A, we will be able to see the propagation of the pulse along the chain. Figure 2b shows that changing the concentration of A to 200 and keeping the concentration of B equal to 100, the response of C follows the response of B. Non-null derivatives in the initial dynamics of B indicate direct connection of A to B. If we repeat this experiment by
444
Lecca et al.
perturbing D, by changing its concentration to 200 and keeping the concentration of A equal to 100, we obtain the dynamics shown in Fig. 2c. Vance et al. (14) motivated this approach arguing that, by perturbing the components of the network, sufficient information can be collected to infer the causal order of the responses. One year later, this method has been successfully used by Torralba et al. (15) in an experimental study to determine a part of the glycolysis pathway in vitro. However, for complicated networks, the interpretation of the responses is usually hard and the network is difficult to reconstruct. Also, generally the pulse propagates dissipatively through the network, so the highest intensities of the responses, i.e., the best information, can be recovered on those species closest in a pathway to the point of perturbation. On the contrary, if one species is placed far from the point of disturbance, the responses will have less intensity. So it may be difficult to detect responses, and even to distinguish them from noise. Due to the difficulty of interpreting the response time-series on complex biochemical networks, perturbation methods can be regarded as semi-quantitative approaches used to guide further experimental investigations devoted to grasp biological insights about the underlying mechanisms, rather than to deduce quantitative connectivity between species. More quantitative approaches to network identification are the so-called correlation-based methods. Correlation in time series data for biochemical network can be used to reveal dependencies between variables and to infer connectivity between species (16). These methods are used especially for inferring gene networks from microarray data (17). The key idea of correlation-based inference methods is to group species together with similar dynamic profiles by using data clustering approaches. In order to perform a clustering analysis, the similarity between two time series must be quantified. Consider N species, for which Xi(t) for i = 1, …, n, the correlation matrix of the n(n − 1)/2 independent pairwise correlation coefficients, can be used to cluster the data set into groups of species. Correlations between species are high within a cluster when compared to pairwise correlations between different groups. These groupings can most easily be discerned by calculating a matrix of pairwise distances, dij from the correlation matrix, whereby dij = 0 for two species which are completely positively correlated. The distance matrix can subsequently be analyzed to find clusters in the data (17, 18). In practical situations, the influence of one species on another takes some finite amount of time to propagate through the network. Two time series which have a low correlation may in fact be strongly correlated if a time lag is allowed between the data points for the two species. In particular, this is evident in the time series
Network Inference from Time-Dependent Omics Data
445
if the time interval between concentration measurements is smaller than the characteristic response timescales for the network. Time-lagged correlations extend the standard correlationbased approach by determining the best correlations among profiles shifted in time. For a concentration profile represented by a series of n measurements, the correlation between species i and j with a time lag, t, is R(t) = (rij(t)), defined by rij =
Sij (t ) Sij (t )Sij (t )
and
Sij = (X i (t ) − xi )(x j (t + t) x j )
where xi(t) denotes the concentration of species i at time t, xi is the concentration of species i averaged across all time points, and the angled brackets represent the inner product between the time-shifted time series. The matrix of lagged correlations R(t) can be used to rank the correlation and anticorrelation between species through conversion to a Euclidean distance metric, dij:
dij = cij − 2cij + cij = cij 2(1 − cij ) = max rij (t ) t
where, cij is the maximum absolute value of the correlation between two genes with a time lag t. If the value of t that gives the maximum correlation is 0, then the two species are best correlated with no time lag. The matrix D = (dij) describes the correlation between two species, i and j, in terms of “distance” by making species that are least correlated (for any t) the “farthest” apart (19). A network of potential interactions, as well as cause and effect relationships, can be inferred by finding species that are closely related and then examining the corresponding value of t. Some caution is needed in the application of this method to ensure species with high correlation have been chosen using enough data points to give statistical significance; otherwise, all of the t values used merely overfit the data. Such errors may occur if values for t are unreasonably long from a biological standpoint. The information contained in the distance matrix can be extracted and interpreted graphically. A projection of the distances onto an n-dimensional space can be exploited to represent the stronger connections (shorter distances) between species as a connected graph, while weaker interactions (longer distances) are collapsed, and can be ignored (18). For a given distance matrix, and for a given dimension, a technique called multidimensional scaling (MDS, e.g., implemented in http://www.ailab.si/orange) finds the optimum projection and so provides the best separation of data (16). Graphical interpretation of the projected data can
446
Lecca et al.
reveal not only the ordering of species in pathways, but may also provide some clues as to the type of interaction. For instance, species which are strongly localized may form a subsystem which is weakly coupled to the rest of the network, or may represent reversible conversions of reaction intermediates at quasi-steady state. The strength of the correlation-based approaches is that information can be extracted from experimental time series data with little a priori knowledge of the underlying mechanisms. Moreover, to some degree, these methods can deal with the effects of unobserved species on the network inference problem. This is because correlation between xi and xj is still observed in the data even if their interaction is mediated by some intermediate species which are not measured. Besides perturbation- and correlation-based methods, it is increasingly recognized that statistical inference methods can be helpful in inferring interaction and functional relationships among species as well. Bayesian methods have been considered valuable because the Bayesian paradigm is fully probabilistic and is able to provide a statistical model, including prior knowledge and unknown variables and parameters. Namely, in statistical Bayesian inference (see Note 3), there is no fundamental distinction between any of the unknowns in a statistical model – parameters, hidden variables, and observations are all treated together in a consistent mathematical structure – and this is the main reason for the power of the methodology (20). Bayesian methods aim to find a directed, acyclic graph describing the causal dependency relationships among components of a system and a set of local joint probability distributions that statistically model these relationships (21, 22). The starting edges are established on the basis of an initial assessment of the experimental data and are refined by an iterative search-and-score algorithm until the causal network and posterior probability distribution best describing the observed state of each node are found. Bayesian inference was recently used to infer the signaling network responsible for embryonic stem cell fate responses to external cues (23). However, the main limiting factors in applying Bayesian methods are the need for a priori knowledge on the system, but most of all, the computational difficulty. For nontrivial problems, analytic approaches to Bayesian inference are not possible, and their numerical solution is often challenging due to the necessity of solving high dimensional integration problems, which in the discrete case translate to combinatorial summation problems. A method for network inference, that is neither purely mechanistic nor probabilistic, has been recently proposed in (24). The method consists of two parts: the first one is the quantification of the correlation between the time-series profiles. The procedure
Network Inference from Time-Dependent Omics Data
447
adopted for the estimation of the correlations among species is inspired by the work of A. Arkin et al. (18). The second part consists of the elimination of any relationship within the connectivity graph that have a non-null correlation coefficient, and are thought not to be biochemically plausible. The cutting of false correlations from the graph is obtained with the calibration of network models. Each edge of the graph representing the interaction network is weighted by the value of the putative kinetic rate constant. Edges weighted by a kinetic constant of zero represent no dynamics and are cut. To calibrate the network, i.e., to detect null dynamics or nonplausible correlations, we developed an innovative probabilistic model of inference of the rate coefficients. The tool implementing this model is called Knowledge Inference (KInfer). This software is freely available for noncommercial purposes and can be downloaded from http://www.cosbi.eu/index. php/research/prototypes/kinfer. For a comprehensive description of the tool, we refer the reader to (25) and to the tutorial available on the Web page. KInfer implements an inference model that deduces the rate constants of a system of biochemical reactions from experimentally measured time-courses of reactants. Based on a new probabilistic model of the variations in reaction concentrations, it infers from the time-course data first the prediction interval, and then the values of the kinetic rate constants and the level of noise in the input data. The probabilistic maximum-likelihood formulation of the inference method combined with a finite-difference model of the law of mass action makes the accuracy of the predictions remarkably strong against experimental, biological, and stochastic noise. KInfer and this correlation-based method for network inference is part of the CoSBi Lab software platform for modeling and simulating biological processes (1). We, in the following, show the use of KInfer to infer a subnetwork of the NF-kB signaling pathway (24). For this case study, we report in the following the input time series, the expected network of interaction, and the KInfer-inferred distance matrix and network graph. 3.1. Example: NF-kB Signaling Pathway
Activation of the NF-kB transcription factor can be triggered by exposing cells to a multitude of external stimuli, such as tumor necrosis factor and interleukin 1. These cytokines initiate numerous and diverse intracellular signaling cascades, most of which activate the IKK complex. This IKK complex regulates the activity of the NF-kB transcription factor positively by phosphorylating the inhibitor, IkB. The IKK complex catalyzes the transfer of the terminal phosphoryl group of ATP to the I-kB protein substrate, thereby tagging the inhibitor protein for ubiquitination subsequently leading to degradation. The previously inactive NF-kB is thus activated and available for regulating gene
448
Lecca et al.
Table 3 Rate constants of the phosphorylation pathway Reactants
Kinetic parameter
Error of estimate
E, ATP
1.2308
0.0008
9.6
E ATP
0.9306
0.00018
14.4
1/min
E.ATP, IkBa
2.8506
0.0003
12.6
1/(nM min)
E ATP IkBa
0.692538
0.000018
42.6
1/min
E IkBa, ATP
1.351
0.004
0.54
1/(nM min)
E, ATP, IkBa
0.4345
0.0004
8.4
1/(nM min)
.
.
.
.
Variance of estimate
Units 1/(nM min)
expression. This crucial component in the NF-kB activation cascade typically consists of two catalytic subunits, IKK1 and IKK2, and a regulatory unit NEMO (IKK). The cytoplasmic inhibitors of NF-kB are phosphorylated by activated IKK at specific N-terminal residues, tagging them for poly-ubiquitination and rapid proteasomal degradation. This allows NF-kB to be released upon activation, where it then translocates to the nucleus to induce the transcription of genes encoding regulators of immune and inflammatory responses, as well as of genes involved in apoptosis signaling and cell proliferation. Since recombinant human IKK2 (rhIKK2) phosphorylates IkBa in vitro, we in the following exemplarily examine its activity on IkBa, followed by the association and dissociation reactions, as shown in the chemical reaction system in Table 3 (25). For this example, the time series of the reactant concentrations are taken from Ihekwaba et al. (26) and taken as the input of the network inference procedure, and the output is a weighted distance matrix: The weights are the rate constants inferred by KInfer (see Note 4). The weighted distances between species are represented as solid circles with different colors and different sizes, corresponding to the intensities of the correlation between the species (Fig. 3). Table 3 shows the set of non-null rate coefficients that are used as weights in calculating the distances among the species. The distance matrix reflects the experimentally observed dynamics (Fig. 4), and the model of interaction is in agreement with the experimentally observed dynamics (26). 3.2. Current Challenges to Network Inference
The analysis of experimental time series data is complicated by uncertainty due to measurement errors, noise, artifacts, and missing data. The construction of models from time series with additive measurement errors is intricate because noise is propagated through the model in such a way that the errors are not normally
Network Inference from Time-Dependent Omics Data
449
Fig. 3. CoSBi Lab visualization of the weighted distance matrix of the species involved in phosphorylation (“E” denotes the enzyme IKK2).
distributed anymore, and their distributions depend on the nonlinearities. A further complication in time series analysis is whether or not the underlying biochemical mechanism was stationary throughout the period when the data was recorded. External influences that are assumed to be held constant during the experiment may, in fact, change slowly. This is often referred to as dynamical uncertainty, which is modeled as the uncertainty in the kinetic parameters of the interaction/reaction. Therefore, a good understanding of all the sources of inaccuracy inherent in the experimental apparatus and measurements, i.e., observational uncertainty, is needed. The modeling framework should take this observational uncertainty into account as it influences the parameter estimates and the predictive accuracy of the resulting model. The output of any model inference procedure is not only one model but also a set of models. This raises the problem of model discrimination that in turn points toward the validation of the inferred models. Generally, consistency with the data is not sufficient to accept a particular model; neither does consistency provide a rationale for selecting one consistent model over another. Numerous competing models cannot easily be tested experimentally and the main criteria for acceptance are often biological plausibility and consistency with “known facts.” For the time series
450
Lecca et al.
Fig. 4. Smoothed time behavior of species included in the inferred model of the IKK phosphorylation reactions in the NK-kB signaling pathway.
analysis techniques which we have described in the previous section, the commonly adopted approach, which may be very useful in this regard, is known as forecasting. While models are often constructed in an attempt to gain a better understanding of the underlying processes, they are usually assessed through their ability to reproduce empirical observations. Once the model parameters have been estimated, a model can be evaluated by assessing the distribution of out-of-sample prediction errors (errors for data not used in the model construction) as a measure of the quality of the model, as it is done to avoid overfitting of experimental data (12). This data-driven approach for model evaluation may prove particularly useful for large and complicated data sets, where “known facts” are unreliable or few, or somehow in-between.
Network Inference from Time-Dependent Omics Data
451
4. Notes 1. High-throughput screening is an experimental method especially used in relevant fields of biology and chemistry as target and drug discovery. Using robotics and control software equipped with liquid handling devices and sensitive detectors, high-throughput screening allows a researcher to conduct millions of biochemical, genetic, or pharmacological tests in parallel. Through this process, the experimentalist can rapidly identify active compounds involved in a particular biomolecular pathway. The results of these experiments provide starting points for understanding the interaction or role of a particular biochemical process in biology. A high-throughput experiment runs a screen of an assay against a library spanning from 100,000 to more than 2,000,000 molecular candidate compounds. An assay is a test specific for inhibition or stimulation of biological mechanism. The testing vessel is the microtiter plate: a small container, that features a grid of small, open divots called wells. The number of wells is a multiple of 96 (typically 384, 1,536, or 3,456), reflecting the original 96 well microplate with 8 × 12 9 mm spaced wells. Some of the wells are filled with experimentally useful matter, often an aqueous solution of dimethyl sulfoxide and some other chemical compound, the latter of which is different for each well across the plate. Some other wells may be empty, intended for use as experimental controls. Then, the researcher fills the wells with the biological entities to be analyzed, such as a nucleotide sequence, protein, or whole cells. After some incubation time has passed to allow the biological matter to absorb, bind to, or otherwise react or fail to react with the compounds in the wells, measurements are taken across all the plate’s wells. An automated analysis machine runs a number of experiments on the wells, for instance, the samples may be irradiated with polarized light to measure their reflectivity, which is an indication of protein binding. The machine’s output represents the result of each experiment as a grid of numeric values, with each number mapping to the value obtained from a single well. A high-capacity analysis machine can measure dozens of plates in few minutes, and generate thousands of experimental data points very quickly. 2. Estimating the parameters and the structure of a biochemical system from time resolved experimental concentration measurements is difficult because often the temporal behavior of the system is nonlinear and thus no general analytic result exists. Thus, we must resort to nonlinear optimization techniques,
452
Lecca et al.
where a measure of the distance between model predictions and experimental data is used as the optimality criterion to be minimized. The selection criterion depends on the assumptions about the data disturbance, the temporal resolution of the measurements, and on the amount of information provided by the user. For instance, the maximum likelihood estimator maximizes the probability of the occurrence of the observed measurements. If we assume that the residuals are normally distributed and independent with the same variance, then the maximum likelihood criterion is equivalent to the least squares, and we aim to find the minimum of the sum of squared residuals of all the responses. This is subject to the measured dynamics of the system and possibly other algebraic constraints. Furthermore, model parameters are also subject to upper and lower bounds. Moreover, when estimating parameters of dynamical systems a number of difficulties may arise like, e.g., convergence to local solutions if standard local methods are used, a flat objective function in the neighborhood of the solution, overdetermined models, badly scaled model functions, or nondifferentiable terms in the systems dynamics. Due to the nonlinear and constrained nature of the systems dynamics, these problems are very often multimodal. Thus, traditional gradient-based methods, like Levenberg–Marquardt or Gauss–Newton, may fail to identify the global solution and may converge to a local minimum although an improved solution exists just a small distance away. Moreover, in the presence of a bad fit or scarce experimental data points, there is no way of knowing if bad parameter estimation is due to a wrong model formulation, or if it is simply a consequence of local convergence. 3. Bayes methods are often used for estimating the parameters of the kinetics of a biological system. In the continuous setting, we are interested in making inferences about the parameter vector f of a probability density model p(y|f) giving rise to an observed data vector y. If we treat the parameters as uncertain, and allocate to them a “prior” probability density p(f), then Bayes theorem gives the “posterior” density
p (f|y ) =
p (f) p (y|f) p (y )
where p(y) is the marginal density for y obtained by integrating over the prior. Since p(y|f) is regarded as a function of f for a fixed (observed) y, we can rewrite this as
p (f|y ) ∝ p (f) p (y|f)
Network Inference from Time-Dependent Omics Data
453
so that the posterior is proportional to the prior times the likelihood. Practical difficulties arise because typically the normalizing constant p(y) is not known, and either p(y|f) is not known explicitly or marginalization over some components of f is required. Since these integration problems are typically analytically intractable, they are amenable to a Monte Carlo or a Markov Chain Monte Carlo solution. In the high-dimensional context, the problem is decomposed according to the underlying conditional independence structure of the model that is suitably described by graphical models (also known as conditional independence graphs). Concerning this, it is worthwhile to point out that in nonstatistical communities the term “Bayes(ian) Network” is often used to describe a discrete graphical model. However, graphical models can be used to describe any probabilistic conditional independence structure, and many of the techniques that are often used to “learn” Bayesian networks are not Bayesian. 4. KInfer is a tool for estimating rate constants of a given biochemical network model from concentration data measured, with error, at discrete time points. The current version of KInfer is beta 1.0. Principal features of the tool are: ●●
●●
●●
●●
Automatic generation of generalized mass action model utomatic estimation of the initial guesses and bounds A for the parameter values stimation of the experimental errors on the inferred E parameters Estimation of the strength of noise in the input data.
The method of KInfer is based on a probabilistic model of the variations in reactant concentrations. We observe time series of concentrations for all the reactant species, gathered in N state vectors X1, …, XN. Our method approximates the rate of change of the reactant concentration by finite differences and provides a tool to predict the values of the variables Xi at time t, conditioned on their values at the previous time point. The variations of the concentration of the species at different time points are conditionally independent by the Markov nature of this approximated model of the rate equation. Assuming the observation noise to be Gaussian with variance s, the probability of observing a variation Di for the concentration [X]i of species i between time tk−1 and tk is a Gaussian with variance depending on s and with mean the expectation value of the law of mass action under the noise distribution. The likelihood for the observed increments/decrements Di can be optimized with respect to the kinetic rate
454
Lecca et al.
constants of the biochemical network under consideration and with regard to the level of noise s affecting the time course of the reactants concentration. The approximation of the time derivative of reactant concentration by finite differences provides a model for the variations of the species concentration. The discretization of the law of mass action and the probabilistic formulation of KInfer’s algorithm guarantees the efficiency and the noise robustness. Moreover, the probabilistic inference model allows Bayesian extension of the method and the development of an automated model selection strategy based on the comparison between marginal likelihoods of different models. References 1. COSBiLab. (2009) CoSBiLab web page, www. cosbi.eu/index.php/research/prototypes/ overview 2. Larranaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., et al. (2006) Machine learning in bioinformatics. Brief Bioinform 7(1), 86–112. 3. Muggleton, S. (2005) Machine learning for systems biology. Proceedings of the 15th International Conference on Inductive Logic Programming. 4. Bauer-Mehren, A. L. (2009) Pathway databases and tools for their exploitation: benefits, current limitations and challenges. Mol Sys Biol 5, 290. 5. Bader, G. D., Cary, M. P., and Sander, C. (2006) Pathguide: a pathway resource list. Nucleic Acids Res 34(Database issue), 504–6. 6. Stein, L. (2003) Integrating biological databases. Nat Rev Gen 4(5), 337–45. 7. Priami, C., Ballarini, P., and Quaglia, P. (2009) BlenX4Bio BlenX for Biologists CMSB 2009, volume 5688 of LNCS/LNBI. Springer. 8. Dematté, L., Priami, C., and Romanel, A. (2008) The BlenX Language: A Tutorial SFM 2008, LNCS 5016. Springer. 9. Pettinen, A. A., Tommi, A., Smolander, O. -P., Manninen, T., Saarinen, A., et al. (2005) Simulation tools for biochemical networks: evaluation of performance and usability. Bioinformatics 21(3), 357–63. 10. Buckingham, S. (2007) To build a better model. Nat Meth 4, 367–74. 11. Kitano, H. (2002) Systems biology: a brief overview. Science 295, 1662–6. 12. Crampin, E. J., Schnell, S., and McSharry, P. E. (2004) Mathematical and computational
13. 14.
15.
16.
17.
18. 19.
20. 21.
techniques to deduce complex biochemical reaction mechanisms. Prog Biophys Mol Biol 86, 72–112. Noble, D. (2002) The rise of computation biology. Nat Rev Mol Biol 3, 459–63. Vance, W., Arkin, A., and Ross, J. (2002) Determination of causal connectivities of species in reaction network. Proc Natl Acad Sci USA 99, 5816–21. Torralba, A. S., Yu, K., Shen, P. D., Oefner, P. J., and Ross, J. (2003) Experimental test of a method for determining causal conectivities of species in reactions. Proc Natl Acad Sci USA 100, 1494–8. Samoilov, M., Arkin, A., and Ross, J. (2001) On the deduction of chemical reaction pathways from measurements of time series of concentrations. Chaos 11(1), 108–14. Schmitt, W. A., Raab, R. M., and Steph anopoulos, G. (2004) Elucidation of gene interaction networks through time-lagged correlation analysis. Genome Res 14, 1654–63. Arkin, A., Shen, P., and Ross, J. (1997) A test case of correlation metric construction. Science 277(29), 1275–9. Arkin, A., and Ross, J. (1995) Statistical construction of chemical-reaction mechanisms from measured time-series. J Phys Chem 99, 970–9. Wilkinson, D. J. (2007) Bayesian methods in bioinformatics and computational systems biology. Brief Bioinform 8(2), 109–16. Friedman, N., Linial, M., Nachman, L., and Peer, D. (2000) Using bayesian networks to analyze expression data. J Comput Biol 7(3–4), 601–20.
Network Inference from Time-Dependent Omics Data 22. Friedman, N. (2004) Inferring cellular networks using probabilistic graphical models. Science 303, 799–805. 23. Woolf, P. J., Prudhomme, W., Daheron, W., Daley, G. Q., and Lauffenberger, D. A. (2005) Bayesian analysis of signaling networks governing embryonic stem cell fate decisions. Bioinformatics 21, 741–53. 24. Lecca, P., Palmisano, A., and Ihekwaba, A. E. (2010) Correlation-based network inference and modelling in systems biology: the NF-kB signalling network case study. International
455
Conference on Intelligent Systems, Modelling and Simulation. Liverpool, England: IEEE CPS. 25. Lecca, P., Palmisano, A., Ihekwaba, A., and Priami, C. (2010) Calibration of dynamic models of biological systems with KInfer. Eur Biophys J 39, 1019–39. 26. Ihekwaba, A. E., Wilkinson, S. J., Broomhead, D. S., Waithe, D., Grimpley, R., Benson, N., et al. (2007) Bridging the gap between in silico and cell based analysis of the NF-kB signalling pathway by in vitro studies of IKK2. FEBS J 27, 1678–90.
Chapter 21 Omics and Literature Mining Vinod Kumar Abstract The measurement of the simultaneous expression values of thousands of genes or proteins from high throughput Omics platforms creates a large amount of data whose interpretation by inspection can be a daunting task. A major challenge of using such data is to translate these lists of genes/proteins into a better understanding of the underlying biological phenomena. We describe approaches to identify biological concepts in the form of Medical Subject Headings (MeSH terms) as extracted from MEDLINE that are significantly overrepresented within the identified gene set relative to those associated with the overall collection of genes on the underlying Omics platform. The method’s principle strength is its ability to simultaneously depict similarities that may exist at the level of biological structure, molecular function, physiology, genetics, and clinically manifest diseases, just as a single published article about a gene of interest may report findings within several of these same dimensions. Key words: Biomedical literature mining, PubMed, MEDLINE, MeSH, Omics, Microarray
1. Introduction The last decade has seen a surge of interest in systematically using the biomedical literature (1–5), ranging from relatively modest tasks such as finding reported gene location on chromosomes to more ambitious attempts to construct putative gene networks based on gene-name co-occurrence within articles (3). Since the literature covers all aspects of biology, chemistry, and medicine, there is almost no limit to the types of information that may be recovered through careful and exhaustive mining. Some possible applications for such efforts include the reconstruction and prediction of pathways, establishing connections between genes and disease, finding the relationships between genes and specific biological functions, and much more.
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_21, © Springer Science+Business Media, LLC 2011
457
458
Kumar
Almost every known or postulated piece of information pertaining to genes, proteins, and their role in biological processes is reported somewhere in the vast amount of published biomedical literature. The advancement of genome sequencing techniques is also accompanied by an overwhelming increase in the literature discussing the discovered genes. This combined abundance of genes and literature produces a major bottleneck for interpreting and planning genome-wide experiments. Thus, the ability to rapidly survey this literature constitutes a necessary step toward both the design and the interpretation of any large scale experiment. Moreover, automated literature mining offers a yet untapped opportunity to integrate many fragments of information gathered by researchers from multiple fields of expertise into a complete picture exposing the interrelated roles of various genes, proteins, and chemical reactions in cells and organisms. The landscape of biomedical research has been transformed by the widespread embrace of high-throughput experimental technologies which have collectively given birth to the Omics fields (e.g., proteomics, transcriptomics, etc.). Such technologies bring new perspectives in enabling the discovery of global patterns of biological responses to experimental or natural perturbations that potentially can provide valuable insights into the molecular mechanisms underlying disease. While these technologies have allowed the unrestrained and rapid generation of large quantities of data, the analytical challenges of interpreting this data are still formidable. A major challenge of using data from such platforms is to translate these lists of differentially regulated genes into a better understanding of the underlying biological phenomena. Most often genes of interest may include dozens or hundreds of different genes. It is beyond the limits of unaided human cognition to identify the most salient biological concepts by inspection alone. Many researchers parse such lists of genes manually, using literature searches and browsing public databases, in an attempt to extract the relevant biological processes and pathways. This is an extremely tedious and error-prone process that usually takes many months. Several resources that provide one-stop access to comprehensive information on genes or gene products are widely available. For instance, the NCBI Entrez browser (http://www.ncbi.nlm.nih. gov/Entrez) (6) is a cross-database portal to sequence data, conserved domains, gene location, protein structure, published microarray expression data (Gene Expression Omnibus, http:// www.ncbi.nlm.nih.gov/geo (7)), and function (OMIM, Online Mendelian Inheritance in Man, http://www.ncbi.nlm.nih.gov/ omim) (8). However, improving access to information does not necessarily create the conditions required for data interpretation. What are needed are analytical tools that help make sense of such a mind-boggling flow of information.
Omics and Literature Mining
459
A variety of approaches have been engineered to condense and manage functional data for the purpose of analyzing data obtained from such high-throughput experimental technologies. Collections of controlled vocabulary such as Gene Ontologies, http://www.geneontology.org (9) have been extensively used by developers of functional mining tools (e.g., GoMiner, http:// discover.nci.nih.gov/gominer (10), NetAffx, http://www. affymetrix.com/analysis/index.affx (11) and MeSHMap (12)), and hand-curated signaling or biochemical pathways constitute another popular means of condensing information to facilitate data interpretation (e.g., KEGG, http://www.genome.jp/kegg (13), Biocarta, http://www.biocarta.com (14) and GenMAPP, http://www.genmapp.org (15)). One plausible approach to categorizing the characteristics of known genes within a group of interest is to exploit the information content of published literature to those genes. The availability of published literature that describes genes and their function in a computer-interpretable form is potentially a rich source of information that can be exploited for such purposes. Various tools of this type of data mining have been described. Jenssen et al. (3) employed HUGO gene symbols associated with specific loci on microarrays as the common currency for linking to the literature and displays characterizations of genes using the MeSH keywords from the literature associated with those genes. Masys et al. (16) developed a similar method for interpreting gene clusters that uses GenBank accession numbers as the common currency for linking to the literature and extends the notion of characterizing groups of genes through literature-derived keywords by placing those keywords in concept hierarchies that represent “is-a” and “part-whole” relationships. For example, in the MeSH Anatomy hierarchy a search on “hand” will also include records retrieved on: fingers, thumb, and wrist because those terms are “indented” under hand in the MeSH tree structure. Shatkay et al. (4) used such information retrieval methods to find the literature most closely related to the gene set, and to predict relationships among genes independent of experimental values. A common approach taken by these tools for information retrieval from the biomedical literature is the use of ontologies representing the essential concepts contained within a text. These ontologies help organize indexed terms into meaningful hierarchies that capture domain knowledge. The biomedical literature in the National Library of Medicine (NLM) MEDLINE database is indexed by keywords drawn from a controlled vocabulary called Medical Subject Headings (MeSH) to aid in search through disambiguation of topics (17). Here we describe a similar approach taken to literature mining within the context of interpretation of Omics data as they relate to drug discovery. The principle strength of such an
460
Kumar
approach is its ability to generate hypotheses about which biological properties are shared within a set of genes or proteins based on their associations with the MeSH concepts as extracted from published literature and then use those relationships to cluster both the genes and the concepts. We then provide two case studies that apply the power of our literature-based annotation strategy to find shared functional relations among genes obtained from microarray analysis that can be extended to be used across any high-throughput platforms.
2. Materials MEDLINE 2009, the U.S. NLM biomedical abstract repository contains approximately 18 million reference articles from around 5,400 journals. Despite the growing availability of full-text articles on the Web, MEDLINE remains in practice a central point of access to biomedical research. Under a proper license agreement with the U.S. NLM, which can be obtained at no cost, the entire MEDLINE/PubMed baseline files, including sample data can be downloaded onto a local server. 2.1. MEDLINE and MeSH
An example of a MEDLINE record, describing a full-text article, is shown in Table 1. It includes textual fields, such as title and abstract, as well as MeSH fields (denoted MH). The MeSH fields present several advantages over textual fields: Unlike the free-text content of the title/abstract fields, the MeSH fields unambiguously associate a single term to a single concept. In addition, MeSH terms are assigned to each MEDLINE abstract by human indexers only after careful examination of the entire research article. Consequently, it covers more conceptual ground than the title/abstract free text. A MEDLINE MeSH field is a combination of a MeSH descriptor with zero or more MeSH qualifiers. In Table 1, “Anoxia/*physiopathology” is the combination of the descriptor “Anoxia” with the qualifier “physiopathology”. MeSH fields can describe major themes of the article (a concept that is central to the article) or minor themes (a secondary concept). A star is used to distinguish the major themes from the minor ones. Therefore, the association “Anoxia/*physiopathology” is a major theme of the MEDLINE record along with “Nerve Degeneration/ metabolism/*pathology”. The MeSH 2009 vocabulary includes 25,186 descriptors, 83 qualifiers, and 180,672 supplementary concepts (see Note 1). Descriptors are the main elements of the vocabulary. Qualifiers are assigned to descriptors inside the MeSH fields to express a special aspect of the concept. Both descriptors and qualifiers are organized in several hierarchies. The MeSH file uses translation tables
Omics and Literature Mining
461
Table 1 A MEDLINE record example that shows some of the available fields MEDLINE field
Value
PMID
19845619
TI
Hypoxia and neurodegeneration
AB
Periods of chronic hypoxia, which can arise from numerous cardiorespiratory disorders, predispose individuals to the development of dementias ….
AU
Peers C
AU
Dallas ML
AU
Boycott HE
AU
Scragg JL
…
…
…
…
MH
Alzheimer disease/metabolism/pathology
MH
Amyloid beta-protein/metabolism
MH
Animals
MH
Anoxia/*physiopathology
MH
Calcium/metabolism
MH
Humans
MH
Models, biological
MH
Nerve degeneration/metabolism/*pathology
PMID PubMed ID, TI title, AB abstract, AU author, MH MeSH term
and explodes by including all related narrower terms derived from MeSH tree structure in order to enhance search capabilities. This file is also used to generate the MeSH Tree Number. The tree number data are necessary because they are the basis of the capability whereby MeSH terms are arranged hierarchically by subject categories with more specific terms arranged beneath broader terms. NLM uses this feature so that when MeSH terms are searched in PubMed, the program automatically includes the more specific MeSH terms. For example, in the MeSH disease hierarchy, Multiple Sclerosis is an example of Autoimmune Demyelinating Disease, which in turn is an example of Nervous System Disease. For each article, the MeSH terms including substance names (in case identification of novel drugs is the analysis goal) that have been indexed for the article was extracted using a parsing program designed to analyze the text contained in the downloaded abstract (18).
462
Kumar
While, we could have parsed the text by directly mapping them to terms from Unified Medical Language System (UMLS, http:// www.nlm.nih.gov/research/umls) (19) or other ontologies, we chose to use the manually curated MeSH terms because manual curation, when comprehensive and systematic, remains the gold standard. On an average there are roughly ten MeSH indexing terms applied for each MEDLINE citation by professional indexers, who choose these descriptors after reading the full length text article. From each PubMed article, the following features are extracted: PubMed identifier, year of publication, title, author list, affiliation, MeSH terms (with flag indicating if major) and substance names. Such a procedure yields ~200 million article-toMeSH term or substance-name mappings. 2.2. MEDLINE and Genes
Each MEDLINE article’s title and abstract can for example be scanned for all high-quality gene names as described by Agarwal and Searls (18): In this setting the Gene name list is built by integrating names and descriptions from multiple fields within EntrezGene, HUGO, and UniProt (see Note 2). All mouse, and rat gene synonyms are mapped onto the orthologous human EntrezGene using Homologene (http://www.ncbi.nlm.nih.gov/ homologene) (6), a system developed for detection of homologs among annotated genes of several completely sequenced eukaryotic genomes. These mappings yield a total of 420,000 synonyms, though only~66,000 of the synonyms are found to be present in PubMed abstracts. Gene synonyms that refer to multiple human EntrezGenes are flagged, while gene names that correspond to common English or those that are likely to be other medical terms words (such as “AND”, “CELL”, etc.) or abbreviations (for example, most three-letter symbols) are discarded (see Note 3). Two additional data files from EntrezGene can be used to augment Gene to PubMed mappings: Gene2pubmed (ftp://ftp.ncbi. nih.gov/gene/DATA/gene2pubmed.gz) and GeneRIF (ftp:// ftp.ncbi.nih.gov/gene/GeneRIF/generifs_basic.gz). GeneRIF (Gene Reference into Function) provides a quality functional annotation that may extend beyond the genes mentioned in the abstract (20), and is usually produced by manual curation. In total, a system derived by the procedures given above contains ~7.1 million PubMed-to-human-gene mappings covering ~6.3M articles and ~17,000 human genes.
2.3. MeSH and Genes
Generating hypotheses about which biological properties are shared within a set of genes or proteins and correlating these properties would facilitate the characterization of a set of unknown biomolecules. However, since biological properties are not always well-defined, we define a set of genes with shared biological properties as a “bucket” and have assigned a name and source from which it was derived. Buckets are generated by searching all
Omics and Literature Mining
463
a Abstracts PUBMED
PubMed Title, Authors Genes, MeSH
Individual Abstracts
Significant Buckets
Gene synonyms From Entrezgene, HUGO and Uniprot
Cyclins DNA Repair Apoptosis Adenoma
b Query rc_AA998164_s_at M58404_at rc_AI231821_at rc_AA860039_s_at X60767mRNA_s_at rc_AI169370_at rc_AA996401_s_at
Buckets Buckets Universe MeSH
Genes
Adenoma Cell Hypoxia DNA Repair Cyclins Apoptosis Glucogenesis Lipolysis
Statistics P-Val < 0.05
Fig. 1. Simplified workflow of (a) literature mining system for generating MeSH-Genes buckets; (b) identifying biological concepts in the form of Medical Subject Headings (MeSH terms) that are significantly overrepresented within the gene set relative to those associated with the overall collection of genes on the underlying Omics platform.
abstracts from PubMed, and looking for co-occurrence of gene names and MeSH terms (see Note 4). PubMed identifiers corresponding to each MeSH term are then analyzed to map to gene names. Figure 1a provides a schematic workflow on a literature mining system and the generation of the buckets. We restricted the analysis to articles with the search term as a major MeSH annotation, and excluded search terms that are descendants of the MeSH tree. The reason for the exclusion is to deemphasize obvious relationships between parent terms and their children (such as Diabetes Mellitus and Type II Diabetes Mellitus). Such a literature-based bucket collection currently comprises over 14,993 buckets (Table 2) hierarchically categorized in 15 concepts such as “Anatomy”, “Chemicals and Drugs”, and “Diseases”. The largest bucket collection is 7,696 for Chemicals and Drugs, followed by 3,699 for Diseases, 1,292 for Anatomy, and 1,208 for Phenomenon and Process concept hierarchies. In total, 16,016 unique genes are identified with these buckets.
3. Methods 3.1. Implementation
The data mining approach described here involves retrieving all key MeSH concepts from published literature linked to a set of submitted gene identifiers. The utility of this approach is that
464
Kumar
Table 2 List of concept hierarchies of MeSH terms from MEDLINE showing the number of buckets including average and median number of genes associated with them MeSH category
Number of buckets
Average genes/bucket (std dev)
Median genes/bucket
Chemicals and drugs
7,696
134 ± 290
57
Diseases
3,699
65 ± 121
29
Anatomy
1,292
182 ± 277
80
Phenomena and processes
1,208
196 ± 410
56
Health care
385
13 ± 26
6
Psychiatry and psychology
283
39 ± 78
11
Disciplines and occupations
178
27 ± 103
8
Technology, industry, agriculture
168
43 ± 75
21
Analytical, diagnostics, therapeutics techniques and equipment
35
183 ± 233
57
Anthropology, education, sociology, and social phenomena
18
33 ± 40
18
Information science
16
60 ± 112
6
Name groups
6
24 ± 25
18.5
Humanities
4
4 ± 2.4
2.5
Organisms
2
21 ± 2
20.5
Publication characteristics
2
11 ± 10.6
10.5
Geographicals
1
3
3
when a user submits a list of input IDs specified by accession number, Affymetrix probe ID, or gene symbols, it converts the list into a list of Entrez Gene identifiers by searching a custom database integrated with data collected from a number of public databases including GenBank, UniGene, EntrezGene, Ref Seq, etc. After creating a list of EntrezGene identifiers, the list is used to query an Oracle “buckets” database. Several groups of Perl
Omics and Literature Mining
465
routines are then used to download the bucket data from the database and generate profiles for each input gene as shown in Tables 1 and 2. The final step includes a classical enrichment analysis based on a Fisher Exact test (21) to identify those buckets that contain a proportion of input set of genes that are significantly different from what is expected by chance. 3.1.1. Calculation of Enrichment of MeSH Buckets in the Analyzed Data Set
The MeSH terms are considered enriched if their actual number of observed target data set comprising of genes or proteins associated with a term or bucket, F(ka), is higher than the number of genes/proteins that would be expected to be associated with F in a similar-sized set of randomly selected proteins (ke) (22). The ke for a given term can be calculated using the expression: ke =
n K N
where, n denotes the number of genes in the query set. K denotes the number of actual genes associated with the concept or bucket F. N denotes the total number of genes in the genome which is set to a conservative estimate of 7,500 (an aspect that pertains to the characterization of the number of genes comprising the human genome that were identified, annotated or otherwise classified in the bucket collection). It is possible to compute an effective size by counting up all the unique sequences that have been partitioned into one or more buckets or restrict the genome size to the number of probe sets (or number of unique genes) available on a specific DNA microarray or chip. Regardless, the specific value for the genome size has no impact upon the rank order of the buckets that are reported as significant matches. This degree of uncertainty in the size of the genome only affects the cutoff level for statistical significance. The enrichment factor, R, is calculated simply by taking the ratio of ka and ke as follows: 3.1.2. Computing Significance
R=
ka ke
The p-values are computed to represent the likelihood of having ka genes of a possible K that are associated with a particular term in a subset of n genes randomly drawn from a total of N genes. Since the set of n distinct genes represents a sample drawn without replacement, the p-value is calculated using the hypergeometric function. The one sided Fisher Exact probability for overrepresentation is computed using the 2 × 2 contingency table that includes the following numbers: ka, K − ka, n − ka, and N − n − K + ka. Since we test for significance among thousands of MeSH buckets, the q-value for determining significance in multiple hypothesis testing (23) based on the False Discovery Rate (FDR) is computed from the resulting p-values using the
466
Kumar
QVALUE software (http://genomics.princeton.edu/storeylab/ qvalue) (24). The analysis returns an ordered list of buckets that are overrepresented in the input query set of genes in the order of statistical significance. The buckets that are most similar to the input query set of genes are shown at the top as indicated by using a set of p-values that characterize the significance of the bucket from this statistical perspective (a lower p-value corresponds to a higher significance). Figure 1b provides a simplified workflow on the utility of this approach. 3.2. Biological Interpretation of Omics Data
To illustrate the method’s functionality in the context of identifying shared properties among the gene set using “buckets” derived from literature-based annotation strategy, we will consider the two examples, expression of a set of p53-regulated genes over a time course study and genes differentially expressed in a dietinduced obesity model.
3.2.1. p53-Regulated Gene Set
The data considered in the first example consist of the expression profiles of 400 genes at multiple time points (2, 4, 8, 12, and 24 h) in response to p53 induction in the colon tumor-derived cell line EB-1 as measured with a custom spotted microarray (25). The microarray includes probe sets for genes reported to be potential p53 targets, some control genes and 161 genes that appeared to be p53 responsive based on an initial screen with Human GeneChip 6500, a high-density oligonucleotide microarray chip produced by Affymetrix that contains probes for 7,464 human genes. Following p53 induction, 69 of the 400 genes were found to be different from control by at least a factor of 2.5 in at least two time points. When subjected to cluster analysis (26), the 69 genes fall into eight distinct clusters having distinct expression kinetics: clusters 1 through 5 are transcriptionally induced genes, while clusters 6 and 7 are repressed genes following p53 induction. Among the transcriptionally induced genes, the early-response genes are in cluster 2, intermediate-response genes in clusters 1 and 3 and late-response genes in cluster 4. Genes in cluster 5 contain transiently induced genes with heterogeneous expression profiles. In the case of transcriptionally repressed genes, earlyresponse genes are found in cluster 6 and late-response genes in cluster 7. Genes in cluster 8 are repressed relative to control in earlier time points but induced in later time points. The 69 gene set clearly contains several putative p53 target genes, but their relevance to p53 cellular activity is not clear. To address this, we first generate a protein–protein interaction network view for the 69 gene set (Fig. 2) based on direct interactions between proteins that are extracted manually from literature. This is a maximally scoring subgraph enriched for genes with the best
Omics and Literature Mining
467
Fig. 2. A protein interaction network of enriched genes associated with colon tumor-derived cell line EB-1 in response to p53 induction. This subset of 69 differentially regulated genes was found to be different from control by at least a factor of 2.5 in at least two time points. The network edges are based on direct interaction between proteins that were extracted manually from literature. The interaction graph was generated with Scalable Vector Graphics (SVG; http://en.wikipedia. org/wiki/Scalable_Vector_Graphics).
p-values. Of the 69 signature genes, 3 have no HUGO gene symbols, and 18 do not have any reported interaction with other proteins and therefore do not have any edges associated with them. The remaining genes on the list are all connected, thus displaying graphically an underlying biological relationship between these genes. This can help focus attention on the key genes associated within the query set. The entire signature gene list of 66 genes can be linked with HUGO gene symbols to search the “buckets” database containing public data information on biochemical function, biological process, cellular role, cellular component, pathways, expression, regulation, molecular function, and literature-based annotation. To demonstrate the power of our literature-based strategy, we show a list of highly enriched literature-based common themes for the gene set as evident from the published literature using MeSH concepts in Table 3. Among the top-ranked MeSH terms, 23 of the 66 genes are involved with the concept “Tumor Suppressor Protein p53”, 19 are associated with “Neoplastic Gene Expression Regulation”, 10 with “Apoptosis”, 13 in “DNA Replication”, 18 with “Squamous Cell Carcinoma”, and 14 in “Cell Division”. Appropriately, the key concepts associated with the gene set are associated with literature that describes their relevance to
468
Kumar
Table 3 Partial list of statistically relevant MeSH concepts obtained from the analysis of the query gene set of 69 differentially regulated genes that were found to be different from control by at least a factor of 2.5 in at least two time points in response to p53 induction MeSH tree structure
P-value
Bucket size
Observed
Expected
Phenomenon and process Apoptosis Gene expression regulation, neoplastic Liver regeneration Cell division DNA replication Cell proliferation Cell cycle DNA damage DNA fragmentation MAP kinase signaling system Mitosis DNA repair Cell aging Cell survival Oxidative stress Signal transduction Gene silencing Necrosis
1.60E−06 2.50E−06
167 664
10 19
1.5 5.8
2.60E−06 3.50E−06 4.50E−06 1.70E−05 2.60E−05 3.00E−05 5.30E−05 0.00027 0.00029 0.0013 0.0016 0.0019 0.0019 0.0039 0.0049 0.0055
314 376 330 688 388 516 9 1,431 67 245 58 101 457 574 240 41
13 14 13 18 13 15 3 25 5 8 4 5 11 12 7 3
2.8 3.3 2.9 6.1 3.4 4.5 0.079 13 0.59 2.2 0.51 0.89 4 5.1 2.1 0.36
Disease Carcinoma, squamous cell Adenoma Carcinoma, non-small-cell lung Precancerous conditions Adenocarcinoma Melanoma Glioblastoma Papilloma Alopecia Ataxia telangiectasia Endometriosis Mandibular diseases Neoplasms Hypertension, pulmonary Ovarian cysts Pulmonary fibrosis Ameloblastoma Lymphoma, non-Hodgkin Leukemia, myeloid Eye abnormalities
5.40E−09 8.70E−09 4.00E−08 2.00E−07 4.70E−07 9.40E−06 1.10E−05 1.70E−05 2.20E−05 2.80E−05 6.50E−05 7.50E−05 0.00024 0.00027 0.00034 0.00036 0.00043 0.00043 0.0025 0.0035
400 71 455 133 658 409 305 63 7 102 116 10 550 66 16 39 41 41 402 35
18 9 18 10 20 14 12 6 3 7 7 3 14 5 3 4 4 4 10 3
3.5 0.62 4 1.2 5.8 3.6 2.7 0.55 0.062 0.9 1 0.088 4.8 0.58 0.14 0.34 0.36 0.36 3.5 0.31 (continued)
Omics and Literature Mining
469
Table 3 (continued) MeSH tree structure
P-value
Bucket size
Observed
Expected
Chemicals/drugs & proteins Tumor suppressor protein p53 Thymosin Cyclins Tumor markers, biological Doxorubicin Paclitaxel Dexamethasone Cell cycle proteins Thiazoles Luciferases Histone deacetylases Cyclooxygenase inhibitors DNA topoisomerases, type II, eukaryotic
1.50E−09 3.50E−07 5.10E−07 1.40E−06 1.90E−06 4.00E−06 5.70E−06 1.00E−05 1.50E−05 1.70E−05 4.50E−05 5.00E−05 5.10E−05
633 18 863 1,218 356 184 509 58 427 2,015 243 76 24
23 5 23 27 14 10 16 6 14 34 10 6 4
5.6 0.16 7.6 11 3.1 1.6 4.5 0.51 3.8 18 2.1 0.67 0.21
v arious p53 cellular activities, as well as several types of cancers. If we look at the functional concepts described under the “Phenomena and Processes” category, we might conclude that the p53-regulated gene list is associated with “Apoptosis”, “Cell Division”, “DNA Replication”, “Cell Proliferation”, and “Cell Cycle”. These annotations clearly provide a first step for the researcher to focus on specific concept/process, which they were potentially not aware of previously. The next step is to find out which of the affected genes from the list are involved in these processes. A search of the highly enriched “Apoptosis” bucket indicates that 8 of the 10 genes are from p53-activated gene targets. The transcriptional activation of Gadd45 and Tnfrsf10b was as early as 2 h after p53 induction, while Tp53, Tp5313, C4a, Cdkn1a, and Fas genes are mostly intermediate response p53-activated targets. Other enriched buckets associated with p53-activated genes include “Oxidative Stress” (8/11), Liver Regeneration (10/13), “Signal Transduction”, and “Neoplastic Gene Expression Regulation” (14/18). A search of the “DNA Replication” bucket shows 7 out of 13 genes are the delayed-response p53-repressed genes from cluster 7. Notably, several of these genes including the early responsive repressed genes from cluster 6 are associated with the buckets such as “DNA Repair”, “Mitosis”, and “Cell Division”, suggesting that several phases of cell cycle-related processes are affected by p53-mediated transcriptional repression. Other enriched buckets within the repressed targets include “DNA Helicases” and “Eukaryotic DNA Topoisomerase Type II”
470
Kumar
that are related to “DNA Replication”. Topoisomerases act by transiently cutting one or both strands of the DNA to relax the coil and extend the DNA molecule. The regulation of DNA supercoiling is essential to DNA replication, when the DNA helix must unwind to permit the proper function of the enzymatic machinery involved in these processes. The unwinding process begins with the “unzipping” or unwinding of the parent DNA molecule by helicases prior to replication. The results generated from such analysis can be tremendously useful, since they save the researcher the inordinate amount of effort involved in going through each of the 66 genes, compiling lists with all MeSH concepts each gene is involved in, and then cross-referencing all those concepts to determine how many genes are in each process. Secondly, by considering the expected numbers of genes from the enriched buckets, it can change radically the interpretation of the data. We may now want to consider the correlation of the p53-repressed gene set with the cellular activity related to cell cycle progression or DNA replication instead of apoptosis. In this step the authors chose to examine genes that were different from control by at least a factor of 2.5 in at least two time points. Clearly, other genes are also p53 responsive, but may not show such a dramatic fold change. Typically, one might be interested to determine the p-values of the observed expression differences and examine different gene sets that satisfy different p-value and fold-change cut-offs. Our methods can be used to examine any gene set. Cut-offs of different stringency can help to identify other potential p53 responsive genes, determine their kinetic behavior, and suggest roles and function. Comparisons between the results with different cut-offs can lend insight into where the signal in less stringent cut-offs becomes overwhelmed by noise. 3.2.2. Diet-Induced Obesity Model
Diet-induced obesity models have been used widely to understand the pharmaceutical mechanism of anti-obesity and Type 2 Diabetes (T2D) medications. In an attempt to better understand the molecular basis of dietary obesity, the differential expression of over 12,500 transcripts in epididymal fat pads in obese versus control rats was determined (27). In the study, rats were made obese with a high-fat diet (65% calories from fat), whereas control rats were fed a standard laboratory diet that contained 6% calories from fat. The rat samples were hybridized on a Rat-GenomeU34A Array (Affymetrix) that contains probes that were derived from more than 4,500 established rat genes and 8,000 EST clusters. More than 800 transcripts were found to be increased or decreased twofold or more in obese rats in response to fatty diets. We examined the expression profiles of the 101 gene subset generated by the authors because of their implication in energy
Omics and Literature Mining
471
Fig. 3. A protein interaction network of enriched genes that are differentially regulated in epididymal fat pads of obese versus control rats. This subset of 101 differentially regulated genes was generated by the authors because of their implication in energy metabolism, cytoskeleton, signal transduction, redox status, and transcriptional regulation. The network edges are based on direct interaction between proteins that were extracted manually from literature. The interaction graph was generated with Scalable Vector Graphics (SVG).
etabolism, cytoskeleton, signal transduction, redox status, and m transcriptional regulation, as well as being differentially regulated. As with the p53-regulatory network, we assessed this 101 gene set by integrating it with protein–protein interaction network (Fig. 3) extracted manually from literature. Of the 101 signature genes, 6 have no HUGO gene symbols, and 16 have no neighbors and therefore do not have any edges. The remainder of the genes from the query set is well connected, thus displaying graphically an underlying biological relationship between these genes. Table 4 shows a list of highly enriched literature-based “buckets” for the gene set using MeSH concepts from the published literature. The most representative enriched buckets are implicated with energy metabolism and lipid metabolism. Specifically, buckets related to “Glycolysis”, “Gluconeogenesis”, “Pentose Phosphate Pathway”, and “Citric Acid Cycle” all have a significant overrepresentation of genes after the diet-induced protocol. The top bucket for the gene set is “Adipose Tissue” with 23 hits, of which 18 genes are upregulated, including Lep, Fabp3, Cebpa, Gpd1, Scd1, Ptgds, Pparg, Igf1, and Ucp3. One of the phenotypes observed with the obese, high-fat diet rats is a statistically significant enlargement of epididymal white adipose pads, which is often followed by an increase in gene expression associated with adipose tissue metabolism. Adipose tissue is an active player in the regulation of
472
Kumar
Table 4 Partial list of statistically relevant MeSH concepts obtained from the analysis of the query gene set of 101 differentially regulated genes from epididymal fat pads of obese versus control rats that are implicated in energy metabolism, cytoskeleton, signal transduction, redox status, and transcriptional regulation MeSH tree structure
P-value
Bucket size
Observed
Expected
Phenomena and process Weight loss Lipid metabolism Insulin resistance Glycolysis Thermogenesis Lipolysis Energy metabolism Gluconeogenesis Oxidative stress Pentose phosphate pathway Sleep deprivation Vascular resistance Lipid peroxidation Citric acid cycle Appetite regulation Biological transport, active Cell hypoxia
2.40E−10 3.10E−10 2.00E−08 1.70E−06 3.30E−06 4.80E−06 1.10E−05 2.30E−05 3.70E−05 5.80E−05 0.00012 0.00029 0.00041 0.0005 0.004 0.005 0.0055
63 229 170 91 34 55 114 47 318 34 21 11 78 13 26 28 312
11 18 14 9 6 7 9 6 14 5 4 3 6 3 3 3 10
0.79 2.9 2.1 1.1 0.43 0.69 1.4 0.59 4 0.43 0.26 0.14 0.98 0.16 0.33 0.35 3.9
Disease Starvation Diabetes mellitus, Type 2 Hypercholesterolemia Hypothyroidism Glucose intolerance Arteriosclerosis Diabetic nephropathies Hyperglycemia Sleep apnea, obstructive Liver failure Cachexia Weight gain
1.90E−09 6.00E−09 4.50E−07 1.60E−06 3.30E−06 4.50E−06 2.10E−05 2.40E−05 2.60E−05 3.00E−05 7.40E−05 8.80E−05
58 155 57 90 34 193 188 96 29 49 83 37
10 14 8 9 6 12 11 8 5 6 7 5
0.73 1.9 0.71 1.1 0.43 2.4 2.4 1.2 0.36 0.61 1 0.46
Chemicals drugs & proteins Adiponectin Lipoprotein lipase Thiazolidinediones Antilipemic agents Hypoglycemic agents Leptin Fatty acid-binding proteins Bezafibrate Oleic acid Receptors, adrenergic, beta-3
2.60E−13 3.60E−13 6.00E−12 7.20E−12 8.60E−12 3.20E−11 4.00E−09 8.40E−08 8.60E−08 2.00E−07
174 177 293 114 160 352 236 108 222 35
19 19 22 15 17 23 17 11 15 7
2.2 2.2 3.7 1.4 2 4.4 3 1.4 2.8 0.44 (continued)
Omics and Literature Mining
473
Table 4 (continued) MeSH tree structure
P-value
Bucket size
Observed
Peroxisome proliferators Thiazoles Gemfibrozil Streptozocin Glutathione transferases Prostaglandin endoperoxidesynthase
6.60E−07 1.30E−06 3.40E−06 3.50E−06 1.90E−06 5.50E−05
192 434 74 156 3 419
13 19 8 11
1.30E−14 8.00E−14 3.90E−13 1.80E−07 2.60E−07 4.50E−07 1.90E−06 7.30E−06 1.30E−05
243 907 285 305 702 57 31 108 179
23 40 23 17 26 8 6 9 11
Anatomy Adipose tissue Liver Adipocytes Myocardium Muscle, skeletal Thyroid gland Subcutaneous tissue Foam cells Caveolae
16
Expected 2.4 5.4 0.93 2 0.038 5.3
3 11 3.6 3.8 8.8 0.71 0.39 1.4 2.2
energy homeostasis, and not surprising, most of these genes are transcription factors or genes involved in energy or lipid metabolism and appear to be upregulated. The expression of several adipose tissue genes related to “Skeletal Muscle”, “Leptin Metabolism” and “Oxidative Stress” and “Leptin” buckets are also significantly enriched. Obesity per se induces systemic oxidative stress and that increased oxidative stress in accumulated fat is, at least in part, the underlying cause of dysregulation of adipocytokines and development of obesity-associated metabolic syndrome (28). One of the enriched buckets down the list is the “Glutathione Transferases” bucket that primarily comprises genes that are significantly downregulated compared to obese animals. These genes are primarily associated with xenobiotic metabolism. High carbohydrate, high fat feeding of obese insulin-resistant mice resulted in approximately two- to threefold increase in total adipose protein carbonylation, but the abundance of glutathione S-transferases was decreased approximately three- to fourfold in adipose tissue of obese mice (29). These results lend support to the hypothesis that obesity is accompanied by an increase in the carbonylation of a number of adipose-regulatory proteins that may serve as a mechanistic link between increased oxidative stress and the development of insulin resistance. Another interesting bucket associated with downregulated genes is the “Prostaglandin-Endoperoxide Synthase” bucket. Yan et al. (30) showed Prostaglandin-Endoperoxide
474
Kumar
Synthase 2 commonly known as Cyclooxygenase-2 (COX-2) to be downregulated during cell differentiation, and the COX pathway is involved in the regulation of adipogenesis. Fain et al. (31) showed COX-2 might be involved in body fat regulation. They observed mice heterozygous for the COX-2 gene showed increased body weight by about 30% with fat pads enlarged two- to threefold compared with those of the wild type. Most of these findings are in agreement with those reported by the authors. For example, we observed similar biological themes as they did, such as glycolysis/glucogenesis, pentose phosphate pathway, citric acid cycle, lipid metabolism, energy metabolism, and glutathione transferases. In addition, we also observe new biological themes such as prostaglandin-endoperoxide synthases and cell hypoxia. Interpretive tools that are currently available for analyzing results from Omics platforms, as exemplarily shown for p53 regulation and a complex disease model as obesity provide only a partial view of the relevant literature, as Masys et al. (16) aptly quoted “gazing through a picket fence” (see Note 5). Our approach provides a statistical procedure to identify biological concepts in the form of Medical Subject Headings (MeSH terms) as extracted from MEDLINE that are significantly overrepresented within the identified gene set relative to those associated with the overall collection of genes on the underlying “Omics” platform. The strengths of using such an approach are designed to help the researcher find relevant biological processes by exploiting the available functional annotation data (see Note 6).
4. Notes 1. The MESH Vocabulary File that is used to validate MeSH indexing terms at data entry/input and NLM’s PubMed are available in XML and ASCII formats, and can be downloaded from http://www.nlm.nih.gov/mesh/filelist.html upon completion of an online memorandum of understanding. 2. The Gene name vocabulary list is built by integrating names and descriptions from multiple fields within EntrezGene, HUGO, and UniProt. The EntrezGene data may be downloaded from ftp://ftp.ncbi.nih.gov/gene/DATA/. All gene names and descriptions are extracted from the gene_info.gz file. HUGO approved gene names and symbols are extracted from the “All Data text” downloaded from http://www.genenames.org/data/gdlw_index.html. All the gene names and descriptions in the fields: GN, ID, and DE are extracted from UniProt that may be downloaded from ftp://ftp.uniprot.org/ pub/databases/uniprot/current_release/knowledgebase/
Omics and Literature Mining
475
complete/uniprot_sprot.dat.gz. The file gene2accession.gz is then used to map UniProt accessions to EntrezGene. 3. Build standardized or controlled vocabularies by compiling a list of commonly occurring terms in the English vocabulary and various gene/disease synonyms that allow for on-the-fly disambiguation in searches. 4. The use of MEDLINE records has the advantage of making an explicit basis for defining co-occurrence. One question with respect to co-occurrences is whether they reflect meaningful relationships between genes. Jennsen et al. (3), and Stapley and Benoit (32) investigated such relationships and found that abstract co-occurrence reflects meaningful biology, and most incorrect pairs were explained by synonym or name confusion. 5. The method has several limitations. First, the use of MEDLINE records restricts the relationships that can be found to those mentioned in titles and abstracts. Second, when searching for the literature relevant to specific entities such as a gene, a protein, or a disease, the effective use of key search terms is critical, but necessarily limiting. Furthermore, since both the English language and the biomedical jargon suffer from several levels of ambiguity, we may miss relevant papers, as well as retrieve irrelevant ones. These issues can be addressed by enforcement of standardized or controlled vocabularies (see Note 3). Other limitations include the inescapable fact that expressed genes without associated publications do not participate in the analysis, and the more subtle bias that well-known, better-characterized genes are overrepresented in the literature relative to newly discovered genes. While such genes currently represent a very small fraction of the genes, this situation should improve over time as additional literature is published that assign functions to them. 6. The usefulness of automated linkages to the literature in assisting in the interpretation of high throughput data from multiple Omics platforms will improve as the literature expands and becomes increasingly available as electronic full text, and as computational tools for processing language become more powerful and robust.
Acknowledgements The author would like to thank Craig Volker, Pankaj Agarwal, Liwen Liu, Tom White, Dilip Rajagopalan, William Reisdorf, Karen Kabnick, and David Searls for their contribution towards the development of this approach.
476
Kumar
References 1. Andrade, M.A., and Valencia, A. (1997) Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. Proc Int Conf Intell Syst Mol Biol 5, 25–32. 2. Hanisch, D., Fluck, J., Mevissen, H.T., and Zimmer, R. (2003) Playing biology’s name game: identifying protein names in scientific text. Pac Symp Biocomput, 403–14. 3. Jenssen, T.K., Laegreid, A., Komorowski, J., and Hovig, E. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 21–8. 4. Shatkay, H., Edwards, S., Wilbur, W.J., and Boguski, M. (2000) Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc Int Conf Intell Syst Mol Biol 8, 317–28. 5. Yandell, M.D., and Majoros, W.H. (2002) Genomics and natural language processing. Nat Rev Genet 3, 601–10. 6. Wheeler, D.L. et al. (2008) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 36, D13–21. 7. Edgar, R., Domrachev, M., and Lash, A.E. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30, 207–10. 8. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., and McKusick, V.A. (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 33, D514–7. 9. Ashburner, M. et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25–9. 10. Zeeberg, B.R. et al. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 4, R28. 11. Cheng, J. et al. (2004) NetAffx Gene Ontology Mining Tool: a visual approach for microarray data analysis. Bioinformatics 20, 1462–3. 12. Srinivasan, P. (2001) MeSHmap: A text mining tool for MEDLINE. Proc AMIA Symp, 642–6. 13. Kanehisa, M., and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28, 27–30. 14. BioCarta, http://www.biocarta.com, 2009. 15. Dahlquist, K.D., Salomonis, N., Vranizan, K., Lawlor, S.C., and Conklin, B.R. (2002) GenMAPP, a new tool for viewing and
16. 17.
18. 19. 20. 21.
22.
23.
24. 25. 26.
27.
28. 29.
a nalyzing microarray data on biological pathways. Nat Genet 31, 19–20. Masys, D.R. et al. (2001) Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 17, 319–26. Srinivasan, P., and Hristovski, D. (2004) Distilling conceptual connections from MeSH co-occurrences. Stud Health Technol Inform 107, 808–12. Agarwal, P., and Searls, D.B. (2008) Literature mining in support of drug discovery. Brief Bioinform 9, 479–92. McCray, A.T. (2003) An upper-level ontology for the biomedical domain. Comp Funct Genomics 4, 80–4. Mitchell, J.A. et al. (2003) Gene indexing: characterization and analysis of NLM’s GeneRIFs. AMIA Annu Symp Proc 460–4. Al-Shahrour, F. et al. (2007) FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res 35, W91–6. Johnson, R.J. et al. (2005) Analysis of gene ontology features in microarray data using the Proteome BioKnowledge Library. In Silico Biol 5, 389–99. Storey, J.D., and Tibshirani, R. (2003) Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods Mol Biol 224, 149–57. Storey, J.D., and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100, 9440–5. Zhao, R. et al. (2000) Analysis of p53- regulated gene expression patterns using oligonucleotide arrays. Genes Dev 14, 981–93. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–8. Lopez, I.P. et al. (2003) DNA microarray analysis of genes differentially expressed in diet-induced (cafeteria) obese rats. Obes Res 11, 188–94. Furukawa, S. et al. (2004) Increased oxidative stress in obesity and its impact on metabolic syndrome. J Clin Invest 114, 1752–61. Grimsrud, P.A., Picklo, M.J., Sr., Griffin, T.J., and Bernlohr, D.A. (2007) Carbonylation of adipose proteins in obesity and insulin resistance: identification of adipocyte fatty acidbinding protein as a cellular target of 4-hydroxynonenal. Mol Cell Proteomics 6, 624–37.
Omics and Literature Mining 30. Yan, H., Kermouni, A., bdel-Hafez, M., and Lau, D.C. (2003) Role of cyclooxygenases COX-1 and COX-2 in modulating adipogenesis in 3T3-L1 cells. J Lipid Res 44, 424–9. 31. Fain, J.N., Ballou, L.R., and Bahouth, S.W. (2001) Obesity is induced in mice heterozygous
477
for cyclooxygenase-2. Prostaglandins Other Lipid Mediat 65, 199–209. 32. Stapley, B.J., and Benoit, G. (2000) Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac Symp Biocomput, 529–40.
Chapter 22 Omics–Bioinformatics in the Context of Clinical Data Gert Mayer, Georg Heinze, Harald Mischak, Merel E. Hellemons, Hiddo J. Lambers Heerspink, Stephan J.L. Bakker, Dick de Zeeuw, Martin Haiduk, Peter Rossing, and Rainer Oberbauer Abstract The Omics revolution has provided the researcher with tools and methodologies for qualitative and quantitative assessment of a wide spectrum of molecular players spanning from the genome to the meta bolome level. As a consequence, explorative analysis (in contrast to purely hypothesis driven research procedures) has become applicable. However, numerous issues have to be considered for deriving mean ingful results from Omics, and bioinformatics has to respect these in data analysis and interpretation. Aspects include sample type and quality, concise definition of the (clinical) question, and selection of samples ideally coming from thoroughly defined sample and data repositories. Omics suffers from a principal shortcoming, namely unbalanced sample-to-feature matrix denoted as “curse of dimensionality”, where a feature refers to a specific gene or protein among the many thousands assayed in parallel in an Omics experiment. This setting makes the identification of relevant features with respect to a phenotype under analysis error prone from a statistical perspective. From this sample size calculation for screening studies and for verification of results from Omics, bioinformatics is essential. Here we present key elements to be considered for embedding Omics bioinformatics in a quality controlled workflow for Omics screening, feature identification, and validation. Relevant items include sample and clinical data management, minimum sample quality requirements, sample size estimates, and statistical procedures for computing the significance of findings from Omics bioinformatics in validation studies. Key words: Clinical study, Database, Data standards, Minimum sample requirements, Clinical statistics, Outcome analysis, Biomarker, Target
1. Introduction The advent of the Omics revolution has forced us to improve our ability to acquire, measure, and handle large data sets. Omic tech nology platforms such as expression arrays and mass spectrometry,
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_22, © Springer Science+Business Media, LLC 2011
479
480
Mayer et al.
with their excellent selectivity, sensitivity, and specificity, are ideal procedures for detection, quantitation, and identification of mRNA, proteins, and metabolites derived from complex body tis sue and fluids (1). In clinical medicine, currently three major approaches, namely genomics, proteomics, and metabolomics, are being applied on a regular basis. Three somewhat interconnected areas can be considered to be a part of genomics: 1. Structural genomics deals with sequencing and mapping of genomes. 2. Functional genomics includes the analysis of mRNA expres sion in development, physiology and disease, as well as the effects of polymorphic variation on gene function. 3. Computational genomics, which develops strategies to ana lyze the vast and puzzling genomic datasets. Real time reverse transcriptase-polymerase chain reaction (RT-PCR), enables the exact quantification of a limited number of mRNAs with extreme sensitivity; DNA microarray systems with the capability to study extended portions of the human tran scriptome have been introduced during the last decade and now are state-of-the-art techniques in functional genomics, in some settings they are already in clinical use (2). Gene expression analysis defines steady-state transcript abun dance. However, this feature represents only one potential aspect of the regulation of the gene product. From the transcriptional regulation of the gene to post-translational modification of the functional protein, multiple levels of regulation add considerable regulatory and functional complexity. Proteomics, the study of global sets of proteins, their expression, function, and structure is currently evolving from classical analytical techniques to complex approaches combing pattern recognition with quantification (3). Metabolomics in contrast focuses on the set of metabolites, where the metabolome is defined as the collection of all exoge nous and endogenous small molecule metabolites present in a living cell or organism. Metabolomics identifies the ongoing bio logical status via quantitative determination of the intermediary metabolites (lipids, amino acids, simple sugars, cofactors etc.) crucial for the phenotype of a biological unit (4). These three techniques, especially when combined in a sys tems biology approach, also referred to as pathway, network, or integrative biology, have fuelled a revolution in hypothesis- generating research that provides a powerful complement to the conventional hypothesis-driven approach. Methods for network analysis and systems biology offer the promise of integrating mul tiple levels of data, connecting molecular pathways to cellular tis sue or organ function, ideally leading to a new understanding of
Omics–Bioinformatics in the Context of Clinical Data
481
integrative physiology, which emphasizes the importance of understanding pathways with overlapping, complementary, or opposing effects and their interactions in the context of intact organisms. Even though initially met with low acceptance, nowa days the concept of integrative physiology is more and more appraised. However, in clinical medicine these techniques are still rarely used, most likely because academic medicine is very attached to a hypothesis driven approach. A randomized controlled trial for example, which currently is supposed to provide the highest level of evidence for hypothesis confirmation is structured in a way such that the number of hypotheses that can be evaluated is limited to a few. Classical statistical approaches have been devel oped to control the probabilities of false positive and false nega tive results even before the trial has started. In the Omics arena the curse of dimensionality is reversed as usually the number of features measured significantly exceeds the number of samples. Therefore, if these techniques are used, study design principles need to be modified. Additionally, in studies using Omics tech niques, special precautions have to be made (e.g., as far as sample collection is concerned) and most probably the number of con founders that need to be taken into account is much larger than in conventional clinical trials making adequate clinical data collection even more important. Finally, however, some general principles still apply no matter which technique may be used. Hard clinical endpoints, e.g., death, are preferable, even though an integrative systems biology approach might better define surrogate or inter mediate endpoints (e.g., disease progression) as these can be bet ter integrated in the pathophysiology of disease processes.
2. Materials 2.1. Clinical Data and Sample Collections
Regarding clinical data, organizations such as the CDISC (Clinical Data Interchange Standards Consortium, http://www.cdisc.org) work on establishing standards for acquisition, exchange, submis sion, and archiving; technical standard XML is deployed for case record forms, namely the Case Report Tabulations Data Definition Specification (CRT.DDS). More information is given at http:// www.cdisc.org/content1057. Omics studies frequently make use of clinical data and sam ples that have been collected in the past, usually with other intentions than performance of Omics studies. If this is the case, it depends on the goal and nature of the intended Omics study what the prerequisites for collected clinical data and sam ples are. If the goal would be to find (a) new biomarker(s) for a disease (process) the new biomarker should either have pre dictive value in addition to already existing biomarkers or have
482
Mayer et al.
the promise that it can be measured much more easily and/or cost effective so that it can replace the existing markers or proce dure (see Note 1). A study determining the additive value of a new biomarker which rests on the results of an Omics screening study will only be of value if the matrix for the existing biomarkers is sampled and stored under optimal conditions. The word “optimal” is very important in this context because the predictive performance of an existing biomarker may seem easy to generate from existing samples, but there are several pitfalls one may encounter. In case of urinary albumin, for example, (a marker for kidney disease), untreated urine samples that have been stored at −20°C for lon ger than 3 months prior to assessment are not suitable for genera tion of adequate data on this variable. Studies have shown that prolonged frozen storage of urine samples at −20°C results in a firm decline in urinary albumin concentration (UAC). In addi tion, the decline also shows large variability between samples (5) making it very unpredictable as to which samples exhibit a false UAC after frozen storage. The large inter-sample variability even tually impairs the performance of albuminuria (stored at −20°C) to predict mortality after prolonged frozen storage of urine (6, 7). Thus, if one wants to compare the additive prognostic perfor mance of a new biomarker, one should be aware that the existing biomarkers are collected, stored, and measured under optimal cir cumstances. When existing biomarkers are not sampled and stored under optimal circumstances, the predictive performance of already existing biomarkers may be underestimated and conse quently the performance of the new biomarker may be overesti mated. We have illustrated this with the case of urinary albumin, but this may be equally true for other small or large proteins, or even with other small molecules. Detailed information on sampling and storage conditions of existing biomarkers is not necessarily needed if the goal would be to gain insight into the pathophysiology of a disease or disease process. An interesting example can again be derived from dia betic nephropathy, where it was found that patients with diabetes and nephropathy excrete less collagen fragments in urine than patients with diabetes without diabetic nephropathy (8). This finding could be consistent with the former patients retaining collagen in their tissues, probably including their kidneys. This is an interesting new – and previously not considered – prospect on the pathophysiology of diabetic nephropathy, which does not per se require information on urinary albumin with the same degree of “optimal collection” as for finding a new biomarker. Bioinformatics and statistics consequently has to keep these issues in mind when designing analysis procedures either tailored at understanding the pathophysiology or aimed at defining biomarker candidates.
Omics–Bioinformatics in the Context of Clinical Data
2.2. Implementing a Data and Sample Repository
483
Frequently, Omics experiments are performed on “ad hoc” collected samples or sample collections derived in a less controlled environment, again to be kept in mind for assessing the evidence level in bioinformatics analyses. Best practice includes, next to SOP controlled specimen and data collection, a detailed study plan including a precise definition of the (clinical) question, pri mary and secondary endpoints, and assessment of confounders, which might be relevant in downstream bioinformatics and statis tics. These boundaries also allow sample size calculation for han dling type I and II errors, and enable the specification of statistical procedures to be applied. In this context, data and sample repositories (or databases) come into play that might be coming along with already given col lections or have to be established for supporting prospective collections. If fortunate, a data and sample repository is already available, and selection of cases and controls to be analyzed in Omics (or being selected for feature validation) can be pursued. If not, a dedicated repository has to be established, which next to issues associated with the hypothesis (event rates, follow up time needed, etc.) also requires thorough design on a technical level. Key issues to be considered include, among others: 1. Single center/multicenter, and consequently the repository shall be established as Web solution or application (having different requirements on the client side). Also, user roles and access policy have to be considered. 2. Data quality and security, involving encryption standards, audit trails, backup and administration, all tightly bound to provisions as set out by the ethics bodies involved in the study centers. 3. Maintenance requirements, including system life time (cen trally defined by follow up times), minimum downtime con siderations, and expected size of the repository (which with modern hardware is usually not an issue). 4. Parameters to be assessed according to the study protocol (carefully avoiding free text, and supporting data quality by constraining parameters by upper/lower limits, etc.) (see Note 2). For usability reasons, most electronic data repositories are nowadays realized as Internet applications. Therefore, the need for a web based solution, e.g., implemented with Java Server Pages (JSP), Java Server Faces (JSF) (http://java.sun.com/ javaee/javaserverfaces), PHP (http://php.net/index.php), or Microsoft .NET (http://www.microsoft.com/net) is given. One major advantage of using web applications is minimal require ments on the client side. These implementations rest on the Hyper Text Transfer Protocol (HTTP), and are per definition
484
Mayer et al.
stateless. If asynchronous communication is needed, e.g., AJAX (http://www.ajax.org) can be used. Client–server applications on the other hand are stateful, allowing communication in both directions (and thereby, e.g., allowing concurrent modification of entries). For realizing a multi-user (across several institutions) data repository, the need to administrate institutions and users is given, demanding the definition of user roles including: (1) a study administrator responsible for administrating all institutions as well as parameter definitions, (2) an institution administrator respon sible for user accounts in local groups, and (3) a standard user for data entry (see Note 3). Next to selecting the software framework in accordance with the study design a detailed specification of the application has to be done (following typical software development standards) involving both, database as well as the server/client architecture. Steps involve definition of user requirements, implementation, and application testing, completed by rollout. Such a design can be followed for a precisely defined study setup involving explicit implementation of a relational database. However, in particular, explorative studies may demand changes in parameters, or even modifications of the underlying business logic. For reflecting these facts on a software development layer, persistence frameworks such as Hibernate (for Java), NHibernate (for .NET) are available (https://www.hibernate.org). These frameworks persist objects with their parameters into a relational database via Object-Relational Mapping. As for most software applications, there are numerous approaches for realizing a repository for clinical data and sample management. Independent of establishing the business logic, freely available software as MySQL for setting up the database, Java for implementing the application, and Linux as server oper ating system are readily available. 2.3. Omics Sample Size Calculation
As specified in the study design and in line with data and sample collection the number of Omics samples to be included in a screening study has to be defined. Sample size computations for a comparison of intensities (concentrations and abundance) between two experimental conditions or between two states (case–control study) are usually based on assumptions of the following quantities: 1. The number of features investigated (NG) 2. The number of features assumed to be differentially expressed (this number is often assumed to be equal to number to be detected by testing) (NDE) 3. The acceptable number of false positives (FP)
Omics–Bioinformatics in the Context of Clinical Data
485
4. The minimally relevant fold change (FC) 5. The per feature variation of intensities within groups [usually expressed as standard deviation (SD) of log2 intensities]. The false discovery rate to be controlled by the study is given by FP/NDE. High variation of intensities and a low fold change (i.e., close to 1) will increase the necessary sample size. Given that the NDE equals the number of features detected by testing, FDR = type II error rate = 1 – statistical power. The per gene type I error is also denoted as false-negative rate and is given by FP/ (NG – NDE), given that all the other numbers remain constant. Sample size assessment for Omics studies can be done using, e.g., the MD Anderson sample size calculator available at http:// bioinformatics.mdanderson.org/MicroarraySampleSize, or by simulation using information from existing pilot data (9). When correlating Omics with clinical data, one will usually not deal with experimental conditions that can be controlled by the investigators, but with patients that may either be cases or controls, defined by their clinical outcome. Cases are patients who experienced the outcome of interest within a fixed follow-up time, e.g., diabetic patients who progressed from normoalbumin uria (no or low level of albumin found in urine) to microalbu minuria (moderately elevated albumin level in urine). Controls are patients with the same baseline constitution, but who did not progress during the same follow-up period. The cause–response relationship between Omics and clinical outcome is such that one assumes that the variation in intensity of some features may influ ence the clinical outcome. This is adequately dealt with by regres sion models which use intensity values as independent variables and clinical outcomes as the dependent variables. However, sam ple size calculation for such multivariable models is difficult to conduct, as assumptions on the correlation of the features jointly influencing clinical outcome are needed but usually cannot be justified. Therefore, for sample size calculations, the two possible states of a patient in a case–control study are treated as the “groups” to be compared as if these states were levels controlled in a randomized experiment. Meaningful values for NG, NDE, FC, and SD can often be found from pilot experiments. Due to limitations of signal detec tion, some types of Omics technologies, e.g., proteomics or metabolomics, lead to intensity distributions with a clump of zeros. To obtain an SD of log intensities from pilot data, one has to replace intensity values of zero by a nonzero value, e.g., 1/2 of the lowest nonzero intensity of a specific feature. The total num ber of features, NG, can already take into account the reduction of features because of unspecific prefiltering. For example, fea tures with few nonzero intensity values or with low variation will be a priori excluded from analysis.
486
Mayer et al.
We exemplify these considerations for a metabolomics study evaluating 486 metabolites in blood samples of diabetes mellitus type 2 patients with normoalbuminuria, who either progressed or did not progress to microalbuminuria within 2 years of followup. From pilot data on 76 blood samples, it is known that 197 metabolites may have to be excluded from analysis because of a relative frequency of zero intensities exceeding 50%, such that NG = 486 − 197 = 289. Replacing each zero intensity value by one half of the lowest nonzero intensity, SD is estimated to be 0.704. Assuming 14 metabolites with a fold change greater than 1.5, and accepting one false positive out of 14 metabolites declared dif ferentially expressed, results in a power of 0.93 or an FDR of 0.07 (1 over 14). Inserting these values into the MD Anderson Sample Size Calculator yields a sample size of 56 per group. 2.4. Technical Specifications of Omics Samples 2.4.1. General Consideration
2.4.2. Transcriptomics
A prerequisite for a successful Omics experiment is knowledge of the platform’s performance, and SOPs and a QC system must be in place (see Note 4). In addition, all relevant clinical and demo graphical data must be present, and informed consent must be given. The golden rule is: the more information, the better. Certain demographical information is mandatory (like age, gen der and BMI). The mandatory clinical information depends on the study, but e.g., considering a study on diabetes, certain param eters like blood pressure, serum creatinine, cholesterol, metabolic status are frequently of relevance and should be collected (see Note 5). Assessment of the biological, pre-analytical (due to sam ple processing), and analytical variability, and assessing the num ber of experiments required to detect an x-fold change with confidence prior to doing the Omics screening as such is highly beneficial. Pre-analytical variability can be addressed by collecting material from the same individual, and preparing and analyzing identical, as well as different samples independently, as also out lined by the FDA (http://www.fda.gov/downloads/Drugs/ GuidanceComplianceRegulator yInformation/Guidances/ UCM070107.pdf). Analytical variability can, for example, be assessed by spiking an identical sample (can be a pool of samples) with different amounts (e.g., 1, 3, and 10 units) of a known refer ence standard upon collection, and assessing how many samples have to be analyzed to detect the difference as significant. Stability of the information to be assessed from a particular sample is essential. Ideally, only analytes that are stable through out collection and preparation, until the actual analytical step, are assessed. However, if this cannot be fully achieved, compromises have to be made, and these depend to a large degree on the type of analysis to be performed. In general, the transcriptome, being substantially regulated by the environment and not at steady state, must be “preserved”, preferably at the minute of harvest. Due to the different tissues
Omics–Bioinformatics in the Context of Clinical Data
487
analyzed, an array of protocols exists. A common feature of these is that the sample must be processed as fast as possible, cells at low temperature (4°C), and stabilizers like RNAlater should be added as soon as possible. To ensure comparability between samples, SOPs for collection and immediate processing must be available and strictly followed. While the high susceptibility of the transcriptome towards environmental factors and conditions during collection until preservation represents a substantial obstacle, the analytical tech nology for transcriptome analysis is well developed, and associ ated with relatively minor challenges in comparison to proteomics and metabolomics. RNA can be efficiently amplified with routine protocols and kits, and it can be detected using specific probes, substantially easing the assessment of the transcriptome. Based on the analytical steps alone, transcriptomics would result in highly accurate information of high density. Pre-analytical issues offset this benefit, and sometimes render extraction of relevant information from such experiments impossible. Another consid eration that must be taken into account is the “curse of dimen sions”: Identification of transcriptional regulation significantly associated with a certain (patho)physiological state mandates adjustment for multiple testing. Consequently statistical power is dependent on the number of features (transcripts) assessed, and decreases with increasing number of features. It may be advanta geous analyzing only a subset of the transcriptome, thereby increasing statistical power and enabling identification of signifi cant changes based on a lower number of samples. This can be accomplished by either using a custom chip with a reduced num ber of features represented (which, however, demands already a hypothesis for selecting this feature subset), or by forwarding a reduced number of features to statistical analysis by applying fil ter mechanisms (e.g., based on minimum standard deviation of expression values). 2.4.3. Proteomics
As for any other Omics experiment, consistency is of outmost importance. Relevant variables are temperature and time until stabilizing, as proteolysis cannot be completely inhibited; hence the proteome observed will be substantially influenced by the proteolysis. An array of protocols for the collection of tissue samples for proteomics exists. In general, performing proteome analysis from tissue remains a substantial challenge, due to sample instability and inhomogenity. To cope with the latter, microdissection, or laser capture microdissection can be per formed, and first results from such experiments have been pub lished (10). Blood (serum and plasma), while appearing as specimen of choice on first sight, unfortunately has several severe problems, especially the high intrinsic protease activity and the presence of a few, highly abundant proteins that obscure lower abundant biomarkers (11).
488
Mayer et al.
In general, plasma is preferred over serum, and no conclusive statement on the ideal stabilizer is available as yet (12). While several products exist to deplete abundant proteins (depletion columns for up to ten abundant plasma proteins are available from several suppliers), theoretically rendering the less abundant proteins accessible by conventional proteomics, experimental data clearly indicated partial und unreproducible co-depletion of >1,000 proteins, increasing variability beyond what could be practically handled (13). The alternative, “equalizing” proteins based on their interaction with a large array of peptide ligands has been proposed (14), but not yet successfully used to identify any valid biomarker. Urine appears to be best suited for proteome analysis, mainly due to its high stability. In addition, a standard reference urine sample has recently been established, likely easing comparative assessment of platform performance and comparison of datasets obtained on different platforms (15). Further standard protocols for urine collection have been established and internationally agreed upon (http://www.eurokup.org/node/137). 2.4.4. Metabolomics
Metabolites by definition are not end products, are transient, and therefore preservation of these is mandatory (16, 17). Common sources for metabolome analysis are blood (plasma and serum) or urine. Here, EDTA plasma (>Heparin plasma > serum, NO citrate plasma) is the preferred specimen. The generally inferior results obtained with urine may be a consequence of the fact that urine is “stored” in the bladder for different amounts of time, resulting in substantial changes in the metabolome. Ideally, postprandial samples are employed in metabolome analysis, collected and cen trifuged under standard conditions. Samples can be stored at −70°C; strictly omitting freeze thaw cycles. The time from collec tion until freezing should be no longer than 1 h, and, as above, it is imperative that the same procedure be applied for all samples. In summary, at the beginning of an Omics experiment, and before even analyzing the first sample for a specific purpose, a platform of known performance characteristics with respect to the specimens used should be available, power calculations shall provide an estimated required number of cases and controls, and at least this required number of specimens, together with all clini cal data necessary, have to be present. In particular, for Omics data, preprocessing knowledge on platform performance and sample handling is pivotal for quality assessment of profiles.
2.5. Outcome Analysis
Having a well designed data repository in hand allows validation of results obtained from the different Omics tracks. Features identified in Omics are always linked to a phenotype (e.g., clinical representation) and are in this context defined as relevant. One typical notion of relevance is the assumption that the feature
Omics–Bioinformatics in the Context of Clinical Data
489
(biomarker) provides a prognostic assessment. A novel biomarker is required to improve prediction of such clinical outcome based on known clinical or genetic predictors. Various concepts exist to measure predictive accuracy of a set of predictors, depending on the type of clinical outcome. For continuous outcomes the classi cal adjusted R-square of linear regression supplies an estimate of the proportion of variance of the outcome variable explained by the predictors. For binary and time-to-event outcomes, alternative pseudo-R-square measures have been proposed (18, 19). The additional value of genetic markers can be expressed as the increase in (pseudo-)R-square due to addition of that marker to a set of known predictors (20, 21). Depending on whether one or a few candidate biomarkers should be evaluated for additional predictive value or a huge number of biomarkers should be screened for further evaluation, different statistical approaches have to be considered. In the for mer case, classical regression models can be applied, as the num ber of independent observations n will clearly exceed the number of independent variables p by at least tenfold. For continuous, binary, and time-to-event clinical outcomes, linear, logistic, or Cox regression will be considered, respectively. In the latter case, one generally assumes that not only one single biomarker is associated with clinical outcome, but several simultaneously. Therefore, multivariable regression models will be considered, which are able to assess the simultaneous effect of sev eral variables on the outcome. In classical statistical methods such as linear, logistic or Cox regression, estimates of the model param eters are found by maximizing the likelihood, i.e., the probability of the observed data given the model parameters. When evaluat ing all available biomarkers at the same time, the resulting p >> n situation causes a breakdown of these methods, because the esti mation problem is over-parameterized, and no unique solution for parameter estimates can be found. Penalized likelihood methods have been proposed in order to deal with that situation. These methods impose restrictions on the norm of the parameter vector, either on the sum of absolute parameter values (L1 restriction, known as the lasso) (22, 23), or on the sum of their squared values (L2 restriction, also known as ridge regression) (24, 25), or on both (this combined method is known as the “elastic net”) (26). Implementations of these methods in the statistical software R exist for all three types of models mentioned above. In general, the penalized likelihood to be maximized assumes the structure
(
L b1 , …, b p
p
) = L (b ,…, b )− l ∑ b *
1
p
1
j =1
p
j
− l2 ∑ b j2 j =1
where for the lasso, l2 is set to zero, and for ridge regression l1 = 0 . The norm restriction methods need to optimize the
490
Mayer et al.
a dditional tuning parameters l1, l2 or both, by maximizing crossvalidated predictive accuracy. The lasso has the general advantage that it forces most parameter estimates to zero, and it tends to assign nonzero parameter estimates to those features that are truly related to clinical outcome (the so-called oracle property). However, it results in less accurate predictions than obtained by ridge regression. Therefore, the elastic net has been proposed which combines both approaches at the cost of optimizing two tuning parameters. 2.6. Validation Sample Size Calculation
Equally important to estimating the sample size for performing an Omics experiment is the sample size to be considered in validating Omics findings. The imbalance between the several thousands of parameters obtained in one run for each subject in the explorative stage and the relative paucity of hard outcomes makes a selection of few, i.e., the most promising candidate parameters (markers) necessary (27). Those markers showing the strongest associations with the outcome are usually considered as being most promising for further investigation. This approach however leaves out the fact of posttranscriptional modification and interaction/regulation of potent effectors by less expressed parameters. The desired approach is the validation of a set of few markers in an independent set of samples. If this is not available or too cost intense then validation must be done internally on the same data set by applying resampling strategies (28, 29). However, the gold standard is still validation in an indepen dent data set. Sample size calculation for such a validation study may be based on the added value, expressed as additional (pseudo-) R-square, of a biomarker in addition to known (e.g., clinical) pre dictors. We exemplify this by assuming an arbitrary continuous clinical outcome parameter, e.g., urinary protein excretion. Assuming that five known predictors may account for 25% of the variance of the outcome parameter, a sample size of 112 will have 80% power to detect at a significance level of 5% an increase in R-square of 5 percentage points (increase to 30%) due to includ ing the biomarker. This number increases to 230 if the increase in R-square deemed to be relevant is changed to 2.5 percentage points. However, the sample size is independent on the number of known predictors, and only to a minor degree dependent on the assumed predictive accuracy of the known predictors. It will decrease to 97 if it is assumed that the known predictors already account for 35% of the outcome parameter. These calculations are based on Gatsonis et al. (30) and have been performed by the commercial software nQuery Advisor 6.0 (Statistical Solutions Ltd, Cork, Ireland). Similar calculations could be done for a binary or a time-toevent outcome, but are not implemented in standard sample size
Omics–Bioinformatics in the Context of Clinical Data
491
calculation programs as no algebraic formulas exist for sample size or power calculations. In these cases, sample sizes have to be estimated by simulation of studies. In transcriptomics experiments, the tags of interest are usu ally analyzed individually by quantitative PCR, in proteomics and metabolomics candidates are validated by antibody-based tests such as Luminex technology or ELISA or western blotting. These technologies finally supply continuous intensity values. The situ ation is somehow different in genomics experiments, were the minor allele frequency (MAF) needs to be considered. If the MAF of a polymorphism is low, even assuming a high effect size will not lead to a substantial increase in predictive accuracy and several thousands of subjects may be needed to validate its effect on a clinical outcome.
3. Methods 3.1. Design of Biomarker Studies
Since ancient time physicians used urine to diagnose patients with certain diseases and to study disease processes. A nice example comes from the field of diabetes where the sweetness of the urine indicated whether a patient had diabetes. Nowadays, many mea surements are being done in urine, but also serum, plasma, saliva, feces, and other matrices are used to measure and generate bio markers for diagnostic and prognostic purposes. However, before new biomarkers can be implemented in daily practice, the validity of the biomarker needs to be tested in carefully designed studies. A potential strategy to discover and validate novel biomarkers consists of three phases following the general principles for bio marker discovery and validation described by the National Heart, Lung, and Blood Institute (31).
3.2. Discovery Study
The first step of this strategy comprises a discovery study to screen for potentially useful biomarkers using Omics techniques. This screening should optimally occur in high quality samples derived from subjects with a very well established phenotype (for crosssectional studies) or a precisely defined outcome (for longitudinal studies) versus very well matched controls with as much as pos sible similar characteristics except for the phenotype or outcome of interest. The samples for the discovery phase need to be col lected and stored under similar and optimal circumstances. This will ensure consistency and guarantees that the potential useful ness of novel biomarkers are not obscured by degradation or other variables related to nonoptimal sampling or storage condi tions. The case–control design is an efficient design for the dis covery phase (resulting in straight forward group comparisons for subsequent bioinformatics). The advantage is that this setting
492
Mayer et al.
requires a relative small sample size, is relatively fast, and less expensive, and despite the limited sample size a wide range of biomarkers can be investigated. One has to realize, however, that the sample size needs to be large enough to identify several poten tial biomarkers with sufficient reliability. In addition, this type of study design is prone to selection bias and the predictive perfor mance of candidate biomarkers is poorly displayed in a case– control design. Finally, these candidates need not necessarily be (independently) prognostic, but could also play a role in etiology. Therefore the discovery phase needs to be followed by a valida tion phase to ascertain the predictive capacity of the candidate biomarkers. 3.3. V alidation Study
The validation phase is a clinical study which needs to be carefully designed including a proper sample size calculation usually involv ing many more subjects than the discovery study (depending on the nature of the outcome variable and its distribution). Subjects participating in the validation study have to be reasonably well phenotyped while sampling of urine, blood, or other matrices need to be well-defined and occur under optimal conditions. The characteristics or phenotype for the outcome will be at the less extreme ends than in the discovery study and many subjects will be “intermediate”. Generally, a validation study will link variation in (the) putative biomarker(s) to variation in the phenotype, both more on a continuous scale than in the discovery phase. The same Omics techniques may be applied as in the discovery phase, but typically validation of selected biomarkers is done with dedicated assays. The validation study can be designed as a cohort study or as a case–control study. In a cohort study, a study population (cohort) is defined consisting of subjects who are free of the end point of interest at baseline and followed over time. At the end of the study, a proportion of subjects will have developed the end point of interest. This design facilitates the determination of (baseline) factors that are associated with the development of the outcome. Using this type of design, the predictive capacity of multiple candidate biomarkers can be assessed simultaneously. The cohort for validation can either be an already existing cohort (retrospective validation) or a newly defined cohort (prospective validation). Retrospective validation is more efficient and costeffective than conducting a prospective cohort study as the patients have already been followed for a period of time. The pre dictive power of candidate biomarkers can then be assessed using already collected baseline samples of all patients. The disadvan tage of the retrospective validation is that it may be subject to selection bias and that samples for Omics studies may not have been collected under optimal conditions. The prospective valida tion has the advantage that the population of interest can be
Omics–Bioinformatics in the Context of Clinical Data
493
defined upfront which prevents selection bias. Furthermore, follow-up visits can be conducted at the appropriate time inter vals to optimally assess the natural course of the biomarker over time. Like all study designs, prospective cohort studies also have certain weaknesses. They are much more expensive than retro spective cohort studies. In addition, prospective designs are not feasible when the endpoint under investigation is rare and/or takes years to develop. Cohort studies are preferred over case– control studies as confounding and bias can be prevented more adequately. Cohort studies however, are relatively inefficient because it may take a long time before an event occurs and a rela tively large population is needed when studying rare events. If this design is not possible (no suitable cohort, financial limited resources), it is also possible to validate candidate biomarkers with a case–control study design. This type of design again facilitates smaller sample size and lower costs, but it is less preferred because it will have the same pitfalls as the discovery case–control study. 3.4. Real Life Simulation
Applying a biomarker in daily practice can be very different from its “behavior” in validation studies. If a biomarker intends to be applied in an emergency setting, it may easily be outperformed by biomarkers with less predictive or diagnostic properties if result of assays of these “poorer” biomarkers are more quickly and easily available. A very promising biomarker may never reach the status of general acceptance if such a requirement is not already fulfilled at its introduction. If a biomarker intends to be applied in the cardiovascular “arena”, it will also be an enormous advantage if the result can already be available at the desk of the physician dur ing the visit of the patient. This will facilitate the physician to make a risk assessment during consultation that can help in the decision whether or not to start or modify drug prescription. If the results of the biomarker only come in at the next visit or even visits later, it is unlikely that such a biomarker will become very “popular”. Good examples are cholesterol and blood pressure: the patient goes to the laboratory prior to the clinical visit and because of the fastness of the assay the results can be available at the desk of the physician during the patient’s visit some time later, at the same day. Blood pressure is measured in the office of the physician and the result is available at the same moment. For new biomarkers it is wise to perform a “real life simula tion” prior to their introduction to identify potential gaps between results of validation studies and practical implementation.
3.5. Practical Example
In this paragraph we illustrate with a practical example how to conduct a series of studies to identify novel biomarkers to detect progression of early renal disease. The design of this set of studies may obviously differ for each disease and for each purpose the biomarker is intended to be used for.
494
Mayer et al.
Elevated urinary albumin excretion and serum creatinine are the classical biomarkers of diabetic (and nondiabetic) renal com plications. As kidney disease in patients with type 2 diabetes pro gresses at a slow but relentless rate, clinical studies following patients from the first signs of kidney damage to end-stage renal disease require very long follow-up. Albuminuria levels can be categorized into normoalbuminu ria, microalbuminuria, and macroalbuminuria. Changes between these states of nephropathy are considered a hallmark of progression – and regression – of disease. Studies in nephrology (either cohort study or randomized controlled trials) therefore determine the transition between certain states of nephropathy as an intermediate endpoint to circumvent the otherwise inappro priately long (and expensive) duration of follow-up. To identify novel biomarkers for progression of early kidney disease we conduct a discovery study. This study will be a case– control study including subjects who progressed either from nor moalbuminuria to microalbuminuria or from microalbuminuria to macroalbuminuria with matched controls who remain nor moalbuminuric or microalbuminuric respectively. The aim of the discovery is to identify a broad panel of potential novel biomark ers using different Omics techniques. The most promising bio markers from the case–control study will subsequently be validated during the validation phase. The validation study will be a retro spective cohort study. The patients selected for the retrospective cohort study will have either normo- or microalbuminuria and have been followed for at least some years to ensure that sufficient number of patients have progressed either from normoalbuminu ria to microalbuminuria or from microalbuminuria to macroalbu minuria. Finally, in the third step, we will conduct a prospective validation study. Optimal sampling conditions and careful design of the population for the prospective cohort study will guarantee that the predictive capacity of each biomarker is validated as good as possible.
4. Notes 1. For example, in the prediction of diabetic nephropathy and progression to end-stage renal disease in patients with diabe tes it is unlikely that urinary albumin will be outperformed with respect to ease with urine as sample matrix and price. So, in the case of diabetic nephropathy, a new biomarker should have substantial additive predictive or diagnostic value. 2. In practice, scientific studies show inherent explorative char acteristics, and as a consequence adding of further parameters to be recorded frequently happens while the study is already
Omics–Bioinformatics in the Context of Clinical Data
495
ongoing. If the need for modifications is anticipated, the technical framework has to be chosen adequately. 3. A multicenter solution ideally holds a user administration module which can be handled by the institution administra tor. In particular, for studies with long follow-up, change of standard users is frequent, and handling this by the central study administrator remotely becomes cumbersome. 4. These specifications are dependent on the specimen to be examined, and hence the quality of the specimen analyzed is another major consideration in any Omics analysis (32). An experiment in the absence of these requirements (well-defined and strictly controlled platform, and high quality samples) will almost certainly fail. Immediate consequences of non compliance with these requirements are, due to the high dimensionality of the datasets, detection of artifactual differ ences between two datasets that would, in the absence of cor rective adjustments, erroneously be attributed to disease. In addition, power calculations cannot be performed, resulting in complete uncertainty about the required number of sam ples that should be analyzed to reach a defined goal. 5. Data from Omics analyses can frequently be used in multiple experiments, provided that analytical parameters are identical (SOPs did not change), informed consent was given, and the relevant clinical data are available. As these data can generally not be recovered easily (if at all) at a later stage, it is reason able to assume that only the data that are delivered together with the sample are available.
Acknowledgements This work was supported by the European Union FP7 project “SysKid”, project number 241544. References 1. Tanaka, H. (2010) Omics-based medicine and systems pathology. A new perspective for personalized and predictive medicine. Methods Inf Med 16, 173–85. 2. Buyse, M., Loi, S., van’t Veer, L., et al. (2006) Validation and clinical utility of a 70-gene prognostic signature for women with nodenegative breast cancer. J Natl Cancer Inst 98, 1183–92. 3. Zürbig, P., Schiffer, E., and Mischak, H. (2009) Capillary electrophoresis coupled to mass spectrometry for proteomic profiling of
human urine and biomarker discovery. Methods Mol Biol 564,105–21. 4. Illig, T., Gieger, Ch., Zhai, G., Römisch-Margl, W., Wang-Sattler, R., Prehn, C., Altmaier, E., Kastenmüller, G., Kato, B.S., Mewes, H.W., Meitinger, T., Hrabé de Angelis, M., Kronenberg, F., Soranzo, N., Wichmann, H.E., Spector, T.D., Adamski, J., and Suhre, K. (2010) A genome-wide perspective of genetic variation in human metabolism. Nat Genet 42, 137–41. 5. Brinkman, J.W., de Zeeuw, D., Duker, J.J., Gansevoort, R.T., Kema, I.P., Hillege, H.L.,
496
6.
7.
8.
9. 10.
11.
12.
13.
14.
15.
Mayer et al. et al. (2005) Falsely low urinary albumin con centrations after prolonged frozen storage of urine samples. Clin Chem 51, 2181–83. Brinkman, J.W., de Zeeuw, D., Gansevoort, R.T., Duker, J.J., Kema, I.P., de Jong, P.E., et al. (2007) Prolonged frozen storage of urine reduces the value of albuminuria for mortality prediction. Clin Chem 53, 153–4. Lambers Heerspink, H.J., Nauta, F.L., van der Zee, C.P., Brinkman, J.W., Gansevoort, R.T., de Zeeuw, D., et al. (2009) Alkalinization of urine samples preserves albumin concentra tions during prolonged frozen storage in patients with diabetes mellitus. Diabet Med 26, 556–9. Rossing, K., Mischak, H., Dakna, M., Zürbig, P., Novak, J., Julian, B.A., Good, D.M., Coon, J.J., Tarnow, L., and Rossing, P. (2008) Urinary proteomics in diabetes and CKD. J Am Soc Nephrol 19, 1283–90. Jung, S.-H., Bang, H., and Young, S. (2005) Sample size calculation for multiple testing in microarray analysis. Biostatistics 6, 157–69. Sitek, B., Potthoff, S., Schulenborg, T., Stegbauer, J., Vinke, T., Rump, L.C., Meyer, H.E., Vonend, O., and Stuhler, K. (2006) Novel approaches to analyse glomerular pro teins from smallest scale murine and human samples using DIGE saturation labelling. Proteomics 6, 4337–45. Mischak, H., Coon, J.J., Novak, J., Weissinger, E.M., Schanstra, J.P., and Dominiczak, A. (2009) Capillary electrophoresis-mass spec trometry as a powerful tool in biomarker dis covery and clinical diagnosis: An update of recent developments. Mass Spectrom Rev 28, 703–24. Rai, A.J., Gelfand, C.A., Haywood, B.C., Warunek, D.J., Yi, J., Schuchard, M.D., Mehigh, R.J., Cockrill, S.L., Scott, G.B., Tammen, H., Schulz-Knappe, P., Speicher, D.W., Vitzthum, F., Haab, B.B., Siest, G., and Chan, D.W. (2005) HUPO Plasma Proteome Project specimen collection and handling: Towards the standardization of parameters for plasma proteome samples. Proteomics 5, 3262–77. Shen, Y., Kim, J., Strittmatter, E.F., Jacobs, J.M., Camp, D.G., Fang, R., Tolie, N., Moore, R.J., and Smith, R.D. (2005) Characterization of the human blood plasma proteome. Proteomics 5, 4034–45. Righetti, P.G., and Boschetti, E. (2008) The ProteoMiner and the FortyNiners: Searching for gold nuggets in the proteomic arena. Mass Spectrom Rev 27, 596–608. Mischak, H., Kolch, W., Aivalotis, M., Bouyssie, D., Court, M., Dihazi, H., Dihazi, G.H., Franke,
16. 17. 18. 19. 20.
21.
22. 23. 24. 25. 26. 27.
28.
J., Garin, J., Gonzales de Peredo, A., Iphöfer, A., Jansch, L., Lacroix, C., Makridakis, M., Masselon, C., Metzger, J., Monsarrat, B., Mrug, M., Norling, M., Novak, J., Pich, A., Pitt, A., Bongcam-Rudloff, E., Siwy, J., Suzuki, H., Thongboonkerd, V., Wang, L., Zoidakis, J., Zurbig, P., Schanstra, J., and Vlahou, A. (2010) Comprehensive human urine standards for comparability and standardization in clini cal proteome analysis. Proteomics Clin Appl 4, 464–78. Dettmer, K., Aronov, P.A., and Hammock, B.D. (2007) Mass spectrometry-based metab olomics. Mass Spectrom Rev 26, 51–78. Ramautar, R., Somsen, G.W., and de Jong, G.J. (2009) CE-MS in metabolomics. Electrophoresis 30, 276–91. Mittlböck, M., and Schemper, M. (1996) Explained variation for logistic regression. Stat Med 15, 1987–97. Schemper, M., and Henderon, R. (2000) Predictive accuracy and explained variation in Cox regression. Biometrics 56, 249–55. Heinze, G., and Schemper, M. (2003) Comparing the importance of prognostic fac tors in Cox and logistic regression using SAS. Comput Methods Programs Biomed 71, 1455–63. Dunkler, D., Michiels, S., and Schemper, M. (2007) Gene expression profiling: Does it add predictive accuracy to clinical characteristics in cancer prognosis? Eur J Cancer 43, 745–51. Tibshirani, R. (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc B 58, 267–88. Tibshirani, R. (1997) The lasso method for variable selection in the Cox model. Stat Med 16, 385–95. le Cessie, S., and van Houwelingen, H.C. (1992) Ridge estimators in logistic regression. Appl Stat 41, 191–201. Verweij, P.J.M., and van Houwelingen, H.C. (1994) Penalized likelihood in Cox regres sion. Stat Med 13, 2427–36. Zou, H., and Hastie, T. (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc B 67, 301–20. Berrar, D., Bradbury, I., and Dubitzky, W. (2006) Avoiding model selection bias in small-sample genomic datasets. Bioinformatics 15, 1245–50. Lusa, L., McShane, L.M., Radmacher, M.D., Shih, J.H., Wright, G.W., and Simon, R. (2007) Appropriateness of some resamplingbased inference procedures for assessing per formance of prognostic classifiers derived from microarray data. Stat Med 28, 1102–13.
Omics–Bioinformatics in the Context of Clinical Data 29. Jiang, W., Varma, S., and Simon, R. (2008) Calculating confidence intervals for predic tion error in microarray classification using resampling. Stat Appl Genet Mol Biol 7, 8. 30. Gatsonis, C., and Sampson, A.R. (1989) Multiple correlation: Exact power and sample size calculations. Psychol Bull 106, 516–24. 31. Granger, C.B., Van Eyk, J.E., Mockrin, S.C., and Anderson, N.L. (2004) National Heart, Lung, and Blood Institute Clinical Proteomics Working Group report. Circulation 109, 1697–703.
497
32. Mischak, H., Apweiler, R., Banks, R.E., Conaway, M., Coon, J.J., Dominizak, A., Ehrich, J.H., Fliser, D., Girolami, M., Hermjakob, H., Hochstrasser, D.F., Jankowski, V., Julian, B.A., Kolch, W., Massy, Z., Neususs, C., Novak, J., Peter, K., Rossing, K., Schanstra, J.P., Semmes, O.J., Theodorescu, D., Thongboonkerd, V., Weissinger, E.M., Van Eyk, J.E., and Yamamoto, T. (2007) Clinical proteomics: A need to define the field and to begin to set adequate standards. Proteomics Clin Appl 1, 148–56.
Chapter 23 Omics-Based Identification of Pathophysiological Processes Hiroshi Tanaka and Soichi Ogishima Abstract Owing to the growing knowledge about the cellular molecular network and its alterations in diseases, most of the diseases become considered as “systems distortion of the cellular molecular network”. This view of diseases, which we call “systems pathology”, has brought about a new usage of the disease Omics, that is, to identify the altered molecular network underlying the disease. In this chapter, we discuss the technologies and clinical applications for Omics-based identification of pathophysiological process. In doing so, we classify the methods into two classes: one is a “data-inductive approach” which infers gene regulatory and transcriptional networks by gene expression data from DNA microarrays, and the other is a “knowledge-referenced approach” which combines the differentially expressed genes from gene expression profiles with existing protein interaction networks or literature-curated pathways. Several typical methods such as ARACNe and eQTL are described with their recent clinical applications. Key words: DNA microarray, Reverse engineering, Disease pathway, Systems pathology, Disease Omics
1. Introduction On account of the recent progresses in high-throughput technologies, such as various types of DNA microarrays (1–3) or mass spectrometers (4, 5), post-genomic Omics information, or simply “Omics” data have become available in the clinical context. A lot of studies have revealed that Omics information observed in diseased state (“disease Omics”) provides comprehensive and substantial information as to the ongoing process of diseases, so that it would contribute to actual clinical medicine, bringing about more exact diagnosis or prognosis of diseases. In use of disease Omics for clinical medicine, Omics data have been first utilized by directly associating them with clinical Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_23, © Springer Science+Business Media, LLC 2011
499
500
Tanaka and Ogishima
henotypes (clinical outcomes or diagnostic classifications). Taking p a typical example, efficient sets of genes called “signature” have been determined from the gene expression profiles of diseased cells in order to make a prediction of disease prognosis such as the recurrence of cancers within several years after the surgery (6, 7). Owing to the rapidly growing knowledge about the cellular molecular network and its alternations in diseases, most of the diseases, except for rare monogenic diseases, become considered to be caused by “systems distortion of the cellular molecular network”, due to the interrelated malfunction of genes and proteins. This view of diseases, which we call system-level understanding of diseases, or “systems pathology” (8), has brought about a new usage of the disease Omics, that is, to identify the altered molecular network underlying the disease. However, on account of the high interdependency of the disease Omics data, identification of a disease associated pathway solely on the basis of Omics data is frequently difficult. Hence, in recent years, various attempts have been proposed to jointly use the existing knowledge about the molecular networks (signaling, transcriptional, or protein–protein interaction networks) in identification of the disease associated pathways from the observed Omics data. In this chapter we will survey the major methods in Omicsbased identification of pathophysiological process. This methodology is considered to be of central importance in systems pathology, or more widely, Omics-based systems medicine which is expected to bring about more personalized, predictive, and preventive medical care (8).
2. Materials In this section, we introduce the various materials and technologies employed for the Omics-based identification of pathophysiological process. These concepts mostly aim to reconstruct gene regulatory and transcriptional networks from the gene expression profile. There are mainly two classes of approaches: one is a completely empirical approach using solely observed Omics data, which we call “data-inductive approach” and the other is based on using existing knowledge about molecular pathways in conjunction with Omics data, which we call “knowledge-referenced approach”. In the former approach, we first assume the mathematical formalism (differential equations, Bayesian network, etc.) and then infer gene regulatory transcriptional networks in this formalism solely from gene expression data as derived from microarrays or mRNA-sequencing enabled by the new generation sequencers. In the latter approach, we combine the differentially expressed genes (DEG) obtained from the expression profile
Omics-Based Identification of Pathophysiological Processes
501
analysis with the existing protein interaction networks or literature-curated pathways to determine the most influential networks or pathways for the determined gene expression. 2.1. Data-Inductive Identification of Gene Regulatory Networks
Data-inductive reconstruction of gene regulatory network has been well studied since immediately after the invention of the DNA microarray technologies in the mid 1990s (1, 2). The procedure is called “reverse engineering” of a gene network from gene expression data. Simplest model of gene regulatory networks is a Boolean network where the network connections are described by Boolean functions, and REVEAL was first developed as inference algorithm of Boolean networks (9). To date, various inference methods have been proposed, using simultaneously ordinary differential equations (ODEs), mutual information, and Bayesian networks (10).
2.1.1. ODE Based Reconstruction
A gene regulatory network is generically described by simultaneous ODEs as follows:
dxi (t ) = ∑ wij x j + bi − λi xi (t ), dt j
(1)
where xi is the gene expression value of gene i, wij is a weight matrix determining the effect of gene i to gene j, bi is a basal gene expression of gene i, and li is a diffusion constant for gene i. Inference of a gene regulatory network is equivalent to that of a weight matrix wij. However, there are difficulties in inference of a weight matrix using time-series mRNA expression data. Algorithms have to estimate the rates of change of the transcripts (dx/dt), but this is very difficult because calculating the derivative can amplify the measurement errors contained in the data. Gardner et al. propose an algorithm named NIR (network identification by a multiple regression) (see Note 1) for inference of gene regulatory network based on first-order model of ODEs as defined below (11):
dxi (t ) N = ∑ wij x j + ui , dt j
(2)
where ui is a vector representing an external perturbation to the rate of accumulation of xi. To eliminate error amplification, the authors proposed the steady-state assumption (dx/dt = 0), and reduced Eq. 2 to
N
∑w
ij
x j = −ui .
(3)
j
According to Eq. 3, a weight matrix w could be solved from mRNA expression data of N distinct perturbation experiments by using a multiple linear regression.
502
Tanaka and Ogishima
2.1.2. Mutual Information Based Reconstruction
Pearson correlation coefficient (PCC) can be calculated among genes from their expression data. By extracting correlative pairs of genes with PCC exceeding a certain threshold, undirected gene regulatory networks can be obtained among genes. However, gene regulatory network based on PCC might contain a considerable number of apparent (false positive) connections between the nodes. Suppose genes X, Y, and Z have relationships as X → Y → Z and X ← Y → Z, the PCC between genes X and Z will be high, which would provide the false positive connections between nodes though indirect relationships. The use of partial correlation coefficients eliminates these indirect effects to some extent, and can be applied to inference of gene regulatory networks (Gaussian graphical model) (12). Furthermore, the PCC indicates the strength of a linear relationship between two gene expression values. However, the relations between the expressions of genes are not necessarily linear. Thus, PCC is unsuitable for reconstruction of gene regulatory networks. Mutual information does not assume linear relationships and therefore it has been widely used to reconstruct gene regulatory networks. Basso et al. proposed an algorithm named ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) (see Note 2) for inference of gene regulatory network based on mutual information as defined below (13): I (X , Y ) = S (X ) + S (Y ) − S (X , Y )
(4)
where S(X) is an entropy of gene expression value of gene X and S(X,Y) is the joint entropy of gene expression values of gene X and Y. ARACNe evaluates p values associated with a given mutual information in the null hypothesis by Monte Carlo simulation. The null hypothesis corresponds to pairs of nodes that are disconnected from the network and from each other. By filtering low p value pairs among genes at a certain threshold, ARACNe can eliminate low mutual information pairs of genes. However, obtained pairs of genes still might contain false positives (indirect interactions). To discard indirect interactions, ARACNe conducts DPI (data processing inequality). DPI states that if X → Y → Z, then I(X,Y ) > I(X,Z ). Thus DPI provides a quantitative measure to evaluate indirect interactions. Consider a pair of genes X and Y (X-Y), and consider a path through some other gene Z between genes X and Y (X-Z-Y). If I(X,Y) < min (I(X,Z ), I(Z,Y )), then ARACNe removes a pair of genes X and Y (X-Y) as an indirect interaction according to DPI. 2.1.3. Bayesian Network Inference
A gene regulatory network is also generically represented by a Bayesian network as follows:
(
)
P (G1 , G 2 , …, GN ) = ∏ P Gi | π (Gi ) i
Omics-Based Identification of Pathophysiological Processes
503
where Gi is gene i, P(Gi) is a probability of expression of gene i, P(G1, G2, …, GN) is the joint probability distribution, and p(Gi) is a parent gene of gene i(Gi). A Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG). The Bayesian approach enables us to consider relationships among multivariables simultaneously. To reconstruct the Bayesian network, a learning algorithm should be conducted to determine an optimal gene regulatory network B = under a given mRNA profile set D. According to the Bayes’ theorem, a posterior probability of a graph with given expression data is evaluated as follows:
P (Bi | D ) =
P (D | Bi )P (Bi )
∑ P (D | B )P (B ) i
i
≅
P (D | Bi )P (Bi ) P (D)
Basically, P (D|Bi) is difficult to calculate, but Cooper et al. provided methodologies for calculating, by utilizing some assumptions (14). Finally, P (Bi|D), the estimation function for a Bayesian model, can be calculated for selection of the best Bayesian network for given data. The most advantageous point of the Bayesian network is that it can provide the direction of causative flow among the genes. In contrast to inference based on mutual information where indirect relationships were discarded by conducting DPI, Bayesian network inference constructs the most concise model, which automatically excludes arcs based on dependencies already explained by the model. Consider the signaling cascade from X → Y → Z, where correlation exists between the measured activities of each pair. Despite the correlation between X and Z, the arc between X and Z is omitted, because the X-Y and the Y-Z relationships explain the X-Z correlation. 2.2. KnowledgeReferenced Approach for Gene Expression Data on Networks
The knowledge-referenced approach in identification of molecular networks based on gene expression data is mainly to superimpose gene expression data, and here especially focusing on DEG, onto existing protein interaction networks and literature-curated pathways. The former is called CGI (combining the gene expression and protein interaction data) and the latter corresponds to ORA (over-representation analysis) and GSEA (gene set enrichment analysis). ORA and GSEA can detect hallmark pathways, and show DEG on these pathways. ORA is an over-representation analysis where the ratio of DEG contained in a given gene set (certain pathway) is compared with that of DEG contained in the genes other than this set (pathway) in order to find the DEG-enriched pathway. ORA is implemented in major software for information processing of microarray data; such as DAVID, GOstats
504
Tanaka and Ogishima
(R/Bioconductor), and GeneSprings. DAVID is a web application to conduct ORA analysis by Fisher’s exact test, and provides a web interface to explore DEGs on over-representing pathways (15) (see Note 3). On the other hand, GSEA deals with genome-wide expression profiles belonging to two classes, labeled 1 or 2. Genes are ranked (in the rank list L) based on the strength of correlation between their expression and the class distinction (16). Given an a priori defined set of genes S which would be genes belonging to a certain pathway, or sharing the same Gene Ontology(GO) category, the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or concentrated at the top or bottom rank. We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down from the top of the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score (ES) is defined as a peak or bottom of the running sum curve along with the rank list L, that is, the maximum deviation from zero encountered in the random walk. It corresponds to a weighted Kolmogorov–Smirnov-like statistic. Then ES is normalized for each gene set to account for the size of the set. The gene sets (pathways) which have high ES mostly explain the gene expression pattern. GSEA also conducts permutation test procedure to estimate the significance level of the ES (see Note 4). 2.3. Other Methods Related to the Estimation of Pathophysiological Processes
The above described methods indentify the pathophysiological process in the form of molecular networks. But there are other methods where the results are not in the form of a molecular network, but provide the important causative aspects of pathophysiological mechanism of diseases.
2.3.1. Promoter Analysis for Reconstruction of Transcriptional Networks
Promoter analysis is a search for transcription factor (TF) binding sites (TFBSs) in promoter sequences. It utilizes not only mRNA expression data, but also information on promoter and transcription factor binding sequences. TRANSFAC and JASPAR are well known databases for TFBSs (see Note 5). By employing promoter analysis on regulons, candidates for regulating TFs can be listed. However, main problem in promoter analysis is that it generically gives us too many candidates for regulating factors (false positives), so experimental validation is inevitable. We developed an integrated promoter analysis which utilizes not only gene expression data and promoter sequences, but also (1) the evolutionary conservation information by interspecies genetic comparisons and (2) the expected function about the explored gene. We applied this method to reveal a transcriptional
Omics-Based Identification of Pathophysiological Processes
505
network regulating the development of mouse taste cells, and identified only one candidate gene, Hes1, as a common regulatory factor in silico, and examined its function in vivo (17). We revealed that Hes1 plays a role in orchestrating taste cell differentiation in developing taste buds. To examine transcriptional networks reconstructed by promoter analysis, sophisticated visualization of reconstructed networks is indispensable. We developed the 3D network visualization software BioCichlid (18) (see Note 6), which deals with timecourse microarray data on molecular networks to visualize both physical (protein) and genetic (regulatory) network layers. Transcriptional regulations are represented to bridge the physical network (transcription factors) and genetic network (regulated genes) layers, thus integrating promoter analysis into the pathway mapping. 2.3.2. Expression Quantitative Trait Loci
Expression quantitative trait loci (eQTL) mapping is another approach to determine which genomic regions help to regulate transcription of the genes and how the polymorphism (genetic variation) within these regions such as SNPs affect gene regulation (19). In such studies, gene expression levels are viewed as quantitative traits (continuous phenotype), and the genetic basis of gene expression phenotypes is mapped to particular genomic loci by well established linkage or association methods in statistical genetics. The eQTL approach can relate a gene expression profile to a restricted number of regulating genes, which are, in most of the cases, cis-regulating transcriptional factors.
3. Methods 3.1. Application Examples in the Medical Context
All methods described above have been applied in the medical context (see Note 7). The NIR algorithm was first applied to the reconstruction of a nine-gene subnetwork of the SOS pathway in Escherichia coli, which regulates cell survival and repair after DNA damage (11). The recA and lexA genes were correctly identified as the major regulatory genes, and the recA gene was also identified as a direct target of mitomycin C compound (external perturbation) after elimination of indirect effects based on inferred model. This approach provides a framework for revealing gene regulatory networks, and, as seen in the above application, identifying both molecular direct targets of pharmacological compounds and dysregulated genes under the diseased state. The ARACNe algorithm was first applied to reconstruction of regulatory networks of human B cells, which are shown to have a hierarchical, scale-free network, where a few hubs account for most of the interactions (20). Validation of the network against
506
Tanaka and Ogishima
available data led to the identification of the MYC gene as a major hub, which regulates a network comprising known target genes as well as novel ones (ZRF1, BYSL, and RSL1D1 genes), as further biochemically validated by ChIP assays. The Bayesian network inference algorithm developed by Friedman et al. was first applied to microarray data of Saccharomyces cerevisiae. They reconstructed a cell-cycle gene regulatory network consisting of about 800 genes utilizing the inference approach (10). The CGI approach was applied by Chuang et al. to the gene expression profiles of metastatic or nonmetastatic breast cancer patients to identify the subnetwork whose expression levels correlate with the prognosis of metastasis (21). We also superimposed the gene-expression profile of surgical samples of liver cancer on the BIND (Biomolecular Interaction Network) where protein– protein interaction is combined with gene (DNA)–protein (transcription factor) interaction, to identify the causative pathway for recurrence of the hepatocellular carcinoma after surgery (22). ORA and GSEA have been utilized in a variety of clinical Omics studies. For example, as a similar test to GSEA, Kolmogorov– Smirnov analysis was done to clarify the mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer (23). As for the application of eQTL, Goring et al. obtained gene expression profiles of lymphocyte samples from 1,240 subjects, where eQTL uncovered 41,000 cis-regulated transcripts with a False Discovery Rate (FDR) of 0.05. They selected high-density lipoprotein cholesterol concentration as a phenotype of clinical importance, and identified the cis-regulated vanin 1 (VNN1) gene as harboring sequence variants that influence high-density lipoprotein cholesterol concentrations (24). 3.2. Example of Integrated Omics-Based Disease Pathway Identification
Carro et al. inferred transcriptional networks regulating transitions into the “mesenchymal” state in human malignant gliomas (25). The mesenchymal gene expression signature (MGES) is the hallmark of tumor aggressiveness in HGG (high-grade gliomas). They applied the ARACNe algorithm to infer a genome-wide repertoire of HGG-specific transcriptional interactions (the HGG-interactome) from 176 gene expression profiles of grade III and grade IV samples previously classified into three molecular signature groups: proneural, proliferative, and mesenchymal. The group indentified 53 MGES-specific TFs by master regulator analysis, which computes the statistical significance of the overlap between the regulon of each TF (ARACNe-inferred targets) and the MGES genes (Fisher’s exact test). The top six TFs (STAT3, C/EBP complex, bHLH-B2, RUNX1, FOSL2, and ZNF238) collectively controlled 74% of the MGES genes. Furthermore, they conducted promoter occupancy analysis, and then showed
Omics-Based Identification of Pathophysiological Processes
507
their transcriptional network. C/EBP and STAT3 occupy their own promoters, and they are at the top of hierarchical regulatory modules of the identified transcriptional network. Interestingly, these exhibit autoregulatory loops and form feed-forward loops with a larger fraction of MGES genes (43%) than any of the other TF pairs. C/EBP and STAT3 act as synergistic initiators and master regulators to proceed and maintain mesenchymal transformation. Human physiology allows categorization of about 210 distinct cell types, and if we consider the transcriptional regulatory network as a dynamical system, it could be considered that this network has 210 “stable solutions” corresponding to each cell type. A mesenchymal phenotype is also considered as one of the stable, sustainable cellular status, and the transition to that phenotype is realized by a feed-forward loop of dynamical transcriptional networks regulated by C/EBP and STAT3.
4. Notes 1. The NIR algorithm was implemented by using MATLAB, and this software is distributed for noncommercial purposes and for academic use at http://dibernardo.tigem.it/Website/ NIR. Commercial users are required to contact Diego di Bernardo ([email protected]). 2. ARACNe is implemented as Java routine and distributed at http://amdec-bioinfo.cu-genome.org/html/ARACNE. htm. MS Windows, Linux, and Max OS PPC executables are available. 3. DAVID is a web application to conduct ORA analysis by Fisher’s exact test, and is available at http://david.abcc. ncifcrf.gov. 4. GSEA software is freely available at http://www.broadinstitute.org/gsea for individuals in both academia and industry for internal research purposes. 5. TRANSFAC is a commercial database holding TFBSs, and is located at http://www.biobase-international.com/pages/ index.php?id=transfac. JASPAR is freely accessible for a catalogue of TFBSs at http://jaspar.cgb.ki.se. 6. BioCichlid is a client-server software implemented by utilizing the Cichlid software, and is available at http://newton. tmd.ac.jp/. There exist both source codes and executables for Windows and Max OS X. 7. There are several points to be taken care in application of the methods to the real data. The computational feasibility of the methods is one of the important practical issues. ARACNe is
508
Tanaka and Ogishima
a computationally tractable method and can be calculated on a standard PC. The most time-consuming step is calculating mutual information between all pairs of genes. This calculation has O(n2m2) complexity, where n and m are the number of genes and samples, respectively (26). For example, in the case of using 1.000 genes and 100 gene expression profiles, ARACNe took about 2,000 s running on a computer equipped with a Pentium 4 processor. As with Boolean networks, in principle, 2n of all possible state transitions should be examined to infer a set of Boolean rule tables, but REVEAL utilized a practical approach where the number of inputs (k) to each gene from other genes is first assumed as 1, then increased incrementally. The practically computable scale of gene regulatory networks becomes smaller than for ARACNE, but if k is less than 3 for every gene, even in the case with n = 50, we quickly find the correct rule tables. The most challenging method is Bayesian networks. The search space for maximum a posteriori (MAP) estimation of a gene regulatory network of n genes is huge with the approximate complexity as follows(27): n
·(n −1)
n !·2 2 cn = r ·z n
where r ~ 0.57436 and z ~ 1.4881. Otto et al. improved the algorithm and reconstructed a gene network of 20 genes within about 50 h using a Sun Fire 15K supercomputer with 96 CPUs (28). Therefore, at the present time, reconstruction of only small gene regulatory networks is practically feasible. The NIR algorithm can be easily implemented and computed also to larger networks. However, it needs the number of experiments (gene expression profiles) by applying a sufficient number of different perturbation signals to the network. References 1. Schena M, Shalon D, Davis RW, and Brown PO. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–70. 2. Fodor SP, Rava RP, Huang XC, Pease AC, Holmes CP, and Adams CL. (1993) Multiplexed biochemical assays with biological chips. Nature 364, 555–6. 3. Xing Y, Kapur K, and Wong WH. (2006) Probe selection and expression index computation of Affymetrix Exon Arrays. PLoS ONE 1, e88.
4. Tanaka K, Waki H, Ido Y, Akita S, Yoshida Y, and Yoshida T. (1988) Protein and polymer analyses up to m/z 100000 by laser ionization time-of flight mass spectrometry. Rapid Commun Mass Spectrom 2, 151–3. 5. Hutchens TW, and Yip TT. (1993) New desorption strategies for the mass spectrometric analysis of macromolecules. Rapid Commun Mass Spectrom 7, 576–80. 6. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield
Omics-Based Identification of Pathophysiological Processes CD, and Lander ES. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–7. 7. Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, and Botstein D. (2003) Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci USA 100, 8418–23. 8. Tanaka H. (2010) Omics-based medicine and systems pathology Meth Informat Med 49, 173–185. 9. Liang S, Fuhrman S, and Somogyi R. (1998) Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput 3, 18–29. 10. Friedman N, Linial M, Nachman I, and Pe’er D. (2000) Using Bayesian networks to analyze expression data. J Comput Biol 7, 601–20. 11. Gardner TS, di Bernardo D, Lorenz D, and Collins JJ. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301, 102–5. 12. Edwards DG. (2000) Introduction to Graphical Modelling. Springer Verlag, Heidelberg. 13. Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, and Califano A. (2005) Reverse engineering of regulatory networks in human B cells. Nat Genet 37, 382–90. 14. Cooper GF, and Herskovits E. (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9, 309–47. 15. Huang DW, Sherman BT, and Lempicki RA. (2009) Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nat Protoc 4, 44–57. 16. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, and Mesirov JP. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102, 15545–50. 17. Ota MS, Kaneko Y, Kondo K, Ogishima S, Tanaka H, Eto K, and Kondo T. (2009) Combined in silico and in vivo analyses reveal role of Hes1 in taste cell differentiation. PLoS Genet 5, e1000443. 18. Ishiwata RR, Morioka MS, Ogishima S, and Tanaka H. (2009) BioCichlid: central dogma-based 3D visualization system of
509
time-course microarray data on a hierarchical biological network. Bioinformatics 25, 543–4. 19. Gilad Y, Rifkin SA, and Pritchar JK. (2003) Revealing the architecture of gene regulation: the promise of eQTL studies. Cell 114, 323–32. 20. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, and Califano A. (2006) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7, S7. 21. Chuang HY, Lee E, Liu YT, Lee D, and Ideker T. (2007) Network-based classification of breast cancer metastasis. Mol Sys Biol 3, 140. 22. Tanaka S, Mogushi K, Yasen M, Noguchi N, Kudo A, Kurokawa T, Nakamura N, Inazawa J, Tanaka H, and Arii S. (2009) Surgical contribution to recurrence-free survival in patients with macrovascular-invasion-negative hepatocellular carcinoma. J Am Coll Surg 208, 368–74. 23. Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, Zahnow CA, Patterson N, Golub TR, and Ewen ME. (2003) A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Cell 114, 323–34. 24. Goring HH, Curran JE, Johnson MP, Dyer TD, Charlesworth J, Cole SA, Jowett JB, Abraham LJ, Rainwater DL, Comuzzie AG, Mahaney MC, Almasy L, MacCluer JW, Kissebah AH, Collier GR, Moses EK, and Blangero J. (2007) Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat Genet 39, 1208–16. 25. Carro MS, Lim WK, Alvarez MJ, Bollo RJ, Zhao X, Snyder EY, Sulman EP, Anne SL, Doetsch F, Colman H, Lasorella A, Aldape K, Califano A, and Iavarone A. (2010) The transcriptional network for mesenchymal transformation of brain tumours. Nature 463, 318–25. 26. Margolin AA, Wang K, Lim WK, Kustagi M, Nemenman I, and Califano A. (2006) Reverse engineering cellular networks. Nat Protoc 1, 662–71. 27. Robinson RW. (1973) Counting labeled acyclic digraphs, in ‘New directions in the theory of graphs’, F. Haray ed., Academic Press, New York. 28. Ott S, Imoto S, and Miyano S. (2004) Finding optimal models for small gene networks. Pac Symp Biocomput 9, 557–67.
Chapter 24 Data Mining Methods in Omics-Based Biomarker Discovery Fan Zhang and Jake Y. Chen Abstract The advent of Omics technologies as genomics and proteomics has brought the hope of discovering novel biomarkers that can be used to diagnose, predict, and monitor the progress of disease. The importance of data mining to identify biological markers for the diagnostic classification and prognostic assessment in the context of microarray and proteomic data has been increasingly recognized. We present an overview of general data mining methods and their applications to biomarker discovery with particular focus on genomics and proteomics data. Two case studies are exemplarily presented, and relevant data mining terminology and techniques are explained. Key words: Data mining, Genomics, Proteomics, Biomarker discovery, Mass spectrometry
1. Introduction Massive amounts of available data in genomics and proteomics have provided a unique opportunity for researchers to utilize data mining methods for diagnosing and classifying disease, to better understand and define, e.g., specific tumors and their development, to identify potential biomarkers, and to provide targets for therapy. 1.1. D ata Mining
Data mining resembles the process of analyzing data for extracting patterns, associations, and relationships. Data mining has been called exploratory data analysis, among other things. Masses of biological data generated from microarrays, mass spectrometry, SNP, etc., are explored, analyzed, reduced, and reused. Searches are performed across different data mining models proposed for identifying biomarkers and predicting cancer risk and treatment response. Classical statistical approaches are fundamental to data mining.
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_24, © Springer Science+Business Media, LLC 2011
511
512
Zhang and Chen
1.2. B iomarkers
A biomarker as defined by a National Cancer Institute is “a biological molecule found in blood, other body fluids, or tissues that is a sign of a normal or abnormal process, or of a condition or disease.” For example, a biomarker may be used to measure the efficacy of treatment. A biomarker is also called “molecular marker” or “signature molecule.” From a clinical perspective, biomarkers may serve a variety of functions (Table 1), which correspond to different stages of a disease (1), such as in the progression in cancer or cardiovascular disease. Biomarkers can be used to detect and subsequently treat early state (pre) cancers (screening biomarkers), to definitively establish the presence of cancer for those who are suspected to have the disease (diagnostic biomarkers), or to portend disease outcome at the time of diagnosis without reference to any specific therapy for those with overt disease (prognostic biomarkers) for whom therapy may or may not have been initiated. Biomarkers can also be used to predict outcome of a particular therapy (predictive biomarkers), or to measure response to treatment and early detection of disease progression or relapse (monitoring biomarkers) (2). This chapter includes basic concepts of data mining and applications to identify biomarkers on the basis of Omics data. Next to this introduction section, a description of the data mining process is provided, followed by a presentation of data mining methods and tools. We complement with a section on a series of successful applications of data mining in genomics and proteomics, specifically focusing on the value of these analyses to discovering biomarkers. Two case studies are provided to demonstrate how to identify biomarkers utilizing data mining methods and tools.
Table 1 Rationale and objectives for the use of biomarkers, exemplified on cancer Type of biomarker
Objective for use
Screening
To detect and treat early state (pre) cancers
Diagnostic
To definitively establish the presence of cancer
Prognostic
To portend disease outcome at the time of diagnosis without reference to any specific therapy
Predictive
To predict outcome of a particular therapy
Monitoring
To measure response to treatment and early detect disease progression or relapse
Data Mining Methods in Omics-Based Biomarker Discovery
513
2. Materials In order to systematically conduct data mining analysis for biomarker discovery, we present a general pipeline as a cyclic process which comprises biomarker understanding, data understanding, data preprocessing, data modeling, evaluation, and deployment (Fig. 1). ●●
●●
●●
●●
●●
Biomarker Understanding: Biomarker understanding includes determining biomarker discovery objectives, assessing the current clinical and molecular situation, establishing data mining goals, and developing a project plan.
Data Understanding: Once biomarker discovery objectives and project plan are established, data understanding mainly considers data requirements. This step includes data collection, data description, data storage, and data exploration.
Data Preprocessing: Once available data are stored and managed, missing values removal or imputation, data transforming, and data normalization has to be performed to prepare for data modeling.
Data Modeling: Models such as neural networks, tree-based models, logistic models or other statistical models are useful for analyzing, predicting, and visualizing the patterns, associations, or relationships among the given data, being the prerequisite for deciding on biomarker candidates.
Evaluation: Evaluation includes two steps: computational evaluation and biological evaluation. First, model results should
Biomarker Understanding
Data Understanding
Data Preprocessing
Deployment
Data Modelling
Evaluation
Fig. 1. Data mining pipeline for biomarker discovery resting on Omics data.
514
Zhang and Chen
be evaluated with computational methods in the context of biomarker objectives established in the first phase (biomarker understanding), followed by experimental validation. ●●
Deployment: Once the experimental assessment of biomarkers provides clinically meaningful results, the development plan, protocols, and methods are applied for the biomarker in accordance with accepted regulatory guidance and prepared to submit to the FDA or other regulatory agencies.
3. Methods 3.1. Data Mining Methodologies 3.1.1. Bayesian Classification
A simple Bayesian classifier is the probabilistic, Naive Bayes classifier applying Bayes’ theorem with strong (naive) independence assumptions. Assuming in general that Y is any discrete-valued variable, and the attributes X 1 … X n are any discrete or real valued attributes, the Naive Bayes classification rule is Y * = arg max P (Y X 1 , …, X n ).
{y1 , … , ym }
y ∈
(1)
Using Bayes’ theorem (see Note 1) and assuming the attributes X 1 … X n are all conditionally independent of one another with respect to Y, Eq. 1 can be rewritten as Y * = arg max P (Y X 1 , …, X n )
{
y ∈ y1 ,…, ym
}
P (X 1 , …, X n Y )P (Y ) } P (X 1 , …, X n )
= arg max
{
y ∈ y1 ,…, ym
(2)
n 1 P (Y )∏ P (X i Y ) i =1 y ∈{y1 ,…, ym }P (X 1 , …, X n )
= arg max
n
where P (Y ) is the prior probability, ∏ P (X i Y ) is the likelihood, and P (X 1 , …, X n ) is the evidence. i =1 De Oliverira et al., e.g., used Bayesian network as a tool to support clinical decision making in the online detection of Premature Ventricular Contraction beats (PVC) in electrocardiogram (ECG) records. Bayesian Network is suitable to model this kind of uncertain random character embedded in the problem (3). Kwon et al. analyzed multiple single nucleotide polymorphisms (SNPs) in genome-wide association studies using Bayesian classification with a singular value decomposition method (4).
Data Mining Methods in Omics-Based Biomarker Discovery 3.1.2. Bayesian Networks
515
Bayesian networks are a probabilistic, graphical models that represent a set of random variables and their conditional independencies via a directed acyclic graph whose nodes represent the variables (observable quantities, latent variables, unknown parameters or hypotheses), and edges represent conditional dependencies. Let G = (V , E ) be a directed acyclic graph, and let X = {X 1 , X 2 , …, X n } be a set of random variables. Suppose that each variable is conditionally independent of all its nondescendants in the graph given the value of all its parents. Then, X is a Bayesian network with respect to G. Its joint probability density function (with respect to a product measure) can be written as a product of the individual density functions, conditional on their parent variables as follows (5): n
(
)
P (X 1 , …, X n ) = ∏ P X i parents (X i ) , i =1
(3)
where X i is the set of parents of X i . For any set of random variables, the probability of any member of a joint distribution can be calculated from conditional probabilities using the chain rule as follows (5):
P (X 1 = x1 , …, X n = xn ) n
(
)
= ∏ P X i = xi X i +1 = xi +1 , …, X n = xn . i =1
(4)
For example, Deng et al. used a Bayesian Network approach to incorporate mass spectrometry and microarray data for crossplatform analysis of cancer biomarkers. They presented a novel Bayesian network model which integrates and cross-annotates multiple datasets related to prostate cancer, and developed an empirical scoring scheme and a simulation algorithm for inferring biomarkers. Fourteen genes/proteins, including prostate-specific antigen (PSA) were identified as reliable serum biomarkers which are insensitive to the model assumptions (6). van Steensel et al. applied Bayesian network to analyze chromatin interactions. Based on genome-wide binding maps, they constructed a Bayesian network model of the targeting interactions among a broad set of 43 chromatin components in Drosophila. They found that the homologous proteins HP1 and HP1C each target the heterochromatin protein HP3 to distinct sets of genes in a competitive manner. They also discovered a central role for the remodeling factor Brahma (brm) in the targeting of several DNA-binding factors, including GAGA factor, JRA, and SU(VAR)3-7 (7). 3.1.3. Artificial Neural Networks
Neural Networks have several unique advantages and characteristics as research tools heavily used in analyzing Omics data, e.g., in cancer (8–12). A very important feature of these networks is their
516
Zhang and Chen
adaptive nature, where “learning by examples” replaces conventional “programming by different cases” in solving problems. A generalized feed forward neural network has three layers: input layer, hidden layer, and output layer, and is trained using a back propagation supervised training algorithm. The input is used as activation for the hidden layer and is propagated to the output layer. The output generated is then compared to the desired output, and an error value is calculated for each node in the output layer. The weights on edges going into the output layer are then adjusted relative to the error value. This error is propagated backward through the network to correct edge weights at all levels. In case study section given below, we demonstrate the application of a Neural Network for identifying multimarker panels for breast cancer based on liquid chromatography tandem mass spectrometry (LC/MS/MS) proteomics profiles. 3.1.4. Genetic Algorithms
Genetic algorithms are implemented in a computer simulation in which a population of abstract representations (called chromosomes or the genotype of the genome) of candidate solutions (called individuals, or phenotypes) to an optimization problem evolves toward an improved solution. Obviously, genetic algorithms are inspired by Darwin’s theory about evolution. Simply said, the solution to a problem solved by genetic algorithms “evolves,” following the basic notion of chromosomes, their recombination, mutation, and fitness. The fitness of an organism is measured by success of the genotype under given boundary conditions. A genetic algorithm is started with a set of solutions (represented by chromosomes) called population. Solutions from one population are taken and used to form a new population following recombination and mutation. Solutions which are selected to form new solutions (offspring) are selected according to their fitness – the more suitable they are, the higher the chance to reproduce. This procedure is repeated until some condition (e.g., number of populations or improvement of the best solution) is satisfied. A scheme of a basic Genetic Algorithm is outlined as follows: 1. [Start] Generate random population of n chromosomes (suitable solutions for the problem) 2. [Fitness] Evaluate the fitness f (x) of each chromosome x in the population 3. [New population] Create a new population by repeating the following steps until the new population is complete (a) [Selection] Select two parent chromosomes from a population according to their fitness (the higher the fitness, the higher the chance to be selected) (b) [Crossover] With a certain probability crossover of parent chromosomes is realized to form a new (child) offspring.
Data Mining Methods in Omics-Based Biomarker Discovery
517
If no crossover is performed, the offspring is an exact copy of parents. (c) [Mutation] With a certain probability, offspring is mutated at each locus (position in chromosome). (d) [Accepting] Place new offspring in a new population 4. [Replace] Use new generated population for a further iteration of the algorithm 5. [Test] If the end condition is satisfied, stop, and return the best solution of the current population 6. [Loop] Go to step 2 Dolled-Filhart et al., for example, used genetic algorithms to classify breast cancer based on tissue microarrays. A minimal number of markers with maximal prognostic or predictive value were identified, including GATA3, NAT1, and estrogen receptor. They first divided a given cancer patient cohort into a training set of 223 cases to detect a solution capable of defining a subset of patients with 5-year survival of 96%. The algorithm was then validated on both an internal validation set (n = 223, 5-year survival = 95.8%) and an independent cohort. Their work illustrates the potential for using genetic algorithms to discover multiplexed biomarker on the tissue microarray platform (13). 3.1.5. Decision Trees
A decision tree (or tree diagram) is a decision support tool that uses a tree-like graph (or model) of rule-based decisions and their consequences on classification of data. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). Both approaches provide a set of rules that can be applied to a new (unclassified) dataset for predicting the some end point. CART segments a dataset by creating two-way splits while CHAID segments by using chi square tests to create multiway splits. CART typically requires less data preparation than CHAID. For example, Su et al. used decision tree classification of mass spectral data to diagnose gastric cancer. They subjected serum samples from 245 individuals (including 127 gastric cancer patients, 100 age- and sex-matched healthy individuals, nine benign gastric lesion patients, and nine colorectal cancer patients) for analysis by surface-enhanced laser desorption/ionization (SELDI) mass spectrometry. Peaks were detected with Ciphergen SELDI software version 3.1.1 and analyzed with Biomarker Patterns’ software 5.0. They developed a classifier for separating the gastric cancer groups from the healthy groups. Three protein masses with 1,468, 3,935, and 7,560 m/z were selected as a potential “fingerprint” for the detection of gastric cancer. The prediction accuracy showed a sensitivity of 95.6% and a specificity of 92.0% (14).
518
Zhang and Chen
3.1.6. Graph Theory
A graph refers to a collection of vertices or nodes and a collection of edges that connect pairs of vertices, mathematically denoted as a graph G (V, E). The vertex set of G is usually denoted by V(G), and the edge set of G is usually denoted by E(G). The degree, or valency, dG(v), of a vertex v in a graph G is the number of edges incident to v, with loops being counted twice. The vertex connectivity or connectivity k(G) of a graph G is the minimum number of vertices that need to be removed to disconnect G. Graph theoretical approaches have been widely applied to the analysis of molecular interaction networks, such as protein–protein interaction (PPI) networks, gene–gene co-expression networks (see Note 2), genetic interaction networks, molecular co-annotation networks, literature cooccurrence networks, and molecular entity association networks, where a vertex can represent a gene, protein, or pathway and the edge can represent interaction, similarity, or distance. Kohler et al. presented a graph theory approach for prioritization of candidate genes. They used a global network distance measure, random walk analysis, for definition of similarities in PPI networks. They tested the method on 110 disease-gene families with a total of 783 genes, and reached an area under the ROC curve of up to 98% on simulating linkage intervals of 100 genes surrounding the disease gene (15).
3.1.7. Nearest Neighbor Method
The nearest neighbor method classifies each record in a dataset based on a combination of the classes of the k record(s) most similar in a historical dataset (where k = 1). For example, Tian and colleagues used a nearest neighbor method to perform the gene selection, i.e., identifying the nearest neighbor of gene e within the same class (gender) using Euclidian distance over all genes for each experiment e, and ranking the individual genes by their ability to distinguish class (gender) based on intensity values (16).
3.1.8. Data Visualization
Graphics tools are used to illustrate data relationships for allowing a visual interpretation of complex relationships in multidimensional data. Common data visualization tools include Cytoscape, OpenGL, GDI+, MATLAB, and dedicated R scripts. As good example, You et al. demonstrated the use of data visualization to explore differential gene expression profiles in biomolecular interaction networks (17).
3.1.9. Similarity Measures
Table 2 lists several measures to determine how similar the patterns of two sets are. Among them Euclidean distance and the Pearson correlation coefficient are two of the easiest and most commonly used similarity measures for analyzing gene expression or proteomics profiles. The two most important clustering methods utilizing these similarity measures are hierarchical clustering (see Note 3) and K-means (see Note 4). Liu et al. used a similarity measure to construct pathways of tumor progression and to discover cancer-associated biomarkers.
Data Mining Methods in Omics-Based Biomarker Discovery
519
Table 2 Similarity measures ( X = {X 1 ,… , X n }, Y = {Y1 ,… ,Yn }) Manhattan distance
n
d = ∑ X i − Yi i =1
Euclidean distance
d=
n
∑ (X i =1
− Yi )
2
i
Pearson correlation d = 1−
Chebychev distance
2 2 X Y ∑ X 2 − ( ∑ ) ∑ Y 2 − (∑ ) n n
d = max X i − Yi i
Spearman rank correlation
n
d= Mahalanobis distance
X Y ∑ XY − ∑ n∑
d=
6∑ (rank (X i ) − rank (Yi )) i =1
2
n (n 2 − 1)
(X − Y )
T
S −1 (X − Y )
,
where S is the covariance matrix
First, they defined a similarity measure using normalized mutual information and clustered the data with this measure to find the pathway nodes. Then, they built the pathway tree based on the center of each cluster and the heritability of the genotype and phenotype data (18). 3.2. Data Mining Tools
Dedicated data mining tools are used widely for analyzing Omics data. These tools provide statistical data models (classification or clustering methods, regression, and modeling, etc.), and utilize visualization functions to support interpretation of analysis results. They may be implemented on existing user platforms or further integrated with other applications as part of a larger data management and analysis strategy. Data mining tools provide users with an interface for discovering, manipulating, and analyzing given Omics data. Table 3 summarizes six of the most highly acclaimed data mining tools suitable for supporting Omics data analysis aimed at biomarker discovery: (1) SAS Enterprise Miner; (2) IBM IntelliMiner; (3) SPSS/Clementine; (4) Oracle Darwin; (5) R Package; (6) MATLAB.
520
Zhang and Chen
Table 3 Comparison of data mining tools with respect to capabilities, algorithms, and applications in the context of Omics and biomarker discovery
Capability
Algorithms
Enterprise Intelligent Miner Miner CLEMENTINE Darwin
R
MATLAB
Handles missing data Has programing language Visualization Debug GNU
No
No
No
No
Yes
No
Yes
No
Yes
No
Yes
Yes
+ No No
+ No No
+ No No
− No No
− Yes
+ Yes No
Decision trees Neural networks Regression Nearest neighbor method Graph theory Genetic algorithm Bayesian classification Bayesian network
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes No
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
No No
No No
No No
No No
Yes Yes
Yes Yes
No
No
No
No
Yes
Yes
No
No
No
No
Yes
Yes
Genomic application Proteomics application
No
No
No
No
Yes
Yes
No
No
No
No
Yes
Yes
Note: + good, average, − needs improvement
3.3. Data Mining Applications 3.3.1. Data Mining Applications in Genomics
Genomic analysis refers to techniques aimed at determining and comparing the different properties and the variations of the genome. This includes DNA sequencing, the genotyping of SNPs in the human genome, and DNA microarray technology for the analysis of gene expression profiles at the mRNA level (19). Various data mining tools have been developed to make genome mining available to computational and experimental biologists alike. For example, Yang et al. (20) used integrative genomic data mining for the discovery of potential blood-borne biomarkers for the early diagnosis of cancer. 1. Genes overexpressed in cancer tissues relative to their corresponding normal tissues are first filtered by Gene Ontology keywords focusing on extracellular environment and using a corrected Q-value (False Discovery Rate) cut-off.
Data Mining Methods in Omics-Based Biomarker Discovery
521
2. The identified genes are imported to the Ingenuity Pathway Analysis (IPA, www.ingenuity.com) biomarker module to separate those genes encoding putative secreted or cell-surface proteins as blood-borne (blood/serum/plasma) cancer markers. 3. The filtered potential indicators are ranked and prioritized according to normalized absolute Student t-test values. Such a combined mining strategy based on an integrated cancer microarray platform (Oncomine, http://www.oncomine.org) and the biomarker module of the IPA program has been proven useful to identify potential blood-based markers for human cancer. Fernandez-Suarez et al. (21) provided workflows for interacting with BioMart (http://www.biomart. org) using other applications to retrieve information through different platforms, such as Galaxy and the biomaRt package of BioConductor (http://www.bioconductor.org). Many of these tools also interact with the UCSC Table Browser (http://genome.ucsc.edu). Dinu et al. (22) integrated domain knowledge with statistical and data mining methods to identify genes and pathways associated with disease. They described Pathway/SNP (http://www.dinulab.org), a software application designed to help evaluate the association between pathways and disease. Pathway/SNP can be used to explore the etiology of complex diseases by integrating domain knowledge, SNP data, and gene and pathway annotation from multiple sources with statistical and data mining algorithms. With the development of network-based approaches for identifying biological markers for diagnostic classification and prognostic assessment in the context of microarray data, Zhu et al. (23) proposed a network-based support vector machine for binary classification problems. In this approach, a penalty term is constructed from the F-infinity-norm (see Note 5) being applied to pairwise gene neighbors with the aim to improve predictive performance and gene selection. They used the method in both low- and high-dimensional data settings as well as two microarray applications to identify clinically relevant genes while maintaining a sparse model with either similar or higher prediction accuracy when compared to the standard and the L1 penalized support vector machines. They concluded that the proposed network-based support vector machine has the potential to be a practically useful classification tool for microarrays and other high-dimensional data. Lancashire et al. (24) described with recent literature that artificial neural networks can cope with highly dimensional, complex datasets such as those generated by protein mass spectrometry and DNA microarray experiments, and can also be used to solve problems, such as disease classification and identification of biomarkers.
522
Zhang and Chen
3.3.2. Data Mining Applications in Proteomics
Data mining applications in the discovery of biomarkers using proteomics data have gained much interest in recent years. However, the considerable amount of data which results from advances made in proteomics and mass spectrometry cannot be easily analyzed, visualized, or interpreted. Typical mass spectrometrybased proteomic datasets are high dimensional but with small sample size (a shortcoming found with almost all Omics technologies). Therefore, advanced artificial intelligence and machine learning algorithms are needed for knowledge discovery from such datasets. Two of the major experimental techniques are matrix-assisted laser desorption/ionization (MALDI-MS) and its extension, the SELDI-MS. Data samples from the two techniques typically comprise hundreds to thousands of protein peaks. Such as vast number of data cannot be visually analyzed or processed by ordinary data mining tools. The data mining pipeline as given in Fig. 1 can be applied to mining protein mass spectrometry data from MALDI-MS or SELDI-MS and each step involves various techniques, here in particular for data preprocessing (transformation, normalization, noise elimination) and data modeling (see earlier sections of this chapter). For examples, Saksena et al. (25) presented a procedure for learning a probabilistic model from mass spectrometry data that accounts for domain specific noise and mitigates the complexity of Bayesian structure learning. Conrads et al. (26) and Petricoin et al. (27) used a genetic algorithm for peak selection and self organizing maps for the classification of SELDI-MS proteomic data in cancer diagnosis. Schaub et al. (28) used proteomic mass spectrometry approaches for profiling, fractionation, and identification of serum proteins from breast cancer patients for discovery of new biomarkers reflecting stage and nodal status. They used a k-nearest neighbor genetic algorithm for deriving group profiles and crossvalidation scores. Rogers et al. (29) investigated the clinical utility of SELDI profiling of urine samples in conjunction with neural-network analysis to detect renal cancer biomarkers. The samples used were from a total of 218 individuals. Samples from patients before undergoing nephrectomy for clear cell renal cell carcinoma (RCC; n = 48), samples from healthy volunteers (n = 38), and outpatients attending with benign diseases of the urogenital tract (n = 20) were used to successfully train neural-network models based on either presence/absence of peaks or peak intensity values. The sensitivity and specificity values reached 98.3–100%. Using a “blinded” group of samples from 12 patients with RCC, 11 healthy controls, and 9 patients with benign diseases to test the models resulted in sensitivity and specificity of 81.8–83.3%.
Data Mining Methods in Omics-Based Biomarker Discovery
3.4. Case Studies 3.4.1. Colorectal Cancer Gene Fishing with Integrated Genomic Data and Molecular Interaction Networks
3.4.2. A Neural Network Approach for Developing Multimarker Breast Cancer Panels Based on LC/MS/MS Proteomics Profiles
523
In the post-genome era, disease-relevant gene biomarker discovery has included various Omics tracks as genome-wide association studies and various interaction data, such as pathway and PPI networks. Combining network analysis with traditional data mining methods gained increasing importance. Huang et al. (30) presented a simple yet generic computational framework based on protein interaction networks to perform and evaluate disease gene-hunting in colorectal cancer. Three sets of colorectal cancer-related genes retrieved from different resources were collected as seeds: (1) the CORE1 set, derived from manually curated databases by querying the OMIM and KEGG database for “colorectal cancer”; (2) the CORE2 set, derived from high-throughput microarray data as provided in the ONCOMINE database, only keeping differentially expressed genes showing a p-values <0.05 when comparing colorectal cancer samples against controls; and (3) the CORE3 set, derived from the Comparative Toxicogenomics Database (CTD) by searching for colorectal cancer genes associated with >2 chemicals in the database. Three disease gene ranking strategies were considered in the case: (1) Global degree strategy, in which the protein’s node degree in a PPI network is used as the weight; (2) Local degree strategy, in which the protein’s node degree in the local (colorectal specific) PPI network is used as the weight; and (3) Edge-weighted Promiscuous Hub subtraction (EPHS) strategy, which is a variant of local degree strategy adapted by penalizing the impact of lowquality, promiscuous protein hubs on ranks. Statistical measurements, including specificity, sensitivity, and Positive Predictive Value (PPV) were used to evaluate the performance of disease gene ranking methods, which includes seed gene selection, protein interaction data quality and coverage, and network-based gene-ranking strategies. The results showed that the best performance came from using curated gene sets as seeds, applying protein interaction dataset with high data coverage and decent quality, and adopting variants of the local degree method. LC/MS/MS-based plasma proteomics profiling is a promising and minimally invasive technology for deciphering prognostic factors for complex human diseases, such as cancer. We collected plasma protein profiles in two batches, which we refer to as Study A and B. Both studies included 80 samples (40 samples from women with breast cancer and 40 from healthy volunteers which served as controls). The demography and clinical parameters characterizing cancer stage/subtype were similar in study A and B (31). We used a data analysis method based on a Feed-Forward Neural Network (FFNN) to identify multiprotein biomarker panels,
524
Zhang and Chen
with which we are capable of separating plasma samples regarding reference and cancerous with high predictive performance. 1. We applied Analysis of variance (ANOVA) methods to filter a list of single candidate biomarkers that significantly differ between breast cancer and controls. 2. A neural network for each combination of single marker proteins was constructed and trained. According to an areaunder-the-curve (AUC) criterion the optimal combination of a panel of five markers was determined, using both a twovariable output encoding scheme and a single-variable encoding scheme. 3. We compared the Receiver Operating Characteristics (ROC) performance and verified that the best five-marker panel performed well in both training dataset and test dataset, achieving more than 82.5% in sensitivity and specificity. Our computational method can serve as a general model for cancer panel biomarker discovery applications.
4. Notes 1. Bayes’ theorem relates the conditional and marginal probabilities of events A and B, where B has a nonvanishing probability:
P (A B) =
P (B A)P (A) , P (B)
where P(A) is the prior probability or marginal probability of A, P(A|B) is the conditional probability of A, given B, P (B|A) is the conditional probability of B given A, and P(B) is the prior or marginal probability of B. 2. Gene–gene co-expression networks are scale-free biological networks in which genes are linked when they are co-regulated and involved in the same biological process. 3. There are two types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering takes each entity (i.e., gene) as a single cluster to start off with and then builds larger clusters by grouping similar entities together until the entire dataset is encapsulated into one final cluster. Divisive hierarchical clustering works the opposite way – the entire dataset is first considered to be one cluster and is then broken down into smaller subsets until each subset consists of only a single entity. 4. k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Data Mining Methods in Omics-Based Biomarker Discovery
525
5. F-infinity-norm is defined as
Fg
∞
= b( g )
∞
= max j ∈S g
{ b } j
Acknowledgments This work was supported in part by a grant from the National Cancer Institute (U24CA126480-01), part of NCI’s Clinical Proteomic Technologies Initiative (http://proteomics.cancer. gov), awarded to Dr. Fred Regnier (PI) and Dr. Jake Chen (co-PI). We thank Hui Huang and Jiao Li for providing a case study. References 1. Soreide K. (2009) Receiver-operating characteristic curve analysis in diagnostic, prognostic and predictive biomarker research. J Clin Pathol 62, 1–5. 2. Jaffe C.C. (2009) Pathology and imaging in biomarker development. Arch Pathol Lab Med 133, 547–9. 3. de Oliveira L.S., Andreao R.V., and SarcinelliFilho M. (2010) The use of bayesian networks for heart beat classification. Adv Exp Med Biol 657, 217–31. 4. Kwon S., Cui J., Rhodes S.L., Tsiang D., Rotter J.I., and Guo X. (2009) Application of Bayesian classification with singular value decomposition method in genome-wide association studies. BMC Proc 3, S9. 5. Needham C.J., Bradford J.R., Bulpitt A.J., and Westhead D.R. (2006) Inference in Bayesian networks. Nat Biotechnol 24, 51–3. 6. Deng X., Geng H., and Ali H.H. (2007) Cross-platform analysis of cancer biomarkers: A Bayesian network approach to incorporating mass spectrometry and microarray data. Cancer Inform 3, 183–202. 7. van Steensel B., Braunschweig U., Filion G.J., Chen M., van Bemmel J.G., and Ideker T. (2010) Bayesian network analysis of targeting interactions in chromatin. Genome Res 20, 190–200. 8. Lai K.C., Chiang H.C., Chen W.C., Tsai F.J., and Jeng L.B. (2008) Artificial neural network-based study can predict gastric cancer staging. Hepatogastroenterology 55, 1859–63. 9. Amiri Z., Mohammad K., Mahmoudi M., Zeraati H., and Fotouhi A. (2008) Assessment of gastric cancer survival: Using an artificial hierarchical neural network. Pac J Biol Sci 11, 1076–84.
10. Chi C.L., Street W.N., and Wolberg W.H. (2007) Application of artificial neural networkbased survival analysis on two breast cancer datasets. AMIA Annu Symp Proc 130–4. 11. Anagnostopoulos I., and Maglogiannis I. (2006) Neural network-based diagnostic and prognostic estimations in breast cancer microscopic instances. Med Biol Eng Comput 44, 773–84. 12. Wang H.Q., Wong H.S., Zhu H., and Yip T.T. (2009) A neural network-based biomarker association information extraction approach for cancer classification. J Biomed Inform 42, 654–66. 13. Dolled-Filhart M., Ryden L., Cregger M., Jirstrom K., Harigopal M., Camp R.L., and Rimm D.L. (2006) Classification of breast cancer using genetic algorithms and tissue microarrays. Clin Cancer Res 12, 6459–68. 14. Su Y., Shen J., Qian H., Ma H., Ji J., Ma L., Zhang W., Meng L., Li Z., Wu J., et al. (2007) Diagnosis of gastric cancer using decision tree classification of mass spectral data. Cancer Sci 98, 37–43. 15. Kohler S., Bauer S., Horn D., and Robinson P.N. (2008) Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet 82, 949–58. 16. Tian Z., Palmer N., Schmid P., Yao H., Galdzicki M., Berger B., Wu E., Kohane I.S. (2009) A practical platform for blood biomarker study by using global gene expression profiling of peripheral whole blood. PLoS One 4, e5157. 17. You Q., Fang S., and Chen J.Y. (2008) GeneTerrain: Visual exploration of differential gene expression profiles organized in native biomolecular interaction networks. J Inf Vis, doi: 10.1057/palgrave.ivs.9500169.
526
Zhang and Chen
18. Liu Z., Guo Z., Tan M. (2008) Constructing tumor progression pathways and biomarker discovery with fuzzy kernel kmeans and DNA methylation data. Cancer Inform 6, 1–7. 19. Lee P.S., and Lee K.H. (2000) Genomic analysis. Curr Opin Biotechnol 11, 171–5. 20. Yang Y., Pospisil P., Iyer L.K., Adelstein S.J., and Kassis A.I. (2008) Integrative genomic data mining for discovery of potential bloodborne biomarkers for early diagnosis of cancer. PLoS One 3, e3661. 21. Fernandez-Suarez X.M., and Birney E. (2008) Advanced genomic data mining. PLoS Comput Biol 4, e1000121. 22. Dinu V., Zhao H., and Miller P.L. (2007) Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis. J Biomed Inform 40, 750–60. 23. Zhu Y., Shen X., and Pan W. (2009) Network-based support vector machine for classification of microarray samples. BMC Bioinformatics 10, S21. 24. Lancashire L.J., Lemetre C., and Ball G.R. (2009) An introduction to artificial neural networks in bioinformatics – application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform 10, 315–29. 25. Saksena A., Lucarelli D., and Wang I.J. (2005) Bayesian model selection for mining mass spectrometry data. Neural Netw 18, 843–9.
26. Conrads T.P., Zhou M., and Petricoin E.F., Liotta L., and Veenstra T.D. (2003) Cancer diagnosis using proteomic patterns. Expert Rev Mol Diagn 3, 411–20. 27. Petricoin E.F., and Liotta L.A. (2004) SELDITOF-based serum proteomic pattern diagnostics for early detection of cancer. Curr Opin Biotechnol 15, 24–30. 28. Schaub N.P., Jones K.J., Nyalwidhe J.O., Cazares L.H., Karbassi I.D., Semmes O.J., Feliberti E.C., Perry R.R., and Drake R.R. (2009) Serum proteomic biomarker discovery reflective of stage and obesity in breast cancer patients. J Am Coll Surg 208, 970–8. 29. Rogers M.A., Clarke P., Noble J., Munro N.P., Paul A., Selby P.J., and Banks R.E. (2003) Proteomic profiling of urinary proteins in renal cancer by surface enhanced laser desorption ionization and neural-network analysis: Identification of key issues affecting potential clinical utility. Cancer Res 63, 6971–83. 30. Huang H., Li J., and Chen J.Y. (2009) Disease gene-fishing in molecular interaction networks: A case study in colorectal cancer. Engineering in Medicine and Biology Society, 2009 EMBC 2009 Annual International Conference of the IEEE 2009, 3. 31. Zhang F., and Chen J.Y. (2009) A neural network approach to developing multi-marker panels for breast cancer based on LC/MS/ MS proteomics profiles. Proceedings of the 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2009.
Chapter 25 Integrated Bioinformatics Analysis for Cancer Target Identification Yongliang Yang, S. James Adelstein, and Amin I. Kassis Abstract The exponential growth of high-throughput Omics data has provided an unprecedented opportunity for new target identification to fuel the dried-up drug discovery pipeline. However, the bioinformatics analysis of large amount and heterogeneous Omics data has posed a great deal of technical challenges for experimentalists who lack statistical skills. Moreover, due to the complexity of human diseases, it is essential to analyze the Omics data in the context of molecular networks to detect meaningful biological targets and understand disease processes. Here, we describe an integrated bioinformatics analysis strategy and provide a running example to identify suitable targets for our in-house Enzyme-Mediated Cancer Imaging and Therapy (EMCIT) technology. In addition, we go through a few key concepts in the process, including corrected false discovery rate (FDR), Gene Ontology (GO), pathway analysis, and tissue specificity. We also describe popular programs and databases which allow the convenient annotation and network analysis of Omics data. We provide a practical guideline for researchers to quickly follow the protocol described and identify those targets that are pertinent to their work. Key words: Microarray data analysis, Literature data analysis, Pathway analysis, False discovery rate, Subcellular location, Tissue specificity, Integrated bioinformatics analysis
1. Introduction Therapeutic target discovery is the most crucial step in modern drug discovery campaign (1). To date, about 1,174 distinct proteins (including isoforms) have been documented as targets with clinical potential among which 239 are targets of marketed drugs (2). The majority of these “druggable” targets are enzymes, kinases, G-protein coupled receptors (GPCR), ion channels, transporters, and nuclear hormone receptors. In the past decades, the sky-rocketing increase of investments in pharmaceutical R&D has been accompanied by a high failure rate of these drugs in the Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_25, © Springer Science+Business Media, LLC 2011
527
528
Yang, Adelstein, and Kassis
clinical development phase (3, 4). Nowadays, it is obvious that the lack of appropriate “druggable” targets is a major contributing factor. Furthermore, the narrow focus on individual targets and underestimation of the complex physiological role of the targets in the intact organism have worsened the situation. Undoubtedly, the flourishing of high-throughput Omics technologies has provided immense opportunities for new target discovery to support the dried up drug discovery pipeline. For example, the number of databases warehousing various Omics data is rapidly growing, with their size estimated to double every 2 years (5). However, to extract valuable biological targets and information from a wealth of Omics data is an arduous task for researchers, particularly for experimentalists who do not possess the necessary statistical skills. It is thus essential to develop practical bioinformatics protocols that can facilitate the identification of targets for further development especially since the criteria to select potential targets can vary significantly according to the context of therapeutic strategies. Nonetheless, there are a number of common factors that might influence an entity’s potential as an effective therapeutic target. For instance, the aberrant expression in diseased tissues compared to healthy tissues, the subcellular location (e.g., extracellular space and cell surface are easily accessible to various drug delivery mechanisms), tissue specificity, the biological role in disease process, etc. Target discovery is also challenging since human diseases are highly complex and biomedical data are largely heterogeneous and poorly defined. Hence, we need to integrate and analyze Omics data across many different disciplines. More importantly, the integrated bioinformatics analysis would help researchers to retrieve and prioritize those biologically meaningful targets (6). For example, the integration of curated pathway knowledgebase (e.g., KEGG and GenMAPP, see Note 1) into the filtering of biological entities has enabled the scientists to analyze and visualize a variety of datasets in the context of major regulatory, metabolic, and cellular pathways. We have many reasons to believe that integrated bioinformatics analysis plays a highly significant role in future target discovery campaign.
2. Materials 2.1. False Discovery Rate
In Omics data studies, researchers need to control the potentially high rate of false positives from analyzing several thousands to tens of thousands of entities simultaneously. The traditional statistical P value is not perfectly suitable for such task. To surmount this problem, the FDR (Q value in meta-analysis), defined as the expected proportion of false positives among the declared
Integrated Bioinformatics Analysis for Cancer Target Identification
529
significant results among a test, has been devised (7). For instance, if we declare a set of 1,000 overexpressed genes with a maximum FDR value of 0.05, then we can expect a maximum of 50 genes to be false positive. Consequently, the FDR value which controls the proportion of positive calls that are false positives can be used as a sensible measure that balances between the number of true positives and false positives in Omics data analysis, such as microarray data studies. Consequently, methods of FDR have been introduced into a number of microarray repository to help researchers filter the statistically meaningful entities. For example, in the Oncomine microarray database ((8), and see Note 2), Q values are calculated as follows: Q = NP/R, where P is the significance level in a statistical test assigned to genes, N is the total number of genes analyzed, and R is the sorted rank of the P values. Therefore, the Q value represents a meta-analysis measure of significance in a pool of genes. For instance, in the course of microarray data analysis, researchers could choose a stringent Q value cutoff of 0.05 to increase the likelihood of a set of filtered entities to be truly differentially expressed. 2.2. G ene Ontology
Gene Ontology (GO) is a controlled vocabulary produced by Gene Ontology Consortium (http://www.geneontology.org) to describe the function of gene products, their location in the cell, and the biological process(es) they are involved in (9). Molecular Function, Biological Process, and Cellular Component have been developed as three structured and defined ontologies in a speciesindependent manner. Here, we present three examples from GO as of November 2009: (1) 11,047 gene products share the Cellular Component GO term <<extracellular region>> (GO: 0005576), which defines “the space external to the outermost structure of a cell.” For cells without external protective or external encapsulating structures, this refers to space outside of the plasma membrane; (2) 10,808 gene products share the Molecular Function GO term <> (GO: 0016788), which defines “catalysis of the hydrolysis of any ester bonds”; and (3) 30,499 gene products share the Biological Process GO term <> (GO: 0007154) which defines “any process that mediates interactions between a cell and its surroundings, encompasses interactions such as signaling or attachment between one cell and another cell, between a cell and an extracellular matrix, or between a cell and any other aspect of its environment.” GO terms can also be used to annotate and group Omics data in an organized fashion after statistical analysis. This is very useful for the identification of diagnostic and therapeutic targets in a sense that they can inform investigators where the gene products are located, what their functions are, and which biological processes they are associated with. Moreover, researchers could gain
530
Yang, Adelstein, and Kassis
a deeper understanding of the gene products from a biological significance point of view rather than pure statistical analysis. As an example, Welsh et al. have used a set of 30 GO terms implying extracellularity to filter putative genes encoding secreted protein targets from Affymetrix probe-sets (10). 2.3. P athway Analysis
Pathway analysis (also referred to as functional enrichment) refers to a procedure in which a list of entities (genes or proteins) is examined for Gene Ontology, biochemical functions, known biochemical and regulatory relationships, and known protein– protein and gene–gene interactions (11). It has now become popular to integrate pathway analysis and determine the biological relevance for a list of entities mined from high-throughput Omics data. For example, we recently identified lists of entities that are either cell surface or membrane-bound as potential targets for cancer imaging based on the analysis of publicly available genomic profiles (unpublished results). Interestingly, we have been able to detect a number of putative subnetworks responsible for tumor cell growth and proliferation process consisting of “influential” molecules by integrating pathway analysis in our study. It is advantageous to integrate pathway analysis in target discovery for the following reasons. First, genes or proteins with therapeutic potentials are more likely to function as a cooperative group or network rather than individual units (12). For human cancers, it has been speculated that cooperative groups of entities can alter the biological pathways to promote tumor growth and progression. Pathway analysis is more useful in identifying subtle concordant changes of a group of entities in a biological process than pure statistical analysis. For instance, Gene Set Enrichment Analysis (GSEA, http://www.broadinstitute.org/gsea) is a method that calculates the enrichment score for a list of genes based on the length of the gene list and the number of genes mapped on a specific pathway. Based on GSEA, Mootha et al. (13) were able to identify a set of coordinately dysregulated genes in diabetic muscle that pure statistical analysis failed to discover. Second, by integrating pathway analysis, researchers can organize and map “focused” interaction networks derived from significantly deregulated gene–gene or protein–protein pairs to reflect important cellular function in disease process. The genes or proteins with high connectivity (a large number of interactions with other entities) in the networks are highly influential and might be preferred diagnostic or therapeutic targets. In fact, in pace with the explosion of Omics data, a number of pathway interaction databases have been developed to facilitate such integrated bioinformatics analysis. For example, Yue et al. (14) have constructed a “focused” interaction network by mapping 21 differentially expressed proteins derived from proteomics studies on Unified Human Interactome Database (UniHI, see Note 1). Among the “focused”
Integrated Bioinformatics Analysis for Cancer Target Identification
531
interaction network, these authors were able to identify the 14-3-3 regulatory protein family as a potential target for antitumor drugs. It is thus very clear that pathway analysis, when integrated in the analysis of Omics data, is a promising and powerful approach for the identification of pertinent targets and mechanisms underlying human diseases. 2.4. T issue Specificity
Tissue specificity can be defined as the preferential expression of certain entities (genes or proteins) in only one particular tissue type (15). Current Omics data analysis has been largely dedicated to detect the quantitative expression differences of entities between two pathological categories (e.g., prostate tumor vs. normal prostate tissue) or between treated and untreated conditions. However, past records have indicated that some failed drug developments could be attributed to the poor tissue specificity of the selected targets (e.g., inability to accumulate a therapeutically effective dose within the targeted tissue consequent to poor tissue specificity; serious side effects consequent to accumulation within normal tissues). Therefore, attentions should be focused on the identification of those entities with preferential expression in selected tissues as potential clinically useful targets. For example, Yang et al. (16) have successfully identified lists of tissue-specific blood-borne markers through a combined mining strategy through microarray databases and a curated pathway knowledgebase. First, all of the significantly upregulated genes with controlled GO subcellular locations in cancer were collected with an FDR cutoff Q £ 0.05. These retrieved genes were then subjected to pathway analysis and only those putative markers encoding secreted proteins in blood/serum/plasma were kept in the list. Using this elegant approach, a multiple-comparison study of the retrieved markers across six common human tumor types led to the identification of a panel of tissue-specific markers for each tumor type. Among those selected and prioritized, several (e.g., LEP, SMO, MMP2, CD44, FAS, and NOTCH4) tissue-specific markers had already been identified and have been used clinically. While the importance of tissue specificity cannot be underestimated in the course of target discovery, such studies might also shed light on the relevant pathogenesis of human diseases and suggest novel therapeutic targeting strategies.
3. Methods 3.1. E MCIT Technology
We want to start this part with a running example in our lab to identify cancer targets through bioinformatics analysis to help experimentalists gain highly practical data analysis experience. Briefly, we have been developing a technology which relies on
532
Yang, Adelstein, and Kassis
tumor-mediated in situ hydrolysis of water soluble, low molecular weight radioactive prodrugs into water-insoluble molecules within the extracellular compartment of solid tumors and metastatic lesions (17–19). The uniqueness of this Enzyme-Mediated Cancer Imaging and Therapy (EMCIT) technology is that it enables the permanent localization and accumulation of the imaging, or radiotherapeutic, molecules within the extracellular spaces of a solid tumor. In the latter situation, tumor cells within the range of the emitted charged particles (e.g., energetic electrons, alpha particles) are killed, while distant normal cells are spared. In the EMCIT concept, the radioactive prodrugs need not be internalized by cells. Consequently, the appropriate candidate targets for EMCIT technology should have the following characteristics: (a) they should be aberrantly overproduced in tumor cells compared to healthy cells; (b) they should be encoded as hydrolases within the extracellular compartment of tumor cells; (c) they should be specifically involved in certain tumorigenesis processes rather than being randomly spread throughout other biological processes; (d) they should be ascertained by both Omics data analysis and precedent literature studies. 3.2. Bioinformatics Analysis Strategy
Two approaches, microarray data analysis and literature data analysis, were used in parallel to mine and validate for extracellular hydrolases as EMCIT targets (Fig. 1). Microarray data analysis is based on the retrieval of significantly overexpressed genes with a controlled statistical cutoff value from publicly accessible cancer microarray databases. Literature data analysis is based on the retrieval of entities (genes or proteins) from PubMed abstracts. In both approaches, a set of controlled GO terms is used to select extracellular entities, including those secreted or anchored on the plasma membrane. Potential protein targets might be implicated in a disease process owing to their particular functions or their abilities to interact with other proteins. A bioinformatics strategy to extend the candidate pool and reveal all potential targets is to search functionally related neighbors and those having direct biological interactions (20). Consequently, pathway analysis is integrated in both approaches to enlarge and enrich the retrieved entities based on related functions and direct biological interactions. Furthermore, pathway analysis allows us to understand complex disease processes, such as tumor growth and progression, within the context of complex molecular network, which is very important for the detection of suitable EMCIT targets (and very likely any target). Lastly, our integrated bioinformatics analysis is able to identify a short list of entities ascertained by both approaches as potential EMCIT targets. The literature data analysis is also used to validate the gene-level profiles on the basis of published evidences, as literature mining queries different aspects of biological knowledge than microarray data analysis.
Integrated Bioinformatics Analysis for Cancer Target Identification
Step 1
Filter by tissue type <<pancreas>> and analysis type <> for 14 datasets
Step 2
Filter by Gene Ontology and Q value
Step 3
Import to IPA analysis and resolve duplicates
Step 1
5,379 abstracts
566 genes
1257 entities
Filter by Gene Ontology
403 entities
456 entities
Import to IPA analysis and resolve duplicates
829 entities
778 entities
396 entities
404 entities
Step 3
Enlarge by functionalrelated entities and biomolecular networks
Step 5
Step 6
Identify IPA-family: peptidases, phosphatases, and other IPA enzymes
Step 2
Step 4
Step 5
Identify IPA locations: <<extracellular space>> or <>
Filter by keywords <<extracellular>> or <<membrane>> 168,812 abstracts
5 microarray datasets 3,402 upregulated genes
Step 4
Enlarge by functionalrelated entities and biomolecular networks
533
Identify IPA locations: <<extracellular space>> or <>
Step 6
35
9
42
Identify IPA-family: peptidases, phosphatases, and other IPA enzymes
Fig. 1. Scheme of an integrated bioinformatics analysis strategy for identification of cancer-associated entities within the context of EMCIT technology, using pancreatic cancer as example.
For example, a literature meta-review method has been applied to validate and rank a panel of candidate genes as potential specific thyroid cancer markers (21). 3.3. Microarray Data Analysis
To identify EMCIT-suitable targets, we have devised an integrated microarray data analysis strategy involving the Oncomine microarray database and a curated pathway knowledgebase. Oncomine is a cancer microarray database presently incorporating 392 independent microarray studies, totaling more than 28,880 microarray experiments, which span 41 cancer types. It is unique in that it provides differential expression analyses comparing most major types of cancer with respect to normal tissues. More importantly and as stated above, Oncomine is integrated with GO annotations filter (22), which permits users to identify genes with particular biological processes, molecular functions, and subcellular locations. Thus, in comparison with other cancer microarray data sources, Oncomine is more experimentalists-friendly for in-depth data analysis on a per-gene basis (see Note 3). A typical analysis may follow the scheme given outlined in the following section.
534
Yang, Adelstein, and Kassis
First, for each of the six common cancer types in prostate, breast, lung, colon, ovary, and pancreas, all upregulated genes comparing cancer versus normal tissue samples are collected irrespective of the different microarray platforms used in these studies. The frequency of these cancer genes upregulated in different studies is also ignored to keep the list as complete as possible although this most likely introduces some inherent noise. Second, the lists of upregulated genes is further filtered by a combination of relevant GO terms implying extracellularity and a stringent corrected Q value (FDR) cutoff Q £ 0.05. The GO terms used in the filtering − <<extracellular space>>, <<extracellular region>>, <>, <>, and <> − are conceived and consulted in the GO database in order to identify all relevant hits which encode secreted or membrane-bound proteins. Other researchers could apply other GO filters to identify targets according to their specific research interests and envisioned target properties. The Q value cutoff is chosen to increase the likelihood that the filtered genes are truly upregulated in tumor compared to normal tissues. Next, the filtered genes are imported and mapped in Ingenuity Pathways Analysis (IPA) program (Ingenuity System, Mountain View, California, USA) (see Note 4). The duplicates and genes with inappropriate locations, such as “cytoplasm” or “nucleus” (defined and curated by human experts in the IPA system), are first eliminated from the list since we are interested in mining the overexpressed genes within a particular cellular location (extracellular, cell surface, plasma membrane, GPI-anchored). The remaining list of entities are then enriched by related functions and direct biological interactions embedded in seven publicly accessible interaction databases, including Biomolecular Interaction Network Database (BIND), the Biological General Repository for Interaction Datasets (BIOGRID), Database of Interacting Proteins (DIP), protein interaction database (INTACT), Human Interactome Map (HiMAP), Molecular Interaction Database (MINT), and Database for Genomes and Protein Sequences Databases (MIPS) implemented within the Ingenuity System (see Note 1 for more details about publically accessible interaction databases). Alternatively, researchers could also manually enrich and enlarge the entities by other INTACTs according to their own particular interests. For instance, investigators could choose to enrich the entities by protein–protein domain interactions embedded at Structural Classification of Protein–Protein Interfaces (SCOPPI, http://www.scoppi.org). Next, in order to select genes with the indication of significant overexpression across all datasets, we use the abs(t) value to rank and prioritize the entities from the enriched lists. In Oncomine, each gene is assessed for differential expression with the Student’s t test, calculated as t = − /d, where y1 and y2
Integrated Bioinformatics Analysis for Cancer Target Identification
535
are the mean expression values in disease and normal conditions, and d is the common variance of the two distributions normalized across different studies. Therefore, the t value should be a good indicator of the degree of change of expression level in cancer tissue relative to its normal tissue, similar to the fold change value (see Note 5). Finally, the ranked hydrolases from the lists defined in the Ingenuity System as <>, <>, and <> are collected (e.g., in the form of an Excel list) as candidate entities for further analysis. 3.4. Literature Data Analysis
Literature data analysis is used as a parallel approach to mine extracellular hydrolases targets and in principal validate the results from microarray data analysis. GoPubMed (http://www. gopubmed) (23) is a Web server which allows researchers to explore PubMed abstracts with Medical Subject Headings (MeSH) keyword search and GO terms. In our studies, the tissue names, prostate/prostatic, breast, lung/pulmonary, colon/ colonic, ovary/ovarian, and pancreas/pancreatic, are applied to retrieve all published abstracts for the six tumor types. Additionally, the keywords <<extracellular>> or <<membrane>> are employed to implicate the extracellular environment in the search. Then the abstracts, which are now associated with the GO terms <<extracellular region>> or <<membrane>>, are retrieved and exported into plain text. GAPSCORE (http://bionlp.stanford.edu/ gapscore) (24), a user-friendly program that enables the scanning of text and the identification of the names of genes and proteins based on a natural language processing approach, is applied to retrieve the entities (genes/proteins) from the filtered abstracts. The routine provides a score and only those entities scored as “excellent” are collected in the list. Next, the retrieved entities (in the form of an Excel list) are imported to IPA to search for entities with related functions and direct biological interactions. Finally, only those hydrolases from the lists defined in Ingenuity System as <>, <>, and <> are collected (also in the form of an Excel list) as candidate entities for further analysis.
3.5. Selection of Putative EMCIT Targets
We now apply the procedure set out above on a practical example for identifying targets for cancer therapy focusing on the six cancer etiologies. For microarray data analysis, between 211 (colon) and 2,782 (ovary) overexpressed genes for the six cancer types were filtered from the Oncomine database using stringent Q (FDR) value cutoff and the set of controlled GO terms (Table 1). Six nonredundant subsets of 140 (colon) to 810 (ovary) proteins are retrieved after resolving the duplicates and cytoplasmic proteins. Among these, 352 (colon) to 1,421 (ovary) entities are identified after enrichment analysis by related functions and direct biological interactions. This procedure shows that the
536
Yang, Adelstein, and Kassis
Table 1 Number of candidate entities identified in six common cancers from microarray data analysis a Tumor type
Prostate
Breast
Lung
Colon
Ovary
Pancreas
Oncomine: number of microarray datasets
82
170
95
41
77
14
Datasets remaining after filtering by analysis type <>
13
6
15
4
7
5
Total number of measured genes b
181,361
121,400
156,767
35,195
97,581
149,309
13,353
12,297
18,188
3,064
19,645
3,402
Entities filtered by Gene Ontology with Q values cutoff of 0.05c
1,930
1,055
2,233
211
2,782
566
IPA: imported entities laid onto global molecular network and resolved duplicates
750
641
771
140
810
403
1,405
1,257
1,323
352
1,421
829
IPA-location: Extracellular Space and Plasma membrane
578
555
599
150
650
396
IPA-location: Extracellular Space
206
177
203
79
261
153
IPA-location: Plasma membrane
372
378
396
71
389
243
IPA-family: Peptidases
33
38
44
19
42
23
7
3
7
1
9
2
36
25
37
7
35
19
Total number of upregulated genes
Enlarged by functionally related entities and biomolecular interaction networks
IPA-family: Phosphatases IPA-family: Other enzymes
Completed by 11/01/09 Sum of measured genes in all datasets filtered by <> c GO keywords include <<extracellular space>>, <<extracellular region>>, <>, <>, and <> a
b
majority of these entities are growth factors, transporters, G-protein-coupled receptors, and hydrolytic enzymes. Subsequently, 27 (colon) to 88 (lung) hydrolase targets are detected and collected for the six tumor types. Analogously, when the literature data analysis method is used, from
Integrated Bioinformatics Analysis for Cancer Target Identification
537
Table 2 Number of candidate entities identified in six common cancers from literature data analysisa Tumor type
Prostate
Breast
Lung
Colon
Ovary
Pancreas
PubMed: Abstracts about tissue types
100,254
224,579
671,023
141,521
151,244
168,812
1,981
4,279
9,489
5,002
4,820
5,379
Entities filtered additionally by Gene Ontology with <<extracellular region>> or <<membrane>>
827
1,080
2,008
1,269
1,544
1,257
IPA: imported entities laid onto global molecular networkb and removed duplicates
392
469
615
483
518
456
Enlarged by functionally related entities and biomolecular interaction networks
697
804
953
782
824
778
IPA-location: Extracellular space and plasma membrane
345
420
530
415
450
404
IPA-location: Extracellular space
142
178
219
177
189
173
IPA-location: Plasma membrane
203
242
311
238
261
231
IPA-family: Peptidases
27
27
35
24
31
30
5
9
11
7
8
5
13
16
20
17
16
16
Abstract remaining after filtering for keywords <<extracellular>> or <<membrane>>
IPA-family: Phosphatases IPA-family: Other enzymes
Completed by 11/01/09, including update for period of 11/18/05–01/01/09 Numbers below correspond to subnetworks of entities designated by IPA-location or IPA-family and are part of IPA cancer network
a
b
45 (prostate) to 66 (lung) hydrolase targets are identified and collected for the six tumor types (Table 2). However, when the two sets of data are compared, from 8 (colon) to 21 (lung) common hydrolase targets are validated by both analysis approaches and described as putative EMCIT targets (Table 3). The following examples show a few of the promising entities from these lists and serve to validate our analysis strategy.
538
Yang, Adelstein, and Kassis
Table 3 Common extracellular hydrolases for six tumor types ascertained by both microarray data analysis and literature mining approaches as putative EMCIT targets Name
Description
Type
Acid phosphatase, prostate ADAM metallopeptidase domain 10 ADAM metallopeptidase domain 15 Fibronectin 1 Prostate-specific membrane antigen 1 Guanine nucleotide-binding protein alpha 12 Harvey rat sarcoma viral oncogene homolog Kallikrein-related peptidase 3 Matrix metallopeptidase 3 Matrix metallopeptidase 13 Occludin Silver homolog (mouse)
Phosphatase Peptidase Peptidase Enzyme Peptidase Enzyme Enzyme Peptidase Peptidase Peptidase Enzyme Enzyme
ADAM metallopeptidase domain 15 Chitinase 3-like 1 (cartilage glycoprotein-39) Fibronectin 1 Kallikrein-related peptidase 3 Matrix metallopeptidase 1 (interstitial collagenase) Matrix metallopeptidase 9 Matrix metallopeptidase 13 (collagenase 3) Proprotein convertase subtilisin/kexin type 2 Plasminogen activator, urokinase Plasminogen Protein tyrosine phosphatase, receptor type, O
Peptidase Enzyme Enzyme Peptidase Peptidase Peptidase Peptidase Peptidase Peptidase Peptidase Phosphatase
ADAM metallopeptidase domain 9 (meltrin gamma) ADAM metallopeptidase domain 12 ADAM metallopeptidase domain 15 ADAM metallopeptidase domain 28 Dipeptidyl–peptidase 4 Ectonucleotide pyrophosphatase/phosphodiesterase 2 Coagulation factor II (thrombin) Coagulation factor VII Coagulation factor XII (Hageman factor) Fibronectin 1 v-Ha-ras Harvey rat sarcoma viral oncogene homolog Kallikrein-related peptidase 3 Kallikrein B, plasma (Fletcher factor) 1 LAG1 homolog, ceramide synthase 1 Lysyl oxidase-like 1 Matrix metallopeptidase 9
Peptidase Peptidase Peptidase Peptidase Peptidase Enzyme Peptidase Peptidase Peptidase Enzyme Enzyme Peptidase Peptidase Enzyme Enzyme Peptidase
Prostate ACPP ADAM10 ADAM15 FN1 FOLH1 GNA12 HRAS KLK3 MMP3 MMP13 OCLN SILV Breast ADAM15 CHI3L1 FN1 KLK3 MMP1 MMP9 MMP13 PCSK2 PLAU PLG PTPRO Lung ADAM9 ADAM12 ADAM15 ADAM28 DPP4 ENPP2 F2 F7 F12 FN1 HRAS KLK3 KLKB1 LASS1 LOXL1 MMP9
(continued)
Integrated Bioinformatics Analysis for Cancer Target Identification
539
Table 3 (continued) Name
Description
Type
MMP13 NEU3 PLAU PSEN1 PTPRO
Matrix metallopeptidase 13 (collagenase 3) Sialidase 3 (membrane sialidase) Plasminogen activator, urokinase Presenilin 1 Protein tyrosine phosphatase, receptor type, O
Peptidase Enzyme Peptidase Peptidase Phosphatase
ADAM metallopeptidase domain 9 (meltrin gamma) Alanyl (membrane) aminopeptidase v-Ha-ras Harvey rat sarcoma viral oncogene homolog Matrix metallopeptidase 3 Matrix metallopeptidase 7 (matrilysin, uterine) Matrix metallopeptidase 9 Matrix metallopeptidase 14 (membrane-inserted) Plasminogen activator, urokinase
Peptidase Peptidase Enzyme Peptidase Peptidase Peptidase Peptidase Peptidase
Acid phosphatase, prostate ADAM metallopeptidase domain 15 Dipeptidyl-peptidase 4 Coagulation factor II (thrombin) Coagulation factor XII (Hageman factor) Fibronectin 1 v-Ha-ras Harvey rat sarcoma viral oncogene homolog Lactotransferrin Matrix metallopeptidase 7 (matrilysin, uterine) Matrix metallopeptidase 9 Matrix metallopeptidase 10 (stromelysin 2) Occludin Plasminogen activator, urokinase Protein tyrosine phosphatase, receptor type, O Thromboxane A synthase 1 (platelet)
Phosphatase Peptidase Peptidase Peptidase Peptidase Enzyme Enzyme Peptidase Peptidase Peptidase Peptidase Enzyme Peptidase Phosphatase Enzyme
ADAM metallopeptidase domain 15 Fibronectin 1 Matrix metallopeptidase 2 Matrix metallopeptidase 7 (matrilysin, uterine) Matrix metallopeptidase 9 Plasminogen activator, urokinase Presenilin 1 Protein tyrosine phosphatase, receptor type, M Sulfatase 1
Peptidase Enzyme Peptidase Peptidase Peptidase Peptidase Peptidase Phosphatase Enzyme
Colon ADAM9 ANPEP HRAS MMP3 MMP7 MMP9 MMP14 PLAU Ovary ACPP ADAM15 DPP4 F2 F12 FN1 HRAS LTF MMP7 MMP9 MMP10 OCLN PLAU PTPRO TBXAS1 Pancreas ADAM15 FN1 MMP2 MMP7 MMP9 PLAU PSEN1 PTPRM SULF1
540
Yang, Adelstein, and Kassis
Acid phosphatase, prostate (ACPP/PAP). ACPP, also referred as PAP, encodes an enzyme that catalyzes the conversion of orthophosphoric monoester to alcohol and orthophosphate. It is synthesized under androgen regulation and secreted by the epithelial cells of the prostate gland and found in seminal fluid. PAP has been detected by both of our data analysis approaches for prostate and ovarian tumor, strongly implicating its potentials as an EMCIT target. To this end, we have designed and synthesized ammonium 2-(2-phosphoryloxyphenyl)-6-iodo-4-(3H)quinazolinone (IQ2-P), a substrate – and a prodrug – for the PAP enzyme present within the extracellular space of solid human tumors (18). We have found that when 125IQ2-P, the radioiodinated form of the water-soluble prodrug, is incubated with PAP, rapid hydrolysis of the compound is observed as exemplified by the formation of its water-insoluble form, 125IQ2-OH. Furthermore, the incubation of IQ2-P with human LNCaP, PC-3, and 22Rv1 prostate tumor cells results in the formation and entrapment of large fluorescent IQ2-OH crystals. More importantly, no hydrolysis of IQ2-P was seen in the presence of normal human cells. Taken together, the approach shows that PAP could be a promising target, ascertained by our bioinformatics analysis approaches, to enable the active in vivo entrapment of radioimaging and radiotherapeutic compounds within the extracellular spaces of primary solid tumors or their metastases. In addition, the upregulation of PAP in prostate and ovarian tumors might suggest the importance of the androgen regulation for both disease processes. Folate hydrolase/prostate-specific membrane antigen (FOLH1/ PSMA). FOLH1, also referred as PSMA, encodes a membranebound glycoprotein acting as a glutamate carboxypeptidase which cleaves N-acetyl-l-aspartyl-l-glutamate (NAAG) to N-acetyl-laspartate (NAA) and glutamate. Interestingly, PSMA appears as a tissue-specific target for prostate tumor across six common tumor types detected in our bioinformatics analysis approaches. For example, through the meta-analysis of microarray data, PSMA is found to be overexpressed (P-value threshold of £0.0001) in four independent microarray studies out of a total of 14 microarray studies comparing prostate tumor to normal tissues within Oncomine. More importantly, PSMA is not found to be overexpressed in most other tumor types. Prior experiments have also confirmed that PSMA is abundantly expressed at all stages of prostate tumor and does not shed into the circulation (25). Indeed, a number of sensing (imaging) probes targeting at PSMA have been developed for the detection and treatment of prostate cancer, including radiolabeled (99mTc, 11C, 125I, and 18F) urea-based agents and near-infrared fluorescent contrast agents with high affinity, high tumor uptake, and rapid clearance (26, 27). Put together, the unique properties of PSMA implicate its appropriateness as an EMCIT target and our lab is actively developing radioiodinated quinazolinone-conjugated peptidic prodrugs for the imaging and therapy of prostate tumor.
Integrated Bioinformatics Analysis for Cancer Target Identification
541
Matrix metallopeptidases (MMPs). MMP protein family has been implicated in many human diseases and in particular in human cancer. This is because MMPs are involved in the breakdown of extracellular matrix in many disease processes (e.g., metastasis). Consistently, we have used our combined microarray data analysis – literature review to identify a few MMPs as potential EMCIT targets. For instance, through meta-analysis of microarray data across six tumor types, we have found that MMP1 is uniquely overexpressed in breast cancer. Indeed, in one microarray dataset, MMP1 was ranked as the most overexpressed gene for breast cancer by statistical student t test value. The literature review approach also found consistent evidences for its role in breast tumor. For example, MMP1 was found to be one of the most highly upregulated genes among 540 entities identified by real-time reverse transcriptase polymerase chain reaction (RT-PCR) analysis (28). Moreover, animal studies have suggested the upregulation of MMP1 encoding protein in invasive breast cancer. Consequently, MMP1 could be a promising EMCIT target for the diagnosis and treatment of breast cancer. Similarly, MMP7 and MMP9 have been ascertained as putative targets by both of our analysis approaches. For instance, MMP7 is found significantly overexpressed in pancreatic cancer from microarray data analysis approach. Prior experiments have also demonstrated that the elevated levels of MMP7 encoded protein are frequently expressed in pancreatic carcinoma and closely associated with the growth and invasion of pancreatic cancer cells (29). On the other hand, MMP9 is prioritized as a highly overexpressed gene in lung cancer from microarray data analysis and there have been prior literature evidences suggesting the potential of MMP9 as a therapeutic target for nonsmall cell lung cancer (30). Extracellular sulfatase 1 (Hsulf-1/SULF1). One potential EMCIT target for human pancreatic cancer ascertained by both microarray data analysis and literature review is Hsulf-1/SULF1, a hydrolase that is secreted outside the cell and facilitates apoptosis in response to exogenous stimulation. Remarkably, in two out of five independent microarray datasets comparing pancreatic tumor to normal tissues, SULF1 is ranked among the top 5% of the most overexpressed genes within Oncomine. More importantly, we have found that SULF1 is a tissue-specific target in patients bearing pancreatic cancer through a comparison analysis across six tumor types. To these ends, our lab has designed and prepared a sulfate monoester substrate, 2-(2¢-sulfooxyphenyl)-6iodo-4-(3H)-quinazolinone (IQ2–S), targeting at extracellular SULF1 for the imaging and treatment of pancreatic tumor. We have demonstrated the in vitro hydrolysis of IQ2–S by human pancreatic T3M4 tumor cells, whereas no hydrolysis is seen in the presence of normal human cells and other tumor cells, including OVCAR-3 and LNCaP (unpublished results). This is consistent with the presence and upregulation of Hsulf-1 protein encoded
542
Yang, Adelstein, and Kassis
by SULF1 in the extracellular region of pancreatic cancerous cells (31). We therefore anticipate that the continued development of novel radiopharmaceuticals targeting at SULF1 leads eventually to a noninvasive approach for the detection and personalized treatment of human pancreatic cancer. Plasminogen activator/urokinase (PLAU/uPA). PLAU, also referred as uPA, encodes a serine protease involved in the degradation of the extracellular matrix and possibly tumor cell migration and proliferation. Both microarray data analysis and literature review have ascertained the potential of PLAU as an EMCIT target in five cancer types, including breast, lung, ovary, colon, and pancreas. Particularly, PLAU is ranked among the top 5% of the most overexpressed gene in three independent microarray datasets out of a total of five studies comparing lung cancer to normal tissues within Oncomine. The ubiquitous presence of PLAU in human cancers strongly implicates its universal relevance in the pathogenesis of these cancer types. In fact, the binding of uPA by uPAR, a glycolipid-anchored receptor of uPA, activates the Ras/ extracellular signal-regulated kinase pathway leading to tumor cell proliferation, migration, and invasion (32). Consequently, the dysregulation of uPA/uPAR system is strongly coupled with tumor growth and metastasis. Thus, it might be possible to develop “smart” probes targeting the uPA/uPAR system for the detection and treatment of human cancer. Indeed, an early study has successfully employed peptide-conjugated fluorescent probes to map tumor-associated uPA activity (33). Furthermore, Li et al. (34) has recently developed a linear 64Cu-labeled peptide targeting at uPAR for noninvasive in vivo PET imaging. We anticipate that development of “smart” probes to quantify the expression of uPA/uPAR system benefits the diagnosis, treatment, and unraveling of the possible “universal” mechanism of human cancer.
4. Notes 1. Popular databases and tools list for integrated bioinformatics analysis Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nihgov/geo) is a public microarray data repository which allows Web-based visualization and interpretation of gene expression datasets. ArrayExpress (http://www.ebi.ac.uk/microarray-as/ae) is a comprehensive database warehousing gene expression profiles and other microarray data with on-line browsing and mining tools. Pathguide (http://www.pathguide.org) contains information about 310 biological pathway resources.
Integrated Bioinformatics Analysis for Cancer Target Identification
543
Kyoto Encyclopedia of Genes and Genomes (KEGG, http:// www.genome.jp/kegg) is an integrated database resource consisting of 16 main databases, broadly categorized into metabolic pathways, genomic pathways, and chemical information. GenMAPP (http://www.genmapp.org) is a software suite designed to visualize gene expression and other genomic data on maps representing biological pathways and groupings of genes. UniHI (http://theoderich.fb3.mdc-berlin.de:8080/unihi/ home) is a comprehensive database of computationally and experimentally derived human protein interaction networks, which contains more than ~253,000 distinct interactions between ~22,300 unique human proteins. PathwayExplorer (http://pathwayexplorer.genome.tugraz.at) is a Web-based package for visualizing high-throughput expression data on biological pathways. BIOGRID (http://www.thebiogrid.org) holds collections of protein and genetic interactions from major model organism species. BIND (http://www.bind.ca), a database designed to store full descriptions of interactions, molecular complexes, and pathways. DIP http://dip.doe-mbi.ucla.edu) is a database that documents experimentally determined protein–protein interactions. INTACT (http://www.ebi.ac.uk/intact/main.xhtml) provides a freely available, open source database system and analysis tools for protein interaction data. HiMAP (http://www.himap.org) is a Web browser for the human protein–protein interaction map. MINT ( http://mint.bio.uniroma2.it/mint) stores experimentally verified protein–protein interactions. 2. Oncomine (http://www.oncomine.org) is a public domain cancer microarray platform incorporating 392 independent microarray datasets, which span 41 cancer types. 3. The newest Oncomine version (Oncomine 4.0) has stopped the per-gene analysis function in Gene Ontology. Researchers could choose to upload their lists of entities into packages, such as GoMiner (http://discover.nci.nih.gov/gominer) and analyze them in Gene Ontology. 4. Although the commercial Ingenuity® system (http://www. ingenuity.com) was used in our studies, the lists of entities could also be annotated, enriched, and analyzed in biological pathways by other public accessible suites, such as (35) the database for annotation, visualization, and integrated
544
Yang, Adelstein, and Kassis
discovery (DAVID). DAVID is able to extract various biological features with large entities list and visualize their interactions on biological pathways (http://david.abcc.ncifcrf.gov/). 5. The number of microarray studies reporting a gene (also referred as “vote”) as overexpressed could be used as another criteria to measure the confidence and rank the genes list. The “voting” strategy has been applied in some meta-analysis studies (21).
Acknowledgments This work was supported in part by National Cancer Institute grant, Detection of Prostate Cancer Genomic Signatures in Blood (to AIK). Work in the Y. Yang laboratory was supported by Start-up Fund (grant: 3016-893318) at Dalian University of Technology and National Science Foundation in China, Medical Division Oncology Department (grant: 81000975). References 1. Yang Y, Adelstein SJ, and Kassis AI. (2009) Target discovery from data mining approaches. Drug Discov Today 14(3–4), 147–54. 2. Chen X, Ji ZL, and Chen YZ. (2002) TTD: Therapeutic Target Database. Nucleic Acids Res 30(1), 412–5. 3. Zheng C, Han L, Yap CW, Xie B et al. (2006) Progress and problems in the exploration of therapeutic targets. Drug Discov Today 11 (9–10), 412–20. 4. Sams-Dodd F. (2005) Target-based drug discovery: is something wrong? Drug Discov Today 10(2), 139–47. 5. Butcher SP. (2003) Target discovery and validation in the post-genomic era. Neurochem Res 28(2), 367–71. 6. Rhodes DR, and Chinnaiyan AM. (2005) Integrative analysis of the cancer transcriptome. Nat Genet 37, 31–7. 7. Pawitan Y, Michiels S, Koscielny S, Gusnanto A et al. (2005) False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 21(13), 3017–24. 8. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R et al. (2007) Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 9, 166–80. 9. Li S, Becich MJ, and Gilbertson J. (2004) Microarray data mining using Gene Ontology. Medinfo 107, 778–82.
10. Welsh JB, Sapinoso LM, Kern SG, Brown DA et al. (2003) Large-scale delineation of secreted protein biomarkers overexpressed in cancer tissue and serum. Proc Natl Acad Sci USA 100, 3410–15. 11. Curtis RK, Oresic M, and Vidal-Puig A. (2005) Pathways to the analysis of microarray data. Trends Biotechnol 23(8), 429–35. 12. Bredel M, Scholtens DM, Harsh GR, Bredel C et al. (2009) A network model of a cooperative genetic landscape in brain tumors. JAMA 302(3), 261–75. 13. Mootha VK, Lindgren CM, Eriksson KF, Subramanian A et al. (2003) PGC-1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 34(3), 267–73. 14. Yue QX, Cao ZW, Guan SH, Liu XH et al. (2008) Proteomics characterization of the cytotoxicity mechanism of ganoderic acid D and computer-automated estimation of the possible drug target network. Mol Cell Proteomics 7(5), 949–61. 15. Liang S, Li Y, Be X, Howes S et al. (2006) Detecting and profiling tissue-selective genes. Physiol Genomics 26(2), 158–62. 16. Yang Y, Pospisil P, Adelstein SJ, and Kassis AI. (2008) Integrative genomic data mining for discovery of potential blood-borne biomarkers for early diagnosis of cancer. PLoS ONE 3(11), e3661.
Integrated Bioinformatics Analysis for Cancer Target Identification 17. Chen K, Aowad AF, Adelstein SJ, and Kassis AI. (2007) Molecular-docking-guided design, synthesis, and biologic evaluation of radioiodinated quinazolinone prodrugs. J Med Chem 50(4), 663–73. 18. Pospisil P, Wang K, Al Aowad AF, Iyer LK et al. (2007) Computational modeling and experimental evaluation of a novel prodrug for targeting the extracellular space of prostate tumors. Cancer Res 67, 2197–205. 19. Kassis AI, Korideck H, Wang K, Pospisil P et al. (2008) Novel prodrugs for targeting diagnostic and therapeutic radionuclides to solid tumors. Molecules 13(2), 391–404. 20. Pospisil P, Iyer LK, Adelstein SJ, and Kassis AI. (2006) A combined approach to data mining of textual and structured data to identify cancer-related targets. BMC Bioinformatics 7, 354. 21. Griffith OL, Melck A, Jones SJ, and Wiseman SM. (2006) Meta-analysis and meta-review of thyroid cancer gene expression profiling studies identifies important diagnostic biomarkers. J Clin Oncol 24(31), 5043–51. 22. Harris MA, Clark J, Ireland A, Lomax J et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32, D258–61. 23. Doms A, and Schroeder M. (2005) GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res 33, 783–6. 24. Chang JT, Schütze H, and Altman RB. (2004) GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20(2), 216–25. 25. Schülke N, Varlamova OA, Donovan GP, Ma D et al. (2003) The homodimer of prostatespecific membrane antigen is a functional target for cancer therapy. Proc Natl Acad Sci USA 100(22), 12590–5. 26. Banerjee SR, Foss CA, Castanares M, Mease RC et al. (2008) Synthesis and evaluation of technetium-99m- and rhenium-labeled inhibitors
27.
28.
29.
30.
31.
32. 33.
34.
35.
545
of the prostate-specific membrane antigen (PSMA). J Med Chem 51(15), 4504–17. Humblet V, Lapidus R, Williams LR, Tsukamoto T et al. (2005) High-affinity nearinfrared fluorescent small-molecule contrast agents for in vivo imaging of prostate-specific membrane antigen. Mol Imaging 4(4), 448–62. Poola I, DeWitty RL, Marshalleck JJ, Bhatnagar R et al. (2005) Identification of MMP-1 as a putative breast cancer predictive marker by global gene expression analysis. Nat Med 11, 481–83. Kuhlmann KFD, van Till JWO, Boermeester MA, de Reuver PR et al. (2007) Evaluation of matrix metalloproteinase 7 in plasma and pancreatic juice as a biomarker for pancreatic cancer. Cancer Epidemiol Biomarkers Prev 16, 886–91. Vihinen P, and Kähäri V-M. (2002) Matrix metalloproteinases in cancer: prognostic markers and therapeutic targets. Int J Cancer 99, 157–66. Abiatari I, Kleeff J, Li J, Felix K et al. (2006) Hsulf-1 regulates growth and invasion of pancreatic cancer cells. J Clin Pathol 59, 1052–58. Duffy MJ. (2004) The urokinase plasminogen activator system: role in malignancy. Curr Pharm Des 10(1), 39–49. Law B, Curino A, Bugge TH, Weissleder R et al. (2004) Design, synthesis, and characterization of urokinase plasminogen-activatorsensitive near-infrared reporter. Chem Biol 11(1), 99–106. Li ZB, Niu G, Wang H, He L et al. (2008) Imaging of urokinase-type plasminogen activator receptor expression using a 64Cu-labeled linear peptide antagonist by microPET. Clin Cancer Res 14(15), 4758–66. Huang da W, Sherman BT, and Lempicki RA. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4(1), 44–5.
Chapter 26 Omics-Based Molecular Target and Biomarker Identification Zhang-Zhi Hu, Hongzhan Huang, Cathy H. Wu, Mira Jung, Anatoly Dritschilo, Anna T. Riegel, and Anton Wellstein Abstract Genomic, proteomic, and other omic-based approaches are now broadly used in biomedical research to facilitate the understanding of disease mechanisms and identification of molecular targets and biomarkers for therapeutic and diagnostic development. While the Omics technologies and bioinformatics tools for analyzing Omics data are rapidly advancing, the functional analysis and interpretation of the data remain challenging due to the inherent nature of the generally long workflows of Omics experiments. We adopt a strategy that emphasizes the use of curated knowledge resources coupled with expert-guided examination and interpretation of Omics data for the selection of potential molecular targets. We describe a downstream workflow and procedures for functional analysis that focus on biological pathways, from which molecular targets can be derived and proposed for experimental validation. Key words: Proteomics, Genomics, Bioinformatics, Biological pathways, Cell signaling, Databases, Molecular targets, Biomarkers
1. Introduction Biomarkers are referred to as biological entities or characteristics that can be used to indicate the states of healthy or diseased cells, tissues, or individuals. Nowadays, biomarkers are mostly molecular makers, such as genes, proteins, metabolites, glycans, and other molecules, that can be used for disease diagnosis, prognosis, prediction of therapeutic responses, as well as therapeutic development (1–3). Over the past decade, high-throughput technologies, such as genomic microarrays, proteomic and metabolomic mass spectrometry, have been used to generate large amount of data from single experiments that allow for global comparison of changes in molecular profiles that underlie particular cellular phenotypes. As a result, the omics-based approaches, Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0_26, © Springer Science+Business Media, LLC 2011
547
548
Hu et al.
coupled with computational and bioinformatics methods, provide unprecedented opportunities to speed up the biomarker discovery and now are widely used to facilitate diagnostic and therapeutic developments for many diseases and particularly in cancers (4–10). Potential biomarkers have been identified at various molecular levels, including genetic, mRNA, protein/peptide, as well as epigenetic (11), miRNA (12), glycans (13), and metabolites (4). For example, using DIGE-based proteomics potential biomarkers (e.g., PPA2 and Ezrin) were identified to be useful for the diagnosis of metastatic prostate cancer (14), and a proteolytic fragment of alpha1-antitrypsin (BF5) was identified as a potential diagnostic and prognostic marker for inflammatory breast cancer as well as a target for potential therapeutic intervention (15, 16). Epigenetic marker, such as PITX2 DNA methylation, is reported as a robust assay for paraffin-embedded tissue for outcome prediction in early breast cancer patients treated by adjuvant tamoxifen therapy (11). In addition, microRNAs, such as miR-500, were identified as a potential diagnostic marker for hepatic cell carcinoma (17). Increasingly, pathway and network-based analyses are applied to Omics data to gain more insight into the underlying biological function and processes, such as cell signaling and metabolic pathways and gene regulatory networks (18, 19). For example, 12 core signaling pathways were shown to be altered in human pancreatic cancers through genomic analyses (18). Network modeling linked breast cancer susceptibility to the centrosome dysfunction (20), and led to the identification of a proliferation/ differentiation switch in cellular networks of multicellular organisms (21). These approaches have led to a new trend in identifying biomarkers in recent years, namely, pathway and network-based biomarker discovery, which identify panels of, instead of single, biomarkers for practical use in diagnostic and therapeutic developments (22–24). Protein networks have been shown to provide a powerful source of information for disease classification and to help in predicting disease causing genes (25, 26). Network approaches have also been used for improving the prediction of cancer outcome (27, 28), providing novel hypotheses for pathways involved in tumor progression (28), and exploring cancerassociated genes (29). In this chapter, we focus on the methodology for the identification of molecular targets through functional Omics data analysis particularly of biological pathways, which can provide more mechanistic insights into the underlying phenotypes and may facilitate therapeutics development. We adopt a strategy that emphasizes the use of curated knowledge resources, and describe a workflow and procedures, coupled with expert-guided analysis and interpretation, for the selection of potential molecular targets.
Omics-Based Molecular Target and Biomarker Identification
549
2. Materials Despite the rapid advancement of the high-throughput technologies and the bioinformatics tools, the functional analysis and interpretation of Omics data remain challenging due to high variation, low reproducibility, and noise of the data. Although many algorithms and tools have been developed to address these challenges, much is inherent to the long workflows of the Omics experiments, e.g., from sample preparation and raw data acquisition, to data processing and analysis. Many statistical and machine learning methods have been developed for better partitioning or clustering of genes (30–33), however, understanding of the biological meaning and functional interpretation of the group of genes/ proteins are critical downstream steps in the Omics workflow, and are necessary for the design of therapeutic strategies. This downstream functional analysis relies heavily on existing knowledge annotated for genes or proteins and frequently requires expertguided analysis for appropriate interpretation. 2.1. Bioinformatics Databases
Annotations of genes and proteins integrated from multiple bioinformatics databases are the basis for functional analysis and interpretation of Omics data (34). Numerous gene and protein databases, varying in size and scope, have been developed to provide functional annotations for genes and gene products, as archived for the past decade, e.g., in the “Molecular Biology Database Collection” in the journal Nucleic Acids Research (35). The number of databases and database entries is rapidly growing, e.g., in 2009 the journal archived a total of 1,170 databases, nearly 100 more than in 2008. These databases are divided into 14 general categories, including databases of DNA, RNA and protein sequences, structure, genomics, proteomics, metabolic, and signaling pathways. Databases most relevant to Omics data analyses include: (1) Gene and protein databases, such as UniProt (36) for protein-based annotations, Entrez Gene (37) and model organism databases (e.g., Mouse Genome Database) (38) for gene-based annotations; (2) GO annotations, such as GOA for annotation of gene products with Gene Ontology (GO) terms (39); (3) Biological pathways, such as KEGG (40) and Pathway Interaction Database (PID) (41) for annotations of proteins involved in metabolic and signaling pathways; Pathway Commons has been developed as a single point of access for diverse pathway databases; and (4) Protein–protein interaction (PPI) databases, such as IntAct (42) and MINT (43) for annotations of proteins involved in physical interaction of proteins.
2.2. Data Mapping and Integration Tools
Mapping different Omics data types (e.g., gene, mRNA, peptide/protein, metabolite) to the common biological entities (e.g., proteins) is an essential step for deriving comprehensive
550
Hu et al.
annotations for functional Omics data analysis (34). Omics data mapping is accomplished most commonly by ID (database entry identifier) mapping that allows different but related biological entities to be mapped to the IDs of common entities (e.g., proteins). One of the most common issues in protein mapping is that the relation between different types of biological entity could be one-to-one (e.g., one gene ID to one protein ID) or one-to-many (e.g., one gene ID to two or more protein IDs), and this is not only caused by the difference between genes and proteins (e.g., one gene encodes several protein isoforms), but can also result from database redundancy (see Note 1). UniProt Knowledgebase (UniProtKB) is the main section of UniProt with comprehensive and high-quality protein sequence annotations (44), and iProClass is an integrated database for all UniProt protein sequences with value-added annotations integrated from over 100 other databases (45). The UniProt and iProClass databases thus serve as the underlying infrastructure for protein ID mapping (different IDs mapped to UniProtKB protein IDs) and data integration for experimental Omics data. ID mapping based on the two databases allows ~32 commonly used, heterogeneous IDs to be converted from each other and the ID mapping services are available online both at the Protein Information Resource (PIR) (http://pir.georgetown.edu) and UniProt (http://www.uniprot.org). ID mapping data files are also available at PIR for download to perform data mapping offline. Other ID mapping tools include DAVID gene ID conversion tool (http://david.abcc.ncifcrf.gov/conversion.jsp) (46) and Protein Identifier Cross-Reference Service (PICR, http://www.ebi.ac. uk/Tools/picr) (47). 2.3. Functional Profiling and Pathway Analysis Tools
Various bioinformatics tools are available for functional profiling of Omics data based on annotations of genes and proteins, such as PIR batch retrieval and functional categorization tool (http:// pir.georgetown.edu/pirwww/search/batch.shtml), iProXpress (http://pir.georgetown.edu/iproxpress) (34), DAVID (48), and BABELOMICS (http://babelomics.bioinfo.cipf.es) (49). Annotations used for profiling by these tools include GO terms, pathways, keywords, sequence features, and families, among which GO terms and pathways are the most commonly used: GO has become a common annotation standard, and pathways provide more insightful biological meaning for the data. Moreover, many concepts in other annotations, such as keywords, have been covered by GO terms. While most of these tools allow profiling of single gene/protein list or two for comparison, iProXpress provides comparative profiling of multiple data sets (or data groups) for cross-data sets comparison, a very useful feature that accommodates many real-world data analysis issues.
Omics-Based Molecular Target and Biomarker Identification
551
For pathway analysis, mapping experimental data to metabolic and signaling pathways is a key for functional interpretation of the Omics data. Curated canonical pathway maps are available in many pathway databases, however, few public Omics analysis tools integrate the maps into their systems to allow experimental data superimposed onto the pathway maps. Several commercial pathway analysis systems are available, such as Ingenuity IPA (http://www. ingenuity.com) and GeneGO MetaCore (http://www.genego.com). Although these tools differ in features, such as visualization of canonical pathways and presentation of experimental data mapped onto the pathways, they all have one feature in common, i.e., integration of additional pathway and functional association data manually curated from literature into the systems in addition to the publicly available data in pathway databases, such as KEGG and PID. 2.4. Literature Text Mining Tools
Despite the extensive use of annotations from current knowledgebase for functional analysis of Omics data, annotations of genes and proteins lag far behind the rapid growth of literature due to the ever-expanding sequence data and the laborious nature of manual curation. In nearly all Omics experiments, varying numbers of identified genes or proteins lack sufficient annotations in databases to be functionally analyzed, and in such cases literature becomes the critical source for deriving functional information. Although literature data have been used solely or combined with other Omics data to generate gene/protein association networks (50–52), currently no literature mining tools have been integrated into any pipelined Omics system in a fashion that computationally extracted data are directly used as annotations for functional data analysis. Nonetheless, literature text mining is an important component of the data analysis workflow, and has been used to assist pathway analysis, such as ResNet of Pathway Studio (53) (http://www.ariadnegenomics.com/products/databases/ ariadne-resnet). A variety of text mining tools are available to assist in mining relevant gene or protein data from literature, and this coupled with manual search of PubMed are often necessary for functional Omics data analyses (see Note 2).
3. Methods The pathway and network-based Omics data analysis approach aims to delineate molecular maps that underlie the changes in biological samples under investigation, and to aid in discovery of molecular targets and biomarkers for diagnostic and therapeutic developments. Below we describe practical procedures applied to analyses of Omics data related to cell signaling and metabolic pathways, as well as organelle biogenesis.
552
Hu et al.
3.1. Omics Data Analysis Workflow
We focus on downstream analytical steps of the Omics workflow leading to functional interpretation of Omics data. The workflow begins with a list of gene or protein identifiers or peptide sequences as results from upstream data processing and analysis, e.g., gene clusters or differentially expressed genes or proteins and follows steps 1–6 depicted in Fig. 1: The genes or proteins in the list are then mapped to UniProtKB protein identifiers (step 1). Next, functional annotations are derived for the list of genes or proteins (step 2) based on integrated data from multiple bioinformatics databases (step 4), including text mining of literature for information that has not yet been annotated in databases (step 5). Steps 4 and 5 make maximal use of public knowledge resources. Functional analyses are often conducted using several approaches (step 3) based on different types of knowledge annotated in bioinformatics databases, i.e., GO profiling, molecular networks, and biological pathways. Among them, GO profiling, while revealing limited biological insights into Omics data, usually covers most of the genes/proteins under analysis (see Note 3). By contrast, while
Fig. 1. A downstream functional analysis workflow for molecular target and biomarker discovery from Omics data.
Omics-Based Molecular Target and Biomarker Identification
553
giving more biological insights, pathway analysis is limited by low coverage of proteins annotated in known canonical pathways (see Note 4). In between the GO profiling and pathway mapping is molecular network analysis of interactions or functional associations between genes or proteins. Finally, molecular targets are inferred from the functional analysis (step 6). 3.2. Omics Data Grouping
Omics experiments are often carried out under various experimental conditions, from which differential patterns of gene or protein expressions are to be analyzed and potential molecular targets are sought. To assist the subsequent bioinformatics ana lysis, genes or proteins associated with different experimental conditions are divided into appropriate data groups and assigned with proper notations (Table 1). Although there is no fixed scheme for assignment, the notations usually clearly distinguish the key conditions under which each experiment is carried out and/or data are collected for given studies. There are additional considerations in Omics data grouping in the case of proteomic data (see Note 5).
Table 1 Proteomics data grouping based on experimental design and methods
Experimental group
Common types
Examples
Treatment Time course
−/+ Radiation; −/+ Estrogen (E2) One time point or multiple (30 m, 1 h, 3 h, 9 h…) ATCL8 and AT5BIVA (Ref. 54); MCF-7 and MCF-7:5C (Ref. 59) Phosphotyrosine (pY) IP; AIB1 IP 1D or 2D gel electrophoresis Single MS; tandem MS (MS/MS) Proteomics; mRNA expression microarray Increased or decreased
Cell types Immunoprecipitation (IP) Sample separation Mass spectrometry (MS) Data type Changes Notations for groups
A_8_3h_increase – Increased on 2D-gel at 3h postradiation in ATCL8 cells MS2AIB1_A – Identified in lane A using anti-AIB1 IP and MS/MS (MS2) in MCF-7 cells (−) E2
Experimental notes
“ATCL8 6.413, 8-pep, 3 h” – Increased by 6.413-fold on 2D-gel and identified with 8 peptides, at 3 h in ATCL8 cells “B11 30K 24K 100 CI90” – Identified in lane B of 1D-gel, band 11 of MW 30 kDa, calculated MW 24 kDa, a score 100 and CI > 90%
554
Hu et al.
3.3. Omics Data Mapping and Integration
Since the UniProt and iProClass databases are the data warehouse of the iProXpress system and serving as the underlying infrastructure for Omics data mapping and integration, the list of genes or proteins from Omics data are mapped to UniProtKB protein entries, referred to as protein mapping, to obtain functional annotations. Protein mapping is primarily based on gene/protein identifiers. For gene expression microarray data, commonly used gene identifiers include Entrez Gene ID, NCBI gi number, and Refseq ID. For mass spectrometry (MS) proteomic data, depending on the database selected for protein identification by the search engine (e.g., MASCOT), the commonly used identifiers include UniProtKB, IPI, NCBI nr, Refseq, etc. Gene and protein IDs are mapped to UniProtKB entries based on comprehensive ID mapping tools available at PIR or UniProt, which converts commonly used gene and protein IDs (such as NCBI’s gi number and Entrez Gene ID) to UniProtKB IDs and vice versa. After protein mapping, all gene or protein IDs from one or more data sets or experimental groups are integrated into a master list of UniProtKB identifiers (ACs or IDs), each associated with corresponding experimental groups and notes (Table 1). This master list of proteins is the basis for the subsequent functional annotation and analysis using the iProXpress system. Frequently, UniProtKB entry matches are not found for a fraction of input gene or protein identifiers, resulting from updates of database identifiers or deletion of entries occurring to most databases, especially when analyzing legacy data in which mixed database identifiers are often used. In such cases, the mapping can be based on sequence comparison or name mapping if the sequence is not available. For genes, the sequence identity and taxonomy information may be used to map the gi numbers to UniProtKB IDs in addition to the mapping bridged by EMBL/ GenBank protein accessions (34). For MS proteomic data, peptide sequences are matched against all sequences in UniProtKB (see Note 6). When gene microarray and MS proteomic experiments are conducted on the same biological samples under identical or similar conditions, the two Omics data sets are compared after data being merged through protein mapping. Direct comparison of expression at both mRNA and protein levels can provide stronger evidences for the underlying changes. For example, the 2D-gel/MS proteomics study identified 412 and 771 proteins that potentially changed in response to radiation treatment in ATM (Ataxia Telangiectasia Mutated) mutated (ATM−) and wild type (ATM+) cells, respectively, while the gene microarray study identified 103 and 131 significantly changed genes in the two cells, respectively (54). Among those genes/proteins, only 13 were commonly identified, including RRM2, the catalytic subunit of ribonucleoside-diphosphate reductase (RR), a rate-limiting
Omics-Based Molecular Target and Biomarker Identification
555
enzyme required for synthesis of dNDP and thus of DNA synthesis in human (55). However, care should be taken in mapping data from genes to proteins due to one-to-many relations and redundancy existing in the UniProt database (see Note 1). 3.4. The Omics Data Annotation and Functional Profiling 3.4.1. Metadata Annotation
3.4.2. Functional Annotation
As discussed above, the experimental groups in which the genes or proteins are identified, as well as additional experimental information are annotated for all proteins with proper notations. The annotated data groups are used for direct comparative analysis between selected groups of interest, such as cell types, treatment types and time course, as well as Omics data types. The metadata annotation can also be used for limiting functional profiling to proteins in selected groups using the iProXpress interface (see below). After protein mapping, rich annotations are described for given Omics data sets in the so-called protein information matrix (Table 2) that captures salient features of proteins, such as functions, pathways, and protein–protein interactions, derived from
Table 2 Major categories of a protein information matrix Major category
Example data sources a
General information Protein name Taxonomy Gene name Keywords Function Subunit Tissue specificity Bibliography
UniProtKB, RefSeq NCBI Taxon UniProtKB UniProtKB UniProtKB UniProtKB UniProtKB UniProtKB, SGD, GeneRIF
Gene-related information Genome/gene Gene expression Genetic variation/disease Gene regulation
GenBank, Entrez Gene, MGI GEO, CleanEx HapMap, OMIM ISG
Protein function-related information Ontology Enzyme/function Pathway Complex/interaction Protein expression Structure Feature and posttranslational modifications Protein family
GOA KEGG, BRENDA, MetaCyc KEGG, EcoCyc, PID, Reactome IntAct, DIP Swiss-2DPAGE, PMG PDB, SCOP, CATH UniProtKB, RESID, PhosphoSite PIRSF, Pfam, COG, InterPro
Detailed data sources are available at http://pir.georgetown.edu/cgi-bin/iproclass_stat
a
556
Hu et al.
Fig. 2. iProXpress interface for browsing, searching, and functional profiling of Omics data. As an example, the interface displays the proteomic data sets derived from 2D gel and mass spectrometry as well as the gene expression microarray data sets from ATM− (AT5BIVA) and ATM+ (ATCL8) human fibroblast cells (54).
comprehensive protein annotations integrated into the UniProt and iProClass databases. The matrix allows for browsing and search of rich protein information through the iProXpress Web interface, which facilitates detailed examination of the Omics data (Fig. 2). Among protein annotations, GO terms, including molecular function, biological process and cellular component, and pathways, such as KEGG, are most commonly used for functional profiling. 3.4.3. GO Profiling
Gene Ontology profiling is primarily based on GO slim, a cutdown version of GO terms at high levels of GO hierarchy (http:// www.geneontology.org/GO.slims). GO slims are usually derived from terms at second and third levels of the GO hierarchy, though varying from sources to sources in the selection of additional terms from deeper levels. GO profiling provides a general view of biology underlying the Omics data and can suggest significant functional categories of genes or proteins that can be further investigated. For example, 26 genes are found upregulated in ionizing radiation treated ATM+ cells, which were identified from gene expression microarray data and were profiled using GO biological process (Fig. 3). The profile shows high representation of proteins in GO categories, such as “cell communication,”
Omics-Based Molecular Target and Biomarker Identification
557
Fig. 3. GO biological process profiling of upregulated genes in ATCL8 cells (ATM+) at 30 min postirradiation. A total of 26 differentially expressed genes are profiled and the GO categories are ranked based on the number of proteins annotated with the corresponding GO terms (frequency); categories with only one protein are partially displayed at the bottom. Encircled in dashed line are top six categories of GO terms that cover 77% of the proteins (20/26), e.g., five genes appearing in three to five GO categories (in the box).
“response to stimulus,” and “cell proliferation,” in which several proteins are known to be involved in radiation-induced responses, e.g., BRCA1, p53, HDAC1, and STAT3. Because GO slims are terms of high level, genes/proteins profiled under given GO categories often overlap to varying degrees, e.g., the above mentioned proteins are common in three or more of the top five GO categories (Fig. 3). However, some terms are too broad, such as “regulation of biological process” or “biological regulation” to reveal meaningful biological information (see Note 7).
558
Hu et al.
3.4.4. Pathway Profiling
Due to the overall low coverage of pathway annotations for a given organism, relatively large numbers of proteins are usually missed in pathway profiling for any Omics data set. Nonetheless, it could provide significant insights into the underlying biology, particularly when used for cross-data set comparative profiling. For example, in our previous study, comparison of nine organelle proteomes, including mitochondria, endoplasmic reticulum (ER), and seven other lysosome-related organelles (56), the pathway profiles based on KEGG pathways show that “oxidative phosphorylation pathway” is prevalent in mitochondria while “N-glycan biosynthesis pathway” is in the ER (Fig. 4), which are consistent with the well-established functions of the two organelles. Pathway profiling also led to the identification of “purine metabolism pathway” that showed notable differences between radiationtreated vs. untreated ATM− and ATM+ cells (Fig. 5a) (54).
Fig. 4. Comparative profiling of organellar proteomes using KEGG pathways. Proteomes of nine organelles (56) are profiled using KEGG pathways. Although only a small portion of the proteome is covered by the KEGG pathways, the profiles show striking contrast between organelles, e.g., endoplasmic reticulum (ER) and mitochondria (Mit) enriched for “oxidative phosphorylation” and “N-Glycan biosynthesis” pathways (encircled on the left), respectively.
Omics-Based Molecular Target and Biomarker Identification
559
Fig. 5. (a) KEGG pathway profiling of radiation-induced protein expression changes in ATM mutated (ATM−) and ATM wild type (ATM+) cells at 3 h postirradiation. The “purine metabolism” pathway is encircled and it shows that the most differentially changed proteins (up- or downregulated in response to radiation in the two cells) is found in this pathway. This profile is a partial display, with the rest having small number of proteins and no striking differences between groups. The figure is adapted from Hu et al. (54). (b) Mapping of radiation-induced protein changes onto the purine metabolic pathway. Enzymes in the KEGG reference map are represented using Enzyme Commission numbers (EC#, e.g., 1.17.4.1). Enzymes labeled with a diamond shape are those identified in human, and all others without such a label are those known to be absent in human. Enzymes with up-tilted arrows, upregulated in ATM+ cells; those with downtilted arrows, downregulated in ATM− cells; the enzyme with double down-tilted arrows are downregulated in both cells. Upper left, biochemical steps surrounding dADP/dATP; upper right, biochemical steps surrounding dGDP/dGTP; bottom, illustration of the rate-limiting step in dATP or dGTP synthesis from the reduction of ADP or GDP, respectively, catalyzed by RRM2 in human.
560
Hu et al.
3.5. Pathway Mapping and Visualization
One key step in functional Omics data analysis is pathway mapping, a process that maps genes/proteins detected by Omics experiments to corresponding proteins annotated in canonical pathways. Various software tools are available for pathway mapping, including iProXpress, DAVID, and commercial tools, such as IPA (http://www.ingenuity.com) and MetaCore (http://www. genego.com). Visualization of the mapped pathways greatly facilitates the comparative analysis and understanding of the underlying differences across experimental groups, thus being critical for identifying potential molecular targets. Visualization of mapped pathways is provided as part of several software systems, e.g., mapped proteins in canonical pathways are highlighted by a distinct color (for one experimental condition as in IPA) or labeled with experimental conditions under which they were detected (as in MetaCore). Recently, KEGG developed a standalone tool, KegArray, for mapping gene expression profiles to pathways and genomes (57). Different pathway tools should be used to maximize the identification of potential pathway-based targets because pathways annotated in different databases vary in their contents and boundaries (see Note 8). We used iProXpress, KEGG, IPA, and MetaCore pathway tools for mapping and/or visualization of metabolic and signaling pathways in several proteomic and functional genomic studies, including those on organelle biogenesis (58), radiation-induced DNA damage repair (54), and estrogeninduced apoptosis in breast cancer cells (59). Pathway mapping could lead to the identification of specific steps in which the proteins participate and the roles they may play.
3.6. L iterature Mining
For genes or proteins of interest that were derived from the Omics data based on differential expression and/or functional profiling, but do not have annotated pathway information, literature mining is used to uncover their potential associations with or pathways for the underlying phenotypes. Various text mining tools are available to assist literature mining (see Note 2).
3.7. Practical Applications
We use the functional analysis of Omics data generated from radiation-treated ATM− and ATM+ cells (54) as an example to illustrate the Omics workflow described above. ATM, a serine– threonine protein kinase, plays critical roles in stress-induced responses, such as DNA damage repair and cell cycle regulation. Using human fibroblast cell lines expressing mutated ATM gene (AT5BIVA cell, ATM−) or wild type ATM (ATCL8 cell, ATM+), the study aims to better understand ATM-mediated pathways in response to ionizing radiation, which could facilitate identification of molecular targets for therapeutic interventions, such as increasing radiation or drug sensitivities of cancers. The two cell lines are subjected to global expression profiling using gene
3.7.1. Examples
Omics-Based Molecular Target and Biomarker Identification
561
microarray and 2D-gel and MS proteomics. Below are the steps used for the analysis. 1. Proteins identified from the MASCOT search engine (http:// www.matrixscience.com) output files are compiled into one protein list and annotated with corresponding experimental groups (e.g., cells and time points). The database searched by MASCOT is SwissProt (a manually annotated portion of the UniProtKB). The differentially changed genes (up- or downregulated genes from the microarray) are mapped to UniProtKB accessions (ACs) from Entrez Gene IDs. 2. Some UniProtKB ACs in the protein list from the MASCOT output might need replacement by new ACs (but usually still identify the same protein sequence) if the bioinformatics analysis is conducted at a time later than MS when UniProtKB has newer releases, in which protein sequences may be updated/corrected or redundant sequences be merged. Updated ID mapping files can be downloaded at ftp://ftp. pir.georgetown.edu/databases/idmapping/idmapping.tb.gz and used to obtain an updated experimental protein list. Alternatively, online ID mapping is available at PIR (http:// pir.georgetown.edu) or UniProt (http://www.uniprot.org). 3. Functional annotations of the protein list are derived from the iProClass database containing comprehensive annotations, which is available for download at ftp://ftp.pir.georgetown. edu/databases/iproclass/iproclass.xml.gz. An output data file that contains all identified proteins, corresponding groups and experimental notes, as well as functional annotations is generated. 4. The data file is browsed, searched, and profiled using the iProXpress interface: http://pir.georgetown.edu/iproxpress (data set: http://pir.georgetown.edu/cgi-bin/textsearch_iprox. pl?data=gu1). Boolean searches (AND, OR, NOT) can be used to display specific experimental groups or proteins pertinent to certain annotations, e.g., using “A_8_3h_increase” OR “B_8_3h_increase” as “group” query displays proteins that are increased at protein (2D-gel/MS) or mRNA (microarray) level 3 h after radiation in ATCL8 cells (ATM+), resulting in 160 proteins (Fig. 2). While providing many analytic functions, the interface mainly provides functionalities for profiling the list of proteins using GO Slims and KEGG pathways. 5. The GO or pathway profiles are examined and compared across experimental groups for the entire or selected proteins from the list, and most differential GO categories or pathways are examined. Comparison could also be made on merged or de-merged groups using the interface, e.g., experimental
562
Hu et al.
repeats could be merged as a single group based on experimental conditions. GO and pathway profiles can also be generated for single list of proteins/genes using PIR batch retrieval at http://pir.georgetown.edu/pirwww/search/ batch.shtml, but without metadata annotations (Fig. 3). 6. The iProXpress interface is used for pathway profiling and shows that the purine metabolism pathway is significantly and differentially represented in radiation treated or untreated ATM−/ATM+ cells (Fig. 5a). Pathway mapping using KEGG is conducted at http://www.genome.jp/kegg/tool/color_ pathway.html, which allows to input enzymes of interests (using EC numbers, e.g., 1.17.4.1) and to generate pathway maps with input enzymes highlighted in colors corresponding to different experimental groups (Fig. 5b). 7. For pathway analysis using Ingenuity IPA, the entire protein list from this study is loaded and “my list” of genes/proteins is created for specific experimental groups. Pathway profiles are examined, and pathway maps are analyzed regarding the positions and relations of specific genes/proteins of interest (e.g., certain experimental groups) in the pathway, e.g., p53, BRCA1, and Chk1, increased in ATM+ cell after irradiation, are mapped to the G2/M DNA damage check point regulation pathway (Fig. 6a). Since one protein could appear in multiple canonical pathways, the pathway maps should be examined carefully with expert guidance. In addition to canonical pathways, gene/protein networks could be generated based on functional associations annotated in the knowledgebase of Ingenuity IPA (Fig. 6b), which provides further evidence for the ATM-mediated radiation response pathways that involve p53, BRCA1, HDAC1, and RRM2. In summary, through functional profiling and pathway mapping, this example shows that purine metabolism is significantly represented and differentially changed in the ATM− and ATM+ cells in response to radiation. The increased expression of RRM2 at both mRNA and protein levels, and of p53, BRCA1, HDAC1, and Chk1 at the mRNA level in ATM+ but not in ATM− cells, strongly suggest that RRM2 is a downstream target of the ATMmediated radiation response pathways and is required for radiation-induced DNA repair. This is supported by a recent report that upregulation of RRM2 transcription in response to DNA damage in human involves ATR/ATM-Chk1-E2F1 pathway (60). RRM2 is also known to play roles in cell proliferation, tumorigenicity, metastasis, and drug resistance (61). Increased expression of RRM2 has been linked to increased drug resistance, and its decrease in expression is linked to the reversal of drug-resistance in cancer cells (61, 62). RRM2 is a potential
Omics-Based Molecular Target and Biomarker Identification
563
Fig. 6. (a) Ingenuity pathway profiling and mapping of genes/proteins from ATM−/ATM+ cells with or without ionizing radiation treatment. The analysis is performed using Ingenuity IPA. Top, top-ranked pathway profiles (well above the threshold p-value), in which the ratio of genes/proteins detected in the experiment over the total number of proteins annotated in the pathway, is given as gray squares. Purine metabolism (encircled on the left) is shown as the third top pathway in the study. Bottom, pathway map of cell cycle G2/M DNA damage check point regulation. BRCA1 and p53 are upregulated at mRNA level 30 min after irradiation in ATCL8 cells (labeled with a dark triangle shape). Chk1, identified from 2D gel/MS, was increased at 3 h after irradiation in ATCL8 cells (encircled with a dashed line). (b) Gene networks linking RRM2 with DNA damage repair pathway proteins. Functional networks showing RRM2 connected to other major DNA repair and cell cycle proteins, such as p53, BRCA1, and HDAC1. Networks are generated using the Ingenuity IPA tool, and are merged from three subnetworks, one containing RRM2 and HDAC1, one with p53, and the third with BRCA1.
564
Hu et al.
Fig. 6. (continued) The protein or gene nodes labeled with a dark triangle shape are those differentially expressed in the study. The lines (edges) connecting nodes indicate associations between proteins or genes, which encompass interaction, binding, activation, inhibition, etc. Solid lines (edges) are for direct and dashed ones for indirect associations. The figure is adapted from Hu et al. (54).
therapeutic target for cancers, e.g., targeting RRM2 for sensitizing cancer cells to drug effects through enhancing camptothecin (CPT)-induced DNA damage in breast cancer cells (60). 3.7.2. Pitfalls
Omics-based molecular target and biomarker identifications remain challenging, and many limitations exist, e.g., see review in ref. 63. 1. Proteomics data coverage bias. Missing (or false negative) identifications are common to mass spectrometry-based proteomics, thus experimental repeats including those at the level of sample preparation often improve the protein
Omics-Based Molecular Target and Biomarker Identification
565
identification rate. The coverage bias also partially accounts for the relatively small percentage of overlaps between proteomics and gene expression microarray data from identical biological samples (54, 64). 2. Limitations of knowledgebases. Although our approach heavily relies on the annotations in knowledgebases, these curated databases have several limitations. Common shortcomings that might affect the analysis include database entry redundancy, insufficient annotations, and high proportion of electronically derived annotations. For example, database entry redundancy can cause ambiguous ID mapping (see Note 1), and insufficient annotations can limit the power of functional interpretation of Omics data. In the case of GO annotation, the vast majority of GO terms (~90%) annotated for gene products are inferred from electronic annotation (IEA) (see http://www.geneontology.org/GO.current.annotations. shtml), thus cautions should be exercised when using GO slim profiling. 3. Lack of tissue and/or isoform specificity in pathway annotations. A potential bias in interpretation of pathway mapping results could come from the fact that pathway annotations currently take little consideration of tissue specificities of genes or proteins in the pathway. Thus, specific steps of a pathway may not be actually active in given tissues/cells from which the Omics data may be generated. In some cases, this may occur because protein isoforms or splice variants have been annotated as a protein class or a canonical protein sequence, respectively, in the pathway while they may be expressed differentially in different tissues/cells. 4. Variations in pathway annotations. Because biological pathways are inherently complex and dynamic, pathway annotations in different pathway databases vary significantly in pathway models and in a number of other aspects, e.g., specific protein forms, dynamic complex formation, subcellular locations, and pathway cross talks (pathway boundaries, also see Note 8). Pathway Commons is an effort to provide a link between the disparate pathway databases.
4. Notes 1. Gene IDs such as Entrez Gene numbers are often mapped to multiple UniProt protein entries, some of which result from protein isoforms that need to be merged under the entry of the same protein precursor, but most result from sequence redundancy in the database. For example, UniProtKB has two
566
Hu et al.
sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL; the former is manually annotated with minimal redundancy and the latter is computationally annotated with more redundancy, including fragments of the same gene products. If the complete proteome annotation is available for an organism (e.g., human), in most cases one can limit the ID mapping to UniProtKB/Swiss-Prot and check any remaining unmapped IDs. Redundant sequence entries can be resolved using UniRef100 and/or UniRef90, which cluster sequences of 100 or 90% identity into one group for the selection of the appropriate entries (http://www.uniprot.org/help/uniref). 2. Although PubMed is the primary tool to access literature citation, some literature mining tools are available to help mining relevant protein data, such as PPI (e.g., MetaServer, http://bcms.bioinfo.cnio.es) and protein phosphorylation (e.g., RLIMS-P, http://pir.georgetown.edu/pirwww/iprolink/ rlimsp.shtml). In addition, gene or protein synonyms can be identified using BioThesaurus (http://pir.georgetown.edu/ iprolink/biothesaurus), which help identify more relevant literature from PubMed for a given gene/protein. 3. GO annotations have high coverage for a given genome, e.g., currently >88% of human proteins in UniProtKB/Swiss-Prot are annotated with GO terms (Table 3). Overall, the vast majority of GO terms (~90%) are annotated based on computational inference (evidence code IEA, Inferred from Electronic Annotation: http://www.geneontology.org/GO.evidence. shtml). Manual annotation of GO remains laborious.
Table 3 Numbers of UniProtKB/Swiss-Prot entries with functional annotations Organism (Taxon ID)
Ontology # Total entry GOb
Pathway
Mammal (40674) 64,813 59,865
PPI
KEGG
PID
Reactome
IntAct
14,289
1,652
3,834
8,281
a
Human (9606)
20,328 18,049 (88.8)
4,925 (24.2) 1,649 (8.1) 3,790 (18.6) 6,423 (31.6)
Mousea (10090)
16,204 14,955 (92.3)
3,685 (22.7) N/A
N/A
1,467 (9.1)
Rat (10116)
7,449 7,060
2,415
N/A
304
c
N/A
All numbers are derived from iProClass database as of November 24, 2009 a Complete human proteome has been annotated in UniProtKB/Swiss-Prot (Human Proteome Initiative project), and the mouse proteome also has high coverage when compared to rat and other mammals. N/A not applicable because only human proteins and pathways are annotated in PID and the Reactome pathway databases b GO annotations are with all evidence codes, including IEA c Numbers in parenthesis are the percentage of proteins annotated in the categories over the total number of entries for the corresponding species. GO gene ontology, PPI protein–protein interaction
Omics-Based Molecular Target and Biomarker Identification
567
4. In general, only a small percentage of a proteome has been annotated with pathways, thus depending on the data sets being analyzed, the coverage of pathways for the given Omics data vary. For human, currently only about one quarter of proteins are covered by pathway databases, including KEGG, PID, and Reactome (Table 3). Although integrated into Pathway Commons (http://www.pathwaycommons.org), PPI data are not part of annotated pathways, but can be used to generate protein interaction networks. 5. Another aspect of dividing experimental data relates to dividing proteins identified from mass spectrometry, such as MALDITOF into groups of proteins identified with high (>90%) or low (<90%) confidence intervals (CI) assigned from statistical processing of MASCOT search results by software, e.g., GPS Explorer™, to increase the probability of true target identifications. The low CI values could result from factors, such as the size of database for the search engine, protein abundance, and the type of mass spectrometry instruments. Furthermore, MS proteomic data often require additional filtering for appropriate analysis. For example, a number of proteins that are deemed to be nonspecific (e.g., keratins) are frequently detected for the underlying experiment, which could be caused by factors such as sample contamination and/or detection bias toward high abundant proteins, thus often are removed from analysis. For proteins identified from 1D gel that migrate at apparent molecular weight (MW) highly deviating from the calculated MW could be removed, albeit with caveats that protein degradation or aggregation may have occurred at or before gel electrophoresis. These practices are currently applied to an ongoing study on investigating E2-induced cell apoptosis pathways in breast cancer cells (59). 6. A two-step procedure is generally used for the peptide mapping: direct sequence mapping and reducing redundancy using UniRef90 clusters (http://www.uniprot.org/help/ uniref) (65). Sequences in UniProtKB with 90% or more sequence identity are grouped in a UniRef90 cluster. Proteins within a UniRef90 cluster are more likely to have the same function. For the peptide matched to more than one UniProtKB sequences, if the matching sequences are in the same UniRef90 cluster, then the peptide is mapped to the representative sequence of the cluster. 7. Some GO terms nearly always appear in high frequencies for any given list of proteins, such as “GO:0065007: biological regulation,” thus reveal little specific functions for the list of proteins being profiled. Statistical testing is provided in such cases to obtain functional enrichment of GO terms by tools, such as DAVID (http://david.abcc.ncifcrf.gov/summary.jsp).
568
Hu et al.
In some cases, a pie chart using GO terms is used to depict the functional categories of the list of proteins. This should be interpreted with caution because the GO categories are not mutually exclusive, especially with regard to the molecular functions and biological processes. A list of proteins can be categorized also based on keywords, functions, and other information from literature, as well as guided by experts. 8. Biological pathways are inherently complex and cross talk between pathways is frequent. Pathways are often annotated using different models in different pathway databases. Among the differences, the pathway boundary for the same core pathway differs most notably in different databases, depending on what additional proteins to be included that are known to interact with the core pathway. For example, 62 proteins are included in TGF beta signaling in PID database (http:// pid.nci.nih.gov), while 40 are found in Reactome (http:// reactome.org). Thus, the combined pathway data from different databases have better coverage on proteins to be analyzed even when they share the same core pathways.
Acknowledgments The work has been supported in part by Federal funds from the National Cancer Institute (NCI), National Institutes of Health (NIH), under Contract No. HHSN261200800001E (Z.Z.H.), by NCI grant P01CA074175 (A.D.), by NIH grant U01-HG02712 (C.W.), and by the Department of Defense Breast Cancer Research Program W81XWH-06-10590 Center of Excellence Grant (A.W., A.T.R.). The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government. References 1. Ransohoff, D.F. (2003). Cancer. Developing molecular biomarkers for cancer Science 299, 1679–80. 2. Riesterer, O., Milas, L., and Ang, K.K. (2007) Use of molecular biomarkers for predicting the response to radiotherapy with or without chemotherapy J Clin Oncol 25, 4075–83. 3. Kim, Y.S., Maruvada, P., and Milner, J.A. (2008) Metabolomics in biomarker discovery: future uses for cancer prevention Future Oncol 4, 93–102.
4. Tainsky, M.A. (2009) Genomic and proteomic biomarkers for cancer: a multitude of opportunities Biochim Biophys Acta 1796, 176–93. 5. Hanash, S. (2004) Integrated global profiling of cancer Nat Rev Cancer 4, 638–44. 6. Souchelnytskyi, S. (2005) Proteomics of TGF-beta signaling and its impact on breast cancer Expert Rev Proteomics 2, 925–35. 7. Walgren, J.L., and Thompson, D.C. (2004) Application of proteomic technologies in the
Omics-Based Molecular Target and Biomarker Identification
8.
9.
10.
11.
12. 13.
14.
15.
16.
17.
18.
drug development process Toxicol Lett 149, 377–85. Tugwood, J.D., Hollins, L.E., and Cockerill, M.J. (2003) Genomics and the search for novel biomarkers in toxicology Biomarkers 8, 79–92. Merrick, B.A., and Bruno, M.E. (2004) Genomic and proteomic profiling for biomarkers and signature profiles of toxicity Curr Opin Mol Ther 6, 600–7. Sreekumar, A., Poisson, L.M., Rajendiran, T.M., Khan, A.P., Cao, Q., Yu, J., Laxman, B., Mehra, R., Lonigro, R.J., Li, Y., Nyati, M.K., Ahsan, A., Kalyana-Sundaram, S., Han, B., Cao, X., Byun, J., Omenn, G.S., Ghosh, D., Pennathur, S., Alexander, D.C., Berger, A., Shuster, J.R., Wei, J.T., Varambally, S., Beecher, C., and Chinnaiyan, A.M. (2009) Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression Nature 457, 910–4. Martens, J.W., Margossian, A.L., Schmitt, M., Foekens, J., and Harbeck, N. (2009) DNA methylation as a biomarker in breast cancer Future Oncol 5, 1245–56. Ruan, K., Fang, X., and Ouyang, G. (2009) MicroRNAs: novel regulators in the hallmarks of human cancer Cancer Lett 285, 116–26. Brooks, S.A. (2009) Strategies for analysis of the glycosylation of proteins: current status and future perspectives Mol Biotechnol 43, 76–88. Pang, J., Liu, W.P., Liu, X.P., Li, L.Y., Fang, Y.Q., Sun, Q.P., Liu, S.J., Li, M.T., Su, Z.L., and Gao, X. (2010) Profiling protein markers associated with lymph node metastasis in prostate cancer by DIGE-based proteomics analysis J Proteome Res 9(1), 216–26. Li, J., Zhao, J., Yu, X., Lange, J., Kuerer, H., Krishnamurthy, S., Schilling, E., Khan, S.A., Sukumar, S., and Chan, D.W. (2005) Identification of biomarkers for breast cancer in nipple aspiration and ductal lavage fluid Clin Cancer Res 11, 8312–20. Zhou, J., Trock, B., Tsangaris, T.N., Friedman, N.B., Shapiro, D., Brotzman, M., Chan-Li, Y., Chan, D.W., and Li, J. (2010) A unique proteolytic fragment of alpha1-antitrypsin is elevated in ductal fluid of breast cancer patient Breast Cancer Res Treat 123(1), 73–86. Yamamoto, Y., Kosaka, N., Tanaka, M., Koizumi, F., Kanai, Y., Mizutani, T., Murakami, Y., Kuroda, M., Miyajima, A., Kato, T., and Ochiya, T. (2009) MicroRNA-500 as a potential diagnostic marker for hepatocellular carcinoma Biomarkers 14, 529–38. Jones, S., Zhang, X., Parsons, D.W., Lin, J.C., Leary, R.J., Angenendt, P., Mankoo, P.,
569
Carter, H., Kamiyama, H., Jimeno, A., Hong, S.M., Fu, B., Lin, M.T., Calhoun, E.S., Kamiyama, M., Walter, K., Nikolskaya, T., Nikolsky, Y., Hartigan, J., Smith, D.R., Hidalgo, M., Leach, S.D., Klein, A.P., Jaffee, E.M., Goggins, M., Maitra, A., IacobuzioDonahue, C., Eshleman, J.R., Kern, S.E., Hruban, R.H., Karchin, R., Papadopoulos, N., Parmigiani, G., Vogelstein, B., Velculescu, V.E., and Kinzler, K.W. (2008) Core signaling pathways in human pancreatic cancers revealed by global genomic analyses Science 321, 1801–6. 19. Zhu, X., Gerstein, M., and Snyder, M. (2007) Getting connected: analysis and principles of biological networks Genes Dev 21, 1010–24. 20. Pujana, M.A., Han, J.D., Starita, L.M., Stevens, K.N., Tewari, M., Ahn, J.S., Rennert, G., Moreno, V., Kirchhoff, T., Gold, B., Assmann, V., Elshamy, W.M., Rual, J.F., Levine, D., Rozek, L.S., Gelman, R.S., Gunsalus, K.C., Greenberg, R.A., Sobhian, B., Bertin, N., Venkatesan, K., AyiviGuedehoussou, N., Solé, X., Hernández, P., Lázaro, C., Nathanson, K.L., Weber, B.L., Cusick, M.E., Hill, D.E., Offit, K., Livingston, D.M., Gruber, S.B., Parvin, J.D., and Vidal, M. (2007) Network modeling links breast cancer susceptibility and centrosome dysfunction Nat Genet 39, 1338–49. 21. Xia, K., Xue, H., Dong, D., Zhu, S., Wang, J., Zhang, Q., Hou, L., Chen, H., Tao, R., Huang, Z., Fu, Z., Chen, Y.G., and Han, J.D. (2006) Identification of the proliferation/differentiation switch in the cellular network of multicellular organisms PLoS Comput Biol 2, e145. 22. Bertagnolli, M.M. (2009) The forest and the trees: pathways and proteins as colorectal cancer biomarkers J Clin Oncol 27(35), 5866–7. 23. Zhang, D.Y., Ye, F., Gao, L., Liu, X., Zhao, X., Che, Y., Wang, H., Wang, L., Wu, J., Song, D., Liu, W., Xu, H., Jiang, B., Zhang, W., Wang, J., and Lee, P. (2009) Proteomics, pathway array and signaling network-based medicine in cancer Cell Div 4, 20. 24. Ptitsyn, A.A., Weil, M.M., and Thamm, D.H. (2008) Systems biology approach to identification of biomarkers for metastatic progression in cancer BMC Bioinformatics 9 Suppl 9, S8. 25. Ideker, T., and Sharan, R. (2008) Protein networks in disease Genome Res 18, 644–52. 26. Loscalzo, J., Kohane, I., and Barabasi, A.L. (2007) Human disease classification in the postgenomic era: a complex systems approach to human pathobiology Mol Syst Biol 3, 124.
570
Hu et al.
27. Auffray, C. (2007) Protein subnetwork markers improve prediction of cancer outcome Mol Syst Biol 3, 141 28. Chuang, H.Y., Lee, E., Liu, Y.T., Lee, D., and Ideker, T. (2007) Network-based classification of breast cancer metastasis Mol Syst Biol 3, 140. 29. Wang, E., Lenferink, A., and O’ConnorMcCourt, M. (2007) Cancer systems biology: exploring cancer-associated genes on cellular networks Cell Mol Life Sci 64, 1752–62. 30. Do, J.H., and Choi, D.K. (2008) Clustering approaches to identifying gene expression patterns from DNA microarray data Mol Cells 25, 279–88. 31. Kerr, G., Ruskin, H.J., Crane, M., and Doolan, P. (2008) Techniques for clustering gene expression data Comput Biol Med 38, 283–93. 32. Weeraratna, A.T., and Taub, D.D. (2007) Microarray data analysis: an overview of design, methodology, and analysis Methods Mol Biol 377, 1–16. 33. Handl, J., Knowles, J., and Kell, D.B. (2005) Computational cluster validation in postgenomic data analysis Bioinformatics 21, 3201–12. 34. Huang, H., Hu, Z.Z., Arighi, C.N., and Wu, C.H. (2007) Integration of bioinformatics resources for functional analysis of gene expression and proteomic data Front Biosci 12, 5071–88. 35. Galperin, M.Y., and Cochrane, G.R. (2009) Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009 Nucleic Acids Res 37(Database issue), D1–4. 36. UniProt Consortium. (2009) The Universal Protein Resource (UniProt) 2009 Nucleic Acids Res 37(Database issue), D169–74. 37. Maglott, D., Ostell, J., Pruitt, K.D., and Tatusova, T. (2005) Entrez Gene: gene-centered information at NCBI Nucleic Acids Res 33(Database issue), D54–8. 38. Bult, C.J., Kadin, J.A., Richardson, J.E., Blake, J.A., and Eppig, J.T. The Mouse Genome Database Group. (2010) The Mouse Genome Database: enhancements and updates Nucleic Acids Res 38(Database issue), D586–92. 39. Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O’Donovan, C., and Apweiler, R. (2009) The GOA database in 2009 – an integrated Gene Ontology Annotation resource Nucleic Acids Res 37(Database issue), D396–403. 40. Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T.,
41.
42.
43.
44.
45.
46.
47.
48.
49.
Kawashima, S., Okuda, S., Tokimatsu, T., and Yamanishi, Y. (2008) KEGG for linking genomes to life and the environment Nucleic Acids Res. 36(Database issue), D480–4. Schaefer, C.F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay, T., and Buetow, K.H. (2009) PID: the Pathway Interaction Database Nucleic Acids Res 37(Database issue), D674–9. Aranda, B., Achuthan, P., Alam-Faruque, Y., Armean, I., Bridge, A., Derow, C., Feuermann, M., Ghanbarian, A.T., Kerrien, S., Khadake, J., Kerssemakers, J., Leroy, C., Menden, M., Michaut, M., Montecchi-Palazzi, L., Neuhauser, S.N., Orchard, S., Perreau, V., Roechert, B., van Eijk, K., and Hermjakob, H. (2010) The IntAct molecular interaction database in 2010 Nucleic Acids Res 38(Database issue), D525–31. Ceol, A., Chatr Aryamontri, A., Licata, L., Peluso, D., Briganti, L., Perfetto, L., Castagnoli, L., and Cesareni, G. (2010) MINT, the molecular interaction database: 2009 update Nucleic Acids Res 38(Database issue), D532–9. Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Natale, D.A., O’Donovan, C., Redaschi, N., and Yeh, L.S. (2004) UniProt: the Universal Protein knowledgebase Nucleic Acids Res 32, D115–9. Wu, C.H., Huang, H., Nikolskaya, A., Hu, Z., and Barker, W.C. (2004) The iProClass integrated database for protein functional analysis Comput Biol Chem 28, 87–96. Huang, da W., Sherman, B.T., Stephens, R., Baseler, M.W., Lane, H.C., and Lempicki, R.A. (2008) DAVID gene ID conversion tool Bioinformation 2, 428–30. Côté, R.G., Jones, P., Martens, L., Kerrien, S., Reisinger, F., Lin, Q., Leinonen, R., Apweiler, R., and Hermjakob, H. (2007) The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases BMC Bioinformatics 8, 401. Sherman, B.T., Huang, da W., Tan, Q., Guo, Y., Bour, S., Liu, D., Stephens, R., Baseler, M.W., Lane, H.C., and Lempicki, R.A. (2007) DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate highthroughput gene functional analysis BMC Bioinformatics 8, 426. Al-Shahrour, F., Carbonell, J., Minguez, P., Goetz, S., Conesa, A., Tárraga, J., Medina, I., Alloza, E., Montaner, D., and Dopazo, J.
Omics-Based Molecular Target and Biomarker Identification
50. 51.
52.
53.
54.
55. 56.
57.
58.
(2008) Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments Nucleic Acids Res 36(Web Server issue), W341–6. Li, Y., and Agarwal, P. (2009) A pathwaybased view of human diseases and disease relationships PLoS One 4, e4346. Ozgür, A., Vu, T., Erkan, G., and Radev, D.R. (2008) Identifying gene-disease associations using centrality on a literature mined geneinteraction network Bioinformatics 24, i277–85. Li, S., Wu, L., and Zhang, Z. (2006) Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach Bioinformatics 22, 2143–50. Nikitin, A., Egorov, S., Daraselia, N., and Mazo, I. (2003) Pathway studio – the analysis and navigation of molecular networks Bioinformatics 19, 2155–7. Hu, Z.Z., Huang, H., Cheema, A., Jung, M., Dritschilo, A., and Wu, C.H. (2008) Integrated bioinformatics for radiationinduced pathway analysis from proteomics and microarray data J Proteomics Bioinform 1, 47–60. Nordlund, P., and Reichard, P. (2006) Ribonucleotide reductases Annu Rev Biochem 75, 681–706. Hu, Z.Z., Valencia, J.C., Huang, H., Chi, A., Shabanowitz, J., Hearing, V.J., Appella, E., and Wu, C.H. (2007) Comparative bioinformatics analyses and profiling of lysosomerelated organelle proteomes Int J Mass Spectrom 259, 147–60. Wheelock, C.E., Wheelock, A.M., Kawashima, S., Diez, D., Kanehisa, M., van Erk, M., Kleemann, R., Haeggström, J.Z., and Goto, S. (2009 ) Systems biology approaches and pathway tools for investigating cardiovascular disease Mol Biosyst 5, 588–602. Chi, A., Valencia, J.C., Hu, Z.Z., Watabe, H., Yamaguchi, H., Mangini, N.J., Huang, H., Canfield, V.A., Cheng, K.C., Yang, F.,
59.
60.
61.
62.
63. 64.
65.
571
Abe, R., Yamagishi, S., Shabanowitz, J., Hearing, V.J., Wu, C., Appella, E., and Hunt, D.F. (2006) Proteomic and bioinformatic characterization of the biogenesis and function of melanosomes J Proteome Res 5, 3135–44. Hu, Z.Z., Kagan, B., Huang, H., Liu, H., Jordan, V.C., Riegel, A., Wellstein, A., and Wu, C. (2009) Pathway and Network Analysis of E2-Induced Apoptosis in Breast Cancer Cells 100th AACR Conference, Denver, CO, April 18–22, Abstract #3285. Zhang, Y.W., Jones, T.L., Martin, S.E., Caplen, N.J., and Pommier, Y. (2009) Implication of checkpoint kinase-dependent up-regulation of ribonucleotide reductase R2 in DNA damage response J Biol Chem 284, 18085–95. Zhou, B., and Yen, Y. (2001) Characterization of the human ribonucleotide reductase M2 subunit gene; genomic structure and promoter analyses Cytogenet Cell Genet 95, 52–59. Zhou, B., Tsai, P., Ker, R., Tsai, J., Ho, R., Yu, J., Shih, J., and Yen, Y. (1998) Overexpression of transfected human ribonucleotide reductase M2 subunit in human cancer cells enhances their invasive potential Clin Exp Metastasis 16, 43–9. Ransohoff, D.F. (2009). Promises and limitations of biomarkers Recent Results Cancer Res 181, 55–9. Waters, K.M., Pounds, J.G., and Thrall, B.D. (2006) Data merging for integrated microarray and proteomic analysis Brief Funct Genomic Proteomic 5, 261–72. Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Mazumder, R., O’Donovan, C., Redaschi, N., Suzek, B. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information Nucleic Acids Res 34(Database issue), D187–91.
Index 2D gel.......................................... 76, 553, 554, 556, 561, 563 PAGE..........................................................................331 454........................ 18, 23, 201–203, 207, 212, 215, 237, 274
A ABI............................................................23, 200, 273, 274 ab initio............................................................................. 10 Abyss......................................................................................11 Accession..............................8, 12, 42, 47, 134, 385, 386, 392, 394, 464, 475, 554, 561 Accuracy................................5, 43, 79, 114, 156, 157, 176, 186, 193, 203, 205, 208, 214, 225, 228, 270, 271, 334, 354, 360, 371, 447, 449, 489–491, 517, 521 Acetylation............................................................. 184, 252 Additive background multiplicative error model (ABME)............................ 279, 281 Affymetrix....................... 17, 40, 47, 101, 121, 123, 130, 187, 237–240, 242–244, 253, 263, 264, 271, 273, 274, 277, 279, 282, 289, 291, 292, 380, 385, 388, 410, 459, 464, 466, 470, 530 Agilent.............. 238, 253, 254, 257, 261, 265, 272, 273, 292 Algorithm.............. 11–13, 36, 43, 73, 85, 87, 133, 143–147, 149, 154–157, 159–163, 165, 178, 180, 239, 240, 242, 244, 245, 258, 260, 264, 265, 272, 285, 286, 290, 302, 303, 307, 308, 310, 311, 313, 315, 316, 318, 331, 334, 335, 339–341, 344, 346, 357, 358, 368, 371, 372, 383, 395, 404, 405, 422, 436, 439, 441, 454, 501–503, 505–508, 516–517, 521, 522 Alignment, multiple..................17, 260, 302, 303, 306, 307, 309–311, 313, 315–318, 322 Allele, frequency.............................. 220, 223–225, 230, 491 Allpaths............................................................................. 11 Amplification..........................202, 204, 237, 252, 262, 263, 271, 276, 296, 431, 501 Annotation...................... 7, 9–13, 15, 16, 18, 19, 22, 41, 47, 49, 61, 63, 65, 71–91, 177, 187–189, 193, 223, 238, 260, 264, 265, 273, 291, 299, 305, 306, 317, 321, 322, 341, 354, 356, 361, 363, 380, 382–383, 390–391, 401, 402, 410, 416, 420, 421, 428, 429, 439, 462, 466, 469, 521, 533, 549–551, 554–556, 565, 566
ANOVA...................................................129, 159, 386, 524 Antibody........................................... 185, 252, 261, 262, 491 APID...............................................................419, 423, 427 Aracne...............................................................502, 505–508 Array gene expression..........................................271–272, 293 single channel................................................... 275, 281 ArrayExpress......... 17, 18, 22, 23, 27, 34, 47–50, 64, 65, 98, 289, 293, 380 Assay gel-shift.................................................................... 230 reporter............................................................. 230, 271 RNAse protection..................................................... 270 Assembly......10–12, 29, 35, 45, 180, 182, 206, 207, 212, 215 Atlas...............................................................11, 17–19, 293 AUC....................................................................... 371, 524 Audit trail......................................................................... 99 Autocorrelation...............................................158, 166, 167
B BABELOMICS............................................................. 550 Background correction............123, 241, 277, 279, 281, 380, 382, 385 estimation......................................................... 258–259 Base calling..............................................199, 207, 208, 213 Bayes Bayesian.....................156, 159, 161, 162, 165, 191, 213, 227, 242, 287, 344, 382, 446, 453, 500, 502, 503, 508, 514, 515, 522 naive.......................................................................... 514 BED......................................... 214, 246, 247, 258, 264, 303 Benjamini Hochberg...................................................... 119 Bespoke................................................................... 143, 148 BIND....................................... 419, 422, 429, 506, 534, 543 BindN...............................................................315, 320, 323 BiNGO........................................................................... 424 Biocarta.............................288, 383, 419, 421, 429, 436, 459 BioCichlid............................................................... 505, 507 Bioconductor.........................44, 49, 65, 121, 130, 135, 149, 241, 256, 260, 265, 282, 289, 291, 293, 364, 382, 408, 410, 504, 521 Biocyc...................................................................... 366, 400 BIOGRID................................. 28, 419, 422, 429, 534, 543
Bernd Mayer (ed.), Bioinformatics for Omics Data: Methods and Protocols, Methods in Molecular Biology, vol. 719, DOI 10.1007/978-1-61779-027-0, © Springer Science+Business Media, LLC 2011
573
Bioinformatics for Omics Data 574 Index
Bioinformatics...................3, 34, 73, 99, 122, 133, 178, 229, 247, 264, 276, 312, 322, 353, 360, 370, 380, 429, 482, 531, 549 Biomarker........................................................................ 512 Biomart................................ 16, 22, 149, 404, 408–410, 521 Biomodels............................................................... 436, 437 BioPAX............................................ 404, 418, 419, 421, 428 Bioportal..............................................................41, 62, 404 Biostatistics............................................................4, 99, 133 Biotapestry............................................................. 406, 413 Blast............................................................................ 12, 89 BlenX4Bio . .................................................................... 439 Bonferroni............................................................... 227, 382 Boolean............................................ 287, 425, 501, 508, 561 Bootstrap.................................. 130, 162, 167, 287, 371, 382 Bowtie..................................................................... 210, 215 Breakpoint...............................................236, 242, 244–246 BRENDA..................................................................... 9, 20 Bucket......................................................462, 463, 472, 473 Burrows-Wheeler........................................................... 215
C caArray...................................................................50, 65, 98 CAGE............................................................................. 305 Calibration..............................................335, 358–360, 447 Cancer........................ 18, 22, 40, 55, 57, 121, 134, 236, 255, 266, 270, 271, 275, 288, 379, 420, 424, 506, 511, 512, 515–518, 520–524, 527–545, 548, 560, 562, 564, 567 Canonical correlation analysis................................ 139–143 CARMAweb...........................................291, 385, 387, 394 Catalog...............................38, 175, 176, 178, 180, 182, 183 Causality.................................................................. 160, 164 CCD....................................................................... 202, 276 Celldesigner............................................................ 421, 429 CellML................................................................... 436, 437 Centroid/centroiding............... 336–338, 340, 341, 345, 346 ChamS............................................................................ 344 Charge estimation.................................................. 337–340 Chemogenomics............................................................... 10 ChIP ChIP-chip................................. 6, 18, 46, 185, 251–266 ChIP-seq..................................... 5, 6, 46, 185, 251, 252 Chip���������143, 177, 179, 185, 186, 237, 238, 465, 466, 487 Chip Definition File (CDF).................................. 238, 363 Chi-square...............................................128, 226, 289, 391 Chi Square Automatic Interaction Detection (CHAID)..................................... 517 Chromatin, immunoprecipitation................. 6, 186, 251, 252, 403, 405 Chromatography affinity��������������������������������������������������������������������������7 co-immunoprecipitation............................................... 7 pull-down..................................................................... 7 CIBEX............................................................................. 17
Cis-regulatory..................................................190, 253, 405 Classification.................................................................. 354 Classification and Regression Trees (CART)................. 517 ClueGO�������������������������������������������������������������������������424 Clustering��������������9, 12, 14, 17, 44, 48, 134, 135, 145–148, 160–163, 169, 255, 260, 284–286, 317, 318, 405, 418, 422, 427, 444, 518, 519, 524, 549 CMfinder���������������������������������������������������������������316, 317 CNP�����������������������������������������������������������������������236, 238 CoCAS��������������������������������������������������� 259–261, 263, 264 Coding������ 4, 7, 12, 16, 19, 84, 87, 179, 182–184, 193, 205, 220, 223, 229, 300, 302, 305, 410, 418 Collinearity............................................................. 113, 114 Comparative genomic hybridization (CGH)......... 236, 238 Computational biology............................................. 13, 143 Consan���������������������������������������������������������������������������316 Context���������� 5, 7, 21, 23, 46, 64, 77, 90, 99, 102, 103, 105, 107–108, 114, 136, 137, 155, 157, 158, 160, 163, 175, 177, 190, 193, 208, 211, 284, 302, 303, 316, 317, 357–360, 362, 366, 371, 383, 384, 406, 409, 424, 453, 459, 466, 479–495, 505, 514, 520, 521, 528, 532, 533 Contig���������������������������������������11, 180, 187, 212, 215, 236 Controlled vocabulary.................34, 36, 37, 41, 52, 60, 403, 429, 459, 529 Copy number variations (CNV)..............182, 232, 235–247 Correlation matrix�����������������������������������������������������������������������444 partial������������������������������������������������������������������������502 Pearson.............................. 138, 163, 257, 286, 502, 518 CosBiLab���������������������������������������������������������������436, 439 Covariance............................... 141, 157, 162, 163, 315, 317 CpG�������������������������������������������������������������������������������259 CPM������������������������������������������������������������������������������344 Cross-Omics............................................................ 97–110 Cross-validation....................................................... 19, 287 Crystallography, X-ray.................................................. 7, 77 Cytogenetics................................................................... 243 Cytoscape����15, 21, 408, 410, 417, 419, 421–424, 426, 427, 429, 439, 518
D DAS����������������������������������������������������������������� 22, 408, 410 Data accuracy.................................................................... 186 analysis.......5, 22, 32, 34, 36, 41, 43–45, 55, 57, 74, 106, 107, 143, 160, 166, 176, 178, 179, 200, 207, 214, 226, 240, 241, 265, 271, 277, 290, 345, 354, 363–365, 390, 392, 400, 407, 511, 519, 523, 529, 531–533, 535–537, 540–542, 550, 551, 560 classification............................................................. 517 clustering.................................................................... 48 exchange.... 15, 20, 35–40, 45, 46, 48, 51–53, 55, 57, 59, 60, 62, 63, 77, 90, 100, 200, 403, 404, 411, 428
Bioinformatics for Omics Data 575 Index
format����������������7, 33, 37, 46, 63, 64, 66, 84, 87, 90, 99, 346, 369, 384, 411 generation...........6, 86, 87, 125, 182, 271, 276–277, 283 harmonization.....................................35, 57–58, 62, 98 integration............... 41, 62, 63, 75, 80, 86, 89, 143, 188, 271, 290, 361, 369, 384, 403–405, 408, 410, 416, 435, 438, 550 integrity.................................................79, 87, 354, 370 maintenance....................................................... 79, 100 management.............71–91, 98, 100, 109, 187, 408, 519 merging.....................................................257, 369, 423 mining����� 14, 16, 19, 22, 39–41, 62, 63, 72, 75, 76, 270, 357, 366, 403, 459, 463, 511–524 persistence................................................................ 100 preprocessing.............123, 155, 224–226, 278–279, 282, 380–383, 385–386, 488, 513, 522 reduction....................................................287, 345, 369 repository.............................................18, 100, 488, 542 retrieval.........................98, 100, 105, 385, 417, 422–423 scaling�����������������������������������������������������������������������358 security................................................................. 79, 99 sharing������������������������� 4, 15, 19, 31–66, 74, 78, 98, 403 standards........................................ 4, 21, 31–66, 78, 155 storage������������������������� 75, 199, 346, 360, 380, 407, 513 structure�������������������������������72, 100, 101, 105, 109, 157 transformation............................... 36, 44, 166, 354, 370 unification................................................................... 98 warehouse..............................................81–83, 399, 554 Database design��������������������������������������������������� 8, 101, 107, 543 relational.................... 7, 80–83, 86, 88, 90, 91, 107, 109, 408, 484 Database for annotation, visualization, and integrated discovery (DAVID)....... 383, 390–392, 503, 504, 507, 544, 550, 560, 567 DBMS����������������������������������������������������������������������22, 438 dChip���������������������������������������������������������������������242, 381 DDBJ������������������������������������������������������������������� 45, 76, 89 de Bruijn����������������������������������������������������������� 11, 212, 216 Decision tree........................................................... 287, 517 Deconvolution.................................................357, 358, 372 Deisotoping.............................................337–338, 340–341 Deletion���������������������������� 17, 183, 199, 209–211, 215, 235, 402, 554 Dendrogram........................................................... 282, 285 Density function.................................................................... 515 plot����������������������������������������������������������� 255–256, 262 Diagnosis�����������������143, 220, 270, 499, 512, 520, 522, 541, 542, 547, 548 Differential display......................................................... 270 DIGE�����������������������������������������������������������������������������548 DIP����������������������������� 19, 28, 419, 422, 429, 534, 543, 555 Distance matrix............................... 318, 444, 445, 447–449 Ditags���������������������������������������������������������������������������������6
DNA cDNA������47, 48, 50, 176, 182, 201, 270, 273, 276, 296, 304, 305, 410 motif�������������������������������������������������������������������������252 Druggable���������������������������������������������������������������527, 528 Dye swap�����������������������������������������258, 260, 261, 263, 275 Dynalign�����������������������������������������������������������������303, 316 Dynamics���� 4, 153, 162, 164, 165, 301, 302, 366, 367, 406, 415, 416, 424, 425, 428, 441–444, 447, 448, 452
E Eclipse������������������������������������������������������������������������81, 91 EDENA���������������������������������������������������������������������������11 Electrospray ionization (ESI)..................332, 335, 338, 345 ELISA����������������������������������������������������������������������������491 EMBL������������������������������������������������������ 9, 19, 45, 76, 554 Emulsion polymerase chain reaction (emPCR)...... 201, 204 ENCODE..........24, 181–183, 192, 299, 300, 303–305, 402 Enhancer������������������������������������������������������������������������261 Ensemble�������������������������������������������������������� 166, 314, 318 Entrez��������������386, 390, 392, 419, 458, 464, 549, 554, 555, 561, 565 Enzyme-mediated cancer imaging and therapy (EMCIT)...........531–533, 535–542 Epigenetic marks.....................................192, 251–253, 548 Epigenomics....................................................5, 24, 45, 184 Error type I�������������������������������������������������������� 116, 122, 485 type II����������������������������������������������� 116, 119, 122, 485 EST�����������������������������������������123, 179, 273, 274, 304, 470 Euclidian����������������������������������������������������������������286, 518 European Bioinformatics Institute (EBI)..... 8, 16–18, 22, 23, 26, 34, 41, 48, 59, 60, 89, 289, 362, 380, 383 European molecular biology open software suite (EMBOSS)�����������������������������������������������22, 26 EvoFold���������������������������������������������������������� 302, 303, 307 Exomics����������������������������������������������������������������������������79 Exon���� 86, 87, 165, 176, 179, 180, 185, 192, 223, 273–274, 278, 284, 291 Explorative/Exploratory................. 107, 230, 380, 381, 484, 490, 494 Expression Quantitative Trait Locus (eQTL)................ 190, 505, 506
F False negative................................. 122, 165, 186, 231, 232, 485, 564 False positive..... 7, 120, 122, 129, 165, 213, 232, 254, 286, 319, 367, 368, 382, 389, 394, 395, 481, 486, 502, 529 Familywise error rate (FWER)........................116, 118, 119 Far Western Blot............................................................ 252 FASTA�������������������������������������������������������������������207–208 FASTQ������������������������������������������������������������ 18, 207, 208 FDA�������������������������������� 34, 44, 50, 54, 360, 379, 486, 514
Bioinformatics for Omics Data 576 Index
FDR����������������116, 119–120, 122, 124, 125, 156, 242, 244, 245, 284, 303, 382, 386, 389, 394, 465, 485, 506, 528, 529, 531, 534, 535 Feature detection............................................338–339, 341, 345 grouping................................................................... 339 FIA�������������������������������������������������������������������������354–356 Filtering baseline�������������������������������������������������������������335, 340 noise�����������������������������������������������������������������335, 340 Fingerprint..................................................................... 333 FIRMA��������������������������������������������������������������������������284 Fisher����������������������������������������������289, 392, 465, 506, 507 Fitness����������������������������������������������������������������������������516 Flow cytometry polychromatic............................................................. 72 Fluxomics/fluxome..................................174, 351, 400, 404 FOLDALIGN................................................303, 308, 316 Fold change/FC...............117, 118, 120–122, 124, 126–128, 256, 275, 280, 382, 389, 470, 485, 486, 535 Forecasting..................................................................... 450 Fourier�������������������������������������155, 163, 167, 332, 340, 357 Functional annotation of the mammalian genome (FANTOM)..................................182, 299, 305 Functional genomics experiment (FuGE)............ 40, 58–60
G Gaussian����������������� 145–147, 255, 259, 262, 266, 335, 340, 345, 355, 357, 453, 502 Gbrowse���������������������������������������������������������������������������77 GEBA������������������������������������������������������������������������������16 Genbank�����������19, 45, 74, 75, 386–388, 459, 464, 554, 555 Gene ontology......8, 13, 19, 20, 28, 41, 42, 127, 135, 260, 288, 367, 390, 391, 403, 423, 424, 429, 504, 520, 529–530, 533, 536, 537, 543, 549, 556, 566 set enrichment...................................127, 289, 367, 530 Genecards����������������������������������������76, 84–88, 91, 149, 382 GeneChip����������������� 17, 101, 176, 179, 181, 187, 271–273, 277, 279, 385, 388, 410, 466 Gene Expression Atlas............................................. 18, 293 Gene Expression Omnibus (GEO)................17, 27, 34, 48, 98, 289, 293, 381, 402, 458, 542 GeneGO����������������������������������������������������������������551, 560 Gene Ontology (GO)������8, 13, 19, 20, 28, 41, 42, 127, 135, 260, 288, 367, 390, 391, 403, 423, 424, 429, 504, 520, 529–530, 533, 536, 537, 543, 549, 556, 566 Genepix������������������������������������������������������������������254, 277 Gene2pubmed................................................................ 462 GeneRIF����������������������������������������������������������������462, 555 Gene set enrichment analysis (GSEA)...........289, 367, 503, 504, 506, 507, 530 Genesis����������������������������������������������������� 66, 285, 286, 440
GeneSprings................................................................... 504 Gene symbol................................................................... 386 Genetic algorithm...................................516–517, 520, 522 Genetic association study............................................... 230 Genetic regulatory modules (GRAM)...........265, 302, 406, 411, 412, 514 GenMAPP.......................288, 419, 421, 429, 461, 528, 543 Genome annotation.......................................10–12, 16, 188, 260 assembly.........................................................29, 45, 182 sequencing............................... 11, 31, 45, 199, 223, 458 wide���������������6, 16, 33, 85, 90, 219, 223, 227, 236–238, 240, 242, 243, 265, 270, 302, 303, 305, 369, 400, 402, 458, 504, 506, 514, 515, 523 Genomic contextual data markup language (GCDML)..................................................... 46 Genomics comparative.............................................10, 16, 45, 190 computational.......................................................... 480 functional........ 33, 40, 56, 58, 61, 77, 176, 366, 369, 480 genotyping........ 18, 24, 56, 221, 222, 224, 237, 238, 520 structural............................................................. 77, 480 Genotype/genotyping............18, 24, 40, 45, 46, 50, 56, 213, 221–226, 228, 229, 231, 237–239, 516, 519, 520 GEPAS���������������������������������������������������������������������������291 Gibbs sampling�������������������������������������������������������260, 286 GLYcan data exchange (GLYDE)����������������������������������403 Glycomics����������������������������������������185, 351, 400, 402, 403 GOminer���������������������������������������������������������� 12, 459, 543 Gostats���������������������������������������������������������������������������503 Graph acyclic������������������������������������������������������� 446, 503, 515 directed���������������������������������������������������������������������383 inference�������������������������������������������������������������������436 sequence..................................................................... 11 theory������������������������������������������13, 400, 422, 518, 520 GraphWeb.............................................................. 421, 429 GWA.............................................................................. 369
H Haplotype................................................213, 219, 226, 229 HapMap................ 22, 75, 76, 219, 223, 224, 228, 231, 243, 245–247, 555 Hardy-Weinberg-Equilibrium............................... 224–226 Health Insurance Portability Accountability Act (HIPAA)........................................................ 57 Heat map................................................................ 125–127 Helicos.....................................................203, 204, 213, 215 HGNC..................................................................22, 84, 86 Hidden Markov model............................157, 242, 302, 317 Hierarchy.............................41, 75, 101, 102, 109, 404, 424, 459, 461, 556 High-performance liquid chromatography (HPLC).........................................331, 332, 334
Bioinformatics for Omics Data 577 Index
High-throughput...............3, 5–6, 31, 32, 38, 45–48, 55, 62, 63, 79, 97, 114, 173–177, 179, 181, 182, 185, 187, 188, 193, 203, 270, 277, 284, 289, 305, 319, 331, 352, 360, 363, 364, 379, 383, 418, 435, 441, 451, 458–460, 499, 523, 528, 530, 543, 547, 550 High-throughput sequencing (HTS).............................. 5, 10, 199–201, 204, 206, 209, 212–215, 274, 291, 292 Histomics.......................................................................... 79 Histone............. 184, 185, 193, 251–253, 255, 259, 260, 262, 263, 305, 469 Homolog................................................................. 538–539 Homologene................................................................... 462 HUGO.................................... 392, 459, 462, 467, 471, 474 Human interactome map (HiMAP)...................... 534, 543 Human protein reference database (HPRD)................. 419, 422, 429 Human proteome organization (HUPO).............20, 38, 40, 42, 51, 58, 63, 78, 346 Hybridization.............. 6, 24, 38, 46, 47, 176, 180, 185, 186, 236–238, 241, 252, 254, 258, 263, 265, 270, 274–276, 292, 304, 312, 318, 394 Hypothesis hypotheses...........44, 106, 113–120, 124, 128–130, 137, 156, 159, 160, 166, 167, 184, 213, 229, 230, 289, 290, 353, 367, 368, 380, 400, 441, 460, 462, 465, 473, 480, 481, 483, 487, 502, 515, 548 null......................113, 115–119, 128, 130, 137, 166, 502
I Identification.....................5–7, 9, 12, 18–20, 22, 26, 35, 40, 45, 50, 57, 185, 220, 232, 270, 273–275, 302–320, 323–325, 333, 337, 338, 342–345, 353, 356–358, 361, 370, 382, 386–390, 405, 406, 436, 444, 461, 480, 487, 499–508, 521, 522, 527–544, 547–568 Illumina........18, 23, 202–204, 207–209, 211, 212, 215, 216, 221–224, 237–240, 242, 243, 274 Imaging...................................... 72, 178, 530, 532, 540–542 Imputation................156, 226, 228, 231, 354, 389, 395, 513 Independent component analysis................................... 163 Inference............................78, 100, 135, 155, 158, 161, 162, 165, 166, 191, 339, 344, 359, 435–454, 501–503, 506, 566 Infernal............................................................310, 317, 323 Ingenuity.................. 289, 421, 429, 521, 534, 535, 543, 551, 560, 562, 563 InParanoid...................................................................... 384 Insertion...................................... 17, 83, 209–211, 215, 235 In silico......................................6, 12, 83, 84, 226, 230, 231, 319, 381, 383, 392, 505 In-situ.............................18, 38, 46, 180, 236, 271–273, 532 IntAct.................... 19–21, 28, 186, 332, 419, 429, 481, 528, 534, 543, 549, 555, 566
Integration..................... 3, 15, 18, 21–22, 32, 41, 45, 57, 62, 63, 71, 72, 75, 78, 80–82, 85–86, 89, 98, 99, 135, 143, 176, 177, 188–192, 204, 206, 271, 290, 337, 360–362, 369, 372, 384, 399–413, 416, 417, 419, 420, 428, 435, 437–439, 446, 453, 528, 549–551, 554–555 Integrative biology.......................................................... 480 Intensity...........................122, 123, 127–129, 201, 224, 236, 238, 242, 255–259, 261, 262, 272, 276, 277, 279–281, 335, 337, 338, 341, 344, 354, 355, 357–359, 366, 394, 444, 485, 486, 491, 518, 522 Interaction protein–DNA........................................6, 400, 402, 441 protein–protein..................6, 13–15, 20–21, 28, 35, 186, 229, 352, 384, 401, 402, 418, 419, 466, 471, 500, 506, 518, 543, 549, 555, 566 protein–RNA............................................................ 319 RNA–RNA.............................................................. 318 interactomics..................4, 6–7, 13–15, 20–21, 79, 186–187 interactome..........................14, 174, 186, 400, 405, 406, 415–430, 506, 530, 534 Intergenic........................................................220, 223, 230 Interpro.................................................... 13, 19, 22, 25, 555 Intron.............................................. 185, 189, 223, 291, 303 iProClass.......................................... 550, 554, 556, 561, 566 IQR..................................................................120, 124, 128 isoform................... 8, 79, 185, 271, 273, 274, 284, 301, 527, 550, 565 isotopic/isotope pattern........................................ 337, 338, 341, 358, 365 peak............................................................336–339, 341
J Jackknife......................................................................... 382 Jacobian........................................................................... 442 JASPAR............................260, 286, 381, 383, 392, 504, 507 Java..................... 22, 39, 64, 65, 91, 104, 109, 110, 161, 265, 342, 408, 483, 484, 507
K Kernel....................................... 135, 142–145, 149, 150, 371 K-nearest..................................................381, 389, 395, 522 Knowledge.....................................13, 24, 41, 42, 62, 63, 80, 97–110, 163, 177, 179, 187, 190, 206, 213, 302, 317, 319–321, 353, 358, 361, 366, 400, 404, 416, 418, 420, 422, 428, 437, 438, 441, 446, 447, 459, 486, 488, 500, 503–504, 521, 522, 532, 548, 549, 552 discovery............................................................5, 7, 522 Knowledge Inference (KInfer)......... 436, 447, 448, 453, 454 Kyoto Encyclopedia of Genes and Genome (KEGG)................... 288, 361, 362, 366, 369, 381, 383, 390, 404, 418, 422, 429, 436, 437, 459, 523, 528, 543, 549, 551, 555, 556, 558–562, 566, 567
Bioinformatics for Omics Data 578 Index
L Laboratory Information Management System (LIMS)............................................. 76–78, 100 Lasso....................................................................... 489, 490 Learning...........14, 22, 33, 88, 135, 143–146, 154, 158–164, 176, 181, 242, 317, 346, 357, 363, 436, 453, 503, 516, 522, 549 Ligation............................................200–202, 204–206, 237 Likelihood.................80, 128, 213, 319, 340, 447, 452–454, 465, 489, 514, 529, 534 Limit of detection (LOD)...................................... 360, 366 LIMMA...........................117, 119, 130, 256, 263, 284, 292 Linkage disequilibrium (LD)..................223, 224, 227, 228 Link integration............................................................. 438 Lipidomics.................................................79, 351, 400, 402 Literature mining................... 422, 426, 457–475, 532, 538, 551, 560, 566 Localizomics.............................................79, 400, 402, 406 LocARNA...............................................309, 316, 322, 323 Loss of Heterozygosity (LOH).......................239, 242, 243 Lower limit of quantification (LLOQ)...........360, 366, 371 Luminex.......................................................................... 491
M Machine learning....................135, 143, 346, 357, 363, 436, 522, 549 Mann-Whitney.............................................................. 118 Map alignment................................................339, 341–346 Maplot/MA-plot.................................................... 123, 282 MapMan........................................................................ 439 Mapping................. 16, 49, 60, 124, 125, 127, 134, 135, 139, 143, 149, 179, 181, 185, 186, 188, 193, 199, 206, 207, 209–211, 213–215, 219, 223, 227–230, 238, 242, 247, 251, 259, 265, 278, 288, 289, 333, 334, 339, 345, 346, 352, 358, 359, 366, 367, 391, 406, 407, 419, 424, 427, 451, 462, 463, 475, 480, 505, 515, 522, 530, 542, 543, 549–551, 553–555, 559–563, 565–567 MARS............................................................................. 293 MAS............................................................................... 380 MASCOT............................... 339, 345, 382, 554, 561, 567 Massively parallel signature sequencing (MPSS)........... 270 Mass-to-charge (m/z)......332, 335–341, 345, 354, 355, 357, 358, 361, 372, 382, 517 Match......................... 24, 130, 134, 149, 161, 210, 211, 226, 260, 272, 307, 311, 317, 319, 339, 341, 357, 361, 362, 383, 392, 395, 423, 465, 554, 567 Matchminer.................................................................... 134 Mate-pair.................................................203, 207, 208, 212 Matlab............... 135, 139, 149, 150, 364, 406, 507, 518–520 Matrix...................88, 139, 141–144, 147, 157, 158, 160, 162, 163, 165, 314, 318, 332, 335, 345, 355, 359–361, 372, 392, 394, 395, 442, 444, 445, 447–449, 482, 494, 501, 519, 529, 538, 539, 541, 542, 555, 556
Matrix-assisted laser desorption/ionization (MALDI)..................... 332, 335, 345, 522, 567 Maximum expectation.................................................... 286 MaxQuant............................................................... 341, 342 maxT............................................................................... 382 Medical Subject Headings (MeSH)...................... 459–465, 467–474, 535 Medline............................................ 459–462, 464, 474, 475 MEME............................................ 260, 264, 314, 320, 323 MEMERIS.....................................................314, 320, 323 Mendelian................................... 19, 25, 220, 230, 242, 458 Messenger RNA (mRNA)............... 5, 12, 16, 17, 133–135, 137–142, 144–148, 176, 179, 181, 182, 190, 191, 193, 220, 229, 270, 273, 274, 286, 300, 301, 303–305, 312, 313, 319, 352, 382, 384, 393, 400, 480, 500, 501, 503, 504, 520, 548, 549, 553, 554, 561–563 Meta-analysis...........154, 226, 232, 528, 529, 540, 541, 544 Metabolite.............. 4, 31, 35, 52, 59, 79, 154, 157, 158, 160, 164, 178, 187, 352–355, 357–361, 365–372, 399, 400, 402, 407, 480, 486, 488, 547–549 Metabolome/metabonome............... 52, 174, 187, 353, 365, 379, 400, 480, 488 Metabolomics non-targeted..............................................353, 357, 360 targeted..............................................353, 358, 360, 361 Metabotype.................................................................... 369 MetaCore................................................................ 551, 560 Metadata...........20, 23, 36, 37, 45, 48, 51, 57, 59–61, 72, 77, 78, 81, 89, 98, 101–108, 110, 363, 394, 404, 408, 555, 562 Metagenomics................................. 5, 19, 24, 38, 45, 46, 75 Metallomics...................................................................... 79 Metatranscriptomics..................................................... 5, 24 Methylation..........................24, 45, 184, 252, 259, 263, 548 Methylome..................................................................... 184 Metric distance......................................................286, 365, 445 similarity........................................................... 160, 286 Mfinder................................................................... 406, 413 Microarray.................... 6, 10, 31, 76, 98, 114, 139, 154, 179, 237, 252, 269, 304, 331, 363, 379, 401, 421, 439, 458, 480, 499, 511, 529, 547 Microcosm.................................................27, 312, 319, 323 Microinspector................................................313, 319, 323 Minor allele frequency (MAF)........................224, 229, 491 minP................................................................................ 382 Mismatch.................209–211, 213–215, 272, 279, 319, 372 Missing value............149, 155–157, 366, 372, 381, 395, 513 Model................10, 14, 16, 20, 21, 25, 39, 40, 46, 48, 50, 53, 54, 58–62, 78, 81, 83, 85, 86, 88, 91, 101, 104, 106, 109, 114, 117, 129, 133–150, 154, 156, 157, 159–166, 178, 188–192, 200, 214, 226, 227, 241, 242, 244, 246, 257, 258, 270, 271, 275, 279, 281–284, 286, 287, 292, 302,
Bioinformatics for Omics Data 579 Index
307–312, 314–318, 336, 340, 341, 355, 357, 363, 371, 400, 403–406, 413, 416, 417, 422, 424–428, 435, 437–439, 441, 446–450, 452–454, 461, 466, 470–474, 485, 489, 501–503, 505, 511, 513–515, 517, 519, 521, 522, 524, 543, 549, 565, 568 Modeling/modelling............5, 34, 36, 37, 39–40, 44, 53, 55, 58, 72, 81, 85–86, 100, 103, 105, 134, 136, 143, 241, 242, 253, 279, 283–288, 290, 357, 364, 371, 405–407, 417, 418, 425, 436, 437, 439–442, 447, 449, 513, 519, 522, 548 MODEM....................................................................... 405 Molecular Interaction Database (MINT)........28, 419, 429, 534, 543, 549 Monte Carlo........................................................... 453, 502 Motif discovery.................................... 255, 259–260, 264, 265 regulatory.....................................................12, 133, 405 Mpeak.................................................................... 258, 265 msInspect.........................................................341, 342, 356 MS/MS................................... 333, 334, 339, 341, 357, 553 tandem-MS............25, 26, 178, 184, 333, 352, 516, 553 Multidimensional scaling (MDS).......................... 163, 445 Multi-Epitope-Ligand-“Kartographie” (MELK).......... 184 MultiExperiment Viewer (MeV)...........................381, 382, 386, 389, 394 Multiomics.......................................... 35, 40, 42, 43, 57–60 Multiple reaction monitoring (MRM)................... 352, 357 Multiple testing....... 114, 118–119, 128, 226, 227, 232, 284, 289, 369, 371, 382, 487 Multiplexed...............................................56, 278, 352, 517 Multi-view learning........................................................ 135 MySQL................... 22, 82, 83, 85, 87, 90, 91, 109, 408, 484 mzData....................................................342, 347, 361, 363 MZmine..........................................................341, 342, 356 mzXML..................................... 52, 342, 343, 347, 361, 363
N National Center for Biomedical Ontology (NCBO).................... 41, 42, 58, 61–62, 66, 404 National Center for Biotechnology Information (NCBI)........................16, 17, 22, 25, 34, 49, 82, 84, 86, 87, 89, 289, 380, 385, 386, 390, 391, 458, 554, 555 Nearest neighbor..................... 163, 389, 395, 518, 520, 522 NetAffx........................................................................... 459 Network abstraction........................................................ 415, 416 Bayesian.................................... 163, 165, 191, 287, 453, 500–503, 506, 508, 514, 515, 520 betweenness.............................................................. 424 Boolean......................................................287, 501, 508 centralization............................................................ 422 clustering coefficient................................................. 422 density....................................................................... 422
directed..................................................................... 422 dynamics............................ 367, 415, 424–426, 447, 507 inference.................... 158, 165, 435–454, 501–503, 506 neural......................... 163, 425, 513, 515–516, 520–524 performance.............................................................. 416 perturbation................................... 14, 53, 424, 440, 508 property.....................................................424, 425, 442 regulatory........... 134, 187, 190, 251, 253, 270, 287, 401, 405, 413, 416, 418, 420, 421, 471, 500–503, 505–508, 548 scaffold....................................................... 21, 405–406 scale free....................................................424, 505, 524 signaling.............191, 383, 416, 421, 424–427, 446, 500 stress centrality......................................................... 424 structure.......................................................21, 415, 425 topology.....................................................191, 415, 424 transcriptional............................................500, 504–507 undirected....................................................13, 422, 502 Network identification by multiple regression (NIR).... 501, 505, 507, 508 Neural network................ 163, 425, 513, 515–516, 520–524 Next generation sequencing (NGS)............. 5, 6, 11, 23, 24, 75, 178, 182, 199, 223, 237, 238, 278, 331 Nimblegen............................... 238, 253, 254, 264, 265, 273 Noise.................................................. 14, 44, 121, 123, 138, 140, 163, 164, 167, 175, 204, 237, 241, 242, 244, 257, 262, 263, 278, 279, 334–336, 340, 345, 355–357, 367, 370, 372, 382, 444, 447, 448, 453, 454, 470, 522, 534, 549 Non-coding RNA (ncRNA).......17, 27, 176, 183, 185, 192, 200, 299–308, 310, 315–319, 322 Normalization intra-array..................................................255–257, 262 loess............................................................256, 261, 281 lowess................................................................ 256, 262 median....................................... 256, 257, 261, 262, 346 quantile.............................. 123, 241, 257, 281, 346, 381 VSN...................................................257, 265, 281, 283 Northern blot......................................................... 176, 270 Nuclear magnetic resonance (NMR).........77, 155, 352, 353 Nutrigenomics.................................................................... 4
O Object-relational mapping (ORM).......................... 91, 484 Odds ratio............................................................... 222, 224 Omics...................................... 3–28, 31–66, 71–91, 97–110, 113–130, 133–150, 153–167, 173–181, 184, 185, 187–192, 227, 271, 283, 284, 289, 290, 331, 351, 361, 363, 364, 369, 370, 379–395, 399–413, 415–430, 435–454, 457–475, 479–495, 499–508, 511–525, 528–532, 547–568 omicsNET...................................................................... 384 Oncomine.........270, 293, 521, 523, 529, 533–536, 540–543 Online Mendelian Inheritance in Man (OMIM)..... 19, 25, 76, 366, 458, 523, 555
Bioinformatics for Omics Data 580 Index
Ontology...............8, 9, 18, 20, 23, 36, 37, 41, 42, 44, 49, 50, 53, 58, 60–64, 66, 75, 77, 102, 107, 159, 260, 361–363, 383, 401, 404, 418, 421, 439, 459, 462, 529, 555, 566 Open document format (ODF)...................................... 439 OpenMS..........................................................340–342, 347 Open reading frame (ORF).........................12, 13, 388, 401 oPOSSUM......................................................381, 383, 392 Oracle................................................ 22, 408, 464, 490, 519 Ortholog..................................................................... 85, 89 Outcome.................... 76, 116, 129, 145, 219, 230, 231, 284, 365, 369, 485, 488–492, 500, 512, 548 Overfit..................................... 113, 287, 341, 363, 445, 450 Over-representation analysis (ORA).......503, 504, 506, 507
P Paired end tags (PET)............................................ 305, 542 Paralog.................................................................85, 89, 185 Partial energy ratio for microarray (PEM)..................... 154 Partial least square (PLS)................................163, 364, 371 Path edge disjoint............................................................. 368 node disjoint............................................................. 368 Pathguide................................................................ 418, 436 Pathology......................................... 164, 219, 460, 461, 500 Pathophysiology.............................. 426, 481, 482, 499–508 Pathovisio....................................................................... 419 Pathway...............4, 5, 10, 35, 41, 53, 86, 110, 155, 184, 189, 190, 192, 220, 221, 270, 288, 289, 318, 352, 356, 365–369, 381–384, 390–392, 399–402, 404, 407, 415–430, 435–437, 439–444, 446–448, 450, 451, 457–459, 467, 471, 472, 474, 480, 481, 500, 501, 503–507, 518, 519, 521, 523, 528, 530–534, 543, 544, 548–553, 555, 556, 558–563, 565–568 PathwayExplore...................................................... 439, 543 Pathway Studio....................................................... 421, 551 Pattern matching....................................................308, 311, 383 recognition.........................................155, 165, 184, 480 PCAP................................................................................ 11 PCEnv............................................................................. 437 Peak detection.................155, 251, 254, 255, 258–259, 261, 265, 356, 357, 382 Peptide mass fingerprinting (PMF)....................... 333, 338 Perl......................... 22, 39, 64, 65, 76, 83, 91, 290, 291, 342, 408, 409, 464 Permutation..... 118–120, 129, 156, 209, 231, 292, 382, 386, 389, 394, 504 Personalized................................. 50, 59, 212, 264, 500, 542 Perturbation................... 14, 35, 53, 158, 165, 190, 333, 364, 416, 424, 436, 440, 442, 444, 446, 501, 505, 508 PETfold...........................................................309, 316, 324 Pfold.................................................................309, 316, 324
Pharmacogenomics..............................................4, 270, 413 Phenomics.................................................79, 400, 402, 404 Phenotype.....4, 21, 40, 46, 99, 104, 219–221, 229, 231, 236, 270, 283, 369, 400, 402, 471, 480, 488, 491, 492, 500, 504–507, 516, 519, 547, 548, 560 Phosphorylation.......184, 252, 416, 419, 447–450, 558, 566 Photomultiplier.............................................................. 276 Phrap................................................................................ 11 Phusion............................................................................. 11 Phylogenetic footprinting........................260, 264, 286, 383 Phylogenetics.................................................................. 4, 9 Phylogenomics................................................................... 4 Physiomics.......................................................................... 4 PITA...............................................................312, 319, 324 PMcomp..........................................................309, 316, 324 Poisson.................................................................... 148, 337 Polymerase chain reaction (PCR) qPCR......................................... 252, 263, 270, 271, 290 RT-PCR............................................215, 271, 480, 541 Polymorphism............................... 17, 40, 46, 213, 491, 505 Positive predictive value (PPV)...................................... 523 Post-genomic................................... 4, 21, 32, 442, 499, 523 Postgres..................................................................... 22, 408 Post-transcriptional......................... 190, 191, 286, 301, 490 Post-translated modifications (PTM).................18, 19, 401 Precision.......................................... 116, 175, 176, 284, 338 Prediction function.....................................................143–145, 302 structure........10, 180, 302–303, 308, 309, 315–317, 322 Pre-processing.........121, 123, 124, 139, 154–157, 224–226, 238, 240, 243, 271, 274, 277–283, 291, 292, 345, 346, 380–382, 385–387, 394, 488, 513, 522 Primer...................................... 200, 202, 204, 205, 237, 271 Principal component analysis (PCA)......................139, 142, 149, 162, 163, 364, 370, 372 Probability........ 115, 116, 118, 127, 146, 148, 208, 213, 223, 226, 258, 266, 289, 290, 309, 313, 314, 344, 371, 394, 446, 452, 453, 465, 481, 489, 503, 514–517, 524, 567 Probablistic............................. 118, 143, 146–150, 286, 309, 310, 315, 316, 436, 440, 446, 447, 453, 454, 503, 514, 515, 522 Process............... 4, 32, 77, 97, 128, 137, 153, 175, 201, 237, 254, 270, 300, 332, 353, 403, 416, 436, 458, 481, 499–508, 511, 528, 548 Profile................6, 8, 17, 18, 21, 23, 41, 52, 59, 98–100, 113, 114, 124, 139, 144, 145, 148, 154, 156, 157, 159–164, 176, 179, 243, 255, 260, 269, 270, 286, 287, 293, 305, 311, 317, 338–341, 353, 355–361, 380, 383–385, 392, 400, 402, 406, 410, 421, 429, 444–446, 465, 466, 470, 488, 500, 503–506, 508, 516, 518, 520, 522–524, 530, 532, 542, 547, 556–563, 567 Prognosis.................................... 80, 220, 379, 499, 500, 506
Bioinformatics for Omics Data 581 Index
Promoter.......35, 46, 185, 189, 220, 223, 230, 253, 257, 260, 261, 286, 301, 381, 383, 392, 419, 420, 504–507 Protein....... 3, 31, 76, 107, 134, 154, 180, 229, 252, 270, 299, 331, 352, 380, 400, 416, 437, 458, 480, 500, 515, 527, 547 Protein analysis through evolutionary relationships (PANTHER).........................381, 383, 390, 391 Protein data bank (PDB).................................... 76–78, 555 Protein identifier cross-reference (PICR)............. 8, 24, 550 Protein information resource (PIR)..........................18, 550, 554, 561, 562 ProteinProphet............................................................... 344 ProteinScape..................................................................... 77 Proteomics...........................4, 6–7, 9, 10, 12–15, 18–21, 23, 27, 28, 31, 32, 34, 35, 38, 40–45, 51–52, 72, 76–79, 98–100, 105–107, 128, 134, 136–139, 155, 157, 164, 178, 179, 184, 185, 190, 331–347, 352, 361, 363, 380–382, 400–402, 404, 406, 420, 421, 458, 480, 485, 487–488, 491, 511, 512, 516, 518, 520, 522–524, 530, 547–549, 553, 554, 556, 561, 564, 567 Proteomics identifications (PRIDE)..............13, 19, 20, 22, 23, 28, 60, 65, 346 PubMed................... 174, 178, 460–463, 474, 532, 535, 537, 551, 566 p-value.................................18, 115, 118–120, 130, 257, 284, 289, 371, 391, 392, 465–470, 472, 473, 502, 523, 528, 529, 540, 563 Pyrosequencing................................................. 24, 201–202
Q QRNA.............................................................302, 307, 324 Quality control (QC)................. 36, 241, 292, 364, 372, 486 Quantification..................6, 19, 42, 175, 186, 278, 353, 356, 359–360, 366, 400, 446, 480 Quantitation..................... 4, 10, 23, 135, 176, 178, 182, 187, 190, 209, 219, 231, 292, 331–347, 352, 353, 360, 366, 379, 400, 406, 425, 437, 444, 480, 491, 502, 505, 531 q-value...............................119, 124–126, 386, 394, 465, 520
R Rank product...................................................118, 124–128 Reactome........... 22, 362, 366, 383, 418, 419, 421, 430, 436, 437, 555, 566–568 Read length................. 5, 201, 203, 206, 207, 209, 210, 212, 215, 274, 278 Receiver operating characteristics (ROC).............. 518, 524 Reconstruction.............. 4, 89, 156, 158, 164, 182, 187, 191, 334, 384, 405–407, 415–430, 457, 500–506, 508 Record...................7, 8, 13, 15, 19, 20, 37, 41, 72, 74, 80, 88, 101–108, 110, 153, 154, 158, 159, 202, 322, 333, 339, 355, 385, 394, 404, 438, 442, 449, 459–461, 475, 481, 494, 514, 518, 531
Reference sequence (RefSeq)...................... 16, 25, 134, 193, 392, 554, 555 Regression Cox........................................................................... 489 linear.......................................... 123, 256, 260, 489, 501 logistic.......................................................227, 287, 489 ridge.................................................................. 489, 490 Regulomics..................................................................... 187 Relation............. 49, 62, 78, 82, 91, 100, 101, 103–105, 107, 108, 110, 137, 189, 190, 193, 225, 363–365, 384, 415, 460, 502, 550, 555, 562 Relational database management systems (RDBMS)........................................... 22, 81–83 Replication................ 77, 230, 236, 275, 292, 364, 387, 399, 467–470 Resource description framework (RDF)..........104, 403, 404 REVEAL........................................................165, 501, 508 Reverse engineering......................... 164, 165, 287–288, 501 Ribozyme........................................................................ 300 RNA clan.....................................................301, 305, 315, 317 class................................................................... 301, 317 family.................................................301, 305, 315, 321 functional................................... 301, 302, 311, 320, 322 mi................... 17, 27, 215, 223, 286, 288, 300, 309, 312, 313, 318, 319, 321, 322, 384, 548 micro............. 27, 128, 190, 191, 220, 286, 300, 319, 548 non coding..............17, 27, 176, 192, 299, 301–303, 307 PiWi......................................................................... 301 regulatory...................................................300, 318, 321 ribosomal............................................................ 27, 300 single stranded...................................308, 314, 319–321 small nuclear............................................................. 300 small nucleolar.................................................. 300, 321 RNAalifold..............................................308, 316, 322, 324 RNA-induced silencing complex (RISC)....................... 301 RNAProB............................................................... 315, 320 RNAseq....................................... 5, 210, 215, 270, 304, 305 RNAz....................................... 302, 303, 307, 308, 322, 325 RNomics..........................................................185, 299–325 Robust multichip average (RMA)..................123, 241, 279, 282–284, 380, 381, 394 Robustness..............................4, 74, 118, 125, 424, 425, 454
S SAMBA.......................................................................... 406 Sample size...... 121–122, 222, 230, 231, 242, 284, 290, 364, 483–486, 490–493, 522 SAS................................................................................. 519 Savitzky-Golay (SG).............................................. 340, 355 Scaffold......................................................21, 320, 405–406 Scaling...................................... 158, 166, 279, 281, 358, 359 Screening............ 3, 5, 59, 104, 116, 352, 451, 482, 484, 486, 491, 512
Bioinformatics for Omics Data 582 Index
Secretomics..................................................................... 351 Segemehl.........................................................210, 214, 215 Selectivity................................................................ 175, 480 Selex............................................................................... 260 Semantic........... 22, 52, 63, 75, 104, 107, 204, 401, 404, 439 Sensitivity........ 4, 6, 127–128, 164, 166, 175, 186, 210, 214, 215, 231, 244, 270, 303, 304, 318, 344, 352, 353, 358, 480, 517, 522–524 Sequence assembly.................................... 10, 11, 16, 35, 182, 215 nucleotide...... 9, 10, 19, 38, 45, 47, 48, 76, 89, 182, 204, 207, 307, 451 similarity..........................................................9, 89, 301 Sequence alignment/map (SAM).....................46, 117, 118, 124–129, 154, 211–212, 284, 381 Sequencing deep..............................................................63, 186, 305 mate pair....................................................203, 207, 212 paired end..................................................203, 206, 207 Sanger........................................................4, 17, 23, 182 Serial analysis of gene expression (SAGE).........17, 50, 270, 401–403 Sesame.............................................................................. 77 SHORTY......................................................................... 11 Shotgun..................................... 11, 179, 180, 184, 332, 339 Signal................. 10, 121, 123, 128, 155, 156, 163, 175, 176, 181, 183, 185, 192, 201, 202, 204–207, 224, 225, 237, 241–244, 246, 252, 254–260, 262, 263, 266, 276, 278, 279, 304, 305, 318, 334, 335, 337, 338, 345, 353–361, 365, 369, 370, 420, 424, 425, 468–472, 485, 508, 542 Signaling............ 18, 191, 192, 383, 384, 415–430, 436, 437, 446–448, 450, 459, 468, 500, 503, 529, 548, 549, 551, 560, 568 Signalomics.................................................................... 351 Signal-to-noise...................................................... 237, 370 Signal transduction........................... 10, 252, 318, 420, 424, 425, 468, 469, 471, 472 Signature.............. 13, 25, 293, 305, 467, 471, 500, 506, 512 Significance.......... 5, 12, 14, 18, 21, 56, 75, 81, 97, 98, 104, 108, 109, 115–121, 124, 125, 130, 136, 137, 144, 149, 154, 156, 158, 159, 161, 163, 165, 176, 191, 200, 202, 204, 206, 210, 215, 220, 222, 230, 242, 246, 252, 257–259, 262, 281, 284, 286, 289, 302, 308, 338, 340, 346, 352, 353, 366, 367, 369, 371, 372, 382, 386, 389, 391, 392, 395, 420, 436, 441, 445, 463, 465–466, 471, 473, 474, 481, 486, 487, 490, 504, 506, 524, 528–532, 534, 541, 554, 556, 558, 562, 565 Significance analysis of microarrays (SAM)................... 382 SIMCA........................................................................... 364 Simulation....... 119–120, 124, 125, 128–130, 212, 367, 406, 417, 421, 428, 439–441, 485, 491, 493, 515, 516
SimulFold........................................................309, 316, 325 SNP nonsynonymous................................................ 220, 223 synonymous........................................................ 87, 223 SOAP.............................................................................. 418 SOLEXA........................................... 18, 202–204, 208, 237 SOLiD...............................23, 200, 203–206, 223, 238, 274 SOM.............................................................................. 162 SOP................................................................................ 483 SOURCE....................................................................... 382 Sparse Bayesian learning (SBL)..............239, 242, 244, 246 Spatio-temporal.......................................175–177, 188, 192 Spearman.................................................123, 137, 138, 519 SpecArray................................................................ 340, 342 Specificity....................... 4, 19, 186, 191, 193, 237, 262, 263, 303, 318, 344, 352, 354, 392, 395, 480, 517, 522–524, 528, 531, 555, 565 Spectroscopy mass.......................................6, 7, 13, 19, 23, 25, 28, 40, 42, 44, 51, 52, 59, 148, 178, 184, 331–333, 335, 337, 338, 340, 351–372, 379, 382, 402, 403, 479, 511, 515–517, 521, 522, 547, 553, 554, 556, 564, 567 NMR.......................................................................... 77 SPINE........................................................................ 77, 78 Spliceomics....................................................................... 79 Splicing..................................10, 35, 79, 181, 211, 223, 229, 271, 274, 284, 300, 301 SPSS............................................................................... 519 SRM............................................................................... 357 SSAKE............................................................212, 215, 216 Stable isotope dilution (SID)......................................... 352 Standard ArMet................................................................. 53, 363 CDISC....................................................40, 54, 55, 481 CIMR............................................................38, 53, 363 ERCC........................................................43, 51, 55, 56 experiment description..............................36–42, 54, 55 experiment execution...................................... 36, 42–45 HITSP................................................................. 58, 59 IMEx...............................................................20, 21, 28 ISA-TAB................................. 23, 40, 54, 58–60, 64, 65 LGC..................................................................... 43, 55 MAGE................... 34, 39, 40, 48–50, 57, 59, 60, 64–66, 254, 293 MAQC....................................................34, 44, 50, 364 MeMo................................................................ 53, 363 MGED...........9, 27, 33, 34, 38, 40, 42, 48, 49, 58, 60, 63, 66, 380 MIAME................ 15, 33, 34, 37–39, 47–49, 51, 56, 61, 78, 289, 380, 439 MIAMET................................................................ 363 MIAPE...............................................15, 34, 38, 51–52 MIBBI..................................23, 38, 39, 58, 61, 362, 439
Bioinformatics for Omics Data 583 Index
MIGS..................................................38, 45–46, 60, 61 MiMiR....................................................................... 50 MIMIx......................................................15, 21, 38, 52 MIMS..................................................38, 45–46, 60, 61 MINSEQE...............................................18, 38, 47–48 MIRIAM................................................................. 439 OBI................................................... 42, 49, 58, 61, 404 OBO...........................23, 41, 42, 58, 61–62, 64, 66, 403 OWL................................42, 64, 66, 104, 403, 404, 439 PaGE-OM........................................................... 40, 46 PML..................................................................... 40, 46 PSI-MI.........................15, 20, 40, 42, 52, 420, 421, 428 terminology...............................................33, 37, 40–42 UMLS....................................................................... 462 Standard addition........................................................... 359 Standard deviation...................114, 117, 122, 123, 126, 130, 136, 238, 257, 277, 280, 283, 337, 364, 485, 487 Stanford microarray database (SMD)................50, 380, 402 Stationarity..................................................................... 166 Statistic/Statistical analysis.............. 113–130, 149, 158, 160, 191, 271, 275, 282–289, 345, 346, 353, 361, 364, 371, 386, 406, 417, 487, 529, 530 chi-squared............................................................... 128 explorative................................................................ 380 multivariate........................................124, 158, 160, 366 supervised................................................................. 158 testing............................................... 114–116, 120, 156, 284, 290, 357, 567 univariate.................................................................. 158 unsupervised............................................................. 158 Statistical analysis of microarrays (SAM)............................. 386, 389, 394–395 Statistical analysis of network dynamics (SANDY )..................................................... 406 Stratification....................................................232, 242, 270 Streptavidin-phycoerythrin (SAPE)............................... 276 STRING......................................... 381, 383, 384, 392, 393 Structural classification of protein–protein interfaces (SCOPPI)..................................................... 534 Study case-control......................................224, 226–228, 230, 484, 485, 492–494 cohort............................................................... 492, 494 longitudinal.............................................................. 491 validation...........................................221, 490, 492–494 Subcellular location..............9, 384, 406, 528, 531, 533, 565 Subtractive hybridization................................................ 270 Suffix array............................................................. 209–210 Summarization........................ 123, 239–242, 279, 281–283 SUPERFAMILY........................................................... 404 SVD................................................................................ 163 SVM........................................ 145, 150, 309, 310, 313–315 Swissprot..............................................................12, 74, 134
Systems biology.....................33, 52, 73, 176, 187, 190–192, 253, 399, 418, 421, 423, 428, 430, 435–437, 439, 440, 480, 481 Systems biology markup language (SBML)...........418, 421, 428, 439 Systems biology workbench (SBW)....................... 421, 430 Systems pathology.......................................................... 500
T Tagging............................................................224, 447, 448 Tandem mass spectroscopy................ 25, 178, 184, 352, 516 Target.................18, 27, 33, 35, 38, 39, 41, 42, 55, 56, 74, 77, 114, 159, 190, 223, 228, 229, 241, 270, 273, 286, 291, 292, 300, 301, 311–314, 318–320, 322, 332, 338, 346, 353, 358, 360, 380, 384, 392, 405, 411, 418, 420, 451, 465, 466, 469, 505, 506, 511, 515, 527–544, 548, 551–553, 560, 562, 564 Taverna..............................................................22, 408, 410 Taxonomy........................16, 20, 49, 101, 102, 105, 107, 554 Therapy.............................220, 511, 512, 532, 535, 540, 548 THRASH....................................................................... 341 Tier......................................................................... 408, 409 TIGR..............................................................382, 386, 394 Tiling.................. 46, 253, 257, 259, 264, 265, 273, 304, 305 Time dependent..................................................... 435–454 Time series/time course.................. 139, 143, 144, 153–155, 157–160, 162, 164, 166, 167, 191, 270, 286, 333, 441–449, 453, 501 Tissue specificity...............................................19, 528, 531 Topology/ topological................14, 163, 184, 191, 415, 422, 424, 425, 427 Toponome...................................................................... 184 Toxicology......................................................................... 77 TPP................................................................................. 344 Trajectory........................................................................ 158 Transcription factor...............................165, 190, 191, 223, 230, 252, 270, 285, 301, 383, 384, 392, 395, 418, 419, 426, 447, 504 Transcription factor binding site (TFBS)............... 252, 384 Transcriptomics........................... 18, 52, 136–139, 299, 304 Transcript/transcriptional...............................10, 16, 24, 134, 164, 165, 177, 181, 182, 189, 190, 252, 270, 272, 274, 277, 278, 286, 289, 291, 299, 301, 305, 390, 395, 405, 416, 419, 421, 469, 471, 472, 480, 487, 500, 504–507 TRANSFAC....................260, 286, 420, 427, 430, 504, 507 Transformation.... 44, 60, 142, 155, 241, 279, 281, 282, 340, 354, 358, 370, 408 TRANSPATH....................................................... 420, 430 TRED......................................................419, 423, 427, 430 TrEMBL.............................................................18, 19, 566 t-test moderated......................................................... 124, 284 student’s t-test, t-statistic...........................117, 119, 284
Bioinformatics for Omics Data 584 Index
Two-hybrid/2-hybrid..........................................6, 186, 403 Type I error.....................................................116, 122, 485 Type II error............................................116, 119, 122, 485
U Ubiquitination........................................................ 252, 448 UCSC................. 90, 246, 247, 258, 261, 264, 302, 305, 321, 322, 521 Uniprot............ 16, 18, 19, 22, 392, 419, 427, 462, 474, 475, 549, 550, 554–556, 561 Uniref.................................................................19, 566, 567 Universal reference......................................................... 275 UNIX...................................................................22, 90, 240 Upper limit of quantification (ULOQ).......................... 366
V Validation prospective........................................................ 492, 494 retrospective.............................................................. 492 VANTED....................................................................... 439 Variance............ 116, 117, 124, 127–129, 159, 166, 167, 175, 257, 275, 279–281, 283, 284, 337, 359, 363, 364, 366, 370, 372, 386, 448, 452, 453, 489, 490, 524, 535 Velvet.................................................................11, 212, 216 Visualization................................. 90, 160, 242, 247, 265, 336, 364, 366, 383, 384, 407–410, 418–421, 439, 449, 505, 518–520, 542, 543, 551, 560
W Warehouse/warehousing..........................17, 18, 81–83, 554 Wavelet............................................................340, 341, 355 W3C............................................................................... 404 Web Ontology Language (OWL)................ 42, 64, 66, 104, 403, 404, 439 Weka....................................................................... 287, 346 Western blot................................................................... 252 Wilcoxon.........................................................118, 128, 129 Workflow...........................22, 23, 74, 76, 98–100, 102, 103, 107, 121, 178, 221, 243, 352, 353, 355, 361, 363–364, 370, 379–395, 550, 552
X XML....................... 15, 20, 22, 25, 34, 39, 46, 48, 52, 53, 60, 80–82, 86, 87, 90, 110, 361, 363, 394, 403, 404, 408, 411, 418, 419, 474, 481 X!Tandem............................................................... 339, 345 Xtrack............................................................................... 77
Y Yeast-two-hybrid/Y2H................................................... 186
Z Zero-mode waveguide (ZMW)..................................... 206 Zscore............................................................................. 341
| | |