Clinical Proteomics
M E T H O D S
I N
M O L E C U L A R
B I O L O G YTM
John M. Walker, SERIES EDITOR 447. Alcohol: Methods and Protocols, edited by Laura E. Nagy, 2008 446. Post-translational Modification of Proteins: Tools for Functional Proteomics, Second Edition, edited by Christoph Kannicht, 2008 443. Molecular Modeling of Proteins, edited by Andreas Kukol, 2008 439. Genomics Protocols: Second Edition, edited by Mike Starkey and Ramnanth Elaswarapu, 2008 438. Neural Stem Cells: Methods and Protocols, Second Edition, edited by Leslie P. Weiner, 2008 437. Drug Delivery Systems, edited by Kewal K. Jain, 2008 436. Avian Influenza Virus, edited by Erica Spackman, 2008 435. Chromosomal Mutagenesis, edited by Greg Davis and Kevin J. Kayser, 2008 434. Gene Therapy Protocols: Volume 2: Design and Characterization of Gene Transfer Vectors edited by Joseph M. LeDoux, 2008 433. Gene Therapy Protocols: Volume 1: Production and In Vivo Applications of Gene Transfer Vectors, edited by Joseph M. LeDoux, 2007 432. Organelle Proteomics, edited by Delphine Pflieger and Jean Rossier, 2008 431. Bacterial Pathogenesis: Methods and Protocols, edited by Frank DeLeo and Michael Otto, 2008 430. Hematopoietic Stem Cell Protocols, edited by Kevin D. Bunting, 2008 429. Molecular Beacons: Signalling Nucleic Acid Probes, Methods and Protocols, edited by Andreas Marx and Oliver Seitz, 2008 428. Clinical Proteomics: Methods and Protocols, edited by Antonia Vlahou, 2008 427. Plant Embryogenesis, edited by Maria Fernanda Suarez and Peter Bozhkov, 2008 426. Structural Proteomics: High-Throughput Methods, edited by Bostjan Kobe, Mitchell Guss, and Huber Thomas, 2008 425. 2D PAGE: Volume 2: Applications and Protocols, edited by Anton Posch, 2008 424. 2D PAGE: Volume 1:, Sample Preparation and Pre-Fractionation, edited by Anton Posch, 2008 423. Electroporation Protocols, edited by Shulin Li, 2008 422. Phylogenomics, edited by William J. Murphy, 2008 421. Affinity Chromatography: Methods and Protocols, Second Edition, edited by Michael Zachariou, 2008 420. Drosophila: Methods and Protocols, edited by Christian Dahmann, 2008 419. Post-Transcriptional Gene Regulation, edited by Jeffrey Wilusz, 2008 418. Avidin-Biotin Interactions: Methods and Applications, edited by Robert J. McMahon, 2008 417. Tissue Engineering, Second Edition, edited by Hannsjörg Hauser and Martin Fussenegger, 2007 416. Gene Essentiality: Protocols and Bioinformatics, edited by Svetlana Gerdes and Andrei L. Osterman, 2008 415. Innate Immunity, edited by Jonathan Ewbank and Eric Vivier, 2007
414. Apoptosis in Cancer: Methods and Protocols, edited by Gil Mor and Ayesha Alvero, 2008 413. Protein Structure Prediction, Second Edition, edited by Mohammed Zaki and Chris Bystroff, 2008 412. Neutrophil Methods and Protocols, edited by Mark T. Quinn, Frank R. DeLeo, and Gary M. Bokoch, 2007 411. Reporter Genes for Mammalian Systems, edited by Don Anson, 2007 410. Environmental Genomics, edited by Cristofre C. Martin, 2007 409. Immunoinformatics: Predicting Immunogenicity In Silico, edited by Darren R. Flower, 2007 408. Gene Function Analysis, edited by Michael Ochs, 2007 407. Stem Cell Assays, edited by Vemuri C. Mohan, 2007 406. Plant Bioinformatics: Methods and Protocols, edited by David Edwards, 2007 405. Telomerase Inhibition: Strategies and Protocols, edited by Lucy Andrews and Trygve O. Tollefsbol, 2007 404. Topics in Biostatistics, edited by Walter T. Ambrosius, 2007 403. Patch-Clamp Methods and Protocols, edited by Peter Molnar and James J. Hickman 2007 402. PCR Primer Design, edited by Anton Yuryev, 2007 401. Neuroinformatics, edited by Chiquito J. Crasto, 2007 400. Methods in Membrane Lipids, edited by Alex Dopico, 2007 399. Neuroprotection Methods and Protocols, edited by Tiziana Borsello, 2007 398. Lipid Rafts, edited by Thomas J. McIntosh, 2007 397. Hedgehog Signaling Protocols, edited by Jamila I. Horabin, 2007 396. Comparative Genomics, Volume 2, edited by Nicholas H. Bergman, 2007 395. Comparative Genomics, Volume 1, edited by Nicholas H. Bergman, 2007 394. Salmonella: Methods and Protocols, edited by Heide Schatten and Abraham Eisenstark, 2007 393. Plant Secondary Metabolites, edited by Harinder P. S. Makkar, P. Siddhuraju, and Klaus Becker, 2007 392. Molecular Motors: Methods and Protocols, edited by Ann O. Sperry, 2007 391. MRSA Protocols, edited by Yinduo Ji, 2007 390. Protein Targeting Protocols Second Edition, edited by Mark van der Giezen, 2007 389. Pichia Protocols, Second Edition, edited by James M. Cregg, 2007 388. Baculovirus and Insect Cell Expression Protocols, Second Edition, edited by David W. Murhammer, 2007 387. Serial Analysis of Gene Expression (SAGE): Digital Gene Expression Profiling, edited by Kare Lehmann Nielsen, 2007 386. Peptide Characterization and Application Protocols, edited by Gregg B. Fields, 2007 385. Microchip-Based Assay Systems: Methods and Applications, edited by Pierre N. Floriano, 2007
M E T H O D S I N M O L E C U L A R B I O L O G YT M
Clinical Proteomics Methods and Protocols
Edited by
Antonia Vlahou Biomedical Research Foundation, Academy of Athens, Athens, Greece
Editor Antonia Vlahou Academy of Athens Biomedical Research Foundation Athens, Greece Athens 115 27 e-mail:
[email protected]
Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Herts., AL10 9AB UK
ISBN: 978-1-58829-837-9
e-ISBN: 978-1-59745-117-8
Library of Congress Control Number: 2007939413 ©2008 Humana Press, a part of Springer Science+Business Media, LLC All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, 999 Riverview Drive, Suite 208, Totowa, NJ 07512 USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper 987654321 springer.com
Preface
Clinical proteomics has rapidly evolved over the past few years and is continuously growing as new methodologies and technologies emerge. In this volume, leading researchers in the field have contributed their stateof-the-art methodologies on protein profiling and identification of disease biomarkers in tissues, microdissected cells, and body fluids. Experimental approaches involving application of two-dimensional electrophoresis, multidimensional liquid chromatography, SELDI/MALDI mass spectrometry and protein arrays, as well as the bioinformatics and statistical tools pertinent to the analysis of proteomics data are described. As stated in the introductory chapter by Prof. Paik, the Vice President of the Human Proteome Organization, “clinical proteomics needs the integration of biochemistry, pathology, analytical technology, bioinformatics, and proteome informatics to develop highly sensitive diagnostic tools for routine clinical care in the future.” The multi-disciplinary character of clinical proteomics approaches is evident in the detailed step-by-step protocols described in this volume, which makes them of potential use to a wide range of researchers, including clinicians, molecular biologists, chemists, bioinformaticians, and computational biologists. Antonia Vlahou
v
Acknowledgments
The editor gratefully acknowledges all contributing authors for their collaboration, which made this project possible and brought it into fruition; the series editor, Prof. John Walker, whose help and guidance have been instrumental; Mr. Patrick Marton, Mr. David Casey, and the whole production team at Humana headed by the late Mr. Tom Laningan for making an excellent production of this book.
vii
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1.
Overview and Introduction to Clinical Proteomics . . . . . . . . . . . . . . . . . Young-Ki Paik, Hoguen Kim, Eun-Young Lee, Min-Seok Kwon, and Sang Yun Cho
Part I:
1
Specimen Collection for Clinical Proteomics
2.
Specimen Collection and Handling: Standardization of Blood Sample Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Harald Tammen 3. Tissue Sample Collection for Proteomics Analysis. . . . . . . . . . . . . . . . . . 43 Jose I. Diaz, Lisa H. Cazares, and O. John Semmes
Part II: Clinical Proteomics by 2DE and Direct MALDI/SELDI MS Profiling 4.
5.
6.
7.
8.
Protein Profiling of Human Plasma Samples by Two-Dimensional Electrophoresis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Sang Yun Cho, Eun-Young Lee, Hye-Young Kim, Min-Jung Kang, Hyoung-Joo Lee, Hoguen Kim, and Young-Ki Paik Analysis of Laser Capture Microdissected Cells by 2-Dimensional Gel Electrophoresis . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Daohai Zhang and Evelyn Siew-Chuan Koay Optimizing the Difference Gel Electrophoresis (DIGE) Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 David B. Friedman and Kathryn S. Lilley MALDI/SELDI Protein Profiling of Serum for the Identification of Cancer Biomarkers . . . . . . . . . . . . . . . . . . . . . . 125 Lisa H. Cazares, Jose I. Diaz, Rick R. Drake, and O. John Semmes Urine Sample Preparation and Protein Profiling by Two-Dimensional Electrophoresis and Matrix-Assisted Laser Desorption Ionization Time of Flight Mass Spectroscopy . . . . . . . . 141 Panagiotis G. Zerefos and Antonia Vlahou
ix
x
Contents 9.
Combining Laser Capture Microdissection and Proteomics Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Dana Mustafa, Johan M. Kros, and Theo Luider
Part III: 10.
Clinical Proteomics by LC-MS Approaches
Comparison of Protein Expression by Isotope-Coded Affinity Tag Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Zhen Xiao and Timothy D. Veenstra
11.
Analysis of Microdissected Cells by Two-Dimensional LC-MS Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Chen Li, Yi-Hong, Ye-Xiong Tan, Jian-Hua Ai, Hu Zhou, Su-Jun Li, Lei Zhang, Qi-Chang Xia, Jia-Rui Wu, Hong-Yang Wang, and Rong Zeng 12. Label-Free LC-MS Method for the Identification of Biomarkers . . . . . 209 Richard E. Higgs, Michael D. Knierman, Valentina Gelfanova, Jon P. Butler, and John E. Hale 13.
Analysis of the Extracellular Matrix and Secreted Vesicle Proteomes by Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Zhen Xiao, Thomas P. Conrads, George R. Beck, Jr., and Timothy D. Veenstra
Part IV:
Clinical Proteomics and Antibody Arrays
14.
Miniaturized Parallelized Sandwich Immunoassays . . . . . . . . . . . . . . . . 247 Hsin-Yun Hsu, Silke Wittemann, and Thomas O. Joos
15.
Dissecting Cancer Serum Protein Profiles Using Antibody Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Marta Sanchez-Carbayo
Part V: Statistics and Bioinformatics in Clinical Proteomics Data Analysis 16.
2D-PAGE Maps Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Emilio Marengo, Elisa Robotti, and Marco Bobba 17. Finding the Significant Markers: Statistical Analysis of Proteomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Sebastien Christian Carpentier, Bart Panis, Rony Swennen, and Jeroen Lammertyn 18. Web-Based Tools for Protein Classification . . . . . . . . . . . . . . . . . . . . . . . . 349 Costas D. Paliakasis, Ioannis Michalopoulos, and Sophia Kossida
Contents 19.
20.
xi
Open-Source Platform for the Analysis of Liquid Chromatography-Mass Spectrometry (LC-MS) Data . . . . . . . . . . . . . . 369 Matthew Fitzgibbon, Wendy Law, Damon May, Andrea Detter, and Martin McIntosh
Pattern Recognition Approaches for Classifying Proteomic Mass Spectra of Biofluids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Ray L. Somorjai Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Contributors Jian-Hua Ai • Eastern Hepatobiliary Surgery Hospital, Shanghai, China George R. Beck, Jr • Division of Endocrinology, Metabolism and Lipids Emory University, School of Medicine, Atlanta, GA Marco Bobba • University of Eastern Piedmont, Department of Environmental and Life Sciences, Alessandria, Italy Jon P. Butler • Lilly Corporate Center, Indianapolis, IN Sebastien Christian Carpentier • Faculty of Bioscience Engineering, Division of Crop Biotechnics, K.U. Leuven, Leuven, Belgium Lisa H. Cazares • The George L. Wright Jr. Center for Biomedical Proteomics Eastern Virginia Medical School, Norfolk, VA Sang Yun Cho • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Thomas P. Conrads • Laboratory of Proteomics and Analytical Technologies SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD Andrea Detter • Fred Hutchinson Cancer Research Center, Seattle, WA Jose I. Diaz • Cancer Therapy Research Center’s Institute for Drug Development, University of Texas, Health Science Center, San Antonio, TX Rick R. Drake • Eastern Virginia Medical School, Norfolk, VA Matthew Fitzgibbon • Fred Hutchinson Cancer Research Center, Seattle, WA David B. Friedman • Proteomics Laboratory, Mass Spectrometry Research Center, Department of Biochemistry, Vanderbilt University School of Medicine, Nashville, TN Valentina Gelfanova • Lilly Corporate Center, Indianapolis, IN John E. Hale • Lilly Corporate Center, Indianapolis, IN Richard E. Higgs • Lilly Corporate Center, Indianapolis, IN Yi-Hong • Eastern Hepatobiliary Surgery Hospital, Shanghai, China Hsin-Yun Hsu • Biochemistry Department NMI Natural and Medical Sciences Institute at the University of Tuebingen, Reutlingen, Germany Thomas O. Joos • Biochemistry Department, NMI Natural and Medical Sciences Institute at the University of Tuebingen, Reutlingen, Germany Min-Jung Kang • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea xiii
xiv
Contributors
Hoguen Kim • Department of Pathology, College of Medicine, Yonsei University, Seoul, Korea Hye-Young Kim • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Michael D. Knierman • Lilly Corporate Center, Indianapolis, IN Evelyn Siew-Chuan Koay • Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore, and Molecular Diagnosis Center, Department of Laboratory Medicine. National University Hospital, Singapore Sophia Kossida • Division of Biotechnology, Biomedical Research Foundation, Academy of Athens, Athens, Greece Johan M. Kros • Department of Pathology, Josephine Nefkens Institute Erasmus Medical Center, Rotterdam, The Netherlands Min-Seok Kwon • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Jeroen Lammertyn • Faculty of Bioscience Engineering, Division of Mechatronics, Biostatistics and Sensors, K.U. Leuven, Leuven, Belgium Wendy Law • Fred Hutchinson Cancer Research Center, Seattle, WA Eun-Young Lee • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Hyoung-Joo Lee • Yonsei Biomedical Proteome Research Center, Department of Biochemistry, College of Sciences, Seoul, Korea Chen Li • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Su-Jun Li • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Kathryn S. Lilley • Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, United Kingdom Theo Luider • Laboratories of Neuro-Oncology/Clinical and Cancer Proteomics, Josephine Nefkens Institute Erasmus Medical Center, Rotterdam, The Netherlands Emilio Marengo • Department of Environmental and Life Sciences, University of Eastern Piedmont, Alessandria, Italy Damon May • Fred Hutchinson Cancer Research Center, Seattle, WA Martin McIntosh • Fred Hutchinson Cancer Research Center, Seattle, WA Ioannis Michalopoulos • Biomedical Research Foundation, Academy of Athens, Athens, Greece Dana Mustafa • Department of Pathology, Josephine Nefkens Institute Erasmus Medical Center, Rotterdam, The Netherlands
Contributors
xv
Young-Ki Paik • Department of Biochemistry, Yonsei Proteome Research Center & Biomedical Proteome Research Center, Seoul, Korea Costas D. Paliakasis • Biomedical Research Foundation, Academy of Athens, Athens, Greece Bart Panis • Faculty of Bioscience Engineering, Division of Crop Biotechnics, K.U. Leuven, Leuven, Belgium Elisa Robotti • Department of Environmental and Life Sciences, University of Eastern Piedmont, Alessandria, Italy Marta S.anchez-Carbayo • Tumor Markers Group, Spanish National Cancer Center (CNI0), Madrid, Spain O. John Semmes • The George L. Wright Jr. Center for Biomedical Proteomics, Eastern Virginia Medical School, Norfolk, VA Ray L. Somorjai • Biomedical Informatics Institute for Biodiagnostics, National Research Council, Winnipeg, Manitoba, Canada Rony Swennen • Faculty of Bioscience Engineering, Division of Crop Biotechnics, K.U. Leuven, Leuven, Belgium Harald Tammen • Digilab BioVisioN GmbH, Hannover, Germany Ye-Xiong Tan • Eastern Hepatobiliary Surgery Hospital, Shanghai, China Timothy D. Veenstra • Laboratory of Proteomics and Analytical Technologies, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD Antonia Vlahou • Division of Biotechnology, Biomedical Research Foundation, Academy of Athens, Athens, Greece Hong-Yang Wang • Eastern Hepatobiliary Surgery Hospital, Shanghai, China Silke Wittemann • Biochemistry Department, NMI Natural and Medical Sciences Institute at the University of Tuebingen, Reutlingen, Germany Jia-Rui Wu • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Qi-Chang Xia • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Zhen Xiao • Laboratory of Proteomics and Analytical Technologies, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD Rong Zeng • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Panagiotis G. Zerefos • Division of Biotechnology, Biomedical Research Foundation, Academy of Athens, Athens, Greece
xvi
Contributors
Daohai Zhang • Molecular Diagnosis Center Department of Laboratory Medicine, National University Hospital, Singapore and Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore Lei Zhang • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China Hu Zhou • Research Center for Proteome Analysis, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
1 Overview and Introduction to Clinical Proteomics Young-Ki Paik, Hoguen Kim, Eun-Young Lee, Min-Seok Kwon, and Sang Yun Cho
Summary As the field of clinical proteomics progresses, discovery of disease biomarkers becomes paramount. However, the immediate challenges are to establish standard operating procedures for both clinical specimen handling and reduction of sample complexity and to increase the ability to detect proteins and peptides present in low amounts. The traditional concept of a disease biomarker is shifting toward a new paradigm, namely, that an ensemble of proteins or peptides would be more efficient than a single protein/peptide in the diagnosis of disease. Because clinical proteomics usually requires easy access to well-defined fresh clinical specimens (including morphologically consistent tissue and properly pretreated body fluids of sufficient quantity), biorepository systems need to be established. Here, we address these questions and emphasize the necessity of developing various microdissection techniques for tissue specimens, multidimensional fractionation for body fluids, and other related techniques (including bioinformatics), tools which could become integral parts of clinical proteomics for disease biomarker discovery.
Key Words: biomarker; body fluids; clinical proteomics; translational proteomics; depletion; biorepository; multidimensional fractionation; specimen bank; biomarker panel. Abbreviations: CSF: Cerebrospinal Fluid, SILAC: Stable Isotope Labeling with Amino acids in Cell culture, FFE: Free Flow Electrophoresis, IMAC: Immobilized Metal Affinity Chromatography, 2DE: 2-dimensional Gel electrophoresis, CBB: Coomassie Brilliant Blue, SELDI: Surface-Enhanced Laser Desorption/Ionization, MALDI: MatrixAssisted laser desorption/ionization, MDLC: Multi-dimensional Liquid Chromatography, LC: Liquid Chromatography, TOF: Time-of-Flight, CID: Collision-induced dissociation, ETD: Electron Transfer Dissociation, LIT: Linear Ion-Trap, FT: Fourier-Transform, Q: Quadrupole, ELISA; Enzyme-Linked Immunosorbent Assay, SISCAPA: Stable Isotope Standards with Capture by Anti-Peptide Antibody, AQUA: Absolute Quantitative From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
1
2
Paik et al. Analysis. Commercial brands are also shown: MARS; Multiple Affinity Removal System, (Agilent, Palo Alto, CA, USA), EnchantTM : EnchantTM Multi-protein Affinity Separation Kit (Pall Life Sciences, Ann Arbor, MI, USA), GradiflowTM : GradiflowTM Separation (Life Bioprocess, Frenchs Forest, Australia), FFETM : BD Free Flow Electrophoresis System (BD Diagnostics, Martinsried/Planegg, Germany), Zoom® : Zoom® Benchtop Proteomics System (Invitrogen Corporation, Carlsbad, CA, USA), Rotofor: Bio-Rad Rotofor® Prep IEF Ccll (Bio-Rad, Hercules, CA, USA), PF2D: ProteomeLabTM PF2D Protein Fractionation System (Beckman Coulter, Inc., Fullerton, CA, USA), DIGE: EttanTM DIGE System (GE Healthcare Bio-Sciences AB, Uppsala, Sweden), Deep PurpleTM : Deep PurpleTM Total Pprotein Stain (GE Healthcare Bio-Sciences AB, Uppsala, Sweden), ICATTM : Isotopecoded affinity tags (Applied Biosystems, Foster City, CA, USA), iTRAQTM : iTRAQTM Reagents (Applied Biosystems, Foster City, CA, USA), Q-TRAPTM : (Applied Biosystems, Foster City, CA, USA).
1. Overview and Scope of Clinical Proteomics Clinical proteomics is defined as comprehensive studies of qualitative and quantitative profiling of proteins (and peptides) present in clinical specimens such as body fluids and tissues. The comparison of specimens from healthy and diseased individuals may lead to the discovery of a disease biomarker (1). The biomarker serves as a molecular signature reflecting stages of disease before or after treatment and can also be used for prognostic purposes in monitoring the response to treatment (2). Clinical proteomics consists of a variety of experimental processes, which include the collection of well-phenotyped clinical specimens, analysis of proteins or peptides of interest, data interpretation, and validation of proteomics data in a clinical context (Fig. 1). After successful identification of a few disease biomarker candidates through extensive profiling,
Fig. 1. Clinical and translational proteomics. The key components of experimental methods are included in each box.
Overview and Introduction to Clinical Proteomics
3
translational proteomics involving validation with a cohort study follows. Even after proper identification and verification of a disease biomarker, it takes quite a long time to prove that this biomarker is applicable to clinical diagnosis or prognosis (3,4). There has been a remarkable increase in publication of clinical proteomics papers within a short period of time [more than 800 papers in 2006 (Fig. 2)], coinciding with the rapid growth of proteomics. Reflecting this trend in clinical proteomics, this chapter aims to present a review of core technologies that are used in the field of clinical proteomics with respect to sample specimen processing, protein separation platforms (e.g., gel-based system or liquid-based methods), quantitative labeling, mass spectrometry (MS), and proteome informatics tools. It is noteworthy that despite the advent of new technologies, there remain several bottlenecks in the proteomics field such as lack of dataset standardization, quantification of the proteins of interest, verification of protein or peptides identified, and an overall strategy for tackling biomarker postidentification. Thus, the pace of biomarker discovery, one of the key agendas of clinical proteomics, will depend on how well these obstacles or bottlenecks are resolved by technical advancement (4). The following sections address these issues in the context of clinical proteomics.
Fig. 2. Recent trends in clinical proteomics publications. The distribution of the articles related to clinical proteomics listed in PubMed is shown here. The key words used for searching articles are as follows: query (clinical[All Fields] OR ((“biological markers”[TIAB] NOT Medline[SB]) OR “biological markers”[MeSH Terms] OR biomarker[Text Word])) AND (“proteomics”[MeSH Terms] OR proteomics[Text Word] OR proteomic[All Fields] OR “proteome”[MeSH Terms] OR proteome[Text Word]).
4
Paik et al.
2. Sample Specimens and Processing Techniques Used for Clinical Proteomics 2.1. General Considerations Because clinical proteomics rely heavily on the patient specimens, three important factors need to be considered before the selection and preparation of clinical specimens: (1) selection of the correct clinical samples according to the type of research, (2) isolation of the appropriate component from the clinical samples, and (3) establishment of optimal experimental conditions for each sample (5,6,7,8). For the selection of correct clinical samples, the relationship between clinical samples and the specific disease should also be considered. For example, although cancer tissue represents a specific cancer, several types of body fluids from patients may also have a relationship to the cancer. If the selected clinical samples specifically represent the disease, the next step is to evaluate what components are related to the specific disease. That is, tumor cells in cancerous tissues are surrounded by many types of stromal cells, inflammatory cells, and connective tissues that are directly related to changes in protein expression in the cancer. If the purpose of proteomic analysis is to identify characteristic changes of specific proteins in tumor cells, then the precise identification of tumor cell percentage that can be increased by tissue microdissection would appear to be necessary (5,6,7). As sample specimen conditions directly impact the results of biomarker discovery, well-defined clinical specimens should be used since the discovery of disease biomarkers is much easier when the samples have clear anatomical and pathophysiological definitions. Because clinical specimens are heterogeneous, sophisticated pathological discrimination is required for the isolation of specific diseased tissue or body fluids. Without the expertise of a pathologist at the earliest stage, it may be difficult to isolate a specifically defined specimen for clinical proteomics. Generally, clinical samples contain variable factors and components originating from the microenvironment of specific tissues. For instance, liver tissues usually contain a large amount of blood in the sinusoid and this amount is increased in tissues with dilated sinusoids (9). Lung tissues usually contain deposited exogenous materials and this amount is increased in heavy smokers (10). Note that the amount of blood present in isolated tissues may directly influence the relative proportion of proteins found in clinical specimens. Deposited materials and the other chemicals such as stain dye and fixatives used in the microdissection may also influence the experimental conditions (11). In the analysis of clinical samples, suitable buffer conditions, minimal lysis time, and high-yield protein precipitation are highly recommended. To avoid substantial variations between experiments using clinical specimens, a large set of specimens are also necessary because, unlike cultured cell lines, clinical specimens have high
Overview and Introduction to Clinical Proteomics
5
component variability (12). More details on specific disease types are also described throughout this volume. 2.2. Body Fluids Surveying the literature, there appears to be five to six different types of clinical specimens. Body fluids [e.g., plasma, urine, tear, cerebrospinal fluid, lymph, and ascites], tissues (e.g., liver, heart, muscle, brain, and lung), cells, bone, and hair have all been used for clinical proteomics (Table 1) (13,14,15,16, 17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33). Each has its own merits and limitations for biomarker discovery via proteomic analysis. Among those sample specimens, the number of publications using body fluids has increased recently, perhaps because of their convenience and ease of use for noninvasive diagnosis. Since those proteins secreted in the body fluids during or after disease may reflect a broad range of pathophysiological conditions, much emphasis has been given to identification of prominent protein/peptide biomarkers that exhibit differential expression at different stages. In the literature, the terms “body fluids” and “biofluids” are being used interchangeably, although the former indicates a greater likelihood of being obtained directly from the patients, while the latter is applied more broadly, referring to liquid or liquid-like samples obtained from living organisms including model animals and plants. Throughout this chapter we will use “body fluids” for clarity. Given the large dynamic range of protein and peptide sources, plasma (a complex liquid interface between tissues) and extra cellular fluids may be the best body fluid to use for clinical proteomics and biomarker discovery (34,35, 36,37,38). In addition to plasma, more than a dozen additional body fluids are currently used for biomarker discovery, ranging from urine to peritoneal fluids (Table 1). However, the biggest challenge in body fluids proteomics may be the multiple pretreatment processes including depletion of high-abundance proteins (in the case of plasma) (34,35,36) and/or their enrichment (in the case of urine) (15,39) prior to analysis (Table 1). Thus, the outcome of clinical proteomics may depend on proper sample processing since the quality of selection and handling of the most specific type of specimen will affect the overall pattern of profiling. Because the details of body fluid proteomics have been well described by Shen Hu et al. (38), we would like to focus on only a few essential points. First, standard measures need to be introduced to protect specimens from nonspecific proteolysis, lysis, and modification during collection and preparation (11). For the standardization of blood sample collection, Tammen emphasizes many useful considerations of preanalytical variables in plasma proteomics, which can be applied to processes involved with blood specimens [(40) and see Chapter 2]. The more specific problems involved in sample
6
Fluid
Synovial fluid Ascites Bronchial lavage fluid
Pleural fluid Peritoneal fluid
Body cavity fluid
Seminal fluid Nipple aspirate fluid Cerebrospinal fluid
Follicular fluid
Lung cancer Ovarian cancer
Rheumatoid arthritis Ovarian cancer Chronic obstructive pulmonary disease, asthmatics and lung disease (29) (14)
(26) (13) (27,28)
(23) (24) (25)
(22)
• Can reflect disease perturbations in the organs or tissues from which they are secreted • Procedure of synovial biopsy is not very difficult
(15) (16) (17,18) (19) (20,21)
Urine Nasal discharge Tears Saliva Amniotic-/cervical fluid
Prostate cancer Seasonal allergic rhinitis Blepharitis and dry eye Oral and breast cancer Fetal aneuploidy and intra-amniotic inflammation Recurrent spontaneous abortion Male infertility Breast cancer Brain tumor
• Routinely accessible body fluids • Very important in the discovery of biomarkers of diseases (systemic vs. organ specific/local) • Important for early detection, disease severity, prognosis, monitoring of response to therapy
(13,14)
Plasma/serum
Proximal fluid
Secretions
Characteristics of the samples
Disease
Reference
Type
Table 1 Types of Biological Specimens Used in Clinical Proteomics
• Mucosa and salt have to be removed necessarily
• Considerations for sample adequacy – Storage – Hemolysis – Influence of anticoagulants –Consistent results • Consider whether to pool samples or analyze individual samples • Depletion of high-abundance proteins (Albumin consist of 50% of plasma proteins)
Pretreatment required for proteomics
7
Hair
Cartilage
Cell lines or primary tissue culture
Cell
Bone
LCM or LMPC isolated Formalin fixed Paraffin embedded
Tissue
Rheumatoid arthritis
Any type of disease
Any type of disease
• Very important for the development of novel in situ biomarkers • Immunofluorescence, immunocytochemistry, imaging mass spectrometry • Very important in the discovery of biomarker candidates • Validation should be performed using primary tumor samples (e.g., immunohistologic methods, imaging MS) • Cartilage consists mainly of extracellular matrix, mostly made of collagens and proteoglycans • Over 300 proteins were found to constitute the insoluble complex formed by transglutaminase crosslinking
(30)
(31)
(32)
(33)
• Need to sufficient extraction of protein from insoluble complex
• Cetylpyridinium chloride effectively aggregate with proteoglycan
• Desalting and removal of media component
• Considerations for sample adequacy • Integrity, degradation of protein • Contamination (microorganisms, extraneous material)
8
Paik et al.
handling are also addressed by Rai et al. (41). Second, to increase the dynamic range of detection and reduce sample heterogeneity, pretreatments such as depletion of high-abundance proteins appear to be required (34,35,36). In addition, many pretreatment steps to remove high-abundance proteins may be required during initial sample processing. Multiple fractionations of clinical samples prior to major separation work would reduce the sample complexity. Note that coremoval of low-abundance proteins during this type of multiple depletion (36,42) and modification of proteins of interest during or after isolation (43) should be considered as well. For several problems encountered with specimen collection, Xiao et al. (Chapter 13) in this volume also describe different methods to isolate extra cellular matrix (ECM) and analyze the proteome of secreted vesicles. These methods will be useful for studying ECM and secreted vesicles in various samples ranging from the primary cultured cells to tissue specimens. Therefore, one must consider the best options for this process before doing the main experiment. 2.3. Tissues and Other Samples Usually tissues are used as primary screening samples to find direct causes of disease from the lesion present in tissues of the corresponding organ, for example, liver tissue in hepatocellular carcinoma (HCC) (44,45). Tissues are widely used for clinical proteomics, although there are no standing operation procedures in specimen fractionation and the detection limit of current instrumentation remains borderline. As listed in Table 1, many cancer tissues can be prepared in different ways such as laser capture microdissection (LCM) (5,6), pressures catapulting techniques [laser microdissection and pressure catapulting (LMPC)] (30,46), and formalin-fixed paraffin-embedded sample preparation (11). Theses techniques are well described in Chapters 3, 5, 9, and 11 in this volume. It is desirable, however, that proteomics studies of disease tissues should also be coupled with parallel analysis of the corresponding body fluids. For example, for the study of cancer biomarkers, paired cancer tissue sets (tumor vs. nontumor) and the same patient’s plasma were used, which led to a more comprehensive analysis (47,48). Experiments on tissue samples may mostly be suitable for pathophysiological studies rather than biomarker discovery due to the complexity of the sample. In specimen processing for proteomics studies, there are usually several unwanted problems such as artifacts created during sample collection, processing, and storage. Other matters arise in the handling of patient information regarding sex, age, and race (49). To minimize those problems associated with systematic sample handling, it is plausible to establish a specimen bank (50,51,52). In fact, the collection of many clinical samples in a biorepository would have enormous
Overview and Introduction to Clinical Proteomics
9
benefits for proteomic research. This enables the selection of homogeneous clinical samples according to the research purposes and isolation of specific components from clinical samples. Additionally, large scale collection of clinical specimens in a biorepository is essential for the validation of specific markers after biomarker candidate discovery. Ideally, the clinical samples stored in the biorepository should be (1) collected and stored immediately because dead cells and altered proteins affect proteomic analysis, (2) subjected to accurate quality control, and (3) catalogued by reliable and secure clinical data. The quality control of clinical samples includes trimming of specimens and confirmation of diagnosis by pathologists; information gained (such as the confirmation of tumor cell and stromal cell ratio, percentage of necrosis, percentage of fibrosis, proportion of infiltrated inflammatory cells, etc.) should be stored in a database of clinical samples. It is also essential to store clinical and follow-up data for each sample and each patient’s written informed consent form in the biorepository network. This clinical specimen banking network provides convenience, reduced budget, and reliability for researchers involved in clinical proteomic research (50,51,52). For representative tissue sample collection for proteomics studies, Diaz et al. (Chapter 3) address a practical experimental strategy for storage and handling of sample specimens that are used in surface-enhanced laser desorption/ionization (SELDI), 2D gel, and liquid chromatography (LC)-based proteomics. Emphasis should be given to the primary responsibility of pathologists in the whole process of tissue proteomics in addition to morphological analysis at the molecular level.
3. Biomarker Discovery and Clinical Proteomics Given that one of the central issues of clinical proteomics is biomarker discovery and its application, a brief account of this subject is appropriate here. An excellent review of the whole arena of biomarker development can be found elsewhere (53,54,55). Until now, it has been generally accepted that a conventional concept of a disease biomarker would be a single protein/peptide with high specificity, which is usually present in low abundance, expressed in a disease in a stage-specific manner, and serve as a major fingerprint of the body’s response to drugs or other treatments. Although many examples of broad biomarkers for various diseases are known (56,57,58,59,60), identification of more specific and selective biomarkers is urgently needed. Accordingly, we may also need to change the current biomarker concept and eliminate the inherent bias toward individual disease biomarkers. Recently, a new idea has been introduced that an ensemble of different proteins would be more efficient than a single protein/peptide in the diagnosis of disease (61,62,63). To solve
10
Paik et al.
this problem we propose a general strategy of clinical proteomics leading to disease biomarker discovery as outlined in Fig. 3. Since biomarker candidate proteins could come from many different cellular processes, they could be either in low abundance or high abundance, which would directly or indirectly reflect the physiological condition of the body. Perhaps they are present in different concentrations depending on the disease stage or tissue type. For example, common proteins such as Hsp 27 (64, 65), 14-3-3 proteins (66,67), apoA-I (68,69), and serum amyloid precursor A (70) appear in most of disease samples from lung cancer, gastric cancer, pancreatic cancer, prostate cancer, neuroblastoma and, inflammation. A number of questions then arise: should they be treated as disease-specific or disease nonspecific proteins? What would be the criterion to make this decision? Is this due to the fact that the number and type of proteins secreted from a specific
Fig. 3. The concept of the creation of a protein biomarker panel for a specific disease. Each white, gray, dark-gray, and black circle represents a putative protein biomarker of a specific disease at that clinical stage. A group of slash-lined circles symbolizes the biomarker panel of liver disease as an example.
Overview and Introduction to Clinical Proteomics
11
physiological condition of many different types of diseases might be similar? How one can distinguish one type of disease from another simply by looking at their protein profiles? As outlined in Fig. 3, at the beginning of certain disease, signals at earlier stages may be limited to only a few easily counted molecules. As the disease progresses, more signal molecules might have been produced, resulting in mixed types of biomarkers representing multiple disease phenomena. Although this assumption seems to be oversimplified, more noise is created at a certain stage where it becomes more difficult to identify those molecules at the molecular level because of two reasons: (1) they are in amounts too small to be detected using the current technology and (2) it may be too premature for the molecules to be specific for a particular disease. Presumably, proteins appearing in stage 3 or 4 may have higher specificity of a particular disease but the sensitivity might be low. It may be likely that this noise interferes with the signaling pathway of a certain disease, and we may end up having no decisive marker. To circumvent this problem, it may be desirable to identify a set of biomarker candidate proteins, termed a “biomarker panel,” which ideally contains potential candidate proteins or peptides that represent specific stages of the disease as a group. Given this panel, extensive validation processes may be sought using large group cohort. Analogous to this strategy, many biomarker candidates at stage 1 can be included in the panel, which can have more specificity and sensitivity as compared to a single molecule biomarker. Using this kind of biomarker panel, one can use not only this molecule as diagnostic marker but also as a prognostic indicator in monitoring treatment effectiveness. For example, Linkov et al. (61) reported that both the sensitivity and specificity were improved up to 84.5 and 98%, respectively, when they used a panel containing 25 multimarkers in early diagnosis of head and neck cancer (squamous cell cancer of the head and neck) (61). In the diagnosis of prostate cancer, specificity was increased from 5–15 to 84–95% when they used a biomarker panel containing six marker proteins as compared to a single marker. In HCC, studies have been carried out on a biomarker panel consisting of a protein array that can be used as a diagnostic kit (62,63). A general strategy for biomarker discovery is outlined in Fig. 4. In typical clinical proteomics, work sample collection is the first step, followed by pretreatment of the sample in order to reduce sample complexity to enable searching for low-abundance proteins (e.g., disease biomarkers) using various fractionation tools. This multidimensional fractionation is well-described elsewhere (34,35,36), and depends on the properties and concentration of the sample. Typically the prefractionated samples go either to a two-dimensional electrophoresis (2DE) or LC-based proteomics separation system, followed by single or multiple steps of mass spectrometric analysis depending on the sample
12 Fig. 4.
Overview and Introduction to Clinical Proteomics
13
quantity and experimental goal. The data obtained from this series of analyses will be integrated into the proteome informatics system where protein/peptide identification, quantification, modification, and verification of peak list are carried out [(71) and also Chapter 19]. Usually this step becomes rate limiting since major profiling data are constructed and analyzed at this point. The clinical relevance of those proteins (and changes in their expression level) in a specific disease state is mostly determined, which eventually leads to identification of biomarker candidates. In addition, SELDI, molecular imaging and protein microarrays can also be applied before or after this step. Once major biomarker candidates are identified, those proteins are subjected to further verification via sophisticated analytical arrays and translational proteomics, which involves cohort studies, pre-evaluation, and a robust analytical system (4,72). Throughout the process of translational proteomics, one may be able to judge whether the identified panel or single proteins are suitable for biomarkers of a specific disease. A recent comprehensive review by Zolg (73) addressed several considerations in the biomarker development pipeline from discovery to validation. Three critical challenges within the pipeline are reduction of clinical sample complexity, the proof of principle of biomarker function, and the detection limit of unique proteins present in the samples. In the search for biomarker panels, reliable statistical tools and bioinformatics resources are needed, which are now available on the web (Table 2; see also Chapters 16 and 17). As the number of biomarker panel candidates increases, more cases are being examined, which require statistical learning methods. These methods include neural networks, genetic algorithms, k-means Fig. 4. A typical experimental strategy for clinical proteomics and translational proteomics. In clinical proteomics research, various experimental techniques are included: specimen collection, prefractionation, 2DE, Non2DE (liquid-based separation), mass spectrometry, informatics, and others. The course of each section as marked (square, circle in different color) is determined by the investigators, depending on the experimental goal. At the bottom, experimental procedures for the verification and validation of biomarker candidates are schematically outlined leading to clinical screening and applications. The squares indicate the separation system based on the specific characteristics of proteins and general prefractionation system. The open circles and open triangle represent analytical modules at the protein and peptide level, respectively. The arrow and junction points indicate an option of each selection. Bottom parts indicate verification procedure employing multiple reaction monitoring and quantitative mass analysis. Those biomarker candidates identified from typical clinical proteomics would be subject to translational proteomics for validation where a large scale cohort study and evaluation would then proceed.
14
Paik et al.
nearest-neighbor analysis, euclidean distance-based nonlinear methods, fuzzy pattern matching, selforganizing mapping, and support vector machines (74,75,76,77,78). They are very useful for classification of proteins according to the specific disease state (see also Chapters 16 and 20). Once biomarker candidates are identified, it is necessary to predict in silico the function of these proteins and validate them in the context of clinical application. Table 3 provides web resources, which can be used for clinical data management, in silico functional annotation (see Chapter 18), prediction, and identification of modified forms of proteins. Thus, by combining experimental methods (Fig. 4) and informatics tools (Tables 2 and 3), one is able to obtain a set of biomarker candidate proteins (panel) that would be further used for validation through translational proteomics (Fig. 1).
4. Introduction of the Experimental Strategy Described in This Volume For protein profiling and identification, proteomics platform technologies are moving forward in many areas not only in clinical proteomics but also in the general biological field. In this section, the leading scientists in the field of proteomics outline core techniques and their application to the studies of clinical proteomics. For example, in plasma proteome analysis, it is necessary to deplete high-abundance proteins using various techniques such as multidimensional fractionation by immunoaffinity column, gel permeation, and beads (Fig. 4). Cho et al. (Chapter 4) addresses this in relation to 2D gel analysis of plasma wherein the technical details of sample preparation, gel electrophoresis, and quantification of proteins on the gel are described. Zhang and Koay (Chapter 5) describe the methods of 2D gel analysis for cells prepared by LCM. They describe the application of LCM in dissecting tumor cells in breast cancer for macromolecular extraction and 2D gels. This can be used for preparation of samples from paraffin-embedded tissue blocks in microdissecting the cells of interest. Further to this procedure, Mustafa et al. (Chapter 9) review the application of LCM for proteomics analysis and demonstrate that combining LCM and MS would facilitate identification of specific proteins for each sample type. For urine sample analysis, Zerefos et al. (Chapter 8) provide simple protocols for protein analysis by 2D gel or direct matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry. These techniques include protein enrichment through protein precipitation and ultrafiltration means. Combining these methods with the above profiling technologies allows reproducible and sensitive analysis of one of the most significant and complex biological samples (77).
Overview and Introduction to Clinical Proteomics
15
Table 2 Clinical Proteomics Initiatives and Resources
Institute CPTI
ABRF
PPI
EDRN
Web resources ExPASy
NCBI
CPRMap
Database MedGene
Details
Websites
National Cancer Institute’s Clinical Proteomics Technologies, initiative for cancer The Association of Biomolecular Resource Facilities, an international society dedicated to advancing core and research biotechnology laboratories through research, communication, and education Plasma Proteome Institute, the PPI is working to facilitate clinical adoption of advanced diagnostic tests using proteins in plasma and serum The Early Detection Research Network, the EDRN provide up-to-date information on biomarker research through this website and scientific publications
http://proteomics.cancer. gov
Expert Protein Analysis System, proteomics related information and database National Center for Biotechnology Information, the protein entries in the Entrez search and retrieval system have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq Clinical Proteomics Research Map, updated research article for disease and clinical proteomics
http://www.expasy.org/
MedGene can make a list of human genes associated with a particular human disease in ranking order
http://hipseq.med.harv ard.edu/MEDGENE
http://www.abrf.org/
http://www.plasmaprote ome.org/plasmaframes. htm http://edrn.nci.nih.gov
http://www.ncbi.nlm. nih.gov/entrez/query. fcgi?db = Protein& itool = toolbar
http://www.cprmap.com/
16
Paik et al.
Table 3 Available Bioinformatic Resources for the Analysis of Proteomics Data Name
Description
Clinical proteome data management system Proteus LIMS for proteomics pipeline CPAS LIMS for identification and quantification using by LC-MS/MS data Systems biology A management system for experiment analysis collecting, storing, management and accessing data system produced by microarray, proteomics, and immunohistochemistry GPM database Open source system for analyzing, validating, and storing protein identification data SpectrumMill MS/MS data analysis and management system Phosphorylation Group-based phosphorylation scoring method KinasePhos
NetPhos
NetPhosK
Prediction of kinase-specific phosphorylation sites A web tool for identifying protein kinase-specific phosphorylation sites using by hidden Markov model Sequence and structure-based prediction of eukaryotic protein phosphorylation sites Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence
Website URL
PMID
http://www. genologics.com 16396501
http://www. sbeams.org/
16756676
http://www. thegpm.org/
15595733
http://www.chem. agilent.com/
http://973proteinweb.ustc. edu.cn/gps/ gps_web/ http://kinasePhos. mbc.nctu.edu.tw
15980451
http://www.cbs. dtu.dk/services/ NetPhos/
10600390
http://www.cbs.dtu. dk/services/ NetPhosK/
15174133
15980458
Overview and Introduction to Clinical Proteomics PredPhospho
PREDIKIN
Prosite
Scansite
Phospho.ELM
Human protein reference database (HPRD)
PhosphoSite
Glycosylation NetOGlyc 2.0
17
Prediction of phosphorylation sites using support vector machine A prediction of substrates for serine/threonine protein kinases based on the primary sequence of a protein kinase catalytic domain A prediction of substrates for protein kinases-based conserved motif search Prediction of PK-specific phosphorylation site with Bayesian decision theory A database of experimentally verified phosphorylation sites in eukaryotic proteins A database of known kinase/phosphatase substrate as well as binding motifs that are curated from the published literature A bioinformatics resource dedicated to physiological protein phosphorylation
http://pred.ngri. re.kr/Pred Phospho.htm http://florey.biosci. uq.edu.au/kinsub/ home.htm
15231530
http://kr.expasy. org/prosite
17237102
http://scansite. mit.edu
16549034
http://phospho.elm. eu.org/
15212693
http://www. phosphosite.org/ Login.jsp
15174125
Predicts O-glycosylation sites in mucin-type proteins
http://www.cbs. dtu.dk/services/ NetOGlyc/ http://www.cbs. dtu.dk/services/ DictyOGlyc/ http://www.cbs. dtu.dk/services/ YinOYang/ http://www.cbs.dtu. dk/services/ NetNGlyc/ http://www.expasy. ch/tools/glycomod/
9557871
DictyOGlyc 1.1
Predicts O-GlcNAc sites in eukaryotic proteins
YinOYang 1.2
Predicts O-GlcNAc sites in eukaryotic proteins
NetNGlyc 1.0
Predicting N-glycosylation sites
GlycoMod
Web software for prediction of the possible oligosaccharide structures in glycoproteins from their experimentally determined masses
16445868
http://www.hprd. org/PhosphoMotif_ finder
10521537
16316981
11680880
(Continued)
18
Paik et al.
Table 3 (Continued) Name
Description
Website URL
PMID
Glyco-fragment
A web tool to support the interpretation of mass spectra of complex carbohydrates Compares each peak of a measured mass spectrum with the calculated fragments of all structures contained in the SweetDB Based on the matching of experimental MS2 data with the theoretical fragmentation of glycan structures in GlycoSuiteDB A web-based computational program that can quickly extract sequence information from a set of MSn spectra for an oligosaccharide of up to 10 residues To determine simultaneously the glycosylation sites and oligosaccharide heterogeneity of glycoproteins using MATLAB A web server for identifying multiple post-translational peptide modifications from tandem mass spectra An attempt to create annotated data collections for carbohydrates
http://www.dkfz. de/spec/projekte/ fragments/
14625865
GlycoSearchMS
GlycosidIQ
Saccharide topology analysis tool
GlycoX
MODi
SWEET-DB
Protein–protein interaction Munich The database of mammalian information protein–protein interactions center for protein sequence’s MPPI
http://www.dkfz. 15215392 de/spec/glycosciences. de/sweetdb/ms/
https://tmat. 15174134 proteomesystems. com/glyco/glycosuite/ glycodb 10857602
17022651
http://www. unimod.org
16845006
http://www.dkfz.de/ spec2/sweetdb/
11752350
http://mips.gsf.de
16381839
Overview and Introduction to Clinical Proteomics Database of interacting proteins Molecular interaction network database
Protein–protein interactions of cancer proteins
IntAct
Biomolecular interaction network database Metabolic and signal pathway BioCarta KEGG
Cancer cell map
HPRD
19
A database that documents experimentally determined protein–protein interactions A database of storing, in a structured format, information about molecular interactions by extracting experimental details from work published in peer-reviewed journals Predicts interactions, which are derived from homology with experimentally known protein–protein interactions from various species IntAct provides a freely available, open source database system and analysis tools for protein interaction data A database designed to store full descriptions of interactions, molecular complexes and pathways
http://dip.doembi.ecla.edu/
11752321
http://mint.bio. uniroma2.it/mint
17135203
http://bmm. cancerresearchuk. org/˜pip
16398927
http://www.ebi. ac.uk/intact/
17145710
http://www.bind.ca
12519993
A pathway database
http://www. biocarta.com http://www. genome.jp/kegg
A pathway database with genomical, chemical, and biological network information The cancer cell map is a selected set of human cancer focused pathways A database with data pertaining to post-translational modifications, protein–protein interactions, tissue expression,
16381885
http://cancer. cellmap.org/cellmap/ http://www. hprd.org/
(Continued)
20
Paik et al.
Table 3 (Continued) Name
Description
Website URL
PMID
subcellular localization, and enzyme–substrate relationships Proteomic data resource The cancer cell A database of clinical data map from SELDI-TOF
Proteomics identifications database PeptideAtlas
Disease resource Online mendelian inheritance in man GeneCards
Cancer gene census
A database of protein and peptide identifications that have been described in the scientific literature A multiorganism, publicly accessible compendium of peptides identified in a large set of tandem mass spectrometry proteomics experiments
http://home.ccr. cancer.gov/ncifda proteomics/ ppatterns.asp http://www.ebi. ac.uk/pride/
16381953
http://www. peptideatlas.org
16381952
A database of human genes and genetic disorders
http://www.ncbi.nlm. nih.gov/entrez/query. fcgi?db = OMIM
17170002
An integrated database of human genes that includes automatically mined genomic, proteomic, and transcriptomic information A catalogue those genes for which mutations have been causally implicated in cancer
http://www.genecards. 15608261 org/index.shtml
http://www.sanger. ac.uk/genetics/CGP/ Census/
14993899
Two-dimensional electrophoresis is perhaps the most popular start-up tool for proteome analysis. For clinical proteomics, 2DE has been the traditional workhorse of proteomics used for the analysis of different clinical specimens ranging from plasma to urine (Table 1). Quantification problems in 2DE are now solved by employing fluorescent dyes (cy3 and cy5), which allow normalization
Overview and Introduction to Clinical Proteomics
21
of data obtained from two different clinical specimens (79). Freedman and Lilley (Chapter 6) present general optimization conditions for differential in gel electrophoresis (DIGE) in the quantitative analysis of clinical samples. They address the usefulness of differentially labeling dyes (Cy2, Cy3, and Cy5). The essence of any DIGE system is to minimize any potential human errors in the process of identification and quantification of proteins spotted in a 2D gel (79). The difficulties in 2D map analysis are introduced by Marengo et al. (Chapter 16). They describe methods for comparing protein spots using image analysis technology and related informatics tools to minimize variations between measurements of spot volume, a key to successful 2D map construction. There are many variations of LC in protein profiling, including mass detection methods, column types, data mining through search engines, mass accuracy, and running conditions (80,81,82). These are all related to quantification of proteins or peptides in the sample, one of the major bottlenecks in proteomics (83,84,85,86,87). Among the several techniques are isotope-coded affinity tags (ICAT), mass-coded affinity tagging, and nonisotope labeled methods. Xiao and Veenstra (Chapter 10) present the application of ICAT in the course of COX-2 inhibitor regulated proteins in a colon cancer cell line. With emphasis on sample preparation, they provide details on ICAT procedures for quantitative proteomics (88). In addition to this approach, Li et al. (Chapter 11) employ a strategy, which combines LCM techniques for sample preparation of HCC and cleavable isotope-coded affinity tags in order to identify those markers quantitatively. However, it should be mentioned here that some other measures are needed to increase the efficiency of ICAT since it has drawbacks in the efficiency of sample recovery during or after labeling steps (87). A label-free serum quantification method has been recently introduced (48) (See Chapter 12 by Higgs et al.). The use of antibody arrays in clinical proteomics has increased recently in the context of high-throughput detection of cancer specimens where the identities of the proteins of interest are known (89,90). The evaluation of antibody crossreactivity and specificity is very crucial in these assays. This matter is addressed by Sanchez-Carbayo (Chapter 15), where technical aspects and application of planar antibody arrays in the quantification of serum proteins is described as well as by Hsu et al. (Chapter 14) where the development and use of beadbased miniaturized multiplexed sandwich immunoassays for focused protein profiling in various body fluids is provided. The latter method using beadbased protein arrays or suspension microarray allows the simultaneous analysis of a variety of parameters within a single experiment. With the versatility of suspension microarray in the analysis of proteins of interest present in different types of body fluids ranging from serum to synovial fluids, this multiplexed protein profiling technology described by Hsu et al. (Chapter 14) seems to hold a great promise in clinical proteomics. Similarly, in combination with
22
Paik et al.
tissue microarrays technology (91) it would also be possible to perform parallel molecular profiling of clinical samples together with immunohistochemistry, fluorescence in situ hybridization, or RNA in situ hybridization. SELDI is another arena of high-throughput profiling of clinical samples in the course of disease marker discovery [(92,93), Chapter 7]. It is expected that profiling approaches in proteomics, such as SELDI-MS, will be frequently used in disease marker discovery, but only if the proper identification technologies coupled with SELDI are improved. During the course of biomarker discovery, large data sets are usually generated and deposited in a coordinated fashion (Tables 2 and 3) (94,95). Indeed, statistical analysis of 2DE proteomics, which produce several hundred protein spots, is complex. To circumvent some inconsistency in 2D gel proteomics data, Friedman and Lilley (Chapter 6) and Carpentier et al. (Chapter 17) point out available statistical tools and suggest case-specific guidelines for 2D gel spot analysis. Fitzgibbon et al. (Chapter 19) describe an open source platform for LC-MS spectra where the msInspector program is used to lower false positives and guide normalization of the dataset. It is also demonstrated that msInspect can analyze data from quantitative studies with and without isotopic labels. Paliakasis et al. (Chapter 18) introduce web-based tools for protein classification, which lead to prediction of potential protein function and family clustering of related proteins. They provide some guidelines to classification of protein data into more meaningful families. Finally, Somorjai (Chapter 20) addresses important filtering criteria for the application of protein pattern recognition to biomarker discovery using statistical tools. 5. Concluding Remarks Although there are several bottlenecks in clinical proteomics (such as lack of standardization of sample specimen process, quantification, and overall strategy for tackling post-identification of biomarkers), we believe that the field holds great promise in biomarker discovery. The success of clinical proteomics depends on the availability and selection of well-phenotyped specimens, reduction of sample complexity, development of good informatics tools, and efficient data management. Therefore, sample handling techniques including microdissection for tissue sample, multidimensional fractionation for body fluids, and pretreatment of other clinical specimens (e.g., urine, tears, and cells) should be developed in this context. Since there is no gold standard for sample collection and handling, one needs to find the best options available for sample processing without damage. In addition, establishment of a biorepository system would systematically minimize some artifacts and variation between samples during or after identification of biomarkers.
Overview and Introduction to Clinical Proteomics
23
It is now generally accepted that an ensemble (or panel) of different proteins would be more efficient than a single protein/peptide in the diagnosis of disease, an idea which is poised to replace the conventional concept of a biomarker. As a high-throughput way of protein profiling, the use of antibody arrays in clinical proteomics has recently increased in regard to detection of cancer specimens. However, in the use of antibody arrays to profile serum autoantibodies, issues of cross-reactivity and specificity have to be resolved. Although not covered here due to space limitations, with the advent of proteomics techniques one can further analyze a network of protein–protein interaction as well as post-translational modifications of those proteins involved in a specific disease (Table 3). It is now highly recommended that common reagents such as antibodies and standard proteins, which are very useful for spiking purposes, quantification work, and sensitivity normalization of one machine to another be used in worldwide efforts like human proteome organization plasma proteome project (96,97). Finally, clinical proteomics needs the integration of biochemistry, pathology, analytical technology, bioinformatics, and proteome informatics to develop highly sensitive diagnostic tools for routine clinical care in the future (71,98). Acknowledgments This study was supported by a grant from the Korea Health 21 R&D project, Ministry of Health & Welfare, Republic of Korea (A030003 to YKP). References 1. Etzioni, R., Urban, N., Ramsey, S., McIntosh, M., Schwartz, S., Reid, B., Radich, J., Anderson, G., and Hartwell, L. (2003) The case for early detection. Nat. Rev. Cancer 3, 1–10. 2. Ludwig, J. A. and Weinstein, J. N. (2005) Biomarkers in cancer staging, prognosis and treatment selection. Nat. Rev. Cancer 5, 845–856. 3. Xiao, Z., Prieto, D., Conrads, T. P., Veenstra, T. D., and Issaq, H. J. (2005) Proteomic patterns: their potential for disease diagnosis. Mol. Cell Endocrinol. 230, 95–106. 4. Rifai, N., Gillette, M. A., and Carr, S. A. (2006) Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat. Biotechnol. 24, 97–983. 5. Emmert-Buck, M. R., Bonner, R. F., Smith, P. D., Chuaqui, R. F., Zhuang, Z., Goldstein, S. R., Weiss, R. A., and Liotta, L. A. (1996) Laser capture microdissection. Science 274, 998–1001. 6. Gillespie, J. W., Ahram, M., Best, C. J., Swalwell, J. I., Krizman, D. B., Petricoin, E. F., Liotta, L. A., and Emmert-Buck, M. R. (2001) The role of tissue microdissection in cancer research. Cancer J. 7, 32–39.
24
Paik et al.
7. Craven, R. A. and Banks, R. E. (2002) Use of laser capture microdissection to selectively obtain distinct populations of cells for proteomic analysis. Methods Enzymol. 356, 33–49. 8. Vincourt, J. B., Lionneton, F., Kratassiouk, G., Guillemin, F., Netter, P., Mainard, D., and Magdalou, J. (2006) Establishment of a reliable method for direct proteome characterization of human articular cartilage. Mol. Cell Proteomics 5, 1984–1995. 9. Platt, M. S., Agamanolis, D. P., Krill, C. E. Jr., Boeckman, C., Potter, J. L., Robinson, H., and Lloyd, J. (1983) Occult hepatic sinusoid tumor of infancy simulating neuroblastoma. Cancer 52, 1183–1189. 10. Mahadevia, P. J., Fleisher, L. A., Frick, K. D., Eng, J., Goodman, S. N., and Powe, N. R. (2003) Lung cancer screening with helical computed tomography in older adult smokers: a decision and cost-effectiveness analysis. JAMA 289, 313–322. 11. Hood, B. L., Darfler, M. M., Guiel, T. G., Furusato, B., Lucas, D. A., Ringeisen, B. R., Sesterhenn, I. A., Conrads, T. P., Veenstra, T. D., and Krizman, D. B. (2005) Proteomic analysis of formalin-fixed prostate cancer tissue. Mol. Cell Proteomics 4, 1741–1753. 12. Alaiya, A., Al-Mohanna, M., and Linder, S. (2005) Clinical cancer proteomics: promises and pitfalls. J. Proteome Res. 4, 1213–1222. 13. Gericke, B., Raila, J., Sehouli, J., Haebel, S., Konsgen, D., Mustea, A., and Schweigert, F. J. (2005) Microheterogeneity of transthyretin in serum and ascitic fluid of ovarian cancer patients. BMC Cancer 17, 133–141. 14. Swisher, E. M., Wollan, M., Mahtani, S. M., Willner, J. B., Garcia, R., Goff, B. A., and King, M. C. (2005) Tumor-specific p53 sequences in blood and peritoneal fluid of women with epithelial ovarian cancer. Am. J. Obstet. Gynecol. 193, 662–667. 15. Pisitkun, T., Johnstone, R., and Knepper, M. A. (2006) Discovery of urinary biomarkers. Mol. Cell Proteomics 5, 1760–1771. 16. Ghafouri, B., Irander, K., Lindbom, J., Tagesson, C., and Lindahl, M. (2006) Comparative proteomics of nasal fluid in seasonal allergic rhinitis. J. Proteome Res. 5, 330–338. 17. Koo, B. S., Lee, D. Y., Ha, H. S., Kim, J. C., and Kim, C. W. (2005) Comparative analysis of the tear protein expression in blepharitis patients using two-dimensional electrophoresis. J. Proteome Res. 4, 719–724. 18. Grus, F. H., Podust, V. N., Bruns, K., Lackner, K., Fu, S., Dalmasso, E. A., Wirthlin, A., and Pfeiffer, N. (2005) SELDI-TOF-MS ProteinChip array profiling of tears from patients with dry eye. Invest. Ophthalmol. Vis. Sci. 46, 863–876. 19. Amado, F. M., Vitorino, R. M., Domingues, P. M., Lobo, M. J., and Duarte, J. A. (2005) Analysis of the human saliva proteome. Expert Rev. Proteomics 2, 521–539. 20. Wang, T. H., Chang, Y. L., Peng, H. H., Wang, S. T., Lu, H. W., Teng, S. H., Chang, S. D., and Wang, H. S. (2005) Rapid detection of fetal aneuploidy using proteomics approaches on amniotic fluid supernatant. Prenat. Diagn. 25, 559–566. 21. Ruetschi, U., Rosen, A., Karlsson, G., Zetterberg, H., Rymo, L., Hagberg, H., and Jacobsson, B. (2005) Proteomic analysis using protein chips to detect
Overview and Introduction to Clinical Proteomics
22.
23. 24.
25.
26.
27.
28.
29.
30. 31.
32.
33.
34.
25
biomarkers in cervical and amniotic fluid in women with intra-amniotic inflammation. J. Proteome Res. 4, 2236–2242. Kim, Y. S., Kim, M. S., Lee, S. H., Choi, B. C., Lim, J. M., Cha, K. Y., and Baek, K. H. (2006) Proteomic analysis of recurrent spontaneous abortion: identification of an inadequately expressed set of proteins in human follicular fluid. Proteomics 6, 3445–3454. Pilch, B. and Mann, M. (2006) Large-scale and high-confidence proteomic analysis of human seminal plasma. Genome Biol. 7, R40 Varnum, S. M., Covington, C. C., Woodbury, R. L., Petritis, K., Kangas, L. J., Abdullah, M. S., Pounds, J. G., Smith, R. D., and Zangar, R. C. (2003) Proteomic characterization of nipple aspirate fluid: identification of potential biomarkers of breast cancer. Breast Cancer Res. Treat. 80, 87–97. Zheng, P. P., Luider, T. M., Pieters, R., Avezaat, C. J., van den Bent, M. J., Sillevis Smitt, P. A., and Kros, J. M. (2003) Identification of tumor-related proteins by proteomic analysis of cerebrospinal fluid from patients with primary brain tumors. J. Neuropathol. Exp. Neurol. 62, 855–862. Gibson, D. S., Blelock, S., Brockbank, S., Curry, J., Healy, A., McAllister, C., and Rooney, M. E. (2006) Proteomic analysis of recurrent joint inflammation in juvenile idiopathic arthritis. J. Proteome Res. 5, 1988–1995. Merkel, D., Rist, W., Seither, P., Weith, A., and Lenter, M. C. (2005) Proteomic study of human bronchoalveolar lavage fluids from smokers with chronic obstructive pulmonary disease by combining surface-enhanced laser desorption/ionization-mass spectrometry profiling with mass spectrometric protein identification. Proteomics 5, 2972–2980. Wu, J., Kobayashi, M., Sousa, E. A., Liu, W., Cai, J., Goldman, S. J., Dorner, A. J., Projan, S. J., Kavuru, M. S., Qiu, Y., and Thomassen, M. J. (2005) Differential proteomic analysis of bronchoalveolar lavage fluid in asthmatics following segmental antigen challenge. Mol. Cell Proteomics 4, 1251–1264. Tyan, Y. C., Wu, H. Y., Lai, W. W., Su, W. C., and Liao, P. C. (2005) Proteomic profiling of human pleural effusion using two-dimensional nano liquid chromatography tandem mass spectrometry. J. Proteome Res. 4, 1274–1286. Khalil, A. A. and James, P. (2007) Biomarker discovery: a proteomic approach for brain cancer profiling. Cancer Sci. 98, 201–213. Khodavirdi, A. C., Song, Z., Yang, S., Zhong, C., Wang, S., Wu, H., Pritchard, C., Nelson, P. S., and Roy-Burman, P. (2006) Increased expression of osteopontin contributes to the progression of prostate cancer. Cancer Res. 66, 883–888. Vincourt, J. B., Lionneton, F., Kratassiouk, G., Guillemin, F., Netter, P., Mainard, D., and Magdalou, J. (2006) Establishment of a reliable method for direct proteome characterization of human articular cartilage. Mol. Cell Proteomics 5, 1984–1995. Lee, Y. J., Rice, R. H., and Lee, Y. M. (2006) Proteome analysis of human hair shaft: from protein identification to post-translational modification. Mol. Cell Proteomics 5, 789–800. Cho, S. Y., Lee, E. Y., Lee, J. S., Kim, H. Y., Park, J. M., Kwon, M. S., Park, Y. K., Lee, H. J., Kang, M. J., Kim, J. Y., Yoo, J. S., Park, S. J., Cho, J. W., Kim, H. S., and
26
35.
36.
37. 38. 39.
40.
41.
42.
43.
44.
45.
46.
47.
Paik et al. Paik, Y. K. (2005) Efficient prefractionation of low-abundance proteins in human plasma and construction of a two-dimensional map. Proteomics 5, 3386–3396. Lathrop, J. T., Hayes, T. K., Carrick, K., and Hammond, D. J. (2005) Rarity gives a charm: evaluation of trace proteins in plasma and serum. Expert Rev. Proteomics 2, 393–406. Lee, H. J., Lee, E. Y., Kwon, M. S., and Paik, Y. K. (2006) Biomarker discovery from the plasma proteome using multidimensional fractionation proteomics. Curr. Opin. Chem. Biol. 10, 42–49. Anderson, N. L. and Anderson, N. G. (2002) The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell Proteomics 1, 845–867. Hu, S., Loo, J. A., and Wong, D. T. (2006) Human body fluid proteome analysis. Proteomics 6, 6326–6353. Park, M. R., Wang, E. H., Jin, D. C., Cha, J. H., Lee, K. H., Yang, C. W., Kang, C. S., and Choi, Y. J. (2006) Establishment of a 2-D human urinary proteomic map in IgA nephropathy. Proteomics 6, 1066–1076. Tammen, H., Schutle, I., Hess, R., Menzel, C., Kellmann, M., and SchulzKnappe, P. (2005) Prerequisites for peptidomic analysis of blood samples: I. Evaluation of blood specimen qualities and determination of technical performance characteristics. Comb. Chem. High Trhoughput Screen 8, 725–733. Rai, A. J., Gelfand, C. A., Haywood, B. C., Warunek, D. J., Yi, J., Schuchard, M. D., Mehigh, R. J., Cockrill, S. L., Scott, G. B., Tammen, H., Schulz-Knappe, P., Speicher, D. W., Vitzthum, F., Haab, B. B., Siest, G., and Chan, D. W. (2005) HUPO plasma proteome project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics 5, 3262–3277. Zhou, M., Lucas, D. A., Chan, K. C., Issaq, H. J., Petricoin, E. F. 3rd, Liotta, L. A., Veenstra, T. D., and Conrads, T. P. (2004) An investigation into the human serum “interactome”. Electrophoresis 25, 1289–1298. Findeisen, P., Sismanidis, D., Riedl, M., Costina, V., and Neumaier, M. (2005) Preanalytical impact of sample handling on proteome profiling experiments with matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Clin. Chem. 51, 2409–2411. Park, K. S., Kim, H., Kim, N. G., Cho, S. Y., Choi, K. H., Seong, J. K., and Paik, Y. K. (2002) Proteomic analysis and molecular characterization of tissue ferritin light chain in hepatocellular carcinoma. Hepatology 35, 1459–1466. Park, K. S., Cho, S. Y., Kim, H., and Paik, Y. K. (2002) Proteomic alterations of the variants of human aldehyde dehydrogenase isozymes correlate with hepatocellular carcinoma. Int. J. Cancer 97, 261–265. Marko-Varga, G., Berglund, M., Malmstrom, J., Lindberg, H., and Fehniger, T. E. (2003) Targeting hepatocytes from liver tissue by laser capture microdissection and proteomics expression profiling. Electrophoresis 24, 3800–3805. Paradis, V., Degos, F., Dargere, D., Pham, N., Belghiti, J., Degott, C., Janeau, J. L., Bezeaud, A., Delforge, D., Cubizolles, M., Laurendeau, I., and Bedossa, P. (2005) Identification of a new biomarker of hepatocellular carcinoma by serum protein profiling of patients with chronic liver diseases. Hepatology 41, 40–47.
Overview and Introduction to Clinical Proteomics
27
48. Ru, Q. C., Zhu, L. A., Silberman, J., and Shriver, C. D. (2006) Label-free semiquantitative peptide feature profiling of human breast cancer and breast disease sera via two-dimensional liquid chromatography–mass spectrometry. Mol. Cell Proteomics 5, 1095–1104. 49. Azad, N. S., Rasool, N., Annuziata, C. M., Minasian, L., Whiteley, G., and Kohn, E. C. (2006) Proteomics in clinical trials and practice: present uses and future promise. Mol. Cell Proteomics 5, 1819–1829. 50. Gunter, E. W. (1997) Biological and environmental specimen banking at the Centers for Disease Control and Prevention. Chemosphere 34, 1945–1953. 51. Strauss, G. H. and Kelly, S. J. (1990) The development of the U.S. EPA health effects research laboratory frozen blood cell repository program. Mutat. Res. 234, 349–354. 52. Romeo, M. J., Espina, V., Lowenthal, M., Espina, B. H., Petricoin, E. F. 3rd, and Liotta, L. A. (2005) CSF proteome: a protein repository for potential biomarker identification. Expert Rev. Proteomics 2, 57–70. 53. Conrads, T. P., Hood, B. L., Petricoin, E. F. 3rd, Liotta, L. A., and Veenstra, T. D. (2005) Cancer proteomics: many technologies, one goal. Expert Rev. Proteomics 2, 693–703. 54. Schrader, M. and Selle, H. (2006) The process chain for peptidomic biomarker discovery. Dis. Markers 22, 27–37. 55. Danna, E. A. and Nolan, G. P. (2006) Transcending the biomarker mindset: deciphering disease mechanisms at the single cell level. Curr. Opin. Chem. Biol. 10, 20–27. 56. De Masi, S., Tosti, M. E., and Mele, A. (2005) Screening for hepatocellular carcinoma. Dig. Liver Dis. 37, 260–268. 57. Yamaguchi, K., Nagano, M., Torada, N. Hamasaki, N., Kawakita, M., and Tanaka, M. (2004) Urine diacetylspermine as a novel tumor marker for pancreatobiliary carcinomas. Rinsho. Byori. 52, 336–339 58. Dabrowska, M., Grubek-Jaworska, H., Domagala-Kulawik, J., Bartoszewicz, Z., Kondracka, A., Krenke, R., Nejman, P., and Chazan, R. (2004) Diagnostic usefulness of selected tumor markers (CA125, CEA, CYFRA 21–1) in bronchoalveolar lavage fluid in patients with non-small cell lung cancer. Pol. Arch. Med. Wewn 111, 659–665. 59. Gann, P. H., Hennekens, C. H., and Stampfer, M. J. (1995) A prospective evaluation of plasma prostate-specific antigen for detection of prostatic cancer. JAMA 273, 289–294 60. Ciambellotti, E., Coda, C., and Lanza, E. (1993) Determination of CA 15–3 in the control of primary and metastatic breast carcinoma. Minerva Med. 84, 107–112. 61. Linkov, F., Lisovich, A., Yurkovetsky, Z., Marrangoni, A., Velikokhatnaya, L., Nolen, B., Winans, M., Bigbee, W., Siegfried, J., Lokshin, A., and Ferris, R. L. (2007) Early detection of head and neck cancer: development of a novel screening tool using multiplexed immunobead-based biomarker profiling. Cancer Epidemiol. Biomarkers Prev. 16, 102–107. 62. Casiano, C. A., Mediavilla-Varela, M., and Tan, E. M. (2006) Tumor-associated antigen arrays for the serological diagnosis of cancer. Mol. Cell Proteomics 5, 1745–1759.
28
Paik et al.
63. Nissom, P. M., Lo, S. L., Lo, J. C., Ong, P. F., Lim, J. W., Ou, K., Liang, R. C., Seow, T. K., and Chung, M. C. (2006) Hcc-2, a novel mammalian ER thioredoxin that is differentially expressed in hepatocellular carcinoma. FEBS Lett. 580, 2216– 2226. 64. Feng, J. T., Liu, Y. K., Song, H. Y., Dai, Z., Qin, L. X., Almofti, M. R., Fang, C. Y., Lu, H. J., Yang, P. Y., and Tang, Z. Y. (2005) Heat-shock protein 27: a potential biomarker for hepatocellular carcinoma identified by serum proteome analysis. Proteomics 5, 4581–1588. 65. Li, D. Q., Wang, L., Fei, F., Hou, Y. F., Luo, J. M., Wei-Chen, Zeng, R., Wu, J., Lu, J. S., Di, G. H., Ou, Z. L., Xia, Q. C., Shen, Z. Z., and Shao, Z. M. (2006) Identification of breast cancer metastasis-associated proteins in an isogenic tumor metastasis model using two-dimensional gel electrophoresis and liquid chromatography-ion trap-mass spectrometry. Proteomics 6, 3352–3368. 66. Lee, I. N., Chen, C. H., Sheu, J. C., Lee, H. S., Huang, G. T., Yu, C. Y., Lu, F. J., and Chow, L. P. (2005) Identification of human hepatocellular carcinomarelated biomarkers by two-dimensional difference gel electrophoresis and mass spectrometry. J. Proteome Res. 4, 2062–2069. 67. Righetti, P. G., Castagna, A., Antonucci, F., Piubelli, C., Cecconi, D., Campostrini, N., Rustichelli, C., Antonioli, P., Zanusso, G., Monaco, S., Lomas, L., and Boschetti, E. (2005) Proteome analysis in the clinical chemistry laboratory: myth or reality? Clin. Chim. Acta 357, 123–139. 68. Jang, J. S., Cho, H. Y., Lee, Y. J., Ha, W. S., and Kim, H. W. (2004) The differential proteome profile of stomach cancer: identification of the biomarker candidates. Oncol. Res. 14, 491–499. 69. Steel, L. F., Shumpert, D., Trotter, M., Seeholzer, S. H., Evans, A. A., London, W. T., Dwek, R., and Block, T. M. (2003) A strategy for the comparative analysis of serum proteomes for the discovery of biomarkers for hepatocellular carcinoma. Proteomics 3, 601–609. 70. Yip, T. T., Chan, J. W., Cho, W. C., Yip, T. T., Wang, Z., Kwan, T. L., Law, S. C., Tsang, D. N., Chan, J. K., Lee, K. C., Cheng, W. W., Ma, V. W., Yip, C., Lim, C. K., Ngan, R. K., Au, J. S., Chan, A., Lim, W. W., and Ciphergen SARS Proteomics Study Group (2005) Protein chip array profiling analysis in patients with severe acute respiratory syndrome identified serum amyloid a protein as a biomarker potentially useful in monitoring the extent of pneumonia. Clin. Chem. 51, 47–55. 71. Anderson, L. and Hunter, C. L. (2005) Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol. Cell Proteomics 5, 573–588. 72. Lee, J. W., Figeys, D., and Vasilescu, J. (2007) Biomarker assay translation from discovery to clinical studies in cancer drug development: quantification of emerging protein biomarkers. Adv. Cancer Res. 96, 269–298. 73. Zolg, W. (2006) The proteomic search for diagnostic biomarkers: lost in translation? Mol. Cell Proteomics 5, 1720–1726.
Overview and Introduction to Clinical Proteomics
29
74. Bensmail, H., Golek, J., Moody, M. M., Semmes, J. O., and Haoudi, A. (2005) A novel approach for clustering proteomics data using Bayesian fast Fourier transform. Bioinformatics 21, 2210–2224. 75. Ward, D. G., Cheng, Y., N’Kontchou, G., Thar, T. T., Barget, N., Wei, W., Billingham, L. J., Martin, A., Beaugrand, M., and Johnson, P. J. (2006) Changes in the serum proteome associated with the development of hepatocellular carcinoma in hepatitis C-related cirrhosis. Br. J. Cancer 94, 287–292. 76. Lin, N. and Zhao, H. (2005) Are scale-free networks robust to measurement errors? BMC Bioinformatics 6, 119. 77. Castagna, A., Cecconi, D., Sennels, L., Rappsilber, J., Guerrier, L., Fortis, F., Boschetti, E., Lomas, L., and Righetti, P. G. (2005) Exploring the hidden human urinary proteome via ligand library beads. J. Proteome Res. 4, 1917–1930. 78. Rauch, A., Bellew, M., Eng, J., Fitzgibbon, M., Holzman, T., Hussey, P., Igra, M., Maclean, B., Lin, C. W., Detter, A., Fang, R., Faca, V., Gafken, P., Zhang, H., Whiteaker, J., States, D., Hanash, S., Paulovich, A., and McIntosh, M. W. (2006) Computational proteomics analysis system (CPAS): an extensible open source analytic system for evaluating and publishing proteomic data and high throughput biological experiments. J. Proteome Res. 5, 112–121. 79. Lilley, K. S. and Friedman, D. B. (2004) All about DIGE: quantification technology for differential-display 2D-gel proteomics. Expert Rev. Proteomics 1, 401–409. 80. Qian, W. J., Jacobs, J. M., Liu, T., Camp, D. G. 2nd, and Smith, R. D. (2006) Advances and challenges in liquid chromatography-mass spectrometrybased proteomics profiling for clinical applications. Mol. Cell Proteomics 5, 1727–1744. 81. Powell, D. W., Merchant, M. L., and Link, A. J. (2006) Discovery of regulatory molecular events and biomarkers using 2D capillary chromatography and mass spectrometry. Expert Rev. Proteomics 3, 63–74. 82. Andre, M., Le Caer, J. P., Greco, C., Planchon, S., El Nemer, W., Boucheix, C., Rubinstein, E., Chamot-Rooke, J., and Le Naour, F. (2006) Proteomic analysis of the tetraspanin web using LC-ESI-MS/MS and MALDI-FTICR-MS. Proteomics 6, 1437–1449. 83. Greengauz-Roberts, O., Stoppler, H., Nomura, S., Yamaguchi, H., Goldenring, J. R., Podolsky, R. H., Lee, J. R., and Dynan, W. S. (2005) Saturation labeling with cysteine-reactive cyanine fluorescent dyes provides increased sensitivity for protein expression profiling of laser-microdissected clinical specimens. Proteomics 5, 1746–1757. 84. Heck, A. J. and Krijgsveld, J. (2004) Mass spectrometry-based quantitative proteomics. Expert Rev. Proteomics 1, 317–326. 85. Schneider, L. V. and Hall, M. P. (2005) Stable isotope methods for high-precision proteomics. Drug Discov. Today 10, 353–363. 86. Zhang, J., Goodlett, D. R., Peskind, E. R., Quinn, J. F., Zhou, Y., Wang, Q., Pan, C., Yi, E., Eng, J., Aebersold, R. H., and Montine, T. J. (2005) Quantitative proteomic analysis of age-related changes in human cerebrospinal fluid. Neurobiol Aging 26, 207–227.
30
Paik et al.
87. Liu, T., Qian, W. J., Strittmatter, E. F., Camp, D. G. 2nd, Anderson, G. A., Thrall. B. D., and Smith, R. D. (2004) High-throughput comparative proteome analysis using a quantitative cysteinyl-peptide enrichment technology. Anal. Chem. 76, 5345–5353. 88. Li, C., Hong, Y., Tan, Y. X., Zhou, H., Ai, J. H., Li, S. J., Zhang, L., Xia, Q. C., Wu, J. R., Wang, H. Y., and Zeng, R. (2004) Accurate qualitative and quantitative proteomic analysis of clinical hepatocellular carcinoma using laser capture microdissection coupled with isotope-coded affinity tag and two-dimensional liquid chromatography mass spectrometry. Mol. Cell Proteomics 3, 399–409. 89. Sheehan, K. M., Calvert, V. S., Kay, E. W., Lu, Y., Fishman, D., Espina, V., Aquino. J., Speer, R., Araujo, R., Mills, G. B., Liotta, L. A., Petricoin, E. F. 3rd, and Wulfkuhle, J. D. (2005) Use of reverse phase protein microarrays and reference standard development for molecular network analysis of metastatic ovarian carcinoma. Mol. Cell Proteomics 4, 346–355. 90. Knezevic, V., Leethanakul, C., Bichsel, V. E., Worth, J. M., Prabhu, V. V., Gutkind, J. S., Liotta, L. A., Munson, P. J., Petricoin, E. F. 3rd, and Krizman, D. B. (2001) Proteomic profiling of the cancer microenvironment by antibody arrays. Proteomics 1, 1271–1278. 91. Sharma-Oates, A., Quirke, P., Westhead, D. R. (2005) TmaDB: a repository for tissue microarray data. BMC Bioinformatics 6, 218. 92. Rai, A. J., Stemmer, P. M., Zhang, Z., Adam, B. L., Morgan, W. T., Caffrey, R. E., Podust, V. N., Patel, M., Lim, L. Y., Shipulina, N. V., Chan, D. W., Semmes, O. J., and Leung, H. C. (2005) Analysis of human proteome organization plasma proteome project (HUPO PPP) reference specimens using surface enhanced laser desorption/ionization-time of flight (SELDI-TOF) mass spectrometry: multiinstitution correlation of spectra and identification of biomarkers. Proteomics 5, 3467–3474. 93. Engwegen, J. Y., Gast, M. C., Schellens, J. H., and Beijnen, J. H. (2006) Clinical proteomics: searching for better tumour markers with SELDI-TOF mass spectrometry. Trends Pharmacol. Sci. 27, 251–259. 94. Domon, B. and Aebersold, R. (2006) Mass spectrometry and protein analysis. Science 312, 212–217. 95. Domon, B. and Aebersold, R. (2006) Challenges and opportunities in proteomics data analysis. Mol. Cell Proteomics 5, 1921–1926. 96. Uhlen, M. and Ponten, F. (2005) Antibody-based proteomics for human tissue profiling. Mol. Cell Proteomics 4, 384–393. 97. Taussig, M. J., Stoevesandt, O., Borrebaeck, C. A., Bradbury, A. R., Cahill, D., Cambillau, C., de Daruvar, A., Dubel, S., Eichler, J., Frank, R., Gibson, T. J., Gloriam, D., Gold, L., Herberg, F. W., Hermjakob, H., Hoheisel, J. D., Joos, T. O., Kallioniemi, O., Koegll, M., Konthur, Z., Korn, B., Kremmer, E., Krobitsch, S., Landegren, U., van der Maarel, S., McCafferty, J., Muyldermans, S., Nygren, P. A., Palcy, S., Pluckthun, A., Polic, B., Przybylski, M., Saviranta, P., Sawyer, A., Sherman, D. J., Skerra, A., Templin, M., Ueffing, M., and Uhlen, M. (2007)
Overview and Introduction to Clinical Proteomics
31
ProteomeBinders: planning a European resource of affinity reagents for analysis of the human proteome. Nat. Methods 4, 13–17. 98. Ilyin, S. E., Belkowski, S. M., and Plata-Salaman, C. R. (2004) Biomarker discovery and validation: technologies and integrative approaches. Trends Biotechnol. 22, 411–416.
I Specimen Collection for Clinical Proteomics
2 Specimen Collection and Handling Standardization of Blood Sample Collection Harald Tammen
Summary Preanalytical variables can alter the analysis of blood-derived samples. Prior to the analysis of a blood sample, multiple steps are necessary to generate the desired specimen. The choice of blood specimens, its collection, handling, processing, and storage are important aspects since these characteristics can have a tremendous impact on the results of the analysis. The awareness of clinical practices in medical laboratories and the current knowledge allow for identification of specific variables that affect the results of a proteomic study. The knowledge of preanalytical variables is a prerequisite to understand and control their impact.
Key Words: blood; plasma; serum; proteomics; specimen; preanalytical variables.
1. Introduction Proteomic analysis of blood specimens by semi-quantitative multiplex techniques offers a valuable approach for discovery of disease or therapyrelated biomarkers (1,2). Based on reproducible separation of proteins by their physical–chemical properties in combination with semi-quantitative detection methods and bioinformatic data analysis, proteomics allows for sensitive measurement of proteins in blood specimens (3). Blood can be regarded as a complex liquid tissue that comprises cells and extracellular fluid (4). The choice of a suitable specimen-collection protocol is crucial to minimize artificial processes (e.g., cell lysis, proteolysis) occurring during specimen collection and preparation (5). Preanalytic procedures can alter the analysis of blood-derived From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
35
36
Tammen
samples. These procedures comprise the processes prior to actual analysis of the sample and include steps needed to obtain the primary sample (e.g., blood) and the analytical specimen (e.g., plasma, serum, cells). Legal or ethical issues (e.g., importance of informed consents) or potential risks of phlebotomy (e.g., bleeding) are not covered in this article. 1.1. Collection of Blood Samples It has been reported that the most frequent faults in the preanalytical phase are the result of erroneous procedures of sample collection (e.g., drawing blood from an infusive line resulting in sample dilution) (6). The design of blood collection devices may aid in correct sampling: evacuated containers sustain the draw of accurate quantity of blood to ensure the correct concentration of additives or the correct dilution of the blood, such as in the case of citrated plasma. The speed of blood draw is also controlled and restricts the mechanical stress. The favored site of collection is the median cubital vein, which is generally easily found and accessed. As such, it will be most comfortable to the patient, and should not evoke additional stress. Preparation of the collection site includes proper cleaning of the skin with alcohol (2-propanol). The alcohol must be allowed to evaporate, since commingling of the remaining alcohol with blood sample may result in hemolysis, raise the levels of distinct analytes, and cause interferences. The position of the patient (standing, lying, sitting) can affect the hematocrit (7), and hence may change the concentration of the analytes. Tourniquet should be applied 3–4 inches above the site of venipuncture and should be released as soon as blood begins flowing into the collection device. The duration of venous occlusion (>1 min) can affect the sample composition. Prolonged occlusion may result in hemoconcentration and subsequently increase the miscellaneous analytes, e.g., total protein levels. Blood should be collected from fasting patients in the morning between 7 and 9 a.m., because ingestion or circadian rhythms can alter the concentration of analytes considerably (e.g., total protein, hemoglobin, myoglobin). 1.2. Characteristics of Serum and Plasma Specimens Serum is one of the most frequently analyzed blood specimens. The generation of serum is time consuming and associated with the activation of coagulation cascade and complement system. These processes influence the composition of the samples, because they result in cell lysis (e.g., thrombocytes, erythrocytes). As a consequence, the concentration of components in the extracellular fluid, such as aspartate-aminotransferase, serotonin, neuronspecific enolase, and lactate-dehydrogenase, are increased (8). On the other hand, degradation of the analytes (e.g., hormones) may occur faster (9). On the
Specimen Collection and Handling
37
proteomic level, more peptides and less proteins are observed in serum when compared to plasma (10,11). Consequently, the activation of clotting cascades necessary to generate serum can lead to artefacts. A reason to use serum as a specimen is based on the notion that the proteome or peptidome of serum may reflect biological events (12). Post-sampling proteolytic cleavage products have been proposed as biomarkers, and it has been further suggested that serum peptidome is of particular diagnostic value for the detection of cancer (13). However, it has been reported that more protein changes occur in serum than in plasma (14). Thus, it can be expected that the reproducibility of such ex vivo proteolytic events is comparatively low. In contrast to serum, citrate and EDTA inhibit coagulation and other enzymatic processes by chelate formation with ions, thereby inhibiting iondependent enzymes. This is in contrast to heparin, which acts through the activation of antithrombin III. The main concern associated with heparinized plasma for proteomic studies is that it is a poly-disperse charged molecule that binds many proteins non-specifically (15,16), and may also influence separation procedures and mass spectrometric detection of peptides and small proteins due to its similar molecular weight (17). The sampling of plasma is less time consuming than the acquisition of serum. Separation of the cells and the liquid phase can be performed subsequently to sample collection since no clotting time is required (30–60 min). In comparison to serum, the amount of plasma generated from blood is approximately 10 to 20% higher. Additionally, the protein content of plasma is also higher than in serum, because of the presence of clotting factors and associated components. Furthermore, proteins may be bound to the clot, resulting in a decrease of protein concentration. 1.3. Processing of Blood Samples A quick separation of cells from the plasma is favorable, since cellular constituents may liberate substances that alter the composition of the sample. Generally, it is recommended that plasma and serum be centrifuged with 1300–2000×g for 10 min within 30 min from the collection of the sample. The temperature should generally be 15–24°C (18), unless recommended differently for distinct analytes like gastrin or A-type natriuretic peptide. Processing at 4°C appears to be attractive, because enzymatic degradation processes are reduced at low temperatures. However, platelets become activated at low temperatures (19) and release intracellular proteins and enzymes, which affect the sample composition. Thus, processing at low temperatures is safe only after thrombocytes have been removed. Since one centrifugation step may be insufficient for
38
Tammen
depletion of platelets below 10 cells/nL, a second centrifugation step (2500×g for 15 min at room temperature) or filtration step may be required to obtain platelet-poor plasma. This procedure is applicable only to plasma since the platelets in serum are already activated. 1.4. Protease Inhibitors Protease inhibitors would be attractive, but commonly used protease cocktails may introduce difficulties due to interference with mass spectrometry and formation of covalent bonds with proteins, which would result in shifting the isoform pattern (20). Protease inhibitors have been considered and investigated as additives in proteome research to prevent or slow down proteolytic processes and thereby provide a means of more sensitive detection of markers in blood (21). Even though protein integrity has been shown to be maintained by the addition of 15 commercially available protease inhibitors, the usefulness of protease inhibitors in overall protein stabilization of blood samples remains to be investigated in more detail (22). The presence of certain protease inhibitors in whole blood is toxic to live cells. Stressed, apoptotic, or necrotic cells release substances, and it may be argued that this affects the composition of serum or plasma until the cellular and soluble factions of blood are separated. However, careful selection of an appropriate protease inhibitor may solve this problem. 2. Materials 1. Twenty gauge needles and an appropriate adapter (e.g., Sarstedt, Nümbrecht, Germany) or a Vacutainer system (BD Bioscience, Franklin Lakes, USA). 2. Alcohol (2-propanol) in spray flask. 3. Swabs. 4. Examination gloves. 5. Tourniquet or sphygmomanometer. 6. Blood collection tubes (e.g., Sarstedt). 7. Centrifuge with a swinging bucket rotor (e.g., Sigma 4K15, Sigma Laborzentrifugen, Osterode, Harz). 8. A 10-mL syringe equipped with a cellulose acetate filter unit with 0.2 μm pore size and 5 cm2 filtration area (e.g., Sartorius Minisart, Sarstedt). 9. 2 mL cryo-vials. 10. Pipette and tips.
3. Methods 1. Venipuncture of a cubital vein is performed using a 20-gauge needle (diameter: 0.9 mm, e.g., butterfly system max. tubing length: 6 cm). If tourniquet is applied, it should not remain in place for longer than 1 min (risk of falsifying results due to
Specimen Collection and Handling
39
hemoconcentration). As soon as the blood flows into the container, the tourniquet has to be released at least partially. If more time is required, the tourniquet has to be released so that circulation resumes and normal skin color returns to extremity. • Prior to blood collection for proteomic analysis, blood is aspirated into the first container (e.g., 2.7 mL S-Monovette, Sarstedt, Nümbrecht, Germany). This is done to flush the surface and remove initial traces of contact-induced coagulation. This sample is not useful for analysis. • Afterward, blood is drawn into a standard EDTA or citrate-containing syringe (e.g. 9 mL EDTA-Monovette, Sarstedt, Nümbrecht, Germany). Depending on ease of blood flow, several samples can be collected. Free flow with mild aspiration should be assured to avoid haemolysis. 2. After venipuncture, plasma is obtained by centrifugation for 10 min at 2000×g at room temperature. Centrifugation should start within 30 min after blood collection. The resulting plasma sample may now be separated from red and white blood cells in an efficient and gentle way. Nevertheless, a significant number of platelets (∼25%) are still present in the sample. This requires an additional preparation step. 3. For platelet depletion, one of the following procedures has to be undertaken directly after step 2: • Platelet removal by centrifugation: The plasma sample is transferred into a second vial for another centrifugation for 15 min at 2500×g at room temperature. After centrifugation, the supernatant is transferred in aliquots of 1.5 mL into cryo vials. • Platelet removal by filtration: Plasma aliquots of 1.5 mL resulting from step 2 are transferred into 2-mL cryo vials using a 10-mL syringe equipped with a cellulose acetate filter unit with 0.2 μm pore size and 5 cm2 filtration area (e.g., Sartorius Minisart® , Sartorius, Göttingen, Germany). Filtration requires only gentle pressure. 4. Samples are transferred to an –80°C freezer within 30 min. Storage is at –80°C. Transport of samples is done on dry ice.
4. Notes 4.1. Frequently Made Mistakes 4.1.1. Blood Withdrawal • • • •
The The The The
patient was not fasting (i.e., had taken food prior to sampling). blood was drawn from an infusive line. blood was drawn in a wrong position (e.g., supine, upright). consumables used were different than those recommended.
40
Tammen
• The expiry date of consumables was already reached. • The tubes were not properly filled. • The tubes were agitated vigorously (instead of gentle shaking to dissolve the anticoagulant). • The blood sample tubes were not consistently kept at room temperature. • The sample tubes were put on ice or in a refrigerator.
. 4.1.2. Lab Handling • Centrifugation was delayed more than 30 min after blood withdrawal. • A cooling centrifuge was adjusted below room temperature. • The centrifugation speed was wrong (e.g., rounds per minute were set instead of g-force). • The centrifugation time was wrong. • The removal of blood plasma by pipetting was done without proper caution. Consequently, the buffy coat or the red blood cells were churned up. • The second centrifugation of recovered plasma samples was delayed after first centrifugation.
4.1.3. Storage of Samples • • • •
The storage of samples was delayed. The storage temperatures were above –80°C. The labeling of sample containers was unreadable or confusable. The attachment of labels to the sample containers was not proper during storage or handling resulted in loss of labels.
4.1.4. General Recommendations • A proper first centrifugation should produce a visible white blood cell layer (buffy coat) between red blood cells and plasma. If not, centrifugation speed or time may be wrong. • One should discard plasma that is icteric or exhibits signs of haemolysis. One should check with an expert if this was due to that particular disease.
References 1. Vitzthum F, Behrens F, Anderson NL, Shaw JH. (2005) Proteomics: from basic research to diagnostic application. A review of requirements and needs. J. Proteome Res. 4, 1086–97. 2. Lathrop JT, Anderson NL, Anderson NG, Hammond DJ. (2003) Therapeutic potential of the plasma proteome. Curr. Opin. Mol. Ther. 5, 250–7.
Specimen Collection and Handling
41
3. Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR et al. (2003) Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal. Chem. 75, 4818–26. 4. Anderson NL, Anderson NG. (2002) The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845–67. 5. Omenn GS. (2004) The Human Proteome Organization Plasma Proteome Project pilot phase: reference specimens, technology platform comparisons, and standardized data submissions and analyses. Proteomics 4, 1235–40. 6. Plebani M, Carraro P. (1997) Mistakes in a stat laboratory: types and frequency. Clin. Chem. 43, 1348–51. 7. Burtis CA, Ashwood E. (eds) (2001) Fundamentals of Clinical Chemistry. Saunders, Philadelphia. 8. Guder WG, Narayanan S, Wisser H, Zawata B. (2003) Samples: From the Patient to the Laboratory. The Impact of Preanalytical Variables on the Quality of Laboratory Results. GIT Verlag, Darmstadt, Germany. 9. Evans MJ, Livesey JH, Ellis MJ, Yandle TG. (2001) Effect of anticoagulants and storage temperatures on stability of plasma and serum hormones. Clin. Biochem 34, 107–12. 10. Omenn GS, States DJ, Adamski M, Blackwell TW, Menon R, Hermjakob H et al. (2005) Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics 5, 3226–45. 11. Rai AJ, Gelfand CA, Haywood BC, Warunek DJ, Yi J, Schuchard MD et al. (2005) HUPO Plasma Proteome Project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics 5, 3262–77. 12. Villanueva J, Shaffer DR, Philip J, Chaparro CA, Erdjument-Bromage H, Olshen AB et al. (2006) Differential exoprotease activities confer tumor-specific serum peptidome patterns. J. Clin. Invest. 116, 271–84. 13. Liotta LA, Petricoin EF. (2006) Serum peptidome for cancer detection: spinning biologic trash into diagnostic gold. J. Clin. Invest. 116, 26–30. 14. Tammen H, Schulte I, Hess R, Menzel C, Kellmann M, Schulz-Knappe P. (2005) Prerequisites for peptidomic analysis of blood samples: I. Evaluation of blood specimen qualities and determination of technical performance characteristics. Comb. Chem. High Throughput Screen. 8, 725–33. 15. Holland NT, Smith MT, Eskenazi B, Bastaki M. (2003) Biological sample collection and processing for molecular epidemiological studies. Mutat. Res. 543, 217–34. 16. Landi MT, Caporaso N. (1997) Sample collection, processing and storage. IARC Sci. Publ. 223–36. 17. Tammen H, Schulte I, Hess R, Menzel C, Kellmann M, Mohring T, Schulz-Knappe P. (2005) Peptidomic analysis of human blood specimens: comparison between plasma specimens and serum by differential peptide display. Proteomics 13, 3414–22.
42
Tammen
18. Favaloro EJ, Soltani S, McDonald J. (2004) Potential laboratory misdiagnosis of hemophilia and von Willebrand disorder owing to cold activation of blood samples for testing. Am. J. Clin. Pathol. 122, 686–92. 19. Mustard JF, Kinlough-Rathbone RL, Packham MA. (1989) Isolation of human platelets from plasma by centrifugation and washing. Methods Enzymol. 169, 3–11. 20. Schuchard MD, Mehigh RJ, Cockrill SL, Lipscomb GT, Stephan JD, Wildsmith J et al. (2005) Artifactual isoform profile modification following treatment of human plasma or serum with protease inhibitor, monitored by 2-dimensional electrophoresis and mass spectrometry. Biotechniques 39, 239–47. 21. Jeffrey DH, Deidra B, Keith H, Shu-Pang H, Deborah LR, Gregory JO, Stanley AH. (2004) An Investigation of Plasma Collection, Stabilization, and Storage Procedures for Proteomic Analysis of Clinical Samples. Humana, Totowa, NJ. 22. Rai AJ, Vitzthum F. (2006) Effects of preanalytical variables on peptide and protein measurements in human serum and plasma: implications for clinical proteomics. Expert Rev. Proteomics 3, 409–26.
3 Tissue Sample Collection for Proteomics Analysis Jose I. Diaz, Lisa H. Cazares, and O. John Semmes
Summary Successful collection of tissue samples for molecular analysis requires critical considerations. We describe here our procedure for tissue specimen collection for proteomic purposes with emphasis on the most important steps, including timing issues and the procedures for immediate freezing, storage, and microdissection of the cells of interest or “tissue targets” and the lysates for protein isolation for SELDI, MALDI, and 2DGE applications. The pathologist is at the cornerstone of this process and is an invaluable collaborator. In most institutions, pathologists are responsible for “tissue custody,” and they closely supervise the tissue bank. In addition, they are optimally trained in histopathology in order to they assist investigators to correlate tissue morphology with molecular findings. In recent years, the advent of the laser capture microscope, a tool ideally designed for pathologists, has tremendously facilitated the efficiency of collecting tissue targets for molecular analysis.
Key Words: tissue bank; frozen section; immunofluorescence; laser capture microscope; proteomics.
1. Introduction From the completion of surgery and the acquisition of tissue sample to protein isolation and performing the various proteomic techniques, a number of challenges must be overcome. The first challenge is time. Surgery is associated with loss of vascular supply, resulting in progressive increase of endogenous protease activity, protein degradation, and tissue autolysis. For this reason, specimens submitted for tissue procurement must be processed without delay. Formalin fixation, a standard processing procedure in pathology, From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
43
44
Diaz et al.
stops protease activity. However, formalin is a cross-linking fixative that irreversibly alters protein, thus compromising the quality of the extracts for most proteomic techniques. Recent technical developments appear promising and may ultimately enable peptide analysis and protein identification (bottom up proteomics) in formalin-fixed paraffin embedded tissue (1). At present, however, it is imperative to take a representative “fresh” tissue sample immediately after surgery when collecting tissue for proteomic studies, including MALDI TOF MS and 2DGE. The surgical specimen should be transported quickly to pathology, and a representative tissue sample should be obtained under the supervision of a pathologist. The sample should be embedded in OCT and frozen without delay. Ideally, a frozen section should be performed for quality assurance before archiving the sample. Once the pathologist confirms that the expected targets are present in the collected tissue (for instance, tumor and non-tumor tissue), the frozen specimen can be stored in a –80°C freezer for subsequent use. Overcoming time constraints requires appropriate institutional policies and dedicated personnel. From our experience, it is better to delegate the responsibility of transporting the surgical specimen from the operating room to pathology to dedicated tissue procurement personnel, instead of expecting the surgical team to deliver the specimens. When collecting and archiving tissue samples, our policy is to bisect the sample into two halves, one embedded in OCT and stored permanently at –80°C for future molecular studies, and one submitted as a “mirror image” processed in formalin after performing a frozen section for morphologic comparison and cell type mapping after basic hematoxylin and eosin (H&E) staining. This formalin-processed mirror image tissue provides optimal morphological detail, which might be necessary in the future. For instance, it is very difficult to identify prostatic intraepithelial neoplasia (PIN) on frozen section slides; however, the formalin fixed section, which closely mimics the frozen section, can be used for guidance. After archiving the tissue sample, the next challenge is to ensure that the proteomic findings are representative of the tissue targets under investigation, given the cellular heterogeneity present in most tissues. For instance, if one would like to determine the differential protein expression in tumor versus non-tumor, one must ensure that proteins are separately and reliably extracted from normal and tumor cells. Certainly, many solid tumors are visible to the naked eye, and both tumor and non-tumor tissues can be collected by gross inspection. However, under a microscope, the tumor bed contains not only tumor cells but many other tumor–associated, non-tumoral elements, such as supporting stromal cells, blood vessels, infiltrating lymphocytes, etc. Moreover, microscopic foci of tumor may infiltrate grossly normal tissue. In the past, various approaches were followed to collect cells from tissue sections, including manual microdissection with a syringe. In the recent years, the procedure
Tissue Sample Collection for Proteomics Analysis
45
of laser-capture microdissection (2) has tremendously increased the quality, specificity, and speed of the process, allowing selective capture of cells and various tissue elements while preserving the molecular integrity (3,4,5). The LCM is a special microscope that isolates cells from frozen or formalinfixed tissues and cytological preparations. Microdissection of single cells or multicellular structures is accomplished by placing a plastic polymer (cap) over the tissue while pulsing an infrared laser for the polymer to melt and adhere to the target cells under the laser ring. When the cap is removed, the cells that adhered to the polymer detach from the surrounding tissue without any molecular damage, becoming suitable for the extraction of high-quality nucleic acids and proteins, and for a wide range of downstream molecular analyses,
A
B
C
D
Fig. 1. Selective immunofluorescent LCM of prostate gland’s basal cells by immunocapture: (A) immunofluorescent staining of basal cells with a mAb against highmolecular-weight keratins, which are highly expressed on basal cells, (B) selection of immunofluorescent-positive basal cells for subsequent LCM, (C) captured immunofluorescent-positive cells after LCM photographed from the plastic cap, (D) remaining of the gland after removing the basal cell layer by LCM.
46
Diaz et al.
such as gene expression microarrays, or proteomics. The use of a microscope can be coupled with special immunostaining procedures if one wishes to capture specific cell types not easily identified by morphology alone, which is the “so called” immunocapture procedure (6,7), which further enhances the specificity of tissue procurement for molecular analysis. For example, in a former study (8), we were able to selectively capture basal cells from benign prostate glands, which are extremely difficult to recognize morphologically but easily identifiable after immunostaining for high-molecular-weight cytokeratin (Fig. 1). We obtained excellent protein quality results and were able to identify several protein peaks preferentially expressed in these cells using SELDI-TOF-MS. When we compared the protein spectra from the same tissue sample sections routinely stained with hematoxilin with those immunostained for high-molecularweight cytokeratins, there was no difference in the spectra, militating against any significant protein deterioration due to the immunostaining procedure. 2. Materials 2.1. Tissue Collection and Storage 1. 2. 3. 4. 5.
Tissue-Tek Cryomold-standard (Sakura, Torrance, CA) Tissue-Tek OCT (Sakura) 2 methylbutane (Mallinckrodt, St. Louis, MO) Shandon Histobath II (Thermo Electron Corp., Waltham, MA) –80°C freezer
2.2. Frozen Tissue Sectioning and Staining 1. Cryostat 2. HistoGeneTM LCM Frozen Section Staining Kit (Arcturus Biosciences Inc, Mountain View, CA). The kit contains histogene staining solution, ethanol (75, 95, 100%), xylene, distilled water nuclease free, histogene LCM slides, and disposable slide staining jars. 3. 1× PBS made from 10× stock (Fisher Scientific) 4. Acetone (high purity grade) 5. Cy3-Strepavidin (Invitrogen, Carlsbad, CA) 6. Biotinylated mAbs: Any antibody can be biotinylated. We routinely have 1.5 mg of antibody labeled with 0.2 mg biotin (Alpha Diagnostic Intl. Inc. San Antonio, TX).
2.3. LCM 1. 2. 3. 4. 5.
PixCell II LCM System (Arcturus Biosciences Inc) AutoPixTM Automated LCM System (Arcturus Biosciences Inc) CapSure® LCM caps (Arcturus Biosciences Inc) Prep Strip (Arcturus Biosciences Inc) Microcentrifuge tubes (0.5 ml) (Eppendorf North America)
Tissue Sample Collection for Proteomics Analysis
47
2.4. LCM Lysate 1. 2. 3. 4.
Micropipet capable of delivering 1 μl accurately 20 mM HEPES (pH to 8.0 with NaOH) with 1% Triton X-100 Sonicator (optional) 1× PBS
2.5. SELDI Analysis 1. 2. 3. 4. 5. 6. 7.
IMAC3 or WCX2 Protein Array Chips (Ciphergen Biosystems Palo Alto, CA) HPLC grade water (Fisher Scientific) 100 mM sodium acetate pH 4.0 100 mM ammonium acetate pH 4.0 Sinapinic acid (SPA) (Ciphergen Biosystems, Palo Alto, CA) Optima grade Acetonitile (Fisher Scientific) Trifluoroacetic acid, packaged in 1 ml ampules (Pierce Chemical Company, Rockford, IL)
2.6. MALDI Analysis 1. 2. 3. 4. 5.
Target plate Cinaminic acid (CHCA) (Bruker Daltonics, Palo Alto, CA) SPA (Fluka) Optima grade Acetonitile (Fisher Scientific) Trifluoroacetic acid, packaged in 1 ml ampules (Pierce Chemical Company)
3. Method 3.1. Tissue Collection and Storage 1. The tissue sample is embedded in OCT using a cryomold and is frozen in the Shandon Histobath, which contains 2 methylbutane (see Note 1). 2. Hold the cryomold against the 2 methylbutane liquid interface and allow the tissue to freeze slowly (3–5 min) (see Note 2). 3. After achieving complete freezing, place the frozen cryomold containing the sample in a plastic bag and transport the sample within a liquid nitrogen container. Store the sample in a –80°C freezer.
3.2. Frozen Tissue Sectioning and Staining 3.2.1. Regular Hematoxylin Staining Prior to LCM, cut 8-μm-thick frozen tissue sections from the cryostat (discard folded or wrinkled sections). Keep slides with sections in cryostat after cutting and stain as follows (see Notes 3 and 9; slides may also be frozen at –80°C until stained.):
48
Diaz et al. 1. 2. 3. 4. 5. 6. 7. 8. 9.
Remove the slides from the freezer or cryostat and place in 70% ethanol (30 s). Place in purified water (5 s). Add the Histogene staining solution (30 s) (see Note 4). Rinse the slides with purified water. Wash with 70% ethanol (60 s). Wash with 95% ethanol twice (60 s each). Wash with 100% ethanol (60 s). Place the slides in xylene to ensure complete dehydration (10 min) (see Note 5). Shake off and drain carefully by touching the corner with a particle-free tissue paper. 10. Air dry the slides to allow xylene to evaporate completely (at least 2 min). 11. The slides are now ready for LCM (they should not be coverslipped) (see Note 12)
3.2.2. Immunofluorescence Staining (see Note 7) 1. 2. 3. 4. 5. 6. 7. 8.
9. 10. 11. 12. 13. 14.
Thaw slides (1 min). Place in cold acetone at 4°C (2 min). Air dry (30 s). Wash in filtered pH 7.4 1× PBS. Drain off slides. Add 100 μl of first biotinylated Ab at optimal dilution: recommended concentration 30–100 μg/ml, optimize for best results (3 min). Rinse in PBS. Add 100 μl of Cy3 at dilution 1:100 (user may decide the optimal staining concentration of the Cy3 Streptavidin conjugate by performing a serial dilution staining experiment) (1 min). Rinse in PBS. Place slides in 75% ethanol (30 s). Place slides in 95% ethanol (30 s). Place slides in 100% ethanol (30 s). Place slides in xylene (5 min) (see Note 6). Air dry (5 min).
3.3. LCM The new instruments developed by Arcturus, such as the AutoPixTM and the VeritasTM are enclosed in automated systems entirely operated by a computer. We describe here the LCM procedure using the PixCell II instrument, which is manually operated and the least expensive LCM instrument today and, therefore, more widely used (see Note 8). 1. Turn on the instrument and enter pertinent data such as slide #, case #, cap lot #, thickness (always 8 μm), and place the stained slide on the mechanical stage (see Note 10).
Tissue Sample Collection for Proteomics Analysis
49
2. Turn on the vacuum pump to immobilize the slide (small aperture on the left side of the stage) and push in the filter bottom for optimal image quality. 3. Place the caps in the rail on the right side of the stage. Unlock the mechanical arm, move it toward the tissue, and drop it at the top of the tissue. Align the joystick to move the stage to a centered and perpendicular position before beginning the microdissection process. 4. Turn on the key on the right side of the power supply to enable the infrared laser. Focus the laser before beginning microdissection using the smallest ring diameter and adjust to the desired diameter. 5. Select the appropriate energy (mW) and time of exposure (ms) for the desired laser ring diameter and ensure its effectiveness in an area of the tissue that lacks any interest using a cap to be discarded (see Note 11). 6. Fire the laser each time the ring is over the desired tissue target. Move the stage supporting the glass slide with the aid of the joystick, which allows fine and precise motion. Check if the tissue is appropriately microdissected and capture the tissue images before and after LCM as well as the image of the target tissue that was captured in the cap (see Note 13). 7. When the cap is filled with the desired amount of tissue, remove the cap and use a 0.5-ml microcentrifuge tube to collect the tissue (the cap is designed to perfectly fit to close the tube) (see Note 14). 8. The microcentrifuge tube can be safely stored in a –80°C freezer without adding any buffer and without lysing the cells, which may be done at a convenient time later.
3.4. LCM Lysate 1. Lyse a total of 1500–2000 laser shots (about 3000 to 6000 microdissected cells) in 4 μl of 20 mM Hepes pH 8.0 with 1% Triton X-100. This is sufficient for one SELDI protein array or one MALDI run. For 2D analysis, a minimum of approximately 25,000 cells are necessary. 2. Add the above lysing buffer on the cap and place in the microfuge tube holding the cap. This is usually done with two additions of 2 μl to the LCM cap. Pipet up and down and scrape the surface of the LCM cap to remove all the cells. A gentle scraping motion with the pipet tip may be necessary to remove the cells, but be careful not to rip the polymer film (see Note 15). Transfer the lysate from the surface of the cap to the microfuge tube. Cells from multiple caps may be combined by subsequently using 4 μl of LCM lysate to lyse cells on another cap. In this way the volume will remain small. If 2DGE may be performed, the lysis procedure is different (see below). Make a 1:10 dilution of each lysate in PBS (for IMAC3 SELDI chips) or 100 mM ammonium acetate pH 4.0 (for WCX2 chips) (i.e., 36 μl added to the 4 μl lysate) vortex for at least 1 min (see Note 16). Spin down briefly. 3. Prepare the arrays of the IMAC chip with CuSO4 according to the manufacturer’s specifications: 20 μl, 100 mM CuSO4 for 10 min, wash with HPLC water; 20 μl, 100 mM Na acetate pH 4.0 for 5 min, wash with water. Use the Micromix shaker for all incubations with the following settings: Form-20, Amplitude-5.
50
Diaz et al.
4. Assemble the bioprocessor with the desired number of chips and add 2× 200 μl PBS to each well, incubate on the shaker for 5 min each time. Pretreat the WCX2 chip with 100 mM ammonium acetate pH 4.0. This can be done on the BioMek robot. 5. Add the diluted lysate to the spot on the chip(s) in the bioprocessor. 6. Cover the bioprocessor with a plastic seal and incubate overnight on MicroMix shaker at room temperature, using the same setting as given above. 7. Remove lysates carefully with a pipet; do not touch the surface of the arrays. Save if needed for another experiment. 8. Wash the spots in bioprocessor 2× with 200 μl PBS (for IMAC) or 100 mM ammonium acetate pH 4.0 (for WCX) for 5 min on the shaker. 9. Wash the arrays with HPLC water 2× for 5 min (on shaker). 10. Remove the chip(s) from bioprocessor and give them a final rinse with HPLC water. 11. Let the chip dry completely, usually overnight. 12. Add 2× 0.5 μl saturated SPA dissolved in 50% acetonitrile, 0.5% TFA. 13. Read at instrument settings optimized for resolution and intensity for the m/z range of 1000–20,000. Higher laser energy will be required to see higher molecular weight peaks.
One method of MALDI sample preparation that reduces the complexity of cell lysates while remaining robust and easily amenable to automated highthroughput applications is sample fractionation using magnetic beads (MB) combined with pre-structured MALDI sample supports (AnchorChip Technology). Several magnetic bead types with different surface chemistries can be used to fractionate serum and increase the number of detectable peaks (see the chapter on serum protein profiling for details). For MALDI analysis, dilute the lysate 1:10 with CHCA or SPA matrix (5–10 mg/ml in 50% acetonitrile, 0.1% TFA). Spot on Anchorplate and read in a MALDI instrument. Further dilution and/or fractionation of the lysate may be necessary to achieve optimal spectra. If 2DGE analysis will be performed, the cells should be lysed as follows: Remove the LCM cap from the tube and add a small volume (10 μl) of 1D focusing rehydration buffer to the tube. The preferred number of laser shots is approximately 100 K. Replace the cap and invert the tube to allow the buffer to come in contact with the cells on the cap and lyse them. Incubate 5 min at room temperature. Sonicate the samples to ensure lysis. Continue with the basic protocol for 1D IEF and 2D analysis. 4. Notes 1. In our experience, a time window of 30 min between completion of surgery and tissue freezing yields good protein quality for most proteomic techniques. However, if one is studying protein phosphorylation, this begins to significantly decrease 20 min after completion of surgery (10).
Tissue Sample Collection for Proteomics Analysis
51
2. When freezing the tissue sample in the Histobath, avoid immediate and complete immersion in 2 methylbutane to preserve optimal tissue morphology. Hold the sample at the liquid interface with minimal immersion and wait until the OCT and the tissue slowly turn white. 3. Use uncoated glass slides for LCM. Coated or electrically-charged glass slides will interfere with the detachment process of the plastic polymer and are not suitable for LCM. 4. Precipitate from Hematoxylin can contaminate the surface of the tissue. Filter these solutions. Add one tablet of protease inhibitor to each staining bath (we use Complete, from BMB). Do not add protease inhibitor to alcohol baths. If using the histogene staining kit (Arcturus) for frozen sections, this is not necessary. 5. Change all the staining and alcohol solutions after staining 20 slides. 6. Poor transfers may result if 100% ethanol has hydrated. Increasing the incubation time in xylene often improves transfer. 7. When specific cells need to be microdissected and these cannot be identified morphologically, the cells of interest can be immunostained with specific mAbs against proteins highly expressed on those cells (immunophenotype). It is critical to expedite the immunostaining procedure because the shorter the immunostaining time, the better the protein quality. One must avoid exceeding 30 min for the total immunostaining and dehydration procedure. In the past, we have used the immunoperoxidase technique with DAB labeling (6), but it was difficult to perform quick enough to preserve optimal protein integrity. Also, manual microdissection of DAB labeled cells with Pixel II is extremely tedious and nonpractical. The immunofluorescence staining method (7) is faster and easier to perform. This method coupled with the Autopix microscope, which has dark field fluorescence and automation capabilities, is the ideal procedure for immunocapture. Since Cy3-strepavidin binds to the antibody labeled with biotin, there is no need for a secondary antibody, thereby decreasing the necessary staining time. It is recommended to run negative control staining; use a biotinylated control antibody from the same animal species and of the same isotype as your primary antibody. Dilute to the same working concentration as the primary antibody. 8. Do not forget to wear gloves every time while performing LCM, including when handling the plastic caps. 9. The thickness of the tissue section is a critical parameter for effective LCM. In our experience (using the Pixel II and the Autopix instruments by Arcturus), 8 μm is the optimal thickness for LCM. 10. Smooth out the surface of the tissue section with a Prep-strip before placing the slide on the LCM instrument, which improves the efficiency and uniformity of the microdissection process. 11. The main factors affecting the efficiency of LCM include the energy, the time of exposure, and the diameter of the laser beam. Regarding the diameter, when using Pixel II, the smallest ring is 7 μm, the medium ring is 15 μm, and the widest ring is 30 μm. Very often, we have used the medium (15 μm, which lifts up about three cells with each shot). When trying to microdissect single cells with
52
Diaz et al. Pixel II, one must use the smallest (7 μm) diameter ring, but our experience was frustrating. With Autopix, we have observed that microdissection of individual cells is better achieved setting the laser ring at 10 μm diameter, below which it becomes very difficult to lift up cells efficiently. A 30-μm diameter laser is very effective for microdissection of whole glands and other large tissue structures.
Regarding the other two parameters, the optimization depends on the tissue type. For instance, for prostate tissue, an energy of 80 mW with a duration of 0.5 ms is usually effective for a medium-size ring (15 μm). The tuning of these parameters is accomplished by a “fail and try” approach, progressively adjusting the energy and the time of exposure for the desired diameter, which obviously depends on the desired microdissection task (single cells vs. mediumor large-size tissue structures). 12. Another factor that affects the effectiveness of LCM is the time the tissue section has been dry after the staining and dehydration procedure. Ideally, the tissue should be stained and microdissected within 1 h if possible. One must avoid having the slide under LCM for more than 4 h. If microdissecting many tissues, stain only four slides at a time. 13. When capturing images before and after microdissection for documentation purposes, make sure the image on the monitor is focused because that is the image that would be captured. Sometimes is focused on the microscope but is unfocused on the monitor. In a typical experiment, you will capture the image before and after firing the laser, which provides records of the effectiveness in removing the cell targets. You can also capture the image of microdissected cells from the polymer cap. 14. Avoid allowing the LCM caps to become excessively crowded. When using the 15-μm laser ring, microdissection is about three cells per shot. One should expect around 3000 cells for each 1000 shots, which is about right per single cap. 15. LCM caps can be viewed under a dissecting microscope to ensure that all cells have been removed from the polymer film after the lysing procedure. 16. Depending on the cell type, vigorous vortexing and sonication may be necessary to completely lyse the cells after they are removed from the cap.
References 1. Prieto, D.A., Hood, B.L., Darfler, M.M., Guiel, T.G., Lucas, D.A., Conrads, T.P., Veenstra, D.T., and Krizman, D.B. (2005) Liquid TissueTM : proteomic profiling of formalin-fixed tissues. Biotechniques 38: 32–5. 2. Emmert-Buck, M.R., Bonner, R.F., Smith, P.D., Chuaqui, R.F., Zhuang, Z., Goldstein, S.R., Weiss, R.A., and Liotta, L.A. (1996) Laser capture microdissection. Science 274: 998–1001. 3. Espina, V., Milia, J., Wu, G., Cowherd, S., Liotta, L.A. (2006) Laser capture microdissection. Methods Mol Biol 319: 213–29.
Tissue Sample Collection for Proteomics Analysis
53
4. Best, C.J., and Emmert-Buck, M.R. (2001) Molecular profiling of tissue samples using laser capture microdissection. Expert Rev Mol Diagn. 1: 53–60. 5. Ornstein, D.K., Gillespie, J.W., Paweletz, C.P., Duray, P.H., Herring, J., Vocke, C.D., Topalian, S.L., Bostwick, D.G., Linehan, W.M., Petricoin, E.F., III, and Emmert-Buck, M.R. (2000) Proteomic analysis of laser capture microdissected human prostate cancer and in vitro prostate cell lines. Electrophoresis 21: 2235–42. 6. Fend, F., Emmert-Buck, M.R., Chuaqui, R., Cole, K., Lee, J., Liotta, L.A., and Raffeld, M. (1999) Immuno-LCM: laser capture microdissection of immunostained frozen sections for mRNA analysis. Am J Pathol 154: 61–6. 7. Murakami, H., Liotta, L., Star, R.A. (2000) IF-LCM: laser capture microdissection of immunofluorescently defined cells for mRNA analysis rapid communication. Kidney Int 58(3): 1346–53. 8. Cazares, L.H., Adam, B.L., Ward, M.D., Nasim, S., Schellhammer, P.F., Semmes, O.J., and Wright, G.L., Jr (2002) Normal, benign, preneoplastic, and malignant prostate cells have distinct protein expression profiles resolved by surface enhanced laser desorption/ionization mass spectrometry. Clin Cancer Res 8: 2541–52. 9. Diaz, J., Cazares, L.H., Corica, A., and Semmes O. (2004) Selective capture of prostatic basal cells and secretory epithelial cells for proteomic and genomic analysis. Urol Oncol 22(4): 329–36. 10. Mora, L., Buettner, R., Seigne, J., Diaz, J., Hamad, N., Garcia, R., Bowman, T., Falcone, R., Faigurth, R., Cantor, A., Muro-Cacho, C., Livistong, S., Levitzki, A., Kraker, A., Karras, J., Pow-Sang, J., and Jove, R. (2002) Constitutive activation of Stat3 in human prostate tumors and cell lines: direct inhibition of stat3 signaling induces apoptosis of prostate cancer cells. Cancer Research 62: 6659–66.
4 Protein Profiling of Human Plasma Samples by Two-Dimensional Electrophoresis Sang Yun Cho, Eun-Young Lee, Hye-Young Kim, Min-Jung Kang, Hyoung-Joo Lee, Hoguen Kim, and Young-Ki Paik
Summary Human plasma is regarded the most complex and well-known clinical specimen that can be easily obtained; alterations in the levels of plasma proteins or their corresponding enzyme activities may reflect either a healthy or a diseased state. Given that there is no defined genomic information as to the intact protein components in plasma, protein profiling could be the first step toward its molecular characterization. Several problems exist in the analysis of plasma proteins, however. For example, the widest dynamic range of protein concentrations, the presence of high-abundance proteins, and post-translational modifications need to be considered before proteomic studies are undertaken. In particular, efficient depletion or pre-fractionation of high-abundance proteins is crucial for the identification of low-abundance proteins that may contain potential biomarkers. After the removal of high-abundance proteins, protein profiling can be initiated using two-dimensional electrophoresis (2DE), which has been widely used for displaying the differential proteome under specific physiological conditions. Here, we describe a typical 2DE procedure for plasma proteome under either a healthy or a diseased state (e.g., liver cancer) in which pre-fractionation and depletion are integral steps in the search for disease biomarkers.
Key Words: 2-dimensional gel electrophoresis; plasma; HPPP; immunoaffinity column.
Abbreviations: IEF: Isoelectric Focusing, IPG; Immobilized pH Gradient, TCA: Trichloroacetic Acid, FFE: Free Flow Electrophoresis, HPMC: Hydroxypropyl Methylcellulose, TBP: Tributylphosphine, 2DE: 2-dimensional Gel Electrophoresis, BPB: Bromophenol Blue, CHCA: -cyano-4-hydroxycinnamic acid, LTQ: Linear Iontrap From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
57
58
Cho et al.
MALDI-TOF: Matrix-assisted Laser Desorption Ionization - Time of Flight Mass Spectrometry, HPPP: Human Plasma Proteome Project.
1. Introduction Human plasma is an intravascular fluid that serves as a liquid medium for blood proteins that are derived from various cells, tissues, and other biofluids (1). In fact, the components of plasma are very heterogeneous, including inorganic ions (e.g., bicarbonate, calcium), metabolic intermediates (e.g., cholesterol, glucose), and plasma proteins (e.g., albumin, globulin), which are important in maintaining body fluid balance, immune response, blood clotting, and other metabolic mechanisms of homeostasis. Plasma contains many different proteins that are primarily synthesized in the liver and are often subjected to post-translational modification (PTM) (2). Since human plasma is the most complex and well-known clinical specimen that can be easily obtained, it has been a central target for many biomedical studies (2). Alterations in the levels of plasma proteins or their corresponding enzyme activities may reflect either a healthy or a diseased state that can be monitored by various analytical tools, including biochemical assays and proteomics. Given that there is no defined genomic information as to the intact protein components in plasma, a proteomic study may be the method of choice (3,4). Recently, plasma protein profiling was conducted as part of the plasma proteome project of HUPO, termed HPPP (5). The pilot phase of HPPP produced 3020 non-redundant proteins that were found to be present in human plasma and serum (5,6). However, several points must be addressed before proteomic studies are undertaken. First, plasma protein is believed to contain the most dynamic concentration range (more than 10 orders of magnitude) of each constituent protein, creating many technical obstacles in proteomic detection by mass spectrometry (MS) (2,3). For example, the removal of high-abundance proteins (e.g., albumin, IgG, transferrin, fibrinogen, IgA, etc.) that occupy more than 90% of all plasma proteins prior to biochemical analysis may be a big challenge and perhaps even problematic in light of plasma-derived biomarker discovery (3,7). Second, since many plasma proteins have many structural isoforms, more efficient analytical system is needed to facilitate the analysis of multiple isoforms of plasma proteins (1). Third, since many plasma proteins are synthesized as pre-proteins that are subjected to various PTMs for cellular function, more efficient methods to analyze modified proteins (e.g., glycosylated proteins) are required. For example, since glycopeptides are not easily ionized completely during MS analysis, which leads to inadequate spectral data and low detection sensitivity due to the attached glycans, a strategy
Protein Profiling by Two-Dimensional Electrophoresis
59
for the removal of glycans must be considered for protein identification. Taken together, all these factors are important for the proteomic study of plasma (8). Of the problems listed above, the first problem that concerns the protein profiling of plasma may be the depletion or pre-fractionation of high-abundance plasma proteins (3,4,7). Without this depletion procedure, the identification of low-abundance proteins (including biomarkers) may not be practical. After the removal of high-abundance proteins, two-dimensional electrophoresis (2DE) may be the first step chosen to analyze plasma proteins because it is easy to perform in the laboratory. Although 2DE has several limitations in terms of reproducibility, separation of membrane or low-molecular-weight proteins, and proteins with extreme pIs (<3 or >10), this technique has been widely used as a first analysis of proteins in a particular physiological state when coupled with MS (9). Recently, quantitative 2DE was performed with a difference in gel electrophoresis (DIGE) system (see Chapter by Friedman and Lilley for detail), where two or three differentially staining dyes can be applied to specific protein populations to determine their quantitative changes in expression levels under a specific physiological condition (10). Thus, this chapter is intended to provide the reader with necessary information on the systematic analysis of the plasma proteome using 2DE in an attempt to search for disease biomarkers from the plasma proteins of patients with hepatocellular carcinoma (HCC) (11,12).
2. Materials 2.1. Preparation of Human Plasma Samples 1. Blood collection tubes: BD Plus Plastic K2 EDTA (BD, 367525; 10 mL), BD Glass Serum with silica clot activator (367820, 10 mL). 2. Protease inhibitor (Complete Protease Inhibitor Cocktail, Roche, 11 697 498 001, 20 tablets): One tablet contains protease inhibitors (antipain, bestatin, chymostatin, leupeptin, pepstatin, aprotinin, phosphoramidon, and EDTA) sufficient for the processing of 100 mL plasma samples. Prepare 25× stock solutions in 2 mL distilled water.
2.2. Depletion of High-Abundance Proteins with an Immunoaffinity Column 1. HPLC system, such as the HP1100 LC system (Agilent). 2. Multiple affinity removal system (MARS): LC column (Agilent, 5185-5984); Buffer A for sample loading, washing, and equilibrating (Agilent, 5185-5987); Buffer B for eluting (Agilent, 5185-5988).
60
Cho et al.
2.3. Isoelectric Focusing (IEF) with Immobilized pH Gradient (IPG) Strip 1. MultiPhorTM (GE Healthcare) or Protean IEF cell (Bio-Rad): Numerous commercially available isoelectric focusing units exist 2. Re-swelling tray 3. Mineral oil: Immobiline Dry Strip Cover Fluid (GE Healthcare) 4. Power supply, such as the EPS 3501 XL power supply (GE Healthcare) 5. Thermostatic circulator: Multitemp III thermostatic circulator (GE Healthcare) 6. IPG strip: Immobiline Dry Strip, pH 3-10 nonlinear (NL), or pH 4.0-5.0, and pH 5.5-6.7, 18 cm long, 0.5 mm thick (GE Healthcare) or with the same pH ranges for ReadyStrip IPG strip (Bio-Rad) 7. Carrier ampholyte mixtures: IPG buffer or Pharmalyte, same range as the selected IPG strip 8. Sample buffer: 7 M urea, 2 M thiourea, 4% (w/v) CHAPS, 0.5% (v/v) ampholyte, 100 mM DTT, 40 mM Tris-HCl, pH 7.5, a trace amount of bromophenol blue (BPB)
2.4. Microscale Solution Isoelectric Focusing: ZOOM® 1. ZOOM® (IEF Fractionator (Invitrogen, ZF10001)). 2. ZOOM® disks: pHs 3.0, 4.6, 5.4, 6.2, 7.0, and 10.0 [Invitrogen, ZD series (e.g., ZD10030 for pH 3.0)] 3. IEF Anode Buffer (50X) (Novex, LC5300, 100 mL) 4. IEF Cathode Buffer (10X) (Novex, LC5310, 125 mL) 5. Anode buffer: 8.4 g urea, 3.0 g thiourea, 3.3 mL Novex® IEF Anode Buffer (50X). Add water to a final volume of 20 mL. 6. Cathode buffer: 8.4 g urea, 3.0 g thiourea, 3.3 mL Novex® IEF Cathode Buffer (50X). Add water to a final volume of 20 mL.
2.5. Fractionation of Plasma Samples by Free Flow Electrophoresis (FFE) 1. ProTeamTM FFE instrument (Tecan) 2. 1% 2-(4-sulfophenylazo)-1,8-dihydroxy-3,6-naphthalenedisulfonic acid (SPADNS) (Tecan, 517074) 3. 0.8% hydroxypropyl methylcellulose (HPMC) (Tecan, 5170709) 4. pI markers: mixture of pI markers that indicate pHs 4.2, 5.1, 6.3, 7.4, 8.7, and 10.1 (Tecan, 5170705) 5. ProlyteTM 1, ProlyteTM 2, and ProlyteTM 3 (Tecan, 0309081, 0309102, and 0309093) 6. Anodic stabilization medium (Inlet I1 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 100 mM H2 SO4 7. Separation medium 1 (Inlet I2 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 14.5% (w/w) ProlyteTM 1
Protein Profiling by Two-Dimensional Electrophoresis
61
8. Separation medium 2 (Inlet I3−5 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 14.5% (w/w) ProlyteTM 2 9. Separation medium 3 (Inlet I6 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 14.5% (w/w) ProlyteTM 3 10. Cathodic stabilization medium (Inlet I7 ): 14.5% (w/w) glycerol, 8 M urea, 0.03% (w/w) HPMC, 100 mM NaOH 11. Counter flow medium (Inlet I8 ): 14.5% (w/w) glycerol, 8 M urea 12. Anodic circuit electrolyte: 100 mM H2 SO4 13. Cathodic circuit electrolyte: 100 mM NaOH
2.6. Preparation of 2D Gels 1. Gradient former: One of the two Bio-Rad models can be used in this step: Model 385 (30-100 mL capacity) or Model 395 (100-750 mL capacity). 2. Orbital shaker with speed controller. 3. SDS-PAGE: Protean II xi multicell and multicasting chamber (Bio-Rad) or Ettan DALT twelve large vertical system (GE Healthcare). 4. 5× Tris-HCl buffer: Dissolve 227 g Tris into 800 mL distilled water and adjust the buffer to pH 8.8 with HCl (∼30 mL). Add distilled water to a final volume of 1 L. 5. 5× Gel buffer: Dissolve 15 g Tris, 72 g glycine, and 5 g sodium dodesyl sulfate (SDS) into 800 mL distilled water and add distilled water to a final volume of 1 L. 6. SDS Equilibration buffer contains 6 M urea, 2% (w/v) SDS, 5× gel buffer (pH 8.8), 50% (v/v) glycerol, and 2.5% (w/v) acrylamide monomer. 7. Acrylamide stock solution: Acrylamide/Bis-acrylamide 37:5.1, 40% (w/v) solution (Amresco, M157, 500 mL). 8. Fixing solution: 40% (v/v) methanol and 5% (v/v) phosphoric acid in distilled water. 9. Coomassie blue G-250 staining solution: 17% (w/v) ammonium sulfate, 3% (v/v) phosphoric acid, 34% (v/v) methanol, and 0.1% (w/v) Coomassie blue G-250 in distilled water.
2.7. 2D Gel Image Analysis 1. Scanner with transparency unit, such as Bio-Rad GS710 or GS800 2. 2D gel image analysis program: Image Master Platinum 5 (GE Healthcare), PDQuest 7.3.0 (Bio-Rad), or Progenesis Discovery (NonLinear Dynamics, Ltd.)
2.8. Destaining, In-gel Deglycosylation, and In-gel Tryptic Digestion 1. Speed Vac (Heto) 2. PNGase F stock solution for in-gel deglycosylation PNGase F (Glyko, Inc, GKE5010). Dilute 1 μL PNGase F (2 mU) with 2.5 mL 1× N-glycanase incubation buffer (20 mM sodium phosphate, pH 7.5, and 0.02% (w/v) sodium azide)
62
Cho et al.
3. Sequencing-grade modified trypsin (Promega, V5111, 100 μg, 18,100 U/mg) 4. 50 mM ammonium bicarbonate
2.9. Desalting of Peptides and MALDI Plating 1. 2. 3. 4. 5. 6. 7. 8.
GELoader tips (Eppendorf, No. 0030 048.083, 20 μL capacity) Poros 10 R2 resin (PerSeptive Biosystems, 1-1118-02, 0.8 g) Oligo R3 resins (PerSeptive Biosystems, 1-1339-03, 6.3 g) 2% (v/v) formic acid in 70% (v/v) acetonitrile (ACN) 0.1% (v/v) trifluoroacetic acid in 70% (v/v) ACN 1-mL syringe Matrix: -cyano-4-hydroxycinnamic acid (CHCA) Opti-TOFTM 384-well insert (123 × 81 mm, 1016491, Applied Biosystems)
2.10. MALDI-TOF and Peptide Mass Fingerprinting 1. MALDI-TOF and MALDI-TOF/TOF: Voyager DE-Pro and 4800 MALDI TOF/TOFTM Analyzer (Applied Biosystems) equipped with a 355-nm Nd:YAG laser. The pressure in the TOF analyzer is approximately 7.6e-07 Torr.
3. Methods 3.1. Human Plasma Sample Preparation The following protocol is conducted according to the HUPO reference sample collection protocol (13). 1. Each sample pool consisted of 400 mL blood from one healthy, fasting male and one healthy, fasting postmenopausal female, and was collected into 10-mL tubes by two venipunctures, 20 tubes per veni-puncture (see Note 1). 2. Equal numbers of tubes and aliquots were generated with appropriate concentrations of K2 -EDTA, lithium heparin, or sodium citrate for plasma or were permitted to clot at room temperature for 30 min to yield serum (with micronized silica as the clot activator) (see Note 2). 3. The specimens were centrifuged for 10–15 min under refrigerated conditions at 2–6°C. 4. The resultant serum and plasma from 10 spun tubes of the same type from each donor were pooled into one secondary 50-mL conical bottom BDTM Falcon tube for each tube type. 5. The secondary tube was centrifuged at 2400×g for 15 min to remove residual cellular material from serum and to prepare platelet-poor plasma from the EDTA, heparin, and citrate secondary tubes. 6. Equal volumes of either serum or plasma were pooled from each secondary tube into media bottles (see Note 3). 7. Serum/plasma was mixed gently and kept on ice while distributed as 20-μL aliquots into cryovials and was then frozen and stored at –70°C.
Protein Profiling by Two-Dimensional Electrophoresis
63
3.2. Depletion of High-abundance Proteins with an Immunoaffinity Column For efficient depletion of high-abundance proteins prior to their molecular analysis, many reports have indicated that it is convenient to use commercially available immunoaffinity columns, such as the MARS (Agilent) (2,3) or the prepacked 2-mL SepproTM MIXED12 affinity LC column (GenWay Biotech.) (14), coupled with an HPLC system. For depletion of the six most abundant proteins (i.e., albumin, transferrin, IgG, IgA, haptoglobin, and anti-trypsin) in either serum or plasma, we introduced MARS, which has been used successfully with a wide variety of sample types, including cerebrospinal fluid (CSF) and follicular fluid (2,3) (see Fig. 1 ). 1. Dilute human serum or plasma fivefold with Buffer A (for example: 20 μL human plasma with 80 μL Buffer A) containing the protease inhibitor stock solution (40 μL per 1 mL plasma) (see Note 4) (adopted from the manufacturer’s instructions). 2. Remove the particulates with a 0.22-μm spin filter for 1 min at 16,000×g. 3. Inject 75-100 μL of the diluted serum or plasma at a flow rate of 0.5 mL/min.
Fig. 1. The 2DE images of total human plasma proteins that were depleted of the major six abundant proteins through MARS. Proteins were isoelectrically focused with pH 3–10 NL IPG strips in the first dimension and then resolved by 9–16% SDSPAGE in the second dimension. (A) Whole plasma. (B) Flow through from MARS. Approximately 800 protein spots are displayed by 2DE and identified by MALDI-TOF mass spectrometry. The names of the major proteins of each gel are marked on the image (5) (from (4)with permission)
64
Cho et al.
4. Collect the flow-through fractions that appear between 1.5 and 4.5 min and store them at –20°C if they were not to be analyzed immediately. 5. Elute bound proteins from the column with Buffer B (elution buffer) at a flow rate of 1 mL/min for 3.5 min. 6. Regenerate the column by equilibrating with Buffer A for an additional 7.4 min at a flow rate of 1 mL/min.
3.3. TCA/Acetone Precipitation During 2DE, interfering compounds, such as proteolytic enzymes, salts, lipids, nucleic acids, and any residual high-abundance proteins present after depletion, must be removed or inactivated. In the case of plasma samples, the two most important parameters are salt and proteolysis. TCA/acetone precipitation is the most useful method for desalting the whole plasma and the flow-through fractions of MARS. 1. Add 50% (w/v) trichloroacetic acid (TCA, Sigma, T9159) to reach a final TCA concentration of 5-8%. Mix gently by inverting the tube 5 to 6 times and incubate on ice for 2 h. 2. Centrifuge the sample at 14,000×g for 15 min and discard the supernatant. 3. Add 200 μL cold acetone and resuspend the protein pellet with a pipette. 4. Incubate on ice for 15 min and centrifuge the sample at 14,000×g for 20 min, discard the acetone, and dry the pellet in air (see Note 5). 5. Dissolve the pellet in the sample buffer for 2DE and quantify the protein concentration by the Bradford protein assay.
3.4. Rehydration of the IPG Gel Strip For analytical purposes, typically 0.3–1.0 mg protein can be loaded onto an 18-cm-long IPG with a wide pH range (e.g., pH 3-10), or 0.5–2.0 mg on an IPG with a narrow pH range (e.g., pH 5.5–6.7). A narrow-range IPG usually produces a higher resolution when separate proteins are analyzed by sequential IEF systems: first, fractionate the proteins over several pI ranges in solution with ZOOM® disks or FFE (see Subheadings 3.6 and 3.7) and then perform IEF with IPG strips [one pH unit range strips are also available (e.g., pH 3.0– 4.0 or pH 3.5–4.5 up to pH 6.7)]. Certain proteins appear to be trapped in the disk membrane; partitions and sample loss should be considered. 1. Dilute 1.0 mg protein with the sample buffer to a final volume of 400 μL for 18-cm-long IPG strips (see Note 6). 2. Transfer the entire protein-containing sample buffer into the re-swelling tray. 3. Peel off the protective cover from the IPG strip and slowly slide the IPG strip (gel side down) onto the sample solution. Avoid trapping air bubbles and distribute the sample solution evenly under the strips.
Protein Profiling by Two-Dimensional Electrophoresis
65
4. Overlay the strip with mineral oil and leave for 12-16 h at room temperature (see Note 7 for cup loading)
3.5. IEF with IPG Strip 1. Remove the rehydrated IPG strips that are carrying the protein samples and place them (gel side up) on the strip tray. 2. Place the 2.5-cm filter papers, wetted with distilled water, on both sides of the strips at both cathodic and anodic ends. Place the strip tray on the IEF unit. 3. Cover the strips entirely with mineral oil. 4. Program the instrument (e.g., Multiphor II): Increase the voltage from 100 to 3500 V to reach 80,000 total voltage hours (Vh) (e.g., sequentially, 300 Vh at 100 V, 600 Vh at 300 V, 600 Vh at 600 V, 1000 Vh at 1000 V, and 2000 Vh at 2000 V, for a total of 80,000 Vh at 3500 V) (see Notes 8 and 9). 5. During IEF, the temperature is set to 20°C with a water circulator.
3.6. Microscale Solution IEF: ZOOM® To reduce typical artifacts that may occur when using narrow-range IPG strips (e.g., streaking, distortion, and loss of protein spots), one may use MicroSol-IEF (e.g., ZOOM® , Invitrogen) prior to running 2D gels (3) (see Fig. 2). MicroSol-IEF is a preparative solution-phase IEF apparatus that is dissected by a defined pH membrane disc (15,16). Using MicroSol-IEF, 2.5-3.0 mg plasma proteins can be loaded and efficiently fractionated into five separate chambers by their pI values. 1. Add 2 μL of 99% dimethylamine (DMA) to the 400-μL sample (see Subheading 3.4, Step 2) for alkylation and incubate the sample on a rotary shaker for 30 min at room temperature (adopted from the manufacturer’s instructions). 2. Add 4 μL of 2 M DTT to quench any excess DMA. Centrifuge at 16,000×g for 20 min at 4°C. 3. Preparation of protein samples: Dilute 3 mg protein to a 3250-μL volume with sample buffer. The amount of diluted sample per chamber in the ZOOM® IEF Fractionator is 650 μL. 4. Assemble the ZOOM® IEF Fractionator according to the manufacturer’s instructions. Six disks (pHs 3.0, 4.6, 5.4, 6.2, 7.0, and 10.0) are used to create five fractions that have a range of pH 3.0–10.0. 5. Add each buffer (anode or cathode) to the corresponding blank chamber. 6. Remove the sample chamber cap and add 650 μL of protein sample (step 3) to each chamber. 7. Fractionation can be carried out under the following conditions: 100 V for 20 min, 200 V for 80 min, and 600 V for 80 min (see Note 10). The starting current is approximately 0.6 mA, which increases to approximately 1.2 mA at the beginning of the 200-V step, and the ending current is approximately 0.2 mA. 8. Load the electro-focused samples to the narrow pH IPG strips for 2DE.
66
Cho et al.
Fig. 2. Narrow pH range 2DE images of plasma proteins after depletion of the major six abundant proteins through MARS. After microscale solution IEF (ZOOM® ), the pH 5.5–6.2 fraction was separated on pH 5.5–6.7 IPG strips by second isoelectric focusing and then resolved on a 9–16% gel. (A) Whole 2DE image of pH 3–10 NL and pH 5.5–6.7. (B) One spot on the pH 3–10 NL gel can be separated into two or more spots in the narrow pH range 2DE. (C) Many hidden spots on the pH 3–10 NL gel appear in the narrow pH range 2DE of normal and HCC plasma.
Protein Profiling by Two-Dimensional Electrophoresis
67
3.7. Fractionation of the Plasma Samples by Free Flow Electrophoresis To identify and isolate biomarker candidates from the plasma of diseased patients with HCC using 2DE, a higher resolution is critical, and the analysis can be done by performing narrow pH range IEF. However, for narrow pH range IEF, higher amounts of proteins (e.g., 10-fold or higher) should be loaded onto the IPG strip since the proteins present in other pH ranges will be discarded. Nevertheless, prefractionation or depletion is required prior to running both IEF and 2D gel. FFE is useful for prefractionation of plasma samples since it gives rise to a specific fraction of interest (e.g., pI, or density). For example, if one knows the pI of certain proteins, free fractionation by FFE can be useful for prefractionation of complex plasma. We describe here one of the several procedures for prefractionation of plasma samples using FFE. 1. Dissolve the TCA-precipitated, flow-through fractions of MARS (∼2.0 mg) into the 500-μL separation medium 3 (see below) (adopted from the manufacturer’s instructions). 2. Add traces of red acidic dye 2-(4-sulfophenylazo)-1,8-dihydroxy-3,6naphthalenedisulfonic acid (SPADNS, Aldrich) to ease the optical control of the migration of sample within the separation chamber. 3. FFE is carried out at 10°C using the following media (solutions marked at each inlet are applied): Anodic stabilization medium (Inlet I1 ), separation medium 1 (Inlet I2 ), separation medium 2 (Inlet I3−−5 ), separation medium 3 (Inlet I6 ), cathodic stabilization medium (Inlet I7 ), and counter-flow medium (Inlet I8 ). 4. To both the anode and the cathode, anodic circuit electrolyte and cathodic circuit electrolyte are applied, respectively. 5. Assemble the ProTeamTM FFE instrument (Tecan). Use a 0.4-mm spacer for the separation chamber and a flow rate of approximately 60 mL/h (Inlet I1−7 ) and a voltage of 1500 V, which results in a current of 20–24 mA. 6. Perfuse the separation chamber with the sample using the cathodal inlet at approximately 0.7 mL/h (4,17). Residence time in the separation chamber is approximately 33 min. 7. Collect each fraction into polypropylene, 96 deep-well plates, numbered 1 (anode) through 44 (cathode) (4). 8. Remove glycerol and HPMC by TCA/acetone precipitation and dissolve the proteins with sample buffer. 9. Load the electro-focused samples with narrow pH to the IPG strips for 2DE.
3.8. Preparation of 2D Gels 1. Cast the glass plates (separated by two 1.5-mm spacers positioned along the sides) and thin plastic sheets in the multi-casting chamber (20). 2. Prepare gel solution for making 10 gels (20 × 20 cm, 1.5-mm spacer, 9–16% gradient): heavy solution (66.7 mL of 5× Tris-HCl buffer, 75 mL of a 40%
68
Cho et al.
acrylamide stock solution, 0.7 mL of 10% ammonium persulfate (APS), 70 μL TEMED, and 191.7 mL of 50% glycerol), light solution (66.7 mL of 5× Tris-HCl buffer, 141.7 mL of a 40% acrylamide stock solution, 0.7 mL of 10% APS, 70 μL TEMED, and 125 mL distilled water). 3. Assemble the gradient maker and peristaltic pump. Pour the light gel solution into the mixing chamber (close to the casting chamber) and the heavy gel solution into the reservoir chamber of the gradient maker. Operate the magnetic stirrer in the mixing chamber. Turn on the peristaltic pump until the gel solution reaches 0.5-1.0 cm below the end of the glass plates (∼5 min). Check the flow rate, which should be between 100-120 mL/min. 4. After the gel solution is poured, overlay the gel solution with distilled water to exclude air and to ensure a level surface on the top of the gel. 5. Allow polymerization to occur overnight at room temperature.
3.9. Equilibration of the Sample and Running of the Gel To solubilize the electro-focused proteins and to allow SDS to polymerize, it is necessary to soak the IPG strips in SDS equilibration buffer. This step is analogous to boiling the sample in SDS buffer prior to SDS-PAGE. The reducing agents, dithiothreitol (DTT) and tributylphosphine (TBP), reduce disulfide bonds to sulfhydryls (cysteine residues). Alkylating agents and iodoacetamide (IAA) prevent reoxidation of the free sulfhydryl groups (21). 1. Prior to use, add approximately 158 μL TBP in 1 mL isopropanol to 100 mL SDS equilibration buffer and sonicate in a bath-type sonicator until the solution becomes transparent (see Note 11) (termed TBP equilibration buffer). 2. Add 15 mL TBP equilibration buffer to each strip (gel side up) and gently shake for 25 min (TBP equilibration) (see Note 12) on an orbital shaker. 3. Briefly rinse the IPG strip with 1× gel buffer and load the IPG strips onto the top of the gel and pour the agarose embedding solution (molten agarose solution with trace amounts of BPB) (see Note 13). 4. Perform SDS-PAGE (40 mA/gel) until the BPB dye reaches the bottom of the gel. Keep the temperature at 10°C. The total run time for 20 × 20 cm gels is approximately 6 h.
3.10. Coomassie Brilliant Blue G-250 Staining 1. Fix the separated proteins into the gel in a 200-mL fixing solution for 1 h. 2. Decant the fixing solution and stain the gel in Coomassie brilliant blue G-250 overnight. 3. Decant the staining solution. 4. Wash several times (>3 times) in distilled water for more than 4 h. 5. Scan the gel, then wrap the gel in plastic, and store it at 4°C.
Protein Profiling by Two-Dimensional Electrophoresis
69
3.11. 2D Gel Image Analysis 1. Import the gel image (recommended 12–16 bit, tiff format) and convert it into an ImageMaster file (*.mel). 2. Detect the protein spots and determine the volume and percentage volume of each spot. The percentage volume is the normalized value that remains relatively independent of any irrelevant variations between gels, particularly those caused by varying experimental conditions. 3. Select the differentially displayed protein spots (see Fig. 3).
3.12. Destaining, In-gel Deglycosylation, and In-gel Tryptic Digestion Most plasma proteins are glycosylated, including clotting factors, lipoproteins, and antibodies (22,23). These carbohydrate-containing proteins play major roles in the normal biological functions in plasma. Since glycopeptides are not easily completely ionized during MS analysis, which may lead to inadequate spectral data and low detection sensitivity due to the attached glycans, a strategy for the removal of glycans is necessary for protein identification. 1. Pick (or excise) the protein spot with an end-cut yellow tip and transfer the gel piece into a 1.5-mL Eppendorf tube. 2. Wash the gel piece with 100 μL distilled water. 3. Add 50 μL of 50 mM NH4HCO3 (pH 7.8) and ACN (6:4), and shake for 10 min. 4. Repeat step 3 until the Coomassie blue G250 dye disappears (2 to 5 times). 5. Decant the supernatant and dry the gel piece in a Speed Vac for 10 min (see Note 14). 6. Add 5 μL trypsin (12.5 ng/μL in 50 mM NH4 HCO3 ) and leave the gel piece on ice for 45 min. 7. Add 10 μL of 50 mM NH4HCO3 to the gel slice. 8. Incubate the gel piece at 37°C for 12 h.
3.13. Desalting of Peptides and MALDI Plating 1. Resin packing: Twist the column body (GELoader tip, Eppendorf) near the end of the tip and push the resin solution [Poros R2:Oligo R3 (2:1) in 70% (v/v) ACN, occasionally in a more efficient ratio of 1:1] with a 1-mL syringe. A packed resin length of 2-3 mm is suitable (18,19). 2. Equilibration of the column: Add 20 μL of 2% (v/v) formic acid and push the solution through the column with the 1-mL syringe. 3. Peptide binding: Add the peptide solution (supernatant of step 9 in Subheading 3.12, approximately 10-12 μL) and push this solution through the column with the syringe. 4. Washing: Add 20 μL of 2% (v/v) formic acid and push this solution through the column with the syringe.
70
Cho et al.
Fig. 3. Detection of PTMs on the 2DE of plasma proteins. (A) 2DE images of plasma proteins that were depleted of the major six abundant proteins through MARS, untreated (left) and alkaline phosphatase (AP)-treated (AP) (right). (B) One of the differentially displayed proteins after treatment with AP. (C) Data-dependant neutral loss scan spectrum of sequence KEPCVESLVSpQYFQTVTDYGKD corresponding to the phosphorylated apolipoprotein A-II precursor.
Protein Profiling by Two-Dimensional Electrophoresis
71
5. MALDI spotting: Add 1 μL matrix solution [10 mg/mL CHCA in 70% (v/v) can and 2% (v/v) formic acid] and directly spot the eluted peptides and matrix mixture onto the MALDI plate (Opti-TOFTM 384-well Insert, Applied Biosystems). 6. Reuse the column: Add 20 μL of 100% ACN and push this solution through the column with the syringe and repeat step 2 for equilibration of the column.
3.14. MALDI-TOF and Peptide Mass Fingerprinting 1. Analyze the peptide mass fingerprinting (PMF) with the Voyager DE-PRO or 4800 MALDI-TOF/TOF mass spectrometer (Applied Biosystems). 2. Obtain the mass spectra in reflectron/delayed extraction mode with an accelerating voltage of 20 kV and sum data from either 500 laser pulses (4800 MALDITOF/TOF) or 100 laser pulses (Voyager DE-PRO). 3. Calibrate the spectrum with tryptic auto-digested peaks (m/z 842.5090 and 2211.1046) and obtain monoisotopic peptide masses with Data Explorer 3.5 (PerSeptive Biosystems). 4. Search the Swiss-Prot and NCBInr databases with the Matrix Science search engine (http://www.matrixscience.com).
3.15. Profiling of PTMs on Selected Spots Although shotgun proteomics that utilize various labeling techniques (e.g., SILAC and iTRAQ) are useful for protein identification in a high-throughput manner, it has many limitations for PTM analysis. However, 2D gels usually display proteins with PTMs or isoforms of certain proteins on a single gel as spots in different positions, which can lead to further identification for their molecular characteristics with the aid of high resolution LC-MS/MS. For example, in a typical 2D gel of plasma, the phosphorylated forms of certain protein can be easily detected in a ladder form that results from different pIs. Figure 3 shows the localization of the exact site of phosphorylated apolipoprotein A-II precursor. As seen in the figure, there is clear difference between spots that are alkaline phosphatase (AP)-treated and those that are untreated in the 2D gel where the treated group has been shifted to a more basic position. The phosphorylation site of these proteins can be determined using multidimensional MS (MS2 and MS3 ). Here, we describe the procedure for identification of phosphorylated proteins by 2DE coupled to MS. 1. Desalting is processed for the MARS-treated (high-abundance proteins depleted) plasma sample using Amicon Ultra-15 (Molecular Weight Cut Off; 5 kDa, Millipore). 2. Dephosphorylation is carried out overnight at 37°C in a solution of 0.4% ammonium carbonate buffer (pH 8.5) with 24 ng/μL calf intestine AP in 0.4% NH4HCO3. 3. The reaction is stopped by freeze drying for further analysis.
72
Cho et al.
4. Execute 2DE, picking, extraction, and desalting of peptides under the same conditions (see Subheadings 3.8-3.13). 5. Dissolve the extracted and desalted peptides in 10 μL of LC-MS/MS solution [0.4% (v/v) acetic acid and 0.005% (v/v) heptafluorobutyric acid (HFBA)]. 6. Nano LC-MS/MS analysis is then performed on an Agilent Nano HPLC system (Agilent) and LTQ mass spectrometer (Thermo Electron, San Jose, CA). 7. The capillary column used for LC-MS/MS analysis (150 mm × 0.075 mm) was obtained from Proxeon (Odense M, Denmark), and the slurry was packed in-house with a 5-μm, 100-Å pore size Magic C18 stationary phase (Michrom Bioresources, Auburn, CA). 8. The mobile phase A for LC separation was 0.4% acetic acid and 0.005% HFBA in deionized water (Cascada™ , Pall, USA), and the mobile phase B was 0.4% acetic acid and 0.005% HFBA in ACN. 9. The sample obtained from the Oasis HLB (Waters, USA) desalting step and Nanosep (Pall, USA) filtering was loaded onto the LC column. 10. The chromatography gradient was designed to provide a linear increase from 5% B to 35% B over 50 min and from 40% B to 60% B over 20 min and from 60% B to 80% B over 5 min. The flow rate was maintained at 300 nL/min. 11. The mass spectra were acquired using data-dependent acquisition with a full mass scan (400-1800 m/z) followed by MS/MS scans. Each MS/MS scan acquired was an average of three microscans on LTQ. 12. The temperature of the ion transfer tube was controlled at 200°C, and the spray was 2.0–3.0 kV. The normalized collision energy was set at 35% for MS2. 13. To determine the exact position of the phosphorylation site, the automated neutral loss MS3 scan was employed, which relies on the observed behavior of phosphopeptides subjected to MS/MS analysis in an ion trap. If the MS/MS scan produces a fragment phosphate group (98 with charge state 1+, 49 with charge state 2+, and 32.6 with charge state 3+), an MS3 scan of the product ion is initiated (see Note 15).
4. Notes 1. Donors were tested and determined negative for HIV-1 and HIV-2 antibodies, HIV-1 antigen (HIV-1), Hepatitis B surface antigen (HBsAg), Hepatitis B core antigen (anti-HBc), Hepatitis C virus (anti-HCV), HTLV-I/II antibody (antiHTLV-I/II), and syphilis. 2. No protease inhibitor cocktails were used. This procedure required 2 h at 2-6°C. 3. Approximately 10% of the sample was left at the bottom of the secondary tube to ensure that no cellular material was collected. 4. If excess of protease inhibitors are used, the resolving power of protein spots in the 2D gel will be decreased, and the border of the spots will be unclear. 5. If protein pellets are dried completely in the Speed Vac, they will be not redissolved in sample buffer. Pellets should be air dried for 15–30 min.
Protein Profiling by Two-Dimensional Electrophoresis
73
6. To ensure complete dissolution of the sample buffer, it is usually recommended to warm the sample buffer at room temperature. The sample buffer that includes proteins should not be heated to avoid carbamylation of proteins by isocyanate, which may lead to charge heterogeneities that are formed from the decomposition of urea. 7. Cup loading: Rehydrate the IPG gel strip with 350 μL sample buffer (proteins are not included), and load the 100-μL protein sample in sample buffer in the sample cup. High salt concentrations are better tolerated by cup loading. 8. Apply low voltages (100 V) at the beginning of the run for 3–5 h. Replace the filter paper (for desalting purposes) at the end of the run. 9. After 1D (first dimension) is run, IPG strips that were not immediately used for 2D (second dimension) run can be preserved at –80°C for several months. 10. If electrical current passes through the system, BPB dye starts to migrate toward the anode reservoir, which eventually results in a change in the color of the anode buffer (to yellow). 11. Concentrated TBP reacts violently with organic matter. All procedures for preparing TBP stock solutions should be done in a fume hood. Store the TBP stock solution in the dark at 4°C. Do not store it longer than 2 weeks. 12. DTT/IAA equilibration procedure: For reduction and alkylation of proteins, the DTT/IAA equilibration procedure is also useful to replace the use of TBP equilibration procedure. Divide the SDS equilibration buffer into two 50-mL aliquots. Add 1 g DTT to the first aliquot and 1.25 g IAA to the second aliquot. Add 10 mL of the DTT equilibration buffer to each strip and place on a shaker for 10 min. Decant the DTT equilibration buffer and shake with 10 mL of the IAA equilibration buffer for another 10 min. 13. To prepare the agarose embedding solution, dissolve 1 g of agarose in 100 mL of small gel buffer and melt in a microwave on medium power. For complete melting of the agarose solution, heat the agarose solution in short intervals with occasional swirling to mix the solution. 14. In-gel deglycosylation: After destaining, one may remove the glycan groups of glycoproteins by trypsin digestion for obtaining peptides of highest purity. Rehydrate gel spots (see Subheading 3.12, step 5) with 10 μL of PNGase F stock solution (10 μU) and incubate for 3 h at 37°C. Decant the supernatant including the glycans. Wash the gel piece with 50 μL 50 mM NH4HCO3 (pH 7.8) and ACN (6:4). Dry the gel piece in a Speed Vac. 15. The SEQUEST software was used to identify the peptide sequences: DeltaCn ≥ 0.1 and Rsp ≤ 4; Xcorr ≥ 1.9 with charge state 1+, Xcorr ≥ 2.2 with charge state 2+, and Xcorr ≥ 3.75 with charge state 3+ were used as cutoffs for peptide identification.
Acknowledgments This study was supported by a grant from the Korean Health 21 R&D project, Ministry of Health & Welfare, Republic of Korea (A030003 to YKP).
74
Cho et al.
References 1. Putnam, F. W. (ed) (1987) The Plasma Proteins, Academic Press, New York. 2. Anderson, N. L., and Anderson, N. G. (2002) The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 1, 845–867. 3. Lee, H. J., Lee, E. Y., Kwon, M. S., and Paik, Y. K. (2006) Biomarker discovery from the plasma proteome using multidimensional fractionation proteomics. Curr. Opin. Chem. Biol. 10, 42–49. 4. Cho, S. Y., Lee, E. Y., Lee, J. S., Kim, H. Y., Park, J. M., Kwon, M. S., Park, Y. K., Lee, H. J., Kang, M. J., Kim, J. Y., Yoo, J. S., Park, S. J., Cho, J. W., Kim, H. S., and Paik, Y. K. (2005) Efficient prefractionation of low-abundance proteins in human plasma and construction of a two-dimensional map. Proteomics 5, 3386–396. 5. Omenn, G. S., States, D. J., Adamski, M., and Blackwell, T. W. (2005). Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-navailable database. Proteomics 5, 3226–3245. 6. States, D. J., Omenn, G. S., Blackwell, T. W., Fermin, D., Eng, J., Speicher, D. W., and Hanash, S. M. (2006) Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Nat. Biotechnol. 24, 333–338. 7. Yang, Z., Hancock, W. S., Chew, T. R., and Bonilla, L. (2005) A study of glycoproteins in human serum and plasma reference standards (HUPO) using multilectin affinity chromatography coupled with RPLC-MS/MS. Proteomics 5, 3353–3366. 8. Wang, Y., Wu, S. L., and Hancock, W. S. (2006) Approaches to the study of N-linked glycoproteins in human plasma using lectin affinity chromatography and nano-HPLC coupled to electrospray linear ion trap-Fourier transform mass spectrometry. Glycobiology 16, 514–523. 9. Gorg, A., Boguth, G., Kopf, A., Reil, G., Parlar, H., and Weiss, W. (2002) Sample prefractionation with Sephadex isoelectric focusing prior to narrow pH range twodimensional gels. Proteomics 2, 1652–1657. 10. Wu, T. L. (2006) Two-dimensional difference gel electrophoresis. Methods Mol. Biol. 328, 71–95. 11. Park, K. S., Kim, H., Kim, N. G., Cho, S. Y., Choi, K. H., Seong, J. K., and Paik, Y. K. (2002) Proteomic analysis and molecular characterization of tissue ferritin light chain in hepatocellular carcinoma. Hepatology 6, 1459–1466. 12. Park, K. S., Cho, S. Y., Kim, H., and Paik, Y. K. (2002) Proteomic alterations of the variants of human aldehyde dehydrogenase isozymes correlate with hepatocellular carcinoma. Int. J. Cancer 2, 261–265. 13. Rai, A. J., Glefand, C. A., Haywood, B. C., Warunek, D. J., Yi, J., Schuchard, M. D., Mehigh, R. J., Cockrill, S. L., Scott, G. B., Tammen, H., Schulz-Knappe, P., Speicher, D. W., Vitzthum, F., Haab, B. B., Siest, G., and Chan, D. W. (2005) HUPO plasma proteome project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics 5, 3262–3277.
Protein Profiling by Two-Dimensional Electrophoresis
75
14. Huang, L., Harvie, G., Feitelson, J. S., Gramatikoff, K., Herold, D. A., Allen, D. L., Amunngama, R., Hagler, R. A., Pisano, M. R., Zhang, W. W., and Fang, X. (2005) Immunoaffinity separation of plasma proteins by IgY microbeads: meeting the needs of proteomic sample preparation and analysis. Proteomics 5, 3314–3328. 15. Herbert, B. and Righetti, P. G. (2000) A turning point in proteome analysis: sample prefractionation via multicompartment electrolyzers with isoelectric membranes. Electrophoresis 21, 3639–3648. 16. Miklos, G. L. and Maleszka, R. (2001) Integrating molecular medicine with functional proteomics: realities and expectations. Proteomics 1, 30–41. 17. Weber, G., Islinger, M., Weber, P., Eckerskorn, C., and Volkl, A. (2004) Efficient separation and analysis of peroxisomal membrane proteins using free-flow isoelectric focusing. Electrophoresis 25, 1735–1747. 18. Choi, B. K., Cho, Y. M., Bae, S. H., Zoubaulis, C. C., and Paik, Y. K. (2003) Single-step perfusion chromatography with a throughput potential for enhanced peptide detection by matrix-assisted laser desorption/ionization-mass spectrometry. Proteomics 3, 1955–1961. 19. Gobom, J., Nordhoff, E., Mirgorodskaya, E., Ekman, R., and Roepstorff, P. (1999) A sample purification and preparation technique based on nano-scale RP-columns for the sensitive analysis of complex peptide mixtures by MALDI-MS. J. Mass Spectrom. 24, 105–116. 20. Walsh, B. J., and Herbert, B. R. (1999) Casting and running vertical slap-gel electrophoresis for 2D-PAGE. Methods Mol. Biol. 112, 245–253. 21. Newhall, W. J. and Jones, R. B. (1983) Disulfide-linked oligomers of the major outer membrane protein of chlamydiae. J. Bacteriol. 154, 998–1001. 22. Kaufman, R. J. (1998) Post-translational modifications required for coagulation factor secretion and function. Thromb. Haemost. 79, 1068–1079. 23. Tabas, I. (1999) Nonoxidative modifications of lipoproteins in atherogenesis. Annu. Rev. Nutr. 19, 123–139.
II Clinical Proteomics by 2DE and Direct MALDI/SELDI MS Profiling
5 Analysis of Laser Capture Microdissected Cells by 2-Dimensional Gel Electrophoresis Daohai Zhang and Evelyn Siew-Chuan Koay
Summary Laser capture microdissection (LCM) is a powerful tool for procuring near-pure populations of targeted cell types from specific microscopic regions of tissue sections, by overcoming problems due to tissue heterogeneity and minimizing intermixture and contamination by other cell types. The combination of LCM with various proteomic technologies has enabled high-throughput molecular analysis of human tumors, and provided critical tools in the search for novel disease markers and therapeutic targets. As an example, we describe the application of LCM in dissecting the tumor cells in breast cancer for macromolecular extraction and subsequent protein separation by 2-dimensional gel electrophoresis (2-D GE). The protocols and the key issues involved in preparing ethanol-fixed paraffin-embedded tissue blocks and microscopic sections, microdissecting the cells of interest using the PixCell II LCM system, extracting and separating the cellular proteins by 2-D GE, and preparing selective proteins for peptide mass analysis by mass spectrometry, are discussed. The aim is to provide a practical guide in performing highthroughput microdissection of target cells and gel-based proteomics, which can be adapted to research in cancer formation and growth.
Key Words: laser capture microdissection; 2-dimensional gel electrophoresis; breast cancer; proteomics; silver staining.
1. Introduction Cellular proteins (collectively known as “proteomes”) are less susceptible than the transcriptome to experimental artifacts arising from the rigors of tissue collection and processing, and advances in global protein expression analysis From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
77
78
Zhang and Koay
(expression proteomics) have been used in mapping cellular pathways, identifying the molecular alterations associated with disease onset and progression and searching for potential tumor markers or drug targets in human disease, especially in cancer. However, to obtain cell-specific protein profiles, homogeneous or near-pure populations of the cells of interest, free from contamination by adjacent cell types, are prerequisites. Laser capture microdissection (LCM) was developed to enable the procurement of near-pure populations of the target cells with a greater speed and precision than is possible with manual dissection methods. LCM permits selective transfer of specific cell types, under direct microscopic visualization, from complex tissues onto a polymer film that is activated by laser pulses, whilst retaining their morphology. The homogeneity of encapsulated cells can be verified microscopically. With these inherent advantages, LCM has become a valuable research tool and has been applied to cellular and molecular studies of various cancers, including breast (1,2), colon (3), and liver (4) cancers. It is equally efficacious in procuring cell populations from both frozen tissues (3,4) and ethanol-fixed, paraffin-embedded tissues (1,5). Protein profiles of the LCM-dissected cells can be obtained by twodimensional fluorescence difference gel electrophoresis (2-D DIGE) (6), 16 O/18 O isotopic labeling (7), differential iodine radioisotope detection (2), isotope-coded affinity tag (iCAT) coupled with two-dimensional tandem mass spectrometry (2-D LCMS/MS) (8), and mass spectrometry compatible silver staining (1,9). Protein samples from LCM-dissected cells can also be applied to reverse-protein arrays to analyze the key cellular signaling pathways and metabolic networks (10,11). In this chapter, the in-house protocols used in the authors’ laboratory for procuring near-pure populations of breast tumor cells from clinical samples, and for the extraction, isolation, and analysis of their protein profiles, are described. These include: (1) preparation of ethanolfixed paraffin-embedded tissue blocks; (2) microdissection using the Pix II LCM System and cellular protein extraction; (3) protein separation by 2-D gel electrophoresis (2-D GE), silver staining, and gel image analysis; and (4) preparation of targeted proteins of interest for peptide mass analysis by tandem mass spectrometry and identification of proteins of interest via database search. 2. Materials 2.1. Histology—Tissue Block and Tissue Section Preparation 1. 2. 3. 4.
70% (v/v), 80% (v/v), 95% (v/v), 100% ethanol Deionized or Milli-Q water (Millipore, Bedford, MA, USA) Hematoxylin solution, Mayer’s (Sigma, St. Louis, MO, USA) Eosin Y solution (Sigma)
Combining LCM with 2-D Gel Electrophoresis
79
5. Complete, mini protease inhibitor cocktail tablets (Roche Applied Science, Pleasanton, CA, USA) 6. Disposable microtome blades (Feather Safety Razor Co., Ltd., Osaka, Japan) 7. Uncharged microscopic glass slides (Paul Marienfeld GmbH & Co, KG, LaudaKoenigshofen, Germany) 8. Sakura Tissue-Tek® V.I.P.TM 5 Jr tissue processor (Sakura Finetek, Inc. Japan Co., Ltd, Tokyo) 9. Paraffin wax—Paraplast® tissue embedding medium; melting point 56-58°C, store at room temperature (RT) (Structure Probe, Inc., West Chester, PA, USA) 10. Xylenes, Reagent Grade (Sigma) 11. Embedding molds—super metal base molds, 66mm × 54mm × 15mm (Surgipath Medical Industries, Richmond, IL, USA)
2.2. Laser Capture Microdissection and Protein Sample Preparation 1. PixCell II LCM system (Arcturus Engineering, Mountain View, CA, USA) 2. CapSure transparent plastic caps (Arcturus Engineering) 3. Lysis buffer: 7 M urea, 2 M thiourea, 4% (w/v) CHAPS, 1% Nonidet P (NP)-40, 0.5% (v/v) Triton X-100, 50 mM dithiothreitol (DTT), 40 mM Tris-HCl, pH 7.5, 2 mM tributyl phosphine (TBP), and 1% (v/v) IPG buffer (pH 3–10). Store at RT. 4. PlusOne 2-D Clean-up Kit (GE Healthcare, San Francisco, CA, USA) 5. Immobilized pH gradient (IPG) buffer (pH 3–10) (GE Healthcare) 6. PlusOne 2-D Quantitation Kit (GE Healthcare)
2.3. Isoelectric Focusing (IEF) and Sodium Dodecyl Sulfate-Polyacrylamide Gel Electrophoresis (SDS-PAGE) 1. EttanTM IPGphorTM IEF electrophoresis unit (GE Healthcare) 2. Ceramic strip holders and EttanTM IPGphorTM Strip Holder Cleaning Solution (GE Healthcare) 3. ImmobilineTM IPG DryStrips (18 cm, pH 3–10, NL) (GE Healthcare) 4. DryStrip Cover Fluid (GE Healthcare) 5. Sample rehydration buffer: 7 M urea, 2 M thiourea, 4% (w/v) CHAPS, 1% (w/v) NP-40, 1% (v/v) IPG buffer, 50 mM DTT. DTT was added freshly to the rehydration buffer prior to use. Store at RT. 6. Equilibration buffer A (prepare 10 ml for each strip): 6 M urea, 30% glycerol, 2% SDS, 1% DTT, 50 mM Tris-HCl, pH 8.8. DTT is added to the stock solution before use. 7. Equilibration buffer B (prepare 10 ml for each use strip): 6 M urea, 30% glycerol, 2% SDS, 250 mg (2.5%, w/v) iodoacetamide (IAA), 50 mM Tris-HCl, pH 8.8. IAA is added to the stock solution before use. 8. 10% SDS-acrylamide gel: 33 ml acrylamide/bis (30% T, 5% C) (Bio-Rad Laboratories, Hercules, CA, USA), 25 ml Tris (1.5 M, pH 8.8), 1 ml 10% (w/v) SDS, 0.5 ml 10% (w/v) ammonium persulfate (freshly prepared on the day of use), 35 μl TEMED (Bio-Rad). Make up to 100 ml with Milli-Q water.
80
Zhang and Koay
9. Water-saturated isobutanol: Shake equal volumes of Milli-Q water and isobutanol in a glass bottle and allow the mixture to separate. Transfer the top layer to a new bottle and store at RT. 10. Agarose sealing solution: Dissolve 0.5% low-melting-point agarose and 0.1% (w/v) bromophenol blue in 1× SDS-PAGE running buffer. Store at RT. 11. SDS-PAGE running buffer: 25 mM Tris, 198 mM glycine, 0.2% (w/v) SDS, pH 8.3 12. PROTEANTM II xi Cell system (Bio-Rad)
2.4. Silver Staining (see Note 1) 1. Fix solution: 5% acetic acid and 50% ethanol per 100 ml 2. Sensitivity-enhancing solution: 30% (v/v) ethanol, 6.8% (w/v) sodium acetate, 100 μl of 2% (w/v) sodium thiosulphate per 100 ml 3. Silver staining solution: 0.25% (w/v) silver nitrate 4. Development solution: 2.5% (w/v) anhydrous potassium carbonate, 20 μl of 2% (w/v) sodium thiosulphate per 100 ml, 40 μl of 37% formaldehyde per 100 ml. 5. Stop solution: 4% (w/v) Tris and 2% (v/v) acetic acid per 100 ml 6. Gel store (soak) solution: 1% (w/v) sodium acetate and 10% (v/v) methanol per 100 ml
2.5. Gel Image Analysis 1. Personal Densitometer SI (Molecular Dynamics, Sunnyvale, CA, USA) 2. ImageMaster 2D Elite (Platinum) software (GE Healthcare)
2.6. In-gel Trypsin Digestion and Preparation for MS Analysis 1. Destaining solution: 30 mM potassium ferricyanide and 100 mM sodium thiosulfate (1:1) 2. 25 mM sodium bicarbonate 3. Dehydrating solution: 50 mM sodium bicarbonate and 50% (v/v) methanol per 100 ml 4. SpeedVac centrifuge (TeleChem International, Inc., Sunnyvale, CA, USA) 5. Digestion solution: 40 ng/μl trypsin sequencing grade (Promega, Madison, WI, USA) in 20 mM ammonium bicarbonate solution 6. Extraction solution (for hydrophobic peptides): 5% (v/v) trifluoracetic acid (TFA) and 50% (v/v) acetonitrile (ACN) per 100 ml 7. Peptide reconstitution solution: 0.1% (v/v) TFA 8. ZipTip C18 columns (Millipore) 9. Eluant: 70% (v/v) ACN and 0.1% TFA per 100 ml 10. Stainless steel MALDI-TOF sample target plates (Applied Biosystems, Framingham, MA, USA) 11. Alpha-cyano-4-hydroxycinnamic acid (-CHCA) matrix, 3 mg/ml (Sigma) 12. Applied Biosystems 4700 MALDI-TOF/TOF mass spectrometer
Combining LCM with 2-D Gel Electrophoresis
81
2.7. Database Search for Protein Identification 1. MASCOT software (Matrix Science, London, England) 2. MS-Fit software (http://prospector.ucsf.edu)
3. Methods The methods described below have been successfully used in the authors’ laboratory for proteomics studies in human breast cancer specimens (1,9) and can be applied to other cancer tissues as well. Breast tumors and matched normal tissues were obtained from the Tissue Repository Unit of the National University Hospital, Singapore, after approval by our Institutional Review Board. 3.1. Preparation of Tissue Sections for LCM In this step, frozen tissues can be directly transferred from the –80°C freezer, where they had been stored after surgical excision and trimming, to a pre-cooled tube containing 70% (v/v) ethanol and kept on ice. Ethanol-fixed paraffinembedded tissue blocks should be prepared as quickly as possible, and the completed blocks stored at or below 4°C. 1. Fix the frozen tissue overnight in 70% ethanol at 4°C. 2. Place each ethanol-fixed tissue piece, trimmed to appropriate dimensions, into a pre-cooled cassette within the tissue processor and dehydrate according to the following procedure: 30 min each in 70% and 80% ethanol at 40°C; 45 min in 95% ethanol at 40°C (twice); 45 min in 100% ethanol at 40°C (twice), and 45 min in xylene at 40°C (twice) (see Note 2). 3. Embed the specimen in paraffin using embedding molds, with four changes of paraffin after every 30-min interval. 4. Store the paraffin blocks at or below 4°C, if they were not to be processed immediately for sectioning. 5. Put the block in a –20°C freezer for at least 1 h before cutting sections from it. 6. Cut sections of 8 μm thickness using a standard microtome. Blades should be changed regularly (see Note 3). 7. Collect the tissue sections on uncharged microscopic glass slides, allow tissue sections to be air dried, and store the cut sections at or below 4°C.
3.2. Staining of Paraffin-embedded Sections The staining of sections for LCM is similar to that used in most histology laboratories for morphological assessment. However, using minimal amount of the stain to visualize the tissue for microdissection will improve macromolecule recovery (see Note 4). One tablet of protease inhibitor cocktail should be added
82
Zhang and Koay
to every 10 ml of each reagent (except xylene), and all reagents prepared using double deionized water or Milli-Q® water. Staining should be performed as close as possible to the scheduled LCM dissection. 1. Deparaffinize the sections in fresh xylene for 5 min, followed by another 5 min with a fresh change of xylene. 2. Rehydrate for 15 s in each step of the following series: 100% ethanol, 95% ethanol, 75% ethanol, and deionized water. 3. Stain with Mayer’s Hematoxylin for 30 s. 4. Rinse off excess stain with deionized water for 15 s; repeat rinse a second time. 5. Dehydrate for 15 s in 70% ethanol. 6. Stain with Eosin Y for 5 s. 7. Dehydrate the sections for 15 s (twice) in 95% ethanol, 15 s (twice) in 100% ethanol, and 60 s in xylene. 8. Air-dry for approximately 2–5 min to allow xylene to evaporate completely (see Note 5). 9. The tissue is now ready for LCM (see Note 6).
3.3. Laser Capture Microdissection and Protein Sample Preparation The PixCell II LCM system (Arcturus Engineering, Mountain View, CA, USA) is used for specific microdissection of tumor cells in our laboratory. Tissue sections are usually mounted on uncoated glass slides to provide support for the CapSure cap during microdissection. LCM utilizes an infrared laser integrated into a standard microscope, and when the desired cells move into the path of the light source, the investigator activates the laser, which in turn activates the membrane (a short laser pulse emitted heats the transparent membrane to ∼90°C for 5 ms). This melts the membrane, with subsequent binding and encapsulation of the cells of interest, segregating them from the surrounding cells and connective tissues. Images of the tissues before and after microdissection and of the captured cells on the cap can be visualized, thus maintaining an accurate record of each dissection. The laser beam diameter may be adjusted from 7.5 to 30 μm to procure either single cells or groups of cells, respectively. 1. Place the slide containing the prepared tissue on the microscope stage. Set the laser parameters as follows: spot diameter at 15 μm, pulse duration at 5 ms, and power at 50 mW. 2. Scan the tissue section to locate the desired cells. Dissect out the target cells of interest and capture all encapsulated cells from each section in quick succession into one cap. Cells dissected from ∼2500 shots can be captured into one cap (see Note 7). Figure 1 shows an example of tumor cells before and after microdissection.
Combining LCM with 2-D Gel Electrophoresis A
B
83 C
Fig. 1. Laser capture microdissection (LCM) of breast tumor cells. The tissue section on the uncharged glass slide was stained with hematoxylin and eosin and microdissected with the PixCell II LCM system (Arcturus Engineering). (A) section before LCM; (B) section after LCM; (C) microdissected cell. 3. Place the LCM cap on an Eppendorf tube containing 100 μl of lysis buffer with protease inhibitor and invert the tube and vortex vigorously for 1 min. 4. Place the tube on ice for approximately 20 min and sonicate the microdissected sample in a bath sonicator with 5 s pulses, in between 5-s intervals, for a duration of 1 min. 5. Replace the sample on ice immediately after 1-min sonication. 6. Centrifuge the sample at 16,000 g for 20 min at 4°C and transfer the supernatant to a new Eppendorf tube. 7. Determine the protein concentration using the PlusOne 2D Quantitation kit (GE Healthcare) and clean up the sample using the PlusOne 2-D cleanup kit (GE Healthcare), following the manufacturer’s instructions closely. 8. Dissolve the protein pellet in the appropriate volume of sample rehydration buffer and aliquot according to experimental plans for immediate and later usage. Store the aliquotted samples at –80°C until analyzed (see Note 8).
3.4. First-dimension Gel Electrophoresis (Isoelectric Focusing) 1. Prepare the strip holder for the 18-cm IPG strip (see Note 9). 2. Squeeze a few drops of Ettan™ IPGphor™ Strip Holder Cleaning Solution (GE Healthcare) into the slot and clean thoroughly. Rinse with Milli-Q water and dry completely. 3. Mix approximately 50 μl of the reconstituted protein samples (∼100–150 μg) with the appropriate volume of rehydration buffer. The total volume should be 340 μl for one 18-cm IPG strip. 4. Transfer the entire volume of the diluted protein sample into the groove of the IPG strip holder. 5. Remove the cover from the IPG strip (18 cm, pH 3–10) and place the IPG strip in the holder such that the gel of the strip is in contact with the sample (i.e., gel
84
Zhang and Koay
6. 7. 8. 9.
side down). Try to remove any trapped air bubbles by lifting the strip up and down from one side. Overlay the IPG strip with 2–3 ml of DryStrip Cover Fluid to prevent urea crystallization and evaporation, and replace the cover on the strip holder. Rehydrate the IPG strip at 20 V for 12 h at 20°C. Perform IEF under the following conditions: 500 V for 1 h, 2000 V for 1 h, 4000 V for 1 h, and 8000 V for 6 h. Once focusing is complete, pour off the oil. The strips can be stored at –20°C for several weeks, or immediately treated as described below (see Subheading 3.5).
3.5. IPG Strip Equilibration 1. Place the focused IPG strips in a container with 10 ml of equilibration buffer A and shake for 15 min at RT (see Note 10). 2. Transfer the IPG strip to a container with 10 ml of equilibration buffer B and shake for 15 min at RT (see Note 10). 3. The equilibrated strips can then be processed for second-dimension gel electrophoresis.
3.6. Second-dimensional SDS-PAGE Prepare the SDS-polyacrylamide gels in advance, and make sure that the gels are well polymerized before performing the equilibration of IPG strips. The proteins have to be charged by equilibration with SDS, and be reduced and alkylated to avoid the formation of oligomers. In our laboratory, we use the PROTEAN II xi Cell system (Bio-Rad) for SDS-PAGE. 1. Assemble the gel casting cassette as per the manufacturer’s instructions. 2. Prepare 10% SDS-PAGE (see Note 10) and pour the solution slowly into the cassette (two 16 cm × 20 cm glass plates sandwiched by 1.5-mm thick spacers) until the gel height is approximately 1 cm from the top. 3. Overlay the gel solution with 2 ml of water-saturated isobutanol. It is best to pour 1 ml of water-saturated isobutanol from one side of the gel and 1 ml on the other side. Do not pour it all along the gel meniscus. 4. Allow the gel to polymerize for at least 2 h. 5. When polymerization is completed, remove the water-saturated isobutanol and rinse with water again. 6. With a pair of forceps, carefully place the equilibrated strip on top of the PAGE gel, with the acidic side of the strip at left. Cover the strip with melted agarose sealing solution (see Note 11). 7. Assemble the electrophoresis unit (Bio-Rad) and perform electrophoresis at 15°C as follows: 40 V for 15 min or until the blue dye enters the gel and then raise the voltage to 125 V and run the gel overnight or until the blue dye migrates to the bottom of the gel. 8. Switch off the main power and disassemble the gel cassette.
Combining LCM with 2-D Gel Electrophoresis
85
9. Place the gel in a glass container and wash the gel with Milli-Q water. 10. Stain the gel using the mass spectrometry-compatible silver staining protocol (see Subheading 3.7).
3.7. Silver Staining and Image Analysis 1. The silver staining protocol as described below is used in the authors’ laboratory and is highly compatible with protein identification by MALDI-TOF MS and MALDI-TOF/TOF MS/MS. It should be noted that adequate washing with MilliQ water is essential to reduce the risk of keratin contamination. All the solutions must be prepared with Milli-Q water, and all the chemical reagents should be filtered to remove any particles that may cause interference during MS analysis. All solutions prepared from solid chemicals should be freshly prepared before performing silver staining. Fix the gel with fixing solution for at least 2 h, changing the solution afresh at hourly intervals. 2. Briefly wash with Milli-Q water, with constant shaking for about 15 min. 3. Remove the wash and cover the gel with appropriate sensitivity-enhancing solution and incubate for 1 h, with constant shaking. 4. Wash the gel thoroughly with Milli-Q water for 6 × 15 min, with gentle shaking and replacing with fresh Milli-Q water after each cycle (see Note 12). 5. Stain the gel with silver staining solution for 30 min. 6. Wash off excess stain from the gel with Milli-Q water (twice, for 2 × 1 min). 7. Develop the gel for 5–30 min in a developing solution (see Note 13). 8. Add Stop Solution and shake the gel for approximately 20 min to stop the reaction. 9. Wash the gel using Milli-Q water for 20 min; replace water and repeat the wash. 10. Scan the gel using Personal Densitometer SI, or store the gel in the gel soak solution for analysis at a later time. 11. Capture the image using ImageMaster 2D Elite software (GE Healthcare). The image analysis includes spot detection, quantification and normalization of spot intensity to the background interferences, according to the instructions from the software. An example of images showing the differences between the protein profiles of LCM-microdissected HER-2/neu positive and -negative tumor cells is shown in Fig. 2. 12. Analyze the image using the software and identify spots that show significant differences in spot intensities (see Note 14), reflecting differential protein expression in the two subtypes of breast cancer triggered by the presence or suppression of HER-2/neu oncogene. Only those spots that show either more than threefold or less than threefold change in signal intensity, consistently from three replicate sets of gels, are considered as demonstrating differential protein expression and selected for further analysis by MALDI-TOF MS/MS. The likelihood of any protein displaying less convincing evidence of differential protein expression being a potential biomarker for early detection of tumor growth or a therapeutic target for breast cancer treatment is low.
86
Zhang and Koay HER-2/neu-P kDa
HER-2/neu-N
pI3
10
pI3
10
92 NP004095
50 AAH025396
35
P04075
P06753-2 P07339 NP001531
28 AAB49495
NP000627
Fig. 2. Silver-stained protein profiles of LCM-dissected cells. Protein samples from HER-2/neu positive and -negative cells are separated by using IPG® ( strips (18 cm, pH 3–10 NL) and homogeneous SDS-PAGE (10%), and then stained with silver nitrate. Silver-stained gels were scanned using the Personal Densitometer SI (Molecular Dynamics) and differentially expressed protein spots were analyzed by ImageMaster 2-D Elite software (GE Healthcare). The Accession Numbers indicate the protein ID identified by MALDI-TOF/TOF tandem mass spectrometry and NCBInr database search using Mascot software (Matrix Science, London, UK).
3.8. Trypsin Digestion and Preparation of Peptides for Mass Spectrometric Analysis 1. Excise the silver-stained protein spots showing significant differential protein expression, as mentioned above, one at a time, taking care not to include adjacent proteins in vicinity, and transfer to individual tubes. 2. Wash with 100 μl of Milli-Q water for 5 min. 3. Add 50 μl of the destaining solution into the tubes, and about 20 min on a platform shaker at RT until the gels become clear in color. 4. Remove the solution carefully and wash with 100 μl of Milli-Q water. 5. Incubate the gel pieces with 25 mM sodium bicarbonate for 20 min, and then cut them into smaller pieces with the tip of the transfer pipette. Avoid carryover and contamination during repetitive work on consecutive samples. 6. Rinse the gel pieces with Milli-Q water, discard the wash after pulsing down the gel pieces, and repeat the washing process three times. 7. Add 100 μl of dehydrating solution and incubate for 20 min at RT. 8. Dry the gel pieces in a SpeedVac centrifuge. 9. Re-swell the dried gel pieces with 10–20 μl of Digestion Solution and leave overnight at 37°C to ensure complete digestion. 10. Extract the resultant hydrophilic peptides first with 10 μl of Milli-Q water for 1 h.
Combining LCM with 2-D Gel Electrophoresis
87
11. Then extract the hydrophobic peptides with Extraction Solution for 2 h. 12. Pool the extracted hydrophilic and hydrophobic peptides and dry the peptide mixture using the SpeedVac centrifuge. 13. Redissolve the dried peptides in 10 μl of 0.1% (v/v) TFA. 14. Desalt the sample with ZipTip C18 columns (Millipore) and elute the treated and purified peptides with 2.5 μl of Eluant. 15. Mix 0.5 μl of the sample eluate with 0.5 μl of CHCA matrix (3 mg/ml) and spot the mixture onto the stainless steel MALDI-TOF sample target plates. 16. The pretreated peptide samples must be stored on ice during transfer to the core facility for mass spectrometric analysis. In our laboratory, peptide mass spectra are obtained by the Applied Biosystems 4700 Proteomics Analyzer MALDI-TOF/TOF mass spectrometer, set in the positive ion reflector mode. The subsequent MS/MS analyses are performed in a data-dependent manner, and the 10 most abundant ions fulfilling certain preset criteria are subjected to high-energy CID analysis. The collision energy is set to 1 keV, and nitrogen is used as the collision gas.
3.9. Database Search to Match Protein Identities Database searches were conducted using the MASCOT search engine (http://www.matrixscience.com). For database search, known contamination peaks, such as keratin and autoproteolysis peaks, were removed prior to database search. Protein identification was performed using the MASCOT software (Matrix Science, London, UK), and all tandem mass spectra were searched against the NCBInr database, with mass accuracy of within 200 ppm for mass measurement, and within 0.5 Da for MS/MS tolerance window. Searches were performed without constraining the protein molecular weight (Mr) or isoelectric point (pI) and species, and allowing for carbamidomethylation of cysteine and partial oxidation of methionine residues. Up to one missed tryptic cleavage was considered for all tryptic-mass searches. Protein scores greater than 75 are considered to be significant (p < 0.05). 3.10. Experimental Example: Differential Protein Profiles between HER-2/neu Positive and -Negative Breast Tumors We dissected the tumor cells from two different subtypes of breast tumors and compared their protein profiles, based on the protocols described above. Figure 2 shows the LCM-dissected tumor cell protein patterns visualized by silver staining. It should be noted that pooled protein samples from different cases of the same tumor subtypes were used for 2-D GE. This gel-based protein visualization technique requires high amount of proteins, and thus more sensitive detecting reagents and protein identification strategies had to be developed to produce meaningful results (see Notes 15 and 16). Using
88
Zhang and Koay
the silver-staining protocol, we identified 500–600 protein spots in the protein profiles generated by coupling LCM and 2-D GE. Protein spots of interest would be excavated and digested with trypsin (Promega), desalted with ZipTipc18 (Millipore), and analyzed using MALDI-TOF/TOF tandem mass spectrometry. Protein identities, as shown in Fig. 2, are obtained by searching the NCBInr databases using the MASCOT software (Matrix Science).
4. Notes 1. All the chemical solutions should be filtered by passing them through filter paper (Cat No. 1001 150, Whatman® , Whatman International Limited, Springfield Mill, Maidstone, Kent, England) to minimize precipitates occurring onto the gels during silver staining. 2. Tissue processors in standard histopathology laboratories generally include formalin fixation as the first step in the paraffin infiltration procedure. It is important to avoid these steps when processing tissues intended for molecular gene and proteome profiling. 3. Consistent LCM transfers have been demonstrated from 5–10 μm thick paraffinembedded tissue sections. For a successful LCM transfer, the strength of the bond between polymer film and targeted tissue must be stronger than that between the tissue and the underlying glass slide. Therefore, for most tissue types, sections should be collected with uncharged glass slides. To prevent cross-contamination while sectioning, residual paraffin and tissue fragments should be wiped off from the area of the sectioning blade with xylenes between consecutive slides. If possible, a fresh microtome blade should be used to section a different block. 4. In our hands, hematoxylin and eosin are best reduced to 10% of their standard concentrations used for routine histomorphological work, when applied to slides prepared for LCM. Breast tumor cells can be clearly visualized and identified from other cell types, without influencing the procurement of tumor cells by LCM, with this modification. Minimum staining also improves macromolecular recovery during cellular protein extraction. 5. Complete dehydration and air drying of sections are the main factors influencing the efficiency of LCM. Prolonged air drying or presence of moisture in the sections appears to inhibit, at least partially, the transfer of cells to the plastic firm. 6. If the investigators have less experience in checking cancer tissue sections, we strongly recommend that investigators consult with the pathologists in their institutions to get assistance in identifying the target cell types that will be microdissected using LCM. It is essential to avoid contamination of other cell types, or dissecting the wrong cells. 7. During microdissection, make sure that there are no irregularities on the tissue surface in or near the area to be microdissected. It should also be noted that wrinkles can elevate the LCM cap away from the tissue surface and decrease the
Combining LCM with 2-D Gel Electrophoresis
8.
9.
10.
11.
12.
89
membrane contact during laser activation. Use an adhesive pad after microdissection to remove cells that may have attached non-specifically to the LCM cap. A cap-alone control is recommended for each experiment to ensure that non-specific transfer is not occurring during microdissection. The cap should be processed together with other tissue-containing caps and serves as a negative control. For protein separation by 2-D GE, 20 to 30 sections from each tissue sample are dissected, depending on the percentage of targets cells in the full sections. Generally, 2300–2700 laser pulse shots are used for each cup. Cells from at least 50,000 shots (spot diameter is 15 μm) are required for each 18-cm gel. Up to 15 mg of proteins can be solubilized with 500 μl of the sample rehydration buffer, but with our breast tumor tissue samples, we usually reconstitute 1–2 mg of extracted proteins in 500 μl, or 2–4 mg/ml. It is recommended that the reconstituted proteins be stored in appropriate aliquots, and that only the required number of aliquots needed for the experiment at hand be removed at any time, to avoid repeated freezing and thawing the peptides, which will lead to sample deterioration. IEF is performed using Ettan™ IPGphor™ IEF electrophoresis unit. Rehydration loading of protein samples is used in the authors’ laboratory. The IPG strips for first-dimensional separation are commercially available, and can be procured from GE Healthcare and other suppliers. IPG strips with various pH gradients and dimensions are available. They are used for protein separation with appropriate resolution needed. The strips should be kept frozen at –20°C, and thawed just before use. The IEF conditions are dependent on the pH range. Reference to the manufacturer’s protocol is recommended. For alkali pH loading, cup loading is a must, and DTT in the rehydration buffer should be replaced by other reducing agents, such as hydroxyethyl-disulfide (HED) reagent (Destreak, GE Healthcare). It is essential to equilibrate the strips before being applied for the seconddimension gel electrophoresis (2-D SDS-PAGE). DTT added to buffer A will reduce the disulfide bonds whereas IAA in buffer B will alkylate the formed sulfydryl groups of proteins. This is to prevent re-oxidation of sulfydryl groups and streaking of spots during 2-D SDS-PAGE. Further, the presence of SDS makes the proteins negatively charged and suitably primed for SDS-PAGE. Use the best quality SDS available for sample and running buffers that include SDS in their formulation. We recommend C12 Grade SDS from Pierce (Rockford, IL, USA). When placing the strips on top of the gel, ensure that the plastic backing of the strips is in contact with the glass wall. If necessary, the strips can be trimmed properly. When adding agarose sealing solution, make sure that there are no air bubbles trapped between the IEF strip and 2-D gel. Wash the gels thoroughly and repeatedly, as recommended, prior to the development step and during the development step itself, to get clear stained gels. During the development of the gels, formaldehyde should be added prior to use,
90
13.
14.
15.
16.
Zhang and Koay and the suggested concentration should be followed strictly to avoid interference during MALDI-TOF analysis. During the developing stage, the gel should be constantly shaken to reduce the background. The developing time depends on the total amount of protein that is used for 2-D separation. With a higher amount of protein, a shorter developing time can be used, without compromising the aim of visualizing the maximum number of protein spots. It is important to manually verify spot detection and matching, as the variations in gel resolution, staining, gel background, and automatic image analysis may not correctly define the spot contours in every case. This variability and the complexity of 2-D gel patterns hinder the accurate matching of analogous spots in different gels. In our experience, approximately 500 to 600 distinct proteins from the dissected breast tumor cells can be visualized on 2D-PAGE stained with silver. On average, we can extract approximately 4–6 μg of total cellular proteins from 2500 laser pulses. Our experience is that silver staining of LCM-dissected cell proteins is a sufficiently sensitive tool for isolating and identifying the dysregulated cellular proteins of high or moderate abundance. However, for the dysregulated proteins of low abundance, the lower detection limit of this technology would have to be enhanced by other techniques such as 125-iodine labeling or biotinylation and fluorescent dye labeling. In addition, the use of scanning immunoblotting with class-specific antibodies, for example, would allow sensitive detection of specific subsets of proteins, e.g., all known proteins involved with cell-cycle regulation. Protein identification by MALDI-TOF, LC-MS/MS, or other techniques is also limited by the requirement of a minimal protein input amount, which is often not attainable from certain types of biopsy samples. A useful strategy to improve protein identification is to produce parallel “diagnostic” fingerprints derived from microdissected cells and “sequencing” the fingerprints generated from the whole tissue section from each case. Alignment of the diagnostic and sequencing 2D gels permits determination of the proteins of interest for subsequent mass spectrometry or N-terminal sequence analysis.
Acknowledgments The Tumor Repository of the National University Hospital, Singapore, provided the clinical breast cancer frozen tissues for LCM. The use of the PixCell II LCM system was courtesy of the Department of Pathology, Yong Loo Lin School of Medicine, National University of Singapore (NUS). This work was supported by an Academic Research Fund from the NUS (Grant No. R-179-000-032) to the authors.
Combining LCM with 2-D Gel Electrophoresis
91
References 1. Zhang, D., Tai, L. K., Wong, L. L., Sethi, S. K., Koay, E. S. (2005) Proteomics of breast cancer: enhanced expression of cytokeratin 19 in human epidermal growth factor receptor type 2 positive breast tumors. Proteomics 5, 1797–1805. 2. Neubauer, H., Clare, S. E., Kurek, R., Fehm, T., Wallwiener, D., Sotlar, K., et al. (2006) Breast cancer proteomics by laser capture microdissection, sample pooling, 54-cm IPG IEF, and differential iodine radioisotope detection. Electrophoresis 27, 1840–1852. 3. Lawrie, L. C., Curran, S., McLeod, H. L., Fothergill, J. E., Murray, G. I. (2001) Application of laser capture microdissection and proteomics in colon cancer. J. Clin. Pathol: Mol. Pathol. 54, 253–258. 4. Ai, J., Tan, Y., Ying, W., Hong, Y., Liu, S., Wu, M., et al. (2006) Proteome analysis of hepatocellular carcinoma by laser capture microdissection. Proteomics 6, 538–546. 5. Ahram, M., Flaig, M. J., Gillespie, J. W., Duray, P. H., Linehan, W. M., Ornstein, D. K., et al. (2003) Evaluation of ethanol-fixed, paraffin-embedded tissues for proteomic applications. Proteomics 3, 413–421. 6. Greengauz-Roberts, O., Stoppler, H., Nomura, S., Yamaguchi, H., Goldenring, J. R., Podolskym R. H., et al. (2005) Saturation labeling with cysteine-reactive cyanine fluorescent dyes provides increased sensitivity for protein expression profiling of laser-microdissected clinical specimens. Proteomics 5, 1746–1757. 7. Zang, L., Palmer-Toy, D., Hancock, W. S., Sgroi, D. C., Karger, B. L. (2004) Proteomic analysis of ductal carcinoma of the breast using laser capture microdissection, LC-MS, and 16 O/18 O isotopic labeling. J. Proteome Res. 3, 604–612. 8. Li, C., Hong, Y., Tan, Y. X., Zhou, H., Ai, J. H., Li, S. J., et al. (2004) Accurate qualitative and quantitative proteomic analysis of clinical hepatocellular carcinoma using laser capture microdissection coupled with isotope-coded affinity tag and two-dimensional liquid chromatography mass spectrometry. Mol. Cell. Proteomics 3, 399–409. 9. Zhang, D., Tai, L. K., Wong, L. L., Chiu, L. L., Sethi, S. K., and Koay, E. S. (2005) Proteomic study reveals that proteins involved in metabolic and detoxification pathways are highly expressed in HER-2/neu-positive breast cancer. Mol. Cell. Proteomics 4, 1686–1696. 10. Cowherd, S. M., Espina, V. A., Petricoin, E. F. III, Liotta, L. A. (2004) Proteomic analysis of human breast cancer tissue with laser-capture microdissection and reverse-phase protein microarrays. Clin. Breast Cancer 5, 385–392. 11. Gulmann, C., Espina, V., Petricoin, E. III, Longo, D. L., Santi, M., Knutsen, T., et al. (2005) Proteomic analysis of apoptotic pathways reveals prognostic factors in follicular lymphoma. Clin. Cancer Res. 11, 5847–5855.
6 Optimizing the Difference Gel Electrophoresis (DIGE) Technology David B. Friedman and Kathryn S. Lilley
Summary Difference gel electrophoresis (DIGE) technology has been used to provide a powerful quantitative component to proteomics experiments involving 2D gel electrophoresis. DIGE combines spectrally resolvable fluorescent dyes (Cy2, Cy3, and Cy5) with sample multiplexing for low technical variation, and uses an internal standard methodology to analyze replicate samples from multiple experimental conditions with unsurpassed statistical confidence for 2D gel-based differential display proteomics. DIGE experiments can facilely accommodate sufficient independent (biological) replicate samples to control for the large interpersonal variation expected from clinical samples. The use of multivariate statistical analyses can then be used to assess the global variation in a complex set of independent samples, filtering out the noise from technical variation and normal biological variation thereby focusing on the underlying variation that can describe different disease states. This chapter focuses on the design and implementation of the DIGE methodology employing the use of a pooled-sample internal standard in conjunction with the minimal CyDye chemistry. Notes are also provided for the use of the alternative saturation labeling chemistry.
Key Words: difference gel electrophoresis; two-dimensional gel electrophoresis; quantification.
1. Introduction Human disease phenotypes are a direct result of protein expression and modification. In many cases, such phenotypes cannot be tied directly to a single alteration in the genome or resulting proteome, but are likely to be the result From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
93
94
Friedman and Lilley
of multiple factors. Studying disease at the protein level is challenging, but as proteins are the mediators of phenotype, the study of protein abundance on a global scale is required to gain a more complete understanding of the underlying molecular mechanisms of disease. Proteomics in the clinical setting is rapidly developing and is having a major impact on the way in which diseases will be diagnosed, treated, and monitored (1). It has been estimated that there could be hundreds of thousands of different protein isoforms in a mammalian cell, but the vast dynamic range of protein abundance results in only the most abundant species of proteins being observable by quantitative proteomics approaches unless technically variable biochemical or subcellular fractionation is employed. The repertoire of techniques and associated hardware, which is now applied to this field, is expanding exponentially, and although a complete visualization of the proteome is still beyond reach of any single technique, each technology platform can provide complementary datasets. Difference gel electrophoresis (DIGE) has proven to be a powerful quantitative technology for differential display proteomics on a global level, where the individual abundance changes for thousands of intact proteins can be simultaneously monitored in replicate samples over multiple variables with statistical confidence (see Note 1). This includes quantitative information on protein isoforms that arise due to post-translational modifications (such as acetylation or phosphorylation), which result in a change in the isoelectric point of the protein. This also includes splice variants and the results of protein processing, all of which are resolved for individual quantification and subsequent analysis by MS. DIGE is based on conventional 2D gel technology that is capable of resolving several thousands of intact proteins first by charge using isoelectric focusing (IEF) and then by apparent molecular mass using SDS-polyacrylamide gel electrophoresis (PAGE) (6,7) (see Note 2 and Chapters 4 and 5 by Cho et al. and Zhang et al., respectively). Importantly, DIGE overcomes many of the limitations commonly associated with 2D gels such as analytical (gel-to-gel) variation and limited dynamic range that can severely hamper a quantitative differential display study. This is accomplished using up to three spectrally resolvable fluorescent dyes (Cy2, Cy3, and Cy5, referred to as CyDyes) that enable low- to subnanogram sensitivities with >104 linear dynamic range, and then by multiplexing the prelabeled samples into the same analytical run (2D gel). Multiplexing in this way allows for direct quantitative measurements between the samples coresolved in the same gel, and is therefore beyond the limitations imposed by between-gel comparisons with conventional 2D gels. The highest statistical power of this multiplexing approach stems from the utilization of a pooled-sample internal standard comprised of an equal aliquot
Optimizing DIGE Technology
95
of every sample in the experiment (see Subheading 1.2.1). With this method, two dyes (Cy3 and Cy5) are used to individually label two independent samples from a much larger experiment, and the Cy2 dye is used to label an internal standard, which is comprised of an equal aliquot of proteins from every sample in the experiment. This pooled-sample internal standard is labeled only once in bulk to avoid additional technical variation, and enough is made and labeled to allow for an equal aliquot to be coresolved on each gel. The three differentially labeled samples are then coresolved on the same 2D gel, after which direct measurements can be made for each resolved protein using the spectrally exclusive dye channels without interference from technical variation of the separation (gel-to-gel variation). Rather than making direct quantitative measurements between the two samples in the gel, the measurements are instead made relative to the Cy2 signal for each resolved protein. The Cy2 signal should be the same for a given protein across different gels because it came from the same bulk mixture/labeling; therefore, any difference represents gel-to-gel variation, which can be effectively neutralized by normalizing all Cy2 values for a given protein across all gels. Using the Cy2 signal to normalize ratios between gels then allows for the Cy3:Cy2 and Cy5:Cy2 ratios for each protein within each gel to be normalized to the cognate ratios from the other gels, encompassing all samples. Each gel may contain different (and/or replicate) samples in the Cy3 and Cy5 channels, but all samples can be quantified relative to each other because each protein from each sample is measured to the cognate Cy2 signal from the internal standard present on each gel. With the use of sufficient replicates, a plethora of advanced statistical tests can be applied, which can highlight proteins of interest whose change in expression is related to the disease state under investigation. Since the technical noise is low, these vital replicates should be independent (biological) replicates as most of the observed variations will be clinical sample related rather than technical or experimental related. In a final step, specific proteins of interest are then identified using standard mass spectrometry (MS) approaches on gel-resolved proteins that have been excised and proteolyzed into a discrete set of peptides. Briefly, excised proteins are subjected to in-gel digestion with trypsin protease (typically), and MS is used to acquire accurate mass determinations on the resulting peptides, as well as fragmentation on individual peptides. The mass spectral data are then used to identify statistically significant candidate protein matches through sophisticated computer search algorithms that compare the observed MS data with theoretical peptide masses (using data generated by peptide mass fingerprinting) or collision-induced fragmentation patterns (obtained from tandem MS) generated in silico from protein sequences present in databases. (see Chapter 19 by Fitzgibbon et al.).
96
Friedman and Lilley
1.1. Optimizing Sensitivity and Resolution There are currently two forms of CyDye labeling chemistries available: minimal labeling involving the use of N-hydroxy succinimidyl (NHS) ester reagents for low-stoichiometry labeling of proteins largely via lysine residues, and saturation labeling, which utilize maleimide reagents for the stoichiometric labeling of cysteine sulfhydryls. The most established DIGE chemistry is the “minimal labeling” method, which has been commercially available since July 2002. Here the CyDye DIGE fluors are supplied as NHS esters, which react with the -amine groups of lysine side chains. The three fluors are mass matched (ca. 500 Da), and carry an intrinsic +1 charge to compensate for the loss of each proton-accepting site that becomes labeled (thereby maintaining the pI of the labeled protein). Each dye molecule also adds a hydrophobic component to proteins, which along with MW influences how proteins migrate in SDS-PAGE. Minimal labeling reactions are optimized such that only 2–5% of the total number of lysine residues are labeled, such that on average a given labeled protein would contain only one dye molecule. This is necessary because lysine is an abundant amino acid, and multiple labeling events may affect the hydrophobicity of some proteins such that they may no longer remain soluble under 2DE conditions. Although a given protein form may exhibit specific labeling efficiencies, these will be the same for labeling with all three dyes, allowing for direct relative quantification. Minimal labeling with CyDye DIGE fluors is very sensitive, comparable to silver-staining or postelectrophoretic fluorescent stains such as Sypro Ruby, Deep Purple or Flamingo Pink (ca. 1 ng), but with a linear response in protein concentration over five orders of magnitude (8)(see Note 3). For maleimide labeling of the cysteine sulfhydryls, the overall lower cysteine content in proteins allows for labeling of these residues to saturation without increasing the overall hydrophobicity of the proteins to cause insolubility problems. Saturation labeling is ultimately more sensitive (150–500 picograms, and even more so for proteins with high cysteine content). Its use is not as commonplace, most likely due to the availability of only Cy3 and Cy5 with this chemistry (see Note 4), the fact that it is blind to the small but significant population of noncysteine containing proteins, and the additional optimization of complete cysteine reduction necessary for reproducible labeling. For these reasons, saturation DIGE is usually reserved for experiments where samples are limited, where the advantage of the increased sensitivity outweigh these additional considerations. To maximize the information that can be gained from DIGE experiments, it is imperative that resolution of protein species within gels is optimized. Although single 2DE runs can resolve proteins with pI ranges between pH 3 and 11, and
Optimizing DIGE Technology
97
apparent molecular mass ranges between 10 and 200 kDa, higher resolution and sensitivity can be obtained by running a series of medium range (e.g., pH 4–7, 7–11) and narrow range (e.g., pH 5–6) IEF gradients with increasing protein loads, leading to an overall more comprehensive proteomic analysis (6,7,10). (see Note 5). This is analogous to gaining increased resolution and sensitivity in an LC/MS-based strategy by using multiple high performance liquid chromatography columns with different affinity chemistries [e.g., MuDPIT (12)]. Much of the sensitivity limitation associated with 2D gels can be attributed to the analysis of unfractionated, whole-cell and whole-tissue extracts. Additional sensitivity can be gained via enrichment for the proteins of interest, such as by analyzing prefractionated or subcellular samples, or immune complexes. However, the additional experimental manipulations required for prefractionation introduce more technical variation into the samples and necessitates increased independent (biological) replicates (which can be accommodated with the DIGE internal standard methodology). The identification of proteins of interest using MS can be performed directly from the DIGE gels when protein amounts have been optimized in this way (see Subheading 3.5). Alternatively, some experimental approaches perform DIGE analysis using “analytical” gels with lower protein amounts, followed by protein excision from a secondary, “preparative” gel with higher protein amounts. This approach has its advantages when dealing with small sample amounts, such is often the case using the saturation dye chemistries, but is also prone to uncertainties that arise due to the disproportionate amount of protein loading (see Note 6). The methods presented in this protocol are for optimization of both the DIGE data as well as material for subsequent MS using high protein loads. 1.2. Optimizing Statistical Significance 1.2.1. Using the Internal Standard The ability to coresolve and compare two or three samples in a single gel is attractive, because it allows for direct relative quantification for a given protein without any interference from gel-to-gel variations in migration and resolution, removing the need for running replicate gels for each sample (similar to stable isotope LC/MS-based strategies, see Chapter 10). This approach has limited statistical power, however, since confidence intervals are determined based on the overall variation within a population (see Subheading 3.6.2). Many researchers new to DIGE technology are not immediately aware of the increased statistical advantage and multiplexing capabilities of DIGE when combining this approach with a pooled-sample mixture as an internal standard for a series of coordinated DIGE gels (13). This design will allow for repetitive measurements (vital to any type of experimental investigation), and in
98
Friedman and Lilley
such a way as to control both for gel-to-gel variation and provide increased statistical confidence. In this way, statistical confidence can be measured for each individual protein based on the variance of repetitive measurements, independent of the variation in the population. Incorporating independently prepared replicate samples into the experimental design also controls for unexpected variation introduced into the samples during sample preparation. This more complex and statistically powerful experimental design is accomplished by using one of the three dyes (usually Cy2) to label an internal standard, which is comprised of equal aliquots of protein from all of the samples in an experiment. The total amount of the Cy2-labeled internal standard is such that an equal aliquot can be coresolved within each DIGE gel that also contains an individual Cy3- and Cy5-labeled sample from the experiment. Since this standard is composed of all of the samples in a coordinated experiment, each protein in a given sample should be represented in the standard and thus have its own unique internal standard (see Note 7). Direct quantitative comparisons are made individually for each resolved protein between the Cy3- or Cy5labeled samples and the cognate protein signal from the Cy2-labeled standard for that gel (without interference from gel-to-gel variation) and results in the calculation of a standardized abundance for every spot matched across all gels within a multigel experiment. The individual signals from the internal standard are also used to normalize and compare between each in-gel direct quantitative comparison for that particular protein from the other gels. Using the Cy2-labeled standard in this fashion, therefore, allows for more precise and complex quantitative comparisons between gels, including independent (biological) sample repetition (Fig. 1). Importantly, the internal standard experimental design allows for the identification of significant changes that would not have been identified if the analyses were performed separately, even when using Cy3- and Cy5-labeled samples on the same DIGE gel (14). This experimental design also allows for multivariable analyses to be performed in one coordinated experiment, whereby statistically significant abundance changes can be quantitatively measured simultaneously between several sample types (e.g., different genotypes, drug treatments, or disease states), with repetition and without the necessity for every pairwise comparison to be made within a single DIGE gel (15,16) (see Note 8 and Chapter 17 by Carpentier et al.). 1.2.2. Assessing Intersample Variation Clinical proteomics is hampered by the significant variation associated with patient samples. The largest proportion of this variation comes from biological diversity, but a significant amount may also come from variable collection
Optimizing DIGE Technology
99
Fig. 1. Illustration of DIGE and experimental design using the mixed-sample internal standard. (A) Representative gel from a six-gel set containing three differentially labeled samples: Cy2-labeled internal standard, Cy3-labeled sample #1, and Cy5labeled sample #2. The individual protein forms all coresolve in this one gel, but these three independently labeled populations of proteins can be individually imaged using mutually exclusive excitation/emission properties of the CyDyes. (B) Schematic of the sample loading matrix indicating gel number, CyDye labeling and three replicates (indicated as “1, 2, and 3”) of the four conditions being tested (A, B, C, D). Within the boxed regions representing each labeled sample is depicted a theoretical protein that is upregulated in condition D. Dotted lines illustrate how the protein signals from each sample are directly quantified relative to the Cy2 internal standard signal for that protein without interference from gel-to-gel variation, and how the Cy3:Cy2 and Cy5:Cy2 intragel ratios are normalized between the six gels. (C) A graphical representation of the normalized abundance ratios for this theoretical protein change. Adapted from (10).
and storage of biological samples. It is of vital importance to identify changes in protein abundance that are disease specific rather than patient or sample specific. In order to gain the more robust data sets necessary to be able to draw accurate conclusions from clinical proteomics studies, it is, therefore, necessary to collect and store samples using very stringent and closely adhered to
100
Friedman and Lilley
protocols. It is also necessary to assess the biological variation within the population being tested and also within a single individual. Interindividual variation has been the focus of several studies (17,18) and determining a typical diversity within a single patient (i.e., taking longitudinal samples and assessing variability in protein abundance) and between patients will determine the minimum number of patient samples required for an experiment. This is an essential step before embarking on any large-scale and potentially costly DIGE experiment. Without this type of pretest, the results of underpowered experiments run the risk of being peppered with false information (both false positives and negatives). As with all complex technologies, the DIGE technique itself is subjected to technical variation, which will be laboratory specific to a greater or lesser extent. However, the amplitude of this variation is generally outweighed by the biological variation associated with a typical sample set (19).
1.2.3. Univariate Statistical Analyses To date, the majority of published quantitative proteomics studies using the DIGE technology have applied a univariate test, such as a Student’s t-test or analysis of variance (ANOVA), to identify protein species with significant changes in expression [(20) and Chapter 17 by Carpentier et al.]. These tests calculate the probability (p) that the samples being compared are the same and therefore any apparent change in expression occurs by chance alone. Typically an expression change is considered significant if the calculated p-value falls below a prescribed significance threshold, typically 0.05 (whereby 1 in 20 tests may give a change in expression by chance). For more stringent analyses, a p-value of 0.01 is often used as the significance threshold. When employing these tests on DIGE datasets, there are several factors that must be considered if correct assumptions are to be made from ensuing analyses. Student’s t-tests and ANOVA assume that the data achieved is normally distributed and that any variance is homogeneous. The measurement and correction of systematic bias within DIGE experiments have been the subject of several studies, which chart methods to optimize normalization of data sets (21,22,23). Another important consideration is that of false discovery rate (FDR), which could arise as a result of statistical tests such as the ones described above. These tests involve the simultaneous and independent testing of thousands of spots. The probability of a false positive being recorded for each test is such that a substantial number of false positives may accumulate. There are several approaches to determine the FDR and adjust p-scores to compensate for this,
Optimizing DIGE Technology
101
the most widely used to date being the Benjamini and Hochberg method, whose use in conjunction with DIGE data has been described by Fodor et al. (21). 1.2.4. Multivariate Statistical Analyses Discovery phase proteomics often produce large lists of proteins that are identified as changing significantly in the experiment, many of which may well be false positives. Another approach to overcome these is the application of additional multivariate statistical analyses to these datasets, which can help to filter out false positives that result from whole sample outliers (i.e., sample misclassification and/or poor sample preparation technique). These analyses, such as principle components analysis (PCA), partial least squares discriminate analysis, and unsupervised hierarchical clustering (HC) (see Figs. 2 and 3 and Chapter 16 by Marengo et al.) have recently been applied to DIGE datasets [(10,24,25,26,27,28,29,30,31,32)]. Raw and normalized data can be exported from most DIGE software solutions (e.g., DeCyder, Progenesis), and several multivariate analyses are now part of an extended data analysis (EDA) software module as part of the DeCyder suite of software tools (GE Healthcare), which was specifically developed for DIGE analysis (see Subheading 3.6). These multivariate analyses work essentially by comparing the expression patterns of all (or a subset of) proteins across all samples, using the variation of expression patterns to group or cluster individual samples. Technical noise (poor sample prep, run-to-run variation) and biological noise (normal differences between samples, especially present in clinical samples) are almost always
Heme
–Fe
control
PC2
Δfur
PC1 Fig. 2. Illustration of the use of principle component analysis. DIGE was used to analyze changes in Staphylococcus proteins in response to genetic and chemical alterations affecting iron utilization. Adapted from (24).
102
Friedman and Lilley
Fig. 3. Hierarchical clustering (by average distance correlation) of representative novel circadian proteins detected by 2D DIGE of soluble protein extracts from mouse liver. Pale gray represents low levels of protein expression, black represents intermediate levels, and dark gray represents high levels of expression. Adapted from (32).
associated with any analytical dataset of this nature, and may well override any variation that arises due to actual differences related to the biological questions being tested. Unsupervised clustering of related samples, therefore, adds additional confidence that a “list of proteins” changing in a DIGE experiment are not arising stochastically (10).
Optimizing DIGE Technology
103
1.3. DIGE in the Clinical Setting Although the potential for DIGE to address clinical studies is only beginning to be addressed [for example, see (29,30)], many studies have been published demonstrating the feasibility and benefit of DIGE/MS using small patient cohorts for preliminary studies in colon (14), liver (33,34,35), breast (36,37), esophageal (38,39), and pancreatic cancers (40), as well as other important clinical studies such as Severe Acute Respiratory Syndrom (SARS) (41). Many studies also explore the important benefit of procuring samples using laser capture microdissection (LCM – see Chapters 3, 5, and 9 by Diaz et al., Zhang et al., and Mustafa et al., respectively) for a highly enriched population of the cells under study (16,30,42,43,44). These LCM studies necessitate the use of the saturation chemistry owing to the increased sensitivity but limited multiplexing power, and typically require secondary preparative gels with higher protein loads to enable protein identification by MS. The study of Suehara et al. (29) represents the utility of a multivariable DIGE/MS analysis with an extended sample set pertinent for a clinical study. Eighty soft tissue sarcoma samples comprising seven different histological backgrounds were analyzed. Using the saturation DIGE fluors, individual samples were labeled with Cy5 and multiplexed with a pooled-sample internal standard (labeled in bulk with Cy3) for each DIGE gel. Using high-resolution 2D gel separations and a combination of multivariate statistical tools (support vector machines, leave-one-out cross-validation, PCA, and HC), these studies identified a small subset of proteins including tropomyosin and HSP27 that were able to discriminate between the different classes of tumors. HSP27 in particular was part of a subclass of discriminating proteins that could distinguish between leiomyosarcoma and malignant fibrous histiocytoma (MFH), as well as correlate with patient survival between low-risk and high-risk groups. HSP27 has long been associated with prognosis in MFH as well as in other human carcinomas (45). 2. Materials This chapter assumes a solid understanding in 2D gel electrophoresis and will focus on the design and implementation of the DIGE method using the pooled-sample internal standard methodology and the minimal dye chemistry for Cy2, Cy3, and Cy5, with notes provided for saturation labeling chemistry. 2.1. Cell Lysis Buffers 1. TNE: 50 mM Tris–HCl pH 7.6, 150 mM NaCl, 2 mM EDTA pH 8.0, 2 mM DTT, 1% (v/v) NP-40.
104
Friedman and Lilley
2. RIPA buffer: 50 mM Tris–HCl pH 8.0, 150 mM NaCl, 1% NP-40, 0.5% deoxycholic acid, 0.1% SDS. 3. Two-dimentional gel electrophoresis lysis buffer: 7 M urea, 2 M thiourea, 4% CHAPS, 2 mg/mL DTT, 50 mM Tris–HCl pH 8.0. 4. ASB14 lysis buffer: 7 M urea, 2 M thiourea, 2% amidosulfobetaine 14, 50 mM Tris–HCl pH 8.0.
NB: depending on the sample, it may also be necessary to add protease inhibitors and phosphatase inhibitors [sodium pyrophosphate (1 mM), sodium orthovanadate (1 mM), beta-glycerophosphate (10 mM) and sodium fluoride (50 mM)] to the chosen lysis buffer (see Subheading 3.1). 2.2. SDS-Polyacrylamide Gel Electrophoresis 1. Immobilized pH gradient (IPG) strips and accompanying ampholyte mixures can be purchased from a number of commercial vendors. Strip lengths vary from 7 cm to high-resolution 24 cm strips, and pH ranges vary from wide-range (e.g., pH 3–11) to high-resolution narrow-range (e.g., pH 5–6) strips. 2. Bind silane working solution (50 mL): 40 mL ethanol, 1 mL acetic acid, 50 μL bind silane solution (GE Healthcare), 9 mL water (see Note 9). 3. 4× separating gel buffer. 1.5 M Tris-base pH 8.8. 4. 30% acrylamide:bis-acrylamide (37.5:1), N,N,N,N´-tetramethyl-ethylenediamine, and ammonium persulfate. 5. 10× SDS-PAGE running buffer (1 L): 30.25 g Tris-base, 144.13 g glycine, 10 g SDS (0.1%). 6. Fixing solution for SyproRuby staining (1 L): 100 mL methanol, 70 mL acetic acid, 830 mL water. SyproRuby stain is available form several commercial sources and can be substituted by other total protein stains, such as Deep Purple (GE Healthcare) or Flamingo Pink (BioRad). 7. Two-dimensional equilibration buffer: 6 M urea, 50 mM Tris-base pH 8.8, 30% glycerol, 2% SDS, trace bromophenol blue. 8. Water-saturated butanol (see Note 10). 9. Dithiolthreitol (store dessicated). 10. Iodoacetamide (store dessicated, keep in the dark).
2.3. DIGE Labeling Materials 1. N,N-dimethyl formamide (DMF) (see Note 11). 2. Labeling (L) buffer: 7 M urea, 2 M thiourea, 4% CHAPS, 30 mM Tris-base (do not pH, but ensure that pH of final solution is between 8.0 and 9.0), 5 mM magnesium acetate (see Note 12). Alternatively, 4% CHAPS can be replaced with 2% ASB14, especially in cases where membrane rich samples are being utilized. 3. Rehydration (R) buffer: 7 M urea, 2 M thiourea, 4% CHAPS, 2 mg/mL DTT (13 mM; 2%).
Optimizing DIGE Technology
105
4. Cyanine dyes with NHS-ester chemistry for minimal labeling (Cy2, Cy3, and Cy5), and with maleimide chemistry for saturation labeling (Cy3 and Cy5) are available from GE Healthcare as dry solids. 5. Quenching solution (for minimal labeling): 10 mM lysine. 6. Dithiothreitol reduction stock solution: 200 mg/mL DTT.
3. Methods The DIGE is a powerful technique for quantitative multivariable differential display proteomics. However, the quality of the data will only be as good as the quality of the underlying 2D gel electrophoresis technology upon which it is based. The main focus of this chapter is to provide detailed notes on the DIGE technology; however, some key considerations to successful high-resolution 2D gel electrophoresis are also provided. This section describes methods associated with labeling using minimal CyDyes. 3.1. Sample Preparation The key to success for any analytical measurement begins with robust sample preparation. This not only includes the buffers and materials used, but also the nature of the samples and the way in which they are procured. The addition of exogenous materials (such as DNAse, RNAse), or allowing for uncontrolled manipulation of the sample (such as conditions that may lead to proteolysis) can severely hamper and sometimes completely prevent an analysis. Care should be taken to ensure against common laboratory contaminants (e.g., mycoplasma for tissue culture) that if present may be detected as significant changes using DIGE, either due to the presence in a subset of samples, or by responding to the experimental perturbation. 1. Prepare protein extracts using any method of preference. The appropriate amount of protein can be subsequently precipitated prior to resuspension in the CyDye labeling buffer (see Subheading 3.2). Ensure against proteolysis and loss of post-translational modifications (e.g., phosphorylation) as this is of monumental importance. Care should be taken not to use reagents that will resolve on the 2D gel, such as soybean trypsin inhibitor. Small molecule inhibitors such as aprotinin, leupeptin, pepstatinA, antipain, 4 - (20aminoethyl) benzenesulfonyl fluoride hydrochloride (AEBSF), sodium orthovanadate, okadaic acid, and microcystin, among others, are far better choices. 2. Lyse cells using standard lysis buffers such as TNE and RIPA buffers, or even the buffers used for 2D gel electrophoresis.
106
Friedman and Lilley All of these buffers have the capability of producing high-resolution samples for 2DE. In most cases, the presence of reagents that would otherwise interfere with CyDye labeling (such as those that contain primary amines) will be removed prior to labeling by protein precipitation (see Subheading 3.2).
3. Sonicate cells if necessary to improve sample quality. Sonication improves sample quality by disrupting nucleic acids, which are subsequently removed by sample cleanup (see Subheading 3.2) along with phospholipids. Both of these nonproteinaceous ionic components can obliterate the resolution during IEF. Short bursts with a tip-sonicator are suggested. It is important to keep the system chilled, especially in the presence of urea-containing samples that should never be heated (see Note 12). 4. Determine the protein concentration of the sample using a system that is compatible for the buffer that the proteins are extracted in. CHAPS and thiourea in the buffers used for DIGE, although adequately chaotropic, interfere with either the Bradford or bicinchoninic acid assays, making the data inaccurate and unreliable. In these cases, aliquots should be precipitated prior to quantification in a suitable buffer, or the use of a detergent compatible assay should be utilized. 5. Aim to use a protein concentration between 1 and 10 mg/mL. Too dilute and it will be difficult to quantitatively recover proteins following precipitation cleanup (see Subheading 3.2); too concentrated and it will be difficult to accurately dispense the appropriate volume for the experiment. Freeze/thawing should also be kept to a minimum; freezing samples in 1 mL aliquots or less will usually suffice.
3.2. Sample Cleanup The desired amount of sample to be used in the experiment should be precipitated prior to labeling. This removes both nonproteinaceous ions from the sample (e.g., nucleic acids, phospholipids) that can interfere with IEF, as well as transfers the proteins into a labeling buffer optimized for CyDye labeling and subsequent IEF. Determine how much total protein will be on each gel, and precipitate ½ of that amount for each sample to be run on that gel. This is straightforward for a two-component separation, but also works out for the multigel experiments where 1/3 of the total protein amount on each gel comes from the pooled-sample internal standard (see Table 1.) Precipitate only what is needed for each sample for the experiment; too much material may create pellets that are difficult to resolubilize completely.
107
150 μg 24 μL 16 μL
150 μg 24 μL 16 μL
Control-2 150 μg 24 μL 16 μL
Treated-2
2 μL 2 μL 2 μL 30 min on ice in the dark 2 μL 2 μL 2 μL 2 μL 10 min on ice in the dark 20 μL 20 μL 20 μL 20 μL For each gel, combine the quenched Cy3-and Cy5-labeled quenched Cy2-labeled pooled mixture 20 + 20 + 20 μL 20 + 20 + 20 μL 60 μL 60 μL 120 μL 120 μL to Vf to Vf
2 μL
150 μg 24 μL 16 μL
Treated-1
Gel 2
2 μL
2 μL
150 μg 24 μL 16 μL
Treated-3
Gel 3
20 + 20 + 20 μL 60 μL 120 μL to Vf
20 μL 20 μL samples and add 1/3 of the
2 μL
2 μL
150 μg 24 μL 16 μL
Control-3
This table illustrates a typical DIGE labeling experiment, as described in Subheadings 3.2 and 3.3.
2× R-buffer Total R-buffer
Total volume
Lysine (quench)
Precipitated amount L-buffer Aliquot Cy2 Cy3 Cy5
Control-1
Gel 1
Samples
Table 1 Experimental Design for CyDye Labeling Using a Pooled-Sample Internal Standard
60 μL
6 μL
8 μL (×6) 6 μL
Pool
108
Friedman and Lilley
Many precipitation methods are available, the following is a MeOH/CHCl3 protocol that works well for DIGE, and can be easily performed in 1.5 mL tubes [adapted from (46)]: 1. 2. 3. 4. 5. 6.
7. 8. 9. 10.
Bring up predetermined amount of protein extract to 100 μL with water. Add 300 μL (3-volumes) water. Add 400 μL (4-volumes) methanol. Add 100 μL (1 volume) chloroform. Vortex vigorously and centrifuge; the protein precipitate should appear at the interface. Remove the water/MeOH mix on top of the interface, being careful not to disturb the interface. Often the precipitated proteins do not make a visibly white interface, and care should be taken not to disturb the interface. Add another 400 μL methanol to wash the precipitate. Vortex vigorously and centrifuge; the protein precipitate should now pellet to the bottom of the tube. Remove the supernatant and briefly dry the pellets in a vacuum centrifuge. Resuspend the pellets in a suitable amount of CyDye labeling buffer (L-buffer, see Table 1).
An alternative widely used precipitation method is as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Add 5 volumes of cold 0.1 M ammonium acetate in methanol. Leave at –20°C for 12 h or overnight. Centrifuge at ∼3000 rpm (1400×g) for 10 min at 4°C and remove the supernatant. A pellet of protein should be visible at this stage. To wash the pellet, add 80% 0.1 M ammonium acetate in methanol and mix to resuspend the protein. Centrifuge at 3000 rpm (1400×g) for ten min at 4°C and remove the supernatant. To dehydrate the pellet add 80% acetone and resuspend the pellet by mixing. Centrifuge at 3000 rpm (1400×g) for ten min at 4°C and remove the supernatant. Dry pellet for 15 min by leaving open tube in a laminar flow cabinet.
3.3. DIGE Experimental Design 1. Start with a preliminary gel. All experiments should start with a preliminary gel on representative samples to ensure equivocal protein amounts between samples, and that the highest resolution and sensitivity are obtained before embarking on a multigel DIGE experiment. (see Notes 13 and 6). The preliminary gel will also show any problems with the sample preparation that may be corrected by adjusting the procurement methods (see Subheading 3.1). This step can also be used to optimize the maximal amount of protein can be loaded without adversely affecting resolution.
The preliminary gel needs only to test one or two of the samples of a much larger experiment. This gel can simply be stained with a total protein stain (e.g., Sypro Ruby or Deep Purple) to visually inspect the resolution and sensitivity.
Optimizing DIGE Technology
109
Alternatively, the gel can contain two different samples prelabeled with Cy3 and Cy5 and coresolved. (see Note 14). 2. Choose a suitable pH gradient for the IEF. Precast IEF strips are commercially available from several vendors. The widest length is currently 24 cm, providing the highest resolving power for a given pH range. Medium-range IEF gradients (e.g., pH 4–7) offer the best trade-off between overall resolution and sensitivity. Subsequent experiments can then be designed to resolve proteins in the basic range (pH 7–11) and in narrow pI ranges with commensurate increases in protein loading to gain access to the lower abundant proteins in a given sample (see Note 5). In this way a more comprehensive picture of the proteomes under study can be obtained. 3. Incorporate a pooled-sample mixture internal standard on every DIGE gel in a coordinated experiment. This internal standard, usually labeled with Cy2, is composed of an equal aliquot of every sample in the entire experiment, and therefore represents every protein present across all samples in an experiment. The use of this pooled-sample internal standard on every DIGE gel in a coordinated experiment allows for the facile comparison of independent sample replicates with increased statistical confidence. This experimental design also enables the simultaneous quantitative comparison between multiple variables in a coordinated experiment (Fig. 1). 4. Plan out which samples will be labeled with which dyes ahead of time. For minimal dye labeling chemistry (see Subheading 3.4), each gel will contain two individual samples labeled with either Cy3 or Cy5, and an equal amount of the pooled-sample internal standard. The example outlined in Table 1 is for a twocomponent comparison repeated in triplicate, with 300 μg total protein loaded onto each of three gels. In this case, 150 μg of each sample should be precipitated (see Subheading 3.2), resuspended in L-buffer and then split 2:1. Two-thirds of each sample (100 μg) will be individually labeled with either Cy3 or Cy5. The remaining 1/3 of each sample will be pooled together and labeled with Cy2 to serve as an internal standard. By following this, there will be enough of the Cy2-labeled internal standard to have an equal amount as the Cy3 or Cy5 samples loaded onto each gel. (see Note 15).
3.4. CyDye Labeling All steps are performed on ice. The following protocol is for sample loading via rehydration of IPG strips, and assumes incorporation of a pooled-sample internal standard to coordinate many samples across multiple DIGE gels simultaneously. The steps are summarized in Table 1 (see Note 16). 1. Resuspend precipitated sample in 24 μL labeling (L) buffer. Remove 8 μL (1/3 of sample) and place into a new tube that will contain the pooled-sample internal standard (8 μL from all of the other individual samples will be pooled into this tube) (see Note 17).
110
Friedman and Lilley
2. CyDyes are purchased as dry solids and should be reconstituted to 10× stock solutions (1 nmol/μL) in fresh DMF. Dilute stock solutions of CyDyes 1:10 in fresh DMF to a final working concentration of 100 pmol/μL (see Note 11). 3. Label each sample (50–250 μg) with 2–4 μL (200–400 pmol) of either Cy3 or Cy5 working dilution for 30 min on ice in the dark. Label the pooled-sample mixture with 2–4 μL (200–400 pmol) of Cy2 working dilution for every equivalent amount of sample present in the pooled standard as compared with the individually labeled samples. That is, if 100 μg of each sample is labeled with 200 pmol of Cy3 or Cy5, then 50 μg of each of these samples is present in the pooled standard, and 200 pmol of Cy2 is used for every 100 μg of pooled standard. (see Table 1 and Note 18). 4. Quench reactions with 2 μL of 10 mM lysine for 10 min on ice in the dark. 5. For each gel, combine the quenched Cy3- and Cy5-labeled samples and add 1/3 of the quenched Cy2-labeled pooled mixture. 6. To each tripartite mixture, add an equal volume of 2× R-buffer and incubate on ice for 10 min. 2× R-buffer is R-buffer supplemented with an additional 2 mg/mL DTT using the 200 mg/mL DTT stock solution. DTT is omitted from the L-buffer to prevent unfavorable interaction with the CyDyes. Adding an equal volume of 2× R-buffer to the quenched reactions provides the reducing agents to the total reaction volume at a 1× final concentration. 7. Add R-buffer (1× DTT concentration) to a final volume suggested by the manufacturer for the given IPG strip length (e.g., 450 μL for 24 cm strips). Add the appropriate volume of IPG buffer ampholines to 0.5% final (v/v) for IEF. Proceed with rehydration of dehydrated IPG strips for >16 h and proceed with IEF (see Subheading 3.5.3 and Note 19).
3.5. 2D Gel Electrophoresis and Poststaining As a result of the minimal labeling, quantification with the CyDyes is carried out on only 2–5% of the proteins that are labeled, and the labeled portion of the protein may migrate at a higher apparent molecular mass than the majority of the unlabeled protein due to the added mass and hydrophobicity of the dyes (exacerbated in lower Mr species). To ensure that the maximum amount of protein is excised for subsequent in-gel digestion and MS, minimally labeled 2D DIGE gels are poststained with a total protein stain such as SyproRuby or Deep Purple. Accurate excision is also ensured by preferentially affixing the second dimension gel to a presilanized glass plate during gel casting so that the gel dimensions do not change during the analysis (see Notes 20 and 21). These methods assume the use of the Ettan 2D electrophoresis system (GE Healthcare), but are easily adaptable to other commercially available systems. It also assumes usage of high-resolution 24 cm × 20 cm gels. 1. Special gels for second dimension SDS-PAGE. Using low-fluorescence glass plates, pretreat one plate for each gel with 3–5 mL bind silane working solution,
Optimizing DIGE Technology
2.
3.
4.
5.
6. 7.
8.
111
carefully wiping the entire surface of the plate with a lint-free wipe. Leave treated plates covered with lint-free wipes for several hours to allow for sufficient outgassing of fumes (that may contain bind silane) before assembling gel plates and casting of second dimensional SDS-PAGE gels (see Note 22). Assemble plates and pour 12% homogeneous SDS-PAGE gel(s) using the appropriate amount of 30% stock acrylamide and 4× separating gel buffer for the volumes needed for the number of gels being poured (see Note 23). Overlay the gels with water-saturated butanol for several hours to provide a straight and level surface to place the focused IPG strip (see Note 10). Perform IEF using an IPGphor II IEF unit (GE Healthcare) of the combined tripartite-labeled samples, brought up to final volume with 1× R-buffer and passively rehydrated into IPG strips for >16 h (see Subheading 3.4.7) (see Note 24). Equilibrate the focused IPG strips into the second dimensional equilibration buffer. During this step, the cysteine sulfhydryls in the focused proteins are reduced and carbamidomethylated by supplementing the equilibration buffer with 1% DTT for 20 min at room temperature, followed by 2.5% iodoacetamide in fresh equilibration buffer for an additional 20 min room temperature incubation (see Note 25). Place equilibrated IPG strip on top of the SDS-PAGE gels that were precast with low-fluorescence glass plates. Use a thin card or ruler to carefully tamp down the IPG strip to the SDS-PAGE gel, removing air bubbles at the interface (see Notes 26 and 27). Perform second dimensional SDS-PAGE at constant wattage, using 1 W/gel for at least 1 h prior to ramping up to <20 W/gel (see Note 28). CyDye images are acquired using a fluorescence imager, such as the Typhoon 9400 series (GE Healthcare) equipped with lasers and filters that are compatible with the emission/excitation spectra of the dyes. Imaging is performed through the glass plates using the intact gel cassette (see Note 29). After imaging the gels, carefully remove the plate that was untreated with bind silane. The gel will remain stuck to the treated plate and can be stained with an appropriate total protein dye (such as SyproRuby Deep Purple or Flamingo Pink) “open-faced” in the fixation/staining solutions. For SyproRuby, fix gels for at least 2 h with fixation solution sufficient to completely cover the gel. Longer fixations are possible without adversely affecting subsequent MS. After removing the fixation solution, stain gels overnight in SyproRuby and acquire images using a fluorescence imager (see Notes 21 and 30).
3.6. DIGE Analysis 3.6.1. Software Algorithms Many bioinformatics tools are commercially available for the comparison of multiple 2D gel-separated protein spot patterns. Some free internet-based utilities (e.g., www.lecb.ncifcrf.gov/flicker/) provide simple alternation between
112
Friedman and Lilley
two spot patterns, whereas most of the commercial products contain proprietary algorithms for protein spot detection, intergel matching, protein spot quantification, and even utilities for building web-based tools for data dissemination. Many include the ability to average replicate patterns into a single virtual pattern to be used in a comparative study. They are all designed to compare multiple spot patterns and quantify abundance changes for individual proteins between experimental conditions. Several software packages allow for the analysis of DIGE data. The DeCyder suite of software tools was specifically developed to support the DIGE platform when this technology was first marketed by GE Healthcare and is therefore used as an example here. The differential in-gel analysis (D I A) module of DeCyder is used for direct quantification of protein spot volume ratios between the triply codetected signals emanating from each resolved protein, and can be used for the simplest form of a DIGE experiment for pairwise comparisons with N = 1. The more advanced DIGE experiments that use the internal standard to crosscompare replicate samples from pairwise and multivariable analyses (N > 3) are handled by the biological variation analysis (BVA) module of DeCyder. In a BVA experiment, the signals emanating from the internal standard are used both for direct quantification within each DIGE gel in a coordinated set (using Differential In-gel Analysis (DIA) module), as well as for normalization and protein spot pattern matching between gels (see Note 31). This allows for the calculation of Student’s t-test and ANOVA statistics for individual abundance changes (see Subheading 3.6.2, and Table 2). BVA is also used to match patterns between SyproRuby- and CyDye-stained images to facilitate protein excision for subsequent MS (see Notes 20, 21, and 30). 3.6.2. Experimental Design and Statistical Confidence In the simplest form of a DIGE experiment, two or three samples are separately labeled with one of the three dyes and separated in the same gel for direct pairwise comparisons. In this case, the software first normalizes the entire signal for each CyDye channel and then calculates the protein spot volume ratio for each protein pair. A normal distribution is modeled over the actual distribution of protein pair volume ratios, and two standard deviations of the mean of this normal distribution represent the 95th percent confidence level for significant abundance changes. This N = 1 type of experiment has limited statistical power, since the 95th percentile confidence interval is determined based on the overall distribution of changes within the population (see Note 32). Many more changes in abundance of much lesser magnitude can be detected with much greater statistical confidence (Student’s t-test and ANOVA, Table 2) by incorporating independent
Optimizing DIGE Technology
113
Table 2 Statistical Applications of DeCyder Biological Variation Analysis and Extended Data Analysis (EDA) Modules Average ratio
Student’s t-test
One-way ANOVA
Two-way ANOVA
Principle component analysis (EDA only)
Hierarchical clustering (EDA only)
K-means (EDA only) Self organizing maps (EDA only) Gene shaving (EDA only)
Discriminant analysis (EDA only)
Calculated for each protein spot feature between two groups or experimental conditions. Derived from the log standardized protein abundance changes that were directly quantified within each DIGE gel relative to the internal standard for the protein spot feature. Univariate test of statistical significance for an abundance change between two groups or experimental conditions. p-values reflect the probability that the observed change has occurred due to stochastic chance alone. With DIGE, p-values of <0.01 are often observed, assumes normal distributions of protein abundance, can be performed either unpaired or paired. Tests for differences in standardized abundance of a given protein across all groups of a multicomponent analysis. Indicates that one group is significantly different from another in the group. Tests for differences in standardized abundance of a given protein between multiple groups with the same condition, where multiple conditions are analyzed. Reduces the dimension of the variables in a multidimensional space. The first principal component (PC1) divides the dataset along an axis describing the most variance in a system, with the orthogonal second component (PC2) accounting for the second greatest source of variation. Compares groups based on similarity of the collective expression patterns of individual proteins, often displayed in an expression matrix (heatmap). Similarity between groups is proportional to the lateral distance depicted as a branched dendrogram. Used to classify proteins into a predefined number of bins based on similarity. Similar to K-means, but also clusters nearest neighbors (based on expression patterns) in a two-dimensional map. Used to identify groups of proteins that have similar expression profiles. Unlike K-means, proteins can belong to more than one group provided there is high coherence within each group. Identifies proteins that can discriminate between groups based on a variety of classifier schemes, including cross-validation, feature selection, partial least squares, K-nearest neighbors.
114
Friedman and Lilley
replicate samples into the experiment (see Note 33). The number of replicates required in a study depends on the amount of variation in the system being investigated. Increasing the number of replicates will increase confidence in smaller changes in expression. The number of gel replicates that are needed for the experiment to have sufficient sensitivity to detect expression changes can be determined using power calculations (for example see (19)). With replicate samples, the Student’s t-test and ANOVA statistics are measuring the significance of the variation of a specific protein change, independent of the overall distribution of abundance changes in the population. Incorporating replicate samples into the experimental design also controls for unexpected variation introduced into the samples during sample preparation. This design not only allows for the identification of abundance changes that are consistent across multiple replicates of an experiment, but can also identify significant abundance changes that would not have been identified even if the analyses were performed using Cy3- and Cy5-labeled samples on the same gels, but without the pooled-sample internal standard to coordinate them (14). 3.6.3. Multivariate Statistical Analysis Univariate analyses such as the Student’s t-test and ANOVA have traditionally been used in DIGE experiments to provide a list of statistically significant changes in protein abundance. The application of multivariate statistical analyses (as outlined in Subheading 1.2.4) allow for the assessment of changes on a global scale, and can bring added insight to the usual “list of proteins” generated. Most software packages allow for the export of raw and normalized protein spot volumes to allow for these additional statistical tests and data manipulations; in addition, the DeCyder suite of software tools now provides an Extended Data Analysis (EDA) module, that includes many of these tools (Table 2). These tools are now becoming more evident in recent DIGE publications (10,24,28,29,30,32,52). Although these multivariate analyses are especially beneficial when analyzing a DIGE experiment that contains three or more conditions, they can also useful in two-condition comparisons to detect sample outliers, fouled samples or even poor experimental design. Figure 2 illustrates an example of PCA applied to a DIGE dataset comprised of four experimental conditions each measured in quadruplicate. PCA simplifies multidimensional datasets by reducing the variation down to the two or three most significant sources of variation. In this example, the first principle component (PC1) accounts for 62.3% of the variation amongst 156 proteins of interest, with the second principle component (PC2) accounting for an additional 12.5% of the variation. Each sample datapoint describes the collective expression profile for the subset of 156 proteins, and PC1 and PC2 orthogonally
Optimizing DIGE Technology
115
divide the samples into quadrants based on these two largest sources of variation within DIGE dataset. In this case, 75% of the variance between these proteins clusters the samples into the proper categories (adapted from (24)). Figure 3 is taken from a 2D DIGE study, which determined the change in protein abundance in mouse liver over a 24 h period. In this, study proteins were harvested from groups of mice on a second cycle after transfer from synchronized (12 h light:12 h dim red light) to free running conditions (constant dim red light). Proteins were extracted from each liver and pooled from six mice per 4-h time point. HC (by average distance correlation) was used to investigate the expression of 49 novel circadian proteins. This gave a range of phase groups with 10 proteins peaking during the subjective day and 39 proteins distributed between two clusters, which were most abundant during the subjective night (adapted from (32)). Finally, additional information may be gleaned by mapping proteins found to be changing by DIGE to existing biological pathways and networks. Many software solutions and services are becoming available for this type of extended analysis (e.g., Kegg pathways, Ingenuity pathways analysis, WebGestalt, DeCyder EDA). Although additional validation is necessary to establish biological significance, the mapping of members of a “list of proteins” to established pathways and networks can provide validating support for the proteins observed by DIGE alone. In some cases, it can also indicate potential proteins associated with the biological question that were not accessible in the DIGE analysis. For example, Friedman et al. (10) recently reported the use of network/pathway mapping for proteins found by DIGE/MS in MCF10A cells overexpressing the HER2 receptor after treatment with TGF-. The majority of proteins identified with DIGE/MS mapped to a network of pathways involving TGF- as a major hub, but also included an intercalating pathway involving p53 that effected many proteins that were independently identified in the DIGE/MS experiments. This insight linking new players to those identified with DIGE/MS led to the further investigation of a direct role for p53 in the expression of the tumor suppressor maspin (53). 4. Notes 1. 2DE has traditionally been a popular method for differential display proteomics on a global scale, but until recently, these strategies lacked the ability to directly quantify abundance changes in the same fashion as in stable isotope LC/MSbased strategies (2,3,4). This has been mainly due to the inability to directly correlate migration patterns and protein staining between gel separations (gelto-gel variation). Stable isotopes have been used in gel-based proteomics as well, whereby different proteomes have been separately labeled with different stable isotopes (e.g., growing cells using 14 N vs. 15 N-labeled medium) prior to
116
2. 3.
4.
5.
6.
7.
8.
Friedman and Lilley mixing and running together through the same 2DE separation (5). In this case, abundance changes can be monitored during the mass spectrometry (MS) stage on individual proteins, but requires the in-gel digestion and MS on every protein present to discover the subset of proteins that is changing. Both hydrophobicity and molecular weight influence how proteins migrate during SDS-PAGE, yielding information on apparent molecular mass. In comparison, commonly used silver or colloidal coomassie blue (ca. 5–10 ng sensitivity) stains typically exhibit a dynamic range of less than two orders of magnitude (8,9). The CyDye labeling system is compatible with the downstream processing commonly used to identify proteins via MS and database interrogation, which involves the generation of tryptic peptides within excised gel plugs. Trypsin cleaves the peptide bonds the C-terminal side of lysine and arginine residues, but peptide generation is mostly unhindered as so few lysine residues are modified by dye labeling. DIGE experiments can still be performed using the internal standard methodology with only two CyDyes, but twice as many gels are required to analyze the same number of samples compared with the three-dye minimal labeling scheme. With saturation labeling, one dye is used to label the internal standard, and the other is used to label individual samples. A dye-swap scheme is not necessary in this case because the individual samples are always labeled with the same CyDye. The use of hydroxyethyl disulfide (commercially available as “DeStreak reagent”), combined with anodic cup loading, should be used for enhanced resolution for IEF above pH 8 (11). Running every DIGE gel with the maximal amount of protein (without adversely effecting first dimension resolution) not only enables detection of lower abundance proteins, but also provides more material for subsequent protein identification using MS. This makes every gel in a coordinated DIGE experiment a “pick-able” gel, without the need to run subsequent preparative gels with increased protein load that then have to be carefully matched to a lower abundant, analytical gel. When combined with narrow range IEF, maximizing the protein amount also allows interrogation of the lower abundant proteins in a sample. If one sample within a study has very skewed protein distributions compared with others, then many of the “novel proteins” within this sample will effectively be diluted out in the pool. Such a sample outlier can be easily identified using the multivariate statistical analyses described. Repetition not only enables the identification of subtle differences with statistical confidence, it is also vital to control for nonbiological variation. In most cases biological variation will outweigh technical variation, therefore, only biological replicates are necessary. Thus it is important that each replicate sample is derived from an independent experiment, ideally performed on different occasions as perhaps using different batches of medium. The independent samples can then be
Optimizing DIGE Technology
9. 10.
11.
12.
13.
14.
15.
16.
117
analyzed coordinately using the pooled-sample internal standard methodology. See Table 1 for an example of this design. All solutions should be prepared using water that has a resistivity of 18.2 Mcm; this is referred to as “water” throughout the text. Mix equal parts of butanol and water and shake vigorously. Let the two phases separate overnight, and use the butanol phase for overlay. Butanol that is not completely water saturated can extract water from the top of the gel. A more recent improvement is to use a 0.1% SDS solution in a conventional spray bottle, used to carefully spray a fine mist over the top of the gels to thoroughly cover the top of the gel (the gel/overlay interface will not be as obvious). DMF can degrade, producing amines, which can react with the NHS-ester CyDyes. DMF stocks should be kept fresh (<3 months) and anhydrous to ensure optimal labeling conditions. For buffers that contain urea, care should be taken to ensure the urea is fresh and free of the natural break down product isocyanate, which will carbamylate free amines and thereby neutralizing the protonatable epsilon-amine groups of lysine residues. This is problematic for several reasons, the foremost being the fact that this gives rise to artificial charged train isoforms in the first dimension IEF. Heating samples above 37°C should also be avoided, as this facilitates the conversion to isocyanate. Any buffer component that contains a primary amine, such as pharmalytes, ampholytes, HEPES buffers, etc. should be avoided as these components may react with the CyDyes, thus reducing their affective concentration. For example, 500 μg of material may be loaded onto a pH 4–7 24 cm IPG strip, but due to the overall distribution of proteins in the sample, as well as a sometimes unusually high abundance of a subset of proteins, may result in much less material actually resolving between the electrodes. A good rule to follow is to load the desired amount based on the protein concentrations, and then adjust the load by eye as necessary. This is DIGE in its most simplistic form, and can show differences between the samples without interference from gel-to-gel variation, but provides limited statistical power to help distinguish true biological variation from background such as artificial noise introduced during sample preparation. Employing a dye-swapping approach will control for any dye-specific effect that may result from preferential labeling or different fluorescence characteristics of acrylamide at the different wavelengths of excitation for Cy2, Cy3, and Cy5, especially at low protein spot volumes. This is easily incorporated into any DIGE analysis where repetitive samples are used (along with the internal standard to compare across multiple DIGE gels). For saturation chemistry, general methods and considerations are the same as for the minimal chemistry, but there are several unique features to also consider for the saturation chemistry. First, careful optimization of the labeling conditions must be carried out for each new sample set to ensure complete reduction of cysteine residues. Insufficient labeling will lead to multiple spots in the second
118
17.
18.
19.
20.
21. 22.
23.
Friedman and Lilley dimension due to MW and hydrophobicity shifts. Overlabeling results in side reactions with the epsilon-amine groups of lysine side chains, but since the maleimide dyes do not carry compensatory charge, this results in the overall loss of a charge, which creates a series of isoelectric forms in the first dimension (“charge trains”). Labeling buffer should not contain any components with free thiols, as these will react with the satCyDyes. L-buffer volume can be increased if necessary for complete resolubilization, although 100–250 μg or more should resolubilize readily in this volume. The volume of labeling buffer used for resolubilization should not exceed 40 μL per sample when using cup loading for sample entry to ensure that the final volumes will not exceed the capacity of the cup loading (ca. 100–150 μL). These methods are provided assuming that all gels to be run will be used both for analytical (quantification) as well as preparative (providing material for subsequent MS) purposes. Current recommendations from the manufacturer are to label 50 μg of sample with 400 pmol CyDye. Sufficient amount of unlabelled sample can be added to the quenched reactions to achieve final protein amounts to facilitate subsequent MS. Alternatively, many have found that the ratios can be adjusted to label increasing amounts of sample (up to 200 μg with 200 pmol dye) without adversely affecting the overall labeling reaction (presented here). If samples are to be introduced using anodic cup loading, simply bring this mixture up to 100 μL in R-buffer and proceed with cup loading. R-buffer can always be supplemented with additional DTT using the 200 mg/mL DTT stock solution. In the presence of Destreak reagent for focusing in pH ranges above pH 8, the addition of equal volume 1× R-buffer should provide sufficient amount of DTT without interfering with the Destreak reagent. Comparison of minimally labeled protein 2D maps with unlabeled protein maps is generally not a problem, as the addition of only one dye molecule does not generally prevent the facile matching of small alterations in protein mobility between the 2- and 5%-labeled protein and the remaining unlabeled protein that will provide enough material for MS. Poststaining is not necessary with saturation DIGE, since an unlabeled population with potentially different migration characteristics will not exist. This treatment binds the gel to one of the glass plates and therefore prevents shrinking/swelling during the poststaining and protein excision processes, thereby facilitating accurate robotic protein excision. Nothing should be placed on top of wipes that are covering bind silane-treated plates, as this may leave impressions that are detected during the scanning phase. Assembly and casting too soon may create a binding surface on the opposite glass plate, preventing the gel to be subsequently poststained and picked. Automated protein excision can be facilitated for certain systems by placing fluorescent alignment reference targets on the plate, which can be performed at this stage. A stacking gel is not required for 2D gel electrophoresis, as the proteins are effectively “stacked” to the height of the IPG strip. SDS is also not essential in the separating gel, as the SDS associated with the proteins during the equilibration
Optimizing DIGE Technology
24.
25. 26.
27.
28.
29.
30.
31.
32.
119
step, and present in the running buffer, is sufficient (although many traditionally use it in the separating gel). Using 2× concentration running buffer in the upper buffer chamber can produce higher quality separations in some circumstances. Samples of similar nature should always be focused simultaneously for optimal reproducibility. Focusing programs vary for some pH gradients. A typical program for many ranges is 500 V for 500 V-h, stepping to 1000 V for 1000 V-h, followed by a final step to 8000 V until >50 V-h has been reached. Check recommendations from specific vendors. Volume of equilibration buffer should be large to ensure sufficient removal of ampholines and other components of the first dimensional run. Carefully wash out any remaining liquid on top of the SDS-PAGE gel. Prewet the IPG strip with 1× running buffer and place the strip between the gel plates with the plastic backing adhering to the inside surface of one of the glass plates. The prewetted running buffer will facilitate the manipulation of the IPG strip down the inside surface of the plate and on top of the SDS-PAGE gel. An agarose overlay, used by many protocols, is not absolutely necessary to ensure proper contact between the IPG strip and the second dimensional SDSPAGE gel. Using a thin card or ruler to carefully tamp down the IPG strip to the gel is usually sufficient and removes the added problems associated with the overlay, such as trapped air bubbles in the solidified agarose. Running gels at less than 1 W/gel can improve resolution in the high molecular weight regions of the second dimension gel. Use wattage appropriate for the second dimensional unit being used. Many different gel units can accommodate increased power by compensating for the increased heat. Absorption/emission maxima in DMF are 491/506 for Cy2, 553/572 for Cy3, and 648/669 for Cy5; although care must be taken to scan in regions of each spectrum that do not contain absorbance or emission in the other spectra, which may mean using a nonmaximal region of a given spectrum. Comparison of the 2D spot maps between saturation-labeled samples and minimal labeled or unlabeled samples is impossible, as proteins containing multiple cysteine residues may appear as significantly larger Mr species when labeled with the saturation dyes, which of course cannot be predicted without first knowing the protein identity. Almost all software packages for 2D electrophoresis involve matching of protein spot patterns between gels. For DeCyder, it is used in the BVA module to match the quantitative data obtained from the triply coresolved protein signals from each gel in the DIA module (where gel-to-gel variation does not come into play). Manual verification of the matching is almost always required with any software package. There are many “all-or-none” type of experiments where the single gel comparison may be valid, and subtle changes are not expected. Nevertheless, using independent replicates and the pooled-sample internal standard methodology is still needed to control for nonbiological sample preparation error.
120
Friedman and Lilley
33. The multigel approach allows many data points to be collected for each group to be compared. Spots of interest can be selected by looking for significant change across the groups. Student’s t-test and ANOVA probability scores (p) indicate the probability that the observed change occurred due to stochastic, random events (null hypothesis). Probability values <0.05 are traditionally used to determine a statistically significant difference from the null hypothesis. As this represents 50 potential false positives for 1000 resolved proteins, confidence intervals within the 99th percentile (p < 0.01) are arguably more valid, and can be attained using DIGE (10,14,24,47,48,49,50,51).
References 1. Petricoin, E., Wulfkuhle, J., Espina, V. and Liotta, L.A. (2004) Clinical proteomics: revolutionizing disease detection and patient tailoring therapy. J Proteome Res 3(2):209–17. 2. Gygi, S.P., Rist, B., Gerber, S.A., Turecek, F., Gelb, M.H. and Aebersold, R. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17(10):994–99. 3. Mason, D.E. and Liebler, D.C. (2003) Quantitative analysis of modified proteins by LC-MS/MS of peptides labeled with phenyl isocyanate. J Proteome Res 2(3):265– 72. 4. Ross, P.L., Huang, Y.N., Marchese, J.N., Williamson, B., Parker, K., Hattan, S., Khainovski, N., Pillai, S., Dey, S., Daniels, S., Purkayastha, S., Juhasz, P., Martin, S., Bartlet-Jones, M., He, F., Jacobson, A. and Pappin, D.J. (2004) Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 3(12):1154–69. Epub 2004 Sep 22. 5. Vogt, J.A., Schroer, K., Holzer, K., Hunzinger, C., Klemm, M., Biefang-Arndt, K., Schillo, S., Cahill, M.A., Schrattenholz, A., Matthies, H. and Stegmann, W. (2003) Protein abundance quantification in embryonic stem cells using incomplete metabolic labelling with 15N amino acids, matrix-assisted laser desorption/ionisation timeof-flight mass spectrometry, and analysis of relative isotopologue abundances of peptides. Rapid Commun Mass Spectrom 17(12):1273–82. 6. Gorg, A., Obermaier, C., Boguth, G., Harder, A., Scheibe, B., Wildgruber, R. and Weiss, W. (2000) The current state of two-dimensional electrophoresis with immobilized pH gradients. Electrophoresis 21(6):1037–53. 7. Gorg, A., Postel, W., Domscheit, A. and Gunther, S. (1988) Two-dimensional electrophoresis with immobilized pH gradients of leaf proteins from barley (Hordeum vulgare): method, reproducibility and genetic aspects. Electrophoresis 9(11):681–92. 8. Tonge, R., Shaw, J., Middleton, B., Rowlinson, R., Rayner, S., Young, J., Pognan, F., Hawkins, E., Currie, I. and Davison, M. (2001) Validation and development of fluorescence two-dimensional differential gel electrophoresis proteomics technology. Proteomics 1(3):377–96.
Optimizing DIGE Technology
121
9. Lilley, K.S., Razzaq, A. and Dupree, P. (2002) Two-dimensional gel electrophoresis: recent advances in sample preparation, detection and quantitation. Curr Opin Chem Biol 6(1):46–50. 10. Friedman, D.B., Wang, S.E., Whitwell, C.W., Caprioli, R.M. and Arteaga, C.L. (2007) Multi-variable difference gel electrophoresis and mass spectrometry: A case study on TGF-beta and ErbB2 signaling. Mol Cell Proteomics 6:150–69. 11. Olsson, I., Larsson, K., Palmgren, R. and Bjellqvist, B. (2002) Organic disulfides as a means to generate streak-free two-dimensional maps with narrow range basic immobilized pH gradient strips as first dimension. Proteomics 2(11):1630–32. 12. Wolters, D.A., Washburn, M.P. and Yates, J.R. 3rd (2001) An automated multidimensional protein identification technology for shotgun proteomics. Anal Chem 73(23):5683–90. 13. Alban, A., David, S.O., Bjorkesten, L., Andersson, C., Sloge, E., Lewis, S. and Currie, I. (2003) A novel experimental design for comparative two-dimensional gel analysis: two-dimensional difference gel electrophoresis incorporating a pooled internal standard. Proteomics 3(1):36–44. 14. Friedman, D.B., Hill, S., Keller, J.W., Merchant, N.B., Levy, S.E., Coffey, R.J. and Caprioli, R.M. (2004) Proteome analysis of human colon cancer by twodimensional difference gel electrophoresis and mass spectrometry. Proteomics 4(3):793–811. 15. Gerbasi, V.R., Weaver, C.M., Hill, S., Friedman, D.B. and Link, A.J. (2004) Yeast Asc1p and mammalian RACK1 are functionally orthologous core 40S ribosomal proteins that repress gene expression. Mol Cell Biol 24(18):8276–87. 16. Sitek, B., Luttges, J., Marcus, K., Kloppel, G., Schmiegel, W., Meyer, H.E., Hahn, S.A. and Stuhler, K. (2005) Application of fluorescence difference gel electrophoresis saturation labelling for the analysis of microdissected precursor lesions of pancreatic ductal adenocarcinoma. Proteomics 5(10):2665–79. 17. Hu, Y., Malone, J.P., Fagan, A.M., Townsend, R.R. and Holtzman, D.M. (2005) Comparative proteomic analysis of intra- and interindividual variation in human cerebrospinal fluid. Mol Cell Proteomics 4(12):2000–9. 18. Zhang, X., Guo, Y., Song, Y., Sun, W., Yu, C., Zhao, X., Wang, H., Jiang, H., Li, Y., Qian, X., Jiang, Y. and He, F. (2006) Proteomic analysis of individual variation in normal livers of human beings using difference gel electrophoresis. Proteomics 6(19):5260–68. 19. Karp, N.A., Spencer, M., Lindsay, H., O’Dell, K. and Lilley, K.S. (2005) Impact of replicate types on proteomic expression analysis. J Proteome Res 4(5): 1867–71. 20. Meunier, B., Dumas, E., Piec, I., Bechet, D., Hebraud, M. and Hocquette, J.F. (2007) Assessment of hierarchical clustering methodologies for proteomic data mining. J Proteome Res 6(1):358–66. 21. Fodor, I.K., Nelson, D.O., Alegria-Hartman, M., Robbins, K., Langlois, R.G., Turteltaub, K.W., Corzett, T.H. and McCutchen-Maloney, S.L. (2005) Statistical challenges in the analysis of two-dimensional difference gel electrophoresis experiments using DeCyder. Bioinformatics 21(19):3733–40.
122
Friedman and Lilley
22. Karp, N., Kreil, D. and Lilley, K. (2004) Determining a significant change in protein expression with DeCyderTM during a pair-wise comparison using twodimensional difference gel electrophoresis. Proteomics 4(5):1421–32. 23. Kreil, D., Karp, N. and Lilley, K. (2004) DNA microarray normalization methods can remove bias from differential protein expression analysis of 2-D difference gel electrophoresis results. Bioinformatics 20(13):2026–34. 24. Friedman, D.B., Stauff, D.L., Pishchany, G., Whitwell, C.W., Torres, V.J. and Skaar, E.P. (2006) Staphylococcus aureus redirects central metabolism to increase iron availability. PLoS Pathog 2(8):e87. 25. Fujii, K., Kondo, T., Yamada, M., Iwatsuki, K. and Hirohashi, S. (2006) Toward a comprehensive quantitative proteome database: protein expression map of lymphoid neoplasms by 2-D DIGE and MS. Proteomics 3:3. 26. Fujii, K., Kondo, T., Yokoo, H., Yamada, T., Matsuno, Y., Iwatsuki, K. and Hirohashi, S. (2005) Protein expression pattern distinguishes different lymphoid neoplasms. Proteomics 5(16):4274–86. 27. Karp, N.A., Griffin, J.L. and Lilley, K.S. (2005) Application of partial least squares discriminant analysis to two-dimensional difference gel studies in expression proteomics. Proteomics 5(1):81–90. 28. Seike, M., Kondo, T., Fujii, K., Yamada, T., Gemma, A., Kudoh, S. and Hirohashi, S. (2004) Proteomic signature of human cancer cells. Proteomics 4(9):2776–88. 29. Suehara, Y., Kondo, T., Fujii, K., Hasegawa, T., Kawai, A., Seki, K., Beppu, Y., Nishimura, T., Kurosawa, H. and Hirohashi, S. (2006) Proteomic signatures corresponding to histological classification and grading of soft-tissue sarcomas. Proteomics 6(15):4402–09. 30. Hatakeyama, H., Kondo, T., Fujii, K., Nakanishi, Y., Kato, H., Fukuda, S. and Hirohashi, S. (2006) Protein clusters associated with carcinogenesis, histological differentiation and nodal metastasis in esophageal cancer. Proteomics 6(23): 6300–16. 31. Verhoeckx, K.C., Gaspari, M., Bijlsma, S., van der Greef, J., Witkamp, R.F., Doornbos, R.P. and Rodenburg, R.J. (2005) In search of secreted protein biomarkers for the anti-inflammatory effect of beta2-adrenergic receptor agonists: application of DIGE technology in combination with multivariate and univariate data analysis tools. J Proteome Res 4(6):2015–23. 32. Reddy, A.B., Karp, N.A., Maywood, E.S., Sage, E.A., Deery, M., O’Neill, J.S., Wong, G.K., Chesham, J., Odell, M., Lilley, K.S., Kyriacou, C.P. and Hastings, M.H. (2006) Circadian orchestration of the hepatic proteome. Curr Biol 16(11):1107–15. 33. Lee, I.N., Chen, C.H., Sheu, J.C., Lee, H.S., Huang, G.T., Yu, C.Y., Lu, F.J. and Chow, L.P. (2005) Identification of human hepatocellular carcinomarelated biomarkers by two-dimensional difference gel electrophoresis and mass spectrometry. J Proteome Res 4(6):2062–69. 34. Liang, C.R., Leow, C.K., Neo, J.C., Tan, G.S., Lo, S.L., Lim, J.W., Seow, T.K., Lai, P.B. and Chung, M.C. (2005) Proteome analysis of human hepatocellular
Optimizing DIGE Technology
35. 36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
123
carcinoma tissues by two-dimensional difference gel electrophoresis and mass spectrometry. Proteomics 5(8):2258–71. Nabetani, T., Tabuse, Y., Tsugita, A. and Shoda, J. (2005) Proteomic analysis of livers of patients with primary hepatolithiasis. Proteomics 5(4):1043–61. Huang, H.L., Stasyk, T., Morandell, S., Dieplinger, H., Falkensammer, G., Griesmacher, A., Mogg, M., Schreiber, M., Feuerstein, I., Huck, C.W., Stecher, G., Bonn, G.K. and Huber, L.A. (2006) Biomarker discovery in breast cancer serum using 2-D differential gel electrophoresis/ MALDI-TOF/TOF and data validation by routine clinical assays. Electrophoresis 27(8):1641–50. Somiari, R.I., Sullivan, A., Russell, S., Somiari, S., Hu, H., Jordan, R., George, A., Katenhusen, R., Buchowiecka, A., Arciero, C., Brzeski, H., Hooke, J. and Shriver, C. (2003) High-throughput proteomic analysis of human infiltrating ductal carcinoma of the breast. Proteomics 3(10):1863–73. Nishimori, T., Tomonaga, T., Matsushita, K., Oh-Ishi, M., Kodera, Y., Maeda, T., Nomura, F., Matsubara, H., Shimada, H. and Ochiai, T. (2006) Proteomic analysis of primary esophageal squamous cell carcinoma reveals downregulation of a cell adhesion protein, periplakin. Proteomics 6(3):1011–18. Zhou, G., Li, H., DeCamp, D., Chen, S., Shu, H., Gong, Y., Flaig, M., Gillespie, J.W., Hu, N., Taylor, P.R., Emmert-Buck, M.R., Liotta, L.A., Petricoin, E.F. 3rd and Zhao, Y. (2002) 2D differential in-gel electrophoresis for the identification of esophageal scans cell cancer-specific protein markers. Mol Cell Proteomics 1(2):117–24. Yu, K.H., Rustgi, A.K. and Blair, I.A. (2005) Characterization of proteins in human pancreatic cancer serum using differential gel electrophoresis and tandem mass spectrometry. J Proteome Res 4(5):1742–51. Wan, J., Sun, W., Li, X., Ying, W., Dai, J., Kuai, X., Wei, H., Gao, X., Zhu, Y., Jiang, Y., Qian, X. and He, F. (2006) Inflammation inhibitors were remarkably upregulated in plasma of severe acute respiratory syndrome patients at progressive phase. Proteomics 6(9):2886–94. Greengauz-Roberts, O., Stoppler, H., Nomura, S., Yamaguchi, H., Goldenring, J.R., Podolsky, R.H., Lee, J.R. and Dynan, W.S. (2005) Saturation labeling with cysteine-reactive cyanine fluorescent dyes provides increased sensitivity for protein expression profiling of laser-microdissected clinical specimens. Proteomics 5(7):1746–57. Kondo, T., Seike, M., Mori, Y., Fujii, K., Yamada, T. and Hirohashi, S. (2003) Application of sensitive fluorescent dyes in linkage of laser microdissection and two-dimensional gel electrophoresis as a cancer proteomic study tool. Proteomics 3(9):1758–66. Sitek, B., Potthoff, S., Schulenborg, T., Stegbauer, J., Vinke, T., Rump, L.C., Meyer, H.E., Vonend, O. and Stuhler, K. (2006) Novel approaches to analyse glomerular proteins from smallest scale murine and human samples using DIGE saturation labelling. Proteomics 3:3. Tetu, B., Lacasse, B., Bouchard, H.L., Lagace, R., Huot, J. and Landry, J. (1992) Prognostic influence of HSP-27 expression in malignant fibrous histiocytoma:
124
46.
47.
48.
49.
50.
51.
52.
53.
Friedman and Lilley
a clinicopathological and immunohistochemical study. Cancer Res 52(8): 2325–28. Wessel, D. and Flugge, U.I. (1984) A method for the quantitative recovery of protein in dilute solution in the presence of detergents and lipids. Anal Biochem 138(1):141–43. Knowles, M.R., Cervino, S., Skynner, H.A., Hunt, S.P., de Felipe, C., Salim, K., Meneses-Lorente, G., McAllister, G. and Guest, P.C. (2003) Multiplex proteomic analysis by two-dimensional differential in-gel electrophoresis. Proteomics 3:1162–71. Prabakaran, S., Swatton, J.E., Ryan, M.M., Huffaker, S.J., Huang, J.J., Griffin, J.L., Wayland, M., Freeman, T., Dudbridge, F., Lilley, K.S., Karp, N.A., Hester, S., Tkachev, D., Mimmack, M.L., Yolken, R.H., Webster, M.J., Torrey, E.F. and Bahn, S. (2004) Mitochondrial dysfunction in schizophrenia: evidence for compromised brain metabolism and oxidative stress. Mol Psychiatry 9(7):684–97. Wang, D., Jensen, R., Gendeh, G., Williams, K. and Pallavicini, M.G. (2004) Proteome and transcriptome analysis of retinoic acid-induced differentiation of human acute promyelocytic leukemia cells, NB4. J Proteome Res 3(3):627–35. Zhang, W. and Chait, B.T. (2000) ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal Chem 72(11):2482–89. Zhang, Y.Q., Matthies, H.J., Mancuso, J., Andrews, H.K., Woodruff, E. 3rd, Friedman, D. and Broadie, K. (2004) The Drosophila fragile X-related gene regulates axoneme differentiation during spermatogenesis. Dev Biol 270(2): 290–307. Yokoo, H., Kondo, T., Fujii, K., Yamada, T., Todo, S. and Hirohashi, S. (2004) Proteomic signature corresponding to alpha fetoprotein expression in liver cancer cells. Hepatology 40(3):609–17. Wang, S.E., Narasanna, A., Whitell, C.W., Wu, F.Y., Friedman, D.B. and Arteaga, C.L. (2007) Convergence of P53 and TGFbeta signaling on activating expression of the tumor suppressor gene maspin in mammary epithelial cells. J Biol Chem 4:4.
7 MALDI/SELDI Protein Profiling of Serum for the Identification of Cancer Biomarkers Lisa H. Cazares, Jose I. Diaz, Rick R. Drake, and O. John Semmes
Summary The ability to visualize the full depth of the serum proteome in a high-throughput manner is a major goal of clinical proteomics. Methodologies, which combine higher throughput with the ability to observe differential protein expression levels, have been applied to this goal. An example of such a system is the coupling of robotic sample processing to matrix-assisted laser desorption time of flight mass spectrometry (MALDITOF-MS). Within this paradigm is a modification of MALDI-TOF termed surfaceenhanced laser desorption/ionization-TOF (SELDI-TOF). Both conventional MALDI and SELDI have been used to generate protein expression profiles reflective of potential peptide changes in serum. This information can be used to identify proteins, which may enable new diagnostic and therapeutic strategies.
Key Words: matrix-assisted laser desorption ionization; surface-enhanced laser desorption ionization; mass spectrometry; protein profiling; proteomics.
1. Introduction Mining the serum proteome for the discovery of new biomarkers is a major goal of many clinical proteomics efforts. Surface-enhanced laser desorption/ionization (SELDI) and matrix-assisted laser desorption ionization (MALDI) have been used extensively for protein profiling in efforts to discover biomarkers in serum from cancer patients including prostate, lung, head and neck, ovarian, and colon (1,2,3,4,5,6). MALDI techniques usually require some up-front fractionation of the serum to reduce the complexity of the sample (7,8,9) and the ease of use in sample fractionation is considered an advantage From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
125
126
Cazares et al.
in SELDI. An advantage of MALDI-TOF instrumentation is the improved resolution over SELDI instruments and the ability to directly identify peaks of interest by analyzing samples in TOF/TOF mode. For routine linear mode profiling both types of instrumentation give similar results with human serum (see Fig. 1). Besides the instrumentation and methodologies related to mass spectrometry analysis, the quality and quantity of the clinical samples to be tested is an important consideration. Serum is one of the most common sample types used in biomarker discovery, because it is routinely obtained in the clinic, a large proportion of blood clotting factors are removed, and it is a rich source of molecules that may indicate systemic function. Blood plasma is an alternative source; however, clinical plasma collection utilizes various anticoagulants, which should be standardized to allow for universal analysis. Whether serum or plasma is used, every effort to standardize the sample collection and processing protocols should be made. Several studies have highlighted this and determined that multiple factors can affect the resulting spectra generated from serum samples (10,11). These factors include the elapsed time between venipuncture and separation of plasma and serum, type of serum collection tube, 5904.6 Bruker IMAC Cu2+ beads
4212.3 3266.1 2663.4
A.
7762.3
5337.6
9282.0
Ciphergen IMAC Cu2+ chip Three primary peaks used for instrument standardization
4000
1
6000
2
8000
B.
3
10,000
Fig. 1. Comparison of SELDI and MALDI spectra using QC sera. (A) MALDI spectra generated using QC processed with IMAC Cu2 magnetic beads. (B) SELDI spectra from QC sera processed on IMAC Cu2 chips. The three peaks used for instrument optimization are indicated.
MALDI/SELDI Protein Profiling of Serum
127
storage conditions, and the number of freeze thaw cycles. In our laboratory, we routinely use serum for proteomic profiling. The following protocols outline our method for collection and storage of serum samples for subsequent analysis via MALDI-MS. Reduction of sample complexity is an essential step in the generation of high quality TOF mass spectrometry data from serum. One method of MALDI sample preparation that reduces the complexity of serum while remaining robust and easily amenable to automated high throughput applications is sample fractionation using magnetic beads (MBs) combined with prestructured MALDI sample supports (AnchorChip technology). Several MB types with different surface chemistries can be used to fractionate serum and increase the number of detectable peaks (12) (see Fig. 2). In addition, depletion of high abundant 203 total unique peaks mass range 1000–10000 ×10 1.50
5904.6
9282.0
8927.5
7762.3
9414.3
8907.0 9124.8
8905.0 9121.1
7000
8000
9000
9411.6
7758.7
7916.4 8134.1
6429.1 6627.1
WAX = 80 peaks
6876.9
6170.3
5062.3
4757.7
4469.5
4211.2 4055.2
0.25
3446.9
0.50
2607.4
1548.2
0.75
2107.5
1262.7
Intens. [a.u.]
1.00
8138.3
7760.4 8135.9
6430.7 6628.3
5903.7
5337.1
4644.5
3956.9
3266.2
4212.5
C18 = 62 peaks
0.0 4 ×10 1.50 1.25
8923.8
7759.4
6627.7 6432.2 6629.9
6086.8
4646.1
4965.0
5337.6
4212.3 3885.1 4093.7
3266.1 3450.4
2557.1 2607.0
2382.4
2014.4
0.5
1790.0
1.0
2955.5
2663.4
1947.3
2212.7
1467.9
1706.7
1208.2
1.5
1208.4
Intens. [a.u.]
2.0
2935.8
0.0 4 ×10
IMAC = 85 peaks
1016.7
0.5
1468.2
Intens. [a.u.]
1.0
6087.8
5336.0
4644.4
4964.9
0.00 4 ×10 1.5
WCX = 84 peaks 9278.1
4211.0 3884.8 4092.6
2954.5
3509.3
0.25
2210.4
0.50
2662.3
0.75
3265.3
1.00
1361.7 1547.4 1733.8 1946.3
Intens. [a.u.]
1.25
5902.8
4
0.00
1000
2000
3000
4000
5000
6000
10000 m/z
Fig. 2. MALDI spectra of serum fractionated with magnetic beads. Example of spectra produced on the Ultraflex-TOF/TOF when serum is fractionated with different magnetic bead types. A total of 203 unique peaks are resolved in the m/z range of 1000–10,000.
128
Cazares et al.
proteins such as albumin and IgG (13,14) serves to reduce ion suppression phenomena as well as to reveal less abundant species. Unfortunately, fractionation greatly increases the number of samples to be processed, which in turn increases the complexity of the experimental procedure. Processing of samples is, therefore, best facilitated by the use of robotics, which increases throughput and produces reproducible results, however, manual processing of small sample sets can be accomplished with careful attention to detail, and the protocols and methods contained in this chapter. Another caveat to depletion strategies is that highly abundant proteins such as albumin inadvertently bind low abundant species (15,16). For comprehensive biomarker discovery, the benefits of depletion and fractionation often outweigh these factors. We have used both depleted and nondepleted serum strategies for biomarker discovery, and this continues to be a major area of methodological development. 2. Materials 2.1. Serum Collection and Storage 1. Becton Dickinson vacutainer serum separator tube (SST) plus blood collection tube (16 mm×100 mm, draw volume 8.5 mL) (Becton Dickenson #367988) 2. Screw cap microtubes for cryo-storage (2.0 mL) (Sarstedt Inc.# 72.609.001, with caps # 65.716) 3. Microcentrifuge tubes for aliquots (1.7 mL) (Corning-Costar #3620)
2.2. Serum Processing for MALDI Using MB-Based Fractionation 1. The MB kit(s) (immobilized metal affinity-Cu, hydrophobic interaction, weak cationic, or weak anionic exchange) (Bruker Daltonics, Billerica, MA) 2. Optional: ClinProt robotic workstation (Bruker Daltonics) 3. Magnetic separators for manual processing: large (1.5 mL) or small tube (0.5 mL) format (Bruker Daltonics) 4. -Cyano-4-hydroxycinnamic acid (CHCA) (Bruker Daltonics) 5. Ethanol ultra pure 100% 6. Acetone ultra pure 100% 7. Micropipette capable of delivering 1 μL accurately 8. Peptide standard mix (Bruker) 9. Microtiter plate AnchorChip 600/384 MALDI target 600 μm diameter (Bruker Daltonics)
2.3. Serum Processing for SELDI 1. Water high performance liquid chromatography (HPLC) grade (Fisher Scientific, Hampton, NH)
MALDI/SELDI Protein Profiling of Serum 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 15. 16. 17. 18. 19.
20. 21. 22. 23.
24.
129
Copper sulfate, anhydrous (Sigma-Aldrich, St. Louis, MO) Sodium acetate trihydrate salt Phosphate buffered saline (PBS) buffer pH 7.4 Urea, at least 99% pure (Promega Madison, WI) CHAPS ultra purity (Fisher Scientific) Sinapinic acid (SPA) (5 μg tube)(Ciphergen Biosystems, Palo Alto, CA) IMAC protein chip arrays (Ciphergen) Bioprocessor holder (Ciphergen) for the processing or 12 chips in a 96-well format Bioprocessor accessory, 96-well disposable reservoir and gasket (Ciphergen) Acetonitrile ultra high purity grade Trifluoroacetic acid (TFA) (100%, 1 mL ampules) [Sigma/Aldrich Chemical Company 26,977-8, (589-37-37)] Plate seals For calibration: (all from Ciphergen biosystems) NP20 ProteinChip arrays Allin-one peptide standard All-in-one protein standard Optional: BioMek 2000 robotic workstation, adapted to process ProteinChip arrays (Ciphergen biosystems) DPC MicroMix 5 shaker (Diagnostic Products Corporation, Los Angeles, CA) or another type of rotary or platform shaker Micropipet capable of delivering 1 μL accurately Pooled serum for quality control (QC) 100 mM CuSO4 in water [room temperature (RT)]: 1.6 g CuSO4 (MW = 159.6) made up to 100 mL in HPLC grade water 100 mM sodium acetate, pH 4.0 (RT): 9.0 mL 0.2 M sodium acetate stock (27.2 g/L), 50 mL HPLC water, 41.0 mL 0.2 M acetic acid (add gradually to get to pH 4.0) (11.6 mL/L made from concentrated). The PBS Buffer pH 7.4 (RT): 10 mL PBS Buffer (10) made up to 100 mL in HPLC water. Check pH. 10% TFA stock: 1 mL TFA (100%), 9 mL HPLC water (store in amber bottle) 1% TFA working solution (store in amber bottle and make fresh every 2 weeks): take 1 mL TFA (10%) and add 9 mL HPLC water 8 M Urea, 1% CHAPS in PBS, pH 7.4: 48.05 g Urea, up to 90 mL PBS pH 7.4; stir until dissolved, may need warming. Add 1 g CHAPS. Bring the final volume to 100 mL with PBS. Filter through 0.4 μm filter. Aliquot into 5 mL volumes and freeze. 1 M Urea, 0.125% CHAPS in PBS, pH 7.4: dilute the 8 M stock above in PBS (100 mL 8 M in 700 mL PBS).
2.4. SELDI and MALDI Spectra Acquisition 1. SELDI PBS II, IIc, or PCS 4000 instrument (Ciphergen biosystems) 2. Ultraflex I or II MALDI-TOF–TOF (Bruker Daltonics)
130
Cazares et al.
3. Method 3.1. Serum Collection Obtain proper patient consent: 1. Perform venipuncture into a 10 cc SST vacutainer tube (without anticoagulant). 2. Allow blood to clot at RT for 30 min. 3. Spin blood at 1700 rcf for 10 min, immediately decant and freeze serum at –70°C in a screw cap freezer vial (Sarstedt). If this is not possible, the serum can be stored at –20 for 5 days, before moving to a –70 freezer. 4. Prior to SELDI or MALDI analysis, the sample should be thawed and divided into small volume aliquots to avoid multiple freeze thaws. When possible, no sample should be taken through more than two freeze thaw cycles, and the number of freeze/thaw cycles should be recorded if unused volumes are returned to the freezer.
3.2. Preparation of Human Serum Expression profiling of proteins/peptides utilizes both peak mass and intensity to quantify changes in differential spectra. This necessitates the use of a QC standard to monitor instrument performance (17). The QC sample routinely used in our lab is pooled human serum collected using the same serum collection protocol used to collect (see above SOP) the experimental samples. Efforts have been made to develop a standardized QC sample for serum mass spectrometry profiling (18). However, until that end, a large volume of serum can be pooled and aliquoted to be run with every experimental sample set. This QC sample should be assayed using the same processing technique, which will be employed for the experimental samples and the data from multiple runs analyzed. In this way, the inter- and intra-assay variability can be determined. Additionally, the spectra obtained from the QC sample can be used as a benchmark for the integrity of processing, instrument optimization, and ProteinChip variability. We, therefore, recommend including several QC samples on a MALDI target and one QC spot on each SELDI ProteinChip. Acceptable levels of reproducibility need to be established for any new technology, and sample preparation is the most critical step to the production of reproducible spectra (see Notes 3, 4, and 5). We have optimized the SELDI system with high-throughput robotics, and previous studies in our laboratory have determined that the mass accuracy of SELDI spectra is highly reproducible with CV’s of 0.05%. Operating in linear mode, we have found the mass accuracy of an Ultraflex-TOF–TOF to be 0.01% CV. Overall normalized intensity values for individual peaks using QC sera are routinely below a 20% CV for samples prepared robotically in our lab using either SELDI or MALDI-MS.
MALDI/SELDI Protein Profiling of Serum
131
3.3. Serum Protein Profiling on the MALDI-TOF–TOF 3.3.1. MB Fractionation of Human Serum These steps are performed by the ClinProt robot. Below is an outline of a comparable manual method. Sequential fractionation can also be performed with multiple bead types. 1. Vortex MBs thoroughly for at least 1 min. 2. In a 0.5 mL eppendorf, pretreat 5 μL of MBs with 50 μL MB-IMAC Cu binding solution. 3. Place the tube in the magnetic bead separator (MBS) and move it between adjacent wells 10 times. 4. Collect the beads on the wall of the tube for 20 s and remove the supernatant carefully with a pipette. 5. Repeat this pretreatment two more times. 6. Add 20 μL of serum and mix carefully with the beads by pipetting up and down five times. 7. Keep at RT for 2 min. 8. Place the tube in the MBS and wait for 20 s for beads to separate. 9. Remove the supernatant with a pipette tip carefully (the unbound fraction can be discarded or saved for analysis or a second fractionation step, if desired). 10. To wash, add 80 μL MB-IMAC Cu wash solution and place tube in the MSB again. Move the tube back and forth to adjacent wells 10 times. 11. Collect the beads on the tube wall for 20 s and remove the supernatant carefully with a pipette. 12. Repeat this wash two more times. 13. To elute, add 10 μL MB-IMAC Cu elution solution and mix. Let the beads sit for 5 min at RT. 14. Place the tube on the MBS and wait 20 s for beads to separate. 15. Transfer the eluate to a fresh tube.
3.3.2. Data Collection on MALDI-TOF–TOF Instrument To best detect proteins over the entire mass range on a MALDI instrument, it is necessary to optimize the instrument settings for both low mass (typically 2000–20,000 Da) and high mass (20,000–100,000 Da or greater). The best sensitivity and resolution is in the mass range below m/z 20,000, and this is the mass range we routinely use for most profiling experiments. 1. Prepare samples on an anchor plate by making dilutions of the eluates of 1:10 in CHCA matrix prepared according to the anchor chip protocol (0.3 mg/mL in ethanol:acetone 2:1). SPA and/or 2,5-dihydroxybenzoic acid may also be used. 2. Spot 1 μL of the sample diluted in matrix onto the 600 μm diameter AnchorChip target. Also spot 1 μL of the peptide standard diluted according to the manufacturer’s instructions.
132
Cazares et al.
3. Allow spots to dry. 4. Perform external calibration with the peptide standard using a linear mode method. 5. Collect at least 300 shots in linear mode, adjusting the laser energy and detection sensitivity to maximize signal and resolution of the major peaks using a QC spot. Typically, in linear mode the resolution of the three major peaks should be greater than 600. 6. Instrument settings will vary based on instrument set-up, and are more numerous that is feasible to describe in this book chapter but the most important settings to optimize are acceleration voltage (IS1), laser power, time lag focusing (or PIE), detector settings, and matrix suppression. Our basic instrument settings in linear mode are as follows:
IS1, 22 Laser, 37% with laser attenuation offset at 48%, range at 40% Time lag focus, 200 ns Detector Gain, 24× Matrix suppression, gated with suppression up to m/z 800 All spectra should be processed using the same baseline subtraction protocol. Perform peak detection using a uniform definition of requisite signal-to-noise ratio and mass window. Although MALDI techniques have the potential to produce protein profiles that contain patterns capable of distinguishing disease and identifying biomarkers, a single analysis may produce many hundreds of protein peaks (see Note 2). Therefore, the data analysis required to discern the differentiating patterns poses a major challenge, and the analysis and interpretation of the enormous volumes of proteomic data remains an unsolved bioinformatics challenge. Many different classification tools are currently being used with success for the analysis of MALDI data. These approaches include Fisher discriminative analysis, CART (19,20), support vector machine (21), artificial neural network (22), boosted decision tree analysis (23), and genetic algorithm (24). General considerations for data preparation before any type of analysis should include averaging intensity values for duplicate samples, baseline subtraction, and peak picking. 3.4. Protein Identification Using MALDI-TOF/TOF Biomarker candidates detected by protein profiling can be subjected to TOF/TOF analysis for the identification of peptides directly from serum profiles using the same sample spot and/or respotting of the sample. Initial analysis in the reflectron mode will allow for visualization of the target or parent peak. Metastable fragment ions of the respective precursor ion are then analyzed after a second acceleration step, and the resulting fragment pattern is interpreted and
MALDI/SELDI Protein Profiling of Serum
133
Peptide View MS/MS Fragmentation of DSGEGDFLAEGGGVR Found in gi|229185, fibrinopeptide A Start - End 2 - 16
Observed 1465.72
Mr(expt) 1464.72
Mr(calc) 1464.65
Delta 0.07
Miss 0
Sequence DSGEGDFLAEGGGVR
Matched peptides shown in Bold Red 1 ADSGEGDFLA EGGGVR ×10
4
1468.0
A
1SLin, Baseline subtracted
Intens. [a.u.]
3
2
1
1868.2
1208.2 1619.0 1352.5
2675.6
1780.8
2024.5 2297.6
0
1200
1400
1600
1800
2000
2200
2400
2557.2
2600
m/z
B C
Fig. 3. Identification of a serum peptide directly from the serum profile. Serum profile (A) was generated in linear mode on the Ultraflex-TOF/TOF, from which a peptide (m/z 1469.09) was selected for MS/MS analysis resulting in a fragmentation spectra (B). This peptide showed homology to fibrinopeptide A using the Mascot search engine (C).
used for peptide identification via database search. The possibility to directly sequence the peptides of interest is a powerful feature of this method (see Fig. 3). 3.5. Serum Protein Profiling on SELDI-TOF 3.5.1. Preparation of Serum Note: All of the following steps including the ProteinChip preparation and serum incubation on the arrays are performed robotically by the BioMek 2000 robot. The protocols below outline a manual method. 1. Thaw human serum samples on ice. Use separate aliquots to set up duplicates or triplicates. 2. Add 20 μL human serum into a 1.7 mL microcentrifuge tube (alternatively, this can be performed in a v-bottom 96-well plate for large sample sets).
134
Cazares et al.
3. Add 30 μL of 8 M Urea, 1% CHAPS in PBS pH 7.4. 4. Vortex tube at 4°C for 10 min or if using a plate, seal and place on MicroMix 5 shaker at 4°C for 10 min: shaker settings: form 20, amplitude 5, time 10 min. 5. Add 100 μL 1 M Urea, 0.125% CHAPS in PBS pH 7.4. 6. Vortex or pipette up and down to mix (total volume 150 μL). 7. Dilute sample 1:5 in PBS pH 7.4 by adding 600 μL PBS. If using a plate, remove 35 μL of serum–urea mixture from first plate and transfer to a second plate. Then add 140 μL of PBS. Mix by vortexing tube or pipetting up and down. 8. Store on ice until ready to add samples to a bioprocessor containing ProteinChip arrays.
3.5.2. Preparation of ProteinChip Arrays This protocol describes the preparation of IMAC-Cu2+ ProteinChips. Other types of chips should be prepared according to the manufacturer’s (Ciphergen) instructions. 1. Label or number IMAC chips on the reverse side and place them into the bioprocessor according to the manufacturer’s instructions. (see Note 1) 2. Add 50 μL of 100 mM CuSO4 onto each spot or array. 3. Shake on Micromix 5 for 10 min at RT. 4. Shaker settings: form 20, amplitude 5, time 10 min 5. Flick plate to remove CuSO4 to waste and pat upside down onto a clean paper towel to remove residual liquid (liquid can also be removed by aspiration, but be careful no to touch array surface with pipette tip). 6. Wash with 200 μL of HPLC water 2 min × 5 min at RT on Micromix shaker at the same settings for form and amplitude as before. 7. Flick plate and pat on paper towel. 8. Add 50 μL of 100 mM sodium acetate pH 4.0. 9. Shake on Micromix shaker for 5 min at RT. 10. Flick plate and pat as before. 11. Wash with HPLC water 2 min × 5 min at RT on Micromix. 12. Add 200 μL PBS pH 7.4. 13. Flick plate and pat as before. 14. Wash with PBS pH 7.4 2 min × 5 min at RT on Micromix.
Leave last volume of PBS on plate until ready to use. 3.5.3. Incubation of Serum on ProteinChip Arrays 1. Remove PBS from bioprocessor with multichannel pipettor, one row at a time to avoid drying chips. 2. Add 100 μL of each sample to respective arrays. Note: samples should be randomized as to their placement on the ProteinChip arrays. Duplicate samples should also be randomly placed.
MALDI/SELDI Protein Profiling of Serum
135
3. Seal plate and shake bioprocessor on micromix (form 20, amplitude 5) for 30 min at RT. 4. Remove samples carefully with a pipette, changing tips to avoid cross contamination. 5. Add 200 μL PBS pH 7.4 to each array and shake on micromix for 5 min at RT using same shaker settings. 6. Remove PBS with multichannel pipettor changing tips for each row. 7. Wash with 200 μL HPLC water, shake on micromix for 5 min at RT. 8. Remove water with multichannel pipettor. 9. Repeat water wash. 10. Remove chips from bioprocessor and allow chips to dry completely.
3.5.4. Adding SPA Matrix to the Chips 1. To one tube of SPA, add 200 μL acetonitrile (100%). 2. Add 200 μL 1% TFA (final concentration of SPA:12.5 mg/mL in 50% acetonitrile, 50% 0.5% TFA). 3. Vortex for 5 min at RT. 4. Quick spin. 5. Add 1.0 μL SPA matrix to each dry spot, being careful not to touch the pipette tip to the array surface. 6. Allow to dry. 7. Arrays are now ready to read on the SELDI instrument. Note: The arrays should be stored in the dark in a cool dry place. It is recommended to read the chips within a few hours of the addition of the matrix. Some signal degradation may occur if the arrays are stored for more than 24 h).
3.5.5. Collection of Spectra on SELDI-TOF We describe here the collection of spectra using the PBS II Ciphergen instrument. 3.5.5.1. Calibration
Calibration of the SELDI instrument is crucial to the accurate mass analysis of the proteins present in samples. Smaller ions fly faster than larger ions, and their m/z ratio can be calculated from their flight time using compounds of known mass. For the most accurate mass assignments, the instrument should be calibrated using conditions identical to the experimental conditions. Calibration should be performed at the beginning of an experimental run, and thereafter everyday the experimental data is collected. When obtaining calibration spectra, use instrument settings as close to the settings used for serum profiling (i.e., detector voltage, lag time, etc.) as possible. 1. Reconstitute one vial each of the seven-in-one peptide and protein standards, according to the manufacturer’s instructions. Aliquot and freeze.
136
Cazares et al.
2. 3. 4. 5.
Mix standards with SPA according to package insert. Deposit 1 μL of each standard onto an array of an NP20 ProteinChip. Air-dry the arrays completely, usually 30–60 min. Read the array in the SELDI instrument using a spot protocol created to read the experimental samples (see below). The laser intensity should be lowered such that the peaks from the standards do not exceed 75% maximum signal intensity. 6. Follow the calibration dialogue in the software of the PBSII SELDI instrument to save the calibration equations.
3.5.5.2. SELDI Instrument Settings Optimization
The SELDI instrument optimization refers to the adjustment of settings necessary for data collection, which will maximize signal intensity while retaining the optimal resolution and the lowest noise. In our studies, there are three consistently present protein peaks (m/z 5900, 7764, 9284 ± 0.2%) in the QC sera processed on IMAC-Cu2+ ProteinChips, which are used as benchmarks for instrument optimization (see Fig. 1). Based on multiple runs, the instrument settings are adjusted to maximize signal to noise and resolution for these three peaks. Thereafter specific criteria were set to ensure instrument optimization (refer to paper Semmes et al. (17)). Generally, when trying to obtain a specific overall intensity level (e.g., to get two instruments to behave similarly, or to obtain similar intensity levels over time), three parameters can be adjusted. These include laser intensity, detector sensitivity, and detector voltage. The following spot protocols for data collection on the SELDI reader are a starting point. The settings will be different from instrument to instrument and will change over time, based on cumulative laser utilization and detector settings. Data collection: standard spot protocol for QC serum on IMAC-Cu (for a PBSII) 1. 2. 3. 4. 5. 6. 7.
Set detector voltage to 1650. Set high mass to 100,000 Da, optimized from 3000 to 50,000 Da. Set starting laser intensity to 220. Set starting detector sensitivity to 7. Focus lag time at 900 ns. Set data acquisition method to SELDI quantitation. Set SELDI acquisition parameters 20 delta to 4 transients per to 12 ending position to 80. 8. Set warming positions with two shots at intensity 230 and do not include warming shots.
When adjusting to meet QC criteria:
MALDI/SELDI Protein Profiling of Serum
137
• Increasing detector voltage typically increases signal and noise. Change this in units of 25 V. • Increasing laser increases signal and generally decreases resolution. Change this in units of 10. • Increasing sensitivity increases signal intensity. Typical working range is six to eight.
For example, if the settings above are not meeting QC specifications, try the following: If S/N passes easily but resolution is low, reduce detector voltage or laser intensity: 1. 2. 3. 4. 5. 6. 7.
Set detector voltage to 1625. Set high mass to 100,000 Da, optimized from 3000 to 50,000 Da. Set starting laser intensity to 220. Set starting detector sensitivity to 7. Focus lag time at 900 ns. Set data acquisition method to SELDI quantitation. Set SELDI acquisition parameters 20 delta to 4 transients per to 12 ending position to 80 (192 total shots). 8. Set warming positions with two shots at intensity 230 and do not include warming shots.
If resolution passes but S/N is low increase laser intensity or detector voltage: 1. 2. 3. 4. 5. 6. 7.
Set detector voltage to 1650. Set high mass to 100,000 Da, optimized from 3000 to 50,000 Da. Set starting laser intensity to 230. Set starting detector sensitivity to 7. Focus lag time at 900 ns. Set data acquisition method to SELDI quantitation. Set SELDI acquisition parameters 20 delta to 4 transients per to 12 ending position to 80. 8. Set warming positions with two shots at intensity 230 and do not include warming shots.
If intensity is too high (i.e., generally stay under 65), reduce laser intensity and/or sensitivity: 1. 2. 3. 4. 5. 6.
Set detector voltage to 1650. Set high mass to 100,000 Da, optimized from 3000 to 50,000 Da. Set starting laser intensity to 220. Set starting detector sensitivity to 6. Focus lag time at 900 ns. Set data acquisition method to SELDI quantitation.
138
Cazares et al.
7. Set SELDI acquisition parameters 20 delta to 4 transients per to 12 ending position to 80. 8. Set warming positions with two shots at intensity 230 and do not include warming shots.
After data collection, each spectrum should be calibrated for mass using the current peptide calibration. If higher molecular weight data is included for analysis, the protein standard calibration should be used for the peaks in this mass range. Spectra should be normalized using total ion current (this is a feature in the Ciphergen software) with the same normalization coefficient and low mass cutoff (2000 Da for SPA matrix to exclude matrix peaks). All spectra should also be processed using the same baseline subtraction protocol. Perform peak detection using a uniform definition of requisite signal-to-noise ratio (usually 3) and mass window (usually 0.2–0.3%). 4. Notes 1. Use powder-free nitrile (not latex) gloves when processing SELDI ProteinChips. Repetitive peaks at 3000–4000 Da will appear in the spectra if samples are contaminated with latex. 2. Use sample sets of sufficient size. A sample set of at least 30 should be included in each classification group in order to do multivariate analysis and to give >90% statistical confidence in a single marker with p values <0.01. 3. Make every effort to ensure that serum samples are collected, processed, and stored in a consistent manner. Retrospective studies may be problematic, because in most cases, samples were not processed or stored similarly. 4. Avoid hemolytic samples as hemoglobin can affect affinity binding interactions and may confound data interpretation. 5. Initially when designing a protein profiling study, avoid using serum samples collected at separate sites unless strict standard operating procedures for serum collection have been followed and documented.
References 1. Adam, B. L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J., Schellhammer, P. F., Yasui, Y., Feng, Z., and Wright, G. L. Jr. (2002). Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res, 62: 3609–3614. 2. Wadsworth, J. T., Somers, K. D., Cazares, L. H., Malik, G., Adam, B. L., Stack, B. C. Jr., Wright, G. L. Jr., and Semmes, O. J. (2004). Serum protein profiles to identify head and neck cancer. Clin Cancer Res, 10: 1625–1632. 3. Petricoin, E. F., Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B., Simone, C., Fishman, D. A., Kohn, E. C., and
MALDI/SELDI Protein Profiling of Serum
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
139
Liotta, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359: 572–577. de Noo, M. E., Mertens, B. J., Ozalp, A., Bladergroen, M. R., van der Werff, M. P., van de Velde, C. J., Deelder, A. M., and Tollenaar, R. A. (2006). Detection of colorectal cancer using MALDI-TOF serum protein profiling. Eur J Cancer, 42: 1068–1076. Sidransky, D., Irizarry, R., Califano, J. A., Li, X., Ren, H., Benoit, N., and Mao, L. (2003). Serum protein MALDI profiling to distinguish upper aerodigestive tract cancer patients from control subjects. J Natl Cancer Inst, 95: 1711–1717. Howard, B. A., Wang, M. Z., Campa, M. J., Corro, C., Fitzgerald, M. C., and Patz, E. F. Jr. (2003). Identification and validation of a potential lung cancer serum biomarker detected by matrix-assisted laser desorption/ionization-time of flight spectra analysis. Proteomics, 3: 1720–1724. Baumann, S., Ceglarek, U., Fiedler, G. M., Lembcke, J., Leichtle, A., and Thiery, J. (2005). Standardized approach to proteome profiling of human serum based on magnetic bead separation and matrix-assisted laser desorption/ionization time-offlight mass spectrometry. Clin Chem, 51: 973–980. Orvisky, E., Drake, S. K., Martin, B. M., Abdel-Hamid, M., Ressom, H. W., Varghese, R. S., An, Y., Saha, D., Hortin, G. L., Loffredo, C. A., and Goldman, R. (2006). Enrichment of low molecular weight fraction of serum for MS analysis of peptides associated with hepatocellular carcinoma. Proteomics, 6: 2895–2902. Feuerstein, I., Rainer, M., Bernardo, K., Stecher, G., Huck, C. W., Kofler, K., Pelzer, A., Horninger, W., Klocker, H., Bartsch, G., and Bonn, G. K. (2005). Derivatized cellulose combined with MALDI-TOF MS: a new tool for serum protein profiling. J Proteome Res, 4: 2320–2326. Rai, A. J., Gelfand, C. A., Haywood, B. C., Warunek, D. J., Yi, J., Schuchard, M. D., Mehigh, R. J., Cockrill, S. L., Scott, G. B., Tammen, H., Schulz-Knappe, P., Speicher, D. W., Vitzthum, F., Haab, B. B., Siest, G., and Chan, D. W. (2005). HUPO plasma proteome project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics, 5: 3262–3277. Banks, R. E., Stanley, A. J., Cairns, D. A., Barrett, J. H., Clarke, P., Thompson, D., and Selby, P. J. (2005). Influences of blood sample processing on low-molecular weight proteome identified by surface-enhanced laser desorption/ionization mass spectrometry. Clin Chem, 51: 1637–1649. Villanueva, J., Philip, J., Entenberg, D., Chaparro, C. A., Tanwar, M. K., Holland, E. C., and Tempst, P. (2004). Serum peptide profiling by magnetic particle-assisted, automated sample processing and MALDI-TOF mass spectrometry. Anal Chem, 76: 1560–1570. Guerrier, L., Thulasiraman, V., Castagna, A., Fortis, F., Lin, S., Lomas, L., Righetti, P. G., and Boschetti, E. (2006). Reducing protein concentration range of biological samples using solid-phase ligand libraries. J Chromatogr B Analyt Technol Biomed Life Sci, 833: 33–40. Fountoulakis, M., Juranville, J. F., Jiang, L., Avila, D., Roder, D., Jakob, P., Berndt, P., Evers, S., and Langen, H. (2004). Depletion of the high-abundance plasma proteins. Amino Acids, 27: 249–259.
140
Cazares et al.
15. Lowenthal, M. S., Mehta, A. I., Frogale, K., Bandle, R. W., Araujo, R. P., Hood, B. L., Veenstra, T. D., Conrads, T. P., Goldsmith, P., Fishman, D., Petricoin, E. F. 3rd, and Liotta, L. A. (2005). Analysis of albumin-associated peptides and proteins from ovarian cancer patients. Clin Chem, 51: 1933–1945. 16. Mehta, A. I., Ross, S., Lowenthal, M. S., Fusaro, V., Fishman, D. A., Petricoin, E. F. 3rd, and Liotta, L. A. (2003). Biomarker amplification by serum carrier protein binding. Dis Markers, 19: 1–10. 17. Semmes, O. J., Feng, Z., Adam, B. L., Banez, L. L., Bigbee, W. L., Campos, D., Cazares, L. H., Chan, D. W., Grizzle, W. E., Izbicka, E., Kagan, J., Malik, G., McLerran, D., Moul, J. W., Partin, A., Prasanna, P., Rosenzweig, J., Sokoll, L. J., Srivastava, S., Srivastava, S., Thompson, I., Welsh, M. J., White, N., Winget, M., Yasui, Y., Zhang, Z., and Zhu, L. (2005). Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. Assessment of platform reproducibility. Clin Chem, 51: 102–112. 18. Rai, A. J., Stemmer, P. M., Zhang, Z., Adam, B. L., Morgan, W. T., Caffrey, R. E., Podust, V. N., Patel, M., Lim, L. Y., Shipulina, N. V., Chan, D. W., Semmes, O. J., and Leung, H. C. (2005). Analysis of human proteome organization plasma proteome project (HUPO PPP) reference specimens using surface enhanced laser desorption/ionization-time of flight (SELDI-TOF) mass spectrometry: multiinstitution correlation of spectra and identification of biomarkers. Proteomics, 5: 3467–3474. 19. Semmes, O. J., Cazares, L. H., Ward, M. D., Qi, L., Moody, M., Maloney, E., Morris, J., Trosset, M. W., Hisada, M., Gygi, S., and Jacobson, S. (2005). Discrete serum protein signatures discriminate between human retrovirus-associated hematologic and neurologic disease. Leukemia, 19: 1229–1238. 20. Qian, H. G., Shen, J., Ma, H., Ma, H. C., Su, Y. H., Hao, C. Y., Xing, B. C., Huang, X. F., and Shou, C. C. (2005). Preliminary study on proteomics of gastric carcinoma and its clinical significance. World J Gastroenterol, 11: 6249–6253. 21. Ressom, H. W., Varghese, R. S., Abdel-Hamid, M., Eissa, S. A., Saha, D., Goldman, L., Petricoin, E. F., Conrads, T. P., Veenstra, T. D., Loffredo, C. A., and Goldman, R. (2005). Analysis of mass spectral serum profiles for biomarker selection. Bioinformatics, 21: 4039–4045. 22. Liu, J., Zheng, S., Yu, J. K., Zhang, J. M., and Chen, Z. (2005). Serum protein fingerprinting coupled with artificial neural network distinguishes glioma from healthy population or brain benign tumor. J Zhejiang Univ Sci B, 6: 4–10. 23. Qu, Y., Adam, B. L., Yasui, Y., Ward, M. D., Cazares, L. H., Schellhammer, P. F., Feng, Z., Semmes, O. J., and Wright, G. L. Jr. (2002). Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin Chem, 48: 1835–1843. 24. Papadopoulos, M. C., Abel, P. M., Agranoff, D., Stich, A., Tarelli, E., Bell, B. A., Planche, T., Loosemore, A., Saadoun, S., Wilkins, P., and Krishna, S. (2004). A novel and accurate diagnostic test for human African trypanosomiasis. Lancet, 363: 1358–1363.
8 Urine Sample Preparation and Protein Profiling by Two-Dimensional Electrophoresis and Matrix-Assisted Laser Desorption Ionization Time of Flight Mass Spectroscopy Panagiotis G. Zerefos and Antonia Vlahou
Summary Urine represents the most easily attainable and consequently one of the most common samples in clinical analysis and diagnostics. However, urine is also considered one of the most difficult proteomic samples to work with due to its highly variable contents, as well as the presence of various proteins in low abundance or modified forms. In this chapter, we describe simple protocols and troubleshooting tips for urinary protein preparation and profiling by two-dimensional electrophoresis or directly via matrix-assisted laser desorption ionization time of flight mass spectroscopy. Direct dilution, protein precipitation, ultrafiltration, and solid phase extraction in combination to the above profiling technologies serve the means for reliable proteomics analysis of one of the most significant yet very complex biological samples.
Key Words: urine; 2DE; MALDI-TOF-MS; protein profiling; sample preparation. Abbreviations: ACT: Acetone, CE: Capillary electrophoresis, CHAPS: [3-[(3-cholamidopropyl)dimethylammonio-1-propanesulfonate], CHCA: -Cyano-4hydroxycinnamic acid, d: Dalton, 2DE: Two-dimensional gel electrophoresis, DHB: Dihydroxybenzoic acid, DTE: 1,4-Dithioerythritol, IEF: Isoelectric focusing, IPG: Immobilized pH gradient, LC: Liquid chromatography, MALDI: Matrix-assisted laser desorption ionization, MS: Mass spectrometry, MW: Molecular weight, MWCO: Molecular weight cut-off, ns: Nano-second, o/n: Overnight, RCF: Relative centrifugal forces, SA: Sinapinic acid, SDS: Sodium dodecylsulfate, SELDI: Surface-enhanced laser desorption, SPE: Solid phase extraction, TCA: Trichloroacetic acid, TFA: Trifluoroacetic acid, TGS: Tris-Glycine-SDS, TOF: Time of flight, UF: Ultrafiltration From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
141
142
Zerefos and Vlahou
1. Introduction Biological fluids play a central role in clinical chemistry. Investigation of their cellular (cell number, morphology, etc.) biochemical (metabolites, biomolecules) and physicochemical (pH, transparency, absorption, etc.) attributes assists in formulating the clinical judgment on disease prognosis, diagnosis, and treatment. Urine, according to International Union of Pure and Applied Chemistry, is the human fluid, which contains water and metabolic products and is excreted by the kidneys, stored in the bladder and normally discharged by the way of the urethra. The protein content of urine is very low under normal conditions (1) and derives mainly from human plasma proteins, which are not filtered through the renal glomeruli. The presence of proteins at high concentrations in urine is usually the result of disease or pharmaceutical treatment. Creatinine assay in urine is one of the most common clinical examinations and serves this exact purpose, to assess unexpected protein excretion. It should be noted that besides the soluble proteins, urine also contains proteins included in exfoliated cells as well as in membrane components known as exosomes (2). In this chapter, we focus on the description of methods for the analysis of the soluble urinary proteins and would recommend for the interested reader the review by Pisitkun et al. (2), for a thorough description of the other urinary protein components. In comparison to other proteomics samples, urine is still less explored. The main reason for this is the fact that urine is a difficult and diverse sample. Its composition is age, sex, health, and drug dependent. In addition, tremendous day variations on the protein content exist between first, void, midstream, morning and random catch urine samples of a single donor. Despite these facts, protein markers for disease have been detected in urine and have been approved to be utilized as adjuncts to clinical assays for disease diagnosis and prognosis (3,4). This justifies and triggers an in-depth analysis of the urinary proteome, particularly with the advent of contemporary proteomics technologies, with the objective to identify novel disease diagnostic/prognostic biomarkers. Specifically, urine proteome has been studied thoroughly by a series of proteomics technologies. These include, two-dimensional electrophoresis (5,6), liquid chromatography (LC) in combination to mass spectroscopy (MS) (7,8), matrix-assisted laser desorption ionization-time of flight (MALDITOF) or surface-enhanced laser desorption (SELDI)-TOF profiling (9,10,11, 12,13), capillary electrophoresis coupled to MS (14,15) and combinations thereof, implementing several separation steps both chromatographic and electrophoretic (15,16,17,18,19). The great interest in the investigation of the urinary proteome is reflected by the recent establishment of the human urine
Urine Sample Preparation
143
and kidney proteome initiative (http://hkupp.kir.jp) within the Human Proteome Organization that targets the integration of existing research efforts in this field. In this chapter, we provide detailed protocols and troubleshooting tips as experienced by the authors, in the preparation and analysis of urinary proteins by two-dimensional gel electrophoresis (2DE) or directly by MALDITOF-MS. We selected these two profiling approaches since the former is a classical high resolution profiling approach (see also Chapters 4– 6), whereas the latter offers the advantage of high throughput [see also Chapters 7 and 13]. In general, the process of urine analysis for the investigation of its protein content can be divided into three main steps: sample collection, usually performed at the physician’s office, protein extraction, protein separation, and detection. Each of these steps is very crucial and affects significantly the output of the proteomics experiment. In this chapter, an emphasis is given on the description of the various protein preparation/extraction methodologies including: ultrafiltration, precipitation, and solid phase extraction (SPE) as they complement 2DE and MALDI-TOF-MS profiling. Apparently, additional protein preparation methods exist such as dialysis, ultracentrifugation, etc. (see Note 1); however, we have focused on the three aforementioned methods due to their simplicity, increased reproducibility, and overall compatibility with the 2DE and MALDI MS profiling approaches.
1.1. Protein Precipitation Protein precipitation is a very common purification procedure employed for the isolation of macromolecules. The denaturation and precipitation of proteins occurs in solutions of extreme ionic strength, very low pH, or high concentrations of organic solvents. In such conditions, biopolymers do not retain a conformation capable of sustaining their solubility. Commonly used reagents are ammonium sulfate ([NH4 ]2 SO4 ), used for protein desalting at concentrations of 3 M, trichloroacetic acid (TCA) [used at concentrations higher than 5% (w/v)], and several organic solvents [ethanol, acetone, acetonitrile, chloroform, methanol, and isopropanol, at final concentrations higher than 50%, (v/v)]. The choice of the precipitation methodology depends primarily on the analytical procedure employed. In general, protein desalting is avoided in proteomics sample preparations since residual salts inhibit further analysis by 2DE and mass spectrometry. TCA precipitation followed by acetone washes is very popular and efficient, especially in cases of very dilute protein solutions. Organic solvents offer very high yields but some of them are toxic (methanol, acetonitrile) while others like chloroform (also toxic) employ rather complicated
144
Zerefos and Vlahou
precipitation procedures. A detailed description of these approaches for urinary protein preparation is provided in Section 3. 1.2. Ultrafiltration-SPE Ultrafiltration is a technique based on the use of molecular filters in combination to centrifugal forces. The whole procedure is performed in a centrifuge and in temperatures varying from 4°C to ambient conditions. It presents many advantages; for example, proteins are kept in solution and are more easily handled. A major disadvantage is the cost of the approach and the fact that even traces of the filter materials, when eluted, produce significant problems in MS based methodologies. Solid phase extraction in combination to MS for urine clinical proteomics is a newly added approach (22). SPE in the form of magnetic particles was recently developed as the front end of direct profiling of biological fluids by MS (23). We have found that acetone or TCA precipitation and ultrafiltration are very efficient urinary protein preparation approaches, highly compatible with 2DE analysis (Figs. 1, 2). In the case of MALDI MS profiling, we favor the utilization of ultrafiltration, SPE as well as direct dilution of urine in MS compatible buffers as front end protein preparation methods (see Note 2, Fig. 3). The detailed protocols are provided below. 1
2
3
4
5
6
7
8
Fig. 1. Comparison of urinary sample preparation approaches. Lanes correspond to: (1) marker, (2) urine starting material, (3) TCA/acetone precipitation supernatant, (4) TCA precipitate, (5) urine supernatant after 3 h centrifugation at 200,000 RCF, (6) protein pellet after ultracentrifugation of 5 mL urine, (7) urine filtrate after ultrafiltration through 5 kd MWCO, and (8) urine retentate after ultrafiltration. In lanes 2, 3, 5, and 7 equal volumes of urine sample were utilized; similarly, lanes 4 and 8 correspond to same amount of starting urine material in order to facilitate comparison of the approaches.
Urine Sample Preparation
145
2. Materials 2.1. Sample Collection, Handling, and Storage 1. Polypropylene aliquoting tubes (1.5, 2, 15, and 50 mL), Sarstedt Corporation (Nümbrecht, Germany)
2.2. Urine Sample Preparation/Protein Precipitation 2.2.1. TCA/Acetone Precipitation Protocol 1. Trichloroacetic acid, ultra pure (store solutions at 2–8°C), Sigma Corporation (St. Luis, MO, USA) 2. Acetone, analytical purity grade, Sigma Corporation
2.2.2. Organic Solvent Precipitation Protocol 1. Acetone, analytical purity grade, Sigma Corporation 2. Isopropanol, analytical purity grade, Sigma Corporation 3. Ethanol, analytical purity grade, Sigma Corporation
2.2.3. Urine Ultrafiltration 1. Amicon ultrafiltration devices, Millipore Corporation (Billerica, MA, USA)
2.2.4. Urine SPE 1. Bioselect C18 SPE cartridges were from Grace Vydac (Columbia, MS, USA) 2. Methanol, high performance liquid chromatography HPLC grade, Sigma Corporation 3. Acetonitrile, HPLC grade, Sigma Corporation 4. Trifluoroacetic acid, HPLC grade, Sigma Corporation
2.3. Analytical/Profiling Techniques 2.3.1. Two-Dimensional Separation 1. Protean isoelectric focusing (IEF) cell, Biorad (Hercules, CA, USA) 2. Nonlinear immobilized pH gradient (IPG) strips (3,4,5,6,7,8,9,10), 17 cm long 3. 2DE sample buffer: 7 M urea, 2 M thiourea, 4% CHAPS w/v, 0.4% 1,4dithioerythritol (DTE) w/v, 2% IPG buffer (Biorad) w/v, all components are of molecular biology grade 4. Mineral oil 5. Equilibration buffer I: 6 M urea, 50 mM Tris–HCl, pH 8.8, 30% glycerol, 2.0% sodium dodecylsulfate (SDS), 30 mM DTE 6. Equilibration buffer II: 6 M urea, 50 mM Tris–HCl, pH 8.8, 30% glycerol (v/v), 2.0% SDS (w/v), 230 mM iodocatemide. All components are of molecular biology grade
146
Zerefos and Vlahou
7. Fixation solution: 5% phosphoric acid (p.a grade, Sigma) w/v, 50% methanol v/v (HPLC grade, Sigma) 8. Colloidal coomassie brilliant blue staining kit, Invitrogen (Carlsbad, CA, USA) 9. GS-800 calibrated densitometer and PDQuest software, Biorad
2.3.2. MALDI-TOF-MS 1. Matrix solution: 50% acetonitrile v/v, 0.1% trifluoroacetic acid (TFA) v/v, 0.75% [-cyano-4-hydroxy-cinnamic (CHCA), Sigma Corporation]. Caution: all MALDI matrices are light sensitive; avoid unnecessary light exposure. Fresh preparation is advised, or else keep for 1 week (maximum) and store at 4°C 2. MALDI ground steel target plate 3. Ultraflex I MALDI-TOF-TOF-MS (Bruker Daltonics, Bremen, Germany) 4. FlexAnalysis 2.2 software, Bruker Daltonics
2.4. Miscellaneous The HPLC grade water (Resistivity >18 M cm−1 , Total organic carbon (TOC)<2 ppb) was utilized throughout the whole experimental process, and its application is advised for all electrophoretic and MS approaches. 3. Methods 3.1. Sample Collection, Handling, and Storage 1. Collect urine sample (see Note 3). 2. If immediate processing and storage at –80°C is not possible, store at 4°C for a short period of time (up to 6 h) (see Note 4). 3. Remove cellular components by centrifugation at approximately 3000 relative centrifugal forces (RCF) for 20–30 min at 4–8°C. Aliquot the supernatant (4–5 mL each aliquot) and store at –80°C, until further use (see Note 5).
3.2. Urine Sample Preparation/Protein Precipitation 3.2.1. TCA/Acetone Precipitation Protocol 1. Thaw a urine sample aliquot (see Note 6). 2. Add TCA to a final concentration of 15%, vortex and store overnight (o/n) at 4°C (see Note 7). 3. Centrifuge at standard refrigerated bench-top centrifuges (for eppendorf type tubes) for 15 min at RCF of 16,000–17,000 and 4°C. Discard the supernatant (see Note 8). 5. Wash pellet with ice-cold acetone (–20°C), leave for 5–10 min at –20°C, and centrifuge again for 15 min at 16,000–17,000 RCF. Discard supernatant and repeat once more the washing step (see Notes 9, 10).
Urine Sample Preparation
147
A1
A2
A3
A4
B1
B2
Fig. 2. Two-dimensional profiling of (A) 24 h collected urine concentrated by (1) ultrafiltration through 5000 MWCO, (2) TCA precipitation, (3) acetone precipitation without washing of the protein pellet, and (4) acetone precipitation with pellet washing. In these cases (1,2,3,4), the starting material was preconcentrated via membrane filtration (Pellicon 2 system, Millipore, Corporation); ultrafiltration and TCA or acetone precipitation, as applicable, were applied for the further concentration of the sample prior to 2DE analysis. (B) Two-dimensional profiling of random catch urine (50 mL starting volume without any preconcentration) condensed via (1) ultrafiltration through 5000 MWCO and (2) acetone precipitation. In all cases, 1 mg of protein was analyzed and visualized with colloidal coomassie stain in 3–10 nonlinear IPG strips.
6. Let pellet dry at ambient temperature (see Note 11). 7. Solubilize pellet in 2DE sample buffer and proceed with 2DE analysis (see Subheading 3.3.1, Note 12, and Fig. 2). 8. The protein pellet may also be subjected to solubilization with MS compatible buffers and analyzed by MS profiling (see Note 13, Subheading 3.3.2, and Fig. 3).
3.2.2. Organic Solvent Precipitation Protocol 1. Add to the urine sample at least equal volume of the desired organic solvent (ethanol, acetone, or isopropanol) and mix (see Notes 14, 15). 2. Keep at –20°C o/n (see Note 16).
148
Zerefos and Vlahou
×104
G
Intensity
6 F E 4 D C
2 B A 1000
10,000
5000 Mass to charge
Fig. 3. MALDI-TOF-MS profiling of urine. (A) Ultrafiltration retentate through 5000 MWCO, diluted 10× in 0.1% TFA; (B) 10× dilution of urine in 0.1% TFA; (C) supernatant of urine (diluted in 0.1% TFA) after protein precipitation via acetone; (D) urine protein pellet from acetone precipitation reconstituted in 0.1% TFA; (E) urine protein pellet from acetone precipitation reconstituted in 50% acetonitrile 0.1% TFA; (F) acetone precipitation (supernatant) and further purification of the supernatant by C18-SPE followed by dilution in 0.1% TFA; (G) C18-SPE eluate in 50% acetonitrile, 0.1% TFA. Extensive reproducibility studies indicated that urine processing by ultrafiltration or direct dilution in 0.1% TFA provides with the most robust spectra of the methods tested. Adapted from (13). 3. Centrifuge at standard refrigerated bench-top centrifuges (for eppendorf type tubes) for 15 min at RCF of 16,000–17,000 and 4°C. Discard the supernatant. 4. Wash pellet with ice-cold acetone, leave for 5–10 min at –20°C, and centrifuge again. Discard supernatant and repeat once more the washing step (see Note 17). 5. Let pellet dry at ambient temperature. 6. Solubilize pellet and proceed with 2DE analysis. The protein pellet or supernatant may also be subjected to solubilization with MS compatible buffers and analyzed by MS profiling (see Notes 12, 13, Subheading 3.3.1, Figs. 2, 3).
3.2.3. Urine Ultrafiltration 1. Place one volume of urine upon a 5000 kd molecular weight cut-offs (MWCO) Amicon ultrafiltration device (see Notes 18–20).
Urine Sample Preparation
149
2. Spin in a refrigerated centrifuge at 3500 RCF and 8–12°C (see Notes 21, 22). 3. After condensation, collect the retentate and discard or keep the filtrate depending on the specific application (see Notes 23–25). 4. For 2DE add the appropriate volume of sample buffer to the retentate and proceed with IEF (see Notes 26–27, Subheading 3.3.1, and Fig. 2). 5. For MALDI profiling dilute the retentate 10 times with 0.1% TFA v/v, and proceed as described below (see Subheading 3.3.2, Fig. 3).
3.2.4. Urine SPE (see Note 28) 1. Activate cartridge with a total of 1 mL methanol (two applications of 500 μL each). 2. Wash cartridge with 2 mL acetonitrile (four applications of 500 μL each, see Note 29). 3. Equilibrate cartridge with a total of 1 mL 0.1% TFA v/v (two applications of 500 μL each). 4. Load cartridge with 1 mL urine acidified by TFA at 0.1% (v/v) final concentration. 5. Wash cartridge with 1 mL 0.1% TFA v/v (two applications of 500 μL each). 6. Elute compounds by adding 100 μL of 50% acetonitrile, 0.1% TFA v/v. 7. Take 1 μL eluent, place on MALDI target, and process for MALDI MS profiling (see Subheading 3.3.2, Fig. 3).
3.2.5. Direct Dilution of Urine This method is used only in conjunction to direct MALDI MS profiling • • • •
Dilute urine 10 times with 0.1% TFA v/v (see Notes 30, 31). Apply 1 μL of the urine sample on MALDI target. Apply 1 μL matrix solution. Proceed with MALDI-TOF-MS (see Subheading 3.3.2, Fig. 3).
3.3. Analytical/Profiling Techniques 3.3.1. Two-dimensional Separation 1. Measure protein concentration of the sample (pretreated by precipitation or ultrafiltration) by the use of a commercially available protein kit. 2. Take 0.5–1 mg of urinary proteins diluted in 300 μL of 2DE sample buffer (see Note 32). 3. Distribute the sample volume equally in a lane of the IEF focusing tray. 4. Place the strip carefully, with the gel face down and in contact with the electrodes (see Note 33). 5. Rehydrate actively for 16 h at 50 V and 20°C. Caution: do not cover the strip with mineral oil immediately but after 1 h of rehydration (see Note 34). 6. After rehydration, place moistened IEF papers between the strip and electrodes. 7. Start IEF. The typical program is: 250 V for 30 min, linear increment up to 5000 V in 12 h, 5000 V for 16 h (total 110,000 V-h) (see Note 35).
150
Zerefos and Vlahou
8. After IEF is complete, equilibrate strip with 10 mL equilibration buffer I for 20 min at ambient temperature. 9. Alkylate with 10 mL equilibration buffer II for 20 min (see Note 36). 10. Place strip on top of 12.5% polyacrylamide gel, cover with 0.5% melted agarose in TGS buffer and start second dimension. Start with 10 mA current for 1 h and continue with 40 mA for approximately another 4 h (see Note 37). 11. Fix gel for 2 h with fixation solution. 12. Stain o/n with colloidal coomassie blue stain (Fig. 2).
3.3.2. MALDI-TOF-MS 1. Place 1 μL sample on the MALDI target plus 1 μL matrix solution and mix on spot (dried droplet technique, see Notes 38 and 39). 2. Leave target to dry at ambient temperature in the dark. 3. Load sample in the instrument and execute the appropriate MS method. Run the instrument in linear mode (see Note 40). 4. Optimize ion acceleration; tempering with sensitivity of the detector is not recommended prior to MS method establishment (see Note 41). 5. Set pulsed ion extraction (delayed ion acceleration) according to the profiling region in use. Typically when -cyano-cinnamic acid is utilized 50–150 ns are applied for large peptides (3–5 kd), 150–300 ns for small molecular weight proteins (15 kd), and higher than 300 for proteins (>20 kd, see Notes 42 and 43). 6. Collect 1000–2000 shots per sample and sum the collected data (see Note 44).
4. Notes 1. Dialysis is one of the most classical methods for buffer exchange and purification (separation) of high from low molecular weight constituents of a specific sample. Although it has been utilized elsewhere (20) we consider it rather laborious, costly and serving solely purification and not condensation purposes. Ultracentrifugation has been applied (21) for the isolation of higher molecular weight urinary proteins prior to 2DE (Fig. 1). In our opinion, centrifugal isolation of proteins is a very diverse and complicated issue and reproducibility is consequently compromized. Precipitation of biopolymers by ultracentrifugation requires the use of solutions with very well calculated composition in order to extract the velocity for protein isolation from the theoretical Svedberg values. Urine samples differ significantly in density (d = m/v) and pH values to serve such purposes in a well-defined and reproducible manner. 2. It should be emphasized that extensive complementarity of the various methods exists; thereby the combinatorial application of different methods is recommended in order to increase protein resolution. 3. Urine samples can be first void, midstream, morning, random catch, or 24 h. Due to its high bacterial content, first morning urine is usually not recommended in biomarker discovery studies.
Urine Sample Preparation
151
4. Upon their collection, if not stored immediately in –80°C, urine samples should be stored at 4°C. Published data support (9,10) that for analysis by 2DE or SELDI/MALDI MS the generated proteomic profiles are usually stable for up to 24 h urine storage at 4°C prior to deep freezing. We have observed occasional profile changes after so prolonged storage times at 4°C, and we therefore favor shorter times. 5. An enrichment of the soluble supernatant for cellular proteins may be achieved if prior to the centrifugation step a mild sonication (sonicator bath) for 5–10 min is applied. 6. The volume of urine required depends on the specific downstream application. For 2DE analysis an aliquot of at least 15 mL of urine is required. For direct MALDI MS profiling 1 mL urine aliquot is sufficient. 7. The TCA can be added as solid to a final concentration of 15% (w/v) (TCA is extremely hydroscopic and is easily solubilized). Alternatively, the appropriate volume of 100% TCA w/v may be added to the urine sample to reach a final concentration of 15% (w/v). TCA precipitation can also be performed at –20°C and o/n storage with occasionally slightly better efficiency. Caution: TCA solutions may form bilayer aqueous–organic systems depending on the salt concentration of the urine at –20°C or lower temperatures. The precipitation efficiency is dependent of the protein concentration of a given sample; in our experience, for example, the precipitation yield for a starting material of 0.5 mg/mL protein concentration (i.e., 1 mg total protein found in 2 mL sample) ranges from 40 to 70%; in contrast the precipitation efficiency for a starting material of 0.1 mg/mL protein concentration (i.e., 1 mg protein in 10 mL sample) is 0–30%. For this reason, avoid adding TCA solution in very dilute protein samples. 8. In case where the highest available centrifugal force is only 4000–5000 RCF, then longer centrifugation times (45 min) are recommended. 9. The volume of acetone utilized for washing depends on the size of the protein pellet. A general rule is to use 1 mL acetone for every 1 mL of urine starting material. 10. Acetone washes are needed to drive of excess TCA or else the pellet is extremely acidic and buffers utilized in further steps are neutralized. In addition, TCA (nonvolatile acid) may inhibit IEF, PAGE, LC, or MS analysis. We have found that acetone washes of the pellet does not induce significant protein losses. 11. The pellet should not be completely dried off, since this renders difficult its subsequent solubilization in 2DE or other buffers. Acetone evaporation at elevated temperatures is not recommended for the same reason. 12. If the pellet does not come in solution, try mild sonication (5 min in a sonicator bath) or incubate at ambient temperature for 30 min with intermittent vortexing. However, heating should be avoided (particularly if the pellet is resuspended in 2DE buffer since urea decomposes when heated and reacts with amino acids). The buffer volume required for solubilization depends on the protein content (pellet size) and the type of downstream application (2DE or MALDI-TOF-MS).
152
Zerefos and Vlahou
13. The protein pellet may be solubilized in 0.1% TFA v/v (roughly 100 μL of solubilization buffer for every milliliter of urine starting material) and analyzed by MALDI-TOF-MS. However, in our experience, plasticizers possibly extracted during the precipitation process are frequently detected and reproducibility problems are observed. Therefore, unless additional purification steps are introduced (SPE, etc.), we do not favor the application of precipitation methods at the front end of MALDI MS profiling. 14. The use of ethanol, acetone, or isopropanol is favored. These are hydrophobic, water mixable – even at elevated salt concentrations – nontoxic, and volatile. In particular, we favor the use of acetone since it is cheap, extremely volatile, and rarely forms aqueous–organic bilayers. Organic solvent mixtures e.g., isopropanol–acetone, do not increase precipitation efficiencies; in our experience their use induces reproducibility problems and therefore is not recommended. 15. The sample to solvent ratio depends on the downstream application and the sample protein concentration. For dilute urine samples (protein concentration of micrograms per milliliter) a solvent to sample ratio of 3 provides relatively high precipitation efficiencies. We have observed that for more concentrated samples (for example, preconcentrated urine or in general starting material of protein content in the micrograms per milliliter range), the precipitation efficiency for lower MW constituents reaches its maximum at solvent to sample ratio of about 9. 16. Precipitation is most efficient at –20°C (lower efficiencies have been observed at 4°C, whereas at –80°C bilayer systems may form, which inhibit the procedure). 17. Acetone washes of pellet in organic solvent precipitation protocols are not accustomed. From our experience, however, washing offers great advantages especially when 2DE separation is the downstream application since salts and other interfering substances are removed (Fig. 2). This washing step renders 2DE gels produced after acetone precipitation equally good to those generated following TCA precipitation. Acetone washing induces negligible protein losses. 18. There are Amicon UF devices that can accommodate up to 4 (UF4) or 15 mL (UF15) sample volumes. We regularly utilize the UF4 devices when MALDI MS profiling is to be performed and UF15 when 2DE is the downstream application. 19. Amicon devices have several MWCO. We propose the use of 5000 kd MWCO for the isolation and condensation of “total” urine protein content. The use of different MWCO is advised for specific isolation of molecular weight groups (see also Note 25). It should be emphasized that UF is not an absolute sizeexclusion separation method and cross-contamination between different protein size groups is expected and regularly observed. 20. UF can be performed in the presence of chemical additives. The kind of additives in use depends on the downstream application (2DE, MALDI profiling, LCMS, etc.) since in all cases the chemical compatibility to the latter should be maintained. For example, we have observed that in case of direct MALDI MS profiling most additives (detergents such as: octyl-glucopyranoside, triton-100, tween-20, and organic solvents such as: trifluoroethanol <5%, acetonitrile <20%,
Urine Sample Preparation
21.
22.
23.
24.
25.
26.
27.
28.
153
and isopropanol <20%) should be avoided since significant ion suppression is caused by membrane eluting compounds. From our experience, in the case of simple urine condensation via 5 kd MWCO for 2DE analysis no additives are necessary. In rare cases where proteins precipitate and plunge the membrane filter, the use of urea (4–6 M solutions) with 0.1–1% 2DE compatible detergent is recommended. The same applies if the removal of smaller proteins (e.g., application of 10 kd MWCO) is intended via UF. In this case, chaotropes disrupt protein–protein interactions and ease the separation of different MW groups. The centrifugal forces are selected based on the manufacturer’s instructions. The temperature of the UF procedure should be kept relatively low (8–12°C). In our experience, however, lower than 8°C temperatures may generate problems such as protein or urea precipitation. Instead, we favor the application of 8–12°C since at this temperature the UF procedure is not significantly decelerated and at the same time the rate of protein degradation is decreased. In case where the rate of condensation is very slow, you should pause the centrifugation, pipette up and down the retentate, and continue with the centrifugation. Good pipetting (up and down) should be performed for the collection of the retentate in order to reduce adsorption losses. Additionally, the filter may be washed off with a small volume of the desired buffer (e.g., sample buffer for 2DE). The latter approach is favored since it minimizes protein losses and increases reproducibility. Amicon UF devices have a minimum cut-off volume (approximately 50 and 250 μL for 4 and 15 mL devices, respectively). Always keep record of the final volume since that will allow the estimation of the condensation factor. For example, if you started with 15 mL of urine and finally collected 250 μL retentate, you have concentrated your sample’s higher MW constituents by 60×. When specific isolation of molecular weight groups is desired, sequential UF may be performed starting from 50, then 30, then 10, or 5 kd MWCO; in each case keep the retentate and use the filtrate for concentration though the next UF device. Buffer (0.1.% TFA in case of MALDI-TOF or sample buffer in case of 2DE) may be added to the filtrate and the process of condensation through centrifugation may be repeated. Caution: in our opinion washing of the sample with sequential UF for MALDI profiling is not advised; we have observed an increased chemical noise possibly attributed to accumulation of plasticizers. When 2DE is to be performed, do not use wash buffers containing high concentrations of IEF incompatible reagents. For example, even 50 mM of Tris–HCl, may cause problems in IEF. Always prefer no additives (for example, addition of ultrapure water) or urea solutions (with or without CHAPS) in concentrations compatible to the filter device. Caution: incompatible chaotrope concentrations may cause filter corrosion and sample loss. Solid phase extraction exhibits a wide selection range for many types of separation; even isolation of proteins–peptides with specific characteristics
154
29.
30.
31.
32.
33. 34. 35.
36. 37.
38.
Zerefos and Vlahou (e.g., phosphor or glycopeptides) is feasible and that is which differentiates SPE from other sample preparation steps. From our point of view SPE in combination to direct MS profiling is encouraged. All chromatographic and SPE media contain residuals and plasticizers, which should be driven off prior to analyte binding. Failure to perform this step may result in complete ionization suppression during MALDI profiling. The user may have to try different dilutions of the urine sample. In MALDI MS profiling experiments, there is a range of protein concentration within which the spectra quality is not affected. It is advised to conduct preliminary experiments in order to address this issue. In addition to TFA, the use of several additives (urea, octyl-glucopyranoside, triton-100, tween-20, NP-40, cholate, and organic solvents) at MALDI MS compatible concentrations has been tested on urinary peptide–protein ionization. However, we did not observe any clear advantage on protein resolution or ionization in these cases. The recommended protein amount of 0.5–1 mg is suitable for 17–18 cm length and 3–10 or 4–7 pH range strips. The protein amount will vary if different strip types are utilized, according to the manufacturer’s guidelines (for additional tips on 2DE see Chapters 4–6). Noncup loading was found to provide better resolution in urine analysis by 2DE compared to the cup loading method. Direct addition of the mineral oil might cause extraction of hydrophobic proteins to the oil layer. These running conditions are for the analysis of 1 mg protein sample on wide range (3–10 or 4–7) 17 or 18 cm IPG strips. The program will vary depending on the sample quantity and the type of strip in use. Reduction and alkylation are necessary for higher protein resolution in SDSPAGE and also for protein identification through peptide mass fingerprinting. The low starting current is needed for the slow migration of the proteins from the strip to the polyacrylamide gel. Direct electrophoresis with 40 mA current may cause protein losses. Alternatively, the gel may run at 10 mA o/n. Although slower, the latter approach provides gels of higher resolution, in our experience (for additional tips on 2DE see Chapters 4–6). Several sample application techniques were tested (thin layer preparation, double layer, and variations of dried droplet). Of those, we found that dried droplet (with simultaneous sample and matrix application) was the simplest, fastest, and most reliable method. In addition, the simultaneous drying of sample and matrix solution (rather than sample and matrix separately) increases reproducibility and minimizes losses during subsequent spot washes. In contrast, if sample and matrix are mixed prior to their application on the target, their consumption is much higher and the sample exposure to plastics increases, thereby increasing the chances for sample contamination and subsequent ion suppression by plasticizers.
Urine Sample Preparation
155
39. In case that crystal formation is obscured due to high salt content in the sample, wash the spot by pipetting two to three times with 2 μL of cool 0.1% TFA solution v/v (let dry again, do not wipe dry). Always prefer spot to spot washing rather than washing the entire target, in order to avoid sample crosscontamination. 40. Instrument calibration is performed according to the manufacturer specifications. In any case, we propose daily calibration to ensure precision and accuracy. 41. Acceleration of biomolecules is first of all affected by voltage settings of the ion source. Settings of the analyzer (TOF) affect mainly resolution parameters, while detector settings should be tempered only to improve signal to noise characteristics of a given sample. 42. The mass spectrum should be divided into subregions and data of each of the latter should be collected separately, in order to increase protein resolution. This is because ionization kinetics (and consequently instrument settings) are completely different for different protein sizes. 43. Different matrices (e.g., CHCA or dihydroxybenzoic acid for peptides and SA for proteins) require different laser focusing settings. In general, large crystals (such as the ones formed by SA) and larger protein molecules require more concentrated energy bursts than smaller ones where more disperse hits may be used. 44. Always sum the same amount of laser shots and select as many regions of a spot as possible to ensure high reproducibility.
Acknowledgments This study was supported by the Greek Ministry of Health.
References 1. Norden, G.W.A., Sharratt, P., Cutillas, P.R., Cramer, R., Gardner, S.C. and Unwin, R.J. (2004) Quantitative amino acid and proteomics analysis: Very low excretion of polypeptides >750 Da in normal urine. Kidney International 66, 1994–2003. 2. Pisitkun, T., Johnstone, R. and Knepper, M.A. (2006) Discovery of urinary biomarkers. Molecular and Cellular Proteomics 5, 1760–1771. 3. Nielsen, M.E., Schaeffer, E.M., Veltri, R.W., Schoenberg, M.P., Getzenberg, R.H. (2006) Urinary markers in the detection of bladder cancer: What’s new? Current Opinion in Urology 16, 350–355. 4. Thongboonkerd, V. and Malasit, P. (2005) Renal and urinary proteomics: Current applications and challenges. Proteomics 5, 1033–1042. 5. Pieper, R., Gatlin, C.L., McGrath, A.M., Makusky, A.J., Mondal, M., Seonarain, M., Field E., Schatz, C.R., Estock, M.A., Ahmed, N., Anderson, N.G. and Steiner, S. (2004) Characterization of the human urinary proteome: A method
156
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Zerefos and Vlahou
for high-resolution display of urinary proteins on two-dimensional electrophoresis gels with a yield of nearly 1400 distinct protein spots. Proteomics 4, 1159–1174. Oh, J., Pyo, J., Jo, E., Hwang, S., Kang, S., Jung, J., Park, E., Kim, S., Choi, J. and Lim, J. (2004) Establishment of a near-standard two-dimensional human urine proteomic map. Proteomics 4, 3485–3497. Spahr, C.S., Davis, M.T., McGinley, M.D., Robinson, J.H., Bures, E.J., Beierle, J., Mort, J., Courchesne, P.L., Chen, K., Wahl, R.C., Yu, W., Luethy, R. and Patterson, S.D. (2001) Towards defining the urinary proteome using liquid chromatography-tandem mass spectrometry I. Profiling an unfractionated tryptic digest. Proteomics 1, 93–107. Cutillas, P.R., Norden, A., Cramer, R., Burlingame, A. and Unwin, R.J. (2003) Detection and analysis of urinary peptides by on-line liquid chromatography and mass spectrometry: Application to patients with renal Fanconi syndrome. Clinical Science 104, 483–490. Schaub, S., Wilkins J., Weiler, T., Sangster, K., Rush, D., Nickerson, P. (2004) Urine protein profiling with SELDI TOF MS. Kidney International 65, 323–332. Rogers, M.A., Clarke, P., Noble, J., Munro, N.P., Paul, A., Selby, P.J. and Banks, R.E. (2003) Proteomic profiling of urinary proteins in renal cancer by surface enhanced laser desorption ionization and neural-network analysis: Identification of key issues affecting potential clinical utility. Cancer Research 63, 6971–6983. Vlahou, A., Schellhammer, P.F., Mendrinos, S., Patel, K., Kondylis, F.I., Gong, L., Nasim, S. and Wright, J.G. Jr. (2001) Development of a novel proteomic approach for the detection of transitional cell carcinoma of the bladder in urine. The American Journal of Pathology 158, 1491–1502. Vlahou, A., Giannopoulos, A., Gregory, B.W., Manousakas, T., Kondylis, F.I., Wilson, L.L., Schellhammer, P.F., Semmes, O.J. and Wright G.L. Jr. (2004) Protein profiling in urine for the diagnosis of bladder cancer. Clinical Chemistry 50, 1438–1445. Zerefos, P.G., Prados, J., Kalousis, A. and Vlahou, A. (2007) Sample preparation and bioinformatics in MALDI profiling of urinary proteins. Journal of Chromatography B. Analyt Technol Biomed Life Sci. 15, 20–30. Zórbig, P., Renfrow, M.B., Schiffer, E., Novak, J., Walden, M., Wittke, S., Just, I., Pelzing, M., NeusóÌ, C., Theodorescu, D., Root, K.E., Ross, M.M. and Mischak, H. (2006) Biomarker discovery by CE-MS enables sequence analysis via MS/MS with platform-independent separation. Electrophoresis 27, 2111–2125. Mischal, H., Kaiser, T., Walden, M., Hillmann, M., Wittke, S., Herrmann, A., Knueppel, S., Haller, H. and Fliser, D. (2004) Proteomic analysis for the assessment of diabetic renal damage in humans. Clinical Science 107, 485–495. Zerefos, P.G., Vougas, K., Dimitraki, P., Kossida, S., Petrolekas, A., Stravodimos, K., Giannopoulos, A., Fountoulakis, M. and Vlahou, A. (2006) Characterization of the human urine proteome by preparative electrophoresis in combination with 2-DE. Proteomics 6, 4346–4355.
Urine Sample Preparation
157
17. Pang, J.X., Ginanni, N., Dongre, A.R., Hefta, S.A., and Opiteck, G.J. (2002) Biomarker discovery in urine by proteomics. Journal of Proteome Research 1, 161–169. 18. Sun, W., Li, F., Wu, S., Wang, X., Zheng, D., Wang, J. and Gao, Y. (2005) Human urine proteome analysis by three separation approaches. Proteomics 5, 4994–5001. 19. Soldi, M., Sarto, C., Valsecchi, C., Magni, F., Proserpio, V., Ticozzi, D. and Mocarelli, P. (2005) Proteome profile of human urine with two-dimensional liquid phase fractionation. Proteomics 5, 2641–2647. 20. Rasmussen, H.H., Orntoft, T.F., Wolf, H. and Celis, J.E. (1996) Towards a comprehensive database of proteins from the urine of patients with bladder cancer. The Journal of Urology 6, 2113–2119. 21. Thongboonkerd, V., McLeish, K.R., Arthur, J.M. and Klein, J.B. (2002) Proteomic analysis of normal human urinary proteins isolated by acetone precipitation or ultracentrifugation. Kidney International 62, 1461–1469. 22. Glen, L., Hortin, G.L., Meilinger, B. and Drake, S.K. (2004) Size-selective extraction of peptides from urine for mass spectrometric analysis. Clinical Chemistry 50, 1092–1095. 23. Zhang, X., Leung, S., Morris, C.R. and Shigenaga, M.K. (2004) Evaluation of a novel, integrated approach using functionalized magnetic beads, benchtop MALDI-TOF-MS with prestructured sample supports, and pattern recognition software for profiling potential biomarkers in human plasma. Journal of Biomolecular Techniques 15, 167–175.
9 Combining Laser Capture Microdissection and Proteomics Techniques Dana Mustafa, Johan M. Kros, and Theo Luider
Summary Laser microdissection is an effective technique to harvest pure cell populations from complex tissue sections. In addition to using the microdissected cells in several DNA and RNA studies, it has been shown that the small number of cells obtained by this technique can also be used for proteomics analysis. Combining laser capture microdissection and different types of mass spectrometers opened ways to find and identify proteins that are specific for various cell types, tissues, and their morbid alterations. Although the combination of microdissection followed by the currently available techniques of proteomics has not yet reached the stage of genome wide representation of all proteins present in a tissue, it is a feasible way to find significant differentially expressed proteins in target tissues. Recent developments in mass spectrometric detection followed by proper statistics and bioinformatics enable to analyze the proteome of not more than 100–200 cells. Obviously, validation of result is essential. The present review describes and discusses the various methods developed to target cell populations of interest by laser microdissection, followed by analysis of their proteome.
Key Words: laser capture microdissection; matrix-assisted laser desorption/ ionization; Fourier transformer mass spectrometry; time-of-flight mass spectrometry; liquid chromatography-electrospray ionization tandem mass spectrometry; two-dimensional polyacrylamide gel electrophoresis; differential in-gel electrophoresis; protein chip technology. Abbreviations: LCM: Laser Capture Microdissection, LMM: Laser Microbeam Microdissection, LPC: Laser Pressure Catapulting, 2D PAGE: Two-dimensional Polyacrylamide Gel Electrophoresis, 2D DIGE: Differential In-gel Electrophoresis, SDS: Sodium From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
159
160
Mustafa et al.
Dodecyl Sulphate, MALDI-TOF/MS: Matrix-assisted Laser Desorption/Ionization Timeof-flight Mass Spectrometry, MALDI-FTMS: Matrix-assisted Laser Desorption/Ionization Fourier Transformer Mass Spectrometry, LC-ESI-MS/MS: Liquid ChromatographyElectrospray Ionization Tandem Mass Spectrometry, HPLC: High Performance Liquid Chromatography, SELDI-TOF: Surface-enhanced Laser Desorption/Ionization Time-offlight, ICAT: Isotope-coded Affinity Tag
1. Introduction Over the last years, significant progress in the analysis of the entire genome has triggered efforts to further analyze normal and abnormal protein expression patterns. There is, for instance, an eagerness to discover more and better diagnostic markers for specific diseases. High expectations of the use of better biomarkers for the purpose of improving diagnosis and monitoring treatment initiated technical developments. Human tissues are usually composed of rather complex mixtures of different cell types. Many techniques have been used for the isolation of pure cell populations and each technique has its advantages and limitations. For example, immunohistochemistry is an established and relatively easy technique applicable for localizing protein expression. A drawback of immunohistochemistry is the impossibility of quantitative assessments of proteins. Another method to obtain information about particular cell populations is growing cell cultures in order to amplify target cells. Despite the technical feasibility of this technique, the biological characteristics of the original cells may not be so accurate in an in vitro environment (1). Alternatively, by using xenografts a better mimicking of the normal situation is reached, but again this method only reflects the real situation of cells in vivo to some extent (2). Another way of separating cell populations for further investigation is flow cytometry, which has successfully been applied in the study of many disease processes. Flow cytometric analysis is applied to cell suspensions and specific markers for selection of cell population are required. To the best of our knowledge, the combination of flow cytometry and subsequent mass spectrometry (MS) has not yet been described for the analysis of solid tissues. In this review, we discuss methods of cell purification and harvesting techniques by the use of laser microdissection, which are currently applied for further MS analysis.
2. Laser Capture Microdissection In order to select for specific cell populations in heterogeneous tissues, several microdissection techniques have been described. Most techniques involve the use of a needle to scrap off cells of interest under direct microscopic
Combining LCM and Proteomics Techniques
161
visualization (3,4). This method, however, tends to be slow, tedious, and highly operator dependent (2). In 1992, Shibata and coworkers described a new method of cell isolation. They used a specific pigment placed over small numbers of cells in a tissue section, which served as an umbrella preventing the covered cells of being destroyed. Ultraviolet light was used to destroy the DNA/RNA of the uncovered cells (5). Shortly later, laser capture microdissection (LCM) under direct microscopic visualization was developed by Liotta and coworkers in the National Cancer Institute. This way of target cell isolation permits rapid, reliable laser microdissection to collect specific cell populations from a section of a complex, heterogeneous tissue (6). For this approach, a tissue section is placed in a holder of an inverted microscope. A transparent, thermoplastic polymer coating [e.g., ethylene vinyl acetate (EVA) (7)] is placed in contact with the tissue. The EVA polymer is positioned over microscopically selected cell clusters and subsequently the polymer is precisely activated by a nearinfrared laser pulse steered by the investigator. The laser activation of the polymer results in specific binding to the targeted area. With the removal of the EVA and the tissue that was bound to it from the section the selected cell aggregates are isolated for molecular analysis (8). LCM is compatible with a variety of cellular staining methods and tissue preservation protocols (9). Dependent on the microlaser dissection device used, the collection caps used are positioned in different ways. For instance, the caps in the PixCell II (Arcturus Engineering, Mountain View, CA, USA) technique make contact with the tissue sections, therefore, strict requirements for preparations are needed. The PALM microlaser dissector (PALM Microlaser Technologies AG, Bernried, Germany) provides a powerful separation in which an important application of the cutting UV-laser is laser microbeam microdissection combined with laser pressure catapulting (10). A specific glass slides covered with polyethylene naphthalate membrane will aid in stabilizing the morphological integrity of the captured area (11) (Fig. 1). In this method, collecting caps do not make any contact with the tissue sections anymore, which increase the flexibility in respect to section preparation (12). Both LCM techniques are specific enough to dissect single cells. The PALM can dissect smaller sections of tissue as compared to the PixCell system. The two methods of microdissection yield RNA retrievals of comparable quality and quantity, but they have not been directly compared with regard to recent developments in protein retrieval for mass spectrometric applications (13). The collection of large quantities of cells by LCM is a time consuming procedure requiring the microscopical visualization of the cells of interest in a stained tissue sections before lasering. The software and the hardware of the different types of laser microdissection are still developing.
162
Mustafa et al. Buffer droplet
Cap
Microdissected tissue
Tissue section
PEN membrane
Slide
Stage Laser
objective
Fig. 1. A scheme that represents the principle of laser capture microdissection.
3. LCM and Two-Dimensional Gel Electrophoresis A new development is the application of LCM for protein retrieval of tissues for further analysis by proteomic techniques. So far, several approaches have been performed on cells obtained by laser microdissection. In 2000, Emmert-Buck and coworkers applied two-dimensional polyacrylamide gel electrophoresis (2D PAGE) to 50,000 microdissected epithelial cells (14). They compared tumor cells and normal controls from two patients with oesophageal cancer (14). Staining the gels with silver yielded the visualization of 675 distinct proteins and isoforms. Seventeen differentially expressed spots were further analyzed by MS. This resulted in the identification of two specific proteins, cytokeratin 1 and annexin I. It was assumed that these proteins were present in an abundance range of 50,000–1,000,000 copies per cell (14). Using colon cancer as a model, also Lawrie and coworkers showed the feasibility of investigating protein expression by combining the technologies of LCM and proteome analysis like 2D PAGE and MS (15). To overcome the limitation of LCM in producing relatively low numbers of cells, an extra step has been added to the separation method. In addition to the 2D PAGE from the microdissected cells, an extra 2D PAGE from the whole section of the same set of samples can be useful. The comparison of silver stained 2D gels created from microdissected epithelial cells of ovarian cancer and the 2D gels created from the whole section of the same ovarian samples, facilitated the discovery of 23 differentially expressed proteins between low malignant potential and invasive ovarian cancers (16). In-gel digestion of the specific gel spots followed by MS/MS analysis resulted in the identification of glyoxalase I, RhoGDI, and a 52 kDa FK506 binding protein (16). In another study based on 2D PAGE, 315 protein spots were identified by collecting 100,000 cells by LCM of normal and cancer ductal units from breast tissue
Combining LCM and Proteomics Techniques
163
sections (17). Subsequent measurement of the spots by MS resulted in the identification of 57 differentially expressed proteins between the two groups of samples (17). The relative low number of microdissected cells emphasizes the importance of loading equivalent amounts of protein on the gels. Thus, Shekouh and coworkers (18) followed a strategy to increase the accuracy of 2D PAGE from LCM samples. The samples were first separated by one-dimensional sodium dodecyl sulphate (SDS)-PAGE, stained with silver and subsequently subjected to densitometry. Evaluation of the staining intensity was used to normalize the samples. The 2D PAGE silver stained images from 50,000 microdissected adenocarcinoma cells were compared with the images from whole sections of pancreatic samples. Spots of their interest were subjected to MALDI-TOF/TOF MS, resulting in the identification of S100A6 as an over-expressed protein in pancreatic cancer cells (18). The same methodology has been used to understand the mechanism of a specific molecule such as (HER-2/neu) in breast cancer (19). Breast cancer tissue was used to microdissect about 50,000–70,000 cells from three HER-2/neu-positive tumors and three HER-2/neu-negative tumors. This lead to the detection of about 500–600 protein spots in each gel. The comparison of these two groups allowed the identification of cytokeratin 19 (CK19) as an overexpressed protein in HER-2/neu-positive breast cancer patients (19). In another study, the 2D PAGE of 10,000 microdissected cells of hepatocellular carcinoma (HCC) samples was compared with normal surrounding tissue. The investigators visualized about 868 spots of which 20 were considered as differentially expressed proteins. The digestion of these proteins into peptides was followed by the application of ESI-MS/MS, which allowed the identification of 11 proteins. Four out of these 11 proteins were considered as novel candidates of hepatitis B-related HCC markers (20). This approach of separating the microdissected cells on 2D PAGE followed by in-gel protein digestion and MS measurements for the identification of biomarkers has been applied to a wide range of cancers, using various numbers of microdissected cells. There is a range of 10,000–100,000 cells harvested by LCM for the successful application of 2D electrophoresis (Table 1) . 4. LCM and Differential In-Gel Electrophoresis In 2002, Zhou and coworkers described a new technique called differential in-gel electrophoresis (DIGE) (21). Two pools of proteins are labeled with 1-(5-carboxypentyl)-1-propylindocarbocyanine halide (Cy3) N-hydroxysuccinimidyl ester and 1-(5-carboxypentyl)-1-methylindodi-carbocyanine halide (Cy5) N-hydroxy-succinimidyl ester fluorescent dyes (21). The labeled proteins are mixed and separated in the same 2D gel. This strategy improves
164
Mass spectrometry and immunoblot analysis
Mass spectrometry data from all the protein spots cut from the gels ESI-MS identification from gels made of whole sections
Approximately 675 distinct proteins including isoforms
2D PAGE, 1–5 μg of total Not determined silver cellular protein staining
23 differentially expressed proteins were discussed
315 protein spots MS identification from gels made of whole sections
800 protein spots MALDI-TOF/TOF
2D PAGE, 50,000 silver staining
2D PAGE, 50,000 silver staining
2D PAGE, 100,000 silver staining
2D PAGE, 50,000 silver staining
Identification technique
Number of visualized proteins
Separation Number of technique microdissected cells/sample
n = 1; calcium-binding protein, S100A6
n = 57 observed proteins. n = 2 after confirmation
n = 3; FK506 binding protein, glyoxalase I, and RhoGDI
n = 3; cytokeratin 8, cytokeratin 18, and -actin
n = 2; cytokeratin 1 and annexin I
Number of significant differentially identified proteins
6 samples of DCIS and 6 samples of normal ductal/lobular units 4 cancer samples and 4 normal samples
3 invasive OV and 2 noninvasive (LMP) OV
2 cancer samples and 2 normal samples
2 cancer samples and 2 normal samples
Number of samples/study
Reference
Pancreas cancer
Breast cancer
Ovarian cancer
Colon cancer
(18)
(17)
(16)
(15)
Esophageal (14) cancer
Tissue used
Table 1 Overview of Different Methods to Combine Laser Microdissection and Different Proteomics Techniques
165
250,000
30,000
50,000
2D DIGE, lysine specific dyes
2D DIGE, lysine specific dyes
10,000
2D PAGE, silver staining
2D DIGE, lysine specific dyes
50,000–70,000
2D PAGE, silver staining
Not applicable
1200 protein spots
1038–1088 protein spots
868 protein spots
500–600 protein spots
MALDI-TOF and/or immunoblotting for protein identification
MALDI-TOF measurements
Capillary LC tandem mass analysis
Nano-flow ESI-MS/MS
MALDI-TOF mass spectrometer
n = 32
No further identifications
n = 1; tumor rejection antigen (gp96)
n = 11 proteins, four of them were novel markers
n = 7; cytokeratin19, tropomyosin 3, aldolase A, glyoxalase I, cathepsin D chain 3, albumin, and MnSOD
Five samples contained malignant and normal breast tissue
One sample contained gastric mucosa and one SPEM
One sample contained normal and one sample contains cancer cells
10 hepatic cancer cells samples
3 HER-2/neupositive samples and 3 HER2/neu-negative samples
(20)
(19)
Breast epithelium cell
Gastric metaplasia samples
Continued
(23)
(22)
Esophageal (21) carcinoma
Hepatic cancer cells. hepatitis B positive cells
HER2/neupositive breast cancer cells
166
Between 100 and 10 glomeruli, which equals to 0.5–3 μg protein
Proteins, 3.8 μg
Approximately Not applicable <180 ng per multiplex protein sample per 54-cm gel
2D DIGE, cysteine specific dyes
(IPG-IEF) 2D-PAGE gel
(IPG-IEF) 2D-PAGE gel
Not applicable
Mass spectrometry
Mass spectrometry
Nano LC-ESI-MS/MS
MALDI-MS and MS/MS measurements
∼1000 protein spots
5000
2D DIGE, cysteine specific dyes
Between 1400 and 900 protein spots
Identification technique
Number of visualized proteins
Separation Number of technique microdissected cells/sample
Table 1 Continued
Quantitative differences in 6 progesterone receptor proteins
n = 29
n = 23 between mice glomeruli and mice cortex
n = 40
Number of significant differentially identified proteins
Kidney glomeruli
Gastric adenocarcinoma
Tissue used
12 ER1/PR2 and 12 Breast ER1/PR1 tumors cancer were grouped into four pools.
2 samples Renal contained renal cell carcinoma carcinoma and normal kidney tissues
3 different protein extracts from human glomeruli and 3 independent isolated glomeruli and cortex from 3 mice
cultured oncogenetransduced epithelial cells and precancerous versus cancerous tissue
Number of samples/study
(31)
(30)
(26)
(25)
Reference
167
150
Gel-free method
Gel-free method
30,000–50,000
O/18 O isotopic labeling peptides
∼2000
10,000
16
Gel-free method
10,000
HPLC system
Not applicable
Not applicable
Not applicable
Not applicable
Not applicable
MALDI-TOF/MS
MALDI-TOF/MS
SELDI-TOF/MS
The reverse phase of LC-ESI-MS/MS on the ion trap mass spectrum
ESI mass spectrometry followed by MS/MS
No protein identifications. Unique peptide pattern of ∼35 peptides for trophoblast and stroma cells
n = 2; calgranulin A and chaperonin 10
n = 1; prostate carcinomaassociated protein (PCa-24)
n = 76
n=9
1 placenta sample contained trophoblasts and surrounding stroma cells.
8 endometrioid adenocarcinomas, 4 proliferative endometria, and 4 secretory endometria
17 prostate carcinoma that contained normal tissue and BPH tissue and 7 BPH samples
2 samples with invasive ductal carcinoma of the breast
3 slides from the same cell culture
(41)
(29)
(34)
Placenta samples
Continued
(37)
Endometrial (36) cancer
Prostate cancer
Ductal carcinoma of the breast
Breast cancer cell line (SKBR-3)
168
Not applicable
2000–2400
3000
Gel-free method
Gel-free method
Not applicable
ProteinChip 3000–5000 technology
Isolation by reverse-phase chromatography and SDS-PAGE then identified by MS/MS analysis
Isolation by two-dimensional gel electrophoresis and tandem mass spectrometry analysis
Nano LC-FTICR mass spectrometry
MALDI-TOF/TOF mass spectrometry
Identification technique
n = 1; heat shock protein 10
n = 1; annexin V
n = 1003 proteins identified
No protein identifications. 9 differentially expressed peptides
Number of significant differentially identified proteins
Breast cancer
Breast cancer
Tissue used
39 colorectal tumor Colorectal samples, 40 normal cancer mucosa samples, and 29 adenoma samples
57 head and neck Head and tumor samples and nick 44 mucosa samples cancer
2 replicate samples of breast cancer epithelial cells
6 invasive ductal breast carcinoma contained cancer and normal cells
Number of samples/study
(39)
(40)
Umar et al., 2006
(38)
Reference
Abbreviations: 2DE: 2 dimensional gel electrophoresis, OV: ovarian cancer, LMP: low malignant potential, DCIS: ductal/lobular units and ductal carcinoma in situ, HCC: hepatocellular carcinoma, BPH: benign prostatic hyperplasia, SPEM: spasmolytic polypeptide expressing metaplasia, PR: progesterone receptor, ER: estrogen receptor
Not applicable
ProteinChip 3000–5000 technology
Not applicable
Number of visualized proteins
Separation Number of technique microdissected cells/sample
Table 1 Continued
Combining LCM and Proteomics Techniques
169
the sensitivity of detection and enlarges the range of candidate proteins for detection. Molecular weight- and charge-matched cyanine dyes enable multiplex labeling with different samples run on the same gel. The same investigators described a powerful tool for the molecular characterization of cancer progression and identification of cancer-specific protein markers by combining 2D DIGE with MS. They compare the 2D DIGE of about 250,000 microdissected cells from oesophageal carcinoma with normal epithelial cells from the oesophagus. The cancer cell lysate yielded 1038 protein spots while the normal epithelial lysate yielded 1088 protein spots. In-gel digestion of the differentially expressed protein spots was followed by capillary high performance liquid chromatography (HPLC) tandem mass analysis to achieve further identification. This way, tumor rejection antigen (gp96) was found to be upregulated in oesophageal squamousal cell cancer (21). Applying the same procedure to smaller numbers of microdissected cells from biopsy samples with gastric metaplasia appeared to be successful as well (22). Approximately 1200 spots were identified from 30,000 microdissected cells. Twenty-eight of these spots were over expressed in the metaplasia samples as compared to the normal surface cells (22). However, subsequent MALDI-TOF measurements of the spots did not result in the identification of proteins. The same procedure was applied to 50,000 microdissected cells resulting in the identification of 32 proteins in breast epithelial cancer cells (23), of which thirteen had not been associated previously with the tumors (23). One technical aspect of the 2D DIGE method needs special attention: the nature of the fluorescent dyes and their ability to bind to lysine residues only (21). Proteins with high percentages of lysine residues can be labeled more efficiently as compared to proteins containing little or no lysine. By developing a new generation of dyes reacting with cysteine residues, the sensitivity of DIGE has been improved (24). Although cysteine is less abundant than lysine in proteins in general, cysteine labeling can be carried to saturation. Lysine labeling must be limited to 1–3% of all the residues to prevent loss of solubility when bulky hydrophobic dyes are coupled to the polar lysine residues (24). Greengauz-Roberts and coworkers applied the saturated labeling for cysteine residues to study about 5000 cells obtained by LCM of metaplasia and cancer cells. A total of 1471 distinct protein features were observed from the relatively small number of cells. Ninety-six of these spots were further identified. Using MALDI-MS and MS/MS measurements in addition to the specific position of the protein in the gel resulted in the identification of 42 proteins in cancer samples (25). Also Sitek and coworkers described a novel approach to analyze glomerular proteins from mice and human samples using DIGE saturation labeling (26). Only 10 glomeruli (0.5 μg) picked by LCM from a slide of a human kidney biopsy appeared to be sufficient to visualize 900 spots using DIGE technique (26). 2D DIGE holds several
170
Mustafa et al.
advantages over the conventional 2D gel. One of the most important advantages is the improvement of the reproducibility of 2D DIGE method. The gel-to-gel differences are minimalized because the separation of the pooled samples takes place in the same gel. Therefore, the comparison of protein expression from two cell populations or samples can be more accurately assessed and easier to be identified. The quantitative differences of protein contents are also better measured by the application of fluorescent dyes. In addition, 2D DIGE enables a higher throughput analysis of 2D gels by its feasibility to automatic gel imaging. Importantly, labeling of proteins by fluorescent dyes did not affect the protein identification by MS, because only small percentages of the molecules of each protein are labeled. Importantly, for 2D DIGE the number of microdissected cells, which are required for protein identification is less as compared to the other 2D electrophoresis techniques (Table 1).
5. LCM and Different Labeling Techniques The comparison of the proteome of two different samples (for instance, normal and tumor cells) is facilitated by labeling. In 2004, Li and coworkers described a method for qualitative and quantitative protein analysis by combining LCM with isotope-coded affinity tag labeling technology and twodimensional liquid chromatography coupled with tandem mass spectroscopy (2D-LC-MS/MS) (27). Approximately 50,000–100,000 cells of HCC and nonHCC hepatocytes were microdissected and a total of 644 proteins in HCC hepatocytes were qualitatively determined, and 261 differential proteins between the two groups were quantified (28). In 2004, 16 O/18 O isotopic labeled peptides were generated from 10,000 microdissected cells of ductal carcinoma of the breast. The approach allowed the identification of 76 proteins (29). By using reverse phase liquid chromatography-electrospray ionization tandem mass spectrometry (LC-ESI-MS/MS) Zang and coworkers were able to identify proteins that were significantly upregulated in the breast tumor cells (29). Separating the radioactive labeled peptides on the high resolution 54 cm serial immobilized pH gradient isoelectric focusing 2D-PAGE gel provided a precise estimate of the abundance ratio for proteins from two samples (30). The radioiodination of 3.8 μg renal carcinoma proteins and 3.8 μg normal kidney proteins with both 125 I and 131 I followed by mass spectrometric identification revealed 29 differentially expressed proteins (30). Applying the same methodology of radioactive labeling to a pool of microdissected breast cancer cells provided a sensitive method to identify some differentially expressed proteins in correlation with the presence of progesterone receptor in estrogens receptor-positive breast cancer (31).
Combining LCM and Proteomics Techniques
171
6. Combining LCM and Different Separation Methods It has been shown previously that the number of detected and identified peptides and proteins increases significantly by coupling MALDI-MS (32) and ESI-MS (33) to a peptide or protein separation system. In 2003, Wu and coworkers described a method for discovering biomarkers from microdissected homogeneous cells from breast cancer cell lines (34). Following capturing the cells, the peptide digest was fractionated by reversed phase HPLC and analyzed by ion trap MS (34). HPLC fractionation of about 10,000 endothelial cells from a breast cancer cell line (SKBR-3) followed by ESI MS resulted in the identification of low-expressed proteins in the cell line. Capillary isoelectric focusing combined with the reverse phase nano-LC in an automated and integrated platform provides systematic resolution of complex peptide mixtures generated from limited protein quantities (7). This method separated the mixture of peptides based on differences in isoelectric points and hydrophobicity, and it eliminates peptide loss and analyte dilution (7). This method of separation coupled to ESI-tandem MS assists in the detection of 6866 peptides, leading to the identification of 1820 proteins from 20,000 microdissected cells of glioblastoma (7). In order to increase the number of identified proteins from LCM of brain samples, Gozal and coworkers added an extra separation step (35). After collecting cells by LCM, the total protein were extracted and resolved on an SDS gel. Gels were cut out into multiple pieces followed by trypsin digestion. Peptides were subjected to highly sensitive liquid chromatography-tandem mass spectrometry (LC-MS/MS). This way resulted in identifying hundreds to thousands of proteins (35). 7. LCM and Gel-Free Mass Spectrometry There are possibilities of measuring the peptide digest of cells harvested by LCM directly by MS, without an initial separation step on 2D PAGE (known as “gel-free MS”). Guo and coworkers directly analyzed endometrial epithelium cells obtained by LCM using matrix-assisted laser desorption/ionization timeof-flight mass spectrometry (MALDI-TOF/MS) (36). A total of 16 physiologic and malignant endometrial samples including four proliferative and four secretory endometria, and eight endometrioid adenocarcinomas were used for this study. Approximately 2000 cells appeared to be sufficient to confirm overexpression of two proteins, calgranulin A and chaperonin 10 in the epithelial cells of endometrial adenocarcinoma samples (36). In another study, the direct analysis of 125 trophoblast and stroma cells of placental tissue resulted in the detection of significant expressed protein differences between these two cell types (37). Also, differentially expressed proteins between breast cancer and normal samples can be detected by direct MALDI-TOF/MS measurements
172
Mustafa et al.
of 2000–2400 LCM cells (38). In a recent study, it was possible to identify over 1000 proteins from 3000 microdissected cells by the combination of advanced nanoLC and high resolution Fourier transformer mass spectrometry (FTMS) (39). 8. LCM and Protein Chip Technology There are currently two approaches to produce arrays capable of generating protein network information. The first method is the forward phase array in which each spot on the slide represents a specific antibody. Therefore, the array is incubated with only one test sample (9). The second method is the reverse phase array in which each spot represents an individual test sample, and the array is composed of multiple, different samples, which then can be tested under the same experimental conditions. In addition, when the arrays are probed separately with two different classes of antibodies, it is possible to specifically detect the total and phosphorylated forms of the protein of interest (9). By combining LCM technique to protein chip technology, Melle and coworkers identified annexin V as a specific protein in head and neck cancer patients, and heat shock protein 10 as a biomarker in colorectal cancer patients (40,41). The protein lysates from 3000 to 5000 microdissected cells were analyzed on both strong anion exchange arrays and weak cation exchange arrays, followed by separation steps (e.g., 2D gel or reverse phase chromatography and SDSPAGE), MS measurements, and MS/MS analysis (40,41). In both cases, a validation step by immunohistochemistry confirmed their findings. In other studies surface-enhanced laser desorption/ionization time-of-flight analysis was applied to microdissected cells because of its sensitivity to smaller amounts of material than other techniques such as 2D gel (42). Using 30,000–50,000 cells of prostate carcinoma specimens, the unique expression of prostate carcinoma-associated protein, called PCa-24 in the epithelial cells, was reached (42). Protein microarrays hold several technical challenges (43). Their application offers the advantage of scalability, flexibility, and automatic processing (43). Arrays may also enable the control of key parameters such as temperature, pH, and cofactor concentration, which are not easily afforded by cell-based systems. 9. Perspectives of LCM and Mass Spectrometry Analysis The use of LCM of (relatively) pure populations of cells to be used for further analysis of their proteome is an important addition to the arsenal of techniques in bioscience. However, this technique is still time consuming and yield relatively small numbers of cells. To overcome this problem, alternative
Combining LCM and Proteomics Techniques Intens. ×107
173
Intens. 6 ×10
1.0
1999.99082
1726.89642
0.8
1943.95115
1818.99943 1840.98089 1873.94999
0.6
1.5
2025.94879
1891.97950
1793.73840
1994.98513
GAPDH
fibrinogen
1978.96298 1859.95483
1963.92507
0.4
1475.75278 CD34 antigen
0.2
1277.71354
1.0
0.0
1707.77693
0.5
GFAP
1700 +MS
1750
1800
1850
1900
1950
2000
m/z
fibrinogen 2151.08736 2368.27262 2511.14239
Tubulin Hb alpha 2 3265.53235 2903.42238
2706.17286
0.0
1000 +MS
1500
2000
2500
3000
3500
m/z
Fig. 2. MALDI FTMS spectrum obtained from 150 microdissected cells from a frozen glioma tissue sample. The spectrum contains approximately thousand monoisotopic peaks between 700 and 3000 m/z at relative high peak intensities. The small box is a zoom in for a small part of the spectra, between 1700 and 2000 m/z. It shows the very high numbers of peaks obtained from measuring a very small number of cells. The peaks can be identified by different sequencing MS techniques; some examples of identified peptides are indicated in the spectrum.
steps of processing tissues are needed. Sample collection and preparation is crucial. During the microdissection procedure, special attention should be taken to prevent waist and contamination of target material. For instance, material should not drop from, or stick to, the cap of the tubes used. Another consideration is to minimize the steps of transferring the collected material from one tube into the other. Therefore, the use of low protein binding tubes is recommended. A protocol for sample preparation is included in this chapter (Box 1). The 2D PAGE is a well-established technique that had been used in combination with LCM in many studies so far. The need of relative large numbers of cells blocks the possibility to measure large numbers of samples as indicated in Table 1. In addition, the relative low reproducibility hampers sound statistical analysis. 2D DIGE improves reproducibility and also lowers the required amount of microdissected tissue. However, this technique is suitable for experimental research only.
174
Mustafa et al.
LCM sample preparation protocol: Cryosections of 8 μm were made from glioma braintumor tissue and mounted on polyethylene naphthalate covered glass slides (PALM Microlaser Technologies AG, Bernried, Germany) as described previously (38). The slides were fixed in 70% ethanol and stored at (–20 (C for not more than 2 days. After fixation and immediately before microdissection, the slides were washed twice with Milli-Q water, stained for 10 s in haematoxylin, washed again twice with Milli-Q water and subsequently dehydrated in a series of 50, 70, 95, and 100% ethanol solution and air dried. The PALM laser microdissection and pressure catapulting device, type P-MB was used with PalmRobo v2.2 software at 40× magnification. Estimating that a cell has a volume of 10 × 10 × 10 μm, we microdissected an area of about 190,000 μm2 of blood vessels and another area of the same size of the surrounding tumor tissue from each sample, resulting in approximately 1500 cells per sample. The microdissected cells were collected in caps of PALM tubes in 5 μl of 0.1% RapiGest buffer (Waters, Milford, MA, USA). The caps were cut and placed onto 0.5 ml Eppendorf protein LoBind tubes (Eppendorf, Hamburg, Germany). Subsequently, these tubes were centrifuged at 12,000 g for 5 min. To make sure that all the cells were covered with buffer, another 5 μl of RapiGest was added to the cells. All samples were stored at –80°C. After thawing the microdissected tissue, the tissue was disrupted by external sonification for 1 min at 70% amplitude at a maximum temperature of 25°C (Bransons Ultrasonics, Danbury, USA). The samples were incubated at 37 and 100°C for 5 and 15 min, respectively, for protein solubilization and denaturation. To each sample, 1.5 μl of 100 ng/μl gold grade trypsin (Promega, Madison, WI, USA) in 3 mM Tris–HCL diluted 1:10 in 50 mM NH4 HCO3 was added and incubated overnight at 37°C for protein digestion. To inactivate trypsin and to degrade the RapiGest, 2 μl of 500 mM HCL was added and incubated for 30 min at 37°C. Samples were dried in a Speedvac (Thermo Savant, Holbrook, NY, USA) and reconstituted in 5 μl of 50% acetonitrile/0.5% trifluoroacetic acid/water prior to measurement. Samples were used for immediate measurements, or stored for a maximum of 10 days at 4°C. Recently, the improvement of resolution and detection limits in modern mass spectrometers, particularly in FTMS, opened a new research field to analyze small numbers of microdissected cells (in the range of 200–5000). FTMS has specific characteristics, unrivalled high mass resolution (in the order of 100,000–1,000,000), high mass accuracy (below 1 ppm), dynamics (three to four orders of magnitude), and its good signal to noise ratio (44). These features facilitate combining this technique with LCM. For instance, by MALDI-FTMS,
Combining LCM and Proteomics Techniques
175
peptide digests of no more than 150 cells taken from biological samples (e.g., glioma vessel tissue) resulted in informative mass spectra (Fig. 2). It is expected that techniques like FTMS soon will be implicated in the practice of routine laboratories for the detection of disease-related proteins in clinical specimens. References 1. Zhang, L., Zhou, W., Velculescu, V. E., Kern, S. E., Hruban, R. H., Hamilton, S. R., Vogelstein, B. and Kinzler, K. W. (1997) Gene expression profiles in normal and cancer cells. Science 276, 1268–1272. 2. Curran, S., McKay, J. A., McLeod, H. L. and Murray, G. I. (2000) Laser capture microscopy. Mol Pathol 53, 64–68. 3. Going, J. J. and Lamb, R. F. (1996) Practical histological microdissection for PCR analysis. J Pathol 179, 121–124. 4. Zhuang, Z., Bertheau, P., Emmert-Buck, M. R., Liotta, L. A., Gnarra, J., Linehan, W. M. and Lubensky, I. A. (1995) A microdissection technique for archival DNA analysis of specific cell populations in lesions <1 mm in size. Am J Pathol 146, 620–625. 5. Shibata, D., Hawes, D., Li, Z. H., Hernandez, A. M., Spruck, C. H. and Nichols, P. W. (1992) Specific genetic analysis of microscopic tissue after selective ultraviolet radiation fractionation and the polymerase chain reaction. Am J Pathol 141, 539–543. 6. Emmert-Buck, M. R., Bonner, R. F., Smith, P. D., Chuaqui, R. F., Zhuang, Z., Goldstein, S. R., Weiss, R. A. and Liotta, L. A. (1996) Laser capture microdissection. Science 274, 998–1001. 7. Wang, Y., Rudnick, P. A., Evans, E. L., Li, J., Zhuang, Z., Devoe, D. L., Lee, C. S. and Balgley, B. M. (2005) Proteome analysis of microdissected tumor tissue using a capillary isoelectric focusing-based multidimensional separation platform coupled with ESI-tandem MS. Anal Chem 77, 6549–6556. 8. Suarez-Quian, C. A., Goldstein, S. R., Pohida, T., Smith, P. D., Peterson, J. I., Wellner, E., Ghany, M. and Bonner, R. F. (1999) Laser capture microdissection of single cells from complex tissues. Biotechniques 26, 328–335. 9. Espina, V., Milia, J., Wu, G., Cowherd, S. and Liotta, L. A. (2006) Laser capture microdissection. Methods Mol Biol 319, 213–229. 10. Schutze, K., Posl, H. and Lahr, G. (1998) Laser micromanipulation systems as universal tools in cellular and molecular biology and in medicine. Cell Mol Biol (Noisy-le-grand) 44, 735–746. 11. Gillespie, J. W., Gannot, G., Tangrea, M. A., Ahram, M., Best, C. J., Bichsel, V. E., Petricoin, E. F., Emmert-Buck, M. R. and Chuaqui, R. F. (2004) Molecular profiling of cancer. Toxicol Pathol 32(Suppl. 1), 67–71. 12. Niyaz, Y., Stich, M., Sagmuller, B., Burgemeister, R., Friedemann, G., Sauer, U., Gangnus, R. and Schutze, K. (2005) Noncontact laser microdissection and pressure catapulting: sample preparation for genomic, transcriptomic, and proteomic analysis. Methods Mol Med 114, 1–24.
176
Mustafa et al.
13. Ball, H. J. and Hunt, N. H. (2004) Needle in a haystack: microdissecting the proteome of a tissue. Amino Acids 27, 1–7. 14. Emmert-Buck, M. R., Gillespie, J. W., Paweletz, C. P., Ornstein, D. K., Basrur, V., Appella, E., Wang, Q. H., Huang, J., Hu, N., Taylor, P. and Petricoin, E. F. 3rd (2000) An approach to proteomic analysis of human tumors. Mol Carcinog 27, 158–165. 15. Lawrie, L. C., Curran, S., McLeod, H. L., Fothergill, J. E. and Murray, G. I. (2001) Application of laser capture microdissection and proteomics in colon cancer. Mol Pathol 54, 253–258. 16. Jones, M. B., Krutzsch, H., Shu, H., Zhao, Y., Liotta, L. A., Kohn, E. C. and Petricoin, E. F. 3rd (2002) Proteomic analysis and identification of new biomarkers and therapeutic targets for invasive ovarian cancer. Proteomics 2, 76–84. 17. Wulfkuhle, J. D., Sgroi, D. C., Krutzsch, H., McLean, K., McGarvey, K., Knowlton, M., Chen, S., Shu, H., Sahin, A., Kurek, R., Wallwiener, D., Merino, M. J., Petricoin, E. F. 3rd, Zhao, Y. and Steeg, P. S. (2002) Proteomics of human breast ductal carcinoma in situ. Cancer Res 62, 6740–6749. 18. Shekouh, A. R., Thompson, C. C., Prime, W., Campbell, F., Hamlett, J., Herrington, C. S., Lemoine, N. R., Crnogorac-Jurcevic, T., Buechler, M. W., Friess, H., Neoptolemos, J. P., Pennington, S. R. and Costello, E. (2003) Application of laser capture microdissection combined with two-dimensional electrophoresis for the discovery of differentially regulated proteins in pancreatic ductal adenocarcinoma. Proteomics 3, 1988–2001. 19. Zhang, D. H., Tai, L. K., Wong, L. L., Sethi, S. K. and Koay, E. S. (2005) Proteomics of breast cancer: enhanced expression of cytokeratin19 in human epidermal growth factor receptor type 2 positive breast tumors. Proteomics 5, 1797–1805. 20. Ai, J., Tan, Y., Ying, W., Hong, Y., Liu, S., Wu, M., Qian, X. and Wang, H. (2006) Proteome analysis of hepatocellular carcinoma by laser capture microdissection. Proteomics 6, 538–546. 21. Zhou, G., Li, H., DeCamp, D., Chen, S., Shu, H., Gong, Y., Flaig, M., Gillespie, J. W., Hu, N., Taylor, P. R., Emmert-Buck, M. R., Liotta, L. A., Petricoin, E. F. 3rd and Zhao, Y. (2002) 2D differential in-gel electrophoresis for the identification of esophageal scans cell cancer-specific protein markers. Mol Cell Proteomics 1, 117–124. 22. Lee, J. R., Baxter, T. M., Yamaguchi, H., Wang, T. C., Goldenring, J. R. and Anderson, M. G. (2003) Differential protein analysis of spasomolytic polypeptide expressing metaplasia using laser capture microdissection and two-dimensional difference gel electrophoresis. Appl Immunohistochem Mol Morphol 11, 188–193. 23. Hudelist, G., Singer, C. F., Pischinger, K. I., Kaserer, K., Manavi, M., Kubista, E. and Czerwenka, K. F. (2006) Proteomic analysis in human breast cancer: identification of a characteristic protein expression profile of malignant breast epithelium. Proteomics 6, 1989–2002. 24. Shaw, J., Rowlinson, R., Nickson, J., Stone, T., Sweet, A., Williams, K. and Tonge, R. (2003) Evaluation of saturation labelling two-dimensional difference gel electrophoresis fluorescent dyes. Proteomics 3, 1181–1195.
Combining LCM and Proteomics Techniques
177
25. Greengauz-Roberts, O., Stoppler, H., Nomura, S., Yamaguchi, H., Goldenring, J. R., Podolsky, R. H., Lee, J. R. and Dynan, W. S. (2005) Saturation labeling with cysteine-reactive cyanine fluorescent dyes provides increased sensitivity for protein expression profiling of laser-microdissected clinical specimens. Proteomics 5, 1746–1757. 26. Sitek, B., Potthoff, S., Schulenborg, T., Stegbauer, J., Vinke, T., Rump, L. C., Meyer, H. E., Vonend, O. and Stuhler, K. (2006) Novel approaches to analyse glomerular proteins from smallest scale murine and human samples using DIGE saturation labelling. Proteomics 6, 4337–4345. 27. Li, C., Hong, Y., Tan, Y. X., Zhou, H., Ai, J. H., Li, S. J., Zhang, L., Xia, Q. C., Wu, J. R., Wang, H. Y. and Zeng, R. (2004) Accurate qualitative and quantitative proteomic analysis of clinical hepatocellular carcinoma using laser capture microdissection coupled with isotope-coded affinity tag and two-dimensional liquid chromatography mass spectrometry. Mol Cell Proteomics 3, 399–409. 28. Gygi, S. P., Rist, B., Gerber, S. A., Turecek, F., Gelb, M. H. and Aebersold, R. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17, 994–999. 29. Zang, L., Palmer Toy, D., Hancock, W. S., Sgroi, D. C. and Karger, B. L. (2004) Proteomic analysis of ductal carcinoma of the breast using laser capture microdissection, LC-MS, and 16O/18O isotopic labeling. J Proteome Res 3, 604–612. 30. Poznanovic, S., Wozny, W., Schwall, G. P., Sastri, C., Hunzinger, C., Stegmann, W., Schrattenholz, A., Buchner, A., Gangnus, R., Burgemeister, R. and Cahill, M. A. (2005) Differential radioactive proteomic analysis of microdissected renal cell carcinoma tissue by 54 cm isoelectric focusing in serial immobilized pH gradient gels. J Proteome Res 4, 2117–2125. 31. Neubauer, H., Clare, S. E., Kurek, R., Fehm, T., Wallwiener, D., Sotlar, K., Nordheim, A., Wozny, W., Schwall, G. P., Poznanovic, S., Sastri, C., Hunzinger, C., Stegmann, W., Schrattenholz, A. and Cahill, M. A. (2006) Breast cancer proteomics by laser capture microdissection, sample pooling, 54cm IPG IEF, and differential iodine radioisotope detection. Electrophoresis 27, 1840–1852. 32. Preisler, J., Hu, P., Rejtar, T., Moskovets, E. and Karger, B. L. (2002) Capillary array electrophoresis-MALDI mass spectrometry using a vacuum deposition interface. Anal Chem 74, 17–25. 33. Bergstrom, S. K., Samskog, J. and Markides, K. E. (2003) Development of a poly(dimethylsiloxane) interface for on-line capillary column liquid chromatography-capillary electrophoresis coupled to sheathless electrospray ionization time-of-flight mass spectrometry. Anal Chem 75, 5461–5467. 34. Wu, S. L., Hancock, W. S., Goodrich, G. G. and Kunitake, S. T. (2003) An approach to the proteomic analysis of a breast cancer cell line (SKBR-3). Proteomics 3, 1037–1046. 35. Gozal, Y. M., Cheng, D., Duong, D. M., Lah, J. J., Levey, A. I. and Peng, J. (2006) Merger of laser capture microdissection and mass spectrometry: a window into the amyloid plaque proteome. Methods Enzymol 412, 77–93.
178
Mustafa et al.
36. Guo, J., Colgan, T. J., DeSouza, L. V., Rodrigues, M. J., Romaschin, A. D. and Siu, K. W. (2005) Direct analysis of laser capture microdissected endometrial carcinoma and epithelium by matrix-assisted laser desorption/ionization mass spectrometry. Rapid Commun Mass Spectrom 19, 2762–2766. 37. de Groot, C. J., Steegers-Theunissen, R. P., Guzel, C., Steegers, E. A. and Luider, T. M. (2005) Peptide patterns of laser dissected human trophoblasts analyzed by matrix-assisted laser desorption/ionisation-time of flight mass spectrometry. Proteomics 5, 597–607. 38. Umar, A., Dalebout, J. C., Timmermans, A. M., Foekens, J. A. and Luider, T. M. (2005) Method optimisation for peptide profiling of microdissected breast carcinoma tissue by matrix-assisted laser desorption/ionisation-time of flight and matrix-assisted laser desorption/ionisation-time of flight/time of flight-mass spectrometry. Proteomics 5, 2680–2688. 39. Umar, A., Luider, T. M., Foekens, J. A. and Pasa-Tolic, L. (2007) NanoLC-FTICR Ms improves proteome coverage attainable for approximately 3000 lasermicrodissected breast carcinoma cells. Proteomics 7, 323–329. 40. Melle, C., Bogumil, R., Ernst, G., Schimmel, B., Bleul, A. and von Eggeling, F. (2006) Detection and identification of heat shock protein 10 as a biomarker in colorectal cancer by protein profiling. Proteomics 6, 2600–2608. 41. Melle, C., Ernst, G., Schimmel, B., Bleul, A., Koscielny, S., Wiesner, A., Bogumil, R., Moller, U., Osterloh, D., Halbhuber, K. J. and von Eggeling, F. (2003) Biomarker discovery and identification in laser microdissected head and neck squamous cell carcinoma with ProteinChip technology, two-dimensional gel electrophoresis, tandem mass spectrometry, and immunohistochemistry. Mol Cell Proteomics 2, 443–452. 42. Zheng, Y., Xu, Y., Ye, B., Lei, J., Weinstein, M. H., O’Leary, M. P., Richie, J. P., Mok, S. C. and Liu, B. C. (2003) Prostate carcinoma tissue proteomics for biomarker discovery. Cancer 98, 2576–2582. 43. Cutler, P. (2003) Protein arrays: the current state-of-the-art. Proteomics 3, 3–18. 44. Dekker, L. J., Burgers, P. C., Guzel, C. and Luider, T. M. (2007) Ftms and TOF/TOF mass spectrometry in concert: identifying peptides with high reliability using matrix prespotted MALDI target plates. J Chromatogr B Analyt Technol Biomed Life Sci 847, 62–64. 45. Mustafa, D. A., Burgers, P. C., Dekker, L. J., Charif, H., Titulaer, M. K., Smitt, P. A., Luider, T. M. and Kros, J. M., (2007) Identification of glioma neovascularization-related proteins by using MALDI-FTMS and nano-LC fractionation to microdissected tumor vessels. Mol Cell Proteomics 6, 1147–1157.
III Clinical Proteomics by LC-MS Approaches
10 Comparison of Protein Expression by Isotope-Coded Affinity Tag Labeling Zhen Xiao and Timothy D. Veenstra
Summary Isotope-coded affinity tag (ICAT) labeling, in combination with mass spectrometry (MS), has been widely adopted as an effective method for comparing protein abundance levels. This chapter describes the ICAT labeling procedure in search for the celecoxibregulated proteins in a colon cancer cell line. Celecoxib, a cyclooxygenase-2 (COX-2) specific inhibitor, is used as a colorectal cancer preventative drug in clinical trials. Here, celecoxib is used to inhibit the expression of COX-2 in a colon cancer cell line HT-29. To elucidate the proteomic changes induced by celecoxib, the protein lysates from the treated and control cells are prepared. The cysteine-containing proteins are labeled with the heavy and light ICAT reagents, respectively. The labeled proteins are then combined and digested with trypsin. The ICAT-labeled peptides are subject to the purification through an avidin column and eventually the cleavage of the biotin tags. This chapter focuses on the ICAT labeling procedure itself, because sample preparation is the most critical step of an ICAT-based protein expression comparison experiment. Other related procedures such as the cation exchange high performance liquid chromatography separation of peptides and MS analysis are detailed elsewhere in this book.
Key Words: isotope-coded affinity tags; quantitative proteomics; mass spectrometry.
1. Introduction The application of mass spectrometry (MS) has rapidly expanded from simple identification of protein components to the quantitative comparison of proteomic changes under various biological and physiological conditions (1,2,3). In many studies, it is desirable to identify proteins and quantify their From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
181
182
Xiao and Veenstra
levels simultaneously using MS. While the ability to target specific molecules for quantitation is well established, there are experimental and technical issues that limit the accuracy of direct quantitation of hundreds (or thousands) of species in a single MS experiment and make it extremely challenging (4,5,6,7). To resolve this hurdle, a variety of chemical-based labeling and derivatization techniques have been developed (5,7,8,9). One of these techniques, isotopecoded affinity tags (ICATs), has been widely adopted and remains the model system by which most other differential labeling methods have been developed (10). The structure of the reagent used in ICAT studies is composed of four parts: (1) an iodoacetamide group that covalently reacts with cysteine residues within proteins; (2) an isotope-coded linker regions, which is prepared in two distinct versions containing either nine 13 C (heavy version) or nine 12 C (light version); (3) a biotin tag that facilitates the purification of labeled peptides via its specific binding to avidin; and (4) an acid-labile bond that is situated between the biotin and isotopically differential domain of the reagent (Fig. 1). After labeling the cysteine residues, the protein mixture is enzymatically digested (usually with trypsin) and the labeled peptides purified via avidin chromatography. Following the enrichment of the ICAT-labeled peptides, the cleavable linker and the biotin tag are removed using trifluoroacetic acid (TFA). The removal of the biotin tag reduces the mass of the remaining tag attached to the peptide and increases the fragmentation efficiency and ultimately the success rate of peptide identification by tandem MS. The advantage of ICAT labeling is the identical chemistry, yet differential mass, of the heavy and light reagents, which enables the protein abundances within two complex proteome samples to be compared simultaneously. Following their coelution from a nanoflow reversed-phase liquid chromatography column, the light- and heavy-labeled peptides are easily recognized within the mass spectrum, being separated by ∼9 Da. The tandem MS spectrum enables the peptide to be identified, while the ratio of the areas of each peak is used as a measurement of the peptide’s relative abundance in the samples being compared. Since its inception, the ICAT reagents have been modified, improved, and made available commercially via applied biosystems
Fig. 1. The structure of cleavable isotope-coded affinity tag reagent.
Isotope-Coded Affinity Tag Labeling
183
as a kit (11). The combination of ICAT labeling, peptide fractionation, and the liquid chromatography tandem mass spectrometry has enabled the rapid and simultaneous identification and quantitation of changes in complex protein mixtures (12,13,14,15,16). In this chapter, the ICAT labeling procedure is described as part of an experiment to identify celecoxib-induced proteomic changes in colon cancer cells. Celecoxib is a nonsteroidal anti-inflammatory drug that specifically inhibits cyclooxygenase-2 (COX-2) (17,18). In clinical trials, it has been shown to inhibit the development of precancerous polyposis in colon (19,20). In this study, a COX-2 expressing colon cancer cell line (HT-29) is used (21,22). After treating the cells with celecoxib, cell lysate would be prepared and labeled with the ICAT reagents. A schematic diagram of the ICAT labeling and peptide analysis procedure is shown in Fig. 2. Since the core of the ICATbased quantitative proteomic analysis is sample preparation, this chapter is dedicated to the details of the ICAT labeling protocol itself. For information on strong cation exchange (SCX) high performance liquid chromatography (HPLC) separation of peptides, analysis by nanoflow reversed-phase liquid chromatography tandem mass spectrometry, and bioinformatics analysis, refer to the chapter on “Analysis of the Extracellular Matrix and Secreted Vesicle Proteomes by Mass Spectrometry,” (Subheadings 3.6–3.8). The methods described in this chapter can be used to (1) understand the proteomic changes in response to drug; (2) illustrate the molecular mechanisms underlying the drug effects; and (3) search for biomarkers or endpoints that can be used to monitor and evaluate the therapeutic and intervention approaches. 2. Materials 2.1. Cell Culture and Harvest 1. T-75 cell culture flasks 2. McCoy’s 5a medium supplemented with 10% (v/v) fetal bovine serum, 50 U/mL penicillin, 50 μg/mL streptomycin, and 1.5 mM l-glutamine (American Type Culture Collection (ATCC), Manassas, VA) 3. Dimethylsulfoxide (DMSO, cell culture use) 4. HT-29 cell line (ATCC, Manassas, VA) 5. Celecoxib (Pfizer, New York, NY) 6. 75 μM celecoxib: dissolve celecoxib in DMSO to make a 100 mM stock solution. Further dilute to 75 μM with McCoy’s 5a cell culture medium. Use the same concentration of DMSO in medium as negative control 7. Sterile phosphate-buffered saline (PBS) solution 8. 500 mM EDTA, pH 8 9. 2 mM EDTA in sterile PBS: add 80 μL of 500 mM EDTA, pH 8, in 20 mL of PBS 10. Centrifuge (maximum force: ∼17,000×g)
184
Xiao and Veenstra
Fig. 2. Schematic diagram of the ICAT labeling procedure applied to the quantitative proteomic analysis.
2.2. Cell Lysis, Desalting, and Protein Quantitation 1. Lysis buffer: 50 mM Tris–HCl, pH 7.2, 1% Triton X-100, 10 mM sodium fluoride (NaF), 1 mM sodium orthovanadate (Na3 VO4 ), and 1 mM EDTA 2. Digital sonifier (Model 250, Branson Ultrasonics Corporation, Danbury, CT) 3. Bicinchoninic acid (BCA) protein assay reagent kit (Pierce, Rockford, IL) 4. D-SaltTM excellulose plastic desalting column 5 mL (maximum binding capacity is 1.25 mg per column) (Pierce, Rockford, IL) 5. 50 mM NH4 HCO3 , pH 8.3
Isotope-Coded Affinity Tag Labeling
185
6. Coomassie blue reagent: coomassie plus – The Better BradfordTM assay reagent (Pierce, Rockford, IL) 7. Centrifuge (maximum force: ∼17,000×g) 8. Vacuum centrifuge
2.3. Denaturing and Reducing the Proteins 1. Denaturing buffer: 6 M guanidine in 50 mM NH4 HCO3 , pH 8.3 2. 100 mM Tris (2-carboxyethyl) phosphine (TCEP) (Pierce, Rockford, IL) 3. Boiling water bath
2.4. Labeling with Cleavable ICAT Reagents, Desalting, and Tryptic Digestion 1. Cleavable ICATTM reagents (light and heavy sulfhydryl modifying biotinylating reagents). Store at –20 °C. One unit of either light or heavy reagent labels 100 μg of protein. The regular kit offers both reagents in 1 unit/tube. The bulk kit offers both reagents in 10 units/tube. The method described here is based on the use of a regular kit, i.e., 1 unit that labels 100 μg of protein/tube. (Applied Biosystems, Foster City, CA) 2. Acetonitrile 3. 37 °C water bath 4. D-SaltTM excellulose plastic desalting column 5 mL (Pierce, Rockford, IL) 5. 50 mM NH4 HCO3 , pH 8.3 6. Coomassie blue reagent: coomassie plus – The Better BradfordTM assay reagent (Pierce, Rockford, IL) 7. Trypsin gold, MS grade (Promega, Madison, WI)
2.5. Purifying the Labeled Peptides 1. 2. 3. 4. 5. 6.
7. 8. 9. 10.
Phenylmethanesulfonyl fluoride (PMSF) (Sigma Chemical Co., St. Louis, MO) Glass wool 5–3/4˝ disposable pasteur glass pipettes UltralinkTM immobilized monomeric avidin slurry [50% (v/v)] (Pierce, Rockford, IL) Teflon tubing that fits the tip of the 5–3/4˝ disposable pasteur glass pipettes 2× PBS buffer, pH 7.2: dissolve 14.2 g of Na2 HPO4 and 8.77 g of NaCl in 450 mL of H2 O. Adjust pH to 7.2 by adding about 350 μL of 85% (v/v) H3 PO4 . Add H2 O to make a total volume of 500 mL. The final concentration is 200 mM Na2 HPO4 and 300 mM NaCl 1× PBS, pH 7.2: dilute 2× PBS 1:1 in H2 O 2 mM biotin solution: dissolve 9.8 mg of d-biotin ImmunoPure (MW 244.31, Pierce, Rockford, IL) in 20 mL of 2× PBS, pH 7.2 Acetonitrile [20% (v/v)] in 50 mM NH4 HCO3 , pH 8.3 Acetonitrile [30% (v/v)] containing 0.4% (v/v) formic acid
186
Xiao and Veenstra
11. pH paper (pH 2–9) 12. Dry ice
2.6. Cleaving Biotin 1. Cleaving reagent A (10 mL) (Applied Biosystems, Foster City, CA): contains concentrated TFA. Store in fume hood at room temperature 2. Cleaving reagent B (Applied Biosystems, Foster City, CA): store at –20 °C 3. 37 °C water bath 4. Vacuum centrifuge
3. Methods 3.1. Cell Culture and Harvest 1. On day 1, plate HT-29 cells in T-75 flasks at 5 × 106 cells/flask. 2. On day 2, aspirate medium. Culture cells with fresh medium containing 75 μM of celecoxib or DMSO (negative control). 3. On day 3, 24 h after treating cells, aspirate cell culture medium. Rinse cells once quickly with 6 mL of PBS. 4. Add 3 mL of 2 mM EDTA-PBS per flask, put flask into the 37 °C incubator. Monitor the detachment of cells carefully. Cells usually detach within 5 min. For the celecoxib-treated cells, it takes less than 5 min (see Note 1). 5. Tap the side of the flask against the palm of hand to dislodge cells. When the cells are visibly detached, add 7 mL of PBS to flask. Resuspend cells and transfer cell suspension to a 15 mL centrifuge tube. Harvest the treated and control cells in separate tubes. 6. Centrifuge the cell suspension at 500×g for 5 min. Remove the supernatant. 7. Wash cell pellet with 10 mL of PBS three times. Centrifuge at 500×g for 5 min. Remove PBS after each centrifugation. 8. Cell pellet is ready for lysis. Leave cell pellet on ice before proceeding to the next step, or store the pellet at –80 °C.
3.2. Cell Lysis, Desalting and Protein Quantitation 1. Add 500 μL of lysis buffer to the cell pellet harvested from each T-75 flask. Transfer the resuspended cells to a 1.5 mL eppendorf tube. Vortex briefly. 2. Clean the sonifier probe with H2 O, methanol, and let it air dry before use. 3. To break the cells, set the digital sonifier amplitude at 16%. Hold up the eppendorf tube with suspended cells. Let the probe plunge half way into the lysis buffer. Pulse for 10 s, pause for 50 s. Repeat this cycle five times. Rest the tube on ice between pulses. Lift the tube up again in time before the next 10 s pulse cycle starts (see Note 2). 4. Clean the sonifier probe as in step 2 before starting the next sample. 5. Centrifuge cell lysate at 15,000×g for 15 min at 4 °C.
Isotope-Coded Affinity Tag Labeling
187
6. Transfer cell lysate to a fresh eppendorf tube (see Note 3). 7. Quantify the protein in cell lysate using the BCA assay (see Note 4). 8. Prepare desalting column (D-SaltTM Excellulose Plastic Desalting Column, 5 mL, Pierce) by washing column with 5× bed volume (i.e., 25 mL) of 50 mM NH4 HCO3 , pH 8.3 (see Note 5). 9. Based on the BCA assay results, load up to 1.25 mg of cell lysate into each desalting column. Discard the flow through (see Note 6). 10. Add 0.5 mL of 50 mM NH4 HCO3 , pH 8.3 into the column. Collect the flow through into one eppendorf tube. Repeat this step seven times. Collect eluant in seven 0.5 mL fractions. 11. Take 10 μL of eluant from each fraction and mix with 300 μL (1:30) of coomassie blue reagent (Pierce). Visually examine the color of each tube. The color of the protein-containing fractions should change from brown to blue. Proteins normally elute in fractions 3–5. 12. Pool the tubes containing protein. Mix well. Discard the tubes that do not contain protein. 13. Measure the protein concentration using the BCA assay (see Note 4). 14. Based on the BCA assay results, transfer 800 μg of protein from each of the treated and control samples into two separate eppendorf tubes (see Note 7). 15. Lyophilize these two samples in vacuum centrifuge (see Note 8).
3.3. Denaturing and Reducing the Proteins 1. Freshly prepare denaturing buffer and 100 mM TCEP. 2. Add denaturing buffer and 100 mM TCEP to the protein samples. For 800 μg of protein, add 640 μL of denaturing buffer and 8 μL of TCEP (see Note 9). 3. Vortex until the sample is completely dissolved in the buffer. 4. Boil the sample for 10 min. 5. Vortex to mix well. Spin the samples in centrifuge briefly. Cool to room temperature.
3.4. Labeling with Cleavable ICAT Reagents, Desalting, and Tryptic Digestion 1. Remove the ICAT reagents from the –20 °C freezer. Bring to room temperature. Avoid exposing them to the light. To label 800 μg of protein (control or treated), use eight tubes of reagent (light or heavy, label 100 μg of protein/tube). Spin in centrifuge briefly to bring down the powder from the wall to the bottom of the tube. 2. In the chemical hood with lights off, add 20 μL of acetonitrile into each of the eight reagent tubes (light or heavy). Add 80 μL (i.e., 100 μg) of protein sample into each tube. Tighten the tube caps. Vortex to mix well. Spin briefly in centrifuge (see Note 10).
188
Xiao and Veenstra
3. Pool the control or treated sample mixtures (eight tubes of light or heavy), respectively, into two tubes. This pooling should result in one light and one heavy label tube with 800 μL of protein mixture in each. 4. Incubate the samples in the 37 °C water bath for 2 h. Keep the samples from being exposed to light. 5. Combine the light- and heavy-labeled samples together into one tube. Proceed with desalting. 6. Use the same desalting column as in the previous section. Since the binding capacity per column is 1.25 mg, prepare two columns for a total of 1.6 mg of labeled protein. Wash each column with 5× bed volume (i.e., 25 mL) of 50 mM NH4 HCO3 , pH 8.3 (see Note 11). 7. Load 800 μg of the combined and labeled proteins per column. Follow steps 8–12 in Subheading 3.2. At the end of elution, pool the protein-containing eluant fractions (usually fractions 3–5) into one 15 mL tube. (see Note 12). 8. Prepare trypsin freshly by reconstituting 20 μg of trypsin in 20 μL of 50 mM NH4 HCO3 , pH 8.3. Add trypsin to the labeled protein at a trypsin-to-protein ratio of 1:40 (w/w). For 1.6 mg of protein, add 40 μg of trypsin (see Note 13). 9. Wrap the 15 mL tube with aluminum foil. Incubate at 37 °C overnight (see Note 14).
3.5. Purifying the Labeled Peptides 1. Boil the peptide solution for 10 min to deactivate trypsin. 2. Freshly prepare 100 mM PMSF in methanol. Vortex to dissolve well. 3. Add PMSF at a 1:100 dilution (v/v) to the trypsin-digested samples. For 3 mL of digests, add 30 μL of PMSF. The final PMSF concentration is 1 mM. Vortex briefly to mix. 4. Prepare the avidin column: put a small trace of glass wool gently into a 5–3/4˝ pasteur glass pipette. Push it from the top down for about 4–1/2˝. This packing creates a support for the resin to settle onto (see Note 15). 5. Add 0.5 mL of water into the pipette. Let the water level fall till it reaches the glass wool. At this point, the flow should stop naturally. Block the bottom of the pipette. Then slowly add 1.5 mL of water into the pipette. Mark the water level as an indicator for the volume of 1.5 mL. 6. Gradually add the avidin slurry to the 1.5 mL mark. Connect Teflon tubing to the pipette tip to increase the flow rate (see Note 16). 7. Condition the column using the following washing buffers and sequence (see Note 17) a. b. c. d.
2× PBS, pH 7.2, 8 mL (5× bed volume) 2 mM biotin solution, 6 mL (4× bed volume) 30% (v/v) acetonitrile, 0.4% (v/v) formic acid, and 6 mL (4× bed volume) 2× PBS, pH 7.2, 8 mL (5× bed volume)
8. Sample loading and incubation: take the teflon tubing off. Load 1.5 mL of the digest sample into the column. After the sample flows through, incubate at room
Isotope-Coded Affinity Tag Labeling
189
temperature for 15 min. Load another 1.5 mL (or the rest) of sample. Incubate for 15 min (see Note 18). 9. Connect the teflon tubing back to the tip of the pipette. Wash the column bound with ICAT-labeled peptides with the following buffers and sequence: a. 2× PBS, pH 7.2, 8 mL (5× bed volume) b. 1× PBS, pH 7.2, 8 mL (5× bed volume) c. 20% (v/v) acetonitrile in 50 mM NH4 HCO3 , pH 8.3, 6 mL (4× bed volume) 10. Final wash: take off the teflon tubing. Add 1.3 mL (a volume slightly less than the bed volume) of 30% (v/v) acetonitrile, 0.4% (v/v) formic acid as a final wash. Discard the flow through. Measure the pH of the last drop of this wash step with pH paper. The pH should be >8 (basic), suggesting that acetonitrile has not eluted the peptides off and that the peptides are still retained on the beads (see Note 19). 11. Elute the peptides with 4 mL of 30% (v/v) acetonitrile, 0.4% (v/v) formic acid in one 15 mL tube. Mix well and divide into four 1 mL aliquots. Briefly freeze the peptides on dry ice or at –80 °C and then lyophilize in vacuum centrifuge (see Note 20).
3.6. Cleaving Biotin 1. Prepare the cleaving reagent mixture in a chemical hood. For 1.6 mg of labeled peptides, mix 760 μL of cleaving reagent A with 40 μL of cleaving reagent B. Add the cleaving reagent mixture to the dry peptides. Dispense the mixture equally to all four peptide aliquots (see Note 21). 2. Close the tube caps. Vortex well to dissolve the peptides. 3. Incubate the samples in a 37 °C water bath for 2 h. 4. Pool all the aliquots together when the incubation is finished. Freeze briefly on dry ice or at –80 °C. Lyophilize the peptides in vacuum centrifuge. 5. Store at –80 °C prior to the next step (i.e., fractionation by SCX HPLC).
4. Notes 1. Dislodging cells using a low concentration of EDTA preserves the integrity of cell surface proteins, which is critical in quantitative proteomic analysis. 2. For the Branson digital sonifier, use the following program settings: pulse on for 10 s; off for 50 s; amplitude = 16%. If bubbles are generated during sonication, decrease the amplitude setting. Depending on the sample volume, the setting can sometimes be lowered to 14%. The clumps of cells should disappear when sonication is complete. 3. After this step the cell lysate can be stored at –80 °C. Otherwise, proceed to the next step, i.e., BCA assay and desalting. 4. Protein quantitation is a common laboratory procedure. The instructions are included within the BCA assay kit (Pierce); therefore, the procedure is not described in this chapter.
190
Xiao and Veenstra
5. It is helpful to assemble a funnel reservoir on the top of the column to hold a larger volume (up to 25 mL) of buffer. 6. The maximum binding capacity of the desalting column is 1.25 mg of protein per column. 7. The method described here is based on the labeling of 800 μg of protein from each of the treated and control samples. This amount of protein is desirable if enough cell lysate is available. However, as little as 100 μg of protein from each of the treated and control samples can be labeled using this protocol. 8. It takes about 3 h to lyophilize the samples. If necessary, leave the samples in the vacuum, centrifuge overnight to dry. 9. It is important to keep the pH of the cell lysate above 7 (ideally between 8 and 9). A pH below 7 will inhibit the reaction between cysteine residues and the iodoacetamide group of the ICAT reagents. 10. Usually the control sample is labeled with the light reagent and the treated sample is labeled with the heavy reagent. 11. To save time, it is suggested to set the two columns up on the stand during the 2-h labeling incubation time. It is better to attach a funnel reservoir to the top of each column to hold up to 25 mL of wash buffer. 12. Normally the volume of sample after pooling is about 3 mL. Desalted samples may have an opaque color because of the protein present in the sample. 13. Instead of using the buffer provided by the manufacturer, resuspend trypsin in 50 mM NH4 HCO3 , pH 8.3. Keep the trypsin-to-protein ratio between 1:40 and 1:50. 14. The digestion mixture is incubated overnight for approximately 16–18 h. 15. Make sure the glass wool is well packed. There should be no holes present; however, it should still allow liquid flow through at a reasonable flow rate. Check the flow rate by adding 0.5 mL of water into the pipette. The water should flow through quickly. Note that the flow rate will be slower considerably once the avidin slurry is packed into the column. Take these recommendations into consideration and not to pack too much or too little glass wool. 16. The protein binding capacity of avidin slurry is 1.6 mg protein per milliliter of packed avidin. One 1.5 mL column should offer sufficient capacity to enrich the labeled peptides from 1.6 mg of protein. 17. The binding of 2 mM biotin to the column and the elution by 30% (v/v) acetonitrile, 0.4% (v/v) formic acid preclear the column of any potential nonspecific binding activities. 18. The teflon tubing is a useful tool to adjust the flow rate. Connecting the teflon tubing on to the tip of the column will increase the flow rate. On the other hand, the flow rate will be slower without the teflon tubing attached. 19. The final wash is aimed to remove any nonspecific binding proteins. Using a volume slightly less than the bed volume ensures that the labeled peptides are retained on the column. The volume of the final wash buffer can be adjusted according to the actual bed volume. When the bed volume of avidin is smaller,
Isotope-Coded Affinity Tag Labeling
191
the volume of the final wash buffer needs to be scaled down. If the pH of the last drop is less than 3, the labeled peptides may have started to elute, meaning potential loss of the labeled peptides. 20. The elution should be performed in a chemical fume hood to avoid inhaling acetonitrile. The quick freezing of samples on dry ice can prevent sample spill during vacuum centrifugation and reduce the time needed for the samples to dry. 21. For every 200 μg of labeled peptides (i.e., 100 μg each of heavy or light labeled in the pair), mix 95 μL of cleaving reagent A and 5 μL of cleaving reagent B together first and transfer to the labeled peptides.
Acknowledgments This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. N01-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organization imply endorsement by the U.S. Government. References 1. Aebersold, R., Rist, B. and Gygi, S. P. (2000) Quantitative proteome analysis: methods and applications. Ann N Y Acad Sci 919, 33–47. 2. Gygi, S. P., Rist, B. and Aebersold, R. (2000) Measuring gene expression by quantitative proteome analysis. Curr Opin Biotechnol 11, 396–401. 3. Yates, J. R. 3rd. (2004) Mass spectral analysis in proteomics. Annu Rev Biophys Biomol Struct 33, 297–316. 4. Ong, S. E. and Mann, M. (2005) Mass spectrometry-based proteomics turns quantitative. Nat Chem Biol 1, 252–262. 5. Zieske, L. R. (2006) A perspective on the use of iTRAQ reagent technology for protein complex and profiling studies. J Exp Bot 57, 1501–1508. 6. Yan, W. and Chen, S. S. (2005) Mass spectrometry-based quantitative proteomic profiling. Brief Funct Genomic Proteomic 4, 27–38. 7. Bronstrup, M. (2004) Absolute quantification strategies in proteomics based on mass spectrometry. Expert Rev Proteomics 1, 503–512. 8. Conrads, T. P., Issaq, H. J. and Hoang, V. M. (2003) Current strategies for quantitative proteomics. Adv Protein Chem 65, 133–159. 9. Leitner, A. and Lindner, W. (2004) Current chemical tagging strategies for proteome analysis by mass spectrometry. J Chromatogr B Analyt Technol Biomed Life Sci 813, 1–26. 10. Gygi, S. P., Rist, B., Gerber, S. A., Turecek, F., Gelb, M. H. and Aebersold, R. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17, 994–999.
192
Xiao and Veenstra
11. Flory, M. R., Griffin, T. J., Martin, D. and Aebersold, R. (2002) Advances in quantitative proteomics using stable isotope tags. Trends Biotechnol 20, S23–S29. 12. Han, D. K., Eng, J., Zhou, H. and Aebersold, R. (2001) Quantitative profiling of differentiation-induced microsomal proteins using isotope-coded affinity tags and mass spectrometry. Nat Biotechnol 19, 946–951. 13. Conrads, K. A., Yu, L. R., Lucas, D. A., Zhou, M., Chan, K. C., Simpson, K. A., Schaefer, C. F., Issaq, H. J., Veenstra, T. D., Beck, G. R. Jr. and Conrads, T. P. (2004) Quantitative proteomic analysis of inorganic phosphate-induced murine MC3T3-E1 osteoblast cells. Electrophoresis 25, 1342–1352. 14. Gygi, S. P., Rist, B., Griffin, T. J., Eng, J. and Aebersold, R. (2002) Proteome analysis of low-abundance proteins using multidimensional chromatography and isotope-coded affinity tags. J Proteome Res 1, 47–54. 15. Tao, W. A. and Aebersold, R. (2003) Advances in quantitative proteomics via stable isotope tagging and mass spectrometry. Curr Opin Biotechnol 14, 110–118. 16. Conrads, K. A., Yi, M., Simpson, K. A., Lucas, D. A., Camalier, C. E., Yu, L. R., Veenstra, T. D., Stephens, R. M., Conrads, T. P. and Beck, G. R. Jr. (2005) A combined proteome and microarray investigation of inorganic phosphate-induced pre-osteoblast cells. Mol Cell Proteomics 4, 1284–1296. 17. Koehne, C. H. and Dubois, R. N. (2004) COX-2 inhibition and colorectal cancer. Semin Oncol 31, 12–21. 18. Sinicrope, F. A. and Gill, S. (2004) Role of cyclooxygenase-2 in colorectal cancer. Cancer Metastasis Rev 23, 63–75. 19. Steinbach, G., Lynch, P. M., Phillips, R. K., Wallace, M. H., Hawk, E., Gordon, G. B., Wakabayashi, N., Saunders, B., Shen, Y., Fujimura, T., Su, L. K. and Levin, B. (2000) The effect of celecoxib, a cyclooxygenase-2 inhibitor, in familial adenomatous polyposis. N Engl J Med 342, 1946–1952. 20. Thun, M. J., Henley, S. J. and Patrono, C. (2002) Nonsteroidal anti-inflammatory drugs as anticancer agents: mechanistic, pharmacologic, and clinical issues. J Natl Cancer Inst 94, 252–266. 21. Arico, S., Pattingre, S., Bauvy, C., Gane, P., Barbat, A., Codogno, P. and OgierDenis, E. (2002) Celecoxib induces apoptosis by inhibiting 3-phosphoinositidedependent protein kinase-1 activity in the human colon cancer HT-29 cell line. J Biol Chem 277, 27613–27621. 22. Lev-Ari, S., Strier, L., Kazanov, D., Madar-Shapiro, L., Dvory-Sobol, H., Pinchuk, I., Marian, B., Lichtenberg, D. and Arber, N. (2005) Celecoxib and curcumin synergistically inhibit the growth of colorectal cancer cells. Clin Cancer Res 11, 6738–6744.
11 Analysis of Microdissected Cells by Two-Dimensional LC-MS Approaches Chen Li, Yi-Hong, Ye-Xiong Tan, Jian-Hua Ai, Hu Zhou, Su-Jun Li, Lei Zhang, Qi-Chang Xia, Jia-Rui Wu, Hong-Yang Wang, and Rong Zeng
Summary Laser capture microdissection (LCM) is a powerful tool that enables the isolation of specific cell types from tissue sections, overcoming the problem of tissue heterogeneity and contamination. We combined the LCM with isotope-coded affinity tag (ICAT) technology and two-dimensional liquid chromatography to investigate the qualitative and quantitative proteomes of hepatocellular carcinoma (HCC). The effects of three different histochemical stains on tissue sections have been compared, and toluidine blue stain was proved as the most suitable stain for LCM followed by proteomic analysis. The solubilized proteins from microdissected HCC and non-HCC hepatocytes were qualitatively and quantitatively analyzed with two-dimensional liquid chromatography tandem mass spectrometry (2D-LC-MS/MS) alone or coupled with cleavable isotope-coded affinity tag (cICAT) labeling technology. A total of 644 proteins were qualitatively identified and 261 proteins were unambiguously quantified. These results showed that the clinical proteomic method using LCM coupled with ICAT and 2D-LC-MS/MS can carry out not only large-scale but also accurate qualitative and quantitative analysis.
Key Words: hepatocellular carcinoma; laser capture microdissection; isotope-coded affinity tag; two-dimensional liquid chromatography; mass spectrometry.
1. Introduction Hepatocellular carcinoma (HCC) is one of the most frequent tumors worldwide. There are 0.25–1 million newly diagnosed cases of HCC each year (1). The highest frequencies of HCC are observed in sub-Saharan Africa and From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
193
194
Li et al.
in Asia. In China, it has ranked the second cancer killer since 1990s. The most risky factors of HCC are chronic hepatitis B virus (HBV) and hepatitis C virus (HCV) infections, chronic exposure to the mycotoxin or aflatoxin B1 (AFB1), and alcoholic cirrhosis. Till now, the mainstay for the diagnosis for HCC includes serological tumor markers, such as alpha-fetoprotein, the L3 fraction of alpha-fetoprotein, and PIVKA-II, as well as imaging modalities (1,2,3). In order to improve diagnosis and prognosis from HCC, there is an urgent need to identify molecular markers to detect the disease. Using tissue samples from patients with HCC may be the most direct and persuasive way to find useful diagnostic and/or prognostic markers. Recently, proteomic analysis was applied to HCC tissues. Nineteen cases of HCC were analyzed by two-dimensional electrophoresis (2DE) and matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS) by Paik et al. (4,5,6). Proteome alterations in normal, cirrhotic, and tumorous tissue were observed using 2DE-MALDI-TOF-MS assay by Jung et al. (7). Kim et al. analyzed 11 cases of HCC using 2DE and delayed extractionmatrix assisted laser desorption/ionization time-of-flight mass spectrometry (DE-MALDI-TOF-MS) (8). Nowadays, non-enzymatic sample preparation (NESP) is one of the regular techniques for tissue sample preparation, which can be modified based on tissuetype-specific properties (9). However, problems may be associated with heterogeneity and contaminating proteins, e.g., blood proteins. Several approaches have been developed to resolve those problems. The selection of cell types of interest by dissection has received a great deal of attention. Since 1996, a laser-assisted technique, laser capture microdissection (LCM), has emerged as a good choice. LCM under direct microscopic visualization permits rapid one-step procurement of select cell populations from a section of complex, heterogeneous tissue (10,11). LCM has been used to isolate specific types of cells for protein, DNA, and RNA analysis. In the age of proteomics, proteins obtained by laser capture microdissected cells can be analyzed by twodimensional gel electrophoresis (2DE gel) (12,13), immunoassay (14,15), and surface-enhanced laser desorption and ionization time-of-flight (SELDI-TOF) (16,17,18,19,20,21). The only shortcoming of LCM may be that it requires long time to pick up sufficient cells for one experiment: 2–7 h for 20,000–40,000 cells per immunoassay and 15 h for 250,000 cells per 2DE gel (22). Our previous work had applied proteomic analysis to HCC cell lines (23,24) and HCC metastatic cells (25). Furthermore, we extended our work to clinical tissues using LCM. However, the present LCM assay only obtains about several hundred micrograms of proteins with dissection for several hours, which is hard to be analyzed by traditional 2DE-MS proteomic route, especially for preparative 2DE gels followed by MS identification.
Proteomic Analysis of Clinical HCC Using LCM
195
Since 1999, the isotope-coded affinity tag (ICAT) strategy has been a leading technology for relative protein quantification relying on post-harvest stable isotope labeling (26). Post-harvest labeling with stable isotopes can be used for protein quantification in cells and tissues from any organism, and the ICAT method as initially described has been shown to be capable of accurate quantification of proteins in complex mixtures (26). After the first-generation 2 HICAT reagents, the second- generation cleavable 13 C-ICAT reagents provided improved performance (27,28,29). The 2D chromatography MS/MS method has been shown to be capable of identifying a large number of proteins, including proteins of low abundance (30,31). In this study, we used LCM to isolate HCC and non-HCC hepatocytes and firstly combined LCM with cleavable isotope-coded affinity tag (cICAT) labeling technology and two-dimensional liquid chromatography tandem mass
Frozen sections of HCC tissues
Stained with toluidine blue Laser capture microdissection
HCC hepatocytes
Non-HCC hepatocytes
Solubilized proteins Labeled with cICAT light chain
Labeled with cICAT heavy chain
Digestion of protein mixture
2D-LC-MS/MS
Analyze by bioinformatics
Fig. 1. Outline of accurate qualitative and quantitative proteomic analysis of clinical hepatocellular carcinoma using laser capture microdissection coupled with isotopecoded affinity tag and two-dimensional liquid chromatography mass spectrometry. Reprinted with permission from (34).
196
Li et al.
spectrometry (2D-LC-MS/MS) to carry out accurate qualitative and quantitative analysis of HCC and non-HCC tissues. The flowchart used is outlined in Fig. 1. Totally 644 proteins in HCC hepatocytes were qualitatively determined and 261 differential proteins between HCC and non-HCC hepatocytes were quantitated. Till now, this is one of the largest qualitative and qualitative proteomes for HCC and non-HCC tissues. Our strategy and method provided an accurate, fast, and sensitive approach for proteomic analysis of clinical tissues, which will facilitate the understanding of the mechanism of HCC or other diseases and mining of potential markers and drug targets for diagnosis and treatment. 2. Materials 2.1. Tissue Specimen and Sample Preparation by Nonenzymatic Method (NESP) 1. Tissues from a HCC patient are isolated from fresh partially hepatectized tissues of HCCs in Shanghai Eastern Hepatobiliary Surgery Hospital. Access to human tissues complies with both Chinese laws and the guidelines of the Ethics Committee. 2. Glutamine-free RPMI 1640 medium: glutamine-free, 5% fetal calf serum, 0.2 mM phenylmethylsulfonyl fluoride, 1 mM ethylenediaminetetraacetic acid tetrasodium salt dehydrate (EDTA), and antibiotics: oxacillin 25 μg/ml, gentamycin 50 μg/ml, penicillin 100 U/ml, streptomycin 100 μg/ml, amphotericin B 0.25 μg/ml, nistatin 50 U/ml. Store at 4°C. 3. Ceramic mortar and pestle (SIBAS Corp. Shanghai, China). 4. Lysis buffer: 8 M urea, 4% 3-[(3-cholamidopropyl)dimethylammonio]-1-propane sulfonate (CHAPS), 40 mM Tris-HCl (pH 8.3), 65 mM dithiothreitol (DTT). Store in aliquots at –8°C. 5. Proteinase inhibitor tablet mixture (Roche).
2.2. Laser Capture Microdissection 1. Tissues from a HCC patient are isolated from fresh partially hepatectized tissues of HCCs in Shanghai Eastern Hepatobiliary Surgery Hospital. Access to human tissues complies with both Chinese laws and the guidelines of the Ethics Committee. The tissues are from a 50-year male patient with HCC in Edmondson grade III (HBV infected, AFP 7.3 μg/L, size 15 × 13 × 10.5 cm). 2. Freezing microtome CM1900 (Leica). 3. O.C.T. compound (Tissue-Tek). 4. Hematoxylin, eosin, and toludine blue stain (Shanghai Genebase Corp.). 5. Leica AS LMD Laser Capture Microdissection System (Leica). 6. Lysis buffer: 8 M urea, 4% CHAPS, 40 mM Tris, 65 mM DTT. Store in aliquots at –8°C. 7. Proteinase inhibitor tablet mixture (Roche).
Proteomic Analysis of Clinical HCC Using LCM
197
2.3. Removal of Toludine Blue and Digestion of Protein Mixture for Qualitative Analysis 1. Precipitation solution: 50% acetone, 50% ethanol, 0.1% acetic acid (HAc). Store at –20°C. 2. Redissolved buffer: 6 M guanidine HCl, 100 mM Tris-HCl (pH 8.3). Store at 4°C. 3. DTT and iodoacetamide (IAA) are from Bio-Rad. Sequencing grade TPCKtrypsin is from Promega. 4. YM3 ultrafiltration membranes (molecular mass cutoff, 3 kDa) are from Millipore Corp. All buffers are prepared with Milli-Q water (Millipore).
2.4. Cleavable Isotope-Coded Affinity Tag Labeling of Proteins 1. Tri-n-butylphosphate (TBP) is from Bio-Rad. 2. cICAT light or heavy reagents, Avidin cartridge, affinity buffer–elute, affinity buffer–load, affinity buffer–wash 1, affinity buffer–wash 2, cleaving reagents A and B are from Applied Biosystems. 3. Sequencing grade TPCK-trypsin (Promega). 4. YM3 ultrafiltration membranes (molecular mass cutoff, 3 kDa) are from Millipore Corp. All buffers are prepared with Milli-Q water (Millipore).
2.5. One-Dimensional and Two-Dimensional Liquid Chromatography Coupled with Tandem Mass Spectrometry 1. Formic acid is obtained from Aldrich, and acetonitrile (HPLC gradient grade) is obtained from Merck. 2. The LCQ™ Deca XP system, ProteomeX™ Workstation and TurboSequest software are purchased from Thermo Electron Corporation.
2.6. Bioinformatics Analysis 1. ExPASy proteomics tools are accessed from cn.expasy.org/tools/#proteome. 2. Program TMHMM 2.0 is accessed from the Center for Biological Sequence Analysis (www.cbs.dtu.dk/services/TMHMM/). 3. Classification tools are accessed from www.geneontology.org.
3. Methods In brief, two keywords should be noticed during the whole process of LCM coupled with 2D-LC-MS/MS approaches. The first one is speediness, and the second one is impurity. Sample preparation by LCM technology must be done as quickly as possible, including fixation of fresh tissues, preparation of frozen sections, histochemical staining, microdissection, and so on. Impurities,
198
Li et al.
such as histochemical stains, should be removed as completely as possible by centrifuge, precipitation, and ultrafitration before trypsin digestion and LCMS/MS analysis. Fixation and histochemical staining are the two initial steps in LCM technology. The appropriate selection of fixation and histochemical staining methods is an important factor for the processes. In this work, we used freshly prepared liver tissues to make frozen sections (8 μm thick), and we fixed the sections with ethanol to avoid the effects on proteins, such as crosslinking caused by formalin fixation. Some histochemical stains (hematoxylin, eosin, methyl green, and toluidine blue) were tested in 2DE gel (33), which showed that staining with single stain (hematoxylin) was better than with two stains simultaneously (hematoxylin and eosin); methyl green and toluidine blue staining were both compatible with the analysis of proteins by 2D-PAGE. The results with toluidine blue staining indicated a direct link between the intensity of tissue section staining and problems with the generation of good-quality protein separations. In our study, the proteins from cells after LCM were subjected to tryptic digestion and LC-MS/MS analysis. The staining material might affect the pH of digestion buffer or inactivate the trypsin; therefore, we tried to remove the stains using precipitation and ultrafiltration prior to digestion. We used three histochemical stains (hematoxylin, eosin, and toluidine blue), respectively, to stain the frozen sections. Among these three histochemical stains, we found that almost all toluidine blue stain could be removed after precipitation in the solution (50% acetone, 50% ethanol, 0.1% acetic acid) and desalting by ultrafiltration. In addition, protein solubilization stained by toluidine blue stain was better because some colored protein precipitation appeared on the filtration membrane when using hematoxylin stain or eosin stain. Therefore, we chose toluidine blue stain to optimize the experimental conditions, including staining, microdissection, and protein digestion. 3.1. Tissue Specimen and Sample Preparation by Nonenzymatic Method (NESP) 1. The tissues used were from a 50-year male patient with HCC in Edmondson grade III (HBV infected, AFP 7.3 μg/L, size 15 × 13 × 10.5 cm). Tumorous tissues and their adjacent paired nontumorous tissues (3 cm away from the edge of HCC lesions, about 0.1 g) were isolated from fresh partially hepatectized tissues of HBV-associated HCC. A part of the resected tissue was used for histology analysis. 2. The tissues were rinsed several times with cold glutamine-free RPMI 1640 medium and were homogenized in liquid nitrogen-cooled mortar and pestle (see Note 1). 3. The tissue powders obtained were dissolved in lysis buffer (see Note 2).
Proteomic Analysis of Clinical HCC Using LCM
199
4. The samples were sonicated on ice for 30 s (intensity: below 50 W) using an ultrasonic processor and centrifuged for 1 h at 20,627×g to remove DNA, RNA, and any particulate materials. 5. The protein concentrations of samples were measured by Bio-Rad Protein Assay kit. All samples were stored at –8°C until use (see Note 3).
3.2. Laser Capture Microdissection 1. Embed fresh tissues carefully in OCT in plastic mold, taking care not to trap air bubbles surrounding the tissue. Freeze the tissue by setting mold on top of liquid nitrogen until 70–80% of the block turns white and then put the block on top of dry ice. 2. For cutting step, mount the frozen block on the cryostat holder. Never, at any point, let the tissue warm up to temperatures above –15°c. Allow frozen blocks to equilibrate in the cryostat chamber for about 5 min. Cut 8-μm sections. 3. Wash 8-μm sections of freshly prepared liver tissues by cold phosphate buffered saline (PBS, pH 7.4), and stain with toluidine blue using standard manufacturer’s protocols with minor modifications (see Note 4). 4. Fix the sections in cold 95% ethanol for 10 min, air-dry and microdissect with Leica AS LMD Laser Capture Microdissection System. 5. Using laser pulses of 7.5 μm diameter, 70 mW, and with 2–3 ms duration, microdissect approximately 50,000 or 100,000 cells of HCC and non-HCC hepatocytes; store in microdissection caps at –8°C until lysed (see Note 5). An example of the results produced using hematoxylin and eosin (H&E) stained section is shown in Fig. 2. 6. Each cell population was determined to be 95% homogeneous by microscopic visualization of the captured cells. Dissolve the laser capture microdissected HCC and non-HCC hepatocytes in lysis buffer (see Note 2). 7. Sonicate the samples on ice for a while using an ultrasonic processor and centrifuge for 1 h at 20,627×g to remove DNA, RNA, and any particulate materials. 8. Measure the protein concentrations of samples by Bio-Rad Protein Assay kit. Store all the samples at –8°C until use (see Note 3).
3.3. Removal of Toludine Blue and Digestion of Protein Mixture for Qualitative Analysis 1. Deposit the samples prepared by NESP or LCM technology in precipitation solution (50% acetone, 50% ethanol, 0.1% acetic acid; sample volume:precipitation solution volume = 1:5) at least for 12 h at –20°C. Wash the pellets with 100% acetone, 70% ethanol, and lyophilize by lyophilization (see Note 6). 2. Redissolve the pellets in 6 M guanidine HCl, 100 mM Tris (pH 8.3); measure the concentrations with Bio-Rad Protein Assay kit.
200
Li et al. A.
B.
Fig. 2. HCC tissues before (A) and after (B) LCM. Reprinted with permission from (34).
4. Reduce 200 μg solubilized proteins with DTT (final concentration 20 mM) and subsequently alkylate with IAA (final concentration 40 mM). 5. After desalting by YM3 ultrafiltration membranes, incubate the protein mixture with trypsin (trypsin:protein mixture = 1:30, W/W, Promega, Madison, WI) at 37°C for 16 h (see Note 7).
3.4. Cleavable Isotope-Coded Affinity Tag Labeling of Proteins 1. Reduce 100 μg HCC and 100 μg non-HCC solubilized proteins prepared by LCM technology with TBP (final concentration 5 mM) (see Note 8).
Proteomic Analysis of Clinical HCC Using LCM
201
2. Transfer the reduced HCC and non-HCC solubilized proteins into the vial containing cICAT light or heavy reagent, respectively, and mix. After a brief centrifugation, incubate the proteins for 2 h at 37°C in the dark. 3. Combine the labeled proteins into one tube. After desalting by YM3 ultrafiltration membranes, incubate the protein mixture with trypsin (trypsin:protein mixture = 1:30, W/W, Promega, Madison, WI) at 37°C for 16 h (see Note 7). 4. Use Avidin cartridge (Applied Biosystems) to purify the ICAT-labeled peptides from tryptic digests according to the manufacture’s protocol. In brief, activate Avidin cartridge by 2 ml of the affinity buffer–elute and 2 ml of the affinity buffer–load. Slowly inject (∼1 drop/5 s) the peptide sample onto Avidin cartridge. Wash the Avidin cartridge by 500 μl of affinity buffer–load, 1 ml of affinity buffer–wash 1, 1 ml of affinity buffer–wash 2, and 1 ml of Milli-Q water. To elute the labeled peptides, slowly inject (∼1 drop/5 s) the affinity buffer–elute and collect the elute. Dry the elute from the Avidin cartridge through lyophilization. 5. Dissolve the dried cICAT-labeled peptides in cleaving reagents and cleave for 2 h at 37°C. Condense the cICAT-labeled peptides through lyophilization.
3.5. One-Dimensional and Two-Dimensional Liquid Chromatography Coupled with Tandem Mass Spectrometry (1D- and 2D-LC-MS/MS) 1. All the 2D HPLC separations are performed on ProteomeX™ (Thermo Finnigan Corp., San Jose, CA) equipped with two LC pumps. The flow rates of both salt and analytical pumps are 200 μl/min and about 2 μl/min after split. The strong cation exchange column is the 300 μm inner diameter ones (SCX resin, 5 μm), and the RPC column is the 150 μm inner diameter (C18 resin, 300 A, 5 μm) (see Note 9). 2. Nine different salt concentration ranges—0, 25, 50, 75, 100, 150, 200, 400, and 800 mM ammonium chloride—are used for step gradient. 3. The mobile phases used for reverse phase are A: 0.1% formic acid in water, pH 3.0, B: 0.1% formic acid in acetonitrile. 4. Load about 200 μg of peptides digested from the LCM protein to the SCX column by the autosample. The elute condition is described in step 2. Load the eluted peptides from each salt step to the RPC columns. The RPC columns are washed by 95% A mobile phases in 20 column volumes. Finally, separate the peptides using 100-min linear gradient from 5 to 80% B mobile phases. The eluting peptide enters an LCQ ProteomeX™ mass spectrometer (Thermo Electron, San Jose, CA) by the metal needle (see Note 10). 5. The 1D HPLC separation uses the same system/experimental steps, but without the use of a strong cation exchange column. 6. An electrospray (ESI) ion-trap mass spectrometer (LCQ Deca XP, Thermo Finnigan, San Jose, CA) is used for peptide detection. 7. The positive ion mode is employed and the spray voltage is set at 3.2 kV. The spray temperature is set at 150°C for peptides. 8. The collision energy is automatically set by LCQ Deca XP. After the acquisition of full scan mass spectra, three MS/MS scans are acquired for the next three most intense ions using dynamic exclusion.
202
Li et al.
9. Peptides and proteins are identified using TurboSequest (Thermo Finnigan, San Jose, CA), which uses the MS and MS/MS spectrum of peptide ions to search against the publicly available NCBI non-redundant protein database (www.ncbi.nlm.nih.gov). 10. The protein identification criteria that we used are based on Delta CN (≥0.1) and Xcorr (one charge ≥ 1.8, two charges ≥ 2.2, three charges ≥ 3.7). An example of the results produced is shown in Table 1 (see Note 11). 11. For quantitative analysis with cICAT technology and 2D-LC-MS/MS, manual check is followed after database searching and quantification by Xpress (TurboSequest software). Quantitative analysis results of 261 proteins from LCM-ICAT-2D-LC-MS/MS are shown in Fig. 3. In our experiment, a total of 149 differentially expressed proteins with at least twofold quantitative alterations in HCC and non-HCC hepatocytes were detected, including 55 upregulated proteins (32 with 2∼5 folds, 13 with 5∼10 folds, 10 with >10 folds) and 94 downregulated spots in HCC hepatocytes (62 with 2∼5 folds, 17 with 5∼10 folds, 15 with >10 folds).
3.6. Bioinformatics Analysis 1. The pI and Mr of the proteins are analyzed using ExPASy proteomics tools accessed from http://cn.expasy.org/tools/#proteome. Examples of the results produced are shown in Table 1 and Fig. 5A and 5B.
15 17
32 2 ≤ Ratio(HCC/non-HCC) ≤ 5
13 10
5 < Ratio(HCC/non-HCC) ≤ 10 Ratio(HCC/non-HCC) > 10
62
Ratio(HCC/non-HCC or non-HCC/HCC) < 2 2 ≤ Ratio(non-HCC/HCC) ≤ 5 5 < Ratio(non-HCC/HCC) ≤ 10 Ratio(non-HCC/HCC) > 10
112
Fig. 3. Quantitative analysis results of 261 proteins from LCM-ICAT-2D-LCMS/MS. A total of 149 differentially expressed proteins with at least twofold quantitative alterations in HCC and non-HCC hepatocytes were detected, including 55 upregulated proteins (32 with 2∼5 folds, 13 with 5∼10 folds, 10 with >10 folds) and 94 downregulated spots in HCC hepatocytes (62 with 2∼5 folds, 17 with 5∼10 folds, 15 with >10 folds). Reprinted with permission from (34).
Proteomic Analysis of Clinical HCC Using LCM
203
Table 1 Summary of Total Proteins Identified in HCC-NESP-1D-LC-MS/MS, HCC-NESP-2D-LC-MS/MS and HCC-LCM-2D-LC-MS/MS
Protein quantity Total proteins identified Hydrophobic proteins Trans-membrane proteins Proteins with Mr >100KD or < 10KD Proteins pI >9
HCCNESP-1DLC-MS/MS
HCCNESP-2DLC-MS/MS
HCCLCM-2DLC-MS/MS
200μg 208 25(12.0%) 8(3.9%) 19(9.1%) 21(10.1%)
200μg 626 64(10.2%) 30(4.8%) 77(12.3%) 78(12.5%)
200μg 644 80(12.4%) 54(8.4%) 75(11.6%) 126(19.6%)
2. The general average hydropathicity (GRAVY) score is calculated as the arithmetic mean of the sum of the hydropathic indices of each amino acid (32). Examples of the results produced are shown in Table 1 and Fig. 5C. 3. The trans-membrane prediction is conducted using the computer server program TMHMM server 2.0, which can be accessed from the CBS (http://www.cbs.dtu.dk/services/TMHMM/). Examples of the results produced are shown in Table 1 and Fig. 5D. 4. All identified proteins are classified by their molecular function, cellular component, and biological process with the tools on http://www.geneontology.org. An example of the results produced is shown in Fig. 4.
4. Notes 1. Glutamine-free RPMI 1640 medium must be cold (4°C) before use. Washing should be done as quickly as possible, until there are no contaminations (blood, etc.) on tissues. Glutamine-free RPMI 1640 medium could be replaced by PBS (pH 7.4), 0.9% NaCl solution, or any other isotonic buffer. 2. Store the lysis buffer in small aliquots at –8°C to avoid multiple freeze-thaw cycles. Protease inhibitor tablet mixture (Roche Molecular Biochemicals) should be dissolved in lysis buffer. 3. Store the samples in small aliquots at –8°C to avoid multiple freeze-thaw cycles. Protein concentrations of the samples should be about 10 μg/μl for subsequent experiments. 4. The sections should be very lightly stained with toluidine blue only to distinguish hepatocytes during microdissection. Otherwise, the redundant stains could affect follow-up experiments. 5. In fact, in order to reduce microdissection time, manipulators could choose to capture hepatocytes or remove other cells based on the condition of each section.
204
Li et al.
A.
B.
Fig. 4. Classification of differentially expressed proteins obtained by LCM-ICAT2D-LC-MS/MS. (A) shows proteins with at least twofold increased expression levels in HCC hepatocytes. (B) shows proteins with at least twofold decreased expression levels in HCC hepatocytes. Reprinted with permission from (34).
6. Precipitation solution, acetone, and ethanol must be cold at –20°C before use. 7. Ultrafiltration is very important to remove redundant salts, stain, and other impurities, and ensure follow-up steps. 8. TBP is a much stronger but more toxic reducing agent for labeling ICAT reaction than DTT.
Proteomic Analysis of Clinical HCC Using LCM
205
A. M r distribution
10
9
18 11
27 13 12 4
3
Protein number
15 6
20 30
0 >1
) ~1 0) (9
~9
~8
(8
17
15 10 5 0
(– < 1. –1 (– 0~– .0 0. 0 (– 9~– .9) 0 0 (– .8~ .8) 0. –0 (– 7~– .7) 0. 0. (– 6~– 6) 0. 0 . (– 5~– 5) 0. 0 (– 4~– .4) 0. 0 3 . (– ~– 3) 0. 0. 2~ 2) – (– 0.1 0. ) 1 (0 ~0 ~ ) (0 0.1 .1 ) – (0 0.2 .2 ) –0 .3 >0 ) .3
Protein number
31
6
D. Trans-membrane protein distribution
39
37
(7
pI range
C. Hydrophile and hydrophobicity distribution 45 40 35 30 25 20 15 10 5 0
)
)
M r range
)
0
~6
0
~7
50 – 100 >100 kDa kDa
30 – 50 kDa
(5
10 – 30 kDa
21
(6
7 <10 kDa
(3
0
37
(4
20
53
21
)
33
40
)
60
61
58
~5
62
70 60 50 40 30 20 10 0
~4
Protein number
79
80
B. pI distribution 80
<3
Protein number
100
5
4 1
1 >3 2 3 Number of trans-membrane region
Hydrophilic and hydrophobic value
Fig. 5. Characteristics of differentially expressed proteins obtained by LCM-ICAT2D-LC-MS/MS. (A) shows the Mr distribution; (B) shows the pI distribution; (C) presents the hydrophile and hydrophobicity distribution; and (D) shows the transmembrane proteins. Reprinted with permission from (34).
9. The LCQ ProteomeX™ Workstation (Thermo Electron, San Jose, CA) is an automatic 2D LC/MS system, which can be used in high-throughout proteomic research. However, you may use another equipment to separate the proteomics sample by offline SCX fractionation. The step involved in offline SCX fractionation is almost the same as online. The difference is that you need to manually load the step salt-eluted peptides to RPC column. 10. If you use the nanospay kit in the mass spectrometer and the 75-μm inner diameter RPC column, the eluted peptides can directly enter the mass spectrometer. The sensitivity in the nanospay mode is higher than in the metal needle mode. 11. The protein identification criteria can vary based on the type of mass spectrometer or other analytic needs. For example, we use Delta CN (≥0.1) and Xcorr (one charge ≥ 1.9, two charges≥ 2.2, three charges ≥ 3.75) as criteria when using LTQ linear ion trap mass spectrometer (Thermo Finnigan, San Jose, CA).
Acknowledgments This work was supported by National High-Technology Project (2001AA233031, 2002BA711A11) and Basic Research Foundation (2001CB210501).
206
Li et al.
References 1. Feitelson M.A., Sun B., Satiroglu Tufan N.L., Liu J., Pan J. and Lian Z. (2002) Genetic mechanisms of hepatocarcinogenesis. Oncogene 21, 2593–2604. 2. Fujiyama S., Tanaka M., Maeda S., Ashihara H., Hirata R. and Tomita K. (2002) Tumor markers in early diagnosis, follow-up and management of patients with hepatocellular carcinoma. Oncology 62(Suppl 1), 57–63. 3. Qin L.X. and Tang Z.Y. (2002) The prognostic molecular markers in hepatocellular carcinoma. World J Gastroenterol 8, 385–392. 4. Park K.S., Cho S.Y., Kim H. and Paik Y.K. (2002) Proteomic alterations of the variants of human aldehyde dehydrogenase isozymes correlate with hepatocellular carcinoma. Int J Cancer 97, 261–265. 5. Park K.S., Kim H., Kim N.G., Cho S.Y., Choi K.H., Seong J.K. and Paik Y.K. (2002) Proteomic analysis and molecular characterization of tissue ferritin light chain in hepatocellular carcinoma. Hepatology 35, 1459–1466. 6. Cho S.Y., Park K.S., Shim J.E., Kwon M.S., Joo K.H., Lee W.S., Chang J., Kim H., Chung H.C., Kim H.O. and Paik Y.K. (2002) An integrated proteome database for two-dimensional electrophoresis data analysis and laboratory information management system. Proteomics 2, 1104–1113. 7. Lim S.O., Park S.J., Kim W., Park S.G., Kim H.J., Kim Y.I., Sohn T.S., Noh J.H. and Jung G. (2002) Proteome analysis of hepatocellular carcinoma. Biochem Biophys Res Commun 291, 1031–1037. 8. Kim J., Kim S.H., Lee S.U., Ha G.H., Kang D.G., Ha N.Y., Ahn J.S., Cho H.Y., Kang S.J., Lee Y.J., Hong S.C., Ha W.S., Bae J.M., Lee C.W. and Kim J.W. (2002) Proteome analysis of human liver tumor tissue by twodimensional gel electrophoresis and matrix assisted laser desorption/ionizationmass spectrometry for identification of disease-related proteins. Electrophoresis 23, 4142–4156. 9. Franzen B., Hirano T., Okuzawa K., Uryu K., Alaiya A.A., Linder S. and Auer G. (1995) Sample preparation of human tumors prior to two-dimensional electrophoresis of proteins. Electrophoresis 16, 1087–1089. 10. Emmert-Buck M.R., Bonner R.F., Smith P.D., Chuaqui R.F., Zhuang Z., Goldstein S.R., Weiss R.A. and Liotta L.A. (1996) Laser capture microdissection. Science 274, 998–1001. 11. Bonner R.F., Emmert-Buck M., Cole K., Pohida T., Chuaqui R., Goldstein S. and Liotta L.A. (1997) Laser capture microdissection: molecular analysis of tissue. Science 278, 1481–1483. 12. Ornstein D.K., Gillespie J.W., Paweletz C.P., Duray P.H., Herring J., Vocke C.D., Topalian S.L., Bostwick D.G., Linehan W.M., Petricoin E.F., III and Emmert-Buck M.R. (2000) Proteomic analysis of laser capture microdissected human prostate cancer and in vitro prostate cell lines. Electrophoresis 21, 2235–2242. 13. Jones M.B., Krutzsch H., Shu H., Zhao Y., Liotta L.A., Kohn E.C. and Petricoin E.F., III (2002) Proteomic analysis and identification of new biomarkers and therapeutic targets for invasive ovarian cancer. Proteomics 2, 76–84.
Proteomic Analysis of Clinical HCC Using LCM
207
14. Simone N.L., Remaley A.T., Charboneau L., Petricoin E.F., III, Glickman J.W., Emmert-Buck M.R., Fleisher T.A. and Liotta L.A. (2000) Sensitive immunoassay of tissue cell proteins procured by laser capture microdissection. Am J Pathol 156, 445–452. 15. Ornstein D.K., Englert C., Gillespie J.W., Paweletz C.P., Linehan W.M., EmmertBuck M.R. and Petricoin E.F., III (2000) Characterization of intracellular prostatespecific antigen from laser capture microdissected benign and malignant prostatic epithelium. Clin Cancer Res 6, 353–356. 16. Sauter E.R., Zhu W., Fan X.J., Wassell R.P., Chervoneva I. and Du Bois G.C. (2002) Proteomic analysis of nipple aspirate fluid to detect biologic markers of breast cancer. Br J Cancer 86, 1440–1443. 17. Verma M., Wright G.L., Jr., Hanash S.M., Gopal-Srivastava R. and Srivastava S. (2001) Proteomic approaches within the NCI early detection research network for the discovery and identification of cancer biomarkers. Ann N Y Acad Sci 945, 103–115. 18. Jain K.K. (2002) Recent advances in oncoproteomics. Curr Opin Mol Ther 4, 203–209. 19. Jr G.W., Cazares L.H., Leung S.M., Nasim S., Adam B.L., Yip T.T., Schellhammer P.F., Gong L. and Vlahou A. (1999) ProteinChip surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures. Prostate Cancer Prostatic Dis 2, 264–276. 20. Batorfi J., Ye B., Mok S.C., Cseh I., Berkowitz R.S. and Fulop V. (2003) Protein profiling of complete mole and normal placenta using ProteinChip analysis on laser capture microdissected cells. Gynecol Oncol 88, 424–428. 21. Wulfkuhle J.D., Paweletz C.P., Steeg P.S., Petricoin E.F., III and Liotta L. (2003) Proteomic approaches to the diagnosis, treatment, and monitoring of cancer. Adv Exp Med Biol 532, 59–68. 22. Seow T.K., Liang R.C., Leow C.K. and Chung M.C. (2001) Hepatocellular carcinoma: from bedside to proteomics. Proteomics 1, 1249–1263. 23. Yu L.R., Shao X.X., Jiang W.L., Xu D., Chang Y.C., Xu Y.H. and Xia Q.C. (2001) Proteome alterations in human hepatoma cells transfected with antisense epidermal growth factor receptor sequence. Electrophoresis 22, 3001–3008. 24. Yu L.R., Zeng R., Shao X.X., Wang N., Xu Y.H. and Xia Q.C. (2000) Identification of differentially expressed proteins between human hepatoma and normal liver cell lines by two-dimensional electrophoresis and liquid chromatography-ion trap mass spectrometry. Electrophoresis 21, 3058–3068. 25. Ding S.J., Li Y., Tan Y.X., Jiang M.R., Tian B., Liu Y.K., Shao X.X., Ye S.L., Wu J.R., Zeng R., Wang H.Y., Tang Z.Y. and Xia Q.C. (2004) From proteomic analysis to clinical significance: overexpression of cytokeratin 19 correlates with hepatocellular carcinoma metastasis. Mol Cell Proteomics 3(1), 73–81. 26. Gygi S.P., Rist B., Gerber S.A., Turecek F., Gelb M.H. and Aebersold R. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17, 994–999.
208
Li et al.
27. Li J., Steen H. and Gygi S.P. (2003) Protein profiling with cleavable isotope coded affinity tag (cICAT) reagents: the yeast salinity stress response. Mol Cell Proteomics 2 (11), 1198–204. 28. Oda Y., Owa T., Sato T., Boucher B., Daniels S., Yamanaka H., Shinohara Y., Yokoi A., Kuromitsu J. and Nagasu T. (2003) Quantitative chemical proteomics for identifying candidate drug targets. Anal Chem 75, 2159–2165. 29. Hansen K.C., Schmitt-Ulms G., Chalkley R.J., Hirsch J., Baldwin M.A. and Burlingame A.L. (2003) Mass spectrometric analysis of protein mixtures at low levels using cleavable 13C-isotope-coded affinity tag and multidimensional chromatography. Mol Cell Proteomics 2, 299–314. 30. Washburn M.P., Wolters D. and Yates J.R., III (2001) Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat Biotechnol 19, 242–247. 31. Gygi S.P., Corthals G.L., Zhang Y., Rochon Y. and Aebersold R. (2000) Evaluation of two-dimensional gel electrophoresis-based proteome analysis technology. Proc Natl Acad Sci USA 97, 9390–9395. 32. Kyte J. and Doolittle R.F. (1982) A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, 105–132. 33. Craven R.A., Totty N., Harnden P., Selby P.J. and Banks R.E. (2002) Laser capture microdissection and two-dimensional polyacrylamide gel electrophoresis: evaluation of tissue preparation and sample limitations. Am J Pathol 160, 815–822. 34. Li C., Hong Y., Tan Y.X., Zhou H., Ai J.H., Li S.J., Zhang L., Xia Q.C., Wu J.R., Wang Y. and Zeng R. (2004) Accurate qualitative and quantitative proteomic analysis of clinical hepatocellular carcinoma using laser capture microdissection coupled with isotope-coded affinity tag and two-dimensional liquid chromatography mass spectrometry. Mol Cell Proteomics 3(4), 399–409.
12 Label-Free LC-MS Method for the Identification of Biomarkers Richard E. Higgs, Michael D. Knierman, Valentina Gelfanova, Jon P. Butler, and John E. Hale
Summary Pharmaceutical companies and regulatory agencies are pursuing biomarkers as a means to increase the productivity of drug development. Quantifying differential levels of proteins from complex biological samples like plasma or cerebrospinal fluid is one specific approach being used to identify markers of drug action, efficacy, toxicity, etc. Academic investigators are also interested in markers that are diagnostic or prognostic of disease states. We report a comprehensive, fully automated, and label-free approach to relative protein quantification including: sample preparation, proteolytic protein digestion, LCMS/MS data acquisition, de-noising, mass and charge state estimation, chromatographic alignment, and peptide quantification via integration of extracted ion chromatograms. Additionally, we describe methods for transformation and normalization of the quantitative peptide levels in multiplexed measurements to improve precision for statistical analysis. Lastly, we outline how the described methods can be used to design and power biomarker discovery studies.
Key Words: relative proteomics; LC-MS/MS.
quantification;
label-free
quantification;
biomarkers;
1. Introduction Recent advances in analytical technology, particularly mass spectrometry, are finding broad applications in the search for biomarkers. Biomarkers may be defined as indicators of biological processes and encompass a variety of measures including imaging, polynucleotides, proteins, and small molecule From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
209
210
Higgs et al.
metabolites, among others. These new biomarker discovery activities are motivated by the need to improve diagnosis, guide-targeted therapies, and monitor therapeutic efficacy and toxicity throughout a treatment regimen. Biomarkers of drug efficacy or toxicity have the potential to shorten the drug development timeline as they may provide early indications of a drug’s activity. This potential for increased drug development productivity from high-quality biomarkers has fueled increased attention from pharmaceutical, biotechnology, and regulatory agencies alike (1,2). Within the field of protein biomarkers, mass spectrometry is playing a central role in the discovery of biomarkers from various biological sample matrices. Quantification of small organic molecules using extracted ion chromatograms (XICs) from liquid chromatography mass spectrometry (LC-MS) experiments has a long history in analytical chemistry. Similar techniques using LC-MS experiments with proteolytic protein digests are now routinely being applied to quantify peptide and protein levels in biological samples. Early LC-MS peptide quantification methods relied on the modification of peptides with reagents enriched in stable isotopes to introduce mass shifts in the peptides from one sample in order to compare relative peptide levels to another un-labeled sample (3,4). The number of biological samples required for statistical power in many applications, the restriction that study samples must be paired or pooled for these label-based methods, and the increased cost due to specialized reagents have limited their application and motivated the search for label-free methods of non-targeted protein profiling. We report here a comprehensive analytical system to collect and automatically process the data from non-targeted LC-MS/MS analyses of complex protein mixtures. In contrast to pattern-based (5,6), difference based (7), or identification-based quantification methods (8,9), the approach presented here simply integrates the peptide parent ion current in order to obtain a relative peptide level in each study sample. No labeling or pooling of study samples is required. The output from this approach is an N × P table in which each of P peptides has been quantified in each of the N study samples. This table maximizes the flexibility in downstream statistical data analysis including transformation, normalization, and an analysis suited to the experimental design. The described method is based on the collective efforts of the applied biochemistry and statistics groups within Lilly Research Laboratories (10,11,12). As a broad-looking, discovery-oriented assay, it is important to note the limitations imposed by the approach. An assay designed to detect and quantify many analytes simultaneously compromises on sensitivity, selectivity, dynamic range, and absolute quantification relative to a targeted assay designed for a particular analyte. Ion suppression and co-elution of peptides from complex mixtures have the potential to interfere with the ion current attributed to a peptide, thus confounding any inference that may be made about the relative quantities of
Label-Free Biomarker Identification
211
the peptide. The limited dynamic range of these uncalibrated assays tends to underestimate the magnitude of a change in protein levels for peptides that do not lie near the linear portion of the instrument response curve. Nonetheless, these non-targeted methods have shown promise in identifying relative changes in protein levels that can be followed in subsequent studies using more targeted assays (e.g., multiple reaction monitoring) (13) to verify the findings in a new sample set. The described method focuses on biomarker discovery from human plasma and cerebrospinal fluid (CSF). Biomarker discovery from these fluids has proven challenging as the highly abundant proteins (e.g., albumin, IgG) are difficult to completely remove and tend to mask the detection of lower abundance proteins that may be directly associated with the biology of interest. However, the analytical and statistical methods described here are directly applicable to more targeted sample matrices (e.g., tissues) in both clinical and pre-clinical models that may increase the probability of technical success based on samples more directly associated with the biology of interest with fewer abundant, masking proteins to remove. Sample collection and handling procedures are critical in reducing the overall variability in biomarker discovery studies. Age, gender, diet, time of day, and medication may affect the plasma or CSF protein profile and should be considered in study designs. Similarly, consistent sample handling tailored to proteomics profiling (e.g., preservatives, rapid sample freezing, controlling for blood contamination in CSF sampling, number of sample freeze-thaw cycles, etc.) are important considerations to ensure high-quality starting material. The proteome is arguably the most modulated class of biomolecules in disease, treatment, and toxicity, resulting in the promise of proteomics for biomarker discovery. Despite this promise and rapid advancements in technology, progress has been slow (14,15). However, with a refined strategy of: (1) applying non-targeted, hypothesis generation methods like those described here to sample matrices proximal to the biology, (2) using targeted MS assays to verify early discoveries in new sample sets, and (3) clinical validation using established diagnostic assay formats (e.g., ELISAs), the potential to fulfill the promise is high by strategically applying the right technology to the appropriate stage of the biomarker discovery life cycle (16). 2. Materials 2.1. Albumin/IgG Depletion 1. Montage equilibration buffer, wash buffer, and columns are provided with the Montage Albumin Deplete Kit™ (Millipore® ). 2. ProteinG-Sepharose (Amersham Biosciences® ).
212
Higgs et al.
2.2. Reduction, Alkylation, and Digestion 1. Denaturing solution and internal standard: 8 M urea in 100 mM (NH4 )2 CO3 buffer containing chicken lysozyme (Sigma, St Louis, MO; 10.4 μg/mL), pH 11.0. 2. Reduction/alkylation cocktail: 97.5% ACN, 2% iodoethanol, and 0.5% triethylphosphine (v/v). 3. Trypsin solution: TPCK treated bovine pancreatic trypsin (Worthington, Lakewood, NJ) is dissolved at 1 mg/mL in H2 O and stored in single-use aliquots at –80°C. Working solutions are prepared by diluting to 5 μg/mL in 100 mM ammonium bicarbonate pH 8.0 prior to use.
2.3. HPLC 1. The C-18 reversed phase column was a Zorbax SB300 1 × 50 mm (Agilent). 2. Solvent A: 0.1% formic acid (Aldrich) in water (Burdick and Jackson HPLC grade). 3. Solvent B: 50% acetonitrile, 0.1% formic acid (Aldrich) in water (Burdick and Jackson HPLC grade). 4. Solvent C: 80% acetonitrile, 0.1% formic acid (Aldrich) in water (Burdick and Jackson HPLC grade).
2.4. Mass Spectrometry 1. LTQ ion trap mass spectrometer (ThermoFinnigan).
3. Methods 3.1. Plasma Sample Preparation 3.1.1. Albumin/IgG Depletion 1. Dilute a 25 μL aliquot of plasma (1.25 mg protein assuming 50 mg/mL total protein concentration) with Montage equilibration buffer to a volume of 200 μL (see Note 1). 2. Add 100 μL of a 50% proteinG-Sepharose bead suspension and rock the mixture for 1 h at RT. 3. Pellet the G-Sepharose beads at 2000 rpm for 2 min. and transfer 200 μL of the effluent to a pre-equilibrated Montage column. Pre-equilibration was performed with 400 μL of equilibration buffer and centrifugation for 2 min at 500×g (see Note 2). 4. Centrifuge the Montage column at 500×g for 2 min and re-apply the flow-thru to the column and centrifuge again. Pass two consecutive 200 μL washes of Montage wash buffer over the column via 500×g centrifugation for 2 min. (final volume approximately 600 μL).
Label-Free Biomarker Identification
213
3.1.2. Reduction, Alkylation, and Digestion 1. Spike a 120 μL aliquot of the diluted and depleted plasma with 120 μL of the denaturing and internal standard solution (see Note 3). 2. Add an equal volume (240 μ(L) of reduction/alkylation cocktail (see Note 4). 3. Cap the solutions and incubate for 1 h at 37°C. 4. Speed vacuum the solutions to dryness (at least 3 h). 5. Re-dissolve the pellet in 600 μL of the working trypsin solution. Digest overnight at 37°C (17).
3.2. Cerebrospinal Fluid Sample Preparation 3.2.1. Albumin/IgG Depletion 1. Dilute an aliquot of CSF (34 μg protein based on a Bradford total protein assay) with Montage equilibration buffer to a volume of 200 μL (see Note 5). 2. Add 100 μL of a 50% proteinG-Sepharose bead suspension and rock the mixture for 1 h at RT. 3. Pellet the G-Sepharose beads at 2000 rpm for 2 min and transfer 200 μL of the effluent to a pre-equilibrated Montage column. Pre-equilibration is performed with 400 μL of equilibration buffer and centrifugation for 2 min at 500×g (see Note 2). 4. Centrifuge the Montage column at 500×g for 2 min and re-apply the flow-thru to the column and centrifuge again. Pass two consecutive 200 μL washes of Montage wash buffer over the column via 500×g centrifugation for 2 min (final volume approximately 600 μL).
3.2.2. Reduction, Alkylation, and Digestion 1. Speed vacuum the CSF samples to approximately 30–50 μL and mix with 40 μL of the denaturing and internal standard solution (see Note 3). 2. Add 100 μL of reduction/alkylation cocktail (see Note 4). 3. Cap the solutions and incubate for 1 h at 37°C. 4. Speed vacuum the solutions to dryness (at least 3 h). 5. Re-dissolve the pellet in 600 μL of the working trypsin solution. Digest overnight at 37°C (17).
3.3. HPLC Conditions 1. A Surveyor autosampler and MS HPLC pump (ThermoFinnigan) are used for separation. 100 μL tryptic digests (4.2 μg plasma non-depleted equivalent protein or 14 μg CSF non-depleted equivalent protein) onto the reversed phase column at a flow rate of 50 μL/min (see Note 6). The gradient conditions are: 10–95% B (90–5% A) over 120 min, followed by a 0.1 min ramp to 100% C, followed by 5 min at 100% C, followed by a 0.1 min ramp to 10% B (90% A), and hold for
214
Higgs et al.
17 min at 10% B (90% A). The effluent is diverted to waste for the first 2 min to keep the mass spectrometer source clean. 2. Between each sample in the set, an injection of water is made and a shortened (60 min) gradient, identical to the above, is performed to reduce carryover.
3.4. Mass Spectrometer Conditions 1. The total column effluent (50 μL/min) is connected to the electrospray interface of the ion trap mass spectrometer. 2. The source is operated in positive ion mode with a 4.8 kV electrospray potential, a sheath gas flow of 20 arbitrary units, and a capillary temperature of 225°C. The source lenses should be set by maximizing the ion current for the 2+ charge state of angiotensin. 3. Data are collected in the triple play mode with the following parameters: centroid parent scan set to one microscan and 50 ms maximum injection time, profile zoom scan set to three microscans and 500 ms maximum injection time, and a centroid MS/MS scan set to two microscans and 2000 ms maximum injection time (see Note 7). 4. Dynamic exclusion settings are set to a repeat count of one, exclusion list duration of 2 min, and rejection widths of –0.75 m/z and +2.0 m/z. 5. Collisional activation is carried out with relative collision energy of 35% and an exclusion width of 3 m/z. 6. Study samples should be injected in a random order to reduce any effects of carryover or confounding with a non-random injection order (see Note 8). 7. All water blank samples should be analyzed by the mass spectrometer in the same manner as study samples in order to monitor carryover (see Note 9).
3.5. Zoom Scan Data Processing The data collected from a zoom scan triple-play experiment are used to estimate the quality of the subsequent MS/MS spectrum, the charge state of the peptide, and the monoisotopic and average mass of the peptide. The quality estimate is used to eliminate those scan events that are triggered by noise or small molecules from further downstream processing. Peptide mass and charge state estimates are used in subsequent steps for peptide identification. Eliminating low-quality scan events and more accurately estimating the charge state and mass of peptides ultimately reduces the number of false positives that must be dealt with at the peptide identification stage of the process. 1. Assume the charge state of the detected peptide is 1+ . 2. Given the m/z of the scan event and the assumed charge state, estimate the theoretical isotope distribution intensities for a peptide of the hypothesized mass using the relationships given in Fig. 1 (see Note 10). Begin by determining the relative intensity of the 12 C peak (I0 ) using the relationship in Fig. 1A and the MW for the assumed charge state. Next, estimate the relative peak intensity of
Label-Free Biomarker Identification
215
the 13 C peak (I1 ) by multiplying the estimate of I0 by the I1 /I0 ratio from Fig. 1B using the MW for the assumed charge state. Isotope intensities I2 and I3 are derived in a similar manner using the ratios from Fig. 1C–D at the MW for the assumed charge state. 3. Convolve the estimated theoretical isotope stick spectrum with a Gaussian peak shape that has a peak width similar to that produced in a typical zoom scan spectrum (18). Linearly scale the result of this convolution such that the maximum value is one.
1.5 0.5
l1 / l0
0.8
(B)
0.5
l0 / max(l0,l1,l2,l3)
(A)
500
2500
500
2500
Mono MVV
(C)
(D)
0.0
1.0
l3 / l0
1.0 0.0
l2 / l0
2.0
Mono MVV
500
2500
Mono MVV
500
2500
Mono MVV
Fig. 1. Empirically derived relationships (from 15,493 example peptides) between isotope peak intensities used to estimate the theoretical isotope pattern for a peptide (A) I0 /max(I0 , I1 , I2 , I3 ), non-linear least squares fit: 1 if MW < 1800 I0 /maxI0 I1 I2 I3 = −000132+MW −18000865 if MW ≥ 1800 e (B) I1 /I0 , linear least squares fit: I1 /I0 = −000498 + 0000560MW , (C) I2 /I0 , linear least squares fit: I2 /I0 = −0367 + 0000516MW + 159×10−7 MW − 1527342 , and −7 2 (D) I3 /I0 , nonlinear least squares fit: I3 /I0 = 00000605e000251MW −270×10 MW . Reprinted with permission from (10).
216
Higgs et al.
4. Convolve the result from step 3 above with the measured zoom scan to obtain the matched filter output between the expected zoom scan spectrum from the assumed charge state and the measured zoom scan spectrum. Record the maximum value of the output of this convolution along with the x-axis (m/z) value where the maximum occurred. 5. Repeat steps 2–4 above for an assumed charge state of 2+ , 3+ , and 4+ . The detected peptide charge state and mass are estimated from the best match between the observed zoom scan spectrum and the theoretically derived spectrum for the possible charge states of 1+ , 2+ , 3+ , and 4+ . The cross-correlation between the best matching theoretical isotope pattern at the m/z shift value associated with the convolution maximum and the measured zoom scan is used as an intensity-independent matching score between the measured and the best matching theoretical spectrum. Triple play events with a cross-correlation score greater than 0.6 are retained for identification. Triple plays below this threshold represent scans that are not peptides, a mixture of several peptides in the ion trap, or very low signal-to-noise measurements. These lower quality scan events are not retained for any further processing.
3.6. MS/MS Spectral Filtering In order to reduce the effect of MS/MS noise peaks on the identification of peptides, a dynamic MS/MS noise level is estimated for each spectrum. This noise level estimate is then subtracted from all MS/MS peak intensities with any resulting differences less than zero set to zero. The spectral noise level is estimated based on the observation that ideal MS/MS spectra of peptides have relatively few peaks (e.g., y-ions, b-ions, adducts, etc.) in a theoretical or high signal-to-noise ratio spectrum, while noisy MS/MS spectra typically have a high density of peaks within a local m/z neighborhood (interpreted as chemical noise). Therefore, the filtering approach uses a percentile of the peak intensities within a local m/z neighborhood as the noise estimate, where the percentile used is based on the density of peaks in the neighborhood – a higher peak density results in a higher percentile to estimate the local noise level, a lower peak density results in a lower percentile to estimate the local noise level. 1. Bin the MS/MS spectrum into a vector of equally spaced m/z values (bin width of 0.1 m/z). 2. At 200 equally spaced m/z value design points between the maximum and minimum observed m/z values observed in the MS/MS spectrum, estimate the local peak density by counting the number of non-zero intensities in a ±20 m/z window around each of the 200 design points. Define the local peak density at these 200 design points as the number of non-zero peaks counted divided by 40 (peaks per m/z). 3. Transform the local peak density values to a filtering percentile value using the relationship shown in Fig. 2.
Label-Free Biomarker Identification
217
Fig. 2. Filtering percentile as a function of local MS/MS peak density. Peak density is defined as the number of MS/MS peaks in a 40 m/z window divided by 40. 0 if PeakDensity ≤ 01 075 Filtering Percentile = 015−PeakDensity if PeakDensity > 01 1+e
005
Reprinted with permission from (10). 4. Obtain an initial noise level estimate by the percentile of MS/MS peak intensities at each of the 200 design points, where the percentile used at each point is derived from step 3 above (see Note 11). 5. Smooth the initial noise estimates with a Gaussian kernel smooth (150 m/z bandwidth) and interpolate between the 200 design points to obtain the final MS/MS noise estimate at each measured m/z value. Subtract this estimate from the measured MS/MS peak intensities and set any negative values to zero. An example of a high and low signal-to-noise MS/MS spectrum and the resulting estimated noise levels is shown in Fig. 3.
3.7. Peptide Identification A detailed description of peptide identification is beyond the scope of this chapter, but some general discussion is warranted given the importance of the subject and its linkage to quantification with the proposed method. The primary problem with peptide identification is controlling for false-positive identifications while maintaining a reasonable sensitivity to detect correct identifications. Our approach utilizes the outputs of two search engines, Sequest (19) and X! Tandem (20), along with other descriptive features of identification (e.g., charge state, peptide length, etc.) as inputs to a classifier that has been trained
218
Higgs et al.
0
Intensity 20,000
50,000
(A)
200
600
1000
1400
350,000
m/z
0
Intensity 150,000
(B)
500
1000
1500
m/z
Fig. 3. Example MS/MS spectra and their estimated noise levels. 443 original peaks reduced to 118 peaks above estimated noise level in high-noise spectrum (A). 589 original peaks reduced to 173 peaks above estimated noise level in lower noise spectrum (B). Reprinted with permission from (10).
to identify correct identifications (21). The output of the classifier provides a unit-less score indicative of the likelihood of a correct identification. Falsepositive identifications are controlled by running the searches against reversed versions of the protein databases and estimating the p-values: the probability of observing a model score from the reversed database search that exceeded the observed score from the correct database. P-values alone are insufficient due to the large number of tests (identifications) being done (i.e., with a 0.05 p-value cutoff, 5% of identifications declared correct would in fact be incorrect in the null condition where there are truly no matches to any MS/MS spectra). To account for multiple testing, false discovery rates (FDRs) (q-values) for
Label-Free Biomarker Identification
219
peptide identifications are estimated from p-values using the method described by Benjamini and Hochberg (22). Peptides with identification q-values less than a threshold, say 0.10, are retained for quantification. Proteins identified by only one peptide are visually examined to eliminate obvious incorrect identifications (e.g., less than four consecutive y- or b-ions). We estimate that the proportion of false identifications using such a procedure is less than or equal to 2%. Overall, the method is similar in strategy to PeptideProphet (23) with the following extensions: multiple search engines are employed, a more flexible classifier (e.g., Random Forests) is used, and statistical significance is estimated from a null distribution of classifier scores derived from reversed database searching instead of fitting a mixture model to the distribution of classifier output scores. The method is described in detail in Higgs et al. (11). In general, we typically restrict biomarker hypothesis generation to identified peptides. The same relative quantification method can be used with unidentified peptides (MS features), although in practice these features need to be identified to be of practical use to clinicians and biologists. To maximize the coverage of proteins identified in a study, identifications from all samples in the study are pooled and used to create a list of peptides to quantify in each sample. Thus, a confident identification needs to be made once out of a sample in order for the associated peptide ion current to be quantified in all study samples. Pooling the identifications across all samples in a study significantly increases the number of identifications relative to the number of identifications from any single sample. 3.8. Chromatographic Alignment Variability in the abundance of individual peptides between different samples may result in that peptide triggering an MS/MS scan in one sample and not in another. The area of this peptide may still be extracted from the primary mass spectrum in each sample. However, doing so requires high-quality chromatographic alignment between the samples so that a consistent region in the extracted ion chromatogram (XIC) is used for integration across all samples in a study. Large biomarker studies can produce chromatographic retention time shifts greater than 1 min between pairs of samples run several days and many samples apart. Simply expanding the integration window by 1 or 2 min to account for chromatographic variability is not an option in our experience as we are analyzing complex samples with multiple co-eluting peaks at most XIC masses. An expanded integration window that includes multiple peaks masks the quantification of individual peptides, produces results that are confounded with multiple peptides contributing to a value, and increases variability. Peak picking is another option, but was not applied here due to the computational
220
Higgs et al.
cost as well as the inherent heuristic nature of peak picking algorithms with an associated variability in what is being integrated. We have found a simple pair-wise alignment between all samples and a select reference sample in the study to work well for numerous biomarker discovery projects. This approach to alignment is founded on the following assumptions: (a) the samples included in the study are generally quite similar to each other with respect to their peptide content (i.e., there are many peptides or landmarks in common between the samples), (b) the same chromatographic conditions are used for each sample in the study, and (c) in a local region of retention time, the retention time offset between any two samples is approximately constant (see Note 12). 1. Identify the landmarks in the reference sample by taking all triple-play scan events with a zoom scan cross-correlation score of 0.65 or greater. This set of reference sample landmarks will be matched against other samples in the study. 2. Identify the matching landmarks in a study sample by declaring a landmark match if the sample and reference triple-play events have: (a) the retention time of the triple play event between the samples is within a user-specified amount (5 min), (b) the charge state of the peptide matches, (c) the m/z value of the monoisotopic peak from the zoom scans is within a user-specified amount (0.7 Da) between the two samples, (d) the zoom scan cross-correlation coefficient of both peptides to their respective theoretical isotope patterns exceeds a threshold (0.65), and (e) the similarity between the corresponding MS/MS spectra exceeds a threshold (e.g., 0.75). The MS/MS similarity metric has been implemented as a cross-correlation coefficient between two MS/MS spectra following a convolution of each MS/MS stick-spectrum with a Gaussian peak shape. 3. For each matching pair of landmarks identified in step 2 above, generate the XIC for the feature in a local retention time window (e.g., ±5 min of scan event time in each sample). Convolve the two XICs to identify the time shift value that maximizes the convolution result between the landmark XICs in both samples. Record the time shift and cross-correlation at the optimal shift value for each landmark. The cross-correlation value will be used as a weighting factor in the subsequent smoothing step below. 4. The optimal time shift values for each pair of landmarks between a sample and the reference defines a warping function that can be used to transform the retention time values of a sample to the reference. Estimate a smooth warping function by fitting a weighted loess (24) to the time shift versus retention time values for each sample. The loess should be done in a weighted manner using the XIC cross-correlation values from step 3 above as weights. The result is a smooth function that can be used to transform a sample’s retention time to a common time defined by the reference sample Fig. 4. 5. The loess warping function for a sample is then applied to all the retention times in the chromatogram (landmark or not). Thus, all samples in a study are projected onto the same retention time scale. The warping function between two samples is generally not monotonic over the entire retention time range, and no restriction
221
0.0 –0.5
Shift (min) n = 462
0.5
Label-Free Biomarker Identification
0
20
40
60
80
100
120
Ret. Time (min)
Fig. 4. Example chromatographic alignment (“warping”) function between two rat serum samples. Retention time shift (min) vs. retention time (min) for 462 landmark peptides are plotted with the resulting loess fit. Reprinted with permission from (10). on overall monotonicity is used in our estimate of the warping function. We do, however, preserve the overall rank order of the retention times following alignment by constraining the bandwidth (span = 0.5) used in the loess fitting (24) (see Note 13).
3.9. Peptide Quantification Relative quantification of peptides is carried out by integration of the XIC peak (using normalized retention times from the chromatographic alignment) from the primary mass spectrum within each sample. A list of peptides to integrate within each sample is constructed by pooling together all triple-play events across all the samples. This pooling can be done with or without the use of peptide identification. As previously noted, we typically restrict the analyses to identified peptides. For each identified peptide, perform the following steps: 1. For each sample in which the peptide was identified, extract the XIC for the peptide and compute the centroid (weighted average of retention time values where weighting factor is the XIC ion current) of the XIC in a small retention time neighborhood (–0.5 min to +1.0 min from triple-play trigger time) using the aligned time values in the XIC. Compute the mean centroid time for the peptide over all samples in which the peptide was identified. Also compute the mean average m/z value estimated from the zoom scan spectrum for each sample in which the peptide was identified.
222
Higgs et al.
2. For each sample in the study, create an XIC for the peptide using the mean zoom scan average m/z value determined in step 1. 3. Estimate a local XIC baseline level and subtract the baseline from the XIC intensity values from each sample. A local linear baseline can be estimated by fitting a line between the lowest intensity XIC point before the peak and the lowest intensity XIC point following the peak in a local neighborhood (e.g., 5 min). This simple local linear baseline estimate always results in a baseline estimate below the signal intensity in the local neighborhood, leading to a low bias in the estimated baseline. For large peaks, this bias is negligible but for small peaks the bias may have a more pronounced effect on quantification. Alternatively, an asymmetric least squares smoothing approach may be used to estimate the baseline XIC values in order to reduce the potential bias with the simple local linear approach (25). 4. A fixed retention time window (±0.5 min for the chromatography described) around the mean centroid time value described in step 1 is used for integration. The width of this window is dependent on the chromatography method used. For the chromatography method reported here, the peak width remains relatively constant across the HPLC gradient (i.e., no band-broadening is observed). If band-broadening is observed, then the integration window width should be modeled as a function of the retention time (e.g., integration window width = intercept + slope × retention time). 5. Integrate the baseline corrected XIC values within the fixed retention time window for each sample in the study using a numerical integration algorithm such as the trapezoid rule. Record the XIC area values for each peptide in each sample. An example of XIC integration for a small study is shown in Fig. 5.
3.10. Data Transformation and Normalization Following the integration of peptide-specific XIC peaks in all study samples, we have a rectangular data table with N rows corresponding to N samples in the study, and P columns corresponding to peptides detected in the study. The cell values in this data table are the peptide peak areas. With this table in hand, the usual operations of transformation and normalization may be applied prior to any statistical analysis. 1. Peptide peak areas are approximately log-normal distributed. Apply a log2 transformation to all peak area values (see Note 14). 2. Normalize the log2 transformed peak areas using a quantile normalization procedure (26) (see Note 15). 3. Normalized log2 peptide areas may be used directly as input to the statistical analysis for the study (peptide level analysis). Additionally, the average of normalized log2 peptide areas for all the peptides identified from a protein can be used as an overall estimate of the protein level (protein level analysis, see Note 16).
Label-Free Biomarker Identification
223
Fig. 5. XICs from the 2+ –1 macroglobulin peptide ATPLSLCALTAVDQSVLLLKPEAK for eight rat serum samples following chromatographic alignment. Note that the peak from all samples fits within the highlighted [83.2, 84.2] integration region. Reprinted with permission from (10).
224
Higgs et al.
3.11. Study Design, Power, Sample Size, and Analysis Our strategy of producing an N × P table of relative peptide levels allows the flexibility for the analysis to be done in a manner consistent with the study design. Note that no part of the described method imposes any limitation on the final study statistical analysis (e.g., pooling of samples, subtractiveor difference-based methods, etc.). In general, the statistical analysis used for identifying potential protein biomarkers in a study should follow the same approach as a primary clinical endpoint analysis would take (i.e., a simple paired design should be analyzed with a paired t-test, a crossover design with repeated measures within period should be analyzed as a crossover study with repeated measures within period, etc.). An analysis of a single clinical endpoint may use the familiar type I error threshold of 0.05 as a measure of statistical significance. This approach does not work well when testing hundreds or thousands of proteins in a study because, by definition, 5% of all p-values from a null experiment (an experiment in which there is truly no treatment or group effect) will have a p-value less than 0.05. The Bonferroni approach to control the family-wise type I error (controlling for no errors in the set of declared changes) has been commonly employed as a means to control false-positive findings (27). However, many investigators doing proteomic hypothesis generation are willing to tolerate some level of falsepositive findings in a declared set as long as it is relatively low and estimated. The use of FDR as a means to identify a set of declared findings with a specified proportion of false-positives has been widely applied in genomics (22) and is the current recommendation for proteomic hypothesis generating experiments. There are numerous estimators of FDR (28,29) with the original method described by Benjamini and Hochberg used in the work presented here (22). Just as multiple comparisons should be considered in the analysis of study data, these should also be considered at the design stage of a new study aimed at generating hypotheses from highly multiplexed measurements like proteomics. This is a relatively new field of research with several methods recently reported (30,31,32,33). A simple approach originally suggested by Benjamini and Hochberg (22), and adapted by Bemis (34), uses traditional sample-size calculations with the following expression for average type I error 1 (ave ) over a set of tested hypotheses: ave = f ave q ∗ m +mm1−q ∗ where fave is the 1 0 average power of hypothesis tests conducted in a study, q ∗ is the rate at which FDR is to be controlled, m0 is the number of true null hypotheses tested, and m1 is the number of true alternative hypotheses tested. Sample-size estimates are made by first estimating ave using the desired values for fave and q ∗ , assumed values for m0 and m1 , and existing sample size calculators using for a given study design. An example set of sample-size curves using ave this approach for the two-sample t-test design is given in Fig. 6.
Label-Free Biomarker Identification
225
Fig. 6. Estimated sample sized required to detect protein changes in a two-sample t-test design. Number of subjects in each of the two groups is plotted against the detectable effect size expressed as a fold-change. Four different levels of total variability are shown (10% CV, 20% CV, 30% CV, and 40% CV). Sample size estimates were made using 85% power, a 0.10 target FDR for declaring significance, and an estimated 0 proportion of true null hypotheses, m m+m , set to 0.98. 0
1
4. Notes 1. We find that plasma total protein concentration, as measured by a Bradford assay, has a total coefficient of variation (CV) of approximately 11% (includes inter-subject, intra-subject, and assay error) and ranges between approximately 48 and 68 mg/mL (12). Due to the apparent highly regulated plasma total protein concentration, it is not generally necessary to measure total protein concentration for each sample in a study in order to load a consistent amount of protein. 2. The depletion material used is based on a dye affinity removal method for albumin. There are commercially available antibody-based depletion kits that may improve albumin removal at a reasonable cost. Abundant protein depletion is an open and active research area at the time of this writing. 3. Chicken lysozyme is added as a spiked internal standard at this stage in order to qualitatively assess the digestion efficiency as well as to quantitatively assess the measurement error across the samples in a study. Other internal standard(s) could also be used. 4. The reduction/alkylation solution should be prepared just before use. Triethylphosphine is pyrophoric and should be handled in a fume hood in accordance with the material safety data sheet. The use of volatile reagents for this step
226
5.
6.
7.
8.
9.
10.
Higgs et al. reduces the variability in the sample prep by minimizing sample handling steps and removing the majority of reduction and alkylating reagents. The digestion is performed with trypsin, which is sensitive to the presence of reducing reagents. We find that CSF total protein concentration, as measured by a Bradford assay, has a total CV of approximately 27% (includes inter-subject, intra-subject, and assay error with the additional total variability relative to plasma total protein attributed to a higher CSF inter-subject variance) with a range between approximately 0.12 and 0.41 μg/mL (12). The higher overall variability is attributed to a significantly higher inter-subject variability relative to plasma total protein (12). Due to the higher variability with CSF total protein, we use the results of Bradford total protein assay to process a consistent total CSF protein amount in the proteomics assay. The HPLC pumps must be capable of producing a smooth gradient at 50 μL/min. The gradient formation should be verified by using water in A and 1% acetone in water for B and running the gradient with UV monitoring at 254 nm. New HPLC columns should be conditioned with at least four runs of digested serum before use in the method. The mass spectrometer’s source should be carefully cleaned to minimize chemical noise. Monitor above 300 m/z and try to maximize the injection time as this is directly proportional to achievable dynamic range in an ion trap mass spectrometer. The spray conditions should be optimized for a peptide of about ˜1700 Da. Alternatively, a design could be used to balance various study factors (e.g., treatment, gender, age, etc.) with injection order. This approach may be most appropriate for small studies (e.g., <15 samples) where an unfortunate randomization could result in a confounding of injection order with an important study factor, like treatment. Using the methods as described for plasma, we have found that the total area under the base peak chromatogram for water blanks is <1.5% the total area for the corresponding preceding plasma sample (<0.0225% total carryover between plasma samples). The largest peak in plasma water blanks is an albumin peptide (2+ ALVLIAFAQYLQQCPFEDHVK) with a peak area of 9% the value of the preceding plasma sample (0.81% carryover of this peptide estimated between plasma samples). For CSF, we have found that the total area under the base peak chromatogram for the water blanks is <6% the total area for the corresponding plasma sample (<0.36% total carryover between CSF samples). The largest peak in CSF water blanks is an acidic transthyretin peptide (2+ TSESGELHGLTTEEEFVEGIYKVEIDTK) with a peak area of 32% the value of the preceding CSF sample (10% carryover of this peptide estimated between CSF samples) (12). A compressed, 60 min gradient can be used for water blank injections in order to reduce the overall cycle time. Theoretical isotope relative heights were derived from an analysis of 15,493 peptides ranging in length from 2 to 38 residues with a median length of 13. We have denoted the intensities of the four isotope peaks I0 for the 12 C monoisotopic peak, I1 for the +1 13 C isotopic peak, I2 for the +2 13 C isotopic peak, and I3 for
Label-Free Biomarker Identification
11.
12.
13.
14. 15.
16.
227
the +3 13 C isotopic peak. The 15,493 example peptides were then used to derive relationships for I0 /max (I0 , I1 , I2 , I3 ), I1 /I0 , I2 /I0 , and I3 /I0 as functions of the peptide monoisotopic molecular weight (Fig. 1). Percentile transformation is done to define the noise level as the Xth percentile of the peak intensities in a local m/z neighborhood where X is dependent on the peak density in the neighborhood (higher peak density–>higher percentile– >higher estimated noise level). One potential improvement to this alignment strategy would be to create a composite list of landmarks across all study samples instead of relying on a single sample to serve as the retention time reference. This could easily be accomplished by grouping or clustering landmarks from all samples enforcing a match on m/z, charge state, retention time, and MS/MS spectral similarity. This has not been employed yet due to the increased computational cost and the lack of data demonstrating any significant problems with the single reference sample approach. In practice, several different samples are evaluated as potential alignment reference samples, and the best sample based on a qualitative assessment of the alignment warping functions is chosen. A visual examination of the alignment warping functions for all samples included in a study is an effective means to detect and diagnose chromatography problems encountered in the analysis of dozens of study samples. For example, oscillatory warping functions have been associated with pump mixing problems while large magnitude mostly linear warping functions have been associated with column degradation. Log2 is convenient because a unit change can be interpreted as a twofold change on the original scale. Normalization can be particularly important for minimizing systematic biases in ion current introduced by sample collection and handling, sample concentration, instrument sensitivity drift during the course of data acquisition, etc. The spiked internal standard, chicken lysozyme can be helpful in diagnosing and monitoring ion intensities before and after normalization. Quantile normalization assumes that the overall distribution of log2 peptide peak areas is unchanged from sample to sample. This is generally a reasonable assumption, but there are cases where a treatment effect may modulate the level of most of the proteins detected in a study, and in such cases quantile normalization should not be used. In these cases, the spiked internal standard, chicken lysozyme can be used to normalize any systematic effects of the process on ion current occurring only after the standard was spiked. In practice, we will analyze a study at both the peptide and protein levels. Peptide-level analyses are generally specific to the identified peptide and allow the opportunity to discover biologically related changes in peptide level due to processing of a specific region of a protein. Protein-level analyses provide additional statistical power to detect smaller magnitude changes in protein levels since we are averaging multiple peptide values, all of which have a high positive covariance.
228
Higgs et al.
Acknowledgments We thank John Saalwaechter and Andrew Kaczorek and the entire scientific computing team for their efforts in developing and maintaining a highavailability grid-computing environment used for this work. We also thank Jude Onyia and the statistical and mathematical sciences management team for supporting us in the development of these methods.
References 1. FDA Critical Path Initiative 2006 (http://www.fda.gov/oc/initiatives/criticalpath). 2. NIH Road Map for Medical Research 2006 (http://www.nihroadmap.nih.gov/ index.asp). 3. Gygi, S.P., Rist, B., Gerber, S.A., Turecek, F., Gelb, M.H., and Aebersold, R. 1999. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 17: 994–999. 4. Aggarwal, K., Choe, L.H., and Lee, K.H. 2006. Shotgun proteomics using the iTRAQ isobaric tags. Brief. Funct. Genomic. Proteomic. 5: 112–120. 5. Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C. et al 2002. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359: 572–577. 6. Radulovic, D., Jelveh, S., Ryu, S., Hamilton, T.G., Foss, E., Mao, Y., and Emili, A. 2004. Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatography-tandem mass spectrometry. Mol Cell Proteomics 3: 984–997. 7. Wiener, M.C., Sachs, J.R., Deyanova, E.G., and Yates, N.A. 2004. Differential mass spectrometry: a label-free LC-MS method for finding significant differences in complex peptide and protein mixtures. Anal. Chem. 76: 6085–6096. 8. Gao, J., Opiteck, G.J., Friedrichs, M.S., Dongre, A.R., and Hefta, S.A. 2003. Changes in the protein expression of yeast as a function of carbon source. J. Proteome. Res. 2: 643–649. 9. Colinge, J., Chiappe, D., Lagache, S., Moniatte, M., and Bougueleret, L. 2005. Differential Proteomics via probabilistic peptide identification scores. Anal. Chem. 77: 596–606. 10. Higgs, R.E., Knierman, M.D., Gelfanova, V., Butler, J.P., and Hale, J.E. 2005. Comprehensive label-free method for the relative quantification of proteins from biological samples. J. Proteome. Res. 4: 1442–1450. 11. Higgs, R.E., Knierman, M.D., Freeman, A.B., Gelbert, L.M., Patil, S.T., and Hale, J.E. 2007. Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. J. Proteome. Res. 6: 1758–1767. 12. Patil, S.T., Higgs, R.E., Brandt, J.E., Knierman, M.D., Gelfanova, V., Butler, J.P., Downing, A.M., Dorocke, J., Dean, R.A., Potter, W.Z. et al. 2007. Identifying pharmacodynamic protein markers of centrally active drugs in humans: a pilot study in a novel clinical model. J. Proteome. Res. 6: 955–966.
Label-Free Biomarker Identification
229
13. Anderson, L., and Hunter, C.L. 2006. Quantitative mass spectrometric multiple reaction monitoring assays for major plasma proteins. Mol Cell Proteomics 5: 573–588. 14. Anderson, N.L., and Anderson, N.G. 2002. The human plasma proteome: history, character, and diagnostic prospects. Mol Cell Proteomics 1: 845–867. 15. Gutman, S., and Kessler, L.G. 2006. The US Food and Drug Administration perspective on cancer biomarker development. Nat. Rev. Cancer 6: 565–571. 16. Rifai, N., Gillette, M.A., and Carr, S.A. 2006. Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat. Biotechnol. 24: 971–983. 17. Hale, J.E., Butler, J.P., Gelfanova, V., You, J.S., and Knierman, M.D. 2004. A simplified procedure for the reduction and alkylation of cysteine residues in proteins prior to proteolytic digestion and mass spectral analysis. Anal. Biochem. 333: 174–181. 18. Proakis, J.G., and Manolakis, D.G. 1992. Digital Signal Processing – Principles, Algorithms and Applications. Prentice Hall, New York, NY. 19. Eng, J.K., Mccormack, A.L., and Yates, J.R. 1994. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 5: 976–989. 20. Craig, R., and Beavis, R.C. 2003. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17: 2310–2316. 21. Ulintz, P.J., Zhu, J., Qin, Z.S., and Andrews, P.C. 2006. Improved classification of mass spectrometry database search results using newer machine learning approaches. Mol Cell Proteomics 5: 497–509. 22. Benjamini, Y., and Hochberg, Y. 1995. Controlling the false discovery rate - a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B-Methodological 57: 289–300. 23. Keller, A., Nesvizhskii, A.I., Kolker, E., and Aebersold, R. 2002. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74: 5383–5392. 24. Cleveland, W.S., Grosse, E., and Shyu, W.M. 1992. Local regression models. In Statistical Models in S. J.M. Chambers and T.J. Hastie, eds. Wadsworth & Brooks/Cole, Pacific Grove, CA. 25. Boelens, H.F., Dijkstra, R.J., Eilers, P.H., Fitzpatrick, F., and Westerhuis, J.A. 2004. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and Raman spectroscopic detection. J. Chromatogr. A 1057: 21–30. 26. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. 2003. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185–193. 27. Miller, R.G., Jr. 1991. Simultaneous Statistical Inference. Springer-Verlag, New York.
230
Higgs et al.
28. Butler, K.W., Deslauriers, R., Geoffrion, Y., Storey, J.M., Storey, K.B., Smith, I.C., and Somorjai, R.L. 1985. 31P nuclear magnetic resonance studies of crayfish (Orconectes virilis). The use of inversion spin transfer to monitor enzyme kinetics in vivo. Eur. J. Biochem. 149: 79–83. 29. Efron, B. 2004. Large-scale simultaneous hypothesis testing: the choice of a null distribution. J. Am. Stat. Soc. 99: 96–104. 30. Pounds, S., and Cheng, C. 2005. Sample size determination for the false discovery rate. Bioinformatics 21: 4263–4271. 31. Hu, J., Zou, F., and Wright, F.A. 2005. Practical FDR-based sample size calculations in microarray experiments. Bioinformatics 21: 3264–3272. 32. Jung, S.H. 2005. Sample size for FDR-control in microarray data analysis. Bioinformatics 21: 3097–3104. 33. Li, S.S., Bigler, J., Lampe, J.W., Potter, J.D., and Feng, Z. 2005. FDR-controlling testing procedures and sample size determination for microarrays. Stat. Med. 24: 2267–2280. 34. Bemis, K.G. 2005. Statistical Issues with Mass Spectrometry Proteomics for Biomarker Discovery. In International Workshop on Statistical Methodology in Clinical and Nonclinical R&DDIA conference, Nice, France.
13 Analysis of the Extracellular Matrix and Secreted Vesicle Proteomes by Mass Spectrometry Zhen Xiao, Thomas P. Conrads, George R. Beck, Jr., and Timothy D. Veenstra
Summary The extracellular matrix (ECM) and secreted vesicles are unique structures outside of cells that carry out dynamic biological functions. ECM is created by most cell types and is responsible for the three-dimensional structure of the tissue or organ in which they are originated. Many cells also produce or secrete specialized vesicles into the ECM, which are thought to influence the extracellular environment. ECM is not s a physical structure to connect cells in a tissue or organ. The proteins in ECM and secreted vesicles are critical to cell function, differentiation, motility, and cell-to-cell interaction. Although a number of major structural proteins of ECM and secreted vesicles have long been known, an appreciation of the role of less-abundant non-collagenous proteins has just begun to emerge. This chapter outlines a series of methods used to isolate and enrich ECM constituents and secreted vesicles from bone-forming osteoblast cells, enabling comprehensive profiles of their proteomes to be obtained by mass spectrometry. These methods can be easily adapted to study ECM and secreted vesicles in other cell types, primary cell cultures derived from animal models, or tissue specimens.
Key Words: extracellular matrix; matrix vesicle; osteoblast; proteomics; mass spectrometry.
1. Introduction Most cells reside in a matrix environment called the extracellular matrix (ECM), which offers the structural and nutritional support as well as a protective barrier required for cells to survive, interact, and differentiate. In addition to From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
231
232
Xiao et al.
the intracellular and tissue-related processes, it is becoming increasingly clear that alterations in the ECM can affect the pathogenesis of the disease. While much effort has been devoted to the understanding of intracellular processes, the characteristics and functions of ECM have not been equally well studied. The evidence gathered to date has shown that ECM is a complicated organelle formed of various proteins that play central roles in cell differentiation, migration, and cell-to-cell communication (1,2,3). The complexity of ECM is exemplified in the structure of a skeleton. The formation and homeostasis of bone is an ongoing process throughout life, and involves the recruitment, replication, and differentiation of osteoblasts and osteoclasts (4). Osteoblasts are derived from mesenchymal stem cells and have the potential to further develop into either osteocytes or lining cells. When induced by the appropriate stimuli, such as ascorbic acid and -glycerophosphate, osteoblasts undergo proliferation and maturation toward the osteocyte phenotype (Fig. 1) (5). This process is accompanied by the accumulation of an ECM and ultimately mineralization of the ECM in the form of hydroxyapatite (6). The deposition of hydroxyapatite in ECM is initiated by a unique type of vesicles secreted by osteoblasts, called matrix vesicles (MVs). With diameters ranging from 30–300 nm, these vesicles reside in the ECM and play a critical role in mineralization (7,8). They serve as nucleation sites for mineralization and sustain the accumulation of ECM (9). A number of proteins, such as annexins and phosphatases, have been identified within MVs. These proteins are responsible for the enrichment of calcium and phosphate within the vesicles (8,10,11,12,13). Although the presence and
Fig. 1. The three-stage timeline of the osteoblast cell differentiation. The mineral deposition is visualized by alizarin red staining of the osteoblasts cultured in the differentiation medium.
Analysis of ECM and Secreted Vesicle Proteomes
233
function of other proteins are largely unknown, changes in ECM and MV proteins are associated with diseases such as osteoporosis (14), arteriosclerosis (15,16,17,18), tumor development, and metastasis (19,20,21,22). A comprehensive profile of the proteins present in these extracellular organelles enables a greater understanding of pathophysiology underlying these clinical manifestations. The development of mass spectrometry (MS) technology combined with appropriate protein enrichment and peptide separation strategies has made this aim achievable (23,24,25,26). This chapter describes the extraction of ECM constituents and MVs from an osteoblast cell line MC3T3-E1 followed by the analysis of their respective proteomic profiles by liquid chromatography (LC) fractionation combined with MS analysis (27). The ECM and MVs are isolated and enriched using centrifugation and enzymatic approaches. The enrichment of MVs is confirmed by the measurement of elevated alkaline phosphatase (ALP) activity. Following the creation of a complex mixture of peptides via a tryptic digestion of the extracted proteins, this mixture is fractionated using strong cation exchange (SCX) LC. These fractions are analyzed by nanoflow reversed-phase LC-tandem mass spectrometry (nanoRPLC-MS/MS), and proteins are identified by searching the data against appropriate proteomic database. 2. Materials 2.1. Cell Culture 1. 2. 3. 4. 5. 6. 7.
8. 9.
MC3T3-E1 pre-osteoblast cell line (see Note 1) Cell culture medium MEM (Irvine Scientific, Santa Ana, CA) Fetal bovine serum (Atlanta Biologicals, Atlanta, GA) Penicillin-streptomycin solution (10,000 I.U./ml penicillin, 10,000 μg/ml streptomycin) (Invitrogen Corp., Carlsbad, CA) 200 mM of l-glutamine (Invitrogen Corp.) Growth medium: MEM supplemented with 10% fetal bovine serum, 50 U/ml penicillin, 50 μg/ml streptomycin, and 2 mM l-glutamine Differentiation medium: growth medium supplemented with 50 μg/ml ascorbic acid (Sigma Chemical Co., St. Louis, MO) and 10 mM -glycerophosphate (Sigma Chemical Co.) Phosphate-buffered saline (PBS) Trypsin/EDTA (0.25% (w/v) trypsin/0.53 mM EDTA solution in Hank’s BSS without calcium or magnesium) (ATCC, Manassas, VA)
2.2. Extraction of the ECM Constituents 1. Liberase/blendzyme 1 (0.14 Wünsch units/ml) (Roche Applied Science, Indianapolis, IN) 2. Centrifuge 3. Bicinchoninic acid (BCA) protein assay reagent kit (Pierce, Rockford, IL)
234
Xiao et al.
2.3. Enrichment of MVs from the ECM 1. Liberase/blendzyme 1 (0.14 Wünsch units/ml) (Roche Applied Science, Indianapolis, IN) 2. Centrifuge
2.4. Isolation of MVs from Medium 1. Ultra-Clear™ centrifuge tubes: 1 × 3.5 in (38 ml) and 5/8 × 4 in (17 ml) (Beckman, Palo Alto, CA) 2. Optima L-90K preparative ultracentrifuge (Beckman Coulter, Inc., Palo Alto, CA)
2.5. Alkaline Phosphatase Assay 1. Mild lysis buffer: 250 mM NaCl, 50 mM HEPES, pH 7.5, 0.1% NP-40 2. ALP assay kit, including alkaline buffer (1.5 mM 2-amino-2-methyl-1-propanol, pH 10.3), p-nitrophenyl phosphate (PNPP) (4 mg/ml) and p-nitrophenol (PNP) standard solution (10 μmol/ml) (Sigma, St. Louis, MO) 3. Flat bottom 96-well plate 4. Lumimark microplate reader (Bio-Rad, Hercules, CA)
2.6. Strong Cation Exchange Liquid Chromatography of Peptides 1. Trypsin Gold, mass spectrometry grade (Promega, Madison, WI) 2. 25% (v/v) acetonitrile containing 0.1% (v/v) formic acid 3. SCX-LC column (1 mm × 150 mm, polysulfoethyl A) (PolyLC, Columbia, MD)
Fig. 2. Transmission electron microscopic image of matrix vesicles in the ultracentrifuge pellets (A). The high magnification image (B) shows fine-needle deposits and black dots, likely signs of calcification, both inside and around the vesicles. Also note the bilayer membrane of the vesicles (arrowhead).
Analysis of ECM and Secreted Vesicle Proteomes 4. 5. 6. 7. 8.
235
Mobile phase A: 25% (v/v) acetonitrile Mobile phase B: 25% (v/v) acetonitrile containing 0.5 M ammonium formate, pH 3 0.1% (v/v) formic acid Vacuum centrifuge Laser-induced fluorescence (LIF) detector
2.7. Nanoflow Reversed-phase Liquid Chromatography Tandem Mass Spectrometry 1. Slurry packer model 1666 (Alltech, Columbia, MD) 2. Ceramic cutter 3. 75 μm i.d. × 360 μm o.d. × 12 cm long fused silica capillary column (Polymicro Technologies, Phoenix, AZ) 4. 5 μm, 300 Å pore size C-18 silica-bonded stationary RP particles (Jupiter, Phenomenex, Torrance, CA) 5. Agilent 1100 nanoLC system (Agilent Technologies, Palo Alto, CA) coupled with a linear ion-trap (LIT) mass spectrometer (LTQ, ThermoElectron, San Jose, CA) 6. Glass sample injection vials 12 × 32 mm (Wheaton, Millville, NJ) 7. Mobile phase A: 0.1% (v/v) formic acid 8. Mobile phase B: 0.1% formic acid (v/v) in acetonitrile
2.8. Bioinformatic Analysis 1. 20-node Beowulf cluster computer server 2. SEQUEST Cluster version 3.1 SR1 (Thermo Electron Corp., Waltham, MA) 3. Bioworks Browser software 3.2 (Thermo Electron Corp.)
2.9. Validation by Immunofluorescence Staining 1. Primary antibodies: anti-annexin V, anti-emilin-1, anti-IQGAP1 (Santa Cruz Biotechnology, Inc., Santa Cruz, CA) 2. Secondary antibodies: goat anti-rabbit IgG-FITC, and donkey anti-goat IgG-TR (Santa Cruz Biotechnology) 3. PBS solution 4. 18 × 18 × 0.15 mm thick glass cover slips 5. Regular microscope glass slides 6. Blocking serum: 10% normal blocking serum in PBS. The blocking serum is derived from the same species in which the secondary antibody is raised. For example, if the secondary antibody is raised in goat, use the normal goat serum diluted to 10% in PBS as the blocking serum. 7. Fixative solution: 3.7% (v/v) formaldehyde in PBS 8. DAPI diluted 1:50,000 in PBS (Invitrogen, Carlsbad, CA) 9. ProLong mounting reagent (Invitrogen) 10. Confocal fluorescence microscope LSM 510 Meta NLO (Carl Zeiss, Oberkochen, Germany)
236
Xiao et al.
3. Methods The ECM proteins are extracted from cultured cells by a short exposure to an ECM-degrading enzyme. To isolate MVs that are either confined to the ECM or reside in the cell culture medium, two approaches may be used: (1) For MVs confined to the ECM, an ECM-degrading enzyme is first applied followed by centrifugation and ultracentrifugation; (2) for MVs in the medium, centrifugation and ultracentrifugation are applied. The characterization of ECM and MV proteomes is performed using LC fractionation and MS analysis. 3.1. Cell Culture 1. Grow the murine calvaria-derived osteoblast MC3T3-E1 cells in growth medium. The medium is changed every two or three days. Passage the cells with trypsin/EDTA (see Note 1). 2. Once the cell culture reaches ∼50% confluency, replace the growth medium with 10 ml of differentiation medium per plate to induce osteoblast differentiation. 3. Extract the ECM or harvest culture medium on the day indicated in the methods below.
3.2. Extraction of the ECM Constituents 1. Grow MC3T3-E1 cells in differentiation medium on 10-cm plates. Change the medium every two or three days (see Note 2). 2. On day 21, aspirate the medium from the plates. Wash the cells with 10 ml of PBS solution three times. 3. Add 3 ml of liberase/blendzyme 1 solution to each plate. Incubate at 37°C for 30 min. 4. Carefully collect the digested supernatant from the plates without disturbing the cells. 5. Centrifuge the supernatant at 2000×g for 5 min to remove any free cells. The resulting supernatant contains ECM proteins. 6. Quantify the amount of ECM proteins using the BCA assay (see Note 3).
3.3. Enrichment of MVs from the ECM 1. Follow the same procedure described earlier to grow and prepare cells (see Subheading 3.2, steps 1 and 2, and Note 2). 2. On day 21, aspirate the medium and wash the cells three times with PBS. 3. Add 3 ml of liberase/blendzyme 1 solution to each plate. Incubate at 37°C for 30 min (see Note 4). 4. Collect the supernatant from the plates without disturbing the cells. Centrifuge the supernatant at 2000×g for 5 min to remove any cells that may have been detached from the plate. Collect the supernatant. 5. Centrifuge the supernatant at 20,000×g at 4°C for 30 min.
Analysis of ECM and Secreted Vesicle Proteomes
237
6. Transfer the supernatant to the Ultra-Clear™ centrifuge tubes. Use the centrifuge tubes that fit the volume of the supernatant. Fill the tubes with PBS up to about 2 –3 mm from the top. 7. Subject the supernatant to ultracentrifugation at 100,000×g at 4°C for 60 min. Carefully remove the supernatant without disturbing the pellet. 8. The pellets are enriched with MVs designed as collagenase-released MVs (CRMVs) (see Note 5). 9. Confirm the enrichment of CRMVs by assaying the ALP activity using an aliquot of the pellet (see Note 6 and Subheading 3.5). 10. Resuspend the rest of the pellet in 25 mM NH4 HCO3 , pH 8.4. Quantify the amount of CRMV proteins in the pellet by BCA assay (see Note 3).
3.4. Isolation of MVs from Medium 1. Grow MC3T3-E1 cells in differentiation medium in four 10-cm plates. 2. On day 15, collect the media from multiple plates (see Note 2). 3. Separate cellular debris from the medium by centrifugation at 20,000×g for 30 min at 4°C. 4. Transfer the supernatant to Ultra-Clear™ centrifuge tubes. Use the centrifuge tubes that fit the volume of the supernatant. 5. Further centrifuge the supernatant by ultracentrifugation at 100,000×g for 60 min. 6. Carefully remove the supernatant. The MVs in the pellet are designated as medium MVs (MMVs) (see Note 5 and Fig. 1). 7. Resuspend an aliquot of the MMV sample in 25 mM NH4 HCO3 , pH 8.4. Determine the protein concentration in the pellet by BCA assay.
3.5. Alkaline Phosphatase Assay 1. For the standard curve: Dilute PNP standard 1:10 in dH2 O. Add 0, 2, 4, 6, 8, 10, 20, 30, 40, and 50 μl of the standard (i.e., 0, 2, 2, 4, 6, 8, 10, 20, 30, 40, and 50 nmol, respectively) to the wells of a flat-bottom 96-well microtiter plate. Add mild lysis buffer to make a total volume of 135 μl. 2. For the CRMV and MMV samples: Resuspend an aliquot of the ultracentrifuged pellet in mild lysis buffer. Quantify the protein by BCA assay. Based on the BCA assay results, add 25 μg of protein to the 96-well microtiter plate. Add mild lysis buffer further to make a total volume of135 μl/well. 3. Add 25 μl of alkaline buffer and 25 μl of p-nitrophenyl phosphate (PNPP) to each well. 4. Incubate the microtiter plate at 37°C for up to 3 h. Monitor the colorimetric change every hour by measuring absorbance at 405 nm using the microtiter plate reader. Stop incubation when the absorbance of the sample reaches the range of the standards. 5. Determine the ALP activity in MV samples by comparing to the PNP standard curve. Report the ALP activity as nmol PNP produced per minute per milligram of protein used (see Note 6).
238
Xiao et al.
3.6. Strong Cation Exchange Liquid Chromatography of Peptides 1. Digest 100 μg of ECM, CRMV, or MMV proteins in 25 mM NH4 HCO3 , pH 8.4, with trypsin using a trypsin-to-protein ratio of 1:40. For 100 μg of protein, add 2.5 μg of trypsin. Incubate the digestion at 37°C overnight (see Note 7). 2. Lyophilize the peptide digests in a vacuum centrifuge. 3. Dissolve peptide digests in 100 μl of 25% (v/v) acetonitrile containing 0.1% (v/v) formic acid. 4. Inject the peptides onto a SCX-LC column (1 × 150 mm, polysulfoethyl A). 5. Maintain the flow rate of the column at 50 μl/min. Mobile phase A is 25% (v/v) acetonitrile, and mobile phase B is 25% (v/v) acetonitrile with 0.5 M ammonium formate (pH 3). 6. Elute the peptides using the following 96-min gradient method: 3% B for 3 min, followed by a linear increase to 10% B in 43 min, a further increase to 45% B in 40 min, and then to 100% B in 10 min. Monitor the peptide separation by fluorescence (266 nm excitation/350 nm emission). Collect fractions every minute for 96 min (see Note 8). 7. Based on the chromatogram, pool the adjacent fractions into a total of 20 fractions and lyophilize (see Notes 9 and 10). 8. Resuspend each pooled fraction in 20 μl of 0.1% (v/v) formic acid prior to nanoRPLC-MS analysis.
3.7. Nanoflow Reversed-Phase Liquid Chromatography Tandem Mass Spectrometry 1. Cut a 12-cm piece of 75 μm i.d. × 360 μm o.d. fused silica capillary column. Use a torch to briefly flame the section about 2 cm near one end. Once the flamed section is soft, pull the column to make a 10-cm long section with a closed tip. To make a fine and flat opening at the end of the tip, lightly score near the end of the closed tip using a ceramic cutter, and then break the end away. 2. Connect the column to the slurry packer. Pack the column with 5 μm, 300 Å pore size C-18 silica-bonded stationary reversed-phase particles. 3. Connect the column to an Agilent 1100 nanoLC system coupled with a LIT mass spectrometer (LTQ, ThermoElectron, operated with Xcalibur 1.4 SR1 software). 4. Transfer the peptide fractions into glass vials. Inject 6 μl of the solution. 5. Mobile phase A is 0.1% (v/v) formic acid and B is 0.1% (v/v) formic acid in acetonitrile. Elute the peptides using the following gradient method: 2% B at 500 nl/min in 30 min; a linear increase of 2–42% B at 250 nl/min in 110 min; 42–98% in 30 min including the first 15 min at 250 nl/min and then 15 min at 500 nl/min; 98% at 500 nl/min for 10 min. 6. Set the capillary temperature and electrospray voltage at 160°C and 1.5 kV, respectively. The LIT-MS is operated in a data-dependent MS/MS mode where the five most abundant peptide molecular ions in every MS scan are sequentially selected for collision-induced dissociation (CID) using a normalized collision
Analysis of ECM and Secreted Vesicle Proteomes
239
energy of 35%. Apply dynamic exclusion to minimize repeated selection of peptides previously selected for CID (see Notes 11 and 12).
3.8. Bioinformatic Analysis 1. Search the tandem mass spectra against the UniProt proteomic database from the European Bioinformatics Institute (http://www.ebi.ac.uk/) with SEQUEST operating on a 40-node Beowulf cluster (SEQUEST Cluster version 3.1 SR1, Bioworks Browser 3.2). Limit the search to peptides generated with fully tryptic cleavage constraints. 2. Set legitimate peptide identification criteria as follows: charge state and crosscorrelation (Xcorr ) scores of 1.9 for [M + H]1+ , 2.2 for [M + 2H]2+ , 3.1 for [M + 3H]3+ , and a minimum delta correlation (Cn ) of 0.08. 3. Base protein identification exclusively on unique peptide hits, i.e., peptides whose sequence is unique to a given protein (see Notes 13 and 14).
3.9. Immunofluorescence Staining 1. Plate 50,000 cells on glass cover slips in 6-well plates. Culture in differentiation medium. 2. On day 15, briefly wash the cells with PBS. 3. Fix the cells in 3.7% (v/v) formaldehyde in PBS for 10 min. 4. Incubate with 10% (v/v) normal blocking serum in PBS. 5. Briefly wash the cells with PBS; incubate with primary antibodies for 1.5 h. 6. Wash the cells three times with PBS for 5 min each, and then incubate with secondary antibodies conjugated with fluorochrome (FITC or Texas Red) for 1 h. 7. Wash the cells three times with PBS for 5 min each, including once with DAPI diluted 1:50,000 in PBS to stain nuclei. 8. Mount the cover slips on microscope glass slides with ProLong mounting reagent. 9. Observe the cells using a confocal fluorescence microscope (see Note 14).
4. Notes 1. MC3T3-E1 pre-osteoblast cells are derived from newborn murine calvaria (28). These cells closely resemble primary cell cultures in their proliferation, differentiation, and mineralization (29,30,31). The combination of ascorbic acid and -glycerophosphate stimulates MC3T3-E1 to undergo differentiation, which is characterized by substantial matrix mineralization (32,33). Therefore, it is a suitable model for the enrichment of ECM and isolation of MVs. 2. It is necessary to culture multiple 10-cm plates (four or more at approximately 4 × 106 cells /plate) in order to obtain sufficient amount of protein from ECM or MVs. 3. Protein quantitation is a common laboratory procedure. The instructions are included within the BCA assay kit (Pierce); therefore, the procedure is not described in this chapter.
240
Xiao et al.
4. The liberase/blendzyme 1 is a mixture of highly purified collagenase and dispase that offers gentle protease activity as compared to other ECM-degrading enzymes. Note that four blendzyme mixtures with increasing levels of enzymatic strength are available from Roche. Blendzyme 1 is the mildest version. The digestion time varies depending on the cell or tissue type. Alternatively, collagenase/dispase (1 mg/ml of collagenase/dispase in PBS-containing collagenase, 0.1 U/ml and dispase, 0.8 U/ml) (Sigma Chemical Co., St. Louis, MO) can be used. Collagenase/dispase enzyme mixture is commonly used to digest the ECM. 5. Two approaches are designed to isolate MVs either from the ECM or directly from the cell culture medium. In the first approach, enzymatic digestion and ultracentrifugation are combined to release MVs embedded in the ECM (designated as CRMVs). In the second approach, ultracentrifugation is applied to the medium to isolate MVs, designated as MMVs (34). To confirm the enrichment of MVs, the ultracentrifugation pellets are fixed and examined using transmission electron microscopy (Fig. 2). 6. Measurement of the enzymatic activity of ALP is a standard marker for MV isolation (35,36). 7. Instead of using the buffer provided along with trypsin, it is desirable to resuspend trypsin in 25 mM NH4 HCO3 , pH 8.4. The trypsin-to-protein ratio should be between 1:40 and 1:50. The digestion mixture is incubated overnight (approximately 16 h). 8. The LIF detector used in this method can be constructed in-house (37). The LIF detector is more sensitive than a conventional lamp-based fluorescence detector. The use of a LIF detector is particularly advantageous when a narrow bore column (<1 mm i.d.) or a micro column (<300 μm i.d.) is used. Some conventional fluorescence detectors can be used with the narrow bore or micro column; however, the sensitivity is lower. When the peptide content is low and a narrow diameter column is being used, the LIF detector offers better sensitivity. For a peptide to be detectable using fluorescence detection, it must contain at least one aromatic residue, particularly tryptophan. Although tryptophancontaining proteins are comparatively rare, the complexity of the peptide mixture compensates to provide a good estimate of the separation. An alternative to LIF detection is UV. The advantage of an UV detector is that it detects amide bonds, which are universally present in peptides. The main disadvantage, however, is its limited compatibility with biological buffers. Volatile salts, such as ammonium formate, used in this method are incompatible with UV detection since formate absorbs strongly at 214 nm, which is the wavelength used for peptide detection. Sodium chloride is compatible with UV detection, but it is non-volatile. In that case, desalting of the peptide fractions is needed for the down stream. This desalting step may lead to sample loss. 9. All the automated two-dimensional (online) LC systems use chloride as the salt in the SCX first dimension, and a desalting step has to be implemented in the program. However, we found that the offline multi-dimensional separation of
Analysis of ECM and Secreted Vesicle Proteomes
10.
11.
12.
13.
14.
15.
241
peptide is capable of identifying more proteins than the online procedure. Thus, the offline separation is described in this chapter. The pooling step is optional. The peptide fractions can be pooled based on the complexity of the chromatogram. In general, pooling to about 20 fractions is appropriate. It will save LC-MS running time without compromising the number of proteins that the approach can identify. In general, the MS data acquisition time is set to 150 min, starting 30 min after the beginning of the peptide elution gradient and synchronized to end with the elution gradient. An alternative approach: the resulting ECM, CRMV, or MMV protein samples can be resolved by SDS-PAGE and the proteins visualized by Coomassie staining. The protein bands that are of greater intensity than those prepared from undifferentiated cells can be excised and subjected to in-gel digestion with trypsin and analyzed using nanoRPLC-MS/MS (27). Proteins that are identified in both CRMV and MMV purifications can be considered as authentic MV proteins with a higher degree of confidence than those that were identified in only one of the preparations. Gene ontology (GO) (www.geneontology.org) can be used to annotate the identified proteins and categorize them according to their cellular location, molecular function, and cellular processes they are associated with. The validation of known MV proteins is conducted using Western blotting or immunofluorescence staining. Annexin V, a known constituent of MVs, is used as a protein landmark to locate vesicles in these experiments (38). The osteoblast cells can be double- stained with anti-annexin V and an additional antibody against either the extracellular protein emilin-1 or the ras GTPase, IQGAP1 (27).
Acknowledgments This project has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. N01-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does the mention of trade names, commercial products, or organization imply endorsement by the US Government. References 1. Holmbeck, K. and Szabova, L. (2006) Aspects of extracellular matrix remodeling in development and disease. Birth Defects Res C Embryo Today 78, 11–23. 2. Brooke, B. S., Karnik, S. K. and Li, D. Y. (2003) Extracellular matrix in vascular morphogenesis and disease: structure versus signal. Trends Cell Biol 13, 51–56. 3. Tahinci, E. and Lee, E. (2004) The interface between cell and developmental biology. Curr Opin Genet Dev 14, 361–366.
242
Xiao et al.
4. Harada, S. and Rodan, G. A. (2003) Control of osteoblast function and regulation of bone mass. Nature 423, 349–355. 5. Beck, G. R., Jr. (2003) Inorganic phosphate as a signaling molecule in osteoblast differentiation. J Cell Biochem 90, 234–243. 6. Aubin, J. E. (2001) Regulation of osteoblast formation and function. Rev Endocr Metab Disord 2, 81–94. 7. Anderson, H. C. (1995) Molecular biology of matrix vesicles. Clin Orthop Relat Res, 266–280. 8. Anderson, H. C. (2003) Matrix vesicles and calcification. Curr Rheumatol Rep 5, 222–226. 9. Anderson, H. C., Garimella, R. and Tague, S. E. (2005) The role of matrix vesicles in growth plate development and biomineralization. Front Biosci 10, 822–837. 10. Kirsch, T. (2005) Annexins – their role in cartilage mineralization. Front Biosci 10, 576–581. 11. Hessle, L., Johnson, K. A., Anderson, H. C., Narisawa, S., Sali, A., Goding, J. W., Terkeltaub, R. and Millan, J. L. (2002) Tissue-nonspecific alkaline phosphatase and plasma cell membrane glycoprotein-1 are central antagonistic regulators of bone mineralization. Proc Natl Acad Sci USA 99, 9445–9449. 12. Johnson, K. A., Hessle, L., Vaingankar, S., Wennberg, C., Mauro, S., Narisawa, S., Goding, J. W., Sano, K., Millan, J. L. and Terkeltaub, R. (2000) Osteoblast tissuenonspecific alkaline phosphatase antagonizes and regulates PC-1. Am J Physiol Regul Integr Comp Physiol 279, R1365–1377. 13. Morris, D. C., Masuhara, K., Takaoka, K., Ono, K. and Anderson, H. C. (1992) Immunolocalization of alkaline phosphatase in osteoblasts and matrix vesicles of human fetal bone. Bone Miner 19, 287–298. 14. Baldini, V., Mastropasqua, M., Francucci, C. M. and D’Erasmo, E. (2005) Cardiovascular disease and osteoporosis. J Endocrinol Invest 28, 69–72. 15. Dao, H. H., Essalihi, R., Bouvet, C. and Moreau, P. (2005) Evolution and modulation of age-related medial elastocalcinosis: impact on large artery stiffness and isolated systolic hypertension. Cardiovasc Res 66, 307–317. 16. Reynolds, J. L., Joannides, A. J., Skepper, J. N., McNair, R., Schurgers, L. J., Proudfoot, D., Jahnen-Dechent, W., Weissberg, P. L. and Shanahan, C. M. (2004) Human vascular smooth muscle cells undergo vesicle-mediated calcification in response to changes in extracellular calcium and phosphate concentrations: a potential mechanism for accelerated vascular calcification in ESRD. J Am Soc Nephrol 15, 2857–2867. 17. Abedin, M., Tintut, Y. and Demer, L. L. (2004) Vascular calcification: mechanisms and clinical ramifications. Arterioscler Thromb Vasc Biol 24, 1161–1170. 18. Tintut, Y. and Demer, L. L. (2001) Recent advances in multifactorial regulation of vascular calcification. Curr Opin Lipidol 12, 555–560. 19. Stewart, D. A., Cooper, C. R. and Sikes, R. A. (2004) Changes in extracellular matrix (ECM) and ECM-associated proteins in the metastatic progression of prostate cancer. Reprod Biol Endocrinol 2, 2.
Analysis of ECM and Secreted Vesicle Proteomes
243
20. Yin, J. J., Pollock, C. B. and Kelly, K. (2005) Mechanisms of cancer metastasis to the bone. Cell Res 15, 57–62. 21. Mundy, G. R. (2002) Metastasis to bone: causes, consequences and therapeutic opportunities. Nat Rev Cancer 2, 584–593. 22. Roodman, G. D. (2004) Mechanisms of bone metastasis. N Engl J Med 350, 1655–1664. 23. Yates, J. R., III. (2004) Mass spectral analysis in proteomics. Annu Rev Biophys Biomol Struct 33, 297–316. 24. Yates, J. R., III, Gilchrist, A., Howell, K. E. and Bergeron, J. J. (2005) Proteomics of organelles and large cellular structures. Nat Rev Mol Cell Biol 6, 702–714. 25. Domon, B. and Aebersold, R. (2006) Mass spectrometry and protein analysis. Science 312, 212–217. 26. Aebersold, R. and Mann, M. (2003) Mass spectrometry-based proteomics. Nature 422, 198–207. 27. Xiao, Z., Camalier, C. E., Nagashima, K., Chan, K. C., Lucas, D. A., de la Cruz, M. J., Gignac, M., Lockett, S., Issaq, H. J., Veenstra, T. D., Conrads, T. P. and Beck Jr, G. R. (2006) Analysis of the extracellular matrix vesicle proteome in mineralizing osteoblasts. J Cell Physiol, In press. 28. Sudo, H., Kodama, H. A., Amagai, Y., Yamamoto, S. and Kasai, S. (1983) In vitro differentiation and calcification in a new clonal osteogenic cell line derived from newborn mouse calvaria. J Cell Biol 96, 191–198. 29. Choi, J. Y., Lee, B. H., Song, K. B., Park, R. W., Kim, I. S., Sohn, K. Y., Jo, J. S. and Ryoo, H. M. (1996) Expression patterns of bone-related proteins during osteoblastic differentiation in MC3T3-E1 cells. J Cell Biochem 61, 609–618. 30. Quarles, L. D., Yohay, D. A., Lever, L. W., Caton, R. and Wenstrup, R. J. (1992) Distinct proliferative and differentiated stages of murine MC3T3-E1 cells in culture: an in vitro model of osteoblast development. J Bone Miner Res 7, 683–692. 31. Franceschi, R. T., Iyer, B. S. and Cui, Y. (1994) Effects of ascorbic acid on collagen matrix formation and osteoblast differentiation in murine MC3T3-E1 cells. J Bone Miner Res 9, 843–854. 32. Beck, G. R., Jr, Sullivan, E. C., Moran, E. and Zerler, B. (1998) Relationship between alkaline phosphatase levels, osteopontin expression, and mineralization in differentiating MC3T3-E1 osteoblasts. J Cell Biochem 68, 269–280. 33. Beck, G. R., Jr, Zerler, B. and Moran, E. (2001) Gene array analysis of osteoblast differentiation. Cell Growth Differ 12, 61–83. 34. Johnson, K., Moffa, A., Chen, Y., Pritzker, K., Goding, J. and Terkeltaub, R. (1999) Matrix vesicle plasma cell membrane glycoprotein-1 regulates mineralization by murine osteoblastic MC3T3 cells. J Bone Miner Res 14, 883–892. 35. Ali, S. Y., Sajdera, S. W. and Anderson, H. C. (1970) Isolation and characterization of calcifying matrix vesicles from epiphyseal cartilage. Proc Natl Acad Sci USA 67, 1513–1520. 36. Dean, D. D., Schwartz, Z., Bonewald, L., Muniz, O. E., Morales, S., Gomez, R., Brooks, B. P., Qiao, M., Howell, D. S. and Boyan, B. D. (1994) Matrix vesicles
244
Xiao et al.
produced by osteoblast-like cells in culture become significantly enriched in proteoglycan-degrading metalloproteinases after addition of beta-glycerophosphate and ascorbic acid. Calcif Tissue Int 54, 399–408. 37. Chan, K. C., Muschik, G. M. and Issaq, H. J. (2000) Solid-state UV laser-induced fluorescence detection in capillary electrophoresis. Electrophoresis 21, 2062–2066. 38. Wang, W., Xu, J. and Kirsch, T. (2005) Annexin V and terminal differentiation of growth plate chondrocytes. Exp Cell Res 305, 156–165.
IV Clinical Proteomics and Antibody Arrays
14 Miniaturized Parallelized Sandwich Immunoassays Hsin-Yun Hsu, Silke Wittemann, and Thomas O. Joos
Summary This chapter describes the development and use of bead-based miniaturized multiplexed sandwich immunoassays for focused protein profiling. Bead-based protein arrays or suspension microarrays allow simultaneous analysis of a variety of parameters within a single experiment. In suspension microarrays capture antibodies are coupled onto colorcoded microspheres. The applications of suspension microarrays are described, which allow to analyze proteins present in different types of body fluids, such as serum or plasma, cerebrospinal, pleural and synovial fluids, as well as cell culture supernatants. The chapter is divided into the generation of suspension microarrays, sample preparation, processing of suspension microarrays, validation of analytical performance, and finally pattern generation using bioinformatics tools.
Key Words: suspension microarray; microspheres; immunoassay; protein profiling; biological fluids; serum; pleura; cell culture supernatants; cerebrospinal fluid; synovial fluid.
1. Introduction Protein microarray technology allows simultaneous determination of a large variety of analytes from a minute amount of sample within a single experiment. Assay systems based on this technology are currently applied for identification and quantitation of proteins. Protein microarray technology is of major interest for proteomic research in basic and applied biology as well as for diagnostic applications. Miniaturized and parallelized assay systems have reached adequate sensitivity, and hence have the potential to replace singleplex analysis systems. From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
247
248
Hsu et al.
Beside the well-known planar microarray-based systems, which are perfectly suited to screen a large number of target proteins, bead-based systems named suspension assays are a very interesting alternative, especially when the number of parameters of interest is comparably low. Suspension assay systems employ different color-coded or size-coded microspheres as the solid support for capture molecules. A flow cytometer, which is able to identify each individual type of bead and quantify the amount of captured targets on each individual bead, is used as a readout system. In the first step, antigen-specific capture antibodies are immobilized on the individual bead type. Different bead types are combined and incubated with the sample of interest. A labeled secondary antibody detects the captured analytes and is visualized with a fluorescent reporter system. Sensitivity, reliability, and accuracy are similar to those observed with standard microtiter ELISA procedures (1). Color-coded microspheres can be used to perform up to a hundred different assay types simultaneously. The flow cytometer identifies several thousand microspheres in a second, and simultaneously quantitates the amount of captured analytes (2,3,4,5,6). Suspension microarrays are currently advanced within the field of miniaturized multiplexed ligand binding assays with respect to automation and throughput (7). Miniaturized parallelized assay systems have to demonstrate appropriate sensitivity, precision, and reliability before they will be applied for screening or diagnostic purposes. This chapter describes the development and use of suspension antibody microarrays for protein profiling of several human body fluids. The standard methodology guidance is described to validate immunoassays (10,11,12) and to determine the sensitivity, precision, and accuracy of the multiplexed analysis. In the final section, data analysis is described to show how to deal with highdimension data sets (13,14).
2. Materials 2.1. Equipment 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Centrifuge: 5415D (Eppendorf) Vortex Mixer (Neolab) Ultrasonic bath Thermomixer (Eppendorf) Luminex100 instrument (Luminex Corp.) Vacuum manifold (Millipore) Filterplates (Millipore 96-well plate, cat. # MAB1250) Microcentrifuge tubes (Starlab 1.5 ml, cat. # I1415-2500) Carboxylated Beads (Qiagen, cat. # 922400 or Luminex Corp.) Deionized water
Miniaturized Parallelized Sandwich Immunoassays
249
2.2. Common Reagents and Materials 1. 2. 3. 4. 5.
Bovine serum albumin (BSA, Roth T844.2) PBS (Fischer Scientific, cat. # 9472615) EDC (Pierce) Sulfo-NHS (Pierce) Detection reagent: Streptavidin-phycoerythrin (Streptavidin-PE) stock solution (1 mg/ml) in 100 mM NaCl, 100 mM sodium phosphate, pH 7.5, containing 2 mM sodium azide (Molecular Probes, cat. #S21388)
2.3. Buffers 1. 2. 3. 4.
Activation buffer [100 mM sodium phosphate (Na2 HPO4 ), pH 6.2] Coupling buffer (50 mM MES, pH 5.0) Washing buffer [PBS, pH 7.4, and 0.05 % (v/v) Tween-20] Blocking/storage (B/S) buffer: 1% BSA fraction IV (Roth, cat. # T844.2) in 1× PBS 5. Assay buffer formulation: 1% BSA fraction IV in 1×PBS
3. Methods 3.1. Principle The principle of suspension antibody microarrays is based on sandwich immunoassays as represented in Fig. 1. First-capture antibodies are coupled to carboxylated microspheres. For performing suspension antibody microarrays, the samples are incubated with coupled microspheres. Bound analytes are detected with biotinylated antibodies. Phycoerythrin-labeled streptavidin is used for signal detection. Finally, microspheres are identified by a flow cytometer, hence allowing the quantitation of the captured analytes. 3.2. Production of Suspension Microarrays—Antibody Coupling to Carboxylated Microspheres (see Note 1) Using proven carbodiimide coupling chemistry, the antibodies are covalently immobilized on carboxylated beads via the amine groups in lysine side chains. Before coupling, the beads are first activated using EDC/Sulfo-NHS. Fig. 1. Processing of suspension microarrays. Schematic representation of the steps required for performing a suspension microarray immunoassay. Figure reproduced from Proteomics of Human Body Fluids: Principles, Methods and Applications, edited by Thongboonkerd (2006). (Continued)
250
Hsu et al.
Miniaturized Parallelized Sandwich Immunoassays
251
The antibodies should not contain foreign protein, azide, glycine, Tris, or any other reagent containing primary amine groups. Otherwise, the antibodies must be purified by gel-filtration chromatography or dialysis before use. 3.2.1. Bead Activation 1. Sonicate the carboxylated bead stock suspension for 15–20 s to yield a homogeneous bead suspension. Thoroughly vortex the bead stock suspension for at least 10 s. Take 2.5 × 106 beads per coupling reaction. 2. Transfer the bead stock suspension to Starlab microcentrifuge tube. 3. Briefly centrifuge the bead suspension (a quick spin up to 3000×g is sufficient) and discard the supernatant. 4. Wash the beads with 80 μl activation buffer. Briefly vortex and centrifuge at 10,000×g for 2 min. Discard the supernatant and repeat washing. 5. Resuspend the beads in 80 μl activation buffer. Sonicate for 15–20 s to yield a homogeneous bead suspension. 6. Freshly prepare EDC solution (50 mg/ml) and Sulfo-NHS solution (50 mg/ml) (see Notes 2 and 3). 7. Add 10 μl of EDC solution and 10 μl of Sulfo-NHS solution to the bead suspension. Incubate for 20 min at room temperature (15–25°C) in the dark.
3.2.2. Coupling of Antibodies to Activated Carboxylated Beads 8. Dilute the protein stock solution with coupling buffer to a concentration of 100 μg/ml in a volume of 500 μl. 9. Centrifuge the beads at 10,000×g for 2 min and discard the supernatant. 10. Wash the beads with 500 μl of coupling buffer. Briefly vortex and centrifuge at 10,000×g for 2 min. Discard the supernatant and repeat washing. 11. Add the diluted antibody solution (500 μl) from step 8. 12. Wrap the tube in aluminum foil to exclude light. Gently agitate the tube with activated beads and antibody solution on a plate shaker for 2 h at room temperature (15–25°C).
3.2.3. Washing and Storage of Coupled Carboxylated Beads 13. Centrifuge the beads at 10,000×g for 2 min and carefully remove and discard the supernatant. 14. Wash the beads with 500 μl of washing buffer. Briefly vortex and centrifuge at 10,000×g for 2 min. Discard the supernatant and repeat washing. 15. Resuspend the bead pellet in 1 ml B/S buffer including 0.05% (w/v) azide. 16. Determine the bead concentration of the suspension using a cell-counting chamber.
252
Hsu et al.
3.2.4. Counting Beads Using a Cell-Counting Chamber 1. Add 5 μl of beads to 45 μl of PBS and mix. 2. The hemacytometer is filled with 10 μl of the sample by placing the pipette tip against the loading “V” of the hemacytometer at a 45° angle. The sample is slowly released between the slide and the cover slip until the counting chamber is loaded. It is important to fill both sides of the chamber and wait for 2–3 min to allow the beads to settle. 3. Count the cells at two opposite corners of the scored chamber and take an average. Each of the nine squares on the grid has an area of 1 mm2 , and the coverglass rests 0.1 mm above the floor of the chamber. Thus, the volume over the central counting area is 0.1 mm3 or 0.1 ml. Multiply the average number of beads in each central counting area by 10,000 to obtain the number of beads per milliliter of diluted sample. Multiply by the dilution factor of 10 to get beads/ml. 4. Store the beads at 25×, typically 5 × 106 beads/ml.
3.3. Processing of Bead-Based Multiplex Assays 3.3.1. Sample Preparation Here, the preparation of proteins for use in multiplexed assay from clinical specimens or cell culture is described. Subheading 3.3.1.1 describes the use of serum or plasma; Subheading 3.3.1.2 describes the analysis of proteins present in cell culture supernatants; Subheading 3.3.1.3 describes the sample preparation of cerebrospinal, synovial, and pleural fluids. 3.3.1.1. Serum or Plasma Samples
Serum and plasma samples should be spun down (8000×g) prior to assay to remove particulate and lipid layers. This will prevent the blocking of wash plate as well as sample needle. The samples should be handled as biohazards since they may carry infectious agents. Freezing-thawing cycles might result in a measurable breakdown of some proteins (e.g., cytokines), and so the samples should be aliquoted before any experiment. The storage of aliquoted samples at –80°C is recommended. When we analyzed eight matched serum and plasma samples on the Luminex platform, no differences were seen between samples that underwent a freeze-thaw for levels of TNF, Eotaxin, IL-13, MCP-1, IFN, IL-12p70, MIP-1, IP-10, or GM-CSF. There was, however, a significant increase in IL-1 after freeze-thaw, suggesting that this process may liberate IL-1 from insoluble receptors. IL-1 and MCP-1 levels were significantly higher in plasma as compared to the matched serum sample. IP-10 was higher in serum. Figure 2 shows the freeze-thaw experiments to evaluate 10plex soluble receptor assays. It seemed that signal from some analytes was slightly decreased after freeze-thaw cycle; however, no statistically significant differences were
Miniaturized Parallelized Sandwich Immunoassays
253
10,000
MFI
1000
100
10
gp130
ICAM
Fas
TNFRII
VCAM
IL-2R
E-sel
TNFRI
RAGE
fresh
thaw
fresh
thaw
thaw
fresh
fresh
thaw
fresh
thaw
thaw
fresh
fresh
thaw
thaw
fresh
fresh
thaw
fresh
thaw
1
MIF
Fig. 2. Serum samples were drawn from three healthy donors. Each sample was divided into two parts. One part was measured directly after serum was taken; and the other part was subjected to a freeze-thaw cycle. Soluble receptors were analyzed using Luminex technology. There were no significant differences in MFI signals attributed to the freeze-thaw cycle.
observed. Another important consideration in analyzing serum or plasma samples is the need for an appropriate buffer (described in Subheading 3.3.2). 3.3.1.2. Cell Culture Samples
Before use, the cell culture supernatants should be centrifuged at 14,000×g to remove any particulates. The cell culture supernatants can be diluted in their corresponding cell culture medium. As well as for serum samples, cell culture supernatants should be aliquoted and frozen at –80°C for any experiment. 3.3.1.3. Cerebrospinal, Synovial, and Pleural Fluids
Precious samples of limited volume such as cerebrospinal fluid (CSF) and synovial fluid are ideal candidates for multiplex analysis. To the synovial fluid, animal serum should be added to prevent heterophilic antibodies and rheumatoid factor (RF) binding, which can cause false positives. For cytokine assays, the samples may be filtered with a 50-kDa filter to remove the interfering antibodies. Another recently described method uses protein L to remove RF from serum(8). CSF samples have been analyzed for 22 cytokines using the Luminex platform, 11 cytokines were detected (9). The authors performed spike recovery experiments and describe the recoveries as good.
254
Hsu et al.
3.3.2. Diluent It is important that the diluents selected for reconstitution and dilution of the standards reflect the environment of the samples being measured. Diluents for specific sample types have to be validated prior to use. For analyzing cell culture samples, the standards and samples are diluted in the respective cell culture medium. It is important to use the same lot of fetal bovine serum (FBS) as there may be significant differences between lots, which can interfere with the assay. Another factor to ensure is the pH of the sample, which will affect antibody binding. For assaying serum samples, each laboratory should develop and validate an appropriate diluent. We suggest starting with PBS supplemented with 10–50% animal serum (e.g., fetal calf serum, horse serum or goat serum, depleted human serum). The goal is to mimic the serum matrix to ensure similar binding kinetics in both serum and standard samples. The serum samples may also require dilution with small amounts of serum to prevent false positives, as some human antibodies may show reactivity toward the mouse captures. Generally, 1–2% of each species of antibodies is sufficient. The serum diluent must not be used to dilute the detection antibody or the streptavidin-PE. 3.3.3. Detection Antibody The concentration of detection antibody used can be varied to create an immunoassay with different sensitivity and dynamic range. The authors typically use detection antibody at a concentration between 0.5 μg/ml and 1.0 μg/ml. Optimization is necessary. The quantitative range of the assay can be shifted by changing the antibody concentration. The dilution of the detection antibody shifts the standard curve to the lower concentration range, whereas an increased concentration shifts the curve to the higher concentration range. 3.3.4. General Protocol for Processing Bead-Based Multiplex Assays for the Determination of Proteins in Human 1. Centrifuge the sample at 14,000×g to precipitate any particulates before diluting into appropriate diluent. The dilution factors will vary depending on sample type and concentration of analyte. 2. Resuspend the standard into appropriate diluent and prepare an eight-point standard curve using twofold serial dilutions. 3. Wet filter plate with 100 μl assay buffer. 4. Plate fitting: Add 50 μl of the standard or sample to each well. 5. Sonicate the coupled beads for 15–20 s to yield a homogeneous suspension. Thoroughly vortex the beads for at least 10 s. 6. Dilute the beads to 1500 beads per well, and add 25 μl of diluted bead suspension to each well.
Miniaturized Parallelized Sandwich Immunoassays
255
7. Incubate for 2 h in the dark at room temperature (see Note 4). 8. Washing step: Apply vacuum manifold to the bottom of filter plate to remove liquid. Wash by adding 100 μl of assay buffer. Repeat washing twice. Resuspend the beads in 75 μl of assay buffer. 9. Add 25 μl of the detection antibody solution to each well. 10. Incubate for 1.5 h in the dark at room temperature. 11. Washing step: Apply vacuum manifold to the bottom of filter plate to remove liquid. Wash by adding 100 μl of assay buffer. Repeat washing twice. Resuspend the beads in 75 μl of assay buffer. 12. Add 25 μl of Streptavidin-Phycoerythrin solution to each well. 13. Incubate for 0.5 h in the dark at room temperature. 14. Washing step: Apply vacuum manifold to the bottom of filter plate to remove liquid. Wash by adding 100 μl of assay buffer. Repeat washing twice. Resuspend the beads in 125 μl of assay buffer. 15. Incubate on a plate shaker for 1 min. 16. Read the results on Luminex 100 instrument. 17. Data evaluation: We recommend extrapolating the sample concentrations from a 4-PL or 5-PL curve.
3.3.5. Screening Protocol: 10plex Soluble Receptor Assay for Serum Samples 1. Resuspend the standard into appropriate diluent and prepare an eight-point standard curve using twofold serial dilutions. 2. Block the plate with 100 μl B/S buffer (1% BSA in PBS). 3. Beads: 1500 beads of each colored code. 4. Prepare an eight-point standard row mixture in 10% horse serum in B/S buffer by 1:2 serial dilutions. The highest concentration (ng/mL) used in the standard curves is shown in the following table: Molecule IL-2R E-Selectin Icam Fas gp130 TNFRI TNFRII RAGE VCAM MIF ng/mL
5. 6. 7. 8. 9.
2
6
5
1
2
0.8
1.5
2
5
4
Prepare the samples by 1:10 dilution in B/S buffer. Add 30 μl beads and 30 μl sample (or standard) into the wells. Incubate and shake for 1.5 h at room temperature. Wash 3×, each time with 100 μl PBS. Prepare the detection antibody mixture in B/S buffer as shown below:
Det. Ab -IL-2R -E-Selectin -Icam -Fas -gp130 -TNFRI -TNFRII -RAGE -VCAM -MIF μg/mL
0.4
1
0.4
0.4
1
1
0.6
0.8
0.8
0.8
256
Hsu et al.
10. Add 30 μl detection antibody mixture to each well, incubate, and shake for 1 h at room temperature 11. Wash 3× each time with 100 μl PBS. 12. Prepare Streptavidin-PE solution (5 μg/mL) in B/S buffer and pipette 30 μl to each well. 13. Incubate and shake for 30 min at room temperature. 14. Wash 3×, each time with 100 μl PBS. 15. Resuspend the beads in 100 μl B/S buffer. 16. Read the data in Luminex100.
3.4. Validation of Analytical Performance of Miniaturized Multiplexed Protein Assays 3.4.1. Accuracy Accuracy is expressed by the closeness of the measured value to the true value. It should be assessed using a minimum of five determinations over a minimum of three concentrations across the expected range of the assay. A deviation of 15% of the measured value to the true value is acceptable. Several methods for estimating accuracy are available. 1. by comparing the measured analyte values with those of reference data; 2. by adding known quantities of the analyte into an appropriate test matrix (e.g., serum, plasma). Then, the recovery is expressed as the measured analyte concentration relative to the added analyte concentration. The recovery (%) is calculated as follows: the background concentration of the matrix plus Recovery (%) =
Measured analyte concentration Background analyte concentration in text matrix + added analyte concentration
∗100
3.4.2. Selectivity Selectivity can be assessed by performing cross-reactivity experiments where multiplex assay is performed with each of the standards assayed separately. This will ensure that the capture antibody is selective for its respective analyte only in the assay. 3.4.3. Specificity Specificity is defined by the ability of an assay to measure unequivocally the amount of an analyte in the presence of interfering substances. Non-specificity might be derived from cross-reactivity of the antibody used in the assay with other proteins or antibodies present in the sample.
Miniaturized Parallelized Sandwich Immunoassays
257
3.4.4. Precision Precision is expressed by the closeness of agreement between a series of repeated measurements. It should be assessed using a minimum of five determinations over a minimum of three concentrations across the expected range of the assay. The mean value should be within 15% of the coefficient of variation (CV). 3.4.4.1. Repeatability
Intra-assay precision, or repeatibility, expresses the precision under constant conditions. The measurements are performed within 1 day by the same analyst using identical reagents and the same instruments. 3.4.4.2. Reproducibility
Inter-assay precision, or reproducibility, expresses the precision by changing the measurement conditions, which may involve different analysts, reagents, instruments, and laboratories. 3.4.5. Limits of Detection and Quantitation (see Note 5) 3.4.5.1. Detection Limit
The limit of detection (LOD) is the lowest amount of analyte in a sample that can be detected but not quantitated as an exact value. According to IUPAC definition (2), the limit of detection is estimated as the mean of the zero standard signal plus three times the standard deviation (SD) obtained on the zero standard signal: LOD = Meanzerostandard + 3∗ SDzerostandard
3.4.5.2. Quantitation Limit
The limit of quantitation (LOQ) is the lowest amount of analyte in a sample that can be quantitated with acceptable statistical significance. According to IUPAC definition, the limit of quantitation is estimated as the mean of the zero standard signal plus 10 times the SD obtained on the zero standard signal: LOQ = Meanzerostandard + 10∗ SDzerostandard
3.4.6. Linearity Linearity is defined as the ability of an analytical procedure to produce signals that are directly proportional to the analyte concentration of the sample.
258
Hsu et al.
3.4.7. Range The range of an analytical procedure is defined by the interval between the upper and lower amounts of analyte within which the analyte can be detected with a suitable level of accuracy, precision, and linearity. 3.4.8. Robustness Robustness expresses the extent to which the measured values remain unaffected by small variations in method parameters like temperature, reagent concentration, or instrumental parameters. It indicates the reliability of an analytical procedure during normal usage. Figure 3 indicates the standard curves of 10plex soluble receptor assay. The data have shown the feasibility and robustness of the assays. 3.5. Pattern Generation After optimization of the assays, screening jobs can be performed, and huge amounts of data will be generated. To deal with high-dimensional 10plex soluble receptors assay 10,000
MIF
1000
VCAM RAGE
MFI
TNFRII TNFRI
100
gp130 Fas ICAM IL-2R
10
E-sel
1 10
100
1000
10,000
100,000
Concentration (pg/ml)
Fig. 3. The standard curves of 10plex soluble receptors assay were plotted according to average MFI readings from several individual measurements; standard deviation bars were included. The data reflected the range of the linearity and also the robustness of the assays.
Miniaturized Parallelized Sandwich Immunoassays
259
data sets, some bioinformatic tools have been provided. For example, performing clustering analysis to distinguish different diseases or symptoms of diseases can lead to useful taxonomies, and correct diagnosis of clusters of symptoms is also extremely essential for successful therapy in the field of medicine. Table 1 summarizes the main features in CIMminer (Clustered Image Maps) (13) and MeV (MultiExperiment Viewer) (14). These are two platforms; both can be applied for the purposes mentioned above. Unsupervised hierarchical clustering analysis can be performed using the online tool CIMminer developed by the National Cancer Institute. MeV is another more integrated freeware, which was developed by TIGR (The Institute for Genomic Research). It has launched 23 modules in the analysis. Its capabilities to generate common clustering data, such as HCL (Hierarchical clustering) and ST (Support Trees), and several methods like TTEST (T-tests), SAM (Significance Analysis of Microarrays), ANOVA (Analysis of Variance), and TFA (Two-factor ANOVA) could help users discover significant parameters based on statistical analysis. Further sophisticated techniques can be applied including PCA (Principal Components Analysis), SOTA (Self Organizing Tree Algorithm), RN (Relevance Networks), KMC (K-Means/K-Medians Clustering), KMS (KMeans/K-Medians Support), CAST (Clustering Affinity Search Technique), QTC (QT CLUST), SOM (Self Organizing Maps), GSH (Gene Shaving), FOM (Figures of Merit), PTM (Template Matching), SVM (Support Vector Machines), KNNC (K-Nearest-Neighbor Classification), DAM (Discriminant Analysis Module), COA (Correspondence Analysis), TRN (Expression Terrain Maps), and EASE (Expression Analysis Systematic Explorer).
Table 1 Comparison of the Main Features in CIMminer and MeV CIMminer Contributor Analysis platform
Input file Order Algorithm Statistical analysis
NCI Web-based(http:// discover.nci.nih.gov/ cimminer/) ”.txt”, “.zip” More No
Results Reference
Color-coded Image Science 1997; 275:343–9
MeV TIGR Off-line / Free software( http:// www.tm4.org/mev.html ) ”.txt”, “.mev”, “.tav”, “.gpr” Less Yes, significant parameters could be found out Color-coded Image Biotechniques 2003; 34:374–8
260
Hsu et al.
4. Notes 1. This method can also be adapted for coupling reactions of antigens, receptors, or other proteins. 2. Minimize the exposure of EDC and Sulfo-NHS to air, and close containers tightly. Use fresh aliquots for each coupling reaction and discard after use. 3. S-NHS solution (50 mg/ml) can be prepared and stored at –20°C. 4. Incubation time can be varied. The authors typically incubate between 30 min and 2 h. The primary incubation of the bead and sample can be performed overnight at 4°C for greater low-end sensitivity. 5. The detection limit is primarily dependent on the quality of the antibodies used. Additionally, the detection limit is influenced by detection conditions (e.g., antibody concentration, incubation time), complexity of the multiplex assay, and matrix proteins.
References 1. Morgan, E., Varro, R., Sepulveda, H., Ember, J.A., Apgar, J., Wilson, J., Lowe, L., Chen, R., Shivraj, L., Agadir, A., Campos, R., Ernst, D., Gaur, A. (2004) Cytometric bead array: a multiplexed assay platform with applications in various areas of biology. Clin Immunol, 110, 252–66 2. Dasso, J., Lee, J., Bach, H., Mage, R.G. (2002) A comparison of ELISA and flow microsphere-based assays for quantification of immunoglobulins. J Immunol Methods, 263, 23–33 3. Carson, R.T., Vignali, D.A. (1999) Simultaneous quantitation of 15 cytokines using a multiplexed flow cytometric assay. J Immunol Methods, 227, 41–52 4. Dunbar, S.A., Vander Zee C.A., Oliver, K.G., Karem, K.L., Jacobson, J.W. (2003). Quantitative, multiplexed detection of bacterial pathogens: DNA and protein applications of the Luminex LabMAP system. J Microbiol Methods, 53, 245–52 5. Joos, T.O., Stoll, D., Templin, M.F. (2002) Miniaturised multiplexed immunoassays. Curr Opin Chem Biol, 6, 76–80 6. Prabhakar, U., Eirikis, E., Davis, H.M. (2002) Simultaneous quantification of proinflammatory cytokines in human plasma using the LabMAP assay. J Immunol Methods, 260, 207–18 7. Lu, J., Getz, G., Miska, E.A., Alvarez-Saavedra, E., Lamb, J., Peck, D., SweetCordero, A., Ebert, B.L., Mak, R.H., Ferrando, A.A., Downing, J.R., Jacks, T., Horvitz, H.R., Golub, T.R. (2005) MicroRNA expression profiles classify human cancers. Nature, 435, 834–8 8. de Jager, W., Prakken, B.J., Bijlsma, J.W., Kuis, W., Rijkers, G.T. (2005) Improved multiplex immunoassay performance in human plasma and synovial fluid following removal of interfering heterophilic antibodies. J Immunol Methods, 300, 124–35 9. Natelson, B.H., Weaver, S.A., Tseng, C.L., Ottenweller, J.E. (2005) Spinal fluid abnormalities in patients with chronic fatigue syndrome. Clin Diagn Lab Immunol, 12, 52–5
Miniaturized Parallelized Sandwich Immunoassays
261
10. Findlay, J.W., Smith, W.C., Lee, J.W., Nordblom, G.D., Das, I., DeSilva, B.S., Khan, M.N., Bowsher, R.R. (2000) Validation of immunoassays for bioanalysis: a pharmaceutical industry perspective. J Pharmaceutical Biomed Anal, 21, 1249–73 11. Sanchez-Carbayo, M. (2006) Antibody arrays: technical considerations and clinical applications in cancer. Clin Chem, 52, 1651–9 12. Kingsmore, S.F. (2006) Multiplexed protein measurement: technologies and applications of protein and antibody arrays. Nat Rev Drug Discov, 5, 310–20 13. Weinstein, J.N., Myers, T.G., O’Connor, P.M., et al. (1997) An informationintensive approach to the molecular pharmacology of cancer. Science, 275, 343–9 14. Saeed, A.I., Sharov, V., White, J., Li, J., Liang, W., Bhagabati, N., Braisted, J., Klapa, M., Currier, T., Thiagarajan, M., Sturn, A., Snuffin, M., Rezantsev, A., Popov, D., Ryltsov, A., Kostukovich, E., Borisovsky, I., Liu, Z., Vinsavich, A., Trush, V., Quackenbush, J. (2003). TM4: a free, open-source system for microarray data management and analysis. Biotechniques, 34(2), 374–8.
15 Dissecting Cancer Serum Protein Profiles Using Antibody Arrays Marta Sanchez-Carbayo
Summary Antibody arrays represent one of the high-throughput techniques enabling detection of multiple proteins simultaneously. One of the main advantages of the technology over other proteomic approaches resides on that the identities of the measured proteins are known at front of the experimental design or can be readily characterized, facilitating a biological interpretation of the obtained results. This chapter overviews the technical issues of the main antibody array formats as well as various applications using serum specimens in the context of neoplastic diseases. Clinical applications of antibody arrays vary from biomarker discovery for diagnosis, prognosis, and drug response to characterization of s protein pathways and modification changes associated with disease development and progression. As a high-throughput tool addressing protein levels and post-translational modifications, it improves the functional characterization of molecular bases for cancer. Furthermore, the identification and validation of protein expression patterns characteristic of cancer progression and tumor subtypes may enable tailored therapeutic intervention and improvement in the clinical management of cancer patients. Technical requirements such as lower sample volume, antibody concentration, format versatility, and high reproducibility support their increasing impact in cancer research.
Key Words: antibody arrays; protein profiling; serum; direct labeling.
1. Introduction 1.1. Antibody Arrays in the Context of Other Proteomic Strategies Two main proteomic strategies can be taken in order to investigate the cancer proteome, named untargeted and targeted. The terminology refers to From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
263
264
Sanchez-Carbayo
whether the proteins to be measured are unknown and identified along an untargeted proteomic approach, or known and considered in the experimental design for targeted strategies. Untargeted architecture platforms are best suited for first-pass comparisons of proteomes to identify relatively few, novel, or known proteins that exhibit the greatest differences in abundance. The two most commonly used technologies are two-dimensional electrophoresis (2D) and low- and high-resolution mass spectrometry (1,2,3). Targeted architecture proteomic platforms measure and quantify proteins of interest identified previously, and are suited for analyses of quantitative differences in abundance among known protein families and pathways. The versatility of targeted platforms allows controlling and estimating the reproducibility, scalability, and precise quantification, leading to high sensitivity and coverage. This approach allows experimental designs to address specific hypothesis and biological interpretation of the results obtained. However, the number of proteins amenable for these analyses depends on the availability of antibodies with high affinity and specificity to bind a target protein. The main targeted techniques used for large-scale analysis of many samples and proteins include protein microarrays, multiplexed Western blots, and tissue arrays. Protein arrays represent the most versatile among the proteomics techniques available to date, since antigens, peptides, complex protein solutions, or antibodies can be immobilized to capture and quantify the presence of specific antibodies or proteins, respectively (1,2,3,4). 1.2. Antibody Array Formats Innovation in the immobilization surfaces and detection strategies has led to an increasing number of planar antibody array technologies and bead-based versions. Planar antibody arrays represent the most common type of protein arrays, which is the major focus of the present chapter. This section describes the main formats of planar arrays covering their differences with bead-based assays (Fig. 1; for bead-based arrays, see also Chapter 14). The main planar label-based types comprise one-antibody assays (using one antibody to capture the target molecule) and sandwich assays (using two antibodies to capture the target protein) (1,2,3,4). One-antibody and sandwich assays present advantages and pitfalls over each other. In one-antibody labelbased assays, the targeted proteins are captured by an immobilized antibody and detected through labeling with a tag (Fig. 1A). In direct labeling, the proteins are labeled with a fluorophore, such as cyanines (Cy3 or Cy5). In indirect labeling, the proteins are labeled with a tag that is later detected by a labeled antibody. One-antibody label-based assays allow the incubation of two different samples, each labeled with a different tag on the arrays. Normalization is facilitated by co-incubating a reference sample with a test sample (1,2,3,4).
Dissecting Cancer Serum Protein Profiles
ANTIBODY-BASED ARRAYS
A
Competitive
265
ANTIGEN-BASED ARRAYS
C
Direct
Reverse phase
Cy5
Cy3
Indirect
TSA
Cy5
Cy3
Complex lysate Biotin
Digoxigenin RCA, RLS, ECL TSA, Bio-SA-Cy3
B
D
Tumor-associated antigen arrays Autoantibody, e.g.: antip53
Suspension: bead based Whole cell Membrane
Tumor antigen e.g.:p53
Soluble
Fig. 1. Main formats of planar and suspension protein arrays. RCA: rolling-circle amplification; RLS: resonance light scattering; ECL: enhanced chemiluminescence; TSA: tyramide signal amplification; SA: streptavidin.
Another benefit is that these assays are competitive, since the analytes in the test and reference solutions compete for binding at the antibodies (1,2,3,4). This leads to improvement in the linearity of response and dynamic range as compared to non-competitive assays (4). The main disadvantage is related to the disruption of analyte–antigen interaction by the label, which may also limit the detection as well as sensitivity and specificity. In the sandwich label-based format, antibodies capture unlabeled proteins, which are detected by another antibody using several methods to generate the signal for detection (Fig. 1B). The use of two antibodies targeting each analyte increases the specificity as compared to one-antibody label-based assays. The reduced background of these assays increases also the sensitivity. The sandwich format allows only non-competitive assays, since only one sample can be incubated on each array (1,2,3,4). This results into sigmoidal binding response, as compared to linear ones in the competitive format, and requires standard curves of known concentrations of analytes to achieve accurate calibration of concentrations (4). As compared to one-antibody label-based assays, sandwich assays are more difficult to develop in a multiplexed manner, since matched pairs of antibodies and purified antigens may not be available for each target, and the potential cross-reactivity among detection antibodies increases with additional analytes (2,4). Currently, the practical size of multiplexed sandwich
266
Sanchez-Carbayo
assays limits to 30–50 different targets (1,2,3,4). This contrasts with oneantibody assays where only the availability of antibodies and space on the substrate limits the number of targets being analyzed. In addition to the planar arrays, suspension or bead-based arrays use different fluorescent beads, each coated with a different antibody and spectrally resolvable from each other [(5,6,7,8,9) and see chapter 14]. The beads are incubated with a sample to allow protein binding to the capture antibodies, and the mixture is incubated with a cocktail of detection antibodies, each corresponding to one of the capture antibodies. The detection antibodies are tagged to allow fluorescent detection. The beads are passed through a flow cytometer system, and each bead is probed by two lasers, one to read the color or identity of the beam, and another to read the amount of detection antibody on the bead (5,6,7,8,9). Multiplexed bead-based flow-cytometry assays represent an active area of development. Differentially identifiable beads coated with either proteins, autoantigens, or antibodies can identify a variety of bound antibodies or proteins using a cytometer system (5,6,7,8,9). Advances in instrumentation and bead chemistries will probably make this approach very valuable for the detection of circulating cancer cells in clinical practice. In another version of this concept, suspensions of cells can be incubated on antibody arrays, and the amount of cells that bound each antibody can be quantified by dark field microscopy. These arrays have the potential of characterizing multiple membrane proteins in specific cell populations or changes in cell surfaces induced by drug therapies. It is important to distinguish antibody arrays from two main protein array formats that can be applied to serum samples based also on the binding of antibodies to specific antigens. The development and design of tumor-associated antigen (TAAs) arrays enhance the detection of autoantibodies against TAAs for cancer diagnosis (Fig. 1C). The rationale is related to the presence in the cancer sera of antibodies, which react with a unique group of autologous cellular antigens or TAAs (10,11). Complex protein extracts can also be spotted onto membranes and probed with antibodies targeting specific proteins on the so-called reverse-phase arrays (12,13) (Fig. 1D). 1.3. Types of Planar Antibody Arrays Based on the Labeling-Hybridization Methods The increasing detection modalities have led to several types and applications for antibody arrays (see Note 1). A number of labeling and detection methods can be employed for one-antibody and sandwich label-based planar arrays (Fig. 2). The signal can be generated by a fluorescently labeled detection antibody (Fig. 2A). This approach represents the standard sandwich arrays,
Dissecting Cancer Serum Protein Profiles A)
Antibody direct Sandwich
267
C) Biotinylated B) Species-specific antibodies with Tertiary Antibody fluorescent streptavidin conjugates
D) 2 SAPE layers
B
B B
E) Tyramide Signal Amplification
F) Alkaline phosphatase linked to a species tertiary Ab activated chemiluminescence
G) Rolling Circle Amplification
B B
H) Resonance lightscattering
B
Fig. 2. Several labeling and detection methods can be employed for antibody arrays.
requiring chemical labeling of all secondary detection antibodies, but the assay is a simple two-step procedure that does not require a separate staining step (14,15). An alternative approach employs a species-specific fluorescently labeled tertiary antibody (Fig. 2B). This option avoids the use of large chemically modified detection antibodies, but limits the species of capture antibodies. A third option is the utilization of available biotinylated detection antibodies (Fig. 2C) (15). In these assays, detection occurs after staining of the sandwich complex with Cy3-labeled streptavidin or other streptavidin variants, such as Texas Red conjugates or streptavidin-R-Phycoerythrin (SAPE) (15). The fourth possibility is based on that the fluorescent signal can be further amplified using a second layer of SAPE coupled to the first layer via an anti-SAPE antibody (Fig. 2D). Alternatively, in the fifth option, the number of biotin labels can be increased via thyramide signal amplification (Fig. 2E) (2). An antibiotin horseradish peroxidase (HRP) will generate a thyramide radical that cross-links a biotin or a fluorophore to all exposed tyrosine residues of any protein near the recognition event (2). Chemiluminesce can also be implemented to multiplexed sandwich assays as a sixth possibility (Fig. 2F), using a streptavidin-HRP or a species-specific antibody conjugated with HRP or alkaline phosphatase and chemiluminescence substrates. Chemiluminescence is typically more sensitive than standard fluorescence applications. A polymer decorated with streptavidin and europium chelates is utilized not only for
268
Sanchez-Carbayo
microplate but also for microarray measurements. Evanescence waveguide is employed as an alternative for ultrasensitive fluorescence (16). Rolling-circle amplification can be applied as a seventh option for signal generation (Fig. 2G). The 5 end of an oligonucleotide primer is attached to an antibiotin antibody (17). After binding of the antibiotin antibody to the biotinylated detection antibody of the sandwich, the oligonucleotide is enzymatically extended using a circular DNA sequence as template. Fluorescently labeled short oligos are then hybridized to the extent DNA decorating each bound antibody with thousands of fluorophores (15). An alternative eighth staining method yielding sensitivity similar to evanescence wave technology and rolling-circle amplification involves the use of colloidal gold particles coated with an antibiotin antibody (18). Because of resonance light scattering (RLS), these particles scatter white light very intensely, and quantitative readouts of miniaturized sandwich assay can be obtained with a simple charge-couple device (CCD) camera-based imaging system (18) (Fig. 2H). RLS particles do not show any photobleaching as compared to fluorescence or chemiluminescence (14,15,16,17,18,19,20). Due to the high versatility of labeling-hybridization methods available to date, the present chapter will describe the detailed reagents and protocol of direct labeling on serum specimens, as summarized in Figure 3. 1.4. Applications in Cancer Research Using Serum Specimens Direct labeling methods have been applied for cancer diagnostics to the detection of proteins in the serum of patients with prostate cancer (21). The use of a two-color rolling-circle amplification method improves the detection of low abundant proteins. This method has also been shown to provide adequate reproducibility and accuracy for protein profiling on serum specimens and clinical applications (17,22,23,24). Sandwich assays can also measure protein abundances in body fluids using detection methods such as RLS (25), enhanced chemiluminescence (26), tyramide signal amplification (27), and fluorescence (28). Reverse protein arrays have also been optimized to spot serum specimens and obtain high-throughput measurement of IgA in thousands of sera using a single experiment (29). For example, a recent report designed antibody arrays for bladder cancer by selecting antibodies against targets differentially expressed in bladder tumors identified by gene profiling (24). Serum protein profiles obtained by two independent antibody arrays represent comprehensive means for bladder cancer diagnosis and clinical outcome stratification (24). Validation analyses with ELISA and immunohistochemistry on tissue microarrays represent alternative approaches to confirm the relevance of identified proteins for tumor progression. Such strategy provides experimental
Dissecting Cancer Serum Protein Profiles
269
evidence for the use of several integrated technologies and strengthens the process of biomarker discovery. Serum specimens can be utilized to profile the humoral immune signature of cancer patients to detect both autoantibodies against tumor antigens and secreted cytokines. The combined detection of antibodies against a group of TAAs has provided high sensitivity for diagnosis of prostate cancer (10). The use of phage display arrays can enhance tumor subtype specificity of such measurements (10,11). Cytokine profiling on serum and plasma specimens can differentiate cancer patients from control subjects, and also stratifies patients with leukemia based on clinical outcome. Several reports have also compared the reproducibility and differences among several technologies available for multiplexing cytokine measurements, including planar and bead-based antibody arrays (5,6,7). In summary, antibody arrays can be utilized for the following applications: (1) the discovery of candidate disease biomarkers (21,24); (2) characterizing signaling pathways (28), disease progression, clinical subtypes, and outcomes (21,24); (3) measurement of changes in post-translational modifications or expression levels of disease-related proteins (28); (4) identifying binding partners to proteins; this is very important especially when conducting functional studies for drug discovery; (5) epitope mapping for determining regions of proteins than bind specific antibodies. 2. Materials 2.1. Printing of Antibody Arrays 1. Antibodies. A critical step is the selection of the antibodies to be printed onto the antibody arrays. The antibodies printed on the arrays will be selected based on their known affinity characterization and experimental design (see Note 2). 2. Antibody purification with Affi-gel Protein A MASP II kit (Bio-Rad, Hercules, CA). 3. Protein concentration measurements with BCA Protein Assay (Pierce, Rockford, IL). 4. Fast Slides (Schleicher and Schuell Biosciences, Keene, NH) or HydroGel coated glass microscope slides (Perkin Elmer Life Sciences, Waltham, MA). 5. Polypropylene 384-well microtiter plates (Genetix, New Milton, Hampshire, UK or MJ Research, Waltham, MA). 6. Seal aluminum scotch brand foil tape (R.S. Hugues Sunnyvale, CA). 7. Printer.
2.2. Labeling and Hybridization of Serum Samples 1. NHS-linked Cy3 and Cy5 protein labeling agents (Amersham, GE Healthcare, Piscataway, NJ).
270
Sanchez-Carbayo
2. Microscopic slide staining chamber with slide racks (Shandon Lipshaw, Pittsburgh, PA). 3. Diamond scribe (VWR, West Chester, PA). 4. Hydrophobic marker (PAP pen, Immunotech, Marseille). 5. Coverslips (Lifterslip, Erie Scientific, Portsmouth, NJ). 6. Wafer handling tweezers (Technitool, West Berlin, NJ). 7. Clinical centrifuge with flat swinging buckets for holding slide racks. 8. Spin columns for protein cleanup (Bio-Rad Micro Bio-Spin P-6). 9. Microcon YM-50 (Millipore, Bedford, MA). 10. Complete protease inhibitors (Roche, Indianapolis, IN). 11. Buffers: phosphate buffered saline (PBS), pH 7.4 (137 mM NaCl, 4.3 mM Na2 HPO4 , 1.4 mM KH2 PO4 ); carbonate buffer, pH 8.5 (50 mM NaHCO3 ); PBST, PBS containing 0.5% (v/v) Tween-20; 0.1 M PBS, pH 7.2 (68.4 ml 1 M Na2 HPO4 , 31.6 ml 1 M NaH2 PO4 , 900 ml dH2 O); NP40 lysis buffer: 50 mM Hepes-OH, EDTA, 50 mM NaCl, 10 mM NaPPi (Tetrasodium Diphosphate Decahydrate), 50 mM NaF, 1% (v/v) NP40, 10 mm Sodium- Vanadate, pH 7.5–8.0; saturated NaCl (Sigma); blocking buffer: 1% (w/v) bovine serum albumin (BSA) in PBST; 7–10 mM dye stock in DMSO: Dissolve one tube of Cy3 or Cy5 dyes in 30 μl of DMSO. Aliquot and freeze at –80°C.
2.3. Detection 1. ScanArray microarray scanner at 543 nm and 633 nm wavelengths (Packard Bioscience, Research Parkway Meriden, CT). 2. GenePix Pro 3.0 (Axon Instruments, Union City, CA) software program employed to quantify the image data.
3. Methods Three main steps can be considered along the overall process of setting up custom-made antibody arrays: antibody array construction, sample labeling and hybridization onto the antibody array, and scanning and data analysis. The success of the whole process is greatly dependent on the availability of highquality antibodies for capturing the target proteins as well as serum samples well handled, preserved, and characterized. 3.1. Antibody Array Construction 1. 2. 3. 4.
Select the antibodies (see Note 2). Purify the antibodies (see Note 3). Keep stable and quantify the antibodies (see Notes 4–7). Prepare the printing plate with antibodies. Put 5– 7 μl antibody solution on each well of a 384-well plate (see Note 8). 5. Prepare slides for printing (see Note 9).
Dissecting Cancer Serum Protein Profiles
271
For nitrocellulose slides, no preparation is needed (see Note 9). For hydrogel slides: The hydrogel slides should be prepared just before use (i.e., only when you are ready to print the arrays). Load the hydrogels into a slide rack, briefly rinse (1 s) in purified water, and wash three times at room temperature with gentle rocking for 10 min each time in purified water. A microscope slide staining chamber is useful for the washing steps. The staining chambers come with slide racks that hold 10–30 slides. The racks can be transferred between staining chambers containing different washing buffers as well as a clinical centrifuge for drying the slides. 6. Centrifuge slides to dry at no more than 350 g for 3 min. A clinical centrifuge with flat swinging bucket holders works well for this task. Place a paper towel layer on the bottom of the swinging bucket to absorb water removed from the slides. Place the slide rack on the paper towel and centrifuge at no more than 350 g for about 3 min. 7. Place the hydrogel slides in a 40°C water bath for 20 min using the staining chamber allocating paper towel in the bottom. 8. Remove the slides from the incubator and allow slides to cool at room temperature for 5 min. The slides are now ready for printing. 9. Print the antibodies on the slides (see Note 10). 10. Start the post-print processing of microarrays.
For hydrogels: • Prepare staining chambers with a wet paper towel soaked in saturated NaCl at the bottom. • After printing, the slides are incubated in a humidified staining chamber overnight at room temperature to allow adsorption of the antibodies to the matrix. • The next day, circumscribe the array boundaries on each slide with a marker (e.g., PAPpen). Leave at least 3–4 mm between the array and the marker line. Allow the hydrophobic marker lines to fully dry.
For nitrocellulose (FAST, Schleier, and Schuell) slides: • Allow the slides to dry for at least 1 h (let the slides dry on a slide-staining chamber). • Store in a refrigerator on a slide rack in a humidified staining chamber. • The next day, circumscribe the array boundaries on each slide with a marker (e.g., PAPpen). Leave at least 3–4 mm between the array and the marker line. Allow the hydrophobic marker lines to fully dry. 11. Rinse the slides as follows: a. Rinse briefly (for 30 s) in PBST. b. Wash in PBST for 3 min with gentle rocking. c. Wash in PBST for 30 min with gentle rocking.
272
Sanchez-Carbayo Cy5
Ligand
+
Cy3
Test proteins
React
Reference proteins
+
React Mix Place on array
Separate free dye
Free dye
Ligand
Coated slide
Separate free dye
Antibodies
Free dye
Scan
Fig. 3. Scheme of the whole process when working with custom-made antibody arrays. Once antibodies are selected and printed on the arrays, serum samples are labeled and hybridized onto the antibody arrays. Scanning and data analyses of fluorescence will provide quantitative measurement of multiple proteins simultaneously. 12. Block the slides. Once the antibodies are immobilized, it is necessary to block non-specific protein-binding sites on the printed microarrays. Typical blocking solutions include diluted BSA or casein solutions (1,2,9,12,19). If the arrays are not to be used for a day or more, leave them in the BSA-blocking solution in the refrigerator. Prepare the blocking buffer right before use. Add sodium azide to the blocking buffer if you intend to store for more than one day and then begin with step b shown below: a. Block in the blocking buffer for 1 h at room temperature with constant shaking. b. Briefly rinse with PBST twice or alternatively rinse the second time with 0.1 M PBS, pH 7.2, for 20 min. c. Dry the slides by centrifugation immediately prior to incubating with the labeled samples using a clinical centrifuge with flat swinging bucket holders.
3.2. Labeling of Samples and Hybridization A protocol for direct labeling is provided, summerized in Figure 3. 1. Select the serum samples for labeling (see Note 11). 2. Determine the volume of each serum sample to label in both Cy3 and Cy5. It is important to note that Cy3 is more consistent and bright when deciding whether to label samples or references with either Cy3 or Cy5. For the samples, divide the volume to be placed on the array by the desired final dilution of the sample (varying from 1/30 to 1/50). For a 20 μl volume (the volume used for a 12 × 12-mm standard hydrogel) and a 1/50 final dilution, use 0.4 μl of serum sample (20/50) per array.
Dissecting Cancer Serum Protein Profiles
3.
4.
5.
6.
7.
8. 9. 10.
11. 12. 13.
273
If a pooled reference is to be used, each component of the reference is first labeled and then pooled (as opposed to pooling and then labeling). The amount to be labeled of each component of the reference is (Va × A)/Nr, where Va is the volume per array (0.4 μl in the above case), A is the number of arrays the reference will be used in, and Nr is the number of samples pooled in the reference. For example, if a pool of 10 samples will be used as the reference for 20 arrays, the volume of each sample to be used in the Cy5 labeling mix will be (0.4 × 20)/10 = 0.8 μl. Dilute the serum sample approximately 15× with carbonate buffer or phosphate buffer at pH 7.5 spiked with 0.5 μg/ml dinitrophenol (DNP) flag (if the flag is to be used for normalization). Do not use buffers with an amine group such as Tris-base. Add a 20th volume of dye stock to each sample. The final concentration of the NH-ester activated Cy-dyes within the serum protein solution should be between 100–300 μM (each vial of dye contains 200 nmol). Mix each dye and serum protein solutions and let the reaction proceed on ice in the dark for 2 h. Normally, mix the reference protein solution with the Cy3 dye solution, and the test protein solution with the Cy5 dye solution. Add a 20th volume 1 M Tris-HCl pH 7.5–8.0 (or glycine) to each of the reactions to quench (stop the labeling), so that at least a 200-fold excess of quencher:dye concentration is achieved. Load the samples onto a microconcentrator having the appropriate molecular cutoff, such as the Bio-Rad Bio-spin 6 microcolumn, and spin at 1000×g for 2 min. A 3000-D cutoff captures most proteins while still removing the dye. If smaller proteins are not important, the 10,000-D cutoff is faster. Centrifuge according to the microconcentrator instructions. The 10,000-D microcon typically requires 20 min, and the 3000-D microcon requires 80 min of centrifugation at 10,000×g at room temperature. Make 10× blocking solution: 30% (w/v) non-fat milk in PBS and 1% (v/v) Tween-20 (e.g., 3 ml milk in 10 ml buffer). Spin the milk solution at 10,000×g for 10 min. The milk blocker solution needs to be centrifuged to remove particulate matter (e.g., 10 min at 10,000×g). After centrifuging with the microconcentrator column to the flow-through (collection tube) of the column, add 1 μl of the supernatant of the blocking mix per array and 1 μl of 10× protease inhibitor per array. Pool the reference samples and divide among the test samples according to the experimental plan. Add 1× PBS to bring to 20–25 μ per array, if necessary. The labeled samples may be stored overnight at 4 C. Start hybridization of the labeled serum samples on the printed antibody arrays. Distribute the Cy3-labeled reference protein solution to the appropriate Cy5-labeled test protein solutions. Add PBS to each mix to achieve a volume of 20–25 μL per array. It is recommended to remove any particulate matter or
274
14.
15. 16. 17. 18. 19.
Sanchez-Carbayo precipitate by (1) filtering with a 0.45-μm spin filter, or (2) centrifuging for 10 min at 14,000×g and pipetting out the supernatant. Load appropriate amount of labeled samples on the slides within the marked boundaries, and cover with Lifterslip. Use 20 μl for the 12 × 12 -mm hydrogels. The cover slip should be at least 1/4 inch longer than the dimensions of the array. (The background is often higher at the edges of the cover slip.) Incubate for 2 h at room temperature with constant shaking. Rinse briefly in PBST to remove the Lifterslip. Wash three more times for 10 min in fresh changes of PBST. (All washes are performed in racks at room temperature.) Rinse for 20 s in PBS. Alternatively, final washes with H2 O can be performed for 5 min each of gentle agitation. Dry the slides by centrifugation prior to scanning.
3.3. Scanning and Data Analysis 1. Scan the slides at 552 nm and 635 nm using a microarray fluorescence scanner (see Note 12). 2. Process the data: grid the arrays and reject unsatisfactory data points (see Note 13). 3. Normalize the data (see Note 14). 4. Analyze the data (see Note 15). 5. Interpret the data (see Note 16).
4. Notes 1. Radioactivity, fluorescence, or chemiluminescence detection methods have been used with antibody arrays. Radioactivity is not frequently used due to its safety concerns and its longer exposure times (up to 10 h). Fluorescence is one of the most frequently utilized detection methods. Fluorophores, like chromogens, exist in many formulations and have defined emission spectra. Fluorescein, rhodamine (Texas Red), phycobiliproteins, nitrobenzoxadiazole (NBD), acridines, Cy3, Cy5, and bodipy compounds are commonly used for protein labeling (13,14,15,16,17). The selection of fluorophores for use with microarrays depends on sample type, substratum, emission characteristics, and even the number of analytes to be assayed. Not all substrates are compatible with fluorescent detection strategies due to inherent autofluorescence of the material (14,15,16,17), which significantly reduces the signal-to-noise ratios. Nitrocellulose-coated slides cause light scatter and higher background as compared to aldehyde-treated slides with laser scanner detection methods, limiting the use of nitrocellulose substrata for fluorescent detection methods (13,14,15,16,17). The sample may also have components that interfere with a selected fluorophore. Flavoproteins autofluoresce and emit light in the same region as fluorescein, limiting the use of this fluorophore in samples rich in flavoproteins, e.g., liver and kidney tissues. Photobleaching and quenching of
Dissecting Cancer Serum Protein Profiles
275
fluorophores can decrease the total signal observed on an array. The Cy3 and Cy5 dyes are commonly used for fluorescent detection because they overcome these effects. They are well suited for fluorescence detection strategies due to their decreased dye interactions, increased brightness, and the ability to add charged groups to the molecules (13,14,15,16,17). Fluorescent-tagged proteins including antibodies can be used for detection of immobilized molecules on a microarray using both indirect or sandwich strategies. Streptavidin-biotin or RCA amplification chemistries can also be applied to fluorescence detection strategies (22,23,24), providing sufficient sensitivity for most applications. Chemiluminescent detection methods are based on Western blotting protocols for detection of antigen-bound antibodies with secondary antibodies conjugated to alkaline phosphatase or HRP (13,14,15,16,17,18). Chemiluminescent detection methods can be applied to any of the label detection methods. Chemiluminesce is highly sensitive but may pose limitations due to its dynamic range and compatibility with multiplexing. Amplification strategies such as biotinyltyramide can be applied to chemiluminesce. A useful application consists of total protein determination made directly on arrays using a ruthenium organic complex, which interacts non-covalently with proteins immobilized on nitrocellulose (13,14,15,16,17,18). The dye is applicable to arrays printed on nitrocellulose membranes. This type of total protein analysis is useful for minute sample volumes in which a standard protein spectrophotometric analysis would not be feasible. 2. Antibody selection. The first critical step is the selection of protein targets to be measured with the antibody arrays, which depends on the experimental design and objectives of the analyses undertaken. It is advisable to have biological or experimental criteria supporting the search for specific proteins in the serum. An approach rendering high efficacy suggests analyses of high-throughput profiling at the DNA or RNA level previous to protein profiling to enrich the probability to find a target protein in the serum. Not all proteins are suitable for measurement with this assay, since their size and the likely abundances of the proteins in the samples are limiting factors. If a protein is very small (or is a polypeptide), it may not be compatible with direct labeling detection methods, which use sizebased separation of labeled product from the label. If a protein is in very low abundance, it may fall out of the detection limit of the assay. Detection limits for the assay depend on the antibody used, the protein background in the sample, and the detection conditions. In general, the direct labeling method described here can give detection limits in the low ng/ml range for targets present in the serum background. Once the target protein is assembled, the search of antibodies begins. The main bottleneck to the development of highly multiplexed planar antibody arrays is the requirement for specific affinity ligands for each analyte. Commercially available antibodies against novel or rare proteins may not exist, which leaves the option of having the antibody custom-produced. Custom antibody generation is lengthy, expensive, and probably not a viable choice for more than a few antibodies. If a protein target is more common and a choice of
276
Sanchez-Carbayo
antibody exists, it is advisable to search for antibodies that work efficiently for enzyme-immunoassays, since these assays are quite similar to antibody arrays. Monoclonal antibodies seem to have a higher success rate, but polyclonals may also work well, although they may lead to high background and reduced specificity and sensitivity as compared to monoclonal antibodies. In vitro selection of antibodies using phage-ribosome or mRNA display technologies, and the use of engineered binding molecules is having increasingly important role in generating specific affinity ligands for analytes for which antibodies are unavailable (14). An alternative strategy to produce specific antibodies has been validated optimizing the design of protein sub-fragments of a selected size with minimal sequence similarity to other proteins. The fragments are selected using an alignment scanning procedure based on the principle of lowest sequence similarity to other human proteins, optimally to generate antibodies with high selectivity (20). If direct labeling method is to be used, only one antibody for target is needed. If using a sandwich assay, a matched pair of antibodies is needed. The direct labeling method works well for mid- to high-abundance proteins, while sandwich assays or amplification protocols are recommended for low-abundance proteins. Since antibodies cannot be manufactured with known affinity and specificity, it is advisable to validate the specificity and sensitivity of each antibody prior to use as a probe for protein arrays. The identification of a single band at the specified molecular weight on Western blotting represents a standard validation strategy for the specificity and sensitivity of the proposed antibody, as well as immunoprecipitation followed by mass spectrometry (1,6). The antigenantibody properties of the antibodies printed on the arrays can be evaluated by the estimation of random and systematic errors. Western blotting analyses can serve to evaluate the specificity of the antibodies. Commercial or custommade enzyme-immunoassays can be utilized to validate the ability of antibodies identified by antibody arrays by an independent method on the same serum specimens profiled using antibody arrays. Recombinant antigens can be utilized as positive and negative controls for the process of printing (depositing the antibodies onto the slides), calibration, and detection methods (1,2,9). The linearity range of the assay depends on the antibody-antigen affinity. Linearity can only be achieved when the concentration of the analyte and antibody are matched to the affinity constant. It is advisable that dilution and recovery experiments evaluating the specificity and affinity of the antibodies for their ligands are included when utilizing antibody arrays. (2,9). 3. Purity of antibodies. Antibodies work best in the arrays when they are highly purified. The use of antibodies in a high background of other proteins often results in a weakened or non-specific signal, since the background proteins occupy many binding sites on the microarray. Some purified antibodies come in a BSA or gelatin stabilizer. It may be desirable to remove gelatin, since it can bind some biological molecules. BSA rarely has the problem of non-specific binding, but if it is at a much higher concentration than the antibody, it could significantly
Dissecting Cancer Serum Protein Profiles
277
reduce the signal from the antibody, which would warrant further purification of the antibody. Some antibodies come in a high concentration (8–50%) of glycerol to improve stability. While glycerol will not interfere with the assay, the added viscosity may negatively affect the printing process. Glycerol concentrations above 20% should be avoided. To change the buffer of an antibody, it is advisable to use the Bio-Rad Micro Bio-Spin P30 column. These columns come with two types of buffers: sodium saline citrate (SCC) and Tris buffer. The filtrate will come through in the packing buffer. This packing buffer can be changed by running a different buffer through the column three times. The P30 column removes solution components smaller than 30 kD, and the P6 column removes components smaller than 6 kD. Thus, the P30 column is better for purification of antibodies, and the P6 column is better for purification of complex mixtures in which low-molecular-weight species should be preserved. Thus, if the antibody is to be subsequently labeled, it is recommended not to put the antibody in a Tris or amine-containing buffer. Polyclonal antibodies come either as unpurified antisera, the IgG fraction of antisera, or the affinity purified (purified using the antigen) fraction of antisera. Affinity purified is best, since it yields the highest purity of specific antibody. IgG-purified fractions of antisera usually work well. Antibodies that arrive in pure ascites fluid may also need to be purified. If a monoclonal antibody is good, it will work well without further purification, and so they should be tested first. A protein purification method of IgG antibodies is recommended using the Affigel Protein A MAPS II kit (Bio-Rad). In general, the following antibody buffer requirements should be considered: (1) all antibodies that arrive as antisera need to be IgG purified; (2) antibodies in ascites fluid may also need to be purified, although they can first be tested without purification. 4. Stability and concentration. Antibodies are stable when refrigerated in a standard buffer such as PBS. The concentration of an antibody can be measured using a protein concentration kit such as the BCA 200 Protein Assay Kit (Pierce Biotechnology). The optimal spotting concentration range is 100–200 μg/mL. Higher concentrations could yield better signal strengths and lower detection limits, and may be desirable if the consumption of antibody is not a concern. Each antibody’s concentration should be constant at different printing sets, since concentration variations in an antibody can affect data. Simply stated, if a set of data is produced using a particular antibody at 300 μg/mL, subsequent experiments should use that antibody at 300 μg/ml for better comparison of the results. 5. Antibody storage. Most antibodies can be stored or refrigerated for up to a year. New antibodies should be divided into aliquots that will last approximately a year each. One aliquot should be kept in the refrigerator as a working stock, and the others frozen at –70°C. Aliquoting the antibody stocks helps to avoid repeated freeze/thawing that can damage the proteins. Protein stocks should not be frozen in PBS; it is better undiluted. When retrieving antibodies/proteins from
278
6.
7.
8.
9.
Sanchez-Carbayo a freezer stock, thawing should be done slowly on ice to reduce damage to the antibody from the thawing process. Tracking antibodies. It is helpful to keep information about the antibodies in a database. It is advisable to provide a number code for each antibody, and if changes are made to an antibody’s buffer composition, a new code should be assigned to the new preparation. Relevant information to track include clonality, manufacturer, animal of origin, concentration, and aliquot age. It is important to track the maximum information provided in the antibody datasheet, and label aliquots accordingly. Maintaining antibody stocks. A refrigerator stock of ready-to-use antibodies (kept at working solution) should be maintained. Except for the antibodies that should not be frozen, only one tube of each antibody should be stored in the refrigerator at a time. The amount of each antibody in the refrigerator stock should be sufficient to last for six months or up to a year (normally around 100 μL). The rest of the antibody stock should be aliquoted into similar volumes and frozen at –80°C. If the antibody in the refrigerator stock needs to be diluted in order to reach the working stock concentration, dilute only sufficient stock for the working solution. When retrieving antibodies/proteins from a freezer stock, they should be thawn slowly on ice in order to reduce damage from the thawing process. The protein stock master list will need to be adjusted to indicate when the antibodies are thawn and frozen. Print plate preparation. After the antibodies have been acquired and prepared at proper purity and concentration, they are assembled into a “print plate,” which is a microtiter plate used in the robotic printing of microarrays. Polypropylene microtiter plates are preferable to polystyrene because of lower protein adsorption. The plate should be rigid and precisely machined for optimal functioning with printing robots. The 384-well plates are generally more compatible with printing robots than 96-well plates and require less volume per well than 96-well plates. Load about 6–10 μl of each antibody into each well of the 384-well print plate. The volume may depend on the shape of the well and how far the print tips descend into the well. Too much volume may lead to droplets of antibody solution sticking to the outside of the print tip. The volume may also need to be optimized for particular applications, such as multiple draws from each well, which would require a greater volume. If printing is sometimes inconsistent or variable between printing tips, it is desirable to fill multiple wells with the same antibody solution so that different print pins spot the same antibody. Store the 384-well print plates sealed in the refrigerator until ready to use. Aluminum foil tape provides a good seal. Enclosing the covered plate in a sealed plastic bag ensures long-term, evaporation-free storage. It is very important to prepare a spreadsheet containing the well identities for use in downstream data processing applications. Selection of slides. The various immobilization and detection strategies are devised depending on which target molecules are going to be measured and which ones are used to capture them. The attributes of an ideal sub-stratum
Dissecting Cancer Serum Protein Profiles
279
for antibody arrays include limited non-specific binding, high surface area-tovolume ratio, inert biological molecules, minimal autofluorescence, and compatibility with available detection methods. A variety of surfaces and immobilization chemistries have been described for antibody arrays. Derivatized supports where capture antibodies are immobilized include surfaces such as polyvinylidene difluoride, nitrocellulose, agarose, polyacrylamide, or hydrogels. Glass slides are frequently coated with one-, two-, or three-dimensionally structured surface modifications, being activated with aldehyde, polylysine, or a homo-functional cross-linker as part of the initial optimization experiments (2,9,14). The advantages of the use of distinct coating or surfaces under different blocking, pH buffering, or UV cross-linking conditions for specific applications have been described (14). Silane-coated glass slides or acrylamide hydrogel can provide good reproducibility from day to day, efficient immobilization of antibodies, and low background when used in conjunction with fluorescence detection. Various substrates for antibody arrays have been reported, such as poly-lysine coated glass (1), aldehyde-coated glass (30), nitrocellulose (31), and a poly-acrylamide based hydrogel (32). Hydrogels and nitrocellulose give good results for the direct labeling method described here. Nitrocellulose slides do not require any preparation before printing, and give clean and low background results. Hydrogel coating on glass slides (such as those supported by PerkinElmer Life Sciences) can support multiple layers of protein, thus increasing the binding capacity and signal strengths, and it should be noted that the hydrophilic matrix of the hydrogel may better retain native protein structure. Hydrogels should be stored dry at room temperature. They must be used within 2 days after preparation. 10. Printing of antibody arrays. The details of printing will depend on the printing robot used. It is necessary to immobilize antibodies in a way that the functional component will be efficiently deposited without interfering subsequent binding. Conditions such as humidity, temperature, dust levels, and pin washing should also be stringently controlled during the printing step. It is important to minimize the time taken to unseal the print plates and their exposure in order to keep the evaporation of antibody solutions low. Maintaining a moderately high humidity in the printing environment (around 45%) will minimize evaporation and maintain spot quality. Excessive humidity can lead to overly large spots. The proper printing of the robot should be confirmed with test prints on dummy slides before starting the microarray production. It is advisable to use 500 μg/mL BSA in 1× PBS for the test prints. If the tips are washed in a wash bath, make sure the water is changed regularly every 6–12 loads to prevent contamination of the tips. It is also desirable to confirm sufficient washing of the pins and lack of carry-over from load to load. This test can be done by loading labeled protein into one of the print plate wells in a dummy print, followed by scanning of the unwashed slide. If fluorescence is seen in spots after the fluorescently labeled material, the pins need to be washed more stringently. Most microarrayers will allow the printing of replicate spots on each array from the same well of the print plate. Replicate spots are useful to obtain more precise data through averaging
280
Sanchez-Carbayo
and ensure the acquisition of data if a portion of the array is somehow unusable. Six to ten spots per array per antibody are recommended. 11. Serum sample handling and storage. Sera should be collected in red gel tubes, allowing the coagule to retrieve and centrifuged at 3000 g/10 min, aliquoted and stored at –80 C. All samples should be consecutively numbered to avoid any record compromising the identity of these patients or controls under study. Serum samples should be handled as biohazards. Tips and tubes that contact serum samples should be disposed in a biohazard bag. Upon the first thaw, the samples need to be aliquoted. Samples should be aliquoted so that no more than four thaws are necessary for every experiment. Low volume aliquots (approximately 10–15 μl) of each specimen are recommended. For greater than approximately 50 samples, it is convenient to use a microtiter plate for aliquoting. In this case, approximately 50 μl from each sample is placed into each well of a 96-well microtiter plate. Either a robot or a matrix multichannel pipettor is used to aliquot small volumes into replicate 96-well plates. 12. Scanning. The fluorescence signal from the microarrays is detected using a microarray scanner. GenePix Pro 3.0 (Axon Instruments) software program quantifies the image data. The local background in each color channel is subtracted from the signal at each antibody spot, and spots having obvious defects, no detectable signal by GenePix, or a low net fluorescence in either color channel are removed from analysis. The ratio of net signal from the samplespecific channel to the net signal from the reference-specific channel is calculated for each antibody spot, and ratios from replicate antibody measurements in each array are averaged. An intensity-dependent normalization algorithm for antibody arrays is recommended. Some of the particulars of the scanning method will depend on the instrument, but some general principles may be followed. Scanning of an experiment set should be performed immediately after incubation of the microarrays and all on the same day, if possible, to minimize noise introduced by variable breakdown of dye on the array (particularly Cy5). The microarrays should be kept in the dark to minimize bleaching of fluorescent dyes. Scanners typically have adjustments for laser power, detector gain, and scan rate. Set both lasers to about 95% and adjust the scanner to achieve the desired signal intensities. Adjust the laser power so that at least 50% of the pixels of each spot are saturated. The laser power should almost always be set very close to the maximum since the maximum powers of the small commercial scanners are still less than optimal. Lower scan rates will generally produce higher signal-to-noise ratios. Scanning is performed at either 50 or 25% speed, depending on practical time limitations. The scan rate usually has a practical time limit to scan large sets of arrays. In order to find the optimal scanner settings, it is advisable to set the laser power close to maximum, set the scan rate to the lowest acceptable value, and then adjust the detector gain as high as possible without showing signal saturation in the data. When scanning a large set of arrays as part of a single experiment set, it is desirable to use similar settings for all the arrays to minimize the differences
Dissecting Cancer Serum Protein Profiles
281
in conditions between the arrays. It may not always be possible to use the same settings for every slide due to great variations in signal and background strengths, but subsequent normalization should readjust the data accordingly. Scanned images are typically stored as tiff files to be analyzed by microarray analysis programs. It is advisable to save the scanned images by their slide number followed by either Cy3 or Cy5 and the date of scanning. 13. Gridding and rejection of data points. The analysis of scanned microarray data depends somewhat on whether the experiment is one color or a twocolor direct-labeling experiment. In all experiment types, the image data first need to be converted into numbers. Various software programs that come with current scanners, such as GenePix with Axon scanners and ArrayQuant with PerkinElmer scanners, accomplish this. The details for using such programs are not discussed here, but the principles that these programs use are mentioned. The quantification of microarray data begins with loading the scanned images (usually in tiff format) into an analysis program and overlaying a grid that defines the locations of the antibody spots. After aligning the grid to the image data, the program calculates the intensities and various statistics for image areas both within and without the spots. The user can “flag” or reject spots if obvious gross defects are present. Spots with very low intensity in one or both of the color channels yield unreliable data and should be rejected. It is especially important to reject low-intensity spots in two-color ratio since the noisy low intensity data can greatly affect the ratio. It is desirable to define statistical criteria for rejecting low-intensity spots rather than relying on user judgments. A threshold based on the overall variation in background on the arrays can be defined. The median signal intensity at each spot should be three standard deviations (of the background areas) above the local background median intensity. This objective criterion provides uniform, statistically based standard for all data. 14. Normalization of data. The signals obtained from each array need to be corrected or normalized for possible changes in the overall signal intensity due to factors such as scanner settings and dye labeling efficiency. This process uses signals from antibodies targeting an internal standard of known concentration. Antibodies against proteins commonly expressed in serum, such as immunoglobulin isotypes, albumin, or C-reactive protein, can be utilized as internal controls. A normalization factor is calculated for each array that sets the data from normalization antibodies to the expected or known values. A highly specific and quantitatively accurate antibody is required for measurement of the normalization protein. The protein standards can either be present naturally in the sample or can be spiked in. Naturally occurring proteins that work well is flag-labeled BSA. It is a widely used peptide tag for which commercial labeling kits are available. Other tags such as DNP can work well too. Normalization is recommended to be based on an intensity-dependent algorithm as follows (24). In this case, the local background in each color channel is subtracted from the signal at each antibody spot, and spots having obvious defects, no detectable signal by GenePix, or a low net fluorescence
282
Sanchez-Carbayo
in either color channel are removed from analysis. The ratio of net signal from the sample-specific channel to the net signal from the reference-specific channel is calculated for each antibody spot, and ratios from replicated antibody measurements in the same array are averaged. It is common to plot a red (Cy5) versus green (Cy3) channel scatter plot to examine the distribution of intensities; however, transforming to fold change versus average intensity displays the data in a more easily readable form. If Ired is the background subtracted red channel intensity, and Igreen is the background subtracted green intensity, then the √ following variables are created: R = Ired /Igreen and A = (Ired ×Igreen ), where R is simply the fold change ratio and A is the average intensity (the geometric mean that is equivalent to averaging the log intensity). The curvature in the scatter plot indicated a dependence of the ratio R on the overall intensity. This curve is then used to normalize the data: log Ired /Igreen →log (Ired /Igreen −c A, where c(A) is the fit. This is equivalent to multiplying the green channel intensity (or dividing the red) by an intensity dependent normalization constant k(A) where log [(k(A)] = c(A). The optimal normalized data should be horizontal and centered (24). 15. Data analysis. A critical step using quantitative data obtained through antibody arrays is the establishment of a filtering process to assess the quality of the data. The conceptual similarity of label-based antibody arrays with two-color competitive detection genomic arrays has allowed the application of normalization and data analysis tools classically utilized for cDNA arrays to protein profiling using antibody arrays (24). In order to obtain efficient measurement of multiple proteins simultaneously with high sensitivity, specificity, and quantitative accuracy over large concentration ranges and reproducibility, it is necessary to consider quality control issues in the design of the arrays (1,4,9). Optimal assessment of technology through filtering and data analyses procedures will later address the linearity, calibration, and specificity of the antibodies, as well as if labeling and/or hybridization protocols are optimized adequately to ensure high signal-to-noise ratios (3,24). The very first level of quality control deals with the experimental design of the printing of antibody arrays, which should include various replicated spots dispersed along the complete surface of the array as well as the inclusion of controls in every single experiment to evaluate the intra- and inter-assay reproducibility of the measurements (1,4,9). The array should also include appropriate means that serve to test the presence of potential antibody interferences and cross-reactivity. In this regard, the quantity of antibody spotted can be used to standardize the antigen concentration. It is possible to use an internally controlled system where one color represents the amount of antibody spotted, and the other color represents the amount of antigen that is used to quantify the level of protein expression. This normalization for antibody spot intensity can decrease variability and lower the limits of detection of antibody arrays. The initial control of scanned data is at the spot level using the scanner software, e.g., GenePix (24). The customized report created can be utilized to analyze the quality of spots, and it is then possible to flag those spots of
Dissecting Cancer Serum Protein Profiles
283
low quality. The criteria to flag the spots may include the standard deviations away from background, the R2 , or the percent saturation (3,24). At the array level of comparison, the quality control of data includes normalization of the array, as well as calculation of average and standard deviation of the intensities of each antibody in its various replicates along the slide (3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24). Spots with high standard deviation between replicated spots can be filtered out. Normalization of the arrays can be performed using the average intensity of each array (24), protein standards such as Immunoglobulin G (1,21), or internal controls based on antibody spot intensity (31). In the next level of data filtering, each experiment set is compared, and the results are calibrated to a dilution series of antibodies by a best fit line removing data with high variability. The results can also be correlated to independent measurements obtained through enzyme-immunoassays (ELISA) available to quantify targets included in the antibody arrays. At this step, if the series for an antibody is bad, the antibody can be flagged. It is possible to set thresholds of expression for an antibody, specifying a maximum and minimum ratio for spots to be considered in further analyses (24). This is a critical step due to its ability to filter the input data based on the standard deviation between replicate spots, and also the output data based on the standard deviation of dilution experiments. The last level of quality control refers to the comparison of independent experiment sets based on internal controls that will allow comparison between experiments performed on different days. The combined use of unsupervised and supervised methods can identify protein patterns associated with disease progression and clinical outcome. 16. One should be aware that there are limitations of research procedures working with antibody arrays, associated with false positive and negative results, which may be overcome using different strategies. Causes of false negative results on antibody arrays include: (1) The protein product may have been degraded by serum proteases during sample handling. (2) Interferences in the antibodyantigen binding process resulting in low detection of the target protein. The specificity of the targets for bladder cancer progression is addressed by immunohistochemistry, and using antibodies targeting different epitopes. The specificity of antigen-antibody binding is assessed by reverse-protocols, printing purified proteins and Western blots. Addition of protease inhibitors and serum preservation at –80°C will avoid protein degradation during sample handling. Serum aliquots will avoid degradation effects associated with repetitive thawing– freezing cycles. Modifications in amplification protocols such as rolling-circle amplification may increase signal detection. Similarly, the causes of false positive results on antibody arrays include: (A) The antibody is binding non-specific molecules or degradation products of the target protein. (B) Gelatin or protein-related additives to antibodies printed onto arrays. (C) The presence of heterophilic antibodies in serum samples. (D) Nonspecific binding of antibodies present in patients with any autoimmune or other diseases. False positive results can be addressed in several ways. Cross-reactivity
284
Sanchez-Carbayo can be overcome by the selection of alternative antibodies directed to other epitopes (A), or including different preservatives without gelatin (B). In cases C and D, the interference and recovery experiments proposed for the analytical validation of antibodies using dilution and recovery coefficients will estimate the amount of interference. Clinical records on other coexisting diseases in the patients analyzed, enzyme-immunoassays, and immunohistochemical analyses will assist to interpret the unexpected results. The specificity of antigen-antibody binding can be assessed by reverse-protocols, printing purified proteins and Western blots.
5. Final Remarks The methods and applications of antibody arrays are increasing in scope and effectiveness. The current and new antibody array formats that may be developed in the near future are likely to markedly accelerate the rate of biomarker discovery and characterization of cancer-specific pathways that will eventually lead to the development of individualized therapies that take into account markers of disease predisposition and therapeutic response. However, multiple challenges remain in the design and application of antibody arrays (33, 34,35): (1) poor understanding of protein immobilization; (2) limited dynamic ranges of no more than three orders of magnitude; (3) achieving accuracy and reproducibility similar to clinical immunoassays; (4) molecular protein complexity and denaturation affecting immunoreactivity; (5) lack of standards and calibrators; (6) development of high-affinity and specific antibodies for target antigens. Such challenges are being addressed by the multi-institutional effort of the Human Proteome Organization (HUPO) toward the standardization of critical parameters in serum or plasma proteomic analyses. Initial studies provide guidance on pre-analytical variables that can alter the analysis of bloodderived samples, including choice of sample type, stability during storage, use of protease inhibitors, and clinical standardization [(33); see also Chapter 2). As part of the HUPO approach, it is also critical to standardize the statistical strategies for high-confidence protein identification and data analyses. These efforts and strategies toward integrating proteomic datasets would lead toward accurate and comprehensive representation of human proteomes (34–35) References 1. Haab BB, Dunham MJ, Brown PO. (2001). Protein microarrays for highly parallel detection and quantitation of specific proteins and antibodies in complex solutions. Genome Biol. 2(2): research 0004.1–0004.13. 2. Chan SM, Ermann J, Su L, Fathman CG, Utz PJ. (2004). Protein microarrays for multiplex analysis of signal transduction pathways. Nat Med. 10, 1390–6.
Dissecting Cancer Serum Protein Profiles
285
3. Sanchez-Carbayo M. (2006). Antibody arrays: technical considerations and clinical applications in cancer. Clin Chem. 52, 1651–9. 4. Barry R, Diggle T, Terrett J, Soloviev M. (2003). Competitive assay formats for high-throughput affinity arrays. J Biomol Screen. 8, 257–63. 5. Pang S, Smith J, Onley D, Reeve J, Walker M, Foy C. (2005). A comparability study of the emerging protein array platforms with established ELISA procedures. J Immunol Meth. 302, 1–13. 6. Lash GE, Scaife PJ, Innes BA, Otun HA, Robson SC, Searle RF, Bulmer JN. (2006). Comparison of three multiplex cytokine analysis systems: Luminex, SearchLight and FAST Quant. J Immunol Meth. 309, 205–8. 7. de Jager W, Rijkers GT. (2006). Solid-phase and bead-based cytokine immunoassay: a comparison. Methods 38, 294–303. 8. Waterboer T, Sehr P, Pawlita M. (2006). Suppression of non-specific binding in serological Luminex assays. J Immunol Methods. 309, 200–4. 9. Kingsmore SF. (2006). Multiplexed protein measurement: technologies and applications of protein and antibody arrays. Nat Rev Drug Discov. 5, 310–21. 10. Wang X, Yu J, Sreekumar A, Varambally S, Shen R, Giacherio D, Mehra R, Montie JE, Pienta KJ, Sanda MG, Kantoff PW, Rubin MA, Wei JT, Ghosh D, Chinnaiyan AM. (2005). Autoantibody signatures in prostate cancer. N Engl J Med. 353, 1224–35. 11. Anderson KS, LaBaer J. (2005). The sentinel within: exploiting the immune system for cancer biomarkers. J Proteome Res. 4, 1123–33. 12. Petricoin EF III, Bichsel VE, Calvert VS, Espina V, Winters M, Young L, Belluco C, Trock BJ, Lippman M, Fishman DA, Sgroi DC, Munson PJ, Esserman LJ, Liotta LA. (2005). Mapping molecular networks using proteomics: a vision for patient-tailored combination therapy. J Clin Oncol. 23, 3614–21. 13. Angenendt P, Glokler J, Murphy D, Lehrach H, Cahill DJ. (2002). Toward optimized antibody microarrays: a comparison of current microarray support materials. Anal Biochem. 309, 253–60. 14. Espina V, Woodhouse EC, Wulfkuhle J, Asmussen HD, Petricoin EF III, Liotta LA. (2004). Protein microarray detection strategies: focus on direct detection technologies. J Immunol Methods. 290, 121–33. 15. Levit-Binnun N, Lindner AB, Zik O, Eshhar Z, Moses E. (2003). Quantitative detection of protein arrays. Anal Chem. 75, 1436–41. 16. Pawlak B, Gordon R. (2005). Density estimation for positron emission tomography. Technol Cancer Res Treat. 4, 131–42. 17. Schweitzer B, Roberts S, Grimwade B, Shao W, Wang M, Fu Q, Shu Q, Laroche I, Zhou Z, Tchernev VT, Christiansen J, Velleca M, Kingsmore SF. (2002). Multiplexed protein profiling on microarrays by rolling-circle amplification. Nat Biotechnol. 20, 359–65. 18. Pasternack RF, Collings PJ. (1995). Resonance light scattering: a new technique for studying chromophore aggregation. Science. 269, 935–9. 19. Stich N, Gandhum A, Matyushin V, Raats J, Mayer C, Alguel Y, Schalkhammer T. (2002). Phage display antibody-based proteomic device using resonance-enhanced detection. J Nanosci Nanotechnol. 2, 375–81.
286
Sanchez-Carbayo
20. Lindskog M, Rockberg J, Uhlen M, Sterky F. (2005). Selection of protein epitopes for antibody production. Biotechniques. 38, 723–7. 21. Miller JC, Zhou H, Kwekel J, Cavallo R, Burke J, Butler EB, Teh BS, Haab BB. (2003). Antibody microarray profiling of human prostate cancer sera: antibody screening and identification of potential biomarkers. Proteomics. 3, 56–63. 22. Zhou H, Bouwman K, Schotanus M, Verweij C, Marrero JA, Dillon D, Costa J, Lizardi P, Haab BB. (2004). Two-color, rolling-circle amplification on antibody microarrays for sensitive, multiplexed serum-protein measurements. Genome Biol. 5, R28. 23. Shao W, Zhou Z, Laroche I, Lu H, Zong Q, Patel DD, Kingsmore S, Piccoli SP. (2003). Optimization of rolling-circle amplified protein microarrays for multiplexed protein profiling. J Biomed Biotechnol. 5, 299–307. 24. Sanchez-Carbayo M, Socci ND, Lozano JJ, Haab BB, Cordon-Cardo C. (2006). Profiling bladder cancer using targeted antibody arrays. Am J Pathol. 168, 93–103. 25. Saviranta P, Okon R, Brinker A, Warashina M, Eppinger J, Geierstanger BH. (2004). Evaluating sandwich immunoassays in microarray format in terms of the ambient analyte regime. Clin Chem. 50, 1907–20. 26. Huang R, Lin Y, Shi Q, Flowers L, Ramachandran S, Horowitz IR, Parthasarathy S, Huang RP. (2004). Enhanced protein profiling arrays with ELISA-based amplification for high-throughput molecular changes of tumor patients plasma. Clin Cancer Res. 10, 598–609. 27. Varnum SM, Woodbury RL, Zangar RC. (2004). A protein microarray ELISA for screening biological fluids. Methods Mol Biol. 264, 161–72. 28. Gembitsky DS, Lawlor K, Jacovina A, Yaneva M, Tempst P. (2004). A prototype antibody microarray platform to monitor changes in protein tyrosine phosphorylation. Mol Cell Proteomics. 3, 1102–18. 29. Janzi M, Odling J, Pan-Hammarstrom Q, Sundberg M, Lundeberg J, Uhlen M, Hammarstrom L, Nilsson P. (2005). Serum microarrays for large scale screening of protein levels. Mol Cell Proteomics. 4, 1942–7. 30. MacBeath G, Schreiber SL. (2000). Printing proteins as microarrays for highthroughput function determination. Science. 289, 1760–3. 31. Knezevic V, Leethanakul C, Bichsel VE, Worth JM, Prabhu VV, Gutkind JS, Liotta LA, Munson PJ, Petricoin EF 3rd, Krizman DB. (2001). Proteomic profiling of the cancer microenvironment by antibody arrays. Proteomics. 1, 1271–8. 32. Arenkov P, Kukhtin A, Gemmell A, Voloshchuk S, Chupeeva V, Mirzabekov A. (2000). Protein microchips: use for immunoassay and enzymatic reactions. Anal Biochem. 278, 123–31 33. Rai AJ, Gelfand CA, Haywood BC, Warunek DJ, Yi J, Schuchard MD, Mehigh RJ, Cockrill SL, Scott GB, Tammen H, Schulz-Knappe P, Speicher DW, Vitzthum F, Haab BB, Siest G, Chan DW. (2005). HUPO Plasma Proteome Project specimen collection and handling: towards the standardization of parameters for plasma proteome samples. Proteomics. 5, 3262–77. 34. States DJ, Omenn GS, Blackwell TW, Fermin D, Eng J, Speicher DW, Hanash SM. (2006). Challenges in deriving high-confidence protein identifications from
Dissecting Cancer Serum Protein Profiles
287
data gathered by a HUPO plasma proteome collaborative study. Nat Biotechnol. 24, 333–8. 35. Uhlen M, Bjorling E, Agaton C, Szigyarto CA, Amini B, Andersen E, Andersson AC, Angelidou P, Asplund A, Asplund C, Berglund L, Bergstrom K, Brumer H, Cerjan D, Ekstrom M, Elobeid A, Eriksson C, Fagerberg L, Falk R, Fall J, Forsberg M, Bjorklund MG, Gumbel K, Halimi A, Hallin I, Hamsten C, Hansson M, Hedhammar M, Hercules G, Kampf C, Larsson K, Lindskog M, Lodewyckx W, Lund J, Lundeberg J, Magnusson K, Malm E, Nilsson P, Odling J, Oksvold P, Olsson I, Oster E, Ottosson J, Paavilainen L, Persson A, Rimini R, Rockberg J, Runeson M, Sivertsson A, Skollermo A, Steen J, Stenvall M, Sterky F, Stromberg S, Sundberg M, Tegel H, Tourle S, Wahlund E, Walden A, Wan J, Wernerus H, Westberg J, Wester K, Wrethagen U, Xu LL, Hober S, Ponten F. (2005). A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics. 4, 1920–32.
V Statistics and Bioinformatics in Clinical Proteomics Data Analysis
16 2D-PAGE Maps Analysis Emilio Marengo, Elisa Robotti, and Marco Bobba
Summary Due to the low reproducibility affecting 2D gel-electrophoresis and the complex maps provided by this technique, the use of effective and robust methods for the comparison and classification of 2D maps is a fundamental tool for the development of automated diagnostic methods. A review of classical and recently developed methods for the comparison of 2D maps is presented here. The methods proposed regard both the analysis of spot volume datasets through multivariate statistical tools (pattern recognition methods, cluster analysis, and classification methods) and the analysis of 2D map images through fuzzy logic, three-way PCA, and the use of moment functions. The theoretical basis of each procedure is briefly introduced, together with a review of the most interesting applications present in recent literature.
Key Words: principal component analysis; cluster analysis; classification; SIMCA; image analysis; moment functions; fuzzy logic; three-way PCA; multidimensional scaling; spot volume data.
1. Introduction The development of new and effective methods for the identification of differences between groups of 2D-PAGE maps represents one of the frontiers in the field of proteomics, for the development of reliable diagnostic/prognostic tools. The comparison of sets of 2D maps is not in fact a trivial problem due to some experimental limitations affecting 2D gel-electrophoresis. In spite of being a very powerful tool for the separation of proteins in cellular extracts, 2D gel-electrophoresis is characterized by quite low reproducibility: From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
291
292
Marengo et al.
this limit is dictated by both the specificity of the specimen and the instrumental procedure employed to obtain the final electrophoretic maps. In fact, the analyzed biological samples often present complex protein mixtures, covering a wide range of structures, properties, and molecular weights. The complexity of the sample is reflected in the complexity of the final map that may contain hundreds or thousands of spots, with the further appearance of spurious spots due to impurities or side reactions. The second aspect to reducing reproducibility in 2D gel-electrophoresis is related to the instrumental technique itself, from sample preparation to the electrophoretic run. Sample pre-treatment, in fact, follows a multi-step procedure consisting of several purification and extraction steps, increasing the overall experimental uncertainty. In addition, the final result is strongly dependent on a great number of instrumental factors that have to be taken under strict control: polymerization conditions, temperature, running conditions, time and temperature during staining and de-staining steps. An unexpected or random variation of one or more of these instrumental parameters can strongly affect the final result of reproducibility of the position, size, and intensity of the spots on the final map. The large number of spots present on each map and the low reproducibility of 2D gel-electrophoresis worsen the achievement of a clear classification of samples and make it quite difficult to use 2D-PAGE maps for diagnostic and prognostic purposes or for drug-design studies. In this perspective, the use of effective and robust methods for the comparison and classification of 2D maps is a key point in the development of automated diagnostic tools based on proteomics. For taking due consideration of the low reproducibility affecting the experimental protocol, sets of replicate 2D maps are usually run and compared. The classical analysis of 2D-PAGE maps is usually carried out by dedicated software packages, which will be briefly described here. The second part of the chapter will focus on the use of multivariate statistical tools for a more effective analysis of the so-called “spot volume datasets” produced by software packages dedicated to 2D-PAGE image analysis. The final part of the chapter will be devoted to the most advanced applications of image analysis tools for the study and classification of 2D maps; these methods will be presented based on fuzzy logic principles coupled with multivariate statistical tools or on the calculation of mathematical moments of the images. 2. Gel Analysis Via Dedicated Software Packages The analysis of sets of 2D maps is usually carried out via dedicated software packages; among the most popular are PDQuest, Progenesis, Melanie, Z3, Phoretix, Z4000, but many other solutions are commercially available.
2D-PAGE Maps Analysis
293
Many papers appeared in the last decade about the development of software packages (1,2,3), the comparison of the performances of different packages (4,5), or the widening of particular topics like point pattern matching, reproducibility, matching efficiency and spot overlapping (6,7,8,9,10,11,12,13,14, 15,16). All software solutions presently available perform the analysis of sets of 2D maps based on the digitalized images of gels obtained by laser densitometry, phosphor imagery, or via a CCD camera. The analysis of digitalized images involves several steps, which are described here in more detail with particular reference to one of the most used ones, namely the PDQuest system (17,18,19): 1. Scanning. Gel images are turned into pixel data; each pixel is characterized by a couple of coordinates x–y indicating its position on the 2D image and a Z value corresponding to the signal intensity of the pixel. Each map is finally turned into a series of pixels described by their optical density value (OD). 2. Filtering images. This step performs a pre-processing of gel images, allowing the elimination of noise, background effects, specks, and other imperfections. 3. Automated spot detection. Spot detection involves the identification of spots present on each gel independently. The operator has to select the faintest spot (to set the sensitivity and minimum peak value parameters), the smallest spot (to set the size scale parameter), and the largest spot that one aims to detect. A final smoothing is applied to remove spots close to the background level. Spots are then located on the gel image (i.e., each spot is identified by a couple of x–y coordinates indicating its position on the gel), fitted by ideal Gaussian distributions and quantified by the sum of the OD values within each Gaussian distribution. 4. Matching of protein profiles. Sets of 2D gels are then edited and matched to one another in a “match set.” Each identified spot is matched to the same spot in all the other gels of the set under investigation. To this purpose, landmarks are needed, consisting of reference spots used by PDQuest to align and position the match set members for matching. The identification of the landmarks sets some parameters accounting for distortions existing among the gels to be compared. 5. Normalization. Normalization is then applied to the maps to compensate gel-togel variations due to sample preparation and loading, staining and de-staining procedures, etc. 6. Differential analysis. This step allows the analysis of different sets of 2D maps, i.e., control and diseased samples. Within each group of different 2D maps, a “sample group” is created containing the average values of all the spots identified. Once the sample groups have been created (i.e., control and diseased samples), the comparison of the groups is carried out to find differentially expressed proteins. Usually, only spots showing a two-fold variation are accepted as significantly changed (100% variation). 7. Statistical analysis. Statistical analysis is then applied to the differentially expressed proteins. It is usually based on Student’s t-test (p < 0.05) (see also Chapter 17).
294
Marengo et al.
The final result of the overall procedure, therefore, appears deeply dependent on the accuracy of the software package adopted, and so the choice of the most suitable analysis software is critical. Commercial software packages, in spite of being powerful tools for image analysis, present two main disadvantages. The first one is related to human interference, which is introduced mainly in steps 2 and 3. The second disadvantage is related to the problem of replicas; the comparison of different groups of 2D maps is performed on the basis of the obtained “sample group” of each class, i.e., a gel containing the average of the information common to all replicates. In this way, single replicas are not considered, and the information about the reproducibility of the maps is not taken into proper consideration. 3. Analysis of Spot Volume Datasets Spot volume datasets coming from the differential analysis via dedicated software (step 5 of the procedure described in Section 2) are particularly suitable for investigation by means of multivariate statistical tools; this is due both to their large dimensionality (a large number of spots identified on each map) and to the difficulty in identifying the small differences existing between groups of maps when hundreds of spots are contemporarily detected on each sample. From this point of view, multivariate statistical tools represent the best alternative since they are able to provide a clear representation of the case under study, considering all the variables contemporarily, and produce robust results, i.e., eliminating the contribution of experimental uncertainty. Among the statistical techniques that are and have been recently and successfully applied to spot volume datasets are pattern recognition methods, e.g., Principal Component Analysis (PCA) and Cluster Analysis; classification methods, e.g., Linear Discriminant Analysis (LDA) and Soft-independent Model of Class Analogy (SIMCA); and regression methods e.g., discriminant analysis–partial least squares regression (DA-PLS). Data from spot volume datasets present a multivariate structure, where several samples (maps) are described by a large number of variables (spots identified). Multivariate data are usually arranged in matrices to undergo the statistical analysis. The datasets taken into account hereafter are arranged in data matrices of dimensions n × p, where n is the number of samples (one for each row of the matrix) and p is the number of variables (one for each column of the matrix). 3.1. Principal Component Analysis Principal Component Analysis (20,21) is a multivariate pattern recognition method that represents the objects, described by the original variables, in a
2D-PAGE Maps Analysis
295
new reference system characterized by new variables called principal components (PCs; see also Chapter 17). Each PC has the property of explaining the maximum possible amount of residual variance contained in the original dataset: the first PC explains the maximum amount of variance contained in the overall dataset, while the second one explains the maximum residual variance. The PCs are then calculated hierarchically so that experimental noise and random variations are contained in the last PCs. The PCs maintain a strict relationship with the original reference system, since they are calculated as linear combinations of the original variables. They are also orthogonal to each other, thus containing independent sources of information (Fig. 1). The hierarchical way in which PCs are calculated makes them useful for operating a dimensionality reduction of the original dataset: in fact, a large number of original variables can be substituted by a smaller number of significant PCs, containing a relevant amount of information when compared to the overall amount of variance contained in the original dataset, but eliminating experimental uncertainty (which is accounted for by the last PCs). Principal Component Analysis provides two main tools for data analysis: the scores and the loadings. The scores represent the coordinates of the samples in the new reference system, while the loadings represent the coefficients of the linear combination describing each PC, i.e., the weights of the original variables on each PC. The graphical representation of the scores in the space of the PCs allows the identification of groups of samples showing a similar behavior (samples close to one another in the graph) or different characteristics (samples far from each other). By looking at the corresponding loading plot, it is possible to identify the variables that are responsible for the analogies or the differences detected for the samples in the score plot. An example of loading and score plot is represented in Fig. 2. Data belong to four groups of 2D maps (24 maps described by more than 1000 spots). From the score plot, it is possible to discriminate the four groups of samples present:
Fig. 1. Construction of the principal components.
296
Marengo et al.
(A)
Loading Plot V435
0.08
V352 V119 V160 V217 V426 V430 V111 V796 V479 V968 V215 V295 V150 V451 V84 V423 V148V317 V60 V729 V428 V363 V303V381 V509 V1076 V158 V759 V475 V513 V856 V405 V188 V310 V753 V931 V1116 V1008 V136 V112 V275 V228 V605 V912 V259 V1006 V419 V450 V743 V276 V788 V915 V305 V916V145 V847 V672 V413 V550 V409 V139 V1079 V416 V94 V237 V534 V818 V963V280 V17 V271 V603 V42V166 V668 V526V328 V415 V308 V834 V41 V346 V781 V486 V458 V888 V823V726 V973 V949 V1064 V113 V902 V478 V476 V176 V1167 V932 V309 V96 V403 V204 V474V877 V939 V130 V138 V116 V708 V982 V1034 V429 V522 V709 V987 V1010 V279 V465 V1107 V388V1106 V921 V1137 V512 V379 V890V452 V86V881V1019 V717 V1127 V653 V361V725 V967V489 V950 V741 V296V397 V359 V555 V477 V990 V365 V1157 V675 V946 V50V436 V906 V493 V31 V531 V288 V1050 V266 V768 V652 V65 V250 V524 V283 V1155 V972 V632 V937 V542V167 V889V1001 V674V214V311 V245 V860 V55V919 V790 V828 V901 V200 V128 V367 V1092 V1023 V103 V1004 V74 V1133 V985 V872 V824 V517 V58 V297 V1154 V591 V563 V754 V616 V70 V77 V463 V99 V506 V1039 V325 V122 V813 V1017 V841 V351 V341 V850 V784V492 V899 V1040 V521 V124 V883 V1062 V189 V454 V353 V246 V871 V287 V947 V117 V408 V771 V185 V730 V933 V97 V976 V195 V758 V495 V637 V172 V649 V1091V922 V470 V597 V579 V298 V865 V380 V859 V395 V469 V443 V528 V650 V571V613 V640 V364 V787V157 V582 V203 V900 V1014V1007 V934 V143 V938 V760 V898 V98 V360 V567 V501 V553 V984 V772 V399 V142 V126 V979 V665 V292 V273 V941 V414 V174 V1105 V7 V256 V998 V439 V1037 V554 V336 V255 V1084 V1060 V920 V863 V182 V1109 V21 V199 V278 V917 V110 V809 V669 V1114 V808 V869 V802 V689 V798 V538 V231 V180 V1080 V592 V146 V427 V347 V576 V617 V1030 V4 V257 V44 V220 V230 V778 V253 V1020 V964 V1087V114 V705 V274 V1058 V1049 V738 V453 V745 V643 V684V822 V519 V980 V864 V581 V324 V1067 V774 V420 V842 V59 V644 V177 V227V1003 V455 V791 V1078 V961V137 V804 V101 V302 V491 V421 V686 V737 V840 V369 V935 V312 V618 V118 V194 V472 V691 V1082 V630 V447 V78 V536 V569 V996 V9 V557 V1074 V716 V410 V1126 V1013 V879 V560 V953V108 V168 V267 V943V92 V514 V348 V216 V878 V270 V693 V243 V965 V928 V490V53V149 V1036 V156 V483 V224 V903 V106 V262 V35 V400 V437 V1085 V966 V394 V1135 V805 V609 V1129 V318 V211 V623 V831 V1045 V543 V376 V599 V284 V264 V306 V102 V251 V503 V826 V639 V1164 V73 V115 V229 V994 V236 V385 V30 V1042 V955 V1165 V570 V552 V473 V19 V608 V213 V1063 V848 V559 V516 V1083 V929 V1069V671 V508 V334 V1119 V396 V1056 V1149 V706 V505 V1075 V339 V100 V789 V135 V125 V1166 V849 V660 V634V456 V401 V587V56 V734 V197 V354 V36 V93V1011 V884 V904 V159 V956 V547 V1066 V914 V1128 V641V843 V26 V1123 V222 V104 V625 V810 V1145 V123 V991 V1111 V1093 V656 V496 V857 V780 V1057 V485 V191 V590 V133 V249 V588 V238 V1124 V1121 V930 V940 V793 V307 V978 V621 V942 V596 V779 V747 V561 V682 V404 V329 V187 V797 V45 V572 V105 V1141 V1110 V807 V511 V254 V85 V892 V777 V573 V911 V891 V291 V504 V1101 V926 V181 V638 V633 V461 V696 V285 V510 V820 V1138 V390 V1025 V32 V529 V851 V234 V527 V332 V152 V740 V692 V344 V412V484 V1026 V626 V1152 V54 V1041 V386V244 V502 V549 V769 V952 V535 V893 V498 V109 V462 V433 V1077 V171 V418 V647 V1029 V736 V642 V258 V948 V179 V874V752 V762 V43V417 V38 V695 V957 V662 V375 V80 V1018 V844V1102V1095 V750 V659 V494 V201 V1143 V855 V1086 V832 V954 V580 V1146 V1096 V676 V960 V723 V241 V33 V711 V232 V127 V896 V29V63 V248 V615 V703 V272 V913 V829 V918 V566 V459 V951 V835 V206 V812 V337 V666V766 V247 V25V1065 V186 V545 V170V763 V1031 V373 V602 V679 V1098 V905 V293 V601 V192 V1144 V969 V854 V718 V1044V320 V681 V1094 V301 V595 V299V193 V624 V482 V862 V1160 V140 V875 V221 V537 V936 V678 V970 V235 V480 V164V698 V338 V165V392 V2 V924 V342 V870 V362 V962 V294 V720 V830 V744 V370 V685 V8 V600 V944 V16V212 V169 V677 V546 V923 V959 V539 V79 V358 V598 V219 V129 V728 V411 V460 V593 V533 V22 V958 V821 V999 V544 V578 V1108 V853 V252 V173 V1140 V1071 V424 V207 V218 V1081 V81 V1070 V814 V1068 V1051 V184 V20 V87 V556 V268 V225 V190 V627 V343 V1021 V144 V532 V1159 V1072 V612 V1161 V846 V1000 V702V398 V1125 V746 V714 V523 V1059 V880 V497 V815 V800 V909 V861 V321 V155 V773 V977 V48 V971 V1099 V551 V71 V885 V95 V876 V852 V838 V64 V783 V564 V628 V648 V619 V434 V733 V1134 V1035 V981 V651 V402 V507 V314 V1162 V393 V202 V1153 V290 V261 V699 V925V908V794 V907 V1142 V886V894 V1163 V382 V1047 V1132 V286 V442 V153 V565 V377 V801 V1054 V260 V72 V457 V449 V724 V742 V663 V765 V631 V670 V432 V1158V240 V68 V239 V226 V1100V776 V330 V1151 V11 V664 V739 V732 V575 V389 V727 V407 V132 V488 V383 V1073 V175 V562 V792 V1012 V196 V422 V316 V694 V1005 V1097 V466 V764 V667 V986 V1043 V989 V444 V997 V319 V636 V1033 V1103 V680 V622V722 V151V558 V233 V1027V839 V371 V326 V1139 V183 V372 V583 V10 V178 V1104 V1028 V1038 V471 V464V88 V910 V162 V39 V304 V83 V355 V751 V1131 V468 V782 V755 V518 V995V811V1090 V837V775 V1113 V1115 V945 V1089 V52V209 V147 V340 V1009 V1120 V300 V1088 V1156 V391V313 V927 V735 V767 V384 V1061 V499 V594 V131 V586 V541 V749 V635V629 V1022 V47 V75 V1148 V687 V1024 V5 V333 V89 V515 V731 V289 V277 V481 V707 V697V210 V704 V62V584 V441 V27 V620 V1136 V1055 V548 V40 V525 V690 V719 V487 V568 V770 V867 V992 V322 V610V858 V107 V263 V1015 V349V350 V24 V604 V242 V141 V1032 V873 V585 V577 V134 V57 V378 V988 V1052 V748 V23 V356 V683 V431 V374 V715 V895 V540 V868 V817 V1046 V761V1130 V887 V1048 V530 V82 V673 V387 V281 V265 V67 V833 V1118 V614 V607 V1122V757 V345 V661 V51 V445 V974 V611 V836 V756 V712 V467 V13 V710 V819 V520 V76V49 V440 V335 V90 V866 V368 V897 V331 V37 V154 V825 V1147 V1053 V448 V3V315V161 V1117 V1016 V12 V6 V28 V323 V1002 V1 V34 V198 V688 V655V18 V366 V15 V700 V163 V882 V121 V205 V975 V327V14 V69 V645 V66 V827V845 V785 V786 V120 V446 V406 V61 V657 V993 V1112 V438 V654 V574 V91 V721 V46 V606 V795 V806 V816 V658 V357 V1150 V282 V983 V803 V701 V500 V589 V425 V223 V799 V646 V713
0.06
V208 V269
0.04
PC2
0.02 0.00 – 0.02 – 0.04 – 0.06 – 0.08
– 0.06
– 0.04
–0.02
0.00
0.04
0.02
0.06
PC1 (B)
Score Plot 25 20 15
PC2
10
C5 C1
C3 C2
B
C6
C
C4
5
B6
B3 B2 B4 B1 B5
0
– 15
A6 –20
AB1
A2
A
–10
– 20 – 25
AB5AB2 AB3 AB4
AB6
–5
A5 A1 –15
AB
A4 A3
–10
–5
0
5
10
15
20
25
PC1
Fig. 2. Example of loadings (A) and scores plots (B).
one group in each quadrant. The first PC is able to discriminate samples C and A (negative scores on PC1 ) from samples B and AB (positive scores on PC1 ); PC2 separates samples C and B (positive scores on PC2 ) from samples A and AB (negative scores on PC2 ). The analysis of the corresponding loading plot explains the reasons for the separation of samples in the four groups: sample C shows large intensities of the spots in the 2nd quadrant and small intensities of the spots in the 4th quadrant, sample A shows large intensities of the spots in the 3rd quadrant and very small in the 1st quadrant ; samples AB present a behavior opposite to that of sample C, while sample B presents a behavior opposite to sample A.
2D-PAGE Maps Analysis
297
From the point of view of identification of groups of samples and variables existing in a dataset, PCA is a very powerful visualization tool, which allows the representation of multivariate datasets by means of only few PCs identified as the most relevant. In proteomics, the representation of loadings appears more effective on a virtual 2D map. In proteomic datasets, in fact, each variable represents a spot, characterized by a couple of x–y values defining its position on the 2D maps used for analysis. The loadings of each PC can then be represented on a “virtual” 2D map, where each spot is represented as a circle centered in the corresponding x–y position: each spot can be described on a color scale, with the increasing color tone corresponding to an increasing positive or negative loading. This representation was proposed for the first time by Marengo et al. (22,23). An example is represented in Fig. 3, where positive and negative loadings of the first PC are represented, referring to the example of Fig. 2. The representation appears clearer with respect to the loading plot of Fig. 2, allowing the immediate identification of the spots showing the most relevant loadings (darker grey tones) on the corresponding PC. 3.2. Cluster Analysis Cluster analysis techniques are pattern recognition methods that help to identify the existence of groups of samples or of variables in a dataset, through the investigation of the relationships between the objects or variables. Cluster analysis tools are unsupervised methods, where the operator does not know the dataset partition and wants to identify potential groups of objects. From this point of view, they are different from classification methods, where the operator does know the separation of objects in classes and wants to obtain the best classification of objects in the corresponding class. The most used clustering methods belong to the class of agglomerative hierarchical methods (24), where the objects are grouped (linked together) on the basis of a measure of their similarity. The most similar objects or groups of objects are linked first. The final result is a graph, called dendrogram; the objects are represented on the x axis and are connected at decreasing levels of similarity along the y axis. An example is reported in Fig. 4, referring to the dataset already presented in Figs. 2 and 3. The four groups of samples can be identified by applying a horizontal cut of the dendrogram, i.e., at a dissimilarity level of 25%, and identifying the number of vertical lines present. The clustering technique applied shows a first partition of the samples into two main groups that can be further separated into three groups at a dissimilarity level of 50%. The four groups present can be identified only by applying a further cut at a dissimilarity level of 25%. Samples B and AB, thus, appear the most similar groups.
298
Marengo et al.
Positive Loadings 220 200 180 160 140 120 100 80 60 40 20 0
0
20
40
60
80
100
120
140
160
180
200
220
160
180
200
220
Negative Loadings 220 200 180 160 140 120 100 80 60 40 20 0
0
20
40
60
80
100
120
140
Fig. 3. Positive and negative loadings of PC1 represented on a virtual 2D-map.
2D-PAGE Maps Analysis
299 Ward Method Euclidean Distances
100
60
40
20
0
AB
B
A
C
AB3 AB2 AB5 AB4 AB6 AB1 B3 B2 B6 B5 B4 B1 A4 A1 A6 A2 A5 A3 C4 C5 C2 C3 C6 C1
(Dleg / Dmax)*100
80
Fig. 4. Dendrogram (Ward method, Euclidean distances).
The results of hierarchical clustering methods depend on the specific measure of similarity and on the linking method, and so different methods are usually adopted to have a general idea of the number of groups present. In general, the linking methods that provide the best results with regard to the clarity of groups identified are the Ward method and the Complete Linkage method. With regard to the measure of similarity, the Euclidean distances are usually adopted. Clustering techniques can be applied both to the original variables and to the results of PCA (scores of the significant PCs), thus achieving a cluster of samples eliminating the contribution of experimental error and exploiting only useful sources of variation.
3.3. Classification Methods The classification methods are particularly suitable for the analysis of proteomic spot volume datasets since the primary necessity in this application is the classification of samples belonging to different groups, e.g., to both control and diseased individuals, to their proper class. The final aim is both the development of diagnostic tools and the identification of differences existing
300
Marengo et al.
between the classes to shed light on the mechanism of action of a disease or of a new drug. Here, two of the most exploited classification methods will be briefly described: LDA and SIMCA. 3.3.1. Linear Discriminant Analysis Linear Discriminant Analysis (25,26) belongs to the so-called Bayesian classification methods, since it exploits the Bayes’s rule; it performs the classification of samples present in a dataset based on its multivariate structure. In Bayesian classification methods, an object, x, is assigned to the class, g, for which the posterior probability P(g/x) is maximum. Posterior probability is computed according to Bayes’s formula: Pg fg/x Pg/x = Pk fk/x k
where Pg is the prior probability of class g; Pk is the prior probability of class k (k = g); f(g/x) is the probability density function of class g; and f(k/x) is the probability density function of class k. One normal assumption is that each class is described by a Gaussian multivariate probability distribution: fgx =
Pg T −1 e−1/2xi −cg Sg xi −cg 2p/2 Sg 1/2
where: Pg is the prior probability of class g; Sg is the covariance matrix of class g; cg is the centroid of class g; and p is the number of descriptors. The argument of the exponential function: xi − cg T Sg−1 xi − cg
is the Mahalanobis distance between object x and the centroid of class g, and it takes into consideration the class covariance structure since it
2D-PAGE Maps Analysis
301
contains the covariance matrix. The covariance matrix accounts for the relationships existing among the variables for each class, i.e., the shape of the class. From the logarithm of posterior probability by eliminating the constant terms, each object is classified in class g if it is minimum, the so-called discriminant score: Dgx = xi − cg T Sg−1 xi − cg + ln Sg − 2 ln Pg
In LDA, the covariance matrix of each class is approximated with the pooled (between the classes) covariance matrix, thus considering all the classes having a common shape, i.e., a weighted average of the shape of each class present in the dataset. The variables contained in the LDA model, which discriminate the classes present in the dataset, can be chosen by a stepwise algorithm, selecting the most discriminating variables iteratively. LDA can be performed on both the original variables or on PCs, thus eliminating the contribution to variation given by experimental uncertainty. 3.3.2. Soft-Independent Model of Class Analogy The SIMCA method (27) is based on the independent modeling of each class by means of PCA; in fact, each class is described by its relevant PCs. The samples of each class are then contained in the so-called SIMCA boxes, defined by the relevant PCs of each class. This represents one of the most important advantages of SIMCA; the classification of each sample is not affected by experimental uncertainty and spurious information, since each class is modeled only by its relevant PCs. Moreover, this method is also useful when small datasets are analyzed (more variables than objects), since it performs substantial dimensionality reduction. Thus, SIMCA classification starts with PCA calculated previously on each class independently, with the identification of relevant PCs for each class. They define the so-called class model. If the data are autoscaled (mean centering followed by normalization for the standard deviation of each variable), each object xiv belonging to class g is modeled as: xivg =
tiag lvag + rivg g = 1 G a = 1 Ag i = 1 ng v = 1 P
a
(G = number of classes present; Ag = number of significant PCs for class g; ng = number of samples in class g; P = number of original variables)
302
Marengo et al.
where tiag = score of the i-th object of class g on the a-th PC; lvag = loading of the v-th variable on the a-th PC of class g; and rivg = residual of the i-th object of class g for variable v. The values estimated by the model are then: xˆ ivg =
tiag lvag
a
while the residuals are defined as: rivg = ˆxivg − xivg
The classification rule of object i is based on a Fisher’s F-test so that object i is classified in class g if: rsdig2 rsdg2
< Fcriticv1=p−Ag v2=p−Ag ng −Ag −1
where rsdig = residual standard deviation of object i on class g; rsdg = residual standard deviation of class g; Fcritic = critical value of F defining the SIMCA box; = significance level (usually set at 0.05, corresponding to a probability level of 95%); and v1 , v2 = degrees of freedom of the numerator and denominator of the F-test, respectively. The residual standard deviation of each object i (i.e., its distance from the model of class g) is then compared to the residual standard deviation of class g (i.e., the typical distance of class g); if their ratio is smaller than the critical F value based on the degrees of freedom and on the significance level, object i is classified in class g. Soft-independent Model of Class Analogy gives some important statistics useful for deeper analysis of the classification performed. The modeling power (MP) of each variable on each class model is a measure of the weight that each variable presents on each class model, i.e., the ability of the variable in describing and characterizing the corresponding class, defined as: MPvc = 1 −
rsdvc sdvc
2D-PAGE Maps Analysis
303
where sdvc = standard deviation of variable v on class c; rsdvc = residual standard deviation of variable v of the objects of class c from the model of their own class. The MP ranges from 0 (variable irrelevant on the definition of the class model) to 1. A typical representation of MP is given in Fig. 5, where the variables are represented on the x axis, and MP is represented as a bar diagram on the y axis. Figure 5 represents the MPs of class C in the example of Figs. 2–4. The discrimination power (DP) is a measure of the ability of each variable to discriminate between two classes (c and g) at a time. The greater the DP, the more a variable weights on the classification of an object in class c or g. It is defined as:
DPvc =
rsd2 vcg + rsd2 vgc rsd2 vc + rsd2 vg
1.0
0.8
0.6
0.4
Fig. 5. Modeling power of a class of six control samples.
1129
1035
941
847
753
659
565
471
377
283
189
1
0.0
95
0.2
304
Marengo et al.
where rsd2 vcg = square residual standard deviation of variable v of the objects of class c from the model of class g; rsd2 vgc = square residual standard deviation of variable v of the objects of class g from the model of class c; rsd2 vc = square residual standard deviation of variable v of the objects of class c from the model of their own class; rsd2 vg = square residual standard deviation of variable v of the objects of class g from the model of their own class. The DP is positively defined, but it is not limited. A representation of DP is shown in Fig. 6; the variables are represented on the x axis, and DPs as bar diagram on the y axis. Figure 6 represents the DPs of classes A and B for the example of Figs. 2–5. In general, when the dataset is constituted by two classes, a unique set of DPs is obtained, corresponding to the discrimination between the two classes present. On the other hand, where more than two classes are present, it is possible to obtain a set of DPs for each couple of classes compared. Modeling powers and DPs can be represented on a color scale on “virtual” 2D maps, as seen for the loadings plots, for clearer representation. An example is given in Fig. 7, where the MPs and DPs represented as bar diagrams in Figs. 5 and 6 are represented on virtual 2D maps. 6000
5000
4000
3000
2000
1129
1035
941
847
753
659
565
471
377
283
189
1
0
95
1000
Fig. 6. Discriminating power of two classes: treated with drug A (six samples) and with drug B (six samples).
2D-PAGE Maps Analysis
305
220 200 180 160 140 120 100 80 60 40 20 0
0
20
40
60
80 100 120 140 Modeling Power of class C
160
180
200
220
0
20
40
60 80 100 120 140 160 Discrimination Power classes A–B
180
200
220
220 200 180 160 140 120 100 80 60 40 20 0
Fig. 7. MPs and DPs of Figs 5 and 6 represented on virtual 2D-maps.
306
Marengo et al.
3.4. Partial Least Squares (PLS) Regression and Discriminant Analysis–Partial Least Squares (DA-PLS) Regression Partial least squares is a regression method using the information contained in X data matrix to predict the behavior of Y data matrix. PLS method models both X and Y variables simultaneously to find the latent variables in X that will predict the latent variables in Y. These PLS components (latent variables) are similar to the PCs. If there are several responses, they are modeled together in a multivariate way (28,29,30). PLS can be used for discriminant analysis (DA-PLS) by creating a response variable for each category: in the case of proteomic data, one response variable for each group of samples. Each response variable is assigned a 1 value for the samples belonging to the corresponding class, and a 0 value for the samples belonging to different classes. 3.5. Applications 3.5.1. Pattern Recognition Methods Many applications are reported in literature for the use of multivariate tools in the analysis of spot volume datasets. PCA can be considered quite a classical approach with its first application to spot volume data dating back to the mid1980s, as reported by Anderson (31) in USA and Tarroux (32) in France. Anderson (31) reports an application of PCA coupled to cluster analysis to identify the differences among a panel of human cell lines; all the groups were successfully separated considering only the subset of proteins present in all the cell lines contemporarily. Tarroux et al. (32) applied PCA in the HERMeS software package, again coupled to cluster analysis. More recently, both PCA and cluster analysis have been applied to the study of DNA and RNA fragments of several biological systems by the groups of Couto (33), Johansson (34), and Boon (35) and to the immunological diagnosis of hydatidosis (36,37). Other applications are from the group of Kovarova (38, 39) and De Moor et al. (40), who applied multivariate tools to microarray data. Iwadate et al. (41) applied discriminant analysis to the classification of human gliomas; the proteomic patterns of 85 tissue samples were compared (52 glioblastoma multiforme, 13 anaplastic astrocytomas, 10 atrocytomas, 10 normal brain tissues). Normal brain tissues could be correctly distinguished from glioma tissues by cluster analysis, which proved to be significantly correlated with the patient survival. Discriminant analysis extracted a set of 37 proteins differentially expressed based on histological grading. Principal Component Analysis has been also applied to toxicological studies by the groups of Amin (42), Hejine (43), and Anderson (44). The first paper (42) reports a study on the effect on expression profile of genes played by three
2D-PAGE Maps Analysis
307
nephrotoxicants (cisplatin, gentamicin, and puromycin) on rats, as a function of time after initial administration. PCA and gene expression-based clustering of compound effects confirmed sample separation based on dose, time, and degree of renal toxicity. Heijne (43) studied the acute hepatotoxicity induced in rats by bromobenzene administration; the physiological symptoms recorded coincided with many changes of hepatic mRNA and protein content. PCA proved to be effective in the discrimination between control and treated samples for both protein and gene expression profiles; some of the proteins that significantly changed upon bromobenzene treatment were identified by mass spectrometry. Anderson (44) investigated the effects of five peroxisome proliferators on the protein profile in the livers of treated mice at 5- and 35-day time points. Data for the selected set of 107 liver protein spots, which respond strongly to at least one of the test compounds, were subjected to PCA to search for global protein pattern changes. PC1 was identified as a global measure of peroxisome proliferation by its correlation with enzymatic peroxisomal -oxidation, while PC2 separated the samples on the basis of time exposures. Perrot et al. (45) applied PCA to the comparison of protein expression of gel-entrapped Escherichia coli cells submitted to a cold shock at 4 °C with those of exponential- and stationary-phase free-floating cells. Ten different incubation conditions were considered; each experiment was replicated three times and each gel was run in duplicate. PCA was carried out on the 203 spots identified as significantly reproducible than those corresponding to synthesis at 37 °C, using the average spot intensities for each experimental condition adopted. In order to remove the variability of staining conditions among the gels, each spot volume was normalized by the sum of volumes of all the spots detected on each map. The data were autoscaled before PCA. From score analysis, it was possible to point out that the protein response of immobilized cells after the cold shock was significantly different from those of exponentialand stationary-phase free-floating organisms. The reasons for these differences could be searched for in the loadings analysis, from which the identification of nine families of proteins could also be confirmed. Principal Component Analysis was applied to identify the differences in macrophage maturation in the U937 human lymphoma cell line by Verhoeckx et al. (46). PCA proved to be effective in the identification of variations between samples belonging to different macrophage maturation times, where standard t-tests identified a smaller number of biomarkers. Another application (47) consisted of the characterization of anti-inflammatory compounds. Other applications from Marengo (22,23,48) exploit PCA coupled to both cluster analysis and SIMCA classification for the identification of differences between groups of maps. The first application (48) refers to a spot quantity dataset comprising 435 spots detected in 18 samples belonging to two different
308
Marengo et al.
cell lines of control (untreated) and drug-treated pancreatic ductal carcinoma cells. The study was conceived for the identification of the role played by drugs on different cell lines. PCA allowed clear discrimination of the four groups of samples with the use of three PCs, and the analysis of the loadings provided reasons for the differences among groups of samples. The results were further confirmed by cluster analysis. Identification of some of the most relevant spots was also performed by mass spectrometry. The other two applications (22,23) regard the use of PCA and SIMCA to the classification of proteomic maps. The first paper (22) shows an application to the adrenal glands of healthy and diseased mice. PCA was able to discriminate the two classes of samples by means of the first PC, the loadings of which allowed the identification of spots responsible for the differences. SIMCA was then applied for the classification of samples in the two classes, and it was able to correctly classify all the samples present with one PC in the SIMCA model of each class. SIMCA allowed the identification of the most discriminating spots by the analysis of DPs. The comparison between the maps showed up- and down-regulation of 84 polypeptide chains out of a total of 700 spots detected. An analog approach was followed even for the comparison of phenotypic expression of mantle cell lymphoma GRANTA-519 and MAVER-1 cell lines (23). Marengo proposed an alternative method to show loadings from PCA, and modeling and discriminating powers calculated by SIMCA. In order to obtain clearer representation of the results, the spots showing relevant discriminating and/or modeling power (and loadings as well) are represented on a virtual 2DPAGE map. Each discriminating spot is represented as a circle on a virtual 2D map; the position of each spot is determined by its x–y coordinates identified by standard software packages (PDQuest in this case). The spots are represented on a color scale: darker red tones identify spots showing a larger discriminating or modeling power. The use of such representations in common software packages could represent a valid alternative to the standard visualization of loadings for each variable in the space given by two PCs at a time. Fujii et al. (49) studied the histological subtypes of lymphoid neoplasms: 42 cell lines from human lymphoid neoplasms were included. The discriminating spots were selected by means of different methods used in sequence: (1) Wilcoxon or Kruskal–Wallis tests to find spots whose intensity was significantly (p < 0.05) different among the cell line groups, (2) statistical learning methods to prioritize the spots according to their contribution to the classification, and (3) unsupervised classification methods to validate classification robustness by the selected spots. Thirty-one spots resulted to be significant, 24 of which were identified by mass spectrometry.
2D-PAGE Maps Analysis
309
Other applications are in the field of food quality (coupled to cluster analysis and discriminant analysis): several examples are present in literature about cheese classification (50) and identification of the protein content in wheat and bread (51,52). 3.5.2. Discriminant Analysis–Partial Least Squares With regard to the application of DA-PLS methods, many papers have appeared in the last few years. Jessen et al. (53) demonstrated with two examples how information can be extracted from 2DE data by discrimination PLSR with variable selection. The time course of post mortem proteome changes in the muscle tissues of pigs was investigated. A first discriminant PLSR was performed on the spot volume dataset derived from usual analysis via dedicated software (Bioimage 2D Analyser, Genomic Solutions, USA), the independent response being a binary indicator of the individual pig considered or of the sampling time (post mortem increasing time). PLS has been proved to be successful in the identification of spots characterized by systematic variation. In order to identify only those spots showing actual relevant variation among the groups identified, a variable selection procedure was applied, and no relevant spots were iteratively eliminated from the model: the final model chosen contained the minimum number of spots giving the best correlation with the response. For variable selection, a jack-knifing procedure was selected. Kleno et al. (54) applied PCA and PLS to the identification of the mechanism of action of hydrazine toxicity in rat liver samples. PCA was carried out on a data matrix of dimensions 30 × 431 (30 being the 2D maps: 5 animals × 3 doses of hydrazine × 2 times after administration; 431 being the spots revealed on the maps). PC1 was able to separate the samples according to three different dose levels, while PC4 allowed the separation of the two times after the administration, but only for the largest dose level. The analysis of the loadings did not allow a clear identification of the most relevant discriminating spots, and so a PLSR was applied to model the Y variable (dose level of hydrazine). A variable selection according to jack-knifing was applied. The PLS regression allowed to identify spots that play an important role in the differentiation of samples according to the dose level administered. The results were compared to standard univariate t-tests, showing that some spots identified by PLS could not be identified as relevant by standard t-tests; this is due to the fact that PLS takes into account the correlation structure of the dataset. Kiaersgard et al. (55) studied the change in the proteomic profile of cod muscle samples during different storage conditions. Eleven storage conditions were taken into account, deriving from a large factorial design including storage temperature (two levels), storage period (4 levels), and chill storage period
310
Marengo et al.
(5 levels). Each sample was replicated twice, and the replicated samples were run on different batches. PCA provided a grouping of samples on the basis of frozen storage time, but no information emerged with respect to the differences between the samples according to the other two parameters. The study was refined through the application of DA-PLS with variable selection by a jackknife procedure, and it allowed the identification of relevant spots with respect to the differentiation of samples according to the storage time. The authors focus their attention even on the optimal normalization of data before multivariate analysis. Autoscaling is in fact the most exploited method for data normalization in proteomics, but it presents the risk of amplifying the noise; this is particularly true for proteomics where experimental uncertainty is large. To avoid this problem, mean centering was applied to the data, and normalization was then applied by dividing each mean centered value by (SD + B) (SD = standard deviation of each variable, B = constant term to be optimized). The authors identified the scale range of B value (2500 in their case) by representing in a scatter diagram the mean volume for each variable (spot) versus its standard deviation: the best value was then selected by considering several values of B, as the value giving the best agreement between univariate and multivariate approaches. Gottfries et al. (56) applied both PCA and DA-PLS to the study of two different datasets: the first dataset consists of samples of cerebrospinal fluid from control individuals and individuals affected by different pathologies (12 control, 15 with Alzheimer’s disease, 15 with Frontotemporal dementia, and 10 with Parkinson’s disease), giving a final dataset of dimension 52 × 96 (96 spots identified on 52 maps). The second dataset consists of liver samples from normal and obese mice (samples were grouped into six groups comprising four to eight animals each); the final dataset has dimension 30 × 603 (30 being the samples, and 603 the spots identified). In both cases, the groups of samples present in each dataset could be separated by means of the first three PCs after the application of PCA. DA-PLS was then applied to each dataset in order to identify the spots responsible for the differences between each pair of groups: in all the cases the first latent variable computed was able to correctly classify the samples. In another application, Karp et al. (57) demonstrated the effectiveness of PLS-DA in the identification of the differences in three proteomic datasets; among them, a dataset in which no difference was expected between the two groups of samples considered was also included: in this case, as expected, PLSDA provided no model. Finally, Norden et al. (58) applied PCA and DA-PLS to the identification of the differences between urine samples of smoking and non-smoking individuals. The great number of applications of PCA, PLS, and other multivariate tools in proteomics (31–59) gives a clear idea of the importance of multivariate
2D-PAGE Maps Analysis
311
methods in this field; such techniques are in fact able to identify a larger number of variables (spots) relevant for discrimination between the classes of samples with respect to the classical t-tests usually carried out by standard software packages. 4. Image Analysis The second approach to 2D-PAGE analysis is focused on the direct analysis of 2D maps images. This approach could present a fundamental advantage to proteomic data analysis: the elimination of contribution given by the operator, which is usually relevant when dedicated software packages for proteomic maps analysis are used. Several methods for direct 2D maps image analysis are reported in literature, but they are not yet much widespread to be included in common software packages; these methods mainly exploit artificial neural networks, fuzzy logic principles, and the calculation of mathematical moments. Such procedures represent the frontier in bioinformatics, and some of them are yet under development. The main principles related to these methods will be presented here, together with a review of the most interesting applications present in literature. 4.1. Fuzzy Logic The low reproducibility of 2D gel-electrophoresis, pointed out earlier in this chapter, produces significant differences even among maps corresponding to replicates of the same electrophoretic run; these differences consist of changes in spot position, size, and shape. The precise description of the position of each spot in terms of x–y coordinates thus appears very difficult to accomplish. The uncertainty on the position and shape of each spot can be effectively treated by fuzzy logic principles. Marengo et al. (60,61,62,63,64) successfully applied fuzzy logic principles coupled to multivariate statistical tools to the analysis of sets of 2D maps. Their four-step procedure consists of: 1. 2. 3. 4.
image digitalization; image defuzzyfication; image refuzzyfication; application of multivariate tools to fuzzy maps.
4.1.1. Image Digitalization The first step consists of scanning each map by a densitometer to provide a description of the map as a grid of a given step containing in each cell the OD
312
Marengo et al.
ranging from 0 to 1. The contribution to the signal of each map given by the background is eliminated by applying a cut-off value to each map (generally 0.3/0.4): the values below the cut-off value are transformed into null values. The cut-off value applied has to be optimized independently for each case study. 4.1.2. Image Defuzzyfication The second step mainly performs defuzzyfication of each map, consisting of the elimination of sensitivity due to the destaining protocol. The digitalized image is, in fact, turned into a grid of binary values: 0 is assigned to the cell where no signal is detected, 1 to the cell where a value above the cut-off threshold is present. 4.1.3. Refuzzyfication The previous step eliminates the information about spatial uncertainty as well, since each spot is no more described by grey-scale values but only by binary values (presence/absence). This step is then focused on the reintroduction of information about spatial uncertainty. Each cell containing a 1 value in step 2 is substituted by a 2D probability function. The most suitable distribution is a 2D Gaussian function. The probability of finding a signal in cell xi , yj when a signal is already present in cell xk , yl is given by: 1 1 2 fxi yj xk yl = e 21− 2x y
xi −xk 2 x2
+
yj −yl 2 y2
where
is correlation between 1st and 2nd dimension; (xi , yj ) is the position of the spot influencing the spot in position (xk , yl ); y is the standard deviation along 1st dimension; and x : is the standard deviation along 2nd dimension. The correlation between the two dimensions ( ) is usually fixed at 0, corresponding to the complete independence of two electrophoretic runs; the two standard deviations, x and y , correspond to the standard deviations of the 2D Gaussian function along the x and y axes. Maintaining them identical corresponds to an identical repeatability of the result with respect to the two electrophoretic runs (according to the pH gradient and molecular mass): in this case, the parameter that is analyzed for its effect on the final result is = x = y . Alternatively, the two parameters can be fixed at different values,
2D-PAGE Maps Analysis
313
usually x = 1.5 y , corresponding to an uncertainty along the second dimension that is about 50% larger than that along the first dimension. The separation according to the molecular mass is in fact expected to show a larger uncertainty (self-made polymerization of the gel for the second run versus a first dimension run on commercial strips). A change in parameter (or of parameters x and y ) corresponds to the modification of distance at which an occupied cell exerts its effect: large values reflect in a perturbation operating at larger distances. Smaller values correspond to a perturbation operating at a smaller distance, with spots acting a lesser effect on their neighbourhood and a crisper final image. Therefore, the larger the parameter, the larger the fuzzyfication level applied to the maps. In general, best results are expected for intermediate levels of parameters, corresponding to not too fuzzyfied maps (nor too blurred final images). With respect to the choice of probability function, the Gaussian distribution appeared to be the best alternative, since spots can be described as intensity/probability distributions with the highest intensity/probability value at the center of the spot and decreasing intensities/probabilities as the distance from the center increases. In addition, the integral of the Gaussian function on the whole domain of the 2D-PAGE is 1, corresponding to a total signal that is blurred but, in the meantime, maintained quantitatively coherent. The value of the signal Sk in each cell xi , yj of the fuzzy map is calculated by the sum of the effect of all neighbor cells xj , yj containing spots: Sk =
f xi yj xi' yj'
i'j='1n
Even if the sum runs on all the cells in the grid, only the neighbor cells are influenced by the presence of a signal, depending on the parameter. The procedure consists of turning each digitalized image into a virtual map containing, in each cell, the sum of the influence of all the spots of the original 2D-PAGE; these virtual maps can be called fuzzy matrices or fuzzy maps. Due to the existence of complex spots of irregular shape in real maps, the Gaussian function is associated to each cell instead of to each spot. Figure 8 represents an example of fuzzyfication of a map at different values; the example shows the digitalized and defuzzyfied maps and the fuzzyfication of the map for five increasing values. 4.1.4. Application of Multivariate Tools to Fuzzy Maps The final fuzzy maps can then be analyzed by several multivariate tools for diagnostic/prognostic purposes. Two approaches will be presented here: (1) the coupling of PCA and classification tools; (2) the use of multi-dimensional scaling (MDS) techniques.
314
Marengo et al. (A)
(B)
Digitalised image
20
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180 200
200
20 40 60 80 100 120 140 160 180 200
20 40 60 80 100 120 140 160 180 200
(C)
σ = 0.50
σ = 1.00
(D)
20
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180
200
200 20 40 60 80 100 120 140 160 180 200
(E)
De-fuzzyfied image
20 40 60 80 100 120 140 160 180 200
σ = 1.50
σ = 2.00
(F)
20
20
40
40
60
60
80
80
100
100
120
120
140
140
160
160
180
180
200
200 20 40 60 80 100 120 140 160 180 200
(G)
20 40 60 80 100 120 140 160 180 200
σ = 2.50
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200
Fig. 8. Sample ILL1 from (61): digitalized image (A); defuzzyfied image (B); fuzzyfication at five values (C–G).
2D-PAGE Maps Analysis
315
4.1.4.1. PCA and Classification Methods (61)
Marengo et al. (61) have reported an application of PCA and LDA to fuzzy maps to a set of eight 2D maps belonging to control and mantle cell lymphoma samples. Principal Component Analysis can be applied to images by the previous unwrapping of each image; each sample (map) is turned into a series of variables describing the signal in each position of the map. In this case, 200 × 200 pixel images were taken into consideration, providing a final set of 40,000 variables for each map. PCA is particularly useful here to detect a small number of components accounting for the differences existing between the groups of samples and operating, in the meantime, a dimensionality reduction. The significant PCs calculated were used to build a LDA model to classify the samples; the selection of the variables for LDA model, which discriminates between the classes present in the dataset, was performed by a stepwise algorithm in forward search (Fto−enter = 4.0). The procedure was repeated for different values of the parameter in order to detect the best value providing correct classification of the samples with the smallest number of components in the final LDA model. The best results (100% of correct assignments) were obtained for values ranging from 1.75 to 2.25, with PC1 and PC4 in the final LDA model. The differences existing between the two groups of samples could then be investigated by the analysis of loadings on the first and the fourth PCs. Figure 9 shows the score plot and the loading plot of PC1 and PC4 for = 2.00. The loadings are represented again on a virtual map on a color scale: white tones correspond to the zones in the map characterized by large positive loadings and the black tones to the zones characterized by large negative loadings on the corresponding PC. 4.1.4.2. Multi-Dimensional Scaling
In other applications of multivariate tools to fuzzy maps, Marengo et al. (62,63) describe the use of MDS procedures. MDS performs a substantial dimensionality reduction and an effective graphical representation of the data on the basis of similarity calculated between couples of objects. MDS searches for the smallest number of dimensions in which the objects can be represented as points, matching, as much as possible, the distances between the objects in the new reference system with those calculated in the original reference system. In these applications, the calculations were performed by the Kruskal iterative method; the search for the coordinates was based on the steepest descent minimization algorithm, where the target function is the so-called stress (S), which is a measure of the ability of the configuration of points to simulate the original distance matrix.
316
Marengo et al. σ = 2.00 10 HEA2
8
6 HEA4
PC4
4
2 ILL2
0 ILL3 ILL4 HEA3
–2 ILL1
–4
HEA1
–6 –12 –10
–8
–6
–4
–2
0
2
4
6
8
10
12
14
16
18
PC1
Loadings PC1
Loadings PC4 0.04 20
20 0.03
40
0.03
40
0.02
60
60 0.02
0.01 80
80 0.01
100
0
100
120
120
–0.01
140
–0.02
0 140 –0.01
160
160 –0.03 180
180 –0.02
–0.04 200
200 20
40
60
80
100
120
140
160
180
200
20
40
60
80
100
120
140
160
180
200
Fig. 9. Score plot (A) and loading plots (B) of PC1 and PC4 with = 2.00.
As for the previous applications based on PCA and LDA, several values of parameter have been investigated, and the one providing the best classification was selected. In this case, for each value of the parameter, a similarity matrix has to be built. From the match between the two fuzzy maps k and l, the common signal SCkl (the sum of all signals present in both maps) and the total signal STkl can be computed: SCkl =
min Sik Sil
i=1n
STkl =
i=1n
max Sik Sil
2D-PAGE Maps Analysis
317
where n is the number of cells in the grid. The similarity index is then computed by: Skl =
SCkl STkl
Skl ranges from 0 (two maps showing no common structure) to 1 (two identical maps). In both the applications, the optimal values that provide the best classification of the samples with only one or two dimensions could be identified. 4.2. Moment Functions Moment functions have been widely used in image analysis, in applications related to invariant pattern recognition, object classification, pose estimation, image coding, and reconstruction (65,66,67,68,69). A set of moments computed from a digital image generally represents global characteristics of the image shape, and provides a lot of information about different types of geometrical features of the image. Geometric moments were the first ones to be applied to images, as they are computationally very simple. With the progress of research in image processing, many new types of moment functions have been introduced recently, such as orthogonal moments, rotational moments, and complex moments, which are useful tools in the field of pattern recognition, and can be used to describe the features of objects such as shape, area, border, location, and orientation; naturally each moment function has its own advantages in specific applications. The most important and most used moments are orthogonal moments (e.g., Legendre (70,71,72) and Zernike moments (73,74,75)), which can attain a zero value of redundancy measure in a set of moment functions, so that these orthogonal moments correspond to the independent characteristics of the image. In other words, moments with orthogonal basis functions can be used to represent the image by a set of mutually independent descriptors, with a minimum amount of information redundancy. So far, orthogonal moments have additional properties of being more robust, with respect to the non-orthogonal ones, in the presence of image noise. Orthogonal moments also permit analytical reconstruction of an image intensity function from a finite set of moments, using the inverse moment transform. Legendre moments are the most used orthogonal moments and can be implemented as feature descriptors for 2D-PAGE maps classification. The main advantages in the use of Legendre moments to clustering the maps derive from the possibility to obtain invariance to translation, scale, and rotation; in other words, the original maps, without any pre-treatment, can be used for classification, and the use of complex commercial software can be totally avoided.
318
Marengo et al.
The number of calculated moments is very large, and many of them do not contain information related to the specific target of correctly classifying the 2D-PAGE maps; for this reason a method for selecting the moments having highest DP must be applied (e.g., LDA). 4.2.1. Legendre Moments The Legendre polynomials form a complete orthogonal set inside the unit circle. Moments with Legendre polynomials as kernel functions were first introduced by Teague (68). The kernel of Legendre moments are products of Legendre polynomials defined along the rectangular image coordinate axes inside a unit circle. The two-dimensional Legendre moments of orderp + qof an image intensity mapf x y are defined as: Lpq =
2p + 1 2q + 1 1 1 Pp x × Pq yfx ydxdy 4 −1 −1 xy∈−11
where Legendre polynomial, Pp x, of order p is given by: Pp x =
p
−1
p−k 2
k=0
1 p − k!xk p+k 2p p−k ! 2 !k! 2
p−k=even
The recurrence relation of Legendre polynomials, Pp x, is: Pp x
2p − 1 xPp−1 x − p − 1 Pp−2 x p
where P0 x-1, P1 x = x, and p>1. Since the region of definition of Legendre polynomials is the interior of [–1,1], a square image of N × N pixels with intensity function fi j, 0≤i, j≤( N–1 ) is scaled in the region –1< x,y<1. Legendre moments can be expressed in discrete form as: L pq = pq
−1 N −1 N
Pp xi Pq yj f i j
i=0 j=0
where the normalizing constant is pq =
2p + 1 2q + 1 and N2
xi and yj denote the normalized pixel coordinates in the range [–1,1] xi =
2i 2j − 1 and yj = −1 N −1 N −1
2D-PAGE Maps Analysis
319
The reconstruction of image function from calculated moments can be performed by the following inverse transformation: f i j =
pmax qmax
pq Pp xi Pq xj
p=0 q=0
Marengo et al. (76) report an interesting application of Legendre moments to a set of 2D-PAGE maps belonging to two different cell lines of control (untreated) and drug-treated pancreatic ductal carcinoma cells. The aim of the work was to obtain the correct classification of the 18 samples using the Legendre moments as discriminant variables. Each 2D-PAGE, which was automatically digitalized, was described by a 200×200 matrix of pixels; the value of each pixel varies from 0 to 1 to indicate the staining intensity in the given position. The Legendre moments of the 18 digitalized images were calculated. Moments up to a maximum order of 100 were computed from the images. Each matrix held the global information of the corresponding 2D-PAGE map. The final dataset contained 18 samples and 10,201 variables. The number of variables was very large, and many of them were either redundant or did not contain information related to the specific target of correctly classifying the samples; for this reason a method for selecting the variables having the highest power of discrimination was applied (forward stepwise LDA with Fto−enter = 4.0). The results of stepwise LDA procedure showed that only six different Legendre moments were necessary in order to correctly classify the 18 samples. The results demonstrate that the Legendre moments can be successfully applied for fast classification and similarity analysis of 2D-PAGE maps. 4.3. Other Methods Schultz et al. (77), together with the application of PCA and PLS to spot volume data, applied PCA to the analysis of gel images after digitalization and unwrapping. The choice of the alignment procedure for the sets of gels proved to be the determinant of the final result. PCA proved to be effective in the identification of the groups of maps present. Marengo et al. (78) also applied three-way PCA to the identification of the differences among groups of 2D maps. Proteomic datasets are suitable to be treated by three-way method due to their three-way structure: the first dimension being the pH gradient, the second the molecular mass, and the third the samples. In three-way PCA, the observed modes (conventionally called I, J, and K) can be synthesized in more fundamental modes, each element of a reduced mode expressing a particular structure existing between all or a part
320
Marengo et al.
of the elements of the associated observation mode. The final result is given by three sets of loadings together with a core array describing the relationship among them. Each of the three sets of loadings can be displayed and interpreted in the same way as a score plot of standard PCA. Three-way PCA was preceded by data transformation to scale all the samples and make them comparable; to this purpose, maximum scaling was selected and the digitalized 2D PAGE maps were scaled one at a time to the maximum value for each map. This method was successfully applied to datasets of human lymph-nodes and rat sera allowing the identification of the main differences existing among the sets of 2D maps.
References 1. Mahon, P., Dupree, P., (2001) Quantitative and reproducible two-dimensional gel analysis using Phoretix 2D Full, Electrophoresis 22, 2075–2085 2. Rubinfeld, A., Keren-Lehrer, T., Hadas, G., Smilansky, Z., (2003) Hierarchical analysis of large-scale two-dimensional gel-electrophoresis experiments, Proteomics 3, 1930–1935 3. Anderson, N.L., Taylor, J., Scandora, A.E., Coulter, B.P., Anderson, N.G., (1981) The TYCHO system for computer analysis of two-dimensional gel electrophoresis patterns, Clinical Chemistry 27 (11), 1807–1820 4. Rosengren, A.T., Salmi, J.M., Aittokallio, T., Westerholm, J., Lahesmaa, R., Nyman, T.A., Nevalainen, O.S., (2003) Comparison of PDQuest and Progenesis software packages in the analysis of two dimensional electrophoresis gels, Proteomics 3, 1936–1946 5. Raman, B., Cheung, A., Marten, M.R., (2002) Quantitative comparison and evaluation of two commercially available, two-dimensional electrophoresis image analysis software packages, Z3 and Melanie, Electrophoresis 23, 2194–2202 6. Panek, J., Vohradsky, J., (1999) Point pattern matching in the analysis of twodimensional gel electropherograms, Electrophoresis 20, 3483–3491 7. Pleissner, K.P., Hoffman, F., Kriegel, K., Wenk, C., Wegner, S., Sahistrom, A., Oswald, H., Alt, H., Fleck, E., (1999) New algorithmic approaches to protein spot detection and pattern matching in two-dimensional electrophoresis gel databases, Electrophoresis 20, 755–765 8. Voss, T., Haberl, P., (2000) Observations on the reproducibility and matching efficiency of two-dimensional electrophoresis gels: consequences for comprehensive data analysis, Electrophoresis 21, 3345–3350 9. Cutler, P., Heald, G., White, I.R., Ruan, J., (2003) A novel approach to spot detection for two-dimensional gel electrophoresis images using pixel value collection, Proteomics 3, 392–401 10. Molloy, M.P., Brzezinski, E.E., Hang, J., McDowell, M.T., VanBogelen, R.A., (2003) Overcoming technical variation and biological variation in quantitative proteomics, Proteomics 3, 1912–1919
2D-PAGE Maps Analysis
321
11. Moritz, B., Meyer, H.E., (2003) Approaches for the quantification of protein concentration ratios, Proteomics 3, 2208–2220 12. Wheelock, A.M., Buckpitt, A.R., (2005) Software-induced variance in twodimensional gel electrophoresis image analysis, Electrophoresis 26, 4508–4520 13. Almeida, J.S., Stanislaus, R., Krug, E., Arthur, J.M., (2005) Normalisation and analysis of residual variation in two-dimensional gel electrophoresis for quantitative differential proteomics, Proteomics 5, 1242–1249 14. Pietrogrande, M.C., Marchetti, N., Dondi, F., Righetti, P.G., (2003) Spot overlapping in two-dimensional polyacrylamide gel electrophoresis maps: relevance to proteomics, Electrophoresis 24, 217–224 15. Pietrogrande, M.C., Marchetti, N., Dondi, F., Righetti, P.G., (2002) Spot overlapping in two-dimensional polyacrylamide gel electrophoresis separations: a statistical study of complex protein maps, Electrophoresis 23, 283–291 16. Campostrini, N., Areces, L.B., Rappsilber, J., Pietrogrande M.C., Dondi, F., Pastorino, F., Ponzoni, M., Righetti, P.G., (2005) Spot overlapping in twodimensional maps: a serious problem ignored for much too long, Proteomics 2005 (5), 2385–2395 17. Garrels, J.I., (1979) Two dimensional gel electrophoresis and computer analysis of proteins synthesized by clonal cell lines, J. Biol. Chem. 254, 7961–7977 18. Garrels, J.I., Farrar, J.T., Burwell IV, C.B., (1984) In: Celis, J.E., Bravo, R. (Eds.), Two-dimensional Gel Electrophoresis of Proteins, Academic Press, Orlando, FA, USA, pp. 38–91 19. Garrels, J.I., (1989) The QUEST system for quantitative analysis of twodimensional gels, J. Biol. Chem. 264, 5269–5282 20. Massart, D.L., Vandeginste, B.G.M., Deming, S.M., Michotte, Y., Kaufman, L., (1988) Chemometrics: A Textbook. Amsterdam, Elsevier 21. Vandeginste, B.G.M., Massart, D.L., Buydens, L.M.C., De Jong, S., Lewi, P.J., Smeyers-Verbeke, J., (1998) Handbook of Chemometrics and Qualimetrics: Part B. Amsterdam, Elsevier 22. Marengo, E., Robotti, E., Righetti, P.G., Campostrini, N., Pascali, J., Ponzoni, M., (2004) Study of Proteomic changes associated with healthy and tumoral murine samples in Neuroblastoma by Principal Component Analysis and classification methods, Clinica Chimica Acta 345, 55–67 23. Marengo, E., Robotti, E., Bobba, M., Liparota, M.C., Antonucci, F., Rustichelli, C., Zamò, A., Chilosi, M., Hamdan, M., Righetti, P.G., (2006) Characterisation of the proteomic profiles of two human lymphoma cell lines by two-dimensional gel-electrophoresis and multivariate statistical tools, Electrophoresis 27, 484–494 24. Massart, D.L., Kaufman, L., (1983) In: Elving, P.J., Winefordner, J.D. (Eds.), The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Wiley, New York, USA 25. Eisenbeis, R.A. (Ed.), (1972) Discriminant Analysis and Classification Procedures: Theory and Applications. Lexington, USA
322
Marengo et al.
26. Klecka, W.R. (Ed.), (1980) Discriminant Analysis. Sage Publications, Beverly Hills, USA 27. Wold, S., (1976) Pattern recognition by means of disjoint principal components models, Pattern Recognition 8, 127–139 28. Martens, H., Naes, T., (1989) Multivariate Calibration, Wiley, London 29. Kleinbaum, D., Kupper, L., Muller, K., (1988) Applied Regression Analysis and Other Multivariate Methods, 2nd ed.. Pws-Kent, Boston 30. De Noord, O.E., (1994) Multivariate calibration standardization, Chemometr. Intell. Lab. Syst. 25, 85–97 31. Anderson, N.L., Hofmann, J.P., Gemmell, A., Taylor, J., (1984) Global approaches to quantitative analysis of gene-expression patterns observed by use of twodimensional gel electrophoresis, Clin Chem. 30, 2031–2036 32. Tarroux, P., Vincens, P., Rabilloud, T., (1987) HERMeS: a second generation approach to the automatic analysis of two-dimensional electrophoresis gels. Part V: Data analysis, Electrophoresis 8, 187–199 33. Couto, M.M.B., Vogels, J.T.W.E., Hofstra, H., Husiintveld, J.H.J., Vandervossen, J.M.B.M., (1995) Random amplified polymorphic DNA and restriction enzyme analysis of PCR amplified RDNA in taxonomy, 2 Identification techniques for food-borne yeasts, J. Applied Bacteriology 79 (5), 525–535 34. Johansson, M.L., Quednau, M., Ahrne, S., Molin, G., (1995) Classification of lactobacillus-plantarum by restriction-endonuclease analysis of total chromosomal DNA using conventional agarose-gel electrophoresis, International J. of Systematic Bacteriology 45 (4), 670–675 35. Boon, N., De Windt, W., Verstraete, W., Top, E.M., (2002) Evaluation of nested PCR-DGGE (denaturing gradient gel electrophoresis) with group-specific 16S rRNA primers for the analysis of bacterial communities from different wastewater treatment plants, FEMS Microbiology Ecology 39 (2), 101–112 36. Gadea, I., Ayala, G., Diago, M.T., Cunat, A., Garcia de Lomas J., (2000) Immunological diagnosis of human hydatid cyst relapse: utility of the enzyme-linked immunoelectrotransfer blot and discriminant analysis, Clinical and Diagnostic Laboratory Immunology 7 (4), 549–552 37. Gadea, I., Ayala, G., Diago, M.T., Cunat, A., Garcia de Lomas, J., (1999) Immunological diagnosis of human cystic echinococcosis: utility of discriminant analysis applied to the enzyme-linked mmunoelectrotransfer blot, Clinical and Diagnostic Laboratory Immunology 6 (4), 504–508 38. Kovarova, H., Hajduch, M., Korinkova, G., Halada, P., Krupickova, S., Gouldsworthy, A., Zhelev, N., Strnad, M., (2000) Proteomics approach in classifying the biochemical basis of the anticancer activity of the new olomoucinederived synthetic cyclin-dependent kinase inhibitor, bohemine, Electrophoresis 21, 3757–3764 39. Kovarova, H., Radzioch, D., Hajduch, M., Sirova, M., Blaha, V., Macela, A., Stulik, J., Hernychova, L., (1998) Natural resistance to intracellular parasites: a study by two-dimensional gel electrophoresis coupled with multivariate analysis, Electrophoresis 19 (8–9), 1325–1331
2D-PAGE Maps Analysis
323
40. De Moor, B., Marchal, K., Mathys, J., Moreau, Y., (2003) Bioinformatics: organisms from Venus, technology from Jupiter, algorithms from Mars, European Journal of Control 9 (2–3), 237–278 41. Iwadate, Y., Sakaida, T., Hiwasa, T., Nagai, Y., Ishikura, H., Takiguchi, M., Yamaura, A., (2004) Molecular classification and survival prediction in human gliomas based on proteome analysis, Cancer Research 64 (7), 2496–2501 42. Amin, R.A., Vickers, A.E., Sistare, F., Thompson, K.L., Roman, R.J., Lawton, M., Kramer, J., Hamadeh, H.K., Collins, J., Grissom, S., Bennett, L., Tucker, C.J., Wild, S., Kind, C., Oreffo, V., Davis, J.W., Curtiss, S., Naciff, J.M., Cunningham, M., Tennant, R., Stevens, J., Car, B., Bertram, T.A., Afsharil, C.A., (2004) Identification of putative gene-based markers of renal toxicity, Environmental Health Perspectives 112 (4), 465–479 43. Heijne, W.H.M., Stierum, R.H., Slijper, M., van Bladeren, P.J., van Ommen, B., (2003) Toxicogenomics of bromobenzene hepatotoxicity: a combined transcriptomics and proteomics approach, Biochemical Pharmacology 65 (5), 857–875 44. Anderson, N.L., EsquerBlasco, R., Richardson, F., Foxworthy, P., Eacho, P., (1996) The effects of peroxisome proliferators on protein abundances in mouse liver, Toxicology and Applied Pharmacology 137 (1), 75–89 45. Perrot, F., Hebraud, M., Charlionet, R., Junter, G.A., Jouenne, T., (2001) Cell immobilisation induces changes in the protein response of Escherichia coli K-12 to a cold shock, Electrophoresis 22, 2110–2119 46. Verhoeckx, K.C.M., Bijlsma, S., de Groene, E.M., Witkamp, R.F., van der Greef, J., Rodenburg, R.J.T., (2004) A combination of proteomics, principal component analysis and transcriptomics is a powerful tool for the identification of biomarkers for macrophage maturation in the U937 cell line, Proteomics 4 (4), 1014–1028 47. Verhoeckx, K.C.M., Bijlsma, S., Jespersen, S., Ramaker, R., Verheij, E.R., Witkamp, R.F., van der Greef, J., Rodenburg, R.J.T., (2004) Characterization of anti-inflammatory compounds using transcriptomics, proteomics, and metabolomics in combination with multivariate data analysis, International Immunopharmacology 4 (12), 1499–1514 48. Marengo, E., Robotti, E., Cecconi, D., Scarpa, A., Righetti, P.G., (2004) Identification of the regulatory proteins in human pancreatic cancers treated with Trichostatin-A by 2D-PAGE maps and Multivariate Statistical Analysis, Analytical and Bioanalytical Chemistry 379 (7–8), 992–1003 49. Fujii, K., Kondo, T., Yokoo, H., Yamada, T., Matsuno, Y., Iwatsuki, K., Hirohashi, S., (2005) Protein expression pattern distinguishes different lymphoid neoplasms, Proteomics 5, 4274–4286 50. Dewettinck, K., Dierckx, S., Eichwalder, P., Huyghebaert, A., (1997) Comparison of SDS-PAGE profiles of four Belgian cheeses by multivariate statistics, Lait 77 (1), 77–89 51. Alika, J.E., AkenOva, M.E., Fatokun, C.A., (1995) Variation among maize (Zea mays L) accessions of Bendel State, Nigeria – numerical analysis of zein protein band patterns, Genetic Resources and Crop Evolution 42 (4), 393–399
324
Marengo et al.
52. Magdic, D., Horvat, D., Jurkovic, Z., Sudar, R., Kurtanjek, K., (2002) Chemometric analysis of high molecular mass glutenin subunits and image data of bread crumb structure from Croatian wheat cultivars, Food Technology and Biotechnology 40 (4), 331–341 53. Jessen, F., Lametsch, R., Bendixen, E., Kjaersgard, I.V.H., Jorgensen, B.M., (2002) Extracting information from two-dimensional electrophoresis gels by partial least squares regression, Proteomics 2, 32–35 54. Kleno, T.G., Leonardsen, L.R., Kjeldal, H.O., Laursen, S.M., Jensen, O.N., Baunsgaard, D., (2004) Mechanisms of hydrazine toxicity in rat liver investigated by proteomics and multivariate data analysis, Proteomics 4, 868–880 55. Kjaersgard, I.V.H., Norrelykke, M.R., Jessen, F., (2006) Changes in cod muscle proteins during frozen storage revealed by proteome analysis and multivariate data analysis, Proteomics 6, 1606–1618 56. Gottfries, J., Sjogren, M., Holmberg, B., Rosengren, L., Davidsson, P., Blennow, K., (2004) Proteomics for drug target discovery, Chemometrics and Intelligent Laboratory Systems 73, 47–53 57. Karp, N.A., Griffin, J.L., Lilley, K.S., (2005) Application of partial least squares discriminant analysis to two-dimensional difference gel studies in expression proteomics, Proteomics 5, 81–90 58. Norden, B., Broberg, P., Lindberg, C., Plymoth A., (2005) Analysis and understanding of high-dimensionality data by means of multivariate data analysis, Chemistry and Biodiversity 2 (11), 1487–1494 59. Malone, J., McGarry, K., Bowermann, C., (2006) Automated trend analysis of proteomics data using an intelligent data mining architecture, Expert Systems with Applications 30, 24–33 60. Marengo, E., Robotti, E., Gianotti, V., Righetti P.G., (2003) A new approach to the statistical treatment of 2D-Pages in proteomics using fuzzy logic, Annali di Chimica 93 (1–2), 105–116 61. Marengo, E., Robotti, E., Righetti, P.G., Antonucci, F., (2003) A new approach based on fuzzy logic and principal component analysis for the classification of 2Dmaps in health and disease: application to lymphomas, Journal of Chromatography A 1004, 13–28 62. Marengo, E., Robotti, E., Gianotti, V., Righetti, P.G., Domenici, E., Cecconi, D., (2003) A new integrated statistical approach to the diagnostic use of proteomic two-dimensional maps, Electrophoresis 24, 225–236 63. Marengo, E., Robotti, E., Cecconi, D., Scarpa, A., Righetti, P.G., (2004) Application of fuzzy logic principles to the classification of 2D-PAGE maps belonging to human pancreatic cancers treated with Trichostatin-A, Proceedings of 2004 IEEE International Conference on Fuzzy Systems, Budapest, Hungary, 25–29 July 2004, 1, 359–364 64. Marengo, E., Robotti, E., Antonucci, F., Cecconi, D., Campostrini, N., Righetti, P.G., (2005) Spot matching in two-dimensional gels: a review of commercial software and of “home-made” approaches, Proteomics 5, 654–666
2D-PAGE Maps Analysis
325
65. Zenkouar, H., Nachit, A., (1997) Images compression using moments method of orthogonal polynomials, Materials Science and Engineering B 49, 211–215 66. Yin, J., Rodolfo De Pierro, A., Wei, M., (2002) Analysis for the reconstruction of a noisy signal based on orthogonal moments, Applied Mathematics and Computation 132, 249–263 67. Hu, M.K., (1962) Visual pattern recognition by moment invariants, IRE Transaction on Information Theory 8, 179–187 68. Teague, M.R., (1980) Image analysis via the general theory of moments, Journal of the Optical Society of America 70, 920–930 69. Li, B.C., Shen, J., (1991) Fast computation of moment invariants, Pattern Recognition 24, 807–813 70. Chong, C., Raveebdram, P., Mukundan, R., (2004) Translation and scale invariants of Legendre moments, Pattern Recognition 37, 119–129 71. Mukundan, R., Ramakrishnan, K.R., (1995) Fast computation of Legendre and Zernike moments, Pattern Recognition 28, 1433–1442 72. Zhou, J.D., Shu, H.Z., Luo, L.M., Yu, W.X., (2002) Two new algorithms for efficient computation of Legendre moments, Pattern Recognition 35, 1143–1152 73. Wee, C., Paramesran, R., Takeda, F., (2004) New computational methods for full and subset Zernike moments, Information Sciences 159, 203–220 74. Kan, C., Srinath, M.D., (2002) Invariant character recognition with Zernike and orthogonal Fourier-Mellin moments, Pattern Recognition 35, 143–154 75. Khotanzad, A., Hong, Y.H., (1990) Invariant image recognition by Zernike moments, IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 489–497 76. Marengo, E., Bobba, M., Robotti, E., Liparota, M.C., (2005) Use of Legendre moments for the fast comparison of 2D-PAGE maps images, Journal of Chromatography A 1096 (1–2), 86–91 77. Marengo, E., Leardi, R., Robotti, E., Righetti, P.G., Antonucci, F., Cecconi, D., (2003) Application of three-way principal component analysis to the evaluation of two-dimensional maps in proteomics, Journal of Proteome Research 2 (4), 351–360 78. Schultz, J., Gottlieb, D.M., Petersen, M., Nesic, L., Jacobsen, S., Sondergaard, I., (2004) Explorative data analysis of two-dimensional electrophoresis gels, Electrophoresis 25 (3), 502–511
17 Finding the Significant Markers Statistical Analysis of Proteomic Data Sebastien Christian Carpentier, Bart Panis, Rony Swennen, and Jeroen Lammertyn
Summary After separation through two-dimensional gel electrophoresis (2DE), several hundreds of individual protein abundances can be quantified in a cell population or sample tissue. Both a good experimental setup and a valid statistical approach are essential to get insight into the data and to draw correct conclusions. High-throughput 2DE proteomics yield complex and large datasets with a huge disproportion between the hundreds of variables and the restricted number of replicates. However, the most commonly used statistical tests have been designed to cope with a high number of replicates and a restricted number of variables. There is some inconsistency in the proteomics community related to the use of statistics. Two approaches of data analysis can be distinguished: exploratory data analysis and confirmatory data analysis. Currently, most proteomic data are analyzed with the emphasis on confirmatory analysis and do not take into account the exploratory data analysis. This chapter gives an overview of the typical statistical exploratory and confirmatory tools available and suggests case-specific guidelines for a reliable statistical approach that can be used for 2DE analysis. Examples are given for an experimental setup based on classical staining methods as well as for the more advanced difference gel electrophoresis.
Key Words: assumptions; confirmatory data analysis; experimental set-up; exploratory data analysis; missing values; multivariate statistics; non-parametric test; parametric test; principal component analysis; univariate statistics.
From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
327
328
Carpentier et al.
1. Introduction The conventional approach to analyze a biological problem is to collect data in order to test a particular hypothesis. Starting from this hypothesis, the data are collected, which should lead to an objective and reliable decision. As such, the hypothesis can be accepted, revised, or rejected. This confirmatory way of data analysis is accompanied by a number of steps that define the experimental setup. However, our understanding of a biological system is usually rather limited, and data may be very heterogeneous and complex. Exploratory data analysis approaches a biological problem from a different angle and tries to describe patterns, relationships, trends, outlying data, etc. Two-dimensional gel electrophoresis (2DE) simultaneously quantifies hundreds of individual protein abundances in a cell population or sample tissue. High-throughput 2DE proteomics yield complex and large datasets with a huge disproportion between the hundreds of variables and the restricted number of replicates. Most commonly used statistical tests are for confirmatory data analysis and have been designed to cope with a high number of replicates and a restricted number of variables. Both a good experimental setup and a valid statistical approach are extremely important. There is some inconsistency in the proteomics community. Proteomic data are currently analyzed by a variety of approaches. The objective of this chapter is to give a concise overview of statistical methods used in functional genomics and to find a good compromise between statistics and proteome analysis in practice. This chapter deals with the experimental design and data analysis and, at the end, provides two practical examples (classical staining approach and DIGE approach). Section 2 discusses the issues of replicates and the pooling of samples, and briefly discusses the calibration, normalization, and quantification of data. Section 3 discusses confirmatory univariate and exploratory multivariate analysis and the related assumptions and associated problems. 2. Experimental Design The design of an experiment is crucial for the robustness of the results obtained. Careful planning is essential to maximize the information output of an experiment. The experimental conditions must be well designed in order to keep variation within an experimental group as small as possible, and the experimental setup should be kept as simple as possible in order to keep the data manageable. When the impact of a particular treatment is to be examined, proper controls should be included (positive and negative control), and irrelevant external influences should be eliminated or anticipated (e.g., by randomized design).
Statistical Analysis of Proteomic Data
329
The conventional approach of analyzing a biological problem is to collect data in order to test a particular hypothesis. The collected data should enable the researcher to make an objective and reliable decision concerning the hypothesis. The experimental setup usually includes a procedure that involves several steps: (1) state a null hypothesis (H0 ) (e.g., there is no difference in protein abundance(s) between the treatments) and its alternative (H1 ) (e.g., there is a difference between the treatments), (2) to choose the most appropriate test statistic to check the hypothesis, (3) specify a significance level (i.e., the accepted level of having false positive results and to reject unjustly the null hypothesis), (4) specify the sample size (number of replicates) to have sufficient power, and (5) collect the data. The power of a statistical test is the ability to detect possible differences between the experimental groups. The power of a statistical test or the reduction of false negative results depends on the variance, the change in abundance, the number of replicates, the statistical test chosen, and the predetermined significance level. Lilley and Karp have illustrated the relationship between power, replicate number, and relative expression change in a proteomics experiment (1). Urfer et al. consider the effect of testing all the proteins simultaneously by means of family-wise error rate and false discovery rate (2). The number of replicates is the best way to control the power of a statistical test. Given the labor and cost involved in the 2DE analysis, the number of replicates is often restricted, and thus the variance (technical and biological) should be kept in control. 2.1. Replicates A well-discussed subject is the nature of replicates. Two types of replicates are reported in 2DE studies: (1) technical replicates (repeated measurements of the same sample (e.g., the same protein extract) and (2) biological replicates (different measurements within the same experimental group). Ideally, only biological replicate samples should be used, and one should try to limit the technical variability to the strict minimum so that a repeated measurement of the same sample is not necessary (Fig. 1A). Therefore, both a reliable sample preparation method (3) and an extended experience in electrophoresis and proteomic techniques are indispensable (4,5,6,7). Technical variability can be introduced at the level of (1) sample collection, (2) sample preparation and protein extraction, (3) sample loading and electrophoresis, and (4) staining and image analysis. Some staining methods, like silver staining, implicate a lot of steps, and each sample is run in an individual gel, which makes the approach susceptible to technical variation. Technical replicates might be considered in experiments with a low sample yield, with cost restrictions, or when all the technical variability is still too high (high inter-gel variability) (Fig. 1B).
330
Carpentier et al.
In any case, one should take care to analyze technical replicates next to biological replicates. Statistically speaking, we are dealing with mixed models and nested designs (8,9). Karp et al. discuss the impact of mixing biological and technical replicates in a proteomics experiment (10). Treating technical replicates as biological replicates can increase the rate of false positives. Analyzing biological and technical replicates in one test would seem reasonable only in a nested ANOVA test. If another statistical test should be used, only the biological replicates are used (Fig. 1A), and the technical repetition of the same biological samples (proteins extracts) should be considered as a distinct and confirmatory analysis. With low technical variance observed with the difference gel electrophoresis (DIGE) approach (see below), the value of the analyzing technical replicates can be questioned and hence skipped (Fig. 1A). 2.2. Pooling Another well-debated subject is the pooling of biological samples. Pooling of individual biological tissues or cells averages the sample. On one hand, pooling reduces the variability increasing the power, but on the other hand, there is incontestable loss of relevant information of individuals. The pooling of samples reduces biological variance in detecting changes in protein abundance between the averages of the experimental groups. Pooling of samples is usually done when the biological variation with in an experimental group is too big (Fig. 1C and 1D), or when an individual starting material is not sufficient to extract proteins from. Pooling of samples might be useful, but must be evaluated for each individual experimental setup. 2.3. Data Processing Common strategies for quantitative determination of gel-separated proteins include organic dyes (e.g., colloidal coomassie blue), silver staining, radio labeling, and fluorescent stains (e.g., Deep Purple, Flamingo, SYPRO Orange/Red/Ruby, and other ruthenium complexes and succinimidyl ester derivatives of cyanine dyes). The use of a particular staining method should carefully be considered taking into account the lab equipment available, budget, and power of a particular method. The dynamic range of staining methods and the technical variability both have a great impact on the power of a statistical test and are decisive for the experimental setup (the number of replicates) and the choice of the statistical test. Data from 2DE analysis are generated through image analysis software that detects and quantifies protein abundances and matches the same proteins across different gels. An important challenge in 2DE is to estimate the protein concentration in order to ensure that all gels are loaded with an equal amount of
Statistical Analysis of Proteomic Data
331
Fig. 1. Experimental set-up. Theoretical examples of experimental setup control vs. treatment. (A) Small intra-group variation and small technical variation: four biological replicates for control and four biological replicates for treatment. (B) Small intra-group variation and big technical variation—mixed model: four biological and three technical replicates for control and the same for treatment. (C) Big intra-group variation and small technical: four replicates of biological pool for control and the same for treatment. (D) Big intra-group variation and big technical variation—mixed model: four replicates of biological pool and three technical replicates for control and the same for treatment.
332
Carpentier et al.
proteins, and hence to minimize the technical variation. Most current software packages take this into account and introduce a calibration or normalization in order to compensate for image differences caused by protein loading, staining, and scanning. 2.3.1. Classical Approach Calibration in a classical approach (like silver or coomassie staining) is developed to take into account the differences in scanning properties (such as image depth). Scanner grey values are converted to optical densities so that intensities are no longer dependent on the original pixel depth. The most logical normalization procedure to anticipate possible loading differences for a classical staining is % volume, where the individual spot volumes are normalized by the total volume of all spots. Normalized data, whether or not transformed, can be subsequently analyzed statistically by a relevant statistical test (see below). The most commonly used organic staining is coomassie brilliant blue (CBB) staining. CBB staining has a relative good dynamic range (approximately 103 ) and is perfectly compatible with MS. However, its sensitivity is relatively low. The limit of protein detection for colloidal CBB stain is approximately 8–10 ng (11). Therefore, several modifications have been proposed to improve its sensitivity. For an overview, see (12). The introduction of the first sensitive silver-staining (13) method was a major breakthrough in the field of protein detection, which led to extensive research and various alternative silver-staining protocols (14). Silver-staining is still one of the most sensitive non-radioactive detection techniques with a detection limit in the lower nanogram range. However, the linearity and dynamic range are relatively poor (approximately 102 or less), the staining is protein-dependent, and gel-to-gel variation is not negligible due to numerous solution changes and other carefully timed steps. 2.3.2. Difference Gel Electrophoresis Approach Fluorescent-based methods are surpassing the conventional technologies in use. A standard UV-transilluminator can be used for visualization of most fluorescent stains, but more sophisticated and expensive CCD cameras or laser scanners are appropriate for quantitative determination. The development of succinimidyl ester derivatives of different cyanine fluorescent dyes that modify free amino groups of proteins prior to separation (15) was a major achievement in terms of reproducibility and throughput. The DIGE approach uses fluorophores that have different absorption optimum, making it possible to run multiple samples simultaneously in the same gel. Several dyes were designed to ensure that a protein acquires the same relative mobility irrespective of the dye used to tag it.
Statistical Analysis of Proteomic Data
333
The difference in MW introduced by different length linkers is compensated by different alkyl moieties opposite the linker moiety. Originally, only two different cyanine dyes were included (Cy3 and Cy5), but the concept was extended with a third dye (Cy2) that opened the way for a total new experimental design that further exploits the sample multiplexing capabilities of the dyes, by including an internal standard (16,17). The internal standard is a mixture of equal amounts of each sample and guarantees a powerful normalization procedure for high accuracy of protein quantification. This normalization reduces the variability considerably and brings on reasonable arguments to justify the use of powerful parametric statistics after transformation of the standardized volume. If multiple conditions have to be tested spread over different electrophoresis runs, one common internal standard should be created and included in all the gels of each run. However, if an experimental setup is too complex, the internal standard will contain too many samples possibly resulting in an overlap of spots of different samples. The minimal labeling approach has a dynamic range of four to five orders, and its sensitivity is currently marginally less sensitive than silver-staining (18). Although the dyes have been carefully designed, care should be taken in the experimental design to take into account possible dye-specific effects. Therefore, a supervised randomization of the Cy3/Cy5 labeling is highly recommended. Not only the labeling should be randomized, but also the samples representing an experimental group should be mixed across gels in order to avoid systematic gel artefacts.
3. Data Analysis 3.1. Confirmatory Univariate Data Analysis Univariate statistical methods examine the individual protein spots one by one, considering the different proteins as independent measurements. Table 1 gives an overview of some commonly used parametric and non-parametric univariate tests. Univariate methods start from the null hypothesis that there is no difference between the two experimental populations. Parametric models
Table 1 Overview of Some Commonly Used Univariate Tests Classes of data
Univariate statistics
Comparing 2 treatments
Parametric T-test
Comparing k treatments
ANOVA
Non-parametric Mann–Whitney/Wilcoxon Kolmogorov–Smirnov test Kruskal–Wallis test
334
Carpentier et al.
like the Student’s T-test start from the observed sampling and assume that the observed sample mean and variance approximate the real population mean and variance, and that the variances of the two experimental populations are equal. Based on the observed mean and variance, the two populations are considered normally distributed and a model is made (Fig. 2). If the test statistic (or Tvalue) is large enough, the null hypothesis is rejected (Eq. 1). The numerator measures the distance between the experimental means and is thus an estimation of the inter-group variability; the denominator approximates the real variability and estimates the intra-group variability. T2 = y2 − y1 2 /S2P 1/n1 + 1/n2
(1)
where yi : experimental mean (estimate of the population mean, μi ); SP : pooled sample variance (estimate of the variance; it is a weighted average of the group variances accounting for the number of replicates or samples in each group); ni : number of replicates per experimental group. Parametric univariate statistical tests are very powerful, but the data must respect the restrictive assumptions (continuous and normally distributed data, homogeneity of variance, and independent samples) and the assumptions must be tested. A commonly used test for the estimation of homogeneity of variances is the Levene’s test, and for the estimation of normality, it is the Shapiro-Wilk test (19). If one assumption is not met, the significance levels and the power of the test might be invalidated. Transformation of data (e.g., log function, arcsine, square root) is frequently used to improve the distribution characteristics (normality and homogeneity of variance) (20). The problem of proteomic data is the low number of replicates. It is impossible to test these assumptions starting from the low sample sizes commonly used in 2DE experiments. Tests like the Levene’s test and the Shapiro-Wilk test are designed for higher sample sizes and have very limited power at the commonly used sample size in proteomics experiments. Given the labor and cost involved in the 2DE analysis, the number of replicates is often restricted and ranges usually between 3 and 6.
Fig. 2. Distribution of two normal populations with a homogeneous variance. μi : real population average estimated by the sample average.
Statistical Analysis of Proteomic Data
335
Although some empirical evidence illustrates that slight deviations in meeting the assumptions underlying parametric tests may not have radical effects on the obtained probability levels, there is no general agreement as to what is a “slight” deviation (21). An alternative for the parametric tests is the use of non-parametric tests, which do not assume any distribution for the data but usually have a relatively low power (21). The assumptions are independent and continuous ordinal data. A useful non-parametric test is the Kolmogorov–Smirnov test. The Kolmogorov–Smirnov test determines whether or not the experimental groups come from the same distribution. Therefore, the data points in each experimental group are sorted in ascending order, and an empirical distribution function is calculated without any assumption of distribution or variance. The Kolmogorov–Smirnov test statistic D is defined as the maximum distance between the cumulative distributions of two experimental groups (for an example, see Fig. 5). Dn1n2 = max Sn1 X − Sn2 X
(2)
where Sni (X) = Ki /ni Ki = number of data equal or less than X; ni : number of replicates per experimental group. 3.2. Exploratory Multivariate Data Analysis Univariate statistical tests, such as the T-test, the Kolmogorov–Smirnov test, ANOVA, or the Kruskal–Wallis test, have not been designed to analyze complex datasets containing multiple correlated variables. Proteomic datasets generally contain hundreds of different proteins that are correlated. Proteins fit within the larger entity of networks and interact with each other. Univariate statistics test the individual variables one by one and are absolutely not able to detect correlations to other variables (proteins). Moreover, testing hundreds of variables (protein spots) one by one and reporting them with an acceptance of a certain risk of false positives () enhances the chance of reporting false positive cases (multiple testing issue), and assumes that the different variables (proteins) are uncorrelated. Proteins are not uncorrelated; they fit within multiple biological pathways and might have close correlations. The field of multivariate analysis consists of those statistical techniques that consider two or more related random variables as a single entity and attempts to produce an overall result taking the relationship among the variables into account (22). In contrast to a univariate approach, it displays the inter-relationships between a large number of variables and is able to correlate multiple proteins to a specific experimental group. The data from different image analysis software packages can be exported, introduced, and analyzed using several software packages to
336
Carpentier et al.
perform multivariate analysis. Some commonly used packages are Unscrambler, Matlab, SAS, and Statistica. GE Healthcare developed a statistical software package (EDA, extended data analysis) for DIGE approach, which is linked to the image analysis software Decyder. The package offers both univariate and multivariate tools. Here, we will discuss mainly the use of Principal Component Analysis (PCA) (for an overview of other possibilities of EDA package and more DIGE related statistical examples, see Chapter 6). 3.2.1. Principal Component Analysis Principal Component Analysis is one of the multivariate possibilities to perform explorative data analysis. A comprehensive overview of the use of PCA in statistics is given by Sharma (23). The basics of PCA date back to Karl Pearson in 1901 (24), and the final procedure as we know it today was developed by Harold Hotelling in 1933 (25). The use of multivariate methods in the analysis of 2DE was already established in the early days of 2DE (26) and is an emerging application in transcriptomics and proteomics (27,28,29,30,31). PCA condenses the information contained in a huge dataset into a smaller number of artificial factors, which explain most of the variance observed. The most logical modus operandi is to consider the different biological replicate samples of the experimental groups as observations (score plot). The score plot allows the detection of trends in the samples and the loading plot allows to identify the relevant proteins that explain the trends. A principal axis transformation transforms the correlated variables (proteins) into new uncorrelated variables. A principal component (PC) is a linear combination calculated from the existing variables (proteins) [PC1 = a1 (protein1) + a2 (protein2) + … + an (protein n); PC2 = b1 (protein1) + b2 (protein2) + … + bn (protein n)]. The relation between the original variables (proteins) and the PCs is displayed in the loading plot. This means that if a protein has a high loading score for a specific PC, that protein explains an important part of the sample variance. The starting point for PCA is the sample covariance matrix. It has been proven that the sum of the original variances is equal to the sum of the eigenvalues of the sample covariance matrix. The eigenvalues are the variances of the PCs. The ratio of each eigenvalue to the total variance indicates the portion of the total variability accounted for each PC. For the fundamentals of data manipulation and a more detailed description of the properties and mechanisms of multivariate analysis and PCA, the reader is referred to the books of Jackson and Sharma (22,23). It is very important to have an insight into what is calculated and what the assumptions are of different models. The EDA software offers the user the choice to play with observations and loadings. Hence, the user also has the possibility to use the transposed data matrix, and to consider the gel images as
Statistical Analysis of Proteomic Data
337
variables (loading plot) and the proteins as observations (score plot). This might be helpful to improve the image analysis and to detect protein mismatches, but should not be used to explore the inter- and intra-group variability of the biological samples. Explorative PCA does not put strict requirements to the data. The majority of PCA applications are descriptive in nature. In these instances, distributional assumptions are of secondary importance (22). The only requirement that must be met is that the dataset has to be complete, meaning that there must be no missing spot values among the different samples. Finding techniques for performing PCA in the absence of complete data and/or techniques for estimating missing data can solve the problem. Several methods for estimating missing data have been reported from the microarray community (32,33,34). A missing value in 2DE proteomics occurs when a spot is detected in the reference or master gel but not detected in one of the other sample gel images, or it is detected but not matched to the reference or master gel. The causes of missing values might be (1) faint spots, flirting with the detection limit and detected in one gel but not detected in another; (2) mismatches probably caused by distortions in the protein pattern, or (3) absence of spots due to bad transfer from the first to the second dimension. Grove et al. show that the staining procedure was an important source of missing values (27). The concept of DIGE with its common internal standard anticipates the missing value problem to some extent by matching the different internal standard images. A good sample preparation (3) and a good experience in electrophoresis and proteomic techniques also reduce this problem, but missing values are inherent to 2DE and must be faced. Some software packages replace the missing values with the value zero, and others remove all the variables with missing values. Introducing zeros leaves the results open to serious bias when a protein is mismatched in a particular sample or when the spot is missing due to a technical error. This particular protein will get an important loading value for the sample in question, influencing incorrectly the score for this particular sample. In the case a protein is really absent or below the detection limit of the staining method, those missing values can be filled either with zeros or with a threshold value (35). A better alternative might be to average the samples within an experimental group and to explore the data based on the group mean. A missing value will still be considered as a zero and will lower the group mean, but the impact of loading on the sample score plot is buffered by the average. The EDA package offers this possibility (see example below). Taking into account only the proteins that are detected and matched to the master or reference gel solves the problem of missing values, but a lot of useful information is lost (see example below). The EDA package offers the possibility to filter the base dataset and to select only those proteins that are 100% matched. Troyanskaya et al. show that averaging is an improvement upon replacing missing values
338
Carpentier et al.
with zeros, but it yields drastically lower accuracy than the estimation methods such as singular value decomposition and weighted K-nearest neighbors (32). We recommend performing the initial PCA based on the complete dataset and not based on the proteins that appear to be significantly different from the individual univariate analyses. Multivariate statistics have an additional value by being capable of differentiating the different experimental groups in terms of correlated expression rather than absolute expression (28,36). Both approaches are complementary. Performing the analysis only on significant proteins from univariate analysis might disregard useful information. We recommend to start the analysis with explorative multivariate analysis and to compare the data subsequently with the confirmatory univariate analysis of the individual proteins.
3.2.2. Marker Selection Principal Component Analysis is outstanding in detecting outlying data and correlations among the different variables (proteins), but it is not able to determine a threshold level for identifying which proteins are significant in classifying the experimental groups, allowing an objective removal of variables (proteins) that do not contribute to the class distinction. Several algorithms exist to select a subset of features from the whole dataset and to perform a classification. In proteome analysis, this corresponds to selecting the proteins that can best discriminate the experimental groups. The use of partial least squares (PLS) as a regression technique has been promoted primarily within the area of chemometrics (37). In contrast to PCA, PLS is a supervised technique mainly applied to link (or regress) a continuous response variable (or dependent variable) to a set of independent variables (e.g., proteins in a gel). However, in proteomic data, the response variable is often a discrete variable (e.g., treatment A, B, C,…) and only takes a fixed number of values. PLS-DA offers an algorithm to deal with this typical data structure. An analysis of the score and (correlation) loading plot allows defining the proteins that are important in discriminating the different experimental treatments. The variable importance plot (VIP) is an interesting tool for this purpose. According to the user manual, the PLS algorithm of EDA creates a supervised model of the data (predefined experimental groups) and then uses the variable influence on the projection (VIP) scores from the model to create a ranked list of how good a protein is for discrimination between the experimental groups. Discriminant analysis (DA) methods, in general, and PLS-DA, in particular, are used to calculate the probability or accuracy of the marker selection. The purpose of DA is to permit to assign individual observations (samples) to one of the experimental
Statistical Analysis of Proteomic Data
339
groups [e.g., the classification of patient samples as healthy and tumor based on protein extractions (38)]. 4. Examples 4.1. Classical Dyes, 2 Conditions In this example, we examine two different conditions, analyse six biological samples per condition, and perform the analysis with classical CBB staining. The data have been analyzed with the Image Master Platinum software version 5 (GE Healthcare). Image Master version 5 offers the possibility to compensate for technical variance and offers intensity calibration and spot normalization. The relative volume (%vol) spot normalization is the best spot normalization procedure because this takes into account the intensity of a spot as well as the area (Eq. 3). %vol = vol/nS=1 volS
(3)
where volS is the volume of spot S in a gel containing n detected spots. Although this spot normalization procedure reduces the possible technical variance, it has consequences for the data. Normalizing all the spots transforms the data and creates an asymmetric population (Fig. 3). A logarithmic transformation of the data improves the distribution characteristics (Fig. 4). However, univariate statistical methods are not developed to analyze all the spots simultaneously like in Figs. 3 and 4. They examine the individual protein spots (variables) one by one, considering the different proteins as independent measurements. Therefore, one should consider each spot individually, and the real population for the experimental groups of this particular protein spot should be estimated based on the six replicates. Performing distribution tests like the Levene’s test and the Shapiro-Wilk test on six replicates is a possibility, but is unlikely that the null hypotheses (normally distributed and homogeneous variance, respectively) will be rejected. The sample sizes need to be large enough in order to minimize the amount of false results (i.e., the populations will appear to be normally distributed and of equal variance although this is not necessarily the case). Taking into account the typical heterogeneity of variance associated with classical dyes, the %vol spot normalization of Image Master, and the limited sample size, a non-parametric statistical test seems to be the best choice in this case. We opted here for the non-parametric univariate Kolmogorov– Smirnov test. The test is one among the options offered by Image Master. It is a two-sample test with high power efficiency for small sample sizes. The reduced power of a non-parametric test was anticipated by including a
340
Carpentier et al. Histogram: Var1 Shapiro-Wilk W = .35883. p = 0.0000 Expected Normal 2000 1800 1600
No. of obs.
1400 1200 1000 800 600 400 200 0
–0.3 –0.1
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
Fig. 3. Distribution of protein spots analyzed by image master and normalized using the %vol criterion. There is an asymmetrical distribution, with the majority of the spots lying between 0 and 0.1%. Histogram: Var2 Shapiro-Wilk W = .98283. p = .00000 Expected Normal 1000 900 800
No. of obs.
700 600 500 400 300 200 100 0
–7
–6
–5
–4
–3
–2
–1
0
Fig. 4. A logarithmic transformation of the %vol data of Fig. 3.
1
Statistical Analysis of Proteomic Data
341
higher number (6) of biological replicates. Figure 5 shows an example of an individual Kolmogorov–Smirnov test. For the complete experimental setup and biological background, see Carpentier et al. (39). The options of the Image Master Platinum software are rather limited and are focused on two experimental groups. The multivariate analysis offered by Image Master Platinum is factor analysis. Factor analysis is a technique similar in nature to PCA. The results of both techniques are quite similar except that factor analysis explains rather correlations between variables, while PCA explains variability (22). In Image Master Platinum, the gels (images) are used as loading and proteins for the score plot. Factor 1 (explaining the majority of the variability) is in our case associated to protein abundance, and the second factor is associated with inter-group variability. As stated above, this might be useful to improve the image analysis and to detect protein mismatches, but to explore the interand intra-variability of the biological samples, it might be better to export the
% vol
A
B
C
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 a b
0 a bc d e f gh i j k l
k j l g ih fe b a d c
1373
1373
1373
D 0.9
frequence
0.8 0.7 0.6 0.5
A B
0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
%vol
Fig. 5. Example of Kolmogorov–Smirnov test. (A) Descriptive statistics displaying the experimental mean and standard deviation of the two experimental groups (A and B). (B) Descriptive statistics of the individual biological samples of the two experimental groups. (C) The data sorted in ascending order. (D) Empirical cumulative distribution functions of the two experimental groups.
342
Carpentier et al.
data to a statistical program. For an example of classical staining and uni- and multivariate analysis, see Pedreschi et al. (40). 4.2. DIGE Approach, 4 Conditions In this example, we are interested in the effects of a specific treatment over time. Using the DIGE approach, we consider here four time points. At each time point, three biological samples were analyzed, quantifying several hundreds of protein spots (i.e., variables) per sample per time point. To process and analyze the gels, the Decyder software version 6.5 was used in combination with the EDA module (GE Healthcare). The standardized normalization procedure in Decyder 2D BVA is based on the concept of having for each gel the Cy2 labeled internal standard image as reference. This standard image is used to normalize the abundance ratios between the different gels. Decyder offers the possibility to perform transformation and normalization of the data: log standardized abundance (Eq. 4). Log standardized abundance =
10
log vol Cy5 or Cy3/vol Cy2
(4)
Using the DIGE approach, Karp and Lilley gathered reasonable arguments to assume that the restrictive assumptions of parametric statistics are not violated too strong after the logarithmic transformation of standardized abundance (1). The use of parametric statistics seems, therefore, acceptable. However, univariate statistics test the individual variables one by one and are absolutely not able to correlate multiple proteins. Moreover, testing hundreds of variables (protein spots) one by one and reporting them with an acceptance of a certain risk of false positives () enhances the chance of reporting false positive cases (multiple testing issue). It is, therefore, advisable to get first an insight in the complex dataset and to explore the data first via multivariate analysis and validate the individual differences via univariate statistics. Not all proteins are relevant to understand the differences between the time points. Therefore, it would be interesting to distinguish relevant proteins from irrelevant proteins that do not have a changing abundance over time. To facilitate the discovery of the differences, we used the PCA of the extended data analysis module of Decyder. PCA reduces more than 1000 variables into PCs that explain most of the variance between the treatment times. PCA analysis is not supervised, meaning that the samples are analyzed without the knowledge of sampling time. In Fig. 6, the score and loading plot are displayed, taking into account the two most important PCs. The different repetitions of the same time point cluster together, and the most important PC (i.e., PC1) is able to separate the clustered treatment times. In practice, this means that proteins with a high positive PC1 value will be abundantly present in the 2-day gels and less
Statistical Analysis of Proteomic Data
343
abundant in 14-day gels and vice versa for proteins with a highly negative PC1 value. Proteins that cluster together have a similar impact on the PCs and have a similar expression pattern (Fig. 6). This rough approach explains only a small part of the variability. The first PC explains 34.2% of the variability and explains a great part of the inter-group biological variability (time effect). A high positive PC1 value is correlated to 2 days, and a high negative value is correlated to 14 days. Most proteins cluster around the origin, indicating a poor contribution to the variance and probably do not change in abundance during the examined time period. The second PC explains 15.1% of the variability and seems to explain mainly (technical) intra-group variability. By default EDA ignores the missing values. By anticipating the missing value issue and taking the average of each experimental group and reducing some technical variability, the first component explains 60.9% of the variability and the second PC 23.4%. Taking into account only the proteins that have been matched and detected in all the gels reduces the number of examined proteins by more than 50% and discards very useful proteins that have, for instance, a very low
A
B
Fig. 6. PCA analysis. (A) Score plot. The big circle is based on the Hotellings T2 -test statistic and is used to detect outlying observables ( 0.95). The three biological replicates of the same experimental group cluster together, indicating an acceptable intragroup variability (grey ellipse). The different experimental groups are also separated, indicating a certain inter-group variability. There is a clear difference between 2 and 14 days of treatment. (B) The loading plot indicates the correlation between the original variables. A protein with a high loading score for a specific PC explains an important part of the sample variance.
344
Carpentier et al.
abundance in the early days of treatment and higher abundances at the end and vice versa. As an example, we focus on five proteins that seem highly correlated from the loading plot (highlighted in Fig. 6B). Confirmatory differential expression analysis via ANOVA confirms that all five proteins have a very similar expression pattern over time (Fig. 7). This might suggest a common regulatory mechanism or an interaction between the proteins. The individual confirmatory univariate statistics (ANOVA and multiple comparison test) confirm for four out of the five proteins that 2 days is significantly different from 4 days, 8 days, and 14 days; and that 14 days is significantly different from 4 days and 8 days ( ≤0.01). We could identify four proteins as lectin isoforms (39), confirming, indeed at a first level, the correlation between the proteins. One protein could not be identified and is under further investigation. This protein is likely to have a common regulatory mechanism (being also a lectin-like protein), might form a complex, or develop an interaction with lectin proteins. This particular protein shows exactly the same expression pattern as the four identified lectins, but the overall ANOVA has a value of 0.0122. This is a nice illustration of
Fig. 7. Confirmatory differential expression analysis—expression pattern of the individual proteins selected from Fig. 6. The different normalized relative abundances are displayed for the different time points (14 days, 8 days, 4 days, and 2 days). The mean of each individual isoform is displayed as a cross.
Statistical Analysis of Proteomic Data
345
how exploratory data analysis is performing, indicating correlation but also bringing up candidate markers that would have been missed when using only confirmatory data analysis ( ≤ 0.01).
5. Conclusions The experimental conditions are important and must be well designed. Ideally, only biological replicate samples should be used, and one should try to limit the technical variability to the strict minimum. A reliable sample preparation and an extended experience in electrophoresis and proteomic techniques are indispensable. With the low technical variance observed with the DIGE approach, the need for analyzing technical replicates can be questioned. The pooling of samples reduces the biological variance to detect changes in protein abundance between the averages of the experimental groups. Pooling of samples might be useful but must be reconsidered for each individual experimental setup. The use of a particular staining method should carefully be considered taking into account the available lab equipment, budget, and power of a particular method. The dynamic range of the staining methods and the technical variability have a great impact on the power of a statistical test and are decisive for the experimental setup (the number of replicates) and the choice of the statistical test. Univariate statistics test the individual variables one by one and are absolutely not able to correlate multiple proteins. Moreover, testing hundreds of variables (protein spots) one by one and reporting them with an acceptance of a certain risk of false positives () enhances the chance of reporting false positive cases (multiple testing issue). Therefore, it is advisable to first get an insight in the complex dataset and to explore the data via multivariate analysis and validate the individual differences via univariate statistics. Using a classical approach with the typical heterogeneity of variance associated with classical dyes and the limited sample sizes, a non-parametric test seems to be the best choice. Using the DIGE approach, the restrictive assumptions of parametric statistics are not violated too strong after the logarithmic transformation of the standardized abundance. The use of parametric statistics seems, therefore, acceptable.
Acknowledgments The authors would like to thank Romina Pedreschi for critical reading and suggestions and Prof. Verbeke for the sharing of his files. Financial support from the Belgian National Fund for Scientific Research (FWO-Flanders) is gratefully acknowledged.
346
Carpentier et al.
References 1. Karp, N. A. & Lilley, K. S. (2005) Proteomics 5, 3105–3115. 2. Urfer, W., Grzegorczyk, M., & Jung, K. (2006) Proteomics S2, 48–55. 3. Carpentier, S. C., Witters, E., Laukens, K., Deckers, P., Swennen, R., & Panis, B. (2005) Proteomics 5, 2497–2507. 4. Bjellqvist, B., Ek, K., Righetti, P. G., Gianazza, E., Gorg, A., Westermeier, R., & Postel, W. (1982) J. Biochem. Biophys. Methods 6, 317–339. 5. Westermeier, R. (2001) Electrophoresis in Practice. Wiley-VCH, Weinheim. 6. Westermeier, R. & Naven, T. (2002) Proteomics in Practice. Wiley-VCH, Weinheim. 7. Rabilloud, T. (2000) Proteome research: two dimensional gel electrophoresis and identification methods. Springer, Heidelberg. 8. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996) In: Applied Linear Statistical Models (Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W., eds.). Irwin, Chicago, pp. 958–1010. 9. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996) In: Applied Linear Statistical Models (Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W., eds.). Irwin, Chicago, pp. 1121–1164. 10. Karp, N. A., Spencer, M., Lindsay, H., O’dell, K., & Lilley, K. S. (2005) J. Proteome Res. 4, 1867–1871. 11. Patton, W. F. (2000) Electrophoresis 21, 1123–1144. 12. Westermeier, R. (2006) Proteomics S2 61–64. 13. Switzer, R. C., Merril, C. R., & Shifrin, S. (1979) Anal. Biochem. 98, 231–237. 14. Rabilloud, T., Vuillard, L., Gilly, C., & Lawrence, J. (1994) Cellular and Molecular Biology 40, 57–75. 15. Unlu, M., Morgan, M. E., & Minden, J. S. (1997) Electrophoresis 18, 2071–2077. 16. Alban, A., Currie, I., Lewis, S., Stone, T., & Sweet, A. C. (2002) Mol. Biol. Cell 13, 407A–408A. 17. Alban, A., David, S. O., Bjorkesten, L., Andersson, C., Sloge, E., Lewis, S., & Currie, I. (2003) Proteomics 3, 36–44. 18. Tonge, R., Shaw, J., Middleton, B., Rowlinson, R., Rayner, S., Young, J., Pognan, F., Hawkins, E., Currie, I. et al. (2001) Proteomics 1, 377–396. 19. Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. (1996) In: Applied Linear Statistical Models (Neter, J., Kutner, M. H., Nachtsheim, C. J., & Wasserman, W. eds.). Irwin, Chicago, pp. 95–152. 20. Gustafsson, J. S., Ceasar, R., Glasbey, C. A., Blomberg, A., & Rudemo, M. (2004) Proteomics 4, 3791–3799. 21. Siegel, S. C. N. J. (1988) Non Parametric Statistics for Behavioral Sciences. McGraw-Hill Book Company, Singapore. 22. Jackson, J. E. (2003) A User’s Guide to Principal Components. Wiley, New York. 23. Sharma, S. Applied Multivariate Techniques. Wiley, Hoboken, NJ. 24. Pearson, K. (1901) Phil. Mag. Ser. B. 2, 559–572. 25. Hotelling, H. (1933) J. Educ. Psychol. 24, 417–441. 26. Tarroux, P. (1983) Electrophoresis 4, 63–70.
Statistical Analysis of Proteomic Data
347
27. Grove, H., Hollung, K., Uhlen, A. K., Martens, H., & Faergestad, E. M. (2006) J. Proteome Res. 5, 3399–3410. 28. Marengo, E., Robotti, E., Bobba, M., Liparota, M. C., Rustichelli, C., Zamoo, A., Chilosi, M., & Righetti, P. G. (2006) Electrophoresis 27, 484–494. 29. Schultz, J., Gottlieb, D. M., Petersen, M., Nesic, L., Jacobsen, S., & Sondergaard, I. (2004) Electrophoresis 25, 502–511. 30. Verhoeckx, K. C. M., Gaspari, M., Bijlsma, S., Van Der Greef, J., Witkamp, R. F., Doornbos, R. P., & Rodenburg, R. J. T. (2005) J. Proteome Res. 4, 2015–2023. 31. Gottlieb, D. M., Schultz, J., Bruun, S. W., Jacobsen, S., & Sondergaard, I. (2004) Phytochemistry 65, 1531–1548. 32. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001) Bioinformatics 17, 520–525. 33. Scheel, I., Aldrin, M., Glad, I. K., Sorum, R., Lyng, H., & Frigessi, A. (2005) Bioinformatics 21, 4272–4279. 34. Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., & Ishii, S. (2003) Bioinformatics 19, 2088–2096. 35. Wood, J., White, I. R., & Cutler, P. (2004) Signal Process. 84, 1777–1788. 36. Karp, N. A., Griffin, J. L., & Lilley, K. S. (2005) Proteomics 5, 81–90. 37. Wold, S. (1985) Encyc. Stat. Sci. 6, 581–591. 38. Nguyen, D. V. & Rocke, D. M. (2002) Bioinformatics 18, 39–50. 39. Carpentier, S. C., Witters, E., Laukens, K., Van Onckelen, H., Swennen, R., & Panis, B. (2007) Proteomics 7, 92–105. 40. Pedreschi, R., Vanstreels, E., Carpentier, S., Robben, J., Noben, J. P., Swennen, R., Lammertyn, J., Vanderleyden, J., & Nicolaï,B.M. Proteomics 7, 2083–2099.
18 Web-Based Tools for Protein Classification Costas D. Paliakasis, Ioannis Michalopoulos, and Sophia Kossida
Summary Current proteomics technologies generate large number of data among which the investigator has to identify the promising diagnostic/prognostic biomarkers as well as potential therapeutic targets. For the latter, classification of proteins into meaningful families is needed. Current databases, featuring a high level of interconnectivity (cross referencing), provide the tools necessary to bring various data together, facilitating protein classification and elucidation of protein function and interoperativity. This chapter provides guidelines to explore the informationally rich peptide sequences generated by the application of the proteomics methodologies by the use of web-based tools, with the objective to predict potential protein function. After proper preprocessing (e.g., for internal repeats) of a query protein sequence, known domains can be identified, which aid in dividing the query into smaller meaningful parts. Any unclassified remainder of the protein provides the material for low-level comparative analysis for the discovery of distant homologues or candidate novel domain types to be verified experimentally.
Key Words: protein classification; domain families; recurrent tertiary structural motifs; sequence–structure relationships; (protein) structural evolution; protein database; homology searches; domain inference; protein structure redundancy.
1. Introduction From the times of the “one man-one gene” approach, when individuals were working on single protein sequences, which were decoded from the corresponding DNA sequences, to the era of high-throughput techniques, when massive automated procedures produce large numbers of peptide sequences, one task remains virtually the same: individual protein sequences need classification. We, humans, have an amazing instinctive capability to categorize From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
349
350
Paliakasis et al.
objects, even the most complex ones, which in particular can be categorized along various kinds of natural or arbitrary schemes. Proteins feature multiple attributes, such as sequence, structure, function, organelle specificity, evolutionary origin, affinity, isoelectric point, and size (not to mention tissue specificity and antigenicity in higher organisms), all of which offer means for classification. For instance, 2D gel spots corresponding to proteins, which have been separated in terms of their size and isoelectric point, reflect a primary attempt for classification; affinity (e.g., nucleoprotein, lipoprotein, metalloprotein, etc.) and function (e.g., enzyme, carrier) offer another basis for classification, both relating to the chemistry of a protein, and basic spectroscopic data, like those of circular dichroism (which suggest an estimate of the relative amounts of -stranded vs. -helical structure), permit classification to the all-, all- or mixed / classes. However, classification schemes based on general attributes (e.g., the physicochemical properties of proteins) suffer from heterogeneity within their classes. For instance, a number of otherwise unrelated proteins can be classified as “metalloproteins.” In general, two requirements with opposing effects should be satisfied by any classification scheme: specificity, which leads to particularization (i.e., a higher number of narrower classes) and abstraction, which leads to generalization (i.e., a smaller number of wider classes). In the end, a comprehensive and useful hierarchy is a trade-off between specificity and abstraction (i.e., the most general classes possible that are still useful in some desired way). Proteins, the structures of which represent successful solutions to the problem of thermodynamic stability and at the same time can accommodate a biologically useful function, provide the basis of all kinds of radiant variation at the level of protein sequence (and consequently function). Each protein variant, that survives the evolutionary pressure of competition against other potential variants, has emerged after a series of modifications of various extents; an explanation is presented later on why this is the preferred mode of action. Common ancestry classification schemes provide the specificity necessary to define sensible protein classes, in contrast to those classification schemes, which follow general features. In the former, all members of each class share a common tertiary structure across very wide evolutionary spans, while similarities at the level of amino acid sequence remain exploitable, even in cases where they are hard to detect. Therefore, evolution-based classification schemes are not driven by our natural impulse to categorize objects drawing arbitrary borderlines, but reflect basic principles of the protein nature. In fact, classification with respect to evolutionary history and structure comes so naturally, that when function is not preserved, we tend to refer to a “-like” form within the same family of proteins, rather than to a different family.
Web-Based Tools for Protein Classification
351
Protein sequences derived from a common ancestor by divergent evolution, share a high degree of similarity (both with each other and naturally with their ancestor, although the latter may be unknown). This similarity persists over quite a wide evolutionary span, before it is worn out by divergence and rendered undetectable by direct pair-wise sequence alignments. Conveniently, it is highly unlikely that proteins without common evolutionary origin share a high degree of similarity; in fact, the higher the similarity the more recent the speciation. It will be shown how these nearest relatives provide the guidelines to identify the features that are crucial for the definition of a family of proteins, before the detection of the most remote relationships is attempted. In conclusion, the amino acid sequence offers a highly specific key to classification, albeit intermediary members, and structure may need to be consulted, before any remote members of a class can be detected. The evolution-based classification schemes, as well as the tools available over the web to explore them, constitute the subject of the following notes. Many researchers in the relevant fields tend to take simple homology searches and domain assignment tools for granted, until an unexpected outcome sheds doubt and confusion; it is the authors’ intention that by the end of this chapter, the reader will be capable to conduct those (otherwise routine) tasks with a higher degree of both awareness and confidence.
2. Materials The procedure of protein classification comprises several more or less independent steps. Although these steps have been arranged (in the present notes) in the order they are usually employed, this order can change, depending on the nature of information available at each point. Steps can also be omitted, if they are unnecessary or their target has already been accomplished (although performing them will provide further reassurance). Each of the steps described is a small protocol in each own right; a number of web tools – some of them in a number of variations – implement each of those steps. However, improvement of user friendliness on one hand and users’ skills on the other has rendered the procedure to look like a single protocol; in fact, sometimes automation hides a number of steps of which only the results can be viewed, in the form of a compiled web page. Instead of listing the websites of all relevant tools, a small and comprehensive selection of entry points is suggested in Table 1, via which a wealth of tools is then accessible. All of those websites provide user friendly interfaces. It is suggested that the reader browses (and gets familiar with) at least those main websites, before attempting to delve deeper into the realm of web-based analysis tools.
352
Paliakasis et al.
Table 1 Main Entry Points to the World Wide Web for Protein Classification ExPASy www.expasy.org A wide range of software tools for the analysis of protein sequences and structures as well as 2D PAGE, can be found here. It also offers an entry point to a rich collection of other web sites, mainly the SwissProt/UniProt databases BLAST www.ncbi.nlm.nih.gov/BLAST A convenient starting point for on-line search of sequence databases (both protein and DNA ones). Many other sites feature some version of BLAST as well EnsEMBL www.ensembl.org A collection of complete genomes, which offers an entry point from a different view – that of a genome rather than that of a sequence Pfam www.sanger.ac.uk/software/pfam A collection of profiles of protein families against which a sequence can be matched, for initial domain recognition Protein data bank www.pdb.org and www.rcsb.org The archive of experimentally determined 3D-structures (by crystallography, NMR, and other techniques) of biological macromolecules (proteins, nucleic acids, sugars, etc.) InterPro www.ebi.ac.uk/interpro An effort to integrate information from several diverse sources to a unified comprehensible form
3. Methods 3.1. Theoretical Issues: Classification Based on Sequence or Structure? The specifics that define a set of sequences as a protein family (i.e., molecular function and involved amino acid residues, other kinds of sequence fingerprints, post-translational modification, etc.) have to be accommodated within a structural framework Fig. 1. However, 3D structure is not reserved for one protein family. In fact, there seems to be a countable set of spatially local packing arrangements between -helices and -sheets, which, when combined,
Web-Based Tools for Protein Classification
353
Fig. 1. Complex shapes can be misclassified by a general property like size, because of small (or larger) parts missing in relation to the simplest forms from which they derive. More specific (“shape-related”) attributes can bring all stars (and parts thereof) together, as they can do with triangles, squares, and circles. Once a proper overall scheme is in place, general attributes (like color) can then detail the distribution within each class.
lead to 3D structural assemblages, stable in terms of thermodynamics and useful in terms of function (1). The participant elements may be distant along the sequence or they may even belong to different chains. The small number of packing options leads to the occurrence of common 3D structural themes, termed the recurrent tertiary motifs, e.g., “up-and-down” helical bundles, barrels, etc. Descriptions at this level of abstraction take into account neither the sequential order of the helices and strands nor their length. Tertiary structural domains in proteins of unrelated evolutionary origin (or function) with apparently unrelated sequences, may adopt the same tertiary motif (usually including further 3D structural elements [(2) see also Note 1]. It can be claimed that the abstract idea of a recurrent tertiary motif leans toward the basic packing arrangements, whereas the implemented domains are closer to the protein families. The 3D environment of certain positions on the structure (a different set of positions for each recurrent tertiary structural motif) poses physicochemical
354
Paliakasis et al. 5-vdef sNIR[enpvtpwnpeps] R1:
: * : * + : *+ A PVID PT AYID PE ASVI G
R2:
E VTIG AN VMVS PM ASIR S[degm]
R3:
P IFVG DR SNVQ DG VVLH A[letineegepiednivevdgkey]
R4:
A VYIG NN VSLA HQ SQVH G
R5:
P AAVG DD TFIG MQ AFVF -
R6:
K SKVG NN CVLE PR SAAI -
R7:
G VTIP DG RYIP AG MVVT - <sqaea..> ------------------------CNS: a VfIG DN vyIa pQ AvVh(g|s) (Consensus) BS#1 T1 BS#2 T2 BS#3
Fig. 2. The seven repeats that form the -helix in MT-CA demonstrate the level of the impact that structure can have on sequence. The -strands (groups of four residues) are shown, separated from the intervening “turns” (groups of two). The turns that connect successive repeats are split–one residue at the left end and a second one, which is missing in some cases, at the right end. Parts of the sequence in square brackets [] are intervening connecting loops; the part in angle brackets <> follows this core motif and is not part of the repeat sequence. The His residues that coordinate the Zn atom are underlined, and stem from positions (within the repeat) marked by a plus sign (+). A partial repeat (every six positions) has been proposed on the basis of other sequences that adopt this structure; the positions marked by stars (*) correspond to main positions in this (partial) repeat, and the ones marked by colon (:) correspond to the secondary ones. No repetition of this kind (i.e., every six positions) is apparent for any other positions, leaving the 17–18 residues long repeat unit as the only complete one. Positions Asn10–Arg12 (top row) form a small extension the -sheet #3; preceding residues are shown for completeness and only to emphasize that the repeat does not extend in them. In the consensus, drawn at the bottom row, the main ingredients of the repeat unit are shown in capital letters.
requirements, which can be best met usually by one or a few amino acid types), thus defining a scale of preferences (3). These preferences are reflected onto patterns that may arise at the level of the primary sequences (that adopt the relevant recurrent tertiary motifs), whenever these spatially defined positions are close along the sequence Fig. 2. It should be noted, that these patterns are reflections along the sequence of the abstract tertiary theme and that they are much more general than the detailed protein family-specific sequence fingerprints. Simplified lattice models suggest that a small number of 3D structural motifs set loose requirements that can be met by a large number of sequences, along their evolutionary pathway (4). In this case, nature appears to reuse a
Web-Based Tools for Protein Classification
355
successful structural solution in evolutionarily unrelated sequences (see Note 2). On the other end, a large number of 3D structural motifs pose requirements so manifold and exact that only a few sequences can be compatible with them. The resultant patterns of preferences along the sequence appear occasionally strong enough to permit structural motif prediction from the sequence alone (5). It can be claimed that no more than 200 recurrent tertiary structural motifs (the exact number depending on the stringency of their definition) provide the structural basis of perhaps 95% of the nonredundant set of protein structures (2). The average residue coverage is a much smaller figure due to the need of additional structural elements to complete a domain. Vice versa, a large number of tertiary structural motifs are so rare, that they provide the basis of the small remaining proportion of protein structures (see Note 3). Detailed specialization into families takes place within this structural framework: Chothia (6) has long ago estimated that 95% of the protein information to be discovered will derive from no more than 1000 protein families. In fact, for a substantial (and growing) proportion of any newly identified protein sequences, enough information already exists in the databases to build a 3D model (7). The reason for this lies on a simple fact: during the creation of new protein families, the relatively small number of structural alternatives directs nature to a strong preference for the reuse of already successful solutions at the level of sequence (not structure), especially when similar problems are to be solved rather than discovering new ones, on the basis of the same or different structure. The traits being inherited along reuse of sequences are usually the ones to be exploited in protein classification. On the other hand, this small set of structural motifs, the ones easily accessible to protein families of irrelevant origin and/or function, occasionally leads otherwise unrelated proteins to elevated sequence similarity scores (which sometimes appear too high to be explained by chance), just because they fold in the same manner (see Note 4). The traits being developed (as opposed to being inherited) reflect convergent evolution. Protein structure has also served as the basis of classification in some schemes. However, the theoretical considerations, which have been discussed herein (in particular, the fact that unrelated proteins may fold in the same way), hint that classification on the basis of 3D structure alone, will tend to be on a coarser scale. On the other hand, the availability of detailed structural data for a (preferably representative) member of a protein family, experimentally derived by means of X-ray crystallography or NMR spectroscopy, besides all kinds of facilitation reserved for other procedures (e.g., structure-based protein design), offers a valuable aid in sequence-based classification. It provides a very solid ground to assess any sequence-based classification, and a great tool to detect the most remote members. However, unless classifying protein structure per se
356
Paliakasis et al.
(rather than proteins in their entirety), it appears that a common structural architecture alone is not sufficient evidence to classify proteins in the same class. Evolutionarily refined variants of tertiary structural domains, “similar-yetdifferent” within a given repertoire, appear in different combinations with those of other repertoires: a domain for a different cofactor or regulatory factor (e.g., GDP vs. ADP) may be combined with a catalytic domain for a slightly different substrate (fructose vs. glucose). Thus, the most complicated and best tuned series of (simpler) functions, necessary for life, can be accomplished in a spatially ordered and life efficient manner. On the other hand, this fact makes essentially imperative that any classification proceeds up to terms of domains: it suffices to describe any sequence in question, as comprising of “an N-terminal domain of type X and a C-terminal domain of type Y, joined by a loop region of type Z,” otherwise, extensive subtyping and the “Russian doll” effect (see Note 5) will soon be confronted. In practice, the classification procedure starts in the form of the detection of some similarity between a protein (or part thereof) and a prototype (e.g., a profile extracted from a multiple alignment or a structure through which it is threaded), which is too high to explain by chance alone. The tools to demonstrate this similarity are presented under the Subheading 3.2, in any case, it will be the network of similarities within a set of data (sequences, structures, etc), which will clarify the underlying reason for the observed similarity. 3.2. The Practical Side It cannot be stressed enough that most protein sequences are nowadays translations of relevant nucleic acid sequences. It is important to identify cDNA originals if possible, to ensure that the employed nucleic acid sequence corresponds to protein in a reliable way. When the original data are supplied in the form of genomic DNA fragments, introns could still be included and alternative splicing remains a possibility. Current gene recognition programs like GeneScan (8), normally expected in genome-oriented databases like EnsEmbl (9) (see Note 6), can efficiently detect and remove introns, but errors may still infiltrate. If this is the origin of the protein data, certain precautions should be taken: • Search for relevant proteins with reliable sequences, e.g., by means of a preliminary Basic Local Alignment and Search Tool (BLAST) (10) search against SwissProt (11). • Align the sequence of interest to any trustworthy matches and observe the pattern of conservation. Sudden insertions to the sequence in question (especially ones with highly biased composition, short tandem repeats or repetitions of other parts of the protein, especially partial ones, etc) do not necessarily represent extra features or minidomains; deleted parts may have been mistakenly considered to be introns.
Web-Based Tools for Protein Classification
357
• Isolate “candidate” insertions and try to find similar sequences in the databases; see if any trustworthy match makes sense in terms of biology. • Alternatively, try finding a protein in the Protein Data Bank (PDB) (12), which is similar (even remotely) to the one in question (excluding the insert), and has its 3D structure experimentally known (see Note 7). The location of the candidate insertion/deletion on the structure may verify or reject it. • Parts of the query protein matching expressed sequence tags (ESTs) (13) provide an extra source of verification (see Note 8): a part matching an EST is an expressed part.
Other criteria may apply to verify the integrity of a processed putative gene. For example, if the protein has been biochemically characterized, then any experimentally observed property must match the ones of the sequence that is predicted by the gene (or have a good reason why it does not). Another very serious issue is the fact that many annotations are automatically transferred between similar sequences of the same or different databases. Even SwissProt entries are crowded with annotations assigned “by similarity.” The number of proteins with primary annotations is many orders of magnitude smaller than the number of annotated sequences in the current databases. These annotations should be considered as hints that can direct experiments to promising routes rather than secure data. 3.2.1. Preprocessing the Query A preliminary check up of the protein sequence itself is recommended. Repeats and parts of low complexity are of particular interest. 3.2.1.1. REPEATS
Regularities in biological macromolecular structure (like the helical nature of DNA or the super-coiled structure of some protein assemblies) and multimerization create room for repetitions along the protein sequences. Repeats can range in length from a few amino acid residues to complete domains (e.g., as a result of domain duplication). In the latter case, the repetition count is usually small, just two to three copies (14) although much higher counts do occur. When catalytic domains are repeated, the situation may have no ground on structural regularities; it may for instance reflect a need for efficiency (e.g., cooperativity between different copies of a domain). In database searches for multidomain protein queries, it is anyway recommended to treat different domains separately, for reasons explained later on; the difference here lies in the fact that the separate copies can be aligned, and their consensus (or profile) can be extracted and serve as the query.
358
Paliakasis et al.
On the other hand, short tandem repeats (e.g., about 10 amino acid residues long or shorter) normally reflect some structural regularity. In a dot-plot style alignment of a protein sequence to itself they manifest themselves as a (moderate-to-high) number of tracks, which run parallel to the main diagonal (and to each other) in a regular manner (Fig. 3). Since combinations of parts coming from different tracks produce significant alternative alignments, procedures, which attempt to report all possible alternative alignments between two proteins will be severely confounded (see Note 9 on BLAST in particular). A consensus or a profile may be extracted again by a proper alignment of the repeats. However, statistically significant matches cannot be expected for a resultant query of (say) 6 or 12 amino acid residues long. One possible cure is to concatenate a small number of repeats, to produce a query no longer than 50 amino acid residues (see Note 10 on why 50). The small number of repeats (e.g., four repeats of length 11) helps avoiding the explosion of alternatives, although a few of them will not be completely avoided. If this step is taken, it is suggested that the output of a dot-plot utility (such as DOTLET, a Java-based hosted in ExPASy server; Table 1) is consulted, at all times. 3.2.1.2. Parts of Low Complexity
Low complexity occurs when some part of the sequence comprises only a few types of amino acid residues, leading database queries to nonspecific results (see Note 11); the situation can be even worse if some of these types are similar to each other. In general, it is important to know beforehand any significant deviations of the composition in types of amino acid residues, as well as the presence of special features such as signal peptides or groupings of biologically relevant charged side chains (see Note 12). Relevant search procedures, like BLAST (10), detect stretches of low complexity and offer to ignore them during the search; however, what appears to be a part of low complexity may be e.g., a transmembrane stretch. The action to take depends on both the importance and the position of the stretch: • If a single transmembrane part makes sense (or is known to exist), the extra- and intracellular moieties can be separate queries. • A signal peptide (especially when located at the extreme of the N-terminus) usually can be excluded from the procedure, profitably or at least without problem. • A stretch of low complexity, which appears to be of no special significance in terms of structure/function/evolution, can be best left to the search procedure to mask it.
Relevant tools are available from the Web (e.g., the ExPASy site). Alternatively, a simple dot-plot style alignment of the protein sequence can be run vs. itself. Besides repeats, this will reveal areas of low complexity as square blocks of elevated average score, symmetrical around the main diagonal (Fig. 3). If low
Web-Based Tools for Protein Classification (A)
(B)
Fig. 3. Continued
359
360
Paliakasis et al.
complexity occurs within the boundaries of a repeat, similar square blocks will appear around relevant parallel off-diagonal tracks. 3.2.2. Inference of Domains In the spirit of the theoretical analysis earlier in this chapter, classification can take the form of assigning parts of the sequence to domains. Hence, using a domain inferring tool like the ones offered by Pfam (15) and SMART (16) should be among the first steps for classification of a protein, based on its sequence (see Note 13). This information serves to divide the sequence of interest into pieces and handle them separately (see Note 14). Given the high coverage achieved by those collections (more than 75% of the proteins have at least one domain recognized by them, and in average about two-thirds of the length of a protein can be described this way) (15), some protein sequence classification efforts end here (see Note 15). In fact, database search procedures should be soon expected to exploit high-level features, which will be extracted from the query and relevant sequences, resorting to amino acids alone, only for parts where the attempts will fail. 3.2.3. Querying Other Databases Despite the current high coverage of protein sequences in terms of known domains, parts of these sequences still elude. These parts may simply be too distant members of the families they belong to, and they have failed the thresholds of automatic procedures. Those parts should be isolated, properly preprocessed (mainly for compositional biases), and queried against SwissProt and PDB. • Entries (records) in SwissProt (11) offer rich annotation and crossreferences to a number of resources, all in a mainly human readable form and via a nice user friendly interface on top. The high level of curation (including annotation derived by similarity) will save duplicate efforts and may provide valuable hints on how to move on.
Fig. 3. (Continued) (A) Schematic representation of a dot-plot style alignment of a protein against itself; to depict the special cases presented in the text, the protein is supposed to feature two copies of some domain, a low complexity N-terminus and a C-terminal part dominated by some short internal repeat, except for a tail, which appears unique. (B) Alignment of a small part (from a real protein) of low complexity against itself. The situation here is worse than suspected, because the few types of amino acid residues are related to each other (alanine to valine and glycine; to proline and serine in lesser extent).
Web-Based Tools for Protein Classification
361
• Search for similar sequences in PDB (12) will reveal experimentally determined 3D structures of protein instances, possibly related (e.g., through evolution) to the protein of interest. A 3D structure offers a model (even before a model of the query sequence is built, following this information) to think on, a toy on which to visualize and handle data in far more efficient ways (see Note 16).
If domains are inferred by the relevant procedures (or supplied by SwissProt annotation) and/or long stretches (say 30–40 amino acid residues or longer) of special behavior are observed, it is a good idea to handle each sequence part separately, or in small meaningful combinations, for instance, there may be no reason to treat, say, a propeptide separately from the main body of the domain it belongs to (see Note 17 and 18). If a few top hits of a database search can be aligned to the query with confidence, and the next ones are marginal (see Note 18), the output of a multiple alignment of the best hits (including the query) should be converted to some kind of profile [e.g., a position-specific scoring matrix (PSSM)] and the database should be scanned for the resulting profile (see Note 19). The marginal hits of the initial query (i.e., the protein of interest) that match positions conserved throughout the profile will have their statistical significance increased and they will surface. If domain inferring programs can detect some kind of domain on those (initially marginal) hits, this information can then be transferred to the initial query with confidence (recall: the query is part on which no domain was detected). The few top hits will be sometimes marginal (see Note 18). Each of the “best” marginal hits should be used as a query and a number of homologues (about 10; see Note 20) should be collected and aligned without the initial query (i.e., protein of interest). Some kind of profiles (e.g., a PSSM) should be produced by those alignments and the relevant part of the initial query (i.e., protein of interest) should be aligned against them. If the initial query matches the profile at conserved positions (see Note 21), the hit was not fortuitous. Again, if domain inferring programs can detect some kind of domain along the sequences that formed the profile, this information can then be transferred to the initial query with confidence. Other databases provide annotation at high level on specific tasks. InterPro (17) offers a convenient entry point to a number of them, especially for manual sequence classification (as opposed to some massive automated procedure). SuperFamily (18) builds information based on classification of 3D structures (a hit here implies structural similarity regardless of common function or evolutionary origin), PRINTS (19) and PROSITE (20) and one may continue with a long list, where each member targets a specified problem (e.g., if the protein of interest is found to be a peptidase, MEROPS (21) may be consulted for further relevant classification).
362
Paliakasis et al.
4. Notes 1. It is just often a simple operation (e.g., a function) that is built by (part of) the sequences as 3D domains. For instance, there are tertiary structural domains, which simply bind a cofactor and feature an allosteric position, where some regulatory factor (e.g., ADP) will dock to exert its role. The active site may reside on a separate domain, or may be shared between two of them, within the range of the cofactor. 2. Unpublished work (C.D.P., Ph.D. thesis) in continuation of (3) suggests that the requirements set – albeit too vaguely – by an -helical “up-and-down” bundle, which is an abundant tertiary structural motif, raise the relevant parts of the sequence to the extreme 0.1–1% of a suitable distribution, when proteins in a databank are scored for compatibility. This shift is not enough for structure prediction from the sequence alone (too many false positives), but it still reflects a possibly minimal set of requirements posed by the structure for compatible sequences. 3. There is a tendency to treat the observed structural solutions, i.e., the recurrent tertiary structural motifs and domains, as the end evolutionary product of our days. In fact, all the preceding evolutionary steps (as well as the future ones, probably) had to employ one of the solutions provided in this relatively narrow set. If we depict this set, so that similar architectures are close to each other, then “evolution” is a “walk” through this set. Whether this set is continuous or partitioned in a discontinuous manner, is the subject of ongoing research. 4. A continuum is thus established in the scale of similarities between protein sequences, on one end, the small biases due to simple facts (e.g., two transmembrane pieces are coincidentally matched); remote similarities due to common structural architecture, in the middle of the scale; and on the other end, 30% (or more) identity observed due to common origin of a protein from a mammal to a bacterial homologue (and, usually, more than 80%, e.g., between mammals, etc.). 5. This effect characterizes the situation in which a particular domain includes a smaller one, plus some extra structural elements (“decorations”); then, the new total constitutes part of a larger domain, which includes some further structural elements, and so on. Orengo and coworkers (2) have presented a number of examples in their series of papers on classification of protein structure. 6. The version of BLAST featured in EnsEmbl can run against the results of GeneScan; this does not simply translate genomic DNA into Opening Reading Frame (ORFs) before comparison, but it also attempts to “splice” it, after predicting and removing potential introns. Other task-specific databases feature relevant tools. 7. The version of BLAST at the National Center for Biotechnology Information (NCBI) has access to all protein sequences of known structure. Alternatively, the PDB resource (Table 1) can be directly accessed for this purpose, losing however the interconnection to other databases offered by NCBI. 8. Like in the previous Note 7, access by means provided by NCBI is recommended.
Web-Based Tools for Protein Classification
363
9. For example, BLAST seeks all the instances where a small part from the query matches the protein of interest. Then to form longer alignments, BLAST, depending on its version, either expands these “seed-alignments” to contiguous subalignments, uninterrupted by gaps, which are then joined in all valid combinations, or expands the seeds in a gapped alignment fashion. The presence of short repeats may make the output particularly hard to follow, due to the numerous alternatives. 10. Sander and Schneider (22) suggest that the minimum percentage of identity between two proteins, which is required to imply structural similarity converges to about 27% for common alignment length of about 80 amino acids. However, the change in the range of 50–80 is small to justify inclusion of further repeats, which would increase the number of alternative alignments. See also Note 18. 11. For instance, assume that a stretch, about 20 amino acid long or longer, is dominated by leucine, isoleucine, and perhaps a couple of phenylalanines. Not only will this part be nonspecifically matched to any sequence that features a similar deviation in composition, but the resulting alignment will also appear unstable in this part, because of the numerous and almost equivalent alternative ways in which two stretches of the kind can be aligned. 12. For example, a large deviation toward lysine and alanine will make the sequence look like a histone. Scanning a databank for similar peptide sequences, the results will tend to include nonspecific stretches rich in positive (and negative to a lesser extent) charges, in general. 13. The NCBI/BLAST Server (Table 1) offers CDD (conserved domain databank), which is based on both Pfam and SMART, further including collections internal to NCBI. Other servers may offer similar compilations. However, for detailed inquiries one may need to resort to the original resources. The information presented by the original collection can be much richer. Furthermore, each specialized collection offers tools for flexible searches in terms of combinations of various domains, to help detect proteins of similar architecture, reference similarities to other related domain, and so on. 14. The fact that tertiary structural domains tend to behave independently should be exploited. Bench work can usually be facilitated by studying isolated domains, e.g., if some part of a protein makes the molecule hard to crystallize, the relevant information (if available) could indicate which part to remove. Information derived using domain inferring tools can serve to divide a sequence of interest into meaningful pieces. Bioinformatics work may as well get similar profits, e.g., during databases search: assume for example that a protein includes a general hydrolase domain (e.g., an esterase), which is found in many combinations with other domains, which particularize its use; and it also contains a domain, which is specific for the family this sequence belongs to. It will be the latter that will boost the most relevant sequences to the top of the sorted list of BLAST results; accordingly, it will be the one to drive the query protein to the correct subfamily within the framework of a larger family.
364
Paliakasis et al.
15. In the case of multidomain proteins, each hit to a constituent domain (or a significant part of it), signifies the existence of a related part in the databank. Occasionally, some domains will seem apparently missing: either the relevant part of the sequence appears deleted or an expected domain is not recognized along it. Given the statistical nature of the recognition procedure and the nucleotide nature of underlying primary data, the tempting conclusion that this domain/part is not present, is by no means secure. • If the relevant part of the sequence is present, you may check whether domains, which were recognized by domain inference programs along remote homologues of this part, can be transferred by means of alignment involving preformed multiple alignments, as described in Subheading 3.2.3 for the case of remote hits. • If the relevant part of the sequence seems absent, then despite the efficiency of genetic data manipulation procedures, parts of the sequence may have been accidentally considered as introns. Once some major part of a multidomain protein has been located on the complete genome, the hits should serve as pointers to the location to search more carefully at. Perhaps the next generation of data-mining will perform this retro-search of missing parts automatically (like the iterative BLAST is performed today). Until then, and in spite of the times of high-level annotation (which will retrieve the major part of the information being hunted) one should be ready for straightforward TBLASTN of minor parts of the sequence in hand to rule out their existence conclusively and beyond reasonable doubt. 16. When an experimentally determined 3D structure for a similar sequence exists in PDB, then the sequence of interest and the matching structure can be input to some automated model building server (like the SwissModel Server; some servers may also need a ready made alignment between the two) and get a 3D structural approximation of the query protein. If nothing else, inspection of this model will explain any mutational data available and will reveal key locations for experimentation by means of site-directed mutagenesis and other kinds of modification and querying (instead of blind trials along the sequence), in order to infer the mechanism of function or other valuable information. If the quality of the alignment is poor, but both the sequence and the structure can be aligned to e.g., a profile, this intermediary link can mediate alignment between the protein of interest and the distantly related sequence of known structure. Alternatively the remote match may serve as the query to retrieve further sequences homologous to the hit, in order to align the original query to their preformed multiple alignments, as it is described under Subheading 3.2.3. 17. The expectancy value (E-value) provided with the sorted hit list by BLAST depends on the product of the length of the database by the length of the query. Assuming that matching counterparts exist for just one of the domains and that this domain comprises a small part of the total protein, BLAST may
Web-Based Tools for Protein Classification
18.
19.
20.
21.
365
miss matching hits of marginal similarity, just because the length product was unnecessarily (thanks to domain independence) too large. The expectancy value should be regarded as only a rough measure. It would be a more accurate measure of the expected number of hits, if databases were nonredundant (i.e., they contained absolutely nonhomologous sequences) and there were no biases toward specific types of amino acid residues or toward sequence patterns (e.g., the amphipathic ones met in -helices, which account for about one quarter of protein structure in general). Besides, Sander and Schneider (22) have long shown that as soon as a subalignment of a given size exceeds a relevant level of identity, 3D structural similarity can be assumed, independently of the length of the proteins which participate in the comparison or the number of sequences which the query is compared to. They suggest a threshold t(L) = 290.15 × L?0562 for L < 80 and about 27% for L > 80; cases with identity level higher than t(L) assume related structure, allowing only a small acceptable number of false positives. Alignments lying at the lower side of the line as this derives from the equation mentioned above, do not necessarily signify proteins of unrelated structure. For them, structural similarity, if existant, cannot be simply asserted with confidence. Similarity is rendered more and more improbable as the relevant figures decrease. Details on how to make or use a PSSM may change with implementation. It is worth spending some time on the on-line help offered on PSSM under their implementation at the NCBI. In any case, Clustal (23) may be used to align a sequence to a block of prealigned sequences, or even to two preformed multiple alignments. In both cases, if conserved positions in the “reference” block are conserved along the query sequence (or the query block) the match is reliable. Pfam (15) offers the tools for another approach involving hidden Markov model, the explanation of which is beyond the scope of the present notes. Following the results of Henikoff and Henikoff (24,25), it seems that about 10 homologues are usually already enough, with the reservation that they should cover, if possible, all the range of similarities from 90% down to 40–30%. If all of them are too similar to each other, it will be as if the same sequence was included 10 times. If all of them are too dissimilar to each other, then the risk of mistakes in their multiple alignment will be too high. As a reassurance, in case that a hit is correct, some of the sequences that are homologous to the hit should have appeared in the hit list of the initial search (i.e., the one in which the protein of interest was the query sequence). If just one protein from a large family was reported, chances are that the hit was coincidential.
References 1. Richardson J.S. and Richardson D.C. (1989) “Principles and patterns of protein conformation.” In: Fasman G. (ed) “Prediction of Protein Structure and the Principles of Protein Conformation.” Plenum Press, NY, pp 1–98.
366
Paliakasis et al.
2. Orengo C.A. and Thornton J.M. (2005) “Protein families and their evolution – a structural perspective.” Annu. Rev. Biochem. 74, 867–900. 3. Paliakasis C.D. and Kokkinidis M. (1992) “Relationships between sequence and structure for the four--helix bundle tertiary motif in proteins.” Protein Eng. 5, 739–748. 4. Lattman E.E., Fiebig K.M. and Dill K.A. (1994) “Modeling compact denatured states in proteins.” Biochemistry 33, 6158–6166. 5. Lupas A., vanDyke M. and Stock J. (1991) “Predicting coiled-coils from protein sequences.” Science 252, 1162–1164. 6. Chothia C. (1992) “One thousand families for the molecular biologist.” Nature 357, 543–544. 7. Schwede T., Kopp J., Guex N. and Peitsch M.C. (2003) “SWISS MODEL: an automated protein homology modeling server.” Nucleic Acids Res. 31, 3381–3385. 8. Burge C. and Karlin S. (1997) “Prediction of complete gene structures in human genomic DNA.” J. Mol. Biol. 268, 78–94. 9. Hubbard T., Andrews D., Caccamo M., et al. (2005) “Ensembl 2005.” Nucleic Acids Res. 33, D447–D453. 10. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W. and Lipman D.J. (1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.” Nucleic Acids Res. 25, 3389–3402. 11. Bairoch A., Apweiler R., Wu C.H., Barker W.C., Boeckmann B., Ferro S., Gasteiger E., Huang H., Lopez R., Magrane M., Martin M.J., Natale D.A., O’Donovan C., Redaschi N. and Yeh L-S.L. (2005) “The universal protein resource (UniProt).” Nucleic Acids Res. 33, D154–D159. 12. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N. and Bourne P.E. (2000) “The protein data bank.” Nucleic Acids Res. 28, 235–242. 13. Boguski M.S., Lowe T.M.J. and Tolstoshev C.M. (1993) “dbEST – database for expressed sequence tags.” Nature Genet. 4, 332–333. 14. Apic G., Gough J. and Teichman S.A. (2001) “Domain combinations in archaeal, eubacterial and eukaryotic proteomes.” J. Mol. Biol. 310, 311–325. 15. Bateman A., Coin L., Durbin R., Finn R.D., Hollich V., Griffiths-Jones S., Khanna A., Marshall M., Moxon S., Sonnhammer E.L.L., Studholme D.J., Yates C. and Eddy S.R. (2004) “The Pfam protein families database.” Nucleic Acids Res. 32, D138–D141. 16. Letunic I., Copley R.R., Pils B., Pinkert S., Schultz J. and Bork P. (2006) “SMART 5: domains in the context of genomes and networks.” Nucleic Acids Res. 34, D257–D260. 17. The InterPro Consortium; Mulder N.J., Apweiler R., Atwood T.K., et al. (2005) “InterPro, Progress and Status in 2005.” Nucleic Acids Res. 33, D201-D205. 18. Madera M., Vogel C., Kummerfeld S.K., Chothia C. and Gough J. (2004) “The SUPERFAMILY database in 2004: additions and improvements.” Nucleic Acids Res. 32, D235-D239.
Web-Based Tools for Protein Classification
367
19. Attwood T.K., Bradley P., Flower D.R., Gaulton A., Maudling N., Mitchell A.L., Moulton G., Nordle A., Paine K., Taylor P., Uddin A. and Zygouri C. (2003) “PRINTS and its automatic supplement, preprints.” Nucleic Acids Res. 31, 400-402. 20. Hulo N., Bairoch A., Bulliard B., Cerutti L., de Castro E., Langendijk-Genevaux P.S., Pagni M. and Sigrist C.J.A. (2006) “The PROSITE database.” Nucleic Acids Res. 34, D227-D230. 21. Rawlings N.D., Morton F.R. and Barrett A.J. (2006) “MEROPS: the peptidase database.” Nucleic Acids Res. 34, D270–D272. 22. Sander C. and Schneider R. (1991) “Database of homology-derived protein structures and the structural meaning of sequence alignment.” Proteins: Struct. Fun. Gen. 9, 56–68. 23. Thompson J.D., Higgins D.G. and Gibson T.J. (1994) “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.” Nucleic Acids Res. 22, 4673–4680. 24. Henikoff S. and Henikoff J.G. (1992) “Amino acid substitution matrices from protein blocks.” Proc. Natl. Acad. Sci. USA 89, 10915–10919. 25. Henikoff S. and Henikoff J.G. (1993) “Performance evaluation of amino acid substitution matrices.” Proteins Struct. Fun. Gen. 17, 49–61.
19 Open-Source Platform for the Analysis of Liquid Chromatography-Mass Spectrometry (LC-MS) Data Matthew Fitzgibbon, Wendy Law, Damon May, Andrea Detter, and Martin McIntosh
Summary The analysis of protein mixtures by liquid chromatography-mass spectrometry (LCMS) requires tools for viewing and navigating LC-MS data, locating peptides in LC-MS data, and eliminating low-quality peptides. msInspect, an open source platform, can carry out these steps for single experiments and can align and normalize peptide features in comparative studies with multiple LC-MS runs. In addition, msInspect can analyze quantitative studies with and without isotopic labels to generate peptide arrays.
Key Words: liquid chromatography-mass spectrometry; peptide identification; filtering; alignment; quantitation.
1. Introduction msInspect is an open-source platform comprising algorithms and visualization tools that process liquid chromatography-mass spectrometry (LCMS) data files to locate peptides in two dimensions [time and mass over charge (m/z)] and perform various analyses on them (1). msInspect can be used for: • • • •
Visually inspecting LC-MS spectra and peptide features Automatically locating peptide features in high mass accuracy MS spectra Filtering peptide features by various quality measures Quantitating label-free peptide features between experiments via alignment and normalization of the data to create a peptide array From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
369
370
Fitzgibbon et al.
• Identifying isotopically labeled pairs [e.g., isotope coded affinity tagging (ICAT), sable labeling with amino acids in cell culture (SILAC)] for quantitative peptide analysis within a single experiment • Comparing and developing MS feature-finding algorithms
msInspect implements multiple algorithms specifically designed for LC-MS data. The signal processing component exploits the two-dimensional nature of the data to identify coeluting isotopes and then groups them based on the similarity of the observed isotopic distributions to those of naturally occurring peptides. The alignment method estimates the underlying nonlinear mapping of retention times between experiments. The normalization approach (2) adapts methods developed for genomic arrays to accommodate natural variation of LC-MS signal intensities across runs. Ultimately, the goal of msInspect is to mine LC-MS data and to produce peptide arrays that can then be analyzed using tools traditionally applied to genomic arrays msInspect also contains a complete Accurate Mass and Time (AMT) analysis workflow (3). These analytical techniques combine LC-MS and LC-MS/MS data in order to expand peptide coverage and enhance the confidence of peptide identifications. 2. Materials To run msInspect the Java Runtime Environment must be installed. To perform alignment of multiple runs, the R environment must also be installed. Both of these programs must be properly configured and on the computer’s PATH. Information on acquiring these software packages is provided in Subheading 2.1 below. Please contact your local IT systems support group for details on installing these software properly. msInspect reads mass spectra from files in the open mzXML format (4). For background on mzXML and information about converting data from particular instruments to mzXML see Note 1. 2.1. Software 1. msInspect is written in platform-independent Java and requires that the Java Runtime Environment, version 1.5 or later, be installed and on the computer’s PATH. Installation of Java Runtime Environment will also install the latest version of Java Web Start, which will allow msInspect to be run without needing to explicitly install it or update it as new versions are released (see Note 2). a. Windows, Linux, and Solaris users can download “J2SE 5.0” from http://java.sun.com/j2se/1.5.0/download.jsp. b. MacIntosh users running Mac OS X v10.4 or later can download Java from http://www.apple.com/support/downloads.
Open LC-MS Analysis Platform
371
2. To align multiple runs into a peptide array, the R environment for statistical computing, version 2.1.0 or later, must be installed and on the computer’s PATH. R executables for various operating systems are available from http://www.rproject.org.
2.2. Hardware msInspect will run on any computer that supports the software listed in Subheading 2.1. For large input files, typical of high mass accuracy measurements, feature extraction can require several hundred megabytes of memory (see Note 3). msInspect has been tested on computers running Windows XP, GNU Linux, and Mac OS X with at least 1 GB of main memory. 2.3. Data Files msInspect will open any version 2.0 mzXML file containing MS1 data. However, msInspect was designed using high-resolution liquid chromatography-electrospray ionization-time of flight mass spectrometer data so it may not perform as well with an mzXML file from another type of mass spectrometer (e.g., a matrix-assisted laser desorption-time of flight mass spectrometer). Sample mzXML files that may be used to follow all of steps in Section 3 are available on the Web (see Note 4). 3. Methods 3.1. Viewing and Navigating LC-MS Data 1. Launch msInspect from http://proteomics.fhcrc.org/download/tools/msInspect/ viewer.jnlp by clicking on “Launch msInspect with Java Web Start.” “Fred Hutchinson Cancer Research Center” must be accepted as a trusted software publisher for the download to be completed. 2. Upon launching msInspect, the Open File dialog box will automatically open. Browse for the mzXML file to be viewed, select the file, and left click the Open button (see Note 5). You may load a different mzXML file by selecting File > Open from the main msInspect menu bar. 3. The msInspect window (Fig. 1) contains several panes for viewing and navigating the MS run: a. An image of the MS run will be displayed in the Image Pane (the largest pane in the center of the msInspect window). b The Properties Pane (left side of the window) will display detailed information from the mzXML file loaded. This pane will later be used to display details of individual peptide features. It can be hidden with Windows > Show/hide properties.
372
Fitzgibbon et al. c. The Detail Pane is on the right side of the window and the Chart Pane is at the bottom part of the window. Each provides a more detailed view of a region of the spectrum. The Detail Pane provides a zoomed view of the area selected in the full Image Pane. The Chart Pane plots intensity versus m/z (to show the isotopes in a single scan) or intensity versus scan (to show the elution profile of a single isotope).
4. Hold the mouse cursor over a location in the Image Pane. A floating tag will appear displaying the scan number and m/z coordinates of that position. 5. Areas containing peptide features in the Image Pane will appear dark. Left click in a dark area of the image where there appear to be many peptide features as shown in Fig. 1. a. The Detail Pane (right) shows a detailed view of the area selected. Feature finding is automatically launched in this area, and after a few seconds of computation, detected peptide features are circled. Xs indicate the monoisotopic peaks in each feature (see Note 6). b. To see detailed information about a detected peptide, position the mouse cursor over the monoisotopic peak. A floating tag will display scan number, m/z (followed by mass in parentheses), inferred charge state,
Fig. 1. msInspect window showing the Properties Pane (top left), Image Pane (top center), Detail Pane (top right), and Chart Pane (bottom).
Open LC-MS Analysis Platform
373
intensity/background intensity/median intensity, and the first and last scan for the feature. c. The Chart Pane (bottom) displays the m/z spectrum for the scan corresponding to the vertical red line in the Detail Pane. 6. Zoom in on features in the Chart Pane by highlighting a desired area. To do this, anchor the mouse cursor by left clicking at the top left corner of the desired area and continue to hold down the left mouse button while dragging the mouse cursor down and to the right. When the mouse button is released, the chart will be redrawn to produce a magnified view of the selected area (see Note 7). To restore the original chart, left click on the mouse cursor anywhere in the Chart Pane and drag the cursor up or to the left. 7. Select “elution” from the drop-down menu at the top of the Chart Pane to display an elution profile plot. This display shows peaks along the scan axis rather than the m/z axis. Note that the Detail Pane now displays a horizontal line corresponding to the m/z value for the profile as shown in Fig. 2. 8. Zoom in on the Image Pane by right clicking on the mouse and selecting a magnification value from the list (e.g., 200%).
Fig. 2. msInspect window displaying an elution profile plot in the Chart Pane and corresponding horizontal line in the Detail Pane.
374
Fitzgibbon et al.
3.2. Locating Peptides in LC-MS Data A Feature Set file, which lists all of the peptide features detected in a run, can be generated using one of the algorithms included in the platform (see Note 8). 1. Under the Tools menu, select two dimensional (2D) Peak Alignment. This is the default feature-finding algorithm and is recommended for most purposes. 2. To initiate feature finding, select Tools > Find All Features. This will bring up the Extract Features dialog box as shown in Fig. 3. 3. In the “Save Features to File” field, enter (or browse for) a path and add a name for the new Feature Set file. 4. Specify a scan range in the “Start Scan” and “End Scan” fields to limit feature finding to a subset of scans. By default, msInspect will attempt to find peptides in all scans (see Note 9). 5. Left click the Find Features button to begin the feature finding process. As the file is processed, the status bar at the bottom of the msInspect window will display progress. For a large input file, processing may take upwards of 20–30 min. 6. When processing is complete, features will be written to the specified output file and highlighted as colored crosses in the Image and Detail Panes. The status bar will display “Finding features complete. See file yourfilepath\yourfile.peptides.tsv.” Place the mouse cursor over one of the detected features to display a summary of its properties. Left click on the feature to view details in the Properties Pane (display by Windows > Show/hide Properties). 7. Select Tools > Display Peptides… to open the Display Features dialog box as shown in Fig. 4A for customization:
Fig. 3. Extract Features dialog box.
Open LC-MS Analysis Platform (A)
(B)
Fig. 4. Continued
375
376
Fitzgibbon et al. a. Display or hide the colored crosses by checking or unchecking the box under the “Display” field. b. Change the color of the crosses by left clicking on the colored box under the “Color” field. A new color can be selected from a color palette. c. View the Feature Set browser by left clicking on the “…” button. This browser lists details of all peptides in the Feature Set. This list can be sorted and edited, comments can be added to a feature, features can be deleted, and the modified Feature Set file may be saved (see Note 10).
3.3. Filtering to Eliminate Low-quality Peptides Low-quality peptides can be removed in msInspect by applying userspecified filtering criteria (e.g., a minimum number of isotopic peaks detected). Removing low-quality peptides is particularly helpful when peptide arrays are to be generated (described in Subheading 3.4.1). 1. Select Tools > Display Peptides…. 2. Left click the Filter tab at the bottom of the Display Features dialog box. This tab displays several parameters by which features can be filtered. 3. Set Min Charge = 1, Min Scans = 3, Min Intensity = 5, Max KL = 1.0, and Min Peaks = 2 as shown in Fig. 4A (see Note 11). 4. Left click the Apply button. The Detail Pane now shows only the features that meet these filtering criteria. 5. Save the filtered Feature Set file over the original file by left clicking on the “…” button at the top right of the Display Features dialog box, then left clicking on the Save button.
3.4. Quantitation of Peptide Features 3.4.1. Quantitation Using Label-free Approaches Features from multiple experiments can be compared in msInspect by simultaneously opening Feature Set files from multiple LC-MS runs, displaying them together, and generating a peptide array. Below are directions for multiple LCMS run comparisons after Feature Set files have been produced (as described above in Subheadings 3.1–3.3) for all LC-MS runs to be compared. 1. Select Tools > Display Peptides…. 2. Left click on the Add Files button (Fig. 4A).
Fig. 4. (A) Display Features dialog box with one file loaded and the Filter tab selected. (B) Display Features dialog box with two files loaded and the Peptide Array tab selected.
Open LC-MS Analysis Platform
377
3. Browse to find another Feature Set file (with file extension.peptide.tsv) and open it. A different colored cross is assigned in the Image Pane to the features from each newly opened file. In this way, multiple Feature Set files can be opened and overlaid in the Image Pane (see Note 12). 4. Left click on the Filter tab (Fig. 4A) at the bottom of the Display Features dialog box and make sure the filter criteria are still set to the values entered in Subheading 3.3 (Min Charge = 1, Min Scans = 3, Min Intensity = 5, Max KL = 1.0, and Min Peaks = 2). Left click on the Apply button if any changes are made. 5. Left click on the Peptide Array tab (Fig. 4B) to set criteria for the peptide array to be generated: a. Enter a name for peptide array file that will be generated. By convention, this file name should end with “.pepArray.tsv.” b. Click the Optimize button to have msInspect search for reasonable tolerances for matching features across runs (see Note 13). c. Check the Normalization box if normalization of features is desired (2). d. Click the Calculate button to actually compute the peptide array. 6. The generated peptide array file consists of one column of intensities for each run and one row for each matched feature. The file is stored in a simple tab-delimited format, which can be exported (to Excel and other programs) and analyzed using tools traditionally applied to genomic arrays (see Note 14).
3.4.2. Quantitation Using Isotopic Labeling A common method of relative quantitation of peptides involves applying heavy and light isotopic labels separately to two samples, then mixing them prior to collecting LC-MS data. Typically, tandem MS/MS (or MS2) experiments are used to analyze these labeled samples. Peptide sequencing in MS/MS can detect the number of labeled residues in each peptide and therefore determine the expected mass difference between light and heavy forms of each peptide. msInspect can perform relative quantitation even in the absence of MS/MS information. Provided with the mass of the light and heavy reagents and with a threshold on the number of labeled residues to consider, msInspect will search for pairs of features consistent with isotopic labeling. 1. Open the file to be analyzed as described in Subheading 3.1. 2. Select Tools > Find All Features. 3. This will again bring up the Extract Features dialog box as shown in Fig. 3. Enter a new output file name and select a scan range of interest as described in Subheading 3.2.3–3.2.4. 4. Note the “Quantitate” check box in this dialog. Selecting this box will enable several options for relative quantitation.
378
Fitzgibbon et al.
5. Select one of several common isotopic labeling strategies (e.g., Cleavable ICAT and O16 /O18 ) from the pull-down menu. Details can be entered including masses for light and heavy label reagents, the particular amino acid labeled, and the maximum number of labeled residues to consider. 6. Left click on the “Find Features” button to locate all features in the specified scan range. Display features from the Feature Set file as described in Subheading 3.2.7. An additional matching step is performed to locate isotopically labeled pairs. A pair is indicated by a vertical bar connecting the light and heavy partners in the Detail Pane. Selecting a pair by left clicking in the Detail Pane will display feature properties including the light and heavy intensities, the ratio of light to heavy, and the number of isotopic labels detected. 7. The results of this quantitation process are stored in a tab separated value (TSV) file specified in step 3.4.2.3. One record is written for each isotopically labeled pair and for each unlabeled peptide (see Note 15).
4. Notes 1. More information on the mzXML file format, as well as utilities to convert native acquisition files from many common MS instruments to mzXML, can be found on the Sashimi website at http://sashimi.sourceforge.net. 2. Running msInspect via Java Web Start is highly recommended for casual use, as it greatly simplifies installation and update of the software. msInspect’s major features, such as feature finding and peptide array creation, are available from the command line as well, and command-line use is more appropriate for batch processing of large numbers of mzXML files. To use msInspect from the command line, the stand-alone JAR file can be downloaded from http://proteomics.fhcrc.org/CPL/msinspect.html. This web page also allows download of the msInspect user’s guide, which contains detailed instructions on installation, using msInspect’s features from the command line, and full source code for the released version (5). 3. Feature extraction can require a great deal of memory since it operates on several scans at a time. By default the Java Web Start version of msInspect allows up to 384 MB of memory to be allocated so that a number of scans and intermediate results may be cached. If additional memory is available on the computer, the amount of memory accessible by msInspect may be increased when running msInspect from the command line with the “-Xmx” option when invoking Java. For example “java –Xmx512M –jar viewerApp.jar.” 4. Sample data files are available at https://proteomics.fhcrc.org/CPAS. From that website, follow the “Published Experiments” link on the lower left side and then left click on the “MiMB Clinical Proteomics” link on the left side. Because LC-MS files can be quite large, the samples provided for download are only small subregions of the files used as figures in Section 3. Some browsers, such as Internet Explorer, may add a “.mzXML.xml” suffix when downloading these
Open LC-MS Analysis Platform
5.
6.
7.
8.
9.
10.
379
files. This should not affect msInspect’s ability to read the files and may be safely modified to “.mzXML” if desired. The first time a particular mzXML file is loaded, msInspect will write a “.inspect” file in the same directory where the mzXML file is located. This file contains an index of each scan in the original file, which will speed subsequent file access. Construction of this index file can take some time for larger input files; the status bar at the bottom of the msInspect window will indicate progress. The area shown in the Detail Pane is indicated in the main Image pane by a blue rectangle. Several aspects of Detail Pane behavior can be adjusted by selecting Detail Pane Settings from the Tools menu. There, feature detection can be turned on or off, background noise that falls below a threshold can be hidden, and the color scheme of the Detail Pane can be modified. Note that in Fig. 1 the Chart Pane clearly shows individual isotopic peaks because the data is from a high-resolution instrument (in this case a Waters LCT Premier). msInspect depends on resolving individual isotopes to infer the charge state of the peptide and therefore its mass. The charge is derived from the reciprocal of the distance between adjacent peaks. In Fig. 1 the peaks of the peptide on the left side of the Chart Pane are 0.5 m/z units apart, therefore msInspect infers that this peptide has a charge of 2. It is not possible to infer a charge for a single peak, so “stray peaks” that cannot be grouped into an isotopic cluster are assigned a charge of zero. msInspect includes a number of feature extraction algorithms, which can be selected in the Tools menu. The default, two dimensional (2D) peak alignment, is recommended for most purposes. The single scan algorithm may be useful if there is little or no scan-to-scan coherence. The feature extraction algorithms in msInspect have been designed to work on high-resolution profile mode data. The algorithms have been successfully applied to centroided data, but performance will depend on the particular centroiding algorithm used and on the noise characteristics of the run under consideration. For such data, the centroided scan algorithm may be appropriate. Once peptides have been located, some amount of visual curation is recommended. The Heat Map view (accessed from the Tools menu) can provide a global view of features grouped by charge state and sorted by various metrics such as mass or intensity. Each column in the Heat Map view consists of a small intensity window around each feature, colored from low intensity (red) to high intensity (yellow). Clicking on a feature in the Heat Map will highlight it in the other windows. By sorting on KL score or intensity and inspecting a few features, one can gain a sense of what filtering criteria might be appropriate for a given data set. When new filter settings are applied, as described in Subheading 3.3, the Heat Map view is automatically updated. A typical example of editing a Feature Set file: a. Sort by ascending KL score (Left click on the “KL” column header). b. Find a feature with KL < 1 that was misidentified by examining its spectrum in msInspect window’s Chart Pane.
380
Fitzgibbon et al. c. Double click in the Description field for the feature to add a comment to the Feature Set List noting that this feature is “questionable.” d. Click “Save” to save changes by overwriting the old Feature Set file.
11. Filtering peptide features can improve the performance of subsequent steps such as construction of peptide arrays. Specific filtering criteria will depend on instrumentation and the experiment goals. The most frequently used filtering criteria include: a. Minimum charge – msInspect locates features by first finding peaks and then grouping them into isotopic distributions consistent with individual peptides. Some peaks will not group with any others and are referred to as “stray peaks.” As described in Note 7, it is not possible to infer the charge state of these stray peaks, so they are assigned a charge of zero. Setting the minimum charge to 1 when filtering will remove these stray peaks, which are often due to noise or chemical contaminants. b. Minimum number of peaks – confidence in the location and charge state assignment of a peptide feature may be greater if it is supported by more isotopic peaks. Setting the minimum number of peaks to 2 will also eliminate the stray peaks described above. c. Minimum number of scans – set the minimum number of scans that a peptide must span in order to be considered. This has the effect of eliminating peptide features that persist for only a brief time. d. Minimum intensity – setting a minimum intensity threshold is often appropriate, although the specific value used will depend on the instrument. e. Maximum KL score – peaks are grouped by how well they match a model of the isotopic distribution of a peptide with a given mass. The KL score described in Bellew, et al. (1) measures how much an extracted group of peaks deviates from this model; in general, a lower KL score indicates a better match. 12. When multiple feature sets are loaded, it is often useful to hide particular sets or to change the colors of the crosses that mark features in a given set. Both of these can be accomplished in the Display Features dialog box as shown in Fig. 4A (select Tools > Display Peptides). For each feature set, this dialog box provides a checkbox to control visibility and a color palette to select colors for the crosses. 13. After optimization, the mass and scan window values that give the best alignment results automatically populate the Peptide Array tab. 14. A number of high-quality open source tools are available for microarray analysis. To analyze peptide arrays produced by msInspect, tools from the Bioconductor project (http://www.bioconductor.org) and from the TM4 microarray software suite (http://www.tm4.org) have been used. 15. Results from isotopic labeling should be treated as suggestive rather than authoritative. Without peptide sequence information, the mass difference between heavy and light partners cannot be definitively ascertained. The quality of the
Open LC-MS Analysis Platform
381
matching is therefore dependent on the quality of feature filtering and the density of features in each run.
Acknowledgments The authors would like to thank Matthew Bellew, Marc Coram, Jimmy Eng, Ruihua Fang, Mark Igra, and Tim Randolph for their intellectual contributions to the development of msInspect. This work was supported by contract # 23XS144A from the National Cancer Institute. References 1. Bellew, M., Coram, M., Fitzgibbon, M., Igra, M., Randolph, T., Wang, P., May, D., Eng, J., Fang, R., Lin, C.W., Chen, J., Goodlet, D., Whiteaker, J., Paulovich, A., and McIntosh, M. (2006) A suite of algorithms for the comprehensive analysis of complex protein mixtures using highresolution LC-MS. Bioinformatics Advance Access published on June 9, 2006 http://bioinformatics.oxfordjournals.org/cgi/reprint/btl276v1. 2. Wang, P., Tang, H., Zhang, H., Whiteaker, J., Paulovich, A.G., and McIntosh, M. (2006) Normalization regarding non-random missing values in high-throughput mass spectrometry data. Proceedings of the Pacific Symposium on Biocomputing 11, 315–326. 3. May, D. Fitzgibbon, M., Liu, Y., Holzman, T., Eng, J., Kemp, C.J., Whiteaker, J., Paulovich, A., and McIntosh, M. (2007) A Platform for Accurate Mass and Time Analyses of Mass Spectrometry Data. Journal of Proteome Research 6(7), 2685–2694. 4. Pedrioli, P.G., Eng, J.K., Hubley, R., Vogelzang, M., Deutsch, E.W., Raught, B., Pratt, B., Nilsson, E., Angeletti, R.H., Apweiler, R., Cheung, K., Costello, C.E., Hermjakob, H., Huang, S., Julian, R.K., Kapp, E., McComb, M.E., Oliver, S.G., Omenn, G., Paton, N.W., Simpson, R., Smith, R., Taylor, C.F., Zhu, W., and Aebersold, R. (2004) A common open representation of mass spectrometry data and its application to proteomics research. Nature Biotechnology 22(11), 1459–1466. 5. Computational Proteomics Laboratory. msInspect website. Accessed on June 28, 2006 at http://proteomics.fhcrc.org/CPL/msinspect.html.
20 Pattern Recognition Approaches for Classifying Proteomic Mass Spectra of Biofluids Ray L. Somorjai
Summary The statistical classification strategy we have developed for magnetic resonance, infrared, and Raman spectra for the analysis of biomedical data is discussed, particularly as it applies to proteomic mass spectra. A general discussion of the current use of pattern recognition methods is given, with caveats and suggestions relevant for clinical applicability.
Key Words: visualization; preprocessing; feature selection/extraction; robust classifier; classifier aggregation; proteomics; mass spectroscopy; magnetic resonance spectroscopy; biodiagnostics.
1. Introduction Unlike magnetic resonance spectroscopy (MRS), infrared spectroscopy (IRS), and Raman spectroscopy (RS) (1,2,3), proteomic mass spectroscopy (PMS) is a relative newcomer to the field of biodiagnostics. However, with the goal of discriminating various disease and disease states, it is a welcome complementary technique that provides yet another means of analyzing biofluids. In particular, this complementarity extends the range of characterizing biofluids, from vibrational states of specific chemical groups (IRS, RS), through the identification of small molecules (MRS), to proteins and protein fragments (PMS). Being an emerging field, PMS suffers from growing-up pains. In particular, there are experimental difficulties specific to PMS that have yet to be addressed From: Methods in Molecular Biology, vol. 428: Clinical Proteomics: Methods and Protocols Edited by: A. Vlahou © Humana Press, Totowa, NJ
383
384
Somorjai
(see Note 1) (in the following, the author assumes that the spectra, for which classifiers are to be developed, have been properly “processed”). Typically, biomedical data consist of a relatively few (of the order 10–100) samples (patterns) that are initially presented in a very high-dimensional feature space (feature ≡ m/z intensity), with dimensionality L (dimension ≡ features) of order 1000–10,000. Unfortunately, these two characteristics lead to two curses that impede the development of robust classifiers: the curse of dimensionality and the curse of dataset sparsity (3). The consequence of the two curses is that the sample to feature ratio (SFR) is 1/10–1/1000, instead of the minimal 5–10, required for robust classification, as is generally accepted by the machine learning community. In this chapter, the author presents the specific strategy [dubbed statistical classification strategy (SCS)] they have developed over the last dozen years to deal with such problems, particularly as they apply to MR, IR, and Raman spectra. We have been adapting this strategy and applying it with success to biomedical data derived from both proteomics mass spectra and microarrays (see Note 2). The author compares the differences and similarities of the SCS with the proteomics data analysts’ current tools and wherever possible, makes recommendations. 2. The Statistical Classification Strategy Lifting the twin curses of high dimensionality and dataset sparsity requires special approaches. The “strategy” part of the SCS reflects the fact that no single approach is, or can be optimal [“there are no panaceas in data analysis” (4)], and that a data-driven, multistage strategy is necessary or even essential. Using a divide-and-conquer philosophy, the SCS consists of five stages: 1. 2. 3. 4. 5.
Data visualization Preprocessing Feature selection/extraction Robust classifier development Classifier aggregation (ensembles)
The five stages are, of course, intimately interrelated; in particular, we use the visualization stage to constantly monitor how well the other stages of the strategy are working. Figure 1 provides a flowchart of the SCS. A more detailed description of the SCS can be found in (5) (see Note 3). 2.1. Visualization of High-Dimensional Data Proper data visualization is an essential first step that requires dimensionalityreducing mapping/projection from typically a very large, L-dimensional feature
Pattern Recognition for Proteomic Spectra
385
DATA VISUALIZATION PREPROCESSING FEATURE SELECTION / EXTRACTION CLASSIFIER DEVELOPMENT CLASSIFIER AGGREGATION
Fig. 1. Flowchart for the five stages of the SCS.
space to one to three dimensions. Of course, mapping from high dimensions to lower ones cannot preserve all distances exactly, because most of the original degrees of freedom are lost. However, if only class separability is required, exact visualization, our primary goal, is both achievable and sufficient. In fact, we recently proposed such an approach (6). It involves mapping highdimensional patterns to a special plane, the relative distance plane (RDP). The mapping procedure starts with the selection of a distance measure. This can range from Euclidean, city block, maximum norm to Mahalanobis, and its generalization (Anderson – Bahadur, AB) (7). Next, two reference patterns are chosen, one from each class. The critical observation, on which the RDP mapping relies, is that the distance of any other pattern to these two reference points is preserved exactly even after the mapping. This is because a triangle remains a triangle in any dimension and for any distance metric. Hence, the three distances of any such a triangle can be displayed in two dimensions, without distortion. By cycling through all possible reference pairs, we can display and visualize the data with respect to these sets, i.e., from a large number of possible “perspectives” (as an analogy, consider looking at a sculpture from every angle to assess its shape and form), a very powerful approach for detecting outliers (e.g., poor quality spectra), discovering additional subgroups within a class (clustering), assessing whether training and test sets derive from the same distributions, etc., in short, for establishing and ensuring quality control. 2.2. Preprocessing Preprocessing enables the user to adapt, “tune” the data, so that the subsequent stages of the SCS are optimized. For spectra, whether MS or MR, we found that the most useful preprocessing approaches, alone or in combination, are normalization (“whitening,” or scaling to unit area), smoothing (filtering), and/or peak alignment (with respect to some internal or external
386
Somorjai
reference). Various transformations of the spectra lead frequently to better classification. Examples of such transformations include replacing the spectra by their (numerical) derivatives or by rank-ordered variants (the nonlinear rank-ordering replaces the original features by their ranks, thus minimizing the influence of accidentally large or small feature values) and combinations of these. Furthermore, creating differently preprocessed versions of the same dataset, selecting different sets of features from these (stage 3), and developing different classifiers using these feature sets (stage 4) facilitates the aggregation of these multiple classifiers for possibly increased accuracy (stage 5). The achieved classifier’s accuracy and reliability are also assessed by visualization of the results (stage 1). This demonstrates how the strategy uses the stages in an interactive, feedback fashion.
2.3. Feature Selection/Extraction In general, this stage is one of the two most important components of the SCS. It is essential not only for dimensionality reduction (which helps lifting the curse of dimensionality), but, when done properly, also helping to arrive at biologically relevant and transparent interpretations of the data (“biomarker” identification). The driving force behind feature selection/extraction (FSE) is the goal of satisfying one of the two critical requirements for any reliable classifier development, lifting the curse of dimensionality. Spectra, whether mass or MR, are peculiar: their “intrinsic dimensionality,” the number of independent, relevant features they possess, is generally much smaller than their original dimensionality. This is because spectra have many irrelevant features (“noise”), and adjacent features are strongly correlated. Some of these correlated features correspond to spectral peaks, representing small molecules (MRS), or small proteins, protein fragments, or peptides (PMS). Thus, it is clearly beneficial to eliminate irrelevant features and identify discriminatory peaks (potential “biomarkers”). For spectra, principal component analysis, a frequently used dimension reduction method (often the principal tool of many PMS data analysts), is doubly dangerous. First, it “scrambles” the original features, making discriminatory feature identification and selection problematic; second, since the principal components (PCs) are ordered according to the maximum variance explained in the data, there is no guarantee that the first few PCs are discriminatory for classification. Even if one were to choose the first M L PCs from the original, total L-term set, these are rarely the best discriminators. One could try selecting m < M PCs as optimal for classification (e.g., by exhaustive search); our early experience indicates that some of the good discriminators are among the remaining k = M + 1,…,L
Pattern Recognition for Proteomic Spectra
387
subset of PCs. All these difficulties point to the need for a feature selection method specific to spectral data, one that preserves spectral interpretability. There are two generic approaches to feature selection (8). The filter method selects features without consideration of the classifiers to be used with these features. The wrapper (embedding) method finds optimal features, while using the eventual classifier to guide the selection method. We have developed a genetic algorithm-based optimal region selection (GA-ORS) method that finds discriminatory features without loosing spectral interpretability (9). The GA-ORS is based on the wrapper approach and is an example of feature extraction. It has the advantage that the spectral ranges found are averaged over adjacent data points (thus equivalent to peak area determination). Such averaging increases the signal to noise ratio, a bonus. Within the GA-ORS suite of programs, one can also control the widths of the selected spectral subregions (discriminatory peaks); this helps to eliminate those regions that appear to be discriminatory simply because of accidental differences in the “noise” regions due to the limited sample size (9,10). The GA-ORS has been very successful in identifying discriminatory subregions of MR, IR, and Raman spectra of biofluids and tissues, obtained for distinguishing between various diseases and disease states (1). In the context of feature selection, many proteomic mass spectroscopists first identify “relevant” peaks, sometimes in an ad hoc fashion, as possible contributors to discrimination. Although using all available “domain knowledge” is very important and should always be considered when available, it can also introduce bias, because of possible preconceived notions of what is relevant for discrimination. Our feature selection approach, sketched above, removes most of such bias, by identifying hitherto unsuspected, novel discriminatory “peaks,” or more accurately, discriminatory spectral subregions. Furthermore, by its explicit multivariate nature, GA-ORS tends to identify a “fingerprint,” a “panel” of peaks whose simultaneous interaction is necessary for discrimination. When the multidimensional feature space does not arise from spectra, e.g., microarray data or preselected discrete peaks in PMS, for which averaging adjacent features is not meaningful, direct application of the GA-ORS methodology may not be appropriate [although we have used it as a preliminary, clustering-type feature selection “trick” (5)]. However, when possible, exhaustive, or when not, a dynamic programming-based search for optimal or near-optimal discriminatory feature subsets is still feasible and is one of the options available in GA-ORS. Figure 2 demonstrates the importance of feature selection, and the relevance of an interactive, feedback-mode visualization of data. For the two-class, prostrate cancer vs. healthy proteomic (mass spectral) dataset (11), we display a Euclidean distance-based mapping, either directly from the original 15,154
388
Somorjai Prostate Cancer – L2 Mapping from 15,154 Dimensions
5 Dimensions
Fig. 2. Mapping from the original 15,154 dimensions (left panel) misclassified eight samples from the training set (TS; class 1, black disks, class 2, black crosses) and nine from the independent validation (test) set (VS; class 1, grey triangles, class 2, grey squares). The mapping from five dimensions (right panel), classified correctly all TS and the VS samples. The dashed lines shown are the optimal LDA separators.
dimensions (left panel) or from five dimensions, reduced via GA-ORS (right panel). Clearly, the success of class separation depends on the dimensionality of the feature space. When mapping from the original 15,154 dimensions, the optimal two-dimensional separation of training sets (TS; black disks for class 1, black crosses for class 2) and test sets (VS; grey triangles for class 1, grey squares for class 2) misclassify eight samples from the training set and nine from the independent test set. For the mapping from five dimensions, all samples are classified correctly (see Note 4). 2.4. Robust Classifier Development There are two, generally interrelated goals for supervised classifiers. First, we want robust classifiers, i.e., with high generalization power. This is realized when the classifier classifies new, unknown “patterns” correctly and reliably. Second, we want to identify the smallest subset of maximally discriminatory features. Eventual disease management/treatment would benefit from having only a few, biologically relevant and interpretable features. Ideally, both classification goals should be achieved, especially in clinically relevant studies. Unfortunately, achieving the first goal is frequently at the expense of the second. A good example is the recent use of support vector machines (SVMs) for classification. These have become particularly popular because of their
Pattern Recognition for Proteomic Spectra
389
persuasive theoretical foundations (12,13) (see Note 5). However, because the SVMs project the data into even higher dimensional feature spaces to achieve linear separability of the classes, relevant, discriminatory feature identification becomes more difficult. The technical complexity and sophistication of the classifiers used range from the simplest correlation techniques, through k nearest neighbors, linear and quadratic discriminant analysis, decision trees, neural nets, etc., to (nonlinear) SVMs. However, the choice of classifier seems not to be dictated by the data to be classified, but rather by “expert” recommendation (usually based on other types of data), personal experience or preference, or simply software availability. The maxim “simpler is better” has mostly been ignored [see however (14)]. In general, no specific effort has been expended on choosing the most appropriate, optimal type of classifier for a given dataset. With a few exceptions, the proteomics (mass spectroscopy) community tends to use the “best” (i.e., the most sophisticated) classifier, whether appropriate or not! If the dataset size is sufficiently large, then the optimum approach for developing a robust classifier is to partition the data into training set, monitoring set and a completely independent test (validation) set. Such partitioning is required to prevent overfitting. This occurs when the classifier adapts itself too closely to the peculiarities of a training set that comprises a limited number of samples. Using a monitoring set helps decide when to stop training. The ultimate assessment of the classifier’s generalization capability is how well it does on the independent test set that was in no way involved in creating the classifier. Unfortunately, a sufficiently large sample size is a luxury rarely available to the data analysts of biomedical data. The only recourse is to use some version of crossvalidation (CV) (15). CV comes in different flavors, each with its advantages and disadvantages. All of them are designed to deal with the bias introduced by using the entire dataset both to develop the “optimal” classifier and to estimate the classification error (see Note 6). It is important to re-emphasize that because of the typical small sample size of biomedical data, the best approach to robust classifier development is to select the simplest classifier possible. This suggests linear classifiers. Complex classifiers have too many parameters that need optimization, inevitably raising the scepter of overfitting (see Note 7). Dimensionality reduction (FSE) is, of course, essential for obtaining an appropriate SFR. Realizing the role of the SFR is important when developing classifiers. However, an essential caveat is that data sparsity can render any classification result statistically suspect, even if the SFR is satisfied (3). The importance of guaranteeing the appropriate SFR is being recognized. However, the consequences of data set sparsity are still not appreciated (16).
390
Somorjai
The control of disparate sensitivities and specificities produced by classifiers when the dataset is imbalanced has particular clinical relevance (typically, there are many more samples from normal subjects than from patients with particular diseases) and tuning methods are needed for the classifiers developed. The standard method in the pattern recognition literature is either oversampling (taking multiple samples from the sparser class), or undersampling (taking a subset of the samples from the larger class), such that the sample sizes in the two classes become balanced (sensitivity, SE ≈ specificity, SP). However, this approach fails quite frequently. Our approach is based on penalizing misclassification of members of the smaller class until SE ≈ SP (note that the penalty weight is generally not equal to the ratio of the class sizes). 2.5. Classifier Aggregation Clinically relevant classifiers require statistically significant class assignments for the samples. Thus, when a classifier’s assignment probability for a sample is “fuzzy” (e.g., less than 75% for a second class problem) that assignment is not really useful from a clinical point of view. If the overall accuracy of a classifier is low and the assignments are fuzzy, a multiple classifier strategy (classifier aggregation) can frequently be beneficial. The idea is to combine the outputs of several classifiers, with the expectation that the new classifier thus formed will be more accurate and less fuzzy than the best of the individual constituents. One of the requirements for accurate ensemble-based classifiers is diversity. It is believed that the component classifiers should be as different as possible. This can be achieved in several ways. One of these approaches used conceptually and methodologically very different classifiers (Linear Discriminant Analysis (LDA), neural nets, and dynamic programming) on the same, unmodified data (17). However, our more recent experiments and experiences suggest that classifier diversity is not necessarily required. Comparable accuracy can be achieved in a simpler way, by employing a single, simple classifier (e.g., LDA) and producing diversity using different transformations of the data (we have already discussed some of these in the context of feature selection). How are we to combine the outcomes of the various classifiers? Some of the combinations range from the simple majority rule to more complex, trainable rules, e.g., stacked generalization (SG) (18). SG uses the output probabilities of the constituent classifiers as input features for a new classifier. Boosting (19) is a very powerful version a learnable classifier combination rule (see Note 8). It was used for identifying proteomic biomarkers for cancer detection (20). There are many classifier combination rules. When choosing such a rule, it is important to take into account both sample size and classifier complexity.
Pattern Recognition for Proteomic Spectra
391
3. Discussion Of course, experimental quality control is essential for good classifiers, i.e., those that have useful generalization properties. Much has been made of the “surprising” observation that different (or even the same) experimental groups, using different classifiers end up with totally different sets of discriminatory features (21). These are ascribed to various possible experimental differences in the spectral acquisition, etc. (22,23,24). Although these are indeed significant contributing factors, and must be considered and corrected, sight is lost of the important fact that when nonunique discriminatory sets are found, they are as likely caused by dataset sparsity (3) as by differences in experimental protocols. The initial euphoria is over: one cannot (or should not be able to) publish in prestigious journals (e.g., Science, Nature, Lancet, PNAS, etc.) proteomic results based on very limited sample sizes. Furthermore, even when there are enough data to produce a respectable classifier, high-impact journals are unlikely to accept a manuscript unless the results are independently validated. In particular, the chemical/biological identification of the discriminatory proteins, protein fragments, or peptides must accompany the classification results. This increased focus on establishing the clinical relevance of putative biomarkers is definitely a good sign. However, at this stage of the game, it is possibly premature, and one would prefer first to have a quick, noninvasive, reliable diagnostic/prognostic tool. To be clinically relevant, many more samples are required to develop such a tool (i.e., a sufficiently robust classifier; this requirement will likely rule out the reliable detection of rare diseases). Unfortunately, currently available sample sizes preclude the discovery of unique biomarker “fingerprints” of a disease. This nonuniqueness due to data sparsity leads inevitably to expensive, onerous, and unnecessary laboratory investigations to sift out medically relevant, unique subsets from the plethora of putative biomarkers found and suggested for various diseases. Understanding the biochemical causes is, of course, essential for, say, finding a possible cure, but should succeed the diagnostic/prognostic stage. Despites such caveats, the proteomics field is maturing and once the technical problems are successfully resolved, will undoubtedly provide important medical/clinical insights. The author further suggests that the power of proteomic spectroscopy can be enhanced by the simultaneous consideration of other experimental modalities that complement PMS, especially MRS, which could identify smaller discriminatory compounds also present in biofluids. 4. Notes 1. Amongst these are correcting the nonflat baselines arising from the matrix material, peak alignment of the spectra, reconciling data acquisition at different times, in different laboratories, with mass spectrometers of different sensitivity,
392
2.
3.
4.
5.
6.
Somorjai correcting high frequency noise, etc. Proper experimental design, including rigorous quality assessment and control is essential before any classifier development is attempted. Good discussions and summaries are given in (21,22,23,24). The realization that some classification strategy is essential for the analysis of proteomic data is recent. That these strategies are different emphasizes that not only there is no best classifier, but also that no unique, best strategy exits either; different groups discovered different strategies that worked well for the data they analyzed (20,25). What common is that all strategies are multistage. The data-driven nature of the SCS emphasizes the fact that there is no simple, universal prescription for creating an optimal classifier (4), i.e., no simple, ready “recipe” is or likely to be available. This much-improved result strengthens the importance of feature selection. Note that both mappings were done using the Euclidean distance, necessary, because one cannot use any other distance measure (e.g., Mahalanobis) that involves matrix inversion. After feature selection, when the number of features is fewer than the number of samples, much more powerful and relevant distance measures can be used. For a fair comparison, the Euclidean distance is used for both cases presented in Fig. 2 [for further possible improvements obtainable using other distance measures see (6)] In practice, SVMs are not nearly as effective as suggested by theory. In fact, we have found (26) that a simple LDA classifier, with wrapper-driven feature selection, when applied to several publicly available proteomic mass spectra, and to six microarray datasets, generally outperformed a linear SVM, even when the latter was used with feature selection. Furthermore, SVM-based classifiers frequently produce classification results that are distinctly out of balance. The accuracy obtained for one of the classes is most of the time considerably better. This imbalance between sensitivity and specificity is of clinical relevance when trying to minimize false negatives and/or false positives. Different variants of CV deal differently with the so-called bias-variance dilemma, particularly acute for datasets with limited sample size. The simplest version, the leave-one-out (LOO) method, removes one of the N samples, develops a classifier with the remaining N – 1 samples, and tests its prediction accuracy on the left-out sample. By cycling through all N samples, N accuracy assessments are found. For small N (for which the data partition, as described in the main text, is not possible), LOO suffers from large variance, even though it minimized the bias. K-fold CV is frequently used to balance bias and variance. The samples are partitioned into K roughly equal subsets. K – 1 subsets are used for training the classifier, while the leftout subset is the current test set. Cycling through the K partitions and then calculating the mean and standard deviation of the accuracies over the K test sets assess how well and how reliably one is expected to classify new, unknown samples. K is typically chosen to be 5 or 10, whether or not the sample size warrants this choice. A more reasonable approach is to determine the best K via CV. Particularly, powerful is Efron’s bootstrapping approach (15). This involves the entire dataset, but uses a random resampling with replacement strategy. A large number of artificial datasets
Pattern Recognition for Proteomic Spectra
393
of the same size as the original are thus produced. A classifier is created for each of these, and the outcomes are averaged. Bootstrapping is supposed to reduce both large bias and variance. Inspired by the bootstrapping concept, we have been using, with some success, its generalization (27). 7. Instead of the direct use of nonlinear classifiers, with the attendant optimization problems, a simple trick is to use nonlinear terms but retain the simplicity of a linear classifier. One approach we found useful is to first develop a linear classifier (with feature selection) and then augment the linear features by constructing from them nonlinear functions, say, quadratic terms. This, of course, increases the number of parameters to be determined. However, the problem remains linear in the augmented feature space and linear classifiers can be developed. Furthermore, our explicit approach produces new features that remain interpretable as interaction terms. This is unlike the SVM classifiers that map implicitly into a much higher dimensional linear feature space, without interpretability. In addition, we can reduce the dimensionality of our augmented feature space by additional feature selection via exhaustive search, optimized by CV. 8. Boosting requires “weak” base classifiers, Cj , j = 1,2,…,j that are combined into a more accurate composite classifier, Dj = C1 + C2 + … = Cj . At stage m, the boosting algorithm carries out a weighed selection of a base classifier, given all previously chosen base classifiers. For the new base classifier Cm , larger weights are given to samples that are incorrectly classified by the current composite classifier Dm−1 so that Cm will be chosen with a tendency to correctly classify previously incorrectly classified samples.
Acknowledgments The author thanks the entire Biomedical Informatics Group for their decadelong, essential contributions to the development of the algorithms and softwares described. References 1. Lean, C. L., Somorjai, R. L., Smith, I. C. P., Russell, P., Mountford, C. E. (2002) Accurate diagnosis and prognosis of human cancers by proton MRS and a three stage classification strategy. Annual Reports on NMR Spectroscopy 48, 71–111. 2. Somorjai, R. L., Dolenko, B., Nikulin, A., Nickerson, P., Rush, D., Shaw, A. et al. (2002) Distinguishing normal from rejecting renal allografts: application of a threestage classification strategy MR and IR spectra of urine. Vibrational Spectroscopy 28, 97–102. 3. Somorjai, R. L., Dolenko, B., Baumgartner, R. (2003) Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 19, 1484–1491. 4. Huber, P. J. (1985) Projection pursuit. Ann. Statistics 13, 435–475.
394
Somorjai
5. Somorjai, R. L., Alexander, M., Baumgartner, R., Booth, S., Bowman, C., Demko, A., Dolenko, B., Mandelzweig, M., Nikulin, A. E., Pizzi, N., Pranckeviciene, E., Summers, R., Zhilkin, P. (2004) A data-driven, flexible machine learning strategy for the classification of biomedical data. In: Dubitzky, W. and Azuaje, F. (eds.) Artificial Intelligence Methods and Tools for Systems Biology, Chapter 5. Computational Biology Series, Vol. 5. Springer, pp. 67–85. 6. Somorjai, R. L., Demko, A., Mandelzweig, M., Dolenko, B., Nikulin, A. E., Baumgartner, R. et al. (2004) Mapping high-dimensional data onto a relative distance plane – a novel, exact method for visualizing and characterizing highdimensional patterns. Journal of Biomedical Informatics 37, 366–379. 7. Anderson, T. W., Bahadur, R. R. (1962) Classification into two multivariate normal distributions with different covariance matrices. Annals of Mathematical Statistics 33, 420–431. 8. Kohavi, R., John, G. H. (1997) Wrappers for feature subset selection. Artificial Intelligence 273–324. 9. Nikulin, A. E., Dolenko, B., Bezabeh, T., Somorjai, R. L. (1998) Near-optimal region selection for feature space reduction: novel preprocessing methods for classifying MR spectra. NMR in Biomedicine 11, 209–217. 10. Li, J., Zhang, Zh., Rosenzweig, J., Wang, Y. Y., Chan, D. W. (2002) Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clinical Chemistry 48, 1296–1304. 11. Dataset “JNCI-7-3-02,” downloaded from the NIH/FDA Clinical Proteomics Program Databank (http://clinicalproteomics.steem.com). 12. Vapnik, V. N. (2000) The nature of statistical learning theory, 2nd edition, Statistics for Engineering and Information Science. Springer, New York. 13. Schölkopf, B., Smola, A. J. (2002) Learning with Kernels. Support Vector Machines, Regularization, and Beyond. The MIT Press, Cambridge, Mass. 14. Lee, K. R., Lin, X., Park, D. C., Eslava, S. (2003) Megavariate data analysis of mass spectrometric proteomics data using latent variable projection method. Proteomics 3, 1680–1686. 15. Efron, B. (1982) The Jackknife, the Bootstrap and Other Resampling Plans. SIAM, Philadelphia. 16. Diamandis, E. P. (2003) Proteomic patterns in biological fluids: do they represent the future of cancer diagnostics? Clinical Chemistry 49(8), 1272–1278. 17. Somorjai, R. L., Nikulin, A. E., Pizzi, N., Jackson, D., Scarth, G., Dolenko, B., Gordon, H., Russel, P., Lean, C. L., Delbridge, L., Mountford, C. E., Smith, I. C. P. (1995) Computerized consensus diagnosis: a classification strategy for the robust analysis of MR spectra. I. Application to 1 H spectra of thyroid neoplasms. Magnetic Resonance in Medicine 33, 257–263. 18. Wolpert, D. H. (1992) Stacked generalization. Neural Networks 5, 241–259. 19. Schapire, R. R. (1990) The strength of weak learnability. Machine Learning 5, 197–227. 20. Yasui, Y., Pepe, M., Thomson, M. L., Adam, B.-L., Wright Jr., G. L., Qu, Y., Potter, J. D., Winget, M., Thornquist, M., Feng, Z. (2003) A data-analytic strategy
Pattern Recognition for Proteomic Spectra
21. 22.
23.
24.
25.
26. 27.
395
for protein biomarker discovery: profiling of high-dimensional data for cancer detection. Biostatistics 3, 449–463. Diamandis, E. P. (2004) Mass spectrometry as a diagnostic and a cancer biomarker discovery tool. Molecular and Cellular Proteomics 3(4), 367–378. Baggerly, K. A., Morris, J. S., Coombes, K. (2004) Cautions about reproducibility in mass spectrometry patterns: joint analysis of several proteomic data sets. Bioinformatics 20, 777–785. Hu, J., Coombes, K. R., Morris, J. S., Baggerly, K. A. (2005) The importance of experimental design in mass spectrometry experiments: some cautionary tales. Briefings in Functional Genomics and Proteomics 3(4), 322–331. Shin, H. and Markey, M. K. (2006) A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples. Journal of Biomedical Informatics 39, 2237–2248. Zhu, W., Wang, X., Ma, Y., Rao, M., Glimm, J., Kovach, J. S. (2003) Detection of cancer-specific markers amid massive mass spectral data. Proceedings of National Academic Science USA 100(25), 14666–14671. Somorjai, R. L. and Pranckeviciene, E. (2006) (Unpublished). Somorjai, R. L., Dolenko, B., Nikulin, A., Nickerson, P., Rush, D., Shaw, A., De Glogowski, M., Rendell, J., Deslauriers, R. (2002) Distinguishing normal from rejecting renal allografts: application of a three-stage classification strategy to MR and IR spectra of urine. Vibrational Spectroscopy 28, 97–102.
Index
Affi-gel Protein A MAPS II kit, 277 Aflatoxin B1 (AFB1), 194 Alkaline phosphatase (ALP) assay, 233, 237 Alpha-fetoprotein, 194 Alzheimer’s disease, 310 Annexin V, 172 ANOVA, analysis of variance, 100, 112, 114, 259, 330, 335, 344 Antibody arrays construction, 270–272 direct labeling methods, for cancer diagnostics, 268–269 formats for, 264–266 labeling and hybridization, of serum samples, 269–270, 272–274 and other proteomic strategies, 263–264 planar, labeling-hybridization methods and, 266–268 printing, 269 scanning and data analysis, 274 Anti-SAPE antibody, 267 ArrayQuant scanners, 281 AutoPixTM , 48. See also Laser-capture microdissection Axon scanners, 281
Bayesian classification methods. See Linear Discriminant Analysis Bayes’s rule, 300 BCA 200 Protein Assay Kit, 277 Bead-based multiplex assays. See also Suspension antibody microarrays detection antibody, 254 diluents, 254 general protocol for, 254–255 sample preparation, 252–254 screening protocol, 255–256 Biological variation analysis (BVA) module, of DeCyder, 112–113 “Biomarker panel,” 11 Bio-Rad Micro Bio-Spin P30 column, 277 Biotinyl-tyramide, 275
BLAST, 352, 358 Blood samples, preanalytical phase collection of, 36 processing of, 37–38 protease inhibitors, 38 serum and plasma specimens, characteristics of, 36–37 Bradford assay, 225
Carboxylated beads, 249. See also Suspension antibody microarrays activation, 251 antibodies coupling to activated, 251 cell-counting chamber and, 252 washing and storage of coupled, 251 1-(5-Carboxypentyl)-1-methylindodi-carbocyanine halide (Cy5) N-hydroxy-succinimidyl ester, 163 1-(5-Carboxypentyl)-1-propylindocarbocyanine halide (Cy3) N-hydroxy-succinimidyl ester, 163 CAST. See Clustering Affinity Search Technique Celecoxib, and cyclooxygenase-2 (COX-2), 183 Charge-couple device (CCD) camera-based imaging system, 268, 293, 332 CIMminer (Clustered Image Maps), 259 Cleavable isotope-coded affinity tag (cICAT) labeling technology, 195, 197, 200–201 Clinical proteomics, 1 biological specimens, 6–7 biomarker discovery and, 9–14 overview and scope of, 2–3 sample specimens and processing techniques, 4–9 Cluster analysis techniques, 297–299, 306 gene expression-based, 307 Clustering Affinity Search Technique, 259 Coomassie brilliant blue (CBB) staining, 68, 332, 339 Creatinine assay, 142 Cyanines (Cy3/Cy5), 264, 333 Cyclooxygenase-2 (COX-2) and celecoxib, 183
397
398 CyDye labeling, 95, 105–106, 109–110. See also Difference gel electrophoresis (DIGE) technology Cy2-labeled internal standard, 98–99 minimal labeling method, 96 pooled-sample internal standard for, 107 saturation labeling, 96 Cy3-labeled streptavidin, 267 Cytokeratin 19 (CK19), 163 DA-PLS method. See Discriminant analysis–partial least squares method DeCyder software, 101, 112–113, 342. See also Difference gel electrophoresis (DIGE) technology Delayed extraction-matrix assisted laser desorption/ionization time-of-flight mass spectrometry (DE-MALDI-TOF-MS), 194 Dendrogram, 297, 299 Dialysis, 150. See also Urine protein profiling, by 2DE and MALDI-TOF-MS Difference gel electrophoresis (DIGE) technology, 78, 93, 330, 332–333, 342–345 ANOVA, 100, 112, 114 in clinical setting, 103 CyDye labeling, 95, 105–106, 109–110 Cy2-labeled internal standard, 98–99 minimal labeling method, 96 pooled-sample internal standard for, 107 saturation labeling, 96 DeCyder suite of software tools, 101, 112–113 2D gel electrophoresis and poststaining, 94, 110–111 experimental design, 108–109 and statistical confidence, 112–114 extended data analysis (EDA) software module, 101, 113 false discovery rate (FDR), 100 hierarchical clustering (HC), 102 labeling materials, 104–105 LCM and, 163–170 MeOH/CHCl3 protocol, 106 MuDPIT, 97 multivariate statistical analysis, 114–115 principle component analysis, 101 SDS-polyacrylamide gel electrophoresis, 104 software algorithms, 111–112 Student’s t-test, 100, 112, 114 DIGE/MS analysis, 103, 115 Direct labeling, 264, 268 protocol for, 272–274 Discriminant analysis–partial least squares method, 306, 309–311
Index Discrimination power (DP), 303–305 Dithiothreitol (DTT), 68 Dot-plot style alignment, of protein sequence, 358–359 DTT/IAA equilibration procedure, 73 ECM. See Extracellular matrix EDA software. See Extended data analysis software EDC/Sulfo-NHS, 249. See also Suspension antibody microarrays 2DE-MALDI-TOF-MS assay, 194 EnsEmbl, 352, 356 Escherichia coli, 307 Ethylene vinyl acetate (EVA) polymer, 161 Ettan 2D electrophoresis system, 110 Exosomes, 142 ExPASy proteomics tools, 202, 352 Expressed sequence tags (ESTs), 357 Extended data analysis software, 101, 113 Extracellular matrix, 8 and matrix vesicles (MVs) proteomes, MS and, 231–232 alkaline phosphatase assay, 234, 237 immunofluorescence staining and, 235, 239 MC3T3-E1, osteoblast cell line, 233, 236–237, 239 nanoRPLC-MS/MS, 235, 238–239 strong cation exchange liquid chromatography, of peptides, 234–235, 238 Extracted ion chromatogram, 219, 221–222, 224 Fetal bovine serum (FBS), 254 Fisher’s F-test, 302 Flow cytometric analysis, 160 Fluorophores, 264, 267 photobleaching and quenching of, 274–275 Fourier transformer mass spectrometry (FTMS), 172–174 Free flow electrophoresis (FFE), plasma samples fractionation and, 60–61, 67 Frontotemporal dementia, 310 GAORS method. See Genetic algorithm-based optimal region selection method 2D Gaussian function, 312 Gaussian multivariate probability distribution, 300 2-D Gel-electrophoresis (2-D GE), 292. See also 2D-PAGE maps analysis LCM cells analysis by, 77 HER-2/neu positive and -negative breast tumors, 87–88
Index isoelectric focusing (IEF), 79–80, 83–84 MASCOT search engine, 87 paraffin-embedded sections staining, 81–82 preparation and analysis, 61, 67–69 protein sample preparation, 79, 82–83 SDS-PAGE, 79–80, 84–85 silver staining and image analysis, 80, 85–86 tissue block and tissue section preparation, 78–79, 81 trypsin digestion and MS analysis, 80, 86–87 Gel-free mass spectrometry and LCM, 171–172 Gene expression microarrays, 45 GenePix Pro 3.0 software program, 280–281 GeneScan program, 356 Genetic algorithm-based optimal region selection method, 387–388. See also Proteomic mass spectroscopy gp96, tumor rejection antigen, 169 GRANTA-519, 308
HCC. See Hepatocellular carcinoma HCL. See Hierarchical clustering Hematoxylin and eosin (H&E) staining, tissue sample collection, 44, 47–48 Hepatitis B/C virus (HBV/HCV), 194 Hepatocellular carcinoma, 8, 11, 59, 67, 163, 170, 193 qualitative and quantitative proteomic analysis of cICAT labeling technology, 195, 197, 200–201 2DE-MALDI-TOF-MS assay, 194 2D-LC-MS/MS for, 195–197, 201–202 ExPASy proteomics tools, 202 LCM for, 194–196, 199 nonenzymatic method (NESP), 196, 198–199 toludine blue removal and protein mixture digestion, 197, 199–200 HERMeS software package, PCA and, 306 HER-2/neu oncogene, 85–86, 163 Hierarchical clustering, 259, 299. See also Cluster analysis techniques High performance liquid chromatography, 169, 171, 183, 212–214 Horseradish peroxidase (HRP), 267 HPLC. See High performance liquid chromatography HSP27 protein, 103 HT-29, COX-2 expressing colon cancer cell line, 183 Human Proteome Organization, 143 Hydrogels, 271. See also Antibody arrays
399 ICAT labeling. See Isotope-coded affinity tag labeling IMAC-Cu2+ ProteinChips, 134, 136 Image analysis. See also 2D-PAGE maps analysis by fuzzy logic principles image defuzzyfication, 312 image digitalization, 311–312 multi-dimensional scaling (MDS), 315–317 PCA and classification methods, 315 refuzzyfication, 312–313 moment functions, 317 Legendre moments, 318–319 Image Master Platinum software, 339, 341 Immobilized pH gradient strip. See also Two-dimensional electrophoresis (2DE) isoelectric focusing (IEF) with, 60, 65 rehydration of, 64–65 Immunofluorescence staining, 235 InterPro, 352, 361 Iodoacetamide (IAA), 68 IPG strip. See Immobilized pH gradient strip Isotope-coded affinity tag labeling, 78, 195 mass spectrometry (MS) and, 181 celecoxib, cyclooxygenase-2 (COX-2) and, 183 cell culture and harvest, 183, 186 cell lysis, desalting, and protein quantitation, 184–187 cleavable reagents, 182, 185, 187–188 cleaving biotin, 186, 189 labeled peptides purification, 185–186, 188–189 proteins, denaturation and reduction of, 185, 187 quantitative proteomic analysis and, 184 Java Runtime Environment, 370. See also msInspect, for LC-MS data analysis KMC (K-Means/K-Medians Clustering), 259 Kolmogorov–Smirnov test, 335, 339, 341 Kruskal–Wallis test, 335 Laser-capture microdissection, 8, 44–45, 160. See also Tissue sample collection, for proteomics analysis AutoPixTM , 48 cells analysis, by 2-D GE, 77 HER-2/neu positive and -negative breast tumors, 87–88 isoelectric focusing (IEF), 79–80, 83–84
400 MASCOT search engine, 87 paraffin-embedded sections staining, 81–82 protein sample preparation, 79, 82–83 SDS-PAGE, 79–80, 84–85 silver staining and image analysis, 80, 85–86 tissue block and tissue section preparation, 78–79, 81 trypsin digestion and MS analysis, 80, 86–87 development, 161 different labeling techniques and, 170 DIGE and, 163–170 and 2-D GE, 162–163 gel-free mass spectrometry and, 171–172 for HCC and non-HCC hepatocytes isolation, 194–195, 199 LCM lysate, 49–50 and mass spectrometry analysis, 172–174 PixCell II instrument, 48–49, 161 and protein chip technology, 172 separation methods and, 171 for tissue sample collection, 44–45 VeritasTM , 48 Laser microdissection and pressure catapulting, 8 LC-ESI-MS/MS. See Liquid chromatography-electrospray ionization tandem mass spectrometry LCM. See Laser-capture microdissection LC-MS data. See Liquid chromatography-mass spectrometry data LC-MS/MS. See Liquid chromatography-tandem mass spectrometry LDA. See Linear Discriminant Analysis Legendre moments, 317–319 Levene’s test, 334 Linear Discriminant Analysis, 300–301, 315–316 Liquid chromatography-mass spectrometry data, 370, 374–376, 377 Liquid chromatography-mass spectrometry data analysis, msInspect for, 369 data viewing and navigation, 371–373 locating peptides in, 373–376 low-quality peptides, elimination of, 376 peptide quantitation, 376–378 software installation for, 370 Liquid chromatography-tandem mass spectrometry, 170, 171 label-free, for biomarker identification, 209–210 albumin/IgG depletion, 211–213 chromatographic alignment, 218–221 data transformation and normalization, 222 HPLC, 212–214 mass spectrometer, 212, 214
Index MS/MS spectral filtering, 216–217 peptide identification, 217–218 peptide quantification, 221–222 statistical analysis, 223 zoom scan data processing, 214–216 LMPC. See Laser microdissection and pressure catapulting two-dimensional (2D-LC/MS/MS), 78 Lysine labeling, 169 MALDI/SELDI protein profiling, of serum, 125–126 on MALDI-TOF–TOF data collection, 131–132 MB fractionation, of human serum, 131 protein identification by, 132–133 MB-based fractionation, 127, 128, 131 SELDI and MALDI spectra acquisition, 129 SELDI ProteinChip, 130 (Magnetic bead based) on SELDI-TOF, 133 ProteinChip arrays, 134–135 SPA matrix addition, 135 spectra collection on, 135–138 MALDI-TOF-MS. See Matrix-assisted laser desorption time of flight mass spectrometry MALDI-TOF, peptide mass fingerprinting (PMF) and, 62, 71 MALDI-TOF–TOF, serum protein profiling on data collection, 131–132 MB fractionation, of human serum, 131 protein identification by, 132–133 Maleimide labeling, of cysteine sulfhydryls, 96 MARS. See Multiple affinity removal system MASCOT software, 81, 87–88 Mass spectrometry, 58–59, 214 ICAT labeling and, 181 celecoxib, cyclooxygenase-2 (COX-2) and, 183 cell culture and harvest, 183, 186 cell lysis, desalting, and protein quantitation, 184–187 cleavable reagents, 182, 185, 187–188 cleaving biotin, 186, 189 labeled peptides purification, 185–186, 188–189 proteins, denaturation and reduction of, 185, 187 quantitative proteomic analysis and, 184 LCM and, 172–174
Index Matrix-assisted laser desorption time of flight mass spectrometry, 125–126, 142, 163, 194 LCM and, 171 for urine protein profiling. See Urine protein profiling, by 2DE and MALDI-TOF-MS MAVER-1 cell lines, 308 MC3T3-E1, osteoblast cell line, 233, 236–237, 239 MDS technique. See Multi-dimensional scaling techniques MeOH/CHCl3 protocol, 106 Metalloproteins, 350 MicroSol-IEF, ZOOM® , 60, 65–66 Miniaturized parallelized sandwich immunoassays. See Suspension antibody microarrays MS. See Mass spectrometry MS-Fit software, 81 msInspect, for LC-MS data analysis, 369 data viewing and navigation, 371–373 locating peptides in, 373–376 low-quality peptides, elimination, 376 peptide quantitation, 376–378 software installation for, 370 MS/MS spectral filtering, 216–217 Multi-dimensional scaling techniques, 313, 315–317 MultiExperiment Viewer (MeV), 259 Multiple affinity removal system, 59, 63–64 Multiplexed bead-based flow-cytometry assays, 266 Nanoflow reversed-phase LC-tandem mass spectrometry (nanoRPLC-MS/MS), 233, 235, 238–239 Non-enzymatic sample preparation (NESP), 194, 196, 198–199 One-antibody label-based assays, 264–266 One-dimensional liquid chromatography coupled with tandem mass spectrometry (1D-LC-MS/MS), 201–202. See also Hepatocellular carcinoma 16 O/18 O isotopic labeling, 78 Osteoblasts, 232. See also Extracellular matrix MC3T3-E1, 233, 236–237, 239 2D-PAGE maps analysis, 291 dedicated software packages and, 292–294 image analysis fuzzy logic, 311–317 moment functions, 317–319 spot volume datasets, analysis of, 294 cluster analysis, 297–299 DA-PLS method, 309–311
401 linear discriminant analysis, 300–301 pattern recognition methods, 306–309 PLS regression and DA-PLS regression, 306 principal component analysis, 294–297 SIMCA method, 301–305 PALM microlaser dissector, 161 Parkinson’s disease, 310 Partial least squares regression, 306, 308, 338 Pattern recognition methods cluster analysis. See Cluster analysis techniques PCA. See Principle component analysis proteomic mass spectroscopy and. See Proteomic mass spectroscopy SIMCA classification. See Soft-independent model of class analogy method PCA. See Principle component analysis PCa-24 protein, in epithelial cells, 172 PDB. See Protein data bank PDQuest system, 293, 308 Peptide mass fingerprinting, MALDI-TOF and, 62, 71 Peptide/protein separation system, 171 PerkinElmer scanners, 281 Pfam, 352, 360 PIN. See Prostatic intraepithelial neoplasia PIVKA-II, 194 PixCell II system, 48–49, 77, 82–83, 161. See also Laser-capture microdissection Planar antibody arrays, 248, 264. See also Antibody arrays main formats of, 265 types of, labeling-hybridization methods and, 266–268 10plex soluble receptor assay, 255–256, 258. See also Bead-based multiplex assays PLS regression. See Partial least squares regression PMF. See Peptide mass fingerprinting PMS. See Proteomic mass spectroscopy Position-specific scoring matrix, 361 Post-translational modification (PTM) profiling, on selected spots, 71–72 Principle component analysis, 101, 259, 294–297, 308, 315–316, 343. See also 2D-PAGE maps analysis Escherichia coli, 307 for explorative data analysis, 336–338 in HERMeS software package, 306 U937 human lymphoma cell line and, 307 Prostatic intraepithelial neoplasia, 44 Protein chip technology and LCM, 172 Protein data bank, 352, 360–361 Protein precipitation, 143–144
402 Protein profiling of human plasma samples , by two-dimensional electrophoresis, 57 coomassie brilliant blue G-250 staining, 68 destaining, in-gel deglycosylation and in-gel tryptic digestion, 61–62, 69 2D gels preparation and analysis, 61, 67–69 difference in gel electrophoresis (DIGE) system, 59 free flow electrophoresis (FFE), samples fractionation by, 60–61, 67 high-abundance proteins depletion, by immunoaffinity column, 59, 63–64 HPPP, 58 IPG gel strip rehydration, 64–65 isoelectric focusing (IEF), with IPG strip, 60, 65 MALDI plating and peptides desalting, 62, 69–71 mass spectrometry (MS), 58–59 microscale solution isoelectric focusing, ZOOM® , 60, 65–66 peptide mass fingerprinting, MALDI-TOF and, 62, 71 PTMs profiling, on selected spots, 71–72 samples preparation, 59, 62 TCA/acetone precipitation, 64 Proteomic data, statistical analysis, 327 classical dyes, 339–342 confirmatory univariate data analysis, 333–335 DIGE approach, 342–345 experimental design for, 328 data processing, 330–333 pooling, 330 replicates, 329–330 exploratory multivariate data analysis, 335 marker selection, 338–339 principal component analysis, 336–338 Proteomic mass spectroscopy, 383 statistical classification strategy (SCS) for classifier aggregation, 390 data visualization, 384–385 feature selection/extraction (FSE), 386–388 preprocessing, 385–386 robust classifier development, 388–390 Proteomics analysis, for tissue sample collection formalin fixation, 43–44 hematoxylin staining, 47–48 immunocapture procedure, 46 immunofluorescence staining, 48 laser-capture microdissection (LCM), 44–45 AutoPixTM , 48 PixCell II instrument, 48–49
Index VeritasTM , 48 LCM lysate, 49–50 SELDI-TOF-MS, 46 PSSM. See Position-specific scoring matrix QTC (QT CLUST), 260 Resonance light scattering (RLS), 268 Reverse protein arrays, 268 Rolling-circle amplification (RCA), 268 SCX-LC. See Strong cation exchange liquid chromatography SDS-PAGE. See Sodium dodecyl sulfate-polyacrylamide gel electrophoresis SELDI. See Surface-enhanced laser desorption/ionization SELDI-TOF. See Surface-enhanced laser desorption/ionization time-of-flight Self Organizing Maps (SOM), 259 Self Organizing Tree Algorithm (SOTA), 259 Shapiro-Wilk test, 334, 339 Significance Analysis of Microarrays (SAM), 259 Silver staining, 80, 332–333. See also Laser-capture microdissection and image analysis, 85–86 SIMCA method. See Soft-independent model of class analogy method SKBR-3, breast cancer cell line, 171 Sodium dodecyl sulfate-polyacrylamide gel electrophoresis, 84–85, 94, 96, 104, 110–111 isoelectric focusing (IEF) and, 79–80 PROTEAN II xi Cell system (Bio-Rad) for, 84 Soft-independent model of class analogy method, 301–305, 307–308 Streptavidin-R-Phycoerythrin (SAPE), 267 Strong cation exchange liquid chromatography, 234–235, 238 Strong cation exchange liquid chromatography, of peptides, 233, 234–235, 238 Student’s T-test, 334 2-(4-Sulfophenylazo)-1,8-dihydroxy-3,6naphthalenedisulfonic acid (SPADNS), 60, 67 Support vector machines, 388–389. See also Proteomic mass spectroscopy Surface-enhanced laser desorption/ionization, 9, 13, 125–126, 142, 172, 194 serum protein profiling on, 133 ProteinChip arrays, 134–135 SPA matrix addition, 135 spectra collection on, 135–138
Index Suspension antibody microarrays, 247–248 bead-based multiplex assays processing, 252–256 limit of detection (LOD), 257 miniaturized multiplexed protein assays, analytical performance, 256–259 pattern generation, 259–260 principle of, 249 production, coupling to carboxylated microspheres, 249–252 SVMs. See Support vector machines
TAAs arrays. See Tumor-associated antigen arrays TCA/acetone precipitation, 2DE and, 64 Tissue sample collection, for proteomics analysis formalin fixation, 43–44 hematoxylin staining, 47–48 immunocapture procedure, 46 immunofluorescence staining, 48 laser-capture microdissection (LCM), 44–45 AutoPixTM , 48 PixCell II instrument, 48–49 VeritasTM , 48 LCM lysate, 49–50 SELDI-TOF-MS, 46 Tributylphosphine (TBP), 68 Trichloroacetic acid (TCA) precipitation, 143–144, 146–147, 151 Trifluoroacetic acid (TFA), 182 Tris buffer, 277 TTEST (T-tests), 259 Tumor-associated antigen arrays, 266, 269 Two-dimensional electrophoresis (2DE), 11, 194, 328 biological replicates, 329–330 LCM and, 162–163 for protein profiling of human plasma samples, 57 coomassie brilliant blue G-250 staining, 68 destaining, in-gel deglycosylation and in-gel tryptic digestion, 61–62, 69 2D gels preparation and analysis, 61, 67–69 difference in gel electrophoresis (DIGE) system, 59 free flow electrophoresis (FFE), samples fractionation by, 60–61, 67 high-abundance proteins depletion, by immunoaffinity column, 59, 63–64 HPPP, 58 IPG gel strip rehydration, 64–65 isoelectric focusing (IEF), with IPG strip, 60, 65
403 MALDI plating and peptides desalting, 62, 69–71 mass spectrometry (MS), 58–59 microscale solution isoelectric focusing, ZOOM® , 60, 65–66 peptide mass fingerprinting, MALDI-TOF and, 62, 71 PTMs profiling, on selected spots, 71–72 samples preparation, 59, 62 TCA/acetone precipitation, 64 technical replicates, 329–330 for urine protein profiling. See Urine protein profiling, by 2DE and MALDI-TOF-MS Two-dimensional fluorescence difference gel electrophoresis (2-D DIGE), 78 see also Difference Gel electrophoresis (DIGE) technology Two-dimensional liquid chromatography tandem mass spectrometry (2D-LC-MS/MS), 78, 170 see also liquid chromatography tandem mass spectrometry for HCC and non-HCC hepatocytes isolation, 195–197, 201–202 Two-dimensional polyacrylamide gel electrophoresis (2D PAGE), 162–163, 174 see also 2D gel electrophoresis, 2D gels Two-factor ANOVA (TFA), 259 Ultrafiltration technique, 144 Urine protein profiling, by 2DE and MALDI-TOF-MS, 141–142 analytical/profiling techniques, 145–146 organic solvent precipitation protocol, 145, 147–148 protein precipitation, 143–144 TCA/acetone precipitation protocol, 145–147 ultrafiltration-SPE, 144–145, 148–149 urine SPE, 149 VeritasTM , 48. See also Laser-capture microdissection Web-based tools, for protein classification, 349 BLAST, 352, 358 dot-plot style alignment, of protein sequence, 358–359 EnsEmbl, 352, 356 evolution-based classification schemes, 351 ExPASy, 352 expressed sequence tags (ESTs), 357 GeneScan program, 356
404 InterPro, 352, 361 MEROPS, 361 metalloproteins, 350 PDB, 352, 360–361 Pfam, 352, 360 PRINTS, 361 PROSITE, 361 sequence and structure of proteins and, 352–356
Index SMART, 360 Western blotting protocols, 275 XIC. See Extracted ion chromatogram ZOOM® , MicroSol-IEF, 60, 65–66 Zoom scan triple-play experiment, 214