GENE DISCOVERY FOR DISEASE MODELS
ffirs01.indd i
1/12/2011 9:44:45 AM
GENE DISCOVERY FOR DISEASE MODELS Edited by
Weikuan Gu and Yongjun Wang
A JOHN WILEY & SONS, INC., PUBLICATION
ffirs02.indd iii
1/12/2011 9:44:45 AM
Copyright © 2011 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Gene discovery for disease models / edited by Weikuan Gu and Yongjun Wang. p. ; cm. Includes bibliographical references and index. ISBN 978-0-470-49946-7 (cloth) 1. Medical genetics. 2. Mutation (Biology) 3. Genomics. 4. Genetic disorders. I. Gu, Weikuan and Yongjun Wang. [DNLM: 1. Genetic Association Studies–methods. 2. Models, Genetic. 3. Mutation. QU 450] RB155.G3584 2011 616′.042–dc22 2010028355 Printed in Singapore. 10
ffirs03.indd iv
9
8
7
6
5
4
3
2
1
1/12/2011 9:44:45 AM
CONTENTS
Preface
vii
Acknowledgments
ix
Contributors
xi
1. Gene Discovery: From Positional Cloning to Genomic Cloning
1
Weikuan Gu and Daniel Goldowitz
2. High-Throughput Gene Expression Analysis and the Identification of Expression QTLs
11
Rudi Alberts and Klaus Schughart
3. DNA Methylation in the Pathogenesis of Autoimmunity
31
Xueqing Xu, Ping Yang, Zhang Shu, Yun Bai, and Cong-Yi Wang
4. Cell-Based Analysis with Microfluidic Chip
59
Wang Qi and Zhao Long
5. Missing Dimension: Protein Turnover Rate Measurement in Gene Discovery
83
Gary Guishan Xiao
6. Bioinformatics Tools for Gene Function Prediction
93
Yan Cui
7. Determination of Genomic Locations of Target Genetic Loci
111
Bo Chang
8. Mutation Discovery Using High-Throughput Mutation Screening Technology
139
Kai Li, Hanlin Gao, Hong-Guang Xie, Wanping Sun, and Jia Zhang
9. Candidate Screening through Gene Expression Profile
165
Michal Korostynski
10. Candidate Screening through High-Density SNP Array
195
Ching-Wan Lam and Kin-Chong Lau
11. Gene Discovery by Direct Genome Sequencing
215
Kunal Ray, Arijit Mukhopadhyay, and Mainak Sengupta v
ftoc.indd v
1/12/2011 9:44:46 AM
vi
CONTENTS
12. Candidate Screening through Bioinformatics Tools
235
Song Wu and Wei Zhao
13. Using an Integrative Strategy to Identify Mutations
261
Yan Jiao and Weikuan Gu
14. Determination of the Function of a Mutation
279
Bouchra Edderkaoui
15. Confirmation of a Mutation by Multiple Molecular Approaches
303
Hector Martinez-Valdez and Blanca Ortiz-Quintero
16. Confirmation of a Mutation by MicroRNA
343
Hongwei Zheng and Yongjun Wang
17. Confirmation of Gene Function Using Translational Approaches
371
Caroline J. Zeiss
18. Confirmation of Single Nucleotide Mutations
391
Jochen Graw
19. Initial Identification and Confirmation of a QTL Gene
403
David C. Airey and Chun Li
20. Gene Discovery of Crop Disease in the Postgenome Era
425
Yulin Jia
21. Impact of Genomewide Structural Variation on Gene Discovery
443
Lisenka E.L.M. Vissers and Joris A. Veltman
22. Impact of Whole Genome Protein Analysis on Gene Discovery of Disease Models
471
Sheng Zhang, Yong Yang, and Theodore W. Thannhauser
Index
ftoc.indd vi
531
1/12/2011 9:44:46 AM
PREFACE
The availability of an annotated mouse genome sequence now provides the most efficient tool yet in the gene hunter’s toolkit. One can move directly from genetic mapping to identification of candidate genes, and the experimental process is reduced to PCR amplification and sequencing of exons and other conserved elements in the candidate interval. With this streamlined protocol, it is anticipated that many decades-old mouse mutants will be understood precisely at the DNA level in the near future.
This paragraph, from Initial Sequencing and Comparative Analysis of the Mouse Genome (2002) by the Mouse Genome Sequencing Consortium, summarizes the historical transition in both the strategy and the direction of gene discovery. It describes the tremendous changes in our protocols for positional cloning and announces the evolution of positional gene cloning. Positional gene cloning no longer requires years of teamwork to achieve. In fact, it is now possible for just one laboratory to identify several genes within one year. This not only is an extraordinary milestone but also is an opportunity for a new starting point for better comprehending the biological function(s) of genes throughout the whole genome. The implication of such a statement should also have similar results with the identification of mutated genes from human diseases and animal models once genome sequencing is completed for them as well. More than half a decade has passed since that paragraph’s publication. We are pleased to prepare a book detailing a new set of concepts and protocols representing the current progress in gene discovery. One goal of this book is to provide readers with a comprehensive understanding of the new concepts and protocols implemented in gene discovery in the present post-genome era. The dramatic progress of gene discovery is built on tremendous resources, such as genome sequences, literature, and databases as well as technologies such as high-throughput gene expression platforms and mutation-screening systems. With the availability of whole genome sequences, molecular markers can be precisely located on the genome and within a particular region of interest. In addition, every genomic element can be obtained easily through genome databases. Online literature and databases for gene information allow a quick search for a specific gene’s function. Large databases on gene expression profiles created by microarray data provide information on both gene expression levels and the specificity of almost every gene vii
fpref.indd vii
1/12/2011 9:44:46 AM
viii
PREFACE
in humans and major organisms. High throughput mutation screening systems are capable of identifying mutations from hundreds of genes with a minimal time frame. This book not only provides a systematic introduction to the available resources and technologies for gene discovery but, most important, teaches readers how to use all the available tools and data to find new mutated genes. Another goal of this book is to predict the future of gene discovery. We intend to let readers not only understand the current concepts and technologies but also learn how to take advantage of these new resources and technologies in the future—namely, the ability to adapt to new discoveries in genetic sciences. In the post-genome era, a good geneticist should be able to use these rapidly accumulating genome resources and incredibly rapidly developed new biotechnologies effectively and efficiently. The large amount of benchwork now produces tens of thousands of times more data than was possible decades ago. Gene discovery to a certain degree is data collection and analysis from large resources. We do not expect that every reader will be able to predict the future of genetic research, but we do hope that, by reading this book, readers can sharpen their minds in preparation for expected or unexpected challenges in this booming era of genomic research. This book can be used as a handbook for gene cloning and discovery as well as a reference book for teachers and students in the fields of genetics and biology. Weikuan Gu and Yongjun Wang
fpref.indd viii
1/12/2011 9:44:46 AM
ACKNOWLEDGMENTS
We thank everyone who contributed to this book for their dedicated work in making this book available. This is a unique, international team with strong scientific background and broad experience. We appreciated the discussions and exchanging of ideas during the preparation of each chapter. We would like to thank the following people for kindly reviewing the chapters for this book: Beth Bennett, Cong-Yi Wang, Daniel Goldowitz, David C. Airey, Griffin Gibson, Junming Yue, Qing Xiong, and Yan Cui. Special thanks to Drs. Bruce Roe, Hongwen Deng, Xinmin Li, Xingen Lei, and Wesley Beamer for their suggestions and kind support during the preparation of this book. We appreciate the assistance of David L. Armbruster and Griffin Gibson for their contributions in editing the chapters and, finally, we would also like to thank Griffin Gibson, XiaoYue Liu, Lishi Wang, and Yue Huang for their assistance in formatting the chapters.
ix
flast01.indd ix
1/12/2011 9:44:46 AM
CONTRIBUTORS
David C. Airey, Department of Pharmacology, Vanderbilt University School of Medicine Nashville, TN, United States Rudi Alberts, Department of Infection Genetics, Helmholtz Center for Infection Research & University of Veterinary Medicine, Hannover, Germany Yun Bai, Department of Medical Genetics, Third Military Medical University, Chongqing, China Bo Chang, Jackson Laboratory, Bar Harbor, ME, United States Yan Cui, Department of Molecular Sciences, University of Tennessee Health Science Center, Memphis, TN, United States Bouchra Edderkaoui, School of Medicine, Loma Linda University, Loma Linda, CA, and Research Scientist, Musculoskeletal Disease Center, JLP Memorial VA Medical Center, Loma Linda, CA, United States Hanlin Gao, DNA core, City of Hope National Medical Center, Duarte, CA, United States Daniel Goldowitz, Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada Jochen Graw, Helmholtz Center Munich, German Research Center for Environmental Health, Institute of Developmental Genetics, Neuherberg, Germany Weikuan Gu, Department of Orthopedic Surgery—Campbell Clinic, University of Tennessee Health Science Center, Memphis, TN, United States Yulin Jia, USDA-ARS Dale Bumpers National Rice Research Center, University of Arkansas, Stuttgart, AR, United States Yan Jiao, Department of Orthopedic Surgery—Campbell Clinic, University of Tennessee Health Science Center, Memphis, TN, United States xi
flast02.indd xi
1/12/2011 9:44:46 AM
xii
CONTRIBUTORS
Michal Korostynski, Department of Molecular Neuropharmacology, Institute of Pharmacology Polish Academy of Sciences, Krakow, Poland Ching-Wan Lam, Department of Pathology, the University of Hong Kong, Queen Mary Hospital, Hong Kong, China Kin-Chong Lau, Department of Pathology, the University of Hong Kong, Queen Mary Hospital, Hong Kong, China Chun Li, Department of Biostatistics, Vanderbilt University School of Medicine Nashville, TN, United States Kai Li, Department of Pharmacology, Suzhou University, Suzhou, Jiangsu, China Zhao Long, Respiratory Department, The Second Hospital Affiliated with Dalian Medical University, Dalian, China Hector Martinez-Valdez, Department of Immunology, The University of Texas M. D. Anderson Cancer Center, Houston, TX, United States Arijit Mukhopadhyay, Genomics & Molecular Medicine, Institute of Genomics & Integrative Biology (CSIR), Delhi, India Blanca Ortiz-Quintero, Department of Immunology, The University of Texas M. D. Anderson Cancer Center, Houston, TX, United States Wang Qi, Respiratory Department, The Second Hospital Affiliated with Dalian Medical University, Dalian, China Kunal Ray, Molecular & Human Genetics Division, Indian Institute of Chemical Biology (CSIR), Kolkata, India Klaus Schughart, Department of Infection Genetics, Helmholtz Center for Infection Research & University of Veterinary Medicine, Hannover, Germany Mainak Sengupta, Molecular & Human Genetics Division, Indian Institute of Chemical Biology (CSIR), Kolkata, India Zhang Shu, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China Wanping Sun, Department of Pharmacology, College of Pharmacy, Suzhou University, Suzhou, Jiangsu, China Theodore W. Thannhauser, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States Joris A. Veltman, Radboud University Nijmegen Medical Centre, Nijmegen Centre of Molecular Life Sciences, Department of Human Genetics, Nijmegen, The Netherlands
flast02.indd xii
1/12/2011 9:44:46 AM
CONTRIBUTORS
xiii
Lisenka E.L.M. Vissers, Radboud University Nijmegen Medical Centre, Nijmegen Centre of Molecular Life Sciences, Department of Human Genetics, Nijmegen, The Netherlands Cong-Yi Wang, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Center for Biotechnology and Genomic Medicine, Medical College of Georgia, Augusta, GA, United States Yongjun Wang, Beijing Tiantan Hospital, Capital Medical University, Beijing, China Song Wu, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, United States Gary Guishan Xiao, Hospital Central Laboratory, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu, China Hong-Guang Xie, Hospital Central Laboratory, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu, China Xueqing Xu, Department of Medical Genetics, Third Military Medical University, Chongqing, China Ping Yang, The Center for Biomedical Research, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China Yong Yang, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States Caroline J. Zeiss, Department of Comparative Medicine, Yale School of Medicine, New Haven, CT, United States Jia Zhang, DNA core, GNF Institute, San Diego, CA, United States Sheng Zhang, Proteomics and Mass Spectrometry Core Facility, Robert W. Holley Center for Agriculture & Health, USDA-ARS, Cornell University, Ithaca, NY, United States Wei Zhao, Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN, United States Hongwei Zheng, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
flast02.indd xiii
1/12/2011 9:44:46 AM
Figure 2.2. A microarray produced by a scanner. Each of the spots on the microarray represents a gene, and the color represents the amount of fluorescence that is measured, hence the amount of cDNA that was present in the original sample. Reprinted from Reinke (2006).
bins.indd 1
1/12/2011 9:43:43 AM
(a)
2
1
3
(c)
trans A B
1
(b)
A
cis
2
3
B
high
RILs
low
Figure 2.8. (a) The genomewide genotypes of eight recombinant inbred lines generated from a cross between two homozygous parents (A and B). Each row indicates the genome of a single RIL. The light or dark gray color in each of the RILs indicates whether that part of its genome was inherited from parent A or B. (b) Gene expression values are determined by microarrays. Four values are shown for each parent and one value for each of the RILs. (c) For three molecular markers, the gene expression values of the RILs are dissected into two groups, according to the allele they carry for that molecular marker (light or dark gray). A statistical test of each marker location calculates whether the means of both groups differ. The significances of the tests are plotted in a genomewide plot as a QTL plot. Here, a QTL peak is found for the second marker. Triangles in the QTL plot indicate the position of the gene, whose expression was used. If the gene coincides with the QTL peak, the QTL is referred to as a cis-QTL, otherwise, it is called a trans-QTL. Adapted from Alberts et al. (2005) by permission of Oxford University Press.
bins.indd 2
1/12/2011 9:43:44 AM
GTGTGTATGGTTGGGTGTTTTTGGGGTGGGTAGGGAGGTGT
GCGCGTACGGTCGGGCGTTTTTGGGGTGGGTAGGGAGGCGC
Figure 3.2. Bisulfite-specific PCR and direct sequencing. Genomic DNA underwent bisulfite conversion followed by PCR amplification. The resulting PCR products were then purified and directly sequenced. The results at the left are from the unmethylated DNA; the results at the right are for the methylated DNA.
bins.indd 3
1/12/2011 9:43:44 AM
(a)
Staurosporine
Microsyringe pump
TLM detection (532 nm)
Drain X-Y scanning stage
Cell culturing flask 30 µm
(b) 9–10 8–9
30 µm
7–8 6–7 5–6 4–5 3–4 2–3 1–2 0–1 + Staurosporine
Figure 4.4. A single-cell analysis system in a glass microchip using a thermal lens microscope (TLM). (a) Cell culture chip design and TLM scanning method. A microflask (1 mm × 10 mm × 0.1 mm) was fabricated in a glass microchip, and a cell suspension was introduced into it. After cultivation, the microchip with capillaries connected to syringe pumps was mounted on the TLM stage, and TLM signals were measured while scanning the stage to obtain a 2D-image. (b) Direct imaging of cytochrome-c in a cell and its distribution change during apoptosis (Tamaki et al., 2002).
bins.indd 4
1/12/2011 9:43:44 AM
(a)
A23187 concentrations
1 2 3 4 5 6 7 8
Low 0.0 µM
0.86 µM
1.71 µM
High 2.57 µM
(a) 3.42 µM
(b) 4.28 µM
(c) 5.13 µM
(d) 6.00 µM
(e)
(f)
(g)
(h)
a b c d e f g h
Cell culture chambers
Normalized fluorescence intensity per Cell
(b) 140.00 130.00 120.00 110.00 100.00 90.00 80.00 70.00 60.00 0.00
0.86
1.71
2.57
3.42
4.28
5.13
6.00
A23187 Concentrations (µM)
Figure 4.5. The expression of GRP78 on a protein level by immunofluorescence in SK-MES-1 cells. (a) Cells were treated with A23187 for 24 h and observed by fluorescence microscope (IX-71; Olympus Optical Co., Tokyo, Japan) (×400). The fluorescent indicated GRP78 in the cytoplasm. (b) The average expression of GRP78 per cell reflected by normalized fluorescent intensities increased with the concentrations of A23187. The normalized fluorescent intensities per cell were determined by the number of cells in a region divided by the fluorescent intensity in that region (Wang et al., 2009).
bins.indd 5
1/12/2011 9:43:44 AM
bins.indd 6
1/12/2011 9:43:45 AM
Figure 6.3. Function annotation with Blast2GO. In this example, 10 sequences are annotated. The upper panel shows the annotations transferred from homologous sequences. The lower panel shows the part of GO DAG containing the annotation terms assigned to the query sequences. Node color represents the annotation intensity.
(a) PCR
Conformation
(b)
Electrophonesis
Variant (2677T, Ser893) MDR1*1 (2677G, Ala893)
Figure 8.1. (a) Band patterns of single-stranded PCR products as visualized on a gel differ with the change in their conformations. (b) An example of how to identify two variants in the MDR1 gene using SSCP. (Kim et al., 2001.)
Model system
Clinical samples
Gene expression profiling
List of transcripts
Co-expression of genes
Association with phenotype
Mechanism of regulation
Gene expression signature
Gene function
Validation
Candidate drug targets
Candidate biomarkers
Discovery of new drug
Discovery of new diagnostic marker
Figure 9.1. Major concepts in gene transcription profiling.
bins.indd 7
1/12/2011 9:43:47 AM
(a)
14 10
12
Log intensity
5 0 –5 0
5
10
6
8
–10
Intensity ratio
10
16
(b)
15
1
Average intensity (c)
3
4
5
6
10 5 –5
0
0.0
0.5
Log-odds ratio
15
1.0
(d)
–0.5
Sample quantiles
2
–3
–2
–1
0
1
2
3
–4
–2
Theoretical quantiles (e)
0 2 4 Log2 fold change
6
(f) 0.6 0.4
PCA2
0.2 0
Class 1
–0.2 –0.4 Class 2
–0.6 –0.8 –1.0 –1.5
–1
–0.5
0
0.5
1
PCA1
Figure 9.4. Various methods of presentation of microarray quality control and data analysis. (See text for full caption.)
bins.indd 8
1/12/2011 9:43:47 AM
–4
–1
0
+1
+4
MMA
MMA
MMA
MMA
MMA
PMA
PMA
PMA
PMA
PMA
PMB
PMB
PMB
PMB
PMB
MMB
MMB
MMB
MMB
MMB
Target sequence (250-2000 bp) ... CAGACAGAGTCTTG[A/C]AATCTATTTCTCATA... Probe sequence (25 bp)
AA
PMA:
TGTCTTCAGAACTTTAGATAAAGAG
MMA:
TGTCTTCAGAACATTAGATAAAGAG
PMB:
TGTCTTCAGAACGTTAGATAAAGAG
MMB:
TGTCTTCAGAACCTTAGATAAAGAG
BB
AB
Figure 10.1. Probe array tiling and hybridization patterns (from Affymetrix).
bins.indd 9
1/12/2011 9:43:47 AM
Standard PCR protocol — 48 or 96 per batch Modified PCR protocol — no batch size limitation 1. Pool 700 µl PCR into deep well plate
1. Pool 700 µl PCR into 2-mL microcentrifuge tube
2. Add 1 ml magnetic beads
2. Add 1 ml magnetic beads
3. Pipetting up & down 5×; Incubate 10 min @ RT
3. Pipetting up & down 5×; Incubate 10 min @ RT
4. Transfer PCR + beads to filter plate
4. Place on magnetic stand for 10 min
5. Apply vacuum until all wells are dry (60–90 min)
5. Pipette out the supernatant
6. Add 1.8 mL 75% ethanol wash
6. Add 1.8 mL 75% ethanol wash
7. Apply vacuum until all wells are dry (10–20 min)
7. Vortex at 75% power for 2.5 min; incubate 7.5 min
8. Dry beads for further 10 min under vacuum
8. Place on magnetic stand for 10 min
9. Tap-off excess ethanol & attach catch plate
9. Pipette out the supematant; air-dry for 15 min
10. Add 55 µl elution buffer
10. Add 55 µL elution buffer
11. Incubate on vortexer for 10 min
11. Vortex at 75% power for 2.5 min; incubate 7.5 min
12. Apply vacuum until all wells are dry (15–30 min)
13. Centrifuge for 5 min at 1400 RCF @ RT
12. Place on magnetic stand for 10 min 13. Collect the eluate (~55 µL)
14. Remove catch plate with eluate (~50 µL)
Figure 10.3. Comparing PCR purification workflow between the Affymetrix standard and our modified protocols.
bins.indd 10
1/12/2011 9:43:47 AM
Figure 10.10. Identification of homozygous DYFS mutations in the homozygous region detected by SNP Array 6.0 (Lau et al., 2009).
bins.indd 11
1/12/2011 9:43:47 AM
(a)
Shearing of genomic DNA
End repairing of sheared DNA
Adapter mediated PCR enrichment of fragments
Addition of dATP at the 3’ends
Purification
Adapter ligation
(b) + Prepared Library
+ Hybridization Buffer
Biotinylated Probes
Optimum Temperature Regulation Hybridization Streptavidin Coated Magnetic Beads
+
Unbound Fraction Discarded
Wash Beads and Remove Probes
Bead Capture Amplify Sequencing
Figure 11.2. The hybridization-based sequencing method. (a) Genomic DNA is sheared and end repaired or modified. A poly-A tail is added to the fragments, adapters are ligated to the 3′-end of the fragments, and excess adapters or unligated primers are removed. The amplicons are purified, and adapter-specific PCR amplification is done to enrich the product pool to prepare a library. (b) The prepared library is hybridized with relevant biotinylated probes (specific sequences, whole exome, etc.) in solution in a hybridization array. The probes bind to the relevant sequences from the library. Then streptavidin-coated magnetic beads are released in the array, and a magnet is used to capture biotinylated probes bound to their complementary sequences. Those specific sequences can then be sequenced in appropriate platform. [Panel (b) of the illustration has been adapted from Protocol version 1.0.1, October 2009; SureSelect Human All Exon Kit from Agilent Technologies.]
bins.indd 12
1/12/2011 9:43:48 AM
(a)
(b) Primer library
(c)
Genomic DNA template
Microfluidic chip
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
8 7
2
Droplet PCF
Genomic DNA
Break emulsion
3
gDNA removal
5
9 Fragmentation and nick translation
4
Sequence
6
Figure 11.3. Microdroplet PCR workflow. (a) Primer library generation: 1, Identify targeted sequences of interest in the genome. 2, Design and synthesize forward and reverse primer pairs for each targeted sequence (library element). 3, Generate primer pair droplets for each library element. A microfluidic chip is used to encapsulate the aqueous PCR primers in inert fluorinated carrier oil with a block-copolymer surfactant to generate the equivalent of a picoliter-scale test tube compatible with standard molecular biology. 4, Mix primer pair droplets of library elements together so that each library element has an equal representation. (b) Genomic DNA template mix preparation: 5, Biotinylate (red dots), fragment into 2- to 4-kb fragments, and purify genomic DNA. 6, Mix purified genomic DNA together with all of the components of the PCR reaction (DNA polymerase, dNTPs and buffer) except for the PCR primers. (c) Droplet merge and PCR: 7, Dispense primer library droplets to the microfluidic chip. 8, Deliver the genomic DNA template as an aqueous solution; template droplets are formed within the microfluidic chip. Then pair the primer pair droplets and template droplets in a 1:1 ratio. 9, Allow the paired droplets to flow through the channel of the microfluidic chip and pass through a merge area, where an electric field induces the two discrete droplets to coalesce into a single PCR droplet. Collect ∼1.5 million PCR droplets in a single 0.2-mL PCR tube. Process the PCR droplets (PCR library) in a standard thermal cycler for targeted amplification; break the emulsion of PCR droplets to release the PCR amplicons into solution for genomic DNA (gDNA) removal, purification, and sequencing. (Reprinted by permission from Macmillan Publishers Ltd: Nature Biotechnology, Tewhey et al., Microdroplet-based PCR enrichment for largescale targeted sequencing, 27, 1025–1031, 2009.)
bins.indd 13
1/12/2011 9:43:49 AM
A
B
A CGT
A T T C G A T A T CA A GC T T A TC G A T AC C G T C G A C C T
Figure 15.1. Manual Versus Automated DNA Sequencing. (A) Shows acrylamide gel electrophoresis results resolving typical chain termination reactions (Ho et al. unpublished data). Each lane corresponds to a designated reaction terminated with ddATP (A), ddCTP (C), ddGTP (G) and ddTTP (T) analogs (Ho et al. unpublished), which identifies respective nucleotides on target DNA template. (B) Depicts a color-coded chromatogram of typical automated DNA sequencing data (Albrechtson et al. unpublished).
Figure 15.2. Chromosome locus assignment of a newly discovered gene. Fluorescent in situ hybridization analysis, using a biotin-labeled genomic probe, which reveals the new gene at the 9q32 locus (Sims-Mourtada et al., 2005) upon binding of fluorescent Streptavidin (shown herein as pseudo yellow fluorescence) against DAPI (blue fluorescence) background. Arrows emphasize gene locus assignment on respective chromosome 9 alleles.
bins.indd 14
1/12/2011 9:43:49 AM
(a)
(b)
Figure 15.3. Gene Expression by Histological Methods. (See text for full caption.)
bins.indd 15
1/12/2011 9:43:50 AM
WT
rd-1
ONL
ONL
Figure 17.2. WT and rd-1 mouse retina at postnatal day 12. In rd-1 mice, a mutation in the cyclic phosphodiesterase beta gene results in rapid loss of the outer nuclear layer in the first 3 weeks of life.
bins.indd 16
1/12/2011 9:43:52 AM
CHAPTER 1
Gene Discovery: From Positional Cloning to Genomic Cloning WEIKUAN GU and DANIEL GOLDOWITZ
Contents 1.1 Concept of Classic Positional Cloning 1.2 Concept of Gene Discovery in the Post-Genome Era 1.3 Strategies for Gene Discovery in the Post-Genome Era 1.4 Future Direction 1.5 References
1 4 5 6 7
Despite the highly significant advances in studying the genetics and genomics of human populations, there are still large gaps in our understanding of the molecular genetic mechanisms involved in the pathogenesis of many human diseases. The mutated genes in many human diseases remain unknown. Identification of these mutations is crucial for correlating disease pathology and biology to the molecular basis of the disease. Discovery of new gene functions depends on the identification of the mutated genes responsible for disease in humans and other species. The techniques of positional cloning have oftentimes discovered new functions of known genes or new genes for known diseases. The goal of this book is to provide illustrations of the strategy in the post-genomic era for the identification and initial characterization of mutated genes in inherited human diseases and animal models.
1.1
CONCEPT OF CLASSIC POSITIONAL CLONING
Positional cloning, also called reverse genetics, is the identification and cloning of a specific gene, with its chromosomal location being the only available Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
1
c01.indd 1
1/12/2011 9:43:53 AM
2
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
Collection of phenotype information
Collection of genotype information
Initial mapping of trait locus
Fine mapping of trait locus
Genomic contig construction
Analysis of genomic elements
Selection of candidate genes
Conformation of candidate genes
Figure 1.1. Procedure of identification of a mutated gene using strategy of classic positional cloning.
information about that gene (Collins, 1990). The identification of the X-linked gene for chronic granulomatous disease in 1986 was the first report employing such a strategy (Baehner et al., 1986; Royer-Pokora et al., 1986). For the past several decades, positional cloning has been widely used in humans, animals, and plants to isolate genes known only by their phenotypic effects. Underlying positional cloning is the assumption that a gene’s location can be pinpointed with sufficient precision to narrow down its location to a DNA segment that is small enough to be sequenced and/or subjected to transformation/complementation experiments. The classic procedure for positional cloning usually includes several steps as shown in Figure 1.1. It starts with the phenotype collection from a genetically mappable population. The population genetics necessary for creating the mappable population is beyond the scope of this chapter (Holsinger and Weir, 2009; Zou, 2009). Briefly, however, a mutant phenotype can be genetically mapped when (1) the phenotype shows Mendelian inheritance, (2) the phenotype is differentially distributed among individuals within the population, and (3) a population is large enough to reach a statistical significance when the phenotype is analyzed using mapping software. Parallel to the phenotype collection, genotype information of the same individuals in the same popula-
c01.indd 2
1/12/2011 9:43:53 AM
CONCEPT OF CLASSIC POSITIONAL CLONING
3
tion is collected. Usually, molecular markers that segregate in the population along each and every chromosome are analyzed. The collected phenotype and genotype data from the population are used in conducting linkage analysis by one of a variety of softwares to define the chromosomal regions that the locus is likely to occupy. If a trait is controlled by a single gene or locus, the linkage analysis should point to a single chromosomal region. For traits regulated by multiple genes, multiple loci, or quantitative trait loci, multiple chromosomal regions are identified. To actually identify the gene underlying the trait of interest, fine mapping has to be conducted to narrow down the chromosomal regions so that genomic searching is practical. The next step, then, is to construct a genomic contiguous region (contig), which is defined as a set of overlapping segments of DNA, to connect and cover all the genomic elements in the targeted area. After a precise contig is constructed, it will be sequenced and analyzed by a technique termed chromosomal walking. This is a lengthy procedure that involves the recognition of potential genes, noncoding genes, and/or coding and noncoding regions. Finally, potential candidate genes should be confirmed using a variety of genetic and biochemical methods. Because all of these procedures require a large amount of work, positional cloning typically requires a team effort and positional cloning projects have been known to take many years. First, the genetic region needs to be narrowed down as precisely as possible by means of initial linkage analysis and fine mapping. Second, linkage analysis requires both the availability of a large pedigree and PCR-based analysis of microsatellite markers of that pedigree to allow a whole-genome search for linkage. Fine mapping is a particularly difficult task consisting of breaking the linkage and identifying useful markers in the targeted region. Contig construction entails identification of a large insert genomic library, either BAC (bacterial artificial chromosomes) or YAC (yeast artificial chromosomes), with known markers. Analysis of genetic elements within a contig can be very difficult because of the lack of knowledge of both genes and gene organization. However, the recent completion of the human and mouse genome projects (e.g., Mouse Genome Sequencing Consortium. 2002), along with other new technology, such as mutation analysis and microarrays, allows unprecedented progress in positional cloning of mutant genes. There are four major changes in the technique of positional cloning (Hinkes et al., 2006): (1) Contig construction is no longer needed because of the availability of whole genomes that have been sequenced. (2) Sequencing of an entire region—usually 10 Mbp of the genome, is no longer necessary, as those sequences are now readily available through public (Ensembl) and private databases (Celera). (3) Sequence analysis requires much less time and effort since annotations of whole genomes have been done (e.g., we now know that the majority of the mouse genome is made up of repetitive sequences, such as transposons, that are easy to identify and, therefore, can be eliminated from further analysis). (4) Because of the availability of whole genome sequences and high-throughput
c01.indd 3
1/12/2011 9:43:53 AM
4
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
technologies, we can now work on a much larger genomic regions, which eliminates fine mapping. (5) Annotations of genomes and bioinformatic algorithms has paralleled the rapid acquisition of genomic data and has permitted an in silico assessment of candidate genes. This is the major theme of this book. As a result of new high-throughput technologies and whole-genome libraries, a genome-based integrative strategy is the most practical method for gene discovery in our current post-genome era (Gu et al., 2002; Jiao et al., 2005a, 2005b, 2007, 2008). Consequently, pure positional cloning in humans, animals, or plants is no longer necessary. The definition of positional cloning is cloning or identifying a gene with specific function purely according to its position. In humans, mice, and rats it is rare to localize mutations to a gene or the expression of that gene is unknown. For example, microarray technology has arrayed every gene into their chips. As a result, microarray analysis of gene expression profiles has become routine in many laboratories. Therefore, soon we may find out that expression data of every gene in every tissue is available to public. Thus, for any gene, even if nothing else is known about that gene, its expression level in a tissue can be assessed. As such, the classic positional cloning method is of little utility in the rapidly evolving arena of functional genomics. A new procedure that integrates both genomic and high-throughput technology has been created and will be, and should be, the next generation’s tool of choice.
1.2
CONCEPT OF GENE DISCOVERY IN THE POST-GENOME ERA
The strategy for gene discovery using positional cloning depends on the availability of genetic-based data and technology. The new approach for gene discovery is highly integrative and is based on the availability of genome resources and biotechnology (Rintisch et al., 2008). There are three distinct and significant differences between new gene discovery strategies and classical positional cloning. The first one is the elimination of fine mapping. Rather than narrowing down the genomic regions using several approaches, a large number of genomic regions can be searched to discover the genes of interest all at once. The second is the direct investigation of genetic elements within the targeted region, without construction of contig or sequencing, because of the availability of genomic sequences and annotation of genomic elements. The third one is the high-throughput screening of candidates within the targeted region. The high-speed analytical methods include mutation screening, resequencing, and both gene expression profiling and functional predictions (Jiao et al., 2008). The following chapters provide detailed information on each of those aspects. The first part of this book introduces the technologies and resources used in gene discovery in our post-genome era. The second part of this book provides experimental procedures and methodologies for gene discovery using both genome resources and high-
c01.indd 4
1/12/2011 9:43:53 AM
STRATEGIES FOR GENE DISCOVERY IN THE POST-GENOME ERA
5
throughput technologies. The third and final part of this book predicts the future direction of gene discovery based on the elucidation of genomes and developing technologies. We are living in an era of both technology explosion and unparalleled expansion of biological resources. of the advances in gene discovery, however, are rooted in the technology of genome sequencing. Without the completion of whole genome sequences for humans and other species, gene discovery would still be stuck in the classic positional cloning approach. Therefore, gene discovery in every chapter is based on the fact that genomic sequences are available for the subjects of interest. Parallel with the necessity of completed genomes is the demand for, and rapid development of, high-throughput technologies necessary for mutation screening, genome analysis, and bioinformatics. Without these tools, there would be no effective method for capitalizing on the completion of whole genomes and for allowing our current rapid methods for gene discovery. Due to the significance of these various technologies, Chapters 2–4 introduce these technologies. Chapters 2–6 illustrate a variety approaches, including SNP analysis, DNA methylation, protein turnover rate measurement, microarray analysis, and bioinformatic tools. Finally, the integrative analysis of data from a variety procedures provides clues for potential candidate genes for the follow-up experiments, such as RT-PCR, DNA sequencing of the potential mutation(s), and/or northern or western blot analysis to determine the significance of the mutated gene. An important reminder to readers is that although this book mainly focuses on coding sequences known as genes, mutations in many other genetic elements could be identified using the same or similar technologies or procedures. Those none-gene elements of the genome include not only the introns, 5′ and 3′ ends of the genes, but also many others (Chen et al., 2008), such as transcription factor binding sites, microRNAs, cis-acting elements, palindromic motifs, and/or conserved k-tuples (phylogenetic footprints) (Hui and Bindereif, 2010). Readers should keep in mind that gene regulation is a complicated process and regulators are not necessarily near the genes that they influence. They can be located at long distances, called distant regulatory elements (REs) (Gotea and Ovcharenko, 2008), such as enhancers, repressors, and silencers. In addition, repetitive sequences sometimes play unexpected roles in gene regulations (Hui and Bindereif, 2005).
1.3 STRATEGIES FOR GENE DISCOVERY IN THE POST-GENOME ERA Current experimental procedure strategies for mutation screening have been summarized (Jiao et al., 2008) and are shown in Figure 1.2. Individual chapters in this book focus on one or more steps or different approaches of this strategy. We briefly touch on screening for mutations in DNA in this introduction using
c01.indd 5
1/12/2011 9:43:53 AM
6
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
Identify mutation models and chromosomes of their disease loci Bioninformatic search
Screen of coding regions Whole genome sequencing Determination of mutated gene Knockdown
Gene network
Knockin
Function elucidation of mutated genes
Figure 1.2. Strategy of gene discovery through mutation models.
the mouse as the model. Detailed procedure and methodologies are presented in Chapters 7–13. The first step is to determine the total number of genes/transcripts within the targeted region. Chapter 7 describes the genetic markers and methods for determine the genomic location of target genetic loci. Any of the many recently developed software programs (see, for example, www.genediscovery.org/ pgmapper/index.jsp; Xiong et al., 2008a) can be used to identify every candidate gene from a defined genomic region. The next step is to evaluate candidate genes to reduce the number of genes in the list to a more workable and feasible amount (Chapters 8–13). At this step, obvious candidate genes are first evaluated. We believe that a large number of differences exist between the gene of interest (GOI) in mutation and in wild type (control). Our current knowledge of gene function and bioinformatics should allow us to eliminate most of the unlikely candidate genes. Series of comparisons and function analyses should be made to rule out the candidacy of variation in introns sequences, if those sequences do not affect the phenotype (Chapters 11–13) At the end, a short list of candiate genes are expected or, in the best case senario, only one gene will remain. Finally, mutation evaluation or testing is carried out (Chapters 14–20). This evaluation considers differences between the GOI and control, sequence differences in these genes, potential gene function changes due to these differences, and whether other strains or populations have similar differences. Information on differences is combined with gene expression profiling and possible gene function to determine a list of candidate genes. Finally, selected candidate genes are tested and confirmed using a variety of experimental approaches, such as gene knockout and/or knockin.
1.4
FUTURE DIRECTION
Gene discovery or mutation identification has gone through two stages, as we have discussed: the classical and the post-genome era. The next stage of gene discovery will depend on development of high-throughput technology and
c01.indd 6
1/12/2011 9:43:53 AM
REFERENCES
7
Disease phenotype /trait of interest Mapping information (which chromosome)
Fine mapping (where on chromosome) Contig construction (DNA assembly) Search based on genome sequences (Small region) (Large region)
Figure 1.3. Different stages of positional cloning (from left to right): classic, postgenome era, and future (dashed blue line).
bioinformatic tools. As shown in Figure 1.3, in the first stage, positional cloning a GOI (the classical approach) has to go through every step, including initial mapping, fine mapping, contig construction, and candidate searching based on genome sequences. Currently at the second stage, in most cases, fine mapping and contig construction are not necessary because of the available information of genomic sequences and genetic elements within the targeted region. The next stage of genomic cloning will allow researchers to conduct a search of candidate genes without mapping information (shown as dashed lines in Figure 1.3). At that stage, once a phenotype is found from an animal model or an individual, a search of candidate genes can be done based on the annotation of every gene or regulatory element in the genome. To reach the next stage, two critical improvements in our genomic research are needed. The first one is the complete evaluation of potential function of every gene and regulatory element in the whole genome. This seemingly large amount of work is most likely to be done within a decade or even sooner, as technologies for the analysis of gene function, SNP analysis, and proteomics are rapidly developing. The second is the availability of software for rapid automatic high-throughput searching. Currently, some programs such as PGmapper (Xiong et al., 2008a) has provided the capability to search genome regions of several megabases. The capability of searching whole chromosomes and whole genomes within a reasonable time (under an hour) will follow development of computational tools in coordination with genome and literature databases.
1.5
REFERENCES
Baehner RL, Kunkel LM, Monaco AP, Haines JL, Conneally PM, Palmer C, Heerema N, Orkin SH. (1986). DNA linkage analysis of X chromosome-linked chronic granulomatous disease. Proc Natl Acad Sci U S A 83(10):3398–401. Chen HP, Lin A, Bloom JS, Khan AH, Park CC, Smith DJ. (2008). Screening reveals conserved and nonconserved transcriptional regulatory elements including an E3/
c01.indd 7
1/12/2011 9:43:54 AM
8
GENE DISCOVERY: FROM POSITIONAL CLONING TO GENOMIC CLONING
E4 allele-dependent APOE coding region enhancer. Genomics 92(5):292–300. Epub Sept. 3. Collins FS. (1990). Identifying human disease genes by positional cloning. Harvey Lect 86:149–64. Gotea V, Ovcharenko I. (2008). DiRE: identifying distant regulatory elements of coexpressed genes. Nucleic Acids Res 36:W133–39. Epub May 17. Gu W, Li X, Lau KH, Edderkaoui B, Donahae LR, Rosen CJ, Beamer WG, Shultz KL, Srivastava A, Mohan S, Baylink DJ. (2002). Gene expression between a congenic strain that contains a quantitative trait locus of high bone density from CAST/EiJ and its wild-type strain C57BL/6J. Funct Integr Genomics 1(6):375–86. Hinkes B, Wiggins RC, Gbadegesin R, Vlangos CN, Seelow D, Nürnberg G, Garg P, Verma R, Chaib H, Hoskins BE, Ashraf S, Becker C, Hennies HC, Goyal M, Wharram BL, Schachter AD, Mudumana S, Drummond I, Kerjaschki D, Waldherr R, Dietrich A, Ozaltin F, Bakkaloglu A, Cleper R, Basel-Vanagaite L, Pohl M, Griebel M, Tsygin AN, Soylu A, Müller D, Sorli CS, Bunney TD, Katan M, Liu J, Attanasio M, O’toole JF, Hasselbacher K, Mucha B, Otto EA, Airik R, Kispert A, Kelley GG, Smrcka AV, Gudermann T, Holzman LB, Nürnberg P, Hildebrandt F. (2006). Positional cloning uncovers mutations in PLCE1 responsible for a nephrotic syndrome variant that may be reversible. Nat Genet 38(12):1397–405. Epub Nov. 5. Holsinger KE, Weir BS. (2009). Genetics in geographically structured populations: defining, estimating and interpreting F(ST). Nat Rev Genet 10(9):639–50. Hui J, Bindereif A. (2005). Alternative pre-mRNA splicing in the human system: unexpected role of repetitive sequences as regulatory elements. Biol Chem 386(12): 1265–71. Jiao Y, Li X, Beamer WG, Yan J, Tong Y, Goldowitz D, Roe B, Gu W. (2005a). Identification of a deletion causing spontaneous fracture by screening a candidate region of mouse chromosome 14. Mammal Genome 16(1):20–31. Jiao Y, Yan J, Zhao Y, Donahue LR, Beamer WG, Li X, Roe BA, Ledoux MS, Gu W. (2005b). Carbonic anhydrase-related protein VIII deficiency is associated with a distinctive lifelong gait disorder in waddles mice. Genetics Epub Aug. 22. Jiao Y, Yan J, Jiao F, Yang H, Donahue LR, Li X, Roe BA, Stuart J, Gu W. (2007). A single nucleotide mutation in Nppc is associated with a long bone abnormality in lbab mice. BMC Genet 8:16. Jiao Y, Jin X, Yan J, Zhang C, Jiao F, Li X, Roe BA, Mount DB, Gu W. (2008). A deletion mutation in Slc12a6 is associated with neuromuscular disease in gaxp mice. Genomics 91(5):407–14. Koppel I, Aid-Pavlidis T, Jaanson K, Sepp M, Palm K, Timmusk T. (2010). BAC transgenic mice reveal distal cis-regulatory elements governing BDNF gene expression. Genesis 48(4):214–19. Mouse Genome Sequencing Consortium. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420:520–62. Rintisch C, Ameri J, Olofsson P, Luthman H, Holmdahl R. (2008). Positional cloning of the Igl genes controlling rheumatoid factor production and allergic bronchitis in rats. Proc Natl Acad Sci U S A 105(37):14005–10. Epub Sept. 8. Royer-Pokora B, Kunkel LM, Monaco AP, Goff SC, Newburger PE, Baehner RL, Cole FS, Curnutte JT, Orkin SH. (1986). Cloning the gene for an inherited human
c01.indd 8
1/12/2011 9:43:54 AM
REFERENCES
9
disorder—chronic granulomatous disease—on the basis of its chromosomal location. Nature 322(6074):32–38. Xiong Q, Qiu Y, Gu W. (2008a). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24(7):1011–13. Epub Jan. 18. Xiong Q, Jiao Y, Hasty KA, Stuart JM, Postlethwaite A, Kang AH, Gu W. (2008b). Genetic and molecular basis of QTL of rheumatoid arthritis in rat: genes and polymorphisms. J Immunol 181(2):859–64. Xiong Q, Jiao Y, Hasty KA, Canale ST, Stuart JM, Beamer WG, Deng HW, Baylink D, Gu W. (2009). Quantitative trait loci, genes, and polymorphisms that regulate bone mineral density in mouse. Genomics 93(5):401–14. Zou F. (2009). QTL mapping in intercross and backcross populations. Methods Mol Biol 573:157–73.
c01.indd 9
1/12/2011 9:43:54 AM
CHAPTER 2
High-Throughput Gene Expression Analysis and the Identification of Expression QTLs RUDI ALBERTS and KLAUS SCHUGHART
Contents 2.1 Concepts in High-Throughput Gene Expression Analysis 2.2 Technologies of High-Throughput Gene Expression Analysis 2.2.1 Gene Expression Microarrays 2.2.2 One-Channel Versus Two-Channel Microarrays 2.2.3 Oligonucleotide Versus Spotted Microarrays 2.2.4 Whole-Transcript Arrays 2.2.5 Genome Tiling Arrays 2.2.6 MicroRNA Arrays 2.3 Protocols 2.3.1 Image Analysis 2.3.2 Normalization 2.3.3 Quality Control 2.4 Applications and Limitations 2.4.1 Identification of Expression QTL and Gene Regulatory Networks 2.4.2 Identification of Differentially Expressed Genes 2.4.3 Identification of Cell-Type-Specific Genes 2.4.4 Determination of the Downstream Effects of a Mutation 2.4.5 Determination of the Downstream Effects of a Signaling Molecule 2.4.6 Predicting Vaccine Efficacy 2.4.7 Determination of Host Responses after Infection 2.4.8 Limitations 2.5 Questions and Answers
12 13 13 13 14 15 16 16 17 17 17 20 23 23 26 26 26 27 27 27 27 28
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
11
c02.indd 11
1/12/2011 9:43:59 AM
12 2.6 2.7
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
Acknowledgments References
28 28
2.1 CONCEPTS IN HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS Many diseases have a genetic basis. Together with influences from the environment, these genetic factors determine whether a certain disease will develop and how severe it will be. In some cases, a disease is determined by only one gene. The sickle-cell disease, for example, is caused by a mutation in the hemoglobin gene. This causes red blood cells to adopt an abnormal sickle shape. It results in a risk of various complications and a shortened life expectancy. Another example of a single gene disease is cystic fibrosis. This disease affects the exocrine glands of the lungs, liver, pancreas, and intestines and results in progressive disability and a severely shortened life expectancy. It is caused by a mutation in the cystic fibrosis transmembrane conductance regulator (CFTCR) gene. However, in most human diseases, multiple genes play a role in the development of the pathological symptoms. Examples for these, so-called complex genetic diseases are cancer, obesity, diabetes, hypertension, asthma, and heart disease. Here, each gene contributes to a certain degree to the establishment of the phenotype. And we can assume that the contributing genes and their products operate in regulatory networks. They may enhance or inhibit each other. If multiple genes contribute to the development of a disease and individual contributions of each gene are small, it is a major challenge to identify the causal disease genes and their interactions. The advent of new highthroughput analyses makes it now possible to study such complex genetic interactions and thus unravel the molecular basis of complex genetic diseases in humans. For example, high-throughput gene expression analysis allows one to measure the expression of tens of thousands of genes at the same time. Researchers can now compare complete gene expression profiles for diseased and healthy samples and obtain a direct insight into global gene expression changes. Thus these new technologies allow them to unravel the interplay between genes and to reconstruct gene regulatory networks for biological processes. The analysis of gene expression is based on the following basic biological principles: The genetic information of a cell is stored in genes, which are part of the DNA in the nucleus. DNA is transcribed into RNA and then processed to messenger RNA (mRNA), which transfers the information to the cytoplasm. Here the mRNA is translated into protein. Many proteins are enzymes that catalyze biochemical reactions. Other proteins have mechanical or structural functions. But proteins are also important in biological signaling pro-
c02.indd 12
1/12/2011 9:43:59 AM
TECHNOLOGIES OF HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
13
cesses, such as growth factor responses, immune responses, cell adhesion, and the cell cycle. Since proteins are major players in living organisms, they are also involved in the development of diseases. Therefore, to gain an understanding of processes that lead to disease, it is of great value to have a global picture of the amount of mRNA of all genes that are expressed in diseased and in healthy subjects. High-throughput gene expression microarrays measure these global changes and differences of mRNA and, therefore, give a very good indication of the processes that are abnormal in disease tissues. 2.2 TECHNOLOGIES OF HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS 2.2.1
Gene Expression Microarrays
Gene expression microarray technology enables the measurement of mRNA abundances in a high-throughput manner. Instead of directly using mRNA, more stable cDNA molecules are used, which are an inverse copy of the RNA. This copy is created by a viral enzyme, reverse transcriptase, in a process called reverse transcription. Microarrays are small glass plates that are subdivided into thousands of spots. Short sequences of the nucleotides A, C, T, and G, commonly referred to as probes, are bound as spots to the glass surface (Fig. 2.1). All probes in one spot have a sequence that is reverse complementary to part of the sequence of the cDNA of a specific gene. The idea is that the cDNA generated from the mRNA that is expressed from this gene will hybridize (bind) to the probes on the specific spot. To make it possible to measure the amount of cDNA hybridized to the microarray, the cDNA is labeled with a fluorescent dye. After hybridization and removal of the cDNA that did not bind, the microarray is inserted into a scanner that reads the amount of fluorescence for each of the spots. These measurements represent the level of gene expression for all genes on the microarray and are generally represented in the manner shown in Figure 2.2. Using specialized software, the intensities of each spots in the image are quantified, providing a quantitative value of the mRNA expression level for each of the genes on the microarray. 2.2.2
One-Channel Versus Two-Channel Microarrays
There exist different kinds of microarrays. A general distinction that can be made is one-channel versus two-channel microarrays. On a one-channel microarray, one fluorescently labeled sample is hybridized and the resulting expression values are read as absolute expression values for that sample. To compare the expression values between multiple samples, it is necessary to use multiple microarrays. The most widely known provider of one-channel microarrays is Affymetrix (www.affymetrix.com). On two channel microarrays, one can directly compare the gene expression values of two different samples. Each of the samples is fluorescently labeled
c02.indd 13
1/12/2011 9:43:59 AM
14
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS RNA fragments with fluorescent tags from sample to be tested
RNA fragment hybridizes with DNA on GeneChip array
Figure 2.1. Hybridization of labeled cDNAs to a gene expression microarray. The small glass plate contains millions of probes. Fluorescently labeled (spheres) cDNA binds to the probes on the microarray. Image courtesy of Affymetrix.
using a different dye. In most cases a Cy5 (red) dye is used for one sample and Cy3 (green) for the other. This produces images like Figure 2.2 with black, yellow, red, and green spots. A red spot indicates that the sample with the red labeling has a higher expression values (vice versa for green) and a yellow spot indicates that both samples have a similar expression values. If the spot remains black, there is no expression in either of the samples. Well-known providers of two-channel microrarrays are Agilent (www.agilent.com) and Illumina (www.illumina.com). 2.2.3
Oligonucleotide Versus Spotted Microarrays
A second important distinction between microarray setups are oligonucleotide arrays versus spotted arrays. On oligonucleotide arrays, the probes are attached to the microarray by the manufacturer. For example Affymetrix uses chemical synthesis and photolithographic masks to build up the probes on the microarray. Here, all probes on the microarray are simultaneously synthesized nucleotide by nucleotide. This results in high-density prefabricated microarrays.
c02.indd 14
1/12/2011 9:43:59 AM
TECHNOLOGIES OF HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
15
Figure 2.2. (See color insert.) A microarray produced by a scanner. Each of the spots on the microarray represents a gene, and the color represents the amount of fluorescence that is measured, hence the amount of cDNA that was present in the original sample. Reprinted from Reinke (2006).
On the other hand, probes on spotted microarrays are synthesized before they are added (spotted) onto the glass. Such microarrays are sold without probes, and laboratories have to design and fabricate their own probes and fix them onto the microarray. This is often a cheaper solution since the gene density can be much lower. Also, the researcher can customize the microarray to each experiment. 2.2.4
Whole-Transcript Arrays
The classical microarrays mentioned earlier interrogate the mRNA at only one specific location. Agilent, for example, uses one 60-mer probe per gene to measure its expression. Affymetrix uses multiple 25-mer probes per gene to measure mRNA abundances, all of them located at the end of the gene, either in the 3′ untranslated region (3′ UTR) or in the last exon or exons (Fig. 2.3). Recent studies, however, indicate that alternative splicing plays a major role in the generation of proteins and thereby functional diversity in metazoan organisms (Blencowe 2006). Alternative splicing means that different transcript isoforms are produced from the same gene, by variations in pre-mRNA
c02.indd 15
1/12/2011 9:43:59 AM
16
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
Genome mRNA transcripts
Exon array probes 3´ array probes
Figure 2.3. Probe coverage along the transcript. Gray regions represent exons, and black regions are introns that are removed during splicing. The short dashes underneath the exon regions indicate probes of the exon array and the classical 3′ array setup.
splicing. It is estimated that 40–60% of human genes have multiple splice forms (Modrek and Lee, 2002). These findings led to the development of a new type of microarray, the whole-transcript array, which is able to measure mRNA levels over the whole length of the gene. As depicted in Figure 2.3, Affymetrix exon arrays cover every exon of a gene with, on average, four probes. By using these microarrays, one can study global gene expression profiles like before but also detect different isoforms of a gene, such as transcripts with alternative 5′ start sites or an undefined 3′ end, nonpolyadenylated messages, or truncated or alternatively spliced transcripts. 2.2.5
Genome Tiling Arrays
The design of gene expression arrays and whole-transcript arrays is based on sequence information and annotation of known transcripts. Genome tiling arrays contain probes that are tiled over the whole genome at regular intervals, including both annotated regions of the genome and regions considered to be noncoding. Tiling arrays can thus be used to discover novel transcripts. The Affymetrix Human Tiling 1.0 R array set is a set of 14 microarrays that contain 45 million oligonucleotide probes covering the whole human genome. Probes have a length of 25 nucleotides and are tiled at an average resolution of 35 bp, leaving an average gap of 10 bp between probes. 2.2.6
MicroRNA Arrays
MicroRNAs (miRNAs) are single-stranded RNAs of very short size, 21–23 nucleotides in length. They do not code for proteins but are complementary to certain mRNA sequences. Binding to their target mRNA causes its degradation. In this way, miRNAs can regulate gene expression. It has been shown that miRNAs have an effect on various biological processes—for example, the development of cancer (He et al., 2005) and heart disease (Thum et al., 2007). Several commercial products are available for large
c02.indd 16
1/12/2011 9:44:00 AM
PROTOCOLS
17
scale identification of miRNA. Example vendors are Affymetrix, Agilent, Invitrogen, Applied Biosystems, and Exiqon. 2.3 2.3.1
PROTOCOLS Image Analysis
After the microarrays have been scanned, one obtains a figure with thousands of individual spots (Fig. 2.2) representing the mRNA levels for each gene. Now image analysis is needed to quantify the intensity for each spot. Most of the microarray vendors provide software that performs image analysis and outputs quantitative intensity values per gene. Several steps are performed in such an image analysis. First, the image will be filtered. This is a cleaning procedure by which small contamination artifacts such as dust particles are removed. Next, the location of the center of each spot is identified. This is called gridding. Next, is a process called segmentation; for each of the pixels in the spot area, it is decided whether it belongs to the signal or to the background (signal detected by the scanner in the area where no hybridization has taken place). Finally, in the quantification step, the pixel values of each spot are summarized into one gene expression value and a background value. 2.3.2
Normalization
During a microarray experiment, there can be multiple factors that introduce unwanted variation into the data. For example, if the experiment includes analysis of many samples that cannot be labeled and hybridized in one day, the quality of the labeling may be different on different days. This will lead to global expression differences between different microarrays that are not due to biological differences. The aim of normalization is to remove such unwanted variations from the data so that different samples can be properly compared and real biological differences detected. This process is called normalization and several techniques exist. In the following sections, we describe the ones that are most commonly used. As a rule of thumb in microarray normalization, Wit and McClure (2004) suggest first normalizing all local features and then gradually progressing to normalizations that deal with several or all microarrays. This procedure involves the following steps. 2.3.2.1 Spatial Correction Since probes are randomly distributed over the microarray, one expects a similar distribution of signals on each location on the array. After performing microarray experiments, there might however be microarrays where this is not the case. For example, there might be an array where all signals tend to be structurally lower in one corner of the array. Yang et al. (2002) observed that the variation in signal can also be different at different locations on the array.
c02.indd 17
1/12/2011 9:44:00 AM
18
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
The spatial effect can be removed by robust smoothing of the expression data across the array in each channel separately. Here, a smooth surface is fit to the data and subsequently subtracted from the data. To also correct for differences in variation on different locations of the array, one can divide by a location-dependent scale parameter. This parameter is obtained by smoothing the absolute differences between the expression values and the first smoothed surface (Wit and McClure, 2004). In cases of very strong abnormal local effects, it may even be best to exclude this array and to repeat the experiment. 2.3.2.2 Background Correction Microarray scanners always detect a background signal, even in places where no true signal is present. To obtain more accurate quantifications of gene expression values, several methods have been proposed to adjust for this background. Some methods work with local background values per spot. These background values are measured directly near the spot. Eisen. (1999) simply subtracts the background value from the observed value to obtain a signal value. Kooperberg et al. (2002) apply a Bayesian approach, assuming that the mean of the observed pixel values is the sum of the mean true signal and the mean background signal. Because of the close vicinity of the background measurements and the signal measurements, there is a possibility that the background values are contaminated with true signal. Therefore, several global background correction methods have been proposed that do not use the background values per spot but global approaches. Wit and McClure (2004) suggest calculating the mean value of all empty spots on the array, subtracting that mean from all measurements and putting the negative values obtained to zero. Irizarry et al. (2003) propose a probabilistic model that determines the conditional expectation of the true signal given the observed signal, assuming that the observed signal is the sum of the true signal and a background signal and that the spot intensities are drawn from one exponential distribution and the background intensities from a normal distribution. Both methods give similar results. 2.3.2.3 Dye-Effect Correction The most commonly used dyes in twochannel microarray experiments are Cy5 (red) and Cy3 (green). Slight differences in the characteristics of these dyes, such as in the size of the molecules, lead to unwanted effects in the observed intensity signals. For Cy5 and Cy3 it was observed that the dyes often have an intensity-dependent effect. That is, for large expression values, one of the dyes tends to give higher expression values, while for small expression values they may give lower expression values (Fig. 2.4a). Yang and Speed (2003) suggest transforming the Cy3 vs. Cy5 scatter plot into an MA plot, which is basically a 45° rotation of the Cy3 vs. Cy5 scatter plot (Fig. 2.4b). The values of M and A are calculated as follows: M = log (Cy 5) − log (Cy3),
c02.indd 18
(2.1)
1/12/2011 9:44:00 AM
PROTOCOLS
(b)
2 −6 −4 −2
0
M
12 10 6
8
log2(Cy5)
14
4
6
16
(a)
6
8
10
12
14
16
4
6
8
log2(Cy3)
10 12 14 16 A
log2(Cy5)
4
6
8
4 2 0 −6 −4 −2
10 12 14 16
(d) 6
(c)
M
19
4
6
8
10 12 14 16
4
6
A
8
10 12 14 16
log2(Cy3)
Figure 2.4. (a) Scatterplot of Cy3 versus Cy5 signal. (b) MA plot. Gray curve fitted by loess. (c) Normalized MA plot. (d) Normalized Cy3 versus Cy5 plot.
A=
1 ∗ ( log (Cy 5) + log (Cy 3)). 2
(2.2)
Then, using a function such as loess, a smooth curve is fitted through all data points and a normalized MA plot is created by subtracting the distance to the line (Fig. 2.4c). The MA plot is transferred back into a normalized Cy5 vs. Cy3 scatter plot by applying the inverse of equations 2.1 and 2.2 on the calculated data (Fig. 2.4d): log (Cy 5) = A + M/ 2,
(2.3)
log (Cy 3) = A − M/ 2.
(2.4)
2.3.2.4 Normalization between Arrays The global range of gene expression values can differ between arrays from one experiment to the next. These
c02.indd 19
1/12/2011 9:44:00 AM
20
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
global changes are often the result of slight variations during the process of sample preparation, labeling, microarray hybridization, and washes. The aim of normalization between arrays is to remove global expression differences so that multiple samples can be properly compared and real biological differences detected. The most straightforward way to normalize between microarrays is to equalize the median or mean value for each of the arrays and to adjust the scale to some fixed value. A disadvantage of this method is that it performs a linear scaling, which is not optimal if the distributions of the expression values differ. Therefore, another method called quantile normalization was proposed by several authors (e.g., Bolstad et al., 2003). This method equalizes the distributions of the expression values of all microarrays. The procedure is as follows: 1. Given n arrays of length g, form matrix A of dimension g × n where each array is a column and each gene is a row. 2. Sort each column of A to give Asort. 3. Take the means across rows of Asort and assign this mean to each element in the row to get Asortmean. 4. Get A normalized by rearranging each column of Asortmean to have the same ordering as the original A. Figure 2.5 shows the distribution of all gene expression values before and after the quantile normalization procedure has been applied. After normalization, the distributions of values are all equal. 2.3.3 Quality Control The performance of microarray experiments involves many steps, and there are many stages where things can go wrong. Here we describe the most common procedures for quality control and explain how they can be used to inspect the quality of the data. 2.3.3.1 Inspection of Signal Plots As a first quality control measure one can make a signal plot of all measured signals for all microarrays before and after normalization. If probes are randomly distributed over the microarrays, there should not be any patterns visible in these plots. Visual inspection of these images may reveal cases in which, for example, a hair or pieces of dust disturb the signals. Also, one can detect if spatial effects are properly corrected by the normalization methods. 2.3.3.2 Dissimilarity Measures To detect deviating microarrays, Wit and McClure (2004) suggest calculating similarity measures between all pairs of microarrays. Suppose one wants to investigate two types of dissimilarity measures: absolute similarities indicating whether genes have similar levels over
c02.indd 20
1/12/2011 9:44:00 AM
21
PROTOCOLS
(b) 14
14
12
12 log 2 Signal
log 2 Signal
(a)
10 8
10 8
6
6
4
4
1
2
3
4
1
Microarrays
2
3
4
Microarrays
Figure 2.5. Quantile normalization. (a) Box plots of the log2 signals for four microarrays before normalization, showing differences in the distributions. (b) The distributions of the log2 signals for the same microarrays after quantile normalization.
different arrays and correlations indicating coordinated changes of genes between arrays. As absolute similarity measures they use power distances ngenes
d p ( x, y) =
p
∑ x −y i
i
p
(2.5)
i =1
and use both the Manhattan distance d1 and the Euclidean distance d2. Investigation of the dissimilarity matrices directly identifies microarrays in which processing problems may have occurred. 2.3.3.3 Dimensionality Reduction Another way to check for possible problems in the data is to perform a dimensionality reduction. A popular method is principal component analysis (PCA), which is a method that transfers a number of variables (gene expression values in this case) into a number of uncorrelated variables called principal components. The first component accounts for as much of the variability in the data as possible, and each following component accounts for as much of the remaining variability as possible. A quicker method for dimensionality reduction is Sammon mapping (Sammon, 1969). Instead of using the whole gene expression data matrix, it uses the distance matrix between arrays. It aims to find a representation of the arrays in a lower-dimensional space in such a way that the distances between the arrays are closest to the distances in the original matrix. Inspection of a two-dimensional (2D) Sammon mapping can indicate whether samples have been swapped or specific arrays have a deviating behavior. For example, the
c02.indd 21
1/12/2011 9:44:00 AM
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
60
22
B.3.2
40
B.3.3 A.3.3
20
B.3.1
B.1.3
A.2.3
A.3.2 B.2.2
0
B.1.2
B.2.3
B.2.1
A.2.2
−40
−20
B.1.1
A.3.1
A.1.3 A.1.2
−60
A.2.1 A.1.1 −60
−40
−20
0
20
40
60
Figure 2.6. Sammon mapping of microarray data of two strains of mice (A and B) infected with a virus, measured 3 days after the infection. Each measurement has been performed in three replicates: A.1.3 means mouse A, day 1 postinfection, replicate 3.
Sammon mapping in Figure 2.6 indicates that samples A.2.3 and B.2.3 have probably been swapped. 2.3.3.4 Pairwise Scatter Plots Another good way to detect deviating microarrays is to inspect scatter plots of all expression values for all possible pairs of microarrays. Normally, the amount of differentially expressed genes between two experimental conditions is small, relative to the total amount of genes on the microarray used. A figure in which the expression values of all genes in the two conditions are plotted against each other should reveal a cloud of points on the diagonal with relatively few points off the diagonal. Comparing the scatter plots of all possible pairs of microarrays might reveal a single microarray that shows deviating scatter plots with all other microarrays—for example, scatter plots in which the cloud on the diagonal is broader than in the other pairwise plots. This would indicate that the microarray shows many more and larger changes in gene expression compared to other samples than do other comparisons. If these changes are not expected from a biological point of view, there might have been technical problem causing these changes, and it will be better to repeat the microarray experiment for this sample.
c02.indd 22
1/12/2011 9:44:00 AM
APPLICATIONS AND LIMITATIONS
23
2.3.3.5 Sex-Specific Gene Expression If the experiment involves both male and female samples, mislabelings of the microarrays can be detected by comparing the expression values for sex-specific genes. Xist is one example of a female-specific gene. It should be expressed only in female samples. In this way, samples that have been mixed up can be easily identified. 2.4 APPLICATIONS AND LIMITATIONS 2.4.1 Identification of Expression QTL and Gene Regulatory Networks Combining gene expression profiles with genetic information represents a new, powerful approach for identifying genes in disease models. This approach is generally referred to as the identification of expression quantitative trait loci (eQTL) (Rockman and Kruglyak, 2006) or genetical genomics (Jansen and Nap, 2001). A quantitative trait locus is a specific region on the genome where one or more genes are located that most likely regulate a phenotypic trait. By making use of specific populations of organisms that are genetically related and by measuring the trait values for many individuals of the population and combining them with the genetic information of the individual organisms, one can identify the locations on the genome regulating the trait. Recombinant inbred lines (RIL) are often used for QTL analysis. In mice, these lines are obtained by breeding two genetically different inbred parental lines and by performing brother–sister mating from a large number of F1 hybrid pairs for about 20 generations. As can be seen from Figure 2.7, the parental strains produce F1 offspring that are heterozygous. The F1 offspring are mated to produce F2 animals. After many generations of brother–sister matings that start from a given F2 pair, recombinant inbred lines have evolved whose genomes represent a fixed mixture of the parental genomes and in which all individuals are again homozygous at every location. The fixed parental genome mixture is, however, different from one line (starting with a given F2 pair) to another line (starting with a different F2 pair). The genetic makeup of these recombinant inbred lines is then determined for each RIL using molecular markers. RILs now allow one to perform phenotype analysis, subsequently relating them to the genotypes. For the identification of eQTL, global gene expression profiles are determined by gene expression microarrays for each RIL. Subsequently, the gene expression profile for each gene is taken as a quantitative trait and is compared in a genome scan with the distribution of molecular markers (Fig. 2.8). For each molecular marker, the expression trait values are divided into two groups, according to the alleles that the individuals carry for that marker (Fig. 2.8c). Then a statistical test is performed that determines whether the means of both groups differ significantly. In this case, eQTL may be determined, as shown for the second marker in Figure 2.8. This result indicates that, with a high
c02.indd 23
1/12/2011 9:44:00 AM
24
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
x
x
Parents
x
F1
x
x
F2
F20
Figure 2.7. Generation of recombinant inbred lines.
probability, there are one or several factors (genes) at that genomic location that regulate the expression of the target gene, since all individuals carrying the one allele have a low expression and all individuals carrying the other allele have a high expression value. Once a QTL is identified, one can compare the location of the QTL with the location of the gene. If they coincide, the QTL is referred to as a cis-QTL or local QTL, otherwise it is called a transQTL, or distant QTL. There exist several methods for the identification of quantitative trait loci. The most straightforward method is called single-marker analysis. Here, a genomewide scan is performed and at each molecular marker a regression test determines whether a QTL is present. In a second method, called interval mapping, the QTL likelihood is determined at locations in between markers. At fixed genomic intervals and by making use of the information for the surrounding markers, this method is able to calculate QTL scores at the markers themselves and at places in between. Based on the idea that multiple QTL can regulate a quantitative trait, Jansen (1993) and Zeng (1993) proposed the method of composite interval mapping (multiple QTL mapping). Here, the existence of multiple QTLs regulating the expression of one trait is modeled. This allows identifying epistatic QTL—that is, multiple QTL regions that regulate the trait by interacting with each other. Also, it allows for the identification of multiple linked QTL. The GeneNetwork (www.genenetwork.org) has been established as a rich resource for systems genetics. It contains a large collection of genotypes, phenotypes, and gene expression profiles for multiple organisms and genetic
c02.indd 24
1/12/2011 9:44:00 AM
APPLICATIONS AND LIMITATIONS
(a)
2
1
3
(c)
trans A B
1
(b)
A
25
cis
2
3
B
high
RILs
low
Figure 2.8. (See color insert.) (a) The genomewide genotypes of eight recombinant inbred lines generated from a cross between two homozygous parents (A and B). Each row indicates the genome of a single RIL. The light or dark gray color in each of the RILs indicates whether that part of its genome was inherited from parent A or B. (b) Gene expression values are determined by microarrays. Four values are shown for each parent and one value for each of the RILs. (c) For three molecular markers, the gene expression values of the RILs are dissected into two groups, according to the allele they carry for that molecular marker (light or dark gray). A statistical test of each marker location calculates whether the means of both groups differ. The significances of the tests are plotted in a genomewide plot as a QTL plot. Here, a QTL peak is found for the second marker. Triangles in the QTL plot indicate the position of the gene, whose expression was used. If the gene coincides with the QTL peak, the QTL is referred to as a cis-QTL, otherwise, it is called a trans-QTL. Adapted from Alberts et al. (2005) by permission of Oxford University Press.
reference populations. It offers good tools for QTL and correlation analysis and the identification of QTL genes and gene networks. The identification of a trans-QTL means that the location probably regulates the expression of another gene, the target gene. Furthermore, several genes may map to the same trans-QTL, which indicates that all these genes appear to have a common regulator. By following links between genes, revealed by trans-QTLs, one can build up gene regulatory networks. These regulatory networks have the potential to explain the complex interplay of genes and their products affecting complex traits and diseases. Ferrara et al. (2008) demonstrated how eQTL can be used to reconstruct networks. They use an F2 intercross between a diabetes-resistant and a
c02.indd 25
1/12/2011 9:44:00 AM
26
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
diabetes-susceptible mouse strain and identified expression QTLs (eQTLs) as well as metabolite QTLs (mQTLs). mQTLs were determined by taking metabolite abundances as quantitative traits. For one metabolite, glutamate, they identified an mQTL interval that also contains eQTLs and transcripts with eQTLs elsewhere. Using this information, they reconstructed a regulatory network, demonstrating the validity of the network by showing that the genes respond to changes in glutamate. Crawford et al. (2008) described how eQTLs can be used to derive a transcriptional network that predicts breast cancer survival. In previous work, it was shown that extracellular matrix (ECM) gene dysregulation predicts both mouse mammary tumorigenesis and human breast cancer. They identified three reproducible eQTLs that regulate ECM gene expression. By correlation analyses and known association with metastasis, they identified seven candidate genes. Six out of the seven candidates appeared to suppress metastasis. 2.4.2
Identification of Differentially Expressed Genes
Microarrays are most often used for the identification of differentially expressed genes. In disease gene discovery, healthy samples and disease samples are compared and genes that are differentially expressed are identified. Thuong et al. (2008) for example, compared gene expression profiles of macrophages from individuals with different clinical manifestations of Mycobacterium tuberculosis infection. For three clinical phenotypes—latent, pulmonary, and meningeal tuberculosis—they identified lists of differentially expressed genes. Comparing the three phenotypes, they identified 261 genes having a greater than fivefold change in expression between any of the three conditions. Pennings et al. (2008) compared multiple microarray studies on acute lung inflammation models. The models included air pollutants; bacterial, viral, and parasitic infections; and allergic asthma models. They identified a cluster of 383 genes with an expression response that was common to all pulmonary diseases. 2.4.3
Identification of Cell-Type-Specific Genes
Another application of microarrays is to identify genes that are expressed in specific cell types. Sugimoto et al. (2006), for example, compared gene expression profiles in CD25+CD4+ regulatory T cells and CD25−CD4+ naive T cells. They found multiple genes that were expressed in a pattern that is specific for regulatory T cells. These genes are thought to be involved in differentiation and homeostatis of regulatory T cells. 2.4.4
Determination of the Downstream Effects of a Mutation
Von Bernuth et al. (2008) used microarrays to determine the downstream effects of a MyD88 mutation in human. Nine patients with MyD88 deficiency
c02.indd 26
1/12/2011 9:44:00 AM
APPLICATIONS AND LIMITATIONS
27
suffered from life-threatening, often recurrent pyogenic bacterial infections. The authors identified the functional pathways in healthy fibroblasts that were regulated after treatment with interlukin 1β (IL-1β), tumor necrosis factor α (TNF), or Poly(IC) and compared them to the expression levels obtained from cells derived from patients. They identified a complete, specific lack of response to IL-1β as a defining characteristic of MyD88 deficiency. 2.4.5 Determination of the Downstream Effects of a Signaling Molecule Type 1 interferon (IFN) contributes significantly to innate immune responses. Malakhova et al. (2006) reported that UBP43 is highly expressed in macrophages and inhibits type 1 IFN signaling. To understand the effect of UBP43 and type 1 IFN signaling, Zou et al. (2007) analyzed the genomewide gene expression profiles of IFN-β-stimulated genes in wild type and UBP43−/− bone marrow–derived macrophages (BMMs). They identified 749 genes that were uniquely upregulated in UBP43−/− BMMs, including a large number of previously unidentified IFN-stimulated genes. 2.4.6 Predicting Vaccine Efficacy Another application of microarrays is the identification of gene signatures that have a predictive value for a biological response. For example, Querec et al. (2009) used this approach to predict vaccine efficacy. They vaccinated humans with the yellow fever vaccine YF-17D and performed microarray experiments on 0, 1, 3, 7, and 21 days after vaccination in two independent trials. Using the DAMIP classification model (Lee 2007, Brooks and Lee 2008), they identified innate immune signatures that could predict subsequent adaptive immune responses. One signature predicted YF-17D CD8+ T cell responses with up to 90% accuracy and another signature predicted the neutralizing antibody response with up to 100% accuracy. 2.4.7 Determination of Host Responses after Infection Microarrays have been successfully used in the characterization of host responses after infection. For example, Kash et al. (2006) infected mice with a contemporary human influenza A/Texas/36/91 H1N1 virus (Tx91) and a reconstructed 1918 (H1N1) recombinant virus (r1918) that caused about 50 million deaths worldwide. They found that mice infected with the r1918 virus revealed a much stronger inflammatory response. As another example, Ding et al. (2008) found differences in mouse strains after infection with influenza A. 2.4.8 Limitations A limitation of using microarrays is that they measure mRNA abundances and not protein levels. Posttranscriptional modifications or mRNA degradation
c02.indd 27
1/12/2011 9:44:00 AM
28
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
might cause actual protein levels to be different from gene expression levels measured with microarrays. In these situations, the transcriptional profiles obtained by microarrays do not fully correspond to the proteome within the cell.
2.5
QUESTIONS AND ANSWERS
Q1. Why are multiple microarrays in an experiment normalized? Q2. What is an expression QTL (eQTL)? And how can it be used to discover gene-interaction networks? A1. Multiple microarrays are normalized to remove the nonbiological variation, such as technical variation, to maintain pure biological variation. A2. An eQTL is an expression quantitative trait locus—a genomic region that very likely contains one or multiple genes regulating the expression of another gene. Trans-eQTLs represent genomic regions that influence the expression of another gene located distantly. The trans-QTL region will very likely contain genes that directly influence the expression of the target gene(s). By relating multiple trans-QTLs with multiple target genes, one may obtain valuable hypotheses for gene–gene regulatory interactions.
2.6
ACKNOWLEDGMENTS
We would like to thank Dr. Robert Geffers for fruitful discussions. This work was supported by intramural grants from the HelmholtzAssociation (Program Infection and Immunity) and a research grant for the virtual institute GeNeSys (German Network for Systems Genetics, No VHVI-242) from the Helmholtz Association.
2.7
REFERENCES
Alberts R, Fu J, Swertz MA, Lubbers LA, Albers CJ, Jansen RC. (2005). Combining microarrays and genetic analysis. Briefings Bioinformatics 6(2):135–45. Blencowe BJ. (2006). Alternative splicing: new insights from global analyses. Cell 126:37–47. Bolstad BM, Irizarry RA, Astrand M, Speed TP. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2):185–93. Brooks JP, Lee EK. (2008). Analysis of the consistency of a mixed integer programming based multi-category constrained discriminant model. Ann Oper Res 164:1–20.
c02.indd 28
1/12/2011 9:44:00 AM
REFERENCES
29
Crawford NP, Walker RC, Lukes L, Officewala JS, Williams RW, Hunter KW. (2008). The Diasporin pathway: a tumor progression-related transcriptional network that predicts breast cancer survival. Clin Exp Metastasis 25(4):357–69. Ding M, Lu L, Toth LA. (2008). Gene expression in lung and basal forebrain during influenza infection in mice. Genes Brain Behav 7(2):173–83. Eisen M. (1999). ScanAlyze User Manual. Available at http://rana.lbl.gov/manuals/ ScanAlyzeDoc.pdf. Ferrara CT, Wang P, Neto EC, Stevens RD, Bain JR, Wenner BR, Ilkayeva OR, Keller MP, Blasiole DA, Kendziorski C, Yandell BS, Newgard CB, Attie AD. (2008). Genetic networks of liver metabolism revealed by integration of metabolic and transcriptional profiling. PLoS Genet 4(3):e1000032. He L, Thomson JM, Hemann MT, Hernando-Monge E, Mu D, Goodson S, Powers S, Cordon-Cardo C, Lowe SW, Hannon GJ, Hammond SM. (2005). A microRNA polycistron as a potential human oncogene. Nature 435(7043):828–33. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4(2):249–62. Jansen RC. (1993). Interval mapping of multiple quantitative trait loci. Genetics 135: 205–11. Jansen RC, Nap JP. (2001). Genetical genomics: the added value from segregation. Trends Genet 17(7):388–91. Kash JC, Tumpey TM, Proll SC, Carter V, Perwitasari O, Thomas MJ, Basler CF, Palese P, Taubenberger JK, García-Sastre A, Swayne DE, Katze MG. (2006). Genomic analysis of increased host immune and cell death responses induced by 1918 influenza virus. Nature 443(7111):578–81. Kooperberg C, Fazzio TG, Tsukiyama T. (2002). Improved background correction for spotted DNA microarrays. J Computat Biol 9:57–68. Lee EK. (2007). Large-scale optimization-based classification models in medicine and biology. Ann Biomed Eng 35:1095–109. Malakhova OA, Kim KI, Luo JK, Zou W, Kumar KG, Fuchs SY, Shuai K, Zhang DE. (2006). UBP43 is a novel regulator of interferon signaling independent of its ISG15 isopeptidase activity. EMBO J 25(11):2358–67. Modrek B, Lee C. (2002). A genomic view of alternative splicing. Nat Genet 30:13–19. Pennings JLA, Kimman TG, Janssen R. (2008). Identification of a common gene expression response in different lung inflammatory diseases in rodents and macaques. PLoS ONE 3(7):e2596. Querec TD, Akondy RS, Lee EK, Cao W, Nakaya HI, Teuwen D, Pirani A, Gernert K, Deng J, Marzolf B, Kennedy K, Wu H, Bennouna S, Oluoch H, Miller J, Vencio RZ, Mulligan M, Aderem A, Ahmed R, Pulendran B. (2009). Systems biology approach predicts immunogenicity of the yellow fever vaccine in humans. Nat Immunol 10(1):116–25. Reinke V. (2006). Germline genomics. Available at www.wormbook.org. Rockman MV, Kruglyak L. (2006). Genetics of global gene expression. Nat Rev Genet 7:862–72. Sammon JW. (1969). A non-linear mapping for data structure analysis. IEEE Trans Comput 18:401–09.
c02.indd 29
1/12/2011 9:44:00 AM
30
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
Sugimoto N, Oida T, Hirota K, Nakamura K, Nomura T, Uchiyama T, Sakaguchi S. (2006). Foxp3-dependent and -independent molecules specific for CD25+CD4+ natural regulatory T cells revealed by DNA microarray analysis. Int Immunol 18(8):1197–209. Thum T, Galuppo P, Wolf C, Fiedler J, Kneitz S, van Laake LW, Doevendans PA, Mummery CL, Borlak J, Haverich A, Gross C, Engelhardt S, Ertl G, Bauersachs J. (2007). MicroRNAs in the human heart: a clue to fetal gene reprogramming in heart failure. Circulation 116(3):258–67. Thuong NT, Dunstan SJ, Chau TT, Thorsson V, Simmons CP, Quyen NT, Thwaites GE, Thi Ngoc Lan N, Hibberd M, Teo YY, Seielstad M, Aderem A, Farrar JJ, Hawn TR. (2008). Identification of tuberculosis susceptibility genes with human macrophage gene expression profiles. PLoS Pathog 4(12):e1000229. von Bernuth H, Picard C, Jin Z, Pankla R, Xiao H, Ku CL, Chrabieh M, Mustapha IB, Ghandil P, Camcioglu Y, Vasconcelos J, Sirvent N, Guedes M, Vitor AB, HerreroMata MJ, Aróstegui JI, Rodrigo C, Alsina L, Ruiz-Ortiz E, Juan M, Fortuny C, Yagüe J, Antón J, Pascal M, Chang HH, Janniere L, Rose Y, Garty BZ, Chapel H, Issekutz A, Maródi L, Rodriguez-Gallego C, Banchereau J, Abel L, Li X, Chaussabel D, Puel A, Casanova JL. (2008). Pyogenic bacterial infections in humans with MyD88 deficiency. Science 321(5889):691–96. Wit E, McClure J. (2004). Statistics for Microarrays: Design, Analysis and Inference. New York, John Wiley and Sons. Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30(4):e15. Yang YH, Speed T. (2003). Design and analysis of comparative microarray experiments. In Statistical Analysis of Gene Expression Microarray Data (ed. Speed T). Chapman & Hall/CRC, Boca Raton, FL, pp. 35–92. Zeng ZB. (1993). Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proc Natl Acad Sci U S A 90:10972–76. Zou W, Kim JH, Handidu A, Li X, Kim KI, Yan M, Li J, Zhang DE. (2007). Microarray analysis reveals that type I interferon strongly increases the expression of immuneresponse related genes in UBP43 (USP18) deficient macrophages. Biochem Biophys Res Commun 356(1):193–99.
c02.indd 30
1/12/2011 9:44:00 AM
CHAPTER 3
DNA Methylation in the Pathogenesis of Autoimmunity XUEQING XU*, PING YANG*, ZHANG SHU*, YUN BAI*, and CONG-YI WANG*
Contents 3.1 Introduction 3.2 General Information for DNA Methylation in Mammals 3.3 DNA Methyltransferases and Methyl-CpG-Binding Domain (MBD) Proteins 3.3.1 DNA Methyltransferases 3.3.2 MBD Proteins 3.4 DNA Methylation in T and B Cell Development 3.4.1 DNA Methylation of IFN-γ Locus in Th1 Cell Development 3.4.2 DNA Methylation of Th2 Cytokine Locus in Th2 Cell Development 3.4.3 DNA Methylation in Regulatory T Cell and Th17 Development 3.4.4 DNA Methylation in B Cell Maturation and Functionality 3.5 The Implication of DNA Methylation in Autoimmune Diseases 3.5.1 DNA Methylation in Systemic Lupus Erythematosus 3.5.2 DNA Methylation in Rheumatoid Arthritis 3.5.3 DNA Methylation in Type 1 Diabetes 3.6 Common Technological Approaches for Assay of DNA Methylation 3.6.1 Methylation-Specific PCR 3.6.2 Bisulfite PCR 3.6.3 Arbitrary Primed PCR 3.6.4 Methylated DNA Immunoprecipitation Chip 3.7 Summary 3.8 Acknowledgments 3.9 References
32 33 34 34 35 36 36 37 38 40 40 41 42 43 44 45 46 46 47 48 48 49
* These authors contributed equally to this work. Correspondence should be addressed to Dr. Cong-Yi Wang,
[email protected]. Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
31
c03.indd 31
1/12/2011 9:44:01 AM
32
3.1
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
INTRODUCTION
Despite the characterization of our genome at DNA basepair level, we are still far from understanding the molecular events underlying the phenotypic variations such as disease susceptibility. It is now realized that epigenetic factors are also a significant contributor for a particular phenotype, indicating a dual inheritance for mammalian cells. Moreover, the epigenome-encoded information is superimposed on DNA sequences, consisting of three interconnected molecular mechanisms: DNA methylation (methylome) (Jeltsch et al., 2006; Vanden Berghe et al., 2006; Wilson et al., 2006), histone modification (chromatin remodeling) (Kouskouti and Talianidis, 2005; Trouche et al., 2003) and RNA interference (SiRNAs and microRNAs) (Andersen and Panning, 2003; Cheng et al., 2005; Mattick, 2001), among which, DNA methylation is the most profound epigenetic mechanism because DNA methylation changes are also linked with the presence of an aberrant pattern of histone modification and microRNA expressions (Callinan and Feinberg, 2006; Esteller, 2006; Jones and Martienssen, 2005; Rauscher, 2005; Wilson et al., 2006). DNA methylation is a type of postsynthesis modification after every cycle of DNA replication. It involves the addition of a methyl group to DNA to the number 5 carbon of the cytosine. It is the oldest epigenetic mechanism known to correlate with gene repression and is also the most thoroughly studied epigenetic modification so far. DNA methylation influences, most notably, gene expression, in that hypermethylation of promoter regions is usually associated with transcriptional repression, while hypomethylation of control regions is generally associated with active gene transcription. However, as indicated, the presence of methylcytosine in the promoter of specific genes also has profound consequences on local chromatin structure in addition to the regulation of gene expression. Therefore, the end results of DNA methylation go far beyond the control of gene expression with much broader implications in genetic regulation such as genomic imprinting, X chromosome inactivation, and chromatin structure modifications (Tollefsbol, 2004). DNA methylation has been found in every vertebrate examined. In adult somatic tissues, DNA methylation typically occurs in a CpG dinucleotide context, while non-CpG methylation is prevalent in embryonic stem cells. In plants, cytosines are methylated both symmetrically (CpG or CpNpG) and asymmetrically (CpNpNp, N can be any nucleotide but guanine). In mammals, CpG dinucleotides are usually located at the promoters of genes, and the methylation of DNA in such promoter regions plays an important role in the gene activation and expression. Similar as nuclear genome, DNA methylome consists all inheritable information encoded by DNA methyaltion, which is established during development in a tissue-specific fashion (Yung et al., 2001). Methylation of CpG dinucleotides (also called CpG islands, regions of DNA enriched with CpG sites) is a unique epigenomic mechanism for suppressing the expression of genes not essentially or potentially detrimental to cellular function. Abnormal
c03.indd 32
1/12/2011 9:44:01 AM
GENERAL INFORMATION FOR DNA METHYLATION IN MAMMALS
33
demethylation of these CpG islands would lead to active transcription of the suppressed genes associated with disease development, such as cancer. In this chapter we provide an overview for the role of DNA methylation in the development of autoimmunity. We particularly discuss its possible implication in the pathogenesis of systemic lupus erythematosus (SLE), rheumatoid arthritis (AR), and type 1 diabetes (T1D).
3.2 GENERAL INFORMATION FOR DNA METHYLATION IN MAMMALS DNA methylation is one of the most important epigenetic alterations in mammals. This modification can be inherited through cell division. DNA methylation is typically removed during zygote formation and reestablished through successive cell divisions during development. DNA methylation is a crucial part of normal organism development and cellular differentiation in mammalians such as regulation of imprinted genes and X chromosome inactivation. It also acts as a protective mechanism adopted by the pathogen DNA—for example, many bacteria take the advantage of DNA methylation against the endonuclease activity that destroys any foreign DNA. DNA methylation stably alters the expression pattern of a particular gene in cells such that the cells can “remember where they have been.” In this way, cells programmed to be pancreatic islet cells during embryonic development would remain within the pancreatic islets throughout the life of the organism, and there is no need for additional signals to direct them to be remained in the islets. In general, 70–80% of all CpGs in the human genome are methylated, and the majority of unmethylated CpG islands are located within or near gene promoters or first exons of housekeeping genes (Heller et al., 2010). In contrast, the promoters of noncoding RNAs and regulatory regions of transposable elements are methylated, thereby inhibiting the parasitic transposable and repetitive elements from replicating (Dolinoy et al., 2006). Studies in animal models indicated that DNA methylome-encoded information is both mitotically and meiotically heritable (Morgan et al., 1999; Rakyan et al., 2003). Alterations for DNA methylome can be induced by a plethora of environmental insults (discussed later). Studies in a large collection of human monozygotic (MZ) twins indicated that these changes in DNA methylome are accumulated during one’s lifetime, which establish a quantitative threshold for the induction of gene expression. This quantitative effect contributes to the phenotypic discordance, including the susceptibilities to disease and a wide range of anthropomorphic features among MZ twins (Fraga et al., 2005). DNA methylation may impact the transcription of genes in two ways. The methylation of DNA may itself physically impede the binding of transcription factors to the gene. More important, methylated DNA may be bound by proteins known as methyl-CpG-binding domain (MBD) proteins. MBD proteins
c03.indd 33
1/12/2011 9:44:01 AM
34
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
then recruit additional proteins to the locus, such as histone deacetylases and other chromatin remodeling proteins that can modify histones, thereby forming compact, inactive chromatin termed silent chromatin (Fatemi and Wade, 2006; Fraga et al., 2003).
3.3 DNA METHYLTRANSFERASES AND METHYL-CPG-BINDING DOMAIN (MBD) PROTEINS 3.3.1 DNA Methyltransferases DNA methylation is carried out by DNA methyltransferases (DNMTs), and at least five DNA methyltransferases (DNMT1, DNMT3a, DNMT3b, DNMT3L, and DNMT2) have been identified in the eukaryotic kingdom. DNMT1 preferentially methylates hemimethylated substrates, such as DNA in the S phase, and is primarily involved in the maintenance of methylation patterns with each cell replication (Bestor and Ingram, 1983; Leonhardt et al., 1992). In contrast, DNMT3a and DNMT3b, which have significantly higher de novo methylation activity than DNMT1, contribute to de novo methylation during embryogenesis (Fatemi et al., 2002). Another two DNMTs—DNMT2 and DNMT3L—are found without significant methylating activity. DNMT3L binds to DNMT3a and DNMT3b and regulates their functionality. The main function for DNMT2 is found to methylate the aspartyl-tRNA (Goll et al., 2006). Given the role of DNMTs in maintenance methylation and de novo methylation, they can be divided into two general classes. Maintenance methylation is necessary to preserve DNA methylation after every cellular DNA replication cycle. In the absence of this activity, the replication machinery itself would generate daughter strands that are unmethylated and over time would result in passive demethylation. DNMT1 is proposed to be a maintenance methyltransferase responsible for copying DNA methylation patterns to the daughter strands during DNA replication (Goyal et al., 2006). Mouse models with both copies of DNMT1 deleted are embryonic lethal at approximately day 9, due to the requirement of DNMT1 activity for development in mammalian cells (Chappell et al., 2006). It is thought that DNMT3a and DNMT3b are the de novo methyltransferases that set up DNA methylation patterns early in development. DNMT3L is a homologous protein to the other DNMT3s but has no catalytic activity. On the other hand, DNMT3L assists the de novo methyltransferases by increasing their binding capacity to DNA and stimulating their enzymatic activity (Brenner and Fuks, 2006). Recently, DNMT2 (TRDMT1) has been identified as a DNA methyltransferase homolog, containing all 10 sequence motifs common to all DNA methyltransferases. However, DNMT2 (TRDMT1) does not methylate DNA, instead it methylates cytosine-38 in the anticodon loop of aspartic acid transfer RNA (Goll et al., 2006).
c03.indd 34
1/12/2011 9:44:01 AM
DNA METHYLTRANSFERASES AND METHYL-CPG-BINDING DOMAIN (MBD) PROTEINS
3.3.2
35
MBD Proteins
A evolutionarily conserved family of DNA-binding proteins characterized by a common sequence motif called methyl-CpG-binding domain (MBD) is generally believed to convert the information represented by methylation patterns into the appropriate functional state (Fatemi and Wade, 2006; Hendrich et al., 1999). MBD forms a wedge-shaped structure composed of a β-sheet superimposed over an α-helix and loop. Amino acid side chains in two of the β-strands along with residues immediately N-terminal to the α-helix interact with the cytosine methyl groups within the major groove, which provides the structural basis for the selective recognition of methylated CpG dinucleotides (Ballestar et al., 2001; Ohki et al., 2001). Thus far, five MBD proteins (MeCP2 and MBD1–MBD4) have been identified in mammals (Fatemi and Wade, 2006). In fact, these proteins share pretty low homology for the primary structures between each other outside the MBD motif, except for MBD2 and MBD3, which indeed share substantial sequence similarity. Furthermore, unlike its amphibian counterpart, mammalian MBD3 does not have the capability to selectively recognize methylated DNA because a tyrosine to phenylalanine substitution within the MBD motif (Fraga et al., 2003). The other four MBD proteins are believed to function, at least in part, in transcriptional repression (Hendrich and Tweedie, 2003; Wade, 2001). It is interesting that MBD4 possesses DNA N-glycosylase enzymatic activity and may exert functionality in DNA repair (Bird and Wolffe, 1999). In most cases, all MBD proteins are ubiquitously expressed (Hendrich and Bird, 1998; Meehan et al., 1992). They represent an important class of chromosomal protein by associating with protein partners to play active roles in transcriptional repression and/or heterochromatin formation. Therefore, it is believed that DNA methylation pattern is “read” by the MBD proteins. Although they may have overlapped functional redundancy, genetic analysis indeed suggested functional heterogeneity for each MBD proteins. Most MBD-deficient animals can survive to adulthood although with varying abnormalities. However, mice deficient for MBD3 fail to survive embryogenesis (Hendrich et al., 2001). Lack of MeCP2 is associated with specific neurological defects in the mouse that mimic the symptoms observed in the human neurological disorder Rett syndrome, which is caused by mutation in the human MECP2 gene (Amir et al., 1999). Loss of MBD1 function is associated with neuronal defects, potentially related to the subtle upregulation of a specific class of endogenous retroelements (Zhao et al., 2003). Mice deficient for MBD4 were found with increased frequency for C → T transitions at CpG sites and showed accelerated tumor formation with CpG → TpG mutations in the Apc gene (Millar et al., 2002), indicating that MBD4 may be important to suppress CpG mutability and tumorigenesis. In contrast, MBD2 deficiency is associated with a decreased incidence of tumors of the colon promoted by mutation of the adenomatous polyposis coli gene (Sansom et al., 2003). More interesting, MBD2 could be also important in the regulation of immune
c03.indd 35
1/12/2011 9:44:01 AM
36
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
response as loss of MBD2 function leads to changes in the abundance of transcripts for certain cytokines essential for T cell development (Hutchins et al., 2002; Hutchins et al., 2005).
3.4
DNA METHYLATION IN T AND B CELL DEVELOPMENT
There is increasing evidence that DNA methylome organizes the ability of signal transduction pathways to generate a restricted set of progeny from a multipotent progenitor. In addition, epigenomic effects seem to allow dividing cells to memorize, or imprint, signaling events that occurred earlier in their development (Ballestar et al., 2006; Reiner, 2005; Richardson, 2003; Sekigawa et al., 2003, 2006; Teitell and Richardson, 2003). Therefore, DNA methylation is emerging as a common strategy in development and function of the mammalian immune system. Recent studies have not only demonstrated the importance of DNA methylation in T cell development and differentiation but also established its pivotal role in T cell polarization (Eivazova and Aune, 2004; Fields et al., 2004; Fitzpatrick and Wilson, 2003; Lee et al., 2002; Reiner, 2005). T cells and B cells are two of the most important components in the immune system, and the differentiation of T helper cells (i.e., Th1, Th2, or Th17) and the maturation of B cells play an important role in specific immune responses and antibody production. While the genomic DNA is the same, different T cell subsets have different functions, most likely through expressing different proteins. Th1 cells are characterized by producing IFNγ and Th2 cells secret IL-4 and IL-13 cytokines, while Th17 cells are preferential producers of IL-17A, IL-17F, IL-21, and IL-22. Another important T subset, T regulatory T cells (Treg), is manifested by expressing Foxp3 in the nucleus. There is mounting evidence that the production of cytokines and the expression of transcription factors necessary for T helper cell differentiation to different cell subsets are regulated at the DNA methylation level. 3.4.1
DNA Methylation of IFN-γ Locus in Th1 Cell Development
IFN-γ is a signature cytokine for Th1 cells. Unlike the well-defined Th2 cytokine loci, little is known about the regulatory elements that govern the expression of interferon. The methylation status of CpG sites in the regulatory elements around IFN-γ locus is quite complicated, different CpG sites show inconsistent methylation status during Th1 development. Multiple regulatory elements and conserved noncoding sequences (CNS) have been identified in a region that extends 60–70 kb upstream and downstream of the mouse Ifn-γ locus, which include enhancers at CNS-34, CNS-22, CNS-6, CNS+18–20 and CNS+29, as well as a putative insulator at CNS+46 (Hatton et al., 2006; Wilson et al., 2009). In naive murine T cells, the IFN-γ promoter +29 and +46 are unmethylated, while CNS-54, CNS-6, intron 3, CNS-18, CNS-20, and CNS-55 were methylated (Bowen et al., 2008). In Th1 cells, methyl groups were removed
c03.indd 36
1/12/2011 9:44:01 AM
DNA METHYLATION IN T AND B CELL DEVELOPMENT
37
from CpG dinucleotides at CNS-54, CNS-6, some CpGs in IFN-γ intron 3, CNS-18 and CNS-20, indicating a prerequisite role for demethylation of these elements in IFN-γ expression for polarizing Th1 development. The involvement of DNA methylation in IFN-γ expression during Th1 development is further supported by the observation in Th2 cells. The Th2 CpG methylation pattern resembles that in naive cells apart from the fact that more CpGs within the IFN-γ promoter were methylated. Studies have shown that during Th2 polarization, the DNA methyltransferase DNMT3a was enriched to the IFN-γ promoter; as a consequence, the promoter undergoes a progressive de novo methylation (Jones and Chen, 2006). It was found that CpGs located at the 53 position become methylated rapidly and such methylation inhibits ATF2/c-Jun and CREB transcription factors binding to the IFN-γ promoter, which then suppresses IFN-γ expression to polarize naive T cells to the Th2 condition. 3.4.2 DNA Methylation of Th2 Cytokine Locus in Th2 Cell Development The IL-4, IL-5, and IL-13 genes are linked closely in an evolutionarily conserved cytokine gene cluster. The region is called the Th2 cytokine locus and contains IL-4, IL-5, IL-13, and the constitutively expressed Rad50 genes, which transcribe well-known cytokines essential for Th2 development. Sustained TCR stimulation in the presence of IL-4 polarizes naive T cells to the Th2 condition, which silences the IFN-γ gene while activating the IL-4, IL-5, and IL-13 genes. Of note, the Th2 cytokine locus is conserved in the genomes of mammals in terms of gene composition and the linear relationship of genes within the locus (Wilson et al., 2009). The transcription of those Th2 cytokines is tightly regulated through their promoters and by several additional regulatory elements, implicating the epigenetic regulatory mechanisms such as DNA methylation. For example, the transcription of murine IL-4 is tightly controlled by the regulatory elements mapped to the DNAse I hypersensitive site I (HSI) and HSII in the second intron of IL-4, the DNAse I hypersensitive site VA (HSVA) and HSV located in the 3′ region of IL-4, and the DNAse I hypersensitive site s1 (HSs1) and HSs2 located between IL-13 and IL-4 (Wilson et al., 2009). During T helper cell differentiation the IL-4 locus undergoes a complex series of methylation and demethylation steps. The 5′ region of the IL-4 locus is hypermethylated in naive T cells and becomes specifically demethylated in Th2 cells, whereas the highly conserved HSVA and HSV regions at the 3′ end show the converse behavior, being hypomethylated in naive T cells and becoming methylated during Th1 differentiation. The 5′ demethylation is not required for primary transcription of the IL-4 gene but is strongly associated with efficient, high-level induction of IL-4 transcripts by differentiated Th2 cells (Lee et al., 2002). In humans, CpG methylation is predominantly present at the IL-4 and IL13 gene in naive and Th1 cells, while CpG demethylation occurs only in Th2-
c03.indd 37
1/12/2011 9:44:01 AM
38
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
cells around the Th2-specific DNAse I hypersensitive sites (Santangelo et al., 2002). This wave of CpG demethylation during Th2 development coincides with the consensus binding sites for the Th2-specific transcription factor GATA-3. Similarly, Makar and colleagues found that as naive T cells were differentiated into Th1 populations, the decreased capacity for IL-4 expression was accompanied by an increase in recruitment of DNMTs to the IL-4 and IL-13 promoters along with an increase in CpG methylation at these regions (Makar et al., 2003; Makar and Wilson, 2004). On the contrary, these promoters/ regions were significantly demethylated once those cells in the Th2 condition (Bowen et al., 2008; Makar et al., 2003). Together, these data establish that the CpG sites in the IL-4/IL-13 locus are mostly methylated in naive T cells and Th1 cells but are demethylated during Th2 differentiation. 3.4.3
DNA Methylation in Regulatory T Cell and Th17 Development
Regulatory T cells and Th17 cells are two recently described lymphocyte subsets with opposing actions. Tregs, also called as T suppressor cells, are a specialized subpopulation of T cells that act to suppress activation of the immune system and thereby maintain immune system homeostasis and tolerance to self-antigens (Bettini and Vignali, 2009; Wing and Sakaguchi, 2010). This subset of T cells is defined by the expression of the forkhead family transcription factor Foxp3. It is evident that Foxp3 is required for Treg development; therefore, it has been used as a marker for Tregs in most current studies. CD4+Foxp3+ Tregs are a very heterogeneous population in both mice and humans, and different subsets of Tregs possess different levels of CpG DNA methylation at the Foxp3 locus. Increased methylation of CpG nucleotides at the Foxp3 locus has been linked with less Foxp3 expression, decreased Treg stability, and reduced suppressive Treg function (Lal et al., 2009; Miyara et al., 2009). The methylation regulation of Foxp3 locus may be the most detailed studied. Demethylation induced by 5-aza-cytidine, an inhibitor of DNA methylation, in human NK cells leads to Foxp3 expression (Zorn et al., 2006). Approximately 70% of CpGs in the human Foxp3 promoter are methylated in CD4+CD25lo cells in contrast to about 5% in CD4+CD25hi T cells (Janson et al., 2008). Similarly, the methylation status of the CpG residues in the proximal promoter region has an essential role in murine Foxp3 expression. CpGs in the promoter region of mouse natural Tregs (nTregs) are all demethylated. In contrast, 10– 45% of these CpG sites are methylated in naive CD4+CD25– T cells (Kim and Leonard, 2007). It has been known for a while that TGF-β induces Foxp3 expression in peripheral naive CD4+CD25– T cells (Fu et al., 2004). It appears that TGF-β may increase Foxp3 expression by inducing demethylation of CpG islands at the promoter region in CD4+CD25– T cells (Luo et al., 2008). TGF-β was found being able to inhibit DNMT expression by suppressing the phosphorylation of ERK69, and as a result, inhibition of DNMT with either siRNA or chemicals leads to Foxp3 expression in CD4+ T cells (Lal et al., 2009).
c03.indd 38
1/12/2011 9:44:01 AM
DNA METHYLATION IN T AND B CELL DEVELOPMENT
39
In addition to promoter elements, there are regulatory cis-elements between noncoding exons that act as intronic enhancers. The intronic region (+4201 to +4500) of Foxp3 is highly conserved, and CpG islands within this region are completely methylated in naive CD4+CD25– T cells but fully demethylated in nTregs in mice (Floess et al., 2007; Kim and Leonard, 2007) and in humans (Baron et al., 2007). Consistent with these observations, TGF-β stimulation results in different levels of demethylation of CpG islands in this region both in mice and humans (Baron et al., 2007; Floess et al., 2007), leading to increased Foxp3 expression. The stable Foxp3 expression is vital in maintaining Treg function (Zhou et al., 2009a, 2009b). It has been noted that a fraction of CD4+Foxp3+ nTregs adoptively transferred into lymphophenic mice converted into Foxp3– T cells (Komatsu et al., 2009). Furthermore, under certain inflammatory conditions, Foxp3+ Tregs lose Foxp3 expression and suppressive function in an IL-6– dependent manner (Lal et al., 2009; Pasare and Medzhitov, 2003). Studies now provide evidence suggesting that DNA methylation may also play a role in the maintenance of Treg stability. For example, stable Foxp3 expression in nTregs is associated with demethylated CpG islands at the Foxp3 locus. In contrast, TGF-β-induced Tregs show methylated CpG associated with loss of their capability to maintain constitutive Foxp3 expression after restimulation in the absence of TGF-β (Lal et al., 2009). This emerging role is further supported by the observation that manipulation of DNA methylation can be used to induce Foxp3 expression and promote the conversion of naive T cells to Tregs (Moon et al., 2009). In contrast to Tregs, much less is known about the regulatory mechanisms and epigenetic processes that control Th17 development. Th17 cells are derived from naive CD4+ precursor cells and preferentially secrete a characteristic profile of cytokines, including IL-17A, IL-17F, IL-21, and IL-22 (Eisenstein and Williams, 2009). As a novel subset of effector cells, they have been implicated in the pathogenesis of allograft rejection and autoimmune diseases such as arthritis and experimental autoimmune encephalomyelitis (EAE) (Hofstetter et al., 2009; Yuan et al., 2008). It is interesting that both Th17 and Tregs can be developed from naive CD4+ T cell precursors in the presence of the same cytokine, the transforming growth factor β1 (TGF-β1), suggesting that they may be coordinately regulated by shared regulatory elements. Exposure of a naive CD4+ T cell to TGF-β1 and IL-6 results in the induction of RORγt, a orphan retinoic acid nuclear receptor, that directs Th17-specific differentiation (Ivanov et al., 2006). It is interesting to note that IL-6 suppresses the development and function of Tregs (Samanta et al., 2008). Studies have shown that IL-6 induces DNMT1 expression and enhances its activity (Hodge et al., 2001), which then leads to STAT3-dependent methylation of the upstream Foxp3 enhancer and, as a consequence, represses Foxp3 expression. Therefore, the action of IL-6 for polarizing naive T cells to the Th17 condition is likely, at least in part, by inducing remethylation of CpG DNA at the upstream enhancers of Foxp3 promoter.
c03.indd 39
1/12/2011 9:44:01 AM
40
3.4.4
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
DNA Methylation in B Cell Maturation and Functionality
Many transcription factors act in a hierarchical and combinational manner to induce specific gene expression necessary for each stage of differentiation of B-lineage cells. It is interesting that the DNA methylation profile during B-cell differentiation is characterized in a dynamic manner (Renaudineau et al., 2009). At the early B-cell differentiation stage, regulatory regions for the Pax5, Pu-1, and Igα/mb1 genes are demethylated (Amaravadi and Klemsz, 1999; Danbara et al., 2002; Maier et al., 2003). At the pre-pro-B-stage, demethylation occurs for the CD19 promoter (Walter et al., 2008), while the CD21 promoter in mature B cells is demethylated (Schwab and Illges, 2001). In unstimulated B cells, DNMT1 and DNMT3a are expressed at low levels, and histones are mainly acetylated. Once the BCR is engaged, DNMT1 is induced and overexpressed, which then methylates many CpG islands, leading to B cell maturation. The involvement of DNA methylation in B cell development is further manifested by the detection of DNMT3b missense mutations in patients with immunodefficiency, centromeric region instability, and facial anomalies (ICF) syndrome (Hansen et al., 1999; Shirohzu et al., 2002), which is a rare recessive disease characterized by B-cell differentiation abnormalities. Studies have shown that DNMT3b missense mutations account for 40% of patients reported all around the world (Ehrlich et al., 2006). More recent, studies have demonstrated that altered DNA methylation is associated with autoantibody production. For example, hydralazine and procainamide were found with the capability to interact with DNA (Dubroff and Reid, 1980; Lee et al., 2005), which then reverses the methylation of cytosines present in CpG islands. Hydralazine suppresses the induction of DNMT1 and DNMT3 transcription by inhibiting the extracellular signal-regulated kinase pathway, while procainamide inhibits DNMT1 enzymatic activity as 5-azacythidine (Deng et al., 2003; Lee et al., 2005). Both of them affect the methylation status of CpG islands within the regulatory sequences, leading to phenotypic changes of peripheral blood lymphocytes (PBL). As a result, treatment of animals with hydralazine-induced autoantibodies against self-antigens associated with the development of lupus erythematosus (Deng et al., 2003; Dubroff and Reid, 1980; Mazari et al., 2007). 3.5 THE IMPLICATION OF DNA METHYLATION IN AUTOIMMUNE DISEASES It has been well accepted that susceptibility to autoimmune diseases such as rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and type 1 diabetes (T1D) is influenced by both genetic and environmental (epigenetic) factors. DNA methylation has now been recognized as the major mechanism of epigenetics that initiates and maintains heritable patterns of gene expression and gene function in an inheritable manner without changing the sequence of the genome (Callinan and Feinberg, 2006; Dennis, 2003; Holliday, 2006;
c03.indd 40
1/12/2011 9:44:01 AM
THE IMPLICATION OF DNA METHYLATION IN AUTOIMMUNE DISEASES
41
Jones and Martienssen, 2005; Sutherland and Costa, 2003; Wilson et al., 2006). Therefore, it acts as a “footprint” for gene–environment interactions or accumulated environmental exposures (Ushijima et al., 2006). Alterations for DNA methylation can be induced by a plethora of environmental factors such as diet (Aggarwal and Shishodia, 2006; Junien et al., 2005; Liu et al., 2003), lifestyle (Fraga et al., 2005), stress (Dennis, 2003; Meaney and Szyf, 2005), chronic inflammation (Tao and Robertson, 2003; Vanden Berghe et al., 2006), bacterial and viral infections (Li et al., 2005a; Maekita et al., 2006; Waterland and Jirtle, 2004), irradiation (Koturbash et al., 2006a, 2006b), and toxicants (Bombail et al., 2004). More important, these changes are not only heritable but also stably accumulated during an individual’s lifetime (Ballestar et al., 2006; Egger et al., 2004; Fraga et al., 2005). 3.5.1
DNA Methylation in Systemic Lupus Erythematosus
SLE is a chronic inflammatory connective tissue disorder that can involve joints, kidneys, mucous membranes, and blood vessel walls (Wardle, 2009). The presence of anti-DNA autoantibody (Ab) and high levels of circulating free DNA is a hallmark for the onset of SLE. Therefore, sera originated from SLE patients show high levels of low molecular weight DNA (e.g., 100–250 bp) enriched with Alu sequences (55% versus 13% in the whole genome) (Li and Steinman, 1989; Sano et al., 1983). These Alu sequences were potentially derived from Z-DNA (Van Helden, 1985), and contain large amounts of demethylated CpG motifs associated with increased anti-DNA recognition in SLE patients and in mouse lupus models. In line with this observation, hydralazine was found being able to induce Z-DNA conformation in a polynucleotide and elicits anti(Z-DNA) antibodies in treated patients (Thomas et al., 1993). It is believed that these circulating free DNAs were derived from apoptotic cells as manifested by the observation that apoptotic hypomethylated DNAs are immunogenic (Wen et al., 2007). Studies from Richardson and colleagues have systematically demonstrated the role of DNA methylation in the occurrence of SLE (Kaplan et al., 2004; Quddus et al., 1993; Richardson, 1986; Richardson et al., 1990, 1992; Yung et al., 1996, 1995; Yung and Richardson, 1994). The group first noted global DNA hypomethylation in T cells derived from SLE patients (Richardson et al., 1990). They next reported that antigen-specific CD4 T cells develop self-reactivity to major histocompatibility complex determinants (HLA-D molecules) upon the treatment with 5-aza-deoxycytidine (5-aza C), a DNA methylation inhibitor (Richardson, 1986), associated with the expression of leukocyte function-associated antigen 1 (LFA-1), an essential molecule implicated in T cell activation. The abnormal activation of LFA-1 is likely caused by 5-aza C-mediated demethylation of regulatory elements, and overexpression of LFA-1 alone seems to be sufficient to cause autoreactivity in SLE patients (Richardson et al., 1992). As a result, T cells with exogenous LFA-1 expression can induce a disease similar to SLE (Yung et al., 1996). They further
c03.indd 41
1/12/2011 9:44:01 AM
42
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
reported that altered perforin expression in SLE T cells is probably induced as a result of DNA hypomethylation, which could in part account for the increase of T cell–mediated apoptosis in SLE patients (Kaplan et al., 2004). Studies in animals have shown that adoptive transfer of 5-aza C-treated cloned or polyclonal T cells induces diverse autoimmune manifestations such as antiDNA autoantibody production, immune complex-mediated glomerulonephritis, central nervous system lesions resembling those seen in human SLE, and pulmonary alveolitis (Quddus et al., 1993; Yung et al., 1995; Yung and Richardson, 1994). Together, the results suggest local hypomethylation of regulatory elements predisposes individuals to the increased risk for the development of SLE. Given the effect of DNA demethylation or hypomethylation on gene transcription, these data support the hypothesis that hypomethylation of SLE-associated genes results in their expression or overexpression and subsequent development of the disease. 3.5.2
DNA Methylation in Rheumatoid Arthritis
RA is a chronic inflammatory autoimmune disease that may affect many tissues and organs but principally attacks the joints, producing an inflammatory synovitis that often progresses to the destruction of articular cartilage and ankylosis of the joints (Korb et al., 2009). Similar to other complex disorders, genetic predisposition is involved in RA pathogenesis. However, studies have demonstrated that the influence of epigenetic processes (environmental triggers) on the development of rheumatic diseases is probably as strong as genetic factors in terms of disease predisposition. It has now become more and more evident that epigenetic factors increase the risk to RA development in those genetic predisposed individuals. Studies have shown that RA synovial fibroblasts (RASFs) have decreased levels of global DNA methylation (Karouzakis et al., 2009), which is associated with altered expression of cell-activating genes and could stimulate innate immune response through TLR9 signaling. In line with this observation, a retrotransposable element LINE-1 is reactivated and transcribed in RASF because of hypomethylation of CpG islands in its promoter (Neidhart et al., 2000). There is substantial evidence supporting that IL-6 is implicated in RA pathogenesis both in animal models and clinical patients (Alonzi et al., 1998; Nishimoto et al., 2004). It is interesting that IL-6 expression is tightly regulated by both transcriptional and posttranscriptional mechanisms. Studies in peripheral blood mononuclear cells (PBMCs) from RA patients and healthy controls found that a specific CpG site in the IL-6 promoter showed a lower level of methylation in cells from RA patients, which rendered PBMCs from RA patients with much higher sensitivity for induction of IL-6 expression upon LPS stimulation (Nile et al., 2008). In sharp contrast, CpG islands within the death receptor 3 (DR3) promoter is highly methylated in first- or second-passage RA synovial cells along with significant downregulation of DR3 expression (Takami et al., 2006). Since DR3 is a member
c03.indd 42
1/12/2011 9:44:01 AM
THE IMPLICATION OF DNA METHYLATION IN AUTOIMMUNE DISEASES
43
of the apoptosis-inducing Fas gene family, the down-regulation of DR-3 could be responsible for the resistance to apoptosis in these cells. Together, the data reviewed here suggest a strong interplay between DNA methylation and RA development. 3.5.3
DNA Methylation in Type 1 Diabetes
T1D is an autoimmune disorder resulted from the breakdown of peripheral tolerance (Wang et al., 2006; Wang and She, 2008). Similar to other autoimmune diseases, a characteristic feature for T1D is the selective targeting of a specific type of cells, the insulin-secreting β cells of the islets of Langerhans in the pancreas, by a certain population of autoreactive immune cells (Li et al., 2005b; Wang et al., 2006, 2008; Wang and She, 2008). It has been well accepted that exogenous (epigenetic) factors modulate T1D susceptibility in genetically predisposed individuals (Dahlquist, 1998; Knip et al., 2005; Metcalfe et al., 2001). As a result, a great deal of research in the past few years has been focused on dissecting environmental triggers for autoimmunity. There is an ever-increasing body of evidence demonstrating that T1D development and progression are associated with diverse environmental triggers such as viral infection. The most popular hypothesis circulating within and beyond the scientific community is that viral infections enhance or elicit autoimmune disorders such as T1D (Filippi and von Herrath, 2008; van der Werf et al., 2007). Indeed, the environmental triggers often used for the explanation of differences of disease frequency across many populations and the rapid rise in disease frequency in the last few decades (Atkinson, 2005). As mentioned, studies in the last several years have not only demonstrated the importance of DNA methylation in T cell development and differentiation but also established its pivotal role in T cell polarization (Eivazova and Aune, 2004; Fields et al., 2004; Fitzpatrick and Wilson, 2003; Lee et al., 2002; Reiner, 2005). To generate an appropriate response to an infectious condition, the type of cytokine as well as the cell type, dose range, and kinetics of its expression are of critical importance. The NFκB transcription factors are, therefore, crucial in rapid responses to stress and pathogens (innate immunity) as well as in the development and differentiation of immune cells (acquired immunity). It is interesting that recent studies confirmed DNA methylome settings are the ultimate integration sites of both environmental and differentiative inputs, determining proper expression of each NFκB-dependent gene (Egger et al., 2004; Fischle et al., 2003; Fuks, 2005; Henikoff et al., 2004; Natoli et al., 2005; Teferedegne et al., 2006; Vanden Berghe et al., 2006). Therefore, the expression of many cytokines implicated in T1D development such as IL-2, IFNγ, and IL-4 are actually also regulated by DNA methylation (Bix and Locksley, 1998; Bruniquel and Schwartz, 2003; Fitzpatrick et al., 1999; Grogan et al., 2001; Lee et al., 2002). A tissue-specific DNA hypomethylation was noted in male rats with type 1 diabetes (Williams et al., 2008). A recent study revealed that CpG islands both in the mouse INS2 and human INS promoters
c03.indd 43
1/12/2011 9:44:01 AM
44
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
are uniquely demethylated in insulin-producing pancreatic β cells (Williams et al., 2008). Methylation of these CpG sites could suppress insulin promoterdriven reporter gene activity by almost 90%, and particularly, the promoter activity can be reduced by 50% by specific methylation of CpG islands in the cAMP responsive element (CRE) within the promoter alone. In NOD mice, a human T1D prone model, we found that altered DNA methylation not only dysregulates T cell function but also results in abnormal DC activation (unpublished data). DNA methylation also affects all aspects of apoptosis right from its initiation to execution (Fulda et al., 2001; Gopisetty et al., 2006). The DNA methylome in NOD pancreatic islet undergoes a rapid switch to hypomethylation upon the development of insulitis, suggesting that autoimmune insult activated genes responsible for β cell apoptosis. In line with this notion, cytokine-induced apoptosis in NIT-1 cells, a NOD-derived β cell line, was significantly exaggerated upon the addition of a DNA methylation inhibitor, 5′-azaC (unpublished data). Taken together, these observations pinpoint an important role for DNA methylation in the regulation of gene expressions for control of autoimmunity and autoimmune-mediated β cell destruction during T1D development.
3.6 COMMON TECHNOLOGICAL APPROACHES FOR ASSAY OF DNA METHYLATION The analysis of DNA methylation patterns was once considered to be a formidable technical challenge. Methylation information is not retained during amplification steps that form the basis of most standard molecular biology techniques such as PCR, biological amplification by cloning in Escherichia coli, and signal amplification by probe hybridization. Therefore, methods for DNA methylation analysis generally rely on a methylation-dependent modification of the original genomic DNA before any amplification step (Eads et al., 2000), which can be roughly divided into two types, global and gene-specific methylation analysis. For global methylation analysis, the typical conventional approach is based on the property of some restriction enzymes to be unable to cut methylated DNA, and HpaII–MspI (CCGG) and SmaI–XmaI (CCCGGG) are the two classical enzyme pairs widely used for this purpose (Sadri and Hornsby, 1996). In contrast, microarray-based high-throughput technology is a typical example for modern technologies used for global DNA methylation analysis (Hayashi et al., 2007). For gene-specific methylation analysis, a large number of techniques have been developed. In the past, the most common way to study DNA methylation of particular sequences was almost entirely based on the use of enzymes that can distinguish between methylated and unmethylated recognition sites in genes of interest. The resulting products after digestion can be detected by either PCR or Southern blotting, which would generate different signal patterns according to the methylated sites or unmethylated sites (Ariel,
c03.indd 44
1/12/2011 9:44:01 AM
COMMON TECHNOLOGICAL APPROACHES FOR ASSAY OF DNA METHYLATION
45
2001). Recently, the bisulfate conversion based techniques have become quite popular and are referred to as the second generation of methylation assays (Frommer et al., 1992). In this section, we briefly introduce several commonly used techniques for analysis of both global and gene-specific DNA methylation patterns. 3.6.1
Methylation-Specific PCR
Methylation-Specific PCR (MSP) is a technique developed to rapidly assess the methylation status of practically any group of CpG sites within a CpG island, independent of the use of cloning or methylation-sensitive restriction enzymes (Herman et al., 1996). This technique includes an initial modification of DNA by sodium bisulfite that converts all unmethylated, but not methylated, cytosines to uracil, followed by PCR amplification using primers specific for methylated versus unmethylated DNA (Fig. 3.1). MSP requires only a very small amount of DNA and is sensitive to 0.1% methylated alleles for a given CpG island locus. Therefore, MSP has the capacity to examine almost all CpG sites, not just those within sequences recognized by the methylation-sensitive restriction enzymes, which can markedly increase the number of such sites to be assessed. Another advantage for this approach is that it can be applied to DNAs extracted from paraffin-embedded samples (Herman et al., 1996).
Figure 3.1. Methylation-specific PCR. After bisulfite convertion, genomic DNA was amplified with primers specific for the methylated DNA or unmethylated DNA. The resulting products were analyzed on the agarose gel. Sample 1 shows the products resulting from unmethylated DNA only; samples 2 and 3 are PCR products yielded from both methylated and unmethylated DNA.
c03.indd 45
1/12/2011 9:44:02 AM
46
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Furthermore, MSP can eliminate the frequent false-positive results due to partial digestion of methylation-sensitive enzymes, and the amplified products can be easily validated by differential restriction patterns. Although MSP is a quick way to detect DNA methylation patterns, it has some drawbacks. It relies on PCR of complex CG-rich DNA and the differential signal patterns of the resulting PCR products. Therefore, it is quite difficult for quantitative assessment because of the PCR bias between different samples and different primers. The key issue for this approach is to design primers that can achieve good PCR efficiency. Primer sequences are usually chosen in regions containing frequent cytosines to distinguish unmodified from modified DNA. The location of CpG pairs within the PCR products should be near the 3′ end of the primers to provide maximal discrimination between methylated and unmethylated DNA. Because of primers for amplification have to cover CpG sites, this approach has limited capacity for the number of CpG residues to be analyzed in one assay, which renders this technique inapplicable for genomewide methylation analysis. 3.6.2
Bisulfite PCR
Similar to MSP, BSP also needs the initial bisulfite conversion. Unlike MSP, this approach is suitable for quantitative assessment and is applicable for whole genome-wide analysis. The most important difference between these two assays is the primer design. In contrast to MSP, primers for BSP do not cover any CpG site. Therefore, there is no difference between the primers binding to the methylated or unmethylated template and no discrimination process during the PCR step. This strategy leads to the simultaneous amplification of all sequence variants resulted from various patterns of DNA methylation in the region resided in between the two primers. The enrichment for each of those sequence variants in the PCR mixtures can be then easily assessed by a variety of standard methods such as DNA sequencing (Fig. 3.2). As PCR reactions are designed to be the same between methylated or unmethylated template, the results produced from this approach are more comparable. 3.6.3 Arbitrary Primed PCR Arbitrary primed PCR is a simply and reproducible way to generate fingerprints of complex genomes without knowing any prior sequence information (Welsh and McClelland, 1990). This approach can be combined with methylation-specific assays such as methylation-sensitive restriction digestion for analysis of whole genome methylation patterns. Several methods have been developed in this approach such as methylation-sensitive arbitrary primed PCR (Gonzalgo et al., 1997), methylated CpG-island amplification (MCA) (Toyota et al., 1999), and amplification of intermethylated sites (AIMS) (Frigola et al., 2002). These methods are particularly useful because arbitrary
c03.indd 46
1/12/2011 9:44:02 AM
COMMON TECHNOLOGICAL APPROACHES FOR ASSAY OF DNA METHYLATION
GTGTGTATGGTTGGGTGTTTTTGGGGTGGGTAGGGAGGTGT
47
GCGCGTACGGTCGGGCGTTTTTGGGGTGGGTAGGGAGGCGC
Figure 3.2. (See color insert.) Bisulfite-specific PCR and direct sequencing. Genomic DNA underwent bisulfite conversion followed by PCR amplification. The resulting PCR products were then purified and directly sequenced. The results at the left are from the unmethylated DNA; the results at the right are for the methylated DNA.
primed PCR is carried out using DNA templates that have been enriched for methyl sequences, which leads to preferential amplification of CpG islands and gene-rich regions. However, all of these techniques require further validation by bisulfite genomic sequencing, and a background of PCR “noise” resulted from repetitive sequences must be taken into account. 3.6.4
Methylated DNA Immunoprecipitation Chip
The completion of the Human Genome Project has brought significant advances in high-throughput technologies, and one typical example is the development of high-density oligonucleotide-based whole-genome microarray (tiling array), which has now been emerged as a new platform for genomic analysis far beyond simple gene expression profiling. Tiling arrays differ in the nature of the probes. Short fragments or probes (around 70-mer) are designed to cover the entire genome. These probes can be synthesized directly on the surface of the arrays by photolithography using light-sensitive synthetic chemistry and photolithographic masks (Fodor et al., 1991; Pease et al., 1994) or programmable optical mirrors (Nuwaysir et al., 2002). Tiling arrays can be made with >6,000,000 discrete features per chip, with each feature comprising millions of copies of a distinct probe sequence. Techniques for enriching genomewide methylated DNA sequences have also been developed, and one such technique is methylated DNA immunoprecipitation (MeDIP), which is carried out by chromatin immunoprecipitation of
c03.indd 47
1/12/2011 9:44:02 AM
48
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
methylated DNA using either monoclonal antibodies specific to 5-methylcytidine (5mC) or by methyl-binding domain (MBD) protein specific to methylated CpGs. One emerging technique is beginning to gain popularity is MeDIP chip that combining chromatin immunoprecipitation with microarray, which helps screen all unknown methylation sites in a genomewide scale. For MeDIP, DNA-protein complexes are first cross-linked in cells with formaldehyde followed by immunoprecipitation with specific antibodies against the protein of interest (e.g., MBD2). DNAs bound by MBD protein are sheared to 0.2- to 2-kb fragments by sonication. The immunoprecipitated DNA and appropriate controls are fluorescently labeled and subsequently applied to chips for microarray analysis. Using input DNA as background, the profiles for methylated DNA in the genome can be then characterized by comparing the immunoprecipitated DNA with background control. Unlike the conventional PCR amplification of specific target sequences from immunoprecipitated materials, MeDIP chip is a genomewide “reverse-genetic” approach (Wu et al., 2006). Another advantage of the MeDIP chip is that it targets genes directly bound by protein factors, while the classic expression arrays cannot distinguish between directly regulated genes and those changed secondarily.
3.7
SUMMARY
Along with the characterization of our genome at DNA basepair level, it has now been recognized that DNA methylome changes and the interactions between cis-acting elements and protein factors may play a central role in gene regulation, which could have significant implications in genome function both in health and disease state. Therefore, altered DNA methylation is thought to be an important risk factor contributing to the development of autoimmunity in genetic predisposed subjects. The novel high-throughput array-based method enables the analysis of DNA methylome in a genomewide scale. The results obtained with this method demonstrate its effectiveness for reliably profiling many CpG sites in parallel, by which informative methylation markers could be identified for disease prediction and prognosis. Therefore, the MeDIP chip approach is thus far the preferable strategy for advancing the genomewide analysis of the DNA methylome.
3.8
ACKNOWLEDGMENTS
This work was supported by grants from the Juvenile Diabetes Foundation International (JDRFI), the EFSD/CDC/Lilly Program for Collaborative Diabetes Research between China and Europe to CYW. The authors declare that they have no competing financial interest.
c03.indd 48
1/12/2011 9:44:02 AM
REFERENCES
3.9
49
REFERENCES
Aggarwal BB, Shishodia S. (2006). Molecular targets of dietary agents for prevention and therapy of cancer. Biochem Pharmacol 71:1397–421. Alonzi T, Fattori E, Lazzaro D, Costa P, Probert L, Kollias G, De Benedetti F, Poli V, Ciliberto G. (1998). Interleukin 6 is required for the development of collageninduced arthritis. J Exp Med 187:461–68. Amaravadi L, Klemsz MJ. (1999). DNA methylation and chromatin structure regulate PU.1 expression. DNA Cell Biol 18:875–84. Amir RE, Van DV, Wan M, Tran CQ, Francke U, Zoghbi HY. (1999). Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat Genet 23:185–88. Andersen AA, Panning B. (2003). Epigenetic gene regulation by noncoding RNAs. Curr Opin Cell Biol 15:281–89. Ariel M. (2001). A PCR-based method for studying DNA methylation. Methods Mol Biol 181:205–16. Atkinson MA. (2005). ADA Outstanding Scientific Achievement Lecture 2004. Thirty years of investigating the autoimmune basis for type 1 diabetes: why can’t we prevent or reverse this disease? Diabetes 54:1253–63. Ballestar E, Esteller M, Richardson BC. (2006). The epigenetic face of systemic lupus erythematosus. J Immunol 176:7143–47. Ballestar E, Pile LA, Wassarman DA, Wolffe AP, Wade PA. (2001). A Drosophila MBD family member is a transcriptional corepressor associated with specific genes. Eur J Biochem 268:5397–406. Baron U, Floess S, Wieczorek G, Baumann K, Grutzkau A, Dong J, Thiel A, Boeld TJ, Hoffmann P, Edinger M, Turbachova I, Hamann A, Olek S, Huehn J. (2007). DNA demethylation in the human FOXP3 locus discriminates regulatory T cells from activated Foxp3(+) conventional T cells. Eur J Immunol 37:2378–89. Bestor TH, Ingram VM. (1983). Two DNA methyltransferases from murine erythroleukemia cells: purification, sequence specificity, and mode of interaction with DNA. Proc Natl Acad Sci U S A 80:5559–63. Bettini M, Vignali DA. (2009). Regulatory T cells and inhibitory cytokines in autoimmunity. Curr Opin Immunol 21:612–18. Bird AP, Wolffe AP. (1999). Methylation-induced repression—belts, braces, and chromatin. Cell 99:451–54. Bix M, Locksley RM. (1998). Independent and epigenetic regulation of the interleukin-4 alleles in CD4+ T cells. Science 281:1352–54. Bombail V, Moggs JG, Orphanides G. (2004). Perturbation of epigenetic status by toxicants. Toxicol Lett 149:51–58. Bowen H, Kelly A, Lee T, Lavender P. (2008). Control of cytokine gene transcription in Th1 and Th2 cells. Clin Exp Allergy 38:1422–31. Brenner C, Fuks F. (2006). DNA methyltransferases: facts, clues, mysteries. Curr Top Microbiol Immunol 301:45–66. Bruniquel D, Schwartz RH. (2003). Selective, stable demethylation of the interleukin-2 gene enhances transcription by an active process. Nat Immunol 4:235–40.
c03.indd 49
1/12/2011 9:44:02 AM
50
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Callinan PA, Feinberg AP. (2006). The emerging science of epigenomics. Hum Mol Genet 15(Spec. no 1):R95–101. Chappell C, Beard C, Altman J, Jaenisch R, Jacob J. (2006). DNA methylation by DNA methyltransferase 1 is critical for effector CD8 T cell expansion. J Immunol 176: 4562–72. Cheng LC, Tavazoie M, Doetsch F. (2005). Stem cells: from epigenetics to microRNAs. Neuron 46:363–67. Dahlquist G. (1998). The aetiology of type 1 diabetes: an epidemiological perspective. Acta Paediatr Suppl 425:5–10. Danbara M, Kameyama K, Higashihara M, Takagaki Y. (2002). DNA methylation dominates transcriptional silencing of PAX5 in terminally differentiated B cell lines. Mol Immunol 38:1161–66. Deng C, Lu Q, Zhang Z, Rao T, Attwood J, Yung R, Richardson B. (2003). Hydralazine may induce autoimmunity by inhibiting extracellular signal-regulated kinase pathway signaling. Arthritis Rheum 48:746–56. Dennis C. (2003). Epigenetics and disease: altered states. Nature 421:686–88. Dolinoy DC, Weidman JR, Jirtle RL. (2006). Epigenetic gene regulation: linking early developmental environment to adult disease. Reprod Toxicol 23:297–307. Dubroff LM, Reid RJ Jr. (1980). Hydralazine-pyrimidine interactions may explain hydralazine-induced lupus erythematosus. Science 208:404–06. Eads CA, Danenberg KD, Kawakami K, Saltz LB, Blake C, Shibata D, Danenberg PV, Laird PW. (2000). MethyLight: a high-throughput assay to measure DNA methylation. Nucleic Acids Res 28:E32. Egger G, Liang G, Aparicio A, Jones PA. (2004). Epigenetics in human disease and prospects for epigenetic therapy. Nature 429:457–63. Ehrlich M, Jackson K, Weemaes C. (2006). Immunodeficiency, centromeric region instability, facial anomalies syndrome (ICF). Orphanet J Rare Dis 1:2. Eisenstein EM, Williams CB. (2009). The T(reg)/Th17 cell balance: a new paradigm for autoimmunity. Pediatr Res 65:26R–31R. Eivazova ER, Aune TM. (2004). Dynamic alterations in the conformation of the IFNγ gene region during T helper cell differentiation. Proc Natl Acad Sci U S A 101: 251–56. Esteller M. (2006). The necessity of a human epigenome project. Carcinogenesis 27: 1121–25. Fatemi M, Hermann A, Gowher H, Jeltsch A. (2002). DNMT3a and Dnmt1 functionally cooperate during de novo methylation of DNA. Eur J Biochem 269:4981–84. Fatemi M, Wade PA. (2006). MBD family proteins: reading the epigenetic code. J Cell Sci 119:3033–37. Fields PE, Lee GR, Kim ST, Bartsevich VV, Flavell RA. (2004). Th2-specific chromatin remodeling and enhancer activity in the Th2 cytokine locus control region. Immunity 21:865–76. Filippi CM, von Herrath MG. (2008). Viral trigger for type 1 diabetes: pros and cons. Diabetes 57:2863–71. Fischle W, Wang Y, Allis CD. (2003). Histone and chromatin cross-talk. Curr Opin Cell Biol 15:172–83.
c03.indd 50
1/12/2011 9:44:02 AM
REFERENCES
51
Fitzpatrick DR, Shirley KM, Kelso A. (1999). Cutting edge: stable epigenetic inheritance of regional IFN-gamma promoter demethylation in CD44highCD8+ T lymphocytes. J Immunol 162:5053–57. Fitzpatrick DR, Wilson CB. (2003). Methylation and demethylation in the regulation of genes, cells, and responses in the immune system. Clin Immunol 109:37–45. Floess S, Freyer J, Siewert C, Baron U, Olek S, Polansky J, Schlawe K, Chang HD, Bopp T, Schmitt E, Klein-Hessling S, Serfling E, Hamann A, Huehn J. (2007). Epigenetic control of the Foxp3 locus in regulatory T cells. PLoS Biol 5:e38. Fodor SP, Read JL, Pirrung MC, Stryer L, Lu AT, Solas D. (1991). Light-directed, spatially addressable parallel chemical synthesis. Science 251:767–73. Fraga MF, Ballestar E, Montoya G, Taysavang P, Wade PA, Esteller M. (2003). The affinity of different MBD proteins for a specific methylated locus depends on their intrinsic binding properties. Nucleic Acids Res 31:1765–74. Fraga MF, Ballestar E, Paz MF, Ropero S, Setien F, Ballestar ML, Heine-Suner D, Cigudosa JC, Urioste M, Benitez J, Boix-Chornet M, Sanchez-Aguilera A, Ling C, Carlsson E, Poulsen P, Vaag A, Stephan Z, Spector TD, Wu YZ, Plass C, Esteller M. (2005). Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci U S A 102:10604–09. Frigola J, Ribas M, Risques RA, Peinado MA. (2002). Methylome profiling of cancer cells by amplification of inter-methylated sites (AIMS). Nucleic Acids Res 30:e28. Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL. (1992). A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci U S A 89:1827–31. Fu S, Zhang N, Yopp AC, Chen D, Mao M, Chen D, Zhang H, Ding Y, Bromberg JS. (2004). TGF-beta induces Foxp3 + T-regulatory cells from CD4 + C. Am J Transplant 4:1614–27. Fuks F. (2005). DNA methylation and histone modifications: teaming up to silence genes. Curr Opin Genet Dev 15:490–95. Fulda S, Kufer MU, Meyer E, van Valen F, Dockhorn-Dworniczak B, Debatin KM. (2001). Sensitization for death receptor- or drug-induced apoptosis by re-expression of caspase-8 through demethylation or gene transfer. Oncogene 20:5865–77. Goll MG, Kirpekar F, Maggert KA, Yoder JA, Hsieh CL, Zhang X, Golic KG, Jacobsen SE, Bestor TH. (2006). Methylation of tRNAAsp by the DNA methyltransferase homolog DNMT2. Science 311:395–98. Gonzalgo ML, Liang G, Spruck CH III, Zingg JM, Rideout WM III, Jones PA. (1997). Identification and characterization of differentially methylated regions of genomic DNA by methylation-sensitive arbitrarily primed PCR. Cancer Res 57:594–99. Gopisetty G, Ramachandran K, Singal R. (2006). DNA methylation and apoptosis. Mol Immunol 43:1729–40. Goyal R, Reinhardt R, Jeltsch A. (2006). Accuracy of DNA methylation pattern preservation by the DNMT1 methyltransferase. Nucleic Acids Res 34:1182–88. Grogan JL, Mohrs M, Harmon B, Lacy DA, Sedat JW, Locksley RM. (2001). Early transcription and silencing of cytokine genes underlie polarization of T helper cell subsets. Immunity 14:205–15.
c03.indd 51
1/12/2011 9:44:02 AM
52
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Hansen RS, Wijmenga C, Luo P, Stanek AM, Canfield TK, Weemaes CM, Gartler SM. (1999). The DNMT3B DNA methyltransferase gene is mutated in the ICF immunodeficiency syndrome. Proc Natl Acad Sci U S A 96:14412–17. Hatton RD, Harrington LE, Luther RJ, Wakefield T, Janowski KM, Oliver JR, Lallone RL, Murphy KM, Weaver CT. (2006). A distal conserved sequence element controls Ifng gene expression by T cells and NK cells. Immunity 25:717–29. Hayashi H, Nagae G, Tsutsumi S, Kaneshiro K, Kozaki T, Kaneda A, Sugisaki H, Aburatani H. (2007). High-resolution mapping of DNA methylation in human genome using oligonucleotide tiling array. Hum Genet 120:701–11. Heller G, Zielinski CC, Zochbauer-Muller S. (2010). Lung cancer: from single-gene methylation to methylome profiling. Cancer Metastasis Rev 29(1):95–107. Hendrich B, Abbott C, McQueen H, Chambers D, Cross S, Bird A. (1999). Genomic structure and chromosomal mapping of the murine and human MBD1, MBD2, MBD3, and MBD4 genes. Mamm Genome 10:906–12. Hendrich B, Bird A. (1998). Identification and characterization of a family of mammalian methyl-CpG binding proteins. Mol Cell Biol 18:6538–47. Hendrich B, Guy J, Ramsahoye B, Wilson VA, Bird A. (2001). Closely related proteins MBD2 and MBD3 play distinctive but interacting roles in mouse development. Genes Dev 15:710–23. Hendrich B, Tweedie S. (2003). The methyl-CpG binding domain and the evolving role of DNA methylation in animals. Trends Genet 19:269–77. Henikoff S, Furuyama T, Ahmad K. (2004). Histone variants, nucleosome assembly and epigenetic inheritance. Trends Genet 20:320–26. Herman JG, Graff JR, Myohanen S, Nelkin BD, Baylin SB. (1996). Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc Natl Acad Sci U S A 93:9821–26. Hodge DR, Xiao W, Clausen PA, Heidecker G, Szyf M, Farrar WL. (2001). Interleukin-6 regulation of the human DNA methyltransferase (HDNMT) gene in human erythroleukemia cells. J Biol Chem 276:39508–11. Hofstetter H, Gold R, Hartung HP. (2009). Th17 cells in MS and experimental autoimmune encephalomyelitis. Int MS J 16:12–18. Holliday R. (2006). Dual inheritance. Curr Top Microbiol Immunol 301:243–56. Hutchins AS, Artis D, Hendrich BD, Bird AP, Scott P, Reiner SL. (2005). Cutting edge: a critical role for gene silencing in preventing excessive type 1 immunity. J Immunol 175:5606–10. Hutchins AS, Mullen AC, Lee HW, Sykes KJ, High FA, Hendrich BD, Bird AP, Reiner SL. (2002). Gene silencing quantitatively controls the function of a developmental trans-activator. Mol Cell 10:81–91. Ivanov II, McKenzie BS, Zhou L, Tadokoro CE, Lepelley A, Lafaille JJ, Cua DJ, Littman DR. (2006). The orphan nuclear receptor RORgammat directs the differentiation program of proinflammatory IL-17+ T helper cells. Cell 126:1121–33. Janson PC, Winerdal ME, Marits P, Thorn M, Ohlsson R, Winqvist O. (2008). Foxp3 promoter demethylation reveals the committed Treg population in humans. PLoS One 3:e1612. Jeltsch A, Walter J, Reinhardt R, Platzer M. (2006). German human methylome project started. Cancer Res 66:73–78.
c03.indd 52
1/12/2011 9:44:02 AM
REFERENCES
53
Jones B, Chen J. (2006). Inhibition of IFN-gamma transcription by site-specific methylation during T helper cell development. EMBO J 25:2443–52. Jones PA, Martienssen R. (2005). A blueprint for a Human Epigenome Project: the AACR Human Epigenome Workshop. Cancer Res 65:11241–46. Junien C, Gallou-Kabani C, Vige A, Gross MS. (2005). [Nutritionnal epigenomics: consequences of unbalanced diets on epigenetics processes of programming during lifespan and between generations]. Ann Endocrinol (Paris) 66:2S19–28. Kaplan MJ, Lu Q, Wu A, Attwood J, Richardson B. (2004). Demethylation of promoter regulatory elements contributes to perforin overexpression in CD4+ lupus T cells. J Immunol 172:3652–61. Karouzakis E, Gay RE, Michel BA, Gay S, Neidhart M. (2009). DNA hypomethylation in rheumatoid arthritis synovial fibroblasts. Arthritis Rheum 60:3613–22. Kim HP, Leonard WJ. (2007). CREB/ATF-dependent T cell receptor-induced FoxP3 gene expression: a role for DNA methylation. J Exp Med 204:1543–51. Knip M, Veijola R, Virtanen SM, Hyoty H, Vaarala O, Akerblom HK. (2005). Environmental triggers and determinants of type 1 diabetes. Diabetes 54(Suppl 2):S125–36. Komatsu N, Mariotti-Ferrandiz ME, Wang Y, Malissen B, Waldmann H, Hori S. (2009). Heterogeneity of natural Foxp3+ T cells: a committed regulatory T-cell lineage and an uncommitted minor population retaining plasticity. Proc Natl Acad Sci U S A 106:1903–08. Korb A, Pavenstadt H, Pap T. (2009). Cell death in rheumatoid arthritis. Apoptosis 14:447–54. Koturbash I, Baker M, Loree J, Kutanzi K, Hudson D, Pogribny I, Sedelnikova O, Bonner W, Kovalchuk O. (2006a). Epigenetic dysregulation underlies radiationinduced transgenerational genome instability in vivo. Int J Radiat Oncol Biol Phys 66:327–30. Koturbash I, Rugo RE, Hendricks CA, Loree J, Thibault B, Kutanzi K, Pogribny I, Yanch JC, Engelward BP, Kovalchuk O. (2006b). Irradiation induces DNA damage and modulates epigenetic effectors in distant bystander tissue in vivo. Oncogene 25:4267–75. Kouskouti A, Talianidis I. (2005). Histone modifications defining active genes persist after transcriptional and mitotic inactivation. EMBO J 24:347–57. Lal G, Zhang N, van der Touw W, Ding Y, Ju W, Bottinger EP, Reid SP, Levy DE, Bromberg JS. (2009). Epigenetic regulation of Foxp3 expression in regulatory T cells by DNA methylation. J Immunol 182:259–73. Lee BH, Yegnasubramanian S, Lin X, Nelson WG. (2005). Procainamide is a specific inhibitor of DNA methyltransferase 1. J Biol Chem 280:40749–56. Lee DU, Agarwal S, Rao A. (2002). Th2 lineage commitment and efficient IL-4 production involves extended demethylation of the IL-4 gene. Immunity 16: 649–60. Leonhardt H, Page AW, Weier HU, Bestor TH. (1992). A targeting sequence directs DNA methyltransferase to sites of DNA replication in mammalian nuclei. Cell 71:865–73. Li HP, Leu YW, Chang YS. (2005a). Epigenetic changes in virus-associated human cancers. Cell Res 15:262–71.
c03.indd 53
1/12/2011 9:44:02 AM
54
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Li JZ, Steinman CR. (1989). Plasma DNA in systemic lupus erythematosus. Characterization of cloned base sequences. Arthritis Rheum 32:726–33. Li M, Guo D, Isales CM, Eizirik DL, Atkinson M, She JX, Wang CY. (2005b). Sumo wrestling with type 1 diabetes. J Mol Med 83:504–13. Liu L, Wylie RC, Andrews LG, Tollefsbol TO. (2003). Aging, cancer and nutrition: the DNA methylation connection. Mech Ageing Dev 124:989–98. Luo X, Zhang Q, Liu V, Xia Z, Pothoven KL, Lee C. (2008). Cutting edge: TGF-betainduced expression of Foxp3 in T cells is mediated through inactivation of ERK. J Immunol 180:2757–61. Maekita T, Nakazawa K, Mihara M, Nakajima T, Yanaoka K, Iguchi M, Arii K, Kaneda A, Tsukamoto T, Tatematsu M, Tamura G, Saito D, Sugimura T, Ichinose M, Ushijima T. (2006). High levels of aberrant DNA methylation in Helicobacter pylori-infected gastric mucosae and its possible association with gastric cancer risk. Clin Cancer Res 12:989–95. Maier H, Colbert J, Fitzsimmons D, Clark DR, Hagman J. (2003). Activation of the early B-cell-specific MB-1 (IG-alpha) gene by Pax-5 is dependent on an unmethylated ETS binding site. Mol Cell Biol 23:1946–60. Makar KW, Perez-Melgosa M, Shnyreva M, Weaver WM, Fitzpatrick DR, Wilson CB. (2003). Active recruitment of DNA methyltransferases regulates interleukin 4 in thymocytes and T cells. Nat Immunol 4:1183–90. Makar KW, Wilson CB. (2004). DNA methylation is a nonredundant repressor of the Th2 effector program. J Immunol 173:4402–06. Mattick JS. (2001). Non-coding RNAs: the architects of eukaryotic complexity. EMBO Rep 2:986–91. Mazari L, Ouarzane M, Zouali M. (2007). Subversion of B lymphocyte tolerance by hydralazine, a potential mechanism for drug-induced lupus. Proc Natl Acad Sci U S A 104:6317–22. Meaney MJ, Szyf M. (2005). Environmental programming of stress responses through DNA methylation: life at the interface between a dynamic environment and a fixed genome. Dialogues Clin Neurosci 7:103–23. Meehan RR, Lewis JD, Bird AP. (1992). Characterization of MeCP2, a vertebrate DNA binding protein with affinity for methylated DNA. Nucleic Acids Res 20:5085–92. Metcalfe KA, Hitman GA, Rowe RE, Hawa M, Huang X, Stewart T, Leslie RD. (2001). Concordance for type 1 diabetes in identical twins is affected by insulin genotype. Diabetes Care 24:838–42. Millar CB, Guy J, Sansom OJ, Selfridge J, MacDougall E, Hendrich B, Keightley PD, Bishop SM, Clarke AR, Bird A. (2002). Enhanced CpG mutability and tumorigenesis in MBD4-deficient mice. Science 297:403–05. Miyara M, Yoshioka Y, Kitoh A, Shima T, Wing K, Niwa A, Parizot C, Taflin C, Heike T, Valeyre D, Mathian A, Nakahata T, Yamaguchi T, Nomura T, Ono M, Amoura Z, Gorochov G, Sakaguchi S. (2009). Functional delineation and differentiation dynamics of human CD4+ T cells expressing the FoxP3 transcription factor. Immunity 30:899–911. Moon C, Kim SH, Park KS, Choi BK, Lee HS, Park JB, Choi GS, Kwan JH, Joh JW, Kim SJ. (2009). Use of epigenetic modification to induce Foxp3 expression in naive T cells. Transplant Proc 41:1848–54.
c03.indd 54
1/12/2011 9:44:02 AM
REFERENCES
55
Morgan HD, Sutherland HG, Martin DI, Whitelaw E. (1999). Epigenetic inheritance at the agouti locus in the mouse. Nat Genet 23:314–18. Natoli G, Saccani S, Bosisio D, Marazzi I. (2005). Interactions of NF-kappaB with chromatin: the art of being at the right place at the right time. Nat Immunol 6: 439–45. Neidhart M, Rethage J, Kuchen S, Kunzler P, Crowl RM, Billingham ME, Gay RE, Gay S. (2000). Retrotransposable L1 elements expressed in rheumatoid arthritis synovial tissue: association with genomic DNA hypomethylation and influence on gene expression. Arthritis Rheum 43:2634–47. Nile CJ, Read RC, Akil M, Duff GW, Wilson AG. (2008). Methylation status of a single CpG site in the IL6 promoter is related to IL6 messenger RNA levels and rheumatoid arthritis. Arthritis Rheum 58:2686–93. Nishimoto N, Yoshizaki K, Miyasaka N, Yamamoto K, Kawai S, Takeuchi T, Hashimoto J, Azuma J, Kishimoto T. (2004). Treatment of rheumatoid arthritis with humanized anti-interleukin-6 receptor antibody: a multicenter, double-blind, placebo-controlled trial. Arthritis Rheum 50:1761–69. Nuwaysir EF, Huang W, Albert TJ, Singh J, Nuwaysir K, Pitas A, Richmond T, Gorski T, Berg JP, Ballin J, McCormick M, Norton J, Pollock T, Sumwalt T, Butcher L, Porter D, Molla M, Hall C, Blattner F, Sussman MR, Wallace RL, Cerrina F, Green RD. (2002). Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res 12:1749–55. Ohki I, Shimotake N, Fujita N, Jee J, Ikegami T, Nakao M, Shirakawa M. (2001). Solution structure of the methyl-CpG binding domain of human MBD1 in complex with methylated DNA. Cell 105:487–97. Pasare C, Medzhitov R. (2003). Toll pathway-dependent blockade of CD4+CD25+ T cell-mediated suppression by dendritic cells. Science 299:1033–36. Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP. (1994). Lightgenerated oligonucleotide arrays for rapid DNA sequence analysis. Proc Natl Acad Sci U S A 91:5022–26. Quddus J, Johnson KJ, Gavalchin J, Amento EP, Chrisp CE, Yung RL, Richardson BC. (1993). Treating activated CD4+ T cells with either of two distinct DNA methyltransferase inhibitors, 5-azacytidine or procainamide, is sufficient to cause a lupuslike disease in syngeneic mice. J Clin Invest 92:38–53. Rakyan VK, Chong S, Champ ME, Cuthbert PC, Morgan HD, Luu KV, Whitelaw E. (2003). Transgenerational inheritance of epigenetic states at the murine Axin(Fu) allele occurs after maternal and paternal transmission. Proc Natl Acad Sci U S A 100:2538–43. Rauscher FJ III. (2005). It is time for a Human Epigenome Project. Cancer Res 65: 11229. Reiner SL. (2005). Epigenetic control in the immune response. Hum Mol Genet 14(Special issue 1):R41–46. Renaudineau Y, Garaud S, Le, DC, onso-Ramirez R, Daridon C, Youinou P. (2009). Autoreactive B cells and epigenetics. Clin Rev Allergy Immunol 39:85–94. Richardson B. (1986). Effect of an inhibitor of DNA methylation on T cells. II. 5-Azacytidine induces self-reactivity in antigen-specific T4+ cells. Hum Immunol 17:456–70.
c03.indd 55
1/12/2011 9:44:02 AM
56
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Richardson B. (2003). DNA methylation and autoimmune disease. Clin Immunol 109:72–79. Richardson B, Scheinbart L, Strahler J, Gross L, Hanash S, Johnson M. (1990). Evidence for impaired T cell DNA methylation in systemic lupus erythematosus and rheumatoid arthritis. Arthritis Rheum 33:1665–73. Richardson BC, Strahler JR, Pivirotto TS, Quddus J, Bayliss GE, Gross LA, O’Rourke KS, Powers D, Hanash SM, Johnson MA. (1992). Phenotypic and functional similarities between 5-azacytidine-treated T cells and a T cell subset in patients with active systemic lupus erythematosus. Arthritis Rheum 35:647–62. Sadri R, Hornsby PJ. (1996). Rapid analysis of DNA methylation using new restriction enzyme sites created by bisulfite modification. Nucleic Acids Res 24:5058–59. Samanta A, Li B, Song X, Bembas K, Zhang G, Katsumata M, Saouaf SJ, Wang Q, Hancock WW, Shen Y, Greene MI. (2008). TGF-beta and IL-6 signals modulate chromatin binding and promoter occupancy by acetylated FOXP3. Proc Natl Acad Sci U S A 105:14023–27. Sano H, Imokawa M, Steinberg AD, Morimoto C. (1983). Accumulation of guaninecytosine-enriched low M.W. DNA fragments in lymphocytes of patients with systemic lupus erythematosus. J Immunol 130:187–90. Sansom OJ, Berger J, Bishop SM, Hendrich B, Bird A, Clarke AR. (2003). Deficiency of MBD2 suppresses intestinal tumorigenesis. Nat Genet 34:145–47. Santangelo S, Cousins DJ, Winkelmann NE, Staynov DZ. (2002). DNA methylation changes at human Th2 cytokine genes coincide with DNase I hypersensitive site formation during CD4(+) T Cell differentiation. J Immunol 169:1893–903. Schwab J, Illges H. (2001). Silencing of CD21 expression in synovial lymphocytes is independent of methylation of the CD21 promoter CpG island. Rheumatol Int 20:133–37. Sekigawa I, Kawasaki M, Ogasawara H, Kaneda K, Kaneko H, Takasaki Y, Ogawa H. (2006). DNA methylation: its contribution to systemic lupus erythematosus. Clin Exp Med 6:99–106. Sekigawa I, Okada M, Ogasawara H, Kaneko H, Hishikawa T, Hashimoto H. (2003). DNA methylation in systemic lupus erythematosus. Lupus 12:79–85. Shirohzu H, Kubota T, Kumazawa A, Sado T, Chijiwa T, Inagaki K, Suetake I, Tajima S, Wakui K, Miki Y, Hayashi M, Fukushima Y, Sasaki H. (2002). Three novel DNMT3B mutations in Japanese patients with ICF syndrome. Am J Med Genet 112:31–37. Sutherland JE, Costa M. (2003). Epigenetics and the environment. Ann N Y Acad Sci 983:151–60. Takami N, Osawa K, Miura Y, Komai K, Taniguchi M, Shiraishi M, Sato K, Iguchi T, Shiozawa K, Hashiramoto A, Shiozawa S. (2006). Hypermethylated promoter region of DR3, the death receptor 3 gene, in rheumatoid arthritis synovial cells. Arthritis Rheum 54:779–87. Tao Q, Robertson KD. (2003). Stealth technology: how Epstein-Barr virus utilizes DNA methylation to cloak itself from immune detection. Clin Immunol 109: 53–63. Teferedegne B, Green MR, Guo Z, Boss JM. (2006). Mechanism of action of a distal NF-kappaB-dependent enhancer. Mol Cell Biol 26:5759–70.
c03.indd 56
1/12/2011 9:44:02 AM
REFERENCES
57
Teitell M, Richardson B. (2003). DNA methylation in the immune system. Clin Immunol 109:2–5. Thomas TJ, Seibold JR, Adams LE, Hess EV. (1993). Hydralazine induces Z-DNA conformation in a polynucleotide and elicits anti(Z-DNA) antibodies in treated patients. Biochem J 294(Pt 2):419–25. Tollefsbol TO. (2004). Methods of epigenetic analysis. Methods Mol Biol 287:1–8. Toyota M, Ho C, Ahuja N, Jair KW, Li Q, Ohe-Toyota M, Baylin SB, Issa JP. (1999). Identification of differentially methylated sequences in colorectal cancer by methylated CpG island amplification. Cancer Res 59:2307–12. Trouche D, Khochbin S, Dimitrov S. (2003). Chromatin and epigenetics: dynamic organization meets regulated function. Mol Cell 12:281–86. Ushijima T, Nakajima T, Maekita T. (2006). DNA methylation as a marker for the past and future. J Gastroenterol 41:401–07. van der Werf N, Kroese FG, Rozing J, Hillebrands JL. (2007). Viral infections as potential triggers of type 1 diabetes. Diabetes Metab Res Rev 23:169–83. Van Helden PD. (1985). Potential Z-DNA-forming elements in serum DNA from human systemic lupus erythematosus. J Immunol 134:177–79. Vanden Berghe W, Ndlovu MN, Hoya-Arias R, Dijsselbloem N, Gerlo S, Haegeman G. (2006). Keeping up NF-kappaB appearances: epigenetic control of immunity or inflammation-triggered epigenetics. Biochem Pharmacol 72:1114–31. Wade PA. (2001). Methyl CpG-binding proteins and transcriptional repression. Bioessays 23:1131–37. Walter K, Bonifer C, Tagoh H. (2008). Stem cell-specific epigenetic priming and B cellspecific transcriptional activation at the mouse CD19 locus. Blood 112:1673–82. Wang CY, Han J, She JX. (2008). Genetic factors for human type 1 diabetes. In Current Topics in Human Genetics (Studies in Complex Diseases). Vol. 24. Ed. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ 07601, pp. 693–727. Wang CY, Podolsky R, She JX. (2006). Genetic and functional evidence supporting SUMO4 as a type 1 diabetes susceptibility gene. Ann N Y Acad Sci 1079:257–67. Wang CY, She JX. (2008). SUMO4 and its role in type 1 diabetes pathogenesis. Diabetes Metab Res Rev 24:93–102. Wardle EN. (2009). Systemic lupus erythematosus conundrums. Saudi J Kidney Dis Transpl 20:731–36. Waterland RA, Jirtle RL. (2004). Early nutrition, epigenetic changes at transposons and imprinted genes, and enhanced susceptibility to adult chronic diseases. Nutrition 20:63–68. Welsh J, McClelland M. (1990). Fingerprinting genomes using PCR with arbitrary primers. Nucleic Acids Res 18:7213–18. Wen ZK, Xu W, Xu L, Cao QH, Wang Y, Chu YW, Xiong SD. (2007). DNA hypomethylation is crucial for apoptotic DNA to induce systemic lupus erythematosus-like autoimmune disease in SLE-non-susceptible mice. Rheumatology (Oxford) 46: 1796–803. Williams KT, Garrow TA, Schalinske KL. (2008). Type I diabetes leads to tissue-specific DNA hypomethylation in male rats. J Nutr 138:2064–69. Wilson CB, Rowell E, Sekimata M. (2009). Epigenetic control of T-helper-cell differentiation. Nat Rev Immunol 9:91–105.
c03.indd 57
1/12/2011 9:44:02 AM
58
DNA METHYLATION IN THE PATHOGENESIS OF AUTOIMMUNITY
Wilson IM, Davies JJ, Weber M, Brown CJ, Alvarez CE, MacAulay C, Schubeler D, Lam WL. (2006). Epigenomics: mapping the methylome. Cell Cycle 5:155–58. Wing K, Sakaguchi S. (2010). Regulatory T cells exert checks and balances on self tolerance and autoimmunity. Nat Immunol 11:7–13. Wu J, Smith LT, Plass C, Huang TH. (2006). ChIP-chip comes of age for genome-wide functional analysis. Cancer Res 66:6899–902. Yuan X, Paez-Cortez J, Schmitt-Knosalla I, D’Addio F, Mfarrej B, Donnarumma M, Habicht A, Clarkson MR, Iacomini J, Glimcher LH, Sayegh MH, Ansari MJ. (2008). A novel role of CD4 Th17 cells in mediating cardiac allograft rejection and vasculopathy. J Exp Med 205:3133–44. Yung R, Powers D, Johnson K, Amento E, Carr D, Laing T, Yang J, Chang S, Hemati N, Richardson B. (1996). Mechanisms of drug-induced lupus. II. T cells overexpressing lymphocyte function-associated antigen 1 become autoreactive and cause a lupuslike disease in syngeneic mice. J Clin Invest 97:2866–71. Yung R, Ray D, Eisenbraun JK, Deng C, Attwood J, Eisenbraun MD, Johnson K, Miller RA, Hanash S, Richardson B. (2001). Unexpected effects of a heterozygous dnmt1 null mutation on age-dependent DNA hypomethylation and autoimmunity. J Gerontol A Biol Sci Med Sci 56:B268–76. Yung RL, Quddus J, Chrisp CE, Johnson KJ, Richardson BC. (1995). Mechanism of drug-induced lupus. I. Cloned Th2 cells modified with DNA methylation inhibitors in vitro cause autoimmunity in vivo. J Immunol 154:3025–35. Yung RL, Richardson BC. (1994). Role of T cell DNA methylation in lupus syndromes. Lupus 3:487–91. Zhao X, Ueba T, Christie BR, Barkho B, McConnell MJ, Nakashima K, Lein ES, Eadie BD, Willhoite AR, Muotri AR, Summers RG, Chun J, Lee KF, Gage FH. (2003). Mice lacking methyl-CpG binding protein 1 have deficits in adult neurogenesis and hippocampal function. Proc Natl Acad Sci U S A 100:6777–82. Zhou L, Chong MM, Littman DR. (2009a). Plasticity of CD4+ T cell lineage differentiation. Immunity 30:646–55. Zhou X, Bailey-Bucktrout S, Jeker LT, Bluestone JA. (2009b). Plasticity of CD4(+) FoxP3(+) T cells. Curr Opin Immunol 21:281–85. Zorn E, Nelson EA, Mohseni M, Porcheray F, Kim H, Litsa D, Bellucci R, Raderschall E, Canning C, Soiffer RJ, Frank DA, Ritz J. (2006). IL-2 regulates Foxp3 expression in human CD4+CD25+ regulatory T cells through a STAT-dependent mechanism and induces the expansion of these cells in vivo. Blood 108:1571–79.
c03.indd 58
1/12/2011 9:44:02 AM
CHAPTER 4
Cell-Based Analysis with Microfluidic Chip WANG QI and ZHAO LONG
Contents 4.1 Introduction 4.2 Fabrication of the Microfluidic Chip and Cell Culture 4.2.1 Fabrication of the Microfluidic Chip 4.2.2 Cell Culture and Analysis 4.3 Application of the Cell-Based Microfluidic Chip 4.3.1 Genomic Analysis on Chip 4.3.2 Protein Analysis on Chip 4.3.3 Analysis of Chemotherapy Resistance in Tumor Cells 4.4 Conclusions and Future Prospects 4.5 Acknowledgments 4.6 References
4.1
59 60 60 62 67 68 70 72 75 76 76
INTRODUCTION
Micro total analysis systems (TAS), also called labs on a chip, integrate analytical processes for sequential operations like sampling, sample pretreatment, analytical separation, chemical reaction, analytical detection, and data analysis in a single microfluidic device. Microfluidic-based research has made significant advances over the last few years and has become very much a hot topic. Because of its advantages—including low reagent and power consumption, short reaction time, portability for in situ use, low cost, versatility in design, and potential for parallel operation and integration with other miniaturized devices—microfluidic chip-based systems for biological cell studies have Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
59
c04.indd 59
1/12/2011 9:44:03 AM
60
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
attracted significant attention. The microfluidic technique has started to play an increasingly important role in discoveries in cell biology, neurobiology, pharmacology, and tissue engineering. As cell-based assay is deemed to be essential for the functional characterization and detection of drugs, pathogens, toxicants, and odorants, some articles concerning microfluidics for cell analysis have been published during the past few years. Andersson and Berg et al. (2003) presented some research on microfluidics for cellomics, which covered the microfluidic devices for cell sampling, cell trapping and sorting, cell treatment, and cell analysis. Erickson and Li (2004) focused on integrated microfluidic devices for cell handling and cytometry, dielectrophoretic cellular manipulation and sorting, and general cellular analysis. There were a large number of reports on the microfluidicbased biological applications such as cell culture, PCR, DNA separation, DNA sequencing, and clinical diagnostics (Auroux et al., 2002; Vilkner et al., 2004). Recently, microfluidics for cell culture, flow cytometers, and other microscale flow-based cell analysis systems were presented (Tai and Shuler, 2003; Huh et al., 2005), where cell detection and enumeration systems and microfluidic fluorescence-activated cell sorting systems were described (Huh et al., 2005). This chapter is to provide an in-depth look at the applications of cell-based analysis with microfluidic devices. In this chapter, we summarize some reports on the use of cell-based microfluidic chip to carry out specific functions on microchips. The chapter consists mainly of three sections, covering fabrication of the microfluidic devices; cell culture and manipulation; and cell-based analysis including protein analysis, genomic analysis and tumor cell chemotherapy analysis. Selected examples for the manipulation method are described in detail to reveal their advantages and disadvantages. An outlook of this promising and rapidly expanding area is presented in the concluding remarks. Some typical microchips are shown in Figure 4.1.
4.2 FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE Fabrication of miniaturized cell manipulator is the first step toward microfluidics for cell-based assays. The commonly used methods for cell manipulation on chip can be categorized based on the manipulating force employed. In this section, a few selected microfluidic devices fabrication and cell culture are described. 4.2.1 Fabrication of the Microfluidic Chip Microfluidic devices can be fabricated from a variety of materials using different techniques. The materials are silicon, glass, and polydimethylsiloxane (PDMS), each with differently characteristics; the main fabricateing techniques are lithography, UV laser ablation, and hot embossing.
c04.indd 60
1/12/2011 9:44:03 AM
FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE
Glass four-channel array electrophoresis chip
Glass PCR chip
Quarts electrophoresis chip
PDMS-glass drug-screening chip
61
Gradient chip for cell culture and drug screen
Figure 4.1. Typical microchips.
4.2.1.1 Different Materials Silicon and glass are traditional materials and have been used to create microfluidic devices (Li et al., 1999; McClain et al., 2003). The particularly attractive advantages of glass and quartz include their well-defined surfaces and excellent optical properties, which are highly desired for signal readout of microarrays by fluorescence. Because of the defects of silicon and glass, such as strong and expensive, polymers are rapidly evolving as alternative substrate materials for many microfluidic applications due to their diverse properties, which can be selected to suit the particular application need and the ability to microfabricate structures in a high-production mode and at low cost. Recently, PDMS, a kind of polymer, has became one of the materials extensively used in microfluidic devices due to its biocompatibility, low toxicity, high oxidative and thermal stability, optically transparency, low permeability to water, and low electrical conductivity; furthermore, it can be easily fabricated into microstructures using soft lithography. 4.2.1.2 Fabrication Technique The fabrication of the prerequisite fluidic networks on polymer substrates can be achieved by lithography, UV laser ablation, hot embossing, injection molding, or direct micromilling techniques (Ford et al., 1999; Friedrich et al., 1997; Grass et al., 2001; Ihlemann and Rubahn, 2000; Martynova et al., 1997; Qi et al., 2002; Robert et al., 1997; Rossier et al., 1999; Soper et al., 2000). Lithography is the earlier technique that has been accomplished in glass/silicon substrates. However, the fabrication processes involved are fairly time-consuming, labor intensive, and expensive. Using UV laser ablation or micromilling, a polymer substrate can be exposed to laser radiation or a milling bit, resulting in the direct writing of the microstructures into the polymer part. In the case of laser ablation, the laser
c04.indd 61
1/12/2011 9:44:03 AM
62
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
photons, typically from a pulsed excimer laser, are focused directly on the polymer. Fluidic channels are produced either by moving the polymer substrate or by moving a focused laser beam across the surface. This method offers the ability to form complex features with various geometries, even in three dimensions because the patterning beam can be moved both horizontally and vertically on the substrates. In either case, fluidic prototypes can be rapidly produced using direct writing techniques to optimize the device performance before mass production using microreplication techniques. Hot embossing and/or injection molding involves the use of mold masters containing the desired microstructures to stamp patterns into the required substrate. The patterns poised on the mold masters can be made either by micromachining or by a lithography-based approach, such as LiGA (lithograhie electroformung abformung, lithography electroplating molding). The LiGA processing steps using x-ray lithography to build the microstructures on the mold master is shown in Figure 4.2. The important aspect of this technology is that the difficult fabrication steps, lithography and electroplating, are performed only once and then parts are microreplicated from the master mold using either injection molding or hot embossing. Soft lithography, also named PDMS fabrication technology, was developed by Whitesides and co-workers and has been widely adopted for the fabrication of different microfluidic networks (Anderson et al., 2000a; Duffy et al., 1998, 1999; McDonald and Whitesides, 2002; Sia and Whitesides, 2003). Fabrication of microfluidic structures using PDMS technology involves pouring a solution containing the PDMS prepolymer and a curing agent over a relief (mold) containing the prerequisite microstructures followed by curing to cross-link the polymer. This technology first requires creation of positive relieves using a variety of techniques (Becker and Gartner, 2000; Love et al., 2001), such as wet etching of silicon or glass followed by photolithography, micromachining metals, or reactive ion etching. A solution containing the pr-polymer and curing agent is then cast against the relief and the cross-linked polymer conforms to the shape of the relief. After casting, the polymer is simply removed from the relief, resulting in a replica that contains the network of microfluidic channels. Finally, the PDMS replicate can be plasma oxidized and sealed to other surfaces by conformal contact to enclose the fluidic network. Plasma oxidation also renders the PDMS channels hydrophilic so that they can easily be filled with aqueous solutions. Besides plasma oxidation, Quake and coworkers (2000) have described a method for sealing fluidic networks created in PDMS, which involves the use of two slabs; one slab consists of an excess of base while the other contains the curing agent. These two are then brought into conformal contact followed by curing. 4.2.2
Cell Culture and Analysis
Microfluidic devices are especially suitable for biological applications, particularly on cellular level, because the scale of the channels is commensurate with
c04.indd 62
1/12/2011 9:44:03 AM
FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE
63
Lithography Synchrotron irradiation Absorber structure Mask membrane Resist Base plate
Resist structure
Electroplating
Metal Resist structure Electrically conductive base plate
Molding
Mould cavity Plastic structure
Figure 4.2. Processing steps required to prepare a mold master using x-ray LiGA and molding finished parts. The process starts with a lithography step using a mask that transfers the desired pattern into a photoresist. Following development of the resist, the underlying metal layer is exposed, which serves as a plating base for another metal that fills the voids left by the resist that was removed due to exposure to the patterning radiation. The deposited metal forms the desired microstructures. The unexposed resist is then removed, forming the finished master mold. This master mold is then available for microreplicating parts in polymers or other materials using hot embossing or injection molding. At the right is the master mold and parts that have been prepared from it (Situma et al., 2006; Ford et al., 1999).
c04.indd 63
1/12/2011 9:44:03 AM
64
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
that of the cells’ (Walker et al., 2004), and the scale of the device allows important factors (e.g., growth factors) to accumulate locally, forming a stable microenvironment for cell culture (Mahoney et al., 2005). Compared to traditional culture tools, microfluidic platforms provide much greater control over cell microenvironments and rapid optimization of media composition using relatively small numbers of cells. 4.2.2.1 Static Cell Culture In recent years, there have been many reports concerning cell culture with microdevices, various cells such as epithelium cells, interstitial cells, cancer cells, and even stem cells grow well in the devices. Most of the culture processes involve static medium—for instance, our research group developed a single-channel PDMS chip used for the culture of lung cancer cell line H446 to explore the function of glucose-regulated protein 78 (GRP78), an endoplasmic reticulum chaperone, in chemotherapy-resistance lung cancer (Ying-Yan et al., 2008). Also, Wlodkowic et al. (2009) describe an inexpensive method for production of reusable, optical-grade PDMS microculture chips that provide a static and self-contained microwell system analogous to conventional polystyrene multiwell plates. Shao and co-workers (2009) developed an integrated microfluidic cell culture platform in which endothelial cells (ECs) are under static conditions or exposed to a pulsatile and oscillatory shear stress. Until now, static cell culture was still a key method for studying biological function of cells. But static environments are far from representative of cells in vivo and rarely involve long-term cell culture. Therefore, it is essential to establish a microfluidic platform supplied with fresh medium of oxygen and nutrition at a continuous control flow rate for mimicking cellular microenvironments and achieving cell culture and biomedical assay in an environment closer to the in vivo situation. 4.2.2.2 Dynamic Cell Culture To achieve the goal of long-term cell culture in higher density and larger numbers in a microenvironment closer to in vivo conditions, continuous nutrition and oxygen supply and waste removal through the culture medium have to be ensured. Takayama and co-workers (2004) described the use of horizontally oriented mini-reservoir arrays as a gravitydriven pumping system to generate multiple fluid streams inside microfluidic cell culture channels at a constant flow rate for prolonged periods. Maharbiz et al. (2003) presented a microfabricated electrolytic oxygen generator for high-density miniature cell culture arrays. Long-term (>2 weeks) cultures of muscle cells spanning the whole process of differentiation from myoblasts to myotubes was reported by Folch and co-workers. Prokop et al. (2004) developed nanoliter bioreactors (NBRs) for long-term culture and maintaining up to several hundred cultured mammalian cells in volumes three orders of magnitude smaller than those in standard multiwell screening plates. Refreshable Braille display–based microfluidic bioreactors, which are more densely packed and not limited to linear and unidirectional perfusion, were developed for cell culture up to 3 weeks under perfusion (Gu et al., 2004). Yasuda and co-workers
c04.indd 64
1/12/2011 9:44:03 AM
FABRICATION OF THE MICROFLUIDIC CHIP AND CELL CULTURE
3 cm
1
65
2
5 cm
24 h
48 h
72 h
96 h
Figure 4.3. Long-term cell culture of SPCA cells in the microfluidic system. An LM photo for each time period (from 1 day to 4 days) indicated SPCA cells grew well and propagated gradually. Magnification is ×100. (Zhao et al., 2010).
developed a type of on-chip microcultivation chamber, which could directly measure the valve opening/closing by optical microscope, for long-term cultivation of swimming cells. Fujii and co-workers achieved large-scale cell culture (up to 107 cells/cm3) in a microfluidic device with a multilayer bioreactor containing an oxygen supply system. Lee and co-workers presented a microfluidic cell culture array that could assay 100 different cell-based experiments in parallel for long-term cellular monitoring. Repeated cell growth/passage cycles, reagent introduction, and real-time optical analysis could all be achieved in this microdevice. Recently, our research group developed a integration microfluidic system consisting of a MS26 syringe pump and single-channel PDMS chip used for a lung cancer cell line culture (SPCA1) for at least 5 days (Zhao et al., 2010) (Fig. 4.3). 4.2.2.3 3D Cell Culture To improve the efficiency of cell culture and closer to the stereoscopic manner in vivo, some approaches, such as fabricating threedimensional (3D) microstructures (Kojima et al., 2003; Moriguchi et al., 2002; Tan and Desai, 2004) and attempting other biocompatible materials (Leclerc et al., 2004), are presented. Yasuda and co-workers developed a method that
c04.indd 65
1/12/2011 9:44:03 AM
66
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
made use of noncontact 3D phototherma etching with a 1480-nm or 1064-nm infrared focused laser beam to form shapes of agar microstructures for cultivating cells. Desai and co-workers fabricated a 3D heterogeneous multilayer tissue-like structure inside microchannels for cell culture. Single and multiple stepwise microstructures were fabricated in photosensitive biodegradable polymers poly-(caprolactone (CL)-dl-lactide (LA)) tetraacrylate for static cell culture of Hep G2 cells and fetal human hepatocyte (FHH) cells (Leclerc et al., 2004). Other approaches for microfluidic cell culture were also studied. Cell cultivation was performed in a highly parallelized manner in fluid segments that were formed as droplets at a channel junction. Organic and cellcontaining aqueous phases were merged at channel junctions (Martin et al., 2003). A multichannel microelectrode array that could influence and record electrical cellular activity was integrated into the microfluidics for cell culture (Pearce et al., 2005). A gradient-generating microfluidic platform for optimizing proliferation and differentiation of neural stem cells in culture was described by Jeon and co-workers. 4.2.2.4 Single-Cell Analysis Systems Our understanding of many biological processes would greatly benefit if we had the ability to analyze the content of single cells. Today, there are only a few conventional systems that enable direct intrinsic studies of single cells, including capillary electrophoresis (CE) and flow cytometry. These systems, however, are based on conventional technologies and instrumentation, they give only limited information about the cell content and do not present a general method for single cell analysis. Recent, rapid developments in microfabrication and nanofabrication technologies, have already led to the successful so called laboratory on a chip (LOC) concept and these developments offer great opportunities for the analysis of single cells. Takayama et al. (2001) reported that by using multiple laminar streams, which constitute one of the most important characteristics of a microfluidic channel, it has been possible to partially stimulate selected domains inside cells. These investigators used fluorescent tags to detect subcellular positioning of small molecules. Microchips with complicated microstructures can also be used to screen for gap-junction formation between two adjacent cardiomyocytes. Tanaka and co-workers (2002) developed a novel single-cell analysis system consisting of a scanning thermal lens microscope (TLM) detection system and a cell culture microchip (Figure 4.4a). TLM is a sensitive instrument to detect nonfluorescent molecules using a microspace. The detection process is based on absorption of visible or UV light followed by a photothermal process (Kitamori et al., 2004). Generally, fluorescent probes are labeled to cells to detect subcellular positioning of small molecules. However, it is difficult to detect subcellular molecules directly without using fluorescent tags by this method. On the other hand, the system using TLM could detect nonfluorescent biological substances with extremely high sensitivity without any labeling materials, and it had a high spatial resolution of ∼1 μm. The microchip system was good for liquid control and simplified troublesome procedures. This system
c04.indd 66
1/12/2011 9:44:03 AM
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
(a)
67
Staurosporine
Microsyringe pump
TLM detection (532 nm)
Drain X-Y scanning stage
Cell culturing flask 30 µm
(b) 9–10 8–9
30 µm
7–8 6–7 5–6 4–5 3–4 2–3 1–2 0–1 + Staurosporine
Figure 4.4. (See color insert.) A single-cell analysis system in a glass microchip using a thermal lens microscope (TLM). (a) Cell culture chip design and TLM scanning method. A microflask (1 mm × 10 mm × 0.1 mm) was fabricated in a glass microchip, and a cell suspension was introduced into it. After cultivation, the microchip with capillaries connected to syringe pumps was mounted on the TLM stage, and TLM signals were measured while scanning the stage to obtain a 2D-image. (b) Direct imaging of cytochrome-c in a cell and its distribution change during apoptosis (Tamaki et al., 2002).
was applied to monitoring cytochrome-c distribution in a neuroblastomaglioma hybrid cell cultured in the microflask (Fig. 4.4b). This system seems to be applicable to monitoring compounds released in cells, when used in combination with analytical microchips. 4.3
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
Microfluidic chips are used for transporting and manipulating minute amount of fluids and/or biological entities through microchannel manifolds, allowing
c04.indd 67
1/12/2011 9:44:03 AM
68
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
integration of various chemical and biochemical processes into fast and automated monolithic microflow systems. It has become evident that there is a tremendous market potential for microdevices aiding in diagnostics, drug discovery, and evaluation of new pharmaceuticals, since these devices are expected to satisfy the urgent demand for high-throughput and large-scale applications. In this section, some applications cover genomics and proteomics analysis and chemotherapy resistance in cancer with the microfluidic devices are described. 4.3.1
Genomic Analysis on Chip
Separation of nucleic acids is one of the leading applications of microchipbased analysis today. Due to the negligible deteriorating effects of Joule heating during electric field–mediated separations in microchannels and the ability to inject very small, well-defined sample plugs, the resolving power of microfluidic bioanalytical devices are mainly diffusion limited, resulting in superior performance compared to slab gel and capillary electrophoresis. Rapid, high-quality separations on chip have been demonstrated for the analysis of oligonucleotides and RNA and DNA fragments. 4.3.1.1 Sample Preparation and Separation The shorter separation distances, compared to those used in conventional CE, represent new challenges to the optimization of separation conditions, such as electrokinetic manipulations, channel geometry, and sieving media. The geometrical effects of folded microchannel structures on band broadening have been extensively studied and developed for CE instruments, based on viscous solutions of entangled water-soluble polymers, and successfully applied to microchip electrophoresis. Linear polyacryamide and its derivatives—polydimethy-acryamide, polyethylene oxide, polyethylene glycol with fluorocarbon tails, hydroxyethylcellulose and various cellulose derivatives, and other polysaccharides—have been used for size separation of nucleic acid molecules in capillaries, and some of these matrices have also been adapted for microchip electrophoresis. Recently introduced novel thermoresponsive co-polymers, making up hydrophobic and hydrophilic blocks have also shown promising results in terms of efficiency and the possibility of theoretical modeling of acrylamide grafted with polyethylene oxide chains (Liang et al., 1999). These matrices have a pronounced temperature-dependent viscosity transition point, which suggests promising implementations. In particular, thermoresponsive polymers can offer some practical advantages for microchannel electrophoresis, enabling easier handling and loading of the viscous polymer solutions without the requirement of a high-pressure manifold. Buchholz et al. (2001) have constructed interesting ‘‘viscosity switch’’ materials, which respond to changes of temperature, pH, or ionic strength. These matrices are based on co-polymers of acrylamide derivatives with variable hydrophobicity and possess a reversible temperature-controlled viscosity
c04.indd 68
1/12/2011 9:44:03 AM
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
69
switch from high-viscosity solutions at room temperature to low-viscosity colloid dispersions at elevated temperatures. Also, high resolving power and good DNA persequencing performance were achieved with these sieving media. 4.3.1.2 DNA Analysis and Genotyping Major applications of electrophoresis microchips include sizing of double-stranded DNA fragments (Effenhauser et al., 1997; Woolley and Methies, 1994; McCormick et al., 1997; Duffy et al., 1999; Ronai et al., 2001), short single-stranded oligonucleotides (Effenhauser et al., 1994), and ribosomal RNA fragments (Ogura et al., 1998). High separation performance and speed have been achieved. DNA genotyping on microchips enables quick identification of genes and can substantially enhance the capabilities of genomic, diagnostic, pharmacogenetic, and forensic tests. The identification of genes related to heredity diseases, such as hemachromatosis (Woolley et al., 1997), has been successfully performed on chip. An ultra-fast allelic profiling assay for the analysis of short tandem repeats (STRs) has also been demonstrated (Schmalzing et al., 1999). Separation of a CTTv quadruplex system was accomplished in <2 min using a 26-mm separation distance, which is 10–100 times faster than capillary or slab gel electrophoresis, respectively (Schmalzing et al., 1997). Two-color multiplex analysis of eight STR loci has also been performed by the same group. 4.3.1.3 Polymerase Chain Reaction The polymerase chain reaction (PCR) is a key process in genetic analysis and has played an important role in modern biology and biochemistry research. However, the conventional 96well PCR devices usually have a thermal cycling rate of 1–2°C/s; therefore, a complete PCR amplification needs 1–2 h or even longer. Moreover, by this approach, both the sample preparation and the post-PCR product detection need to be performed off-line, resulting in a long analysis process and a high risk of cross-contamination. Therefore, there have been many driving forces to exploit the potential benefits of microsize PCR apparatus (also called PCR microfluidic chips), which include reduced consumption of samples and reagents, shorter analysis time, greater sensitivity, portability, and disposability. The obvious benefits of such miniaturization come from the improved thermal energy transfer compared to conventional macrovolumes, resulting in a greatly increased speed of thermal cycling, reduced amount of expensive reagents used, and the possibility to create versatile, multifunctional integrated systems. However, potential problems, associated with PCR downscaling, such as surface-related effects, become greatly enhanced with shrinking reaction volumes and the resulting increase in surface-to-volume ratio. Various biocompatible PCR-friendly surfaces have been investigated by passive silicon, fusedintro silica, glass or plastic walls via coatings by chemical modification, and/or physical adsorption. Another issue is related to handling of small volumes of liquids and the incorporation of miniaturized DNA amplification units into the chain of
c04.indd 69
1/12/2011 9:44:03 AM
70
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
sample processing and analysis, to develop truly automated and highthroughput technology. A number of studies have been done to optimize microscale DNA amplification. Two possible embodiments can be envisioned— closed microchamber and open format, both of which can be multiplexed to an array. Fast real-time PCR analysis was demonstrated in silicon microchambers by several groups (Belgrader et al., 1999; Northrup et al., 1998; Wilding et al., 1998). DNA amplification can be coupled with special microfluidic cartridges designed to carry out several sequential steps of DNA extraction from different sample types (tissues, cells, whole blood, soil, food, etc.) before the PCR. DNA amplification in open format has not yet been reported, but other chemical and enzymatic reactions conducted in nanovials and picovials have shown promising results (Litborn et al., 1999, 2000; Clark and Ewing, 1998; Bernhard et al., 2001). 4.3.2 Protein Analysis on Chip The conventional approach to analyzing protein molecules typically comprises their extraction from cells, separation and detection by one- or two-dimensional gel electrophoresis, bands cut out and in gel digestion, followed by MS analysis of the resulting peptide mixture. These traditional techniques are slow and labor intensive. Fast and high-throughput integrated multisample systems are in great demand. Microfluidic devices offer opportunities that are unavailable in traditional protein analysis technologies, with the potential to control and automate multiple sample processing steps. The majority of recent developments in microfluidics for protein applications have been related to the combination of single or multidimensional microchannel separations and MS detection. 4.3.2.1 Separation and System Integration A number of research groups have been focusing on the realization of traditional 2D gel electrophoresis in a microchip format. The problems of protein purification and desalting before the analysis were addressed in microfabricated structures using semipermeable membranes, sandwiched between microfluidic manifold layers and connecting the buffer and analyte counterflows (Jiang et al., 2001). Such microdialysis enabled effective cleanup of analytes of interest from low molecular mass compounds and enhanced the following MS characterization. Figeys and Pinto (2001) reported examples of isoelectric focusing (IEF) of proteins in microchips, demonstrating the potential of microscale techniques. Chip-based IEF has been accomplished in 30 s in a 7-cm channel with a peak capacity of 30–40 peaks (Hofmann et al., 1999). Electroosmotically driven mobilization of the focused zones was found the most suitable technique for the microchip approach due to its easy implementation and high speed. An integrated IEF-electrospray ionization-MS plastic microfluidic device, coupling an electrospray tip to an IEF microchip, has also been reported (Wen et al., 2000).
c04.indd 70
1/12/2011 9:44:03 AM
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
71
Chromatography and electrophoresis are the two major separation techniques for proteins. The latter is by far more popular in microchip applications due to its easier realization in miniaturized formats. Nevertheless, on-chip chromatographic separations have recently been attempted by trapping coated beads into a microcavity within a microchannel network or using in situ microfabricated (He et al., 1999) or polymerized (Ericson et al., 2000) beds. The development of more sophisticated microfluidic systems with both horizontal integration (by building parallel analysis lanes for high-throughput applications) and vertical integration (by implementing several functions on a single device) is perhaps the most exciting trend in the microchip world. Recently, a new microfluidic glass device was developed for an integrated protein sizing assay that performs separation, staining, virtual destaining and detection steps and can sequentially analyze 10 different samples in <30 min (Bousse et al., 2001). Universal noncovalent fluorescent on chip labeling was employed in combination with postseparation dilution to reduce the fluorescent background associated with SDS micelle-bound dye and increase the signal-to-noise ratio. The applicability of this technique to microscale protein separations has been investigated by several research groups (Colyer et al., 1997; Liu et al., 2000; Csapo et al., 2000). High speed; the sensitivity of microseparations; and the option of on-chip precolumn or postcolumn labeling, mixing, destaining, and so on proved to be clearly beneficial in combination with noncovalent protein staining dyes. High separation efficiency and good detection limits were demonstrated. 4.3.2.2 Immunoassay Immunoassays represent a very important tool in clinical diagnostics and biomedical research and industry. Various miniaturized immunoassays, mostly based on competitive antigen–antibody interactions, were successfully performed on electrophoresis microchips, demonstrating greater speed and feasibility of automated analysis in a portable format. Among recent developments, Gottschlich et al. (2000) reported an integrated microchip device that sequentially performed enzymatic reactions, electrophoretic separation of the products, and postseparation on-line labeling of the peptides and proteins, followed by fluorescent detection. Our group also detected the expression of glucose regulated protein 78 (GRP78) in an integrated PDMS chip with immunoassay, showing several advantages over conventional testing in terms of time and sensitivity (Wang et al., 2009). A new class of microfluidic immunoassays is based on solid supported lipid bilayers (Yang et al., 2001). The bilayers, created on PDMS surfaces of an array of parallel microchannels contained dinitrophenyl (DNP) conjugated lipids for binding with fluorescently labeled anti-DNP antibodies of different concentrations in each channel. The methodology can be used for performing rapid and accurate heterogeneous assays in a single experiment, and the amount of proteins required is significantly reduced compared to conventional methods.
c04.indd 71
1/12/2011 9:44:03 AM
72
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
4.3.2.3 Microchip and Mass Spectrometry Since MS is one of the most frequently used methods of choice in protein analysis and characterization, a number of successful efforts have been reported to couple microfluidic devices to mass spectrometers (Figeys and Pinto, 2001; Licklider et al., 2000). Such combinations enabled automated sample delivery and enhanced MS analysis efficiency—for instance, sample processing, enrichment, cleanup, and fractionation before the detection. These devices can transport the analyte fluid electrokinetically or by pressure and generate electrospray via an attached capillary or more complex emitter couplings. Enzymatic digestion was monitored by peptide mass fingerprinting in real time, with high detection sensitivity on a hybrid microchip nanolectrospray device (Lazar et al., 2001). Ekstrom et al. (2000) also described the automated MS analysis, using a porous microfabricated digestion chip, integrated with a sample pretreatment robot and microdispenser for transferring the reaction products to an MALDI target plate. 4.3.3
Analysis of Chemotherapy Resistance in Tumor Cells
Cancer is the third leading cause of death worldwide. The major clinical strategy for administration of cancer with the advanced stages is chemotherapy. However, the success of the treatment is limited by the intrinsic or acquired resistance of cancer cells to chemotherapy. Microchip-based systems or lab on a chip have many advantage, such as a short diffusion distance, a large specific interface area, and a rapid and efficient reaction, showing a promising future for the research on chemotherapy resistance. Recently, our group and other researchers have done much work on chemotherapy resistance with this platform. 4.3.3.1 Assay of Drug Resistance Related Protein The conventional methods for the in vitro detection of drug resistance related protein mainly include immunoprecipitation, Western blotting, and gel electrophoretisis on the cells cultured in flasks (Ramsay et al., 2005; Reddy et al., 2003). These processes involve a relatively long time, troublesome liquid-handling procedures, and large quantities of consumed reagents; some require labor-intensive purification of the protein and complex analysis steps. Our group fabricated three kinds of chips: a straight channel chip, a gradient chip, and a micro-pump chip. Figure 4.5 shows a gradient chip, it was composed of an upstream concentration gradient generator (CGG) and a downstream cell culture module. The chip was fabricated in PDMS by using standard soft lithography methods and rapid prototyping techniques. Gradient concentrations of an anticancer drug can be formed through the concentration gradient generator. Lung cancer cells were introduced into the cell culture chambers and grew well in that. This work aimed to study the correlation between the expression of GRP78 and the resistance to the anticancer drug VP-16 in the human lung squamous carcinoma cell line SK-MES-1 using the integrated microfluidic gradient chip
c04.indd 72
1/12/2011 9:44:03 AM
73
APPLICATION OF THE CELL-BASED MICROFLUIDIC CHIP
(a)
A23187 concentrations
1 2 3 4 5 6 7 8
Low 0.0 µM
0.86 µM
1.71 µM
High 2.57 µM
(a) 3.42 µM
(b) 4.28 µM
(c) 5.13 µM
(d) 6.00 µM
(e)
(f)
(g)
(h)
a b c d e f g h
Cell culture chambers
Normalized fluorescence intensity per Cell
(b) 140.00 130.00 120.00 110.00 100.00 90.00 80.00 70.00 60.00 0.00
0.86
1.71
2.57
3.42
4.28
5.13
6.00
A23187 Concentrations (µM)
Figure 4.5. (See color insert.) The expression of GRP78 on a protein level by immunofluorescence in SK-MES-1 cells. (a) Cells were treated with A23187 for 24 h and observed by fluorescence microscope (IX-71; Olympus Optical Co., Tokyo, Japan) (×400). The fluorescent indicated GRP78 in the cytoplasm. (b) The average expression of GRP78 per cell reflected by normalized fluorescent intensities increased with the concentrations of A23187. The normalized fluorescent intensities per cell were determined by the number of cells in a region divided by the fluorescent intensity in that region (Wang et al. 2009).
c04.indd 73
1/12/2011 9:44:03 AM
74
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
device. We used A23187, a GRP78 inducer, with a gradient concentration in the upstream network of the device to induce the expression of GRP78 in the cells cultured in the downstream before the addition of VP-16. The expression of GRP78 was detected by immunofluorescence, the apoptosis for the cells treated by VP-16 was assessed morphologically by 4′,6-diamidino-2phenylindole (DAPI) staining. The results indicated that the expression of GRP78 increased greatly for the cells under the induction of A23187 in a dose-depended manner, while the percentage of apoptotic cells decreased significantly after being treated by VP-16. The results confirmed the role GRP78 plays in the chemotherapy resistance to VP-16 in the SK-MES-1 cell line, suggesting that the integrated microfluidic systems may be an unique approach for characterizing the cellular responses. (Wang et al., 2009). 4.3.3.2 Anticancer Drug Cytotoxicity Assay Sugiura et al. (2008) reported a pressure-driven perfusion culture chip developed for parallel drug cytotoxicity assay. The device is composed of an 8 × 5 array of cell culture microchambers with independent perfusion microchannels. It is equipped with a simple interface for convenient access by a micropipette and connection to an external pressure source, which enables easy operation without special training. The unique microchamber structure was carefully designed with consideration of hydrodynamic parameters and was fabricated out of a polydimethylsiloxane by using multilayer photolithography and replica molding. The microchamber structure enables uniform cell loading and perfusion culture without cross-contamination between neighboring microchambers. A parallel cytotoxicity assay was successfully carried out in the 8 × 5 microchamber array to analyze the cytotoxic effects of seven anticancer drugs. The pressure-driven perfusion culture chip, with its simple interface and well-designed microfluidic network, will likely become platform for future high-throughput drug screening by microchip. 4.3.3.3 Analysis of Anticancer Drugs Efficiency Popovtzer and coworkers (2008) describe a new method for rapid, sensitive, and high-throughput detection of colon cancer cells’ response to differentiation therapy using a novel electrochemical lab-on-a-chip system. Differentiation-inducing agents such as butyric acid and its derivatives were introduced to minuature colon cancer samples within the nanovolume chip chambers. The efficacy of each of the differentiation-inducing agents was evaluated by electrochemical detection of the cellular enzymatic activity level, whereas reappearance of normal enzymatic activity denotes effective therapy. The results demonstrate the ability to evaluate simultaneously multiplex drug effects on miniature tumor samples (approximately 15 cells) rapidly (5 min) and sensitively, with quantitative correlation between the number of cancer cells and the induced current. The use of miniature analytical devices is of special interest in clinically relevant samples, in that it requires less tissue for diagnosis and enables
c04.indd 74
1/12/2011 9:44:03 AM
CONCLUSIONS AND FUTURE PROSPECTS
75
high-throughput analysis and comparison of various drug effects on one small tumor sample, while maintaining uniform biological and environmental conditions.
4.4
CONCLUSIONS AND FUTURE PROSPECTS
In this chapter we reviewed a sampling of recently reported cell-based bioanalysis with microfluidic chips. The objective was to provide a glimpse into the current state of the art in these fields. As we have stated, microfluidic devices are powerful tools for cell-based analysis exhibiting numerous advantages, such as reduced reagent consumption, improved performance, multifunctional possibilities with interconnected network of channels, inherent mechanical stability of monolithic systems, and the possibility of parallelization and inexpensive mass production. System integration plays one of the most significant roles in miniaturization and will lead to greater process intensification. As miniaturization greatly improves heat and mass transfer, it opens new horizons and leads to revolutionary changes in bioanalytical sciences and separation technology. Microfluidic systems enable rapid identification of nucleic acids, proteins, drugs, and other compound, using subnanogram amounts of sample and provide online interface with upstream/downstream processing and analysis, such as sample preparation, microdialysis, separation, and MS detection. In the meantime, current cell-based bioanalysis with microfluidic chips still require improvements, especially in microfabrication techniques. The successful applications of cell-based microfluidics analysis include cell lysis, cell culture, chemical and biochemical sensing, and other perform process. On-chip cell lysis can be realized using chemical, mechanical, and electrical methods. All these lysis methods should be integrated with microfluidics. To achieve efficient deliveries of lysis reagents or on-chip generation of lysis reagents, more microfluidic components should be fabricated, increasing the complexity of the microdevices. Mechanical-based lysis also faces the same challenge. Electrical-based lysis is the most widely used in microfluidic cell lysis due to its easy integration and simplified instrumentation and will obtain more applications. On-chip cell culture can be successfully achieved in microfluidic platform. Different materials (silicon, PDMS, and borosilicate) have proved to be suitable substrates for cell growth. Due to its inherent properties, PDMS has become the dominant material for fabricating cell culture chips. Micropumps and oxygen generators have been incorporated into microfluidics to achieve long-term cell culture and culturing cells up to higher density and larger numbers. To improve the efficiency of cell culture, 3D microstructures have also been fabricated in cell culture chips. Electroporation and electrofusion microchips can achieve the necessary electric field with a much lower required applied voltage, avoiding the potential risks of using high voltage and dissipating heat more quickly due to large surface/volume ratio. Though they
c04.indd 75
1/12/2011 9:44:03 AM
76
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
still suffer from the low survival rate, their advantages have not been fully exploited. Complicated optical setup and complex operation limit applications of optimization of microchips. Miniaturization of analytical devices promises numerous benefits, including reduced cell consumption, automated and reproducible reagent delivery, and improved performance. Microcytometry is the most successful application of microfluidic systems for cell-based assays. Various detection methods such as confocal detection, optical stretcher, impedance spectra, and coincident light scattering detection can be incorporated into microcytometers for different applications. Many of the necessary components and functionalities of a typical room-size laboratory, such as pumps, dampers, switch valves, and input and output wells can be integrated into a small chip for cell sorting. Chemical cytometry can also be successfully achieved in microfluidic platform. A variety of small molecules, large proteins, and DNA can be simultaneous analyzed in chemical cytometer chip within a single run. Especially arresting is the possibility of treating and analyzing single living cells using microfluidic devices. A variety of intracellular parameters and cell metabolites have been analyzed and detected using microfluidics, which also demonstrate the versatility of microfluidics. With the employment of nanotechnology, single cell-based microfluidic devices for various sophisticated experiments will be the future direction of this research area.
4.5
ACKNOWLEDGMENTS
This work was supported by the National Nature Science Foundation of China (30670532).
4.6
REFERENCES
Anderson JR, Chiu DT, Jackman RJ, Cherniavskaya O, McDonald JC, Wu H, Whitesides SH, Whitesides GM. (2000a). Fabrication of topologically complex three-dimensional microfluidic systems in PDMS by rapid prototyping. Anal Chem 72(14): 3158–64. Andersson H, van den Berg A. (2003). Microfluidic devices for cellomics: a review. Sensors Actuators Chemi 92(3):315–25. Auroux PA, Iossifidis D, Reyes DR, Manz A. (2002). Micro total analysis systems. 2. Analytical standard operations and applications. Anal Chem 74:2637–52. Becker H, Gartner C. (2000). Polymer microfabrication methods for microfluidic analytical applications. Electrophoresis 21(1):12–26. Belgrader P, Benett W, Hadley D, Richards J, Stratton P, Mariella R Jr, Milanovich F. (1999). PCR detection of bacteria in seven minutes. Science 284:449–50. Bernhard DD, Mall S, Pantano P. (2001). Fabrication and characterization of microwell array chemical sensors. Anal Chem 73:2484–90.
c04.indd 76
1/12/2011 9:44:03 AM
REFERENCES
77
Bousse L, Mouradian S, Minalla A, Yee H, Williams K, Dubrow K. (2001). Protein sizing on a microchip. Anal Chem 73:1207–12. Buchholz BA, Doherty EA, Albarghouthi MN, Bogdan FM, Zahn JM, Barron AE. (2001). Microchannel DNA sequencing matrices with a thermally controlled “viscosity switch.” Anal Chem 73(2):157–64. Clark RA, Ewing AG. (1998). Characterization of electrochemical responses in picoliter volumes. Anal Chem 70:1119–25. Colyer CL, Mangru SD, Harrison DJ. (1997). Microchip-based capillary electrophoresis of human serum proteins. J Chromatogr 781:271–76. Csapo Z, Gerstner A, Sasvari-Szekely M, Guttman A. (2000). Automated ultra-thinlayer SDS gel electrophoresis of proteins using noncovalent fluorescent labeling. Anal Chem 72:2519–25. Duffy DC, Schueller OJA, Brittain ST, Whiteside GM. (1999). Rapid prototyping of microfluidic switches in poly(dimethylsiloxane) and their actuation by electroosmotic flow. J Micromech Microeng 9(3):211–17. Effenhauser CS, Bruin GJM, Paulus A. (1997). Integrate chip-based capillary electrophoresis. Electrophoresis 18:2203. Effenhauser CS, Paulus A, Manz A, Widmer HM. (1994). High-speed separation of antisense oligonucleotides on a micromachined capillary electrophoresis device. Anal Chem 66:2949–53. Ekstrom S, Opperfjord P, Nilsson J, Bengsson M, Laurell T, Marko-Varga G. (2000). Integrated microanalytical technology enabling rapid and automated protein identification. Anal Chem 72:286–93. Ericson C, Holm J, Ericson T, Hjerten S. (2000). Electroosmosis and pressure-driven chromatography in chips using continuous beds. Anal Chem 72:81–87. Erickson D, Li DQ. (2004). Integrated microfluidic devices. Anal Chim Acta 507:11–26. Figeys D, Pinto D. (2001). Protomics on a chip: promising developments. Electrophoresis 22:208–16. Ford SM, Davies J, Kar B, Qi SD, McWhorter S, Soper SA, Malek CK. (1999). Micromachining in plastics using x-ray lithography for the fabrication of microelectrophoresis devices. J Biomech Eng 121(1):13–21. Friedrich CR, Coane PJ, Vasile, MJ. (1997). Micromilling development and applications for microfabrication. Microelectron Eng 35(1–4):367–72. Gottschlich N, Culbertson CT, McKnight TE, Jacobson SC, Ramsey JM. (2000). Integrated microchip-device for the digestion, separation, and postcolumn labeling of proteins and peptides. J Chromatogr 745:243–49. Grass B, Neyer A, Johnck M, Siepe D, Eisenbeiss F, Weber G, Hergenroder, R. (2001). A new PMMA-microchip device for isotachophoresis with integrated conductivity detector. Sens. Actuators B 72(3):249–58. Gu W, Zhu X, Futai N, Cho BS, Takayama S. (2004). From the cover: computerized microfluidic cell culture using elastomeric channels and braille displays. Proc natl Acad Sci U S A 101:15861–66. He B, Li J, Regnier F. (1999). Capillary electrochromatography of peptides in a microfabricated system. J Chromatogr A 853:257–62. Hofmann O, Chi D, Cruickshank KA, Muller UR. (1999). Adaptation of capillary isoelectric focusing to microchannels on a glass chip. Anal Chem 71:678–86.
c04.indd 77
1/12/2011 9:44:04 AM
78
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
Huh D, Gu W, Kamotani Y, Grotberg JB, Takayama S. (2005). Microfluidics for flow cytometric analysis of cells and particles. Physiol Meas 26:R73–98. Ihlemann J, Rubahn, K. (2000). Excimer laser micro-machining: fabrication and applications of dielectric masks. Appl Surf Sci 154–155: 587–92. Jiang Y, Wang P-C, Locascio LE, Lee CS. (2001). Integrated plastic microfluidic deviceswith ESI-MS for high throughput drug screening and residue analysis. Anal Chem 73:2048–53. Kitamori T, Tokeshi M, Hibara A, Sato K. (2004). Thermal lens microscopy and microchip chemistry. Anal Chem 76(3):52A–60A. Kojima K, Moriguchi H, Hattori A, Kaneko T, Yasuda K. (2003). Two-dimensional network formation of cardiac myocytes in agar microculture chip with 1480-nm infrared laser photo-thermal etching. Lab Chip 3:292–96. Lazar IM, Ramsey RS, Ramsey JM. (2001). On-chip proteolytic digestion and analysis using “wrong-way-round” electrospray time-of-flight mass spectrometry. Anal Chem 73:1733–39. Leclerc E, Furukawa KS, Miyata F, Sakai Y, Ushida T, Fujii T. (2004). Fabrication of microstructures in photosensitive biodegradable polymers for tissue engineering applications. Biomaterials 25:4683–90. Li J, Thibault P, Bings NH, Skinner CD, Wang C, Colyer C, Harrison J. (1999). Integration of microfabricated devices to capillary electrophoresis electrospray mass spectrometry using a low dead volume connection: application to rapid analyses of proteolytic digests. Anal Chem 71(15):3036–45. Liang D, Song L, Zhou S, Zaitsev VS, Chu B. (1999). Poly(N-isopropylacrylamide)-gpoly(ethyleneoxide) for high resolution and high speed separation of DNA by capillary electrophoresis. Electrophoresis 20:2856–63. Licklider L, Wang X, Desai A, Tai YC, Lee T. (2000). A micromachined chip-based electrospray source for mass spectrometry. Anal Chem 72:367–75. Litborn E, Emmer A, Roeraade J. (1999). Chip-based nanovials for tryptic digest and capillary electrophoresis. Anal Chim Acta 401:11–19. Litborn E, Emmer A, Roeraade J. (2000). Parallel reactions in open chip-based nanovials with continuous compensation for solvent evaporation. Electrophoresis 21:91–99. Liu Y, Foote RS, Jacobson SC, Ramsey RS, Ramsey JM. (2000). Electrophoretic separation of proteins on a microchip with noncovalent, postcolumn labeling. Anal Chem 72:4608–13. Love JC, Wolfe DB, Jacobs HO, Whitesides GM. (2001). Microscope projection photolithography for rapid prototyping of masters with micronscale features for use in soft lithography. Langmuir 17(19):6005–12. Maharbiz MM, Holtz WJ, Sharifzadeh S, Keasling JD, Howe RT. (2003). A microfabricated electrochemical oxygen generator for high-density cell culture arrays. Microelectromech Syst 12:590–99. Mahoney MJ, Chen RR, Tan J, Saltzman WM. (2005). The influence of microchannels on neurite growth and architecture Biomaterials 26:771–78. Martin K, Henkel T, Baier V, Grodrian A, Schön T, Roth M, Michael Köhler J, Metze J. (2003). Generation of larger numbers of separated microbial populations by cultivation in segmented-flow microdevices. Lab Chip 3:202–07.
c04.indd 78
1/12/2011 9:44:04 AM
REFERENCES
79
Martynova L, Locascio LE, Gaitan M, Kramer GW, Christensen RG, MacCrehan WA. (1997). Fabrication of plastic microfluid channels by imprinting methods. Anal Chem 69(23):4783–89. McClain MA, Culbertson CT, Jacobson SC, Allbritton NL, Sims CE, Ramsey JM. (2003). Microfluidic devices for the high-throughput chemical analysis of cells. Anal Chem 75(21):5646–55. McCormick RM, Nelson RJ, Alonso-Amigo MG, Benvegnu DJ, Hooper HH. (1997). Microchannel electrophoretic separations of DNA in injection-molded plastic substrates. Anal Chem 69:2626–30. McDonald JC, Whitesides GM. (2002). Poly(dimethylsiloxane) as a material for fabricating microfluidic devices. Acc Chem Res 35(7): 491–9. Moriguchi H, Wakamoto Y, Sugio Y, Takahashi K, Inoue I, Yasuda K. (2002). An agarmicrochamber cell-cultivation system: flexible change of microchamber shapes during cultivation by photo-thermal etching. Lab Chip 2:125–30. Northrup MA, Benett B, Hadley D, Landre P, Lehew S, Richards J, Stratton P. (1998). A miniature analytical instrument for nucleic acids based on micromachined silicon reaction chambers. Anal Chem 70:918–22. Ogura M, Agata Y, Watanabe K, McCormick RM, Hamaguchi Y, Aso Y, Mitsuhashi M. (1998). RNA chip: quality assessment of RNA by microchannel linear gel electrophoresis in injection-molded plastic chips. Clin Chem 44:2249–55. Oleschuk RD, Shultz-Lockyear LL, Ning Y, Harrison DJ. (2000). Trapping of Bead Based Reagents within Microfluidic Systems: On-chip Solid Phase Extraction and Electrochromatography. Anal Chem 72:585–90. Pearce TM, Wilson JA, Oakes SG, Chiu SY, Williams JC. (2005). Integrated microelectrode array and microfluidics for temperature clamp of sensory neurons in culture. Lab Chip 5:97–101. Popovtzer R, Neufeld T, Popovtzer A, Rivkin I, Margalit R, Engel D, Nudelman A, Rephaeli A, Rishpon J, Shacham-Diamand Y. (2008). Electrochemical lab on a chip for high-throughput analysis of anticancer drugs efficiency. Nanomedicine 4(2):121–26. Prokop A, Prokop Z, Schaffer D, Kozlov E, Wikswo J, Cliffel D, Baudenbacher F. (2004). Nanoliter bioreactor: long-term mammalian cell culture at nanofabricated scale. Biomed Microdevices 6:325–39. Qi S, Liu X, Ford S, Barrows J, Thomas G, Kelly K, McCandless A, Lian K, Goettert J, Soper SA. (2002). Microfluidic devices fabricated in poly(methyl methacrylate) using hot-embossing with integrated sampling capillary and fiber optics for fluorescence detection. Lab Chip 2(2): 88–95. Ramsay RG, Ciznadija D, Mantamadiotis T, Anderson R, Pearson R. (2005). Expression of stress response protein glucose regulated protein-78 mediated by c-Myb. Int J Biochem Cell Biol 37:1254–68. Reddy RK, Mao C, Baumeister P, Austin RC, Kaufman RJ, Lee AS. (2003). Endoplasmic reticulum chaperone protein GRP78 protects cells from apoptosis induced by topoisomerase inhibitors: role of ATP binding site in suppression of caspase-7 activation. J Biol Chem 278:20915–24. Roberts MA, Rossier JS, Bercier P, Girault H. (1997). UV laser machined polymer substrates for the development of microdiagnostic systems. Anal Chem 69(11): 2035–42.
c04.indd 79
1/12/2011 9:44:04 AM
80
CELL-BASED ANALYSIS WITH MICROFLUIDIC CHIP
Ronai Z, Barta C, Sasvari-Szekely M, Guttman A. (2001). DNA analysis on electrophoretic microchips: effect of operational variables. Electrophoresis 22:294–99. Rossier JS, Schwarz A, Reymond F, Ferrigno R, Bianchi F, Girault HH. (1999). Microchannel networks for electrophoretic separations. Electrophoresis 20 (4–5): 727–31. Schmalzing D, Koutny L, Adourian A, Belgrader P, Matsudaira P, Ehrlich D. (1997). DNA typing in thirty seconds with a microfabricated device. Proc Natl Acad Sci U S A 94:10273–78. Schmalzing D, Koutny L, Chisholm D, Adourian A, Matsudaira P, Ehrlich D. (1999). Two-color multiplexed analysis of eight short tandem repeat loci with an electrophoretic microdevice. Anal Biochem 270:148–52. Shao J, Wu L, Wu J, Zheng Y. (2009). Integrated microfluidic chip for endothelial cells culture and analysis exposed to a pulsatile and oscillatory shear stress. Lab Chip 9(21):3118–25. Sia SK, Whitesides GM. (2003). Microfluidic devices fabricated in poly (dimethylsiloxane) for biological studies. Electrophoresis 24(21):3563–76. Situma C, et al. (2006). Biomolecular Engineering 23:213–31. Soper SA, Ford SM, Qi S, McCarley RL, Kelly K, Murphy MC. (2000). Polymeric microelectromechanical systems. Anal Chem 72(19):643A–651A. Sugiura S, Edahiro J, Kikuchi K, Sumaru K, Kanamori T. (2008). Pressure-driven perfusion culture microchamber array for a parallel drug cytotoxicity assay. Biotechnol Bioeng 100:1156–65. Tai HP, Shuler ML. (2003). Integration of cell culture microfabrication technology. Biotechnol Prog 19:243–53. Takahashi K, Orita K, Matsumura K, Yasuda K. (2003). On-chip microcultivation chamber for swimming cells using visualized poly(dimethylsiloxane) valves. Jpn J Appl Phys 42:L1104–07. Takayama S, Ostuni E, LeDuc P, Naruse K, Ingber DE, Whitesides GM. (2001). Subcellular positioning of small molecules. Nature 411:1016. Tamaki E, Sato K, Tokeshi M, Sato K, Aihara M, Kitamori T. (2002). Single-cell analysis by a scanning thermal lens microscope with a microchip: direct monitoring of cytochrome c distribution during apoptosis process. Anal Chem 74:1560–64. Tan W, Desai TA. (2004). Layer-by-layer microfluidics for biomimetic three-dimensional structures. Biomaterials 25:1355–64. Unger MA, Chou H-P, Thorsen T, Scherer A, Quake SR. (2000). Monolithic microfabricated valves and pumps by multilayer soft lithography. Science 288(5463): 113–16. Vilkner T, Janasek D, Manz A. (2004). Micro total analysis systems. Recent developments. Anal Chem 76:3373–85. Walker GM, Zeringue HC, Beebe DJ. (2004). Microenvironment design considerations for cellular scale studies. Lab Chip 4:91–97. Wang S, Yue F, Zhang L. (2009). Application of microfluidic gradient chip in the analysis of lung cancer chemotherapy resistance. J Pharm Biomedi Anal 49(3):806–10. Wen J, Lin Y, Xiang F, Matson DW, Udseth HR, Smith RD. (2000). Microfabricated isoelectric focusing devices for direct electrospray ionization-mass spectrometry. Electrophoresis 21:191–97.
c04.indd 80
1/12/2011 9:44:04 AM
REFERENCES
81
Wilding P, Kricka LJ, Cheng J, Hvichia G, Shoffner M, Fortina P. (1998). Integrated cell isolation and polymerase chain reaction analysis using silicon microfilter chambers. Anal Biochem 257:95–100. Wlodkowic D, Faley S, Skommer J. (2009). Biological implications of polymeric microdevices for live cell assays. Anal Chem 81(23):9828–33. Woolley AT, Mathies RA. (1994). Ultra-high-speed DNA fragment separations using microfabricated capillary ar ray electrophoresis chips. Proc Natl Acad Sci U S A 91:11348–52. Woolley AT, Sensabaugh GF, Mathies RA. (1997). High-speed DNA geno-typing using microfabricated capillary array electrophoresis chips. Anal Chem 69:2181–86. Yang T, Jung J-Y, Mao H, Cremer PS. (2001). Fabrication of phospholipid bilayer-coated microchannels for on-chip immunoassays. Anal Chem 73:165–69. Ying-Yan W, Tao W, Xin L. (2008). The analysis of chemotherapy resistance in human lung cancer cell line with microchip-based system. Biomed Microdevices 10: 429–35. Zhao L, Wang Z, Fan S, Meng Q, Li B, Shao S, Wang Q. (2010). Chemotherapy resistance research of lung cancer based on micro-fluidic chip system with flow medium. Biomed Microdevices. Zhu X, Yi Chu L, Chueh BH, Shen M, Hazarika B, Phadke N, Takayama S. (2004). Arrays of horizontally-oriented mini-reservoirs generate steady microfluidic flows for continuous perfusion cell culture and gradient generation. Analyst 129(11): 1026–31.
c04.indd 81
1/12/2011 9:44:04 AM
CHAPTER 5
Missing Dimension: Protein Turnover Rate Measurement in Gene Discovery GARY GUISHAN XIAO
Contents 5.1 Protein Turnover as Significant Missing Dimension in Gene Function and Discovery 5.2 Determination of the Rate of Turnover of Specific Proteins 5.3 Future Direction 5.4 Questions and Answers 5.5 Acknowledgments 5.6 References
83 84 89 90 90 90
5.1 PROTEIN TURNOVER AS SIGNIFICANT MISSING DIMENSION IN GENE FUNCTION AND DISCOVERY Multiple levels of analysis are normally exploited in functional genomics, including genome, transcritome, proteome, and metabolome. The dynamic changes of those molecules (mRNAs, proteins, and metabolites) depend on the physiological, developmental or pathological state of living cells (Pratt et al., 2002). A change in the proteome may be the most important for the analysis of gene function and interaction. Because of a wild range of protein concentration in cells, it is also the most difficult to study in a truly comprehensive manner. Standard proteomics usually compares amounts of proteins in cells in two different states (e.g., disease vs. normal) or conditions (e.g., treatment vs. non treatment) (Xiao et al., 2008b); it does not address the dynamics of the proteome in the different biological states that are being compared, nor does it provide information about the mechanisms whereby the system changes from one state to the other. Protein turnover, which is also Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
83
c05.indd 83
1/12/2011 9:44:04 AM
84
MISSING DIMENSION: PROTEIN TURNOVER RATE MEASUREMENT IN GENE DISCOVERY
known as protein accretion, is the balance between protein synthesis and protein degradation (or breakdown). To illustrate, an increase in the level of expression of a protein could be achieved by an enhanced rate of synthesis, or a diminished rate of degradation (Pratt et al., 2002; Xiao et al., 2008a; Zhao et al., 2009a). Protein turnover is believed to decrease with age in all senescence organisms, including humans. This results in an increase in the amount of damaged protein within the body. It is unknown if this is a cause or consequence of aging, but it seems likely that it is in fact both. The damaged protein results in a slower protein turnover, which then results in more damaged protein causing an exponential increase in damage to all protein within the body and to aging. Protein turnover is being considered as a missing dimension in proteomics for biomedical research (Pratt et al., 2002). The dynamics of protein turnover is another key feature to the understanding of regulation of protein expression and protein–protein interaction in cells (Xiao et al., 2008a; Zhao et al., 2009a). The level of expression of a protein depends on the rates of its synthesis and degradation. Thus the turnover of a protein is an important indicator of its functional significance in cells. Despite its evident importance, the role of protein turnover has not previously been considered in analyses of the proteome. Protein turnover can be quantified on a protein-by-protein basis.
5.2 DETERMINATION OF THE RATE OF TURNOVER OF SPECIFIC PROTEINS The dynamics of protein turnover is key to the understanding of regulation of protein expression in cells (Doherty et al., 2005; Pratt et al., 2002). Unlike other quantitative proteomics methods using stable isotope labeling of protein after its synthesis such as in ICAT (Gygi et al., 1999) and iTRAQ (Gan et al., 2007) measurement of protein synthesis requires labeling of protein in vitro or in vivo using pathways of protein synthesis and amino acid metabolism (Pratt et al., 2002; Xiao et al., 2008a; Zhao et al., 2009a). In general, measurement of protein synthesis depends on the use of precursor–product relationship. For example, when a radioactive amino acid is used to trace protein synthesis, one has to first determine the specific activity of the radioactive amino acid in the cell and the specific activity of the same amino acid in the protein of interest (Fern and Garlick, 1974). With the advent of mass spectrometry, many stable isotope methods have been developed for the estimation of protein turnover (Busch et al., 2006; Cargile et al., 2004; Papageorgopoulos et al., 1999, 2002; Pratt et al., 2002; Previs et al., 2004). In these methods precursor enrichment is either determined directly or indirectly. In the deuterated water methods of Busch et al. (2006) and Previs et al. (2004), plasma alanine enrichment is used as the precursor enrichment, and fractional synthesis is determined from the ratio of alanine enrichment in protein to its enrichment in plasma. Precursor enrichment can also be determined indirectly by isotopomer ratio method
c05.indd 84
1/12/2011 9:44:04 AM
DETERMINATION OF THE RATE OF TURNOVER OF SPECIFIC PROTEINS
85
such as those established for fatty acid synthesis using MIDA (Lee et al., 1992, 1994). Such an approach was employed in the recent method of Doherty et al. (2005), where relative isotope abundance was estimated from peptides containing two or three molecules of leucine. Depending on how precursor enrichment is estimated, the approach by which enrichment in the labeled protein is determined also differs in these published methods for measurement of protein turnover. Recently, we developed a general method for determination of protein synthesis using deuterated water and mass isotopomer distribution analysis (Xiao et al., 2008a). The approach is based on the concept that mass isotopomer distribution in the newly synthesized protein due to isotope incorporation is a concatenation of 13C isotopomers from 13C natural abundance with 2 H isotopomers. Precursor and product enrichment can be determined by comparing the labeled and unlabeled spectra. Fraction of protein synthesis (FSR) is provided by the ratio of the old and new isotopomer distributions. Our method differs from the methods of Busch et al. (2006) and Previs et al. (2004) in that enrichment of 2H in amino acids and in the peptide is determined from mass isotopomer distribution and average mass of the peptide. Once precursor enrichment is known, protein synthesis is determined from isotopomer distribution and average mass of the peptide similar to the approach of Vogt et al. (2003, 2005). This novel method for measuring protein turnover rate was further much improved. We extended the application of this new technique of mass isotopomer distribution analysis for in vitro research using 15N labeling to illustrate the handling of spectral data when precursor enrichment is high, resulting in sufficient mass shift in the new protein. Since the change in average mass (mass shift) of a new peptide is given by the product of the number of nitrogen atoms (NN) in the peptide and the average enrichment of 15N in the amino acids (p′), the 15N enrichment can be estimated from the mass shift by curve fitting (Fig. 5.1), and the expected isotopomer distribution of the new peptide can be generated by the concatenation function. FSR can be calculated by multiple linear regression analysis of the observed peptide spectrum on the expected new and the old (unlabeled) spectra. Since the method for determining precursor and product enrichment dictates the nature of the isotope and its infusion as well as the sensitivity and precision in estimating protein turnover of proteins with widely different half-lives, the approach described for use of low enrichment of deuterated water (Xiao et al., 2008a) and 15N amino acids should be generally applicable in most in vivo and in vitro studies of quantitative proteomics. The isotopomer distribution of a peptide is the concatenation of isotopomers from stable isotope species. For example, isotopomer distribution of a protein synthesized in the presence of 15N amino acids is the result of concatenation of the isotopomers contributed by natural abundance of 13C and the isotopomers contributed by the 15N in amino acids. When the natural abundance of the stable isotope species is low (e.g., 15N and deuterium), the mass
c05.indd 85
1/12/2011 9:44:04 AM
86
MISSING DIMENSION: PROTEIN TURNOVER RATE MEASUREMENT IN GENE DISCOVERY
0.4 13
C N Concatenation
Molar Fraction
15
0.3 Ncp 0.2
Nnp´ Ncp + Nnp´
0.1
0.0
0
2
4
6
8
10
12
14
16
18
Isotopomer
Figure 5.1. The isotope envelops of three theoretical mass isotopomer distributions showing effect of concatenation. The first curve (䉬) is that of the distribution of 13C isotopomers calculated from a binomial distribution assuming the carbon number to be 94 and the natural enrichment of 13C to be 0.0111. The second curve (䊏) represents the distribution of 15N isotopomers with a nitrogen number of 21 and an enrichment of 0.333. The third curve (䉱) is that of concatenation of these two distributions representing the newly synthesized protein after incorporation of 15N. The sum of molar fractions for each curve is 1. Reprinted with Permission from Zhao et al. (2008).
shift of the isotope envelope formed by the isotopomers can be shown to be a function of N, the number of positions that can be substituted, and p, the enrichment of the isotope in question. The observed spectrum of a peptide (both preexisting and newly synthesized) is then linear combination of isotopomers of natural (unlabeled) and the expected labeled peptides, and the ratio of labeled to the total isotopomers provides a molar fraction of the newly synthesized protein. Since the presence of 13C and 15N in an amino acid are independent of each other and are not mutually exclusive, the spectrum of the labeled peptide can be constructed using the concatenation operation on the 13C and 15N isotopomers. The 13C isotopomer distribution of a hypothetical natural peptide and 15 N isotopomer distribution of a labeled peptide show a typical binomial distribution (Fig. 5.1). The mean and variance of the 13C isotopomer distribution around the monoisotopic peak (m0) are NCp and NCp(1 − p), respectively, where NC is the number of carbon atoms in the peptide, and p is the natural abundance of 13 C. Similarly, the mean and variance of the 15N isotopomer distribution are NNp′ and NNp′(1 − p′), respectively, and p′ is the average 15N enrichment. The isotopomer distribution after concatenation is also shown in Figure 5.1.
c05.indd 86
1/12/2011 9:44:04 AM
DETERMINATION OF THE RATE OF TURNOVER OF SPECIFIC PROTEINS
87
The mean and variance of the new distribution is given by (NCp + NNp′) and [NCp(1 − p) + NNp′(1 − p′)], respectively. The mass shift of the new distribution is [(NCp + NNp′) − NCp] = NNp′. Therefore, the mass shift is a function of the 15N isotopomer distribution. Because NN can be determined from the known peptide sequence, the 15N enrichment (p′) can be determined by curve fitting (mass shift/NN). Once the natural isotopomers and the 15N isotopomers are known, the isotopomers of the 13C- and 15N-labeled peptide can be constructed using concatenation operation. The resultant distribution is the expected distribution of the new peptide and can be used to determine newly synthesized fraction. This novel method (Zhao et al., 2009a, 2009b) has been successfully applied to cancer research in hope for discovery of biomarker and drug target (Xiao et al., 2008a; Zhao et al., 2009b). Understanding the principles that govern cancer responses to small molecule metabolic inhibitors and to drugs targeting specific signaling pathways remains a major challenge in the development of chemotherapeutic agents. The alteration of intracellular signaling pathways by extrinsic growth factors or by mutation is currently the basic premise on which development of chemopreventive and chemotherapeutic targets is based. In a recent study of pancreatic cancer, we used this method to study protein dynamics in pancreatic cancer cells MIA PaCa-2 in response to treatment of oxythiamine (OT), a transketolase inhibitor in the pentose phosphate cycle of glycolysis pathway. In that study, we investigated the effect of inhibition of the transketolase pathway on signaling pathways in MIA PaCa cancer cells using the newly developed proteomic techniques described above (Fig. 5.2). MIA PaCa-2 cells were cultured in media containing an algal 15N amino acid mixture at 50% enrichment, with and without OT, to determine protein expression and synthesis. Analysis of cell lysates using 2-DE-MALDI-TOF/TOF MS identified 12 phosphor proteins that were significantly suppressed by OT treatment. Many of these proteins are involved in regulation of cycle activities and apoptosis. Among the proteins identified, expression of the phosphor heat shock protein 27 (Hsp27) was dramatically inhibited by OT treatment, while the level of its total protein remained unchanged. Hsp27 expression and phoshporylation are known to be associated with drug resistance and cancer cell survival. The changes in phosphorylation of key proteins of cancer proliferation and survival suggest that protein phosphorylation is the confluence of the effects of OT on metabolic and signaling pathways. This study revealed that a small molecule metabolic inhibitor such as OT, through its metabolic action can result in specific and nonspecific inhibition of translational and posttranslational changes in signaling molecules. Our understanding of the cellular metabolic network would suggest that the inhibition of a single metabolic pathway invariably results in metabolic inefficiency of other metabolic reactions in the network (Lee, 2006). Inhibition of the transketolase reaction may cause a global deficiency in high-energy phosphate bonds, resulting in nonspecific decrease in phosphorylation of proteins. Depending on the enzymatic properties of these kinases, the decrease in
c05.indd 87
1/12/2011 9:44:04 AM
88
MISSING DIMENSION: PROTEIN TURNOVER RATE MEASUREMENT IN GENE DISCOVERY
Intensity × 104 (a.u.)
(a)
4 Da
Intensity (a.u.)
(b)
m/z
Figure 5.2. Mass spectra of 960 m/z fragment from Hsp27 of lysates of cells grown in the presence of (a) natural amino acids and (b) 50% enriched 15N algal amino acid mixtures. The spectrum of control in panel a shows the binomial distribution of isotopomic peaks largely due to natural existence of 13C. Incorporation of 15N resulted in an obvious mass shift in isotopomic distribution. The synthesis rate of a protein was based on the isotopomer distribution of these two spectra.
phosphorylation may appear to be specific, as illustrated by the phosphorylation of Hsp27. We found that phosphor proteins that are sensitive to OT inhibition have high turnover rates (>60%), suggesting that the function of these proteins are regulated by both protein synthesis and protein degradation. The intracellular concentration of a protein depends on its synthesis and breakdown. Therefore, protein concentration can be controlled either by changes in its synthesis or degradation. It is generally held that the turnover rate of a protein must be high in order for it to change its concentration quickly inside a cell. Our finding
c05.indd 88
1/12/2011 9:44:04 AM
FUTURE DIRECTION
89
of relatively high turnover rates in these signaling proteins is in agreement with this view and suggests that the determination of protein turnover may be useful to identify the signaling target in chemotherapy. As reported in our previous study, in contrast with SILAC, labeling of proteins with low enrichment of 15N amino acids mixture (mSILAC) provides a characteristic isotope envelop of the peptide spectrum, which can be used for the determination of newly synthesized fraction (Zhao et al., 2009a, 2009b). The importance of the secreted protein can be ranked by its synthesis rate. In another study, we identified secreted proteins from cultured MIA PaCa-2 pancreatic cancer cells and their response to OT treatment using 15N amino acids and serum-free media (Xiao et al., 2010). Using two-dimensional gel electrophoresis (2-DE), we detected 14 differentially expressed proteins in media of control (untreated) and OT-treated cells. Among the proteins identified, tissue inhibitor of metalloproteases 1 (TIMP 1) and cytokeratin 10, two known pancreatic cancer biomarkers, were heavily labeled with 15N, and their expressions were inhibited by OT treatment. The results from ELISA assays with human pancreatic cancer sera showed that TIMP 1, although having a weaker signal than CA19–9, has the characteristics of a biomarker of pancreatic cancer. Thus labeling of proteins with 15N amino acids and serum-free medium is a potentially useful approach in serum biomarkers discovery using cell culture. In summary, the newly developed technology provides a general method for determination of protein synthesis rate using labeling of amino acids with deuterium or 15N at low enrichment (Xiao et al., 2008a; Zhao et al., 2009a). Modified SILAC can measure protein synthesis rate quantitatively based on analysis of mass isotopmer distribution (MIDA) of the newly synthesized protein (Xiao et al., 2008a; Zhao et al., 2009a). Once precursor enrichment is known, protein synthesis is determined from isotopomer distribution. This method obviates the need for the use of a 100% newly synthesized protein as a reference as in Vogt’s (2003, 2005) methods. The concatenation function provides an ideal 100% labeled spectrum, and multiple regression analysis uses all the information from the mass spectrum. Our mathematical algorithm represents a major improvement in the calculation of protein synthesis rate, permitting the use of isotope labeling of protein through the pathways of amino acid metabolism with low cost isotopes (Lee et al., 1992, 1994; Xiao et al., 2008a, 2008b; Zhao et al., 2009a, 2009b).
5.3
FUTURE DIRECTION
The recent development of mass spectrometry–based strategies for absolute protein quantification offers great opportunity for biomarker discovery. While the prospects of this technology are exciting and promising, the current methodology is far from perfect. First, most current methods require complicated sample preparation, such as immuno-subtraction, multidimensional LC separation, immunoaffinity and solid phase extraction, to enhance the analytical
c05.indd 89
1/12/2011 9:44:04 AM
90
MISSING DIMENSION: PROTEIN TURNOVER RATE MEASUREMENT IN GENE DISCOVERY
dynamic range and detection sensitivity. To establish a high-throughput pipeline, we should ideally have a one-step preparation. Second, useful and validated biomarkers are still rare based on these developed methods because low abundance biomarkers are always immersed in large quantities of routine proteins, especially in plasma samples. There is much room for improvement of sensitivity. 5.4
QUESTIONS AND ANSWERS
Q1. Which dimension is missed in current proteomics research? Q2. What are the advantages of mSILAC over classical SILAC? Q3. What art the limitations of this technique? A1. Current research (information obtained from a steady-state of cells) focuses on either gene expression in two states of the disease (disease vs. healthy) or conditions (treatments vs. control). As demonstrated by recent studies, protein turnover (protein dynamics) can truly reflect cellular physiology and function, which are the dimensions missed in current proteomics studies in biomedical research. A2. There are three advantages that mSILAC offers compared to conventional SILAC. (1) More dynamics protein information can be obtained. (2) It is fast and economic; this technique requires an experimental time of 2 days and low concentrations of stable isotope (e.g., as low as 5–10% stable isotope enrichment), whereas SILAC requires an experimental time of 1 week and a 100% enrichment of stable isotope used for the experiment. (3) Unique algorithms deal with overlapped isotope envelops (as shown in Figure 5.2). A3. This technique must be improved for the analysis of low abundance proteins. 5.5
ACKNOWLEDGMENTS
This work is fully supported by grants awarded to Dr. Gary Guishan Xiao (GGX) from the Bone Biology Program of the Cancer and Smoking Related Disease Research Program and the Nebraska Tobacco Settlement Biomedical Research Program (LB692, LB595, and LB506). 5.6
REFERENCES
Busch R, Kim YK, Neese RA, Schade-Serin V, Collins M, Awada M, Gardner JL, Beysen C, Marino ME, Misell LM, Hellerstein MK. (2006). Measurement of protein
c05.indd 90
1/12/2011 9:44:04 AM
REFERENCES
91
turnover rates by heavy water labeling of nonessential amino acids. Biochim Biophys Acta 1760(5):730–44. Cargile BJ, Bundy JL, Grunden AM, Stephenson JL Jr. (2004). Synthesis/degradation ratio mass spectrometry for measuring relative dynamic protein turnover. Anal Chem 76(1):86–97. Doherty MK, Whitehead C, McCormack H, Gaskell SJ, Beynon RJ. (2005). Proteome dynamics in complex organisms: using stable isotopes to monitor individual protein turnover rates. Proteomics 5(2): 522–33. Fern EB, Garlick PJ. (1974). The specific radioactivity of the tissue free amino acid pool as a basis for measuring the rate of protein synthesis in the rat in vivo. Biochem J 142(2):413–19. Gan CS, Chong PK, Pham TK, Wright PC. (2007). Technical, experimental, and biological variations in isobaric tags for relative and absolute quantitation (iTRAQ). J Proteome Res 6(2):821–27. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R. (1999). Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17(10):994–99. Lee WN, Bassilian S, Ajie HO, Schoeller DA, Edmond J, Bergner EA, Byerley LO. (1994). In vivo measurement of fatty acids and cholesterol synthesis using D2O and mass isotopomer analysis. Am J Physiol 266(5 pt 1):E699–708. Lee WN, Bergner EA, Guo ZK. (1992). Mass isotopomer pattern and precursor-product relationship. Biol Mass Spectrom 21(2):114–22. Lee WN. (2006). Characterizing phenotype with tracer based metabolomics. Metabolomics 2(1):31–39, 5–20. Papageorgopoulos C, Caldwell K, Shackleton C, Schweingrubber H, Hellerstein MK. (1999). Measuring protein synthesis by mass isotopomer distribution analysis (MIDA). Anal Biochem 267(1):1–16. Papageorgopoulos C, Caldwell K, Schweingrubber H, Neese RA, Shackleton CH, Hellerstein M. (2002). Measuring synthesis rates of muscle creatine kinase and myosin with stable isotopes and mass spectrometry. Anal Biochem 309(1):1–10. Pratt JM, Petty J, Riba-Garcia I, Robertson DH, Gaskell SJ, Oliver SG, Beynon RJ. (2002). Dynamics of protein turnover, a missing dimension in proteomics. Mol Cell Proteomics 1(8):579–91. Previs SF, Fatica R, Chandramouli V, Alexander JC, Brunengraber H, Landau BR. (2004). Quantifying rates of protein synthesis in humans by use of 2H2O: application to patients with end-stage renal disease. Am J Physiol Endocrinol Metab 286(4): E665–72. Vogt JA, Hunzinger C, Schroer K, Hölzer K, Bauer A, Schrattenholz A, Cahill MA, Schillo S, Schwall G, Stegmann W, Albuszies G. (2005). Determination of fractional synthesis rates of mouse hepatic proteins via metabolic 13C-labeling, MALDI-TOF MS and analysis of relative isotopologue abundances using average masses. Anal Chem 77(7):2034–42. Vogt JA, Schroer K, Hölzer K, Hunzinger C, Klemm M, Biefang-Arndt K, Schillo S, Cahill MA, Schrattenholz A, Matthies H, Stegmann W. (2003). Protein abundance quantification in embryonic stem cells using incomplete metabolic labelling with 15N amino acids, matrix-assisted laser desorption/ionisation time-of-flight mass
c05.indd 91
1/12/2011 9:44:04 AM
92
MISSING DIMENSION: PROTEIN TURNOVER RATE MEASUREMENT IN GENE DISCOVERY
spectrometry, and analysis of relative isotopologue abundances of peptides. Rapid Commun Mass Spectrom 17(12):1273–82. Xiao GG, Garg M, Lim S, Wong D, Go VL, Lee WN. (2008a). Determination of protein synthesis in vivo using labeling from deuterated water and analysis of MALDI-TOF spectrum. J Appl Physiol 104(3):828–36. Xiao GG, Recker RR, Deng HW. (2008b). Recent advances in proteomics and cancer biomarker discovery. Clin Med Oncol 2:63–72. Xiao J, Lee WN, Zhao Y, Cao R, Go VL, Recker RR, Wang Q, Xiao GG. (2010). Profiling pancreatic cancer-secreted proteome using 15N amino acids and serumfree media. Pancreas 39(1):e17–23. Zhao Y, Lee WN, Lim S, Go VL, Xiao J, Cao R, Zhang H, Recker RR, Xiao GG. (2009a). Quantitative proteomics: measuring protein synthesis using 15N amino acid labeling in pancreatic cancer cells. Anal Chem 81(2):764–71. Zhao Y, Lee WN, Xiao GG. (2009b). Quantitative proteomics and biomarker discovery in human cancer. Expert Rev Proteomics 6(2):115–18.
c05.indd 92
1/12/2011 9:44:05 AM
CHAPTER 6
Bioinformatics Tools for Gene Function Prediction YAN CUI
Contents 6.1 Gene Ontology: Description of Gene Function with Controlled and Structured Vocabulary 6.2 Sequence-Based Function Prediction 6.2.1 Annotation Transfer by Sequence Homology 6.2.2 Phylogenomic Methods for Function Prediction 6.2.3 Function Prediction Using Sequence Motif 6.3 Structure-Based Function Prediction 6.3.1 Protein Structure Comparison for Function Prediction 6.3.2 Predicting Functional Sites on Protein Surface 6.4 Function Prediction Using Integrated Data 6.4.1 Types of Data for Facilitating Function Prediction 6.4.2 Integrative Methods for Function Prediction 6.5 Questions and Answers 6.6 References
93 95 95 99 99 100 100 102 102 102 104 106 106
6.1 GENE ONTOLOGY: DESCRIPTION OF GENE FUNCTION WITH CONTROLLED AND STRUCTURED VOCABULARY Traditionally, gene functions are described in natural languages that are difficult for computers to process. A standardized and computer-readable system for gene function description must be developed for automated annotation of the huge amounts of sequence and structure data produced by the genome sequencing projects and structural genomics projects. Gene ontology (GO) Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
93
c06.indd 93
1/12/2011 9:44:06 AM
94
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
1
Molecular function
2
3
4
5
6
7
Binding
Catalytic activity
Hydrolase activity Nucleic acid binding
Peptidase activity GZMB (IEA)
Endopeptidase activity
DNA binding
Protein binding MYC (IPI)
Transcription regulator activity
Transcription factor activity MYBL2 (TAS) Serine-type peptidase activity MYC (TAS) SOX4 (TAS)
Serine-type endopeptidase activity
Chymotrypsin activity Granzyme B activity GZMB (IEA) GZMB (TAS)
Trypsin activity GZMB (IEA)
Figure 6.1. Gene function annotation with gene ontology.
(Ashburner et al., 2000) is the most widely used gene function classification system. It uses well-structured and -controlled vocabularies to describe gene function. Gene ontology includes three independent ontologies: biological process, molecular function, and cellular component. Each ontology has a hierarchical structure, which can be visualized as a directed acyclic graph (DAG). A small part of the molecular function DAG is shown in Figure 6.1. The whole DAG is too big to be displayed here. The broad function terms are near the root, and more specific terms are down at the bottom of the DAG. The functional annotations of four genes are shown in the Figure. The rectangular nodes represent the GO terms that are associated with the genes. Each annotation has an evidence code (in parentheses) to show how the annotation is obtained. There are five classes of evidence codes (Table 6.1). Detailed descriptions of the evidence codes can be found at the gene ontology website (www.geneontology.org/GO.evidence.shtml). The figure was generated by GeneInfoViz (http://genenet2.uthsc.edu), a web server for functional analysis
c06.indd 94
1/12/2011 9:44:06 AM
SEQUENCE-BASED FUNCTION PREDICTION
95
TABLE 6.1. Gene Ontology Evidence Codes Experimental evidence codes • Inferred from experiment (EXP) • Inferred from direct assay (IDA) • Inferred from physical interaction (IPI) • Inferred from mutant phenotype (IMP) • Inferred from genetic interaction (IGI) • Inferred from expression pattern (IEP) Computational analysis evidence codes • Inferred from sequence or structural similarity (ISS) • Inferred from sequence orthology (ISO) • Inferred from sequence (ISA) • Inferred from sequence model (ISM) • Inferred from genomic context (IGC) • inferred from reviewed computational analysis (RCA) Author statement evidence codes • Traceable author statement (TAS) • Nontraceable author statement (NAS) Curatorial statement codes • Inferred by curator (IC) • No biological data available (ND) Automatically assigned evidence code • Inferred from electronic annotation (IEA)
of gene lists (Zhou and Cui, 2004). Many tools have been developed for searching and browsing GO or using GO for annotation and data analysis (www.geneontology.org/GO.tools.shtml?all). All of the software tools we introduce in this chapter use gene ontology as the primary system for function annotation.
6.2 6.2.1
SEQUENCE-BASED FUNCTION PREDICTION Annotation Transfer by Sequence Homology
In comparative genomics, homology refers to a significant similarity between DNA or protein sequences in different species. If the similarity between two sequences is high, they may descend from a common ancestral sequence and have inherited identical or similar functions. Therefore, the functional annotations can be transferred between homologous genes. Gene/protein sequences from newly sequenced genomes are often searched against databases to identify homologous genes/proteins for annotation transfer. The most commonly used methods for detecting sequence similarity include PSI-BLAST (Altschul et al., 1997), HMMER (Bateman et al., 1999), and SAM (Karplus et al., 2005).
c06.indd 95
1/12/2011 9:44:06 AM
96
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
These sequence alignment programs are often integrated with DNA/protein sequence databases such as UniProt (Bairoch et al., 2005) to provide webbased service for sequence analysis (Johnson et al., 2008). In this section, I introduce two software tools for homology-based function prediction: GOtcha (Martin et al., 2004) and Blast2GO (Conesa et al., 2005; Gotz et al., 2008). These tools take DNA or protein sequences as input and assign GO terms to them. GOtcha is a web-based tool for gene function prediction and annotation. The user submits a query protein sequence. GOtcha performs a BLAST search against many genomes to identify homologous proteins. GO terms associated with the homologous proteins are evaluated by scores based on the E-values of the BLAST hits (Martin et al., 2004). An empirical relationship between the scores and prediction accuracy is established by applying the scoring scheme to annotated sequences from the Swiss-Prot database. Thus a percentage indicating the likelihood of correctness can be assigned to each predicted annotation (Martin et al., 2004). Figure 6.2 shows the GOtcha annotation for the Hox-D9 protein sequence (human homeobox protein). Only the biological process annotations are shown. The DAG in the upper panel includes the GO terms assigned to the query sequence. Each GO term (node) has a percentage indicating the likelihood of correctness of the annotation. The nodes are colored by the likelihood percentage. The table in the lower panel shows estimated likelihood percentage, the score, and the standard deviation (SD) for each GO term. Blast2GO is a program for high-throughput functional annotation. It is freely available from the Blast2GO website (www.blast2go.org) via Java Web Start. Blast2GO annotates DNA/protein sequences based on BLAST search outputs. Blast2GO can process and manage a large number sequences. There are three steps to run for sequence annotation. The first is the BLAST step, in which the query sequences are searched against the Swiss-Prot, NCBI databases such as nr (nonredundant) and RefSeq (Maglott et al., 2000; Pruitt et al., 2007). The second step is gene ontology mapping. In this step, the GO terms associated with the homologous sequences of each query sequence are retrieved. The third step is annotation, in which the GO terms retrieved in the mapping step are used to annotate the query sequences according an annotation rule. The annotation rule is designed to assign the most specific and reliable GO terms to an query sequence by taking into account the sequence similarity, GO evidence code, and the structure of GO DAG (Conesa et al., 2005). Blast2GO also provides options for retrieving and mapping annotations from other functional databases, such as InterPro (Hunter et al., 2009) and KEGG (Ogata et al., 1999). The GO DAGs for each query sequence with annotation can be visualized. A combined graph for all query sequences can also be displayed (Fig. 6.3). Node color represents the annotation intensity, which is determined by the number of sequences annotated at the node and its child nodes, and the distances to the child nodes in the GO DAG.
c06.indd 96
1/12/2011 9:44:06 AM
97
c06.indd 97
1/12/2011 9:44:06 AM
GO:0006351 27 %
GO:0034961 55 %
GO:0034960 55 %
GO:0043284 GO:0009889 55 % 97 %
GO:0010468 97 %
GO:0006355 97 %
GO:0045449 GO:0051252 97 % 97 %
GO:0010556 GO:0019219 97 % 97 %
GO:0060255 GO:0031323 97 % 97 %
Figure 6.2. GOtcha assigns GO annotations to a query protein.
GO:0032774 27 %
GO:0016070 27 %
GO:0006350 55 %
GO:0010467 55 %
GO:0034645 55 %
GO:0019222 GO:0050794 97 % 97 %
GO:0006139 GO:0044249 GO:0044260 GO:0009059 55 % 55 % 55 % 55 %
GO:0043283 55 %
GO:0050789 97 %
GO:0065007 97 %
GO:0044238 GO:0044237 GO:0009058 GO:0043170 55 % 55 % 55 % 55 %
GO:0009987 GO:0008152 55 % 55 %
GO:0008150 97 %
GOall
98
c06.indd 98
1/12/2011 9:44:06 AM
Figure 6.3. (See color insert.) Function annotation with Blast2GO. In this example, 10 sequences are annotated. The upper panel shows the annotations transferred from homologous sequences. The lower panel shows the part of GO DAG containing the annotation terms assigned to the query sequences. Node color represents the annotation intensity.
SEQUENCE-BASED FUNCTION PREDICTION
6.2.2
99
Phylogenomic Methods for Function Prediction
In the homology-based annotation transfer, the difference between two types of homologous sequences—orthologs and paralogs—must be taken into consideration (Friedberg, 2006). Orthologs are homologous sequences derived from a common ancestral sequence through speciation. Paralogs are homologous sequences derived from a duplication event, which is often associated with processes of functional divergence (Gabaldon et al., 2009). Therefore, orthologs typically perform identical or similar functions, while paralogs are more likely to evolve new functions (Kuzniar et al., 2008). The top BLAST hits could be either orthologous or paralogous to the query sequence. The homology-based annotation transfer may lead to two or more sets of different annotations if both orthologous and paralogous sequences are among the top BLAST hits. Annotations should be transferred from the orthologous sequences even when a paralogous sequence has the best similarity score. Phylogenomic methods such as RIO (Zmasek and Eddy, 2002), Orthostrapper (Storm and Sonnhammer, 2002) and SIFTER (Engelhardt et al., 2005) infer the evolutionary history of the homologous sequences and estimate the probability of orthology and, therefore, can facilitate annotation transfer by distinguishing between orthologous and paralogous relationships. 6.2.3
Function Prediction Using Sequence Motif
Sequence motifs (or signatures) refer to recurring sequence patterns that are relatively short but functionally important (Bork and Koonin, 1996; D’Haeseleer, 2006). Members of a protein family often share common sequences motifs. Nonhomologous proteins may also share a common motif (and therefore a common function), though their overall sequence similarity is low. PROSITE (http://ca.expasy.org/prosite) is a database of protein families and domains (Hulo et al., 2008; Sigrist et al., 2010). It currently contains patterns and profiles for more than a thousand protein families or domains. Figure 6.4 shows a protein sequence motif—the ATP P2X receptors signature
Figure 6.4. ATP P2X receptors signature (PROSITE ID: PDOC00932). The height of a letter or a position represents the degree of sequence conservation.
c06.indd 99
1/12/2011 9:44:06 AM
100
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
(Kennedy and Leff, 1995; Surprenant et al., 1995). The consensus pattern of this motif is: G − G − x − [LIVM] − G − [LIVM] − x − [IV] − x − W − x − C − [DN] − L − D − x(5) − C − x − P − x − Y − x-F. InterPro (www.ebi.ac.uk/interpro) (Hunter et al., 2009) integrates motifs representing protein domains, families, and functional sites from many databases, including Gene3D (Yeats et al., 2008), PANTHER (Mi et al., 2007), Pfam (Finn et al., 2008), PIRSF (Wu et al., 2004), PRINTS (Attwood et al., 2003), ProDom (Bru et al., 2005), PROSITE (Hulo et al., 2008; Sigrist et al., 2010), SMART (Letunic et al., 2006), SUPERFAMILY (Wilson et al., 2007), and TIGRFAMs (Haft et al., 2003). InterPro is accessible through InterProScan (Quevillon et al., 2005)—a set of web-based computational tools for protein signature recognition, such as BlastProDom (Zdobnov and Apweiler, 2001), FingerPRINTScan (Scordis et al., 1999), SignalPHMM (Dyrløv Bendtsen et al., 2004), Hmmpfam (http://hmmer.janelia.org) and TMHMM (Sonnhammer et al., 1998). Query sequences can be submitted through the web interface of InterProScan (www.ebi.ac.uk/Tools/InterProScan). InterProScan runs the signature recognition programs to assign annotations to the query sequences. 6.3 6.3.1
STRUCTURE-BASED FUNCTION PREDICTION Protein Structure Comparison for Function Prediction
Protein structure comparison can often detect remote evolutionary relationships even in the absence of significant sequence similarity (Gherardini and Helmer-Citterich, 2008). Frequently used global structural comparison tools include CE (Shindyalov and Bourne, 2001), DaliLite (Holm et al., 2008), FATCAT (Ye and Godzik, 2004), and Matras (Kawabata, 2003) (For a complete list see (Gherardini and Helmer-Citterich, 2008)). These programs search protein structure databases such as PDB (Berman et al., 2000) for similar structures. Although all of these structure comparison methods can assist function prediction by detecting homologous structures, this section focuses on AnnoLite (Marti-Renom et al., 2007), a computational tool specifically designed for structure-based annotation transfer. AnnoLite (http://salilab.org/DBAli/?page=tools&action=f_annolitechain) predicts protein function by transferring known annotations from homologous structures to the query structure. The criteria that AnnoLite uses to identify homologous structures is a minimum of 75% of Ca atoms aligned within 4 Å and a maximum of 4 Å Ca RMSD after superposition of the two structures. The hits are then filtered by sequence identity. An enrichment test (Fisher’s exact test) is performed for the functional annotations of the homologous chains (the background group of all annotated chains in the PDB). Only the annotations with a significant p-value are assigned to the query structure
c06.indd 100
1/12/2011 9:44:06 AM
101
c06.indd 101
1/12/2011 9:44:06 AM
Figure 6.5. AnnoLite annotates a protein (PDB ID: 2azw) by searching its structure in the DBAli database for a precalculated structure alignments.
102
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
(Marti-Renom et al., 2007). Many types of annotations—including the CATH (Greene et al., 2007) and SCOP (Lo Conte et al., 2000) fold assignments, InterPro and PFam (Finn et al., 2008) entries, EC numbers, and GO terms— are integrated by AnnoLite. 6.3.2
Predicting Functional Sites on Protein Surface
Like sequence motifs, structure patterns are often associated with specific biological functions and therefore can be used for protein function prediction even in the absence of overall structure similarity. Several methods have been developed for predicting functional sites on the protein surface. Q-SiteFinder (www.modelling.leeds.ac.uk/qsitefinder) uses an energy-based method to predict ligand binding sites on the protein surface (Laurie and Jackson, 2005). PINTS (patterns in nonhomologous tertiary structures; (www.russell.embl.de/ pints) is a web-based tool that enables the query of a protein structure against a database of patterns (Stark and Russell, 2003). HotPatch (http://hotpatch. mbi.ucla.edu) is a web server for automated detection of functional sites on protein structures (Pettit et al., 2007). It uses trained neural networks to recognize surface patches of unusual physicochemical properties and estimates the probability that the predicted structure patch overlaps with a functional site. Other resources for analyzing patterns in protein structure include catalytic site Atlas (Porter et al., 2004), ConCavity (Capra et al., 2009), FINDSITE (Skolnick and Brylinski, 2009), PDBFun (Ausiello et al., 2005), PDBSiteScan (Ivanisenko et al., 2004), SuMo (Jambon et al., 2005), WEBFEATURE (Liang et al., 2003), and PatchFinder (Nimrod et al., 2005).
6.4 6.4.1
FUNCTION PREDICTION USING INTEGRATED DATA Types of Data for Facilitating Function Prediction
6.4.1.1 Protein–Protein Interaction Data Large-scale data of protein– protein interaction (PPI) have become available in recent years (von Mering et al., 2002). The interacting proteins form networks called protein interaction networks, in which each protein is represented by a node and each pair of interacting proteins are connected by an edge (Bork et al., 2004). The proteins that interact with each other are likely to be involved in same biological processes (von Mering et al., 2002). This observation inspired many methods that seek to propagate functional annotations through the PPI network (Sharan et al., 2007). The simplest method considers only the immediate neighbors of an unanotated protein and assigns the functions that are most common among the neighbors to the protein (Schwikowski et al., 2000). More complex methods may take the topology of the PPI network into consideration and use graph theoretic and module-based methods for function prediction (see Sharan et al., 2007).
c06.indd 102
1/12/2011 9:44:06 AM
FUNCTION PREDICTION USING INTEGRATED DATA
103
6.4.1.2 Gene Expression Data Expression microarray is the firstgeneration high-throughput technology for gene expression analysis (Brown and Botstein, 1999). Recently developed deep-sequencing technology is revolutionizing transcriptome profiling. RNA sequencing (RNA-seq) uses deepsequencing technology to provide more accurate measures of levels of transcript and their isoforms (Wang et al., 2009). The massive gene expression datasets from these high-throughput transcriptome profiling technologies provide very useful information for function prediction. Genes involved in the same biological processes tend to co-express across time and experimental conditions. The well-known guilt by association (GBA) method predicts gene function based on co-expression—if two genes co-express, they may share common function (Walker et al., 1999). Supervised machine learning algorithms, such as support vector machines, were applied to expression-based function prediction but with limited success (Brown et al., 2000). 6.4.1.3 Genetic Interaction Data Genetic interactions may be revealed by the combined effects of two mutated genes that are not exhibited by either of them alone (Kelley and Ideker, 2005). Large-scale genetic interaction data have become available in recent years with the development of high-throughput methods, such as synthetic genetic arrays (SGAs) (Tong et al., 2001) and synthetic lethal analysis by microarrays (SLAMs) (Ooi et al., 2003). Genomewide genetic interaction networks have been constructed from the high-throughput screenings of genetic interactions. In a genetic interaction network, each node represents a gene, and two genes are connected if a synthetic genetic interaction between them is detected. Genes involved in similar biological processes tend to group together in the genetic interaction networks (Costanzo et al., 2010). Therefore analysis of genetic interaction networks may be taken as an approach to gene function prediction. Genes with genetic interactions have been predicted to share function in the same pathway (Tong et al., 2001, 2004; Wong et al., 2004). It has also been shown that genetic interactions may bridge parallel pathways; therefore, congruent gene pairs that share synthetic lethal partners may function in the same pathway (Ye et al., 2005). 6.4.1.4 Phenotype Data In recent years large-scale gene knock-out and RNA inference, combined with high-throughput phenotyping, have produced huge amounts of phenotype data. Dedicated databases such as PhenomicDB (Groth et al., 2007) and Mouse phenome (Grubb et al., 2009) have been developed for storing, searching and retrieving phenotype data. Phenotype data can also be used for gene function prediction. It has been observed that genes associated with similar or correlated phenotypes may share common functions (Piano et al., 2002). Text-clustering methods were used to group genes based on their phenotype descriptions and function annotations were then transferred among genes in same cluster (Groth et al., 2008). The Online Mendelian Inheritance in Man (OMIM) database contains information on the associations between human genes and diseases (Hamosh et al., 2005). As of March
c06.indd 103
1/12/2011 9:44:06 AM
104
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
2010, OMIM contained a total of 2,729 phenotype descriptions with known molecular basis. Like the gene-phenotype data of other organisms, the disease association data can be used for gene function prediction. 6.4.2
Integrative Methods for Function Prediction
As discussed in the previous sections, many types of data can be used for gene/ protein function prediction. In this section I introduce several algorithms that integrate multiple types of genomic evidences to predict gene/protein function. The performance of these algorithms were rigorously evaluated and compared in a recent contest for gene function prediction (Hughes and Roth, 2008). A collection of mouse data was assembled for the contest, including gene expression data, sequence patterns, proteins interaction data, phenotype data, conservation pattern across species, and disease associations (PenaCastillo et al., 2008). 6.4.2.1 Calibrated Ensembles of SVMs The support vector machine (SVM) (Boser et al., 1992) is a widely used learning method. Obozinski and co-workers (2008) used calibrated ensembles of SVMs to predict gene function. First, a support vector machine is trained for each GO term. Making predictions for each GO term independently may cause inconsistency with the GO graph structure. For example, inconsistency occurs when a GO term is assigned to a gene while its parent term is not. To reconcile the inconsistency, 11 methods were evaluated. Among these reconciliation methods, the isotonic regression method showed best performance (Obozinski et al., 2008). 6.4.2.2 Multilabel Hierarchical Classification and Bayesian Integration This method also takes the hierarchical structure of GO graph into consideration. First, a SVM classifier is trained on each GO term and individual dataset. Then Bayesian networks are used to enforce hierarchical consistency across GO terms and obtain the most probable consistent set of predictions (Barutcuoglu et al., 2006). A naive Bayesian classifier is used to make the final prediction based on the outputs of per-dataset SVM classifiers (Guan et al., 2008). 6.4.2.3 GeneMANIA GeneMANIA (Mostafavi et al., 2008) takes a network-based approach. MANIA strands for Multiple Association Network Integration Algorithm. GeneMANIA integrates multiple functional association networks and propagates functional annotations through the network. The data used by GeneMANIA to construct networks include physical interactions, genetic interactions, co-localization, shared protein domains, and pathways. GeneMANIA is accessible through a web interface (www.genemania.org). It takes a list of query genes as input and generates a network that includes the query genes and the genes associated with them. An example is shown in Figure 6.6, where the query list consists of 11 human genes.
c06.indd 104
1/12/2011 9:44:06 AM
105
c06.indd 105
1/12/2011 9:44:06 AM
RAD51C
RAD51L3
BLM
MSH6
RAD51AP1
RAD51 LIG1
DMC1
PCNA
RFC1
MRE11A
MSH2
MLH1
RAD54B
DNAJA3
PMS1
6.13%
Bos-Massague (2009)
3.28%
Nakayama-Hasegawa (2007) 3.80%
Jones-Libermann (2005)
6.94%
11.64%
Lugthart-Evans (2005) Potti-Nevins (2006)
11.99%
43.77%
2.67%
53.56%
56.23%
Agnelli-Neri (2007)
Co-expression
PATHWAYCOMMONS
BIOGRID
Physical interactions
DMC1 MLH1 MRE11A MSH2 MSH6 PCNA RAD50 RAD51 RAD51AP1 XRCC2 XRCC3 RAD51C LIG1 RAD51L1 DNAJA3 RAD54B RAD51L3 RFC1 PMS2 PMS1 BLM
11.36 7.01 6.41 6.39 6.11 5.72 5.32 5.04 4.56 4.11
Figure 6.6. GeneMANIA propagates functional annotations through the integrated networks. On the left is the integrated network. The dark nodes represent query genes, and the white nodes represent predicted genes. The size of a predicted gene node is determined by the score assigned to the gene (on the right side of the figure), which reflects the degree of association between the gene and query genes. The datasets from which the networks were constructed and the weights of the networks are shown in the middle.
RAD51L1
XRCC3
XRCC2
PMS2
RAD50
106
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
6.4.2.4 Combination of Classifier Ensemble and Gene Network Kim and co-workers (2008) compared the performance of two types of prediction strategies—a supervised machine-learning approach (an ensemble of 2,815 naive Bayesian classifiers, one for each GO term) and a network-based approach. In their comparison, the network approach generally outperformed the naive Bayesian classifier. Each approach uses a score to evaluate the annotation. Therefore, the two approaches can be easily combined by using the average score instead of each individual score. It was found that the combination of the two approaches show better performance than either approach alone (Kim et al., 2008).
6.5 QUESTIONS AND ANSWERS Q1. Gene ontology includes three independent ontologies. What are they? Q2. What are orthologs and paralogs? Why they must be distinguished in gene/protein function prediction? Q3. Why use protein structure comparison for function prediction? A1. Biological process, molecular function, and cellular component. A2. Orthologs are homologous sequences derived from a common ancestral sequence in a speciation event. Paralogs are homologous sequences derived from a duplication event, which is often associated with processes of functional divergence. Therefore, orthologs typically perform identical or similar functions, while paralogs are more likely to have evolved new functions. The homology-based annotation transfer may lead to two or more sets of different annotations if both orthologous and paralogous sequences are among the top BLAST hits. Annotations should be transferred from the orthologous sequences even when a paralogous sequence has the best similarity score. A3. Protein structure comparison can often detect remote evolutionary relationships, even in the absence of significant sequence similarity.
6.6 REFERENCES Altschul SF et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–402. Ashburner M et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–9. Attwood TK et al. (2003). PRINTS and its automatic supplement, prePRINTS. Nucl Acids Res 31(1):400–02.
c06.indd 106
1/12/2011 9:44:07 AM
REFERENCES
107
Ausiello G et al. (2005). PdbFun: mass selection and fast comparison of annotated PDB residues. Nucl Acids Res 33(Suppl 2):W133–37. Bairoch A et al. (2005). The Universal Protein Resource (UniProt). Nucl Acids Res 33(Suppl 1):D154–59. Barutcuoglu Z et al. (2006). Hierarchical multi-label prediction of gene function. Bioinformatics 22(7):830–36. Bateman A et al. (1999). Pfam 3.1: 1,313 multiple alignments and profile HMMs match the majority of proteins. Nucl Acids Res 27(1):260–62. Berman HM et al. (2000). The protein data bank. Nucl Acids Res 28(1):235–42. Bork P et al. (2004). Protein interaction networks from yeast to human. Curr Opin Struct Biol 14(3):292–99. Bork P, Koonin EV. (1996). Protein sequence motifs. Curr Opin Struct Biol 6(3): 366–76. Boser B et al. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Fifth Annual Workshop on Computational Learning Theory: July 27–29, 1992, Pittsburgh, PA. Brown MPS et al. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97(1):262–67. Brown PO, Botstein D. (1999). Exploring the new world of the genome with DNA microarrays. Nat Genet 21: 33–37. Bru C et al. (2005). The ProDom database of protein domain families: more emphasis on 3D. Nucl Acids Res 33(Suppl 1):D212–15. Capra JA et al. (2009). Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol 5(12):e1000585. Conesa A et al. (2005). Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18):3674–76. Costanzo M et al. (2010). The genetic landscape of a cell. Science 327(5964):425–31. D’Haeseleer P. (2006). What are DNA sequence motifs? Nat Biotech 24(4):423–25. Dyrløv Bendtsen J et al. (2004). Improved prediction of signal peptides: Signal P 3.0. J Molec Biol 340(4):783–95. Engelhardt BE et al. (2005). Protein molecular function prediction by bayesian phylogenomics. PLoS Comput Biol 1(5):e45. Finn RD et al. (2008). The Pfam protein families database. Nucl Acids Res 36(Suppl 1): D281–88. Friedberg I. (2006). Automated protein function prediction—the genomic challenge. Brief Bioinform 7(3):225–42. Gabaldon T et al. (2009). Joining forces in the quest for orthologs. Genome Biol 10(9):403. Gherardini PF, Helmer-Citterich M. (2008). Structure-based function prediction: approaches and applications. Brief Funct Genomic Proteomi 7(4):291–302. Gotz S et al. (2008). High-throughput functional annotation and data mining with the Blast2GO suite. Nucl Acids Res 36(10):3420–35. Greene LH et al. (2007). The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucl Acids Res 35(Suppl 1):D291–97.
c06.indd 107
1/12/2011 9:44:07 AM
108
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
Groth P et al. (2007). PhenomicDB: a new cross-species genotype/phenotype resource. Nucl Acids Res 35:D696–99. Groth P et al. (2008). Mining phenotypes for gene function prediction. BMC Bioinformatics 9(1):136. Grubb SC et al. (2009). Mouse phenome database. Nucl Acids Res 37(Suppl 1): D720–30. Guan Y et al. (2008). Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol 9(Suppl 1):S3. Haft DH et al. (2003). The TIGRFAMs database of protein families. Nucl Acids Res 31(1):371–73. Hamosh A et al. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledge base of human genes and genetic disorders. Nucl Acids Res 33(Suppl 1):D514–17. Holm L et al. (2008). Searching protein structure databases with DaliLite v.3. Bioinformatics 24(23):2780–81. Hughes T, Roth F. (2008). A race through the maze of genomic evidence. Genome Biol 9(Suppl 1):S1. Hulo N et al. (2008). The 20 years of PROSITE. Nucl Acids Res 36(Suppl 1): D245–49. Hunter S et al. (2009). InterPro: the integrative protein signature database. Nucl Acids Res 37(Suppl 1):D211–15. Ivanisenko VA et al. (2004). PDBSiteScan: a program for searching for active, binding and posttranslational modification sites in the 3D structures of proteins. Nucl Acids Res 32(Suppl 2):W549–54. Jambon M et al. (2005). The SuMo server: 3D search for protein functional sites. Bioinformatics 21(20):3929–30. Johnson M et al. (2008). NCBI BLAST: a better web interface. Nucl Acids Res 36 (Suppl 2):W5–9. Karplus K et al. (2005). SAM-T04: what is new in protein-structure prediction for CASP6. Proteins 61(S7):135–42. Kawabata T. (2003). MATRAS: a program for protein 3D structure comparison. Nucl Acids Res 31(13):3367–69. Kelley R, Ideker T. (2005). Systematic interpretation of genetic interactions using protein networks. Nat Biotech 23(5):561–66. Kennedy C, Leff P. (1995). How should P2x purinoceptors be classified pharmacologically? Trends Pharmacol Sci 16(5):168–74. Kim W et al. (2008). Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy. Genome Biol 9(Suppl 1):S5. Kuzniar A et al. (2008). The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 24(11):539–51. Laurie ATR, Jackson RM. (2005). Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 21(9):1908–16. Letunic I et al. (2006). SMART 5: domains in the context of genomes and networks. Nucl Acids Res 34(Suppl 1):D257–60. Liang MP et al. (2003). WebFEATURE: an interactive web tool for identifying and visualizing functional sites on macromolecular structures. Nucl Acids Res 31(13): 3324–27.
c06.indd 108
1/12/2011 9:44:07 AM
REFERENCES
109
Lo Conte L et al. (2000). SCOP: a structural classification of proteins database. Nucl Acids Res 28(1):257–59. Maglott DR et al. (2000). NCBI’s LocusLink and RefSeq. Nucl Acids Res 28(1): 126–28. Marti-Renom M et al. (2007). The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics 8(Suppl 4):S4. Martin D et al. (2004). GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5(1):178. Mi H et al. (2007). PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucl Acids Res 35(Suppl 1):D247–52. Mostafavi S et al. (2008). GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol 9(Suppl 1):S4. Nimrod G et al. (2005). In silico identification of functional regions in proteins. Bioinformatics 21(Suppl 1):i328–37. Obozinski G et al. (2008). Consistent probabilistic outputs for protein function prediction. Genome Biol 9(Suppl 1):S6. Ogata H et al. (1999). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl Acids Res 27(1):29–34. Ooi SL et al. (2003). DNA helicase gene interaction network defined using synthetic lethality analyzed by microarray. Nat Genet 35(3):277–86. Pena-Castillo L et al. (2008). A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 9(Suppl 1):S2. Pettit FK et al. (2007). HotPatch: A statistical approach to finding biologically relevant features on protein surfaces. J Mol Biol 369(3):863–79. Piano F et al. (2002). Gene clustering based on RNAi phenotypes of ovary-enriched genes in C. elegans. Curr Biol 12(22):1959–64. Porter CT et al. (2004). The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucl Acids Res 32(Suppl 1): D129–33. Pruitt KD et al. (2007). NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucl Acids Res 35 (Suppl 1):D61–65. Quevillon E et al. (2005). InterProScan: protein domains identifier. Nucl Acids Res 33(Suppl 2):W116–20. Schwikowski B et al. (2000). A network of protein-protein interactions in yeast. Nat Biotech 18(12):1257–61. Scordis P et al. (1999). FingerPRINTScan: intelligent searching of the PRINTS motif database. Bioinformatics 15(10): 799–806. Sharan R et al. (2007). Network-based prediction of protein function. Mol Syst Biol 3:88. Shindyalov IN, Bourne PE. (2001). A database and tools for 3-D protein structure comparison and alignment using the Combinatorial Extension (CE) algorithm. Nucl Acids Res 29(1):228–29. Sigrist CJA et al. (2010). PROSITE, a protein domain database for functional characterization and annotation. Nucl Acids Res 38(Suppl 1):D161–66.
c06.indd 109
1/12/2011 9:44:07 AM
110
BIOINFORMATICS TOOLS FOR GENE FUNCTION PREDICTION
Skolnick J, Brylinski M. (2009). FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinformatics 10(4):378–91. Sonnhammer EL et al. (1998). A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 6:175–82. Stark A, Russell RB. (2003). Annotation in three dimensions. PINTS: Patterns in Nonhomologous Tertiary Structures. Nucl Acids Res 31(13):3341–44. Storm CEV, Sonnhammer ELL. (2002). Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18(1):92–99. Surprenant A et al. (1995). P2x receptors bring new structure to ligand-gated ion channels. Trends Neurosci 18(5):224–29. Tong AHY et al. (2001). Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294(5550):2364–68. Tong AHY et al. (2004). Global mapping of the yeast genetic interaction network. Science 303(5659):808–13. von Mering C et al. (2002). Comparative assessment of large-scale data sets of proteinprotein interactions. Nature 417(6887):399–403. Walker MG et al. (1999). Prediction of Gene Function by genome-scale expression analysis: prostate cancer-associated genes. Genome Res 9(12):1198–203. Wang Z et al. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63. Wilson D et al. (2007). The SUPERFAMILY database in 2007: families and functions. Nucl Acids Res 35(Suppl 1):D308–13. Wong SL et al. (2004). Combining biological networks to predict genetic interactions. Proc Nat Acad Sci USA 101(44):15682–87. Wu CH et al. (2004). PIRSF: family classification system at the Protein Information Resource. Nucl Acids Res 32:D112–14. Ye P et al. (2005). Gene function prediction from congruent synthetic lethal interactions in yeast. Mol Syst Biol 1:26. Ye Y, Godzik A. (2004). FATCAT: a web server for flexible structure comparison and structure similarity searching. Nucl Acids Res 32(Suppl 2):W582–85. Yeats C et al. (2008). Gene3D: comprehensive structural and functional annotation of genomes. Nucl Acids Res 36(Suppl 1):D414–18. Zdobnov EM, Apweiler R. (2001). InterProScan—an integration platform for the signature-recognition methods in InterPro. Bioinformatics 17(9):847–48. Zhou M, Cui Y. (2004). GeneInfoViz: constructing and visualizing gene relation networks. Silico Biol 4(3): 323–33. Zmasek C, Eddy S. (2002). RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 3(1):14.
c06.indd 110
1/12/2011 9:44:07 AM
CHAPTER 7
Determination of Genomic Locations of Target Genetic Loci BO CHANG
Contents 7.1 Concepts of Genomic Location 7.2 Genetic Loci of Eye Diseases in Human and Animal Models 7.2.1 Discovery of Eye Diseases 7.2.2 Determination of the Mode of Inheritance 7.2.3 Single Versus Multiple Gene Traits 7.2.4 Examples 7.3 Genetic Markers for the Localization of Disease Loci 7.3.1 Background 7.3.2 Types of Genetic Markers 7.3.3 Uses of Genetic Markers 7.4 Defining Genomic Regions of Disease Loci Using Genetic Markers 7.4.1 Defining a Disease Locus Using Microsatellite Markers 7.4.2 Defining a Disease Locus Using SNP Markers 7.4.3 When Markers Are Missing in the Genome 7.4.4 Limitations and Alternative Procedures 7.5 Gene Identification Based on a Defined Genomic Region 7.5.1 Collection of Genetic Elements within a Targeted Region 7.5.2 Gene Screening and Discovery 7.6 Questions and Answers 7.7 Acknowledgments 7.8 References
112 113 114 115 117 118 119 120 121 123 124 125 126 127 127 128 129 130 131 133 134
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
111
c07.indd 111
1/12/2011 5:03:39 PM
112
7.1
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
CONCEPTS OF GENOMIC LOCATION
Over the last 20 years, the sequencing of the human genome, along with related organisms, has represented one of the largest scientific endeavors in the history of humankind. Since the publication of the full genome sequence (Lander et al., 2001; Venter et al., 2001), scientists have been working to identify the genomic location of all the gene products involved in the complex biological processes in a single organism. However, they have been able to identify only a fraction of those locations. Discovery of the genomic location of disease genes in human populations is difficult due to the extreme genetic heterogeneity in humans and to the diversity of environmental and nutritional variables that can influence disease phenotypes. Models to study the diseases that occur in humans are important as reproducible experimental systems for elucidating the genomic location as well as the pathways of normal development and function. We have used a strategy of working from a known trait toward the identification of the genomic location of the underlying molecular defect causing the trait. By taking this approach, we are assured that the gene we eventually identify must be necessary for normal development and/or function in the human being. Also important, this strategy offers the possibility of identifying the genomic location that contains previously unknown genes and novel pathways. To understand the concept of genomic location, a number of terms must be defined. A trait is a particular aspect of the phenotype that can be measured or observed directly, such as blood pressure, body weight, and coat color. Heredity is the passing of traits to offspring (from its parent or ancestors). The study of heredity in biology is called genetics. The heritable unit that may influence a trait is called a gene. A gene is the basic unit of heredity in a living organism and is a sequence within a strand of DNA, which is part of a very long and compacted string of DNA called a chromosome. A genetic locus is a specific location on a chromosome, similar to an address. A modern working definition of a gene is “a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions” (Pearson, 2006; Pennisi, 2007). An allele is one of a series of different sequence variations of a genetic locus and is used to describe variant forms of a gene detected as variant phenotypes. In diploid organisms, with two copies of each chromosome, the genotype for each gene is made up of the pair of alleles present at that locus, which are the same in homozygotes and different in heterozygotes. For example, at the gene locus for ABO blood type proteins in humans, there are three alleles, (IA, IB, and IO) that determine compatibility of blood transfusions. Any individual has one of six possible genotypes (AA, AO, BB, BO, AB, and OO) that produce one of four possible phenotypes: A (produced by AA homozygous and AO heterozygous genotypes), B (produced by BB homozygous and BO heterozygous genotypes), AB (produced by AB heterozygotes), and O (produced by OO homozygotes) (Johnson and Hopkinson, 1992; Ugozzoli and Wallace, 1992).
c07.indd 112
1/12/2011 9:44:07 AM
GENETIC LOCI OF EYE DISEASES IN HUMAN AND ANIMAL MODELS
113
Genomics is the study of the full genome, both coding and noncoding, of an organism or multiple organisms; genetics or molecular biology focuses more on individual genes, whereas genomics pays attention to broader genome phenomena, such as epitstasis and heterosis. The field includes mapping genes and sequencing DNA. Gene mapping is the process of determining the locus for a particular biological trait and the creation of a genetic map assigning each locus to a specific location on a chromosome. The process of identifing a genetic element that is responsible for a disease is also referred to as mapping; the mapping information is derived from the investigation of disease manifestations in large families (genetic linkage) or from populations-based genetic association studies. DNA sequencing represents the sequencing methods for determining the order of the nucleotide bases adenine (A), guanine (G), cytosine (C), and thymine (T) in a molecule of DNA. A genetic map is a map showing the position of genes or markers on a chromosome. There are three types of genetic maps: (1) A linkage map is a type of genetic map showing relative gene positions based on meiotic recombination frequencies. The unit of measurement is the centimorgan (cM). (2) A cytogenetic map is relates gene positions to chromosomal banding patterns. The maps are built from relating the positions of genes to cytogenetic markers or by in situ hybridization. (3) A physical map of DNA shows distances between and within genes or specified markers measured in base pairs of DNA. It is based on the direct measurement of DNA. When a genome is first investigated, this map is nonexistent. The genetic map improves with scientific progress, and when the genomic DNA sequencing of the species has been completed the resulting physical map provides a comprehensive framework.
7.2 GENETIC LOCI OF EYE DISEASES IN HUMAN AND ANIMAL MODELS Virtually all nontraumatic and noninfectious human ocular diseases are genetic in origin or have strong genetic components; genetic disorders account for more than 95% of common eye diseases. Most of the genetic eye disorders are the direct result of a mutation in one gene, such as congenital cataract and retinitis pigmentosa (RP). A search for cataract in the Online Mendelian Inheritance in Man (OMIM) database (OMIM, 2010) returns a total of 364 genes or genetic loci that have been discovered in humans. The same search in the Online Mendelian Inheritance in Animals (OMIA) database (OMIA, 2010) (other than human and mouse) returns a total of 11 cataract-associated genetic loci discovered in other species, including cat (Felis catus), chicken (Gallus gallus), cow (Bos taurus), dog (Canis familiaris), domestic guinea pig (Cavia porcellus), European polecat (Mustela putorius), horse (Equus caballus), macaques (Macaca) and sheep (Ovis aries). There are a total of 557 genes or genetic loci of retinal disease in the OMIM database and 17 genetic loci in the OMIA database. In the search of mouse models of human eye diseases at the Jackson Laboratory, a total of 109 have been reported (Chang
c07.indd 113
1/12/2011 9:44:07 AM
114
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
et al., 2005); these mouse models of human eye diseases include diseases that affect each part of the eye, including the eye lid, cornea, iris, lens, and retina and result in corneal diseases, glaucoma, cataract, and retinal degenerations. Age-related eye diseases (AREDs) are the leading causes of vision impairment and blindness throughout the world. Of the AREDs, age-related macular degeneration (AMD) increases dramatically with age in men and women and is the most important cause of irreversible visual loss in the elderly. Cataracts are the leading cause of blindness in the world; the Eye Diseases Prevalence Research Group (2004) estimated 20.5 million (17.2%) Americans older than 40 years have a cataract in one eye. Through the search for late-onset eye disease in mice at the Jackson Laboratory, many age-related eye diseases have been found to exist in mouse strains. For example, among 35 inbred mouse strains, 15 were found to have age-related retinal degeneration and 21 were found to have age-related cataract (Chang, 2008). One of the most difficult problems ahead is to determine how specific genes contribute to diseases that have a complex pattern of inheritance, as in particular AREDs. In all of these cases, no single gene has the yes or no power to determine whether a person develops a particular disease. It is likely that more than one mutation is required before the disease is manifest, and a number of genes may each make a subtle contribution to an individual’s susceptibility to a disease. Specific genes may also affect how a person reacts to environmental factors, such as senile or age-related cataract. With the completion of the human genome sequence and the discovery of thousands of DNA markers (such as SNPs or CNVs), it is possible to work through these problems to find out how genes contribute to diseases that have a complex pattern of inheritance by a study called a genome wide association study (GWAS). Through GWAS, many genetic loci have been discovered (Hindorff et al., 2009, 2010). For example, three susceptible loci associated with primary open-angle glaucoma were identified by GWAS in a Japanese population (Nakano et al., 2009). 7.2.1 Discovery of Eye Diseases Discovery of the genetic causes of eye disease in human populations is difficult due to extreme heterogeneity in humans and to the diversity of environmental and nutritional variables. Consequently, animal models with defined genetic backgrounds maintained in controlled environments are of particular importance for eye disease gene discovery and understanding the cellular pathways and molecular mechanisms that lead to specific eye diseases. Once disease etiology is understood at the molecular level, pharmaceutical or genetic therapies are more easily designed to delay onset of the disease or provide a cure, and mouse models can be used for initial preclinical testing of these therapies. Mice carrying mutations that alter developmental pathways or metabolic functions provide model systems for analyzing the defects in comparable human disorders and for testing modes of therapy. Mutant mice also provide repro-
c07.indd 114
1/12/2011 9:44:08 AM
GENETIC LOCI OF EYE DISEASES IN HUMAN AND ANIMAL MODELS
115
ducible, experimental systems for understanding pathways of normal development and function. The most important feature of inherited mutations is that they are genetically transmitted to the next generation and thus can be propagated. The development of inherited model systems requires two steps. First, the mutant mice must be discovered and characterized in sufficient detail to determine their scientific value. Second, in-depth analyses of the molecular and biochemical mechanisms underlying the disease process must be performed. Such in-depth studies can require specialized characterization plus the research to elucidate the function of a single mutant gene. To study the inherited eye diseases in mice, we have developed a screening protocol that we apply to several mice of each gender from each inbred strain or stock of interest. The screening examination starts with an external evaluation of the eyelids, globe, cornea, and iris first with visual inspection, and then we use a biomicroscope (slit lamp) to check the cornea for clarity, size (bupthalmos vs. microcornea), surface texture, and vascularization. The iris is checked for pupil size, constriction, reflected luminescence, and synechia. The eye is then dilated with 1% atropine and the lens is checked for cataract. Finally, an indirect ophthalmoscope is used to examine the fundus for signs of retinal degeneration, such as retinal vessel constriction, retinal pigment epithelial disturbance, drusen or deposits, and structural or optic nerve head abnormalities. If the fundus is normal, an electroretinogram (ERG) test is used to see if there is retinal function loss in the rods and cones. Mice with a suspected abnormality are followed up with a secondary examination by examining more mice of the same strain and genetically related strains. The second level screen also includes (1) an ERG for suspected retinal problems, (2) a histological check of all eye tissues, (3) a comparison of mice at different ages to determine age of onset of the condition, and (4) a comparison of the new mutant’s clinical features to those in established mutant eye stocks. 7.2.2
Determination of the Mode of Inheritance
Once an eye disease (phenotype) is discovered from the screening procedure, the next step is to determine whether the phenotype is genetically transmitted; if so, the mode of inheritance is determined by mating the mouse with a new phenotype to a mouse from a related but independent strain. To determine if the eye disease is transmitted (inherited), the mice with the eye disease are mated to an unrelated mice to see if the eye disease (phenotype) appears in F1 progeny (Fig. 7.1). If the F1 progeny have the eye disease, it is inherited in a dominant or semidominant fashion. If the F1 progeny have no eye disease, these F1 progeny are intercrossed or backcrossed to the mutant parent to see if the phenotype appears in the F2 or N2 progeny. If the phenotype is recovered in the F2 or N2 progeny, it is inherited in a recessive fashion. If the phenotype is not recovered in the F2 or N2 progeny, it is not inherited and is not useful for genetic study. The mode of inheritance generally follows these rules (Fig. 7.2):
c07.indd 115
1/12/2011 9:44:08 AM
116
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
Is it inherited? Eye deviant × Unrelated
Dominant Senidominant 3 affected 1 normal
... F1
Normal mice
Affected mice
Recessive Not transmitted
1 more severe 2 like parents 1 normal
1 affected 3 normal
All normal
... F2
Not inherited
Inherited
Figure 7.1. How to determine whether the eye deviant is genetically transmitted. Parent genotype
Progeny genotype
Progeny phenotype
D/+ × D/+
1 D/D 2 D/+ 1 +/+ 1 S/S 2 S/+ 1 +/+
Affected
S/+ × S/+ r/+ × r/+
X/XD × X/Y
1 r/r 2 r/+ 1 +/+ 1 X/X 1 X/Y 1 XD/X 1 XD/Y
⎫ ⎬ ⎭
Mode of inheritance
Normal Affected Intermediate Normal
Recessive
Normal
⎫ ⎬ ⎭
Normal
⎫ ⎬ ⎭
Affected ⎪
⎫ ⎬ ⎪
X/X × X /Y
Affected ⎭ Normal
X/X × X/Y
1 X/X ⎫ 1 X/Y ⎬ 1 Xr/X ⎭ 1 Xr/Y
Normal
X/X × X /Y
2 X/Xr 2 X/Y
Normal
r
r
Semi-dominant
Affected ⎫ ⎬ ⎭
2 X/XD 2 X/Y
D
Dominant
⎫ ⎬ ⎭
X-linked dominant
⎫
Affected ⎪ ⎬ ⎪ ⎭
X-linked recessive
Figure 7.2. The mode of inheritance is determined by genetic analysis.
1. Dominant, semidominant: phenotype is recovered in 50% of F1 mice. 2. Semidominant: phenotype in 25% of the F2 mice is more severe than in F1 mice. 3. Recessive: phenotype is recovered in only approximately 25% of the F2 mice. 4. Sex-linked: a. X-linked dominant: female deviant mice transmit the mutated X to 50% of offspring so half of the daughters and half of the sons display
c07.indd 116
1/12/2011 9:44:08 AM
GENETIC LOCI OF EYE DISEASES IN HUMAN AND ANIMAL MODELS
117
the mutation; male deviant mice transmit the phenotype to all their offspring because they have only one X chromosome—that with the mutation—to pass on. b. X-linked recessive: female deviant mice bred to unaffected males have no daughters that express the recessive mutation because they can inherit at most one mutant X chromosome, but half of all sons display the mutation because the only X chromosome they receive is from their mother and the paired chromosome from their father is the Y chrsomosome. c. Y-linked: expressed only in males who then transmit to only sons: this is rare. As long as the mutation follows Mendelian inheritance and does not have incomplete penetrance these patterns hold. Based on the procedures above, if the abnormal phenotype appears among the F1 progeny, the mutation is inherited as a dominant or semidominant gene. If the F1 progeny are intercrossed, a recessively inherited gene will reappear as about 25% affected F2 progeny; a semidominant gene will produce an intermediate (50%) and more severe (25%) phenotype in F2 progeny, and a dominant gene will produce 75% affected progeny. X- and Y-linked mutations, of course, are ascertained by their sex-linked mode of inheritance. If the deviant dies at a young age or is infertile, the breeding test is done with the parents or siblings. Simultaneously, parents and siblings of the deviant mouse are mated to ensure propagation of the phenotype if it is genetic. 7.2.3 Single Versus Multiple Gene Traits Many traits are due to the actions of a single gene. However, individual traits can also be affected by multiple genes. Phenotypic traits can be of two types: single gene (or qualitative) trait and multiple gene (or quantitative) trait. The qualitative traits are the classical Mendelian traits of kinds such as structure genes (e.g., eye lens proteins and many forms of cataract), pigment genes (e.g., black or agouti coat of mice), and antigens (e.g., blood group types of human). Each qualitative trait may be under the genetic control of two or more alleles of a single gene with little or no environmental modifications to obscure the gene effects. The qualitative traits have distinct phenotypic classes and are said to exhibit discontinuous variations, and many of these types of genes have been characterized in human beings and animal models. The quantitative traits, however, are measurable phenotypic traits of degree such as height, weight, red blood cell count, plasma concentration of high-density lipoprotein cholesterol (HDL), etc. The quantitative traits are also called metric traits. They do not show clear-cut differences between individuals, but instead are a spectrum of phenotypes that blend imperceptively from one type to another to cause continuous variations. In contrast to qualitative traits, quantitative traits may be modified, to varying degrees, by environmental conditions and are usually governed by many factors or genes (perhaps
c07.indd 117
1/12/2011 9:44:08 AM
118
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
10 or 100 or more), each contributing such a small amount of phenotype that their individual effects often cannot be detected by Mendelian methods but by only statistical methods. Genes that contribute in combination with other, nonallelic genes to the phenotype of a single quantitative trait are called polygenes or cumulative genes. The inheritance of polygenes or quantitative traits is called quantitative inheritance, multiple factor inheritance, multiple gene inheritance, or polygenic inheritance. Since the completion of the human genome sequence, many genetic loci have been discovered by GWAS. As of January 2010, the catalog of GWAS identified 476 publications and a total of 2220 SNP-trait associations in many different human diseases, including agerelated macular degeneration and open-angle glaucoma (Hindorff et al., 2010). 7.2.4
Examples
Two examples serve to clarify the process of disease model characterization. Lens opacity 10 (Lop10), is an autosomal semidominant mutation causing cataracts, which was discovered among progeny of a cross between BALB/cJ and AKR/J mice. Mice homozygous for the Lop10 mutation develop microphthalmia with dense cataracts, first observed when the mice open their eyes at day 12. Heterozygous mice have normal size eyes with a variable expression of cataracts ranging from no expression with clear lenses, partial expression with snowflake opacities, and full expression with dense fetal nuclear opacities. A genomewide screen was used to map the Lop10 mutation to mouse chromosome 3 between D3Mit11 and D3Mit288, where the gap junction protein alpha 8 (Gja8) gene is located. Two pairs of PCR primers were designed based on the DNA sequence from the mouse a8 connexin sequence in Genbank covering the whole coding region. Sequence analysis showed that Lop10 is a missense mutation (G to C) at codon 22 of the Gja8 gene that results in glycine being replaced by arginine (G22R). Therefore, the gene symbol for the Lop10 allele is Gja8Lop10 (Chang et al., 2002b). Mutations of connexin a8 (also called Gja8 or Cx50) in humans have been reported to cause cataract with semidominant inheritance patterns; and seven alleles have been identified: 1. Three alleles: a C-T transition (PRO88SER) (Shiels et al., 1997, 1998), a G-A transition (GLU48LYS) (Berry et al., 1999), and a T-G transversion (ILE247MET) (Polyakov et al., 2001) in the Gja8 gene were identified in human patients with cataract, zonular pulverulent 1 (CZP1, OMIM 116200). 2. Two alleles: a T-A transversion (VAL44GLU) and a G-A transition (ARG198GLN) (Devi and Vijayalakshmi, 2006) in the Gja8 gene were discovered in patients with cataract-microcornea syndrome (OMIM 116150). 3. One allele of a G-C transversion (ARG23THR) (Willoughby et al., 2003) was identified in human patients with nuclear progressive cataract, (OMIM 607304).
c07.indd 118
1/12/2011 9:44:08 AM
GENETIC MARKERS FOR THE LOCALIZATION OF DISEASE LOCI
119
4. One allele of a G-A transition (ASP47ASN) (Arora et al., 2008) in the Gja8 gene was discovered in human patients with nuclear pulverulent cataract, (OMIM 600897). No b-wave 2 (nob2) is an X-linked recessive mutation. Mice with the nob2 mutation were found by fundus examination to have small retinal vessels. ERG tests detected a selective absence of rod b-waves and a much reduced cone response in homozygous (nob2/nob2) mice from 3 weeks to 9 months of age. Histopathology demonstrated that the retinal outer plexiform layer was missing in nob2/nob2 mutant mice at 3 weeks of age. This disorder is an X-linked retinal outer plexiform layer dystrophy that mapped to mouse chromosome X near the nob (nob2 in rod ERG) gene. A test for allelism (also called a complementation test) with nob was negative (double heterozygotes are the wild type—that is, the mutations are not alleles and they complement each other). Thus this new mutation was named nob2. The nob2 mutation mapped close to the gene alpha 1F subunit of the voltage-dependent calcium channel (Cacna1f). Subsequent sequence analysis showed that the nob2 mutation is caused by a 184 basepair insertion (Mus musculus transposon ETn) in exon 2 of the Cacna1f gene, proving the nob2 mutation to be an allele of Cacna1f (Cacna1f nob2) (Chang et al., 2007b). The transposon ETn insertion causes an aberration in the voltage-gated calcium channel, presumably causing a decrease in neurotransmitter release from photoreceptor presynaptic terminals. Many mutations in the human CACNA1F gene have been identified that cause incomplete X-linked congenital stationary night blindness (CSNB2) (Strom et al., 1998; Bech-Hansen et al., 1998; Boycott et al., 2001; Wutz et al., 2002; Nakamura et al., 2003; Hemara-Wahanui et al., 2005; Jalkanen et al., 2006, 2007).
7.3 GENETIC MARKERS FOR THE LOCALIZATION OF DISEASE LOCI A genetic marker is a gene or DNA sequence with a known location on a chromosome and associated with a particular gene or trait. It can be described as a variation, which may arise due to mutation or alteration in the genomic loci and can be observed or a specific gene that produces a recognizable trait and can be used in family or population studies. A genetic marker may be an identifiable segment of DNA with an identifiable physical location on a chromosome, whose inheritance can be followed, such as a single nucleotide polymorphism (SNP), a restriction fragment length polymorphism (RFLP), a variable number of tandem repeats (VNTR), or microsatellite DNAs. Since DNA segments that lie near each other on a chromosome tend to be inherited together, markers are often used as indirect ways of tracking the inheritance pattern of a gene that has not yet been identified but whose approximate or exact location is known.
c07.indd 119
1/12/2011 9:44:08 AM
120
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
A genetic marker is a tool that allows a scientist to detect variation (or the absence of variation) among individuals or between alleles in a particular segment of DNA, to identify an individual disease or trait or trace its inheritance within a family. A genetic marker can be a polymorphic locus that can be used in linkage studies, and the polymorphism may be anything to do with the DNA at the locus or its possible product so long as it can be recognized with an appropriate test. Examples include polymorphisms such as a single nucleotide change, an RFLP, a microsatellite polymorphism, an electrophoretic variant of a protein or enzyme, a blood group, and a genetic disease. Any segment of DNA that can be identified or whose chromosomal location is known can be a genetic marker in that it can be used as a reference point to map or locate other genes. 7.3.1
Background
For many years, gene mapping was limited in most organisms by traditional genetic markers that include genes encoding easily observable characteristics, such as coat colors and white spotting, skin and hair texture, and skeleton mutants. For example, an autosomal recessive cataract called lens opacity 18 (lop18) mapped on mouse chromosome 17 (chr 17) based on linkage with a visible skeleton mutant gene called T locus (T/+ mice are short-tailed) (Chang et al., 1996). Among 89 backcrossed mice, 15 recombinants appeared between lop18 and the T locus for a recombination percentage of 16 ± 3.9 (15/89). The T locus is located 4 cM distal to the centromere on chr 17, suggesting a possible location for lop18 at approximately 20 cM on chr 17, which is very close to the aA-crystallin (Cryaa) gene. Subsequently, a missense mutation was identified in the aA-crystallin gene of the lop18 mouse (Chang et al., 1999). Aside from outwardly observable genetic markers, biochemically based genetic markers can also be used for gene mapping. Some genes have alleles encoding electrophoretic variants of their protein products, and these have proven useful in mapping. For example, the mouse retinal degeneration 3 (rd3) mutation was mapped to the distal end of mouse Chromosome 1 based on its linkage pattern with the isozyme markers alkaline phosphatase 1 and peptidase 3 in a threepoint linkage test (Chang et al., 1993). Since the advent of nucleotide sequencing of genes in the early 1970s (Min Jou et al., 1972; Fiers et al., 1976; Maxam and Gilbert 1977; Gilbert and Maxam 1973; Sanger and Coulson 1975; Sanger et al., 1977), numerous types of genetic markers have been developed, including RFLPs, random amplified polymorphic DNA (RAPDs), simple sequence repeats (SSRs), quantitative trait loci (QTLs), cleavage amplification polymorphism (CAP), sequence-specific amplification polymorphisms (SSAPs), intersimple sequence repeats (ISSRs), sequence tagged sites (STSs), sequence characterized amplification regions (SCARs), selective amplification of microsatellite polymorphic loci (SAMPLs), SNPs, expressed sequence tags (ESTs), sequence-related amplified polymorphisms (SRAPs), target region amplification polymorphisms (TRAPs), microarrays, diversity arrays technology (DArT), single-strand conformation polymorphisms (SSCPs), denaturing gra-
c07.indd 120
1/12/2011 9:44:08 AM
GENETIC MARKERS FOR THE LOCALIZATION OF DISEASE LOCI
121
dient gel electrophoresis (DGGE), temperature-gradient gel electrophoresis (TGGE), and methylation-sensitive PCR (Jones et al., 2009) as well as copy number variations (CNVs) (Sebat et al., 2004; Iafrate et al., 2004). 7.3.2
Types of Genetic Markers
Although there are many types of genetic markers that have been developed, only a few of the commonly used types of genetic markers will be discussed here. 7.3.2.1 RFLP In RFLP analysis, the DNA sample is digested into pieces by restriction enzymes. The resulting DNA restriction fragments are separated according to their lengths by agarose gel electrophoresis, and then transferred to a membrane via the Southern blot procedure. Hybridization of the membrane to a labeled DNA probe then determines the length of the fragments which are complementary to the probe. A RFLP occurs when the length of a detected fragment varies between individuals. Each fragment length is considered an allele, and can be used in genetic analysis. 7.3.2.2 Microsatellite Polymorphism A microsatellite polymorphism or marker is a short (up to several hundred basepairs) segment of DNA that consists of multiple tandem repeats of a two- or three-basepair sequence. Microsatellites expand and contract (that is, add or remove repeat units) with a frequency much higher than other types of mutations, making them useful as polymorphic markers in closely related mouse strains. One of a large series of microsatellite markers in the mouse was developed at the Massachusetts Institute of Technology. These markers have been used to align the physical and linkage maps in mouse (Dietrich et al., 1994, 1996). One common example of a microsatellite is a (CA)n repeat, where n is variable between alleles. These markers often present high levels of interspecific and intraspecific polymorphism, particularly when tandem repeats number 10 or greater (Queller et al., 1993). The repeated sequence is often simple, consisting of two, three, or four nucleotides (dinucleotide, trinucleotide, and tetranucleotide repeats, respectively), and can be repeated 10 to 100 times. CA nucleotide repeats are very frequent in human and mouse genomes and are present every few thousand basepairs. As there are often many alleles present at a microsatellite locus, genotypes within pedigrees are often fully informative, in that the progenitor of a particular allele can often be identified. In this way, microsatellites are ideal for paternity studies, population genetic studies, and recombination mapping. Microsatellite markers can be amplified for identification by PCR, using the unique sequences of flanking regions as primers. DNA is repeatedly denatured at a high temperature to separate the double strand, then cooled to allow annealing of primers and the extension of nucleotide sequences through the microsatellite. This process results in production of enough DNA to be visible on agarose or polyacrylamide gels; only small amounts of template DNA are
c07.indd 121
1/12/2011 9:44:08 AM
122
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
needed for amplification as thermocycling in this manner creates an exponential increase in the replicated segment (Dietrich et al., 1994, 1996). 7.3.2.3 SNP SNP (pronounced snip) refers to a particular nucleotide (or base) in a DNA sequence that is variable within a species (or between related species). For example, at a certain position in a DNA sequence there may be a cytosine (C) present in some individuals but a thymine (T) present in others. SNPs represent the most basic form of genetic polymorphism. With the completion of the Human Genome Project in 2003, researchers began to pinpoint areas of the genome that varied among individuals. Shortly thereafter, they discovered that the most common type of DNA sequence variation found in the genome is the single nucleotide polymorphism; in fact, there are an estimated 10 million SNPs that commonly occur in the human genome. A worldwide effort known as the HapMap Project seeks to identify and localize these and other genetic variants and to learn how the variants are distributed within and among populations from different parts of the world. To date, the international HapMap Consortium (2007) has identified over 3.1 million SNPs across the human genomes that are common to individuals of African, Asian, and European ancestry. Single nucleotides may be changed (substitution), removed (deletion) or added (insertion) to a polynucleotide sequence. SNPs may fall within coding sequences of genes, noncoding regions of genes, or in the intergenic regions between genes. SNPs within a coding sequence will not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code. A SNP in which both forms lead to the same polypeptide sequence is termed synonymous (sometimes called a silent mutation)—if a different polypeptide sequence is produced they are nonsynonymous. A nonsynonymous change may either be missense or nonsense, by which a missense change results in a different amino acid, and a nonsense change results in a premature stop codon. SNPs that are not in protein-coding regions may still have consequences for gene splicing, transcription factor binding, or the sequence of noncoding RNA. 7.3.2.4 CNV A CNV is a segment of DNA in which copy number differences have been found by comparison of two or more genomes. The segment may range from one kilobase to several megabases in size (Cook and Scherer 2008). The fact that DNA copy number variation is a widespread and common phenomenon among humans was first uncovered following the completion of the Human Genome Project (Sebat et al., 2004; Iafrate et al., 2004). It is estimated that approximately 0.4% of the genomes of unrelated people typically differ with respect to copy number (Kidd et al., 2008). A CNV can be discovered by cytogenetic techniques, such as fluorescent in situ hybridization, comparative genomic hybridization, and array comparative genomic hybridization and by virtual karyotyping with SNP arrays. CNVs can be caused by genomic rearrangements (deletions, duplications, inversions, and translocations) and
c07.indd 122
1/12/2011 9:44:08 AM
GENETIC MARKERS FOR THE LOCALIZATION OF DISEASE LOCI
123
may be either inherited or caused by a mutation. In humans, CNVs encompass more DNA than SNPs. CNVs can be limited to a single gene or include a contiguous set of genes and can result in having either too many or too few of the dosage sensitive genes, which may be responsible for a substantial amount of human phenotypic variability, complex behavioral traits, and disease susceptibility (Redon et al., 2006; Freeman et al., 2006). 7.3.3
Uses of Genetic Markers
Genetic markers can be used to study the relationship between an inherited disease and its genetic cause, which is commonly, but not always, a specific mutation in a gene that results in a defective protein. Because chromosomal cross-over occurs with limited frequency, DNA loci that lie near each other on a chromosome tend to be inherited together. This property enables the use of a marker to determine the inheritance pattern of a linked gene that has not yet been exactly localized. Genetic markers have to be easily identifiable, associated with a specific locus, and highly polymorphic, so they can be used for gene mapping. GWAS is an approach that involves rapidly scanning genetic markers (such as SNPs, CNVs) across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat, and prevent the disease. Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases, such as asthma, cancer, diabetes, heart disease, mental illness, AMD, and glaucoma. Researchers already have reported considerable success using this new strategy. For example, in 2005, three independent studies found that a common form of blindness is associated with variation in the gene for complement factor H, which produces a protein involved in regulating inflammation (Hageman et al., 2005; Klein et al., 2005; Edwards et al., 2005). Few previously thought that inflammation might contribute so significantly to this type of blindness, which is called age-related macular degeneration. GWAS represents a recently developed research technique with many implications on both a global and an individual scale. GWAS seeks to identify the SNPs that are common to the human genome and to determine how these polymorphisms are distributed across different populations. On a broad scale, these studies help scientists uncover associations between individual SNPs and disorders that are passed from one generation to the next in Mendelian fashion. On a small scale, GWAS can be used to determine an individual’s risk of developing a particular disorder. Genetic markers have also been used to measure the genomic response to selection in livestock. Natural and artificial selection leads to a change in the genome. The presence of different alleles due to a distorted segregation at specific genetic markers indicates the difference between selected and nonselected livestock (GomezRaya et al., 2002).
c07.indd 123
1/12/2011 9:44:08 AM
124
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
7.4 DEFINING GENOMIC REGIONS OF DISEASE LOCI USING GENETIC MARKERS Once a mouse eye disorder has been shown to be due to a new mutation (inherited), a genetic linkage cross to determine the mutant gene’s chromosomal location is set up. Each new spontaneous mutant gene that is mapped contributes to our knowledge of the functional genome of the mouse. If a recessive mutation produces lethality or sterility, the homozygotes cannot be used to breed, and progeny testing has to be used to identify heterozygotes for further propagation of the recessive mutation. With the knowledge of its genetic location, the mutation can be efficiently maintained in repulsion or in coupling with a closely linked nonlethal genetic marker. The marker used to tag the mutation is not just for more efficient maintenance of the mutant stock but for developmental studies of mutant animals until the gene is characterized at the molecular level. Because each new mutation is a unique event, each requires its own cross to position the new mutation on a chromosome with respect to other genes, including candidate loci. Our preferred mapping strategy uses crosses to the inbred M. m. castaneus–derived inbred strain CAST/Ei, since about 95% of the polymorphic loci tested in this strain differ from standard inbred strains. Progeny are genotyped by PCR analysis of polymorphic microsatellite simple sequence repeat (SSLP) markers, most of which were developed by Dr. Eric Lander’s group at the Massachusetts Institute of Technology (the so-called MIT markers) (Dietrich et al., 1994, 1996). Briefly, carriers of a new recessive or semidominant mutation are mated to CAST/Ei mice; the F1 mice are intercrossed and tissues from affected F2 (homozygous) mutant mice are saved for DNA extraction. This is efficient because each mouse analyzed represents two potentially recombinant chromosomes—one from each F1 parent. For the initial genome scan, we use the pooled sample method to increase efficiency and MIT-derived simple sequence repeat markers using PCR analysis (Taylor et al., 1994). DNA samples from homozygous F2 progeny are pooled for initial analysis. When the band for the CAST/Ei allele for a particular marker is missing or reduced in intensity, individual samples from the pooled mice are typed for that marker only. Dominant mutations are analyzed using backcrosses (BC) with CAST/Ei or other wild-derived strains. MIT microsatellite markers are typed by PCR amplification and then visualized by agarose gel separation and ethidium bromide staining. The Map Manager Computer program is used to aid in linkage detection, determine gene order, and estimate genetic distances (Manly et al., 2001). Once a new mutation is positioned with respect to at least two other loci, the F2 or BC DNAs for nearby MIT markers are typed. This defines the position more precisely, correlates the mutation’s position with the molecular maps generated in other laboratories, and may identify a candidate gene or a starting point for identifying the mutated gene: this mapping information is valuable for others interested in cloning the mutant gene.
c07.indd 124
1/12/2011 9:44:08 AM
DEFINING GENOMIC REGIONS OF DISEASE LOCI USING GENETIC MARKERS
7.4.1
125
Defining a Disease Locus Using Microsatellite Markers
A recessive mutation, retinal degeneration 10 (rd10), was first observed in a recombinant inbred strain CXB1/TyJ in the Jackson Laboratory’s distribution colony. The mutation must have occurred near the time of discovery because it was still segregating in that inbred strain. Mice homozygous for the rd10 mutation show retinal degeneration with sclerotic retinal vessels at 4 weeks of age; the phenotype is easily distinguished from the normal retinal appearance by indirect ophthalmoscope examination. The genetic analysis shows that the rd10 mutant phenotype is inherited as single autosomal recessive mutation. To define the chromosomal location of the rd10 locus, C57BL/6J-rd10/rd10 mice were mated to CAST/EiJ mice. The F1 mice, which exhibited no retinal abnormalities, were backcrossed to C57BL/6J-rd10/rd10 mice. In the initial genomewide screen, two pools of DNA samples with equal contribution from 10 affected (retinal degeneration) or 10 unaffected (normal retina) backcrossed mice were genotyped for 57 microsatellite markers (3 per chromosome) polymorphic between the two parental strains. When the DNA band representing a CAST/EiJ allele is missing or reduced in intensity, suggesting linkage of the mutant gene to that microsatellite marker and the same marker was used to genotype in individual DNA samples to confirm the linkage. More nearby markers were genotyped to define the mapping location for the mutant gene. The rd10 mutation was mapped between microsatellite markers D5Mit7 (genetic map: chr5 at 45cM; sequence map: chr5:93726231–93726392bp) and D5Mit291 (genetic map: chr5 at 70cM; sequence map: chr5: 126695216– 126695353bp) on mouse chromosome 5 (Chang et al., 2007a). The progressive motor neuron degeneration (mnd) mouse was originally identified as having spontaneous adult-onset neurological disease with symptoms beginning at about 6 months of age, progressing to total spastic paralysis with premature death (Messer and Flaherty, 1986). It was originally thought that the mnd/mnd mouse was a model for amyotrophic lateral sclerosis (ALS) based on the neurologic findings. However, while screening mouse strains for retinal degeneration and other ocular disorders, we found that the mnd homozygous mouse also has a retinal degeneration that starts at about 5 weeks of age with loss of photoreceptors and retinal atrophy by 6 months (Chang et al., 1994). To confirm the hypothesis that the retinal degeneration was a pleiotropic effect of mnd and to map the mnd locus more precisely we conducted a multipoint linkage test with markers on chromosome 8 where the mnd locus had previously been assigned. Mice homozygous for the mnd mutation on the C57BL/6J background were crossed to SWR/J; then F1 females were backcrossed to the mnd homozygotes. All backcross offspring were typed for retinal degeneration by ERG and histology and for the hindlimb paresis of the neurologic phenotype by visual examination from 4 to 12 months of age. Among the 182 backcross offspring, there were no recombinants between retinal degeneration, motor neuron degeneration, and the microsatellite marker
c07.indd 125
1/12/2011 9:44:08 AM
126
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
D8Mit124 (genetic map: chr 8 at 6cM; sequence map: chr 8:14723160– 14723288bp). The retinal degeneration is therefore an early expression of the mnd locus, and it is tightly linked to the microsatellite marker D8Mit124. The retinal degeneration in mnd homozygotes on the C57BL/6J background is a manifestation of the mnd locus, and the gene order is D8Mit124-mnd-D8Mit94D8Mit4 (Chang et al., 1998). 7.4.2 Defining a Disease Locus Using SNP Markers Allele-specific PCR (AS-PCR) can be used on genomic DNA to genotype SNP markers. SNPs represent the most basic form of genetic polymorphism, but almost all common SNPs have only two alleles. If a particular nucleotide (or base) in a DNA sequence is more common than another one, the common one can be named as wild type (W) and the other one as mutant type (M). Three oligoprimers can be picked by using Web-based Alelle Specific Primer (http:// bioinfo.biotec.or.th/WASP)—namely wild type (W), mutant (M), and common (C) primers. The AS-PCR assay was conducted in two parallel experiments: one was a wild+common (WC) primer experiment and the other was a mutant+ common (MC) primer experiment. The results may be interpreted from each experimental result. For each SNP genotyped by AS-PCR, three scenarios are to be detected: homozygous wild type, heterozygous, and homozygous mutant type, respectively. An electrophoretic band from WC but none from the MC assay indicates homozygous wild type. On the other hand, a single band from MC but none from the WC assay indicates homozygous mutant type. If an electrophoretic band is generated in both the WC and MC assays, the sample is from a heterozygote (Wangkumhang et al., 2007). With the AS-PCR assay on SNP, each individual laboratory can use SNPs to define a disease locus or to fine-map a disease in the genome, and we have used the SNPs on mouse chromosome 2 to define a retinal disease locus (data not show). Another example is a GWAS on human pancreatic cancer to identify common variants associated with pancreatic cancer. To carry out a genomewide association study, researchers use two groups of participants: people with the disease being studied and ethnically related people without the disease. Researchers obtain DNA from each participant, usually by drawing a blood sample or by rubbing a cotton swab along the inside of the mouth to harvest cells. Each person’s complete set of DNA, or genome, is then purified from the sample cells, placed on tiny chips, and scanned on automated laboratory machines. The machines quickly survey each participant’s genome for strategically selected markers of genetic variation, or SNPs. If certain genetic variations are found to be significantly more frequent in people with the disease compared to people without disease, the variations are said to be associated with the disease. The associated genetic variations can serve as powerful pointers to the region of the human genome where the disease-causing problem resides. In a two-stage genomewide association study of pancreatic cancer, 558,542 SNPs in 1,896 individuals with pancreatic cancer and 1,939 controls were typed
c07.indd 126
1/12/2011 9:44:08 AM
DEFINING GENOMIC REGIONS OF DISEASE LOCI USING GENETIC MARKERS
127
and a combined analysis was performed of these groups plus an additional 2,457 affected individuals and 2,654 controls from eight case-control studies, adjusting for study, sex, ancestry and five principal components. This GWAS has identified an association between a locus on 9q34 and pancreatic cancer marked by the SNP rs505922 (combined P = 5.37 × 10−8; multiplicative perallele odds ratio 1.20; 95% confidence interval 1.12–1.28). This SNP maps to the first intron of the ABO blood group gene. The protective allele T of rs505922 is in complete linkage disequilibrium with the O allele of the ABO locus, consistent with earlier epidemiologic evidence, suggesting that people with blood group O may have a lower risk of pancreatic cancer than those with groups A or B (Amundadottir et al., 2009). 7.4.3 When Markers Are Missing in the Genome Gene mapping was very difficult before the microsatellite marker developed in the early 1990s. To map a disease gene with a limited number of markers in each inbred strain of mice, a multiple genetic linkage cross has to be set up and each cross can be used for a small part of the genome because the markers are present in one cross but missed in the other crosses. For a genome-wide scan for a disease gene location in mouse, the disease mouse strain has to be compared with many other inbred strains of mice for the availability of genetic markers in different parts of the mouse genome and a total of 20 linkage crosses may need to be set up because each linkage cross may only cover one mouse chromosome. Since microsatellite markers were developed and used in genomic mapping, the situation when markers are missing in the genome is rare, but when useful markers are absent in part of the genome in one linkage cross, the solution is to set up a separate linkage cross with an inbred background that has markers present in the region of interest. 7.4.4 Limitations and Alternative Procedures If recombination mapping is inadequate because large genomic regions are indivisible by recombination mapping, the positional complementation approach can be used to obtain higher resolution; large insert clones, such as BACs and P1s, which contain intact candidate genes, can be used in transgene rescue experiments to determine whether introduction of a wild type clone compensates for the dysfunction in mice that are homozygous for the recessive mutation under study. This in vivo complementation approach has been used successfully in several positional cloning efforts for mouse mutations (Antoch et al., 1997; Hamilton et al., 1997; Majumder et al., 1998; Probst et al., 1998). Large insert genomic DNA clones containing wild type putative candidate genes are microinjected into pronuclei of zygotes that are homozygous for the recessive mutation. The clones are selected and prioritized so that a minimum number of microinjections are needed to cover the minimal candidate interval. Mice resulting from these microinjections are then tested for the presence of the transgenic clone and their phenotypes are assessed for abnormalities. If
c07.indd 127
1/12/2011 9:44:08 AM
128
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
mice homozygous for the mutation under study but carrying a transgene are normal in phenotype, it can be deduced that the transgenic clone contains a wild type allele of the mutated gene. If mice are abnormal, conclusions are uncertain because the entire gene or all requisite regulatory sequences may not have been included in the transgenic clone. Genetic background can impact the expression of a mutation. In the event that a new mutant phenotype fails to be expressed in a cross with CAST/Ei (wild derived strain), an alternative inbred strain must be selected for the mapping cross. The mutation carriers should be crossed to the inbred strain found to have the greatest polymorphic differences as defined by biochemical, retroviral, and DNA polymorphisms that differ between the two strains. Information about comparative strain polymorphisms can be found at the Mouse Genome Informatics website.
7.5 GENE IDENTIFICATION BASED ON A DEFINED GENOMIC REGION Identifying a gene based simply on a defined genomic region is called positional cloning. Because this is the reverse of how things have been done traditionally, this process has also been called reverse genetics. Positional cloning is a method of gene identification in which a gene for a specific phenotype is identified, with only its approximate chromosomal location (but not the function) known. This approximate chromosomal location is also known as the candidate region. Initially, the candidate region can be defined using techniques such as linkage analysis; positional cloning is then used to narrow the candidate region until the gene and its mutations are found. For example, previous genetic mapping localized the retinal degeneration 6 (rd6) mutation to mouse chromosome 9, ∼24 cM from the centromere (Hawes et al., 2000; Chang et al., 2002). To map rd6 more precisely, a high-resolution genetic map of the region was constructed based on a total of 657 F2 progeny and the genetic interval containing the rd6 mutation was further defined to 0.15 ± 0.11 cM. Eighteen transcripts were identified within the region and the membrane-type frizzled-related protein (Mfrp) gene was found to be mutated in rd6 mice (Kameya et al., 2002). Positional cloning typically involves the isolation of partially overlapping DNA segments from genomic libraries to progress along the chromosome toward a specific gene. For genomes in which the regions of genetic polymorphisms are known, positional cloning involves identifying polymorphisms that flank the mutation. This process requires that DNA fragments from the closest known genetic marker are progressively cloned and sequenced, getting closer to the mutant allele with each new clone. This process produces a contig map of the locus and is known as chromosome walking. Depending on the size of the mapping population, the mutant locus can be narrowed down to a small region (<30 Kb). Sequence comparison between wild type and mutant DNA in that region
c07.indd 128
1/12/2011 9:44:08 AM
GENE IDENTIFICATION BASED ON A DEFINED GENOMIC REGION
129
is then required to locate the DNA mutation that causes the phenotypic difference. With the completion of genome sequencing projects such as the Human Genome Project, modern positional cloning can use ready-made contigs from the genome sequence databases directly. Modern positional cloning can more directly extract information from genomic sequencing projects and existing data by analyzing the genes in the candidate region. Potential disease genes from the candidate region can then be prioritized, generally reducing the amount of work involved. Genes that have expression patterns consistent with the disease phenotype, or show a (putative) function related to the phenotype, or are homologous to another gene linked to the phenotype are all priority candidates. Generalization of positional cloning techniques in this manner is also known as positional candidate cloning or positional gene discovery. 7.5.1
Collection of Genetic Elements within a Targeted Region
The most efficient way to identify the molecular basis of a mutation is to minimize the number of genetic elements or genes that require testing. This reduction can be achieved by narrowing the mutation-containing interval between genetic recombination events in a linkage cross (Kameya et al., 2002). Only those genetic elements or transcripts within the defined interval need be further considered as candidates for the mutation. The genetic interval will be reduced by producing additional linkage cross progeny (meiotic products) to increase the probability that recombination has occurred near the mutation. Informative recombination events effectively eliminate all genes that do not co-segregate with the mutation. The identification of strategically located markers that can be used to demarcate the boundaries of the recombination events is also important for reducing the minimal genetic interval. Assuming there are 1600 cM and approximately 25,000 genes in the mouse genome, then there are approximately 16 genes per cM, thus it is important to reduce the minimum genetic interval to less than 0.5 cM (preferably <0.2 cM) to reduce the number of candidate genes to a feasible number for testing. All genes (proven and predicted) or genetic elements that reside within the minimal genetic intervals are considered candidates for the mutation, and all genetic and biological information related to each candidate is collected from all available databases, such as Mouse Genome Informatics (www.informatics.jax.org), and resources available from the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov). Based on the phenotype(s) of the mutation and proven or predicted phenotype(s) of candidate genes, the order in which candidate genes are tested can be prioritized. For example, a gene expressed in the retina can be tested first if the mutation’s phenotype is retinal disease (Chang et al., 2008). The possibility exists that the genes underlying retinal abnormalities in the mutant mice have no similarities with known described genes or that published expression patterns and interpretations of gene function may be misleading.
c07.indd 129
1/12/2011 9:44:08 AM
130
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
Therefore, the order in which candidate genes are tested can be prioritized, but are not limited, by their potential causal relationships with the mutant phenotype. Prioritization of candidate genes for testing is influenced by similarities or relationships with other genes or pathways known to cause retinal abnormalities or those genes that are in regions homologous with genes associated with retinal disease in humans. 7.5.2
Gene Screening and Discovery
For each of the selected candidate genes, the gene expression and DNA sequence are compared for any differences between mutant and control mice. If the tissue and time of expression are known for a candidate gene, the cDNA sequences between mutant and controls are first compared for sequence alterations. cDNAs are prepared from total RNA with the SuperScript Preamplification System for First Strand cDNA Synthesis. The candidate gene cDNAs are used as templates to PCR-amplified overlapping fragments of both mutant and control cDNAs. These PCR-amplified DNA products will be purified with the Qiagen PCR Purification Kit and sequenced using an Applied Biosystems 373A DNA Sequencer and an optimized DyeDeoxy Terminator Cycle Sequencing method. The same primers used for PCR amplification will also be used for cycle sequencing. Direct sequencing of PCR products, rather than sequencing of clones derived from them, circumvents the problem of PCR amplification artifacts. Although other techniques, such as single strand conformational analysis and gradient gel electrophoresis, could be used to assay PCR products for alterations, direct sequencing of the resulting PCR products is most efficient. If the expression profile of the candidate gene is unknown or if no cDNA sequence differences are detected, the genomic DNA is assessed for differences in exons or splice sites between mutants and controls. This approach is undertaken by PCR methods because intron-exon structure and junction sequences of candidate genes are now available on-line from the NCBI public genome sequence database web browsers. Primers designed from flanking intron sequences can be designed to amplify each exon of the candidate gene, and PCR products from mutant and control genomic DNA can be sequenced. If no differences are detected in cDNA, exon, or splice site sequences, the gene expression differences are examined by quantitative PCR (Q-PCR) methods. Mutations that disrupt gene regulatory sequences, rather than transcript sequences, can cause quantitative differences in mRNA levels. Either TaqMan primers and probe sets available through Applied Biosystems or custom designed SYBR Green assays can be used for Q-PCR. If a change in transcript level is observed between mutants and controls but no coding sequence or splice site differences are identified, noncoding genomic sequences such as promoter elements or introns with suspected deletions or insertions should next be examined. For gross gene structure rearrangements, large deletions or insertions (such as retroviral elements) can be screened by Southern blot
c07.indd 130
1/12/2011 9:44:08 AM
QUESTIONS AND ANSWERS
131
analysis. Genomic DNA for Southern blots can be prepared from adult spleens by standard phenol/chloroform extraction and ethanol precipitation methods. Blotting, probe labeling, and hybridization procedures used for Southern blots are as described in the literature (Johnson et al., 1992). This type of analysis was successfully used for detecting a mutation caused by the insertion of a retroviral element into an intron of the Eya1 gene (Johnson et al., 1999). For highly expressed genes, quantitative differences in expression and differences in pre-mRNA splice sites (causing altered transcript sizes) can be confirmed by northern blot analysis. For this analysis, total RNA can be extracted from appropriate tissues or embryos at known developmental stages and enriched for mRNA by PolyA selection. Probe labeling and hybridization procedures used for northern blots are the same as for Southern blots.
7.6
QUESTIONS AND ANSWERS
Q1. How would you discover a mouse model for eye disease? Q2. Once an eye disease is discovered in mice, how can you be sure it is inherited? Q3. What is a genetic map and how many types of genetic maps are there? Q4. What is a genetic marker? Q5. What is gene mapping and how is it done? Q6. What is a GWAS and how is it performed? Q7. What is positional cloning? Q8. How do you identify a potential disease-causing gene from a mapped candidate region? Q9. How do you identify the mutation(s) in a candidate gene? A1. Eye diseases are discovered through eye screening procedures, which start with an external evaluation of the eyelids, globe, cornea, and iris first with visual inspection, and then using a biomicroscope. The eye is then dilated with 1% atropine and the lens is checked for cataract. Finally, an indirect ophthalmoscope is used to examine the fundus. If the fundus is normal, an electroretinogram (ERG) test is used to see if there is loss of retinal function in the rods and cones. Mice with a suspected abnormality are followed up with a secondary examination by examining more mice of the same strain and genetically related strains via an ERG, a histological check, comparison of mice at different ages, and comparison to established eye mutant stocks.
c07.indd 131
1/12/2011 9:44:08 AM
132
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
A2. To be sure an eye disease is inherited, the mouse with the eye disease is mated to an unrelated mouse without the eye disease to see if the eye disease (phenotype) appears in F1 progeny. If the F1 progeny have the eye disease, it is inherited in a dominant or semidominant fashion. If the F1 progeny have no eye disease, these F1 progeny should be intercrossed to see if the phenotype appears in the F2 progeny. If the phenotype is recovered in the F2 progeny, it is inherited in a recessive fashion; if the phenotype is not recovered in the F2 progeny, it is not inherited and is not a candidate for genetic study. A3. A genetic map is a map showing the position of genes or markers on a chromosome. There are three types of genetic maps. (1) A linkage map is a type of genetic map showing relative gene positions based on meiotic recombination frequencies. The unit of measurement is the centimorgan (cM). (2) A cytogenetic map is a type of genetic map relating gene positions to chromosomal banding patterns. Cytogenetic maps are built from relating the positions of genes to cytogenetic markers or by in situ hybridization. (3) A physical map is a map of DNA showing distances between and within genes or specified markers measured in basepairs of DNA. It is based on the direct measurement of DNA. A4. A genetic marker is a gene or DNA sequence with a known location on a chromosome and associated with a particular gene or trait. Genetic markers are used for gene mapping. A5. Gene mapping is the process of identifying the location of a gene within the genome. Since each new mutation is a unique event, each requires its own cross to identify the position of the new mutation on a chromosome with respect to other genes, including candidate loci. A cross is set up to an inbred strain (e.g., CAST/EiJ), and progeny are genotyped by PCR analysis. Once a new mutation is positioned with respect to at least two other loci, the F2 or backcrossed DNA samples are typed for nearby markers, such as MIT markers. This defines the position of the mutant gene more precisely, correlates the position of the mutant gene with the molecular maps generated in other laboratories, and may identify a candidate gene or a starting point for identifying the mutant gene. This mapping information is valuable for anyone interested in cloning the mutant gene. A6. A GWAS is an approach that involves rapidly scanning genetic markers (such as SNPs or CNVs) across the complete sets of DNA, or genomes, of many people to find genetic variations associated with a particular disease. Once new genetic associations are identified, researchers can use the information to develop better strategies to detect, treat, and prevent the disease. GWAS is a powerful tool in assisting the discovery of many genetic loci. As of January 2010, the catalog of GWAS identified 476 publications and a total of 2220 SNP-trait associations in many different
c07.indd 132
1/12/2011 9:44:08 AM
ACKNOWLEDGMENTS
133
human diseases, including age-related macular degeneration and openangle glaucoma. A7. Positional cloning typically involves the isolation of partially overlapping DNA segments from genomic libraries to progress along the chromosome toward a specific gene. Depending on the size of the mapping population, the mutant allele can be narrowed down to a small region (<30 Kb). Sequence comparison between wild type and mutant DNA in that region is then required to locate the DNA mutation that causes the phenotypic difference. With the completion of genome sequencing projects such as the Human Genome Project, modern positional cloning can use readymade contigs from the genome sequence databases directly. Modern positional cloning can more directly extract information from genomic sequencing projects and existing data by analyzing the genes in the candidate region. A8. All genes (proven and predicted) or genetic elements that reside within the genetic interval of the candidate region are considered as candidates containing the mutation, and all genetic and biological information related to each gene are collected from all available database. Based on the mutation’s phenotype(s), the order in which candidate genes are tested can be prioritized to reduce the amount of work involved. Genes that have expression patterns consistent with the disease phenotype or show a (putative) function related to the phenotype or are homologous to another gene linked to the phenotype are all priority candidates. For example, a gene expressed in the retina should be tested first if the mutation’s phenotype is retinal disease. A9. For each of the selected candidate genes, the gene expression and DNA sequence are compared for any differences between mutant and control mice. If the tissue and time of expression is known for a candidate gene, the cDNA sequences between mutant and controls are first compared for coding sequence alterations. This involves preparation of cDNA from total RNA, using the cDNA as templates for PCR, sequencing the PCR-amplified DNA products, and comparing the results. If the expression profile of the candidate gene is unknown or if no cDNA sequence differences are detected, genomic DNA of both mutants and controls are sequenced to detect any differences in exons or splice sites. 7.7
ACKNOWLEDGMENTS
This work has been supported by the Foundation Fighting Blindness (FFB), Macula Vision Research Foundation (MVRF) and National Eye Institute Grant EY19943. I am grateful to Melissa Berry and Da Chang for their critical reading and editing of the manuscript.
c07.indd 133
1/12/2011 9:44:08 AM
134
7.8
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
REFERENCES
Amundadottir L, Kraft P, Stolzenberg-Solomon RZ, Fuchs CS, Petersen GM, Arslan AA, Bueno-de-Mesquita HB, Gross M, Helzlsouer K, Jacobs EJ, LaCroix A, Zheng W et al. (2009). Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nature Genet 41:986–90. Antoch MP, Song EJ, Chang AM, Vitaterna MH, Zhao Y, Wilsbacher LD, Sangoram AM. (1997). Functional identification of the mouse circadian clock gene by transgenic BAC rescue. Cell 89:655–67. Arora A, Minogue PJ, Liu X, Addison PK, Russel-Eggitt I, Webster AR, Hunt DM, Ebihara L, Beyer EC, Berthoud VM, Moore AT. (2008). A novel connexin 50 mutation associated with congenital nuclear pulverulent cataracts. J Med Genet 45:155–60. Bech-Hansen NT, Naylor MJ, Maybaum TA, Pearce WG, Koop B, Fishman GA, Mets M, Musarella MA, Boycott KM. (1998). Loss-of-function mutations in a calciumchannel alpha-1-subunit gene in Xp11.23 cause incomplete X-linked congenital stationary night blindness. Genet 19:264–67. Berry V, Mackay D, Khaliq S, Francis PJ, Hameed A, Anwar K, Mehdi SQ, Newbold RJ, Ionides A, Shiels A, Moore T, Bhattacharya SS. (1999). Connexin 50 mutation in a family with congenital “zonular nuclear” pulverulent cataract of Pakistani origin. Hum Genet 105:168–70. Boycott KM, Maybaum TA, Naylor MJ, Weleber RG, Robitaille J, Miyake Y, Bergen AAB, Pierpont ME, Pearce WG, Bech-Hansen NT. (2001). A summary of 20 CACNA1F mutations identified in 36 families with incomplete X-linked congenital stationary night blindness, and characterization of splice variants. Hum Genet 108:91–97. Chang B, Hawes NL, Hurd RE, Wang J, Howell D, Davisson MT, Roderick TH, Nusinowitz S, Heckenlively JR. (2005). Mouse models of ocular diseases. Vis Neurosci 22:587–93. Chang B, Hawes NL, Pardue MT, German AM, Hurd RE, Davisson MT, Nusinowitz S, Rengarajan K, Boyd AP, Sidney SS, Phillips MJ, Stewart RE, Chaudhury R, Nickerson JM, Heckenlively JR, Boatright JH. (2007a). Two mouse retinal degenerations caused by missense mutations in the “beta”-subunit of rod cGMP phosphodiesterase gene. Vis Res 47:624–33. Chang B, Wang X, Hawes NL, Ojakian R, Davisson MT, Lo W, Gong X. (2002a). A Gja8 (α8 connexin) Point mutation causes functional impairment of α3 connexin in semi-dominant cataracts of Lop10 mice. Hum Mol Genet 11(5):507–13. Chang B. (2008). Age-related eye diseases. In Eye, Retina, and Visual System of the Mouse. Chalupa LM, Williams RW (eds), MIT Press, Cambridge, MA, pp. 581–90. Chang B, Bronson RT, Hawes NL, Roderick TH, Peng C, Davisson MT, Heckenlively JR. (1998). Improved genetic map for the mnd gene, using the retinal degeneration aspect of the phenotype. MGI direct data submission. Chang B, Bronson RT, Hawes NL, Roderick TH, Peng C, Hageman GS, Heckenlively JR. (1994). Retinal degeneration in motor neuron degeneration: a mouse model of ceroid lipofuscinosis. Invest Ophthalmol Vis Sci 35(3):1071–77. Chang B, Hawes NL, Hurd RE, Davisson MT, Nusinowitz S, Heckenlively JR. (2002b). Retinal degeneration mutants in the mouse. Vision Res 42(4):517–25.
c07.indd 134
1/12/2011 9:44:08 AM
REFERENCES
135
Chang B, Hawes NL, Roderick TH, Smith RS, Heckenlively JR, Horwitz J, Davisson MT. (1999). Identification of a missense mutation in the alphaA-crystallin gene of the lop18 mouse. Mol Vis 10(5):21. Chang B, Hawes NL, Smith RS, Heckenlively JR, Davisson MT, Roderick TH. (1996). CHROMOSOMAL localization of a new mouse lens opacity gene (lop18). Genomics 36(1):171–73. [Correction 39(2):237.] Chang B, Heckenlively JR, Bayley PR, Brecha NC, Davisson MT, Hawes NL, Hirano AA, Hurd RE, Ikeda A, Johnson BA, McCall MA, Morgans CW, Nusinowitz S, Peachey NS, Rice DS, Vessey KA, Gregg RG. (2007b).The nob2 mouse, a null mutation in Cacna1f: anatomical and functional abnormalities in the outer retina and their consequences on ganglion cell visual responses. Vis Neurosci 23(1):11–24. Chang B, Heckenlively JR, Hawes NL, Roderick TH. (1993). New mouse primary retinal degeneration (rd-3). Genomics 16(1):45–49. Chang B, Mandal MN, Chavali VR, Hawes NL, Khan NW, Hurd RE, Smith RS, Davisson ML, Kopplin L, Klein BE, Klein R, Iyengar SK, Heckenlively JR, Ayyagari R. (2008). Age-related retinal degeneration (arrd2) in a novel mouse model due to a nonsense mutation in the Mdm1 gene. Hum Mol Genet 17(24):3929–41. Cook EH, Scherer SW. (2008). Copy-number variations associated with neuropsychiatric conditions. Nature 455(7215):919–23. Devi RR, Vijayalakshmi P. (2006). Novel mutations in GJA8 associated with autosomal dominant congenital cataract and microcornea. Molec Vis 12:190–95. Dietrich WF, Miller J, Steen R, Merchant MA, Damron-Boles D, Husain Z, Dredge R, Daly MJ, Ingalls KA, O’Connor TJ, et al. (1996). A comprehensive genetic map of the mouse genome. Nature 380(6570):149–52. [Erratum 381(6578):172.] Dietrich WF, Miller JC, Steen RG, Merchant M, Damron D, Nahf R, Gross A, Joyce DC, Wessel M, Dredge RD, et al. (1994). A genetic map of the mouse with 4,006 simple sequence length polymorphisms. Nat Genet 7(2 Spec no):220–45. Edwards AO, Ritter R III, Abel KJ, Manning A, Panhuysen C, Farrer LA. (2005). Complement factor H polymorphism and age-related macular degeneration. Science 308(5720):421–24. Fiers W, Contreras R, Duerinck F, et al. (1976). Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature 260(5551):500–07. Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME, Carter NP, Scherer SW, Lee C. (2006). Copy number variation: new insights into genome diversity. Genome Res 16:949–61. Gilbert W, Maxam A. (1973). The nucleotide sequence of the lac operator. Proc Natl Acad Sci U S A 70 (12):3581–84. Gomez-Raya L, Olsen HG, Lingaas F, Klungland H, Våge DI, Olsaker I, Talle SB, Aasland M, Lien S. (2002). The use of genetic markers to measure genomic response to selection in livestock. Genetics 162:1381–88. Hageman GS, Anderson DH, Johnson LV, Hancox LS, Taiber AJ, Hardisty LI, Hageman JL, Stockman HA, Borchardt JD, Gehrs KM, Smith RJ, Silvestri G, Russell SR, Klaver CC, Barbazetto I, Chang S, Yannuzzi LA, Barile GR, Merriam JC, Smith RT, Olsh AK, Bergeron J, Zernant J, Merriam JE, Gold B, Dean M, Allikmets R. (2005). A common haplotype in the complement regulatory gene factor H (HF1/CFH)
c07.indd 135
1/12/2011 9:44:08 AM
136
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
predisposes individuals to age-related macular degeneration. Proc Natl Acad Sci U S A 102(20):7227–32. Hamilton BA, Smith DJ, Mueller KL, Kerrebrock AW, Bronson RT, van Berkel V, Daly MJ. (1997). The vibrator mutation causes neurodegeneration via reduced expression of PITPa: positional complementation cloning and extragenic suppression. Neuron 18:711–22. Hawes NL, Chang B, Hageman GS, Nusinowitz S, Nishina PM, Schneider BS, Smith RS, Roderick TH, Davisson MT, Heckenlively JR. (2000). Retinal degeneration 6 (rd6): a new mouse model for human retinitis punctata albescens. Invest Ophthalmol Vis Sci 41(10):3149–57. Hemara-Wahanui A, Berjukow S, Hope CI, Dearden PK, Wu SB, Wilson-Wheeler J, Sharp DM, Lundon-Treweek P, Clover GM, Hoda JC, Striessnig J, Marksteiner R, Hering S, Maw MA. (2005). A CACNA1F mutation identified in an X-linked retinal disorder shifts the voltage dependence of Ca(v)1.4 channel activation. Proc Nat Acad Sci U S A 102: 7553–58. Hindorff LA, Junkins HA, Mehta JP, Manolio TA. (2010). A Catalog of Published Genome-Wide Association Studies. Available at www.genome.gov/gwastudies; accessed 1/16/2010. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS and Manolio TA. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A 106(23):9362–67. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. (2004). Detection of large-scale variation in the human genome. Nature Genet 36:949–51. Jalkanen R, Bech-Hansen NT, Tobias R, Sankila EM, Mantyjarvi M, Forsius H, de la Chapelle A, Alitalo T. (2007). A novel CACNA1F gene mutation causes Aland Island eye disease. Invest Ophthal Vis Sci 48:2498–502. Jalkanen R, Mantyjarvi M, Tobias R, Isosomppi J, Sankila EM, Alitalo T, Bech-Hansen NT. (2007). X linked cone-rod dystrophy, CORDX3, is caused by a mutation in the CACNA1F gene [Letter]. J Med Genet 43:699–704. Johnson KR, Cook SA, Davisson MT. (1992). Chromosomal localization of the murine gene and two related sequences encoding high-mobility-group I and Y proteins. Genomics 12:503–09. Johnson KR, Cook SA, Erway LC, Matthews AN, Sanford LP, Paradies NE, Friedman RA. (1999). Inner ear and kidney anomalies caused by IAP insertion in an intron of the Eya1 gene in a mouse model of BOR syndrome. Hum Molec Genet 8: 645–53. Johnson PH, Hopkinson DA. (1992). Detection of ABO blood group polymorphism by denaturing gradient gel electrophoresis. Hum Molec Genet 1:341–44. Jones N, Ougham H, Thomas H. (2009). Pasakinskiene I. Markers and mapping revisited: finding your gene. New Phytol 183(4):935–67. Kameya S, Hawes NL, Chang B, Heckenlively JR, Naggert JK, Nishina PM. (2002). Mfrp, a gene encoding a frizzled related protein, is mutated in the mouse retinal degeneration 7. Hum Mol Genet 11(16):1879–87. Kidd JM, Cooper GM, Donahue WF, et al. (2008). Mapping and sequencing of structural variation from eight human genomes. Nature 453(7191):56–64.
c07.indd 136
1/12/2011 9:44:08 AM
REFERENCES
137
Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, Barnstable C, Hoh J. (2005). Complement factor H polymorphism in age-related macular degeneration. Science 308(5720):385–89. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. (2001). Initial sequencing and analysis of the human genome. Nature 409:860–921. Majumder K, Shawlot W, Schuster G, Harrison W, Elder FFB, Overbeek PA. (1998). YAC rescue of downless locus mutations in mice. Mamm Genome 9:863–68. Manly KF, Cudmore RH, Jr., Meer JM. (2001). Map Manager QTX, cross-platform software for genetic mapping. Mamm Genome 12:930–32. Maxam AM, Gilbert W. (1977). A new method for sequencing DNA. Proc Natl Acad Sci U S A 74(2): 560–64. Messer A, Flaherty L. (1986). Autosomal dominance in a late-onset motor neuron disease in the mouse. J Neurogenet 3(6):345–55. Min Jou W, Haegeman G, Ysebaert M, Fiers W. (1972). Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein. Nature 237(5350):82–88. Nakamura M, Ito S, Piao CH, Terasaki H, Miyake Y. (2003). Retinal and optic disc atrophy associated with a CACNA1F mutation in a Japanese family. Arch Ophthal 121:1028–33. Nakano M, Ikeda Y, Taniguchi T, Yagi T, Fuwa M, Omi N, Tokuda Y, Tanaka M, Yoshii K, Kageyama M, Naruse S, Matsuda A, Mori K, Kinoshita S, Tashiro K. (2009). Three susceptible loci associated with primary open-angle glaucoma identified by genomewide association study in a Japanese population. Proc Natl Acad Sci U S A 106(31): 12838–42. OMIA (Online Mendelian Inheritance in Animals). Available at www.ncbi.nlm.nih. gov/omia, accessed January 9, 2010. OMIM (Online Mendelian Inheritance in Man). Available at www.ncbi.nlm.nih.gov/ omim, accessed January 9, 2010. Pearson H. (2006). Genetics: what is a gene? Nature 441:398–401. Pennisi E. (2007). DNA study forces rethink of what it means to be a gene. Science 316:1556–57. Polyakov AV, Shagina IA, Khlebnikova OV, Evgrafov OV. (2001). Mutation in the connexin 50 gene (GJA8) in a Russian family with zonular pulverulent cataract [Letter]. Clin Genet 60:476–78. Probst FJ, Fridell RA, Raphael Y, Saunders TL, Wang A, Liang Y, Morell RJ. (1998). Correction of deafness in shaker-2 mice by an unconventional myosin in a BAC transgene. Science 280:1444–47. Queller DC, Strassman JE, Hughes CR. (1993). Microsatellites and kinship. Trends in ecology and evolution. 8:285–88. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles
c07.indd 137
1/12/2011 9:44:08 AM
138
DETERMINATION OF GENOMIC LOCATIONS OF TARGET GENETIC LOCI
ME. (2006). Global variation in copy number in the human genome. Nature 444:444–54. Sanger F, Coulson AR. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94(3):441–48. Sanger F, Nicklen S, Coulson AR. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74(12):5463–67. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Månér S, Massa H, Walker M, Chi M, Navin N, Lucito R, Healy J, Hicks J, Ye K, Reiner A, Gilliam TC, Trask B, Patterson N, Zetterberg A, Wigler M. (2004). Large-scale copy number polymorphism in the human genome. Science 305:525–28. Shiels A, Mackay D, Ionides A, Berry V, Moore A, Bhattacharya S. (1998). A missense mutation in the human connexin50 gene (GJA8) underlies autosomal dominant “zonular pulverulent” cataract, on chromosome 1q. Am J Hum Genet 62:526–32. Shiels A, Mackay D, Ionides A, Berry V, Moore A, Bhattacharya S. (1997). A missense mutation in the GJA8 gene underlies autosomal dominant cataract on human chromosome 1q. Am J Hum Genet 61(suppl.):A21. Strom TM, Nyakatura G, Apfelstedt-Sylla E, Hellebrand H, Lorenz B, Weber BHF, Wutz K, Gutwillinger N, Ruther K, Drescher B, Sauer C, Zrenner E, Meitinger T, Rosenthal A, Meindl A. (1998). An L-type calcium-channel gene mutated in incomplete X-linked congenital stationary night blindness. Nature Genet 19:260–63. Taylor BA, Navin A, Phillips SJ. (1994). PCR-amplification of simple sequence repeat variants from pooled DNA samples for rapidly mapping new mutations of the mouse. Genomics 21:626–32. The Eye Diseases Prevalence Research Group. (2004). Prevalence of cataract and pseudophakia/aphakia among adults in the United States. Arch Ophthalmol 122:487–94. The International HapMap Consortium. (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–61. Ugozzoli L, Wallace RB. (1992). Application of an allele-specific polymerase chain reaction to the direct determination of ABO blood group genotypes. Genomics 12:670–74. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. (2001). The sequence of the human genome. Science 291:1304–51. Wangkumhang P, Chaichoompu K, Ngamphiw C, Ruangrit U, Chanprasert J, Assawamakin A,Tongsima S. (2007). WASP: a Web-based allele-specific PCR assay designing tool for detecting SNPs and mutations. BMC Genomics 8:275. Willoughby CE, Arab S, Gandhi R, Zeinali S, Arab S, Luk D, Billingsley G, Munier FL, Heon E. (2003). A novel GJA8 mutation in an Iranian family with progressive autosomal dominant congenital nuclear cataract. J Med Genet 40:e124. Wutz K, Sauer C, Zrenner E, Lorenz B, Alitalo T, Broghammer M, Hergersberg M, de la Chapelle A, Weber BHF, Wissinger B, Meindl A, Pusch CM. (2002). Thirty distinct CACNA1F mutations in 33 families with incomplete type of XLCSNB and Cacna1f expression profiling in mouse retina. Eur J Hum Genet 10:449–57.
c07.indd 138
1/12/2011 9:44:08 AM
CHAPTER 8
Mutation Discovery Using High-Throughput Mutation Screening Technology KAI LI, HANLIN GAO, HONG-GUANG XIE, WANPING SUN, and JIA ZHANG
Contents 8.1 Introduction 8.2 Classical Technologies for High-Throughput Mutation Analysis 8.2.1 Hybridization-Based Assays 8.2.2 Configuration-Based Assays 8.2.3 Primer Extension–Based Assays 8.2.4 Sequencing-Based Assays 8.3 Microarray-Based Mutation Detection 8.4 Miscellaneous Technological Advances for Mutation Detection 8.4.1 Restriction Fragment Length Polymorphism 8.4.2 Matrix–Assisted Laser Desorption/Ionization Mass Spectrometry 8.4.3 TaqMan Assay and Real Time PCR 8.5 Summary 8.6 Acknowledgments 8.7 References
8.1
139 140 140 141 142 149 155 156 156 157 157 157 158 158
INTRODUCTION
The term mutation refers to a different basepair sequence existing in a gene when compared with so-called reference genome. The major genetic material of human beings is DNA, which contains four predominant nucleotides, a few epigenetically modified nucleotides like 5-methyl C and 6-methyl A, and many Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
139
c08.indd 139
1/12/2011 9:44:09 AM
140
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
nonenzymatic-modified nucleotides like d-U, and 8-ox-G. Thus the genome contains hundreds or more modified nucleotides. The human genome contains four or five sequence-specific nucleotides, depending on the life stage. Therefore, from a medical point of view, the term mutation covers mutations on the location change of the A,T,C, and G nucleotides as well as chemical modification of these bases as epimutation and loss of genetic imprinting are both disease related. Mutation can be used to describe a variety of sequence variations, such as point mutations, insertions or deletions of few bases, short tandem repeats, gene copy number variations (CNVs), and structural variants (gross deletions, insertions, inversions, and rearrangements). A single nucleotide polymorphism (SNP) is a DNA sequence variation occurring when a single nucleotide—A, T, C, or G—in the genome differs among members of a species. SNPs or mutations may be synonymous or nonsynonymous due to an affected codon if the SNP exists in the coding region. Accumulating evidence has well documented that genetic diseases and cancers are caused by gene mutations. Many other polygenetic diseases as well as disease susceptibility and biodiversity or interindividual variability in drug metabolism and response may all be the result of genetic variation (or mutation). Therefore, identification of the mutation is the first step to investigating its functional significance. Thus various methodologies and techniques used for screening and detection of genetic mutations have been developed for the genotype–phenotype correlation studies. In this chapter, the principles of some technologies and their applications used in the identification and detection of genetic mutations are discussed. 8.2 CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS 8.2.1 Hybridization-Based Assays The term hybridization is defined as the annealing of an allele-specific probe to its target sequence based on the existence of a certain length of fragments perfectly or largely complementary to each other. Under optimized assay conditions, only one-base mismatch can destabilize the hybridization so that annealing of the designed allele-specific probe to its target sequence is less efficient. When such allele-specific probes are immobilized on a solid support, labeled target DNA samples can be captured with the hybridization (as visualized by detecting the label after the unbound targets are washed away). The genotype of the target DNA samples can be inferred based on the location of the known probes on the solid support. Allele-specific hybridization is the basis of several high-throughput genotyping methods. Conventional hybridization and macroarray hybridization were widely used for genetic studies in the past decades. However, oligo-based hybridization, derived from classic Southern blotting and northern blotting, is more powerful in its throughput as well as in its mutation discrimination ability, which will be discussed in detail later.
c08.indd 140
1/12/2011 9:44:09 AM
CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS
141
(a) PCR
Conformation
(b)
Electrophonesis
Variant (2677T, Ser893) MDR1*1 (2677G, Ala893)
Figure 8.1. (See color insert.) (a) Band patterns of single-stranded PCR products as visualized on a gel differ with the change in their conformations. (b) An example of how to identify two variants in the MDR1 gene using SSCP. (Kim et al., 2001.)
8.2.2
Configuration-Based Assays
Single-stranded conformation polymorphism (SSCP) is a widely used technique in the screening and identification of new genetic mutations and to a lesser extent, in the determination of known genetic polymorphisms.Technically, it is a gel-based separation of denatured PCR products (thus keeping them single-stranded) based on their different mobility during the period of gel electrophoresis due to their different secondary structure as a result of subtle variation in sequence (such as a single base pair). Since its first report published in 1989 (Orita et al., 1989), SSCP has been extensively applied to the screening and detection of DNA polymorphisms and mutations. The basic principle of how SSCP works is that the mobility of singlestranded PCR products may vary by very small change in sequence that leads to certain loops and foldings that give each single strand a unique 3D conformation due to a comparatively unstable structure and altered intrastrand base pairing in the absence of a complementary strand, as shown in Figure 8.1. In contrast, the mobility of double-stranded PCR products in gel electrophoresis depends on their strand size or length but is relatively independent of certain changes in sequence. The following are the key procedures of how to do an SSCP analysis: 1. A highly specific primer pair (forward and reverse) is designed to amplify the desired DNA fragments from individuals. The size of PCR products is typically within a range of 150–300 bp for optimal results, depending on the sequence of amplified DNA fragments, such as G/C content (Vorechovsky, 2005). 2. Single-stranded PCR products are easily produced by denaturing the PCR products with a loading dye in a water bath set at 95°C and
c08.indd 141
1/12/2011 9:44:09 AM
142
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
immediately cooling them into ice until they are loaded into the TBE gel filled with 1 × TBE running buffer in a gel box. Please note that sufficient time for the cooling of denatured samples is required to get the better results. 3. Gel electrophoresis should be steadily run at a constant temperature and voltage. 4. At last, the gel can be stained with either ethidium bromide (Figure 8.1b) or silver. Samples with altered band pattern visualized on the gel, need to be sequenced to identify exact mutation. Following these steps, PCR-based SSCP has been extensively applied to identify and detect polymorphisms in the genes that encode drug-metabolizing enzymes, drug transporters and drug receptors. For example, SSCP has been successfully developed to simultaneously detect three functionally significant variants CYP2C9 *3 (A1075C), *4 (T1076C), and *5 (C1080G) (Xie and Kim, 2004); several variants in the MDR1 gene (Kim et al., 2001; Xie, 2011); and the two functionally important nonsynonymous mutations often present in the β-2 adrenergic receptor (ADRB2) gene, Arg16Gly and Gln27Glu (Xie et al., 1999; Kanki et al., 2002). Classical SSCP can be more efficient by updating isotop visualization with fluorescence, and the mutation discrimination also can be enhanced simply by changing the temperature or running buffer (Feng et al., 2007; Mroske et al., 2007). Similar to PCR-based SSCP technique, denatured high-performance liquid chromatography (d-HPLC) is a high-throughput method used for the screening and detection of polymorphisms. The use of reversed-phase HPLC to analyze the mobility of different homoduplex and/or heteroduplex amplicons is the major working principle, because the conformations of amplified PCR products differ under partially denaturing conditions (making amplicans, at least in part, single stranded) (Xie, 2011). So far, d-HPLC has been used to identify the presence of unknown polymorphisms, such as a single nucleotide polymorphism (SNP), an insertion, or a deletion of a few bases, with high sensitivity for PCR products of up to 1 kb in length in a highly automated fashion. For example, it was developed to determine some polymorphisms in the genes CYP2B6 (Zanger et al., 2002) and MDR1 (Hitzl et al., 2001) with high reproducibility and specificity.
8.2.3
Primer Extension–Based Assays
The primer extension, a very robust allele discrimination mechanism, typically refers to sequencing (or allele-specific nucleotide incorporation used to determine the identity of the base(s) incorporated at the polymorphic site present in the amplified target DNA sequence) or to allele-specific PCR by which the target DNA can be amplified only when the 3′-end of a PCR primer is perfectly complementary to the target DNA sequence (Kwok, 2001). Practically,
c08.indd 142
1/12/2011 9:44:09 AM
CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS
143
allele-specific primer extension is not always specific as some false positives in the case of using taq DNA polymerase occur. In recent years, both exo- and exo+ DNA polymerases have been well used in the development of mutation detection technologies. 8.2.3.1 Exo Polymerase-Mediated Primer-Dependent Mutation Assay 8.2.3.1.1 Allele-Specific Primer Amplification and Apyrase-Mediated Primer Amplification Allele specific amplification (ASA)—also called PCR amplification of specific alleles (PASA), allele-specific PCR (ASPCR), and amplification of refractory mutation (ARM) (Sommer et al., 1992)—consists of a pair of primers designed to amplify the mutation-containing region. Typically, one of the two primers overlaps the mutated base either at 3′ end, but mismatches one or more bases from the 3′ end can work (Sarkar et al., 1990). One of the primers overlaps the mutated base either near or at its 3′ end. Primer extension is expected to occur only with a matched but not with a mismatched primer. Direct haplotyping can be performed by amplifying with downstream and upstream primers in which each is allele-specific for two variants close enough for PCR to be performed successfully (Sarkar and Sommer, 1991). Extension products generally are visualized using gel electrophoresis. Some variants of allele-specific primer extensions have improved the SNP assay and doubled the efficiency by detecting two alleles in a single-tube reaction. Example strategies are application of different lengths of allele-specific primers or different tags on the 5′ of the allele-specific primers. The former is gel based, and the latter is real time PCR based. With universally labeled primers targeting the specific tags on allele-specific primers, closed-tube, realtime PCR-based SNP assay is able to discriminate more than one allele in a single reaction. With optimization, ASA assays can be developed for virtually any SNP (Sommer et al., 1992). One problem in allele-specific primer extension using exopolymerases is extension from mismatched primers (Goodman, 1988). Introducing an artificially mismatched nucleotide near the 3′ termini can minimize the products from mismatched primers. However, this strategy could potentially increase the occurrences of false negatives. Another approach for circumventing false positives observed with exopolymerases in allele-specific primer extension is to combine apyrase, which has a strong 3′ exonuclease activity with exopolymerases, such as in apyrase-mediated allele-specific extension (AMASE) (Kaller et al., 2004). AMASE has been formatted for both real-time PCR and microarray genotyping (O’Meara, 2002). The use of the 3′ to 5′ exonuclease activity intrinsic to the exo+ polymerases in allele-specific primer extension has had an enormous impact on assay reliability and has led to the development of three new types of SNP assays as described in detail below. 8.2.3.2 Pyrophosphorolysis and Pyrophosphorolysis Activated Polymerization (PAP) With their ubiquitous physiological role in vivo, exo+ polymerases have surprisingly few in vitro applications, especially given the
c08.indd 143
1/12/2011 9:44:09 AM
144
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
broad application of polymerase in fundamental and clinical research (Bi and Stambrook, 1998; Mir, 2000; Kwok, 2001). Exo+ proofreading polymerases are only used in specific high-fidelity demanding situations, such as in subcloning or in long-range PCR. In all other cases, exo- polymerase seems to be the enzyme of choice, particularly in conventional SNP genotyping and sequencing analyses. Therefore, most polymerases isolated in nature have been genetically engineered to delete the internal 3′ exonuclease activity. A report in 1992 demonstrated the potential of the primer-3′-terminal modification for the purpose of adapting PCR with high-fidelity polymerases (Skerra, 1992). Recently, a variety of mutation assays directly employing the proofreading mechanism of the exo+ polmymerase were developed. These three assays can be expanded with different experimental designs and modifications of the reaction components. For example, the application of ddNTP with phosphorothioate-modified primer forms a novel single base extension for SNP screenings (Di Giusto and King, 2003, 2004; King, 2004). With singleor double-labeled allele-specific primers, the 3′ terminal labeled primer extension can be used in combination with polarization, molecular beacon, and fluorescent resonance energy transfer platforms (Li and Zhang, 2003b; Zhang and Li, 2001, 2003b; Zhang, 2003a). 8.2.3.2.1 SNP-Triggered Off/On Switch The SNP-triggered off/on switch (also called the reversed on/off switch) works complementary to proofreading 3′ exonuclease-resistant or 3′ labeled primers (Bi and Stambrook, 1998; Zhang and Li, 2003a). With the introduction of inert allele-specific primers or proprimers, matched amplicons turn off DNA polymerization, while mismatched amplicons turn it on. Inert primers of a matched amplicon are not sent to the 3′ exo domain of the high-fidelity DNA polymerases. In this circumstance, inert primers stay inactivated and no DNA polymerization could occur. However, the inert primers of mismatched amplicon trigger the 3′ exonuclease excision process by removing the mismatched 3′ terminal. Actually, proofreading the mismatched primer activates the inert primer in the reversed on/off switch system. After removal of the 3′ terminal nucleotide from the original inert primer, the product from 3′ exonuclease digestion possesses active 3′ hydroxyl group for DNA polymerization. Technically, negative results with no PCR products indicate the template is matched with the 3′ terminal of the inert primer, which is a one to one complementary relationship in nucleotide detection. The off/on switch is a robust mutation detection tool for its ability to detect all the three mismatched nucleotides other than the complementary one with positive results. At this moment, two types of 3′ dihydroxylated primers have been evaluated: the 3′ phosphorylated and 3′ hydrogenized. The 3′ phosphorylated primers can always be extended regardless whether low-fidelity or high-fidelity polymerases are used. Fortunately, the 3′ hydrogenized primer works well as an inert primer that cannot be extended without activation through removal of the 3′ terminal nucleotide residue.
c08.indd 144
1/12/2011 9:44:09 AM
CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS
145
8.2.3.2.2 The 3 ′ Terminally Labeled Primer Extension Terminal-labeled primer extension is a SNP assay consisting of 3′ terminal-labeled allele-specific primers and DNA polymerases with proofreading activity (Zhang and Li, 2001). Perfect-match primers generate labeled product, whereas the 3′ mismatched primers either had no product or products without a label, depending on annealing temperature and duration. At the stage of signal detection, perfect-match primers show positive results, and 3′ terminally mismatched primers show negative results. Both 3′ terminal 3[H]-labeled and fluorescentlabeled primers have been applied in SNP assays (Zhang, 2003c). The 3′ terminal mismatched nucleotide that bears the signal to be detected was removed by the proofreading function, whereas the label was retained in the products when primer and template were perfectly matched. It is interesting that difficulty in removing Rox-labeled 3′ terminal nucleotide occurred under standard PCR conditions, suggesting potential interference by the label on 3′ exonuclease digestion. The terminal-labeled primer extension approach has several advantages over current SNP assays. Its most significant advantage is that it greatly decreases false positives. As is well known, a high rate of false positives is one of the main obstacles to the clinical application of current SNP technologies. The advantage of the terminal-labeled primer extension is a consequence of the proofreading activity of exo+ polymerases. The second advantage of terminal-labeled primer extension is its high sensitivity. Terminally labeled primer extension harnesses the power of PCR to improve the efficiency of genetic analysis. Cahill (2003) and co-workers applied terminally labeled primer extension to a large-scale SNP assay. Using allele-specific oligonucleotide hybridization as a control by which SNPs were called 98.1% of the time, terminally labeled primer extension had a higher call rate with 100% accuracy. As stated earlier, one of the advantages of terminally labeled primer extension is the direct application of SNP genotyping on genomic DNA. In Cahill’s study, both genomic DNA and PCR-amplified samples were compared and no significant difference was found in the assay’s sensitivity and accuracy. 8.2.3.2.3 SNP-Triggered On/Off Switch Exo+ polymerases together with 3′ phosphorothioate-modified mismatched primers work as an off switch in DNA polymerization (Di Giusto and King, 2003; Li and Zhang, 2003b; Zhang and Li, 2003a, 2003b; Zhang, 2003a). For 3′ allelespecific primers with phosphorothioate modification, perfectmatch primer turns on and mismatched primer turns off DNA polymerization. The result of the off-switch effect results from the exonuclease-resistant property of the phosphorothioatemodification that blocks mismatch excision. Phosphorothioate modification renders oligonucleotides nuclease resistant, a strategy widely used in antisense technology as well as in single base extension (Di Giusto and King, 2003; Li and Zhang, 2001). In a comparative analysis of 3′ mismatched primers, the 3′ phosphorothioate-modified primer was applied to fully demonstrate the critical role played by successful mismatch
c08.indd 145
1/12/2011 9:44:09 AM
146
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
excision by 3′ exonuclease in DNA polymerization. Exo- polymerases generate primer-dependent products from either unmodified or 3′ phosphorothioatemodified mismatched primers in relatively broad ranges of annealing temperatures. A remarkable phenomenon was observed in exo+ proofreading phosphorothioatemodified primer 3′ termini when there was a mismatch between the primer’s 3′ terminus and the template. Instead of producing template-dependent products from exonuclease-digestible primers, exo+ polymerases generated no products from mismatched primers with 3′ phosphorothioate-modification. This breakthrough observation of an on/off switch action was repeatedly confirmed using either short artificial amplicons or natural genomic DNA templates (Peng, 2003). These data strongly favor the use of phosphorothioate-modified primers for practical SNP assays. The off-switch effect, directly resulting from the 3′ exonuclease activity was well supported by the comparison of a variety of DNA polymerases in both linear and exponential amplification with phosphorothioate-modified allele-specific primers. The crucial structural components of the on/off switch are (1) allele-specific primers with 3′ terminal exonuclease-resistant modification and (2) DNA polymerases having 3′ exonuclease activity. In a recent study done by Di Giusto and King (2003), four types of DNA polymerases—T4+, T7+, KF+, and Vent—were tested and similar off-switch effects were observed when combined with 3′ phosphorothioate-modified primers. Based on the new model of this proofreading mechanism (Fig. 8.2), polymerases with 3′ exonuclease function should have higher base discrimination ability over exo- polymerases regardless of the properties of the substrates used. This speculation was confirmed by Di Giusto and King’s (2003) experiments. In addition to comparing nine different DNA polymerases, they evaluated the effect of dNTP, ddNTP, and acyNTP on the accuracy of primer extension. The maintenance of high fidelity with ddNTP and acyNTP allows the exo+ polymerases to be applied in both exponential and linear primer extension in SNP analysis. The latter was tested with MALDI-TOF and is compatible with many other detection formats. In addition, the SNP-operated on/off switch is not only a simple candidate method in mutation detection, but it is also complementary to the off/on switch mechanistically (Zhang, 2005). The identical reaction conditions of the two types of exo+ polymerase–mediated primer extensions are a particularly important feature shared by the on/off switch and the off/on switch. In largescale SNP scanning, the application of two complementary assays within one platform, such as a multiwell plate and a microarray, will help minimize incorrectly genotyped SNP sites due to the special local sequence contents. Although an increasing number of SNP assays have been developed, these assays still remain a technical challenge for modern personalized medicine. With the application of the off/on switch using the inert primers and the on/off switch using the 3′ exonuclease-resistant primers, the complementary effect of these two primers will help increase the assay sensitivity and reliability in genetic analysis. The on/off switch assays offer precise detection of mutation site and
c08.indd 146
1/12/2011 9:44:09 AM
CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS
147
acgcttaga Single-base extension/removal
100%
acgcttaga T
(ddATP/dATP)
(S-dATP)
2%
acgcttaga TAA
1%
acgcttaga TAAT
0%
acgcttaga TAAT
1%
acgcttaga TAATG
(ddTTP/dTTP)
(S-dTTP)
Partial chain (ddCTP/dCTP) termination/replacement No C signal
extension
(ddGTP/dGTP)
(S-dGTP)
Figure 8.2. The cycled proofreading genotyping processes. The colored diamonds represent the florescent-labeled ddNTP and the unlabeled dNTPs. After single-base extension/removal are the sequential partial chain termination/replacement steps. In partial chain termination, the ddNTP is integrated proportionally. If there is a run of TT on the template, the primers are thus extended with a tandem of AA, some ddATP are added at the first and some are at the second A. The length of the homopolymer of same nucleotide is detected by the intensity of the fluorescent signals (representative peaks here). After scanning, the microchip is subjected to a replacing extension with proofreading DNA polymerase in the presence of phosphorothioate-modified dATP (αS-dATP), which removes the labeled ddATP and completes the extension with unlabeled αS-dATP. The partial chain termination and the replacing extension continues → C, → G, → T, and its cycling decodes more sequences, as shown at the right.
type, whereas the off/on switch provides a very powerful and efficient assay in unknown mutation scanning. 8.2.3.3 Application of the On/Off Switch in Mutation Analysis Many mutation-detection methods discussed in this paper have been widely used in pharmacogenetic studies. However, exo+ polymerase-mediated assays are relatively new and not well recognized by laboratories working on pharmacogenomic studies. This section briefly discusses the possible application of the on/off switch in pharmacogenetic studies, including the detection of point mutations or SNPs, small insertions and deletions, somatic mutations, allele frequencies, and high-fidelity gene expression profiling. These applications of the on/off switch also apply to the other two proofreading assays. But the on/ off switch assay is more accessible to most clinical and fundamental research laboratories.
c08.indd 147
1/12/2011 9:44:09 AM
148
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
8.2.3.3.1 Mutation Detection Application in SNP Assay The on/off switch assay was initially identified during the development of a SNP assay. The termination of DNA replication, or off switch effect, has been well documented for a variety of DNA polymerases in both linear and exponential amplifications with phosphorothioate-modified allele-specific primers. Mismatches at either primer 3′ termini, or several bases upstream from the 3′ termini can efficiently trigger the off switch. In addition to gaining high accuracy in genotyping, another advantage of the exo+ polymerase–mediated on/off switch is the versatility to different types of platforms, as mentioned. The on/off switch assay has been used by several groups using both genomic DNA and prior amplified PCR products (Cahill, 2003; Di Giusto and King, 2003; King, 2004; Li and Zhang, 2003; Zhang, 2003a, 2003b, 2003c, 2005; Zhang and Li, 2001, 2003a, 2004). 8.2.3.3.2 Analysis of Short Deletion/Insertion Mutations Mutations involving large stretches of DNA arise during recombination, whereas single-base mismatches or short insertions/deletions usually originate from inaccuracies of the DNA replication machinery. The on/off switch in combination with phosphorothioate-modified allele-specific primers and high-fidelity polymerases works in vitro as an accurate DNA replication system. The on/off switch assay mimics the in vivo fidelity maintenance mechanism and is thus ideal for detection of point mutations and short insertions/deletions. Short insertions, deletions, and indels are closely associated with human diseases, particularly mutations caused from a single-base insertion or deletion generating a frameshift and a nonsense mutation. Short insertions/deletions can be targeted by allele-specific primer 3′ termini, or one base upstream from the primer 3′ terminus in circumstances of a short string of identical nucleotides. Except for the deletion or insertion of short repeats, the application of this on/off switch provides efficient high-throughput screening compatible with mutation analysis. 8.2.3.3.3 Detection of Rare Alleles and Mutation Load Determination Somatic mutations are associated with the development of cancers and a variety of other human diseases. But because spontaneous mutation frequencies vary widely between species, tissues, and genes, it has been difficult to determine mutation loads. A notable achievement in this area has been the ability to detect rare mutations in <10% of cells assayed by strategies such as preferential amplification of the mutant allele, preferential destruction of the wild type allele, and spatial separation of the mutant and wild type alleles. Although great strides have been made in genomewide SNP screening, somatic mutation load analysis remains an unmet technical challenge, considering that the spontaneous mutation rate for humans is lower than 10−9 per cell per generation. One outstanding feature of the novel on/off switch is its flexibility, including the ability to apply double switches in a single reaction (i.e., both the forward
c08.indd 148
1/12/2011 9:44:09 AM
CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS
149
and the reverse primers are 3′ phosphorothioate modified and allele specific). High-fidelity DNA polymerases have a mutation rate of <10−6. Application of the double off-switch strategy could theoretically lead to the detection of a single mutation in 109 wild type DNA molecules. Rare mutation detection allows somatic pharmacogenetic studies to be performed. With the application of the on/off switch assay, rare mutations in mitochondrial DNA, in genomic DNA of normal cells as well as of cancer cells can be quickly determined. 8.2.3.3.4 Allele Frequency Estimation in Pooled DNA Samples The central aim of genetics is to correlate specific molecular variations with particular phenotypic changes. As the third generation of genetic markers, the highly abundant SNPs are now widely used in mapping association studies. To reduce the time and cost of genotyping every SNP allele for all selected individuals in the population under study, pooled DNA samples can be employed instead. The on/off switch has the advantage of simplicity, sensitivity, and accuracy over conventional methods that are currently used in allele frequency estimation. The application of the novel on/off switch in allele frequency estimation is expected to lower the cost and increase the accuracy of allele estimation using pooled DNA samples. In addition to SNP analysis, allele frequency estimation has implications in mutation annotation. Establishing of single base change as rare genetic or a causative mutation is a routine difficulty in genetic diagnostics. With the availability of reference values for a specific allele, normal and pathological genetic variants can be quickly recognized, which is the basis for individualized medicine. 8.2.4
Sequencing-Based Assays
Sequencing-based assays used for mutation screening are often classified as direct and indirect sequencing. The conventional, clone-based, capillary sequencing assay is set for sequencing of purified PCR products of interest with both directions performed by a DNA sequencing core facility. It has been widely recognized that direct sequencing is the gold standard of genotyping. However, the so-called next-generation sequencing method (i.e., a platform of massively parallel DNA sequencing read production) and instruments were available in 2004, allowing highly streamlined sample preparation steps that do not require amplification of DNA fragments before DNA sequencing (Mardis, 2008). As conventional sequencing is already a routine practice, widely used research and clinical diagnosis, the next generation of sequencing technologies is discussed in detail here. The first practical sequencing technology was based on chemical degradation of the DNA. But this technology is not convenient and efficient as compared with the second technology based on chain termination. The Human Genome Project would not be able to finish without the high-throughput capability of the sequencing technology developed by Sanger. The first two
c08.indd 149
1/12/2011 9:44:09 AM
150
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
sequencing methods are electrophoreisis dependent and cannot be simply applied to microarray technology. We reported a strategy to integrate chain termination sequencing and microarray-based sequencing by synthesis in a microchip at a conference in 2008; the sequencing method was cycling inactivating and reactivating the arrayed primers. The cycled proofreading procedure is crucial for this new method, which consists of partial chain termination with mixed ddNTP/dNTP targeting of one type of nucleotide at a time and its relevant replacing extension with unlabeled phosphorothioate-modified dNTP of the same type of nucleotide used in a partial chain termination. Replacing extension eliminates fluorescent signals integrated at previous steps, proofreads errors, and reactivates and synchronizes the 3′ termini of extended DNA molecules. This new sequencing method inherited the advantages of sequencing by termination and sequencing by extension and is more powerful than conventional chain termination sequencing because the cycled proofreading procedure has the potential of decoding DNA sequences as long as the primer extension can proceed. One unique advantage of this strategy is that the fluorescent terminators are the ddNTPs that have been used for decades and have well-confirmed properties. It is hoped that this microchip-based sequencing method will be applicable to the efforts in reaching the goal of the $1000-genome project. The second-generation sequencing technologies do not yet have strategies or prototype sequencing machines, they are practically available for research and may be used for clinical diagnostics soon. The following section will focus on several parallel sequencing methods currently used for research. Three technologies of pyrosequencing, ligase-based sequencing, and sequencing by synthesis are discussed as representatives. The major application of the next generation of sequencing technologies is for genome resequencing. One such approach is the COSMIC project, which has successfully identified more than 1000 genes with somatic mutations in human cancer tissues. 8.2.4.1 Pyrosequencing with PicoTiterPlate The first practical approach to the $1000 per genome sequencing and probably the most successful nonSanger method developed to date is pyrosequencing (Hyman, 1988). In contrast to the conventional direct sequencing, pyrosequencing is an indirect sequencing-based assay developed for short DNA sequences. In theory, pyrosequencing is a cascade of enzymatic reactions, yielding detectable light that is proportional to de novo incorporation of nucleotides. In other words, this technology is based on the luminometric detection of an equimolar amount of pyrophosphate release on nucleotide incorporation (Xie, 2011; Ronaghi et al., 1996; Ahmadian et al., 2000; Nordstrom et al., 2000; Ronaghi, 2001) because the pyrophosphate is used by sulfurylase to generate ATP, which is used by luciferase to generate light subsequently. Finally, that light is captured and converted, as displayed as a peak in a program. Each peak height reflects the number of nucleotides that have been incorporated. Simultaneously, apyrase degrades unincorporated nucleotides. The pyrosequencing reaction
c08.indd 150
1/12/2011 9:44:09 AM
CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS
151
is carried out with biotinylated PCR products (i.e., single-stranded DNA isolated by the presence of a biotinylated primer sequence), and the sequencing of peaks is decoded from the 5′ end of a sequencing primer by stepwise addition of nucleotides in a selected dispensation order. This nonfluorescence technique measures the release of inorganic pyrophosphate during DNA polymerization, which is proportionally converted into visible light by a series of enzymatic reactions (Ronaghi et al., 1996, 1998; Gharizadeh et al., 2002; Nyren et al., 1993). Upon addition of the complementary dNTP, DNA polymerase extends the primer and theoretically pauses when it encounters a noncomplementary base. DNA synthesis is reinitiated after the addition of the next complementary dNTP in the dispensing cycle. The light generated during the reaction corresponds to the order of complementary dNTPs incorporated and reveals the underlying DNA sequence. Applications for pyrosequencing have been reviewed by Ronaghi (2001) and Langaee and Ronaghi (2005). One drawback of this technology is for the sequencing of homopolymeric regions that are greater than three bases in length. This drawback is not limited to pyrosequencing but is common for most of the sequencing by synthsis. The 454 Corporation introduced a whole-genome sequencing strategy by integrating pyrosequencing with their PicoTiterPlate (PTP) platform a few years ago (Figure 8.3). PTP is able to amplify and image approximately 300,000 clonal PCR templates captured on sepharose beads (Leamon et al., 2003). The PTP is manufactured by anisotropic etching of a fiberoptic faceplate with a well diameter of approximately 40 μm. These micro wells are the chambers for pyrosequencing after the clonal amplification of single DNA molecules, discussed above (Margulies et al., 2005). The clonal amplification strategy can efficiently distinguish homozygous from heterozygous bases according to its single molecule amplification. Currently, pyrosequencing assays are commercially available (such as Pyrosequencing, Uppsala, Sweden). Real-time pyrosequencing has been developed to determine polymorphisms in the genes, such as CYP2D6 *2, *3, *4, and *6 (Zackrisson and Lindblom, 2003); CYP2B6 variants G516T, C785G, and C1459T (Rohrbacher et al., 2006); CYP3A4 *2 and *3 (Garsa et al., 2005); and CYP3A5*3 (Garsa et al., 2005). This technology has been proven a rapid, reliable high-throughput technique for genotyping. 8.2.4.2 Oligonucleotide Ligation–Mediated Parallel Sequencing with Applied Biosystems SOLiD Sequencing by ligation is substantially different from the concept of sequencing by hybridization and led to oligoarrays developed by Affymetrix. The oligoarray process is a simple reverse blotting of classical Southern blotting, both of which were originally developed by Southern. Sequencing by ligation coupled hybridization and ligation reaction and the repeated application of cleavable probes or primers allow these two processes to cycle (Tomkinson et al., 2006; Landegren et al., 1988). Technically, the cycle can be repeated either by using cleavable probes to remove the
c08.indd 151
1/12/2011 9:44:09 AM
152
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
Major Platforms
Applied Biosystems ABI 3730XL 10kb / run/500-800bp
Roche / 454 Genome Sequencer FLX 0.5Gb / run/400-500bp Applied Biosystems SOLiD Over 20Gb / run/50bp
Illumina / Solexa Genetic Analyzer Up to 20Gb / run 5075bp
Helicos Up to 30Gb/ run/2555bp
Figure 8.3. The major platforms currently used for next-generation sequencers.
fluorescent dye and regenerate a 5′-PO4 group for subsequent ligation cycles or by removing and hybridizing a new primer to the template. Applied Biosystems (ABI) has commercialized their SBL platform, called support oligonucleotide ligation detection (SOLiD) (Valouev et al., 2008). SOLiD uses two slides per run; each can be partitioned into four or eight regions, called spots. Although SOLiD may under eshmate true variants and may have a high substitution error similar to Illumina/Solexa reads (Shen et al., 2008), it has unparalleled throughput and >99.94% base-calling accuracy, making it a highly accurate, massively parallel, genomic analysis platform (Figure 8.3). 8.2.4.3 Fluorescently Labeled Sequencing by Synthesis with Illumina/ Solexa Genome Analyzer II Another success in developing sequencing by synthesis technology is the Illumina/Solexa Genome Analyzer. Unlike the 454 Life Sciences, fluorescently labeled, reversible nucleotide terminator chemistry is used in this platform, a typical cyclic reversible termination. This system uses a dense array of small adapter molecules covalently bound to a glass surface. A small percentage of these adapters are also covalently bound to a DNA template. The tethered fragments are then replicated using several rounds of semisolid PCR to form a very dense cluster of the DNA templates tethered to adapters. Although the semisolid PCR is not highly efficient, the products are usually enough for the following step of sequencing that begins
c08.indd 152
1/12/2011 9:44:09 AM
CLASSICAL TECHNOLOGIES FOR HIGH-THROUGHPUT MUTATION ANALYSIS
153
with the addition of sequencing primers, fluorescently labeled reversible dNTP terminators, and DNA polymerase. Each of the inactivation and reactivation cycles can decode one base, thus the length of sequence decoded depends on the cycles repeated in this method. Currently, Illumina/Solexa Genome Analyzer II is compatible for 36–50 cycles, with the accuracy required by resequencing. Both the cycled proofreading genotyping processes we reported and the Illumina/Solexa Genome Analyzer II use cyclic reversible termination (CRT). The former becomes feasible by combining the well-confirmed ddNTPs and exonuclease activity of the high-fidelity polymerase; while the latter uses reversible modified dNTP instead. As compared to Sanger’s sequencing method, CRT is advantageous in (1) eliminating gel electrophoresis and (2) formatting assay in a highly parallel fashion. The CRT assay can be performed on a number of highly parallel platforms, such as high-density oligonucleotide arrays (Albert et al., 2003), PTP arrays (Leamon et al., 2003), polony arrays (Mitra et al., 1999) or random dispersion of single molecules. Different from the Affymetrix oligoarray is the lack of the free 3′ terminal. Albert et al. (2003) have demonstrated the 5′ → 3′ synthesis of oligonucleotide on a high-density array, which allows a single-base extension microarray for SNP analysis. The cycled proofreading genotyping processes discussed earlier was also prompted based on the approach by Albert et al. (2003). The flexibility and compatibility of the Illumina/Solexa system seems better than the 454’s pyrosequencing PTP. The Illumina/Solexa Genome Analyzer II uses the clonally amplified template method, coupled with the four-color CRT method. The four colors are detected by total internal reflection fluorescence (TIRF) imaging using two lasers. The slide is partitioned into eight channels, which allows independent samples to be run simultaneously. However, the reversible terminator and the polymerase combination show bias in base integration, resulting in an error of substitution commonly observed at AT-rich (Dohm et al., 2008; Hillier et al., 2008; Harismendy et al., 2009) and GC-rich regions, which is alternatively explained due to amplification bias during template preparation instead of the error occurred at the step of sequencing. Varied base substitution errors were reported in different studies (Frazer et al., 2009; Ley et al., 2008; Sarin et al., 2008). 8.2.4.4 Real-Time Sequencing Developed by Pacific Biosciences Real-time sequencing under development by Pacific Biosciences may have a commercial potential in the next generation sequencing market. Real-time sequencing involves imaging the continuous incorporation of dye-labeled nucleotides during DNA synthesis (Metzker, 2009; Eid et al., 2009). Dramatic approaches have been done in optimizing individual zero-mode wave-guide detectors (Zmw detectors) (Levene et al., 2003), in engineering DNA polymerases for an enhanced signal by fluorescence resonance energy transfer (Hardin et al., 2000), and in developing dye-quencher nucleotides (Williams, 1998).
c08.indd 153
1/12/2011 9:44:09 AM
154
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
a Roche/454, Life/APG, Polonator Emulsion PCR One DNA molecule per bead, Clonal amplification to thousands of copies occurs in microreactors in an emulsion 100–200 million beads PCR amplification
Break emulsion
Template dissociation
Chemically crosslinked to a glass slide
Primer, template. dNTPs and polymerase
c Helicos BioSciences: one-pass sequencing Single molecule: primer immobilized
b Illumina/Solexa Solid-phase amplification One DNA molecule per cluster Cluster growth Sample preparation DNA (5 µg) Template dNTPs and polymerase 100–200 million molecular clusters
Bridge amplification
d Helicos BioSciences: two-pass sequencing Single molecule; template immobilized
Billions of primed, single-molecule templates
Billions of primed, single-molecule templates
e Pacific Biosciences, Life/Visigen, LI-COR Biosciences Single molecule; polymerase immobilized
Thousands of primed, single-molecule templates
Nature Reviews | Genetics
Figure 8.4. The next-generation sequencing with emulsion PCR.
In emulsion PCR (emPCR), a reaction mixture consisting of an oil–aqueous emulsion is created to encapsulate bead–DNA complexes into single aqueous droplets. PCR amplification is performed within these droplets to create beads containing several thousand copies of the same template sequence. EmPCR beads can be chemically attached to a glass slide or deposited into PicoTiterPlate wells (Fig. 8.4) (Metzker, 2010). Solid-phase amplification is composed of two basic steps: initial priming and extending of the single-stranded, single-molecule template and bridge amplification of the immobilized template with immediately adjacent primers to form clusters. Three approaches are shown for immobilizing single-molecule templates to a solid support: immobilization by a primer, immobilization by a template, and immobilization of the polymerase dNTP, 2′-deoxyribonucleoside triphosphate. The Helicos Single Molecule Sequencer is similar to the real-time sequencing just mentioned. Helicos BioSciences’s single-molecule sequencer was
c08.indd 154
1/12/2011 9:44:09 AM
MICROARRAY-BASED MUTATION DETECTION
155
based on the work of Quake and colleagues (Braslavsky et al., 2003) and was first successfully applied in sequencing a viral genome (Harris et al., 2008). A common bottleneck of sequencing by synthesis is the quality of the reversible terminator. Developers from Helicos BioSciences described a 5% error in using Cy5-12ss-dNTPs. Single-molecule sequencing does not require prior amplification of the template to be sequenced and thus may have potential to minimize the sequencing error when optimized reversible terminators are available.
8.3
MICROARRAY-BASED MUTATION DETECTION
Comparative genomic hybridization (CGH) is a high-throughput hybridizationbased assay that has been developed as an efficient approach to scanning the entire genome in the DNA copy number variations (Pinkel and Albertson, 2005). Typically, genomic DNA from test and reference samples are labeled with different fluorescences and then co-hybridized to the representative BAC clones or oligo probes. The gain or loss of genomic region can be determined by comparing the fluorescence ratio. The resolution ranges from megabases for BAC clones to kilobases for oligo probes. However, the CGH array cannot detect polyploidy or any balance translocations or inversions. For SNP analysis, a microarray with single-base extension (SBE) is highly efficient. SBE is an extreme case of conventional sequencing in that it involves extension by just one base. When the template to be extended is a base belonging to a SNP, the extended single base decodes the SNP either through the resulting difference in molecular weight of the integrated ddNTP or through differential labeling of each ddNTP. Simple and predictable base extension renders SBE very compatible with both conventional methods and microarray approaches (Pastinen, 2000; Tonisson, 2002). Examples of applications include arrayed primer extension, fluorescence resonance energy transfer (FRET), and MALDI-TOF MS. Since SBE exhibits linear amplification, prior exponential PCR amplification is required. Recently, Di and King (2003) reported that replacing exo- polymerase with proofreading polymerase in SBE microarray significantly increased the correctness of base decoding. Another example of microarray-based mutation detection is the AmpliChip CYP450 GeneChip (AmpliChip) that was developed by Roche Molecular Systems and Affymetrix. It is an oligonucleotide microarray hybridization method for the genotyping of two important genes—CYP2C19 and CYP2D6— that encode drug-metabolizing enzymes responsible for the metabolism of an estimated 25% of all prescription drugs (Xie and Frueh, 2005). A total of 35 polymorphisms and mutations (33 polymorphisms and mutations identified in the CYP2D6 and 2 SNPs in the CYP2C19 gene), including gene deletions and duplications, have been set up in a single assay for both genes. According to the manufacturer’s instructions (AmpliChip, Roche), fragmentation, labeling, hybridization, staining, and scanning are the AmpliChip’s
c08.indd 155
1/12/2011 9:44:10 AM
156
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
major steps. First of all, two target genes—CYP2D6 and CYP2C19—are amplified in two independent reactions A and B using two specific CYP450 Primers Mixes (A and B), CYP450 Master Mix and the protocol provided by the manufacturer. After 35 cycles of PCR amplification, the DNA amplicons from the two separate amplification reactions of each individual are pooled and then cut by DNase I to generate small DNA fragments (an average size of 50–200 bp). After alkaline phosphatase degrades residual dNTPs present in the PCR reactions, the target DNA fragments are labeled with biotin at their 3′ end using Terminal Transferase and AmpliChip TdT Labelling Reagent and then hybridized to the oligos located on the AmpliChip CYP450 Microarray with the help of a hybridization buffer. Finally, the hybridized microarray is washed and stained with streptavidin-conjugated phycoerythrin and scanned by an Affymetrix GeneChip scanner using a laser that excites the fluorescent label bound to the hybridized target DNA fragments. The amount of emitted light is proportional to bound target DNA fragments at each location on the probe microarray. The image and derived data are stored and processed, and a report is generated for the genotype assignment and for the prediction of enzymatic activity. The AmpliChip is a highly reliable genotyping method (Heller et al., 2006; Rebsamen et al., 2009). In addition to the Roche AmpliChip, other chips have been developed for various purposes (Ragoussis, 2009), such as whole-genome association studies, longrange haplotyping, and identification of many new genes to be associated with common diseases.
8.4 MISCELLANEOUS TECHNOLOGICAL ADVANCES FOR MUTATION DETECTION 8.4.1
Restriction Fragment Length Polymorphism
RFLP is based on alterations in restriction sites—the gain or loss of restriction sites due to point mutations, which was initially used in linkage analysis (Davies, 1983; Wieacker, 1983). It involved PCR amplification of a fragment flanking the restriction site followed by restriction enzyme digestion. RFLP has been widely used in association studies because it is simple and inexpensive. With the use of efficient restriction enzymes, RFLP can yield results to a level of accuracy comparable to that of direct sequencing. However, RFLP is not able to assay SNPs that do not reside within restriction sites. Furthermore, a great deal of care needs to be paid to SNPs that are linked to weak restriction enzymes to differentiate between fragments that cannot be digested and fragments that are merely incompletely digested. For some point mutations or SNPs that reside in sequences one nucleotide away from endonuclease restriction sites, allele-specific primers introducing a point mutagenesis may be used to artificially constitute a restriction site for RFLP (Haliassos, 1989).
c08.indd 156
1/12/2011 9:44:10 AM
SUMMARY
157
8.4.2 Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry DNA polymerases are not directly involved in mutation discrimination at all in MALDI-MS (Tost and Gut, 2003). Instead, it is solely used to extract the specific sequences and to amplify sufficient quantity of DNA sequences for the assay. MALDI-TOF MS uses time-of-flight mass spectrometry to detect the mass/charge ratio between paired DNA fragments from primer extension. The key steps in this method are DNA amplification, ionization of products and detection of the mass/charge ratio. Since the charge and the mass supplied by a single nucleotide are relatively small, limiting the fragments to be analyzed to small lengths will maximize the power of this assay. High equipment costs limit its application for fundamental research. 8.4.3
TaqMan Assay and Real Time PCR
TaqMan assays are used for both gene expression profiling and SNP analysis (Andras, 2001). The exponential amplification of signal by polymerase renders the TaqMan assay very sensitive in real-time SNP analysis and in gene expression profiling. Base discrimination by TaqMan or by hybridization array is mediated by hybridization between DNA molecules. DNA polymerases do not play any direct role in base discrimination in either of these two assays. In the TaqMan assay, polymerases simply help generate the signal by freeing the flourophore from the nearby quencher via its 5′ exonuclease activity during the elongation phase of PCR. 8.5
SUMMARY
Fast growing knowledge of genomics and advances in genomic research are leading to an explosion of the technology and methods used for scanning and detecting DNA sequence variants. When DNA sequencing technology was developed in the 1970s, sequencing had been used as the gold standard for mutation detection, and it is still the gold standard. Hybridization based assays and PCR based assays are well developed for easier use and higher throughput capacity in mutation analysis. High throughput can be considered either in screening a specific mutation for a large number of samples or in scanning unknown mutations in a large number of nucleotides—for example, a long fragment or even a whole genome. Many of the conventional technology such as PCR-based assays are developed for high-throughput screening of a specific mutation; whereas, microarray technology and the next generation of sequencing technologies open avenues to efficiently scan large genes or even a whole genome in a short time. When the traditional sequencing technology has been partially replaced in mutation analysis, the next generation of sequencing has shown its great potential in future genetic diagnostics. Except for de novo sequencing, the
c08.indd 157
1/12/2011 9:44:10 AM
158
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
second-generation sequencing technologies are highly efficient in resequencing as well as many other applications in genomic studies. Although the second-generation sequencing technologies are mainly used in research, it is reasonable to expect their application in individualized medicine, particularly when the third-generation sequencing technologies mature and become available.
8.6
ACKNOWLEDGMENTS
This work was partially supported by Chinese National 863 grant (2008AA02Z436), a grant from National Natural Science Foundation of China (30970877) and a NIH grant (5P01 GM031304).
8.7
REFERENCES
Ahmadian A, Gharizadeh B, Gustafsson AC, Sterky F, Nyren P, Uhlen M, Lundeberg J. (2000). Single-nucleotide polymorphism analysis by pyrosequencing. Anal Biochem 280:103–10. Albert TJ, Norton J, Ott M, Richmond T, Nuwaysir K, Nuwaysir EF, Stengele KP, Green RD. (2003). Light-directed 5′ → 3′ synthesis of complex oligonucleotide microarrays. Nucl Acids Res 31:e35. Andras SC, Power JB, Cocking EC, Davey MR. (2001). Strategies for signal amplification in nucleic acid detection. Mol Biotechnol 19:29–44. Bi WL, Stambrook PJ. (1998). Detection of known mutation by proof-reading PCR. Nucl Acids Res 26:3073–75. Braslavsky I, Hebert B, Kartalov E, Quake SR. (2003). Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A 100:3960–64. Cahill P, Bakis M, Hurley J, Kamath V, Nielsen W, Weymouth D, Dupuis J, DoucetteStamm L, Smith DR. (2003). Exoproofreading, a versatile SNP scoring technology. Genome Res 13:925–31. Davies KE, Pearson PL, Harper PS, Murray JM, O’Brien T, Sarfarazi M, Williamson R. (1983). Linkage analysis of two cloned DNA sequences flanking the Duchenne muscular dystrophy locus on the short arm of the human X chromosome. Nucl Acids Res 11:2303–12. Di Giusto D, King GC. (2003). Single base extension (SBE) with proofreading polymerases and phosphorothioate primers: improved fidelity in single-substrate assays. Nucl Acids Res 31:e7. Di Giusto DA, King GC. (2004). Strong positional preference in the interaction of LNA oligonucleotides with DNA polymerase and proofreading exonuclease activities: implications for genotyping assays. Nucl Acids Res 32:e32. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucl Acids Res 36:e105.
c08.indd 158
1/12/2011 9:44:10 AM
REFERENCES
159
Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G. (2009). Real-time DNA sequencing from single polymerase molecules. Science 323:133–38. Feng J, Yan J, Li W, Chen J, Sommer SS. (2007). Candidate gene analyses by scanning or brute force fluorescent sequencing: a comparison of DOVAM-S with gel-based and capillary-based sequencing. Genet Test 11:235–40. Frazer KA, Murray SS, Schork NJ, Topol EJ. (2009). Human genetic variation and its contribution to complex traits. Nat Rev Genet 10:241–51. Garsa AA, McLeod HL, Marsh S. (2005). CYP3A4 and CYP3A5 genotyping by Pyrosequencing. BMC Med Genet 6:19. Gharizadeh B, Nordstrom T, Ahmadian A, Ronaghi M, Nyren P. (2002). Anal Biochem 301:82–90. Goodman MF. (1988). DNA replication fidelity: kinetics and thermodynamics. Mutat Res 200:11–20. Haliassos A, Chomel JC, Grandjouan S, Kruh J, Kaplan J, Kitzis A. (1989). Detection of minority point mutations by modified PCR technique: a new approach for a sensitive diagnosis of tumor-progression markers. Nucl Acids Res 17:8093–99. Hardin S, Gao X, Briggs J, Willson R, Tu SC. (2000). Methods for real-time single molecule sequence determination. U.S. Pat. 7329492. Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, et al. (2009). Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 10:R32. Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I, Causey M, et al. (2008). Single-molecule DNA sequencing of a viral genome. Science 320:106–09. Heller T, Kirchheiner J, Armstrong VW, Luthe H, Tzvetkov M, Brockmoller J, et al. (2006). AmpliChip CYP450 GeneChip: a new gene chip that allows rapid and accurate CYP2D6 genotyping. Ther Drug Monit 28:673–77. Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, et al. (2008). Whole-genome sequencing and variant discovery in C elegans. Nat Meth 5:183–88. Hitzl M, Drescher S, van der KH, Schaffeler E, Fischer J, Schwab M, et al. (2001). The C3435T mutation in the human MDR1 gene is associated with altered efflux of the P-glycoprotein substrate rhodamine 123 from CD56+ natural killer cells. Pharmacogenetics 11:293–98. Hyman ED. (1988). A new method of sequencing DNA. Anal Biochem 174:423–36. Kaller M, Ahmadian A, Lundeberg J. (2004). Microarray-based AMASE as a novel approach for mutation detection. Mutat Res 554:77–88. Kanki H, Yang P, Xie HG, Kim RB, George AL Jr, Roden DM. (2002). Polymorphisms in beta-adrenergic receptor genes in the acquired long QT syndrome. J Cardiovasc Electrophysiol 13:252–56. Kim RB, Leake BF, Choo EF, Dresser GK, Kubba SV, Schwarz UI, et al. (2001). Identification of functionally variant MDR1 alleles among European Americans and African Americans. Clin Pharmacol Ther 70:189–99. King GC, Di Giusto DA, Wlassoff WA, Giesebrecht S, Flening E, Tyrelle GD. (2004). Proofreading genotyping assays and electrochemical detection of SNPs. Hum Mutat 23:420–25.
c08.indd 159
1/12/2011 9:44:10 AM
160
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
Kwok PY. (2001). Methods for genotyping single nucleotide polymorphisms. Annu Rev Genomics Hum Genet 2:235–58. Landegren U, Kaiser R, Sanders J, Hood L. (1988). A ligase-mediated gene detection technique. Science 241:1077–80. Langaee T, Ronaghi M. (2005). Genetic variation analyses by pyrosequencing. Mutat Res 573:96–102. Leamon JH, Lee WL, Tartaro KR, Lanza JR, Sarkis GJ, deWinter AD, Berka J, Weiner M, Rothberg JM, and Lohman KL. (2003). A massively parallel PicoTiterPlate™ based platform for discrete picoliter-scale polymerase chain reactions. Electrophoresis 24:3769–77. Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW, et al. (2003). Zero-mode waveguides for singlemolecule analysis at high concentrations. Science 299:682–86. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, et al. (2008). DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456:66–72. Li K, Zhang J. (2001). ISIS-3521 (ISIS Pharmaceutics). Curr Opin Invest Drug 2:1454–61. Li K, Zhang J. (2003). New SNP assays from an old concept of proofreading. Curr Drug Disc 11:37–39. Mardis ER. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–80. Metzker ML. (2009). Sequencing in real time. Nat Biotech 27:150–51. Metzker ML. (2010). Sequencing technologies—the next generation. Nat Rev Genet 11:31–46. Mir KU, Edwin M, Southern EM. (2000). Sequence variation in genes and genomic DNA: Methods for large-scale analysis. Annu Rev Genomics Hum Gene 1:329–60. Mitra R, Church G. (1999). In situ localized amplification and contact replication of many individual DNA molecules. Nucl Acids Res 27:e34. Mroske C, Muci J, Wang J, Li K, Song W, Yan J, Feng J, Liu Q, Sommer SS. (2007). Toward a fluorescent single-strand conformation polymorphism technique that detects all mutations: F-DOVAM-S. Anal Biochem 368:250–57. Nordstrom T, Nourizad K, Ronaghi M, Nyren P. (2000). Method enabling pyrosequencing on double-stranded DNA. Anal Biochem 282:186–93. Nordstrom T, Ronaghi M, Forsberg L, de Faire U, Morgenstern R, Nyren P. (2000). Direct analysis of single-nucleotide polymorphism on double-stranded DNA by pyrosequencing. Biotechnol Appl Biochem 31(Pt 2):107–12. Nyren P, Pettersson B, Uhlen M. (1993). Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Anal Biochem 208:171–75. O’Meara D, Ahmadian A, Odeberg J, Lundeberg J. (2002). SNP typing by apyrasemediated allele-specific primer extension on DNA microarrays. Nucl Acids Res 30:e75.
c08.indd 160
1/12/2011 9:44:10 AM
REFERENCES
161
Orita M, Iwahana H, Kanazawa H, Hayashi K, Sekiya T. (1989). Detection of polymorphisms of human DNA by gel electrophoresis as single-strand conformation polymorphisms. Proc Natl Acad Sci U S A 86:2766–70. Pastinen T, Raitio M, Lindroos K, Tainola P, Peltonen L, Syvanen AC. (2000). A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays. Genome Res 10:1031–42. Peng CY, Zhang J, Guo ZF, Chen LL, Liao DF. (2003). Discrimination of C to T point mutation in GJB3 sensorineural deafiness by SNP-operated on/off switch. J Nanhua University 31:132–34. Pinkel D, Albertson DG. (2005). Comparative genomic hybridization. Annu Rev Genomics Hum Genet 6:331–54. Ragoussis J. (2009). Genotyping technologies for genetic research. Annu Rev Genomics Hum Genet 10:117–33. Rebsamen MC, Desmeules J, Daali Y, Chiappe A, Diemand A, Rey C, et al. (2009). The AmpliChip CYP450 test: cytochrome P450 2D6 genotype assessment and phenotype prediction. Pharmacogenomics J 9:34–41. Rohrbacher M, Kirchhof A, Geisslinger G, Lotsch J. (2006). Pyrosequencing-based screening for genetic polymorphisms in cytochrome P450 2B6 of potential clinical relevance. Pharmacogenomics 7:995–1002. Ronaghi M, Karamohamed S, Pettersson B, Uhlén M, Nyrén P. (1996). Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 242:84–89. Ronaghi M. (2001). Pyrosequencing sheds light on DNA sequencing. Genome Res 11:3–11. Ronaghi M, Uhlén M, Nyrén P. (1998). A sequencing method based on real-time pyrophosphate. Science 281:363–65. Sarin S, Prabhu S, O’Meara MM, Pe’er I, Hobert O. (2008). Caenorhabditis elegans mutant allele identification by whole-genome sequencing. Nat Meth 2008(5): 865–67. Sarkar G, Sommer SS. (1991). Haplotyping by double PCR amplification of specific alleles. Biotechniques 10:436–40. Sarkar G, Cassady J, Bottema CDK, Sommer SS. (1990). Characterization of polymerase chain reaction amplification of specific alleles. Anal Biochem 186: 64–68. Shen Y, Sarin S, Liu Y, Hobert O, Pe’er I. (2008). Comparing platforms for C. elegans mutant identification using high-throughput whole-genome sequencing. PLoS ONE 3:e4012. Skerra A. (1992). Phosphorothioate primers improve the amplification of DNA sequences by DNA polymerases with proofreading activity. Nucl Acids Res 20: 3551–54. Sommer SS, Groszbach A, Bottema CDK. (1992). PCR amplification of specific alleles (PASA) is a general method for rapidly detecting known single-base changes. Biotechniques 12:82–97. Tomkinson AE, Vijayakumar S, Pascal JM, Ellenberger T. (2006). DNA ligases: structure, reaction mechanism, and function. Chem Rev 106:687–99. Tonisson N, Zernant J, Kurg A, Pavel H, Slavin G, Roomere H, Meiel A, Hainaut P, Metspalu A. (2002). Evaluating the arrayed primer extension resequencing assay of TP53 tumor suppressor gene. Proc Natl Acad Sci U S A 99:5503–08.
c08.indd 161
1/12/2011 9:44:10 AM
162
HIGH-THROUGHPUT MUTATION SCREENING TECHNOLOGY
Tost J, Gut IG. (2003). Genotyping single nucleotide polymorphisms by mass spectrometry. Mass Spectrom Rev 21:388–418. Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, et al. (2008). A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res 18:1051–63. Vorechovsky I. (2005). Single-strand conformation polymorphism (SSCP) analysis. In Medical Biomethods Handbook. Walker JM, Rapley R (eds). New York City, Humana Press, pp. 73–77. Wieacker P, Horn N, Pearson P, Wienker TF, McKay E, Ropers HH. (1983). Menkes kinky hair disease: a search for closely linked restriction fragment length polymorphism. Hum Genet 64:139–42. Williams JGK. (1998). System and methods for nucleic acid sequencing of single molecules by polymerase synthesis. U.S. Pat. 16,255,083. Xie HG. (2011). Chapter 52. Genetically polymorphic cytochrome P450s and transporters and personalized antimicrobial chemotherapy. In Molecular Microbiology: Diagnostic Principles and Practice. 2nd edition. Persing DH, Tenover FC, Tang YW, Nolte FS, Hayden RT, van Belkum A (eds). ASM Press, Washington, DC, pp. 803–32. Xie HG, Frueh FW. (2005). Pharmacogenomics steps toward personalized medicine. Personalized Med 2:325–37. Xie HG, Kim RB. (2004). Chapter 49. Genetic polymorphisms of cytochrome P-450s and drug transporters and infectious disease management. In Molecular Microbiology: Diagnostic Principles and Practice. 1st edition. Persing DH, Tenover FC, Tang YW, Nolte FS, Hayden RT, van Belkum A (eds). ASM Press, Washington, DC, pp. 655–69. Xie HG, Stein CM, Kim RB, Xiao ZS, He N, Zhou HH, et al. (1999). Frequency of functionally important beta-2 adrenoceptor polymorphisms varies markedly among African-American, Caucasian and Chinese individuals. Pharmacogenetics 9: 511–16. Zackrisson AL, Lindblom B. (2003). Identification of CYP2D6 alleles by single nucleotide polymorphism analysis using pyrosequencing. Eur J Clin Pharmacol 59: 521–26. Zanger UM, Fischer J, Klein K, Lang T. (2002). Detection of single nucleotide polymorphisms in CYP2B6 gene. Meth Enzymol 357:45–53. Zhang J, Chen LL, Guo ZF, Peng CY, Liao DF, Li K. (2003a). On/off switch mediated by exo+ polymerases: Experimental analysis for its physiological and technological implications. J Biochem Mol Biol 36:529–32. Zhang J, Li K. (2004). New performance from an old member: SNP assay and de novo sequencing mediated by exo+ DNA polymerase. J Biochem Mol Biol 37: 269–74. Zhang J, Li K. (2003a). On/off regulation of 3′ exonuclease excision to DNA polymerization by exo+ polymerase. J Biochem Mol Biol 36:525–28. Zhang J, Li K. (2003b). Single base discrimination mediated by proofreading 3′ phosphorothioate-modified primers. Mol Biotechnol 25:223–28. Zhang J, Li K. (2001). Terminal labeled primer extension: A new method for SNP analysis and expression profiling. Curr Drug Disc 9:21–23.
c08.indd 162
1/12/2011 9:44:10 AM
REFERENCES
163
Zhang J, Li K, Deng Z, Liao D, Fang W, Zhang X. (2003b). Efficient mutagenesis method for producing the template of single nucleotide polymorphisms. Mol Biotechnol 24:105–10. Zhang J, Li K, Liao D, Pardinas JR, Chen L, Zhang X. (2003c). Different application of polymerase with and without proofreading activity in SNP analysis. Lab Invest 83:1147–54. Zhang J, Li K, Pardinas JR, Sommer SS, Yao KT. (2005). Proofreading genotyping assays mediated by high fidelity exo+ DNA polymerases. Trends Biotechnol 23:92–96.
c08.indd 163
1/12/2011 9:44:10 AM
CHAPTER 9
Candidate Screening through Gene Expression Profile MICHAL KOROSTYNSKI
Contents 9.1 Concepts in High-Throughput Gene Expression Analysis 9.2 High-Throughput Gene Expression Analysis Technologies 9.2.1 Microarrays 9.2.2 Sequence-Based Sampling Methods 9.2.3 Multigene Quantitative PCR Assays 9.3 Applications and Limitations 9.4 Microarrays: Protocols in Gene Discovery 9.4.1 Before a Microarray Experiment 9.4.2 RNA Quality 9.4.3 Experimental Design 9.4.4 Data Analysis 9.5 Gene Expression Profiling Data Analysis 9.5.1 Microarray Data Preprocessing 9.5.2 Candidate Gene Selection 9.5.3 Identification and Characterization of Co-Expressed Gene Transcription Modules 9.5.4 Advance Transcript and Candidate Gene Analyses 9.6 Questions and Answers 9.7 Acknowledgments 9.8 References
166 168 168 173 174 174 176 176 176 177 178 180 180 182 183 185 186 188 188
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
165
c09.indd 165
1/12/2011 9:44:11 AM
166
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
9.1 CONCEPTS IN HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS After a given genome is sequenced, the next challenge is to determine how and when cells use their genetic information to control gene transcription and the synthesis of new proteins (Lander et al., 2001; Venter et al., 2001). Alterations in gene expression profiles are early biological responses to intracellular and extracellular stimuli as well as to environmental stressors. Gene expression is controlled by transcriptional factors via an on/off system. However, regulation of gene transcription is complex and depends on interactions between various activators and repressors (Sandelin et al., 2007). The temporal profile of gene transcription is functionally related to changes in cell state. Therefore, cellular levels of mRNA for specific proteins provide useful markers of biological processes taking place in a particular cell or tissue as well as in the entire organism (Fig. 9.1). Several techniques are routinely used to measure eukaryotic gene expression; two of these are quantitative PCR and in situ hybridization. These traditional methods are limited in the number of genes they can simultaneously measure (typically, a single gene is analyzed) and are often based on a priori assumptions that the expression of a selected candidate gene is regulated under the particular conditions tested. The introduction of DNA microarrays has opened a new era of whole-genome transcriptional analysis in molecular biology (Brown et al., 1999). Microarray-based high-throughput gene expression profiling allows the transcription levels of tens of thousands of genes to be measured simultaneously (Lockhart et al., 2000). The results provide a researcher with a global picture of cellular function during precisely selected conditions and within a specific time window. Whole-genome approaches in genetics and genomics have engendered a conceptual switch from purely hypothesis-driven science to discovery-driven, exploratory research (Kell et al., 2004). This technological progress has, furthermore, transformed genetic research from a process of searching for previously unknown genes to the identification of gene function, the characterization of gene regulation, and the evaluation of relationships between groups of genes and proteins. Other methods of gene expression profiling, such as suppression subtractive hybridization or serial analysis of gene expression (SAGE) are also available; however, microarray analysis is the least complicated, best standardized and relatively inexpensive to use and is, therefore, the most accessible approach for many investigators in all fields of biology and medicine. The future of highthroughput gene expression profiling most probably will be related to the development of next-generation sequencing technologies (Metzker, 2005, 2009). Whole transcriptome resequencing will allow for detection of splicing and sequence variants, as well as for quantitative determination of mRNA abundance (Wang et al., 2009). The growing contribution of genomics to clinical medicine and pharmacology provides hope that novel molecular targets for therapy of various disor-
c09.indd 166
1/12/2011 9:44:11 AM
CONCEPTS IN HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS
Model system
167
Clinical samples
Gene expression profiling
List of transcripts
Co-expression of genes
Association with phenotype
Mechanism of regulation
Gene expression signature
Gene function
Validation
Candidate drug targets
Candidate biomarkers
Discovery of new drug
Discovery of new diagnostic marker
Figure 9.1. (See color insert.) Major concepts in gene transcription profiling.
ders, including those that currently lack a known molecular mechanism, can be identified (van’t Veer et al., 2002). To this end, gene expression profiling might be used as a preclinical tool for disease classification and for evaluation of cellular responses to toxic substances (Golub et al., 1999; Rosenwald et al., 2002; Shipp et al., 2002). Genomic markers of the pharmacological and toxicological effects of drugs provide tools for therapeutic candidates as well as for food safety risk assessments (Searfoss et al., 2005; Lamb et al., 2006; Yang et al., 2006). Other potential applications include molecular diagnostics, for instance, the identification of endogenous markers and transcriptional signatures related to disease progression or to the effectiveness of pharmacotherapy. Sets of specific genomic biomarkers for drug responses can be used as clinical diagnostic tools (Cheok et al., 2003; Simon, 2003; Potti et al., 2006). A microarray-based test for the identification of genetic variants associated with interindividual variation in the ability to metabolize drugs are already available (e.g., AmpliChip CYP450 Test) (de Leon et al., 2006). However, the
c09.indd 167
1/12/2011 9:44:11 AM
168
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
identification of additional genetic and transcriptional biomarkers is necessary for developing reliable and extensive tests to predict drug effectiveness and side effects (Goodsaid et al., 2006). Genetical genomics approaches also provide powerful tools for studying the cause-and-effect relationships between gene transcription and heritable biochemical or complex behavioral traits (Nica et al., 2008). These approaches are based on individual differences in profiles of gene expression. High-throughput methods are unique tools with which to evaluate patterns of co-expressed transcripts and to investigate networks of functionally connected genes, including regulatory factors and their effectors (Dobrin et al., 2009). Integration of bioinformatic tools with gene expression profiling provides in silico methods for the analysis of cis-acting regulatory elements overrepresented in promoter regions of co-expressed genes (Wasserman et al., 2004). Characterization of the interactions between promoter sequences, epigenetic modifications and protein regulatory factors help us understand the complex nature of gene transcription regulation. Characterization of coregulated groups of genes may further indicate that there is a master switch transcription factor crucial for cellular pathways related to basic cellular functions or pathological responses (Emilsson et al., 2008; Nica et al., 2008). Finally, novel genes identified by gene expression profiling are potential targets for pharmacotherapy or gene therapy (Welsh et al., 2001).
9.2 HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS TECHNOLOGIES In recent years, substantial advances have been made in methodologies for measuring gene expression. The level of mRNA abundance for the entire transcriptome can be assessed simultaneously using fast, reliable, and relatively inexpensive techniques. Here, I compare the most popular technologies used in high-throughput gene expression analysis, with particular focus on various DNA microarray platforms. Some of the methods described are interesting from the historical point of view but are rarely used today (Table 9.1). 9.2.1 Microarrays The first paper to describe the potential of array technology was published in 1987 (Kulesh et al., 1987). The technology has subsequently evolved from selfspotted cDNA arrays to silicon chips printed with short oligonucleotide probes by photolithography (Schena et al., 1995; Brown et al., 1999; Eisen et al., 1999). Transcriptional profiling is still the most popular application of microarrays; however, several other microarray-based approaches are available, for example measuring genetic variation and protein-DNA interactions (Gunderson et al., 2005; Hoheisel, 2006; Bilitewski, 2009). Profiling gene expression at the level
c09.indd 168
1/12/2011 9:44:11 AM
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS TECHNOLOGIES
169
TABLE 9.1. Comparison of High-Throughput Gene Expression Profiling Methods Technology DNA Microarrays
Spotted cDNA arrays
Two-channel labeling system
Oligonucleotide microarrays
Exon profiling
SequenceBased Sampling
Subtractive libraries
Serial analysis of gene expression (SAGE)
PCR
c09.indd 169
Next-generation sequencing PCR-based arrays
Advantages Customizable, relatively cheap and easy to perform without dedicated equipment Relatively cheap, one array for two samples
Standardized, reliable, whole genome coverage, >100 ng of total RNA required, free dedicated software Exon level analysis, detection of splicing variants, multiple probes per gene Detection of unknown genes and transcript variants Detection of unknown genes and transcript variants, automated Whole transcriptome, variant detection Low concentration of template needed, perfect for microarray validation
Disadvantages Noisy, large amount of total RNA is necessary, outdated
Not recommended for multifactorial design, lack of standardization (self-spotted) Only previously known transcripts are detected, expensive
Complicated procedure, large amount of data generated, more expensive than alternatives Noisy, not always quantitative, outdated Complicated procedure, not always quantitative
Large amount of data generated, expensive High price per gene, maximum of 384 genes
1/12/2011 9:44:11 AM
170
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
50
(a)
28S
30
18S
20 10
19
(c)
18s
0 24
29
34 39 Time [s]
28s
(b)
Fluorescence
40
44
49
(d)
Figure 9.2. A Microarray system. (a) Different platforms of whole-genome gene expression microarrays: Affymetrix, Illumina, and Agilent. (b) Instruments for RNA quality and quantity assessment: spectrophotometer NanoDrop-1000, and bioanalyzer Agilent. (c) Automated hybridization station and array scanner, both from Affymetrix. (d) Data storage and analysis server.
of exons and global analysis of transcript splicing variants is already possible using microarrays (Blencowe, 2006; Bemmo et al., 2008). The main advantages of gene expression profiling using microarrays are standardization and wholegenome coverage. However, using this method, only previously known transcripts can be detected and analyzed. The microarray processing platform used in the method includes a hybridization unit, a scanner, and a computer workstation (Fig. 9.2). Assessment of RNA quality and quantity is critical to accurate gene expression profiling. Therefore, low-volume spectrophotometer and bioanalyzer instruments are essential components of the setup. The following discussion of microarray technologies will categorize them based on their applications and technical properties.
c09.indd 170
1/12/2011 9:44:11 AM
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS TECHNOLOGIES
171
9.2.1.1 Spotted cDNA Arrays In its nascent form, microarray analysis was similar to a large Southern blot, in which a sample was hybridized to a number of target DNAs attached to a solid support. In this method, spotted DNA microarrays are produced using a printing robot, which attaches previously prepared single-strand DNA probes on the order of 100–1000 bp in length to a solid substrate. The printed probes are complementary to known gene sequences and include probes for housekeeping genes and internal controls. As mRNA is unstable, it is converted to complementary DNA (cDNA) using reverse transcriptase after it is extracted from the analyzed cells. The cDNA acts as a copy of the expressed mRNA in DNA form. The oligonucleotides on the slide are hybridized to fluorescently (e.g., biotin or cyanine dyes) or isotopically (e.g., 32P) labeled cDNA samples, and the emission of the hybridized array is scanned using equipment appropriate to the label. The intensity of the signal obtained is considered to be a relative measure of mRNA abundance. In the early stages of microarray technology, results were often difficult to replicate due to disturbances that occurred during printing, labeling, and scanning. However, the system was nonetheless effective for the selection of candidate genes with large transcriptional changes. Furthermore, when commercial arrays were not available, customized microarrays provided a useful approach for research on nonmodel organisms (Wiseman et al., 2002; Lemoine et al., 2009). Currently, spotted cDNA microarrays have been almost completely replaced by more advanced techniques. 9.2.1.2 Two-Channel Labeling System The main difference between using one- and two-channel microarrays is the strategy of sample labeling. Two-channel microarrays are typically hybridized to cDNA prepared from two samples labeled with different fluorophores to allow comparative measurement of each (Shalon et al., 1996). This system assumes competitive hybridization and supports experimental designs involving direct comparison between drug-treated and control samples, diseased and healthy tissues or knock-out and wild type animals. Fluorescent dyes commonly used for cDNA two-color labeling include cyanine 3 (Cy3, which emits in the green part of the light spectrum) and cyanine 5 (Cy5, which emits in the red part of the light spectrum). The two Cy-labeled cDNA samples are mixed and hybridized to a single microarray that is then scanned at dual excitation wavelengths to visualize fluorescence of both dyes. The relative intensity of each fluorophore is used in a ratio-based analysis to identify genes that are up- or down-regulated under different conditions. The main disadvantage of two-channel microarrays is that they are not well suited for use in multifactorial gene expression experiments. Two-color experiments can be done either with custom self-printed arrays or using whole-genome commercially available products. 9.2.1.3 Oligonucleotide Microarrays Despite the availability of multiple methodologies for high-throughput gene expression profiling, the approach that involves oligonucleotide DNA microarrays is clearly dominant. The
c09.indd 171
1/12/2011 9:44:12 AM
172
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
source of the scientific and commercial success of oligonucleotide DNA microarrays lies in the standardized manufacturing process and the use of shorter oligos (Nuwaysir et al., 2002). Short probes can be selected on the basis of their sequence specificities and either synthesized in situ on a solid surface or conventionally synthesized and then robotically deposited. The most popular commercial system is the Affymetrix Gene Chip platform (www.affymetrix.com). In oligonucleotide microarrays, each gene is typically represented by more than one probe. An ensemble of probes mapping to different regions of the same gene is called a probe set. Commercial chips are available with oligonucleotide representations that span tens of thousands of genes and offer coverage of the entire genomes of human, mouse, and other model organisms. Oligonucleotide microarrays often contain control probes designed to hybridize to RNA spike-ins (McCall et al., 2008). The arrays are designed to give estimates of the absolute levels of gene expression and can be used to compare genes within the same sample as well as across multiple arrays. An alternative system, such as that developed by Illumina, uses arrays based on beads instead of chips (www.illumina.com). Microarrays of this type are created either by impregnating beads with different concentrations of fluorescent dye or by the use of barcoding technology (Gunderson et al., 2004). The beads are specifically identifiable and are used to detect specific binding events that occur on their surfaces. Different oligonucleotide sequences (∼70mers) are attached to each bead, and thousands of beads can be self-assembled on a fiber bundle. A subsequent decoding process is carried out to determine which bead occupies which well. Complementary oligonucleotides present in the sample bind to the beads, and bound oligonucleotides are measured by using a fluorescent label. 9.2.1.4 Expression Profiling of Transcript Variants The results of gene expression profiling indicate that the eukaryotic transcriptome is highly complex (Johnson et al., 2003; Blencowe, 2006). A single gene can be transcribed in such a way as to yield different transcript variants, which are further translated to different protein isoforms, contributing to proteome complexity. A recent generation of oligonucleotide microarrays capable of analyzing different mRNA isoforms and splicing variants has been developed. The Affymetrix Exon platform was created using oligonucleotides that recognize approximately 4 probes per exon and roughly 40 probes per gene (Bemmo et al., 2008). These arrays enable two complementary levels of analysis: gene expression and alternative splicing of the resultant transcripts. Multiple probes per exon enable exon-level analysis and allow transcripts representing the various isoforms of a gene to be specifically distinguished. On the wholegenome scale, exon-level analysis opens the way to detection of specific alterations in exon usage that may play a central role in the etiology and mechanisms of disease (Johnson et al., 2003). At the level of the analysis of gene expression, results from multiple probes that recognize different exons can be summarized into an expression value that represents all transcripts arising from the same gene. Currently, exon arrays provide the most comprehensive cover-
c09.indd 172
1/12/2011 9:44:12 AM
HIGH-THROUGHPUT GENE EXPRESSION ANALYSIS TECHNOLOGIES
173
age of the transcriptome, especially because they include both empirically supported and predicted transcribed sequences. Exon arrays have thus enabled the discovery of previously unidentified novel splicing events. In such expression analysis, high resolution has been achieved by interrogating over 1 million exon clusters within the known and predicted transcribed regions of the entire genome (Clark et al., 2007; Bemmo et al., 2008). Similar technology can be used to produce tiling arrays that contain high-density oligonucleotide probes spanning the entire genome or contigs of the genome (Mockler et al., 2005). One application of this type of microarray is determination of the location of DNA-binding sites for transcriptional factors within gene promoter regions using the chip-on-chip method (Mockler et al., 2005; Hoheisel, 2006; Pillai et al., 2009). This approach is complementary to gene expression profiling. 9.2.2
Sequence-Based Sampling Methods
Sequence-based sampling techniques are not based on direct hybridization between samples and probes. Some, but not all, of these methods provide comparative qualitative measurement of mRNA levels. The main advantage of this technology is the fact that the mRNA sequences sought need not be known a priori, thus unknown genes or gene variants can be discovered using this technique. High-throughput, whole-genome DNA sequencing is available at large genomic facilities equipped with next-generation sequencers. 9.2.2.1 Subtractive Libraries The purpose of generating a subtractive library is to enrich for transcripts that are highly expressed in one condition compared to a second or control condition (Diatchenko et al., 1996). This facilitates screening for the cDNA of interest because the complexity of the resulting library is much reduced and screening of far fewer clones is thus required. Subtractive libraries can be used to generate a highly select group of clones (in the range of 100) for further sequencing. Techniques involving the use of subtractive libraries depend on the differential elimination of duplex mRNA/cDNA or cDNA/cDNA hybrids that form between genes expressed in both conditions, leaving only single-stranded mRNAs or cDNAs of interest. 9.2.2.2 Serial Analysis of Gene Expression SAGE is a powerful tool that employs digital analysis of patterns of overall gene expression (Velculescu et al., 1995; Anisimov, 2008). Because SAGE does not require a preexisting clone, it can be used to identify and quantify new genes in addition to known genes. A snapshot of the mRNA population in a sample of interest is obtained in the form of small tags that correspond to fragments of the transcripts present in the sample. Several variants have been developed, including a particularly robust version that uses an increased tag length of 25–27 bp to enable precise annotation of existing genes as well as discovery of new genes within genomes (Sutcliffe et al., 2000).
c09.indd 173
1/12/2011 9:44:12 AM
174
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
9.2.2.3 Transcriptome Resequencing Technological advances have resulted in the development of transcriptome resequencing methods that combine the advantages of both tag-based and microarray approaches (Mardis, 2008). This technology provides the opportunity to gather a full profile of gene expression that includes genome annotation, identification of novel transcripts, splice variant detection, allele-specific expression, the ability to distinguish paralogous genes, assembly of full-length genes, and the discovery of coding SNPs or other variations such as insertions or deletions (indels). Genome sequencing has already been used to obtain unbiased transcriptome surveys by rapid sequencing of full-length cDNA libraries. Moreover, next-generation sequencing (NGS) systems are capable of determining a complete sequence of the genome (either de novo or by resequencing), chip-seq as well as metagenomic data (Mardis, 2008; Fox et al., 2009; Park, 2009). Integrated platforms of NGS systems include a set of high-throughput sequencers supplied with a cluster of computers for raw data analysis. 9.2.3
Multigene Quantitative PCR Assays
Quantitative RT PCR (qPCR) is widely used to detect and quantify specific DNA or cDNA sequences (Higuchi et al., 1993; Heid et al., 1996). In this technique, amplifications are performed in the presence of a DNA-binding fluorescent dye. Because a macroarray is a gene expression array with a predesigned set of a few dozens of functionally or biologically related genes, PCR macroarrays are reliable tools for analyzing the expression of specific sets of genes (typically, a few dozen genes) that are related to specific cellular pathways such as development or metabolism or to specific diseases (for instance, cancer or allergy). Moreover, genomewide qPCR arrays for microRNAs (measure 100 miRNAs) are also available. In this analysis, very low amounts of starting material are required. Custom PCR-based macroarrays are commercially available from Applied Biosystems and SuperArray. Such qPCRbased macroarray analysis requires a PCR instrument equipped with 96-well or 384-well units.
9.3
APPLICATIONS AND LIMITATIONS
Currently, most gene profiling experiments are performed using DNA microarrays. However, the early era of high-throughput gene transcription profiling was marked by limited reliability (Allison et al., 2006; Draghici et al., 2006). Early problems with the replication of results and failed validation using alternative techniques initially raised serious doubts about the usefulness of such high-throughput methods. Yet technical improvements in DNA microarray technology and advances in general knowledge regarding application of these tools have transformed this technique into a key method of quantifying gene expression (Bammler et al., 2005; Larkin et al., 2005). The techniques
c09.indd 174
1/12/2011 9:44:12 AM
APPLICATIONS AND LIMITATIONS
175
employed have evolved from rough screening of selected genes to exon-level whole-genome profiling. Microarray analysis can provide us with a picture of the entire transcriptome and allow us to further identify not only solitary candidate genes but also co-regulated groups of transcripts (Dobrin et al., 2009). Investigators in the clinical sciences are testing the usefulness of these features as diagnostic tools for the classification of cancers and the prediction of the therapeutic effectiveness of potential treatments (van’t Veer et al., 2002; Goodsaid et al., 2006; Potti et al., 2006). Anyone who is planning to use microarrays should be aware of the technical limitations of gene expression profiling and should avoid making assumptions that may bias the experimental results. It is important to note that gene transcription is not always proportional to the translation of an associated protein because each mRNA has a specific lifespan and different rates of degradation; furthermore, cellular activities are not necessarily characterized by the transcriptome profile (Unwin et al., 2006). A gene expression profile is essentially a snapshot of the mRNA present in a population of cells at a given point in time, and interpretation of this snapshot requires careful thought about the results without excessive speculation. The measurement of transient transcriptional responses uncovers signatures of protein translation and molecular markers associated with specific cell states. Transcriptomes of complex tissue usually include several overlapping cell-type specific gene transcription profiles. A variety of technological approaches are being developed that can be used to separate expression profiles of different cell types, including combinations of flow cytometry and transgenics or laser microdissection (Emmert-Buck et al., 1996; Heiman et al., 2008). Detailed time-course experiments are often used to analyze the temporal dynamics of transcriptional changes. One caveat to this approach is that the identification of previously unknown transcripts is not possible using microarrays and is limited to high-throughput sequencing. On the other hand, genome sequencing projects for all of the commonly used model organisms are currently nearly completed, eliminating the need to identify totally novel genes. There remains room for technical improvement and further development in microarray technology. Reductions in the levels of background, increases in the specificity of hybridization and better annotation of transcripts are gradually being made with each new version of DNA microarray. Moreover, microarrays are now widely used as a method of genomewide genotyping and analysis of variation in copy numbers (CNV) (Gunderson et al., 2005). Alternative technologies are also under development. Next-generation sequencing methods may provide solutions for some of the current methodological problems of microarrays (Wang et al., 2009). The amount of raw data generated by high-throughput gene expression profiling methods is enormous. Advances in bioinformatics and data-mining methods are therefore critical for the future of genomic research. Large collections of previously collected microarray data have been made publicly available. Among the best such data sources are the Gene Expression Omnibus
c09.indd 175
1/12/2011 9:44:12 AM
176
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
(GEO; www.ncbi.nlm.nih.gov/geo) and ArrayExpress database (www.ebi.ac.uk/ microarray-as/ae) (Edgar et al., 2002; Parkinson et al., 2005). GEO is a gene expression repository supporting MIAME-compliant data submissions and is maintained as a curated online resource for gene expression data browsing, query and retrieval. Minimum information about a microarray experiment (MIAME) is a standard intended to specify all of the information necessary to interpret the results of an experiment unambiguously and to define the conditions for the experiment to be reproduced (Brazma et al., 2001).
9.4
MICROARRAYS: PROTOCOLS IN GENE DISCOVERY
This section is focused on protocols and examples of gene expression profiling applications using microarray technology. Short guidelines for planning gene expression profiling experiments, together with some additional information important from the practical point of view, are presented. For a successful microarray experiment, the most important issues are concept and design, sample quality, and data analysis. 9.4.1
Before a Microarray Experiment
The biological questions motivating an experiment must be clearly and strongly connected to predicted changes or differences in gene transcription or differences in mRNA levels (Abdullah-Sayani et al., 2006). Hypotheses may differ for gene expression profiling in clinical material, tissue samples obtained from model organisms or in vitro cultured cells. Samples derived from human blood and postoperative or postmortem tissues are usually used for screening of diagnostic and prognostic biomarkers, whereas in vivo and in vitro model systems provide controlled conditions more appropriate for the study of biological mechanisms and gene function. Alterations in gene expression profiles may be temporarily induced or long-lasting and adaptive. Therefore, the time points of an analysis and the selection of cell line, model organism, or tissue must be well thought out. It is also recommended that pilot qPCR experiments be performed and that effective doses of a drug or time points of maximal response be determined before gene expression profiling is undertaken. The controls are critical and require appropriate time, sham, or vehicle-treated groups of samples. 9.4.2
RNA Quality
The quality of the extracted total RNA must always be estimated without bias (Auer et al., 2003). The best method by which to assess RNA quality is based on a microfluidics bioanalyzer platform (Fig. 9.2) (Imbeaud et al., 2005). In this method, the analysis of RNA is standardized and assigns a
c09.indd 176
1/12/2011 9:44:12 AM
MICROARRAYS: PROTOCOLS IN GENE DISCOVERY
177
quantitative measure of quality designated as an RNA integrity number, or RIN (Schroeder et al., 2006). Typically, total RNA samples with a RIN between 8 and 10 are considered to be of good quality. All samples compared should be of as similar quality as possible. The use of an RNA stabilization reagent, such as RNAlater, is recommended for maximizing quality during tissue collection. Moreover, RNA isolation protocols should be optimized for each particular species and tissue. Two methods of RNA sample quantification are available, spectrophotometric and fluorescent; either or both may be used to equalize RNA concentrations between samples. The quality of RNA isolated from postoperative or postmortem human tissues is particularly variable and should be even more carefully controlled. However, even if the RNA samples are of a low quality (e.g., moderate RNA degradation), it is reasonable to test them on a microarray. If the amount of starting material is small, microarray analysis may be possible after linear amplification of RNA using amplification kits. 9.4.3
Experimental Design
The overall design of the experiment is also very important. The number of replicates and distribution of samples across hybridization batches require careful consideration (Churchill, 2002; Pan et al., 2002; Yang et al., 2002). With commercially available platforms of oligonucleotide microarrays, technical replicates are no longer considered necessary; instead, experiments should be performed using a rational number of biological replicates (Wei et al., 2004). For statistical reasons, the experiment must have at least three biological replicates; however, five replicates are recommended. To reduce variation or to obtain a sufficient amount of starting material, samples from two or three individuals might be pooled and hybridized to a single microarray (Kendziorski et al., 2003). On the other hand, individual data points are valuable for analyses of correlations between gene expression and phenotypic traits. A simple microarray experiment can compare gene expression between two groups— for example, the comparison of treated samples versus control. More complex experiments using several time points, multiple drug doses, and different tissue types, strains, or combinations thereof can be designed to analyze multifactorial biological models. The design of such experiments should take into consideration hidden biological factors such as circadian rhythm and the effects of vehicle or of different genetic backgrounds, which can significantly influence gene expression profiles. Experimental design should also take into account technical factors—for example, gene expression profiling platform, microarray version, and differences across hybridization batches (Moreau et al., 2003; Barnes et al., 2005; Cahan et al., 2007). Complex experiments allow detailed insight into transcriptional networks and the mapping of genomic activity, and the results may elucidate previously unknown gene functions or novel relationships between proteins.
c09.indd 177
1/12/2011 9:44:12 AM
178
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
9.4.4
Data Analysis
In contrast to many of the techniques used in molecular biology, a relatively short amount of time is needed to obtain raw microarray data compared to the time required to analyze it. To obtain comprehensive results and biologically sound answers from such data, the proper use of statistical methods is critical (Schadt et al., 2000). There is no gold standard for microarray data analysis. However, a few general guidelines can help ensure its quality (Tilstone, 2003; Allison et al., 2006). First, it is better to use a combination of statistical tests rather than simple fold changes (ratio of mRNA abundance mean levels in control and experimental groups) as a primary means of filtering. Second, correction for multiple comparisons is necessary. Because it is not always easy to define the balance between the levels of false positive and false negative results, the optimal method of correction of the data should be ascertained. For instance, use of the Bonferroni correction (λ = λ/n, when an experimenter is testing n hypotheses) might be too stringent for experiments with a low number of replicates. The investigator should be aware of unnecessary standardizations because these can potentially introduce artifacts instead of removing technical bias. Statistical and knowledge-based gene selection should be used in a complementary fashion. The computational resources needed to analyze a small or medium-sized microarray experiment (<30 microarrays) are not demanding and, for such experiments, an average PC will usually be sufficient. However, for large datasets (>100 microarrays or >30 exon-level arrays), more powerful computers may be required. Examples of microarray experimental design follow; see also Figure 9.3. A. Profiling transcriptional response of human-derived macrophages to cytokines (Fig. 9.3a). Aim: The identification of novel cytokine-responsive genes. Materials: 15 Affymetrix Human 133A arrays containing ∼25 k probe sets; the majority of probes recognize the 3 untranslated region of genes. High-quality RNA extracted from in vitro cultured macrophages. Design: 5 biological replicates, 5 distinct hybridization batches for 5 different donors, untreated control, 1 time point. Analysis: Data standardization for batch effects using z-score transformation. Statistical analysis using one-way ANOVA for treatment effect and correction of multiple comparisons using control of false discovery rate (FDR method by Benjamini and Hochberg). B. Effects of drugs of abuse on the profile of gene expression in the brain (Fig. 9.3b). Aim: The identification of drug-induced gene expression patterns in brain and the identification of new drug-responsive genes. Materials: 96 Illumina Mouse WG-6 microarrays containing ∼45 k probes. High-quality RNA extracted from mouse brain.
c09.indd 178
1/12/2011 9:44:12 AM
179
c09.indd 179
1/12/2011 9:44:12 AM
Control
1 factorial experiment: treatment
*Cytokine regulated transcripts *Individual differences *Functional classification
+ Interleukin 6
+ Interleukin 1
+
Stimulation:
Affymetrix microarray Human 133A 2.0
Mouse Brain
Naive
Saline
Controls
Drugs
Illumina microarray Mouse WG-6
2 factorial experiment: treatment time
*Drug-specific gene regulation *Co-expressed patterns *Identification of regulatory factors
Batch effect (balanced)
1 – Plate = 6 Arrays
Cocaine Morphine Ethanol Heroin Nicotine Meth amphetamine Drug 8h 4h injection 1 h 2 h
6 drugs + controls 4 time points 2 animals per array 3 biological replicates
(b) Low-ethanol prefering
2 strains 2 animals per array 5 biological replicates
Affymetrix microarray Rat exon ST 1.0
(c)
1 factorial experiment: strain
All exon array probes
Isoform 3
*Detection of difference in mRNA level and expression of splicing variants
Isoform 2
exam 1 exam 2 exam 3 exam 4
Conventional array probes
Isoform 1
Pre-RNA Transcript
Rat Liver
High-ethanol prefering
Figure 9.3. Design of gene expression profiling experiments. (a) Profiling transcriptional response of human-derived macrophages to cytokines. (b) Effects of drugs of abuse on the profile of gene expression in the brain. (c) Comparison of liver transcriptomes in two strains of rat.
2 cytokines 1 donor per array 5 biological replicates
Macrophages
(a)
180
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
Design: Pooled total RNA from 2 mice, 3 biological replicates, brain tissue from inbred mouse model, 4 time points, saline-treated and naive controls represented at each time point, with a careful balance between the 2 batches. Analysis: Standardization for batch effect; two-way ANOVA for time and drug interactions, with Bonferroni correction (or FDR) applied for multiple comparisons. C. Comparison of liver transcriptome between rat strains (Fig. 9.3c). Aim: The identification of genes and transcript variants that are expressed in association with genetic backgrounds and phenotypic differences. Materials: Affymetrix Exon ST 1.0 arrays, containing ∼1 million probes, ∼250 k exons and 25 k transcript clusters. High-quality RNA extracted from rat liver. Design: 5 biological replicates, total RNA from liver tissue of 2 rats pooled for 1 array, unselected outbred rats used as a control, 1 hybridization batch. Analysis: One-way ANOVA followed by correction for multiple tests followed by exon-level analysis.
9.5
GENE EXPRESSION PROFILING DATA ANALYSIS
This short introduction to gene expression profiling analysis contains a description of some basic statistical and bioinformatics methods (Table 9.2). Understanding of these methods and practical knowledge of how to use them are essential for obtaining coherent biological interpretations of microarray results. The microarray data analysis process, from raw data to the identification and characterization of candidate genes, consists of the following step. 9.5.1 Microarray Data Preprocessing Microarray scanners generate raw data in different image formats (e.g., *.dat or *.tif files). To obtain gene expression profiling results, the data must be preprocessed. Several freeware software platforms that automatically perform these analyses have been developed, for example, R with Bioconductor packages (e.g., affy and beadarray) and dChip (Pan et al., 2002; Gautier et al., 2004; Gentleman et al., 2004; Dunning et al., 2007). Commercial statistical and visual data analysis platforms are also available from Partek (www.partek.com) or GeneSpring (www.agilent.com/chem/genespring). The first steps in the analysis are quality control and normalization of the microarray data (Tseng et al., 2001; Durinck, 2008). Quality control of array hybridization is usually based on the number and percent of probes detected on a microarray or comparison of a signal from positive (e.g., spike-in hybridization control) and negative control probes. Single microarray points with weak parameters should be considered as outliers and excluded from further analyses. Inspection of the
c09.indd 180
1/12/2011 9:44:12 AM
GENE EXPRESSION PROFILING DATA ANALYSIS
181
TABLE 9.2. Free Microarray Software and Databases Name Software
Database
Address
Application
R
www.r-project.org www.bioconductor.org
dChip
http://biosun1.harvard. edu/complab/dchip
MeV
www.tm4.org/mev
ErmineJ
www.bioinformatics. ubc.ca/ermineJ
SAM
wwwstat.stanford. edu/∼tibs/SAM
DAVID 2008
http://david.abcc. ncifcrf.gov
Oppossum
www.cisreg.ca/ oPOSSUM
GEO/ ArrayExpress
Allen Atlas
www.ncbi.nlm.nih. gov/geo www.ebi.ac.uk/ microarrayas/ae www.brain-map.org
Environment for statistical and bioinformatic data analysis Statistical and bioinformatic microarray data analyses, graphical interface Analysis, visualization and datamining of largescale genomic data Analyses of gene sets in expression microarray data Supervised learning software for genomic expression data mining (Excel plugin) The database for annotation, visualization and integrated discovery Identification of overrepresented transcription factor binding sites in sets of genes A public functional genomics data repository supporting
GeneNetwork
www.genenetwork.org
Database of gene expression patterns in the brain Resources and analysis tools for systems genetics
distribution of the signal throughout the microarray and within the set of microarrays in the experiment (inspected on histograms and box plots) can help identify further outliers. It is also possible to detect low-quality template RNA by the analysis of 5/3 signal ratio using specific microarray probes (Popova et al., 2008). This comparison helps distinguish between problems with sample quality and array hybridization. The next step of the analysis is
c09.indd 181
1/12/2011 9:44:12 AM
182
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
normalization, which is necessary to obtain reliable gene expression measurements. Normalization techniques are required because of variations in experimental and hybridization conditions and are used to reduce systematic variation in gene expression measurements (Verducci et al., 2006; Durinck, 2008). The major sources of such variations are technical and biological bias (Gibson, 2008). Different normalization methods such as LOESS, quantile or invariant set have been used in an effort to improve quality (Steinhoff et al., 2006). Based on signal level from blank probes or negative hybridization controls, it is possible to establish background levels for microarray signals. Genes with hybridization signals close to the background level might be considered as low-expressed genes and filtered out from the analysis. Frequently used statistical tests assume a normal distribution of values. Therefore, to obtain approximately normal distribution of microarray data, logarithm log transformation (log 2) can be performed. Once gene expression values have been normalized and log-transformed, they are ready for statistical analyses. 9.5.2
Candidate Gene Selection
Often, an important question to be answered is how many genes undergo regulation under the particular conditions being studied. Some genes are constitutively transcribed, and these genes are usually considered as not regulated (Vandesompele et al., 2002). Other genes reveal strongly inducible transcription under various conditions and in different tissues. However, the scale of alterations in profiles of gene expression is often different in different experiments, and the number of regulated genes must therefore be estimated in each experiment. Fold change in the level of mRNA abundance between control and experimental groups can be used as an additional parameter to provide a list of genes for further investigation, but genes with low levels of expression usually show higher fold changes between examined and control samples. From the transcriptional point of view, identification of groups of co-expressed genes and regulatory networks are the most interesting (Pilpel et al., 2001; Dobrin et al., 2009). However, the researcher may also select individual candidates from a list of most significantly regulated genes based on their known biological functions. Nevertheless, novel genes with unknown functions are even more interesting as putative components of new molecular mechanisms involved in the biological process under investigation. To select genes of interest from whole-genome gene expression profiles, it is necessary to perform statistical analysis. A good approach is to use traditional statistical methods such as the t-test (or Weltch test, for single comparison), or ANOVA or MANOVA (for multifactorial experiments) best suitable to the design of the experiment (Churchill, 2004). To obtain reliable results of gene expression profiling, the number of results (e.g., a list of regulated genes) must be significantly higher than the level of results obtained by chance in the particular dataset. Therefore, the statistical values (p) should be corrected for the number of comparisons performed in the analysis (number of analyzed
c09.indd 182
1/12/2011 9:44:12 AM
GENE EXPRESSION PROFILING DATA ANALYSIS
183
transcripts). Some methods of correction for multiple comparisons, including the classical Bonferroni method, exist. Alternatively, the estimation of false discovery rate (% FDR) can be performed using Benjamini and Hochberg or Q methods (Tusher et al., 2001; Storey et al., 2003). These methods of correction are available in R. Moreover, the false discovery rate (FDR) for a set of potential differentially expressed transcripts in a particular comparison can be empirically estimated by permutation analysis (Efron et al., 2002). Once having yielded a list of genes at a given statistical threshold with estimated % FDR (<5% threshold is usually accepted), the data are ready for further bioinformatic analyses. 9.5.3 Identification and Characterization of Co-Expressed Gene Transcription Modules A global profile of gene expression contains a number of co-expressed transcriptional patterns (Dobrin et al., 2009). One of the main goals of gene expression profiling is to define particular modules of co-regulated genes and predict important transcriptional factors (Werner, 2001). Transcriptional signatures and patterns of gene expression can be identified using various methods of gene aggregation (Golub et al., 1999). One of the most frequently used methods is cluster analysis, an approach that groups together gene expression profiles that are close to one another (Azuaje, 2003; Datta, 2003; Thalamuthu et al., 2006). Hierarchical clustering is based on repeated calculation of distance measures (using, for instance, correlation measures or Euclidean distance) between genes. Unsupervised learning algorithms Kmeans and self-organizing maps (SOM) are also useful in grouping differentially regulated genes into sets based on their expression patterns. These procedures classify a given data set using a certain number of clusters fixed a priori. Principal components analysis (PCA) is a statistical technique for determining the key variables in a multidimensional data set that explain the differences in the observations and can be used to simplify the analysis and visualization of microarray datasets (Yeung et al., 2001). All these methods can extract from gene transcription profiles lists of genes with similar patterns of expression. Several examples of the visualization of microarray results are presented in Figure 9.4. Lists of a few dozen co-expressed genes can be analyzed using automated data-mining tools that help in understanding the entire gene expression profile as well as in obtaining additional information contained in the data. Gene ontology (GO) annotation provides statistics on the biological terms related to the data (Ashburner et al., 2000; Huang et al., 2007). GO allows mining of co-regulated gene profiles and detection of functional gene clusters. This is very important for classification of biological processes that may influence the profile of gene expression. Literature mining is the process of applying data mining techniques to databases in the published literature (Lee et al., 2005). This approach is able to place multiplex biological and medical data into
c09.indd 183
1/12/2011 9:44:12 AM
(a)
14 10
12
Log intensity
5 0 –5 0
5
10
6
8
–10
Intensity ratio
10
16
(b)
15
1
Average intensity (c)
3
4
5
6
10 5 –5
0
0.0
0.5
Log-odds ratio
15
1.0
(d)
–0.5
Sample quantiles
2
–3
–2
–1
0
1
2
3
–4
–2
Theoretical quantiles (e)
0 2 4 Log2 fold change
6
(f) 0.6 0.4
PCA2
0.2 0
Class 1
–0.2 –0.4 Class 2
–0.6 –0.8 –1.0 –1.5
–1
–0.5
0
0.5
1
PCA1
Figure 9.4. (See color insert.) Various methods of presentation of microarray quality control and data analysis. (a) MA plots are used to compare data from one microarray against other microarrays. This method is used to visualize the intensity-dependent ratio of raw microarray data. M is the intensity ratio, and A is the average intensity for a dot in the plot. (b) Box plots are used to compare data from an entire group of microarrays. Each box is a graphical depiction of the gene expression data from one microarray through their five-number summaries: the smallest observation (sample minimum), lower quartile, median, upper quartile and largest observation (sample maximum) are presented. (c) The quantile–quantile (Q-Q) plot is a probability plot, a graphical method for comparing two probability distributions by plotting the observed data values against their expected quantiles. (d) A volcano plot is a graph that summarizes the results from the experiment; both fold-change and statistical criteria are plotted. (e) A hierarchical clustering for both genes and samples. (f) The results of PCA presented as a plot.
c09.indd 184
1/12/2011 9:44:12 AM
GENE EXPRESSION PROFILING DATA ANALYSIS
185
context relative to published work. To find points of reference for specific microarray results, lists of genes from the co-expressed gene patterns might be compared with previously described changes in gene transcription profiles. Literature mining is based on lists of genes reported as regulated in published manuscripts or found in publicly available datasets. The analysis of common motifs in promoter regions of co-expressed genes is helpful in determining putative regulatory mechanisms of gene transcription (Pilpel et al., 2001). This method allows searches for over-represented transcription factor binding sites (TFBS) within the conserved regions of genes. Gene promoter sequences can be investigated in silico for the identification and overrepresentation of TFBS (Wasserman et al., 2004). TFBS matrices are stored in repositories such as Jaspar and Transfac (Sandelin et al., 2004). Using these matrices, it is possible to search for TFBS and perform statistical analysis of the overrepresented sites using oPOSSUM or Genomatix databases. The identification of master switch transcriptional factors is important for the interpretation of gene expression profiling data. 9.5.4
Advance Transcript and Candidate Gene Analyses
Candidate genes selected based on microarray screening are further investigated for their function and relation to the phenotype being studied. Alteration in gene transcription suggests direct or indirect connection between the specific gene and the biological process or disease. Information about gene sequence, splicing variants and homologous genes may provide further insight into the molecular control of gene expression. These data are accessible through several genetic and genomic databases GenBank, Ensembl and UCSC (www.ncbi.nlm.nih.gov/Genbank, www.ensembl.org, and genome.ucsc.edu) (Kent et al., 2002; Benson et al., 2009; Hubbard et al., 2009). The comparison of multiple microarray probes designed to detect different regions of specific genes can be used to confirm alterations in expression of transcripts or to show changes in expression of specific mRNA forms. Using exon-level microarrays, it is possible to identify expression of specific splicing variants or novel mRNA forms (Clark et al., 2007; Bemmo et al., 2008). When individuals of different genetic backgrounds are compared, genetic polymorphism may also influence the results of gene expression profiling (Alberts et al., 2007). Moreover, indels and SNPs close to the target regions of microarray probes may affect hybridization. Variations in gene copy number (CNV) can also be detected using DNA microarrays. Significant associations between genes or gene variants and specific phenotypes obtained from genomewide association studies are stored in dbGAP database (Mailman et al., 2007). It is also possible to determine how much gene expression is due to genetic variation using different human and mouse datasets. The analysis of correlations between expression of a particular gene and various behavioral or physiological phenotypes in a panel of inbred mouse strains is possible using the GeneNetwork database (Wu et al., 2004; Druka et al., 2008).
c09.indd 185
1/12/2011 9:44:12 AM
186
9.6
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
QUESTIONS AND ANSWERS
Q1. Do I need any special equipment to perform microarray analysis? Q2. For which species/tissues are microarray chips available? Q3. Which technology is better, high-throughput sequencing or microarray? Q4. Which microarray technology/platform would be optimal for my experiment? Q5. How much total RNA do I need? Q6. How many replicates should be performed? Q7. How to process raw microarray data? Q8. How many genes are constitutively expressed, and how many are under regulation in my experiment? Q9. How do I interpret the results of a long list of regulated genes? Q10. Is it possible to compare microarray results between different batches/ versions/platforms/labs? A1. Complete microarray hybridization and scanning system is relatively expensive, and the performance of reliable and repeatable experiments requires expertise and experience. Therefore, the purchase of a full microarray system is advisable only if you plan to conduct these experiments on a large scale. For a small number of experiments, the best choice is often to contact a microarray facility that offers services using the chosen platform. With this option, the investigator need only design the experiment, collect tissue or cells, and analyze the resulting data. A2. Dedicated whole-genome microarrays are available for most (if not for all) model organisms and tissues from which sufficient amounts of total RNA can be extracted. The list of commercially available products is also constantly increasing. In cases where a nonmodel organism is being investigated, it is possible to use arrays dedicated to a closely related species (however, it is recommended that you include information about sequence differences in the analysis). Alternatively, a custom spotted microarray can be printed if you have access to an appropriate cDNA library. A3. High-throughput sequencing and microarray technologies cover different fields of whole-genome gene expression profiling. Microarrays are still irreplaceable in huge multifactorial experiments involving a large number of samples. High-throughput sequencing has the capability of obtaining sequences of novel genes and genomes as well as of detecting genetic polymorphisms. Next-generation sequencing is a novel approach.
c09.indd 186
1/12/2011 9:44:12 AM
QUESTIONS AND ANSWERS
187
Because of its novelty, it is expensive and its quantitative value must be confirmed in further experiments. A4. The choice of microarray technology/platform to use depends on the biological question driving the experiment. For example, microarrays based on 3 UTR probes are less resistant to sequence polymorphisms; in cases involving strain comparisons, exon-level microarray analysis should work better. The quality of the results is critical; therefore, the use of unreliable microarray platforms is not economically beneficial even if the initial costs appear lower. On the other hand, if the overall reliability between two platforms is similar, it is recommended that the less expensive option be chosen and more biological replicates performed. A5. For commercially available oligonucleotide microarrays, at least 100 ng of high-quality total RNA is needed, and about 1 μg is optimal. When the starting material is highly limited, a linear amplification of the RNAs using a two-cycle protocol is possible. However, to ensure that the initial ratio of the differences between samples is preserved, it is highly recommended that RNA quality and quantity be carefully checked before the experiment. A6. It is always up to each researcher to determine how many replicate arrays should be performed in a particular experiment. Three independent biological replicates are considered the minimum; six to nine replicates would be close to the optimal and most cost-effective number (behavioral and pharmacological experiments usually test six to nine replicates in parallel). For highly variable samples, such as postoperative or postmortem human tissues, the number of replicates should be relatively higher. A7. The methods by which raw microarray data are analyzed are specific to the particular microarray platform. Usually, preprocessing of data includes background correction, data normalization and gene annotation. Raw microarray data can often be analyzed using software released by the array manufacturer, R packages, or other software, such as dChip. Using each of these methods, you will be able to generate files that contain gene expression measures for each of the microarray probes. A8. The number of genes constitutively transcribed (host-genes) depends on cell type and stage of development. Expression of more than 10,000 transcripts in a particular cell (from about 25,000 genes) is considered a basic gene profile, whereas the number of genes regulated in a particular experiment depends on experimental conditions such as age and treatment. The best way to estimate the number of regulated genes is to use a method that estimates the false discovery rate for each particular comparison (e.g., 1% or 5%).
c09.indd 187
1/12/2011 9:44:12 AM
188
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
A9. To further characterize microarray results, you can use various publicly available bioinformatic tools. Functional connections might be revealed by overrepresented functional terms in gene ontology, literature mining software or analysis of transcription factor binding sites (TFBS) in promoter regions of co-regulated groups of genes using the web-based tools. A10. Comparisons between microarray data from different sources can be very enlightening; however, it is not always possible to perform a direct comparison, even when the data are obtained using the same platform. The best way to approach joint analysis of microarray data from different batches/versions/platforms/labs is to standardize the experiments using controls or z-score transformation. Each of these methods can reduce the influence of technical bias and improve the search for genes that are co-expressed in different datasets.
9.7
ACKNOWLEDGMENTS
This work was supported by the Polish Ministry of Science and Higher Education (MSHE) grant No. N N405 274137 and statutory funds from the Institute of Pharmacology PAS.
9.8 REFERENCES Abdullah-Sayani A, Bueno-de-Mesquita JM, van de Vijver MJ. (2006). Technology Insight: tuning into the genetic orchestra using microarrays—limitations of DNA microarrays in clinical practice. Nat Clin Pract Oncol 3(9):501–16. Alberts R, Terpstra P, Li Y, Breitling R, Nap JP, Jansen RC. (2007). Sequence polymorphisms cause many false cis eQTLs. PLoS One 2(7):e622. Allison DB, Cui X, Page GP, Sabripour M. (2006). Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(1):55–65. Anisimov SV. (2008). Serial analysis of gene expression (SAGE): 13 years of application in research. Curr Pharm Biotechnol 9(5):338–50. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1):25–29. Auer H, Lyianarachchi S, Newsom D, Klisovic MI, Marcucci G, Kornacker K. (2003). Chipping away at the chip bias: RNA degradation in microarray analysis. Nat Genet 35(4):292–93. Azuaje F. (2003). Clustering-based approaches to discovering and visualising microarray data patterns. Brief Bioinform 4(1):31–42.
c09.indd 188
1/12/2011 9:44:12 AM
REFERENCES
189
Bammler T, Beyer RP, Bhattacharya S, Boorman GA, Boyles A, Bradford BU, et al. (2005). Standardizing global gene expression analysis between laboratories and across platforms. Nat Meth 2(5):351–56. Barnes M, Freudenberg J, Thompson S, Aronow B, Pavlidis P. (2005). Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms. Nucl Acids Res 33(18):5914–23. Bemmo A, Benovoy D, Kwan T, Gaffney DJ, Jensen RV, Majewski J. (2008). Gene expression and isoform variation analysis using Affymetrix Exon Arrays. BMC Genomics 9:529. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. (2009). GenBank. Nucl Acids Res 37:D26–31. Bilitewski U. (2009). DNA microarrays: an introduction to the technology. Meth Mol Biol 509:1–14. Blencowe BJ. (2006). Alternative splicing: new insights from global analyses. Cell 126(1):37–47. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FCP, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. (2001). Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29(4):365–71. Brown PO, Botstein D. (1999). Exploring the new world of the genome with DNA microarrays. Nat Genet 21(1 Suppl): 33–37. Cahan P, Rovegno F, Mooney D, Newman JC, St Laurent G III, McCaffrey TA. (2007). Meta-analysis of microarray results: challenges, opportunities, and recommendations for standardization. Gene 401(1–2):12–19. Cheok MH, Yang W, Pui CH, Downing JR, Cheng C, Naeve CW, Relling MV, Evans WE. (2003). Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat Genet 34(1):85–90. Churchill GA. (2002). Fundamentals of experimental design for cDNA microarrays. Nat Genet 32(Suppl):490–95. Churchill GA. (2004). Using ANOVA to analyze microarray data. Biotechniques 37(2):173–75, 177. Clark TA, Schweitzer AC, Chen TX, Staples MK, Lu G, Wang H, Williams A, Blume JE. (2007). Discovery of tissue-specific exons using comprehensive human exon microarrays. Genome Biol 8(4):R64. Datta S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4):459–66. de Leon J, Susce MT, Murray-Carmichael E. (2006). The AmpliChip CYP450 genotyping test: Integrating a new clinical tool. Mol Diagn Ther 10(3):135–51. Diatchenko L, Lau YF, Campbell AP, Chenchik A, Moqadam F, Huang B, Lukyanov S, Lukyanov K, Gurskaya N, Sverdlov ED, Siebert PD. (1996). Suppression subtractive hybridization: a method for generating differentially regulated or tissue-specific cDNA probes and libraries. Proc Natl Acad Sci U S A 93(12):6025–30. Dobrin R, Zhu J, Molony C, Argman C, Parrish ML, Carlson S, Allan MF, Pomp D, Schadt EE. (2009). Multi-tissue coexpression networks reveal unexpected subnetworks associated with disease. Genome Biol 10(5):R55.
c09.indd 189
1/12/2011 9:44:12 AM
190
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
Draghici S, Khatri P, Eklund AC, Szallasi Z. (2006). Reliability and reproducibility issues in DNA microarray measurements. Trends Genet 22(2):101–09. Druka A, Druka I, Centeno AG, Li H, Sun Z, Thomas WT, Bonar N, Steffenson BJ, Ullrich SE, Kleinhofs A, Wise RP, Close TJ, Potokina E, Luo Z, Wagner C, Schweizer GF, Marshall DF, Kearsey MJ, Williams RW, Waugh R. (2008). Towards systems genetic analyses in barley: Integration of phenotypic, expression and genotype data into GeneNetwork. BMC Genet 9:73. Dunning MJ, Smith ML, Ritchie MEM, Tavare S. (2007). Beadarray: R classes and methods for Illumina bead-based data. Bioinformatics 23(16):2183–84. Durinck S. (2008). Pre-processing of microarray data and analysis of differential expression. Meth Mol Biol 452: 89–110. Edgar R, Domrachev M, Lash AE. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucl Acids Res 30(1):207–10. Efron B, Tibshirani R. (2002). Empirical bayes methods and false discovery rates for microarrays. Genet Epidemiol 23(1):70–86. Eisen MB, Brown PO. (1999). DNA arrays for analysis of gene expression. Meth Enzymol 303:179–205. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, et al. (2008). Genetics of gene expression and its effect on disease. Nature 452(7186):423–29. Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Goldstein SR, Weiss RA, Liotta L. (1996). Laser capture microdissection. Science 274(5289): 998–1001. Fox S, Filichkin S, Mockler TC. (2009). Applications of ultra-high-throughput sequencing. Meth Mol Biol 553:79–109. Gautier L, Cope L, Bolstad BM, Irizarry RA. (2004). Affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20(3):307–15. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitski G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5(10):R80. Gibson G. (2008). The environmental contribution to gene expression profiles. Nat Rev Genet 9(8):575–81. Golub TR, Slonim DK, Tamay P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–37. Goodsaid F, Frueh F. (2006). Process map proposal for the validation of genomic biomarkers. Pharmacogenomics 7(5):773–82. Gunderson KL, Kruglyak S, Graige MS, Garcia F, Kermani BG, Zhao C, Che D, Dickinson T, Wickham E, Bierle J, Doucet D, Milewski M, Yang R, Siegmund C, Haas J, Zhou L, Oliphant A, Fan JB, Barnard S, Chee M. (2004). Decoding randomly ordered DNA arrays. Genome Res 14(5):870–77. Gunderson KL, Steemers FJ, Lee G, Mendoza LG, Chee MS. (2005). A genome-wide scalable SNP genotyping assay using microarray technology. Nat Genet 37(5): 549–54.
c09.indd 190
1/12/2011 9:44:12 AM
REFERENCES
191
Heid CA, Stevens J, Livak KJ, Williams PM. (1996). Real time quantitative PCR. Genome Res 6(10):986–94. Heiman M, Schaefer A, Gong S, Peterson JD, Day M, Ramsey KE, Suarez-Farinas M, Schwarz C, Stephan DA, Surmeier DJ. (2008). A translational profiling approach for the molecular characterization of CNS cell types. Cell 135(4):738–49. Higuchi R, Fockler C, Dollinger G, Watson R. (1993). Kinetic PCR analysis: real-time monitoring of DNA amplification reactions. Biotechnology 11(9):1026–30. Hoheisel JD. (2006). Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev Genet 7(3):200–10. Huang DW, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lan HC, Lempicki RA. (2007). The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol 8(9):R183. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, et al. (2009). Ensembl 2009. Nucl Acids Res 37:D690–97. Imbeaud S, Graudens E, Boulanger V, Barlet X, Zaborski P, Eveno E, Muller O, Schroeder A, Auffray C. (2005). Towards standardization of RNA quality assessment using user-independent classifiers of microcapillary electrophoresis traces. Nucl Acids Res 33(6):e56. Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD. (2003). Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302(5653): 2141–44. Kell DB, Oliver SG. (2004). Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the post-genomic era. Bioessays 26(1):99–105. Kendziorski CM, Zhang Y, Lan H, Attie AD. (2003). The efficiency of pooling mRNA in microarray experiments. Biostatistics 4(3):465–77. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. (2002). The human genome browser at UCSC. Genome Res 12(6):996–1006. Kulesh DA, Clive DR, Zarlenga DS, Greene JJ. (1987). Identification of interferonmodulated proliferation-related cDNA sequences. Proc Natl Acad Sci U S A 84(23): 8453–57. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, et al. (2006). The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313(5795):1929–35. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. (2001). Initial sequencing and analysis of the human genome. Nature 409(6822):860–921. Larkin JE, Frank BC, Gavras H, Sultana R, Quackenbush J. (2005). Independence and reproducibility across microarray platforms. Nat Meth 2(5):337–44. Lee HK, Braynen W, Keshav K, Pavlidis P. (2005). ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics 6:269. Lemoine S, Combes F, Crom SL. (2009). An evaluation of custom microarray applications: the oligonucleotide design challenge. Nucl Acids Res 37(6):1726–39. Lockhart DJ, Winzeler EA. (2000). Genomics, gene expression and DNA arrays. Nature 405(6788):827–36.
c09.indd 191
1/12/2011 9:44:12 AM
192
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L, Popova N, Pretel S, Ziyabari L, Lee M, Shao Y, Wang Z, Sirotkin K, Ward M, Kholodov M, Zbicz K, Beck J, Kimelman M, Shevelev S, Preuss D, Yaschenko E, Graeff A, Ostell J, Sherry ST. (2007). The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 39(10):1181–86. Mardis ER. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402. McCall MN, RA Irizarry. (2008). Consolidated strategy for the analysis of microarray spike-in data. Nucl Acids Res 36(17):e109. Metzker ML. (2005). Emerging technologies in DNA sequencing. Genome Res 15(12):1767–76. Metzker ML. (2009). Sequencing technologies—the next generation. Nat Rev Genet 11(1):31–46. Mockler TC, Chan S, Sundaresan A, Chen H, Jacobsen SE, Ecker JR. (2005). Applications of DNA tiling arrays for whole-genome analysis. Genomics 85(1):1–15. Moreau Y, Aerts S, Moor BD, Strooper BD, Dabrowski M. (2003). Comparison and meta-analysis of microarray data: from the bench to the computer desk. Trends Genet 19(10):570–77. Nica AC, Dermitzakis ET. (2008). Using gene expression to investigate the genetic basis of complex disorders. Hum Mol Genet 17(R2):R129–34. Nuwaysir EF, Huang W, Albert TJ, Singh J, Nuwaysir K, Pitas A, Richmond T, Gorski T, Berg JP, Ballin J, McCormick M, Norton J, Pollock T, Sumwalt T, Butcher L, Porter D, Molla M, Hall C, Blattner F, Sussman MR, Wallace RL, Cerrina F, Green RD. (2002). Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res 12(11):1749–55. Pan W, Lin J, Le CT. (2002). How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol 3(5):research0022–research0022.10. Pan W, Lin J, Le CT. (2002). Model-based cluster analysis of microarray gene-expression data. Genome Biol 3(2):research0009–research0009.8. Park PJ. (2009). ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 10(10):669–80. Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma A. (2005). ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucl Acids Res 33:D553–55. Pillai S, Chellappan SP. (2009). ChIP on chip assays: genome-wide analysis of transcription factor binding and histone modifications. Meth Mol Biol 523:341–66. Pilpel Y, Sudarsanam P, Church GM. (2001). Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 29(2):153–59. Popova T, Mennerich D, Weith A, Quast K. (2008). Effect of RNA quality on transcript intensity levels in microarray analysis of human post-mortem brain tissues. BMC Genomics 9:91. Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J, Cottrill H, Kelley MH, Petersen R, Harpole D, Marks J, Berchuck A, Ginsburg GS, Febbo P, Lancaster
c09.indd 192
1/12/2011 9:44:13 AM
REFERENCES
193
J, Nevis JR. (2006). Genomic signatures to guide the use of chemotherapeutics. Nat Med 12(11):1294–300. Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, et al. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse largeB-cell lymphoma. N Engl J Med 346(25):1937–47. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B. (2004). JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucl Acids Res 32:D91–94. Sandelin A, Carninci P, Lenhard B, Ponjavic J, Hayashizaki Y, Hume DA. (2007). Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet 8(6):424–36. Schadt EE, Li C, Su C, Wong WH. (2000). Analyzing high-density oligonucleotide gene expression array data. J Cell Biochem 80(2):192–202. Schena M, Shalon D, Davis RW, Brown PO. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235): 467–70. Schroeder A, Mueller O, Stocker S, Salowsky R, Leiber M, Gassmann M, Lightfoot S, Menzel W, Granzow M, Ragg T. (2006). The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol Biol 7:3. Searfoss GH, Ryan TP, Jolly RA. (2005). The role of transcriptome analysis in preclinical toxicology. Curr Mol Med 5(1):53–64. Shalon D, Smith SJ, Brown PO. (1996). A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Res 6(7):639–45. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74. Simon R. (2003). Using DNA microarrays for diagnostic and prognostic prediction. Exper Rev Mol Diagn 3(5):587–95. Steinhoff C, Vingron M. (2006). Normalization and quantification of differential expression in gene expression microarrays. Brief Bioinform 7(2):166–77. Storey JD, Tibshirani R. (2003). Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100(16):9440–45. Sutcliffe JG, Foye PE, Erlander MG, Hilbush BS, Bodzin LJ, Durham JT, Hasel KW. (2000). TOGA: an automated parsing technology for analyzing expression of nearly all genes. Proc Natl Acad Sci U S A 97(5):1976–81. Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC. (2006). Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19): 2405–12. Tilstone C. (2003). DNA microarrays: vital statistics. Nature 424(6949):610–12. Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. (2001). Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucl Acids Res 29(12):2549–57.
c09.indd 193
1/12/2011 9:44:13 AM
194
CANDIDATE SCREENING THROUGH GENE EXPRESSION PROFILE
Tusher VG, Tibshirani R, Chu G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98(9):5116–21. Unwin RD, Whetton AD. (2006). Systematic proteome and transcriptome analysis of stem cell populations. Cell Cycle 5(15):1587–91. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, Speleman F. (2002). Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol 3(7): research0034–research0034.11. van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, Van Der Kooy K, Marton MH, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards RA, Friend SH. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871):530–36. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. (1995). Serial analysis of gene expression. Science 270(5235):484–87. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. (2001). The sequence of the human genome. Science 291(5507):1304–51. Verducci JS, Melfi VF, Lin S, Wang Z, Roy S, Sen CK. (2006). Microarray analysis of gene expression: considerations in data mining and statistical treatment. Physiol Genomics 25(3):355–63. Wang Z, Gerstein M, Snyder M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63. Wasserman WW, Sandelin A. (2004). Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5(4):276–87. Wei C, Li J, Bumgarner RE. (2004). Sample size for detecting differentially expressed genes in microarray experiments. BMC Genomics 5(1):87. Welsh JB, Zarrinkar PP, Sapinoso LM, Kern SG, Behling CA, Monk BJ, Lockhart DJ, Burger RA, Hampton GM. (2001). Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci U S A 98(3):1176–81. Werner T. (2001). Cluster analysis and promoter modelling as bioinformatics tools for the identification of target genes from expression array data. Pharmacogenomics 2(1):25–36. Wiseman SB, Singer TD. (2002). Applications of DNA and protein microarrays in comparative physiology. Biotechnol Adv 20(5–6):379–89. Wu CC, Huang HC, Juan HF, Chen ST. (2004). GeneNetwork: an interactive tool for reconstruction of genetic networks using microarray data. Bioinformatics 20(18): 3691–93. Yang Y, Abel SJ, Ciurlionis R, Waring JF. (2006). Development of a toxicogenomics in vitro assay for the efficient characterization of compounds. Pharmacogenomics 7(2):177–86. Yang YH, Speed T. (2002). Design issues for cDNA microarray experiments. Nat Rev Genet 3(8):579–89. Yeung KY, Ruzzo WL. (2001). Principal component analysis for clustering gene expression data. Bioinformatics 17(9):763–74.
c09.indd 194
1/12/2011 9:44:13 AM
CHAPTER 10
Candidate Screening through High-Density SNP Array CHING-WAN LAM and KIN-CHONG LAU
Contents 10.1 Introduction 10.1.1 What Is High-Density SNP Array Technology? 10.2 Platforms and Protocols of SNP Microarray 10.2.1 The First Platform and Protocol of a High-Density SNP Array 10.2.2 The Second Generation of High-Density SNP Array Platform and Protocol 10.2.3 A Much Advanced Platform and Protocol of an Ultra-High-Density SNP Array 10.3 How a High-Density SNP Array Can Be Used to Localize a Possible Disease Loci 10.3.1 LOH in Cancer 10.3.2 Copy-Neutral LOH in Genetic Diseases due to Consanguinity 10.3.3 LOH in Other Clinical Cytogenetics Analysis 10.4 Discussion 10.5 References
195 196 197 197 199 200 204 204 205 209 213 213
10.1 INTRODUCTION The International HapMap Project was launched in 2002 to create a genomewide database of common single nucleotide polymorphisms (SNP), which facilitates the development of inexpensive, accurate technologies for highthroughput SNP genotyping (The International HapMap Consortium, 2005). To date, two human genomes have been entirely sequenced. There are about 3 billion DNA basepairs (bp) with various genetic variations including >3.1 Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
195
c10.indd 195
1/12/2011 9:44:14 AM
196
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
TABLE 10.1. High-Density SNP Array vs. Conventional Cytogenetic Methods Technology
FISH
Karyotyping
Genome view
Specific region of genome High
Full genome view
Resolution Prior knowledge required Time to result Quality control Results and interpretation
Need to know where to look ∼4–7 days Qualitative QC (subjective) Manual read (subjective)
Limited to CNV of multiple Mb Do not need to know where to look ∼4–7 days Qualitative QC (subjective) Manual read (subjective)
High-density SNP array Full genome view High Do not need to know where to look ∼3–5 days Quantitative QC Automated quantitative read method (objective)
Note: Data are modified from Affymetrix.
million SNPs and >1,400 copy number variable regions (CNVs) of at least 1 kb DNA segments spread throughout the human genome (The International HapMap Consrotium, 2007; Redon et al., 2006). SNPs, usually with two alleles, are the most common form of sequence variation in the human genome, occurring approximately every 1,200 bp. The combination of SNP database and high-density SNP array allows the efficient use of SNPs as informative polymorphic markers for Mendelian diseases with complex traits. With high-density and high-resolution SNP arrays, we can detect even the smallest structural changes to identify the loss of heterozygosity (LOH), uniparental disomy (UPD), and genetic regions identical by descent (IBD) that would have been missed with conventional low-density cytogenetic techniques for prognostic and diagnostic utilities (Table 10.1). LOH, a form of allelic imbalance results from the complete loss of an allele or appears as copy-neutral LOH without copy number change. UPD refers to homozygous chromosomal region/segment originated from only one parent. In IBD, two alleles are identical copies of the same ancestral allele. Copy-neutral LOH due to UPD or IBD is a potential chromosomal region harboring mutations causing autosomal recessive diseases. 10.1.1
What Is High-Density SNP Array Technology?
High-density SNP array is a kind of DNA microarray that queries thousands of SNPs in a single experiment to globally analyze the human genome for genetic alteration. In standard microarrays developed by Affymetrix, one of the microarray platforms, various probes targeting thousands of SNPs are immobilized on a glass or silicon solid surface (commonly called chip) in specified positions called probe cell or feature. Other microarray platforms, such as Illumina, use microscopic beads instead of the large solid support. For each SNP targeted in Affymetrix array, 40 different short oligonucleotides probes
c10.indd 196
1/12/2011 9:44:14 AM
PLATFORMS AND PROTOCOLS OF SNP MICROARRAY
197
(20–25 bp) are tiled, each with a slight variation in perfect matches, mismatches, and flanking sequence around the SNP (PMA, MMA, PMB, MMB). PMA/B means the probe designed to be a perfect match to allele A or B. MMA/B is the mismatch probe with the same sequence except for a single base mismatch at or near the SNP site of allele A or B. After the biotin-labeled targets hybridized to the probes on the array with sufficient sequence complementarity for sufficient time, the excess sample is washed off the solid surface and stained with a complex of streptavidin phycoerythrin (SAPE) and biotinylated antistreptavidin IgG antibody. The wash and stain procedures are run automatically under the control of Affymetrix system. The scanner will capture highresolution intensity data for automatic calculation of the intensity values of each SNP marker. Algorithms of the Genotype Viewer in Microarray Suite Software evaluate the quality of the hybridization intensity data from each set of the four different complementary probes (Fig. 10.1). A metric called the relative allele signal (RAS) is derived from the hybridization intensity signal and is represented as an arbitrary number on a continuous scale from 0 to 1, indicating the relative representation of either of the two possible alleles in the target mixture. Using these values from the sense and antisense target strands, genotype calls are automatically made for each SNP marker. For example, an RAS value close to 1 indicates that the sample is homozygous for the A allele, and RAS value close to 0 indicates homozygous for the B allele, and an RAS value about 0.5 signifies a heterozygote. Six possible calls can be generated: (1) A (homozygous A allele), (2) B (homozygous B allele), (3) AB (heterozygous), (4) AB_A (two possible genotypes AB or A that could not be distinguished), (5) AB_B (two possible genotypes AB or B that could not be distinguished), or (6) No signal (insufficient data passed the quality tests to perform an analysis). An Affymetrix SNP array can also be used to generate a virtual karyotype using specialized software to determine the copy number of each SNP on the array and then align the SNPs in chromosomal order. One of the unique features of the SNP array is the specific SNP allele difference confirmation of DNA copy number changes and designation of copy neutral homozygosity correlated with UPD and consanguinity. 10.2 10.2.1
PLATFORMS AND PROTOCOLS OF SNP MICROARRAY The First Platform and Protocol of a High-Density SNP Array
We suggested that using a high-density SNP array would become the standard for molecular investigations of genetic diseases due to consanguinity (Lam et al., 2007). Early in 2002, we started whole-genome scanning for DNAbased diagnosis using Affymetrix SNP arrays (Table 10.2). Affymetrix GeneChip HuSNP Mapping Assay is an early microarray platform that yields approximately 1300 SNP genotypes per sample with >99% reproducibility and >98% accuracy. The HuSNP array is manufactured using technology that
c10.indd 197
1/12/2011 9:44:14 AM
198
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
–4
–1
0
+1
+4
MMA
MMA
MMA
MMA
MMA
PMA
PMA
PMA
PMA
PMA
PMB
PMB
PMB
PMB
PMB
MMB
MMB
MMB
MMB
MMB
Target sequence (250-2000 bp) ... CAGACAGAGTCTTG[A/C]AATCTATTTCTCATA... Probe sequence (25 bp) PMA:
TGTCTTCAGAACTTTAGATAAAGAG
MMA:
TGTCTTCAGAACATTAGATAAAGAG
PMB:
TGTCTTCAGAACGTTAGATAAAGAG
MMB:
TGTCTTCAGAACCTTAGATAAAGAG
AA
BB
AB
Figure 10.1. (See color insert.) Probe array tiling and hybridization patterns (from Affymetrix).
TABLE 10.2. Timeline of the Clinical Applications of Different Affymetrix Microarrays Year
2002
Array type
GeneChip HuSNP Mapping Assay
Number of SNP 1,494 markers Number of 0 CNV markers
c10.indd 198
2003
2004
2005
2006
2007
GeneChip Human Mapping 10 K Array Xba 131
2008–Present
11,500
GenomeWide Human SNP Arrary 6.0 906,600
0
945,826
1/12/2011 9:44:14 AM
PLATFORMS AND PROTOCOLS OF SNP MICROARRAY
199
combines photolithographic methods and combinatorial chemistry. Tens to hundreds of thousands of different oligonucleotide probes are synthesized in a 0.81- by 0.81-cm area on a glass substrate of each array. The fully mapped markers are evenly distributed across the human genome with a median marker gap size of 1.2 cM. We were able to identify the gene xeroderma pigmentosum, complementation group C (XPC), as the disease-causing gene by detecting LOH in a xeroderma pigmentosum (XP) patient with consanguineous parents (Lam et al., 2005). We mapped the chloride channel 7 gene (CLCN7) for malignant osteopetrosis or autosomal recessive osteopetrosis (ARO) in a consanguineous family (Lam et al., 2007). Using HuSNP probe arrays also allowed us to detect the allelic imbalance with patterns of LOH in the paraffin-embedded tissues of renal cell carcinoma (RCC) cells (Lam et al., 2006). We showed that a high-density SNP array can detect previously described and new LOH sites in cancer genomic studies. According to the manufacturer’s protocol, starting with 120 ng of genomic DNA, a set of 24 simultaneously run multiplex PCRs will amplify the human SNPs represented in the GeneChip HuSNP genetic mapping assay. The amplified SNPs are further amplified and concomitantly labeled using biotinylated primers in a second set of 24 simultaneously run labeling PCRs. The biotinylated PCR products are then pooled, concentrated, and prepared for hybridization. The biotinylated amplification products, which reflect the biallelic genotype in the sample DNA, are hybridized to the GeneChip HuSNP probe arrays during an overnight incubation at 44°C in the GeneChip hybridization oven. On the following day, the probe arrays are thoroughly washed and stained with streptavidin and antistreptavidin antibody. The automated wash and stain procedures are run on the GeneChip Fluidics Station 400 (Affymetrix), under the control of Affymetrix microarray suite software running on a workstation with a Windows NT operating system. The stained probe arrays will be scanned twice to capture the light emitted at wavelengths of 530 nm and 570 nm, generating two scan image files. Affymetrix microarray suite will process the two scan images to calculate all of the signal intensities on the probe array. 10.2.2 The Second Generation of High-Density SNP Array Platform and Protocol Further technical advancement of chip development and high-resolution scanning allows efficient SNP genotyping of >10,000 SNPs in one array. Affymetrix GeneChip Human Mapping 10K Array Xba 131 contains approximately 11,500 SNP markers with higher genomic coverage of median intermarker distance of 105 kb. Each array has 18- by 18-mm features consisting of more than 1 million copies of a 25-bp oligonucleotide probe of defined sequence, synthesized in parallel by photolithographic manufacturing. This platform gives a significant increase in genetic power and ensures more informative markers for molecular investigation in consanguineous families.
c10.indd 199
1/12/2011 9:44:14 AM
200
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
We repeated the molecular investigations on the same samples of ARO and XP cases using this improved platform. The results are consistent with the diagnosis made from HuSNP probe arrays. Then we standardized the use of the 10K mapping array for DNA-based diagnosis of genetic diseases due to consanguinity in our laboratory. We successfully identified homozygous mutations in sulfite oxidase deficiency and hypophosphatasia after mapping the homozygous regions in the consanguineous families (unpublished data). We also detected a small homozygous region around the disease-causing gene, solute carrier family 25 (carnitine/acylcarnitine translocase), member 20 (SLC25A20) in a sudden neonatal death case of nonconsanguineous marriage and confirmed carnitine-acylcarnitine translocase deficiency by further sequencing the SLC25A20 gene (Lam et al., 2003). Total genomic DNA of 250 ng is digested with 10 U of XbaI restriction enzyme and ligated to adapters that recognize the cohesive overhangs. Each SNP that lies within the XbaI fragments is amplified by a generic primer that recognizes the adapter sequence. PCR conditions have been optimized to preferentially amplify fragments in the 250 to 1000 bp range. The amplified DNA is purified over MinElute 96 UF PCR purification plates by vacuum pumping. The PCR amplicons are then fragmented and biotin labeled. Hybridization is carried out at 48°C overnight in a rotisserie rotating at 60 rpm. The array runs on the standard GeneChip instrument system, including fluidics station 450 and GeneChip Scanner 3000, for automated washing, staining, and scanning. The image will be processed to get hybridization signal intensity values using GCOS software (Affymetrix) while the genotype-calling is performed by GDAS analysis software (Affymetrix). The LOH and copy number (in Log2 ratio) are calculated through CNAT analysis embedded in GTYPE software (Affymetrix). 10.2.3 A Much Advanced Platform and Protocol of an Ultra-High-Density SNP Array Affymetrix Genome-wide Human SNP Array 6.0 is an advanced microarray platform containing 906,600 SNPs probes and 945,826 copy number probes on a single array for studying LOH and CNV simultaneously (Table 10.3). The median intermarker distance taken over all SNP and CNV markers is <700 bases (Fig. 10.2). More than half of the SNP probes are selected from a previous generation of microarray, SNP Array 5.0. The rest are designed for increasing genome convergence on X and Y chromosomes, mitochondrial DNA and recombination hotspots. For those copy number probes, nearly 80% are distributed evenly spaced along the genome while the rest interrogate previously identified CNV regions. This probe design improves the detection of novel CNV present in the genome. While other chip-based methods (e.g., comparative genomic hybridization) can detect only genomic gains or deletions, SNP array has the additional advantage of detecting copy neutral LOH due to UPD. The high genomic coverage and high resolution array offering a total of more than 1.8 million markers of different genetic variations across the
c10.indd 200
1/12/2011 9:44:14 AM
PLATFORMS AND PROTOCOLS OF SNP MICROARRAY
201
TABLE 10.3. Two Advanced High-Density SNP Arrays from Two Different Microarray Platforms Platform
Affymetrix SNP Array 6.0*
Illumina 1M*
Total number of genetic markers Number of SNPs probes Number of CNV probes Median marker spacing DNA required
>1.8 million ∼906 K ∼946 K 680 bases 500 ng
∼1.0 million ∼1050 K ∼22 K 1700 bases 750 ng
* Data from specification sheets on company websites.
800 700
# per Mb
600 500
#SNPs/Mb #CNs/Mb
400
#SNP+CN
300 200 100
ch r1 5 ch r1 7 ch r1 9 ch r2 1 ch rX
r1 3 ch
r1 1 ch
ch r9
r7 ch
r5 ch
r3 ch
ch
r1
0
Figure 10.2. Affymetrix SNP Array 6.0: SNP and CNV markers across multiple chromosomes (from Affymetrix).
genome allows detection of the smallest structural changes and regions of autozygosity. We performed genomewide scanning in a case of ring chromosome and a consanguineous family with three affected siblings of limb-girdle muscular dystrophy (LGMD) using the ultra-high-density SNP array (Lau et al., 2009). The workflow of the GeneChip Mapping Assay has some similarity to the previous generations of Affymetrix microarrays. Instead of a fixed sample size of 48 or 96 per batch described in the standard Affymetrix protocol, every two samples were simultaneously processed in 0.2 mL centrifuge tubes following the manufacturer’s instructions, with some modifications specialized for random access approach in personalized genomic medicine (Lau et al., 2009). 1. Prepare two sets of DNA samples for each individual assay; use 250 ng of genomic DNA by adding 5 μL of sample at concentration 50 ng/μL. 2. Digest the genomic DNA with 10 U Sty I and 10 U Nsp I, respectively, at 37°C for 2 h. Inactivate the enzymes at 65°C for 20 min.
c10.indd 201
1/12/2011 9:44:14 AM
202
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
3. Ligate 50 μM restriction enzyme–specific Adaptor (Affymetrix) to the digested DNA with 800 U T4 DNA Ligase at 16°C for 3 h and inactivate the enzymes at 70°C for 20 min. 4. Dilute the ligated DNA fourfold and take 10 μL for 30 cycles of PCR amplifications using 2 μL TITANIUM Taq DNA polymerase. Use the correct PCR program for different models of thermocycler (for details, see the Affymetrix manual). Perform a set of 3 simultaneously run multiplex PCRs for Sty I-digested DNA and a set of 4 simultaneously run multiplex PCRs for Nsp I-digested DNA. 5. Pool 700 μL PCR products from the two sets (total 7 simultaneously run multiplex PCRs) in a 2.0-mL round-bottom tube before purification. 6. Mix 1 mL magnetic beads with the pooled PCR products by pipetting up and down 5 times generously. Leave the DNA-beads mixture for binding at room temperature (RT) for 10 min. PCR products ≥100 bp bind to the beads based on the solid-phase reversible immobilization (SPRI) technology. The beads are paramagnetic microparticles of small bead size (1.0 μm ± 8%), having a carboxylate-modified polymer coating that gives them a high nucleic acids binding capacity. 7. To separate the magnetic beads, place the 2.0-mL tube on the magnetic stand for 10 min until the solution becomes clear before proceeding to the next step.* 8. Leave the tube on the stand, pipette off the supernatant without disturbing the bead pellet on the tube wall. 9. Wash the beads with 1.8 mL 75% ethanol and vortex at 75% power for 2.5 min, and incubate at RT for 7.5 min. Repeat steps 7 and 8. 10. Allow the beads to air dry on the stand for 15 min. 11. To elute the DNA from the beads, add 55 μL Buffer EB. Vortex the tube for 2.5 min, and incubate at RT for 7.5 min. Repeat steps 7 and 8. 12. Collect the purified DNA, and concentrate to 47 μL by speed-vac concentrator. Apply 2 μL purified DNA for quantitation using a spectrophotometer (NanoDrop ND-1000, NanoDrop Technologies). High concentration of 4–6 μg/μL purified DNA would be sufficient for subsequent assay procedures. 13. Fragment the rest of the 45 μL purified DNA at 37°C for 35 min using 2.5 U DNase I-containing Fragmentation reagent (Affymetrix). * The PCR purification steps in the standard Affymetrix protocol using vacuum pumping are sample size dependent with a fixed batch size of 48 or 96 samples per run, which is designed specifically for high-throughput genomewide association studies. For personalized genomic medicine, we use a magnetic stand device (six-tube holder) as a magnetic particles concentrator instead of a vacuum pump (Fig. 10.3). The 40% iron content gives the beads a very quick magnetic response time so that they are separated rapidly and completely from suspensions on application of the magnetic force. This modification changes the genotyping to a random access assay, which 1 is practical for clinical application.
c10.indd 202
1/12/2011 9:44:14 AM
PLATFORMS AND PROTOCOLS OF SNP MICROARRAY
203
Standard PCR protocol protocol—— 48 or 96 per batch Modified PCR protocol — no batch size limitation 1. Pool 700 µl PCR into deep well plate
1. Pool 700 µl PCR into 2-mL microcentrifuge tube
2. Add 1 ml magnetic beads
2. Add 1 ml magnetic beads
3. Pipetting up & down 5×; Incubate 10 min @ RT
3. Pipetting up & down 5×; Incubate 10 min @ RT
4. Transfer PCR + beads to filter plate
4. Place on magnetic stand for 10 min
5. Apply vacuum until all wells are dry (60–90 min)
5. Pipette out the supernatant
6. Add 1.8 mL 75% ethanol wash
6. Add 1.8 mL 75% ethanol wash
7. Apply vacuum until all wells are dry (10–20 min)
7. Vortex at 75% power for 2.5 min; incubate 7.5 min
8. Dry beads for further 10 min under vacuum
8. Place on magnetic stand for 10 min
9. Tap-off excess ethanol & attach catch plate
9. Pipette out the supematant; air-dry for 15 min
10. Add 55 µl elution buffer
10. Add 55 µL elution buffer
11. Incubate on vortexer for 10 min
11. Vortex at 75% power for 2.5 min; incubate 7.5 min
12. Apply vacuum until all wells are dry (15–30 min)
13. Centrifuge for 5 min at 1400 RCF @ RT
12. Place on magnetic stand for 10 min 13. Collect the eluate (~55 µL)
14. Remove catch plate with eluate (~50 µL)
Figure 10.3. (See color insert.) Comparing PCR purification workflow between the Affymetrix standard and our modified protocols.
Inactivate the enzymes at 95°C for 15 min. Remove 1.5 μL fragmented DNA for running 4% QC gel. 14. Biotin-label the fragmented DNA with 100 U TdT enzyme and 30 mM DNA labeling reagent at 37°C for 4 h. Inactivate the enzymes at 95°C for 15 min. 15. Combine all the components of hybridization cocktail, and mix with the labeled DNA. Hybridize to a single SNP Array 6.0 chip in a hybridization oven at 50°C overnight (about 16 h) with 60 rpm rotation on the rotisserie. 16. Proceed to the automated washing and staining in a new model of fluidics station 450.
c10.indd 203
1/12/2011 9:44:14 AM
204
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
17. Scan the chip in an advanced scanner called GeneChip Scanner 3000 7G. The 7G refers to the seventh-generation of the GeneChip technology platform with higher resolution scanning at pixelations from 2.5 um down to 0.51 um, and this spot size is 50% smaller than the previous scanner. To scan a typical 49-format array at 2.5-um pixelation only takes 5 mins. The SNP Array 6.0 platform has several check points before the GeneChip hybridization to exclude experimental errors; intact genomic DNA, PCR amplicon size, and DNase I digested fragment size are checked by electropherograms. Quantity check of starting DNA and purified PCR products are measured using a spectrophotometer. Accurate genotype calls of each sample are determined by the Birdseed version 2 genotype calling algorithm embedded in the software Affymetrix Genotyping Console 3.0 (Nishida et al., 2008). The platform includes quality control (QC) probes for 3022 SNPs to assess the overall quality for a sample based on the dynamic model (DM) algorithm. 10.3 HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI The polymorphic nucleotide allele difference of SNP provides confirmation of genomic imbalance by identifying regions of LOH associated with deletions, allele-specific dosage gain associated with duplications, and long contiguous stretches of homozygosity (LCSH) associated with UPD and consanguinity. Therefore, SNP genotyping aids in identification of candidate genes in both complex and Mendelian disorders for clinical practice. 10.3.1
LOH in Cancer
Most human cancers are characterized by chromosomal aberrations in which allelic imbalances can be identified by LOH. These LOH events sometimes affect known genes, and mutations may suggest regions of novel somatic events contributing to tumorigenesis by activating potential oncogenes or unmasking mutated tumor suppressor genes. The LOH regions are specific for tumor type or subtype (Tuna et al., 2009). Copy-neutral LOH represents one example of the cancer genomic abnormality and is important in cancer clonal evolution. In the HuSNP assay, allelic imbalance usually indicates true loss of heterozygosity, whereas amplifications are rarely detected by HuSNP array. The software calculates the difference in RAS values between two samples and those reported as delta RAS values. We compare the quantitative representation of alleles for samples obtained from normal tissue to those obtained from tumor tissue to determine the location and extent of chromosomal loss in tumor cells. Significant shifts (P < 0.05) in delta RAS between tumor and germline DNA indicate the presence of LOH. To incorporate possible genotyping errors in the analysis, we declare a chromosomal region as having LOH when there are more than two SNPs in the LOH region.
c10.indd 204
1/12/2011 9:44:14 AM
HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI
205
16 14
Std Units
12 10 8 6 4 2 1 1 2 2 3 3 4 5 5 6 6 6 7 7 8 8 8 9 9 10 11 11 11 12 12 14 14 16 16 17 19 19 20 22
0
Chromosome
Figure 10.4. LOH regions in RCC detected by HuSNP probe array (Lam et al., 2006). On the x-axis, markers are arranged by chromosome number and mapped positions. On the y-axis, Std Units are the normalized delta RAS, using the observed RAS standard deviations in heterozygotes.
Renal-cell carcinoma is the most common malignancy in the kidney. Figure 10.4 shows the whole-genome view of LOH regions in one of our RCC samples (Lam et al., 2006). We identified a common 14q LOH that has been shown to be significantly associated with tumor aggressiveness and disease-specific mortality with a hazard ratio of 1.22; 95% CI = 1.02–1.45; p = 0.039. The deletion of 3p as a simple deletion or by translocation has strongly suggested the loss of function of the tumor suppressor gene, von Hippel-Lindau (VHL). Mutations of VHL are found in about 60% of those RCC that exhibit chromosome 3p loss (Gnarra et al., 1994). 10.3.2 Copy-Neutral LOH in Genetic Diseases due to Consanguinity Because of the consanguineous parents, the two disease-causing locus (one from each parent) should be located in an autozygous chromosomal region that is IBD. Consequently, the disease-causing locus of this family should fall in a chromosomal region marked by homozygous SNPs and can be identifiable by LOH. As these regions segregate and are cut by additional generations of recombination, they become fewer and smaller proportional to the degree of inbreeding. After whole-genome scanning, we examine the homozygosity of SNPs flanking all the possible disease-causing loci. Then, we rank the quality of all the LOH regions for prioritization of mutational analysis as follows: (1) the size of the homozygous chromosomal region (normalized with the size of the respective chromosome), (2) the number of SNPs in the homozygous chromosomal region, and (3) the number of SNPs on the centromeric and telomeric side of the disease-causing locus in the homozygous chromosomal region. The best of each was scored as 3 and the worst was scored as 1. A high-quality
c10.indd 205
1/12/2011 9:44:14 AM
206
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
homozygous chromosomal region should have the largest size, the highest number of SNPs, and an equal number of SNPs on both sides of the possible causal locus. The possible disease-causing loci within the LOH region with the highest total score is selected for mutation detection by direct sequencing. Consanguinity of parents is common in patients with rare autosomal recessive diseases and, for example, has been reported in about 30% of the XP cases (Kraemer et al., 1987). We successfully applied 10K mapping assays for homozygosity mapping in different autosomal recessive diseases (Fig. 10.5). We then identified the homozygous mutations by direct sequencing the candidate genes. Table 10.4 summarizes some of the homozygous mutations that are associated with hypophosphatasia (Patient A), Wilson disease (Patient B) (Mak et al., 2008), sulfite oxidase deficiency (Patients C, D, and E) (Lam et al., 2002), Pompe disease (Patient H), and molybdenum cofactor deficiency (Patient I) in our clinical samples. This approach is not only useful for prenatal diagnosis for the next pregnancy in the same family but also useful for identifying novel mutations or additional disease-causing genes (Lam et al., 2002; Chiang et al., 2006). We found novel homozygous mutation p.M219V of the ALPL (alkaline phosphatase) gene for hypophosphatasia but found no mutation in all the coding exons of the collagen type 1 alpha2 gene (COL1A2) in an abortus (Patient A) suspected of osteogenesis imperfecta. Using the same approach in a patient with sulfite oxidase deficiency (Patient E), novel p.D512Y of the SUOX (sulfite oxidase) gene was identified. Molybdenum cofactor deficiency results in pleiotropic loss of the activity of all molybdoenzyme and displays the symptoms of a combined deficiency of sulfite oxidase, and xanthine dehydrogenase (XDH) (Johnson et al., 1989). In addition to a candidate gene, molybdenum cofactor synthesis 2 (MOCS2), Figure 10.6 shows the homozygous regions that harbor SUOX and XDH genes. The long stretch of 10 Mb LOH on chromosome 5 and identification of a small deletion on MOCS2 gene confirmed molybdenum cofactor deficiency in Patient I. A high-density SNP array decreases the burden of completely sequencing all possible loci for genetic diseases with extensive genetic heterogeneity such as ARO, XP, Bardet-Biedl syndrome (BBS), and LGMD, which have 7, 8, 9, and 13 disease-causing genes, respectively (Lam et al., 2005, 2007; Chiang et al., 2006; Lau et al. 2009). Based on the history of consanguinity in our XP case (Patient F), the XPC loci was prioritized for mutational analysis after LOH detection by genome-wide SNP genotyping (Lam et al., 2005). Figure 10.7 shows two out of eight candidate genes of XP that were found in homozygous regions detected in both HuSNP and 10K mapping assays. The degree of LOHs showed in 10K mapping assay are consistent with our scoring system used in the HuSNP mapping assay in which XPC got the highest score for subsequence mutation analysis. A homozygous nonsense mutation c.445G>T or p.E149X was identified in the patient and was heterozygous in the parents. Most ARO cases have been ascribed to a mutation in the T-cell immune regulator 1, ATPase, H+ transporting, lysosomal V0 subunit A3 (TCIRG1)
c10.indd 206
1/12/2011 9:44:14 AM
Figure 10.5. The 10K mapping assays for different autosomal recessive diseases. The y-axis shows the degree of LOH. ATP7B (NM 000053) for Wilson disease locates at chromosome 13q14.3; SUOX (NM 000456) for sulfite oxidase deficiency locates at chromosome 12q13.2; ALPL (NM 0000478) for hypophosphatasia locates at chromosome 1p36.12; GAA (NM 000152) for Pompe disease locates at chromosome 17q25.2–25.3.
c10.indd 207
1/12/2011 9:44:14 AM
208
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
TABLE 10.4. Mutation Screening of Candidate Genes Identified in LOH Sites Patient Number
Genetic Disease
A B C D E F G H I J, K, and L M
Mutation (All are Homozygous)
Gene
Hypophosphatasia Wilson disease Sulfite oxidase deficiency Sulfite oxidase deficiency Sulfite oxidase deficiency Xeroderma pigmentosum (type C) Malignant osteopetrosis Pompe disease Molybdenum cofactor deficiency Limb-girdle muscular dystrophy type IIB Carnitine-acylcarnitine translocase deficiency
ALPL ATP7B SUOX SUOX SUOX XPC
p.M219V p.R778L c.1521_1524delTTGT p.R217Q p.D512Y p.E149X
CLCN7 GAA MOCS2
p.I261F p.R224W c.346_349delGTCA
DYSF
p.D1837N
SLC25A20
c.199-10T>G
4.00 PATIENT I - LOH
Chromosome 5
3.00
2.00
1.00
0.00 42.77
47.40 p12
52.03
p12
56.66
q11.1
61.29 Mb
q11.2
q12.1
PATIENT I - LOH
4.00
Chromosome 12
3.00
2.00
1.00
0.00 47.49
52.13
56.76
q13.13
q13.12
q13.2
61.40
q13.3
q14.1
66.03 Mb q14.2
4.00 PATIENT I - LOH
q14.3
Chromosome 2
3.00
2.00
1.00
0.00 20.10
25.16 p24.1
30.21 p23.3
p23.2
35.26 p23.1
p22.3
40.31 Mb p22.2
p22.1
Figure 10.6. Comparing the LOH regions from 10K mapping assays between three candidate genes of molybdenum cofactor deficiency. On the y-axis is the degree of LOH. MOCS2 (NM 176806.2) locates at chromosome 5q11; SUOX (NM 000456) locates at chromosome 12q13.2; XDH (NM 000379) locates at chromosome 2p23.1.
c10.indd 208
1/12/2011 9:44:14 AM
209
HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI
25
PATIENT F - LOH
Chromosome 3
20
15
10
5
0 2.77
8.75 p26.2
p26.3
4.00 PATIENT F - LOH
p26.1
14.73 p25.3
p25.2
p25.1
20.71 p24.3
26.69 Mb p24.2
Chromosome 16
3.00
2.00
1.00
0.00 9.51
11.57 p13.2
p13.13
13.64
15.70
p13.12
p13.11
17.77 Mb p12.3
Figure 10.7. Comparing the LOH regions from 10K mapping assays between two candidate genes of XP in a patient whose parents are blood relatives (fifth degree). On the y-axis, the degree of LOH. XPC locates at chromosome 3p25.1; XPF (ERCC4) locates at chromosome 16p13.12.
gene, with only a few cases attributed to a mutation in the CLCN7 gene (Cleiren et al., 2001). We were able to prioritize the CLCN7 loci for mutation screening in a Chinese patient (Patient G) and identify a homozygous novel missense mutation c.781A>T or p.I261F, which was heterozygous in the parents (Lam et al., 2007, 2010). From the results of 10K mapping arrays with increased SNP markers among the three candidate genes CLCN7, TCIRG1, and osteopetrosis associated transmembrane protein 1 (OSTM1), only chromosome 16p harboring the CLCN7 gene shows the highest degree of LOH (>6.0), which is consistent with the HuSNP assay (Fig. 10.8). In a consanguineous family with LGMD (Fig. 10.9), we identified a long stretch of 25M bp homozygous region in all three affected siblings (Patients J, K, L) on the short arm of chromosome 2 (Lau et al., 2009). The LCSH is greater than that showed in Figure 10.7 as there are multigeneration recombinations and dilutions of consanguineous chromosomes in Patient F. The SNP genotyping revealed a homozygous candidate region for further mutation analysis. The ultra-high density SNP Array 6.0 mapped the gene dysferlin (DYSF), located on 2p13.3, in this LOH region (Fig. 10.10). A homozygous missense mutation c.5509G>A or p.D1837N of DYSF was identified in all the affected siblings. 10.3.3 LOH in Other Clinical Cytogenetics Analysis Using 10K mapping arrays, we identified in a Chinese neonate presenting with sudden unexpected death (Patient M) a 7M-bp LOH region at 3p21.1–21.31
c10.indd 209
1/12/2011 9:44:15 AM
210
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
4.00 PATIENT G - LOH
Chromosome 16
3.00
2.00
1.00
0.00 0.00
5.38
10.77 p13.2
p13.3
4.00 PATIENT G - LOH
16.15 p13.13
p13.12
21.54 Mb
p13.11
p12.3
p12.2
Chromosome 11
3.00
2.00
1.00
0.00 66.53
70.43 q13.2
q13.1
4.00
PATIENT G - LOH
74.33
q13.3
q13.4
78.23 q13.5
82.13 Mb q14.1
Chromosome 6
3.00
2.00
1.00
0.00 103.09
106.20
109.31
q16.3
112.42
115.53 Mb
q21
q22.1
Figure 10.8. Comparing the LOH regions from 10K mapping assays between the three candidate genes of ARO in a patient whose parents are first cousins. CLCN7 (NM 001287) locates at chromosome 16p13.3; TCIRG1 (NM 006019) locates at chromosome 11q13.2; OSTM1 (NM 014028) locates at chromosome 6q21.
I
II
J
K
L
III
Figure 10.9. Pedigree of a consanguineous family with limb-girdle muscular dystrophy.
(Fig. 10.11). The SLC25A20 gene encodes a protein carnitine-acylcarnitine translocase (CACT), which is located at chromosome 3p21.31. Although lack of consanguinity of the parents, we found homozygous IVS2-10T>G, a known mutation for CACT deficiency (Lam et al., 2003). Since the parents are nonconsanguineous, the homozygous region is likely due to linkage disequilibrium (Wong et al., 1998; Okubo et al., 1999; Lam, 1999). We suggest that this mutation within IBD region may be a founder mutation in the Chinese population. In a case of ring chromosome, a karyotype report from a clinical laboratory
c10.indd 210
1/12/2011 9:44:15 AM
HOW A HIGH-DENSITY SNP ARRAY CAN BE USED TO LOCALIZE A POSSIBLE DISEASE LOCI
211
Figure 10.10. (See color insert.) Identification of homozygous DYFS mutations in the homozygous region detected by SNP Array 6.0 (Lau et al., 2009).
suggested the ring chromosome derived from chromosome 21 (46, XY, r21) (Lau et al., 2009). From the genotyping results of SNP Array 6.0, an approximately 7M-bp segment was lost at the end of the long arm of chromosome 21. Figure 10.12 shows the breakpoint (with changes in both LOH and copy number) located at chromosome 21q22.2, indicating the deletion of the gene Down syndrome cell adhesion molecule (DSCAM).
c10.indd 211
1/12/2011 9:44:15 AM
212 4.00
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
PATIENT M - LOH
Chromosome 3
3.00
2.00
1.00
0.00 43.88 p21.33 p21.32
46.58
49.27 p21.31
51.96 p21.2
54.65 Mb p21.1
p14.3
Figure 10.11. Detection of a homozygous region in the short arm of chromosome 3 in a case of CACT deficiency by 10K mapping array. SLC25A20 (NM 000387) locates at chromosome 3p21.31.
Figure 10.12. Positional mapping of ring chromosome 21 by the SNP Array 6.0 platform (Lau et al., 2009).
c10.indd 212
1/12/2011 9:44:15 AM
REFERENCES
213
10.4 DISCUSSION The comprehensive genomewide scan with the use of automated high-density SNP array offers significant cost and time benefits in sample preparation, processing, and data analysis. The savings in the cost of the analysis will be more if the disease has marked locus heterogeneity for prioritization of mutational analysis. Accurate mapping of the disease-causing genes by detecting LOH using high-density SNP array provides a better method for making a reliable DNA-based prenatal diagnosis, while the prenatal ultrasound scans are sometimes normal for an affected fetus. This is of particular advantage in finding the disease loci in consanguineous families using SNP microarrays. The results of the mapping study and the mutation study proved to be consistent, validating this approach. LOH sites may also have prognostic significance and may be ethnic-specific in different populations. The patterns of LOH obtained by SNP microarray are in excellent agreement with those obtained by analysis with both microsatellite genotyping and comparative genomic hybridization in some cancer studies (Lam et al., 2006). Unlike microsatellites, SNPs are not susceptible to repeat expansion that is so often observed in cancer. We strongly suggest that the use of high-density SNP arrays should be standardized for molecular investigations of genetic diseases due to consanguinity. High-density SNP array is useful not only for personalized genomic medicine but also for building disease-specific databasest (Lau et al., 2009; Seelow et al., 2009). Having precision medical informatics, we can make genotype– phenotype correlations of early warning signs or improved drug response for better disease management. With further technical and software advancements, high-density SNP genotyping will continue to help exploring the pathophysiology of more disorders and creating targeted drugs in the near future.
10.5 REFERENCES Chiang AP, Beck JS, Yen HJ, et al. (2006). Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11). Proc Natl Acad Sci USA 103:6287–92. Cleiren E, Benichou O, Van Hul E, et al. (2001). Albers-Schönberg disease (autosomal dominant osteopetrosis, type II) results from mutations in the ClCN7 chloride channel gene. Nature 415:287–94. Gnarra JR, Tory K, Weng Y, et al. (1994). Mutations of the VHL tumor suppressor gene in renal carcinoma. Nat Genet 7:85–90. Johnson JL, Wuebbens MM, Mandell R, Shih VE. (1989). Molybdenum cofactor biosynthesis in humans. Identification of two complementation groups of cofactordeficient patients and preliminary characterization of a diffusible molybdopterin precursor. J Clin Invest 83:897–903. Kraemer KH, Lee MM, Scotto J. (1987). Xeroderma pigmentosum. Cutaneous, ocular, and neurologic abnormalities in 830 published cases. Arch Dermatol 123:241–50.
c10.indd 213
1/12/2011 9:44:16 AM
214
CANDIDATE SCREENING THROUGH HIGH-DENSITY SNP ARRAY
Lam CW. (1999). Origin of the Japanese population. Science 284:1125. Lam CW. (2010). Genome-based diagnosis of genetic disease. Indian J Med Res 131: 484–85. Lam CW, Cheung KKT, Luk NM, Chan SW, Lo KK, Tong SF. (2005). DNA-based diagnosis of xeroderma pigmentosum group C by whole-genome scan using singlenucleotide polymorphism microarray. J Invest Dermatol 124:97–91. Lam CW, Lai CK, Chow CB, Tong SF, Yuen YP, Mak YF, Chan YW. (2003). Ethnicspecific splicing mutation of the carnitine-acylcarnitine translocase gene in a Chinese neonate presenting with sudden unexpected death. Chin Med J (Engl) 116:1110–12. Lam CW, Li CK, Lai CK, Tong SF, Chan KY, Ng GSF, Yuen YP, Cheng AWF, Chan YW. (2002). DNA-based diagnosis of isolated sulfite oxidase deficiency by denaturing high-performance liquid chromatography. Mol Genet Metab 75:91–95. Lau KC, Mak CM, Leung KY, Tsoi TH, Tang HY, Lee P, Lam CW. (2009). A fast modified protocol for random-access ultra-high density whole-genome scan: A tool for personalized genomic medicine, positional mapping, and cytogenetic analysis. Clin Chim Acta 406:31–35. Lam CW, To KF, Tong SF. (2006). Genome-wide detection of allelic imbalance in renal cell carcinoma using high-density single-nucleotide polymorphism microarrays. Clin Biochem 39:187–90. Lam CW, Tong SF, Wong K, et al. (2007). DNA-based diagnosis of malignant osteopetrosis by whole-genome scan using a single-nucleotide polymorphism microarray: standardization of molecular investigations of genetic diseases due to consanguinity. J Hum Genet 52:98–101. Mak CM, Lam CW, Tam S, et al. (2008). Mutational analysis of 65 Wilson disease patients in Hong Kong Chinese: identification of 17 novel mutations and its genetic heterogeneity. J Hum Genet 53:55–63. Nishida N, Koike A, Tajima A, et al. (2008). Evaluating the performance of Affymetrix SNP Array 6.0 platform with 400 Japanese individuals. BMC Genom 9:431. Okubo M, Horinishi A, Murase T, Hamada K. (1999). 1176C polymorphism in Japanese patients with glycogen storage disease type 1a. Hum Genet 104:193. Redon R, Ishikawa S, Fitch KR, et al. (2006). Global variation in copy number in the human genome. Nature 444(7118):444–54. Seelow D, Schuelke M, Hildebrandt F, Nürnberg P. (2009). HomozygosityMapper—an interactive approach to homozygosity mapping. Nucl Acids Res 37:W593–10. The International HapMap Consortium. (2005). A haplotype map of the human genome. Nature 437(7063):1299–13110. The International HapMap Consortium. (2007). A second generation human haplotype map of over 3.1 million SNPs. Nature 449(7164):851–61. Tuna M, Knuutila S, Mills GB. (2009). Uniparental disomy in cancer. Trends Mol Med 15:120–28. Wong LJ, Liang MH, Hwu WL, Lam CW. (1998). Linkage disequilibrium and linkage analysis of the glucose-6-phosphatase gene. Hum Genet 103:199–203.
c10.indd 214
1/12/2011 9:44:16 AM
CHAPTER 11
Gene Discovery by Direct Genome Sequencing KUNAL RAY, ARIJIT MUKHOPADHYAY, and MAINAK SENGUPTA
Contents 11.1 Introduction 11.2 Gene Discovery by Direct Genome Sequencing 11.2.1 Discovery of Mutations in Mendelian Diseases 11.2.2 Discovery of QTL or Single Nucleotide Mutations 11.3 Applications and Protocols 11.3.1 Identification and Capturing of the Targeted Genomic Region 11.3.2 Selection of Suitable Platform 11.4 The Limitations of Direct Genome Sequencing 11.5 References
215 216 217 218 219 219 225 228 231
11.1 INTRODUCTION The last few decades saw unprecedented growth in our understanding of the genetic basis of diseases as the underlying molecular defects were unraveled especially for Mendelian disorders. Most of these discoveries were possible using the Sanger method of direct DNA sequencing (Sanger and Coulson, 1978; Sanger et al., 1977, 1992). Discovery of the Sanger sequencing method and its widespread use made forward genetics more powerful than reverse genetics. We could identify the causal variants even before we knew the molecular basis of the causal relationship with the trait under study. However, the OMIM database reveals that there are only 344 entries (out of 19,864) for which gene mutations have been definitively linked with phenotypes, indicating that most of the diseases are not purely monogenic and are probably controlled by one or more modifier loci. Even the most commonly used Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
215
c11.indd 215
1/12/2011 9:44:17 AM
216
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
textbook example of a Mendelian disease, hemophilia, has been recently shown to have a quasi-quantitative character with variable penetrance (Chavali et al., 2008, 2009). To add more to the complexity, Gottlieb et al., (2009) while studying a condition called abdominal aortic aneurysm (AA), found that SNPs the in BAK1 gene were different in aortic tissue than in blood samples, even in samples taken from the same individuals. Based on these findings, the authors suggested being careful in interpreting genetic associations based on DNA from blood samples alone. Hence, to achieve a proper genotype-phenotype correlation, we need to probe deeper into the genome, matching it with epigenome and transcriptome to get a glimpse of the complex networks that exist at the proteome level. This realization has led to a new turn in the way research in human genetics is carried out. Today researchers are using high-throughput genotyping tools to first map a disease locus, then sequence the entire candidate region to detect all possible nucleotide variations that might contribute to biology, either singly or in synergy. A large number discoveries are being reported in the literature in which a classical exon or ORF sequencing does not identify the causal variant, which is found only by means of direct genome sequencing. In this chapter we deal with the recent approaches that are being undertaken to emphasize that researchers need to align themselves with changing paradigm to discover novel genes.
11.2 GENE DISCOVERY BY DIRECT GENOME SEQUENCING As described in the previous chapters, recent technological developments in the field of high-throughput genotyping have allowed us to scan the human genome at a very high resolution (the latest arrays have a marker at every 1.5 kb on average). These arrays allow us to look for causal association of phenotypes with single nucleotide polymorphisms as well as copy number variations both independently and in combination. These platforms are becoming cost effective and are gradually replacing classical microsatellite-based linkage or association analysis for hereditary traits. This possibility has revolutionized the field, especially in the area of complex multifactorial diseases, where lack of large pedigrees prohibits the use of more powerful linkage analysis leaving us with only association studies. In the last few years, there has been an explosion of data from various genomewide association studies (GWAS), which led to discovery of many unexpected disease-causing genes. As we know from experience, discovery of these genes in the context of specific diseases by traditional candidate gene approach would take very long time. All these high-throughput screening technologies have enabled us to take an unbiased approach toward the genetic basis of diseases. An exemplary success of GWAS is the discovery the association of the p.Y402H variant of the complement factor H gene for age-related macular degeneration. This variant, identified by a GWAS, helped explain a large proportion of the disease burden and led to a new field of research in
c11.indd 216
1/12/2011 9:44:17 AM
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
217
the understanding and cure of this blinding disorder in the elderly (Haines et al., 2005). However, these high-throughput screening technologies have some inherent limitations as well. Recent studies suggest that usually expression of complex array of disease phenotypes result from the effect of a few rare variants with higher penetrance in combination with many common variants with lower penetrance. The higher the penetrance of the rare variant, the greater is the Mendelian character of the phenotype. By virtue of the design of high-throughput genotyping arrays, the markers being identified are in most cases validated in different population groups. On the other hand, GWAS studies are very poor in detecting the rare variants that would be expected to be highly penetrant. This means, that GWAS can detect only less penetrant common variants, warranting a large sample size to have enough power for the study. Another limitation is at the present level of saturation of markers, where the average spacing between markers ranges from 1.5 to 5 kb. It is debatable whether one would gain significantly in power to identify causal variants by increasing the number of markers in any region. The recombination frequency map of the human genome is nonrandom (linkage disequilibrium blocks), implying that increasing the number of markers in a block would not necessarily increase the power to detect a causal variant. In addition to these conceptual limitations of this approach, there are some technical limitations as well that are dealt with in other chapters of this book. Direct genome sequencing, on the other hand, can circumvent all these limitations described above. Technically speaking, direct genome sequencing is powerful enough to detect the causal variant even from one individual sample. One can probe the entire genomic region of interest without having to hypothesize for where the causal variant might lie. Presently, the cost of direct genome sequencing for large regions or for a large number of samples is prohibitory, but rapid technological breakthroughs under way will almost certainly reduce the cost substantially. Different strategies for direct genome sequencing are described later in this chapter. 11.2.1
Discovery of Mutations in Mendelian Diseases
For a Mendelian disorder, where one mutation is often penetrant enough to precipitate a disease phenotype, one would collect samples from one or more large pedigrees with multiple affected members, carry out linkage analysis, and then sequence the exons of all the genes in the linkage interval. Exclusion of an ORF or a gene often will be decided based on lack of a causal variant in the exons. Classically, these sequencing approaches typically involve bidirectional Sanger sequencing. Recently, expression analysis (den Hollander et al., 2007; Mukhopadhyay et al., 2006), microRNA sequencing (Mencía et al., 2009), and direct genome sequencing (Ng et al., 2009; Nikopoulos et al., 2010) have been successfully used to find causal gene or mutation for Mendelian disorders.
c11.indd 217
1/12/2011 9:44:17 AM
218
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
In spite of the enormous advancement in the field of disease genetics and genomics, there are almost 1800 identified genetic loci or phenotypes for which the causal gene is not known and another 2000 suspected Mendelian traits with unknown molecular defect (according to OMIM). These statistics suggest that identification of the molecular defects in genetic diseases still remains a challenging task, despite pinning down the underlying loci using traditional tools and strategies of molecular genetic analysis. The major impediment to better success is hidden mutations in the region of the gene that is not intuitively obvious and immediately testable—for example, promoter regions, locus controlling regions, and miRNA binding sites in untranslated regions. Such hurdles could be surmounted if multiple unrelated patient samples are available with the defect in the same genetic locus but harboring different mutations. Thus some mutation may remain refractory to limited sequencing efforts or a variant may be present in an uncharacterized region of the gene (promoter, UTR), but other mutations may be easily related to biological aberration. However, for rare genetic diseases finding multiple unrelated families/ patients is not easy. In such circumstances, direct genome sequencing plays a very important role in the discovery of novel genes or mutations in Mendelian disorders. For example, a recent report describes identification of a causal mutation in a novel gene for familial exudative vitreoretinopathy (FEVR), a Mendelian disease of the eye phenotype (Nikopoulos et al., 2010). Here the authors used high-throughput genotyping technology to find a common linkage interval in two pedigrees spanning 40 Mb. They then used next-generation sequencing (NGS) technology to perform direct genome sequencing for the entire 40-Mb and identified a causal variant in TSPAN12 that is a novel gene for this phenotype (Nikopoulos et al., 2010). It is likely that the classical approach for characterizing the causal gene for the disease and the underlying defect in the patient(s) would have taken a much longer time. 11.2.2 Discovery of QTL or Single Nucleotide Mutations As discussed at the beginning of this chapter, most of the diseases are influenced by modifying effects of other loci and the environment. To understand the etiology of a disease one needs to quantify the contribution of each trait or parameter responsible for the disease phenotype under study. These parameters are not of binary nature and form a continuum in the population. The genetic locus controlling a particular trait is called a quantitative trait locus (QTL). For example, one can study type 2 diabetes and decide on the disease status in a binary manner (i.e., either affected with the disease or not). However, the most important parameter to make the binary decision is the level of blood glucose, which is not binary and will have a range of values in cases and controls. Hence blood glucose becomes a QTL for type 2 diabetes, and understanding the genetic locus controlling the blood glucose level can be more useful in elucidating the disease etiology than studying the disease itself.
c11.indd 218
1/12/2011 9:44:17 AM
APPLICATIONS AND PROTOCOLS
219
In contrast to what has been discussed for Mendelian diseases, identification of genetic signatures for QTL is much more challenging. Typically, the variants contributing to any QTL will be frequent in both cases and controls for a phenotype with less penetrance and hence would need larger samples to detect the genetic variant with enough statistical power. In addition, one may not have a correct estimate of the number of QTLs contributing to a phenotype and hence would have to do very robust and deep phenotyping, so that during genetic analysis the contribution of each phenotypic trait can be assessed using regression based approaches.
11.3
APPLICATIONS AND PROTOCOLS
In the previous sections we discussed the need to use the direct genome sequencing to discover novel genes and variations causing diseases, both Mendelian as well as multifactorial. In this section, we shall briefly describe various technical approaches used for direct genome sequencing for gene discovery. 11.3.1
Identification and Capturing of the Targeted Genomic Region
11.3.1.1 PCR-Based Method for Targeted Deep Sequencing The most obvious and commonly used method to generate fragments from targeted genomic regions is traditional single-plex PCR. This can be either long range (LR-PCR) to capture a few kilobases of DNA in one amplicon, or multiple exons of multiple genes can be sequenced individually and then pooled for a particular sample to proceed with parallel sequencing approaches. Multiple sources are now available that provide specialized PCR reagents for longrange PCR reaction that can amplify fragments up to 25 kb. Recently, Yeager et al. (2008) have successfully used the LR-PCR approach to resequence a 136 kb region from 8q24 to detect variations associated with prostate and colon cancer. They designed primers to amplify fragments ranging from 2.0 to 5.5 kb and kept more than a 500-bp overlap between any two adjacent fragments. After checking for successful amplification for each reaction, they pooled equimolar amounts of each fragment to represent the entire 136 kb region and then did parallel sequencing using the 454 technology from Roche. This approach might be useful when the region under study represents one contiguous stretch on the genome. However, typically for a multifactorial disease one would like to sequence multiple small regions from one individual, and LR-PCR will not be effective. Using the PCR-based approach one has to do many independent PCR reactions and then follow the usual pipeline of quantitation, pooling, library preparation, and parallel sequencing. Both the approaches are refractory to the variability between each PCR reaction and needs a large amount of DNA as input (each PCR reaction will need 10–20 ng DNA) compared to other available options. In addition, for degraded samples,
c11.indd 219
1/12/2011 9:44:17 AM
220
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
as often the case for cancer, long-range PCR does not work well. These traditional PCR-based methods are both highly specific and sensitive but are difficult to scale up, which leads to underutilization of the massive throughput provided by the NGS platforms. Recently, several other methods have evolved to circumvent these problems. 11.3.1.2 Multiplex Amplification by Padlock Molecular Inversion Probe The padlock probes, circular oligonucleotides for localized DNA detection, were described as early as 1994 (Nilsson et al., 1994). This technology was later used for multiplex SNP genotyping and termed molecular inversion probe (MIP) assay (Hardenbol et al., 2003, 2005). With this approach “inverted” probes are generated in which the SNP information is reformatted into tag sequences enabling large-scale screening using a tag DNA microarray. MIPs are single-stranded DNA with sequence complementary to the flanking sequences of the SNP under study. Each probe also contains two universal primers separated by endo-ribonuclease recognition site. During the assay, the probes undergo a unimolecular rearrangement that is (1) circularized by filling gaps with nucleotides corresponding to the SNPs in separate allele-specific polymerizations (A, C, G, and T) and ligation reactions and (2) linearized in enzymatic reactions. As a result, they become inverted. This step is followed by PCR amplification and sequencing (Fig. 11.1). Recently, Porreca et al. (2007) have shown the utility of the MIP assay in generating multiplex amplification. In evaluating multiplex targeting methods, key performance parameters to consider include multiplexity, specificity, and uniformity. Multiplexity refers to the number of independent capture reactions performed simultaneously in a single reaction. Specificity is measured as the fraction of captured nucleic acids derived from the targeted regions. Uniformity is defined as the relative abundance of targeted sequences after selective capture. Ideally, a multiplex targeting method will perform adequately by all three measures. An additional concern is cost; targeted capture necessarily requires one or more oligos to specify each target, which is potentially very expensive at high degrees of multiplexing (Porreca et al., 2007). To overcome these problems the authors synthesized 100-mer oligos and released them from a programmable microarray. This complex pool is PCR amplified, then restriction digested to release a single-stranded 70-mer capture probe mixture. Individual probes consist of a universal 30-nucleotide motif flanked by unique 20-nt segments (targeting arms). Each linked pair of targeting arms is designed to hybridize immediately upstream and downstream of a specific genomic target—for example, an exon. The capture event itself, a modification of the molecular inversion probe strategy developed for multiplex genotyping, is achieved by polymerase-driven extension from the 3′ end of the capture probe to copy the target, followed by ligation to the 5′ end to complete the circle. Subsequent steps enrich and amplify these circles or generate products amenable to shotgun sequencing library production (Porreca et al., 2007). Although this method circumvents the low scalability of a PCRbased approach, it is highly specific and represents >90% of the targets.
c11.indd 220
1/12/2011 9:44:17 AM
Figure 11.1. Molecular inversion probe assay. 1, The probe and the genomic DNA with the polymorphic base to be genotyped. The bases in black are complementary to portions of the probe (bolded) flanking the polymorphic site. The white and textured arrows are regions complementary to universal PCR primers, and the black region is the cleavage site containing a restriction endonuclease recognition sequence. 2, Terminal portions of the probe complementary to the genomic DNA sequence hybridizes with it, while the remaining part is looped out and a gap is created at the site of the polymorphic site. 3, The gap in the probe is filled by incorporation of dNTPs in the system by using polymerase and ligase. Unreacted probes are digested by Exonuclease. 4, The probe is linearized by restriction enzyme digestion and 5, released from the genomic DNA. Now, the probe has an inverted appearance compared to its original conformation. 6, The probe is enriched by PCR amplification with the help of the universal PCR primer pairs (arrows). 7, The polymorphic base is identified by sequencing or hybridization.
c11.indd 221
1/12/2011 9:44:17 AM
222
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
Published results show that there is >100-fold range in coverage of the targeted sequences, and 34–42% of the sequence capacity is consumed by either sequencing of primer sequences or the molecular inversion probes’ linker backbone (Tewhey et al., 2009a). This results in unnecessary sequencing of the unwanted regions, implying less coverage for sequence of interest. 11.3.1.3 Hybridization-Based Method for Genome Sequencing The third approach, based on hybridization with long oligonucleotides that are either matrix bound or in solution, captures and pulls down the target sequences (Albert et al., 2007; Gnirke et al., 2009; Hodges et al., 2007; Okou et al., 2007). The solid-phase hybridization approach has been used to capture the entire human exome and has been reported in several published studies (Hodges et al., 2007; Ng et al., 2009). However, the process is difficult to scale up for large population studies. A proof-of-principle study for solution-phase hybridization by using long 170-bp capture probes has recently been published (Gnirke et al., 2009). Although this study clearly demonstrated the utility of the approach at a depth of 84× coverage, the variant-detection sensitivity was only 64–80% within the exonic sequences, likely because of insufficient coverage uniformity (Tewhey et al., 2009a). Recently, a modification of the solution-based hybridization method has been published for enrichment of sequencing targets (Tewhey et al., 2009a). The authors noted that the tiling frequency of the 120-bp capture probes is important for obtaining high uniform coverage across the targeted sequences. Their results suggest that the sequence coverage improves if each targeted basepair is contained within two different capture probes but is not affected by a greater tiling density. They also demonstrated that, for optimal coverage of human exons shorter than 180 bp, at least three 120-mer capture probes per exon should be used. The hybridization-based methods have good capture rates, uniform coverage of target sequences, and good reproducibility. However, the methods are known to be biased to repetitive elements, which can result in a high proportion of reads that are not uniform. In addition, sequences that are highly homologous to other sequences in the genome cannot be individually targeted. The method (solution based) is presented in Figure 11.2. 11.3.1.4 Microdroplet-Based PCR Enrichment for Large-Scale Sequencing To use the strength of each of the approaches described, a microdroplet-based PCR enrichment for large-scale targeted sequencing has been developed (Tewhey et al., 2009b). In short, the method takes advantage of massive parallel singleplex amplification retaining its specificity, and sensitivity. It is achieved by discrete encapsulation of microdroplet, which prevents primer pair interaction; thus up to 4000 target amplification has been done successfully. The authors described that it involves the preparation of 1.5 million separate PCR reactions from 20 μL template solution containing 7.5 μg genomic DNA. This technology is well suited for processing DNA for mas-
c11.indd 222
1/12/2011 9:44:17 AM
APPLICATIONS AND PROTOCOLS
(a)
End repairing of sheared DNA
Shearing of genomic DNA
Adapter mediated PCR enrichment of fragments
223
Addition of dATP at the 3’ends
Purification
Adapter ligation
(b) +
+ Prepared Library
Hybridization Buffer
Biotinylated Probes
Optimum Temperature Regulation Hybridization Streptavidin Coated Magnetic Beads
+
Unbound Fraction Discarded
Wash Beads and Remove Probes
Bead Capture Amplify
Sequencing
Figure 11.2. (See color insert.) The hybridization-based sequencing method. (a) Genomic DNA is sheared and end repaired or modified. A poly-A tail is added to the fragments, adapters are ligated to the 3′-end of the fragments, and excess adapters or unligated primers are removed. The amplicons are purified, and adapter-specific PCR amplification is done to enrich the product pool to prepare a library. (b) The prepared library is hybridized with relevant biotinylated probes (specific sequences, whole exome, etc.) in solution in a hybridization array. The probes bind to the relevant sequences from the library. Then streptavidin-coated magnetic beads are released in the array, and a magnet is used to capture biotinylated probes bound to their complementary sequences. Those specific sequences can then be sequenced in appropriate platform. [Panel (b) of the illustration has been adapted from Protocol version 1.0.1, October 2009; SureSelect Human All Exon Kit from Agilent Technologies.]
c11.indd 223
1/12/2011 9:44:17 AM
224
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
(a)
(b) Primer library
(c)
Genomic DNA template
Microfluidic chip
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
8 7
2
Droplet PCF
Genomic DNA
Break emulsion
3
gDNA removal
5
9 Fragmentation and nick translation
4
Sequence
6
Figure 11.3. (See color insert.) Microdroplet PCR workflow. (a) Primer library generation: 1, Identify targeted sequences of interest in the genome. 2, Design and synthesize forward and reverse primer pairs for each targeted sequence (library element). 3, Generate primer pair droplets for each library element. A microfluidic chip is used to encapsulate the aqueous PCR primers in inert fluorinated carrier oil with a blockcopolymer surfactant to generate the equivalent of a picoliter-scale test tube compatible with standard molecular biology. 4, Mix primer pair droplets of library elements together so that each library element has an equal representation. (b) Genomic DNA template mix preparation: 5, Biotinylate (red dots), fragment into 2- to 4-kb fragments, and purify genomic DNA. 6, Mix purified genomic DNA together with all of the components of the PCR reaction (DNA polymerase, dNTPs and buffer) except for the PCR primers. (c) Droplet merge and PCR: 7, Dispense primer library droplets to the microfluidic chip. 8, Deliver the genomic DNA template as an aqueous solution; template droplets are formed within the microfluidic chip. Then pair the primer pair droplets and template droplets in a 1:1 ratio. 9, Allow the paired droplets to flow through the channel of the microfluidic chip and pass through a merge area, where an electric field induces the two discrete droplets to coalesce into a single PCR droplet. Collect ∼1.5 million PCR droplets in a single 0.2-mL PCR tube. Process the PCR droplets (PCR library) in a standard thermal cycler for targeted amplification; break the emulsion of PCR droplets to release the PCR amplicons into solution for genomic DNA (gDNA) removal, purification, and sequencing. (Reprinted by permission from Macmillan Publishers Ltd: Nature Biotechnology, Tewhey et al., Microdroplet-based PCR enrichment for large-scale targeted sequencing, 27, 1025–1031, 2009.)
sively parallel amplification of sequencing targets. Microdroplet PCR consists of the following steps: merging picoliter of fragmented genomic DNA template (2–4 kb by DNase I digestion) with primer pair droplets, pooled thermal cycling of the PCR reactions, and destabilizing the droplets to release the PCR product for purification and sequencing. Figure 11.3 shows the methodology.
c11.indd 224
1/12/2011 9:44:17 AM
APPLICATIONS AND PROTOCOLS
225
The microdroplet PCR technology and the improvement of the solutionbased hybridization technology were developed by the same group of researchers (Tewhey et al., 2009a, 2009b). They furnished important parameters to consider when choosing an enrichment method for targeted sequencing: (1) uniformity of coverage of targeted sequences, (2) the detection rate and calling accuracy of sequence variants, (3) the efficiency of the enrichment over background sequences, (4) universality of the capture method (fraction of genome that can be uniquely captured), and (5) the multiplicity of the reaction (amount of sequence that can be targeted). Compared to other enrichment methods, microdroplet PCR generates substantially greater uniform coverage of targeted sequences, resulting in a higher variant detection rate: microdroplet PCR (94.5%), solution-based hybridization (64–89%), molecular inversion probe (75%). Microdroplet PCR is a universal method allowing for unique capture of most sequences including those highly similar to other regions of the genome. By anchoring a primer in the divergent portion of a homologous sequence or in an adjacent unique region, almost any interval can be specifically targeted. In contrast, hybridization-based methods cannot capture individual repetitive elements or homologous exons. The authors have further commented that they have already used the microdroplet PCR to enrich ∼4,000 targeted sequences in a single tube per sample and are currently working on scaling it up to 20,000 targets (∼7.5 Mb, ∼1/10th the exome) using an expanded content format with five sets of primers in each droplet and no other changes to the workflow. The requirement for 7.5 μg of starting DNA used in this study limits the applicability of microdroplet PCR for samples with limited quantities. The authors commented that optimization has reduced the current requirement to 2 μg, and is being further reduced to nanogram quantities of DNA (Tewhey et al., 2009b). As discussed, today a researcher can choose from multiple options of methods for sample preparation, depending on the objective of the research question. Table 11.1 is a comparison of all the methods. 11.3.2 Selection of Suitable Platform In terms of ease of sample preparation and accuracy of data, the traditional Sanger method of DNA sequencing is still the best. However, highest achievable throughput of the method is 1 kb sequence per hour. On the other hand, some of the NGS platforms can sequence 10 times of the human genome sequence (30 Gb) in about 1 week. For higher throughput and to address pangenomic questions, the research strategies are being formulated to match the requirements of NGS platforms. As described earlier, choosing the right approach for sample preparation is perhaps the most crucial step in determining the effectiveness of NGS technology. After preparing the samples for direct genome sequencing using the methods of choice, the sample is ready to go onto the NGS platform for massive parallel sequencing. At present there are three major players providing appropriate platforms for this purpose: 454 by Roche, Genome analyzer by Illumina, and SOLiD by Applied Biosystems. The
c11.indd 225
1/12/2011 9:44:17 AM
226
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
TABLE 11.1. Sample Preparation Methods for Direct Genome Sequencing Approach PCR and LR-PCR
Merits Highest specificity and sensitivity
Multiplex amplification by padlock molecular inversion probe
•
Hybridization-based methods
•
•
• •
Microdroplet-based PCR enrichment
•
•
•
•
Limitations • •
High specificity Can multiplex up to 10,000 different regions
•
High specificity High depth of coverage Can capture targeted regions from the entire genome Very high sensitivity and specificity Highest variant detection sensitivity (94.5%) Can amplify up to 4000 different targets simultaneously and can perform 1.5 million PCR reactions in a small volume Suitable for multiple regions and many samples
•
•
•
• •
•
Difficult to scale up Not suited for sequencing multiple genomic regions High cost of oligo synthesis ∼40% of total data are from known, unwanted oligo or linker sequences Expensive to use in large population studies Biased for repetitive sequences Nonuniform coverage Depends on a highquality oligo library Currently requires a large amount (7.5 μg) of genomic DNA
details about these platforms are available in other chapter(s) of this book; here we will discuss their performance in regard to direct genome sequencing for gene discovery. The NGS technologies generate a large amount of sequence for each run. For example, both the Illumina genome analyzer and the ABI SOLiD can produce 30–70 Gb of raw sequence read per week but one run on the 454 currently produces only 500 Mb of sequence. However, for the platforms that produce short-sequence reads, greater than half of this sequence is not usable. On average, 55% of the Illumina genome analyzer reads pass quality filters, of which approximately 77% align to the reference sequence. For ABI SOLiD, approximately 35% of the reads pass quality filters, and subsequently 96% of the filtered reads align to the reference sequence. Thus only 43% and 34% of the Illumina genome analyzer and ABI SOLiD raw reads, respectively, are usable. In contrast to the platforms generating short-read lengths, approximately 95% of the Roche 454 reads uniquely align to the target sequence. When designing experiments and calculating the target coverage for a region, one must consider the fraction of alignable sequence (Harismendy et al., 2009). It has been reported that for the genome analyzer platform 50% of the gener-
c11.indd 226
1/12/2011 9:44:18 AM
APPLICATIONS AND PROTOCOLS
227
ated sequence represent the first 50 bp of the amplicon initially generated by LR-PCR (Harismendy and Frazer, 2009) thus making half of the data unusable. The authors have shown that by blocking the 5′ end of the PCR primer reduces its overrepresentation in sequencing, resulting in more uniform coverage. 11.3.2.1 The New Range of Possibilities All the available NGS platforms use amplification-based methods. However, the latest technologies in the field have bypassed that requirement. The true Single Molecule Sequencing (tSMS) technique from HELICOS is the first of its kind available commercially (Gupta, 2008; Milos, 2008). It first fragments the genomic DNA to a length between 100 and 200 nucleotides. Then a universal poly-A stretch is added at one end of every fragment. These DNA fragments are then hybridized onto a lawn of immobilized poly-T oligos. Subsequently, DNA polymerase and one of the four nucleotides (fluorescent labeled) are passed onto the array (flow-cell) and the images are captured. The residual reagents are washed away and the step is repeated until the desired length is sequenced (Fig. 11.4). This platform will revolutionize the field as it allows probing a single cell without the need for any amplification. The recent findings that the genomic DNA content as well as the variations in the DNA varies from cell to cell (Gottlieb et al., 2009) could be further probed using this platform, which was not possible earlier. The removal of the amplification step also drastically reduces the time required for each experiment, and currently it can sequence an entire human genome in one day compared to one week for other NGS platforms. The product has been recently launched to market. With the use of the technology one would learn about the possibilities and limitations of this new innovation. However, as the technology depends on fragmentation and uses a single primer for sequencing, one can imagine the computational issues will be more challenging for this platform. Nanopore based sequencing is another novel approach for PCR-free massive parallel sequencing (Branton et al., 2008). This technology uses the polarity of the DNA strands; upon application of a potential difference across a nanometer scale tube (nanopore), the DNA strand is attracted toward the pore. Upon interaction of each different nucleotide, the current flow through the nanopore changes differently, enabling one to read the order of the nucleotides—for example DNA sequencing (Fig. 11.5). Although this technology is still under development, once available this technique should be more powerful than the HELICOS tSMS platform, as it will not require any fluorescent labeling. Also, the technical design of the platform is likely to provide largest read lengths in the field. Currently, the biggest challenge is to ensure entry of every base into the nanopore as it gets cleaved by the exonuclease. As discussed, multiple platforms for massive parallel sequencing are now available for the research community. Each comes with their pros and cons. A comparison is given in Table 11.2.
c11.indd 227
1/12/2011 9:44:18 AM
228
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
Figure 11.4. Single molecule sequencing. a, Genomic DNA is sheared and a poly A tail added to the fragments of DNA. The fragments are then hybridized in an array containing probes of poly T attached to the chip. After hybridization, florescently labeled dNTPs are added (different fluorescence for different types of the bases; one type at a time) that get incorporated in relevant positions in the poly T probe, depending on the sequence context of the poly A tailed fragment hybridized to the respective probe. Excess dNTPs are washed away, and then the florescent signal is cleaved from the nucleotide that emits florescence captured by relevant detectors.
11.4 THE LIMITATIONS OF DIRECT GENOME SEQUENCING The biggest limitation of the direct genome sequencing at present is its prohibitive cost to most of the investigators. Although technological innovations in this area are bringing down the cost quite rapidly to achieve the $1000 magic number for genome sequencing, still it is beyond reach of most individual investigators. One approach to overcome this problem is pooling samples (Ingman and Gyllensten, 2009) or to index them. For example, if one needs to sequence a common genomic region in many individuals to detect common at-risk variants for a common phenotype or QTL, multiple samples can be pooled together. Thus many samples from cases and controls can be made into two pooled samples, one for cases and one for controls. This will ensure overall higher coverage of the sequence data and will amplify the signal of the variants that are more frequent in a particular pool. However, this has several limitations: (1) multiple PCR or LR-PCR from every sample has to work with similar efficiency; (2) unequal pooling of samples will lead to nonuniform
c11.indd 228
1/12/2011 9:44:18 AM
THE LIMITATIONS OF DIRECT GENOME SEQUENCING
229
Nanopore platform
ATGCT A AGGC DNA strand Nanopore
v
C –
+ Electrodes
A
GG A
C
Measurement of the alteration in the magnitude of current with passage of each nudeotide through the pore.
Figure 11.5. Nanopore-based sequencing technology. Genomic DNA is made to pass through a nanopore immersed in a conducting fluid and a potential (voltage) is applied across it. The electric current generated due to conduction of ions through the nanopore is assayed. As individual nucleotides pass through the nanopore, each nucleotide obstructs the nanopore to a different, characteristic degree, thereby varying the amount of current that passes through the nanopore at any given moment. The change in the current through the nanopore represents a direct reading of the DNA sequence.
representation of every individual in a pool; and (3) upon detection of variants, one cannot identify the individuals who harbor the variant, hence it would require another round of sequencing or genotyping for the targeted loci on each sample. Another method of sequencing multiple samples on NGS is indexing. Currently, almost all the NGS platforms provide the indexing kit. This attaches unique tags at the end of universal primers, and the more unique tags, the more samples that can be multiplexed together. This approach of pooling or indexing is only applicable for targeted resequencing and is not applicable for whole-genome sequencing approaches. The next challenge is to handle the data computationally, once the machine churns out a few gigabytes of sequence data in matter of days. One run on the Genome Analyzer machine from Illumina currently generates an image of >5 terabyte followed by almost 1 terabyte of raw sequence data once the image is analyzed. It is a challenging endeavor to keep up with the growing need of computational storage space as well as the requirement of fast processors to analyze the large amount of data. Finally, identification of genomic variants of specific interest from a very large pool of background changes is a remarkably daunting task. In short, we have come a long way since the discovery of DNA sequencing. Powered by technological innovations and computational capacity building,
c11.indd 229
1/12/2011 9:44:18 AM
230
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
TABLE 11.2. Sequencing Methods for Direct Genome Sequencing Approach Sanger sequencing
Merits •
•
Emulsion PCR based pyrosequencing
•
•
•
•
Polymerase-based sequencing by synthesis
•
•
•
Ligation-based parallel sequencing
•
•
Single molecule sequencing
•
•
•
Highest specificity and sensitivity; high coverage not needed Maximum read length (∼1 kb) Liquid-phase emulsion PCR for high throughput Larger sequence read length (400 bp) Most suitable for de novo sequencing of small genomes (metagenomics) Highest fraction (∼95%) of usable data High throughput (>50 Gb of sequence per experiment) Comparatively low consumable cost Robust workflow ensuring less failures High throughput (∼100 Gb of sequence per experiment) Uses proprietary two-base recognition method for high accuracy Sequence from single molecule, amplification free Enable researchers to analyze cell-to-cell differences in genomic composition Should provide high read lengths
Limitations • •
•
•
• • •
•
•
•
•
Lowest throughput Sequence quality depends on the nucleotide composition High cost of consumables Lower quantity of sequence per run (∼500 Mb) compared to other next-generation sequencing platforms
Small read length (75 nt) Nonuniform coverage Only 43% of raw data is usable
Only 34% of raw data is usable Comparatively smaller read length
Latest entry in the field, needs more data to compare Computationally challenging to analyze
sequencing technology has accelerated the pace of learning about information embedded in the genome sequence. Direct rapid sequencing of large region of our genome has largely replaced need of painstaking fine physical mapping to narrow down the critical region for our trait of interest such that it is amenable to sequencing. Sequencing of genome from a single cell is now a reality. It is likely that rapidly a large number of complete genomes will be available in the public domain, which will make the reference genome more complete for a better coverage in our approach of resequencing for gene
c11.indd 230
1/12/2011 9:44:18 AM
REFERENCES
231
discovery. It is tempting to speculate that within a few years we will be beyond the era of $1000 per genome sequencing and global projects on 1000 genome sequencing with exponential growth in our knowledge in biology. Of course, the key challenge remains the cost that would determine the extent to which the scientific community can take benefit of new innovations in direct genome sequencing. 11.5 REFERENCES ABI SOLiD: solid.appliedbiosystems.com HELICOS: www.helicosbio.com Illumina genome analyzer: www.illumina.com/systems/genome_analyzer.ilmn Nanopore: www.nanoporetech.com OMIM: www.ncbi.nlm.nih.gov/Omim Roche 454: www.454.com Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA. (2007). Direct selection of human genomic loci by microarray hybridization. Nat Meth 4:903–05. Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, Di Ventra M, Garaj S, Hibbs A, Huang X, Jovanovich SB, Krstic PS, Lindsay S, Ling XS, Mastrangelo CH, Meller A, Oliver JS, Pershin YV, Ramsey JM, Riehn R, Soni GV, Tabard-Cossa V, Wanunu M, Wiggin M, Schloss JA. (2008). The potential and challenges of nanopore sequencing. Nat Biotechnol 26:1146–53. Chavali S, Ghosh S, Bharadwaj D. (2009). Hemophilia B is a quasi-quantitative condition with certain mutations showing phenotypic plasticity. Genomics 94:433–37. Chavali S, Sharma A, Tabassum R, Bharadwaj D. (2008). Sequence and structural properties of identical mutations with varying phenotypes in human coagulation factor IX. Proteins 73:63–71. Den Hollander AI, Koenekoop RK, Mohamed MD, Arts HH, Boldt K, Towns KV, Sedmak T, Beer M, Nagel-Wolfrum K, McKibbin M, Dharmaraj S, Lopez I, Ivings L, Williams GA, Springell K, Woods CG, Jafri H, Rashid Y, Strom TM, van der Zwaag B, Gosens I, Kersten FF, van Wijk E, Veltman JA, Zonneveld MN, van Beersum SE, Maumenee IH, Wolfrum U, Cheetham ME, Ueffing M, Cremers FP, Inglehearn CF, Roepman R. (2007). Mutations in LCA5, encoding the ciliary protein lebercilin, cause Leber congenital amaurosis. Nat Genet 39:889–95. Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C. (2009). Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 27:182–89. Gottlieb B, Chalifour LE, Mitmaker B, Sheiner N, Obrand D, Abraham C, Meilleur M, Sugahara T, Bkaily G, Schweitzer M. (2009). BAK1 gene variation and abdominal aortic aneurysms. Hum Mutat 30:1043–47. Gupta PK. (2008). Single-molecule DNA sequencing technologies for future genomics research. Trends Biotechnol 26:602–11.
c11.indd 231
1/12/2011 9:44:18 AM
232
GENE DISCOVERY BY DIRECT GENOME SEQUENCING
Haines JL, Hauser MA, Schmidt S, Scott WK, Olson LM, Gallins P, Spencer KL, Kwan SY, Noureddine M, Gilbert JR, Schnetz-Boutaud N, Agarwal A, Postel EA, PericakVance MA. (2005). Complement factor H variant increases the risk of age-related macular degeneration. Science 308:419–21. Hardenbol P, Banér J, Jain M, Nilsson M, Namsaraev EA, Karlin-Neumann GA, Fakhrai-Rad H, Ronaghi M, Willis TD, Landegren U, Davis RW. (2003). Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol 21:673–78. Hardenbol P, Yu F, Belmont J, Mackenzie J, Bruckner C, Brundage T, Boudreau A, Chow S, Eberle J, Erbilgin A, Falkowski M, Fitzgerald R, Ghose S, Iartchouk O, Jain M, Karlin-Neumann G, Lu X, Miao X, Moore B, Moorhead M, Namsaraev E, Pasternak S, Prakash E, Tran K, Wang Z, Jones HB, Davis RW, Willis TD, Gibbs RA. (2005). Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res 15:269–75. Harismendy O, Frazer K. (2009). Method for improving sequence coverage uniformity of targeted genomic intervals amplified by LR-PCR using Illumina GA sequencingby-synthesis technology. Biotechniques 46:229–31. Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA. (2009). Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 10:R32. Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR. (2007). Genome-wide in situ exon capture for selective resequencing. Nat Genet 39:1522–27. Ingman M, Gyllensten U. (2009). SNP frequency estimation using massively parallel sequencing of pooled DNA. Eur J Hum Genet 17:383–86. Mencía A, Modamio-Høybjør S, Redshaw N, Morín M, Mayo-Merino F, Olavarrieta L, Aguirre LA, del Castillo I, Steel KP, Dalmay T, Moreno F, Moreno-Pelayo MA. (2009). Mutations in the seed region of human miR-96 are responsible for nonsyndromic progressive hearing loss. Nat Genet 41:609–13. Milos P. (2008). Helicos BioSciences. Pharmacogenomics 9:477–80. Mukhopadhyay A, Nikopoulos K, Maugeri A, de Brouwer AP, van Nouhuys CE, Boon CJ, Perveen R, Zegers HA, Wittebol-Post D, van den Biesen PR, van der Velde-Visser SD, Brunner HG, Black GC, Hoyng CB, Cremers FP. (2006). Erosive vitreoretinopathy and wagner disease are caused by intronic mutations in CSPG2/ Versican that result in an imbalance of splice variants. Invest Ophthalmol Vis Sci 47:3565–72. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–76. Nikopoulos K, Gilissen C, Hoischen A, van Nouhuys CE, Boonstra FN, Blokland EA, Arts P, Wieskamp N, Strom TM, Ayuso C, Tilanus MA, Bouwhuis S, Mukhopadhyay A, Scheffer H, Hoefsloot LH, Veltman JA, Cremers FP, Collin RW. (2010). Next-generation sequencing of a 40 Mb linkage interval reveals TSPAN12 mutations in patients with familial exudative vitreoretinopathy. Am J Hum Genet 86:240–47.
c11.indd 232
1/12/2011 9:44:18 AM
REFERENCES
233
Nilsson M, Malmgren H, Samiotaki M, Kwiatkowski M, Chowdhary BP, Landegren U. (1994). Padlock probes: circularizing oligonucleotides for localized DNA detection. Science 265:2085–88. Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. (2007). Microarraybased genomic selection for high-throughput resequencing. Nat Meth 4:907–09. Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, Gao Y, Church GM, Shendure J. (2007). Multiplex amplification of large sets of human exons. Nat Meth 4:931–36. Sanger F, Coulson AR. (1978). The use of thin acrylamide gels for DNA sequencing. FEBS Lett 87:107–10. Sanger F, Nicklen S, Coulson AR. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74:5463–67. Sanger F, Nicklen S, Coulson AR. (1992). DNA sequencing with chain-terminating inhibitors. Biotechnology 24:104–08. Tewhey R, Nakano M, Wang X, Pabón-Peña C, Novak B, Giuffre A, Lin E, Happe S, Roberts DN, LeProust EM, Topol EJ, Harismendy O, Frazer KA. (2009a). Enrichment of sequencing targets from the human genome by solution hybridization. Genome Biol 10:R116. Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, Kotsopoulos SK, Samuels ML, Hutchison JB, Larson JW, Topol EJ, Weiner MP, Harismendy O, Olson J, Link DR, Frazer KA. (2009b). Microdroplet-based PCR enrichment for largescale targeted sequencing. Nat Biotechnol 27:1025–31. Yeager M, Xiao N, Hayes RB, Bouffard P, Desany B, Burdett L, Orr N, Matthews C, Qi L, Crenshaw A, Markovic Z, Fredrikson KM, Jacobs KB, Amundadottir L, Jarvie TP, Hunter DJ, Hoover R, Thomas G, Harkins TT, Chanock SJ. (2008). Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum Genet 124:161–70.
c11.indd 233
1/12/2011 9:44:18 AM
CHAPTER 12
Candidate Screening through Bioinformatics Tools SONG WU and WEI ZHAO
Contents 12.1 Introduction 12.2 Computing Environment: R and Bioconductor 12.3 Bioinformatic Databases 12.3.1 Literature Database: PubMed 12.3.2 Biological Ontology Databases 12.3.3 Protein–Protein Interaction Databases 12.4 Bayesian Network to Analyze Expression Data: NATbox 12.5 Weighted Gene Co-Expression Network Analysis 12.5.1 Generation of Weighted Gene Co-Expression Network 12.5.2 Detection of Modules 12.5.3 Define Measures of Gene Significance and Module Relevance 12.5.4 Functional Enrichment Studies of Gene Modules 12.5.5 Relating Intramodular Connectivity to Gene Significance 12.5.6 Network-Based Screening Strategy 12.5.7 Brain Tumor Example 12.6 In Silico Screening of Candidate Genes 12.6.1 Input Gene List Preparation 12.6.2 Gene Set Enrichment Analysis 12.6.3 Protein–Protein Interaction Network Analysis 12.6.4 PID Example 12.6.5 Other Bioinformatics Tools 12.7 Future Directions 12.8 Questions 12.9 Acknowledgments 12.10 References
236 237 237 237 238 239 240 242 244 244 246 246 246 247 247 248 248 249 252 255 255 256 257 257 257
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
235
c12.indd 235
1/12/2011 5:03:45 PM
236
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
12.1 INTRODUCTION With the rapid development of DNA sequencing technologies, whole genome sequences have become available for many species. Accompanying this, genomewide high-throughput experiments, such as gene expression assays and single-nucleotide polymorphism (SNP) arrays, have been developed and industrialized. The ability to do large-scale screening with these assays makes them popular among researchers who are interested in searching for disease genes. In most cases, the result of a genomewide array experiment is a list of hundreds or even thousands of significant genes. Besides the assay experiments, traditional marker-based linkage analysis is another disease gene hunting method, in which quantitative trait loci associated with disease traits can be identified. A quantitative trait locus usually corresponds to a large genomic region that contains several hundred genes. Thus, in either highthroughput assays or linkage analyses, causal genes underlying a disease frequently hide within a large set of genes, and it is a daunting exercise to validate all of them experimentally. Typically, researchers pick a few top genes or cherry pick a handful of interesting genes for further experiments. However, by retrieving and integrating the information from multiple bioinformatic databases, better strategies may be applied to prioritize the resulting genes. In this chapter, we review several bioinformatics tools to explore the gene structures among the long list of significant genes to generate a short list of candidate genes. Because candidate gene screening by bioinformatics tools is essentially a gene prioritization process, the terms candidate gene screening and gene prioritization will be used interchangeably for ease of presentation. Generally speaking, two types of gene prioritization analyses can be done. One is based on data-driven network analysis, which aims to infer the structure of the gene regulation process based on assay data (Friedmen, 2000; Zhao, 2006); another is based on information-driven analysis, which aims to retrieve and integrate biological knowledge from multiple databases to reveal the gene relationships (Sun et al., 2009; Ortutay et al., 2009). For the data-driven analysis, we focus on gene expression experiments, in which the data reveal not only differentially expressed genes, but also their coexpression patterns—that is, the gene correlations. Based on this, it is possible to query the interactions between genes and form a gene network from their interconnectiveness. We will dedicate two sections to demonstrating some bioinformatics tools for network analysis. For the informationdriven analysis, we focus on how to generate a small list of justifiable disease candidate genes solely from the bioinformatic resources. The main idea behind this is that perturbation of genes that are involved in the same pathway or biological process important for a disease will produce the same or very similar disease phenotypes. We describe in detail how this can be done.
c12.indd 236
1/12/2011 9:44:21 AM
BIOINFORMATIC DATABASES
237
12.2 COMPUTING ENVIRONMENT: R AND BIOCONDUCTOR R is a language and environment capable of providing a wide variety of statistical computing and graphics techniques. It is a free software tool and can be downloaded from http://cran.r-project.org. The R environment has many notable features, one of which is its great extensibility through add-on packages that can be easily installed. Packages contributed by developers from all statistics research areas have greatly enriched the choices and benefited biological and medical researchers by providing good-quality analyses. In this sense, R can be viewed as an integrated suite of software facilities. Due to its flexibility and data manipulation capacity, R is now becoming one of the most widely used tools for bioinformatics. Bioconductor is an R-based open source and open development software project specializing in providing tools for the analysis and comprehension of genomic data (http:// www.bioconductor.org). The functional scope of Bioconductor includes the analysis of almost all types of genomic data, such as DNA microarray, serial analysis of gene expression, sequence, and SNP data. All analysis packages in Bioconductor are distributed as R packages and are compatible with the R environment. Bioconductor also includes many up-to-date data packages for easy annotations. Most analysis software tools discussed in this chapter are implemented in R/Bioconductor and are freely available. Anyone who is not familiar with R or is interested in learning more about its applications in bioinformatics can read Gentleman (2008).
12.3 BIOINFORMATIC DATABASES Bioinformatic databases serve as the arsenal for bioinformatics analyses. Before we go further, it is obligatory to introduce the bioinformatic resources. Since there are a huge number of databases out there and it is impossible to introduce them all, only those closely related to the material discussed in this chapter will be reviewed. 12.3.1
Literature Database: PubMed
PubMed is a service of the U.S. National Library of Medicine (USNLM) that currently includes more than 19 million citations from MEDLINE and other life science journals for biomedical articles (www.ncbi.nlm.nih.gov/pubmed). It is the single largest literature resource online. For the average researcher, hands-on PubMed usage might just mean searching key works in PubMed and trying to read all the related abstracts. This exercise was feasible a decade ago when the body of literature was relatively small. However, as the number of articles grows exponentially each year, it is no longer practical to read
c12.indd 237
1/12/2011 9:44:21 AM
238
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
every detail within all the relevant literature. More efficient methods to retrieve information from articles are needed. Recognizing these, USNLM has developed tools or data files to facilitate fast information extraction by annotating published articles, which can be found in the NCBI repositories site (ftp://ftp.ncbi.nlm.nih.gov). For example, the file gene2pubmed under the gene directory and DATA subdirectory contains annotations between genes and PubMed IDs, which provides fast conversion between genes and PubMed articles. 12.3.2 Biological Ontology Databases Ontology is the science of what is (Smith, 2001). It is a rigorous and exhaustive organization of some knowledge domains that are usually hierarchical and contain all the relevant entities and their relations. From a practical view, biological ontologies provide deeper and more robust representations of biological domains on which we wish to reason and solve problems. Here we discuss two relevant ontology databases. 12.3.2.1 Gene Ontology Gene ontology (GO) provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data (www.geneontology.org). It consists of three independent areas: biological process, molecular function, and cellular component. The terms in GO can be structured as a graph, with terms as nodes and the relations between the terms as arcs (Fig. 12.1). The relations between GO terms can also be categorized and defined, including is_a (is a subtype of ); part_of; has_part; and regulates, negatively_regulates, and positively_regulates relationships. The properties of each relation are specified in the OBO format and can be graphically viewed by an OBO ontology editor (http://oboedit.org) or browsed on the web (http://amigo.geneontology.org). More important, each GO term is functionally annotated with a set of genes, which can be used for functional enrichment analysis. The gene annotations to GO terms can be found on the GO website or obtained in a cleaner format from BioMart (http://www.biomart.org), a generic query-oriented data management system developed jointly by the Ontario Institute for Cancer Research and the European Bioinformatics Institute. The GO database is becoming so important that most candidate gene screening algorithms are based somehow on this information. 12.3.2.2 Medical Subject Headings Medical Subject Headings (MeSH) is the USNLM’s controlled vocabulary thesaurus used for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts (www.ncbi.nlm.nih.gov/mesh). Similar to GO, MeSH has more than one ontology and has 16 areas. However, in terms of disease candidate gene screening, MeSH C (Diseases) and MeSH D (Chemicals and Drugs) are more related
c12.indd 238
1/12/2011 9:44:21 AM
BIOINFORMATIC DATABASES
Cellular component
Biological process
I Cellular metabolic process
Macromolecular complex
Biological regulation R
I R Protein complex
I Nucleobase, nucleoside, nucleotide, and nucleic acid metabolic process
Molecular function
I
I Cellular process
239
I Regulation of biological process
I
I
PCNA complex
Regulation of cellular process I
I
Regulation of cellular response to stress
DNA metabolic process I
I
DNA repair I Nucleotide-excision repair
R Regulation of DNA repair
Figure 12.1. An graphic example of the GO term of regulation of DNA repair shows the hierarchical structure of the terms. I, is_a relationship; R, regulates relationship.
terms. These MeSH terms are very useful when combined with literature searches to generate candidate gene lists. Some software such as G2D (Genes to Diseases, www.ogic.ca/projects/g2d_2) use MeSH terms for candidate gene prioritization. 12.3.3 Protein–Protein Interaction Databases Protein–protein interactions (PPIs) are essential to all biological processes. Over the past few years, the number of known PPIs has grown at a substantial pace, either due to direct experimental evidence or due to in silico evidence derived from deeper understandings of PPI mechanisms. Many protein interaction repositories have been built to store PPI knowledge and are widely used for investigating molecular networks or pathways. There are six major PPI databases: the Human Protein Reference Database (HPRD), the Biomolecular Interaction Network Database (BIND), the Biological General Repository for Interaction Datasets (BioGRID), the Molecular INTeraction database (MINT), the Database of Interacting Proteins (DIP), and the IntAct molecular interaction database (IntAct). Each differs in scope and content. If possible, it is better to combine all interactions together for PPI analysis. However, in practice, it is also fine to use only HPRD (http:// www.hprd.org), since it alone contains about 80% of interactions (Mathivanan et al., 2006).
c12.indd 239
1/12/2011 9:44:21 AM
240
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
12.4 BAYESIAN NETWORK TO ANALYZE EXPRESSION DATA: NATBOX A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional probabilistic independencies. Each gene is a variable in a Bayesian network, which is essentially a directed acyclic graph (DAG) based on the causal relationship of genes (e.g., upregulation of a transcription factor promotes the expression of its downstream regulatory genes). Bayesian networks have many advantages in modeling gene expression networks: (1) they explicitly relate the DAG model of the causal relations among the gene expression levels to a statistical analysis; (2) they have broad applications and include linear models, nonlinear models, Boolean networks, and Hidden Markov models as special cases; (3) there are already well-developed algorithms for searching for Bayesian networks from observational data; (4) they allow for the introduction of a stochastic element and hidden variables; and (5) they allow explicit modeling of the process by which the data are collected (Spirtes et al., 2000). Bayesian networks assume that a variable is independent of its nondescendants, given its parents in the network. The conditional independence assumption allows the decomposition of the joint distribution of the network. Figure 12.2 is a simple example given by Friedmen (2000) that clearly demonstrates this. The joint probability distribution of A, B, C, D, and E can be decomposed as P ( A, B, C , D, E ) = P ( A) P ( B | A, E ) P (C | B) p ( D | A) P(E). To learn the network structure from the observed data is an NP-hard problem (Chickering, 1996). Many searching algorithms have been proposed (Margaritis, 2003; Tsamardinos et al., 2003, 2006; Yaramakala and Margaritis, 2005). Currently, there are two R packages that perform Bayesian network analysis, BNArray (Chen et al., 2006) and NATbox (Chavan et al., 2009), and we found that NATbox is superior. NATbox is a menu-driven graphical user interface (GUI) implemented in R for modeling and analysis of functional relationships for gene expression data (Chavan et al., 2009). The input data should be saved as a tab-delimited *.txt file, with each column representing a gene, and no row name is allowed. All functions are accessible with a simple click; no command needs to be entered once the software is running. This gives NATbox a superior advantage over BNArray such that less program-savvy researchers can use it easily. The software provides more searching algorithms for optimizing Bayesian networks, versus only two searching algorithms in BNArray. The backbone of the software is bnlearn, an R package developed by Marco Scutari. NATbox calls the bnlearn function through its GUI to conduct network searches and draw network plots. Given an adjacency matrix, NATbox also provides the option to draw a network plot and perform network analysis through its Social Network Analysis tool. However, the tool cannot label genes by their names
c12.indd 240
1/12/2011 9:44:21 AM
BAYESIAN NETWORK TO ANALYZE EXPRESSION DATA: NATBOX
241
Figure 12.2. A simple example of Bayesian network. The network can be decomposed as follow: (1) A and E are independent; (2) B and D are independent, given A and E; (3) C is independent of A, D, and E, given B; (4) D is independent of B, C, and E, given A; (5) E is independent of A and D.
and does not allow users to interact with the network graphs. Stand-alone software, such as VisANT, or more sophisticated R packages, such as graph, RBGL, and Rgraphviz, can be used to draw network plots for those who are interested in a particular network structure. As an example, we have constructed a new Bayesian network of the 17 DNA repair genes using NATbox’s GS algorithm (Figures 12.3 and 12.4). This list of genes was originally from a yeast experiment to study genes that are involved in cell cycle regulations (Spellman et al., 1998). The data include 77 yeast gene expression microarrays and around 6200 genes. The 17 DNA repair genes among 799 differentially expressed genes were selected by the authors of BNArray as a tutorial example of their software. The input data can be derived using the code in Box 12.1. The network is significantly different from that constructed using BNArray. The network by BNArray has far more edges than that by NATbox, and the directions of many edges are also different. We also compared the networks constructed by these two programs using the example data found at www.bnlearn.com. Both programs work well with simple examples, but BNArray is less useful for complicated networks. Not only does it generate more edges, but also some of the edges are in the wrong direction. For that reason, we recommend NATbox for Bayesian network analysis.
c12.indd 241
1/12/2011 9:44:21 AM
242
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
Figure 12.3. A screen shot of the NATbox GUI.
Despite all their advantages, Bayesian networks have many pitfalls as well. First of all, they require a large sample size that most microarray studies cannot afford. Simulation studies inferring DAG structure from sample data indicate that even for relatively sparse graphs, sample sizes of several hundred are required for high accuracy. Second, learning a Bayesian network poses an insurmountable task for large networks. The relationship between two genes A and B has four possibilities: A causes B, B causes A, A and B mutually cause each other, and A and B have no causal relationship. For n genes, there are 4n possibilities. This number becomes astronomical even for a small network made up of 100 genes. Although searching algorithms, such as genetic algorithms and PC algorithms, have been developed to make global searches possible, identification of all causal relationships of genes in a small network still remains an ordeal (Aten et al., 2008). Third, because of the nature of the Bayesian network, two networks constructed from the same data may be different if the program runs at different times. It is important to be aware of the pitfalls of Bayesian networks and interpret the results cautiously.
12.5 WEIGHTED GENE CO-EXPRESSION NETWORK ANALYSIS Weighted gene co-expression network analysis (WGCNA) is based on the concept of a scale-free network. Metabolic networks in all organisms have been suggested to be scale-free networks, and scale-free network phenomena
c12.indd 242
1/12/2011 9:44:21 AM
WEIGHTED GENE CO-EXPRESSION NETWORK ANALYSIS
243
YOR033C YDL101C
YOL090W YNL312W
YDR097C
YER095W
YNL082W
YGL021W
YML061C
YGL163C
YML060W
YML021C
YIL066C
YLR288C
YKL113C YLR032W
YLR383W
Figure 12.4. A screen shot of the NATbox network analysis of the DNA repair gene network.
BOX 12.1. Code to Retrieve DNA Repair Gene Expression Data > library(BNArray) > data(total.data) > attach(total.data) > ori.compact = LLSimpute(total.data$df.all, total. data$df.ori, total.data$n.changed) > ori.compact = FinalImpute(ori.compact) > bn.data = PrepareCompData(ori.compact) # the names of DNA repair genes can be found in http://www.cls.zju.edu.cn/binfo/BNArray/ > dnarepair=read.csv(file=″DNA repair.csv″, header=F) > bn.data=bn.data[, as.character(unlist(dnarepair))]
c12.indd 243
1/12/2011 9:44:21 AM
244
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
have been observed in many empirical studies (Zhang and Horvath, 2005; Dong and Horvath, 2007; Horvath et al., 2006; Carlson et al., 2006; Gargalovic et al., 2006). In a scale-free network, the connectivity among genes, p(k), follows a power law distribution, p(k)∼k-Y (Ravasz et al., 2002; Barabasi and Albert, 1999). One key feature of a scale-free network is the existence of a few highly connected hub nodes that participate in a very large number of metabolic reactions. With a large number of links, these hubs integrate all substrates into a single, integrated web. Scale-free networks have been shown to be robust against accidental failures but vulnerable to coordinated attacks (Albert et al., 2000). Identifying the hub molecules involved in certain diseases could lead new drugs that would target those hubs (Barabasi and Bonabeau, 2003). Thus knowing the connectivity of genes helps prioritize the candidate genes (Zhao et al., 2006). 12.5.1 Generation of Weighted Gene Co-Expression Network Weighted network construction was performed using R as described by Zhang et al. (2005). Briefly, the absolute value of the Pearson correlation coefficient was calculated for all pair-wise comparisons of gene expression values across all microarray samples. The Pearson correlation matrix was then transformed into an adjacency matrix A—that is, a matrix of connection strengths using a power function. Thus the connection strength aij between gene expressions xi and xj is defined as aij = |cor(xi,xj)|β. The power β is chosen large enough so that the resulting network exhibits approximate scale-free topology. The network connectivity ki of the ith gene expression profile xi is the sum of the connection strengths with all other genes in the network: ki =
∑
N j =1
aij .
12.5.2 Detection of Modules The next step in network construction is to identify groups of genes with similar patterns of connection strengths with all other genes in the network. The topological overlap matrix (Ravasz, 2002; Yip et al., 2007; Zhang and Horvath, 2005) is used as a measure of gene similarity. This amounts to defining a module as a set of highly co-expressed genes. A pair of genes is said to have high topological overlap if they are both strongly connected to the same group of genes. The use of topological overlap thus serves as a filter to exclude spurious or isolated connections during network construction. After calculating the topological overlap for all pairs of genes in the network, this information is used in conjunction with a hierarchical clustering algorithm to identify groups, or modules, of densely interconnected genes. In the resulting dendrogram, discrete branches of the tree correspond to modules of co-expressed genes (Fig. 12.5a). After identifying modules of co-expressed genes, each module in effect becomes a subnetwork, and a new measure of connectivity,
c12.indd 244
1/12/2011 9:44:21 AM
Standard TOM Measure
1.0 0.9 0.8 0.7 0.6 0.5
Colored by GTOM1 modules
Colored by GeneSignificance
(a) brown, cor = 0.58 0.35
GeneSignificance
0.30 0.25 0.20 0.15 0.10 0.05 0.00 10
20
30 Connectivity (b)
40
50
1.0
mean (P < 0.05)
0.8 0.6 0.4 0.2 0.0 1
5
10
20
50
100
Size (c)
Figure 12.5. Brain cancer network results. a, Average linkage hierarchical clustering tree colored by modules (first color band) and high/low gene significance (white/black) in the second color band. Note that the brown module is enriched with significant genes. b, Scatterplot between intramodular connectivity (x-axis) and gene significance (y-axis) in the brown module. c, Proportion of significant genes in the test set data as a function of different sizes of gene lists. Green and red bar plots associated with network screening and gene significance screening, respectively.
c12.indd 245
1/12/2011 9:44:21 AM
246
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
intramodular connectivity, is defined as the sum of a gene’s connection strengths with all other genes in its module. 12.5.3 Define Measures of Gene Significance and Module Relevance Based on the clinical outcome y, the gene significance of the ith gene expression profile xi is defined as the absolute value of the Student t-test statistic for testing differential expression between cases and controls, GSi = Gene significance ( xi ) = t testi . Depending on the type of clinical outcome, the gene significance can also be defined as F-test statistic, Pearson correlation, -log10 of Cox regression p value, or other reasonable statistics. An important step in gene network analysis is to study the biological or clinical relevance of network modules. To relate gene modules to a clinical trait y, it is natural to make use of the gene significance measure. Specifically, we define a measure of module relevance by the mean gene significance in the qth module,
Module relevance
q
∑ =
nq i =1
GSi
nq
,
where i indexes the genes in the qth module and nq equals the qth module size. By considering the module relevance measure in our applications, we find that certain modules can be singled out as being enriched with genes that are differentially expressed between cases and controls. 12.5.4 Functional Enrichment Studies of Gene Modules Identification of biologically plausible modules that are relevant for the clinical outcome y is an important step for our main goal: finding the important genes within these relevant modules. In real data applications, one would certainly want to study the functional enrichment (gene ontology information) of gene modules and study the expression profiles of the module genes in other tissues to further elucidate the meaning of the identified modules. Functional enrichment analysis may provide insights into the meaning of the modules. Available software for this analysis includes topGO (discussed later), EASE (http://david.niaid.nih.gov/david/ease/ease.jsp), Ingenuity, GeneGo, and others. In practice, important complementary information may help with selection of the biologically most plausible module. 12.5.5 Relating Intramodular Connectivity to Gene Significance Highly connected hub genes are far more likely than nonhub genes to be essential for survival (Giaever, 2002; Han, 2004; Winzeler, 1999). Therefore, we
c12.indd 246
1/12/2011 9:44:22 AM
WEIGHTED GENE CO-EXPRESSION NETWORK ANALYSIS
247
hypothesize that hub genes may also be more significant according to the gene significance measure. Empirically, we find that this intuition is correct in relevant modules and usually not true for nonrelevant modules. 12.5.6
Network-Based Screening Strategy
The fact that intramodular connectivity is significantly correlated with gene significance in a relevant module suggests that intramodular connectivity can be used to obtain complementary information for finding prognostic genes. To select genes based on a gene significance measure and connectivity, we propose the following network-based gene screening strategy: • •
• • •
•
Input S is the number of genes that should be selected. Define a gene significance measure based on the clinical outcome of interest—for example, the absolute value of the t-test statistic for testing differential expression. Construct a weighted gene co-expression network. Identify modules of highly co-expressed (correlated) genes. Identify relevant modules based on the module relevance measure (see equation in section 12.5.3). Within the relevant modules, select S genes with high gene significance and high intramodular connectivity.
In general, since the number of genes selected for Bayesian network analysis is much smaller than that for WGCNA, the genes for Bayesian network analysis can be treated as a gene module. Once the network is constructed, the connectivity of each gene (or the degree) can also be calculated. Thus the network-based screening strategy applies to Bayesian networks as well. 12.5.7 Brain Tumor Example The network-based gene screening method was applied brain cancer study. Dataset 1 consisted of 55 glioblastomas and was considered the training data set, and dataset 2 consisted of 65 independent glioblastoma samples as a validation set. Expression of 22,215 probe sets (15,005 unique transcripts) was measured using Affymetrix HG-U133A microarrays. The absolute value of the Pearson correlation between expression profiles of all pairs of genes was determined for the 8,000 most varying nonredundant transcripts. Since module identification is computationally intensive, only the 3,600 most connected genes were considered for module detection. Since module genes tend to have high connectivity, this step does not lead to a big loss in information (www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ASPMgene). The gene significance of a gene is defined as -log10 of its univariate Cox regression p value. Thus the gene significance is proportional to the number
c12.indd 247
1/12/2011 9:44:22 AM
248
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
of zeroes in the Cox p value. The network-based gene screening method that incorporates connectivity information is hypothesized to more likely identify genes that validate in an independent dataset than traditional screening methods that ignore network connectivity. The validation success rate of a candidate set of prognostic genes is defined as the proportion of those genes for which the Cox regression p value in an independent data set is smaller than 0.05. Network-based screening methods have significantly higher validation success rates than the traditional method (Fig. 12.5c). The code to perform WGCNA has been developed into publicly accessible R packages. For examples and tutorials, visit www.genetics.ucla.edu/labs/ horvath/CoexpressionNetwork.
12.6 IN SILICO SCREENING OF CANDIDATE GENES In the previous two sections, we discussed data-based gene prioritizations. However, sometimes researchers may want to explore a disease trait but do not have the resources or are restrained to perform large-scale screening assays. Several in silico methods have been proposed for such purposes by mining through public databases. Given a disease phenotype, savvy investigators can prepare an input list of a few thousand genes that show preliminary evidence of association with the disease, based on prior knowledge or experimental data. Further functional analysis can then be applied on the list to prioritize a handful of them to direct validation experiments. In this section, we will discuss in detail how to start from an interesting phenotype to develop a short list of candidate genes.
12.6.1 Input Gene List Preparation There are several ways to build the input list. The first and easiest way is to search through the literature. By using Entrez Programming Utilities (eUtils), a set of tools developed by NCBI to facilitate information retrieval from Entrez data, including PubMed, one can automate the search for gene-disease associations from the literature. A detailed description of how to use eUtils can be found at http://eutils.ncbi.nlm.nih.gov. In short, a fixed URL syntax is used to translate a standard set of input parameters into values necessary for various NCBI software components to retrieve the requested data. It is easy to call the eUtils in R/Bioconductor to run batch searches. In the following, we describe how to combine Esearch, one of the eUtils, with the annotate package in Bioconductor to retrieve abstracts from PubMed. The key step is to build an appropriate query URL, which makes up the base URL http://eutils.ncbi.nlm.nih.gov/entrez/eu-tils/esearch.fcgi?db=pubmed& term= and additional search terms. Some special characters needed for the term syntax are
c12.indd 248
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
Text Char
space
[
]
″
#
Query Char
+
%5B
%5D
%22
%23
249
For example, to obtain abstracts for all case reports that contain the key words iron overload, the regular input in PubMed would be (“iron overload” AND case report[msh]). The implementation with R is shown by the code in Box 12.2. The code can be easily modified for tasks such as detecting co-occurrences of genes and disease phenotypes in the literature. Input genes can also be collected from experimental evidence, including association studies, linkage scans, and gene expression. Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo) is a public functional genomics data repository storing thousands of genomic array datasets. The Input list could result from analyses of these experiments with loose criteria, such as a calculated false discovery rate <0.5. Investigators should seek resources for as many high-throughput assays as possible. The Online Mendelian Inheritance in Man (OMIM) is another good resource. It is a phenotypic companion to the human genome project and is a continuously updated catalog of human genes and genetic disorders (www.ncbi.nlm.nih.gov/ Omim). Genetic map loci linked to various diseases have been summarized. The linked loci can be used to generate input genes as well. There are still other ways to prepare the input gene list. For instance, if the disease is liver related, the input genes could simply be all genes expressed in liver. Researchers can be creative in this process but at the same time should also be as exhaustive as possible to include genes that show some evidence of a link to the disease trait. Tremendous effort may be required for this step. Examples can be found in Sun et al. (2009) and Ortutay et al. (2009). Once the input gene list is built, further functional analyses can be done to eliminate insignificant genes. 12.6.2
Gene Set Enrichment Analysis
It is well known that genes form functional modules and work cooperatively to conduct biological processes. It is also known that a disease trait is often associated with dysfunction of certain biological processes. Therefore, the association between genes and disease traits can be bridged through common biological processes involved. The hypothesis for gene set enrichment analysis (GSEA) is that if several genes known to be associated with a disease are involved in one biological function, other genes in the same function group are likely to be linked with the formation of that disease as well. GO terms are the most widely used functional annotations for the enrichment analysis. The basic principle is that since each GO term is annotated with a set of genes, for a particular GO term u, evidence of enrichment can be assessed by the probability of having that many genes in the term u if the same number of input genes is randomly drawn from the total gene pool (Fig. 12.6).
c12.indd 249
1/12/2011 9:44:22 AM
250
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
BOX 12.2. Sample Code to Retrieve Search Abstracts from PubMed > library(annotate); library(XML); > base.url = ″http://eutils.ncbi.nlm.nih.gov/entrez/ eutils/esearch.fcgi?db=pubmed&term=″ > my.term = ″%22iron+overload%22+AND+(Case+Reports%5Bp typ%5D)″ > url.txt = scan(paste(base.url, my.term, sep=″″), what=″″) > url.txt = gsub(″>″,″<″, tmp[grep(″<eSearchResult>
″, tmp)]) > totalcount = unlist(strsplit(url.txt, ″<″))[5] > url.txt = scan(paste(base.url, my.term,″&retmax=″, totalcount, sep=″″), what=″″) > ids= url.txt[grep(″″, url.txt)]; ids= gsub(″″, ″″, ids); ids= gsub(″″, ″″, ids); > x = pubmed(ids); a =xmlRoot(x); numAbst =length(xmlChildren(a)); > arts = vector(″list″, length = length(numAbst)) > absts = rep(NA, numAbst) > for (i in 1:numAbst) {arts[[i]] = buildPubMedAbst(a[[i]]); absts[i] = abstText(arts[[i]]); }
Let p0 = (number of input genes in term u)/( number of genes in term u) and p1 = (number of input genes not in term u)/(number of genes in gene pool but not in term u). The test hypotheses are formulated as H 0 : p0 = p1 against H1 : p0 > p1. Rejection of the null hypothesis H0 suggests enrichment of a significant gene set in the term u. Standard tests like Fisher’s exact or the Kolmogorov-Smirnov test can be applied. There are several tools/software that can be used to perform GSEA based on GO terms (Al-Shahrour et al., 2004; Beissbarth and Speed, 2004). However, most of them do not consider the hierarchical structure of the GO terms. In this section, we introduce topGO, a Bioconductor package that takes into account the dependent structure of the GO terms. 12.6.2.1 Software: topGO topGO is a Bioconductor package developed by Alexa et al. (2006) for scoring functional groups by de-correlating GO graphs. R codes for topGO functions can be found at www.koders.com/ noncode/fidBD151204CB40891793D227DE8E474F119A9020A7.aspx. The key step to using the topGO package is to create a topGOdata object, for
c12.indd 250
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
251
Figure 12.6. Enrichment of a significant gene set in GO terms.
which two inputs are needed: a GO term-to-term structure relationship file and a GO term gene annotation file. The GO structure can be obtained from GO.db, a data package from Bioconductor. Therefore, users do not need to prepare such an input. However, the gene annotations for GO require some work because different applications have different setups. This input can be prepared from the BioMart database. The code in Box 12.3 describes how to perform GSEA by topGO. Before the code is run, two variables, geneList and gene2GO.goa, need to be constructed. geneList is a factor vector of 0s and 1s with input genes coded as 1. The vector has the same length as the number of genes in the gene pool, and the vector names are the gene names. gene2GO. goa is a list with each entry being a gene annotated by a set of GO terms. It requires the same gene order as the vector name of geneList. The genes in the significant GO terms but not in the original input list are considered to be candidate genes. This process eliminates genes that are unlikely to be associated with the disease of interest. The limitation of the process is that the analysis is entirely dependent on the GO annotation file, which is generated from the current knowledge. Functional terms about which people have less understanding will contain fewer genes. Therefore, results based on GO terms generally bias toward what is already known. Nevertheless, it is a good exploratory type of analysis, and further experiments are needed to confirm any findings from it.
c12.indd 251
1/12/2011 9:44:22 AM
252
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
BOX 12.3. Sample Code for Gene Set Enrichment Analysis by topGO > library(topGO) > head(geneList) 44M2.3 A1BG A1CF A1IGU5 A1L167 A1L4H1 0 0 1 0 0 0 > head(gene2GO.goa) $′1′ [1] ″GO:0000166″ ″GO:0003723″ ″GO:0004527″ ″GO:0005622″ ″GO:0005730″ ″GO:0016787″ $′2′ [1] ″GO:0003674″ ″GO:0005576″ ″GO:0008150″ > # create a topGO object > GOdata = new(″topGOdata″, ontology = ″MF″, allGenes = geneList, annot = annFUN.gene2GO, + gene2GO = gene2GO.goa) > #Classic test without considering the correlation among GO terms > test.stat = new(″classicCount″, testStatistic = GOFisherTest, name = ″Fisher test″) > resultFis = getSigGroups(GOdata, test.stat) > #weighted test with consideration of the correlation among GO terms > test.stat = new(″weightCount″, testStatistic = GOFisherTest, name = ″Fisher test″, sigRatio = + ″ratio″) > resultWeight = getSigGroups(GOdata, test.stat) > #results summary and plotting the significant GO graphs > l = list(classic = score(resultFis), weight = score(resultWeight)) > allRes = genTable(GOdata, l, orderBy = ″weight″, ranksOf = ″classic″,top = 20) > showSigOfNodes(GOdata, score(resultWeight), firstTerms = 5,useInfo = ″all″)
12.6.3
Protein–Protein Interaction Network Analysis
PPIs are important for virtually every aspect of cellular functions. Although small diagrams of PPIs are commonly seen, the whole network of PPIs is hard to visualize. The level of complexity makes them difficult to generate and analyze. Network analysis is a way to approach this problem. By capturing certain notions of the important genes in a gene network that may be related
c12.indd 252
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
253
to a disease trait, these genes have a good chance of being related to the disease. The hypothesis is that a change of gene function in the hub nodes/ genes causes higher instability of the whole network, and working on these genes has a higher chance to achieve experimentally positive results. Several measures that can be used to describe a network structure and search for the important nodes are degree, vulnerability, and closeness centrality. It is natural to view a network as graph and thus the fundamental elements of network graphs are their nodes and edges. For a node i, its degree is simply the count of numbers of edges incident upon this node. It measures the connection of a node to its neighboring nodes. It is reasonable to assume that the higher a node’s degree, the more it contributes to the network stability. Unlike degree, Closeness centrality measures the importance of a node in a more global sense. It is based on the notion of how close a node is to all other nodes in the graph. The closeness centrality is calculated as CCi =
Vi − 1
∑d
,
ij
i≠ j
where |Vi| is the size of the reachable subnetwork from node i, and dij is the shortest distance between node i and node j (Kolaczyk et al., 2008). Vulnerability is another measurement of the importance a node to a network and is calculated based on the network efficiency (Gol’dshtein et al., 2004). The network efficiency quantifies the efficiency of information transmission within the network. Assuming the efficiency between two nodes is inversely proportional to their distance, measured by the edges, the global efficiency of a network is calculated as E=
1 N ( N − 1)
∑d , 1
i≠ j
ij
where N is the total number of nodes in the network. Then, the vulnerability of node i is Vi =
E − Ei , E
where Ei is the global efficiency of the network with node i and all edges connected to node i removed. Therefore, vulnerability is the efficiency loss of a network if node i is missing. Studies have shown that these measures can capture some level of importance for a PPI network, as important genes tend to be in more central positions (Ortutay et al., 2009). These values can be easily calculated from igraph. 12.6.3.1 Software: igraph igraph is a free software tool for creating and manipulating undirected and directed graphs (Csardi et al., 2006; http:// igraph.sourceforge.net). It supports R and can be installed as an R package.
c12.indd 253
1/12/2011 9:44:22 AM
254
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
BOX 12.4. Sample Code for Network Analysis > library(igraph); > ppis.sel = ppi[ppi[,1] %in% gene.input | ppi[,2] %in% gene.input, ]; > ppis.dm = data.frame( p1 = as.character(ppis. sel[,1]), p2 = as.character(ppis.sel[,2])); > ppi.graph = graph.data.frame(ppis.dm, directed = F); geneNames = V(ppi.g)$name > # function to calcuate global efficiency ### > global.eff = function(graph){ > v = V(graph); n = length(v); Es = rep(0, n); > for (i in 0:(n-1)){ > spi = shortest.paths(graph,i); spi[spi==n|spi==0]=Inf; Es[i+1] = sum(1/spi); > } > sum(Es)/n/(n-1) > } > E = global.eff(ppi.graph); > # calculating the vulnerability of node i > Eis = rep(0, N.gene); > for (i in 1:N.gene){ > v.i = v.vec[v.vec!= which(geneNames %in% gene. sel[i])-1]; > ppi.gi = subgraph(ppi.g, v.i); Eis[i] = global. eff(ppi.gi); > } # this process may take a few hours if the network is large > Vis = (E-Eis)/E; ## Vulnerability;
The PPI data can be stored in the following format with each row corresponding to a PPI: A1CF A1CF A26C3 A2BP1 …
SYNCRIP TNPO2 MME ATN1 …
The network characteristics for a list of interested genes can be calculated using the code in Box 12.4, where ppi is a matrix read from the data file and gene.input contains the gene list. The top ranked genes are considered candidate genes worthy of further pursuit.
c12.indd 254
1/12/2011 9:44:22 AM
IN SILICO SCREENING OF CANDIDATE GENES
12.6.4
255
PID Example
To illustrate the methods discussed, we take the primary immunodeficiency (PID) study (Ortutay et al., 2009) as an example. PID occurs when part of the body’s immune system is missing or does not function properly. There are numerous mechanisms that can cause PID, most of which are related to dysfunctional immune genes. To search for candidate genes for PID, first a set of 847 genes that are crucial for the immune system were constructed, based on exhaustive analyses of the literature and databases, to be the input list (Ortutay et al., 2008). A PPI network for the 847 genes was built from the HPRD PPI database and used for network analysis. The top genes (e.g., 50) from analyses of degree, vulnerability, and closeness centrality characteristics were merged together for further GO enrichment analysis. The combination of interaction and enrichment analyses results in a list of 39 significant genes, of which 13 genes have been previously known to be PID genes. This suggests the other 26 could be very promising PID candidate genes. 12.6.5
Other Bioinformatics Tools
Many other web-based tools can be used for searching for candidate disease genes as well. Some of them share similar ideas and use similar procedures, with slight differences in how algorithms are implemented. Here we briefly discuss four of them. Details about how to apply the software can be found in the corresponding websites. 12.6.5.1 GeneSeeker (www.cmbi.ru.nl/GeneSeeker) GeneSeeker is a server that gathers information from several online databases to filter positional candidate disease genes (van Driel MA, 2003; van Driel MA, 2005). The rationale is that genes causing a disease are most likely expressed in the tissue affected by that disease. In addition, through synteny or protein homology comparison, information from other species such as mice can be borrowed to infer the function of human genes/proteins. GeneSeeker automates the combination of data from cytogenetic locations, phenotypes, and expression patterns. It is particularly well suited for syndromes in which disease genes alter their expression patterns in the affected tissues. 12.6.5.2 G2D (www.ogic.ca/projects/g2d_2) G2D (gene to disease) provides three strategies, phenotype, known genes, and interactions, to prioritize disease candidate genes (Perez-Iratxeta et al., 2005; Perez-Iratxeta et al., 2007). The inputs of each algorithm include a location box, in which an interesting genomic region that may be linked with a disease trait is provided by the user, and an additional box containing either MIM disease number (phenotype), Entrez gene identifiers known to be associated with diseases (known genes), or another locus region (interaction). The phenotype algorithm searches MeSH C and MeSH D terms associated with a disease from PubMed and uses them
c12.indd 255
1/12/2011 9:44:22 AM
256
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
to associate with GO terms. The genes annotating the GO terms are used for comparison with genes in the location box by their sequence homologue. The main assumption of this method is that for a given disease with an undiscovered associated gene X and a phenotypically similar disease with a known associated gene Y, some functions of genes X and Y will be related and relevant to those phenotypes (Perez-Iratxeta et al., 2002). The known gene algorithm uses the known genes in enriched GO terms to compare with the interesting locus. The interaction algorithm performs PPI analysis on genes from both loci. The justification for the interaction algorithm is that mutations on two proteins that participate in the same pathway or directly interact will produce the same or very similar disease phenotypes. 12.6.5.3 SUSPECTS (www.genetics.med.ed.ac.uk/suspects) The inputs of SUSPECTS are exactly the same as the inputs for the known gene algorithm in G2D: one is the coordinates of the requested genomic region and the other is a list of known genes involved in the disease of interest. Users may also simply enter the name of the disease, and SUSPECTS can search an appropriate gene list from OMIM, the HGMD, and GAD (Adie et al., 2006). This list is known as the training set. SUSPECTS scores each gene in the region requested on three features: (1) how well its GO annotation compares with the annotation found in the training set (similar to what is done in G2D), (2) how well its Interpro domains are shared with the training set, and (3) how its gene expression profile compares with the profiles from the match set using Spearman’s rho rank-order correlation. A weighted average is then calculated to rank genes in order of likelihood of involvement in the disease. Genes near the top of the list are, in theory, better candidates than those farther down. 12.6.5.4 PGMapper (www.genediscovery.org/pgmapper) PGMapper is a software tool for automatically matching phenotype to genes from a defined genome region or a group of given genes. PGMapper retrieves information of all genes in the prespecified region from the Ensemble database and then searches OMIM and PubMed to find candidate genes relevant to a disease trait in the literature. Users can specify key words that describe the particular disease/phenotype features. PGMapper is currently available for candidate gene search of humans, mice, rats, zebrafish, and 12 other species (Xiong et al., 2008).
12.7 FUTURE DIRECTIONS In this chapter, we reviewed several bioinformatics tools for candidate gene screening, and we are certain that many more will be developed in the foreseeable future. Tools that are currently in a dire need are those can integrate data and information from multiple resources and produce consistent findings, such as how to construct a reliable network based on different platforms like SNP
c12.indd 256
1/12/2011 9:44:22 AM
REFERENCES
257
data, gene expression data, and PPI data. Since different platforms have different high-throughput technologies or knowledge depth, the networks constructed from them contain similar but not the same information. They query the genome at different stages, and therefore provide a mechanism to double check the edges and directions of the network (e.g. for Bayesian networks). Conceptually, the network integrating more information should be more dependable. Groundbreaking work has been done to use SNP array data to guide position directions of network edges (Aten et al., 2008) and to create a method to integrate a seeded prior network to boost the reliability of a gene expression network (Djebbari and Quackenbush, 2008). More research should be positioned in this direction. 12.8 QUESTIONS 1. Compare the Bayesian networks constructed by BNArray and NATbox. The sample data can be found at www.bnlearn.com. 2. Visit www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ and replicate the WGCNA provided on the web site. 3. Generate a GO Biological Process term annotation file for probes in an Affy HG-U133A array. (Hint: Biomart database.) 4. Randomly pick 100 probes from the Affy HG-U133A array, and use the annotation file above to perform a GO enrichment analysis by topGO. 5. Randomly choose 500 genes and construct a PPI subnetwork containing only those from the HPRD database. (Hint: igraph package.) 6. Pick a disease you are interested in (e.g., Alzheimer disease or obesity) and use the method used for the PID example to find a candidate gene list. 7. For Alzheimer disease (OMIM #104300), compare the candidate genes within the genomic region of 17q23 found by GeneSeeker, G2D, SUSPECTS, and PGMapper. 12.9
ACKNOWLEDGMENTS
We thank David Galloway in Scientific Editing at St Jude Children’s Research Hospital for his professional editing support. This work was supported in part by the American Lebanese Syrian Associated Charities. 12.10 REFERENCES Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS. (2006). SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics 22:773.
c12.indd 257
1/12/2011 9:44:22 AM
258
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
Albert R, Jeong H, Barabasi AL. (2000). Error and attack tolerance of complex networks. Nature 406:378. Alexa A, Rahnenfuhrer J, Lengauer T. (2006). Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22:1600. Al-Shahrour F, Díaz-Uriarte R, Dopazo J. (2004). FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 20:578–580. Aten JE, Fuller TF, Lusis AJ, Horvath S. (2008). Using genetic markers to orient the edges in quantitative traite networks: The NEO software. BMC Syst Biol 2:34. Barabasi AL, Albert R. (1999). Emergence of scaling in random networks. Science, 286:509. Barabasi AL, Bonabeau E. (2003). Scale-free networks. Scientific American 288:60–69. Beissbarth T, Speed TP. (2004). GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20:1464–65. Carlson MRJ, Zhang B, Fang Z, Mischel PS, Horvath S, Nelson SF. (2006). Yeast Network Application. Gene Connectivity, Function, Sequence Conservation: Predictions from Modular Yeast Co-Expression Networks. BMC Genomic 7:40. Chavan SS, Bauer MA, Scutari M, Nagarajan R. (2009). NATbox: A network analysis toolbox in R. BMC Bioinformatics 10:S14. Chen X, Chen M, Ning KD, et al. (2006). BNArray: an R package for constructing gene regulatory net-works from microarray data by using Bayesian network. Bioinformatics 22:2952. Chickering DM. (1996). Learning Bayesian networks is NP-complete. Springer Verlag. Csardi G, Nepusz T. (2006). The igraph software package for complex network research. Int. J. Complex Syst. 1695. Djebbari A, Quackenbush J. (2008). Seeded Bayesian networks: constructing genetic networks from microarray data. BMC Sys Biol 2:57. Dong J, Horvath S. (2007). Understanding network concepts in modules. BMC Syst Biol 1:24. Friedman N, Linial M, Nachman I, dPe’er D. (2000). Using Bayesian networks to analyze expression data. J Comput Biol 7:601–20. Gargalovic PS, Imura M, Zhang B, Gharavi NM, Clark MJ, Pagnon J, Yang WP, He AQ, Truong A, Patel S, Nelson SF, Horvath S, Berliner JA, Kirchgessner TG, Lusis AJ. (2006). Identification of inflammatory gene modules based on variations of human endothelial cell responses to oxidized lipids. Proc Natl Acad Sci USA 103:12741. Gentleman RR. (2008). Programming for Bioformatics. CRC Press. Giaever G, Chu AM, Ni L, Connelly C, Riles L, et al. (2002). Functional profiling of the Saccharomyces cerevisiae genome. Nature 418:387–391. Gol’dshtein V, Koganov G, Surdutovich G. (2004). Vulnerability and hierarchy of complex networks. Arxiv Prepr Cond–Mater 0409298. Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, et al. (2004). Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430:88.
c12.indd 258
1/12/2011 9:44:22 AM
REFERENCES
259
Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS. (2006). Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. PNAS 103:17402–07. Kolaczyk ED. (2008). Statistical Analysis of Network Data. Springer, New York. Margaritis D. (2003). Learning Bayesian Network Model Structure from Data. Ph.D. Thesis, Carnegie-Mellon University, Pittsburgh, PA. Mathivanan S, Periaswamy B, Gandhi TK, Kandasamy K, Suresh S, Mohmood R, Ramachandra YL, Pandey A. (2006). An evaluation of human protein-protein interaction data in the public domain. BMC Bioinformatics 18:S19. Ortutay C, Vihinen M. (2008). Efficiency of the immunome protein interaction network increases during evolution. Immunome Res. 4:4. Ortutay C, Vihinen M. (2009). Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies. Nucl Acids Res 37:622. Perez-Iratxeta C, Bork P, Andrade MA. (2002). Association of genes to genetically inherited diseases using data mining. Nat Genet 31:316. Perez-Iratxeta C, Wjst M, Bork P, Andrade MA. (2005). G2D: A Tool for Mining Genes Associated to Disease. BMC Genetics 6:45. Perez-Iratxeta C, Bork P, Andrade-Navarro MA. (2007). Update of the G2D tool for prioritization of gene candidates to inherited diseases. Nucleic Acids Res. 35: W212. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL. (2002). Hierarchical organization of modularity in metabolic networks. Science 297:1551. Smith B, Welty C. (2001). Ontology: towards a new synthesis. In FOIS ’01: Proceedings of the international conference on Formal Ontology in Information Systems. October 17–19, 2001. Ogunquit, Maine. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. (1998). Comprehensive Identification of cell cycle–regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9(12):3273. Spirtes P, Glymore C, Scheines R, Kauffman S, Aimale V, Wimberly F. (2000). Constructing bayesian network models of gene expression networks from microarray data.Acailable at www.phil.cmu.edu/projects/genegroup/papers/spirtes2002a.pdf. Sun J, Jia P, Fanous AH, Webb BT, van den Oord EJ, Chen X, Bukszar J, Kendler KS, Zhao Z. (2009). A multi-dimensional evidence-based candidate gene prioritization approach for complex diseases-schizophrenia as a case. Bioinformatics 25:2595. Tsamardinos I, Aliferis CF, Statnikov A. (2003). Algorithms for Large Scale Markov Blanket Discovery. Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference. Tsamardinos I, Brown LE, and Aliferis CF. (2006). The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine Learning 65:3. van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG. (2003). A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet 11:57.
c12.indd 259
1/12/2011 9:44:22 AM
260
CANDIDATE SCREENING THROUGH BIOINFORMATICS TOOLS
van Driel MA, Cuelenare K, Kemmeren PPCW, Leunissen JAM, Brunner HG, Vriend G. (2005). GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases. Nucleic Acids Res 33: W758–61. Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, et al. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285:901. Xiong Q, Qiu YH, Gu WK. (2008). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24:10113. Yaramakala S, Margaritis D. (2005). Speculative Markov Blanket Discovery for Optimal Feature Selection. Paper presented at the Fifth IEEE International Conference on Data Mining. November 27–30, 2005, Houston, Texas, USA. Yip AM, Horvath S. (2007). Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 8:22. Zhang B, Horvath S. (2005). A general framework for weighted gene coexpression network analysis. Stat Appl Genet Mol Biol 4:1. Zhao W, Mishel P, Carlson M, Zhang B, Nelson SF, Horvath S. (2006). A network-based gene screening approach for improving the validation success of microarray. Paper presented at the 2006 International Conference on Boinformatics & Computational Biology. June 26–29, 2006, Las Vegas, Nevada, USA.
c12.indd 260
1/12/2011 9:44:22 AM
CHAPTER 13
Using an Integrative Strategy to Identify Mutations YAN JIAO and WEIKUAN GU
Contents 13.1 Introduction 13.2 Identifying Possible Candidate Genes within the Genome Region of Interest 13.2.1 Selection of Genomic Database 13.2.2 Identification of All Genes and Other Genetic Elements 13.3 Identification of Possible Nucleotide Differences/Mutations within the GRI 13.3.1 Confirmation of Genomic Mutations in cDNA 13.3.2 Limitations and Alternative Approaches 13.4 Identifying Differentially Expressed Candidate Genes within GRI 13.4.1 Analyzing Candidate Gene Expression Levels 13.4.2 Microarray 13.4.3 Gene Expression Profiles Determined by Gene Microarray Analysis and Quantitative RT-PCR 13.5 Functional Prediction for Genes within GRI by Bioinformatics Approaches 13.5.1 Literature and Webpage Searching 13.5.2 Gene Network Analysis 13.5.3 Limitations and Alternative Approaches 13.6 Candidate Selection and Prioritization 13.6.1 The Prioritization of Candidate Genes Should Be Done According to Possible Function and the Nature of the Differences 13.6.2 The Importance of a Gene’s Potential Function in the Candidate Gene Selection Process 13.6.3 Final Prioritization of Candidate Genes Should Be Based on Integrative Information from All the Analyses
262 262 263 263 263 266 266 266 267 268 268 269 269 270 271 271
272 273 273
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
261
c13.indd 261
1/12/2011 9:44:23 AM
262
13.7
13.8 13.9
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
13.6.4 Limitations and Alternative Approaches Confirmation of the Function of Selected Genes in GRI 13.7.1 Molecular Biological Approach 13.7.2 Genetic Approach 13.7.3 Limitation and Alternative Approaches Questions and Answers References
274 274 275 276 276 277 277
13.1 INTRODUCTION The initial work of positional cloning began in the late of 1980s (Baehner et al., 1986; Royer-Pokora et al., 1986). However, the classic protocol for positional cloning was challenged in 2002 when the initial version of the sequence of the whole mouse genome was completed by the Mouse Genome Sequencing Consortium (2002). Eight years later, indeed a historical transition in the strategy and direction of gene discovery has occurred. Extraordinary progress has been made in gene discovery through positional cloning: (1) A tremendously large number of mutated genes have been identified, (2) the time for identifying mutations has been greatly shortened, and (3) mutated genes of many decades-old mouse mutants have been discovered. In this chapter we describe an integrative strategy that has been used successfully to search for mutated genes in animal models of human diseases, mostly using the mouse model. The relative importance of every candidate gene depends on either the relevance of the gene to the tissues of interest or to the phenotype: specifically, whether there is a difference in sequences between the mutation and the wild type or the control and whether the gene is expressed in certain tissues. Therefore, before evaluating candidates, one needs to fully characterize each candidate gene. First, one needs to identify every possible difference in the candidate genes. Next, the expression profiles for each of them must be examined. Finally, one needs to elucidate their possible known function by bioinformatics analysis or literature searching. By doing so, a profile for each candidate gene can be established for mutation confirmation and further functional analyses. 13.2 IDENTIFYING POSSIBLE CANDIDATE GENES WITHIN THE GENOME REGION OF INTEREST For most chronic animal disease models, it appears that the causative mutation is not among well-known candidate genes. Therefore, it is important not to dismiss any possibility, and, accordingly, it is necessary to examine all possible genes and other genetic components such as regulatory elements. Understanding
c13.indd 262
1/12/2011 9:44:23 AM
IDENTIFICATION OF POSSIBLE NUCLEOTIDE DIFFERENCES/MUTATIONS WITHIN THE GRI
263
the genetic elements within the genome region of interest (GRI) ensures that no candidate gene will be missed. The first imperative step is to identify every possible genetic factor within this region. Although officially the entire genomes in humans or many animals have been sequenced, gaps and errors in the sequence assembly still remain. It is essential to obtain not only accurate genome information but also every possible candidate gene within the genome sequences. This seemingly simple step is crucial in searching for the candidate gene. With current bioinformatics, we can select candidate genes by hierarchically examining every nucleotide in the GRI region. First, all of the coding sequences of a chromosomal region of interest are identified. Second, introns, 5′ and 3′ sequences, are determined. Third, nucleotide organization, gene ordering, and chromosomal structure will be analyzed. Very important, gene regulatory elements in non-gene regions in the GRI region should also be carefully identified. 13.2.1
Selection of Genomic Database
Genome data are essential for determining and verifying the GRI interval genome sequences. Currently, several genomic databases provide genomic sequences. The Ensembl Genome Browser (www.ensembl.org/index.html) is one example and is commonly used as the basic database for gene identification. Previously, Ensembl was used to determine the location of targeted genome regions in the identification of mutated genes for several mouse disease models, during which one assembly problem was identified (Jiao et al., 2005a). Currently, Ensembl provides the most accurate assembly of genome sequences. Other databases include the University of California Santa Cruz genome browser (http://genome.ucsc.edu) and the NCBI genome database (www.ncbi.nlm.nih.gov/sites/entrez?db=genome). Both of these databases are used for comparing and confirming the results obtained from Ensembl. 13.2.2
Identification of All Genes and Other Genetic Elements
Although the Ensembl database has complete sequence coverage of entire genomes, genes and transcripts notated by other sources/programs (e.g., EMBL mRNAs, Unigene, Genscan) are presented in the genome as well. The combination of Ensembl and these other databases allows for complete genome information, which is critical for the purpose of identifying candidate genes.
13.3 IDENTIFICATION OF POSSIBLE NUCLEOTIDE DIFFERENCES/ MUTATIONS WITHIN THE GRI Two major nucleotide changes cause variation in phenotype: mutations and polymorphism. Mutations usually refer to deletion, insertion, replacement, and
c13.indd 263
1/12/2011 9:44:23 AM
264
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
duplication of nucleotides that lead to change of function of a genetic element, which in turn results in visible or detectable changes in phenotype. Polymorphism usually refers to a single nucleotide polymorphism (SNP) or variation. Polymorphic nucleotides individually or in groups in general cause relatively small but continuous scale of changes in phenotype, usually called quantitative trait loci (QTL). Phenotype variation of QTL usually is not considered as disease. SNPs exist and segregate in the general population. Identifying an SNP for QTL can be achieved using available genome databases. Initially, some genome regions can be eliminated through haplotype analysis and polymorphic comparison. Such an analysis can be done with a method partially based on a suggestion by Wade and Daly (2005): use comparative haplotype analysis to limit GRI intervals and to identify likely candidate genes in these regions. A haplotype is a group of markers retained as a block. Haplotypes are typically used to characterize regions where linkage disequilibrium occurs; here, the purpose is to identify nonpolymorphic marker blocks between the GRI of subject and the control. Identifying these blocks is useful because they exclude these regions from further consideration as GRI intervals. Having removed all nonpolymorphic blocks, one can focus searching on gene identification in the regions of polymorphics between GRI and control. Dense marker coverage in GRI regions (averaging one SNP per 400– 500 bp) is available by using SNPs between GRI and a control, archived at Mouse Genome Informatics (www.informatics.jax.org/javawi2/servlet/ WIFetch?page=snpQF) and GeneNetwork (www.genenetwork.org/cgi-bin/ beta/snpBrowser.py), to determine haplotypes. Next, identifying differences can be done, to some extent, with available SNP databases such as the Roche SNP database (http://mousesnp.roche.com), the MGI mouse SNP database (www.informatics.jax.org/javawi2/servlet/WIFetch?page=snpQF), or the NCBI SNP database (www.ncbi.nlm.nih.gov/SNP). In contract to SNPs, mutated nucleotides usually cause disease and therefore do not exist in the normal population. Accordingly, mutations need to be discovered by sequencing or mutation screening. While sequence confirmation is the final call for a mutation, many mutation systems have been used for rapidly screening a large number of samples. Currently major mutation analysis systems include chemical cleavage of mismatch (CCM) (Tabone et al., 2006) and denaturing high-pressure liquid chromatography (DHPLC) (Hall et al., 2001). CCM is one of few methods capable of detecting nearly all single base mismatches. Mutation detection by CCM is based on the chemical modification and cleavage at the site of mismatched C or T in heteroduplexes by using hydroxylamine or osmium tetroxide (OsO4) as chemical probes. DHPLC compares two or more DNA fragments as a mixture of denatured and reannealed PCR amplicons, thereby revealing the presence of a mutation by the differential retention of homo- and heteroduplex DNA on reversed-phase chromatography supports under partial denaturation. Differences among DNA fragments can be detected successfully by UV or fluorescence monitor-
c13.indd 264
1/12/2011 9:44:23 AM
IDENTIFICATION OF POSSIBLE NUCLEOTIDE DIFFERENCES/MUTATIONS WITHIN THE GRI
Importance Lowest
265
Nucleotide differences Unique between GRI and control? No (stop) Yes
Coding sequences
3’ and 5’ end
Intron
Change amino acid
Known motif or regulatory domain
splicing element or regulatory domain
No
No Yes
Yes
No Yes
Change the type of Amino acid No
Yes Highest
Selected for prioritization
Figure 13.1. Characterization of candidate genes according to nucleotide difference.
ing. One of major commercial DHPLC systems used for discovering mutations in animal models is the SpectruMedix system (Jiao et al., 2005a). The SpectruMedix system includes high-throughput capillary electrophoresis instruments, specialized separation polymers, and a suite of automated software applications. Mutation screening should start with examining coding regions in all the candidate genes for differences. As illustrated in Figure 13.1, initial screening identifies all the differences in nucleotides between the GRI and control. The next step is to identify which of those differences are unique to the GRI by sequencing or by searching known sequences. Theoretically, a mutation should not exist in other wild type strains or populations. If one or more different nucleotides are unique to the GRI, then each difference’s location in the targeted genome must be examined. If the difference is located in an exon, it is necessary to determine whether it leads to a change in amino acids and/or whether the change leads to the alteration of a functional domain, such as from hydrophobic to hydrophilic. The following step examines other regions to identify new differences. In addition to identifying known differences, screening with mutation detection systems identifies any other differences. In practice, if the change is in the regulatory region, the next question is whether the change is in a known motif or another regulatory element. If it is in an intron or nongene region, the potential impact of the mutation or polymorphism on gene regulation needs to be investigated. For each such investigation,
c13.indd 265
1/12/2011 9:44:23 AM
266
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
extensive searching and comparison should be conducted using sequencing and other databases such as Ensembl. 13.3.1
Confirmation of Genomic Mutations in cDNA
Once a difference in a coding region is found, several approaches should be taken to confirm the mutation. First, the obvious question is whether it is located within the genetic region of the GRI locus. Second, is this mutation the only defect detected among the candidate genes and ESTs within the GRI locus? The answer to this question ensures that there are no other differences between the GRIs of the subject and normal controls, so that the possibility of another mutation is ruled out. Third, do the cDNA sequence results agree with the genomic DNA data? Last, is the mutation unique in the subject GRI compared with other available populations or inbred strains (if the mutation is from the mouse disease model)? 13.3.2
Limitations and Alternative Approaches
The next step identifies all possible coding sequences (open reading frames), promoters, intron/exon junctions, sequences that match to known genes, and other biologically significant sequences such as repetitive elements in the targeted region. Investigators can usually treat the identified fragments as if they were genes or parts of genes. However, fragments that the software does not pick up should be considered as undetermined sequences rather than noncoding sequences. In general, the only excluded sequences in this step should be well-known repetitive sequences. Errors may exist among the currently assembled genome sequences. During positional cloning for the spontaneous fracture (sfx) mutation (Jiao et al., 2005a), inconsistencies in the number of exons between the NCBI and Ensembl databases were discovered. Searching through different databases may not be sufficient to obtain a complete list of candidates. Further steps should be taken such as obtaining information from different sources, including the bacterial artificial chromosome and yeast artificial chromosome contigs deposited in GeneBank. Thus every possible measure must be taken to search every possible resource so that the specific errors in the genome sequence assembly can be identified. Alternatively, search regions can be increased (e.g., extending 0.5 Mp on each side of the GRI region) to ensure all possible genes are considered.
13.4 IDENTIFYING DIFFERENTIALLY EXPRESSED CANDIDATE GENES WITHIN GRI Identifying differentially expressed genes in the GRI is an important step for identifying mutated genes and for analyzing molecular pathways of mutated
c13.indd 266
1/12/2011 9:44:23 AM
IDENTIFYING DIFFERENTIALLY EXPRESSED CANDIDATE GENES WITHIN GRI
Importance Lowest
267
Expression Tissue of Interest? No (stop) Expression level Low
High Differentially Expressed (DE)
Yes
No
Yes
Larger than 2 folds Yes
No
Larger than 2 folds No
No
Yes
Quantitative PCR DE Confirmed No Yes
Highest
Differentially Expressed (DE)
Quantitative PCR DE Confirmed No Yes
Selected for prioritization
Figure 13.2. Characterization of candidate genes according to gene expression.
genes. If the expression level of a gene in the GRI is altered, the gene is either the mutated one or affected by the mutated gene. It is important to keep in mind that mutations in a regulatory region are expected to lead changes in gene expression. However, mutations in a coding region of a gene may or may not lead to changes in gene expression. 13.4.1 Analyzing Candidate Gene Expression Levels An illustration of the experimental steps is shown in Figure 13.2. Briefly, the first step is to examine whether a gene is expressed in the tissues of interest or other relevant tissues. If it is not expressed, most likely that gene is not relevant in the GRI phenotype. The next step is to examine the expression levels and determine whether there is a difference between the GRI and the control. If there is a difference, then the significance of the difference should be investigated. The expression level and the differential expression of a gene should be confirmed by real-time PCR. This selection process involves many experimental steps and is described below. To analyze gene expression level,
c13.indd 267
1/12/2011 9:44:23 AM
268
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
whole genome expression and exon arrays offer rapid high-throughput solutions. Currently, several commercial systems are available, including Illumina, Agilent, and Affymetix systems. Here we use the Affymetrix system as an example for analyzing gene expression levels. Currently, the Affymetrix array contains almost every gene found in the mouse genome. The advantage of using Affymetrix exon arrays is they allow investigators to characterize gene network(s) by the expression levels of all the possible genes, as well as by their exons. This, in turn, should allow investigators to better assess the potential role each candidate gene has within the GRI region or phenotype. In case any genetic element is not covered by an Affymetrix array, real-time PCR (RT-PCR) or similar experimental technologies must be used to confirm the data. High-quality RNA is essential for this step. Many investigators have been very successful in extracting high-quality RNA by using a modified procedure from Life Technologies that employs TRIzol as a reagent for microarray analysis (Gu et al., 2002). 13.4.2
Microarray
For each tissue block of interest, it is important to use equally mixed RNA from multiple samples with at least three replicates and the same number of controls. Total RNA for each group should be used for cDNA synthesis with SuperScript Choice System for cDNA Synthesis. Generated cRNA should be hybridized to an Affymetrix GeneChip 430 2.0 array, representing ∼36,000 mouse transcripts, at 45°C for 16 h. Chips should be washed and stained in a fluidic station. MAS 5.0 should be used to control image scanning by Agilent GeneArray Scanner and data generation. 13.4.3 Gene Expression Profiles Determined by Gene Microarray Analysis and Quantitative RT-PCR Raw image data from microarray fluorescence scanning should be subjected to quality control analysis by using dChip software. Hybridization signals should be analyzed using software released by Affymetrix (currently, GCOS). Depending on the data distribution of individual transcripts, parametric (ANOVA) or nonparametric (e.g., Kruskal-Wallace) multigroup comparison analyses for replicate samples should be run to determine which transcripts are differentially expressed in both the tissues of interest and lung tissues from the different mouse groups. Analyses can be done using customized SAS analysis tools (SAS Institute, 2001), dChip, GeneSpring (Silicon Genetics, 2002), and R-Affy analysis packages. Additional analysis of array data can be performed with the current or upgraded 8 + 1 node Linux cluster. The next step is to use quantitative RT-PCR to analyze and confirm the expression level of the various candidate genes. Not all genes are included in an Affymetrix array. The quantitative (q)RT-PCR should analyze the genes not included in Affymetrix arrays, as well as confirm the other gene’s data
c13.indd 268
1/12/2011 9:44:23 AM
FUNCTIONAL PREDICTION FOR GENES WITHIN GRI BY BIOINFORMATICS APPROACHES
Importance Lowest
269
Gene function
Known to Relevant Function?
Yes
No In pathway of relevant gene?
Yes
No Has been well instigated Yes No
Highest
Selected for prioritization
Figure 13.3. Characterzation of candidate genes according to known function.
from Affymetrix arrays. Depending on the number of candidate genes, one should vary the selection of samples for each assay. If only a few candidates are identified, one should use a more conventional 96-well format and larger volumes. In either case, the data should detect genes that show differences between the control and GRI. In addition, investigators should use some genes randomly selected from a microarray experiment so that the genes can be independently evaluated by qRT-PCR.
13.5 FUNCTIONAL PREDICTION FOR GENES WITHIN GRI BY BIOINFORMATICS APPROACHES Functional analysis of candidate genes should be conducted via multiple approaches, including gene annotation, sequencing comparison or domain recognition, known reported functions, and gene network construction. 13.5.1 Literature and Webpage Searching One obvious approach is by literature searching and comparison (Fig. 13.3). The literature search should be conducted using web-based search programs
c13.indd 269
1/12/2011 9:44:23 AM
270
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
such as PGMapper (www.genediscovery.org/pgmapper) (Xiong et al., 2008a). PGMapper is a software tool that automatically matches phenotypes to their causative genes/genotype. PGMapper provides detailed information concerning the candidate genes and all related references from the OMIM and PubMed databases to support their candidacy. PGMapper can search publications and examine each candidate gene’s respective encoded protein structure to look for a possible connection between gene function and phenotype of interest. If the candidate gene is novel, GeneBank should be searched for possible similarities between the selected candidate genes and other known genes. The available protein structure models should be used to predict the probable function for each candidate gene. Detailed procedures for functional analysis when using PGMapper should follow the outline detailed in publications (Xiong et al., 2009, 2008b). Briefly, first, names of all the candidate genes and key terms should be put into the space provided on the PGMapper webpage. Second, the program should be instructed to search through OMIM and PubMed for any publications containing either the name of the candidate gene and/or the key words. Third, all identified publications will be retrieved from the databases into a list. Fourth, the abstracts or the full content of each retrieved publication must be reviewed individually to determine the relevance of the identified gene to tissues of interest or the fibrotic phenotype in mouse GRI. Finally, a table consisting of a list of candidate genes and publication references relevant to phenotype or trait of interest should be constructed. 13.5.2 Gene Network Analysis The second approach should be to examine the role of each candidate gene(s) in the gene network or pathway network (Rhodes et al., 2002; Selaru et al., 2002; Pereira et al., 2004). Not all genes relevant to the GRI disease have been studied or reported. However, large numbers of gene networks based on whole genome microarray gene chips currently provide connections among genes. If a gene has not been studied for roles in tissues of interest or trait of interest but is connected to a gene of high relevance to a known function of interest in a gene network or pathway, its importance in the candidate list should be much greater than that of a gene with no known function/relevance in/to function of interest. One of the resources used to construct the gene network is the microarray data. User-friendly software has been used with great success (e.g., data clustering performed with Cluster and TreeView (Rhodes et al., 2002) (Eisen Laboratory, http://rana.lbl.gov/EisenSoftware.htm); biological profiling of altered gene expression patterns with Ingenuity Pathways Analysis; and construction and visualization of gene relation networks using the web server of the University of Tennessee (http://132.192.64.224/geneinfoviz/ search.php). Another important resource is the database Genenetwork (www. genenetwork.org). Currently, this database contains gene expression data
c13.indd 270
1/12/2011 9:44:23 AM
CANDIDATE SELECTION AND PRIORITIZATION
271
from mouse tissues. Genenetwork is extremely important for studying gene pathways, particularly in the mouse models, because it contains gene expression profiles from a variety of tissues of more than 60 recombinant mouse inbred lines from two popular mouse strains, C57BL/6 and DBA/2. By inputting the name of a gene, this database provides a network that contains the inputted gene and its co-relationship to other genes. 13.5.3
Limitations and Alternative Approaches
Investigators should realize that the one thing not under their control is the genome database. For example, the mouse genome database in Ensemble is improved/modified from time to time and is currently at its 50th version. Nevertheless, there is no guarantee for 100% accuracy of every sequence and the assembly of the sequences. Investigators should keep searching for updated information from the genome database and make corrections whenever necessary. For a similar reason, literature searches on the reported function of genes are limited by the available literature and research data.
13.6 CANDIDATE SELECTION AND PRIORITIZATION Candidate selection and prioritization will determine which genes have the greatest potential to be responsible for the dermal fibrosis GRI. Through gene searching and profile analysis, many differences and functionally relevant genes for the GRI should be identified. However, it is unlikely that all of them regulate the variations in tissues of interest. For example, it is possible to find differences in every gene. Clearly, not every gene should be a candidate. It is reasonable to assume that, if a difference is related to the GRI phenotype, the difference should be specific to the trait of interest. Thus any nucleotide differences should co-segregate with the phenotype of interest. In contrast, if a difference is randomly segregated among different strains, it should not be considered a candidate. In a similar example, not all the genes expressed in the tissues of interest are responsible for the trait of interest and not all phenotypically relevent genes function in the regulation of a particular disease. Therefore, one needs to eliminate obvious noncandidate genes. If there are still too many candidate genes to handle, it is necessary to prioritize the remaining genes according to current knowledge. Prioritization of candidate genes should be done according to their sequence differences, expression profiles, and known and/or potential biological function. Investigators should elect the most favorable candidate gene(s) according to an overall evaluation. As shown in Figure 13.4, among differences, expression levels, and potential functions, a confirmed useful difference should be given much more weight than either expression or known functional relevance.
c13.indd 271
1/12/2011 9:44:23 AM
272
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
Prioritization Lowest
Candidate genes
Relevant function
Unique Polymorphism No
Yes
Highest
Differential expression No
Yes
No Yes
Favorite candidate/s
Figure 13.4. Prioritization and candidate gene selection.
13.6.1 The Prioritization of Candidate Genes Should Be Done According to Possible Function and the Nature of the Differences In regard to the position of a difference, in general, the relative importance is (1) coding regions, (2) 3′ and 5′ end sequences, and (3) intron sequences, ranked from the highest to lowest importance. However, in many cases, the nature of the difference is more important than the position. For example, within coding sequences, a difference that alters the amino acid is much more important than a difference that changes only the nucleotide. Among the differences that alter amino acids, the changes in the conserved sequences are more important than those in nonconserved sequences. Changes that result in an amino acid’s becoming another type, (e.g., the AA changes its hydrocarbon R-group into an acid or base R-group) are more important than changes that result in a similar type of amino acids (e.g., hydrocarbon R-group changes to another hydrocarbon R-group) (Jiao et al., 2007). Whenever examining a difference in coding sequences, it is essential to first check the amino acid and codon table to see whether the difference leads to changes in amino acids. Furthermore, the difference should be examined to determine whether the change encodes the same type or different type of amino acid. Although differences in noncoding regions are generally regarded as a secondary priority to differences in the coding region, some differences may be of more importance than those in coding regions, especially when the differences are relevant to gene expression and splicing. One should try to identify commonly known potential functional motifs, sequences that may affect differential splicing, and common promoter sequences. Software for autoiden-
c13.indd 272
1/12/2011 9:44:23 AM
CANDIDATE SELECTION AND PRIORITIZATION
273
tifying promoter sequences is available for such a search. For example, Promoter 2.0 Prediction Server (www.cbs.dtu.dk/services/promoter) is a tool for predicting potential promoter regions from a given sequence. Another web-based searching tool is Transcription Regulatory Element Search (TRES). TRES can simultaneously search up to 20 promoter sequences for known transcription factor binding sites, cis-acting elements, palindromic motifs, and/ or conserved k-tuples (phylogenetic footprints). It is useful for comparative promoter sequence analysis to elucidate common themes (modules) in functionally or phylogenetically related promoters. Accordingly, one should select sequences that contain identified, potential regulatory elements first. Next, one should examine the rest of the nonrepetitive sequences. The repetitive sequences should be examined last. After examining noncoding regions, the next task is to identify differences that are segregated between the GRI and the genome region of control or wild type and that are associated with tissues of interest. Nucleotide differences or SNPs within those haplotypes, strains, or substrains should be confirmed and analyzed. One should amplify, from each strain/substrain/haplotype, every DNA fragment in the GRI interval polymorphic between the control and the GRI. 13.6.2 The Importance of a Gene’s Potential Function in the Candidate Gene Selection Process The first priority should be genes relevant to the susceptibility of the disease of interest. The second priority is novel genes that have no known biological function. The next is genes with known function in other pathways. The lowest priority genes are those that have been extensively studied (indicated by a large number of literature reports in searching) but have no connection to the susceptibility of the disease of interest in either functional or molecular pathways. 13.6.3 Final Prioritization of Candidate Genes Should Be Based on Integrative Information from All the Analyses The first priority is the genes that possess a unique difference affecting the amino acid, are highly expressed in the tissues of interest, and have a demonstrated connection to the disease of interest. The next level of priority is the genes that translate amino acid differences and are expressed in the tissues of interest. At the level of sequence difference, emphasis should be put on the amino acid change and the possible impact it may have. If a difference is in a noncoding region, it may be in an intron, the 5′ or 3′ end, or in sequences between genes. The importance of such a difference is difficult to evaluate. However, differences in noncoding regions may affect regulation of gene expression. Much importance should be given to combining literature, gene function, and gene expression profiling.
c13.indd 273
1/12/2011 9:44:23 AM
274
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
If a gene has no change in its expression level between the control and the GRI and there is no different nucleotide co-segregating with the GRI from the individual with disease phenotype or trait of interest throughout a gene, the gene should be eliminated from the candidate list. At the nucleotide level, some differences in a gene may be retained for further study, while others may not. Polymorphisms can occur in the regulatory regions, 5′ or 3′ end, coding region, or introns. If the change is in the regulatory 5′ or 3′ position, it is important to know whether the difference causes any change in gene expression, modification of posttranscription and translation, and function. If a gene is not differentially expressed between the control and GRI and if the difference in a noncoding region of this gene has no obvious function, the difference may be considered unimportant. However, if the difference is in a coding region, it is necessary to examine whether the difference causes any possible changes in protein translation and whether the change in protein sequence causes any change in the function or activity of the protein. If a difference potentially results in altered splicing, the different splicing should be reflected in the data from the exon arrays. It is important to closely examine or repeat the experiment to confirm the data from the exon arrays and to determine the importance of potential differential splicing (Jiao et al., 2007). 13.6.4
Limitations and Alternative Approaches
It is not possible to predict the extent to which the numbers of candidate genes can be reduced. For some genes, especially novel genes within the GRI, one may have no information as to their function. It will difficult to eliminate them according to their function. In this case, difference and expression screening should play key roles in eliminating such genes. One should have concerns about polymorphic comparisons between multiple strains. It is not necessary that the same GRI or the same gene regulates the disease within a strain and among available strains. Also, the phenotype of interest may be affected by a modifier. With all of these uncertain factors, one should come to a conclusion concerning nucleotide comparison with great caution. Prioritization must be based on a combination of nucleotide-specific differences, as well as functional and bioinformatics information.
13.7 CONFIRMATION OF THE FUNCTION OF SELECTED GENES IN GRI In general, one should expect that one or more mutations lead to the phenotype of interest. The confirmation of the candidate gene(s) should be an integrative process that includes both bioinformatics and experiment. The consequence of the mutation may be easily seen according to the change in the DNA codon and resulting amino acid. The mutation can be tested experi-
c13.indd 274
1/12/2011 9:44:23 AM
CONFIRMATION OF THE FUNCTION OF SELECTED GENES IN GRI
275
mentally for its transcription and translational products. The mutation can also be transferred into another mouse strain to examine the phenotype. After initially confirming the finding of the gene defect(s) in the GRI, one should examine the identified genes to see whether (1) the gene has been studied for the same or similar phenotype, (2) the gene has not been studied and therefore the potential pathway in which it participates is unexplored, (3) there is a significant role the new gene may play in the pathway, (4) the gene is a known gene or the function revealed represents a new function of the gene, and (5) the importance of the new function affects the phenotype of interest. Confirmation of the function of a candidate gene can be accomplished using a combination of molecular biological and genetic approaches described next. 13.7.1
Molecular Biological Approach
13.7.1.1 Relative Quantitative RT-PCR It is likely that the GRI gene(s) should be identified as one or more genes or coding regions. In this case, it will be necessary to perform expression studies by RT-PCR. Unique probes should be designed from the sequence data of the GRI gene obtained earlier. To determine in what tissues and to what extent the gene is expressed, message levels for GRI gene(s) should be determined by RT-PCR (Jiao et al., 2005b, 2008). 13.7.1.2 Expression of GRI Gene(s) In Vitro To compare the protein sequences from normal and GRI genes, one should insert the corresponding nucleotide sequences into expression vectors. First, one should pay attention to designing a pair of primers that flank the entire normal gene. This pair of primers will then be used to amplify cDNA derived by RT-PCR from normal and GRI mouse tissues. Protein products should be analyzed using SDSPAGE and/or native PAGE electrophoresis with known enzymes to confirm the predicted differences between the normal and congenic proteins. 13.7.1.3 In Situ Hybridization In situ hybridization is still useful in detecting gene expression in different tissues and at different time points. With a gene that has not been studied or with an unknown function, one needs to detect its expression during its developmental process and in a variety of tissues. The procedure should follow that outlined in our previous publications (Jiao et al., 2005a, 2005b, 2008)). 13.7.1.4 Antibody Generation and Immunolocalization The generation of one or more highquality antibodies to the encoded proteins of interest should allow for subcellular localization studies and, in addition, corroborate the in situ hybridization results (Jiao et al., 2005a). Furthermore, antibodies may be needed for protein–protein interaction and other biochemical studies.
c13.indd 275
1/12/2011 9:44:23 AM
276
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
In addition, BLAST analyses should be performed to exclude significant homology to other peptide sequences. 13.7.1.5 Western Blotting to Characterize Antibody Specificity Standard immunohistochemical methods should be used to localize encoded proteins in the tissues of interest (Gu et al., 2002; Jiao et al., 2005a)). 13.7.2
Genetic Approach
Depending on the number of candidate genes and the nature of the mutations, one or more of the following approaches should be used to confirm the function(s) of candidate genes. 13.7.2.1 Transfer Selected Gene from GRI to the Control or Wild Type One should plan to transfer the selected gene into its original control background. Theoretically, the genesin GRI and control are different only in the mutation that results in the fibrotic phenotype. Therefore, if the identified mutation truly causes disease, transferring the selected GRI gene back into the control should result in the expected phenotype. The experimental protocol is straightforward. Initially, a cross between a heterozygous individual and a control should be made. Beginning with F1, one should use primers that flank the mutation to select heterozygous individuals for backcrossing to the control. After five to six generations of backcrossing, one should cross heterozygous mice to create homozygous individuals for the phenotype test. A key disadvantage of this approach is the difficulty in transferring only the selected GRI gene into the control mice. 13.7.2.2 Creation of Transgenic Mice that Overexpress or a Knockout That Does not Express the Candidate Gene Whether one takes the approaches of expressing or knocking out GRI depends on the nature of the selected GRI gene. If it is expressed at a much higher level than in the wild type control mice, using an srRNA approach to knock it out or suppress its expression may cue the disease phenotype. If the expression of the selected GRI gene is decreased or disappears in the mutant, then creation of a transgenic mouse with normal expression of the GRI gene may rescue the diseased individual. 13.7.3
Limitation and Alternative Approaches
It is possible that the expression results obtained do not give any idea about the function of selected GRI genes. In this case, one should perform microarray analyses comparing homozygous GRI/+ as well at GRI/− tissues of interest to +/+ tissues of interest to identify genes that are either up- or down-regulated by selected GRI compared to controls. If there are multiple mutations in multiple genes, double or triple knockout or transgenic individuals need to
c13.indd 276
1/12/2011 9:44:23 AM
REFERENCES
277
be created. This process will involve much more work and additional difficulties. An additional complicated issue is the interaction between mutation(s) within the GRI locus and the genomic background. Phenotype variations of the same mutation in different genomic backgrounds have been very common. Once the mutation(s) is identified, interactions with other genes and in different genomic backgrounds should be one important aspect for future studies.
13.8 QUESTIONS AND ANSWERS Q1. When determining the relative importance of a gene in the genome region of interest of a disease model, which three major features of the gene need to be considered? A1. Whether there is a difference in DNA sequences between study subject and wild type or the control, whether there is a differential expression level between subject and control, and whether there is a reported relevant function or a connection of pathway of the gene to the disease.
13.9
REFERENCES
Baehner RL, Kunkel LM, Monaco AP, Haines JL, Conneally PM, Palmer C, Heerema N, Orkin SH. (1986). DNA linkage analysis of X chromosome-linked chronic granulomatous disease. Proc Natl Acad Sci U S A 83(10):3398–401. Gu W, Li X, Lau KH, Edderkaoui B, Donahae LR, Rosen CJ, Beamer WG, Shultz KL, Srivastava A, Mohan S, Baylink DJ. (2002). Gene expression between a congenic strain that contains a quantitative trait locus of high bone density from CAST/EiJ and its wild-type strain C57BL/6J. Funct Integr Genomics 1(6):375–86. Hall AG, Hamilton P, Minto L, Coulthard SA. (2001). The use of denaturing highpressure liquid chromatography for the detection of mutations in thiopurine methyltransferase. J Biochem Biophys Meth 47(1–2):65–71. Jiao Y, Jin X, Yan J, Zhang C, Jiao F, Li X, Roe BA, Mount DB, Gu W. (2008). A deletion mutation in Slc12a6 is associated with neuromuscular disease in gaxp mice. Genomics 91(5):407–14. Jiao Y, Li X, Beamer WG, Yan J, Tong Y, Goldowitz D, Roe B, Gu W. (2005a). Identification of a deletion causing spontaneous fracture by screening a candidate region of mouse chromosome 14. Mammal Genome 16(1):20–31. Jiao Y, Yan J, Jiao F, Yang H, Donahue LR, Li X, Roe BA, Stuart J, Gu W. (2007). A single nucleotide mutation in Nppc is associated with a long bone abnormality in lbab mice. BMC Genet 17;8:16. Jiao Y, Yan J, Zhao Y, Donahue LR, Beamer WG, Li X, Roe BA, Ledoux MS, Gu W. (2005b). Carbonic anhydrase-related protein VIII deficiency is associated with a distinctive lifelong gait disorder in waddles mice. Genetics 171(3):1239–46.
c13.indd 277
1/12/2011 9:44:23 AM
278
USING AN INTEGRATIVE STRATEGY TO IDENTIFY MUTATIONS
Mouse Genome Sequencing Consortium. (2002). Initial Sequencing and Comparative Analysis of the Mouse Genome. Nature 420:520–562. Pereira E, Tamia-Ferreira MC, Cardoso RS, Mello SS, Sakamoto-Hojo ET, Passos GA, Donadi EA. (2004). Immunosuppressive therapy modulates T lymphocyte gene expression in patients with systemic lupus erythematosus. Immunology 113(1): 99–105. Rhodes DR, Miller JC, Haab BB, Furge KA. (2002).CIT: identification of differentially expressed clusters of genes from microarray data. Bioinformatics 18(1):205–06. Royer-Pokora B, Kunkel LM, Monaco AP, Goff SC, Newburger PE, Baehner RL, Cole FS, Curnutte JT, Orkin SH. (1986). Cloning the gene for an inherited human disorder–chronic granulomatous disease–on the basis of its chromosomal location. Nature 322(6074):32–38. Selaru FM, Zou T, Xu Y, et al. (2002). Global gene expression profiling in Barrett’s esophagus and esophageal cancer: a comparative analysis using cDNA microarrays. Oncogene 21(3):475–78. Tabone T, Sallmann G, Chiotis M, Law M, Cotton R. (2006). Chemical cleavage of mismatch (CCM) to locate base mismatches in heteroduplex DNA. Nat Protoc 1(5):2297–304. Wade CM, Daly MJ. (2005). Genetic variation in laboratory mice. Nat Genet 37(11):1175–80. Xiong Q, Jiao Y, Hasty KA, Canale ST, Stuart JM, Beamer WG, Deng HW, Baylink D, Gu W. (2009). Quantitative trait loci, genes, and polymorphisms that regulate bone mineral density in mouse. Genomics 93(5):401–14. Xiong Q, Jiao Y, Hasty KA, Stuart JM, Postlethwaite A, Kang AH, Gu W. (2008b). Genetic and molecular basis of QTL of rheumatoid arthritis in rat: genes and polymorphisms. J Immunol 181(2):859–64. Xiong Q, Qiu Y, Gu W. (2008a). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24(7):1011–13.
c13.indd 278
1/12/2011 9:44:23 AM
CHAPTER 14
Determination of the Function of a Mutation BOUCHRA EDDERKAOUI
Contents 14.1 Introduction 14.2 Concept of Quantitative Trait Loci 14.2.1 Mouse Model 14.2.2 Human Diseases and Association Studies 14.3 How to Determine if a Mutation is Functional 14.3.1 Test to Determine if the Mutation is Null 14.3.2 Dosage Analysis—Quantitative or Qualitative Changes Due to Mutation 14.3.3 Complementation Test 14.4 Effect of Mutations on the Function of the Gene 14.4.1 Nonsynonymous SNPs 14.4.2 Synonymous SNPs 14.4.3 Regulatory SNPs 14.5 General Strategy to Assess the Effect of a Mutation on the Function of a Gene and the Observed Phenotypic Variations 14.5.1 Sequence Analyses 14.5.2 Tissue Specificity 14.5.3 Expression Profiling 14.5.4 In Vitro Functional Studies 14.5.5 In Vivo Functional Studies and Mouse Model 14.6 Questions and Answers 14.7 References
280 280 280 281 283 283 284 286 287 287 288 289 291 291 292 292 292 293 296 297
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
279
c14.indd 279
1/12/2011 5:03:48 PM
280
DETERMINATION OF THE FUNCTION OF A MUTATION
14.1 INTRODUCTION Mutations are permanent changes in the DNA sequence. They range in size from a single DNA building block (DNA base) to a large segment of a chromosome and can be inherited from a parent or acquired during lifetime. The concept of mutation has enflamed imaginations for generations, as its potential power to benefit or harm humans is enormous. Depending on the type of mutation and the type of cells affected by the mutation, the results can be positive, negative, or neutral. Some genetic changes are very rare; others are common in the population. Single nucleotide polymorphism (SNP) is the simplest form of DNA variation among individuals. SNPs can be of transition or transversion type, they occur throughout the genome at a frequency of about one in 1,000 bp (Shastry, 2009) but some genomic sequences show more SNP frequency than others. SNPs may change the encoded amino acids and will be called nonsynonymous SNPs or change the DNA sequence but not the encoded amino acid and will be called synonymous SNPs. Or they can simply occur in the noncoding regions, and in this case they may influence promoter activity (gene expression), messenger RNA (mRNA) conformation (stability), and subcellular localization of mRNAs and/or proteins and hence may lead to a disease. These small DNA variations induce diversity among individuals and could be responsible for genome evolution and the most common familial traits such as skin, eye color, and interindividual differences in drug response. They can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. Therefore, understanding the effect of specific mutations on gene function and individuals health is a key to develop the concept of personalized medicine. To determine the effect of a mutation on the function of a gene; the following parameters need to be carefully analyzed: Mutation-phenotype linkage. Position of the mutation in the gene sequence. Genomic changes caused by the mutation. Protein changes caused by the mutation. Tissue specificity. In this chapter, I will discuss each parameter and the strategies that could lead to a better understanding of the impact of different mutations on gene function with the focus on SNPs. 14.2 CONCEPT OF QUANTITATIVE TRAIT LOCI 14.2.1 Mouse Model A quantitative trait locus (QTL) is a polymorphic locus containing alleles that differentially affect the expression of a specific phenotypic trait (a genetic basis
c14.indd 280
1/12/2011 9:44:24 AM
CONCEPT OF QUANTITATIVE TRAIT LOCI
281
for physiological variation) (Nadeau and Frankel, 2000). The purpose of QTL research is to identify genes and gene variants or mutations that contribute to the expression of these traits. QTL are identified via association of the studied traits with genetic markers, which are polymorphic sequences that characterize each species or strain of mice. The use of inbred strains of mice in this setting has proven to be a viable alternative to human genetic studies given the degree of control that can be exercised over experimental parameters such as environment, breeding scheme, and detailed phenotyping. Over the past 20 years, QTL mapping has led to the identification of numerous genetic loci for a variety of traits relevant to human diseases, including behavioral differences, lipid levels, obesity, atherosclerosis and osteoporosis (Korstanje and Paigen, 2002; Allayee et al., 2003; Williams and Spector, 2007). Genomewide SNP screen combined with microsatellite markers have been used to successfully identify QTL for complex traits. Bice et al. (2009) have genotyped a total of 867 informative SNPs in an F2 population of 989 mice derived from a cross between a high- and low-alcohol-preferring mouse lines. QTL analyses detected significant evidence of associations between the phenotypic variations and multiple chromosomal regions in mouse. Several of the identified regions included candidate genes previously associated with alcohol dependence in humans or other animal models. (More details on QTL studies are described in Chapter 19.) After the identification of a QTL, most researchers perform further fine mapping of the QTL to reduce the size of these regions and hence refine the list of potential candidate genes that can be achieved by creating additional recombination events through selective breeding (Darvasi, 1998) or by exploiting historical recombinations (Cardon and Bell 2001). Then the candidate genes are selected based on the following criteria: (1) position within the QTL, (2) known function, (3) expression profile, and (4) sequence variations between the strains of mice analyzed. Several genes responsible for complex traits have been identified by combining QTL analyses, gene expression profiling, and mutation analyses (Kleeberger et al., 2000; Klein et al., 2004; Edderkaoui et al., 2007). 14.2.2 Human Diseases and Association Studies 14.2.2.1 Genomewide Associations Scientists have spent decades mapping human disease genes. Initially, the focus was directed toward the identification of genetic mutations responsible for single-gene disorders. However, recently, new technology has made it possible to link multiple genes to a single disease and to connect multiple diseases to one another by knowing the genes associated with them. Genomewide association (GWA) studies are performed to identify the genes involved in complex traits and human diseases; this method searches the genome for DNA variations that are associated with different traits. If association is present, a particular allele, genotype or haplotype of a polymorphism or polymorphism(s) will be seen more often
c14.indd 281
1/12/2011 9:44:24 AM
282
DETERMINATION OF THE FUNCTION OF A MUTATION
than expected by chance within a population that shows the trait. Thus a person carrying one or two copies of a high-risk variant is at increased risk of developing the associated disease or having the associated trait. Because GWA studies examine DNA variations across the genome, they represent a promising way to study complex, common diseases in which many genetic variations contribute with different percentage to the development of the disease. In comparison to family linkage-based approaches, association studies have two key advantages: (1) They are able to capitalize on all meiotic recombination events in a population rather than only those in the families studied. Thus association signals are localized to small regions of the chromosome containing only a single to few genes, enabling rapid detection of the actual disease susceptibility gene. (2) GWA studies allow the identification of disease genes with only modest increases in risk, which is considered a severe limitation in linkage studies and the very type of genes one expects for common disorders. The power to detect association between genetic variation and disease is a function of several factors, including the frequency of the risk allele or genotype, the relative risk conferred by the diseaseassociated allele or genotype, the correlation between the genotyped marker and the risk allele, sample size, disease prevalence, and genetic heterogeneity of the sample population. While the first three factors are unknown before specific GWAS, their impact can be influenced by the study design. The key elements for the success of an association study include sufficient sample sizes, rigorous phenotypes, comprehensive maps, accurate high-throughput genotyping technologies, sophisticated information technology infrastructure, rapid algorithms for data analysis, and rigorous assessment of genomewide signatures. GWAS have played a large role in unraveling these complex relationships. Although these studies do not account for the many environmental factors that contribute to disease, they have revealed numerous gene–disease associations (Lees and Satsangi, 2009; Bajaj et al., 2010; Vanunu et al., 2010), which encouraged healthcare professionals to think about the molecular pathways of diseases, which in turn have led to the development of various new treatment options. Of course, there is still a long way to go before much of the knowledge provided by GWAS can actually be used to treat or cure human disorders. One prominent obstacle along the path to this goal involves determining how to best manage the ever-growing body of gene association data. 14.2.2.2 Haplotypes Genetic association studies can also be performed with haplotypes. A haplotype is the series of genetic variants in a specific chromosome that are inherited from one parent. In subsequent generations, the chromosomal haplotype is progressively broken up by crossing over events in meiosis. In general, the term haplotype usually refers to closely linked genetic loci. SNPs that are located in close proximity tend to travel together,
c14.indd 282
1/12/2011 9:44:24 AM
HOW TO DETERMINE IF A MUTATION IS FUNCTIONAL
283
a phenomenon that is known as linkage disequilibrium (LD). In general, loci that are located more closely together on a chromosome will be in stronger LD than those loci located far apart, but the correlation between LD and the physical distance separating two loci is modest: some loci that are separated by 20 bp will not be in LD, whereas other loci separated by 200,000 nucleotide bases will be in tight LD (Pritchard and Przeworski, 2001). One approach is to assign the most likely haplotypes to each individual in a study population and then determine if the distribution of assigned haplotypes differs between cases and control subjects or within families. However, this approach does not adjust for the uncertainty in haplotype assignment. Therefore, approaches that explicitly incorporate the relative probabilities of each haplotype for each individual are preferred. The statistical genetic issues in haplotype association studies of unrelated individuals have been recently reviewed by Schaid (2004). A variety of haplotype-based association methods have been developed in both unrelated subjects or in families (Clayton and Jones 1999; Schaid et al, 2002; Horvath et al., 2004; Morris et al., 2004; Satten and Epstein, 2004). Schaid et al. (2002) have developed a regression-based score test for haplotype association in unrelated subjects that allows for testing of both global haplotype association and individual haplotype association as implemented in the Haplo. Stats program. Such regression-based approaches have a number of advantages, including the inclusion of covariates for environmental and other nongenetic factors as well as the inclusion of haplotype by environment interactions (Lake et al., 2003).
14.3 HOW TO DETERMINE IF A MUTATION IS FUNCTIONAL 14.3.1
Test to Determine if the Mutation is Null
When analyzing gene function, geneticists start by searching for multiple mutations with a particular phenotype in a genetic screen. Once this is accomplished, the next step is usually to do complementation tests to figure out how many genes are represented by the mutations that were isolated in the screen. However, it is important to figure out whether the individual mutations are loss of function, gain of function, or something else (dominant negative, neomorphic, etc.). In the case of null mutations, the function of the gene is completely abolished either due to a total absence of the translated product or the production of a totally inactive product. Thus one way to know if the mutation is null is to know the type of DNA modification caused by the mutation and the protein structure. Another way is to determine whether the phenotype is completely altered by the mutation and to evaluate if the gene and the protein are expressed and translated, respectively, by simply comparing the mRNA and the protein expression levels in the mutated and the control gene.
c14.indd 283
1/12/2011 9:44:24 AM
284
DETERMINATION OF THE FUNCTION OF A MUTATION
Three types of mutations can cause loss of function: • •
•
Deletion is the loss of one or several bases. Insertion represents the addition of one or several bases, which can be of various natures—for instance, duplication of a preexisting DNA sequence or insertion of a foreign sequence, such as a viral sequence. Substitution consists in the replacement of one base by a different one, with no change in total number of bases in the sequence.
Deletions, insertions, or substitutions could create a stop codon in place of the same codon or further in the DNA sequence, which will lead to either a truncated protein or to a translation alteration. When located in noncoding regions, deletions or insertions may affect gene expression, when they are located in coding regions, they may affect the structure of the protein and consequently its function. In the case of base substitutions, three subtypes can be identified: •
•
•
Nonsense mutation, where the replacement of one base by another creates a stop codon in place of a codon specifying an amino acid. Missense mutation, where the mutated codon specifies a different amino acid. Splice mutation, where a splicing site is suppressed or created. Depending on the position of the polymorphism within the gene, it may affect DNA transcription, RNA splicing, RNA translation, protein structure, or quantity. Functional consequences at the phenotypic level may show various degrees.
14.3.2 Dosage Analysis—Quantitative or Qualitative Changes Due to Mutation 14.3.2.1 Loss of Function Loss of function mutation behaves generally in a recessive way, because the normal allele of a heterozygous carrier retains its function and may sometimes even be transcribed at a higher level than in the normal homozygous. Thus the heterozygous carrier will show an intermediate amount of the gene product, which can be sufficient to maintain the function. Variations may occur if the function is greatly impaired under a certain threshold level (wild type level) for the amount of the gene product. This latter phenomenon corresponds to a dosage effect (Fig. 14.1). If the amount of the gene product in the heterozygous carrier is below the threshold, then the mutation will behave in a dominant way. This situation is also described as haploinsufficiency, where the presence of one normal allele is not sufficient to maintain the function. Because the normal activity of a gene can be disturbed in many different ways, one can expect an important molecular diversity at the origin of loss of function mutations.
c14.indd 284
1/12/2011 9:44:24 AM
HOW TO DETERMINE IF A MUTATION IS FUNCTIONAL
285
Amount of Gene Product
120 100 a. Threshold > 50% phenot H=phenot M dominance
80 60
b. Threshold < 50% phenot H=phenot N recessivity
40 20 0 Normal
Heterozygot
Mutant (Homozygot)
Figure 14.1. The dosage effect for a loss of function mutation. Threshold values a and b used to distinguish between dominance and recessivity. N represents the homozygous normal genotype, H, the heterozygous genotype for the mutation, and M, the homozygous mutant genotype (Tixier-Boichard, 2002).
Purely genetic evidence, without biochemical studies, can often suggest whether a phenotype is caused by loss or gain of function. When a clinical phenotype results from loss of function of a gene, we would expect any change that inactivates the gene product to produce the same clinical result. We should be able to find point mutations that have the same effect as mutations that delete or disrupt the gene. Waardenburg syndrome type 1 provides a good example of loss of function mutation, since missense mutation as well as nonsense mutation and in some patients complete deletion of the PAX3 sequence produces the same clinical result (Sheffer and Zlotogora, 1992; Wollnik et al., 2003). 14.3.2.2 Dominant Negative Mutations This situation is observed when the gene product of the mutated allele is only partially active and may interfere with the normal gene product. This is particularly the case when the gene product acts antagonistically to the wild type allele as a cofactor or if it is involved in the formation of a dimer or polymer. In cases of polymeric molecules, such as collagen, dominant negative mutations are often more deleterious than mutations causing the production of no gene product (null mutations or null alleles). The defect of one component in a dimer or polymer may be sufficient to impair the overall function because the abnormal allele also disturbs the normal allele when both allelic products form a dimer/polymer. The mutation behaves in a dominant way because only one mutated allele is able to impair the gene function. This phenomenon is well illustrated in human genetics by mutations of the nuclear hormone receptors (Yen and Chin, 1994). The same applies when the dimer involves the product of a different gene; the mutation of one gene has a negative epistatic effect on the function usually associated with the other gene. Another example of dominant negative muta-
c14.indd 285
1/12/2011 9:44:24 AM
286
DETERMINATION OF THE FUNCTION OF A MUTATION
tion was reported by Thomas et al. (1997). The group identified a point mutation in the gene encoding cartilage-derived morphogenetic protein 1 (CDMP-1). The mutation substitutes a tyrosine for the first of seven highly conserved cysteine residues in the mature active domain of the protein. They showed that the mutation results in a protein that is not secreted and is inactive in vitro. It produces a dominant negative effect by preventing the secretion of other related bone morphogenetic protein family members. 14.3.2.3 Gain of Function Mutations Gain of function mutations usually cause a dominant phenotype, they change the gene product such that it gains a new and abnormal function. A new function can be obtained with the production of an aberrant protein or the expression of a normal protein in abnormal conditions, either at an unusual age or in an aberrant location. These mutations may have severe phenotypic effects. Most commonly activating mutations were found in thyroid nodules, many are not heritable (germ line) but are somatic mutations that develop under specific environmental conditions such as a long-term iodine deficiency or exposure to goitrogens. Several germline mutations within thyroid-stimulating hormone receptor (TSHR) have been identified as gain of function mutations (Esapa et al., 1999; Alberti et al., 2001) in families that showed hyperthyroidism with severe thyrotoxicosis. The effect of the mutation on the function of the gene was evaluated after cloning and direct mutagenesis. COS-7 cells were transfected with wild type or the mutated TSHR cDNA. The cells transfected with mutated TSHR displayed increased constitutive activity toward the cAMP pathway when compared with the wild type TSHR (Alberti et al., 2001). 14.3.3 Complementation Test Genetic studies start by the identification of the molecular components that contribute to a particular process and then determine how those molecules communicate or work together to execute that process. There are several strategies to unravel the networking of each process. However, there are certain rules that apply to all situations. The first is to identify multiple mutations with a particular phenotype in a genetic screen. Once this is accomplished, the next step is to do complementation tests to figure out how many genes are represented by the mutations that were isolated in the screen. It is important to figure out whether the individual mutations isolated are loss of function, gain of function, or something else (dominant negative, neomorphic, etc.). For that purpose, there is a need to combine the mutations in different genes that give opposite or at least different phenotypes. Combining mutations with the same or similar phenotypes is not often informative, since the combination is likely to give the same phenotype as either individual gene. The one exception is when two mutations act as enhancers of each other. In this case, the two mutations that generate similar phenotype can be combined to generate either a new phenotype or a probably stronger phenotype. Two
c14.indd 286
1/12/2011 9:44:24 AM
EFFECT OF MUTATIONS ON THE FUNCTION OF THE GENE
287
loss of function mutations that give different phenotypes or a loss of function allele of one gene with a gain of function allele of another gene can be combined in the suppressor and enhancer problem set. Suppressor and enhancer screens are usually designed to find additional genes that act in the same process or related/parallel processes. They generally work by mutating cells or animals that already carry one mutation and looking for a milder phenotype (to find suppressors) or a stronger or different phenotype (to find enhancers).
14.4 EFFECT OF MUTATIONS ON THE FUNCTION OF THE GENE The candidate mutation, whether identified through linkage, candidate gene, and/or bioinformatics analyses, can be verified through a number of methods. The choice of the approaches to follow to determine the effect of a specific mutation on the function of a gene depends on the position of the mutation and the type of mutation (synonymous, nonsynonymous, regulatory SNPs). 14.4.1
Nonsynonymous SNPs
SNPs located within the coding region of genes have been extensively studied, including those that cause amino acid codon alterations (nonsynonymous variants) that can lead to protein misfolding, polarity shift, improper phosphorylation, and other functional consequences. The impact of the nonsynonymous SNPs can be assessed using bioinformatic tools to evaluate the importance of the amino acids they affect. There are three categories of SNPs: (1) the SNP that causes a change in a part of the protein sequence that is not conserved over long evolutionary distances; (2) the SNP that causes a change in a conserved domain of the protein, but the changed residue is not conserved; and (3) the SNP that changes a conserved residue within a conserved domain. Using the publicly available sequences and some simple computational tools available at the websites listed, the conserved domain of the protein as well as the conserved amino acids through species evolution can be identified. Then the SNP can be categorized either among the most probably affecting the function of the protein or the least probably functional. www.ensembl.org/index.html. www.ncbi.nlm.nih.gov/guide/sequence-analysis. www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Other more sophisticated computational tools can be used. The Sorting Intolerant from Tolerant (SIFT) software uses sequence homology to predict whether an amino acid substitution will affect protein function and hence
c14.indd 287
1/12/2011 9:44:24 AM
288
DETERMINATION OF THE FUNCTION OF A MUTATION
potentially alters the phenotype (Ng and Henikoff, 2001, 2002). The program is available at http://sift.jcvi.org. After computational analyses, it is still necessary to perform functional studies to confirm the effect of the mutation on the function of the encoded gene. The different experiments that can be performed to confirm the function of specific mutations are explained later in this book. 14.4.2
Synonymous SNPs
Due to the relatively large frequency of the SNPs in the human genome, synonymous SNPs (sSNPs) were often disregarded in many pharmacogenomic studies based on the assumption that these are silent mutations. However, recent genetic studies have shown evidence that synonymous mutations can have important fitness consequences, with >40 genetic diseases now associated with such silent mutations (Chamary et al., 2006). There is now clear evidence that synonymous codons are not used randomly, that preferred codons correlate strongly with the relative abundance of the corresponding tRNAs, and that natural selection acts on synonymous mutations. Recently Kimchi-Sarfaty et al. (2007) investigated the effect of a synonymous SNP in the human multidrug resistance 1 (MDR1) gene. The SNP was selected as part of a haplotype previously linked to altered function of the MDR1 gene product, P-glycoprotein (P-gp), which is implicated both in determining drug pharmacokinetics and in multidrug resistance in human cancer cells, the SNP results in P-gp with altered drug and inhibitor interactions. No difference on mRNA and protein levels were found in the cells transfected with the mutant allele compared to wild type allele, but altered conformation was found in the mutated P-gp. Therefore, the authors suggested that the presence of a rare codon, marked by the synonymous polymorphism, affects the timing of cotranslational folding and insertion of P-gp into the membrane, thereby altering the structure of substrate and inhibitor interaction sites. Accordingly, Nackley et al. (2006) reported that haplotypes divergent in synonymous SNPs in the catechol-O-methyltransferase gene exhibited the largest difference in the enzymatic activity due to a reduced amount of translated protein. The mutated gene showed a change in the RNA local stem-loop structures, such that the most stable structure was associated with the lowest protein levels and enzymatic activity. Site-directed mutagenesis that eliminated the stable structure restored the amount of translated protein, which highlights the functional significance of synonymous SNPs. Furthermore, it has been shown that a synonymous SNP of the corneodesmosin gene leads to increased mRNA stability, and haplotypes that have this SNP are more likely to develop psoriasis (Capon et al., 2004). In summary, there are at least three relatively resolved mechanisms by which synonymous mutations can affect fitness (Joanna and Parmley, 2007): (1) mRNA structure and stability, (2) kinetic of translation, and (3) alternate splicing (Fig. 14.2). It is also likely that overlapping transcripts, which may well
c14.indd 288
1/12/2011 9:44:24 AM
EFFECT OF MUTATIONS ON THE FUNCTION OF THE GENE
289
Mechanism
mRNA structure and stability
Kinetic of translation
Alternate splicing
Change in protein amount, structure and/or function Figure 14.2. Mechanisms by which synonymous mutations could affect the protein expression and/or its function.
be much more common than one would think (Sun et al., 2006), will impose some form of extra constraint (Lipman, 1997) on mutations that are synonymous but in only one of the two genes. Very recently, Okamoto et al. (2010) identified the potassium inwardly rectifying channel subfamily J member 15 (KCNJ15) gene as a new type 2 diabetes mellitus (T2DM) susceptibility gene. A sSNP, rs3746876, in exon 4 (C566T) of this gene have been associated with T2DM in three independent Japanese sample sets. Thus to determine the effect the sSNP on the function of the gene, the authors have cloned the wild type and the mutated KCNJ15 in human embryonic kidney 293 cells. The functional analysis demonstrated that the risk allele of the sSNP in exon 4 increased KCNJ15 expression via increased mRNA stability, which resulted in a higher expression of protein as compared to that of the nonrisk allele. Overexpression of KCNJ15 decreased insulin secretion in high-glucose conditions, while no significant change was found under normoglycemic conditions. 14.4.3 Regulatory SNPs The SNPs located within noncoding regions of the genome are the less predictable. While mostly regarded as nonfunctional, this type of alteration can impact gene regulatory sequences such as promoters, enhancers, and silencers (Ponomarenko et al., 2002). Termed regulatory SNPs (rSNPs), these variations have become more prevalent in recent studies (Wang et al., 2005, 2007; Knight, 2003, 2005). Transcription factor (TF) binding sites are the most attractive
c14.indd 289
1/12/2011 9:44:24 AM
290
DETERMINATION OF THE FUNCTION OF A MUTATION TF
Promoter-gene
TF-TFBS interaction
TEBS
ATG
TF
TF
TF
TF
tFBS
TfBS
TFbS
TFBs
No change
Increased binding
Decreased binding
No binding
TF TFBS
Novel binding Controlled by other TF
Expression
Figure 14.3. The impact of a regulatory SNP in a transcription factor binding site (TFBS). In most cases, the SNP will not change TF binding activity or target gene expression because the TF, in general, allows variation in the consensus sequence of the binding site. In some cases, a SNP may increase or decrease the stability of the binding, leading to allelic-specific gene expression. It is rare when the SNP eliminates the natural binding site or generate a novel binding site, and consequently the gene is no longer controlled by the original TF (Chorley et al., 2008).
regions to search for functional rSNPs. A SNP in a TF binding site can have multiple consequences. In most cases, a SNP does not change the TF and binding site interaction nor does it alter the gene expression, since a TF will usually recognize a considerable number of binding sites. In some cases, a SNP may increase or decrease the binding, leading to allele-specific gene expression. In rare cases, a SNP may eliminate the natural binding site or generate a novel binding site. Consequently, the gene can no longer be regulated by the original TF. Thus, functional rSNPs in TF binding sites may predictably lead to differences in gene expression (Fig. 14.3) and phenotypes, and ultimately affect susceptibility to environmental exposure. Indeed, there are numerous examples of rSNPs associated with disease susceptibility, including hypercholesterolemia (Ono et al., 2003), hyperbilirubinemia (Sugatani et al., 2002; Bosma et al., 1995) myocardial infarction (Nakamura et al., 2002), acute lung injury (Marzec et al., 2007), and asthma (Jinnai et al., 2004). Identification and experimental verification of functional rSNPs are key limiting steps in an efficient functional polymorphism discovery process. 14.4.3.1 Bioinformatics Successful bioinformatics identification of functional rSNPs requires identification of putative regulatory sequences, or motifs, and the co-location of SNPs in these sequences. Some genes have cisregulatory sequences within 10 kb (sometimes larger) of the transcription start site. Computational methods for the identification of cis-regulatory sequences have been successfully applied to simple organisms such as yeast
c14.indd 290
1/12/2011 9:44:25 AM
THE OBSERVED PHENOTYPIC VARIATIONS
291
and worm; and while some methods have been plagued by high false positive rates in mammalians, primarily because of the very large quantity of intergenic sequence present (Chang et al., 2006), many recent new bioinformatics algorithms have improved prediction (Warner et al., 2008). These include examining evolutionarily conserved regulatory sequences in upstream sequences of orthologous genes across species (Wang et al., 2007; Boffelli et al., 2003; Sun et al., 2004; Cliften et al., 2003) and identifying statistically overrepresented motifs in the upstream regions of genes that are co-regulated in microarray expression profiles (Warner et al., 2008; Haverty et al., 2004). 14.4.3.2 Experimental Assessment of Regulatory SNPs After computational analyses, it is still necessary to verify the functional impact of novel gene polymorphism using basic molecular biology techniques such as electrophoretic mobility shift assay (EMSA) to assess DNA binding; chromatin immunoprecipitation (ChIP), a binding assay that provides insight into gene regulation in an endogenous state; and luciferase reporter constructs to test the effect of a SNP on regulatory element function. These processes tend to be laborious and not well matched for screening large numbers of DNA elements and SNPs. It is, therefore, imperative to develop high-throughput methods to assess regulatory regions in the genome. Indeed, Chorley and collaborators (2008) have covered some new high-throughput methodologies such as surface plasmon resonance imaging arrays (SPR) analysis, oligoconjugated microsphere binding assays, and allelic imbalance methods that can help shortening the process of rSNP function. 14.5 GENERAL STRATEGY TO ASSESS THE EFFECT OF A MUTATION ON THE FUNCTION OF A GENE AND THE OBSERVED PHENOTYPIC VARIATIONS In summary, the overall strategy to determine the effect of the mutation on the function of the gene or the mechanism by which a mutation affects a specific phenotype starts by identifying the type of mutation, the localization of the mutation within the gene sequence and the effect of the mutation on the RNA or protein structure. Then, in vitro functional studies allow the evaluation of the molecular pathways affected by the mutation and the in vivo studies will determine the effect of the mutation on cell interaction and the overall phenotype expressed by the affected individual. 14.5.1
Sequence Analyses
As described earlier, the first step in determining the role of a mutation on the function of a gene is to determine the type of mutation, the position of the mutation in the DNA sequence and the effect of the mutation on the structure of the RNA or the amino acid sequence that can be performed by sequencing using computer tools as described.
c14.indd 291
1/12/2011 9:44:25 AM
292
14.5.2
DETERMINATION OF THE FUNCTION OF A MUTATION
Tissue Specificity
It is important to determine the cells or tissues that express the gene analyzed since some genes are tissue/cell specific. Thus gene expression has to be evaluated in different tissues/cells to determine which cells express the studied gene. This information will help with estimating the function of the gene and therefore the experiments used to test the effect of the mutation on the function of the affected gene. 14.5.3
Expression Profiling
There are different methods to determine the effect of a mutation on the expression of the mutated gene or protein; the tissue/cells that express the analyzed gene is isolated from the affected and nonaffected individuals, and then gene expression is evaluated in the wild type and the affected cells by quantitative real-time polymerase chain reaction (RT-PCR) (Logan et al., 2009). The protein expression can be evaluated by western immunoblotting; protocols that describe different methods for isolating proteins and performing western bloting are available online (www.westernblotting.org). Immunohistochemistry staining is also used to determine the effect of a mutation on the protein expression and the localization of the protein expression in different cell compartments. 14.5.4
In Vitro Functional Studies
Biologists mostly start with the in vitro studies to determine the functional consequence of a specific mutation. For in vitro studies, the choice of the cells to be used is crucial for the success of the study; in the case of candidate gene/ mutation, some investigators first use the cells that show high expression of the candidate. Thus the expression and the activity of the candidate gene or any molecule that interacts with this gene are assessed in the cells isolated from affected and nonaffected individuals, assuming that any identified change between the affected and nonaffected cells is due to the mutation of interest. This approach allows the investigators to measure different parameters with great precision and explore molecular mechanisms of the mutation studied. However, it does not eliminate the effect of the genetic background of both cells, especially when the cells are human derived. Therefore, it is necessary to confirm the results by cloning and direct mutagenesis. cDNA from nonaffected individuals is generated by reverse transcriptase and cDNA of the analyzed gene is amplified and cloned under a strong promoter. Then the mutation of interest is introduced by direct mutagenesis. The vector carrying the mutated allele/control vector is transfected in one specific cell line to overcome the effect of different genetic backgrounds. The choice of the cell line to be used in the study depends on the planned experiments. Wataha et al. (1994) have tested four cell lines to determine the best cell line
c14.indd 292
1/12/2011 9:44:25 AM
THE OBSERVED PHENOTYPIC VARIATIONS
293
for in vitro biological tests that assess the cytotoxicity of dental materials. Lindén et al. (2007) have tested nine human gastrointestinal epithelial cell lines to improve in vitro model systems for gastrointestinal infection studies. The use of in vitro cell models for functional studies has many advantages over in vivo systems. First, variation among individuals is eliminated especially when using cell lines, as is the confounding interactions between different cells other than the one under study. Moreover, in vitro systems can be manipulated in ways not possible in vivo, allowing investigators to measure the effects of different variables (e.g., temperatures and pharmacological agents) with greater precision and to explore the molecular mechanisms of the gene/ mutation studied. However, these advantages are offset by the loss of the in vivo context (e.g., cues from extracellular matrix, other cell types), which undoubtedly provides levels of regulation that are missing in vitro. For this reason, it is important to confirm the in vitro results in the in vivo situation and to compare results obtained in the two systems.
14.5.5
In Vivo Functional Studies and Mouse Model
Animal models such as fruit fly (Drosophila melanogaster), zebrafish (Danio rerio), and mouse (Mus musculus) have been used since 1930 to investigate different human traits and diseases (Paigen, 2003; Lieschke and Currie, 2007; Rosenthal and Brown, 2007). However, laboratory mouse is considered the model organism of choice to determine the mechanism by which different mutations regulate specific human diseases in vivo for the following reasons: The mouse genome is the most completely described of any animal model so mouse gene sequences can be compared to human. Mice and humans share 99% of their genes. Mice and humans share most physiological and pathological features; similarities in nervous, cardiovascular, endocrine, immune, musculoskeletal, and other internal organ systems have been extensively documented. When using mouse model the environmental factors as well as genetic variations are controlled. At present, a number of mutagenesis strategies based on embryonic stem (ES) cells are used, all of which use homologous recombination to alter genes in their original location, producing either knockouts to cripple gene function or knockins to introduce a mutated gene version. Typically, this is done in mice since the technology for this process is more refined, and because mouse embryonic stem cells are easily manipulated. The rational for using ES cells to introduce mutation is because ES cells have the capacity of self-renewal and broad differentiation plasticity. ES cells can be propagated as a homogeneous, uncommitted cell population for an almost unlimited period of time without losing their pluripotency and their stable karyotype.
c14.indd 293
1/12/2011 9:44:25 AM
294
DETERMINATION OF THE FUNCTION OF A MUTATION
Figure 14.4. Two methods of generating standard transgenic mice.
Even after extensive genetic manipulation, mouse ES cells are able to reintegrate fully into viable embryos when injected into a host blastocyst or aggregated with a host morula. 14.5.5.1 Standard Transgenic Mice The definition of transgenesis is the introduction of DNA from one species into the genome of another species, but at present the term transgenic mice is given to any animal that carries a foreign DNA (even from the same species) that was deliberately inserted into its genome. Many of the first transgenic mice were generated to study the overexpression of a human protein (Masliah and Rockenstein, 2000). To generate a standard transgenic mouse, the gene with the mutation of interest, a strong mouse gene promoter and enhancer to allow the gene to be expressed, and a bacterial or viral vector DNA to enable the transgene to be inserted into the mouse genome are needed. The investigator can choose either a cell specific promoter that will induce the gene expression only in specific cells or use a promoter that can express in all cells. It is important to add a selection marker such as G148. Two methods of producing transgenic mice are widely used (Fig 14.4). Transform embryonic ES cells growing in tissue culture. Then the successfully transformed cells are selected and injected into the inner cell mass of mouse blastocysts to generate embryos.
c14.indd 294
1/12/2011 9:44:26 AM
THE OBSERVED PHENOTYPIC VARIATIONS
295
Inject the desired gene into the pronucleus of a fertilized mouse egg. This method has been used to introduce a mutation at the nuclear factor interleukin 6 (NF-IL-6) DNA binding site in mouse model and evaluate the role of IL-6 in the response to environmental oxygen deprivation (Yan et al. 1997). Chiesa et al. (1998) have used this transgenic method to evaluate the effect of an insertional mutation in a prion gene on a neurological disorder characterized clinically by ataxia and neuropathologically by cerebellar atrophy. The embryos are then transferred into the uterus of a pseudopregnant foster mouse. Since the success of the implantation is estimated to be no more than 33% (Wang and Dey, 2006), it is necessary to transfer at least three embryos each time to be sure to get the implantation of one embryo. The offspring are then tested; a small piece of tissue from the tail is isolated, and its DNA is examined for the specific mutation. It is estimated that 10–20% of the progeny will have the introduced mutation, and they will be heterozygous for the mutated allele. Heterozygous mice are then mated and their offspring are screened for the 1:4 that will be homozygous for the transgene. Transgenic mice approach is relatively quick, but includes the risk that the DNA may insert itself into a critical locus, causing an unexpected, detrimental genetic mutation. For this reason, several independent mouse lines containing the same transgene must be created and studied to ensure that any resulting phenotype is not due to toxic gene-dosing or to the mutations created at the site of transgene insertion. 14.5.5.2 Knockin Mice To avoid the problems of a standard transgenic, biologists in the last 10 years rely on knockin mice to study the exogenous expression of a protein. In this method, a mutated DNA sequence is exchanged for the endogenous sequence without any other disruption of the gene. Knockin strategies rely on a method developed by Orban et al. (1992). This procedure comprises heritable tissue-specific and site-specific DNA recombination as a function of recombinase expression in transgenic mice. Transgenes encoding the bacteriophage P1 Cre recombinase and the loxP-flanked βgalactosidase gene were used to generate transgenic mice. The use of gene vectors with flanking sequences, termed loxP, are constructed to delete a specific exon of a gene in embryonic stem cells. When exposed to an enzyme called Cre recombinase, LoxP undergoes reciprocal recombination, leading to the deletion of the intervening DNA. With this method, it is possible to replace the wild type gene sequence with the mutated sequence or vice versa and to delete unnecessary sequences. The gene for Cre recombinase has been knocked into targeted loci in a way that brings its expression under the direction of the endogenous gene promoter, thus allowing tissue-specific or temporal-specific expression of the Cre enzyme and hence recombination of loxP sites that flank the gene of interest. Site-specific knockins result in a more consistent level of expression of the transgene from generation to generation because it is known that the
c14.indd 295
1/12/2011 9:44:26 AM
296
DETERMINATION OF THE FUNCTION OF A MUTATION
overexpression cassette is present as a single copy. Also, because a targeted transgene is not interfering with a critical locus, the investigator can be more certain that any resulting phenotype is due to the exogenous expression of the protein. The knockin mouse procedure requires more time to assemble the vector and to identify ES cells that have undergone homologous recombination, but it does avoid many of the problems of a traditional transgenic mouse. The applications of this method are numerous, and some are already clinically useful. knockin mouse models of Huntington disease have been developed by introducing the mutation responsible for this fatal disease, which is an abnormally expanded and unstable CAG repeat within the coding region of the gene encoding huntingtin (Menalled, 2005). Furthermore, knockin mouse models that carry the mutation R345W in the EFEMP1 gene (also called fibulin-3) have been developed to determine the mechanism by which this mutation causes age-related macular degeneration (Fu et al., 2007).
14.6 QUESTIONS AND ANSWERS Q1. What is the advantage of association mapping over linkage analysis? Q2. How do you distinguish between dominant and recessive mutation? Q3. What is the purpose of complementation test? Q4. Which of the following SNPs can affect the expression and the function of the protein; synonymous, non-synonymous, regulatory SNPs? Q5. What is the possible effect of a mutation at a noncoding region? A1. In comparison to family linkage based approaches, association studies have two key advantages. (1) They are able to capitalize on all meiotic recombination events in a population, rather than only those in the families studied. Because of this, association signals are localized to small regions of the chromosome containing only a single to a few genes, enabling rapid detection of the actual disease susceptibility gene. (2) GWAS allow the identification of disease genes with only a modest increases in risk, which is considered a severe limitation in linkage studies. A2. You can distinguish between dominant and recessive mutations by comparing the product of heterozygote mutant, if the heterozygote product is equal or close to normal product, then the mutation is considered recessive. If the product of the heterozygote is different from the normal and is close to the homozygote mutant, then the mutation is dominant. A3. Complementation test allows the identification of the molecules that interact with the gene studied and determines whether the gene studied
c14.indd 296
1/12/2011 9:44:26 AM
REFERENCES
297
is upstream or downstream of the others and whether it inhibits or activates a downstream target. A4. The three SNPs can affect the function; the nonsynonymous SNP can affect the conformation of the protein and therefore alter the function of the protein which would consequently affect the expression of the gene/ protein. The synonymous SNP could affect the translation of the protein and the regulatory SNPs could alter the promoter activity (gene expression), mRNA conformation (stability), and subcellular localization of mRNAs and/or proteins. A5. A mutation at a noncoding region can affect the expression of the affected gene and therefore its function. The mutation at noncoding region can affect the stability of the transcription factor so the expression will be either increased or decreased or even completely altered, or it can create a new transcription binding site. 14.7 REFERENCES Alberti L, Proverbio MC, Costagliola S, Weber G, Beck-Peccoz P, Chiumello G, Persani L. (2001). A novel germline mutation in the TSH receptor gene causes nonautoimmune autosomal dominant hyperthyroidism. Eur J Endocrinol 145(3): 249–54. Allayee H, Ghazalpour A, Lusis AJ. (2003). Using mice to dissect genetic factors in atherosclerosis, Arterioscler thromb vasc biol 23:1501–09. Bajaj A, Driver JA, Schernhammer ES. Parkinson’s disease and cancer risk: a systematic review and meta-analysis. Cancer Causes Control 2010 May;21(5):697–707. Bice P, Valdar W, Zhang L, Liu L, Lai D, Grahame N, Flint J, Li TK, Lumeng L, Foroud T. (2009). Genomewide SNP screen to detect quantitative trait loci for alcohol preference in the high alcohol preferring and low alcohol preferring mice. Alcohol Clin Exp Res 33(3):531–37. Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM. (2003). Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299:1391–94. Bosma PJ, Chowdhury JR, Bakker C, Gantla S, de Boer A, Oostra BA, Lindhout D, Tytgat GNJ, Jansen PLM, Elferink RPJO, Chowdhury NR. (1995). The genetic basis of the reduced expression of bilirubin UDP-glucuronosyltransferase 1 in Gilbert’s syndrome. N Engl J Med 333:1171–75. Capon F, Allen MH, Ameen M, Burden AD, Tillman D, Barker JN, Trembath RC. (2004). A synonymous SNP of the corneodesmosin gene leads to increased mRNA stability and demonstrates association with psoriasis across diverse ethnic groups. Hum Mol Genet 13:2361–68. Cardon LR, Bell JI. (2001). Association study designs for complex diseases. Nat Rev Genet 2:91–99. Chamary J-V, Parmley JL, Hurst LD. (2006). Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet 7:98–108.
c14.indd 297
1/12/2011 9:44:26 AM
298
DETERMINATION OF THE FUNCTION OF A MUTATION
Chang LW, Nagarajan R, Magee JA, Milbrandt J, Stormo GD. (2006). A systematic model to predict transcriptional regulatory mechanisms based on overrepresentation of transcription factor binding profiles. Genome Res 16:405–14. Chiesa R, Piccardo P, Ghetti B, Harris DA. (1998). Neurological illness in transgenic mice expressing a prion protein with an insertional mutation. Neuron 21(6):1339– 51. Clayton D, Jones H. (1999). Transmission/disequilibrium tests for extended marker haplotypes. Am J Hum Genet 65:1161–69. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M. (2003). Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301:71–76. Chorley BN, Wang X, Campbell MR, Pittman GS, Noureddine MA, Bell DA. (2008). Discovery and verification of functional single nucleotide polymorphisms in regulatory genomic regions: current and developing technologies. Mutat Res 659(1–2): 147–57. Darvasi A. (1998). Experimental strategies for the genetic dissection of complex traits in animal models. Nat Genet 18:19–24. Edderkaoui B, Baylink DJ, Beamer WG, Wergedal JE, Porte R, Chaudhuri A, Mohan S. (2007). Identification of mouse Duffy antigen receptor for chemokines (Darc) as a BMD QTL gene. Genome Res 17(5):577–85. Fu L, Garland D, Yang Z, Shukla D, Rajendran A, Pearson E, Stone EM, Zhang K, Pierce EA. (2007). The R345W mutation in EFEMP1 is pathogenic and causes AMD-like deposits in mice. Hum Mol Genet 16(20):2411–22. Esapa CT, Duprez L, Ludgate M, Mustafa MS, Kendall-Taylor P, Vassart G, Harris PE. (1999). A novel thyrotropin receptor mutation in an infant with severe thyrotoxicosis. Thyroid 9(10):1005–10. Haverty PM, Hansen U, Weng Z. (2004). Computational inference of transcriptional regulatory networks from expression profiling and transcription factor binding site identification. Nucleic Acids Res 32:179–88. Horvath S, Xu X, Lake SL, Silverman EK, Weiss ST, Laird NM. (2004). Family based tests for associating haplotypes with general phenotype data: application to asthma genetics. Genet Epidemiol 26:61–69. Jinnai N, Sakagami T, Sekigawa T, Kakihara M, Nakajima T, Yoshida K, Goto S, Hasegawa T, Koshino T, Hasegawa Y, Inoue H, Suzuki N, Sano Y, Inoue I. (2004). Polymorphisms in the prostaglandin E2 receptor subtype 2 gene confer susceptibility to aspirin-intolerant asthma: a candidate gene approach. Hum Mol Genet 13:3203–17. Joanna L, Parmley LDH. (2007). How do synonymous mutations affect fitness? Bioessays 29:515–19. Kimchi-Sarfaty C, Oh JM, Kim IW, Sauna ZE, Calcagno AM, Ambudkar SV, Gottesman MM. (2007). A Silent polymorphism in the MDR1 gene changes substrate specificity. Science 315:525–28. Kleeberger SR, Reddy S, Zhang LY, Jedlicka AE. (2000). Genetic susceptibility to ozone-induced lung hyperpermeability: role of toll-like receptor 4. Am J Respir Cell Mol Biol 22(5):620.
c14.indd 298
1/12/2011 9:44:26 AM
REFERENCES
299
Klein RF, Allard J, Avnur Z, Nikolcheva T, Rotstein D, Carlos AS, Shea M, Waters RV, Belknap JK, Peltz G, Orwoll ES. (2004). Regulation of bone mass in mice by the lipoxygenase gene Alox15. Science 303:229–32. Knight JC. (2003). Functional implications of genetic variation in non-coding DNA for disease susceptibility and gene regulation. Clin Sci (Lond) 104:493–501. Knight JC. (2005). Regulatory polymorphisms underlying complex disease traits. J Mol Med 83:97–109. Korstanje R, Paigen B. (2002). From QTL to gene: the harvest begins. Nature Genet 31:235–36. Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ. (2003). Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered 55:56–65. Lees CW, Satsangi J. (2009). Genetics of inflammatory bowel disease: implications for disease pathogenesis and natural history. Expert Rev Gastroenterol Hepatol 3(5):513–34. Lieschke GJ, Currie PD. (2007). Animal models of human disease: zebrafish swim into view. Nat Rev Genet 8(5):353–67. Lindén SK, Driessen KM, McGuckin MA. (2007). Improved in vitro model systems for gastrointestinal infection by choice of cell line, pH, microaerobic conditions, and optimization of culture conditions. Helicobacter 12(4):341–53. Lipman DJ. (1997). Making (anti)sense of non-coding sequence conservation. Nucl Acids Res 25:3580–83. Logan J, Edwards K and Saunders N. (2009). Real-Time PCR: Current Technology and Applications. Applied and Functional Genomics. Health Protection Agency, London. Marzec JM, Christie JD, Reddy SP, Jedlicka AE, Vuong H, Lanken PN, Aplenc R, Yamamoto T, Yamamoto M, Cho HY, Kleeberger SR. (2007). Functional polymorphisms in the transcription factor NRF2 in humans increase the risk of acute lung injury. Faseb J 21(9):2237–46. Masliah E, Rockenstein E. (2000). Genetically altered transgenic models of Alzheimer’s disease. J Neural Transm Suppl 59:175–83. Menalled LB. (2005). Knock-in mouse models of Huntington’s disease. NeuroRx 2(3):465–70. Morris AP, Whittaker JC, Balding DJ. (2004). Little loss of information due to unknown phase for fine-scale linkage-disequilibrium mapping with single-nucleotide-polymorphism genotype data. Am J Hum Genet 74:945–53. Nackley AG, Shabalina SA, Tchivileva IE, Satterfield K, Korchynskyi O, Makarov SS, Maixner W, Diatchenko L. (2006). Human catechol-O-methyltransferase haplotypes modulate protein expression by altering mRNA secondary structure. Science 314:1930–33. Nadeau JH, Frankel WN. (2000). The roads from phenotypic variation to gene discovery: mutagenesis versus QTL. Nature Genet 25:381–84. Nakamura S, Kugiyama K, Sugiyama S, Miyamoto S, Koide S, Fukushima H, Honda O, Yoshimura M, Ogawa H. (2002). Polymorphism in the 5’-flanking region of human glutamate-cysteine ligase modifier subunit gene is associated with myocardial infarction. Circulation 105:2968–73.
c14.indd 299
1/12/2011 9:44:26 AM
300
DETERMINATION OF THE FUNCTION OF A MUTATION
Ng PC, Henikoff S. (2001). Predicting deleterious amino acid substitutions. Genome Res 11(5):863–74. Ng PC, Henikoff S. (2002). Accounting for human polymorphisms predicted to affect protein function. Genome Res 12(3):436–46. Okamoto K, Iwasaki N, Nishimura C, Doi K, Noiri E, Nakamura S, Takizawa M, Ogata M, Fujimaki R, Grarup N, Pisinger C, Borch-Johnsen K, Lauritzen T, Sandbaek A, Hansen T, Yasuda K, Osawa H, Nanjo K, Kadowaki T, Kasuga M, Pedersen O, Fujita T, Kamatani N, Iwamoto Y, Tokunaga K. (2010). Identification of KCNJ15 as a susceptibility gene in Asian patients with type 2 diabetes mellitus. Am J Hum Genet 86(1):54–64. Ono S, Ezura Y, Emi M, Fujita Y, Takada D, Sato K, Ishigami T, Umemura S, Takahashi K, Kamimura K, Bujo H, Saito Y. (2003). A promoter SNP (-1323T>C) in G-substrate gene (GSBS) correlates with hypercholesterolemia. J Hum Genet 48:447–50. Orban PC, Chui D, Marth JD. (1992). Tissue- and site-specific DNA recombination in transgenic mice. Proc Natl Acad Sci U S A 89(15):6861–65. Paigen K. (2003). One hundred years of mouse genetics: an intellectual history. II. The molecular revolution (1981–2002). Genetics 163(4):1227–35. Ponomarenko JV, Orlova GV, Merkulova TI, Gorshkova EV, Fokin ON, Vasiliev GV, Frolov AS, Ponomarenko MP. (2002). rSNP guide: an integrated database-tools system for studying SNPs and sitedirected mutations in transcription factor binding sites. Hum Mutat 20:239–48. Pritchard JK, Przeworski M. (2001). Linkage disequilibrium in humans: models and data. Am J Hum Genet 69:1–14. Rosenthal N, Brown S. (2007). The mouse ascending: perspectives for human-disease models. Nat Cell Biol 9(9):993–99. Satten GA, Epstein MP. (2004). Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genet Epidemiol 27:192–201. Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. (2002). Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 70:425–34. Schaid DJ. (2004). Evaluating associations of haplotypes with traits. Genet Epidemiol 27:348–64. Shastry BS. (2009). SNPs: impact on gene function and phenotype. Meth Mol Biol 578:3–22. Sheffer R, Zlotogora J. (1992). Autosomal dominant inheritance of Klein-Waardenburg syndrome. Am J Med Genet 42(3):320–22. Sugatani J, Yamakawa K, Yoshinari K, Machida T, Takagi H, Mori M, Kakizaki S, Sueyoshi T, Negishi M, Miwa M. (2002). Identification of a defect in the UGT1A1 gene promoter and its association with hyperbilirubinemia. Biochem Biophys Res Commun 292:492–97. Sun YV, Boverhof DR, Burgoon LD, Fielden MR, Zacharewski TR. (2004). Comparative analysis of dioxin response elements in human, mouse and rat genomic sequences. Nucl Acids Res 32:4512–23. Sun M, Hurst LD, Carmichael GG, Chen J. (2006). Evidence for variation in of antisense transcripts between multicellular animals but no relationship between antisense transcription and organismic complexity. Genome Res 16:922–33.
c14.indd 300
1/12/2011 9:44:26 AM
REFERENCES
301
Thomas JT, Kilpatrick MW, Lin K, Erlacher L, Lembessis P, Costa T, Tsipouras P, Luyten FP. (1997). Disruption of human limb morphogenesis by a dominant negative mutation in CDMP1. Nat Genet 17(1):58–64. Tixier-Boichard M. (2002). From phenotype to genotype: Major genes in chickens. World’s Poul Sc J 58:35–43, 65–75. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. (2010). Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol 6(1):e1000641 Wang H, Dey SK. (2006). Roadmap to embryo implantation: Clues from mouse models. Nat Rev Genet 7(3):185–99. Review. Wang X, Tomso DJ, Chorley BN, Cho HY, Cheung VG, Kleeberger SR, Bell DA. (2007). Identification of polymorphic antioxidant response elements in the human genome. Hum Mol Genet 16:1188–200. Wang X, Tomso DJ, Liu X, Bell DA. (2005). Single nucleotide polymorphism in transcriptional regulatory regions and expression of environmentally responsive genes. Toxicol Appl Pharmacol 207:84–90. Warner JB, Philippakis AA, Jaeger SA, He FS, Lin J, Bulyk ML. (2008). Systematic identification of mammalian regulatory motifs’ target genes and functions. Nat Methods 5:347–53. Wataha JC, Hanks CT, Sun Z. (1994). Effect of cell line on in vitro metal ion cytotoxicity. Dent Mater 10:156–61. Williams FM, Spector TD. (2007). The genetics of osteoporosis. Acta Reumatol Port 32(3):231–40. Wollnik B, Tukel T, Uyguner O, Ghanbari A, Kayserili H, Emiroglu M, Yuksel-Apak M. (2003). Homozygous and heterozygous inheritance of PAX3 mutations causes different types of Waardenburg syndrome. Am J Med Gene A 122A(1):42–5. Yan SF, Zou YS, Mendelsohn M, Gao Y, Naka Y, Du Yan S, Pinsky D, Stern D. (1997). Nuclear factor interleukin 6 motifs mediate tissue-specific gene transcription in hypoxia. J Biol Chem 272(7):4287–94. Yen PM, Chin WW. (1994). Molecular mechanisms of dominant negative activity by nuclear hormone. receptors. Mol Endocrinol 8:1450–54.
c14.indd 301
1/12/2011 9:44:26 AM
CHAPTER 15
Confirmation of a Mutation by Multiple Molecular Approaches HECTOR MARTINEZ-VALDEZ and BLANCA ORTIZ-QUINTERO
Contents 15.1 Introduction 15.1.1 Gene Expression Overview 15.1.2 Mutations 15.1.3 Other Factors That Affect Gene Expression 15.2 mRNA Expression by Real-Time PCR 15.2.1 Theory, Scheme, and Scope 15.2.2 Comparative Appraisals 15.2.3 Quantitation and Data Report 15.3 DNA Sequencing 15.3.1 Scope and Evolution 15.3.2 Chemistry, Reaction, and Analysis 15.3.3 Outsourcing 15.4 In Situ Hybridization 15.4.1 Experimental Considerations 15.4.2 Scope and New Developments 15.5 Expression at the Protein Level 15.5.1 Overview 15.5.2 Antibody Technology 15.5.3 Evolution and Scope of Protein Analyses 15.6 Genetically Engineered Animals 15.6.1 Enforced Expression in Transgenic Mice 15.6.2 Gene Targeting/Knockout 15.6.3 Pitfalls and Solutions 15.7 Concluding Remarks 15.8 Acknowledgments 15.9 References
304 304 305 307 308 308 310 311 314 314 314 316 316 316 317 319 319 320 320 324 324 327 329 330 330 331
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
303
c15.indd 303
1/12/2011 9:44:27 AM
304
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
15.1 INTRODUCTION 15.1.1
Gene Expression Overview
In most eukaryotic cells, the DNA content is several orders of magnitude higher than the one required for the coding of proteins (Rangel et al., 2005). However, beyond the sequence annotations for protein coding, the human genome stands out with a complex gene expression diversity that breaches the limits of gene usage and function (Rangel et al., 2005). Such diversity is largely contributed by genetic rearrangements (Chen and Alt, 1993), differential RNA processing, and alternative translation initiation mechanisms (Sims-Mourtada et al., 2005). In keeping with this notion, genes previously ascribed to only yield noncoding germline mRNAs, are increasingly being documented to productively translate into small and long polypeptides, which conform to the cell lineage phenotype and function (Rangel et al., 2005; Frances et al., 1994; Jolly and O’Neill, 1997; Saint-Ruf et al., 1994; Erdmann et al., 2000; McKeller and Martinez-Valdez, 2006). Such remarkable diversity stems from the distinct expression patterns of genes in different tissues (Rangel et al., 2005). Whereas some gene products are required for basal cell functions and are constitutively expressed by most cells, a restricted transcription and translation dictate tissue and cell-specific functions (Rangel et al., 2005). For instance, the expression of antigen (Ag) receptor genes is developmentally controlled, and it is exclusive of the immune cells (Alt et al., 1992). A problem with such restriction, as experienced by the adaptive immune system in mammals, is the requirement of a broad repertoire of Ag receptor genes that surpasses the encoded capacity of the genome (Yang et al., 2003). However, nature has tailored gene rearrangement mechanisms to circumvent the diversity constraints of the Ag receptor loci and to ensure a large repertoire of Ag specificities (Chen and Alt, 1993; Rajewsky, 1996; Sleckman et al., 1996). Notably, mouse and human genomes carry fewer than twice the number of genes present in lower eukaryotic genomes, which indicates a greater degree of complexity that requires mechanisms that diversify and increase the number of gene products derived from a single gene (Sims-Mourtada et al., 2005; Landry et al., 2003). Among those, mRNA alternative splicing is perhaps the most frequent and diversifying (Sims-Mourtada et al., 2005). Recent estimates indicate that 35–50% of mammalian genes undergo alternative splicing (Landry et al., 2003; Wen et al., 2004). Whereas alternative splicing frequently leads to changes in protein structure, subcellular localization and/or function, in some cases, differential mRNA processing only affects 5′ or 3′ untranslated regions without altering the makeup of functional motifs or protein structures. However, taken as a whole, alternative splicing has profound influence on protein output and magnifies gene expression diversity. Other means to enhance gene expression diversity is the use of alternative promoters. This has the additional feature of having the potential to create
c15.indd 304
1/12/2011 9:44:27 AM
INTRODUCTION
305
complex regulatory diversity (Landry and Mager, 2002). Although not as common as alternative splicing, alternative promoter usage is a frequent regulatory mechanism that occurs in at least 18% of the mammalian genes (Trinklein et al., 2003). The use of alternative promoters permits genes to exhibit more than one pattern of tissue specificity and developmental control (Trinklein et al., 2003; Saleh et al., 2002; Medstrand et al., 2001). In some cases, alternative promoters not only provide regulatory versatility but dictate the expression of different protein isoforms, thereby greatly expanding both the translational and transcriptional capacity of the genome (Sims-Mourtada et al., 2005). In higher eukaryotes, polyadenylation serves as an added layer of regulatory diversification, which is accomplished by the addition of poly (A) tails at the 3′ end of all mRNAs, except histone transcripts (Zarudnaya et al., 2003). The site where polyadenylation occurs can have profound functional consequences, as it dictates the length and sequence of the 3′ untranslated region, which in turn can control RNA splicing events, mRNA stability and translation rates (Zarudnaya et al., 2003). Thus gene rearrangement, alternative splicing, differential promoter usage, and alternative polyadenylation constitute important mechanisms that have significant impact on the control of gene expression and functional diversity. 15.1.2
Mutations
The complexity of the human genome is underscored by its unprecedented sequence diversity, variation in gene copy number, inherent potential for basepair mutations that can lead to single nucleotide polymorphisms (SNPs) and the enormous plasticity for gene rearrangements, insertions, deletions and translocations, which can radically change gene expression, cellular functions and organism phenotype. In keeping with this notion, point mutations, deletions, insertions and recombinations of selected gene loci can occur, under physiological conditions, as a result of programmed mechanisms that enhance genome usage. For instance, the efficacy of antibody protection against pathogens relies on specific recognition functions. Because antibodies are encoded by distinct immunoglobulin (Ig) gene segments, mechanisms are tailored (the good news) to ensure a broad spectrum of pathogen recognition by multifaceted gene rearrangements and mutations. Conversely, these mechanisms are susceptible of intrinsic (genetic) and/or extrinsic (biological, physical or chemical gene lesions) derailment that can result in the loss of the programmed function or the gain of deleterious pathology (the bad news). Further elaboration on the physiological (good news) and pathological (bad news) genetic scenarios is detailed herein below. 15.1.2.1 The Good News Physiological gene rearrangement, somatic hypermutation and class switch recombination (CSR). During B lymphocyte development in the bone marrow, gene segments on the immunoglobulin (Ig)
c15.indd 305
1/12/2011 9:44:27 AM
306
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
heavy (H) and light (L) chain loci assemble in a defined ordered manner (Rangel et al., 2005; Chen and Alt, 1993; Frances et al., 1994; McKeller and Martinez-Valdez, 2006; Alt et al., 1992; Tonegawa, 1983; Dudley et al., 2005; Puebla-Osorio and Zhu, 2008). Thus germline B cell progenitors (pro-B cells) rearrange DH and JH genes (pre-B 1 cells) and subsequently join a given VH gene to the rearrange DH-JH segment (pre-B2 cells), allowing the expression of IgH chain. Finally, if a successful VL to JL rearrangement occurs, the cells can express κ or λ chains and can mature into B cells, displaying surface IgM antigen receptor (Rangel et al., 2005; Chen and Alt, 1993; Frances et al., 1994; McKeller and Martinez-Valdez, 2006; Alt et al., 1992; Tonegawa, 1983; Dudley et al., 2005; Puebla-Osorio and Zhu, 2008). The subsequent acquisition of B cell memory from naïve cells takes place within a highly specialized microenvironment, the germinal centers (GC) of the secondary lymphoid organs (Siddiqa et al., 2001; Guzman-Rojas et al., 2002). Within the GC microenvironment, the B cell maturation program faces a series of genetic events that include changes in the regulation of cell cycle checkpoint genes that result in the proliferation of Ag-specific B cells (Thorbecke et al., 1994; Siepmann et al., 2001), somatic diversification of the IgV domains by the introduction of point mutations that occurs as a result of active DNA polymerase during clonal cell expansion (Jacob et al., 1991; Pascual et al., 1994; Liu et al., 1996a; 1996b), selection of the functional (highaffinity) Ag-specific B cell repertoire against low-affinity and autoreactive cells by a concerted regulation of survival or death-inducing genes (Choe et al., 1996; Martinez-Valdez et al., 1996; Rathmell et al., 1996), and the intramolecular switch of the constant Ig regions, from IgM to the IgG, IgA, or IgE isotypes (Liu et al., 1996b, 1996c), to express immunoglobulin (Ig) receptors with high affinity Ag-binding (IgV) domains that are associated to μ, γ, α, or ε constant regions (Liu et al., 1996a, 1996c). Thus, the memory of the immune system is borne by B and T lymphocytes, which make rapid and robust humoral (antibody) responses or cell-mediated responses upon repeated antigenic invasion (Siddiqa et al., 2001; GuzmanRojas et al., 2002; Liu et al., 1996b, 1996c; Martinez-Valdez et al., 1996). 15.1.2.2 The Bad News Autoreactive and malignant GC B cells represent a unique class of disorders because they originate from cells of the immune system that divert from the normal maturation programs, via genetic rearrangements or somatic mutations (Guzman-Rojas et al., 2002). Since gene rearrangement, somatic hypermutation, Ag receptor editing and isotype switching are the physiological landmarks of Ag-driven GC responses (Guzman-Rojas et al., 2002; Unniraman and Schatz, 2006; Casellas et al., 2001; Meffre et al., 1998; Franco et al., 2006), the risk for genetic lesions and hence autoimmunity and malignant transformation is exponentially enhanced. In line with this rationale, T cells, which do not undergo receptor editing, somatic hypermutation, or isotype switching, give rise to 10 to 20 times less lymphoproliferative diseases than B cells (Guzman-Rojas et al., 2002). It is during these
c15.indd 306
1/12/2011 9:44:27 AM
INTRODUCTION
307
stages that B lymphocytes become target of aberrant rearrangements and mutations that result in diverse forms of lymphoproliferative disorders, including autoimmunity, leukemia and lymphoma (Frances et al., 2000; Malisan et al., 1996a; O’Brien et al., 1995; Sawyers et al., 1991). Moreover, Somatic hypermutation is a fundamental mechanism by which diversity for the antibody repertoire is created and such function is exponentially amplified within GC (Liu et al., 1996b, 1996c). Whereas immunoglobulin (Ig) genes rapidly became the stereotyped targets, due to their high molecular frequency, non-Ig genes can also be targeted for somatic hypermutation (Pasqualucci et al., 1998; Storb et al., 2001; Parsa et al., 2007; WatanabeFukunaga et al., 1992; Takahashi et al., 1994; Muschen et al., 2002; Muschen et al., 2000a, 2000b). Consequently, Fas and other GC genes gained celebritylike notoriety and gradually became de facto endangered tumor suppressors at the mercy of GC somatic mutation predatory machinery (Guzman-Rojas et al., 2002). 15.1.3
Other Factors That Affect Gene Expression
Decades of research in diverse walks of science have unveiled how chromosome rearrangements, mutations and epigenetic mechanisms contribute to altered cell functions and organism phenotypes (Pfeifer and Besaratinia, 2009; Partanen et al., 2009; Mathews et al., 2009; Herceg and Hainaut, 2007). However, deciphering the pathological scenario at the molecular level can be complex and involve the cooperative alteration of multiple genes, which override normal checkpoint mechanisms and produce an intricate phenotype (Hussain et al., 2009; Jones and Thompson, 2009). Moreover, the majority of gene lesions affect cells undergoing distinct developmental stages (Malisan et al., 1996b; O’Brien et al., 1995; Sawyers et al., 1991), where the genetic and epigenetic events can equally target proto-oncogene and tumor suppressor functions and lead to deregulation of cell proliferation and/or survival (Hussain et al., 2009; Jones and Thompson, 2009; Porter and Polyak, 2003; Fusco and Fedele, 2007; Van Vlierberghe et al., 2008). It is for this reason that the characterization of genes, whose function may equally influence the regulation of normal cell development and tumorigenesis, is of critical significance. As an example of how ancillary genetic and epigenetic events can be efficiently documented, genomewide screens and data processing emerge as a technological breakthrough (Landvik et al., 2009; Marcucci et al., 2008; Savas and Liu, 2009; Thye et al., 2003) that enables array-based comparative genome hybridization (CGH) studies and concomitant assessment of major regions of chromosome fragility (Blaveri et al., 2005; Engelmark et al., 2008; Nymark et al., 2006; Ross et al., 2007). A major advantage of the technology is its versatility to query common fragile sites (CFS) of synteny with mouse chromosomes (Bauer and Rondini, 2009; Helmrich et al., 2006), which prompts the rationale for the generation and study of genetically engineered animal models (Sims-Mourtada et al., 2005; Bauer and Rondini, 2009; Helmrich et al., 2006;
c15.indd 307
1/12/2011 9:44:27 AM
308
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Callahan et al., 2003; Festing et al., 1998; Kleeberger et al., 2000) Furthermore, genomewide scans can also interrogate gene structure, expression profiles and copy number variations of normal, experimental and pathological specimens (Geisert et al., 2009; Xiong et al., 2008a; Ikram et al., 2009; Schejeide et al., 2009; Takezaki and Nei, 2009; Yan et al., 2009; Li et al., 2008). Added features of genomewide screening susceptible of quantitative measurements include transcriptional activity (Jiao et al., 2009; Xiong et al., 2008b), DNA methylation and/or acetylation and regulatory microRNAs expression. As gene screens reveal exceedingly broad variations in chromatin revisions, microRNA engagement, gene copy numbers, insertions, deletions, inversions, and translocations (Bild et al., 2006; Hartmann et al., 2008; Yuille et al., 2001; Pritchard and Przeworski, 2001; Mullighan et al., 2009; Tay et al., 2009; Fan et al., 2007; Zhang et al., 2009; Jaillard et al., 2009; Karnan et al., 2006; Calin and Croce, 2007; Liu et al., 2008; Yu et al., 2008), apt technology must be applied to validate the pathophysiological relevance of genomewide screens (Zhang et al., 2009; Bentley et al., 2008; Dunckley et al., 2007; Gallardo et al., 2008; Scheinfeldt et al., 2009; Wheeler et al., 2008). These are discussed in subsequent sections. 15.2 MRNA EXPRESSION BY REAL-TIME PCR 15.2.1 Theory, Scheme, and Scope As stated, genomewide microarray screens permit high-throughput genetic queries and large-scale gene expression analyses. However, quantitative confirmation is necessary for meaningful interpretations and real-time PCR technology possesses the accuracy, sensitivity and specificity to validate gene discovery, structure, mutation patterns, polymorphisms and expression. Quantitative real-time PCR has gained center-stage attention for over a decade (Gibson et al., 1996; Luthra and Medeiros, 2006; Mocellin et al., 2003; Murphy and Bustin, 2009; Rooney et al., 2005; Wang and Brown, 1999; GarciaCastillo and Barros-Nunez, 2009) for quite pragmatic reasons: dynamics, high sensitivity, unprecedented specificity and reproducibility, and the potential for large sample management (Murphy and Bustin, 2009; Rooney et al., 2005; Wong and Medrano, 2005; Szczepanski, 2007). The incredible pace of real-time PCR applications, virtually reaching personalized genetic analyses, enables molecular medicine to quantitatively confirm gene expression patterns, copy number variations, single nucleotide polymorphisms, allelic discrimination, and gene lesions (deletions, insertions or inversions), a particularly critical feat when mutations in the sample population are underrepresented (Deepak et al., 2007). In keeping with personalized genetic analyses, individual records can be cross-examined against multifaceted databases that include clinical, drug response, epidemiology, gender, race, phenotype, job history, and geography/environment parameters. Crossexamined information can thus serve to establish genetic links to disease and adverse reactions to medication and drug resistance traits (Deepak et al., 2007;
c15.indd 308
1/12/2011 9:44:27 AM
MRNA EXPRESSION BY REAL-TIME PCR
309
Severino and Zompo, 2004). Hence coordinate application of quantitative real-time PCR with multiparameter database information could become a rational means to personalized medical intervention and assessment of individual responses to therapy. Cancer, immunodeficiency, autoimmunity, pathogen infection, diabetes, neurodegeneration, cardiovascular, and respiratory disorders are among the most challenging threats, where confirmatory realtime PCR can be of enormous value to decode genetic substrates associated to the pathology. Real-time PCR measures the initial content of DNA or reverse-transcribed mRNA templates by combining log amplification and detection parameters, whose progression can be monitored in real time. These key features are in contrast with other PCR methods, which are designed to only record the endpoint amplified product (Espy et al., 2006; Freeman et al., 1999; Raeymaekers, 2000). Hence real-time PCR has become the preferred method to quantitatively assess gene structure, copy number, rearrangements, deletions, insertions, inversions, and expression (Gibson et al., 1996; Luthra and Medeiros, 2006; Mocellin et al., 2003; Murphy and Bustin, 2009; Garcia-Castillo and Barros-Nunez, 2009; Muller et al., 2004). The principle of real-time PCR analyses is based on the detection of fluorescence emission produced at each reaction and measures amplicon production per cycle. Toward that end, real-time PCR relies on the quantitative detection of reporter probes, whose fluorescence emission increases proportionally with the amount of PCR product that is generated. The quantitation of the fluorescence emitted per cycle serves as the means to assess the state of the reaction at an exponential stage, in which the level of the amplified product corresponds to the starting amount of template. The overall conceptual message is that the higher concentration of the DNA or cDNA template, the faster detection of log fluorescent increments. Whereas there are multiple (albeit redundant) modes to define the progression and significance of real-time PCR stages, in the interest of simplicity a standard reaction can be divided into three main stages: (1) exponential, (2) linear, and (3) plateau. During the early exponential stage, fluorescence emission reaches a level of increment above background baseline that reflects template concentration at origin. The threshold value obtained at the cycle where the increased fluorescence shift is first detected, known as cycle threshold (Ct), is applied to quantitatively determine the changes in the experimental samples (Gibson et al., 1996; Heid et al., 1996; Lay and Wittwer, 1997). Under optimal conditions, exponential doubling of DNA or cDNA copies reaching the most favorable amplification ratios follows the primary fluorescent shift. The linear stage is characterized by a higher degree of variability, which results from substrate consumption and degradation, and which ultimately leads to a slower pacing reaction. The plateau stage is basically the end point of the reaction, marked by the arrest of amplification products and increased vulnerability for DNA or cDNA degradation. It must be emphasized that upon reaching the plateau stage, changes in fluorescence emission are negligible, irrelevant, and without quantitative value (Wong and Medrano, 2005; Bustin, 2000).
c15.indd 309
1/12/2011 9:44:27 AM
310
15.2.2
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Comparative Appraisals
Given the nature of the real-time PCR chemistry, kinetic measurements of DNA or cDNA amplification can be achieved during the early exponential phase of the reaction. This is in contrast with nonquantitative standard PCR, in which detection can be accomplished only at the end of the reaction, through the resolution of final plateau-phase PCR products by gel (agarose or acrylamide) electrophoresis and fluorescent staining (ethidium bromide, acridine orange and SYBR dyes) or autoradiography (alternatively phosphoimaging), when electrophoresis is followed by Southern transferring and hybridization with radioactively labeled (commonly 32pNTP) probes. It must be emphasized that the qualitative information obtained by combined gel electrophoresis, fluorescent staining, and/or autoradiography originates from end point PCR products. In this context, the results are largely based on reference size discrimination, which are prone to inaccuracy due potential product degradation and hence without quantitative value. Congruent with these differences, real-time PCR quantitation does not require postreaction processing, which facilitates the management of large and complex sample analyses, enables the simultaneous processing of multiple repeats for statistic validation, and reduces the risk of sample contamination. Moreover, unlike nonquantitative methods, real-time PCR possesses a broader range where variations in template concentration remain susceptible of accurate detection (roughly 107-fold, against 103-fold of standard PCR) and commonly known as reaction dynamics (Wagatsuma et al., 2005; Louvel et al., 2008). In brief, a broad dynamic range provides permissible concentration ratios between experimental and housekeeping control templates to carry the reaction with comparable sensitivity and specificity. In other words, accuracy of the reaction is proportional to the dynamic range. Overall, real-time not only surpasses the quantitative, accuracy and sensitivity reach of conventional PCR but also stands alone against most widely used approaches to genetic analyses, including RNAse protection assay (Wang and Brown, 1999; Wong and Medrano, 2005) and dot-blot hybridization (Wong and Medrano. 2005; Malinen et al., 2003). Among the key features that gained real-time PCR center-stage prominence in biomedicine are the feasibility to detect a single copy of DNA or mRNA (Hyvarinen et al., 2009; Palmer et al., 2003; Barragan et al., 2001), the capability to discern subtle differences of detection levels within comparing templates (Gentle et al., 2001; Reil et al., 2008), and the capacity to discriminate virtually identical gene copy isoforms (Wong and Medrano, 2005; Rodriguez-Manotas et al., 2006; Louis et al., 2004). Because real-time PCR applications can amplify both DNA and mRNA substrates, inherent differences must be taken into account. For instance, under homeostatic conditions, DNA content, gene mutations, and polymorphisms are equally represented in all cells of a given organism and usually uninfluenced by internal or external environmental cues (Bustin and Nolan, 2004; Nannya et al., 2005). In contrast, mRNA transcription, processing and
c15.indd 310
1/12/2011 9:44:27 AM
MRNA EXPRESSION BY REAL-TIME PCR
311
stability largely depend on both intrinsic and extrinsic factors, which ultimately determine the physiological steady-state levels (Neu-Yilik and Kulozik, 2008). This means that whereas DNA presents a stable number of target gene templates, mRNA copies of a given transcript can be highly variable and depend on the state of cell maturation, differentiation, and/or activation (Bustin and Nolan, 2004; Neu-Yilik and Kulozik, 2008). Unlike DNA, mRNA is relatively unstable and susceptible to degradation during extraction procedures, which can contribute to target template variability in quality and quantity. Such limitation not only impinges on the accuracy of reverse-transcription and real-time PCR reactions but also on data interpretation and biological impact. Whereas human error cannot be completely ruled out, the fragile quality of mRNA preparations is inevitably associated with the nature of the biological specimen, such those obtained postmortem, sampled as transoperatory biopsies, or collected as sorted or microdissected single cell preparations. Multiple preventive approaches (not discussed in this chapter) are now in place to preserve the quality of unique mRNA samples, hence cDNA amplifications by real-time PCR remain the state of the art method to evaluate common and rare gene expressions. Other noteworthy considerations include the cost of equipment and validated reagents. Although equipment may not be a turning point, since most departments of prominent institutions provide the necessary hardware, realtime reagents of superior quality that ensure reproducible experimentation are expensive consumables and are the sole responsibility of the investigator. Last, real-time PCR reactions are designed neither to measure the size of the amplified product nor to discriminate between DNA and cDNA templates. Then again, these parameters are not needed to evaluate the accuracy, specificity, or reproducibility of real-time PCR reactions and are susceptible of independent assessment. For example, amplified products can be readily analyzed by electrophoresis or directly sequenced, which concomitantly provides size and gene identity information. On the other hand, RNA specific chromatography, oligo d(T)-dependent reverse transcription and/or enzymatic clearing of DNA can be applied to achieve discriminating cDNA amplification. 15.2.3
Quantitation and Data Report
Real-time PCR quantitation is a direct function of the detection of fluorescent reporters, whose log increase is proportional to initial template input and the magnitude of amplified DNA or cDNA. Namely, faster fluorescence detection reflects higher template input. For pragmatic purposes, three categories of fluorescent reporter chemistries applied to real-time PCR are herein considered: (1) hydrolysis; (2) hybridization, and (3) DNA-binding (Mackay and Landt, 2007). As an example of hydrolysis chemistries (often referred as 5 nuclease reactions), TaqMan fluorescence is emitted only when 5′ exonucleolytic Taq polymerase activity cleaves the reporter probe, usually a 20- to 30-mer oligonucleotide
c15.indd 311
1/12/2011 9:44:27 AM
312
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
that carries a 5′ fluorochrome and a 3′ quencher (with or without native fluorescence). The function of the quencher in the intact probe is to prevent the emission of fluorescence from the reporter by fluorescence resonance energy transfer (FRET). However, when the sequence of interest is recognized, the reporter probe specifically intercalates between the sites where the amplifying primers anneal. Upon hybridization of the probe to the target sequence, the fluorochrome reporter dissociates from the quencher by 5′ exonuclease activity, which thus enables fluorescence emission. In this context, increased fluorescence emission by the reporter fluorochrome is directly proportional to PCR product buildup. The specificity of these reactions stems from the sequence complementarity between the probe and the target of amplification. In considering the chemistries that depend on probe hybridization to DNA and cDNA templates, annealing and melting temperatures stand out as inherent features of nucleotide sequences that are key in the design and application of real-time PCR technology to a variety of demanding research projects (Mackay and Landt, 2007). The exploit of DNA melting chemistries is equally central to equipment development, designed to control and record temperature shifts (Arya et al., 2005; Winter et al., 2004) and the use of suitable fluorescent DNA probes (Kutyavin et al., 2003; Wong and Bai, 2006). Probe hybridization chemistries can be designed in two alternative formats. One uses the amplifying oligonucleotides intercalated head to tail by two template-specific probes proximal to each forward and reverse primer. In this format, the probe proximal to the forward primer is tagged at its 3′ end (acceptor fluorochrome), whereas the one nearing the reverse primer is tagged at its 5′ end (donor fluorochrome). Increased fluorescence is emitted upon probe DNA binding, via FRET. Alternatively, the forward primer can carry the acceptor fluorochrome at its 3′ end and thus reduce the reaction to a tripartiteoligonucleotide reaction. Although the technology is designed to facilitate multiple gene/sequence target assessments in a single reaction, the approach is demanding and entails laborious optimizations. The availability of new fluorochromes and detection modules should circumvent these limitations. Irrespective of the format, the probe hybridization approach enables highresolution assessment of template amplifications under stringent conditions, which records fluorescence emission. DNA-binding dye chemistries target double-stranded (ds) DNA and are designed to measure PCR amplification products through sequenceindependent fluorochrome intercalation. Examples of these dyes are SYBRgreen and ethidium bromide, which are weak fluorochromes as free molecules but emit increased fluorescence when bound to dsDNA (Lutfalla and Uze, 2006). Since PCR products double with every amplifying cycle, template availability increases, more DNA binding ensues, and higher fluorescence levels are proportionately registered. Whereas this method has been applied and improved in numerous and varied applications, it requires standardization and validation at different levels to ensure accuracy and specificity (Lutfalla and Uze, 2006). Also, it must be noted that despite the plasticity of sequenceindependent techniques, which allows the analysis of different genes with the
c15.indd 312
1/12/2011 9:44:27 AM
MRNA EXPRESSION BY REAL-TIME PCR
313
same type of probe, the use of DNA binding dyes precludes the setting of multiparametric reactions and often results in spurious amplification products (Kutyavin et al., 2003). Nevertheless and as pointed out earlier, high stringency conditions together with confirmed amplifying primer specificity are routinely applied to defray the risk of false-positive products. Saturation of fluorescence detection is a disadvantage often associated to DNA-binding dye chemistries, particularly when longer sequences are amplified. However, amplicon limits and detection parameters can be adjusted. Upon data processing and irrespective of the chemistry, it is important to bear in mind the significance of cycle threshold (Ct) settings, since the slope generated at the log exponential phase measures the amplification efficiency (Gibson et al., 1996; Wong and Medrano, 2005; Heid et al., 1996, Lay and Wittwer, 1997). Ct is routinely set above the amplification baseline—namely within the exponential phase, which becomes linear with log conversion. While amplification efficiency is roughly estimated to be near 100%, it can be affected by multiple factors, including the quality of the amplifying oligonucleotides, structural constraints, contaminants and as stated earlier, the size of target sequence (Wong and Medrano, 2005; Bustin and Nolan, 2004; Yuan et al., 2006; Kubista et al., 2006; Mehra and Hu, 2005). Hence steps must be taken to prevent exogenous sources of error that can mislead interpretation of data. For absolute quantitation standards with validated concentrations are either commercially available or custom prepared depending on the application. Such standards are used as internal Ct references with experimental linear-log slope validation, usually obtained from serial dilution measurements that cover a relatively broad dynamic range. Examples of these standards include dsDNA, single-stranded (ss) DNA and complementary RNA (cRNA) that carry the target sequence of amplification (Wong and Medrano, 2005). A basic requirement for absolute quantitation is that the standards routinely used possess quasi-constant amplification efficiencies. Conversely, relative quantitation relies on comparative housekeeping genes that serve as reference parameters for amplification levels. Toward that end, relative quantitation is best accomplished when expression of the endogenous reference is relatively abundant and independent of pathophysiological events. Moreover, target amplification can be normalized for concentration differences relative to the endogenous control by the amount of total template added to the experimental reaction. Among the options for quantitative reference of amplification, ribosomal RNA, (rRNA) glyceraldehyde 3-phosphate dehydrogenase (G3PDH), β-actin and α-tubulin stand out as routine housekeeping controls (Wang et al., 2009; Teste et al., 2009). However, these controls have been noted to exhibit some limitations. For instance, rRNA lacks poly-(A) tailing and hence does not represent a true reference for normalization of oligo d(T)-dependent reverse-transcribed mRNA, whereas G3PDH β-actin and αtubulin expression can be influenced by physiological or pathological conditions (Wang et al., 2009; Teste et al., 2009). In keeping with these constraints, careful selection of reference genes is necessary to ensure accurate application of relative quantitation (Wang et al., 2009; Teste et al., 2009).
c15.indd 313
1/12/2011 9:44:27 AM
314
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
In considering that mutations inevitably impinge on gene expression, realtime PCR has become the technology of choice to quantitatively assess the impact of the given genetic lesions. 15.3 15.3.1
DNA SEQUENCING Scope and Evolution
The original design for nucleotide sequencing reactions over three decades ago (Sanger et al., 1977) launched an unprecedented quest to decode eukaryote and prokaryote genomes. Since then, DNA and RNA sequencing are virtually ordinary practices in most research laboratories. Complete characterization of newly identified genes, assessment of physiological and aberrant gene rearrangements and mutations, confirmation of genetic polymorphisms, legal proof of parental imprinting, criminal investigation, and evolution research are only a handful of molecular practices where DNA sequencing created tremendous experimental impact. DNA polymerase-dependent reactions, using radioactively (mainly 35SdNTP) or nonisotopically labeled nucleotides (commonly Biotin) and resolved in polyacrylamide gel electrophoresis characterized the pioneer DNA sequencing practices, whose results were revealed by autoradiography and eye-read base by base (Sanger et al., 1977; Church and Gilbert, 1984). The advent of fluorescence-based reactions, automated detection and commercialization of high throughput DNA sequencers, gave rise to large and ambitious projects that included the sequencing of the human genome. The phenomenon empowered technology improvement and fostered spinoffs of ancillary methodology to support the demands (McPherson, 2009; Watts and MacBeath, 2001; MacBeath et al., 2001). Although nucleotide reading from an autoradiography rarely occurs at the present time, the fundamental principle of the chain termination reaction remains and naturally followed by rational modifications that gave rise to primer walking, unidirectional deletions, direct sequencing by PCR, and formamide gel-based RNA sequencing (Sinden et al., 1999; Reddy et al., 2008; Voss et al., 1995; Brent and Guigo, 2004; Motta et al., 2006). 15.3.2
Chemistry, Reaction, and Analysis
The principles of the nucleotide sequencing chemistry are herein summarized: Basically the method involves (1) a synthetic oligonucleotide complementary to the target DNA that serves as a primer and anneals at the 3′ end of the sequence, and (2) the enzyme DNA polymerase, which directionally catalyses a 5′ to 3′ reaction to synthesize an exact copy of the template strand. The in vitro DNA synthesis requires a pool of dATP, dCTP, dGTP and dTTP to support the extension of the copying strand. While maintaining the equimolarity of the reaction, the dNTP pool also incorporates a labeled deoxy-nucleotide,
c15.indd 314
1/12/2011 9:44:27 AM
DNA SEQUENCING
A
315
B
A CGT
A T T C G A T A T CA A GC T T A TC G A T AC C G T C G A C C T
Figure 15.1. (See color insert.) Manual Versus Automated DNA Sequencing. (A) Shows acrylamide gel electrophoresis results resolving typical chain termination reactions (Ho et al. unpublished data). Each lane corresponds to a designated reaction terminated with ddATP (A), ddCTP (C), ddGTP (G) and ddTTP (T) analogs (Ho et al. unpublished), which identifies respective nucleotides on target DNA template. (B) Depicts a color-coded chromatogram of typical automated DNA sequencing data (Albrechtson et al. unpublished).
usually α35SdATP or α35SdCTP if radioactivity is used or biotin or digoxigenin when nonisotopic labeling is applied. Since nucleotide elongation needs the presence of 3′-hydroxi (OH) groups, such requirement is exploited to terminate the reaction at each given base by the use of di-deoxy nucleotide (ddNTP) analogs, which lack the OH. Congruent with this rationale, four different reactions are set in which each contain one of the four distinct ddNTP terminators and electrophoretically resolved through independent lanes of an acrylamide slab gel. Since the DNA polymerase-dependent reaction stops every time an analog-base is incorporated into the growing strand, this creates labeled nucleotide chains of different length. Hence each reaction run in a separate gel lane provides the precise nucleotide sequence identity (Fig. 15.1A). The spinoffs, unprecedented improvements and multi faceted applications of this remarkable technology are now history (McPherson, 2009; MacBreath et al., 2001; Sinden et al., 1999; Reddy et al., 2008; Voss et al., 1995; Brent and Guigo, 2004; Motta et al., 2006; Lander et al., 2001). However, essential parameters remain valid and applicable irrespective of chemistry innovations, creative new reagents and continuously updated equipment. These include pure and intact templates, versatile polymerase enzymes, rational design of priming reagents and reliable detection systems. Whereas laboratory-based advances such as bacterial artificial chromosome (BAC) cloning and shotgun analyses continue to be the force behind the landmark leap of the nucleotide sequencing technology, software development can be credited with resolving the conundrum of managing the enormous
c15.indd 315
1/12/2011 9:44:27 AM
316
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
outburst of data generated by the powerful sequencing hardware. The imaginative software eased complex sequence annotations and analyses, thus giving rise to the numerous, efficient and constantly evolving sequence databases, including the National Center for Biotechnology Information (NCBI), the University of California Santa Cruz (UCSC)/Bioinformatics and the European Molecular Biology Laboratory (EMBL)/Enembl genome browsers (Kent et al., 2001; Hubbard et al., 2002; Wheeler et al., 2001). Among the platforms with relevant clinical interest, PolyPhred, PolyScan, and SNPDetector are technologies in continued evolution that support complex genotyping analyses, including, SNP, insertion/deletion variants and heterozygosis assessment (Chen et al., 2007; Zhang et al., 2005; Bhangale et al., 2006). 15.3.3
Outsourcing
The new generation of automated sequencing hardware continues to be a constant challenge for software evolution. Several sequencers can now produce over 1 × 106 reads of sequence lengths of 400 basepairs (bp) or more, which handily satisfies the demands of the most ambitious projects. The cost of nucleotide sequencing has significantly dropped by the offer and demand equation. The most sophisticated sequencing services are either intramurally or extramurally available, in which sample pick up and electronic data downloading are included in the cost. The need to own nucleotide sequencing equipment or manually perform sequencing experiments is no longer cost effective. The quality and quantity of data provided by most nucleotide sequencing core facilities (Fig. 15.1B) is highly competitive and complete with superb downloadable software linkage. Unless a laboratory is fully engaged in the sequence analysis field, the wise action is to routinely outsource the sequence characterization of the genes of interest. Irrespective of the approach nucleotide sequencing is an indispensible method that can uncover modification imprints affecting gene expression and function. On one hand, it provides the direct means to validate physiologically relevant gene mutations, rearrangements and recombinatorial switches, which forecast successful achievement of gene expression diversity (the good news). On the other hand, it enables researchers to diagnose gene lesions of deleterious consequences (the bad news). Moreover, nucleotide sequencing has become a pillar in forensic medicine. 15.4 IN SITU HYBRIDIZATION 15.4.1 Experimental Considerations Among the multitude of methods designed to investigate gene content, amplification, mutation, and expression, in situ hybridization can be distinguished for its unique property to provide information in the context of the chromosome, nuclear, cellular and/or histological microenvironments. By and large DNA and RNA are the prominent targets of this powerful technology.
c15.indd 316
1/12/2011 9:44:27 AM
IN SITU HYBRIDIZATION
317
Genetic traits linked to diverse forms of disease are recognized as inherent consequences of chromosomal abnormalities, which can affect gene dosage, structure, processing and function. Therefore, it is not surprising that gene duplications, deletions and aberrant rearrangements resulting from chromosomal translocations can lead to dramatic phenotypes, which are often lethal or the cause of severe morphological and functional defects (Shaffer and Bejjani, 2004). The development of techniques that enabled the visualization, identification and analysis of chromosomes marked a new era for accurate counts, integrity assessments and detection of deletions and translocations (Garcia-Sagredo, 2008). Most of these methods, such as chromosome G-banding are a routine in most laboratories where cytogenetics is performed and like most technologies, progress continues to be achieved when high-resolution techniques are required to reveal subtle and/or complex gene abnormalities. Applications include karyotype, chromosome gene assignment, chromatin structure, DNA recombination, gene expression and radiation dosimetry assessments (Maierhofer et al., 2002). Detection of specific chromosome segments to assess structurally inaccessible gene lesions is laborious and technically demanding. Hence the need for innovative approaches imposed by clinical demands, resolution limitations and inconsistency of available methods, moved the creative development of technology to the next level and thus fluorescent in situ hybridization (FISH) became a reality. Historically, FISH appeared at the lab scene nearly 30 years ago and despite its rapid and remarkable evolution the elemental principles and the applications prevail. Basically, FISH involves the hybridization of fluorescently labeled probes to target segments of chromosomes that have been chemically fixed on slide preparations (Fig. 15.2). Alternatively, probes can be labeled with biotin or digoxigenin-conjugated nucleotides, whose detection can be indirectly achieved by fluorescent conjugates, such as streptavidin/avidin or antidigoxigenin antibodies. Irrespective of the use of direct or indirect methods, FISH detection technology has exploded in the past decade, and fluorochrome reagents, probe engineering, and visualization hardware and software are just as diverse as sophisticated. Accordingly, FISH earned solid credibility for its chromosome/gene mapping capabilities, specificity, precision, flexibility and superb microscopy and digital imaging support, thus rapidly becoming an indispensable tool in biomedical research and practice. For instance, FISHdependent genetic queries find widespread use in a variety of biomedical fields, including genetics, neurosciences, reproduction, toxicology, ecology and evolution (Volpi and Bridger, 2008) to name some. 15.4.2 Scope and New Developments Because of the enormous diversity of FISH technology, where acronyms are coined for any given application (Volpi and Bridger, 2008), it may be justified to conclude that FISH now appears on the menu in multiple and varied application flavors. Because the aim of the present chapter is to underscore the
c15.indd 317
1/12/2011 9:44:27 AM
318
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Figure 15.2. (See color insert.) Chromosome locus assignment of a newly discovered gene. Fluorescent in situ hybridization analysis, using a biotin-labeled genomic probe, which reveals the new gene at the 9q32 locus (Sims-Mourtada et al., 2005) upon binding of fluorescent Streptavidin (shown herein as pseudo yellow fluorescence) against DAPI (blue fluorescence) background. Arrows emphasize gene locus assignment on respective chromosome 9 alleles.
impact of the various molecular strategies applied to the pathophysiological assessment of mutations of discovered genes, only brief appraisals of FISH applications are herein presented. Unambiguous karyotype analyses, which can concomitantly query genespecific location, detect cryptic gene fusions, and resolve intricate chromosome rearrangements, called for multicolor chromosome painting. The design and application of multiple fluorochromes and development of broad wavelength rage detection systems led the way to multiplex-FISH (M-FISH) and gave a record boost to cytogenetics (Volpi and Bridger, 2008; Kearney, 2006). The invention was of particular biomedical impact on cancer research. Alterations in chromosome numbers (aneuploidy) are common in trisomy syndromes, like in trisomy 2, and accurate assessment of micronucleation events is not trivial. Blocking cytoplasm partition with the drug cytochalasin-B (CB) in combination with FISH (CB-FISH) became instrumental in assessing an array of chromosome segregation abnormalities (Volpi and Bridger, 2008; Migliore et al., 1999). Quantitative determination of telomere loss in aging can exploit the power of telomere hybridization, using peptide nucleic acid (PNA) FISH, combined with the versatility of flow cytometry (flow-FISH) that can measure fluores-
c15.indd 318
1/12/2011 9:44:27 AM
EXPRESSION AT THE PROTEIN LEVEL
319
cent telomere signals in cell suspensions (Baerlocher et al., 2006; Potter and Wener, 2005). The approach enables us to manage multiple cell analyses with high precision and conveys high clinical potential. DNA strand breaks are physiologically and pathologically important and determination of chromosome loci susceptibility is of relevant biomedical interest. The electrophoretical exit of DNA from the nucleus onto an agarose gel field, a detection method known as the comet assay, measures the degree of DNA breaks at the single cell level. When combined with FISH (cometFISH), the procedure reveals the chromosome sites with relevant DNA breakage susceptibility (Glei et al., 2009; Escobar et al., 2007). The use of antibody probes in combination with FISH (immuno-FISH) to simultaneously detect precise gene loci and protein complexes has quasiunlimited potential (Yang et al., 2004; Zinner et al., 2007; Sun et al., 2003). Likewise, the accurate capture of aberrant sister-chromatid exchanges by combining BrdU/cell cycle labeling with FISH (harlequin-FISH) technology (Pala et al., 2001; Jordan et al., 1999) is enticing. Focused cytogenetic analysis on gene fusions resulting from chromosome rearrangements, found a niche that has relevant diagnostic and prognostic value. By the clever application of dual-color fluorescent probes flanking the breakpoint site of chromosomal translocations (split-signal FISH), precise identification of the rearranging loci can be achieved (Volpi and Bridger, 2008; van Rijk et al., 2008, 2009). Determination of gene expression in situ at nuclear or cytosol locales, using fluoresce-based methods (RNA/Expression-FISH) opened a whole new dimension for evaluating transcription, mRNA processing, and decay. The method enables one to equally assess endogenous transcription, enforced expression resulting from plasmid-mediated transfection or lentivirus/ retrovirus-dependent transduction/infection, and overexpression in transgenic (Tg) animal models. The potential applications are virtually unlimited, from single cell-based allelic expression to phenotypically/pathologically distinct expression arrays, gene organization and regulation, nuclear/cytosol traffic and diagnostic-based expression analyses (Volpi and Bridger, 2008; Ferrai et al., 2010; Mahadevaiah et al., 2009; Voss et al., 2006). As stated above, applications of the in situ hybridization technology in general and FISH in particular exploded in extraordinary directions. The unparalleled flexibility of the methods developed predicts that a variety FISH flavors will be on the table of leading biomedical researchers. 15.5 EXPRESSION AT THE PROTEIN LEVEL 15.5.1 Overview The pathophysiological consequences of known and newly identified gene lesions cannot be more flagrant when it comes to protein expression, subcellular and extracellular location, molecular interactions, and proprietary functions.
c15.indd 319
1/12/2011 9:44:27 AM
320
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Proteins can be present in the nucleus to carry out both genetic and epigenetic roles, locate at the nuclear/cytosol boundaries as active gatekeepers, confer structural support and cell plasticity, perform an unimaginable array of cytosol, mitochondrial, lysosome, golgi, centriole, vacuolar, and endoplasmic reticulum actions; direct intracellular signal traffic; be deployed at the cell surface as intracellular and extracellular communicators; and be secreted as intercellular signal exporters (Scott and Pawson, 2009; Xylourgidis and Fornerod, 2009; Nigg and Raff, 2009; Michelsen and von Hagen, 2009; Koizumi et al., 2007). Whereas certain gene mutations may not preclude transcription, translation, subcellular location, and function, thus resulting in silent phenotypes, others can produce dramatic effects that range from lethal phenotypes to severe functional impairments. Either way, formal proof of silent or deleterious, point mutations, polymorphisms, deletions, insertions inversions, or fusions must be obtained at the protein level before the pathophysiological significance of a newly identified gene can be established. Coherent with the needs, enormous progress has been achieved in technology development and generation of sensitive and specific probes. 15.5.2
Antibody Technology
Although in vitro translation became instrumental to obtaining primary evidence of protein expression and continues to be applied to gene discovery (Frances et al., 1994), the generation of mouse monoclonal antibody (mAbs) probes (Kohler and Milstein, 1975) completely transformed the approach to protein analysis. Consistent with its broad reach, the groundbreaking discovery found applications in virtually all fields of science, empowered gene discovery and enabled the pathophysiological characterization of known and newly identified molecules. Notably, biomedical research became a major beneficiary of both diagnostic and therapeutic reagents, which successfully turned to lifesaving uses in immunology, oncology, cardiovascular, respiratory, neurology and infectious disorders (Nissim and Chernajovsky, 2008). The cloning, engineering, and generation of recombinant monoclonal antibody probes permitted to get around adverse reactions to proteins of murine origin and led the way to the development of human or humanized mAb reagents (Frances et al., 2000; Nissim and Chernajovsky, 2008). Despite the progress, antibody therapeutics still faces significant hurdles to prevent undesirable effects resulting from systemic and relatively long-term use of mAbs, including anti-idiotype reactions. On the other hand, the use of polyclonal antibody probes, obtained directly from sera of immunized animals (goat, rabbit, donkey, or chicken) is noteworthy because it provides increased detection diversity, particularly when simultaneous analysis of distinct protein targets is needed. 15.5.3 Evolution and Scope of Protein Analyses In addition to monoclonal and polyclonal antibody probes, peptide and protein tags can be engineered to facilitate in vitro, ex vivo, and in vivo detection of
c15.indd 320
1/12/2011 9:44:27 AM
EXPRESSION AT THE PROTEIN LEVEL
321
proteins of interest, evaluation of their subcellular location and even assessment of their function, such as cell structure, motility, migration, phagocytosis, killing, and survival (Siddiqa et al., 2001; Chen et al., 2009; Dross et al., 2009; Flannagan and Grinstein, 2009; Li et al., 2007; Lohela and Werb, 2009; Karan et al., 2004). Methods, reagents and equipment to determine protein expression at the organ, histological, cellular and biochemical levels are now in place to provide accurate morphological and physiological parameters on the status of wild type and mutant protein forms. For example, immunohistochemistry (Fig. 15.3A) and immunofluorescent histology (Fig. 15.3B) can reveal the presence or absence of protein expression (Frances et al., 2000), subcellular localization (Siddiqa et al., 2001), cell homing to specific organs (Sims-Mourtada et al., 2003), and altered tissue development and function resulting from partial or complete gene inactivation (Aldrich et al., 2003). Enzyme-linked immunostain assays (ELISA) and the spinoff application ELISOPT quantitatively measure extracellular mediators, such as immunoglobulins, active peptides, cytokines, and chemokines (Bondada and Robertson, 2003; Bocchino et al., 2009; Hogrefe, 2005), whereas confocal microscopy captures the dynamics of proteins in action and cell signal synapses (Chen et al., 2009; Dross et al., 2009; Flannagan and Grinstein, 2009; Li et al., 2007; Contento et al., 2008). In retrospect, it is not difficult to appraise how all areas of cell biology research came to soar on the availability of antibody probes and feasibility to engineer molecular tags. Flow cytometry for instance, enabled researchers to accurately achieve comprehensive phenotypings, discover new and unique cell populations, and provide evidence of cell surface assembly of receptor proteins. Moreover, flow cytometry allowed the concomitant cell ID and function, evaluation of cell cycle progression and proliferation and to record cell senescence and death events (Rangel et al., 2005, Frances et al., 1994; Liu et al., 1996a, 1996b, 1996c; Malisan et al., 1996a; Matteucci and Giampietro, 2008; Krysko et al., 2008; Malisan et al., 1996b; Challen et al., 2009; Passos and von Zglinicki, 2007). Rationally, the access to the innovative and continually evolving antibody and recombinant protein tag reagents permitted to undertake more challenging biochemical tasks. These ranged from simultaneous verification of protein identity, molecular mass, and covalent protein bond formations, using single and two-dimension immunoblotting (Rangel et al., 2005; Frances et al., 1994; McKeller and Martinez-Valdez, 2006) to confirmatory assessment of deleterious mutations (Perlman et al., 2003). Furthermore, protein purification, by means of antibody-based (affinity) and tag-dependent chromatography, respectively, brought within reach the achievement of immunoproteomic experiments (Wu and Mohan, 2009) and the feasibility of resolving molecular structures (Wetterholm et al., 2008). Moreover, exploiting the use of antibody probes and genetically engineered tags facilitates the biochemical elucidation of complex molecular interactions, which can be relevant to cell surface, cytosol, nuclear or extracellular functions. After protein extraction, the experiment involves a two-step procedure that includes (1) an antibody-based precipitation or immunoprecipitation (IP) by
c15.indd 321
1/12/2011 9:44:27 AM
(a)
(b)
c15.indd 322
1/12/2011 9:44:27 AM
EXPRESSION AT THE PROTEIN LEVEL
323
Figure 15.3. (See color insert.) Gene Expression by Histological Methods. (A) Immunohistochemical (IHC) detection of IgD and CD38-expressing B lymphocytes, respectively identified in the follicular mantle (FM: blue) and germinal centers (GC: Red) of cryopreserved tonsil sections (Martinez-Valdez, unpublished), obtained according to institutional guidelines. IHC reactions result from specific monoclonal anti-IgD or CD38 antibody reactivity, revealed by enzyme-dependent blue or red chromogen activation. (B) Representative immunofluorescent histology (IFH) to confirm the presence CD3 (red fluorescence) and interleukin-8 (IL-8) receptor/CxCR1expressing (green fluorescence) T lymphocytes on DAPI-stained cryostat tonsil sections (Sims-Mourtada et al., 2003).
controlled centrifugation, also known as pull down, and (2) a subsequent immunoblotting to reveal the identity of the IP protein complexes. Simple reciprocal IP/immunoblots can provide confirmatory information when the identity of the suspected protein interactions is known and intentionally tested (Rangel et al., 2005). The use of recombinant tag does not usually override the need of the antibody probe, since anti-tag antibodies are routinely used to either perform the IP or as a reveling probe to identify the protein(s) in the immune complex. When novel protein entities are being characterized and possess structural motifs that are hypothesized as crucial for physiological molecular interactions, the IP/immunoblot approach can be instrumental to determine the consequences of genetic mutations (engineered or naturally occurring) at the active site. When the need to elucidate the identity of proteins interacting with a molecule under study, the alternative to the two-step IP/immunoblot approach involves state of the art proteomics that though technically demanding, it has been widely implemented and is available in most institutional cores. The experiments typically involve target-specific IP in conjunction with 2-electrophoresis and mass spectrometry (Long et al., 2004). A judicious evolution of the IP technology led to the development of chromatin IP (ChIP), which directly queries protein/DNA interactions, implicates genetic and epigenetic functions and has reached throughput capacity. The basic method involves the cross-linking of nucleic acid/protein complexes, antibody-dependent IP, amplification of target DNA region by PCR, and nucleotide sequence identification. Variations of the technology logically developed and large scanning projects became feasible (Lund-Olesen et al., 2008; Collas and Dahl, 2008; Barski and Zhao, 2009; Jothi et al., 2008; Barski and Frenkel, 2004; Trelle and Jensen, 2007). Nothing has impacted biomedicine more than the application of antibody technology to therapy as diagnostic biomarkers, neutralizing agents, inhibitors, or active magic bullets. Understandably, while this chapter cannot cover all dimensions of the antibody technology, the discussion does underscore the fact that since the inception of monoclonal antibodies (Kohler and Milstein, 1975) science has
c15.indd 323
1/12/2011 9:44:28 AM
324
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
never been the same. Currently, unconjugated and fluorescently or enzymatically conjugated antibody probes are commercially available, and while these reagents can be in-house/custom tailored for new molecules or specific applications, outsourcing to commercial vendors can be more cost-effective.
15.6 GENETICALLY ENGINEERED ANIMALS 15.6.1
Enforced Expression in Transgenic Mice
In vitro studies can uncover fundamental features of key gene functions and involvement in pathways that may be crucial for cellular operations, including transcriptional activation/repression, survival/death, proliferation, motility, signal communication, immunity, and reproduction. Yet, gene discovery, elucidation of intricate mechanisms, transcriptional/translational intereference, and evaluation of dominant-negative mutations can never reach the pathophysiological significance of in vivo measurements. It is for this reason that engineered transgenic (Tg) mouse models are needed to assess the consequences of enforced gene overexpression (Blyth et al., 2009). Toward that end, mice can be engineered to enforce transgene expression ubiquitously or in a tissuespecific manner (Blyth et al., 2009). The choice of tissue/cell-specific Tg expression depends on whether the aim is to determine how the excess gene expression can alter the balance of known cell pathways or how ectopic expression can uncover additional gene pathophysiology. A typical example is herein highlighted. Up until now, the mechanisms by which antigen (Ag)-specific germinal center (GC) B cells survive and die were not clear and hence overexpression of a candidate genes with suspected potential to confer resistance to physiological death could provide fundamental clues. To facilitate the in vivo interrogation in a broad or restricted context, the use of ubiquitous and cell-specific promoters can be applied to Tg mouse technology (Blyth et al., 2009; Zhou et al., 2004; Bordon et al., 2008). For instance, the Eμ-SV system uses IgM regulatory sequences to selectively drive transgene expression in the B cell compartment (Blyth et al., 2009; Bordon et al., 2008), whereas β-globin promoter can direct transgene expression throughout animal tissues (Zhou et al., 2004). Added features of Tg mouse engineering is that tags can be incorporated to the gene expression under study to facilitate detection. A brief example is the engineering of gene X to generate a B cell–specific Eμ/promoter-gene X-flag transgene that can be achieved by inserting the entire gene, flag-tagged by 3′ PCR insertion, into available cloning sites of the Eμ-SV40 vector. An alternative option is to use discistronic transgene constructs, in which larger protein-based not peptide tags can be independently expressed from a different promoter through the incorporation of internal ribosomal entry sites (IRES) (Bouabe et al., 2008). When the gene locus of interest spans only a few kilobases (kb) or the target gene bears no introns
c15.indd 324
1/12/2011 9:44:28 AM
GENETICALLY ENGINEERED ANIMALS
325
(Guzman-Rojas et al., 2000; Drysdale et al., 2002), the use of the entire gene can be engineered for overexpression. However, cDNA-based constructs are by and large the most frequently used transgenes. In some applications where tissue-specific overexpression is envisioned, certain gene promoters that are used to target a given cell lineage can drive undesirable side effects. A typical observation when the transgene of interest is placed under the control of Eμ-driven, expression could leak into cell lineages other than B cells. The interpretation of the Tg phenotype could remain unaffected if the information sought is exclusively focused on the B cell compartment. Should the data lead to overwhelming confusion, due to a striking phenotype that indirectly impinges on the B cell physiology, the rational remedy is to substitute the promoter for a more stringent cell lineage-specific one, if available. In the described scenario, the CD19 promoter represents the ideal substitution. Advances in Tg technology enable us to test the pathophysiological effect of dominant-negative function interference by the overexpression functiondefective mutant proteins (Halabi et al., 2008). Likewise, potentially deleterious point mutations within functional motifs can be tested using knockin Tg mice (Yang et al., 2008), which often resemble human disease and thus provide invaluable diagnostic/prognostic clues. Most institutions house genetically engineered mouse core facilities, which operate Tg and knockout (KO) projects within pathogen-free barriers (required for precise applications) and under the supervision of an institutional animal care and use committee (IACUC). Thus a given Tg project follows a quite similar routine: After sequence verification that the gene of interest has the correct in-frame orientation within the promoter-driven construct of choice, the transgene is excised from the vector and purified according to the institution Tg core facility’s specifications. From here, the Tg core performs all steps from the pro-nuclear injections to the delivery of Tg mice. Assessment of the relative transgene copy number against the endogenous gene homologue is then determined, using quantitative dot-blot hybridization and phosphoimaging. Typically, titrations can be performed using a standard genomic reference, i.e., G3PDH gene. After F1 Tg mouse lines are established, reverse transcription-based Real-time PCR is performed to select the Tg mouse lines with relevant transgene expression that will be used for further studies. Immunoblot confirmation can be applied, when the transgene is tagged and/or specific antibody reagents are available. Alternatively tissue sections and/or cell suspensions can be prepared and subjected to immunohistochemistry, immunofluorescence histology/cytology or flow cytometry, should appropriate antibody reagents are available or the transgene carries a fluorescent tag. In keeping with the Eμ-Gene X-Flag Tg mouse example, downstream pathophysiological assessment of Tg mouse phenotypes involves comprehensive and multi-parametric tests, in which the experimental approach is tailored to the information available on the gene of interest. Accordingly, B cell specific
c15.indd 325
1/12/2011 9:44:28 AM
326
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
parameters can disclose stage-specific accumulation of progenitor, precursor and/or immature B cell subsets found in single cell suspensions from bone marrows. Pure mononuclear cells can be obtained by discontinuous-gradient centrifugation, followed by either fluorescent or magnetic sorting of untouched B cells, using negative selection protocols that deplete all other cell lineages. Cell lineage, morphology, and developmental stage assessment of the pure B cell preparations can be performed by multiparametric flow cytometry and immunocytology assays. After primary characterization of Tg B cells, basal functions can be examined, including cell division patterns, capacity to proliferate in the absence of growth, and differentiation factors. In the event that the B cells exhibit a potential for autonomous clonal expansion, simple cytogenetic experiments can be performed to investigate whether autonomous proliferation and clonal expansion originate from illegitimate gene rearrangements. Additional cellular tests can be carried out to determine whether the enforced expression of transgene X confers cell survival advantage or whether clonal expansion emerged from resistance to bone marrow microenvironment-induced death signaling. Depending on the results of the primary evaluation of Tg B cell phenotype, the information can be supplemented by more mechanistically oriented experimentation. This can envision, comparative cDNA micoarrays and/or protein-profiling experiments to reveal potential gene expression, mechanisms and cellular pathways affected by the enhanced and sustained expression of the transgene. In the peripheral organs, critical parameters of antigen (Ag)-dependent B cell responses within secondary lymphoid organs can be evaluated. For example, if conventional knowledge hypothesizes that enforced overexpression of Eμ-gene-X-flag may influence mature B cell survival and promote massive accumulation resulting from immunization, focused attention must be directed to perturbations in all peripheral lymphoid organs. This basically entails postmortem examination of Tg mice for splenomegaly and lymphoadenopathies, followed by histopathology experiments to assess architecture abnormalities and defects in GC formation (Guzman-Rojas et al., 2002; SimsMourtada et al., 2003; Aldrich et al., 2003). Flow cytometry and immunohistochemistry experiments can help reveal signs of abnormal clonal expansion within the GC microenvironment. Notably, enhanced Tg B cell survival or resistance to death-inducing stimuli can be equally determined by flow cytometry parameters. With the same logistics of investigation on primary B cell maturation programs in the bone marrow, differential gene expression arrays and proteinprofiles can be queried to determine whether sustained Eμ-gene X-flag overexpression impinges upon target GC cellular pathways. Although the hypothetical gene X, proposed herein as an example, may aim to understand how high affinity/Ag-specific GC B cells acquire cell survival benefits, the overall take-home message of this chapter is to emphasize that the ultimate goal of a given Tg project is to reveal the in vivo significance of
c15.indd 326
1/12/2011 9:44:28 AM
GENETICALLY ENGINEERED ANIMALS
327
gene X expression. In keeping with this rationale, the undertaken study must remain open to evaluate the unexpected emergence of ancillary information resulting from the enforced overexpression of the gene under investigation. These may involve the spontaneous expansion and persistence of abnormal cells, including tumorigenic phenotypes, which can conceivably lead to proliferative disorders. The concept of spontaneous expansion and persistence of abnormal cell phenotypes must be evaluated with caution, since it is equally conceivable that the enforced overexpression of gene X alone may not be responsible of the observed phenomenon. Because tumor progression is likely to be complex and multifactorial, the concept of spontaneity and persistence needs to be rigorously evaluated in the context of complementary studies. Coherently, breeding hypothetical Tg X mice into either tumor suppressor KO or protooncogene Tg backgrounds, in the presence or absence of carcinogens, represents a complement strategy to further assess gene X’s pathophysiological potential. 15.6.2
Gene Targeting/Knockout
Naturally occurring mutations provide useful information about the function of altered genes, particularly those that affect hematopoietic development. The introduction of gene-targeted deletions, whereby mutations are engineered in the mouse (Thomas and Capecchi, 1987; Capecchi, 1989) led to remarkable advances in the analysis of cellular functions. The principle of method, also known as gene knockout (KO) is the use of murine embryonic stem (ES) cells, which can be maintained in a pluripotent state in culture (Weiss, 1997). When mutant ES cells are introduced into a host blastocyst, they develop in most tissues of the chimeric mouse, including germ cells. Breeding of the chimeras results in the transmission of the mutation, first manifested heterozygotically. Mice can be then mated to develop homozygous-/- strains. Genes that exhibit a potential role in development, such as regulators of growth and differentiation, stand out as key candidates. Fundamental features in the choice of a gene target largely relies on its background: (1) the pattern of expression in developing and mature cells, (2) known or suspected lineage specificity, (3) expression related to the growth of malignancies (i.e., leukemia and lymphoma), (4) prior in vitro or in vivo evidence for function in a particular cellular pathway and relationship to other molecules. The most widely used practice in disrupting a gene in vivo for the analysis of its function is the deletion of a segment or the complete targeted gene by substitution of the wild type gene with a mutant allele by homologous recombination, concomitantly dependent on drug selection (the neomycin gene, for example). At least three targeting strategies, whose effectiveness can be monitored in embryonic stem (ES) cells, can be envisioned. In the first model, the inactivation of the gene expression of interest can be accomplished by the replacement of its 5′ upstream region and the first exon with a Neo gene
c15.indd 327
1/12/2011 9:44:28 AM
328
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
cassette. The second model could interrupt gene expression by swapping the largest exon with the Neo cassette. Alternatively, homologous recombination can target selective exons, encoding known functional motifs or enzyme catalytic sites by Neo cassette replacement in opposite transcriptional and translational orientation (Scacheri et al., 2001). The bidirectional location of the Neo cassette marker with respect to the gene target, aims at dominant Neo gene transcription from the shared locus to abrogate endogenous gene activity. However, neomycin/G418 selection is by no means proof of gene inactivation, and hence confirmation at the transcriptional and the protein level must be obtained before ES/blastocyte transfers. While the approach can be technically demanding, numerous cassette constructs are commercially available. The generation of preliminary in vitro data to obtain relevant information on the function of the gene under study can be of the utmost value in considering the pathophysiological consequences of its in vivo inactivation. Information on tissue distribution; expression pattern during embryo and adult development, cell differentiation, and/or activation; subcellular location; function; potential physiological redundancy; and pathway characterization could prove instrumental in the design of the targeting strategy and the assessment of the KO phenotype. Even when embryo or fetal deaths may not be anticipated, constant monitoring of the gene-deficient colony in conjunction with veterinarians and animal quarters is customary. Should death occur at any stage, necropsy procedures must be carried out, followed by organ fixation, tissue section preparation and anatomopathological examination. When in utero deaths occur, thorough examination of whole-mount embryos and fetal cryopreserved sections can be carried out by immunohistochemistry and in situ hybridization to register potential developmental alterations that led to the mouse demise. Likewise, necropsies of animals that die postnatally should be routinely performed and tissue sections and cell suspension preparations processed for immunohistochemistry, in situ hybridization, and flow cytometry. These experiments can be useful to reveal altered cellularity, morphology, and gene expression. Preparation of tissue sections can also be useful to perform terminal deoxyribonucleotidyl transferase (TdT) mediated dUTP nick-end labeling (TUNNEL) assays and uncover defects in cell survival resulting from the targeted gene deficiency. Isolation and purification of distinct cell subsets carry equal experimental value because they give the investigator access to a thorough assessment of phenotype and function, using flow cytometry and cell culture assays. The approach can be of particular immunological value when cell preparations are obtained from bone marrow, thymus, and spleen, which can provide key information on developmental, maturation, and differentiation deficiencies within the immune cell compartment. Otherwise, viable litters can be routinely examined for anatomical integrity, weight at birth, gain or loss of weight during neonatal to adulthood development, fur color, number and characteristics of limbs, ambulatory properties,
c15.indd 328
1/12/2011 9:44:28 AM
GENETICALLY ENGINEERED ANIMALS
329
ability to thrive, behavior, and the capacity to respond to external stimuli (light, sound, mild changes in temperature). 15.6.3
Pitfalls and Solutions
While it is tenable that the in vivo inactivation of certain genes often leads to phenotypes that a priori are perceived as inconsequential, numerous experimental approaches can be designed to challenge behavioral, neurological, reproductive cardiovascular, respiratory, renal, gastrointestinal, metabolic and immunological functions (Bhangoo and Jacobson-Dickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2010; Komine, 2009; Lerebours et al., 2002; Li et al., 2009; McGivern and Lemon, 2009; Morisawa et al., 2008). Within the immunological context, for instance, lymphoid organ analyses can be performed via preimmunization and postimmunization schemes to examine ligand/receptor interactions, plasma cell differentiation, and antibody production of Ag-specific antibodies (Siddiqa et al., 2001; Guzman-Rojas et al., 2002; Aldrich et al., 2003; Malisan et al., 1996b; Briere et al., 1996). Flow cytometry, cell sorting, immunohistology, cell culture, ELISA, mixed lymphocyte reactions (MLR), together with standard genetic/epigenetic, biochemical and molecular biology parameters make up tools of routine use for evaluating immune functions. On the other hand, inactivation of genes that are critical for mouse development or exert critical functions for adult vital organs, frequently result in embryo or perinatal lethality. To circumvent these limitations alternative gene targeting strategies can be devised. For example conditional cell lineage specific gene targeting, which makes use of the Cre/loxP recombination system of bacteriophage P1 can be envisioned (Aoki and Taketo, 2008; Kirschner, 2009). The target gene is thus flanked by the recombinase recognition loxP sites (a floxed target gene), which are introduced by homologous recombination into the ES cells. In such models, the expression of the target gene should be the same as in wild type strains, unless the gene becomes inactivated by Cre-mediated deletion. The deletion of the target gene can then selectively accomplished in vivo in a cell lineage-restricted manner, by mating the mouse strain carrying a floxed target with a transgenic mouse expressing the Cre recombinase under the control of a cell lineage specific promoter (Aoki and Taketo, 2008; Kirschner, 2009). Furthermore, advances in gene targeting technology have elegantly introduced procedures to induce gene inactivation in mice, practically at will, during specific stages of mouse development. The approach exploits the use of inducible promoters to control the expression of diverse Cre recombinase constructs (Aoki and Taketo, 2008). An example of the approach features the Mx1 promoter (Gutierrez et al., 2008), which transiently drives high levels of transcriptional activity under the influence of interferon-α or -β. Thus mice carrying a floxed target gene and Mx-Cre transgene can be induced to inactivate the target locus following a treatment with
c15.indd 329
1/12/2011 9:44:28 AM
330
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
IFN (Gutierrez et al., 2008). Such approaches are of great value and continue to be developed. The overall lesson arising from the thought of how genetically engineered mouse technology emerged and where it stands today, is that given the challenge to assess the biological significance of a new gene, one can rest assured that supporting reagents, methods, and expertise is within reach.
15.7 CONCLUDING REMARKS Evidence that mutations of pathophysiologically relevant genes account for a large extent of human morbidity and mortality is readily available in most biomedical fields (Mullighan et al., 2009; Bhangoo and Jacobson-Dickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2009; Komine, 2009; Lerebours et al., 2002; McGivern and Lemon, 2009; Morisawa et al., 2008). Vast reports of cardiovascular, respiratory, neurological, renal, metabolic, reproductive, and sexual failures, together with increased susceptibility to inflammation, oxidative toxicity, anaphylactic shock, infections, and tumorigenicity, provide support to such notion (Mullighan et al., 2009; Bhangoo and JacobsonDickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2009; Komine, 2009; Lerebours et al., 2002; McGivern and Lemon, 2009; Morisawa et al., 2008). Notably, defective leukocyte reactivities are widely documented in association with genetic lesions of proinflammatory cytokines, chemotactic factors, cell cycle regulators, tumor suppressors, proto-oncogenes, surface receptors, and proteolytic metalloproteases (Mullighan et al., 2009; Bhangoo and Jacobson-Dickman, 2009; Chan, 2008; Cihakova and Rose, 2008; Gungor et al., 2009; Komine, 2009; Lerebours et al., 2002; McGivern and Lemon, 2009; Morisawa et al., 2008). The collective supporting information validates the development and use of multiparametric methodology to accurately investigate new gene entities not only to reveal the physiological benefits of their expression but also to assess the pathological risks associated to their lesions. As science is in constant motion and continues to yield information that challenges existing paradigms, the pragmatic purpose of this chapter is to entice the reader to the thought that new findings lead to new questions, chases and problems to be solved. A thorough scrutiny of what we can learn on genetics in particular and biology in general may serve to ignite creativity that contributes to biomedical endeavors.
15.8 ACKNOWLEDGMENTS The writing of this book chapter was supported by the National Institutes of Health (NIH) R01 grant AI065796-01.
c15.indd 330
1/12/2011 9:44:28 AM
REFERENCES
331
15.9 REFERENCES Aldrich MB, et al. (2003). Impaired germinal center maturation in adenosine deaminase deficiency. J Immunol 171(10): 5562–70. Alt FW, et al. (1992). VDJ recombination. Immunol Today 13(8):306–15. Aoki K, Taketo MM. (2008). Tissue-specific transgenic, conditional knockout and knock-in mice of genes in the canonical Wnt signaling pathway. Methods Mol Biol 468:307–31. Arya M, et al. (2005). Basic principles of real-time quantitative PCR. Expert Rev Mol Diagn 5(2):209–19. Baerlocher GM, et al. (2006). Flow cytometry and FISH to measure the average length of telomeres (flow FISH). Nat Protoc 1(5):2365–76. Barragan E, et al. (2001). Quantitative detection of AML1-ETO rearrangement by real-time RT-PCR using fluorescently labeled probes. Leuk Lymphoma 42(4): 747–56. Barski A, Frenkel B. (2004). ChIP Display: novel method for identification of genomic targets of transcription factors. Nucl Acids Res 32(12):e104. Barski A, Zhao K. (2009). Genomic location analysis by ChIP-Seq. J Cell Biochem 107(1):11–18. Bauer AK, Rondini EA. (2009). Review paper: the role of inflammation in mouse pulmonary neoplasia. Vet Pathol 46:369–90. Bentley DR, et al. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218):53–59. Bhangale TR, et al. (2006). Automating resequencing-based detection of insertiondeletion polymorphisms. Nat Genet 38(12):1457–62. Bhangoo A, Jacobson-Dickman E. (2009). The genetics of idiopathic hypogonadotropic hypogonadism:unraveling the biology of human sexual development. Pediatr Endocrinol Rev 6(3):395–404. Bild AH, et al. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439(7074):353–57. Blaveri E, et al. (2005). Bladder cancer stage and outcome by array-based comparative genomic hybridization. Clin Cancer Res 11:7012–22. Blyth K, et al. (2009). Runx1 promotes B-cell survival and lymphoma development. Blood Cells Mol Dis 43(1):12–19. Bocchino M, et al. (2009). IFN-gamma release assays in tuberculosis management in selected high-risk populations. Expert Rev Mol Diagn 9(2):165–77. Bondada S, Robertson DA. (2003). Assays for B lymphocyte function. Curr Protoc Immunol Chapter 3, Unit 3.8. Bordon A, et al. (2008). Enforced expression of the transcriptional coactivator OBF1 impairs B cell differentiation at the earliest stage of development. PLoS One 3(12):e4007. Bouabe H, et al. (2008). Improvement of reporter activity by IRES-mediated polycistronic reporter system. Nucl Acids Res 36(5):e28. Brent MR, Guigo R. (2004). Recent advances in gene structure prediction. Curr Opin Struct Biol 14(3):264–72.
c15.indd 331
1/12/2011 9:44:28 AM
332
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Briere F, et al. (1996). [B lymphocytes of patients with complete IgA deficiency secrete IgA in response to interleukin 10]. Nephrologie 17(5):289–95. Bustin SA. (2000). Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays. J Mol Endocrinol 25(2):169–93. Bustin SA, Nolan T. (2004). Pitfalls of quantitative real-time reverse-transcription polymerase chain reaction. J Biomol Tech 15(3):155–66. Calin GA, Croce CM. (2007). Chromosomal rearrangements and microRNAs: a new cancer link with clinical implications. J Clin Invest 117(8):2059–66. Callahan G, et al. (2003). Characterization of the common fragile site FRA9E and its potential role in ovarian cancer. Oncogene 22:590–61. Capecchi MR. (1989). The new mouse genetics: altering the genome by gene targeting. Trends Genet 5(3):70–76. Casellas R, et al. (2001). Contribution of receptor editing to the antibody repertoire. Science 291(5508):1541–44. Challen GA, et al. (2009). Mouse hematopoietic stem cell identification and analysis. Cytometry A 75(1):14–24. Chan LS. (2008). Atopic dermatitis in 2008. Curr Dir Autoimmun 10:76–118. Chen J, Alt FW. (1993). Gene rearrangement and B-cell development. Curr Opin Immunol 5(2):194–200. Chen K, et al. (2007). PolyScan: an automatic indel and SNP detection approach to the analysis of human resequencing data. Genome Res 17(5):659–66. Chen Y, et al. (2009). Automated 5-D analysis of cell migration and interaction in the thymic cortex from time-lapse sequences of 3-D multi-channel multi-photon images. J Immunol Methods 340(1):65–80. Choe J, et al. (1996). Cellular and molecular factors that regulate the differentiation and apoptosis of germinal center B cells. Anti-Ig down-regulates Fas expression of CD40 ligand-stimulated germinal center B cells and inhibits Fas-mediated apoptosis. J Immunol 157(3):1006–16. Church GM, Gilbert W. (1984). Genomic sequencing. Proc Natl Acad Sci U S A 81(7):1991–95. Cihakova D, Rose NR. (2008). Pathogenesis of myocarditis and dilated cardiomyopathy. Adv Immunol 99:95–115. Collas P, Dahl JA. (2008). Chop it, ChIP it, check it: the current status of chromatin immunoprecipitation. Front Biosci 13:929–43. Contento RL, et al. (2008). CXCR4-CCR5: a couple modulating T cell functions. Proc Natl Acad Sci U S A 105(29):10101–06. Deepak S, et al. (2007). Real-time PCR: revolutionizing detection and expression analysis of genes. Curr Genomics 8(4):234–51. Dross N, et al. (2009). Mapping eGFP oligomer mobility in living cell nuclei. PLoS One 4(4):e5041. Drysdale J, et al. (2002). Mitochondrial ferritin: a new player in iron metabolism. Blood Cells Mol Dis 29(3):376–83. Dudley DD, et al. (2005). Mechanism and control of V(D)J recombination versus class switch recombination: similarities and differences. Adv Immunol 86:43–112. Dunckley T, et al. (2007). Whole-genome analysis of sporadic amyotrophic lateral sclerosis. N Engl J Med 357(8):775–88.
c15.indd 332
1/12/2011 9:44:28 AM
REFERENCES
333
Engelmark MT, et al. (2008). Polymorphisms in 9q32 and TSCOT are linked to cervical cancer in affected sib-pairs with high mean age at diagnosis. Hum Genet 123: 437–43. Erdmann VA, et al. (2000). Non-coding, mRNA-like RNAs database Y2K. Nucl Acids Res 28(1):197–200. Escobar PA, et al. (2007). Leukaemia-specific chromosome damage detected by comet with fluorescence in situ hybridization (comet-FISH). Mutagenesis 22(5):321–27. Espy MJ, et al. (2006). Real-time PCR in clinical microbiology: applications for routine laboratory testing. Clin Microbiol Rev 19(1):165–256. Fan YS, et al. (2007). Detection of pathogenic gene copy number variations in patients with mental retardation by genomewide oligonucleotide array comparative genomic hybridization. Hum Mutat 28(11):1124–32. Ferrai C, et al. (2010). Poised transcription factories prime silent uPA gene prior to activation. PLoS Biol 8(1):e1000270. Festing MF, et al. (1998). At least four loci and gender are associated with susceptibility to the chemical induction of lung adenomas in A/J x BALB/c mice. Genomics 53:129–36. Flannagan RS, Grinstein S. (2009). The application of fluorescent probes for the analysis of lipid dynamics during phagocytosis. Meth Mol Biol 591:121–34. Frances V, et al. (1994). A surrogate 15 kDa JC kappa protein is expressed in combination with mu heavy chain by human B cell precursors. EMBO J 13(24):5937–43. Frances V, et al. (2000). The human anti-bullous pemphigoid monoclonal autoantibody P22 is encoded by genes of the IGHV4 and IGLV4 families. J Autoimmun 15(4):459–68. Franco S, et al. (2006). Pathways that suppress programmed DNA breaks from progressing to chromosomal breaks and translocations. DNA Repair (Amst) 5(9–10): 1030–41. Freeman WM, et al. (1999). Quantitative RT-PCR: pitfalls and potential. Biotechniques 26(1):112–22, 124–15. Fusco A, Fedele M. (2007). Roles of HMGA proteins in cancer. Nat Rev Cancer 7(12):899–910. Gallardo D, et al. (2008). Mapping of quantitative trait loci for cholesterol, LDL, HDL, and triglyceride serum concentrations in pigs. Physiol Genomics 35(3):199–209. Garcia-Castillo H, Barros-Nunez P. (2009). Detection of clonal immunoglobulin and T-cell receptor gene recombination in hematological malignancies: monitoring minimal residual disease. Cardiovasc Hematol Disord Drug Targets 9(2):124–35. Garcia-Sagredo JM. (2008). Fifty years of cytogenetics: a parallel view of the evolution of cytogenetics and genotoxicology. Biochim Biophys Acta 1779(6–7):363–75. Geisert EE, et al. (2009). Gene expression in the mouse eye: an online resource for genetics using 103 strains of mice. Mol Vis 15:1730–63. Gentle A, et al. (2001). High-resolution semi-quantitative real-time PCR without the use of a standard curve. Biotechniques 31(3):502–08. Gibson UE, et al. (1996). A novel method for real time quantitative RT-PCR. Genome Res 6(10):995–1001. Glei M, et al. (2009). Use of Comet-FISH in the study of DNA damage and repair: review. Mutat Res 681(1):33–43.
c15.indd 333
1/12/2011 9:44:28 AM
334
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Gungor N, et al. (2010). Genotoxic effects of neutrophils and hypochlorous acid. Mutagenesis 25(2):149–54. Gutierrez L, et al. (2008). Ablation of Gata1 in adult mice results in aplastic crisis revealing its essential role in steady-state and stress erythropoiesis. Blood 111(8):4375–85. Guzman-Rojas L, et al. (2000). PRELI, the human homologue of the avian px19, is expressed by germinal center B lymphocytes. Int Immunol 12(5):607–12. Guzman-Rojas L, et al. (2002). Life and death within germinal centres: a double-edged sword. Immunology 107(2):167–75. Halabi CM, et al. (2008). Interference with PPAR gamma function in smooth muscle causes vascular dysfunction and hypertension. Cell Metab 7(3):215–26. Hartmann S, et al. (2008). Detection of genomic imbalances in microdissected Hodgkin and Reed-Sternberg cells of classical Hodgkin’s lymphoma by array-based comparative genomic hybridization. Haematologica 93(9):1318–26. Heid CA, et al. (1996). Real time quantitative PCR. Genome Res 6(10):986–94. Helmrich A, et al. (2006). Common fragile sites are conserved features of human and mouse chromosomes and relate to large active genes. Genome Res 16:1222–30. Herceg Z, Hainaut P. (2007). Genetic and epigenetic alterations as biomarkers for cancer detection, diagnosis and prognosis. Mol Oncol 1(1):26–41. Hogrefe WR. (2005). Biomarkers and assessment of vaccine responses. Biomarkers 10(Suppl 1):S50–57. Hubbard T, et al. (2002). The Ensembl genome database project. Nucl Acids Res 30(1):38–41. Hussain S, et al. (2009). DUBs and cancer: the role of deubiquitinating enzymes as oncogenes, non-oncogenes and tumor suppressors. Cell Cycle 8(11):1688–97. Hyvarinen K, et al. (2009). Detection and quantification of five major periodontal pathogens by single copy gene-based real-time PCR. Innate Immun 15(4):195–204. Ikram MA, et al. (2009). Genomewide association studies of stroke. N Engl J Med 360(17):1718–28. Jacob J, et al. (1991). Intraclonal generation of antibody mutants in germinal centres. Nature 354(6352):389–92. Jaillard S, et al. (2009). Identification of gene copy number variations in patients with mental retardation using array-CGH: Novel syndromes in a large French series. Eur J Med Genet, Epub ahead of print, October 28. Jiao Y, et al. (2009). ENU induced single mutation locus on chr 16 leads to highfrequency hearing loss in mice. Genes Genet Syst 84(3):219–24. Jolly CJ, O’Neill HC. (1997). Specific transcription of the unrearranged TCR V beta 8.2 gene in lymphoid tissues occurs independently of V(D)J rearrangement. Immunol Cell Biol 75(1):13–20. Jones RG, Thompson CB. (2009). Tumor suppressors and cell metabolism: a recipe for cancer growth. Genes Dev 23(5):537–48. Jordan R, et al. (1999). Detection of chromosome aberrations by FISH as a function of cell division cycle (harlequin-FISH). Biotechniques 26(3):532–34. Jothi R, et al. (2008). Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucl Acids Res 36(16):5221–31.
c15.indd 334
1/12/2011 9:44:28 AM
REFERENCES
335
Karan G, et al. (2004). Expression of wild type and mutant ELOVL4 in cell culture: subcellular localization and cell viability. Mol Vis 10:248–53. Karnan S, et al. (2006). Genomewide array-based comparative genomic hybridization analysis of acute promyelocytic leukemia. Genes Chromosomes Cancer 45(4): 420–25. Kearney L. (2006). Multiplex-FISH (M-FISH): technique, developments and applications. Cytogenet Genome Res 114(3–4):189–98. Kent WJ. (2002). BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–64. Kent WJ, Haussler D. (2001). Assembly of the working draft of the human genome by GigAssembler (2001) Genome Res 11(9):1541–48. Kirschner LS. (2009). Use of mouse models to understand the molecular basis of tissuespecific tumorigenesis in the Carney complex. J Intern Med 266(1):60–68. Kleeberger SR, et al. (2000). Genetic susceptibility to ozone-induced lung hyperpermeability: role of toll-like receptor 4. Am J Respir Cell Mol Biol 22:620–27. Kohler G, Milstein C. (1975). Continuous cultures of fused cells secreting antibody of predefined specificity. Nature 256(5517):495–97. Koizumi K, et al. (2007). Chemokine receptors in cancer metastasis and cancer cellderived chemokines in host immune response. Cancer Sci 98(11):1652–58. Komine M. (2009). Analysis of the mechanism for the development of allergic skin inflammation and the application for its treatment:keratinocytes in atopic dermatitis—their pathogenic involvement. J Pharmacol Sci 110(3):260–64. Krysko DV, et al. (2008). Apoptosis and necrosis: detection, discrimination and phagocytosis. Methods 44(3):205–21. Kubista M, et al. (2006). The real-time polymerase chain reaction. Mol Aspects Med 27(2–3):95–125. Kutyavin I, et al. (2003). Chemistry of minor groove binder-oligonucleotide conjugates. Curr Protoc Nucleic Acid Chem Chapter 8, Unit 8.4. Lander ES, et al. (2001). Initial sequencing and analysis of the human genome. Nature 409(6822):860–921. Landry JR, Mager DL. (2002). Widely spaced alternative promoters, conserved between human and rodent, control expression of the Opitz syndrome gene MID1. Genomics 80(5):499–508. Landry JR, et al. (2003). Complex controls: the role of alternative promoters in mammalian genomes. Trends Genet 19(11):640–48. Landvik NE, et al. (2009). A specific interleukin-1B haplotype correlates with high levels of IL1B mRNA in the lung and increased risk of non-small cell lung cancer. Carcinogenesis 30:1186–92. Lay MJ, Wittwer CT. (1997). Real-time fluorescence genotyping of factor V Leiden during rapid-cycle PCR. Clin Chem 43(12):2262–67. Lerebours F, et al. (2002). Evidence of chromosome regions and gene involvement in inflammatory breast cancer. Int J Cancer 102(6):618–22. Li H, et al. (2009). Matrix metalloproteinase-9 inhibition ameliorates pathogenesis and improves skeletal muscle regeneration in muscular dystrophy. Hum Mol Genet 18(14):2584–98. Li J, et al. (2007). Noninvasive intravital imaging of thymocyte dynamics in medaka. J Immunol 179(3):1605–15.
c15.indd 335
1/12/2011 9:44:28 AM
336
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Li X, et al. (2008). Clinical utility of microarrays: current status, existing challenges and future outlook. Curr Genomics 9(7):466–74. Liu CG, et al. (2008). MicroRNA expression profiling using microarrays. Nat Protoc 3(4):563–78. Liu YJ, et al. (1996a). Normal human IgD + IgM- germinal center B cells can express up to 80 mutations in the variable region of their IgD transcripts. Immunity 4(6):603–13. Liu YJ, et al. (1996b). Sequential triggering of apoptosis, somatic mutation and isotype switch during germinal center development. Semin Immunol 8(3):169–77. Liu YJ, et al. (1996c). Within germinal centers, isotype switching of immunoglobulin genes occurs after the onset of somatic mutation. Immunity 4(3):241–50. Lohela M, Werb Z. (2009). Intravital imaging of stromal cell dynamics in tumors. Curr Opin Genet Dev 20(1):72–78. Long A, et al. (2004). A multidisciplinary approach to the study of T cell migration. Ann N Y Acad Sci 1028:313–19. Louis M, et al. (2004). Rapid combined genotyping of factor V, prothrombin and methylenetetrahydrofolate reductase single nucleotide polymorphisms using minor groove binding DNA oligonucleotides (MGB probes) and real-time polymerase chain reaction. Clin Chem Lab Med 42(12):1364–69. Louvel S, et al. (2008). Detection of drug-resistant HIV minorities in clinical specimens and therapy failure. HIV Med 9(3):133–41. Lund-Olesen T, et al. (2008). Sensitive on-chip quantitative real-time PCR performed on an adaptable and robust platform. Biomed Microdevices 10(6):769–76. Lutfalla G, Uze G. (2006). Performing quantitative reverse-transcribed polymerase chain reaction experiments. Meth Enzymol 410:386–400. Luthra R, Medeiros LJ. (2006). TaqMan reverse transcriptase-polymerase chain reaction coupled with capillary electrophoresis for quantification and identification of bcr-abl transcript type. Meth Mol Biol 335:135–45. MacBeath JR, et al. (2001). Automated fluorescent DNA sequencing on the ABI PRISM 377. Meth Mol Biol 167:119–52. Mackay J, Landt O. (2007). Real-time PCR fluorescent chemistries. Meth Mol Biol 353:237–61. Mahadevaiah SK, et al. (2009). Using RNA FISH to study gene expression during mammalian meiosis. Meth Mol Biol 558:433–44. Maierhofer C, et al. (2002). Multicolor FISH in two and three dimensions for clastogenic analyses. Mutagenesis 17(6):523–27. Malinen E, et al. (2003). Comparison of real-time PCR with SYBR Green I or 5′nuclease assays and dot-blot hybridization with rDNA-targeted oligonucleotide probes in quantification of selected faecal bacteria. Microbiology 149(Pt 1): 269–77. Malisan F, et al. (1996a). B-chronic lymphocytic leukemias can undergo isotype switching in vivo and can be induced to differentiate and switch in vitro. Blood 87(2): 717–24. Malisan F, et al. (1996b). Interleukin-10 induces immunoglobulin G isotype switch recombination in human CD40-activated naive B lymphocytes. J Exp Med 183(3): 937–47.
c15.indd 336
1/12/2011 9:44:28 AM
REFERENCES
337
Marcucci G, et al. (2008). MicroRNA expression in cytogenetically normal acute myeloid leukemia. N Engl J Med 358:1919–28. Martinez-Valdez H, et al. (1996). Human germinal center B cells express the apoptosisinducing genes Fas, c-myc, P53, and Bax but not the survival gene bcl-2. J Exp Med 183(3):971–77. Mathews LA, et al. (2009). Epigenetic gene regulation in stem cells and correlation to cancer. Differentiation 78(1):1–17. Matteucci E, Giampietro O. (2008). Flow cytometry study of leukocyte function: analytical comparison of methods and their applicability to clinical research. Curr Med Chem 15(6):596–603. McGivern DR, Lemon SM. (2009). Tumor suppressors, chromosomal instability, and hepatitis C virus-associated liver cancer. Annu Rev Pathol 4:399–415. McKeller MR, Martinez-Valdez H. (2006). The kappa-like pre-B receptor: surplus biology or a missing link? Semin Immunol 18(1):40–43. McPherson JD. (2009). Next-generation gap. Nat Methods 6(11 Suppl):S2–5. Medstrand P, et al. (2001). Long terminal repeats are used as alternative promoters for the endothelin B receptor and apolipoprotein C-I genes in humans. J Biol Chem 276(3):1896–903. Meffre E, et al. (1998). Antigen receptor engagement turns off the V(D)J recombination machinery in human tonsil B cells. J Exp Med 188(4):765–72. Mehra S, Hu WS. (2005). A kinetic model of quantitative real-time polymerase chain reaction. Biotechnol Bioeng 91(7):848–60. Michelsen U, von Hagen J. (2009). Isolation of subcellular organelles and structures. Meth Enzymol 463:305–28. Migliore L, et al. (1999). Preferential occurrence of chromosome 21 malsegregation in peripheral blood lymphocytes of Alzheimer disease patients. Cytogenet Cell Genet 87(1–2):41–46. Mocellin S, et al. (2003). Quantitative real-time PCR: a powerful ally in cancer research. Trends Mol Med 9(5):189–95. Morisawa T, et al. (2008). Organ-specific profiles of genetic changes in cancers caused by activation-induced cytidine deaminase expression. Int J Cancer 123(12): 2735–40. Motta FC, et al. (2006). Comparison between denaturing gradient gel electrophoresis and phylogenetic analysis for characterization of A/H3N2 influenza samples detected during the 1999–2004 epidemics in Brazil. J Virol Meth 135(1):76–82. Muller MC, et al. (2004). Standardization of preanalytical factors for minimal residual disease analysis in chronic myelogenous leukemia. Acta Haematol 112(1–2):30–33. Mullighan CG, et al. (2009). Deletion of IKZF1 and prognosis in acute lymphoblastic leukemia. N Engl J Med 360(5):470–80. Murphy J, Bustin SA. (2009). Reliability of real-time reverse-transcription PCR in clinical diagnostics: gold standard or substandard? Expert Rev Mol Diagn 9(2):187–97. Muschen M, et al. (2000a). Somatic mutation of the CD95 gene in human B cells as a side-effect of the germinal center reaction. J Exp Med 192(12):1833–40. Muschen M, et al. (2000b). Somatic mutations of the CD95 gene in Hodgkin and ReedSternberg cells. Cancer Res 60(20):5640–43.
c15.indd 337
1/12/2011 9:44:28 AM
338
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Muschen M, et al. (2002). The origin of CD95-gene mutations in B-cell lymphoma. Trends Immunol 23(2):75–80. Nannya Y, et al. (2005). A robust algorithm for copy number detection using highdensity oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 65(14):6071–79. Neu-Yilik G, Kulozik AE. (2008). NMD: multitasking between mRNA surveillance and modulation of gene expression. Adv Genet 62:185–243. Nigg EA, Raff JW. (2009). Centrioles, centrosomes, and cilia in health and disease. Cell 139(4):663–78. Nissim A, Chernajovsky Y. (2008). Historical development of monoclonal antibody therapeutics. Handb Exp Pharmacol (181):3–18. Nymark P, et al. (2006). Identification of specific gene copy number changes in asbestosrelated lung cancer. Cancer Res 66:5737–43. O’Brien S, et al. (1995). Advances in the biology and treatment of B-cell chronic lymphocytic leukemia. Blood 85(2):307–18. Pala FS, et al. (2001). In vitro transmission of chromosomal aberrations through mitosis in human lymphocytes. Mutat Res 474(1–2):139–46. Palmer S, et al. (2003). New real-time reverse transcriptase-initiated PCR assay with single-copy sensitivity for human immunodeficiency virus type 1 RNA in plasma. J Clin Microbiol 41(10):4531–36. Parsa JY, et al. (2007). AID mutates a non-immunoglobulin transgene independent of chromosomal position. Mol Immunol 44(4):567–75. Partanen JI, et al. (2009). 3D view to tumor suppression: Lkb1, polarity and the arrest of oncogenic c-Myc. Cell Cycle 8(5):716–24. Pascual V, et al. (1994). Analysis of somatic mutation in five B cell subsets of human tonsil. J Exp Med 180(1):329–39. Pasqualucci L, et al. (1998). BCL-6 mutations in normal germinal center B cells: evidence of somatic hypermutation acting outside Ig loci. Proc Natl Acad Sci U S A 95(20):11816–821. Passos JF, von Zglinicki T. (2007). Methods for cell sorting of young and senescent cells. Meth Mol Biol 371:33–44. Perlman S, et al. (2003). Ataxia-telangiectasia: diagnosis and treatment. Semin Pediatr Neurol 10(3):173–82. Pfeifer GP, Besaratinia A. (2009). Mutational spectra of human cancer. Hum Genet 125(5–6):493–506. Porter D, Polyak K. (2003). Cancer target discovery using SAGE. Expert Opin Ther Targets 7(6):759–69. Potter AJ, Wener MH. (2005). Flow cytometric analysis of fluorescence in situ hybridization with dye dilution and DNA staining (flow-FISH-DDD) to determine telomere length dynamics in proliferating cells. Cytometry A 68(1):53–58. Pritchard JK, Przeworski M. (2001). Linkage disequilibrium in humans: models and data. Am J Hum Genet 69(1):1–15. Puebla-Osorio N, Zhu C. (2008). DNA damage and repair during lymphoid development: antigen receptor diversity, genomic integrity and lymphomagenesis. Immunol Res 41(2):103–22.
c15.indd 338
1/12/2011 9:44:28 AM
REFERENCES
339
Raeymaekers, L. (2000). Basic principles of quantitative PCR. Mol Biotechnol 15(2): 115–22. Rajewsky, K. (1996). Clonal selection and learning in the antibody system. Nature 381(6585):751–58. Rangel R, et al. (2005). Assembly of the kappa preB receptor requires a V kappa-like protein encoded by a germline transcript. J Biol Chem 280(18):17807–815. Rathmell JC, et al. (1996). Expansion or elimination of B cells in vivo: dual roles for CD40- and Fas (CD95)-ligands modulated by the B cell antigen receptor. Cell 87(2):319–29. Reddy PS, et al. (2008). A high-throughput genome-walking method and its use for cloning unknown flanking sequences. Anal Biochem 381(2):248–53. Reil H, et al. (2008). Clinical validation of a new triplex real-time polymerase chain reaction assay for the detection and discrimination of Herpes simplex virus types 1 and 2. J Mol Diagn 10(4):361–67. Rodriguez-Manotas M, et al. (2006). Real time PCR assay with fluorescent hybridization probes for genotyping intronic polymorphism in presenilin-1 gene. Clin Chim Acta 364(1–2):343–44. Rooney PH. (2005). Multiplex quantitative real-time PCR of laser microdissected tissue. Meth Mol Biol 293:27–37. Ross AJ, et al. (2007). Transcriptional profiling of mucociliary differentiation in human airway epithelial cells. Am J Respir Cell Mol Biol 37:169–85. Saint-Ruf C, et al. (1994). Analysis and expression of a cloned pre-T cell receptor gene. Science 266(5188):1208–12. Saleh A, et al. (2002). Identification of a novel Ly49 promoter that is active in bone marrow and fetal thymus. J Immunol 168(10):5163–69. Sanger F, et al. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 74(12):5463–67. Savas S, Liu G. (2009). Genetic variation as cancer prognostic markers: review and update. Hum Mutat 30:1369–77. Sawyers CL, et al. (1991). Leukemia and the disruption of normal hematopoiesis. Cell 64(2):337–50. Scacheri PC, et al. (2001). Bidirectional transcriptional activity of PGK-neomycin and unexpected embryonic lethality in heterozygote chimeric knockout mice. Genesis 30(4):259–63. Scheinfeldt LB, et al. (2009). Population genomic analysis of ALMS1 in humans reveals a surprisingly complex evolutionary history. Mol Biol Evol 26(6):1357–67. Schjeide BM, et al. (2009). GAB2 as an Alzheimer disease susceptibility gene: follow-up of genomewide association results. Arch Neurol 66(2):250–54. Scott JD, Pawson T. (2009). Cell signaling in space and time: where proteins come together and when they’re apart. Science 326(5957):1220–24. Severino G, Del Zompo M. (2004). Adverse drug reactions: role of pharmacogenomics. Pharmacol Res 49(4):363–73. Shaffer LG, Bejjani BA. (2004). A cytogeneticist’s perspective on genomic microarrays. Hum Reprod Update 10(3):221–26. Siddiqa A, et al. (2001). Regulation of CD40 and CD40 ligand by the AT-hook transcription factor AKNA. Nature 410(6826):383–87.
c15.indd 339
1/12/2011 9:44:28 AM
340
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Siepmann K, et al. (2001). Rewiring of CD40 is necessary for delivery of rescue signals to B cells in germinal centres and subsequent entry into the memory pool. Immunology 102(3):263–72. Sims-Mourtada JC, et al. (2003). In vivo expression of interleukin-8, and regulated on activation, normal, T-cell expressed, and secreted, by human germinal centre B lymphocytes. Immunology 110(3):296–303. Sims-Mourtada JC, et al. (2005). The human AKNA gene expresses multiple transcripts and protein isoforms as a result of alternative promoter usage, splicing, and polyadenylation. DNA Cell Biol 24(5):325–38. Sinden RR, et al. (1999). DNA-directed mutations. Leading and lagging strand specificity. Ann N Y Acad Sci 870:173–89. Sleckman BP, et al. (1996). Accessibility control of antigen-receptor variable-region gene assembly: role of cis-acting elements. Annu Rev Immunol 14:459–81. Storb U, et al. (2001). Somatic hypermutation of immunoglobulin and nonimmunoglobulin genes. Philos Trans R Soc Lond B Biol Sci 356(1405):13–19. Sun Y, et al. (2003). Specific interaction of PML bodies with the TP53 locus in Jurkat interphase nuclei. Genomics 82(2):250–52. Szczepanski, T. (2007). Why and how to quantify minimal residual disease in acute lymphoblastic leukemia? Leukemia 21(4):622–26. Takahashi T, et al. (1994). Generalized lymphoproliferative disease in mice, caused by a point mutation in the Fas ligand. Cell 76(6):969–76. Takezaki N, Nei M. (2009). Genomic drift and evolution of microsatellite DNAs in human populations. Mol Biol Evol 26(8):1835–40. Tay SK, et al. (2009). Global discovery of primate-specific genes in the human genome. Proc Natl Acad Sci U S A 106(29):12019–024. Teste MA, et al. (2009). Validation of reference genes for quantitative expression analysis by real-time RT-PCR in Saccharomyces cerevisiae. BMC Mol Biol 10:99. Thomas KR, Capecchi MR. (1987). Site-directed mutagenesis by gene targeting in mouse embryo-derived stem cells. Cell 51(3):503–12. Thorbecke GJ, et al. (1994). Biology of germinal centers in lymphoid tissue. FASEB J 8(11):832–40. Thye T, et al. (2003). Genomewide linkage analysis identifies polymorphism in the human interferon-gamma receptor affecting Helicobacter pylori infection. Am J Hum Genet 72:448–53. Tonegawa, S. (1983). Somatic generation of antibody diversity. Nature 302(5909): 575–81. Trelle MB, Jensen ON. (2007). Functional proteomics in histone research and epigenetics. Expert Rev Proteomics 4(4):491–503. Trinklein ND, et al. (2003). Identification and functional analysis of human transcriptional promoters. Genome Res 13(2):308–12. Unniraman S, Schatz DG. (2006). AID and Igh switch region-Myc chromosomal translocations. DNA Repair (Amst) 5(9–10):1259–64. van Rijk A, et al. (2008). Translocation detection in lymphoma diagnosis by split-signal FISH: a standardised approach. J Hematop 1(2):119–26.
c15.indd 340
1/12/2011 9:44:28 AM
REFERENCES
341
van Rijk A, et al. (2009). Double staining chromatic in situ hybridization as a useful alternative of split-signal in situ hybridization. Hematologica 95(2):247–52. Epub 2009 Sep 22. Van Vlierberghe P, et al. (2008). Molecular-genetic insights in paediatric T-cell acute lymphoblastic leukaemia. Br J Haematol 143(2):153–68. Volpi EV, Bridger JM. (2008). FISH glossary: an overview of the fluorescence in situ hybridization technique. Biotechniques 45(4):385–90. Voss H, et al. (1995). Efficient low redundancy large-scale DNA sequencing at EMBL. J Biotechnol 41(2–3):121–29. Voss TC, et al. (2006). Single-cell analysis of glucocorticoid receptor action reveals that stochastic post-chromatin association mechanisms regulate ligand-specific transcription. Mol Endocrinol 20(11):2641–55. Wagatsuma A, et al. (2005). Determination of the exact copy numbers of particular mRNAs in a single cell by quantitative real-time RT-PCR. J Exp Biol 208(Pt 12): 2389–98. Wang F, et al. (2009). Normalizing genes for real-time PCR in epithelial and nonepithelial cells of mouse small intestine. Anal Biochem 399(2):211–17. Wang T, Brown MJ. (1999). mRNA quantification by real time TaqMan polymerase chain reaction: validation and comparison with RNase protection. Anal Biochem 269(1):198–201. Watanabe-Fukunaga R, et al. (1992). Lymphoproliferation disorder in mice explained by defects in Fas antigen that mediates apoptosis. Nature 356(6367):314–17. Watts D, MacBeath JR. (2001). Automated fluorescent DNA sequencing on the ABI PRISM 310 Genetic Analyzer. Meth Mol Biol 167:153–70. Weiss MJ. (1997). Embryonic stem cells and hematopoietic stem cell biology. Hematol Oncol Clin North Am 11(6):1185–98. Wen F, et al. (2004). The impact of very short alternative splicing on protein structures and functions in the human genome. Trends Genet 20(5):232–36. Wetterholm A, et al. (2008). High-level expression, purification, and crystallization of recombinant rat leukotriene C(4) synthase from the yeast Pichia pastoris. Protein Expr Purif 60(1):1–6. Wheeler DA, et al. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189):872–76. Wheeler DL, et al. (2001). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 29(1):11–16. Winter H, et al. (2004). Direct gene expression analysis. Curr Pharm Biotechnol 5(2): 191–97. Wong LJ, Bai RK. (2006). Real-time quantitative polymerase chain reaction analysis of mitochondrial DNA point mutation. Meth Mol Biol 335:187–200. Wong ML, Medrano JF. (2005). Real-time PCR for mRNA quantitation. Biotechniques 39(1):75–85. Wu T, Mohan C. (2009). Proteomic toolbox for autoimmunity research. Autoimmun Rev 8(7):595–98. Xiong Q, et al. (2008a). A close examination of genes within quantitative trait loci of bone mineral density in whole mouse genome. Crit Rev Eukaryot Gene Expr 18(4):323–43.
c15.indd 341
1/12/2011 9:44:28 AM
342
CONFIRMATION OF A MUTATION BY MULTIPLE MOLECULAR APPROACHES
Xiong Q, et al. (2008b). PGMapper: a web-based tool linking phenotype to genes. Bioinformatics 24(7):1011–13. Xylourgidis N, Fornerod M. (2009). Acting out of character: regulatory roles of nuclear pore complex proteins. Dev Cell 17(5):617–25. Yan H, et al. (2009). IDH1 and IDH2 mutations in gliomas. N Engl J Med 360(8): 765–73. Yang F, et al. (2004). Cytogenetic and immuno-FISH analysis of the 4q subtelomeric region, which is associated with facioscapulohumeral muscular dystrophy. Chromosoma 112(7):350–59. Yang SH, et al. (2008). Progerin elicits disease phenotypes of progeria in mice whether or not it is farnesylated. J Clin Invest 118(10):3291–300. Yang XO, et al. (2003). Regulation of T-cell receptor D beta 1 promoter by KLF5 through reiterated GC-rich motifs. Blood 101(11):4492–99. Yu W, et al. (2008). Epigenetic silencing of tumour suppressor gene p15 by its antisense RNA. Nature 451(7175):202–06. Yuan JS, et al. (2006). Statistical analysis of real-time PCR data. BMC Bioinformatics 7:85. Yuille MR, et al. (2001). TCL1 is activated by chromosomal rearrangement or by hypomethylation. Genes Chromosomes Cancer 30(4):336–41. Zarudnaya MI, et al. (2003). Downstream elements of mammalian pre-mRNA polyadenylation signals: primary, secondary and higher-order structures. Nucleic Acids Res 31(5):1375–86. Zhang F, et al. (2009). Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 10:451–81. Zhang J, et al. (2005). SNPdetector: a software tool for sensitive and accurate SNP detection. PLoS Comput Biol 1(5):e53. Zhou W, et al. (2004). The role of p22 NF-E4 in human globin gene switching. J Biol Chem 279(25):26227–32. Zinner R, et al. (2007). Biochemistry meets nuclear architecture: multicolor immunoFISH for co-localization analysis of chromosome segments and differentially expressed gene loci with various histone methylations. Adv Enzyme Regul 47: 223–41.
c15.indd 342
1/12/2011 9:44:28 AM
CHAPTER 16
Confirmation of a Mutation by MicroRNA HONGWEI ZHENG and YONGJUN WANG
Contents 16.1 Basic Concept of MicroRNA and Relevance to Gene Function 16.1.1 Introduction 16.1.2 Biogenesis of MiRNAs 16.1.3 Biological Functions of MiRNAs 16.1.4 The Mechanism of MiRNA-Target Recognition 16.1.5 The Mechanism of MiRNA Regulation 16.1.6 Involvement of MiRNA in Human Diseases 16.1.7 Variation of MiRNA Binding Sites (Single-Base Mutation/ Polymorphism) within 3′-UTR and MiRNA Functional Deregulations 16.2 Designing an Experiment Using MiRNA to Confirm Gene Mutation Function 16.2.1 Background 16.2.2 Experiment Design 16.3 Procedure of Confirmation of a Gene Mutation by MiRNA 16.3.1 Luciferase Reporter Assays 16.3.2 MiRNA Target Gene Expression Analysis 16.4 Limitations and Troubleshooting 16.5 References
344 344 344 345 347 348 349
351 351 351 352 356 356 359 362 362
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
343
c16.indd 343
1/12/2011 5:03:50 PM
344
CONFIRMATION OF A MUTATION BY MICRORNA
16.1 BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION 16.1.1 Introduction MicroRNAs (miRNA) are evolutionarily endogenous, ∼22-nucleotide (nt), noncoding small RNAs that regulate gene expression in a sequence-specific manner via mRNA degradation, transcriptional regulation, or translational repression. Vertebrate miRNA targets are thought to be plentiful in number. Computational analysis estimates the presence of up of ∼1000 miRNAs may be contained in the human genome (Berezikov et al., 2005). More than 721 of them have been identified by molecular cloning and registered in the miRNA database, miRBase, and it is predicted that they regulate 30% of proteinencoding transcripts (Lewis et al., 2005; Xie et al., 2005). 16.1.2 Biogenesis of MiRNAs The basic scheme of the microRNA pathway is shown in Figure 16.1. MiRNAs are generated in multiple steps, and the biogenesis and function of miRNA
Cell Nucleus
Pri-miRNA
Pri-miRNA
Dicer
Ran-GTP Exportin-5
Pasha Drosha
PACT m7G
AAAA
7
mG
Pol II
TRBP
AAAA
Helicase?
Ago RISC
Figure 16.1. Biogenesis of miRNAs. The primary miRNA (pri-miRNA) was transcribed by RNA polymerase II (Pol II) in the nucleus and then modified by adding a 5′-m7G cap and a 3′-poly(A)-tail. Following this, the pri-miRNA is processed by the RNase drosha and its co-factor, pasha, to form the hairpin-structured precursor miRNA (pre-miRNA), which is exported from the nucleus by exportin 5. In the cytoplasm, the pre-miRNA is further processed by the RNase dicer, is unwound by a putative helicase, and ends up as a mature single-stranded miRNA, which is loaded into the RNAinduced silencing complex (RISC) in tight association with the argonaute protein (Ago). The miRNA is now ready to interact with its target mRNAs (Cowland et al., 2007).
c16.indd 344
1/12/2011 9:44:29 AM
BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION
345
require a common set of proteins. First, miRNAs are transcribed by RNA polymerase II as long RNA precursors (pri-miRNAs) (Lee et al., 2002; Cai et al., 2004; Lee et al., 2004), which are usually several kilobases long and contained in a 7-methyl guanosine cap structure and a poly(A) tail similar to protein-coding mRNAs. Under the effects of the RNase III enzyme, drosha, the pri-miRNAs are processed into 60- to 70-nt precursor miRNAs (premiRNAs) with a hairpin-shaped stem-loop secondary structure, a 5′ phosphate, and a 2-nt 3′ overhang (Lee et al., 2003). Drosha associates with the double-stranded RNA-binding protein DGCR8 in humans (Gregory et al., 2004) or pasha in flies (Denli et al., 2004) to form the microprocessor complex, which is required for directing the specific cleavage of pri-miRNA by Drosha. Pre-miRNAs are exported to the cytoplasm by exportin-5 (Yi et al., 2003; Lund et al., 2004); further processed by another RNase III enzyme, dicer; and released as 22-nt double-stranded miRNA (Hutvagner et al., 2001). After being unwound by a helicase, only one mature miRNA strand (guide strand) is incorporated into an RNA-induced silencing complex (RISC) that mediates cleavage or translational inhibition of target mRNAs, while the other strand (passenger strand) is quickly degraded (Matranga et al., 2005; Rand et al., 2005). RISC is composed of dicer, argonaute 2 (Ago2), and the doublestrand RNA binding protein TRBP. It cleaves target mRNAs more efficiently by using pre-miRNAs rather than the duplex RNAs that do not have the stemloop structure, a process that suggests that processing by dicer may be coupled with assembly of the mature miRNA into RISC (Gregory et al., 2005). Ago2, the key component of RISC, may function as an endonuclease that cleaves target mRNAs (Hammond et al., 2001). RISC was guided by the incorporated miRNA to the complementary sequence in the 3′ untranslated region (UTR) of target mRNAs. miRNAs bind to the 3′ UTR of the target mRNA with perfect or near perfect complementarity, leading to the target mRNA degradation by Ago2. On the contrary, partial base pairing between an miRNA and a target mRNA leads to translational silencing of the target mRNA without degradation. In systematic mutation experiments, the binding of some nucleotides in the 5′ region of miRNAs seems to be functionally important in partial base pairing (Doench, 2004; Kiriakidou et al., 2004). 16.1.3 Biological Functions of MiRNAs Less than 20 years ago, the lin-4 gene, which controls the timing of C. elegans larval development, was discovered to unexpectedly produce a 21-nt-long noncoding RNA that suppressed lin-14 protein expression without noticeably affecting lin-14 mRNA levels (Lee et al., 1993; Wightman et al., 1993). This first miRNA was initially treated as a genetic oddity and virtually ignored for nearly a decade, but now we recognize that hundreds of these small RNAs exist in the genomes of divergent species and posttranscriptionally regulate gene expression by basepair to complementary sites in the 3′-UTR of the target gene and negatively affect the translation (Callis et al., 2008). Increasing
c16.indd 345
1/12/2011 9:44:30 AM
346
CONFIRMATION OF A MUTATION BY MICRORNA
evidence indicates that miRNAs may in fact be key regulators of processes such as development (Reinhart et al., 2000; Giraldez et al., 2005), cell proliferation and death (Brennecke et al., 2003), apoptosis and fat metabolism (Xu et al., 2003), hematopoiesis (Chen et al., 2004), and stem cell division (Hatfield et al., 2005). 16.1.3.1 MiRNA Emerges as a Central Regulator for Development Basic research found that miRNAs are involved in regulating developmental processes. For instance, without miR-430, zebrafish embryos develop defects, that can be rescued and complemented by supplying miR-430 (Giraldez et al., 2005). Another study of C. elegans miRNAs showed that without lin-4, C. elegans is unable to make the transition from the first to the second larval stage because of a differentiation defect, that is caused by a failure to posttranscriptionally repress the lin-14 gene, the target gene of lin-4 (Lee et al., 1993; Wightman et al., 1993). Similarly, let-7 can also cause a failure of larvalto-adult transition (Reinhart et al., 2000). It is known that lin-41, hbl-1, daf-12, and the fork head transcription factor pha-4 are the direct targets of let-7 during this transition (Slack et al., 2000; Abrahante et al., 2003; Grosshans et al., 2005). McGlinn et al. (2009) identified a layer of regulatory control provided by the miR-196 family in defining the boundary of Hox gene expression along the anterior-posterior (A-P) embryonic axis in chick development. Following knockdown of miR-196, they observed a homeotic transformation of the last cervical vertebrae toward a thoracic identity. 16.1.3.2 MiRNAs are Involved in Cell Proliferation and Apoptosis A number of miRNAs have been shown to balance cell proliferation and survival. For example, members of the miR-17-92 cluster are frequently upregulated in lymphomas, representing potential oncomiRs. It was shown in Eμ-Myc transgenic mice that the miR-17-92 cluster, but not the individual miRNAs, could enhance tumorigenesis by inhibiting apoptosis in c-Myc-overexpressing tumors (He et al., 2005). Additional studies in human cell lines showed that transcription of the miR-17-92 cluster was directly regulated by c-Myc and that the individual miRs-17-5p and -20 regulate the translation of E2F1, a transcription factor with both pro-apoptotic and proproliferative activity. Thus co-expression of c-Myc and miR-17 is believed to finetune E2F1 activity so that proliferation is enhanced and apoptosis is inhibited (O’Donnell et al., 2005). In addition, miR-21 has anti-apoptotic activity, which is highly expressed in glioblastoma (Ciafre et al., 2005). Knockdown of miR-21 in breast tumor and glioblastoma cell lines led to inhibition of BCL-2 activity, caspase reactivation, and increased apoptosis (Chan et al., 2005; Si et al., 2007). 16.1.3.3 MiRNAs Might Contribute to Maintaining Tissue Identity Basic research found that the expression levels of miRNA targets are significantly lower in all mature mouse tissues and later life stages of Drosophila than in the embryos, which indicates that miRNAs might play roles in determining
c16.indd 346
1/12/2011 9:44:30 AM
347
BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION
(a) AAAA
m’G m7G
AAAA
3’ m’G
eIF4E
AAAA
(b)
5’
18s
m7G
28s 28s
AAAA
P-body
Ccr4 AAAA Not1
18s
Decay
Protease
AAAA 18s
eIF4E
(c)
Decay
m7G
Storage
28s
AAAA AAAA mRNA exit and translation
3’
⎫ ⎬ ⎭
5’ Seed AAAA
m7G
Figure 16.2. Possible mechanisms of miRNA-target recognition. a, Perfect (or nearperfect) complementarity between miRNA and mRNA leads to cleavage of the target mRNA through the siRNA pathway. b, Translation is repressed by an miRNA with incomplete complementarity with mRNA by inhibition of ribosomal elongation or recruitment of a protease that degrades the nascent polypeptide chain. Because ribosomes are still associated with mRNA, the complex cannot enter the P-body. c, Inhibition of translation initiation by interaction between RISC and the translation initiation complex protein eIF4E leads to a ribosome-free miRNA: an mRNA structure that is directed to the P-body, where it interacts with the Ccr4:Not1 deadenylase complex. This initiates a degradation of the mRNA. Alternatively, the miRNA:mRNA complex may be stored in the P-body, and—after an appropriate stimulus—reenter the cytoplasm for renewed translation (Cowland et al., 2007).
the timing of tissue differentiation and maintaining tissue identity during adulthood (Giraldez et al., 2005). 16.1.4
The Mechanism of MiRNA-Target Recognition
The precise mechanism by which individual miRNAs recognize their target sites on mRNAs has not yet been completely unraveled, but some general patterns have been determined (Fig. 16.2). The miRNA binding motif is situated in the 3′-UTR of the transcription production—that is, between the protein-coding region of the mRNA and its poly (A) tail (Stark et al., 2005). By sequence comparison of miRNAs and their cognate mRNA target sequences, it has been found that nucleotides 2 to 8 of the miRNA 5-region
c16.indd 347
1/12/2011 9:44:30 AM
348
CONFIRMATION OF A MUTATION BY MICRORNA
constitute a seed region, which mediates the interaction of miRNA and its target (Lewis et al., 2003; Brennecke et al., 2005). 1. In most cases, the seed region binds to a perfectly complementary recognition sequence on the mRNA (Lewis et al., 2003; Brennecke et al., 2005). The central part of the miRNA usually lacks complementarity to the mRNA (typically nucleotides 10 and 11), whereas the 3-region of the miRNA binds more or less specifically to the mRNA and contributes partly to the specificity and affinity of the miRNA:mRNA complex (Brennecke et al., 2005). 2. In a few instances, the seed region does not show complete complementarity to the target sequence, and, these cases, a strong binding of the miRNA 3 region to the mRNA is required to stabilize the RNA duplex (Enright et al., 2003). MiRNAs that rely mainly on their seed sequence for binding may exert a function on the mRNA by themselves, whereas those that bind less strongly due to a weaker seed sequence often have to act in concert with other miRNAs binding to the same mRNA to cause an effect (Brennecke et al., 2005). There are multiple searchable databases can computationally predict, miRNA targets in several species, using various algorithms, but further validation experiments are needed. At present, these databases are vital in guiding us in experimentally validating miRNA targets (Giraldez et al., 2005). 16.1.5
The Mechanism of MiRNA Regulation
The repression of mRNA is achieved in two different ways, depending on the degree of complementarity between the miRNA and the target. 16.1.5.1 The Perfectly Complementary Pathway If perfect base complementarity exists between the miRNA and mRNA, the mRNA will be processed through the siRNA pathway and cleaved in an miRNA-directed manner by argonaute proteins (Ago2 in humans), the catalytically active component of RISC (Yekta et al., 2004; Bagga et al., 2005) (Fig. 16.2). RISC is the ribonucleoprotein effector complex for miRNA-mediated gene expression regulation and consists of argonaute protein family members and accessory factors such as R2D2, along with an miRNA and targeted mRNA (Hammond et al., 2000; Filipowicz, 2005). This mode of gene silencing is common in plants, but only occasionally in animals (for instance, the miR-196-directed degradation of the HOXB8 transcript during mouse embryogenesis) (Yekta et al., 2004). 16.1.5.2 The Imperfectly Complementary Pathway Generally in animals, the vast majority of miRNAs are imperfectly complementary to the 3′-UTR of targeted mRNAs, which results in suppression of translation and
c16.indd 348
1/12/2011 9:44:30 AM
BASIC CONCEPT OF MICRORNA AND RELEVANCE TO GENE FUNCTION
349
subsequent partial mRNA decay (Bartel, 2004; Bagga et al., 2005). Three modes of action have been unraveled (Fig. 16.2) 1. Repression of the initiation step of translation. For instance, the distribution of polysomes of let-7-repressed mRNAs was shifted toward the lighter fractions of a sucrose gradient in a manner similar to that observed when using known inhibitors of translational initiation. Analogously, the cationic amino acid transporter 1 (CAT-1) mRNA, which is repressed by miR-122 in hepatocarcinoma cells under regular growth conditions, was found in the light polysomal fraction (Bhattacharyya et al., 2006). 2. Repression of the elongation phase of translation. In Caenorhabditis elegans, repression of the lin-14 mRNA by miRNA lin-4 does not involve a change in polysome distribution, indicating that repression occurs after initiation of translation (Olsen and Ambros, 1999). 3. General destabilization of the transcript as a result of poly(A)-tail shortening. This mechanism of degradation relies on recruiting deadenylating and decapping enzymes by miRNAs with a subsequent degradation of the cognate transcript (Behm-Ansmant et al., 2006; Wu et al., 2006). Mounting data indicate that mRNAs silenced by miRNA accumulate in cytoplasmic compartments known as processing bodies (P-bodies) (Liu et al., 2005; Pillai et al., 2005; Sen and Blau, 2005). The mRNAs found in these locations are devoid of ribosomes and other translation factors (Teixeira et al., 2005). The P-bodies are rich in enzymes involved in mRNA deadenylation, decapping, and degradation and are believed to cause decay of the miRNAinhibited mRNAs (Sheth and Parker, 2003; Behm-Ansmant et al., 2006). In some instances, however, mRNAs instead appear to be stored in an inactive form in the P-body with the potential to reenter the cytoplasm and reengage in translation (Brengues et al., 2005). One example of this phenomenon is the miR-122-directed repression of CAT-1 in hepatocarcinoma cells during normal growth, which is relieved by starvation and results in retranslation of the CAT-1 mRNA (Bhattacharyya et al., 2006). 16.1.6
Involvement of MiRNA in Human Diseases
Many studies have revealed a large number of miRNA-disease associations and shown the mechanisms of miRNAs involved in diseases. As such, mutation of miRNAs, dysfunction of miRNA biogenesis, and deregulation of miRNAs and their targets may result in various diseases. Currently, 70 diseases associated with miRNAs have been reported (see http://cmbi.bjmu.edu.cn/ hmdd) (Lu et al., 2008). Giraldez et al. (2005) reported a linkage of miRNAs to cardiac hypertrophy and offered new insight into the regulation of this disease process. They found that the expression profiles for a number of miRNAs changed during cardiac hypertrophy. Furthermore, misexpression of miRNAs and loss-of-function experiments in mice demonstrated that specific
c16.indd 349
1/12/2011 9:44:30 AM
350
CONFIRMATION OF A MUTATION BY MICRORNA
miRNAs can augment or attenuate the hypertrophic growth response and suggested the potential of these molecules as novel therapeutic targets. Ample evidence also shows that components of the miRNA machinery, miRNAs themselves, and their binding motif are involved in many cellular processes that are altered in cancer, such as differentiation, proliferation, and apoptosis. Some miRNAs exhibit differential expression levels in cancer and have demonstrated the capability to affect cellular transformation, carcinogenesis, and metastasis by acting either as oncogenes (oncomiRs) or tumor suppressors (TSmiRs) (Medina and Slack, 2008). In general, the majority of miRNAs are downregulated in cancer specimens (Lu et al., 2005). Because miRNAs have several potential targets that may be the mRNAs of both oncogenes and tumor suppressors, the actual function of a particular miRNA as either TS-miR or onco-miR may depend on the cellular context (Lee et al., 1993). Although the miRNA era started only a few years ago, it has brought great promise for cancer diagnosis, prognosis, and therapy. The quick development of powerful techniques such as miRNA microarrays, bead-based miRNA profiling, specific quantitative PCR of miRNAs, and anti-sense technologies are expected to have a significant impact on clinical oncology in the next decade (Medina and Slack, 2008). Because one mRNA generally dictates the translation of a single protein, while one miRNA molecule has the capacity to regulate the translation of an array of genes governing a certain function (John et al., 2004; Sayed et al., 2007), we must come to the realization that miRNA has the capacity to regulate a cellular function and makes it more powerful in functional outcome prediction. Moreover, mature miRNA levels are more tightly regulated and less variable. Thus it is expected that miRNA will provide us with a more superior predictive parameter in diseases. When Lu et al. (2005) put this idea to the test, they found that miRNA profiling was highly accurate in predicting the differentiation state of tumors and in classification of poorly differentiated tumors, predictions that they could not determine by mRNA profiling. Another study achieved almost perfect accuracy in classifying the tissue origin of 400 tumor samples from 22 different tumor tissues and metastases (Rosenfeld et al., 2008) and demonstrated the effectiveness of miRNAs as biomarkers for tracing the tissue of origin of cancers of unknown primary origin, a major clinical problem. But miRNA expression profiles also provide important information regarding the prognosis of cancer patients. For example, it was shown that miRNA expression profiles obtained by miRNA microarrays correlated with survival with lung adenocarcinomas, including those in precocious pathological stages. High levels of miR-155 and low let-7a-2 expression correlated with poor survival (Yanaihara et al., 2006). Another recent miRNA profiling effort in lung cancer identified five miRNAs as important for prognosis: high levels of miR-221 and let-7a appeared to be protective, while high levels of miR-137, miR-372, and miR-182 correlated with worse clinical outcome. The levels of these miRNAs could also help in predicting relapse of the cancer (Yu et al., 2008). A recent study focused on colorectal cancer showed that high miR-21 expression was associated with poor survival and poor therapeutic outcome (Schetter et al.,
c16.indd 350
1/12/2011 9:44:30 AM
DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION
351
2008). But clearly, more studies are needed to further validate the predictive powers of miRNA in cancer. 16.1.7 Variation of MiRNA Binding Sites (Single-Base Mutation/ Polymorphism) within 3′-UTR and MiRNA Functional Deregulations The 3′-UTRs of human protein-coding genes play a pivotal role in regulating mRNA 3′ end formation, stability/degradation, nuclear export, and subcellular localization and translation and hence are particularly rich in cis-acting regulatory elements. One recent addition to the already large repertoire of known cis-acting regulatory elements is the miRNA target binding sites that are present in the 3′-UTRs of many human genes (Chuzhanova et al., 2007). Recently, SNPs residing in miRNA-binding sites were shown to affect the expression of miRNA targets and contribute to the susceptibility to complex disorders such as cancer, asthma, cardiovascular disease, and Tourette’s syndrome (Abelson et al., 2005; Martin et al., 2007; Saunders et al., 2007; Tan et al., 2007; Yu et al., 2007). As each miRNA is expected to regulate the translation of up to 100 mRNAs (Brennecke et al., 2005; Lim et al., 2005; Xie et al., 2005), it is clear that any disturbance of miRNA expression level, processing of the miRNA precursors, or mutation in the sequence of the miRNA, its precursor, or its target mRNA may have detrimental effects on cell physiology. One of the modes leads to the change is mutations in the miRNA:mRNA interacting sequences. Inappropriate base pairing resulting from variations in the 3′-UTR sequence of the target mRNAs or in the mature miRNA sequence is likely to weaken the interaction between the miRNA and mRNA (He et al., 2005; Iwai and Naraba, 2005) and might contribute to alterations in the translation efficiency of the target mRNA. This interaction is especially sensitive to mutations in the seed region (Brennecke et al., 2005). Indeed, naturally occurring polymorphisms in miRNA binding sites have been documented in Tourette’s syndrome in humans and muscularity in sheep (Abelson et al., 2005; Clop et al., 2006). Loss of the KIT protein in thyroid cancers has been associated with high expression of miR-221, -222, and -146b, and polymorphic changes in 3′-UTR of the KIT-mRNA were demonstrated in half of these cases. Owing to the high incidence of familial thyroid cancer, researchers speculated that these polymorphisms might predispose one to this disease (He et al., 2005).
16.2 DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION 16.2.1
Background
Because miRNAs have emerged as a new class of regulatory gene, miRNA target sites within the 3′-UTRs of human protein-coding genes constitute a new class of cis-acting regulatory elements (Lee et al., 1993). In humans, about
c16.indd 351
1/12/2011 9:44:30 AM
352
CONFIRMATION OF A MUTATION BY MICRORNA
one third of all protein-coding genes contain conserved target sequences for the 163 miRNA families that are conserved among different species (Gardner and Vinther, 2008). Upon binding to their cognate targets, miRNA posttranscriptionally downregulate gene expression by inducing either mRNA degradation or translational repression. A base change in the mature miRNA or in the target sequence of the mRNA will weaken the interaction between the miRNA and mRNA (He et al., 2005; Iwai and Naraba, 2005). This interaction is especially sensitive to mutations in the seed region (Brennecke et al., 2005). Specifically, here we highlight the feasibility of using miRNA to confirm miRNA target site variations by functional experiments in vitro. Actually, several studies have identified genetic miRNA target site variant that are claimed to be associated with disorders ranging from Parkinson’s disease to cancer (Sethupathy and Collins, 2008). One such lesion has recently been reported: A G to A transition (absent in 4296 control chromosomes), which replaces a G : U wobble base pair with an A : U Watson-Crick pairing in a binding site for human miRNA hsa-miR-189 within the 3′-UTR of the Slit and Trk-like 1 gene, was identified in two unrelated patients with Tourette’s syndrome and obsessive-compulsive symptoms (Abelson et al., 2005). In vitro functional analysis demonstrated that, in the presence of hsa-miR-189, the mutant allele gave rise to decreased repression of the reporter gene as compared with the wild type allele. Other studies have reported that, in the THPO and PTGS2 genes, the polymorphism affects the binding ability of miRNA and the target mRNAs. The reduced complementarity of the THPO rs6141(+24) G > A variant allele to hsa-miR-431 and of the PTGS2 9850A > G variant allele to hsa-miR-132, as compared with their respective wild type alleles, led to overexpression of these genes and was, therefore, consistent with the functional consequences (Cox et al., 2004; Garner et al., 2006). In another way, polymorphism/mutation may also yield new, illegitimate miRNA binding sites. For example, the muscular phenotype of the Texel sheep strain is the result of a mutation in the myostatin 3′-UTR that creates a binding site for miR-1 and miR-206—miRNAs that are highly expressed in skeletal muscle. As a key negative regulator of muscle mass, even slight decreases in myostatin activity yield muscle overgrowth (Flynt and Lai, 2008). Based on this evidence, we can use miRNA to confirm the variations in the 3′-UTR of candidate genes. 16.2.2
Experiment Design
16.2.2.1 Search the Existent Databases As of December 2009, various techniques including small RNA cloning and, most recently, deep-sequencingbased approaches have characterized 721 human miRNAs, which are listed in the official miRNA database (miRBase). Most computional approaches have suggested that there are well over 1000 human miRNAs (Berezikov et al., 2005) and, according to some projections, even tens of thousands (Rigoutsos et al., 2006). Target predictions based primarily on conserved seed pairing and
c16.indd 352
1/12/2011 9:44:30 AM
DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION
353
local sequence or structural features suggest that individual animal miRNAs often have >100 targets and that at least 20–30% of animal transcripts bear one or more conserved miRNA binding sites in their 3′-UTR (Krek et al., 2005; Xie et al., 2005; Ruby et al., 2007). Additional targets may potentially be regulated through miRNA binding to atypical sites with imperfect seeds (Callis et al., 2008; Brennecke et al., 2003, 2005) or nonconserved sites (Farh et al., 2005; Giraldez et al., 2006; Sood et al., 2006). Therefore, the direct target network of animal miRNAs is inferred to be quite substantial. One can also ask whether the miRNA operates through one target or many targets, each of which might behave differently with respect to the quantitative and qualitative consequence of miRNA control (Flynt and Lai, 2008). To bridge the information in the miRNA database with the biology of the cell, a number of computer programs have been developed for predicting mRNA targets for these miRNAs in animals. In summary, the common criteria used for target prediction by these computer programs are (1) the degree of base complementarity between the miRNA and mRNA with special focus on identifying a perfect—or near perfect—complementarity between a target mRNA and the miRNA in the seed region (i.e., nts 2–8 of the miRNA), (2) the calculated thermodynamic stability of the predicted miRNA:mRNA complex, and (3) the degree of conservation of orthologous target sites in the 3′-UTR of different species. The different software, however, do not use the same algorithm for calculating the targets, and therefore, only a partial overlap is seen between the hit lists of each program (Lee et al., 1993). Several existing tools and resources provide updated data regarding each of these areas of research. Sanger Institute’s miRBase serves as the central database for experimentally supported mature miRNA sequences (GriffithsJones et al., 2006). For each supported miRNA, miRBase provides the genomic coordinates of the predicted precursor sequence, the nucleotide sequences of both the precursor and mature miRNA sequences, and predicted targets of the mature miRNA according to prediction programs miRanda, PicTar, and TargetScanS. Two additional databases, ARGONAUTE (Shahi et al., 2006) and miRNAMap (Hsu et al., 2006), offer enhanced interfaces to the data contained in miRBase for human, mouse, rat, and dog. MiRNAMap also reports computationally predicted miRNAs and their predicted targets, according to programs miRanda and RNAhybrid (Kruger and Rehmsmeier, 2006). Moreover, it provides cross-links to other biological databases to provide tissue expression and cross-species sequence conservation data for each supported and predicted miRNA. ARGONAUTE, published simultaneously with miRNAMap, provides much of the same information with perhaps a larger miRNA tissue expression dataset— collected from various miRNA expression studies. In addition, TarBase offers a manually curated and comprehensive set of experimentally supported targets in eight different species (Sethupathy et al., 2006). It contains over 550 target genes and over 750 individual target sites. For each miRNA:target interaction that has gained experimental support, TarBase reports on the
c16.indd 353
1/12/2011 9:44:30 AM
354
CONFIRMATION OF A MUTATION BY MICRORNA
sufficiency of the interaction to independently induce translational silencing, the type of translational silencing that is induced (repression vs. immediate cleavage), the location of the target site along the 3′-UTR, the nature of the base pairing between the miRNA and target sequence according to the minimum free-energy hybridization, and the types of experimental methods used for verification. Recently, the suggestion that a polymorphism/mutation in miRNA binding sites (poly-miRTS) can lead to disease was strengthened by a study from Clop et al. (2006). They provided rigorous in vivo evidence that an miRNA target site mutation in myostatin (GDF8 or MSTN) contributes to muscular dystrophy in sheep. In addition to this, there are also studies have claimed association of one or more poly-miRTS with various human diseases ranging from colorectal cancer to Parkinson’s disease (Sethupathy and Collins, 2008). Now, roughly 20,000 poly-miRTS have been cataloged in databases such as PolymiRTS (http://compbio.utmem.edu/miRSNP) and Patrocles (www.patrocles.org) (Georges et al., 2006; Bao et al., 2007) and used to study natural selection on miRNA target sites (Chen and Rajewsky, 2006; Saunders et al., 2007). 16.2.2.2 Confirmation of the Base Variation within the MiRNA Binding Sites by Functional Experiment 16.2.2.2.1 Luciferase Activity Assay Further evaluation of the predicted target in a biological system is therefore needed. A widely used method is to make a plasmid construct, which encodes a reporter such as firefly luciferase with a 3′-UTR of the predicted miRNA target, and transfect it into a cell expressing the cognate miRNA. If the target and miRNA interact, a decreased luciferase activity should be measured (Taganov et al., 2006; Voorhoeve et al., 2006). Conversely, a similar reporter construct with a mutated target sequence has no luciferase activity deduction. This approach has been widely used in miRNA functional studies. Zhao et al. (2009) found that miR-15a inhibits reporter luciferase activities in a dose-dependent manner by binding to its seed regions of c-myb 3′-UTR. Compared with controls, relative luciferase activity was significantly decreased with as little as 1 nM miR-15a. Maximal decrease was obtained with a 50-nM concentration of this miRNA. In contrast, increasing concentrations of miRNA control had little effect on luciferase activity, even when the concentration of these RNAs was 100 nM. To demonstrate that miR-15a interacts with a specific target sequence localized in the human c-myb 3′-UTR, three additional mutant reporter constructs were generated in which two 7-bp seed sequences (i.e., ACGACGA) were deleted individually or simultaneously. The resulting constructs, pBub1/Myb3U/miR-15a1 and pBub1/Myb3U/ miR-15a2, were co-transfected together with miR-15a into HEK293 T cells. Luciferase activity in the respective cells was then measured. Compared with the decrease in luciferase activity observed when the authentic c-myb 3′-UTR was cotrans-
c16.indd 354
1/12/2011 9:44:30 AM
DESIGNING AN EXPERIMENT USING MIRNA TO CONFIRM GENE MUTATION FUNCTION
a
CMVp
Luciferase
CMVp
Luciferase
355
SV40 pA
pLuci Hsp20 3’UTR
SV40 pA
pLuci-p20-3’
CMVp
Luciferase
Mutated 3’UTR
SV40 pA
pLuci-3’M
b H9c2
Relative Luciferase Activity (Luci/β-Gal)
40 30 20
*
10 0 miR-Ctl Luci
miR-320 miR-Ctl miR-320 miR-Ctl Luci Luci-p20-3’ Luci-p20-3’ Luci-3’M
miR-320 Luci-3’M
Figure 16.3. a, Plasmid construction. A segment of Hsp20 3′-UTR or a mutated segment was cloned downstream of the luciferase encoding region. b, Luciferase activity in H9c2 cells cotransfected with the various vectors indicated. *p < .05 relative to respective controls (Ren and Wu et al., 2009).
fected with miR-15a, deletion of the miR-15a binding sites in the c-myb 3′UTR resulted in a twofold to threefold increase in luciferase activity, indicating that miR15a was no longer able to bind the 3′-UTR with the same avidity. All these data are consistent with the hypothesis that miR-15a hybridizes with the predicted sequence in the c-myb 3′-UTR and that alteration of the sequence to which the miRNA hybridizes would result in enhanced luciferase activity. Other evidence supporting the use of this method is coming from the work of Ren et al. (2009) (Fig. 16.3). To validate whether miR-320 directly recognizes the 3′-UTR of Hsp20, they cotransfected H9c2 cells with a construct containing the 3′-UTR of Hsp20 fused downstream to the luciferase coding sequence along with miR-320 or a negative control miRNA. Overexpression of miR-320 strongly inhibited the luciferase activity from the reporter construct containing the 3′-UTR segment of Hsp20, whereas no effect was observed with a construct containing a mutated segment of Hsp20 3′-UTR (seed sequence
c16.indd 355
1/12/2011 9:44:30 AM
356
CONFIRMATION OF A MUTATION BY MICRORNA
CAGCUUU was mutated to GACACAA). This effect was specific, because no change was seen in luciferase reporter activity when a negative control miRNA was cotransfected with either reporter construct. Collectively, these data indicate that variations located in the 3′-UTR may influence the complementary affinity between miRNA and its binding site, which can be tested by the luciferase activity assay. 16.2.2.2 Altered Expression of the Target Genes Another way to confirm a mutation is the expression analysis in a proper cell model by cotransfection the experimental cells with plasmid containing a known miRNA target gene containing a binding site mutation locating in the 3′-UTR, the miRNA mimic and a proper control. Because the mutation could weaken the interaction of the miRNA and the target binding site, it will release the mRNA from the negative control and show a protein expression elevation. Compare the expression level of the target gene by Western blotting or immunostaining in the cell model. Calin et al. (2002), using expression analyses, determined that as many as 68% of all chronic lymphocytic leukemias (CLLs) showed downregulation of miRs-15 and -16. Both miRNAs were shown to act as tumor suppressors by targeting translation of the anti-apoptotic BCL-2 mRNA (Calin et al., 2002), an oncogene that frequently is found to be overexpressed in CLL. Downregulation of miR-15 and -16 has been shown to correlate with overexpression of the BCL-2 protein, and transfection with either of the two miRNAs completely abolished protein expression and reestablished apoptosis in a leukemia model (Cimmino et al., 2005). Another early and well-documented finding was the downregulation of oncogenic Ras by the let-7 family members of miRNAs in lung cancer (Johnson et al., 2005). It was observed that low Let-7 expression correlated with a shortened postoperative survival in lung cancer patients who had undergone potentially curative operative procedures (Takamizawa et al., 2004). 16.3 PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA 16.3.1
Luciferase Reporter Assays
16.3.1.1 Target Prediction Target prediction programs are very useful to define potential miRNA targets. MiRNAs do not switch off their target genes completely but rather fine tune their expression through the binding site within the 3′-UTR. Identifying target mRNAs of miRNAs is an important step in elucidating the interaction between miRNAs and the target. Several computational target prediction programs have been developed, but the overlap between sets of predicted target genes for a given miRNA by different programs is surprisingly low (Sethupathy et al., 2006), suggesting a number of false positive predictions (Nicolas et al., 2008).
c16.indd 356
1/12/2011 9:44:31 AM
PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA
357
TABLE 16.1. Online Databases for MiRNA Research Name of the Database
Website Linkage
miRBase
www.mirbase.org
ARGONAUTE
www.ma.uni-heidelberg.de/ apps/zmf/argonaute
miRNAMap
http://mirnamap.mbc. nctu.edu.tw
TargetScanS
http://genes.mit.edu/ targetscan
PolymiRTS
http://compbio.utmem.edu/ miRSNP
Patrocles
www.patrocles.org/ Patrocles.htm
Description Contains three main sections: • miRBase sequences contains all published miRNA sequences, genomic locations, and associated annotations • miRBase targets is a newly developed database of predicted miRNA target genes • miRBase registry provides a confidential service assigning official names for novel miRNA genes before publication of their discovery Mammalian miRNAs and their function in gene and pathway regulation Collects experimental verified miRNAs and experimental verified miRNA target genes in human, mouse, rat, and other metazoan genomes Predict biological targets of miRNAs by searching for the presence of conserved 8mer and 7mer sites that match the seed region of each miRNA Naturally occurring DNA variations in putative miRNA target sites Polymorphic miRNA-target interactions
16.3.1.1.1 Online Databases and Software Available online data resources are summarized in Table 16.1. For example, using default parameters, GriffithsJones et al. (2006) searched for 79 collated upstream sequences (USS) between the translational termination codon and the upstream core polyadenylation signal (UCPAS) variant miRNA binding sites with miRBase software. For each variant, both the wild type 3′-UTR sequence and its mutated counterpart, each a total length ∼50 bp flanking the site of mutation, could be screened for the presence of miRNA binding sites, with all possible 25 bp fragments within these flanking sequences being examined sequentially. While the three of six databases overlap in many predicted targets, they diverge in others. Thus it might be beneficial to search all the databases for potential targets of a miRNA
c16.indd 357
1/12/2011 9:44:31 AM
358
CONFIRMATION OF A MUTATION BY MICRORNA
of interest for experimental validation. Once the candidate miRNA has been fixed on, its sequence can be searched online in different database, too. The ready-to-use mature miRNA is commercially available (e.g., Ambion). 16.3.1.2 Experimental Validation by Luciferase Reporter Assay 16.3.1.2.1 Plasmid Construction 3′-UTR containing the target binding site with or without the mutation could be PCR-amplified from the DNA or cDNA samples and the product could be subcloned and transferred into commercial luciferase vector downstream of the firefly luciferase coding region (Fig. 16.4). The authenticity and orientation of the inserts relative to the luciferase gene should be confirmed by sequencing. The experimental reporter vector containing tested target elements for transcription activity is usually cotransfected with a second reporter vector, which is used as an internal control of transfection efficiency. The dualluciferase reporter (DLR) assay system offers sequential measurement of activities of two distinct reporter luciferases in the same cell lysates obtained from cells cotransfected with experimental and control reporter vectors, for example, firefly luciferase (encoded by the experimental vector) and renilla luciferase (encoded by the control vector). Many choices for commercially available vectors containing the suitable luciferase gene are available, and a particular vector should be selected according to the study (Matuszyk, 2002). 16.3.1.2.2
Cell Culture Conditions
1. One day before the transfection experiment, adjust cell concentration, and plate the cells. 2. Culture the cells overnight to achieve 60–80% confluence. 16.3.1.2.3 Transfection of Experimental Cells Transfect the experimental cells with the luciferase reporter constructs described above along with the internal control vector and the appropriate commercially available mimic miRNA by using transfection reagent according to the manufacturer’s instructions (e.g., Lipofectamine 2000 Invitrogen). After 48 h incubation, harvest the cells. The relative luciferase activity can be expressed as the ratio of experimental and inner control luciferase: 1. Before the transfection, change fresh medium. 2. Prepare the transfection reagent. Promoter Vector
CMV
Reporter gene Luciferase
With/without mutation 3’-UTR
SV40 pA
Figure 16.4. Construction of the reporter plasmid.
c16.indd 358
1/12/2011 9:44:31 AM
PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA
359
3. Prepare the experimental DNA and miRNA; mix gently. 4. Incubate the transfection reagent (DNA/RNA complex) for a minimum of 15 min at room temperature. 5. When ready, add this mixture drop-wise directly to the cells through the medium. Be sure to evenly sprinkle the droplets over the entire area. There is no need to remove and replace with fresh medium. 6. Incubate for 36–48 h. Harvest cells. 16.3.1.2.4
Preparation of Cell Llysate for Luciferase Activity Assay
1. Remove growth medium from cultured cells. 2. Rinse cells in the washing buffer (e.g., 1 × PBS) without dislodging cells. Remove as much of the final wash as possible. 3. Dispense a minimal volume of 1× lysis reagent into each culture vessel (e.g., 200–400 μL/60-mm culture dish). 4. For culture dishes, scrape attached cells from the dish, and transfer the cells and solution to a microcentrifuge tube. 5. Pellet debris by brief centrifugation, and transfer the supernatant to a new tube. 6. Mix 20 μL of cell lysate with 100 μL of luciferase assay reagent and measure the light produced by using an illuminometer. From Nature protocol: www.natureprotocols.com/2006/10/27/transient_ transfection_and_luc.php (May 30; 2009). 16.3.2 MiRNA Target Gene Expression Analysis 16.3.2.1 Target Prediction Predicted targets are available in online databases. 16.3.2.2 Experimental Cell Model 1. Choose a cell line with a known mutation to be tested. 2. Transfect a cell line with miRNA target gene containing the 3′-UTR with/without mutation to be tested (Fig. 16.5). 16.3.2.3 Transfection Cotransfect the cells with commercially available miRNA mimics along with the target gene expression plasmid. Promoter
Target gene With/without mutation
Vector
CMV
3’-UTR
Coding Region
Figure 16.5. Construction of expression plasmid.
c16.indd 359
1/12/2011 9:44:31 AM
360
CONFIRMATION OF A MUTATION BY MICRORNA
16.3.2.4 Detection of the Target Gene Expression 16.3.2.4.1 A Brief Protocol of Western Blotting with Monoclonal Antibodies Sample Preparation 1. Wash dishes three times with 1 × PBS. 2. Apply lysis buffer and scrape the cells from the dishes. 3. Transfer the lysate to a microcentrifuge tube and boil for 5 min in a boiling water bath. To reduce viscosity, the sample may be sonicated briefly or passed several times through a 26-gauge needle. 4. Centrifuge the sample for 5 min to pellet insoluble material and collect the supernatant. 5. Determine the protein concentration by BCA (Pierce) protein concentration assay according to the manufacturer’s instruction. Polyacrylamide Gel Electrophoresis 1. Apply the proper volume of electrophoresis sample buffer to the sample tube and boil 3–5 min. 2. Apply 5–20 μg total protein for each lane. Refer to the antibody datasheet for the appropriate positive control cell lysate. 3. Electrophorese until the bromophenol blue in the samples reaches the bottom of the gel. Keep gels in running buffer until ready to transfer. Semidry Transfer 1. Transfer the protein from the gel to PVDF membrane at 1.2 mAmp/cm2 for 1.75 h in transfer buffer. Protein Blotting 1. Blocking: Transfer the blot from the transfer apparatus or staining tray to blocking buffer (5% nonfat dry milk, 10 mM Tris pH 7.5, 100 mM NaCl, 0.1% Tween 20). Incubate the blot for 30 min at 37°C, 1 h at room temperature, or overnight at 4°C. 2. Primary antibody: Decant the blocking buffer from the blot, add the antibody solution with optimized dilution, and incubate with agitation for 30 min at 37°C, 1 h at room temperature, or overnight at 4°C. 3. Decant the primary antibody solution, and wash the blot with 1 × PBS three times, 10 min each. 4. Enzyme-conjugated secondary antibody: Add the enzyme-conjugated secondary antibody and incubate with agitation for 30 min at 37°C or 1 h at room temperature. 5. Decant the secondary antibody solution, and wash the blot with 1 × PBS three times, 10 min each.
c16.indd 360
1/12/2011 9:44:31 AM
PROCEDURE OF CONFIRMATION OF A GENE MUTATION BY MIRNA
361
Develop the Blot 1. Put the blot into the chemiluminescent working solution and expose to X-ray film to get the best signal. From BD Biosciences: www.bdbiosciences.com/pharmingen/protocols/ Western_Blotting.shtml (May 30, 2009). 16.3.2.4.2
Brief Protocol of Immunostaining
1. With freezing acetone or other fixative, gently rinse slides containing sections for 20 min. 2. Return the slides to room temperature before the experiment. 3. Rinse slides 3× in PBS, 2 min each time. 4. Block endogenous peroxidase activity by incubating the slides in 0.3% H2O2 solution in PBS for 10 min. 5. Rinse slides 3× in PBS, 2 min each time. 6. Block nonspecific binding by incubating with blocking buffer (10% serum from host species of secondary antibody diluted in PBS or 10% FBS in PBS) 30–60 min at room temperature in a humidified chamber. 7. Dilute the primary antibody in the antibody diluent. Alternatively, use a buffered solution with a source of protein as antibody diluent. Apply the diluted antibody to the tissue sections on the slide. Incubate for 1 h at room temperature in a humidified chamber. 8. Rinse slides 3× in PBS, 2 min each time. 9. Dilute the biotinylated secondary antibody in the antibody diluent. Alternatively, use a buffered solution with a source of protein as antibody diluent. Apply to the tissue sections on the slide, and incubate for 30 min at room temperature. 10. Rinse slides 3× in PBS, 2 min each time. 11. Apply streptravidin-horseradish peroxidase prediluted to the tissue sections on the slide, and incubate for 30 min at room temperature. 12. Rinse slides 3× in PBS, 2 min each time. 13. Prepare DAB substrate solution following manufacturer’s recommendations. Safety note: DAB is a suspected carcinogen. Handle with care. Wear gloves, lab coat, and eye protection. 14. Drain PBS from slides and apply the DAB substrate solution. Allow slides to incubate for 5 min or until the desired color intensity is reached. 15. Wash 3× in water, 2 min each time. 16. Counterstain slides: dip twice in hematoxylin. 17. Rinse thoroughly in water. 18. Dip twice in bluing reagent or diluted ammonia water. 19. Rinse thoroughly in water.
c16.indd 361
1/12/2011 9:44:31 AM
362
CONFIRMATION OF A MUTATION BY MICRORNA
20. Dehydrate through four changes of alcohol (95%, 95%, 100%, and 100%). Clear in three changes of xylene (or xylene substitute), and coverslip using mounting solution. Note: From BD Biosciences: www.bdbiosciences.com/pharmingen/protocols/ Frozen_Tissue_Sections.shtml (May 30, 2009). 16.4
LIMITATIONS AND TROUBLESHOOTING
Limitations •
•
•
Target predictions based primarily on conserved seed pairing and local sequence or structural features suggest that individual animal miRNAs often have >100 targets and that at least 20–30% of animal transcripts bear one or more conserved miRNA binding sites in their 3′-UTR. Thus the level of an mRNA or its translation product is governed by the combinatorial effect of its targeting miRNA. One candidate target gene of the known miRNA may have more than one binding site within the 3′-UTR. For example, HMGA2 encodes a chromatin-associated protein and contains seven let-7 sites in its 3′-UTR (Chuzhanova et al., 2007). Search different databases to pick up the most overlap candidate sequence. It is estimated that 1–4% of genes in the human genome are miRNAs and that a single miRNA can regulate as many as 200 mRNAs. It is clear that disturbances of their binding may have detrimental effects on cell physiology (Esquela-Kerscher and Slack, 2006).
Troubleshooting • • •
•
Use low concentrations. Use independent miRNAs to the same target. MiRNAs cannot be used to confirm mutations within the protein coding region. High cost is an important limitation of this method.
16.5 REFERENCES Abelson JF, Kwan KY, O’Roak BJ, Baek DY, Stillman AA, Morgan TM, Mathews CA, Pauls DL, Rasin MR, Gunel M, Davis NR, Ercan-Sencicek AG, Guez DH, Spertus JA, Leckman JF, Dure LS 4th, Kurlan R, Singer HS, Gilbert DL, Farhi A, Louvi A, Lifton RP, Sestan N, State MW. (2005). Sequence variants in SLITRK1 are associated with Tourette’s syndrome. Science 310(5746):317–20. Abrahante JE, Daul AL, Li M, Volk ML, Tennessen JM, Miller EA, Rougvie AE. (2003). The Caenorhabditis elegans hunchback-like gene lin-57/hbl-1 controls developmental time and is regulated by microRNAs. Dev Cell 4(5):625–37.
c16.indd 362
1/12/2011 9:44:31 AM
REFERENCES
363
Bagga S, Bracht J, Hunter S, Massirer K, Holtz J, Eachus R, Pasquinelli AE. (2005). Regulation by let-7 and lin-4 miRNAs results in target mRNA degradation. Cell 122(4):553–63. Bao L, Zhou M, Wu L, Lu L, Goldowitz D, Williams RW, Cui Y. (2007). PolymiRTS Database: linking polymorphisms in microRNA target sites with complex traits. Nucl Acids Res 35:D51–4. Bartel, DP. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2):281–97. Behm-Ansmant I, Rehwinkel J, Doerks T, Stark A, Bork P, Izaurralde E. (2006). mRNA degradation by miRNAs and GW182 requires both CCR4:NOT deadenylase and DCP1:DCP2 decapping complexes. Genes Dev 20(14):1885–98. Berezikov E, Guryev V, van de Belt J, Wienholds E, Plasterk RH, Cuppen E. (2005). Phylogenetic shadowing and computational identification of human microRNA genes. Cell 120(1):21–24. Bhattacharyya SN, Habermacher R, Martine U, Closs EI, Filipowicz W. (2006). Relief of microRNA- mediated translational repression in human cells subjected to stress. Cell 125(6):1111–24. Brengues M, Teixeira D, Parker R. (2005). Movement of eukaryotic mRNAs between polysomes and cytoplasmic processing bodies. Science 310(5747):486–89. Brennecke J, Stark A, Russell RB, Cohen SM. (2005). Principles of microRNA-target recognition. PLoS Biol 3(3):e85. Brennecke J, Hipfner DR, Stark A, Russell RB, Cohen SM. (2003). bantam encodes a developmentally regulated microRNA that controls cell proliferation and regulates the proapoptotic gene hid in Drosophila. Cell 113(1):25–36. Cai X, Hagedorn CH, Cullen BR. (2004). Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. RNA 10(12):1957–66. Calin GA, Dumitru CD, Shimizu M, Bichi R, Zupo S, Noch E, Aldler H, Rattan S, Keating M, Rai K, Rassenti L, Kipps T, Negrini M, Bullrich F, Croce CM. (2002). Frequent deletions and down-regulation of micro-RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proc Natl Acad Sci USA 99(24):15524–29. Callis TE, Tatsuguchi M, Wang DZ. (2008). MTAD Wang, VA Erdmann, W Poller, J Barciszewski (eds.) miRNAs and their emerging role in cardiac hypertrophy. In RNA Technologies in Cardiovascular Medicine and Research. Springer-Verlag Berlin Heidelberg. Chan JA, Krichevsky AM, Kosik KS. (2005). MicroRNA-21 is an antiapoptotic factor in human glioblastoma cells. Cancer Res 65(14):6029–33. Chen CZ, Li L, Lodish HF, Bartel DP. (2004). MicroRNAs modulate hematopoietic lineage differentiation. Science 303(5654):83–86. Chen K, Rajewsky N. (2006). Natural selection on human microRNA binding sites inferred from SNP data. Nat Genet 38(12):1452–56. Chuzhanova N, Cooper DN, Férec C, Chen JM. (2007). Searching for potential microRNA- binding site mutations amongst known disease-associated 3′ UTR variants. Genomic Med 1(1–2):29–33. Ciafrè SA, Galardi S, Mangiola A, Ferracin M, Liu CG, Sabatino G, Negrini M, Maira G, Croce CM, Farace MG. (2005). Extensive modulation of a set of microRNAs in primary glioblastoma. Biochem Biophys Res Commun 334(4):1351–58.
c16.indd 363
1/12/2011 9:44:31 AM
364
CONFIRMATION OF A MUTATION BY MICRORNA
Cimmino A, Calin GA, Fabbri M, Iorio MV, Ferracin M, Shimizu M, Wojcik SE, Aqeilan RI, Zupo S, Dono M, Rassenti L, Alder H, Volinia S, Liu CG, Kipps TJ, Negrini M, Croce CM. (2005). miR-15 and miR-16 induce apoptosis by targeting BCL2. Proc Natl Acad Sci U S A 102(39):13944–49. Clop A, Marcq F, Takeda H, Pirottin D, Tordoir X, Bibé B, Bouix J, Caiment F, Elsen JM, Eychenne F, Larzul C, Laville E, Meish F, Milenkovic D, Tobin J, Charlier C, Georges M. (2006). A mutation creating a potential illegitimate microRNA target site in the myostatin gene affects muscularity in sheep. Nat Genet 38(7):813–18. Cowland JB, Hother C, Grønbaek K. (2007). MicroRNAs and cancer. APMIS 115(10):1090–106. Cox DG, Pontes C, Guino E, Navarro M, Osorio A, Canzian F, Moreno V; Bellvitge Colorectal Cancer Study Group. (2004). Polymorphisms in prostaglandin synthase 2/cyclooxygenase 2 (PTGS2/COX2) and risk of colorectal cancer. Br J Cancer 91(2):339–43. Denli AM, Tops BB, Plasterk RH, Ketting RF, Hannon GJ. (2004). Processing of primary microRNAs by the Microprocessor complex. Nature 432(7014):231–35. Doench JG, Sharp PA. (2004). Specificity of microRNA target selection in translational repression. Genes Dev 18(5):504–11. Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. (2003). MicroRNA targets in Drosophila. Genome Biol 5(1):R1. Esquela-Kerscher A, Slack FJ. (2006). Oncomirs—microRNAs with a role in cancer. Nat Rev Cancer 6(4):259–69. Farh KK, Grimson A, Jan C, Lewis BP, Johnston WK, Lim LP, Burge CB, Bartel DP. (2005). The widespread impact of mammalian microRNAs on mRNA repression and evolution. Science 310(5755):1817–21. Filipowicz, W. (2005). RNAi: the nuts and bolts of the RISC machine. Cell 122(1): 17–20. Flynt AS, Lai EC. (2008). Biological principles of microRNA-mediated regulation: shared themes amid diversity. Nat Rev Genet 9(11):831–42. Gardner PP, Vinther J. (2008). Mutation of miRNA target sequences during human evolution. Trends Genet 24(6):262–65. Garner C, Best S, Menzel S, Rooks H, Spector TD, Thein SL. (2006). Two candidate genes for low platelet count identified in an Asian Indian kindred by genome-wide linkage analysis: glycoprotein IX and thrombopoietin. Eur J Hum Genet 14(1): 101–108. Georges M, Clop A, Marcq F, Takeda H, Pirottin D, Hiard S, Tordoir X, Caiment F, Meish F, Bibé B, Bouix J, Elsen JM, Eychenne F, Laville E, Larzul C. (2006). Polymorphic microRNA-target interactions: a novel source of phenotypic variation. Cold Spring Harb Symp Quant Biol 71:343–50. Giraldez AJ, Cinalli RM, Glasner ME, Enright AJ, Thomson JM, Baskerville S, Hammond SM, Bartel DP, Schier AF. (2005). MicroRNAs regulate brain morphogenesis in zebrafish. Science 308(5723):833–38. Giraldez AJ, Mishima Y, Rihel J, Grocock RJ, Van Dongen S, Inoue K, Enright AJ, Schier AF. (2006). Zebrafish MiR-430 promotes deadenylation and clearance of maternal mRNAs. Science 312(5770):75–79.
c16.indd 364
1/12/2011 9:44:31 AM
REFERENCES
365
Gregory RI, Yan KP, Amuthan G, Chendrimada T, Doratotaj B, Cooch N, Shiekhattar R. (2004). The Microprocessor complex mediates the genesis of microRNAs. Nature 432(7014):235–40. Gregory RI, Chendrimada TP, Cooch N, Shiekhattar R. (2005). Human RISC couples microRNA biogenesis and posttranscriptional gene silencing. Cell 123(4):631–40. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. (2006). miRBase: microRNA sequences, targets and gene nomenclature. Nucl Acids Res 34:D140–44. Grosshans H, Johnson T, Reinert KL, Gerstein M. (2005). The temporal patterning microRNA let-7 regulates several transcription factors at the larval to adult transition in C. elegans. Dev Cell 8(3):321–30. Hammond SM, Bernstein E, Beach D, Hannon GJ. (2000). An RNA-directed nuclease mediates post-transcriptional gene silencing in Drosophila cells. Nature 404(6775): 293–96. Hammond SM, Boettcher S, Caudy AA, Kobayashi R, Hannon GJ. (2001). Argonaute2, a link between genetic and biochemical analyses of RNAi. Science 293(5532): 1146–50. Hatfield SD, Shcherbata HR, Fischer KA, Nakahara K, Carthew RW, Ruohola-Baker H. (2005). Stem cell division is regulated by the microRNA pathway. Nature 435(7044):974–78. He H, Jazdzewski K, Li W, Liyanarachchi S, Nagy R, Volinia S, Calin GA, Liu CG, Franssila K, Suster S, Kloos RT, Croce CM, de la Chapelle A. (2005). The role of microRNA genes in papillary thyroid carcinoma. Proc Natl Acad Sci U S A 102(52): 19075–80. Hsu PW, Huang HD, Hsu SD, Lin LZ, Tsou AP, Tseng CP, Stadler PF, Washietl S, Hofacker IL. (2006). miRNAMap: genomic maps of microRNA genes and their target genes in mammalian genomes. Nucl Acids Res 34:D135–39. Hutvágner G, McLachlan J, Pasquinelli AE, Bálint E, Tuschl T, Zamore PD. (2001). A cellular function for the RNA- interference enzyme Dicer in the maturation of the let-7 small temporal RNA. Science 293(5531):834–38. Iwai N, Naraba H. (2005). Polymorphisms in human pre-miRNAs. Biochem Biophys Res Commun 331(4):1439–44. John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS. (2004). Human microRNA targets. PLoS Biol 2(11):e363. Johnson SM, Grosshans H, Shingara J, Byrom M, Jarvis R, Cheng A, Labourier E, Reinert KL, Brown D, Slack FJ. (2005). RAS is regulated by the let-7 microRNA family. Cell 120(5):635–47. Kiriakidou M, Nelson PT, Kouranov A, Fitziev P, Bouyioukos C, Mourelatos Z, Hatzigeorgiou A. (2004). A combined computational-experimental approach predicts human microRNA targets. Genes Dev 18(10):1165–78. Krek A, Grün D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, MacMenamin P, da Piedade I, Gunsalus KC, Stoffel M, Rajewsky N. (2005). Combinatorial microRNA target predictions. Nat Genet 37(5):495–500. Krüger J, Rehmsmeier M. (2006). RNAhybrid: microRNA target prediction easy, fast and flexible. Nucl Acids Res 34:W451–4. Lee RC, Feinbaum RL, Ambros V. (1993). The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75(5):843–54.
c16.indd 365
1/12/2011 9:44:31 AM
366
CONFIRMATION OF A MUTATION BY MICRORNA
Lee Y, Ahn C, Han J, Choi H, Kim J, Yim J, Lee J, Provost P, Rådmark O, Kim S, Kim VN. (2003). The nuclear RNase III Drosha initiates microRNA processing. Nature 425(6956):415–19. Lee Y, Jeon K, Lee JT, Kim S, Kim VN. (2002). MicroRNA maturation: stepwise processing and subcellular localization. EMBO J 21(17):4663–70. Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN. (2004). MicroRNA genes are transcribed by RNA polymerase II. EMBO J 23(20):4051–60. Lewis BP, Burge CB, Bartel DP. (2005). Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microRNA targets. Cell 120(1):15–20. Lewis BP, Shih IH, Jones-Rhoades MW, Bartel DP, Burge CB. (2003). Prediction of mammalian microRNA targets. Cell 115(7):787–98. Lim LP, Lau NC, Garrett-Engele P, Grimson A, Schelter JM, Castle J, Bartel DP, Linsley PS, Johnson JM. (2005). Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433(7027):769–73. Liu J, Valencia-Sanchez MA, Hannon GJ, Parker R. (2005). MicroRNA-dependent localization of targeted mRNAs to mammalian P-bodies. Nat Cell Biol 7(7): 719–23. Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horvitz HR, Golub TR. (2005). MicroRNA expression profiles classify human cancers. Nature 435(7043): 834–38. Lu M, Zhang Q, Deng M, Miao J, Guo Y, Gao W, Cui Q. (2008). An analysis of human microRNA and disease associations. PLoS One 3(10):e3420. Lund E, Güttinger S, Calado A, Dahlberg JE, Kutay U. (2004). Nuclear export of microRNA precursors. Science 303(5654):95–98. Martin MM, Buckenberger JA, Jiang J, Malana GE, Nuovo GJ, Chotani M, Feldman DS, Schmittgen TD, Elton TS. (2007). The human angiotensin II type 1 receptor +1166 A/C polymorphism attenuates microrna-155 binding. J Biol Chem 282(33): 24262–69. Matranga C, Tomari Y, Shin C, Bartel DP, Zamore PD. (2005). Passenger-strand cleavage facilitates assembly of siRNA into Ago2-containing RNAi enzyme complexes. Cell 123(4):607–20. Matuszyk J. (2002). Selection of a control reporter vector for the dual-luciferase reporter assay for transcription activation. E. ZIOLO 7:63. McGlinn E, Yekta S, Mansfield JH, Soutschek J, Bartel DP, Tabin CJ. (2009). In ovo application of antagomiRs indicates a role for miR-196 in patterning the chick axial skeleton through Hox gene regulation. Proc Natl Acad Sci U S A 106(44): 18610–15. Medina PP, Slack FJ. (2008). microRNAs and cancer: an overview. Cell Cycle 7(16): 2485–92. Nicolas FE, Pais H, Schwach F, Lindow M, Kauppinen S, Moulton V, Dalmay T. (2008). Experimental identification of microRNA-140 targets by silencing and overexpressing miR-140. RNA 14(12):2513–20. O’Donnell KA, Wentzel EA, Zeller KI, Dang CV, Mendell JT. (2005). c-Myc-regulated microRNAs modulate E2F1 expression. Nature 435(7043):839–43.
c16.indd 366
1/12/2011 9:44:31 AM
REFERENCES
367
Olsen PH, Ambros V. (1999). The lin-4 regulatory RNA controls developmental timing in Caenorhabditis elegans by blocking LIN-14 protein synthesis after the initiation of translation. Dev Biol 216(2):671–80. Pillai RS, Bhattacharyya SN, Artus CG, Zoller T, Cougot N, Basyuk E, Bertrand E, Filipowicz W. (2005). Inhibition of translational initiation by let-7 microRNA in human cells. Science 309(5740):1573–76. Rand TA, Petersen S, Du F, Wang X. (2005). Argonaute2 cleaves the anti-guide strand of siRNA during RISC activation. Cell 123(4):621–29. Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz HR, Ruvkun G. (2000). The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature 403(6772):901–06. Ren XP, Wu J, Wang X, Sartor MA, Qian J, Jones K, Nicolaou P, Pritchard TJ, Fan GC. (2009). MicroRNA-320 is involved in the regulation of cardiac ischemia/reperfusion injury by targeting heat-shock protein 20. Circulation 119(17):2357–66. Rigoutsos I, Huynh T, Miranda K, Tsirigos A, McHardy A, Platt D. (2006). Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes. Proc Natl Acad Sci U S A 103(17): 6605–10. Rosenfeld N, Aharonov R, Meiri E, Rosenwald S, Spector Y, Zepeniuk M, Benjamin H, Shabes N, Tabak S, Levy A, Lebanony D, Goren Y, Silberschein E, Targan N, Ben-Ari A, Gilad S, Sion-Vardy N, Tobar A, Feinmesser M, Kharenko O, Nativ O, Nass D, Perelman M, Yosepovich A, Shalmon B, Polak-Charcon S, Fridman E, Avniel A, Bentwich I, Bentwich Z, Cohen D, Chajut A, Barshack I. (2008). MicroRNAs accurately identify cancer tissue origin. Nat Biotechnol 26(4):462–69. Ruby JG, Stark A, Johnston WK, Kellis M, Bartel DP, Lai EC. (2007). Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. Genome Res 17(12):1850–64. Saunders MA, Liang H, Li WH. (2007). Human polymorphism at microRNAs and microRNA target sites. Proc Natl Acad Sci U S A 104(9):3300–05. Sayed D, Hong C, Chen IY, Lypowy J, Abdellatif M. (2007). MicroRNAs play an essential role in the development of cardiac hypertrophy. Circ Res 100(3):416–24. Schetter AJ, Leung SY, Sohn JJ, Zanetti KA, Bowman ED, Yanaihara N, Yuen ST, Chan TL, Kwong DL, Au GK, Liu CG, Calin GA, Croce CM, Harris CC. (2008). MicroRNA expression profiles associated with prognosis and therapeutic outcome in colon adenocarcinoma. JAMA 299(4):425–36. Sen GL, Blau HM. (2005). Argonaute 2/RISC resides in sites of mammalian mRNA decay known as cytoplasmic bodies. Nat Cell Biol 7(6):633–36. Sethupathy P, Corda B, Hatzigeorgiou AG. (2006). TarBase: A comprehensive database of experimentally supported animal microRNA targets. RNA 12(2):192–97. Sethupathy P, Collins FS. (2008). MicroRNA target site polymorphisms and human disease. Trends Genet 24(10):489–97. Sethupathy P, Megraw M, Hatzigeorgiou AG. (2006). A guide through present computational approaches for the identification of mammalian microRNA targets. Nat Methods 3(11):881–86. Shahi P, Loukianiouk S, Bohne-Lang A, Kenzelmann M, Küffer S, Maertens S, Eils R, Gröne HJ, Gretz N, Brors B. (2006). Argonaute–a database for gene regulation by mammalian microRNAs. Nucleic Acids Res 34:D115–18.
c16.indd 367
1/12/2011 9:44:31 AM
368
CONFIRMATION OF A MUTATION BY MICRORNA
Sheth U, Parker R. (2003). Decapping and decay of messenger RNA occur in cytoplasmic processing bodies. Science 300(5620):805–08. Si ML, Zhu S, Wu H, Lu Z, Wu F, Mo YY. (2007). miR-21-mediated tumor growth. Oncogene 26(19):2799–803. Slack FJ, Basson M, Liu Z, Ambros V, Horvitz HR, Ruvkun G. (2000). The lin-41 RBCC gene acts in the C. elegans heterochronic pathway between the let-7 regulatory RNA and the LIN-29 transcription factor. Mol Cell 5(4):659–69. Sood P, Krek A, Zavolan M, Macino G, Rajewsky N. (2006). Cell-type-specific signatures of microRNAs on target mRNA expression. Proc Natl Acad Sci U S A 103(8): 2746–51. Stark A, Brennecke J, Bushati N, Russell RB, Cohen SM. (2005). Animal microRNAs confer robustness to gene expression and have a significant impact on 3′UTR evolution. Cell 123(6):1133–46. Taganov KD, Boldin MP, Chang KJ, Baltimore D. (2006). NF-kappaB-dependent induction of microRNA miR-146, an inhibitor targeted to signaling proteins of innate immune responses. Proc Natl Acad Sci U S A 103(33):12481–86. Takamizawa J, Konishi H, Yanagisawa K, Tomida S, Osada H, Endoh H, Harano T, Yatabe Y, Nagino M, Nimura Y, Mitsudomi T, Takahashi T. (2004). Reduced expression of the let-7 microRNAs in human lung cancers in association with shortened postoperative survival. Cancer Res 64(11):3753–56. Tan Z, Randall G, Fan J, Camoretti-Mercado B, Brockman-Schneider R, Pan L, Solway J, Gern JE, Lemanske RF, Nicolae D, Ober C. (2007). Allele-specific targeting of microRNAs to HLA-G and risk of asthma. Am J Hum Genet 81(4):829–34. Teixeira D, Sheth U, Valencia-Sanchez MA, Brengues M, Parker R. (2005). Processing bodies require RNA for assembly and contain ontranslating mRNAs. RNA 11(4):371–82. Voorhoeve PM, le Sage C, Schrier M, Gillis AJ, Stoop H, Nagel R, Liu YP, van Duijse J, Drost J, Griekspoor A, Zlotorynski E, Yabuta N, De Vita G, Nojima H, Looijenga LH, Agami R. (2006). A genetic screen implicates miRNA-372 and miRNA-373 as oncogenes in testicular germ cell tumors. Cell 124(6):1169–81. Wightman B, Ha I, Ruvkun G. (1993). Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75(5):855–62. Wu L, Fan J, Belasco JG. (2006). MicroRNAs direct rapid deadenylation of mRNA. Proc Natl Acad Sci U S A 103(11):4034–39. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. (2005). Systematic discovery of regulatory motifs in human promoters and 3 UTRs by comparison of several mammals. Nature 434(7031):338–45. Xu P, Vernooy SY, Guo M, Hay BA. (2003). The Drosophila microRNA Mir-14 suppresses cell death and is required for normal fat metabolism. Curr Biol 13(9): 790–95. Yanaihara N, Caplen N, Bowman E, Seike M, Kumamoto K, Yi M, Stephens RM, Okamoto A, Yokota J, Tanaka T, Calin GA, Liu CG, Croce CM, Harris CC. (2006). Unique microRNA molecular profiles in lung cancer diagnosis and prognosis. Cancer Cell 9(3):189–98.
c16.indd 368
1/12/2011 9:44:31 AM
REFERENCES
369
Yekta S, Shih IH, Bartel DP. (2004). MicroRNA-directed cleavage of HOXB8 mRNA. Science 304(5670):594–96. Yi R, Qin Y, Macara IG, Cullen BR. (2003). Exportin-5 mediates the nuclear export of pre-microRNAs and short hairpin RNAs. Genes Dev 17(24):3011–16. Yu SL, Chen HY, Chang GC, Chen CY, Chen HW, Singh S, Cheng CL, Yu CJ, Lee YC, Chen HS, Su TJ, Chiang CC, Li HN, Hong QS, Su HY, Chen CC, Chen WJ, Liu CC, Chan WK, Chen WJ, Li KC, Chen JJ, Yang PC. (2008). MicroRNA signature predicts survival and relapse in lung cancer. Cancer Cell 13(1):48–57. Yu Z, Li Z, Jolicoeur N, Zhang L, Fortin Y, Wang E, Wu M, Shen SH. (2007). Aberrant allele frequencies of the SNPs located in microRNA target sites are potentially associated with human cancers. Nucl Acids Res 35(13):4535–41. Zhao H, Kalota A, Jin S, Gewirtz AM. (2009). The c-myb proto-oncogene and microRNA-15a comprise an active autoregulatory feedback loop in human hematopoietic cells. Blood 113(3):505–16.
c16.indd 369
1/12/2011 9:44:31 AM
CHAPTER 17
Confirmation of Gene Function Using Translational Approaches CAROLINE J. ZEISS
Contents 17.1 Introduction 17.2 Sources of Phenotypic Variability in Genetically Altered Mice 17.2.1 The Effect of Mouse Strain on Expression of an Induced Genetic Alteration 17.2.2 Strain-Specific Pathology 17.2.3 Environmental Phenomena 17.3 Gene-Driven or Reverse Genetics Approach to Mouse Research 17.3.1 Method of Genetic Manipulation and Phenotypic Variability 17.3.2 Determining the Phenotypic Effects of a Known or Novel Gene 17.3.3 Embryonal Lethal Phenotypes 17.4 Phenotype-Driven or Forward Genetics Approach to Mouse Research 17.4.1 Spontaneous Mutations 17.4.2 Genomewide Mutagenesis 17.4.3 High-Throughput Phenotyping 17.5 Information Resources 17.6 Questions and Answers 17.7 References
371 373 373 375 376 378 378 379 381 383 384 384 385 386 386 387
17.1 INTRODUCTION The ultimate intent of translational research is to advance patient care by integrating information obtained in basic molecular studies with clinical trials (Woolf, 2008; Goldblatt and Lee, 2010). This approach works in both directions. Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
371
c17.indd 371
1/12/2011 9:44:32 AM
372
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
Forward genetic screen
Observe phenotype
Identify causative gene and infer function
Random mutagenesis Infer function
Alter a known gene
Observe phenotype
Reverse genetic screen
Figure 17.1. Forward and reverse approaches to identifying gene function in mice.
Clinical findings provide the impetus for in vitro studies or experiments in animal models that elucidate mechanisms of disease, or their proximate causes. Similarly, pathophysiologic clues to progression of a disease entity, or new solutions towards mitigating its effects typically evolve from applied studies in animals. The mouse has been the primary vehicle of translational exploration. Genetically altered mice provide superb models of human physiology and disease. They allow us to evaluate the effects of single altered genes in the context of the whole organism and provide tremendous insight into gene function. The hereditary basis for a multitude of hereditary, neoplastic and degenerative disorders in humans has long been known. Solving the human genome generated the coding sequence of genes causing or contributing to these conditions. Simultaneously, the capacity to genetic manipulate specific genes in the mouse genome (or reverse genetics) has resulted in a vast body of knowledge as to how genetic alterations in humans create disease in a mammalian system and how they may be remedied (Figure 17.1). Although murine studies form the bulk of such investigations, they are increasingly supplemented by more basic studies in lower vertebrates (zebrafish, nematodes, and frogs) and applied clinical studies in larger models such as dogs, pigs, and nonhuman primates. The solution of multiple organismal genomes and the ability to manipulate genes in vivo has rapidly accelerated our ability to dissect gene function and apply these findings to human disease. Before these developments, exploration of gene function relied on spontaneous appearance of hereditary disorders in animals or random mutagenesis followed by observation of progeny for a
c17.indd 372
1/12/2011 9:44:32 AM
SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE
373
phenotype (forward genetics) (Figure 17.1). This approach depends on relatively laborious identification of the causative gene defect using classical genetics, or the fortuitous choice of the correct candidate gene. In the forward approach, random mutagenesis in a founder results in numerous defects in unknown genes. Progeny are screened for a phenotype and those showing one used to develop individual lines in which the causative defect is mapped. In the reverse approach, a specific gene is mutagenized or introduced and the phenotypic effect on the progeny observed. Recording the physiologic effect of a gene defect on an animal system is broadly termed phenotyping. Most commonly, the approach taken is hypothesis driven and limited by the interests of the individual investigator. Less commonly, a non-hypothesis-driven approach to standardized characterization of mutants is employed by facilities or consortia devoted to generating and/ or screening large numbers of novel mutant animals. The purpose of this chapter is to describe these approaches so as to generate an overall understanding of their intent, as well as to provide a guide to performing phenotypic studies in mice.
17.2 SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE Variation is intrinsic to animal populations, and limiting the source of this to the experimental intervention results in a more satisfactory outcome. Unexpected variation in murine studies most commonly arises from several sources, including background strain, concurrent infection, and methods used to create the genetic alteration of interest. 17.2.1 The Effect of Mouse Strain on Expression of an Induced Genetic Alteration Inbred mouse strains differ dramatically in their appearance, physiologic characteristics, and spectrum of spontaneous disease. These differences arise from strain-specific allelic variation across the genome and profoundly affect expression of the genetic alteration under study as well as the spectrum of spontaneous disease in individual mouse strains. 17.2.1.1 Genetic Manipulation in 129 and C57BL/6J Strains The C57BL/6 mouse is the reference strain for the mouse genome sequence (Waterston et al., 2002) and is the most commonly used strain for biomedical research. However, the most commonly used strain for embryonic stem (ES) cell manipulation is the 129 strain (Simpson et al., 1997), as these display robust germline transmission of the genetic alteration. The 129-derived ES cells are injected into C57BL/6J blastocysts because of high fecundity and good mothering of the latter strain. Chimeric animals are typically bred
c17.indd 373
1/12/2011 9:44:32 AM
374
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
to C57BL/6J mice, producing genetically similar F1 animals sharing similar chromosomal complements of 129 and C57BL/6J strains. Due to meiotic recombination, the local influence of each in the genome becomes increasingly variable if progeny are simply bred together in subsequent generations. Phenotypic variability may be caused by alleles that are adjacent to or interact with the target locus. The increasing use of double/triple knockout combinations, inducible transgenes and cell-specific knockouts provide additional opportunities for creating a genetic background so mixed that results cannot be replicated by other investigators. Genetic heterogeneity is not limited to interstrain variation. Both 129 and C57BL/6 strains exhibit heterogeneity within strain. Over time, genetic drift within a population is inevitable—this means that wild type C57BL/6J mice from the colony of one investigator are no longer identical to the same strain in another facility. Consequently, wild type littermates are more appropriate control animals than wild type animals from another source. 17.2.1.2 Backcrossing Ideally, the genetic background of control and experimental animals should be identical, with exception of the target locus. In cases where background effects are likely to be important, the target locus is best propagated in congenic strains by successive backcrossing to one inbred strain. After breeding parental strains, F1 progeny are bred back to one parental strain (usually C57BL/6J). F2 progeny from this mating are then similarly bred back to the parental strain until, after 6 backcross breedings (a process which generally takes 2 years), the resultant offspring are 99% similar to the chosen strain, with the exception of the region surrounding the target locus. After 10 generations, the induced gene defect, enclosed in about 100 Mb of donor genome, is all that remains within a pure homozygous recipient genome. This strategy also provides the opportunity to place the target locus on a number of genetic backgrounds to assess the effects of strain-specific modifier loci (Sigmund, 2000). Although 10 generations of backcrossing is recommended, circumstances frequently dictate that phenotype of genetically altered animals be evaluated long before congenic strains can be generated. In these cases, the use of aged-matched control littermates, rather than inbred animals from another source is essential. 17.2.1.3 C57BL/6 Embryonic Stem Cells Genetic manipulation of C57BL/6 ES cells would eliminate the need for backcrossing of progeny, however development of these has been hindered by low germline transmission rates of C57BL/6 ES cells in contrast to 129 derived ES cells. Large scale mutagenesis programs have been established to mutate all protein coding genes in the mouse genome, and for these, C57BL/6 embryonic stem cells have been chosen to reduce the effects of mixed background (Austin et al., 2004). Recently, robust germline transmission has been achieved with C57BL/6N ES cells (Pettitt et al., 2009). Repair of the Agouti locus (imparting agouti color
c17.indd 374
1/12/2011 9:44:32 AM
SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE
WT
375
rd-1
ONL
ONL
Figure 17.2. (See color insert.) WT and rd-1 mouse retina at postnatal day 12. In rd-1 mice, a mutation in the cyclic phosphodiesterase beta gene results in rapid loss of the outer nuclear layer in the first 3 weeks of life.
rather than black) in C57BL/6N ES cells and subsequent injection into C57BL/6J blastocysts allows identification of ES cell derived chimeras by mixed agouti and black coat color (Pettit et al., 2009). 17.2.2 Strain-Specific Pathology Mice experience a spectrum of spontaneous diseases that are heavily influenced by strain. Some, such as retinal degeneration caused by the rd1 mutation (in cyclic GMP phosphodiesterase) occur in specific strains such a C3H and FVB mice. Investigators studying retinal development or disease would be advised to avoid these strains. Other conditions, such as ulcerative dermatitis, are common to many strains, but occur with higher frequency in C57BL/6 mice. The significance of this condition is that it frequently necessitates euthanasia of affected mice and may be worsened by genetic intervention. Clinical examination should
c17.indd 375
1/12/2011 9:44:32 AM
376
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
precede behavioral testing as a host of unrelated factors can cause profound artifactual deficits on behavioral tests. Strain-specific background pathology (e.g., callosal defects in BALB/c, 129 and other mice) may affect behavioral test results. C57BL/6 mice tend to display normal hyperactive behaviors compared to other strains. Strain-specific anatomy and pathology are described and referenced in several excellent texts (Maronpot et al., 1999; Hof et al., 2000; Ward et al., 2000; Brayton et al., 2001). In addition, online resources such as the Mouse Phenome Database, the database of Inbred Strain Characteristics, and the Mouse Tumor Biology Database provide searchable databases of strain-specific anatomy and pathology (Table 17.1). The latter is complemented by a recent text on murine tumor classification (Mohr, 2001).
17.2.3 Environmental Phenomena In addition to the combined effects of genetic background and the induced mutation, environmental factors may confound the phenotype. The most significant of these is the presence of subclinical infection within the colony. 17.2.3.1 Concurrent Infection Experimental mouse colonies are subjected to a rigorous diseasemonitoring program to prevent the spread of infectious agents. Consequently, most of the diseases that plagued mouse colonies 30 years ago have been eradicated. However, several subclinical conditions persist in most colonies and are tolerated because they cause subclinical disease and are difficult to detect or eradicate. Currently, these include Helicobacter sp., mouse parvo virus, and to a lesser extent, mouse hepatitis virus. The significance of these infections is probably limited to those investigators working on immunologic topics. Although subclinical, they do stimulate immune responses that could confound immunologic studies. In general, phenotypes that affect the immune system are most likely to suffer potential confounding effects of a prevalent but subclinical infectious disease. Not infrequently, helicobacteriosis will present as clinical disease (rectal prolapse), particularly in animals prone to inflammatory bowel disease (Chin et al., 2000) as the result of ablation of components of their immune system. Pinworms outbreaks are relatively frequent occurrence, but can be controlled by quarantine and treatment. If animals have been produced at a research facility, information regarding the health status of the room in which they live will be available from the veterinary staff. 17.2.3.2 Epigenetic Phenomena Environmental phenomena such as stress, food composition, and The light/dark cycle can have substantial effects on phenotype. In particular, behavioral phenotypes, and phenotypes, such as obesity, that are affected by behavior and feeding can be particularly affected (Crabbe et al., 1999; Tordoff et al., 1999).
c17.indd 376
1/12/2011 9:44:32 AM
SOURCES OF PHENOTYPIC VARIABILITY IN GENETICALLY ALTERED MICE
377
TABLE 17.1. Online Resources for Mouse Phenotyping Mouse Development Atlases The House Mouse. Atlas of Embryonic Development by Karl Theiler Edinburgh Mouse Atlas Project Caltech μMRI Atlas of Mouse Development 3D embryo images at Duke Center for In Vivo Microscopy Correlation of the Theiler system with embryonic age, size and morphologic features Mouse Brain Atlases Allen Brain Atlas MBL: The Mouse Brain Library Mouse Atlas Project High resolution Mouse Brain Atlas Electronic Prenatal Mouse Brain Atlas Whole Body Mouse Atlases Three-dimensional atlas of the mouse Mouse anatomy
http://genex.hgu.mrc.ac.uk/Atlas/ Theiler_book_download.html http://genex.hgu.mrc.ac.uk/intro.html http://mouseatlas.caltech.edu/ http://www.civm.duhs.duke.edu/ devatlas/index.html http://genex.hgu. mrc.ac.uk/.
http://www.brain-map.org/ http://www.mbl.org/ http://map.loni.ucla.edu/ http://www.hms.harvard.edu/research/ brain/ http://www.epmba.org/ http://www.mrpath.com/ previousvisiblemouse.html http://www.informatics.jax.org/ cookbook/chapters/contents2.shtml
Mouse Gene Expression Atlases Allen Brain Atlas http://www.brain-map.org/ Mouse Atlas of Gene Expression http://www.mouseatlas.org/data/mouse/ Gene Expression in the PN 7 Mouse Brain http://www.geneatlas.org/gene/ Mouse Genomics and Phenotyping Resources Mouse Genome Resources http://www.ncbi.nlm.nih.gov/projects/ genome/guide/mouse/ Mouse Genome Informatics http://www.informatics.jax.org/ Mouse Phenome Database http://phenome.jax.org/pub-cgi/ phenome/mpdcgi?rtn=docs/home Inbred Strain Characteristics http://www.informatics.jax.org/external/ festing/search_form.cgi Mouse Tumor Biology Database http://tumor.informatics.jax.org/mtbwi/ index.do European Mouse Phenotyping Resource http://empress.har.mrc.ac.uk/) of Standardized Screens EuroPhenome http://www.europhenome.org/ EUMORPHIA consortium http://www.eumorphia.org/). Knockout Mouse Project http://www.komp.org/ International Knockout Mouse http://www.knockoutmouse.org/ Consortium European Conditional Mouse Mutagenesis http://www.eucomm.org/ Program
c17.indd 377
1/12/2011 9:44:32 AM
378
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
17.3 GENE-DRIVEN OR REVERSE GENETICS APPROACH TO MOUSE RESEARCH The completion of the Mouse Genome Project (www.ncbi.nlm.nih.gov/ projects/genome/guide/mouse) provided the impetus to explore the function of specific genes by introducing a mutation in a gene, followed by observation of the phenotype in mutant progeny. This approach forms the basis of hypothesis driven research performed by the majority of individual research labs. Animals derived from targeted gene manipulations are analyzed according to the interests of the lab—the type of analysis ranges from general overall screening to highly specialized organ specific techniques. The general approach towards this type of analysis is described in this section. 17.3.1
Method of Genetic Manipulation and Phenotypic Variability
Understanding the technology used in the experiment is necessary to identify potential factors that may confound the phenotype. Detailed descriptions of methodologies used to create genetically altered animals can be found in Williams and Wagner (2000) and Adams and van der Weyden (2008). The following discussion will address only pitfalls associated with the more commonly used methods to generate transgenic and knockout mice. Creating a transgenic animal employs random chromosomal integration of foreign DNA after injection into fertilized oocytes. The resulting offspring are screened to identify those animals in which stable chromosomal integration of the foreign DNA has occurred. In addition to the target gene, the transgenic construct contains a transcriptional regulatory region that directs both expression level and tissue specificity of the inserted gene. Depending on the aim of the experiment, the target protein may be overexpressed (excessive amounts of normal protein expressed in tissues that normally express it) or ectopically expressed (a normal protein is expressed in tissues that do not normally express it). Alternatively, the transgene may be modified to create a gain of function mutant (by which the protein is constitutively expressed) or a loss of function mutant (by which the protein interacts with its partners in a dominant negative fashion). The nature of the transgenic manipulation will determine the extent to which individual tissues are examined. Because of the random nature of transgene insertion after pronuclear injection, each resultant founder contains the transgene at a different site in the genome (Clark et al., 1994). This position effect can profoundly affect the expression of both the transgene and endogenous genes whose regulatory elements may be disrupted by the insertion event. Several factors may influence the resultant phenotype. The foreign DNA usually integrates as linear arrays, resulting in variable levels of gene dosage. The site of chromosomal integration may affect the regulatory function of the transcriptional element contained within the construct. These factors result in variable expression levels of the transgene in different founder lines. In addition, random integration of the transgene may disrupt endogenous genes (insertional mutagenesis)
c17.indd 378
1/12/2011 9:44:32 AM
GENE-DRIVEN OR REVERSE GENETICS APPROACH TO MOUSE RESEARCH
379
thus further confounding phenotype. Consequently, it is essential that lines from several (at least two) different founders be examined before a conclusion relating a specific phenotype to transgene expression is made (Sigmund, 2000; Williams and Wagner, 2000). To assess dose-response relationships between transgene expression and phenotype, it is also important to assess lines of mice that express the transgene at different levels. The uncertainties of random integration may be circumvented by the more challenging technology used to create knockout mice. Using homologous recombination, the coding region of a specific endogenous gene can be interrupted to eliminate gene expression (knockout) or replace it with a modified variant of the gene (knockin). The foreign DNA is inserted into cultured ES cells, followed by identification of clones that have the correct mutation and then injection of these clones into mouse blastocysts. If chimeric mice have integrated the foreign DNA into their germline, they can pass it along to their progeny to establish a colony of genetically altered animals. Although gene expression may be more precisely controlled with this method, it is possible to destroy transcriptional control elements controlling expression of a neighboring gene, thus creating varying phenotypes (Olson et al., 1996). More sophisticated methods of genetic manipulation are accompanied by their own particular pitfalls. These methods include Cre-Lox technology to create conditional mutants and drug-regulated transgene expression (Adams and van der Weyden, 2008). 17.3.2 Determining the Phenotypic Effects of a Known or Novel Gene In general, the effect of eliminating the gene (knockout model) is investigated first, followed by more sophisticated interventions such as introducing single nucleotide polymorphisms in an otherwise functional gene or inducing tissue specific expression of a mutant gene. The investigator is required to assess phenotypic impact of single gene alterations on complex molecular pathways. The effects of genetic background and the variability inherent in the gene construct used to create the animals frequently confound this assessment. Finally, findings must be integrated with published information to draw conclusions and design new experiments. 17.3.2.1 Preparatory Steps When initiating a phenotypic examination, it is important to collect as much information about the experiment as possible. These include (1) the aim and design of the experiment, (2) the known physiology of the target gene and the methods used to manipulate its expression, and (3) potential sources of phenotypic variability. Last, but most important, the induced genetic defect should be characterized. At a minimum, the genomic lesion should be characterized, and expression of the transcript and cognate protein assessed in mutant animals. This will ensure that the target gene has been successfully deleted or otherwise altered prior to embarking upon a phenotyping effort.
c17.indd 379
1/12/2011 9:44:32 AM
380
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
17.3.2.2 Assessing the Live Animal Simple observations and good record-keeping are the fundamentals of phenotypic assessment of a new mutant line. The initial step in any phenotyping project is to determine whether the altered gene is inherited at the expected Mendelian frequently in live progeny. Following birth of progeny, the investigator notes whether neonatal deaths occur in the litter. At some point between birth and weaning, pups are genotyped, and the Mendelian spread of wildtype, heterozygote, and homozygote mutant animals as assessed. If fewer than expected homozygous mutant live animals are detected, this is generally the first indication that the altered gene may cause early neonatal or embryonal death. In this case, the approach taken is described in Section 17.3.3. If a normal Mendelian spread is identified, the investigator typically has a selection of mutant, heterozygous, and wild type littermates from each mating. These should be examined clinically and weighed at regular intervals e.g. early neonatal period (P5), weaning, and sexual maturity (6–8 weeks). To assess reproductive performance, mutant mice should be mated with one another, and with wild type mice to determine whether reproduction of each gender is normal. This provides a broad but sensitive test of normal physiology in a multitude of organ systems. Not uncommonly, no clinical abnormalities are noted at all. In this case, baseline clinical and anatomic pathology phenotyping can be performed in young (8–12 weeks) and older (12–15 months) adult animals. In general, 6–10 animals of each genotype and age are used for terminal assessment. This typically comprises clinical chemistry, hematology, and a relatively detailed panel of histologic tissues. Detailed descriptions of hematologic and morphologic evaluation of mice are given in Brayton et al. (2001) and Car and Eng (2001). 17.3.2.3 Mice with no Phenotype Failure to identify a clear phenotype following genetic manipulation is not uncommon. In these cases, the steps outlined above represent the minimum that is typically required to complete a publishable study. The reasons for failure of a phenotype to emerge vary from failure of the genetic alteration to create a corresponding abnormality in the protein, to the activation of compensatory mechanisms (Susulic et al., 1995; Cummings et al., 1996). Such compensation is typically identified by the altered or increased expression of related genes in the presence of a relatively normal phenotype. 17.3.2.4 Mice with a Clinical Phenotype If it is relatively obvious, a phenotype may be identified during initial observation and provides a direction for further analysis. Alternatively, the phenotype may be subtle and revealed only when the investigator employs a series of tests designed to reveal a phenotype of interest. Numerous protocols for the antemortem physiologic assessment of mutant mice exist. These are succinctly reviewed in Rao and Verkman (2000).
c17.indd 380
1/12/2011 9:44:32 AM
GENE-DRIVEN OR REVERSE GENETICS APPROACH TO MOUSE RESEARCH
381
900 800 700 600 500 400 300 200 100 0
13:03 14:43 16:23 18:03 19:43 21:23 23:03 0:43 2:23 4:03 5:43 7:23 9:03 10:43 12:23 14:03 15:43 17:23 19:03 20:43 22:23 0:03 1:43 3:23 5:03 6:43 8:23 10:03 11:43 13:23 15:03 16:43 18:23 20:03 21:43 23:23 1:03 2:43 4:23 6:03 7:43
XT
Horizontal activity 12-h light: dark cycle
Mutant WT
Time
Figure 17.3. Horizontal activity over 3 days in wild type and mutant mice. Data are collected in a metabolic cage where horizontal movement by the mounts elicits a beam break (XT). Activity is recorded over several light/dark cycles. During the day (white segments), mutant mice are more active than WT mice.
For progressive conditions, animals in early, middle, and late stages of the condition should be chosen for histologic analysis. The final number of animals assessed in each study is unique, and determined by the variability of the data. This is in turn, affected by the subtlety of the phenotype, the variability inherent in the tests used to assess physiology, and additional factors such as background strain. General testing protocols for the majority of body systems have been established over the last few years by the Emorphia consortium (www.eumorphia.org), which developed a number of standardized screens known as EMPReSS (European Mouse Phenotyping Resource of Standardized Screens). These are available online from the EMPreSS website (http:// empress.har.mrc.ac.uk) and provide a good overall starting point for more in-depth screening of live mice. Following characterization of live animals, age-matched animals of mutant and wild type genotypes are sacrificed and characterized as described above. Frequently, this step is combined with collection of tissues for molecular analysis. Often, the mouse phenotype represents a relatively minor portion of a publication and is located between the description of the genomic intervention and the molecular work describing the mechanism constituting the essence of the paper. 17.3.3 Embryonal Lethal Phenotypes Identifying the time of fetal death requires euthanasia of pregnant dams at successive stages of pregnancy to determine the time at which embryos are lost. Good reviews detailing the evaluation of embryonic death and perinatal mortality can be found in Brayton et al. (2001) and Ward et al. (2000). 17.3.3.1 Collection of Embryos at Specific Developmental Stages Matings between fertile males and spontaneously cycling females are usually set up in the late afternoon or early evening. Females in proestrus can be
c17.indd 381
1/12/2011 9:44:32 AM
382
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
selected by vaginal inspection (Champlin et al., 1973). Approximately half of the females selected this way will mate that night. Consequently, a relatively large number of matings need to be set up to obtain the required number of timed pregnant females. Alternatively, females can be superovulated using intraperitoneal pregnant mares serum gonadotropin (PMSG; typical dose 5 IU) followed 48 h later by human chorionic gonadotropin (hCG; typical dose 5 IU). Ovulation occurs approximately 12 h later. Depending on the dose administered, ranging from 2.5 IU (physiologic) to 10 IU (high), large numbers of embryos may implant and result in artifactual changes from overcrowding (Kaufman, 2000). Observation of a vaginal plug is required to accurately determine the developmental stage of embryos. In mice kept in a standard 12 h light/dark cycle, it is assumed that mating occurs at the mid-dark point, at approximately 2 a.m. If a vaginal plug is identified the next morning, embryos will be assumed to be E (embryonic day) 0.5 or 0.5 dpc (days post coitum) old (Hogan et al., 1994). Implantation usually occurs at E4.5 and the duration of pregnancy is 19.5–21 days. Before implantation, embryos may be retrieved by flushing the oviduct and uterus with phosphate buffered saline. Between E4.5 and E8.5, it is best to isolate the embryo within its intact decidual swelling to avoid damaging it. After E8.5, the embryo can be dissected from the uterus and its yolk sac. It can be retained within the amnion, but considerable care should be taken to avoid damage. 17.3.3.2 Fixation, Embedding, and Orientation Following collection of tissue for genotyping, embryos may be fixed in Bouin’s solution, 10% formalin, or 4% paraformaldehyde. Particularly with Bouin’s solution, the tissue will become brittle if placed in fixative for too long. Embryos with a crown–rump length of 2 mm require only 1 h in fixative, while those with a crown–rump length of ∼15 mm can be placed in fixative for up to 24 h. After removal from fixative, embryos may be placed in 70% ethanol for long-term storage at room temperature. Before embedding, the embryos are dehydrated through graded stages of alcohol, before being placed in a 1 : 1 mixture of 100% ethanol : benzene (see Kaufman [2000] for detailed procedures). The addition of a few drops of eosin at the 90% ethanol stage will stain the embryo pink and facilitate its visualization during embedding. Embryos older than E8.5 can be relatively easily oriented, as the head and tail can be easily visualized, and they tend to fall on their sides in the wax block. Younger embryos within their decidual swellings can be sectioned in the transverse plane by using the decidual swelling to orient the embryo. A large number of specimens may be required to obtain useful sections of specimens under E8.5. Tranverse sections are generally done through the majority of the embryo, and provide the most morphologic information. 17.3.3.3 Histologic Interpretation and Staging The most commonly used staging system is that of Theiler (1972, 1989). This system has been
c17.indd 382
1/12/2011 9:44:32 AM
PHENOTYPE-DRIVEN OR FORWARD GENETICS APPROACH TO MOUSE RESEARCH
383
adopted by recent standard texts on mouse embryology (Kaufman, 1994; Kaufman and Bard, 1999). A table correlating the Theiler system with embryonic age, size and morphologic features can be found online (http:// genex.hgu.mrc.ac.uk). The Atlas of Mouse Development (Kaufman, 1994) provides the most comprehensive illustration of each of the Theiler stages. Each Theiler stage, up to about E11.5 (Theiler stage 20) lasts for about 12 h. As tissues develop so rapidly at these stages, a precise identification of embryo age may be difficult. Aging of embryos is easier after E12, when each Theiler stage encompasses about 24 h. Aging can also be done by examining the sequence of long bone ossification in whole embryos or tissue sections. This method is best used after E15.5 (Patton and Kaufman, 1995) when ossification centers are present. The pathologist should be aware of intrinsic variations in normal embryonal development. Within the same litter, developmental maturity can vary by 6–12 h. Hematoxylin and eosin staining is sufficient for initial screening. Further analyses frequently make use of the spectrum of techniques traditionally used in light microscopy—for example, special stains, histochemistry (Kaufman and Schnebelen, 1986), immunohistochemistry, and in situ hybridization (Durrant, 1996; Kadkol et al., 1999). 17.3.3.4 Newer Techniques to Assess Embryonal Phenotypes Detailed two- or three-dimensional visualization of embryonal anatomy can be achieved using a variety of imaging techniques. Volumetric X-ray computed tomography of osmium tetroxide stained tissues is able to generate virtual histologic images (Johnson et al., 2006). Three-dimensional images can be obtained by magnetic resonance imaging (Petiet et al., 2008). An atlas of normal embryos obtained using this method can be accessed online at the Duke Center for In Vivo Microscopy (CIVM; Table 17.1). A review of these, and other in vivo techniques is reviewed by Kulandavelu et al. (2006).
17.4 PHENOTYPE-DRIVEN OR FORWARD GENETICS APPROACH TO MOUSE RESEARCH In the forward genetics approach, genetic lesions are introduced randomly throughout the genome, followed by observation for a phenotype in progeny. Once the phenotype can be established through successive generations, the causative gene is mapped and identified. Many of the animal models with which we have been familiar for many decades resulted from the forward genetics approach applied to spontaneous mutations that arose in inbred colonies of mice and larger animals. The advantage of this approach is that novel mutants derived from alteration of as yet uncharacterized genes can be developed. Also, the induced mutations tend to be point mutations that can result in subtle or dominant negative phenotypes (Rajan and Kopito, 2005). Different point mutations in the same gene can generate an allelic series, thus more accurately reflecting the disease phenotypes arising from mutations in the
c17.indd 383
1/12/2011 9:44:32 AM
384
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
corresponding human disease gene (Bendotti and Carrì, 2004). Last, point mutations and knockouts may have very different phenotypes (Signorini et al., 1997; Jensen et al., 1999). 17.4.1
Spontaneous Mutations
Modern inbred mouse strains were created by progressive inbreeding and are defined by over 20 generations of brother–sister matings. It is surprising that this endeavor generated a body of inherited diseases in mice that have since become established models for their corresponding human diseases. Spontaneous mutants still arise in mouse colonies, and in facilities equipped to pursue these, they form the basis of new spontaneous models of disease. While some transgenic models have been created in larger animals, most commonly pigs, spontaneous inherited diseases in larger animal models remain the most common means by which these models are developed to complement murine and human studies. 17.4.2
Genomewide Mutagenesis
Random genomewide mutagenesis is typically employed by facilities that generate large numbers of novel mutant mice. For example, the aim of the Knockout Mouse Project (now the International Knockout Mouse Consortium) is to mutate all protein-coding genes in the mouse using a combination of gene trapping and gene targeting in C57BL/6 mouse embryonic stem (ES) cells (www.knockoutmouse.org). This approach is also used by individual investigators, most commonly to identify novel genes to modify a known phenotype. Techniques of mutagenesis can be broadly divided into those that facilitate subsequent identification of the mutagenized gene using a genetic tag (such as gene trapping or transposon based mutagenesis) and those that do not (such as ENU mutagenesis). In all cases, backcrossing is needed to identify recessive phenotypes. 17.4.2.1 ENU Mutagenesis Because of its ability to induce single base pair mutations in any gene, ENU-mutageneisis has become a standard mutagen for the phenotype driven approach in the mouse (Barbaric et al. 2007). Mutagenesis is achieved in male founders by single or multiple treatment with N-ethyl-N-nitrosurea (ENU). These males (G0) are bred to normal females, and G1 progeny can be screened for dominant mutations. To obtain recessive mutants, G1 males (each a unique set of mutations) are bred to normal females, and the female G2 progeny of this mating bred back to their G1 father. The G3 progeny of these matings are screened for phenotypes, a higher proportion of which can now be expected to be recessive (Vitaterna et al., 2006). ENUinduced mutations are most commonly A :T to T :A or G : C transitions that cause missense mutations—genes with larger coding sequences are most commonly affected (Barbaric et al., 2007). Because ENU mutagenesis does not
c17.indd 384
1/12/2011 9:44:32 AM
PHENOTYPE-DRIVEN OR FORWARD GENETICS APPROACH TO MOUSE RESEARCH
385
introduce a selectable marker, there is no direct means to identify the insertion site, this must be established by positional cloning. 17.4.2.2 Gene Trapping Gene trapping is a method of random mutagenesis in which insertion of a synthetic DNA element into endogenous genes results in their transcriptional disruption (Brennan and Skarnes, 2008). A gene trap construct consists of a splice acceptor, selectable marker gene and polyadenylation signal that is placed within a retroviral genome. Retroviral particles are used to infect the ES cell line, when insertions occur within transcriptionally active regions, the marker is transcribed and expressed, allowing selection of positive clones. This insertion also results in disruption of the endogenous transcript and is associated function. In addition, the selectable marker can be used as a tag to identify the insertion location and the disrupted gene. 17.4.2.3 Transposons DNA transposons are genetic elements consisting of inverted terminal DNA repeats (TRs), which in their naturally occurring configuration flank a transposase coding sequence (CDS). This transposase follows a cut- and-paste mechanism to excise the transposon from its original genomic location and insert it into a new locus (Adams and van der Weyden, 2008; Ivics et al., 2009; Largaespada, 2009). These genetic elements are responsible for both ancient and new phenotypes in mice. Approximately 10% of the mouse genome is made up of endogenous retrovirus (ERV) sequences, which represent the remains of ancient germ line infections by transposable elements. These interrupt gene function at specific loci and are associated with strain specific cancers, predominantly mammary and lymphoid tumors (Stocking and Kozak, 2008). Recently, two classess of transposons have been engineered to induce random genomewide mutagenesis. Both the Tc1-like DNA transposon known as Sleeping Beauty, and an insect derived transposon PiggyBAC have been optimized to induce random mutagenesis in mouse cells. In addition to the transposase, the inverted terminal DNA repeats enclose a selectable marker to aid subsequent identification of the insertion site (Ivics et al., 2009; Largaespada DA 2009). 17.4.3
High-Throughput Phenotyping
Several facilities exist, primarily in Europe, that perform high-throughput, standardized phenotyping studies on mice. These are best suited to characterizing mice that are created by random mutagenesis and thus employ a nonhypothesis-driven approach. Its strength lies in the uniformity of its approach, as well the breadth of data collected on each mouse. The majority of tests are performed on live mice, consequently, pathology data constitute only a fraction of the entire dataset. Currently, a multinational consortium to knockout all genes in mouse genome has been created away EUCOMM in Europe, NorCOMM in Canada, and KOMP in the United State. EMPReSS is a
c17.indd 385
1/12/2011 9:44:32 AM
386
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
database of SOPs (Mallon et al., 2008), developed by a now extinct European Consortium known as EUMORPHIA consortium (www.eumorphia.org). A truncated version of these SOPs, known as EMPReSS Slim, is currently in use in four major phenotyping facilities in France, Germany, and the UK. Using these protocols, normal baseline data from several inbred mouse strains have already been established, and are available at the Europhenome site (www.europhenome.org). Current and future data from high-throughput phenotyping of mutant strains will be deposited in the Europhenome database (Morgan et al., 2009). 17.5 INFORMATION RESOURCES After completion of the initial phenotyping panel, the data should be assessed in the light of the experimental design. This requires integration of current knowledge of the cellular process in which the target gene is involved, and comparison with described related mouse phenotypes. Collective analysis of numerous mutant mouse studies has generated a massive body of information that is still undergoing organization. Currently, no comprehensive resource exists that correlates structure and function of genes to their cognate cellular pathways and mutant phenotypes. However, extensive data exist for each of these disciplines independently (Hancock and Mallon, 2007), so it falls to the pathologist and investigator to integrate them. In addition to resources listed in this paper Chapter, Table 17.1 provides a list of the most comprehensive resources. 17.6 QUESTIONS AND ANSWERS Q1. You have generated a knockout line and notice that after genotyping, most litters contain the following distribution of pups at weaning: homozygous mutant 10%, heterozygous 55%, homozygous wild-type 35%. What could be happening and how do you investigate this? Q2. You have generated a knockout line on a mixed C57BL/6/129 background. You are testing your line for elevated blood pressure (BP) and notice that the data are so variable that you cannot establish significant differences in BP between genotypes. What could be happening and what could you do? Q3. You have generated a knockout line and fail to identify a phenotype. What is the minimum amount of data you have to show to claim that your mice have “no phenotype”? Q4. You have a characterized a cancer phenotype resulting from mice carrying a dominant negative allele for the p53 gene. You wish to identify modifier genes that either worsen or improve the phenotype. What approaches can you take?
c17.indd 386
1/12/2011 9:44:32 AM
REFERENCES
387
A1. These data indicate that most homozygous mutant pups are dying in utero. Because some do survive, it also suggests that there is variable penetrance of the phenotype. To investigate this, you need to do timed matings, and sacrifice the mother at various stages of pregnancy, working backward from E19/20. Genotype all pups in the uterus until you find a normal Mendelian spread. This will tell you which day pups are lost in utero. Phenotype mutant and control pups and placenta 1–2 days before the mutants are lost to identify the cause. You are likely to see a spectrum of severity in pups that die in utero and those that survive. A2. You are working with a phenotype (blood pressure) that is variable to begin with. The additional variability created by a mixed genetic background is probably masking nay differences between mutant and control animals. Backcross your mice for 6–10 generations to C57BL/6J mice. Also develop a standardized technique of BP measurement that you rigorously apply to all mice and make sure mutant and control animals are age and sex matched. A3. First, ensure that the genetic defect is characterized and you demonstrate loss or alteration of the transcript and protein. Then ensure that there is normal Mendelian spread of the mutant, het and wild type alleles in progeny and that these mice have comparable growth curves and morphology until sexual maturity. Determine whether mutant mice breed normally and produce normal offspring carrying the mutant allele. Age some mice to 2 years to assess whether they develop age-related phenotypes. Sacrifice age-matched cohorts of 4–6 male and female mice of all genotypes at 6–12 weeks and around 1 year and perform clinical chemistry, hematology, and histology. A4. You could induce germline genome wide point mutations in a male P53 mutant mouse using ENU-mutagenesis. Breeding this male to a female p53 mutant would deliver dominant phenotypes in G1. To assess recessive phenotypes, G1 males should be backcrossed to p53 females. Female G2 mice are bred back to their father and phenotypes assessed in progeny. Each G1 male will be used to develop one line of mice. This approach will reveal point mutations and subtle interactions in interacting genes, but requires positional cloning to identify the new gene. Using gene trapping or transposon-mediated genomewide disruption has the advantage of introducing a selectable marker allowing rapid identification of the new locus.
17.7 REFERENCES Adams DJ, van der Weyden L. (2008). Contemporary approaches for modifying the mouse genome. Physiol Genomics 34(3):225–38.
c17.indd 387
1/12/2011 9:44:32 AM
388
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
Austin CP, Battey JF, Bradley A, Bucan M, Capecchi M, Collins FS, Dove WF, Duyk G, Dymecki S, Eppig JT, et al. (2004). The knockout mouse project. Nat Genet 36:921–24. Barbaric I, Wells S, Russ A, Dear TN. (2007). Spectrum of ENU-induced mutations in phenotype-driven and gene-driven screens in the mouse. Environ Mol Mutagen 48(2):124–42. Bendotti C, Carrì MT. (2004). Lessons from models of SOD1-linked familial ALS. Trends Mol Med 10(8):393–400. Brayton C, Justice M, Montgomery CA. (2001). Evaluating mutant mice: anatomic pathology. Vet Pathol 38:1–19. Brennan J, Skarnes WC. (2008). Gene trapping in mouse embryonic stem cells. Methods Mol Biol 461:133–48. Car BD, Eng VM. (2001). Special considerations in the evaluation of the hematology and hemostasis of mutant mice. Vet Pathol 38:20–30. Champlin AK, Dorr DL, Gates AH. (1973). Determining the stage of the estrus cycle in the mouse by the appearance of the vagina. Biol Reprod 8:491–94. Chin EY, Dangler CA, Fox JG, Schauer DB. (2000). Helicobacter hepaticus infection triggers inflammatory bowel disease in T cell receptor alpha-beta mutant mice. Comp Med 50:586–94. Clark AJ, Bissinger P, Bullock DW, Damak S, Wallace R, Whitelaw CB, Yull F. (1994). Chromosomal position effects and the modulation of transgene expression. Reprod Fertil Dev 6:589–98. Copp AJ, Cockcroft DL. (1990). Postimplantation Mammalian Embryos. A practical Approach. IRL Press, Oxford. Crabbe JC, Wahlsten D, Dudek BC. (1999). Genetics of mouse behavior: interactions with laboratory environment. Science 284:1670–72. Cummings DE, Brandon EP, Planas JV, Motamed K, Idzerda RL,McKnight GS. (1996). Genetically lean mice result from targeted disruption of the RII beta subunit of rotein kinase A. Nature 382:622–26. Durrant I. (1996). Nonradioactive in situ hybridization for cells and tissues. Methods Mol Biol 58:155–67. Goldblatt EM, Lee WH. (2010). From bench to bedside: the growing use of translational research in cancer medicine. Am J Transl Res 2(1):1–18. Hancock JM, Mallon AM. (2007). Phenobabelomics—mouse phenotype data resources. Brief Funct Genomic Proteomic 6(4):292–301. Hof PR, Young WG, Bloom F. (2000). Comparative Cytoarchitectonic Atlas of the C57BL/6 and 129/SV: Mouse Brains. New York, NY, Elsevier Science. Hogan B, Beddington R, Constantini F, Lacy E. (1994). Manipulating the Mouse Embryo. A Laboratory Manual. 2nd ed. Cold Spring Harbor Laboratory, New York. Ivics Z, Li MA, Mátés L, Boeke JD, Nagy A, Bradley A, Izsvák Z. (2009). Transposonmediated genome manipulation in vertebrates. Nat Methods 6(6):415–22. Jensen P, Surmeier DJ, Goldowitz D. (1999). Rescue of cerebellar granule cells from death in weaver NR1 double mutants. J Neurosci 19(18):7991–98. Johnson JT, Hansen MS, Wu I, Healy LJ, Johnson CR, Jones GM, Capecchi MR, Keller C. (2006). Virtual histology of transgenic mouse embryos for high-throughput phenotyping. PLoS Genet 2(4):e61.
c17.indd 388
1/12/2011 9:44:32 AM
REFERENCES
389
Kadkol SS, Gage WR, Pasternack GR. (1999). In situ hybridization—theory and practice. Mol Diagn 4:169–83. Kaufman MH, Schnebelen MT. (1986). The histochemical identification of primordial germ cells in diploid parthenogenetic mouse embryos. J Exp Zool 238:103–11. Kaufman MH. (1994). The Atlas of Mouse Development. Academic Press, London. Kaufman MH. (2000). Gestational Mortality in Genetically Engineered Mice. In Pathology of Genetically Engineered Mice. Ward JM, Mahler JF, Maronpot RR, Sundberg JP (eds.). Iowa State University Press, Ames, pp. 63–88; 103–122. Kaufman MH, Bard JB. (1999). The Anatomical Basis of Mouse Development. Academic Press, San Diego, CA. Kulandavelu S, Qu D, Sunn N, Mu J, Rennie MY, Whiteley KJ, Walls JR, Bock NA, Sun JC, Covelli A, Sled JG, Adamson SL. (2006). Embryonic and neonatal phenotyping of genetically engineered mice. ILAR J 47(2):103–17. Largaespada DA. (2009). Transposon mutagenesis in mice. Meth Mol Biol 530: 379–90. Maronpot RR, Boorman GA, Gaul BW. (1999). Pathology of the Mouse: Reference and Atlas. Cache River Press, Vienna, IL. Mallon AM, Blake A, Hancock JM. (2008). EuroPhenome and EMPReSS: online mouse phenotyping resource. Nucl Acids Res 36:D715–8. Mohr U. (2001). International Classification of Rodent Tumours: The Mouse. Springer Verlag, Heidelberg, Germany. Morgan H, Beck T, Blake A, Gates H, Adams N, Debouzy G, Leblanc S, Lengger C, Maier H, Melvin D, Meziane H, Richardson D, Wells S, White J, Wood J, de Angelis MH, Brown SD, Hancock JM, Mallon AM. (2009). EuroPhenome: a repository for high-throughput mouse phenotyping data. Nucl Acids Res 2010 Jan; 38 (Database issue):D577–85. Olson EN, Arnold HH, Rigby PW, Wold BJ. (1996). Know your neighbors: three phenotypes in null mutants of the myogenic bHLH gene MRF4. Cell 85:1–4. Patton JT, Kaufman MH. (1995). The timing of ossification of the limb bones, and growth rates of various long bones of the fore and hind limbs of the prenatal and early postnatal laboratory mouse. J Anat 186(pt 1):175–85. Petiet AE, Kaufman MH, Goddeeris MM, Brandenburg J, Elmore SA, Johnson GA. (2008). High-resolution magnetic resonance histology of the embryonic and neonatal mouse: a 4D atlas and morphologic database. Proc Natl Acad Sci U S A. 26: 105(34):12331–36. Pettitt SJ, Liang Q, Rairdan XY, Moran JL, Prosser HM, Beier DR, Lloyd KC, Bradley A, Skarnes WC. (2009). Agouti C57BL/6N embryonic stem cells for mouse genetic resources. Nat Methods 6(7):493–95. Rajan RS, Kopito RR. (2005). Suppression of wild-type rhodopsin maturation by mutants linked to autosomal dominant retinitis pigmentosa. J Biol Chem 280(2): 1284–91. Rao S, Verkman AS. (2000). Analysis of organ physiology in transgenic mice. Am J Physiol Cell Physiol 279(1):C11–C18. Signorini S, Liao YJ, Duncan SA, Jan LY, Stoffel M. (1997). Normal cerebellar development but susceptibility to seizures in mice lacking G protein-coupled, inwardly rectifying K+ channel GIRK2. Proc Natl Acad Sci U S A 94(3):923–27.
c17.indd 389
1/12/2011 9:44:32 AM
390
CONFIRMATION OF GENE FUNCTION USING TRANSLATIONAL APPROACHES
Sigmund CD. (2000). Viewpoint: are studies in genetically altered mice out of control? Arterioscler Thromb Vasc Biol 20:1425–29. Simpson EM, Linder CC, Sargent EE, Davisson MT, Mobraaten LE, Sharp JJ. (1997). Genetic variation among 129 substrains and its importance for targeted mutagenesis in mice. Nat Genet 16(1):19–27. Susulic VS, Frederich RC, Lawitts J, Tozzo E, Kahn BB, et al. (1995). Targeted disruption of the beta 3-adrenergic receptor gene. J Biol Chem 270:29483–92. Stocking C, Kozak CA. (2008). Murine endogenous retroviruses. Cell Mol Life Sci 65(21):3383–98. Theiler K. (1972). The House Mouse: Development and Normal Stages from Fertilization to 4 weeks of Age. Springer Verlag, Berlin. Theiler K. (1989). The House Mouse: Atlas of Embryonic Development. Springer Verlag, New York. Tordoff MG, Bachmanov AA, Friedman MI, Beauchamp GK. (1999). Testing the genetics of behavior in mice. Science 285:2069. Vitaterna MH, Pinto LH, Takahashi JS. (2006). Large-scale mutagenesis and phenotypic screens for the nervous system and behavior in mice. Trends Neurosci 29(4):233–40. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420(6915):520–62. Ward JM, Mahler JF, Maronpot RR, and Sundberg JP. (2000). Pathology of Genetically Engineered Mice. Ames: Iowa State University Press. Woolf SH. (2008). The meaning of translational research and why it matters. JAMA 299:211–213. Williams RS, Wagner PD. (2000). Transgenic animals in integrative biology: approaches and interpretations of outcome. J Appl Physiol 88:1119–26.
c17.indd 390
1/12/2011 9:44:32 AM
CHAPTER 18
Confirmation of Single Nucleotide Mutations JOCHEN GRAW
Contents 18.1 Introduction: Why Single Nucleotide Mutations Are Difficult to Confirm 18.2 Initial Confirmation by Co-Segregation in the Family 18.3 Second Confirmation by Population Screening 18.3.1 Recurrent Mutation or Founder Effect? 18.4 Third Confirmation by Expression Analysis and Functional Studies in Model Systems 18.5 Recapitulation of Human Mutations in Animal Models 18.6 Conclusions and Outlook 18.7 Acknowledgments 18.8 Questions and Answers 18.9 References
391 393 395 396 397 398 399 400 400 400
18.1 INTRODUCTION: WHY SINGLE NUCLEOTIDE MUTATIONS ARE DIFFICULT TO CONFIRM The human genome—as any other genome—is dynamic and underlies a broad variety of changes. These changes may alter the biological function of a given sequence, but in many cases it is just a polymorphic site without any (actual) consequence. It is a game of evolution, and only in the context of other modifications or under new environmental conditions might it have a positive or negative effect for the organism. One of the simplest modifications seems to be the exchange of one single nucleotide. If such point mutations occur within a coding sequence, they may alter the encoded amino Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
391
c18.indd 391
1/12/2011 9:44:33 AM
392
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
acid (missense mutation). The single-basepair mutation may also cause a stop codon (nonsense mutation), leading to a premature stop of the translation. In such cases, instability of the corresponding mRNA is discussed leading finally to its nonsense-mediated decay. Biochemically, mutations like this can be identified using the western blot technique by a band representing a lower molecular weight of the corresponding protein or even its absence (in case of a nonsense-mediated decay). However, frequently the mutation affects the third base of a triplett and is predicted having no effect on the protein sequence (silent mutation). In a few cases, a mutation can also change a stop codon into an amino-acid coding codon extending the length of a given protein. However, one should consider the fact that the different tRNAs coding for the same amino acid do not have the same concentration within a cell. Therefore, the actual amount of a translated protein might depend on the available amount of tRNA, or in other words, the tRNA might be a rate-limiting factor during protein synthesis. Point mutations can occur rather frequently outside the coding region of a gene—namely, in its promoter, in its 5′- or 3′-untranslated regions, in an intron, or even in intergenic regions, where they may effect enhancers, chromosomal domains, or functionally less defined regions. A rather new but emerging field concerns the small regulatory RNAs, where point mutations may occur also. The latter points address the difficulties we have in confirmation of single nucleotide mutations: it is the confirmation of the biological function—on the background of the major noise of single nucleotide polymorphisms (SNPs), that can vary between very rare up to highly frequent in a given population. If a missense or nonsense mutation occurs in any coding region, the functional outcome can be deduced from their consequences with respect to the amino acid sequence and the effects on charge, structure, or possible posttranslational modification sites. In a promoter, a single nucleotide mutation can be tested in a reporter gene assay to quantify its effect on gene expression as compared to a wild type sequence. And a mutation in an intron may affect splicing, which can be confirmed by the analysis of cDNA in cases where the mRNA (for making of cDNA) is easily accessible. However, in all other cases the functional analysis is difficult, making a statement concerning a causative relationship to the genetic disorder of interest rather complicated. An overview of the entire strategy to confirm single point mutations as disease-causing mutations is given in Figure 18.1. Point mutations can be caused by a failure during DNA-replication or DNA-repair processes or can be induced induced by chemicals or radiation (Nomura, 2008; Sankaranarayanan, 2006). In model systems like mice, alkylating agents like ethylnitrosourea (ENU) are very potent mutagens, leading to the modification of bases, mispairing during replication, and fixation of the mutation in the next replication period (Ehling et al., 1985). Because of the higher rate of cell divisions in male germ cell development compared to females, base substitutions arising from errors during replication tend to be paternal in origin (Eichenlaub-Ritter et al., 2007).
c18.indd 392
1/12/2011 9:44:33 AM
INITIAL CONFIRMATION BY CO-SEGREGATION IN THE FAMILY
393
Patient with suggested hereditary disease
Family analysis
Sporadic case or small family (n < 10)
Large family (n > 10)
Functional candidate approach
Linkage analysis
Identification of a mutation
Genetic confirmation Co-segregation with the disease in the family Absence in the healthy population
Functional confirmation Analysis in model systems (cell culture, animal models)
Figure 18.1. Confirmation of point mutations in humans.
18.2
INITIAL CONFIRMATION BY CO-SEGREGATION IN THE FAMILY
From a genetic point of view, a causative mutation has to co-segregate with the disease within a family. Therefore, the familial analysis has to be at the first place for confirmation of a newly detected mutation, however, it should be kept in mind that this confirmation is valid for monogenic Mendelian disorders only; special features will be discussed below. A recent example is the observation of a point mutation in the GJA8 gene encoding connexin 50. The mutation leads to an exchange of Ile to Met at pos 247, therefore it is referred to as Cx50I247M (Graw et al., 2009). The mutation was previously described in a Russian family to be causative for a dominant congenital cataract (opacity of the eye lens); the authors found a co-segregation within the family and did not find the mutation in 25 healthy nonrelated people (Polyakov et al., 2001). However, the mutation in the German patient analyzed by Graw et al. (2009) showed no co-segregation within the family, since the mutation was also present in the unaffected mother. This finding
c18.indd 393
1/12/2011 9:44:33 AM
394
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
excluded the Cx50I247 mutation in the GJA8 gene as a causative mutation for the cataract; the observation that this particular mutation was not present in 179 controls demonstrated only that it is a very rare allele. Further functional studies of the authors in cell culture systems showed that the mutant protein has the same functional characteristics like the wild type protein. Finally, biochemical studies confirmed the first conclusions based on genetic studies only. In practice, the sequence of investigations can be also contrariwise, since in large families the first step in mutation analysis of rare diseases or of disorders with genetic heterogeneity (e.g., retinitis pigmentosa; Rivolta et al., 2002; Hamel, 2006) will be a linkage analysis. Such a positional cloning approach will exclude many candidate genes and end up in a critical region in the range of a few megabases (or even less); however, the suggested mutation has to be present in all affected members of the family, but not in the healthy ones. In this context, a haplotype analysis is helpful. The term haplotype is a contraction of the term haploid genotype and refers to a combination of alleles at multiple, but close loci (including also genetic markers, like SNPs or microsatellite markers) that are transmitted together on the same chromatide. A haplotype contains only a few loci; the borders of a haplotype in a given family are determined by the recombination events that have occurred during the family history. It allows the ascertainment of a putative disease allele during family history (“identity by descent”; Visscher, 2009). However, if the families of interest are rather small (in its extreme form just a trio: the parents and one child), only a functional candidate gene approach is possible. In such cases only known genes for the diagnosed diseases can be tested. However, all other aspects for the control of the result have to be considered (segregation of the mutation within the small family, population control of the particular mutation, and its functional relevance, if possible). In the case of the GJA8 mutation (I247M), the biochemical experiments were crucial for the final categorization of the mutation as a polymorphism. However, if the biochemical investigations are not considered (or if such experiments cannot be performed), the result in the German cataract family could be interpreted in a different manner using the term reduced penetrance. Penetrance is always 100% in classical Mendelian disorders, and if one assumes a Mendelian way of inheritance but the disorder cannot be diagnosed in all carriers of the same mutation, we refer to this feature as reduced penetrance. This term, however, is only a formal description of unknown mechanisms modifying the outcome of a given mutation. An example was published by de Lange et al. (2007), who analyzed a large family suffering from cherubism, a benign fibro-osseous disease of the jaws. The disease has an autosomal dominant inheritance and causative mutations have been found in the SH3BP2 gene encoding the SH3 domain binding protein 2 on chromosome 4p16. They identified the P418T mutation in the SH3BP2 gene in five members of the family, however, two of those five were obviously healthy.
c18.indd 394
1/12/2011 9:44:33 AM
SECOND CONFIRMATION BY POPULATION SCREENING
395
The new P418T mutation occurs at a codon having the most frequent changes in cherubism (Pro418 to Leu, Arg, or His). Because it is a substitution of the nonpolar Pro by the polar Thr, the authors suggest that also the change to Thr is pathogenic. However, since no further biochemical or genetic data are given, the pathogenic mechanism of this particular mutation remains to be elucidated. Another unsolved problem in genotype–phenotype correlation is the clinical heterogeneity of particular mutations. An example was published by Tein et al. (2008) for mutations in the ACADS gene (encoding the short-chain acylCoA dehydrogenase) among individuals of Ashkenazi Jewish origin. In this population, the heterozygous carrier frequency of a particular point mutation (319C->T; resulting in a R107C mutation in the precursor protein or a R83C mutation in the mature enzyme) is 1 : 15, and the homozygous frequency is 1 : 900; therefore, this mutation is discussed as a founder mutation in this population. Another mutation, 625G->A (resulting in a G209S mutation in the precursor protein or a G185S mutation in the mature enzyme), is also quite frequent in the general population and discussed as a susceptibility mutation. The clinical analysis of 10 homzygotes for the 319C->T mutation or compound heterozygotes for the 319C->T/625G->A mutation exhibit a broad range of clinical heterogeneity, including hypotonia and developmental delay in most of the patients but also myopathy of different grades of severity, facial weakness, lethargy, feeding difficulties, or congenital abnormalities in only few of the patients. These differences might be caused by biochemical differences of the two mutations resulting in 1–6% of residual enzyme activity. However, it indicates that there are additional modifiers involved in the pathogenic process which still remain to be elaborated. In addition of the topics discussed above, some particular modes of inheritance need specific consideration. First of all, if the gene of interest is located on a sex chromosome, it has to be proven whether it is on one of the two pseudo-autosomal regions of the X or Y chromosome. In these cases the classical scheme of sex-linked inheritance is not valid. Similar difficulties in the interpretation of a pedigree might be if the mutation affects a fertility gene on the Y chromosome. If the mutation occurs on the mitochondrial genome, matrilinearity of the inherited disease has to be taken into account; additionally, heteroplasmy might be a cause of clinical heterogeneity in these cases.
18.3 SECOND CONFIRMATION BY POPULATION SCREENING The essential distinction between a true disease-causing point mutation and a rare polymorphism (as some SNPs are) is its distribution among the healthy population. The human genome contains at least 11 million SNPs, with ∼7 million of these occurring with a minor allele frequency of over 5% and the remaining having minor allele frequencies between 1 and 5% (Frazer et al., 2009). A rare polymorphism can be found also within a healthy population,
c18.indd 395
1/12/2011 9:44:33 AM
396
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
but a true mutation must not be found. (Caveat: the tested population has to be screened for the critical characteristics of the corresponding disease, otherwise a finding within the population can pick up just a new patient!) This dogma is a formal one and holds true for classical Mendelian disorders only. For Mendelian disorders it is therefore necessary to include ∼100 controls (=200 chromosomes); if the observed mutation does not occur within this number of control persons at least in a heterozygous condition, the allele frequency is <0.5% and the observed mutation is very likely a true mutation (an exception see above for the GJA8/connexion50 mutation). It is obviously not true for complex or polygenic disorders, if several genes interact and disturbances of this interaction leads to a disease in different degrees of severity. In such cases, no clear Mendelian pattern of inheritance exists, but only an increasing relative risk. A paradigm is venous thrombosis: keeping the blood flow constant (hemostasis) needs a orchestrated balance of prothrombotic and antithrombotic factors in the vasculature (KottkeMarchant, 2002), and venous thrombosis is caused by alterations in this balance. One of the moderately strong risk factors is a particular mutation (R506Q) in the gene encoding the coagulation factor V (FV; because of its first description in the Dutch city Leiden, it is also frequently referred to FV-Leiden). This mutation inactivates a cleavage site of the FV protein leading to loss of its inactivation—that is, it remains in its active state. FV-Leiden is a common mutation, with a prevalence of carriers among whites of approximately 5%; it is also listed in the SNP database with the accession # rs6025. It is found in ∼50% of patients with familial thrombophilia. The risk of thrombosis is five fold increased in heterozygotes, and fiftyfold in homozygotes. However, it has limited relevance for the individual: in thrombophilic families approximately 50% of the carriers will have developed thrombosis by age 65 (Rosendaal and Reistma, 2009). In addition, there might be also a gene-X-environment interaction being important for the onset of thrombophilia in female FV-Leiden carriers, because the use of contraceptive increases the relative risk for a venous thrombosis additionally (Kottke-Marchant, 2002). 18.3.1
Recurrent Mutation or Founder Effect?
Mutations are considered to be rare effects; the overall mutation rate for autosomal dominant mutations is 16,500 per 106 live births (Sankaranarayanan, 2006). Therefore, it is surprising that point mutations frequently occur at the same position. Examples can be found in every database for mutations, if the total number is high. As a paradigm we might have a look in point mutations of the gene encoding the coagulation factor IX (FIX) leading to hemophilia B: the mutation G60S occurs 59 times, the mutation R248Q was found 59 times, and the mutation T296M was identified even 106 times in the FIX gene. All these mutations occur at CpG dinucleotides, making CpG dinucleotides to a hotspot of mutations. The underlying mechanism is a spontaneous deamination of 5-methylcytosin to thymidine, which leads after the next cell division
c18.indd 396
1/12/2011 9:44:33 AM
FUNCTIONAL STUDIES IN MODEL SYSTEMS
397
to a C->T (or G->A) transition (depending on which strand of the DNA the 5-methylcytosine was present); if this occurs during the development of the germ cells, it is fixed as a mutation. 5-Methylcytosine is made during biochemical modification of cytosine residues by DNA-methylation and occurs frequently during regulatory processes at CpG dinucleotides (Mukherjee et al., 2003). In such frequently occurring mutations the question arises whether these mutations are independent de novo mutations or founder mutations having spread throughout the population. An answer to this question can come from haplotype analysis. If the haplotypes are different, independent de novo mutations have to be considered, and vice versa, identical haplotypes point to a common origin (i.e., a founder mutation). In case of the T296M mutation in the FIX gene, 36 patients underwent an additional haplotype analysis, resulting in 15 different haplotypes. This observation strongly argues in favour of major contributions of de novo mutations as opposed to a founder effect (Mukherjee et al., 2003). In another case, Loidi et al. (2006) investigated CYP21A2 mutations in unrelated Spanish patients suffering from congenital adrenal hyperplasia. They observed in a total of 138 patients seven times the R444X mutation. Using haplotype analysis, the authors could demonstrate that six of the seven patients share a common haplotype, indicating a unique ancestral origin of this particular mutation.
18.4 THIRD CONFIRMATION BY EXPRESSION ANALYSIS AND FUNCTIONAL STUDIES IN MODEL SYSTEMS The determination of a mutation in the DNA sequence alone is just an indicator that it might be causative for an inherited defect—even if all formal criteria are fulfilled (co-segregation within the family, absence in the healthy population). However, it is also important to make a statement concerning the underlying mechanism. Therefore, expression studies are necessary as well as functional studies in appropriate model systems (model organisms, cell culture or biochemical tests). It is obvious that a gene that is expressed in the eye lens only cannot be responsible for a heart failure. However, if it is expressed in both tissues a mutation in this gene can be causative for cataracts (lens opacities) as well as for heart problems, as it is the case for mutations in CRYAB encoding αB-crystallin (Graw, 2009). Such pleiotropic effects are well known even in classical genetics and describe effects of the same mutation on different organs or tissues. However, also the opposite situation can be observed: a child suffers from an apparent new syndrome of cataract and macular hypoplasia. The first mutation screening using a functional candidate approach in the small family revealed a de novo point mutation in the CRYAA gene (R21L) encoding the αA-crystallin explaining the cataract. Since CRYAA is not expressed in the retina, the mutation cannot be responsible for the other pathological findings.
c18.indd 397
1/12/2011 9:44:33 AM
398
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
Additional screening revealed compound heterozygosity in the OCA2 gene (R419Q and A481T); one of both alleles was present in each of the unaffected parents. The macular hypoplasia was explained by a concerted interaction of compound heterozygous mutations in the OCA2 gene manifesting a mild form of oculocutaneous albinism (Graw et al., 2006). Besides expression analysis, functional studies in appropriate model systems are necessary to confirm a mutation or to characterize it as a rare polymorphism without pathological potential. As mentioned above, the 1247M mutation in the GJA8 gene (encoding connexin50, Cx50) was not found in more than 200 healthy people of two different populations. Nevertheless, when expressed in HeLa cells, both wild type Cx50 and the I247M-Cx50 formed gap junction plaques. Moreover, both wild type Cx50 and the I247M-Cx50-induced gap junctional currents in pairs of Xenopus oocytes, indicating no functional differences between these two isoforms of Cx50 (Graw et al., 2009).
18.5 RECAPITULATION OF HUMAN MUTATIONS IN ANIMAL MODELS The gold standard in confirming single nucleotide mutations is the recapitulation of the same mutation in the mouse. Actually, from a genetic point of view the mouse is the best animal model that can be used to study genetic effects in a living organism for comparison to humans. Several ways for making mouse mutants can be discussed: just the knockout of a gene of interest leading to loss-of-function mutations (or, in terms of genetics: to a null allele) or the exchange of the wild type allele by the mutation of interest (knockin approach), or a random mutagenesis using ENU as mutagen. The knockout of any gene in the mouse is the concept of major consortia worldwide (www.eucomm.org, www.komp.org). This approach has made a lot of information available about genes and their functions. In many cases, the diseases are similar or even identical, particularly if the human point mutation also leads to a loss of function (e.g., by forming a premature stop codon). These mouse mutants and their comparison to wild type mice allow studying the expression of the gene of interest in the tissue(s) or organ(s) of interest and the physiological consequences of its loss at the molecular, cellular and wholeorgan level. Mouse models have been used to prove or disprove causality, necessity and sufficiency of various genes and their encoded proteins or their absence in causing pathological situations in many organs (an excellent review about such mouse models for cardio-vascular diseases was published by Yutzey and Robbins [2007]). However, loss-of-function mutations represent only one aspect of the broad spectrum of the consequences of mutations in humans; they do not consider hypermorphic or hypomorphic alleles and their pathophysiological consequences, which are part of daily practice if dealing with human point mutations. Therefore, allelic series of mutations are absolutely required for
c18.indd 398
1/12/2011 9:44:33 AM
CONCLUSIONS AND OUTLOOK
399
understanding the frequently observed clinical heterogeneity. Therefore, a combination of gene targeting, spontaneous or randomly induced mutations, are necessary to represent the entire clinical spectrum. One of the most interesting sets of mutations was published very early in the field for the Mitf gene in the mouse (encoding a micophthalmia-associated transcription factor; Steingrímsson et al., 2004). The mutation spectrum in the mouse ranges from large deletions to point mutations of different severity of the disease (from severely affected dominant mutants to recessive mutants with almost no pathological phenotype). The human homolog, MITF, is mutated in patients with the pigmentary and deafness disorder Waardenburg syndrome type 2A. An actual summary of all available mouse mutants at the Mitf (and all other genes) can be found on the Jackson Laboratory Website (www.informatics.jax.org/), which lists 35 Mitf alleles. For comparison, the database for human genetic diseases (Online Mendelian Inheritance in Man [OMIM], www.ncbi.nlm.nih.gov/omim) gives information for 8 selected alleles. Point mutations in the mouse can be induced randomly and with high efficiency by ENU (Ehling et al., 1985). This treatment schedule have been widely used and yielded a large collection of point mutations in the mouse (AcevedoArozena et al., 2008) and leads to mutants being picked up because of an interesting phenotype. The underlying mutations have to be characterized in a similar way as human mutations—that is, by linkage analysis, sequencing of positional candidate genes, and exclusions of polymorphisms. Another way is offered by the Harwell sperm bank, which contains over 4000 DNA samples from individual F1 ENU-mutagenized mice (paralleled by frozen sperm samples). This archive can be screened for mutations in many genes, which allows a target-oriented phenotyping of the mutants afterward (Quwailid et al., 2004). In this context it is important to perform phenotyping in a highly standardized manner to receive finally sets of data being comparable to those from human clinics. One example for a high-throughput and standardized phenotyping unit of mutant mice is the German Mouse Clinic (GMC; GailusDurner et al., 2009).
18.6 CONCLUSIONS AND OUTLOOK Point mutations are frequently causative for inherited disorders in humans. Since single-nucleotide mutations occur also frequently as polymorphisms, it is necessary to confirm their pathological nature by co-segregation with the disease in the family, by its absence in the common population (in case of Mendelian disorders), and by its biological meaning. However, next-generation sequencing techniques will allow a fast sequencing of entire individual genomes at low prices. It is expected that the data on single-nucleotide mutations will increase significantly. Therefore, a clear pipeline of tests is necessary to confirm the elaborated mutations as causative for a given disorder.
c18.indd 399
1/12/2011 9:44:33 AM
400
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
18.7 ACKNOWLEDGMENTS I thank those clinicians who have sent us samples for mutation analysis. Moreover, I thank Erika Bürkle, Monika Stadler and Maria Kugler for expert technical assistance in the analysis of numerous mutations in mice and humans. This project was supported by grants from the European Community (EUMODIC; LSHG-2006-037188) and from the National Genome Network (NGFN plus; BMBF 01GS0850). 18.8 QUESTIONS AND ANSWERS Q1. What does SNP mean? Q2. Give three criteria that define a real single nucleotide mutation as causative for a Mendelian disease. Q3. Which criterion does not make sense for complex disorders? A1. Scottish National Party, single nucleotide polymorphism or SchneiderNeureither & Partner. A2. Co-segregation in the family, absence in the population, biological meaning. A3. Absence in the population. 18.9
REFERENCES
Acevedo-Arozena A, Wells S, Potter P, Kelly M, Cox RD, Brown SD. (2008). ENU mutagenesis, a way forward to understand gene function. Annu Rev Genomics Hum Genet 9:49–69. de Lange J, van Maarle MC, van den Akker HP, Redeker EJ. (2007). A new mutation in the SH3BP2 gene showing reduced penetrance in a family affected with cherubism. Oral Surg Oral Med Oral Pathol Oral Radiol Endod 103:378–81. Eichenlaub-Ritter U, Adler ID, Carere A, Pacchierotti F. (2007). Gender differences in germ-cell mutagenesis and genetic risk. Environ Res 104:22–36. Ehling UH, Charles DJ, Favor J, Graw J, Kratochvilova J, Neuhäuser-Klaus A, Pretsch W. (1985). Induction of gene mutations in mice: the multiple endpoint approach. Mutation Res 150:393–401. Frazer KA, Murray SS, Schork NJ, Topol EJ. (2009). Human genetic variation and its contribution to complex traits. Nat Rev Genet 10:241–51. Gailus-Durner V, Fuchs H, Adler T, Aguilar Pimentel A, Becker L, Bolle I, CalzadaWack J, Dalke C, Ehrhardt N, Ferwagner B, Hans W, Hölter SM, Hölzlwimmer G, Horsch M, Javaheri A, Kallnik M, Kling E, Lengger C, Mörth C, Mossbrugger I, Naton B, Prehn C, Puk O, Rathkolb B, Rozman J, Schrewe A, Thiele F, Adamski J, Aigner B, Behrendt H, Busch DH, Favor J, Graw J, Heldmaier G, Ivandic B, Katus
c18.indd 400
1/12/2011 9:44:33 AM
REFERENCES
401
H, Klingenspor M, Klopstock T, Kremmer E, Ollert M, Quintanilla-Martinez L, Schulz H, Wolf E, Wurst W, de Angelis MH. (2009). Systemic first-line phenotyping. Meth Mol Biol 530:463–509. Graw J. (2009). Crystallins: cataract and beyond. Exp Eye Res 88:173–89. Graw J, Klopp N, Illig T, Preising MN, Lorenz B. (2006). Congenital cataract and macular hypoplasia in humans associated with a de novo mutation in CRYAA and compound heterozygous mutations in P. Graefe’s Arch Clin Exp Ophthalmol 244:912–19. Graw J, Schmidt W, Minogue PJ, Rodriguez J, Tong JJ, Klopp N, Illig T, Ebihara L, Berthoud VM, Beyer EC. (2009). The GJA8 allele encoding CX50I247M is a rare polymorphism, not a cataract-causing mutation. Mol Vis 14:1881–85. Hamel C. (2006). Retinitis pigmentosa. Orphanet J Rare Dis 1:40(doi: 10.1186/17501172-1-40). Kottke-Marchant K. (2002). Genetic polymorphisms associated with venous and arterial thrombosis. Arch Pathol Lab Med 126:295–304. Loidi L, Quinteiro C, Parajes S, Barreiro J, Lestón DG, Cabezas-Agrícola JM, Sueiro AM, Araujo-Vilar D, Catro-Feijóo L, Costas J, Pombo M, Domínguez F. (2006). High variability in CYP21A2 mutated alleles in Spanish 21-hydroxylase deficiency patients, six novel mutations and a founder effect. Clin Endocrinol 64:330–36. Mukherjee S, Mukhopadhyay A, Chaudhuri K, Ray K. (2003). Analysis of haemophilia B database and strategies for identification of common point mutations in the factor IX gene. Haemophilia 9:187–92. Nomura T. (2008). Transgenerational effects from exposure to environmental toxic substances. Mutat Res 659:185–93. Polyakov AV, Shagina IA, Khlebnikova OV, Evgrafov OV. (2001). Mutation in the connexin 50 gene (GJA8) in a Russian family with zonular pulverulent cataract. Clin Genet 60:476–78. Quwailid MM, Hugill A, Dear N, Vizor L, Wells S, Horner E, Fuller S, Weedon J, McMath H, Woodman P, Edwards D, Campbell D, Rodger S, Carey J, Roberts A, Glenister P, Lalanne Z, Parkinson N, Coghill EL, McKeone R, Cox S, Willan J, Greenfield A, Keays D, Brady S, Spurr N, Gray I, Hunter J, Brown SD, Cox RD. (2004). A gene-driven ENU-based approach to generating an allelic series in any gene. Mamm Genome 15:585–91. Rivolta C, Sharon D, DeAngelis MM, Dryja TP. (2002). Retinitis pigmentosa and allied diseases: numerous diseases, genes, and inheritance patterns. Hum Mol Genet 11:1219–27. Rosendaal FR, Reitsma PH. (2009). Genetics of venous thrombosis. J Thromb Haemost 7(suppl. 1):301–4. Sankaranarayanan K. (2006). Estimation of the genetic risks of exposure to ionizing radiation in humans: current status and emerging perspectives. J Radiat Res 47(suppl.):B57–66. Steingrímsson E, Copeland NG, Jenkins NA. (2004). Melanocytes and the microphthalmia transcription factor network. Annu Rev Genet 38:365–411. Tein I, Elpeleg O, Ben-Zeev B, Korman SH, Lossos A, Lev D, Lerman-Sagie T, Leshinsky-Silver E, Vockley J, Berry GT, Lamhonwah AM, Matern D, Roe CR,
c18.indd 401
1/12/2011 9:44:33 AM
402
CONFIRMATION OF SINGLE NUCLEOTIDE MUTATIONS
Gregersen N. (2008). Short-chain acyl-CoA dehydrogenase gene mutation (c.319C>T) presents with clinical heterogeneity and is candidate founder mutation in individuals of Ashkenazi Jewish origin. Mol Genet Metab 93:179–89. Visscher PM. (2009). Whole genome approaches to quantitative genetics. Genetica 36:351–58. Yutzey KE, Robbins J. (2007). Principles of genetic murine models for cardiac disease. Circulation 115:792–99.
c18.indd 402
1/12/2011 9:44:33 AM
CHAPTER 19
Initial Identification and Confirmation of a QTL Gene DAVID C. AIREY and CHUN LI
Contents 19.1 Introduction 19.1.1 What Are QTLs? 19.1.2 What Are the Goals of QTL Mapping? 19.1.3 Why Map QTL in Mice and Rats? 19.2 Initial Mapping of QTL 19.2.1 Software 19.2.2 Segregating Crosses 19.2.3 Genetic Reference Populations 19.2.4 Experimental Design and Statistical Power 19.3 Fine Mapping QTL 19.3.1 Selective Phenotyping and Recombinant Progeny Testing 19.3.2 Congenics 19.3.3 Advanced Intercrosses 19.3.4 Heterogeneous Stock and Outbred Mice 19.3.5 Recombinant Inbred Segregation Tests 19.3.6 Haplotype Association Mapping 19.3.7 Multiple Cross Mapping and Combining Crosses 19.3.8 Populations on the Horizon: Diversity Outbred and Collaborative Cross Mice 19.4 Confirmation of a QTL Gene 19.4.1 What Is Required to Claim a Gene? 19.5 Bioinformatics, Systems Genetics, and Networks 19.6 Pharmacogenomics and Dynamic Phenotyping 19.7 References
404 404 404 404 405 405 407 409 410 411 411 412 412 412 412 413 413 413 413 413 414 415 418
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
403
c19.indd 403
1/12/2011 9:44:34 AM
404
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
19.1 INTRODUCTION 19.1.1
What Are QTLs?
Quantitative genetics is broadly defined as the study of biological variation (Lynch and Walsh, 1998). Most phenotypes show continuous variation when carefully measured, despite the tendency of the medical model to categorize disease. Quantitative trait loci (QTLs) are the genetic loci that contribute to quantitative variation in a trait, and QTL mapping in mice and rats is the effort to identify QTL through an experimental cross. This chapter serves as a pragmatic tour guide for those considering rodent QTL mapping experiments for the first time. The main parts of the chapter summarize QTL mapping using mice and rats, from initial identification to confirmation of a QTL gene. Following, we briefly discuss the intersection of bioinformatics and systems genetics with QTL mapping. We end the chapter with a suggestion for QTL mapping in the area of pharmacogenomics. 19.1.2
What Are the Goals of QTL Mapping?
The goals of QTL mapping in rodents include detection, localization, and estimating effect size. For biomedical purposes, where the goal is translation to improved human disease understanding and treatment, we care most about these goals in that order. Were we conducting agricultural experiments where the end goal was genotype assisted selection for an improved phenotype, then the heritability of the QTL might be of greatest interest. The heritability of a QTL is the fraction of phenotypic variance explained by it, and is a measure of QTL effect size. Rather, in biomedical applications of QTL mapping in rodents, any identified QTL genes, regardless of their effect size, may provide clues to disease mechanism. 19.1.3
Why Map QTL in Mice and Rats?
All common human diseases of great economic burden, such as obesity, heart disease, hypertension, diabetes, cancer, and psychiatric illness, have complex genetic etiology. The underlying genetic variation of such diseases is polygenic, and the effect of individual genetic variants is small. In other words, it is difficult to discover the genetic causes of any of these diseases for the vast majority of victims, despite recent progress with genomewide association (GWA) studies in humans. Manolio et al. (2009) outlined several strategies to improve GWA studies in humans, given that the amount of phenotypic variance that is collectively explained by all discovered genes in each disease is thus far much lower than prior estimates of disease heritability. One strategy provided by the authors is improved phenotyping. Ascertainment of human phenotypes can be limited by access and ethics. Thus a complementary approach to finding the “missing heritability of complex diseases” in humans not outlined by
c19.indd 404
1/12/2011 9:44:34 AM
INITIAL MAPPING OF QTL
405
Manolio et al. (2009) is the use of experimental crosses in rodents. Within the ethical guidelines for animal research, experimental crosses in mice and rats provide access to disease model phenotypes not feasible or possible in human genetics research.
19.2 INITIAL MAPPING OF QTL It is unfortunate that much of the primary and secondary literature describing the methods of QTL mapping requires a high level of statistical sophistication. As an accessible beginning treatment, Broman (2001) is recommended. Another recent review is by Zou (2009). For non-statisticians willing or required to go beyond that, a recently released book by Karl Broman and Saunuk Sen (2009) provides a complete and highly useful exposition of QTL mapping using the software package R/QTL (www.rqtl.org). Every practitioner of QTL mapping should own a copy of this book; there are other good sources (e.g., Siegmund and Yakir, 2007; Wu et al., 2007), but they are not as useful to the biologist. Broman and Sen (2009) manage to provide enough of the right details about how to actually do QTL mapping without oversimplifying the material or digressing into too much statistical methodology while providing enough of that to allow good judgment by the biomedical scientist-practitioner. In the following sections, we discuss software first, followed by a synopsis of work flow in two contrasting types of mapping populations for QTL mapping in mice and rats: segregating crosses and genetic reference populations. The distinguishing feature between these is that genetic variation is between individuals in segregating crosses and between isogenic lines or strains in genetic reference populations. 19.2.1
Software
19.2.1.1 Why Is Advanced Software Necessary? While rodent QTL mapping methods differ in either or both the organization of the genomes used (e.g., backcross, intercross, advanced intercross, recombinant inbred, consomic, congenic, recombinant congenic, inbred strain diversity panel, or outbred lines of mice) and the type of statistical association evaluated (e.g., t-test, ANOVA, correlation, regression, generalized linear model, nonparametric approaches, or Bayesian approaches), all are fundamentally similar in that they relate phenotypic variation to genetic variation. It is important to note that if we had complete genotype data for the entire mapping population, then using a t-test or ANOVA to compare the means of the animals grouped by genotype at every genetic location (called marker regression) would be a satisfactory approach for a single QTL model. However, three central problems conspire against using this simple approach and result in the requirement of sophisticated software like R/QTL. These three problems are the model
c19.indd 405
1/12/2011 9:44:34 AM
406
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
selection problem, the missing data problem, and the test multiplicity problem (Broman and Sen, 2009). First, even if we had complete genotype data, we still have the daunting problem of finding a good model. For example, the ttest would be inadequate if the phenotype were best predicted by a covariate moderated two-QTL interaction. This illustrates the model selection problem. Second, because we generally observe only a finite set of markers and not the QTL genotype, we must infer an association between marker and phenotype due to linkage. This is the missing data problem, and it is inadequately handled by marker regression. Marker regression drops animals from the analysis with missing marker data, has reduced power depending on marker density and also cannot discern the difference between a smaller effect QTL close to a marker or a larger effect QTL further away from the marker. Third and finally, the testing of hundreds or thousands of genetic markers makes the nominal criterion for statistical significance invalid. Software such as R/QTL flexibly defines the genomewide statistical significance criterion by permutation methods. 19.2.1.2 Features of R/QTL Introduced in 2003 by Broman et al., R/QTL is a mature software package for mapping QTLs in rodents. R/QTL runs as a package within R, a free and open-source, cross-platform language and environment for statistical computing and graphics (www.r-project.org). Although all R/QTL functions are command line executed, a more user-friendly JAVA graphical interface, called J/QTL is available (Smith et al., 2009). R/QTL has numerous functions for genetic mapping and map construction, data summarization and results plots of many kinds, single QTL interval mapping scans by four different estimation methods (standard maximum likelihood interval mapping, Haley-Knott approximation, an extended Haley-Knott approximation, and multiple imputation), two-QTL epistasis scans, binary trait and nonparametric mapping, allowance for additive or interacting covariates, proper handling of the X chromosome (missing in other packages), statistical significance by permutation and stratified permutation, multiple QTL models with a routine for automated model selection, and—under it all—a Hidden-Markov Model that gracefully handles missing data. The rationale and use of each of these many features is clearly explained by Broman and Sen (2009). We add that some caution should be exercised when using automated model selection routines, because overfitting and lack of generalization to similar crosses may occur. 19.2.1.3 Missing from R/QTL Broman and Sen (2009) note the lack of multiple trait mapping methods in R/QTL, where two or more related phenotypes are jointly mapped. Also desirable but missing, is the structural or causal analysis of multiple traits, described by Li et al. (2006). With a touch of humor, Broman and Sen (2009, pp.257–58) note the lack of Bayesian approaches to QTL mapping in R/QTL, suggesting that while these approaches are
c19.indd 406
1/12/2011 9:44:34 AM
INITIAL MAPPING OF QTL
407
attractive, they require considerable training to use. Multiple QTL model construction after multiple imputation supports Haley-Knott interval mapping only. 19.2.1.4 Other Software There are several other free and commercial QTL mapping software environments, such as QTL Cartographer (http:// statgen.ncsu.edu/qtlcart/WQTLCart.htm). This software has a Windows graphical user interface and is also scriptable like R/QTL, although it does not run on top of a general purpose statistical language like R. QTL Cartographer has tools for single-marker analysis, interval mapping, composite interval mapping, Bayesian interval mapping, multiple interval mapping (the extension of standard interval mapping by maximum likelihood to multiple QTL models), multiple trait analysis, and categorical trait analysis. Brief online documentation is available (http://statgen.ncsu. edu/qtlcart/HTML/index.html) and support is offered by email. For online analysis of genetic reference populations, the website www.webqtl.org is recommended (Wang et al., 2003a; Rosen et al., 2007). For analysis of outbred populations, see for example, Mott et al. (2000) and Johannesson et al. (2009). Both Karl Broman and Brian Yandell maintain directories of software for QTL mapping using rodents (e.g., see www.rqtl.org). 19.2.2 Segregating Crosses 19.2.2.1 Phenotyping Successful crosses benefit from using mice or rat inbred strains that differ in the phenotype of interest. Before firsthand assessment of the technical variance and heritability of a phenotype (www.nervenet.org/ papers/shortcourse98.html), primary literature and online resources should be investigated. For the mouse, a large number of strain phenotypes are deposited at the Jackson Laboratory Mouse Phenome Database (www.jax.org/phenome, (Grubb et al., 2009)). Similar resources are available for the rat (http:// rgd.mcw.edu) (Twigger et al., 2006). An important difference in phenotyping exists between segregating crosses and genetic reference populations. Repeated measurement of mice in a segregating cross can only reduce technical variance. Repeated measurement of mice from genetic reference populations can reduce technical variance (through measures on the same mouse) or environmental variance (through measures on different mice of the same genotype). While careful planning does allow multiple phenotypes to be collected for each mouse from an intercross or backcross (Solberg et al., 2006), greater flexibility in obtaining multiple phenotypes is enabled by use of genetic reference populations. 19.2.2.2 Genotyping Each individual mouse or rat from a segregating cross is genetically unique and requires genotyping at equally spaced intervals if it is to contribute genetic information to the mapping of QTLs. Fortunately, mutiplex genotyping can be performed rapidly by high-throughput assays,
c19.indd 407
1/12/2011 9:44:34 AM
408
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
such as the Illumina Mouse LD (low-density, 377 loci) and Mouse MD (medium-density, 1449 loci) Linkage genotyping panels. Reagent cost per biological sample as of March 2010 is approximately $70 for the LD panel and $90 for the MD panel. For qualified NIH funded projects, use of the Illumina Mouse MD Linkage panel can be free (www.nih.gov/ science/models/mouse/ resources/geno-service.html). For very high genetic resolution, appropriate in fine mapping (see below) and other applications, a JAX Mouse Diversity Genotyping Array is available with 620,000 SNPs (http://jaxservices.jax.org/ mdarray/index.html) (Yang et al., 2009). From a cost-savings point of view, it should be emphasized that it is not necessary to genotype every cross progeny— selective genotyping can be considered when using segregating crosses in QTL mapping experiments (Sen et al., 2007, 2005). In rats, although genotyping in segregating crosses is still performed using PCR of microsatellites and polyacrylamide gel separation, the basis for highthroughput assays is becoming available (Anegon, 2011; Jacob, 2010; STAR Consortium et al., 2008). A highly important new development for mouse QTL mapping is an improved standard genetic map based on a large heterogeneous mouse population (Cox et al., 2009). When using this “revised Shifman” map and examining 78 mapped QTLs, 15 (19%) showed altered localization. The new map also improved concordance between mouse QTLs and human GWAS loci. Ackert-Bicknell et al. (2010) found 26 of 28 human GWAS loci for bone mineral density coincided with mouse QTL support intervals, with 14 GWAS loci within 3 cM of a mouse QTL peak, using the new map. 19.2.2.3 Mapping with Backcrosses and Intercrosses The purpose of an experimental cross is to create genetic variation, so that we can study the association between genetic markers and the phenotype. Backcrosses and intercrosses are the two standard segregating crosses used to map QTLs. In the backcross, two inbred strains are crossed to produce isogenic F1 mice that are heterozygous at loci that differ between the two strains. F1 mice are crossed back to one of the inbred strains to produce backcross mice, that are either homozygous for the backcross parent genotype or heterozygous at other loci. In the intercross, following the F1 stage, two F1 mice are crossed to produce F2 intercross mice that are homozygous for either inbred strain genotype, or heterozygous. Although no more on this will be said here, to map the X chromosome, one has to keep track of the direction of the crosses with respect to sex (see Broman and Sen, 2009, §4.2). Following creation of a backcross or intercross population, phenotyping, and genotyping, a data set is generated that records columns of data for animal identification matched to one or more phenotypes, sex, covariates, and genotypes with associated chromosomal and centimorgan (cM) position information. Covariates are secondary measurements that may affect the phenotype of interest (in general, if we cannot experimentally control a covariate, then we measure subjects in different stratum of the covariate, or at least measure the covariate). The general workflow using R/QTL as illustrated by Broman and Sen (2009) follows.
c19.indd 408
1/12/2011 9:44:34 AM
INITIAL MAPPING OF QTL
409
1. Format and import the data into R/QTL (or J/QTL, see above). 2. Perform various data quality checks, such as checking the distribution shape of the phenotype or the integrity or order of the genotype data. 3. Perform single-QTL analysis by interval mapping. 4. Determine genomewide significance by permutation. 5. Determine the confidence interval estimates of location. 6. Determine effect size estimates for significant QTL. 7. Investigate additive covariates or QTL × covariate interactions, possibly also using QTL markers as covariates (composite interval mapping) to enhance single QTL mapping. 8. Perform a two-QTL epistasis scan. 9. Explore and construct a multiple QTL model using multiple imputation combined with Haley-Knott interval mapping. Some additional aspects to this work flow can be highlighted. First, the authors note that it is not uncommon to spend half of the intellectual effort of a project editing and cleaning one’s data! Second, some of the permutation procedures can take considerable time (days) and require processing on multiple CPUs. Third, an automated (stepwise) version of the work flow above (#3–9) is implemented in R/QTL and demonstrated in two case studies at the end of Broman and Sen (2009). 19.2.3 Genetic Reference Populations 19.2.3.1 Phenotyping As stated earlier, the distinguishing feature between segregating crosses and genetic reference populations (GRPs) is that genetic variation is between individuals in segregating crosses and between isogenic lines or strains in GRPs. This allows greater flexibility in phenotyping, because the same genotype can be repeatedly sampled under the same conditions to reduce measurement error and environmental variance, or under different conditions to facilitate multivariate analysis. How many animals per strain are usefully measured has been addressed by Crusio (2004) and others. A great advantage of GRPs is the possibility of using phenotypes from the literature (Philip et al., 2009, Wang et al., 2003a), but caution about environmental differences is appropriate (Wahlsten et al., 2006). Warehouses of GRP data are available at www.webqtl.org and at the JAX Mouse Phenome Database. 19.2.3.2 Genotyping Although genotyping of segregating crosses has become much easier with the high-throughput assays like those described above, once genotyping is established for a GRP, it is archival and need not be repeated (by contrast, every new segregating cross must be genotyped). For example, 15,260 SNPs were genotyped on 480 recombinant inbred lines and standard inbred strains by the Wellcome Trust Centre for Human Genetics (http://mus.well.ox.ac.uk/mouse/INBREDS).
c19.indd 409
1/12/2011 9:44:34 AM
410
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
19.2.3.3 Mapping with Recombinant Inbred Lines Recombinant inbred lines (RILs) are created by inbreeding F2 intercross progeny 20 or more generations. Each resulting RIL is an isogenic genetic mosaic of the parental inbred strain genomes. The genotype at any locus is homozygous for either parent strain’s alleles. Because creating RILs is expensive and requires years of effort, QTL mapping with RILs typically uses available mapping populations. RIL populations are available for both mice and rats (e.g., Jirout et al., 2003; Peirce et al., 2004), but they are relatively few in number and size compared to opportunities using segregating crosses. The advantages and disadvantages of using RILs in QTL mapping led to the proposal to create a larger panel of RILs using more than two parental strains (Churchill et al., 2004; Roberts et al., 2007; Chesler et al., 2008). Mapping with RILs is usually performed on the strain averages, and follows the same general progression described above for segregating crosses. 19.2.3.4 Mapping with Other GRPs There are other types of GRPs besides RILs that can be used for initial QTL mapping. They are chromosome substitution strains (also called consomic lines), recombinant congenic lines, and standard inbred strain diversity panels. A chromosome substitution strain (CSS) is an inbred strain with one of its chromosomes replaced by the homologous chromosome of another inbred strain. For example, the C57BL/6J (host) × A/J (donor) CSSs are 22 in number, including one CSS for each autosome, the X and Y sex chromosomes, and the mitochondrial DNA (all available through the Jackson Laboratory). To map a QTL to a particular chromosome, a CSS is simply compared to the control (host) strain C57BL/6J (Hill et al., 2006; Shao et al., 2008). A recombinant congenic line is similar in concept to the chromosome substitution strain, except a smaller donor region of a chromosome is present on the host background (Fortin et al., 2007). The same, simple mapping principles apply, but with greater QTL mapping resolution. Finally, there is the possibility of directly mapping QTL using a large panel of inbred strains. This approach, called haplotype association mapping (HAM), has been advocated for both initial mapping and fine mapping, but has also been controversial (Tsaih and Korstanje, 2009). Nevertheless, some experimental success has been achieved using HAM (e.g., Burgess-Herbert et al., 2009). 19.2.4
Experimental Design and Statistical Power
19.2.4.1 Experimental Design Good experimental design balances pragmatism against the scientific goals of a study and often requires extensive experience. Experimental design choices related to QTL mapping are reviewed by Broman and Sen (2009). For example, the relative strengths and weakness of using the backcross, intercross, and RILs are discussed as well as some underlying theory. The intercross is the most versatile cross, able to detect both additive and dominance effects. The backcross may be more powerful for
c19.indd 410
1/12/2011 9:44:34 AM
FINE MAPPING QTL
411
dominance effects, but only if the right backcross is chosen. RILs are theoretically the most powerful choice to detect a purely additive locus, but only if enough lines are available. RILs are incapable of detecting overdominant loci that lack additive effects. Given their different merits, it is a reasonable strategy to use more than one cross type. 19.2.4.2 Statistical Power The R package “qtlDesign” (Sen et al., 2007) can be used to explore statistical power associated with different experimental design choices. For example, Broman and Sen (2009) use this package to show equivalent power to detect an additive QTL of a given effect size using an intercross of 162 animals, a backcross of 247 animals, or 42 RILs with four replicate mice per line. Whereas the choice of cross type and its size can influence the power to detect a QTL, selective genotyping may reduce experimental cost. Broman and Sen (2009) note this generally is useful only when a specific single phenotype is of interest (because selective genotyping is done on the phenotypic extremes) and when the cost of raising and phenotyping mice is much less than genotyping. Finally, the R functions provided by Sen et al. (2007) are approximations given a set of assumptions. Sen et al. (2007) also provide tools to simulate experimental design choices, and although this approach may be more difficult, it may also prove more accurate (Broman and Sen, 2009).
19.3 FINE MAPPING QTL The methods described earlier are sufficient to localize QTL to an interval of 10 cM or more. To further reduce the support interval of a QTL, additional fine-mapping methods are required. Because these methods involve considerable expense and effort, independent confirmation of a mapped QTL is desirable before advancing to fine mapping (Abiola et al., 2003). Approaches to fine mapping a QTL will generally result in a 1–5 cM support interval containing 20–100 candidate genes. Although many approaches are touched on briefly below, congenics and advanced intercrosses are popular and proven methods. 19.3.1
Selective Phenotyping and Recombinant Progeny Testing
Once a QTL is initially mapped using an intercross or backcross, additional progeny can be bred, tail snipped, and genotyped within the QTL support interval. Animals recombinant within the interval can be selectively phenotyped allowing the QTL to be confirmed and its mapping interval reduced. Alternatively, animals that have a recombinant chromosome within a QTL support interval can be backcrossed to a parental strain to determine the location of the QTL relative to the recombination point, using a procedure called recombinant progeny testing (Darvasi, 1998).
c19.indd 411
1/12/2011 9:44:34 AM
412
19.3.2
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
Congenics
Congenic animals contain a defined chromosomal segment from a donor inbred strain on the background of a host inbred strain. Congenics are created by repeatedly backcrossing one inbred strain onto another with selection for a particular marker from the donor strain. When additional selection for the genetic background is performed with markers spanning the genome, speed congenics can be created in 5 generations (15–19 months) rather than the 10 generations (30–36 months) required for traditional congenics. Jackson Laboratory provides full services for speed congenic creation ($25,000) or partial services for genotyping and pairing advice ($7–9000). A variation on this approach called interval specific congenics has been described by Darvasi (1997). 19.3.3
Advanced Intercrosses
Advanced intercross (AI) animals are created by breeding beyond the F2 stage. Because mapping resolution depends on recombination density, each additional generation in an advanced intercross introduces additional recombination and reduces the support interval of a mapped QTL. For example, after eight additional breeding generations a fivefold reduction in the QTL confidence interval is expected (Darvasi and Soller, 1995). Several groups have used the advanced intercross effectively (Fawcett et al., 2009; Iraqi et al., 2000; Wang et al., 2003b), but some care is required in the breeding of the population (Schmitt et al., 2009) and the analysis (Peirce et al., 2008; Valdar et al., 2009). 19.3.4
Heterogeneous Stock and Outbred Mice
Heterogeneous stock (HS) mouse and rat populations were each separately created from eight inbred strains. HS animals have both high recombination density (50–60+ generations) and genetic diversity (8 strains). The use of these populations for high-resolution genomewide mapping has been successfully pioneered by Flint and colleagues (Valdar et al., 2006, 2009; Johannesson et al., 2009; Huang et al., 2009; Woods et al., 2010). In each of these applications, careful attention was paid to the family structure of this highly recombinant population. Heterogeneous stock mice and rat populations are a very promising resource for fine mapping QTLs. Other outbred stock, such as MF1 mice (Ghazalpour et al., 2008; Yalcin et al., 2004), have also been used to fine map QTL. The use of HS and other outbred stock is a rapidly developing area of interest in rodent complex trait genetics (Aldinger et al., 2009). 19.3.5
Recombinant Inbred Segregation Tests
To perform a recombinant inbred segregation test (RIST) (Darvasi, 1998), recombinant inbred lines that have a recombination event within a QTL support interval are crossed to both parental lines. In one of the crosses, the
c19.indd 412
1/12/2011 9:44:34 AM
CONFIRMATION OF A QTL GENE
413
QTL will segregate and in the other it will not, providing a test of whether the QTL is above or below the recombination event. Because the RIST is limited to available RILs, yin-yang crosses try to generalize the RIST by treating available inbred strains as RILs and allowing a greater pool of possible recombination events to choose from (Flint et al., 2005). 19.3.6
Haplotype Association Mapping
HAM (also called in silico mapping) looks for associations between the phenotype and haplotypes of mouse inbred strains, treating inbred strains as individuals. While initially regarded as circumspect, experimental validation of the method and careful consideration of potential pitfalls has led to continued interest in the approach (e.g., Burgess-Herbert et al., 2009). Tsaih and Korstanje (2009) provide an excellent and current review of HAM methods in mice. 19.3.7
Multiple Cross Mapping and Combining Crosses
In some cases it may be possible to combine QTL mapping crosses by assuming the same locus is segregating in each cross. While this approach can reduce the QTL support interval considerably (Hitzemann et al., 2003), the assumption is safest with crosses of the same parental strains (Peirce et al., 2007; Malmanger et al., 2006). 19.3.8 Populations on the Horizon: Diversity Outbred and Collaborative Cross Mice There are two large-scale QTL mapping populations being developed that will greatly increase the power of complex trait genetics in mice. These are collaborative cross (CC) mice and diversity outbred (DO) mice (http:// jaxmice.jax.org/jaxnotes/514/514b.html). CC mice are a large panel of recombinant inbred lines derived from an eight-way cross of standard inbred strains and wild-derived strains (Chesler et al., 2008; Roberts et al., 2007). DO mice are outbred mice derived from the founder breeding mice used to create the collaborative cross. The combination of these two populations, perhaps combined with RIST methods (Flint et al., 2005) and dense genotyping offered by the JAX Mouse Diversity Genotyping Array, should enable researchers to rapidly map genetic loci at high resolution and identify individual genes involved in disease. The expected availability of DO and CC mice is 2010 and 2012, respectively. 19.4 CONFIRMATION OF A QTL GENE 19.4.1
What Is Required to Claim a Gene?
A consortium of authors (Abiola et al., 2003) presented a community white paper describing the evidence needed to identify a candidate gene for a
c19.indd 413
1/12/2011 9:44:34 AM
414
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
mapped QTL. A predominance of evidence, including more than one of the following types should suffice when supported by peer review. • • • • • • • •
Polymorphism in coding or regulatory regions. Gene function related to the mapped trait. In vitro functional studies. Transgenesis (BAC lines). Knockins. QTL-knockout interaction test. Mutational analysis. Homology searches.
To gain a sense of the use of these guidelines, the reader is directed to outstanding reviews of success stories in mice (Flint et al., 2005) and rats (Aitman et al., 2010). Tables 1 and 2 in Flint et al. (2005) and table 2.1 in Aitman et al. (2010) list cloned quantitative trait loci along with the approaches used. The most noticeable difference between the mouse and rat QTL endgame is the lack of large numbers of available rat knockouts. Aitman et al. (2010) do describes new developments in rat genetic engineering, and the overall tenor of using rats for QTL mapping is extremely promising, given the long history of this excellent animal model of physiology and pharmacology; for additional papers on rat genomics see Anegon (2011). 19.5 BIOINFORMATICS, SYSTEMS GENETICS, AND NETWORKS Perhaps the most remarkable change in rodent genomics over the last decade, and more recently in human genetics, has been the transformation to systems approaches and perspectives (Sieberts and Schadt, 2007; Cookson et al., 2009). The combination of genome mapping with the ability to monitor the transcriptome has allowed both gene expression mapping to discover eQTL—which can very quickly nominate candidate genes for classical trait QTL (Lu et al., 2008)—and grander schemes for delineating expression networks causally affected by DNA polymorphism and predictive of disease. Two studies illustrate such outstanding scientific achievement (Chen et al., 2008, Emilsson et al., 2008). Chen et al. (2008) used a large F2 intercross to define a liver and adipose tissue gene expression network by correlations with a suite of traits related to metabolic disorders and used this network to define novel obesity genes. Emilsson et al. (2008) created an adipose gene expression network from a large sample of human volunteers and discovered significant overlap with the mouse network of Chen and colleagues. Because the mouse network was predictive of obesity genes, the human network was examined for its relation to obesity. First, Emilsson et al. (2008) found that the human network was
c19.indd 414
1/12/2011 9:44:34 AM
PHARMACOGENOMICS AND DYNAMIC PHENOTYPING
415
robustly correlated with body mass index (BMI) across subjects. Second, and as a predictive test, Emilsson et al. (2008) genotyped a collection of SNPs from the vicinity of each gene in the human network and found a significant collective association with a large independent group of humans measured for BMI. Although the brevity of this chapter limits detected discussion of systems genetics in mice and rats, it is worth pointing out two recently published practical reviews on the importance and utility of reproducible bioinformatic workflows in support of eQTL studies (Fisher et al., 2009) and how to actually do eQTL mapping in mice or rats (Tesson and Jansen, 2009). QTL and combined eQTL studies make use of very large data sets (that potentially change over time and build). The analysis often involves bioinformatic services and specialized programs that are only loosely connected by hyperlinks. As Fisher et al. (2009) note, researchers can quickly become overwhelmed. As one solution, researchers should know about workflow systems. Fisher et al. (2009) outline the use of one such workflow system, Taverna (www.taverna.org.uk), to support discovery of classical trait QTL candidate genes using gene expression data sets. Perhaps most useful to those considering eQTL analysis in mice or rats is the chapter by Tesson and Jansen (2009), who provide a detailed step-by-step guide to performing genomewide linkage analysis in an eQTL mapping experiment by using the R statistical framework. Tesson and Jansen (2009) provide a literal computational protocol that more or less (depending on your R skill!) demystifies the process.
19.6 PHARMACOGENOMICS AND DYNAMIC PHENOTYPING We began this chapter by acknowledging how the use of QTL mapping with mice and rats can complement the study of human genetics, especially with phenotypes that cannot be feasibly or ethically obtained. We now return to where we started by discussing a novel approach to pharmacogenomics. Use of mice and rats can elucidate novel drug targets as well as increase understanding of undesirable or toxic properties of current drugs, and thus contribute to the goals of personalized medicine (Harrill et al., 2009; Lum et al., 2009). We suggest that functional QTL mapping of drug dose-response is best applied to genetic reference populations of mice or rats. Functional mapping of QTLs uses biologically motivated nonlinear mathematical models (e.g., the four parameter logistic dose-response function (Kenakin, 2009; Motulsky and Christopoulos, 2004)) to make associations between genotype and phenotype in dynamically patterned data (Gong et al., 2004; Wu and Lin, 2006). Mapping QTL for drug effects in mice typically employs a single dose level. Because drug effects are dose dependent and heritable, choosing a single dose is not an optimal design for the study of multiple genetic backgrounds. Failure to accommodate dose-response can result in reduced signal or interpretive
c19.indd 415
1/12/2011 9:44:34 AM
416
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
error. For example (Fig. 19.1, top left), if a QTL affects the maximum response, choosing too low a dose will underestimate the maximum and reduce statistical power. Choosing too high a dose runs the risk of introducing confounding toxicity responses or altered phenotypes. If a QTL affects the ED50 (Fig. 19.1, top right), then not only is it possible to miss the optimal dose and reduce statistical power, but choosing an optimum single dose will not discriminate the differences as a simple right-shift versus a change in maximum. Functional mapping of dose-response QTL avoids these problems by estimating the full curve for each independent genotype. Functional mapping of dose-response QTL is enhanced by using isogenic lines of mice that make up a genetic reference population. Independent mice of an identical genotype can be exposed to different drug doses to achieve genotype specific curves, while also allowing invasive phenotypes. This assumes that environmental similarity is maintained aside from the differences in drug dose level. Shown in the middle left and right panels of Figure 19.1 are two demonstrations of significantly different dose-response profiles from common inbred (isogenic) strains of mice. On the left are three hyperactivity profiles in the open field test chamber in response to a drug of abuse, MDMA (Ecstasy), using two independent mice per dose level per strain (unpublished data). On the right are the dose-response profiles for two isogenic strains of mice, using three independent mice per dose level, for the head twitch response, a behavioral response to the 5-HT2A/5-HT2C receptor agonist and hallucinogen DOI (Canal et al., 2010). Means ± 1 SEM error bars are indicated. Two approaches to functional mapping of dose-response QTL using genetic reference populations of mice are shown in the bottom of Figure 19.1. The simplest approach (bottom left) is to use chromosome substitution strains. To locate QTL to a chromosome, each CSS is compared to a reference or control parental strain using nonlinear regression. A potentially more powerful approach but also more complex, is the use of recombinant inbred lines, such as the BXD RILs, derived from a C57BL/6J × DBA/2J intercross (bottom right). BXD RIL dose-response QTL analysis requires grouping dose-response curves for a set of RILs by genotype at each of many markers along each chromosome and testing for curve shape difference by genotype. This approach requires nonlinear mixed model methods with random effects for RILs nested in genotype. The use of genetic reference populations of mice is key to enabling this approach. Genetic variation is between strain and lines rather than between individual animals in these groups of mice. A comparable approach in segregating populations, such as an F2 intercross or in human volunteers, would require multiple dosing of the same individual, which poses potentially insurmountable feasibility and ethical concerns. The use of animals also allows collection of invasive phenotypes and terminal endpoint data. Although Figure 19.1 outlines pharmacodynamic dose-response modeling, similar approaches can be developed for pharmacokinetic questions, or combined pharmacokineticpharmacodynamic models.
c19.indd 416
1/12/2011 9:44:34 AM
PHARMACOGENOMICS AND DYNAMIC PHENOTYPING
417
Figure 19.1. Functional mapping of dose-response in genetic reference populations of mice. See text for details.
c19.indd 417
1/12/2011 9:44:34 AM
418
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
19.7 REFERENCES Abiola O, Angel JM, Avner P, Bachmanov AA, Belknap JK, Bennett B, Blankenhorn EP, Blizard DA, Bolivar V, Brockmann GA, Buck KJ, Bureau JF, Casley WL, Chesler EJ, Cheverud JM, Churchill GA, Cook M, Crabbe JC, Crusio WE, Darvasi A, de Haan G, Dermant P, Doerge RW, Elliot RW, Farber CR, Flaherty L, Flint J, Gershenfeld H, Gibson JP, Gu J, Gu W, Himmelbauer H, Hitzemann R, Hsu HC, Hunter K, Iraqi FF, Jansen RC, Johnson TE, Jones BC, Kempermann G, Lammert F, Lu L, Manly KF, Matthews DB, Medrano JF, Mehrabian M, Mittlemann G, Mock BA, Mogil JS, Montagutelli X, Morahan G, Mountz JD, Nagase H, Nowakowski RS, O’Hara BF, Osadchuk AV, Paigen B, Palmer AA, Peirce JL, Pomp D, Rosemann M, Rosen GD, Schalkwyk LC, Seltzer Z, Settle S, Shimomura K, Shou S, Sikela JM, Siracusa LD, Spearow JL, Teuscher C, Threadgill DW, Toth LA, Toye AA, Vadasz C, Van Zant G, Wakeland E, Williams RW, Zhang HG, Zou F; Complex Trait Consortium. (2003). The nature and identification of quantitative trait loci: a community’s view. Nat Rev Genet 4(11):911–16. Ackert-Bicknell CL, Karasik D, Li Q, Smith RV, Hsu YH, Churchill GA, Paigen BJ, Tsaih SW. (2010). Mouse BMD quantitative trait loci show improved concordance with human genome wide association loci when recalculated on a new, common mouse genetic map. J Bone Miner Res Epub ahead of print, February 23. Aitman TJ, Petretto E, Behmoaras J. (2010). Genetic mapping and positional cloning. Meth Mol Biol 597:13–32. Aldinger KA, Sokolo G, Rosenberg DM, Palmer AA, Millen KJ. (2009). Genetic variation and population substructure in outbred CD-1 mice: implications for genomewide association studies. PLoS One 4(3):e4729. Anegon I, ed. (2011). Rat Genomics. Springer, New York. Broman KW. (2001). Review of statistical methods for QTL mapping in experimental crosses. Lab Anim (NY), 30(7):44–52. Broman K, Sen S. (2009). A Guide to QTL Mapping with R/QTL. Statistics for Biology and Health, vol. 2848. Springer, New York. Broman KW, Wu H, Sen S, Churchill GA. (2003). R/QTL: QTL mapping in experimental crosses. Bioinformatics 19(7):889–90. Burgess-Herbert SL, Tsaih S-W, Stylianou IM, Walsh K, Cox AJ, Paigen B. (2009). An experimental assessment of in silico haplotype association mapping in laboratory mice. BMC Genet 10:81. Canal CE, Olaghere da Silva UB, Gresch PJ, Watt EE, Sanders-Bush E, Airey DC. (2010). The serotonin 2C receptor potently modulates the head-twitch response in mice induced by a phenethylamine hallucinogen. Psychopharmacology (Berl) 209(2):163–74. Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK, Leonardson A, Castellini LW, Wang S, Champy MF, Zhang B, Emilsson V, Doss S, Ghazalpour A, Horvath S, Drake TA, Lusis AJ, Schadt EE. (2008). Variations in DNA elucidate molecular networks that cause disease. Nature 452(7186):429–35. Chesler EJ, Miller DR, Branstetter LR, Galloway LD, Jackson BL, Philip VM, Voy BH, Culiat CT, Threadgill DW, Williams RW, Churchill GA, Johnson DK, Manly KF.
c19.indd 418
1/12/2011 9:44:35 AM
REFERENCES
419
(2008). The collaborative cross at Oak Ridge National Laboratory: developing a powerful resource for systems genetics. Mamm Genome 19(6):382–89. Churchill GA, Airey DC, Allayee H, Angel JM, Attie AD, Beatty J, Beavis WD, Belknap JK, Bennett B, Berrettini W, Bleich A, Bogue M, Broman KW, Buck KJ, Buckler E, Burmeister M, Chesler EJ, Cheverud JM, Clapcote S, Cook MN, Cox RD, Crabbe JC, Crusio WE, Darvasi A, Deschepper CF, Doerge RW, Farber CR, Forejt J, Gaile D, Garlow SJ, Geiger H, Gershenfeld H, Gordon T, Gu J, Gu W, de Haan G, Hayes NL, Heller C, Himmelbauer H, Hitzemann R, Hunter K, Hsu HC, Iraqi FA, Ivandic B, Jacob HJ, Jansen RC, Jepsen KJ, Johnson DK, Johnson TE, Kempermann G, Kendziorski C, Kotb M, Kooy RF, Llamas B, Lammert F, Lassalle JM, Lowenstein PR, Lu L, Lusis A, Manly KF, Marcucio R, Matthews D, Medrano JF, Miller DR, Mittleman G, Mock BA, Mogil JS, Montagutelli X, Morahan G, Morris DG, Mott R, Nadeau JH, Nagase H, Nowakowski RS, O’Hara BF, Osadchuk AV, Page GP, Paigen B, Paigen K, Palmer AA, Pan HJ, Peltonen-Palotie L, Peirce J, Pomp D, Pravenec M, Prows DR, Qi Z, Reeves RH, Roder J, Rosen GD, Schadt EE, Schalkwyk LC, Seltzer Z, Shimomura K, Shou S, Sillanpää MJ, Siracusa LD, Snoeck HW, Spearow JL, Svenson K, Tarantino LM, Threadgill D, Toth LA, Valdar W, de Villena FP, Warden C, Whatley S, Williams RW, Wiltshire T, Yi N, Zhang D, Zhang M, Zou F; Complex Trait Consortium. (2004). The collaborative cross, a community resource for the genetic analysis of complex traits. Nat Genet 36(11): 1133–137. Cookson W, Liang L, Abecasis G, Moffatt M, Lathrop M. (2009). Mapping complex disease traits with global gene expression. Nature Reviews Genetics (PMID: 19223927) March Vol. 10: 184–94. Cox A, Ackert-Bicknell CL, Dumont BL, Ding Y, Bell JT, Brockmann GA, Wergedal JE, Bult C, Paigen B, Flint J, Tsaih SW, Churchill GA, Broman KW. (2009). A new standard genetic map for the laboratory mouse. Genetics 182(4):1335–44. Crusio WE. (2004). A note on the effect of within-strain sample sizes on QTL mapping in recombinant inbred strain studies. Genes Brain Behav 3(4):249–51. Darvasi A. (1998). Experimental strategies for the genetic dissection of complex traits in animal models. Nat Genet 18(1):19–24. Darvasi A. (1997). Interval-specific congenic strains (ISCS): an experimental design for mapping a QTL into a 1-centimorgan interval. Mamm Genome 8(3):163–67. Darvasi A, Soller M. (1995). Advanced intercross lines, an experimental population for fine genetic mapping. Genetics 141(3):1199–207. Emilsson V, Thorleifsson G, Zhang B, Leonardson AS, Zink F, Zhu J, Carlson S, Helgason A, Walters GB, Gunnarsdottir S, Mouy M, Steinthorsdottir V, Eiriksdottir GH, Bjornsdottir G, Reynisdottir I, Gudbjartsson D, Helgadottir A, Jonasdottir A, Jonasdottir A, Styrkarsdottir U, Gretarsdottir S, Magnusson KP, Stefansson H, Fossdal R, Kristjansson K, Gislason HG, Stefansson T, Leifsson BG, Thorsteinsdottir U, Lamb JR, Gulcher JR, Reitman ML, Kong A, Schadt EE, Stefansson K. (2008). Genetics of gene expression and its effect on disease. Nature 452(7186):423–28. Fawcett GL, Jarvis JP, Roseman CC, Wang B, Wolf JB, Cheverud JM. (2009). Finemapping of obesity-related quantitative trait loci in an F(9/10) advanced intercross line. Obesity 18(7):1383–92.
c19.indd 419
1/12/2011 9:44:35 AM
420
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
Fisher P, Noyes H, Kemp S, Stevens R, Brass A. (2009). A systematic strategy for the discovery of candidate genes responsible for phenotypic variation. Meth Mol Biol 573:329–45. Flint J, Valdar W, Shifman S, Mott R. (2005). Strategies for mapping and cloning quantitative trait genes in rodents. Nat Rev Genet 6(4):271–86. Fortin A, Diez E, Henderson JE, Mogil JS, Gros P, Skamene E. (2007). The AcB/BcA recombinant congenic strains of mice: strategies for phenotype dissection, mapping and cloning of quantitative trait genes. Novartis Found Symp 281:141–53 (discussion 153–55, 208–09). Ghazalpour A, Doss S, Kang H, Farber C, Wen PZ, Brozell A, Castellanos R, Eskin E, Smith DJ, Drake TA, Lusis AJ. (2008). High-resolution mapping of gene expression using association in an outbred mouse stock. PLoS Genet 4(8):e1000149. Gong Y, Wang Z, Liu T, Zhao W, Zhu Y, Johnson JA, Wu R. (2004). A statistical model for functional mapping of quantitative trait loci regulating drug response. Pharmacogenomics J 4(5):315–21. Grubb SC, Maddatu TP, Bult CJ, and Bogue MA. (2009). Mouse phenome database. Nucl Acids Res 37:D720–30. Harrill AH, Ross PK, Gatti DM, Threadgill DW, Rusyn I. (2009). Population-based discovery of toxicogenomics biomarkers for hepatotoxicity using a laboratory strain diversity panel. Toxicol Sci 110(1):235–43. Hill AE, Lander ES, Nadeau JH. (2006). Chromosome substitution strains: a new way to study genetically complex traits. Meth Mol Med 128:153–72. Hitzemann R, Malmanger B, Reed C, Lawler M, Hitzemann B, Coulombe S, Buck K, Rademacher B, Walter N, Polyakov Y, Sikela J, Gensler B, Burgers S, Williams RW, Manly K, Flint J, Talbot C. (2003). A strategy for the integration of QTL, gene expression, and sequence analyses. Mamm Genome 14(11):733–47. Huang GJ, Shifman S, Valdar W, Johannesson M, Yalcin B, Taylor MS, Taylor JM, Mott R, Flint J. (2009). High resolution mapping of expression QTLs in heterogeneous stock mice in multiple tissues. Genome Res 19(6):1133–40. Iraqi F, Clapcott SJ, Kumari P, Haley CS, Kemp SJ, Teale AJ. (2000). Fine mapping of trypanosomiasis resistance loci in murine advanced intercross lines. Mamm Genome 11(8):645–48. Jacob HJ. (2010). The rat: a model used in biomedical research. Meth Mol Biol 597:1–11. Jirout M, Krenová D, Kren V, Breen L, Pravenec M, Schork NJ, Printz MP. (2003). A new framework marker-based linkage map and SDPs for the rat HXB/BXH strain set. Mamm Genome 14(8):537–46. Johannesson M, Lopez-Aumatell R, Stridh P, Diez M, Tuncel J, Blázquez G, MartinezMembrives E, Cañete T, Vicens-Costa E, Graham D, Copley RR, Hernandez-Pliego P, Beyeen AD, Ockinger J, Fernández-Santamaría C, Gulko PS, Brenner M, Tobeña A, Guitart-Masip M, Giménez-Llort L, Dominiczak A, Holmdahl R, Gauguier D, Olsson T, Mott R, Valdar W, Redei EE, Fernández-Teruel A, Flint J. (2009). A resource for the simultaneous high-resolution mapping of multiple quantitative trait loci in rats: the NIH heterogeneous stock. Genome Res 19(1):150–58. Kenakin TP. (2009). A Pharmacology Primer: Theory, Applications, and Methods. 3rd ed. Academic Press/Elsevier, Amsterdam.
c19.indd 420
1/12/2011 9:44:35 AM
REFERENCES
421
Li R, Tsaih SW, Shockley K, Stylianou IM, Wergedal J, Paigen B, Churchill GA. (2006). Structural model analysis of multiple quantitative traits. PLoS Genet 2(7):e114. Lu L, Wei L, Peirce JL, Wang X, Zhou J, Homayouni R, Williams RW, Airey DC. (2008). Using gene expression databases for classical trait QTL candidate gene discovery in the BXD recombinant inbred genetic reference population: mouse forebrain weight. BMC Genomics 9:444. Lum PY, Derry JMJ, Schadt EE. (2009). Integrative genomics and drug development. Pharmacogenomics 10(2):203–12. Lynch M, Walsh B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA. Malmanger B, Lawler M, Coulombe S, Murray R, Cooper S, Polyakov Y, Belknap J, Hitzemann R. (2006). Further studies on using multiple-cross mapping (MCM) to map quantitative trait loci. Mamm Genome 17(12):1193–204. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. (2009). Finding the missing heritability of complex diseases. Nature 461(7265): 747–53. Mott R, Talbot CJ, Turri MG, Collins AC, Flint J. (2000). A method for fine mapping quantitative trait loci in outbred animal stocks. Proc Natl Acad Sci U S A 97(23): 12649–54. Motulsky H Christopoulos A. (2004). Fitting Models to Biological Data Using Linear and Nonlinear Regression: A Practical Guide to Curve Fitting. Oxford University Press, Oxford, UK. Peirce JL, Broman KW, Lu L, Chesler EJ, Zhou G, Airey DC, Birmingham AE, Williams RW. (2008). Genome reshuffling for advanced intercross permutation (GRAIP): simulation and permutation for advanced intercross population analysis. PLoS One 3(4):e1977. Peirce JL, Broman KW, Lu L, Williams RW. (2007). A simple method for combining genetic mapping data from multiple crosses and experimental designs. PLoS One 2(10):e1036. Peirce JL, Lu L, Gu J, Silver LM, Williams RW. (2004). A new set of BXD recombinant inbred lines from advanced intercross populations in mice. BMC Genet 5:7. Philip VM, Duvvuru S, Gomero B, Ansah TA, Blaha CD, Cook MN, Hamre KM, Lariviere WR, Matthews DB, Mittleman G, Goldowitz D, Chesler EJ. (2009). Highthroughput behavioral phenotyping in the expanded panel of BXD recombinant inbred strains. Genes Brain Behav. Epub ahead of print, September 22. Roberts A, Pardo-Manuel de Villena F, Wang W, McMillan L, Threadgill DW. (2007). The polymorphism architecture of mouse genetic resources elucidated using genome-wide resequencing data: implications for qtl discovery and systems genetics. Mamm Genome 18(6–7):473–81. Rosen GD, Chesler EJ, Manly KF, Williams RW. (2007). An informatics approach to systems neurogenetics. Meth Mol Biol 401:287–303. Schmitt AO, Bortfeldt R, Neuschl C, Brockmann GA. (2009). RANDOMATE: a program for the generation of random mating schemes for small laboratory animals. Mamm Genome 20(5):321–25.
c19.indd 421
1/12/2011 9:44:35 AM
422
INITIAL IDENTIFICATION AND CONFIRMATION OF A QTL GENE
Sen S, Satagopan JM, Broman KW, Churchill GA. (2007). R/QTLDESIGN: inbred line cross experimental design. Mamm Genome 18(2):87–93. Sen S, Satagopan JM, Churchill GA. (2005). Quantitative trait locus study design from an information perspective. Genetics 170(1):447–64. Shao H, Burrage LC, Sinasac DS, Hill AE, Ernest SR, O’Brien W, Courtland HW, Jepsen KJ, Kirby A, Kulbokas EJ, Daly MJ, Broman KW, Lander ES, Nadeau JH. (2008). Genetic architecture of complex traits: large phenotypic effects and pervasive epistasis. Proc Natl Acad Sci U S A 105(50):19910–14. Sieberts SK, Schadt EE. (2007). Moving toward a system genetics view of disease. Mamm Genome 18(6–7):389–401. Siegmund D, Yakir B. (2007). The Statistics of Gene Mapping. Springer, New York. Smith R, Sheppard K, DiPetrillo K, Churchill G. (2009). Quantitative trait locus analysis using J/QTL. Meth Mol Biol 573:175–88. Solberg LC, Valdar W, Gauguier D, Nunez G, Taylor A, Burnett S, Arboledas-Hita C, Hernandez-Pliego P, Davidson S, Burns P, Bhattacharya S, Hough T, Higgs D, Klenerman P, Cookson WO, Zhang Y, Deacon RM, Rawlins JN, Mott R, Flint J. (2006). A protocol for high-throughput phenotyping, suitable for quantitative trait analysis in mice. Mamm Genome 17(2):129–46. STAR Consortium, Saar K, Beck A, Bihoreau MT, Birney E, Brocklebank D, Chen Y, Cuppen E, Demonchy S, Dopazo J, Flicek P, Foglio M, Fujiyama A, Gut IG, Gauguier D, Guigo R, Guryev V, Heinig M, Hummel O, Jahn N, Klages S, Kren V, Kube M, Kuhl H, Kuramoto T, Kuroki Y, Lechner D, Lee YA, Lopez-Bigas N, Lathrop GM, Mashimo T, Medina I, Mott R, Patone G, Perrier-Cornet JA, Platzer M, Pravenec M, Reinhardt R, Sakaki Y, Schilhabel M, Schulz H, Serikawa T, Shikhagaie M, Tatsumoto S, Taudien S, Toyoda A, Voigt B, Zelenika D, Zimdahl H, Hubner N. (2008). SNP and haplotype mapping for genetic analysis in the rat. Nat Genet 40(5):560–66. Tesson BM, Jansen RC. (2009). eQTL analysis in mice and rats. Meth Mol Biol 573:285–309. Tsaih S-W, Korstanje R. (2009). Haplotype association mapping in mice. Meth Mol Biol 573:213–22. Twigger SN, Smith S, Zuniga-Meyer A, Bromberg SK. (2006). Exploring phenotypic data and the rat genome database. Current Protocols in Bioinformatics, 14:1, 14.1– 1.14.27. Valdar W, Holmes CC, Mott R, Flint J. (2009). Mapping in structured populations by resample model averaging. Genetics 182(4):1263–77. Valdar W, Solberg LC, Gauguier D, Burnett S, Klenerman P, Cookson WO, Taylor MS, Rawlins JN, Mott R, Flint J. (2006). Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet 38(8):879–87. Wahlsten D, Bachmanov A, Finn DA, Crabbe JC. (2006) Stability of inbred mouse strain differences in behavior and brain size between laboratories and across decades. Proc Natl Acad Sci USA 103(44):16364–69. Wang J, Williams RW, Manly KF. (2003a). WebQTL: web-based complex trait analysis. Neuroinformatics 1(4):299–308. Wang X, Le Roy I, Nicodeme E, Li R, Wagner R, Petros C, Churchill GA, Harris S, Darvasi A, Kirilovsky J, Roubertoux PL, Paigen B. (2003b). Using advanced inter-
c19.indd 422
1/12/2011 9:44:35 AM
REFERENCES
423
cross lines for high-resolution mapping of HDL cholesterol quantitative trait loci. Genome Res 13(7):1654–64. Woods LCS, Holl K, Tschannen M, Valdar W. (2010). Fine-mapping a locus for glucose tolerance using heterogeneous stock rats. Physiol Genomics 41(1):102–08. Wu R, Lin M. (2006). Functional mapping—how to map and study the genetic architecture of dynamic complex traits. Nat Rev Genet 7(3):229–37. Wu R, Ma C-X, Casella G. (2007). Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. Springer, New York. Yalcin B, Willis-Owen SA, Fullerton J, Meesaq A, Deacon RM, Rawlins JN, Copley RR, Morris AP, Flint J, Mott R. (2004). Genetic dissection of a behavioral quantitative trait locus shows that Rgs2 modulates anxiety in mice. Nat Genet 36(11): 1197–202. Yang H, Ding Y, Hutchins LN, Szatkiewicz J, Bell TA, Paigen BJ, Graber JH, de Villena FP, Churchill GA. (2009). A customized and versatile high-density genotyping array for the mouse. Nat Methods 6(9):663–66. Zou F. (2009). QTL mapping in intercross and backcross populations. Meth Mol Biol 573:157–73.
c19.indd 423
1/12/2011 9:44:35 AM
CHAPTER 20
Gene Discovery of Crop Disease in the Postgenome Era YULIN JIA
Contents 20.1 Introduction 20.2 Map-Based Cloning 20.2.1 Mapping Population for Map-Based Cloning 20.3 A Plant Model—The Rice Blast System 20.3.1 The Structure and Function of Blast R Gene 20.3.2 Co-Evolution of Host R and Pathogen AVR Genes 20.4 R Gene Use in Breeding 20.5 Future Prospects 20.6 Acknowledgments 20.7 References
425 426 426 430 431 433 436 437 438 438
20.1 INTRODUCTION Plants producing essential foods and fibers for human survival have been subjected to intensive studies worldwide. In nature, plants are attacked by numerous viral, bacterial, and fungal pathogens. Unlike animals and humans, plants cannot move away from the pathogens by themselves. To survive in natural ecosystems, plants evolved sophisticated innate defense systems governed by R genes to fight against invading pathogens. These R genes each provide robust power to battle against pathogens and are distributed in plant germplasm worldwide. Crop plants are domesticated plant species for human cultivation. Before breeding and genetics, farmers knew to save seeds that survived disease epidemics. Since then, R genes have been accumulated in
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
425
c20.indd 425
1/12/2011 9:44:36 AM
426
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
numerous crop landraces. These landraces have been used as R gene donors for breeding to improve resistance worldwide. In contrast with animals and humans, the principles of plant genetics can be used such that genetic crosses can be made without ethical concerns. This unique feature has made plants excellent models for R gene discovery. Genetic analysis of crop germplasm thus far has identified hundreds of plant R genes that are effective in preventing infections by numerous races of diverse pathogens containing the corresponding avirulence (AVR) genes (Flor, 1971). Many of these have been mapped with diverse molecular markers for breeding and genetic studies and some of them have been isolated. The Pto gene in tomato was the first plant R gene cloned that conferred gene-for-gene resistance (Martin et al., 1993). Subsequently, numerous R genes have been cloned from other crop plants and Arabidopsis (Martin et al., 1993). Cloning and characterization of crop R genes have been greatly expedited in the postgenomic era. A wide range of biotechnologies, including numerous functional genomic tools, have been successfully applied to isolate and characterize plant R genes. Among them, the most common technique used for R gene isolation is map-based positional cloning. Map-based cloning of an R gene takes several years and sometimes longer than 10 years. There is no doubt that the time needed for map-based cloning can be reduced by available genome sequences. Available genome sequences of two subspecies of rice (Oryza sativa indica cv 93-11 and japonica cv Nipponbare) have allowed rapid R gene isolation using in silico cloning (Goff et al., 2002; Yu et al., 2002). Rapid improvement of sequencing technology has allowed whole genome sequences of diverse crop plants. These technological advances have laid a solid foundation for R gene discovery and for studying the molecular mechanisms of R gene-mediated defense responses. The purposes of this chapter are to describe the contemporary map-based cloning for plant R gene discovery and review the current understanding of structure and function of plant R genes with emphasis on R genes to the world renowned crop killer Magnaporthe oryzae.
20.2 MAP-BASED CLONING Map-based cloning used for plant R gene discovery involves the following three steps: (1) construction of a genetic linkage map using DNA markers, (2) determination of genetic and physical locations of candidate genes, and (3) functional confirmation of candidate genes by complementation test (Fig. 20.1). 20.2.1 Mapping Population for Map-Based Cloning 20.2.1.1 Mapping Population A mapping population is required for the construction of a genetic linkage map. Such a mapping population can be created using any sexually compatible
c20.indd 426
1/12/2011 9:44:36 AM
MAP-BASED CLONING
427
Figure 20.1. The procedure of map-based cloning. a, Integrated genetic and physical maps with left and right borders shown. A candidate R gene was identified from a BAC or a YAC clone. b, Complementation test using either Agrobacterium-mediated transformation or particle bombardment to verify that the resistance was due to the presence of the candidate gene. Primary transformant T1 expressing the candidate R gene is resistant and T2, progeny of T1 segregate as a ratio of 3 resistant:1 susceptible, demonstrating the resistant function of the R gene.
plant. There are three different mapping populations that are commonly used for map-based cloning of plant genes. 1. An F2 population: F1 hybrid is produced by the crossing of two parents. Selfing an F1 hybrid can give rise to multiple F2 progeny, and these F2 progeny are members of an F2 mapping population. The advantage of an F2 population is that each individual is genetically unique; however, the disadvantage is that the phenotype cannot be replicated at F2 because each individual cannot be reused. One solution to this problem is to use a controlled phenotyping method. For example, detached leaf inoculation was developed to repeatedly measure the phenotype of each F2 individual (Jia et al., 2003). 2. A recombinant inbred line (RIL) population: A RIL population overcomes the disadvantage of an F2 population such that the phenotype can be evaluated repeatedly for each segregating progeny. The most common method to develop a RIL population is to self each F2 individual for an additional three to nine generations. An assumption can be made that each RIL at F5 to F9 would reach the homogenous status at all genetic loci. A RIL population is ideal for mapping and cloning quantitative trait loci (QTL); however, it normally takes 2 to 5 years to develop a RIL population (Jia et al., 2010). In rice,
c20.indd 427
1/12/2011 9:44:36 AM
428
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
two generations can be advanced per year and sometimes three generations per year for short-season rice varieties under greenhouse conditions. In some tropical areas of the world, such as Los Banos in the Philippines, Puerto Rico, and Hainan Island in China, three generations can be amplified per year. 3. A doubled haploid (DH) population: DH population is designed to reduce the time needed for developing a RIL population. F1 pollen are cultured using anther culture method, and the resulting haploids are doubled after chemical treatment to reach homozygosity so that each locus in a doubled haploid is homozygous. A DH population can be developed within 2 years; however, some rice varieties are not suitable for tissue culture because their regeneration frequencies are extremely low. 20.2.1.2
The Procedure of Map-Based Cloning
20.2.1.2.1 Construction of a Genetic Linkage Map A genetic linkage map is a map with genetic markers based on crossing-over events during crosses. For example, a genetic linkage map of rice was constructed using JoinMap (Kosambi, 1944; Liu et al., 2008). The genetic distances of any two markers in each linkage map may be different; however, physical locations of the majority of markers should be the same for a particular crop. For any plant genome, DNA markers should first be identified that distinguish both parents for constructing a genetic linkage map. These polymorphic DNA markers evenly distributed among all chromosomes are used to determine the genotypes of each segregating progeny in a mapping population. For the past decade, the genetic markers have evolved from restriction fragment length polymorphism (RFLP), to rapid amplified polymorphic DNA fragment, (RAPD) (Dioh et al., 2000), to simple sequence repeat (SSR) and to single nucleotide polymorphism (SNP) (Rafalski, 2002). User-friendly co-dominant DNA markers, SSR and SNP, can be readily identified in the genome if sequence is available. In rice, a high-density linkage map consisting of abundant genetic markers on each chromosome was constructed by McCouch and colleagues (1998). 20.2.1.2.2 Determination of Physical Location of the Candidate R Gene After a genetic linkage map with evenly distributed DNA markers is constructed, the phenotype of each progeny in a mapping population should be evaluated. Often replicated experiments are required for evaluating the phenotypes to exclude environmental effects. Replications can be easily set up with each individual RIL and DH because of unlimited seed supplies. For mapping using an F2 mapping population, F2 phenotypes can be verified in F3 families. A genetic map is constructed with both phenotype and genotype of each individual of a mapping population. The most common software for map construction for major R genes is Mapmaker (Lincoln et al., 1992) and for quantitative trait loci (QTL) the software is QTL cartographer (Wang et al., 2007).
c20.indd 428
1/12/2011 9:44:36 AM
MAP-BASED CLONING
429
The next step is to construct a physical map to determine the physical location of the candidate R gene on a chromosome. Physical mapping can be expedited by using available sequence information. A genomic library that provides several times the coverage of a genome is needed to ensure the identification of all possible candidate genes in the genome. The two most commonly used genomic libraries for cloning a plant R gene are the yeast artificial chromosome (YAC) library and the bacterial artificial chromosome (BAC) library. Physical mapping begins with the closely linked DNA markers to the R locus. The closest linked DNA markers are initially used to generate an overlapping contig spanning the R locus. DNA sequences of either YAC or BAC clones spanning an R locus are sequenced. Based on homology to known R genes from public databases, candidate R genes from the contig are thus identified. 20.2.1.2.3 Functional Confirmation of the Candidate Gene The candidate R gene can be transferred into a susceptible parent using genetic transformation. There are two methods of transformation routinely used for plants. 20.2.1.2.3.1 AGROBACTERIUM-mediated transformation The most commonly used method for plant transformation is Agrobacterium-mediated transformation using binary vectors (Bevan, 1984). Agrobacterium is commonly used for transferring new genes into plant cells and for securing their stable integration into the host genome. The theory behind plant transformation using Agrobacterium is that the A. tumefaciens, a soil bacterium, causes tumors at wound sites (crown gall) and introduces the genetic information of the crown gall (tumor) formation into the plant genome. Instead of being situated on the Agrobacteria’s chromosomes, this transferred DNA (T-DNA) is found on a Ti (tumor-inducing) plasmid, flanked by the right (RB) and left (LB) borders. Agrobacteria can modify plants genetically and the original T-DNA containing the information for tumor formation can be replaced with foreign DNA. The Agrobacterium can therefore be used as a transport vector to introduce new genes. Since an indicator is necessary to identify transformed cells, a marker such as hygromycin resistance is often placed on the Ti plasmid alongside the target gene. However, transformation using only one plasmid can lead to random introgression of unwanted DNA from plasmid into the plant genome. Binary vectors were later developed to avoid random T-DNA introduction into plant cells. With a binary vector, the target gene (T-DNA) and virulence gene, which are naturally joined on one plasmid, are split between two plasmids. The initial Ti plasmid contains virulence (vir) genes that are necessary for T-DNA transfer. In the case of binary vectors, the vir genes are removed, and the desired DNA can be integrated between the borders. These disarmed binary plasmids are easy to propagate in E. coli and are also much smaller than a normal Ti plasmid. The vir genes needed for transfer to the plant are arranged on a second plasmid where the T-DNA and both borders are removed. Using
c20.indd 429
1/12/2011 9:44:36 AM
430
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
binary vectors, one plasmid is used to transfer the target gene, while the other helps with the transformation to avoid random integration of unwanted plasmid DNA into the plant genome. 20.2.1.2.3.2 particle bombardment Another common alternative for genetic transformation is particle bombardment (Christou et al., 1997). Particle bombardment is based on the fact that gold particles can be used as a vector to introduce plasmid DNA that expresses the candidate gene inside a plant cell. For particle bombardment, plasmid containing a DNA construct expressing the coding region of candidate R genes is coated with fine gold particles that can be co-introduced into calli along with the plasmid expressing an antibiotic selective marker, such as hygromycin resistance. 20.2.1.2.3.3 plant regeneration and progeny analysis Plant calli without plasmids expressing hygromycin resistance do not survive, and only transformed calli can give rise to seedlings in culture media containing hygromycin. All surviving seedlings can be examined for the presence of a transgene using PCR and/or southern blot. Once primary T1 transformants are produced, the next step is to determine their reactions to the pathogen infection. Standard disease infection assays are used to evaluate T1’s reaction to the pathogen. The phenotype of primary transformants can be evaluated at T1. A ratio of 3 resistant with candidate R gene to 1 susceptible without candidate R gene is expected in T2 progeny if the candidate R gene is responsible for newly acquired resistance. 20.2.1.2.3.4 positional cloning of the first gene-specific plant R gene Using the map-based cloning strategy, the Pto gene in tomato was the first race-specific R gene cloned in plants. Pto confers race specific resistance to Pseudomonas syringae DC3000 causing bacteria speck disease. The Pto gene was mapped within 3 cM using RFLP markers by using an F2 mapping population consisting of 251 individuals. The candidate R gene on a YAC clone was identified from a YAC library using a RFLP marker that co-segregated with resistance (Martin et al., 2003). After a decade of investigation, molecular mechanisms underlying the Pto gene-mediated defense response are now well understood (Martin et al., 2003). Subsequently, more R genes have been isolated from tomato, peppers, tobacco, and rice (Martin et al., 2003).
20.3 A PLANT MODEL—THE RICE BLAST SYSTEM Rice blast disease caused by the filamentous ascomycetes fungus Magnaporthe oryzae (formerly M. grisea) is one of the most damaging rice diseases (Khush and Jena, 2007). Rice, one of the most important food crops, has been cultivated under diverse environments around the globe. M. oryzae is highly adaptive to the environment and infection begins with asexual conidia. The conidial
c20.indd 430
1/12/2011 9:44:36 AM
A PLANT MODEL—THE RICE BLAST SYSTEM
(a)
(b)
(c)
(d)
431
Figure 20.2. Symptoms of rice blast disease on rice plants without the major blast R genes. a, Four asexual conidian of Magnaporthe oryzae that infect rice seedlings. b, Leaf blast with close up eye-shaped lesion shown from an irrigated rice field in the United States. c, Leaf blast showing severe diseased leaves in an upland rice field in Colombia. d, Panicle blast showing affected grain from an irrigated rice field in Arkansas.
infection is a semi-biotrophic process resulting in loss of productivity and quality of rice grains (Howard et al., 1991; Fig. 20.2). M. oryzae shifted the host to rice after worldwide rice cultivation. On the other hand, R genes in rice have evolved over time to prevent infections by M. oryzae. Evidently the AVR genes in strains of M. oryzae determining efficacies of R genes are highly unstable (Kang et al., 2001; Zhou et al., 2007). Under favorable environmental conditions, blast epidemics sometimes cause significant economic losses in some rice growing areas. This never-ending battle between rice and M. oryzae makes the rice blast system a model for plant R gene discovery. 20.3.1 The Structure and Function of Blast R Gene Historically the blast R genes called Pi-genes confer resistance to the blast fungus in a race-specific manner (Silue et al., 1992). The Pi-genes have been commonly discovered from landrace varieties and wild rice relatives. Today, rice blast R genes are one of the best plant models for understanding the
c20.indd 431
1/12/2011 9:44:36 AM
432
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
coevolutionary dynamics of R genes with their cognate AVR genes. Thus far, over 80 blast Pi genes have been tagged with molecular markers; some of them have been cloned and others have been used for breeding for improved resistance in plants (Ballini et al., 2008). Similar to other plant R genes, 13 cloned blast R genes encode putative receptor proteins with the NBS-LRR domain except for Pi-d2 and pi21. Pi-d2 encodes a predicted B-lectin receptor kinase and pi21 is a defective proline protein (Table 20.1). Pi-b was the first blast R gene cloned, and Pi-b encodes a predicted protein with NBS-LRR domain (Wang et al., 1999). Pi-ta is the second blast R gene isolated from rice. Pi-ta encodes a putative receptor with NBS and a degenerated LRR domain (referred to as LRD) (Fig. 20.3; Bryan et al., 2000; Jia et al., 2000). Among them, three blast R genes, Pi-ta, Pi-d2 and Pi-36 are single members in the rice genome. Others are members of small gene families. It is interesting that, two members of the Pi-km and Pi-5 families are required for complete resistance to some races of blast fungus (Ashikawa et al., 2008; Lee et al., 2009b). The ability of R proteins to recognize pathogen signaling molecules is the most critical component of signal transduction. Such recognition can be direct and indirect. The Pto gene in tomato was the first plant R gene whose product was demonstrated to bind to the pathogen signaling molecule (Scofield et al., 1996; Tang et al., 1996). Similar to Pto, Pi-ta was the only blast R gene whose role as a cytoplasmic receptor was demonstrated with a putative product of the corresponding AVR gene, AVR-Pita176 (Jia et al., 2000). A model for Pi-ta-mediated resistance is proposed (Fig. 20.4). In this model, the AVR gene product AVR-Pita, with 223 amino acids, was processed to be an active protein with 176 amino acids, AVR-Pita176, with unknown mechanisms. The AVRPita716 protein was demonstrated to interact with the Pi-ta protein that may also be involved in the Pi-ta2 protein for resisting other races of M. oryzae (Bryan et al., 2000; Jia et al., 2000). As a result of these interactions, defense signals are triggered and presumably transferred to another plant modifier Ptr(t), subsequently activating plant defense gene expression that stops invading blast fungus (Jia et al., 2002; Jia and Martin, 2008). To date, four other NBS-LRR proteins have been shown to physically interact with cognate AVR proteins in other plant-pathogen systems: The Arabidopsis thaliana protein RPS1 with PopP2 from Ralstonia solanacearun (Deslandes et al., 2003); the flax L5, L6, and L7 proteins with Avr567 proteins of flax rust (Melampsora lini) (Dodds et al., 2006); N from tobacco with p50 elicitor from tobacco mosaic virus (Ueda et al., 2006); and the flax M protein with AvrM of flax rust (Catanzariti et al., 2010). Indirect bindings of R proteins with the pathogen effectors were demonstrated in the A. thaliana and P. syringae bacterium system (Axtell et al., 2003; Mackey et al., 2003; Shao et al., 2003). In the rice blast system, it is not an easy task to isolate the cognate AVR gene because of difficulties of genetic crosses in M. oryzae. It can take a long time to isolate one AVR gene using map based cloning (Orbach et al., 2000). Pi-ta and AVR-Pita are still the only pair of R and AVR genes that are well characterized thus far. By genomewide association, three AVR genes from
c20.indd 432
1/12/2011 9:44:37 AM
A PLANT MODEL—THE RICE BLAST SYSTEM
433
TABLE 20.1. DNA Sequences, Chromosomal Locations, and Structural Characteristics of Cloned Rice Blast R Genes Name of R Genes
GenBank Accessiona
Pi-b Pi-ta
AB013448 AF207842
Pi-9 Pi2/Pizt Pi-d2
Pi36 Pi37 Pi5 Pit Pikm
Pi-d3
Pi-21
a
Chromosome
Motif
Reference
2 12
NBS-LRR NBS-LRR
ABB88855 ABC94599/DQ352040 Not available
6 6 6
DQ90896 DQ923494 EU869185 and EU869186 AB379815-AB379822
8 1 9
NBS-LRR NBS-LRR B-lectin Receptor Kinase NBS-LRR NBS-LRR NBS-LRR
Wang et al. (1999) Bryan et al. (2000) Qu et al. (2006) Zhou et al. (2006) Chen et al. (2006)
1
NBS-LRR
11
NBS-LRR
6
NBS-LRR
Shang et al. (2009)
4
Defected proline protein
Fukuoka et al. (2009)
AB462256, AB462324, and AB462325 FJ745365 (Pid3-A4) FJ745366 (Pid3-ZYQ8) FJ745367 (Pid3-TP309) FJ745368 (Pid3-LTH) FJ773285 (Pid3-9311) FJ773286 (Pid3-Nip)] AB430852, AB430853, and AB430854
Li et al. (2009) Lin et al. (2007) Lee et al. (2009) Hayashi et al. (2009) Ashikawa et al. (2008)
Available at www.ncbi.nlm.nih.gov.
M. oryzae were rapidly isolated (Yoshida et al., 2009). Similarly, isolation of blast R genes has been accelerated by available genomic resources. In fact, 10 of 13 blast R genes were cloned within the past 5 years (Table 20.1). With more matched R and AVR genes cloned in the future, the recognition mechanisms of blast R genes can be further examined. 20.3.2 Co-Evolution of Host R and Pathogen AVR Genes The structures of rice R genes are extremely conserved. However, AVR genes are known to encode random molecules that may play important roles in pathogen fitness and pathogenicity. One outstanding question is how host R genes have evolved the ability to detect these random molecules from the pathogens. In the blast system, transposition, alternative splicing, gene
c20.indd 433
1/12/2011 9:44:37 AM
434
c20.indd 434
1/12/2011 9:44:37 AM
Figure 20.3. Cloning of the rice blast resistance gene Pi-ta using map-based cloning. Pi-ta mediated resistance was located near the centromere of the chromosome 12, and a 1-mb BAC contig was assembled and sequenced. All candidates were analyzed by bioinfomatic tools, and the candidate for Pi-ta was identified and isolated from the contig. The resistant function of the candidate Pi-ta gene was verified by transforming the candidate gene into a susceptible rice cultivar (Bryan et al., 2000).
A PLANT MODEL—THE RICE BLAST SYSTEM
435
Figure 20.4. A model for the Pi-ta gene-mediated disease resistance response. The avirulence gene product AVR-Pita was predicted to be processed to be an active form AVR-Pita176 in plant cells with unknown mechanism. Once inside the plant cell, AVRPita176 is predicted to bind to the Pi-ta protein. Binding of Pi-ta with AVR-Pita176 may need the Pi-ta2 and Ptr(t) proteins in activating signaling for producing plant proteins to prevent further invasive growth of the blast fungus.
clustering, diversification, and genomic rearrangements are known mechanisms of genetic changes that drive the co-evolution of host R genes with the pathogen AVR genes. 20.3.2.1 Transposition Transposons are sequences of DNA that can move around to different positions within the genome. As a result of transposition, transposons can influence the structure of genes and genome via transposition, insertion, excision, chromosome breakage, and ectopic recombination, often with alteration of gene expression (Bennetzen, 2000). In the rice blast system, Pi-ta is located near the centromere of rice chromosome 12, a region that embeds fewer active genes than other regions of the chromosome. It is interesting, that a transposon was found at the promoter region of the Pi-ta gene. The presence of this transposon was found to be strictly associated with resistance in rice germplasm surveyed (Lee et al., 2009a). Similarly, an ancient blast R gene Pit was demonstrated to be activated by another transposon in the promoter region (Hayashi et al., 2009). Both cases led to a hypothesis that transposons play a positive role in regulating ancient blast R genes.
c20.indd 435
1/12/2011 9:44:37 AM
436
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
20.3.2.2 Alternative Splicing Alternative splicing is a process by which the exons of the RNA produced by transcription of a gene are reconnected in multiple ways during RNA splicing. The resulting different mRNAs may be translated into different protein isoforms; therefore, a single gene may code for multiple proteins. The Pi-ta gene was predicted to encode 12 distinct putative products between 315 and 1033 amino acids. Among them, five preserve complete NBS-LRR domains and two couple the original NBS-LRR domain of the Pi-ta protein with a C-terminal thioredoxin (TRX) domain. Gene expression analysis demonstrated that transcript variants encoding the TRX domain had the highest level of expression in comparison to other full length or truncated transcripts. These posttranscriptional modifications of Pi-ta produced a series of transcripts that could have a significant impact on newly evolved resistance specificity. 20.3.2.3 Clustering Historically, resistance to different pathogens or different races of the same pathogen is often mapped within a small genomic interval. The presence of the clustered R genes suggests that R genes in plant genomes have evolved in clusters to fight against pathogens. This is also true for most of the cloned blast R genes because they are members of small gene families. Whether these family members are in fact R genes remains to be demonstrated. Most noticeably, a large linkage block (5.4–27 Mbp) was found in rice cultivars that contain the resistant Pi-ta alleles. These findings suggest that many genes involved in gene specific blast resistance may reside within a small genomic region on the same chromosome. 20.3.2.4 Diversification Surveys of Pi-ta alleles in a wide range of rice germplasm and their wild rice relatives revealed that selection constraints had occurred at the Pi-ta locus in cultivars but not in wild rice relatives (Jia et al., 2003; Lee et al., 2009a; Wang et al., 2008). Diversification was not common at the Pi-ta gene among cultivated rice varieties O. sativa; however, diversification was more pronounced in O. rufipogon, a predicted ancestor of the cultivated species of rice. In contrast, pathogen AVR gene products are often involved in promoting the virulence and fitness. Diversification of AVR genes is one of the most important strategies that the fungus employs to overcome resistance controlled by R genes. Surveys thus far have identified 37 AVR-Pita variants with minor amino acid differences in field isolates of M. oryzae. In addition, partial and complete deletions, frame-shift mutation were found in the AVR-Pita variants in virulent field and laboratory races (Zhou et al., 2007). These genomic rearrangements can alter resistance stability of deployed R genes that eventually lead to severe blast epidemics.
20.4 R GENE USE IN BREEDING The use of plant R genes is the most economical and environmentally friendly method of crop protection. For a long time, breeding for improved resistance
c20.indd 436
1/12/2011 9:44:39 AM
FUTURE PROSPECTS
437
has been accomplished through traditional genetic crosses of donor parents with recurrent parents and then selecting resistant individuals in subsequent generations. In practically any given rice variety, the exact number of R genes present is unknown, similarly in any race of M. oryzae there is often an unknown number of AVR genes. Despite this fact, an effective international differential system has been in place for predicting the spectra of R genes. It is known that resistance to one race of M. oryzae is governed by a matched pair of R genes in rice and AVR genes in M. oryzae; interaction of both products of the R and the AVR gene can result in complete resistance. The presence of one matched pair therefore can mask expression of other matched pairs of R and AVR genes. Selection based on disease reactions can identify resistant progeny but cannot accurately identify a particular R gene. Closely linked DNA markers to an R gene can immediately be used for R gene selection in classical plant breeding using marker assisted selection (MAS). The use of markers for breeding is a relatively new tool in plant breeding programs. Under normal circumstances, the use of markers can reduce the time needed for developing a cultivar because trait selection can be made at seed and seedling stage under controlled laboratory settings. One short-term benefit of cloning a plant R gene is being able to develop DNA markers from portions of cloned genes. These types of markers are derived from R genes themselves and are referred to as The Perfect Markers. The perfect markers for two blast R genes, Pi-ta, Pi-b have been effectively developed (Fjellstrom et al., 2004; Jia et al., 2000, 2002, 2009; Wang et al. 2007). Among them, the perfect markers for the Pi-ta gene have been effectively used for MAS since 2002 (Jia et al., 2009). For the long term, the cloned R genes can immediately be used to engineer durable resistance to the pathogens. Transferring R genes using transformation into advanced breeding lines can accelerate R gene incorporation and also avoid linkage drag associated with resistance (Jia, 2009).
20.5 FUTURE PROSPECTS Up to now, it has been well known that plants evolved an array of highly regulated defense strategies to prevent pathogen invasion. Among them, R genes regulate infector triggered immune responses. Rapid advances in biotechnology, including controlled phenotyping, DNA sequence analysis, gene expression using DNA microarray, and serial analysis of gene expression, have accelerated efforts of crop R gene discovery and use. With dramatic reduction in the cost of biotechnology and the speed of isolation, characterization of plant R genes will be unimaginably increased. Cloning and characterization of crop R genes have facilitated a better understanding of the molecular mechanisms of disease resistance, co-evolution mechanisms, interaction and signaling recognition, and transduction (Jia et al., 2000). However, several challenges lie ahead, including understanding how crop R genes keep pace with the rapid changes of pathogen effectors. Specifically, (1) the pathogen effector is meant
c20.indd 437
1/12/2011 9:44:39 AM
438
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
to promote plant diseases; however, cellular targets for the effector proteins in plant cells are still unknown. (2) The existence of a master controller(s) for plant disease resistance is undetermined. If there is a master controller, why has it not been discovered? (3) The pathogens can overcome engineered resistance in a short time after resistance deployment. The methodology that ensures resistance durability mediated by R genes has not been demonstrated. (4) Gene flow has been commonly observed in crop fields; however, methods to prevent crop R genes from escaping to weedy species of crop plants have not been developed. Finally, (5) the penalty for crop productivity and quality is unknown if a crop plant is immune to invading pathogens (Tian et al., 2003). Besides these challenges, an improved defined genetic system is urgently needed to study resistance to the necrotrophic pathogens, such as the soilborne fungal pathogen Rhizoctonia solani, which causes the rice sheath blight disease. For the sheath blight disease, the major R genes are not functional, and mechanisms are unknown. Recently, progress has been made in improving a phenotyping method and tagging the major QTLs for MAS (Jia et al., 2007; Liu et al., 2009). Continued identification and cloning of major QTLs will be another important priority for crop protection. In summary, cloned plant R genes can be directly used for genetic engineering for effective resistance and MAS. There is no doubt that genetic engineering is one of the fastest approaches to developing disease resistant crops. However, MAS also holds great promise despite the fact that MAS is a relatively young infant for crop breeding (Jia, 2003). Significant new knowledge learned from characterized plant R genes thus far has established a solid foundation for continued exploration of sophisticated natural defense systems for effective crop protection to maintain stable crop production that should benefit humanity. 20.6 ACKNOWLEDGMENTS The author thanks the present and past members of the Molecular Plant Pathology program of USDA-ARS Dale Bumpers National Rice Research Center (DB NRRC) for excellent technical support, Melissa Jia (staff scientist) and Ellen McWhirter (English editor) of DB NRRC for proofreading, Seonghee Lee (Plant Pathologist, Noble Foundation) and Stefano Costanzo (Plant Pathologist, DB NRRC) for critical reading, and ARS 301 National Program, National Science Foundation and Arkansas Rice Research and Promotion Board for financial support. 20.7 REFERENCES Ashikawa I, Hayashi N, Yamane H, Kanamori H, Wu J, Matsumoto T, Ono K, Yano M. (2008). Two adjacent nucleotide-binding site-leucine-rich repeat class genes are required to confer Pikm-specific rice blast resistance. Genetics 180: 2267–76.
c20.indd 438
1/12/2011 9:44:39 AM
REFERENCES
439
Axtell MJ, Chisholm ST, Dahlbeck D, Staskawicz BJ. (2003). Genetic and molecular evidence that the Pseudomonas syringae type III effector protein AvrRpt2 is a cysteine protease. Mol Microbiol 49:1537–46. Ballini E, Morel JB, Droc G, Price A, Courtois B, Notteghem JL, Tharreau D. (2008). A genome-wide meta-analysis of rice blast resistance genes and quantitative trait loci provides new insights into partial and complete resistance. Mol PlantMicrobe Interact 21: 859–68. Bennetzen JL. (2000). Transposable element contributions to plant gene and genome evolution. Plant Mol Biol 42:251–69. Bevan M. (1984). Binary Agrobacterium vectors for plant transformation. Nucl Acids Res 12:8711–21. Bryan GT, Wu K, Farrall L, Jia Y, Hershey HP, McAdams SA, Faulk KN, Donaldson GK, Tarchini R, Valent B. (2000). A single amino acid difference distinguishes resistant and susceptible alleles of rice blast resistance gene Pi-ta. Plant Cell 12: 2033–45. Catanzariti A-M, Dodds PN, Ve T, Kobe B, Ellis JG, Staskawicz BJ. (2010). The AvrM effector from flax rust has a structured c-terminal domain and interacts directly with the M resistance protein. Mol Plant Microbe Interact 23:49–57. Chen X, Shang J, Chen D, Lei C, Zou Y, Zhai W, Liu G, Xu J, Ling Z, Cao G, Ma B, Wang Y, Zhao X, Li S, Zhu L. (2006). A B-lectin receptor kinase gene conferring rice blast resistance. Plant J 46: 794–804. Christou P. (1997). Rice transformation: bombardment. Plant Mol Biol 35:193– 203. Deslandes L, Olivier J, Peeters N, Feng DX, Khounlotham M, Boucher C, Somssich I, Genin S, Marco Y. (2003). Physical interaction between RRS1-R, a protein conferring resistance to bacterial wilt, and PopP2, a type III effector targeted to the plant nucleus. Proc Natl Acad Sci U S A 100:8024–29. Dioh W, Tharreau D, Notteghem JL, Orbach M, Lebrun MH. (2000). Mapping of avirulence genes in the rice blast fungus, Magnaporthe grisea, with RFLP and RAPD markers. Mol Plant Microbe Interact 13:217–27. Dodds PN, Lawrence GJ, Catanzariti A, Teh T, Wang CI, Ayliffe MA, Kobe B, Ellis JG. (2006). Direct protein interaction underlies gene-for-gene specificity and coevolution of the flax resistance genes and flax rust avirulence genes. Proc Natl Acad Sci U S A 103:8888–93. Fjellstrom RG, Conaway-Bormans CA, McClung AM, Marchetti MA, Shank AR, Park WD. (2004). Development of DNA markers suitable for marker assisted selection of three Pi genes conferring resistance to multiple Pyricularia grisea pathotypes. Crop Sci 44:1790–98. Flor HH. (1971). Current status of the gene-for-gene concept. Annu Rev Phytopathol 9:275–96. Fukuoka S, Saka N, Koga H, Ono K, Shimizu T, Ebana K, Hayashi N, Takahashi A, Hirochika H, Okuno K, Yano M. (2009). Loss of function of a proline-containing protein confers durable disease resistance in rice. Science 325:998–1001. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J et al. (2002). A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100.
c20.indd 439
1/12/2011 9:44:39 AM
440
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
Hayashi K, Yoshida H. (2009). Refunctionalization of the ancient rice blast disease resistance gene Pit by the recruitment of a retrotransposon as a promoter. Plant J 57:413–25. Howard R, Ferrari J, Roach MA, Roach DH, Money NP. (1991). Penetration of hard substrates by a fungus employing enormous turgor pressures. Proc Natl Acad Sci U S A 88:11281–284. Jia Y. (2003). Marker assisted selection for the control of rice blast disease. Pesticide Outlook 14:150–52. Jia Y. (2009). Artificial introgression of a large chromosome fragment around the rice blast resistance gene Pi-ta in backcross progeny and several elite rice cultivars. Heredity 103:333–39. Jia Y, Correa-Victoria, FJ, McClung A, Zhu L, Liu G, Wamishe Y, Xie J, Marchetti MA, Pinson SRM, Rutger JN, Correll JC. (2007). Rapid determination of rice cultivar responses to the sheath blight pathogen Rhizoctonia solani using a micro-chamber screening method. Plant Dis 91:485–89. Jia Y, Lee F, McClung A. (2009). Determination of resistance spectra to US races of Magnaporthe oryzae causing blast in a recombinant inbred line population. Plant Dis 93:639–44. Jia Y, Martin R. (2008). Identification of a new locus, Ptr(t), required for rice blast resistance gene Pi-ta-mediated resistance. Mol Plant Microbe Interact 21:396–403. Jia Y, McAdams S, Bryan G, Hershey H, Valent B. (2000). Direct interaction of resistance gene and avirulence gene products confers rice blast resistance. EMBO J 19:4004–14. Jia Y, Moldenhauer K. (2010). Development of mono and digenic rice lines of rice blast resistance gene Pi-ta, Pi-k(s/h). Plant Reg 4:163–66. Jia Y, Valent B, Lee FN. (2003). Determination of host responses to Magnaporthe grisea on detached rice leaves using a spot inoculation method. Plant Dis 87:129–33. Jia Y, Wang Z, Singh P. (2002). Development of dominant rice blast resistance Pi-ta gene markers. Crop Sci 42:2145–49. Kang S, Lebrun MH, Farrall L, Valent B. (2001). Gain of virulence caused by insertion of a Pot3 transposon in a Magnaporthe grisea avirulence gene. Mol Plant Microbe Interact 14:671–74. Khush G, Jena K. (2007). Current status and future prospects of research on blast disease in rice (Oryza sativa). Paper presented at the 4th International Rice Blast Conference, Changsha, China. Kosambi DD. (1944). The estimation of map distances from recombination values. Ann Eugen 12:172–75. Lee S, Costanzo S, Jia Y, Olsen K, Caicedo A. (2009a). Evolutionary dynamics of the genomic region around the blast resistance gene Pi-ta in AA genome Oryza species. Genetics 183:1315–25. Lee SK, Song MY, Seo YS, Kim HK, Ko S, Cao PJ, Suh JP, Yi G, Roh JH, Lee S, An G, Hahn TR, Wang GL, Ronald P, Jeon JS. (2009b). Rice Pi5-mediated resistance to Magnaporthe oryzae requires the presence of two coiled-coil-nucleotide-bindingleucine-rich repeat genes, Genetics 181:1627–38. Li B, Wang J, Wu Y, Hu X, Zhang Z, Zhang Q, Zhao Q, Feng H, Zhang Z, Wang GL, Wang G, Lu B, Han Z, Wang Z, Zhou B. (2009). The Magnaporthe oryzae avirulence
c20.indd 440
1/12/2011 9:44:39 AM
REFERENCES
441
gene AvrPiz-t encodes a predicted secreted protein that triggers the immunity in rice mediated by the blast resistance gene Piz-t. Mol Plant Microbe Interact 22: 411–20. Lin F, Chen S, Que Z, Wang L, Liu X, Pan Q. (2007). The blast resistance gene Pi37 encodes a nucleotide binding site-leucine-rich repeat protein and is a member of a resistance gene cluster on rice chromosome 1. Genetics 177:1871–80. Lincoln S, Daly M, Lander ES. (1992). Construction Genetic Maps with MAPMAKER/ EXP 3.0 in Whitehead Institute Technical Report. 2nd ed., Whitehead Institute, Cambridge, UK. Liu G, Bernhardt JL, Jia MH, Wamishe YA, Jia Y. (2008). Molecular characterization of the recombinant inbred line population derived from a japonica-indica rice cross. Euphytica 159:73–82. Liu G, Jia Y, Correa-Victoria F, Prado GA, Yeater KM, McClung A, Correll JC. (2009). Mapping quantitative trait loci responsible for resistance to sheath blight in rice. Phytopathology 99:1078–84. Liu X, Lin F, Wang L, Pan Q. (2007). The in silico map-based cloning of Pi36, a rice coiled-coil–nucleotide-binding site–leucine-rich repeat gene that confers racespecific resistance to the blast fungus. Genetics 176:2541–49. Mackey D, Belkhadir Y, Alonso JM, Ecker JR, Dangl JL. (2003). Arabidopsis RIN4 is a target of the type III virulence effector AvrRpt2 and modulates RPS2-mediated resistance. Cell 112:379–89. Martin GB, Bogdanove A, Sessa G. (2003). Understanding the functions of plant disease resistance proteins. Ann Plant Biolo 54:23–61. Martin GB, Brommonschenkel S, Chunwongse J, Frary A, Ganal MW, Spivey R, Wu T, Earle ED, Tanksley SD. (1993). Map-based cloning of a protein kinase gene conferring disease resistance in tomato. Science 262:1432–36. McCouch SR, Kochert G, Yu Z, Wang Z, Khush GS, Coffman WR, Tanksley SD. (1998). Molecular mapping of rice chromosomes. Theor Appl Genet 76:815–29. Orbach MJ, Farrall L, Sweigard JA, Chumley FG, Valent B. (2000). A telomeric avirulence gene determines efficacy for the rice blast resistance gene Pi-ta. Plant Cell 12:2019–32. Qu S, Liu G, Zhou B, Bellizzi M, Zeng L, Dai L, Han B, Wang GL. (2006). The broadspectrum blast resistance gene Pi9 encodes a nucleotide-binding site–leucine-rich repeat protein and is a member of multigene family in rice. Genetics 172:1901–14. Rafalski A (2002). Applications of single nucleotide polymorphisms in crop genetics. Curr Opin Plant Biol 5:94–100. Scofield SR, Tobias CM, Rathjen JP, Chang JH, Lavelle DT, Michelmore RW, Staskwicz BJ. (1996). Molecular basis of gene-for-gene specificity in bacterial speck disease of tomato. Science 268:661–67. Shang J, Tao Y, Chen X, Zou Y, Lei C, Wang J, Li X, Zhao X, Zhang M, Lu Z, Xu J, Cheng Z, Wan J, Zhu L. (2009). Identification of a new rice blast resistance gene, Pid3, by genomewide comparison of paired nucleotide-binding site-leucine-rich repeat genes and their pseudogene alleles between the two sequenced rice genomes. Genetics 182:1303–11. Shao F, Golstein C, Ade J, Stoutemyer M, Dixon JE, Innes RW. (2003). Cleavage of Arabidopsis PBS1 by a bacterial type III effector. Science 301:1230–33.
c20.indd 441
1/12/2011 9:44:39 AM
442
GENE DISCOVERY OF CROP DISEASE IN THE POSTGENOME ERA
Silue D, Notteghem JL, Tharreau D. (1992). Evidence for a gene for gene relationship in the Oryza sativa-Magnaporthe grisea pathosystem. Phytopathology 82: 577–82. Tang X, Frederick R, Zhou J, Halterman DA, Jia Y, Martin GB. (1996). Initiation of plant disease resistance by physical interaction of avrPto and Pto kinase. Science 274:2060–63. Tian D, Traw MB, Chen JQ, Kreltman M, Bergelson J. (2003). Fitness costs of R-genemediated resistance in Arabidopsis thaliana. Nature 423:74–77. Ueda H, Yamaguchi Y, Sano H. (2006). Direct interaction between the tobacco mosaic virus helicase domain and the ATP-bound resistance protein, N factor during the hypersensitive response in tobacco plants. Plant Mol Biol 2006; 61:31–45. Wang S, Basten CJ, Zeng ZB. (2007). Windows QTL Cartographer 2.5. Department of Statistics, North Carolina State University, Raleigh. Available at http://statgen. ncsu.edu/qtlcart/WQTLCart.htm. Wang X, Jia Y, Shu QY, Wu D. (2008). Haplotype diversity at the Pi-ta locus in cultivated rice and its wild relatives. Phytopathology 98:1305–11. Wang X, Yano M, Yamanouchi U, Iwamoto M, Monna L, Hayasaka H, Katayose Y, Sasaki T. (1999). The Pi-b gene for rice blast resistance belongs to the nucleotide binding and leucine-rich repeat class of plant disease resistance genes. Plant J 19:55–64. Yoshida K, Saitoh H, Fujisawa S, Kanzaki H, Matsumura H, Yoshida K, Tosa Y, Chuma I, Takano Y, Win J, Kamoun S, Terauchi R. (2009). Association Genetics reveals three novel avirulence genes from the rice blast fungal pathogen magnaporthe oryzae. Plant Cell 21:1573–91. Yu J, Hu S, Wang J, Wong G, Li S, Liu B, Deng Y, et al. (2002). A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296:79–92. Zhou E, Jia Y, Singh P, Correll JC, Lee FN. (2007). Instability of the Magnaporthe oryzae avirulence gene AVR-Pita alters virulence. Fungal Genet Biol 44:1024–34. Zhou B, Qu S, Liu G, Dolan M, Sakai H, Lu G, Bellizzi M, Wang G. (2006). The eight amino acid differences within three leucine-rich repeats between Pi2 and Piz-t resistance proteins determine the resistance specificity to Magnaporthe grisea. Mol Plant Microbe Interact 19:1216–28.
c20.indd 442
1/12/2011 9:44:39 AM
CHAPTER 21
Impact of Genomewide Structural Variation on Gene Discovery LISENKA E.L.M. VISSERS and JORIS A. VELTMAN
Contents 21.1 A Historical Perspective of the Detection of Genomewide Structural Variation and Its Relevance to Disease 21.1.1 Human Genomic Variation and Visualization of Structural Variants 21.1.2 Chromosomal Rearrangements Causing Disease 21.1.3 The Detection of Submicroscopic Chromosomal Rearrangements 21.1.4 The Clinical Consequence of Submicroscopic Chromosome Rearrangements 21.1.5 Array-Based Comparative Genomic Hybridization 21.2 The Basic Concept for Disease Gene Discovery through Genomewide Profiling Strategies 21.2.1 Single Gene Disorders 21.2.2 Contiguous Gene Syndromes 21.2.3 Point Mutations, Deletions, and Duplications May Lead to the Same Phenotype 21.3 Disease Gene Discovery through Genomewide Profiling Strategies 21.3.1 CHARGE Syndrome 21.3.2 The 9q Subtelomeric Deletion Syndrome 21.3.3 Defining New Microdeletion Syndromes 21.4 Disease Gene Identification for Common Diseases 21.4.1 Rare CNVs in Common Diseases 21.5 Discriminating the Disease-Related CNV from All Normal CNVs 21.5.1 Forging Links between Human Phenotypes and Mouse Gene Knockout Models
444 444 444 446 447 448 450 450 450 450 451 451 453 454 454 455 455 456
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
443
c21.indd 443
1/12/2011 9:44:40 AM
444 21.6
21.7 21.8 21.9 21.10
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
Next-Generation Sequencing for the Detection of Structural Variation 21.6.1 CNV Detection Using Shotgun Sequencing 21.6.2 CNV Detection Using Mate-Pair Sequencing Conclusion Questions and Answers Acknowledgments References
456 457 458 458 459 461 462
21.1 A HISTORICAL PERSPECTIVE OF THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION AND ITS RELEVANCE TO DISEASE 21.1.1 Human Genomic Variation and Visualization of Structural Variants Human genomic variation is present in many forms, including single nucleotide polymorphisms (SNPs), variable number of tandem repeats (VNTRs), transposable elements, and structural variation. All these variants differ among individuals and as such define human phenotypic variation. Whereas SNPs involve only a single base, structural chromosome variants involve multiple bases and are defined by a chromosome breakage and subsequent joining in an altered configuration. Large structural variants, involving millions of bases, can be visualized using light microscopy, which is referred to as karyotyping. The altered configuration may lead to a loss or gain of genetic material, in which case the rearrangement is unbalanced. Alternatively, all genetic material is retained, in which case the new configuration may be fully balanced. Structural chromosome abnormalities include (1) deletions, (2) duplications, (3) isochromosomes, (4) ring chromosomes, (5) inversions, and (6) translocations (Fig. 21.1). By definition, deletions, isochromosomes, duplications, and ring chromosomes are unbalanced in nature. Translocations and inversions can either be balanced or unbalanced. Structural chromosome rearrangements can lead to a wide variety of serious clinical manifestations, including mental retardation (MR) and congenital malformations, as the altered configuration may affect several genes. The exact clinical manifestations depend on the size of the rearrangement and the genetic information affected. That is, larger genome segments are likely to contain more genes, and as such, may lead to a more severe phenotype. 21.1.2
Chromosomal Rearrangements Causing Disease
With the availability of karyotyping, chromosome rearrangements have been linked to disease. Down syndrome was the first clinical syndrome linked to
c21.indd 444
1/12/2011 9:44:40 AM
THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION
(a)
445
(b)
16 15.3 15.1 14 13 12 12 13 21
16 15.3 15.1 14 13 12 12 13 21
22 24
22 24
26 27 28 31.1 31.3 32 33 34 35
26 27 28 31.1 31.3 32 33 34 35
4
4
(c)
(d) 16 15.3 15.1 14 13 12 12 13 21
16 15.3 15.1 14 13 12 12 13 21
22 24
22 24
26 27 28 31.1 31.3 32 33 34 35
26 27 28 31.1 31.3 32 33 34 35
4
(e)
4
(f) 16 15.3 15.1 14 13 12 12 13 21
16 15.3 15.1 14 13 12 12 13 21
22 24
22 24
26 27 28 31.1 31.3 32 33 34 35
26 27 28 31.1 31.3 32 33 34 35
4
22 21 15 14 13 12 11.2 21 22 31 32 33 35 36
4
7
Figure 21.1. Structural rearrangements. a, Deletion with loss of genetic material. b, Duplication with insertion of genetic material (gray). c, Isochromosome of the p arm with loss of one chromosome arm and duplication of the other arm (duplication in gray). d, Ring chromosome with joining of two sticky chromosome ends caused by deletion of genetic material on both chromosome arms. e, Inversion with reversion of genetic material on a single chromosome. f, Translocation with transfer of genetic material from one chromosome to another. In this case a translocation of chromosomes 4 (in black) and 7 (in gray). Part of the 4q arm is exchanged with material originating from the 7q arm.
c21.indd 445
1/12/2011 9:44:40 AM
446
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
a specific chromosome rearrangement, being trisomy of chromosome 21. Since this discovery, the concept of linking a disease phenotypes to a chromosome rearrangement has been further explored. The strategy that was used mostly included the identification of overlapping deletions in multiple patients presenting a similar disease phenotype. The smallest region of overlap, commonly deleted in all patients, would as such define the region causing the phenotype under investigation. Subsequently, a positional gene approach can be used to identify the disease gene. In this approach, all genes present in the shortest region of deletion overlap are examined by DNA sequencing to identify mutations in patients with a similar phenotype but without a causative structural variation. A related disease-gene identification approach starts with the identification of a (balanced) translocation and studies the gene(s) disrupted by the translocation as potential disease gene(s). These approaches have been very successful for localizing genes for several diseases, including holoprosencephaly, retinoblastoma and Gardner syndrome (Lele et al., 1963; Riccardi et al., 1978; Herrera et al., 1986; Schmickel, 1986; Munke, 1989; Roessler et al., 1996; Brown et al., 1998; Wallis et al., 1999; Gripp et al., 2000). Gross chromosomal rearrangements can be detected by karyotyping. However, this genomewide approach has several limiting factors. First, karyotyping requires actively dividing cells to obtain chromosomes in the optimal configuration for visualization (i.e., metaphase chromosome), and second, karyotyping has a limited resolution (i.e., the structural aberration needs to be of sufficient size to be detectable, involving at least 5–10 Mb). There is, however, no reason why smaller structural genomic variations should not cause disease, as even a single basepair mutation can cause disease. Moreover, for disease gene identification studies, the detection of a smaller structural variation is much more useful than a larger one, as the number of candidate disease genes in the affected genomic locus will be limited. 21.1.3 The Detection of Submicroscopic Chromosomal Rearrangements For a considerable number of clinical disorders, the genetic cause has been established to be smaller than 5–10 Mb in size and, as such, remain undetectable using karyotyping. To detect these rearrangements, more sensitive techniques were needed. The resolution of chromosome analysis has greatly benefited from the introduction of fluorescent in situ hybridization (FISH) (Van Prooijen-Knegt et al., 1982). This technology relies on the unique ability of single-stranded fluorescently labeled DNA, known as a probe, to anneal to its complementary sequence in the chromosomes. Next, the location of the annealed (or hybridized) probe can be visualized by use of a fluorescent microscope. Depending on the application, different types of FISH probes can be used, such as telomere-specific probes, whole chromosome painting probes
c21.indd 446
1/12/2011 9:44:40 AM
THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION
447
and locus specific probes. Especially the latter type of probe can be used, for instance, to validate a clinical diagnosis by proving that the genomic locus involved in the disease is indeed not present the normal copy number of two copies. Although FISH and related high-resolution technologies, such as quantitative PCR, provide a higher resolution than does karyotyping (100–300 kb), the techniques can interpret only a limited amount of loci in a single experiment and a priori knowledge on the genomic locus to investigate is needed. As such, these type of technologies are too expensive and labor intensive to use for a genomewide analysis. In subsequent years, FISH technologies were further modified to allow for a genomewide analysis. The optimization resulted in the introduction of comparative genomic hybridization (CGH) (Kallioniemi et al., 1992; Lichter et al., 2000). CGH is based on the comparison of two genomic DNA populations, one derived from a test (patient) sample, and one derived from a normal reference sample. Equal amounts of DNA are differentially labeled and simultaneously hybridized onto normal human metaphase chromosomes, thus competing for the same targets on the chromosomes. Variation in fluorescence intensities of test and reference DNA along each chromosome target reveals the genomic locations of chromosome rearrangements in the test DNA. The advantage of CGH over karyotyping is its independence of actively dividing cells from the test sample as a source for metaphase spreads. As such, CGH can be performed on virtually all samples from which DNA can be extracted. CGH has proven particularly useful in cancer research. In general, tumour samples are difficult to culture and harvest for preparing metaphase spreads. In addition, these spreads are difficult to analyze by conventional karyotyping because of the abundance and complexity of the rearrangements present. The resolution of CGH, however, still depends on the resolution of the target metaphase chromosomes (i.e., it remains difficult to detect rearrangements below the level of 5–10 Mb) (Forozan et al., 1997). 21.1.4 The Clinical Consequence of Submicroscopic Chromosome Rearrangements Over the last decades, various FISH studies have revealed that submicroscopic subtelomeric rearrangements account for approximately 6% of all previously unexplained cases of MR (Flint et al., 1995; Knight et al., 1999; de Vries et al., 2003). Similarly, it was found that interstitial submicroscopic chromosome rearrangements account for a vast proportion of contiguous gene syndromes (Osborne et al., 2001; Shaikh et al., 2001). However, in cases where the clinical phenotype has not previously been associated with a known genomic rearrangement, there is no a priori knowledge of the region to be tested and, hence, FISH is no longer the method of choice. To detect such submicroscopic rearrangements on a genomewide scale, novel technologies were needed that
c21.indd 447
1/12/2011 9:44:40 AM
448
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
combine the resolution of targeted FISH technologies with the genomewide approach of CGH. One such technology is microarray-based comparative genomic hybridization (array CGH). 21.1.5
Array-Based Comparative Genomic Hybridization
Through the development of novel technologies such as array CGH the resolving power of conventional chromosome analysis techniques has increased from the megabase to the kilobase level (Solinas-Toldo et al., 1997; Pinkel et al., 1998). Tools that have facilitated the development of these technologies include (1) genomewide clone resources integrated into the finished human genome sequence, (2) high-throughput microarray platforms, and (3) optimized CGH protocols and data analysis systems. Together, these microarraybased technological developments have accumulated into a so-called molecular karyotyping approach that allows for the sensitive and specific detection of submicroscopic single copy number changes throughout the entire human genome. Array CGH builds on conventional CGH procedures in such a way that the target metaphase spreads are replaced by genomic fragments with known physical locations in a microarray format. In comparison with conventional CGH, the microarray format provides a higher resolution, a higher dynamic range, and a better possibility for automation. In addition, it allows for direct linking of (submicroscopic) chromosome rearrangements, also referred to as copy number variation (CNV), to known genomic sequences and, thus, to genes which may be involved in the disease under investigation (Fig. 21.2). Initially, genomic microarrays were developed in academia and contained mostly genomic fragments obtained from large-insert genomic clones, mainly bacterial artificial chromosomes. Different clone sets have been used, most popular ones containing one clone per 1 Mb or later on using a tiling resolution clone set of approximately 30,000 clones, covering the genome with one clone per 100 kb (Vissers et al., 2003; Shaw-Smith et al., 2004; de Vries et al., 2005; Schoumans et al., 2005; Menten et al., 2006; Redon et al., 2006; Rosenberg et al., 2006). Recently, genomic microarray production has been taken over by private enterprises, and many companies are now offering microarrays for genomewide copy number profiling containing more than a million oligonucleotides and targeting random sequences, SNPs, or a combination thereof. These oligonucleotides have been more evenly spaced across the genome, and optimized protocols are now available for the quantitative detection of CNVs (Fig. 21.2). With this, CNV detection can now reliably be performed at the kilobase level, resulting in the detection of hundreds of CNVs per individual (Redon et al., 2006). These advances have made genomic profiling technology an excellent tool for the genomewide detection of CNVs in health and disease (Friedman et al., 2006; Wagenstaller et al., 2007; Shao et al., 2008; Zhang et al., 2008; McMullan et al., 2009). Consequently, disease gene discovery has been facilitated using these approaches.
c21.indd 448
1/12/2011 9:44:40 AM
THE DETECTION OF GENOMEWIDE STRUCTURAL VARIATION
a
Patient
b
Reference
449
Patient
DNA isolation
DNA isolation
Differential labeling
Restriction enzym digestion
Simultaneous hybridization in 1:1 ratio
Adaptor ligation
PCR amplification and complexity reduction
microarray
unique probe sequence Fragmentation and end-labeling
Detection of labeled DNA
Hybridization
Computational analysis using internal control
Log2 Patient/Reference
AA
AB
BB
Detection of hybridized material
Computational analysis using external controls
Duplication
Deletion
Clones ordered by Mb position on chromosome
Figure 21.2. Overview of (a) the array CGH procedure using BAC arrays and (b) the SNP array technology. a, Genomic DNA samples from a test (patient; left) and reference (normal control; right) are differentially labeled with different fluorochromes, usually Cy3 and Cy5 (for green and red–indicated here by light and dark gray asterisks—respectively). The two DNA samples are mixed in equal amounts and hybridized to the microarray, onto which large-insert clone DNAs (e.g., BAC clones) have been robotically spotted as targets. Subsequent computer imaging assesses the relative fluorescence levels of each labeled DNA for each array target. Clones to which equal amounts of patient DNA and reference DNA have been hybridized will appear in yellow, clones deleted in the patient DNA but not the reference DNA will appear in red, and clones that are duplicated in the patient DNA will appear in green. b, For SNP arrays, genomic DNA of a single patient is hybridized onto the array (single color hybridization). Signal intensities for all probes are determined, and intensity ratios are calculated in silico using signal intensities obtained in previous array runs with control DNA. For both BAC arrays and SNP arrays, the output visualizes the copy number variation with deletions and duplication showing ratios below and above preset thresholds for loss and gain, respectively.
c21.indd 449
1/12/2011 9:44:40 AM
450
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
21.2 THE BASIC CONCEPT FOR DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES 21.2.1
Single Gene Disorders
In single gene disorders, or monogenic diseases, the phenotypic spectrum observed can be attributed to the malfunctioning of a single gene. Such malfunctioning can be achieved by physical deletion or duplication of (parts of) a genomic copy of the gene or by more subtle intragenic mutations. The ultimate effect of a deletion or mutation that, for example, leads to a premature stop in the reading frame of the affected gene is haploinsufficiency, a state by which a decrease in the level of the corresponding protein gives rise to the phenotype. Such genes are dosage sensitive. In reverse, duplications or gain of function mutations create proteins that exhibit an increase in constitutive activity, even in the absence of a physiological activator, or that create insensitivity to negative regulators. To date, over 10,000 single gene disorders are known and listed in the database of Online Mendelian Inheritance in Man (OMIM). 21.2.2
Contiguous Gene Syndromes
In contrast to single gene disorders, it has been shown that several conditions, including mental retardation and additional congenital/developmental abnormalities, may be due to submicroscopic chromosome rearrangements encompassing several genes. In 1986, the term contiguous gene syndrome was coined for these disorders (Schmickel, 1986). Since the introduction of this term, many alternatives have been suggested, including microdeletion/ microduplication syndrome and segmental aneusomy syndrome. All these terms intend to imply that the phenotype of the disorder results from an inappropriate dosage of more than one critical gene located within the genomic region affected—that is, individual genes located in such genomic regions contribute to distinct clinical features of the syndrome. It has, therefore, been suggested that the extent of the chromosomal region involved in each case would correlate with the ultimate phenotype and that individual clinical features might be inherited in isolation (Budarf and Emanuel, 1997). 21.2.3 Point Mutations, Deletions, and Duplications May Lead to the Same Phenotype As outlined above, it is becoming increasingly clear that the only real requirement for a candidate microdeletion syndrome gene is that it should be dosage sensitive. In case of microduplications, the effect of having a complete extra copy of a gene may result in a phenotype that is not mirrored by other mutations in this gene. The frequencies at which microdeletions or microduplications are encountered in monogenic diseases differ markedly. For example,
c21.indd 450
1/12/2011 9:44:40 AM
DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES
451
there are monogenic diseases that are mostly caused by gene mutations and rarely by deletions or duplications, such as Rubinstein-Taybi syndrome and Alagille syndrome (Krantz et al., 1997; Petrij et al., 2000). In other monogenic diseases, however, large deletions or duplications involving a dosage-sensitive gene are responsible for the majority of the cases, including PelizaeusMerzbacher syndrome and Smith-Magenis syndrome (Juyal et al., 1996; Mimault et al., 1999). Thus microdeletions and microduplications occur at various frequencies in many monogenic diseases with a known genetic cause, and the difference between a microdeletion syndrome with rare mutations and a single gene disorder with occasional large deletions may be gradual rather than absolute. The availability of genomewide technologies to detect submicroscopic CNVs may further enhance the possibilities for a straightforward mapping of the genes underlying these disorders (Fig. 21.3).
21.3 DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES To identify disease-causing genes, a stringent clinical preselection of patients whose DNA can be interrogated using high-resolution genomewide technologies for the detection of CNVs is an important first step. Second, the platform to conduct the molecular study needs to be selected. Whenever possible, the latter choice is the highest resolution available at the time of testing. Since smaller CNVs can be detected only using higher resolution platforms, the chance on disease gene identification the increases with increasing resolution. It is, however, noteworthy that with increasing resolution, more CNVs per individual will be detected (up to 100 CNVs per individual). With this observation, discriminating between benign CNVs occurring in the general population and the causative CNV becomes increasingly important to facilitate disease gene discovery (see Section 21.5). The first syndrome successfully resolved using a high-resolution genomewide approach was CHARGE syndrome, through the detection and characterization of microdeletions by array CGH (Vissers et al., 2004). 21.3.1
CHARGE Syndrome
CHARGE syndrome (MIM #214800) is an autosomal dominant disorder with a prevalence of one in 10,000 (Blake et al., 1998). The acronym stands for the cardinal clinical features of the syndrome: coloboma, heart malformation, choanal atresia, retardation of growth and/or development, genital anomalies, and ear anomalies (Pagon et al., 1981). Most cases of CHARGE syndrome are sporadic, but several aspects of this condition support the involvement of a genetic factor that had remained elusive until recently (Tellier et al., 1998, 2000; Martin et al., 2002; Lalani et al., 2003). With the availability of microarray-
c21.indd 451
1/12/2011 9:44:40 AM
452 a
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
p24.3
0
p14.1 p12.3
50.000.000
q26.1
100.000.000
150.000.000
2 1 0 –1 –2
prioritize CNVs to find causative variant b
gene A gene B
gene C
prioritize genes to find dosage sensitive gene c
gene A
sequence for mutations in large patient cohort
Figure 21.3. Disease gene identification strategy using genomewide structural variation. From all CNVs detected in a patient (a), the causal variant needs to be identified using diverse criteria, including de novo occurrence and absence in healthy controls (b). Prioritization of the candidate genes located within this CNV determines which gene to sequence for disease-causing mutations in a larger cohort of patients showing the same phenotype but who do not harbor the deletion or duplication (c). Dark gray and light gray arrowheads in a represent deletions and duplications, respectively. Arrows in b signify the orientation of the genes located within the deleted interval. Circles in c represent mutations in patients without the deletion.
based approaches, unbiased, genomewide screens were performed hypothesizing that microdeletions and/or microduplications might be the underlying cause of CHARGE syndrome (Vissers et al., 2004). Initial screening of two patients with CHARGE syndrome on a 1 Mb array revealed a microdeletion of ∼5 Mb in one of the patients at chromosome locus 8q12. Subsequent array analysis of an apparently balanced chromosome 8 translocation (based on
c21.indd 452
1/12/2011 9:44:40 AM
DISEASE GENE DISCOVERY THROUGH GENOMEWIDE PROFILING STRATEGIES
453
karyotyping) with estimated breakpoints within the 8q12 region (Hurst et al., 1991) unraveled two interspersed microdeletions, overlapping with the microdeletion of the first patient. This result showed (1) that higher resolution genomic analyses can reveal deletions at the breakpoint of a translocation, which are impossible to detect with the lower-resolution karyotyping technology, and (2) that deletions in two unrelated patients with the same syndrome point to chromosome 8q12 as the disease-causing locus. Subsequent analyses of DNA from 17 additional CHARGE patients on a tiling resolution chromosome 8 array did not show any additional microdeletions in this genomic locus. As such, it was reasoned that the disease in these patients was caused by point mutations in one of the genes residing in the shortest region of deletion overlap. Sequence analysis of all nine genes located within this region indeed revealed de novo mutations in CHD7, a novel member of the chromodomain helicase DNA-binding gene family, in the majority of individuals with CHARGE syndrome without deletions. Based on these results, it was concluded that CHARGE syndrome is caused by haploinsufficiency of the CHD7 gene, either by microdeletions encompassing the CHD7 gene, or by mutations within this gene (Vissers et al., 2004). 21.3.2 The 9q Subtelomeric Deletion Syndrome A second well-illustrated example of gene discovery through deletion and/ or translocation mapping is the discovery of the euchromatin histone methyl transferase 1 (EHMT1) gene causing 9q subtelomeric deletion syndrome (MIM #610253). Submicroscopic subtelomeric deletions of chromosome 9q (9qSTDS) are associated with a recognizable mental retardation syndrome (Harada et al., 2004; Stewart et al., 2004; Kleefstra et al., 2009). The identification of the molecular cause of 9qSTDS started with the initial FISH screening of subtelomeric rearrangements in 12 patients narrowing down the commonly deleted region to an ∼1.2 Mb interval (Stewart et al., 2004). Subsequently, this region was further reduced to ∼700 kb, still containing at least five genes and several ESTs (Yatsenko et al., 2005). The first evidence that 9qSTDS was a single gene disorder came from the characterization of the breakpoints of a balanced translocation t(X;9)(p11.23;q34.3) in a patient presenting with typical features of 9qSTDS, whose chromosome 9 breakpoint disrupted the EHMT1 gene in intron 9 (Kleefstra et al., 2005). Additional evidence was provided by deletion screening and sequence analysis of the gene in 23 patients with a clinical presentation reminiscent of 9qSTDS (Kleefstra et al., 2006). Of these 23 patients, 3 patients showed a deletion including the EHMT1 gene and 2 patients showed a de novo mutation in the EHMT1 gene. With this discovery, it was established that haploinsufficiency of the EHMT1 gene, either by deletion or mutation, leads to 9qSTDS (Kleefstra et al., 2009). Other examples for which this strategy has been successful include PotockiLupski syndrome (RAI1), Peters-plus syndrome (B3GALTL), and the MECP2
c21.indd 453
1/12/2011 9:44:40 AM
454
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
duplication syndrome (MECP2) (van Esch et al., 2005; Lesnik Oberstein et al., 2006; Potocki et al., 2007). 21.3.3
Defining New Microdeletion Syndromes
In addition to revealing disease genes for known syndromes, the worldwide use of high-resolution platforms on large patient cohorts has also been instrumental for defining novel microdeletion syndromes. Here, the concept of a phenotype-first approach, referring to the identification of a disease gene in a patient-cohort (e.g., CHARGE syndrome), is changed into a genotype-first approach. In the genotype-first approach, overlapping CNVs are identified in a large clinically heterogeneous patient cohort. After this molecular finding, more careful examination of the patients phenotype may show phenotypic overlap that is not expected for such a heterogeneous disease, hence allowing the definition of a new syndrome. The 17q21.31 microdeletion syndrome was the first new microdeletion syndrome identified through this approach by studying large cohorts of patients with unexplained mental retardation. The identification of this microdeletion syndrome encompassing 17q21.31 was simultaneously described by three groups (Koolen et al., 2006; Sharp et al., 2006; Shaw-Smith et al., 2006). Apart from mental retardation these patients turned out to have additional clinical features in common, such as hypotonia and a specific facial feature (bulbous nose) (Koolen et al., 2006; Sharp et al., 2006; Shaw-Smith et al., 2006). Other examples of novel syndromes include 15q24 microdeletion syndrome, 3q29 microdeletion syndrome, Xq28 microduplication syndrome, and 16p11.2 microdeletion/microduplication syndrome (Van Esch et al., 2005; Willatt et al., 2005; Sharp et al., 2007; Weiss et al., 2008; Brunetti-Pierri et al., 2008; Mefford et al., 2008; El-Hattab et al., 2009). Whether these novel microdeletions and microduplication syndromes can also be attributed to a single dosage sensitive gene, similar to the syndromes that have been resolved using CNV as the initial discovery, has not yet been established.
21.4 DISEASE GENE IDENTIFICATION FOR COMMON DISEASES Although many common diseases, such as schizophrenia, mental retardation, and autism spectrum disorder, occur at high frequencies in the general population and show an overall high heritability, the genetic contribution of these diseases is only partially explained (Vissers et al., 2003; Shaw-Smith et al., 2004; de Vries et al., 2005; Schoumans et al., 2005; Friedman et al., 2006, 2008; Rosenberg et al., 2006; Owen et al., 2007; Wagenstaller et al., 2007; Ingason et al., 2009; Kirov et al., 2008, 2009; Sanders et al., 2008; Shao et al., 2008; Vrijenhoek et al., 2008; Walsh et al., 2008; Weiss et al., 2008; Xu et al., 2008; Zhang et al., 2008; McMullan et al., 2009). Genomewide CNV microarrays have recently been introduced in these diseases (Gijsbers et al., 2009; Koolen et al., 2009). For mental retardation CNV studies have already entered the
c21.indd 454
1/12/2011 9:44:40 AM
DISCRIMINATING THE DISEASE-RELATED CNV FROM ALL NORMAL CNVS
455
diagnostic arena and replaced karyotyping as the golden standard in studying chromosomal abnormalities. For the other diseases this has not yet happened, but these have benefited from the fact that CNVs can now be reliably called on SNP-based microarrays that are being routinely used for genomewide association studies. Until now most studies in all of these common diseases focus on the detection and interpretation of rare CNVs, as these are more easy to link to disease than CNVs that also occur frequently in the normal population (Vissers et al., 2003; Shaw-Smith et al., 2004; de Vries et al., 2005; Schoumans et al., 2005; Friedman et al., 2006; Rosenberg et al., 2006; Ullmann et al., 2007; Wagenstaller et al., 2007; Mefford et al., 2008; Shao et al., 2008; Sharp et al., 2008; Zhang et al., 2008; van Bon et al., 2009; Hannes et al., 2009; McMullan et al., 2009). 21.4.1
Rare CNVs in Common Diseases
One of the examples in which the role of rare CNVs in schizophrenia was evaluated combined a genomewide CNV screen in patients with deficit schizophrenia, with a more targeted follow-up study in a general-schizophrenia patient–control cohort (Vrijenhoek et al., 2008). The discovery cohort of deficit schizophrenia patients revealed a set of four CNVs containing candidate genes, not reported to be copy number variant in healthy individuals. The genes located within these rare CNVs—NRXN1, CTNND2, MYT1L, and ASTN2—were further studied for copy number variation in more than 700 patients with more generalized schizophrenia as well as more than 700 unaffected controls. In total four additional CNVs were identified in the patient cohort, all leading to deletions, duplications or disruptions of one of the genes, thereby suggesting an important role in the etiology of the schizophrenia (Vrijenhoek et al., 2008). Similarly, dosage variation of the CNTNAP2 gene has been linked to a combination of schizophrenia and epilepsy in three individuals with overlapping aberrations involving this gene (Friedman et al., 2008). In addition, several other rare variants have been reported in patients with schizophrenia, including deletion of the ERBB4 gene and a fusion the SKP2 and the SLC1A gene due to deletion of an intervening segment (Walsh et al., 2008). Although individually rare, the total number of disease-causing structural variants in these common diseases such as mental retardation and schizophrenia indicates that these contribute substantially to the disease etiology.
21.5 DISCRIMINATING THE DISEASE-RELATED CNV FROM ALL NORMAL CNVS Taking all of the above into account, the general strategy to identify disease genes through CNV mapping is best suited for resolving those diseases that are monogenic/oligogenic and involve haploinsufficiency as the disease causing mechanism (Vissers et al., 2005). Hence the identification of a CNV that is (1)
c21.indd 455
1/12/2011 9:44:40 AM
456
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
relatively large, (2) rare, and (3) de novo in a patient provides a strong indicator of clinical significance, as this combination is rare in the normal population (Conrad et al., 2006; Redon et al., 2006; Lupski, 2007). Increases in microarray resolution have, however, been revealing a much higher rate of CNVs per individual than previously thought (McMullan et al., 2009) and increasing number of genomic loci are showing variable inheritance and penetrance (Ullmann et al., 2007; Mefford et al., 2008; Sharp et al., 2008; van Bon et al., 2009; Hannes et al., 2009). These observations complicate direct identification of the causative CNV—potentially harboring the disease-causing gene—and as such, argue for a predication of CNVs to be benign or disease causing. 21.5.1 Forging Links between Human Phenotypes and Mouse Gene Knockout Models A first elegant strategy to make this distinction is by forging a link between human MR-associated CNVs and mouse gene knockout models (Webber et al., 2009). In this novel approach, all genes located in 148 MR-associated CNVs were collected and functionally compared to the genes in more than 26,000 CNVs from the general population. The MR-CNVs were found to be significantly enriched in two classes of genes, those whose mouse orthologues, when disrupted, result in either abnormal axon or dopaminergic neuron morphologies. Additional enrichments highlighted correspondence between relevant mouse phenotypes and secondary presentation including brain abnormalities, cleft palate, and seizures. Already, this approach has identified 78 new candidate genes contributing to MR and associated phenotypes (Webber et al., 2009), thereby demonstrating the power of exploiting mouse knockout data to better understand the distinction between benign and disease-associated CNVs. These novel candidate genes within the pathogenic CNV(s) can now be prioritized for high-throughput sequencing in large cohorts of patients with a similar phenotype, potentially leading to the identification of mutations in novel disease-genes.
21.6 NEXT-GENERATION SEQUENCING FOR THE DETECTION OF STRUCTURAL VARIATION The ultimate resolution to screen the human genome for disease-causing mutations and structural variants is at the basepair level. Major advances in DNA sequencing technologies, collectively termed next-generation sequencing (NGS) technologies, are now enabling the comprehensive analysis of whole genomes (Korbel et al., 2007; Levy et al., 2007; Kidd et al., 2008; Mardis, 2008; Rusk and Kiermer, 2008; Wheeler et al., 2008; Conrad et al., 2009; Ng et al., 2009). Currently, NGS includes three main non-Sanger-based sequencing methods: (1) pyrosequencing (Roche 454 technology), (2) sequencing with reversible terminators (Solexa technology), and (3) sequencing by ligation
c21.indd 456
1/12/2011 9:44:40 AM
NEXT-GENERATION SEQUENCING FOR THE DETECTION OF STRUCTURAL VARIATION
Position (Mb)
27.56
27.57
457 27.58
Shotgun (a)
Mate-pairs (b)
Figure 21.4. Detecting structural variation using next-generation sequencing. a, Structural variation using next-generation sequencing depends directly on the read depth of the individual sequence reads derived from patient DNA. As such, read depth within a heterozygous deletion will contain half the number of sequence reads compared to flanking genomic regions for which two genomic copies are present. In addition, split-reads will be present, indicating the breakpoints of the deletion interval. b, Alternatively, structural variation can be detected by sequencing a mate-paired library providing positional information. Deletions can be detected by mate pairs spanning a larger genomic segment than anticipated based on the library size. Light gray boxes represent individual sequence reads. Split-reads in a are indicated by dark gray boxes and connected by black dotted lines. In b, appropriately mapped mate pairs are shown in light gray boxes and connected by solid gray lines, indicating the distance between the pairs. Mate pairs that map at outside the expected size distribution are shown in dark gray boxes, connected by black solid lines.
(SOLiD technology) (Rusk and Kiermer, 2008). The main differences among the methods are read length, number of reads per run, and the costs involved (Mardis, 2008). It is interesting that all NGS methods are in principle capable of detecting single base mutations and structural variation, including both balanced and unbalanced rearrangements. 21.6.1
CNV Detection Using Shotgun Sequencing
Copy number variation can be identified using shotgun sequencing by studying local differences in read depth—for example, the number of reads mapping to a specific genomic locus also referred to as coverage (Fig. 21.4a). Hence for heterozygous deletions half the number of reads should be expected compared to the surrounding regions where two copies are present, whereas for duplications 1.5× the number of sequence reads should be present. Additional evidence for copy number variants can be provided by the presence of split-reads in which one part of the sequence read maps to one the deleted or duplicated interval, whereas the remainder of the sequence read maps to the other side of the interval.
c21.indd 457
1/12/2011 9:44:40 AM
458
21.6.2
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
CNV Detection Using Mate-Pair Sequencing
Currently, the most specific NGS application to identify all structural variation—including balanced rearrangements—is paired-end mapping or mate-paired library sequencing. This application directly provides detailed positional information, this in contrast to array-based methods or shotgun sequencing (Fig. 21.4b) (Korbel et al., 2007; Kidd et al., 2008). For mate-pair runs, genomic DNA is randomly sheared and the size is selected. After several processing steps, shotgun reads are obtained by sequencing both ends of the size-selected DNA library. This positional information is determined by the size selection constrains the placement of paired reads within the reference genome. Deviations from this expected size distribution may point to deletions, duplications, and insertions (Fig. 21.5). For example, fragments sequenced from 3-kb library are expected to map ∼3 kb apart when mapped back onto the reference genome, whereas fragments mapping ∼100 kb apart may point to a deletion in the DNA library tested. Similarly, differences in strand location, orientation, or mapping positions to different chromosomes may indicate inversions and translocations (Fig. 21.5d). It is interesting that paired-end mapping strategies have identified numerous structural variants currently not annotated in the reference genome, suggesting that the reference genome is still incomplete (Kidd et al., 2008). With this ultimate resolution to screen the genome, new disease genes await discovery. The labor-intensive candidate-gene approach of sequencing all genes within a deletion interval, as for CHARGE syndrome, is now no longer required. A simple unbiased next-generation sequence run will immediately lead to the identification of the causative gene. Also, inversions and balanced rearrangements that were difficult to fully sequence in the pre-NGS era will now be analyzed to the greatest detail potentially unraveling disrupted genes and fusions thereof. 21.7 CONCLUSION In conclusion, the impact of genomewide structural variation on gene discovery has been enormous. The ability to obtain detailed quantitative copy number information for the whole genome in a single experiment has led to the identification of a significant number of disease genes. Without doubt, further implementation of next-generation sequencing technologies and medical resequencing strategies will continue disease gene identification at a more rapid pace than ever before. Eventually, the vast majority of Mendelian disorders, if not all, may be explained by copy number-dependent gene dosage variations or single base pair substitutions. There are many challenges ahead in the clinical interpretation of structural variation related to disease, especially since not all of these variants will be fully penetrant and not all of these variants will contain functional genes. Nevertheless it can be expected that many more disease genes will be identified through the study of structural genomic variation.
c21.indd 458
1/12/2011 9:44:40 AM
QUESTIONS AND ANSWERS
459
(a)
(b)
(c)
(d)
Figure 21.5. The interpretation of structural variation using mate-paired library sequencing. a, Mate-paired library of a given size is mapped to the reference genome. If both tags are interspersed by the expected size insert, no structural variation is present. b, For mate pairs spanning a deletion, the distance between the tags exceeds the expected size distance. c, For insertions in the test sample, the mate pair spans a shorter distance on a reference genome than expected. d, Balanced rearrangements, such as inversions, are detected by the altered orientation of one of the pairs.
21.8
QUESTIONS AND ANSWERS
1. How are microarray-based technologies used for the detection of chromosomal rearrangements and what are the differences between the most widely used platforms? 2. What is more difficult to detect on a genomic microarray: single copy number deletions or single copy number duplications?
c21.indd 459
1/12/2011 9:44:40 AM
460
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
3. Explain how point mutations and microdeletions/duplications may result in the same clinical phenotype and why this is useful to identify novel disease genes. 4. In a clinical diagnostic setting, a patient’s DNA is examined using a highresolution genomic microarray because a CNV is expected to cause the clinical phenotype. Lab results show that several CNVs are present in the patients DNA. How do you proceed to determine which of these variants is the disease-causing variant? 5. What are the advantages and disadvantages of using next-generation sequencing for the detection of structural variation when compared to microarray-based technologies? 1. By definition, microarray-based technologies use microarrays containing target probes representing anything of interest. For the detection of human chromosomal rearrangements, microarrays that contain target probes representing the entire human genome with an even spacing between the probes are most preferred. These target probes can be either genomic DNA fragments (e.g., BAC arrays—two-color hybridization) or oligonucleotides representing SNPs (e.g., SNP arrays—single-color hybridization). To this microarray, DNA of a patient is hybridized. In case of microarrays using BAC clones or random oligonucleotides, control DNA, labeled with a different fluorochrome is simultaneously hybridized onto the same microarray. For SNP arrays, control DNA is hybridized to a separate microarray. Subsequently, hybridization intensities are determined user laser scanners for each target probe on the array. Next, for each target on the array, the ratio for hybridization intensity of patient DNA (T) is calculated over the hybridization intensity of the control DNA (R). This ratio is a relative measure for the copy number state of each probe on the array. Using statistical tools, all target probes are ordered on the physical genome position, and deletions and duplications can be determined over the entire human genome as for each target on the array. 2. Single copy number duplications are more difficult to detect than single copy number deletions. This is because you measure relative changes in copy number on a microarray, and this relative change is less for single copy number duplications (from two copies to three, a relative increase of 50%) than for single copy number deletions (from two to one, a relative decrease of 100%). 3. Point mutations may generate premature stop codons leading to haploinsufficiency of the gene in which the mutation is present. The remaining copy of the gene on the other allele is not enough for a normal functioning of the gene. Similarly, microdeletions may lead to the physical absence of one copy of the gene, thereby also leading to haploinsufficiency of the gene. For point mutations leading to a gain-of-function, the gene is constitutively active, or overstimulation of downstream target genes. Similarly, duplica-
c21.indd 460
1/12/2011 9:44:41 AM
ACKNOWLEDGMENTS
461
tions lead to an additional copy of the gene, thus, potentially leading to the same array of consequences. The fact that deletions and duplications may lead to similar phenotypes as point mutations is used to find disease genes according to the following principle. Point mutations are very difficult to localize in a genomewide manner without a priori information what genomic region to screen for mutations, at least using traditional Sanger sequencing. Next-generation sequencing technology will soon allow unbiased genomewide mutation screening. However, microdeletions/duplications can already be identified in an unbiased genomewide fashion. Since a phenotype can be caused by both deletions/duplications and mutations, any patient with a deletion/ duplication may point to the genomic locus to screen for gene mutations in patients with a similar phenotype but not showing the deletion/ duplication. 4. If more than one CNV is found a patient, there are several steps that will help identify of the causative CNV. The first step is to examine whether the CNVs are de novo by testing the parents for the same CNVs. Second, the presence of the CNV must be checked in control cohorts. This can either be done using in-house tested control samples or, using online databases, such as the Database of Genomic Variants or the HapMap consortium, collecting these CNV data on healthy controls. Third, classifier programs and phenotype databases can be used to predict the disease potential of a given CNV. Especially for those patients whose parental samples are not available, classifier predication programs are of great importance. Current diagnostic practice mostly includes testing parental samples and testing the occurrence of the CNVs in in-house collected CNV databases of healthy controls to determine the disease-causing CNV. 5. The advantages of using next-generation sequencing for the detection of structural variation over microarray-based technology include the detection of balanced rearrangements, such as inversions and balanced translocations, which both remain undetected using microarray technologies. In addition, direct positional information is acquired, which directly points to breakpoints for deletions and to the inserted location for duplications. Also, the exact copy number can be established for duplications, which cannot be obtained using microarray-based technologies. Currently, the disadvantages of the next-generation sequence technology include the relative high costs involved per experiment and the complex practical and bioinformatic workflow. Also, the biological interpretation of the data are still challenging.
21.9
ACKNOWLEDGMENTS
This work was supported by grants from the Netherlands Organisation for Health Research and Development (ZonMW 916.86.016 to LELMV, ZonMW
c21.indd 461
1/12/2011 9:44:41 AM
462
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
917.66.363 to JAV) and grants from the AnEUploidy project (LSHG-CT-2006037627 to JAV) supported by the European Commission under FP6. 21.10 REFERENCES Blake KD, Davenport SL, Hall BD, Hefner MA, Pagon RA, Williams MS, Lin AE, Graham JM Jr. (1998). CHARGE association: an update and review for the primary pediatrician. Clin Pediatr (Phila) 37(3):159–73. Brown SA, Warburton D, Brown LY, Yu CY, Roeder ER, Stengel-Rutkowski S, Hennekam RC, Muenke M. (1998). Holoprosencephaly due to mutations in ZIC2, a homologue of Drosophila odd-paired. Nat Genet (2):180–83. Brunetti-Pierri N, Berg JS, Scaglia F, Belmont J, Bacino CA, Sahoo T, Lalani SR, Graham B, Lee B, Shinawi M, Shen J, Kang SH, Pursley A, Lotze T, Kennedy G, Lansky-Shafer S, Weaver C, Roeder ER, Grebe TA, Arnold GL, Hutchison T, Reimschisel T, Amato S, Geragthy MT, Innis JW, Obersztyn E, Nowakowska B, Rosengren SS, Bader PI, Grange DK, Naqvi S, Garnica AD, Bernes SM, Fong CT, Summers A, Walters WD, Lupski JR, Stankiewicz P, Cheung SW, Patel A. (2008). Recurrent reciprocal 1q21.1 deletions and duplications associated with microcephaly or macrocephaly and developmental and behavioral abnormalities. Nat Genet 40(12):1466–71. Budarf ML & Emanuel BS. (1997). Progress in the autosomal segmental aneusomy syndromes (SASs): single or multi-locus disorders? Hum Mol Genet 1997;6(10): 1657–65. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK. (2006). A highresolution survey of deletion polymorphism in the human genome. Nat Genet 38(1):75–81. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, MacDonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Tyler-Smith C, Carter NP, Lee C, Scherer SW, Hurles ME. (2009). Origins and functional impact of copy number variation in the human genome. Nature 464(7239):704–12. de Vries BB, Pfundt R, Leisink M, Koolen DA, Vissers LE, Janssen IM, Reijmersdal S, Nillesen WM, Huys EH, Leeuw N, Smeets D, Sistermans EA, Feuth T, van Ravenswaaij-Arts CM, Geurts van Kessel A, Schoenmakers EF, Brunner HG, Veltman JA. (2005). Diagnostic genome profiling in mental retardation. Am J Hum Genet 77(4):606–16. de Vries BB, Winter R, Schinzel A, van Ravenswaaij-Arts C. (2003). Telomeres: a diagnosis at the end of the chromosomes. J Med Genet 40(6):385–98. El-Hattab AW, Smolarek TA, Walker ME, Schorry EK, Immken LL, Patel G, Abbott MA, Lanpher BC, Ou Z, Kang SH, Patel A, Scaglia F, Lupski JR, Cheung SW, Stankiewicz P. (2009). Redefined genomic architecture in 15q24 directed by patient deletion/duplication breakpoint mapping. Hum Genet 126(4):589–602. Flint J, Wilkie AO, Buckle VJ, Winter RM, Holland AJ, McDermid HE. (1995). The detection of subtelomeric chromosomal rearrangements in idiopathic mental retardation. Nat Genet 9(2):132–40.
c21.indd 462
1/12/2011 9:44:41 AM
REFERENCES
463
Forozan F, Karhu R, Kononen J, Kallioniemi A, Kallioniemi OP. (1997). Genome screening by comparative genomic hybridization. Trends Genet 13(10):405–09. Friedman JM, Baross A, Delaney AD, Ally A, Arbour L, Armstrong L, Asano J, Bailey DK, Barber S, Birch P, Brown-John M, Cao M, Chan S, Charest DL, Farnoud N, Fernandes N, Flibotte S, Go A, Gibson WT, Holt RA, Jones SJ, Kennedy GC, Krzywinski M, Langlois S, Li HI, McGillivray BC, Nayar T, Pugh TJ, RajcanSeparovic E, Schein JE, Schnerch A, Siddiqui A, Van Allen MI, Wilson G, Yong SL, Zahir F, Eydoux P, Marra MA. (2006). Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation. Am J Hum Genet 79(3):500–13. Friedman JI, Vrijenhoek T, Markx S, Janssen IM, van de Stelt I, Faas BH, Knoers NV, Cahn W, Kahn RS, Edelmann L, Davis KL, Silverman JM, Brunner HG, Geurts van Kessel A, Wijmenga C, Ophoff RA, Veltman JA. (2008). CNTNAP2 gene dosage variation is associated with schizophrenia and epilepsy. Mol Psychiatry 13(3):261–66. Gijsbers AC, Lew JY, Bosch CA, Schuurs-Hoeijmakers JH, van HA, den Hollander NS, Kant SG, Bijlsma EK, Breuning MH, Bakker E, Ruivenkamp CA. (2009). A new diagnostic workflow for patients with mental retardation and/or multiple congenital abnormalities: test arrays first. Eur J Hum Genet 17(11):1394–402. Gripp KW, Wotton D, Edwards MC, Roessler E, Ades L, Meinecke P, Richieri-Costa A, Zackai EH, Massague J, Muenke M, Elledge SJ. (2000). Mutations in TGIF cause holoprosencephaly and link NODAL signalling to human neural axis determination. Nat Genet 25(2):205–08. Harada N, Visser R, Dawson A, Fukamachi M, Iwakoshi M, Okamoto N, Kishino T, Niikawa N, Matsumoto N. (2004). A 1-Mb critical region in six patients with 9q34.3 terminal deletion syndrome. J Hum Genet 49(8):440–44. Hannes FD, Sharp AJ, Mefford HC, de RT, Ruivenkamp CA, Breuning MH, Fryns JP, Devriendt K, Van BG, Vogels A, Stewart H, Hennekam RC, Cooper GM, Regan R, Knight SJ, Eichler EE, Vermeesch JR. (2009). Recurrent reciprocal deletions and duplications of 16p13.11: the deletion is a risk factor for MR/MCA while the duplication may be a rare benign variant. J Med Genet 46(4):223–32. Herrera L, Kakati S, Gibas L, Pietrzak E, Sandberg AA. (1986). Gardner syndrome in a man with an interstitial deletion of 5q. Am J Med Genet 25(3):473–76. Hurst JA, Meinecke P, Baraitser M. (1991). Balanced t(6;8)(6p8p;6q8q) and the CHARGE association. J Med Genet 28(1):54–5. Ingason A, Rujescu D, Cichon S, Sigurdsson E, Sigmundsson T, Pietilainen OPH, Buizer-Voskamp JE, Strengman E, Francks C, Muglia P, Gylfason A, Gustafsson O, Olason PI, Steinberg S, Hansen T, Jakobsen KD, Rasmussen HB, Giegling I, Moller HJ, Hartmann A, Crombie C, Fraser G, Walker N, Lonnqvist J, Suvisaari J, TuulioHenriksson A, Bramon E, Kiemeney LA, Franke B, Murray R, Vassos E, Toulopoulou T, Muhleisen TW, Tosato S, Ruggeri M, Djurovic S, Andreassen OA, Zhang Z, Werge T, Ophoff RA, Rietschel M, Nothen MM, Petursson H, Stefansson H, Peltonen L, Collier D, Stefansson K, Clair DMS. (2009). Copy number variations of chromosome 16p13.1 region associated with schizophrenia. Mol Psychiatry doi 10.1038/mp. 2009.101. Juyal RC, Figuera LE, Hauge X, Elsea SH, Lupski JR, Greenberg F, Baldini A, Patel PI. (1996). Molecular analyses of 17p11.2 deletions in 62 Smith-Magenis syndrome patients. Am J Hum Genet 58(5):998–1007.
c21.indd 463
1/12/2011 9:44:41 AM
464
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D. (1992).Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 258(5083):818–21. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tuzun E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, Smith JD, Korn JM, McCarroll SA, Altshuler DA, Peiffer DA, Dorschner M, Stamatoyannopoulos J, Schwartz D, Nickerson DA, Mullikin JC, Wilson RK, Bruhn L, Olson MV, Kaul R, Smith DR, Eichler EE. (2008). Mapping and sequencing of structural variation from eight human genomes. Nature 453(7191):56–64. Kirov G, Gumus D, Chen W, Norton N, Georgieva L, Sari M, O’Donovan MC, Erdogan F, Owen MJ, Ropers HH, Ullmann R. (2008). Comparative genome hybridization suggests a role for NRXN1 and APBA2 in schizophrenia. Hum Mol Genet 17(3): 458–65. Kirov G, Zaharieva I, Georgieva L, Moskvina V, Nikolov I, Cichon S, Hillmer A, Toncheva D, Owen MJ, O’Donovan MC. (2009). A genome-wide association study in 574 schizophrenia trios using DNA pooling. Mol Psychiatry 14(8):796–803. Kleefstra T, Smidt M, Banning MJ, Oudakker AR, Van Esch H, de Brouwer AP, Nillesen W, Sistermans EA, Hamel BC, de Bruijn D, Fryns JP, Yntema HG, Brunner HG, de Vries BB, van Bokhoven H. (2005). Disruption of the gene Euchromatin Histone Methyl Transferase1 (Eu-HMTase1) is associated with the 9q34 subtelomeric deletion syndrome. J Med Genet 42(4):299–306. Kleefstra T, Koolen DA, Nillesen WM, de Leeuw N, Hamel BC, Veltman JA, Sistermans EA, van Bokhoven H, van Ravenswaaij C, de Vries BB. (2006). Interstitial 2.2 Mb deletion at 9q34 in a patient with mental retardation but without classical features of the 9q subtelomeric deletion syndrome. Am J Med Genet A 140(6):618–23. Kleefstra T, van Zelst-Stams WA, Nillesen WM, Cormier-Daire V, Houge G, Foulds N, van Dooren M, Willemsen MH, Pfundt R, Turner A, Wilson M, McGaughran J, Rauch A, Zenker M, Adam M, Innes M, Davies C, Gonzalez-Meneses LA, Casalone R, Weber A, Brueton LA, Delicado NA, Palomares BM, Venselaar H, Stegmann SP, Yntema HG, van Bokhoven H, Brunner HG. (2009). Further clinical and molecular delineation of the 9q subtelomeric deletion syndrome supports a major contribution of EHMT1 haploinsufficiency to the core phenotype. J Med Genet 46(9):598–606. Knight SJ, Regan R, Nicod A, Horsley SW, Kearney L, Homfray T, Winter RM, Bolton P, Flint J. (1999). Subtle chromosomal rearrangements in children with unexplained mental retardation. Lancet 354(9191):1676–81. Koolen DA, Pfundt R, de LN, Hehir-Kwa JY, Nillesen WM, Neefs I, Scheltinga I, Sistermans E, Smeets D, Brunner HG, Geurts van Kessel A, Veltman JA, de Vries BB. (2009).Genomic microarrays in mental retardation: a practical workflow for diagnostic applications. Hum Mutat 30(3):283–92. Koolen DA, Vissers LE, Pfundt R, de LN, Knight SJ, Regan R, Kooy RF, Reyniers E, Romano C, Fichera M, Schinzel A, Baumer A, Anderlid BM, Schoumans J, Knoers NV, Geurts van Kessel A, Sistermans EA, Veltman JA, Brunner HG, de Vries BB. (2006). A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat Genet 38(9):999–1001.
c21.indd 464
1/12/2011 9:44:41 AM
REFERENCES
465
Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, Kim PM, Palejev D, Carriero NJ, Du L, Taillon BE, Chen Z, Tanzer A, Saunders AC, Chi J, Yang F, Carter NP, Hurles ME, Weissman SM, Harkins TT, Gerstein MB, Egholm M, Snyder M. (2007). Paired-end mapping reveals extensive structural variation in the human genome. Science 318(5849):420–26. Krantz ID, Rand EB, Genin A, Hunt P, Jones M, Louis AA, Graham JM, Jr., Bhatt S, Piccoli DA, Spinner NB. (1997). Deletions of 20p12 in Alagille syndrome: frequency and molecular characterization. Am J Med Genet 70(1):80–86. Lalani SR, Stockton DW, Bacino C, Molinari LM, Glass NL, Fernbach SD, Towbin JA, Craigen WJ, Graham JM Jr., Hefner MA, Lin AE, McBride KL, Davenport SL, Belmont JW. (2003). Toward a genetic etiology of CHARGE syndrome: I. A systematic scan for submicroscopic deletions. Am J Med Genet A 118A(3):260–66. Lele KP, Penrose LS, Stallard HB. (1963). Chromosome deletion in a case of retinoblastoma. Ann Hum Genet 27:171–74. Lesnik Oberstein SA, Kriek M, White SJ, Kalf ME, Szuhai K, den Dunnen JT, Breuning MH, Hennekam RC. (2006). Peters Plus syndrome is caused by mutations in B3GALTL, a putative glycosyltransferase. Am J Hum Genet 79(3):562–66. Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G, Lin Y, MacDonald JR, Pang AW, Shago M, Stockwell TB, Tsiamouri A, Bafna V, Bansal V, Kravitz SA, Busam DA, Beeson KY, McIntosh TC, Remington KA, Abril JF, Gill J, Borman J, Rogers YH, Frazier ME, Scherer SW, Strausberg RL, Venter JC. (2007). The diploid genome sequence of an individual human. PLoS Biol 5(10):e254. Lichter P, Joos S, Bentz M, Lampel S. (2000). Comparative genomic hybridization: uses and limitations. Semin Hematol 37(4):348–57. Lupski JR. (2007). Genomic rearrangements and sporadic disease. Nat Genet 39(7 suppl):S43–S47. Mardis ER. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402. Martin DM, Probst FJ, Fox SE, Schimmenti LA, Semina EV, Hefner MA, Belmont JW, Camper SA. (2002). Exclusion of PITX2 mutations as a major cause of CHARGE association. Am J Med Genet 111(1):27–30. McMullan DJ, Bonin M, Hehir-Kwa JY, de Vries BB, Dufke A, Rattenberry E, Steehouwer M, Moruz L, Pfundt R, de LN, Riess A, tug-Teber O, Enders H, Singer S, Grasshoff U, Walter M, Walker JM, Lamb CV, Davison EV, Brueton L, Riess O, Veltman JA. (2009). Molecular karyotyping of patients with unexplained mental retardation by SNP arrays: A multicenter study. Hum Mutat 30(7):1082–92. Mefford HC, Sharp AJ, Baker C, Itsara A, Jiang Z, Buysse K, Huang S, Maloney VK, Crolla JA, Baralle D, Collins A, Mercer C, Norga K, de RT, Devriendt K, Bongers EM, de LN, Reardon W, Gimelli S, Bena F, Hennekam RC, Male A, Gaunt L, Clayton-Smith J, Simonic I, Park SM, Mehta SG, Nik-Zainal S, Woods CG, Firth HV, Parkin G, Fichera M, Reitano S, Lo GM, Li KE, Casuga I, Broomer A, Conrad B, Schwerzmann M, Raber L, Gallati S, Striano P, Coppola A, Tolmie JL, Tobias ES, Lilley C, Armengol L, Spysschaert Y, Verloo P, De CA, Goossens L, Mortier G, Speleman F, van BE, Nelen MR, Hochstenbach R, Poot M, Gallagher L, Gill M, McClellan J, King MC, Regan R, Skinner C, Stevenson RE, Antonarakis SE, Chen C, Estivill X, Menten B, Gimelli G, Gribble S, Schwartz S, Sutcliffe JS, Walsh T,
c21.indd 465
1/12/2011 9:44:41 AM
466
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
Knight SJ, Sebat J, Romano C, Schwartz CE, Veltman JA, de Vries BB, Vermeesch JR, Barber JC, Willatt L, Tassabehji M, Eichler EE. (2008). Recurrent rearrangements of chromosome 1q21.1 and variable pediatric phenotypes. N Engl J Med 359(16):1685–99. Menten B, Maas N, Thienpont B, Buysse K, Vandesompele J, Melotte C, de RT, Van VS, Balikova I, Backx L, Janssens S, De PA, De MB, Moreau Y, Marynen P, Fryns JP, Mortier G, Devriendt K, Speleman F, Vermeesch JR. (2006). Emerging patterns of cryptic chromosomal imbalance in patients with idiopathic mental retardation and multiple congenital anomalies: a new series of 140 patients and review of published reports. J Med Genet 43(8):625–33. Mimault C, Giraud G, Courtois V, Cailloux F, Boire JY, Dastugue B, Boespflug-Tanguy O. (1999). Proteolipoprotein gene analysis in 82 patients with sporadic PelizaeusMerzbacher disease: duplications, the major cause of the disease, originate more frequently in male germ cells, but point mutations do not. The Clinical European Network on Brain Dysmyelinating Disease. Am J Hum Genet 65(2):360–69. Munke M. (1989). Clinical, cytogenetic, and molecular approaches to the genetic heterogeneity of holoprosencephaly. Am J Med Genet 34(2):237–45. Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. (2009). Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461(7261):272–76. Osborne LR, Li M, Pober B, Chitayat D, Bodurtha J, Mandel A, Costa T, Grebe T, Cox S, Tsui LC, Scherer SW. (2001). A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome. Nat Genet 29(3):321–25. Owen MJ, Craddock N, Jablensky A. (2007). The genetic deconstruction of psychosis. Schizophr Bull 33(4):905–11. Pagon RA, Graham JM, Jr., Zonana J, Yong SL. (1981). Coloboma, congenital heart disease, and choanal atresia with multiple anomalies: CHARGE association. J Pediatr 99(2):223–27. Petrij F, Dauwerse HG, Blough RI, Giles RH, van der Smagt JJ, Wallerstein R, Maaswinkel-Mooy PD, van Karnebeek CD, van Ommen GJ, van HA, Rubinstein JH, Saal HM, Hennekam RC, Peters DJ, Breuning MH. (2000). Diagnostic analysis of the Rubinstein-Taybi syndrome: five cosmids should be used for microdeletion detection and low number of protein truncating mutations. J Med Genet 37(3): 168–76. Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, Dairkee SH, Ljung BM, Gray JW, Albertson DG. (1998). High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20(2):207–11. Potocki L, Bi W, Treadwell-Deering D, Carvalho CM, Eifert A, Friedman EM, Glaze D, Krull K, Lee JA, Lewis RA, Mendoza-Londono R, Robbins-Furman P, Shaw C, Shi X, Weissenberger G, Withers M, Yatsenko SA, Zackai EH, Stankiewicz P, Lupski JR. (2007).Characterization of Potocki-Lupski syndrome (dup(17)(p11.2p11.2)) and delineation of a dosage-sensitive critical interval that can convey an autism phenotype. Am J Hum Genet 80(4):633–49. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos
c21.indd 466
1/12/2011 9:44:41 AM
REFERENCES
467
M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME. (2006). Global variation in copy number in the human genome. Nature 444(7118):444–54. Riccardi VM, Sujansky E, Smith AC, Francke U. (1978). Chromosomal imbalance in the Aniridia-Wilms’ tumor association: 11p interstitial deletion. Pediatrics 61(4):604–10. Roessler E, Belloni E, Gaudenz K, Jay P, Berta P, Scherer SW, Tsui LC, Muenke M. (1996). Mutations in the human Sonic Hedgehog gene cause holoprosencephaly. Nat Genet 14(3):357–60. Rosenberg C, Knijnenburg J, Bakker E, Vianna-Morgante AM, Sloos W, Otto PA, Kriek M, Hansson K, Krepischi-Santos AC, Fiegler H, Carter NP, Bijlsma EK, van Haeringen A, Szuhai K, Tanke HJ. (2006). Array-CGH detection of micro rearrangements in mentally retarded individuals: clinical significance of imbalances present both in affected children and normal parents. J Med Genet 43(2):180–86. Rusk N, Kiermer V. (2008). Primer: sequencing—the next generation. Nat Meth 5(1):15. Sanders AR, Duan J, Levinson DF, Shi J, He D, Hou C, Burrell GJ, Rice JP, Nertney DA, Olincy A, Rozic P, Vinogradov S, Buccola NG, Mowry BJ, Freedman R, Amin F, Black DW, Silverman JM, Byerley WF, Crowe RR, Cloninger CR, Martinez M, Gejman PV. (2008). No significant association of 14 candidate genes with schizophrenia in a large European ancestry sample: implications for psychiatric genetics. Am J Psychiatry 165(4):497–506. Schmickel RD. (1986). Contiguous gene syndromes: a component of recognizable syndromes. J Pediatr 109(2):231–41. Schoumans J, Ruivenkamp C, Holmberg E, Kyllerman M, Anderlid BM, Nordenskjold M. (2005). Detection of chromosomal imbalances in children with idiopathic mental retardation by array based comparative genomic hybridisation (array-CGH). J Med Genet 42(9):699–705. Shaikh TH, Kurahashi H, Emanuel BS. (2001). Evolutionarily conserved low copy repeats (LCRs) in 22q11 mediate deletions, duplications, translocations, and genomic instability: an update and literature review. Genet Med 3(1):6–13. Shao L, Shaw CA, Lu XY, Sahoo T, Bacino CA, Lalani SR, Stankiewicz P, Yatsenko SA, Li Y, Neill S, Pursley AN, Chinault AC, Patel A, Beaudet AL, Lupski JR, Cheung SW. (2008). Identification of chromosome abnormalities in subtelomeric regions by microarray analysis: a study of 5,380 cases. Am J Med Genet A 146A(17):2242–51. Sharp AJ, Hansen S, Selzer RR, Cheng Z, Regan R, Hurst JA, Stewart H, Price SM, Blair E, Hennekam RC, Fitzpatrick CA, Segraves R, Richmond TA, Guiver C, Albertson DG, Pinkel D, Eis PS, Schwartz S, Knight SJ, Eichler EE. (2006). Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat Genet 38(9):1038–42. Sharp AJ, Mefford HC, Li K, Baker C, Skinner C, Stevenson RE, Schroer RJ, Novara F, De GM, Ciccone R, Broomer A, Casuga I, Wang Y, Xiao C, Barbacioru C, Gimelli G, Bernardina BD, Torniero C, Giorda R, Regan R, Murday V, Mansour S, Fichera M, Castiglia L, Failla P, Ventura M, Jiang Z, Cooper GM, Knight SJ, Romano C, Zuffardi O, Chen C, Schwartz CE, Eichler EE. (2008). A recurrent
c21.indd 467
1/12/2011 9:44:41 AM
468
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
15q13.3 microdeletion syndrome associated with mental retardation and seizures. Nat Genet 40(3):322–28. Sharp AJ, Selzer RR, Veltman JA, Gimelli S, Gimelli G, Striano P, Coppola A, Regan R, Price SM, Knoers NV, Eis PS, Brunner HG, Hennekam RC, Knight SJ, de Vries BB, Zuffardi O, Eichler EE. (2007). Characterization of a recurrent 15q24 microdeletion syndrome. Hum Mol Genet 16(5):567–72. Shaw-Smith C, Pittman AM, Willatt L, Martin H, Rickman L, Gribble S, Curley R, Cumming S, Dunn C, Kalaitzopoulos D, Porter K, Prigmore E, Krepischi-Santos AC, Varela MC, Koiffmann CP, Lees AJ, Rosenberg C, Firth HV, de SR, Carter NP. (2006). Microdeletion encompassing MAPT at chromosome 17q21.3 is associated with developmental delay and learning disability. Nat Genet 38(9):1032–37. Shaw-Smith C, Redon R, Rickman L, Rio M, Willatt L, Fiegler H, Firth H, Sanlaville D, Winter R, Colleaux L, Bobrow M, Carter NP. (2004). Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features. J Med Genet 41(4):241–48. Solinas-Toldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P. (1997). Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer 20(4):399–407. Stewart DR, Huang A, Faravelli F, Anderlid BM, Medne L, Ciprero K, Kaur M, Rossi E, Tenconi R, Nordenskjold M, Gripp KW, Nicholson L, Meschino WS, Capua E, Quarrell OW, Flint J, Irons M, Giampietro PF, Schowalter DB, Zaleski CA, Malacarne M, Zackai EH, Spinner NB, Krantz ID. (2004). Subtelomeric deletions of chromosome 9q: a novel microdeletion syndrome. Am J Med Genet A 128A(4):340–51. Tellier AL, Amiel J, Delezoide AL, Audollent S, Auge J, Esnault D, Encha-Razavi F, Munnich A, Lyonnet S, Vekemans M, ttie-Bitach T. (2000). Expression of the PAX2 gene in human embryos and exclusion in the CHARGE syndrome. Am J Med Genet 93(2):85–88. Tellier AL, Cormier-Daire V, Abadie V, Amiel J, Sigaudy S, Bonnet D, de LonlayDebeney P, Morrisseau-Durand MP, Hubert P, Michel JL, Jan D, Dollfus H, Baumann C, Labrune P, Lacombe D, Philip N, LeMerrer M, Briard ML, Munnich A, Lyonnet S. (1998). CHARGE syndrome: report of 47 cases and review. Am J Med Genet 76(5):402–09. Ullmann R, Turner G, Kirchhoff M, Chen W, Tonge B, Rosenberg C, Field M, ViannaMorgante AM, Christie L, Krepischi-Santos AC, Banna L, Brereton AV, Hill A, Bisgaard AM, Muller I, Hultschig C, Erdogan F, Wieczorek G, Ropers HH. (2007). Array CGH identifies reciprocal 16p13.1 duplications and deletions that predispose to autism and/or mental retardation. Hum Mutat 28(7):674–82. van Bon BW, Mefford HC, Menten B, Koolen DA, Sharp AJ, Nillesen WM, Innis JW, de Ravel TJ, Mercer CL, Fichera M, Stewart H, Connell LE, Ounap K, Lachlan K, Castle B, van der Aa N, van Ravenswaaij C, Nobrega MA, Serra-Juhe C, Simonic I, de Leeuw N, Pfundt R, Bongers EM, Baker C, Finnemore P, Huang S, Maloney VK, Crolla JA, van KM, Elia M, Vandeweyer G, Fryns JP, Janssens S, Foulds N, Reitano S, Smith K, Parkel S, Loeys B, Woods CG, Oostra A, Speleman F, Pereira AC, Kurg A, Willatt L, Knight SJ, Vermeesch JR, Romano C, Barber JC, Mortier G, PerezJurado LA, Kooy F, Brunner HG, Eichler EE, Kleefstra T, de Vries BB. (2009).
c21.indd 468
1/12/2011 9:44:41 AM
REFERENCES
469
Further delineation of the 15q13 microdeletion and duplication syndromes: A clinical spectrum varying from non-pathogenic to a severe outcome. J Med Genet 46(8):511–23. Van Esch H, Bauters M, Ignatius J, Jansen M, Raynaud M, Hollanders K, Lugtenberg D, Bienvenu T, Jensen LR, Gecz J, Moraine C, Marynen P, Fryns JP, Froyen G. (2005). Duplication of the MECP2 region is a frequent cause of severe mental retardation and progressive neurological symptoms in males. Am J Hum Genet 77(3):442–53. Van Prooijen-Knegt AC, Van Hoek JF, Bauman JG, Van DP, Wool IG, Van der PM. (1982). In situ hybridization of DNA sequences in human metaphase chromosomes visualized by an indirect fluorescent immunocytochemical procedure. Exp Cell Res 141(2):397–407. Vissers LE, de Vries BB, Osoegawa K, Janssen IM, Feuth T, Choy CO, Straatman H, van der Vliet WA, Huys EH, van RA, Smeets D, van Ravenswaaij-Arts CM, Knoers NV, van de Burgt I, de Jong PJ, Brunner HG, Geurts van Kessel A, Schoenmakers EF, Veltman JA. (2003). Array-based comparative genomic hybridization for the genomewide detection of submicroscopic chromosomal abnormalities. Am J Hum Genet 73(6):1261–70. Vissers LE, van Ravenswaaij CM, Admiraal R, Hurst JA, de Vries BB, Janssen IM, van der Vliet WA, Huys EH, de Jong PJ, Hamel BC, Schoenmakers EF, Brunner HG, Veltman JA, Geurts van Kessel A. (2004). Mutations in a new member of the chromodomain gene family cause CHARGE syndrome. Nat Genet 36(9):955–57. Vissers LE, Veltman JA, Geurts van Kessel A, Brunner HG. (2005). Identification of disease genes by whole genome CGH arrays. Hum Mol Genet 14 (Spec No. 2): R215–R23. Vrijenhoek T, Buizer-Voskamp JE, van de Stelt I, Strengman E, Sabatti C, Geurts van Kessel A, Brunner HG, Ophoff RA, Veltman JA. (2008). Recurrent CNVs disrupt three candidate genes in schizophrenia patients. Am J Hum Genet 83(4):504–10. Wagenstaller J, Spranger S, Lorenz-Depiereux B, Kazmierczak B, Nathrath M, Wahl D, Heye B, Glaser D, Liebscher V, Meitinger T, Strom TM. (2007). Copy-number variations measured by single-nucleotide-polymorphism oligonucleotide arrays in patients with mental retardation. Am J Hum Genet 81(4):768–79. Wallis DE, Roessler E, Hehr U, Nanni L, Wiltshire T, Richieri-Costa A, GillessenKaesbach G, Zackai EH, Rommens J, Muenke M. (1999). Mutations in the homeodomain of the human SIX3 gene cause holoprosencephaly. Nat Genet 22(2): 196–98. Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, Nord AS, Kusenda M, Malhotra D, Bhandari A, Stray SM, Rippey CF, Roccanova P, Makarov V, Lakshmi B, Findling RL, Sikich L, Stromberg T, Merriman B, Gogtay N, Butler P, Eckstrand K, Noory L, Gochman P, Long R, Chen Z, Davis S, Baker C, Eichler EE, Meltzer PS, Nelson SF, Singleton AB, Lee MK, Rapoport JL, King MC, Sebat J. (2008). Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320(5875):539–43. Webber C, Hehir-Kwa JY, Nguyen DQ, de Vries BB, Veltman JA, Ponting CP. (2009). Forging links between human mental retardation-associated CNVs and mouse gene knockout models. PLoS Genet 5(6):e1000531. Weiss LA, Shen Y, Korn JM, Arking DE, Miller DT, Fossdal R, Saemundsen E, Stefansson H, Ferreira MA, Green T, Platt OS, Ruderfer DM, Walsh CA, Altshuler
c21.indd 469
1/12/2011 9:44:41 AM
470
IMPACT OF GENOMEWIDE STRUCTURAL VARIATION ON GENE DISCOVERY
D, Chakravarti A, Tanzi RE, Stefansson K, Santangelo SL, Gusella JF, Sklar P, Wu BL, Daly MJ. (2008). Association between microdeletion and microduplication at 16p11.2 and autism. N Engl J Med 358(7):667–75. Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189):872–76. Willatt L, Cox J, Barber J, Cabanas ED, Collins A, Donnai D, FitzPatrick DR, Maher E, Martin H, Parnau J, Pindar L, Ramsay J, Shaw-Smith C, Sistermans EA, Tettenborn M, Trump D, de Vries BB, Walker K, Raymond FL. (2005). 3q29 microdeletion syndrome: clinical and molecular characterization of a new syndrome. Am J Hum Genet 77(1):154–60. Xu B, Roos JL, Levy S, van Rensburg EJ, Gogos JA, Karayiorgou M. (2008). Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet 40(7):880–85. Yatsenko SA, Cheung SW, Scott DA, Nowaczyk MJ, Tarnopolsky M, Naidu S, Bibat G, Patel A, Leroy JG, Scaglia F, Stankiewicz P, Lupski JR. (2005). Deletion 9q34.3 syndrome: genotype-phenotype correlations and an extended deletion in a patient with features of Opitz C trigonocephaly. J Med Genet 42(4):328–35. Zhang ZF, Ruivenkamp C, Staaf J, Zhu H, Barbaro M, Petillo D, Khoo SK, Borg A, Fan YS, Schoumans J. (2008). Detection of submicroscopic constitutional chromosome aberrations in clinical diagnostics: a validation of the practical performance of different array platforms. Eur J Hum Genet 16(7):786–92.
c21.indd 470
1/12/2011 9:44:41 AM
CHAPTER 22
Impact of Whole Genome Protein Analysis on Gene Discovery of Disease Models SHENG ZHANG, YONG YANG, and THEODORE W. THANNHAUSER
Contents 22.1 Introduction 22.2 Proteomics Strategies and Workflow 22.2.1 MS Instrumentation for Proteomics 22.2.2 Importance of Experimental Design for Proteomics 22.2.3 Sample Preparation and Separation Technologies 22.2.4 MS Analytical Strategies for Proteomics 22.2.5 Protein Identification by Database Searching 22.2.6 Quantitative Proteomics 22.2.7 PTM Characterization: Phosphoproteome 22.3 Biological Impact of Proteomic Technologies 22.3.1 Understanding Complex Biological Processes 22.3.2 Proteomics-Driven Discovery of Cancer Biomarkers 22.3.3 Proteogenomics: From Proteome to Genome 22.4 Conclusions and Future Perspectives 22.5 Questions and Answers 22.6 Acknowledgments 22.7 References
471 474 474 478 482 487 491 493 499 503 504 509 511 514 517 521 521
22.1 INTRODUCTION The emergence of technologies that facilitate genomewide data acquisition (DNA sequence, mRNA expression, protein expression, and associated Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
471
c22.indd 471
1/12/2011 9:44:42 AM
472
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
metabolite profiles, etc.) heralds a new paradigm for functional genomics that will allow the development of a deep understanding of gene regulation and other complex molecular and cellular biological processes (Cox et al., 2007; de Hoog et al., 2004). Over the past decade comprehensive genomic sequence information has become available for an ever-increasing number of species, the most significant being the completion of the Human Genome Project (Collins et al., 2004; Venter et al., 2001; Lander et al., 2001). The development of DNA microarray technologies to study transcriptional regulation of genes at the messenger level has permitted genomewide expression analysis in response to various stimuli, providing unparalleled opportunities for biomarker discovery from which has emerged a new discipline, transcriptomics. The reason that DNA microarrays have become so popular is that they allow the evaluation of expression for thousands of genes in parallel and make it possible to assess interactions between expressed genes. However, mRNA levels do not provide a complete picture of cellular function. First, the vast majority of cellular functions involve the interaction of proteins. Second, protein expression levels dependent not only on transcript levels but also on translational efficiency and regulated degradation (Gygi et al., 1999; Lu et al., 2007). Also, proteins function at specific subcellular localizations and are susceptible to posttranslational modification (often required to enable function) in ways that cannot be predicted from transcript expression levels or from the genome sequence. Therefore, it is essential to supplement DNA microarray data with direct measurements at the protein level. In fact, for many cases it is proteins that act as the cellular machinery that directly assert the function of genes through enzymatic catalysis, molecular signaling, and physical/ chemical interactions. It is at the protein level that most regulatory processes take place, where the primary disease processes occur and where most drugs target to. Unfortunately, the analogous protein array technologies are much more difficult to implement because proteins cannot be as easily synthesized or replicated in the same way as nucleic acids. Furthermore, the physical properties of proteins vary much more widely than those of nucleic acids, making protein– protein binding less predictable and more subject to non-specific interactions. Protein arrays also require antibodies for each protein of interest. Since antibodies recognize only a portion (the epitope) of the target molecule, they tend to cross-react with similar or accidentally homologous proteins, and they are generally unable to distinguish between microheterogeneous forms (due to PTMs etc.) of the target protein. As a result, the rapidly developing field of proteomics has largely been limited to the systematic study of protein expression in particular cell or tissue types as a function of time and biological or environmental stress. These studies typically involve the identification, quantification and characterization of proteins but often are extended to determine the subcellular localization of certain proteins of interest. Parallel to the success in genomics and transcriptomics, in the past decade, considerable progress has been made in the field of proteomics. Mass spec-
c22.indd 472
1/12/2011 9:44:42 AM
INTRODUCTION
473
trometry (MS) has emerged as an indispensible tool for the investigation of the protein components in biological systems (Han et al., 2008; Domon et al., 2006; Yates, 2004; Aebersold and Mann, 2003). Advances in MS, together with new methods of biochemical separation, protein tagging, chemical labeling, and the development of new bioinformatics tools have allowed initial efforts focused on protein identification to evolve such that the science of proteomics is currently being applied to high-throughput quantitative applications (Han et al., 2008; Ong and Mann, 2005; Ong et al., 2003), the characterization of protein modification state (Wiesner et al., 2008; Mirza and Olivier, 2007; Cantin and Yates, 2004; Mann and Jensen, 2003) and to study large protein complexes. Moreover, modern MS can be used to study time resolved changes in protein structure and interactions within a given subcellular compartment or superstructure (Cox et al., 2007; Aebersold and Mann, 2003; Gingras et al., 2007, 2005). There is no doubt that these developments have led to a tremendous insight into the composition, regulation, and function of molecular complexes and the metabolic pathways they engender (Cravatt et al., 2007; Yates et al., 2005). Discovery-based quantitative proteomics has been widely applied to study various disease states (such as cancers) with an aim to identify biomarkers associated with the diseases or targets for potential therapeutics (Pan et al., 2009; Ferrer-Alcon et al., 2009). These efforts have led to a significant increase in identification of novel biomarker candidates. However, further characterization and validation of the vast majority of the putative biomarkers remains extremely challenging due to the dynamic nature and complexity of the cellular proteomes. Driven by the challenges associated with protein characterization and quantification and the need to develop strategies to deal with the inherent complexities of the proteome, a wide range of new MS-based analytical platforms and technologies have been developed. It is quite clear that the advances in proteomics have been closely tied to the continuous improvement in mass spectrometry technology, experimental design, strategies for sample preparation, and data mining tools. Despite the fact that the target analytes of the genome and proteome are fundamentally different, there is a strong and synergistic relationship between proteomics and genomics as the two disciplines investigate the molecular makeup of the cell at complementary levels and each provides information that enhances the effectiveness of the other. For example, the MS-based peptide-spectral matching that is used in proteomics to identify peptides is possible only with extensive knowledge of the genomes’ sequence. Genomics provides complete genomic sequences, a critical resource for identifying proteins quickly and robustly by the correlation of MS spectra with sequence databases. Meanwhile, recent advances in mass spectrometry hardware and software have enabled the production of large proteomics datasets with broad coverage of the proteome through high-throughput LC-MS/MS analysis. The resulting peptide and fragment mass information has been proven useful for genome annotation.
c22.indd 473
1/12/2011 9:44:43 AM
474
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
This new discipline of proteogenomics complements computationally based genome annotation in that it unambiguously determines reading frame, translation start and stop sites, and splice boundaries and provides validation of short ORFs (Ansong et al., 2008). By combining gene model-based annotation with proteogenomics, an accurate and more complete protein catalog can be obtained. This chapter reviews the advanced MS-based proteomics technologies, the biological impact of proteomics on biomarker and gene discovery associated with diseases, and the recent development of proteogenomics studies for facilitating genome annotation. Finally, the future perspectives in MS-based proteomics development including its applications and challenges are discussed.
22.2
PROTEOMICS STRATEGIES AND WORKFLOW
A fundamental aspect of proteomics is the ability to systematically identify every protein expressed in a cell or tissue and to comprehensively characterize the alterations of identified proteins for their abundance, state of modification/ complexation, and subcellular location in response to environmental and physiological factors of the cell. The general workflow and technology for such analyses requires the integration of effective separation methods for reduction of sample complexity, advanced MS-based analytical techniques for the identification and quantitation of analytes, and bioinformatics tools for data analysis and interpretation. Among all the hardware and software tools required for proteomics analysis, MS technology has become increasingly important, to the exclusion of almost every other strategy. The combined analytical features of MS for enhanced sensitivity, high selectivity, high mass accuracy, and wide dynamic range offer unique abilities to handle the sample complexity, detect low abundance protein components, and identify sites of modification and protein complexes. 22.2.1
MS Instrumentation for Proteomics
Mass spectrometers measure the mass-to-charge ratio (m/z) of gas phase ions. Fundamentally, all mass spectrometers consist of three core parts: an ion source that converts analyte molecules into gas phase ions, a mass analyzer that allows the separation of ionized peptides on the basis of m/z ratio, and a detector that registers the number of ions at each m/z value. Historically, MS was limited to the analysis of small, volatile, and thermostable compounds until the late 1980s when two soft ionization techniques—electrospray ionization (ESI) (Fenn et al., 1989) and matrix-assisted laser desorption/ionization (MALDI) (Tanaka et al., 1988; Karas et al., 1988)—were developed and introduced into protein analysis. These two effective ionization techniques allow polar, nonvolatile, and thermally unstable protein/peptide molecules to be ionized and transferred from a condensed phase into the gas phase
c22.indd 474
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
475
without excessive fragmentation. These developments have simply revolutionized the analysis of peptides and proteins, a fact that was recognized by awarding the 2002 Nobel Prize in chemistry to Fenn and Tanaka. It is not surprisingly that ESI and MALDI are the two most common ionization techniques integrated into modern mass spectrometers intended for proteomic applications. MALDI is a solid-phase ionization technique that ionizes sample molecules out of crystalline matrix via laser excitation. It is important that the absorbance spectrum of the matrix used be well matched to the wavelength of light emitted by the laser. The MALDI matrix absorbs laser energy, becomes excited, and vaporizes, carrying the macromolecular analyte molecules into the gas phase. In this tumultuous and explosive process the analyte molecules undergo collisions with excited matrix ions that are sufficiently energetic to cause the transfer of electrons and protons, creating a population of charged macromolecular ions that can be analyzed, typically in a time of flight (TOF) mass analyzer. Two closely related techniques often used are atmospheric pressure MALDI (AP-MALDI) and surface-enhanced laser desorption ionization (SELDI). AP-MALDI allows an easy interchange between MALDI and ESI sources on the same MS instrument. SELDI is essentially MALDI that has been targeted to a specific class of molecules through the introduction of surface affinity ligands on the target plate before analysis. Unless MALDI analysis is coupled with off-line reverse phase liquid chromatography (RPLC), which will be subjected, to relatively low throughput, direct MALDI-MS is limited to analysis of relatively simple peptide mixtures. Unlike MALDI, ESI is a solution-based ionization method and is therefore readily coupled with liquid-based separation techniques such as liquid chromatography. ESI is initially driven by a high voltage difference applied between the sample delivery probe and the inlet of the mass spectrometer, which creates a spray of electrically charged droplets. In the low pressure of the mass spectrometer inlet, with the assistance of a heated capillary and sheath gas flow, the liquid in the droplets continues to vaporize leaving droplets of smaller size with an ever-increasing surface charge. When the repulsion from the charged ions on the surface of the droplet overcomes the surface tension of the liquid, the droplets undergo a Coulombic explosion, creating a set of smaller droplets. This process continues until liquid of the droplet is fully depleted and the residual charges are deposited on the macromolecules contained within, creating the multiply charged ions observed (the charge deposition model), or the curvature of the droplet surface becomes so great that the field strength at the surface of the droplet is sufficient to cause direct ionization of the multiply charged macromolecules from the droplet surface (direct ionization model). Nanoelectrospray ionization (nanoESI) MS introduced by Wilm and Mann (Wilm et al., 1996; Wilm and Mann, 1996) has become the most widely used ionization technique for proteomics studies due to its low flow rates, low sample consumption and improved detection limits when coupled with upfront nanoLC. Both MALDI and ESI have strengths and
c22.indd 475
1/12/2011 9:44:43 AM
476
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
weaknesses, but many studies have shown that they are highly complementary (Bodnar et al., 2003; Stapels and Barofsky, 2004; Yang et al., 2007). The mass analyzers are the core component of the mass spectrometer as they can store and/or separate ions based on their m/z and can be manipulated to select specific ions for further analysis. There are four types of mass analyzers widely used in proteomics research: (1) quadrupole mass analyzer separating ions based on trajectory stability, (2) TOF based on velocity (flight time), (3) ion trap (IT), and (4) Fourier-transform ion cyclotron resonance (FTICR) based on their m/z resonance frequency. Each of the mass analyzers has unique physical properties that contribute to determine the instrument’s performance with respect to key parameters, such as detection mass range, scan speed, sensitivity, mass accuracy, resolution, and dynamic range. Each of the mass analyzers can be used individually, but often they are used in combination creating hybrid tandem mass spectrometry (MS/MS) instruments, such as triple quadrupole (Q-q-Q), Q-q-Q-linear IT, Q-TOF, TOF-TOF, and LITFTICR. Thus many different types of MS and MS/MS instruments are commercially available for proteomics research. Each has strengths and weaknesses, depending on the specific application; these characteristics are used to determine which is the most appropriate to use. Many excellent recent reviewers have covered the latest development of MS instrumentation (Han et al., 2008; Domon and Aebersold, 2006; Perry et al., 2008; Liu et al., 2007). In tandem mass spectrometers, the first mass analyzer is always used for separation and isolation of ions with subsequent fragmentation of a specific m/z in a collision cell. The m/z of the product ions are measured in a second stage of mass analysis and then interpreted to yield information concerning the structure of the selected precursor ion. MS/MS is a fundamental and essential technique for protein/peptide analysis. Thus far the most widely used fragmentation means in MS/MS analysis for proteomics is collision-induced dissociation (CID) (Shukla and Futrell, 2000). In the CID process, a gas-phase protein/peptide cation is selected and transmitted into a high-pressure region where it undergoes a number of collisions with gas atoms or molecules. During these inelastic collisions, a portion of kinetic energy is converted into the internal energy in the ion, making the ion unstable, which drives peptide backbone fragmentation resulting in the series of b-fragment and y-fragment ions acquired in the second stage mass analysis (Roepstorff and Fohlman, 1984; Biemann, 1990a, 1990b). In addition to the typical b- and y-ion series observed in low-energy CID analysis, internal fragmentation and neutral-loss of H2O, NH3 and labile modification molecules can be seen due to slow-heating and other energetic features associated with CID. These can combine to produce very complex spectra which often restrict the amount of sequence information one can obtain from large peptides and intact proteins. ESI, which typically produces multiply charged ions, is most often coupled with low energy CID to produce high-quality and sequence-specific MS/MS data. MALDI typically produces singly charged ions that require higher energy to fragment effectively. Thus MALDI instruments are often
c22.indd 476
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
477
coupled to high-energy collision cells or make use of other high-energy fragmentation strategies, such as postsource decay (PSD), to produce sequencespecific fragment ion spectra. While high-energy fragmentation has some unique advantages (such as the ability to distinguish between leucine and isoleucine residues), it generally results in more internal fragmentation and side-chain fragmentation, making the spectra more difficult to interpret compared to low energy CID spectra. A relatively new fragmentation method, electron-capture dissociation (ECD) was developed and introduced by the McLafferty group in 1998 (Zubarev, 2006; Zubarev et al., 1998). ECD involves an excitation of the massselected multiply protonated peptide/protein cation by the capture of a thermal, low-energy electron and subsequent fragmentation of the resulting odd-electron ion at the amino alkyl (N-Cα) bond to produce c-type and z-type fragment ions in abundance. Because the process is nonergodic (does not involve any intramolecular vibration-energy distribution), the fragmentation of large protein ions with the preservation of labile modifications becomes possible. As ECD produces far more backbone cleavages than CID, particularly for large proteins and peptides, it offers better sequence coverage for proteomics analysis and PTM characterizations. Therefore, ECD has become a useful tool. However, its use was initially confined to expensive and sophisticated FTICR mass spectrometers. Electron transfer dissociation (ETD) is another nonergodic fragmentation method that is analogous to ECD and has recently been developed by the Hunt laboratory. It uses electron transfer between singly charged anions with low electron affinity and multiply charged peptide cations to induce backbone fragmentation at N-Cα bond (Coon et al., 2005; Syka et al., 2004). ETD fragmentation creates complementary c and z-type ion series, yielding information highly complementary to conventional CID fragmentation. More importantly, ETD can be implemented on relatively inexpensive RF ion trap mass spectrometers, making it available to a much larger number of researchers. As with ECD, ETD preserves labile PTMs as fragmentation occurs along the peptide backbone in a sequence-independent manner. Thus ETD has been increasingly recognized as an important alternative dissociation technique to CID for analysis of many PTMs (Wiesner et al., 2008; Mikesh et al., 2006), particularly phosphorylation (Lu et al., 2008; Chi et al., 2007; Molina et al., 2007) and glycosylation (Khidekel et al., 2007; Wuhrer et al., 2007; Catalina et al., 2007). ETD can be used to analyze large peptides and small intact proteins through a sequential proton transfer reaction (PTR), by which the reduction of charge states for ETD generated multiply charged ions is performed and readily measured on a bench top instrument. As a result, ETD integrated ion trap instruments enable rapid sequencing of large peptides (middle-down workflow) and small intact proteins (top-down strategy), which allows the determination of a 15–40 amino acid sequence at both the N- and C-terminals of proteins (Bunger et al., 2008; Chi et al., 2007; Wu et al., 2007). In addition, because the ion/ion reaction is highly efficient and fast, ETD can be performed
c22.indd 477
1/12/2011 9:44:43 AM
478
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
at a fast scan rate (∼3 scans/s) and is therefore compatible with the chromatographic timescale (Udeshi et al., 2008, 2007) typically used for shotgun analysis, providing high sensitivity and superior sequence coverage for peptides of all sizes. Furthermore, both CID and ETD can be combined in the same experiment, generating two sets of highly complementary data (Hart et al., 2009; Swaney et al., 2008; Good et al., 2007). Table 22.1 summarizes the key features of commonly used MS instruments with their available ion source, fragmentation technique and specific applications in proteomics analysis. 22.2.2
Importance of Experimental Design for Proteomics
The focus of proteomics research has been systematic identification and quantitation of all expressed cellular components, their associated biological modifications and isoforms in response to a particular treatment or environmental stimulus. Given the dynamic nature of proteins in any given biological system, proteome samples should be collected in specific conditions at various time points for proteomic analysis to fully reflect the various states of proteins in a cell. In contrast to the traditional target analysis for single protein characterization, proteomics studies often take a brute-force discovery strategy (nontarget analysis) at a global (hypothesis free) level. Consequently, proteomics research requires large-scale, multistep analysis on multiple complex samples and the collection of large amounts of data. In addition, there are a plethora of techniques available to carry out experiments, each of which will generate enormous sets of complementary data. For these reasons, generating highly reliable and reproducible methodologies and optimized workflows has been a significant challenge in the entire field (Rifai et al., 2006). Past experience has proven that all these data, often generated at a significant cost, can have very little value if appropriate attention is not paid to the design of the experiment. Thus it is advisable to make an effort to design the experiment to ensure that the right type of data are collected to answer the question of interest as efficiently as possible. Perhaps the most important question to be answered when considering the design of a proteomics experiment is, What are the objectives? Once the experimental objectives are defined, the resources, protocols and instrumentation can be selected based on their ability to achieve the objectives with their accuracy, precision, and reliability. It is necessary to identify the known or expected sources of variability within the experiment so that efforts can be made to reduce their impact on our ability to answer the question of interest. One designs an experiment to improve the precision of the answer. Proteomics is applied to a broad array of experimental objectives. These include dissecting the biomolecular interactions involved in the formation of protein complexes, deciphering the intricacies of metabolic processes or identification of proteins and/or peptides that are characteristic of a particular environmental stress or disease state. Furthermore, proteomic researchers
c22.indd 478
1/12/2011 9:44:43 AM
479
c22.indd 479
1/12/2011 9:44:43 AM
fm
pm
1E+4 MSn ESI CID/EDT 1,3,6
CID
1,3,6
Fragmentation technology Applicationsf
TOF-TOF Q-q-TOF
1–6
CID
6E+6 MS/MS ESI
1–6
CID
4E+6 MSn ESI
1
CID/PSD
1E+4 n/ae MALDI
No upper limit Fast
fm
1–3
CID/PSD
1E+4 MS/MS MALDI
No upper limit Fast
fm
1–3,6
Moderate to fast 1E+4 MS/MS ESI; MALDI CID
20–100,000
fm
fm
30,000– 100,000 <5
LTQOrbitrap
1,3,4,6
1–4,6,7
50–2,000; 50–2,000; 200–4,000 200–4,000 Slow Moderate to fast 1E+3 4E+3 MSn MSn ESI; ESI; MALDI MALDI CID/ECD CID/ETD
fm
50,000– 750,000 <2
FTICR
1–3, 6–8
Moderate to fast 1E+4 MS/MS ESI; MALDI CID
20–100,000
fm
10,000– 20,000 <3b
Q-IMSTOF
b
Mass resolution achieved at normal scan rate; higher resolution achievable at slower scan rate. With external calibration. c With internal calibration. n >2, up to 15. d Slow: <200 Da/s; moderate: 200–2,000 Da/s; fast: >2000 Da/s. e Fragmentation achievable by post source decay. f 1, identification; 2, quantification; 3, PTM detection; 4, PTM characterization; 5, MRM (SRM) quantification; 6, high throughput; 7, top-down proteomics; 8, IMS.
a
am to fm fm
10,000
Q-q-LIT TOF
10,000– 10,000– 10,000– 20,000 20,000 20,000 100–1,000 100–500 10–20b; <5c 10–20b; <5c 10–20b; <5c
1,000
Q-q-Q
50–2,000; 50–2,000; 10–4,000 5–2,800 200–4,000 200–4,000 Moderate Fast Moderate Fast
Dynamic range 1E+3 MS/MS capability MSn Ion source ESI
Scan rated
100–500
100–1,000
Mass accuracy (ppm) Sensitivity (pm; fm; am) m/z range
10,000
1,000
Mass resolutiona
LTQ
QIT
Features
TABLE 22.1. Key Features of Commonly Used MS Instruments with Their Available Ion Sources, Fragmentation Techniques, and Specific Applications in Proteomics Analysis
480
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
frequently have different sets of analytical instrumentation and other resources available to them. Given the diversity of problems to which proteomic methods are applied and the variety of instruments and technologies that have been developed to address them it is unlikely that a universal approach involving rigid requirements for the instrumentation and methodologies can be found. Thus, from the beginning, the objectives of every proteomics study must be defined precisely and the resources available must be marshaled in such a way that the data generated are of sufficient quality to secure the desired objective. The main factors that affect experimental reproducibility come from the variations, which can be introduced biologically and technically, inherent in the experiment through sample source selection, protein extraction and collection, sample preparation, storage protocols, LC-MS settings and conditions, and data mining, and processing algorithms for automated data interpretation (Prakash et al., 2007). The complexity of the samples, techniques, and resulting data requires application of comprehensive statistical tools to enable the extraction of the signature features responsible for the alteration of biological phenotype. The experimental variation and sources of bias often observed in proteomics research demonstrate the need for applying principles of statistical experimental design (Oberg and Vitek, 2009). Therefore, the statistics-based experimental design will have tremendous impact on the outcome of proteomics study by avoiding systematic error and minimizing the random variation. The vital proteomics experiments need integration of optimized algorithms and power of statistics and to be designed in a way that leads to maximum data reliability, reproducibility, and profiling information and to minimal variations for each step from each source. 22.2.2.1 Size Matters The most commonly used technique to determine if the difference in protein expression revealed in a proteomics experiment is significant is the Students’ t-test. This is a univariate statistical test and when using such a test it is essential that a sufficient number of replicates be performed to achieve the required statistical power. The power of an experiment is defined as the probability of correctly rejecting the null hypothesis (H0) and is equal to (1 − β) where β is the false-negative rate. The power of an experiment refers to its ability to detect what the researcher is looking for. It depends on a number of parameters, including system noise or variance, the number of replicates, the size of the effect measured, and the significance or confidence level desired. The most common way to increase the power of an experiment is to increase the number of replicates; however, increasing the size of the study comes at a cost. Doubling the number of samples will likely double the cost of the experiment. Nevertheless, an undersize (powered) study will not have the ability to distinguish biologically important changes in expression from random fluctuations. Analyzing more samples is an easy (albeit more expensive) way to improve the experimental outcome. However, one must be careful because oversize studies are wasteful as they tend to use more resources
c22.indd 480
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
481
than are necessary. Furthermore, they present serious ethical problems when human or animal subjects are used. Thus there is little virtue in conducting an oversized study. Unfortunately it is not always possible to increase the number of replicates used. Sometimes additional sample material is just not available and, from a practical perspective, all experiments are limited by the time and budget available. If the number of samples is fixed and too small to provide the needed statistical power it would be inappropriate to conclude that the study should not be carried out. It simply requires a reevaluation of the experimental design. The number of replicates is just one of the parameters that affect the power of the experiment. If the sample number is limited then one must focus on the other parameters that determine power. Reducing system variance by narrowing the scope of the study, using more accurate instrumentation to better determine the value of the response variable, incorporating sample stratification, and blocking into the experimental design (see below) are all effective ways to improve the statistical power of experiments with limited sample availability. 22.2.2.2 Randomization It is extremely difficult for researchers to eliminate bias using their scientific judgment alone, thus, it has become a common practice to incorporate the principle of randomization into the experimental design. Randomization is a way to avoid hidden biases by equalizing the unknown factors that impact the data that cannot be otherwise controlled. It is a key concept in experimental design, and it is important to incorporate it in all phases of the study, including the assignment of individuals to specific treatment groups, the order in which samples are analyzed, and the labeling of samples from different treatment groups. 22.2.2.3 Blocking and Stratification In a completely randomized experimental design objects are assigned to treatment groups completely at random. However, as we have seen, this is not always a practical approach, particularly when the statistical power of the experiment is low and the amount of sample material is limiting. In these cases, a randomized block design may be preferable. If the researchers are aware of specific criteria among the test subjects that could affect the outcome of the experiment (age, sex, race, etc.) they can first divide the population into homogeneous blocks. Then within each block individuals are assigned to treatment groups using a completely randomized design. Stratification is a related concept. In stratified random sampling, the population of interest is stratified (divided/blocked) into strata based on one or more shared characteristics (like male versus female). Then from each of the different blocks or strata, random samples are selected in the same proportions as the blocking characteristics occur in the population. 22.2.2.4 Validation It must be recognized that the results derived from many proteomics experiments contain a considerable amount of ambiguity.
c22.indd 481
1/12/2011 9:44:43 AM
482
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
This is particularly true for shotgun and other peptide-centric approaches where proteins are identified by matching entries in incomplete sequence databases to mass spectral data. The goal of proteomics studies is often discovery or hypothesis generation; therefore, the instrumentation used and data generated are not well suited to confirmatory analysis or validation. Thus it is important that the hypotheses generated through a proteomics approach be validated by an orthogonal method of analysis. In this way, Yang et al. (2008) were able to validate their findings from a two-dimensional gel electrophoresis experiment that certain aphid proteins were specifically associated with the ability of S. graminum to transmit cereal yellow dwarf virus-RPV through coimmunoprecipitation with purified virus. Other than the fundamental principles of statistical consideration at the early design stage, the experimental design needs to consider the following aspects: (1) careful setup of control/test groups with biological and technical repeat studies; (2) selection of the least complex sample source for addressing relevant biological questions; (3) development of software tools and generation of statistical models that can assess the experimental reproducibility for sets of large-scale LC-MS data with the essential information on source, magnitude, and distribution of errors; and (4) a necessary pilot experiment for estimation and testing of above experimental design. When selecting a statistical mode, all potential sources of experimental variation should be counted. Nevertheless, it is extremely important to integrate experimental design as a key part of proteomics pipeline and workflow that can maximize the possibility for yielding new discoveries, better understanding and/or for raising a testable hypothesis for further assessment by other approaches. 22.2.3
Sample Preparation and Separation Technologies
The first, and perhaps most important, challenge to be faced when carrying out a proteomics study is the extraction of proteins from the material of interest to produce, in a reproducible way, an analyzable and biologically relevant sample. Many protocols have been developed to carry out total protein extractions from various types of biological samples. However, it is now widely recognized that all of these approaches incorporate biases with respect to particular classes of proteins that often lead to incomplete and inconsistent results. This problem is particularly severe for whole body or whole cell extracts where the target proteome is exceedingly large and complex. An effective way to minimize these problems is to focus on a subfraction of the entire proteome by targeting a particular tissue, cell type or subcellular structure. Such an approach allows one to reduce the complexity of the protein sample and use optimized protein extraction solvents and protocols. An additional advantage of this approach is that it provides information concerning the location of proteins within the tissue, cell or subcellular structure. Knowledge of protein location can provide useful information concerning cellular pathways, protein– protein interactions, and the pathogenetic mechanism involved in different
c22.indd 482
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
483
diseases (Uhlen and Ponten, 2005). The prefractionation protocols used to isolate purified organelles or large cellular structures often involve traditional cell biology methodologies such as centrifugation, filtration, or selective affinity isolation of certain cellular compartments. Even with subfractionation, the wide range of protein concentrations found within subproteomes and their highly complex nature make it necessary to further fractionate samples before MS analysis to further reduce their complexity, enrich for low abundance proteins, and minimize the ionization inhibiting effects of sample matrix components and the super abundant proteins. A variety of sample preparation strategies incorporating many separation technologies have been developed and implemented for a wide array of proteomics applications. These can be divided into two general strategies: gel-based and chromatography-based separations. Gel-based separation covers one-dimensional gel electrophoresis where the separation is based on either different protein molecular weight or isoelectric point, and twodimensional gels, consisting of the pI-based isoelectric focusing (IEF) separation followed by an orthogonal separation based on size employing SDS-polyacrylamide gel electrophoresis (SDS-PAGE). Historically, the chromatography-based approach involved a single reverse-phase separation (RPLC) coupled directly to MS analysis; however, often it now incorporates a two-dimensional liquid chromatographic separation where the first dimension is strong cation ion exchange chromatography and the second is reverse phase. It should be noted that the sample preparation and separation technologies are sometimes used to distinguish between bottom-up and top-down proteomics. There are two related but slightly different concepts used for the definitions of bottom-up/top-down proteomics. The most widely accepted definition is based on whether peptides or proteins are being directly introduced into mass spectrometers for analysis. Bottom-up approaches are those in which peptides are the species that are introduced into the mass spectrometer, whereas topdown indicates direct MS analysis of intact proteins. By this definition, any of the gel-based approaches that involve subsequent in-gel proteolytic digestion before MS analysis would be considered a bottom-up workflow. Another less commonly used but equally valid definition is determined by the nature of the material being initially separated. By this definition, top-down proteomics would include all procedures that begin with a separation of intact proteins including one- and two-dimensional gel electrophoresis. In this review, for the sake of clarity we have adopted the traditional (former) definition using MS detection as a benchmark. 22.2.3.1 Gel Electrophoresis Techniques Both one-dimensional (1D) SDS-PAGE (1DE) and two-dimensional (2D) gel electrophoresis (2DE) have been used extensively for separation of protein complex samples. For very complex proteomes, 1DE-separation of proteins alone does not provide adequate separation to justify immediate digestion and MS analysis. However,
c22.indd 483
1/12/2011 9:44:43 AM
484
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
1DE separated samples can be combined with solution-phase LC techniques for direct RPLC-MS analysis. Typically, protein extracts are fractionated by 1DE and individual gel bands of interests are excised or an entire lane of a gel band can be excised and divided into 10–30 fractions. Proteins in the gel slices are then subjected to in-gel digestion with trypsin. After which, the tryptic peptides are extracted. The extracted peptide mixtures are then analyzed by RPLC-MS. The 1DE approach has many advantages, including being well established, simple to use; highly reliable; tolerant to the presence of interfering components in the original sample matrix; and applicable to proteins with a wide range of molecular masses, pI values, and hydrophobicities. Historically, 2DE has been the most frequently used gel-based protein quantification/separation method due to its unrivaled ability to resolve proteins from complex samples. The orthogonal combination of a pI based (first dimension based on protein pI) and a size-based separation (second dimension based on relative molecular mass) results in the sample proteins being distributed across an area in pI, Mw space. Using current technology >5000 protein spots can be resolved on gel in a single experiment. More than 2000 protein spots containing <1 ng total protein per spot have been detected (Gorg et al., 2004). The high-sensitivity visualization (such as fluorescencebased Sypro Ruby or silver staining) of the resulting 2D electropherograms are amenable to quantitative computer analysis, allowing one to detect changes in intensity between identical protein spots from different samples. These spots of interest are then excised, digested and analyzed by MS for protein identification. Despite the increasing popularity of chromatographic alternatives, 2DE is still a widely used means of resolving complex protein samples and remains one of the best separation tools for intact proteins. It enables the identification of proteins containing posttranslational modifications (PTMs), the presence of isoforms and proteolytic processing in a convenient reference map format. Furthermore, highly reproducible differential protein expression profiles can be obtained for protein extracts under different cellular states, particularly with the introduction of sophisticated image analysis software and the use of difference gel electrophoresis (DIGE) techniques. However, the main drawbacks of 2DE approach are its inability to resolve an entire proteome in a single experiment, particularly for those proteins with extreme size, charge and hydrophobicities. It is also thought to be a labor intensive, time-consuming process and difficult to automate, resulting in low throughput. Finally, 2DE is technique sensitive, which introduces reproducibility problems when experiments are carried out by those not skilled in the art. 22.2.3.2 Gel-Free (Solution) Separation Techniques Solution-phase separation methods include technologies for both protein and peptide separation. There are platforms available for the separation of complex protein samples such as IEF-based Rotofor apparatus and 2D liquid-phase ProteomeLab PF 2D technique (combining charge-based separation with hydrophobicity) for some success in proteomics analysis. But the utilities of
c22.indd 484
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
485
those solution-based separation platforms at protein level over gel-based approach for solubilizing an entire proteome and avoiding protein precipitation and aggregation are still somewhat unclear and, as a result, not widely applied. The overwhelming majority of gel-free separations currently carried out in proteomics analysis involve separation at the peptide level. Complex protein samples are denatured in solution by using strong chaotropes, reduced and alkylated, then digested, typically with trypsin but other enzymes or combinations of enzymes may be used. The resulting peptide mixture will be subjected to RPLC (based on peptide hydrophobicity) or multiple dimensions of LC separation: ion exchange (IEX), hydrophilic-interaction (HILIC), and affinity prior to MS analysis. A unique feature of RPLC is that the elution solvents used are directly compatible with ESI making it the ideal choice for the last dimension of separation prior to MS analysis. Due to the fact that ESI-MS is a concentration sensitive technique a nanoscale column (with <100 μm i.d.) coupled with a nano-scale LC system capable of delivering flow rates at the nanoliter/min scale are typically used. Such a system operating at 200–300 nL/min flow rates produces low-volume, high-concentration peaks that maximize the MS sensitivity. As peptide mixtures become more complex, the improved separation resolution results in a greater peak capacity, ultimately yielding higher proteome coverage. So RPLC-MS analysis becomes the obvious choice for enhanced depth of coverage. The recent introduction of nano-scale ultra performance liquid chromatography (nUPLC) systems designed for use with sub-2-μ-diameter packing materials, that can operate at pressures up to 15,000 psi, is one of the efforts that has the potential to truly revolutionize the field. The UPLC platform allows improved peak capacity and resolution and reduced analysis time over a conventional HPLC system. These high-quality chromatographic separations can be achieved in one tenth the time of those attainable using a standard LC, and the low dispersion of the UPLC system allows for a fourfold increase in sensitivity. Despite these improvements it is still true that no single separation method is sufficient to adequately fractionate an entire proteome. The most widely used approach to overcome limited resolving power of single RPLC stage for tackling high sample complexity is the introduction of a prefractionation step using an orthogonal second dimension of chromatography. The most popular choice for the prefractionation step is strong cation exchange (SCX) LC, which separates peptides based on their number of positive charges. The SCX LC can be performed offline or online by using either a linear gradient of increasing ionic strength or by sequential elution of the peptides with a series of salt washes of increasing salt concentration. For the offline approach the two chromatographic procedures are carried out separately. In the online approach each salt fraction is captured by an in-line RP-trapping column as it elutes from the SCX column. As each fraction is captured, it is automatically desalted, after which the trapping column is coupled to the analytical column for RPLCMS/MS analysis. It is also possible to use a mixed-bed column consisting of
c22.indd 485
1/12/2011 9:44:43 AM
486
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
an SCX segment followed by a RP layer inside a fused silica capillary. The chromatography is carried out in cycles, each step increasing the salt concentration to elute peptides off of the SCX and on to the RP layer, which is followed by a wash step and then a gradient of increasing acetonitrile concentration to elute peptides from the RP segment into the mass spectrometer. This latter approach is called multidimensional protein identification technology (MudPIT) and was pioneered by the Yates lab (Gharbi et al, 2002; Link et al., 1999; Yates et al., 1999). An alternative to the standard SCX-RP LC approach involves a fractionation scheme using OffGel IEF and has been found to improve resolution and increase protein identification. The OffGel device provides the IEF separation of peptides using standard hydrated IPG strips with a tightly fitting frame that divides the strip into 12 or 24 fractions. The sample is loaded by evenly distributing it across all the wells of the strip. When a voltage is applied, the peptides migrate horizontally through the gel until they reach the well that includes their isoelectric point where they focus into tight bands. While the voltage is applied, their position on the horizontal axis is fixed, but they are free to migrate vertically. Thus they partition themselves between the gel and the solution above based on the difference in volume. Typically, the volume contained in the gel represents only one fifth of the total fraction volume and 80% of the peptides are recovered from the liquid portion. Compared to the standard SCX-RPLC approach OffGel peptide IEF consistently identified one third more proteins with equal number of fractions (Hubner et al., 2008). The peptide IEF also allows a considerable increase in proteome coverage of very complex samples prepared from total cell extracts (Ernoult et al., 2008). One advantage of in-solution separation is the greater solubility of peptides as compared to proteins in aqueous buffers. It also eliminates potential sample loss during in-gel extraction from gel slices. In addition, the gel-free platform is more readily automatable and minimizes sample handling over the gelbased approach. Another way to reduce the complexity of a sample is through specific enrichment of low-abundance proteins or other PTM proteins before analysis. For example, immunodepletion of highly abundant proteins in serum or plasma is one of the effective and most commonly used steps to enhance the detection of low abundance proteins in a complex sample (Pan et al., 2009). Biotinylation affinity enrichment (targeted to exposed lysine residues on surface proteins) is also a widely used technique, and has been proven a robust method for membrane protein enrichment (Lu et al., 2008; Speers and Wu, 2007). Epitope-tagging of targeted proteins expressed within cells followed by immunoprecipitation (IP) using epitope-specific antibodies is another popular and effective sampling approach for capturing targeted proteins together with their possible interacting proteins as protein complexes. Various epitopetagging systems such as FLAG-tag are commercially available and widely used. The resulting IP pull down protein complexes can be suitable for a subsequent gel-based separation or solution-based separation and MS analysis
c22.indd 486
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
487
(Gingras et al., 2007; Mann et al., 2001). PTM proteins often play an important regulatory role in the cell, but are often present in low abundance. Also the low stoichiometry nature of PTMs such as phosphorylation always poses a significant analytical challenge and requires enrichment to the MS detectable level. Affinity purification can generally be integrated into a part of the multidimensional separation either off-line or directly coupled to RPLC. 22.2.4 MS Analytical Strategies for Proteomics As described in the previous section, current strategies used for proteomic studies are based on a variety of separation techniques followed by MS analysis of either separated intact proteins or their proteolytic peptides. This distinction provides the basis for dividing proteomics into two broad categories of approaches: the top-down approach for direct MS analysis of prepurified proteins, and bottom-up proteomics for analysis of enzymatic digests of complex protein mixtures. The strategies for MS-based proteomics are summarized in Figure 22.1. With so many technologies and experimental options available for proteomics analysis, many effective MS-based proteomics strategies have been developed to meet different biological and analytical challenges. In contrast to the separation methods that often involve multiple orthogonal steps, MS data acquisition is almost exclusively performed in a data-dependent acquisition (DDA) mode to minimize the collection of redundant MS spectra. In DDA analysis, the information from the survey scan MS (full scan MS measuring masses of the protein ions or peptide ions) can be used to determine the selection of the precursor ions used in subsequent MS/MS scans (fragmentation MS scans generating the primary amino acid sequence information). The m/z value of a specific precursor ion at a specified time can be automatically entered into the exclusion list, which makes it possible to avoid collecting redundant data by repeating MS/MS on the same ions in the following scan cycles. As a result, DDA defines the maximal scan rate at which mass spectrometers can acquire MS/MS data for co-eluting peptides. 22.2.4.1 Bottom-Up Proteomics Technically this strategy depends on measuring the mass of a proteolytic peptide, and/or the analysis of a peptide’s fragmentation pattern to obtain its sequence, which infers protein identity. Currently, this approach has dominated proteomics research for large-scale analysis of complex samples in a high-throughput manner. In this approach, proteins generally are digested enzymatically into peptides before MS analysis. There are two specific workflows for the bottom-up approach: (1) complex protein mixtures are simplified through off-line separation such as 1D or 2D gel electrophoresis or enriched through affinity purification before protein digestion and (2) proteins in complex samples are directly digested with subsequent separation, often multidimensional, followed by MS/MS analysis. In the first workflow, 2D-gel-separated proteins are digested and subjected
c22.indd 487
1/12/2011 9:44:43 AM
488
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Figure 22.1. Overview of proteomics strategies. Protein mixtures extracted from biological materials are typically analyzed by bottom-up or top-down approaches. In the bottom-up analysis, protein extracts are directly subjected to digestion creating a pool of peptides, which is fractionated by such methods as SCX chromatography or OffGel electrophoresis before a second dimensional reverse phase chromatography coupled to tandem MS in a data-dependent analysis mode. Alternatively, the protein extracts can be fractionated by 1D or 2D gel electrophoresis or LC, followed by digestion and either direct analysis by MALDI-TOF/TOF or further separation by reverse phase chromatography on-line to a tandem mass spectrometer. Protein identifications and characterizations are achieved from peptide MS/MS searched against available databases and manual inspections. In the top-down analysis, protein mixtures are generally separated into relatively pure, less complex protein mixtures, which are delivered into a high-resolution mass spectrometer such as FTICR by off-line static infusion analysis for measurement of intact protein mass and its fragmentation ions.
to direct peptide analysis by peptide mass fingerprinting (PMF) using MALDI-TOF. Due to its inherent simplicity and high sample throughput, MALDI-TOF is commonly used as a screening method for such a direct analysis of 2D-gel-extracted peptides. Confidence in protein identification by this approach is completely dependent on the accuracy of the mass measurement and can be improved through the inclusion of a tandem time-of-flight
c22.indd 488
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
489
(TOF/TOF) mass analysis (Vanrobaeys et al., 2005). Digests of 1D-gelseparated protein bands are usually too complex for direct MS analysis and require peptide separation by LC-MS/MS (GeLC-MS/MS) (Schirle et al., 2003). The second workflow in which a complex protein sample is directly digested and analyzed by 2D LC-MS/MS analysis is commonly referred to as shotgun proteomics by analogy to shotgun genomics, in which a similar concept of shotgun DNA sequencing strategy is applied for the protein equivalent. The shotgun approach is currently the most widely used strategy in proteomics (McDonald and Yates, 2003). MS/MS spectra are collected from as many distinct peptides ions as possible through a combination of orthogonal 2D LC sampling steps, fast instrument scanning rate DDA, and nonergotic fragmentation technologies such as ETD (Scigelova and Makarov, 2006; Makarov et al., 2006a, 2006b). The MS/MS spectra are then used to search against a database of protein sequences derived from the genomic sequence of the organism from which the sample was obtained using a search algorithm such as Sequest or Mascot (Perkins et al., 1999). The LC resolving power, peak capacity and scan rate of the mass spectrometer are critical to the success of shotgun analysis. Recently, the development of an alternate scanning acquisition (LCMSE) technique has enabled parallel MS/MS analysis of multiple precursor ions simultaneously. Coupling this development with the high-resolution of UPLC and the high-frequency sampling rates of modern mass spectrometers has led to significant improvements in both protein and proteome coverage. Unlike traditional data-dependent analysis, the LCMSE uses other data, including the chromatographic peak area, peak shape, combined charge state, and the apex retention time for each precursor and its corresponding fragment ions to make peptide identifications. The improved protein coverage makes it possible to carry out label-free quantitative analysis of very complex samples (Silva et al., 2006a, 2006b). The shotgun approach is well suited for the quantitative analysis of chemically modified peptides/proteins. The use of peptides allows for a better initial separation of analytes than would be possible using protein fractionation and offers higher detection sensitivity than the top-down approach. The inherent limitation of the shotgun approach is that it greatly increases the complexity of the sample analyzed, requiring improved resolving power, efficient separation and faster scanning mass spectrometers. Even with the use of UPLC and mass spectrometers capable of ultra-high scan rates, the rate at which peptides elute from the column often exceeds the rate at which MS/MS spectra can be acquired. This leads to severe under sampling of the peptides, which greatly reduces sequence coverage and has important consequences for throughput and experimental reproducibility. In addition, the limited dynamic range of MS analysis favors the detection of the most abundant proteins. As a result, peptides from low-abundance proteins, particularly those containing PTMs, are often not identified. Furthermore, the shotgun approach is not able to unambiguously identify the origin for redundant peptide sequences compared to the 2DE-based or top-down approaches.
c22.indd 489
1/12/2011 9:44:43 AM
490
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
22.2.4.2 Top-Down Proteomics The top-down strategy involves direct high-resolution mass measurement of intact protein ions and their sequencespecific fragment ions without prior proteolytic digestion (Mclafferty et al., 2007; Kelleher, 2004). Since this approach involves the entire contiguous sequence of the protein, it can be used to reveal both the nature and position of any modifications and to characterize protein isoforms through database searching or de novo sequencing. A fundamental discovery that made top-down proteomics practical was the development of nonergotc fragmentation techniques such as ECD and ETD, which are now available in FTICR and ion trap instruments for direct fragmentation of small to mid-size proteins. Because of the increased complexity of the gas-phase ions’ tertiary structures, current top-down proteomics methods are not capable of dealing with larger proteins (ⱜ50 kD). A related strategy, the middle-down approach makes use of limited proteolytic digestion to produce larger peptides (>5 kD) to get around the size limitations. This approach has had limited success in the characterization of large proteins or specific protein domains using ECD and ETD fragment techniques (Garcia et al., 2007a, 2007b). The key advantages of the top-down approach includes higher sequence coverage, which allows for the identification of protein isoforms and better characterization of the PTMs. Also, the top-down approach is reported to deliver more reliable protein quantification compared to the bottom-up approach (Yates et al., 2009). However, there are several technological limitations for the top-down approach, which make it a less attractive alternative to bottom-up approaches. Up-front separation of proteins from high complexity samples is much more difficult than the separation of peptide mixtures. In addition, the top-down approach requires more sophisticated instruments with higher mass accuracy and resolution such as FTICR mass spectrometers, which are much expensive to purchase and maintain and require greater expertise to operate than the more common RF ion trap mass spectrometers used for bottom-up proteomics. Moreover, generic and efficient methods for fragmenting large proteins (>50 kD) are still limited. Due to these limitations, top-down proteomics relies heavily on direct infusion analysis of a single protein or simple protein mixture and depends on time-consuming and inefficient off-line multiple-step separations. As a result, the analytical throughput and efficiency for large-scale proteome involving large proteins continues to pose a significant challenge. Recent efforts toward extending top-down analysis of complex mixtures renew hope that these analytical challenges can be overcome. These efforts make it possible for high throughput analysis of intact proteins on a chromatographic time scale by LTQ-FTICR analysis and automated database searching informatics (Parks et al., 2007) for the first demonstration of large-scale top-down proteomics. Similar work also has been reported using an automated top-down ETD MS approach (Bunger et al., 2008; Chi et al., 2007).
c22.indd 490
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
22.2.5
491
Protein Identification by Database Searching
A key advance in MS-based proteomics was the development of algorithms for protein identifications by matching mass spectrometry data to a sequence database. Originally this was done using accurate peptide mass data (MS spectra) however, it is now almost exclusively done by combining the fragmentation spectra of selected precursor ions (MS/MS spectra) with their peptide mass information. The algorithms allow for automated interrogation of genomic databases with acquired large MS and MS/MS data sets using predetermined parameters and other search criteria to generate lists of putative peptide spectrum matches. As DNA sequence technologies continued to advance, the number of species with fully sequenced genomes has exponentially increased in the past decade. To better exploit these new resources many new searching algorithms have been developed. A variety of databases can be used for database searches including protein, DNA, and expressed-sequence tag (EST) databases. Table 22.2 shows a partial list of commonly used open access and commercially available databases. In addition to the sequence entries of proteins, databases also contain annotation information, which provide links to information about protein function and homology, presence and position of PTMs, domain and higher-order structures, etc. Many search algorithms have been developed and have become publicly available. Among them, MASCOT (Perkins et al., 1999) and SEQUEST (Eng et al., 1994) are the most widely used database searching tools on the basis of the MS/MS sequence ion data. These search engines compare the observed fragment ion spectra with those calculated for all possible peptides generated through an in silico digestion of all the proteins present in the database. Each of the putative matches is ranked by a score, which can be a simple cross-correlation, a probability, or an arbitrary similarity measurement. We use these scores as a measure of the significance of each potential match. Similarly, the search engines can also identify PTMs by assuming that each putative modification site is modified in either a fixed or variable manner to the specific residues in the peptides. Some of these algorithms also take fragment ion intensity into account in addition to its m/z, while others are simply concerned with m/z. This is an important distinction because those that do not factor in ion abundance tend to attribute ions with low-intensity signals to the missing ions in a sequence specific ion series, greatly increasing the chance of a false-positive match. Therefore, it is important to set appropriate thresholds to avoid assigning background noise to the spectrum. Often manual inspection of the MS/MS spectra is necessary to verify that the most abundant ions are indeed used to justify the proposed match. Nevertheless, false positive identifications cannot always be avoided through the artifice of manual inspection, particularly in shotgun applications, as they arise simply as a consequence of searching large MS data sets against large sequence databases.
c22.indd 491
1/12/2011 9:44:43 AM
492
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
TABLE 22.2. A Partial List of Commonly Used Open Access and Commercially Available Databases Database
Description
URL
Frequency of updates
www.ncbi.nlm.nih.gov/ Genbank/index.html www.ebi.ac.uk
Daily with specific releases every 3 months Daily with specific releases every 3 months Daily with specific releases every 3 months Daily with specific releases every 3 months Daily with specific releases every 6 months Daily with specific releases every 6 months Daily with specific releases every 6 months Daily with specific releases every 3 months Daily with specific releases every 3 months Daily with specific releases every week Weekly
GenBank
Nucleotide
EMBL
GenPept
Nucleotide/ protein Nucleotide/ protein Protein
Swiss-Prot
Protein
TrEMBL
Protein
www.ncbi.nlm.nih.gov/ Genbank/index.html www.ebi.ac.uk/ swissprot www.expasy.ch
UniProt
Protein
www.uniprot.org
OWL
Nucleotide/ protein Protein
www.bioinf.man.ac.uk/ dbbrowser/OWL http://pir.georgetown. edu/pirwww/dbinfo www.ncbi.nlm.nih.gov/ dbEST www.rcsb.org/pdb
DDBJ
PIR dbEST PDB
Expressedsequence tag Protein threedimensional structures
www.ddbj.nig.ac.jp
Statistically there is a finite probability that some of the MS/MS spectral data generated in the experiment will match the spectral data expected for some of the peptides present in the database. The actual number of these false discoveries depend on the quality of the MS data in addition to the size of the data set and database utilized. To properly interpret the protein identification data it is necessary to have a reliable estimate of the false discovery rate (FDR), which is a measure of the percentage of putative protein identifications that are likely to be false. The most commonly used method to determine the FDR associated with a particular set of experimental data involves a target/ decoy strategy. In this approach a database appropriate to the mixture of proteins to be analyzed is defined as the target. The decoy database is derived from the target by reversing, or shuffling the sequence of the target, or by generating the decoy sequences randomly, using a model with parameters derived from the target sequences. Both target and decoy databases are searched with the same experimental data and the FDR is defined (for a given score threshold) as Hitsdecoy/Hitstarget + Hitsdecoy. With the high quality of MS and MS/MS spectra acquired by high mass accuracy/resolution instruments
c22.indd 492
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
493
and with the criterion at least two distinct peptide hits for each identified protein, FDRs of <1% are achievable. 22.2.6 Quantitative Proteomics One of the advantages in MS-based proteomics is the ability to systematically identify differentially expressed proteins in biological sample by quantitative analysis. Differential expression of proteins in a cell reflects different regulated states of the cell, a disease state or other biological perturbations such as stress from external factors or experimental manipulation. Quantitative proteomics is one of the most widely used techniques for helping identify biomarkers associated with a particular biological pathway, phenotype or disease. Such analysis is the first important step for understanding the function of proteins in system biology and can be used for an early diagnostic intervention and prevention of a disease. A general strategy to screen biomarkers is to analyze samples from several different states (e.g., disease vs. healthy control) and compare the protein expression patterns across the samples in a relative quantitative proteomics analysis. Both the 2D gel and the shotgun approach using either stable isotopelabeling methods or label-free techniques are frequently applied in comparative proteomics studies. As these are unbiased discovery methods, they provide the possibility to find novel target proteins and correlations between the expressions of different proteins. It should be pointed out that it is extremely important to evaluate the replicate datasets to dissect and distinguish biological variations from technical variations as discussed in the section of Experimental Design. 22.2.6.1 Two-Dimensional Gel Electrophoresis-Based Quantitation 2DE is still a widely used method for quantitative proteomics due to its general availability, requirement for less expensive instrumentation and the rich literature available concerning its application to a broad array of biological problems. In a classical 2DE approach, proteins from both control and test samples are each processed on separate gels, which are then stained with a suitable dye (colloidal commassie blue or Sypro Ruby, etc). After the images of the stained gels are acquired, the spot densities are determined and compared by image analysis software, which provides quantitative information that identifies differentially expressed protein spots. Those spots whose intensity change exceeds some arbitrary threshold of significance are selected as possible biomarkers and are analyzed further by mass spectrometry. Unfortunately, this approach suffers from limitations related to gel reproducibility and is inherently laborious and error prone. Fluorescent 2D differential gel electrophoresis (DIGE), on the other hand, avoids these limitations by allowing comparative analyses to be performed in a single gel (Gharbi et al., 2002; Unlu et al., 1997). DIGE involves covalent labeling of two different protein extracts (e.g., from tissue A and tissue B) with
c22.indd 493
1/12/2011 9:44:43 AM
494
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Figure 22.2. Two-dimensional-gel-based comparative proteomics workflow. In this approach, quantitative information is acquired through the analysis of gel images analysis to determine spot densities that are compared gel to gel, identifying differentially expressed protein spots. These spots of interest are then subjected to in-gel digestion and MS-based protein identifications and characterizations.
one of two fluorescent cyanine dyes (typically Cy3 and Cy5). A third fluorescent dye (Cy2) can be used to label a third protein sample (e.g., a small amount of a mixture of A and B extracts) to provide an internal standard for sample normalization. The three labeled protein samples are then mixed and separated on the same 2D gel, so specific proteins from different extracts migrate to the same gel position. The gel is scanned with a variable wavelength laserbased imaging system. Since Cy2, Cy3, and Cy5 exhibit distinct excitation and emission spectra, it is possible to rapidly quantify and distinguish among proteins present in all three samples. Reports describing the use of DIGE have clearly demonstrated the power of this technique (Gharbi et al., 2002; Minden et al., 2009). Figure 22.2 shows a schematic of the 2D-gel-based comparative proteomics workflow. Despite the inherent power of DIGE and the high resolution provided by the 2DE approach it remains impossible to resolve an entire proteome in a single electrophoretic analysis. Reports have shown that many seemingly well-resolved protein spots isolated from large format 2D gels are found to contain more than a single protein. This observation poses a significant
c22.indd 494
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
495
challenge for the proper interpretation of comparative gel experiments in which changes in protein abundance are inferred from changes in gel spot volumes (staining intensities). Clearly it is necessary to know how the observed change is distributed among the various proteins present in the spot. To tackle this issue, we have developed an integrated workflow to quantitatively evaluate the abundance of each protein identified within a single gel spot. This approach integrates 2D-gel-based GeLC-MS/MS analysis with an empirically derived tool, the exponentially modified protein abundance index (emPAI) to correctly distribute the change in spot intensity determined by gel staining and image analysis to each of the protein constituents in the spot (Yang et al., 2007). 22.2.6.2 Shotgun Proteomics-Based Quantitation Stable isotope labeling techniques have clearly become the dominant approach in shotgun-based quantitative analysis although various label-free methods have also been proposed and used. The analytical idea behind the stable isotope technique is to create a mass shift that distinguishes the same peptides from different samples within a single MS or MS/MS analysis. Stable isotope labels can be introduced by either metabolic, or chemical or enzymatic processes. The metabolic labeling approach occurs during protein synthesis with either the stable isotope of certain elements (15N) or the incorporation of isotopically labeled amino acids. The use of an isotopically labeled amino acid in cell culture (SILAC) has become a popular method (Ong et al., 2002). In a typical procedure, one sample is enriched with light isotope containing amino acids and the other with a heavy isotope. After mixing these in a 1 : 1 ratio, proteins are extracted, digested and analyzed by LC-MS/MS. Signal intensities of labeled peptides and the unlabeled peptides are measured in the full-MS scan providing the quantitative information. The biggest advantage is that after labeling, samples can be combined pre-lysis, eliminating the introduction of the subsequent sample processing errors. The main limitation of metabolic labeling is that it is not practical for organisms that cannot be grown in culture. In addition, labeling with stable isotope usually makes downstream MS analysis more complex. Chemical labeling techniques for quantitative proteomics include isotope-coded affinity tags (ICAT) used for the labeling of free cysteine at both protein and peptide levels (Gygi et al., 1999) and isobaric tags for relative and absolute quantification (iTRAQ) used for labeling free amine groups (Ross et al., 2004) at mainly the peptide level. iTRAQ has emerged as one of the most widely used techniques in quantitative shotgun analysis. The isobaric tag consists of an amine-specific reactive group, a balance group and a reporter group that provides mass signature. The key feature of this approach is that it can afford a high degree of multiplexing since it is able to monitor up to eight samples in a single analysis. After labeling, a full-MS scan generates the same mass for the same peptides from different samples due to the isobaric nature of the tags. It is only upon fragmentation at MS/MS stage and the neutral loss of the balance group, that
c22.indd 495
1/12/2011 9:44:43 AM
496
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
the individual mass tags, reflecting the relative concentration of the original labeled peptide in each of the samples, become evident. This multiplexing feature greatly reduces the amount of run time required to analyze multiple samples. Enzymatic labeling is carried out using 18O water either during the peptide bond hydrolysis step of digestion (Yao et al., 2001) or after digestion (Yao et al., 2003) using the carboxy terminal oxygen exchange feature of the protease. Hydrolysis of a peptide bond, by a C-terminal protease such as trypsin, results in the incorporation of one 18O atom into the carboxy terminus of each peptide generated through the nucleophilic attack of the water on the carbonyl carbon of the peptide bond cleaved. Following the bond cleavage step the peptides undergo carboxyl oxygen exchange with the water in a reaction that is analogous to the enzymatic formation of a peptide bond (Miyagi and Rao, 2007). Thus if the reaction is allowed to continue to equilibrium there will be two 18O atoms incorporated, creating a total mass shift of 4-Da. This allows for accurate relative peptide quantitation in the full-MS scan. One potential advantage of the 18O labeling is that each tryptic peptide of a protein (except C-terminal peptide) can be labeled, which can enhance the confidence level in quantitation through multiple peptide data points. A limitation of this approach includes variability in labeling kinetics that is observed in different substrates and the requirement for a high-accuracy mass spectrometer to distinguish between the labeled and unlabeled peptides. It should be noted that the labeling efficiency of the carboxyl oxygen exchange reaction is much less than that of the peptide bond cleaving step because there is only a 50% chance of replacing the 16O atom with an 18O atom in each enzymatic cycle. Calculations show that roughly five enzymatic cycles are required to achieve complete incorporation. It is not surprisingly that rates of the carboxyl oxygen exchange reactions are unpredictable as they vary from peptide to peptide. Furthermore, there can be significant amounts of back exchange with 16 O, particularly when the reaction volume is small and it is not scrupulously protected from atmospheric water. Thus the inherent weakness of this approach is variable 18O atom incorporation, which greatly complicates peptide quantification and leads to significant errors (Miyagi and Rao, 2007). One way of avoiding these complications is to substitute an amino-terminal (NT) protease such as Lys-N or Asp-N for the more commonly used trypsin. Like trypsin, NT-proteases will incorporate a single 18O atom to the carboxyl terminus of the peptide bond cleaved during the hydrolysis. But as an NT protease has no affinity for the newly generated carboxy-terminal amino acid, it will not catalyze any further exchange. Thus the total mass shift will be 2 Da and quantification will not be complicated by either variable rates of carboxyl oxygen or back exchange (Miyagi and Rao, 2007). Despite the popularity of label-based methods, the main limitations of the label methods include cost of isotopic labeling and increased complexity of the experimental procedures, which can lead to sample loss and add experimental variations. As a result, label-free quantitative approaches have recently received considerable attention as promising alternatives. The label-free
c22.indd 496
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
497
methods are based on comparisons of LC-MS analysis of peptide ions between different samples. Those peptide ions with nearly the same mass, charge, and retention time for each sample are compared and quantified on the basis of the difference in the peptide signal intensity, which correlates well with protein abundances in complex samples (Domon and Aebersold, 2006). A similar, slightly modified, approach is based on the observation that the average MS signal response for the three most intense tryptic peptides for every protein in an experiment is constant within a coefficient of variation of about ±10%. Thus using a known concentration of standard protein spiked into the complex samples serving as an internal standard, absolute quantification can be achieved for all proteins identified on the basis of three or more peptide IDs (Silva et al., 2006a). Another commonly used label-free method is spectral counting, in which the number of MS/MS spectra assigned to each protein is correlated with protein abundance (Choi et al., 2008; Liu et al., 2004). A similar method is termed the exponentially modified protein abundance index (emPAI) by which the number of peptides matched to each protein is used as a surrogate correlation with protein abundance (Ishihama et al., 2005). Nevertheless, despite the success of label-free quantitation for some studies, from a practical perspective it is extremely challenging to generate reproducible patterns across all samples and time points and to develop software tools that can reliably align and match related patterns (Domon and Aebersold, 2006). A target-driven quantitative approach based on conventional selected reaction monitoring (SRM) MS analysis has been developed for absolute quantitation (Gerber et al., 2003). Technically, this approach is actually a copy of SRM-based absolute quantitation for small molecules by traditional triple quadrupole MS analysis. The absolute quantitation of a target peptide selected from a previously identified target protein is obtained by the use of a stable isotope labeled synthetic peptide serving as an internal standard (IS). The sample is spiked with a known amount of isotope labeled synthetic IS peptide and analyzed by LC-SRM (often called multiple reaction monitoring, MRM reflecting SRM on multiple transition ion pairs) analysis. The identical structure with different masses between IS peptide and endogenous target peptide allows the two peptides to co-elute from the LC but be resolved in separate MRM channels by monitoring the distinct transition ion pairs. The ratio of the chromatographic peak area for target/IS peptides can be calculated and absolute quantitation can be achieved. This approach has been used successfully for absolute quantitation of phosphopeptides (Gerber et al., 2003; Mayya et al., 2006). MRM-based quantitation has proven to be robust, highly selective, and sensitive and is currently thought to be the most accurate method available for both absolute and relative quantitation. Limitations of this approach include the requirement for knowledge about the target protein/peptide and the need for isotopically labeled synthetic peptides as IS for each targeted peptide. Therefore, this MRM approach is not practically affordable for large-scale peptide quantitation. Figure 22.3 summarizes the strategies of quantitative peptide analyses for shotgun proteomics-based quantitation.
c22.indd 497
1/12/2011 9:44:43 AM
Figure 22.3. Generic strategies for quantitative peptide analyses in shotgun proteomics. There are two main categories for MS-based peptide quantitative methods: (a) labeled and (b) label-free approaches. In a typical procedure of stable isotope labeled technique, one sample is labeled with light isotope containing amino acids and the other with a heavy isotope. After mixing these in a 1 : 1 ratio, the mixed peptides are analyzed by LC-MS/MS. a-1, Signal intensities of heavy isotope-labeled peptides and the light isotope-labeled peptides are measured in the full-MS scan providing quantitative information. a-2, Alternatively, in the case of the mixed peptides labeled with isobaric tags, relative signal intensities of reporter ions generated in the MS/MS of the isobaric tagged peptide precursor ion allows for relative peptide quantitation of multiple samples initially tagged with four to eight different reporters. For label-free applications, three specific methods are commonly used. b-1, One popular approach has focused on the analysis 2D images of ion intensities in the span of retention time and m/z from a LC-MS/MS run where the extracted ion chromatogram for specific ions is generated and its peak intensities are used as the abundance measure. b-2, Another approach involves spectral counting where either the number of spectra matched to peptides from a protein and/or the number of identified peptides (protein abundance index, PAI and exponential modified PAI, emPAI) is used as a measure of protein abundance. b-3, The MRM-based quantitative technique is a target-driven quantitative approach used for the detection of a specific peptide with known mass and fragmentation properties from complex samples. The MRM transition ion pair is selected for one precursor ion (m/z) by Q1 and one specific fragment characteristic from the precursor ion by Q3. The instrument detects and cycles through a series of transition pairs and registers the signal as a function of chromatographic elution time. Quantitation of a target peptide is obtained by the use of a stable isotope labeled synthetic peptide serving as an internal standard (IS) spiked into each sample. The ratio of the chromatographic peak area for target/IS peptides can be calculated and relative/absolute quantitation can be achieved across samples. It should be noted that in the label-free approach, each sample must be run individually and then the data are compared by one of three methods.
c22.indd 498
1/12/2011 9:44:43 AM
PROTEOMICS STRATEGIES AND WORKFLOW
22.2.7
499
PTM Characterization: Phosphoproteome
Protein phosphorylation is one of the most important PTMs in living cells. It is one of the few modifications described to date that has been shown to be reversible. It plays an important regulatory role in various cell activities including growth, differentiation, division, and metabolism. It is necessary to pinpoint the phosphorylation sites in a protein in order to fully understand the regulatory mechanism involved, yet this remains a significant analytical challenge. The difficulties are due, in part, to their wide dynamic range, low stoichiometry at any given site, and the labile nature of the phosphorus group with its poor ionization efficiency in electrospray analysis. MS has been proven the most efficient means to identify and quantify phosphorylated residues in a protein mixture. 22.2.7.1 Enrichment Strategies Due to the low their stoichiometry, phosphopeptides often represent a small proportion of the total number of peptides in a shotgun sample. Therefore, selective enrichment for phosphopeptides before MS analysis has become increasingly important to the success of a project. To date the most successful and widely used methods for enrichment have been a combination of chromatography with affinity-based approaches. These chromatographic techniques include immobilized metal affinity chromatography (IMAC), metal oxide affinity chromatography (MOAC), antibodybased affinity chromatography, strong cation-exchange chromatography (SCX), and hydrophilic interaction chromatography (HILIC). Affinity-based techniques rely on the immunopurification experiments using the antibodies against phosphoamino acid epitopes, magnetic and nano particles, and metal ion-phosphopeptide precipitations (Han et al., 2008). IMAC is based on metal chelators, such as iminodiacetic acid and nitrilotriacetic acid, covalently linked to a stationary support that immobilizes certain trivalent metal ions (Ficarro et al., 2002; Anderson and Porath, 1986). The high affinity afforded by the coordination sites of the metal ions toward the negatively phosphorylated peptides allows for the specific capture of phosphorylated peptides. One of the challenges in IMAC is that strongly acidic peptides rich in glutamic acid or aspartic acid residues can also be captured with metal complexes. Consequently, additional chemical derivation such as methylation is often needed to prevent displacement of phosphopeptides by abundant acidic peptides. MOAC as an alternative enrichment technique that doesn’t require a separate metal ion charging step as the metal ions are part of solid metal support. Titanium dioxide (TiO2) is the most commonly used material for this type of enrichment (Pinkse et al., 2004). The disadvantage of the TiO2 technique is that the anion-exchange properties of TiO2 create a similar non-specific binding of acidic peptides. As a result, 2,5-dihydroxybenzoic acid (DHB) is frequently used in the buffer to minimize nonspecific binding because it is a competitive inhibitor of carboxylic acids (Larsen et al., 2005).
c22.indd 499
1/12/2011 9:44:44 AM
500
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Phosphopeptide enrichment by SCX is based on the difference in charge at pH 2.7 between phosphopeptide (typically +1 for singly phosphorylated tryptic peptides) and nonphosphopeptide (typically net charge +2 for tryptic peptides). Thus phosphopeptides can be enriched due to their decreased net charge because they elute in early fractions when using cation exchange chromatography (Beausoleil et al., 2004). It should be noted that multiphosphorylated peptides are often not retained on the SCX column under normal conditions because their net charge is = 0. Those phosphopeptides will be in the unbound (flow-through) fraction. As a result, SCX is usually used as a first separation/enrichment step followed by IMAC or TiO2 enrichment for the flow through and early eluting fractions. The combination of SCX-IMAC/TiO2 enrichment has been applied in many large-scale phosphoproteome projects Villen et al., 2007; Olsen et al., 2006). HILIC has recently emerged as an attractive alternative for large-scale enrichment of phosphopeptides (McNulty and Annan, 2008). In HILIC analysis, peptide retention is based on hydrophilicity. Phosphopeptides are bound to HILIC column in an organic solvent and eluted with a gradient of aqueous solvent. As a result, HILIC is truly orthogonal to RPLC-MS analysis and phosphopeptides elute from HILIC columns in a solution that is directly compatible with subsequent RPLC-MS/MS analysis. Some drawbacks of HILIC are the decreased solubility of longer peptides in organic phase and the tighter binding of multiply phosphorylated peptides leading to difficulties with their elution. Immunoaffinity purification using antiphosphotyrosine antibodies is well established for investigation and enrichment of the phosphotyrosine proteome (Rush et al., 2005). The antibodies can be directly applied to peptide digests, which allows for specific enrichment and mapping of individual phosphotyrosine-containing peptides for determination of tyrosine phosphorylation sites at large-scale analysis. Peptides containing phosphotyrosine are isolated directly from protease-digested protein extracts with a phosphotyrosine-specific antibody and are identified by tandem mass spectrometry. Applying this approach to several cell systems, including cancer cell lines, showed that it can be used to identify activated protein kinases and their phosphorylated substrates without prior knowledge of the signaling networks that are activated (Rush et al., 2005; Rikova et al., 2007). 22.2.7.2 Phosphopeptide Analysis by MS Phosphopeptides can be identified and sequenced by product ion scanning using tandem mass spectrometers in positive mode in the same way as for other nonphosphopeptides described earlier. The fragmentation spectrum often reveals a loss of phosphoric acid (H3PO4, 98 Da) due to a β-elimination reaction in the case of phosphoserine (pS) and phosphothreonine (pT)-containing peptides. As a result, the location of the phosphorylation site for pS can be identified by a mass difference between two successive fragments of either 167 (87 + 80) Da or 69 Da (β-elimination product, dehydroalanine). Similarly, a mass difference
c22.indd 500
1/12/2011 9:44:44 AM
PROTEOMICS STRATEGIES AND WORKFLOW
501
of either 181 (101 + 80) Da or 83 Da (β-elimination product, dehydroamino2-butyric acid) represents the site for pT. Phosphotyrosine (pY) is a relatively stable modification as its phosphate group is not found in the β position, and mass spectrometric fragmentation yields product ions with the modification still attached. The pY site can be determined by a mass difference of 243 Da (163 + 80). Precursor ion scanning is one of the most commonly used MS acquisition strategies for selectively fragmenting and sequencing phosphopeptides in complex mixtures. In this method, the characteristic diagnostic fragment ions are the phospho group (at m/z 79 in negative ion mode) and the pY immonium ion (at m/z 216.043 in positive ion mode) which are produced in collision cell (q2) and selectively monitored in the second mass analyzer. In precursor ion scanning mode, any precursor peptide ions that can produce the designated diagnostic ions after fragmentation are registered as resulting signals in the first mass analyzer. In other words, the detector signal is the result of all precursor ions that can fragment to a common diagnostic product ion. Since exclusive detection of the ions that produce the specific diagnostic fragment ions is possible in precursor ion scanning, this technique allows for highly selective and sensitive detection of phosphopeptides in complex samples. However, one weakness of this method is that MS and MS/MS spectra in negative ion mode are generally of poor quality and significantly impact the detection sensitivity. Furthermore, an additional MS/MS scan in positive ion mode is often required to determine the sequence of the targeted peptide ion that gave rise to the diagnostic ion in Q3. Recently, a fully automated nanoLC-MS/MS with polarity switching has been successfully applied to the identification of phosphorylation sites in complex samples using a hybrid triple quadrupole linear ion trap where precursor ion scanning of m/z 79 in negative ion mode was used as a survey scan to trigger the subsequent MS/ MS spectra of phosphopepitde candidates in positive ion mode (Williamson et al., 2006). Neutral loss scanning is another useful MS detection technique that is carried out in positive ion mode on triple quadrupole instruments for selective detection of pS and pT residues as a result of their β-elimination reaction (Carr et al., 2005). Instead of using Q1 to select specific ions for fragmentation in q2 as in typical product ion scanning, Q1 is scanned coordinately with Q3 but monitors a constant mass difference due to a specific neutral loss. The detected signal is the result of the loss of a specific neutral species, forming a product ion with a characteristic mass difference. In the case of pS- and pTcontaining phosphopeptides, these neutral loss species and the resulting mass difference is 97.97, 48.99, and 32.66 m/z units relative to the singly, doubly, or triply charged phosphorylated precursor ion, respectively. Neutral loss scanning is not as popular as other precursor ion scanning methods because of the propensity for false-positive signals and the need to know the charge state of the ion that loses the phosphate. However, it has the advantage that it is carried out in the positive ion mode and can therefore be
c22.indd 501
1/12/2011 9:44:44 AM
502
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
used with direct data-dependent scanning to acquire MS/MS spectra in a single analysis. This feature has made the method one of the most widely used approaches in phosphoproteome analysis. For example, the detection of a neutral loss of phosphate is used as a survey MS that selectively triggers the subsequent data dependent MS/MS scanning in a hybrid triple quadrupole linear ion trap (Cox et al., 2005). Alternatively, a neutral loss MS/MS product ion can selectively trigger its MS/MS/MS in a 2D linear ion trap (LTQ) with software-controlled neutral loss-dependent MS/MS/MS capabilities (Zhang et al., 2004). In the neutral loss-dependent MS/MS/MS mode of operation, the software will automatically look for a diagnostic neutral loss of 98, 49, or 32.7 amu corresponding to a singly, doubly, or triply charged phosphopeptide, respectively, for all acquired data dependent MS/MS spectra. If observed, the software will automatically trigger MS/MS/MS fragmentation on the ion generated from the neutral loss (Zhang et al., 2004; Gruhler et al., 2005). In this way, the sequence of the phosphorylated parent is verified. Other than the traditional CID analysis for phosphopeptides using the aforementioned techniques, ECD and the related ETD fragmentation technologies have been particularly valuable for analysis of labile phosphorylation modification, as both were shown to fragment phosphopeptides preserving the phosphate moiety on all fragment ions (Chi et al., 2007; Stensballe et al., 2000). It is also possible to use both CID and ETD in the same experiment in a combined strategy with CID for doubly charged peptides and ETD for more highly charged peptides (Swaney et al., 2008; Good et al., 2007). With the implementation of the data-dependent decision tree algorithm for unsupervised selection of either CID or ETD on the basis of precursor charge and size, the combined approach enables identification of significantly more phosphopeptides (7422), than by using either CID (2801) or ETD (5874) alone (Swaney et al., 2008). 22.2.7.3 Quantitation of Phosphorylation Sites Over the past decade, numerous approaches have emerged to quantitatively analyze complex proteomics samples including the phosphoproteome as described in above. These strategies consist of three major catagories: gel-based methods, stable isotope labeling, and label-free detection. Gel-based approaches generally rely on 2DE combined with visualization by either x-ray film for 32P-labeled phosphosamples, or western blotting analysis for pY samples or a phosphoproteinspecific stain, such as Pro-Q Diamond (Smith and Figeys, 2008; Steinberg et al., 2003). The main drawback for the gel-based methods is that they yield little or no information for quantitation of phosphorylation sites. Label-free peptide quantitation strategies have also been used for phosphopeptides. These rely on computational reconstruction of the LC-MS/MS elution profile for each identified peptide based on its precursor ion mass and normalized intensity of each ion to an internal reference. The peak areas of a particular phosphopeptide identified in a separate experiment can be used to
c22.indd 502
1/12/2011 9:44:44 AM
BIOLOGICAL IMPACT OF PROTEOMIC TECHNOLOGIES
503
quantify the relative changes between samples (Wang et al., 2006). However, the reproducibility and linearity of using a label-free approach to quantify highly complex phosphoproteomes remains a significant challenge. Stable-isotope labeling approaches, including enzymatic 18O-labeling, 15 N-labeling SILAC, AQUA, and iTRAQ remain the core technologies used in MS-based proteomic quantification. Among them, SILAC and iTRAQ are currently the most frequently used for techniques in quantitative phosphoproteomics (Macek et al., 2009). These techniques allow for measurement of sitespecific phosphorylation changes in response to stimuli, the stoichiometry of phosphorylation sites and the site occupancy. For absolute quantitation of phosphorylation sites, both iTRAQ and MRM-based AQUA approaches can be used to determine concentration. Practically, both are limited to small-scale analysis as making synthetic phosphopeptide standards on a comprehensive scale is currently not cost-effective.
22.3
BIOLOGICAL IMPACT OF PROTEOMIC TECHNOLOGIES
As described above, remarkable progress in the development of proteomics technologies has been achieved over the past decade. MS-based proteomics allows for parallel analysis and rapid characterization of proteins in complex samples and has emerged as an indispensible discipline for molecular, cell, and systems biology. The ability to acquire high-content quantitative information, determine the state of modification, and study protein–protein interactions for thousands of proteins in native biological samples provides key insights into the composition, regulation and function of protein complexes and pathways, which allows us to develop a better understanding of cellular and physiological processes and disease mechanisms. One of the important goals in proteomics research is biomarker discovery, the identification of biomolecules whose abundance is altered in response to certain disorders and indicative of a physiological or pathological state. Combined with genomics and bioinformatics, proteomics has led to renewed interest in discovering novel biomarkers for different cell states and diseases such as cancer, which would also facilitate gene discovery of such diseases. Quantitative proteomics technologies provide opportunities to identify new biological targets (biomarker) for cancer diagnosis, classification, prognosis and, most importantly, the development of therapies. Biomarkers can also be used to monitor the response to therapies. As a result, proteomics as a discipline is expected to have a broad and practical impact on biomedicine in addition to its impact on basic cell and systems biology. Furthermore, proteomics has continually gained benefits from growing genomic information through the integration of gene prediction algorithms, cDNA sequences, and comparative genomics, which enable the production of large proteomics datasets with broad coverage of the proteome. Proteomic information can now be used for complementing nucleotide-based annotation through unambiguous determination of reading frame, translation start and
c22.indd 503
1/12/2011 9:44:44 AM
504
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
stop sites, splice boundaries, and the validation of short ORFs. This new emerging field called proteogenomics in which the proteomics data are used to assist in genome annotation accelerates the gene annotation pipeline and offers a more accurate and complete picture of the annotated genome. Therefore, the impact of proteomics on gene annotation through proteogenomics is expected to become increasingly important to genomic projects. 22.3.1
Understanding Complex Biological Processes
MS-based proteomics technologies have increasingly become the preferred method for in-depth characterization of protein components in enormously complex biological systems. Functional proteomics attempts to correlate identification and quantitative results to the function of proteins and protein networks. With the discovery that more and more proteins have specific binding partners which depend on biological states, it has become wellaccepted that most proteins are present in complexes and functioning in complex protein networks. Thus characterization of protein complexes becomes a prerequisite for understanding the function of proteins within the cell. In the past decade, substantial advances in determining composition, regulation and function of molecular complexes have been obtained by MSbased proteomics leading to a greater understanding of the molecular basis of complex biological processes. These studies include those that map protein localization, identify protein–protein interactions, determine the site of posttranslational modifications and altered protein composition. 22.3.1.1 Protein Localization Knowledge about protein localization within the cell often provides useful information concerning functions of certain proteins and the metabolic pathways they are involved in. Protein location information can also support or refute the findings of binding studies, which suggest potential protein–protein interactions. Information of protein localization obtained through a subcellular approach provides spatial information that can facilitate the interpretation of the results (Andersen and Mann, 2006) and can provide knowledge of the pathogentic mechanism involved in different diseases (Uhlen and Ponten, 2005). Traditionally, protein specific antibodies and fluorescence-microscopy is the preferred technique for studying protein subcellular localization. However, when combined with stringent and efficient prefractionation methods such as gradient centrifugation, filtration, or selective affinity isolation, MS-based proteomics approaches have become an attractive alternative to localizing proteins in organelles or large cellular structures. This is because antibody/microscopy is based on single cells, whereas organellar proteomics analyzes the constituents of organelles that have been biochemically enriched, averaging millions of cells. Also the microscopy approach requires the artificial creation of antibodies, which raises specificity issues with respect to the endogenous protein. Therefore, a combination of MS-based proteomics and microscopy techniques provides an attractive
c22.indd 504
1/12/2011 9:44:44 AM
BIOLOGICAL IMPACT OF PROTEOMIC TECHNOLOGIES
505
validation method. Andersen and Mann (2006) summarize some of the recent organellar proteomics works in their recent review emphasizing the advantages of subcellular resolution for functional genomics. In one experiment, Mootha and colleagues (2003) used the mitochondrial proteome to identify the gene mutated in a form of Leigh syndrome, a human cytochrome c oxidase deficiency that maps to chromosome 2p16-21 by genome dissociation studies. The clinical and biochemical features suggest that the pathway underlying this disorder is involved in mitochondrial biology, and genetic analysis of affected families has implicated a specific genomic region in the development of the disease. An integrated strategy including genomic information and tandem MS data was applied with a view to identify candidate genes and the putative disease pathway. A single candidate gene, LRPPRC (leucine-rich pentatricopeptide repeat-containing protein), was shown to be the causative gene for the disease. Resequencing identified two mutations on two independent haplotypes, providing definitive genetic proof that LRPPRC indeed causes Leigh syndrome. Data from the same mitochondrial proteome mapping experiment were also used in a sophisticated bioinformatic analysis to identify transcription-factor-binding sites upstream from mitochondrial genes. Furthermore, a gene-expression neighborhood index was defined to capture genes the transcripts of which showed significant co-regulation with mitochondrial genes. The LRPPRC gene is found to encode a mRNA-binding protein involved in the processing and trafficking of mitochondrial DNAencoded transcripts and suggests that LRPPRC participates in mRNA processing in both the nucleus and the mitochondria. This finding also indicates that defective processing of mitochondria DNA-encoded mRNA represents a mechanism in mitochondrial pathophysiology that has not been previously described. The same approach was also proposed to identify the genes that cause other diseases such as diabetes. This integrative approach holds great promise for disease-gene discovery. Investigating the spatial distribution of proteins and small molecules within a biological system through in situ analysis of intact tissue sections by MALDI imaging mass spectrometry (IMS) is a recent technique that has proven useful for various targets (Chaurand et al., 2006; Meistermann et al., 2006; Walch et al., 2008). MALDI-IMS can determine the distribution of hundreds of unknown compounds in a single measurement and enables the acquisition of cellular expression profiles while maintaining cellular and molecular integrity. Based on detection of specific molecular masses present at defined positions of a tissue section, the technique also provides spatial distribution patterns for many molecular species in one single tissue section. Compared to traditional immunohistochemistry and in situ hybridization, this technique offers the advantage that no target-specific reagent such as an antibody, tag or labeling are required. 22.3.1.2 Protein Interactions and Protein Complexes Interactions between proteins are fundamental to almost all biological processes (Cusick
c22.indd 505
1/12/2011 9:44:44 AM
506
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
et al., 2005). Vital cellular functions such as DNA replication, transcription, and mRNA translation, require the coordinated action of many proteins that are assembled into an array of multiprotein complexes of distinct composition and structure. Also, many important biological processes such as cell communication are comprised of, and regulated by, dynamic signaling networks of interacting proteins that directly or indirectly respond to specific effecter molecules. Thus a comprehensive determination of protein–protein interactions within an organism provides a framework for understanding biology as an integrated system and aids in assignment of individual protein function. Protein complexes can vary in size and composition from assemblies of several dozen proteins to smaller clusters of only a few. Moreover, understanding the dynamic nature of protein complexes with respect to composition and stability as a function of time and cell type/state presents a significant challenge in determining their structure and function. A number of different methods designed for determining interacting pairs of proteins have been developed, including the yeast two hybrid (YTH) system, the first technique used for large-scale interactome map (Uetz et al., 2000). Recently, a combination of affinity purification and mass spectrometry (APMS) (Gingras et al., 2007) has been applied that has greatly facilitated the characterization of protein complexes. A commonly used affinity purification approach for isolation of complexes is the tandem affinity purification (TAP)tag method where a dual affinity tag is engineered in frame with the target protein. The TAP-tag allows for a stringent two-step capturing and washing procedure yielding a sample ready for subsequent LC-MS/MS analysis. Compared to YTH, AP-MS can be carried out under near physiological conditions in a systematic and unbiased manner. The AP-MS approach doesn’t perturb relevant PTMs present in targeted complexes by affinity purification, but offers an ability to identify the PTMs and to probe dynamic changes for the composition of protein complexes when combined with quantitative proteomics techniques. Using the AP-MS approach, large-scale data sets of protein complexes and binding partners can be generated, which in turn can be represented as complex networks or interactome maps. It should be noted that if the bait protein is a component of multiple complexes, a single AP-MS analysis is not able to decipher the multiplicity of associations. Consequently, the use of multiple bait proteins (baits) is required and the interactions should be validated by multiple AP-MS analyses. Krogan and colleagues (2006) used TAP-tagged AP-MS for identification of yeast protein complexes based on a graph-clustering algorithm that identifies highly connected modules in protein–protein interaction networks. A total of 4562 TAP yeast bait proteins were processed and >7100 protein–protein interactions were detected, in which the 4562 baits were shown to interact with 4087 different endogenous proteins. This corresponds to 72% of yeast protein predicted from the yeast genome. Through clustering algorithms, the authors found a total of 547 distinct heteromeric protein complexes. This finding suggests that the majority of all proteins have one or more interaction partner.
c22.indd 506
1/12/2011 9:44:44 AM
BIOLOGICAL IMPACT OF PROTEOMIC TECHNOLOGIES
507
AP-MS technique can be combined with other biochemical fractionation, chemical crosslinking and quantitative proteomics techniques to reveal the complex stoichiometry, structural organization and dynamics (Gingras et al., 2007). Protein–protein interaction mapping provides insight into how the biochemical properties of proteins and protein complexes are integrated into biological systems. Such protein interaction databases are also a useful source to predict the function(s) of thousands of genes by various computational genomics approaches combining sequence similarities within genomes and orthologies across genomes. Another approach for detection of noncovalent protein complexes and their interactions with DNA, RNA, ligands, and cofactors is based on the soft ionization of ESI-MS. This technique analyzes the macromolecules in physiological pH under 50–100 mM ammonium acetate at a minimal MS source temperature for generating intact complex ions in the gas phase. This ESI-MS approach is used as an alternative technique to in vitro approaches to discern higher-order structural assemblies of protein complexes, including determining stoichiometry and subunit binding affinity. Over the last decade, significant advances have been made in ionization and mass analysis techniques, including the development of ion mobility mass spectrometry, making the investigation of intact large and heterogeneous assemblies feasible (Heck et al., 2004; Loo et al., 2005; Ruotolo et al., 2005). These technological developments have paved the way to study intact noncovalent protein–protein interactions, assembly and disassembly in real time, subunit exchange, cooperative effects, and effects of cofactors, allowing us a better understanding of proteins in cellular processes (Heck et al., 2004). 22.3.1.3 Kinases and Signal Transduction Pathways Protein phosphorylation is a key posttranslational modification for regulating protein function in almost every basic cellular process, including metabolism, growth, division, differentiation, motility, organelle trafficking, immunity, learning, and memory (Ubersax et al., 2007). Particularly, phosphorylation can be reversibly controlled by the complementary action of protein kinases and protein phosphotases. Thus it can dynamically modulate most known cell signaling pathways. The signaling pathways associated with the development of cancer can provide various points of control, which can be viewed as likely drug targets in oncology drug discovery (Kruse et al., 2008). Understanding the role of signaling pathways for any given biological context (stimuli, disease) requires the measurement of pathway outputs, dynamic changes, and cross talk (Yates et al., 2009). One of the big challenges is that signaling proteins (such as kinases, phosphotases and scaffold proteins) are usually present in low amounts compared to other cellular proteins. In addition, characterization of the enormous number of endogenous substrates for the more than 500 kinases found in the human proteome (Manning et al., 2002) is a significant task in phosphoproteome research. Therefore, different approaches including global and sitespecific strategies are needed to accomplish the job.
c22.indd 507
1/12/2011 9:44:44 AM
508
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
The receptor-mediated tyrosine kinase signaling pathway is one wellstudied signaling pathway in which tyrosine phosphorylation is initiated to relay the signal from the cell membrane to the nucleus. Among the >500 kinases in the human proteome, 91 are tyrosine kinases (Manning et al., 2002), which can be cataloged into two classes: plasma membrane-spanning receptor-protein tyrosine kinases (RTKs) and the intracellular nonreceptor tyrosine kinase (NRTKs). RTKs are the receptors for many secreted polypeptide growth factors, including epidermal growth factor (EGF) and fibroblast growth factor (FGF), and have been heavily analyzed due to their potential involvement in cancer-associated signaling pathways and the possibility of using them as targets for anticancer drugs (Kruse et al., 2008). EGF receptor (EGFR) has been a key target as it is often over expressed in cancer cells and has been associated with a poor progonosis. EGFR belongs to the HER family of receptor tyrosine kinases. The binding of EGF induces homodimerization of the receptors as well as heterodimerization with other members of the HER family. Dimerization leads to transactivation of kinases in which one receptor tyrosine kinase phosphorylates the other at its carboxylterminal tail and association with various adaptor proteins. This creates binding sites for other signaling molecules and initiates a downstream signaling cascade. Downstream signaling for EGF is mediated by two signaling pathways: the Ras-Raf mitogen-activated kinases (ERK1 and ERK2) and the phosphatidylinositol 3-kinase/AKT pathway. The AKT pathway plays an important role in cell survival, whereas the ERK pathway regulates cell proliferation, transformation, and the progression to metastasis (Kruse et al., 2008). As a result, the overexpression of EGFR protein in tumor cells can lead to ligand-independent dimerization and receptor activation, while in normal tissues ligand binding is required for activation. It is estimated that over expressed EGFR was found in 40–80% of non-small-cell lung cancers, making EGFR a promising target for anticancer drugs such as Tarceva (erlotinib) from Genetech. Blagoev et al. (2004) applied the SILAC approach to study the global dynamics of phosphotyrosine-based signaling events in early EGF stimulation. The authors studied the time course of EGF stimulation in HeLa cells using three cell states and two combined SILAC experiments encoded with different stable isotopic forms of arginine, followed by antibody-based enrichment and quantitative LC-MS. The authors identified 31 putative novel effectors as EGFR substrates upon EGF stimulation and 81 signaling proteins, including a time course of phosphorylation events of known EGFR signaling effectors. An iTRAQ method was also employed to quantify the hundreds of tyrosine phosphorylation sites found after EGF treatment in a timeresolved manner (Zhang et al., 2005). Global activation profiles provide an informative perspective on cell signaling. Similarly, the insulin-signaling pathway has been intensively investigated by both SILAC and iTRAQ-based approaches, yielding numerous novel components (Schmelzle et al., 2006; Kruger et al., 2008).
c22.indd 508
1/12/2011 9:44:44 AM
BIOLOGICAL IMPACT OF PROTEOMIC TECHNOLOGIES
22.3.2
509
Proteomics-Driven Discovery of Cancer Biomarkers
A biomarker is defined as a feature that is objectively measured and evaluated as an indicator of specific physiological or pathological states. Cancer biomarkers can be used for early diagnosis, disease prognosis, and predicting or monitoring therapeutic response, providing great promise for significant improvements in clinical outcomes for cancer patients. In the past decade, much effort has been expended integrating genomics, bioinformatics and proteomics technologies to enhance the possibility of discovering novel biomarkers from different cell states of cancers and other diseases, including blood-based cancer biomarkers. Quantitative proteomics technologies have emerged as the preferred tool, promising systematic detection of protein targets and their characteristic PTMs, which collectively constitute a molecular fingerprint secreted into blood or other tissues, reflecting the presence of cancer and disease phenotype. Blood is by far the most commonly used, least invasive, accessible material feasible for monitoring over long periods of time and, therefore, has been the most-used biospecimen as a biomarker discovery matrix so far (Rifai et al., 2006). However, discovery of blood-based biomarkers for cancer poses a significant challenge to current proteomics technologies. One reason is that blood contains an enormous number of proteins, and many of them are found in multiple forms. Some forms result from alternative RNA splicing and others come from proteolytic cleavages and PTMs. The second reason is the extraordinary dynamic range in protein abundance from as high as 40 μg/μL for albumin to cytokines (∼5 pg/μL) with 22 of the most abundant proteins representing 99% of blood protein content (Simpson et al., 2008). Furthermore, proteins in circulation that are released by tumors and might prove useful as potential biomarkers are expected to have very low concentrations. These factors create extreme difficulties for identification of low abundance biomarkers directly in blood. Despite the fact that advanced MS technologies have evolved to sufficiently detect and identify subfemtomole amounts of peptides, the dynamic range of the MS (∼104 to 105) is still a limiting factor to the analysis of the blood proteome. As a result, additional separation strategies have been developed to overcome these challenges. These strategies include the employment of immunological depletion of high-abundance proteins such as albumin and immunoglobulins (Liu et al., 2006), deployment of extensive fractionation using orthogonal threedimensional protein separation (Wang et al., 2005) to reduce complexity, and targeted enrichment of specific groups of proteins or peptide of interests such as glycoproteins and cysteine-rich proteins (Bernhard et al., 2007). These analytical strategies showed the successful identification of low abundance proteins in blood such as EGFR (1.3–3.5 μg/mL) and the heptocyte growth factor activator (400 ng/mL), demonstrating that mass spectrometry can be used to identify proteins at the concentrations expected for potential biomarkers associated with cancer. Unfortunately, these approaches involved
c22.indd 509
1/12/2011 9:44:44 AM
510
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
extensive sample fractionation and processing and therefore lead to poor reproducibility, which adversely affects clinical validation. It also should be noted that there is a risk for immune-depletion of abundant proteins as these proteins could act as carriers for some of low-abundance molecules. Cytokines and other low abundance proteins have been found co-depleted with albumin (Granger et al., 2005). The use of blood peptides as a source for potential biomarker discovery has been the focus using protein-chip technology in the past years. This technique involves surface chemistry-based separation with the subsequent detection by laser-desorption-ionization (SELDI)-TOF MS and generates a proteomics signature pattern used for several cancer detection including early detection of ovarian cancer (Petricoin et al., 2002). Consequently, it used to receive considerable attention in the field. However, a number of critical limitations have been identified, including bias from artifacts related to clinical sample handing, the bioinformatics tools used for pattern recognition, and its failure to identify well-established cancer biomarkers, which have made it fallout of favor. MALDI-TOF has been used recently to identify novel biomarkers in the low-molecular-weight plasma/serum peptidome (Lopez et al., 2005). These peptidomic patterns were also reported to distinguish not only controls from cancers (Lopez et al., 2007) but also between various types of cancers (Villanueva et al., 2006). One major concern of all serum peptidomics data is still the issue of artifacts related to sample collection and handing as serum contains substantial amounts of endo/exo proteolytic enzymatic activities. Another promising approach for biomarker detection involves targeting the proteins generated by the body’s immune response against tumor proteins. Much evidence exists suggesting that human subjects produce autoantibodies in response to a developing tumor. These antibodies have been identified against a number of intracellular and surface antigens in patients with various types of tumors. This approach is based on the hypothesis that multiple proteins induce autoantibodies that may constitute a panel specific fingerprint for a cancer type. The tumor proteins have been identified using 2DE separation, followed by western blotting analysis of 2D gels incubated with patient sera (Le Naour. 2007). Brichory et al. reported autoantibodies against a carbohydrate epitope of glycosylated annexin I and annexin II (Brichory et al., 2001), indicating PTMs contribute to the immune response. MS analysis of the immunoreactive spots of western blotting is required to determine what proteins serve as antigens, followed by independent methods to confirm the specificity of the autoantibody reactivity. In addition, tumor cell-based proteomics can be performed using tumor cells and cancer cell lines as sources of potential cancer biomarkers. A typical shotgun approach for cell protein analysis can yield quantitative information when combined with labeling such as SILAC or iTRAQ, as described in the previous section. The targeted approaches for mapping protein signaling pathways within tumors have been used for identifying new targets for anticancer
c22.indd 510
1/12/2011 9:44:44 AM
BIOLOGICAL IMPACT OF PROTEOMIC TECHNOLOGIES
511
drug discovery as detailed in the section on kinases and signaling transduction pathways. Many groups have reported the use of MS-based quantitative techniques to discover many biomarker candidates or putative marker proteins associated with cancers and other diseases in different body fluids. Barnidge et al. (2004) demonstrated the feasibility of absolute quantitation of prostate-specific antigen, a biomarker for prostate cancer, as a model biomarker in serum. Whiteaker et al. (2007) verified osteopontin and fibulin 2 as breast cancer biomarkers in a mouse model using peptide-based antibodies to enrich the targeted low abundance peptides followed by quantitative MS analysis. A known biomarker, carcinoembryonic antigen and a subset of putative biomarkers candidates for lung cancer were also quantitatively assessed in sera samples from patients (Nicol et al., 2008). Anderson (2005) proposed a list of 177 protein candidates associated with cardiovascular diseases and stroke to be used for targeted analysis in plasma. Recently an excellent example of how proteomics analysis facilitates gene discovery associated with a human tumor has been reported and demonstrated (Hao et al., 2009). In characterization of one of the conserved mitochondrial proteomes among eukaryotes, Sdh5 protein using yeast as primary model system, the authors have discovered that the SDH5 gene is required for flavination of succinate dehydrognease 1 and SHD-dependent respiration by forming a complete SDH complex. More importantly, they also found that a single nucleotide mutation of c.232G>A change at exon 2 in human SDH5 (hSDH) corresponding to G78R amino acid mutation in the most conserved region of the protein is directly responsible for a neuroendocrine tumor, hereditary paraganglioma (PGL2) (Hao et al., 2009). This mitochondrial proteomics analysis in yeast has led to the discovery of a human tumor susceptibility gene where its mutational inactivation confers tumor susceptibility in humans. It should be stressed that despite the fact that MS-based proteomics holds special promise for the discovery of novel biomarkers that might form the foundation for new clinical blood tests, to date, no putative biomarker identified by this approach has been clinically validated. The lack of a coherent pipeline connecting marker discovery with well-established methods for validation is believed a part of the reason (Rifai et al., 2006). A pipeline connecting biomarker discovery with the necessary verification, assay optimization, validation, and commercialization is required to move discovery to the clinic (Rifai et al., 2006). 22.3.3
Proteogenomics: From Proteome to Genome
As genomic sequencing technologies have been developed at an extraordinarily rapid pace, the completion of a genome sequence has transitioned to a relatively routine process for both prokaryotic and eukaryotic species. However, to acquire the full biological value of the sequenced genome,
c22.indd 511
1/12/2011 9:44:44 AM
512
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
accurate identification of the protein-coding genes in each genome and their functional protein products is required. As a result, considerable attention has been recently shifted from genome sequencing to genome annotation. Current genome annotation process relies on a number of computational tools that taken together constitute an automated high-throughput annotation pipeline that combines information from noncomparative (ab initio gene prediction) and comparative (sequence similarity) approaches (Ansong et al., 2008). There are many significant challenges remaining in the genome annotation process. Protein-coding genes usually consist of a small fraction of the eukaryotic genome (e.g., <5% in human genome), leading to problems associated with the identification of coding sequences against the ubiquitous background of noncoding sequences. The high frequency of alternative splicing in eukaryotic genes further complicates the identification of correct genome structure. Other challenges include the accurate determination of open reading frame (ORF), translation start and stop sites, and the prediction of short ORFs. In the human ENCODE genome annotation assessment project, Guigo et al. (2006) reported that prediction of the correct gene structure is estimated to be under 50% using the best de novo gene prediction programs. In a reanalysis project of 143 annotated prokaryotic genomes, up to 60% of the genes in some genomes have been found with a possible wrong start codon (Nielsen et al., 2005). Therefore, verification of the protein-coding gene predictions made by the computational tools becomes imperative. Current approaches used for verifying the genomic structure of predicted protein-coding genes involve the systematic RT-PCR followed by direct sequencing of PCR products. This expression-based validation approach has two limitations: (1) it predicts only the protein-coding gene being expressed but produces no information concerning its translation into a protein, (2) the RT-PCR results can be biased by the initial gene annotation. Thus it becomes obvious that the best means to independently and unambiguously verify the protein-coding genes is to directly identify the translated protein complements of the genome in a systematic manner through proteomics approaches. High-throughput LC-MS/MS analysis has emerged as a routine technique for directly measuring protein fragmentation and generating large-scale peptide sequence information on varieties of organisms. The resulting peptide sequences are used to confirm the existence of naturally translated protein products from a specific genome and serve to improve or validate an annotation. In addition to direct validation of predicted genes, MS/MS-based peptide sequence data produce a more confident annotation of targeted genomes by detection of novel (unannotated) genes, independent confirmation of hypothetical ORFs and the correction of erroneous gene predictions. Such proteomics results also allow for accurately determining translation start and stop sites and verifying the existence of splice variants at the translation level. Moreover, the improved outcome for genome annotation achieved by using proteomics data provides invaluable information that can be incorporated into
c22.indd 512
1/12/2011 9:44:44 AM
BIOLOGICAL IMPACT OF PROTEOMIC TECHNOLOGIES
513
existing gene prediction algorithms to further enhance gene prediction accuracy (Ansong et al., 2008). Clearly the rapidly advancing proteomics research is dramatically facilitating genome annotation efforts. Thus the recently emerged discipline of proteogenomics provides unique values and represents an increasingly important strategy for integrating protein/peptide information into the genome annotation pipeline for enhancing annotation quality (Gupta et al., 2007). A number of proteogenomics studies have been reported for genome annotation in bacteria, yeast, fruitfly, plant, human genomes, etc. Findlay et al. (2009) recently reported the proteomic discovery of 19 previously unannotated genes encoding seminal fluid proteins (Sfps) that are transferred from males to females during mating in Drosophila. Using bioinformatics, the authors detected putative orthologs of these genes. Gene expression analysis revealed that almost all predicted orthologs are transcribed and that most are expressed in a male-specific or male-biased manner. Like annotated Sfps, many of these newly found proteins show a pattern of adaptive evolution, consistent with their potential role in influencing male sperm competitive ability. However, in contrast to annotated Sfps, these new genes are shorter, have a higher rate of nonsynonymous substitution, and have a markedly lower GC content in the coding region, which are apparently the reasons why these genes escaped computational prediction (Findlay et al., 2009). Tanner et al. (2007) described algorithms, including exon splice graphs, that enable efficient searches for potential coding sequences, such as peptides spanning splice junctions, rather than searching translated genomes directly. The authors validated 39,000 exons and 11,000 introns at the level of translation for human genome using 18.5 million MS/MS spectra. Meanwhile, translation-level evidence for novel or extended exons in 16 genes was discovered, and the translation of 224 hypothetical human proteins was confirmed. Over 40 alternative splicing events in the human genome and 308 coding SNPs were also discovered. Overall, a total of 800 correct exons were added by using integrated proteogenomics-based automated gene prediction (Tanner et al., 2007). In a recent publication, Baerenfaller et al. (2008) reported the significant impact of the broad coverage of the Arabidopsis thaliana proteome through extensive sampling on refining plant genome annotation. From 1354 LC-MS/ MS runs, 86,456 unique peptides covering 13,029 proteins were identified, and 57 new gene models that were not represented in the protein database were discovered at the translation-level (Baerenfaller et al., 2008). Using the exon splice graph of Arabidopsis thaliana for an efficient database search in proteogenomics studies and proteomics’ extensive sampling approach for obtaining novel peptides, Castellana et al. (2008) identified 144,079 distinct peptides from 45 LC-MS/MS runs. The majority of the peptides (126,055) matched existing gene models (12,769 proteins) and represented 40% of annotated genes. It was surprising that, 1473 new or revised genes were discovered from 18,024 novel peptides that did not correspond to annotated genes; 280 of these were previously unrecognized, 498 were previously annotated as pseudogenes,
c22.indd 513
1/12/2011 9:44:44 AM
514
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
and 695 were revised based on known genes that were annotated in the wrong reading frame, with missing or incomplete exons. The proteogenomics results suggest that 13% of the Arabidopsis protein-coding genes either were not identified or contained significant errors in their exon definition (Castellana et al., 2008). In summary, we can anticipate as both genomics and proteomics technologies continue to improve, they will drive advances in proteogenomics and pave the way for improvements in biomarker discovery.
22.4 CONCLUSIONS AND FUTURE PERSPECTIVES For the past decade, proteomics analysis has become the driving force for the rapid development of separation science, mass spectrometry technologies and associated bioinformatic algorithms. The technological advances in new instrumentation, fragmentation techniques, complementary analysis strategies, and automated data mining have led to significant improvements in separation resolving power, mass accuracy and resolution, detection sensitivity, and quantification accuracy. These developments have propelled MS to become a central tool for the systematic analysis of protein expression, protein interaction and protein modifications. Many effective MS analysis strategies, including stable isotope labeling for quantitative proteomics and fractionation techniques for reducing sample complexity, and low-redundancy target workflows have emerged, providing important insights into the composition, regulation, and function of cellular and biological processes and pathways. As the postgenomic era has developed, it has become increasingly clear that largescale peptide sequence information from expressed proteins provides a superior choice compared to EST and cDNA sequences for gene structural and functional annotation. LC-MS/MS-based proteomics offers direct verification of predicted coding genes as well as identification of novel protein-coding genes, which greatly facilitates genome annotation. This proteogenomics approach for genome annotation has been strongly recommended for all current and future genome sequencing/annotation projects (Ansong et al., 2008). MS-based quantitative proteomics has demonstrated tremendous promise for biomarker and gene discovery associated with diseases. Almost all candidate biomarkers identified by proteomics have been found in the discovery stage and require further validation. Nevertheless, once validated, these proteomics derived biomarkers are expected to be of great benefit to clinicians as they will provide new approaches to disease diagnosis, prognosis and drug development. Despite the substantial progress made to date, significant technical challenges in MS-based proteomics remain due mainly to the inherent nature of protein complexity in biological systems. The wide range of biochemical properties and the natural abundance of proteins pose tremendous challenges in several technical aspects. Perhaps foremost among these is sample preparation. To reflect the full proteome one must develop an efficient and unbiased
c22.indd 514
1/12/2011 9:44:44 AM
CONCLUSIONS AND FUTURE PERSPECTIVES
515
solubilization method for all cellular proteins. Two major approaches (in-gel and in-solution digestion) are currently used for breaking proteins extracted from biological material down into peptides before MS analysis. Both approaches have potential weakness, such as low peptide recovery from the gel matrix for in-gel digestion or incomplete protein solubilization for insolution digestion. The recent development of a method, filter-aided sample preparation as a universal sample preparation approach combining the advantages of in-gel and in solution digestion appears to overcome many of these weaknesses (Wisniewski, 2008). In addition to improved sample extraction and digestion, we anticipate that a variety of orthogonal separation techniques at both the protein and peptide levels will be further developed and applied to reduce the sample complexity before MS analysis. Affinity purification-based enrichment strategies and their expanding utilities are expected to be further developed. With enhanced selectivity and specificity, they are poised to become the most effective techniques for reduction of sample complexity and will readily combine with other approaches to acquire increased information content particularly on protein complexes, PTMs, and subproteomes. This approach will be invaluable for blood-based proteomics and clinical applications where a high-throughput pipeline integrating a highly specific, single-step sample preparation strategy is often required. Continued development and improvement in MS technology and analysis methodology is crucial for future progress in MS-based proteomics to make routine the high-throughput analysis of whole proteomes and to enhance proteome coverage. Technically, peptide sequencing capacity by LC-MS/MS analysis is determined by the scan rate, sensitivity and detection range of the MS instrument, as well as front-end LC performance, including peak resolving power and capacity. Fast MS scan speed enables the coupling of fast and higher resolution LC methods for higher throughput analysis, which allows increased sampling of peptide ions and the ability to detect lower-abundance ions, providing enhanced dynamic range. Improved MS sensitivity, mass accuracy, and resolution will further increase dynamic range, strengthen confidence in identifications of low abundance peptides, and facilitate the discovery of protein PTMs. It is estimated that the peptide identification capacity of any given MS instrument needs to be at least 10,000 peptides/analysis for the yeast proteome, and 100,000 peptides/analysis for mammalian proteomes to enable true proteome-wide quantification (Ong and Mann, 2005; Macek et al., 2009). Also to detect the low-level biomarkers in the human blood, the lower detection limit has to be improved by at least 100-fold (Ong and Mann, 2005; Simpson et al., 2008). Clearly mass spectrometers need to be improved to meet the analytical demands of the biological system. In addition, we expect that the development of effective fragmentation methods for analyzing large proteins and improvement of the methods for sample delivery to the mass spectrometer will be key areas for development in top-down strategies. MALDI imaging MS for direct tissue analysis is an increasingly important technology for assessing and tracing the localization of molecular species and
c22.indd 515
1/12/2011 9:44:44 AM
516
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
revealing the underlying molecular signatures indicative of a disease state. Although MALDI imaging MS is still in very young stage, it is anticipated that this technology will develop rapidly as an alternative and complementary approach for determining the change in spatial distribution patterns of proteins/peptides that are associated with a range of pathologies. The analysis and interpretation of tens of thousands of MS/MS spectra collected in a single analysis is another challenge to the field. The automated MS/ MS-based peptide identifications have relied on bioinformatic tools, which incorporate a database search engine containing statistically based scoring algorithms. Thus the quality of both the database search engine and the database will have a significant impact on the outcome of proteomics research. As genomic sequences become available for more and more organisms, the availability of genomic information is generally not the bottleneck it once was. There are many different search engines commercially available, and about half a dozen of them are widely used in research labs. Database matching and curation of protein identifications has been identified as likely reasons for the inconsistent findings reported from lab to lab (Bell et al., 2009; Aebersold, 2009). Thus improvements of current search engines, databases that incorporate standardized algorithms and uniform databases are needed for improved reproducibility. For the application of proteogenomics for genome annotation, the need for improved data mining and informatics tools as well as instrumentation is recognized as a critical challenge. It appears necessary that new features should be built into both current and future data mining and informatics tools that can take full advantage of the value-added information from LC-MS/ MS datasets. Perhaps the biggest challenge for future proteomics is to interrogate the acquired data and distil the useful information from the enormous data sets being collected. It is anticipated that emphasis in proteomics studies will shift from discovery to validation (Cox and Mann, 2007). This is because the discovery of biomarkers from high-throughput proteomics workflows leads to large datasets of multidimensional data and a certain percentage of false discoveries. It is imperative to carefully design validation studies in a representative population of samples to characterize the sensitivity and specificity respectively for each biomarker candidate with a view toward clinical applications. Construction of a comprehensive biomarker pipeline connecting biomarker discovery with verification, assay optimization, validation, and commercialization is required to move the biomarker candidate into the clinic. Better understanding of the overall process of biomarker discovery and validation should improve experimental study design, in turn increasing the efficiency of biomarker development (Rifai et al., 2006). Nevertheless, the development of MS-based techniques have provided a complementary platform in which a few known low-abundance biomarkers associated with cancers have been validated, providing proof of principal that proteomics derived biomarkers will eventually provide breakthroughs for the diagnosis and treatment of disease.
c22.indd 516
1/12/2011 9:44:44 AM
QUESTIONS AND ANSWERS
517
22.5 QUESTIONS AND ANSWERS 1. What is mass spectrometry (MS)? Why has MS technology increasingly become an indispensable analytical tool in the proteomics field? 2. A fundamental challenge to the application of MS to any class of analytes is the production of gas-phase ions of those species. One of the significant advances in MS technology is the development of new ionization techniques, including MALDI and ESI for protein/peptide analysis. Describe how MALDI and ESI techniques work. 3. Describe the importance of the statistics-based experimental design for proteomics. 4. Why are sample preparation and separation technologies so critical for outcomes of proteomics experiments? What were the common approaches before MS analysis in the biomarker discovery effort? 5. Define bottom-up and top-down proteomics. 6. Describe the importance of bioinformatics tools for MS data analysis and interpretation. Why does proteomics data analysis rely heavily on genomic data and information? 7. Describe the commonly used quantitative proteomics workflow in biomarker discovery. 8. In the past decade, what general contributions of MS-based proteomics have led to an enhanced understanding of complex biological processes? 9. Describe the promises and challenges of proteomics-driven biomarker discovery for cancer. 10. What is proteogenomics? Why can proteomics data be used to assist in gene discovery and gene annotation pipeline? 1. MS is an analytical technique that measures the molecular masses of individual compounds and atoms by converting them into gas phase ions. MS can be used for the identification and quantification of molecules present in a sample and can guide the elucidation of molecular structures. Among all the hardware and software tools used for proteomics analysis, MS technology has become the most important. The combination of analytical features available on modern MS instruments—enhanced sensitivity, high selectivity, high mass accuracy and resolution, fast scanning rate, and accurate quantitation over a wide dynamic range—offers unique abilities to accommodate the complexity of proteomics samples, allowing the detection of low abundance protein components, the unambiguous identification of modification sites, and the ability to characterize protein complexes in a high-throughput manner.
c22.indd 517
1/12/2011 9:44:44 AM
518
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
2. MALDI is a solid-phased ionization technique that ionizes sample molecules from a crystalline matrix via laser excitation. The mechanism of MALDI is widely accepted to be a two-step process: a primary matrix ionization event, followed by in-plume secondary ion-molecule reactions. The MALDI matrix absorbs laser energy, becomes excited, and vaporizes carrying the macromolecular analyte molecules into the gas phase. In this tumultuous and explosive process the analyte molecules undergo collisions with excited matrix ions that are sufficiently energetic to cause the transfer of electrons and protons, creating a population of charged macromolecular ions that can be analyzed, typically in a TOF mass analyzer. ESI is a solution technique that generates a continuous stream of ions through a three-step process: droplet formation, droplet shrinkage, and ion desorption. Electrospray ionization is achieved by passing the sample, in solution, through a metal capillary, which is held at a high voltage. As the sample exits the capillary it is nebulized by a flow of gas to produce a fine aerosol of charged droplets. Together with additional gentle heating, the droplets evaporate. As the droplets decrease in size the sample ions in the droplet are forced together until repulsion causes them to be ejected from the surface. The ions are then extracted through a small orifice into the mass spectrometer vacuum. 3. Given the dynamic nature of proteins in any given biological system and the plethora of techniques available to carry out experiments, generating highly reliable and reproducible methodologies, and optimized workflow have been significant challenges in the entire proteomics field. Both biological and technical variations affect experimental reproducibility. The complexity of the samples, techniques, and the data requires the application of comprehensive statistical tools to enable the detection of the signature features responsible for the alteration of biological phenotype. Using a statistics-based experimental design will have a tremendous impact on the outcome of proteomics study by helping avoid systematic error and minimizing the random variation. Proteomics experiments require the integration of optimized algorithms and power of statistics. They need to be designed in a way that leads to maximum data reliability, reproducibility and to minimal technical variation for each step from each source. 4. The extraction of proteins from the material of interest to generate, in a reproducible way, an analyzable and biologically relevant sample has been a long-standing challenge for proteomics. It is widely recognized that when total protein extractions from various types of biological samples are carried out, many extraction approaches incorporate biases with respect to particular classes of proteins that often lead to incomplete and inconsistent results. Thus reduction of protein complexity by targeting a sub fraction of the entire proteome and use of optimized protein extraction solvents and protocols has been proven to be effective ways to minimize
c22.indd 518
1/12/2011 9:44:44 AM
QUESTIONS AND ANSWERS
519
variations arising from sample preparation. Given the wide range of protein concentrations in subproteomes and their highly complex nature, a variety of efficient protein (or peptide) separation strategies are required to further fractionate samples prior to MS analysis to reduce their complexity, enrich for low abundance proteins/peptides and minimize the ionization suppression effects of sample matrix components. The two most commonly used separation approaches are gel-based separation, including size-based 1D SDS gel and 2D gel analysis, consisting of both size- and pI-based separations, and chromatography-based separation, including RPLC coupled directly to MS or the combination of multiple LC separations before MS analysis. 5. There are two related but slightly different concepts used for the definitions of bottom-up/top-down proteomics. The most widely accepted definition is based on whether peptides or proteins are being directly introduced into the mass spectrometer for analysis. Bottom-up proteomics is an analytical strategy that relies on measuring the mass of proteolytic peptides and analysis of a peptide’s sequence, which infer protein identity. Topdown proteomics involves direct high-resolution mass measurement of intact protein ions and their sequence-specific fragment ions without prior proteolytic digestion. Another less commonly used but equally valid definition is determined by the nature of the material being initially separated. By this definition, top-down proteomics would include all procedures that begin with a separation of intact proteins including 1D and 2D gel electrophoresis. 6. The analysis and interpretation of the huge amounts of MS data produced by typical experiments have posed a significant challenge in proteomics field. Manual analysis not only is incompatible with thousands of spectra acquired in even a single data file due to the time involved but is affected by subjective interpretation and inconsistency. The development of database search algorithms and relevant bioinformatics tools incorporating statistical principles for automatic MS data analysis has been a key advance in proteomics studies. Such search algorithms allow for robust peptide/protein identification by correlating experimental MS data with specific genomic or protein sequence databases and use an accepted set of parameters and criteria in a consistent way such that statistically valid methods can be used to confidently determine the correct peptide spectral match from the sequences in the database. Therefore, the development of MS analysis methods, sequence database search engines, and the availability of a growing number of complete sequence databases represent the essential components of the proteomics pipeline. In addition to the raw sequence entries, these databases also contain annotation information, which provides links to information about protein identity, function, homology, position of PTMs, domain, and higher-order structures.
c22.indd 519
1/12/2011 9:44:44 AM
520
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
7. A general strategy to screen biomarkers is to analyze samples from several different states (e.g., disease vs. healthy control) and compare the protein expression patterns across the samples in a quantitative proteomics analysis. Both the 2D-gel and the shotgun approach using either stable isotopelabeling methods or label-free techniques are frequently applied in comparative proteomics studies. In the 2D-gel approach, the quantitative information is obtained through the gel image analysis and protein identification of the spots of interest is achieved by MS analysis. In the shotgunbased approach, both identification and quantitation are acquired through MS analysis. For the stable isotope labeling strategy, the quantitative information is obtained from a label mass shift that distinguishes the same peptides from different samples. In the label-free approach, the peptide signal intensity for species with the same mass, charge, and retention time across all the samples in the LC-MS/MS analysis is used for correlating peptide and protein abundance in different samples. 8. In the past decade, substantial advances in determining composition, regulation, and function of molecular complexes have been obtained by MSbased proteomics, leading to a greater understanding of the molecular basis of complex biological processes. These studies include those that map protein localization, identify protein–protein interactions, determine the site of posttranslational modifications, and reveal altered protein composition. 9. Current cancer biomarkers suffer from low diagnostic sensitivity and specificity. The rapid growth and availability of large-scale and high-throughput proteomics technologies has resulted in increased popularity and promise for the concept that cancer biomarkers can be discovered through the following strategies: (1) blood-based strategies for the proteomics profiling of cancers, (2) identification of candidate biomarkers using tissues or biological fluids near to a tumor site of origin, (3) identification of solublesecreted proteins and shed membrane proteins from tumor cell lines grown in vitro, and (4) the analysis of serum for autoantibodies against tumor proteins. Despite the promise of MS-based proteomics for the discovery of cancer biomarkers, many practical challenges and limitations have been identified. These challenges include the extraordinary complexity of proteins and high dynamic range of protein concentration in blood and biological fluids, the inherent limited quantitative nature of MS technology, bias toward identification of high-abundance molecules, and poor lab-to-lab reproducibility. Other important shortcomings are inherent variations in the blood proteome due to genetic polymorphisms and bias from artifacts related to the clinical sample collection, processing time, sample preparation and storage. 10. Proteomics has continually accrued the benefits from growing genomic information through the integration of gene prediction algorithms, cDNA
c22.indd 520
1/12/2011 9:44:44 AM
REFERENCES
521
sequences, and comparative genomics. Proteomic information can now be used to complement gene model-based annotation through unambiguous determination of reading frames, translation start and stop sites, splice boundaries and the validation of short ORFs. The application of proteomics methodologies to genome annotation is called proteogenomics. The peptide sequences obtained from proteomics data can be used to confirm the existence of naturally translated protein products from a specific genome and serve to improve or validate annotation. In addition MS/ MS-based peptide sequence data produces a more confident annotation of targeted genomes by detection of novel (unannotated) genes, independent confirmation of hypothetical ORFs and the correction of erroneous gene predictions. Proteomics results also allow for accurately determining translation start and stop sites and verifying the existence of splice variants at the translational level. 22.6
ACKNOWLEDGMENTS
We thank Dr. Celeste Ptak and Mr. Robert Sherwood for helpful discussions and comments on the manuscript. 22.7
REFERENCES
Aebersold R. (2009). A stress test for mass spectrometry-based proteomics. Nat Meth 6(6):411–12. Aebersold R, Mann M. (2003). Mass spectrometry-based proteomics. Nature 422:198–207. Andersen JS, Mann M. (2006). Organellar proteomics: turning inventories into insights. EMBO Rep 7:874–79. Anderson L. (2005). Candidate-based proteomics in the search for biomarkers of cardiovascular disease. J Physiol 563:23–60. Andersson L, Porath J. (1986). Isolation of phosphoproteins by immobilized metal (Fe3+) affinity chromatography. Anal Biochem 154:250–54. Ansong C, Purvine SO, Adkins JN, Lipton MS, Smith RD. (2008). Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic 7:50–62. Baerenfaller K, Grossmann J, Grobei MA, et al. (2008). Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science 320:938–41. Barnidge DR, Goodmanson MK, Klee GG, Muddiman DC. (2004). Absolute quantification of the model biomarker prostate-specific antigen in serum by LC-Ms/MS using protein cleavage and isotope dilution mass spectrometry. J Proteome Res 3:644–52. Beausoleil SA, Jedrychowski M, Schwartz D, et al. (2004). Large-scale Characterization of HeLa cell nuclear phosphoproteins. Proc Natl Acad Sci U S A 101:12130–35.
c22.indd 521
1/12/2011 9:44:44 AM
522
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Bell AW, Deutsch EW, Au CE, et al. (2009). A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat Meth 6(6):423–30. Bernhard OK, Kapp EA, Simpson RJ. (2007). Enhanced analysis of the mouse plasma proteome using cysteine-containing tryptic glycopeptides. J Proteome Res 6: 987–95. Biemann K. (1990 a). Appendix 5. Nomenclature for peptide fragment ions (positive ions). Meth Enzymol 193:886–87. Biemann K. (1990 b). Sequencing of peptides by tandem mass spectrometry and highenergy collision-induced dissociation. Meth Enzymol 193:455–79. Blagoev B, Ong SE, Kratchmarova I, Mann M. (2004). Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics. Nat Biotechnol 22:1139–45. Bodnar WM, Blackburn RK, Krise JM, Moseley MA. (2003). Exploiting the complementary nature of LC/MALDI/MS/MS and LC/ESI/MS/MS for increased proteome coverage. J Am Soc Mass Spectrom 14:971–79. Brichory FM, Misek DE, Yim AM, et al. (2001). An immune response manifested by the common occurrence of annexins I and II autoantibodies and high circulating levels of IL-6 in lung cancer. Proc Natl Acad Sci U S A 98:9824–29. Bunger MK, Cargile BJ, Ngunjiri A, Bundy JL, Stephenson JL Jr. (2008). Automated proteomics of E. coli via top-down electron-transfer dissociation mass spectrometry. Anal Chem 80:1459–67. Chi A, Huttenhower C, Geer LY, et al. (2007). Analysis of phosphorylation sites on proteins from Saccharomyces cerevisiae by electron transfer dissociation (ETD) mass spectrometry. Proc Natl Acad Sci U S A 104:2193–98. Cantin GT, Yates JR. (2004). Strategies for shotgun identification of post-translational modifications by mass spectrometry. J Chromatogr A 1053:7–14. Carr SA, Annan RS, Huddleston MJ. (2005). Mapping posttranslational modifications of proteins by MS-based selective detection: application to phosphoproteomics. Meth Enzymol 405:82–115. Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP. (2008). Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci U S A 105:21034–38. Catalina MI, Koeleman CA, Deelder AM, Wuhrer M. (2007). Electron transfer dissociation of N-glycopeptides: loss of the entire N-glycosylated asparagine side chain. Rapid Commun Mass Spectrom 21:1053–61. Chaurand P, Norris JL, Cornett DS, Mobley JA, Caprioli RM. (2006). New developments in profiling and imaging of proteins from tissue sections by MALDI mass spectrometry. J Proteome Res 5:2889–900. Chi A, Bai DL, Geer LY, Shabanowitz J, Hunt DF. (2007). Analysis of intact proteins on a chromatographic time scale by electron transfer dissociation tandem mass spectrometry. Int J Mass Spectrom 259:197–203. Choi H, Fermin D, Nesvizhskii AI. (2008). Significance analysis of spectral count data in label-free shotgun proteomics. Mol Cell Proteomics 7:2373–85. Collins FS, Lander ES, Rogers J, Waterson RH. (2004). Finishing the euchromatic sequence of the human genome. Nature 431:931–45.
c22.indd 522
1/12/2011 9:44:44 AM
REFERENCES
523
Coon JJ, Ueberheide B, Syka JE, et al. (2005). Protein identification using sequential ion/ion reactions and tandem mass spectrometry. Proc Natl Acad Sci U S A 102: 9463–68. Cox DM, Zhong F, Du M, Duchoslav E, Sakuma T, McDermott JC. (2005). Multiple reaction monitoring as a method for identifying protein posttranslational modifications. J Biomol Tech 16:83–90. Cox J, Mann M. (2007). Is proteomics the new genomics? Cell 130:395–98. Cravatt BF, Simon GM, Yates JR III. (2007). The biological impact of mass-spectrometrybased proteomics. Nature 450:991–1000. Cusick ME, Klitgord N, Vidal M, Hill DE. (2005). Interactome: gateway into systems biology. Hum Mol Genet 14(spec 2):R171–81. de Hoog CL, Mann M. (2004). Proteomics. Annu Rev Genomics Hum Genet 5:267–93. Domon B, Aebersold R. (2006). Mass spectrometry and protein analysis. Science 312:212–17. Eng JK, McCormack AL, Yates JR III. (1994). An approach to correlate tandem mass spectral data of peptide with amino acid sequences in a protein database. J Am Soc Mass Spectrom 5:976–89. Ernoult E, Gamelin E, Guette C. (2008). Improved proteome coverage by using iTRAQ labelling and peptide OFFGEL fractionation. Proteome Sci 6:27. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM. (1989). Electrospray ionization for mass spectrometry of large biomolecules. Science 246:64–71. Ferrer-Alcon M, Arteta D, Guerrero MJ, Fernandez-Orth D, Simon L, Martinez A. (2009). The use of gene array technology and proteomics in the search of new targets of diseases for therapeutics. Toxicol Lett 186:45–51. Ficarro SB, McCleland ML, Stukenberg PT, et al. (2002). Phosphoproteome analysis by mass spectrometry and its application to Saccharomyces cerevisiae. Nat Biotechnol 20:301–05. Findlay GD, MacCoss MJ, Swanson WJ. (2009). Proteomic discovery of previously unannotated, rapidly evolving seminal fluid genes in Drosophila. Genome Res 19:886–96. Garcia BA, Shabanowitz J, Hunt DF. (2007a). Characterization of histones and their post-translational modifications by mass spectrometry. Curr Opin Chem Biol1 1:66–73. Garcia BA, Siuti N, Thomas CE, Mizzen CA, Kelleher NL (2007b). Characterization of neurohistone variants and post-translational modifications by electron capture dissociation mass spectrometry. Int J Mass Spectrom 314:109–12. Gerber SA, Rush J, Stemman O, Kirschner MW, Gygi SP. (2003). Absolute quantification of proteins and phosphoproteins from cell lysates by tandem MS. Proc Natl Acad Sci U S A 100:6940–45. Gharbi S, Gaffney P, Yang A, et al. (2002). Evaluation of two-dimensional differential gel electrophoresis for proteomic expression analysis of a model breast cancer cell system. Mol Cell Proteomics 1:91–98. Gingras AC, Aebersold R, Raught B. (2005). Advances in protein complex analysis using mass spectrometry. J Physiol 563:11–22.
c22.indd 523
1/12/2011 9:44:44 AM
524
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Gingras AC, Gstaiger M, Raught B, Aebersold R. (2007). Analysis of protein complexes using mass spectrometry. Nat Rev Mol Cell Biol 8:645–54. Good DM, Wirtala M, McAlister GC, Coon JJ. (2007). Performance characteristics of electron transfer dissociation mass spectrometry. Mol Cell Proteomics 6:1942–51. Gorg A, Weiss W, Dunn MJ. (2004). Current two-dimensional electrophoresis technology for proteomics. Proteomics 4:3665–85. Granger J, Siddiqui J, Copeland S, Remick D. (2005). Albumin depletion of human plasma also removes low abundance proteins including the cytokines. Proteomics 5:4713–18. Gruhler A, Olsen JV, Mohammed S, et al. (2005). Quantitative phosphoproteomics applied to the yeast pheromone signaling pathway. Mol Cell Proteomics 4:310–27. Guigo R, Flicek P, Abril JF, et al. (2006). EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S2 1–31. Gupta N, Tanner S, Jaitly N, et al. (2007). Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Genome Res 17:1362–77. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R. (1999). Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17:994–99. Gygi SP, Rochon Y, Franza BR, Aebersold R. (1999). Correlation between protein and mRNA abundance in yeast. Mol Cell Biol 19:1720–30. Han G, Ye M, Zou H. (2008). Development of phosphopeptide enrichment techniques for phosphoproteome analysis. Analyst 133:1128–38. Han X, Aslanian A, Yates JR III. (2008). Mass spectrometry for proteomics. Curr Opin Chem Biol 12:483–90. Hao HX, Khalimonchuk O, Schraders M, et al. (2009). SDH5, a gene required for flavination of succinate dehydrogenase, is mutated in paraganglioma. Science 325:1139–42. Hart SR, Lau KW, Hao Z, et al. (2009). Analysis of the trypanosome flagellar proteome using a combined electron transfer/collisionally activated dissociation strategy. J Am Soc Mass Spectrom 20:167–75. Heck AJ, Van Den Heuvel RH. (2004). Investigation of intact protein complexes by mass spectrometry. Mass Spectrom Rev 23:368–89. Hubner NC, Ren S, Mann M. (2008). Peptide separation with immobilized pI strips is an attractive alternative to in-gel protein digestion for proteome analysis. Proteomics 8:4862–72. Ishihama Y, Oda Y, Tabata T, et al. (2005). Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol Cell Proteomics 4:1265–72. Karas M, Hillenkamp F. (1988). Laser desorption ionization of proteins with molecular masses exceeding 10,000 daltons. Anal Chem 60:2299–301. Kelleher NL. (2004). Top-down proteomics. Anal Chem 76:197A–203A. Khidekel N, Ficarro SB, Clark PM, et al. (2007). Probing the dynamics of O-GlcNAc glycosylation in the brain using quantitative proteomics. Nat Chem Biol 3:339–48. Krogan NJ, Cagney G, Yu H, et al. (2006). Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440:637–43.
c22.indd 524
1/12/2011 9:44:44 AM
REFERENCES
525
Kruger M, Kratchmarova I, Blagoev B, Tseng YH, Kahn CR, Mann M. (2008). Dissection of the insulin signaling pathway via quantitative phosphoproteomics. Proc Natl Acad Sci U S A 105:2451–56. Kruse U, Bantscheff M, Drewes G, Hopf C. (2008). Chemical and pathway proteomics: powerful tools for oncology drug discovery and personalized health care. Mol Cell Proteomics 7:1887–901. Lander ES, Linton LM, Birren B, et al. (2001). Initial sequencing and analysis of the human genome. Nature 409:860–922. Larsen MR, Thingholm TE, Jensen ON, Roepstorff P, Jorgensen TJ. (2005). Highly selective enrichment of phosphorylated peptides from peptide mixtures using titanium dioxide microcolumns. Mol Cell Proteomics 4:873–86. Le Naour F. (2007). Identification of tumor antigens by using proteomics. Meth Mol Biol 360:327–34. Link AJ, Eng J, Schieltz DM, et al. (1999). Direct analysis of protein complexes using mass spectrometry. Nat Biotechnol 17:676–82. Liu H, Sadygov RG, Yates JR III. (2004). A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal Chem 76:4193–201. Liu T, Belov ME, Jaitly N, Qian WJ, Smith RD. (2007). Accurate mass measurements in proteomics. Chem Rev 107:3621–53. Liu T, Qian WJ, Mottaz HM, et al. (2006). Evaluation of multiprotein immunoaffinity subtraction for plasma proteomics and candidate biomarker discovery using mass spectrometry. Mol Cell Proteomics 5:2167–74. Loo JA, Berhane B, Kaddis CS, et al. (2005). Electrospray ionization mass spectrometry and ion mobility analysis of the 20S proteasome complex. J Am Soc Mass Spectrom 16:998–1008. Lopez MF, Mikulskis A, Kuzdzal S, et al. (2005). High-resolution serum proteomic profiling of Alzheimer disease samples reveals disease-specific, carrier-proteinbound mass signatures. Clin Chem 51:1946–54. Lopez MF, Mikulskis A, Kuzdzal S, et al. (2007). A novel, high-throughput workflow for discovery and identification of serum carrier protein-bound peptide biomarker candidates in ovarian cancer samples. Clin Chem 53:1067–74. Lu B, McClatchy DB, Kim JY, Yates JR III. (2008). Strategies for shotgun identification of integral membrane proteins by tandem mass spectrometry. Proteomics 8: 3947–55. Lu H, Zong C, Wang Y, et al. (2008). Revealing the dynamics of the 20 S proteasome phosphoproteome: a combined CID and electron transfer dissociation approach. Mol Cell Proteomics 7:2073–89. Lu P, Vogel C, Wang R, Yao X, Marcotte EM. (2007). Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat Biotechnol 25:117–24. Macek B, Mann M, Olsen JV. (2009). Global and site-specific quantitative phosphoproteomics: principles and applications. Annu Rev Pharmacol Toxicol 49:199–222. Makarov A, Denisov E, Kholomeev A, et al. (2006b). Performance evaluation of a hybrid linear ion trap/orbitrap mass spectrometer. Anal Chem 78:2113–20. Makarov A, Denisov E, Lange O, Horning S. (2006a). Dynamic range of mass accuracy in LTQ Orbitrap hybrid mass spectrometer. J Am Soc Mass Spectrom 17:977–82.
c22.indd 525
1/12/2011 9:44:44 AM
526
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. (2002). The protein kinase complement of the human genome. Science 298:1912–34. Mann M, Hendrickson RC, Pandey A. (2001). Analysis of proteins and proteomes by mass spectrometry. Annu Rev Biochem 70:437–73. Mann M, Jensen ON. (2003). Proteomic analysis of post-translational modifications. Nat Biotechnol 21:255–61. Mayya V, Rezual K, Wu L, Fong MB, Han DK. (2006). Absolute quantification of multisite phosphorylation by selective reaction monitoring mass spectrometry: determination of inhibitory phosphorylation status of cyclin-dependent kinases. Mol Cell Proteomics 5:1146–57. McDonald WH, Yates JR III. (2003). Shotgun proteomics: integrating technologies to answer biological questions. Curr Opin Mol Ther 5:302–09. McLafferty FW, Breuker K, Jin M, et al. (2007). Top-down MS, a powerful complement to the high capabilities of proteolysis proteomics. FEBS J 274:6256–68. McNulty DE, Annan RS. (2008). Hydrophilic interaction chromatography reduces the complexity of the phosphoproteome and improves global phosphopeptide isolation and detection. Mol Cell Proteomics 7:971–80. Meistermann H, Norris JL, Aerni HR, et al. (2006). Biomarker discovery by imaging mass spectrometry: transthyretin is a biomarker for gentamicin-induced nephrotoxicity in rat. Mol Cell Proteomics 5:1876–86. Mikesh LM, Ueberheide B, Chi A, et al. (2006). The utility of ETD mass spectrometry in proteomic analysis. Biochim Biophys Acta 1764:1811–22. Minden JS, Dowd SR, Meyer HE, Stuhler K. (2009). Difference gel electrophoresis. Electrophoresis 30(Suppl 1):S156–61. Mirza SP, Olivier M. (2007). Methods and approaches for the comprehensive characterization and quantification of cellular proteomes using mass spectrometry. Physiol Genomics 33:3–11. Miyagi M, Rao KC. (2007). Proteolytic 18O-labeling strategies for quantitative proteomics. Mass Spectrom Rev 26:121–36. Molina H, Horn DM, Tang N, Mathivanan S, Pandey A. (2007). Global proteomic profiling of phosphopeptides using electron transfer dissociation tandem mass spectrometry. Proc Natl Acad Sci U S A 104:2199–204. Mootha VK, Lepage P, Miller K, et al. (2003). Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics. Proc Natl Acad Sci U S A 100:605–10. Nicol GR, Han M, Kim J, et al. (2008). Use of an immunoaffinity-mass spectrometrybased approach for the quantification of protein biomarkers from serum samples of lung cancer patients. Mol Cell Proteomics 7:1974–82. Nielsen P, Krogh A. (2005). Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21:4322–29. Oberg AL, Vitek O. (2009). Statistical design of quantitative mass spectrometry-based proteomic experiments. J Proteome Res 8:2144–56. Olsen JV, Blagoev B, Gnad F, et al. (2006). Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 127:635–48. Ong SE, Blagoev B, Kratchmarova I, et al. (2002). Stable isotope labeling by amino acids in cell culture, SILAC, as a simple and accurate approach to expression proteomics. Mol Cell Proteomics 1:376–86.
c22.indd 526
1/12/2011 9:44:44 AM
REFERENCES
527
Ong SE, Foster LJ, Mann M. (2003). Mass spectrometric-based approaches in quantitative proteomics. Methods 29:124–30. Ong SE, Mann M. (2005). Mass spectrometry-based proteomics turns quantitative. Nat Chem Biol 1:252–62. Pan S, Aebersold R, Chen R, et al. (2009). Mass spectrometry based targeted protein quantification: methods and applications. J Proteome Res 8:787–97. Parks BA, Jiang L, Thomas PM, et al. (2007). Top-down proteomics on a chromatographic time scale using linear ion trap fourier transform hybrid mass spectrometers. Anal Chem 79:7984–91. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. (1999). Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551–67. Perry RH, Cooks RG, Noll RJ. (2008). Orbitrap mass spectrometry: instrumentation, ion motion and applications. Mass Spectrom Rev 27:661–99. Petricoin EF, Ardekani AM, Hitt BA, et al. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359:572–77. Pinkse MW, Uitto PM, Hilhorst MJ, Ooms B, Heck AJ. (2004). Selective isolation at the femtomole level of phosphopeptides from proteolytic digests using 2D-NanoLCESI-MS/MS and titanium oxide precolumns. Anal Chem 76:3935–43. Prakash A, Piening B, Whiteaker J, et al. (2007). Assessing bias in experiment design for large scale mass spectrometry-based quantitative proteomics. Mol Cell Proteomics 6:1741–48. Rifai N, Gillette MA, Carr SA. (2006). Protein biomarker discovery and validation: the long and uncertain path to clinical utility. Nat Biotechnol 24:971–83. Rikova K, Guo A, Zeng Q, et al. (2007). Global survey of phosphotyrosine signaling identifies oncogenic kinases in lung cancer. Cell 131:1190–203. Roepstorff P, Fohlman J. (1984). Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 11:601. Ross PL, Huang YN, Marchese JN, et al. (2004). Multiplexed protein quantitation in Saccharomyces cerevisiae using amine-reactive isobaric tagging reagents. Mol Cell Proteomics 3:1154–69. Ruotolo BT, Giles K, Campuzano I, Sandercock AM, Bateman RH, Robinson CV. (2005). Evidence for macromolecular protein rings in the absence of bulk water. Science 310:1658–61. Rush J, Moritz A, Lee KA, et al. (2005). Immunoaffinity profiling of tyrosine phosphorylation in cancer cells. Nat Biotechnol 23:94–101. Schirle M, Heurtier MA, Kuster B. (2003). Profiling core proteomes of human cell lines by one-dimensional PAGE and liquid chromatography-tandem mass spectrometry. Mol Cell Proteomics 2:1297–305. Schmelzle K, Kane S, Gridley S, Lienhard GE, White FM. (2006). Temporal dynamics of tyrosine phosphorylation in insulin signaling. Diabetes 55:2171–79. Scigelova M, Makarov A. (2006). Orbitrap mass analyzer—overview and applications in proteomics. Proteomics 6(Suppl 2):16–22. Shukla AK, Futrell JH. (2000). Tandem mass spectrometry: dissociation of ions by collisional activation. J Mass Spectrom 35:1069–90. Silva JC, Denny R, Dorschel C, et al. (2006b). Simultaneous qualitative and quantitative analysis of the Escherichia coli proteome: a sweet tale. Mol Cell Proteomics 5:589–607.
c22.indd 527
1/12/2011 9:44:44 AM
528
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Silva JC, Gorenstein MV, Li GZ, Vissers JP, Geromanos SJ. (2006a). Absolute quantification of proteins by LCMSE: a virtue of parallel MS acquisition. Mol Cell Proteomics 5:144–56. Simpson RJ, Bernhard OK, Greening DW, Moritz RL. (2008). Proteomics-driven cancer biomarker discovery: looking to the future. Curr Opin Chem Biol 12:72–77. Smith JC, Figeys D. (2008). Recent developments in mass spectrometry-based quantitative phosphoproteomics. Biochem Cell Biol 86:137–48. Speers AE, Wu CC. (2007). Proteomics of integral membrane proteins–theory and application. Chem Rev 107:3687–714. Stapels MD, Barofsky DF. (2004). Complementary use of MALDI and ESI for the HPLC-MS/MS analysis of DNA-binding proteins. Anal Chem 76:5423–30. Steinberg TH, Agnew BJ, Gee KR, et al. (2003). Global quantitative phosphoprotein analysis using Multiplexed Proteomics technology. Proteomics 3:1128–44. Stensballe A, Jensen ON, Olsen JV, Haselmann KF, Zubarev RA. (2000). Electron capture dissociation of singly and multiply phosphorylated peptides. Rapid Commun Mass Spectrom 14:1793–800. Swaney DL, McAlister GC, Coon JJ. (2008). Decision tree-driven tandem mass spectrometry for shotgun proteomics. Nat Meth 5:959–64. Syka JE, Coon JJ, Schroeder MJ, Shabanowitz J, Hunt DF. (2004). Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry. Proc Natl Acad Sci U S A 101:9528–33. Tanaka K, Waki H, Ido Y, et al. (1988). Protein and polymer analyses up to m/z 100 000 by laser ionization time-of-flight mass spectrometry. Rapid Commun Mass Spectrom 2:151–70. Tanner S, Shen Z, Ng J, et al. (2007). Improving gene annotation using peptide mass spectrometry. Genome Res 17:231–39. Ubersax JA, Ferrell JE, Jr. (2007). Mechanisms of specificity in protein phosphorylation. Nat Rev Mol Cell Biol 8:530–41. Udeshi ND, Compton PD, Shabanowitz J, Hunt DF, Rose KL. (2008). Methods for analyzing peptides and proteins on a chromatographic timescale by electron-transfer dissociation mass spectrometry. Nat Protoc 3:1709–17. Udeshi ND, Shabanowitz J, Hunt DF, Rose KL. (2007). Analysis of proteins and peptides on a chromatographic timescale by electron-transfer dissociation MS. FEBS J 274:6269–76. Uetz P, Giot L, Cagney G, et al. (2000). A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403:623–27. Uhlen M, Ponten F. (2005). Antibody-based proteomics for human tissue profiling. Mol Cell Proteomics 4:384–93. Unlu M, Morgan ME, Minden JS. (1997). Difference gel electrophoresis: a single gel method for detecting changes in protein extracts. Electrophoresis 18:2071–77. Vanrobaeys F, Van Coster R, Dhondt G, Devreese B, Van Beeumen J. (2005). Profiling of myelin proteins by 2D-gel electrophoresis and multidimensional liquid chromatography coupled to MALDI TOF-TOF mass spectrometry. J Proteome Res 4:2283–93. Venter JC, Adams MD, Myers EW, et al. (2001). The sequence of the human genome. Science 291:1304–51.
c22.indd 528
1/12/2011 9:44:44 AM
REFERENCES
529
Villanueva J, Shaffer DR, Philip J, et al. (2006). Differential exoprotease activities confer tumor-specific serum peptidome patterns. J Clin Invest 116:271–84. Villen J, Beausoleil SA, Gerber SA, Gygi SP. (2007). Large-scale phosphorylation analysis of mouse liver. Proc Natl Acad Sci U S A 104:1488–93. Walch A, Rauser S, Deininger SO, Hofler H. (2008). MALDI imaging mass spectrometry for direct tissue analysis: a new frontier for molecular histology. Histochem Cell Biol 130:421–34. Wang G, Wu WW, Zeng W, Chou CL, Shen RF. (2006). Label-free protein quantification using LC-coupled ion trap or FT mass spectrometry: Reproducibility, linearity, and application with complex proteomes. J Proteome Res 5:1214–23. Wang H, Clouthier SG, Galchev V, et al. (2005). Intact-protein-based high-resolution three-dimensional quantitative analysis system for proteome profiling of biological fluids. Mol Cell Proteomics 4:618–25. Whiteaker JR, Zhang H, Zhao L, et al. (2007). Integrated pipeline for mass spectrometrybased discovery and confirmation of biomarkers demonstrated in a mouse model of breast cancer. J Proteome Res 6:3962–75. Wiesner J, Premsler T, Sickmann A. (2008). Application of electron transfer dissociation (ETD) for the analysis of posttranslational modifications. Proteomics 8: 4466–83. Williamson BL, Marchese J, Morrice NA. (2006). Automated identification and quantification of protein phosphorylation sites by LC/MS on a hybrid triple quadrupole linear ion trap mass spectrometer. Mol Cell Proteomics 5:337–46. Wilm M, Mann M. (1996). Analytical properties of the nanoelectrospray ion source. Anal Chem 68:1–8. Wilm M, Shevchenko A, Houthaeve T, et al. (1996). Femtomole sequencing of proteins from polyacrylamide gels by nano-electrospray mass spectrometry. Nature 379: 466–69. Wisniewski JR. (2008). Protocol to enrich and analyze plasma membrane proteins from frozen tissues. Meth Mol Biol 432:175–83. Wu SL, Huhmer AF, Hao Z, Karger BL. (2007). On-line LC-MS approach combining collision-induced dissociation (CID), electron-transfer dissociation (ETD), and CID of an isolated charge-reduced species for the trace-level characterization of proteins with post-translational modifications. J Proteome Res 6:4230–44. Wuhrer M, Stam JC, van de Geijn FE, et al. (2007). Glycosylation profiling of immunoglobulin G (IgG) subclasses from human serum. Proteomics 7:4070–81. Yang X, Thannhauser TW, Burrows M, Cox-Foster D, Gildow FE, Gray SM. (2008). Coupling genetics and proteomics to identify aphid proteins associated with vector– specific transmission of polerovirus (luteoviridae). J Virol 82:291–99. Yang Y, Thannhauser TW, Li L, Zhang S. (2007). Development of an integrated approach for evaluation of 2-D gel image analysis: impact of multiple proteins in single spots on comparative proteomics in conventional 2-D gel/MALDI workflow. Electrophoresis 28:2080–94. Yang Y, Zhang S, Howe K, et al. (2007). A comparison of nLC-ESI-MS/MS and nLCMALDI-MS/MS for GeLC-based protein identification and iTRAQ-based shotgun quantitative proteomics. J Biomol Tech 18:226–37.
c22.indd 529
1/12/2011 9:44:44 AM
530
IMPACT OF WHOLE GENOME PROTEIN ANALYSIS
Yao X, Afonso C, Fenselau C. (2003). Dissection of proteolytic 18O labeling: endoprotease-catalyzed 16O-to-18O exchange of truncated peptide substrates. J Proteome Res 2:147–52. Yao X, Freas A, Ramirez J, Demirev PA, Fenselau C. (2001). Proteolytic 18O labeling for comparative proteomics: model studies with two serotypes of adenovirus. Anal Chem 73:2836–42. Yates JR III. (2004). Mass spectral analysis in proteomics. Annu Rev Biophys Biomol Struct 33:297–316. Yates JR III, Carmack E, Hays L, Link AJ, Eng JK. (1999). Automated protein identification using microcolumn liquid chromatography-tandem mass spectrometry. Meth Mol Biol 112:553–69. Yates JR III, Gilchrist A, Howell KE, Bergeron JJ. (2005). Proteomics of organelles and large cellular structures. Nat Rev Mol Cell Biol 6:702–14. Yates J, Ruse CI, Nakorchevsky A. (2009). Proteomics by Mass Spectrometry: Approaches, Advances, and Applications. Annu Rev Biomed Eng 11:49–79. Zhang S, Van Pelt CK. (2004). Chip-based nanoelectrospray mass spectrometry for protein characterization. Expert Rev Proteomics 1:449–68. Zhang Y, Wolf-Yadlin A, Ross PL, et al. (2005). Time-resolved mass spectrometry of tyrosine phosphorylation sites in the epidermal growth factor receptor signaling network reveals dynamic modules. Mol Cell Proteomics 4:1240–50. Zubarev R. (2006). Protein primary structure using orthogonal fragmentation techniques in Fourier transform mass spectrometry. Expert Rev Proteomics 3:251–61. Zubarev RA, Kelleher NL, McLafferty FW. (1998). Electron capture dissociation of multiply charged protein cations. A nonergodic process. J Am Chem Soc 120: 3265–66.
c22.indd 530
1/12/2011 9:44:44 AM
INDEX
Abdominal Aortic Aneurysm (AA) 216 Advanced Intercrosses 412 Affinity Purification and Mass Spectrometry (AP-MS) 508 Affymetrix SNP Array 197 Age-Related Eye Diseases (AREDs) 114 Age-Related Macular Degeneration (AMD) 114, 123 Ago2 345 Agrobacterium-Mediated Transformation 429 Allele 112 Allele Frequency Estimation 149 Allele Specific Amplification (ASA) 143 Allele-Specific PCR (ASPCR) 143 Alterations in Chromosome Numbers 318 Altered Expression of the Target Genes 356 Alternative Splicing 436 Amyotrophic Lateral Sclerosis (ALS) 125 Analysis of Anticancer Drugs Efficiency 74 Analysis of Chemotherapy Resistance in Tumor Cells 72 Analysis of Short Deletion/Insertion Mutations 148 Analyzing Candidate Gene Expression Levels 267 Annolite 100 Annotation Transfer by Sequence Homology 95 Antibody Technology 320
Anticancer Drug Cytotoxicity Assay 74 Application of the On/Off Switch in Mutation Analysis 147 Arbitrary Primed PCR 46 Assay of Drug Resistance Related Protein 72 Autoreactive and Malignant GC B Cells 306 B Cell Functionality 40 B Cell Maturation 40 B Cells 36, 324, 326 Backcrosses 408 Background Correction 18 Bardet-Biedl Syndrome (BBS) 206 Bayesian Integration 104 Bayesian Network 240 Bioinformatic Databases 237 Bioinformatics 414 Biological Ontology Databases 238 Biomarkers 511 Bisulfite PCR 46 Blast R Gene 431 Blast2GO 96 Blocking 481 Blood Peptides 512 BNArray 241 Bottom-Up Proteomics 483, 487 Breeding 436 BXD Rils 416 C. Elegans 345 Calibrated Ensembles of Svms 104 Candidate Gene Selection 182, 185 Candidate Selection 271
Gene Discovery for Disease Models, First Edition. Edited by Weikuan Gu, Yongjun Wang. © 2011 John Wiley & Sons, Inc. Published 2011 by John Wiley & Sons, Inc.
531
bindex.indd 531
1/12/2011 9:43:42 AM
532
INDEX
Cell Proliferation 346 Cell-Type-Specific Genes 26 Centrality 253 C57bl/6 Embryonic Stem Cells 374 Characterization of Candidate Genes According to Gene Expression 265, 267 Charge Syndrome 451 cis-QTL 24 Class Switch Recombination 305 Classic Procedure for Positional Cloning 2 Closeness 253 Clustering 434 Collaborative Cross Mice 413 Collagen Type 1 Alpha2 Gene (COL1A2) 206 Collection of Embryos 381 Collision-Induced Dissociation 476 Combining Crosses 413 Commonly Used Open Access Databases 492 Comparative Genomic Hybridization (CGH) 44, 155, 447–8 Complementation Test 286 Computing Environment 237 Concentration Gradient Generator (CGG) 72 Concurrent Infection 376 Configuration-Based Assays 141 Confirm Gene Mutation Function 351 Confirmation of a QTL Gene 414 Confirmation of Genomic Mutations in cDNA 266 Confirmation of the Base Variation 354 Congenics 412 Congenital Stationary Night Blindness 119 Consanguinity 206 Construction of a Genetic Linkage Map 428 Contiguous Gene Syndromes 450 Conventional Cytogenetic Methods 196 Copy Number Variable Regions (CNVs) 114, 121–3 Copy Number Variations (CNVs) 123, 139, 448 Co-segregation 393
bindex.indd 532
Cycled Proofreading Genotyping Processes 147 Cytogenetic Map 113 Cytogenetics Analysis 209 Data Analysis 178 Database Searching 492 Defining a Disease Locus 125–6 Defining Genomic Regions 124 Degree 253 Detection of Rare Alleles and Mutation 148 Difference Gel Electrophoresis (DIGE) 484, 493 Differentially Expressed Genes 26 Dimensionality Reduction 21 Direct Genome Sequencing 215–31 Directed Acyclic Graph (DAG) 94, 240 Dissimilarity Measures 20 Diversification 434 Diversity Outbred 413 DNA Methylation 32 DNA Methyltransferases 34 DNA Sequencing 113, 314 DNA Strand Breaks 319 Dominant 115–6 Dominant Negative Mutations 285 Dose-Response 416, 417 Doubled Haploid 428 Down Syndrome Cell Adhesion Molecule (DSCAM) 211 Downstream Effects 27 Dye-Effect Correction 18 Dynamic Cell Culture 64 Dynamic Phenotyping 415 Effect of Mouse Strain 373 Electron Transfer Dissociation 477 Electron-Capture Dissociation 477 Embryonal Lethal Phenotypes 381 Embryonic Stem Cells 293 Enforced Expression in Transgenic Mice 324 Enrichment 499 Enu Mutagenesis 384 Environmental Phenomena 376 Epidermal Growth Factor 508 Epigenetic Phenomena 376 Euclidean Distance 21
1/12/2011 9:43:42 AM
INDEX
Exo Polymerase-Mediated PrimerDependent Mutation Assay 143 Experimental Design for Proteomics 478 Exponentially Modified Protein Abundance Index (emPAI) 498 Expression Analysis 397 Expression Microarray 103 Expression Profiling 292 Expression Profiling of Transcript Variants 172 Expression QTL 23 Expression Quantitative Trait Loci (eQTL) 23 Fabrication of the Microfluidic Chip 60, 61 Familial Exudative Vitreoretinopathy (FEVR) 218 Fine Mapping QTL 411 Fluorescence Resonance Energy Transfer (FRET) 155 Fluorescent in Situ Hybridization (FISH) 446 Fluorescently Labeled Sequencing by Synthesis 152 Forward and Reverse Approaches to Identifying Gene Function 372 Forward Genetics 383 4′,6-Diamidino-2-Phenylindole (DAPI) 74 Free Microarray Software and Databases 181 F2 Population 427 Function Prediction Using Integrated Data 102 Function Prediction Using Sequence Motif 99 Functional Enrichment 246 Functional Mapping 416 Gain of Function Mutations 286 Gel Electrophoresis Techniques 483 Gel-Free (Solution) Separation Techniques 484 Gene 112 Gene Discovery 4 Gene Expression Microarrays 13 Gene Expression Profiles 268
bindex.indd 533
533
Gene Expression Profiling Data Analysis 180 Gene Expression Profiling Methods 169 Gene Identification Based on a Defined Genomic Region 128 Gene Mapping 113 Gene Network 23, 106, 270 Gene Ontology (GO) 93, 184, 238 Gene Screening 130 Gene Set Enrichment Analysis 249 Gene Significance 246 Gene Trapping 385 Genemania 104 Geneseeker 255 Genetic Interaction 103 Genetic Locus 112 Genetic Manipulation 373 Genetic Map 113 Genetic Markers 119–21 Genetic Reference Populations 409 Genetically Engineered Animals 324 Genetics 112 Genome Region of Interest (GRI) 263 Genome Tiling Arrays 16 Genome Wide Association Study (GWAS) 114, 216 Genomewide Associations 281 Genomewide Mutagenesis 384 Genomic Analysis on Chip 68 Genomic Cloning 7 Genomic Location 112–3 Genomics 113 Genotype 112 Genotyping 407, 409 Global Methylation Analysis 44 GOtcha 96 Graphical User Interface (GUI) 240 Haplotype Association Mapping 413 Haplotypes 282, 394 Heredity 112 Heterogeneous Stock 412 Heterozygotes 112 High-Density Lipoprotein Cholesterol (HDL) 117 High-Density SNP Array 196, 197 High-Throughput Gene Expression Analysis 12, 166
1/12/2011 9:43:42 AM
534
INDEX
High-Throughput Mutation Analysis 140 High-Throughput Phenotyping 385 Homozygotes 112 Host Responses 27 Hot Embossing 62 Hybridization-Based Assays 140 Hybridization-Based Method for Genome Sequencing 222 Hydrophilic-Interaction Chromatography (HILIC) 485, 499 Identical by Descent 196 Identifying Differentially Expressed Candidate Genes 266 IFN-Γ 36 igraph 253 Image Analysis 17 Imaging Mass Spectrometry 505 Immobilized Metal Affinity Chromatography (IMAC) 499 Immunoassay 71 Immunofluorescence 73 Immunoglobulin 305 Immunoprecipitation (IP) 322, 486 In Emulsion PCR (emPCR) 154 In Silico Screening 248 In Situ Hybridization 316 In Vitro Functional Studies 292, 293 Indexing 229 Initial Mapping of QTL 405 Inspection of Signal Plots 20 Integrative Methods for Function Prediction 104 Intercrosses 408 Ion Exchange (IEX) 485 Isobaric Tags for Relative and Absolute Quantification (iTRAQ) 495 Isotopomers 86 Kinases 509 Knockin Mice 295 Knockout 327 Lens Opacity 18 (lop 18) 120 Limb-Girdle Muscular Dystrophy 201 Linear Ion Trap (LTQ) 504 Linkage Analysis 2 Linkage Disequilibrium 283 Linkage Map 113
bindex.indd 534
Load Determination 148 Long Contiguous Stretches of Homozygosity (LCSH) 204 Longrange PCR 219 Loss of Function 284 Loss of Heterozygosity (LOH) 196 Luciferase Activity Assay 354 Luciferase Reporter Assays 356 MALDI 474–6, 479 MALDI-TOF 155, 488 Manhattan Distance 21 Map-Based Cloning 426 Mapping Population 426 Mapping with Other Grps 410 Mapping with Recombinant Inbred Lines 410 Mass Analyzers 474–6 Mass Spectrometry 473 Mate-Pair Sequencing 458 Medical Subject Headings 238 Membrane-Type Frizzled-Related Protein (Mfrp) 128 Mendelian Disorder 217 Mendelian Inheritance 117 Mental Retardation 444 Metal Oxide Affinity Chromatography (MOAC) 501 Methylated DNA Immunoprecipitation Chip 47 Methylation-Specific PCR (Msp) 45 Methyl-Binding Domain (MBD) 47 Methyl-CpG-Binding Domain (MBD) 35 Micro Total Analysis Systems (TAS) 59 Microarray 168, 176, 268 Microarray Data Preprocessing 180 Microarray-Based Mutation Detection 155 Microchannel Structures 68 Microdeletion Syndrome 450 Microdroplet PCR 224 Microdroplet-Based PCR Enrichment for Large-Scale Sequencing 222 Microfluidic Chip 60 Microfluidic Immunoassay 71 MicroRNA Arrays 16 MicroRNAs 344 Microsatellite Polymorphism 121
1/12/2011 9:43:42 AM
INDEX
MiRNA Binding Sites 351 MiRNA in Human Diseases 349 MiRNA Regulation 348 MiRNA-Target Recognition 347 Missense Mutation 284 Mode of Inheritance 115 Module Relevance 246 Molecular Inversion Probe 220 Multidrug Resistance 1 288 Multigene Quantitative PCR Assays 174 Multilabel Hierarchical Classification 104 Multiple Cross Mapping 413 Multiple Gene Traits 117 Mutation 139, 305 Mutation Detection Application in SNP Assay 148 Mutation Screening 265 Mutation Screening of Candidate Genes 208–9 Nanoelectrospray Ionization (NanoESI) 475 Nanopore-Based Sequencing 227 Nano-Scale Ultra Performance Liquid Chromatography 485 Natbox 240 Network-Based Screening Strategy 247 Networks 414 Neutral Loss Scanning 501 Next-Generation Sequencing (NGS) 218, 456 No B-Wave 2 (nob2) 119 Nonreceptor Tyrosine Kinase (NRTKs) 510 Nonsense Mutation 284 Nonsynonymous SNP 280 Normalization 17 Null Mutations 283 Oligonucleotide Ligation–Mediated Parallel Sequencing 151 Oligonucleotide Microarrays 14, 171 One-Channel Microarrays 13 Online Databases and Software 357 Online Resources for Mouse Phenotyping 377 Onsynonymous SNP 287 Open Reading Frame 512
bindex.indd 535
535
Outbred Mice 412 Oxythiamine (OT) 87 Padlock Molecular Inversion Probe 220 Pairwise Scatter Plots 22 Particle Bombardment 430 Pathogen AVR Gene 433 Peptide Mass Fingerprinting 488 PGMapper 256, 270 Pharmacogenomics 415 Phenotypes 112 Phenotypic Effects 379 Phenotyping 407, 409 Phosphopeptide Analysis 500 Phosphopeptide Enrichment 500 Phosphoproteome 499 Phylogenomic Methods for Function Prediction 99 Physical Location of the Candidate R Gene 428 Physical Map 113 Physiological Gene Rearrangement 305 Plant Regeneration 430 Plasmid Construction 358 Platforms and Protocols of SNP Microarray 197 Point Mutations 391 Polygenic Disorders 396 Polymorphism 264 Population Screening 395 Positional Cloning 1, 262, 430 Power Distances 21 Precursor Ion Scanning 501 Predicting Functional Sites on Protein Surface 102 Preparation of Cell Llysate for Luciferase Activity Assay 359 Primary Immunodeficiency 255 Primer Extension–Based Assays 142 Pri-MiRNAs 345 Prioritization of Candidate 271 Progeny Analysis 430 Progressive Motor Neuron Degeneration (mnd) 125 Protein Analyses 320 Protein Analysis on Chip 70 Protein Complexes 505 Protein Interactions 505
1/12/2011 9:43:42 AM
536
INDEX
Protein Localization 504 Protein Structure Comparison 100 Protein Turnover 83 Protein–Protein Interaction 102 Protein–Protein Interaction Databases 239 Protein–Protein Interaction Network Analysis 252 Proteogenomics 511 Proteomics 474–8 PubMed 237 Pyrophosphorolysis 144 Pyrophosphorolysis Activated Polymerization (PAP) 144 Pyrosequencing with Picotiterplate 150 QTL Cartographer 407 QTL Mapping 404 Quality Control 20 Quantile Normalization 20 Quantitation of Phosphorylation Sites 502 Quantitative Proteomics 495 Quantitative Real-Time PCR 308 Quantitative Trait Loci (QTL) 218, 264, 280, 404 R Environment 237 R Genes 425 R/Qtl 406 Randomization 481 Rate of Protein Turnover 84 Real-Time Sequencing 153 Recapitulation of Human Mutations in Animal Models 298 Receptor-Protein Tyrosine Kinases (RTKs) 508 Recessive 115–32 Recombinant Inbred Lines (RIL) 23, 427 Recombinant Inbred Segregation Tests 412 Recurrent Mutation 396 Regulatory SNP 289 Relative Allele Signal (RAS) 197 Restriction Fragment Length Polymorphism (RFLP) 115–32, 156 Retinal Degeneration 10 (Rd10) 125 Retinal Degeneration 3 (Rd3) 120
bindex.indd 536
Retinitis Pigmentosa (RP) 113 Reverse Genetics Approach 378 Rheumatoid Arthritis (RA) 40, 42 Rice 426 Rice Blast Disease 430 RNA Quality 176 Sammon Mapping 21 Segregating Crosses 407 Selected Reaction Monitoring 497 Selection of Genomic Database 263 Semidominant 115 Seminal Fluid Proteins 515 Separation Technologies 482 Sequence Analyses 291 Sequence-Based Function Prediction 95 Sequence-Based Sampling Methods 173 Sequencing Methods for Direct Genome Sequencing 230 Sequencing-Based Assays 149 Serial Analysis of Gene Expression (SAGE) 166, 173 Sex-Linked 116–7 Sex-Specific Gene Expression 23 Shotgun Proteomics-Based Quantitation 497 Shotgun Sequencing 457 Signal Transduction Pathways 507 Single Nucleotide Polymorphism (SNP) 126, 139, 195, 264, 444 Single-Cell Analysis Systems 66 Single-Molecule Sequencing 227 SNP Triggered Off/On Switch 144, 146 Soft Lithography 62 Software for QTL Mapping 405 Somatic Hypermutation 305 Spatial Correction 17 Splice Mutation 284 Spontaneous Mutations 384 Spotted cDNA Arrays 171 Spotted Microarrays 15 Static Cell Culture 64 Statistical Power 411 Strain-Specific Pathology 375 Strategy of Gene Discovery 6 Stratification 481 Strong Cation Exchange (SCX) 485, 491 Structural Chromosome Abnormalities 444
1/12/2011 9:43:42 AM
INDEX
Structure-Based Function Prediction 100 Students’ t-test 480 Submicroscopic Chromosomal Rearrangements 446 Subtractive Libraries 173 Suspects 256 Synonymous SNP 280, 288 Systemic Lupus Erythematosus (SLE) 40, 41 Systems Genetics 414 T Cells 36, 38 Tandem Affinity Purification (TAP) 506 TaqMan Assay 157 Target Prediction 356 Targeted Deep Sequencing 219 Th1 Cells 36 Th2 Cytokine 37 Th17 Cells 38 The 3′ Terminally Labeled Primer Extension 145 3d Cell Culture 65 Tissue Identity 346 Tissue Inhibitor of Metalloproteases 1 (TIMP 1) 88 Tissue Specificity 292 Top-Down Proteomics 492 topGO 251 Topological Overlap Matrix 244 Trait 112 Transcription Factor Binding Sites 185 Transcription Modules 183 Transcription Regulatory Element Search (Tres) 273
bindex.indd 537
537
Transcriptional Response 178 Transcriptome Resequencing 174 Transfection 359 Transfection of Experimental Cells 358 Transgenesis 294 Transgenic Mice 294 Transposition 433 Transposons 385 Trans-QTL 24 True Single Molecule Sequencing (tSMS) 227 Tumor Proteins 510 Two-Channel Labeling System 171 Two-Channel Microarrays 13 Two-Dimensional Gel ElectrophoresisBased Quantitation 495 Type 1 Diabetes (T1D) 40, 43 Ultra-High-Density SNP Array 200 Uniparental Disomy (UPD) 196 Vaccine Efficacy 27 Validation 481 Variable Number of Tandem Repeats (VNTR) 119 Web-Based Search Programs 269 Weighted Gene Co-Expression Network Analysis 242 Western Blotting 360 Whole-Transcript Arrays 15 Xanthine Dehydrogenase (XDH) 206
1/12/2011 9:43:42 AM